Pinyin support for Chinese (Mandarin) subtitles?

I checked the article: https://www.hackingchinese.com/focusing-on-mandarin-tones-without-being-distracted-by-pinyin/

We could show tones by stripping the number from the end of the pinyin, like this:

pinyin(‘中心’, style=Style.TONE3, heteronym=True)
[[‘zhong1’, ‘zhong4’], [‘xin1’]]

This morning I hooked up this word tokeniser… it new and fancy and looks the business: https://github.com/lancopku/pkuseg-python

The model is only trained for Simplified Chinese, it seems. Does anyone need Traditional Chinese? I think jieba still gives better results for that.

1 Like

image

(will go live in a few days)

1 Like

Yes, definitely. It is used in primarily in Taiwan where quite significant amount of foreigners studying Mandarin. (Me included). I also know some Chinese learners who are studying elsewhere using simplified characters yet are interested in knowing traditional as well.

Edit: do’t get confused with the English name “Traditional”, it does not imply some historical language (Think Latin or Ancient Greek), just implies that the place where it is used did not go through reform in the middle of past century.

Wow wow wow… LLN is amazing. I just got it and wish I found it sooner. Very very impressed.

I was searching for transliteration support for thai + korean, and came across this thread. The chinese / japanese functionalities are great, I think I learned more than I ever did watching netflix / youtube in a foreign language in the past 30 minutes with this.

Korean + Thai transliteration would also be amazing. I’m not sure how easy / challenging to implement, but I am guessing there is a large kdrama audience. Learning to read Korean is certainly easier than Chinese and Japanese, but it would be pretty amazing to see the same functionality for the same reason.

Thai is a relatively hard language to learn to read, and I believe there is huge demand for being able to watch thai series with eng subs. Being able to learn the language on top would be an amazing bonus.

I would love to see how I can help support or contribute to making this happen :slight_smile: Amazing job David!

1 Like

New Chinese word tokenisers now hooked up (pkuseg for simplified, jieba for traditional), they should go live in the next few days.

I added a proper tokeniser for Thai (it is probably pretty broken at the moment), and also transliteration for Thai and Korean. :slight_smile: Was trying to find something for Hindi… didn’t see any nice library. There’s ‘polyglot’ (python) but it’s got a lot of dependencies and seems a bit fickle, and I didn’t like the output for cyrillic.

@David_Wilkinson It looks like you’re already well ahead of needing an answer to this, but just in case… There are lots of characters that have completely different pronounciations with completely different pinyin depending on which word they’re used in. Reportedly it’s something like 20% of the most commonly used characters that at least differ on tones by context.

Here’s some examples:

行 is xíng or háng
的 is de, dí or dì
会 is huì or kuài
都 is dōu or dū
便 biàn or pián
薄 is bó, báo or bò
着 can be zhe, zháo, zhuó or zhāo

Here’s a couple of articles on this:


https://www.fluentinmandarin.com/content/chinese-characters-multiple-pronunciations/

@David_Wilkinson Oh and in addition to those contextual pinyin differences for the same characters used in different words or parts of speech in standard Mandarin Chinese, there’s also the complication that Taiwanese Mandarin differs in pronounciation and vocabulary from that about as much as British English does from American English.

This means that the correct pinyin for Taiwanese Mandarin speech will be different to what’s correct for content from Chinese television, because the same characters should have different tones to reflect how they’re said by Mandarin speakers in Taiwan. Often, the correct pinyin for Taiwanese Mandarin is wrong for standard Mandarin Chinese and vice versa.

Additionally, Taiwanese Mandarin doesn’t tend to use the neutral tone for the second character in double character words. It also has a lot of distinctly different vocabulary, so the translations would be different for the same words, and it has more/different loan words.

So when I’ve complained in the past that I can’t find ‘Simplified Chinese’ content on YouTube through LLY and everything is ‘Tranditional Chinese’, I’m actually complaining that I can only find Taiwanese Mandarin content which is using the wrong tones for my target language a lot of the time, and a different accent and dialect - which is all very much something that I plan to learn familiarity with later on, but is actively unhelpful with immersion learning right now.

Right now I want to be thoroughly immersed in Chinese TV dramas, documentaries, social media videos and films that are all using the tones from standard Mandarian Chinese (which frankly has enough accents and dialects already).

Hopefully, in a year or so from now, I’ll have the tones and vocabulary from standard Mandarin thoroughly ingrained, and then I’ll be ready to start learning how Taiwanese Mandarin differs in vocabulary, tones and pronounciation.

Equally, given that there’s a lot of Taiwanese Mandarin content on YouTube, you probably want to be offering the correct pinyin and definitions when your LLY users are choosing to watch Taiwanese content (like the Taiwanese dubs of Steven Universe and classic Disney movies).

Here’s a relevant article on that https://www.theworldofchinese.com/2014/05/taiwanese-mandarin-starter-kit/

Getting there…

I think pypinyin should handle different pronounciations ok. Well, at least this sounds promising:

The new word tokenisers should help a bit too. Today we are updating the schema/database for saved items, so they will work with the new code (this will fix your ‘undefined’ in saved items view)… that’s the last thing to do before roling out the update. It won’t be the last Chinese update, I expect there will still be a few things off.

I’ve been told Traditional characters can be mapped to Simplified reliably, and vice versa I guess.

btw, take a fish here for Channels: https://extension.dioco.io/catalogue/youtube_catalogue/best_youtube_channels_to_learn_chinese.html

Code is live, you should get it within 48hrs (rolling out slowly so the servers don’t get squashed under load).

1 Like

Hey… is it ok how we show pinyin in the main subs? Do we need different spacing or hypens or something?

1 Like

Looks great! Maybe smaller font would be a bit better? The pinyin is there as a hint, not to be read primarily. It should not take much attention from the characters – which are the main focus of a learner afterall.

1 Like

Ok.
Side panel char/pinyin highlighting should be working right:

Will add extra fields to Anki export soon.

Uh, it would be easy to optionally show pinyin only for infrequent words (based on the word frequency option in the settings).

Working on some better machine translation stuff now, should be able to do something better than what is possible with using translation APIs.

Hm, now that the tokeniser isn’t awful, we could show word boundaries visually… by alternating color underlining or something. Any takers? :slight_smile:

Hi, I don’t know if it’s because i’m using Netflix in France or for another reason but, I don’t manage to enable the Pinyin feature, there is no transliteration available option.

Is Jyutping as well as yale romanization available for Cantonese? Also, is support possible to have 3 subtitles on the screen simultaneously in addition to the 1 that would be shown and selected on Netflix side?

Hi, just discovered LLN. I really, really love it!

However, I also can’t get pinyin to work AT ALL (no transliteration slider below the “Force Original” slider. Any help?

Hah! Got it. The trick is to set Chinese as the Netflix subtitle and English as your LLN secondary subtitle. Then the transliteration option appears. Amazing!

1 Like

So currently, can this export to Anki’s front card with phonetic symbols?