Pinyin support for Chinese (Mandarin) subtitles?

How do I turn the pinyin off?

Thanks for adding this! I’d already noticed it on YouTube, but a few refreshes got it working on Netflix too.

My feedback is that this will generally be a huge huge help and makes LLN/LLY a lot more useful for Chinese learners, but that there absolutely needs to be the option to turn it off, and greater control of how and when it appears.

There are stages in learning where you’re trying to learn the characters without pinyin. At this point you should only be looking up the pinyin if you want to check that you remembered correctly, or see what the tones are.

I think for a significant portion of Chinese learners, they’re only going to want the pinyin to appear on hover or click in the definition that appears - and I’ve reloaded several times and it still doesn’t seem to be there yet. Is this feature planned?

I’m not quite at the level where I only want pinyin in the definition modals, I would like to keep the pinyin on screen, but I’m finding that putting pinyin at the top means that I’m not actually reading the characters, which I’m pretty good at doing for a few hundred of the most common characters. I think an option to put the pinyin below the characters would cause me to read the characters first, then glance down at the pinyin only if I need it, aiding my learning.

Ultimately my goal is to be able to read books, signs and subtitles, none of which would provide the pinyin, so there’s a point at which it would be limiting how much I’m learning by having the pinyin ‘training wheels’ always on screen.

Similarly, I think that pinyin should be blurred out when ‘hide translations’ is on. Ideally this functionality would have more granular control and there’d be a third option ‘hide pronounciation’ that would also apply to similar additions in Japanese.

The only other bit of feedback would be about how layout is working both in subtitles and the sidebar transcript - I’ll cover transcript later, it needs a different approach.

I would prefer it if the subtitle layout logic attempted to centre the pinyin alternative directly above (or below) the individual character it relates to.

Currently there seems to be some guessing of which sets of characters make up ‘words’ (often these are quite arbitrary) and then all the pinyin for those characters is centred within that ‘word’ rather than the individual characters.

I’m not sure if this way of doing things would be desirable even if the word matching was 100% accruate, but when they’re often different from even the machine transtion’s interpretation, this becomes extremely distracting.

(This relates to a separate issue with the definitions, but that should be in its own separate topic, I think.)

It’s even worse in the transcript sidebar where the characters start getting grouped into dubious chunks and the pinyin mostly displays as simple sentences.

This is really not desirable. The characters should never be clumped up in that way - a gap makes it look like it’s a difference sentence or cuts words in two.

I’d say in the sidebar you want the pinyin to be one sentence, and the Chinese characters another. This is how pinyin alternatives would be provided in a textbook, it’s more like a translation than an annotation or key.

This has also had the side effect of breaking copy-paste from the transcript. This is the mess resulting from copying one line of subtitle dialogue now:

zhè

wèi wú xiàn zuò
魏无羡做
de

dōng xī
东西
jiù shì
就是


xíng

This should be provided simply as, no gaps in either, plain text a line for each:

zhè wèi wú xiàn zuò de dōng xī jiù shì bù xíng
这魏无羡做的东西就是不行

I can quickly read either of those at a glance. With gaps added, they become harder to read and I’m distracted by why it’s decided that 魏无羡做 is a word and 不行 isn’t.

Oh and you’ve done this to all my saved items:

Having tested Anki export, I relieved to find that the lines aren’t mangled on export to Anki, but equally, there’s no pinyin field included.

Generally all my flashcards have the Chinese sentence on the front and both the pinyin and the English translations on the back.

Here’s an example with card definition above, preview of front of card below (audio would play):

And here’s the card after you click to see the rear (audio plays again):

I hope this makes the point that the pinyin should be treated as a type of translation, an intermediary step on the way to a translation.

While we’re talking about Anki export, in general it would be great if you could provide everything as simple named plain text fields that I can then set up my own formatting rules to change into cards - or at least give me the option to do it that way.

Anki is extremely powerful and customisable, but not if you’ve forced everything into your own ‘Front’ and ‘Back’ fields and filled them with HTML. Just trying to edit them to put the TV show title and episode on the other side was extremely labourious.

Here’s an example of what a card format definition looks like for Chinese sentence mining flashcards:

And a different, highly customised deck for learning Chinese characters, where I’ve added my own animated GIFs:

I’ve just found the answer to this in a thread asking about the Japanese equivalent.

To turn them off you need to go to settings (the cog icon next to the LLN or LLY logo and On/Off control) and find the ‘Show Transliterations:’ drop down, that should look like this:

From there “No transliterations” means “Don’t show pinyin”, “With originial form” means “Include pinyin” and “Transliterations only” means “Only show pinyin”. (Hopefully these will be changed to include the word ‘pinyin’, to match what’s been done with the Japanese version.)

I’d only seen pinyin referred to as ‘romanisation’ before now, so my eyes completely skipped over ‘Transliterations’ and assumed it said ‘Translations’ when I went looking for this setting.

Hope this has helped!

The ‘separate issue’ I mentioned in the comment above about the layout / formatting of the pinyin in chunks is now documented here in excruciating detail (and a tl;dr at the start):

Hey, thanks for this useful feedback. We’ll come back to Chinese shortly and sort it out properly. Maybe we can get a couple of the quicker items done in the next couple of days, will post updates here. :slight_smile:

Og coded up the dragable subs:

(purple words are the new/old word frequency highlighting)

Ok, so tasks for Chinese:
– Add pinyin to hover dict (if not already displayed), and full dict.
– Pinyin below subs, and a ‘blur transliteration’ mode (Hopefully the change doesn’t upset anyone).

Centre the pinyin alternative directly above (or below) the individual character it relates to.

Question, can Chinese characters always be mapped to the same pinyin? Can the pinyin change when characters appear as part of combinations? If not, I think we should be able to do it. We’d convert every Chinese character to pinyin in isolation.

Currently there seems to be some guessing of which sets of characters make up ‘words’.

Yes, there absolutely is. We could upgrade to jieba as a word tokeniser for Chinese, it’s probably a bit better but not perfect either… Maybe Tencent etc. made a better tool, it would need to be researched and integrated.

This is really not desirable. The characters should never be clumped up in that way - a gap makes it look like it’s a difference sentence or cuts words in two.

Ok, spaces between Chinese chars is undesirable. We’ll try implement your suggestion.

Oh and you’ve done this to all my saved items:

Will fix shortly.

Pinyin to Anki.

While we’re talking about Anki export, in general it would be great if you could provide everything as simple named plain text fields that I can then set up my own formatting rules to change into cards - or at least give me the option to do it that way.

We made CSV export, I think it’s what you want, can be imported in Anki if you set up the fields right, although it’s missing pinyin still.

We used ‘Transliterations’ because I found a library that can transliterate 50 languages or so… we were going to make it feature for many languages (Hindi, Russian etc.), but, on closer inspection it wasn’t so great.

[From the other thread]

You need to be able to define all the characters in the compound because so many Chinese words are compounds and often learning a new word and its characters actually teaches you two or more new words, and gives you the tools to intuitively understand what other unfamiliar words mean.

Maybe you can use a custom dictionary url that breaks down the word in chars? You can open the last used custom dict with shift + click. We could try showing definitions for each character in the word… I’m not sure it will be as useful as a dedicated Chinese dictionary. If you have a suggestion, we can add it to the list of external dictionaries.

‘cat-head-eagle’ :slight_smile:

There’s more to repond to, I’ll get to soon. Thanks for the detailed feedback.

The pronunciation might change for some characters. Just to give an example:
了 as a particle is most of the times ‘le’, although in ‘了解’ it is ‘liǎo jiě​’. Similar in some gramatical constructs like 看不了、看得了、去不了、去得了 it is ‘liǎo’. See complement liao.

  • 睡覺 ‘shuì jiào​’ vs. ​覺得 ‘jué de’ (this is one of the reasons why good word splitting is important)
  • 得 which can be ‘de’ or ‘dei’

Although I would say, having pinyin correct at least for words consisting of multiple characters and maybe at least the 不了/得了 pattern is good enough for the time being.

Ruby Character might have some inspiration.

Many good points were made.
But I believe the better words tokenizer should be the most important task.
As for pinyin, maybe you could figure it for a word, split in syllables, which should not be that hard, and then center syllables above the characters. I think this problem is overrated however, and with better word splitting it will partly gone away. Most Mandarin words are two-syllable only anyway.
I’m also not convinced that pinyin below characters is better, because it will become overlapping with English translations which use the same roman letters. But different learners could have different expectations.

I’m still trying to digest everything and make an actionable list. I got these so far:

  • Add pinyin to hover dict (if not already displayed), and full dict.
    This will give an option for ‘optional’ pinyin. (Adding to Top Priority)

  • Fix pinyin in saved items. (Adding to Top Priority)

  • Upgrade to jieba as a word tokeniser for Chinese. (This is already in Coming Soon)

  • Pinyin to Anki. (Adding to Top Priority)

  • This format for side panel (Adding to Top Priority)
    zhè wèi wú xiàn zuò de dōng xī jiù shì bù xíng
    这魏无羡做的东西就是不行

  • Integrate some code from Zhongwen extension or cedict dictionary directly. I think this probably should be done at some point, it’s just hard to find time. Chinese learners are a small percent of our users… mostly it’s Koreans/Japanese/Turkish learning English.

Outstanding questions in my mind:

For a ‘blur pinyin’ mode, I think pinyin would need to go below the characters for a practical reason, unbluring lines with the mouse would start with the characters (I assume), then the mouse moves down to pinyin, then translations. Procion is not fond of the idea of pinyin below chars. Hmm.

Centre the pinyin alternative directly above (or below) the individual character it relates to. I think Stiivi is saying that there isn’t a 1-to-1 correspondance between chars and pinyin. Here’s the library we are using to get the pinyin for ‘words’: https://github.com/mozillazg/python-pinyin


Actually, it looks like it may be outputing the pinyin for each character individually. I was wondering why the output was nested twice.

I am not fond of the “below” idea as well, and would suggest to go “below” only if technically it is impossible or technically impractical to go above. The common typographic practice is above (or right side for Bopomofo) so that’s how people are used to it in my experience so far.

I am not a web developer person (just a different kind of software engineer), although found this W3C spec for the ruby characters – which is what we are actually looking for in here. Not sure how is the support of browsers at this moment for this, as it seems to be quite new. In that case the control of blur or other ways of hiding can be done through CSS. But again, I am definitely not an expert in this domain, so I might be wrong or naive.

Btw another reason for going “above” is that you can reuse the layout not only for pinyin but also for tones and one would be able to display just tones (those diacritic marks, just with larger font) centered over characters, without pinyin. This is very useful for Mandarin learners. (I’m even doing it outside of school while reading a book just to remember/emphasise the tone I forgot or want to learn, as the tone makes as much difference as the sound itself)

See this article " Focusing on Mandarin tones without being distracted by Pinyin".

Re 1-1: It is just only about the pronunciation, but it is always one syllable – one character. The pronunciation mostly negotiable if dictionary is provided and it is not that frequent outside of the grammatical patterns (like (得 | 不) + 了) mentioned in my post above.

Did you try those phrases using pypinyin? Try for example 我吃不了 and see if it gets ‘liao’ (correct) or ‘le’ (incorrect) for the last character. The last three characters are not technically a single word that can be looked up in a dictionary, they are three tokens: ((VERB), (得 | 不), (了)).


p.s.: I believe there might be more mandarin learners coming for the plugin :wink: Whenever I mention about this plugin to people who study Mandarin around me, they get quite excited, although word splitting was quite a reason for not bothering much.

2 Likes

Because of possibility of different pronunciations:

>>> pypinyin.pinyin("了", heteronym=True)
[['le', 'liǎo', 'liào']]

So the output of the function is a list of lists of potential variants of the character.

Thanks for this info. We are finishing overhauling some backend code, Ognjen is working on the ‘Top Priority’ items above today.

1 Like

I checked the article: https://www.hackingchinese.com/focusing-on-mandarin-tones-without-being-distracted-by-pinyin/

We could show tones by stripping the number from the end of the pinyin, like this:

pinyin(‘中心’, style=Style.TONE3, heteronym=True)
[[‘zhong1’, ‘zhong4’], [‘xin1’]]

This morning I hooked up this word tokeniser… it new and fancy and looks the business: https://github.com/lancopku/pkuseg-python

The model is only trained for Simplified Chinese, it seems. Does anyone need Traditional Chinese? I think jieba still gives better results for that.

1 Like

image

(will go live in a few days)

1 Like

Yes, definitely. It is used in primarily in Taiwan where quite significant amount of foreigners studying Mandarin. (Me included). I also know some Chinese learners who are studying elsewhere using simplified characters yet are interested in knowing traditional as well.

Edit: do’t get confused with the English name “Traditional”, it does not imply some historical language (Think Latin or Ancient Greek), just implies that the place where it is used did not go through reform in the middle of past century.

Wow wow wow… LLN is amazing. I just got it and wish I found it sooner. Very very impressed.

I was searching for transliteration support for thai + korean, and came across this thread. The chinese / japanese functionalities are great, I think I learned more than I ever did watching netflix / youtube in a foreign language in the past 30 minutes with this.

Korean + Thai transliteration would also be amazing. I’m not sure how easy / challenging to implement, but I am guessing there is a large kdrama audience. Learning to read Korean is certainly easier than Chinese and Japanese, but it would be pretty amazing to see the same functionality for the same reason.

Thai is a relatively hard language to learn to read, and I believe there is huge demand for being able to watch thai series with eng subs. Being able to learn the language on top would be an amazing bonus.

I would love to see how I can help support or contribute to making this happen :slight_smile: Amazing job David!

1 Like