Parsing chengyu (成語) in Chinese subtitles.

I often use LR to watch Chinese shows with both Mandarin and English subtitles. One thing I have noticed after using for several hours is that it doesn’t seem to be able to parse chengyu very well. In case you are not familiar with the term, Mandarin Chinese has a TON (many thousands) of idiomatic expressions used in daily conversation which are almost always exactly four characters in length. These strings of four characters create new words which will often have a totally different meaning from the individual characters that make them up.

To give one example, the phrase/word 亂七八糟 means a giant mess, or something that is in complete disorder. But the individual characters themselves in that word wouldn’t necessary reflect the meaning when you look at them individually. For example, the two characters in the middle are the numbers 7(七) and 8(八).

When using the app, it doesn’t seem to recognize the chengyu and group the four characters together correctly into a single meaning, rather, it lists the meaning of the individual characters separately instead.

image

Is there a way to improve this? I might be able to assist. In addition to being an avid student of Chinese, I’m also an ETL Developer for my day job. If you would like, I could probably generate a machine readable file (ex. XML, JSON) of thousands of the most commonly used chengyu and send it to you all, then you could use it to map the chengyu in the subtitles better. Not sure if that would be helpful or not, but I would just need to know how you all want it formatted so it could be imported into whatever database you are using on the backend.

Keep up the great work, this is by far my favorite language learning app!

1 Like

I was just about to make a thread about this problem until I saw this. I’ve only been learning for about 6 months so often I’m not aware if a sentence has chengyu or if it’s just 4 words that I don’t know the meaning of. Often, I get a sense that it is in fact chengyu and then have to type it into Pleco on my phone while watching to confirm.

A larger problem is that I find LR fails to parse even some two character words. For example just now 願意 came up and LR didn’t register is as one word to mean ‘to wish/to be willing’ but instead defined the characters individually. I have noticed that when I use Zhongzhong (another Chrome pop-up Chinese dictionary extension) and hover over the same subtitle, it also didn’t recognise 願意 as a single word. However if I open up the sidebar where a list of subtitles can be found, and hover over 願意 again, then Zhongzhong recognises it as one word, however LR still doesn’t. The characters in the sidebar view are much closer together which is why I guess the Zhongzhong extension is then able to recognise them together as a word.

Hopefully the LR dev team takes you up on your offer!

1 Like