(Chinese) LLN seems to incorrectly parse individual words. Is there any way to override this?

Example:

In this scene, the word I want to highlight is 礼服 (formal attire). LLN parses the section 买礼服 (buy formal attire) is 【买礼】+【服】instead of 【买】+【礼服】. Due to Chinese not having spaces, I totally get why picking out individual words is difficult (although in this case, it’s a clear error because 买礼 is not a word). I would love to know if there was a way to override this, highlight the two characters 礼服 and indicate to LLN that this is a whole word?

I would say that there is an error like this in around 10% of all Chinese subs, another really common one is where a character in somebody’s name gets caught in with other characters and messes things up. Again, super difficult or impossible to correctly program that behaviour, but an override option would be excellent.

1 Like

I am observing this behaviour as well. I would say that the error is more than 10% especially around quite frequent characters such as 我、你、他、不、沒、就、好.

Few more examples:

Subtitle:【妳就直接告訴我】
LLN split:【妳】【就直】【接告】【訴我】
Expected: 【妳】【就】【直接】【告訴】【我】.

Fragment:【看起來好像】
LLN split:【看】【起】【來好】【像】
Expected:【看】【起來】【好像】or【看起來】【好像】

Not sure what is backing the LLN for Chinese, but here is one link that might be useful: MDBG dictionary (download). The dictionary is also being used in app Pleco.

Current behaviour makes LLN quite frustrating to use because:

  • can’t translate and bookmark words, if the word is split in two separate clusters of letters
  • can’t translate words that are part of larger cluster/phrase considered as one unit by LLC
  • bookmarking words does not highlight them in other subtitles when it happens that the split is different

Note that the current word split is not 100% consistent across subtitles. In one it might be A BC DE, in another it might be AB CD.


Adding few more examples, if that might help for analysis of the problem:

This one has a longer sequence of words joined together in a translated phrase:

Text:【真的很謝謝妳】
LLN:【真的】【很謝謝妳】
Expected:【真的】【很】【謝謝】【妳】

This one has 3 out of 8 correct:

Text:【你今天應該有帶腦袋出門吧】
LLN:【你】【今天】【應】【該】【有】【帶腦】【袋出】【門吧】
Expected:【你】【今天】【應該】【有】【帶】【腦袋】【出門】【吧】

Hi. Chinese didn’t really get the proper NLP processing tools it deserved, as we were lacking time. We did make a better effort for Japanese, because we anticipated anime fans would like using LLN. Pinyin will be added shortly, and I’ll look into better tokenisation.

1 Like

Would you recommend me any mainland mandarin(no Taiwanese) contents on Netflix?
Because I have a plan learning mainland Chinese but almost all Netflix dramas and cartoons are from Taiwan and they don’t fit on simplified mainland mandarin subtitles.

I don’t know if things have just changed in 2 years or you just aren’t looking in the right place, but most of what I see available on Netflix these days is from Mainland China. I especially love school days rom coms because the language is the easiest, and the vast majority are from the Mainland. Also all those historical and wuxia ones, but that’s not so great for practical language learning.

1 Like

I have been color coding all the words I know and am learning different colors so I can tell at a glance whether I should really work at 100% understanding a sentence or not, or let my brain rest. This inaccurate parsing is definitely a headache. A lot of the “words” aren’t even words when you look them up in the dictionary. I use Pleco with MANY dictionaries, so if it’s not there, it’s very unlikely to be a word in a romcom.

Would the technology behind “Zhongwen Chinese Popup Dictionary” extension be useful? It is open source on GitHub. It seems to do a good job parsing and gives the range of shortest and longest viable character chunks that could be considered words or phrases (even some chengyu). You could limit parsing to only to words in the the open source CC-CEDICT dictionary or dictionary of choice. So depending on which character in a string you click/hover, you will have the option to save and get every definition. For example: 在後面 can be parsed into the words: 在 後 面, 在 後面, or 在後 面. The latter doesn’t make sense in context, but they are real words. If you hover or click on the 後, you should have the option to see and save the definition for both 後 and 後面. If you click on the 在, you should have the option to see and save the definition for 在 and 在後. It will also eliminate a lot of non-sensical groupings that LLR is currently giving. This will cause some color coding conflicts, but that could be resolved by prioritizing certain colors over others where there is overlap and/or by longest dictionary approved character string.

That said, this tool has been invaluable for my learning, so thank you so much for your work so far! It is the only reason I even subscribed to Netflix because of all the soft subs. I just joined the Pro version because color coding what I know, what I’m learning, and what I don’t need to worry about has made a huge different learning efficiency. I hope you can make these parsing changes soon to make it even better.

1 Like