(Chinese) LLN seems to incorrectly parse individual words. Is there any way to override this?

Example:

In this scene, the word I want to highlight is 礼服 (formal attire). LLN parses the section 买礼服 (buy formal attire) is 【买礼】+【服】instead of 【买】+【礼服】. Due to Chinese not having spaces, I totally get why picking out individual words is difficult (although in this case, it’s a clear error because 买礼 is not a word). I would love to know if there was a way to override this, highlight the two characters 礼服 and indicate to LLN that this is a whole word?

I would say that there is an error like this in around 10% of all Chinese subs, another really common one is where a character in somebody’s name gets caught in with other characters and messes things up. Again, super difficult or impossible to correctly program that behaviour, but an override option would be excellent.

1 Like

I am observing this behaviour as well. I would say that the error is more than 10% especially around quite frequent characters such as 我、你、他、不、沒、就、好.

Few more examples:

Subtitle:【妳就直接告訴我】
LLN split:【妳】【就直】【接告】【訴我】
Expected: 【妳】【就】【直接】【告訴】【我】.

Fragment:【看起來好像】
LLN split:【看】【起】【來好】【像】
Expected:【看】【起來】【好像】or【看起來】【好像】

Not sure what is backing the LLN for Chinese, but here is one link that might be useful: MDBG dictionary (download). The dictionary is also being used in app Pleco.

Current behaviour makes LLN quite frustrating to use because:

  • can’t translate and bookmark words, if the word is split in two separate clusters of letters
  • can’t translate words that are part of larger cluster/phrase considered as one unit by LLC
  • bookmarking words does not highlight them in other subtitles when it happens that the split is different

Note that the current word split is not 100% consistent across subtitles. In one it might be A BC DE, in another it might be AB CD.


Adding few more examples, if that might help for analysis of the problem:

This one has a longer sequence of words joined together in a translated phrase:

Text:【真的很謝謝妳】
LLN:【真的】【很謝謝妳】
Expected:【真的】【很】【謝謝】【妳】

This one has 3 out of 8 correct:

Text:【你今天應該有帶腦袋出門吧】
LLN:【你】【今天】【應】【該】【有】【帶腦】【袋出】【門吧】
Expected:【你】【今天】【應該】【有】【帶】【腦袋】【出門】【吧】

Hi. Chinese didn’t really get the proper NLP processing tools it deserved, as we were lacking time. We did make a better effort for Japanese, because we anticipated anime fans would like using LLN. Pinyin will be added shortly, and I’ll look into better tokenisation.

1 Like

Would you recommend me any mainland mandarin(no Taiwanese) contents on Netflix?
Because I have a plan learning mainland Chinese but almost all Netflix dramas and cartoons are from Taiwan and they don’t fit on simplified mainland mandarin subtitles.