(Chinese) LLN seems to incorrectly parse individual words. Is there any way to override this?

rjm2109 · April 6, 2020, 12:27pm

Example:

In this scene, the word I want to highlight is 礼服 (formal attire). LLN parses the section 买礼服 (buy formal attire) is 【买礼】+【服】instead of 【买】+【礼服】. Due to Chinese not having spaces, I totally get why picking out individual words is difficult (although in this case, it’s a clear error because 买礼 is not a word). I would love to know if there was a way to override this, highlight the two characters 礼服 and indicate to LLN that this is a whole word?

I would say that there is an error like this in around 10% of all Chinese subs, another really common one is where a character in somebody’s name gets caught in with other characters and messes things up. Again, super difficult or impossible to correctly program that behaviour, but an override option would be excellent.

Stiivi · April 23, 2020, 4:50pm

I am observing this behaviour as well. I would say that the error is more than 10% especially around quite frequent characters such as 我、你、他、不、沒、就、好.

Few more examples:

Subtitle:【妳就直接告訴我】
LLN split:【妳】【就直】【接告】【訴我】
Expected: 【妳】【就】【直接】【告訴】【我】.

Fragment:【看起來好像】
LLN split:【看】【起】【來好】【像】
Expected:【看】【起來】【好像】or【看起來】【好像】

Stiivi · May 6, 2020, 4:00pm

Not sure what is backing the LLN for Chinese, but here is one link that might be useful: MDBG dictionary (download). The dictionary is also being used in app Pleco.

Current behaviour makes LLN quite frustrating to use because:

can’t translate and bookmark words, if the word is split in two separate clusters of letters
can’t translate words that are part of larger cluster/phrase considered as one unit by LLC
bookmarking words does not highlight them in other subtitles when it happens that the split is different

Note that the current word split is not 100% consistent across subtitles. In one it might be A BC DE, in another it might be AB CD.

Adding few more examples, if that might help for analysis of the problem:

This one has a longer sequence of words joined together in a translated phrase:

Text:【真的很謝謝妳】
LLN:【真的】【很謝謝妳】
Expected:【真的】【很】【謝謝】【妳】

This one has 3 out of 8 correct:

Text:【你今天應該有帶腦袋出門吧】
LLN:【你】【今天】【應】【該】【有】【帶腦】【袋出】【門吧】
Expected:【你】【今天】【應該】【有】【帶】【腦袋】【出門】【吧】

David_Wilkinson · May 9, 2020, 12:11am

Hi. Chinese didn’t really get the proper NLP processing tools it deserved, as we were lacking time. We did make a better effort for Japanese, because we anticipated anime fans would like using LLN. Pinyin will be added shortly, and I’ll look into better tokenisation.

Bear · October 9, 2020, 4:05am

Would you recommend me any mainland mandarin(no Taiwanese) contents on Netflix?
Because I have a plan learning mainland Chinese but almost all Netflix dramas and cartoons are from Taiwan and they don’t fit on simplified mainland mandarin subtitles.

asane · April 19, 2022, 6:03pm

I don’t know if things have just changed in 2 years or you just aren’t looking in the right place, but most of what I see available on Netflix these days is from Mainland China. I especially love school days rom coms because the language is the easiest, and the vast majority are from the Mainland. Also all those historical and wuxia ones, but that’s not so great for practical language learning.

asane · April 19, 2022, 6:33pm

I have been color coding all the words I know and am learning different colors so I can tell at a glance whether I should really work at 100% understanding a sentence or not, or let my brain rest. This inaccurate parsing is definitely a headache. A lot of the “words” aren’t even words when you look them up in the dictionary. I use Pleco with MANY dictionaries, so if it’s not there, it’s very unlikely to be a word in a romcom.

Would the technology behind “Zhongwen Chinese Popup Dictionary” extension be useful? It is open source on GitHub. It seems to do a good job parsing and gives the range of shortest and longest viable character chunks that could be considered words or phrases (even some chengyu). You could limit parsing to only to words in the the open source CC-CEDICT dictionary or dictionary of choice. So depending on which character in a string you click/hover, you will have the option to save and get every definition. For example: 在後面 can be parsed into the words: 在後面, 在後面, or 在後面. The latter doesn’t make sense in context, but they are real words. If you hover or click on the 後, you should have the option to see and save the definition for both 後 and 後面. If you click on the 在, you should have the option to see and save the definition for 在 and 在後. It will also eliminate a lot of non-sensical groupings that LLR is currently giving. This will cause some color coding conflicts, but that could be resolved by prioritizing certain colors over others where there is overlap and/or by longest dictionary approved character string.

That said, this tool has been invaluable for my learning, so thank you so much for your work so far! It is the only reason I even subscribed to Netflix because of all the soft subs. I just joined the Pro version because color coding what I know, what I’m learning, and what I don’t need to worry about has made a huge different learning efficiency. I hope you can make these parsing changes soon to make it even better.

Topic		Replies	Views
Feature Request: (and solution?) Improving Chinese parsing Request	5	742	April 1, 2023
Chinese Character/Word Segmentation Issues	12	1516	November 7, 2024
Settings for how Chinese names are machine translated? In English	2	2039	May 22, 2020
Parsing chengyu (成語) in Chinese subtitles.	1	366	December 17, 2021
Pinyin support for Chinese (Mandarin) subtitles? In English	68	14948	January 3, 2023

(Chinese) LLN seems to incorrectly parse individual words. Is there any way to override this?

Related topics