Chinese Character/Word Segmentation Issues

Hi, just wanted to reach out about the Chinese character/word segmentation issues I’m seeing quite a lot of. It’s letting down an otherwise awesomely helpful extension.

Here’s a video with an example

你仔細想一想他是一種動物
LLY will break this up like this
你仔/細想/一/想/他/是/一/種/動物
Google Translate will break this up like this
你/仔細/想/一/想/他/是/一種/動物
Also, the English translation LLY provides looks like it’s from the Google Translate api since it’s the same. So, if the English is the same why isn’t the Chinese segmentation the same?

Otherwise loving the addon, just hope that this could be fixed.

Just found another interesting bug in the segmentation.

對~因為你身體如果是濕的
The first character 對 has a ~ after it and so it somehow didn’t get parsed and cannot be hovered over, however the 對 got translated into “Yes” in the English, so it did get parsed there.
Also, Google Translate segments the last three characters as 是/濕/的 but LLY segments as 是濕/的.

Hi @Russell_Sancto, welcome to the forums!

This issue seems heavily related to a report I made elsewhere (which should have been in its own topic), here’s a link to that so we can all keep track of everyone’s feedback:

The ~ issue seems to be a new one.

In addition to the segmentation issue, is the problem that there’s really no way to reliably segment a sentence with an algorithm, and even if you do so completely accurately, it’s still desirable for learners toi be able to access a breakdown of all the characters within the segment.

Hi @quarridors, it’s true that there’s no perfect algorithmic segmentation. I’m simply curious why the LLY segmentation of the Chinese original sentence doesn’t match the Google translate segmentation that is produced as part of the process of giving us the English translation of the sentence. If the segmentation of the Chinese sentence was produced using the Google translate API, then it should matchup with the segmentation we see when putting the sentence directly into Google translate. So, I’m guessing that there are two APIs being used here, one for segmentation the subtitle into words so that we can save/hover over the words etc. and one for producing the English/German (etc.) translation.

Chinese does have words, but no word boundaries. Most words in Chinese are composed of two characters, many characters can’t be used by themselves as meaningful words, they only really carry meaning when paired up into two, three etc. character words. There are some characters that by themselves do constitute words of course, but most characters don’t behave like this. When you look up a character in a dictionary some characters will have multiple definitions, and these definitions will relate to the word that that character appears in.

1 Like

Hi. I’ve got better Chinese word tokenisers hooked up now (pkuseg for simplified, jieba for traditional), they should go live in the next few days, things should be better then. There might be some new issues, we will see. :slight_smile:

3 Likes

@David_Wilkinson that’s great news looking forward to trying it out. Thanks for all your hard work.

1 Like

@Russell_Sancto Yes, I don’t disagree with anything you’re saying here, particularly that the seqmentation algorithm needed to be replaced with something more consistent with the actual machine translation - I’m very glad that’s happening!

However, I do think the addition of the names of people and organisations that the algorithm may not recognise - which is very common in drama dialogue - causes a number of complications and far more segmentation failures than usual, as I’ve written about elsewhere.

Even putting aside the names of people, organisations and so on, just things like types of animals, fruits, vegetables etc are compound words that then become four or more characters long. These can sometimes get longer still with electronic devices.

For example 自动取款机 is a single word, ‘cashpoint’ (or ‘ATM’ in US English), but it’s also three words automatic/voluntary + withdrawal + machine. And then you can segment those characters further into self + moving take/fetch + funds machine

Or in another example, brussel sprouts is 球芽甘蓝, that’s a compound word ball + bud + wild cabbage (which itself is sweet + blue).

Of course many English vegetable names are similar like, ‘eggplant’, ‘sweet potato’ and of course ‘brussel sprouts’. You’d also want those to be segmented as single words/names for the purpose of translation in the other direction, despite being multiple words, because, for example, ‘sweet potato’ translates to 甘薯 (sweet + yam) while ‘potato’ translates to 土豆 (earth + bean).

Languages are messy organic, highly contextual, subtle things with lots of exception cases to any rules an alogorithm might follow, and that’s before you add translation to other languages into the mix as a multiplication of complexity.

My point here is that there should be the ability to breakdown the components of words within segments. Partly because even the best segmentation algorithms will sometimes be wrong, and partly because it’s just easier to learn when you also see ‘sweet’ and ‘potato’ separately defined under your definition for ‘sweet potato’, rather than just having the entire segment defined and nothing more.

If I was integrating this sort of functionality into LLN, I would have this work winthin the definition panels that appear when a subtitle segment is clicked, then define smaller word chunks below the main definition, and also make it so that the characters in the definition could be clicked to switch what’s currently being defined. But I’m sure there’s many other ways to do this.

I use an excellent free Chrome extension called Zhongwen: Chinese-English Dictionary that does this with any Chinese characters in webpages by the method of howevering on the first character in a chunk. For example, here it is hovering on characters in the preview of this post:


Screen Shot 2020-06-20 at 14.05.55

However, I fully recognise that this might not be a priority for the developers, especially given what I understand is the breakdown of users and their language learning goals.

So I’m mainly just hoping that they make sure that their development decisions don’t break the functionality of the Zhongwen extension and any others like it that users may be making use of for other languages.

1 Like

@Russell_Sancto I do agree with you that it would be ideal if single characters that have no meaning when used alone were always flagged as such. This is definitely the main weakness of the Zhongwen extension mentioned above. I think the Pleco mobile app presents bound characters much more clearly, but then it isn’t trying to fit everything into hover text.

Having said that though, I would say that in most cases bound characters do tend to have had distinct meanings in ancient forms of Chinese, rather than only taking meaning from their usages in compound words.

So, I think a better analogy for definitions of single bound characters, is that it’s a little more like listing below the main definition for ‘telephone’, that ‘tele-’ is a prefix meaning ‘far away’ and ‘phone’ refers to the sound of a voice, and both of those derive from ancient Greek words that you don’t need to know and can’t use in English.

Knowing them isn’t harmful, as long as you understand that they’re bound syllables, though. They might actually be useful in helping you with understanding new words later, say when you see ‘telemedicine’ and ‘phonetics’ for the first time.

I’d say this goes doubly for a langage like Chinese where you can see that it’s definitely the same character and not coincidentally the same syllable, so there’s a high chance that it has the same linguistic root.

Although there can still be ambiguity over whether a character in a new word is the same as one you know elsewhere. Of course, loan words from other languages cause problems when the characters used are purely phonetic.

Also, in some cases the character simplification process has caused different characters with distinct meanings to be combined together into a single character like with 面. The measure word for surfaces isn’t meant to be ‘a noodles of’, in traditional characters flour/noodles is 麵, which has 面 on the right side because it’s pronounced the same as surface/plane/aspect (in ancient Chinese this character mean originally meant ‘face’). Personally I’d have preferred it if they’d simplified 麵 to something like 饣+ 面 (ie, added the radical for eating/food) - it’s not technically correct by the simplication rules, but it sure is better than arbitrarily combining it with another word.

Ideally a really useful definition list would omit irrelevant definitions when it knows it’s a presenting a loan word or that a particular character in the word is a simplified from a specific character meaning in this context - this is probably a bit much to ask for a simple definition panel though!

@David_Wilkinson That’s fantastic, thanks for looking into a fix! Do let us know when it’s gone live, and I’ll compare against previously reported issues.

1 Like

Hi, I can confirm after checking back, that the issues I reported above including the 對~ issue and the other segmentation issue have been fixed, and the segmentation has improved a lot. So many thanks for that!

1 Like

Hi, David! Just wondering… Why not jieba for simplified, just for the sake of consistency? A quick comparison of both tells me none is perfect, both make mistakes, but Jieba is better. At least the way jieba is implemented inside SegmentAnt: (I downloaded subtitles and segmented them with this software)
https://www.laurenceanthony.net/software/segmentant/