Korean has so many endings, and blurs the line between “grammatical ending” and “compound word.” Could be a nightmare to process. One popular Korean learning resource, “Korean Grammar in Use” is a three book series all about grammatical endings.
So, after thinking about it a hot minute, probably way less than your team, I can imagine a couple approaches:
The “simplify everything” approach: just take the roots and ignore all the endings.
So you’d take this subtitle:
나는 네가 먹고 있는 것을 알았어 (“I knew you were eating”)
and just process it like this:
나 - 너 - 먹다 - 있다 - 것 - 알다
Meaning, if I’ve marked 먹다 (“to eat”) as highlighted, it will mark this form, 먹고, as well. Maybe shows the root on hover so I understand what’s going on as a user.
Pros:
-
simplifies processing
-
this is how most people consciously parse sentences anyway, with grammar processing mostly happening instinctively (MIA and AJATT are big proponents of this philosophy)
Cons:
-
Controversial, a lot of people like treating grammar as a first class citizen in language learning
-
False positives: The system could insist a learner knows words that look very strange, that have have changed dramatically by the addition of many stacked particles
-
Inconsistent with how you probably handle other languages, and for scalability you probably don’t want too many special cases.
The “completionist” approach: let people highlight grammatical principles too, treat those as separate words.
So if the system encounters 먹고 싶어요 (“I want to eat”)
It breaks it down to 먹다 (“to eat”) + -고 ("-to X") + 싶다 (“to want”) + -아/어요 (“present tense, polite style”).
IFF I’ve tagged both 먹다 AND -고, then the system would highlight “먹고” fully.
If I’ve tagged 먹다 BUT NOT -고, it will only highlight half the word.
If I’ve tagged -고 only, then it will highlight that part only.
In general it will treat roots and endings as separate words.
Pros:
-
Accuracy
-
Going back and looking at the subtitles, it seems like your system is already trying to break down words into their components, so you may have a head start on this.
Cons:
- Complexity. The engine for breaking words down will need a lot of tuning, including maybe just hard coding a lot of corner cases.*
There might be a third way? You guys have moved crazy fast tackling a lot of languages, when a developer could lose years figuring out on any one of these, so I’m impressed and sure you’ll find a good path.
Good luck, thanks for building this, definitely worth the subscription. If you need any help testing anything let me know.
Thomas
- Here’s a tough case:
물어보다 is the commonly used word for “to ask.” It’s really 묻다 (“to ask”) plus the -아/어보다 ending, which means “to try to do.”
Nobody says 몯다 (“ask”) because it sounds way too much like 물다 (“to bite”) when conjugated, and you really don’t want people to misunderstand when you say you want to ask them something.
So… do you tag 물어보다 as 몯다 + some ending? I think you just hard code that one as one word, because that’s how people think of it, and how it’s listed in common vocab lists.
좋아하다 (“to like”) is another really common one of these. Technically 좋다 plus an ending, but nobody seems to think of it like that.