No excuse necessary, Korean is hard!
I. Verbs / Adjectives
Fair warning… this answer starts simple, but then gets much messier.
The simple:
-
For verbs, the “dictionary” form ends in “다”
-
To get the “true” root, you just remove that “다” – In other words, for any conjugation, the “다” is removed and a new ending is applied to the remaining stem.
Example:
먹다 (to eat) + 었어 (informal past) = 먹었어 (i ate)
mokda + osso = mokosso
Easy!
But…
If the remaining verb stem ends in a vowel, the vowel often shifts or merges with the endings.
가다 (to go) + 었어 (informal past) = 갔어 (i went)
Explanation: the ㅓ flips to ㅏ in this case
Bad transliteration: kada + osso = kasso
And sometimes consonants shift too.
쉽다 (to be easy) + 었어 = 쉬웠어 (it was easy)
Explanation: the 다 is removed, then the final ㅂ is moved to the next character and replaced with ㅜ, then ㅜ merges with ㅓ…
Bad transliteration: suipda + osso = suiuosso
(i don’t know the official transliterations, but it’d be something like this, just to give you the feel for it)
So, there are 20ish rules like this. Once you code all these rules and exceptions you’re well on your way to learning Korean.
II. You
- 네가 (you) and 내가 (i) are irregulars unrelated to these verbs. There’s not a rule or other words that do this, just when you add ‘가’ (a subject marker) to these two: 너 (you) or 나 (i), the vowel shifts for no apparent reason.
So, back to verbs/adjectives…
-
You can just subtract the 다 for an 80% solution. (60%?)
-
You can code an entire morphological analyzer for Korean… which seems like a massive resource burn…
- or -
- Pick an analyzer/tagger from KoNLPy and learn its API. https://konlpy.org/ko/v0.4.1/morph/
As I’ve thought about this for my own projects, I feel like (3) is the textbook option. They’re pretty big research projects, and still not perfect, so they made me much more humble about my ability to build something like that from scratch. The APIs took a few hours to get the hang of, but they were pretty intuitive. They all pull from Java though, so since it’s not pure Python the setup isn’t trivial. Not sure how it would interface with your code, might mean a messy refactor on your end.
Also not sure about speed. Are you preprocessing all the subtitles or doing it live?