Small QoL recommendations

Fantastic extension! Really glad I subbed.

It’s great out of the box, but there are just a few things that might improve the experience a bit.

It’d be really nice to have a shortcut to replay the current subtitle, instead of just forward and back. (I guess a workaround is just hitting both arrows, so this isn’t a showstopper.)

I’d like to be able to re-hide subtitles after my mouse accidentally drifts over them (maybe they are blurred again after the mouse moves away? Or an option make them visible only on a specific keypress and not on mouseover?).

I’d also like some way to import lists to mark words different colors. And an ability to export just the words that have been marked. (Currently if I go to saved items > green words > export, I get all the subs those words appear in, which is great usually, but sometimes I really just want the wordlist.)

With Korean and other highly inflected languages, being able to identify roots that take different endings would be really useful. I’m sure it would be complicated though. If you know “run” maybe that includes knowing “running” and “ran,” but Korean has some long compounds built off of words with endings and I’m not sure where the line would be. German is another tough one, less stacking of particles and more just massive compound words.

Agree with the other comments, subs2srs + morphman support or emulation would be a god feature. You might not be able to let people grab AV from Netflix for legal reasons, (definitely be careful of that so the extension stays up!) But if the exports included timestamps for the start and end of saved subs, it might be a good halfway step (people might be able to get the rest of the way with some other tools or methods, I don’t know). Might be easier to implement in the shortrun too.

Thanks for all you’re doing, great extension.

Oh, also, the ability to add alternate or suggested translations for words. Maybe one of the dictionaries could be “User suggestions.”

검사님 comes out as “check +” on hover. The show’s about prosecutors, they clearly mean prosecutor here, I’d like to just add that translation over the top for this show. Names too, a lot of names get translated literally, I’d like to just write in the romanization for them so it’s easy not to confuse them with other words.

Not sure where you pull the machine translated subtitles from. But the community might even help you gradually improve the machine translated subtitles this way.

Don’t you just hit the down arrow key to repeat the current sub?

Yeah, I also realized ‘s’ can repeat the current one, but it doesn’t re-hide the subtitle, so I still end up mashing back and forward together. That one’s not a showstopper really.

Importing/exporting lists of words to mark,
Identifying roots of words or word families,
Adding suggested translations,
and more ways to create decks for Anki would be incredible.

Hey, sorry for the late reply. Glad you like the extension.

Or an option make them visible only on a specific keypress and not on mouseover?).

You can use the ‘e’ key, that should work, but no way to deactive the mouse expose thing…

import lists to mark words different colors

We’re working on something that I think will solve your problem, more or less.

sometimes I really just want the wordlist

There’s lot of ways to export data, I’ll make a little user guide, there’s not much info currently.

subs2srs + morphman support

Yeah, bit of a dilema, don’t want to poke the lion too much. Maybe can see if there’s some really good TTS. Hoping to add some i+1 type feature soon.

Oh, also, the ability to add alternate or suggested translations for words. Maybe one of the dictionaries could be “User suggestions.”

That could be nice. We considered it for translations, just didn’t have time to implement it.

“check +” on hover

I made a quick fix, will still need to find time to work on Korean dict a bit more (with compound stuff).

Thanks for feedback.

If you’re working on Korean dictionary support, specifically separating out roots from endings, I’m looking to see if I can integrate this in one of my projects: https://konlpy.org/en/latest/

Seems really promising.

Actaully, we already have words broken down into roots… but I’m not sure how to handle them. Korean is a bit unique here, it’s problematic for the dictionary and for the word frequency feature.

필요했다

필요


Korean has so many endings, and blurs the line between “grammatical ending” and “compound word.” Could be a nightmare to process. One popular Korean learning resource, “Korean Grammar in Use” is a three book series all about grammatical endings.

So, after thinking about it a hot minute, probably way less than your team, I can imagine a couple approaches:

The “simplify everything” approach: just take the roots and ignore all the endings.

So you’d take this subtitle:

나는 네가 먹고 있는 것을 알았어 (“I knew you were eating”)

and just process it like this:

나 - 너 - 먹다 - 있다 - 것 - 알다

Meaning, if I’ve marked 먹다 (“to eat”) as highlighted, it will mark this form, 먹, as well. Maybe shows the root on hover so I understand what’s going on as a user.

Pros:

  • simplifies processing

  • this is how most people consciously parse sentences anyway, with grammar processing mostly happening instinctively (MIA and AJATT are big proponents of this philosophy)

Cons:

  • Controversial, a lot of people like treating grammar as a first class citizen in language learning

  • False positives: The system could insist a learner knows words that look very strange, that have have changed dramatically by the addition of many stacked particles

  • Inconsistent with how you probably handle other languages, and for scalability you probably don’t want too many special cases.

The “completionist” approach: let people highlight grammatical principles too, treat those as separate words.

So if the system encounters 먹고 싶어요 (“I want to eat”)

It breaks it down to 먹다 (“to eat”) + -고 ("-to X") + 싶다 (“to want”) + -아/어요 (“present tense, polite style”).

IFF I’ve tagged both 먹다 AND -고, then the system would highlight “먹고” fully.

If I’ve tagged 먹다 BUT NOT -고, it will only highlight half the word.

If I’ve tagged -고 only, then it will highlight that part only.

In general it will treat roots and endings as separate words.

Pros:

  • Accuracy

  • Going back and looking at the subtitles, it seems like your system is already trying to break down words into their components, so you may have a head start on this.

Cons:

  • Complexity. The engine for breaking words down will need a lot of tuning, including maybe just hard coding a lot of corner cases.*

There might be a third way? You guys have moved crazy fast tackling a lot of languages, when a developer could lose years figuring out on any one of these, so I’m impressed and sure you’ll find a good path.

Good luck, thanks for building this, definitely worth the subscription. If you need any help testing anything let me know.

Thomas

  • Here’s a tough case:

물어보다 is the commonly used word for “to ask.” It’s really 묻다 (“to ask”) plus the -아/어보다 ending, which means “to try to do.”

Nobody says 몯다 (“ask”) because it sounds way too much like 물다 (“to bite”) when conjugated, and you really don’t want people to misunderstand when you say you want to ask them something.

So… do you tag 물어보다 as 몯다 + some ending? I think you just hard code that one as one word, because that’s how people think of it, and how it’s listed in common vocab lists.

좋아하다 (“to like”) is another really common one of these. Technically 좋다 plus an ending, but nobody seems to think of it like that.

1 Like

Hey, thanks for this info. I’ll look into this shortly.

Hey. I’m looking at this part of the code again as we prepare to roll out some new features (words tab), would be good to sort out the Korean dilema.

If you see this message, I’d like your opinion. :slight_smile:

The “simplify everything” approach - this is very easy to implement.

The “completionist” approach - this is also easy to implement, with a potential wrinkle. We have the lemma/root/dictionary forms of the endings, but I don’t know if it’s possible to divide the the original word neatly and reliably into parts, one for each ending. This is needed for highlighting each part of the word individually.

To use your example:

먹고 싶어요 (“I want to eat”)

먹다 (“to eat”)
고 ("-to X")

싶다 (“to want”)
아/어요 (“present tense, polite style”)

So, the issue is that 먹다 does not appear in the original 먹고, only the first part of it, , followed by . We need to know which part of the original word corresponds to each constituent component (we don’t have this info). Perhaps there are some simple rules/transformations, or perhaps it would work to work through the word from the beginning, switching to highlighting for the next constituent component when we see it’s first character?

In the other example:

나는 네가 먹고 있는 것을 알았어 (“I knew you were eating”)

나 - 너 - 먹다 - 있다 - 것 - 알다

The character does not appear in the original subtitle, it’s kind of merged with it’s neighbour into a different (unicode) character.

Please excuse my ignorance of Korean. :- :no_mouth:

No excuse necessary, Korean is hard!

I. Verbs / Adjectives

Fair warning… this answer starts simple, but then gets much messier.

The simple:

  • For verbs, the “dictionary” form ends in “다”

  • To get the “true” root, you just remove that “다” – In other words, for any conjugation, the “다” is removed and a new ending is applied to the remaining stem.

Example:

먹다 (to eat) + 었어 (informal past) = 먹었어 (i ate)
mokda + osso = mokosso

Easy!

But…

If the remaining verb stem ends in a vowel, the vowel often shifts or merges with the endings.

가다 (to go) + 었어 (informal past) = 갔어 (i went)
Explanation: the ㅓ flips to ㅏ in this case
Bad transliteration: kada + osso = kasso

And sometimes consonants shift too.

쉽다 (to be easy) + 었어 = 쉬웠어 (it was easy)
Explanation: the 다 is removed, then the final ㅂ is moved to the next character and replaced with ㅜ, then ㅜ merges with ㅓ…
Bad transliteration: suipda + osso = suiuosso
(i don’t know the official transliterations, but it’d be something like this, just to give you the feel for it)

So, there are 20ish rules like this. Once you code all these rules and exceptions you’re well on your way to learning Korean.

II. You

  • 네가 (you) and 내가 (i) are irregulars unrelated to these verbs. There’s not a rule or other words that do this, just when you add ‘가’ (a subject marker) to these two: 너 (you) or 나 (i), the vowel shifts for no apparent reason.

So, back to verbs/adjectives…

  1. You can just subtract the 다 for an 80% solution. (60%?)

  2. You can code an entire morphological analyzer for Korean… which seems like a massive resource burn…

  • or -
  1. Pick an analyzer/tagger from KoNLPy and learn its API. https://konlpy.org/ko/v0.4.1/morph/

As I’ve thought about this for my own projects, I feel like (3) is the textbook option. They’re pretty big research projects, and still not perfect, so they made me much more humble about my ability to build something like that from scratch. The APIs took a few hours to get the hang of, but they were pretty intuitive. They all pull from Java though, so since it’s not pure Python the setup isn’t trivial. Not sure how it would interface with your code, might mean a messy refactor on your end.

Also not sure about speed. Are you preprocessing all the subtitles or doing it live?

We process live when a user opens a video. Processing is done on a pair of Ryzen 3600 boxes running Ubuntu server, takes ~1-3 seconds, and there’s a NGINX cache/load balancer in front of them. The NLP code is various libraries duktaped together for segmentation, lemmatisation, tagging and transliteration. UDPipe (https://github.com/ufal/udpipe) is the main one, kudos to the guys in Prague. I dumped the code here if you want to have a poke around: https://github.com/hobodrifterdavid/dioco-nlp

I should give KoNLPy a go when I have an afternoon. It looks like it’s the way to go. It seems there’s no quick fix for Korean just now though, I guess it will go back on the back ring until after this round of updates is done. :- :neutral_face: Thanks for you input, it’s invaluable.

Oh interesting, UDPipe already has a trained Korean model; I’ll look into that, maybe it’s on par with the tokenizers in konlpy.

I’ll try to document the repo a bit in the next couple of weeks, and figure out how to package it as a Docker (or… vagrant?) image, I guess. Some of these tools/libraries are a bit fickle to install… I think distributing an image is the best way to go. Ideally the code would be useable with just an ‘npm install’ or a ‘pip install’, and you can immeadiatly can pass in text to process… I’m not sure what is the best way to achieve that, will have to research a bit. There’s a fair amout of back-endy knowledge required to make use of the repo in it’s current state, would be nice to make it available to average python/js user. UDPipe does most of the heavy lifting, but we clean up and augment the data in a few ways.

+1 for this. It would be extremely helpful. As you say, it sounds easy implement and it would be trivial to feed the timestamps into something like ffmpeg. The app itself wouldn’t need to record any audio, so hopefully wouldn’t tread on any toes.

I use ←→ for re-hide to memorize a sentence. Would you add re-hide key, too?