It sounds good to me what you suggested.
From those frequent words left in white, perhaps it would useful to give the option to mark them also according to their grammar function in the sentence (noun, verb, adjective, adverb, preposition). That can help getting you familiar with complex new sentence structures, and anticipating what each word would correspond with in the translation. I would do that with the underlines you suggested. You can also try to apply this to all the words in the sentence rather than just white words.
If I select the first 3,000, I would still like a special color to highlight the first 1,000 within those, as frequency there is the most significant by far (there is little difference in frequency above this). Maybe slightly bolding them would suffice (reducing colors).
Also, perhaps it would be better to color the frequent words in purple, instead of the infrequent ones, since those are the ones we’re interested in, anyway. When no purple words would remain you will know you completed phase 1 in your studies (doing it like this you’re also reducing the color purple with time, which is better, as to not overload the user with too many different colors).
Since white is the default color, if you highlight just the infrequent words, it is likely that your code would leave stranglers behind in white, that it missed to catch, a color which would suggest that they’re frequent, while they’re not. That is another reason why it would be better to highlight the focus group directly rather than the other way around.
It actually gives me great pleasure seeing it turn all into green as it’s now. I quite like it.
If it still bug you guys perhaps you can offer 2 display styles to choose from to satisfy everyone preferences.