Lemmatisers are getting better in the last few year, check this project for example:
It should give correct lemmas for ~95-98% of words in the subtitles for Spanish. For frequency, it’s not even crucial that the lemma is always correct, but that it’s consistent. In this case, the lemma is can be thought of as a kind of ‘hash.’
You can see the lemmatiser in action here:
I think it’s a useful to be able to group different verb forms into one unit. Of course there are pros and cons.
One problem we are dealing with now is that we would like to show the lemmas to the user in their saved items list, but some lemmas will be wrong (‘dijistir’ lol). I think the solution will be that the user can edit the lemma manually. Maybe a better idea will come along. The bigger problem with this approach in my opinion is that some words form compounds that have quite different or specific meanings, and we don’t really support handling compounds yet. Should be possible I suppose. Another problem, a ‘word’ is really a string of letters that can often have ten quite different usages, in different contexts. Some usages are common, some less common. I am not entirely happy with just giving a word in isolation a frequency number… there should be a better way… somehow. But still, word frequency is quite simple and quite useful.
You’re right that many American movies are used for compiling the frequency lists, it’s not something I paid particular attention to. At the moment, the frequency lists are just an approximate indication of the usefulness of the words, it’s used for highlighting infrequent words in purple only. Next step would be to make a better list that would be used in more places in the extension (Og is building a view that is a list of lemmas in the movie, ordered by frequency, and you can click on a lemma to see the usage examples in the movie). It’s not good if there are bad lemmas in that list… maybe we will just hide lemmas that we don’t find in the dictionary.
Thanks for your offer to use your list. I think you have a good method. We need to support 30+ languages (with limited time), and we have already a certain pipeline that is used by many features. I will look at your method in more detail, I think I can incorporate ‘contextual diversity’ for example, that looks like a good idea.