What word list are you using for Spanish?

Hey. So the frequency lists come from subtitles (http://opus.nlpl.eu/OpenSubtitles-v2018.php). I processed 10,000 subtitles for each language, tokenised the subs into words, then lemmatised. 5,000… was not really enough to have a good list for top 10,000 frequent words… 10,000 subtitles is kind of ok, but more is better I think, I just got a bit impatient to get the feature online, I will run more subs through the script sometime soon.

Here’s the top 30k items for English: https://extension.dioco.io/en.txt
and Spanish: https://extension.dioco.io/es.txt

You can see lemmatisation isn’t perfect, when I have time, I’d make a mapping of bad lemmas to correct lemmas for the top ~3000 words.

AntConc looks like an interesting tool… I’ll take a look at that.

The issue with “tres” (three) was a bug actually. I tried to exclude numbers when making the frequency list, but I forgot to add the code to exclude them from the frequency logic on the front end (the extension). It should be fixed now.

1 Like