What word list are you using for Spanish?

David_Wilkinson · July 10, 2020, 4:02pm

Hey. So the frequency lists come from subtitles (http://opus.nlpl.eu/OpenSubtitles-v2018.php). I processed 10,000 subtitles for each language, tokenised the subs into words, then lemmatised. 5,000… was not really enough to have a good list for top 10,000 frequent words… 10,000 subtitles is kind of ok, but more is better I think, I just got a bit impatient to get the feature online, I will run more subs through the script sometime soon.

Here’s the top 30k items for English: https://extension.dioco.io/en.txt
and Spanish: https://extension.dioco.io/es.txt

You can see lemmatisation isn’t perfect, when I have time, I’d make a mapping of bad lemmas to correct lemmas for the top ~3000 words.

AntConc looks like an interesting tool… I’ll take a look at that.

The issue with “tres” (three) was a bug actually. I tried to exclude numbers when making the frequency list, but I forgot to add the code to exclude them from the frequency logic on the front end (the extension). It should be fixed now.

Topic		Replies	Views
frequency dictionary created by "Language Reactor" Ask the community	2	818	October 17, 2023
Source of the Corpus used Ask the community	1	271	October 20, 2023
Word Highlighting / Suggested Words based on Word Frequency	6	2625	June 13, 2020
Request- Beginner Learner Option: Show reduced subtitles (every sentence, extension just picks one or two main nouns or verbs to show) Request	3	63	December 1, 2024
Feature TODO List and roadmap (continuously updated) News from the Team	59	87298	March 2, 2022

What word list are you using for Spanish?

Related topics