What word list are you using for Spanish?

Ernalzar · July 10, 2020, 8:38pm

I was perusing the 30k list you posted… and now I’m more curious.
“Cre” is word 174 and “dijistir” is word 293, “harar” is word 298, “damar” is word 325, “pap” is word 389, “hagar” is 405 and so on, but I haven’t seen these words in my life. Probably errors by the lemmatizer. As I said, trying to lemmatize Spanish is just… not gonna work.

I start to see proper nouns starting with “Michael” as word 249, “George” as word 429, “Jack” in number 466 and “Bob” as 469, which are not Spanish names. This confirms that this comes from English language movies and TV shows subtitled to Spanish.

“ar” is word 345, which is a country code, which tells me a lot of subtitles were probably translated by an Argentine team and they left their URL at the end of each one.

Just use my list. No need to give me credit (although it would be nice if you credited me as “spanishinput.com”). Using contextual diversity has produced a list so balanced that there’s not a single personal name (No Bob, no Jack, not even Juan or María) among the first 1,000 word forms, and these top 1,000 words cover 78%-79%, in average, of any Netflix TV show or movie from Spain or Latin America. Because… Netflix was the main source of this list. It’s 96% Netflix. The rest is a few subtitles from Telemundo TV shows I compiled from Telemundo’s YouTube channel, so it’s also relevant to your extensions (Some of these shows are on Hulu). Here’s the entire corpus as a zip file so you can use it, too. Each country has its own folder. Each file starts with the country code (AR, CL, CO, ES). Telemundo shows are inside subfolders. Each show has been compiled into a single TXT file, because each show is only one context. This prevents long shows (common in Latin America) from having too much of a influence in the results. Subtitles were cleaned up before converting to text, to remove song lyrics and captions for the deaf.
https://www.spanishinput.com/uploads/1/1/9/0/11905267/netflix_corpus_by_show__2_.zip

Topic		Replies	Views
frequency dictionary created by "Language Reactor" Ask the community	2	826	October 17, 2023
Source of the Corpus used Ask the community	1	271	October 20, 2023
Word Highlighting / Suggested Words based on Word Frequency	6	2626	June 13, 2020
Request- Beginner Learner Option: Show reduced subtitles (every sentence, extension just picks one or two main nouns or verbs to show) Request	3	63	December 1, 2024
Feature TODO List and roadmap (continuously updated) News from the Team	59	87622	March 2, 2022

What word list are you using for Spanish?

Related topics