What word list are you using for Spanish?

Ernalzar · July 2, 2020, 12:48am

Hi! I’ve just noticed you’ve added a new feature that highlights less important words for different levels. I’m curious: What list are you using? It’s funny because it considers “tres” (three) not to be among the top 1,500 words, but in my data it’s word number 106.

I have a list of more than 10K words ordered by Contextual Diversity and by frequency. (You should never rely on frequency alone, because if there’s a movie where Finn spends 2 hours shouting “Rey!” then Rey will end up near the top of your list). I created this list from thousands of subtitles from Spanish language TV shows and movies I downloaded from Netflix and then processed using AntConc. This list comes from actual Spanish-language TV shows, unlike the Subtlex list and other lists circulating online, which come from Spanish subtitles of American movies and thus do not reflect real Spanish.

Also, what’s the logic behind 300 words, 800 words, 1500 words? Why not just
Level 1: 1000 words
Level 2: 2000 words
and so on?

I’d happily contribute my list to the project. Feel free to find my contact info on arqui3d.com.

procion · July 6, 2020, 1:04am

I study a different language, but have found the same thing. It looks like the extension get frequencies from a series itself, so proper names are indeed registered as frequent words. I wonder if it takes frequencies from a current episode or from a whole series at once. If it’s only a current episode, I would not trust it too much.
As for levels, each next level is roughly 1.5-2 times more words than previous. It’s a logarithmic scale and it’s much more reliable to measure language level than linear one. I guess, because low levels are easier for learner to acquire, the multiplier is higher at first and than keeps gradually lowering.

David_Wilkinson · July 10, 2020, 4:02pm

Hey. So the frequency lists come from subtitles (http://opus.nlpl.eu/OpenSubtitles-v2018.php). I processed 10,000 subtitles for each language, tokenised the subs into words, then lemmatised. 5,000… was not really enough to have a good list for top 10,000 frequent words… 10,000 subtitles is kind of ok, but more is better I think, I just got a bit impatient to get the feature online, I will run more subs through the script sometime soon.

Here’s the top 30k items for English: https://extension.dioco.io/en.txt
and Spanish: https://extension.dioco.io/es.txt

You can see lemmatisation isn’t perfect, when I have time, I’d make a mapping of bad lemmas to correct lemmas for the top ~3000 words.

AntConc looks like an interesting tool… I’ll take a look at that.

The issue with “tres” (three) was a bug actually. I tried to exclude numbers when making the frequency list, but I forgot to add the code to exclude them from the frequency logic on the front end (the extension). It should be fixed now.

Ernalzar · July 10, 2020, 8:03pm

Thanks for the detailed explanation!

As you know, Spanish lemmatization is… just not there. There are many word forms, including verb forms, that can have completely different things. Besides this, lemmatisation causes both rare and common forms to be lumped together, and when you’re facing several dozen forms per verb, it’s very important to prioritize the most common forms first. (hablas is a LOT more common than hablaríais, but both are lumped together if lemmatized. Some verb forms are actually purely theoretical and only found in grammar discussions). That’s why I’ve stuck to word forms for my list.

I’m curious as to where you found 10,000 Spanish subtitles. I’ve checked the Subtlex project, but they mostly used Spanish translations of American movies (which don’t even match the dubs), so their frequency list does not represent how we actually speak. It represents American thoughts translated to Spanish.

Outside of Netflix, it’s rare to find subtitles for Spanish language stuff. I’ve checked iTunes, Hulu, Amazon, Disney Plus (they have shows from Argentina) and even Pantaya, a service specifically made for the Latin American market, but no luck. We just don’t have a tradition of adding captions for the hard of hearing, as is the case in the United Sates. So Netflix has really been a godsend for Spanish language learners. The other companies just don’t care. An example: When Soy Luna (A Disney show made in Argentina) was on Netflix, it did have Spanish captions. But now that it’s on Disney Plus, it has subtitles in several languages… Except Spanish.

For my list, I only used subtitles from Spanish language TV shows and movies, not translated ones. Most come from Netflix. As I said, I used contextual diversity instead of frequency as the primary ranking criteria, so even though my sample is small, it’s effective. (5,000 episodes/movies which I condensed into 376 text files, one per show and per movie. Some Latin American shows have 300+ episodes!. I wish I could find more shows and movies, but as I said, outside of Netflix it’s a desert, and this extension is for watching Netflix, anyway). Contextual diversity has been shown in research to be a better predictor of lexical decision times than frequency. (Lexical decision times: A type of test that tells us if a native speaker recognizes a word or not. In other words, an experiment that tells us what words native speakers know.)

Feel free to PM me if you’d like to use this list. It would be my pleasure!

Ernalzar · July 10, 2020, 8:14pm

Ok, I think I’ll just leave the list here:
https://www.spanishinput.com/uploads/1/1/9/0/11905267/raw_results_2020-06-17__complete_.ods

I have also compiled the most common (meaning, the ones with the most common contextual diversity) n-grams, up to 7-word n-grams. I’m also compiling usage examples of every single verb form included in the top 1,000 forms list. The examples come from… Netflix shows. This is a WIP, but I’ll let you know when it’s ready.

Ernalzar · July 10, 2020, 8:38pm

I was perusing the 30k list you posted… and now I’m more curious.
“Cre” is word 174 and “dijistir” is word 293, “harar” is word 298, “damar” is word 325, “pap” is word 389, “hagar” is 405 and so on, but I haven’t seen these words in my life. Probably errors by the lemmatizer. As I said, trying to lemmatize Spanish is just… not gonna work.

I start to see proper nouns starting with “Michael” as word 249, “George” as word 429, “Jack” in number 466 and “Bob” as 469, which are not Spanish names. This confirms that this comes from English language movies and TV shows subtitled to Spanish.

“ar” is word 345, which is a country code, which tells me a lot of subtitles were probably translated by an Argentine team and they left their URL at the end of each one.

Just use my list. No need to give me credit (although it would be nice if you credited me as “spanishinput.com”). Using contextual diversity has produced a list so balanced that there’s not a single personal name (No Bob, no Jack, not even Juan or María) among the first 1,000 word forms, and these top 1,000 words cover 78%-79%, in average, of any Netflix TV show or movie from Spain or Latin America. Because… Netflix was the main source of this list. It’s 96% Netflix. The rest is a few subtitles from Telemundo TV shows I compiled from Telemundo’s YouTube channel, so it’s also relevant to your extensions (Some of these shows are on Hulu). Here’s the entire corpus as a zip file so you can use it, too. Each country has its own folder. Each file starts with the country code (AR, CL, CO, ES). Telemundo shows are inside subfolders. Each show has been compiled into a single TXT file, because each show is only one context. This prevents long shows (common in Latin America) from having too much of a influence in the results. Subtitles were cleaned up before converting to text, to remove song lyrics and captions for the deaf.
https://www.spanishinput.com/uploads/1/1/9/0/11905267/netflix_corpus_by_show__2_.zip

David_Wilkinson · July 11, 2020, 3:33am

Lemmatisers are getting better in the last few year, check this project for example:

It should give correct lemmas for ~95-98% of words in the subtitles for Spanish. For frequency, it’s not even crucial that the lemma is always correct, but that it’s consistent. In this case, the lemma is can be thought of as a kind of ‘hash.’

You can see the lemmatiser in action here:

I think it’s a useful to be able to group different verb forms into one unit. Of course there are pros and cons.

One problem we are dealing with now is that we would like to show the lemmas to the user in their saved items list, but some lemmas will be wrong (‘dijistir’ lol). I think the solution will be that the user can edit the lemma manually. Maybe a better idea will come along. The bigger problem with this approach in my opinion is that some words form compounds that have quite different or specific meanings, and we don’t really support handling compounds yet. Should be possible I suppose. Another problem, a ‘word’ is really a string of letters that can often have ten quite different usages, in different contexts. Some usages are common, some less common. I am not entirely happy with just giving a word in isolation a frequency number… there should be a better way… somehow. But still, word frequency is quite simple and quite useful.

You’re right that many American movies are used for compiling the frequency lists, it’s not something I paid particular attention to. At the moment, the frequency lists are just an approximate indication of the usefulness of the words, it’s used for highlighting infrequent words in purple only. Next step would be to make a better list that would be used in more places in the extension (Og is building a view that is a list of lemmas in the movie, ordered by frequency, and you can click on a lemma to see the usage examples in the movie). It’s not good if there are bad lemmas in that list… maybe we will just hide lemmas that we don’t find in the dictionary.

Thanks for your offer to use your list. I think you have a good method. We need to support 30+ languages (with limited time), and we have already a certain pipeline that is used by many features. I will look at your method in more detail, I think I can incorporate ‘contextual diversity’ for example, that looks like a good idea.

Ernalzar · July 11, 2020, 4:15pm

When you do, I have a list of 2, 3, 4, 5 6 and 7 word n-grams I’ve found using AntConc. The most common 2-word n-gram is “lo que”, which could be translated as “what”, or “the thing that”.

It’s true that using lemmas instead of forms seems more logical for students… That is, if they have fully mastered conjugation. But in my 4 years of experience as an italki tutor, a student can be very familiar with a verb and still not be able to recognize it when they see it in a new form. In Spanish, verb forms go beyond what you can find in conjugation tables: They also include forms with attached pronouns, such as “dime”, “decirte”, etc. This means we have tons and tons of forms for each verb. So far I’ve only seen the most advanced students being able to recognize a verb in any form, including with attached pronouns. For beginners and intermediates, it’s best to treat each word form separately. That’s why I’m compiling usage examples trying my best to only include examples that reflect how the language is used in Netflix. Every day I’m surprised at how much my students , even intermediate ones, can learn from this file (WIP):

David_Wilkinson · July 31, 2020, 1:35am

Hey.

Decirte etc. should be handled correctly by the libraries, although there are some details in handling the output correctly.

I made two simple changes when calculating the word lists…

I count the number of occurances of a lemma in a subtitle file, then square root that number before adding to the global count. This means words that appear in more files have more weight. If a word occurs 16 times in one movie, it’s score is sqrt(16) = 4. If it occurs 4 times in 4 movies (total 16 times), the score is 4 x sqrt(4) = 8. This helps with contextual diversity. Could use the cube root instead, would favor it even more.
I check the lemma reported occurs somewhere in the subtites as a word before adding it to the list (Ognjen’s idea). This gets rid of ‘fake lemmas’.

The list has definately improved (new one on the right):

Names sneaking in at 5000, but that’s maybe not bad behaviour:

Still a couple of tweaks to do.

David_Wilkinson · July 31, 2020, 2:08pm

New frequency lists:
https://extension.dioco.io/en_2.txt
https://extension.dioco.io/es_2.txt

(better to right-click and ‘save as’, then open in a local editor, as the browser might give you the wrong file encoding. It’s UTF-8).

Ernalzar · July 31, 2020, 2:40pm

Hi, David. It does look a LOT better! I’ve only noticed six non-words among the top 1,000 in Spanish, and far fewer lemmatization errors.

Sorry for insisting on this, but I would still love to have option to use a list based on word types instead of lemmas. This would be fantastic to practice grammar with the cloze-style flashcards that LLN generates. For example, let’s talk about the verb “ser”, which has many forms completely different from one another. If a student wants to remember that in this particular context he should use “es”, he could right-click on the word to save it. And if he later finds “soy”, “fue”, “éramos”, “sería”, and the like, he can also right-click to save each one of them. He would then be able to generate individual Anki flashcards with the right context for every single one of them. Now this would be a real game-changer for learning Spanish grammar!

If switching everyone to word type-based lists instead of lemma-based lists is not in the cards, maybe the extension could let Pro subscribers upload their own word type-based lists. For example, I’ve just created my own list for Russian based on lots of movies and TV shows that I found mainly on YouTube. Netflix has only 6 titles in Russian!

David_Wilkinson · August 2, 2020, 3:15pm

Here’s something me and Og are prototyping:

You can see the words in the movie, ordered by ‘global’ frequency (frequency in the languge, not the movie). You can click the words to see the subs with that word, mark the words etc.

Next step is to add additional examples from http://tatoeba.org/ and Youtube videos. I tried FluentU the other day and noticed they had this feature already. Our projects have a lot in common actually.

Next step is to show the ‘global’ frequency lists in the cards app: https://cards.dioco.io/ (not implemented yet). When you click a word in the list, you will see examples in a seperate panel.

The step after that is to change the way the flashcards work, so that they are not even really flashcards anymore, but show you examples many sentences with words you are trying to learn. I would like to turn away from the idea of learning lists of words and their translations, and rather use the words in the list as ways to organise and index example sentences, which are the real source of learning. I think this will kind of serve your idea about verbs and their different forms, especially if we add some special logic for verbs.

The holy grail is something I refer to as ‘FACE PUMP’. This is where the user visits a site, and we, uh, pump video (maybe audio too?) content into their face that is appropriately selected, based not only on vocabulary, but also integrating some knowledge of tenses, grammatical structures etc. (perhaps some kind of machine learning, combined with our existing grammatical info, will make this feasible). A kind of special TV channel. There are still lots of questions and work to get to that point.

David_Wilkinson · August 2, 2020, 8:00pm

That’s… kind of where we’re trying to go. How far we get down that road using only code, that’s a question. Your idea to prepare materials specifically for a similar purpose, I think that’s good.

Ernalzar · August 2, 2020, 11:14pm

That sounds amazing! This way you can keep both short term and long term learning in mind!

Yes, I did use FluentU many years ago, when it was called FluentFlix. It was exciting at first, but it quickly grew old due to the lack of proper organization of new videos, and the lack of full-length TV shows and movies. Since then I’ve been looking for something similar that I can use with any mp4 video or any mp3 audio. LaMP, Lingo and Voracious were good, but LLN and LLY are far, far better. (Actually LaMP is the closest to FluentU and LLN).

Back to topic, yes, FluentU’s most exciting feature was the ability to watch lots of examples of the same word from tons of other YouTube videos.

I heard about Tatoeba before for Japanese, but really, LLN’s flashcard capabilities are 1000x better, because when you review you remember the context. (A snapshot would help, too)

Hey, the cards app looks promising! I love Anki, but not everybody if fond of it.

BTW, inside LLN it would be really cool to have each card link to the relevant episode and time, even when I’m watching something completely different. This way I could save a bunch of of phrases I want to discuss with my teacher over Zoom, and then have LLN play the relevant section from the relevant episode.

Yeah, I’m also a bit tired of word lists by themselves, even though they have helped me a lot. That’s why I love the “cloze” style flashcards LLN creates. I remember the context every time I review one. Context is the key.

The only “special logic” for verbs, at least in Spanish, is to treat each word type separately. Spanish verb forms can even include direct and indirect object pronouns, so they can be unrecognizable. If I had a dollar for every time a student was not able to recognize a new form of a verb he “knows”…

The “face pump” idea sounds like what FluentU tries to be. As I said, it was exciting for a few months when I was starting out with Chinese, but then I found it too limiting and preferred to just pick materials myself.

BTW, it would be really cool to resurrect the Lingo player project and give it some of LLN’s features, such as auto-pause and the ASD 12 QE keyboard shortcuts. I actually have lots of mp4s with text-based subtitles at my disposal in hundreds of languages. All translated from the English originals, so I can always use the original subtitles, too. I would totally pay for it and I know my students would, too.

EDIT: Maybe it would be better to give LLN the ability to load mp4s both from your hard drive and from any URL within Chrome. This way it would work in both Windows And MacOS. Some mp4s have embedded text-based subtitles encoded in tx3g format, and both Chrome and Safari can read those subtitles and turn them on and off.

Topic		Replies	Views
frequency dictionary created by "Language Reactor" Ask the community	2	831	October 17, 2023
Source of the Corpus used Ask the community	1	271	October 20, 2023
Word Highlighting / Suggested Words based on Word Frequency	6	2627	June 13, 2020
Request- Beginner Learner Option: Show reduced subtitles (every sentence, extension just picks one or two main nouns or verbs to show) Request	3	63	December 1, 2024
Feature TODO List and roadmap (continuously updated) News from the Team	59	87921	March 2, 2022

What word list are you using for Spanish?

Related topics