March 02, 2012

Near-Miss Words

I do a fair amount of morphological analysis of my corpus of Japanese content. Today I thought up and quickly implemented a new way to order words I should learn next. I call them near-miss words because each one is the single word that I don't understand in an otherwise comprehensible sentence. In other words, they are the missing "+1" in i+1 sentences. Learning near-miss words could be a productive way to fill gaps in my understanding.

I have a lot of modules and scripts written in Perl to help me learn Japanese, but the most important chunk of this is mecab. Mecab breaks up sentences into stemmed (i.e. normalized) words. Without it, I would have nothing for analyzing what I do and do not know.

Using mecab I scan through all of the flash cards that I irregularly review in Anki to produce a list of words I know (currently ~5500). I then use that to comb through my large (~200K) sentence corpus to find the ones that are known except for one word. Then just dump out those words sorted by the number of times they show up as a near-miss.

Top five answers are on the board...

  1. 驚く - marvel (123x)
  2. 株式会社 - corporation (98x)
  3. 震える - tremble (75x)
  4. 同士 - pal (69x)
  5. だいぶ - great (68x)

One example among the 123 sentences for that top word follows. Notice how the verb is in the past tense form - mecab's got me covered.

That's why I was surprised when I got the letter!

Pretty soon I will need a meta-analyzer for finding the words that most frequently appear at the tops of these lists.