Sunday, November 18, 2012

Non-English language models

Part of Thought Into Design's work was to create some language models to understand the similarity of web pages that link to each other. This is in response to Google's recent updates where irrelevant links appear to be penalised for SEO purposes.

The English model was our first focus because that's what the largest share of customers are using.

And it took ages to do. As always with statistical things, the time-consuming part was preparing the data to a good enough level. When trying to create statistical models, it's easy to jump in and tokenise / stem / lemmatise / whatever the corpus and go ahead but even a cursory inspection of your dictionary will show nonsense values that should not be there. The critical part is to reduce this noise to a level so low that the good stuff shines more brightly.

In all, it took a couple of months of work to prepare the corpus properly along with some mind-bending custom-written tokenising code and many late nights.

And there was another requirement. The process had to be repeatable. The client a) will need to perform this every so often for their model to stay up to date and b) models in other languages are necessary. A large part of time was sunk into creating scripts that could do all this in a repeatable process for any language.

But now we have it. The English-language models were creating in just a few days of intensive analysis and they work and work well using an amalgam of different transforms with some proprietary magic.

I wanted to see how well the process adapted to new languages so I tried to create a model in Norwegian. Why Norwegian? Well, I like Norway lots, particularly the outbreak of peace in response to Breivik's murders, and the corpus was very small compared to English.

The whole process took less than 24 hours which includes compiling the corpus. In theory, I can do this for most European languages.

Actually, this is a lie. I need to check the veracity of each dictionary and even a superficial examination requires a good knowledge of the language. I can make estimates but there's nothing like having a handy Norwegian to glance over my shoulder.

But that aside, I can create large working statistical language models in 14 languages so far (based on evidence so far!). Further blood, sweat and tears are yet needed for Asian languages but that's in hand.