Wednesday 15 June 2016

Week 2

Hey everyone! Here's a small reflection of what I had set out to do and how it panned out over the last month.

My agenda for last month was to complete my normalization PR, finish my doc2vec/word2vec warning PR, code two modules required by the topic coherence API and resolve any other bugs which I encounter in the process.

The normalization PR has been successfully completed and merged into gensim. Now the users will have explicit control of which normalization to choose and can normalize documents in-place by creating a "normalization model".
The word2vec/doc2vec warning PR has also been successfully completed and merged. Users were previously unaware of what errors they were making while training doc2vec and word2vec models. Warnings will now be raised on committing some common errors.

Coming to my topic coherence project, topic coherence is basically a measure by which a set of topics' human interpretability is quantified. Hence it can be used to measure the quality of a topic modelling algorithm by checking the topic coherence of the topics it comes up with. The AKSW group of Michael Röder did a nice comparison of various methods for evaluating topic coherence and came up with this wonderful paper. My project is basically to add the topic coherence pipeline mentioned in this paper to gensim.

The topic coherence API is also going well. You can check out the open PR here. You can also check out a sample usage of this API here. This is still a work in progress and is still far from being merged but it's starting to take shape now. My next step would be to perform benchmark testing against Palmetto using some popular datasets such as the English Wikipedia. My next blog post will be dedicated to this very interesting part of my project so stay tuned for more!

Lastly, thanks a lot Lev and Radim for the assistance offered for this project! It's been a great experience with Gensim so far!

Thursday 26 May 2016

Getting started

Here's my first post as part of the RaRe Technologies Incubator Programme! Over the course of this summer I will be working on (and hopefully improving) the functionality of gensim, an open source library for topic modelling.

My interest in machine learning and natural language processing started when I took an online course on machine learning by BerkeleyX. I was fascinated by how maths can be used to "learn" a language! Towards the end of last year I also started collaborating with Bhargav Srinivasa (who is currently doing his GSoC with gensim) on building a Whatsapp Chat Analyser which is hosted on GitHub and is still a work in progress. This year, while searching for tools that can be used out of the box to help me with this project, I came across gensim. I found their work to be stellar! If you're still here and have been hearing a lot about "Deep Learning" (psst Google vs Lee Sedol), you should surely check this out to see what gensim can do! Gensim is a very easy-to-use, robust library which works at lightning fast speeds. So if you're into natural language processing and haven't used it yet, I strongly recommend you try it. Seriously, topic modelling can't be easier than this.

I started work on gensim by making normalization an explicit transformation (issue #69). The problem was that every normalization done in gensim was an L2 normalization and was done implicitly without really giving the user a choice. My first pull request basically adds the L1 normalization option and gives the user a choice of which normalization to choose. It makes normalization an "explicit transformation". Apart from that it also has an option of passing a corpus and storing it as a normalized corpus or performing normalization in-place on documents.
My second pull request looks to raise warnings if unexpected input is encountered while using Word2Vec and Doc2Vec. This can help the user become more aware of what is going on inside the "box" and can also help the user rectify his/her mistakes in the initialization of these models.
After I finish work on these two pull requests, I will be proceeding with my project of adding a 4 stage topic coherence pipeline (more on that later!) to gensim.

A big shout out to Radim, Lev and Gordon for helping me out with the PRs and giving me an opportunity to work on this project! It's been a brilliant learning experience so far and hopefully by the end of this project, topic modelling will become even better for humans!