Automated topic extraction with Latent Dirichlet Allocation

During last semester, I became aware of a really interesting NLP algorithm called LDA, short for “Latent Dirichlet Allocation.” The goal of the algorithm is to create a set of “topics” which represent a specific human-interpretable concept

The core assumption of LDA is that documents are generated as follows:

  1. For each document:
    1. generate a distribution over topics based on model parameters
    2. for each word in the document:
      1. Sample a topic from the document’s topic distribution
      2. Sample a word from that topic’s distribution over words

The idea being that there is some latent variable (the topics) that informs word choice.

Unfortunately, the document->topic and topic->word distributions are impossible to calculate exactly, but it is possible to approximate them using gibbs sampling or variational inference- approximation techniques which will allow us to eventually converge to a solution close to the true solution (insofar as such a thing exists). Unfortunately, these have the side-effect of being very slow, so the algorithm is not exactly the most efficient. Compared to training a neural network, though, it’s not actually unreasonable.

Here are some results from running the algorithm on a datset of news articles, where each line is a discrete topic. See if you can figure out what each topic is about:

command, field, one, marshal, last, general, boy, first, slim, perform

belgian, congo, government, independent, u.n., congolese, lumumba, nation, one, province

Sept., oct., new, lake, 23, color, first, fall, river, season

Run, two, mantle, one, home, game, mariners, record, hit, play

Would, effect, generation, fallout, megaton, radiation, test, soviet, human, government

Law, anti-trust, union, labor, act, price, manufacture, company, collect, bargain

And from a combination of science fiction and scientific papers:

brown, soil, detergent, wash, surface, john, house, daily, provide

casework, service, prevent, family, treatment, use, interview, care, time, help

company, questionnaire, list, manchester, store, busy, mail, plant, downtown

food, radiation, object, cost, meat, process, irradiate, product, visual, refrigerate

And finally, the SCP foundation. One thing to note is that I didn’t do as much data cleaning or parameter selection as I did with the previous datasets, so quality could be better. I’ll fine-tune the results later.

film, dust, character, neutrino, drug, pill, site-64t, supplemen, text, brand

photograph, photo, cognitohazard, photographed, effect, viewer, ascension, baxter, epsilon, ride

mr., kiryu, d-13321, head , dr., mad, corrupted, little, penny, rust

playback, inside, data, ritual, becam, went, object, orbit, contain, unit

tlaloc, station, turkey, none, d-69601:, dubcek, albright, materialized:, stop., deities

These are just a few examples but you can see how easily interpretable the results are with basically no human intervention or annotation. I’m hoping to apply this to some other datasets in the near future to see what sort of results I get.

Advertisements
This entry was posted in LDA. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s