During last semester, I became aware of a really interesting NLP algorithm called LDA, short for “Latent Dirichlet Allocation.” The goal of the algorithm is to create a set of “topics” which represent a specific human-interpretable concept
The core assumption of LDA is that documents are generated as follows:
- For each document:
- generate a distribution over topics based on model parameters
- for each word in the document:
- Sample a topic from the document’s topic distribution
- Sample a word from that topic’s distribution over words
The idea being that there is some latent variable (the topics) that informs word choice.
Unfortunately, the document->topic and topic->word distributions are impossible to calculate exactly, but it is possible to approximate them using gibbs sampling or variational inference- approximation techniques which will allow us to eventually converge to a solution close to the true solution (insofar as such a thing exists). Unfortunately, these have the side-effect of being very slow, so the algorithm is not exactly the most efficient. Compared to training a neural network, though, it’s not actually unreasonable.
Here are some results from running the algorithm on a datset of news articles, where each line is a discrete topic. See if you can figure out what each topic is about:
command, field, one, marshal, last, general, boy, first, slim, perform
belgian, congo, government, independent, u.n., congolese, lumumba, nation, one, province
Sept., oct., new, lake, 23, color, first, fall, river, season
Run, two, mantle, one, home, game, mariners, record, hit, play
Would, effect, generation, fallout, megaton, radiation, test, soviet, human, government
Law, anti-trust, union, labor, act, price, manufacture, company, collect, bargain
And from a combination of science fiction and scientific papers:
brown, soil, detergent, wash, surface, john, house, daily, provide
casework, service, prevent, family, treatment, use, interview, care, time, help
company, questionnaire, list, manchester, store, busy, mail, plant, downtown
food, radiation, object, cost, meat, process, irradiate, product, visual, refrigerate
And finally, the SCP foundation. One thing to note is that I didn’t do as much data cleaning or parameter selection as I did with the previous datasets, so quality could be better. I’ll fine-tune the results later.
film, dust, character, neutrino, drug, pill, site-64t, supplemen, text, brand
photograph, photo, cognitohazard, photographed, effect, viewer, ascension, baxter, epsilon, ride
mr., kiryu, d-13321, head , dr., mad, corrupted, little, penny, rust
playback, inside, data, ritual, becam, went, object, orbit, contain, unit
tlaloc, station, turkey, none, d-69601:, dubcek, albright, materialized:, stop., deities
These are just a few examples but you can see how easily interpretable the results are with basically no human intervention or annotation. I’m hoping to apply this to some other datasets in the near future to see what sort of results I get.