Adventures in Reinforcement Learning: Trials and Errors

Over the last few weeks, I’ve been working on learning some basic reinforcement learning models. All my code is available here (warning: basically totally undocumented and uses code from a few sources. It’s not the prettiest).

The Environment

The environment I used was the OpenAI Gym. The Gym is a library created by OpenAI to use as a benchmark for reinforcement learning systems. For my first experiments, I used the CartPole environment:

The goal of this environment is to keep the pole balanced for as long as possible by moving the cart left and right. It has three actions– left, right, and no-op, and four inputs: position and velocity of the cart, and position and velocity of the pole.


{\displaystyle Q(s_{t},a_{t})\leftarrow (1-\alpha )\cdot \underbrace {Q(s_{t},a_{t})} _{\rm {old~value}}+\underbrace {\alpha } _{\rm {learning~rate}}\cdot \overbrace {{\bigg (}\underbrace {r_{t}} _{\rm {reward}}+\underbrace {\gamma } _{\rm {discount~factor}}\cdot \underbrace {\max _{a}Q(s_{t+1},a)} _{\rm {estimate~of~optimal~future~value}}{\bigg )}} ^{\rm {learned~value}}}

Q-learning update equation, from Wikipedia

The first model I implemented was a basic Q-Learning model. Q-learning operates by estimating the expected value of each action given the current state, including both the reward from the immediate next time step and the expected reward from the resulting state.  For my implementation, the Q function is simply a matrix mapping (state, action) -> value, and at any time step it selects the action which has the highest value in the given state, with some small probability of choosing randomly instead (this helps it explore and makes it less likely to get trapped in local minima.

And it learns! Quite quickly, in fact.

A graph showing progress of a Q learning model (x-axis is training iteration, y-axis is how many frames it survived)

This is comparatively simple to implement, but has some problems. Specifically, the state space must be discrete. In order to make it possible to index into this matrix with (state, action) pairs, the state space has to be discrete, In order to do this I binned the parameters provided by the environment (8 bins for each of the cart parameters, 10 bins for each of the pole’s parameters, which is 6400 possible states). There’s a trade-off here, since more states results in a finer-grained model which is better able to comprehend the state of the environment, but takes longer to train because there are more spaces to fill in the matrix, and each one will be encountered less frequently (the state space becomes sparser).

How do we solve this problem? We need to go deeper. 


The only difference between basic Q-Learning and Deep-Q Learning is that the Q function is a neural network instead of a matrix. This allows us to use a continuous state space. The network is trained to produce the same value used above for the more basic Q-Learning model. This model should perform better in theory, because of this.

Unfortunately, this model wasn’t very practical. The Q function estimates the value of each action separately, which means that the model runs several times per iteration. This makes it very slow to train, which makes it kind of miserable to iterate. It was learning, but so slowly, I moved right on to my next experiment for the sake of time and my sanity.

Asynchronous Actor-Critic

The Actor-Critic model improves over Q-learning in two significant ways. First off, it’s asynchronous. Rather than having a single network, it parallelizes the learning across multiple workers which each have their own environment. Each workers’ experiences are used to update the shared network asynchronously, which results in a dramatic improvement in training performance.

Secondly, it doesn’t just output the value (V(s), below), which corresponds to the value of the current state (similar to the Q value above, but it doesn’t consider actions), it also outputs the policy (π(s) below), which is a distribution over all possible actions. Importantly, the network’s own value judgements are used to train its policy- based on the action taken, it compares the value of the new state and updates the policy with respect to the previous state to reflect that.

We can make one more improvement: Instead of using the value estimations and rewards directly, we can calculate the advantage. This represents how much higher the reward was than estimated, which gives a larger weight to places where the network struggles and lets it exploit sparse rewards more effectively.

A diagram of the Asynchronous Actor-Critic model (from here)

I used a different environment than before, moving out of the parameterized space into a more complex environment: Breakout. It’s one of the atari

The Breakout environment

My specific architecture used 2 convolutional layers to parse the images, followed by a single LSTM layer to get time relations, and then a set of fully-connected layers for the policy and value outputs. You can see an early example of it above! This model actually worked very well for me– here’s the tensorboard output from 20 hours of training:

Screenshot from 2018-05-08 18:33:02.png

Graphs showing a steady increase in length, reward, and value

Interestingly, the gradient and variable norm also increased over the course of the training, which suggests that I need to normalize the model more strongly, but given the duration of the test, I’m not sure I want to cross-validate that.

Unfortunately, due to an error which would very rarely not terminate a training episode, the workers all died at the end and my computer ran out of memory as the episodes dragged on forever (alas). However, it did still seem to be improving when it died, and I still have the model, so I plan to continue training and see just how good it eventually gets. I also made a few improvements over that first try above, which you can see in the sample below:


A black-and-white, cropped version of Breakout

As you can probably see, the images have been pre-processed to crop out the score, resized, and converted to black-and-white for efficiency. Additionally, I moved the training and sampling of the network run on my GPU, which freed up the CPU to run the simulations of the environment. This more than doubled the speed of the training and was what made it reasonable to train overnight like I did above.


One feature of the Atari environments caught my attention, though: Frame skip. In the Breakout environment, each action has a random chance of being repeated 2, 3, or 4 times, which adds some noise to the behavior of the agents that they have to account for. I began to wonder, however, how an agent might perform that was able to choose how many frames to execute an action for. It’s been done before with a Q-learning model, but this model simply adds an additional version of each decision that corresponds to each number of time steps, which dramatically increases the cost of choosing an action and the number of hyperparameters, and makes the model more sparse in general. What would be ideal is a model which can use some sort of regression to select. Fortuitously, the Actor-Critic model is ideal for this!

Screenshot from 2018-05-08 18:46:57.png

The architecture proposed by Lakshminarayanan et. al. in the paper above

We can simply add another output to the model which selects the number of times to repeat the chosen action.

This is remarkably similar to a concept called “options.” Options generalize decisions over time by outputting not a single discrete decision, but two components: a distribution over possible actions (we have that already), and a termination condition. Ideally, the termination condition would be some subset of the possible states, such that the option could continue until some state is reached, however this is difficult to accomplish with the neural network interpreting the states. Instead, our termination condition is a simple scalar which will sample from that distribution the given number of times.

This is what I’ve been working on, but unfortunately, learning the options seems to be much more difficult than learning one-frame actions. My best results have come from bootstrapping the network with the pre-trained checkpoint from the actor-critic model above, but it simply always chose a duration of 1 frame and then continued learning exactly as before. If I choose to reward the model for using larger time steps, it always chooses the maximum and ignores the ball completely to cash in on that easy reward. Reward engineering is tough!

In the near future, I’ll continue working on this problem and see what I can make of it, but unfortunately with this sort of reinforcement learning model, sometimes it’s nigh-impossible to identify the source of a bug, and iteration takes so long that it’s difficult to find them all in a reasonable amount of time.

Next steps

My next step in this is really to move out of the OpenAI gym and out into the wild a little bit, or rather, into unity. Unity is a very popular game development engine which recently released a toolkit for training machine learning agents in environments created in the engine. This opens up some really interesting possibilities, as the environments can be anything you can create in unity, including physics and collision simulations, realistic lighting, and, most importantly, human interaction. This opens up a lot of really interesting possibilities that I’m excited to explore.

Posted in Reinforcement Learning | Tagged , , , , , , , , , , , , | Leave a comment

Unexpected results in ML & project updates

There was a recent paper that caught my eye that I think is worth sharing with anyone interested in the themes of my blog:

The Surprising Creativity of Digital Evolution: A Collection of Anecdotes from the Evolutionary Computation and Artificial Life Research Communities

If you’re into reading papers, definitely take a look at this one, if not, Károly Zsolnai-Fehér of Two Minute Papers did an excellent video about this paper. This is one of the things that really inspires my love of machine learning, and it’s great to see awareness of this really interesting stuff on the rise.

An additional note, if you haven’t yet, check out some other blogs hosting generative humor, including AI Weirdness and Botnik Studios (not neural networks, but still hilarious). There’s also a (very new) subreddit for Botnik with similar predictive text humor, so check that out.

Finally, since I’ve been getting a ton of new subscribers (Welcome!) an update on my current projects: The first one, which I’ll be doing a much more substantial write-up of some time in the next month, is a gimp plugin interface for Deep Dream. Here’s an example of the kind of stuff you might see some more of:


Tubingen with the “honeycomb” class deep-dreamed onto it

I’m really excited about this project finally coming together– once this is set up, I may look into the feasibility of porting it to a Photoshop plugin and/or creating a plugin for Neural-Style, since these tools really need to be put into the hands of artists.

The second is just learning more about reinforcement learning using the OpenAI Gym and (eventually) the Unity ML Agent framework.  But look! It learns:


Asynchronous Actor-Critic RL model playing Breakout

It’s not very smart yet, but it’s trying its best. (I’ve since switched to grayscale for better performance)

In the meantime, I’ve found a set of web scrapers that pull data from BoardGameGeek, so once I get that set up to get natural language descriptions instead of just ratings data, expect some AI-generated board games.

[last-minute edit]

I should also mention, I did run char-rnn on the database of Urban Dictionary results, but then realized belatedly that this website is on my resume which I hand out to employers, so I decided to do a bit more thinking about whether I wanted to dip into the NSFW on this blog. I’ll keep you updated on that.

Posted in Uncategorized | Tagged , , , , , | Leave a comment

Sentiment Translator


A very common problem in natural language processing is that of sentiment analysis, or rather, attempting to analytically determine if a given input has an overall positive or negative tone. For example, a positive string might be “I really enjoyed this movie!” whereas a negative input might be “it was boring, vapid, and dull.”

My idea is essentially thus: given recent advancements in machine translation, it should be possible to translate between sentiments (as opposed to between languages. The difficulty in doing this presented itself fairly immediately: there is no dataset available with a one-to-one mapping between negative and positive sentiments. There are models for unpaired machine translation, but they’re still relatively unproven.

My first implementation was a fairly simple rule-based approach: try to remove or add inverters (the word “not,” for example) and replace words with their antonyms until the sentiment changes appropriately. This worked well for very simple cases, but just wasn’t smart enough to capture more complex relationships (for example, it loved to translate “good” to “evil,” even when “bad” would make a lot more sense). My new implementation takes a different approach, using (and abusing) a model loosely adapted from Daniele Grattarola’s Twitter Sentiment CNN.

The Data

I used the aclimdb dataset, a set of reviews scraped from the International Movie Database, split into four parts of ~12500 reviews each: positive training, negative training, positive test, and negative test. Movie reviews work very well for this problem because they are essentially already annotated with the sentiment in the form of the user’s star rating for the film.

In pre-processing, I split the reviews into sentences to reduce the length of each input and convert each review in to word vectors (in my experiments, I used the googlenews-300 pretrained vectors). Unfortunately, due to the size of the input when converted into 300-dimensional vectors, I frequently ran out of memory during training. To reduce this issue, I only load the million most common words in the google news negative 300 vectors

The Model

The model is based on a set of one-dimensional convolutions over the series of word vectors, followed by a max pooling layer, ReLU, and a fully-connected layer. This is trained as a standard sentiment classifier, learning to predict the sentiment of a given input sentence.

At sampling time, however, we do something different. We run the input sentence through the classifier, as normal, however we give the classifier a different target sentiment. We then find the gradient of the loss of the classifier with respect to the input word vectors. This may be familiar to anyone who’s implemented Google’s Deep Dream algorithm or worked with Adversarial images. In essence, this will give us the direction we should perturb the input vectors to cause the largest change in the sentiment. Additionally, the magnitude of the gradient for a given word roughly corresponds to how much that word contributes to the sentiment (and therefore, how important it is to change).

We hit on another problem here. The space of possible words is discrete, but the word vector space is continuous (and very high-dimensional, and thus sparse). How can we be sure that these gradients are moving towards an actual word? To be honest, I’m not entirely sure. My first approach was to use multiple gradient steps, however this appeared to find minima in the sentiment that didn’t correspond to actual words in the input set. My second approach was to extend the gradient out as a ray from the original word and find the word vectors closest to this line: this worked a good deal better (specifically, it captures the “hate” <-> “love” relationship), but still isn’t perfect: we still need an heuristic method to select which of the proposed word replacements to use, which in the end will make this method little better than the rule-based approach from my initial implementation.


The biggest realization I came to was that when mapping a discrete space to a continuous space, the meaning of the intermediate values is not always intuitive. This is what we see when we simply perform gradient descent on the word vectors– the sentiment converges very nicely, but the resulting vectors have barely changed from their original values. This is interesting in the domain of computer vision, as it typically results in an “adversarial image,” or an image which can fool the classifier into misclassifying it with a very high confidence while being indistinguishable from the original to a human. However, as we are hoping for some of the words to converge to different discrete values in the word space, this is less than ideal.

Additionally, one unanticipated disadvantage of the lack of unpaired data was the inability to mathematically verify the accuracy of the translations– there was no ground truth to translate to. 

Future Work

One thought I’ve had is to try to do something similar to CycleGAN, which performs “image translation” in an unpaired fashion through a combination of GAN loss and “reconstruction loss,” however this still introduces problems as we cannot easily calculate gradients of the sentiment loss through the discretization into the word space.

It’s a tricky problem, but if anyone has any ideas, I’m interested.

Posted in nlp | Tagged , , , , , | Leave a comment

Automated topic extraction with Latent Dirichlet Allocation

During last semester, I became aware of a really interesting NLP algorithm called LDA, short for “Latent Dirichlet Allocation.” The goal of the algorithm is to create a set of “topics” which represent a specific human-interpretable concept

The core assumption of LDA is that documents are generated as follows:

  1. For each document:
    1. generate a distribution over topics based on model parameters
    2. for each word in the document:
      1. Sample a topic from the document’s topic distribution
      2. Sample a word from that topic’s distribution over words

The idea being that there is some latent variable (the topics) that informs word choice.

Unfortunately, the document->topic and topic->word distributions are impossible to calculate exactly, but it is possible to approximate them using gibbs sampling or variational inference- approximation techniques which will allow us to eventually converge to a solution close to the true solution (insofar as such a thing exists). Unfortunately, these have the side-effect of being very slow, so the algorithm is not exactly the most efficient. Compared to training a neural network, though, it’s not actually unreasonable.

Here are some results from running the algorithm on a datset of news articles, where each line is a discrete topic. See if you can figure out what each topic is about:

command, field, one, marshal, last, general, boy, first, slim, perform

belgian, congo, government, independent, u.n., congolese, lumumba, nation, one, province

Sept., oct., new, lake, 23, color, first, fall, river, season

Run, two, mantle, one, home, game, mariners, record, hit, play

Would, effect, generation, fallout, megaton, radiation, test, soviet, human, government

Law, anti-trust, union, labor, act, price, manufacture, company, collect, bargain

And from a combination of science fiction and scientific papers:

brown, soil, detergent, wash, surface, john, house, daily, provide

casework, service, prevent, family, treatment, use, interview, care, time, help

company, questionnaire, list, manchester, store, busy, mail, plant, downtown

food, radiation, object, cost, meat, process, irradiate, product, visual, refrigerate

And finally, the SCP foundation. One thing to note is that I didn’t do as much data cleaning or parameter selection as I did with the previous datasets, so quality could be better. I’ll fine-tune the results later.

film, dust, character, neutrino, drug, pill, site-64t, supplemen, text, brand

photograph, photo, cognitohazard, photographed, effect, viewer, ascension, baxter, epsilon, ride

mr., kiryu, d-13321, head , dr., mad, corrupted, little, penny, rust

playback, inside, data, ritual, becam, went, object, orbit, contain, unit

tlaloc, station, turkey, none, d-69601:, dubcek, albright, materialized:, stop., deities

These are just a few examples but you can see how easily interpretable the results are with basically no human intervention or annotation. I’m hoping to apply this to some other datasets in the near future to see what sort of results I get.

Posted in LDA | Leave a comment

Making art with neural style and fractals

I recently attempted to see if I could create art with Neural Style using only photos I’ve taken and fractal flames I’ve created with Fractorium and Apophysis. I must say, I’m very happy with the results! I generated the outputs using the Neural Style GUI and fine-tuned them using Photoshop– the latter was necessisary to mitigate the black splotches still present in some images.


I’m planning to write up a post soon about a really interesting NLP algorithm, and I’ve been having fun recently training char-RNN on a database of Urban Dictionary so stay tuned for that.

Posted in neural style | Leave a comment

Steam Games

While I work on my next big project, I decided to generate some random steam game names. All of these are games that don’t actually exist:

Happy Panic

Unraveled Land

Mad Sharkness

The Heart’s Medicine

formic innocence

Heroes Over Europe – Full Mojo

Lovely Ventures

The Gravesable Moon Visual Novel

Hotline Miami 3: Back under Begins

Redemption Cemetery: Secret Of The Angel

Nightman: Trigger Element Space


Princess Maker Expedition

Gorescripted Addiction: Possibility

Mars Indian

The Ember Sigil

Train Simulator: Eternal Gun

5-Bit Soundtrack

Best Force

Happy Fantasy

Jackal Journey

Signal Flyng


And the winner for the most probable steam game name is:

The Walking Dead: Inside The Walking Dead: “The Secret of the Magic Crystal”

and also:

Steam Dev Days: Steam Edition

Posted in Uncategorized | Leave a comment

Espresso is the marshmallow of coffee (fun with Word2Vec)

In exploring the recipe dataset, I decided to have some fun with Word2vec, an algorithm originally created by Google. For the layperson, this algorithm works by looking at the context in which a given word appears and learns vectors to represent words such that words that appear in similar context have similar vectors. On the recipe dataset, this means that, for example, the vectors for vodka and cognac are very close together, wheat and rye are very close, chocolate and butterscotch are very close together, etc.

What’s really neat about this, though, is that it enables us to do some very interesting things. One of the properties of the vectors created is the ability to perform vector arithmetic, adding and subtracting these semantic vectors to create word analogies. Here are a few examples: (read a – b + c = d as “b is to a as c is to d”)

pie – pizza + calzone = blintz

That makes sense! Never would have thought of that to be honest.

banana – plantain + apple = blueberry

I guess an apple is just a big blueberry. Who knew

candy – marshmallow + coffee = espresso

I guess that makes sense. Weird though.

 fish – tuna + chocolate = candy

Ok, tuna is a type of fish, chocolate is a type of candy. I guess I’ll let that one slide.

coffee – tea + lemon = orange

tea – coffee + lemon = lime

That’s interesting. It seems to think coffee is sweeter than tea.

coffee – knife + spoon = expresso

Interesting. In addition to the marshmallow of candy, it’s also the spoon of cutlery.

rasin – grape + fish = offal

Wow, ok, I guess it doesn’t like rasins.

brie – cheese + candy = meringues (closely followed by “fondant”)

Makes sense. Fancy, soft, light.

ribbon – bar + dome = tented

Let’s try the classic word2vec analogy:

king – man + woman = bruce

What. The next closest option is “retired.”

I’m going to continue experimenting with this. I’ve also been getting some really good results with the chef-rnn, so I’ll get back to you with more of that soonish as well.

Posted in word2vec | Tagged | Leave a comment