Jokebot redux

Jokebot has finished training. It still has some serious issues. Perhaps the issues are even more serious.

DISCLAIMER: The “jokes” the network produces do not represent my views or sense of humor. The data was scraped from reddit and it shows. Also the network has become VERY RUDE, so proceed with caution.

Very low temperature:

What do you call a cow with no legs? A: A pilot, you racist.

What do you call a black man who speaks a piece of course? A small medium at large.

What do you call a black guy who likes to be a company lived in a car? A paragraph and a condom.

What do you call a black man that has a porcupine? A pilot, you racist piece of cake.

What do you call a cow that doesn’t work? A condescending con descending.

What do you call a black guy who can’t see? A small medium at large.

It appears that at this temperature the network has over-fitted a bit. Specifically, most of the answers are “A small medium at large,” or “A pilot, you racist piece of [cheese, cake, shit, etc].” I did laugh at “you racist piece of cake” though.

Low temperature:

What do you call a short pilot? A pedophile.

Why do black people have such big noses? Because they can’t even.

What do you call a transformer who can’t fly? A baked potato.

What do you call a black man who doesn’t scream at least year? A little shit.

What do you call a bunch of dead babies? A family seat.

What do you call a dead baby flying a plane? A baked potato.

What do you call a deaf prostitute? A broken nose.

What do you call a scary girl with a shit with her brothers? A cock in your mouth

Dear god it’s only gotten worse. What have I created?

Medium temperature:

Why did the dick go to the man in the mud? He was taking a shit!

Why couldn’t the chicken cross the road? It was two-tired.

What do you call a deaf man with no arms and no legs in the middle of the ocean? A pilot, you racist mother.

What’s the difference between a comedian and a gay guy? A hockey player showers after 3 periods.

What do you call a superhero who is going to be a tree? An asshole.

What do you call a fat person who only eats gay men? A semi-chicken

What did the pedophile say to the pirate? Nothing.


But on the plus side, it also created these:

What do you call a woman with an extra leg? A woman

What did the doctor say when he fell out of the closet? Damn

High temperature:

What did the cremate say to the stove? Whoat. Oh, it was out yet.

What do you call a stoner with a bad real paint in your jean? Half of course!

How do you make a blind man organ? With a snowblower.

What do Jewish people with breasts and dumb games have in common? Everyone wants to smell it, but it’s gonna be dead.

What do you call a cow with a pet dog? A space member

What’s the difference between Michael Jackson and a bag of cocaine? one spits and the other is a group of cunning.

What do you call a gun on a wheelchair? A tooth crip.

What do you call a cow with no eyes? The Nemon Roll.

What do you call a chicken coop with a donkey and a white guy? A crustacean!

What do you call two monkeys floating in the middle of the ocean? The Amazon.

What’s a stormtrooper’s favorite sport? Project and Tour Debate

This is where the network got the most laughs. Some of them are just so absurd. It also had a few of my least favorite “jokes”

Very high temperature:

What do you call a Mexican with one phone in his arse? No PROCEDO

Which have you call a Graveyard nurse? Shroting me Debatins

What has 3 beans? A Brown.

What’s the difference between 8 out of roux and figure?,You can tuna piano, but you can’t jelly until your mom on your ass.

What do you call a confused asian? Spaghetti

How do you cut an elephant into a snowblower? I’ll tell you tomorrow.

What did the buffalo say to the ground? Nothing. He just came back.

What is Bruce Lee’s favourite food???8? URDUMA

How many average people does it take to change a light bulb?None, it’s still dark dirty.

What did the dumb brothel say?I wooden hanger.

Why did captain say the toaster between her boyfriend?Cause the dick waves pings.

What do you call a cow machine? A cow with cheese.

As you can see, it got a bit dadaist, as is wont to occur.

Posted in Uncategorized | Leave a comment


I fixed the problems! It seemed that I needed to update Cutorch, and in order to do that, I had to update CUDA, and in the process I inadvertently uninstalled my graphics drivers. What an adventure. To celebrate, I’ve been training char-rnn on a database of question/answer jokes scraped from reddit. Still early in training, but here are some highlights so far.

DISCLAIMER: The jokes the network produced are not representative of my opinions. The source data came from scraping a subreddit and it shows. The network is… a little offensive sometimes.

Low temperature:

What do you call a man who can’t even live? A star track.

What do you call a group of banana that can’t stand a lot of leaves? An angry banana.

Why do mathematicians have to go to the other to talk to the bar? Because they can go stop the political and store they go to the bar.

What do you call a prostitute with no legs? A pilot of the bar.

What do you call a group of children who goes to a chicken star? A sandwich.

What do you call a chicken that starts a basketball team? A star bang.

What do you call a prostitute that can’t even live? A star to the chimney.

What do you call a black man who has a bar and a redneck in the world?,A person that has a great salad.

For some reason it really likes the answer “A star” and “A pilot” and including prostitutes in the questions.

Raising the temperature a bit, we see:

What do you call a porn star that doesn’t wark? A pencil

What do you call an alligator with no legs? A space mate.

What do you call a prostitute who is going to be a computer? A pot battery

What do you call a group of baby that doesn’t have a bar? Lettuce

Did you hear about the person who got a few bar stars? He had a horse with the shopping story.

There were also a few that started to get a bit lewd, which I guess says something about the data source. Let’s keep cranking up the heat!

What’s the difference between a terrorist and a chickpea? Errrrrrrrrly marks out of a stranded college.

What do you get when you put an elephant in a car? The holly-convention

What do you get when black girls want to pee? 1st light.

Okay, what the hell, jokebot. That got bad fast. And it doesn’t get much better:

What do you call a midget in a prostitute? A cross character.

What do you call an Indian snake fighting his brother? A HAR GUUR NELLAR!

What does a porn star say to a Jewish bank?,Hello Game of a toilet Life.

Did you hear about the cock-worker who was in the statistic on the stool?,He had a man from the weather.


For science, let’s crank the temperature all the way up.

How do you react a hippie? An angry salad.

What is the sound of irony? Osian.

What’s the difference between a Day and a gas bill? Thought in the oven.

Why can’t the chicken take a deud at the main crag countant? At the swseek.

What did the doctor say to the mathematician? Fuck mississippy!

What’s the difference between an alcoholic and a baby? With a portuplage binguins, they’re both tattooed.

Did you know about Pokemon massacre Tunnels? His son makes a tight in the Olympics teaches.

Why doesn’t Usian greet a pothead? He’s always stopped up bunched!

Why won’t Michelle coop continue?,Because a punched people in pedophiles.

This is actually better… just because they make less sense. It clearly has a really twisted sense of humor though. Pokemon massacre Tunnels? WHAT?

This seems to be a pretty clear example of why data is important. I expected most of the jokes to be clean with a few bad ones, but it seems to be the other way around. I’ll keep training to see what happens, mostly because I’m curious.

Posted in char-rnn | Leave a comment


While I’ve been working out some issues I’ve been having with torch, I did some training on a database of all Jeopardy questions. Unfortunately, training was cut short by said torch problems, so I’ll have to resume that tonight. Here’s a sampling of my favorites: (read in Alex Trebek’s voice)

SINGERS,$800,’In 1969 this film classical former president says, I read that branch of the park recorded the top of the memoir’,Born in the Martian Empire

MUSEUMS,$600,’His first major commission treats a color’,a political plant

ROCK PRIZE,$400,’In 1992 this season was a Philip Harrison in 1999 in the Mark of the Dance Age in 1996′,Alice

FAMOUS WOMEN,$400,’The best state capital is a distinguished by this film by Elizabeth II’,Shakespeare

VICE PRESIDENTS,$1200,’An example of this senator displays with 100 miles in Mark the Palace Committee on April 1991′,John Hancock

THE AMERICA,$400,’In 1797 this president enters the southernmost word for the same name’,the Standard Sea

PEOPLE FROM PENNSYLVANIA,$2000,’The famous cathedral of this word meaning to hold it to the model of the Roman Empire’, Parthenon

The answer actually matches the question! Kinda! Too bad the category is “people from Pennsylvania.” As you can see, most of the questions tend to be gibberish. I hope that’ll resolve itself with more training or a larger network, though.

Setting the network’s temperature to the lowest possible results in variations on:

A MOVIE SCIENCE,$1000,’The state state of this country is a state where the state is a state in the state’,the Roman Party

THE SILVER SCREEN,$1000,’The state of this country is a state for the state of the state’,Mark Twain

Mostly lots of “The state” and “Mark Twain.” Also common occurrances: “The first president,” “Mariah Carey,” “Marie Antoinette,” “Charles Martin,” and something called alternatively “The Band of the Road,” and “The Band of the World.”

It did also produce this oddity:

THE SILVER SCREEN,$1000,’This country was a popular president of the state of the Sea of Fame’,The Man With The Brothers

Oooh. The Man With The Brothers is a little spooky. And “the Sea of Fame” sounds cool.

Some of the categories generated at higher temperatures are hilarious, even when the questions start to fall apart. For example:













B__INERS [sic]


And my all-time favorite:


Honestly, that’s probably a real category, (as are some of the others I’m sure) but I don’t care. It genuinely made me laugh.

I’ll train the network for a bit longer tonight and see if results improve.

Posted in Uncategorized | Leave a comment

Machine Learning vs. Human Learning

I’ve been a bit busy to run any experiments, unfortunately, but I’ve still been thinking about this quite a bit. Since I haven’t posted in about a month, I figured I’d share one of my motivations for getting into machine learning: it offers a lot of really interesting parallels to human learning. Below I’ve collected a few examples of techniques I’ve picked up that have some surprising connections, and have even given me a bit more insight into how people work.

Learning Rate Annealing

This is a common technique for training that I’ve been using quite a bit with Wavenet. Essentially, the idea of learning rate annealing is that over the course of the training regimen, whatever system you’re optimizing will learn slower and slower. This is due to the nature of a lot of problems machine learning is used for– often, getting a tiny bit closer to the solution won’t show any measurable improvement until you’re right on top of it, thus it’s better to have a high training rate early on. However, if the training rate is kept this high, it will skate right over the global minimum. Lowering the training rate allows it to narrow in and get more precise as it gets closer to the answer it’s looking for. The training rate has to start high because this allows it to avoid getting stuck in local minima that might distract from another, more optimal solution.

In more plain English, the high training rate at the beginning lets the model quickly get a general sense of how things work, and lowering it as training progresses lets it fill in the details.

What’s fascinating about this is that this mirrors something that happens to humans as we age. As we get older, neuroplasticity drops, and it gets harder and harder to learn new skills, change tastes and opinions, or adapt to new environments. While this might seem like a bad thing, since it will make it harder to learn, it also has the added benefit of proofing the system (be it a machine or a human) against outliers. For example– if the first time you see someone drop a rock, it falls, you might conclude that this happens all the time (and you would be correct). As you see more and more objects dropped, this belief solidifies, until one day, someone drops something, and it doesn’t fall. If your pattern recognition was as elastic as it was the day you saw that first rock dropped, you might conclude sometimes dropped objects float, which is wrong. What’s more likely is that you’ll assume that something fishy is going on and thus this data point won’t skew your internal model of the behavior of dropped objects.

Learning rate annealing is what stops the first thing we see that contradicts our worldview to that point from causing us to throw out all of our assumptions to that point (for better or worse).

Unreliable Parts & Noisy Data

One recurring problem with machine learning models is the tendency to overfit. This essentially means that the model is learning to match patterns that exist in the training data that are not representative of reality. This will create a model that can faithfully reproduce/categorize/recognize the training data, but fails miserably in the wild. There are a lot of ways to avoid this, but one of the most common ones used for neural networks is called dropout.

The idea behind dropout is essentially randomly disabling a certain number of the neurons in a network each training step. Since at any given time, any neuron could be down, the network has to learn redundancy, forcing it to create a more robust representation of the data with overlapping roles for the neurons.

Of course, the only way to completely avoid overfitting is with more data, but this isn’t always possible. In this situation, one technique that gets used a lot is to multiply the amount of data by taking each training sample and distorting them in some way– rotate them, shift them, scale them vertically or horizontally, add random noise, shift colors, etc. For text-based data, this might involve using a thesarus to replace words with synonyms, or deliberately include misspellings, though this is mostly applicable to images.

This artificial noise gives the model a larger range of possible environments to interpret, which will make it better at generalizing (though it’s not quite as good as just having more data) and better at interpreting poorly sanitized data, which is also important for working in the wild. In addition, this random artificial noise prevents the model from overfitting to noise patterns present in the training data because the noise is always changing. Even though each individual sample is distorted, the noise averages out in the end and results in a more robust system.

These two techniques are so powerful that google has actually created a piece of hardware they call the “Tensor Processing Unit,” a parallel processing chip, which has “reduced computational precision, which means it requires fewer transistors per operation,” and means that it’s “an order of magnitude better-optimized” than conventional hardware. They’ve implemented dropout and noisy data by simply removing the precision and reliability that are so important to many other types of computation, just packing together noisy, unreliable circuits, and it actually makes it better.

This also mirrors life. Biology is noisy, unreliable, and messy. The parts don’t always work, and when they do, they’re not very precise, but for this application, it not only doesn’t matter, it actually helps. This is, in large part, why intelligent life was able to evolve at all. Neural networks are the ideal system for creating intelligence in an noisy, unreliable, ever-changing environment.

Internal Vector Encoding

This is one of the things I find the most fascinating about machine learning. One of the most basic types of neural network, called a Restricted Boltzmann Machine, has only two layers (or three, depending on how you interpret it). One layer acts as the input, which passes through a second, smaller layer, and then attempts to recreate the original input. In doing so, the model is attempting to figure out the best way to compress the input and still be able to recover the original data. This results in a compressed vector representation that reflects the structure of the input in a more compact way.

Expanding this simple model with more layers, this can create some really interesting structures. The output doesn’t even have to be in the same form as the input– for example, the input could be English, and the output French, or vice versa. In this case, the internal vector representation stores the meaning of the sentence, independently of the language. What’s really fascinating about this particular example is that it works better the more languages are used. Because the internal representation stays the same, having more languages allows the model to create a better compressed representation using ideas that may not exist in one language to help translate it to another. Translating from English to Chinese is easier if the network can also translate English->French and French->Chinese.

Once again, we see this in humans too. It is much more likely for someone who knows two languages to learn a third than it is for someone who knows only one to learn a second. It could be argued that this is because of cultural differences that change a person’s upbringing and allows them to learn multiple languages while their learning rate is still relatively high, I think there are other factors at work. This is just a personal belief, but I would not be surprised if something similar was at work– knowing multiple languages allows someone to have a more efficient internal representation of ideas and of language in general, as they can integrate aspects of multiple languages into their thought processes. This is the strongest argument I’ve ever been presented with for learning multiple languages.

Posted in Uncategorized | Leave a comment

Encouraging results with Wavenet!

After doing some digging into the code and resolving an error that caused the network to devolve into white noise, then talking with some other  folks about what seemed to work for them, along with a whole bunch of hyper-parameter optimization, I’ve had some encouraging results!

For these, I’ve restricted the training to one speaker. Each successive test represents one round of hyper-parameter optimization, and for the last one, I switched to SGD with momentum as the optimizer instead of ADAM with normalization.

It is also very interesting to note that the most successful test, test 7, was also the smallest of the networks used of these tests, and trained for the shortest time– only 26,000 iterations instead of 50,000, 100,000 and 150,000 for tests 6a,b,&c. My next test will be to continue training on this network with a reduced learning rate to see if I can get it even better, but I’m really happy with these results.

My eventual goal is to get this running with some of my music to see what it spits out.

Posted in Wavenet | Leave a comment


So I got wavenet working.

Well, for a limited definition of “working.” I’ve been running some experiments to try to figure out how the model behaves and unfortunately I don’t seem to have a very good grasp of the underlying theory of the network architecture, because my experiments have gotten progressively worse.

  • The first sample was using the default settings for the implementation I’m using.
  • The second sample was the same except for longer training.
  • The third and fourth were the same as the previous except with a larger network (more layers).
  • For the fifth I changed the activation function and the optimizer to something I thought might work better. (It didn’t.)
  • And the sixth I tried training with l2 normalization on.

To be honest, these results are somewhat disheartening, given that I haven’t really had much success, but I’m really interested in the idea so I’m definitely going to stay at it, because I’ve seen some really amazing results from other people experimenting with it.

Posted in Wavenet | Leave a comment

Parameter Experimentation (initialization and merging styles)

While I wait to get my hands on Deepmind’s wavenet, I’ve been experimenting some with the parameters of neural-style.

Here, I have two images that use the same style and content image, however in the first one, the result is initialized randomly, and the second is initialized from the source image:


The largest difference appears to be the lamp; the style image didn’t have any areas bright enough to fill that in by itself, but initialized from the content image, it can keep that around.

The next was an experiment with merging together styles. The first are the results of the individual style images– one has the best texture, one gets the red of the barn, one gets the red sky.


And here’s the combination of all three:


So it looks as though, in the case of multiple style images, in addition to creating an odd fusion of the painting style, it also will pull appropriate colors from the different style images to fill in areas that the other style images may not have (see the blue sky and the red barn.)

Posted in Uncategorized | Leave a comment