Jeopardy!

While I’ve been working out some issues I’ve been having with torch, I did some training on a database of all Jeopardy questions. Unfortunately, training was cut short by said torch problems, so I’ll have to resume that tonight. Here’s a sampling of my favorites: (read in Alex Trebek’s voice)

SINGERS,$800,’In 1969 this film classical former president says, I read that branch of the park recorded the top of the memoir’,Born in the Martian Empire

MUSEUMS,$600,’His first major commission treats a color’,a political plant

ROCK PRIZE,$400,’In 1992 this season was a Philip Harrison in 1999 in the Mark of the Dance Age in 1996′,Alice

FAMOUS WOMEN,$400,’The best state capital is a distinguished by this film by Elizabeth II’,Shakespeare

VICE PRESIDENTS,$1200,’An example of this senator displays with 100 miles in Mark the Palace Committee on April 1991′,John Hancock

THE AMERICA,$400,’In 1797 this president enters the southernmost word for the same name’,the Standard Sea

PEOPLE FROM PENNSYLVANIA,$2000,’The famous cathedral of this word meaning to hold it to the model of the Roman Empire’, Parthenon

The answer actually matches the question! Kinda! Too bad the category is “people from Pennsylvania.” As you can see, most of the questions tend to be gibberish. I hope that’ll resolve itself with more training or a larger network, though.

Setting the network’s temperature to the lowest possible results in variations on:

A MOVIE SCIENCE,$1000,’The state state of this country is a state where the state is a state in the state’,the Roman Party

THE SILVER SCREEN,$1000,’The state of this country is a state for the state of the state’,Mark Twain

Mostly lots of “The state” and “Mark Twain.” Also common occurrances: “The first president,” “Mariah Carey,” “Marie Antoinette,” “Charles Martin,” and something called alternatively “The Band of the Road,” and “The Band of the World.”

It did also produce this oddity:

THE SILVER SCREEN,$1000,’This country was a popular president of the state of the Sea of Fame’,The Man With The Brothers

Oooh. The Man With The Brothers is a little spooky. And “the Sea of Fame” sounds cool.

Some of the categories generated at higher temperatures are hilarious, even when the questions start to fall apart. For example:

THE DIGESTIVE SYSTEM

POSTAL FOOTBALL

WHAT A WAY A PERFARY WITH A PENNED?

THE SUPREME COURT WITH S

THE MISSING STREET

THE PRESIDENT’S FIRST 2000

BIBLICAL BOOK TITLES

ART TO THE LUMBER, VAMP

BEASTLY EXPRESSIONS

WHO ARE YOUR MEDICI

U.S. FOOD HEADLINES

I AM POETIC

B__INERS [sic]

GOFF-PHO

And my all-time favorite:

DON’T PICK ME!

Honestly, that’s probably a real category, (as are some of the others I’m sure) but I don’t care. It genuinely made me laugh.

I’ll train the network for a bit longer tonight and see if results improve.

Advertisements
Posted in Uncategorized | Leave a comment

Machine Learning vs. Human Learning

I’ve been a bit busy to run any experiments, unfortunately, but I’ve still been thinking about this quite a bit. Since I haven’t posted in about a month, I figured I’d share one of my motivations for getting into machine learning: it offers a lot of really interesting parallels to human learning. Below I’ve collected a few examples of techniques I’ve picked up that have some surprising connections, and have even given me a bit more insight into how people work.

Learning Rate Annealing

This is a common technique for training that I’ve been using quite a bit with Wavenet. Essentially, the idea of learning rate annealing is that over the course of the training regimen, whatever system you’re optimizing will learn slower and slower. This is due to the nature of a lot of problems machine learning is used for– often, getting a tiny bit closer to the solution won’t show any measurable improvement until you’re right on top of it, thus it’s better to have a high training rate early on. However, if the training rate is kept this high, it will skate right over the global minimum. Lowering the training rate allows it to narrow in and get more precise as it gets closer to the answer it’s looking for. The training rate has to start high because this allows it to avoid getting stuck in local minima that might distract from another, more optimal solution.

In more plain English, the high training rate at the beginning lets the model quickly get a general sense of how things work, and lowering it as training progresses lets it fill in the details.

What’s fascinating about this is that this mirrors something that happens to humans as we age. As we get older, neuroplasticity drops, and it gets harder and harder to learn new skills, change tastes and opinions, or adapt to new environments. While this might seem like a bad thing, since it will make it harder to learn, it also has the added benefit of proofing the system (be it a machine or a human) against outliers. For example– if the first time you see someone drop a rock, it falls, you might conclude that this happens all the time (and you would be correct). As you see more and more objects dropped, this belief solidifies, until one day, someone drops something, and it doesn’t fall. If your pattern recognition was as elastic as it was the day you saw that first rock dropped, you might conclude sometimes dropped objects float, which is wrong. What’s more likely is that you’ll assume that something fishy is going on and thus this data point won’t skew your internal model of the behavior of dropped objects.

Learning rate annealing is what stops the first thing we see that contradicts our worldview to that point from causing us to throw out all of our assumptions to that point (for better or worse).

Unreliable Parts & Noisy Data

One recurring problem with machine learning models is the tendency to overfit. This essentially means that the model is learning to match patterns that exist in the training data that are not representative of reality. This will create a model that can faithfully reproduce/categorize/recognize the training data, but fails miserably in the wild. There are a lot of ways to avoid this, but one of the most common ones used for neural networks is called dropout.

The idea behind dropout is essentially randomly disabling a certain number of the neurons in a network each training step. Since at any given time, any neuron could be down, the network has to learn redundancy, forcing it to create a more robust representation of the data with overlapping roles for the neurons.

Of course, the only way to completely avoid overfitting is with more data, but this isn’t always possible. In this situation, one technique that gets used a lot is to multiply the amount of data by taking each training sample and distorting them in some way– rotate them, shift them, scale them vertically or horizontally, add random noise, shift colors, etc. For text-based data, this might involve using a thesarus to replace words with synonyms, or deliberately include misspellings, though this is mostly applicable to images.

This artificial noise gives the model a larger range of possible environments to interpret, which will make it better at generalizing (though it’s not quite as good as just having more data) and better at interpreting poorly sanitized data, which is also important for working in the wild. In addition, this random artificial noise prevents the model from overfitting to noise patterns present in the training data because the noise is always changing. Even though each individual sample is distorted, the noise averages out in the end and results in a more robust system.

These two techniques are so powerful that google has actually created a piece of hardware they call the “Tensor Processing Unit,” a parallel processing chip, which has “reduced computational precision, which means it requires fewer transistors per operation,” and means that it’s “an order of magnitude better-optimized” than conventional hardware. They’ve implemented dropout and noisy data by simply removing the precision and reliability that are so important to many other types of computation, just packing together noisy, unreliable circuits, and it actually makes it better.

This also mirrors life. Biology is noisy, unreliable, and messy. The parts don’t always work, and when they do, they’re not very precise, but for this application, it not only doesn’t matter, it actually helps. This is, in large part, why intelligent life was able to evolve at all. Neural networks are the ideal system for creating intelligence in an noisy, unreliable, ever-changing environment.

Internal Vector Encoding

This is one of the things I find the most fascinating about machine learning. One of the most basic types of neural network, called a Restricted Boltzmann Machine, has only two layers (or three, depending on how you interpret it). One layer acts as the input, which passes through a second, smaller layer, and then attempts to recreate the original input. In doing so, the model is attempting to figure out the best way to compress the input and still be able to recover the original data. This results in a compressed vector representation that reflects the structure of the input in a more compact way.

Expanding this simple model with more layers, this can create some really interesting structures. The output doesn’t even have to be in the same form as the input– for example, the input could be English, and the output French, or vice versa. In this case, the internal vector representation stores the meaning of the sentence, independently of the language. What’s really fascinating about this particular example is that it works better the more languages are used. Because the internal representation stays the same, having more languages allows the model to create a better compressed representation using ideas that may not exist in one language to help translate it to another. Translating from English to Chinese is easier if the network can also translate English->French and French->Chinese.

Once again, we see this in humans too. It is much more likely for someone who knows two languages to learn a third than it is for someone who knows only one to learn a second. It could be argued that this is because of cultural differences that change a person’s upbringing and allows them to learn multiple languages while their learning rate is still relatively high, I think there are other factors at work. This is just a personal belief, but I would not be surprised if something similar was at work– knowing multiple languages allows someone to have a more efficient internal representation of ideas and of language in general, as they can integrate aspects of multiple languages into their thought processes. This is the strongest argument I’ve ever been presented with for learning multiple languages.

Posted in Uncategorized | Leave a comment

Encouraging results with Wavenet!

After doing some digging into the code and resolving an error that caused the network to devolve into white noise, then talking with some other  folks about what seemed to work for them, along with a whole bunch of hyper-parameter optimization, I’ve had some encouraging results!

For these, I’ve restricted the training to one speaker. Each successive test represents one round of hyper-parameter optimization, and for the last one, I switched to SGD with momentum as the optimizer instead of ADAM with normalization.

It is also very interesting to note that the most successful test, test 7, was also the smallest of the networks used of these tests, and trained for the shortest time– only 26,000 iterations instead of 50,000, 100,000 and 150,000 for tests 6a,b,&c. My next test will be to continue training on this network with a reduced learning rate to see if I can get it even better, but I’m really happy with these results.

My eventual goal is to get this running with some of my music to see what it spits out.

Posted in Wavenet | Leave a comment

Wavenet!

So I got wavenet working.

Well, for a limited definition of “working.” I’ve been running some experiments to try to figure out how the model behaves and unfortunately I don’t seem to have a very good grasp of the underlying theory of the network architecture, because my experiments have gotten progressively worse.

  • The first sample was using the default settings for the implementation I’m using.
  • The second sample was the same except for longer training.
  • The third and fourth were the same as the previous except with a larger network (more layers).
  • For the fifth I changed the activation function and the optimizer to something I thought might work better. (It didn’t.)
  • And the sixth I tried training with l2 normalization on.

To be honest, these results are somewhat disheartening, given that I haven’t really had much success, but I’m really interested in the idea so I’m definitely going to stay at it, because I’ve seen some really amazing results from other people experimenting with it.

Posted in Wavenet | Leave a comment

Parameter Experimentation (initialization and merging styles)

While I wait to get my hands on Deepmind’s wavenet, I’ve been experimenting some with the parameters of neural-style.

Here, I have two images that use the same style and content image, however in the first one, the result is initialized randomly, and the second is initialized from the source image:

a_init_randA_init_img.png

The largest difference appears to be the lamp; the style image didn’t have any areas bright enough to fill that in by itself, but initialized from the content image, it can keep that around.

The next was an experiment with merging together styles. The first are the results of the individual style images– one has the best texture, one gets the red of the barn, one gets the red sky.

fridamonet_dwheat_d

And here’s the combination of all three:

three

So it looks as though, in the case of multiple style images, in addition to creating an odd fusion of the painting style, it also will pull appropriate colors from the different style images to fill in areas that the other style images may not have (see the blue sky and the red barn.)

Posted in Uncategorized | Leave a comment

Hallucinating words

After a few failed experiments (namely, trying to train char-rnn on the grant award database to discover that the file has been corrupted with non-unicode characters which caused loss to explode and training on the list is kernel source, which over fitted really badly) I decided to move back to something a bit simpler– the dictionary. My training rig isn’t at the moment connected to the internet, so no dumps yet, but here are some favorites from current checkpoints. 

From the best checkpoint:

 This checkpoint actually produced correct definitions for a lot of words, including:

Preponderation (n.) The act of prepositing

Extravagance (n.) The state of being extravagant 

Tattler (n.) One who, or that which, tattles. 

which implies that the network is either representing the language on some deeper level (ha ha, that’s likely), or just that it’s overfitting to the data, however this only seems to occur with words where the definition contains the word itself, or another similar word, like the above examples. I begin to wonder if I could do something with a word-to-vec algorithm to make an even better neural dictionary, but for now, I’m just filtering out words that have the word in their definition.  

Temperature 0.1

Station (n.) The act of stating; a statement or condition of any object or position; a statement of the particulars of a person or thing; as, the station of a country or court

Temperature 0.5

Infusorianism (n.) The doctrine of the infusorian ministry

Manure (n.) To make a manufacturer; to seize or be confirmed to the mind 

Temperature 0.75

Confine (v. t.) To interchange or impress as an expression of assignment; to reduce or to indicate or represent; to consent; to disapprove of; as, to constitute the title of the rules of a province or a firearm. 

Endoderm (n.) A white crystalline substance, C0H10, of the acetylene series, found in alkaloids, and having been elongated in the proportion of ordinary odor, in which the phenomena of certain compounds are produced artificially, and is derived from its natural ore, and is now a mixture of granular copper;– also called hexanelin

Encrin (n.) A bishop composed of sensible colors. 

Stick (v. t.) To fix or defeat with a stick

Cloud (n.) A striving in a church; as of men.

Temperature 1.0

Imbreviate (v. t. & i.) To increase a disease, office, or claim

Nipperty (a.) Like a nipple; of or pertaining to the nipping.

Sympathetic (n.) Syphilis; execution

OK wow, that wins for biggest miss.

Encognat (n.) A printed person; an otolith

Smoke (v.) The spot or strap by which swings are driven

Hey that’s not a verb!

Ensifer (n.) A person who held or performs the privileges of the discriminal world itself

Gavash (v. t.) To cause to swing into game

Cloyer (n.) The harsh, uncertain part; any body or degree of obstruction; a handle of screws and judges

Tattlery (n.) A vessel for catching a plate or animal like a strumpet

Levator (n.) One who annoys the occupation of men

It is notable, that the validation loss for this network never got very low at all– this is due to the nature of the problem. In most cases, a good deal of the loss can be avoided by capturing the structure of a database. The recipes, for example, all follow the same basic format: a name, categories, ingredients, and then instructions. 

This also gives the network a lot of information to work on when it gets to the instructions. A recipe called “fried chicken” will probably contain chicken. Something with flour will probably be baked at some point. And so on. In this way, to make each discrete recipe, it only had to remember the details for as long as the recipe.

In this case, however, that’s just not enough. The network can’t really guess what the meaning of a word is from the letters in the word except by looking at its structure (prefixes, suffixes, roots, etc.), But it’ll only be beneficial to remember that information for a very short time. It would be much more effective to hardcode those relationships with some kind of word-to-vec (or char-to-vec?) system that would actually represent the meaning of the word in some abstract way. Especially because then it could possibly make sense of all those “wordly: the state of being a word; see wordy” definitions. 

But anyway. I’d really like to get the grants database fixed up and running so I can get some real output, and I’m also looking into some other fun things– Google Magenta, for one, and a reinforcement-learning agent that plays Super Mario. Could be neat. 

Posted in Uncategorized | Leave a comment

The adventures of Robo-Chef

Since my previous post, I’ve gotten the model better.  How much better? About 22%! That’s pretty exciting. Here, exclusive, and brand new, are some highlights and some dumps from this 404-star recipe bot.

At temperauture=0.1, we get the least creative, most probable recipe the network can come up with:

  title: chocolate dipped chocolate chip cookies
categories: cookies, chocolate
yield: 1 servings

1 c  butter or margarine
1 c  sugar
2    eggs
1 ts vanilla
1 c  flour
1 ts baking soda
1/2 ts salt
1 c  chopped nuts
1 c  chopped nuts

cream butter and sugar until light and fluffy. add eggs, one at a
time, beating well after each addition. add eggs, one at a time, beating
well after each addition. blend in flour mixture. stir in chocolate
chips. spread in greased and floured 9-inch square pan. bake at 350
degrees for 15 minutes. cool completely. cut into squares. serve
warm.

This completely legitimate recipe for some kind of cookie might actually work and it scares me. It also contains a full cup of butter, so… maybe don’t eat them. Assuaging my worries about accidental sentience, the next recipe is:

title: baked stuffed chicken wings
categories: poultry
yield: 4 servings

1 lb beef stew meat
1 tb olive oil
1    onion, chopped
1    clove garlic, minced
1 tb chili powder
1 ts cumin
1 ts cumin
1 ts cumin
1 ts cumin
1 ts cumin
1 ts cumin

And it continues with “1 ts cumin” forever after that point.

Increasing the temperature to 0.5, we see that the list of ingredients is often completely disassociated from the instructions. For example:

     title: chicken noodle casserole
categories: main dish, poultry, main dish
yield: 4 servings

1 c  chicken broth
1 c  fresh parsley; chopped
2 tb butter
2 tb vegetable oil
1    carrot, sliced
1    onion, chopped
2    carrots, cut in chunks
3    garlic cloves, minced
1 ts chili powder
1 ts cumin
1 ts ground cumin
2 ts ground cumin
1 ts ground cumin
1/2 ts ground cumin
1/2 ts cayenne pepper
1/4 ts ground coriander
2    cloves garlic, minced
1/4 c  green pepper, chopped
1 tb peanut oil
2 tb chopped fresh parsley
salt and pepper

1. combine the soy sauce, salt, pepper, celery salt, basil and pepper.
shape into balls and place on a lightly greased cookie sheet. bake at 350
for 30-40 minutes.  the chicken should be soft.  serve with fresh
vegetables.

Dang, this one made me hungry. Add some sticky rice or chicken or something to those and that could be really tasty. But what’s with those ingredients?

We also start to see hints of the absurd creeping in, such as:

[truncated due to really long ingredient list]

cook pasta according to package directions; drain. rinse with cold
water; drain.  combine cornstarch and water in a large bowl and add
to meat mixture.  mix lightly with potato masher. stir in milk and beat
until smooth. pour into shallow baking dish and bake in preheated 350
degree oven for 25 minutes. remove from oven.  cool on wire rack.

Mix… with a potato masher?

      title: cranberry crisp potatoes
categories: vegetables, ethnic
yield: 1 servings

2 lb smoked beef, cubed
salt and pepper
cilantro sprigs
serrano pepper
chopped canned tomatoes

combine all ingredients in a medium bowl.  add the beans to the cooked
pasta and set aside.  place the oil in a 2-quart saucepan and cook
over medium heat for 5 minutes. add the onions and cook for 2
minutes. add the remaining ingredients, except the cheese. cook for
another 5 minutes or until the sauce thickens. serve over rice.

Another one that sounds pretty good, if you assume “all ingredients” refers to the ingredients above, and assuming you have the extra ingredients like the beans and the cooked pasta on hand.

We also see stuff like this:

     title: chinese cabbage & tomato salad
categories: salads, salads
yield: 6 servings

1    carrot, cut in 1 inch pieces
1    onion, chopped
1    garlic clove, chopped
1 ts salt
1/4 ts pepper
1    red bell pepper, seeded
-and chopped
1 tb chopped fresh parsley
2 tb sugar
1 ts salt
1 ts pepper
1/2 ts cayenne pepper
1 ts salt
1 ts sugar
1/2 ts sugar
1 ts sesame oil
1/4 c  cider vinegar
1 tb water
salt and pepper to taste
freshly ground black pepper

place all ingredients in a small saucepan and cook until the mixture
boils and thickens.  remove from heat.  add beans and cook for another
30 minutes.  add remaining ingredients and simmer for a further 5
minutes.  add soy sauce and simmer for about 10 minutes or until
thickened. serve over rice.

Notice how many times sugar and salt show up in that ingredients list?

And also this nonsense:

     title: chinese chicken
categories: chicken, main dish
yield: 4 servings

—————————steak——————————–
1 c  crushed corn
1 c  chopped onion
1/2 c  chopped green pepper
1 c  chopped celery
1 c  chopped green pepper
1 c  chopped celery
1 c  chopped onion
1 c  chopped onion
1 c  chopped green pepper
1 c  chopped carrots
1 c  chopped celery
2 c  chopped celery
1 c  chopped celery
1 c  chopped celery
1 c  sliced fresh mushrooms
1 c  chopped onion
1 c  chopped onion
1 c  chopped celery
2 tb fresh lemon juice
2 tb white vinegar
1 ts salt
1 ts ground cumin

place the sausage, chicken in a small container and puree until smooth.
add the chicken stock and bring to a boil.  reduce heat and simmer
for 10 minutes. add the beans, cover and cook for about 20 minutes.
mash the chicken and reduce the heat to medium and cook until
the meat is tender, about 20 minutes. meanwhile, bring the chicken to a
boil and simmer for 5 minutes, stirring occasionally. add the corn,
salt, pepper, sugar and worcestershire sauce, and cook for 1 minute.
add the garlic, salt and pepper and stir for another minute. add the
chicken and continue to stir for a few minutes longer. remove the chicken
from the heat and stir in the salt and pepper. return the corn to
the pot and simmer for about 20 minutes. serve the sauce with the
sauce. serves 6.

How would you even do this? You start by PUREEING SAUSAGE, and ends by serving the sauce with sauce. I laughed so hard at this one it hurt.

Moving on to the opposite extreme, temperature = 1.0

Warning: Absurdist Cooking

     title: danish icebox
categories: breads
yield: 10 servings

1 c  plain yogurt
2 tb light soy sauce
1/2 ts salt
1/8 ts freshly ground pepper
1    onion

cut beef into 1″ x 1/4″ rings.  blanch oranges in salt fat oil and
lemon flavour. meanwhile, cut bacon over chicken to 1/2 inch thickness.
carefully scoop chicken breasts off cod that look skins desired with a
doagh cloth masher; cut kabobs into cubes. in batches, combine
cinnamon, cayenne, salt and pepper; stir into the dressing mixture.
bake 20 minutes.

add chopped clams to marinade and garnish with scallions.

Ahh, the Danish Icebox. A classic comfort food from the old world. Those quaint traditions, like cutting the bacon over chicken, which you then scoop off cod. Cod that looks like skins.

     title: fettucine soup
categories: soups, ceideburg 2
yield: 1 servings

2 tb canola oil
3 ea chicken breast halves
5 tb lemon juice
2 tb sugar
1/2 ts curry powder
2 tb soy sauce
3 tb cornstarch
1/3 c  dry white wine
7 ea chick peas, sliced
1 c  chopped green pepper
2 tb salt
1 tb parsley flakes
1    whole alternight bones
2 c  star brown broth
2 oz cream cheese
paprika

brown beef in butter. add vegetables and cook on high heat until
browned. increase heat to 315 degrees f. heat 2 teaspoons butter in a
saucepan over medium heat. when hot, add onions, onions, parsley,
bell pepper, basil, parsley, capers and salt. cook for 5
minutes (add parsley to the oil and cook until the wine has been reduced
by half,about 12min.ends the same manner to warm over moderate
heat just until egg mixture just comes to a boil. remove as much of
the cooking liquid and foam in pan.  combine the egg white and sugar in
a small bowl. form into balls and press down to finish cooking. bake
at 375 f for one hour, until the patties pull away from sides of
the pan. serves 6 to 8.

That… is not fettuccine soup. I don’t know what that is, but that unmatched open-paren haunts me). There we go.

      title: apricot-apple deluxe
categories: side dishes, tex-mex, poultry
yield: 12 servings

4 ea egg whites
1/2 c  coconut
2 c  pecan halves
6    shredded horseradish
fresh ripe toasted walnuts*
sliced strawberries
mint extract
fruit preserves:
3/4 ts almond extract
1/2 ts vanilla extract
combine warm orange juice
concentrate, pecans,
whipped, lemon rind

:       preheat oven to 275 degrees f.

in a large bowl, combine the champagne and sugar. melt, stirring
constantly, or a few minutes to combine it well. stir in
margarine and chocolate.  set the bowl over simmering water in a
microwave oven 5 minutes or until the mixture is liquid is evaporated. stir in
the grape-nuts and frost it evenly with melted chocolate and mix
well. pour the mixture into the center of the pudding (using a
syrup), then combine in hot water until smooth. beat vigorously until smooth.

Why is this tex-mex? WHY DOES IT HAVE HORSERADISH? Some questions will never have answers. I love those instructions, though. It feels so avant-garde. You melt sugar into champane, then add butter and chocolate and evaporate out the liquid in a microwave? I… kinda want to try this one.

Here are some dumps:

t=0.5

t=0.75

t=1.0

I might do more intervals in between at some point– I typically do a 0.6, 0.8, 1.0 spread, but thought I’d space them out for a bit more variety this time.

Next I think I’m going to go back and try to generate some grants again. That sounds like fun.

Posted in char-rnn | Tagged | Leave a comment