I’ve been a bit busy to run any experiments, unfortunately, but I’ve still been thinking about this quite a bit. Since I haven’t posted in about a month, I figured I’d share one of my motivations for getting into machine learning: it offers a lot of really interesting parallels to human learning. Below I’ve collected a few examples of techniques I’ve picked up that have some surprising connections, and have even given me a bit more insight into how people work.
Learning Rate Annealing
This is a common technique for training that I’ve been using quite a bit with Wavenet. Essentially, the idea of learning rate annealing is that over the course of the training regimen, whatever system you’re optimizing will learn slower and slower. This is due to the nature of a lot of problems machine learning is used for– often, getting a tiny bit closer to the solution won’t show any measurable improvement until you’re right on top of it, thus it’s better to have a high training rate early on. However, if the training rate is kept this high, it will skate right over the global minimum. Lowering the training rate allows it to narrow in and get more precise as it gets closer to the answer it’s looking for. The training rate has to start high because this allows it to avoid getting stuck in local minima that might distract from another, more optimal solution.
In more plain English, the high training rate at the beginning lets the model quickly get a general sense of how things work, and lowering it as training progresses lets it fill in the details.
What’s fascinating about this is that this mirrors something that happens to humans as we age. As we get older, neuroplasticity drops, and it gets harder and harder to learn new skills, change tastes and opinions, or adapt to new environments. While this might seem like a bad thing, since it will make it harder to learn, it also has the added benefit of proofing the system (be it a machine or a human) against outliers. For example– if the first time you see someone drop a rock, it falls, you might conclude that this happens all the time (and you would be correct). As you see more and more objects dropped, this belief solidifies, until one day, someone drops something, and it doesn’t fall. If your pattern recognition was as elastic as it was the day you saw that first rock dropped, you might conclude sometimes dropped objects float, which is wrong. What’s more likely is that you’ll assume that something fishy is going on and thus this data point won’t skew your internal model of the behavior of dropped objects.
Learning rate annealing is what stops the first thing we see that contradicts our worldview to that point from causing us to throw out all of our assumptions to that point (for better or worse).
Unreliable Parts & Noisy Data
One recurring problem with machine learning models is the tendency to overfit. This essentially means that the model is learning to match patterns that exist in the training data that are not representative of reality. This will create a model that can faithfully reproduce/categorize/recognize the training data, but fails miserably in the wild. There are a lot of ways to avoid this, but one of the most common ones used for neural networks is called dropout.
The idea behind dropout is essentially randomly disabling a certain number of the neurons in a network each training step. Since at any given time, any neuron could be down, the network has to learn redundancy, forcing it to create a more robust representation of the data with overlapping roles for the neurons.
Of course, the only way to completely avoid overfitting is with more data, but this isn’t always possible. In this situation, one technique that gets used a lot is to multiply the amount of data by taking each training sample and distorting them in some way– rotate them, shift them, scale them vertically or horizontally, add random noise, shift colors, etc. For text-based data, this might involve using a thesarus to replace words with synonyms, or deliberately include misspellings, though this is mostly applicable to images.
This artificial noise gives the model a larger range of possible environments to interpret, which will make it better at generalizing (though it’s not quite as good as just having more data) and better at interpreting poorly sanitized data, which is also important for working in the wild. In addition, this random artificial noise prevents the model from overfitting to noise patterns present in the training data because the noise is always changing. Even though each individual sample is distorted, the noise averages out in the end and results in a more robust system.
These two techniques are so powerful that google has actually created a piece of hardware they call the “Tensor Processing Unit,” a parallel processing chip, which has “reduced computational precision, which means it requires fewer transistors per operation,” and means that it’s “an order of magnitude better-optimized” than conventional hardware. They’ve implemented dropout and noisy data by simply removing the precision and reliability that are so important to many other types of computation, just packing together noisy, unreliable circuits, and it actually makes it better.
This also mirrors life. Biology is noisy, unreliable, and messy. The parts don’t always work, and when they do, they’re not very precise, but for this application, it not only doesn’t matter, it actually helps. This is, in large part, why intelligent life was able to evolve at all. Neural networks are the ideal system for creating intelligence in an noisy, unreliable, ever-changing environment.
Internal Vector Encoding
This is one of the things I find the most fascinating about machine learning. One of the most basic types of neural network, called a Restricted Boltzmann Machine, has only two layers (or three, depending on how you interpret it). One layer acts as the input, which passes through a second, smaller layer, and then attempts to recreate the original input. In doing so, the model is attempting to figure out the best way to compress the input and still be able to recover the original data. This results in a compressed vector representation that reflects the structure of the input in a more compact way.
Expanding this simple model with more layers, this can create some really interesting structures. The output doesn’t even have to be in the same form as the input– for example, the input could be English, and the output French, or vice versa. In this case, the internal vector representation stores the meaning of the sentence, independently of the language. What’s really fascinating about this particular example is that it works better the more languages are used. Because the internal representation stays the same, having more languages allows the model to create a better compressed representation using ideas that may not exist in one language to help translate it to another. Translating from English to Chinese is easier if the network can also translate English->French and French->Chinese.
Once again, we see this in humans too. It is much more likely for someone who knows two languages to learn a third than it is for someone who knows only one to learn a second. It could be argued that this is because of cultural differences that change a person’s upbringing and allows them to learn multiple languages while their learning rate is still relatively high, I think there are other factors at work. This is just a personal belief, but I would not be surprised if something similar was at work– knowing multiple languages allows someone to have a more efficient internal representation of ideas and of language in general, as they can integrate aspects of multiple languages into their thought processes. This is the strongest argument I’ve ever been presented with for learning multiple languages.