After a few failed experiments (namely, trying to train char-rnn on the grant award database to discover that the file has been corrupted with non-unicode characters which caused loss to explode and training on the list is kernel source, which over fitted really badly) I decided to move back to something a bit simpler– the dictionary. My training rig isn’t at the moment connected to the internet, so no dumps yet, but here are some favorites from current checkpoints.
From the best checkpoint:
This checkpoint actually produced correct definitions for a lot of words, including:
Preponderation (n.) The act of prepositing
Extravagance (n.) The state of being extravagant
Tattler (n.) One who, or that which, tattles.
which implies that the network is either representing the language on some deeper level (ha ha, that’s likely), or just that it’s overfitting to the data, however this only seems to occur with words where the definition contains the word itself, or another similar word, like the above examples. I begin to wonder if I could do something with a word-to-vec algorithm to make an even better neural dictionary, but for now, I’m just filtering out words that have the word in their definition.
Temperature 0.1
Station (n.) The act of stating; a statement or condition of any object or position; a statement of the particulars of a person or thing; as, the station of a country or court
Temperature 0.5
Infusorianism (n.) The doctrine of the infusorian ministry
Manure (n.) To make a manufacturer; to seize or be confirmed to the mind
Temperature 0.75
Confine (v. t.) To interchange or impress as an expression of assignment; to reduce or to indicate or represent; to consent; to disapprove of; as, to constitute the title of the rules of a province or a firearm.
Endoderm (n.) A white crystalline substance, C0H10, of the acetylene series, found in alkaloids, and having been elongated in the proportion of ordinary odor, in which the phenomena of certain compounds are produced artificially, and is derived from its natural ore, and is now a mixture of granular copper;– also called hexanelin
Encrin (n.) A bishop composed of sensible colors.
Stick (v. t.) To fix or defeat with a stick
Cloud (n.) A striving in a church; as of men.
Temperature 1.0
Imbreviate (v. t. & i.) To increase a disease, office, or claim
Nipperty (a.) Like a nipple; of or pertaining to the nipping.
Sympathetic (n.) Syphilis; execution
OK wow, that wins for biggest miss.
Encognat (n.) A printed person; an otolith
Smoke (v.) The spot or strap by which swings are driven
Hey that’s not a verb!
Ensifer (n.) A person who held or performs the privileges of the discriminal world itself
Gavash (v. t.) To cause to swing into game
Cloyer (n.) The harsh, uncertain part; any body or degree of obstruction; a handle of screws and judges
Tattlery (n.) A vessel for catching a plate or animal like a strumpet
Levator (n.) One who annoys the occupation of men
It is notable, that the validation loss for this network never got very low at all– this is due to the nature of the problem. In most cases, a good deal of the loss can be avoided by capturing the structure of a database. The recipes, for example, all follow the same basic format: a name, categories, ingredients, and then instructions.
This also gives the network a lot of information to work on when it gets to the instructions. A recipe called “fried chicken” will probably contain chicken. Something with flour will probably be baked at some point. And so on. In this way, to make each discrete recipe, it only had to remember the details for as long as the recipe.
In this case, however, that’s just not enough. The network can’t really guess what the meaning of a word is from the letters in the word except by looking at its structure (prefixes, suffixes, roots, etc.), But it’ll only be beneficial to remember that information for a very short time. It would be much more effective to hardcode those relationships with some kind of word-to-vec (or char-to-vec?) system that would actually represent the meaning of the word in some abstract way. Especially because then it could possibly make sense of all those “wordly: the state of being a word; see wordy” definitions.
But anyway. I’d really like to get the grants database fixed up and running so I can get some real output, and I’m also looking into some other fun things– Google Magenta, for one, and a reinforcement-learning agent that plays Super Mario. Could be neat.