Genetic Algorithms are cool
10 Dec 2019Introduction
Genetic Algorithms are a method of optimizing a function by a process similar to that of evolution. In nature creatures have genes which effect traits which then cause them to be successfull organisms, or not. The successful organisms survive, and their genes are passed on. Similarly, we can generate a population of solutions to a problem, decide which are successful (which of them are fit) and then combine the successful solutions into a second generation.
I'm interested in using a Lisp to perform function fitting: an unknown function is performed on an input dataseries \(x\), resulting in \(y\). The function could be recovered by trying different functions \(f(x)\) until the results match. In Lisp, functions are data and this problem becomes just correctly building up a list out of building blocks (operations).
So in the interest of shoe-horning anything ML into a project, I'm gonna use a genetic algorithm to perform this function fitting.
The Algorithm
Some definitions:
- Gene
->
Basic building block which can combine together (add, subtract, multiply, divide, integrate, differentiate)- Organism
->
Collection of genes together (a function)- Fitness score
->
How well does the organism perform?- Reproduction
->
Organisms are combined to give a (hopefully) more successfull offspring.- Population
->
A collection of different organisms which compete to reproduce.- Mutation
->
When an organism reproduces, there is a chance a gene will activate/change independant of parent genetics.
Let's start with a sketch of the process:
We have a bunch of genes (different genes represented by different colours here), which can combine together into an organism, and a bunch of organisms together forms a population. The population is assessed according to their fitness, and the most fit organisms are allowed to reproduce. The next generation is therefore more fit. The process is repeated until the population is at maximum fitness (i.e. the search has converged on a solution).
The general algorithm is:
- START: start with an initial population of a variety of organisms.
- FITNESS: assess fitness of population.
- SELECTION: pair up successful members of population for reproduction.
- REPRODUCTION: combine pairs to give offspring (the new generation).
- MUTATION: mutate offspring.
- REPEAT:
GOTO 2
General Problem
Suppose we have a function \(f(x)\) that we can measure. We have a sample of output \(y\), and the respective input \(x\). We want to know the form of \(f(x)\). Functional form is often know for problems like this: so the problem reduces to finding the correct set of coefficients and we can employ methods like linear regression and stochastic optimisation to solve them. However, if the function is unknown, then the problem remains complex. We could guess the form and update our guess based upon the result and iterate until we get a good enough result.
This is okay if the functional form is easy enough to guess, but what if the function is difficult to discern? A neural network can be used to observe patterns, but the mysterious meaning of the resulting weights of the perceptrons may not mean anything to the reader. We can employ a similar mechanism to a neural network, except instead of changing the weights of combining perceptrons/neurons, we're changing the form of the pattern function.
Using the genetic algorithm gives us a good way to decide how to perform the update ("these functions performed well, let's keep them and work on it from there.").
General Solution
GA in the context of the function solver:
- START: generate a large set of functions of random operations.
- FITNESS: find the most successful functions (which ones have the smallest residual error).
- SELECTION: pair up successful functions, taking the top n% of organisms.
- REPRODUCTION: common operations in successful functions are taken into offspring along with a random subset of the uncommon operations.
- MUTATION: there will be some (low, non-zero) chance for each offspring to mutate, and each gene will have a mutation chance associated. If mutation of a gene is selected, the operation is changed in some way including the selected mutation.
Fitness
Let's start with something a bit tangible: fitness. We need to decide on a good method for judging the fitness of an "organism". In the context of the symbolic function fitter, this fitness is how well the function recreates the expected output – simply an error function. Let \(y(x)\) be our function, \(y_{expected}\) is out expected result given the input series \(x\), and \(y_{model}\) is the result given by our model function (the "organism") the error could be defined simply as:
\[ \varepsilon = y_{expected} - y_{model}\]
or we can weight it against the expectation, and square it so large errors are punished:
\[ \varepsilon = {\left(\frac{y_{expected} - y_{model}}{y_{expected}}\right)}^2\]
Either way we end up with a measure of how well the function recreates the expectation – the fitness of our organism.
Selection
Now that we have an idea of good/bad organisms, we can decide which ones to take forward into the next generation. This is a very important step. We want to take enough genes forward that we have sufficient diversity to create the solution, but we don't want to just take everything forward or we won't converge. This step is sometimes also called "Elitism" - only the best move forward.
We need to decide on a cut off on who to move forward. I reckon \(10\%\) is a good number, and I've seen others use this amount, and it can always be tuned later. So the top \(10\%\) of the population will move up to the next level.
Reproduction
Organisms selected for reproduction can now transfer their genes and create the next generation. We can pair of the organisms and create offspring from the pairs.
What happens to the old generation?
If the old generation are cleared after reproduction, we lose a lot of genetic information, and the population will rapidly die unless the organisms are allowed to produce multiple offspring.
If the old generation persist and reproduction adds new organisms every time, the population rapidly grows and convergence is never met. Let's call this rising population reproduction.
We could clear a small amount of the old generation (small, but larger than the added amount from reproduction). Thus we have a slowly declining population which will end in convergence, but not until sufficient generations have passed to allow a good solution to be found. Let's call this falling population reproduction
Finally, we could ensure reproduction and death are matched. The top \(10\%\) reproduce to form an additional \(10\%\), which is matched by the bottom \(10\%\) dying off. This will ensure the population will slowly improve towards a soltuion, which can be improved by increasing the number of generations. This method we'll call constant population reproduction.
A reproduction will take two organisms to produce an offspring. We can take the common parts of the organisms and put them directly into the offspring, and randomly choose from the individual (non-common) genes.
Mutation
Again, there are many different ways we can take this. Individual genes can
mutate, depending on their form (add-coefficient
could be mutated by
changing coefficient
) or new genes could be added, old genes
removed. That's three possibilities. Each reproduction will have the
possibility of one of these happening
Initial population
Finally, let's talk about the start. We need to start (innoculate) our petri dish with enough genetic diversity to ensure convergence on a good solution (especially if falling population reproduction is used). If a constant or rising population is maintained, we can still get a good answer so long as the population is allowed to mutate.
The population complexity (how many genes to an organism) should start high enough to encapture the solution, but no to high so as to overshoot the solution. Perhaps it would be best to start at minimal complexity (only four or five genes) and turn up the gene mutation probability.
Sources
Questions? Comments? Get in touch on Twitter!