What are fitness sharing and niche count in evolutionary computation? - artificial-intelligence

What are "fitness sharing" and "niche count" in the context of evolutionary computation?

Evolutionary algorithms (EAs) tend to converge to a single solution as the diversity of the population diminishes [1]. This behavior is known as genetic drift. Any technique that maintains diversity in the population based on the distance between the population members is called a Niching technique.
Fitness sharing is a type of Niching, where the fitness of each individual is scaled based on its proximity to others. This means that good solutions in densely populated regions are given a lower fitness value than comparably good solutions in sparsely populated regions. In effect, the algorithm's selection technique places less emphasis on these high-quality, high-density solutions. The distance can be calculated based on the values in either decision space (genotype), solution space (phenotype), or both (as in Goldberg and Richardsen [2]). Distance in genotype is usually defined using the Hamming distance whereas distance in phenotype is usually defined using Euclidean distance.
A simple fitness sharing method is given by the following Java method:
/**
* Computes the shared fitness value for a solution
* #param index the index of the solution for which a shared fitness value will be computed
* #param minDist any solution closer than minDist will share fitness with the current solution
* #param shareParam a parameter that defines how much influence sharing has. Higher = more sharing.
* #param population the array of solutions. Each solution has a genotype and associated fitness value.
*/
public double computeSharedFitnessValue(int index, double minDist, double shareParam, Solution[] population){
double denominator = 1;
for(int j = 0; j < population.length; j++){
final double dist = hamming_dist(population[index],population[j]);
if (dist < minDist){
denominator += (1-(dist/shareParam))
}
}
return population[index].getFitnessValue()/denominator;
}
Motivational Example: The following figure perfectly illustrates why fitness sharing is so important in multi-objective problems. In Figure A (left), diversity was maintained throughout execution. As a result, the solutions span a considerable portion of the true Pareto front (shown here as wire frame). In Figure B (right), the population only converged to a small area of the Pareto front. In many situations, even if the solutions in Figure B were of higher quality, a decision maker would prefer the diversity of options provided in Figure A to the (nominal) improvement in quality of Figure B.
Additional Resources:
[1] Genetic algorithms with sharing for multimodal function optimization
[2] Genetic Algorithms for Multi-Objective Optimization: Formulation Discussion and Generalization

Related

Fitness Function altenatives in Genetic Algorithms for game AI

I have created a Gomoku(5 in a row) AI using Alpha-Beta Pruning. It makes moves on a not-so-stupid level. First, let me vaguely describe the grading function of the Alpha-Beta algorithm.
When it receives a board as an input, it first finds all repetitions of stones and gives it a score out of 4 possible values depending on its usefulness as an threat, which is decided by length. And it will return the summation of all the repetition scores.
But, the problem is that I explicitly decided the scores(4 in total), and they don't seem like the best choices. So I've decided to implement a genetic algorithm to generate these scores. Each of the genes will be one of 4 scores. So for example, the chromosome of the hard-coded scores would be: [5, 40000,10000000,50000]
However, because I'm using the genetic algorithm to create the scores of the grading function, I'm not sure how I should implement the genetic fitness function. So instead, I have thought of the following:
Instead of using a fitness function, I'll just merge the selection process together: If I have 2 chromosomes, A and B, and need to select one, I'll simulate a game using both A and B chromosomes in each AI, and select the chromosome which wins.
1.Is this a viable replacement to the Fitness function?
2.Because of the characteristics of the Alpha-Beta algorithm, I need to give the max score to the win condition, which in most cases is set to infinity. However, because I can't use Infinity, I just used an absurdly large number. Do I also need to add this score to the chromosome? Or because it's insignificant and doesn't change the values of the grading function, leave it as a constant?
3.When initially creating chromosomes, random generation, following standard distribution is said to be the most optimal. However, genes in my case have large deviation. Would it still be okay to generate chromosomes randomly?
Is this a viable replacement to the Fitness function?
Yes, it is. It's a fairly common way to define a fitness function for board games. Probably a single round is not enough (but you have to experiment).
A slight variant is something like:
double fitness(Agent_k)
fit = 0
repeat M times
randomly extract an individual Agent_i (i <> k)
switch (result of Agent_k vs Agent_i)
case Agent_k wins: fit = fit + 1
case Agent_i wins: fit = fit - 2
case draw: fit doesn't change
return fit
i.e. an agent plays against M randomly selected opponents from the population (with replacement but avoiding self match).
Increasing M the noise decreases but longer simulation times are required (M=5 is a value used in some chess-related experiments).
2.Because of the characteristics of the Alpha-Beta algorithm...
Not sure of the question. A very large value is a standard approach for a static evaluation function signaling a winning condition.
The exact value isn't very important and shouldn't probably be subject to optimization.
3.When initially creating chromosomes, random generation, following standard distribution is said to be the most optimal. However, genes in my case have large deviation. Would it still be okay to generate chromosomes randomly?
This is somewhat related to the specific genetic algorithm "flavor" you are going to use.
A standard genetic algorithm could work better with not completely random initial values.
Other variants (e.g. Differential Evolution) could be less sensitive to this aspect.
Take also a look at this question / answer: Getting started with machine learning a zero sum game?

Difference of using different population size and different crossover method

I have couple of general questions on genetic algorithm. In selection step where you pick up chromosomes from the population, is there an ideal number of chromosomes to be picked up? What difference does it make if I pick, say 10 chromosomes instead of 20? Does it have any effect on final result? At mutation stage, I've learnt there are different ways to mutate - Single point crossover, two points crossover, uniform crossover and arithmetic crossover. When should I choose one over the other? I know they sound very basic, but I couldn't find answer anywhere. So I thought I should ask in Stackoverflow.
Thanks
It seems to me that your terminology and concepts are a little bit messed up. Let me clarify.
First of all - there are many ways people call the members of the population: genotype, genome, chromosome, individual, solution... I will use solution for now as it is, in my opinion, the most general term, it is what we are eventually evolve, and also I'm not a biologist so I don't know whether genotype, genome and chromosome somehow differ and if they do what is the difference...
Population
Genetic Algorithms are population-based evolutionary algorithms. The algorithms have (usually) a fixed-sized population of solutions of the problem it is solving.
Genetic operators
There are two principal genetic operators - crossover and mutation. The goal of crossover is to take two (or more in some cases) solutions and combine them to create a solution that has some properties of both, optimally the best of both. The goal of mutation is to create new genetic material that was not previously present in the population by doing a small random change.
The choice of the particular operators, i.e. whether a single-point or multi-point crossover..., is totally problem-dependent. For example, if your solutions are composed of some logical blocks of bits that work together in each block, it might not be a good idea to use uniform crossover because it will destroy these blocks. In such case a single- or multi-point crossover is a better choice and the best choice is probably to restrict the crossover points to be on the boundaries of the blocks only.
You have to try what works best for your problem. Also, you can always use all of them, i.e. by randomly choosing which crossover operator is going to be used each time the crossover is about to be performed. Similarly for mutation.
Modes of operation
Now to your first question about the number of selected solutions. Genetic Algorithms can run in two basic modes - generational mode and steady-state mode.
Generational mode
In generational mode, the whole population is replaced in every generation (iteration) of the algorithm. A simple python-like pseudo-code for a generational-mode GA could look like this:
P = [...] # initial population
while not stopping_condition():
Pc = [] # empty population of children
while len(Pc) < len(P):
a = select(P) # select a solution from P using some selection strategy
b = select(P)
if rand() < crossover_probability:
a, b = crossover(a, b)
if rand() < mutation_probability:
a - mutation(a)
if rand() < mutation_probability:
b = mutation(b)
Pc.append(a)
Pc.append(b)
P = Pc # replace the population with the population of children
Evaluation of the solutions was omitted.
Steady-state mode
In steady-state mode, the population persists and only a few solutions are replaced in each iteration. Again, a simple steady-state GA could look like this:
P = [...] # initial population
while not stopping_condition():
a = select(P) # select a solution from P using some selection strategy
b = select(P)
if rand() < crossover_probability:
a, b = crossover(a, b)
if rand() < mutation_probability:
a - mutation(a)
if rand() < mutation_probability:
b = mutation(b)
replace(P, a) # put a child back into P based on some replacement strategy
replace(P, b)
Evaluation of the solutions was omitted.
So, the number of selected solutions depends on how do you want your algorithm to operate.

Generating a Population for Genetic Algorithm in C

I have just learned the basic (more of an introduction) of genetic algorithm. For an assignment, we are to find the value of x that maximizes f(x) = sin (x*pi/ 256) in the interval 0 <= x <= 256.
While I understand how to get the fitness of an individual and how to normalize the fitness, I am a little lost on generating the population. In the text, for the purposes of performing crossover and mutation, represent each individual using 8 bits. Example:
189 = 10111101
35 = 00100011
My questions are this:
Using c, what is the best way to generate the population? I have looked it up and all I could find was using uint8_t. I'm thinking of generating it as an array and then find a way to convert to it's integer representation.
What purposes does normalizing fitness serves?
As this is my first time at writing a program that uses genetic algorithm, is there any advice I should keep in mind?
Thank you for your time.
The usual way is the population to be random, but if you have some preliminary optimization you can form population around results already available.
It is very common to use hybrid algorithms when GA is mixed with altoritmsh like PSO, simulated annealing and so on.

Neural Network Architecture Design

I'm playing around with Neural Networks trying to understand the best practices for designing their architecture based on the kind of problem you need to solve.
I generated a very simple data set composed of a single convex region as you can see below:
Everything works fine when I use an architecture with L = 1, or L = 2 hidden layers (plus the output layer), but as soon as I add a third hidden layer (L = 3) my performance drops down to slightly better than chance.
I know that the more complexity you add to a network (number of weights and parameters to learn) the more you tend to go towards over-fitting your data, but I believe this is not the nature of my problem for two reasons:
my performance on the Training set is also around 60% (whereas over-fitting typically means you have a very low training error and high test error),
and I have a very large amount of data examples (don't look at the figure that's only a toy figure I uplaoded).
Can anybody help me understand why adding an extra hidden layer gives
me this drop in performances on such a simple task?
Here is an image of my performance as a function of the number of layers used:
ADDED PART DUE TO COMMENTS:
I am using a sigmoid functions assuming values between 0 and 1, L(s) = 1 / 1 + exp(-s)
I am using early stopping (after 40000 iterations of backprop) as a criteria to stop the learning. I know it is not the best way to stop but I thought that it would ok for such a simple classification task, if you believe this is the main reason I'm not converging I I might implement some better criteria.
At least on the surface of it, this appears to be a case of the so-called "vanishing gradient" problem.
Activation functions
Your neurons activate according to the logistic sigmoid function, f(x) = 1 / (1 + e^-x) :
This activation function is used frequently because it has several nice properties. One of these nice properties is that the derivative of f(x) is expressible computationally using the value of the function itself, as f'(x) = f(x)(1 - f(x)). This function has a nonzero value for x near zero, but quickly goes to zero as |x| gets large :
Gradient descent
In a feedforward neural network with logistic activations, the error is typically propagated backwards through the network using the first derivative as a learning signal. The usual update for a weight in your network is proportional to the error attributable to that weight times the current weight value times the derivative of the logistic function.
delta_w(w) ~= w * f'(err(w)) * err(w)
As the product of three potentially very small values, the first derivative in such networks can become small very rapidly if the weights in the network fall outside the "middle" regime of the logistic function's derivative. In addition, this rapidly vanishing derivative becomes exacerbated by adding more layers, because the error in a layer gets "split up" and partitioned out to each unit in the layer. This, in turn, further reduces the gradient in layers below that.
In networks with more than, say, two hidden layers, this can become a serious problem for training the network, since the first-order gradient information will lead you to believe that the weights cannot usefully change.
However, there are some solutions that can help ! The ones I can think of involve changing your learning method to use something more sophisticated than first-order gradient descent, generally incorporating some second-order derivative information.
Momentum
The simplest solution to approximate using some second-order information is to include a momentum term in your network parameter updates. Instead of updating parameters using :
w_new = w_old - learning_rate * delta_w(w_old)
incorporate a momentum term :
w_dir_new = mu * w_dir_old - learning_rate * delta_w(w_old)
w_new = w_old + w_dir_new
Intuitively, you want to use information from past derivatives to help determine whether you want to follow the new derivative entirely (which you can do by setting mu = 0), or to keep going in the direction you were heading on the previous update, tempered by the new gradient information (by setting mu > 0).
You can actually get even better than this by using "Nesterov's Accelerated Gradient" :
w_dir_new = mu * w_dir_old - learning_rate * delta_w(w_old + mu * w_dir_old)
w_new = w_old + w_dir_new
I think the idea here is that instead of computing the derivative at the "old" parameter value w, compute it at what would be the "new" setting for w if you went ahead and moved there according to a standard momentum term. Read more in a neural-networks context here (PDF).
Hessian-Free
The textbook way to incorporate second-order gradient information into your neural network training algorithm is to use Newton's Method to compute the first and second order derivatives of your objective function with respect to the parameters. However, the second order derivative, called the Hessian matrix, is often extremely large and prohibitively expensive to compute.
Instead of computing the entire Hessian, some clever research in the past few years has indicated a way to compute just the values of the Hessian in a particular search direction. You can then use this process to identify a better parameter update than just the first-order gradient.
You can learn more about this by reading through a research paper (PDF) or looking at a sample implementation.
Others
There are many other optimization methods that could be useful for this task -- conjugate gradient (PDF -- definitely worth a read), Levenberg-Marquardt (PDF), L-BFGS -- but from what I've seen in the research literature, momentum and Hessian-free methods seem to be the most common ones.
Because the number of iterations of training required for convergence increases as you add complexity to a neural network, holding the length of training constant while adding layers to a neural network will certainly result in you eventually observing a drop like this. To figure out whether that is the explanation for this particular observation, try increasing the number of iterations of training that you're using and see if it improves. Using a more intelligent stopping criterion is also a good option, but a simple increase in the cut-off will give you answers faster.

Getting fitness in TSP

I'm using a genetic algorithm (GA) to optimise a traveling salesman problem (TSP). My problem is how I calculate the fitness of an individual.
Obviously solutions with shorter routes are fitter but how exactly do I assign a fitness value without knowing what the shortest possible route and longest possible route is to determine where my solution fits in that range?
Having fitness equals to path length is fine. Keep in mind that in genetic algorithms the fitness is only used for selecting individuals: consequently with usual selection procedures the scale does not matter, only the rank does.
Examples of implementation:
http://www.codeproject.com/Articles/1403/Genetic-Algorithms-and-the-Traveling-Salesman-Prob
http://khayyam.developpez.com/articles/algo/voyageur-de-commerce/genetique/ (use Google translate)
http://www.lalena.com/ai/tsp/
http://www.mathworks.com/matlabcentral/fileexchange/13680
More subtleties (2001 - Swarm Intelligence - Kennedy & Eberhart - page 249):
Pablo Moscato is a South American researcher who has pioneered the
study of memetic algorithms (e.g., Moscato, 1989). He and Michael
Norman, who is now in Scotland at the University of Edinburgh, began
working together in the 1980s at Caltech. In a recent paper they
describe the use of a memetic algorithm for optimization of a
traveling salesman problem (TSP) (Moscato and Norman, 1992). Recall
that the TSP requires finding the shortest path through a number of
cities, passing through each one only once. The problem has a rich
history in applied mathematics, as it is very hard to solve,
especially when the number of cities is large. TSP is an NP-hard
problem, which suggests that if a way is found to solve it, then a
large number of other problems will also have been solved. Moscato and
Norman use an algorithm with both cooperation and competition among
agents in the population, and implement a hybrid version of simulated
annealing for local search.
A population of individuals—these
researchers usually use a population size of 16—searches the problem
space, which is defined by permutations of the cities, called “tours.”
The population is conceptualized as a ring, where each individual is
attached to its two immediately adjacent neighbors, with whom it
competes in the search; individuals are also connected to others on
the far side of the ring, with whom they cooperate. Each individual in
the population comprises a tour of the cities. Competition is seen as
“challenge” and “battles” between pairs of individuals, where the tour
lengths of an individual and its neighbor are compared and a
probability threshold is set based on the difference. The difference
between the tours’ lengths affects the steepness of the sshaped curve;
when the difference is small or the temperature is cool, the
probability distribution becomes nearly uniform, and when the
difference in lengths between the two tours is great, the probability
is increased that tour 1 will be deleted and replaced with a copy of
tour 0.
Cooperation is used to let more successful individuals “mate”
with one another, rather than with less-fit members of the population.
The same rule that is used in deciding competitive interactions is
used to assess the desirability of partners for crossover, which is
implemented just as it is in GA. One individual “proposes” to another,
and if the proposition is accepted, that is, if the stochastic
decision favors their interaction, then the crossover operator is
implemented. Thus the next generation is created.
You could normalise all candidate solutions, such that the shortest path you've seen to date gets the fitness score 1.0 (or 10, or 42, or 3.14... whatever you like), and then scale all paths longer than this relatively. Same with the longest path - the longest path that you've observed is considered the worst possible score.
The trick comes with what you do when you find an even shorter path (given that you assigned some longer path the highest possible score, such as 1.0) - you have to then raise the ceiling on your normalisation function. Start assigning fitness 2.0, for example (or 1.1, or some other arbitrarily larger fitness score).
If your program is maximizing fitness values, you would want to maximize a fitness function
f = - Tour-Length
EDITED: I had added 1000000000000, an arbitrary number to the fitness, to make the fitness positive, On reading a few comments, I realize it is not necessary.
If your program is minimizing fitness values, you would want to minimize a fitness function
f = Tour-Length

Resources