Genetic algorithm - iterative optimization - theory

Hi I am absolute beginner and I've got rather theoretical question about iterative optimization in genetic algorithms.
Where in the genetic cycle (see below) does the iterative optimization logically belong? I am not really sure, but I think it could be in population initialization or mutation, depending on given problem.
To be more specific about about the iterative optimization algorithm - I would like to use "hill climbing" or "simulated annealing".
I use this model as reference:

Well, there are several possibilities and basically all of them make sense.
If you put the optimisation phase after population initialization (being run once before the genetic algorithm itself) then what you get is an already optimized initial population. This might be useful because the genetic algorithm does not have to search so much, but it may be harmful in a way that you optimize to some local optima, possibly losing useful information in the process.
If you put the optimisation phase before selection, you get a so-called memetic algorithm. MA is based on the fact that an organism learns throughout its lifetime (the optimization). You have two possibilites of how to do that:
Take an individual, optimize it and replace the original individual with its optimized version. This is called "Lamarckian evolution" and is based on the idea (originally by Jean-Baptiste Lamarck at the beginning of 19th century) that the learned features can be passed on to the offsprings.
Take an individual, optimize it, but then throw the optimized individual away but assign its fitness to the original, unoptimized individual. In this variant the optimization effectively becomes the part of the fitness evaluation process. This is called "Baldwin effect" and is based on the idea (originally by James Mark Baldwin at the end of 19th century) that the learned features cannot be passed on to the offsprings and the genetic information rather describes the ability to learn. By the way, this is the way the natural evolution actually works.
The optimization can, of course, be placed in all the other places, but it is not used (at least I don't know of any such case). However, your problem and mutation/crossover operators might be such that the optimization in those places might be somehow beneficial. Or you can just try and see what you get.

Related

What is good measure to compare algorithms?

Well I was reading an article about comparing two algorithms by firstly analyzing them.
My teacher taught me that you can analyze algorithm by directly using number of steps for that algorithm.
for ex:
algo printArray(arr[n]){
for(int i=0;i<n;i++){
write arr[i];
}
}
will have complexity of O(N), where N is size of array. and it repeats the for loop for N times.
while
algo printMatrix(arr,m,n){
for(i=0;i<m;i++){
for(j=0;j<n;j++){
write arr[i][j];
}
}
}
will have complexity of O(MXN) ~ O(N^2) when M=N. statements inside for are executed MXN times.
similarly O(log N). if it divides input into 2 equal parts. and So on.
But according to that article:
The Measures Execution Time, Number of statements aren't good for analyzing the algorithm.
because:
Execution Time will be system Dependent and,
Number of statements will vary with the programming language used.
and It states that
Ideal Solution will be to express running time of algorithm as a function of input size N that is f(n).
That confused me a little, How can you calculate running time if you consider execution time as not good measure?
Can experts here please elaborate this?
Thanks in advance.
When you were saying "complexity of O(N)" that is referred to as "Big-O notation" which is the same as the "Ideal Solution" that you mentioned in your post. It is a way of expressing run time as a function of input size.
I think were you got confused was when it said "express running time" - it didn't mean express it in a numerical value (which is what execution time is), it meant express it in Big-O notation. I think you just got tripped up on the terminology.
Execution time is indeed system-dependent, but it also depends on the number of instructions the algorithm executes.
Also, I do not understand how the number of steps is irrelevant, given that algorithms are analyzed as language-agnostic and without paying any attention to whatever features and syntactic-sugars various languages imply.
The one measure of algorithm analysis I have always encountered since I started analyzing algorithms is the number of executed instructions and I fail to see how this metric may be irrelevant.
At the same time, complexity classes are meant as an "order of magnitude" indication of how fast or slow an algorithm is. They are dependent of the number of executed instructions and independent of the system the algorithm runs on, because by definition an elementary operation (such as addition of two numbers) should take constant time, however large or small this "constant" means in practice, therefore complexity classes do not change. The constants inside the expression for the exact complexity function may indeed vary from system to system, but what is actually relevant for algorithm comparison is the complexity class, as only by comparing those can you find out how an algorithm behaves on increasingly large inputs (asymptotically) compared to another algorithm.
Big-O notation waves away constants (both fixed cost and constant multipliers). So any function that takes kn+c operations to complete is (by definition!) O(n), regardless of k and c. This is why it's often better to take real-world measurements (profiling) of your algorithms in action with real data, to see how fast they effectively are.
But execution time, obviously, varies depending on the data set -- if you're trying to come up with a general measure of performance that's not based on a specific usage scenario, then execution time is less valuable (unless you're comparing all algorithms under the same conditions, and even then it's not necessarily fair unless you model the majority of possible scenarios, and not just one).
Big-O notation becomes more valuable as you move to larger data sets. It gives you a rough idea of the performance of an algorithm, assuming reasonable values for k and c. If you have a million numbers you want to sort, then it's safe to say you want to stay away from any O(n^2) algorithm, and try to find a better O(n lg n) algorithm. If you're sorting three numbers, the theoretical complexity bound doesn't matter anymore, because the constants dominate the resources taken.
Note also that while the number of statements a given algorithm can be expressed in varies wildly between programming languages, the number of constant-time steps that need to be executed (at the machine level for your target architecture, which is typically one where integer arithmetic and memory accesses take a fixed amount of time, or more precisely are bounded by a fixed amount of time). It is this bound on the maximum number of fixed-cost steps required by an algorithm that big-O measures, which has no direct relation to actual running time for a given input, yet still describes roughly how much work must be done for a given data set as the size of the set grows.
In comparing algorithms, execution speed is important as well mentioned by others, but other factors like memory space are crucial too.
Memory space also uses order of complexity notation.
Code could sort an array in place using a bubble sort needing only a handful of extra memory O(1). Other methods, though faster, may need O(ln N) memory.
Other more esoteric measures include code complexity like Cyclomatic complexity and Readability
Traditionally, computer science measures algorithm effectivity (speed) by the number of comparisons or sometimes data accesses, using "Big O notation". This is so, because the number of comparisons (and/or data accesses) is a good mathematical model to describe efficiency of certain algorithms, searching and sorting ones in particular, where O(log n) is considered the fastest possible in theory.
This theoretic model has always had several flaws though. It assumes that comparisons (and/or data accessing) are what takes time, and that the time for performing things like function calls and branching/looping is neglectible. This is of course nonsense in the real world.
In the real world, a recursive binary search algorithm might for example be extremely slow compared to a quick & dirty linear search implemented with a plain for loop, because on the given system, the function call overhead is what takes the most time, not the comparisons.
There are a whole lot of things that affect performance. As CPUs evolve, more such things are invented. Nowadays, you might have to consider things like data alignment, instruction pipe-lining, branch prediction, data cache memory, multiple CPU cores and so on. All these technologies make traditional algorithm theory rather irrelevant.
To write the most effective code possible, you need to have a specific system in mind and you need in-depth knowledge about said system. Fortunately, compilers have evolved a lot too, so a lot of the in-depth system knowledge can be left to the person who implements a compiler port for the specific system.
Generally, I think many programmers today spend far too much time pondering about program speed and coming up with "clever things" to get better performance. Back in the days when CPUs were slow and compilers were terrible, such things were very important. But today, a good, modern programmer focus on making the code bug-free, readable, maintainable, re-useable, secure, portable etc. It doesn't matter how fast your program is, if it is a buggy mess of unreadable crap. So deal with performance when the need arises.

Are there problems that can't be solved efficiently without arrays? [duplicate]

Does anyone know what is the worst possible asymptotic slowdown that can happen when programming purely functionally as opposed to imperatively (i.e. allowing side-effects)?
Clarification from comment by itowlson: is there any problem for which the best known non-destructive algorithm is asymptotically worse than the best known destructive algorithm, and if so by how much?
According to Pippenger [1996], when comparing a Lisp system that is purely functional (and has strict evaluation semantics, not lazy) to one that can mutate data, an algorithm written for the impure Lisp that runs in O(n) can be translated to an algorithm in the pure Lisp that runs in O(n log n) time (based on work by Ben-Amram and Galil [1992] about simulating random access memory using only pointers). Pippenger also establishes that there are algorithms for which that is the best you can do; there are problems which are O(n) in the impure system which are Ω(n log n) in the pure system.
There are a few caveats to be made about this paper. The most significant is that it does not address lazy functional languages, such as Haskell. Bird, Jones and De Moor [1997] demonstrate that the problem constructed by Pippenger can be solved in a lazy functional language in O(n) time, but they do not establish (and as far as I know, no one has) whether or not a lazy functional language can solve all problems in the same asymptotic running time as a language with mutation.
The problem constructed by Pippenger requires Ω(n log n) is specifically constructed to achieve this result, and is not necessarily representative of practical, real-world problems. There are a few restrictions on the problem that are a bit unexpected, but necessary for the proof to work; in particular, the problem requires that results are computed on-line, without being able to access future input, and that the input consists of a sequence of atoms from an unbounded set of possible atoms, rather than a fixed size set. And the paper only establishes (lower bound) results for an impure algorithm of linear running time; for problems that require a greater running time, it is possible that the extra O(log n) factor seen in the linear problem may be able to be "absorbed" in the process of extra operations necessary for algorithms with greater running times. These clarifications and open questions are explored briefly by Ben-Amram [1996].
In practice, many algorithms can be implemented in a pure functional language at the same efficiency as in a language with mutable data structures. For a good reference on techniques to use for implementing purely functional data structures efficiently, see Chris Okasaki's "Purely Functional Data Structures" [Okasaki 1998] (which is an expanded version of his thesis [Okasaki 1996]).
Anyone who needs to implement algorithms on purely-functional data structures should read Okasaki. You can always get at worst an O(log n) slowdown per operation by simulating mutable memory with a balanced binary tree, but in many cases you can do considerably better than that, and Okasaki describes many useful techniques, from amortized techniques to real-time ones that do the amortized work incrementally. Purely functional data structures can be a bit difficult to work with and analyze, but they provide many benefits like referential transparency that are helpful in compiler optimization, in parallel and distributed computing, and in implementation of features like versioning, undo, and rollback.
Note also that all of this discusses only asymptotic running times. Many techniques for implementing purely functional data structures give you a certain amount of constant factor slowdown, due to extra bookkeeping necessary for them to work, and implementation details of the language in question. The benefits of purely functional data structures may outweigh these constant factor slowdowns, so you will generally need to make trade-offs based on the problem in question.
References
Ben-Amram, Amir and Galil, Zvi 1992. "On Pointers versus Addresses" Journal of the ACM, 39(3), pp. 617-648, July 1992
Ben-Amram, Amir 1996. "Notes on Pippenger's Comparison of Pure and Impure Lisp" Unpublished manuscript, DIKU, University of Copenhagen, Denmark
Bird, Richard, Jones, Geraint, and De Moor, Oege 1997. "More haste, less speed: lazy versus eager evaluation" Journal of Functional Programming 7, 5 pp. 541–547, September 1997
Okasaki, Chris 1996. "Purely Functional Data Structures" PhD Thesis, Carnegie Mellon University
Okasaki, Chris 1998. "Purely Functional Data Structures" Cambridge University Press, Cambridge, UK
Pippenger, Nicholas 1996. "Pure Versus Impure Lisp" ACM Symposium on Principles of Programming Languages, pages 104–109, January 1996
There are indeed several algorithms and data structures for which no asymptotically efficient purely functional solution (t.i. one implementable in pure lambda calculus) is known, even with laziness.
The aforementioned union-find
Hash tables
Arrays
Some graph algorithms
...
However, we assume that in "imperative" languages access to memory is O(1) whereas in theory that can't be so asymptotically (i.e. for unbounded problem sizes) and access to memory within a huge dataset is always O(log n), which can be emulated in a functional language.
Also, we must remember that actually all modern functional languages provide mutable data, and Haskell even provides it without sacrificing purity (the ST monad).
This article claims that the known purely functional implementations of the union-find algorithm all have worse asymptotic complexity than the one they publish, which has a purely functional interface but uses mutable data internally.
The fact that other answers claim that there can never be any difference and that for instance, the only "drawback" of purely functional code is that it can be parallelized gives you an idea of the informedness/objectivity of the functional programming community on these matters.
EDIT:
Comments below point out that a biased discussion of the pros and cons of pure functional programming may not come from the “functional programming community”. Good point. Perhaps the advocates I see are just, to cite a comment, “illiterate”.
For instance, I think that this blog post is written by someone who could be said to be representative of the functional programming community, and since it's a list of “points for lazy evaluation”, it would be a good place to mention any drawback that lazy and purely functional programming might have. A good place would have been in place of the following (technically true, but biased to the point of not being funny) dismissal:
If strict a function has O(f(n)) complexity in a strict language then it has complexity O(f(n)) in a lazy language as well. Why worry? :)
With a fixed upper bound on memory usage, there should be no difference.
Proof sketch:
Given a fixed upper bound on memory usage, one should be able to write a virtual machine which executes an imperative instruction set with the same asymptotic complexity as if you were actually executing on that machine. This is so because you can manage the mutable memory as a persistent data structure, giving O(log(n)) read and writes, but with a fixed upper bound on memory usage, you can have a fixed amount of memory, causing these to decay to O(1). Thus the functional implementation can be the imperative version running in the functional implementation of the VM, and so they should both have the same asymptotic complexity.

which encoding to use for genetic algorithm?

I want to code a genetic algorithm in C for optimizing a function of 10 variables (x1 to x10). However I am not able to figure out which encoding I should use. I have mostly seen binary encoding being used in example but the variables in my case can take real values. Also, is value encoding a good option for these types of problems?
For real valued problems I would suggest to try CMA-ES or another ES variant. CMA-ES certainly is the current state of the art for real-valued problems. It is designed to find good solutions in multidimensional problems quickly. There are implementations available on Hansen's page. There's also a C# implementation in the work for HeuristicLab. Evolution strategies are algorithms that were specifically designed for real-valued optimization problems. They are very similar to genetic algorithms (both were invented around the same time, but in different places). The main distinction is that for ES the main driver is mutation and it features a clever adaption of the mutation strength. Without this adaption the (local) optimum cannot be located in time. CMA-ES is easy to configure, all it needs is the initial standard deviation and optionally the population size (otherwise there's a formula that estimates this given the problem size).
Genetic algorithms can of course also be applied, but you have to use some specific operators which are able to mutate variables only with very small degree. For example there's the Breeder Genetic Algorithm from Mühlenbein. In general however genetic algorithms are more suited for problems that need a right combination of things. E.g. which items to include in a knapsack problem or which functions and terminals to combine to a formula (genetic programming). Less for problems, where you need to find the right value for something. Although of course there are variants of the genetic algorithm to solve these, look for Real coded Genetic Algorithm (RCGA or RGA).
Another algorithm suited for real-valued problems is Particle Swarm Optimization, but in my opinion it is harder to configure. I'd start with SPSO-2011 the 2011 standard PSO.
If your problem contains integer variables choices become more difficult. Evolution strategies do not perform so well when variables are discrete, because the adaptation schemes for integer variables are different. A genetic algorithm becomes an interesting first-choice algorithm again.
A genetic algorithm is best used when two answers that are pretty close to optimal will make something else pretty close to optimal when combined. The problem with a pure binary encoding is that if you don't check your crossover you end up getting two answers which may not have all that much to do with the original answers.
That said, this is only really an issue if your number of variables is very small and the amount of data in your variables is large. As far as picking an encoding, it's more of an art than a science and it depends on your problem. I would suggest going with an encoding that fits the amount of precision you want. With 10 variables you won't got that far wrong however you encode it, an 8-bit ASCII encoder would probably work fine.
Hope that helps.

What's differential evolution and how does it compare to a genetic algorithm?

From what I've read so far they seem very similar.
Differential evolution uses floating point numbers instead, and the solutions are called vectors? I'm not quite sure what that means.
If someone could provide an overview with a little bit about the advantages and disadvantages of both.
Well, both genetic algorithms and differential evolution are examples of evolutionary computation.
Genetic algorithms keep pretty closely to the metaphor of genetic reproduction. Even the language is mostly the same-- both talk of chromosomes, both talk of genes, the genes are distinct alphabets, both talk of crossover, and the crossover is fairly close to a low-level understanding of genetic reproduction, etc.
Differential evolution is in the same style, but the correspondences are not as exact. The first big change is that DE is using actual real numbers (in the strict mathematical sense-- they're implemented as floats, or doubles, or whatever, but in theory they're ranging over the field of reals.) As a result, the ideas of mutation and crossover are substantially different. The mutation operator is modified so far that it's hard for me to even see why it's called mutation, as such, except that it serves the same purpose of breaking things out of local minima.
On the plus side, there are a handful of results showing DEs are often more effective and/or more efficient than genetic algorithms. And when working in numerical optimization, it's nice to be able to represent things as actual real numbers instead of having to work your way around to a chromosomal kind of representation, first. (Note: I've read about them, but I've not messed extensively with them so I can't really comment from first hand knowledge.)
On the negative side, I don't think there's been any proof of convergence for DEs, yet.
Differential evolution is actually a specific subset of the broader space of genetic algorithms, with the following restrictions:
The genotype is some form of real-valued vector
The mutation / crossover operations make use of the difference between two or more vectors in the population to create a new vector (typically by adding some random proportion of the difference to one of the existing vectors, plus a small amount of random noise)
DE performs well for certain situations because the vectors can be considered to form a "cloud" that explores the high value areas of the solution solution space quite effectively. It's pretty closely related to particle swarm optimization in some senses.
It still has the usual GA problem of getting stuck in local minima however.

Why do safety requirements like to discourage use of AI?

Seems that requirements on safety do not seem to like systems that use AI for safety-related requirements (particularly where large potential risks of destruction/death are involved). Can anyone suggest why? I always thought that, provided you program your logic properly, the more intelligence you put in an algorithm, the more likely this algorithm is capable of preventing a dangerous situation. Are things different in practice?
Most AI algorithms are fuzzy -- typically learning as they go along. For items that are of critical safety importance what you want is deterministic. These algorithms are easier to prove correct, which is essential for many safety critical applications.
I would think that the reason is twofold.
First it is possible that the AI will make unpredictable decisions. Granted, they can be beneficial, but when talking about safety-concerns, you can't take risks like that, especially if people's lives are on the line.
The second is that the "reasoning" behind the decisions can't always be traced (sometimes there is a random element used for generating results with an AI) and when something goes wrong, not having the ability to determine "why" (in a very precise manner) becomes a liability.
In the end, it comes down to accountability and reliability.
The more complex a system is, the harder it is to test.
And the more crucial a system is, the more important it becomes to have 100% comprehensive tests.
Therefore for crucial systems people prefer to have sub-optimal features, that can be tested, and rely on human interaction for complex decision making.
From a safety standpoint, one often is concerned with guaranteed predictability/determinism of behavior and rapid response time. While it's possible to do either or both with AI-style programming techniques, as a system's control logic becomes more complex it's harder to provide convincing arguments about how the system will behave (convincing enough to satisfy an auditor).
I would guess that AI systems are generally considered more complex. Complexity is usually a bad thing, especially when it relates to "magic" which is how some people perceive AI systems.
That's not to say that the alternative is necessarily simpler (or better).
When we've done control systems coding, we've had to show trace tables for every single code path, and permutation of inputs. This was required to insure that we didn't put equipment into a dangerous state (for employees or infrastructure), and to "prove" that the programs did what they were supposed to do.
That'd be awfully tricky to do if the program were fuzzy and non-deterministic, as #tvanfosson indicated. I think you should accept that answer.
The key statement is "provided you program your logic properly". Well, how do you "provide" that? Experience shows that most programs are chock full of bugs.
The only way to guarantee that there are no bugs would be formal verification, but that is practically infeasible for all but the most primitively simple systems, and (worse) is usually done on specifications rather than code, so you still don't know of the code correctly implements your spec after you've proven the spec to be flawless.
I think that is because AI is very hard to understand and that becomes impossible to maintain.
Even if a AI program is considered fuzzy, or that it "learns" by the moment it is released, it is very well tested to all know cases(and it already learned from it) before its even finished. Most of the cases this "learning" will change some "thresholds" or weights in the program and after that, it is very hard to really understand and maintain that code, even for the creators.
This have been changing in the last 30 years by creating languages easier to understand for mathematicians, making it easier for them to test, and deliver new pseudo-code around the problem(like mat lab AI toolbox)
As there is no accepted definition of AI, the question shall be more specific.
My answer is on adaptive algorithms merely employing parameter estimation - a kind of learning - to improve the safety of the output information. Even this is not welcome in functional safety although it may seem that the behaviour of a proposed algorithm is not only deterministic (all computer programs are) but also easy to determine.
Be prepared for the assessor asking you to demonstrate test reports covering all combinations of input data and failure modes. Your algorithm being adaptive means it depends not only on current input values but on many or all of the earlier values. You know that a full test coverage is impossible within the age of the universe.
One way to score is showing that previously accepted simpler algorithms (state of the art) are not safe. This shall be easy if you know your problem space (if not, keep away from AI).
Another possibility may exist for your problem: a compelling monitoring function indicating whether the parameter is estimated accurately.
There are enough ways that ordinary algorithms, when shoddily designed and tested, can wind up killing people. If you haven't read about it, you should look up the case of Therac 25. This was a system where the behaviour was supposed to be completely deterministic, and things still went horribly, horribly wrong. Imagine if it were trying to reason "intelligently", too.
"Ordinary algorithms" for a complex problem space tend to be arkward. On the other hand, some "intelligent" algorithms have a simple structure. This is especially true for applications of Bayesian inference. You just have to know the likelihood function(s) for your data (plural applies if the data separates into statistically independent subsets).
Likelihood functions can be tested. If the test cannot cover the tails far enough to reach the required confidence level, just add more data, for example from another sensor. The structure of your algorithm will not change.
A drawback is/was the CPU performance required for Bayesian inference.
Besides, mentioning Therac 25 is not helpful, since no algorithm at all was involved, just multitasking spaghetti code. Citing the authors, "[the] accidents were fairly unique in having software coding errors involved -- most computer-related accidents have not involved coding errors but rather errors in the software requirements such as omissions and mishandled environmental conditions and system states."

Resources