Most appropriate AI for parameter weighting? - artificial-intelligence

I have data of this form:
[(v1, A1, B1), (v2, A2, B2), (v3, A3, B3), ...]
The vs correspond to the data elements and the As and Bs to numerical values characterizing the vs.
A human looking at this data can look at it and see which tuple seems the best "match" according to the A and B values. I want a form of AI that I could train by picking one of these tuples as the best, and that would adjust the weights given to A and B.
Basically, each tuple represents an approximation to a value. A represents an error and B represents the complexity of each approximation. I want some compromise between error and complexity by assigning them different weightings. I want to run several trials with approximations to different values, and choose the one I think looks the best, and have the AI adjust the weightings correspondingly.

What you described is also known as a model selection problem, something often encountered in machine learning and statistics. You basically have some models that fit your data by some measure of goodness (typically measured as error or log likelihood) and those models have some complexity measure (typically the number of parameters in the model). You want to pick the best fitting model and penalize its complexity because that can be a sign of overfitting.
Typically, the degree to which overfitting can affect you is driven by the size of your data. But there are some measures that explicitly allow you to trade off model fitness and complexity:
Akaike information criterion
Bayesian information criterion
Regularization
Choose a model based on your data as above can bias the model choice toward the data. Thus, this is done typically using a validation set and then evaluated on a test set.
I don't know if your approach in having an algorithm solve this problem is a good one. Typically it is dependent on your data and some degree of intuition. The meta-machine-learning technique you described probably won't be too reliable, in my opinion. Better to start with some more principled and simpler ideas first.

Related

SIFT- image comparison

I am comparing two images using SIFT in java using sift implementation by Stephan Saalfeld-- http://fly.mpi-cbg.de/~saalfeld/Projects/javasift.html. But due to lack of proper example,i am finding difficult in using it. I am able to get the descriptors for the two images, then their corresponding matching descriptors and finally applying RANSAC to neglect the false matches. Now, i am left with a number of inliers. But I am confused how to conclude if two images are similar or not?
RANSAC gives you the transformation matrix (including translation,rotation, and scaling values). Using this information you can try to fit images on each other in order to see the matches that are found by SIFT.
An advantage of RANSAC is its ability to do robust estimation of the model parameters, i.e., it can estimate the parameters with a high degree of accuracy even when a significant number of outliers are present in the data set. A disadvantage of RANSAC is that there is no upper bound on the time it takes to compute these parameters. When the number of iterations computed is limited the solution obtained may not be optimal, and it may not even be one that fits the data in a good way. In this way RANSAC offers a trade-off; by computing a greater number of iterations the probability of a reasonable model being produced is increased. Another disadvantage of RANSAC is that it requires the setting of problem-specific thresholds.
RANSAC can only estimate one model for a particular data set. As for any one-model approach when two (or more) model instances exist, RANSAC may fail to find either one. The Hough transform is an alternative robust estimation technique that may be useful when more than one model instance is present.
Concluding, you can say how much two images are similar. It cannot always tell you that it is a total match or a total difference. So you will get the matches after applying RANSAC. Then you can find out that the percentage of the good matches over total matches, and then you need to decide according to this information.

Is the HTM cortical learning algorithm defined by Numenta's paper restricted by Euclidean geometry?

Specifically, their most recent implementation.
http://www.numenta.com/htm-overview/htm-algorithms.php
Essentially, I'm asking whether non-euclidean relationships, or relationships in patterns that exceed the dimensionality of the inputs, can be effectively inferred by the algorithm in its present state?
HTM uses Euclidean geometry to determine "neighborship" when analyzing patterns. Consistently framed input causes the algorithm to exhibit predictive behavior, and sequence length is practically unlimited. This algorithm learns very well - but I'm wondering whether it has the capacity to infer nonlinear attributes from its input data.
For example, if you input the entire set of texts from Project Gutenberg, it's going to pick up on the set of probabilistic rules that comprise English spelling, grammar, and readily apparent features from the subject matter, such as gender associations with words, and so forth. These are first level "linear" relations, and can be easily defined with probabilities in a logical network.
A nonlinear relation would be an association of assumptions and implications, such as "Time flies like an arrow, fruit flies like a banana." If correctly framed, the ambiguity of the sentence causes a predictive interpretation of the sentence to generate many possible meanings.
If the algorithm is capable of "understanding" nonlinear relations, then it would be able to process the first phrase and correctly identify that "Time flies" is talking about time doing something, and "fruit flies" are a type of bug.
The answer to the question is probably a simple one to find, but I can't decide either way. Does mapping down the input into a uniform, 2d, Euclidean plane preclude the association of nonlinear attributes of the data?
If it doesn't prevent nonlinear associations, my assumption would then be that you could simply vary the resolution, repetition, and other input attributes to automate the discovery of nonlinear relations - in effect, adding a "think harder" process to the algorithm.
From what I understand of HTM's, the structure of layers and columns mimics the structure of the neocortex. See appendix B here: http://www.numenta.com/htm-overview/education/HTM_CorticalLearningAlgorithms.pdf
So the short answer would be that since the brain can understand non-linear phenomenon with this structure, so can an HTM.
Initial, instantaneous sensory input is indeed mapped to 2D regions within an HTM. This does not limit HTM's to dealing with 2D representations any more than a one dimensional string of bits is limited to representing only one dimensional things. It's just a way of encoding stuff so that sparse distributed representations can be formed and their efficiencies can be taken advantage of.
To answer your question about Project Gutenberg, I don't think an HTM will really understand language without first understanding the physical world on which language is based and creates symbols for. That said, this is a very interesting sequence for an HTM, since predictions are only made in one direction, and in a way the understanding of what's happening to the fruit goes backwards. i.e. I see the pattern 'flies like a' and assume the phrase applies to the fruit the same way it did to time. HTM's do group subsequent input (words in this case) together at higher levels, so if you used Fuzzy Grouping (perhaps) as Davide Maltoni has shown to be effective, the two halves of the sentence could be grouped together into the same high level representation and feedback could be sent down linking the two specific sentences. Numenta, to my knowledge has not done too much with feedback messages yet, but it's definitely part of the theory.
The software which runs the HTM is called NuPIC (Numenta Platform for Intelligent Computing). A NuPIC region (representing a region of neocortex) can be configured to either use topology or not, depending on the type of data it's receiving.
If you use topology, the usual setup maps each column to a set of inputs which is centred on the corresponding position in the input space (the connections will be selected randomly according to a probability distribution which favours the centre). The spatial pattern recognising component of NuPIC, known as the Spatial Pooler (SP), will then learn to recognise and represent localised topological features in the data.
There is absolutely no restriction on the "linearity" of the input data which NuPIC can learn. NuPIC can learn sequences of spatial patterns in extremely high-dimensional spaces, and is limited only by the presence (or lack of) spatial and temporal structure in the data.
To answer the specific part of your question, yes, NuPIC can learn non-Euclidean and non-linear relationships, because NuPIC is not, and cannot be modelled by, a linear system. On the other hand, it seems logically impossible to infer relationships of a dimensionality which exceeds that of the data.
The best place to find out about HTM and NuPIC, its Open Source implementation, is at NuPIC's community website (and mailing list).
Yes, It can do non-linear. Basically it is multilayer. And all multilayer neural networks can infer non linear relationships. And I think the neighborship is calculated locally. If it is calcualted locally then globally it can be piece wise non linear for example look at Local Linear Embedding.
Yes HTM uses euclidean geometry to connect synapses, but this is only because it is mimicking a biological system that sends out dendrites and creates connections to other nearby cells that have strong activation at that point in time.
The Cortical Learning Algorithm (CLA) is very good at predicting sequences, so it would be good at determining "Time flies like an arrow, fruit flies like a" and predict "banana" if it has encountered this sequence before or something close to it. I don't think it could infer that a fruit fly is a type of insect unless you trained it on that sequence. Thus the T for Temporal. HTMs are sequence association compressors and retrievers (a form of memory). To get the pattern out of the HTM you play in a sequence and it will match the strongest representation it has encountered to date and predict the next bits of the sequence. It seems to be very good at this and the main application for HTMs right now are predicting sequences and anomalies out of streams of data.
To get more complex representations and more abstraction you would cascade a trained HTMs outputs to another HTMs inputs along with some other new sequence based input to correlate to. I suppose you could wire in some feedback and do some other tricks to combine multiple HTMs, but you would need lots of training on primitives first, just like a baby does, before you will ever get something as sophisticated as associating concepts based on syntax of the written word.
ok guys, dont get silly, htms just copy data into them, if you want a concept, its going to be a group of the data, and then you can have motor depend on the relation, and then it all works.
our cortex, is probably way better, and actually generates new images, but a computer cortex WONT, but as it happens, it doesnt matter, and its very very useful already.
but drawing concepts from a data pool, is tricky, the easiest way to do it is by recording an invarient combination of its senses, and when it comes up, associate everything else to it, this will give you organism or animal like intelligence.
drawing harder relations, is what humans do, and its ad hoc logic, imagine a set explaining the most ad hoc relation, and then it slowly gets more and more specific, until it gets to exact motor programs... and all knowledge you have is controlling your motor, and making relations that trigger pathways in the cortex, and tell it where to go, from the blast search that checks all motor, and finds the most successful trigger.
woah that was a mouthful, but watch out dummies, you wont get no concepts from a predictive assimilator, which is what htm is, unless you work out how people draw relations in the data pool, like a machine, and if you do that, its like a program thats programming itself.
no shit.

Best way to automate testing of AI algorithms?

I'm wondering how people test artificial intelligence algorithms in an automated fashion.
One example would be for the Turing Test - say there were a number of submissions for a contest. Is there any conceivable way to score candidates in an automated fashion - other than just having humans test them out.
I've also seen some data sets (obscured images of numbers/letters, groups of photos, etc) that can be fed in and learned over time. What good resources are out there for this.
One challenge I see: you don't want an algorithm that tailors itself to the test data over time, since you are trying to see how well it does in the general case. Are there any techniques to ensure it doesn't do this? Such as giving it a random test each time, or averaging its results over a bunch of random tests.
Basically, given a bunch of algorithms, I want some automated process to feed it data and see how well it "learned" it or can predict new stuff it hasn't seen yet.
This is a complex topic - good AI algorithms are generally the ones which can generalize well to "unseen" data. The simplest method is to have two datasets: a training set and an evaluation set used for measuring the performances. But generally, you want to "tune" your algorithm so you may want 3 datasets, one for learning, one for tuning, and one for evaluation. What defines tuning depends on your algorithm, but a typical example is a model where you have a few hyper-parameters (for example parameters in your Bayesian prior under the Bayesian view of learning) that you would like to tune on a separate dataset. The learning procedure would already have set a value for it (or maybe you hardcoded their value), but having enough data may help so that you can tune them separately.
As for making those separate datasets, there are many ways to do so, for example by dividing the data you have available into subsets used for different purposes. There is a tradeoff to be made because you want as much data as possible for training, but you want enough data for evaluation too (assuming you are in the design phase of your new algorithm/product).
A standard method to do so in a systematic way from a known dataset is cross validation.
Generally when it comes to this sort of thing you have two datasets - one large "training set" which you use to build and tune the algorithm, and a separate smaller "probe set" that you use to evaluate its performance.
#Anon has the right of things - training and what I'll call validation sets. That noted, the bits and pieces I see about developments in this field point at two things:
Bayesian Classifiers: there's something like this probably filtering your email. In short you train the algorithm to make a probabilistic decision if a particular item is part of a group or not (e.g. spam and ham).
Multiple Classifiers: this is the approach that the winning group involved in the Netflix challenge took, whereby it's not about optimizing one particular algorithm (e.g. Bayesian, Genetic Programming, Neural Networks, etc..) by combining several to get a better result.
As for data sets Weka has several available. I haven't explored other libraries for data sets, but mloss.org appears to be a good resource. Finally data.gov offers a lot of sets that provide some interesting opportunities.
Training data sets and test sets are very common for K-means and other clustering algorithms, but to have something that's artificially intelligent without supervised learning (which means having a training set) you are building a "brain" so-to-speak based on:
In chess: all possible future states possible from the current gameState.
In most AI-learning (reinforcement learning) you have a problem where the "agent" is trained by doing the game over and over. Basically you ascribe a value to every state. Then you assign an expected value of each possible action at a state.
So say you have S states and a actions per state (although you might have more possible moves in one state, and not as many in another), then you want to figure out the most-valuable states from s to be in, and the most valuable actions to take.
In order to figure out the value of states and their corresponding actions, you have to iterate the game through. Probabilistically, a certain sequence of states will lead to victory or defeat, and basically you learn which states lead to failure and are "bad states". You also learn which ones are more likely to lead to victory, and these are subsequently "good" states. They each get a mathematical value associated, usually as an expected reward.
Reward from second-last state to a winning state: +10
Reward if entering a losing state: -10
So the states that give negative rewards then give negative rewards backwards, to the state that called the second-last state, and then the state that called the third-last state and so-on.
Eventually, you have a mapping of expected reward based on which state you're in, and based on which action you take. You eventually find the "optimal" sequence of steps to take. This is often referred to as an optimal policy.
It is true of the converse that normal courses of actions that you are stepping-through while deriving the optimal policy are called simply policies and you are always implementing a certain "policy" with respect to Q-Learning.
Usually the way of determining the reward is the interesting part. Suppose I reward you for each state-transition that does not lead to failure. Then the value of walking all the states until I terminated is however many increments I made, however many state transitions I had.
If certain states are extremely unvaluable, then loss is easy to avoid because almost all bad states are avoided.
However, you don't want to discourage discovery of new, potentially more-efficient paths that don't follow just this-one-works, so you want to reward and punish the agent in such a way as to ensure "victory" or "keeping the pole balanced" or whatever as long as possible, but you don't want to be stuck at local maxima and minima for efficiency if failure is too painful, so no new, unexplored routes will be tried. (Although there are many approaches in addition to this one).
So when you ask "how do you test AI algorithms" the best part is is that the testing itself is how many "algorithms" are constructed. The algorithm is designed to test a certain course-of-action (policy). It's much more complicated than
"turn left every half mile"
it's more like
"turn left every half mile if I have turned right 3 times and then turned left 2 times and had a quarter in my left pocket to pay fare... etc etc"
It's very precise.
So the testing is usually actually how the A.I. is being programmed. Most models are just probabilistic representations of what is probably good and probably bad. Calculating every possible state is easier for computers (we thought!) because they can focus on one task for very long periods of time and how much they remember is exactly how much RAM you have. However, we learn by affecting neurons in a probabilistic manner, which is why the memristor is such a great discovery -- it's just like a neuron!
You should look at Neural Networks, it's mindblowing. The first time I read about making a "brain" out of a matrix of fake-neuron synaptic connections... A brain that can "remember" basically rocked my universe.
A.I. research is mostly probabilistic because we don't know how to make "thinking" we just know how to imitate our own inner learning process of try, try again.

What are the differences between simulated annealing and genetic algorithms?

What are the relevant differences, in terms of performance and use cases, between simulated annealing (with bean search) and genetic algorithms?
I know that SA can be thought as GA where the population size is only one, but I don't know the key difference between the two.
Also, I am trying to think of a situation where SA will outperform GA or GA will outperform SA. Just one simple example which will help me understand will be enough.
Well strictly speaking, these two things--simulated annealing (SA) and genetic algorithms are neither algorithms nor is their purpose 'data mining'.
Both are meta-heuristics--a couple of levels above 'algorithm' on the abstraction scale. In other words, both terms refer to high-level metaphors--one borrowed from metallurgy and the other from evolutionary biology. In the meta-heuristic taxonomy, SA is a single-state method and GA is a population method (in a sub-class along with PSO, ACO, et al, usually referred to as biologically-inspired meta-heuristics).
These two meta-heuristics are used to solve optimization problems, particularly (though not exclusively) in combinatorial optimization (aka constraint-satisfaction programming). Combinatorial optimization refers to optimization by selecting from among a set of discrete items--in other words, there is no continuous function to minimize. The knapsack problem, traveling salesman problem, cutting stock problem--are all combinatorial optimization problems.
The connection to data mining is that the core of many (most?) supervised Machine Learning (ML) algorithms is the solution of an optimization problem--(Multi-Layer Perceptron and Support Vector Machines for instance).
Any solution technique to solve cap problems, regardless of the algorithm, will consist essentially of these steps (which are typically coded as a single block within a recursive loop):
encode the domain-specific details
in a cost function (it's the
step-wise minimization of the value
returned from this function that
constitutes a 'solution' to the c/o
problem);
evaluate the cost function passing
in an initial 'guess' (to begin
iteration);
based on the value returned from the
cost function, generate a subsequent
candidate solution (or more than
one, depending on the
meta-heuristic) to the cost
function;
evaluate each candidate solution by
passing it in an argument set, to
the cost function;
repeat steps (iii) and (iv) until
either some convergence criterion is
satisfied or a maximum number of
iterations is reached.
Meta-heuristics are directed to step (iii) above; hence, SA and GA differ in how they generate candidate solutions for evaluation by the cost function. In other words, that's the place to look to understand how these two meta-heuristics differ.
Informally, the essence of an algorithm directed to solution of combinatorial optimization is how it handles a candidate solution whose value returned from the cost function is worse than the current best candidate solution (the one that returns the lowest value from the cost function). The simplest way for an optimization algorithm to handle such a candidate solution is to reject it outright--that's what the hill climbing algorithm does. But by doing this, simple hill climbing will always miss a better solution separated from the current solution by a hill. Put another way, a sophisticated optimization algorithm has to include a technique for (temporarily) accepting a candidate solution worse than (i.e., uphill from) the current best solution because an even better solution than the current one might lie along a path through that worse solution.
So how do SA and GA generate candidate solutions?
The essence of SA is usually expressed in terms of the probability that a higher-cost candidate solution will be accepted (the entire expression inside the double parenthesis is an exponent:
p = e((-highCost - lowCost)/temperature)
Or in python:
p = pow(math.e, (-hiCost - loCost) / T)
The 'temperature' term is a variable whose value decays during progress of the optimization--and therefore, the probability that SA will accept a worse solution decreases as iteration number increases.
Put another way, when the algorithm begins iterating, T is very large, which as you can see, causes the algorithm to move to every newly created candidate solution, whether better or worse than the current best solution--i.e., it is doing a random walk in the solution space. As iteration number increases (i.e., as the temperature cools) the algorithm's search of the solution space becomes less permissive, until at T = 0, the behavior is identical to a simple hill-climbing algorithm (i.e., only solutions better than the current best solution are accepted).
Genetic Algorithms are very different. For one thing--and this is a big thing--it generates not a single candidate solution but an entire 'population of them'. It works like this: GA calls the cost function on each member (candidate solution) of the population. It then ranks them, from best to worse, ordered by the value returned from the cost function ('best' has the lowest value). From these ranked values (and their corresponding candidate solutions) the next population is created. New members of the population are created in essentially one of three ways. The first is usually referred to as 'elitism' and in practice usually refers to just taking the highest ranked candidate solutions and passing them straight through--unmodified--to the next generation. The other two ways that new members of the population are usually referred to as 'mutation' and 'crossover'. Mutation usually involves a change in one element in a candidate solution vector from the current population to create a solution vector in the new population, e.g., [4, 5, 1, 0, 2] => [4, 5, 2, 0, 2]. The result of the crossover operation is like what would happen if vectors could have sex--i.e., a new child vector whose elements are comprised of some from each of two parents.
So those are the algorithmic differences between GA and SA. What about the differences in performance?
In practice: (my observations are limited to combinatorial optimization problems) GA nearly always beats SA (returns a lower 'best' return value from the cost function--ie, a value close to the solution space's global minimum), but at a higher computation cost. As far as i am aware, the textbooks and technical publications recite the same conclusion on resolution.
but here's the thing: GA is inherently parallelizable; what's more, it's trivial to do so because the individual "search agents" comprising each population do not need to exchange messages--ie, they work independently of each other. Obviously that means GA computation can be distributed, which means in practice, you can get much better results (closer to the global minimum) and better performance (execution speed).
In what circumstances might SA outperform GA? The general scenario i think would be those optimization problems having a small solution space so that the result from SA and GA are practically the same, yet the execution context (e.g., hundreds of similar problems run in batch mode) favors the faster algorithm (which should always be SA).
It is really difficult to compare the two since they were inspired from different domains..
A Genetic Algorithm maintains a population of possible solutions, and at each step, selects pairs of possible solution, combines them (crossover), and applies some random changes (mutation). The algorithm is based the idea of "survival of the fittest" where the selection process is done according to a fitness criteria (usually in optimization problems it is simply the value of the objective function evaluated using the current solution). The crossover is done in hope that two good solutions, when combined, might give even better solution.
On the other hand, Simulated Annealing only tracks one solution in the space of possible solutions, and at each iteration considers whether to move to a neighboring solution or stay in the current one according to some probabilities (which decays over time). This is different from a heuristic search (say greedy search) in that it doesn't suffer from the problems of local optimum since it can get unstuck from cases where all neighboring solutions are worst the current one.
I'm far from an expert on these algorithms, but I'll try and help out.
I think the biggest difference between the two is the idea of crossover in GA and so any example of a learning task that is better suited to GA than SA is going to hinge on what crossover means in that situation and how it is implemented.
The idea of crossover is that you can meaningfully combine two solutions to produce a better one. I think this only makes sense if the solutions to a problem are structured in some way. I could imagine, for example, in multi-class classification taking two (or many) classifiers that are good at classifying a particular class and combining them by voting to make a much better classifier. Another example might be Genetic Programming, where the solution can be expressed as a tree, but I find it hard to come up with a good example where you could combine two programs to create a better one.
I think it's difficult to come up with a compelling case for one over the other because they really are quite similar algorithms, perhaps having been developed from very different starting points.

Pruning Deductions in Expert Systems

In a rule system, or any reasoning system that deduces facts via forward-chaining inference rules, how would you prune "unnecessary" branches? I'm not sure what the formal terminology is, but I'm just trying to understand how people are able to limit their train-of-thought when reasoning over problems, whereas all semantic reasoners I've seen appear unable to do this.
For example, in John McCarthy's paper An Example for Natural Language Understanding and the AI Problems It Raises, he describes potential problems in getting a program to intelligently answer questions about a news article in the New York Times. In section 4, "The Need For Nonmonotonic Reasoning", he discusses the use of Occam's Razer to restrict the inclusion of facts when reasoning about the story. The sample story he uses is one about robbers who victimize a furniture store owner.
If a program were asked to form a "minimal completion" of the story in predicate calculus, it might need to include facts not directly mentioned in the original story. However, it would also need some way of knowing when to limit its chain of deduction, so as not to include irrelevant details. For example, it might want to include the exact number of police involved in the case, which the article omits, but it won't want to include the fact that each police officer has a mother.
Good Question.
From your Question i think what you refer to as 'pruning' is a model-building step performed ex ante--ie, to limit the inputs available to the algorithm to build the model. The term 'pruning' when used in Machine Learning refers to something different--an ex post step, after model construction and that operates upon the model itself and not on the available inputs. (There could be a second meaning in the ML domain, for the term 'pruning.' of, but i'm not aware of it.) In other words, pruning is indeed literally a technique to "limit its chain of deduction" as you put it, but it does so ex post, by excision of components of a complete (working) model, and not by limiting the inputs used to create that model.
On the other hand, isolating or limiting the inputs available for model construction--which is what i think you might have had in mind--is indeed a key Machine Learning theme; it's clearly a factor responsible for the superior performance of many of the more recent ML algorithms--for instance, Support Vector Machines (the insight that underlies SVM is construction of the maximum-margin hyperplane from only a small subset of the data, i.e, the 'support vectors'), and Multi-Adaptive Regression Splines (a regression technique in which no attempt is made to fit the data by "drawing a single continuous curve through it", instead, discrete section of the data are fit, one by one, using a bounded linear equation for each portion, ie., the 'splines', so the predicate step of optimal partitioning of the data is obviously the crux of this algorithm).
What problem is solving by pruning?
At least w/r/t specific ML algorithms i have actually coded and used--Decision Trees, MARS, and Neural Networks--pruning is performed on an initially over-fit model (a model that fits the training data so closely that it is unable to generalize (accurately predict new instances). In each instance, pruning involves removing marginal nodes (DT, NN) or terms in the regression equation (MARS) one by one.
Second, why is pruning necessary/desirable?
Isn't it better to just accurately set the convergence/splitting criteria? That won't always help. Pruning works from "the bottom up"; the model is constructed from the top down, so tuning the model (to achieve the same benefit as pruning) eliminates not just one or more decision nodes but also the child nodes that (like trimming a tree closer to the trunk). So eliminating a marginal node might also eliminate one or more strong nodes subordinate to that marginal node--but the modeler would never know that because his/her tuning eliminated further node creation at that marginal node. Pruning works from the other direction--from the most subordinate (lowest-level) child nodes upward in the direction of the root node.

Resources