Dog face detection with dlib - need advice on improving recal - face-detection

I'm trying to train a dog face detector with dlib's hog pyramid detector.
I used Columbia dogs dataset: ftp://ftp.umiacs.umd.edu/pub/kanazawa/CU_Dogs.zip
At first I would get a recall of 0%, but by increasing C value I managed to increase it to 62% on training set and 53% on testing set. After certain point increasing C value stopped helping (1000+) and would only slow down training.
Precision is really high though, if it actually manages to find dog's face it's always correct, haven't seen any false positives.
Could you give any advice on how I could improve recall to a descent recall quality?
Thanks in advance
UPDATE:
Following Davis King's advice, got the accuracy to 100% on training set and 80% on testing set just by training different detector per breed. I imagine it could be even higher if I cluster them by direction they're looking to.

You probably need to train different detectors for different head poses and dogs that look very different. I would try running dlib's imglab command line tool with the --cluster option. That will cluster the images into coherent poses and you can train detectors for each pose.

Related

Help--100% accuracy with LibSVM?

Nominally a good problem to have, but I'm pretty sure it is because something funny is going on...
As context, I'm working on a problem in the facial expression/recognition space, so getting 100% accuracy seems incredibly implausible (not that it would be plausible in most applications...). I'm guessing there is either some consistent bias in the data set that it making it overly easy for an SVM to pull out the answer, =or=, more likely, I've done something wrong on the SVM side.
I'm looking for suggestions to help understand what is going on--is it me (=my usage of LibSVM)? Or is it the data?
The details:
About ~2500 labeled data vectors/instances (transformed video frames of individuals--<20 individual persons total), binary classification problem. ~900 features/instance. Unbalanced data set at about a 1:4 ratio.
Ran subset.py to separate the data into test (500 instances) and train (remaining).
Ran "svm-train -t 0 ". (Note: apparently no need for '-w1 1 -w-1 4'...)
Ran svm-predict on the test file. Accuracy=100%!
Things tried:
Checked about 10 times over that I'm not training & testing on the same data files, through some inadvertent command-line argument error
re-ran subset.py (even with -s 1) multiple times and did train/test only multiple different data sets (in case I randomly upon the most magical train/test pa
ran a simple diff-like check to confirm that the test file is not a subset of the training data
svm-scale on the data has no effect on accuracy (accuracy=100%). (Although the number of support vectors does drop from nSV=127, bSV=64 to nBSV=72, bSV=0.)
((weird)) using the default RBF kernel (vice linear -- i.e., removing '-t 0') results in accuracy going to garbage(?!)
(sanity check) running svm-predict using a model trained on a scaled data set against an unscaled data set results in accuracy = 80% (i.e., it always guesses the dominant class). This is strictly a sanity check to make sure that somehow svm-predict is nominally acting right on my machine.
Tentative conclusion?:
Something with the data is wacked--somehow, within the data set, there is a subtle, experimenter-driven effect that the SVM is picking up on.
(This doesn't, on first pass, explain why the RBF kernel gives garbage results, however.)
Would greatly appreciate any suggestions on a) how to fix my usage of LibSVM (if that is actually the problem) or b) determine what subtle experimenter-bias in the data LibSVM is picking up on.
Two other ideas:
Make sure you're not training and testing on the same data. This sounds kind of dumb, but in computer vision applications you should take care that: make sure you're not repeating data (say two frames of the same video fall on different folds), you're not training and testing on the same individual, etc. It is more subtle than it sounds.
Make sure you search for gamma and C parameters for the RBF kernel. There are good theoretical (asymptotic) results that justify that a linear classifier is just a degenerate RBF classifier. So you should just look for a good (C, gamma) pair.
Notwithstanding that the devil is in the details, here are three simple tests you could try:
Quickie (~2 minutes): Run the data through a decision tree algorithm. This is available in Matlab via classregtree, or you can load into R and use rpart. This could tell you if one or just a few features happen to give a perfect separation.
Not-so-quickie (~10-60 minutes, depending on your infrastructure): Iteratively split the features (i.e. from 900 to 2 sets of 450), train, and test. If one of the subsets gives you perfect classification, split it again. It would take fewer than 10 such splits to find out where the problem variables are. If it happens to "break" with many variables remaining (or even in the first split), select a different random subset of features, shave off fewer variables at a time, etc. It can't possibly need all 900 to split the data.
Deeper analysis (minutes to several hours): try permutations of labels. If you can permute all of them and still get perfect separation, you have some problem in your train/test setup. If you select increasingly larger subsets to permute (or, if going in the other direction, to leave static), you can see where you begin to lose separability. Alternatively, consider decreasing your training set size and if you get separability even with a very small training set, then something is weird.
Method #1 is fast & should be insightful. There are some other methods I could recommend, but #1 and #2 are easy and it would be odd if they don't give any insights.

Find optimal/good-enough strategy and AI for the game 'Proximity'?

'Proximity' is a strategy game of territorial domination similar to Othello, Go and Risk.
Two players, uses a 10x12 hex grid. Game invented by Brian Cable in 2007.
Seems to be a worthy game for discussing a) optimal algorithm then b) how to build an AI.
Strategies are going to be probabilistic or heuristic-based, due to the randomness factor, and the insane branching factor (20^120).
So it will be kind of hard to compare objectively.
A compute time limit of 5 seconds max per turn seems reasonable => this rules out all brute-force attempts. (Play the game's AI on Expert level to get a feel - it does a very good job based on some simple heuristic)
Game: Flash version here, iPhone version iProximity here and many copies elsewhere on the web
Rules: here
Object: to have control of the most armies after all tiles have been placed. You start with an empty hexboard. Each turn you receive a randomly numbered tile (value between 1 and 20 armies) to place on any vacant board space. If this tile is adjacent to any ALLY tiles, it will strengthen each of those tile's defenses +1 (up to a max value of 20). If it is adjacent to any ENEMY tiles, it will take control over them IF its number is higher than the number on the enemy tile.
Thoughts on strategy: Here are some initial thoughts; setting the computer AI to Expert will probably teach a lot:
minimizing your perimeter seems to be a good strategy, to prevent flips and minimize worst-case damage
like in Go, leaving holes inside your formation is lethal, only more so with the hex grid because you can lose armies on up to 6 squares in one move
low-numbered tiles are a liability, so place them away from your main territory, near the board edges and scattered. You can also use low-numbered tiles to plug holes in your formation, or make small gains along the perimeter which the opponent will not tend to bother attacking.
a triangle formation of three pieces is strong since they mutually reinforce, and also reduce the perimeter
Each tile can be flipped at most 6 times, i.e. when its neighbor tiles are occupied. Control of a formation can flow back and forth. Sometimes you lose part of a formation and plug any holes to render that part of the board 'dead' and lock in your territory/ prevent further losses.
Low-numbered tiles are obvious-but-low-valued liabilities, but high-numbered tiles can be bigger liabilities if they get flipped (which is harder). One lucky play with a 20-army tile can cause a swing of 200 (from +100 to -100 armies). So tile placement will have both offensive and defensive considerations.
Comment 1,2,4 seem to resemble a minimax strategy where we minimize the maximum expected possible loss (modified by some probabilistic consideration of the value ß the opponent can get from 1..20 i.e. a structure which can only be flipped by a ß=20 tile is 'nearly impregnable'.)
I'm not clear what the implications of comments 3,5,6 are for optimal strategy.
Interested in comments from Go, Chess or Othello players.
(The sequel ProximityHD for XBox Live, allows 4-player -cooperative or -competitive local multiplayer increases the branching factor since you now have 5 tiles in your hand at any given time, of which you can only play one. Reinforcement of ally tiles is increased to +2 per ally.)
A former member of the U of A GAMES group here.
That branching factor is insane. Far worse than Go.
Basically, you're hooped.
The problem with this game is that it is not deterministic due to the selection of a random tile. This actually adds another layer of nodes between each existing layer of nodes in the tree. You'll be interested in my publications on *-Minimax to learn about techniques for searching in stochastic domains.
In order to complete one-ply searches before the end of this century, you're going to need some very aggressive forward pruning techniques. Throw provably best move out the window early and concentrate on building good move ordering.
For general algorithms, I would suggest you to check the research done by the Alberta University AI Games group: http://games.cs.ualberta.ca Many of the algorithms there guarantee to find optimal policies. However, I doubt you're really interested in finding the optimal, aim for the "good enough" unless you want to sell that game in Korea :D
From your description, I have understood the game to be a two-player with full-observability i.e. no hidden units and such and fully deterministic i.e. player's actions outcomes do not require rolling, then you should take a look at the real-time bounded-search minimax derivatives proposed by the U Alberta guys. However, being able to do bound as well the depth of the backups of the value function would perhaps be a nice way to add a "difficulty level" to your game. They have been doing some work - a bit fishy imo - on sampling the search space for improving value function estimates.
About the "strategy" section you describe: in the framework I am mentioning, you will have to encode that knowledge as an evaluation function. Look at the work of Michael Büro and others - also in the U Alberta group - for examples of such knowledge engineering.
Another possibility would be to pose the problem as a Reinforcement Learning problem, where adversary moves are compiled as "afterstates". Look that up on the Barto & Sutton book: http://webdocs.cs.ualberta.ca/~sutton/book/the-book.html However the value function for a RL problem resulting from such a compilation might prove a bit difficult to solve optimally - the number of states will blow up like an H-Bomb. However, if you see how to use a factored representation, things can be much easier. And your "strategy" could perhaps be encoded as some shaping function, which would be speeding up the learning process considerably.
EDIT: Damn English prepositions

How to program a neural network for chess?

I want to program a chess engine which learns to make good moves and win against other players. I've already coded a representation of the chess board and a function which outputs all possible moves. So I only need an evaluation function which says how good a given situation of the board is. Therefore, I would like to use an artificial neural network which should then evaluate a given position. The output should be a numerical value. The higher the value is, the better is the position for the white player.
My approach is to build a network of 385 neurons: There are six unique chess pieces and 64 fields on the board. So for every field we take 6 neurons (1 for every piece). If there is a white piece, the input value is 1. If there is a black piece, the value is -1. And if there is no piece of that sort on that field, the value is 0. In addition to that there should be 1 neuron for the player to move. If it is White's turn, the input value is 1 and if it's Black's turn, the value is -1.
I think that configuration of the neural network is quite good. But the main part is missing: How can I implement this neural network into a coding language (e.g. Delphi)? I think the weights for each neuron should be the same in the beginning. Depending on the result of a match, the weights should then be adjusted. But how? I think I should let 2 computer players (both using my engine) play against each other. If White wins, Black gets the feedback that its weights aren't good.
So it would be great if you could help me implementing the neural network into a coding language (best would be Delphi, otherwise pseudo-code). Thanks in advance!
In case somebody randomly finds this page. Given what we know now, what the OP proposes is almost certainly possible. In fact we managed to do it for a game with much larger state space - Go ( https://deepmind.com/research/case-studies/alphago-the-story-so-far ).
I don't see why you can't have a neural net for a static evaluator if you also do some classic mini-max lookahead with alpha-beta pruning. Lots of Chess engines use minimax with a braindead static evaluator that just adds up the pieces or something; it doesn't matter so much if you have enough levels of minimax. I don't know how much of an improvement the net would make but there's little to lose. Training it would be tricky though. I'd suggest using an engine that looks ahead many moves (and takes loads of CPU etc) to train the evaluator for an engine that looks ahead fewer moves. That way you end up with an engine that doesn't take as much CPU (hopefully).
Edit: I wrote the above in 2010, and now in 2020 Stockfish NNUE has done it. "The network is optimized and trained on the [classical Stockfish] evaluations of millions of positions at moderate search depth" and then used as a static evaluator, and in their initial tests they got an 80-elo improvement when using this static evaluator instead of their previous one (or, equivalently, the same elo with a little less CPU time). So yes it does work, and you don't even have to train the network at high search depth as I originally suggested: moderate search depth is enough, but the key is to use many millions of positions.
Been there, done that. Since there is no continuity in your problem (the value of a position is not closely related to an other position with only 1 change in the value of one input), there is very little chance a NN would work. And it never did in my experiments.
I would rather see a simulated annealing system with an ad-hoc heuristic (of which there are plenty out there) to evaluate the value of the position...
However, if you are set on using a NN, is is relatively easy to represent. A general NN is simply a graph, with each node being a neuron. Each neuron has a current activation value, and a transition formula to compute the next activation value, based on input values, i.e. activation values of all the nodes that have a link to it.
A more classical NN, that is with an input layer, an output layer, identical neurons for each layer, and no time-dependency, can thus be represented by an array of input nodes, an array of output nodes, and a linked graph of nodes connecting those. Each node possesses a current activation value, and a list of nodes it forwards to. Computing the output value is simply setting the activations of the input neurons to the input values, and iterating through each subsequent layer in turn, computing the activation values from the previous layer using the transition formula. When you have reached the last (output) layer, you have your result.
It is possible, but not trivial by any means.
https://erikbern.com/2014/11/29/deep-learning-for-chess/
To train his evaluation function, he utilized a lot of computing power to do so.
To summarize generally, you could go about it as follows. Your evaluation function is a feedforward NN. Let the matrix computations lead to a scalar output valuing how good the move is. The input vector for the network is the board state represented by all the pieces on the board so say white pawn is 1, white knight is 2... and empty space is 0. An example board state input vector is simply a sequence of 0-12's. This evaluation can be trained using grandmaster games (available at a fics database for example) for many games, minimizing loss between what the current parameters say is the highest valuation and what move the grandmasters made (which should have the highest valuation). This of course assumes that the grandmaster moves are correct and optimal.
What you need to train a ANN is either something like backpropagation learning or some form of a genetic algorithm. But chess is such an complex game that it is unlikly that a simple ANN will learn to play it - even more if the learning process is unsupervised.
Further, your question does not say anything about the number of layers. You want to use 385 input neurons to encode the current situation. But how do you want to decide what to do? On neuron per field? Highest excitation wins? But there is often more than one possible move.
Further you will need several hidden layers - the functions that can be represented with an input and an output layer without hidden layer are really limited.
So I do not want to prevent you from trying it, but chances for a successful implemenation and training within say one year or so a practically zero.
I tried to build and train an ANN to play Tic-tac-toe when I was 16 years or so ... and I failed. I would suggest to try such an simple game first.
The main problem I see here is one of training. You say you want your ANN to take the current board position and evaluate how good it is for a player. (I assume you will take every possible move for a player, apply it to the current board state, evaluate via the ANN and then take the one with the highest output - ie: hill climbing)
Your options as I see them are:
Develop some heuristic function to evaluate the board state and train the network off that. But that begs the question of why use an ANN at all, when you could just use your heuristic.
Use some statistical measure such as "How many games were won by white or black from this board configuration?", which would give you a fitness value between white or black. The difficulty with that is the amount of training data required for the size of your problem space.
With the second option you could always feed it board sequences from grandmaster games and hope there is enough coverage for the ANN to develop a solution.
Due to the complexity of the problem I'd want to throw the largest network (ie: lots of internal nodes) at it as I could without slowing down the training too much.
Your input algorithm is sound - all positions, all pieces, and both players are accounted for. You may need an input layer for every past state of the gameboard, so that past events are used as input again.
The output layer should (in some form) give the piece to move, and the location to move to.
Write a genetic algorithm using a connectome which contains all neuron weights and synapse strengths, and begin multiple separated gene pools with a large number of connectomes in each.
Make them play one another, keep the best handful, crossover and mutate the best connectomes to repopulate the pool.
Read blondie24 : http://www.amazon.co.uk/Blondie24-Playing-Kaufmann-Artificial-Intelligence/dp/1558607838.
It deals with checkers instead of chess but the principles are the same.
Came here to say what Silas said. Using a minimax algorithm, you can expect to be able to look ahead N moves. Using Alpha-beta pruning, you can expand that to theoretically 2*N moves, but more realistically 3*N/4 moves. Neural networks are really appropriate here.
Perhaps though a genetic algorithm could be used.

What optimization problems do you want to have solved?

I love to work on AI optimization software (Genetic Algorithms, Particle Swarm, Ant Colony, ...). Unfortunately I have run out of interesting problems to solve. What problem would you like to have solved?
This list of NP complete problems should keep you busy for a while...
How about the Hutter Prize?
From the entry on Wikipedia:
The Hutter Prize is a cash prize
funded by Marcus Hutter which rewards
data compression improvements on a
specific 100 MB English text file.
[...]
The goal of the Hutter Prize is to
encourage research in artificial
intelligence (AI). The organizers
believe that text compression and AI
are equivalent problems.
Basically the idea is that in order to make a compressor which is able to compress data most efficiently, the compressor must be, in Marcus Hutter's words, "smarter". For more information on the relation between artificial intelligence and compression, see the Motivation and FAQ sections of the Hutter Prize website.
Does the Netflix Prize count?
I would like my bank balance optimised so that there is as much money as possible left at the end of the month, instead of the other way round.
What about the Go Game ?
Here's an interesting practical problem I came up while tinkering with color quantization and image compression.
The basic idea is that I would like a program to which I give a picture and it reduces the amount of colors is it as much as possible without me noticing it. Since every person has a different sensitivity of the eye (and eyes have different sensitivity of red/green/blue intensities), it should be possible to specify this sensitivity threshold in some way.
In other words, in a truecolor picture, replace every pixel's color with another color so that:
The total count of different colors in a picture would be the smallest possible; and
Every new pixel would have it's color no further from the original color than some user-specified value D.
The D can be defined in different ways, pick your favorite. For example:
Separate red, green and blue components for specifying the maximum possible deviation for each of them (for every pixel you get a rectangular cuboid of valid replacement values);
A real number which would represent the maximum allowable distance in the RGB cube (for every pixel you get a sphere of valid replacement values);
Something inbetween or completely different.
Most efficient solution to a given set of Sudoku puzzles. (excluding brute-force methods)

PID controller affect on a differential driving robot when the parameters (Kp, Ki, and Kd) are increased individually. [full Q written below]

Question: A PID controller has three parameters Kp, Ki and Kd which could affect the output performance. A differential driving robot is controlled by a PID controller. The heading information is sensed by a compass sensor. The moving forward speed is kept constant. The PID controller is able to control the heading information to follow a given direction. Explain the outcome on the differential driving robot performance when the three parameters are increased individually.
This is a question that has come up in a past paper but most likely won't show up this year but it still worries me. It's the only question that has me thinking for quite some time. I'd love an answer in simple terms. Most stuff i've read on the internet don't make much sense to me as it goes heavy into the detail and off topic for my case.
My take on this:
I know that the proportional term, Kp, is entirely based on the error and that, let's say, double the error would mean doubling Kp (applying proportional force). This therefore implies that increasing Kp is a result of the robot heading in the wrong direction so Kp is increased to ensure the robot goes on the right direction or at least tries to reduce the error as time passes so an increase in Kp would affect the robot in such a way to adjust the heading of the robot so it stays on the right path.
The derivative term, Kd, is based on the rate of change of the error so an increase in Kd implies that the rate of change of error has increased over time so double the error would result in double the force. An increase by double the change in the robot's heading would take place if the robot's heading is doubled in error from the previous feedback result. Kd causes the robot to react faster as the error increases.
An increase in the integral term, Ki, means that the error is increased over time. The integral accounts for the sum of error over time. Even a small increase in the error would increase the integral so the robot would have to head in the right direction for an equal amount of time for the integral to balance to zero.
I would appreciate a much better answer and it would be great to be confident for a similar upcoming question in the finals.
Side note: i've posted this question on the Robotics part earlier but seeing that the questions there are hardly ever noticed, i've also posted it here.
I would highly recommend reading this article PID Without a PhD it gives a great explanation along with some implementation details. The best part is the numerous graphs. They show you what changing the P, I, or D term does while holding the others constant.
And if you want real world Application Atmel provides example code on their site (for 8 bit MCU) that perfectly mirrors the PID without a PhD article. It follows this code from AVR's website exactly (they make the ATMega32p microcontroller chip on the Arduino UNO boards) PDF explanation and Atmel Code in C
But here is a general explanation the way I understand it.
Proportional: This is a proportional relationship between the error and the target. Something like Pk(target - actual) Its simply a scaling factor on the error. It decides how quickly the system should react to an error (if it is of any help, you can think of it like amplifier slew rate). A large value will quickly try to fix errors, and a slow value will take longer. With Higher values though, we get into an overshoot condition and that's where the next terms come into play
Integral: This is meant to account for errors in the past. In fact it is the sum of all past errors. This is often useful for things like a small dc/constant offset that a Proportional controller can't fix on its own. Imagine, you give a step input of 1, and after a while the output settles at .9 and its clear its not going anywhere. The integral portion will see this error is always ~.1 too small so it will add it back in, to hopefully bring control closer to 1. THis term usually helps to stabilize the response curve. Since it is taken over a long period of time, it should reduce noise and any fast changes (like those found in overshoot/ringing conditions). Because it's aggregate, it is a very sensitive measurement and is usually very small when compared to other terms. A lower value will make changes happen very slowly, and create a very smooth response(this can also cause "wind-up" see the article)
Derivative: This is supposed to account for the "future". It uses the slope of the most recent samples. Remember this is the slope, it has nothing to do with the position error(current-goal), it is previous measured position - current measured position. This is most sensitive to noise and when it is too high often causes ringing. A higher value encourages change since we are "amplifying" the slope.
I hope that helps. Maybe someone else can offer another viewpoint, but that's typically how I think about it.

Resources