Coefficients and significance of lasso/ridge - logistic-regression

I had 628 predictors after forming dummy of all categorical variables. When I ran lot many iterations traditional logistic regression iteration, I came across 15 variables that was giving me pretty good model with good ROC, recall & precision(for certain cut-off) values on test data and also all variables were significant(at p<=0.05). But since it took lot of time, I tried using lasso that gave me 50 non-zero-coefficient variables after taking best lambda value post running 10 fold cross-validation. But only 5 variables were common between 15 variables of traditional method and 50 of lasso. Moreover, when I tried to calculate its SE and t-stats, I figured out that many variables are insignificant(low t-stats and high p-value). In addition to it, the AUC for ROC was less than traditional method.The ROC drops even more when I used traditional logistic regression on 50 variables that were result of lasso. Can someone help me understand the dynamics of it and how I will be able to justify the coefficients of lasso model as they are penalized(I have normalized all the variables before using lasso)?

Related

Viola - jones adaboost method

I have read the paper about Viola-Jones method for object detection and am confused by a few things.
1 - For Adaboost does each round mean that we calculate all the 160k features across all images then find the one with the least error (which as i understand is a 'weak classifier? please correct me if I am wrong). If yes then won't this take extremely long to train on a large set of images which can take months ? And also how many rounds will you run it for if this is correct.
2- If the above point is wrong then does it mean that for each feature, we evaluate all the non-face and face images with one feature and compare it to a certain acceptable threshold of error, and if it lies below this acceptable threshold then we take this feature as a weak classifier and then update the weights before using the next feature from the 160k features.
I have tried understanding the MATLAB code in this link http://www.ece301.com/ml-doc/54-face-detect-matlab-1.html but not sure if the way he implemented adaboost was correct.
It would also be a great help if there was a link that would explain adaboost used by Viola-Jones in a simple and clear way.
AdaBoost tries out multiple weak classifier over several rounds , Selecting the best weak classifier in each round and combining the best classifier to create a strong classifier.
Example for adaboost :
Data point Classifier 1 Classifier 2 Classifier 3
x1 Fail pass fail
x2 Pass fail pass
x3 fail pass pass
x4 pass fail pass
AdaBoost can use classifier that are consistently wrong by reversing their decision .
1) Yes. If your parameters are, say, 5000 positive windows and 5000 negative ones (extracted from the set of negative background images you provided initially), then at each stage every feature is evaluated in turn for all 10000 windows and the best one is added to the feature set.
And yes in the original paper of Viola-Jones each feature is a weak classifier.
The process of checking all features will take a long time, but not weeks. As a matter of fact the bottle neck is the gathering of negative widows over the last stages than could require days or also weeks The number of rounds depends on the reaching of the given conditions. Each stage requires a number of round (usually more in final stages). So for example: stage 1 requires 3 features (3 rounds), stage 2 requires 5 features (5 rounds), stage 3 requires 6 features (6 rounds), etc.
2) The above point is true so it doesn’t apply.

Neural Network Architecture Design

I'm playing around with Neural Networks trying to understand the best practices for designing their architecture based on the kind of problem you need to solve.
I generated a very simple data set composed of a single convex region as you can see below:
Everything works fine when I use an architecture with L = 1, or L = 2 hidden layers (plus the output layer), but as soon as I add a third hidden layer (L = 3) my performance drops down to slightly better than chance.
I know that the more complexity you add to a network (number of weights and parameters to learn) the more you tend to go towards over-fitting your data, but I believe this is not the nature of my problem for two reasons:
my performance on the Training set is also around 60% (whereas over-fitting typically means you have a very low training error and high test error),
and I have a very large amount of data examples (don't look at the figure that's only a toy figure I uplaoded).
Can anybody help me understand why adding an extra hidden layer gives
me this drop in performances on such a simple task?
Here is an image of my performance as a function of the number of layers used:
ADDED PART DUE TO COMMENTS:
I am using a sigmoid functions assuming values between 0 and 1, L(s) = 1 / 1 + exp(-s)
I am using early stopping (after 40000 iterations of backprop) as a criteria to stop the learning. I know it is not the best way to stop but I thought that it would ok for such a simple classification task, if you believe this is the main reason I'm not converging I I might implement some better criteria.
At least on the surface of it, this appears to be a case of the so-called "vanishing gradient" problem.
Activation functions
Your neurons activate according to the logistic sigmoid function, f(x) = 1 / (1 + e^-x) :
This activation function is used frequently because it has several nice properties. One of these nice properties is that the derivative of f(x) is expressible computationally using the value of the function itself, as f'(x) = f(x)(1 - f(x)). This function has a nonzero value for x near zero, but quickly goes to zero as |x| gets large :
Gradient descent
In a feedforward neural network with logistic activations, the error is typically propagated backwards through the network using the first derivative as a learning signal. The usual update for a weight in your network is proportional to the error attributable to that weight times the current weight value times the derivative of the logistic function.
delta_w(w) ~= w * f'(err(w)) * err(w)
As the product of three potentially very small values, the first derivative in such networks can become small very rapidly if the weights in the network fall outside the "middle" regime of the logistic function's derivative. In addition, this rapidly vanishing derivative becomes exacerbated by adding more layers, because the error in a layer gets "split up" and partitioned out to each unit in the layer. This, in turn, further reduces the gradient in layers below that.
In networks with more than, say, two hidden layers, this can become a serious problem for training the network, since the first-order gradient information will lead you to believe that the weights cannot usefully change.
However, there are some solutions that can help ! The ones I can think of involve changing your learning method to use something more sophisticated than first-order gradient descent, generally incorporating some second-order derivative information.
Momentum
The simplest solution to approximate using some second-order information is to include a momentum term in your network parameter updates. Instead of updating parameters using :
w_new = w_old - learning_rate * delta_w(w_old)
incorporate a momentum term :
w_dir_new = mu * w_dir_old - learning_rate * delta_w(w_old)
w_new = w_old + w_dir_new
Intuitively, you want to use information from past derivatives to help determine whether you want to follow the new derivative entirely (which you can do by setting mu = 0), or to keep going in the direction you were heading on the previous update, tempered by the new gradient information (by setting mu > 0).
You can actually get even better than this by using "Nesterov's Accelerated Gradient" :
w_dir_new = mu * w_dir_old - learning_rate * delta_w(w_old + mu * w_dir_old)
w_new = w_old + w_dir_new
I think the idea here is that instead of computing the derivative at the "old" parameter value w, compute it at what would be the "new" setting for w if you went ahead and moved there according to a standard momentum term. Read more in a neural-networks context here (PDF).
Hessian-Free
The textbook way to incorporate second-order gradient information into your neural network training algorithm is to use Newton's Method to compute the first and second order derivatives of your objective function with respect to the parameters. However, the second order derivative, called the Hessian matrix, is often extremely large and prohibitively expensive to compute.
Instead of computing the entire Hessian, some clever research in the past few years has indicated a way to compute just the values of the Hessian in a particular search direction. You can then use this process to identify a better parameter update than just the first-order gradient.
You can learn more about this by reading through a research paper (PDF) or looking at a sample implementation.
Others
There are many other optimization methods that could be useful for this task -- conjugate gradient (PDF -- definitely worth a read), Levenberg-Marquardt (PDF), L-BFGS -- but from what I've seen in the research literature, momentum and Hessian-free methods seem to be the most common ones.
Because the number of iterations of training required for convergence increases as you add complexity to a neural network, holding the length of training constant while adding layers to a neural network will certainly result in you eventually observing a drop like this. To figure out whether that is the explanation for this particular observation, try increasing the number of iterations of training that you're using and see if it improves. Using a more intelligent stopping criterion is also a good option, but a simple increase in the cut-off will give you answers faster.

Getting win percentages for Texas hold'em poker without monte carlo/exhaustive enumeration

Sorry I'm just starting this project and don't have any ideas or code, I'm asking more of a theoretical question than a programming one.
It seems that every google search provides the same responses and it's very hard to find an answer to this question:
Is there a way to calculate win percentages for texas holdem poker (the same way they do on poker after dark or other televised poker events) without using the monte carlo/exhaustive enumeration methods. Assuming all cards are face up and we know every card in the deck.
Every response on other forums just seems to be "use pokerstove" or something similar, I'm looking for the theory to write the code.
Thanks.
Is there a way to calculate win percentages for texas holdem poker
(the same way they do on poker after dark or other televised poker
events) without using the monte carlo/exhaustive enumeration methods.
In specific instances it is possible...
You can use a perfect lookup table preflop for two players heads-up preflop matchups: note that the "typical" 169 vs 169 approximation ain't good enough (say Jh Th vs 9h 8h ain't really "JTs vs 98s": I mean, that would quite a gross approximation).
Besides that if you have a lot of memory and if you can live with gigantic cache misses, you technically could precompute gigantic lookup tables (say on the server side) and do lookups for other cases (e.g. for every possible three players all-in matchups preflop), but you'd really need a lot of memory : )
Note that "full enumeration" at flop and turn ain't an issue: at flop there's only 2 more cards to come, so there are typically only C(45,2) [two players all-in at flop, we know 2*2 holecards + 3 community cards -- hence leaving 990 possibilities] or C(43,2) [three players all-in at at flop, we know 3*2 holecards + 3 community cards].
So an actual evaluator would not use one but several methods. For example:
lookup table for two players all-in preflop (the fastest)
full enum for any number of players all-in at flop or turn because it's tiny (max 990 possibilities) -- very fast
monte-carlo or full enum for three players or more all-in preflop -- incredibly slower
It is interesting to see here that in the most typical cases you'll get the result very, very fast: most actual all-ins involve two players, not three or more.
So you're either looking up in a "1 vs 1 preflop" lookup table or doing full C(45,2) or C(46,1) full enum (which are, in both case, amazingly fast).
It's really only the "three players or more all-in preflop" case which do take time.
The answer is no.
There is no closed form computation that you can do to compute poker equities. Using combinatorics, you can identify and solve many subproblems, which speeds up computation.
For example, if you are considering all five card hands, there are 52 choose 5 = 2,598,960 different hands. But knowing that suits are equivalent and using combinatorial methods (either analytic or computational), you can reduce the space of all hands to 134,459 classes each weighted according to the number of different hands in each equivalent class.
There are also various ways of using exhaustive evaluations tailored to your application. If you need to perform some subset of evaluations repeatedly, you can use caches or precomputed lookup tables targeted to your specific needs.

Help--100% accuracy with LibSVM?

Nominally a good problem to have, but I'm pretty sure it is because something funny is going on...
As context, I'm working on a problem in the facial expression/recognition space, so getting 100% accuracy seems incredibly implausible (not that it would be plausible in most applications...). I'm guessing there is either some consistent bias in the data set that it making it overly easy for an SVM to pull out the answer, =or=, more likely, I've done something wrong on the SVM side.
I'm looking for suggestions to help understand what is going on--is it me (=my usage of LibSVM)? Or is it the data?
The details:
About ~2500 labeled data vectors/instances (transformed video frames of individuals--<20 individual persons total), binary classification problem. ~900 features/instance. Unbalanced data set at about a 1:4 ratio.
Ran subset.py to separate the data into test (500 instances) and train (remaining).
Ran "svm-train -t 0 ". (Note: apparently no need for '-w1 1 -w-1 4'...)
Ran svm-predict on the test file. Accuracy=100%!
Things tried:
Checked about 10 times over that I'm not training & testing on the same data files, through some inadvertent command-line argument error
re-ran subset.py (even with -s 1) multiple times and did train/test only multiple different data sets (in case I randomly upon the most magical train/test pa
ran a simple diff-like check to confirm that the test file is not a subset of the training data
svm-scale on the data has no effect on accuracy (accuracy=100%). (Although the number of support vectors does drop from nSV=127, bSV=64 to nBSV=72, bSV=0.)
((weird)) using the default RBF kernel (vice linear -- i.e., removing '-t 0') results in accuracy going to garbage(?!)
(sanity check) running svm-predict using a model trained on a scaled data set against an unscaled data set results in accuracy = 80% (i.e., it always guesses the dominant class). This is strictly a sanity check to make sure that somehow svm-predict is nominally acting right on my machine.
Tentative conclusion?:
Something with the data is wacked--somehow, within the data set, there is a subtle, experimenter-driven effect that the SVM is picking up on.
(This doesn't, on first pass, explain why the RBF kernel gives garbage results, however.)
Would greatly appreciate any suggestions on a) how to fix my usage of LibSVM (if that is actually the problem) or b) determine what subtle experimenter-bias in the data LibSVM is picking up on.
Two other ideas:
Make sure you're not training and testing on the same data. This sounds kind of dumb, but in computer vision applications you should take care that: make sure you're not repeating data (say two frames of the same video fall on different folds), you're not training and testing on the same individual, etc. It is more subtle than it sounds.
Make sure you search for gamma and C parameters for the RBF kernel. There are good theoretical (asymptotic) results that justify that a linear classifier is just a degenerate RBF classifier. So you should just look for a good (C, gamma) pair.
Notwithstanding that the devil is in the details, here are three simple tests you could try:
Quickie (~2 minutes): Run the data through a decision tree algorithm. This is available in Matlab via classregtree, or you can load into R and use rpart. This could tell you if one or just a few features happen to give a perfect separation.
Not-so-quickie (~10-60 minutes, depending on your infrastructure): Iteratively split the features (i.e. from 900 to 2 sets of 450), train, and test. If one of the subsets gives you perfect classification, split it again. It would take fewer than 10 such splits to find out where the problem variables are. If it happens to "break" with many variables remaining (or even in the first split), select a different random subset of features, shave off fewer variables at a time, etc. It can't possibly need all 900 to split the data.
Deeper analysis (minutes to several hours): try permutations of labels. If you can permute all of them and still get perfect separation, you have some problem in your train/test setup. If you select increasingly larger subsets to permute (or, if going in the other direction, to leave static), you can see where you begin to lose separability. Alternatively, consider decreasing your training set size and if you get separability even with a very small training set, then something is weird.
Method #1 is fast & should be insightful. There are some other methods I could recommend, but #1 and #2 are easy and it would be odd if they don't give any insights.

What is fuzzy logic?

I'm working with a couple of AI algorithms at school and I find people use the words Fuzzy Logic to explain any situation that they can solve with a couple of cases. When I go back to the books I just read about how instead of a state going from On to Off it's a diagonal line and something can be in both states but in different "levels".
I've read the wikipedia entry and a couple of tutorials and even programmed stuff that "uses fuzzy logic" (an edge detector and a 1-wheel self-controlled robot) and still I find it very confusing going from Theory to Code... for you, in the less complicated definition, what is fuzzy logic?
Fuzzy logic is logic where state membership is, essentially, a float with range 0..1 instead of an int 0 or 1. The mileage you get out of it is that things like, for example, the changes you make in a control system are somewhat naturally more fine-tuned than what you'd get with naive binary logic.
An example might be logic that throttles back system activity based on active TCP connections. Say you define "a little bit too many" TCP connections on your machine as 1000 and "a lot too many" as 2000. At any given time, your system has a "too many TCP connections" state from 0 (<= 1000) to 1 (>= 2000), which you can use as a coefficient in applying whatever throttling mechanisms you have available. This is much more forgiving and responsive to system behavior than naive binary logic that only knows how to determine "too many", and throttle completely, or "not too many", and not throttle at all.
I'd like to add to the answers (that have been modded up) that, a good way to visualize fuzzy logic is follows:
Traditionally, with binary logic you would have a graph whose membership function is true or false whereas in a fuzzy logic system, the membership function is not.
1|
| /\
| / \
| / \
0|/ \
------------
a b c d
Assume for a second that the function is "likes peanuts"
a. kinda likes peanuts
b. really likes peanuts
c. kinda likes peanuts
d. doesn't like peanuts
The function itself doesn't have to be triangular and often isn't (it's just easier with ascii art).
A fuzzy system will likely have many of these, some even overlapping (even opposites) like so:
1| A B
| /\ /\ A = Likes Peanuts
| / \/ \ B = Doesn't Like Peanuts
| / /\ \
0|/ / \ \
------------
a b c d
so now c is "kind likes peanuts, kinda doesn't like peanuts" and d is "really doesn't like peanuts"
And you can program accordingly based on that info.
Hope this helps for the visual learners out there.
The best definition of fuzzy logic is given by its inventor Lotfi Zadeh:
“Fuzzy logic means of representing problems to computers in a way akin to the way human solve them and the essence of fuzzy logic is that everything is a matter of degree.”
The meaning of solving problems with computers akin to the way human solve can easily be explained with a simple example from a basketball game; if a player wants to guard another player firstly he should consider how tall he is and how his playing skills are. Simply if the player that he wants to guard is tall and plays very slow relative to him then he will use his instinct to determine to consider if he should guard that player as there is an uncertainty for him. In this example the important point is the properties are relative to the player and there is a degree for the height and playing skill for the rival player. Fuzzy logic provides a deterministic way for this uncertain situation.
There are some steps to process the fuzzy logic (Figure-1). These steps are; firstly fuzzification where crisp inputs get converted to fuzzy inputs secondly these inputs get processed with fuzzy rules to create fuzzy output and lastly defuzzification which results with degree of result as in fuzzy logic there can be more than one result with different degrees.
Figure 1 – Fuzzy Process Steps (David M. Bourg P.192)
To exemplify the fuzzy process steps, the previous basketball game situation could be used. As mentioned in the example the rival player is tall with 1.87 meters which is quite tall relative to our player and can dribble with 3 m/s which is slow relative to our player. Addition to these data some rules are needed to consider which are called fuzzy rules such as;
if player is short but not fast then guard,
if player is fast but not short then don’t guard
If player is tall then don’t guard
If player is average tall and average fast guard
Figure 2 – how tall
Figure 3- how fast
According to the rules and the input data an output will be created by fuzzy system such as; the degree for guard is 0.7, degree for sometimes guard is 0.4 and never guard is 0.2.
Figure 4-output fuzzy sets
On the last step, defuzzication, is using for creating a crisp output which is a number which may determine the energy that we should use to guard the player during game. The centre of mass is a common method to create the output. On this phase the weights to calculate the mean point is totally depends on the implementation. On this application it is considered to give high weight to guard or not guard but low weight given to sometimes guard. (David M. Bourg, 2004)
Figure 5- fuzzy output (David M. Bourg P.204)
Output = [0.7 * (-10) + 0.4 * 1 + 0.2 * 10] / (0.7 + 0.4 + 0.2) ≈ -3.5
As a result fuzzy logic is using under uncertainty to make a decision and to find out the degree of decision. The problem of fuzzy logic is as the number of inputs increase the number of rules increase exponential.
For more information and its possible application in a game I wrote a little article check this out
To build off of chaos' answer, a formal logic is nothing but an inductively defined set that maps sentences to a valuation. At least, that's how a model theorist thinks of logic. In the case of a sentential boolean logic:
(basis clause) For all A, v(A) in {0,1}
(iterative) For the following connectives,
v(!A) = 1 - v(A)
v(A & B) = min{v(A), v(B)}
v(A | B) = max{v(A), v(B)}
(closure) All sentences in a boolean sentential logic are evaluated per above.
A fuzzy logic changes would be inductively defined:
(basis clause) For all A, v(A) between [0,1]
(iterative) For the following connectives,
v(!A) = 1 - v(A)
v(A & B) = min{v(A), v(B)}
v(A | B) = max{v(A), v(B)}
(closure) All sentences in a fuzzy sentential logic are evaluated per above.
Notice the only difference in the underlying logic is the permission to evaluate a sentence as having the "truth value" of 0.5. An important question for a fuzzy logic model is the threshold that counts for truth satisfaction. This is to ask: for a valuation v(A), for what value D it is the case the v(A) > D means that A is satisfied.
If you really want to found out more about non-classical logics like fuzzy logic, I would recommend either An Introduction to Non-Classical Logic: From If to Is or Possibilities and Paradox
Putting my coder hat back on, I would be careful with the use of fuzzy logic in real world programming, because of the tendency for a fuzzy logic to be undecidable. Maybe it's too much complexity for little gain. For instance a supervaluational logic may do just fine to help a program model vagueness. Or maybe probability would be good enough. In short, I need to be convinced that the domain model dovetails with a fuzzy logic.
Maybe an example clears up what the benefits can be:
Let's say you want to make a thermostat and you want it to be 24 degrees.
This is how you'd implement it using boolean logic:
Rule1: heat up at full power when
it's colder than 21 degrees.
Rule2:
cool down at full power when it's
warmer than 27 degrees.
Such a system will only once and a while be 24 degrees, and it will be very inefficient.
Now, using fuzzy logic, it would be like something like this:
Rule1: For each degree that it's colder than 24 degrees, turn up the heater one notch (0 at 24).
Rule2: For each degree that it's warmer than 24 degress, turn up the cooler one notch (0 at 24).
This system will always be somewhere around 24 degrees, and it only once and will only once and a while make a tiny adjustment. It will also be more energy-efficient.
Well, you could read the works of Bart Kosko, one of the 'founding fathers'. 'Fuzzy Thinking: The New Science of Fuzzy Logic' from 1994 is readable (and available quite cheaply secondhand via Amazon). Apparently, he has a newer book 'Noise' from 2006 which is also quite approachable.
Basically though (in my paraphrase - not having read the first of those books for several years now), fuzzy logic is about how to deal with the world where something is perhaps 10% cool, 50% warm, and 10% hot, where different decisions may be made on the degree to which the different states are true (and no, it wasn't entirely an accident that those percentages don't add up to 100% - though I'd accept correction if needed).
A very good explanation, with a help of Fuzzy Logic Washing Machines.
I know what you mean about it being difficult to go from concept to code. I'm writing a scoring system that looks at the values of sysinfo and /proc on Linux systems and comes up with a number between 0 and 10, 10 being the absolute worst. A simple example:
You have 3 load averages (1, 5, 15 minute) with (at least) three possible states, good, getting bad, bad. Expanding that, you could have six possible states per average, adding 'about to' to the three that I just noted. Yet, the result of all 18 possibilities can only deduct 1 from the score. Repeat that with swap consumed, actual VM allocated (committed) memory and other stuff .. and you have one big bowl of conditional spaghetti :)
Its as much a definition as it is an art, how you implement the decision making process is always more interesting than the paradigm itself .. whereas in a boolean world, its rather cut and dry.
It would be very easy for me to say if load1 < 2 deduct 1, but not very accurate at all.
If you can teach a program to do what you would do when evaluating some set of circumstances and keep the code readable, you have implemented a good example of fuzzy logic.
Fuzzy Logic is a problem-solving methodology that lends itself to implementation in systems ranging from simple, small, embedded micro-controllers to large, networked, multi-channel PC or workstation-based data acquisition and control systems. It can be implemented in hardware, software, or a combination of both. Fuzzy Logic provides a simple way to arrive at a definite conclusion based upon vague, ambiguous, imprecise, noisy, or missing input information. Fuzzy Logic approach to control problems mimics how a person would make decisions, only much faster.
Fuzzy logic has proved to be particularly useful in expert system and other artificial intelligence applications. It is also used in some spell checkers to suggest a list of probable words to replace a misspelled one.
To learn more, just check out: http://en.wikipedia.org/wiki/Fuzzy_logic.
The following is sort of an empirical answer.
A simple (possibly simplistic answer) is that "fuzzy logic" is any logic that returns values other than straight true / false, or 1 / 0. There are a lot of variations on this and they tend to be highly domain specific.
For example, in my previous life I did search engines that used "content similarity searching" as opposed to then common "boolean search". Our similarity system used the Cosine Coefficient of weighted-attribute vectors representing the query and the documents and produced values in the range 0..1. Users would supply "relevance feedback" which was used to shift the query vector in the direction of desirable documents. This is somewhat related to the training done in certain AI systems where the logic gets "rewarded" or "punished" for results of trial runs.
Right now Netflix is running a competition to find a better suggestion algorithm for their company. See http://www.netflixprize.com/. Effectively all of the algorithms could be characterized as "fuzzy logic"
Fuzzy logic is calculating algorithm based on human like way of thinking. It is particularly useful when there is a large number of input variables. One online fuzzy logic calculator for two variables input is given:
http://www.cirvirlab.com/simulation/fuzzy_logic_calculator.php

Resources