Logistic regression using glm give strange predictions, only very small probabilities

Logistic regression using glm give strange predictions, only very small probabilities - logistic-regression

I want to regress a binary variable of 0 and 1 onto a continuous variable ranging from 0 to 120. So I fit glm(y~x,family='binomial') and make predictions using predict(fit,newdata=x,type=c('response')), and plotting these gives
where the black dots are observations and red dots are predicted probabilities. I don't understand why the curve does not go down from 1 but only produce probabilities from 0 to 0.05, is there anything I have done wrong or is it just because there are too many 0 observations?
One more thing is that the predictor variable is a positive variable, so extending the axis to the left does not really give sensible predictions. Is there a way to resolve this?
Here's the summary table for observed responses:
> table(test_y)
test_y
0 1
9778 462

Related

Correctness of logistic regression in Vowpal Wabbit?

I have started using Vowpal Wabbit for logistic regression, however I am unable to reproduce the results it gives. Perhaps there is some undocumented "magic" it does, but has anyone been able to replicate / verify / check the calculations for logistic regression?
For example, with the simple data below, we aim to model the way age predicts label. It is obvious there is a strong relationship as when age increases the probability of observing 1 increases.
As a simple unit test, I used the 12 rows of data below:
age label
20 0
25 0
30 0
35 0
40 0
50 0
60 1
65 0
70 1
75 1
77 1
80 1
Now, performing a logistic regression on this dataset, using R , SPSS or even by hand, produces a model which looks like L = 0.2294*age - 14.08. So if I substitude the age, and use the logit transform prob=1/(1+EXP(-L)) I can obtain the predicted probabilities which range from 0.0001 for the first row, to 0.9864 for the last row, as reasonably expected.
If I plug in the same data in Vowpal Wabbit,
-1 'P1 |f age:20
-1 'P2 |f age:25
-1 'P3 |f age:30
-1 'P4 |f age:35
-1 'P5 |f age:40
-1 'P6 |f age:50
1 'P7 |f age:60
-1 'P8 |f age:65
1 'P9 |f age:70
1 'P10 |f age:75
1 'P11 |f age:77
1 'P12 |f age:80
And then perform a logistic regression using
vw -d data.txt -f demo_model.vw --loss_function logistic --invert_hash aaa
(command line consistent with How to perform logistic regression using vowpal wabbit on very imbalanced dataset ) , I obtain a model L= -0.00094*age - 0.03857 , which is very different.
The predicted values obtained using -r or -p further confirm this. The resulting probabilities end up nearly all the same, for example 0.4857 for age=20, and 0.4716 for age=80, which is extremely off.
I have noticed this inconsistency with larger datasets too. In what sense is Vowpal Wabbit carrying out the logistic regression differently, and how are the results to be interpreted?

This is a common misunderstanding of vowpal wabbit.
One cannot compare batch learning with online learning.
vowpal wabbit is not a batch learner. It is an online learner. Online learners learn by looking at examples one at a time and slightly adjusting the weights of the model as they go.
There are advantages and disadvantages to online learning. The downside is that convergence to the final model is slow/gradual. The learner doesn't do a "perfect" job at extracting information from each example, because the process is iterative. Convergence on a final result is deliberately restrained/slow. This can make online learners appear weak on tiny data-sets like the above.
There are several upsides though:
Online learners don't need to load the full data into memory (they work by examining one example at a time and adjusting the model based on the real-time observed per-example loss) so they can scale easily to billions of examples. A 2011 paper by 4 Yahoo! researchers describes how vowpal wabbit was used to learn from a tera (10^12) feature data-set in 1 hour on 1k nodes. Users regularly use vw to learn from billions of examples data-sets on their desktops and laptops.
Online learning is adaptive and can track changes in conditions over time, so it can learn from non-stationary data, like learning against an adaptive adversary.
Learning introspection: one can observe loss convergence rates while training and identify specific issues, and even gain significant insights from specific data-set examples or features.
Online learners can learn in an incremental fashion so users can intermix labeled and unlabeled examples to keep learning while predicting at the same time.
The estimated error, even during training, is always "out-of-sample" which is a good estimate of the test error. There's no need to split the data into train and test subsets or perform N-way cross-validation. The next (yet unseen) example is always used as a hold-out. This is a tremendous advantage over batch methods from the operational aspect. It greatly simplifies the typical machine-learning process. In addition, as long as you don't run multiple-passes over the data, it serves as a great over-fitting avoidance mechanism.
Online learners are very sensitive to example order. The worst possible order for an online learner is when classes are clustered together (all, or almost all, -1s appear first, followed by all 1s) like the example above does. So the first thing to do to get better results from an online learner like vowpal wabbit, is to uniformly shuffle the 1s and -1s (or simply order by time, as the examples typically appear in real-life).
OK now what?
Q: Is there any way to produce a reasonable model in the sense that it gives reasonable predictions on small data when using an online learner?
A: Yes, there is!
You can emulate what a batch learner does more closely, by taking two simple steps:
Uniformly shuffle 1 and -1 examples.
Run multiple passes over the data to give the learner a chance to converge
Caveat: if you run multiple passes until error goes to 0, there's a danger of over-fitting. The online learner has perfectly learned your examples, but it may not generalize well to unseen data.
The second issue here is that the predictions vw gives are not logistic-function transformed (this is unfortunate). They are akin to standard deviations from the middle point (truncated at [-50, 50]). You need to pipe the predictions via utl/logistic (in the source tree) to get signed probabilities. Note that these signed probabilities are in the range [-1, +1] rather than [0, 1]. You may use logistic -0 instead of logistic to map them to a [0, 1] range.
So given the above, here's a recipe that should give you more expected results:
# Train:
vw train.vw -c --passes 1000 -f model.vw --loss_function logistic --holdout_off
# Predict on train set (just as a sanity check) using the just generated model:
vw -t -i model.vw train.vw -p /dev/stdout | logistic | sort -tP -n -k 2
Giving this more expected result on your data-set:
-0.95674145247658 P1
-0.930208359811439 P2
-0.888329575506748 P3
-0.823617739247262 P4
-0.726830630992614 P5
-0.405323815830325 P6
0.0618902961794472 P7
0.298575998150221 P8
0.503468453150847 P9
0.663996516371277 P10
0.715480084449868 P11
0.780212725426778 P12
You could make the results more/less polarized (closer to 1 on the older ages and closer to -1 on the younger) by increasing/decreasing the number of passes. You may also be interested in the following options for training:
--max_prediction <arg> sets the max prediction to <arg>
--min_prediction <arg> sets the min prediction to <arg>
-l <arg> set learning rate to <arg>
For example, by increasing the learning rate from the default 0.5 to a large number (e.g. 10) you can force vw to converge much faster when training on small data-sets thus requiring less passes to get there.
Update
As of mid 2014, vw no longer requires the external logistic utility to map predictions back to [0,1] range. A new --link logistic option maps predictions to the logistic function [0, 1] range. Similarly --link glf1 maps predictions to a generalized logistic function [-1, 1] range.

How would I organise a clustered set of 2D coordinates into groups of close together sets?

I have a large amount of 2D sets of coordinates on a 6000x6000 plane (2116 sets), available here: http://pastebin.com/kiMQi7yu (the context isn't really important so I just pasted the raw data).
I need to write an algorithm to group together coordinates that are close to each other by some threshold. The coordinates in my list are already in groups on that plane, but the order is very scattered.
Despite this task being rather brain-melting to me at first, I didn't admit defeat instantly; this is what I tried:
First sort the list by the Y value, then sort it by the X value. Run through the list checking the distance between the current set and the previous. If they are close enough (100 units) then add them to the same group.
This method didn't really work out (as I expected). There are still objects that are pretty close that are in different groups, because I'm only comparing the next set in the list and the list is sorted by the X position.
I'm out of ideas! The language I'm using is C but I suppose that's not really relevant since all I need is an idea for how the algorithm should work. Thanks!

Though I haven't looked at the data set, it seems that you already know how many groups there are. Have you considered using k means? http://en.m.wikipedia.org/wiki/K-means_clustering

I'm just thinking this along while I write.
Tile the "arena" with squares that have the diameter of your distance (200) as their diagonal.
If there are any points within a square (x,y), they are tentatively part of Cluster(x,y).
Within each square (x,y), there are (up to) 4 areas where the circles of Cluster(x-1,y), Cluster(x+1,y), Cluster(x, y-1) and Cluster(x,y+1) overlap "into" the square; of these consider only those Clusters that are tentatively non-empty.
If all points of Cluster(x,y) are in the (up to 4) overlapping segments of non-empty neighbouring clusters: reallocate these points to the pertaining Cluster and remove Cluster(x,y) from the set of non-empty Clusters.
Added later: Regarding 3., the set of points to be investigated for one neighbour can be coarsely but quickly (!) determined by looking at the rectangle enclosing the segment. [End of addition]
This is just an idea - I can't claim that I've ever done anything remotely like this.

A simple, often used method for spatially grouping points, is to calculate the distance between each unique pair of points. If the distance does not exceed some predefined limit, then the points belong to the same group.
One way to think about this algorithm, is to consider each point as a limit-diameter ball (made of soft foam, so that balls can intersect each other). All balls that are in contact belong to the same group.
In practice, you calculate the squared distance, (x2 - x1)2 + (y2 - y1)2, to avoid the relatively slow square root operation. (Just remember to square the limit, too.)
To track which group each point belongs to, a disjoint-set data structure is used.
If you have many points (a few thousand is not many), you can use partitioning or other methods to limit the number of pairs to consider. Partitioning is probably the most used, as it is very simple to implement: just divide the space into squares of limit size, and then you only need to consider points within each square, and between points in neighboring squares.
I wrote a small awk script to find the groups (no partitioning, about 84 lines or awk code, also numbers the groups consecutively from 1 onwards, and outputs each input point, the group number, and the number of points in each group). Here's the results summarized:
Limit Singles Pairs Triplets Clusters (of four or more points)
1.0 1313 290 29 24
2.0 1062 234 50 52
3.0 904 179 53 75
4.0 767 174 55 81
5.0 638 173 52 84
10.0 272 99 41 99
20.0 66 20 8 68
50.0 21 11 3 39
100.0 13 6 2 29
200.0 6 5 0 23
300.0 3 1 0 20
400.0 1 0 0 18
500.0 0 0 0 15
where Limit is the maximum distance at which the points are considered to belong to the same group.
If the data set is very detailed, you can have intertwined but separate groups. You can easily have a separate group in the hole of a donut-shaped group (or hollow ball in 3D). This is important to remember, so you don't make wrong assumptions on how the groups are separated.
Questions?

You can use a space-filling-curve, I.e a z curve a.k.a morton curve. Basically you translate x-and y value to binary and then concatenate th,e coordinates. The spatial index puts together close coordinates. You can verify it with the upper bounds and the mostsignificant bits.

Using genetic programming to estimate probability

I would like to use a genetic program (gp) to estimate the probability of an 'outcome' from an 'event'. To train the nn I am using a genetic algorithm.
So, in my database I have many events, with each event containing many possible outcomes.
I will give the gp a set of input variables that relate to each outcome in each event.
My questions is - what should the fitness function be in the gp be ????
For instance, right now I am giving the gp a set of input data (outcome input variables), and a set of target data (1 if outcome DID occur, 0 if outcome DIDN'T occur, with the fitness function being the mean squared error of the outputs and targets). I then take the sum of each output for each outcome, and divide each output by the sum (to give the probability). However, I know for sure that this is not the right way to be doing this.
For clarity, this is how I am CURRENTLY doing this:
I would like to estimate the probability of 5 different outcomes occurring in an event:
Outcome 1 - inputs = [0.1, 0.2, 0.1, 0.4]
Outcome 1 - inputs = [0.1, 0.3, 0.1, 0.3]
Outcome 1 - inputs = [0.5, 0.6, 0.2, 0.1]
Outcome 1 - inputs = [0.9, 0.2, 0.1, 0.3]
Outcome 1 - inputs = [0.9, 0.2, 0.9, 0.2]
I will then calculate the gp output for each input:
Outcome 1 - output = 0.1
Outcome 1 - output = 0.7
Outcome 1 - output = 0.2
Outcome 1 - output = 0.4
Outcome 1 - output = 0.4
The sum of the outputs for each outcome in this event would be: 1.80. I would then calculate the 'probability' of each outcome by dividing the output by the sum:
Outcome 1 - p = 0.055
Outcome 1 - p = 0.388
Outcome 1 - p = 0.111
Outcome 1 - p = 0.222
Outcome 1 - p = 0.222
Before you start - I know that these aren't real probabilities, and that this approach does not work !! I just put this here to help you understand what I am trying to achieve.
Can anyone give me some pointers on how I can estimate the probability of each outcome ? (also, please note my maths is not great)
Many thanks

I understand the first part of your question: What you described is a classification problem. You're learning if your inputs relate to whether an outcome was observed (1) or not (0).
There are difficulties with the second part though. If I understand you correctly you take the raw GP output for a certain row of inputs (e.g. 0.7) and treat it as a probability. You said this doesn't work, obviously. In GP you can do classification by introducing a threshold value that splits your classes. If it's bigger than say 0.3 the outcome should be 1 if it's smaller it should be 0. This threshold isn't necessarily 0.5 (again it's just a number, not a probability).
I think if you want to obtain a probability you should attempt to learn multiple models that all explain your classification problem well. I don't expect you have a perfect model that explains your data perfectly, respectively if you have you wouldn't want a probability anyway. You can bag these models together (create an ensemble) and for each outcome you can observe how many models predicted 1 and how many models predicted 0. The amount of models that predicted 1 divided by the number of models could then be interpreted as a probability that this outcome will be observed. If the models are all equally good then you can forget weighing between them, if they're different in quality of course you could factor these into your decision. Models with less quality on their training set are less likely to contribute to a good estimate.
So in summary you should attempt to apply GP e.g. 10 times and then use all 10 models on the training set to calculate their estimate (0 or 1). However, don't force yourself to GP only, there are many classification algorithms that can give good results.
As a sidenote, I'm part of the development team of a software called HeuristicLab which runs under Windows and with which you can run GP and create such ensembles. The software is open source.

AI is all about complex algorithms. Think about it, the downside is very often, that these algorithms become black boxes. So the counterside to algoritms, such as NN and GA, are they are inherently opaque. That is what you want if you want to have a car driving itself. On the other hand this means, that you need tools to look into the black box.
What I'm saying is that GA is probably not what you want to solve your problem. If you want to solve AI types of problems, you first have to know how to use standard techniques, such as regression, LDA etc.
So, combining NN and GA is usually a bad sign, because you are stacking one black box on another. I believe this is bad design. An NN and GA are nothing else than non-linear optimizers. I would suggest to you to look at principal component analysis (PDA), SVD and linear classifiers first (see wikipedia). If you figure out to solve simple statistical problems move on to more complex ones. Check out the great textbook by Russell/Norvig, read some of their source code.
To answer the questions one really has to look at the dataset extensively. If you are working on a small problem, define the probabilities etc., and you might get an answer here. Perhaps check out Bayesian statistics as well. This will get you started I believe.

Is the Leptonica implementation of 'Modified Median Cut' not using the median at all?

I'm playing around a bit with image processing and decided to read up on how color quantization worked and after a bit of reading I found the Modified Median Cut Quantization algorithm.
I've been reading the code of the C implementation in Leptonica library and came across something I thought was a bit odd.
Now I want to stress that I am far from an expert in this area, not am I a math-head, so I am predicting that this all comes down to me not understanding all of it and not that the implementation of the algorithm is wrong at all.
The algorithm states that the vbox should be split along the lagest axis and that it should be split using the following logic
The largest axis is divided by locating the bin with the median pixel
(by population), selecting the longer side, and dividing in the center
of that side. We could have simply put the bin with the median pixel
in the shorter side, but in the early stages of subdivision, this
tends to put low density clusters (that are not considered in the
subdivision) in the same vbox as part of a high density cluster that
will outvote it in median vbox color, even with future median-based
subdivisions. The algorithm used here is particularly important in
early subdivisions, and 3is useful for giving visible but low
population color clusters their own vbox. This has little effect on
the subdivision of high density clusters, which ultimately will have
roughly equal population in their vboxes.
For the sake of the argument, let's assume that we have a vbox that we are in the process of splitting and that the red axis is the largest. In the Leptonica algorithm, on line 01297, the code appears to do the following
Iterate over all the possible green and blue variations of the red color
For each iteration it adds to the total number of pixels (population) it's found along the red axis
For each red color it sum up the population of the current red and the previous ones, thus storing an accumulated value, for each red
note: when I say 'red' I mean each point along the axis that is covered by the iteration, the actual color may not be red but contains a certain amount of red
So for the sake of illustration, assume we have 9 "bins" along the red axis and that they have the following populations
4 8 20 16 1 9 12 8 8
After the iteration of all red bins, the partialsum array will contain the following count for the bins mentioned above
4 12 32 48 49 58 70 78 86
And total would have a value of 86
Once that's done it's time to perform the actual median cut and for the red axis this is performed on line 01346
It iterates over bins and check they accumulated sum. And here's the part that throws me of from the description of the algorithm. It looks for the first bin that has a value that is greater than total/2
Wouldn't total/2 mean that it is looking for a bin that has a value that is greater than the average value and not the median ? The median for the above bins would be 49
The use of 43 or 49 could potentially have a huge impact on how the boxes are split, even though the algorithm then proceeds by moving to the center of the larger side of where the matched value was..
Another thing that puzzles me a bit is that the paper specified that the bin with the median value should be located, but does not mention how to proceed if there are an even number of bins.. the median would be the result of (a+b)/2 and it's not guaranteed that any of the bins contains that population count. So this is what makes me thing that there are some approximations going on that are negligible because of how the split actually takes part at the center of the larger side of the selected bin.
Sorry if it got a bit long winded, but I wanted to be as thoroughas I could because it's been driving me nuts for a couple of days now ;)

In the 9-bin example, 49 is the number of pixels in the first 5 bins. 49 is the median number in the set of 9 partial sums, but we want the median pixel in the set of 86 pixels, which is 43 (or 44), and it resides in the 4th bin.
Inspection of the modified median cut algorithm in colorquant2.c of leptonica shows that the actual cut location for the 3d box does not necessarily occur adjacent to the bin containing the median pixel. The reasons for this are explained in the function medianCutApply(). This is one of the "modifications" to Paul Heckbert's original method. The other significant modification is to make the decision of which 3d box to cut next based on a combination of both population and the product (population * volume), thus permitting splitting of large but sparsely populated regions of color space.

I do not know the algo, but I would assume your array contains the population of each red; let's explain this with an example:
Assume you have four gradations of red: A,B,C and D
And you have the following sequence of red values:
AABDCADBBBAAA
To find the median, you would have to sort them according to red value and take the middle:
median
v
AAAAAABBBBCDD
Now let's use their approach:
A:6 => 6
B:4 => 10
C:1 => 11
D:2 => 13
13/2 = 6.5 => B
I think the mismatch happened because you are counting the population; the average color would be:
(6*A+4*B+1*C+2*D)/13

is this classification result acceptable?

I have a very simple linear classification problem,which is to work out a linear classification problem for the following three classes in coordinates:
Class 1: points (0,1) (1,0)
Class 2: points (-1,0) (1,0)
Class 3: points (0,-1) (1,-1)
I manually used a random initial weight [ 1 0,0 1] (2*2 matrix) and a random initial bias
[1,1] by applying each iteration on the six samples,I finally get a classification which is X=-1 and Y=-1,so when x and Y are both >-1,it is class1;
if X<=-1 and Y>-1,it is class2;
if x >-1 and Y <=-1,it is class3.
After plotting this on the graph,I think it has some problems since the decision boundary cross samples in class2 and class3,I wonder if that is acceptable.By observing the graph,I would say the ideal classification would be x =-1/2 and y=1/2,but I really cannot get that result after calculation.
Please kindly share your thoughts with me,thanks in advance.

I'd say the results are acceptable. All the points are correctly classified except for the point at (1,0) that is labelled as class 2 and classified as class 1. The problem is that there is also a point at (1,0) labelled as class 1, so it's impossible to separate classes 1 and 2.
Of course, the model is quite probably awful when evaluated on a test set. If you want the decision boundaries to be placed equidistant between points, you need to look at max margin classifiers.

The results are not acceptable. Class 2 and 3 are linearly separable, so you shouldn't accept any classifier that doesn't classify them perfectly.
As far as I know, with these samples and a feed-forward network trained with backpropagation, you are unlikely to get your desired x=-1/2 and y=1/2. You need a maximum margin classifier for that.
I recommend you to check a SVM linear classifier. You can check SVMlight for multiclass problems.