CFA not yielding same results as EFA in R - psych

I am using the 'psych' package in R to run EFA. After running the EFA, I use Lavaan to run a CFA using the factor structure from EFA, purely just to humor myself to compare results. After doing so, I am suspicious of results produced by 'psych'.
I am aware that CFA is used to essentially test a hypothesis of how it is believed certain items may factor. And I am aware that the two are not typically ran together. However, it is my understanding that if I am aware of certain factor structure, that results from EFA (RMSEA, TLI, factor loadings, residual variance, etc.) should be approximately similar to results from CFA.
My issue is that sometimes in using 'psych' I can have a 3 factor structure derived from EFA, where all factor loadings appear to be below 1. Then when putting the same factor structure through CFA, I can have a standardized loading exceed 1, and warnings of negative residual variance, thus yielding unreliable estimates.
This is troubling to me, because I feel uneasy reporting results from 'psych'. Especially, because I am dealing with oblique rotation (correlated factors) that can yield factor loadings greater than 1 (say if I use a different number of factors). I have read that factor loadings can exceed 1, if residual variance is non-negative. I am just having a hard time deciphering if I do have negative residual variance, when my EFA is telling me no, but the CFA is telling me yes. Anyone see anything like this previously?

Related

Sample size calculation for experimental design

I have three treatments (Wild type, Mutant1 and Mutant2); I request inputs on how to decide the sample size that would be statistically significant (alpha <0.05) with high statistical power (1-beta=0.8).
Questions
I understand that we need the information of effect size. We approach this problem if we don't know the expected effect size prior; a trial experiment to estimate the effect size. In such case if we want to estimate the effect size with trial experiment; what could be the sample size to start with; a high (n=10) or as low as n=3? Can n=3 among treatments provide a good estimate of effect size or n=10 is better to get this estimate. Let's be specific; if we have resource for n=10 max. and we are given option to choose between n=3 or n=10 for this trial
This question is better asked in https://stats.stackexchange.com.
I would discourage you from trying to estimate effects sizes from pilot experiments with low n. Your estimates will be quite noisy and this is rarely done (at least in my field of neuroscience). Instead, I would suggest you estimate your effect size from the literature. Have other people measured something similar to what you are planning to do? What are the sample sizes they use? What kind of effect sizes do they report.
If you were going to go ahead with the plan to run a pilot study, I would recommend pre-registering your experimental design (https://www.cos.io/initiatives/prereg). Something like:
We will test the effects of mutation 1 and mutation 2 on XXXX (compared to wild type) in a cohort of 30 mice (10 in each group). Based on the results of this study, we will then conduct a power analysis and reproduce the experiments in a sample size required to have a power of 0.8 at p=0.05.
Our criteria for excluding animals from the power analysis will be .....
The statistical test for estimating effect size will be......"
etc.

Understanding forecast accuracy MAPE, WMAPE,WAPE?

I am new to the forecast space and I am trying to understand the different forecast accuracy measures. I am referring to the below link
https://www.otexts.org/fpp/2/5
Can anyone please help me understand the below things:
1. MAPE: I am trying to understand the disadvantage of MAPE "They also have the disadvantage that they put a heavier penalty on negative errors than on positive errors. " Can anyone please provide an example to explain this in detail?
2. Also, I was assuming that WMAPE and WAPE are same. I saw this post at stackoverflow which formulates them differently.
What's the gaps for the forecast error metrics: MAPE and WMAPE?
Also, can you please help me understand how the weights are calculated? My understanding is higher the value more important it is. But I am not sure how the value is calculated.
Thanks in advance!
MAPE = 100* mean(|(Actual-forecast)/Actual|)
If you check the website https://robjhyndman.com/hyndsight/smape/ and the example given u will notice that the denominator taken is the forecast which is incorrect (Should be the actual value). With this formula you can see that MAPE does not put a heavier penalty on negative errors than on positive errors.
WMAPE applies weights which may in fact be biased towards the error which would make the metric worse. The weightage for WMAPE is as far as I know based on the use case. For example you are trying to predict the loss but the percentage of loss needs to be weighted with volume of sales because a loss on a huge sale needs better prediction.
In cases where values to be predicted is very low MAD/Mean (a.k.a WAPE) should be used. For example if the sales is 3 units in one particular week (maybe a holiday) and the predicted value is 9 then the MAPE would be 200%. This would bloat up the total MAPE when you look at multiple weeks of data.
The link given below has details of some other stats used for error measurement
http://www.forecastpro.com/Trends/forecasting101August2011.html
I'm not very sure about the rest, but I came across an answer for the first question recently.
Check out this website - http://robjhyndman.com/hyndsight/smape/
The example given there is presented below -
"Armstrong and Collopy (1992) argued that the MAPE "puts a heavier penalty on forecasts that exceed the actual than those that are less than the actual". Makridakis (1993) took up the argument saying that "equal errors above the actual value result in a greater APE than those below the actual value". He provided an example where yt=150 and y^t=100, so that the relative error is 50/150=0.33, in contrast to the situation where yt=100 and y^t=150, when the relative error would be 50/100=0.50."
y^t == estimated value of y
WMAPE and MAPE are different measures.
MAPE is Mean Absolute Percent Error - this just averages the percent errors.
WMAPE is Weighted Mean Absolute Percent Error = This weights the errors by Volume so this is more rigorous and reliable.
Negative errors do not influence the calculation is this is all absolute error. This could result from the denominator used which is a separate debate.
You can download a detailed presentation from our website at https://valuechainplanning.com/download/24. The PDF can be downloaded at https://valuechainplanning.com/upload/details/Forecast_Accuracy_Presentation.pdf.

Help--100% accuracy with LibSVM?

Nominally a good problem to have, but I'm pretty sure it is because something funny is going on...
As context, I'm working on a problem in the facial expression/recognition space, so getting 100% accuracy seems incredibly implausible (not that it would be plausible in most applications...). I'm guessing there is either some consistent bias in the data set that it making it overly easy for an SVM to pull out the answer, =or=, more likely, I've done something wrong on the SVM side.
I'm looking for suggestions to help understand what is going on--is it me (=my usage of LibSVM)? Or is it the data?
The details:
About ~2500 labeled data vectors/instances (transformed video frames of individuals--<20 individual persons total), binary classification problem. ~900 features/instance. Unbalanced data set at about a 1:4 ratio.
Ran subset.py to separate the data into test (500 instances) and train (remaining).
Ran "svm-train -t 0 ". (Note: apparently no need for '-w1 1 -w-1 4'...)
Ran svm-predict on the test file. Accuracy=100%!
Things tried:
Checked about 10 times over that I'm not training & testing on the same data files, through some inadvertent command-line argument error
re-ran subset.py (even with -s 1) multiple times and did train/test only multiple different data sets (in case I randomly upon the most magical train/test pa
ran a simple diff-like check to confirm that the test file is not a subset of the training data
svm-scale on the data has no effect on accuracy (accuracy=100%). (Although the number of support vectors does drop from nSV=127, bSV=64 to nBSV=72, bSV=0.)
((weird)) using the default RBF kernel (vice linear -- i.e., removing '-t 0') results in accuracy going to garbage(?!)
(sanity check) running svm-predict using a model trained on a scaled data set against an unscaled data set results in accuracy = 80% (i.e., it always guesses the dominant class). This is strictly a sanity check to make sure that somehow svm-predict is nominally acting right on my machine.
Tentative conclusion?:
Something with the data is wacked--somehow, within the data set, there is a subtle, experimenter-driven effect that the SVM is picking up on.
(This doesn't, on first pass, explain why the RBF kernel gives garbage results, however.)
Would greatly appreciate any suggestions on a) how to fix my usage of LibSVM (if that is actually the problem) or b) determine what subtle experimenter-bias in the data LibSVM is picking up on.
Two other ideas:
Make sure you're not training and testing on the same data. This sounds kind of dumb, but in computer vision applications you should take care that: make sure you're not repeating data (say two frames of the same video fall on different folds), you're not training and testing on the same individual, etc. It is more subtle than it sounds.
Make sure you search for gamma and C parameters for the RBF kernel. There are good theoretical (asymptotic) results that justify that a linear classifier is just a degenerate RBF classifier. So you should just look for a good (C, gamma) pair.
Notwithstanding that the devil is in the details, here are three simple tests you could try:
Quickie (~2 minutes): Run the data through a decision tree algorithm. This is available in Matlab via classregtree, or you can load into R and use rpart. This could tell you if one or just a few features happen to give a perfect separation.
Not-so-quickie (~10-60 minutes, depending on your infrastructure): Iteratively split the features (i.e. from 900 to 2 sets of 450), train, and test. If one of the subsets gives you perfect classification, split it again. It would take fewer than 10 such splits to find out where the problem variables are. If it happens to "break" with many variables remaining (or even in the first split), select a different random subset of features, shave off fewer variables at a time, etc. It can't possibly need all 900 to split the data.
Deeper analysis (minutes to several hours): try permutations of labels. If you can permute all of them and still get perfect separation, you have some problem in your train/test setup. If you select increasingly larger subsets to permute (or, if going in the other direction, to leave static), you can see where you begin to lose separability. Alternatively, consider decreasing your training set size and if you get separability even with a very small training set, then something is weird.
Method #1 is fast & should be insightful. There are some other methods I could recommend, but #1 and #2 are easy and it would be odd if they don't give any insights.

How to convert the output of an artificial neural network into probabilities?

I've read about neural network a little while ago and I understand how an ANN (especially a multilayer perceptron that learns via backpropagation) can learn to classify an event as true or false.
I think there are two ways :
1) You get one output neuron. It it's value is > 0.5 the events is likely true, if it's value is <=0.5 the event is likely to be false.
2) You get two output neurons, if the value of the first is > than the value of the second the event is likely true and vice versa.
In these case, the ANN tells you if an event is likely true or likely false. It does not tell how likely it is.
Is there a way to convert this value to some odds or to directly get odds out of the ANN. I'd like to get an output like "The event has a 84% probability to be true"
Once a NN has been trained, for eg. using backprogation as mentioned in the question (whereby the backprogation logic has "nudged" the weights in ways that minimize the error function) the weights associated with all individual inputs ("outside" inputs or intra-NN inputs) are fixed. The NN can then be used for classifying purposes.
Whereby the math (and the "options") during the learning phase can get a bit thick, it is relatively simple and straightfoward when operating as a classifier. The main algorithm is to compute an activation value for each neuron, as the sum of the input x weight for that neuron. This value is then fed to an activation function which purpose's is to normalize it and convert it to a boolean (in typical cases, as some networks do not have an all-or-nothing rule for some of their layers). The activation function can be more complex than you indicated, in particular it needn't be linear, but whatever its shape, typically sigmoid, it operate in the same fashion: figuring out where the activation fits on the curve, and if applicable, above or below a threshold. The basic algorithm then processes all neurons at a given layer before proceeding to the next.
With this in mind, the question of using the perceptron's ability to qualify its guess (or indeed guesses - plural) with a percentage value, finds an easy answer: you bet it can, its output(s) is real-valued (if anything in need of normalizing) before we convert it to a discrete value (a boolean or a category ID in the case of several categories), using the activation functions and the threshold/comparison methods described in the question.
So... How and Where do I get "my percentages"?... All depends on the NN implementation, and more importantly, the implementation dictates the type of normalization functions that can be used to bring activation values in the 0-1 range and in a fashion that the sum of all percentages "add up" to 1. In its simplest form, the activation function can be used to normalize the value and the weights of the input to the output layer can be used as factors to ensure the "add up" to 1 question (provided that these weights are indeed so normalized themselves).
Et voilà!
Claritication: (following Mathieu's note)
One doesn't need to change anything in the way the Neural Network itself works; the only thing needed is to somehow "hook into" the logic of output neurons to access the [real-valued] activation value they computed, or, possibly better, to access the real-valued output of the activation function, prior its boolean conversion (which is typically based on a threshold value or on some stochastic function).
In other words, the NN works as previously, neither its training nor recognition logic are altered, the inputs to the NN stay the same, as do the connections between various layers etc. We only get a copy of the real-valued activation of the neurons in the output layer, and we use this to compute a percentage. The actual formula for the percentage calculation depends on the nature of the activation value and its associated function (its scale, its range relative to other neurons' output etc.).
Here are a few simple cases (taken from the question's suggested output rules)
1) If there is a single output neuron: the ratio of the value provided by the activation function relative to the range of that function should do.
2) If there are two (or more output neurons), as with classifiers for example: If all output neurons have the same activation function, the percentage for a given neuron is that of its activation function value divided by the sum of all activation function values. If the activation functions vary, it becomes a case by case situation because the distinct activation functions may be indicative of a purposeful desire to give more weight to some of the neurons, and the percentage should respect this.
What you can do is to use a sigmoid transfer function on the output layer nodes (that accepts data ranges (-inf,inf) and outputs a value in [-1,1]).
Then by using the 1-of-n output encoding (one node for each class), you can map the range [-1,1] to [0,1] and use it as probability for each class value (note that this works naturally for more than just two classes).
The activation value of a single output neuron is a linearly weighted sum, and may be directly interpreted as an approximate probability if the network is trained to give outputs a range from 0 to 1. This would tend to be the case if the transfer function (or output function) in both the preceding stage and providing the final output is in the 0 to 1 range too (typically the sigmoidal logistic function). However, there is no guarantee that it will but repairs are possible. Moreover unless the sigmoids are logistic and the weights are constrained to be positive and sum to 1, it is unlikely. Generally a neural network will train in a more balanced way using the tanh sigmoid and weights and activations that range positive and negative (due to the symmetry of this model). Another factor is the prevalence of the class - if it is 50% then a 0.5 threshold is likely to be effective for logistic and a 0.0 threshold for tanh. The sigmoid is designed to push things towards the centre of the range (on backpropogation) and constrain it from going out of the range (in feedforward). The significance of the performance (with respect to the Bernoulli distribution) can also be interpreted as a probability that the neuron is making real predictions rather than guessing. Ideally the bias of the predictor to positives should match the prevalence of positives in the real world (which may vary at different times and places, e.g. bull vs bear markets, e.g. credit worthiness of people applying for loans vs people who fail to make loan payments) - calibrating to probabilities has the advantage that any desired bias can be set easily.
If you have two neurons for two classes, each can be interpreted independently as above, and the halved difference between them can also be. It is like flipping the negative class neuron and averaging. The differences can also give rise to a probability of significance estimate (using the T-test).
The Brier score and its Murphy decomposition give a more direct estimate of the probability that an average answer is correct, while Informedness gives the probability the classifier is making an informed decision rather than a guess, ROC AUC gives the probability a positive class will be ranked higher than a negative class (by a positive predictor), and Kappa will give a similar number that matches Informedness when prevalence = bias.
What you normally want is both a significance probability for the overall classifier (to ensure that you are playing on a real field, and not in an imaginary framework of guestimates) and a probability estimate for a specific example. There are various ways to calibrate, including doing a regression (linear or nonlinear) versus probability and using its inverse function to remap to a more accurate probability estimate. This can be seen by the Brier score improving, with the calibration component reducing towards 0, but the discrimination component remaining the same, as should ROC AUC and Informedness (Kappa is subject to bias and may worsen).
A simple non-linear way to calibrate to probabilities is to use the ROC curve - as the threshold changes for the output of a single neuron or the difference between two competing neurons, we plot the results true and false positive rates on a ROC curve (the false and true negative rates are naturally the complements, as what isn't really a positive is a negative). Then you scan the ROC curve (polyline) point by point (each time the gradient changes) sample by sample and the proportion of positive samples gives you a probability estimate for positives corresponding to the neural threshold that produced that point. Values between points on the curve can be linearly interpolated between those that are represented in the calibration set - and in fact any bad points in the ROC curve, represented by deconvexities (dents) can be smoothed over by the convex hull - probabilistically interpolating between the endpoints of the hull segment. Flach and Wu propose a technique that actually flips the segment, but this depends on information being used the wrong way round and although it could be used repeatedly for arbitrary improvement on the calibration set, it will be increasingly unlikely to generalize to a test situation.
(I came here looking for papers I'd seen ages ago on these ROC-based approaches - so this is from memory and without these lost references.)
I will be very prudent in interpreting the outputs of a neural networks (in fact any machine learning classifier) as a probability. The machine is trained to discriminate between classes, not to estimate the probability density. In fact, we don't have this information in the data, we have to infer it. For my experience I din't advice anyone to interpret directly the outputs as probabilities.
did you try prof. Hinton's suggestion of training the network with softmax activation function and cross entropy error?
as an example create a three layer network with the following:
linear neurons [ number of features ]
sigmoid neurons [ 3 x number of features ]
linear neurons [ number of classes ]
then train them with cross entropy error softmax transfer with your favourite optimizer stochastic descent/iprop plus/ grad descent. After training the output neurons should be normalized to sum of 1.
Please see http://en.wikipedia.org/wiki/Softmax_activation_function for details. Shark Machine Learning framework does provide Softmax feature through combining two models. And prof. Hinton an excellent online course # http://coursera.com regarding the details.
I can remember I saw an example of Neural network trained with back propagation to approximate the probability of an outcome in the book Introduction to the theory of neural computation (hertz krogh palmer). I think the key to the example was a special learning rule so that you didn't have to convert the output of a unit to probability, but instead you got automatically the probability as output.
If you have the opportunity, try to check that book.
(by the way, "boltzman machines", although less famous, are neural networks designed specifically to learn probability distributions, you may want to check them as well)
When using ANN for 2-class classification and logistic sigmoid activation function is used in the output layer, the output values could be interpreted as probabilities.
So if you choosing between 2 classes, you train using 1-of-C encoding, where 2 ANN outputs will have training values (1,0) and (0,1) for each of classes respectively.
To get probability of first class in percent, just multiply first ANN output to 100. To get probability of other class use the second output.
This could be generalized for multi-class classification using softmax activation function.
You can read more, including proofs of probabilistic interpretation here:
[1] Bishop, Christopher M. Neural networks for pattern recognition. Oxford university press, 1995.

Is there a good reason for storing percentages that are less than 1 as numbers greater than 1?

I inherited a project that uses SQL Server 200x, wherein a column that stores a value that is always considered as a percentage in the problem domain is stored as its greater than 1 decimal equivalent. For example, 70% (0.7, literally) is stored as 70, 100% as 100, etc. Aside from the need to remember to * 0.01 on retrieved values and * 100 before persisting values, it doesn't seem to be a problem in and of itself. It does make my head explode though... so is there a good reason for it that I'm missing? Are there compelling reasons to fix it, given that there is a fair amount of code written to work with the pseudo-percentages?
There are a few cases where greater than 100% occurs, but I don't see why the value wouldn't just be stored as 1.05, for example, in those cases.
EDIT: Head feeling better, and slightly smarter. Thanks for all the insights.
There are actually four good reasons I can think of that you might want to store—and calculate with—whole-number percentage values rather than floating-point equivalents:
Depending on the data types chosen, the integer value may take up less space.
Depending on the data type, the floating-point value may lose precision (remember that not all languages have a data type equivalent to SQL Server's decimal type).
If the value will be input from or output to the user very frequently, it may be more convenient to keep it in a more user-friendly format (decision between convert when you display and convert when you calculate ... but see the next point).
If the principle values are also integers, then
principle * integerPercentage / 100
which uses all integer arithmetic is usually faster than its floating-point equivalent (likely significantly faster in the case of a floating-point type equivalent to T-SQL's decimal type).
If its a byte field then it takes up less room in the db than floating point numbers, but unless you have millions and millions of records, you'll hardly see a difference.
Since floating-point values can't be compared for equality, an integer may have been used to make the SQL simpler.
For example
(0.3==3*.1)
is usually False.
However
abs( 0.3 - 3*.1 )
Is a tiny number (5.55e-17). But it's pain to have to do everything with (column-SomeValue) BETWEEN -0.0001 AND 0.0001 or ABS(column-SomeValue) < 0.0001. You'd rather do column = SomeValue in your WHERE clause.
Floating point numbers are prone to rounding errors and, therefore, can act "funny" in comparisons. If you always want to deal with it as fixed decimal, you could either choose a decimal type, say decimal(5,2), or do the convert and store as int thing that your db does. I'd probably go the decimal route, even though the int would take up less space.
A good guess is because anything you do with integers (storing, calculating, stuffing into an edit for for a user, etc.) is marginally easier and more efficient than doing the same with floating point numbers. And the rounding issues aren't so obvious when you look at the data.
If these are numbers that end users are likely to see and interact with, percentages are easier to understand than decimals.
This is one of those situations where a notation aid can help; in the program, be consistent in using a prefix (Hungarian) or postfix to specify values that are percentages vs. those that are decimal. If you can extend a naming convention to the database fields themselves, so much the better.
And to add to the data storage issue, if you can use integer arithmetic for whatever processing you are doing, the performance is much better than when doing floating point arithmetic... So storing ther percetages as integer values may allow the processing logic to itilize integer arithmetic
If you're actually using them as a coefficient (or expect users of the database to do this sort of thing in reports), there's a case for storing them as a coefficient - particularly if there's a reason to do calculations involving more than one.
However, if you do this you should be consistent - either all percentages or all coefficients.

Resources