I am currently trying to understand a topic in Artificial Intelligence (Learning) and need assistance in understanding the following:
Why would a Leave-one-out-cross-validation algorithm, when used in conjunction with a majority classifier, score zero instead of 50% on a data set of equal number of positive and negative examples?
Thank you for your guidance on this.
If I understand the question correctly, when you leave out the positive sample, the training set has more negative samples; therefore the left out sample is classified as negative. And vice versa.
Related
dear reader
I have been thinking about how to store data efficiently since the beginning of my studies and while taking a shower, I came up with the following idea:
For example, you take a picture and convert that picture into 0(zeros) and 1(ones). Then you take this eternally long number and divide it by e.g. 10 then again by 10 and then again by 10 etc. and at the end you have a small number. Now the small number and the calculation path are stored and if someone wants to read the data, they only have to perform the inverse operation to get the result.
The idea is too good to be true --> my gut feeling tells me. But I would still like to know why this should not work?
Kind regards
Hello, dear reader
I have been thinking about how to store data efficiently since the beginning of my studies and while taking a shower, I came up with the following idea:
For example, you take a picture and convert that picture into 0(zeros) and 1(ones). Then you take this eternally long number and divide it by e.g. 10 then again by 10 and then again by 10 etc. and at the end you have a small number. Now the small number and the calculation path are stored and if someone wants to read the data, they only have to perform the inverse operation to get the result.
The idea is too good to be true --> my gut feeling tells me. But I would still like to know why this should not work?
Kind regards
Fun theorem. No bijection on the natural numbers can map every number to a smaller one. Proof by contradiction, consider F(F(1)).
Lots of ways to map numbers 1-1 to smaller numbers such that many map to smaller numbers. These are lossless compression algorithms. Most have the property that repeated application of the algorithm make the data larger, or leave it unchanged.
In your proposal, to the extent I understand it, you would have to store all the remainders of the division, which would be as large as the original data.
I am asked to normalize a probability distribution P=A(x^2)(e^-x) within 0 to infinity by finding the value for A. I know the algorithms to calculate the Numerical value of Integration, but how do I deal with one of the limits being Infinity.
The only way I have been able to solve this problem with some accuracy (I got full accuracy, indeed) is by doing some math first, in order to obtain the taylor series that represents the integral of the first.
I have been looking here for my sample code, but I don't find it. I'll edit my post if I get a working solution.
The basic idea is to calculate all the derivatives of the function exp(-(x*x)) and use the coeficients to derive the integral form (by dividing those coeficients by one more than the exponent of x of the above function) to get the taylor series of the integral (I recommend you to use the unnormalized version described above to get the simple number coeficients, then adjust the result by multiplying by the proper constants) you'll get a taylor series with good convergence, giving you precise values for full precision (The integral requires a lot of subdivision, and you cannot divide an unbounded interval into a finite number of intervals, all finite)
I'll edit this question if I get on the code I wrote (so stay online, and dont' change the channel :) )
According to the official document, there is no doubt that lower level is 10 times bigger than upper one in LevelDB.
The question is why 10? not 2? not 20? It is due to some rigorous math calculations or it just works?
I have read the original LSMT paper. I can understand the multi-components part, because it will be too hard to merge c0 tree with a super large c1 tree. But the paper shows nothing about what the best parameter is.
Am I right? It is actually an interview question. How can I answer it properly if there is no best parameter?
10x is a reasonable value, may not be rigorous.
The value of this coefficient can't be too small, because will make too many level, which is not friend to read, and will result in more space amplification.
It can't be too large as you mentioned, the cost of compact will increase, with higher average number of participant ssts.
I am new to the forecast space and I am trying to understand the different forecast accuracy measures. I am referring to the below link
https://www.otexts.org/fpp/2/5
Can anyone please help me understand the below things:
1. MAPE: I am trying to understand the disadvantage of MAPE "They also have the disadvantage that they put a heavier penalty on negative errors than on positive errors. " Can anyone please provide an example to explain this in detail?
2. Also, I was assuming that WMAPE and WAPE are same. I saw this post at stackoverflow which formulates them differently.
What's the gaps for the forecast error metrics: MAPE and WMAPE?
Also, can you please help me understand how the weights are calculated? My understanding is higher the value more important it is. But I am not sure how the value is calculated.
Thanks in advance!
MAPE = 100* mean(|(Actual-forecast)/Actual|)
If you check the website https://robjhyndman.com/hyndsight/smape/ and the example given u will notice that the denominator taken is the forecast which is incorrect (Should be the actual value). With this formula you can see that MAPE does not put a heavier penalty on negative errors than on positive errors.
WMAPE applies weights which may in fact be biased towards the error which would make the metric worse. The weightage for WMAPE is as far as I know based on the use case. For example you are trying to predict the loss but the percentage of loss needs to be weighted with volume of sales because a loss on a huge sale needs better prediction.
In cases where values to be predicted is very low MAD/Mean (a.k.a WAPE) should be used. For example if the sales is 3 units in one particular week (maybe a holiday) and the predicted value is 9 then the MAPE would be 200%. This would bloat up the total MAPE when you look at multiple weeks of data.
The link given below has details of some other stats used for error measurement
http://www.forecastpro.com/Trends/forecasting101August2011.html
I'm not very sure about the rest, but I came across an answer for the first question recently.
Check out this website - http://robjhyndman.com/hyndsight/smape/
The example given there is presented below -
"Armstrong and Collopy (1992) argued that the MAPE "puts a heavier penalty on forecasts that exceed the actual than those that are less than the actual". Makridakis (1993) took up the argument saying that "equal errors above the actual value result in a greater APE than those below the actual value". He provided an example where yt=150 and y^t=100, so that the relative error is 50/150=0.33, in contrast to the situation where yt=100 and y^t=150, when the relative error would be 50/100=0.50."
y^t == estimated value of y
WMAPE and MAPE are different measures.
MAPE is Mean Absolute Percent Error - this just averages the percent errors.
WMAPE is Weighted Mean Absolute Percent Error = This weights the errors by Volume so this is more rigorous and reliable.
Negative errors do not influence the calculation is this is all absolute error. This could result from the denominator used which is a separate debate.
You can download a detailed presentation from our website at https://valuechainplanning.com/download/24. The PDF can be downloaded at https://valuechainplanning.com/upload/details/Forecast_Accuracy_Presentation.pdf.
I'm experimenting with single-layer perceptrons, and I think I understand (mostly) everything. However, what I don't understand is to which weights the correction (learning rate*error) should be added. In the examples I've seen it seems arbitrary.
Well, it looks like you half answered your own question: its true you correct all of the non-zero weights, you don't correct all by the same amount.
Instead, you correct the weights in proportion to their incoming activation, so if unit X activated really strongly and unit Y activated just a lil bit, and there was a large error, then the weight going from unit X to the output would be corrected far more than unit Y's weights-to-output.
The technical term for this process is called the delta rule, and its details can be found in its wiki article. Additionally, if you ever want to upgrade using to multilayer perceptrons (single layer perceptrons are very limited in computational power, see a discussion of Minsky and Papert's argument against using them here), an analogous learning algorithm called back propogation is discussed here.
Answered my own question.
According to http://intsys.mgt.qub.ac.uk/notes/perceptr.html, "add this correction to any weight for which there was an input". In other words, do not add the correction to weights whose neurons had a value of 0.