What does it mean when apriori min support = 0.05? and how does this differ from min support = 0.5? - apriori

I used python for apriori but I was confused seeing min support = 0.05 in many models. Doesn't this mean min support is only 5% and how can this be relied on when discovering the existing patterns in the data?

0.05 refers to 5%, which means the fraction of transactions with probability of item 'A' and 'B' occurring together as compared to the total number of transactions is 5 out of 100.
https://medium.com/analytics-vidhya/market-basket-analysis-127c73f353d7

Related

probability of collision in MD5

Worst case, I have 180 million values in a cache(15 minute window before they go stale) and an MD5 has 2^128 values. What is my probability of a collision? or better yet, is there a web page somewhere to answer that question or a rough estimate thereof? That would rock so I know my chances.
The probability is 1-m!/(mⁿ(m-n)!) where m = 2¹²⁸ and n = 180000000.
Running it through on-line Wolfram exceeds available computational time!
If you have SmallTalk installed locally, you can run this:
|m n p|
m := 2 raisedTo:128.
n := 180000000.
p := (1-(m factorial/((m raisedTo:n)*(m-n)factorial)))asFloat.
Transcript show:p printString;cr.
A search for the Birthday Problem brings up a Wikipedia page where they provide a table showing for 128 bits and 2.6×10¹⁰ hashes, the probability of a collision is 1 in 10¹⁸, so that’s for 140× the number of hashes than what you’re considering. So you know your odds are “worse” than this.
A good approximation if n ≪ m is 1-e-n2/2m, where if you plug in m and n above, you get 4.76×10⁻²³ or 1 in 2.10×10²² as the probability of a collision.
Even though the probability of a collision is very low, it is prudent in the FOOBAR case, say if there is an issue and the hashes accumulate for more than 15 minutes, to at least confirm what would happen in the event of a collision. This will also help if someone somehow injects duplicate hashes in order to try to compromise it.

combination of smote and undersampling on weka

according to paper which written by chawla, et al (2002)
the best perfomance of balancing data is combining undersampling with SMOTE.
I’ve tried to combine my dataset using under-sampling and SMOTE,
but I am bit confuse about the attribute for under-sampling.
In weka there is Resample to decrease the majority class.
there is a attribute in Resample
biasToUniformClass -- Whether to use bias towards a uniform class. A value of 0 leaves the class distribution as-is, a value of 1 ensures the class distribution is uniform in the output data.
I use value 0 and the data in majority class is down so the minority do and when I use value 1, the data in majority decrease but in minority class, the data is up.
I try to use value 1 for that attribute, but I don't using smote to increase the instances of minority class because the data is already balance and the result is good too.
so, is that the same as I combine the SMOTE and under-sampling or I still have to try with value 0 in that attribute and do the SMOTE ?
For undersampling, see the EasyEnsemble algorithm (a Weka implementation was developed by Schubach, Robinson, and Valentini).
The EasyEnsemble algorithm allows you to split the data into a certain number of balanced partitions. To achieve this balance, set the numIterations parameter equal to:
(# of majority instances) / (# minority instances) = numIterations
For example, if there are 30 total instances with 20 in the majority class and 10 in the minority class, set the numIterations parameter equal to 2 (i.e., 20 majority instances / 10 instances equals 2 balanced partitions). These 2 partitions should each contain 20 instances; each has the same 10 minority instances along with a different 10 instances from the majority class.
The algorithm then trains classifiers on each of the balanced partitions,
and at test time, ensembles the batch of classifiers trained on each of the balanced partitions for prediction.

How does the HyperLogLog algorithm work?

I've been learning about different algorithms in my spare time recently, and one that I came across which appears to be very interesting is called the HyperLogLog algorithm - which estimates how many unique items are in a list.
This was particularly interesting to me because it brought me back to my MySQL days when I saw that "Cardinality" value (which I always assumed until recently that it was calculated not estimated).
So I know how to write an algorithm in O(n) that will calculate how many unique items are in an array. I wrote this in JavaScript:
function countUniqueAlgo1(arr) {
var Table = {};
var numUnique = 0;
var numDataPoints = arr.length;
for (var j = 0; j < numDataPoints; j++) {
var val = arr[j];
if (Table[val] != null) {
continue;
}
Table[val] = 1;
numUnique++;
}
return numUnique;
}
But the problem is that my algorithm, while O(n), uses a lot of memory (storing values in Table).
I've been reading this paper about how to count duplicates in a list in O(n) time and using minimal memory.
It explains that by hashing and counting bits or something one can estimate within a certain probability (assuming the list is evenly distributed) the number of unique items in a list.
I've read the paper, but I can't seem to understand it. Can someone give a more layperson's explanation? I know what hashes are, but I don't understand how they are used in this HyperLogLog algorithm.
The main trick behind this algorithm is that if you, observing a stream of random integers, see an integer which binary representation starts with some known prefix, there is a higher chance that the cardinality of the stream is 2^(size of the prefix).
That is, in a random stream of integers, ~50% of the numbers (in binary) starts with "1", 25% starts with "01", 12,5% starts with "001". This means that if you observe a random stream and see a "001", there is a higher chance that this stream has a cardinality of 8.
(The prefix "00..1" has no special meaning. It's there just because it's easy to find the most significant bit in a binary number in most processors)
Of course, if you observe just one integer, the chance this value is wrong is high. That's why the algorithm divides the stream in "m" independent substreams and keep the maximum length of a seen "00...1" prefix of each substream. Then, estimates the final value by taking the mean value of each substream.
That's the main idea of this algorithm. There are some missing details (the correction for low estimate values, for example), but it's all well written in the paper. Sorry for the terrible english.
A HyperLogLog is a probabilistic data structure. It counts the number of distinct elements in a list. But in comparison to a straightforward way of doing it (having a set and adding elements to the set) it does this in an approximate way.
Before looking how the HyperLogLog algorithm does this, one has to understand why you need it. The problem with a straightforward way is that it consumes O(distinct elements) of space. Why there is a big O notation here instead of just distinct elements? This is because elements can be of different sizes. One element can be 1 another element "is this big string". So if you have a huge list (or a huge stream of elements) it will take a lot memory.
Probabilistic Counting
How can one get a reasonable estimate of a number of unique elements? Assume that you have a string of length m which consists of {0, 1} with equal probability. What is the probability that it will start with 0, with 2 zeros, with k zeros? It is 1/2, 1/4 and 1/2^k. This means that if you have encountered a string starting with k zeros, you have approximately looked through 2^k elements. So this is a good starting point. Having a list of elements that are evenly distributed between 0 and 2^k - 1 you can count the maximum number of the biggest prefix of zeros in binary representation and this will give you a reasonable estimate.
The problem is that the assumption of having evenly distributed numbers from 0 t 2^k-1 is too hard to achieve (the data we encountered is mostly not numbers, almost never evenly distributed, and can be between any values. But using a good hashing function you can assume that the output bits would be evenly distributed and most hashing function have outputs between 0 and 2^k - 1 (SHA1 give you values between 0 and 2^160). So what we have achieved so far is that we can estimate the number of unique elements with the maximum cardinality of k bits by storing only one number of size log(k) bits. The downside is that we have a huge variance in our estimate. A cool thing that we almost created 1984's probabilistic counting paper (it is a little bit smarter with the estimate, but still we are close).
LogLog
Before moving further, we have to understand why our first estimate is not that great. The reason behind it is that one random occurrence of high frequency 0-prefix element can spoil everything. One way to improve it is to use many hash functions, count max for each of the hash functions and in the end average them out. This is an excellent idea, which will improve the estimate, but LogLog paper used a slightly different approach (probably because hashing is kind of expensive).
They used one hash but divided it into two parts. One is called a bucket (total number of buckets is 2^x) and another - is basically the same as our hash. It was hard for me to get what was going on, so I will give an example. Assume you have two elements and your hash function which gives values form 0 to 2^10 produced 2 values: 344 and 387. You decided to have 16 buckets. So you have:
0101 011000 bucket 5 will store 1
0110 000011 bucket 6 will store 4
By having more buckets you decrease the variance (you use slightly more space, but it is still tiny). Using math skills they were able to quantify the error (which is 1.3/sqrt(number of buckets)).
HyperLogLog
HyperLogLog does not introduce any new ideas, but mostly uses a lot of math to improve the previous estimate. Researchers have found that if you remove 30% of the biggest numbers from the buckets you significantly improve the estimate. They also used another algorithm for averaging numbers. The paper is math-heavy.
And I want to finish with a recent paper, which shows an improved version of hyperLogLog algorithm (up until now I didn't have time to fully understand it, but maybe later I will improve this answer).
The intuition is if your input is a large set of random number (e.g. hashed values), they should distribute evenly over a range. Let's say the range is up to 10 bit to represent value up to 1024. Then observed the minimum value. Let's say it is 10. Then the cardinality will estimated to be about 100 (10 × 100 ≈ 1024).
Read the paper for the real logic of course.
Another good explanation with sample code can be found here:
Damn Cool Algorithms: Cardinality Estimation - Nick's Blog

Suggestions on designing a metric

I am designing a metric to measure when a search term is "ambiguous." A score near to one means that it is ambiguous ("Ajax" could be a programming language, a cleaning solution, a greek hero, a European soccer club, etc.) and a score near to zero means it is pretty clear what the user meant ("Lady Gaga" probably means only one thing). Part of this metric is that I have a list of possible interpretations and frequency of those interpretations from past data and I need to turn this into a number between 0 and 1.
For example: lets say the term is "Cats" -- of a million trials 850,000 times the user meant the furry thing that meows, 80,000 times they meant the musical by that name, and the rest are abbreviations for things each only meant a trivial number of times. I would say this should have a low ambiguity score because even though there were multiple possible meanings, one was by far the preferred meaning. In contrast lets say the term is "Friends" -- of a million trials 500,000 times the user meant the people who they hang out with all the time, 450,000 times they meant the tv show by that name, and the rest were some other meaning. This should get a higher ambiguity score because the different meanings were much closer in frequency.
TLDR: If I sort the array in decreasing order, I need a way to take arrays which fall off quickly to numbers close to zero and arrays that fall off slower to numbers closer to one. If the array was [1,0,0,0...] this should get a perfect score of 0 and if it was [1/n,1/n,1/n...] this should get a perfect score of 1. Any suggestions?
What you are looking for sounds very similar to the Entropy measure in information theory. It is a measure of how uncertain a random variable is based on the probabilities of each outcome. It is given by:
H(X) = -sum(p(x[i]) * log( p(x[i])) )
where p(x[i]) is the probability of the ith possiblility. So in your case, p(x[i]) would be the probability that a certain search phrase corresponded to an actual meaning. In the cats example, you would have:
p(x[0]) = 850,000 / (850,000+80,000) = 0.914
p(x[1]) = 80,000 / (850,000+80,000) = 0.086
H(X) = -(0.914*log2(0.914) + 0.086*log2(0.086)) = 0.423
For the Friends case, you would have: (assuming only one other category)
H(X) = -(0.5*log2(0.5) + 0.45*log2(0.45) + 0.05*log2(0.05)) = 1.234
The higher number here means more uncertainty.
Note that I am using log base 2 in both cases, but if you use a logarithm of the base equal to the number of possibilities, you can get the scale to work out to 0 to 1.
H(X) = -(0.5*log3(0.5) + 0.45*log3(0.45) + 0.05*log3(0.05)) = 0.779
Note also that the most ambiguous case is when all possibilities have the same probability:
H(X) = -(0.33*log3(0.33) + 0.33*log3(0.33) + 0.33*log3(0.33)) = 1.0
and the least ambiguous case is when there is only one possibility:
H(X) = -log(1) = 0.0
Since you want the most ambiguous terms to be near 1, you could just use 1.0-H(X) as your metric.

RANDOM Number Generation in C

Recently i have begun development of a simple game. It is improved version of an earlier version that i had developed. Large part of the game's success depends on Random number generation in different modes:
MODE1 - Truly Random mode
myRand(min,max,mode=1);
Should return me a random integer b/w min & max.
MODE2 - Pseudo Random : Token out of a bag mode
myRand(min,max,mode=2);
Should return me a random integer b/w min & max. Also should internally keep track of the values returned and must not return the same value again until all the other values are returned atleast once.
MODE3 - Pseudo Random : Human mode
myRand(min,max,mode=3);
Should return me a random integer b/w min & max. The randomisation ought to be not purely mathematically random, rather random as user perceive it. How Humans see RANDOM.
* Assume the code is time-critical (i.e. any performance optimisations are welcome)
* Pseudo-code will do but an implementation in C is what i'm looking for.
* Please keep it simple. A single function should be sufficient (thats what i'm looking for)
Thank You
First, research the Mersenne Twister. This should be an excellent foundation for your problem.
Mode 1: Directly use the values. Given that the values are 32 bit, depending on the ranges of min and max, modulo (max-min+1) may be good enough, though there is a small bias if this interval is not a power of two. Else you can treat the value as a float value between 0 and 1 and need some additional operations. There may be other solutions to get equal distribution with integers, but I haven't researched this specific problem yet. Wikipedia may be of help here.
Mode 2: Use an array that you fill with min..max and then shuffle it. Return the shuffled values in order. When you're through the array, refill and reshuffle.
Mode 3 is the most complicated. Small amounts of random values show clusters, i.e. if you count the occurrences of the different values, you have an average value and the counts are usually above or below this average. As I understand your link, humans expect randomness to have all counts exactly on the average. So count the occurrences and give the different values a higher probability, depending on their distance to the average count. It may be enough to simply reuse mode 2 with a multiple array, e.g. use an array 10 times the size of (max-min+1), fill it with 10x min, 10x min+1, and so on, and shuffle it. Each full 10 rounds you then have the counts exactly equal.
EDIT on mode 3:
Say you have min=1 and max=5. You count the occurrences. If they all have the same probability (which they should using a good random generator), then this probability for each value to occur is 0.2, because the probabilites add up to 1.0:
Value Occur Probability
1 7x 0.2
2 7x 0.2
3 7x 0.2
4 7x 0.2
5 7x 0.2
Average: 7x
But now let's say that 3 occured only 5x and 5 occured 9x. If you want to hold the equal distribution, then 3 has to become a higher probability to catch up with the average occurrence, and 5 has to become a lower probability to not grow so fast until all the other values catched up. Nonetheless all the individual probabilities must add up to 1.0:
Value Occur Probability
1 7x 0.2
2 7x 0.2
3 5x 0.3
4 7x 0.2
5 9x 0.1
Average: Still 7x
The different occurrences should have different probabilities, too, depending on their distance to the average:
Value Occur Probability
1 10x 0.05
2 4x 0.35
3 5x 0.3
4 7x 0.2
5 9x 0.1
Average: Still 7x
Not that trivial to implement and most likely very slow, because the random generator still provides equal probabilites, so a modified mode 2 may be a good-enough choice.
As a first step, go and read Knuth
You can use a linear feedback shift register for the mode 2, if max-min = 2^N-1. This kind of random generator produces a repeating sequence of 2^N-1 numbers with N bit internal storage. See http://en.wikipedia.org/wiki/LFSR for a more detailed explanation and code.
human perceivable would be the same as the first mode, except results that are multiple of 2, 5, 10 can be arbitrarily rejected as a result.
if i asked for a random number and got 5 or 10 i would think it's not random enough.

Resources