A/B testing sorting algorithm [closed] - sampling

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I want to make an algorithm which will enable the conduct of A/B testing over a variable number of subjects with a variable number of properties per subject.
For example I have 1000 people with the following properties: they come from two departments, some are managers, some are women etc. these properties may increase/decrease according to the situation.
I want to make an algorithm which will split the population in two with the best representation possible in both A and B of all the properties. So i want two groups of 500 people with equal number of both departments in both, equal number of managers and equal number of women. More specifically, I would like to maintain the ratio of each property in both A and B. So if we have 10% managers I want 10% of sample A and Sample B to be managers.
Any pointers on where to begin? I am pretty sure that such an algorithm exists. I have a gut feeling that this may be unsolvable in some cases as there may be an odd number of managers AND women AND Dept. 1.

Make a list of permutations of all a/b variables.
Dept1,Manager,Male
Dept1,Manager,Female
Dept1,Junior,Male
...
Dept2,Junior,Female
Go through all the people and assign them to their respective permutation. Maybe randomise the order of the people first just to be sure there is no bias in the order they are added to each permutation.
Dept1,Manager,Male-> Person1, Person16, Person143...
Dept1,Manager,Female-> Person7, Person10, Person83...
Have a second process that goes through each permutation and assigns half the people to one test group and half to the other. You will need to account for odd numbers of people in the group, but that should be fairly easy to factor in, obviously a larger sample size will reduce the impact of this odd number on the final results.

The algorithm for splitting the groups is simple - take each group of people who have all dimensions in common and assign half to the treatment and half to the control. You don't need to worry about odd numbers of people, whatever statistical test you are using will account for that. If some dimension is so skewed (i.e., there are only 2 females in your entire sample), it may be wise throw the dimension out.
Simple A/B tests usually use a t-test or g-test, but in your case, you'd be better of using an ANOVA to determine the significance of the treatment on each of the individual dimensions.

Related

When to use quicksort, mergesort, and radix sort? (with regard to quantities rather than paradigms) [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I've seen plenty of questions asking whether quicksort or mergesort is 'better', and when to use each of them, but what I'd like to see is some input on when to use them with regard to the size of the data being sorted.
Let's say I have a number of items, whether they be ints or custom objects. I sort these items.
I see mergesort, in a way, as being the optimal case of quicksort (picking the median as the pivot) at every step, but with some overhead. So at a certain size, when the overhead is negligible in comparison to the consistent optimal nature of mergesort, it would make sense to use it in favor of quicksort.
Radix sort has a 'linear' runtime given that the number of digits of the keys being sorted on does not approach the number of separate items being sorted. However, radix sort also has a relatively large constant on its runtime to my knowledge as well.
If I recall from some testing in the past, it made sense to use mergesort when the number of items being sorted began to number in the millions, and radix in the high millions/billion range.
Am I reasonably accurate in these assessments? Can someone confirm, deny, or correct them to some extent?
(I'm talking about rather 'simple' implementations of each sort. Also, in the case of radix sort, let's say that the largest single key is no larger than twice the number of items being sorted. i.e. sorting 4,000,000 items, the largest possible key is 8,000,000)
edit - I would like some input on the generic number ranges that one of the given sorts is fastest. I provided some in the question, and that may have been a mistake. What I'd like to see in an answer is an opinion on the number ranges. I know quicksort tends to be the default since its usually 'good enough' and doesn't have the space complexity of merge and doesn't come with the worry of malicious data purposefully made with obscenely large keys (radix).

How should I write this generic algorithm [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Suppose you have two strings. Each string has lines, seperated by a newline character. Now you want to compare both strings and then find the best method (shortest number of steps) by only adding or deleting lines of one string, to transform the second string in to the first string.
i.e.
string #2:
abc
def
efg
hello
123
and string #1:
abc
def
efg
adc
123
The best (shortest steps) solution to transform string #2 in to string #1 would be:
remove line at line position 3 ('hello')
add 'abc' after line
position 3
How would one write a generic algorithm to find the quickest, least steps, solutions for transforming one string to another, given that you can only add or remove lines?
This is a classic problem.
For a given set of allowed operations the edit distance between two strings is the minimal number of operations required to transform one into the other.
When the set of allowed operations consists of insertion and deletion only, it is known as the longest common subsequence edit distance.
You'll find everything you need to compute this distance in Longest common subsequence problem.
Note that to answer this question fully, one would have to thoroughly cover the huge subject of graph similarity search / graph edit distance, which I will not do here. I will, however, point you in directions where you can study the problem more thoroughly on your own.
... to find the quickest, least steps, solutions for transforming
one string to another ...
This is a quite common problem known as the (minimum) edit distance problem (or, originally, the specific 'The String-to-String Correction problem', by R. Wagner and M. Fischer), which is a non-trivial problem for the optimal (minimum = least steps) edit distance, which is what you ask for in your question.
See e.g.:
https://en.wikipedia.org/wiki/Edit_distance
https://web.stanford.edu/class/cs124/lec/med.pdf
The minimum edit distance problem for string similarity is in itself a subclass of the more general minimum graph edit distance problem, or graph similarity search (since any string or even sequenced object, as you have noted yourself, can be represented as a graph), see e.g. A survey on graph edit distance.
For details regarding this problem here on SO, refer to e.g. Edit Distance Algorithm and Faster edit distance algorithm.
This should get you started.
I'd tag this problem rather as a math problem (algorithmic instructions) rather than language specific problems, unless someone could guide you to an existing language (C) library for solving edit distance problems.
The fastest way would be to remove all sub-strings, then append (not insert) all new sub-strings; and to do "all sub-strings at once" if you can (possibly leading to a destPointer = sourcePointer approach).
The overhead of minimising the amount of sub-strings removed and inserted will be higher than removing and inserting/appending without checking if its necessary. It's like spending $100 to pay a consultant to determine if you should spend $5.

clustering a SUPER large data set [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I am working on a project as a part of my class curriculum . Its a project for Advanced Database Management Systems and it goes like this.
1)Download large number of images (1000,000) --> Done
2)Cluster them according to their visual Similarity
a)Find histogram of each image --> Done
b)Now group (cluster) images according to their visual similarity.
Now, I am having a problem with part 2b. Here is what I did:
A)I found the histogram of each image using matlab and now have represented it using a 1D vector(16 X 16 X 16) . There are 4096 values in a single vector.
B)I generated an ARFF file. It has the following format. There are 1000,000 histograms (1 for each image..thus 1000,000 rows in the file) and 4097 values in each row (image_name + 4096 double values to represent the histogram)
C)The file size is 34 GB. THE BIG QUESTION: HOW THE HECK DO I CLUSTER THIS FILE???
I tried using WEKA and other online tools. But they all hang. Weka gets stuck and says "Reading a file".
I have a RAM of 8 GB on my desktop. I don't have access to any cluster as such. I tried googling but couldn't find anything helpful about clustering large datasets. How do I cluster these entries?
This is what I thought:
Approach One:
Should I do it in batches of 50,000 or something? Like, cluster the first 50,000 entries. Find as many possible clusters call them k1,k2,k3... kn.
Then pick the the next 50,000 and allot them to one of these clusters and so on? Will this be an accurate representation of all the images. Because, clustering is done only on the basis of first 50,000 images!!
Approach Two:
Do the above process using random 50,000 entries?
Any one any inputs?
Thanks!
EDIT 1:
Any clustering algorithm can be used.
Weka isn't your best too for this. I found ELKI to be much more powerful (and faster) when it comes to clustering. The largest I've ran are ~3 million objects in 128 dimensions.
However, note that at this size and dimensionality, your main concern should be result quality.
If you run e.g. k-means, the result will essentially be random because of you using 4096 histogram bins (way too much, in particular with squared euclidean distance).
To get good result, you need to step back an think some more.
What makes two images similar. How can you measure similarity? Verify your similarity measure first.
Which algorithm can use this notion of similarity? Verify the algorithm on a small data set first.
How can the algorithm be scaled up using indexing or parallelism?
In my experience, color histograms worked best on the range of 8 bins for hue x 3 bins for saturation x 3 bins for brightness. Beyond that, the binning is too fine grained. Plus it destroys your similarity measure.
If you run k-means, you gain absolutely nothing by adding more data. It searches for statistical means and adding more data won't find a different mean, but just some more digits of precision. So you may just as well use a sample of just 10k or 100k pictures, and you will get virtually the same results.
Running it several times for independent sets of pictures results in different cluster clusters which are difficult to merge. Thus two similar images are placed in different clusters. I would run the clustering algorithm for a random set of images (as large as possible) and use these cluster definitions to sort all other images.
Alternative: Reduce the compexity of your data, e.g. to a histogram of 1024 double values.

Which are admissible heuristics and why? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
There are n vehicles on an n x n grid. At the start they are ordered in the top row 1. The vehicles have to get to the bottom row such that the vehicle at (1,n) must get to (n, n − i + 1). On each time step, each of the vehicles can move one square up, down, left or right, or it can stay put. If the vehicle stays put, one adjacent vehicle (but not more than one) can hop over it. Two vehicles cannot occupy the same square.
Which of the following heuristics are admissible for the problem of moving all the vehicles to their destination?
i. sum from 1 to n (h1 ... hn)
ii. max(h1 ... hn)
iii. min(h1 ...hn)
I think that iii is the only correct one, but I'm not sure how to formulate my reasoning on why.
I am sure someone will come along with a very detailed answer, but as a favour to those who like me can be a bit overwhelmed by all things AI, an admissible heuristic is quite simply:
A heuristic that never overestimates the true cost of getting to the goal
Not to sound too uncharitable, but it sounds as if maybe the problems you've posted are from a homework problem or assignment. I wouldn't want to spoil your fun working out exactly which of those three heuristics are and aren't admissible - but hopefully that one sentence definition should help you along.
If you get confused, just remember: if once your vehicles have both reached their goals you find the actual cost was less than what the heuristic thought it would be then it's inadmissable.

Generating non-uniform random numbers [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
Can you tell me any ways to generate non-uniform random numbers?
I am using Java but the code examples can be in whatever you want.
One way is to create a skewed distribution by adding two uniform random numbers together (i.e. rolling 2 dice).
Try generating uniformly distributed random numbers, then applying your inverted non-uniform cumulative distribution function to each of them.
What distribution of deviates do you want?
Here is a technique which always works, but isn't always the most efficient. The cumulative distrubtion function P(x) gives the fraction of the time that values fall below x. Thus P(x)=0 at the lowest possible value of x and P(x)=1 at the highest possible value of x. Every distribution has a unique CDF, which encodes all the properties of the distrubtion in the way that P(x) rises from 0 to 1. If y is a uniform deviate on the interval [0,1], then x satisfying P(x)=y will be disributed according to your distribution. To make this work comuptationally, you just need a way computing the inverse of P(x) for your distribution.
The Meta.Numerics library defines a large number of commonly used distrubtions (e.g. normal, lognormal, exponential, chi squared, etc.) and has functions for computing the CDF (Distribution.LeftProbability) and the inverse CDF (Distribution.InverseLeftProbability) of each.
For specialized techniques that are fast for particular distrubtions, e.g. the Box-Muller technique for normaly distributed deviates, see the book Numerical Recipies.
If you are using Java then my Uncommons Maths library may be of interest. It includes classes for generating random numbers for Uniform, Gaussian, Poisson, Binomial and Exponential distributions. This article shows how you might use these distributions.

Resources