Zipfian vs Uniform - What's the difference between these two YCSB distribution? - benchmarking

Can anyone please describe the differences between Zipfian and Uniform distribution while running YCSB workloads ?
Here's the YCSB core properties: https://github.com/brianfrankcooper/YCSB/wiki/Core-Properties

The Yahoo Team has explained it in their paper.
In a nutshell, the distribution affects how YCSB reads and scans over the keyspace:
uniform: each row has an equal probability to be read
zipfian: some rows have more probability to be targeted by reads or scans. Those rows are called "hot set" or "hot spot" and represent popular data, for instance popular threads of a forum. You should set it up with: hotspotdatafraction and hotspotopnfraction. See $YCSB_HOME/workloads/workload_template for more details.
Hope this helps.

Related

How to determine the parameters of DBSCAN?

I tried the kdist plot as introduced in one paper I read and used the knee distance to set the epsilon. However,the results were not satisfying.
I use the WEKA to implement DBSCAN, but it always returns only one cluster.
Can anyone please give me some advice?
This can happen if the k-dist plot has more than 1 knee (this can happen when the dataset contains clusters having different density, and the outcome you have obtained arise when the high density clusters are nested inside the low density ones).
The solution is to search the next knee and re-apply the algorithm on the core points you already have found.

Finding a handwritten dataset with an already extracted features

I want to test my clustering algorithms on data of handwritten text, so I'm searching for a dataset of handwritten text (e.g. words) with already extracted features (the goal is to test my clustering algorithms on, not to extract features). Does anyone have any information on that ?
Thanks.
There is a dataset of images of handwritten digits : http://yann.lecun.com/exdb/mnist/ .
Texmex has 128d SIFT vectors
"to evaluate the quality of approximate
nearest neighbors search algorithm on different kinds of data and varying database sizes",
but I don't know what their images are of; you could try asking the authors.

What classifiers to use for deciding if two datasets depict the same individual?

Suppose I have pictures of faces of a set of individuals. The question I'm trying to answer is: "do these two pictures represent the same individual"?
As usual, I have a training set containing several pictures for a number of individuals. The individuals and pictures the algorithm will have to process are of course not in the training set.
My question is not about image processing algorithms or particular features I should use, but on the issue of classification. I don't see how traditional classifier algorithms such as SVM or Adaboost can be used in this context. How should I use them? Should I use other classifiers? Which ones?
NB: my real application is not faces (I don't want to disclose it), but it's close enough.
Note: the training dataset isn't enormous, in the low thousands at best. Each dataset is pretty big though (a few megabytes), even if it doesn't hold a lot of real information.
You should probably look at the following methods:
P. Jonathon Phillips: Support Vector Machines Applied to Face Recognition. NIPS 1998: 803-809
Haibin Ling, Stefano Soatto, Narayanan Ramanathan, and David W.
Jacobs, A Study of Face Recognition as People Age, IEEE International
Conference on Computer Vision (ICCV), 2007.
These methods describe using SVMs to same person/different person problems like the one you describe. If the alignment of the features (eyes, nose, mouth) is good, these methods work very nicely.
How big is your dataset?
I would start this problem by coming up with some kind of distance metric (say euclidean) that would characterize differences between image(such as differences in color,shape etc. or say local differences)..Two image representing same individual would have small distance as compared to image representing different individual..though it would highly depend on the type of data set you are currently working.
Forgive me for stating the obvious, but why not use any supervised classifier (SVM, GMM, k-NN, etc.), get one label for each test sample (e.g., face, voice, text, etc.), and then see if the two labels match?
Otherwise, you could perform a binary hypothesis test. H0 = two samples do not match. H1 = two samples match. For two test samples, x1 and x2, compute a distance, d(x1, x2). Choose H1 if d(x1, x2) < epsilon and H0 otherwise. Adjusting epsilon will adjust your probability of detection and probability of false alarm. Your application would dictate which epsilon is best; for example, maybe you can tolerate misses but cannot tolerate false alarms, or vice versa. This is called Neyman-Pearson hypothesis testing.

Bioinformatics databases with negative results?

Bioinformatics databases such as BioGRID collect a lot of interaction results for proteins and genes in different species from all sorts of publications and experiments, but such collations sufer from testing biases since not all combinations are tested, and some are tested more than once. Shouldn't they also collect all negative results? Is there such a resource which systematically collects both positive and negative interactions from High Throughput and Low Throughput experiments?
These might help:
http://www.jnrbm.com/info/about/
http://www.jnr-eeb.org/index.php/jnr
and so far I know databases of non-hitters or non-binding drug-like compounds exist
You should look for the 'Negatome', database of non-interacting protein pairs.
Smialowski P, Pagel P, Wong P, Brauner B, Dunger I, Fobo G, Frishman G,
Montrone C, Rattei T, Frishman D, Ruepp A. The Negatome database: a reference set
of non-interacting protein pairs. Nucleic Acids Res. 2010 Jan;38(Database
issue):D540-4. Epub 2009 Nov 17. PubMed PMID: 19920129; PubMed Central PMCID:
PMC2808923. Available from: http://www.ncbi.nlm.nih.gov/pubmed/19920129
1) High throughput screens that are published in peer reviewed journals often have such data. Cessarini has published negative results regaring domain/peptide interactions.
2) You can contact data bases like mint/reactome/etc... and mention that you want the negative results where they are available. Many such organizations are required by mandate to share any such data with you, even if its not on their site.
3) A good resource on this subject is here http://www.nature.com/nmeth/journal/v4/n5/full/nmeth0507-377.html
We have been working on an opensource protein interactions meta-database & prediction server (which does include data from BioGRID among other sources) that deals with both negative and positive data as you have asked for...
MAYETdb does the following:
Classify protein interactions as either "interacting" or "not interacting"
Includes data from a variety of experimental set-ups (Y2H, TAP-MS, more) and species (yeast, human, c.elegans) (inc. both literature-mined and databased data e.g. BioGRID).
It also yields false-positive and false-negative errors of those classifications.
Random forest machine learning system makes predictions for previously un-tested interactions by learning form a wide variety of protein features and works at rather high accuracy (~92% AUG).
It is not yet running on a server but the source code is available and heavily commented if you are curiously impatient: https://bitbucket.org/dknightg/ppidb/src
Please ask if you have any queries :)

need some suggestions on my SVM feature refinement

I've trained a system on SVM,that is given a question,whether the webpage is a good one for answering this question.
The feature I selected are "Term frequency in webpage","Whether term matches with the webpage title", "number of images in the webpage", "length of the webpage","is it a wikipedia page?","the position of this webpage in the list returned by the search engine".
Currently,my system will maintain a precision around 0.4 and recall at 1.It has a large portion of false positive error(that many bad links were classified as good link by my classifier).
Since the accuracy could be improved a bit,I would like to ask for some help here on considering refine the features that I selected for training/testing,could remove some or adding more in there.
Thanks in advance.
Hmm...
How large is your training set? i.e., how many training documents are you using?
What is your test set composed of?
Since you're getting too many FPs, I would try training with more (and varied) "bad" webpages
Can you give more details about your different features, like "tf in webpage," etc.?

Resources