Dataset help for TF-IDF and Vector Model - dataset

I want to compare TF-IDF, Vector model and some optimization of TF-IDF algorithm.
For that I need a dataset (at least 100 documents of English text). I am not able to find one. any suggestions ?

It depends the application that you use TF-IDF. for example if you want to find keywords you could use "Mendely" dataset or for tagging using "Delicious" data.


If possible, what is the Solr query syntax to filter by doc size?

Solr 4.3.0
I want to find the larger size documents.
I'm trying to build some test data for testing memory usage, but I keep getting the smaller sized documents. So, if I could add a doc size clause to my query it would help me find more suitable documents.
I'm not aware of this possibility, most likely there is no support for it.
I could see one possible approach - you could add size of the document during indexing in some separate field, which will later use to filter on.
Another possible case - is to use TermVectorComponent, which could return term vectors for matched documents, which could lead to some understanding of "how big" this document is. Not easy and simple, though.
Example of the possibly useful output:
Third possible option (kudos to MatsLindh for the idea): to use sorting function norm() for a specific field. There are some limitations:
You need to use some classic similarity
The field you're sorting on should contains norms
Example of the sorting function: sort:norm(field_name) desc

SOLR: Create term vector (like data returned from TermVectorComponent) from raw text

Using I can get indexed terms and their frequencies for any document stored in my index. How can I get the same information for a text, without storing the text in my index? I just want SOLR to process the text and return the information, but without having to store the document in my index.
AFAIK this isn't possible without storing data in SOLR.
If you are looking to do text analysis (I understand this is broader than what you ask for), I would recommend the below alternatives:
MAUI - does keyphrase and terminology extraction.
Gensim - does topic modelling
Kea - keyword extraction
I've also come across some python scripts that do term frequency analysis. Have a look at Mincemeat, particulary the example, which does term frequency calculation.
From what you ask for I conclude that you actually need a search library, not a full search engine (service). That library is Lucene. Perhaps, this will help for starters: How to extract Document Term Vector in Lucene 3.5.0. You could store the index in RAM for the sake of computing necessary bits and then get rid of the index.
I wrote an application in Java several years ago that did heavy text analysis based on Lucene. I had to custom-write the search functions to find words within a certain distance of each other. You can import your text documents into the software and have it count the term frequencies, or you can take the code and taylor it to your needs.
Free download:
If you are using Solr4 and you are not storing the text, you can use a Solr pivot on the text field. But then, obviously you will get terms after the analyzer processing:,yourText
This is a pretty heavy query, I hope you don't have too many documents that match...

how to do fuzzy search in big data

I'm new to that area and I wondering mostly what the state-of-the-art is and where I can read about it.
Let's assume that I just have a key/value store and I have some distance(key1,key2) defined somehow (not sure if it must be a metric, i.e. if the triangle inequality must hold always).
What I want is mostly a search(key) function which returns me all items with keys up to a certain distance to the search-key. Maybe that distance-limit is configureable. Maybe this is also just a lazy iterator. Maybe there can also be a count-limit and an item (key,value) is with some probability P in the returned set where P = 1/distance(key,search-key) or so (i.e., the perfect match would certainly be in the set and close matches at least with high probability).
One example application is fingerprint matching in MusicBrainz. They use the AcoustId fingerprint and have defined this compare function. They use the PostgreSQL GIN Index and I guess (although I haven't fully understood/read the acoustid-server code) the GIN Partial Match Algorithm but I haven't fully understand wether that is what I asked for and how it works.
For text, what I have found so far is to use some phonetic algorithm to simplify words based on their pronunciation. An example is here. This is mostly to break the search-space down to a smaller space. However, that has several limitations, e.g. it must still be a perfect match in the smaller space.
But anyway, I am also searching for a more generic solution, if that exists.
There is no (fast) generic solution, each application will need different approach.
Neither of the two examples actually does traditional nearest neighbor search. AcoustID (I'm the author) is just looking for exact matches, but it searches in a very high number of hashes in hope that some of them will match. The phonetic search example uses metaphone to convert words to their phonetic representation and is also only looking for exact matches.
You will find that if you have a lot of data, exact search using huge hash tables is the only thing you can realistically do. The problem then becomes how to convert your fuzzy matching to exact search.
A common approach is to use locality-sensitive hashing (LSH) with a smart hashing method, but as you can see in your two examples, sometimes you can get away with even simpler approach.
Btw, you are looking specifically for text search, the simplest way you can do it split your input to N-grams and index those. Depending on how your distance function is defined, that might give you the right candidate matches without too much work.
I suggest you take a look at FLANN Fast Approximate Nearest Neighbors. Fuzzy search in big data is also known as approximate nearest neighbors.
This library offers you different metric, e.g Euclidian, Hamming and different methods of clustering: LSH or k-means for instance.
The search is always in 2 phases. First you feed the system with data to train the algorithm, this is potentially time consuming depending on your data.
I successfully clustered 13 millions data in less than a minute though (using LSH).
Then comes the search phase, which is very fast. You can specify a maximum distance and/or the maximum numbers of neighbors.
As Lukas said, there is no good generic solution, each domain will have its tricks to make it faster or find a better way using the inner property of the data your using.
Shazam uses a special technique with geometrical projections to quickly find your song. In computer vision we often use the BOW: Bag of words, which originally appeared in text retrieval.
If you can see your data as a graph, there are other methods for approximate matching using spectral graph theory for instance.
Let us know.
Depends on what your key/values are like, the Levenshtein algorithm (also called Edit-Distance) can help. It calculates the least number of edit operations that are necessary to modify one string to obtain another string.

Finding a handwritten dataset with an already extracted features

I want to test my clustering algorithms on data of handwritten text, so I'm searching for a dataset of handwritten text (e.g. words) with already extracted features (the goal is to test my clustering algorithms on, not to extract features). Does anyone have any information on that ?
There is a dataset of images of handwritten digits : .
Texmex has 128d SIFT vectors
"to evaluate the quality of approximate
nearest neighbors search algorithm on different kinds of data and varying database sizes",
but I don't know what their images are of; you could try asking the authors.

Dataset for Apriori algorithm

I am going to develop an app for Market Basket Analysis (using apriori algorithm) and I found a dataset which has more than 90,000 Transaction records .
the problem is this dataset doesn't have the name of items in it and only contains the barcode of the items .
I just start the project and doing research on apriori algorithm , can anyone help me about this case , how is the best way to implement this algorithm using the following dataset ?
these kind of datasets are consider critical information and chain stores will not give you these information but you can generate some sample dataset yourself using SQL Server .
The algorithm is defined independent of the identifiers used for the object. Also, you didn't post the 'following data set' :P If your problem is that the algorithm expects your items to be numbered 0,1,2,... then just scan your data set and map each individual barcode to a number.
If you're interested, there's been some papers on how to represent frequent item sets very efficiently:
The algorithm does not need the name of the items.
