Finding a handwritten dataset with an already extracted features - dataset

I want to test my clustering algorithms on data of handwritten text, so I'm searching for a dataset of handwritten text (e.g. words) with already extracted features (the goal is to test my clustering algorithms on, not to extract features). Does anyone have any information on that ?
Thanks.

There is a dataset of images of handwritten digits : http://yann.lecun.com/exdb/mnist/ .

Texmex has 128d SIFT vectors
"to evaluate the quality of approximate
nearest neighbors search algorithm on different kinds of data and varying database sizes",
but I don't know what their images are of; you could try asking the authors.

Related

Data mining and weka

Hi ive beeen asked to search for at least 20 different datasets with a maximum of 40 datasets. i need to apply the following classification techniques using the WEKA software on the chosen datasets:
(1) Decision tree (SimpleCart),
(2) Naïve Bayes, and
(3) K-NN (IBk) (with K taking the value of 1 up to the number of class labels in the dataset)
Once you have applied WEKA on all the datasets, it is required to accomplish the following tasks:
Compare the performance of the applied techniques you have achieved through WEKA.
Analyse the results with regards to the dataset properties.
Ive never used weka before,am unsure on how to apply the classification techniques and what am actually comparing, but am quick at learning.Am not really about what am required to do...i just need some direction or some example please anyone?
To find dataset, you can use
https://archive.ics.uci.edu/ml/datasets.html
To compare the performance of classifier, there are many measures like AUC (Area Under Curve), ROC curve, Accuracy, precision and recall. Weka has the ability to generate these measures. I recommend to use AUC and Accuracy.
To learn how to use Weka, there are many online tutorials like http://www.ibm.com/developerworks/library/os-weka2/

SOLR: Create term vector (like data returned from TermVectorComponent) from raw text

Using http://wiki.apache.org/solr/TermVectorComponent I can get indexed terms and their frequencies for any document stored in my index. How can I get the same information for a text, without storing the text in my index? I just want SOLR to process the text and return the information, but without having to store the document in my index.
AFAIK this isn't possible without storing data in SOLR.
If you are looking to do text analysis (I understand this is broader than what you ask for), I would recommend the below alternatives:
MAUI - does keyphrase and terminology extraction.
Gensim - does topic modelling
Kea - keyword extraction
I've also come across some python scripts that do term frequency analysis. Have a look at Mincemeat, particulary the example, which does term frequency calculation.
From what you ask for I conclude that you actually need a search library, not a full search engine (service). That library is Lucene. Perhaps, this will help for starters: How to extract Document Term Vector in Lucene 3.5.0. You could store the index in RAM for the sake of computing necessary bits and then get rid of the index.
I wrote an application in Java several years ago that did heavy text analysis based on Lucene. I had to custom-write the search functions to find words within a certain distance of each other. You can import your text documents into the software and have it count the term frequencies, or you can take the code and taylor it to your needs.
Free download:
http://www.minoesoftware.com/download.php
Source:
https://github.com/danspiteri/MINOE/blob/master/src/minoe/SearchFiles.java
If you are using Solr4 and you are not storing the text, you can use a Solr pivot on the text field. But then, obviously you will get terms after the analyzer processing:
http://192.168.0.202:8080/solr/fr_00_0425_sem/select?q=renault&wt=xml&facet=true&facet.pivot=uniqueKey,yourText
This is a pretty heavy query, I hope you don't have too many documents that match...

Dataset help for TF-IDF and Vector Model

I want to compare TF-IDF, Vector model and some optimization of TF-IDF algorithm.
For that I need a dataset (at least 100 documents of English text). I am not able to find one. any suggestions ?
It depends the application that you use TF-IDF. for example if you want to find keywords you could use "Mendely" dataset or for tagging using "Delicious" data.

Is there a way to rank the difficulty of pronunciation of a word?

I'm trying to build a collection English words that are difficult to pronounce.
I was wondering if there is an algorithm of some kind or a theory, that can be used to show how difficult a word is to pronounce.
Does this appear to you as something that can be computed?
As this seems to be a very subjective thing, let me make it more objective, let's say hardest words to pronounce by text to speech technologies.
One approach would be to build a list with two versions of each word. One the correct spelling, and the other being the word spelled using the simplest of phonetic spelling. Apply a distance function on the two words (like Levenshtein distance http://en.wikipedia.org/wiki/Levenshtein_distance). The greater the distance between the two words, the harder the word would be to pronounce.
Great problem! Off the top of my head you could create a system which contains all the letters from the phonetic alphabet and with connected weights betweens every combination based on difficulty (highly specific so may need multiple people testing and take averages etc) then have a list of all words from the English dictionary stored on disk and call a script which cycles through each entry and performs web scraping on wikipedia for the phonetic spelling and ranks their difficulty. This could take into consideration the length of the word as well as the difficulty between joining phonetics then order the list based on the difficulty.
Thats what I would try and do :P
To a certain extent...
Speech programs for example use a system of phonetics to try and pronounce words.
For example, "grasp" would be split into:
Gr-A-Sp
However, for foreign words (or words that don't follow this pattern), exception lists have to be kept e.g. Yacht
Suggestion
Fortunately Pronunciation as a process is dependent on a two factors these include
the phones making up the words and the location of vowels and semi vowels i.e
/a/,/ae/,/e/,/i/,/o/,/u/,/w/,/j/...
length of the word.
the first relates to the mechanics of phone sound production as the velum, cheeks tongue have to be altered to produce various sounds related to individual phones i.e nasal etc. this makes some words more difficult to pronounce as the movement required may be a lot. Refer to books about phonetics to find positions of pronouncing each phone.
Algorithm
a weighted spanning tree with weight being the difficulty of pronouncing two consecutive phones i.e l and r or /sh/ and /s/
good luck.

Dataset for Apriori algorithm

I am going to develop an app for Market Basket Analysis (using apriori algorithm) and I found a dataset which has more than 90,000 Transaction records .
the problem is this dataset doesn't have the name of items in it and only contains the barcode of the items .
I just start the project and doing research on apriori algorithm , can anyone help me about this case , how is the best way to implement this algorithm using the following dataset ?
these kind of datasets are consider critical information and chain stores will not give you these information but you can generate some sample dataset yourself using SQL Server .
The algorithm is defined independent of the identifiers used for the object. Also, you didn't post the 'following data set' :P If your problem is that the algorithm expects your items to be numbered 0,1,2,... then just scan your data set and map each individual barcode to a number.
If you're interested, there's been some papers on how to represent frequent item sets very efficiently: http://www.google.de/url?sa=t&source=web&cd=1&ved=0CB8QFjAA&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.163.4827%26rep%3Drep1%26type%3Dpdf&ei=QdVuTsn7Cc6WmQWD7sWVCg&usg=AFQjCNGDG8etNN2B4GQ52pSNIfQaTH7ajQ&sig2=7r3buh8AcfJmn2CwjjobAg
The algorithm does not need the name of the items.

Resources