Data mining and weka

Data mining and weka - database

Hi ive beeen asked to search for at least 20 different datasets with a maximum of 40 datasets. i need to apply the following classification techniques using the WEKA software on the chosen datasets:
(1) Decision tree (SimpleCart),
(2) Naïve Bayes, and
(3) K-NN (IBk) (with K taking the value of 1 up to the number of class labels in the dataset)
Once you have applied WEKA on all the datasets, it is required to accomplish the following tasks:
Compare the performance of the applied techniques you have achieved through WEKA.
Analyse the results with regards to the dataset properties.
Ive never used weka before,am unsure on how to apply the classification techniques and what am actually comparing, but am quick at learning.Am not really about what am required to do...i just need some direction or some example please anyone?

To find dataset, you can use
https://archive.ics.uci.edu/ml/datasets.html
To compare the performance of classifier, there are many measures like AUC (Area Under Curve), ROC curve, Accuracy, precision and recall. Weka has the ability to generate these measures. I recommend to use AUC and Accuracy.
To learn how to use Weka, there are many online tutorials like http://www.ibm.com/developerworks/library/os-weka2/

Related

Compare MAPE for markets with different volatilities

I am trying to compare the forecast accuracy of a number of methods using MAPE across different commodity markets, such as corn, wheat, soybeans, coffee, cotton. Obviously the relative MAPE’s area impacted by the relative volatilities of each commodity: a high MAPE for wheat may simply reflect a volatile market, not necessarily a poor forecast.
I am wondering how to correct for this: some kind of vol-adjusted MAPE I suppose, but I cannot find any literature on this. Alternatively, I was thinking of comparing the MAPE of a certain forecast method with the MAPE of a naïve forecast…this should also correct for the vol difference somewhat, I suppose.
Any further suggestions/comments are greatly appreciated.

I'm not aware of any measures that directly incorporates volatility, in order to enable comparison across. I would also question the relevance of directly comparing accuracy measures across like that, as the accuracy would depend - as you also points out - on the volatility/signal-to-noise ratio of the time series.
I approach a problem like this by what you also suggest - create a naïve forecast, and have that as the lowest acceptable accuracy for that series, and also an initial measure of the forecastability of the series.
Note: i follow the definition of a naïve forecast as: one which is a very simple forecast model, could be naive1, naive2, moving average or combination of those - where no further work needs to be done on parameters.
Try to have a look at the work of Michael Gilliland on FVA for inspiration

Recommendation for mailing address matching scenario?

My SQL server contains 2 tables containing a similar set of fields for a mailing (physical) address. NB these tables are populated before the data gets to my database (can't change that). The set of fields in the tables are similar though not identical - most exist in both tables, some only in one, some the other. The goal is to determine with "high confidence" whether or not two mailing addresses match.
Example fields:
Street Number
Predirection
Street Name
Street Suffix
Postdirection (one table and not the other)
Unit name (one table) v Address 2 (other table) --adds complexity
Zip code (length varies in each table 5 v 5+ digits)
Legal description
Ideally I'd like to a simple way to call a "function" which returns either a boolean or a confidence level of match (0.0 - 1.0). This call can be made in SQL or Python within my solution; free/open source highly preferred by client.
Among options such as SOUNDEX, DIFFERENCE, Levenshtein distance (all SQL) and usaddress, dedupe (Python) none stand out as a good-fit solution.

Ideally I'd like to a simple way to call a "function" which returns
either a boolean or a confidence level of match (0.0 - 1.0).
A similarity metric is what you're looking for. You can use Distance Metrics to calculate similarity. The Levenshtein Distance, Damerau-Levenshtein Distance and Hamming Distance are examples of Distance Metrics.
Given the shortest of the two: M the shorter of the two, N the longest, and your distance metric (D) you can measure string Similarity using (M-D)/N. You can also use the Longest Common subsequence or Longest Common Substring (LCS) to measure similarity by dividing LCS/N.
If you can use CLRs I HIGHLY recommend mdq.similarity which you can get from here. It will give a similarity metric using these algorithms:
The Damarau-Levenshtein distance (the documentation only says, "Levenshtein" but they are mistaken)
The Jaccard Similarity coefficient algorithm.
a form of the Jaro-Winkler distance algorithm.
4 a longest common subsequence algorithm (which grows by one when transpositions are involved)
If performance is important (these metrics can be quite slow depending on what you're feeding them) then I would get familiar with my Bernie function. It's designed to help measure similarity using any of the aforementioned algorithms much, much faster. Bernie is 100% open source and can be easily re-created in any language (Python, C#, etc.) Ditto my N-Grams function.
You can easily create your own metric using NGrams8K.
For pure T-SQL versions of Levenshtein or the Longest Common Subsequence you can check Phil Factor's blog. (Note these cannot compete with the CLR I mentioned).
I'll stop for now. The best advice can be given after we better understand what is making the strings different (note my question under your comment).

How to calculate a lot of records in DB with reasonable time

If I have a vector (for example: (5,4,6,8) ) in my application and I want to find similarity to other vector in my DB, let say for simplicity that I'm calculating distance between two vectors with Manhattan distance.
What I need is a way to calculate the algorithm (Manhattan distance in my example) between my vector and all the vectors that are stored in my DB, Can I do 10 million vectors under a couple of seconds ?

If You really deal with a lot of data, what You really need is an Approximate Near Neighborhood - http://en.wikipedia.org/wiki/Nearest_neighbor_search#Approximate_nearest_neighbor implementation. Take look at Annoy - https://pypi.python.org/pypi/annoy/1.8.0 project page. There is a benchmark with other ANN projects wich You can find interesting. Maybe there is a implementation as a plugin for DB, but I am not aware of such. However, ANN can be also used to pre-compute top-n NN and store them in DB as a list for User/Item.

Finding a handwritten dataset with an already extracted features

I want to test my clustering algorithms on data of handwritten text, so I'm searching for a dataset of handwritten text (e.g. words) with already extracted features (the goal is to test my clustering algorithms on, not to extract features). Does anyone have any information on that ?
Thanks.

There is a dataset of images of handwritten digits : http://yann.lecun.com/exdb/mnist/ .

Texmex has 128d SIFT vectors
"to evaluate the quality of approximate
nearest neighbors search algorithm on different kinds of data and varying database sizes",
but I don't know what their images are of; you could try asking the authors.

What classifiers to use for deciding if two datasets depict the same individual?

Suppose I have pictures of faces of a set of individuals. The question I'm trying to answer is: "do these two pictures represent the same individual"?
As usual, I have a training set containing several pictures for a number of individuals. The individuals and pictures the algorithm will have to process are of course not in the training set.
My question is not about image processing algorithms or particular features I should use, but on the issue of classification. I don't see how traditional classifier algorithms such as SVM or Adaboost can be used in this context. How should I use them? Should I use other classifiers? Which ones?
NB: my real application is not faces (I don't want to disclose it), but it's close enough.
Note: the training dataset isn't enormous, in the low thousands at best. Each dataset is pretty big though (a few megabytes), even if it doesn't hold a lot of real information.

You should probably look at the following methods:
P. Jonathon Phillips: Support Vector Machines Applied to Face Recognition. NIPS 1998: 803-809
Haibin Ling, Stefano Soatto, Narayanan Ramanathan, and David W.
Jacobs, A Study of Face Recognition as People Age, IEEE International
Conference on Computer Vision (ICCV), 2007.
These methods describe using SVMs to same person/different person problems like the one you describe. If the alignment of the features (eyes, nose, mouth) is good, these methods work very nicely.

How big is your dataset?
I would start this problem by coming up with some kind of distance metric (say euclidean) that would characterize differences between image(such as differences in color,shape etc. or say local differences)..Two image representing same individual would have small distance as compared to image representing different individual..though it would highly depend on the type of data set you are currently working.

Forgive me for stating the obvious, but why not use any supervised classifier (SVM, GMM, k-NN, etc.), get one label for each test sample (e.g., face, voice, text, etc.), and then see if the two labels match?
Otherwise, you could perform a binary hypothesis test. H0 = two samples do not match. H1 = two samples match. For two test samples, x1 and x2, compute a distance, d(x1, x2). Choose H1 if d(x1, x2) < epsilon and H0 otherwise. Adjusting epsilon will adjust your probability of detection and probability of false alarm. Your application would dictate which epsilon is best; for example, maybe you can tolerate misses but cannot tolerate false alarms, or vice versa. This is called Neyman-Pearson hypothesis testing.