Random Forest/SVM in time series classification - c

I have a classification task and tried some feature extraction(like mean, and std). the overall accuracy using random forest was around 85% so that I though of finding better algorithms. Is it possible to extract the curve function formula produced by random forest in way that every test sample just be checked by the curve(line)? how about SVM as it produces a line(or a curve) which would separate the data to do classification.(Is there any other classification algorithms that let us do this process?--> extracting the final curve function like in curve fitting)
I have done the task in scikit-learn but want to transfer the model in C, so any recommendation based on C would be of high priority.

Related

Logistic Regression with Gradient Descent on large data

I have a training set with about 300000 examples and about 50-60 features and also it's a multiclass with about 7 classes. I have my logistic regression function that finds out the convergence of the parameters using gradient descent. My gradient descent algorithm, finds the parameters in matrix form as it's faster in matrix form than doing separately and linearly in loops.
Ex :
Matrix(P) <- Matrix(P) - LearningRate( T(Matrix(X)) * ( Matrix(h(X)) -Matrix(Y) ) )
For small training data, it's quite fast and gives correct values with maximum iterations to be around 1000000, but with that much training data, it's extremely slow, that with around 500 iterations it takes 18 minutes, but with that much iterations in gradient descent, the cost is still high and it does not predict the class correctly.
I know, I should implement maybe feature selection, or feature scaling and I can't use the packages provided. Language used is R. How do I go about implementing feature selection or scaling without using any library packages.
According to link, you can use either Z-score normalization or min-max scaling method. Both methods scale the data to [0,1] range. Z-score normalization is calculated as
Min-max scaling method is calculated as:

How to calculate a lot of records in DB with reasonable time

If I have a vector (for example: (5,4,6,8) ) in my application and I want to find similarity to other vector in my DB, let say for simplicity that I'm calculating distance between two vectors with Manhattan distance.
What I need is a way to calculate the algorithm (Manhattan distance in my example) between my vector and all the vectors that are stored in my DB, Can I do 10 million vectors under a couple of seconds ?
If You really deal with a lot of data, what You really need is an Approximate Near Neighborhood - http://en.wikipedia.org/wiki/Nearest_neighbor_search#Approximate_nearest_neighbor implementation. Take look at Annoy - https://pypi.python.org/pypi/annoy/1.8.0 project page. There is a benchmark with other ANN projects wich You can find interesting. Maybe there is a implementation as a plugin for DB, but I am not aware of such. However, ANN can be also used to pre-compute top-n NN and store them in DB as a list for User/Item.

Is dimensionality reduction reversible?

I have implemented a dimentionality reduction algorithm using ENCOG, that takes a dataset (call it A) with multiple features and reduces it to a dataset (B) with only one feature (I need that for time series analisys).
Now my question is, I have a value from B - predicted by the time series analysis, can I convert it back to two dimensions like in the A dataset?
No, dimensionality reduction is not reversible in general. It loses information.
Dimensionality reduction (compression of information) is reversible in auto-encoders. Auto-encoder is regular neural network with bottleneck layer in the middle. You have for instance 20 inputs in the first layer, 10 neurons in the middle layer and again 20 neurons in the last layer. When you train such network you force it to compress information to 10 neurons and then uncompress again minimizing error in the last layer(desired output vector equals input vector). When you use well known Back-propagation algorithm to train such network it performs PCA - Principal Component Analysis. PCA returns uncorrelated features. It's not very powerful.
By using more sophisticated algorithm to train auto-encoder you can make it perform nonlinear ICA - Independent Component Analysis. ICA returns statistically independent features. This training algorithm searches for low complexity neural networks with high generalization capability. As a byproduct of regularization you get ICA.

Data mining and weka

Hi ive beeen asked to search for at least 20 different datasets with a maximum of 40 datasets. i need to apply the following classification techniques using the WEKA software on the chosen datasets:
(1) Decision tree (SimpleCart),
(2) Naïve Bayes, and
(3) K-NN (IBk) (with K taking the value of 1 up to the number of class labels in the dataset)
Once you have applied WEKA on all the datasets, it is required to accomplish the following tasks:
Compare the performance of the applied techniques you have achieved through WEKA.
Analyse the results with regards to the dataset properties.
Ive never used weka before,am unsure on how to apply the classification techniques and what am actually comparing, but am quick at learning.Am not really about what am required to do...i just need some direction or some example please anyone?
To find dataset, you can use
https://archive.ics.uci.edu/ml/datasets.html
To compare the performance of classifier, there are many measures like AUC (Area Under Curve), ROC curve, Accuracy, precision and recall. Weka has the ability to generate these measures. I recommend to use AUC and Accuracy.
To learn how to use Weka, there are many online tutorials like http://www.ibm.com/developerworks/library/os-weka2/

Algorithm for voice comparison

Given two recorded voices in digital format, is there an algorithm to compare the two and return a coefficient of similarity?
I recommend to take a look into the HTK toolkit for speech recognition http://htk.eng.cam.ac.uk/, especially the part on feature extraction.
Features that I would assume to be good indicators:
Mel-Cepstrum coefficients (general timbre)
LPC (for the harmonics)
Given your clarification I think what you are looking for falls under speech recognition algorithms.
Even though you are only looking for the measure of similarity and not trying to turn speech into text, still the concepts are the same and I would not be surprised if a large part of the algorithms would be quite useful.
However, you will have to define this coefficient of similarity more formally and precisely to get anywhere.
EDIT:
I believe speech recognition algorithms would be useful because they do abstraction of the sound and comparison to some known forms. Conceptually this might not be that different from taking two recordings, abstracting them and comparing them.
From wikipedia article on HMM
"In speech recognition, the hidden
Markov model would output a sequence
of n-dimensional real-valued vectors
(with n being a small integer, such as
10), outputting one of these every 10
milliseconds. The vectors would
consist of cepstral coefficients,
which are obtained by taking a Fourier
transform of a short time window of
speech and decorrelating the spectrum
using a cosine transform, then taking
the first (most significant)
coefficients."
So if you run such an algorithm on both recordings you would end up with coefficients that represent the recordings and it might be far easier to measure and establish similarities between the two.
But again now you come to the question of defining the 'similarity coefficient' and introducing dogs and horses did not really help.
(Well it does a bit, but in terms of evaluating algorithms and choosing one over another, you will have to do better).
There are many different algorithms - the general name for this task is Speaker Identification - start with this Wikipedia page and work from there: http://en.wikipedia.org/wiki/Speaker_recognition
I'm not sure this will work for soundfiles, but it gives you an idea how to proceed i hope. That is a basic way how to find a pattern (image) in another image.
You first have to calculate the fft of both the soundfiles and then do a correlation. In formular it would look like (pseudocode):
fftSoundFile1 = fft(soundFile1);
fftConjSoundFile2 = conj(fft(soundFile2));
result_corr = real(ifft(soundFile1.*soundFile2));
Where fft= fast Fourier transform, ifft = inverse, conj = conjugate complex.
The fft is performed on the sample values of the soundfiles.
The peaks in the result_corr vector will then give you the positions of high correlation.
Note that both soundfiles must in this case be of the same size-otherwise you have to place the shorter one into a file of max(soundFileLength) vector.
Regards
Edit: .* means (in matlab style) a component wise mult, you must not do a vector mult!
Next Edit: Note that you have to operate with complex numbers - but there are several Complex classes out there so I think you don't have to bother about this.

Resources