Is the idf part of tf-idf useless when creating a random forest model? - tf-idf

I am creating a Random forest model for a classification task with text. My question is will the idf part of tf-idf help in any way?
As I understand it idf will scale each word depending on how likely it is to appear in any document. However this scaling factor is the same for every document which means you are essentially multiplying the feature vector by a constant. If that is correct will it really be that useful to use tf-idf vs using just tf on its own?

Related

How Does Adaboost Work with Viola and Jones Algorithm?

I am implementing a functional face detection algorithm in C using Viola and Jones algorithm. I'm having trouble understanding Adaboost to train a strong classifier.
I can detect all 5 basic haar-features in a single image (162336 in a 24x24 image) I'm pretty sure this is good and working, and my algorithm outputs and array containing all the features sorted.
Then, I started working on Adaboost and here's what I understand. We create a weak classifier (slightly better than random) and we make a linear combination of many weak classifier (approx 200) to get a strong classifier.
What I don't understand is how to create this weak classifier. From what I read online:
Normalize the weights of our training examples (first round 1 by default)
Then get a feature (here's one of my problem, do I have to process each feature of each training example ? (162336 * number of examples) that would be a lot of computing power no ? )
"Apply" this feature to each image to get an optimal treshold and toggle (here's my main problem, I don't understand what "apply" means here, compare it with each feature of the image ? I really don't see what I have to do with it. Then, I don't understand what is the treshold and the toggle and that's where i'm looking for help)
Then many more other things to do
I'm really looking forward your help to make me understand this!
Should have answered my own question faster, but I've forgot about it. It was a project for my computer science school so I can provide answers.
Adaboost is in fact fairly simple when you understand it.
First you need to detect features inside every images in your base (we used 4000 images to have a large set) you can store them if you have enough memory or process them when you need them in you program. For 4000 images with 5 haar features inside we used more than 16Gb of RAM (Code was written in c, but no memory leak, it was arrays of double)
The training algorithm assign a weight to an image. That weight represent the difficulty for the algorithm to make a good prediction (face or no face).
Your training algorithm will be composed of rounds (200 rounds is fine to have 90%+ of good prediction).
At the first round every image possess the same weight because the algorithm never worked on them.
Here is how a round goes:
Find the best haar feature among X (for each type) in each image. To do this, compare each feature to the same one (same type, dimension and position) on every image and see if it is a good or bad prediction. The feature with the best prediction inside the X features is the best one, keep it stored.You will find 5 best features per images because there is 5 type combine them in a single struct and it is your weak classifier
Calculate the weighted error of the classifier. The weighted error is the error of the weak classifier applied to each image while taking in account the weight assigned to each image. In later rounds, the image with a bigger weight (the algorithm made lot of mistakes about this image) will be taken much more into account.
Add the weak classifier to a strong classifier (which is an array of weak classifiers) and its alpha. The alpha is the performance of the weak classifier and is determined with the weighted error. With the alpha, weak classifier which were made at later stage of the algorithm when the training is harder will have more weight in the final prediction of the strong classifier.
Update the weight of each image according to the prediction of the weak classifier you just created. If the classifier is right the weight goes down otherwise it goes up.
At the end of the 200 rounds of training, you will possess a strong classifier composed of 200 weak classifier. To make a prediction about a single image, apply each weak classifier to the image and the majority wins.
I voluntarily simplified the explanation but the majority is here. For more informations look here, it really helped me during my project: http://www.ipol.im/pub/art/2014/104/article.pdf
I suggest every person interested in AI and optimisation to work on a project like that. As a student it made me really interested in AI and made me think a lot about optimisation, which I never did before.

How to find the smallest Euclidean vector difference using a database engine?

I want to store thousands of ~100 element vectors in a database, and then I need to search for the record with the smallest difference.
e.g. when comparing [4,9,3] and [5,7,2], take the element-wise diff: [-1,2,1] and then compute the Euclidean length: sqrt(1+4+1) = 2.45.
I need to be able to search for the record containing this lowest value.
I don't think I can do efficiently in MySQL. I hear Solr or Elastisearch might provide a solution; can someone point me towards or post an example of how this kind of search can be done (efficiently)?
I think the answer to your question is here
But this is also quite interesting link
Unfortunately, In general you have to compare input vector to all other in database.
Maybe, if you know something more about your data, you can separate your data to smaller subset of vectors and decrease the complexity of comparisons.
In PostgreSQL database, you can use C++ extensibility to write your own function like here or use K-Nearest-Neighbor Indexing. When the GPU is available, you can look on this GPU-based PostgreSQL Extensions for Scalable High-throughput Pattern Matching
for extensions of the PostgreSQL.

Is it possible to use Lucene of Solr for image retrieval?

I am searching for a retrieval server right now for my image retrieval project. As I see from the Internet, Lucene and Solr are particularized for textual seraching but do you think is it possible and reasonable to convey them for image retrieval.
You might suggest a image specific tool like LIRE but it has predefined featreu extraction algorithms and not very flexible for new features. Basically, all I need to index my image features from my extraction pipeline (written in Python) with a server like Lucene or Solr and perform some retrieval tasks based on Euclidean distance on indexed features.
Any suggestion or pointer to any reference would be very useful. Thanks.
Based on your post , you could store the features as keyword fields in Lucene or ES (solr has a strict schema definition and i don't think it would fit your needs very well as the feature matrix is usually sparsely populated in my understanding), and have a unique ID field from the image hash. Then you can just search for feature values ( feature1:value1 AND feature2:value2) and see what matches the query.
If you're going to work with Euclidean distances, you'll want to look into using the Spatial Features of Solr. This will allow you to index your values as coordinates, then perform indexed lookups from other points and sort by their Euclidean distances.
You might also want to look at the dist and sqedist functions.

Caching user-specific proximity searches

The situation and the goal
Imagine a user search system that provides a proximity search from a user’s own position, which is specified by a decimal latitude/longitude combination. An Atlanta resident’s position, for instance, would be represented by 33.756944,-84.390278 and a perimeter search by this user should yield other users in his area from a radius of 10 mi, 50 mi, and so on.
A table-valued function calculates distances and provides users accordingly, ordered by ascending distance to the user that started the search. It’s always a live query, and it’s a tough and frequent one. Now, we want to built some sort of caching to reduce load.
On the way to solutions
So far, all users were grouped by the integer portion of their lat/long. The idea is to create cache files with all users from a grid square, so accessing the relevant cache file would be easy. If a grid square contains more users than a cache file should, the square is quartered or further divided into eight pieces and so on. To fully utilize a square and its cache file, multiple overlaying squares are contemplated. One deficiency of this approach is that gridding and quartering high density metropolitan areas and spacious countrysides into overlaying cache files may not be optimal.
Reading on, I stumbled upon topics like nearest neighbor searches, the Manhattan distance and tree-esque space partitioning techniques like a k-d tree, a quadtree or binary space partitioning. Also, SQL Server provides its own geographical datatypes and functions (though I’d guess the pure-mathematical FLOAT way has an adequate performance). And of course, the crux is making user-centric proximity searches cacheable.
Question!
I haven’t found much resources on this, but I’m sure I’m not the first one with this plan. Remember, it’s not about the search, but about caching.
Can I scrap my approach?
Are there ways of an advantageous partitioning of users into geographical divisions of equal size?
Is there a best practice for storing spatial user information for efficient proximity searches?
What do you think of the techniques mentioned above (quadtrees, etc.) and how would you pair them with caching?
Do you know an example of successfully caching user-specific proximity search?
Can I scrap my approach?
You can adapt your appoach, because as you already noted, a quadtree uses this technic. Or you use a geo spatial extension. That is available for MySql, too.
Are there ways of an advantageous partitioning of users into
geographical divisions of equal size
A simple fixed grid of equal size is fine when locations are equally distributed or if the area is very small. Geo locations are hardly equal distributed. Usually a geo spatial structure is used. see next answer:
Is there a best practice for storing spatial user information for
efficient proximity searches
Quadtree, k-dTree or R-Tree.
What do you think of the techniques mentioned above (quadtrees, etc.) and how would you pair them with caching?
There is some work from Hannan Samet, which describes Quadtrees and caching.

Support Vector Machine or Artificial Neural Network for text processing? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
We need to decide between Support Vector Machines and Fast Artificial Neural Network for some text processing project.
It includes Contextual Spelling Correction and then tagging the text to certain phrases and their synonyms.
Which will be the right approach? Or is there an alternate to both of these... Something more appropriate than FANN as well as SVM?
I think you'll get a competitive results from both of the algorithms, so you should aggregate the results... think about ensemble learning.
Update:
I don't know if this is specific enough: use Bayes Optimal Classifier to combine the prediction from each algorithm. You have to train both of your algorithms, then you have to train the Bayes Optimal Classifier to use your algorithms and make optimal predictions based on the input of the algorithms.
Separate your training data in 3:
1st data set will be used to train the (Artificial) Neural Network and the Support Vector Machines.
2nd data set will be used to train the Bayes Optimal Classifier by taking the raw predictions from the ANN and SVM.
3rd data set will be your qualification data set where you will test your trained Bayes Optimal Classifier.
Update 2.0:
Another way to create an ensemble of the algorithms is to use 10-fold (or more generally, k-fold) cross-validation:
Break data into 10 sets of size n/10.
Train on 9 datasets and test on 1.
Repeat 10 times and take a mean accuracy.
Remember that you can generally combine many the classifiers and validation methods in order to produce better results. It's just a matter of finding what works best for your domain.
You might want to also take a look at maxent classifiers (/log linear models).
They're really popular for NLP problems. Modern implementations, which use quasi-newton methods for optimization rather than the slower iterative scaling algorithms, train more quickly than SVMs. They also seem to be less sensitive to the exact value of the regularization hyperparameter. You should probably only prefer SVMs over maxent, if you'd like to use a kernel to get feature conjunctions for free.
As for SVMs vs. neural networks, using SVMs would probably be better than using ANNs. Like maxent models, training SVMs is a convex optimization problem. This means, given a data set and a particular classifier configuration, SVMs will consistently find the same solution. When training multilayer neural networks, the system can converge to various local minima. So, you'll get better or worse solutions depending on what weights you use to initialize the model. With ANNs, you'll need to perform multiple training runs in order to evaluate how good or bad a given model configuration is.
This question is very old. Lot of developments were happened in NLP area in last 7 years.
Convolutional_neural_network and Recurrent_neural_network evolved during this time.
Word Embeddings: Words appearing within similar context possess similar meaning. Word embeddings are pre-trained on a task where the objective is to predict a word based on its context.
CNN for NLP:
Sentences are first tokenized into words, which are further transformed into a word embedding matrix (i.e., input embedding layer) of d dimension.
Convolutional filters are applied on this input embedding layer to produce a feature map.
A max-pooling operation on each filter obtain a fixed length output and reduce the dimensionality of the output.
Since CNN had a short-coming of not preserving long-distance contextual information, RNNs have been introduced.
RNNs are specialized neural-based approaches that are effective at processing sequential information.
RNN memorizes the result of previous computations and use it in current computation.
There are few variations in RNN - Long Short Term Memory Unit (LSTM) and Gated recurrent units (GRUs)
Have a look at below resources:
deep-learning-for-nlp
Recent trends in deep learning paper
You can use Convolution Neural Network (CNN) or Recurrent Neural Network (RNN) to train NLP. I think CNN has achieved state-of-the-art now.

Resources