Inconsistent # of Samples NLP IMDB Sentiment Classification - arrays

I'm trying to make a sentiment classifier on the IMDB dataset. I am fairly new to NLP and data science, and I keep getting this error while tryng to fit my model.
ValueError: Found input variables with inconsistent numbers of samples: [24745, 40000]
I've looked at many other threads and it all says to reshape your data, which I did, the size of my X variable is (24745, 100) and for my y variables its (40000, 1).
I am currently trying to use any of these models:
MultinomialNB
BernoulliNB
GaussianNB
I've also tried to create a Tensorflow Sequential Model with Bidirectional LSTM but that produced terrible accuracy, it was at 50 for a long time, randomly guessing.
Any help appreciated please.

Related

How to store FaceNet data efficiently?

I am using the Facenet algorithm for face recognition. I want to create application based on this, but the problem is the Facenet algorithm returns an array of length 128, which is the face embedding per person.
For person identification, I have to find the Euclidian difference between two persons face embedding, then check that if it is greater than a threshold or not. If it is then the persons are same; if it is less then persons are different.
Let's say If I have to find person x in the database of 10k persons. I have to calculate the difference with each and every person's embeddings, which is not efficient.
Is there any way to store this face embedding efficiently and search for the person with better efficiency?
I guess reading this blog will help the others.
It's in detail and also covers most aspects of implementation.
Face recognition on 330 million faces at 400 images per second
Recommend you to store them in redis or cassandra. They will overperform than relational databases.
Those key-value stores can store multidimensional vector as a value.
You can find embedding vectors with deepface. I shared a sample code snippet below.
#!pip install deepface
from deepface import DeepFace
img_list = ["img1.jpg", "img2.jpg", ...]
model = DeepFace.build_model("Facenet")
for img_path in img_list:
img_embedding = DeepFace.represent(img_path, model = model)
#store img_embedding into the redis here
Sounds like you want a nearest neighbour search. You could have a look at the various space partitioning data structures like kd-trees
First make a dictionary with 10000 face encodings as it is shown at Face_recognition sample, then store it as pickle-file. While loaded in memory it will take a sacond to find distance between X face encoding and that 10000 pre-encoded ones. take a look how it works I'm operating with millions of faces in such way.

Import CSV file into python, then turn it into numpy array, then feed it to sklearn algorithm

Sklearn algorithm require a feature and a label for it to learn.
I have a CSV file which contain some data. These data is actually a challenge from hackerearth website in which participant need to create a learning algorithm that learn from data on massive amount of individuals from affiliate network and their ad click performance which then predict future performance of other individuals in the affiliate network which allow the company to optimize their ad performance.
The features in these data include id,date,siteid, offerid, category, merchant, countrycode,type of browser, type of device and the number of clicks their ads have gotten.
https://www.hackerearth.com/practice/algorithms/string-algorithm/string-searching/practice-problems/machine-learning/predict-ad-clicks/
So my plan is to use the first 7 information as my feature and ad click as label. Unfortunately, countrycode,browser and device information is in text (Google Chrome, Desktop) and not integers which can be turned into array.
Q1: Is there a way for sklearn to accept not just numpy arrays but also words as features? Am I support to use vectorizer for this? If so, how would I do it? If not, can I just replace the wording data into numbers (Google Chrome replaced by 1, firefox replaced by 2) and still have it to work? (I am using Naive Bayes algorithm)
Q2: Would Naive Bayes algorithm be suitable for this task? Since this competition require participant to create a program that predict the probability of individuals in affiliate network have their ads click, I assume Naive Bayes would be best suited.
Training data : https://drive.google.com/open?id=1vWdzm0uadoro3WcpWmJ0SVEebeaSsHvr
Testing data : https://drive.google.com/open?id=1M8gR1ZSpNEyVi5W19y0d_qR6EGUeGBQl
My messy coding and horrible attempt at this challenge which I don't think will be much help:
from sklearn.naive_bayes import GaussianNB
import csv
import pandas as pd
import numpy as np
data = []
from numpy import genfromtxt
import pandas as pd
data = genfromtxt('smaller.csv', delimiter=',')
dat = pd.read_csv('smaller.csv', delimiter=',')
print(dat(siteid))
feature = []
label =[]
i = 1
j = 1
while i <17:
feature.append(data[i][2:8])
i += 1
while j <17:
label.append(data[i][9])
j += 1
clf = GaussianNB()
clf.fit(feature,label)
print(clf.predict([data[18][2:8]]))
print(data[18])
Answer for Question1: No. Sklearn only works with numerical data. So you need to convert your text to numbers.
Now to convert text to numbers you can follow multiple approaches. First is as you said just assign numbers to them. But you need to to take in account if the text data shows any order like the numbers assigned to them or not. In that case, most often one-hot encoding is used. Please see the below scikit-learn documentation for that:
- http://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features
Answer to Question 2: It depends on the data and task at hand.
No single algorithm is capable of handling every type of data optimally.
Most of the times we need to compare multiple algorithms and see what gives best result for our data. See this example:
http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html#sphx-glr-auto-examples-classification-plot-classifier-comparison-py
Even in a single algorithm we need to check for various parameter values, tune those values for maximum score. This is called grid-search. See this example:
http://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html#sphx-glr-auto-examples-model-selection-plot-grid-search-digits-py
Hope this clears your doubts. Make sure to go through the scikit-learn documentation and examples:
http://scikit-learn.org/stable/user_guide.html
http://scikit-learn.org/stable/auto_examples/index.html
They are one of the best out there.

Natural language classifier returns classifications for untrained items

I am confused as to how NLC works. My expectation is that when it is asked to classify text that it should have no relation or training data to learn from it should return no results or results with very low confidence scores.
I have trained a model with a set of training data and when I attempt to classify text that is outside of the training data I am getting results with high confidence values (~60%).
Here's an example of my training data:
foo,1,2,3,4
bar,1,2,3,4
baz,1,2,3,4
When I try to classify the text "This should not exist" I receive a high confidence that this text is "1".
Is my assumption correct in that I should be returned values in this case? Am I training the data to classify foo, bar, and baz incorrectly? If not what should I expect from the NLC service?
Imagine that you have 3 buckets and you have to throw a coin in one of them. Each bucket has 33.3 % changes to get the coin. The same happens with Natural Language Classifier service. It is trained to classify input text into predefined classes.
If you create a classifier with 3 classes and you try to classify text that wasn't in the training data, NLC will still classify your sentence to one of the three classes you defined. If your output is 60% then the other two buckets will get the remaining 40%.
Sometimes you could get a high score and that's normal when you have classes that are very different.

I want to use deep learning to classify features to scores

I have a problem and was wondering if I can use deep-learning to solve it.
I have a lists of 7 features, and for each list I have 7 scores.
For examples for the features:
[0.2,0.6,0.2,0.6,0.1,0.3,0.1]
I have the following scores:
[100,0,123,2,14,15,2]
and for the features:
[0.1,0.2,0.3,0.6,0.5,0.1,0.2]
I have the following scores:
[10,10,13,22,4,135,22]
etc..
Any ideas of how to utilize deep learning to train a network that giving a list of features will give me back the correct scores.
Thanks
You have the basic setup here for a regression problem. You could try solving this problem using a neural network toolkit. I wrote a toolkit called theanets that might help, so I'll give a simple example of how you might use it:
import numpy as np
import theanets
# set up data arrays: X is input, Y is target output
X = np.array([
[0.2,0.6,0.2,0.6,0.1,0.3,0.1],
[0.1,0.2,0.3,0.6,0.5,0.1,0.2],
], 'f')
Y = np.array([
[100,0,123,2,14,15,2],
[10,10,13,22,4,135,22],
], 'f')
# set up a regression model:
# map from X to Y using one hidden layer.
exp = theanets.Experiment(
theanets.Regressor,
(X.shape[1], 100, Y.shape[1]))
# train the model using rmsprop.
exp.train([X, Y], algorithm='rmsprop')
# predict outputs for some inputs.
Yhat = exp.network.predict(X)
There are several options for configuring and training your model, have a look at the documentation for more info.
There are also many, many other neural network toolkits out there, here are just a few popular ones that I'm familiar with:
Lasagne
Keras
Caffe
You might want to give these a try to see whether they fit better with your mental model of the problem you're trying to solve.
You generate a big number of neural networks
You give a fitness score to each neural net based on the results(the higher the fitness score the better)
You sort the neural nets by their fitness score
You take the first x%
You apply small mutations to each selected neural net.
Repeat 2-5 until results are satisfactory.
That big number mentioned in the first step should be roughly equal to:
(100/x)^generationCount
where the x here is the same one as in step 4 and generationCount is the amount of generations until final result.

Neural network weights adjustment by the user ratings

im really new to NN, and im trying to implement it in my recommendation system that gives users recommendations on user similarities.
The thing is that im having 4 different similarities of users by different parameters, and im using weights to make the importance of each similarity in total similarity.
region similarity = 0.5, weightRegion=0.6
interests similarity = 0.3, weightInterest=0.8
education similarity = 0.75, weightEducation=1.1
positions similarity = 0.6, weightPositions=1.5
so calculating total similarity will be multiplied sum divided by sum of the weights: (0.5*0.6+0.3*0.8+0.75*1.1+0.6*1.5)/4
//im dividing by sum of weights to put parameter in {0..1}
So the thing is i need to control those weights by the user rating (user clicks rating from 1 to 10 and weights r corrected)
I've built such NN:
So what im doing is:
n=0.25 (learning k);
rating=0.7 (that is my 7 rating);
net5=x1*w15+x2*w25+x3*w35+x4*w45;
out5=1/(1-pow(e,-net5));
real=out5*(1+1-rating);
err=out5*(1-out5)*(real-out5);
w15n=w15+errnx1;
w25n=w25+errnx2;
w35n=w35+errnx3;
w45n=w45+errnx4;
(im sry for code formatting, it kept saying its not properly formatted)
What am I doing wrong? cause results of such correcting arent good at all.
Thanks
I think you are going the wrong way. Backpropagation isn't a good choice for this type of learning (somehow incremental learning).
To use backpropagation you need some data, say 1000 data where different types of similaritiy (input) and the True Similarity (output) are given. Then weights will update and update until error rate comes down. And besides you need test data set too that will make you sure the result network will do good even for similarity values it didn't see during training.

Resources