Predict probabilities of a recommender system? - sparse-matrix

I have a dataset where I have a sparse utility matrix (user-product) with binary input: 1 if the user 𝑖 bought the product 𝑗, and 0 if it hasn't.
However it has a different meaning on the test set, 0 means that we don't know if the user bought this product, and 1 means that we're sure that the user bought that given product.
I need to get for each user in each product the probability that a user 𝑖 bought a product 𝑗 in the test set. For this I wish to use different matrix factorization techniques like FunkSVD, NMF or SVD++ , but I'm quite confused:
These techniques would only allow me to get the label (1 or 0) on test set, but I need to compute a probability of getting 1 and not the label in it self.
How can I approach this problem ? Or do I treat it as a classification problem and then use all common classification techniques ?

Okay, One approach that might help based on Recurrent-Knowledge-Graph-Embedding.
Try converting user-item interactions into a knowledge graph and mine the top-n paths in the graph between user_i and item_j. Build an RNN model as described in the paper and its code, to get probabilities.

Related

Motivation for k-medoids

Why would one use kmedoids algoirthm rather then kmeans? Is it only the fact that
the number of metrics that can be used in kmeans is very limited or is there something more?
Is there an example of data, for which it makes much more sense to choose the best representatives
of cluster from the data rather then from R^n?
The problem with k-means is that it is not interpretable. By interpretability i mean the model should also be able to output the reason that why it has resulted a certain output.
lets take an example.
Suppose there is food review dataset which has two posibility that there is a +ve review or a -ve review so we can say we will have k= 2 where k is the number of clusters. Now if you go with k-means where in the algorithm the third step is updation step where you update your k-centroids based on the mean distance of the points that lie in a particular cluster. The example that we have chosen is text problem, so you would also apply some kind of text-featured vector schemes like BagOfWords(BOW), word2vec. now for every review you would get the corresponding vector. Now the generated centroid c_i that you will get after running the k-means would be the mean of the vectors present in that cluster. Now with that centroid you cannot interpret much or rather i should say nothing.
But for same problem you apply k-medoids wherein you choose your k-centroids/medoids from your dataset itself. lets say you choose x_5 point from your dataset as first medoid. From this your interpretability will increase beacuse now you have the review itself which is termed as medoid/centroid. So in k-medoids you choose the centroids from your dataset itself.
This is the foremost motivation of introducing k-mediods
Coming to the metrics part you can apply all the metrics that you apply for k-means
Hope this helps.
Why would we use k-medoids instead of k-means in case of (squared) Euclidean distance?
1. Technical justification
In case of relatively small data sets (as k-medoids complexity is greater) - to obtain a clustering more robust to noise and outliers.
Example 2D data showing that:
The graph on the left shows clusters obtained with K-medoids (sklearn_extra.cluster.KMedoids method in Python with default options) and the one on the right with K-means for K=2. Blue crosses are cluster centers.
The Python code used to generate green points:
import numpy as np
import matplotlib.pyplot as plt
rng = np.random.default_rng(seed=32)
a = rng.random((6,2))*2.35 - 3*np.ones((6,2))
b = rng.random((50,2))*0.25 - 2*np.ones((50,2))
c = rng.random((100,2))*0.5 - 1.5*np.ones((100,2))
d = rng.random((7,2))*0.55
points = np.concatenate((a, b, c, d))
plt.plot(points[:,0],points[:,1],"g.", markersize=8, alpha=0.3) # green points
2. Business case justification
Here are some example business cases showing why we would prefer k-medoids. They mostly come down to the interpretability of the results and the fact that in k-medoids the resulting cluster centers are members of the original dataset.
2.1 We have a recommender engine based only on user-item preference data and want to recommend to the user those items (e.g. movies) that other similar people enjoyed. So we assign the user to his/her closest cluster and recommend top movies that the cluster representant (actual person) watched. If the cluster representant wasn't an actual person we wouldn't possess the history of actually watched movies to recommend. Each time we'd have to search additionally e.g. for the closest person from the cluster. Example data: classic MovieLens 1M Dataset
2.2 We have a database of patients and want to pick a small representative group of size K to test a new drug with them. After clustering the patients with K-medoids, cluster representants are invited to the drug trial.
Difference between is that in k-means centroids(cluster centrum) are calculated as average of vectors containing in the cluster, and in k-medoids the medoid (cluster centrum) is record from dataset closest to centroid, so if you need to represent cluster centrum by record of your data you use k-medoids, otherwise i should use k-means (but concept of these algorithms are same)
The K-Means algorithm uses a Distance Function such as Euclidean Distance or Manhattan Distance, which are computed over vector-based instances. The K-Medoid algorithm instead uses a more general (and less constrained) distance function: aka pair-wise distance function.
This distinction works well in contexts like Complex Data Types or relational rows, where the instances have a high number of dimensions.
High dimensionality problem
In standard clustering libraries and the k-means algorithms, the distance computation phase can spend a lot of time scanning the entire vector of attributes that belongs to an instance; for instance, in the context of documents clustering, using the standard TF-IDF representation. During the computation of the cosine similarity, the distance function scans all the possible words that appear in the whole collection of documents. Which in many cases can be composed by millions of entries. This is why, in this domain, some authors [1] suggests to restrict the words considered to a subset of N most frequent word of that language.
Using K-Kedoids there is no need to represent and store the documents as vectors of word frequencies.
As an alternative representation for the documents is possible to use the set of words appearing at least twice in the document; and as a distance measure, there can be used Jaccard Distance.
In this case, vector representation is long as the number of words in your dictionary.
Heterogeneousity and Complex Data Types.
There are many domains where is considerably better to abstract the implementation of an instance:
Graph's nodes clustering;
Car driving behaviour, represented as GPS routes;
Complex data type allows the design of ad-hoc distance measures which can fit better with the proper data domain.
[1] Christopher D. Manning, Prabhakar Raghavan, and Hinrich SchΓΌtze. 2008. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA.
Source: https://github.com/eracle/Gap

Data Mining KNN Classifier

Suppose a data analyst working for an insurance company was asked to build a predictive model for predicting whether a customer will buy a mobile home insurance policy. S/he tried kNN classifier with different number of neighbours (k=1,2,3,4,5). S/he got the following F-scores measured on the training data: (1.0; 0.92; 0.90; 0.85; 0.82). Based on that the analyst decided to deploy kNN with k=1. Was it a good choice? How would you select an optimal number of neighbours in this case?
It is not a good idea to select a parameter of a prediction algorithm using the whole training set as the result will be biased towards this particular training set and has no information about generalization performance (i.e. performance towards unseen cases). You should apply a cross-validation technique e.g. 10-fold cross-validation to select the best K (i.e. K with largest F-value) within a range.
This involves splitting your training data in 10 equal parts retain 9 parts for training and 1 for validation. Iterate such that each part has been left out for validation. If you take enough folds this will allow you as well to obtain statistics of the F-value and then you can test whether these values for different K values are statistically significant.
See e.g. also:
http://pic.dhe.ibm.com/infocenter/spssstat/v20r0m0/index.jsp?topic=%2Fcom.ibm.spss.statistics.help%2Falg_knn_training_crossvalidation.htm
The subtlety here however is that there is likely a dependency between the number of data points for prediction and the K-value. So If you apply cross-validation you use 9/10 of the training set for training...Not sure whether any research has been performed on this and how to correct for that in the final training set. Anyway most software packages just use the abovementioned techniques e.g. see SPSS in the link.
A solution is to use leave-one-out cross-validation (each data samples is left out once for testing) in that case you have N-1 training samples(the original training set has N).

how to solve this with kmeans clustering and use cosine similiraty

can anyone tell to me how k-means clustering working on textmining..
and i using cosine similarity as the distance metric.
nim 310910022 320910044 310910043 310910021
access 0 2 3 5
abdi 1 0 0 0
actual 5 0 0 1
arrow 0 1 1 2
this data is on listview
How can I do that in VB.net? to get any cluster and trending topic of the term ?
Thank in advance
First I would separate the problem into two parts:
Computing the k-means clustering assignments
Getting the data from the GUI (you mention the data is on a listview)
I assume 2 is straight forward and you just need help on 1.
I would start by re-writing the code to just read a TSV text file of the data as you specified. This will make things a lot easier to debug.
Then ask if you want to implement the kmeans algorithm yourself or use a library.
If you want to implement it, here is a link to the pseudo code
http://www.scribd.com/doc/89373376/K-Means-Pseudocode
You can also search up other kmeans pseudocode.
If you want to use a library to just "run" your data against a kmeans algorithm, here is one example in python/scipy.
http://glowingpython.blogspot.com/2012/04/k-means-clustering-with-scipy.html
Regardless of which approach you use, realize that kmeans is non-deterministic and you may get different answers everytime you run it. I would recommend computing against a known validation set to see if the data is roughly what you think it should be.

How do I handle uncertainty/missing data in an Artifical Neural Network?

The context:
I'm experimenting with using a feed-forward artificial neural network to create AI for a video game, and I've run into the problem that some of my input features are dependent upon the existence or value of other input features.
The most basic, simplified example I can think of is this:
feature 1 is the number of players (range 2...5)
feature 2 to ? is the score of each player (range >=0)
The number of features needed to inform the ANN of the scores is dependent on the number of players.
The question: How can I represent this dynamic knowledge input to an ANN?
Things I've already considered:
Simply not using such features, or consolidating them into static input.
I.E using the sum of the players scores instead. I seriously doubt this is applicable to my problem, it would result in the loss of too much information and the ANN would fail to perform well.
Passing in an error value (eg -1) or default value (eg 0) for non-existant input
I'm not sure how well this would work, in theory the ANN could easily learn from this input and model the function appropriately. In practise I'm worried about the sheer number of non-existant input causing problems for the ANN. For example if the range of players was 2-10, if there were only 2 players, 80% of the input data would be non-existant and would introduce weird bias into the ANN resulting in a poor performance.
Passing in the mean value over the training set in place on non-existant input
Again, the amount of non-existant input would be a problem, and I'm worried this would introduce weird problems for discrete-valued inputs.
So, I'm asking this, does anybody have any other solutions I could think about? And is there a standard or commonly used method for handling this problem?
I know it's a rather niche and complicated question for SO, but I was getting bored of the "how do I fix this code?" and "how do I do this in PHP/Javascript?" questions :P, thanks guys.
It sounds like you have multiple data sets (for each number of players) that aren't really compatible with each other. Would lessons learned from a 5-player game really apply to a 2-player game? Try simplifying the problem, such as #1, and see how the program performs. In AI, absurd simplifications can sometimes give you a lot of traction, like bag of words in spam filters.
Try thinking about some model like the following:
Say xi (e.g. x1) is one of the inputs that a variable number of can exist. You can have n of these (x1 to xn). Let y be the rest of the inputs.
On your first hidden layer, pass x1 and y to the first c nodes, x1,x2 and y to the next c nodes, x1,x2,x3 and y to the next c nodes, and so on. This assumes x1 and x3 can't both be active without x2. The model will have to change appropriately if this needs to be possible.
The rest of the network is a standard feed-forward network with all nodes connected to all nodes of the next layer, or however you choose.
Whenever you have w active inputs, disable all but the wth set of c nodes (completely exclude them from training for that input set, don't include them when calculating the value for the nodes they output to, don't update the weights for their inputs or outputs). This will allow most of the network to train, but for the first hidden layer, only parts applicable to that number of inputs.
I suggest c is chosen such that c*n (the number of nodes in the first hidden layer) is greater than (or equal to) the number of nodes in the 2nd hidden layer (and have c be at the very least 10 for a moderately sized network (into the 100s is also fine)) and I also suggest the network have at least 2 other hidden layers (so 3 in total excluding input and output). This is not from experience, but just what my intuition tells me.
This working is dependent on a certain (possibly undefinable) similarity between the different numbers of inputs, and might not work well, if at all, if this similarity doesn't exist. This also probably requires quite a bit of training data for each number of inputs.
If you try it, let me / us know if it works.
If you're interested in Artificial Intelligence discussions, I suggest joining some Linked-In group dedicated to it, there are some that are quite active and have interesting discussions. There doesn't seem to be much happening on stackoverflow when it comes to Artificial Intelligence, or maybe we should just work to change that, or both.
UPDATE:
Here is a list of the names of a few decent Artificial Intelligence LinkedIn groups (unless they changed their policies recently, it should be easy enough to join):
'Artificial Intelligence Researchers, Faculty + Professionals'
'Artificial Intelligence Applications'
'Artificial Neural Networks'
'AGI β€” Artificial General Intelligence'
'Applied Artificial Intelligence' (not too much going on at the moment, and still dealing with some spam, but it is getting better)
'Text Analytics' (if you're interested in that)

Determining which inputs to weigh in an evolutionary algorithm

I once wrote a Tetris AI that played Tetris quite well. The algorithm I used (described in this paper) is a two-step process.
In the first step, the programmer decides to track inputs that are "interesting" to the problem. In Tetris we might be interested in tracking how many gaps there are in a row because minimizing gaps could help place future pieces more easily. Another might be the average column height because it may be a bad idea to take risks if you're about to lose.
The second step is determining weights associated with each input. This is the part where I used a genetic algorithm. Any learning algorithm will do here, as long as the weights are adjusted over time based on the results. The idea is to let the computer decide how the input relates to the solution.
Using these inputs and their weights we can determine the value of taking any action. For example, if putting the straight line shape all the way in the right column will eliminate the gaps of 4 different rows, then this action could get a very high score if its weight is high. Likewise, laying it flat on top might actually cause gaps and so that action gets a low score.
I've always wondered if there's a way to apply a learning algorithm to the first step, where we find "interesting" potential inputs. It seems possible to write an algorithm where the computer first learns what inputs might be useful, then applies learning to weigh those inputs. Has anything been done like this before? Is it already being used in any AI applications?
In neural networks, you can select 'interesting' potential inputs by finding the ones that have the strongest correlation, positive or negative, with the classifications you're training for. I imagine you can do similarly in other contexts.
I think I might approach the problem you're describing by feeding more primitive data to a learning algorithm. For instance, a tetris game state may be described by the list of occupied cells. A string of bits describing this information would be a suitable input to that stage of the learning algorithm. actually training on that is still challenging; how do you know whether those are useful results. I suppose you could roll the whole algorithm into a single blob, where the algorithm is fed with the successive states of play and the output would just be the block placements, with higher scoring algorithms selected for future generations.
Another choice might be to use a large corpus of plays from other sources; such as recorded plays from human players or a hand-crafted ai, and select the algorithms who's outputs bear a strong correlation to some interesting fact or another from the future play, such as the score earned over the next 10 moves.
Yes, there is a way.
If you choose M selected features there are 2^M subsets, so there is a lot to look at.
I would to the following:
For each subset S
run your code to optimize the weights W
save S and the corresponding W
Then for each pair S-W, you can run G games for each pair and save the score L for each one. Now you have a table like this:
feature1 feature2 feature3 featureM subset_code game_number scoreL
1 0 1 1 S1 1 10500
1 0 1 1 S1 2 6230
...
0 1 1 0 S2 G + 1 30120
0 1 1 0 S2 G + 2 25900
Now you can run some component selection algorithm (PCA for example) and decide which features are worth to explain scoreL.
A tip: When running the code to optimize W, seed the random number generator, so that each different 'evolving brain' is tested against the same piece sequence.
I hope it helps in something!

Resources