How to find similarity for large number of features - dataset

I'm not sure if I am asking the question at right place as I'm new to stackoverflow, please move if required.
I'm trying to solve a link prediction problem for Flickr Dataset. My dataset has 5K nodes and each node has around 27K features, it is sparse.
I want to find similarity between the nodes so that I can predict a link between them if the similarity value is greater than some threshold that I decide. The problem is with the number of features. I cannot load the file in Weka (To try to reduce features by some info gain or something and then try clustering or check if cosine similarity measure)
One more problem is, how to define this as a classification problem ? I wanted to find overlapping tags for two nodes, so the table contains the nodes and some features of them (will be in thousands) and all of them will be positive class only as I know that there is a link between them.
I want to create a test data set with some of the nodes and and create similar table and label them as positive class or negative class. But my problem is all data I have is positive, so I think it would never be able to label as negative. How to change it to a classification problem correctly ?
Any pointers or help is very much appreciated.

Weka can deal with 27K features, it shoudn't be a problem... However, I would approach this problem as a classification problem, but a link-discovery one, which, in this case can be seen as a matching problem.
My approach would be:
1. new node appears
2. search for the most similar elements
3. assume they are related (there is a link) if the similarity is greater than your threshold.
The main problem would be to tune the threshold based on some quality measure.
For this approach Lucene would be probably the best option.
I hope this helps.

Related

Algo to find least amount of items that satisfy a given requirement

I want to implement an algorithm for the following problem. It later on needs to be implemented in T-SQL:
I have a set of providers - lets say shops. Each shop has it set of items it offers. Some items overlap between shops, some are only present in one shop.
I have a list of items - lets say a shopping list with a set of items I want.
I now have to find the combination of shops which offer ALL the items while requiring the least amount of shops.
I am pretty sure this problem is frequently solved and the algorithm has its own name but I was not able to find it via search.
Sorry, I think I misinterpret the question the first time. Your problem is essentially a Set Cover Problem which is NP-Complete. There are heuristic approaches however no optimal solution.
(This is similar but not quite your problem, worth looking at though) Knapsack Problem
Not sure about any particular algorithm. But given your requirements, I would:
Start looking for the shop with the largest number of matching items from your list.
Iterate until you fill your list.
If you want to really optimize this, you can then look for any redundant shops. A redundant shop would have items provided by one or more of the other shops in your list.
On a second thought, this could be solved using binary linear programming. Where each shop is variable, and the constraints are such that each item must be served by at least one shop. You then try to minimize the number of shops. Not sure how you would solve this inside T-SQL.

Algorithm to find related words in a text

I would like to have a word (e.g. "Apple) and process a text (or maybe more). I'd like to come up with related terms. For example: process a document for Apple and find that iPod, iPhone, Mac are terms related to "Apple".
Any idea on how to solve this?
As a starting point: your question relates to text mining.
There are two ways: a statistical approach, and one form natural language processing (nlp).
I do not know much about nlp, but can say something about the statistical approach:
You need some vector space representation of your documents, see
http://en.wikipedia.org/wiki/Vector_space_model
http://en.wikipedia.org/wiki/Document-term_matrix
http://en.wikipedia.org/wiki/Tf%E2%80%93idf
In order to learn semantics, that is: different words mean the same, or one word can have different meanings, you need a large text corpus for learning. As I said this is a statistical approach, so you need lots of samples.
http://www.daviddlewis.com/resources/testcollections/
Maybe you have lots of documents from the context you are going to use. That is the best situation.
You have to retrieve latent factors from this corpus. Most common are:
LSA (http://en.wikipedia.org/wiki/Latent_semantic_analysis)
PLSA (http://en.wikipedia.org/wiki/Probabilistic_latent_semantic_analysis)
nonnegative matrix factorization (http://en.wikipedia.org/wiki/Non-negative_matrix_factorization)
latent dirichlet allocation (http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)
These methods involve lots of math. Either you dig it, or you have to find good libraries.
I can recommend the following books:
http://www.oreilly.de/catalog/9780596529321/toc.html
http://www.oreilly.de/catalog/9780596516499/index.html
Like all of AI, it's a very difficult problem. You should look into natural language processing to learn about some of the issues.
One very, very simplistic approach can be to build a 2d-table of words, with for each pair of words the average distance (in words) that they appear in the text. Obviously you'll need to limit the maximum distance considered, and possibly the number of words as well. Then, after processing a lot of text you'll have an indicator of how often certain words appear in the same context.
What I would do is get all the words in a text and make a frequency list (how often each word appears). Maybe also add to it a heuristic factor on how far the word is from "Apple". Then read multiple documents, and cross out words that are not common in all the documents. Then prioritize based on the frequency and distance from the keyword. Of course, you will get a lot of garbage and possibly miss some relevant words, but by adjusting the heuristics you should get at least some decent matches.
The technique that you are looking for is called Latent Semantic Analysis (LSA). It is also sometimes called Latent Semantic Indexing. The technique operates on the idea that related concepts occur together in text. It uses statistics to build the word relationships. Given a large enough corpus of documents it will definitely solve your problem of finding related words.
Take a look at vector space models.

In what sequence cluster analysis is done?

First find the minimum frequent patterns from the database.
Then divide them into various data types like interval based , binary ,ordinal variables etc and define various distance measures for all the variables.
Finally apply cluster analysis method.
Is this sequence right or am i missing something?
whether you're right or not depends on what you want to do. The general approach that you describe seems to go into the right direction, but you'll never know if your on target until you answer the following questions:
What is your data?
What are trying to find/Which cluster method do you want to use?
From what you describe it seems to me that you want to do 'preprocessing' steps like feature selection and vectorization. Unfortunately, this by itself can be quite challenging. For example, one of the biggest partical problems is the design of a distance function (there's a tremendous amount of research available).
So, please give us more information on your specific target application.

ai: Determining what tests to run to get most useful data

This is for http://cssfingerprint.com
I have a system (see about page on site for details) where:
I need to output a ranked list, with confidences, of categories that match a particular feature vector
the binary feature vectors are a list of site IDs & whether this session detected a hit
feature vectors are, for a given categorization, somewhat noisy (sites will decay out of history, and people will visit sites they don't normally visit)
categories are a large, non-closed set (user IDs)
my total feature space is approximately 50 million items (URLs)
for any given test, I can only query approx. 0.2% of that space
I can only make the decision of what to query, based on results so far, ~10-30 times, and must do so in <~100ms (though I can take much longer to do post-processing, relevant aggregation, etc)
getting the AI's probability ranking of categories based on results so far is mildly expensive; ideally the decision will depend mostly on a few cheap sql queries
I have training data that can say authoritatively that any two feature vectors are the same category but not that they are different (people sometimes forget their codes and use new ones, thereby making a new user id)
I need an algorithm to determine what features (sites) are most likely to have a high ROI to query (i.e. to better discriminate between plausible-so-far categories [users], and to increase certainty that it's any given one).
This needs to take into balance exploitation (test based on prior test data) and exploration (test stuff that's not been tested enough to find out how it performs).
There's another question that deals with a priori ranking; this one is specifically about a posteriori ranking based on results gathered so far.
Right now, I have little enough data that I can just always test everything that anyone else has ever gotten a hit for, but eventually that won't be the case, at which point this problem will need to be solved.
I imagine that this is a fairly standard problem in AI - having a cheap heuristic for what expensive queries to make - but it wasn't covered in my AI class, so I don't actually know whether there's a standard answer. So, relevant reading that's not too math-heavy would be helpful, as well as suggestions for particular algorithms.
What's a good way to approach this problem?
If you know nothing about the features you have not sampled, then you have little to go on when deciding whether to explore or exploit your data. If you can express your ROI as a single number following every query, then there is an optimal way of making this choice by keeping track of the upper confidence bounds. See the paper Finite-time Analysis of Multiarmed Bandit Problem.

Generating 'neighbours' for users based on rating

I'm looking for techniques to generate 'neighbours' (people with similar taste) for users on a site I am working on; something similar to the way last.fm works.
Currently, I have a compatibilty function for users which could come into play. It ranks users on having 1) rated similar items 2) rated the item similarly. The function weighs point 2 heigher and this would be the most important if I had to use only one of these factors when generating 'neighbours'.
One idea I had would be to just calculate the compatibilty of every combination of users and selecting the highest rated users to be the neighbours for the user. The downside of this is that as the number of users go up then this process couls take a very long time. For just a 1000 users, it needs 1000C2 (0.5 * 1000 * 999 = = 499 500) calls to the compatibility function which could be very heavy on the server also.
So I am looking for any advice, links to articles etc on how best to achieve a system like this.
In the book Programming Collective Intelligence
http://oreilly.com/catalog/9780596529321
Chapter 2 "Making Recommendations" does a really good job of outlining methods of recommending items to people based on similarities between users. You could use the similarity algorithms to find the 'neighbours' you are looking for. The chapter is available on google book search here:
http://books.google.com/books?id=fEsZ3Ey-Hq4C&printsec=frontcover
Be sure to look at Collaborative Filtering. Many recommendation systems use collaborative filtering to suggest items to users. They do it by finding 'neighbors' and then suggesting items your neighbors rated highly but you haven't rated. You could go as far as finding neighbors, and who knows, maybe you'll want recommendations in the future.
GroupLens is a research lab at the University of Minnesota that studies collaborative filtering techniques. They have a ton of published research as well as a few sample datasets.
The Netflix Prize is a competition to determine who can most effectively solve this sort of problem. Follow the links off their LeaderBoard. A few of the competitors share their solutions.
As far as a computationally inexpensive solution, you could try this:
Create categories for your items. If we're talking about music, they might be classical, rock, jazz, hip-hop... or go further: Grindcore, Math Rock, Riot Grrrl...
Now, every time a user rates an item, roll up their ratings at the category level. So you know 'User A' likes Honky Tonk and Acid House because they give those items high ratings frequently. Frequency and strength is probably important for your category aggregate score.
When it's time to find neighbors, instead of cruising through all ratings, just look for similar scores in the categories.
This method wouldn't be as accurate but it's fast.
Cheers.
What you need is a clustering algorithm, which would automatically group similar users together. The first difficulty that you are facing is that most clustering algorithms expect the items they cluster to be represented as points in a Euclidean space. In your case, you don't have the coordinates of the points. Instead, you can compute the value of the "similarity" function between pairs of them.
One good possibility here is to use spectral clustering, which needs precisely what you have: a similarity matrix. The downside is that you still need to compute your compatibility function for every pair of points, i. e. the algorithm is O(n^2).
If you absolutely need an algorithm faster than O(n^2), then you can try an approach called dissimilarity spaces. The idea is very simple. You invert your compatibility function (e. g. by taking its reciprocal) to turn it into a measure of dissimilarity or distance. Then you compare every item (user, in your case) to a set of prototype items, and treat the resulting distances as coordinates in a space. For instance, if you have 100 prototypes, then each user would be represented by a vector of 100 elements, i. e. by a point in 100-dimensional space. Then you can use any standard clustering algorithm, such as K-means.
The question now is how do you choose the prototypes, and how many do you need. Various heuristics have been tried, however, here is a dissertation which argues that choosing prototypes randomly may be sufficient. It shows experiments in which using 100 or 200 randomly selected prototypes produced good results. In your case if you have 1000 users, and you choose 200 of them to be prototypes, then you would need to evaluate your compatibility function 200,000 times, which is an improvement of a factor of 2.5 over comparing every pair. The real advantage, though, is that for 1,000,000 users 200 prototypes would still be sufficient, and you would need to make 200,000,000 comparisons, rather than 500,000,000,000 an improvement of a factor of 2500. What you get is O(n) algorithm, which is better than O(n^2), despite a potentially large constant factor.
The problem seems like to be 'classification problems'. Yes there are so many solutions and approaches.
To start exploration check this:
http://en.wikipedia.org/wiki/Statistical_classification
Have you heard of kohonen networks?
Its a self organing learning algorithm that clusters similar variables into similar slots. Although most sites like the one I link you to displays the net as bidimensional there is little involved in extending the algorithm into a multiple dimension hypercube.
With such a data structure finding and storing neighbours with similar tastes is trivial as similar users should be stores into similar locations (almost like a reverse hash code).
This reduces your problem into one of finding the variables that will define similarity and establishing distances between possible enumerate values ,like for example classical and acoustic are close toghether while death metal and reggae are quite distant (at least in my oppinion)
By the way in order to find good dividing variables the best algorithm is a decision tree. The nodes closer to the root will be the most important variables to establish 'closeness'.
It looks like you need to read about clustering algorithms. The general idea is that instead of comparing every point with every other point each time you divide them in clusters of similar points. Then the neighborhood may be all the points in the same cluster. The number/size of the clusters is usually a parameter of the clustering algorithm.
Yo can find a video about clustering in Google's series about cluster computing and mapreduce.
Concerns over performance can be greatly mitigated if you consider this as a build/batch problem rather than a realtime query.
The graph can be statically computed then latently updated e.g. hourly, daily etc. to then generate edges and storage optimized for runtime query e.g. top 10 similar users for each user.
+1 for Programming Collective Intelligence too - it is very informative - wish it wasn't (or I was!) as Python-oriented, but still good.

Resources