Local Sensitivity Hashing for similarity matching - matching

Iam experimenting with the Local Sensitivity Hashing algorithm which I want to use to find patient similarities in computed tomography (CT) images.
I have build a Deep neural network extracting features. Now I want to take a images of a new patient, extract features and send them together with all other (training) patient features through the LSH algorithm to find the most similar one.
As the features are high dimensional I thought LSH would be a good choice.
Now Iam wondering how I should set the hyperparamters of the algorithm for such a task.
Should each patient get its own bucket. In example 100 patients, 100 buckets or should i just have one bucket and put all patients in this to get The nearest one?
All CT images contain the same anatomical structures.
Thanks for any advice,
Kind regards,
Michael

Related

Best way to correlate weights to nodes (Use a graph?)

I don't have a strong software engineering background and I am currently tasked with correlating fuzzy hashes (Nilsimsa) and their similarity to one another. For example: I have a ton of strings ( Hundreds of thousands), I hash each one using Nilsimsa and then want to be able to correlate each hash to a similar hash (indicating a similar string)
I've been trying to think of the best way to go about this in a way that can scale (ultimately will need to be able to handle millions of strings input per day) I was thinking maybe a graph database or just a graph in general where the weights in between each node would be the similarity of the Nilsimsa hashes.
Im not asking anyone to implement anything just ideas or suggestions would be very appreciated.

Location based horizontal scalable dating app database model

I am assessing backend for location base dating app similar to Tinder.
App feature is showing nearby online users (with sex, and age filter)
Some database engines in mind are Redis, Cassandra, MySQL Cluster
The app should scale horizontally by adding node at high traffic time
After researching, I am very confused whether there is a common "best practice" data model, algorithm for this.
My approach is using Redis Cluster:
// Store all online users in same location (city) to a Set. In this case, store user:1 to New York set
SADD location:NewYork 1
// Store all users age to Sorted Set. In this case, user:1 has age 30
ZADD age 30 "1"
// Retrieve users in NewYork age from 20 to 40
ZINTERSTORE tmpkey 2 location:NewYork age AGGREGATE MAX
ZRANGEBYSCORE tmpkey 20 40
I am inexperienced and can not foresee potential problem if scaling happen for million of concurrent users.
Hope any veteran could shed some light.
For your use case, mongodb would be a good choice.
You can store each user in single document, along with their current location.
Create indexes on fields you want to do queries on, e.g. age, gender, location
Mongodb has inbuilt support for geospatial queries, hence it is easy to find users within 1 km radius of another user.
Most noSQL Geo/proximity index features rely on the GeoHash Algorithm
http://www.bigfastblog.com/geohash-intro
It's a good thing to understand how it works, and it's really quite fascinating. This technique can also be used to create highly efficient indexes on a relational database.
Redis does have native support for this, but if you're using ElastiCache, that version of Redis does not, and you'll need to mange this in your API.
Any Relational Database will give you the most flexibility and simplest solution. The problem you may face is query times. If you're optimizing for searches on your DB instance (possibly have a 'search db' separate to profile/content data), then it's possible to have the entire index in memory for fast results.
I can also talk a bit about Redis: The sorted set operations are blazingly fast, but you need to filter. Either you have to scan through your nearby result and lookup meta information to filter, or maintain separate sets for every combination of filter you may need. The first will have more performance overhead. The second requires you to mange the indexes yourself. EG: What if someone removes one of their 'likes'? What if they move around?
It's not flash or fancy, but in most cases where you need to search a range of data, relational databases win due to their simplicity and support. Think of your search as a replica of your master source, and you can always migrate to another solution, or re-shard/scale if you need to in the future.
You may be interested in the Redis Geo API.
The Geo API consists of a set of new commands that add support for storing and querying pairs of longitude/latitude coordinates into Redis keys. GeoSet is the name of the data structure holding a set of (x,y) coordinates. Actually, there isn’t any new data structure under the hood: a GeoSet is simply a Redis SortedSet.
Redis Geo Tutorial
I will also support MongoDB on the basis of requirements with the development of MongoDB compass you can also visualize your geospatial data.The link of mongodb compass documentation is "https://docs.mongodb.com/compass/getting-started/".

Algorithm for finding visually similar photos from a database?

TinEye, Google and others offer a "reverse image search" -- you can upload a photo and within seconds it will find similar photos.
Is there an open-source version of these algorithms?
I know about "SIFT" and other algorithms for finding "visually similar" photos, but they only work for comparing one photo directly to another. i.e., to find similar photos to a given photo is an O(n) operation, to find all visually similar photos would be O(n^2) -- both of which are prohibitively slow.
I need a feature descriptor that is indexable by a [relational] database to reduce the result set to something more manageable.
By "visually similar" I mean very similar. i.e, a photo that has been lightly touched up/recolored in Photoshop, slightly cropped or resized, photos taken in rapid succession of the same scene, or flipped or rotated images.
A valid approach you can consider is the Bag-of-Words model.
Basically you can do an offline computation of the target images. You can extract from those images a bunch of features in order to create a codebook with algorithms like k-means clustering. Searching for the nearest images will lead to the applications of an algorithm like Nearest neighbor search in the space of the codebook.
For the neighbour search you can use FLANN
http://www.cs.ubc.ca/~mariusm/index.php/FLANN/FLANN
http://opencv.willowgarage.com/documentation/cpp/flann_fast_approximate_nearest_neighbor_search.html
Take a look also at:
Visual similarity search algorithm
This is only a possibility and, truth must be told, this topic is really challenging and litterature on it is really huge.
Just some references:
http://www.cs.nott.ac.uk/~qiu/webpages/Papers/ColorPatternRecognition.pdf
http://cs.brown.edu/~th/papers/Hofmann-UAI99.pdf
http://www.ifp.illinois.edu/~jyang29/ScSPM.htm
http://johnwinn.org/Publications/papers/Savarese_Winn_Criminisi_Correlatons_CVPR2006.pdf
http://www-cvr.ai.uiuc.edu/ponce_grp/publication/paper/cvpr06b.pdf
Take a look at http://vision.caltech.edu/malaa/software/research/image-search/ it uses LSH algorithm and some kind of kd-tree.
Also this task is called CBIR or image duplicate search.

Clustering of 10's of millions of high dimensional data

I have a set of 50 million text snippets and I would like to create some clusters out of them. The dimensionality might be somewhere between 60k-100k. The average text snippet length would be 16 words. As you can imagine, the frequency matrix would be pretty sparse. I am looking for a software package / libray / sdk that would allow me to find those clusters. I had tried CLUTO in the past but this seems a very heavy task for CLUTO. From my research online I found that BIRCH is an algorithm that can handle such problems, but, unfortunately, I couldn't find any BIRCH implementation software online (I only found a couple of ad-hoc implementations, like assignment projects, that lacked any sort of documentation whatsoever). Any suggestions?
You may be interested to checkout the Streaming EM-tree algorithm that uses the TopSig representation. Both are these are from my Ph.D. thesis on the topic of large scale document clustering.
We recently clustered 733 million documents on a single 16-core machine (http://ktree.sf.net). It took about 2.5 days to index the documents and 15 hours to cluster them.
The Streaming EM-tree algorithm can be found at https://github.com/cmdevries/LMW-tree. It works with binary document vectors produced by TopSig which can be found at http://topsig.googlecode.com.
I wrote a blog post about a similar approach earlier at http://chris.de-vries.id.au/2013/07/large-scale-document-clustering.html. However, the EM-tree scales better for parallel execution and also produces better quality clusters.
If you have any questions please feel free to contact me at chris#de-vries.id.au.
My professor made this implementation of BIRCH Algorithm in Java. It is easy to read with some inline comments.
Try it with the graph partition algorithm. It may help you to make clustering on high dimensional data possible.
I suppose you're rather looking for something like all-pairs search.
This will give you pairs of similar records up to desired threshold. You can use bits of graph theory to extract clusters afterwards - consider each pair an edge. Then extracting connected components will give you something like single-linkage clustering, cliques will give you complete linkage clusters.
I just found implementation of BIRCH in C++.

Generating 'neighbours' for users based on rating

I'm looking for techniques to generate 'neighbours' (people with similar taste) for users on a site I am working on; something similar to the way last.fm works.
Currently, I have a compatibilty function for users which could come into play. It ranks users on having 1) rated similar items 2) rated the item similarly. The function weighs point 2 heigher and this would be the most important if I had to use only one of these factors when generating 'neighbours'.
One idea I had would be to just calculate the compatibilty of every combination of users and selecting the highest rated users to be the neighbours for the user. The downside of this is that as the number of users go up then this process couls take a very long time. For just a 1000 users, it needs 1000C2 (0.5 * 1000 * 999 = = 499 500) calls to the compatibility function which could be very heavy on the server also.
So I am looking for any advice, links to articles etc on how best to achieve a system like this.
In the book Programming Collective Intelligence
http://oreilly.com/catalog/9780596529321
Chapter 2 "Making Recommendations" does a really good job of outlining methods of recommending items to people based on similarities between users. You could use the similarity algorithms to find the 'neighbours' you are looking for. The chapter is available on google book search here:
http://books.google.com/books?id=fEsZ3Ey-Hq4C&printsec=frontcover
Be sure to look at Collaborative Filtering. Many recommendation systems use collaborative filtering to suggest items to users. They do it by finding 'neighbors' and then suggesting items your neighbors rated highly but you haven't rated. You could go as far as finding neighbors, and who knows, maybe you'll want recommendations in the future.
GroupLens is a research lab at the University of Minnesota that studies collaborative filtering techniques. They have a ton of published research as well as a few sample datasets.
The Netflix Prize is a competition to determine who can most effectively solve this sort of problem. Follow the links off their LeaderBoard. A few of the competitors share their solutions.
As far as a computationally inexpensive solution, you could try this:
Create categories for your items. If we're talking about music, they might be classical, rock, jazz, hip-hop... or go further: Grindcore, Math Rock, Riot Grrrl...
Now, every time a user rates an item, roll up their ratings at the category level. So you know 'User A' likes Honky Tonk and Acid House because they give those items high ratings frequently. Frequency and strength is probably important for your category aggregate score.
When it's time to find neighbors, instead of cruising through all ratings, just look for similar scores in the categories.
This method wouldn't be as accurate but it's fast.
Cheers.
What you need is a clustering algorithm, which would automatically group similar users together. The first difficulty that you are facing is that most clustering algorithms expect the items they cluster to be represented as points in a Euclidean space. In your case, you don't have the coordinates of the points. Instead, you can compute the value of the "similarity" function between pairs of them.
One good possibility here is to use spectral clustering, which needs precisely what you have: a similarity matrix. The downside is that you still need to compute your compatibility function for every pair of points, i. e. the algorithm is O(n^2).
If you absolutely need an algorithm faster than O(n^2), then you can try an approach called dissimilarity spaces. The idea is very simple. You invert your compatibility function (e. g. by taking its reciprocal) to turn it into a measure of dissimilarity or distance. Then you compare every item (user, in your case) to a set of prototype items, and treat the resulting distances as coordinates in a space. For instance, if you have 100 prototypes, then each user would be represented by a vector of 100 elements, i. e. by a point in 100-dimensional space. Then you can use any standard clustering algorithm, such as K-means.
The question now is how do you choose the prototypes, and how many do you need. Various heuristics have been tried, however, here is a dissertation which argues that choosing prototypes randomly may be sufficient. It shows experiments in which using 100 or 200 randomly selected prototypes produced good results. In your case if you have 1000 users, and you choose 200 of them to be prototypes, then you would need to evaluate your compatibility function 200,000 times, which is an improvement of a factor of 2.5 over comparing every pair. The real advantage, though, is that for 1,000,000 users 200 prototypes would still be sufficient, and you would need to make 200,000,000 comparisons, rather than 500,000,000,000 an improvement of a factor of 2500. What you get is O(n) algorithm, which is better than O(n^2), despite a potentially large constant factor.
The problem seems like to be 'classification problems'. Yes there are so many solutions and approaches.
To start exploration check this:
http://en.wikipedia.org/wiki/Statistical_classification
Have you heard of kohonen networks?
Its a self organing learning algorithm that clusters similar variables into similar slots. Although most sites like the one I link you to displays the net as bidimensional there is little involved in extending the algorithm into a multiple dimension hypercube.
With such a data structure finding and storing neighbours with similar tastes is trivial as similar users should be stores into similar locations (almost like a reverse hash code).
This reduces your problem into one of finding the variables that will define similarity and establishing distances between possible enumerate values ,like for example classical and acoustic are close toghether while death metal and reggae are quite distant (at least in my oppinion)
By the way in order to find good dividing variables the best algorithm is a decision tree. The nodes closer to the root will be the most important variables to establish 'closeness'.
It looks like you need to read about clustering algorithms. The general idea is that instead of comparing every point with every other point each time you divide them in clusters of similar points. Then the neighborhood may be all the points in the same cluster. The number/size of the clusters is usually a parameter of the clustering algorithm.
Yo can find a video about clustering in Google's series about cluster computing and mapreduce.
Concerns over performance can be greatly mitigated if you consider this as a build/batch problem rather than a realtime query.
The graph can be statically computed then latently updated e.g. hourly, daily etc. to then generate edges and storage optimized for runtime query e.g. top 10 similar users for each user.
+1 for Programming Collective Intelligence too - it is very informative - wish it wasn't (or I was!) as Python-oriented, but still good.

Resources