Algorithm for finding visually similar photos from a database? - database

TinEye, Google and others offer a "reverse image search" -- you can upload a photo and within seconds it will find similar photos.
Is there an open-source version of these algorithms?
I know about "SIFT" and other algorithms for finding "visually similar" photos, but they only work for comparing one photo directly to another. i.e., to find similar photos to a given photo is an O(n) operation, to find all visually similar photos would be O(n^2) -- both of which are prohibitively slow.
I need a feature descriptor that is indexable by a [relational] database to reduce the result set to something more manageable.
By "visually similar" I mean very similar. i.e, a photo that has been lightly touched up/recolored in Photoshop, slightly cropped or resized, photos taken in rapid succession of the same scene, or flipped or rotated images.

A valid approach you can consider is the Bag-of-Words model.
Basically you can do an offline computation of the target images. You can extract from those images a bunch of features in order to create a codebook with algorithms like k-means clustering. Searching for the nearest images will lead to the applications of an algorithm like Nearest neighbor search in the space of the codebook.
For the neighbour search you can use FLANN
http://www.cs.ubc.ca/~mariusm/index.php/FLANN/FLANN
http://opencv.willowgarage.com/documentation/cpp/flann_fast_approximate_nearest_neighbor_search.html
Take a look also at:
Visual similarity search algorithm
This is only a possibility and, truth must be told, this topic is really challenging and litterature on it is really huge.
Just some references:
http://www.cs.nott.ac.uk/~qiu/webpages/Papers/ColorPatternRecognition.pdf
http://cs.brown.edu/~th/papers/Hofmann-UAI99.pdf
http://www.ifp.illinois.edu/~jyang29/ScSPM.htm
http://johnwinn.org/Publications/papers/Savarese_Winn_Criminisi_Correlatons_CVPR2006.pdf
http://www-cvr.ai.uiuc.edu/ponce_grp/publication/paper/cvpr06b.pdf

Take a look at http://vision.caltech.edu/malaa/software/research/image-search/ it uses LSH algorithm and some kind of kd-tree.
Also this task is called CBIR or image duplicate search.

Related

Is it possible to find a optimal heuristic for this graph search example?

If we have a problem like the one in this image.
We are in Arad, for example, and we want to go to Fagaras using a informed search algorithm (like A*) and the only information available is the one in the picture, is it possible to get a heuristics that is not trivial, but consistent and admissible?

Caching user-specific proximity searches

The situation and the goal
Imagine a user search system that provides a proximity search from a user’s own position, which is specified by a decimal latitude/longitude combination. An Atlanta resident’s position, for instance, would be represented by 33.756944,-84.390278 and a perimeter search by this user should yield other users in his area from a radius of 10 mi, 50 mi, and so on.
A table-valued function calculates distances and provides users accordingly, ordered by ascending distance to the user that started the search. It’s always a live query, and it’s a tough and frequent one. Now, we want to built some sort of caching to reduce load.
On the way to solutions
So far, all users were grouped by the integer portion of their lat/long. The idea is to create cache files with all users from a grid square, so accessing the relevant cache file would be easy. If a grid square contains more users than a cache file should, the square is quartered or further divided into eight pieces and so on. To fully utilize a square and its cache file, multiple overlaying squares are contemplated. One deficiency of this approach is that gridding and quartering high density metropolitan areas and spacious countrysides into overlaying cache files may not be optimal.
Reading on, I stumbled upon topics like nearest neighbor searches, the Manhattan distance and tree-esque space partitioning techniques like a k-d tree, a quadtree or binary space partitioning. Also, SQL Server provides its own geographical datatypes and functions (though I’d guess the pure-mathematical FLOAT way has an adequate performance). And of course, the crux is making user-centric proximity searches cacheable.
Question!
I haven’t found much resources on this, but I’m sure I’m not the first one with this plan. Remember, it’s not about the search, but about caching.
Can I scrap my approach?
Are there ways of an advantageous partitioning of users into geographical divisions of equal size?
Is there a best practice for storing spatial user information for efficient proximity searches?
What do you think of the techniques mentioned above (quadtrees, etc.) and how would you pair them with caching?
Do you know an example of successfully caching user-specific proximity search?
Can I scrap my approach?
You can adapt your appoach, because as you already noted, a quadtree uses this technic. Or you use a geo spatial extension. That is available for MySql, too.
Are there ways of an advantageous partitioning of users into
geographical divisions of equal size
A simple fixed grid of equal size is fine when locations are equally distributed or if the area is very small. Geo locations are hardly equal distributed. Usually a geo spatial structure is used. see next answer:
Is there a best practice for storing spatial user information for
efficient proximity searches
Quadtree, k-dTree or R-Tree.
What do you think of the techniques mentioned above (quadtrees, etc.) and how would you pair them with caching?
There is some work from Hannan Samet, which describes Quadtrees and caching.

Automatic product classification and query weighting

I'm facing ranking problems using solr and I'm stucked.
Given a e-commerce site, for the query "ipad" i obtain:
ipad case for ipad 2
ipad case
ipad connection kit
ipad 32gb wifi
This is a problem, since we want to rank first the main products (or products by itself) and tf/idf ranks first the accessories due to descriptions like "ipad case compatible with ipad, ipad2, ipad3, ipad retina, ipad mini, etc".
Furthermore, using the categories we have no way of determining whether is an accessory or a product.
I wonder if using automatic classification would help. Another solution that improves this ranking (like Named Entity Recognition) would be appreciated.
Could you provide tagged data?
If you have >50k items a Naive Bayes with a bigram language model trained on the product name will almost catch all accessories with 99% accuracy. I guess you can train such a naive bayes with Mahout, however product names have a pretty limited bigram amount so this can be trained even on a smartphone easily and fast nowadays.
This is a typical mechanical turk task, shouldn't be that expensive to tag a few items. However if you insist on some semi-supervised algorithm, I found Iterative similarity aggregation pretty useful.
The main idea is that you give a few tokens like "case"/"power adapter" and it iteratively finds new tokens that are indicators of spam because they appear in the same context.
Here is the paper, but I have written a blogpost about this as well which sums up the intention in plain language. This paper also mentions the same "let the user find the right item" paradigm that Sean has proposed, so both can be used in conjunction.
Oh and if you need some advice of machine learning with Lucene&SOLR I can recommend you the talk of my friend Tommaso Teofili at ApacheCon Europe this year. You can find the slides on slideshare. There is also a youtube video of the talk out there, just search for it ;)
TF/IDF is just going to rank based on the words in the query vs words in the title as you have found. That sounds like it is not the right definition of "good result" and that you want to favor products over accessories.
Of course you can simply attach heuristics to patch the problem. For example, consider the title as a set of words, not multiset, so the appearance of "iPad" several times makes no difference. Or just boost the score of items that you know are products. This isn't learning per se, but are simple, directly reflect your business knowledge, and probably have some positive effect.
If you want to learn here, you probably need to use the one best source of knowledge about what the best results are: your users. You know what they click in response to each query. You can learn a term-item model that associates search terms to items clicked. You can view that as many types of problem -- actually a latent-factor recommender model could work well there.
Have a look at Ted's slides on how to use a recommender as a "search engine": http://www.slideshare.net/tdunning/search-as-recommendation

Metric for finding similar images in a database

There's a lot of different algorithms for computing the similarity between two images, but I can't find anything on how you would store this information in a database such that you can find similar images quickly.
By "similar" I mean exact duplicates that have been rotated (90 degree increments), color-adjusted, and/or re-saved (lossy jpeg compression).
I'm trying to come up with a "fingerprint" of the images such that I can look them up quickly.
The best I've come up with so far is to generate a grayscale histogram. With 16 bins and 256 shades of gray, I can easily create a 16-byte fingerprint. This works reasonably well, but it's not quite as robust as I'd like.
Another solution I tried was to resize the images, rotate them so they're all oriented the same way, grayscale them, normalize the histograms, and then shrink them down to about 8x8, and reduce the colors to 16 shades of gray. Although the miniature images were very similar, they were usually off by a pixel or two, which means that exact matching can't work.
Without exact-matching, I don't believe there's any efficient way to group similar photos (without comparing every photo to every other photo, i.e., O(n^2)).
So, (1) How can I create I create a fingerprint/signature that is invariant to the requirements mentioned above? Or, (2) if that's not possible, what other metric can I use such that given a single image, I can find it's best matches in a database of thousands?
There's one little confusing thing in your question: the "fingerprint" you linked to is explicitly not meant to find similar images (quote):
TinEye does not typically find similar images (i.e. a different image with the same subject matter); it finds exact matches including those that have been cropped, edited or resized.
Now, that said, I'm just going to assume you know what you are asking, and that you actually want to be able to find all similar images, not just edited exact copies.
If you want to try and get into it in detail, I would suggest looking up papers by Sivic, Zisserman and Nister, Stewenius. The idea these two papers (as well as quite a bit of others lately) have been using is to try and apply text-searching techniques to image databases, and search the image database in a same manner Google would search it's document (web-page) database.
The first paper I have linked to is a good starting point for this kind of approach, since it addresses mainly the big question: What are the "words" in the images?. Text searching techniques all focus on words, and base their similarity measures on calculations including word counts. Successful representation of images as collections of visual words is thus the first step to applying text-searching techniques to image databases.
The second paper then expands on the idea of using text-techniques, presenting a more suitable search structure. With this, they allow for a faster image retrieval and larger image databases. They also propose how to construct an image descriptor based on the underlying search structure.
The features used as visual words in both papers should satisfy your invariance constraints, and the second one definitely should be able to work with your required database size (maybe even the approach from the 1st paper would work).
Finally, I recommend looking up newer papers from the same authors (I'm positive Nister did something new, it's just that the approach from the linked paper has been enough for me until now), looking up some of their references and just generally searching for papers concerning Content based image (indexing and) retrieval (CBIR) - it is a very popular subject right now, so there should be plenty.

Generating 'neighbours' for users based on rating

I'm looking for techniques to generate 'neighbours' (people with similar taste) for users on a site I am working on; something similar to the way last.fm works.
Currently, I have a compatibilty function for users which could come into play. It ranks users on having 1) rated similar items 2) rated the item similarly. The function weighs point 2 heigher and this would be the most important if I had to use only one of these factors when generating 'neighbours'.
One idea I had would be to just calculate the compatibilty of every combination of users and selecting the highest rated users to be the neighbours for the user. The downside of this is that as the number of users go up then this process couls take a very long time. For just a 1000 users, it needs 1000C2 (0.5 * 1000 * 999 = = 499 500) calls to the compatibility function which could be very heavy on the server also.
So I am looking for any advice, links to articles etc on how best to achieve a system like this.
In the book Programming Collective Intelligence
http://oreilly.com/catalog/9780596529321
Chapter 2 "Making Recommendations" does a really good job of outlining methods of recommending items to people based on similarities between users. You could use the similarity algorithms to find the 'neighbours' you are looking for. The chapter is available on google book search here:
http://books.google.com/books?id=fEsZ3Ey-Hq4C&printsec=frontcover
Be sure to look at Collaborative Filtering. Many recommendation systems use collaborative filtering to suggest items to users. They do it by finding 'neighbors' and then suggesting items your neighbors rated highly but you haven't rated. You could go as far as finding neighbors, and who knows, maybe you'll want recommendations in the future.
GroupLens is a research lab at the University of Minnesota that studies collaborative filtering techniques. They have a ton of published research as well as a few sample datasets.
The Netflix Prize is a competition to determine who can most effectively solve this sort of problem. Follow the links off their LeaderBoard. A few of the competitors share their solutions.
As far as a computationally inexpensive solution, you could try this:
Create categories for your items. If we're talking about music, they might be classical, rock, jazz, hip-hop... or go further: Grindcore, Math Rock, Riot Grrrl...
Now, every time a user rates an item, roll up their ratings at the category level. So you know 'User A' likes Honky Tonk and Acid House because they give those items high ratings frequently. Frequency and strength is probably important for your category aggregate score.
When it's time to find neighbors, instead of cruising through all ratings, just look for similar scores in the categories.
This method wouldn't be as accurate but it's fast.
Cheers.
What you need is a clustering algorithm, which would automatically group similar users together. The first difficulty that you are facing is that most clustering algorithms expect the items they cluster to be represented as points in a Euclidean space. In your case, you don't have the coordinates of the points. Instead, you can compute the value of the "similarity" function between pairs of them.
One good possibility here is to use spectral clustering, which needs precisely what you have: a similarity matrix. The downside is that you still need to compute your compatibility function for every pair of points, i. e. the algorithm is O(n^2).
If you absolutely need an algorithm faster than O(n^2), then you can try an approach called dissimilarity spaces. The idea is very simple. You invert your compatibility function (e. g. by taking its reciprocal) to turn it into a measure of dissimilarity or distance. Then you compare every item (user, in your case) to a set of prototype items, and treat the resulting distances as coordinates in a space. For instance, if you have 100 prototypes, then each user would be represented by a vector of 100 elements, i. e. by a point in 100-dimensional space. Then you can use any standard clustering algorithm, such as K-means.
The question now is how do you choose the prototypes, and how many do you need. Various heuristics have been tried, however, here is a dissertation which argues that choosing prototypes randomly may be sufficient. It shows experiments in which using 100 or 200 randomly selected prototypes produced good results. In your case if you have 1000 users, and you choose 200 of them to be prototypes, then you would need to evaluate your compatibility function 200,000 times, which is an improvement of a factor of 2.5 over comparing every pair. The real advantage, though, is that for 1,000,000 users 200 prototypes would still be sufficient, and you would need to make 200,000,000 comparisons, rather than 500,000,000,000 an improvement of a factor of 2500. What you get is O(n) algorithm, which is better than O(n^2), despite a potentially large constant factor.
The problem seems like to be 'classification problems'. Yes there are so many solutions and approaches.
To start exploration check this:
http://en.wikipedia.org/wiki/Statistical_classification
Have you heard of kohonen networks?
Its a self organing learning algorithm that clusters similar variables into similar slots. Although most sites like the one I link you to displays the net as bidimensional there is little involved in extending the algorithm into a multiple dimension hypercube.
With such a data structure finding and storing neighbours with similar tastes is trivial as similar users should be stores into similar locations (almost like a reverse hash code).
This reduces your problem into one of finding the variables that will define similarity and establishing distances between possible enumerate values ,like for example classical and acoustic are close toghether while death metal and reggae are quite distant (at least in my oppinion)
By the way in order to find good dividing variables the best algorithm is a decision tree. The nodes closer to the root will be the most important variables to establish 'closeness'.
It looks like you need to read about clustering algorithms. The general idea is that instead of comparing every point with every other point each time you divide them in clusters of similar points. Then the neighborhood may be all the points in the same cluster. The number/size of the clusters is usually a parameter of the clustering algorithm.
Yo can find a video about clustering in Google's series about cluster computing and mapreduce.
Concerns over performance can be greatly mitigated if you consider this as a build/batch problem rather than a realtime query.
The graph can be statically computed then latently updated e.g. hourly, daily etc. to then generate edges and storage optimized for runtime query e.g. top 10 similar users for each user.
+1 for Programming Collective Intelligence too - it is very informative - wish it wasn't (or I was!) as Python-oriented, but still good.

Resources