Metric for finding similar images in a database - database

There's a lot of different algorithms for computing the similarity between two images, but I can't find anything on how you would store this information in a database such that you can find similar images quickly.
By "similar" I mean exact duplicates that have been rotated (90 degree increments), color-adjusted, and/or re-saved (lossy jpeg compression).
I'm trying to come up with a "fingerprint" of the images such that I can look them up quickly.
The best I've come up with so far is to generate a grayscale histogram. With 16 bins and 256 shades of gray, I can easily create a 16-byte fingerprint. This works reasonably well, but it's not quite as robust as I'd like.
Another solution I tried was to resize the images, rotate them so they're all oriented the same way, grayscale them, normalize the histograms, and then shrink them down to about 8x8, and reduce the colors to 16 shades of gray. Although the miniature images were very similar, they were usually off by a pixel or two, which means that exact matching can't work.
Without exact-matching, I don't believe there's any efficient way to group similar photos (without comparing every photo to every other photo, i.e., O(n^2)).
So, (1) How can I create I create a fingerprint/signature that is invariant to the requirements mentioned above? Or, (2) if that's not possible, what other metric can I use such that given a single image, I can find it's best matches in a database of thousands?

There's one little confusing thing in your question: the "fingerprint" you linked to is explicitly not meant to find similar images (quote):
TinEye does not typically find similar images (i.e. a different image with the same subject matter); it finds exact matches including those that have been cropped, edited or resized.
Now, that said, I'm just going to assume you know what you are asking, and that you actually want to be able to find all similar images, not just edited exact copies.
If you want to try and get into it in detail, I would suggest looking up papers by Sivic, Zisserman and Nister, Stewenius. The idea these two papers (as well as quite a bit of others lately) have been using is to try and apply text-searching techniques to image databases, and search the image database in a same manner Google would search it's document (web-page) database.
The first paper I have linked to is a good starting point for this kind of approach, since it addresses mainly the big question: What are the "words" in the images?. Text searching techniques all focus on words, and base their similarity measures on calculations including word counts. Successful representation of images as collections of visual words is thus the first step to applying text-searching techniques to image databases.
The second paper then expands on the idea of using text-techniques, presenting a more suitable search structure. With this, they allow for a faster image retrieval and larger image databases. They also propose how to construct an image descriptor based on the underlying search structure.
The features used as visual words in both papers should satisfy your invariance constraints, and the second one definitely should be able to work with your required database size (maybe even the approach from the 1st paper would work).
Finally, I recommend looking up newer papers from the same authors (I'm positive Nister did something new, it's just that the approach from the linked paper has been enough for me until now), looking up some of their references and just generally searching for papers concerning Content based image (indexing and) retrieval (CBIR) - it is a very popular subject right now, so there should be plenty.

Related

Algorithm for finding visually similar photos from a database?

TinEye, Google and others offer a "reverse image search" -- you can upload a photo and within seconds it will find similar photos.
Is there an open-source version of these algorithms?
I know about "SIFT" and other algorithms for finding "visually similar" photos, but they only work for comparing one photo directly to another. i.e., to find similar photos to a given photo is an O(n) operation, to find all visually similar photos would be O(n^2) -- both of which are prohibitively slow.
I need a feature descriptor that is indexable by a [relational] database to reduce the result set to something more manageable.
By "visually similar" I mean very similar. i.e, a photo that has been lightly touched up/recolored in Photoshop, slightly cropped or resized, photos taken in rapid succession of the same scene, or flipped or rotated images.
A valid approach you can consider is the Bag-of-Words model.
Basically you can do an offline computation of the target images. You can extract from those images a bunch of features in order to create a codebook with algorithms like k-means clustering. Searching for the nearest images will lead to the applications of an algorithm like Nearest neighbor search in the space of the codebook.
For the neighbour search you can use FLANN
http://www.cs.ubc.ca/~mariusm/index.php/FLANN/FLANN
http://opencv.willowgarage.com/documentation/cpp/flann_fast_approximate_nearest_neighbor_search.html
Take a look also at:
Visual similarity search algorithm
This is only a possibility and, truth must be told, this topic is really challenging and litterature on it is really huge.
Just some references:
http://www.cs.nott.ac.uk/~qiu/webpages/Papers/ColorPatternRecognition.pdf
http://cs.brown.edu/~th/papers/Hofmann-UAI99.pdf
http://www.ifp.illinois.edu/~jyang29/ScSPM.htm
http://johnwinn.org/Publications/papers/Savarese_Winn_Criminisi_Correlatons_CVPR2006.pdf
http://www-cvr.ai.uiuc.edu/ponce_grp/publication/paper/cvpr06b.pdf
Take a look at http://vision.caltech.edu/malaa/software/research/image-search/ it uses LSH algorithm and some kind of kd-tree.
Also this task is called CBIR or image duplicate search.

Get info from map data file by coordinate

Imagine I have a map shape file (.shp) or osm xml, I'm able to see different kind of data from different layers in GIS oriented programs, e.g. ArcGIS, QGIS etc. But how can I get this info programmatically? Is there a specific library for that?
What I'm really looking for is a some kind of method getMapData(longitude, latitude) to get landscape/terrain info (e.g. forest, river, city, highway) in specified location
Thanks in advance for your answers!
It still depends what you want to achieve whether you are better off using raster or vector data.
If your are using your grid to subdivide an area as an array of containers for geographic features, then stick with vector data. To do this, I would create a polygon grid file and intersect it with each of your data layers. You can then add an ID field that represents the cell's location in the array (and hence it's relative position to a known lat/long coordinate - let's say lower left). Alternatively you can use spatial queries to access your data by selecting a polygon in your vector grid file and then finding all the features in your other file that are contained by it.
OTOH, if you want to do some multi-feature analysis based on presence/abscence then you may be better going down the route of raster analysis. My gut feeling from what you have said is that this is what you are trying to achieve but I am still not 100% sure. You would handle this by creating a set of boolean rasters of a suitable resolution and then performing maths operations on the set (add, subtract, average etc - depending on what questions your are asking).
Let's say you are looking at animal migration. Let's say your model assumes that streams, hedges and towns are all obstacles to migration but roads only reduce the chance of an area being crossed. So you convert your obstacles to a value of '1' and NoData to '0' in each case, except roads where you decide to set the value to 0.5. You can then add all your rasters together in one big stack and predict migration routes.
Ok that's a simplistic example but perhaps you can see why we need EVEN more information on what you are wanting to do.
Shapefiles or an osm xml file are just containers that hold geometric shapes. There are plenty of software libraries out there that let you read these files and extract the data. I would recommend looking at GDAL/OGR as a starting point.
A method like getMapData(longitude, latitude) is essentially a search/query function. You need to be a little more specific too, do you want geometries that contain the point, are within a distance of a point, etc?
You could find the map data using a brute force algorithm
for shape in shapefile:
if shape.contains(query_point):
return shape
Or you can use more advanced algorithms/data structures such as RTrees, KDTrees, QuadTrees, etc. The easiest way to get start with querying map data is to load it into a spatial database. I would recommending investigating PostgreSQL+PostGIS and SpatiaLite
You may also like to look at Spatialite and/or PostGIS which are two spatial enabled databses that you could use separately or in conjunction with GDAL/OGR.
I must echo Charles' request that you explain your use-case in more detail because the actual implementation will depend greatly on exactly what you are wanting to achieve. My reading of this is that you may want to convert your data into a series of aligned rasters which you can overlay and treat as a 3 dimensional array.

How to best do server-side geo clustering?

I want to do pre-clustering for a set of approx. 500,000 points.
I haven't started yet but this is what I had thought I would do:
store all points in a localSOLR index
determine "natural cluster positions" according to some administrative information (big cities for example)
and then calculate a cluster for each city:
for each city
for each zoom level
query the index to get the points contained in a radius around the city (the length of the radius depends on the zoom level)
This should be quite efficient because there are only 100 major cities and SOLR queries are very fast. But a little more thinking revealed it was wrong:
there may be clusters of points that are more "near" one another than near a city: they should get their own cluster
at some zoom levels, some points will not be within the acceptable distance of any city, and so they will not be counted
some cities are near one another and therefore, some points will be counted twice (added to both clusters)
There are other approaches:
examine each point and determine to which cluster it belongs; this eliminates problems 2 and 3 above, but not 1, and is also extremely inefficient
make a (rectangular) grid (for each zoom level); this works but will result in crazy / arbitrary clusters that don't "mean" anything
I guess I'm looking for a general purpose geo-clustering algorithm (or idea) and can't seem to find any.
Edit to answer comment from Geert-Jan
I'd like to build "natural" clusters, yes, and yes I'm afraid that if I use an arbitrary grid, it will not reflect the reality of the data. For example if there are many events that occur around a point that is at or near the intersection of two rectangles, I should get just one cluster but will in fact build two (one in each rectangle).
Originally I wanted to use localSOLR for performance reasons (and because I know it, and have better experience indexing a lot of data into SOLR than loading it in a conventional database); but since we're talking of pre-clustering, maybe performance is not that important (although it should not take days to visualize a result of a new clustering experiment). My first approach of querying lots of points according to a predefined set of "big points" is clearly flawed anyway, the first reason I mentioned being the strongest: clusters should reflect the reality of the data, and not some other bureaucratic definition (they will clearly overlap, sure, but data should come first).
There is a great clusterer for live clustering, that has been added to the core Google Maps API: Marker Clusterer. I wonder if anyone has tried to run it "offline": run it for any amount of time it needs, and then store the results?
Or is there a clusterer that examines each point, point after point, and outputs clusters with their coordinates and number of points included, and which does this in a reasonable amount of time?
You may want to look into advanced clustering algorithms such as OPTICS.
With a good database index, it should be fairly fast.

Algorithm to find related words in a text

I would like to have a word (e.g. "Apple) and process a text (or maybe more). I'd like to come up with related terms. For example: process a document for Apple and find that iPod, iPhone, Mac are terms related to "Apple".
Any idea on how to solve this?
As a starting point: your question relates to text mining.
There are two ways: a statistical approach, and one form natural language processing (nlp).
I do not know much about nlp, but can say something about the statistical approach:
You need some vector space representation of your documents, see
http://en.wikipedia.org/wiki/Vector_space_model
http://en.wikipedia.org/wiki/Document-term_matrix
http://en.wikipedia.org/wiki/Tf%E2%80%93idf
In order to learn semantics, that is: different words mean the same, or one word can have different meanings, you need a large text corpus for learning. As I said this is a statistical approach, so you need lots of samples.
http://www.daviddlewis.com/resources/testcollections/
Maybe you have lots of documents from the context you are going to use. That is the best situation.
You have to retrieve latent factors from this corpus. Most common are:
LSA (http://en.wikipedia.org/wiki/Latent_semantic_analysis)
PLSA (http://en.wikipedia.org/wiki/Probabilistic_latent_semantic_analysis)
nonnegative matrix factorization (http://en.wikipedia.org/wiki/Non-negative_matrix_factorization)
latent dirichlet allocation (http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)
These methods involve lots of math. Either you dig it, or you have to find good libraries.
I can recommend the following books:
http://www.oreilly.de/catalog/9780596529321/toc.html
http://www.oreilly.de/catalog/9780596516499/index.html
Like all of AI, it's a very difficult problem. You should look into natural language processing to learn about some of the issues.
One very, very simplistic approach can be to build a 2d-table of words, with for each pair of words the average distance (in words) that they appear in the text. Obviously you'll need to limit the maximum distance considered, and possibly the number of words as well. Then, after processing a lot of text you'll have an indicator of how often certain words appear in the same context.
What I would do is get all the words in a text and make a frequency list (how often each word appears). Maybe also add to it a heuristic factor on how far the word is from "Apple". Then read multiple documents, and cross out words that are not common in all the documents. Then prioritize based on the frequency and distance from the keyword. Of course, you will get a lot of garbage and possibly miss some relevant words, but by adjusting the heuristics you should get at least some decent matches.
The technique that you are looking for is called Latent Semantic Analysis (LSA). It is also sometimes called Latent Semantic Indexing. The technique operates on the idea that related concepts occur together in text. It uses statistics to build the word relationships. Given a large enough corpus of documents it will definitely solve your problem of finding related words.
Take a look at vector space models.

Best way to store large searchable text files

I am developing an online Bible search program. The Bible is a pretty large book, taking up nearly 5MB of space in plain text. I am planning on implementing an API in the program as well allowing other websites to include their own Bible search widgets and programs without having to develop the search queries or storing Bibles on their own servers.
With this in mind, I am going to expect that eventually I will have a moderate flow of queries passing through the program. Also, for those not familiar with the Bible, it has 2 methods of formatting the text. It can contain both red text and italics. I need a way to store the Scriptures along with the red letter and italics formatting but allowing the search queries to ignore the formatting.
It also needs to be fast and as efficient (memory and cpu usage) as possible. Any storage format will be considered (MySQL, JSON or XML text files, etc) as long as the querying can be done ignoring the formatting. File size and count doesn't really matter, so splitting up the books or even chapters into separate files is fine by me.
One more important thing to keep in mind though, is that I want to have some form of search method that can search across multiple verses. So a search for "but have everlasting life for God sent not his Son" would return John 3:16,17. Thanks for all ideas!
There are a bunch of different open source document search engines which are made for precisely what you're trying to do. Solr, Elastic Search, Xapian, Whoosh, Haystack (made for Django) and others. There are other posts on S.O. and elsewhere that go into the benefits of using one vs another, but your requirements are simple enough that any of them will be more than fine (and easily scale with very minimal effort should your project take off, which is always nice to know). So look at their examples and see which one looks most intuitive to you - Solr is arguably the most popular and it's the only one I've worked with, but Elastic Search uses the same popular Lucene backend and is apparently much easier to get up and running, so I would start there.
As for the actual implementation, you'll want to index each verse as a separate "document" if the single verse (or just verse number) is what you want to return. The search engine handles the ranking of the results based on relevancy (usually using a tf/idf algorithm, in case you're interested).
The way I'd handle the italics and red text is to include some kind of markup in the text (i.e. wrap the phrase in single asterisks for italics, double asterisks for red) and then tell the analyzer to ignore those characters - there may be a simpler way in the framework you end up choosing, though, so take that with a grain of salt. The queries spanning multiple verses requirement is more complicated, but the answer will probably involve indexing each whole chapter as a document instead of (or maybe in addition to? I'd have to think about it more) each verse.
A word of caution - if you're not familiar with search indexing, even something designed to be plug-and-play like Elastic Search will probably still require some time and effort to set up, so if you absolutely need to get this up and running quickly and you're already familiar with MySQL I suppose it could work (it does do fulltext search). But it's certainly not the best tool for the job, so if this is a project that you're invested in you will thank yourself later if you put in a little bit of work to learn one of these search frameworks. It may be overkill in terms of the amount of text you're dealing with, as others have pointed out, but it will be extremely flexible in how you can search on that text which seems to be what you want. For instance, adding other requirements later on would be very straightforward (for instance, you could let people limit their search to only matches in the red text).
I didn't know the bible had formatting. What is it used for? If it is for the verses, I'd suggest you store every verse in a database. In a highly normalized form, you got a table with books, a table with chapters and a table with verses. Each verse consists of a verse number and a verse text.
Now, I think the chapters don't have titles so they are actually just a number as well. In that case it it silly to store them separately, so you got just your table of books and a table of verses, in which each verse has a chapter number and a verse number and a verse text. That text I think of to be plain text, isn't it?
If the verse is plain text, you can easily make it searchable by storing it in MySQL and create a FULLTEXT index for it. That way, you can search quite efficiently and even use wildcards and such.
If the verse was to have formatting, you could choose to create two columns, one with the plain text for searching, and one with the formatted text for display, but I doubt you would need this.
PS: 5 MB of text is nothing really. If you got a dedicated program, you could keep it in memory in a single string and use strpos or a similar function to find a text. What language, database and platform are you using?

Resources