Is it possible to use Lucene of Solr for image retrieval? - solr

I am searching for a retrieval server right now for my image retrieval project. As I see from the Internet, Lucene and Solr are particularized for textual seraching but do you think is it possible and reasonable to convey them for image retrieval.
You might suggest a image specific tool like LIRE but it has predefined featreu extraction algorithms and not very flexible for new features. Basically, all I need to index my image features from my extraction pipeline (written in Python) with a server like Lucene or Solr and perform some retrieval tasks based on Euclidean distance on indexed features.
Any suggestion or pointer to any reference would be very useful. Thanks.

Based on your post , you could store the features as keyword fields in Lucene or ES (solr has a strict schema definition and i don't think it would fit your needs very well as the feature matrix is usually sparsely populated in my understanding), and have a unique ID field from the image hash. Then you can just search for feature values ( feature1:value1 AND feature2:value2) and see what matches the query.

If you're going to work with Euclidean distances, you'll want to look into using the Spatial Features of Solr. This will allow you to index your values as coordinates, then perform indexed lookups from other points and sort by their Euclidean distances.
You might also want to look at the dist and sqedist functions.

Related

lucene Fields vs. DocValues

I'm using and playing with Lucene to index our data and I've come across some strange behaviors concerning DocValues Fields.
So, Could anyone please just explain the difference between a regular Document field (like StringField, TextField, IntField etc.) and DocValues fields
(like IntDocValuesField, SortedDocValuesField (the types seem to have change in Lucene 5.0) etc.) ?
First, why can't I access DocValues using document.get(fieldname)? if so, how can I access them?
Second, I've seen that in Lucene 5.0 some features are changed, for example sorting can only be done on DocValues... why is that?
Third, DocValues can be updated but regular fields cannot (you have to delete and add the whole document)...
Also, and perhaps most important, when should I use DocValues and when regular fields?
Joseph
Most of these questions are quickly answered by either referring to the Solr Wiki or to a web search, but to get the gist of DocValues: they're useful for all the other stuff associated with a modern Search service except for the actual searching. From the Solr Community Wiki:
DocValues are a way of recording field values internally that is more efficient for some purposes, such as sorting and faceting, than traditional indexing.
...
DocValue fields are now column-oriented fields with a document-to-value mapping built at index time. This approach promises to relieve some of the memory requirements of the fieldCache and make lookups for faceting, sorting, and grouping much faster.
This should also answer why Lucene 5 requires DocValues for sorting - it's a lot more efficient than the previous approach.
The reason for this is that the storage format is turned around from the standard format when gathering data for these operations, where the application previously have to go through each document to find the values, it can now look up the values and find the corresponding documents instead. Which is very useful when you already have a list of documents that you need to perform an intersection on.
If I remember correctly, updating a DocValue-based field involves yanking the document out from the previous token list, and then re-inserting it into the new location, compared to the previous approach where it would change loads of dependencies (and reindexing was the only viable strategy).
Use DocValues for fields that need any of the properties mentioned above, such as sorting / faceting / etc.

Lucene and SQL Server - best practice

I am pretty new to Lucene, so would like to get some help from you guys :)
BACKGROUND: Currently I have documents stored in SQL Server and want to use Lucene for full-text/tag searches on those documents in SQL Server.
Q1) In this case, in order to do the keyword search on the documents, should I insert all of those documents to the Lucene index? Does this mean there will be data duplication (one in SQL Server and the other one in the Lucene index?) It could be a matter since we have a massive amount of documents (about 100GB). Is it inevitable?
Q2) Also, each documents has a set of tags (up to 3). Lucene is also good choice for the tag search? If so, how to do it?
Thanks,
Yes, providing full-text search through Lucene and data storage through a traditional database is a well-supported architecture. Take a look here, for a brief introduction. A typical implementation would be to index anything you wish to be able to support searching on, and store only a unique identifier in the Lucene index, and pull any records founds by a search from the database, based on the ID. If you want to reduce DB load, you can store some information in Lucene to display a list of search results, and only query the database in order to fetch the full document.
As for saving on space, there will be some measure of duplication. This is true even if you only Lucene, though. Lucene stores the inverted index used for searching entirely separately from stored data. For saving on space, I'd recommend being very deliberate about what data you choose to index, and what you need to store and be able to retrieve later. What you store is particularly important for saving space in Lucene, since indexed-only values tend to be very space-efficient, in most cases.
Lucene can certainly implement a tag search. The simple way to implement it would be to add each tag to a field of your choosing (I'll call is "tags", which seems to make sense), while building the document, such as:
document.add(new Field("tags", "widget", Field.Store.NO, Field.Index.ANALYZED));
document.add(new Field("tags", "forkids", Field.Store.NO, Field.Index.ANALYZED));
and I could simply add a required term to any query to search only within a particular tag. For instance, if I was to search for "some stuff", but only with the tag "forkids", I could write a query like:
some stuff +tags:forkids
Documents can also be stored in Lucene, you can retrieve and reference them using the document ID.
I would suggest using Solr http://lucene.apache.org/solr/ on top of Lucene, is more user friendly and has multiValued fields (for the tags) available by default.
http://wiki.apache.org/solr/SchemaXml

Algorithms for key value pair, where key is string

I have a problem where there is a huge list of strings or phrases it might scale from 100,000 to 100Million. when i search for a phrase if found it gives me the Id or index to database for further operation. I know hash table can be used for this, but i am looking for other algorithm which could serve me to generate index based on strings and can also be useful in some other features like autocomplete etc.
I read suffix tree/array based on some SO threads they serve the purpose but consumes alot memory than i can afford. Any alternatives to this?
Since my search is only in a huge list of millions of strings. No docs no webpages not interested in search engine like lucene etc.
Also read about inverted index sounds helpful but which algorithm i need to study for it?.
If this Database index is within MS SQL Server you may get good results with SQL Full Text Indexing. Other SQL providers may have a similar function but I would not be able to help with those.
Check out: http://www.simple-talk.com/sql/learn-sql-server/understanding-full-text-indexing-in-sql-server/
and
http://msdn.microsoft.com/en-us/library/ms142571.aspx

How to store document vectors in a database for a search engine?

I have implemented a search engine in Java. It has a database that stores the inverted index ie mapping from terms to list of documents the term appears in. There is a feature that allows a user to upload a document which can be added to document for indexing. The problem that i'm facing is that , everytime a new document is added , the index is reconstructed in memory instead of being updated . To update , i would need a database that stores document vectors that are essentially tf-idf's(term frequency* inverse document frequency) of each and every term in the index. I'm not able to work out database structure for it as in what rows and columns or multiple tables would be needed for storing such a structure.
I need to store
1. Document ID
2. Document Title
3. N dimensional Document vector where N is the number of unique terms
4. N terms
5. IDF of each term
6. TF of each term for every document.
I need it so that at the time of query matching i can extract this vector and calculate its similarity with the query vector.If you want any additional information, please let me know.
Thank you very much , I'm sure i would get some help here.
Are you sure you wanna use a Database to implement a search engine?
You may take a look at this Java framework which does an excellent job and very simple to learn .
Lucene Tutorial in 5 mins
It uses the Vector Space Model and there's no need for you to worry about all the above fields you mentioned in your post, since Lucene stores them along with much more advanced ranking factors.
I am sorry that my reply doesn't help you if you are intentionally using the Databases.

Geoclusters in SOLR

We're reimplementing a search that includes locations that need to be clustered on a map. I've been searching without luck for an implementation in SOLR.
The current search with map clustering implemented is at http://www.uship.com/find
Has anyone seen similar or have ideas about how to best do this?
Regards,
Nick
If the requirement is to cluster a fairly small number of points, perhaps less than 1000, then Solr needn't be involved. Grab the points and plot them using something like HeatmapJS.
I presume the requirement is to cluster all results in a search which may potentially be many thousands or even millions of documents. I suggest starting with generating a heatmap of the densities over a grid of the search area. You can do this by indexing each point encoded in geohash form at each length (e.g. D2RY, D2R, D2, D). But then precede the length by how long it is: 4_D2RY, 3_D2R, 2_D2, 1_D. These little strings go into a multi-valued "string" type field in Solr that you will then facet on. When faceting, you'll come up with a suitable grid resolution (e.g. goehash prefix length) and then use that as a prefix query, like facet.prefix=4_ You can index the point using a LatLonType field separately and do a standard bounding box query there. At this point, you're faceted search results will give you the information to fill in a grid of numbers. The beauty of this scheme is that it is fast -- you could generate such heat-maps on the fly. It will use a fair amount of RAM though since this is faceting on a multi-valued field that will have a ton of values. This is something I want to add to the new Lucene spatial module (or perhaps at the Solr layer) in a way that won't need extra memory and to make it easy. It won't make it to Solr 4.0, but maybe 4.1.
At this stage, perhaps a heatmap is fine as-is. But you may want to apply clustering on top of this, as your question states. Someone tipped me off to some interesting geo clustering algorithms that can be applied to heatmaps.
I don't know whether you searched lucidworks, but there are many interesting resources there:
Search with Polygons: Another Approach to Solr Geospatial Search
Go through these:
http://www.lucidimagination.com/search/?q=geospatial#%2Fn
Already implemented in Solr:
http://wiki.apache.org/solr/SpatialSearch/ (what's wrong with this approach?)
http://wiki.apache.org/solr/SpatialSearchDev
https://issues.apache.org/jira/browse/SOLR-3304

Resources