Similarity/approximate queries in Solr

Similarity/approximate queries in Solr - solr

What is the simplest way to query Solr for the documents that contain text similiar to a (longish) passage. This is similar to what ElasticSearch match queries do or what probabilistic search engines like Indri do by default. This is something between an and and an or query. None of the terms is required, but you get documents that contain many of the terms. You can also just pass a passage of raw text to the engine and it returns documents with high term overlap with the passage without having to try to parse or tokenize the text in the client. The best I option can see in the Solr query reference is to tokenize the query text myself and then insert an OR between each pair of terms and return the top N results. Is there more concise way of doing it with Solr?

The answer above is correct. You can choose to find documents similar to another document in the index, similar to a given external URL or similar to some given text. You can choose what field(s) to target and various other parameters. Here's the official Solr Reference Guide documentation page for MLT: https://cwiki.apache.org/confluence/display/solr/MoreLikeThis

Related

Apache Solr use entire string for search within collection

I have managed to create a dataset using Apache Solr. I have also managed to make queries, such as in this example:
content:(test1 OR test2) OR title: test2
I would now like to search the dataset using an entire string, in similar fashion to searching on google. Is the correct way to approach this to keep using or tags on the title and content for each word within the query, or is there a better way to achieve this ? (I am not looking for exact matches, just the most relevant ones)

You can use dismax or edismax for your approach and can pass the phrases if you have with the boosting.
The DisMax query parser is designed to process simple phrases (without
complex syntax) entered by users and to search for individual terms
across several fields using different weighting (boosts) based on the
significance of each field. Additional options enable users to
influence the score based on rules specific to each use case
(independent of user input).
The detailed parameters are found on the solr page at Solr Dismax

Difference between full text and free text search in solr (other search db)

New to search databases and working with one. What is the difference between full text and free text search/index?

They are kind of same. More precisely they are just synonyms.
They are techniques used by search engines to find results in a database.
Solr uses Lucene project for it's search engine. It is used when you have a large documents to be searched and, you can't use LIKE queries with normal RDMS considering the performance.
Mianly it's follows two stages indexing and searching. The indexing stage will scan the text of all the documents and build a list of search terms. In the search stage, when performing a specific query, only the index is referenced, rather than the text of the original documents.
Suppose you typed John and Ryan, query will return will all the items in document which either contains "John" or "Ryan". Order and case sensitiveness doesn't matter.
In nutshell, unless you are using/terming them in specific use case, they are just calling different name for same person.
Call him Cristiano or CR7, they are same :)

Extract query terms from text for querying Solr server

I am using Solrj to build queries for Solr server.
So I have some pretty short free-form texts that can contain various special characters - like Mr. John's New-Wall, "Hotels & Food".
A phrase query for text like this would not produce enough matches. So from this text I would like to extract terms for building a simple query, something like content:Mr OR content:John's OR content:Hotels OR content:Food. (It probably would be good to somehow consider the term proximity, but I have to start with something).
The field that I am searching is the default text_general field. I started with replacing some special characters with spaces and splitting them up to extract the terms. But it feels kind of redundant.
Isn't there an easier way to extract terms from text using Solrj and Solr? Basically I would like to extract terms from text similarly to how it is done by Solr when it creates its index.

I am not sure exactly what your question is, however here is a bit of info that you may find helpful:
Basically I would like to extract terms from text similarly to how it is done by Solr when it creates its index.
You can configure indexing and query field processing in your schema. I would suggest you take a look in here. This gives you a bit of flexibility to normalize your data.
So from this text I would like to extract terms for building a simple query, something like content:Mr OR content:John's OR content:Hotels OR content:Food.
This is the default way that solr queries under the hood. I would suggest you look up edismax query parser and qf and tie parameters.
Hope it helps

How to get full documents via MoreLikeThis search in solr?

I´m quite new to the MoreLikeThis search in solr but i find one option is missing.
The wiki pages and google (and stack overflow) search says nothing about the document format of the returned value of a MLT-Search.
My aim is to get either all or at least a specified field-set in the returned documents, but it seams that one have no influence which fields are included in the similar documents.
Of course one can do a query for each of the documents from the moreLikeThis result to get those field but i don´t like the idea to do multiple queries where just one could really be sufficient.
I would really appreciate if anybody does knows a way to influence the result format of the documents.
Thanks.

Lucene and SQL Server - best practice

I am pretty new to Lucene, so would like to get some help from you guys :)
BACKGROUND: Currently I have documents stored in SQL Server and want to use Lucene for full-text/tag searches on those documents in SQL Server.
Q1) In this case, in order to do the keyword search on the documents, should I insert all of those documents to the Lucene index? Does this mean there will be data duplication (one in SQL Server and the other one in the Lucene index?) It could be a matter since we have a massive amount of documents (about 100GB). Is it inevitable?
Q2) Also, each documents has a set of tags (up to 3). Lucene is also good choice for the tag search? If so, how to do it?
Thanks,

Yes, providing full-text search through Lucene and data storage through a traditional database is a well-supported architecture. Take a look here, for a brief introduction. A typical implementation would be to index anything you wish to be able to support searching on, and store only a unique identifier in the Lucene index, and pull any records founds by a search from the database, based on the ID. If you want to reduce DB load, you can store some information in Lucene to display a list of search results, and only query the database in order to fetch the full document.
As for saving on space, there will be some measure of duplication. This is true even if you only Lucene, though. Lucene stores the inverted index used for searching entirely separately from stored data. For saving on space, I'd recommend being very deliberate about what data you choose to index, and what you need to store and be able to retrieve later. What you store is particularly important for saving space in Lucene, since indexed-only values tend to be very space-efficient, in most cases.
Lucene can certainly implement a tag search. The simple way to implement it would be to add each tag to a field of your choosing (I'll call is "tags", which seems to make sense), while building the document, such as:
document.add(new Field("tags", "widget", Field.Store.NO, Field.Index.ANALYZED));
document.add(new Field("tags", "forkids", Field.Store.NO, Field.Index.ANALYZED));
and I could simply add a required term to any query to search only within a particular tag. For instance, if I was to search for "some stuff", but only with the tag "forkids", I could write a query like:
some stuff +tags:forkids

Documents can also be stored in Lucene, you can retrieve and reference them using the document ID.
I would suggest using Solr http://lucene.apache.org/solr/ on top of Lucene, is more user friendly and has multiValued fields (for the tags) available by default.
http://wiki.apache.org/solr/SchemaXml

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight