Extract query terms from text for querying Solr server - solr

I am using Solrj to build queries for Solr server.
So I have some pretty short free-form texts that can contain various special characters - like Mr. John's New-Wall, "Hotels & Food".
A phrase query for text like this would not produce enough matches. So from this text I would like to extract terms for building a simple query, something like content:Mr OR content:John's OR content:Hotels OR content:Food. (It probably would be good to somehow consider the term proximity, but I have to start with something).
The field that I am searching is the default text_general field. I started with replacing some special characters with spaces and splitting them up to extract the terms. But it feels kind of redundant.
Isn't there an easier way to extract terms from text using Solrj and Solr? Basically I would like to extract terms from text similarly to how it is done by Solr when it creates its index.

I am not sure exactly what your question is, however here is a bit of info that you may find helpful:
Basically I would like to extract terms from text similarly to how it is done by Solr when it creates its index.
You can configure indexing and query field processing in your schema. I would suggest you take a look in here. This gives you a bit of flexibility to normalize your data.
So from this text I would like to extract terms for building a simple query, something like content:Mr OR content:John's OR content:Hotels OR content:Food.
This is the default way that solr queries under the hood. I would suggest you look up edismax query parser and qf and tie parameters.
Hope it helps

Related

Apache Solr use entire string for search within collection

I have managed to create a dataset using Apache Solr. I have also managed to make queries, such as in this example:
content:(test1 OR test2) OR title: test2
I would now like to search the dataset using an entire string, in similar fashion to searching on google. Is the correct way to approach this to keep using or tags on the title and content for each word within the query, or is there a better way to achieve this ? (I am not looking for exact matches, just the most relevant ones)
You can use dismax or edismax for your approach and can pass the phrases if you have with the boosting.
The DisMax query parser is designed to process simple phrases (without
complex syntax) entered by users and to search for individual terms
across several fields using different weighting (boosts) based on the
significance of each field. Additional options enable users to
influence the score based on rules specific to each use case
(independent of user input).
The detailed parameters are found on the solr page at Solr Dismax

Solr multilingual search

I'm currently working on a project where we have indexed text content in SOLR. Every content is writen in one specific language (we have 4 differents
european languages) but we would like to add a feature that if the primary search (search text entered by the user) doesn't return much result then we try too look for document in other languages. Thus we would somehow need to translate the query.
Our base is that we can have a mapping list of translated words commonly used in the field of the project.
One solution that came to me was to use synonym search feature. But this might not provide the best results.
Does people have pointers on existing modules that could help us achieving this multilingual search feature? Or conception ideas we cold try to investigate?
Thanks
It seems like multi-lingual search is not a unique problem.
Please take a look
http://lucene.472066.n3.nabble.com/Multilingual-Search-td484201.html
and
Solr index and search multilingual data
those two links suggest to have dedicated fields for each language, but you can also have a field that states language, and you can add filter query (&fq=) for the language you have detected (from user query). This is more scalable solution, I think.
One option would be for you to translate your terms at index time, this could probably be done at Solr level or even before Solr at the application level, and then store the translated texts in different fields so you would have fields like:
text_en: "Hello",
text_fi: "Hei"
Then you can just query text_en:Hello and it would match.
And if you want to score primary language matches higher, you could have a primary_language field and then boost documents where it matches the search language higher.

Solr queries stored within Solr field

I have a set of keywords defined by client requirements stored in a SOLR field. I also have a never ending stream of sentences entering the system.
By using the sentence as the query against the keywords I am able to find those sentences that match the keywords. This is working well and I am pleased. What I have essentially done is reverse the way in which SOLR is normally used by storing the query in Solr and passing the text in as the query.
Now I would like to be able to extend the idea of having just a keyword in a field to having a more fully formed SOLR query in a field. Doing so would allow proximity searching etc. But, of course, this is where life becomes awkward. Placing SOLR query operators into a field will not work as they need to be escaped.
Does anyone know if it might be possible to use the SOLR "query" function or perhaps write a java class that would enable such functionality? Or is the idea blowing just a bit too much against the SOLR winds?
Thanks in advance.
ES has percolate for this - for Solr you'll usually index the document as a single document in a memory based core / index and then run the queries against that (which is what ES at least used to do internally, IIRC).
I would check out the percolate api with ElasticSearch. It would sure be easier using this api than having to write your own in Solr.

Similarity/approximate queries in Solr

What is the simplest way to query Solr for the documents that contain text similiar to a (longish) passage. This is similar to what ElasticSearch match queries do or what probabilistic search engines like Indri do by default. This is something between an and and an or query. None of the terms is required, but you get documents that contain many of the terms. You can also just pass a passage of raw text to the engine and it returns documents with high term overlap with the passage without having to try to parse or tokenize the text in the client. The best I option can see in the Solr query reference is to tokenize the query text myself and then insert an OR between each pair of terms and return the top N results. Is there more concise way of doing it with Solr?
The answer above is correct. You can choose to find documents similar to another document in the index, similar to a given external URL or similar to some given text. You can choose what field(s) to target and various other parameters. Here's the official Solr Reference Guide documentation page for MLT: https://cwiki.apache.org/confluence/display/solr/MoreLikeThis

Complex queries with Solr 4

I would like to fire complex queries in Solr 4. If I am using Lucene, I can search using XML Query parser and get the results I need. However, I am not able to see how to use the XML Query Parser in Solr.
I need to be able to execute queries with proximity searches, booleans, wildcards, span or, phrases (although these can be handled by proximity searches).
Guidance on material on how to proceed also welcome.
Regards
Puneet
As far as I know it's still a work in progress. More info can be found at their Jira. You can of course use the normal query language, it's also capable of doing pretty complex things, for example:
"a proximity search"~2 AND *wildcards* OR "a phrase"
As you can see you can search for phrases, boolean operators (AND, OR, ...), span, proximity and wildcards. For more information about the query syntax look at the Lucene documentation. Solr also added some extra features on top of the Lucene query parser and more information about that can be found at the Solr wiki.
Solr 4.8 now has the "complexphrase" query parser built in that can construct all sorts of complex proximity queries (i.e. phrase queries with embedded boolean logic and wildcards).
you can use the query url as
http://xx.xxx.xx.xx:8983/solr/collectionname/select?indent=on&q=
{!complexphrase%20inOrder=true}"good*"&wt=json&fl=Category,keywords,ImageID

Resources