Extracting key phrases from documents indexed in solr - solr

I want to extract the key phrases from the documents indexed in solr and show those phrases as tags for the user.This should be performed after the query has been executed.
For eg: if i type a query India and I get the first 50 results on the page, I want to display the important phrases from these 50 documents to the user as tags (to do further filtering).
How do I extract key phrases from the text?

I believe org.apache.lucene.search.highlight.Highlighter is what you are looking for.
An example showing it's use can be seen here: http://www.tinhtruong.me/2012/04/highlighting-text-with-lucene.html (among other places).

Related

SOLR: Search for a value in multiple fields

I am looking for a way of querying for values in multiple fields. Basically i am building a simple search engine where user can type ie. "Java How to XML JSON" and it will search for these values in 3 different fields categories, tags, description.
I read on some blog I should query all fields q=*:* and then filter based on those fields for example fq=categories:java,xml,how,to,json description:java,xml,how,to,json tags:java,xml,how,to,json
This works :| But it seems incorrect to just copy paste values like this.
Is there a correct way of doing this? I have been researching this for some time but i havent found a solution.
Any help is appreciated,
Thank you
You can use defType=edismax to get the extended dismax handler. This is meant to handle user typed queries (i.e. what you'd type in). You can then use qf (query fields) to tell the edismax handler which fields you want to search (and an optional weight for each field):
q=Java How to XML JSON&defType=edismax&qf=categories^5 tags description
.. will search each part of the string "Java How to XML JSON" in all the fields, and any hits in the categories field will be weighted five times higher than hits in the other two fields.

Solr query string not working for full text searches

I'm following this tutorial on how to perform indexing on sample documents using Solr. The default collection is "gettingstarted" as shown. Now I'm trying to query it. There are 52 entries as shown:
However, when I replace the q argument with say electronics, it should return 14 results. However, I get nothing.
When I replace the query string q with cat:electronics, then I actually get the 14 results. But why is this the case? isn't q=word supposed to search for word wherever it appears?
No, it's not. Your assumption that:
isn't q=word supposed to search for word wherever it appears?
is wrong. If you're using word as your only query, and nothing more - you're searching for word in the default search field. It does not search all available fields in all available documents.
Also be aware that the default query parser assumes that your query is in the Lucene Query Syntax. To handle more "natural" querying, you can use the edismax query parser. This query parser supports the qf parameter that tells Solr which fields to search, instead of having to use the cat:electronics syntax. Your example would then be q=electronics&qf=cat.
In the example documents you've given, qf=series_t author name cat is probably a decent value to search all these fields for the given query. You can also append ^<weight> to a field name to give hits in the different fields different weights. qf=name^10 cat would give a hit in name ten times the weight of a hit in the cat field.

Solr - Nested Edismax Query

I am using Solr (with pySolr) to search products in my database, returning products, facets and facet.pivots:
result = solr.search(query_s, **{
'rows': '24',
'sort': formatted_sort,
'facet': 'on',
'facet.limit': '-1',
'facet.mincount': '1',
'facet.field': ['gender', 'material'],
'facet.pivot': 'brand,series',
'fq': '-in_stock:(0 OR 99 OR 100 OR 101)'
})
The query_s selects specific fields, for example: brand:Target AND gender:Men's.
I would like to combine the above query with a DisMax query which will allow me to combine the above query with a full text search over specified fields. I found an article which demonstrates nested queries. I have tried to implement something like this:
q: "gender:* AND _query_:"{!edismax qf=brand series}Summer""
For some reason 'Target' will return results for Target brand shirts, but only with correct capitalization. 'Summer' which is a series of Target, won't return any results. Why am I not seeing a list of docs ordered by relevancy?
Am I overcomplicating things by using Dismax altogether?
The dismax parsers are useful for making sense of more "natural" queries, i.e. queries where the user is used to just type what they're looking for, and how most search engines work.
In your case it sounds like brand:Target AND gender:Men's are filters for which documents should be shown, while the query is the part that the user has typed. Usually you'll want to have the filters in fq as they don't affect score (i.e. they're exact values matching a field value), and the query in q.
I assume that Summer is what the user would have typed into your search box, which would give you:
q=Summer&defType=edismax&qf=series
But this assumes that the series field is defined as a text field that has an analyzer attached, so that the values are lowercased and split appropriately.
If you also have a description field you'd like to search, you can do:
q=Summer&defType=edismax&qf=series^20 description
.. which would search for Summer in both the series and description fields, but give 20 times more weight to a hit in the series field. This is a good way to naturally boost documents that match more exact data in your documents. If you also include the brand field, you'd be able to let your users search for "target summer" and similar queries.

Manipulating and Removing Facets in Apache Solr

I am creating a front end application which queries through a database using the Apache Solr engine, but I have two issues that I just cannot find the answer to.
When Solr is processing a Facet query, how do I get the facet to be a single phrase ("Department of the Navy (160)") instead of a broken up facet of 4 terms ("Department (160)" "of (200)" "the (200)" "Navy(160)").
Also, how do I remove certain facets from being queried, for example "and" "to" "the" etc.
Thank you.
Looks like your phrase is being indexed into a Text field which, among many things, splits by whitespace. This is very good for full text search but not for faceting.
You can have a duplicate field for this, of type string (and not Text), which is not splitted. You can still use the original field for searching but the new string field for faceting.

SOLR: Is it it possible to index multiple timestamp:value pairs per document?

Is it possible in solr to index key-value pairs for a single document, like:
Document ID: 100
2011-05-01,20
2011-08-23,200
2011-08-30,1000
Document ID: 200
2011-04-23,10
2011-04-24,100
and then querying for documents with a specific value aggregation in a specific time range, i.e. "give me documents with sum(value) > 0 between 2011-08-01 and 2011-09-01" would return the document with id 100 in the example data above.
Here is a post from the Solr User Mailing List where a couple of approaches for dealing with fields as key/value pairs are discussed.
1) encode the "id" and the "label" in the field value; facet on it;
require clients to know how to decode. This works really well for simple
things where the the id=>label mappings don't ever change, and are
easy to encode (ie "01234:Chris Hostetter"). This is a horrible approach
when id=>label mappings do change with any frequency.
2) have a seperate type of "metadata" document, one per "thing" that you
are faceting on containing fields for id and the label (and probably a
doc_type field so you can tell it apart from your main docs) then once
you've done your main query and gotten the results back facetied on id,
you can query for those ids to get the corrisponding labels. this works
realy well if the labels ever change (just reindex the corrisponding
metadata document) and has the added bonus that you can store additional
metadata in each of those docs, and in many use cases for presenting an
initial "browse" interface, you can sometimes get away with a cheap
search for all metadata docs (or all metadata docs meeting a certain
criteria) instead of an expensive facet query across all of your main
documents.

Resources