search most frequently used word in a selected set of documents - solr

I need to find the most frequently used words in a given field from a selected set of documents. I tried luke handler,
http://localhost:8983/solr/admin/luke?fl=my_field&numTerms=1
But this query gives results considering whole content.

Assuming your field tokenizes to your definition of the word, you can just use faceting for that. That's why faceting fields are usually strings, because the algorithm looks at the tokens generated.
So, in your case, you want the opposite effect.

Related

Solr query string not working for full text searches

I'm following this tutorial on how to perform indexing on sample documents using Solr. The default collection is "gettingstarted" as shown. Now I'm trying to query it. There are 52 entries as shown:
However, when I replace the q argument with say electronics, it should return 14 results. However, I get nothing.
When I replace the query string q with cat:electronics, then I actually get the 14 results. But why is this the case? isn't q=word supposed to search for word wherever it appears?
No, it's not. Your assumption that:
isn't q=word supposed to search for word wherever it appears?
is wrong. If you're using word as your only query, and nothing more - you're searching for word in the default search field. It does not search all available fields in all available documents.
Also be aware that the default query parser assumes that your query is in the Lucene Query Syntax. To handle more "natural" querying, you can use the edismax query parser. This query parser supports the qf parameter that tells Solr which fields to search, instead of having to use the cat:electronics syntax. Your example would then be q=electronics&qf=cat.
In the example documents you've given, qf=series_t author name cat is probably a decent value to search all these fields for the given query. You can also append ^<weight> to a field name to give hits in the different fields different weights. qf=name^10 cat would give a hit in name ten times the weight of a hit in the cat field.

Solrnet facet returning spaces

I'm using Solrnet to return search results and am also requesting the facets, in particular categories which is a multi-valued field.
The problem I'm coming up against is that the category "house products" is being returned as two seperate facets because of the space.
Is there a way of ensuring this is returned as a single facet value, or should I be escaping the value when it is added to the index?
Thanks in advance
Al
If the tokens are generated for house products then you are using text analysis for the field.
Text fields are not suggested to be used for Faceting.
You won't get the desired behavior as the text fields would be tokenized and filtered leading to the generation of multiple tokens which you see from the facets returned as response.
Use a copy field to copy the field to a String field to be able to facet on it without splitting the words.
SolrFacetingOverview :-
Because faceting fields are often specified to serve two purposes,
human-readable text and drill-down query value, they are frequently
indexed differently from fields used for searching and sorting:
They are often not tokenized into separate words
They are often not mapped into lower case
Human-readable punctuation is often not removed (other than double-quotes)
There is often no need to store them, since stored values would look much like indexed values and the faceting mechanism is used for
value retrieval.
Try to use String fields and it would be good enough without any overheads.
The faceting works on tokens, so if you have a field that is tokenized in many words it will split the facet too.
I suggest you create another field of type string used only for faceting.

Lucene search for a filename, using WordDelimiterFilterFactory

If I search for toto.pdf, a token "pdf" is created for the search tI'm indexing some data, including filenames.
What I want is, according to indexed filename:
MySupercool123girlfriend.jpg
And to be able tosearch it with:
supercool
supercool123
123
girlfriend
jpg
So at index it pretty easy to be able to use WordDelimiterFilterFactory so that some tokens are created, like:
my
supercool
mysupercool
mysupercool123
supercool123
123
girlfriend
jpg
girlfriend.jgp
etc...
The matter is that at search time, I don't really know what I should do.
If I use WordDelimiterFilterFactory at search time, MySupercool123girlfriend.jpg would match even with toto.jpg because in both cases a token jpg is created.
toto.jpg should not be in the result list at all, so it's not a solution for me to have both results with the appropriate one having a better scoring
Have you any recommendation to index and search for filenames?
For this specific example of yours i.e. if the search is for MySupercool123girlfriend.jpg and you want this to only return documents that have the entire string in it, you can keep a copyField, say named filename_str, whose fieldType is string. String matches will ensure you that you get an exact match. This could be a first-level "exact match" search you do.
However, I am guessing that you would want a search for 123girlfriend.jpg to return the document containing MySupercool123girlfriend.jpg. You can do a 2nd level search for this. Beginning Solr 4.0 you can do a regex search like
q=filename_str:/.*123girlfriend.jpg/
(This regex query should also work for filename field itself, if you are using preserveOriginal=1 in WordDelimiterFilterFactory at index time.)
Else you can do a leading wild-card search, which works in earlier Solr versions too.
If you also want MySupercool.jpg to match MySupercool123girlfriend.jpg, then I guess you would have to manually do the work of DelimiterFilterFactory and construct a regex query like
q=filename_str:/.*My.*Supercool.*.jpg/
Another issue is that jpg is going to match lot of documents, so you may want to split the filename and the extension and keep them as separate fields.
Can you come up with some meaningful for your use case DisMax mm parameter?
See http://wiki.apache.org/solr/DisMaxQParserPlugin#mm_.28Minimum_.27Should.27_Match.29
E.g.
mm=100% and "MySupercool123girlfriend.jpg" would match only filenames that have all ["my", "supercool", "123", "girlfriend", "jpg"] terms in them
You can find some less strict but still giving relevant results expression. See http://lucene.apache.org/solr/4_1_0/solr-core/org/apache/solr/util/doc-files/min-should-match.html

Different indexing and search strategies on same field without doubling index size?

For a phrase search, we want to bring up results only if there's an exact match (without ignoring stopwords). If it's a non-phrase search, we are fine displaying results even if the root form of the word matches etc.
We currently pass our data through standardTokenizer, StopFilter, PorterStemFilter and LowerCaseFilter. Due to this when user wants to search for "password management", search brings up results containing "password manager".
If I remove StemFilter, then I will not be able to match for the root form of the word for non-phrase queries. I was thinking if I should index the same data as part of two fields in document.
For the first field (to be used for phrase searches), following tokenizers/filters will be used:
StandardTokenizer, LowerCaseFilter
For the second field (Non-phrase searches)
StandardTokenizer, StopFilter, PorterStemFilter, LowerCaseFilter
Now, based on whether it's a phrase search or not, I need to rewrite user's query to search in the appropriate field.
Is this the right way to address this issue? Is there any other way to achieve this without doubling index size?
let's say user's query is
summary:"Furthermore, we should also fix this"
Internally this will be translated to
summary_field1:"Furthermore, we should also fix this"
If user's query is
summary:(Furthermore, we should also fix this)
Internally this will be translated to
+summary_field2:furthermor +summary_field2:we +summary_field2:should +summary_field2:also +summary_field2:fix
both summary_field1 and summary_field2 index the same data. summary_field1 passes through only StandardTokenizer and LowerCaseFilter, whereas summary_field2 passes through StandardTokenizer, StopFilter, PorterStemFilter and LowerCaseFilter.
Please let me know if I'm missing something here.
By defining two different fields you can search for exact matches.
By using boosts you can also bring results in one query. For example:
(firstField:"password management")^5 OR (secondField:"pasword management")^1

Solr query results using *

I want to provide for partial matching, so I am tacking on * to the end of search queries. What I've noticed is that a search query of gatorade will return 12 results whereas gatorade* returns 7. So * seems to be 1 or many as opposed to 0 or many ... how can I achieve this? Am I going about partial matching in Solr all wrong? Thanks.
First, I think Solr wildcards are better summarized by "0 or many" than "1 or many". I doubt that's the source of your problem. (For example, see the javadocs for WildcardQuery.)
Second, are you using stemming, because my first guess is that you're dealing with a stemming issue. Solr wildcards can behave kind of oddly with stemming. This is because wildcard expansion is based by searching through the list of terms stored in the inverted index; these terms are going to be in stemmed form (perhaps something like "gatorad"), rather than the words from the original source text (perhaps "gatorade" or "gatorades").
For example, suppose you have a stemmer that maps both "gatorade" and "gatorades" to the stem "gatorad". This means your inverted index will not contain either "gatorade" or "gatorades", only "gatorad". If you then issue the query gatorade*, Solr will walk the term index looking for all the stems beginning with "gatorade". But there are no such stems, so you won't get any matches. Similarly, if you searched gatorades*, Solr will look for all stems beginning with "gatorades". But there are no such stems, so you won't get any matches.
Third, for optimal help, I'd suggest posting some more information, in particular:
Some particular query URLs you are submitting to Solr
An excerpt from your schema.xml file. In particular, include A) the field elements for the fields you are having trouble with, and B) the field type definitions corresponding to those fields
so what I was looking for is to make the search term for 'gatorade' -> 'gatorade OR gatorade*' which will give me all the matches i'm looking for.
If you want a query to return all documents that match either a stemmed form of gatorade or words that begin with gatorade, you'll need to construct the query yourself: +(gatorade gatorade*). You could alternatively extend the SolrParser to do this, but that's more work.
Another alternative is to use NGrams and TokenFilterFactories, specifically the EdgeNGramFilterFactory. .
This will create indexes for ngrams or parts of words. Documents, with a min ngram size of 5 and max ngram size of 8, would index: Docum Docume Document Documents
There is a bit of a tradeoff for index size and time. One of the Solr books quotes as a rough guide: Indexing takes 10 times longer Uses 5 times more disk space Creates 6 times more distinct terms.
However, the EdgeNGram will do better than that.
You do need to make sure that you don't submit wildcard character in your queries. As you aren't doing a wildcard search, you are matching a search term on ngrams(parts of words).
My guess is the missing matches are "Gatorade" (with a capital 'G'), and you have a lowercase filter on your field. The idea is that you have filters in your schema.xml that preprocess the input data, but wildcard queries do not use them;
see this about how Solr deals with wildcard queries:
http://solr.pl/en/2010/12/20/wildcard-queries-and-how-solr-handles-them/
("Solr and wildcard handling").
From what I've read the wildcards only matched words with additional characters after the search term. "Gatorade*" would match Gatorades but not Gatorade itself. It appears there's been an update to Solr in version 3.6 that takes this into account by using the 'multiterm' field type instead of the 'text' field.
A better description is here:
http://bensch.be/the-solr-wildcard-problem-and-multiterm-solution

Resources