search for any number in SOLR - solr

How can I set up a SOLR index in a way that allows me to search for any number?
I believe that the following works, more or less:
0* OR 1* OR 2* OR 3* OR 4* OR 5* OR 6* OR 7* OR 8* OR 9*
But it really does not seem to be ideal, and cannot be used as part of double-quoted expressions, etc.

If you're looking for all documents that contain a token that just is a number, a regular expression search should work:
q=field:/[0-9]+/
If you have tokens in your text that contain a number within other characters (.. but those wouldn't have matched your example), you can add a wildcard before and after matching the numbers:
q=field:/.*[0-9]+.*/

Related

in AIML, can I give priority to the pattern matching

in AIML, if I have multiple files matching for the same pattern, how can I give precedence to match in one file ?
You should use AIML's wildcards to control the priority of pattern matching.
AIML 1.0 only has * and _ to match 1 or more words. AIML 2.0 adds ^ and # to match 0 or more words.
Below is the priority rank of AIML 2.0 wildcards, from the highest matching priority to the lowest.
"$" : indicates that the word now has higher matching priority than "_"
"#" : 0 or more words
"_" : 1 or more words
word : exact word match
"^" : 0 or more words
"*" : 1 or more words
Please see AIML 2.0 working draft for details, specifically chapter 5.A.Zero or more words wildcards for wildcards and priority description.
The AIML 1.0 wildcards * and _ are defined so that they match one or more words. AIML 2.0 introduces two new wildcards, ^ and #, defined to match zero or more words. As a shorthand description, we refer to these as “zero+ wildcards”.
Both ^ and # are defined to match 0 or more words. The difference between them is the same as the difference between * and _. The # matching operator has the highest priority in matching, followed by _, followed by an exact word match, followed by ^, and finally * has the lowest matching priority.
When defining a zero+ wildcard it is necessary to consider what the value of (as well as and ) should be when the wildcard match has zero length. In AIML 2.0 we leave this up to the botmaster. Each bot can have a global property named nullstar which the botmaster can set to “”, “unknown”, or any other value.
What’s new in AIML 2.0?
The Alice site has the following notes on how priority is determined:
At every node, the "_" has first priority, an atomic word match second priority, and a "*" match lowest priority.
The patterns need not be ordered alphabetically, only partially ordered so that "_" comes before any word and "*" after any word.
The matching is word-by-word, not category-by-category.
The algorithm combines the input pattern, the pattern, and the pattern into a single "path" or sentence such as: "PATTERN
THAT TOPIC" and treats the tokens and
like ordinary words. The PATTERN, THAT and TOPIC patterns may contain
multiple wildcards.
The matching algorithm is a highly restricted version of depth-first search, also known as backtracking.
You can simplify the algorithm by removing the "_" wildcard, and considering just the second two steps. Also try understanding the
simple case of PATTERNs without and .
From Alicebot.org
Based on this you could use the '_' to give something presidence. Take the following example:
<category>
<pattern>_ BAR</pattern>
<template>Which bar?</template>
</category>
<category>
<pattern>FOO BAR</pattern>
<template>Don't you mean FUBAR? That's an old military acronym, that roughly translates to "broken". I can't directly translate it because I don't curse.</template>
</category>
<category>
<pattern>* BAR</pattern>
<template>There are a lot of bars. There's a crow bar, the state bar, a bar for drinking, and foo bar.</template>
</category>
The _ takes highest priority being matched first. The simple BAR is second in priority and the * is last.

Lucene: How to search for at least m out of n words

Suppose I have 5 words that I'm searching for. Is there a way to specify that the matching documents should have at least 4 of those words?
In case of a BooleanQuery, you can set the 'minimumShouldMatch' property. Here is the API link for more details: http://lucene.apache.org/core/5_1_0/core/org/apache/lucene/search/BooleanQuery.html#setMinimumNumberShouldMatch(int)

Lucene and Nickname matching

I have a series of docs containing a nickname ( even with spaces ) and an ID.
The nickname can be like ["example","nick n4me", "nosp4ces","A fancy guy"].
I have to find a query that allow me to find profiles by a perfect matching, a fuzzy, or event with partial character.
So if a write down "nick" or "nick name" or "nick name", the document "nickname" has always to come out.
I tried with something like:
nickname:(%1%^4 %1%~^3 %1%*^1)
where "%1%" is what I'm searching, but it doesn't work, especially for spaces or numbers nicknames. For example if i try to search "nick n" the query would be:
nickname:(nick n^4 nick n~^3 nick n*^1)
Boosting with ^ will only affect the scoring and not the matching, i.e. if your query does not match at all, boosting the terms or not won't make any difference.
In your specific example, the query won't match because:
1) nick n won't match because that would require that either the token nick or n have been tokenized;
2) EDIT: I found out that fuzzy queries work only on single terms, if you use the standard query parser. In your case, you should probably rewrite nick n~ using ComplexPhraseQueryParser, so you can do a fuzzy query on the whole PhraseQuery. Also, you can specify a threshold for your fuzzy query (technically, you are specifying a minimum Levenshtein distance). Obviously you have to adjust the threshold, and that usually requires some trial and error.
An easier tactic is to load all nicknames into one field -- in your example you would have 4 values for your nickname field. If you want embedded spaces in your nicknames, you will need to use a simpler analyzer than StandardAnalyzer or use phrase searches.

difference between q=word1 word2 and q="word1 word2" in Solr/Lucene

Can someone please tell me what is the difference between:
q=word1 word2
and
q="word1 word2"
I'm trying to match a keyword "word1 word2" (yes, my keyword can have whitespaces) that is analyzed with KeywordTokenizerFactory and it seems it only works when I add the quotes in the query.
By the way I use Solr extended Dismax, don't know if this matters.
The syntax is then:
q="some text"&qf=KeywordField&qf=FrenchtextField
Edit:
The problem I have with quotes is that I have another field that contains fulltexte (analysis is basic and close to FrenchAnalyzer, including a lowercase filter)
I have 'HelloWorld' text indexed, and I can find it back with q=helloWoRLD but not with q="helloWoRLD": this unit test is broken since I added quotes in all my queries. I don't understand what is the difference between q=helloWoRLD and q="helloWoRLD" since it would still be 1 term search right?
Lucene query syntax uses spaces to separate terms so you are performing a search for "word1" in the field "q" and "word2" but with no specified field (I'm not sure how lucene behaves when no field is specified).
If you want to search for the string "word1 word2" (consecutive words) in the field q then you will have to use quotes i.e. q="word1 word2"
If you want to search for records which contain both of these words (non-consecutive) then you can search for "q=word1 AND q=word2"
I don't quite follow your hello world problem so can't comment on that. Hope this helps

Solr query results using *

I want to provide for partial matching, so I am tacking on * to the end of search queries. What I've noticed is that a search query of gatorade will return 12 results whereas gatorade* returns 7. So * seems to be 1 or many as opposed to 0 or many ... how can I achieve this? Am I going about partial matching in Solr all wrong? Thanks.
First, I think Solr wildcards are better summarized by "0 or many" than "1 or many". I doubt that's the source of your problem. (For example, see the javadocs for WildcardQuery.)
Second, are you using stemming, because my first guess is that you're dealing with a stemming issue. Solr wildcards can behave kind of oddly with stemming. This is because wildcard expansion is based by searching through the list of terms stored in the inverted index; these terms are going to be in stemmed form (perhaps something like "gatorad"), rather than the words from the original source text (perhaps "gatorade" or "gatorades").
For example, suppose you have a stemmer that maps both "gatorade" and "gatorades" to the stem "gatorad". This means your inverted index will not contain either "gatorade" or "gatorades", only "gatorad". If you then issue the query gatorade*, Solr will walk the term index looking for all the stems beginning with "gatorade". But there are no such stems, so you won't get any matches. Similarly, if you searched gatorades*, Solr will look for all stems beginning with "gatorades". But there are no such stems, so you won't get any matches.
Third, for optimal help, I'd suggest posting some more information, in particular:
Some particular query URLs you are submitting to Solr
An excerpt from your schema.xml file. In particular, include A) the field elements for the fields you are having trouble with, and B) the field type definitions corresponding to those fields
so what I was looking for is to make the search term for 'gatorade' -> 'gatorade OR gatorade*' which will give me all the matches i'm looking for.
If you want a query to return all documents that match either a stemmed form of gatorade or words that begin with gatorade, you'll need to construct the query yourself: +(gatorade gatorade*). You could alternatively extend the SolrParser to do this, but that's more work.
Another alternative is to use NGrams and TokenFilterFactories, specifically the EdgeNGramFilterFactory. .
This will create indexes for ngrams or parts of words. Documents, with a min ngram size of 5 and max ngram size of 8, would index: Docum Docume Document Documents
There is a bit of a tradeoff for index size and time. One of the Solr books quotes as a rough guide: Indexing takes 10 times longer Uses 5 times more disk space Creates 6 times more distinct terms.
However, the EdgeNGram will do better than that.
You do need to make sure that you don't submit wildcard character in your queries. As you aren't doing a wildcard search, you are matching a search term on ngrams(parts of words).
My guess is the missing matches are "Gatorade" (with a capital 'G'), and you have a lowercase filter on your field. The idea is that you have filters in your schema.xml that preprocess the input data, but wildcard queries do not use them;
see this about how Solr deals with wildcard queries:
http://solr.pl/en/2010/12/20/wildcard-queries-and-how-solr-handles-them/
("Solr and wildcard handling").
From what I've read the wildcards only matched words with additional characters after the search term. "Gatorade*" would match Gatorades but not Gatorade itself. It appears there's been an update to Solr in version 3.6 that takes this into account by using the 'multiterm' field type instead of the 'text' field.
A better description is here:
http://bensch.be/the-solr-wildcard-problem-and-multiterm-solution

Resources