I was query in solr use http://localhost:8983/solr/matching/select?facet.field=content&facet=on&q=\*:\* . And not all words not show. this is my key and value in solr
{
"id":"1",
"content":["Jakarta - KM Sinar Bangun tenggelam di Danau Toba, Sumut. Menteri Pariwisata Arief Yahya berharap, audit transportasi dan keamanan di sana diperketat.\n\n\"Pertama-tama kita berbelasungkawa atas KM Sinar Baru yang tenggelam di Danau Toba. Saya juga ikut memonitor dan apa yang sudah dilakukan rekan-rekan Basarnas sudah bagus,\" katanya di Balairung Soesilo Soedarman, Gedung Sapta Pesona, Jakarta, Kamis (21/6/2018) setelah acara Halal Bi Halal Kementerian Pariwisata.\n\nKM Sinar Bangun tenggelam di Danau Toba, Senin (18/6) sekitar pukul 17.30 WIB. Kapal tenggelam saat berlayar dari Pelabuhan Simanindo, Kabupaten Samosir, menuju Pelabuhan Tigaras, Kabupaten Simalungun.\n\nKorban hilang penumpang KM Sinar Bangun yang tenggelam berjumlah 186. Sebanyak 94 orang teridentifikasi, sedangkan 92 orang belum diketahui identitasnya.\n\n\"Kabarnya kapal itu over capacity atau tidak memenuhi spesifik teknis. Saya setuju. Ke depannya diaudit kepada semua kapal yang berlayar di Danau Toba,\" tegas Arief.\n\nDanau Toba merupakan salah satu 10 Destinasi Prioritas atau 10 Bali Baru. Maka itu, poin keamanan, keselamatan dan pelayanannya harus terus ditingkatkan. Agar tidak terulang lagi musibah Danau Toba.\n\n\"Kita harapkan akan lebih ketat dan selektif terutama saat hari-hari besar di sana. Nanti ketemu lagi di Natal dan Tahun Baru harus dipersiapkan lebih bagus,\" tutupnya. (aff/fay)\n"],
"_version_":1603877168829431808},
{
"id":"2",
"content":["Jakarta - KM Sinar Bangun tenggelam di Danau Toba, Sumut. Menteri Pariwisata Arief Yahya berharap, audit transportasi dan keamanan di sana diperketat.\n\n\"Pertama-tama kita berbelasungkawa atas KM Sinar Baru yang tenggelam di Danau Toba. Saya juga ikut memonitor dan apa yang sudah dilakukan rekan-rekan Basarnas sudah bagus,\" katanya di Balairung Soesilo Soedarman, Gedung Sapta Pesona, Jakarta, Kamis (21/6/2018) setelah acara Halal Bi Halal Kementerian Pariwisata.\n\nKM Sinar Bangun tenggelam di Danau Toba, Senin (18/6) sekitar pukul 17.30 WIB. Kapal tenggelam saat berlayar dari Pelabuhan Simanindo, Kabupaten Samosir, menuju Pelabuhan Tigaras, Kabupaten Simalungun.\n\nKorban hilang penumpang KM Sinar Bangun yang tenggelam berjumlah 186. Sebanyak 94 orang teridentifikasi, sedangkan 92 orang belum diketahui identitasnya.\n\n\"Kabarnya kapal itu over capacity atau tidak memenuhi spesifik teknis. Saya setuju. Ke depannya diaudit kepada semua kapal yang berlayar di Danau Toba,\" tegas Arief.\n\nDanau Toba merupakan salah satu 10 Destinasi Prioritas atau 10 Bali Baru. Maka itu, poin keamanan, keselamatan dan pelayanannya harus terus ditingkatkan. Agar tidak terulang lagi musibah Danau Toba.\n\n\"Kita harapkan akan lebih ketat dan selektif terutama saat hari-hari besar di sana. Nanti ketemu lagi di Natal dan Tahun Baru harus dipersiapkan lebih bagus,\" tutupnya. (aff/fay)\n"],
"_version_":1603877168887103488}
and after query the result is :
"facet_counts":{
"facet_queries":{},
"facet_fields":{
"content":[
"10",2,
"17",2,
"18",2,
"186",2,
"2018",2,
"21",2,
"30",2,
"6",2,
"92",2,
"94",2,
"acara",2,
"aff",2,
"agar",2,
"akan",2,
"apa",2,
"arief",2,
"atas",2,
"atau",2,
"audit",2,
"bagus",2,
"balairung",2,
"bali",2,
"bangun",2,
"baru",2,
"basarnas",2,
"belum",2,
"berbelasungkawa",2,
"berharap",2,
"berjumlah",2,
"berlayar",2,
"besar",2,
"bi",2,
"capacity",2,
"dan",2,
"danau",2,
"dari",2,
"depannya",2,
"destinasi",2,
"di",2,
"diaudit",2,
"diketahui",2,
"dilakukan",2,
"diperketat",2,
"dipersiapkan",2,
"ditingkatkan",2,
"fay",2,
"gedung",2,
"halal",2,
"harapkan",2,
"hari",2,
"harus",2,
"hilang",2,
"identitasnya",2,
"ikut",2,
"itu",2,
"jakarta",2,
"juga",2,
"kabarnya",2,
"kabupaten",2,
"kamis",2,
"kapal",2,
"katanya",2,
"ke",2,
"keamanan",2,
"kementerian",2,
"kepada",2,
"keselamatan",2,
"ketat",2,
"ketemu",2,
"kita",2,
"km",2,
"korban",2,
"lagi",2,
"lebih",2,
"maka",2,
"memenuhi",2,
"memonitor",2,
"menteri",2,
"menuju",2,
"merupakan",2,
"musibah",2,
"nanti",2,
"natal",2,
"orang",2,
"over",2,
"pariwisata",2,
"pelabuhan",2,
"pelayanannya",2,
"penumpang",2,
"pertama",2,
"pesona",2,
"poin",2,
"prioritas",2,
"pukul",2,
"rekan",2,
"saat",2,
"salah",2,
"samosir",2,
"sana",2,
"sapta",2]},
"facet_ranges":{},
"facet_intervals":{},
"facet_heatmaps":{}}
in the result "Sinar","tenggelam", "toba" and some word not show.
This is my field configuration
<fieldType name="edgytext" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
So whats wrong with my query or solr field configuration?
Your list of facets cuts off at 100 - all the words you are missing are after this cutoff point. The problem isn't that the values hasn't been indexed, it's that you're not retrieving them. By default the facet.limit parameter is set to 100 - set it -1 to return all terms for a field and their associated counts.
&facet=true&facet.field=content&facet.limit=-1
Hi my suggestion is replace your field type edgytext with below defination.
<fieldType name="edgytext" class="solr.TextField" positionIncrementGap="100"
multiValued="true">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
If you want all words in with counts in facet then you need to StandardTokenizerFactory.
What is solr.StandardTokenizerFactory ?
It tokenizes on whitespace, as well as strips characters.
Example :
http://google.com/i+love+birds
would generate 6 tokens (separated by comma) -
http,google.com,I,love,birds
what is KeywordTokenizerFactory?
Keyword Tokenizer does not split the input at all.
No processing in performed on the string, and the whole string is treated as a single entity.
This doesn't actually do any tokenization. It returns the original text as one term.
Mainly used for sorting or faceting requirements, where you want to match the exact facet when filtering on multiple words and sorting as sorting does not work on tokenized fields.
Using 6.0.1 SOLR.
Have got a type declaration:
<fieldType name="customy_icu" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.ICUTokenizerFactory"/>
<filter class="solr.LengthFilterFactory" min="1" max="100"/>
<filter class="solr.NGramTokenizerFactory" minGramSize="2" maxGramSize="20"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.ICUTokenizerFactory"/>
<filter class="solr.LengthFilterFactory" min="1" max="100"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
customy_icu is used for storing text data at hebrew lang (word are reading/writing) from right to left.
When query is "מי פנים"
I have got the result in incorrect order, product_3351 is higher (more relevant) than product product_3407, but should be vice versa.
Here is debug:
<str name="product_3351">
2.711071 = sum of:
2.711071 = max of:
0.12766865 = weight(meta_keyword:"מי פנים" in 882) [ClassicSimilarity], result of:
0.12766865 = score(doc=882,freq=1.0), product of:
0.05998979 = queryWeight, product of:
8.5126915 = idf(), sum of:
4.7235003 = idf(docFreq=21, docCount=910)
3.7891912 = idf(docFreq=55, docCount=910)
0.0070471005 = queryNorm
2.1281729 = fieldWeight in 882, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = phraseFreq=1.0
8.5126915 = idf(), sum of:
4.7235003 = idf(docFreq=21, docCount=910)
3.7891912 = idf(docFreq=55, docCount=910)
0.25 = fieldNorm(doc=882)
2.711071 = weight(name:"מי פנים" in 882) [ClassicSimilarity], result of:
2.711071 = score(doc=882,freq=1.0), product of:
0.6178363 = queryWeight, product of:
9.99 = boost
8.776017 = idf(), sum of:
4.8417873 = idf(docFreq=22, docCount=1071)
3.93423 = idf(docFreq=56, docCount=1071)
0.0070471005 = queryNorm
4.3880086 = fieldWeight in 882, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = phraseFreq=1.0
8.776017 = idf(), sum of:
4.8417873 = idf(docFreq=22, docCount=1071)
3.93423 = idf(docFreq=56, docCount=1071)
0.5 = fieldNorm(doc=882)
</str>
and
<str name="product_3407">
2.711071 = sum of:
2.711071 = max of:
2.711071 = weight(name:"מי פנים" in 919) [ClassicSimilarity], result of:
2.711071 = score(doc=919,freq=1.0), product of:
0.6178363 = queryWeight, product of:
9.99 = boost
8.776017 = idf(), sum of:
4.8417873 = idf(docFreq=22, docCount=1071)
3.93423 = idf(docFreq=56, docCount=1071)
0.0070471005 = queryNorm
4.3880086 = fieldWeight in 919, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = phraseFreq=1.0
8.776017 = idf(), sum of:
4.8417873 = idf(docFreq=22, docCount=1071)
3.93423 = idf(docFreq=56, docCount=1071)
0.5 = fieldNorm(doc=919)
</str>
The product 3351 has name field value:
סאבליים סופט מי פנים
And product 3407 has name field value:
מי פנים מיסלרים
http://screencast.com/t/2iBwLQqu
How I can boost 3407 product it become higher in result list ?
Thanks a lot!
If you have a specific query where you want to boost a document to the top of the result set, irrelevant of its own score, use the Query Elevation Component.
There is no automagic boosting for "appears earlier in the document", but there's a few ways to work around it. See How to boost scores for early matches for a couple of possible solutions.
"Relevancy" is a fluent term, and you have to implement the kind of scoring that you feel is suitable for your application outside of the standard rules. The debugQuery you've included shows that the documents are scored identically on relevancy by default.
You can use elevate.xml file to set particular document to appear top in the resultset for specific serachterm.
example :
<elevate>
<query text ="מי פנים">
<doc id="your_product_ID" />
</query>
I am trying to use Solr to find exact matches on categories in a user search (e.g. "skinny jeans" in "blue skinny jeans"). I am using the following type definition:
<fieldType name="subphrase" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer type="index">
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="\ "
replacement="_"/>
<tokenizer class="solr.KeywordTokenizerFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.ShingleFilterFactory"
outputUnigrams="true"
outputUnigramsIfNoShingles="true"
tokenSeparator="_"
minShingleSize="2"
maxShingleSize="99"/>
</analyzer>
</fieldType>
The type will index categories without tokenizing, only replacing whitespace with underscores. But it will tokenize queries and shingle them (with underscores).
What I am trying to do is match the query shingles against the indexed categories. In the Solr Analysis page I can see that the whitespace/underscore replacement works on both index and query, and I can see that the query is being shingled correctly (screenshot below):
My problem is that in the Solr Query page, I cannot see shingles being generated, and I presume that as a result the category "skinny jeans" is not matched, but the category "jeans" is matched :(
This is the debug output:
{
"responseHeader": {
"status": 0,
"QTime": 1,
"params": {
"q": "name:(skinny jeans)",
"indent": "true",
"wt": "json",
"debugQuery": "true",
"_": "1464170217438"
}
},
"response": {
"numFound": 1,
"start": 0,
"docs": [
{
"id": 33,
"name": "jeans",
}
]
},
"debug": {
"rawquerystring": "name:(skinny jeans)",
"querystring": "name:(skinny jeans)",
"parsedquery": "name:skinny name:jeans",
"parsedquery_toString": "name:skinny name:jeans",
"explain": {
"33": "\n2.2143755 = product of:\n 4.428751 = sum of:\n 4.428751 = weight(name:jeans in 54) [DefaultSimilarity], result of:\n 4.428751 = score(doc=54,freq=1.0), product of:\n 0.6709952 = queryWeight, product of:\n 6.600272 = idf(docFreq=1, maxDocs=541)\n 0.10166174 = queryNorm\n 6.600272 = fieldWeight in 54, product of:\n 1.0 = tf(freq=1.0), with freq of:\n 1.0 = termFreq=1.0\n 6.600272 = idf(docFreq=1, maxDocs=541)\n 1.0 = fieldNorm(doc=54)\n 0.5 = coord(1/2)\n"
},
"QParser": "LuceneQParser"
}
}
It's clear that the parsedquery parameter does not display the shingled query. What do I need to do to complete the process of matching query shingles against indexed values? I feel like I am very close to cracking this problem. Any advice is appreciated!
This is an incomplete answer, but it might be enough to get you moving.
1: You probably want outputUnigrams="false", so you don't match category "jeans" on query "skinny jeans"
2: You actually do want to search with quotes, (a phrase) or the field won't ever see more than one token to shingle.
3: It seems like you're trying to do the same thing as this person was:
http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746
That thread looks like it lead to the inclusion of the PositionFilterFactory
https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PositionFilterFactory
If you're using Solr < 5.0, try putting that at the end of your query time analysis and see if it works.
Unfortunately, that filter factory was removed in 5.0. This is the only comment I've found about what to do instead:
http://lucene.apache.org/core/4_10_0/analyzers-common/org/apache/lucene/analysis/position/PositionFilter.html
I played with autoGeneratePhraseQueries a little, but I have yet to find another way to prevent Solr from generating a MultiPhraseQuery.
I have a number with hyphens 91-21-22020-4.
My problem is that I would like hits even if the hyphens are moved within the number string. As it's now 912122020-4 will give one hit but 91212202-04 will not?
The debug info looks like:
"debug": {
"rawquerystring": "91212202-04",
"querystring": "91212202-04",
"parsedquery": "+((freetext:91212202 freetext:9121220204)/no_coord) +freetext:04",
"parsedquery_toString": "+(freetext:91212202 freetext:9121220204) +freetext:04",
"explain": {},
"QParser": "LuceneQParser",
AND
"debug": {
"rawquerystring": "912122020-4",
"querystring": "912122020-4",
"parsedquery": "+((freetext:912122020 freetext:9121220204)/no_coord) +freetext:4",
"parsedquery_toString": "+(freetext:912122020 freetext:9121220204) +freetext:4",
"explain": {
"ATEST003-81419": "\n0.33174315 = (MATCH) sum of:\n 0.17618936 = (MATCH) sum of:\n 0.17618936 = (MATCH) weight(freetext:9121220204 in 0) [DefaultSimilarity], result of:\n 0.17618936 = score(doc=0,freq=1.0), product of:\n 0.5690552 = queryWeight, product of:\n 3.3025851 = idf(docFreq=1, maxDocs=20)\n 0.17230599 = queryNorm\n 0.30961734 = fieldWeight in 0, product of:\n 1.0 = tf(freq=1.0), with freq of:\n 1.0 = termFreq=1.0\n 3.3025851 = idf(docFreq=1, maxDocs=20)\n 0.09375 = fieldNorm(doc=0)\n 0.15555379 = (MATCH) weight(freetext:4 in 0) [DefaultSimilarity], result of:\n 0.15555379 = score(doc=0,freq=2.0), product of:\n 0.44962177 = queryWeight, product of:\n 2.609438 = idf(docFreq=3, maxDocs=20)\n 0.17230599 = queryNorm\n 0.34596586 = fieldWeight in 0, product of:\n 1.4142135 = tf(freq=2.0), with freq of:\n 2.0 = termFreq=2.0\n 2.609438 = idf(docFreq=3, maxDocs=20)\n 0.09375 = fieldNorm(doc=0)\n"
},
My schema.xml looks like:
<fieldType name="text_indexed" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.HyphenatedWordsFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory" catenateNumbers="1" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-index.txt"/>
<filter class="solr.ReversedWildcardFilterFactory" withOriginal="true" maxPosAsterisk="3" maxPosQuestion="2" maxFractionAsterisk="0.33"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" catenateNumbers="1" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-index.txt"/>
</analyzer>
</fieldType>
Use a PatternReplaceCharFilter to remove all traces of the hyphens before they're indexed in Solr (or use PatternReplaceFilter to change the tokens stored and not the text indexed).
91212202-04 would then be indexed (and searched) as 9121220204, which would effectively remove any dependency on hyphens.
I have indexed in solr shop names like
H&M
Lotte & Anna
fan & more
Tele2
Pure Tea
I have the following two issues (with importance in priority)
if I search for "H&M" I will never get any result. If I search for "te & Ann" I get the expected results.
if I search for "te & an" the results I get are Tele2 and Pure Tea whereas I would have expected "Lotte & Anna" to appear first in the list.
It appears as if the & character is not taken into consideration. What am I doing wrong here?
These are my analysers for the specific field (both query and index)
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Ok, so the 1st problem was addressed with the WordDelimiterFilterFactory specifying & => ALPHA in the wdfftypes.txt and changing switching from the StandardTokenizerFactory to the WhitepsaceTokenizerFactory
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" types="wdfftypes.txt"/>
(edited in both analyser and query).
2nd question still remains.
In the debugQuery I get the following
"debug": {
"rawquerystring": "te & an",
"querystring": "te & an",
"parsedquery": "text:te text:an",
"parsedquery_toString": "text:te text:an",
"explain": {
"": "\n0.8152958 = (MATCH) product of:\n 1.6305916 = (MATCH) sum of:\n 1.6305916 = (MATCH) weight(text:te in 498) [DefaultSimilarity], result of:\n 1.6305916 = score(doc=498,freq=1.0 = termFreq=1.0\n), product of:\n 0.8202942 = queryWeight, product of:\n 5.300835 = idf(docFreq=87, maxDocs=6491)\n 0.15474811 = queryNorm\n 1.9878132 = fieldWeight in 498, product of:\n 1.0 = tf(freq=1.0), with freq of:\n 1.0 = termFreq=1.0\n 5.300835 = idf(docFreq=87, maxDocs=6491)\n 0.375 = fieldNorm(doc=498)\n 0.5 = coord(1/2)\n"
},
so, what should I modify so that the weights shift in favour of the desired result?
Use "NGramFilterFactory" instead of "EdgeNGramFilterFactory". That way, "Lotte & Anne", gets indexed into "lo, ot, tt, te, lot, ott, tte, lott, otte, lotte" and "an, nn, ne, ann, nne, anne". so when you search for "tte & ann", the document will match.