Solr Facet Group Query Filtering - solr

There is about 12 million solr data as shown in the model below. I want to group emotion values ​​using solr facet query structure and find the total number of values, how can I do?
<doc>
<str name="id">389352936</str>
<str name="title">Title</str>
<str name="body">Body</str>
<date name="feeddate">2019-05-05T09:22:46Z</date>
<int name="sentiment">0</int>
</doc>
<doc>
<str name="id">389352937</str>
<str name="title">Title</str>
<str name="body">Body</str>
<date name="feeddate">2019-05-06T09:22:46Z</date>
<int name="sentiment">1</int>
</doc>
result structure i want
[
{"feeddate":"2019-05-05T00:00:00Z","sentiment":{"0":10,"1":20,"2":30}},{"feeddate":"2019-05-06T00:00:00Z","sentiment":{"0":5,"1":10,"2":15}},,{"feeddate":"2019-05-07T00:00:00Z","sentiment":{"0":12,"1":21,"2":12}}
]
I'm trying the query below but it's not in the structure I want
facet.range={!tag=rdt}feeddate
&facet.range.start=2019-01-01T00:00:00Z
&facet.sort=feeddate
&facet.field=sentiment
&facet.range.end=2019-02-01T00:00:00Z
&facet.range.gap=%2B1DAY&facet=true
&facet.pivot={!range=piv1}sentiment
Return data
[ {field:"sentiment",value:0,count:6258160,ranges:{feeddate:{counts:["2019-01-01T00:00:00Z",7983,"2019-01-02T00:00:00Z",9673,"2019-01-03T00:00:00Z",12727,"2019-01-04T00:00:00Z"]}},
{field:"sentiment",value:1,count:1830481,ranges:{feeddate:{counts:["2019-01-01T00:00:00Z",4983,"2019-01-02T00:00:00Z",9673,"2019-01-03T00:00:00Z",23727,"2019-01-04T00:00:00Z"]}}
{field:"sentiment",value:2,count:3086818,ranges:{feeddate:{counts:["2019-01-01T00:00:00Z",3983,"2019-01-02T00:00:00Z",9673,"2019-01-03T00:00:00Z",10727,"2019-01-04T00:00:00Z"]}}
]

Related

How to boost repeated values in a multiValue field on Solr

I have some repeated (same strings) data in a multiValue field on my solr index and i want to boost documents by matches count in that field. For example:
doc1 : { locales : ['en_US', 'de_DE', 'fr_FR', 'en_US'] }
doc2 : { locales : ['en_US'] }
When i run the query q=locales:en_US i would like to see the doc1 at the top because it has two "en_US" values. What is the proper way to boost this kind of data?
Should i use a special tokenizer?
Solr version is: 4.5
Disclaimer
In order to use either of the following solutions you will need to make either one of the following changes:
Create a copyField for locales:
<field name="locales" type="string" indexed="true" stored="true" multiValued="true"/>
<!-- No need to store(stored="false") locales_text as it will only be used for searching/sorting/boosting -->
<field name="locales_text" type="text_general" indexed="true" stored="false" multiValued="true"/>
<copyField source="locales" dest="locales_text"/>
Change the type of locales to "text_general" (the type is provided in the standard solr collection1)
First solution (Ordering):
Results can be ordered by some function. So we can order by number of occurrences (termfreq function) in field:
If copyField is used, then sort query will be: termfreq(locales_text,'en_US') DESC
If locales is of text_general type, then sort query will be: termfreq(locales,'en_US') DESC
Example response for copyField option (the result is the same for text_general type):
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">1</int>
<lst name="params">
<str name="fl">*,score</str>
<str name="sort">termfreq(locales_text,'en_US') DESC</str>
<str name="indent">true</str>
<str name="q">locales:en_US</str>
<str name="_">1383598933337</str>
<str name="wt">xml</str>
</lst>
</lst>
<result name="response" numFound="2" start="0" maxScore="0.5945348">
<doc>
<arr name="locales">
<str>en_US</str>
<str>de_DE</str>
<str>fr_FR</str>
<str>en_US</str>
</arr>
<str name="id">4f9f71f6-7811-4c22-b5d6-c62887983d08</str>
<long name="_version_">1450808563062538240</long>
<float name="score">0.4203996</float></doc>
<doc>
<arr name="locales">
<str>en_US</str>
</arr>
<str name="id">7f93e620-cf7b-4b90-b741-f6edc9db77c9</str>
<long name="_version_">1450808391856291840</long>
<float name="score">0.5945348</float></doc>
</result>
</response>
You can also use fl=*,termfreq(locales_text,'en_US') to see the number of matches.
One thing to keep in mind - it is an order function, not a boost function. If you will rather boost score based on multiple matches, you will be probably more insterested in the second solution.
I included the score in the results to demonstrate what #arun was talking about. You can see that the score is different(probably to length)... Quite unexpected(for me) that for multivalued string it is the same.
Second solution (Boosting):
If copyField is used, then the query will be : {!boost b=termfreq(locales_text,'en_US')}locales:en_US
If locales is of text_general type, then the query will be: {!boost b=termfreq(locales,'en_US')}locales:en_US
Example response for copyField option (the result is the same for text_general type):
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">0</int>
<lst name="params">
<str name="lowercaseOperators">true</str>
<str name="fl">*,score</str>
<str name="indent">true</str>
<str name="q">{!boost b=termfreq(locales_text,'en_US')}locales:en_US</str>
<str name="_">1383599910386</str>
<str name="stopwords">true</str>
<str name="wt">xml</str>
<str name="defType">edismax</str>
</lst>
</lst>
<result name="response" numFound="2" start="0" maxScore="1.1890696">
<doc>
<arr name="locales">
<str>en_US</str>
<str>de_DE</str>
<str>fr_FR</str>
<str>en_US</str>
</arr>
<str name="id">4f9f71f6-7811-4c22-b5d6-c62887983d08</str>
<long name="_version_">1450808563062538240</long>
<float name="score">1.1890696</float></doc>
<doc>
<arr name="locales">
<str>en_US</str>
</arr>
<str name="id">7f93e620-cf7b-4b90-b741-f6edc9db77c9</str>
<long name="_version_">1450808391856291840</long>
<float name="score">0.5945348</float></doc>
</result>
</response>
You can see that the score changed significantly. The first document score two time more than the second (because there was two matches each scored as 0.5945348).
Third solution (omitNorms=false)
Based on the answer from #arun I figured that there is also a third option.
If you convert you field to (for example) text_general AND set omitNorms=true for that field - it should have the same result.
The default standard request handler in Solr does not use only the term frequency to compute the scores. Along with term frequency, it also uses the length of the field. See the lucene scoring algorithm, where it says:
lengthNorm - computed when the document is added to the index in accordance with the number of tokens of this field in the document, so that shorter fields contribute more to the score.
Since doc2 has a shorter field it might have scored higher. Check the score for the results with fl=*,score in your query. To know how Solr arrived at the score, use fl=*,score&wt=xml&debugQuery=on (then right click on your browser and view page-source to see a properly indented score calculation). I believe you will see the lengthNorm contributing to a lower score for doc1.
To have length of the field not contribute to the score, you need to disable it. Set omitNorms=true for that field. (Ref: http://wiki.apache.org/solr/SchemaXml) Then see what the scores are.

Solr Group By query

I have schema.xml like this:
Sample data
id Country State City Area
1 India abc cd mnv
15131 India Delhi HauzK asdf (from 1 to 15131 inserted usingcsvhandler)
15132 India Karnatka Bang mno ( 15132 inserted using solarium api)
All fields are text_general type and applying
Whitespace tokenizer
Lowercase filterfactory
Ngramfilter factory
One thing to note :
I inserted records from Id = '1' to id=15131 with CSV request handler and document with id = 15132 using solarium API to insert new record.
Now, I have suggestion box for country. I want to show only different countries, so I did group by on country.
http://localhost:8983/solr/searchLocation/country?
q=country%3Ain&wt=xml&indent=true&group=true&group.field=country
I got following result
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">4</int>
</lst>
<lst name="grouped">
<lst name="country">
<int name="matches">15132</int>
<arr name="groups">
<lst>
**<str name="groupValue">ndia</str>**
<result name="doclist" numFound="15131" start="0" maxScore="0.24998347">
<doc>
<str name="country">india</str>
<str name="state">Andaman and Nicobar</str>
<str name="city">A&N Islands</str>
<str name="area">Marine Jetty</str>
<str name="id">02cb8ba4-bffe-4c4e-a976-29f01ad8d275</str>
<float name="score">0.24998347</float>
</doc>
</result>
</lst>
<lst>
**<str name="groupValue">d</str>**
<result name="doclist" numFound="1" start="0" maxScore="0.24998347">
<doc>
<str name="country">india</str>
<str name="state">Kerala</str>
<str name="city">Palghat</str>
<str name="area">Padagirinew</str>
<str name="id">0158f635-24dd-4d2f-9697-e79272684c95</str>
<float name="score">0.24998347</float>
</doc>
</result>
</lst>
</arr>
</lst>
</lst>
</response>
My confusion is , how it could be possible I got two groups
all records from id = 1 to id=15131 with country value = india
last record with id = 15132 with country value = india
Why it is not making two different groups?? It should be single group becuase value of country field is India ...
Thanks

Solr : How can I group on two different fields?

My schema is like :
product_id
category_id
A category contains products.
In solr 3.6, I group results on category_id and it works well.
I just added a new field:
group_id
A group contains products that vary on size or color.
Example: shoes in blue, red and yellow are 3 differents products and have the same group_id.
Additionally to the result grouping on field category_id, I would like to have in my results only one product for a group_id, assuming group_id can be null (for products that aren't part of a group).
To follow the example of the shoes, it means that for the request "shoe", only one of the 3 products should be in results.
I thought to do a second result grouping on group_id, but I doesn't seem possible to do that way.
Any idea?
EDIT : For now, i process the results in php to delete documents that have a group_id that is already in the results. I leave this subject open, in case someone finds how to group on 2 fields
If your aim is to get grouping counts based on multiple "group by" fields, you can use pivot faceting to achieve this.
&facet.pivot=category_id,group_id
Solr will give you back a hierarchy of grouped result counts, following the page of search results, under the facet_pivot element.
http://wiki.apache.org/solr/SimpleFacetParameters?highlight=%28pivot%29#Pivot_.28ie_Decision_Tree.29_Faceting
It is not possible to group by query on two fields.
If you need count then you can use facet.field(For single field) or facet.pivot(For multiple field).
It is not actually group but you can get count of that group for multiple field.
Example Output:
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<bool name="zkConnected">true</bool>
<int name="status">0</int>
<int name="QTime">306</int>
</lst>
<result name="response" numFound="667" start="0" maxScore="0.70710677">
<doc>
<int name="idField">7393</int>
<int name="field_one">12</int>
</doc>
</result>
<lst name="facet_counts">
<lst name="facet_queries"/>
<lst name="facet_fields"/>
<lst name="facet_ranges"/>
<lst name="facet_intervals"/>
<lst name="facet_heatmaps"/>
<lst name="facet_pivot">
<arr name="field_one,field_two">
<lst>
<str name="field">field_one</str>
<int name="value">3</int>
<int name="count">562</int>
<arr name="pivot">
<lst>
<str name="field">field_two</str>
<bool name="value">true</bool>
<int name="count">347</int>
</lst>
<lst>
<str name="field">field_two</str>
<bool name="value">false</bool>
<int name="count">215</int>
</lst>
</arr>
</lst>
<lst>
<str name="field">field_one</str>
<int name="value">12</int>
<int name="count">105</int>
<arr name="pivot">
<lst>
<str name="field">field_two</str>
<bool name="value">true</bool>
<int name="count">97</int>
</lst>
<lst>
<str name="field">field_two</str>
<bool name="value">false</bool>
<int name="count">8</int>
</lst>
</arr>
</lst>
</arr>
</lst>
</lst>
</response>
Example Query :
http://192.168.100.145:7983/solr/<collection>/select?facet.pivot=field_one,field_two&facet=on&fl=idField,field_one&indent=on&q=field_one:(3%2012)&rows=1&wt=xml
if you can change the data that you are posting to solr, then I suggest that you create a string field which will have a concatenation of category_id and group_id. For example, if the category_id = 5 and group_id=2, then your string field can be :- '5,2' (using ',' or any other character as a delimiter). You can then group on this string field.

Multiple cores join query

My solr version is 4.0
I have a multicore environment with a core for products and a core for availability records of these products.
The products core will contain detailed descriptions and has about 10,000 douments.
The availabilities core contains up to 4 million documents.
I built a small testset and I'm trying to get results using the join syntax, meant to find alle availabilities of products containing "disney".
http://localhost:8080/solr/product/select?q={!join%20from=productid%20to=id%20fromindex=availp}disney&fl=*
I get zero results.
Individual queries on each of the cores do yield results.
Questions:
1. how should I construct the query in order to get results
2. when I refine my query for filtering for a specific date, what would the syntax be.
for example ?fq=period:"november 2012" AND country:France
country is a field from the product index, period is a field from then availp index.
Results from individual queries: product core
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">1</int>
<lst name="params">
<str name="fl">id,productname</str>
<str name="indent">1</str>
<str name="q">disney</str>
<str name="rows">1</str>
</lst>
</lst>
<result name="response" numFound="31" start="0">
<doc>
<str name="productname">DPAZ00 DPAZ00-02 DPAZ0002 Disneyland Parijs Hotel Disney's Santa Fe</str>
<str name="id">44044</str></doc>
</result>
</response>
other core: availp
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">0</int>
<lst name="params">
<str name="fl">*</str>
<str name="indent">1</str>
<str name="q">productid:44044</str>
<str name="rows">1</str>
</lst>
</lst>
<result name="response" numFound="42" start="0">
<doc>
<date name="datefrom">2012-10-01T10:00:00Z</date>
<arr name="period">
<str>oktober 2012</str>
</arr>
<str name="productid">44044</str>
<double name="minpriceperperson">209.0</double>
<int name="durationcode">1</int>
<str name="id">3890</str>
<int name="budgetcode">2</int>
</result>
</response>
1) You should query inventory core (with product as inner index).
This is how the query should be
http:// localhost:8080/solr/product/select?q=*& fl={!join from=id to=id fromIndex=availp}productname:disney
2) You can use the same query syntax above.
http:// localhost:8080/solr/product/select?q=period:november&fl={!join from=id to=id fromIndex=availp}productname:disney AND country:France
You can remove productname from above if not needed.
Have you tried by changing the fromindex to fromIndex (uppercase I)?
According to Adventures with Solr Join, the query look like this:
http://localhost:8983/solr/parents/select?q=alive:yes AND _query_:"{!join fromIndex=children from=fatherid to=parentid v='childname:Tom'}"
It should be works

Can I restrict the search to a specific date range?

I want to get all results AFTER a given date, can you do this with solr?
(http://lucene.apache.org/solr/)
Right now the results are search the entire result set, I want to filter for anything after a given date.
Update
This isn't working for me yet.
My returned doc:
trying:
http://www.example.com:8085/solr/select/?q=test&version=2.2&start=0&rows=10&indent=on&indexed_at:2009-08-27T13%3A15%3A27.73Z
<doc>
<str name="apptype">Forum</str>
<str name="collapse">forum:334</str>
<str name="content"> testing </str>
<str name="contentid">357</str>
<str name="createdby">some_user</str>
<str name="date">20090819</str>
<str name="dummy_id">1</str>
<int name="group">5</int>
<date name="indexed_at">2009-08-25T16:48:45.121Z</date>
<str name="rating">000.0</str>
<str name="rawcontent"><p>testing</p></str>
−
<arr name="roles">
<str>1</str>
<str>2</str>
<str>3</str>
<str>4</str>
<str>14</str>
<str>15</str>
<str>16</str>
</arr>
<int name="section">79</int>
<int name="thread">334</int>
<str name="title">testing</str>
<str name="titlesort">testing</str>
<str name="type">forum</str>
−
<str name="unique_id">
BLAHBLAH|357
</str>
<str name="url">/blahey/f/79/p/334/357.aspx#357</str>
<str name="user">21625</str>
<str name="username">some_user</str>
</doc>
Yes you can I assume you have a field with the date value you want to filter on. Then you do
yourdatefield:[2008-08-27T23:59:59.999Z TO *]
a sample url would be localhost:8983/solr/select?q=yourdatefield:[2008-08-27T23:59:59.999Z TO *]
you want to submit the date part as a query so in the value of q like
localhost:8983/solr/select/q=(text:test+AND+indexed_at:`[2009-08-27T13:A15:A27.73Z TO *`])
So the entire query is contained within the q querystring paramter.
the format of the date is ISO 8601.
You can add a automatic timestamp to the documents as they are indexed using:
<field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/>
in the schema.xml. The default schema has this commented out so if you copied the default, you just need to uncomment it.
You could add that and use olle's suggested search pattern to find the documents indexed after a certain date. (You'd have to update yourdatefield with timestamp or whatever you name the field in the xml.
You will need to create a query that compares dates, here is the syntax for queries:
http://wiki.apache.org/solr/SolrQuerySyntax
And here is how you can make date comparisons in the query:
http://lucene.apache.org/solr/api/org/apache/solr/util/DateMathParser.html

Resources