Solr: Newly Observed facets - solr

I have two fields in my solr index data: "userName" and "startTimeISO" along with many other fields.
Now I want to query for all the "userNames" that were seen TODAY but not seen in the last 30 days.
Basically, I am trying to find out Newly Observed UserNames for today.
Now the Solr Facet query I am running is:
facet.pivot: "userName,startTimeISO",
fq: " NOT startTimeISO:["2014-12-20T00:00:00.000Z" TO "2015-01-18T00:00:00.000Z"] AND startTimeISO:["2015-01-19T00:00:00.000Z" TO "2015-01-20T00:00:00.000Z"]"
But I am for some reason getting incorrect results.
For example, I see userName: "bla" the above query.
If I run the same query for tomorrow, I am again see "bla" in my Facet Results.
I am some how not able to get the correct logic. Perhaps I am not using all the tools provided by solr, which I am unaware of?
Can someone help me here. I dont mind testing all of your suggestions and coming back and forth with different suggestions.
In the meanwhile I am looking online to see if there is some other way to facet.
Update:
SOLUTION:
In case your data looks like:
"id": "1",
"userName": "one",
"startTimeISO": "2015-01-20T17:24:32.888Z"
"id": "2",
"userName": "one",
"startTimeISO": "2015-01-16T17:24:50.208Z"
"id": "3",
"userName": "two",
"startTimeISO": "2015-01-20T17:25:06.109Z"
You could use the below query combination:
q=*:*
fq=startTimeISO:[NOW-1DAY TO NOW] //this will give you all the users that
were seen today
fq=-_query_:"{!join from=userName to=userName}startTimeISO:[NOW-30DAYS TO
NOW-1DAYS]" //dont include those documents that have others with the same
name and were viewed during the last 30 days.
Thanks to Alvaro Cabrerizo for helping me out.
Here is the link to the same question on Solr mailing list:
http://lucene.472066.n3.nabble.com/Newly-observed-Facets-td4180538.html

There isn't one query that will do what you want. Your best bet is to first query for the user names seen today (a smaller number than all those in the last 30) returning that list to your client. A typical 'fq' querying for the last day will select those documents, and then facet.field=username with facet.limit=1000000 unfortunately high and facet.mincount=1. Now that you have this list on your client, submit a large query to Solr for faceting on the username field again and with a filter query for the next 29 days (don't include today), and an additional filter query to match just the usernames you found in the first query. Ideally the username filter would use the 'terms' QParser in Solr 4.10 but it's not essential. When this second query returns, this will show you which of the usernames seen today were also seen in the subsequent 29 days. With that information, you can subtract the sets of names and you have the usernames seen today.

Related

Can I find documents based on duplicated fields?

I have a Solr server with data under this format:
{
id: 1,
text_1: "some_text1",
text_2: "some_text2",
},
{
id: 2,
text_1: "some_text1",
text_2: "some_text2",
}
I need to find documents like the ones I wrote above. Documents that have the same "text_1" and "text_2" values but different ids.
I've tried using facets, but I'm not sure if it helps. Firstly, it only returns a count of the duplicates and I need the id's of these documents. Secondly, I'm not sure that faceting over multiple fields does what I want. I'm not sure that:
facet.field=text_1&facet.field=text_2 shows me a count of documents that have both those fields.
Thank you, I don't know much about Solr. Any help is greatly appreciated!
I think facets are your best bet to get this done, but as you noticed you will need to issue at least two queries: one to get the facets and another to fetch the actual documents that belong to the facet (i.e. the duplicates in your case)
To get the multi facets to work for what you are trying to do you'll need to use PivotFaceting (https://lucene.apache.org/solr/guide/7_0/faceting.html#pivot-decision-tree-faceting). The syntax is facet=on&facet.pivot=field1,field2
Make sure the field that you use for facets is a string field and not a text field.

Solr search on specific field gives weird results

Solr experts -
I have two records in my solr database:
{
"keywords":["jaime kelly jkelly natixis sales and marketing manager"],
"job_role":"natixis sales & marketing manager",
"empl_name":["jaime kelly jkelly"],
},
{
"keywords":["schwayb jackson sjackson"],
"job_role":"portfolio manager",
"empl_name":["schwayb jackson sjackson"],
}
When I search on the field empl_name with the query:
empl_name:schwayb natixis
the first record returned is jaime kelly instead of schwayb jackson. This is weird. I am explicitly searching the field empl_name and among the two records the second one is the closer match. Why does Solr not order it correctly?
Looks like Solr sees the string "natixis" in the job_role and keywords of the earlier record and is giving it more preference. But I want solr to ONLY look at empl_name and no other field. How can I achieve this?
You need to group the query terms in parentheses. Or else it gets parsed as empl_name:schwayb OR natixis.

Solr group query based on the sum aggregation of function query

I have tried to implement the Page and Post relation in single Solr Schema. In my use case each page has multiple posts. Page and Post fields are as follows:
Post:{post_content, owner_page_id, document_type}
Page:{page_id, document_type}
Suppose I want to query this single core for the results sorted by the total number of term frequency for specific term per each Page. First I though the following query can help me to overcome this query requirement for term "hello":
http://localhost:8983/solr/document/select?wt=json&indent=true&fl=id,name&q=*:*&group=true&group.field=owner_page_id&sort=termfreq(post_content,%27hello%27)+desc&fl=result:termfreq(post_content_text,%27hello%27),owner_page_id
But, it seems that this query returns the term frequency for single post of each page and the result is not aggregated for all of the page posts, and I am looking for the aggregate result.
I would be really grateful if somebody can help me to find the required query for my requirement.
P.S: I am using Solr 6, so JSON Facet is available for me.

Lucene/Solr: Store offset information for certain keywords

We are using Solr to store documents with keywords; each keyword is associated with a span within the document.
The keywords were produced by some fancy analytics and/or manual work prior to loading them into Solr. A keyword can be repeated multiple times in a document. On the other hand, different instances of the same string in a single document can be connected with different keywords.
For example, this document
Bill studied The Bill of Rights last summer.
could be accompanied by the following keywords (with offsets in parentheses):
William Brown (0:4)
legal term (13:31)
summer 2011 (32:43)
(Obviously in other documents, Bill could refer to Bill Clinton or Bill Gates. Similarly, last summer will refer to different years in different documents. We do have all this information for all the documents.)
I know the document can have a field, say KEYWORD, which will store William Brown. Then when I search for William Brown I will get the above document. That part is easy.
But I have no idea how to store the info that William Brown corresponds to the text span 0:4 so I can highlight the first Bill, but not the second.
I thought I could use TermVectors, but I am not sure if/how I can store custom offsets. I would think this is a fairly common scenario ...
EDIT: edited to make clear that Bill can refer to different people/things in different documents.
EDIT2: edited to make clear that a document can contain homonyms (identical strings with different meanings).
Two Q Monte
Solution Pros:
Annotations logically stored with source docs
No knowledge of highlighter implementation or custom Java highlighter development required
Since all customization happens outside of Solr, this solution should be forward-compatible to future Solr versions.
Solution Cons:
Requires two queries to be run
Requires code in your search client to merge results from one query into the other.
With Solr 4.8+ you can nest child documents (annotations) underneath each primary document (text)...
curl http://localhost:8983/solr/update/json?softCommit=true -H 'Content-type:application/json' -d '
[
{
"id": "123",
"text" : "Bill studied The Bill of Rights last summer.",
"content_type": "source",
"_childDocuments_": [
{
"id": "123-1",
"content_type": "source_annotation",
"annotation": "William Brown",
"start_offset": 0,
"end_offset": 4
},
{
"id": "123-2",
"content_type": "source_annotation",
"annotation": "legal term",
"start_offset": 13,
"end_offset": 31
},
{
"id": "123-3",
"content_type": "source_annotation",
"annotation": "summer 2011",
"start_offset": 32,
"end_offset": 43
}
]
}
]
... using block join to query the annotations.
1) Annotation Query: http://localhost:8983/solr/query?fl=id,start_offset,end_offset&q={!child of=content_type:source}annotation:"William Brown"
"response":{"numFound":1,"start":0,
"docs":[
{
"id": "123-1",
"content_type": "source_annotation",
"annotation": "William Brown",
"start_offset": 0,
"end_offset": 4
}
]
}
Store these results in your code so that you can fold in the annotation offsets after the next query returns.
2) Source Query + Highlighting: http://localhost:8983/solr/query?hl=true&hl.fl=text&fq=content_type:source&q=text:"William Brown" OR id:123
(id:123 discovered in Annotation Query gets ORed into second query)
"response":{"numFound":1,"start":0,
"docs":[
{
"id": "123",
"content_type": "source",
"text": "Bill studied The Bill of Rights last summer."
}
],
"highlighting":{}
}
Note: In this example there is no highlighting information returned because the search terms didn't match any content_type:source documents. However we have the explicit annotations and offsets from the first query!
Your client code then needs to take the content_type:source_annotation results from the first query and manually insert highlighting markers into the content_type:source results from the second query.
More block join info on Yonik's blog here.
By default Solr stores the start/end position of each token once is tokenized, for instance using the StandardTokenizer. This info is encoded on the underline index. The use case that you described here sounds a lot like the SynonymFilterFactory.
When you define a synonym using the SynonymFilterFactory stating for instance that: foo => baz foo is equivalent to bar, the bar term is added to the token stream generated when the text is tokenized, and it will have the same offset information than the original token. So for instance if your text is: "foo is awesome", the term foo will have the following offset information (start=0,end=3) a new token bar(start=0,end=3) will be added to your index (assuming that you're using the SynonymFilterFactory at index time):
text: foo is awesome
start: 0 4 7
end: 3 6 13
Once the SynonymFilterFactory is applied:
bar
text: foo is awesome
start: 0 4 7
end: 3 6 13
So if you fire a query using foo, the document will match, but if you use bar as your query the document will also match since a bar token is added by the SynonymFilterFactory
In your particular case, you're trying to accomplish multi-term synonyms, which is kind of a difficult problem, you may need something more than the default synonym filter of Solr. Check this post from the guys at OpenSourceConnections and this other post from Lucidworks (the company behind Solr/Lucene). This two posts should provide additional information and the caveats of each approach.
Do you need to fetch the stored offsets for some later processing?

Solr - How do I get the number of documents for each field containing the search term within that field in Solr?

Imagine an index like the following:
id partno name description
1 1000.001 Apple iPod iPod by Apple
2 1000.123 Apple iPhone The iPhone
When the user searches for "Apple" both documents would be returned. Now I'd like to give the user the possibility to narrow down the results by limiting the search to one or more fields that have documents containing the term "Apple" within those fields.
So, ideally, the user would see something like this in the filter section of the ui after his first query:
Filter by field
name (2)
description (1)
When the user applies the filter for field "description", only documents which contain the term "Apple" within the field "description" would be returned. So the result set of that second request would be the iPod document only. For that I'd use a query like ?q=Apple&qf=description (I'm using the Extended DisMax Query Parser)
How can I accomplish that with Solr?
I already experimented with faceting, grouping and highlighting components, but did not really come to a decent solution to this.
[Update]
Just to make that clear again: The main problem here is to get the information needed for displaying the "Filter by field" section. This includes the names of the fields and the hits per field. Sending a second request with one of those filters applied already works.
Solr just plain Doesn't Do This. If you absolutely need it, I'd try it the multiple requests solution and benchmark it -- solr tends to be a lot faster than what people put in front of it, so an couple few requests might not be that big of a deal.
you could achieve this with two different search requests/queries:
name:apple -> 2 hits
description:apple -> 1 hit
EDIT:
You also could implement your own SearchComponent that executes multiple queries in the background and put it in the SearchHandler processing chain so you only will need a single query in the frontend.
if you want the term to be searched over the same fields every time, you have 2 options not breaking the "single query" requirement:
1) copyField: you group at index time all the fields that should match togheter. With just one copyfield your problem doesn't exist, if you need more than one, you're at the same spot.
2) you could filter the query each time dynamically adding the "fq" parameter at the end
http://<your_url_and_stuff>/?q=Apple&fq=name:Apple ...
this works if you'll be searching always on the same two fields (or you can setup them before querying) otherwise you'll always need at least a second query
Since i said "you have 2 options" but you actually have 3 (and i rushed my answer), here's the third:
3) the dismax plugin described by them like this:
The DisMaxQParserPlugin is designed to process simple user entered phrases
(without heavy syntax) and search for the individual words across several fields
using different weighting (boosts) based on the significance of each field.
so, if you can use it, you may want to give it a look and start from the qf parameters (that is what the option number 2 wanted to be about, but i changed it in favor of fq... don't ask me why...)
SolrFaceting should solve your problem.
Have a look at the Examples.
This can be achieved with Solr faceting, but it's not neat. For example, I can issue this query:
/select?q=*:*&rows=0&facet=true&facet.query=title:donkey&facet.query=text:donkey&wt=json
to find the number of documents containing donkey in the title and text fields. I may get this response:
{
"responseHeader":{"status":0,"QTime":1,"params":{"facet":"true","facet.query":["title:donkey","text:donkey"],"q":"*:*","wt":"json","rows":"0"}},
"response":{"numFound":3365840,"start":0,"docs":[]},
"facet_counts":{
"facet_queries":{
"title:donkey":127,
"text:donkey":4108
},
"facet_fields":{},
"facet_dates":{},
"facet_ranges":{}
}
}
Since you also want the documents back for the field-disjunctive query, something like the following works:
/select?q=donkey&defType=edismax&qf=text+titlle&rows=10&facet=true&facet.query=title:donkey&facet.query=text:donkey&wt=json

Resources