Solr pivot faceting or hierarchical faceting, overall limit - solr

A Solr pivot faceting query returns a tree of facets such as:
"cat,popularity,inStock":[{
"field":"cat",
"value":"electronics",
"count":14,
"pivot":[{
"field":"popularity",
"value":6,
"count":7,
"pivot":[
{
"field":"inStock",
"value":true,
"count":5
},
{
"field":"inStock",
"value":false,
"count":2
}]},
{
"field":"popularity",
"value":1,
"count":4,
"pivot":[
{
"field":"inStock",
"value":true,
"count":3
},
{
"field":"inStock",
"value":false,
"count":1
}]
]}]
Is there a way to specify a limit on the number of returned leaves (i.e. the number of objects with "field":"inStock" in the example above, which is 4)?
Note that this is not the same as defining a limit on the individual facets, such as "f.popularity.facet.limit", which only limit the degree of the facet tree internal nodes.
Thanks

Related

Delimit records by recurring value

I have documents that contain an object array. Within that array are pulses in a dataset. For example:
samples: [{"time":1224960,"flow":0,"temp":null},{"time":1224970,"flow":0,"temp":null},
{"time":1224980,"flow":23,"temp":null},{"time":1224990,"flow":44,"temp":null},
{"time":1225000,"flow":66,"temp":null},{"time":1225010,"flow":0,"temp":null},
{"time":1225020,"flow":650,"temp":null},{"time":1225030,"flow":40,"temp":null},
{"time":1225040,"flow":60,"temp":null},{"time":1225050,"flow":0,"temp":null},
{"time":1225060,"flow":0,"temp":null},{"time":1225070,"flow":0,"temp":null},
{"time":1225080,"flow":0,"temp":null},{"time":1225090,"flow":0,"temp":null},
{"time":1225100,"flow":0,"temp":null},{"time":1225110,"flow":67,"temp":null},
{"time":1225120,"flow":23,"temp":null},{"time":1225130,"flow":0,"temp":null},
{"time":1225140,"flow":0,"temp":null},{"time":1225150,"flow":0,"temp":null}]
I would like to construct an aggregate pipeline to act on each collection of consecutive 'samples.flow' values above zero. As in, the sample pulses are delimited by one or more zero flow values. I can use an $unwind stage to flatten the data but I'm at a loss as to how to subsequently group each pulse. I have no objections to this being a multistep process. But I'd rather not have to loop through it in code on the client side. The data will comprise fields from a number of documents and could total in the hundreds of thousands of entries.
From the example above I'd like to be able to extract:
[{"time":1224980,"total_flow":123,"temp":null},
{"time":1225020,"total_flow":750,"temp":null},
{"time":1225110,"total_flow":90,"temp":null}]
or variations thereof.
If you are not looking for specific values to be on the time field, then you can use this pipeline with $bucketAuto.
[
{
"$bucketAuto": {
"groupBy": "$time",
"buckets": 3,
"output": {
total_flow: {
$sum: "$flow"
},
temp: {
$first: "$temp"
},
time: {
"$min": "$time"
}
}
}
},
{
"$project": {
"_id": 0
}
}
]
If you are looking for some specific values for time, then you will need to use $bucket and provide it a boundaries argument with precalculated lower bounds. I think this solution should do your job

How to get the total word count per document in SOLR?

I would like to retrieve some summary statistics from the text documents I have indexed in Solr. In particular, the word count per document.
For example, I have the following three documents indexed:
{
"id":"1",
"text":["This is the text in document 1"]},
{
"id":"2",
"text":["some text in document 2"]},
{
"id":"3",
"text":["and document 3"]}
I would like to get the total number of words per each individual document:
"1",7,
"2",5,
"3",3,
What query can I use to get such a result?
I am new to Solr and I am aware that I can use facets to get the count of the individual words over all documents using something like:
http://localhost:8983/solr/corename/select?q=*&facet=true&facet.field=text&facet.mincount=1
But how to get the total word count per document is not clear to me.
I appreciate your help!
If you do a faceted search over id and an inner facet over text, the inner facet count will give the number of words in that document with that id. But text field type must be text_general or something equivalent (tokenized).
If you only want to count "distinct" words per document id, it is actually much easier:
{
"query": "*:*",
"facet": {
"document": {
"type": "terms",
"field": "id",
"facet": {
"wordCount": "unique(message)"
}
}
}
}
Gives distinct word count per document. Following gives all words and all counts per document but it's up to you to sum them to get total amount (also it's an expensive call)
{
"query": "*:*",
"facet": {
"document": {
"type": "terms",
"field": "id",
"facet": {
"wordCount": {
"type": "terms",
"field": "message",
"limit": -1
}
}
}
}
}
#MatsLindth's comment is something to consider too. Solr and you might not agree on what's a "word". Tokenizer is configurable to a point but depending on your needs it might not be very easy.

How to fetch only condition satisfied/match object from an array in mongodb?

I am new to MongoDB.
This is my 'masterpatients' collection it has many documents. every documents contain 'visits' array and every visits array contains multiple objects. I want only those object which is satisfied with my input.
I am expecting only below the expected output. if the facility match with my input and visit date range match with my provided input then the query should return only that object as I have given below.
_id:5ef59134a3d8d92e580510fe
flag:0
name:"emicon_test"
dob:2020-06-25T00:00:00.000+00:00
visits:[
{
visit:2020-06-09T10:36:10.635+00:00,
facility:"Atria Lady Lake"
},
{
visit:2020-05-09T10:36:10.635+00:00,
facility:"demo"
}]
_id:5ee3213040f8830e04ff74a8
flag:0
name:"xyz"
dob:1995-06-25T00:00:00.000+00:00
visits:[
{
visit:2020-05-01T10:36:10.635+00:00,
facility:"pqr"
},
{
visit:2020-05-15T10:36:10.635+00:00,
facility:"demo"
},
{
visit:2020-05-09T10:36:10.635+00:00,
facility:"efg"
}]
My query input parameters is facility='demo' and visit date range is from '1st May 2020' to '10th May 2020'
output expected:
_id:5ef59134a3d8d92e580510fe
flag:0
name:"emicon_test"
dob:2020-06-25T00:00:00.000+00:00
visits:[
{
visit:2020-05-09T10:36:10.635+00:00,
facility:"demo"
}]
Thanks in advance.
I got an answer.
MasterPatientModel.aggregate([
{
'$unwind':"$visits"
},
{"$match": {"visits.visit": {"$gte": new Date(req.body.facilitySummaryFromdate), "$lte": new Date(req.body.facilitySummaryTodate)
} , "visits.facility": req.body.facilitySummary
}
}
])
You cannot filter the contents of a mongo collection property on the server.
Make visits array into a top level collection/model and you can filter by criteria on the server.

Elasticsearch query array field across documents

I want to query the array field from elasticsearch. I have an array field that contains one or several node numbers of a gpu that were allocated to a job. Different people may be using the same node at the same time given that some people may be sharing the same gpu node with others. I want get the total number of distinct nodes that were used at a specific time.
Say I have three rows of data which fall in the same time interval. I want to plot a histogram showing that there are three nodes occupied in that period. Can I achieve this on Kibana?
Example :
[3]
[3,4,5]
[4,5]
I am expecting an output of 3 since there were only 3 distinct nodes used.
Thanks in advance
You can accomplish this using a combination of a date histogram aggregation along with either a terms aggregation (if the exact number of nodes is important) or a cardinality aggregation (if you can accept some inaccuracy at higher cardinalities).
Full example:
# Start with a clean slate
DELETE test-index
# Create the index
PUT test-index
{
"mappings": {
"event": {
"properties": {
"nodes": {
"type": "integer"
},
"timestamp": {
"type": "date"
}
}
}
}
}
# Index a few events (using the rows from your question)
POST test-index/event/_bulk
{"index":{}}
{"timestamp": "2018-06-10T00:00:00Z", "nodes":[3]}
{"index":{}}
{"timestamp": "2018-06-10T00:01:00Z", "nodes":[3,4,5]}
{"index":{}}
{"timestamp": "2018-06-10T00:02:00Z", "nodes":[4,5]}
# STRATEGY 1: Cardinality aggregation (scalable, but potentially inaccurate)
POST test-index/event/_search
{
"size": 0,
"aggs": {
"active_nodes_histo": {
"date_histogram": {
"field": "timestamp",
"interval": "hour"
},
"aggs": {
"active_nodes": {
"cardinality": {
"field": "nodes"
}
}
}
}
}
}
# STRATEGY 2: Terms aggregation (exact, but potentially much more expensive)
POST test-index/event/_search
{
"size": 0,
"aggs": {
"active_nodes_histo": {
"date_histogram": {
"field": "timestamp",
"interval": "hour"
},
"aggs": {
"active_nodes": {
"terms": {
"field": "nodes",
"size": 10
}
}
}
}
}
}
Notes:
Terms vs. cardinality aggregation: Use the cardinality agg unless you need to know WHICH nodes are in use. It is significantly more scalable, and until you get into cardinality of 1000s, you likely won't see any inaccuracy.
Date histogram interval: You can play with the interval such that it's something that makes sense for you. If you run through the example above, you'll only see one histogram bucket, however if you change hour to minute, you'll see the histogram build itself out with more data points.

how to order groups by count in solr

I'm wondering how to order groups in a Solr result. I want to order the groups by numFound. I saw how to order the groups by score here, but that didn't seem to actually make a difference in the examples I looked at, and isn't exactly what I wanted.
In the xml you can see the number per group as numFound and that is what I want to sort the groups by, so for example I could see the largest group at the top.
<arr name="groups">
<lst>
<str name="groupValue">top secret</str>
<result name="doclist" numFound="12" start="0">
...
Any tips appreciated! Thanks!
This is an old question, but it is possible with two queries.
First query: bring back the field you're grouping by as a set of facets for your navigation state. You can limit the number of records returned to 0 here: you just need the facets. The number of facets you return should be the size of your page.
group_id:
23 (6)
143:(3)
5:(2)
Second query: Should be for the records, so no facets are required. The query should be an OR query for the facet field values returned from the first query. (group_id:23 OR group_id:143 OR group_id:5 and so on) and be grouped by the id you are used for grouping.
Sorting: reorder the records from query 2 to match the order from query 1.
That'll do it, with the proviso that I'm not sure how scalable that OR query will be. If you're looking to paginate, remember that you can offset facets: use that as the mechanism instead of offseting the records.
Sorting on the numFound is not possible as numFound is not an field in Solr.
Check the discussion mentioning it not being supported and I did not find a JIRA open for the issue as well.
Not possible since the last time I looked into this.
you can sort by using fields
consider an Example :
If you have 5 FACETS and COUNT associated with it.
Then you can sort by using the COUNTS of each fields.
It can be applicable to normal/non-facets fields .
public class FacetBean implements Category,Serializable {
private String facetName; //getter , setters
private long facetCount; // getter , setters
public FacetBean(String facetName, long count,) {
this.facetName = facetName;
this.count = count;
}}
Your calling method should be like this
private List<FacetBean> getFacetFieldsbyCount(QueryResponse queryResponse)
{
List<FacetField> flds = queryResponse.getFacetFields();
List<FacetBean> facetList = new ArrayList<FacetBean>();
FacetBean facet = null;
if (flds != null) {
for (FacetField fld : flds) {
facet = new FacetBean();
facet.setFacetName(fld.getName());
List<Count> counts = fld.getValues();
if (counts != null) {
for (Count count : counts) {
facet.setFacetCount(count.getCount());
}
}
facetList.add(facet);
}
}
Collections.sort(facetList,new Comparator<FacetBean>() {
public int compare(FacetBean obj1, FacetBean obj2) {
if(obj1.getFacetCount() > obj2.getFacetCount()) {
return (int)obj1.getFacetCount();
} else {
return (int)obj2.getFacetCount();
}
}
});
return facetList;
}
In The same URL They have mentioned something like.
sort -- >ex : For example, sort=popularity desc will cause the groups to be sorted according to the highest popularity doc
group.sort -- > you can apply your field here .
Hope it helps.

Resources