I would like to retrieve some summary statistics from the text documents I have indexed in Solr. In particular, the word count per document.
For example, I have the following three documents indexed:
{
"id":"1",
"text":["This is the text in document 1"]},
{
"id":"2",
"text":["some text in document 2"]},
{
"id":"3",
"text":["and document 3"]}
I would like to get the total number of words per each individual document:
"1",7,
"2",5,
"3",3,
What query can I use to get such a result?
I am new to Solr and I am aware that I can use facets to get the count of the individual words over all documents using something like:
http://localhost:8983/solr/corename/select?q=*&facet=true&facet.field=text&facet.mincount=1
But how to get the total word count per document is not clear to me.
I appreciate your help!
If you do a faceted search over id and an inner facet over text, the inner facet count will give the number of words in that document with that id. But text field type must be text_general or something equivalent (tokenized).
If you only want to count "distinct" words per document id, it is actually much easier:
{
"query": "*:*",
"facet": {
"document": {
"type": "terms",
"field": "id",
"facet": {
"wordCount": "unique(message)"
}
}
}
}
Gives distinct word count per document. Following gives all words and all counts per document but it's up to you to sum them to get total amount (also it's an expensive call)
{
"query": "*:*",
"facet": {
"document": {
"type": "terms",
"field": "id",
"facet": {
"wordCount": {
"type": "terms",
"field": "message",
"limit": -1
}
}
}
}
}
#MatsLindth's comment is something to consider too. Solr and you might not agree on what's a "word". Tokenizer is configurable to a point but depending on your needs it might not be very easy.
Related
The document structure has a round collection, which has an array of holes Objects embedded within it, with each hole played/scored entered.
The structure looks like this (there are more fields, but this summarises):
{
"_id": {
"$oid": "60701a691c071256e4f0d0d6"
},
"schema": {
"$numberDecimal": "1.0"
},
"playerName": "T Woods",
"comp": {
"id": {
"$oid": "607019361c071256e4f0d0d5"
},
"name": "US Open",
"tees": "Pro Tees",
"roundNo": {
"$numberInt": "1"
},
"scoringMethod": "Stableford"
},
"holes": [
{
"holeNo": {
"$numberInt": "1"
},
"holePar": {
"$numberInt": "4"
},
"holeSI": {
"$numberInt": "3"
},
"holeGross": {
"$numberInt": "4"
},
"holeStrokes": {
"$numberInt": "1"
},
"holeNett": {
"$numberInt": "3"
},
"holeGrossPoints": {
"$numberInt": "2"
},
"holeNettPoints": {
"$numberInt": "3"
}
}
]
}
In the Atlas web UI, it shows as (note there are 9 holes in this particular round of golf - limited to 3 for brevity):
I would like to find the players who have a holeGross of 2, or less, somewhere in their round of golf (i.e. a birdie on par 3 or better).
Being new to MongoDB, and NoSQL constructs, I am stuck with this. Reading around the aggregation pipeline framework, I have tried to break down the stages I will need as:
Filter by the comp.id and comp.roundNo
Filter this result with any hole within the holes array of Objects
Maybe I have approached this wrong, and should filter or structure this pipeline differently?
So far, using the Atlas web UI, I can apply these filters individually as:
{
"comp.id": ObjectId("607019361c071256e4f0d0d5"),
"comp.roundNo": 2
}
And:
{ "holes.0.holeGross": 2 }
But I have 2 problems:
The second filter query, I have hard-coded the array index to get this value. I would need to search across all the sub-elements of every document that matches this comp.id && comp.roundNo
How do I combine these? I presuming this is where the aggregation comes in, as well as enumerating across the whole array (as above).
I note in particular it is the extra ".0." part of the second query that I am not seeing from various other online postings trying to do the same thing. Is my data structure incorrect? Do I need the [0]...[17] Objects for an 18-hole round of golf?
I would like to find the players who have a holeGross of 2, or less, somewhere in their round of golf
if that is the goal, a simple $lte search inside the holes array like the following would do:
db.collection.find({ "holes.holeGross": { $lte: 2 } })
you simply have to not specify an array index such as 0 in the property path in order to search each element of the array.
https://mongoplayground.net/p/KhZLnj9mJe5
I want to query the array field from elasticsearch. I have an array field that contains one or several node numbers of a gpu that were allocated to a job. Different people may be using the same node at the same time given that some people may be sharing the same gpu node with others. I want get the total number of distinct nodes that were used at a specific time.
Say I have three rows of data which fall in the same time interval. I want to plot a histogram showing that there are three nodes occupied in that period. Can I achieve this on Kibana?
Example :
[3]
[3,4,5]
[4,5]
I am expecting an output of 3 since there were only 3 distinct nodes used.
Thanks in advance
You can accomplish this using a combination of a date histogram aggregation along with either a terms aggregation (if the exact number of nodes is important) or a cardinality aggregation (if you can accept some inaccuracy at higher cardinalities).
Full example:
# Start with a clean slate
DELETE test-index
# Create the index
PUT test-index
{
"mappings": {
"event": {
"properties": {
"nodes": {
"type": "integer"
},
"timestamp": {
"type": "date"
}
}
}
}
}
# Index a few events (using the rows from your question)
POST test-index/event/_bulk
{"index":{}}
{"timestamp": "2018-06-10T00:00:00Z", "nodes":[3]}
{"index":{}}
{"timestamp": "2018-06-10T00:01:00Z", "nodes":[3,4,5]}
{"index":{}}
{"timestamp": "2018-06-10T00:02:00Z", "nodes":[4,5]}
# STRATEGY 1: Cardinality aggregation (scalable, but potentially inaccurate)
POST test-index/event/_search
{
"size": 0,
"aggs": {
"active_nodes_histo": {
"date_histogram": {
"field": "timestamp",
"interval": "hour"
},
"aggs": {
"active_nodes": {
"cardinality": {
"field": "nodes"
}
}
}
}
}
}
# STRATEGY 2: Terms aggregation (exact, but potentially much more expensive)
POST test-index/event/_search
{
"size": 0,
"aggs": {
"active_nodes_histo": {
"date_histogram": {
"field": "timestamp",
"interval": "hour"
},
"aggs": {
"active_nodes": {
"terms": {
"field": "nodes",
"size": 10
}
}
}
}
}
}
Notes:
Terms vs. cardinality aggregation: Use the cardinality agg unless you need to know WHICH nodes are in use. It is significantly more scalable, and until you get into cardinality of 1000s, you likely won't see any inaccuracy.
Date histogram interval: You can play with the interval such that it's something that makes sense for you. If you run through the example above, you'll only see one histogram bucket, however if you change hour to minute, you'll see the histogram build itself out with more data points.
A Solr pivot faceting query returns a tree of facets such as:
"cat,popularity,inStock":[{
"field":"cat",
"value":"electronics",
"count":14,
"pivot":[{
"field":"popularity",
"value":6,
"count":7,
"pivot":[
{
"field":"inStock",
"value":true,
"count":5
},
{
"field":"inStock",
"value":false,
"count":2
}]},
{
"field":"popularity",
"value":1,
"count":4,
"pivot":[
{
"field":"inStock",
"value":true,
"count":3
},
{
"field":"inStock",
"value":false,
"count":1
}]
]}]
Is there a way to specify a limit on the number of returned leaves (i.e. the number of objects with "field":"inStock" in the example above, which is 4)?
Note that this is not the same as defining a limit on the individual facets, such as "f.popularity.facet.limit", which only limit the degree of the facet tree internal nodes.
Thanks
I am using solr to retrieve results from a cassandra table.
Table structure:
CREATE TABLE mytable (
field1 uuid,
field2 text ,
bfield blob,
custmdata_<text, text>,
PRIMARY KEY (field1)
);
Table content
INSERT INTO mytable VALUES ( 62c36092-82a1-3a00-93d1-46196ee77204,"test1", { 'custmdata_data1' : 'data1value', 'custmdata_data2' : 'data2value' });
INSERT INTO mytable VALUES ( e26690db-dd54-4b61-b002-d3c07125f359,"test2", { 'custmdata_data5' : 'data5value', 'custmdata_data1' : 'mydata1value' });
I am able to retrieve the results using solr query.
{
"responseHeader": {
"status": 0,
"QTime": 1
},
"response": {
"numFound": 2,
"start": 0,
"docs": [
{
"field1": "62c36092-82a1-3a00-93d1-46196ee77204",
"field2": "test1",
"custmdata_data1":"data1value",
"custmdata_data2" : "data2value"
},
{
"field1": "e26690db-dd54-4b61-b002-d3c07125f359",
"field2": "test2",
"custmdata_data5":"data5value",
"custmdata_data1" : "mydata1value"
}
]
}
}
Is there any way to specify the field name in result so that I can retrieve the dynamic fields without having the field name prefix? I need result like this:
{
"responseHeader": {
"status": 0,
"QTime": 1
},
"response": {
"numFound": 2,
"start": 0,
"docs": [
{
"field1": "62c36092-82a1-3a00-93d1-46196ee77204",
"field2": "test1",
"data1":"data1value",
"data2" : "data2value"
},
{
"field1": "e26690db-dd54-4b61-b002-d3c07125f359",
"field2": "test2",
"data5":"data5value",
"data1" : "mydata1value"
}
]
}
}
Update:
From datastax documentaion, I found that,
Avoid or limit the use of dynamic fields. Lucene allocates memory for
each unique field (column) name, so if you have a row with columns A,
B, C, and another row with B, D, E, Lucene allocates 5 chunks of
memory. For millions of rows, the heap is unwieldy.
So is there a better way to achieve dynamic field based filtering in Solr? What I need is to filter against custom fields that may vary for each insert.
Instead of calling your dyn field custmdata_ call it data and that should get rid of the bit you don't want.
Otherwise removing the dyn field prepended label is not supported but you can rename returned fields with fl: https://wiki.apache.org/solr/CommonQueryParameters#Field_alias
If you're worried about having too many dyn fields, try to workaround it with some C* collection type if you scenario lends itself to that.
Does Solr maintain sequence of fields (Dynamic fields ) in result document like in the sequence used to index the document ?
For Example:
Consider the following record being indexed
School_txt , Class_txt , Section_txt
So When I will get this document as a result , will the sequence of fields be maintained or it can be random like Class_tx , School_txt , Section_txt ?
If it can be random then how can I preserve the sequence of fields ?
Yes, the sequence of the fields are maintained (at least with 4.9.0) for each document. This is also true for multiValued field, where the values are returned in the same sequence as they are added (which is useful if you want to merge two fields into a separate value later). Here's an example where I rotated the field sequence while indexing:
{
"id": "1",
"School_txt": "School",
"Class_txt": "Class",
"Section_txt": "Section1",
"_version_": 1473987528354693000
},
{
"id": "2",
"Class_txt": "School2",
"Section_txt": "Class2",
"School_txt": "Section2",
"_version_": 1473987528356790300
},
{
"id": "3",
"Section_txt": "School3",
"School_txt": "Class3",
"Class_txt": "Section3",
"_version_": 1473987528356790300
}