I am using solr to retrieve results from a cassandra table.
Table structure:
CREATE TABLE mytable (
field1 uuid,
field2 text ,
bfield blob,
custmdata_<text, text>,
PRIMARY KEY (field1)
);
Table content
INSERT INTO mytable VALUES ( 62c36092-82a1-3a00-93d1-46196ee77204,"test1", { 'custmdata_data1' : 'data1value', 'custmdata_data2' : 'data2value' });
INSERT INTO mytable VALUES ( e26690db-dd54-4b61-b002-d3c07125f359,"test2", { 'custmdata_data5' : 'data5value', 'custmdata_data1' : 'mydata1value' });
I am able to retrieve the results using solr query.
{
"responseHeader": {
"status": 0,
"QTime": 1
},
"response": {
"numFound": 2,
"start": 0,
"docs": [
{
"field1": "62c36092-82a1-3a00-93d1-46196ee77204",
"field2": "test1",
"custmdata_data1":"data1value",
"custmdata_data2" : "data2value"
},
{
"field1": "e26690db-dd54-4b61-b002-d3c07125f359",
"field2": "test2",
"custmdata_data5":"data5value",
"custmdata_data1" : "mydata1value"
}
]
}
}
Is there any way to specify the field name in result so that I can retrieve the dynamic fields without having the field name prefix? I need result like this:
{
"responseHeader": {
"status": 0,
"QTime": 1
},
"response": {
"numFound": 2,
"start": 0,
"docs": [
{
"field1": "62c36092-82a1-3a00-93d1-46196ee77204",
"field2": "test1",
"data1":"data1value",
"data2" : "data2value"
},
{
"field1": "e26690db-dd54-4b61-b002-d3c07125f359",
"field2": "test2",
"data5":"data5value",
"data1" : "mydata1value"
}
]
}
}
Update:
From datastax documentaion, I found that,
Avoid or limit the use of dynamic fields. Lucene allocates memory for
each unique field (column) name, so if you have a row with columns A,
B, C, and another row with B, D, E, Lucene allocates 5 chunks of
memory. For millions of rows, the heap is unwieldy.
So is there a better way to achieve dynamic field based filtering in Solr? What I need is to filter against custom fields that may vary for each insert.
Instead of calling your dyn field custmdata_ call it data and that should get rid of the bit you don't want.
Otherwise removing the dyn field prepended label is not supported but you can rename returned fields with fl: https://wiki.apache.org/solr/CommonQueryParameters#Field_alias
If you're worried about having too many dyn fields, try to workaround it with some C* collection type if you scenario lends itself to that.
Related
I would like to retrieve some summary statistics from the text documents I have indexed in Solr. In particular, the word count per document.
For example, I have the following three documents indexed:
{
"id":"1",
"text":["This is the text in document 1"]},
{
"id":"2",
"text":["some text in document 2"]},
{
"id":"3",
"text":["and document 3"]}
I would like to get the total number of words per each individual document:
"1",7,
"2",5,
"3",3,
What query can I use to get such a result?
I am new to Solr and I am aware that I can use facets to get the count of the individual words over all documents using something like:
http://localhost:8983/solr/corename/select?q=*&facet=true&facet.field=text&facet.mincount=1
But how to get the total word count per document is not clear to me.
I appreciate your help!
If you do a faceted search over id and an inner facet over text, the inner facet count will give the number of words in that document with that id. But text field type must be text_general or something equivalent (tokenized).
If you only want to count "distinct" words per document id, it is actually much easier:
{
"query": "*:*",
"facet": {
"document": {
"type": "terms",
"field": "id",
"facet": {
"wordCount": "unique(message)"
}
}
}
}
Gives distinct word count per document. Following gives all words and all counts per document but it's up to you to sum them to get total amount (also it's an expensive call)
{
"query": "*:*",
"facet": {
"document": {
"type": "terms",
"field": "id",
"facet": {
"wordCount": {
"type": "terms",
"field": "message",
"limit": -1
}
}
}
}
}
#MatsLindth's comment is something to consider too. Solr and you might not agree on what's a "word". Tokenizer is configurable to a point but depending on your needs it might not be very easy.
I am working with Cosmos DB and I want to write a SQL query that returns multiple document in one single embed documents.
To elaborate, imagine you have the following two document types in one container. OrderId of Order document has reference in OrderDetail document.
1.Order
{
"OrderId": "31d4c08b-ee59-4ede-b801-3cacaea38808",
"Name": "ABC",
"Type": "Order",
"DeptName": "ABC",
"TotalAmount": 100.05
}
2.OrderDetail
{
"OrderDetailId": "689bdc38-9849-4a11-b856-53f8628b76c9",
"OrderId": "31d4c08b-ee59-4ede-b801-3cacaea38808",
"Type": "OrderDetail",
"ItemNo": 202,
"Quantity": 10,
"UnitPrice": 10.05
},
{
"OrderDetailId": "789bdc38-9849-4a11-b856-53f8628b76c9",
"OrderId": "31d4c08b-ee59-4ede-b801-3cacaea38808",
"Type": "OrderDetail",
"ItemNo": 200,
"Quantity": 11,
"UnitPrice": 15.05
}
I want to write a query that will return all entries of OrderDetail in one array based on reference OrderId="31d4c08b-ee59-4ede-b801-3cacaea38808"
Output should be like below
{
"OrderId":"31d4c08b-ee59-4ede-b801-3cacaea38808",
"Name":"ABC",
"Type":"Order",
"OrderDetail":[
{
"OrderDetailId":"689bdc38-9849-4a11-b856-53f8628b76c9",
"Type":"OrderDetail",
"ItemNo":202,
"Quantity":10,
"UnitPrice":10.05
},
{
"OrderDetailId":"789bdc38-9849-4a11-b856-53f8628b76c9",
"Type":"OrderDetail",
"ItemNo":200,
"Quantity":11,
"UnitPrice":15.05
}
]
}
I have no idea how to query in Cosmosdb to get the above result.
Your desired output should be applied in relational database,Cosmos db is non-relational db which is not appropriate for your scenario. Per my knowledge, no query sql could produce above output directly.
I suggest you executing 2 sqls, one produces:
{"OrderId":"31d4c08b-ee59-4ede-b801-3cacaea38808",
"Name":"ABC",
"Type":"Order"}
other one produces:
"OrderDetail":[
{
"OrderDetailId":"689bdc38-9849-4a11-b856-53f8628b76c9",
"Type":"OrderDetail",
"ItemNo":202,
"Quantity":10,
"UnitPrice":10.05
},
{
"OrderDetailId":"789bdc38-9849-4a11-b856-53f8628b76c9",
"Type":"OrderDetail",
"ItemNo":200,
"Quantity":11,
"UnitPrice":15.05
}
]
Then combine them. Surely,you could do such process in Stored Procedure.
I want to query the array field from elasticsearch. I have an array field that contains one or several node numbers of a gpu that were allocated to a job. Different people may be using the same node at the same time given that some people may be sharing the same gpu node with others. I want get the total number of distinct nodes that were used at a specific time.
Say I have three rows of data which fall in the same time interval. I want to plot a histogram showing that there are three nodes occupied in that period. Can I achieve this on Kibana?
Example :
[3]
[3,4,5]
[4,5]
I am expecting an output of 3 since there were only 3 distinct nodes used.
Thanks in advance
You can accomplish this using a combination of a date histogram aggregation along with either a terms aggregation (if the exact number of nodes is important) or a cardinality aggregation (if you can accept some inaccuracy at higher cardinalities).
Full example:
# Start with a clean slate
DELETE test-index
# Create the index
PUT test-index
{
"mappings": {
"event": {
"properties": {
"nodes": {
"type": "integer"
},
"timestamp": {
"type": "date"
}
}
}
}
}
# Index a few events (using the rows from your question)
POST test-index/event/_bulk
{"index":{}}
{"timestamp": "2018-06-10T00:00:00Z", "nodes":[3]}
{"index":{}}
{"timestamp": "2018-06-10T00:01:00Z", "nodes":[3,4,5]}
{"index":{}}
{"timestamp": "2018-06-10T00:02:00Z", "nodes":[4,5]}
# STRATEGY 1: Cardinality aggregation (scalable, but potentially inaccurate)
POST test-index/event/_search
{
"size": 0,
"aggs": {
"active_nodes_histo": {
"date_histogram": {
"field": "timestamp",
"interval": "hour"
},
"aggs": {
"active_nodes": {
"cardinality": {
"field": "nodes"
}
}
}
}
}
}
# STRATEGY 2: Terms aggregation (exact, but potentially much more expensive)
POST test-index/event/_search
{
"size": 0,
"aggs": {
"active_nodes_histo": {
"date_histogram": {
"field": "timestamp",
"interval": "hour"
},
"aggs": {
"active_nodes": {
"terms": {
"field": "nodes",
"size": 10
}
}
}
}
}
}
Notes:
Terms vs. cardinality aggregation: Use the cardinality agg unless you need to know WHICH nodes are in use. It is significantly more scalable, and until you get into cardinality of 1000s, you likely won't see any inaccuracy.
Date histogram interval: You can play with the interval such that it's something that makes sense for you. If you run through the example above, you'll only see one histogram bucket, however if you change hour to minute, you'll see the histogram build itself out with more data points.
I have some JSON in a field in my Postgres 9.4 db and I want to find rows where the given name is a certain value, where the field is named model and the JSON structure is as follows:
{
"resourceType": "Person",
"id": "8a7b72b1-49ec-43e5-bd21-bc62674d9875",
"name": [
{
"family": [
"NEWMAN"
],
"given": [
"JOHN"
]
}
]
}
So I tried this: SELECT * FROM current WHERE model->'name' #> '{"given":["JOHN"]}'; (as well as various other guesses) but that does not match the above data. How should I do this?
Use the function jsonb_array_elements():
select t.*
from current t,
jsonb_array_elements(model->'name') names
where names->'given' ? 'JOHN'
Does Solr maintain sequence of fields (Dynamic fields ) in result document like in the sequence used to index the document ?
For Example:
Consider the following record being indexed
School_txt , Class_txt , Section_txt
So When I will get this document as a result , will the sequence of fields be maintained or it can be random like Class_tx , School_txt , Section_txt ?
If it can be random then how can I preserve the sequence of fields ?
Yes, the sequence of the fields are maintained (at least with 4.9.0) for each document. This is also true for multiValued field, where the values are returned in the same sequence as they are added (which is useful if you want to merge two fields into a separate value later). Here's an example where I rotated the field sequence while indexing:
{
"id": "1",
"School_txt": "School",
"Class_txt": "Class",
"Section_txt": "Section1",
"_version_": 1473987528354693000
},
{
"id": "2",
"Class_txt": "School2",
"Section_txt": "Class2",
"School_txt": "Section2",
"_version_": 1473987528356790300
},
{
"id": "3",
"Section_txt": "School3",
"School_txt": "Class3",
"Class_txt": "Section3",
"_version_": 1473987528356790300
}