For example, there are timeseries data like root partition utilization. The data structure is like below:
name: root_disk_utilizatoin
ip: 1.1.1.1
timestamp: 1234567890
value: 0.5
We have millions of servers reporting this data every few minutes. My expection is to find latest data for each server.
The first idea is to store these timeseries data in some storage like elasticsearc or tsdb(influxdb/opentsdb). Then query the storage to get result. But I worry about the performance. No mater what storage I chose, they must do below two steps to archieve the result.
group data by ip
sort data by timestamp and return the latest one
I guess this will be a very expensive process(cost a lot of time).
So I guess this may not be a good idea.
Do you have similar requirements and how do you solve it?
Will it be a problem for timeseries db like influxdb?
You can use a combination of terms aggregation with max
aggregation
Adding a working example with index data,search query, and search result
Index Data:
{
"name": "root_disk_utilizatoin",
"ip": "1.1.1.2",
"timestamp": 1234567891,
"value": 0.5
}
{
"name": "root_disk_utilizatoin",
"ip": "1.1.1.1",
"timestamp": 1234567890,
"value": 0.5
}
Search Query:
{
"size":0,
"aggs": {
"unique_id": {
"terms": {
"field": "ip.keyword",
"order": {
"latestOrder": "desc"
},
"size":1
},
"aggs": {
"latestOrder": {
"max": {
"field": "timestamp"
}
}
}
}
}
}
Search Result:
"aggregations": {
"unique_id": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 1,
"buckets": [
{
"key": "1.1.1.2",
"doc_count": 1,
"latestOrder": {
"value": 1.234567891E9
}
}
]
}
Related
Let's say I have User table with fields like name, address, age, etc. There are more than 1000 records in this table, so I used Elasticsearch to retrieve this data one page at a time, 20 records.
And let's say I just wanted to search for some text "Alexia", so I wanted to display: is there any record contain Alexia? But special thing is that I wanted to search this text via all my fields within the table.
Does search text match the name field or age or address or any? IF it does, it should return values. We are not going to pass any specific field for Elastic query. If it returns more than 20 records matched with my text, the pagination should work.
Any idea of how to do such a query? or any way to connect Elasticsearch?
Yes you can do that by query String
{
"size": 20,
"query": {
"query_string": {
"query": "Alexia"
},
"range": {
"dateField": {
"gte": **currentTime** -------> This could be current time or age or any property that like to do a range query
}
}
},
"sort": [
{
"dateField": {
"order": "desc"
}
}
]
}
For getting only 20 records you can pass the Size as 20 and for Pagination you can use RangeQuery and get the next set of Messages
{
"size": 20,
"query": {
"query_string": {
"query": "Alexia"
},
"range": {
"dateField": {
"gt": 1589570610732. ------------> From previous response
}
}
},
"sort": [
{
"dateField": {
"order": "desc"
}
}
]
}
You can do the same by using match query as well . If in match query you specify _all it will search in all the fields.
{
"size": 20,
"query": {
"match": {
"_all": "Alexia"
},
"range": {
"dateField": {
"gte": **currentTime**
}
}
},
"sort": [
{
"dateField": {
"order": "desc"
}
}
]
}
When you are using ElasticSearch to provide search functionality in search boxes , you should avoid using query_string because it throws error in case of invalid syntax, which other queries return empty result. You can read about this from query_string.
_all is deprecated from ES6.0, so if you are using ES version from 6.x ownwards you can use copy_to to copy all the values of field into single field and then search on that single field. You can refer more from copy_to.
For pagination you can make use of from and size parameter . size parameter tells you how many documents you want to retrieve and from tells from which hit you want to process.
Query :
{
"from" : <current-count>
"size": 20,
"query": {
"match": {
"_all": "Alexia"
},
"range": {
"dateField": {
"gte": **currentTime**
}
}
},
"sort": [
{
"dateField": {
"order": "desc"
}
}
]
}
from field value you can set incremently in each iteration to how much much documents you got. For e.g. first iteration you can set from as 0 . For next iteration you can set it as 21 (since in first iteration you got first 20 hits and in second iteration you want to get documents after first 20 hits). You can refer this.
I'm new to databases, but I feel what I am trying to do should be pretty commonplace...
What I am trying to achieve is to allow a user to apply both a price range filter and a price sort to the results that my site fetches for them. So I want to find all prices within the specified price range, and then sort them by price.
I have a Hasura DB running on Heroku. The DB has two tables, seeds and prices. One row in seeds table, i.e. one seed, can be related to multiple rows in the prices tables, i.e. many prices. They are joined by a foreign key constraint and a one-many relationship, or Object to Array.
I am attempting to query seeds using the following GraphQL query:
{
seeds(
where: {prices: {price: {_gte: "10", _lte: "200"}}},
order_by: {prices_aggregate: {min: {price: asc_nulls_last}}}
) {
product_name
prices {
price
}
}
}
What I want this query to do is to filter out all prices.price that are not within the valid range (10, 200), and then sort them in order of prices.price ascending. However what happens is that the results are sorted by prices.price ascending INCLUDING values that are not within the range, then the results are filtered.
I will give an example to clarify. Consider the following results from the above query:
{
"data": {
"seeds": [
{
"product_name": "Northern Lights Auto Feminised Seeds",
"prices": [
{
"price": 3.48
},
{
"price": 6.79
},
{
"price": 9.58
},
{
"price": 104.5
}
]
},
{
"product_name": "The White OG Feminised Seeds",
"prices": [
{
"price": 3.48
},
{
"price": 6.79
},
{
"price": 15.68
}
]
},
{
"product_name": "Special Kush #1 Feminised Seeds from Royal Queen Seeds",
"prices": [
{
"price": 3.49
},
{
"price": 13.53
},
{
"price": 8.29
}
]
}
]
}
}
The above results are correct, because there are valid values for prices.price within the range specified (104.5, 15.68, 13.53) respectively, however the results are not in the right order. They are instead ordered by the lowest price.prices, regardless of the filter which was specified (10, 200).
The correct order for the results would be:
{
"data": {
"seeds": [
{
"product_name": "Special Kush #1 Feminised Seeds from Royal Queen Seeds",
"prices": [
{
"price": 3.49
},
{
"price": 13.53
},
{
"price": 8.29
}
]
},
{
"product_name": "The White OG Feminised Seeds",
"prices": [
{
"price": 3.48
},
{
"price": 6.79
},
{
"price": 15.68
}
]
},
{
"product_name": "Northern Lights Auto Feminised Seeds",
"prices": [
{
"price": 3.48
},
{
"price": 6.79
},
{
"price": 9.58
},
{
"price": 104.5
}
]
}
]
}
}
Can anyone help me, and explain how I can achieve these correct results? It is worth mentioning that it is not possible to sort the results after the query as there are thousands and the sort will definitely affect which results are returned from the DB.
Thanks, in advance!
The where clause you are applying to seeds will fetch all seeds that have atleast one price within the range you have specified. This will NOT filter the price data that you are fetching in the nested query.
For filtering the price you need to apply the where clause inside prices array relationship.
{
seeds(
where: {prices: {price: {_gte: "10", _lte: "200"}}}, -> this filters seeds
order_by: {prices_aggregate: {min: {price: asc_nulls_last}}}
) {
product_name
prices(where:{price:{_gte: "10", _lte: "200"}}) { -> this filters price
price
}
}
}
I have a quite big number of records currently stored in mongodb, each looks somehow like this:
{
"_id" : ObjectId("5c38d267b87d0a05d8cd4dc2"),
"tech" : "NodeJs",
"packagename" : "package-name",
"packageversion" : "0.0.1",
"total_loc" : 474,
"total_files" : 7,
"tecloc" : {
"JavaScript" : 316,
"Markdown" : 116,
"JSON" : 42
}
}
What I want to do is to find similar data record based on e.g., records which have about (+/-10%) the number of total_loc or use some of the same technologies (tecloc).
Can I somehow do this with a query against mongodb or is there a technology that fits better for what I want to do? I am fine with regenerating the data and storing it e.g., in elastic or some graph-db.
Thank you
One of the possibility to solve this problem is to use Elasticsearch. I'm not claiming that it's the only solution you have.
On the high level - you would need to setup Elasticsearch and index your data. There are various possibilities to achieve: mongo-connector, or Logstash and JDBC input plugin or even just dumping data from MongoDB and putting it manually. No limits to do this job.
The difference I would propose initially is to make field tecloc - multivalued field, by replacing { to [, and adding some other fields for line of code, e.g:
{
"tech": "NodeJs",
"packagename": "package-name",
"packageversion": "0.0.1",
"total_loc": 474,
"total_files": 7,
"tecloc": [
{
"name": "JavaScript",
"loc": 316
},
{
"name": "Markdown",
"loc": 116
},
{
"name": "JSON",
"loc": 42
}
]
}
This data model is very trivial and obviously have some limitations, but it's already something for you to start and see how well it fits your other use cases. Later you should discover nested type as one of the possibility to mimic your data more properly.
Regarding your exact search scenario - you could search those kind of documents with a query like this:
{
"query": {
"bool": {
"should": [
{
"term": {
"tecloc.name.keyword": {
"value": "Java"
}
}
},
{
"term": {
"tecloc.name.keyword": {
"value": "Markdown"
}
}
}
],
"must": [
{"range": {
"total_loc": {
"gte": 426,
"lte": 521
}
}}
]
}
}
}
Unfortunately, there is no support for syntax with +-10% so this is something that should be calculated on the client.
On the other side, I specified that we are searching documents which should have Java or Markdown, which return example document as well. In this case, if I would have document with both Java and Markdown the score of this document will be higher.
In my data, I have two fields that I want to use as an index together. They are sensorid (any string) and timestamp (yyyy-mm-dd hh:mm:ss).
So I made an index for these two using the Cloudant index generator. This was created successfully and it appears as a design document.
{
"index": {
"fields": [
{
"name": "sensorid",
"type": "string"
},
{
"name": "timestamp",
"type": "string"
}
]
},
"type": "text"
}
However, when I try to make the following query to find all documents with a timestamp newer than some value, I am told there is no index available for the selector:
{
"selector": {
"timestamp": {
"$gt": "2015-10-13 16:00:00"
}
},
"fields": [
"_id",
"_rev"
],
"sort": [
{
"_id": "asc"
}
]
}
What have I done wrong?
It seems to me like cloudant query only allows sorting on fields that are part of the selector.
Therefore your selector should include the _id field and look like:
"selector":{
"_id":{
"$gt":0
},
"timestamp":{
"$gt":"2015-10-13 16:00:00"
}
}
I hope this works for you!
I'm using Elasticsearch for this project but a Solr solution might be appropriate too. In the query I'd like to include a portion of a should clause that will return results even if none of the other terms can. This will be used for document popularity. I'll periodically calculate reading popularity and add a float field to each doc with a numeric value.
The idea is to return docs based on terms but when that fails, return popular docs ranked by popularity. These should be ordered by term match scores or magnitude of popularity score.
I realize that I could quantize the popularity and treat it like a tag "hottest", "hotter", "hot"... but would like to use numeric field since the ranking is well defined.
Here is the current form of my data (from fetch by id):
GET /index/docs/ipad
returns a sample object
{
"_index": "index",
"_type": "docs",
"_id": "doc1",
"_version": 1,
"found": true,
"_source": {
"category": ["tablets", "electronics"],
"text": ["buy", "an", "ipad"],
"popularity": 0.95347457,
"id": "doc1"
}
}
Current query format
POST /index/docs/_search
{
"size": 10,
"query": {
"bool": {
"should": [
{"terms": {"text": ["ipad"]}}
],
"must": [
{"terms": {"category": ["electronics"]}}
]
}
}
}
This may seem an odd query format but these are structured objects, not free form text.
Can I add popularity to this query so that it returns items ranked by popularity magnitude along with those returned by the should terms? I'd boost the actual terms above the popularity so they'd be favored.
Note I do not want to boost by popularity, I want to return popular if the rest of the query returns nothing.
One approach I can think of is wrapping match_all filter in constant score
and using sort on score followed by popularity
example:
{
"size": 10,
"query": {
"bool": {
"should": [
{
"terms": {
"text": [
"ipad"
]
}
},
{
"constant_score": {
"filter": {
"match_all": {}
},
"boost": 0
}
}
],
"must": [
{
"terms": {
"category": [
"electronics"
]
}
}
],
"minimum_should_match": 1
}
},
"sort": [
{
"_score": {
"order": "desc"
}
},
{
"popularity": {
"unmapped_type": "double"
}
}
]
}
You want to look into the function score query and a decay function for this.
Here's a gentle intro: https://www.found.no/foundation/function-scoring/