Get database last N data points from each node (Cloudant/couchdb) - database

TL;DR: MapReduce or POST request?
What is the correct(=most efficient) way to fetch the latest n data points of multiple sensors, from Cloudant or equivalent database?
Sensor data is stored in individual documents like this:
{
"_id": "2d26dbd8e655ae02bdab611afc92b6cf",
"_rev": "1-a64448521f05935b915e4bee12328e84",
"date": "2017-06-20T15:59:50.509Z",
"name": "Sensor01",
"temperature": 24.5,
"humidity": 45.3,
"rssi": -33
}
I want the fetch the latest 10 documents from sensor01-sensor99 so I can feed it to UI.
I have discovered few options:
1. Use map reduce function
Reduce each sensor data to array under sensor01, sensor02, etc...
E.g.
Map:
function (doc) {
if (doc.name && doc.temperature) emit(doc.name, doc.temperature);
}
Reduce:
function (keys, values, rereduce) {
var temp_arr=[];
for (i=0;i<values.length;i++)
{
temp_arr.push(values);
}
return temp_arr;
}
I couldn't get this to work, but I think the method should be viable.
2. Multi-document fetching
{
"queries":[
{sensor01},{sensor02},{sensor03} etc....
]};
Where each {sensor0x} is filtered using
{"startkey": [sensors[i],{}],"endkey": [sensors[i]],"limit": 5}
This way I can order documents using ?descending=true
I implemented it and it works. I have my doubts should I use this if I have 1000 sensors with 10000 data points each.
And for hundreds of sensors I need to send a very large POST request.
Something better?
Is my architecture even correct?
Storing sensor data individual documents, and then fill the UI by fetching all data through REST API.
Thank you very much!

There's nothing wrong with your method of storing one reading per document, but there's no truly efficient way of getting "the last n data points" for a number of sensors.
We could create a MapReduce function:
function (doc) {
if (doc.name && doc.temperature && doc.date) {
emit([doc.name, doc.date], doc.temperature);
}
}
This creates an indexed ordered on name and date.
We can access the most recent readings for a single sensor by querying the view:
/_design/mydesigndoc/_view/myview?startkey=["Sensor01","2020-01-01"]&descending=true&limit=10
This fetches readings for "Sensor01" in newest-first order:
startkey & endkey are reveresed when doing descending=true
descending= true means in reverese order
limit - the number of readings required (or n in your parlance)
This is a very efficient use of Cloudant/CouchDB but it only returns the last n readings for single sensor. To retrieve other sensors' data, additional API calls would be required.
Creating an index like this:
function (doc) {
if (doc.name && doc.temperature && doc.date) {
emit(doc.date, doc.temperature);
}
}
orders each reading by date. You can then retrieve the newest n readings with:
/_design/mydesigndoc/_view/myview?startkey="2020-01-01"&descending=true&limit=200
If all of your sensors are saving data at the same rate, then simply using a larger limit should get your the latest readings of all sensors.
This too is an efficient use of CouchDB/Cloudant.
You may also want to look at the built-in reducers (_count, _sum and _stats) to get the database to aggregate readings for you. They are a great way to create year/month/day groupings of IoT data.
In general, I would recommend not using custom reducers they are many times more inefficient than the built-in reducers which are written in Erlang.

Related

Azure Data Factory - converting lookup result array

I'm pretty new to Acure Data Factory - ADF and have stumbled into somthing I would have solved with a couple lines of code.
Background
Main flow:
Lookup Activity fetchin an array of ID's to process
ForEach Activity looping over input array and uisng a Copy Activity pulling data from a REST API storing it into a database
Step #1 would result in an array containing ID's
{
"count": 10000,
"value": [
{
"id": "799128160"
},
{
"id": "817379102"
},
{
"id": "859061172"
},
... many more...
Step #2 When the lookup returns a lot of ID's - individual REST calls takes a lot of time. The REST API supports batching ID's using a comma spearated input.
The question
How can I convert the array from the input into a new array with comma separated fields? This will reduce the number of Activities and reduce the time to run.
Expecting something like this;
{
"count": 1000,
"value": [
{
"ids": "799128160,817379102,859061172,...."
},
{
"ids": "n,n,n,n,n,n,n,n,n,n,n,n,...."
}
... many more...
EDIT 1 - 19th Des 22
Using "Until Activity" and keeping track of posistions, I managed to use plain ADF. Would be nice if this could have been done using some simple array manipulation in a code snippet.
The ideal response might be we have to do manipulation with Dataflow -
My sample input:
First, I took a Dataflow In that adding a key Generate (Surrogate key) after the source - Say new key field is 'SrcKey'
Data preview of Surrogate key 1
Add an aggregate where you group by mod(SrcKey/3). This will group similar remainders into the same bucket.
Add a collect column in the same aggregator to collect into an array with expression trim(toString(collect(id)),'[]').
Data preview of Aggregate 1
Store output in single file in blob storage.
OUTPUT

ImportJSON for Google Sheets Can't Handle File WIthout Properties?

I'm trying to pull historical pricing data from CoinGecko's free API to use in a Google Sheet. It presents OHLC numbers in the following format:
[
[1589155200000,0.05129,0.05129,0.047632,0.047632],
[1589500800000,0.047784,0.052329,0.047784,0.052329],
[1589846400000,0.049656,0.053302,0.049656,0.053302],
...
]
As you can see, this isn't typical JSON format since there are no property names. So that everyone is on the same page, for this data the properties of each subarray in order are Time (in UNIX epoch format), Open Price, High Price, Low Price, and Close Price.
I'm using the ImportJSON code found here to try and pull this data, but it does not work. Instead of putting each subarray into a separate row, split into columns for the 5 properties, it prints everything out into a single cell like so:
1589155200000,0.05129,0.05129,0.047632,0.047632,1589500800000,0.047784,0.052329,0.047784,0.052329,15898 6400000,0.049656,0.053302,0.049656,0.053302,...
This is incredibly unhelpful. I'm trying to avoid using a paid API add-on since I really don't want to have to pay the frankly exorbitant fees they want to charge, but I can't figure out a way to get ImportJSON to play nicely with this data. Does anyone know of a solution?
It's simplier : your datas are in an array structure : I put
[
[1589155200000,0.05129,0.05129,0.047632,0.047632],
[1589500800000,0.047784,0.052329,0.047784,0.052329],
[1589846400000,0.049656,0.053302,0.049656,0.053302]
]
in A1, and I get the individual values by this simplier way :
function myArray(){
var f = SpreadsheetApp.getActiveSheet();
var result = eval(f.getRange('A1').getValue());
f.getRange(2,1,result.length,result[0].length).setValues(result)
}

Iterate over big resultset in batches (like foreach, but grouped)

I am using ScalikeJDBC to fetch a large table, convert the data to JSON and then calling a web service with 50 JSON objects (rows) each. This is my code:
val rows = sql"SELECT * FROM bigtable"
val jsons = rows.map { row =>
// build JSON object for each row
}.toList().apply()
jsons.grouped(50).foreach { batch =>
// send 50 objects at once to an HTTP server
}
This works, but unfortunately, the intermediate list is huge and consumes alot of memory. I am looking for a way to iterate over the resultset in a "lazy" fashion, similar to foreach, except I want to iterate over batches of 50 rows. Is that possible with ScalikeJDBC?
I solved the memory issues by filling and clearing a mutable list instead of using grouped, but I am still looking for a better solution.
Try specifying fetchSize.
See also: http://scalikejdbc.org/documentation/operations.html#setting-jdbc-fetchsize

Creating logger in CouchDB?

I would like to create a logger using CouchDB. Basically, everytime someone accesses the file, I would like like to write to the database the username and time the file has been accessed. If this was MySQL, I would just add a row for every access correspond to the user. I am not sure what to do in CouchDB. Would I need to store each access in array? Then what do I do during update, is there a way to append to the document? Would each user have his own document?
I couldn't find any documentation on how to append to an existing document or array without retrieving and updating the entire document. So for every event you log, you'll have to retrieve the entire document, update it and save it to the database. So you'll want to keep the documents small for two reasons:
Log files/documents tend to grow big. You don't want to send large documents across the wire for each new log entry you add.
Log files/documents tend to get updated a lot. If all log entries are stored in a single document and you're trying to write a lot of concurrent log entries, you're likely to run into mismatching document revisions on updates.
Your suggestion of user-based documents sounds like a good solution, as it will keep the documents small. Also, a single user is unlikely to generate concurrent log entries, minimizing any race conditions.
Another option would be to store a new document for each log entry. Then you'll never have to update an existing document, eliminating any race conditions and the need to send large documents between your application and the database.
Niels' answer is going down the right path with transactions. As he said, you will want to create a different document for each access - think of them as actions. Here's what one of those documents might look like
{
"_id": "32 char hash",
"_rev": "32 char hash",
"when": Unix time stamp,
"by": "some unique identifier
}
If you were tracking multiple files, then you'd want to add a "file" field and include a unique identifier.
Now the power of Map/Reduce begins to really shine, as it's extremely good at aggregating multiple pieces of data. Here's how to get the total number of views:
Map:
function(doc)
{
emit(doc.at, 1);
}
Reduce:
function(keys, values, rereduce)
{
return sum(values);
}
The reason I threw the time stamp (doc.at) into the key is that it allows us to get total views for a range of time. Ex., /dbName/_design/designDocName/_view/viewName?startkey=1000&endkey=2000&group=true gives us the total number of views between those two time stamps.
Cheers.
Although Sam's answer is an ok pattern to follow I wanted to point out that there is, indeed, a nice way to append to a Couch document. It just isn't very well documented yet.
By defining an update function in your design document and using that to append to an array inside a couch document you may be able to save considerable disk space. Plus, you end up with a 1:1 correlation between the file you're logging accesses on and the couch doc that represents that file. This is how I imagine a doc might look:
{
"_id": "some/file/path/name.txt",
"_rev": "32 char hash",
"accesses": [
{"at": 1282839291, "by": "ben"},
{"at": 1282839305, "by": "kate"},
{"at": 1282839367, "by": "ozone"}
]
}
One caveat: You will need to encode the "/" as %2F when you request it from CouchDB or you'll get an error. Using slashes in document ids is totally ok.
And here is a pair of map/reduce functions:
function(doc)
{
if (doc.accesses) {
for (i=0; i < doc.accesses.length; i++) {
event = doc.accesses[i];
emit([doc._id, event.by, event.at], 1);
}
}
}
function(keys, values, rereduce)
{
return sum(values);
}
And now we can see another benefit of storing all accesses for a given file in one JSON document: to get a list of all accesses on a document just make a get request for the corresponding document. In this case:
GET http://127.0.0.1:5984/dbname/some%2Ffile%2Fpath%2Fname.txt
If you wanted to count the number of times each file was accessed by each user you'll query the view like so:
GET http://127.0.0.1:5984/test/_design/touch/_view/log?group_level=2
Use group_level=1 if you just want to count total accesses per file.
Finally, here is the update function you can use to append onto that doc.accesses array:
function(doc, req) {
var whom = req.query.by;
var when = Math.round(new Date().getTime() / 1000);
if (!doc.accesses) doc.accesses = [];
var event = {"at": when, "by": whom}
doc.accesses.push(event);
var message = 'Logged ' + event.by + ' accessing ' + doc._id + ' at ' + event.at;
return [doc, message];
}
Now whenever you need to log an access to a file issue a request like the following (depending on how you name your design document and update function):
http://127.0.0.1:5984/my_database/_design/my_designdoc/_update/update_function_name/some%2Ffile%2Fpath%2Fname.txt?by=username
A comment to the last two anwers is that they refer to CouchBase not Apache CouchDb.
It is however possible to define updatehandlers in CouchDb but I have not used it.
http://wiki.apache.org/couchdb/Document_Update_Handlers

key-value store for time series data?

I've been using SQL Server to store historical time series data for a couple hundred thousand objects, observed about 100 times per day. I'm finding that queries (give me all values for object XYZ between time t1 and time t2) are too slow (for my needs, slow is more then a second). I'm indexing by timestamp and object ID.
I've entertained the thought of using somethings a key-value store like MongoDB instead, but I'm not sure if this is an "appropriate" use of this sort of thing, and I couldn't find any mentions of using such a database for time series data. ideally, I'd be able to do the following queries:
retrieve all the data for object XYZ between time t1 and time t2
do the above, but return one date point per day (first, last, closed to time t...)
retrieve all data for all objects for a particular timestamp
the data should be ordered, and ideally it should be fast to write new data as well as update existing data.
it seems like my desire to query by object ID as well as by timestamp might necessitate having two copies of the database indexed in different ways to get optimal performance...anyone have any experience building a system like this, with a key-value store, or HDF5, or something else? or is this totally doable in SQL Server and I'm just not doing it right?
It sounds like MongoDB would be a very good fit. Updates and inserts are super fast, so you might want to create a document for every event, such as:
{
object: XYZ,
ts : new Date()
}
Then you can index the ts field and queries will also be fast. (By the way, you can create multiple indexes on a single database.)
How to do your three queries:
retrieve all the data for object XYZ
between time t1 and time t2
db.data.find({object : XYZ, ts : {$gt : t1, $lt : t2}})
do the above, but return one date
point per day (first, last, closed to
time t...)
// first
db.data.find({object : XYZ, ts : {$gt : new Date(/* start of day */)}}).sort({ts : 1}).limit(1)
// last
db.data.find({object : XYZ, ts : {$lt : new Date(/* end of day */)}}).sort({ts : -1}).limit(1)
For closest to some time, you'd probably need a custom JavaScript function, but it's doable.
retrieve all data for all objects for
a particular timestamp
db.data.find({ts : timestamp})
Feel free to ask on the user list if you have any questions, someone else might be able to think of an easier way of getting closest-to-a-time events.
This is why databases specific to time series data exist - relational databases simply aren't fast enough for large time series.
I've used Fame quite a lot at investment banks. It's very fast but I imagine very expensive. However if your application requires the speed it might be worth looking it.
There is an open source timeseries database under active development (.NET only for now) that I wrote. It can store massive amounts (terrabytes) of uniform data in a "binary flat file" fashion. All usage is stream-oriented (forward or reverse). We actively use it for the stock ticks storage and analysis at our company.
I am not sure this will be exactly what you need, but it will allow you to get the first two points - get values from t1 to t2 for any series (one series per file) or just take one data point.
https://code.google.com/p/timeseriesdb/
// Create a new file for MyStruct data.
// Use BinCompressedFile<,> for compressed storage of deltas
using (var file = new BinSeriesFile<UtcDateTime, MyStruct>("data.bts"))
{
file.UniqueIndexes = true; // enforces index uniqueness
file.InitializeNewFile(); // create file and write header
file.AppendData(data); // append data (stream of ArraySegment<>)
}
// Read needed data.
using (var file = (IEnumerableFeed<UtcDateTime, MyStrut>) BinaryFile.Open("data.bts", false))
{
// Enumerate one item at a time maxitum 10 items starting at 2011-1-1
// (can also get one segment at a time with StreamSegments)
foreach (var val in file.Stream(new UtcDateTime(2011,1,1), maxItemCount = 10)
Console.WriteLine(val);
}
I recently tried something similar in F#. I started with the 1 minute bar format for the symbol in question in a Space delimited file which has roughly 80,000 1 minute bar readings. The code to load and parse from disk was under 1ms. The code to calculate a 100 minute SMA for every period in the file was 530ms. I can pull any slice I want from the SMA sequence once calculated in under 1ms. I am just learning F# so there are probably ways to optimize. Note this was after multiple test runs so it was already in the windows Cache but even when loaded from disk it never adds more than 15ms to the load.
date,time,open,high,low,close,volume
01/03/2011,08:00:00,94.38,94.38,93.66,93.66,3800
To reduce the recalculation time I save the entire calculated indicator sequence to disk in a single file with \n delimiter and it generally takes less than 0.5ms to load and parse when in the windows file cache. Simple iteration across the full time series data to return the set of records inside a date range in a sub 3ms operation with a full year of 1 minute bars. I also keep the daily bars in a separate file which loads even faster because of the lower data volumes.
I use the .net4 System.Runtime.Caching layer to cache the serialized representation of the pre-calculated series and with a couple gig's of RAM dedicated to cache I get nearly a 100% cache hit rate so my access to any pre-computed indicator set for any symbol generally runs under 1ms.
Pulling any slice of data I want from the indicator is typically less than 1ms so advanced queries simply do not make sense. Using this strategy I could easily load 10 years of 1 minute bar in less than 20ms.
// Parse a \n delimited file into RAM then
// then split each line on space to into a
// array of tokens. Return the entire array
// as string[][]
let readSpaceDelimFile fname =
System.IO.File.ReadAllLines(fname)
|> Array.map (fun line -> line.Split [|' '|])
// Based on a two dimensional array
// pull out a single column for bar
// close and convert every value
// for every row to a float
// and return the array of floats.
let GetArrClose(tarr : string[][]) =
[| for aLine in tarr do
//printfn "aLine=%A" aLine
let closep = float(aLine.[5])
yield closep
|]
I use HDF5 as my time series repository. It has a number of effective and fast compression styles which can be mixed and matched. It can be used with a number of different programming languages.
I use boost::date_time for the timestamp field.
In the financial realm, I then create specific data structures for each of bars, ticks, trades, quotes, ...
I created a number of custom iterators and used standard template library features to be able to efficiently search for specific values or ranges of time-based records.

Resources