Cassandra data model for time series - database

I am working on a Cassandra data model for storing time series (I'm a Cassandra newbie).
I have two applications: intraday stock data and sensor data.
The stock data will be saved with a time resolution of one minute.
Seven datafields build one timeframe:
Symbol, Datetime, Open, High, Low, Close, Volume
I will query the data mostly by Symbol and Date. e.g. give me all data for AAPL between 2013-01-01 and 2013-01-31 ordered by Datetime.
The recommendation for cassandra queries is to query whole columns. So you could create five rows with the keys Open, High, Low, Close, Volume. And for each Symbol and Minute an own column. E.g. "AAPL:2013-01-04T130400Z".
This would result in a table of five rows and n*NT columns where n = number of symbols, nT = number of minutes.
Most of the time I will query date ranges. I.e. all minutes of a day. So I could rearrange the data to have columns named "AAPL:2013-01-04" and rows: OpenT130400Z, HighT130400Z, LowT130400Z, CloseT130400Z, VolumeT130400Z.
This would result in a table with n*nD columns (n: number of Symbols, nD: number of Days) and 5*nM rows (nM: number of minutes/entries per day).
To sum up: I have columns, which hold the information for a whole day for one symbol.
I have found a description how to deal with time series data in cassandra here http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra
But I don't really get, if they use the hour (1332960000) as a column name or as a row key!?
I understood they use the hour as row key and have the small timesteps as columns. So they would have a fixed column number. But that would have disadvantages in reading because I would have to do a range query on keys! Am I right?
Second question:
If I have sensor data, which is much more fine grained than 1 minute stock data (let's say I have to save timesteps with a resolution of microseconds) how would I deal with this?
If I use columns for saving a composite of sensor channel and hours, and rows for microseconds since the last hour this would result in 3,600,000,000 rows and n*nH columns (n: number of sensors, nH: number of Hours).
I could not use the microseconds since last hour for columns because I have 3,6 billion points which is higher than the allowed number of 2 billion columns.
Did I get it?
What do you think about this problem? How to solve it?
Thank you!
Best,
Malte

So I have a suggestion for your first question about the stock data. A naive implementation might look like this:
RowKey:
Column Format:
Name: The current datetime granular to a minute
Value: a composite column of Open,High,Low,Close,Volume
So you would have something like
AAPL = [2013-05-02-15:38:00 | 441.78:448.59:440.63:15066146:445.52] ... [2013-05-02-15:39:00 | 441.78:448.59:440.63:15066146:445.52] ... [2013-05-02-15:40:00 | 441.78:448.59:440.63:15066146:445.52]
That would give you roughly half a million columns in one year so it might be ok for maybe 4 years. I wouldn't go and attempt to hit the 2 billion limit. What you could do is define a splitting factor on the row key. It all depends on your usage pattern, but a simple one might be on the year so the column family entry might look like this with a composite row key and that would guarantee that you always have less than a million columns per row.
AAPL:2013 = [05-02-15:38:00 | 441.78:448.59:440.63:15066146:445.52] ... [05-02-15:39:00 | 441.78:448.59:440.63:15066146:445.52] ... [05-02-15:40:00 | 441.78:448.59:440.63:15066146:445.52]

Related

SQL Server full text search max performance limits within time window

Let's say we have a table with 100 million records. Which are transactions of sales form multiple sellers. Each record has around 14 columns
TABLE SellerTransactions
string SellerId,
string ProductId,
DateTime CreateDate,
string BankNumber,
string Name(name+' '+surname+' 'alias),
string Comments,
decimal Amount
etc...
Each year we will add around 60 million new records.. and the record count will increase by 10% yearly +-
Search will be done by seller Id, then by product Id or products Id's(for multiple products in a time period for that seller).
Now every search will be filtered by time usually 1 week period/ mostly the last week but its also possible to have worst case scenarios when we will search within all the data we have: years.
And also a search should be possible by bank number, or full text search by name or part of it.
SELECT *
FROM SellerTransactions
WHERE SelledId = 'Seller1Guid'
AND ProductId IN 'ProductGuid1,ProductGuid2..'
AND (CreateDate <= CurrentDate AND CreateDate >= (CurrendDate - 7 days))
AND CONTAINS(Name, "BoB Skynet")
ORDER BY CreateDateTime
TAKE 20 SKIP 20
So search time consumption in regards to these scenarios:
After filtering by id's & time range search in up to 10k records
After filtering by id's & time range search in up to 100k records
After filtering by id's & time range search in up to 1 million records
After filtering by id's & time range search in up to 1 to 10 million records
in these cases would it be 0.5s? or 1s or up to 20s?
Also how would the search performance would change if we would also add search in comments column as well as name?
CONTAINS(Name, "BoB Skynet") OR CONTAINS(Comments, "online")
In most cases the searching should be done on small records counts, in very rare cases we would go though millions of rows.. but how much time would it take?
When it would be an good idea to move to Elastic search for example?
Well we are storing a bit of data, but request count for this data is usually small.

Google Data Studio date aggregation - average number of daily users over time

This should be simple so I think I am missing it. I have a simple line chart that shows Users per day over 28 days (X axis is date, Y axis is number of users). I am using hard-coded 28 days here just to get it to work.
I want to add a scorecard for average daily users over the 28 day time frame. I tried to use a calculated field AVG(Users) but this shows an error for re-aggregating an aggregated value. Then I tried Users/28, but the result oddly is the value of Users for today. The division seems to be completely ignored.
What is the best way to show average number of daily users over a time frame? Average daily users over 10 days, 20 day, etc.
Try to create a new metric that counts the dates eg
Count of Date = COUNT(Date) or
Count of Date = COUNT_DISTINCT(Date) in case you have duplicated dates
Then create another metric for average users
Users AVG = (Users / Count of Date)
The average depends on the timeframe you have selected. If you are selecting the last 28 days the average is for those 28 days (dates), if you filter 20 days the average is for those 20 days etc.
Hope that helps.
I have been able to do this in an extremely crude and ugly manner using Google Sheets as a means to do the calculation and serve as a data source for Data studio.
This may be useful for other people trying to do the same thing. This assumes you know how to work with GA data in Sheets and are starting with a Report Configuration. There must be a better way.
Example for Average Number of Daily Users over the last 7 days:
Edit the Report Configuration fields:
Report Name: create one report per day, in this case 7 reports. Name them (for example) Users-1 through Users-7. These are your Row 2 values. You'll have 7 columns, with the first report name in column B.
Start Date and End Date: use TODAY()-X where X is the number of days previous to define the start and end dates for each report. Each report will contain the user count for one day. Report Users-1 will use TODAY()-1 for start and end, etc.
Metrics: enter the metrics e.g. ga:users and ga:new users
Create the reports
Use 'Run reports' to have the result sheets created and populated.
Create a sheet for an interim data set you will use as the basis for the average calculation. The first column is date, the remaining columns are for the metrics, in this case Users and New Users.
Populate the interim data set with the dates and values. You will reference the Report Configuration to get the dates, and you will pull the metrics from each of the individual reports. At this stage you have a sheet with date in first columns and values in subsequent columns with a row for each day's values. Be sure to use a header.
Finally, create a sheet that averages the values in the interim data set. This sheet will have a column for each metric, with one value per column. The one value is calculated from the series in the interim data set, for example =AVG(interim_sheet_reference:range) or any other calculation you'd like to do.
At last, you can use Data Studio to connect to this data source and use the values. For counts of users such as this example, you would use Sum as the aggregation field type when you are creating the data source.
It's super ugly but it works.

Tricky: SQL Server-side aggregation of time-series data for charting

I have a large time-series data set in a table that contains 5 years of data. The data is very structured; it is clustered/ordered on the time column and there is exactly one record for exactly every 10 minutes over this entire 5 year period.
In my user-side application I have a time-series chart that is 400 pixels wide, and users can set the time scale from 1 hour up to 5 years. Therefore any query to the database by this chart that returns more than 400 records provides data that cannot be physically displayed.
What I want to know is; can anyone suggest an approach such that when the database is queried for a certain time range, the SQL database would dynamically make a suitable averaging aggregation that returns no more than 400 records?
Example 1): if the time range was 5 years, SQL Server would calculate ~1 value for every 4.5 days (5yrs*365days/400records required), so would average all the 10 minute samples for each 4.5 day bin and return a record for each bin. About 400 in total.
Example 2): If the time range was one month, SQL Server would calculate ~1 record for every 1.85 hours (31 days/400records), so would average all the 10 minute samples for each 1.85 hour bin and return a record for each bin. About 400 in total.
Ideally I'd like a solution that from the applications perspective can be queried just like a static Table.
I'd really appreciate any suggested approaches or code snippets.
some examples, if you have a datetime column (which is not quite clear from your question, as there is not table schema):
Grouping into interval of 5 minutes within a time range
SELECT / GROUP BY - segments of time (10 seconds, 30 seconds, etc)
They should be quite easy to port to SQL server, use datediff to convert your datetime values into an unix timestamp and use round() with the function parameter <> 0 for the div.

How to optimize large database requests

I am working with a database that contains information (measurements) about ships. The ships send an update with their position, fuel use, etc. So an entry in the database looks like this
| measurement_id | ship_id | timestamp | position | fuel_use |
| key | f_key | dd-mm-yy hh:ss| lat-lon | in l/km |
A new one of these entries gets added for every ship every second so the amount of entries in the database gets large very fast.
What I need for the application I am working on is not the information for one second but rather cumulative data for 1 minute, 1 day, or even 1 year. For example the total fuel use over a day, the distance traveled in a year, or the average fuel use per day over a month.
To get that and calculate that from this raw data is unfeasible, you would have to get 31,5 million records from the server to calculate the distance traveled in a year.
What I thought was the smart thing to do is combining entries into one bigger entry. For example get 60 measurements and combine them into 1 minute measurement entry in a separate table. By averaging the fuel use, and by summing the distance traveled between two entries. A minute entry would then look like this.
| min_measurement_id | ship_id | timestamp | position | distance_traveled | fuel_use |
| new key |same ship| dd-mm-yy hh| avg lat-lon | sum distance_traveled | avg fuel_use |
This process could then be repeated to work with hours, days, months, years. This way a query for a week could be done by requesting only 7 queries, or if I want hourly details 168 entries. Those look like way more usable numbers to me.
The new tables can be filled by querying the original database every 10 minutes, that data then fills the minute table, which in turn updates the hours table, etc.
However this seems to be a lot of management and duplication of almost the same data, with constantly the same operation being done.
So what I am interested in is if there is some way of structuring this data. Could it be sorted hierarchically (after all seconds, days, minutes are pretty hierarchical) or are there other ways to optimize this?
This is the first time I am using a database this size so I also did not really know what to look for on the internet.
Aggregates are common in data warehouses so your approach to group data is fine. Yes, you are duplicating some of the data, but you'll get the speed benefit.

Choosing appropriate DataType in SQL Server table

I have a large transaction table in SQL server which is used to store about 400-500 records each day. What is the data type should I use in my PK column? The PK column stores numeric values, for which integer seems suitable but I'm afraid it will exceed the maximum value for integer since I have so many records everyday.
I am currently using integer data type for my PK column.
With a type INT, starting at 1, you get over 2 billion possible rows - that should be more than sufficient for the vast majority of cases. With BIGINT, you get roughly 922 quadrillion (922 with 15 zeros - 922'000 billions) - enough for you??
If you use an INT IDENTITY starting at 1, and you insert a row every second, you need 66.5 years before you hit the 2 billion limit .... so with 400-500 rows per day - it will take centuries before you run out of possible values... take 1'000 rows per day - you should be fine for 5883 years - good enougH?
If you use a BIGINT IDENTITY starting at 1, and you insert one thousand rows every second, you need a mind-boggling 292 million years before you hit the 922 quadrillion limit ....
Read more about it (with all the options there are) in the MSDN Books Online.
I may be wrong here as maths has never been my strong point, but if you use bigint this has a max size of 2^63-1 (9,223,372,036,854,775,807)
so if you divide that by say 500 to get roughly the number of days-worth of records you get 18446744073709600 days-worth of 500 new records.
divide again by 365, gives you 50539024859478.2 years-worth of 500 records a day
so (((2^63-1) / 500) / 365)
if that's not me being stupid then that's a lot of days :-)

Resources