Time selections on InfluxDB - inner-join

I have some stock price data in InfluxDB with columns Ticker and Price. For example:
Time Ticker Price
------ ------ ------
12:02 IBM 100.12
12:02 MSFT 50.15
12:03 IBM 100.15
12:04 MSFT 51.00
12:05 AMZN 200.00
I would like to extract the latest prices for each stock, even though they may be on different times. So the final selection should look like
Time Ticker Price
------ ------ ------
12:03 IBM 100.15
12:04 MSFT 51.00
12:05 AMZN 200.00
In regular SQL, one would usually do it like this:
SELECT values.*
FROM (SELECT Ticker, MAX(Time) AS MaxTime
FROM StockHistory
GROUP BY Ticker) as keys
INNER JOIN StockHistory as values
ON keys.Ticker = values.Ticker
AND keys.MaxTime = values.Time
The problem is, Influx does not seem to support INNER JOIN or any other kind of join. I am just starting to learn it, and for a time series db, this type of a problem must be one of the most frequent ones it is built for. How do you do this in a fast way?
Speed is of concern for me, since I am looking at roughly 5-15 million rows in the table, and 150,000 different tickers (not all are stocks, there are many instruments I am storing).
Thank you very much.
PS If it matters, I will be accessing Influx through the Python API and can do this filtering in the code, but would strongly prefer to do it on the DB side to minimize transmission of a huge number of rows over the network...
UPDATE
I saw this question about a left join, but it seems to be outdated, and i need an inner one on time, something the answer specifically implies would be supported...
Possible Approach
Will this work?
SELECT LAST(Time), Price
FROM StockHistory
GROUP BY Ticker
If yes, how fast would it be on a large table (see above for measurements)?

Related

Periodic snapshot fact table - Design question

I'm working on the design of a new periodic snapshot fact table. I'm looking into health insurance claims and the amount of money people owe to the insurance company and the amount they've already paid. Data in the table will look like this.
CLAIM_ID TIME_KEY AMOUNT_OWED PAID
123 31.1.2000 1000 0
123 28.2.2000 900 100
123 31.3.2000 800 200
123 30.4.2000 0 1000
123 31.5.2000 0 1000
123 30.6.2000 0 1000
123 31.7.2000 0 1000
123 31.8.2000 0 1000
...
As you can see after 30.4.2000 it doesn't make sense to insert new data for claim_id 123 as it no longer changes (there is a reasonable degree of certainty this won't happen). Is it a good idea to stop inserting data for this claim or should I do so till the end of time :)?
I'm mainly concerned about sticking to best practices when designing Data Warehouse tables.
Thanks for any answer!
just a few thoughts...
Unless you can have multiple payments in a day against a claim (and potentially other transactions e.g. interest that increases the amount owed), then what you have shown is not really a snapshot fact, it is a transactional fact. The normal example given is a bank account where you have multiple in/out transactions per day and then a snapshot of the end-of-day (or end-of-month) position. Obviously I don't know your business model but it seems unlikely that there would be multiple transactions per day against a single claim
If there have been no changes to a claim since the last fact record was created there seems little point creating a new fact record
Typically you choose a periodic snapshot if you have
a) a large number of transactions and
b) you need an effective access to the data at some point of time (end of the month in your case)
If you have say 50 claim transactions per month and the claim is active one year on average, you will profit from this design even if you will hold the inactive claims for 50 years (which you will probably will not do;)
Your doubts suggest that you have not so much transactions per claim life cycle. It that case you should consider a fact table storing each transaction.
You will have definitively no overhead for inactive claims, but to get a snapshot information at a specific time you'll have to read the whole table.
On the contrary the periodic snapshot is typically partitioned on the snapshot time, so the access is very affective.
get no free lunch with saving space and an effective access.

Large db table with many rows or many columns

I have tried to have a normalized table design. The problem (maybe) is that we are generating a lot of data, and therefore a lot of rows. Currently the database is increasing in size by 0,25 GB per day.
The main tables are Samples an Boxes. There's a one-to-many relation from Samples to Boxes.
Sample table:
ID | Timestamp | CamId
Boxes table:
ID | SampleID | Volume | ...
We analyse 19 samples each 5 seconds, and each sample on avg has 7 boxes. That's 19*7*12 = 1596 boxes each minute and 1596*60*24 = 2,298,240 rows in Boxes table each day on avg.
This setup might run for months. At this time the Boxes table has about 25 million rows.
Quistion is; should i be worried about database size, table size and table design with so much data?
Or should I have a table like
ID | SampleID | CamId | Volume1 | Volume2 | ... | Volume9 | ...
Depending on the validity of your data, you could implement a purge of your data.
What I mean is: do you really need data from days ago, months ago, years ago? If you have a time limit of use for your data, purge them and your data table should stop growing (or likely) after a set amount of time.
This way you wouldn't need to care that much about either architecture for the sake of size.
Otherwise the answer is yes, you should care. Separate notions in a lot of tables could give you a good tweak on performance but maybe won't be sufficient in terms of access time after a long time. Consider looking at NoSQL solutions or alike in order to store heavy rows.
There is one simple rule: Whenever you think you have to put a number to a column's name you probably need a related table.
The amount of data will be roughly the same, no wins here.
I'd try to partition the table. AFAIK this feature was bound to the Enterprise Editions, but - according to this document - with SQL Server 2016 SP1 table and index partitioning is coming down even to Express!
The main question is: What are you going to do with this data?
If you have to run analytical scripts over everything, there won't be a much better hint than buy better hardware.
If your needs refer to data of the last 3 weeks you will be fine off with partitioning.
If you cannot use this feature yet (due to your Server's version), you can create an archive table and move older data into this table in regular jobs. A UNION ALL view would still allow to grab the whole lot. With SCHEMA BINDING you might even get the advantages of indexed views.
In this case it is clever, to hold your working data in your fastest drive and put the archive table in a separate file on a large storage somewhere else.
Question is, should i be worried about database size, table size and table design with so much data?
My answer is YES:
1. A huge amount of data(daily) should affect your storage in hardware part.
2. Table normalized is a must mostly if you are storing bytes or images.

Data Store Design for NxN Data Aggregation

I am trying to come up with a theoretical solution to an NxN problem for data aggregation and storage. As an example I have a huge amount of data that comes in via a stream. The stream sends the data in points. Each point has 5 dimensions:
Location
Date
Time
Name
Statistics
This data then needs to be aggregated and stored to allow another user to come along and query the data for both location and time. The user should be able to query like the following (pseudo-code):
Show me aggregated statistics for Location 1,2,3,4,....N between Dates 01/01/2011 and 01/03/2011 between times 11am and 4pm
Unfortunately due to the scale of the data it is not possible to aggregate all this data from the points on the fly and so aggregation prior to this needs to be done. As you can see though there are multiple dimensions that the data could be aggregated on.
They can query for any number of days or locations and so finding all the combinations would require huge pre-aggregation:
Record for Locations 1 Today
Record for Locations 1,2 Today
Record for Locations 1,3 Today
Record for Locations 1,2,3 Today
etc... up to N
Preprocessing all of these combinations prior to querying could result in an amount of precessing that is not viable. If we have 200 different locations then we have 2^200 combinations which would be nearly impossible to precompute in any reasonable amount of time.
I did think about creating records on 1 dimension and then merging could be done on the fly when requested, but this would also take time at scale.
Questions:
How should I go about choosing the right dimension and/or combination of dimensions given that the user is as likely to query on all dimensions?
Are there any case studies I could refer to, books I could read or anything else you can think of that would help?
Thank you for your time.
EDIT 1
When I say aggregating the data together I mean combining the statistics and name (dimensions 4 & 5) for the other dimensions. So for example if I request data for Locations 1,2,3,4..N then I must merge the statistics and counts of name together for those N Locations before serving it up to the user.
Similarly if I request the data for dates 01/01/2015 - 01/12/2015 then I must aggregate all data between those periods (by adding summing name/statistics).
Finally If I ask for data between dates 01/01/2015 - 01/12/2015 for Locations 1,2,3,4..N then I must aggregate all data between those dates for all those locations.
For the sake of this example lets say that going through statistics requires some sort of nested loop and does not scale well especially on the fly.
Try a time-series database!
From your description it seems that your data is a time-series dataset.
The user seems to be mostly concerned about the time when querying and after selecting a time frame, the user will refine the results by additional conditions.
With this in mind, I suggest you to try a time-series database like InfluxDB or OpenTSD.
For example, Influx provides a query language that is capable of handling queries like the following, which comes quite close to what you are trying to achieve:
SELECT count(location) FROM events
WHERE time > '2013-08-12 22:32:01.232' AND time < '2013-08-13'
GROUP BY time(10m);
I am not sure what you mean by scale, but the time-series DBs have been designed to be fast for lots of data points.
I'd suggest to definitely give them a try before rolling your own solution!
Denormalization is a means of addressing performance or scalability in relational database.
IMO having some new tables to hold aggregated data and using them for reporting will help you.
I have a huge amount of data that comes in via a stream. The stream
sends the data in points.
There will be multiple ways to achieve denormalization in the case:
Adding a new parallel endpoint for data aggregation functionality in streaming
level
Scheduling a job to aggregate data in DBMS level.
Using DBMS triggering mechanism (less efficient)
In an ideal scenario when a message reaches the streaming level there will be two copies of data message containing location, date, time, name, statistics dimensions, being dispatched for processing, one goes for OLTP(current application logic) second will goes for an OLAP(BI) process.
The BI process will create denormalized aggregated structures for reporting.
I will suggest having aggregated data record per location, date group.
So end-user will query preprossed data that wont need heavy recalculations, having some acceptable inaccuracy.
How should I go about choosing the right dimension and/or combination
of dimensions given that the user is as likely to query on all
dimensions?
That will depends on your application logic. If possible limit the user for predefined queries that can be assigned values by the user(like for dates from 01/01/2015 to 01/12/2015). In more complex systems using a report generator above the BI warehouse will be an option.
I'd recommend Kimball's The Data Warehouse ETL Toolkit.
You can at least reduce Date and Time to a single dimension, and pre-aggregate your data based on your minimum granularity, e.g. 1-second or 1-minute resolution. It could be useful to cache and chunk your incoming stream for the same resolution, e.g. append totals to the datastore every second instead of updating for every point.
What's the size and likelyhood of change of the name and location domains? Is there any relation between them? You said that location could be as many as 200. I'm thinking that if name is a very small set and unlikely to change, you could hold counts of names in per-name columns in a single record, reducing the scale of the table to 1 row per location per unit of time.
you have a lot of datas. It will take a lot of time with all methods due to the amount of datas you're trying to parse.
I have two methods to give.
First one is a brutal one, you probably thought off:
id | location | date | time | name | statistics
0 | blablabl | blab | blbl | blab | blablablab
1 | blablabl | blab | blbl | blab | blablablab
ect.
With this one, you can easily parse and get elements, they are all in the same table, but the parsing is long and the table is enormous.
Second one is better I think:
Multiple tables:
id | location
0 | blablabl
id | date
0 | blab
id | time
0 | blab
id | name
0 | blab
id | statistics
0 | blablablab
With this you could parse (a lot) faster, getting the IDs and then taking all the needed informations.
It also allow you to preparse all the datas:
You can have the locations sorted by location, the time sorted by time, the name sorted by alphabet, ect, because we don't care about how the ID's are mixed:
If the id's are 1 2 3 or 1 3 2, no one actually care, and you would go a lot faster with parsing if your datas are already parsed in their respective tables.
So, if you use the second method I gave: At the moment where you receive a point of data, give an ID to each of his columns:
You receive:
London 12/12/12 02:23:32 donut verygoodstatsblablabla
You add the ID to each part of this and go parse them in their respective columns:
42 | London ==> goes with London location in the location table
42 | 12/12/12 ==> goes with 12/12/12 dates in the date table
42 | ...
With this, you want to get all the London datas, they are all side by side, you just have to take all the ids, and get the other datas with them. If you want to take all the datas between 11/11/11 and 12/12/12, they are all side by side, you just have to take the ids ect..
Hope I helped, sorry for my poor english.
You should check out Apache Flume and Hadoop
http://hortonworks.com/hadoop/flume/#tutorials
The flume agent can be used to capture and aggregate the data into HDFS, and you can scale this as needed. Once it is in HDFS there are many options to visualize and even use map reduce or elastic search to view the data sets you are looking for in the examples provided.
I have worked with a point-of-sale database with hundred thousand products and ten thousand stores (typically week-level aggregated sales but also receipt-level stuff for basket analysis, cross sales etc.). I would suggest you to have a look at these:
Amazon Redshift, highly scalable and relatively simple to get started, cost-efficient
Microsoft Columnstore Indexes, compresses data and has familiar SQL interface, quite expensive (1 year reserved instance r3.2xlarge at AWS is about 37.000 USD), no experience on how it scales within a cluster
ElasticSearch is my personal favourite, highly scalable, very efficient searches via inverted indexes, nice aggregation framework, no license fees, has its own query language but simple queries are simple to express
In my experiments ElasticSearch was faster than Microsoft's column store or clustered index tables for small and medium-size queries by 20 - 50% on same hardware. To have fast response times you must have sufficient amount of RAM to have necessary data structures loaded in-memory.
I know I'm missing many other DB engines and platforms but I am most familiar with these. I have also used Apache Spark but not in data aggregation context but for distributed mathematical model training.
Is there really likely to be a way of doing this without brute forcing it in some way?
I'm only familiar with relational databases, and I think that the only real way to tackle this is with a flat table as suggested before i.e. all your datapoints as fields in a single table. I guess that you just have to decide how to do this, and how to optimize it.
Unless you have to maintain 100% to the single record accuracy, then I think the question really needs to be, what can we throw away.
I think my approach would be to:
Work out what the smallest time fragment would be and quantise the time domain on that. e.g. each analyseable record is 15 minutes long.
Collect raw records together into a raw table as they come in, but as the quantising window passes, summarize the rows into the analytical table (for the 15 minute window).
Deletion of old raw records can be done by a less time-sensitive routine.
Location looks like a restricted set, so use a table to convert these to integers.
Index all the columns in the summary table.
Run queries.
Obviously I'm betting that quantising the time domain in this way is acceptable. You could supply interactive drill-down by querying back onto the raw data by time domain too, but that would still be slow.
Hope this helps.
Mark

how to design table in sql server for obtaining summary results

I got this situation. Logic is customer will be given credit sale and they will repay money in installments. I need to store this details about products, qty and the amounts they are giving in installments.
In dashboard i need to show all customers with name total sale amount, paid amount and balance amount.
Approach i thought
tblCredit = Stores as rows for all the time they pay amount
(e.g) shan(Name), paper (product), 1500 (qty) , 2000 (Price), 100
(Debit) { initial purchase) }
shan (Name), -, -, -, 200 (Debit)
In query filter by name and sum(Price) - Sum(Debit amount) will give
balance
But this approach once the data grows is this aggregation will be trouble some ?
Is it possible like caching the aggregated result with timestamps or
something like that and update that at every operation when we are
inserting data in that table and show result from that ?
Note
Data growth rate will be high.
I am very new to designing.
Please guide me the best approach to handle this.
Update
Apart from dashboard i need to show report when users clicks report to know how much credit given for whom. So in any case i need a optimized query and logic and design to handle this.
Usually a dashboard do not need to get the data in real time. You may think of using data snapshot (schedule data insert after your aggregation) rather than maintaining a summary table update by different types of sales transactions, which is difficult in maintaining the integrity especially handling back-day process.

What database to use for statistic / report data?

My application is a kind of POS system.
The problem is with reports. Sales per product, per table, per staff,categories. Having 1 year date range reports are very slow because they have to sum lots of rows etc. So i am wondering if a no-sql database could help ,like having summaries per day or something.. But maybe its not easy because there could be for each item * products * categories * staff etc query. SO what could i do ?
If you're comfortable with relational databases, I'd recommend sticking with them, and using daily aggregate tables for the reports you commonly use.
For example, if like to do sales reports grouped by product numbers, figure out what stats you're looking for (ie. quantity sold) and aggregate your raw data into day sized "buckets" by product number.
+-----------+------------+------------+-------+-------+
| salesdate | productNum | totalSales | stat2 | stat3 |
+-----------+------------+------------+-------+-------+
If you do day-sized buckets at the end of every day, you will only have 30 buckets per month for your report, or 365 buckets per year. Much faster to summarize. I've done this with network performance metrics when building out dashboards (hour-sized buckets), and it greatly reduces query time. You can always dig into the raw data if need be, but for the average user who wants to see something at a glance, the aggregated buckets are enough.
You may also consider putting the summary tables in a separate database.
Just keep in mind, if one of your stats in an average, the average of a series of averages is not the average for the overall range.

Resources