Cassandra append-only timeseries modelling/querying - database

I am modelling a financial stock price storage in Cassandra, where I need to cater for retrospective changes.
An append only database is what came to mind.
CREATE TABLE historical_data (
ticker text,
eoddate timestamp,
price double,
created timestamp,
PRIMARY KEY(ticker, eoddate)
) WITH CLUSTERING ORDER BY (eoddate DESC);"""
eg a record might be:
ticker=AAPL, eoddate=2016-09-28, price=123.4, created=2016-09-28 16:30:00
A day later, there was a retro data fix, I'd insert another record
ticker=AAPL, eoddate=2016-09-28, price=120.9, created=2016-09-29 09:00:00
What is the best way to model/query this data, if I'd like to get the latest series for AAPL (ie filtering the first value)?
in SQL I could write a parition query. How about in CQL ?
Or should the filter be applied at the application level?
Thanks.

If i understand correctly your need, your table is good.
With this schema, you can run query like :
SELECT price
FROM historical_data
WHERE ticker = 'AAPL'
LIMIT 1;
It will return the last price for ticker AAPL.
The CLUSTERING ORDER BY clause order your data physically in descending order for a specific ticker, it wont order the whole table. So this query should be enough.

Related

How do I design a primary key for time-series data in DynamoDB?

I have a data model that consists of 8 entity types and I need to design a DynamoDB NoSQL model around some access patterns. The access patterns are not entity specific so I am in the process of translating them, but most of the access patterns rely on getting items by a date range. From previous related questions, people usually assume that getting the item by both an itemID (Partition Key) and date range (Sort Key) is the norm, but in my case, I need to get all entities by a date range.
This would mean the partition key is the entity type and the sort key is the date range. Am I correct with this statement?
Given the large size of the data (>100GB), I am not sure if this is true.
Update: List of access patterns and data example
The access patterns so far look like this:
Get all transactions during a given date range
Get all transactions during a given date range for a given locationId
Get all transactions during a given date range for a given departmentId
Get all transactions during a given date range for a given categoryId
Get all transactions during a given date range for a given productId
Get all transactions during a given date range for a given transactionItemId
Get all transactions during a given date range for a given supplierId
Get all product on transactions during a given date range
And a transaction entity has the following attributes (I have only included a snippet but there are 52 attributes altogether):
identifier
confirmationNumber **(contains date information)**
priceCurrency
discount
discountInvoiced
price
shippingAmount
subtotal
subtotalInclTax
.
.
.
I don't think DynamoDB will make you very happy for this use case, you seem to have a bunch of different filter categories, that's typically not what DynamoDB excels at.
An implementation would require lots of data duplication through global secondary indexes as well as trouble with hot partitions. The general approach could be to have a base table with the PK as the date and the timestamp as the SK. You then create global secondary indexes based on locationId, departmentId and the other categories you filter by. This will result in data duplication and depending on your filter categories hot partitions.
I'd probably use a relational database with indexes on the filter fields and partition that by the transaction time.

MS SQL Server, arrays, looping, and inserting qualified data into a table

I've searched around for answer and I'm not sure how best to frame the question since I'm rather new to SQL Server.
Here's what I got going on: I get a weekly report detailing the products that have been sold and the quantity of each. This data needs to go into a yearly totals table. In this table the first column is the product_id and the next 52 columns are week numbers, 1-52.
There's a JOIN that runs on the product_id of both the weekly and yearly tables. That finds the proper row and column to put the weekly quantity data for that product.
Here's where I'm not sure what to do. In 2019 there are no product_id in that column. So there's nothing to JOIN on. Those product_id need to be added weekly if they aren't there. I need to take the weekly report of product_id and quantity and check each product_id to see if it's in the yearly table. If not I need to add it.
If I had it my way I'd create an array of the product_id numbers from the weekly data and loop through each one creating a new record in the yearly table for any product_id that is not already there. I don't know how best to do that in SSMS.
I've searched around and have found different strategies for this. Nothing strikes me as being a perfect solution. There's creating a #temp table variable, a UNION using exclude to get just those that aren't in the table, and a WHILE loop. Any suggestions would be helpful.
I ended up using a MERGE to solve this. I create a table WeeklyParts to dump the weekly data into. Then I do a MERGE with the yearly table inserting only those where the is no match. Works well.
-- Merge the PartNo's so that only unique ones are added to the yearly table
MERGE INTO dbo.WeeklySales2018
USING dbo.WeeklyParts
ON (dbo.WeeklySales2018.PartNo = dbo.WeeklyParts.PartNo)
WHEN NOT MATCHED THEN
INSERT (PartNo) VALUES (dbo.WeeklyParts.PartNo);

Partition Hive based on the first chars of a field

I am looking to store data into Hive to run analysis on the pas months (~100GB per days).
My rows contains a date (STRING) field looking like that: 2016-03-06T04:31:59.933012793+08:00
I want to partition based on this field but only based on the date (2016-03-06) --and i don't care about the timezone. Is there any ways to achieve that without changing the original format?
The reason for partitioning is both performances and the ability to delete old days to have a rolling window of data.
Thank you
You can achieve this through Insert Overwrite table with dynamic partition.
You can apply sub-string or regexp_extract function on your date time column and get the value in required format.
Below is my sample query where I am loading a Partitioned table by applying function on the partition column.
CREATE TABLE base2(id int, name String)
PARTITIONED BY (state string);
INSERT OVERWRITE TABLE base2 PARTITION (state)
SELECT id, name, substring(state,0,1)
Here I am applying some transformation the partition column. Hope this helps.
FROM base;

data synchronization from unreliable data source to SQL table

I am looking for pattern, framework or best practice to handle a generic problem of application level data synchronisation.
Let's take an example with only 1 table to make it easier.
I have an unreliable datasource of product catalog. Data can occasionally be unavailable or incomplete or inconsistent. ( issue might come from manual data entry error, ETL failure...)
I have a live copy in a Mysql table in use by a live system. Let's say a website.
I need to implement safety mecanism when updating the mysql table to "synchronize" with original data source. Here are the safety criteria and the solution I an suggesting:
avoid deleting records when they temporarily disappear from datasource => use "deleted" boulean/date column or an archive/history table.
check for inconsistent changes => configure rules per columns such as : should never change, should only increment,
check for integrity issue => (standard problem, no point discussing approach)
ability to rollback last sync=> restore from history table ? use a version inc/date column ?
What I am looking for is best practice and pattern/tool to handle such problem. If not you are not pointing to THE solution, I would be grateful of any keywords suggestion that would me narrow down which field of expertise to explore.
We have the same problem importing data from web analytics providers - they suffer the same problems as your catalog. This is what we did:
Every import/sync is assigned a unique id (auto_increment int64)
Every table has a history table that is identical to the original, but has an additional column "superseded_id" which gets the import-id of the import, that changed the row (deletion is a change) and the primary key is (row_id,superseded_id)
Every UPDATE copies the row to the history table before changing it
Every DELETE moves the row to the history table
This makes rollback very easy:
Find out the import_id of the bad import
REPLACE INTO main_table SELECT <everything but superseded_id> FROM history table WHERE superseded_id=<bad import id>
DELETE FROM history_table WHERE superseded_id>=<bad import id>
For databases, where performance is a problem, we do this in a secondary database on a different server, then copy the found-to-be-good main table to the production database into a new table main_table_$id with $id being the highest import id and have main_table be a trivial view to SELECT * FROM main_table_$someid. Now by redefining the view to SELECT * FROM main_table_$newid we can atomically swicth the table.
I'm not aware of a single solution to all this - probably because each project is so different. However, here are two techniques I've used in the past:
Embed the concept of version and validity into your data model
This is a way to deal with change over time without having to resort to history tables; it does complicate your queries, so you should use it sparingly.
For instance, instead of having a product table as follows
PRODUCTS
Product_ID primary key
Price
Description
AvailableFlag
In this model, if you want to delete a product, you execute "delete from product where product_id = ..."; modifying price would be "update products set price = 1 where product_id = ...."
With the versioned model, you have:
PRODUCTS
product_ID primary key
valid_from datetime
valid_until datetime
deleted_flag
Price
Description
AvailableFlag
In this model, deleting a product requires you to update products set valid_until = getdate() where product_id = xxx and valid_until is null, and then insert a new row with the "deleted_flag = true".
Changing price works the same way.
This means that you can run queries against your "dirty" data and insert it into this table without worrying about deleting items that were accidentally missed off the import. It also allows you to see the evolution of the record over time, and roll-back easily.
Use a ledger-like mechanism for cumulative values
Where you have things like "number of products in stock", it helps to create transactions to modify the amount, rather than take the current amount from your data feed.
For instance, instead of having a amount_in_stock column on your products table, have a "product_stock_transaction" table:
product_stock_transactions
product_id FK transaction_date transaction_quantity transaction_source
1 1 Jan 2012 100 product_feed
1 2 Jan 2012 -3 stock_adjust_feed
1 3 Jan 2012 10 product_feed
On 2 Jan, the quantity in stock was 97; on 3 Jan, 107.
This design allows you to keep track of adjustments and their source, and is easier to manage when moving data from multiple sources.
Both approaches can create large amounts of data - depending on the number of imports and the amount of data - and can lead to complex queries to retrieve relatively simple data sets.
It's hard to plan for performance concerns up front - I've seen both "history" and "ledger" work with large amounts of data. However, as Eugen says in his comment below, if you get to an excessively large ledger, it may be necessary to to clean up the ledger table by summarizing the current levels, and deleting (or archiving) old records.

Maintain valid records in a table

I have a table which holds flight schedule data. Every schedule have an effective_from & effective_to date. I load this table from flat file which don't provide me an effective_from and effective_to date. So at the time of loading I ask this information from user.
Suppose user gave from date as current date and to date as 31st March. Now on 1st March the user loads a new flight schedule and user give from date as current date and to date as 31st May.
If I query table for effective date between 1st March to 31st March the query returns me two records for each flight whereas I want only one record for each flight and this should be the latest record.
How do I do this? Should I handle this by query or while loading check and correct the data?
You need to identify the primary key for the data (which might be a 'business' key). There must be something which uniquely identifies each flight schedule (it sounds like it shouldn't include effective_from. Once that key is established, you check for it when importing and then either update the existing record or insert a new one.
I assume that each flight has some unique ID, otherwise how can make the difference between them. Then you can add to the schedule thable extra field "Active".
When loading in new schedule - query first existing records with the same flight id and set them to Active=false. New record enter with Active=true.
Query is then simple: select * from schedule where active=1
I developed this solution but looking for even better solution if possible.
Table Schedule {
scheduleId, flightNumber, effective_from,effective_to
}
Data in Schedule table {
1, XYZ12, 01/01/2009, 31/03/2009
2, ABC12, 01/01/2009, 30/04/2009
}
Now user loads another record 3, XYZ12, 01/03/2009, 31/05/2009
select scheduleId from Schedule where flightNumber = 'XYZ12' and (effective_from < '01/03/2009' and effective_to > '01/03/2009' or effective_from < '31/05/2009' and effective_to > '31/05/2009')
If the above query returns me any result that means its an overlap and I should throw an error to user.
The problem description and the comment to one of the suggestions gives the business rules:
querying flights with an effective date should return only one record per flight
the record returned should be the latest record
previous schedules must be kept in the table
The key to the answer is how to determine which is the latest record - the simplest answer would be to add a column that records the timestamp when the row is inserted. Now when you query for a flight and a given effective date, you just get the result with the latest inserted timestamp (which can be done using ORDER BY DESC and take the first row returned).
You can do a similar query with just the effective date and return all flights - again, for each flight you just want to return the row that includes the effective date, but with the largest timestamp. There's a neat trick for finding the maximum of something on a group-by-group basis - left join the results with themselves so that left < right, then the maximum is the left value where the right is null. The author of High Performance MySQL gives a simple example of this.
It's much easier than trying to retroactively correct the older schedules - and, by the sound of things, the older schedules have to be kept intact to satisfy your business requirements. It also means you can answer historical questions - you can always find out what your schedule table used to look like on a given date - which means it's very handy when generating reports such as "This month's schedule changes" and so on.

Resources