Efficient data retention policy other than time in timescaledb - database

I have a hypertable which looks like this:
Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description
-------------+---------+-----------+----------+---------+----------+-------------+--------------+-------------
state | text | | | | extended | | |
device | text | | | | extended | | |
time | bigint | | not null | | plain | | |
Indexes:
"device_state_time" btree ("time")
Triggers:
ts_insert_blocker BEFORE INSERT ON "device_state" FOR EACH ROW EXECUTE FUNCTION _timescaledb_internal.insert_blocker()
Child tables: _timescaledb_internal._hyper_4_2_chunk
Access method: heap
I have 100k devices each sending their state at different time intervals. For ex, device1 sends state every second, device2 every day, device3 every 5 days etc. And I MUST keep at least 10 latests states for a device. So, I can't really use the default data retention policy provided by timescale.
Is there any way to achieve this efficiently other than manually selecting the latest 10 entries for each device and deleting the rest?
Thanks!

That sounds like a corner case because the chunks are time-based. Can you categorize these devices in advance?
Maybe you can insert data into different hypertables based on the insert timeframe if you still want to use the retention policies.
For example, on promscale, the solution uses one table for each metric, allowing users to redefine the retention policy for every metric.
It will depend on how you read the data later; maybe fragmenting it into several hypertables will make it harder.
Also, consider hacking the hypertable creation optional arguments maybe you can get something from the partitioning_func and time_partitioning_func.

Related

What is the name of the approach where you store the history in the same table?

I'm developing an application that uses a mysql database and we wanted to do an approach for history purposes, that we store the current state and the history in the same table for performance reasons (on updates the application doesn't have the id for an entity just a key pair, so it is easier just to insert a new row).
The table looks like this:
+------+-------+-----------+------------------------------+
| id |user_id| type |content |
+------+-------+-----------+------------------------------+
| 1 |'1-2-3'| position | *creation |
| 2 |'1-2-3'| position | *something_changed |
| 3 |'1-2-3'| device | *creation |
| 4 |'1-2-4'| position | *creation |
| 5 |'1-2-4'| device | *creation |
| 6 |'1-2-4'| device | *something_changed |
+------+-------+-----------+------------------------------+
Every entity is described with the user_id and type "key" pair, when something is changed in the entity a new row is inserted. The current state of an entity is selected by the highest id row from the group, which is grouped by the user_id and type. Performance wise the updates should be super fast and the selects can be slower, because those are not used often.
I would like to look up best practices and other people experiences with this method, but I don't know how to search for them. Can you help me? I'm interested in your experiences or opinions on this topic as well.
I know about Kafka and other streaming platforms, but that was sadly not an option for this.

ad-hoc slowly-changing dimensions materialization from external table of timestamped csvs in a data lake

Question
main question
How can I ephemerally materialize slowly changing dimension type 2 from from a folder of daily extracts, where each csv is one full extract of a table from from a source system?
rationale
We're designing ephemeral data warehouses as data marts for end users that can be spun up and burned down without consequence. This requires we have all data in a lake/blob/bucket.
We're ripping daily full extracts because:
we couldn't reliably extract just the changeset (for reasons out of our control), and
we'd like to maintain a data lake with the "rawest" possible data.
challenge question
Is there a solution that could give me the state as of a specific date and not just the "newest" state?
existential question
Am I thinking about this completely backwards and there's a much easier way to do this?
Possible Approaches
custom dbt materialization
There's a insert_by_period dbt materialization in the dbt.utils package, that I think might be exactly what I'm looking for? But I'm confused as it's dbt snapshot, but:
run dbt snapshot for each file incrementally, all at once; and,
built directly off of an external table?
Delta Lake
I don't know much about Databricks's Delta Lake, but it seems like it should be possible with Delta Tables?
Fix the extraction job
Is our oroblem is solved if we can make our extracts contain only what has changed since the previous extract?
Example
Suppose the following three files are in a folder of a data lake. (Gist with the 3 csvs and desired table outcome as csv).
I added the Extracted column in case parsing the timestamp from the filename is too tricky.
2020-09-14_CRM_extract.csv
| OppId | CustId | Stage | Won | LastModified | Extracted |
|-------|--------|-------------|-----|--------------|-----------|
| 1 | A | 2 - Qualify | | 9/1 | 9/14 |
| 2 | B | 3 - Propose | | 9/12 | 9/14 |
2020-09-15_CRM_extract.csv
| OppId | CustId | Stage | Won | LastModified | Extracted |
|-------|--------|-------------|-----|--------------|-----------|
| 1 | A | 2 - Qualify | | 9/1 | 9/15 |
| 2 | B | 4 - Closed | Y | 9/14 | 9/15 |
| 3 | C | 1 - Lead | | 9/14 | 9/15 |
2020-09-16_CRM_extract.csv
| OppId | CustId | Stage | Won | LastModified | Extracted |
|-------|--------|-------------|-----|--------------|-----------|
| 1 | A | 2 - Qualify | | 9/1 | 9/16 |
| 2 | B | 4 - Closed | Y | 9/14 | 9/16 |
| 3 | C | 2 - Qualify | | 9/15 | 9/16 |
End Result
Below is SCD-II for the three files as of 9/16. SCD-II as of 9/15 would be the same but OppId=3 has only one from valid_from=9/15 and valid_to=null
| OppId | CustId | Stage | Won | LastModified | valid_from | valid_to |
|-------|--------|-------------|-----|--------------|------------|----------|
| 1 | A | 2 - Qualify | | 9/1 | 9/14 | null |
| 2 | B | 3 - Propose | | 9/12 | 9/14 | 9/15 |
| 2 | B | 4 - Closed | Y | 9/14 | 9/15 | null |
| 3 | C | 1 - Lead | | 9/14 | 9/15 | 9/16 |
| 3 | C | 2 - Qualify | | 9/15 | 9/16 | null |
Interesting concept and of course it would a longer conversation than is possible in this forum to fully understand your business, stakeholders, data, etc. I can see that it might work if you had a relatively small volume of data, your source systems rarely changed, your reporting requirements (and hence, datamarts) also rarely changed and you only needed to spin up these datamarts very infrequently.
My concerns would be:
If your source or target requirements change how are you going to handle this? You will need to spin up your datamart, do full regression testing on it, apply your changes and then test them. If you do this as/when the changes are known then it's a lot of effort for a Datamart that's not being used - especially if you need to do this multiple times between uses; if you do this when the datamart is needed then you're not meeting your objective of having the datamart available for "instant" use.
Your statement "we have a DW as code that can be deleted, updated, and recreated without the complexity that goes along with traditional DW change management" I'm not sure is true. How are you going to test updates to your code without spinning up the datamart(s) and going through a standard test cycle with data - and then how is this different from traditional DW change management?
What happens if there is corrupt/unexpected data in your source systems? In a "normal" DW where you are loading data daily this would normally be noticed and fixed on the day. In your solution the dodgy data might have occurred days/weeks ago and, assuming it loaded into your datamart rather than erroring on load, you would need processes in place to spot it and then potentially have to unravel days of SCD records to fix the problem
(Only relevant if you have a significant volume of data) Given the low cost of storage, I'm not sure I see the benefit of spinning up a datamart when needed as opposed to just holding the data so it's ready for use. Loading large volumes of data everytime you spin up a datamart is going to be time-consuming and expensive. Possible hybrid approach might be to only run incremental loads when the datamart is needed rather than running them every day - so you have the data from when the datamart was last used ready to go at all times and you just add the records created/updated since the last load
I don't know whether this is the best or not, but I've seen it done. When you build your initial SCD-II table, add a column that is a stored HASH() value of all of the values of the record (you can exclude the primary key). Then, you can create an External Table over your incoming full data set each day, which includes the same HASH() function. Now, you can execute a MERGE or INSERT/UPDATE against your SCD-II based on primary key and whether the HASH value has changed.
Your main advantage doing things this way is you avoid loading all of the data into Snowflake each day to do the comparison, but it will be slower to execute this way. You could also load to a temp table with the HASH() function included in your COPY INTO statement and then update your SCD-II and then drop the temp table, which could actually be faster.

Best database storage for matching products from offers

I have following problem. I have products, offers and their parameters (in MySQL about 300 000 000 rows). Based on offer parameters and their rate (parameters are dynamic and every parameter type has different rate) I must join offers to product. Of course there will be a lot of updates, deletes or inserts (for example around 5000req/s).
Second functionality will be sending these connected information via api. Anyone have any recommendations what NoSQL, relational database or something similar to use for storage?
Edit
I'll show my example on a small sample of data in MySQL:
Offer
+----------+-----------------+
| offer_id | name |
+----------+-----------------+
| 1 | iphone_se_black |
| 2 | iphone_se_red |
| 3 | iphone_se_white |
+----------+-----------------+
Parameter_rating
+--------------+----------------+--------+
| parameter_id | parameter_name | rating |
+--------------+----------------+--------+
| 1 | os | 10 |
| 2 | processor | 10 |
| 3 | ram | 10 |
| 4 | color | 1 |
+--------------+----------------+--------+
Parameter value
+----+--------------+----------------+
| id | parameter_id | value |
+----+--------------+----------------+
| 1 | 1 | iOS |
| 2 | 2 | some_processor |
| 3 | 3 | 2GB |
| 4 | 4 | black |
| 5 | 4 | red |
| 6 | 4 | white |
+----+--------------+----------------+
Parameter_to_value
+----------+--------------------+
| offer_id | parameter_value_id |
+----------+--------------------+
| 1 | 1 |
| 1 | 2 |
| 1 | 3 |
| 1 | 4 |
| 2 | 1 |
| 2 | 2 |
| 2 | 3 |
| 2 | 5 |
| 3 | 1 |
| 3 | 2 |
| 3 | 3 |
| 3 | 6 |
+----------+--------------------+
and based on this data I must return that bids 1,2 and 3 are one product.
The biggest problem is that data often changes. For example, changing prices, removing offers, etc. Therefore, I do not think that MySQL is the most suitable technology and I try to choose another.
Platform
any recommendations what NoSQL, relational database or something similar to use for storage?
Therefore, I do not think that MySQL is the most suitable technology and I try to choose another.
All that is ordinary fare for a Relational database. Tens of thousands of banks run trading and pricing systems that are extremely active from hundreds of thousands of users, on such systems. Every day. The changes you allude to are normal on such systems (eg. pricing and pricing basis, change all the time, in response to Buys & Sells).
But they use genuine SQL platforms. Freeware/shareware/vapourware/nowhere suites such as MySQL and PostgreSQL are neither SQL-compliant, nor viable platforms for high-throughput OLTP systems (no server architecture; no ACID Transactions; etc). They are still implementing the basics that SQL platforms have had since 1984, which is very difficult (impossible!) because they do not have a server architecture.
Therefore MySQL and PostgreSQL are not suitable for the reason of abject performance; zero concurrency; etc, and not for any database design concerns.
For an appreciation of the value of a genuine OLTP Server Architecture, refer to Oracle vs Sybase ASE. Although the article deals with Oracle explicitly, it applies to all freeware because all freeware has the same non-architecture that Oracle has. Actually, even less than Oracle. You get what you pay for.
Data Analysis
This answer is limited to Relational databases; SQL, its designated data sublanguage; and a genuine, commercially viable, SQL platform.
It appears the system supports an auction of some kind, which means you have to maintain an inventory of available/sold items. The database design that is required is quite ordinary.
However, your question is not clear enough to be answered. You are making many assumptions, that we are not party to. Allow me to ask some leading questions, which you need to consider and answer (update your Question):
what are the fundamental things that the systems transacts operations against ?
(products such as phones ?)
how are those things identified ?
(Not the ID but how do humans identify each thing)
what are the properties of those things ?
(please, not "parameter" ... maybe OS; RAM; Processor; Colour) ?
Then property values can be understood
(You can't mess with the attributes of a thing unless you hold and maintain the thing)
what are the operations or transactions against those things
(a) internal or admin transactions
(eg. AddProperty; AddPropertyValue; AddProduct; etc)
(b) external or online user transactions
(eg. BidProduct [offer to buy]; CloseBid; etc)
who are the operators, to which those transactions are permitted ?
(eg. Admins; product suppliers; online bidders; etc)
I can't make any sense of your Parameter_to_value, please explain
What is rating ? Some kind of weighting for the property vs the other properties, or something the bidders declare ?
Database Design • Tentative
This might take a few iterations.
Don't worry about ID fields on each and every file: first we have to understand the data, how it relates to other data, and how it is identified. We can add ID fields at the end.
Note
All my data models are rendered in IDEF1X, the Standard for modelling Relational databases since 1993
My IDEF1X Introduction is essential reading for beginners.
The IDEF1X Anatomy is a refresher for those who have lapsed.
If you have trouble reading the Predicates from the Data Model, let me know and I will produce them in text form.

Mapping Tables of Database in NiFi

Here is my requirement.
I have a big table in Vertica say base_table as follows.
base_table
ID | path | service | experience
20 | /abc/xyz | trz | moderate
22 | /wer/cmz | brd | professional
Mapping Tables
map_table1
path_id | path
1 | /abc/xyz
map_table2
exp_id | experience
1 | beginner
Final Table
ID | path_id | service | exp_id
20 | 1 | trz | -
22 | - | brd | 2
In the First case, I need to get ID as 1 as the path column is present in the map_table1 as well as base table and insert that record into the final table.
In the Second case, I need to insert id as 2 in map_table2 as experience professional is not present in that table as well as insert it into the final table.
which processors should I go for or how the flow should look like in Nifi?
I am not sure if I understand your question correctly, but if I generalize the situation, you want to insert a record if it does not exist, and then get the value of the corresponding ID (which may or may not have existed before).
The good news is that NiFi can easily work with a database like Vertica, have a look at the QueryDatabaseTable processor.
The challenge here however, is that NiFi is designed to efficiently handle many individual messages, and is therefore designed not to be very context aware. For your usecase you would probably want to use a tool that is built to work with tables. In general the solution for this would be Spark, or perhaps it can be built into your database with some procedures.

Another way to build database structure

I have to optimize my little-big database, because it's too slow, maybe we'll find another solution together.
First of all let's talk about data that are stored in the database. There are two objects: users and let's say messages
Users
There is something like that:
+----+---------+-------+-----+
| id | user_id | login | etc |
+----+---------+-------+-----+
| 1 | 100001 | A | ....|
| 2 | 100002 | B | ....|
| 3 | 100003 | C | ....|
|... | ...... | ... | ....|
+----+---------+-------+-----+
There is no problem inside this table. (Don't afraid of id and user_id. user_id is used by another application, so it has to be here.)
Messages
And the second table has some problem. Each user has for example messages like this:
+----+---------+------+----+
| id | user_id | from | to |
+----+---------+------+----+
| 1 | 1 | aab | bbc|
| 2 | 2 | vfd | gfg|
| 3 | 1 | aab | bbc|
| 4 | 1 | fge | gfg|
| 5 | 3 | aab | gdf|
|... | ...... | ... | ...|
+----+---------+------+----+
There is no need to edit messages, but there should be an opportunity to updated the list of messages for the user. For example, an external service sends all user's messages to the db and the list has to be updated.
And the most important thing is that there are about 30 Mio of users and average user has 500+ of messages. Another problem that I have to search through the field from and calculate number of matches. I designed a simple SQL query with join, but it takes too much time to get the data.
So...it's quite big amount of data. I decided not to use RDS (I used Postgresql) and decided to move to databases like Clickhouse and so on.
However I faced with a problem that for example Clickhouse doesn't support UPDATE statement.
To resolve this issues I decided to store messages as one row. So the table Messages should be like this:
Here I'd like to store messages in JSON format
{"from":"aaa", "to":bbe"}
{"from":"ret", "to":fdd"}
{"from":"gfd", "to":dgf"}
||
\/
+----+---------+----------+------+ And there I'd like to store the
| id | user_id | messages | hash | <= hash of the messages.
+----+---------+----------+------+
I think that full-text search inside the messages column will save some time resources and so on.
Do you have any ideas? :)
In ClickHouse, the most optimal way is to store data in "big flat table".
So, you store every message in a separate row.
15 billion rows is Ok for ClickHouse, even on single node.
Also, it's reasonable to have each user attributes directly in messages table (pre-joined), so you don't need to do JOINs. It is suitable if user attributes are not updated.
These attributes will have repeated values for each users' message - it's Ok because ClickHouse compresses data well, especially repeated values.
If users' attributes are updated, consider to store users table in separate database and use 'External dictionaries' feature to join it.
If message is updated, just don't update it. Write another row with modified message to a table instead and leave old message as is.
Its important to have right primary key for your table. You should use table from MergeTree family, which constantly reorders data by primary key and so maintains efficiency of range queries. Primary key is not required to be unique, for example you could define primary key as just (from) if you would frequently write "from = ...", and if these queries must be processed in short time.
And you could use user_id as primary key: if queries by user id are frequent and must be processed as fast as possible, but then queries with predicate on 'from' will scan whole table (mind that ClickHouse do full scan efficiently).
If you need to fast lookup by many different attributes, you could just duplicate table with different primary keys. It's typically that table will be compressed well enough and you could afford to have data in few copies with different order for different range queries.
First of all, when we have such a big dataset, from and to columns should be integers, if possible, as their comparison is faster.
Second, you should consider creating proper indexes. As each user has relatively few records (500 compared to 30M in total), it should give you a huge performance benefit.
If everything else fails, consider using partitions:
https://www.postgresql.org/docs/9.1/static/ddl-partitioning.html
In your case they would be dynamic, and hinder first time inserts immensely, so I would consider them only as last, if very efficient, resort.

Resources