I was profiling an application that uses Cassandra and it turned out that reads were the bottleneck. At closer inspection it seems they take way to long, I would really appreciate some help in understanding why.
The application reads always the whole set of rows for a given partition key (the query is of the form SELECT * FROM table WHERE partition_key = ?). Unsurprisingly, the read time is O(number of rows for partition key), however the constant, seems way to high. After examining the query plan it turns out that the majority of time is spent on the "merging data from mem and sstables".
This step takes over 200ms for a partition key of ~5000 rows, where a row consists of 9 columns, and is less than 100 bytes. Given the read throughput of a SSD, reading sequentially 0.5MB should happen instantaneously.
Actually, I doubt this is to do with I/O at all. The machine used to have a spinning disk which was replaced with the SSD it has now. The change had no impact on query performance. I think there is something very involved in Cassandra processing or how it reads the data of disk that makes this operation very expensive.
Merging from more than one SSTable or iterating over tombstoned cells does not explain this. First of all, it should take milliseconds, second of all this is happening consistently, regardless if it is 2 or 4 SSTables and whether there are or not tombstoned cells.
To give some background:
Hardware: The machine that is running Cassandra is an 8 core, bare metal and SSD backed. I query it from cqlsh on the machine, the data is stored locally. There is no other load on it and looking at iostats, there is also barely any i/o.
Data model: The partition key, PK, is of text type, the primary key is a composite of the partition key and a bigint column K, and the rest are 7 mutable columns. The schema creation command is listed below.
CREATE TABLE inboxes (
PK text,
K bigint,
A boolean,
B boolean,
C boolean,
D boolean,
E bigint,
E bigint,
F int,
PRIMARY KEY (PK, K)
) WITH CLUSTERING ORDER BY (K DESC));
This is an example trace, with 3 SSTable involved, an a quite large number of tombstones.
activity | timestamp | source | source_elapsed
-------------------------------------------------------------------------------------------+--------------+-------------+----------------
execute_cql3_query | 03:14:07,507 | 10.161.4.77 | 0
Parsing select * from table where PK = 'key_value' LIMIT 10000;| 03:14:07,508 | 10.161.4.77 | 123
Preparing statement | 03:14:07,508 | 10.161.4.77 | 244
Executing single-partition query on table | 03:14:07,509 | 10.161.4.77 | 1155
Acquiring sstable references | 03:14:07,509 | 10.161.4.77 | 1173
Merging memtable tombstones | 03:14:07,509 | 10.161.4.77 | 1195
Key cache hit for sstable 2906 | 03:14:07,509 | 10.161.4.77 | 1231
Seeking to partition beginning in data file | 03:14:07,509 | 10.161.4.77 | 1240
Key cache hit for sstable 1533 | 03:14:07,509 | 10.161.4.77 | 1550
Seeking to partition beginning in data file | 03:14:07,509 | 10.161.4.77 | 1561
Key cache hit for sstable 1316 | 03:14:07,509 | 10.161.4.77 | 1867
Seeking to partition beginning in data file | 03:14:07,509 | 10.161.4.77 | 1878
Merging data from memtables and 3 sstables | 03:14:07,510 | 10.161.4.77 | 2180
Read 5141 live and 1944 tombstoned cells | 03:14:07,646 | 10.161.4.77 | 138734
Request complete | 03:14:07,742 | 10.161.4.77 | 235030
You're not just "reading sequentially 0.5MB", you're asking Cassandra to turn it into rows, filter out tombstones (deleted rows), and turn it into a resultset. 0.04ms per row is pretty reasonable; my rule of thumb is 0.5ms per 10 rows for an entire query.
Remember that Cassandra optimizes for short requests suitable for online applications; 10 to 100 row resultsets are typical. There is no parallelization within a single query.
Related
I have a hypertable which looks like this:
Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description
-------------+---------+-----------+----------+---------+----------+-------------+--------------+-------------
state | text | | | | extended | | |
device | text | | | | extended | | |
time | bigint | | not null | | plain | | |
Indexes:
"device_state_time" btree ("time")
Triggers:
ts_insert_blocker BEFORE INSERT ON "device_state" FOR EACH ROW EXECUTE FUNCTION _timescaledb_internal.insert_blocker()
Child tables: _timescaledb_internal._hyper_4_2_chunk
Access method: heap
I have 100k devices each sending their state at different time intervals. For ex, device1 sends state every second, device2 every day, device3 every 5 days etc. And I MUST keep at least 10 latests states for a device. So, I can't really use the default data retention policy provided by timescale.
Is there any way to achieve this efficiently other than manually selecting the latest 10 entries for each device and deleting the rest?
Thanks!
That sounds like a corner case because the chunks are time-based. Can you categorize these devices in advance?
Maybe you can insert data into different hypertables based on the insert timeframe if you still want to use the retention policies.
For example, on promscale, the solution uses one table for each metric, allowing users to redefine the retention policy for every metric.
It will depend on how you read the data later; maybe fragmenting it into several hypertables will make it harder.
Also, consider hacking the hypertable creation optional arguments maybe you can get something from the partitioning_func and time_partitioning_func.
Question
main question
How can I ephemerally materialize slowly changing dimension type 2 from from a folder of daily extracts, where each csv is one full extract of a table from from a source system?
rationale
We're designing ephemeral data warehouses as data marts for end users that can be spun up and burned down without consequence. This requires we have all data in a lake/blob/bucket.
We're ripping daily full extracts because:
we couldn't reliably extract just the changeset (for reasons out of our control), and
we'd like to maintain a data lake with the "rawest" possible data.
challenge question
Is there a solution that could give me the state as of a specific date and not just the "newest" state?
existential question
Am I thinking about this completely backwards and there's a much easier way to do this?
Possible Approaches
custom dbt materialization
There's a insert_by_period dbt materialization in the dbt.utils package, that I think might be exactly what I'm looking for? But I'm confused as it's dbt snapshot, but:
run dbt snapshot for each file incrementally, all at once; and,
built directly off of an external table?
Delta Lake
I don't know much about Databricks's Delta Lake, but it seems like it should be possible with Delta Tables?
Fix the extraction job
Is our oroblem is solved if we can make our extracts contain only what has changed since the previous extract?
Example
Suppose the following three files are in a folder of a data lake. (Gist with the 3 csvs and desired table outcome as csv).
I added the Extracted column in case parsing the timestamp from the filename is too tricky.
2020-09-14_CRM_extract.csv
| OppId | CustId | Stage | Won | LastModified | Extracted |
|-------|--------|-------------|-----|--------------|-----------|
| 1 | A | 2 - Qualify | | 9/1 | 9/14 |
| 2 | B | 3 - Propose | | 9/12 | 9/14 |
2020-09-15_CRM_extract.csv
| OppId | CustId | Stage | Won | LastModified | Extracted |
|-------|--------|-------------|-----|--------------|-----------|
| 1 | A | 2 - Qualify | | 9/1 | 9/15 |
| 2 | B | 4 - Closed | Y | 9/14 | 9/15 |
| 3 | C | 1 - Lead | | 9/14 | 9/15 |
2020-09-16_CRM_extract.csv
| OppId | CustId | Stage | Won | LastModified | Extracted |
|-------|--------|-------------|-----|--------------|-----------|
| 1 | A | 2 - Qualify | | 9/1 | 9/16 |
| 2 | B | 4 - Closed | Y | 9/14 | 9/16 |
| 3 | C | 2 - Qualify | | 9/15 | 9/16 |
End Result
Below is SCD-II for the three files as of 9/16. SCD-II as of 9/15 would be the same but OppId=3 has only one from valid_from=9/15 and valid_to=null
| OppId | CustId | Stage | Won | LastModified | valid_from | valid_to |
|-------|--------|-------------|-----|--------------|------------|----------|
| 1 | A | 2 - Qualify | | 9/1 | 9/14 | null |
| 2 | B | 3 - Propose | | 9/12 | 9/14 | 9/15 |
| 2 | B | 4 - Closed | Y | 9/14 | 9/15 | null |
| 3 | C | 1 - Lead | | 9/14 | 9/15 | 9/16 |
| 3 | C | 2 - Qualify | | 9/15 | 9/16 | null |
Interesting concept and of course it would a longer conversation than is possible in this forum to fully understand your business, stakeholders, data, etc. I can see that it might work if you had a relatively small volume of data, your source systems rarely changed, your reporting requirements (and hence, datamarts) also rarely changed and you only needed to spin up these datamarts very infrequently.
My concerns would be:
If your source or target requirements change how are you going to handle this? You will need to spin up your datamart, do full regression testing on it, apply your changes and then test them. If you do this as/when the changes are known then it's a lot of effort for a Datamart that's not being used - especially if you need to do this multiple times between uses; if you do this when the datamart is needed then you're not meeting your objective of having the datamart available for "instant" use.
Your statement "we have a DW as code that can be deleted, updated, and recreated without the complexity that goes along with traditional DW change management" I'm not sure is true. How are you going to test updates to your code without spinning up the datamart(s) and going through a standard test cycle with data - and then how is this different from traditional DW change management?
What happens if there is corrupt/unexpected data in your source systems? In a "normal" DW where you are loading data daily this would normally be noticed and fixed on the day. In your solution the dodgy data might have occurred days/weeks ago and, assuming it loaded into your datamart rather than erroring on load, you would need processes in place to spot it and then potentially have to unravel days of SCD records to fix the problem
(Only relevant if you have a significant volume of data) Given the low cost of storage, I'm not sure I see the benefit of spinning up a datamart when needed as opposed to just holding the data so it's ready for use. Loading large volumes of data everytime you spin up a datamart is going to be time-consuming and expensive. Possible hybrid approach might be to only run incremental loads when the datamart is needed rather than running them every day - so you have the data from when the datamart was last used ready to go at all times and you just add the records created/updated since the last load
I don't know whether this is the best or not, but I've seen it done. When you build your initial SCD-II table, add a column that is a stored HASH() value of all of the values of the record (you can exclude the primary key). Then, you can create an External Table over your incoming full data set each day, which includes the same HASH() function. Now, you can execute a MERGE or INSERT/UPDATE against your SCD-II based on primary key and whether the HASH value has changed.
Your main advantage doing things this way is you avoid loading all of the data into Snowflake each day to do the comparison, but it will be slower to execute this way. You could also load to a temp table with the HASH() function included in your COPY INTO statement and then update your SCD-II and then drop the temp table, which could actually be faster.
Here is my requirement.
I have a big table in Vertica say base_table as follows.
base_table
ID | path | service | experience
20 | /abc/xyz | trz | moderate
22 | /wer/cmz | brd | professional
Mapping Tables
map_table1
path_id | path
1 | /abc/xyz
map_table2
exp_id | experience
1 | beginner
Final Table
ID | path_id | service | exp_id
20 | 1 | trz | -
22 | - | brd | 2
In the First case, I need to get ID as 1 as the path column is present in the map_table1 as well as base table and insert that record into the final table.
In the Second case, I need to insert id as 2 in map_table2 as experience professional is not present in that table as well as insert it into the final table.
which processors should I go for or how the flow should look like in Nifi?
I am not sure if I understand your question correctly, but if I generalize the situation, you want to insert a record if it does not exist, and then get the value of the corresponding ID (which may or may not have existed before).
The good news is that NiFi can easily work with a database like Vertica, have a look at the QueryDatabaseTable processor.
The challenge here however, is that NiFi is designed to efficiently handle many individual messages, and is therefore designed not to be very context aware. For your usecase you would probably want to use a tool that is built to work with tables. In general the solution for this would be Spark, or perhaps it can be built into your database with some procedures.
I have to optimize my little-big database, because it's too slow, maybe we'll find another solution together.
First of all let's talk about data that are stored in the database. There are two objects: users and let's say messages
Users
There is something like that:
+----+---------+-------+-----+
| id | user_id | login | etc |
+----+---------+-------+-----+
| 1 | 100001 | A | ....|
| 2 | 100002 | B | ....|
| 3 | 100003 | C | ....|
|... | ...... | ... | ....|
+----+---------+-------+-----+
There is no problem inside this table. (Don't afraid of id and user_id. user_id is used by another application, so it has to be here.)
Messages
And the second table has some problem. Each user has for example messages like this:
+----+---------+------+----+
| id | user_id | from | to |
+----+---------+------+----+
| 1 | 1 | aab | bbc|
| 2 | 2 | vfd | gfg|
| 3 | 1 | aab | bbc|
| 4 | 1 | fge | gfg|
| 5 | 3 | aab | gdf|
|... | ...... | ... | ...|
+----+---------+------+----+
There is no need to edit messages, but there should be an opportunity to updated the list of messages for the user. For example, an external service sends all user's messages to the db and the list has to be updated.
And the most important thing is that there are about 30 Mio of users and average user has 500+ of messages. Another problem that I have to search through the field from and calculate number of matches. I designed a simple SQL query with join, but it takes too much time to get the data.
So...it's quite big amount of data. I decided not to use RDS (I used Postgresql) and decided to move to databases like Clickhouse and so on.
However I faced with a problem that for example Clickhouse doesn't support UPDATE statement.
To resolve this issues I decided to store messages as one row. So the table Messages should be like this:
Here I'd like to store messages in JSON format
{"from":"aaa", "to":bbe"}
{"from":"ret", "to":fdd"}
{"from":"gfd", "to":dgf"}
||
\/
+----+---------+----------+------+ And there I'd like to store the
| id | user_id | messages | hash | <= hash of the messages.
+----+---------+----------+------+
I think that full-text search inside the messages column will save some time resources and so on.
Do you have any ideas? :)
In ClickHouse, the most optimal way is to store data in "big flat table".
So, you store every message in a separate row.
15 billion rows is Ok for ClickHouse, even on single node.
Also, it's reasonable to have each user attributes directly in messages table (pre-joined), so you don't need to do JOINs. It is suitable if user attributes are not updated.
These attributes will have repeated values for each users' message - it's Ok because ClickHouse compresses data well, especially repeated values.
If users' attributes are updated, consider to store users table in separate database and use 'External dictionaries' feature to join it.
If message is updated, just don't update it. Write another row with modified message to a table instead and leave old message as is.
Its important to have right primary key for your table. You should use table from MergeTree family, which constantly reorders data by primary key and so maintains efficiency of range queries. Primary key is not required to be unique, for example you could define primary key as just (from) if you would frequently write "from = ...", and if these queries must be processed in short time.
And you could use user_id as primary key: if queries by user id are frequent and must be processed as fast as possible, but then queries with predicate on 'from' will scan whole table (mind that ClickHouse do full scan efficiently).
If you need to fast lookup by many different attributes, you could just duplicate table with different primary keys. It's typically that table will be compressed well enough and you could afford to have data in few copies with different order for different range queries.
First of all, when we have such a big dataset, from and to columns should be integers, if possible, as their comparison is faster.
Second, you should consider creating proper indexes. As each user has relatively few records (500 compared to 30M in total), it should give you a huge performance benefit.
If everything else fails, consider using partitions:
https://www.postgresql.org/docs/9.1/static/ddl-partitioning.html
In your case they would be dynamic, and hinder first time inserts immensely, so I would consider them only as last, if very efficient, resort.
I've noticed that my meta_keys are getting pretty long, e.g user_event_first_impression_ratings and I retrieve most of the data with WordPress functions e.g get_post_meta($post_id, $meta_key);
I've thought about this often - there's no way to make shorter names because I've got a lot of different things going on and not naming them like that would lose its purpose which is understanding quicky in phpMyAdmin and code what and where is going on.
I've thought of making a table (in excel for example) where I give very short, like 2-3 digit numberic codes for every meta_key, replace them and then use that to navigate in database and code. Im sure that I would know all these codes by heart pretty soon.
Does meta_key length have any impact to queries and get_meta-s performance?
String vs integer?
Let's leave query quality out of this and pretend that query is well written.
If some of you is not familiar with WordPress database, here's an example:
--------------------------------------------------------------------------
| meta_id (unique row nr) | post id | meta_key | meta_value |
--------------------------------------------------------------------------
| 1 | 343 | my_event_color | red |
| 2 | 623 | my_event_id | 235 |
| 3 | 423 | my_event_lenght | 537644 |
| 4 | 243 | my_event_name | tortilla |
| 5 | 732 | my_event_is_xxx | 1 |
| ... | ... | ... | ... |
Etc for many, many, many rows - meta_id is only unique number here
To your first question, no. Or the difference in performance between a long key and a short key is so tiny as to not make it worth thinking about. So don't worry about your excel reference table.
See the following:
https://dba.stackexchange.com/questions/91057/does-the-length-of-the-index-name-have-any-performance-impact
Table name or column name length affect performance?
https://dba.stackexchange.com/questions/91057/does-the-length-of-the-index-name-have-any-performance-impact
To your second question I don't really understand what you're asking.