How does sortkey in Redshift work internally? [closed]

How does sortkey in Redshift work internally? [closed] - database

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 1 year ago.
Improve this question
I'm a beginner to Redshift and Data Warehouses in general.
When a numeric or timestamp column is specified as sortkey, does the Redshift DBMS use binary search during a query to find the desired row as efficiently as possible?
I feel that knowing more about this would improve my table design skill.

Amazon Redshift is a columnar datastore, which means that each column is stored separately. This is great for wide tables because Redshift only needs to read in the columns that are specifically used in the query. The most time-consuming part of database queries is disk access, so anything that reduces/avoids disk access is a good thing.
When data is stored on disk, it is stored in 1MB disk blocks. Each column can consume multiple blocks, and each block only contains data relating to one column. Redshift keeps a Zone Map of each block, which stores the minimum and maximum values stored in the block. For example, if a query is searching for data from 2021 and Redshift knows that the timestamp column for a particular block has a maximum value in 2018, it does not need to read the block from disk to examine the contents. This greatly reduces query time.
Data is stored in the blocks based upon the selected Compression Encoding. These are very clever techniques for reducing the storage space for data. For example, if a column contains a list of Countries and the rows are sorted in alphabetical order by country, then Redshift could simply store the fact that the block contains Jamaica x 63, then Japan x 104, then Jordan x 26. This might only require 24 bytes to store 193 rows of data, and don't forget that each block is 1MB in size. Thus, compression reduces the amount of disk access required to retrieve data, again making queries faster.
To answer your question about how Redshift would find the desired rows:
If the SORTKEY is used in a WHERE statement, then Redshift can quickly find the relevant blocks that potentially contain the desired data. I'm not sure if it does that with a binary search.
If WHERE statement does not use the SORTKEY, then finding the right rows is not as efficient because many blocks on disk might contain the rows that match the WHERE statement in various columns since they are not sorted together. This makes the query less efficient.
Redshift can still 'skip-over' blocks that do not contain matching data in the Zone Maps for all columns, avoiding the need to read those blocks from disk. Plus, compression on various columns can reduce the number of blocks that need to be read from disk.
The general rules for Amazon Redshift are:
Set the DISTKEY to the column that is most frequently used in JOIN
Set the SORTKEY to the column that is most frequently used in WHERE
See: Tuning query performance - Amazon Redshift

Related

Structuring database for financial data [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I have a lot of stocks price data saved in CSV files that I've been collecting for a while and intend to keep collecting, but now into a DB instead of CSVs.
There are 73 files (a file for each asset), each with around 2 million rows. The data is formatted the same way in all of them:
date, timestamp, open, high, low, close, volume
I want to create a individual table for each of the CSV files because:
For the uses I have in mind, I wont need more than one asset at once.
I know 140 million lines isn't a heavy load for a RDBMS, but I think it would have a better perfomance searching a table of 2M records instead of 140M.
Separating by asset I can have a column with Unique constraints (like date or timestamp) and prevent records being duplicated.
Are any of those points a wrong assumption or bad practice? Is there a compelling reason to save them all in a single table?
I've read this question, although a similar problem don't think the answer applies to my case.
In case it wasn't clear, I don't have much experience with DBs, so guidance and educational answers are heavily appreciated.

I would store them in a single table just because I wouldn't have to maintain 73 tables.
If you update your data on a daily or weekly or even monthly basis you would have to insert into 73 tables from 73 csv files, or maintain an automated script for that purpose which I think is a bit too much for this.
For the uses I have in mind, I wont need more than one asset at once. -> I don't understand this.
Separating by asset I can have a column with Unique constraints (like date or timestamp) and prevent records being duplicated. -> If you store them in a single table you could add a column asset_id, which would identify for which asset and date, timestamp is the data for.
I know 140 million lines isn't a heavy load for a RDBMS, but I think it would have a better perfomance searching a table of 2M records instead of 140M. -> You could partition your table on date and asset_id, but this is a much broader discussion, with the details you've given I would do this.

Redshift: Disadvantages of having a lot of nulls/empties in a large varchar column

I have a varchar column of max size 20,000 in my redshift table. About 60% of the rows will have this column as null or empty. What is the performance impact in such cases.
From this documentation I read:
Because Amazon Redshift compresses column data very effectively,
creating columns much larger than necessary has minimal impact on the
size of data tables. During processing for complex queries, however,
intermediate query results might need to be stored in temporary
tables. Because temporary tables are not compressed, unnecessarily
large columns consume excessive memory and temporary disk space, which
can affect query performance.
So this means query performance might be bad in this case. Is there any other disadvantage apart from this?

To store in redshift table, there is no significant performance degradation as suggested in documentation, compression encoding help in keeping the data compact.
Whereas when you query the column with null values, extra processing is required, for instance, using it in where clause. This might impact the performance of your query. So performance depends on your query.
EDIT (answer to your comment) - Redshift stores your each column in "blocks" and these blocks are sorted according to the sort key you specified. Redshift keeps a record of the min/max of each block and can skip over any blocks that could not contain data to be returned. Query your disk space for the particular column and check size against other columns.
If I’ve made a bad assumption please comment and I’ll refocus my answer.

Huge database table design issue [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed last year.
Improve this question
In my DB design I've encountered an issue. My app consists of a workflow on specific kind of media which has 6 stages as follows :
Resources
Ingest
Review
VideoRepair
Listing
backup
Since in all the stages the type of data being added (or updated) to table is the same but only their names change for example in ingest we have following columns
CaptureSup_Name, Assign_DateByCaptureSup, AssignedCaptureOp_Name,
LastCapture_Date, LastCaptureOp_Name, LastCapture_Date,
and in review we have exactly same columns but only Ingest replaces with Review and almost the same happens for the other columns of table (with one or two columns more or less)and for the purpose of having better performance on select queries So I decided not to assign one table for each stage (tradition normalization) and on the contrary I wrapped them all in one unified table
And now my table has 30 columns and the number of 30 in columns is scaring me because I never designed such a big table. Which of the following scenarios is most suitable for my case considering my database is supposed to support huge amount of data (about 1500 records being added daily) and speed is vital?
following the traditional normalization approach and break my big table into 6 or 7 tables each of which have about 5 or 6 columns o that I have to write (n-1) joins to retrieve complete data of cycle
keep my current design (one table with 30 columns) and find solution for reducing size of logs because in this case logs would be more massive (because of updates)

30 fields is not a wide table. SQL Server supports over 1,000 fields per table.
1,500 records per day is not a large volume of inserts. That is only about 500 thousand rows per year. SQL Server can handle billions of rows.
If you are having an issue with logs, then you have several options, such as:
switching to simple recovery mode;
periodically backing up the database and truncating the log; and,
using database checkpoints.
You should break your table up into other tables, if that makes sense in terms of the relational model -- this is likely, because 30 columns in a table would often be combinations of other entities. But your data structure is easily in the realm of what databases readily support. And, it can grow much, much larger with no problems (assuming none of your columns are really humongous).

Fastest persistent key/value db for fixed size keys and only insert/fetch (no delete/update)? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Given the following requirements of a persistent key/value store:
Only fetch, insert, and full iteration of all values (for exports) are required
No deleting values or updating values
Keys are always the same size
Code embedded in the host application
And given this usage pattern:
Fetches are random
Inserts and fetches are interleaved with no predictability
Keys are random, and inserted in random order
What is the best on-disk data structure/algorithm given the requirements?
Can a custom implementation exceed the performance of LSM-based (Log Structured Merge) implementations (i.e. leveldb, rocksdb)?
Would a high performance custom implementation for these requirements also be considerably simpler in implementation?

While it might be possible to have better performance custom implementation for your requirements, a well-configured RocksDB should be able to beat most of such custom implementations in your case. Here is what I would config RocksDB:
First, since you don't have updates and deletes, it's best to compact all data into big files in RocksDB. Because RocksDB sort these keys in a customizable order, having some big files gives you faster read performance as binary search across some big files is faster than across multiple small files. Typically, having big files hurt the performance of compaction --- the process of reorganizing big part of the data in RocksDB such that 1. multiple updates associated with a single key will be merged; 2. execute deletions to free disk space; 3. keep the data sorted. However, since you don't have updates and deletes, having big files give you fast read and write performance.
Second, specify large bits for bloom filter, this allow you to avoid most IOs when you're likely to issue some queries where keys do not exist in RocksDB.
So a read request goes like this. First, it compares the query key with the key ranges of those big files to identify which file the query key might located. Then, once the file(s) which key-range covers the query key, it will computes the bloom bits of the query key to see if the query key may exist in those files. If so, then a binary search inside each file will be triggered to identify the matched data.
As for scan operation, since RocksDB internally sort data in a user customizable order, scan can be done very efficiently using RocksDB's iterator.
More information about RocksDB basics can be found in here.

Is there a way for a database query to perform at O(1) [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
A query on a non-indexed column will result in O(n) because it has to search the entire table. Adding an index allows for O(log n) due to binary search. Is there a way for databases to achieve O(1) using the same technique a hash table uses (or perhaps another way) if you are searching on a unique key?

Hash-based indices are supported by some rdbms under some conditions. For example, MySQL supports the syntax CREATE INDEX indexname USING HASH ON tablename (cols…) but only if the named table is stored in memory, not if it is stored on disk. Clustered tables are a special case.
I guess the main reason against widespread use of hash indices in rdbms is the fact that they scale poorly. Since disk I/O is expensive, a very thin index will require lots of I/O for little gain in information. Therefore you would prefer a rather densely populated index (e.g. keep the filled portion between ⅓ and ⅔ at all times), which can be problematic in terms of hash collisions. But even more problematic: as you insert values, such a dense index might become too full, and you'd have to increase the size of the hash table fairly often. Doing so will mean completely rebuilding the hash table. That's an expensive operation, which requires lots of disk I/O, and will likely block all concurrent queries on that table. Not something to look forward to.
A B-tree on the other hand can be extended without too much overhead. Even inceasing its depth, which is the closest analogon to an extension of the hash table size, can be done more cheaply, and will be required less often. Since B-trees tend to be shallow, and since disk I/O tends to outweight anything you do in memory, they are still the preferred solution for most practical applications. Not to mention the fact that they provided cheap access to ranges of values, which isn't possible with a hash.