I'm reading book Designing Data Intensive Applications, currently on chapter about partitioning which describes secondary indexes where one example is local index (document-based partitioning) and global index which is term-partitioned. The picture below shows example of the global index.
The book says that global indexes perform better since index can be read from the single partition based on the "term". However what I don't get, is the index itself holding all rows that contain this term or following index read, next queries must be done to fetch data from all partitions that can contain the data? This would be little more efficient in comparison to local indexes where query must be send to all partitions, depending on the number of partitions.
In the summary of the chapter author wrote
Term-partitioned indexes (global indexes), where the secondary indexes are partitioned separately, using the indexed values. An entry in the secondary index may include records from all partitions of the primary key. When a document is written, several partitions of the secondary index need to be updated; however, a read can be served from a single partition.
Am I missing something here?
Following index read, next queries must be done to fetch data from all partitions that can contain the data
Yeah, I think that would be the case because storing the data alongside the index (clustered index) could lead to more painful consistency issues. In the worst-case, I agree that you would need to do a scatter-gather query similar to document-partitioning. But if your use case involves querying for a small amount of data that likely lives in a small subset of all your partitions, then a term-partitioned (global) index may be much better
References:
https://docs.oracle.com/database/121/VLDBG/GUID-EE7C7B09-81BD-4996-8AC1-42A50D26FC25.htm
https://www.oreilly.com/library/view/designing-data-intensive-applications/9781491903063/
Related
I was reading this link. and trying to understand what is the relation between both, it is quite confusing. Please explain
Much like Gokhan's answer, but I would describe it differently.
Micro-partitions:
Every time to write data to snowflake it's written to a new file, because the files are immutable. This means you have many fragments. But due to keep metadata for tables, when you query, Snowflake can prune tables known to not contain the data being looked for. Otherwise it load all data (for the columns you select) and does brute force "full table scans".
Snowflake micro-partitions have no relation to classic partitioning, except for if you are lucky you might get pruning.
Also the micro partitions are written how you load the data, and are just "split" into more partitions as your writes go over a threshold. Much like you get from WinZip/7Zip/Gzip with a max file chunk size parameter..
The next thing to note is if you update a ROW in a partition, the whole partition is rewritten. AND the order of the updated rows is not controllable and can be random (based of your table join logic).
Thus if you do many writes, or many "small" updates, you will have very bad fragmentation of your partitions, which very negatively impact compile time, as all the meta data needs to be loaded. Which they are now charging for.
This is like this because S3 is immutable file store, but this is also why you can separate compute form data. It's also how "timetravel" and "history days data retention" work, because the prior state of the table is not deleted for this time, thus why you pay for the S3 storage. Which also means watchout for your churn as you pay for all data written cumulatively for days
Data-Clustering:
Is a way to indicate how you would like the data to be ordered by. And ether the legacy manual cluster commands or the auto-clustering will rewrite the partitions to improve the clustering. Think of Norton SpeedDisk (if you are old school)
Writing to table in the order you want it clustered in (aka always have an ORDER BY on your INSERT), will improve things. But you can only have a table clustered on one set of "KEYS", thus you need to think about how your mostly use the data before clustering it. Or have multiple copies of the data with the min-sub-set/sort behavior we need (we do this).
warning: UPDATEs currently do not respect this clustering, and you can pay upwards of 4x the cost of a full table rewrite by having auto clustering running, you need to watch this as it's potentially unbounded cost.
So in short clustering is like poor man's indexes, and Snowflake is basically a massive full table scan/map/reduce processing. But it's really good at that, and when you understand how it works, it's super fun to use.
Snowflake stored data in micro-partitions. Each micro-partition contains between 50 MB and 500 MB of uncompressed data. If you are familiar with partition on traditional databases, micro-partitions are very similar to them but micro-partitions are automatically generated by Snowflake. You do not need to partition table as you need to do in traditional database systems.
Data Clustering is to distribute the data based on a clustering key into these micro-partitions. If clustering is not enabled for your table, your table will still have micro-partitions but the data will not be distributed based on a specific key.
Let's assume we have A, B, C, D unique values of X column in our table (t), and we have 5 partitions:
P1: AABBC
P2: ABDAC
P3: BBBCA
P4: CBDCC
P5: BBCCD
If we try to run "SELECT * FROM t WHERE X=A" query, Snowflake needs to read P1, P2, P3 partitions. If this table is clustered based on X column, the data will be distributed liked this (in theory):
P1: AAAAA
P2: BBBBB
P3: BBBBC
P4: CCCCC
P5: CCDDD
In this case, when we run "SELECT * FROM t WHERE X=A" query, Snowflake will need to read only P1 partition.
Micro-partitions (or partitioning) is very important when accessing a portion of data in a large table, because Snowflake can prune partitions based on your filter conditions in your query. If a right key (column) is defined for clustering, the partition pruning would be much more effective.
I have a table where my queries will be purely based on the id and created_time, I have the 50 other columns which will be queried purely based on the id and created_time, I can design it in two ways,
Either by multiple small tables with 5 column each for all 50 parameters
A single table with all 50 columns with id and created_at as primary
key
Which will be better, my rows will increase tremendously, so should I bother on the length of column family while modelling?
Actually, you need to have small tables to decrease the load on single table and should also try to maintain a query based table. If the query used contains the read statement to get all the 50 columns, then you can proceed with single table. But if you are planning to get part of data in each of your query, then you should maintain query based small tables which will redistribute the data evenly across the nodes or maintain multiple partitions as alex suggested(but you cannot get range based queries).
This really depends on how you structure of your partition key & distribution of data inside partition. CQL has some limits, like, max 2 billion cells per partitions, but this is a theoretical limit, and practical limits - something like, not having partitions bigger than 100Mb, etc. (DSE has recommendations in the planning guide).
If you'll always search by id & created_time, and not doing range queries on created_time, then you may even have the composite partition key comprising of both - this will distribute data more evenly across the cluster. Otherwise make sure that you don't have too much data inside partitions.
Or you can add another another piece into partition key, for example, sometimes people add the truncated date-time into partition key, for example, time rounded to hour, or to the day - but this will affect your queries. It's really depends on them.
Sort of in line with what Alex mentions, the determining factor here is going to be the size of your various partitions (which is an extension of the size of your columns).
Practically speaking, you can have problems going both ways - partitions that are too narrow can be as problematic as partitions that are too wide, so this is the type of thing you may want to try benchmarking and seeing which works best. I suspect for normal data models (staying away from the pathological edge cases), either will work just fine, and you won't see a meaningful difference (assuming 3.11).
In 3.11.x, Cassandra does a better job of skipping unrequested values than in 3.0.x, so if you do choose to join it all in one table, do consider using 3.11.2 or whatever the latest available release is in the 3.11 (or newer) branch.
I'm new to database. Recently I start using timescaledb, which is an extension in PostgreSQL, so I guess this is also PostgreSQL related.
I observed a strange behavior. I calculated my table structure, 1 timestamp, 2 double, so totally 24bytes per row. And I imported (by psycopg2 copy_from) 2,750,182 rows from csv file. I manually calculated the size should be 63MB, but I query timescaledb, it tells me the table size is 137MB, index size is 100MB and total 237MB. I was expecting that the table size should equal my calculation, but it doesn't. Any idea?
There are two basic reasons your table is bigger than you expect:
1. Per tuple overhead in Postgres
2. Index size
Per tuple overhead: An answer to a related question goes into detail that I won't repeat here but basically Postgres uses 23 (+padding) bytes per row for various internal things, mostly multi-version concurrency control (MVCC) management (Bruce Momjian has some good intros if you want more info). Which gets you pretty darn close to the 137 MB you are seeing. The rest might be because of either the fill factor setting of the table or if there are any dead rows still included in the table from say a previous insert and subsequent delete.
Index Size: Unlike some other DBMSs Postgres does not organize its tables on disk around an index, unless you manually cluster the table on an index, and even then it will not maintain the clustering over time (see https://www.postgresql.org/docs/10/static/sql-cluster.html). Rather it keeps its indices separately, which is why there is extra space for your index. If on-disk size is really important to you and you aren't using your index for, say, uniqueness constraint enforcement, you might consider a BRIN index, especially if your data is going in with some ordering (see https://www.postgresql.org/docs/10/static/brin-intro.html).
I have another question but i'll be more specific.
I see that when selecting a million row table it takes < 1second. What I don't understand is how might it do this with indexes. It seems to take 10ms to do a seek so for it to succeed 1sec it must do <100seeks. If there is an index entry per row then 1M rows is at least 1K blocks to store the indexes (actually its higher if its 8bytes per row (32bit index value + 32 key offset)). Then we would need to actually travel to the rows and collect the data. How do databases keep the seeks low and pull that data as fast as they do?
One way is something called a 'clustered index', where the rows of the table are physically ordered according to the clustered index's sort. Then when you want to read in a range of values along the indexed field, you find the first one, and you can just read it all in at once with no extra IO.
Also:
1) When reading an index, a large chunk of the index will be read in at once. If descending the B-tree (or moving along the children at the bottom, once you've found your range) moves you to another node already read into memory, you've saved an IO.
2) If the number of records that the SQL server statistically expects to retrieve are so high that the random access requirement of going from the index to the underlying rows will require so many IO operations that it would be faster to do a table scan, then it will do a table scan instead. You can see this e.g. using the query planner in SQL Server or PostgreSQL. But for small ranges the index is usually better, and the query plan will reflect this.
how does indexing increases the performance of data retrieval?
How indexing works?
Database products (RDMS) such as Oracle, MySQL builds their own indexing system, they give some control to the database administrators however nobody exactly knows what happens on the background except people makes research in that area, so why indexing :
Put simply, database indexes help
speed up retrieval of data. The other
great benefit of indexes is that your
server doesn't have to work as hard to
get the data. They are much the same
as book indexes, providing the
database with quick jump points on
where to find the full reference (or
to find the database row).
There are many indexing techiques for example :
Primary indexing, secondary indexing
B-trees and variants (B+-trees,B*-trees)
Hashing and variants (linear hashing, spiral etc.)
for example, just think that you have a database with the primary keys are sorted (simply) and these all data is stored in blocks (in hdd) so everytime you want to access the data you don't want to increase the access time (sometimes called transaction time or i/o time) the indexing helps you which data is stored in which block by using these primary keys.
Alice (primary key is names, not good example but just give an idea)
Alice
...
...
AZ...
Bob
Bri
...
Bza
...
Now you have an index in this index you only store Alice and Bob and the blocks they point, with this way users can access the data faster.The RDMS deals with the details.
I don't give the details but if you want to delve these topics, i offer you take an Database course or look at this popular book which is taught most of the universities.
Database Management Systems Ramakrishn CGherke
Each index keep the indexed fields stored separately, sorted (typically) and in a data structure which makes finding the right entries particularly easy. The database finds the entries in the index then cross-references them to the entries in the tables (Except in the case of clustered indexes and covering indexes, in which case the index has it all already). This cross-referencing takes time but is faster (you hope) than scanning the entire table.
A clustered index is where the rows themselves with all columns* are stored together with the index. Scanning clustered indexes is better than scanning non-clustered non-covering indexes because fewer lookups are required.
A covering index is where the query only requires columns which are part of the index, so the rest of the row does not need to be looked up (This is often good for performance).
* typically excluding blob / long text columns etc
How does an index in a book increase the ease with which you find the right page?
Much easier to look through an alphabetic list and then go to the right page than read every page.
This is a gross oversimplification, but in general, database indexing creates another list of some of the contents of the table, arranged in a way that the database engine can find information quickly. By organizing table contents deliberately, this eliminates the need to look for a row of data by scanning the entire table, creating a create efficiency in searches.
Indexes provide an optimal data structure for lookup queries. If your dataset changes a lot, you might consider the performance of updating/regenerating the index as well.
There are lot of open source indexing engines like lucene available, and you can search online for detailed information about performance benchmarks.