Related
I was reading this link. and trying to understand what is the relation between both, it is quite confusing. Please explain
Much like Gokhan's answer, but I would describe it differently.
Micro-partitions:
Every time to write data to snowflake it's written to a new file, because the files are immutable. This means you have many fragments. But due to keep metadata for tables, when you query, Snowflake can prune tables known to not contain the data being looked for. Otherwise it load all data (for the columns you select) and does brute force "full table scans".
Snowflake micro-partitions have no relation to classic partitioning, except for if you are lucky you might get pruning.
Also the micro partitions are written how you load the data, and are just "split" into more partitions as your writes go over a threshold. Much like you get from WinZip/7Zip/Gzip with a max file chunk size parameter..
The next thing to note is if you update a ROW in a partition, the whole partition is rewritten. AND the order of the updated rows is not controllable and can be random (based of your table join logic).
Thus if you do many writes, or many "small" updates, you will have very bad fragmentation of your partitions, which very negatively impact compile time, as all the meta data needs to be loaded. Which they are now charging for.
This is like this because S3 is immutable file store, but this is also why you can separate compute form data. It's also how "timetravel" and "history days data retention" work, because the prior state of the table is not deleted for this time, thus why you pay for the S3 storage. Which also means watchout for your churn as you pay for all data written cumulatively for days
Data-Clustering:
Is a way to indicate how you would like the data to be ordered by. And ether the legacy manual cluster commands or the auto-clustering will rewrite the partitions to improve the clustering. Think of Norton SpeedDisk (if you are old school)
Writing to table in the order you want it clustered in (aka always have an ORDER BY on your INSERT), will improve things. But you can only have a table clustered on one set of "KEYS", thus you need to think about how your mostly use the data before clustering it. Or have multiple copies of the data with the min-sub-set/sort behavior we need (we do this).
warning: UPDATEs currently do not respect this clustering, and you can pay upwards of 4x the cost of a full table rewrite by having auto clustering running, you need to watch this as it's potentially unbounded cost.
So in short clustering is like poor man's indexes, and Snowflake is basically a massive full table scan/map/reduce processing. But it's really good at that, and when you understand how it works, it's super fun to use.
Snowflake stored data in micro-partitions. Each micro-partition contains between 50 MB and 500 MB of uncompressed data. If you are familiar with partition on traditional databases, micro-partitions are very similar to them but micro-partitions are automatically generated by Snowflake. You do not need to partition table as you need to do in traditional database systems.
Data Clustering is to distribute the data based on a clustering key into these micro-partitions. If clustering is not enabled for your table, your table will still have micro-partitions but the data will not be distributed based on a specific key.
Let's assume we have A, B, C, D unique values of X column in our table (t), and we have 5 partitions:
P1: AABBC
P2: ABDAC
P3: BBBCA
P4: CBDCC
P5: BBCCD
If we try to run "SELECT * FROM t WHERE X=A" query, Snowflake needs to read P1, P2, P3 partitions. If this table is clustered based on X column, the data will be distributed liked this (in theory):
P1: AAAAA
P2: BBBBB
P3: BBBBC
P4: CCCCC
P5: CCDDD
In this case, when we run "SELECT * FROM t WHERE X=A" query, Snowflake will need to read only P1 partition.
Micro-partitions (or partitioning) is very important when accessing a portion of data in a large table, because Snowflake can prune partitions based on your filter conditions in your query. If a right key (column) is defined for clustering, the partition pruning would be much more effective.
I have a table where my queries will be purely based on the id and created_time, I have the 50 other columns which will be queried purely based on the id and created_time, I can design it in two ways,
Either by multiple small tables with 5 column each for all 50 parameters
A single table with all 50 columns with id and created_at as primary
key
Which will be better, my rows will increase tremendously, so should I bother on the length of column family while modelling?
Actually, you need to have small tables to decrease the load on single table and should also try to maintain a query based table. If the query used contains the read statement to get all the 50 columns, then you can proceed with single table. But if you are planning to get part of data in each of your query, then you should maintain query based small tables which will redistribute the data evenly across the nodes or maintain multiple partitions as alex suggested(but you cannot get range based queries).
This really depends on how you structure of your partition key & distribution of data inside partition. CQL has some limits, like, max 2 billion cells per partitions, but this is a theoretical limit, and practical limits - something like, not having partitions bigger than 100Mb, etc. (DSE has recommendations in the planning guide).
If you'll always search by id & created_time, and not doing range queries on created_time, then you may even have the composite partition key comprising of both - this will distribute data more evenly across the cluster. Otherwise make sure that you don't have too much data inside partitions.
Or you can add another another piece into partition key, for example, sometimes people add the truncated date-time into partition key, for example, time rounded to hour, or to the day - but this will affect your queries. It's really depends on them.
Sort of in line with what Alex mentions, the determining factor here is going to be the size of your various partitions (which is an extension of the size of your columns).
Practically speaking, you can have problems going both ways - partitions that are too narrow can be as problematic as partitions that are too wide, so this is the type of thing you may want to try benchmarking and seeing which works best. I suspect for normal data models (staying away from the pathological edge cases), either will work just fine, and you won't see a meaningful difference (assuming 3.11).
In 3.11.x, Cassandra does a better job of skipping unrequested values than in 3.0.x, so if you do choose to join it all in one table, do consider using 3.11.2 or whatever the latest available release is in the 3.11 (or newer) branch.
I'm considering various ways to partition my data in SQL Server. One approach I'm looking at is to partition a particular huge table into 8 partitions, then within each of these partitions to partition on a different partition column. Is this even possible in SQL Server, or am I limited to definining one parition column+function+scheme per table?
I'm interested in the more general answer, but this strategy is one I'm considering for Distributed Partitioned View, where I'd partition the data under the first scheme using DPV to distribute the huge amount of data over 8 machines, and then on each machine partition that portion of the full table on another parition key in order to be able to drop (for example) sub-paritions as required.
You are incorrect that the partitioning key cannot be computed. Use a computed, persisted column for the key:
ALTER TABLE MYTABLE ADD PartitionID AS ISNULL(Column1 * Column2,0) persisted
I do it all the time, very simple.
The DPV across a set of Partitioned Tables is your only clean option to achieve this, something like a DPV across tblSales2007, tblSales2008, tblSales2009, and then each of the respective sales tables are partitioned again, but they could then be partitioned by a different key. There are some very good benefits in doing this in terms of operational resiliance (one partitioned table going offline does not take the DPV down - it can satisfy queries for the other timelines still)
The hack option is to create an arbitary hash of 2 columns and store this per record, and partition by it. You would have to generate this hash for every query / insertion etc since the partition key can not be computed, it must be a stored value. It's a hack and I suspect would lose more performance than you would gain.
You do have to be thinking of specific management issues / DR over data quantities though, if the data volumes are very large and you are accessing it in a primarily read mechanism then you should look into SQL 'Madison' which will scale enormously in both number of rows as well as overall size of data. But it really only suits the 99.9% read type data warehouse, it is not suitable for an OLTP.
I have production data sets sitting in the 'billions' bracket, and they reside on partitioned table systems and provide very good performance - although much of this is based on the hardware underlying a system, not the database itself. Scalaing up to this level is not an issue and I know of other's who have gone well beyond those quantities as well.
The max partitions per table remains at 1000, from what I remember of a conversation about this, it was a figure set by the testing performed - not a figure in place due to a technical limitation.
I'm not much of a database guru so I would like some advice.
Background
We have 4 tables that are currently stored in Sybase IQ. We don't currently have any choice over this, we're basically stuck with what someone else decided for us. Sybase IQ is a column-oriented database that is perfect for a data warehouse. Unfortunately, my project needs to do a lot of transactional updating (we're more of an operational database) so I'm looking for more mainstream alternatives.
Question
Given these tables' dimensions, would anyone consider SQL Server or Oracle to be a viable alternative?
Table 1 : 172 columns * 32 million rows
Table 2 : 453 columns * 7 million rows
Table 3 : 112 columns * 13 million rows
Table 4 : 147 columns * 2.5 million rows
Given the size of data what are the things I should be concerned about in terms of database choice, server configuration, memory, platform, etc.?
Yes, both should be able to handle your tables (if your server is suited for it). But, I would consider redesigning your database a bit. Even in a datawarehouse where you denormalize your data, a table with 453 columns is not normal.
It really depends on what's in the columns. If there are lots of big VARCHAR columns -- and they are frequently filled to near capacity -- then you could be in for some problems. If it's all integer data then you should be fine.
453 * 4 = 1812 # columns are 4 byte integers, row size is ~1.8k
453 * 255 = 115,515 # columns are VARCHAR(255), theoretical row size is ~112k
The rule of thumb is that row size should not exceed the disk block size, which is generally 8k. As you can see, your big table is not a problem in this regard if it consists entirely of 4-byte integers but if it consists of 255-char VARCHAR columns then you could be exceeding the limit substantially. This 8k limit used to be a hard limit in SQL Server but I think these days it's just a soft limit and performance guideline.
Note that VARCHAR columns don't necessarily consume memory commensurate with the size you specify for them. That is the max size, but they only consume as much as they need. If the actual data in the VARCHAR columns is always 3-4 chars long then size will be similar to that of integer columns regardless of whether you created them as VARCHAR(4) or VARCHAR(255).
The general rule is that you want row size to be small so that there are many rows per disk block, this reduces the number of disk reads necessary to scan the table. Once you get above 8k you have two reads per row.
Oracle has another potential problem which is that ANSI joins have a hard limit on the total number of columns in all tables in the join. You can avoid this by avoiding the Oracle ANSI join syntax. (There are equivalents that don't suffer from this bug.) I don't recall what the limit is or which versions it applies to (I don't think it's been fixed yet).
The numbers of rows you're talking about should be no problem at all, presuming you have adequate hardware.
With suitable sized hardware and I/O subsystem to meet your demands both are quite adequate - Wihlst you have a lot of columns the row counts are really very low - we regularily use datasets that are expressed in billions, not millions. (Just do not try it on SQL 2000 :) )
If you know your usages and I/O requirements, most I/O vendors will translate that into hardware specs for you. Memory, processors etc again is dependant on workloads that only you can model.
Oracle 11g has no problems with such data and structure.
More info at: http://neworacledba.blogspot.com/2008/05/database-limits.html
Regards.
Oracle limitations
SQL Server limitations
You might be close on SQL Server, depending on what data types you have in that 453 column table (note the bytes per row limitation, but also read the footnote). I know you said that this is normalized, but I suggest looking at your workflow and considering ways of reducing the column count.
Also, these tables are big enough that hardware considerations are a major issue with performance. You'll need an experienced DBA to help you spec and set up the server with either RDBMS. Properly configuring your disk subsystem will be vital. You will probably also want to consider table partitioning among other things to help with performance, but this all depends on exactly how the data is being used.
Based on your comments in the other answers I think what I'd recommend is:
1) Isolate which data is actually updated vs. which data is more or less read only (or infrequently)
2) Move the updated data to separate tables joined on an id to the bigger tables (deleting those columns from the big tables)
3) Do your OLTP transactions against the smaller, more relational tables
4) Use inner joins to hook back up to the big tables to retrieve data when necessary.
As others have noted you are trying to make the DB do both OLTP and OLAP at the same time and that is difficult. Server settings need to be tweaked differently for either scenario.
Either SQL Server or Oracle should work. I use census data as well and my giganto table has around 300+ columns. I use SQL Server 2005 and it complains that if all the columns were to be filled to their capacity it would exceed that max possible size for a record. We use our census data in an OLAP fashion, so it isn't such a big deal to have so many columns.
Are all of the columns in all of those tables updated by your application?
You could consider having data marts (AKA operational or online data store) that are updated during the day, and then the new records are migrated into the main warehouse at night? I say this because rows with massive amounts of columns are going to be slower to insert and update, so you may want to consider tailoring your specific online architecture to your application's update requirements.
Asking one DB to act as an operational and warehouse system at the same time is still a bit of a tall order. I would consider using SQL server or Oracle for operational system and having a separate DW for reporting and analytic, probably keeping the system you have.
Expect some table re-design and normalization to happen on the operational side to fit one-row per page limitations of row-based storage.
If you need to have fast updates of the DW, you may consider EP for ETL approach, as opposed to standard (scheduled) ETL.
Considering that you are in the early stage of this, take a look at Microsoft project Madison, which is auto-scalable DW appliance up to 100s TB. They have already shipped some installations.
I would very carefully consider switching from a column oriented database to a relational one. Column oriented databases are indeed inadequate for operational work as updates are very slow, but they are more than adequate for reporting and business intelligence support.
More often than not one has to split the operational work into a OLTP database containing the current activity needed for operations (accounts, inventory etc) and use an ETL process to populate the data warehouse (history, trends). A column oriented DW will beat hands down a relational one in almost any circumstance, so I wouldn't give up the Sybase IQ so easily. Perhaps you can design your system to have an operational OLTP side using your relational product of choice (I would choose SQL Server, but I'm biased) and keep the OLAP part you have now.
Sybase have a product called RAP that combines IQ with an in-memory instance of ASE (their relational database) which is designed to help in situations such as this.
Your data isn't so vast that you couldn't consider moving to a row-oriented database but, depending on the structure of the data, you could end up using considerably more disk space and slowing down many kinds of queries.
Disclaimer: I do work for Sybase but not currently on the ASE/IQ/RAP side.
Suppose you have a dense table with an integer primary key, where you know the table will contain 99% of all values from 0 to 1,000,000.
A super-efficient way to implement such a table is an array (or a flat file on disk), assuming a fixed record size.
Is there a way to achieve similar efficiency using a database?
Clarification - When stored in a simple table / array, access to entries are O(1) - just a memory read (or read from disk). As I understand, all databases store their nodes in trees, so they cannot achieve identical performance - access to an average node will take a few hops.
Perhaps I don't understand your question but a database is designed to handle data. I work with database all day long that have millions of rows. They are efficiency enough.
I don't know what your definition of "achieve similar efficiency using a database" means. In a database (from my experience) what are exactly trying to do matters with performance.
If you simply need a single record based on a primary key, the the database should be naturally efficient enough assuming it is properly structure (For example, 3NF).
Again, you need to design your database to be efficient for what you need. Furthermore, consider how you will write queries against the database in a given structure.
In my work, I've been able to cut query execution time from >15 minutes to 1 or 2 seconds simply by optimizing my joins, the where clause and overall query structure. Proper indexing, obviously, is also important.
Also, consider the database engine you are going to use. I've been assuming SQL server or MySql, but those may not be right. I've heard (but have never tested the idea) that SQLite is very quick - faster than either of the a fore mentioned. There are also many other options, I'm sure.
Update: Based on your explanation in the comments, I'd say no -- you can't. You are asking about mechanizes designed for two completely different things. A database persist data over a long amount of time and is usually optimized for many connections and data read/writes. In your description the data in an array, in memory is for a single program to access and that program owns the memory. It's not (usually) shared. I do not see how you could achieve the same performance.
Another thought: The absolute closest thing you could get to this, in SQL server specifically, is using a table variable. A table variable (in theory) is held in memory only. I've heard people refer to table variables as SQL server's "array". Any regular table write or create statements prompts the RDMS to write to the disk (I think, first the log and then to the data files). And large data reads can also cause the DB to write to private temp tables to store data for later or what-have.
There is not much you can do to specify how data will be physically stored in database. Most you can do is to specify if data and indices will be stored separately or data will be stored in one index tree (clustered index as Brian described).
But in your case this does not matter at all because of:
All databases heavily use caching. 1.000.000 of records hardly can exceed 1GB of memory, so your complete database will quickly be cached in database cache.
If you are reading single record at a time, main overhead you will see is accessing data over database protocol. Process goes something like this:
connect to database - open communication channel
send SQL text from application to database
database analyzes SQL (parse SQL, checks if SQL command is previously compiled, compiles command if it is first time issued, ...)
database executes SQL. After few executions data from your example will be cached in memory, so execution will be very fast.
database packs fetched records for transport to application
data is sent over communication channel
database component in application unpacks received data into some dataset representation (e.g. ADO.Net dataset)
In your scenario, executing SQL and finding records needs very little time compared to total time needed to get data from database to application. Even if you could force database to store data into array, there will be no visible gain.
If you've got a decent amount of records in a DB (and 1MM is decent, not really that big), then indexes are your friend.
You're talking about old fixed record length flat files. And yes, they are super-efficient compared to databases, but like structure/value arrays vs. classes, they just do not have the kind of features that we typically expect today.
Things like:
searching on different columns/combintations
variable length columns
nullable columns
editiablility
restructuring
concurrency control
transaction control
etc., etc.
Create a DB with an ID column and a bit column. Use a clustered index for the ID column (the ID column is your primary key). Insert all 1,000,000 elements (do so in order or it will be slow). This is kind of inefficient in terms of space (you're using nlgn space instead of n space).
I don't claim this is efficient, but it will be stored in a similar manner to how an array would have been stored.
Note that the ID column can be marked as being a counter in most DB systems, in which case you can just insert 1000000 items and it will do the counting for you. I am not sure if such a DB avoids explicitely storing the counter's value, but if it does then you'd only end up using n space)
When you have your primary key as a integer sequence it would be a good idea to have reverse index. This kind of makes sure that the contiguous values are spread apart in the index tree.
However, there is a catch - with reverse indexes you will not be able to do range searching.
The big question is: efficient for what?
for oracle ideas might include:
read access by id: index organized table (this might be what you are looking for)
insert only, no update: no indexes, no spare space
read access full table scan: compressed
high concurrent write when id comes from a sequence: reverse index
for the actual question, precisely as asked: write all rows in a single blob (the table contains one column, one row. You might be able to access this like an array, but I am not sure since I don't know what operations are possible on blobs. Even if it works I don't think this approach would be useful in any realistic scenario.