I have a scenario where i have to run parallel inserts/deletes on a snowflake table.
For example: the table contains data related to different countries. And each insert pipe or thread will be contain data for only a specific country.
Similarly when i am running parallel deletes then each delete thread will be deleting data for only a specific country.
I was looking to partition the data in the snowflake table based on country which might have helped in avoiding any locks. however, it seems that option is not there in snowflake.
Can you suggest how can i achieve parallel inserts/deletes and avoid and contention or locks.
Note: I am using matillion to run different ELT jobs in parallel to do the inserts.
In snowflake, there is no option of partition. It's a true SaaS product with almost zero administration.
Coming to your question of the parallel insert and delete, Incase there is a huge delay, either you can scale up or make auto-scale. the Snowflake Data Platform implements a powerful and unique form of partitioning, called micro-partitioning, that delivers all the advantages of static partitioning without the known limitations, as well as providing additional significant benefits.
You can also go for table clustering
It is highly unlikely (almost impossible) that you will get locking on your data tables. If you are getting locking on the metadata tables then follow the solution suggested in #sergiu's link above.
If you have performance issues (but not table/record locking) then possible solutions include:
Larger warehouse (unlikely to improve performance much but worth trying if all else fails)
Auto-scale your warehouse (as suggested in the previous answer) so that more warehouses are running in parallel when there is a high workload
Run multiple warehouses: 1 per country or 1 per group of countries. No real benefit over auto-scaling unless there is significantly more data for some countries compared to others and there is a benefit to sizing the warehouses to match the data size
Related
What kind of optimizer does Snowflake use, rule based or cost based. Could not get to any documentation, need explanation on how it works to write better queries.
I find "knowing the 'rules'" less helpful, than understanding what the system is doing as more helpful.
I have found describing it to new team members has massive table scans, that do map/reduce/merge joins.
You can make the tables scans faster by selecting the smallest set of columns needed to get the answer you need.
There is partition pruning so if you have data in a 'inserted/sorted' order of x 1-2,3-4,5-6 and your query has x = 5, then the first two partitions will not be read.
Next because it's all merge joins, equi joins are the fastest thing todo. [Edit:] This is trying to say, that at the order of million's of rows and up. Joining 1m rows to 1m rows based on complex join logic like a.v1 > b.v2 or a.v2 < b.v3 ... etc means you have to more or less make you trillion+ rows and just try and see. Where-as if you can join on exact values a.v1 = b.v2 and a.v2 = b.v2 now the data can be sorted with respect to those keys, and a merge join can be done, and your performance is very good (sort-merge join on Wikipedia).
This means sometimes reading from the same set of source tables many times in different CTE's and joining those can be the fastest way to process large volumes of data.
[Edit:] which in the context of the above statement often in small db SQL people do correlated sub-queries, because a) you can, so why not, and b) they can be fast on indexed databases. But in snowflake with no indexes, besides the optimizer doesn't support most correlated sub-queries, you generally should avoid them and read the data twice in two CTEs and join/left-join those via a equi-join to answer the question that is done, as the CTE's tasks are independent, thus parallelisable, and the merge-join is near-optimal. And the waste of calculating (lets pretend sub-totals) for data that is not in the main join body, is less that gains of parallelism. (this holds best for queries in the 30 seconds or longer range, as compared to speeding up sub 5 second sized queries). But with everything, have a base model, and try/experiment, and poke and the slow stuff, till you cannot restructure you data or query to make it faster.
As always look at the profile of the run query, and look for area there many rows are dropped, and think how you can restructure the logic to push these restrictions earlier in the pipeline.
Brief description could be found in the following document:
The Snowflake Elastic Data Warehouse by Snowflake Computing
3.3.1 Query Management and Optimization
(...)
Snowflake’s query optimizer follows a typical Cascades-
style approach [28], with top-down cost-based optimization.
All statistics used for optimization are automatically main-
tained on data load and updates. Since Snowflake does
not use indices (cf. Section 3.3.3), the plan search space
is smaller than in some other systems. The plan space is
further reduced by postponing many decisions until execu-
tion time, for example the type of data distribution for joins.
This design reduces the number of bad decisions made by
the optimizer, increasing robustness at the cost of a small
loss in peak performance. It also makes the system easier
to use (performance becomes more predictable), which is in
line with Snowflake’s overall focus on service experience.
Once the optimizer completes, the resulting execution plan
is distributed to all the worker nodes that are part of the
query. As the query executes, Cloud Services continuously
tracks the state of the query to collect performance counters
and detect node failures. All query information and statis-
tics are stored for audits and performance analysis. (...)
Query Optimization:
Snowflake supports query vectorization and does some cost-based optimization. But its first-time run of queries are typically seconds-to-minutes. Snowflake has added local disk “caching” and also a result cache to speed up subsequent queries for repetitive workloads like reporting and dashboards.
Optimized Storage:
Snowflake has a micro-partition file system that is more optimized than S3 and supports partitioning and sorting with cluster keys
Split the data into multiple small files to support optimal data loading in Snowflake.
IMPROVING QUERY PERFORMANCE
Consider implementing clustering keys for large tables.
Try to execute relatively homogeneous queries (size, complexity,
data sets, etc.) on the same warehouse.
IMPROVING LOAD PERFORMANCE
Use bulk loading to get the data into tables in Snowflake. Consider splitting
large data files so the load can be efficiently distributed across servers in a
cluster.
Delete from internal stages files that are no longer needed. You may notice an
improvement performance in addition to saving on costs.
Isolate load and transform jobs from queries to prevent resource contention.
Dedicate separate warehouses for loading and querying operations to optimize
performance for each.
Leverage the scalable compute layer to do the bulk of the data processing.
Consider using Snowpipe in micro-batching scenarios. Your query may benefit from
cached results from a previous execution.
Use separate warehouses for your queries and load tasks. This will facilitate
targeted provisioning of warehouses and avoid any resource contention between
dissimilar operations.
Use a separate data warehouse for large files.
The number and capacity of the servers determine the number of data files.
Segment Data
Snowflake caches data in the virtual data warehouse, but it's still essential to segment data. Consider these best practices for data query performance:
Group users with common queries in the same virtual data warehouse to optimize data retrieval and use.
The Snowflake Query Profile supports query analysis to help identify and address performance concerns.
Snowflake draws from the same virtual data warehouse to support complex data science operations, business intelligence queries, and ELT data integration.
Scale-Up
Snowflake allows for a scale-up in the virtual data warehouse to better handle large workloads. When using scale-up to improve performance, make note of the following:
Snowflake supports fast and easy adjustments to the warehouse-size to handle the workload.
It can also automatically suspend or resume the scale-up, with complete transparency for the user.
Snowflake's scale-up functionality supports the continually changing requirements for processing.
Scale-Out
Snowflake supports the deployment of same-size clusters to support concurrency. Keep these points in mind for how scale-out can help performance optimization:
As users execute queries, the virtual data warehouse automatically adds clusters up to a fixed limit.
It can scale-up in a more controlled way instead of deploying one or more clusters of larger machines like legacy data platforms.
Snowflake automatically adjusts based on user queries, with automatic clustering during peak and off hours as needed.
As per this answer, it is recommended to go for single table in Cassandra.
Cassandra 3.0
We are planning for below schema:
Second table has composite key. PK(domain_id, item_id). So, domain_id is partition key & item_id will be clustering key.
GET request handler will access(read) two tables
POST request handler will access(write) into two tables
PUT request handler will access(write) details table(only)
As per CAP theorem,
What are the consistency issues in having multi-table schema? in Cassandra...
Can we avoid consistency issues in Cassandra? with these terms QUORUM, consistency level etc...
recommended to go for single table in Cassandra.
I would recommend the opposite. If you have to support multiple queries for the same data in Apache Cassandra, you should have one table for each query.
What are the consistency issues in having multi-table schema? in Cassandra...
Consistency issues between query tables can happen when writes are applied to one table but not the other(s). In that case, the application should have a way to gracefully handle it. If it becomes problematic, perhaps running a nightly job to keep them in-sync might be necessary.
You can also have consistency issues within a table. Maybe something happens (node crashes, down longer than 3 hours, hints not replayed) during the write process. In that case, a given data point may have only a subset of its intended replicas.
This scenario can be countered by running regularly-scheduled repairs. Additionally, consistency can be increased on a per-query basis (QUORUM vs. ONE, etc), and consistency levels of QUORUM and higher will occasionally trigger a read-repair (which syncs all replicas in the current operation).
Can we avoid consistency issues in Cassandra? with these terms QUORUM, consistency level etc...
So Apache Cassandra was engineered to be highly-available (HA), thereby embracing the paradigm of eventual consistency. Some might interpret that to mean Cassandra is inconsistent by design, and they would not be incorrect. I can say after several years of supporting hundreds of clusters at web/retail scale, that consistency issues (while they do happen) are rare, and are usually caused by failures to components outside of a Cassandra cluster.
Ultimately though, it comes down to the business requirements of the application. For some applications like product reviews or recommendations, a little inconsistency shouldn't be a problem. On the other hand, things like location-based pricing may need a higher level of query consistency. And if 100% consistency is indeed a hard requirement, I would question whether or not Cassandra is the proper choice for data storage.
Edit
I did not get this: "Consistency issues between query tables can happen when writes are applied to one table but not the other(s)." When writes are applied to one table but not the other(s), what happens?
So let's say that a new domain is added. Perhaps a scenario arises where the domain_details_table gets updated, but the id_table does not. Nothing wrong here on the database side. Except that when the application expects to find that domain_id in the id_table, but cannot.
In that case, maybe the application can retry using a secondary index on domain_details_table.domain_id. It won't be fast, but the decision to be made is more around which scenario is more preferable; no answer, or a slow answer? Again, application requirements come into play here.
For your point: "You can also have consistency issues within a table. Maybe something happens (node crashes, down longer than 3 hours, hints not replayed) during the write process." How does RDBMS(like MySQL) deal with this?
So the answer to this used to be simple. RDBMSs only run on a single server, so there's only one replica to keep in-sync. But today, most RDBMSs have HA solutions which can be used, and thus have to be kept in-sync. In that case (from what I understand), most of them will asynchronously update the secondary replica(s), while restricting traffic only to the primary.
It's also good to remember that RDBMSs enforce consistency through locking strategies, as well. So even a single-instance RDBMS will lock a data point during an update, blocking any reads until the lock is released.
In a node-down scenario, a single-instance RDBMS will be completely offline, so instead of inconsistent data you'd have data loss instead. In a HA RDBMS scenario, there would be a short pause (during which you would likely encounter connection/query failures) until it has failed-over to the new primary. Once the replica comes up, there would probably be additional time necessary to sync-up the replicas, until HA can be restored.
We have observed one problem in Postgresql as it doesn't uses multi core of CPU for single query. For example, I have 8 cores in cpu. We are having 40 Million entries in stock.move table. When we apply massive query in single database connection to generate reporting & observe at backend side, we see only one core is 100% used, where as all other 7 are free. Due to that query execution time takes so longer and our odoo system being slow. Whereas problem is inside postgresql core. If by anyhow we can share a query between two or more cores than we can get performance boost in postgresql query execution.
I am sure by solving parallel query execution, we can make Odoo performance even faster. Anyone has any kind of suggestions regarding this ??
----------- * Editing this question to show you answer from Postgresql Core committee *---------
Here I am posting the answer which I got from one of top contributor of Postgresql database. ( I hope this information will be useful)
Hello Hiren,
It is expected behave. PostgreSQL doesn't support parallel CPU for
single query. This topic is under high development, and probably, this
feature will be in planned release 9.6 ~ September 2016. But table
with 40M rows isn't too big, so probably more CPU should not too help
to you (there is some overhead with start and processing multi CPU
query). You have to use some usual tricks like materialized view,
preagregations, ... the main idea of these tricks - don't try to
repeat often same calculation. Check health of PostgreSQL - indexes,
vacuum processing, statistics,.. Check hw - speed of IO. Check
PostgreSQL configuration - shared_buffers, work_mem. Some queries can
be slow due bad estimations - check a explain of slow queries. There
are some tools that can breaks some query to more queries and start
parallel execution, but I didn't use it. https://launchpad.net/stado
http://www.pgpool.net/docs/latest/tutorial-en.html#parallel
Regards Pavel Stehule
Well, I think you have your answer there -- PostgreSQL does not currently support parallel query yet. The general advice towards performance is very apt, and you might also consider partitioning, which might allow you to truncate partitions instead of deleting parts of a table, or increasing memory allocation. It's impossible to give good advice on that without knowing more about the query.
Having had experience with this sort of issue on non-parallel query Oracle systems, I suggest that you also consider what hardware you're using.
The modern trend towards CPUs with very many cores is a great help for web servers or other multi-process systems with many short-lived transactions, but you have a data processing system with few, large transactions. You need the correct hardware to support that. CPUs with fewer, more powerful cores are a better choice, and you have to pay attention to bandwidth to memory and storage.
This is why engineered systems have been popular with big data and data warehousing.
I have a problem where I need to load alot of data (5+ billion rows) into a database very quickly (ideally less than an 30 min but quicker is better), and I was recently suggested to look into postgresql (I failed with mysql and was looking at hbase/cassandra). My setup is I have a cluster (currently 8 servers) that generates alot of data, and I was thinking of running databases locally on each machine in the cluster it writes quickly locally and then at the end (or throughout the data generating) data is merged together. The data is not in any order so I don't care which specific server its on (as long as its eventually there).
My questions are , is there any good tutorials or places to learn about PostgreSQL auto sharding (I found results of firms like sykpe doing auto sharding but no tutorials, I want to play with this myself)? Is what I'm trying to do possible? Because the data is not in any order I was going to use auto-incrementing ID number, will that cause a conflict if data is merged (this is not a big issue anymore)?
Update: Frank's idea below kind of eliminated the auto-incrementing conflict issue I was asking about. The question is basically now, how can I learn about auto sharding and would it support distributed uploads of data to multiple servers?
First: Do you really need to insert the generated data from your cluster straight into a relational database? You don't mind merging it at the end anyway, so why bother inserting into a database at all? In your position I'd have your cluster nodes write flat files, probably gzip'd CSV data. I'd then bulk import and merge that data using a tool like pg_bulkload.
If you do need to insert directly into a relational database: That's (part of) what PgPool-II and (especeially) PgBouncer are for. Configure PgBouncer to load-balance across different nodes and you should be pretty much sorted.
Note that PostgreSQL is a transactional database with strong data durability guarantees. That also means that if you use it in a simplistic way, doing lots of small writes can be slow. You have to consider what trade-offs you're willing to make between data durability, speed, and cost of hardware.
At one extreme, each INSERT can be its own transaction that's synchronously committed to disk before returning success. This limits the number of transactions per second to the number of fsync()s your disk subsystem can do, which is often only in the tens or hundreds per second (without battery backup RAID controller). This is the default if you do nothing special and if you don't wrap your INSERTs in a BEGIN and COMMIT.
At the other extreme, you say "I really don't care if I lose all this data" and use unlogged tables for your inserts. This basically gives the database permission to throw your data away if it can't guarantee it's OK - say, after an OS crash, database crash, power loss, etc.
The middle ground is where you will probably want to be. This involves some combination of asynchronous commit, group commits (commit_delay and commit_siblings), batching inserts into groups wrapped in explicit BEGIN and END, etc. Instead of INSERT batching you could do COPY loads of a few thousand records at a time. All these things trade data durability off against speed.
For fast bulk inserts you should also consider inserting into tables without any indexes except a primary key. Maybe not even that. Create the indexes once your bulk inserts are done. This will be a hell of a lot faster.
Here are a few things that might help:
The DB on each server should have a small meta data table with that server's unique characteristics. Such as which server it is; servers can be numbered sequentially. Apart from the contents of that table, it's probably wise to try to keep the schema on each server as similar as possible.
With billions of rows you'll want bigint ids (or UUID or the like). With bigints, you could allocate a generous range for each server, and set its sequence up to use it. E.g. server 1 gets 1..1000000000000000, server 2 gets 1000000000000001 to 2000000000000000 etc.
If the data is simple data points (like a temperature reading from exactly 10 instruments every second) you might get efficiency gains by storing it in a table with columns (time timestamp, values double precision[]) rather than the more correct (time timestamp, instrument_id int, value double precision). This is an explicit denormalisation in aid of efficiency. (I blogged about my own experience with this scheme.)
Use citus for PostgreSQL auto sharding. Also this link is helpful.
Sorry I don't have a tutorial at hand, but here's an outline of a possible solution:
Load one eight of your data into a PG instance on each of the servers
For optimum load speed, don't use inserts but the COPY method
When the data is loaded, do not combine the eight databases into one. Instead, use plProxy to launch a single statement to query all databases at once (or the right one to satisfy your query)
As already noted, keys might be an issue. Use non-overlapping sequences or uuids or sequence numbers with a string prefix, shouldn't be too hard to solve.
You should start with a COPY test on one of the servers and see how close to your 30-minute goal you can get. If your data is not important and you have a recent Postgresql version, you can try using unlogged tables which should be a lot faster (but not crash-safe). Sounds like a fun project, good luck.
You could use mySQL - which supports auto-sharding across a cluster.
I'm building a system which has the potential to require support for 500+ concurrent users, each making dozens of queries (selects, inserts AND updates) each minute. Based on these requirements and tables with many millions of rows I suspect that there will be the need to use database replication in the future to reduce some of the query load.
Having not used replication in the past, I am wondering if there is anything I need to consider in the schema design?
For instance, I was once told that it is necessary to use GUIDs for primary keys to enable replication. Is this true?
What special considerations or best practices for database design are there for a database that will be replicated?
Due to time constraints on the project I don't want to waste any time by implementing replication when it may not be needed. (I have enough definite problems to overcome at the moment without worrying about having to solve possible ones.) However, I don't want to have to make potentially avoidable schema changes when/if replication is required in the future.
Any other advice on this subject, including good places to learn about implementing replication, would also be appreciated.
While every row must have a rowguid column, you are not required to use a Guid for your primary key. In reality, you aren't even required to have a primary key (though you will be stoned to death for failing to create one). Even if you define your primary key as a guid, not making it the rowguid column will result in Replication Services creating an additional column for you. You definitely can do this, and it's not a bad idea, but it is by no means necessary nor particularly advantageous.
Here are some tips:
Keep table (or, rather, row) sizes small; unless you use column-level replication, you'll be downloading/uploading the entire contents of a row, even if only one column changes. Additionally, smaller tables make conflict resolution both easier and less frequent.
Don't use sequential or deterministic algorithm-driven primary keys. This includes identity columns. Yes, Replication Services will handle identity columns and allocating key allotments by itself, but it's a headache that you don't want to deal with. This alone is a great argument for using a Guid for your primary key.
Don't let your applications perform needless updates. This is obviously a bad idea to begin with, but this issue is made exponentially worse in replication scenarios, both from a bandwidth usage and a conflict resolution perspective.
You may want to use GUIDs for primary keys - in a replicated system rows must be unique throughout your entire topology, and GUID PKs is one way of achieving this.
Here's a short article about use of GUIDs in SQL Server
I'd say your real question is not how to handle replication, but how to handle scale out, or at least scale out for queryability. And while there are various answers to this conundrum, one answer will stand out: not using replication.
The problem with replication, specially with merge replication, is that writes gets multiplied in replication. Say you have a system which handles a load of 100 queries (90 reads and 10 writes) per second. You want to scale out and you choose replication. Now you have 2 systems, each handling 50 queries, 45 reads and 5 writes each. Now those writes have to be replicated so the actual number of writes is not 5+5, but 5+5 (original writes ) and then another 5+5 (the replica writes), so you have 90 reads and 20 writes. So while the load on each system was reduced, the ratio of writes and reads has increased. This not only changes the IO patterns, but most importantly it changes the concurency pattern of the load. Add a third system and you'll have 90 reads and 30 writes and so on and so forth. Soon you'll have more writes than reads and the replication update latency combined with the concurency issues and merge conflicts will derail your project. The gist of it is that the 'soon' is much sooner than you expect. Is soon enough to justify looking into scale up instead, since you're talking a scale out of 6-8 peers at best anyway, and 6-8 times capacity increase using scale up will be faster, much more simpler and possible even cheaper to start with.
And keep in mind that all these are just purely theorethical numbers. In practice what happens is that the replication infrastructure is not free, it adds its own load on the system. Writes needs to be tracked, changes have to be read, a distributor has to exists to store changes until distributed to subscribers, then changes have to be writes and mediated for possible conflicts. That's why I've seen very few deployments that could claim success with a replication based scale out strategy.
One alternative is to scale out only reads and here replication does work, usualy using transactional replication, but so does log-shipping or mirroring with a database snapshot.
The real alternative is partitioning (ie. sharding). Requests are routed in the application to the proper partition and land on the server containig the appropiate data. Changes on one partiton that need to be reflected on another partition are shipped via asynchronous (usually messaging based) means. Data can only be joined within a partition. For a more detailed discussion of what I'm talking about, read how MySpace does it. Needless to say, such a strategy has a major impact on the application design and cannot be simply glued in after v1.