How to auto resize Snowflake warehouses up/down? - snowflake-cloud-data-platform

I know multi-cluster warehouses can have an autoscaling policy to scale out, but is there a way to automate resizing up or down? I have a set of queries that deal with varying sizes of data, which means I sometimes only need a S warehouse, but sometimes need a XL. I don't think Snowflake provides a built-in mechanism to do this, so looking for advice on how to automate this, maybe with a SP?

You can use the ALTER WAREHOUSE DDLs to do what you describe and CALL a stored proc prior to your queries.
Another alternative is to create a warehouse of each size, then do USE WAREHOUSE <foo> prior to your query, which should wake it up, run the query, then suspend once its inactive (although would come with the disadvantage of not being able to reuse locally cached data.)

Related

Does snowflake has a workload management feature like RedShift?

In AWS Redshift we can manage query priority using WLM. Do we have any such feature for Snowflake or is it done using multi warehouse strategy?
I think you've got the right idea that warehouses are typically the best approach this problem in Snowflake.
If you have a high priority query/process/account, it's entirely reasonable to provide that with a dedicated warehouse. That will guarantee that your query won't be competing with any resources on other warehouses.
You can also then size that warehouse appropriately. if it's a Small query, or file copy query, for example, and it's just really important that it runs right away, then you can give it a dedicated Small/X-Small warehouse. If it's a big query that doesn't run very frequently, you can give it a larger warehouse. If you set it to auto-suspend then you won't even incur much extra cost for the extra dedicated compute.

Shrinking pg_toast on RDS instance

I have a Postgres 9.6 RDS instance and it is growing 1GB a day. We have made some optimizations to the relation related to the pg_toast but the pg_toast size is not changing.
Autovacuum is on, but since autovacuum/VACUUM FREEZE do not reclaim space and VACUUM FULL does an exclusive lock, I am not sure anymore what the best approach is.
The data in the table is core to our user experience and although following this approach makes sense, it would take away the data our users expect to see during the vacuum full process.
What are the other options here to shrink the pg_toast?
Here is some data about table sizes. You can see in the first two images, that the relation scoring_responsescore is relation associated with the pg_toast.
Autovacuum settings
Results from current running autovacuum process for that specific pg_toast. It might help.
VACUUM (FULL) is the only method PostgreSQL provides to reduce the size of a table.
Is the bloated TOAST table such a problem for you? TOAST tables are always accessed via the TOAST index, so the bloat shouldn't be a performance problem.
I know of two projects that provide table reorganization with only a short ACCESS EXCLUSIVE lock, namely pg_squeeze and pg_repack, but you probably won't be able to use those in an Amazon RDS database.
To keep the problem from getting worse, you should first try to raise autovacuum_vacuum_cost_limit to 2000 for the affected table, and if that doesn't do the trick, lower autovacuum_vacuum_cost_delay to 0. You can use ALTER TABLE to change the settings for a single table.
pg_repack still does not allow to reduce the size of TOAST Segments in RDS.
And in RDS we cannot run pg_repack with superuser privileges, we have to use "--no-superuser-check" option. With this it will not be able to access the pg_toast.* tables.

How often is supposed to change the schema of a table and how to deal with this?

We are using Redshift at my workplace, and in the last week I have been running through a serie of requests about changing the schema of a certain table, which have become a very tedious process (involving update of ETL jobs and Redshift views) every day.
The process can be summarized to:
Change the ETL job that produces the raw data before loading it to Redshift
Modify temporarily a Redshift view that uses the underlying table to allow modifications on such table.
Modify the table (e.g. add/change/remove column(s))
Modify the view back to use the updated table.
Of course, in the process there's testing involved and other time-consuming steps.
How often is it "natural" for a table schema to change? What are the best practices to deal with this without losing too much time or having to do all the "mechanic" process all over again?
Thanks!
This is one of the reasons that data warehouse automation tools exist. We know that users will change their mind when they see the warehouse, or as business requirements change. Automating the process means that everything you asked for could be delivered in a few clicks of a mouse.
You'll find a list of all the data warehouse automation products we know, on our web site, at http://ajilius.com/competitors/

Architecting a high performing "inserting solution"

I am tasked with putting together a solution that can handle a high level of inserts into a database. There will be many AJAX type calls from web pages. It is not only one web site/page, but several different ones.
It will be dealing with tracking people's behavior on a web site, triggered by various javascript events, etc.
It is important for the solution to be able to handle the heavy database inserting load.
After it has been inserted, I don't mind migrating the data to an alternative/supplementary data store.
We are initial looking at using the MEAN stack with MongoDB and migrating some data to MySql for reporting purposes. I am also wondering about the use of some sort of queue-ing before insert into db or caching like memcached
I didn't manage to find much help on this elsewhere. I did see this post but it is now close to 5 years old, feels a bit outdated and don't quite ask the same questions.
Your thoughts and comments are most appreciated. Thanks.
Why do you need a stack at all? Are you looking for a web-application to do the inserting? Or do you already have an application?
It's doubtful any caching layer will outrun your NoSQL database for inserts, but you should probably confirm that you even need a NoSQL database. MySQL has pretty solid raw insert performance, as long as your load can be handled on a single box. Most NoSQL solutions scale better horizontally. This is probably worth a read. But realistically, if you already have MySQL in-house, and you separate your reporting from your insert instances, you will probably be fine with MySQL.
Some initial theory
To understand how you can optimize for the heavy insert workload, I suggest to understand the main overheads involved in inserting data in a database. Once the various overheads are understood, all kings of optimizations will come to you naturally. The bonus is that you will both have more confidence in the solution, you will know more about databases, and you can apply these optimizations to multiple engines (MySQL, PostgreSQl, Oracle, etc.).
I'm first making a non-exhaustive list of insertion overheads and then show simple solutions to avoid such overheads.
1. SQL query overhead: In order to communicate with a database you first need to create a network connection to the server, pass credentials, get the credentials verified, serialize the data and send it over the network, and so on.
And once the query is accepted, it needs to be parsed, its grammar validated, data types must be parsed and validated, the objects (tables, indexes, etc.) referenced by the query searched and access permissions are checked, etc. All of these steps (and I'm sure I forgot quite a few things here) represent significant overheads when inserting a single value. The overheads are so large that some databases, e.g. Oracle, have a SQL cache to avoid some of these overheads.
Solution: Reuse database connections, use prepared statements, and insert many values at every SQL query (1000s to 100000s).
2. Ensuring strong ACID guarantees: The ACID properties of a DB come at the cost of logging all logical and physical modification to the database ahead of time and require complex synchronization techniques (fine-grained locking and/or snapshot isolation). The actual time required to deal with the ACID guarantees can be several orders of magnitude higher than the time it takes to actually copy a 200B row in a database page.
Solution: Disable undo/redo logging when you import data in a table. Alternatively, you could also (1) drop the isolation level to trade off weaker ACID guarantees for lower overhead or (2) use asynchronous commit (a feature that allows the DB engine to complete an insert before the redo logs are properly hardened to disk).
3. Updating the physical design / database constraints: Inserting a value in a table usually requires updating multiple indexes, materialized views, and/or executing various triggers. These overheads can again easily dominate over the insertion time.
Solution: You can consider dropping all secondary data structures (indexes, materialized views, triggers) for the duration of the insert/import. Once the bulk of the inserts is done you can re-created them. For example, it is significantly faster to create an index from scratch rather than populate it through individual insertions.
In practice
Now let's see how we can apply these concepts to your particular design. The main issues I see in your case is that the insert requests are sent by many distributed clients so there is little chance for bulk processing of the inserts.
You could consider adding a caching layer in front of whatever database engine you end up having. I dont think memcached is good for implementing such a caching layer -- memcached is typically used to cache query results not new insertions. I have personal experience with VoltDB and I definitely recommend it (I have no connection with the company). VoltDB is an in-memory, scale-out, relational DB optimized for transactional workloads that should give you orders of magnitude higher insert performance than MongoDB or MySQL. It is open source but not all features are free so I'm not sure if you need to pay for a license or not. If you cannot use VoltDB you could look at the memory engine for MySQL or other similar in-memory engines.
Another optimization you can consider is to have a different database for doing the analytics. Most likely, a database with a high data ingest volume is quite bad at executing OLAP-style queries and the other way around. Coming back to my recommendation, VoltDB is no exception and is also suboptimal at executing long analytical queries. The idea would be to create a background process that reads all new data in the frontend DB (i.e. this would be a VoltDB cluster) and moves it in bulk to the backend DB for the analytics (MongoDB or maybe something more efficient). You can then apply all the optimizations above for the bulk data movement, create a rich set of additional index structures to speed up data access, then run your favourite analytical queries and save the result as a new set of tables/materialized for later access. The import/analysis process can be repeated continuously in the background.
Tables are usually designed with the implied assumption that queries will far outnumber DML of all sorts. So the table is optimized for queries with indexes and such. If you have a table where DML (particularly Inserts) will far outnumber queries, then you can go a long way just by eliminating any indexes, including a primary key. Keys and indexes can be added to the table(s) the data will be moved to and subsequently queried from.
Fronting your web application with a NoSQL table to handle the high insert rate then moving the data more or less at your leisure to a standard relational db for further processing is a good idea.

DB Aggregation and BI Best Practices

We have a fair amount of aggregating queries in our db for use on making (sometimes real-time) business decisions. Unfortunately the pages that present these aggregates are some of the most frequently called, and the SPs are passed parameters by the page. The queries themselves have been tuned, but Unfortunately each SP is generation a handful of aggregate fields.
We're working on some performance tuning, and one of the tasks is to re-work how/where these aggregations are done.
The thoughts we have are to possibly create SPs that do some of these and store them in a table. Then the page could run a more simple select query on the table still using a parameter to limit it to the correct data-set. It wouldn't be as "real time", but could be frequently enough.
The other suggested solution was to perform the aggregation queries in our DWH and (through) SSIS pass the data back to a table in the Prod db. There is significantly less traffic on our DWH db, so it could easily handle the heavy lifting.
What are the thoughts on ways to streamline querying and presenting what would typically be considered "reporting" data in a Prod environment? The current SPs are called probably a couple thousand times a day. Is pushing DWH data back to a Prod db against BI best practices? Is this something better done in a CLR Proc (not that familiar with CLR)?
To make use of pre-aggregated calculations and still answer all questions in NRT, this is what you could do:
Calculate an aggregate every minute/hour/day and store this in the database as an aggregated value. Then, if a query is done on the database you use these aggregates. For the time that isn't in the aggregate, you use the raw data that was inserted after the final timestamp of the last aggregate. It requires a bit more coding, but this is the ultimate solution.
I don't see any problem with pushing aggregate reporting data back to a prod DB from a data warehouse for delivery, but as a commenter mentioned, you would inherently introduce a delay which might not be acceptable for your 'real-time' decisions.
Another option, as long as the nature of your aggregation is relatively straightforward, is indexed views. Essentially you create a view of the source table with the level of aggregation required for reporting, and then put an index on it. The index means that SQL server actually precalculates the view and stores it on disk like a real table, so that when you do a reporting query, it doesn't actually need to read the full detail of the source table and aggregate it.
Another neat trick that SQL can do with indexed views is 'aggregate awareness', which means that even if you query the source table, if the planner realises it can get the answer from the indexed view more quickly, then it will. You wouldn't even need to update your existing procs for them to gain a benefit.
It probably sounds too good to be true, and in some ways it is, since there are are couple of massive downsides:
There are a large number of limitations on the SQL syntax you can use in the view, for example while you can use sum and count_big, you can't use max/min (which seems odd to be honest). You can do inner but not outer joins. You can't use window functions. The list goes on...
Also, there will obviously be a performance cost when writing to the tables the view references, as the view has to be updated. So you would want to test out performance in your environment with a typical workload before you turn something like this on in prod.

Resources