WITH in Snowflake/SQL: Comparison to Temporary Tables - snowflake-cloud-data-platform

Other than coder preference, is there any reason to use a series of WITH(SELECT . . . ) statements/clauses in Snowflake rather than create several temporary tables. Are there performance or other issues that would lead one to prefer one method over the other?

I have some real world experience with this. We followed a hybrid approach in our ETLs
CTEs keeps results in memory, hence saves I/O cost while processing huge volume of data. Based on our experience processing an ETL involving 10 billion rows, CTE took 2 hours while table approach took 4.5 hours.But we should carefully choose our weapon, CTEs will not perform well in all scenarios. In the below scenarios, you must do some testing before using CTE
When your ETL query has more than 7-8 steps
Query is a combination of large and small tables
Volume of data is low and there is an opportunity to reuse the transformation logic
Please also remember CTEs cannot take advantage of meta data and hence row count estimations for joins and scans will be inaccurate and the optimizer will not be able to optimize the query path.

Related

clickhouse schema design, predefined set of columns

I have multiple sources of input with different schemas. To do some analytics using Clickhouse, I though of of 2 approaches to handling the analytic workload, using join or aggregation operation:
Using join involves defining a table corresponding to each input.
Using aggregated functions requires a single table, with a predefined set of columns, The number of columns and the type of the columns would be based on my approximations, and may change in the future.
My question is: If I go with the second approach, defining lots of columns let's say hundred of columns. How does it affect the performance, storage cost... etc ?
Generally speaking, a large table with all your values + the usage of aggregated functions is often the usecase for which clickhouse was designed.
Various types of Join based queries start being efficient on large datasets, when the queries are distributed between machines. But if you can afford to keep your data on a single SSD RAID, try using a single table and aggregated functions.
Of course, that's generic advice, it really depends on your data.
As far as irregular data goes, depending on how varied it can be, you may want to look into using a dynamic solution (e.g. Spark or Elastic Search) or a database that supports "sparse" columns (e.g. Cassandra or ScyllaDb).
If you want to use Clickhouse for this, look into using arrays and tuples to hold them.
Overall, clickhouse is pretty clever about compressing data, so adding a lot of empty values should be fine (e.g. they won't increase query time by almost anything and they won't occupy extra space). The queries are column-based, so if you don't need a column for a specific query, the performance won't be affected by the simple fact said column exists (e.g. like it would in an RDBMS).
So even if your table has, say 200 columns, as long as your query is only using 2 of those columns, it will be basically as efficient as if the table only had 2 columns. Also, the lower the granularity of a column, the faster the queries on that column (with some caveats). That being said, if you plan to query hundreds of columns in the same query... it's probably going to go fairly slow, but clickhouse is very good at parallelizing work, so if your data is in the lower dozens of Tb (uncompressed), getting a machine with some large SSDs and a 2 Xeons will usually do the trick.
But, again, this all depends heavily on the dataset, you have to explain your data and the types of queries you need in order to get a more meaningful answer.

Query optimization for hive table

We have a table which is of size 100TB and we have multiple customers using the same table (i.e every customer uses different where conditions). Now the problem statement is every time a customer tries to query the table it gets scanned from top to bottom.
This creates lot of slowness for all the queries. We cannot even partition/bucket the table basing on any business keys. Can someone provide solution or point to similar problem statements and their resolution.
you can provide your suggestions as well as alternative technologies so that we can pick the best suitable one. Thanks.
My 2 cents: experiment with an ORC table with GZip compression (default) and clever partitioning / ordering...
every SELECT that uses a partition key in its WHERE clause will
do "partition pruning" and thus avoid to scan everything [OK, OK, you said you had no good candidate in your specific case, but in general it can be done so I had to mention it first]
then within each ORC file in scope, the min/max counters will be
checked for "stripe pruning", limiting the I/O further
With clever partitioning & clever ordering of the data at INSERT time, using the most-frequent filters, the pruning can be quite efficient.
Then you can look into optimizations such as using a non-default ORC stripe size, a non-default "bytes-per-reducer" threshold, etc.
Reference:
http://fr.slideshare.net/oom65/orc-andvectorizationhadoopsummit
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC
https://streever.atlassian.net/wiki/display/HADOOP/Optimizing+ORC+Files+for+Query+Performance
http://thinkbig.teradata.com/hadoop-performance-tuning-orc-snappy-heres-youre-missing/
One last thing: with 15 nodes for running queries and a replication factor of 3, each HDFS block is available "locally" on 3 the nodes (20%) and "remotely" in the rest (80%). A higher replication factor may reduce I/O and network bottlenecks -- at the cost of disk space, of course.

Is it better to use one complex query or several simpler ones?

Which option is better:
Writing a very complex query having large number of joins, or
Writing 2 queries one after the other, applying the obtained result set of the processed query on other.
Generally, one query is better than two, because the optimizer has more information to work with and may be able to produce a more efficient query plan than either separately. Additionally, using two (or more) queries typically means you'll be running the second query multiple times, and the DBMS might have to generate the query plan for the query repeatedly (but not if you prepare the statement and pass the parameters as placeholders when the query is (re)executed). This means fewer back and forth exchanges between the program and the DBMS. If your DBMS is on a server on the other side of the world (or country), this can be a big factor.
Arguing against combining the two queries, you might end up shipping a lot of repetitive data between the DBMS and the application. If each of 10,000 rows in table T1 is joined with an average of 30 rows from table T2 (so there are 300,000 rows returned in total), then you might be shipping a lot of data repeatedly back to the client. If the row size of (the relevant projection of) T1 is relatively small and the data from T2 is relatively large, then this doesn't matter. If the data from T1 is large and the data from T2 is small, then this may matter; measure before deciding.
When I was a junior DB person I once worked for a year in a marketing dept where I had so much free time I did each task 2 or 3 different ways. I made a habit of writing one mega-select that grabbed everything in one go and comparing it to a script that built interim tables of selected primary keys and then once I had the correct keys went and got the data values.
In almost every case the second method was faster. the cases where it wasn't were when dealing with a small number of small tables. Where it was most noticeably faster was of course large tables and multiple joins.
I got into the habit of select the required primary keys from tableA, select the required primary keys from tableB, etc. Join them and select the final set of primary keys. Use the selected primary keys to go back to the tables and get the data values.
As a DBA I now understand that this method resulted in less purging of the data cache and played nicer with others using the DB (as mentioned by Amir Raminfar).
It does however require the use of temporary tables which some places / DBA don't like (unfairly in my mind)
Depends a lot on the actual query and the actual database i.e. SQL, Oracle mySQL.
At large companies, they prefer option 2 because option 1 will hog the database cpu. This results in all other connections being slow and everything being a bottle neck. That being said, it all depends on your data and the ammount you are joining. If you are joining on 10000 to 1000 then you are going to get back 10000 x 1000 records. (Assuming an inner join)
Possible duplicate MySQL JOIN Abuse? How bad can it get?
Assuming "better" means "faster", you can easily test these scenarios in a junit test. Note that a determining factor that you may not be able to get from a unit test is network latency. If the database sits right next to your machine where you run the unit test, you may see no difference in performance that is attributed to the network. If your production servers are in another town, country, or continent from the database, network traffic becomes more of a bottleneck. You do not want to go back and forth across the wire- you more likely want to make one round trip and get everything at once.
Again, it all depends :)
It could depend on many things: ,
the indexes you have set up
how many tables,
what the actual query is,
how big the data set is,
what the underlying DB is,
what table engine you are using
The best thing to do would probably test both methods on a variety of test data and see which one bottle necks.
If you are using MySQL, ( and Oracle maybe? ) you can use
EXPLAIN SELECT .....
and it will give you a lot of info on how it will execute the query, and therefor how you can improve it etc.

Table design and performance related to database?

I have a table with 158 columns in SQL Server 2005.
any disdvantage of keeping so many columns.
Also I have to keep those many columns,
how can i improve performance - like using SP's, Indexes?
Wide tables can be quite performant when you usually want all the fields for a particular row. Have you traced your user's usage patterns? If they're usually pulling just one or two fields from multiple rows then your performance will suffer. The main issue is when your total row size hits the 8k page mark. That means SQL has to hit the disk twice for every row (first page + overflow page), and thats not counting any index hits.
The guys at Universal Data Models will have some good ideas for refactoring your table. And Red Gate's SQL Refactor makes splitting a table heaps easier.
There is nothing inherently wrong with wide tables. The main case for normalization is database size, where lots of null columns take up a lot of space.
The more columns you have, the slower your queries will be.
That's just a fact. That isn't to say you aren't justified in having many columns. The above does not give one carte blanche to split one entity's worth of table with many columns into multiple tables with fewer columns. The administrative overhead of such a solution would most probably outweigh any perceived performance gains.
My number one recommendation to you, based off of my experience with abnormally wide tables (denormalized schemas of bulk imported data) is to keep the columns as thin as possible. I had to work with a lot of crazy data and left most of the columns as VARCHAR(255). I recommend against this. Although convenient for development purposes, performance would spiral out of control, especially when working with Perl. Shrinking the columns to their bare minimum (VARCHAR(18) for instance) helped immensely.
Stored procedures are just batches of SQL commands; they don't have any direct on speed other than that regular use of certain types of stored procedures will end up using cached query plans (which is a performance boost).
You can use indexes to speed up certain queries, but there's no hard and fast rule here. Good index design depends entirely on the type of queries you're running. Indexing will, by definition, make your writes slower; they exist only to make your reads faster.
The problem with having that many columns in a table is that finding rows using the clustered primary key can be expensive. If it were possible to change the schema, breaking this up into many normalized tables will be the best way to improve efficiency. I would strongly recommend this course.
If not, then you may be able to use indices to make some SELECT queries faster. If you have queries that only use a small number of the columns, adding indices on those columns could mean that the clustered index will not need to be scanned. Of course, there is always a price to pay with indices, in terms of storage space and INSERT, UPDATE and DELTETE time, so this may not be a good idea for you.

How to gain performance when maintaining historical and current data?

I want to maintain last ten years of stock market data in a single table. Certain analysis need only data of the last one month data. When I do this short term analysis it takes a long time to complete the operation.
To overcome this I created another table to hold current year data alone. When I perform the analysis from this table it 20 times faster than the previous one.
Now my question is:
Is this the right way to have a separate table for this kind of problem. (Or we use separate database instead of table)
If I have separate table Is there any way to update the secondary table automatically.
Or we can use anything like dematerialized view or something like that to gain performance.
Note: I'm using Postgresql database.
You want table partitioning. This will automatically split the data between multiple tables, and will in general work much better than doing it by hand.
I'm working on near the exact same issue.
Table partitioning is definitely the way to go here. I would segment by more than year though, it would give you a greater degree of control. Just set up your partitions and then constrain them by months (or some other date). In your postgresql.conf you'll need to turn constraint_exclusion=on to really get the benefit. The additional benefit here is that you can only index the exact tables you really want to pull information from. If you're batch importing large amounts of data into this table, you may get slightly better results a Rule vs a Trigger and for partitioning, I find rules easier to maintain. But for smaller transactions, triggers are much faster. The postgresql manual has a great section on partitioning via inheritance.
I'm not sure about PostgreSQL, but I can confirm that you are on the right track. When dealing with large data volumes partitioning data into multiple tables and then using some kind of query generator to build your queries is absolutely the right way to go. This approach is well established in Data Warehousing, and specifically in your case stock market data.
However, I'm curious why do you need to update your historical data? If you're dealing with stock splits, it's common to implement that using a seperate multiplier table that is used in conjunction with the raw historical data to give an accurate price/share.
it is perfectly sensible to use separate table for historical records. It's much more problematic with separate database, as it's not simple to write cross-database queries
automatic updates - it's a tool for cronjob
you can use partial indexes for such things - they do wonderful job
Frankly, you should check your execution plans and try fixing your queries or indexing before taking more radical steps.
Indexing comes at very little cost (unless you do a lot of insertions) and your existing code will be faster (if you index properly) without modifying it.
Other measures such as partioning come after that...

Resources