Currently we are using Snowflake DWH for our project. The columns defined in the tables are defined without any size specification. Not sure why it was done so, as this was done long back.
Will there be any performance hit with Snowflake DWH, when the size is not specified. For ex, by default the size of VARCHAR is 16777216 and for NUMBER is (38,0).
Will there be any performance hit because of leaving the size to default in Snowflake?
Actually, we're just about to add more info about it to our doc, coming very soon.
In short, the length for VARCHAR and precision ("15" in DECIMAL(15,2)for DECIMAL/NUMBER only work as constraints, and have no effect on performance. Snowflake automatically will detect the range of values and optimize storage and processing for it. The scale ("2" in DECIMAL(15,2)) for NUMBER and TIMESTAMP can influence storage and performance size though.
Related
I'm trying to get my head around Snowflake's capabilities around wide-tables.
I have a table of the form:
userId
metricName
value
asOfDate
1
'meanSessionTime'
30
2022-01-04
1
'meanSessionSpend'
20
2022-01-04
2
'meanSessionTime'
34
2022-01-05
...
...
...
...
However, for my analysis I usually pull big subsets of this table into Python and pivot out the metric names
userId
asOfDate
meanSessionTime
meanSessionSpend
...
1
2022-01-04
30
20
...
2
2022-01-05
43
12
...
...
...
...
...
...
I am thinking of generating this Pivot in Snowflake (via DBT, the SQL itself is not hard), but I'm not sure if this is good/bad.
Any good reasons to keep the data in the long format? Any good reasons to go wide?
Note that I don't plan to always SELECT * from the wide table, so it may be a good usecase for the columnar storage.
Note:
These are big tables (billions or records, hundreds of metrics), so I am looking for a sense-check before a burn a few hundred $ in credits doing an experiment.
Thanks for the additional details provided in the comments and apologies for delayed response. A few thoughts.
I've used both Wide and Tall tables to represent feature/metric stores in Snowflake. You can also potentially use semi-structured column(s) to store the Wide representation. Or in the Tall format if your metrics can be of different data-types (e.g. numeric & character), to store the metric value in a single VARIANT column.
With ~600 metrics (columns), you are still within the limits of Snowflakes row width, but the wider the table gets, generally the less useable/manageable it becomes when writing queries against it, or just retrieving the results for further analysis.
The wide format will typically result in a smaller storage footprint than the tall format, due to the repetition of the key (e.g. user-id, asOfDate) and metricName, plus any additional columns you might need in the tall form. I've seen 3-5x greater storage in the Tall format in some implementations so you should see some storage savings if you move to the Wide model.
In the Tall table this can be minimised through clustering the table so the same key and/or metric column values are gathered into the same micro-partitions, which then favours better compression and access. Also, as referenced in my comments/questions, if some metrics are sparse, or have a dominant default value distribution, or change value at significantly different rates, moving to a sparse-tall form can enable more much efficient storage and processing. In the wide form, if only one metric value changes out of 600, on a given day, you still need to write a new record with all 599 unchanged values. Whereas in the tall form you could write a single record for the metric with the changed value.
In the wide format, Snowflakes columnar storage/access should effectively eliminate physical scanning of columns not included within the queries so they should be at least as efficient as the tall format, and columnar compression techniques can effectively minimise the physical storage.
Assuming your data is not being inserted into the tall table in optimal sequence for your analysis patterns the table will need to be clustered to get the best performance using CLUSTER BY. For example if you are always filtering on a subset of user-ids, it should take precedence in your CLUSTER BY, but if you are mainly going after a subset of columns, for all, or a subset of all, user-ids then the metricName should take precedence. Clustering has an additional service cost which may become a factor in using the tall format.
In the tall format, having a well defined standard for metric names enables a programmatic approach to column selection. e.g. column names as contracts This makes working with groups of columns as a unit very effective using the WHERE clause to 'select' the column groups (e.g. with LIKE), and apply operations on them efficiently. IMO this enables much more concise & maintainable SQL to be written, without necessarily needing to use a templating tool like Jinja or DBT.
Similar flexibility can be achieved in the wide format, by grouping and storing the metric name/value pairs within OBJECT columns, rather than as individual columns. They can be gathered (Pivoted) to an Object with OBJECT_AGG. Snowflakes semi-structured functionality can then be used on the object. Snowflake implicitly columnarises semi-structured columns, up to a point/limit, but with 600+ columns, some of your data will not benefit from this which may impact performance. If you know which columns are the most commonly used for filtering or returned in queries you could use a hybrid of the two approaches
I've also used Snowflake UDFs to effectively perform commonly required filter, project or transform operations over the OBJECT columns using Javascript, but noting that you're using Python, the new Python UDF functionality may be a better option for you. When you retrieve the data to Python for further analysis you can easily convert the OBJECT to a DICT in Python for further iteration. You could also take a look at Snowpark for Python, which should enable you to push further analysis and processing from Python into Snowflake.
You could of course not choose between the two options, but go with both. If CPU dominates storage in your cloud costs, then you might get the best bang for your buck by maintaining the data in both forms, and picking the best target for any given query.
You can even consider creating views that present the one from as the other one, if query convenience outweighs other concerns.
Another option is to split your measures out by volatility. Store the slow moving ones with a date range key in a narrow (6NF) table and the fast ones with snapshot dates in a wide (3NF) table. Again a view can help present an simpler user access point (although I guess the Snowflake optimizer won't be do join pruning over range predicates, so YMMV on the view idea).
Non-1NF gives you more options too on DBMSes like Snowflake that have native support for "semi-structured" ARRAY, OBJECT, and VARIANT column values.
BTW do update us if you do any experiments and get comparison numbers out. Would make a good blog post!
I have a varchar column of max size 20,000 in my redshift table. About 60% of the rows will have this column as null or empty. What is the performance impact in such cases.
From this documentation I read:
Because Amazon Redshift compresses column data very effectively,
creating columns much larger than necessary has minimal impact on the
size of data tables. During processing for complex queries, however,
intermediate query results might need to be stored in temporary
tables. Because temporary tables are not compressed, unnecessarily
large columns consume excessive memory and temporary disk space, which
can affect query performance.
So this means query performance might be bad in this case. Is there any other disadvantage apart from this?
To store in redshift table, there is no significant performance degradation as suggested in documentation, compression encoding help in keeping the data compact.
Whereas when you query the column with null values, extra processing is required, for instance, using it in where clause. This might impact the performance of your query. So performance depends on your query.
EDIT (answer to your comment) - Redshift stores your each column in "blocks" and these blocks are sorted according to the sort key you specified. Redshift keeps a record of the min/max of each block and can skip over any blocks that could not contain data to be returned. Query your disk space for the particular column and check size against other columns.
If I’ve made a bad assumption please comment and I’ll refocus my answer.
I have a SQL Server 2008 database that stores millions of rows. There are several NVARCHAR columns that will never exceed the current max length of the column, nor get close to it due to application constraints.
i.e.
The Address NVARCHAR field has a length of 50 characters, but it'll never exceed 32 characters.
Is there a performance benefit or space saving benefit to me reducing the size of the NVARCHAR column to what it's actual max length will be (i.e. in the case of the Address field, 32 characters). Or will it not make a difference since it's a variable length field?
Setting the number of characters in NVARCHAR is mainly for validation purposes. If there is some reason why you don't want the data to exceed 50 characters then the database will enforce that rule for you by not allowing extra data.
If the total row size exceeds a threshold then it can affect performance, so by restricting the length you could benefit by not allowing your row size to exceed that threshold. But in your case, that does not seem to matter.
The reason for this is that SQL Server can fit more rows onto a Page, which results in less disk I/O and more rows can be stored in memory.
Also, the maximum row size in SQL Server is 8KB as that is the size of a page and rows cannot cross page boundaries. If you insert a row that exceeds 8KB, the extra data will be stored in a row overflow page, which will likely have a negative affect on performance.
There is no expected performance or space saving benefit for reducing your n/var/char column definitions to their maximum length. However, there may be other benefits.
The column won't accidentally have a longer value inserted without generating an error (desirable for the "fail fast" characteristic of well-designed systems).
The column communicates to the next developer examining the table something about the data, that aids in understanding. No developer will be confused about the purpose of the data and have to expend wasted time determining if the code's field validation rules are wrong or if the column definition is wrong (as they logically should match).
If your column does need to be extended in length, you can do so with potential consequences ascertained in advance. A professional who is well-versed in databases can use the opportunity to see if upcoming values that will need the new column length will have a negative impact on existing rows or on query performance—as the amount of data per row affects the number of reads required to satisfy queries.
I know storing multiple values in a column. Not a good idea.
It violates first normal form --- which states NO multi valued attributes. Normalize period...
I am using SQL Server 2005
I have a table that require to store lower limit and uppper limit for a measurement, think of it as a minimum and maximum speed limit... only problem is only 2 % out of hundread i need upper limit. I will only have data for lower limit.
I was thinking to store both values in a column (Sparse column introduces in 2008 so not for me)
Is there a way...? Not sure about XML..
You'd have to be storing an insane amount of rows for this to even matter. The price of a 1 terabyte disk is now 60 dollars!
Two floats use up 8 bytes; an XML string will use a multiple of that just to store one float. So even though XML would store only one instead of two columns, it would still consume more space.
Just use a nullable column.
To answer your question, you could store it as a string with a particular format that you know how to parse (e.g. "low:high").
But ... this is really not a good idea.
Dealing with 98% of the rows having NULL value for upper limit is totally fine IMHO. Keep it clean, and you won't regret it later.
Even so, I agree with Andomar. Use two colums, low limit and high limit. If either value could be unknown, make those columns nullable.
Alternatively, designate a default arbitrary minimum and maximum values, and use those values instead of nulls. (Doing this means you never have to mess with trinary logic, e.g. having to wrap everything with ISNULL or COALESCE.)
Once you define your schema, there are tricks you can use to reduce storage space (such as compression and sparce columns).
I have a column that is declared as nvarchar(4000) and 4000 is SQL Server's limit on the length of nvarchar. it would not be used as a key to sort rows.
Are there any implication that i should be aware of before setting the length of the nvarchar field to 4000?
Update
I'm storing an XML, an XML Serialized Object to be exact. I know this isn't favorable but we will most likely never need to perform queries on it and this implementation dramatically decreases development time for certain features that we plan on extending. I expect the XML data to be 1500 characters long on average but there can be those exceptions where it can be longer than 4000. Could it be longer than 4000 characters? It could be but in very rare occasions, if it ever happens. Is this application mission critical? Nope, not at all.
SQL Server has three types of storage: in-row, LOB and Row-Overflow, see Table and Index Organization. The in-row storage is fastest to access. LOB and Row-Overflow are similar to each other, both slightly slower than in-row.
If you have a column of NVARCHAR(4000) it will be stored in row if possible, if not it will be stored in the row-overflow storage. Having such a column does not necesarily indicate future performance problems, but it begs the question: why nvarchar(4000)? Is your data likely to be always near 4000 characters long? Can it be 4001, how will your applicaiton handle it in this case? Why not nvarchar(max)? Have you measured performance and found that nvarchar(max) is too slow for you?
My recommendation would be to either use a small nvarchar length, appropiate for the real data, or nvarchar(max) if is expected to be large. nvarchar(4000) smells like unjustified and not tested premature optimisation.
Update
For XML, use the XML data type. It has many advantages over varchar or nvarchar, like the fact that it supports XML indexes, it supports XML methods and can actually validate the XML for a compliance to a specific schema or at least for well-formed XML compliance.
XML will be stored in the LOB storage, outside the row.
Even if the data is not XML, I would still recommend LOB storage (nvarchar(max)) for something of a length of 1500. There is a cost associated with retrieving the LOB stored data, but the cost is more than compensated by macking the table narrower. The width of a table row is a primary factor of performance, because wider tables fit less rows per page, so any operation that has to scan a range of rows or the entire table needs to fetch more pages into memory, and this shows up in the query cost (is actualy the driving factor of the overall cost). A LOB stored column only expands the size of the row with the width of a 'page id', which is 8 bytes if I remember correctly, so you can get much better density of rows per page, hence faster queries.
Are you sure that you'll actually need as many as 4000 characters? If 1000 is the practical upper limit, why not set it to that? Conversely, if you're likely to get more than 4000 bytes, you'll want to look at nvarchar(max).
I like to "encourage" users not use storage space too freely. The more space required to store a given row, the less space you can store per page, which potentially results in more disk I/O when the table is read or written to. Even though only as many bytes of data are stored as are necessary (i.e. not the full 4000 per row), whenever you get a bit more than 2000 characters of nvarchar data, you'll only have one row per page, and performance can really suffer.
This of course assumes you need to store unicode (double-byte) data, and that you only have one such column per row. If you don't, drop down to varchar.
Do you certainly need nvarchar or can you go with varchar? The limitation applies mainly to sql server 2k. Are you using 2k5 / 2k8 ?