Create column based on another column in Snowflake - snowflake-cloud-data-platform

I have a Snowflake table that has 50 billion rows (~10 TB) and ~40 columns. This table has a timestamp column that's always filled. I need to create a new date column and backfill all values for it.
This query works, but it's really inefficient (it takes 1 hour to scan 1%, with 2XL warehouse). Is there a more efficient way to do this?
UPDATE DAILY_TABLE b
SET b.ts_date = to_date(a.ts_tstamp)
FROM
(SELECT a.ts_date, a.ts_tstamp
FROM DAILY_TABLE a) a;
Current table:
ts_tstamp
ts_date
2021-04-28 05:01:32.883
null
2021-01-28 05:01:32.883
null
2020-01-25 05:01:32.883
null
Desired table:
ts_tstamp
ts_date
2021-04-28 05:01:32.883
2021-04-28
2021-01-28 05:01:32.883
2021-01-28
2020-01-25 05:01:32.883
2020-01-25

If the column does not need to be materialized and is always dependent on ts_tstamp then another option could be a virtual column:
ALTER TABLE DAILY_TABLE
ADD COLUMN ts_date_virtual DATE AS TO_DATE(ts_tstamp);

Related

How to add data to a single column

I have a question in regards to adding data to a particular column of a table, i had a post yesterday where a user guided me (thanks for that) to what i needed and said an update was the way to go for what i need, but i still can't achieve my goal.
i have two tables, the tables where the information will be added from and the table where the information will be added to, here is an example:
source_table (has only a column called "name_expedient_reviser" that is nvarchar(50))
name_expedient_reviser
kim
randy
phil
cathy
josh
etc.
on the other hand i have the destination table, this one has two columns, one with the ids and the other where the names will be inserted, this column values are null, there are some ids that are going to be used for this.
this is how the other table looks like
dbo_expedient_reviser (has 2 columns, unique_reviser_code numeric PK NOT AI, and name_expedient_reviser who are the users who check expedients this one is set as nvarchar(50)) also this is the way this table is now:
dbo_expedient_reviser
unique_reviser_code | name_expedient_reviser
1 | NULL
2 | NULL
3 | NULL
4 | NULL
5 | NULL
6 | NULL
what i need is the information of the source_table to be inserted into the row name_expedient_reviser, so the result should look like this
dbo_expedient_reviser
unique_reviser_code | name_expedient_reviser
1 | kim
2 | randy
3 | phil
4 | cathy
5 | josh
6 | etc.
how can i pass the information into this table? what do i have to do?.
EDIT
the query i saw that should have worked doesn't update which is this one:
UPDATE dbo_expedient_reviser
SET dbo_expedient_reviser.name_expedient_reviser = source_table.name_expedient_reviser
FROM source_table
JOIN dbo_expedient_reviser ON source_table.name_expedient_reviser = dbo_expedient_reviser.name_expedient_reviser
WHERE dbo_expedient_reviser.name_expedient_reviser IS NULL
the query was supposed to update the information into the table, extracting it from the source_table as long as the row name_expedient_reviser is null which it is but is doesn't work.
Since the Names do not have an Id associated with them I would just use ROW_NUMBER and join on ROW_NUMBER = unique_reviser_code. The only problem is, knowing what rows are null. From what I see, they all appear null. In your data, is this the case or are there names sporadically in the table like 5,17,29...etc? If the name_expedient_reviser is empty in dbo_expedient_reviser you could also truncate the table and insert values directly. Hopefully that unique_reviser_code isn't already linked to other things.
WITH CTE (name_expedient_reviser, unique_reviser_code)
AS
(
SELECT name_expedient_reviser
,ROW_NUMBER() OVER (ORDER BY name_expedient_reviser)
FROM source_table
)
UPDATE er
SET er.name_expedient_reviser = cte.name_expedient_reviser
FROM dbo_expedient_reviser er
JOIN CTE on cte.unique_reviser_code = er.unique_reviser_code
Or Truncate:
Truncate Table dbo_expedient_reviser
INSERT INTO dbo_expedient_reviser (name_expedient_reviser, unique_reviser_code)
SELECT DISTINCT
unique_reviser_code = ROW_NUMBER() OVER (ORDER BY name_expedient_reviser)
,name_expedient_reviser
FROM source_table
it is not posible to INSERT the data into a single column, but to UPDATE and move the data you want is the only way to go in that cases

Partitioning table based on first letter of a varchar field

I have a massive table (over 1B records) that have a specific requirement for table partitioning:
(1) Is it possible to partition a table in Postgres based on the first character of a varchar field?
For example:
For the following 3 records:
a-blah
a-blah2
b-blah
a-blah and a-blah2 would go in the "A" partition, b-blah would go into the "B" partition.
(2) If the above is not possible with Postgres, what is a good way to evenly partition a large growing table? (without partitioning by create date -- since that is not something these records have).
You can use an expression in the partition by clause, e.g.:
create table my_table(name text)
partition by list (left(name, 1));
create table my_table_a
partition of my_table
for values in ('a');
create table my_table_b
partition of my_table
for values in ('b');
Results:
insert into my_table
values
('abba'), ('alfa'), ('beta');
select 'a' as partition, name from my_table_a
union all
select 'b' as partition, name from my_table_b;
partition | name
-----------+------
a | abba
a | alfa
b | beta
(3 rows)
If the partitioning should be case insensitive you might use
create table my_table(name text)
partition by list (lower(left(name, 1)));
Read in the documentation:
Table Partitioning
CREATE TABLE

T-SQL Select, manipulate, and re-insert via stored procedure

The short version is I'm trying to map from a flat table to a new set of tables with a stored procedure.
The long version: I want to SELECT records from an existing table, and then for each record INSERT into a new set of tables (most columns will go into one table, but some will go to others and be related back to this new table).
I'm a little new to stored procedures and T-SQL. I haven't been able to find anything particularly clear on this subject.
It would appear I want to something along the lines of
INSERT INTO [dbo].[MyNewTable] (col1, col2, col3)
SELECT
OldCol1, OldCol2, OldCol3
FROM
[dbo].[MyOldTable]
But I'm uncertain how to get that to save related records since I'm splitting it into multiple tables. I'll also need to manipulate some of the data from the old columns before it will fit into the new columns.
Thanks
Example data
MyOldTable
Id | Year | Make | Model | Customer Name
572 | 2001 | Ford | Focus | Bobby Smith
782 | 2015 | Ford | Mustang | Bobby Smith
Into (with no worries about duplicate customers or retaining old Ids):
MyNewCarTable
Id | Year | Make | Model
1 | 2001 | Ford | Focus
2 | 2015 | Ford | Mustang
MyNewCustomerTable
Id | FirstName | LastName | CarId
1 | Bobby | Smith | 1
2 | Bobby | Smith | 2
I would say you have your OldTable Id to preserve in new table till you process data.
I assume you create an Identity column Id on your MyNewCarTable
INSERT INTO MyNewCarTable (OldId, Year, Make, Model)
SELECT Id, Year, Make, Model FROM MyOldTable
Then, join the new table and above table to insert into your second table. I assume your MyNewCustomerTable also has Id column with Identity enabled.
INSERT INTO MyNewCustomerTable (CustomerName, CarId)
SELECT CustomerName, new.Id
FROM MyOldTable old
JOIN MyNewCarTable new ON old.Id = new.OldId
Note: I have not applied Split of Customer Name to First Name and
Last Name as I was unsure about existing data.
If you don't want your OldId in MyNewCarTable, you can DELETE it
ALTER TABLE MyNewCarTable DROP COLUMN OldId
You are missing a step in your normalization. You do not need to duplicate your customer information per vehicle. You need three tables for 4th Normal form. This will reduce storage size and more importantly allow an update to the customer data to take place in one location.
Customer
CustomerID
FirstName
LastName
Car
CarID
Make
Model
Year
CustomerCar
CustomerCarID
CarID
CustomerID
DatePurchaed
This way you can have multiple owners per car, multiple cars per owner and only one record needs to be updated per car and or customer...4th Normal Form.
If I am reading this correctly, you want to take each row from table 1, and create a new record into table A using some of that row data, and then data from the same original row into Table B, Table C but referencing back to Table A again?
If that's the case, you will create TableA with an Identity and make thats the PK.
Insert the required column data into that table and use the #IDENTITY to retrieve the last identity value, then you will insert the remaining data from the original table into the other tables, TableB, TableC, etc. and use the identity you retrieved from TableA as the FK in the other tables.
By Example:
Table 1 has columns col1, col2, col3, col4, col5
Table A has TabAID, col1, col2
Table B has TabBID, TabAID, col3
TableC has TabCID, TabAID, col4
When the first row is read, the values for col1 & col2 are inserted into TableA.
The Identity is captured from that row inserted, and then value for col3 AND the identity are entered into TableB, and then value for col4 AND the identity are entered into TableC.
This is a standard data migration technique for normalizing data.
Hope this assists,

Why sort on sorted non clustered index field?

Say I have a table with ID, Name, and Date.
And I have a non-clustered index like,
CREATE NONCLUSTERED INDEX IX_Test_NameDate ON [dbo].[Test] (Name, Date)
When I run the query,
select
[Name], [Date]
from
[dbo].[Test] WITH (INDEX(IX_Test_NameDate))
where
[Name] like 'A%'
order by
[Date] asc
I get in SQL Server's execution plan,
Select <-- Sort <-- Index Seek (NonClustered)
Why the sort? Isn't the date already sorted in the non-clustered index? What would a better non-clustered index look like that doesn't require a sort (only an index seek).
(Can't use a clustered index as this example is a condensed version of a bigger example with multiple rows/indexes).
For example, I get the execution plan (with sort) for a table that looks like this,
ID Name Date
1 A 2014-01-01
2 A 2014-02-01
3 A 2014-03-01
4 A 2014-04-01
5 B 2014-01-01
6 B 2014-02-01
7 B 2014-03-01
8 B 2014-04-01
9 B 2014-05-01
10 B 2014-06-01
Shouldn't the dates be sorted in this case?
No, the Date column is not "already sorted in the non-clustered index", at least, not by itself. It is sorted after Name.
Consider the following trivial table data:
Name Date
----- --------
Allen 1/1/2014
Barb 1/1/2013
Charlie 1/1/2015
Darlene 1/1/2012
Ernie 1/1/2016
Faith 1/1/2011
Once you've sorted by Name, the Date columns are potentially out of order. Dates are guaranteed in order only for rows that have the same Name.
Your goals are at cross-purposes to each other. You want multiple names--so the data is best ordered by name so that the seek is possible, but then you want to sort by Date. How would you propose storing the above six-row table so that it is sorted by Date for every possible range of names?
If there is some kind of regularity or pattern about the ranges of names (perhaps, for example, you always pull names by first letter only) then there is a possible workaround.
ALTER TABLE dbo.Test ADD NamePrefix AS (Left(Name, 1)) PERSISTED;
CREATE NONCLUSTERED INDEX IX_Test_NamePrefix_Date ON dbo.Test (NamePrefix, Date);
Now this query theoretically should not need to perform the sort:
SELECT Name, Date
FROM dbo.Test
WHERE NamePrefix = 'A'
ORDER BY Date;
Be aware that there are some likely gotchas with adding a persisted computed column like this: increased data size, the fact that such a design is almost certainly wrong in almost every case, that the proliferation of computed columns would be very bad, among others.
P.S. It is generally not best practice to force indexes manually--let the optimizer choose.

SQL Persisted Computed Column with Subquery

I have three tables
Table 1: Items
ItemID | DaysLastSold
Table2: Listings
ItemID | ListingID
Table3: Sales
ListingID | DateItemClosed
I got this query to work:
SELECT min(DATEDIFF(day, DateItemClosed, getdate())) as DaysLastSold
from Sales
where QtySold > 0
and ListingID in (SELECT ListingID from Listings where ItemID = 8101 )
What I'm trying to do is basically place this query into the DaysLastSold Column in the Items table. So when ever the column is selected it recalculates DaysLastSold using the ItemID in the neighboring column.
If you want to persist that information you could create an indexed view that is made up of your calculated value and an ItemID. Obviously this would not be a column in your original table though. You could then join in on this view when you need the information.
Personally I would probably just do it inline when you need it. If you are concerned about performance, post the execution plan here and we may be able to make some suggestions.

Resources