In PostgreSQL, there is a concept to return the data modified by the query in the same query.
UPDATE products SET price = price * 1.10
WHERE price <= 99.99
RETURNING name, price AS new_price;
Above is the sample query that returns the name and price after the update is completed.
Is there any concept available in Snowflake as listed above?
I am trying to do the following,
Update and get the same record
Avoid collision on updates between the process.
Reference for PostgreSQL : https://www.postgresql.org/docs/9.5/dml-returning.html
That's a nice feature. Snowflake doesn't have that, but you could open a transaction and do a select and then the update, to the same effect. However if your objective is to log changes you might want to check out change tracking with Table Streams -- currently in preview.
Related
I have a Kafka stream and Hive table that I want to use as lookup table to enhance data from kafka. Hive table points to parquet file in S3. Hive table is updated once a day with INSERT OVERWRITE statement, which means, older files from that s3 path will be replaced by newer files once a day.
Everytime, hive table is updated, newer data from hive table is joined with historical data from kafka and this results in older kafka data getting republished. I see this is the expected behaviour from this link.
I tried to set idle state retention of 2 days as shown below, but, it looks Flink is not honoring the 2 days idle state retention and seems to be keeping all the kafka records in table state. I was expecting only last 2 days data will be republished at the time hive table is updated. My job has been running for one month and instead, I see record old as one month still getting sent in the output. I think this will make the state grow forever and might result in out of memory exception at some point.
One possible reason for this is I think Flink keeps the state of kafka data keyed by sales_customer_id field because that is the field used to join with hive table and as soon as another sales come for that customer id, then state expiry is extended for another 2 days? I am not sure whether this is the reason but wanted to check with Flink expert on what could be the possible problem here.
StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env);
TableConfig tableConfig = tableEnv.getConfig();
Configuration configuration = tableConfig.getConfiguration();
tableConfig.setIdleStateRetention(Duration.ofHours(24*2));
configuration.setString("table.dynamic-table-options.enabled", "true");
DataStream<Sale> salesDataStream = ....;
Table salesTable = tableEnv.fromDataStream(salesDataStream);
Table customerTable = tableEnv.sqlQuery("select * from my_schema.customers" +
" /*+ OPTIONS('streaming-source.enable'='true', 'streaming-source.partition-order'='create-time') */");
Table resultTable = salesTable.leftOuterJoin(customerTable, $("sales_customer_id").isEqual($("customer_id")));
DataStream<Sale> salesWithCustomerInfoDataStream = tableEnv.toRetractStream(resultTable, Row.class).map(new RowToSaleFunction());
My company is migrating to snowflake from SQL Server 2017 and am looking to build historical data tables that capture delta changes. In SQL, these would be in stored procedures, where old records would get expired (change to data) and insert the new row with updated data. This design allows dynamic retrieval of historical data at any point in time.
My question is, how would i migrate this design to snowflake? From what I read about procedures, they're more like UDTs or scalar functions (SQL equiv) , but in javascript lang...
Below is brief example of how we are doing CDC for tables in SQL
Would data pipeline cover this? If anyone knows good tutorial site for snowflake 101 (not snowflake offical documentation, its terrible). would be appreciated
thanks
update h
set h.expiration_date = t.effective_date
from data_table_A_history h
join data_table_A as t
on h.account_id = t.account_id
where h.expiration_date is null
and (
(isnull(t.person_name,'x') <> isnull(h.person_name,'x')) or
(isnull(t.person_age,0) <> isnull(h.person_age,0))
)
---------------------------------------------------------------------
insert into data_table_A_history (account_id,person_name,person_age)
select
account_id,person_name,person_age
from
data_table_A t
left join data_table_A_history h
on t.account_id = h.account_id
and h.expiration_date is null
where
h.account_id is null
Table streams are Snowflake's CDC solution
You can setup multiple streams on a single table and it will track changes to the table from a particular point in time. This point in time is changed once you consume the data in the stream, with the new starting point being from the time you consumed the data. Consumption here is when you either use the data to upsert another table or perhaps insert the data into a log table for example. Simply select statements do not consume the data
A pipeline could be something like this: Snowpipe->staging table->stream on staging table->task with SP->merge/upsert target table
If you wanted to keep a log of the changes then you could setup a 2nd stream on the staging table and consume that by inserting the data into another table
Another trick, if you didn't want to use a 2nd stream is to amend your SP so that before you consume the data, run a select on the stream and then immediately run
INSERT INTO my_table select * from table(result_scan(last_query_id()))
This does not consume the stream and change the offset and leaves the stream data available to be consumed by another DML operation
I went over the documentation for Clickhouse and I did not see the option to UPDATE nor DELETE. It seems to me its an append only system.
Is there a possibility to update existing records or is there some workaround like truncating a partition that has records in it that have changed and then re-insering the entire data for that partition?
Through Alter query in clickhouse we can able to delete/update the rows in a table.
For delete: Query should be constructed as
ALTER TABLE testing.Employee DELETE WHERE Emp_Name='user4';
For Update: Query should be constructed as
ALTER TABLE testing.employee UPDATE AssignedUser='sunil' where AssignedUser='sunny';
UPDATE: This answer is no longer true, look at https://stackoverflow.com/a/55298764/3583139
ClickHouse doesn't support real UPDATE/DELETE.
But there are few possible workarounds:
Trying to organize data in a way, that is need not to be updated.
You could write log of update events to a table, and then calculate reports from that log. So, instead of updating existing records, you append new records to a table.
Using table engine that do data transformation in background during merges. For example, (rather specific) CollapsingMergeTree table engine:
https://clickhouse.yandex/reference_en.html#CollapsingMergeTree
Also there are ReplacingMergeTree table engine (not documented yet, you could find example in tests: https://github.com/yandex/ClickHouse/blob/master/dbms/tests/queries/0_stateless/00325_replacing_merge_tree.sql)
Drawback is that you don't know, when background merge will be done, and will it ever be done.
Also look at samdoj's answer.
You can drop and create new tables, but depending on their size this might be very time consuming. You could do something like this:
For deletion, something like this could work.
INSERT INTO tableTemp SELECT * from table1 WHERE rowID != #targetRowID;
DROP table1;
INSERT INTO table1 SELECT * from tableTemp;
Similarly, to update a row, you could first delete it in this manner, and then add it.
Functionality to UPDATE or DELETE data has been added in recent ClickHouse releases, but its expensive batch operation which can't be performed too frequently.
See https://clickhouse.yandex/docs/en/query_language/alter/#mutations for more details.
It's an old question, but updates are now supported in Clickhouse. Note it's not recommended to do many small changes for performance reasons. But it is possible.
Syntax:
ALTER TABLE [db.]table UPDATE column1 = expr1 [, ...] WHERE filter_expr
Clickhouse UPDATE documentation
I work on a program that stores information about network connections across my University and I have been asked to create a report that shows the status changes of these connections over time. I was thinking about adding another table that has the current connection information and the date the data was added so when the report is run, it just grabs the data at that date, but I'm worried that the report might get slow after a couple of months as it would be adding about 50,000 rows every month. Is there a better way to do this? We use a Microsoft SQL Server.
It depends on the reason you are holding historical data for facts.
If the reason is:
For reporting needs then you could hold it in the same table by
adding two date columns FromDate and ToDate which will remove the
need to join the active and historical data tables later on.
Just for reference then it makes sense to have it in a different
table as it may decrease the performance of your indexes on your
active table.
I'll highlight the Slowly Changing Dimension (SCD) type 2 approach that tracks data history by maintaining multiple versions of records and uses either the EndDate or a flag to identify the active record. This method allows tracking any number of historical records as each time a new record is inserted, the older ones are populated with an EndDate.
Step 1: For re-loaded facts UPDATE IsActive = 0 for the record to be history preserved and populate EndDate as the current date.
merge ActiveTable as T
using DataToBeLoaded as D
on T.ID = D.ID
and
T.isactive = 1 -- Current active entry
when matched then
update set T.IsActive = 0,
T.EndDate = GETDATE();
Step 2: Insert the latest data into the ActiveTable with IsActive = 1 and FromDate as the current date.
Disclaimer: The following approach using SCD 2 could make your data warehouse huge. However, I don't believe it would affect performance much for your scenario.
I have a table that has several fields, 2 of these fields are "startdate" and "enddate", which mark the validity of the record. If i insert 1 new record, the new record cannot overlap with other records in terms of start date and end date.
Hence on insertion of new record i may need to adjust the value of "startdate" and "enddate" of pre-existing records so they don't overlap with the new record. Similarly, any preexisting records that have 100% overlap with the new record, will need to be deleted.
My table is an InnoDB table, which i know supports such transactions.
Are there any examples which show use of insert / update / delete using transactions (all must succeed in order for any one of them to succeed and be commited) ?
I don't know how to do this. Most examples only show the use of saveAssociated() which i'm not sure is capable of catering for delete operations?
Thanks
Kevin
Perhaps you could use the beforeSave callback to search for the preexisting records and delete them before saving your new record.
from the docs:
Place any pre-save logic in this function. This function executes immediately after model data has been successfully validated, but just before the data is saved. This function should also return true if you want the save operation to continue.
I think you're looking to do Transactions: http://book.cakephp.org/2.0/en/models/transactions.html
That should allow you to run your queries - you start a transaction, perform any required actions, and then commit or rollback based on the outcome. Although, given your description I'd think doing some reads and adjusting your data before committing anything might be a better approach. Either way, transactions aren't a bad idea though!