Snowflake stream not working after table swap - snowflake-cloud-data-platform

Snowflake stream not working after table swap - snowflake-cloud-data-platform

I'm loading data to snowflake using an approach I found on the forums
snowpipe -> load_table <-> staging table -> final table
I have a task tree that check the stream on the load_table and if it finds data, swaps the load_table with the staging_table
further tasks process the staging_table into the final table
The staging_table is then truncated and swapped back with the load_table
This typically works fine, but the problem I am seeing is that I end up with orphan records on either the load_table or the staging_table and the load_stream is empty.
Its at the point right now where even if I manually insert data into the load_table the stream is still marked as empty so no tasks run.
What is the expected behaviour when swapping tables that contain streams, is the above behaviour supported or do I need to look at an alternative?
The goal is to use snowpipe to load files from S3 into a temp table and merge them into a final table, without having an ever growing staging table to manage...
Thanks!
/edit
doing some some experimenting and it seems that when the tables are swapped the stream still listens to the "original" table so will ignore any data that snowpipe loads to the "new" table, even though that new table has swapped with the original...

The problem is, the SHOW STREAMS and DESCRIBE STREAM provides wrong information:
create or replace table test1 (v varchar);
create or replace table test2 (v1 varchar, v2 varchar);
create or replace stream test_stream_1 on table test1;
alter table test1 swap with test2;
show streams like 'test_stream_1';
+---------------+------------------------+
| name | table_name |
+---------------+------------------------+
| TEST_STREAM_1 | GOKHAN_DB.PUBLIC.TEST1 |
+---------------+------------------------+
It should point to GOKHAN_DB.PUBLIC.TEST2 after the swap operation! I suggest you to submit a support case.
The good thing is, get_ddl returns the correct result:
select get_ddl('stream','test_stream_1');
+--------------------------------------------------------+
| GET_DDL('STREAM','TEST_STREAM_1') |
+--------------------------------------------------------+
| create or replace stream TEST_STREAM_1 on table TEST2; |
+--------------------------------------------------------+

Related

How to efficiently replace long strings by their index for SQL Server inserts?

I have a very large DataTable-Object which I need to import from a client into an MS SQL-Server database via ODBC.
The original Data-Table has two columns:
* First column is the Office Location (quite a long string)
* Second column is a booking value (integer)
Now I am looking for the most efficient way to insert these data into an external SQL-Server. My goal is to replace each office location automatically by an index instead using the full string because each location occurs VERY often in the initial table.
Is this possible via a trigger or via a view on the SQL-server?
At the end I want to insert the data without touching them in my script because this is very slow for these large amount of data and let the optimization done by the SQL Server.
I expect that if I do INSERT the data including the Office location, that SQL Server looks up an index for an already imported location and then use just this index. And if the location did not already exist in the index table / view then it should create a new entry here and then use the new index.
Here a sample of the data I need to import via ODBC into the SQL-Server:
OfficeLocation | BookingValue
EU-Germany-Hamburg-Ostend1 | 12
EU-Germany-Hamburg-Ostend1 | 23
EU-Germany-Hamburg-Ostend1 | 34
EU-France-Paris-Eifeltower | 42
EU-France-Paris-Eifeltower | 53
EU-France-Paris-Eifeltower | 12
What I do need on the SQL-Server is something like these 2 tables as a result:
OId|BookingValue OfficeLocation |Oid
1|12 EU-Germany-Hamburg-Ostend1 | 1
1|23 EU-France-Paris-Eifeltower | 2
1|43
2|42
2|53
2|12
My initial idea was, to write the data into a temp-table and have something like an intelligent TRIGGER (or a VIEW?) to react on any INSERT into this table to create the 2 desired (optimized) tables.
Any hint are more than welcome!

Yes, you can create a view with an INSERT trigger to handle this. Something like:
CREATE TABLE dbo.Locations (
OId int IDENTITY(1,1) not null PRIMARY KEY,
OfficeLocation varchar(500) not null UNIQUE
)
GO
CREATE TABLE dbo.Bookings (
OId int not null,
BookingValue int not null
)
GO
CREATE VIEW dbo.CombinedBookings
WITH SCHEMABINDING
AS
SELECT
OfficeLocation,
BookingValue
FROM
dbo.Bookings b
INNER JOIN
dbo.Locations l
ON
b.OId = l.OId
GO
CREATE TRIGGER CombinedBookings_Insert
ON dbo.CombinedBookings
INSTEAD OF INSERT
AS
INSERT INTO Locations (OfficeLocation)
SELECT OfficeLocation
FROM inserted where OfficeLocation not in (select OfficeLocation from Locations)
INSERT INTO Bookings (OId,BookingValue)
SELECT OId, BookingValue
FROM
inserted i
INNER JOIN
Locations l
ON
i.OfficeLocation = l.OfficeLocation
As you can see, we first add to the locations table any missing locations and then populate the bookings table.
A similar trigger can cope with Updates. I'd generally let the Locations table just grow and not attempt to clean it up (for no longer referenced locations) with triggers. If growth is a concern, a periodic job will usually be good enough.
Be aware that some tools (such as bulk inserts) may not invoke triggers, so those will not be usable with the above view.

Provide counterfeit data protection for end users of service – tools?

I need to create a service but i need a help with choice of tools.
Imagine service in which users create some data that have value in historical view (e.g. transactions). Other users can see this data but they need a proof that data are real and not falsified by users or even by service.
Example:
User A creates record with number 42
Couple of months passes
User B see this record and wants to be sure that service can't update this record with any other number 37
Service has trust window with 24 hours: it even can change users data, which were made on this day.
Question: Which instruments can help me to achieve that?
I was thinking about doing public daily backups (or reports?) that any user can download. From each report hash will be calculated and inserted into next backup – thus, a chan of hashes created. If service will change something in past, then hashes in this chain will not converge. Of course, i'll create open sourced tool for easy comparing diff between data and check if chain is valid.
Point of trust: there is one thing that i'm afraid of. Service can use many databases simultaneously and update all backups with all hashes one time (because first backup has no hash of previous one). So, to cover that case too, i think of storing hashes in some place that service can't change at all. For example, in one of the existed blockchains (btc, eth, ...) from official wallet of service. Or, maybe, DAG with some blockchain like IOTA?
What do you think of point of trust?
Can i achieve my goal with some simpler way (without blockchain)? And which one?
What are bottlenecks in this logic?

There are 2 participating variables here
timestamp at which the record is created.
the data.
Solution premise,
Tampering proof.
the data can be changed in the same GMT calendar day without violating tamper-proof guarantee. (can be changed to a fixed window after creation)
RDBMS as the data store, (can be changed to any NoSQL with minor mods, but the idea remains the same).
Doesn't depend on any other mechanism which can be faulty or error-prone.
Single query verification.
## Proposed solution
create data table
CREATE TABLE TEST(
ID INT PRIMARY KEY AUTO_INCREMENT,
DATA VARCHAR(64) NOT NULL,
CREATED_AT DATETIME DEFAULT CURRENT_TIMESTAMP()
);
create checksum table, which monitor tempering
CREATE TABLE SIGN(
ID INT PRIMARY KEY AUTO_INCREMENT,
DATA_ID INT NOT NULL,
SIGNATURE VARCHAR(128) NOT NULL,
CREATED_AT DATETIME DEFAULT CURRENT_TIMESTAMP(),
UPDATED_AT TIMESTAMP
);
create trigger on insert of data
/** Trigger on insert */
DELIMITER //
CREATE TRIGGER sign_after_insert
AFTER INSERT
ON TEST FOR EACH ROW
BEGIN
-- INSERT VAL
INSERT INTO SIGN(DATA_ID, `SIGNATURE`) VALUES(
NEW.ID, MD5(CONCAT (NEW.DATA, DATE(NEW.CREATED_AT)))
);
END; //
DELIMITER ;
Create a trigger for update of data
-- UPDATE TRIGGER
DELIMITER //
CREATE TRIGGER SIGN_AFTER_UPDATE
AFTER UPDATE
ON TEST FOR EACH ROW
BEGIN
-- UPDATE VALS
IF (NEW.DATA <> OLD.DATA) AND (DATE(OLD.CREATED_AT) = CURRENT_DATE() ) THEN
UPDATE SIGN SET SIGNATURE=MD5(CONCAT(NEW.DATA, DATE(NEW.CREATED_AT))) WHERE DATA_ID=OLD.ID;
END IF;
END; //
DELIMITER ;
Test
Step 1: insert the data
INSERT INTO TEST(DATA) VALUES ('DATA2');
The signature of data and the date at which it was created, will reflect as the signature in SIGN table.
Step 2: update the data
the signature will get updated if value is changed and it is the SAME DAY.
UPDATE TEST SET DATA='DATA' WHERE ID =1;
Step 3: validate
you can always validate the data signature as
SELECT MD5(CONCAT (T.DATA, DATE(T.`CREATED_AT`))) AS CHECKSUM, S.SIGNATURE FROM TEST AS T ,SIGN AS S WHERE S.DATA_ID= T.ID AND S.`id`=1;
Output
| CHECKSUM | SIGNATURE |
| ------ | ------ |
|2bba70178abdafc5915ba0b5061597fa |2bba70178abdafc5915ba0b5061597fa

How to safely use current identity as value in insert query

I have a table where one of the columns is a path to an image and I need to create a directory for the record being inserted.
Example:
Id | PicPath |<br>
1 | /Pics/1/0.jpg|<br>
2 | /Pics/2/0.jpg|
This way I can be sure that the folder name is always valid and it is unique (no clash between two records).
Question is: how can I safely refer to the current id of the record being insert? Keep in mind that this is a highly concurrent environment, and I would like to avoid multiple trips to the DB if possible.
I have tried the following:
insert into Dummy values(CONCAT('a', (select IDENT_CURRENT('Dummy'))))
and
insert into Dummy values(CONCAT('a', (select SCOPE_IDENTITY() + 1)))
The first query is not safe, for when running 1000 concurrent inserts I got 58 'duplicate key' exceptions.
The second query didn't work because SCOPE_IDENTITY() returned the same value for all queries as I suspected.
What are my alternatives here?

Try a temporary table to track your inserted ids using OUTPUT clause
INSERT #temp_ids(someval) OUTPUT inserted.identity_column
This will get all the inserted ids from your queries. 'inserted' is context safe.

T-SQL - Update long list of values based on long list of ids

It may seem as a straight forward task, but I can't really find a good approach.
I have a long list of ids with a long list of corresponding values to update for a field (a single field)
id = 1 | field = value_1
id = 2 | field = value_2
.......................
id = n | field = value_n
I can put the fields in 2 lists (or any other way i choose to) but i have to loop through and update each value..
What would be the best approach for this?
To add few more details: The values are in a big excel, but this is not about processing that excel, I will copy paste the list of values into.. text. I was thinking 2 long list (id1, id2,..) (value_1, value_2,...)

For a one time job, convert the text into a CSV or other format that is processable by bcp.exe, then import it into a temp table, do the update via a JOIN, then drop the temp table.
For a repeatable job I would us SSIS: flat file source the data or even directly Excel source, source the table, merge the two sources, apply the result back into the table.

The selected answer is a good method, but for completeness: when this is a one-time task, and the updates all follow a simple pattern like that, it can also be effective to convert the input text directly into a series of update statements, using an Excel formula and fill down or using a text editor's replace function.
Example:
id newvalue
1 foo
2 grok
becomes
id newvalue generated statement
1 foo update dbo.mytable set field1 = 'foo' where id = 1
2 grok update dbo.mytable set field1 = 'grok' where id = 2
Quick and dirty, but apply with care and watch out for unexpected syntax errors.

is what i did in the end, I created a temp table, I imported all the values in it and updated via a join

How to optimize oracle query for repeated full table sorts?

I have a database infrastructure where we are regularly (at least once a day) replicating the full content of tables from a source database to approximately 20 target databases. Due to the replication code in use (we have to use regular oracle queries, no control or direct access to source database) - this results in 20 full-table sorts of the source table.
Is there any way to optimize for this in the query? I'm looking for something that would basically tell oracle "I'm going to be repeatedly sorting this entire table"? MySQL had an option with myisamchk where you could tell it to sort a table and keep it in sorted order, but obviously that wouldn't apply here for multiple reasons.
Currently, there are also some intermediate tables involved (sync from A to B, then from B to C.) We do have control over the intermediate tables, so if there are tuning options there, that would be useful as well.
Generally, the queries are almost all of the very simplistic form:
select a, b, c, d, e, ... z from tbl1 order by a, b, c, d, e, ... z;
I'm aware of streams, but as described above, the primary source tables are outside of our control, so we won't be able to use streams there. (Additionally, those source tables are rebuilt completely from a snapshot daily, so streams wouldn't really work anyway.)

you could look into the multi-table INSERT feature. It should perform a single FULL SCAN and will insert into multiple tables. Consider (10gR2):
SQL> CREATE TABLE t1 (ID NUMBER);
Table created
SQL> CREATE TABLE t2 (ID NUMBER);
Table created
SQL> INSERT ALL
2 INTO t1 VALUES (d_id)
3 INTO t2 VALUES (d_id)
4 /* your select goes here */
5 SELECT ROWNUM d_id FROM dual d CONNECT BY LEVEL <= 5;
10 rows inserted
SQL> SELECT COUNT(*) FROM t1;
COUNT(*)
----------
5
SQL> SELECT COUNT(*) FROM t2;
COUNT(*)
----------
5
You will have to check if it works over database links.

Some things that would help the sorting issue is to have indexes on the columns that you are sorting on (and also joining the tables on, if they're not there already). You could also create materialized views which are already sorted, and Oracle would keep the sorted results cached.

You don't say exactly how the replication is done or the data volumes involved (or why you are sorting the data).
If the aim is to minimise the impact on the source database, your best bet may be to extract into an intermediate file and load the file into the destination databases. The sort could be done on the intermediate file (if plain text), or as part of either the export or import into the destination databases.
In source database :
create table export_emp_info
organization external
( type oracle_datapump
default directory DATA_PUMP_DIR
location ('emp.dmp')
) as select emp_id, emp_name, dept_id from emp order by dept_id
/
Copy file then, import in dest database:
create table import_emp_info
(EMP_ID NUMBER(12),
EMP_NAME VARCHAR2(100),
DEPT_ID NUMBER)
organization external
( type oracle_datapump
default directory DATA_PUMP_DIR
location ('emp.dmp')
)
/
insert into emp_info select * from import_emp_info;
If you don't want or can't have the external table on the source db, you can use a straight expdp of the emp table (possibly using NETWORK_LINK if you have limited access to the source database directory structure) and QUERY to do the ordering.

You could load data from source table A to an intermediate table B and then do a partition exchange between B and destination table C. Exact replication, no sorting involved.

This I/U/D form of replication is what the MERGE command is there for. It's very doubtful that an expensive sort-merge would be required, and I'd expect to see hash joins instead. As long as the hash table can be stored in memory the hash join is barely more expensive than scanning the tables.
A handy optimisation is to store a hash value based on the non-key attributes, so that you can join between source and target tables on the key column(s) and compare small hash values instead of the full set of columns - change detection made easy.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight