How to apply ROW ID in snowflake? Oracle code conversion to Snowflake

How to apply ROW ID in snowflake? Oracle code conversion to Snowflake - snowflake-cloud-data-platform

Is there any way of converting the last a.ROWID > b.ROWID values in below code in to snowflake? the below is the oracle code. Need to take the ROW ID to snowflake. But snowflake does not maintain ROW ID. Is there any way to achieve the below and convert the row id issue?
DELETE FROM user_tag.user_dim_default a
WHERE EXISTS (SELECT 1
FROM rev_tag.emp_site_weekly b
WHERE a.number = b.ID
AND a.accountno = b.account_no
AND a.ROWID > b.ROWID)

So this Oracle code seem very broken, because ROWID is a table specific pseudo column, thus comparing value between table seem very broken. Unless the is some aligned magic happening, like when user_tag.user_dim_default is inserted into rev_tag.emp_site_weekly is also written. But even then I can imagine data flows where this will not get what you want.
So as with most things Snowflake, "there is no free lunch", so the data life cycle that is relying on ROW_ID needs to be implemented.
Which implies if you are wanting to use two sequences, then you should do explicitly on each table. And if you are wanting them to be related to each other, it sounds like a multi table insert or Merge should be used so you can access the first tables SEQ and relate it in the second.

ROWID is an internal hidden column used by the database for specific DB operations. Depending on the vendor, you may have additional columns such as transaction ID or a logical delete flag. Be very carful to understand the behavior of these columns and how they work. They may not be in order, they may not be sequential, they may change in value as a DB Maint job runs while your code is running, or someone else runs an update on a table. Some of these internal columns may have the same value for more than one row for example.
When joining tables, the RowID on one table has no relation to the RowID on another table. When writing Dedup logic or delete before insert type logic, you should use the primary key, and then additionally an audit column that has the date of insert or date of last update in combo with that. Check the data model or ERD digram for the PK/FK relationships between the tables and what audit columns are available.

Related

SCD Type 2 by Merge Statement for tracking changes for Joined Table without unique Key

I have a table for which I want to create SCD Type 2 by using the T-SQL Merge statement, however it doesn't have a unique key.
RoleTaskTable:
RoleID, TaskID
1,A
1,B
1,C
2,A
2,D
2,F
3,A
3,B
3,E
3,F
Obviously I get the error
"The MERGE statement attempted to UPDATE or DELETE the same row more than once. This happens when a target row matches more than one source row. A MERGE statement cannot UPDATE/DELETE the same row of the target table multiple times. Refine the ON clause to ensure a target row matches at most one source row, or use the GROUP BY clause to group the source rows."
When I combine RoleID and TaskID as the unique index for the Staging Table as well as the SCDTable, it will recognize it simply as a new record, thus all records (even if some are deleted) stay flagged as Active in the SCD table.
How can you solve something like this?
I can in case show the entire code I have for this SCD, but I think I miss something basic here.

A the risk of sounding obvious, give the table a unique key. Without one it is not a proper relation. Merge is designed to work work unique keys.
OTOH, since Merge is just a one-stop-shop for insert/update/delete, you could use the individual commands, wrapped in a transaction, to accomplish the same thing.

Postgres lookup tables over clustered data

Background
This is a simplified version of the postgres database I am managing:
TableA: id,name
TableB: id,id_a,prop1,prop2
This database has a peculiarity: when I select data, I only consider rows of TableB that have the same id_a. So I am never interested in selecting data from TableB with mixed values of id_a. Therefore, queries are always of this kind:
SELECT something FROM TableB INNER JOIN TableA ON TableA.id=id_a
Some time ago, the number of rows in TableA grew up to 20000 rows and TableB up to 10^7 rows.
To first speedup queries I added a binary tree lookup table to the TableB properties. Something like the following:
"my_index" btree (prop1)
The problem
Now I have to insert new data, and the database size will became more than the double of its current size. Inserting data to TableB became too slow.
I understood that the slowness cames from the updating of my_index.
When I add a new row of TableB, the database have to reorder the my_index lookup table.
I feel like this would be speeded up, if my_index was not over all elements.
But I do not need the new row, with a given id_a property to be sorted with a row having a different id_a property
The question
How can I create an index on a table, where the elements are ordered only when they have a same common property (es. a column called id_a)?

You can't.
The question that I would immediately ask you if you want such an index is: Yes, but for what values of id_a do you want the index? And your answer would be “for all of them”.
If you actually would want an index only for some values, you could use a partial index:
CREATE INDEX partidx ON tableb(prop1) WHERE id_a = 42;
But really you want an index for the whole table.
Besides, the INSERT would be just as slow unless the row inserted does not satisfy the WHERE condition of your index.
There are three things you can do to speed up INSERT:
Run as many INSERT statements as possible in a single transaction, ideally all of them.
Then you don't have to pay the price of COMMIT after every single INSERT, and COMMITs are quite expensive: they require data to be written to the disk hardware (not the cache), and this is incredibly slow (1 ms a decent time).
You can speed this up even more if you use prepared statements. That way the INSERT doesn't have to be parsed and prepared every time.
Use the SQL command COPY to insert many rows. COPY is specifically designed for bulk data import and will be faster than INSERT.
If COPY is to slow, usually because you need to INSERT a lot of data, the fatsest way is to drop all indexes, insert the data with COPY and then recreate the indexes. It can speed up the process by an order of magnitude, but of course the database is not fully available while the indexes are dropped.

How to store id of a record instead of text in parent table

I have to get input from users regarding their skills. I have a table for skills which has id as primary key. In an other table i am storing user id and skill id as many to many relationship. Now the problem is that how do I know the that the skill entered by user is already in my skills table? because I have to put Id of skill in Many to Many relationship table. Do I run each time a select statement or is there some efficient solution available? Thanks,

how do I know the that the skill entered by user is already in my skills table?
In a concurrent environment, there is no way for you to know that. Even if you do the SELECT, that only tells you whether row existed at the time of the SELECT execution - it doesn't tell you whether the row exists now. For example, even if the SELECT returned an empty result, a concurrent transaction might have inserted the row within the few milliseconds that it took for you to receive the SELECT result.
So you either do a drastic reduction in concurrency (e.g. through table locks), or learn to live with it...
When just the INSERT is needed
I'd recommend you simply attempt the INSERT (without SELECT) and then ignore the possible PRIMARY KEY violation1.
If you did the separate SELECT and INSERT steps, you'd still have to be prepared for PK violations, since a concurrent transaction could perform the INSERT (and commit) after your SELECT but before your INSERT. So why bother with the SELECT in the first place?
When INSERT or UPDATE is needed
If the junction table contains other fields in addition to FK fields, then you might want to update them to new values, so you'd have to first perform the SELECT to determine if the row needs inserting or updating.
In such a case, consider locking the row early using SELECT ... FOR UPDATE (or equivalent syntax).2 Alternatively, some DBMSes offer "insert or update" (aka "UPSERT") in a single command (e.g.: MySQL INSERT ... ON DUPLICATE KEY UPDATE).
1 But be careful to only ignore PK violations - don't just blindly "swallow" FK or CHECK violations etc...
2 To avoid it being deleted before you had a chance to UPDATE it (INSERT PK violation is still possible). Even worse, a concurrent transaction could UPDATE the row, leaving your transaction to silently overwrite other transaction's values, without being aware they were ever there.

There is no one-step solution to this problem. First you need to check if the skill exists in the Skills table and get the ID or insert it and get the ID if it did not exist. Then insert a row in the your PeopleSkills table with the IDs of the person and the skill...

SQL Server BIDS, SSIS aggregate and group by

I have an employee table with an employee_id, name, and working_division, where the employee_id is the primary key. I have an Excel source with these columns and more where an employee has entered their hours, and what type of work they have done, what division of the company it was for and so forth..
So for any given day an employee I could have multiple rows showing their type of work, what division they worked for, and their charged hours to that division.
How do I get this into the OLE DB in which the employee_id is the primary key?
I am trying to use the aggregate transform to group by the employee_id, however the employee_id and working_divisions are not one-to-one. Thus, the group by operation on both of those columns will try to insert the same employee_id into the employee table (the employee_id is the primary key!) If I do not include the working_division for the aggregate transform, then I lose the data.
How can I group my data by the employee_id, and still keep all the other columns with that row?
Thanks for all the help!

I need the employee_id to be the PK. Basically I have a very large
unorganized data source, and I am breaking it apart into 4 to 5
separate tables to fit my model so I can make sense of the data with
some data mining algorithms
OK, then why don't you split employee_id and working_division in two separate tables? The second table should keep a FK to the employee table (so one to many).
In the SSIS package you can then add a Multicast component, right after the Aggregate on employee_id, in order to split your data source in the 2 target tables.
I think that without a modification in your target model you won't be able to achieve what you want. It basically violates the rules of RDBMS. That grouping you are talking about couldn't be done even in plain SQL and yield correct results.
Note: If you're worried about modifying your target data model, then perhaps you can normalize it as I mentioned before and then denormalize it back through a view. You maybe can even create an indexed view in order to speed things up at read time (as far as I can see an indexed view should be possible, since all you have is an inner join between the two tables).

How to unique identify rows in a table without primary key

I'm importing more than 600.000.000 rows from an old database/table that has no primary key set, this table is in a sql server 2005 database. I created a tool to import this data into a new database with a very different structure. The problem is that I want to resume the process from where it stopped for any reason, like an error or network error. As this table doesn't have a primary key, I can't check if the row was already imported or not. Does anyone know how to identify each row so I can check if it was already imported or not? This table has duplicated row, I already tried to compute the hash of all the columns, but it's not working due to duplicated rows...
thanks!

I would bring the rows into a staging table if this is coming from another database -- one that has an identity set on it. Then you can identify the rows where all the other data is the same except for the id and remove the duplicates before trying to put it into your production table.

So: you are loading umpteen bazillion rows of data, the rows cannot be uniquely identified, the load can (and, apparently, will) be interrupted at any point at any time, and you want to be able to resume such an interrupted load from where you left off, despite the fact that for all practical purposes you cannot identify where you left off. Ok.
Loading into a table containing an additional identity column would work, assuming that however and whenever the data load is started, it always starts at the same item and loads items in the same order. Wildly inefficient, since you have to read through everythign every time you launch.
Another clunky option would be to first break the data you are loading into manageably-sized chunks (perhaps 10,000,000 rows). Load them chunk by chunk, keeping track of which chunk you have loaded. Use a Staging table, so that you know and can control when a chunk has been "fully processed". If/when interrupted, you've only toss the chunk you were working on when interrupted, and resume work with that chunk.

With duplicate rows, even row_number() is going to get you nowhere, as that can change between queries (due to the way MSSQL stores data). You need to either bring it into a landing table with an identity column or add a new column with an identity onto the existing table (alter table oldTbl add column NewId int identity(1,1)).
You could use row_number(), and then back out the last n rows if they have more than the count in the new database for them, but it would be more straight-forward to just use a landing table.

Option 1: duplicates can be dropped
Try to find a somewhat unique field combination. (duplicates are allowed) and join over a hash of the rest of the fields which you store in the destination table.
Assume a table:
create table t_x(id int, name varchar(50), description varchar(100))
create table t_y(id int, name varchar(50), description varchar(100), hash varbinary(8000))
select * from t_x x
where not exists(select *
from t_y y
where x.id = y.id
and hashbytes('sha1', x.name + '~' + x.description) = y.hash)
The reason to try to join as many fields as possible is to reduce the chance of hash collisions which are real on a dataset with 600.000.000 records.
Option 2: duplicates are important
If you really need the duplicate rows you should add a unique id column to your big table. To achieve this in a performing way you should do the following steps:
Alter the table and add a uniqueidentifier or int field
update the table with the newsequentialid() function or a row_number()
create an index on this field
add the id field to your destination table.
once all the data is moved over, the field can be dropped.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

How to apply ROW ID in snowflake? Oracle code conversion to Snowflake - snowflake-cloud-data-platform

Related

SCD Type 2 by Merge Statement for tracking changes for Joined Table without unique Key

Postgres lookup tables over clustered data

How to store id of a record instead of text in parent table

SQL Server BIDS, SSIS aggregate and group by

How to unique identify rows in a table without primary key

Categories

Resources