How to have 4 layers of data deleted in snowflake - parent, child, grandchild, greatgrandchild?
I have 4 tables where data is to be deleted when data is deleted in the parent table. As cascade delete is not there in snowflake, how can we achieve this in an automated process?
I'd just use tasks and streams for this. You'd delete from the parent with a stream over it, and then a task would use the information in the stream to apply to the child, and so on. You could have the tasks checking for changes on the parent every minute or so.
https://docs.snowflake.com/en/user-guide/tasks-intro.html
https://docs.snowflake.com/en/user-guide/streams.html
https://community.snowflake.com/s/article/Using-Streams-and-Tasks-inside-Snowflake
Related
I have a table which get loaded from S3 every time when there is a new file in bucket.
And I am using Snowpipe to do so.
However the ask is to refresh the table in every load.
To accomplish that, My thought process is below.
Create a pipe on table t1 to copy from S3.
Create a Stream on table t1.
Create a task to run every 5 min and condition when stream has data.
The Task statement will be to delete the record from table where load_date of stream is not equal to the load_date of Table. (Using stream to implement DML operation on Stream so that Stream get empty)
So Basically using the Self stream to delete the data from the table.
However my issue is what will happen when there is multiple load on the same day.
And this approach does not look very professional. Is there a better way.
I would create a new target table for the stream data and merge into this new table on every run. If you really need to delete data from t1 then you could setup a child task that deletes data from t1 based on what you have in t2 (after you have merged)
However, the stream will record these delete operations. Depending on how your load works you could created an append only stream or when ingesting the stream, make sure to use the metadata to filter only the data events you are interested in
I have been asked to check a production issue for which I need help. I am trying to understand the isolation levels and different locks available in SQL server.
I have a table JOB_STATUS having columns job_name (string, primary key), job_status (string), is_job_locked (string)
Sample data as below:
job_name
job_status
is_job_locked
JOB_A
INACTIVE
N
JOB_B
RUNNING
N
JOB_C
SUCCEEDED
N
JOB_D
RUNNING
N
JOB_E
INACTIVE
N
Multiple processes can update the table at the same time by calling a stored procedure and passing the job_name as input parameter. It is fine if two different rows are getting updated by separate processes at the same time.
BUT, two processes should not update the same row at the same time.
Sample update query is as follows:
update JOB_STATUS set is_job_locked='Y' where job_name='JOB_A' and is_job_locked='N';
Here if two processes are updating the same row, then one process should wait for the other one to complete. Also, if the is_job_locked column value is changed to Y by one process, then the other process should not update it again (which my update statement should handle if locking is proper).
So how can I do this row level locking and make sure the update query reads the latest data from the row before making an update using a stored procedure.
Also, would like to get the return value whether the update query updated the row or it did not as per the condition, so that I can use this value in my further application flow.
RE: "Here if two processes are updating the same row, then one process should wait for the other one to complete. "
That is how locking works in SQL Server. An UPDATE takes an exclusive lock on the row -- where "exclusive" means the English meaning of the word: the UPDATE process has excluded (locked out) all other processes while it is running. The other processes now wait for the UPDATE to complete. This includes READ processes for transaction isolation levels READ COMMITTED and above. When the UPDATE lock is released, then the next statement can access the value.
IF what you are looking for is that 2 processes cannot change the same row in a single table at the same time, then SQL Server does that for you out of the box and you do not need to add your own "is_job_locked" column.
However, typically an is_job_locked column is used to control access beyond a single table. For example, it may be used to prevent a second process from starting a job that is already running. Process A would mark is_job_locked, then start the job. Process B would check the flag before trying to start the job.
I did not had to use explicit lock or provide any isolation level as it was a single update query in the stored procedure.
At a time SQL server is only allowing one process to update a row which is then read committed by second process and not updated again.
Also, I used ##ROWCOUNT to get the No. of rows updated. My issue is solved now.
Thanks for the answers and comments.
I'm implementing a background process for moving event log records from SQL database to mongoDB.
Event log / audit trail entries are known to change only once at the end of the event. The process is like this:
1) an event log entry gets created to fixate that a business process has been initiated
2) a new transaction is started for the business process
3) audit trail entries are created
4) attempts to update the event log entry with successful status
5) the transaction completes
6) if the transaction fails - updates the event log entry as failed
So, theoretically, the background process could safely read all the log entries that have been already marked as failed/successful and, most probably, after specific timeout could also safely read the entries that for unknown reason are stuck in "Process was started" state.
To avoid any locks on the event log table while the background process is reading event log records in batches, I would like to use a relaxed isolation level but I'm not sure which one would be safe to use on a table that will have lots of parallel inserts and updates occurring constantly (albeit updates only on the records that my background process will ignore; so I don't care about dirty reads).
In my case it seems to be acceptable to miss the records that are being inserted right now (I'll get them in the next background job run anyway) but it is not acceptable to get duplicate/missing records among the older records that aren't being updated right now.
P.S. You might ask - why not log directly to mongoDB? There are two reasons: 1) the database has many triggers and stored procs that log to SQL table and the customer doesn't want to reimplement all of that; 2) the customer wants to ensure atomicity of event log/audit trail with the transaction of the business process and he's afraid that with direct journaling to mongoDB there might be cases when for some (most probably - very critical) reason event log entries go missing while the SQL transaction has succeeded and the data was changed. It's not trivially possible to include writing to mongoDB in a single atomic unit of work with an SQL transaction.
Context:
We would like to delete multiple records in the Case and its related/child objects. The child objects have few related objects. There are 4 to 5 levels of hierarchy as follows
Case
--Task
-----Child1
--------Child2
-----------Child3
The related objects are having master-child relationship with cascade delete set to false.
Currently the way we are deleting cases in a batch is as follows
Collect all cases in the batch
Collect all the tasks for all cases in the batch
Collect all the Child1 records for all Cases in the batch
Collect all the Child2 records for all Cases in the batch
Collect all the Child3 records for all Cases in the batch
Then delete each set of records in batch using bulk delete. The advantage is we will have only 5 deletes per batch and we don't hit governor limits.
However the down side of this process is, when there is error in deleting in any of the steps above, whole transaction is rolled back. Though we can get which delete caused he error, we cannot role back objects related to only that particular case.
Question:
Is there are any better way to handle deleting of records and child
records.
Is there a way to rollback only the cases and the child
records which had error
Use bottoms up approach along with batch.
For example start with Child3. collect all the records that needs to be purged in each related object and case object. Then delete in batch.
Child3
--Child2
-----Child1
--------Task
-----------Case
There are 2 ways:
Declarative method:
Use the Process Event and Process Builder/flow field count in each object is less than 350. The advantage of this method is you don't have to write code.
Using Apex class:
Write apex code to get all the related objects that needs to be purged/deleted and execute in batch.
a) Set the batch size of 1 so that any error while deleting, only the related records will be rolled back.
b) If batch size is set to higher than 1, whole batch will be rolled back. In this case you need to identify the parent (Case object) id and mark them for error and re-read all the other records which were rolled back as part of the batch and run again. In this method, identifying the failed record (if the records failed in child3) and its root parent (case) id could be challenging.
I have a table dbo.RawMessage, which allows anther system to frequently(insert 2 records per second) insert data.
I need to process the data in the RawMessage, and put the processed data in dbo.ProcessedMessage.
Because the processing logic is not very complected, so my approach is add a insert trigger in the RawMessage table, but sometime I got deadlock.
I am using SQL SERVER EXPRESS
My questions:
1.Is this a stuipid approach?
2.If not, how to improve?
3.If yes, please guide me the graceful way