How to process the data in the table, which is frequently inserted - sql-server

I have a table dbo.RawMessage, which allows anther system to frequently(insert 2 records per second) insert data.
I need to process the data in the RawMessage, and put the processed data in dbo.ProcessedMessage.
Because the processing logic is not very complected, so my approach is add a insert trigger in the RawMessage table, but sometime I got deadlock.
I am using SQL SERVER EXPRESS
My questions:
1.Is this a stuipid approach?
2.If not, how to improve?
3.If yes, please guide me the graceful way

Related

how are concurrent queries handled in snowflake?

For example, if I have a task that's inserting rows into a table while another task is truncating the same table, what happens?
I'm asking because I have a task that runs every minute which inserts rows into a table and then a lambda that reads and truncates the same table that runs every minute. I know snow tasks and event bridge don't run at every minute on the dot so I haven't really run into this issue yet but I'm thinking it'll happen eventually.
How does snowflake handle this?
It is the same concept in other SQL engines, that lock on resources will be placed.
In the Snowflake world, INSERT will have PARTITION level locking, because most of the INSERT statements write only new partitions.
Please see the below doc:
https://docs.snowflake.com/en/sql-reference/transactions.html#resource-locking
If the INSERT query is submitted before the TRUNCATE, then the TRUNCATE will have to wait until the INSERT query finishes. They can't be operated at the same time on the same resource.
See the screenshot below, the first query was the INSERT, which was HOLDING the PARTITION level lock, while the second query was the TRUNCATE, which was in the WAITING state:
The table will be locked by the first transaction that runs and subsequent transactions will be queued until the preceding transaction(s) complete.
BTW (and this may be the point of your question) having two processes like this operate independently doesn’t seem like a good design - as the lambda process seems to be logically dependent on the task.

SQL Integrity Concern for long running staged Backend Processing

I have an application which selects quite a large amount of data (pyodbc + sqlalchemy, db = SQL Server), does some operations on it (with pandas) and then inserts the results into another table.
My issue is now that I would like to mark the rows which I have selected originally at the end of my processing.
What is the best way to achieve this?
I currently prevent any new inserts etc to my first table with a pid lock (blocking the loader), but this of course is not a constraint on the DB and then bulk update the rows in the first table which don't have any mark yet
I could of course get a list of the ID's which were in my original data and update the ID's in batches which is probably really slow since there could be millions upon millions of rows)
Another option would be to lock the table at the start of my process, but is this actually a good idea? (what if my script dies to whatever reasons during the processing in a way that the "finally" block for releasing the lock is not executed)
Thankful for any ideas, thoughts etc!

Oracle database table delete best practices

Environment: Oracle 12C
Got a table with about 10 columns which include few clob and date columns. This is a very busy table for an ETL process as described below-
Flat files are loaded into the table first, then updated and processed. The insert and updates happen in batches. Millions of records are inserted and updated.
There is also a delete process to delete old data based on a date field from the table. The delete process runs as a pl/sql procedure and deletes from the table in a loop fetching first n records only based on date field.
I do not want the delete process to interfere with the regular insert/update . What is the best practice to code the delete so that it has minimal impact on the regular insert/update process ?
I can also partition the table and delete in parallel since each partition uses its own rollback segment but am looking for a simpler way to tune the delete process.
Any suggestions on using a special rollback segment or other tuning tips ?
The first thing you should look for is to decouple various ETL processes so that you need not do all of them together or in a particular sequence. Thereby, removing the dependency of the INSERTS/UPDATES and the DELETES. While a insert/update you could manage in single MERGE block in your ETL, you could do the delete later by simply marking the rows to be deleted later, thus doing a soft delete. You could do this as a flag in your table column. And use the same in your application and queries to filter them out.
By doing the delete later, your critical path of the ETL should minimize. Partitioning the data based on date range should definitely help you to maintain the data and also make the transactions efficient if it's date driven. Also, look for any row-by-row thus slow-by-slow transactions and make them in bulk. Avoid context switching between SQL and PL/SQL as much as possible.
If you partition the table as a date range, then you could look into DROP/TRUNCATE partition which will discard the rows stored in that partition as a DDL statement. This cannot be rolled back. It executes quickly and uses few system resources (Undo and Redo). You can read more about it in the documentation.

Alternative Method to Polling/Trigger a Table in Oracle?

I have a db on Oracle 11g where there's a table updated by external users. Now I want to catch the insert/update/delete on this table in order to bring these changes on a table on another db and I'm trying different methods for research. I tested polling (a job to check every minute if there is an update, insert or delete on the table) and trigger (fire on update, insert or delete on the table) yet, so are there alternative methods?
I found AOQ (Oracle Advanced Queuing), DBMS_PIPE, Oracle SNMP Agent Integrator Polling Activity, but I don't know if they are right for this case...
It depends.
Polling or triggers are often all you need depending on the volume of data involved, and the frequency of inserts/updates/deletes.
For example, the polling method might be as simple as adding a column which is set to 1 by default, and updated to NULL when the row is "consumed" by the replication code. A trigger on the table would set it back to 1 if a row is updated. An index on this column would be lightweight (the index would only include entries for rows where the column is 1) and therefore fast to query. You'd need another table to handle deletes, though.
The trigger method would merely write insert/update/delete rows into a log table of some sort, which would then get purged periodically by a job.
For heavier volumes solutions include Oracle GoldenGate and Oracle Streams: http://www.oracle.com/technetwork/database/focus-areas/data-integration/index.html

Large Data Service Architecture

Everyday a company drops a text file with potentially many records (350,000) onto our secure FTP. We've created a windows service that runs early in the AM to read in the text file into our SQL Server 2005 DB tables. We don't do a BULK Insert because the data is relational and we need to check it against what's already in our DB to make sure the data remains normalized and consistent.
The problem with this is that the service can take a very long time (hours). This is problematic because it is inserting and updating into tables that constantly need to be queried and scanned by our application which could affect the performance of the DB and the application.
One solution we've thought of is to run the service on a separate DB with the same tables as our live DB. When the service is finished we can do a BCP into the live DB so it mirrors all of the new records created by the service.
I've never worked with handling millions of records in a DB before and I'm not sure what a standard approach to something like this is. Is this an appropriate way of doing this sort of thing? Any suggestions?
One mechanism I've seen is to insert the values into a temporary table - with the same schema as the target table. Null IDs signify new records and populated IDs signify updated records. Then use the SQL Merge command to merge it into the main table. Merge will perform better than individual inserts/updates.
Doing it individually, you will incur maintenance of the indexes on the table - can be costly if its tuned for selects. I believe with merge its a bulk action.
It's touched upon here:
What's a good alternative to firing a stored procedure 368 times to update the database?
There are MSDN articles about SQL merging, so Googling will help you there.
Update: turns out you cannot merge (you can in 2008). Your idea of having another database is usually handled by SQL replication. Again I've seen in production a copy of the current database used to perform a long running action (reporting and aggregation of data in this instance), however this wasn't merged back in. I don't know what merging capabilities are available in SQL Replication - but it would be a good place to look.
Either that, or resolve the reason why you cannot bulk insert/update.
Update 2: as mentioned in the comments, you could stick with the temporary table idea to get the data into the database, and then insert/update join onto this table to populate your main table. The difference is now that SQL is working with a set so can tune any index rebuilds accordingly - should be faster, even with the joining.
Update 3: you could possibly remove the data checking from the insert process and move it to the service. If you can stop inserts into your table while this happens, then this will allow you to solve the issue stopping you from bulk inserting (ie, you are checking for duplicates based on column values, as you don't yet have the luxury of an ID). Alternatively with the temporary table idea, you can add a WHERE condition to first see if the row exists in the database, something like:
INSERT INTO MyTable (val1, val2, val3)
SELECT val1, val2, val3 FROM #Tempo
WHERE NOT EXISTS
(
SELECT *
FROM MyTable t
WHERE t.val1 = val1 AND t.val2 = val2 AND t.val3 = val3
)
We do much larger imports than that all the time. Create an SSIS pacakge to do the work. Personally I prefer to create a staging table, clean it up, and then do the update or import. But SSIS can do all the cleaning in memory if you want before inserting.
Before you start mirroring and replicating data, which is complicated and expensive, it would be worthwhile to check your existing service to make sure it is performing efficiently.
Maybe there are table scans you can get rid of by adding an index, or lookup queries you can get rid of by doing smart error handling? Analyze your execution plans for the queries that your service performs and optimize those.

Resources