A client application receives a table from server and stores it in local SQLite database. The table is sent frequently but changes rarely, so most updates insert into local database data that is already there.
Will SQLite do disk writes after such updates? Data is stored on SD card and writing to it frequently is a bad idea.
I can read the data from the database compare with received table and update only the rows that have changed. Maybe SQLite is smart and will do this for me transparently?
When you run a statement like this:
UPDATE SomeTable SET ... WHERE ...
then all rows that are matched by the WHERE clause get rewritten.
If you want to avoid that, you have to compare the row values yourself.
Related
My company is migrating to snowflake from SQL Server 2017 and am looking to build historical data tables that capture delta changes. In SQL, these would be in stored procedures, where old records would get expired (change to data) and insert the new row with updated data. This design allows dynamic retrieval of historical data at any point in time.
My question is, how would i migrate this design to snowflake? From what I read about procedures, they're more like UDTs or scalar functions (SQL equiv) , but in javascript lang...
Below is brief example of how we are doing CDC for tables in SQL
Would data pipeline cover this? If anyone knows good tutorial site for snowflake 101 (not snowflake offical documentation, its terrible). would be appreciated
thanks
update h
set h.expiration_date = t.effective_date
from data_table_A_history h
join data_table_A as t
on h.account_id = t.account_id
where h.expiration_date is null
and (
(isnull(t.person_name,'x') <> isnull(h.person_name,'x')) or
(isnull(t.person_age,0) <> isnull(h.person_age,0))
)
---------------------------------------------------------------------
insert into data_table_A_history (account_id,person_name,person_age)
select
account_id,person_name,person_age
from
data_table_A t
left join data_table_A_history h
on t.account_id = h.account_id
and h.expiration_date is null
where
h.account_id is null
Table streams are Snowflake's CDC solution
You can setup multiple streams on a single table and it will track changes to the table from a particular point in time. This point in time is changed once you consume the data in the stream, with the new starting point being from the time you consumed the data. Consumption here is when you either use the data to upsert another table or perhaps insert the data into a log table for example. Simply select statements do not consume the data
A pipeline could be something like this: Snowpipe->staging table->stream on staging table->task with SP->merge/upsert target table
If you wanted to keep a log of the changes then you could setup a 2nd stream on the staging table and consume that by inserting the data into another table
Another trick, if you didn't want to use a 2nd stream is to amend your SP so that before you consume the data, run a select on the stream and then immediately run
INSERT INTO my_table select * from table(result_scan(last_query_id()))
This does not consume the stream and change the offset and leaves the stream data available to be consumed by another DML operation
I have a table named abcTbl, the data in there is populated
from other tables from a different database. Every time I am loading
data to abcTbl, I am doing a delete all to it and loading the buffer
data into it.
This package runs daily. My question is how do I avoid losing data
from the table abcTbl if we fail to load the data into it. So my
first step is deleting all the data in the abcTbl and then
selecting the data from various sources into a buffer and then
loading the buffer data into abcTbl.
Since we can encounter issues like failed connections, package
stopping prematurely, supernatural forces trying to stop/break my
package from running smoothly, etc. which will end up with the
package losing all the data in the buffer after I have already
deleted the data from abcTbl.
My first intuition was to save the data from the abcTbl into a
backup table and then deleting the data in the abcTbl but my DBAs
wouldn't be too thrilled about creating a backup table for in every
environment for the purpose of this package, and giving me juice to
create backup tables on the fly and then deleting it again is out of
the question too. This data is not business critical and can be repopulated
again if lost.
But, what is the best approach here? What are the best practices for this issue?
For backing up your table, instead of loading data from one table (Original) to another table (Backup), you can just rename your original table to something (back-up table), create original table again like the back-up table and then drop the renamed table only when your data load is successful. This may save some time to transfer data from one table to another. You may want to test which approach is faster for you depending on your data/table structure etc., But what I wanted to mention is, this is also one of the way to do it. If you have lot of data in that table below approach may be faster.
sp_rename 'abcTbl', 'abcTbl_bkp';
CREATE TABLE abcTbl ;
While creating this table, you can keep similar table structure as that of abcTbl_bkp
Load your new data to abcTbl table
DROP TABLE abcTbl_bkp;
Trying to figure this out but I think what you are asking for is a method to capture the older data before loading the new data. I would agree with your DBA's that a seperate table for every reload would be extremely messy and not very usable if you ever need it.
Instead, create a table that copies your load table but adds a single DateTime field(say history_date). Each load you would just flow all the data in your primary table to the backup table. Use a Derived Column task in the Data Flow to add the history_date value to the backup table.
Once the backup table is complete, either truncate or delete the contents of the current table. Then load the new data.
Instead of created additional tables you can set the package to execute as a single transaction. By doing this, if any component fails all the tasks that have already executed will be rolled back and subsequent ones will not run. To do this, set the TransactionOption to Required on the package. This will allow that the package will begin a transaction. After this set all this property to Supported for all components that you want to succeed or fail together. The Supported level will have these tasks join a transaction that is already in progress by the parent container, being the package in this case. If there are other components in the package that you want to commit or rollback independent of these tasks you can place the related objects in a Sequence container, and apply the Required level to the Sequence instead. An important thing to note is that if anything performs a TRUNCATE then all other components that access the truncated object will need to have the ValidateExternalMetadata option set to false to avoid the known blocking issue that is a result of this.
Here is my situation, my client wants to bulk insert 100,000+ rows into the database from a csv file which is simple enough but the values need to be checked against data that is already in the database (does this product type exist? is this product still sold? etc.). To make things worse these files will also be uploaded into the live system during the day so I need to make sure I’m not locking any tables for long. The data that is inserted will also be spread across multiple tables.
I’ve been adding the date into a staging table which takes seconds, I then tried creating a WebService to start processing the table using Linq and marking any erroneous rows with an invalid flag (this can take some time). Once the validation is done I need take the valid rows and update/add the rows to the appropriate tables.
Is there a process for this that I am unfamiliar with?
For a smaller dataset I would suggest
IF EXISTS (SELECT blah FROM blah WHERE....)
UPDATE (blah)
ELSE
INSERT (blah)
You could do this in chunks to avoid server load, but this is by no means a quick solution, so SSIS would be preferable
I've a running system where data is inserted periodically into MS SQL DB and web application is used to display this data to users.
During data insert users should be able to continue to use DB, unfortunatelly I can't redesign the whole system right now. Every 2 hours 40k-80k records are inserted.
Right now the process looks like this:
Temp table is created
Data is inserted into it using plain INSERT statements (parameterized queries or stored proceuders should improve the speed).
Data is pumped from temp table to destination table using INSERT INTO MyTable(...) SELECT ... FROM #TempTable
I think that such approach is very inefficient. I see, that insert phase can be improved (bulk insert?), but what about transfering data from temp table to destination?
This is waht we did a few times. Rename your table as TableName_A. Create a view that calls that table. Create a second table exactly like the first one (Tablename_B). Populate it with the data from the first one. Now set up your import process to populate the table that is not being called by the view. Then change the view to call that table instead. Total downtime to users, a few seconds. Then repopulate the first table. It is actually easier if you can truncate and populate the table becasue then you don't need that last step, but that may not be possible if your input data is not a complete refresh.
You cannot avoid locking when inserting into the table. Even with BULK INSERT this is not possible.
But clients that want to access this table during the concurrent INSERT operations can do so when changing the transaction isolation level to READ UNCOMMITTED or by executing the SELECT command with the WITH NOLOCK option.
The INSERT command will still lock the table/rows but the SELECT command will then ignore these locks and also read uncommitted entries.
Everyday a company drops a text file with potentially many records (350,000) onto our secure FTP. We've created a windows service that runs early in the AM to read in the text file into our SQL Server 2005 DB tables. We don't do a BULK Insert because the data is relational and we need to check it against what's already in our DB to make sure the data remains normalized and consistent.
The problem with this is that the service can take a very long time (hours). This is problematic because it is inserting and updating into tables that constantly need to be queried and scanned by our application which could affect the performance of the DB and the application.
One solution we've thought of is to run the service on a separate DB with the same tables as our live DB. When the service is finished we can do a BCP into the live DB so it mirrors all of the new records created by the service.
I've never worked with handling millions of records in a DB before and I'm not sure what a standard approach to something like this is. Is this an appropriate way of doing this sort of thing? Any suggestions?
One mechanism I've seen is to insert the values into a temporary table - with the same schema as the target table. Null IDs signify new records and populated IDs signify updated records. Then use the SQL Merge command to merge it into the main table. Merge will perform better than individual inserts/updates.
Doing it individually, you will incur maintenance of the indexes on the table - can be costly if its tuned for selects. I believe with merge its a bulk action.
It's touched upon here:
What's a good alternative to firing a stored procedure 368 times to update the database?
There are MSDN articles about SQL merging, so Googling will help you there.
Update: turns out you cannot merge (you can in 2008). Your idea of having another database is usually handled by SQL replication. Again I've seen in production a copy of the current database used to perform a long running action (reporting and aggregation of data in this instance), however this wasn't merged back in. I don't know what merging capabilities are available in SQL Replication - but it would be a good place to look.
Either that, or resolve the reason why you cannot bulk insert/update.
Update 2: as mentioned in the comments, you could stick with the temporary table idea to get the data into the database, and then insert/update join onto this table to populate your main table. The difference is now that SQL is working with a set so can tune any index rebuilds accordingly - should be faster, even with the joining.
Update 3: you could possibly remove the data checking from the insert process and move it to the service. If you can stop inserts into your table while this happens, then this will allow you to solve the issue stopping you from bulk inserting (ie, you are checking for duplicates based on column values, as you don't yet have the luxury of an ID). Alternatively with the temporary table idea, you can add a WHERE condition to first see if the row exists in the database, something like:
INSERT INTO MyTable (val1, val2, val3)
SELECT val1, val2, val3 FROM #Tempo
WHERE NOT EXISTS
(
SELECT *
FROM MyTable t
WHERE t.val1 = val1 AND t.val2 = val2 AND t.val3 = val3
)
We do much larger imports than that all the time. Create an SSIS pacakge to do the work. Personally I prefer to create a staging table, clean it up, and then do the update or import. But SSIS can do all the cleaning in memory if you want before inserting.
Before you start mirroring and replicating data, which is complicated and expensive, it would be worthwhile to check your existing service to make sure it is performing efficiently.
Maybe there are table scans you can get rid of by adding an index, or lookup queries you can get rid of by doing smart error handling? Analyze your execution plans for the queries that your service performs and optimize those.