Ways to batch update DynamoDB table - database

I'm planning to create an ETL job that puts data from Redshift to DynamoDB, this data runs daily that aims to update the values stored in DynamoDB. Since there is no way to do batch update of values, I'm doing a delete and create process on the table where I re-create the table daily with new values.
Issue comes in since there is an API that extracts data from the daily refreshed DynamoDB so I have to ensure that the table in DynamoDB will exist at all times. There is a possibility that the API will be called in a time where the deletion process has happened and re-creation is on-going.
Is there a better way in doing batch "update" in DynamoDB in this case?
DynamoDB Table schema:
ID: partition-key
version: sort-key
score
timestamp

Why don’t you make a new table (with a new name) then when it’s ready for traffic delete the old table. You can have a ParameterStore entry that gives the latest table name to the caller.
(If you’re rock climbing, don’t detach a climbing rope until your new one is attached.)

Related

Backing up a table before deleting all the records and reloading it in SSIS

I have a table named abcTbl, the data in there is populated
from other tables from a different database. Every time I am loading
data to abcTbl, I am doing a delete all to it and loading the buffer
data into it.
This package runs daily. My question is how do I avoid losing data
from the table abcTbl if we fail to load the data into it. So my
first step is deleting all the data in the abcTbl and then
selecting the data from various sources into a buffer and then
loading the buffer data into abcTbl.
Since we can encounter issues like failed connections, package
stopping prematurely, supernatural forces trying to stop/break my
package from running smoothly, etc. which will end up with the
package losing all the data in the buffer after I have already
deleted the data from abcTbl. 
My first intuition was to save the data from the abcTbl into a
backup table and then deleting the data in the abcTbl but my DBAs
wouldn't be too thrilled about creating a backup table for in every
environment for the purpose of this package, and giving me juice to
create backup tables on the fly and then deleting it again is out of
the question too. This data is not business critical and can be repopulated
again if lost.
But, what is the best approach here? What are the best practices for this issue?
For backing up your table, instead of loading data from one table (Original) to another table (Backup), you can just rename your original table to something (back-up table), create original table again like the back-up table and then drop the renamed table only when your data load is successful. This may save some time to transfer data from one table to another. You may want to test which approach is faster for you depending on your data/table structure etc., But what I wanted to mention is, this is also one of the way to do it. If you have lot of data in that table below approach may be faster.
sp_rename 'abcTbl', 'abcTbl_bkp';
CREATE TABLE abcTbl ;
While creating this table, you can keep similar table structure as that of abcTbl_bkp
Load your new data to abcTbl table
DROP TABLE abcTbl_bkp;
Trying to figure this out but I think what you are asking for is a method to capture the older data before loading the new data. I would agree with your DBA's that a seperate table for every reload would be extremely messy and not very usable if you ever need it.
Instead, create a table that copies your load table but adds a single DateTime field(say history_date). Each load you would just flow all the data in your primary table to the backup table. Use a Derived Column task in the Data Flow to add the history_date value to the backup table.
Once the backup table is complete, either truncate or delete the contents of the current table. Then load the new data.
Instead of created additional tables you can set the package to execute as a single transaction. By doing this, if any component fails all the tasks that have already executed will be rolled back and subsequent ones will not run. To do this, set the TransactionOption to Required on the package. This will allow that the package will begin a transaction. After this set all this property to Supported for all components that you want to succeed or fail together. The Supported level will have these tasks join a transaction that is already in progress by the parent container, being the package in this case. If there are other components in the package that you want to commit or rollback independent of these tasks you can place the related objects in a Sequence container, and apply the Required level to the Sequence instead. An important thing to note is that if anything performs a TRUNCATE then all other components that access the truncated object will need to have the ValidateExternalMetadata option set to false to avoid the known blocking issue that is a result of this.

JOB TALEND TRIGGER

I've a question about Talend, I want to insert data in csv in real time after each inserting in PostgreSql database. Have you ideas ?
Thank you !
My Job TALEND
Unfortunately you cannot trigger Talend's operation from database.
If your table is very intensively written - so all the exercise does not make any sense at all. If the inserts are relatively rare - i would recommend to create a postgreSQL trigger, to populate table with PK of a new record. Talend job will pick the newly inserted records, by probing that new table, let's say every second, and will delete that Pk.

Alternative Method to Polling/Trigger a Table in Oracle?

I have a db on Oracle 11g where there's a table updated by external users. Now I want to catch the insert/update/delete on this table in order to bring these changes on a table on another db and I'm trying different methods for research. I tested polling (a job to check every minute if there is an update, insert or delete on the table) and trigger (fire on update, insert or delete on the table) yet, so are there alternative methods?
I found AOQ (Oracle Advanced Queuing), DBMS_PIPE, Oracle SNMP Agent Integrator Polling Activity, but I don't know if they are right for this case...
It depends.
Polling or triggers are often all you need depending on the volume of data involved, and the frequency of inserts/updates/deletes.
For example, the polling method might be as simple as adding a column which is set to 1 by default, and updated to NULL when the row is "consumed" by the replication code. A trigger on the table would set it back to 1 if a row is updated. An index on this column would be lightweight (the index would only include entries for rows where the column is 1) and therefore fast to query. You'd need another table to handle deletes, though.
The trigger method would merely write insert/update/delete rows into a log table of some sort, which would then get purged periodically by a job.
For heavier volumes solutions include Oracle GoldenGate and Oracle Streams: http://www.oracle.com/technetwork/database/focus-areas/data-integration/index.html

What are good strategies for updating a live database table?

I have a db table that gets entirely re-populated with fresh data periodically. This data needs to be then pushed into a corresponding live db table, overwriting the previous live data.
As the table size increases, the time required to push the data into the live table also increases, and the app would look like its missing data.
One solution is to push the new data into a live_temp table and then run an SQL RENAME command on this table to rename it as the live table. The rename usually runs in sub-second time. Is this the "right" way to solve this problem?
Are there other strategies or tools to tackle this problem? Thanks.
I don't like messing with schema objects in this way - it can confuse query optimizers and I have no idea what will happen to any transactions that are going on while you execute the rename.
I much prefer to add a version column to the table, and have a separate table to hold the current version.
That way, the client code becomes
select *
from myTable t,
myTable_currentVersion tcv
where t.versionID = tcv.CurrentVersion
This also keeps history around - which may or not be useful; if it's not delete old records after setting the CurrentVersion column.
Create a duplicate table - exact copy.
Create a new table that does nothing more than keep track of the "up to date" table.
MostCurrent (table)
id (column) - holds name of table holding the "up to date" data.
When repopulating, populate the older table and update MostCurrent.id to reflect this table.
Now, in your app where you bind the data to the page, bind the newest table.
Would it be appropriate to only push changes to the live db table? For most applications I have worked with changes have been minimal. You should be able to apply all the changes in a single transaction. Committing the transaction will make them visible with no outage on the table.
If the data does change entirely, then you could configure the database so that you can replace all the data in a single transaction.

How do I sync the results of a Web service with a DB?

I am looking for a way to quickly compare the state of a database table with the results of a Web service call.
I need to make sure that all records returned by the Web service call exist in the database, and any records in the database that are no longer in the Web service response are removed from the table.
I have to problems to solve:
How do I quickly compare a data
structure with the results of a
database table?
When I find a
difference, how do I quickly add
what's new and remove what's gone?
For number 1, I was thinking of doing an MD5 of a data structure and storing it in the database. If the MD5 is different, then I'd move to step 2. Are there better ways of comparing response data with the state of a database?
I need more guidance on number 2. I can easily retrieve all records from a table (SELECT * FROM users WHERE user_id = 1) and then loop through an array adding what's not in the DB and creating another array of items to be removed in a subsequent call, but I'm hoping for a better (faster) was of doing this. What is the best way to compare and sync a data structure with a subset of a database table?
Thanks for any insight into these issues!
I've recently been caught up in a similar problem. Our--very simple--solution was to load the web service data into a table with the same structure as the DB table. The DB table keeps a hash of its most important columns, and the same hash function is applied to the corresponding columns in the web service table.
The "sync" logic then goes like this:
Delete any rows from the web service table with hashes that do exist in the DB table. This is duplicate data that doesn't need synchronizing.
DELETE FROM ws_table WHERE hash IN (SELECT hash from db_table);
Delete any rows from the DB table with hashes not found in the web service table.
DELETE FROM db_table WHERE hash NOT IN (SELECT hash FROM ws_table);
Anything left over in the web service table is new data, and should now be inserted into the DB table.
INSERT INTO db_table SELECT ... FROM ws_table;
It's a pretty brute-force approach, and if done transactionally (even just steps 2 and 3) locks up the DB table for the duration, but it's very simple.
One refinement would be to deal with changed records using UPDATE statements, but that adds a good deal of complexity, and may not be any faster than a DELETE followed by an INSERT.
Another possible optimization would be to set a flag instead of deleting rows. The rows could then be deleted later on. However, any logic using the DB table would have to ignore rows with a set flag.
Don't kill yourself doing premature optimization. Go with the simple approach of inserting each row one at a time. If you find your having transactional issues like locking of the table is to long while looping you could insert the rows first into a temporary table then do a single insert into the real destination table.
If you were using SQL Server you could do bulk inserts, or package the data into XML, But I'd still highly recommend implement it the easy way first, then test it and if you can test with production data (or the same quantity of data), then look to optimize only if you need to.

Resources