Does SymmetricDS propagates updates to RedShift - database

Hello all :) SymmetricDS' blog talks about RedShift as target-only without going into details.
Updates can be problematic in RedShift, performance-wise.
Contexte is: A one-way replication from a MySQL instance to RedShift has already been set up and databases are in synch. Rows in MySQL then gets updated.
So, the question is:
Will the line get updated in RedShift, with the JDBC way, with the S3-bulk-load way? Does the S3-bulk even work with updates?
Additional non necessary questions:
Performance-wise, is JDBC still ok for 1m updates a day?
Is Vaccuum performed by SymmetricDS? Do I need to schedule it?
If the S3-bulk-update way exists, how does it work internally? Does it loads as a sort of log table that is then propagated to the RedShift target table?
EDIT: I'd love to go parse the source code, but the feature is not yet available (last update of publicly open source files date before the feature)

Related

How to ETL using stream processing

I have a SQL server database, Where millions of rows are (inserted/deleted/updated) every day. I'm supposed to propose an ETL solution to transfer data from this database to a data warehouse. At first i tried to work with CDC and SSIS, but the company i work in want a more real time solution. I've done some research and discovered stream processing. I've also looked for Spark and Flink tutorials but i didn't find anything.
my question is which stream processing tool do i choose? and how do i learn to work with it?
Open Source Solution
You can use Confluent Kafka Integration tool to track Insert and Update operation using Load Timestamp. These would automatically provide you the real time data which get inserted or Updated in the database. If you are having a soft delete in your database , that can be also tracked by using load timestamp and active or inactive flag.
If there is no such flags then you need to provide some logic on which partition might get updated on that day and send that entire partition into the stream which is definitely resource exhaustive.
Paid Solution
There is a paid tool called Striim CDC which can provide real time responses to your system

Detect Table Changes In A Database Without Modifications

I have a database ("DatabaseA") that I cannot modify in any way, but I need to detect the addition of rows to a table in it and then add a log record to a table in a separate database ("DatabaseB") along with some info about the user who added the row to DatabaseA. (So it needs to be event-driven, not merely a periodic scan of the DatabaseA table.)
I know that normally, I could add a trigger to DatabaseA and run, say, a stored procedure to add log records to the DatabaseB table. But how can I do this without modifying DatabaseA?
I have free-reign to do whatever I like in DatabaseB.
EDIT in response to questions/comments ...
Databases A and B are MS SQL 2008/R2 databases (as tagged), users are interacting with the DB via a proprietary Windows desktop application (not my own) and each user has a SQL login associated with their application session.
Any ideas?
Ok, so I have not put together a proof of concept, but this might work.
You can configure an extended events session on databaseB that watches for all the procedures on databaseA that can insert into the table or any sql statements that run against the table on databaseA (using a LIKE '%your table name here%').
This is a custom solution that writes the XE session to a table:
https://github.com/spaghettidba/XESmartTarget
You could probably mimic functionality by writing the XE events table to a custom user table every 1 minute or so using the SQL job agent.
Your session would monitor databaseA, write the XE output to databaseB, you write a trigger that upon each XE output write, it would compare the two tables and if there are differences, write the differences to your log table. This would be a nonstop running process, but it is still kind of a period scan in a way. The XE only writes when the event happens, but it is still running a check every couple of seconds.
I recommend you look at a data integration tool that can mine the transaction log for Change Data Capture events. We are recently using StreamSets Data Collector for Oracle CDC but it also has SQL Server CDC. There are many other competing technologies including Oracle GoldenGate and Informatica PowerExchange (not PowerCenter). We like StreamSets because it is open source and is designed to build realtime data pipelines between DB at the schema level. Till now we have used batch ETL tools like Informatica PowerCenter and Pentaho Data Integration. I can near real-time copy all the tables in a schema in one StreamSets pipeline provided I already deployed DDL in the target. I use this approach between Oracle and Vertica. You can add additional columns to the target and populate them as part of the pipeline.
The only catch might be identifying which user made the change. I don't know whether that is in the SQL Server transaction log. Seems probable but I am not a SQL Server DBA.
I looked at both solutions provided by the time of writing this answer (refer Dan Flippo and dfundaka) but found that the first - using Change Data Capture - required modification to the database and the second - using Extended Events - wasn't really a complete answer, though it got me thinking of other options.
And the option that seems cleanest, and doesn't require any database modification - is to use SQL Server Dynamic Management Views. Within this library residing, in the System database, are various procedures to view server process history - in this case INSERTs and UPDATEs - such as sys.dm_exec_sql_text and sys.dm_exec_query_stats which contain records of database transactions (and are, in fact, what Extended Events seems to be based on).
Though it's quite an involved process initially to extract the required information, the queries can be tuned and generalized to a degree.
There are restrictions on transaction history retention, etc but for the purposes of this particular exercise, this wasn't an issue.
I'm not going to select this answer as the correct one yet partly because it's a matter of preference as to how you approach the problem and also because I'm yet to provide a complete solution. Hopefully, I'll post back with that later. But if anyone cares to comment on this approach - good or bad - I'd be interested in your views.

Sqldependency vs CDC for writing a data replicator from sql server to mongodb

I would like to write a simple replicator that replicates data from a sql server database to mongodb database.
What are my options other than:
a) Change Data Capture
b) SqlDependency
What is impact of each the options on sql-server db performance?
If I have 50 writes/min on a table which option would be a better candidate and why?
Which option would be a better one for recovery in case of a problem occur and machine/server reboots?
CDC (+Change Tracking) and SqlDependency solve different problems and address different scenarios. SqlDependency is a cache invalidation mechanism for application. It is unsuitable for tracking changes. CDC is explicitly targeting applications that need to track changes. It is pretty much the only viable mechanism to track changes. Consuming the tracked changes is up to your application, like writing them to /dev/null or modngodb (same thing, actually).
Please read the MSDN relevant topics first, start from Track Data Changes.
I like CDC for this type of operation for a couple of different reasons
You can configure it to capture net changes. If you don't care that a given row was updated 10 times within your capture interval, this is the way to go.
It tells you what changed. You can see which columns changed in which operation.
You don't need to track all of the columns. If there's a column that you don't need to propagate to your downstream data store (in your case, MongoDB), then don't include it in the tracking.
It's asynchronous. CDC uses a log reader to determine what changes happened after the fact so you don't need to slow down your primary application for CDC.
To complete previous answers :
CDC is not available in SQL Server standard edition
One option can be to use table triggers + Servier Broker (available in standard edition).
An other option can be to use Change tracking (CT) to monitor table changes (available in standard edition) and a custom monitoring app to propagate changes to MongoDb. I used CT in production environment to implement a homemade replication system, it works really fine. Ressource overcost to maintain CT index is similar to a non-clusterd index according to MS.
Hope it helps

Replicating a SQL Server database for read access

I have an application that is in production with its own database for more than 10 years.
I'm currently developing a new application (kind of a reporting application) that only needs read access to the database.
In order not to be too much linked to the database and to be able to use newer DAL (Entity Framework 6 Code First) I decided to start from a new empty database, and I only added the tables and columns I need (different names than the production one).
Now I need some way to update the new database with the production database regularly (would be best if it is -almost- immediate).
I hesitated to ask this question on http://dba.stackexchange.com but I'm not necessarily limited to only using SQL Server for the job (I can develop and run some custom application if needed).
I already made some searches and had those (part-of) solutions :
Using Transactional Replication to create a smaller database (with only the tables/columns I need). But as far as I can see, the fact that I have different table names / columns names will be problematic. So I can use it to create a smaller database that is automatically replicated by SQL Server, but I would still need to replicate this database to my new one (it may avoid my production database to be too much stressed?)
Using triggers to insert/update/delete the rows
Creating some custom job (either a SQL Job or some Windows Service that runs every X minutes) that updates the necessary tables (I have a LastEditDate that is updated by a trigger on my tables, so I can know that a row has been updated since my last replication)
Do you some advice or maybe some other solutions that I didn't foresee?
Thanks
I think that the Transactional replication is the better than using triggers.
Too much resources would be used in source server/database due to the trigger fires by each DML transaction.
Transactional rep could be scheduled as a SQL job and run it few times a day/night or as a part of nightly scheduled job. IT really depends on how busy the source db is...
There is one more thing that you could try - DB mirroring. it depends on your sql server version.
If it were me, I'd use transactional replication, but keep the table/column names the same. If you have some real reason why you need them to change (I honestly can't think of any good ones and a lot of bad ones), wrap each table in a view. At least that way, the view is the documentation of where the data is coming from.
I'm gonna throw this out there and say that I'd use Transaction Log shipping. You can even set the secondary DBs to read-only. There would be some setting up for full recovery mode and transaction log backups but that way you can just automatically restore the transaction logs to the secondary database and be hands-off with it and the secondary database would be as current as your last transaction log backup.
Depending on how current the data needs to be, if you only need it done daily you can set up something that will take your daily backups and then just restore them to the secondary.
In the end, we went for the Trigger solution. We don't have that much changes a day (maybe 500, 1000 top), and it didn't put too much pressure on the current database. Thanks for your advices.

Log inserted/updated/deleted rows in all tables for a given database in SQL Server 2008

Whats the best way to track/Log inserted/updated/deleted rows in all tables for a given database in SQL Server 2008?
Or is there a better "Audit" feature in SQL Server 2008?
Short answer is that there is no one single solution fits all. It depends on the system but and requirements but here are couple different approaches.
DML Triggers
Relatively easy to implement, because you have to write one that works well for one table and then apply it to other tables.
Downside is that it can get messy when you have a lot of tables and even more triggers. Managing 600 triggers for 200 tables (insert, update and delete trigger per table) is not an easy task.
Also, it might cause a performance impact.
Creating audit triggers in SQL Server
Log changes to database table with trigger
Change Data Capture
Very easy to implement, natively supported but only in enterprise edition which can cost a lot of $ ;). Another disadvantage is that CDC is still not as evolved as it should be. For example, if you change your schema, history data is lost.
Transaction log analysis
Biggest advantage of this is that all you need to do is to put the database in full recovery mode and all info will be stored in transaction log
However, if you want to do this correctly you’ll need a third party log reader because this is not natively supported.
Read the log file (*.LDF) in SQL Server 2008
SQL Server Transaction Log Explorer/Analyzer
If you want to implement this I’d recommend you try out some of the third party tools that exist out there. I worked with couple tools from ApexSQL but there are also good tools from Idera and Netwrix
ApexSQL Log – auditing by reading transaction log
ApexSQL Comply – uses traces in the background and then parses those traces and stores results in central database.
Disclaimer: I’m not affiliated with any of the companies mentioned above.
Change Data Capture is designed to do what you want, but it requires each table be set up individually, so depending on the number of tables you have, there may be some logistics to it. It will also only store the data in capture tables for a couple of days by default, so you may need an SSIS package to pull it out and store for longer periods.
I don't remember whether there is already some tool for this, but you could always use triggers (then you will have access for temporal tables with changed rows- INSERTED and DELETED). Unfortunately, it could be quite a work to do if you would like to track all tables. I believe that there should be some simpler solution, but do not remember as I said.
EDIT.
Maybe this could be helpful:
--Change tracking
http://msdn.microsoft.com/en-us/library/cc280462.aspx
http://msdn.microsoft.com/en-us/library/cc280386.aspx
This allows you to do audits at the database level; it may or may not be enough to meet the business requirements, as database records usually don't make all that much sense without the logic to glue them together. For instance, knowing that user x inserted a record into the "time_booked" table with a foreign key to the "projects", "users", "time_status" tables may not make all that much sense without the SQL query to glue those 4 tables together.
You may also need to have each database user connect with their own user ID - this is fine with integrated security and a client app, but probably won't work with a website using a connection pool.
The sql server logs are not possible to analyze just like that. There are some 3rd party tools available to read the logs but as far as I know you can't query them for statistics and such. If you need this kind of info you'll have to create some sort of auditing to capture all these events in separate tables. You can use "DDL triggers".

Resources