Is there a way to export the results from Pig directly to a database like mysql?
While keeping in mind what orangeoctopus said (beware of DDOS...) have you had a look to DBStorage?
data = LOAD '...' AS (...);
...
STORE data INTO DBStorage('com.mysql.jdbc.Driver', 'dbc:mysql://host/db', 'INSERT ...');
The main problem I see is that each reducer is effectively going to insert into the database around the same time.
If you don't think this will be an issue, I suggest you write a custom Storage method that uses JDBC (or something similar) to insert into the database directly and writing nothing out to HDFS.
If you are afraid of performing a DDOS attack on your own database, perhaps collecting the data on HDFS and performing a separate bulk load into mysql would be better.
I'm currently experimenting with an embedded pig application which loads results into mysql via PigServer.OpenIterator and a JDBC connection. It's worked very well in testing, but I haven't tried it at scale yet. This is similar to the custom storage method already suggested, but runs from a single point, so no accidental DDOS attack. You effectively end up paying the network transfer cost twice (cluster -> staging machine, staging machine -> DB server) if you don't run the load off the DB server (I personally prefer to run nothing except the DB itself off the DB server), but that's no different than the "write the file out and bulk load it" option.
Sqoop may be the good way to go, but it is difficult to set-up (IMHO) as all these Hadoop related projects...
Pig's DBStorage is working fine (at least for storing).
Don't forget to register the PiggyBank and your MySQL driver:
-- Register Piggy bank
REGISTER /opt/cmr/pig/pig-0.10.0/lib/piggybank.jar;
-- Register MySQL driver
REGISTER /opt/cmr/mysql/drivers/mysql-connector-java-5.1.15-bin.jar
Here is a sample call:
-- Store a relation into a SQL table
STORE relation INTO 'unused' USING org.apache.pig.piggybank.storage.DBStorage('com.mysql.jdbc.Driver', 'jdbc:mysql://<mysqlserver>/<database>', '<login>', '<password>', 'REPLACE INTO <table> (<column1>, <column2>) VALUES (?, ?)');
Try using Sqoop
Related
I have a database ("DatabaseA") that I cannot modify in any way, but I need to detect the addition of rows to a table in it and then add a log record to a table in a separate database ("DatabaseB") along with some info about the user who added the row to DatabaseA. (So it needs to be event-driven, not merely a periodic scan of the DatabaseA table.)
I know that normally, I could add a trigger to DatabaseA and run, say, a stored procedure to add log records to the DatabaseB table. But how can I do this without modifying DatabaseA?
I have free-reign to do whatever I like in DatabaseB.
EDIT in response to questions/comments ...
Databases A and B are MS SQL 2008/R2 databases (as tagged), users are interacting with the DB via a proprietary Windows desktop application (not my own) and each user has a SQL login associated with their application session.
Any ideas?
Ok, so I have not put together a proof of concept, but this might work.
You can configure an extended events session on databaseB that watches for all the procedures on databaseA that can insert into the table or any sql statements that run against the table on databaseA (using a LIKE '%your table name here%').
This is a custom solution that writes the XE session to a table:
https://github.com/spaghettidba/XESmartTarget
You could probably mimic functionality by writing the XE events table to a custom user table every 1 minute or so using the SQL job agent.
Your session would monitor databaseA, write the XE output to databaseB, you write a trigger that upon each XE output write, it would compare the two tables and if there are differences, write the differences to your log table. This would be a nonstop running process, but it is still kind of a period scan in a way. The XE only writes when the event happens, but it is still running a check every couple of seconds.
I recommend you look at a data integration tool that can mine the transaction log for Change Data Capture events. We are recently using StreamSets Data Collector for Oracle CDC but it also has SQL Server CDC. There are many other competing technologies including Oracle GoldenGate and Informatica PowerExchange (not PowerCenter). We like StreamSets because it is open source and is designed to build realtime data pipelines between DB at the schema level. Till now we have used batch ETL tools like Informatica PowerCenter and Pentaho Data Integration. I can near real-time copy all the tables in a schema in one StreamSets pipeline provided I already deployed DDL in the target. I use this approach between Oracle and Vertica. You can add additional columns to the target and populate them as part of the pipeline.
The only catch might be identifying which user made the change. I don't know whether that is in the SQL Server transaction log. Seems probable but I am not a SQL Server DBA.
I looked at both solutions provided by the time of writing this answer (refer Dan Flippo and dfundaka) but found that the first - using Change Data Capture - required modification to the database and the second - using Extended Events - wasn't really a complete answer, though it got me thinking of other options.
And the option that seems cleanest, and doesn't require any database modification - is to use SQL Server Dynamic Management Views. Within this library residing, in the System database, are various procedures to view server process history - in this case INSERTs and UPDATEs - such as sys.dm_exec_sql_text and sys.dm_exec_query_stats which contain records of database transactions (and are, in fact, what Extended Events seems to be based on).
Though it's quite an involved process initially to extract the required information, the queries can be tuned and generalized to a degree.
There are restrictions on transaction history retention, etc but for the purposes of this particular exercise, this wasn't an issue.
I'm not going to select this answer as the correct one yet partly because it's a matter of preference as to how you approach the problem and also because I'm yet to provide a complete solution. Hopefully, I'll post back with that later. But if anyone cares to comment on this approach - good or bad - I'd be interested in your views.
What steps should I pass if I like to synchronize two databases ex.: in every 15 minutes?
What practical advices can you give me if I'd like to sync. a MYSQL and an MSSQL databases?
The theory behind true replication (something like MySQL to MySQL) is very complicated and difficult. I wouldn't recommend trying to implement something like that for MySQL to SQL Server.
Some things to look at:
Look at Mule ESB (http://www.mulesoft.org/) You can get off the ground pretty fast with JDBC connections to MySQL and SQL Server. Then it's just a matter of how often you want to poll one endpoint to push to another endpoint. (For example, poll MySQL every 15 minutes and take the results and write to SQL Server.)
You can write your own syncing program. Maybe export data from one system every 15 minutes and write to the file system. Have another program watch that directory and import anything it sees. (Disadvantage is you have to touch the disk.)
To be really creative, you can write triggers in MySQL and SQL Server that fire an external process to send data. That way when a record gets touched, it will send off a message in near real time to the other database.
Try to make the schemas the same. MySQL and SQL Server share many of the same data types, so definitely try to not use data types that are specific to one of the two databases. (For example, I don't believe MySQL supports the "xml" data type. But maybe I'm wrong?)
So BCP for inserting data into a SQL Server DB is very very fast. What is is doing that makes it so fast?
In SQL Server, BCP input is logged very differently than traditional insert statements. How SQL decides to handle things depends on a number of factors and some are things most developers never even consider like what recovery model the database is set to use.
bcp uses the same facility as BULK INSERT and the SqlBulkCopy classes.
More details here
http://msdn.microsoft.com/en-us/library/ms188365.aspx
The bottom line is this, these bulk operations log less data than normal operations and have the ability to instruct SQL Server to ignore its traditional checks and balances on the data coming in. All those things together serve to make it faster.
It cheats.
It has intimate knowledge of the internals and is able to map your input data more directly to those internals. It can skip other heavyweight operations (like parsing, optimization, transactions, logging, deferring indexes, isolation). It can make assumptions that apply to every row of data that a normal insert sql statement can not.
Basically, it's able to skip a bulk of the functionality that makes a database a database, and then clean up after itself en masse at the end.
The main difference I know between bcp and a normal insert is that bcp doesn't need to keep a separate transaction log entry for each individual transaction.
The speed is because they use of BCP API of the SQL Server Native Client ODBC driver. According to Microsoft:
http://technet.microsoft.com/en-us/library/aa337544.aspx
The bcp utility (Bcp.exe) is a command-line tool that uses the Bulk
Copy Program (BCP) API...
Bulk Copy Functions reference:
http://technet.microsoft.com/en-us/library/ms130922.aspx
I have a some exported rows that I want to import into a database that is currently live. Should I disable the app that interacts with the DB before doing the bulk-insert so that the BI is the only operation being performed?
I assumed that this would be best-practice but just wanted to check with the community.
Many Thanks!
I wouldn't generally disable any other app using the database. To me, the bulk load is just another client of the database and is subject to the usual concurrency/isolation mechanisms.
There are exceptions where the bulk load is part of some long running release process or maintanance routine, in which case it'd out of hours anyway
I always use a staging table to load data though. After after processing, scrubbing, cleansing and key lookups etc, I'd flush the data to the live table in a single atomic operation. In other words, I wouldn't mix bulk loads with other table access: I'd buffer the bulk load via a staging table
In a Firebird database driven Delphi application we need to bring some data online, so we can add to our application online-reporting capabilities.
Current approach is: whenever data is changed or added send them to the online server(php + mysql), if it fails, add it to the queue and try again. Then the server having the data is able to create it's own reports.
So, to conclude: what is a good way to bring that data online.
At the moment I know these two different strategies:
event based: whenever changes are detected, push them to the web server / mysql db. As you wrote, this requires queueing in case the destination system does not receive the messages.
snapshot based: extract the relevant data in intervals (for example every hour) and transfer it to the web server / mysql db.
The snapshot based strategy allows to preprocess the data in a way that if fits nicely in the wb / mysql db data structure, which can help to decouple the systems better and keep more business logic on the side of the sending system (Delphi). It also generates a more continuous load, as it does not care about mass data changes.
One other way can be to use replication but I don't know system who make replication between Firebird and MySQL database.
For adding reporting tools capability on-line : you can also check fast report server