Sqoop export to Teradata is slow - export

I am trying to load ~700GB file to Teradata, using sqoop-connector-teradata-1.3c5.tar.gz[Cloudera Connector Powered by Teradata]
the performance seems to be very slow.
I have included below parameters in sqoop command
sqoop export -D sqoop.export.records.per.statement=100 --connect jdbc:teradata://ip address/Database=dbname --driver com.teradata.jdbc.TeraDriver --username user --password pwd --table STG_TEST --export-dir /dirpath/ --input-fields-terminated-by "\t" --input-lines-terminated-by "\n" --connection-param-file /path/sqoop.properties --batch;
connection file includes
jdbc.transaction.isolation=TRANSACTION_READ_UNCOMMITTED
Please suggest how to improve sqoop export performance.

Have you considered leveraging the 'internal.fastload' method for the Cloudier Connector to leverage Teradata's FastLoad utility mechanism? This would be faster than what may currently be running via 'INSERT...SELECT' statements. It will require that you load to an empty stage table then using a 'MERGE' or 'INSERT/UPDATE' statements to apply to the final target table.

According to Clouderas documentation here, "the connector (1.3c5) automatically uses FastExport/FastLoad for better performance."
Looking at your command, you are not specifying any mappers. Use more than one mapper to parallelize the job and improve performance.
From the Apache Sqoop Cookbook:
The optimal number of mappers depends on many variables: you need to take into account your database type, the hardware that is used for your database server, and the impact to other requests that your database needs to serve. There is no optimal number of mappers that works for all scenarios. Instead, you’re encouraged to experiment to find the optimal degree of parallelism for your environment and use case. It’s a good idea to start with a small number of mappers, slowly ramping up, rather than to start with a large number of mappers, working your way down.

Related

Easiest way to replicate (copy? Export and import?) a large, rarely changing postgreSQL database

I have imported about 200 GB of census data into a postgreSQL 9.3 database on a Windows 7 box. The import process involves many files and has been complex and time-consuming. I'm just using the database as a convenient container. The existing data will rarely if ever change, and will be updating it with external data at most once a quarter (though I'll be adding and modifying intermediate result columns on a much more frequent basis. I'll call the data in the database on my desktop the “master.” All queries will come from the same machine, not remote terminals.
I would like to put copies of all that data on three other machines: two laptops, one windows 7 and one Windows 8, and on a Ubuntu virtual machine on my Windows 7 desktop as well. I have installed copies of postgreSQL 9.3 on each of these machines, currently empty of data. I need to be able to do both reads and writes on the copies. It is OK, and indeed I would prefer it, if changes in the daughter databases do not propagate backwards to the primary database on my desktop. I'd want to update the daughters from the master 1 to 4 times a year. If this wiped out intermediate results on the daughter databases this would not bother me.
Most of the replication techniques I have read about seem to be worried about transaction-by-transaction replication of a live and constantly changing server, and perfect history of queries & changes. That is overkill for me. Is there a way to replicate by just copying certain files from one postgreSQL instance to another? (If replication is the name of a specific form of copying, I'm trying to ask the more generic question). Or maybe by restoring each (empty) instance from a backup file of the master? Or of asking postgreSQL to create and export (ideally on an external hard drive) some kind of postgreSQL binary of the data that another instance of postgreSQL can import, without my having to define all the tables and data types and so forth again?
This question is also motivated by my desire to work around a home wifi/lan setup that is very slow – a tenth or less of the speed of file copies to an external hard drive. So if there is a straightforward way to get the imported data from one machine to another by transference of (ideally compressed) binary files, this would work best for my situation.
While you could perhaps copy the data directory directly as mentioned by Nick Barnes in the comments above, I would recommend using a combination of pg_dump and pg_restore, which will dump a self-contained file which can then be dispersed to the other copies.
You can run pg_dump on the master to get a dump of the DB. I would recommend using the options -Fc -j3 to use the custom binary format (instead of dumping in SQL format; this should be much smaller and perhaps faster as well) and will dump 3 tables at once (this can be adjusted up or down depending on the disk throughput capabilities of your machine and the number of cores that it has).
Then you run dropdb on the copies, createdb to recreate an empty DB of the same name, and then run pg_restore on that new empty DB to restore the dump file to the DB. You would want to use the options -d <dbname> -f <dump_file> -j3 (again adjusting the number for -j according to the abilities of the machine).
When you want to refresh the copies with new content from the master DB, simply repeat the above steps

Import data on HDFS to SQL Server or export data on HDFS to SQL Server

I had been trying to figure out on which is the best approach for porting data from HDFS to SQL Server.
Do I import data from Cloudera Hadoop using sqoop Hadoop Connector for SQL Server 2008 R2 or
Do I export data from Cloudera Hadoop using sqoop into SQL Server
I am sure that both are possible based on the bunch of links I read through
http://www.cloudera.com/blog/2011/10/apache-sqoop-overview/
http://www.microsoft.com/en-in/download/details.aspx?id=27584
But when I am looking for possible issues that could rise at level of configuration and maintenance I don't have proper answers.
I strongly feel that I should go for import, but I am not comfortable in troubleshooting and maintaining the issues that could come up every now and then.
Can someone share their thoughts on what could be the best?
Both of your options use the same method: Apache Sqoop's Export utility. Using the licensed Microsoft connector/driver jar should expectedly yield more performance for the task than using a generic connector offered by Apache Sqoop.
In terms of maintenance, there should be none once you have it working fine. So long as the version of SQL Server in use is supported by the driver jar, it should continue to work as normally expected of it.
In terms of configuration, you may initially have to manually tune to find the best -m value for parallelism of your Export MapReduce job launched by the export tool. Using a too high value would cause problems on the DB side, while using a too low value would not give you ideal performance. Some trial and error is required here to arrive at the right -m value, along with knowledge of the load periods of your DB, in order to set the parallelism right.
The Apache Sqoop (v1) doc page for users of the export tool also lists down a set of common reasons for the failure of the export job. You may want to view those here.
On the MapReduce side, you may also want to dedicate a defined scheduler pool or queue for such external-writing jobs as they may be business critical, and schedulers like FairScheduler and CapacityScheduler help define SLA guarantees on each pool or queue such that the jobs get adequate resources to run when they're launched.

Using Hadoop to perform DML operations on large fixed-format files

We have a product that uses a MySQL database as the data-store. The data-store holds large amount of data. The problem we are facing is that the response time of the application is very slow. The database queries are very basic with very simple joins, if any. The root cause for the slow response time according to some senior employees is the database operations on the huge data-store.
Another team in our company had worked on a project in the past where they processed large fixed-format files using Hadoop and dumped the contents of these files into database tables. Borrowing from this project, some of the team members feel that we can migrate from using a MySQL database to simple fixed-format files that will hold the data instead. There will be one file corresponding to each table in the database instead. We can then build another data interaction layer that provides interfaces for performing DML operations on the contents in these files. This layer will be developed using Hadoop and the MapReduce programming model.
At this point, several questions come to my mind.
1. Does the problem statement fit into the kind of problems that are solved using Hadoop?
2. How will the application ask the data interaction layer to fetch/update/delete the required data? As far as my understanding goes, the files containing the data will reside on HDFS. We will spawn a Hadoop job that will process the required file (similar to a table in the db) and fetch the required data. This data will be written to an outout file on HDFS. We will have to parse this file to get the required content.
3. Will the approach of using fixed format-files and processing them with Hadoop truly solve the problem?
I have managed to set up a simple node cluster with two Ubuntu machines but after playing around with Hadoop for a while, I feel that the problem statement is not a good fit for Hadoop. I could be completely wrong and therefore want to know whether Hadoop fits into this scenario or is it just a waste of time as the problem statement is not in line with what Hadoop is meant for?
I would suggest go straight to Hive (http://hive.apache.org/). It is SQL engine / datawarehouse build on top of the Hadoop MR.
In a nutshell - it get Hadoop scalability and hadoop high latency.
I would consider storing bulk of data there, do all required transformation and only summarized data move to MySQL to serve queries. Usually it is not good idea to translate user requests to the hive queries - they are too slow, capability to run jobs in parallel is not trivial.
If you are planning to update data more often then storing directly in hadoop may not be a good option for you. To update a file in hadoop you may have to rewrite the file and then delete old file and copy a new file in hdfs.
However if you are just searching and joining the data then its good option. If you use hive then you could make some queries like sql.
In hadoop your work flow could be something described below:
You will run a hadoop job for your queries.
Your hadoop program will parse query and execute some job to join
and read files based on your queries and input parameters.
Your output will be generated in hdfs.
You will copy the output to local file system. Then show the output to your program.

Why is BCP so fast?

So BCP for inserting data into a SQL Server DB is very very fast. What is is doing that makes it so fast?
In SQL Server, BCP input is logged very differently than traditional insert statements. How SQL decides to handle things depends on a number of factors and some are things most developers never even consider like what recovery model the database is set to use.
bcp uses the same facility as BULK INSERT and the SqlBulkCopy classes.
More details here
http://msdn.microsoft.com/en-us/library/ms188365.aspx
The bottom line is this, these bulk operations log less data than normal operations and have the ability to instruct SQL Server to ignore its traditional checks and balances on the data coming in. All those things together serve to make it faster.
It cheats.
It has intimate knowledge of the internals and is able to map your input data more directly to those internals. It can skip other heavyweight operations (like parsing, optimization, transactions, logging, deferring indexes, isolation). It can make assumptions that apply to every row of data that a normal insert sql statement can not.
Basically, it's able to skip a bulk of the functionality that makes a database a database, and then clean up after itself en masse at the end.
The main difference I know between bcp and a normal insert is that bcp doesn't need to keep a separate transaction log entry for each individual transaction.
The speed is because they use of BCP API of the SQL Server Native Client ODBC driver. According to Microsoft:
http://technet.microsoft.com/en-us/library/aa337544.aspx
The bcp utility (Bcp.exe) is a command-line tool that uses the Bulk
Copy Program (BCP) API...
Bulk Copy Functions reference:
http://technet.microsoft.com/en-us/library/ms130922.aspx

A way to export the results from Pig to a database

Is there a way to export the results from Pig directly to a database like mysql?
While keeping in mind what orangeoctopus said (beware of DDOS...) have you had a look to DBStorage?
data = LOAD '...' AS (...);
...
STORE data INTO DBStorage('com.mysql.jdbc.Driver', 'dbc:mysql://host/db', 'INSERT ...');
The main problem I see is that each reducer is effectively going to insert into the database around the same time.
If you don't think this will be an issue, I suggest you write a custom Storage method that uses JDBC (or something similar) to insert into the database directly and writing nothing out to HDFS.
If you are afraid of performing a DDOS attack on your own database, perhaps collecting the data on HDFS and performing a separate bulk load into mysql would be better.
I'm currently experimenting with an embedded pig application which loads results into mysql via PigServer.OpenIterator and a JDBC connection. It's worked very well in testing, but I haven't tried it at scale yet. This is similar to the custom storage method already suggested, but runs from a single point, so no accidental DDOS attack. You effectively end up paying the network transfer cost twice (cluster -> staging machine, staging machine -> DB server) if you don't run the load off the DB server (I personally prefer to run nothing except the DB itself off the DB server), but that's no different than the "write the file out and bulk load it" option.
Sqoop may be the good way to go, but it is difficult to set-up (IMHO) as all these Hadoop related projects...
Pig's DBStorage is working fine (at least for storing).
Don't forget to register the PiggyBank and your MySQL driver:
-- Register Piggy bank
REGISTER /opt/cmr/pig/pig-0.10.0/lib/piggybank.jar;
-- Register MySQL driver
REGISTER /opt/cmr/mysql/drivers/mysql-connector-java-5.1.15-bin.jar
Here is a sample call:
-- Store a relation into a SQL table
STORE relation INTO 'unused' USING org.apache.pig.piggybank.storage.DBStorage('com.mysql.jdbc.Driver', 'jdbc:mysql://<mysqlserver>/<database>', '<login>', '<password>', 'REPLACE INTO <table> (<column1>, <column2>) VALUES (?, ?)');
Try using Sqoop

Resources