It is possible to export data from HDFS to RDBMS table using Sqoop.
But it seems like we need to have existing table.
Is there some parameter to tell Sqoop do the 'CREATE TABLE' thing and export data to this newly crated table?
If yes, is it going to work with Oracle?
I'm afraid that Sqoop do not support creating tables in the RDBMS at the moment. Sqoop uses the table in RDBMS to get metadata (number of columns and their data types), so I'm not sure where Sqoop could get the metadata to create the table for you.
You can actually execute arbitrary SQL queries and DDL via sqoop eval, at least with MySQL and MSSQL. I'd expect it to work with Oracle as well. MSSQL example:
sqoop eval --connect 'jdbc:sqlserver://<DB SERVER>:<DB PORT>;
database=<DB NAME>' --query "CREATE TABLE..."
--username <USERNAME> -P
I noticed you use Oracle too. Certain sqoop vendor-specific sqoop connectors support that, including Oracle. Sqoop's Oracle direct connect mode has option to do that
https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_create_oracle_tables
24.8.5.4. Create Oracle Tables
-Doraoop.template.table=TemplateTableName
Creates OracleTableName by replicating the structure and data types of
TemplateTableName. TemplateTableName is a table that exists in Oracle
prior to executing the Sqoop command.
ps. You'll have to use --direct sqoop export option to activate sqoop direct mode = 'Data Connector for Oracle and Hadoop' (aka OraOOP - older name).
Related
I am using a SqlServer database and need to connect to a Hive database. The end goal is to be able to push data from SqlServer to a Hive table. Connecting to SqlServer from Hive via Sqoop is not an option. How would I accomplish this?
How big is the data ? If your database contains few tables than you can export the sqlserver db to excel or csv and load them into hive else you can write some code to accompolish this.
Which one is better option among following options in-terms of speed and performance for the purpose of exporting data from hive/hdfs to sql server.
1) Using Sqoop Export facility to connect to RDBMS (SQL server) and export data directly.
2) Dump CSV file using HIVE using INSERT OVERWRITE LOCAL DIRECTORY command and then perform BCP ( or Bulk Insert Query) on those CSV files to put the data into SQL server database.
Or,
Is there any other better option?
In my experience, I use bcp whenever I can. It's from what I can tell the fastest way to shotgun data into a database and is configurable on a (somewhat) fine grain level.
Couple things to consider:
Use a staging table. No primary key, no indexes, just raw data.
Have a "consolidation" proc to move data around after loading.
Use a row size of about 5000 to start, but if performance is of utmost concern, then test.
Make sure you increase your timeout.
On the subject of importing data into sqoop from Microsoft SQL Server. How does sqoop handle database locks when running import table commmands?
More info:
Sqoop is using a JDBC driver.
Sqoop handles database locks by taking required locks and respecting conflicting locks acquired by other processes. Same as everybody else.
What exactly are you worried about? Sqoop does ordinary INSERT operations.
I have a unique query regarding Apache Sqoop. I have imported data using apache Sqoop import facility into my HDFS files.
Next ,. I need to put the data back into another database (basically I am performing data transfer from one database vendor to another database vendor) using Hadoop (Sqoop).
To Put data into Sql Server , there are 2 options.
1) Using Sqoop Export facility to connect to my RDBMS,(SQL server) and export data directly.
2) Copy the HDFS data files (which are in CSV format) into my local machine using copyToLocal command and then perform BCP ( or Bulk Insert Query) on those CSV files to put the data into SQL server database.
I would like to understand which is the perfect(or rather correct) approach to do so and which one of them is more Faster out of the two - The Bulk Insert or Apache Sqoop Export from HDFS into RDBMS. ??
Are there any other ways apart from these 2 ways mentioned above which can transfer faster from one database vendor to another.?
I am using 6-7 mappers (records to be transferred is around 20-25 millions)
Please suggest and Kindly let me know if my Question is unclear.
Thanks in Advance.
If all you do is ETL from one vendor to another, then going through Sqoop/HDFS is a poor choice. Sqoop makes perfect sense if the data originates in HDFS or is meant to stay in HDFS. I would also consider sqoop if the set is so large as to warrant a large cluster for the transformation stage. But a mere 25 million records is not worth it.
With SQL Server import it is imperative, on large imports, to achieve minimally logging, which require bulk insert. Although 25 mil is not so large as to make the bulk option imperative, still AFAIK sqoop, nor sqoop2, do not support bulk insert for SQL Server yet.
I recommend SSIS instead. Is much more mature than sqoop, it has bulk insert task and has a rich transformation featureset. Your small import is well within the size SSIS can handle.
I have a number of tables in a database in the server side in PostgreSQL. I want to have them all in another database. Is it possible?
Use PostgreSQL pg_dump utility with -t table option to define tables that should be dumped and restore them in a another database. For more information see PostgreSQL pg_dump documentation page