Does Sqoop use Reducer? - database

Does Sqoop run reducer if there is a join/aggregation performed in the select query given with a --query parameter? Or is there any case in Sqoop where both mappers and reducers run?
Documentation specifies that each map task will need to execute a copy of the query, with results partitioned by bounding conditions inferred by Sqoop.
$ sqoop import \
--query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \
--split-by a.id --target-dir /user/foo/joinresults
In the example above, how does the JOIN take place where the table is first partitioned using $CONDITIONS?

Join/Computation will be executed on RDBMS and its result will be used by mapper to transfer to HDFS.
No reducer is involved
With --query parameter, you need to specify the --split-by parameter with the column that should be used for slicing
your data into multiple parallel map tasks. This parameter usually automatically defaults to
the primary key of the main table
$CONDITIONS will be automatically substituted this placeholder with the generated conditions specifying which slice of data to transfer

In your particular command, the sqoop does not use reducer.
However there are cases, when sqoop do use reducer. Check the below example taken from document here.
$ sqoop export \
-Dmapred.reduce.tasks=2
-Dpgbulkload.bin="/usr/local/bin/pg_bulkload" \
-Dpgbulkload.input.field.delim=$'\t' \
-Dpgbulkload.check.constraints="YES" \
-Dpgbulkload.parse.errors="INFINITE" \
-Dpgbulkload.duplicate.errors="INFINITE" \
--connect jdbc:postgresql://pgsql.example.net:5432/sqooptest \
--connection-manager org.apache.sqoop.manager.PGBulkloadManager \
--table test --username sqooptest --export-dir=/test -m 2

Related

Upsert data into SQL Server from pyspark code

I have a pyspark dataframe that I wanted to upsert into a SQL Server table. I was looking at df.write modes and I do not see any upsert option. Therefore I am trying to write the dataframe into HDFS as parquet format and then sqoop the file using --update-mode allowinsert. However, I keep getting the following error:
Got exception in update thread: com.microsoft.sqlserver.jdbc.SQLServerException: One or more values is out of range of values for the datetime2 SQL Server data type
I tried to write the file as csv just to check if the contents/timestamp in the file is out of range, however the timestamps are correct.
Has anybody been able to write the pyspark dataframe into a SQL Server table?
Here's the function to write DF to HDFS:
def write_df_to_hdfs(df, filename, hdfs_location_working):
"""
Function to write delta records dataframe to HDFS
"""
logging.info("Started writing delta records dataframe to hdfs")
df.write.save(hdfs_location_working, format='parquet', mode='append', timestampFormat='YYYY-MM-dd hh:mm:ss.SSS',emptyValue="")
logging.info("Successfully written delta records dataframe to hdfs")
Also, here's the sqoop command I m using to write that data into SQL Server:
sqoop export -Dmapreduce.map.memory.mb=4096 -Dmapreduce.map.java.opts=-Xmx3000m -Dmapred.job.queuename=ici -Dsqoop.export.records.per.statement=30 -Dsqoop.export.statements.per.transaction=30 -libjars /opt/cloudera/parcels/CDH-7.1.6-1.cdh7.1.6.p6.12486751/lib/sqoop/lib/sqljdbc.jar --connect "jdbc:sqlserver://*******.hosts.cloud.ford.com;databaseName=SQTDIAPM_AM;schema=dbo;" \
--username 'user' \
--password 'pwd' \
--export-dir <HDFS Path> \
--table <tablename> \
--input-null-string '""' \
--input-null-string '\\N' \
--input-null-non-string '\\N' \
--update-key col1,col2 \
--update-mode allowinsert \
--batch \
-m 40 \
--verbose
Appreciate your help!

Importing data from SQL Server to HIVE using SQOOP

I am able to successfully import data from SQL Server to HDFS using sqoop. However, when it tries to link to HIVE I get an error. I am not sure I understand the error correctly
sudo -u hdfs sqoop import \
-Dorg.apache.sqoop.splitter.allow_text_splitter=true \
--connect "jdbc:sqlserver://XX.XX.X.X:1433;instanceName=data-engr-sql-svr; databaseName=AdventureWorks2019" \
--username sa \
--password XXXXXXXX \
--driver com.microsoft.sqlserver.jdbc.SQLServerDriver \
--warehouse-dir "/user/hive/warehouse/AdventureWorks2019.db" \
--hive-import \
--create-hive-table \
--fields-terminated-by ',' \
--hive-table AdventureWorks2019.Production.TransactionHistory \
--table Production.TransactionHistory \
--split-by TransactionID \
-- --schema Production
I don't know how to handle schemas, most of the tutorial uses a dummy database without proper schemas which are not helpful.
Error
21/03/31 08:52:47 INFO conf.HiveConf: Using the default value passed in for log id: 95e2b831-cfe5-4108-be0f-0df1d9a8797e
21/03/31 08:52:47 INFO session.SessionState: Updating thread name to 95e2b831-cfe5-4108-be0f-0df1d9a8797e main
21/03/31 08:52:47 INFO conf.HiveConf: Using the default value passed in for log id: 95e2b831-cfe5-4108-be0f-0df1d9a8797e
21/03/31 08:52:47 INFO ql.Driver: Compiling command(queryId=hdfs_20210331085247_050638e8-593a-4d01-8020-c40b7db8e66a): CREATE TABLE IF NOT EXISTS AdventureWorks2019.Production.TransactionHistory ( TransactionID INT, ProductID INT, ReferenceOrderID INT, ReferenceOrderLineID INT, TransactionDate STRING, TransactionType STRING, Quantity INT, ActualCost DOUBLE, ModifiedDate STRING) COMMENT 'Imported by sqoop on 2021/03/31 08:52:45' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054' LINES TERMINATED BY '\012' STORED AS TEXTFILE
21/03/31 08:52:49 INFO hive.metastore: HMS client filtering is enabled.
21/03/31 08:52:49 INFO hive.metastore: Trying to connect to metastore with URI thrift://cnt7-naya-cdh63:9083
21/03/31 08:52:49 INFO hive.metastore: Opened a connection to metastore, current connections: 1
21/03/31 08:52:49 INFO hive.metastore: Connected to metastore.
21/03/31 08:52:49 INFO parse.SemanticAnalyzer: Starting Semantic Analysis
FAILED: SemanticException [Error 10255]: Invalid table name AdventureWorks2019.Production.TransactionHistory
21/03/31 08:52:49 ERROR ql.Driver: FAILED: SemanticException [Error 10255]: Invalid table name AdventureWorks2019.Production.TransactionHistory
There is no such thing as schema inside the database in Hive. Database and schema mean the same thing and can be used interchangeably.
So, the bug is in using database.schema.table. Use database.table in Hive.
Read the documentation: Create/Drop/Alter/UseDatabase

sqoop import all to hive from db2 specific schema

I was trying to import all tables from a Specific schema in DB2 using below command line.
sqoop import-all-tables --username user --password pass \
--connect jdbc:db2://myip:50000/databs:CurrentSchema=testdb \
--driver com.ibm.db2.jcc.DB2Driver --fields-terminated-by ',' \
--lines-terminated-by '\n' --hive-database default --hive-import --hive-overwrite \
--create-hive-table -m 1;
Struck with following error
2017-05-02 09:21:18,474 ERROR - [main:] ~ Error reading database metadata:
com.ibm.db2.jcc.am.SqlSyntaxErrorException: [jcc][10165][10051][4.11.77]
Invalid database URL syntax:
jdbc:db2://myip:50000/msrc:CurrentSchema=testdb. ERRORCODE=-4461,
SQLSTATE=42815 (SqlManager:43)
com.ibm.db2.jcc.am.SqlSyntaxErrorException: [jcc][10165][10051][4.11.77]
Invalid database URL syntax:
jdbc:db2://myip:50000/msrc:CurrentSchema=testdb. ERRORCODE=-4461,
SQLSTATE=42815
at com.ibm.db2.jcc.am.gd.a(gd.java:676)
at com.ibm.db2.jcc.am.gd.a(gd.java:60)
at com.ibm.db2.jcc.am.gd.a(gd.java:85)
at com.ibm.db2.jcc.DB2Driver.tokenizeURLProperties(DB2Driver.java:911)
at com.ibm.db2.jcc.DB2Driver.connect(DB2Driver.java:408)
at java.sql.DriverManager.getConnection(DriverManager.java:571)
at java.sql.DriverManager.getConnection(DriverManager.java:215)
at
org.apache.sqoop.manager.SqlManager.makeConnection(SqlManager.java:885)
at
org.apache.sqoop.manager.GenericJdbcManager.getConnection(GenericJdbcManager.java:52)
at org.apache.sqoop.manager.SqlManager.listTables(SqlManager.java:520)
at
org.apache.sqoop.tool.ImportAllTablesTool.run(ImportAllTablesTool.java:95)
at org.apache.sqoop.Sqoop.run(Sqoop.java:143)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:179)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:218)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:227)
at org.apache.sqoop.Sqoop.main(Sqoop.java:236)
Caused by: java.util.NoSuchElementException
at java.util.StringTokenizer.nextToken(StringTokenizer.java:349)
at java.util.StringTokenizer.nextToken(StringTokenizer.java:377)
at com.ibm.db2.jcc.DB2Driver.tokenizeURLProperties(DB2Driver.java:899)
... 13 more
Could not retrieve tables list from server
2017-05-02 09:21:18,696 ERROR - [main:] ~ manager.listTables() returned null
(ImportAllTablesTool:98)
[
Command:
sqoop import-all-tables \
--driver com.ibm.db2.jcc.DB2Driver \
--connect jdbc:db2://myip:50000/databs \
--username username --password password \
--hive-database default --hive-import --m 1 \
--create-hive-table --hive-overwrite
The import-all-tables tool imports a set of tables from an RDBMS to HDFS. Data from each table is stored in a separate directory in HDFS.
For the import-all-tables tool to be useful, the following conditions must be met:
Each table must have a single-column primary key.
You must intend to import all columns of each table.
You must not intend to use non-default splitting column, nor impose any conditions via a WHERE clause.

Apache Scoop import qualified table from SQL Server

When I try to import a table from SQL Server using
sqoop import \
-m 1 \
--connect jdbc:sqlserver://Arwen:1433 \
--username=bods \
--password=***\
--table datamart.dbo.fct_txn
--compression-codec=snappy \
--as-avrodatafile \
--warehouse-dir=/user/tkidb
sqoop seems to create a wrong query syntax. Apparently it expects an unqualified table name. Then the brackets would work. How to tackle this?
16/06/25 07:44:55 INFO tool.CodeGenTool: Beginning code generation
16/06/25 07:44:57 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM [datamart.dbo.fct_txn] AS t WHERE 1=0
16/06/25 07:44:57 ERROR manager.SqlManager: Error executing statement: com.microsoft.sqlserver.jdbc.SQLServerException: Invalid object name 'datamart.dbo.fct_txn'.
Based on a query from error log:
SELECT t.* FROM [datamart.dbo.fct_txn] AS t WHERE 1=0
The problem is in quotes around [datamart.dbo.fct_txn] the correct syntax must be [datamart].[dbo].[fct_txn] or datamart.dbo.fct_txn. Try to change two strings:
--connect 'jdbc:sqlserver://Arwen:1433;database=datamart' \
--table fct_txn
If datamart is default DB for the user you are trying to login, then change only table part.

Sqoop export columns

Do we need to give SQL column names only in order of HDFS columns?
Example:
We update SQL table in the following format:
sqoop export
--connect "jdbc:sqlserver://blah;database=blahblah"
--username="user" --password="pass"
--driver "com.microsoft.sqlserver.jdbc.SQLServerDriver"
--table "blahblhablha" --export-dir "/blah"
--columns "id,name,age" --update-key "id"
where my SQL table is in format:
+--+---+----+
|id|age|name|
+--+---+----+
while I'm executing above sqoop command it's running fine but freezes at 100% and never finishes the job.
Is it compulsory that columns should be in order?(I don't think so).
It's running fine when I give in same order --columns "id,age,name"
Is there something I'm missing here?
Thanks in advance
It depends on the order the data in HDFS is in. Let's use the following CSV data as an example:
1,55,"John Smith"
2,22,"Jason Jonas"
...
If the --columns argument is set to id,name,age, then Sqoop will create insert statements that take on the form INSERT INTO blahblahblah (id, name, age) VALUES (1, 55, "John Smith"). This should not work since "John Smith" is clearly not an integer. So, order matters here.
See docs for more details and the user#sqoop.apache.org mailing list for more help.

Resources