Sqoop export columns - sql-server

Do we need to give SQL column names only in order of HDFS columns?
Example:
We update SQL table in the following format:
sqoop export
--connect "jdbc:sqlserver://blah;database=blahblah"
--username="user" --password="pass"
--driver "com.microsoft.sqlserver.jdbc.SQLServerDriver"
--table "blahblhablha" --export-dir "/blah"
--columns "id,name,age" --update-key "id"
where my SQL table is in format:
+--+---+----+
|id|age|name|
+--+---+----+
while I'm executing above sqoop command it's running fine but freezes at 100% and never finishes the job.
Is it compulsory that columns should be in order?(I don't think so).
It's running fine when I give in same order --columns "id,age,name"
Is there something I'm missing here?
Thanks in advance

It depends on the order the data in HDFS is in. Let's use the following CSV data as an example:
1,55,"John Smith"
2,22,"Jason Jonas"
...
If the --columns argument is set to id,name,age, then Sqoop will create insert statements that take on the form INSERT INTO blahblahblah (id, name, age) VALUES (1, 55, "John Smith"). This should not work since "John Smith" is clearly not an integer. So, order matters here.
See docs for more details and the user#sqoop.apache.org mailing list for more help.

Related

SF KAFKA CONNECTOR Detail: Table doesn't have a compatible schema - snowflake kafka connector

I have setup the snowflake - kafka connector. I setup a sample table (kafka_connector_test) in snowflake with 2 fields both are VARCHAR type.
Fields are CUSTOMER_ID and PURCHASE_ID.
Here is my configuration that I created for the connector
curl -X POST \
-H "Content-Type: application/json" \
--data '{
"name":"kafka_connector_test",
"config":{
"connector.class":"com.snowflake.kafka.connector.SnowflakeSinkConnector",
"tasks.max":"2",
"topics":"kafka-connector-test",
"snowflake.topic2table.map": "kafka-connector-test:kafka_connector_test",
"buffer.count.records":"10000",
"buffer.flush.time":"60",
"buffer.size.bytes":"5000000",
"snowflake.url.name":"XXXXXXXX.snowflakecomputing.com:443",
"snowflake.user.name":"XXXXXXXX",
"snowflake.private.key":"XXXXXXXX",
"snowflake.database.name":"XXXXXXXX",
"snowflake.schema.name":"XXXXXXXX",
"key.converter":"org.apache.kafka.connect.storage.StringConverter",
"value.converter":"com.snowflake.kafka.connector.records.SnowflakeJsonConverter"}}'\
I send data to the topic that I have configured in the connector configuration.
{"CUSTOMER_ID" : "test_id", "PURCHASE_ID" : "purchase_id_test"}
then when I check the kafka-connect server I get the below error:
[SF KAFKA CONNECTOR] Detail: Table doesn't have a compatible schema
Is there something I need to setup in either kafka connect or snowflake that says which parts of the json go into which columns of the table? Not sure how to specify how it parses the json.
I setup a different topic as well and didn't create a table in snowlake. In that I was able to populate this table but the connector makes a table with 2 columns RECORD_METADATA and RECORD_CONTENT. But I don't want to write a scheduled job to parse this I want to directly insert into a queryable table.
Snowflake Kafka connector writes data as json by design. The default columns RECORD_METADATA and RECORD_CONTENT are variant. If you like to query them you can create a view on top the table to achieve your goal and you don't need a scheduled job
So, your table created by connector would be something like
RECORD_METADATA, RECORD_CONTENT
{metadata fields in json}, {"CUSTOMER_ID" : "test_id", "PURCHASE_ID" : "purchase_id_test"}
You can create a view to display your data
create view v1 as
select RECORD_CONTENT:CUSTOMER_ID::text CUSTOMER_ID,
RECORD_CONTENT:PURCHASE_ID::text PURCHASE_ID
Your query will be
select CUSTOMER_ID , PURCHASE_ID from v1
PS. If you like to create your own tables you need to use variant as your data type instead of varchar
Also looks like it's not supported at this time in reference to this github issue

Encrypting data on PostgreSQL

I'm running some queries using PostgreSQL for student purposes. Now I need to encrypt some fields using AES256 to run the same queries and compare the times. Any idea how this can be done using UPDATE table? For example, I need to encrypt the customer address of the table address. Can I do this using UPDATE? Can anyone give me an example? Couldn't find anything online. Thanks.
CREATE EXTENSION IF NOT EXISTS pgcrypto;
INSERT INTO table(name,age) VALUES(
PGP_SYM_ENCRYPT('John', 'AES_KEY'),
PGP_SYM_ENCRYPT('22', 'AES_KEY')
);
UPDATE table SET
(name,age) = (
PGP_SYM_ENCRYPT('Jona', 'AES_KEY'),
PGP_SYM_ENCRYPT('15','AES_KEY')
) WHERE id='1';
SELECT
PGP_SYM_DECRYPT(name::bytea, 'AES_KEY') as name
PGP_SYM_DECRYPT(age::bytea, 'AES_KEY') as age
FROM table WHERE(
LOWER(PGP_SYM_DECRYPT(name::bytea, 'AES_KEY')
LIKE LOWER('%John%')
);

Does Sqoop use Reducer?

Does Sqoop run reducer if there is a join/aggregation performed in the select query given with a --query parameter? Or is there any case in Sqoop where both mappers and reducers run?
Documentation specifies that each map task will need to execute a copy of the query, with results partitioned by bounding conditions inferred by Sqoop.
$ sqoop import \
--query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \
--split-by a.id --target-dir /user/foo/joinresults
In the example above, how does the JOIN take place where the table is first partitioned using $CONDITIONS?
Join/Computation will be executed on RDBMS and its result will be used by mapper to transfer to HDFS.
No reducer is involved
With --query parameter, you need to specify the --split-by parameter with the column that should be used for slicing
your data into multiple parallel map tasks. This parameter usually automatically defaults to
the primary key of the main table
$CONDITIONS will be automatically substituted this placeholder with the generated conditions specifying which slice of data to transfer
In your particular command, the sqoop does not use reducer.
However there are cases, when sqoop do use reducer. Check the below example taken from document here.
$ sqoop export \
-Dmapred.reduce.tasks=2
-Dpgbulkload.bin="/usr/local/bin/pg_bulkload" \
-Dpgbulkload.input.field.delim=$'\t' \
-Dpgbulkload.check.constraints="YES" \
-Dpgbulkload.parse.errors="INFINITE" \
-Dpgbulkload.duplicate.errors="INFINITE" \
--connect jdbc:postgresql://pgsql.example.net:5432/sqooptest \
--connection-manager org.apache.sqoop.manager.PGBulkloadManager \
--table test --username sqooptest --export-dir=/test -m 2

Index is out of range: JDBC SqlServer exception

I am using Sqoop to import data from SQL server to local HDFS. I am using a simple free form query to pull some 10 rows from the table. Below is the sqoop command that I execute from the terminal:
sqoop import --connect 'jdbc:sqlserver://xx.xx.xx.xx;username=xx;password=xxxxx;database=DBName' --query "SELECT top 10 OrderID from DJShopcart_OrderItems where \$CONDITIONS" --split-by "OrderID" --target-dir /work/gearpurchase
When I execute this from my local machine, I get the following exception:
Caused by: com.microsoft.sqlserver.jdbc.SQLServerException: The index
2 is out of range. at
com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDriverError(SQLServerException.java:191)
at
com.microsoft.sqlserver.jdbc.SQLServerResultSet.verifyValidColumnIndex(SQLServerResultSet.java:543)
at
com.microsoft.sqlserver.jdbc.SQLServerResultSet.getterGetColumn(SQLServerResultSet.java:2066)
at
com.microsoft.sqlserver.jdbc.SQLServerResultSet.getValue(SQLServerResultSet.java:2099)
at
com.microsoft.sqlserver.jdbc.SQLServerResultSet.getValue(SQLServerResultSet.java:2084)
at
com.microsoft.sqlserver.jdbc.SQLServerResultSet.getInt(SQLServerResultSet.java:2327)
at
org.apache.sqoop.lib.JdbcWritableBridge.readInteger(JdbcWritableBridge.java:52)
at
com.cloudera.sqoop.lib.JdbcWritableBridge.readInteger(JdbcWritableBridge.java:53)
at QueryResult.readFields(QueryResult.java:105) at
org.apache.sqoop.mapreduce.db.DBRecordReader.nextKeyValue(DBRecordReader.java:244)
If I import 2 columns, the exception says index 3 is out of range.
I checked the SQLServerResultSet class documentation also, to understand what could be the cause of exception but to no use. Only got more confused with concepts like Client-Side Cursor vs Server-Side Cursor
No matter what I try I can't get this simple free form query to import data from SQL server.
Sqoop version : 1.4.6
Hadoop : 2.7.3
Machine : Ubuntu 16.04
Please help me out. Thanks in advance.
If I import 2 columns, the exception says index 3 is out of range.
Then the fault lies with SQOOP.
at org.apache.sqoop.lib.JdbcWritableBridge.readInteger(JdbcWritableBridge.java:52)
As it's passing 3 as an argument to
at com.microsoft.sqlserver.jdbc.SQLServerResultSet.getInt(SQLServerResultSet.java:2327)
When the result only contains 2 columns, and the valid column indexes are 1 and 2.

Apache Scoop import qualified table from SQL Server

When I try to import a table from SQL Server using
sqoop import \
-m 1 \
--connect jdbc:sqlserver://Arwen:1433 \
--username=bods \
--password=***\
--table datamart.dbo.fct_txn
--compression-codec=snappy \
--as-avrodatafile \
--warehouse-dir=/user/tkidb
sqoop seems to create a wrong query syntax. Apparently it expects an unqualified table name. Then the brackets would work. How to tackle this?
16/06/25 07:44:55 INFO tool.CodeGenTool: Beginning code generation
16/06/25 07:44:57 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM [datamart.dbo.fct_txn] AS t WHERE 1=0
16/06/25 07:44:57 ERROR manager.SqlManager: Error executing statement: com.microsoft.sqlserver.jdbc.SQLServerException: Invalid object name 'datamart.dbo.fct_txn'.
Based on a query from error log:
SELECT t.* FROM [datamart.dbo.fct_txn] AS t WHERE 1=0
The problem is in quotes around [datamart.dbo.fct_txn] the correct syntax must be [datamart].[dbo].[fct_txn] or datamart.dbo.fct_txn. Try to change two strings:
--connect 'jdbc:sqlserver://Arwen:1433;database=datamart' \
--table fct_txn
If datamart is default DB for the user you are trying to login, then change only table part.

Resources