Sqoop NoSuchElementException while exporting data from Hive to SQL - sql-server

I'm trying to export the hive table from Hdinsight cluster to Azure SQL. I am able to do this when I have one column in both table, but if the number of columns is bigger, I have the above mentioned exception. I have tried already different combinations of delimiters, nothing helps.
I've created the hive table as following:
create table test2
(
a string,
b string
)
row format delimited fields terminated by ',';
SQL table has the following schema:
create table test1
(
a [nvarchar](100),
b [nvarchar](100)
);
create clustered index test1_clustered_index on test1(a);
And I am using the following script to export the data:
$tableName = 'test1'
$connectionString = "jdbc:sqlserver://$sqlDatabaseServerName.database.windows.net;user=$sqlDatabaseLogin#$sqlDatabaseServerName;password=$sqlDatabasePassword;database=$databaseName"
$exportDir = "/hive/warehouse/test2"
$sqoopDef = New-AzureHDInsightSqoopJobDefinition -Command "export --connect $connectionString --table $tableName --export-dir $exportDir --fields-terminated-by ','"
$sqoopJob = Start-AzureHDInsightJob -Cluster $clusterName -JobDefinition $sqoopDef
The job log looks like that:
15/01/07 00:02:46 INFO mapreduce.Job: Task Id : attempt_1420540952382_0059_m_000000_1, Status : FAILED
Error: java.io.IOException: Can't export data, please check failed map task logs
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:112)
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:39)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: java.lang.RuntimeException: Can't parse input data: 'a,b'
at test1.__loadFromFields(test1.java:204)
at test1.parse(test1.java:147)
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:83)
... 10 more
Caused by: java.util.NoSuchElementException
at java.util.ArrayList$Itr.next(ArrayList.java:834)
at test1.__loadFromFields(test1.java:199)
... 12 more
What am I doing wrong? Thanks in advance

Related

SQL Data Warehouse External Table with String fields

I am unable to find a way to create an external table in Azure SQL Data Warehouse (Synapse SQL Pool) with Polybase where some fields contain embedded commas.
For a csv file with 4 columns as below:
myresourcename,
myresourcelocation,
"""resourceVersion"": ""windows"",""deployedBy"": ""john"",""project_name"": ""test_project""",
"{ ""ResourceType"": ""Network"", ""programName"": ""v1""}"
Tried with the following Create External Table statements.
CREATE EXTERNAL FILE FORMAT my_format
WITH (
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS(
FIELD_TERMINATOR=',',
STRING_DELIMITER='"',
First_Row = 2
)
);
CREATE EXTERNAL TABLE my_external_table
(
resourceName VARCHAR,
resourceLocation VARCHAR,
resourceTags VARCHAR,
resourceDetails VARCHAR
)
WITH (
LOCATION = 'my/location/',
DATA_SOURCE = my_source,
FILE_FORMAT = my_format
)
But querying this table gives the following error:
Failed to execute query. Error: HdfsBridge::recordReaderFillBuffer - Unexpected error encountered filling record reader buffer: HadoopExecutionException: Too many columns in the line.
Any help will be appreciated.
Currently this is not supported in polybase, need to modify the input data accordingly to get it working.

Is JDBC sink in SQL Server available to skip a not-provided column from my source?

I want to extract all data from my PostgreSQL to SQL Server using Kafka Connect and JDBC sink. I want to rid off from some queries to check whether i can do data stream using insert.mode=insert only.
This is my source config:
name=debezium_pg_connectors
connector.class=io.debezium.connector.postgresql.PostgresConnector
tasks.max=1
plugin.name=pgoutput
database.hostname=XXX.XXX.XXX.XX
database.port=5432
database.user=XXXXXX
database.password=XXXXXX
database.dbname=XXXXX
database.server.name=XXXXX
database.history.kafka.bootstrap.servers=localhost:9092
database.history.kafka.topic=XXXXXX
table.whitelist=XXXXXXX
time.precision.mode=connect
transforms=unwrap
transforms.unwrap.type= io.debezium.transforms.ExtractNewRecordState
This is my sink config:
name=jdbc-sink
connector.class=io.confluent.connect.jdbc.JdbcSinkConnector
tasks.max=1
topics=pj_user
connection.url=<connection>
auto.create=true
auto.evolve=true
insert.mode=insert
pk.mode=record_key
table.name.format=<table>
transforms=unwrap,route
transforms.unwrap.type=io.debezium.transforms.UnwrapFromEnvelope
transforms=route
transforms.route.type=org.apache.kafka.connect.transforms.RegexRouter
transforms.route.regex=([^.]+)\\.([^.]+)\\.([^.]+)
transforms.route.replacement = $3
fields.whitelist=...
In my SQL Server, i have auto-generated column called key with uniqueidentifier data type and as primary key. However, there's failure every time i tried to push my data to sink:
[2020-03-03 12:45:11,487] ERROR WorkerSinkTask{id=jdbc-sink-0} RetriableException from SinkTask: (org.apache.kafka.connect.runtime.WorkerSinkTask:552)
org.apache.kafka.connect.errors.RetriableException: java.sql.SQLException: java.sql.BatchUpdateException: Cannot insert the value NULL into column 'key', table '<table>'; column does not allow nulls. INSERT fails.
at io.confluent.connect.jdbc.sink.JdbcSinkTask.put(JdbcSinkTask.java:93)
at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:539)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:322)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:224)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:192)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:177)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:227)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.sql.SQLException: java.sql.BatchUpdateException: Cannot insert the value NULL into column 'key', table '<table>'; column does not allow nulls. INSERT fails.
... 12 more
If anyone have any ideas to help me, any helps and advice are appreciated. thanks
Make sure that your key column has a default value:
ALTER TABLE [tableName] ADD DEFAULT NEWSEQUENTIALID() FOR key

IMDB to SQLite: How to create table and insert imdb data into sqlite database?

I'm working on a project where I want the select IMDB files (The files format is in *.list) that I downloaded from here into a sqlite database. Unfortunately, I'm not able to solve this issue. I'm able to create a database but can't tabulate the table with IMDB data.
The documentation I've been following is here. So far, I created a sqlite table but it will not populate it.
import sqlite3
from sqlite3 import Error
def create_connection(db_file):
""" create a database connection to the SQLite database
specified by db_file
:param db_file: database file
:return: Connection object or None
"""
try:
conn = sqlite3.connect(db_file)
return conn
except Error as e:
print(e)
return None
def create_table(conn, create_table_sql):
""" create a table from the create_table_sql statement
:param conn: Connection object
:param create_table_sql: a CREATE TABLE statement
:return:
"""
try:
c = conn.cursor()
c.execute(create_table_sql)
except Error as e:
print(e)
def main():
database = "/Users/Erudition/Desktop/imdb_database/sqldatabase.db"
sql_create_tile_akas = """ CREATE TABLE IF NOT EXISTS title (
titleid text PRIMARY KEY,
ordering integer NOT NULL,
title text,
region text,
language text NOT NULL,
types text NOT NULL,
attributes text NOT NULL,
isOriginalTitle integer NOT NULL
); """
conn = create_connection(database)
if conn is not None:
# create projects table
create_table(conn, sql_create_tile_akas)
else:
print("Error! cannot create the database connection.")
if __name__ == '__main__':
main()
In the terminal, I enter
imdbpy2sql.py -d /Users/Erudition/Desktop/imdb_database/aka-titles.list/
-u sqlite:///sqldatabase.db'''
The output I expect is a sqlite table with all the rows filled. Instead, I get a several sqlite tables with nothing filled out.
The terminal output is:
WARNING The file will be skipped, and the contained
WARNING information will NOT be stored in the database.
WARNING Complete error: [Errno 20] Not a directory:
'/Users/Erudition/Desktop/imdb_database/aka-titles.list/complete-
cast.list.gz'
WARNING WARNING WARNING
WARNING unable to read the "/Users/Erudition/Desktop/imdb_database/aka-
titles.list/complete-crew.list.gz" file.
WARNING The file will be skipped, and the contained
WARNING information will NOT be stored in the database.
WARNING Complete error: [Errno 20] Not a directory:
'/Users/Erudition/Desktop/imdb_database/aka-titles.list/complete-
crew.list.gz'
I found the solution!
pip install imdb-sqlite
Then
imdb-sqlite
Here's the link

How to create SQL Server table from dplyr pipeline

Due to a bug in dbplyr, copy_to and compute are currently not working for SQL Server connections.
connStr <- "driver=ODBC Driver 13 for SQL Server;server=localhost;..."
db <- DBI::dbConnect(odbc::odbc(), .connection_string=connStr)
copy_to(db, mtcars)
#Error: <SQL> 'CREATE TEMPORARY TABLE "mtcars" (
# "row_names" varchar(255),
# "mpg" FLOAT,
# ...
# nanodbc/nanodbc.cpp:1587: 42000: [Microsoft][ODBC Driver 13 for SQL Server][SQL Server]Unknown object type 'TEMPORARY' used in a CREATE, DROP, or ALTER statement.
# use raw DBI functionality to create table
DBI::dbWriteTable(db, "mtcars", mtcars)
qry <- tbl(db, "mtcars") %>% group_by(am) %>% summarise(m=mean(mpg))
compute(qry)
#Error: <SQL> 'CREATE TEMPORARY TABLE "isrxofsskr" AS SELECT "am" AS "am", "m" #AS "m"
#FROM (SELECT "am", AVG("mpg") AS "m"
#FROM "mtcars"
#GROUP BY "am") "htrkkxabrn"'
# nanodbc/nanodbc.cpp:1587: 42000: [Microsoft][ODBC Driver 13 for SQL Server][SQL Server]Unknown object type 'TEMPORARY' used in a CREATE, DROP, or ALTER statement.
There is an active PR on the dbplyr repo that solves this problem, but no indication of when this will be merged (or when it will reach CRAN). In the meantime, how would I create a table from the query, without reading the data into R?
It turns out that the PR on the dbplyr repo is glitched anyway, and will pull the entire table into memory before writing it back.
Fixing the problem requires creating a couple of MSSQL-specific methods for dbplyr generics. These are listed below. I've also posted them to the dbplyr repo so (assuming they work) they should hopefully be merged before too long.
#' #export
`db_compute.Microsoft SQL Server` <- function(con, table, sql, temporary=TRUE,
unique_indexes=list(), indexes=list(), ...)
{
# check that name has prefixed '##' if temporary
if(temporary && substr(table, 1, 1) != "#")
table <- paste0("##", table)
if(!is.list(indexes))
indexes <- as.list(indexes)
if(!is.list(unique_indexes))
unique_indexes <- as.list(unique_indexes)
db_save_query(con, sql, table, temporary=temporary)
db_create_indexes(con, table, unique_indexes, unique=TRUE)
db_create_indexes(con, table, indexes, unique=FALSE)
table
}
#' #export
`db_save_query.Microsoft SQL Server` <- function(con, sql, name, temporary=TRUE, ...)
{
# check that name has prefixed '##' if temporary
if(temporary && substr(name, 1, 1) != "#")
name <- paste0("##", name)
tt_sql <- build_sql("SELECT * INTO ", ident_q(name),
" FROM (", sql, ") ", ident_q(name), con=con)
dbExecute(con, tt_sql)
name
}
Note: may not be Bobby Tables-resistant. Testing is advised.

Sqoop export from HDFS dir to sybase IQ failed

I am trying to export HDFS file from a HDFS directory to sybase IQ table.
I have placed the sybase driver in sqoop lib path correctly .
sqoop Command :
sqoop export \
--connect jdbc:sybase:Tds:sybasehost:port/DATABASE=OMEGA \
--username dummy \
--password dummy \
--driver com.sybase.jdbc4.jdbc.SybDriver \
--table omega_sybase_table \
--export-dir /user/cloudera/omega/output_files/ \
--input-fields-terminated-by ','
I am getting the below error and this export failed.
17/04/25 16:17:07 INFO mapreduce.Job: Task Id : attempt_1489579695153_4935_m_000002_1, Status : FAILED
Error: java.io.IOException: Can't export data, please check failed map task logs
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:112)
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:39)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.io.IOException: java.sql.SQLException: SQL Anywhere Error -210: User 'another user' has the row in 'omega_sybase_table' locked
at org.apache.sqoop.mapreduce.AsyncSqlRecordWriter.write(AsyncSqlRecordWriter.java:233)
at org.apache.sqoop.mapreduce.AsyncSqlRecordWriter.write(AsyncSqlRecordWriter.java:46)
at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:658)
at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:84)
... 10 more
Caused by: java.sql.SQLException: SQL Anywhere Error -210: User 'another user' has the row in 'omega_sybase_table' locked
at com.sybase.jdbc4.jdbc.SybConnection.getAllExceptions(Unknown Source)
Could someone help me fixing this issue?
its getting because of multiple mappers tasks being used in sqoop export command.
Sybase IQ only allows one connection at a time, multiple mappers tasks try to insert records in sybase iq table in parallel.
Solution is to use -m 1 in sqoop export command.

Resources