COPY FROM file to Cassandra iqnoring solr_query coumn - solr

i can't import data to cassandra because i am using DSE Solr now and as i can see it created solr_query (virtual column) in my table.
So i tried COPY table FROM 'file' WITH SKIPCOLS = "solr_query";
but getting same error.
Failed to import 10 rows: ParseError - Invalid row length 9 should be 10 - given up without retries.
So how can i import data and ignore solr_query column ?

The copy command accepts the columns to import as a list COPY. Try to list them, avoiding the solr_query column, and it should be ok:
COPY table (colA, colB, colC,...) FROM 'file'

Related

Exporting Kusto table to SQL Server GENERATED ALWAYS column

I'm setting up a data pipeline to export from a Kusto table to a SQL Server table. The only problem is the target table has two GENERATED ALWAYS columns. Looking for some help implementing the solution using Kusto.
This is the export statement:
.export async to sql ['CompletionImport']
h#"connection_string_here"
with (createifnotexists="true", primarykey="CompletionSearchId")
<|set notruncation;
apiV2CompletionSearchFinal
| where hash(SourceRecordId, 1) == 0
Which gives the error:
Cannot insert an explicit value into a GENERATED ALWAYS column in table 'server.dbo.CompletionImport'.
Use INSERT with a column list to exclude the GENERATED ALWAYS column, or insert a DEFAULT into GENERATED ALWAYS column.
So I'm a little unsure how to implement this solution in Kusto. Would I just add a project pipe excluding the GENERATED ALWAYS columns? Maybe ideally how could I insert a DEFAULT value into GENERATED ALWAYS SQL Server columns using a Kusto query?
Edit: Trying to use materialize to create a temporary table in the cache and export this cached table. However I can't find any documentation on this and the operation is failing:
let dboV2CompletionSearch = apiV2CompletionSearchFinal
| project every, variable, besides, generated, always, ones;
let cachedCS = materialize(dboV2CompletionSearch);
.export async to sql ['CompletionImport']
h#"connect_string"
with (createifnotexists="true", primarykey="CompletionSearchId")
<|set notruncation;
cachedCS
| where hash(SourceRecordId, 1) == 0
with the following error message:
Semantic error: 'set notruncation;
cachedCS
| where hash(SourceRecordId, 1) == 0'
has the following semantic error: SEM0100:
'where' operator: Failed to resolve table or column expression named 'cachedCS'.

Index is out of range: JDBC SqlServer exception

I am using Sqoop to import data from SQL server to local HDFS. I am using a simple free form query to pull some 10 rows from the table. Below is the sqoop command that I execute from the terminal:
sqoop import --connect 'jdbc:sqlserver://xx.xx.xx.xx;username=xx;password=xxxxx;database=DBName' --query "SELECT top 10 OrderID from DJShopcart_OrderItems where \$CONDITIONS" --split-by "OrderID" --target-dir /work/gearpurchase
When I execute this from my local machine, I get the following exception:
Caused by: com.microsoft.sqlserver.jdbc.SQLServerException: The index
2 is out of range. at
com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDriverError(SQLServerException.java:191)
at
com.microsoft.sqlserver.jdbc.SQLServerResultSet.verifyValidColumnIndex(SQLServerResultSet.java:543)
at
com.microsoft.sqlserver.jdbc.SQLServerResultSet.getterGetColumn(SQLServerResultSet.java:2066)
at
com.microsoft.sqlserver.jdbc.SQLServerResultSet.getValue(SQLServerResultSet.java:2099)
at
com.microsoft.sqlserver.jdbc.SQLServerResultSet.getValue(SQLServerResultSet.java:2084)
at
com.microsoft.sqlserver.jdbc.SQLServerResultSet.getInt(SQLServerResultSet.java:2327)
at
org.apache.sqoop.lib.JdbcWritableBridge.readInteger(JdbcWritableBridge.java:52)
at
com.cloudera.sqoop.lib.JdbcWritableBridge.readInteger(JdbcWritableBridge.java:53)
at QueryResult.readFields(QueryResult.java:105) at
org.apache.sqoop.mapreduce.db.DBRecordReader.nextKeyValue(DBRecordReader.java:244)
If I import 2 columns, the exception says index 3 is out of range.
I checked the SQLServerResultSet class documentation also, to understand what could be the cause of exception but to no use. Only got more confused with concepts like Client-Side Cursor vs Server-Side Cursor
No matter what I try I can't get this simple free form query to import data from SQL server.
Sqoop version : 1.4.6
Hadoop : 2.7.3
Machine : Ubuntu 16.04
Please help me out. Thanks in advance.
If I import 2 columns, the exception says index 3 is out of range.
Then the fault lies with SQOOP.
at org.apache.sqoop.lib.JdbcWritableBridge.readInteger(JdbcWritableBridge.java:52)
As it's passing 3 as an argument to
at com.microsoft.sqlserver.jdbc.SQLServerResultSet.getInt(SQLServerResultSet.java:2327)
When the result only contains 2 columns, and the valid column indexes are 1 and 2.

Import JSON into ClickHouse

I create table with this statement:
CREATE TABLE event(
date Date,
src UInt8,
channel UInt8,
deviceTypeId UInt8,
projectId UInt64,
shows UInt32,
clicks UInt32,
spent Float64
) ENGINE = MergeTree(date, (date, src, channel, projectId), 8192);
Raw data looks like:
{ "date":"2016-03-07T10:00:00+0300","src":2,"channel":18,"deviceTypeId ":101, "projectId":2363610,"shows":1232,"clicks":7,"spent":34.72,"location":"Unknown", ...}
...
Files with data loaded with the following command:
cat *.data|sed 's/T[0-9][0-9]:[0-9][0-9]:[0-9][0-9]+0300//'| clickhouse-client --query="INSERT INTO event FORMAT JSONEachRow"
clickhouse-client throw exception:
Code: 117. DB::Exception: Unknown field found while parsing JSONEachRow format: location: (at row 1)
Is it possible to skip fields from JSON object that not presented in table description?
The latest ClickHouse release (v1.1.54023) supports input_format_skip_unknown_fields user option which eneables skipping of unknown fields for JSONEachRow and TSKV formats.
Try
clickhouse-client -n --query="SET input_format_skip_unknown_fields=1; INSERT INTO event FORMAT JSONEachRow;"
See more details in documentation.
Currently, it is not possible to skip unknown fields.
You may create temporary table with additional field, INSERT data into it, and then do INSERT SELECT into final table. Temporary table may have Log engine and INSERT into that "staging" table will work faster than into final MergeTree table.
It is relatively easy to add possibility to skip unknown fields into code (something like setting 'format_skip_unknown_fields').

Getting Source only and mismatch data with tablediff utility

I'm using tablediff utility to transfert data from serval databases sources to a destination database and I get a result having all the differences between the source and destination databases with something like this
Dest. Only N'1027' N'799' N'91443' N'1'
Mismatch N'103A' N'799' N'13010' N'1' DATE_CURRENT DATE_OPERATION MATRICULE_UTILISATEUR QTE QTE_FINAL QTE_INIT QTE_OPERATION REFERENCE_DOCUMENT TYPE_DOCUMENT
Src. Only N'103A' N'310' N'30129' N'1'
so the generated sql file contain delete the Dest. Only rows, update the Mismatch rows and insert the Src. Only rows
My question is: Is there any way using tablediff to get the result of only Mismatch and Src. Only rows??
In the end of your tablediff tool command add the following
-dt -et DiffResults
It will drop an existing table with name DiffResults and create a new one in the destination server and database.
Then you can query the DiffResults table to get the desired rows.
In my test I run the following
SELECT * FROM DiffResults
WHERE MSdifftool_ErrorDescription in ('Mismatch','Src. Only')
or
SELECT * FROM DiffResults
WHERE MSdifftool_ErrorCode in (0,2) -- 0 is for 'Mismatch'; 1 is for 'Dest. Only' and 2 is for 'Src. Only'
Some more details can be found here - https://technet.microsoft.com/en-us/library/ms162843.aspx
If you want to use results from command line, you can pipe output with findstr:
tablediff <your parameters> | findstr /i "^Mismatch ^Src"

Hive Serde errors with Array<Struct<>> org.json.JSONArray cannot be cast to [Ljava.lang.Object;

I have created a table :
add jar /../xlibs/hive-json-serde-0.2.jar;
CREATE EXTERNAL TABLE SerdeTest
(Unique_ID STRING
,MemberID STRING
,Data ARRAY>
)
PARTITIONED BY (Pyear INT, Pmonth INT)
ROW FORMAT SERDE "org.apache.hadoop.hive.contrib.serde2.JsonSerde";
ALTER TABLE SerdeTest ADD
PARTITION (Pyear = 2014, Pmonth =03) LOCATION '../Test2';
The data in the file :
{"Unique_ID":"ABC6800650654751","MemberID":"KHH966375835","Data":[{"SerialNo":1,"VariableName":"Var1","VariableValue":"A_49"},{"SerialNo":2,"VariableName":"Var2","VariableValue":"B_89"},{""SerialNo":3,"VariableName":"Var3","VariableValue":"A_99"}]}
Select query that I am using:
select Data[0].SerialNo from SerdeTest where Unique_ID = 'ABC6800650654751';
however, when I run this query I get the following error:
java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row [Error getting row data with exception java.lang.ClassCastException: org.json.JSONArray cannot be cast to [Ljava.lang.Object;
at org.apache.hadoop.hive.serde2.objectinspector.StandardListObjectInspector.getList(StandardListObjectInspector.java:98)
at org.apache.hadoop.hive.serde2.SerDeUtils.buildJSONString(SerDeUtils.java:330)
at org.apache.hadoop.hive.serde2.SerDeUtils.buildJSONString(SerDeUtils.java:386)
at org.apache.hadoop.hive.serde2.SerDeUtils.getJSONString(SerDeUtils.java:237)
at org.apache.hadoop.hive.serde2.SerDeUtils.getJSONString(SerDeUtils.java:223)
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:539)
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:157)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:418)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:349)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.Child.main(Child.java:264)
]
Can anyone please suggest me what am I doing wrong
Few suggestions:
Make sure that all the packages of hive and hive-json-serde-0.2.jar have execute permission for hadoop user.
Hive creates a file called derby.log and metastore_db in the hive directory. It should be allowed to the user invoking the hive query to create files and directories.
Location for data should have / at the end. e.g. LOCATION '../Test2/';
In short, the working JAR is json-serde-1.3-jar-with-dependencies.jar which can be found here. This one is working with 'STRUCT' and can even ignore some malformed JSON. During the creation of the table, include the following code:
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ("ignore.malformed.json" = "true")
LOCATION ...
If needed, it is possible to recompile it from here or here. I tried the first repository and it is compiling fine for me, after adding the necessary libs. The repository has also been updated recently.
Check for more details here.

Resources