I have a table having nvarchar column in SQL server 2016 that I want to store in hive. The nvarchar column can have non-ASCII characters. The data from SQL server is extracted in a file with the nvarchar column converted to base64 coded string. I tried the following to convert the base64 back to readable string:
select decode(unbase64(BASE64STR),'UTF-8');
It failed with the following error:
org.apache.hive.service.cli.HiveSQLException: Error while compiling
statement: FAILED: SemanticException [Error 10014]: Line 1:7 Wrong
arguments ''UTF-8'': org.apache.hadoop.hive.ql.metadata.HiveException:
java.nio.charset.MalformedInputException: Input length = 1
The following code is able to propertly decode BASE64 coded string
select decode(unbase64(BASE64STR),'UTF-16LE');
Is it safe to use UTF-16LE to decode the string from nvarchar type column? Will this work with any data stored in the column? Is there another way to achieve this ETL functionality from SQL Server to Hive for nvarchar type data?
Take a look at the functions for working with base64 that we have in SDU Tools (free). They go to/from varbinary but might work ok with strings cast to/from that. At the very least, the code should give you a good start. They are at: http://sdutools.sqldownunder.com
Related
I am querying a SQL Server database that has a column of type datetimeoffset. I am using 'pyodbc' and SQL Server 2017. The datetime is being returned as strings as follows:
"b'\xe3\x07\n\x00\x0e\x00\x12\x00\x03\x00\x05\x00#\xe1\x9d\x18\x00\x00\x00\x00'"
Pandas doesn't recognize it as a timestamp and I have tried using Python 3 'struct' module to unpack it like this:
import struct
raw = 'b\xe3\x07\n\x00\x0e\x00\x12\x00\x03\x00\x05\x00#\xe1\x9d\x18\x00\x00\x00\x00'
unpacked, = struct.unpack('<Q', raw)
That errors out because 'raw' is a string. If I enter the string directly as an argument in 'unpack' it errors out because of wrong number of bytes.
How do I convert the column values to pandas datetime?
Additional Note:
This site indicates that the SQL Server uses a particular type that pyodbc doesn't handle natively as suggested by mostert. That said, they seem to have no problem retrieving a human-readable value.
[SOLVED] So the solution at this site does work. TIL: when adding the converter you need to get the type as an integer, in this case '-155'. This site has the integer codes for some other types
So the solution at this site does work. TIL: when adding the converter you need to get the type as an integer, in this case '-155'. This site has the integer codes for some other types
I'm trying to insert some XML into a SQL Server database table which uses column type XML.
This works fine most of the time, but one user submitted some XML with the character with hex value 3, and SQL Server gave the error "hexadecimal value 0x03, is an invalid character."
Now I want to check, and remove, any invalid XML characters before doing the insert, and there are various articles suggesting how invalid XML characters can be replaced using regex or something similar.
However, the problem for me is that the user submitted the XML document with the invalid character escaped i.e. "", and none of the methods I've found will detect this. This is also why the error was not detected earlier: it's only when inserting it into the SQL database that the problem occurs.
Has anyone written a function that will check for all escaped invalid XML characters? I suppose the character above could have been written as or , or lots of other ways, so it's quite hard to catch them all.
Thanks in advance for any help you can offer.
You could try importing the XML to a temporary varchar(max) variable or table column and use REPLACE to strip out the offending characters, then insert the cleansed string into the destination CASTing it to XML
I have XML docs stored in a TEXT column (collation_name French_CI_AS, character_set_name iso_1).
I want to move them to a new table, in an XML column with the following SQL...
INSERT INTO Signature(JustifId, SignedJustif)
SELECT JustifID, CONVERT(XML, Justif.SignedJustif,2)
FROM Justif
When I do this, I get character encoding errors, that point to the high ascii character in this fragment "presentación, OU=CERES, O=FNMT-RCM, C=ES" - a spanish accented o in an X509 certificate.
This ó started life in utf8, became utf16 as a .net string, then became iso_1 when inserted into the TEXT column. I can copy and paste it into a web page no problem. How, then, do I move it from a TEXT column to an XML column in the same DB (and why is this so difficult?)?
The CONVERT idea came from this post. This MS page covers creating XML from varchar and nvarchar.
This is tricky... A conversion on byte-level might lead to unexpected results...
Try this
INSERT INTO Signature(JustifId, SignedJustif)
SELECT JustifID, CONVERT(XML, CONVERT(VARCHAR(MAX),Justif.SignedJustif))
FROM Justif
If you still get issues, try to specify the specific collation together with the conversion and/or try to convert to NVARCHAR(MAX).
If this doesn't help, please edit your question and poste a (reduced) example. Best was a test-scenario with a minimal XML, where one can reproduce the error.
So I decided for the fun of it to read a text file and store the contents into a NVARCHAR using TSQL and the Microsoft SQL Server Management Studio 2008 R2. I found an example for doing this at https://www.simple-talk.com/sql/t-sql-programming/the-tsql-of-text-files/
So I tried this with my ABC.txt file and its contents are:
ABCDEFGHIJKLMNOPQRSTUVWXYZ
abcdefghijklmnopqrstuvwxyz
When I first tried to store the contents of this file into#myString I used this code:
declare #myString nvarchar(max);
Select #myString = BulkColumn
from OPENROWSET(Bulk 'C:\Users\<myComputer'sNameHere>\Documents\How2\FilesForTestingStuff\ABC.txt', SINGLE_BLOB) as x
print #myString;
I got this as my output when I printed the string:
䉁䑃䙅䡇䩉䱋乍偏剑呓噕塗婙扡摣晥桧橩汫湭灯牱瑳癵硷穹
I changed nvarchar to varchar and I got the correct contents of the file.
Anyone know why this happend? I didn't think that there's a conversion difference other than nvarchar has more space available than varchar and is able to hold unicode characters.
Also how do you normally attempt reading from a file and inserting the contents into a nvarchar?
I suppose it depends on the encoding of the input file.
You used SINGLEBLOB and according to MSDN it causes data to be returned as varbinary(MAX). Your file was probably saved using a non-unicode encoding, so when it was imported data into nvarchar column, SQL interpreted it incorrectly. Changing the type allowed characters to be read correctly. Please try to encode the file with UTF-16 and try to import data into a nvarchar(MAX) variable.
Update
I tried to recreate the issue You described. I've saved a text file with ANSI encoding, run the import script and got the output similar to the one You posted in Your question. Then, I converted the file to UCS-2 Little Endian encoding and after running the script I got correct output.
To sum it up, if You want to use importing with SINGLEBLOB option, just convert the file with data to use UCS-2 Little Endian encoding and it should work correctly with nvarchar SQL type.
Reference links:
OPENROWSET
nchar and varchar
I have a data file that contains a datetime field in (yyyy mm dd) format.
I have created a bcp format file to import the data but when run the statement, I get an error
Bulk load data conversion error (type mismatch or invalid character for the specified codepage) for row 1, column 16 (ReleaseDate).
How can I tell the bcp utility to treat the field in (yyyy mm dd) format or convert the format to that sql server expect ?
I have two comments on the problem.
First, make sure you are using a ASCII code page, not UNICODE which is two bytes.
Second, if BCP is having issues you can play around with the format file.
If that does not work, change from ETL to Extract Load Translate (ELT).
Bulk load from file to a varchar() column in a table. Translate with a stored procedure to the right data type.