Converting binary data containing null (0x00) characters to ASCII in SQL Server - sql-server

On SQL Server (2016+), I have data stored in a varbinary column, saved by some Java application, which contains a mixture of binary data and ASCII text. I want to search the column using a like operator or otherwise to look for certain ASCII strings, and then view the returned values as ASCII (so that I can read the surrounding text).
The data contains non-characters such as "00" (0x00), and these seem to stop SQL Server from converting the string as might otherwise be possible according to the answers at Hex to ASCII string conversion on the fly . In the example below, it can be seen that the byte "00" stops the parsing of the ASCII.
select convert(varchar(max),0x48454C4C004F205000455445,0) as v1 -- HELL
select convert(varchar(max),0x48454C4C4F205000455445,0) as v2 -- HELLO P
select convert(varchar(max),0x48454C4C4F2050455445,0) as v3 -- HELLO PETE
How can I have
select convert(varchar(max), 0x48454C4C004F205000455445, 0)
...return something like this?:
HELL?O P?ETE
(Or, less ideally, have an expression similar to
convert(varchar(max), 0x48454C4C004F205000455445, 0) like '%HE%ETE%'
...return the row?)
It works on the website https://www.rapidtables.com/convert/number/hex-to-ascii.html with 48454C4C004F205000455445 as input.
I'm not overly concerned about performance, but I want to stay within SQL Server, and ideally within the scope of T-SQL which can be copied and pasted easily.
I've tried using replace on "00", but this could causes problems with characters ending with 0, as in "5000" in the examples above. There may be bytes other than 0x00 which cause string conversion to stop as well.

To return the row (the more limited version of this question), a simple like operator on the value appears to work when run directly on the binary value, despite the intervening 0x00 values:
0x48454C4C004F205000455445 like 'HE%ETE%'
In other words, like can cope where convert can't.
To view the actual value, the best I've managed so far is this:
convert(varchar(max),convert(varbinary(max),
REPLACE(
convert(varchar(max), 0x48454C4C004F205000455445, 1)
,'00',''
)
,1),0)
This gives HELLO PETE, and works well enough on the actual data, getting to its end.
(It depends on the heuristic of not caring about converting e.g. 0x50 0x03 to 0x53 and similar, but I can live with that, as 0x0z, where z is 1 to f, represents control characters, which don't occur around the text I'm interested in).
(thanks to Panagiotis Kanavos for prodding me in a useful direction!)

Related

Encoding (?) issues fetching binary data (image type column) from SQL Server via pyodbc/Python3 [duplicate]

I'm executing this query
SELECT CMDB_ID FROM DB1.[dbo].[CDMID]
when I do this on SSMS 18 I get this:
I'm aware these are HEX values, although I'm not an expert on the topic.
I need to excute this exact query on python so I can process that information through a script, this script need as input the HEX values without any manipulation (as you see in the SSMS output).
So, through pyodbc library with a regular connection:
SQLserver_Connection("Driver={SQL Server Native Client 11.0};"
"Server=INSTANCE;"
"Database=DB1;"
"UID=USER;"
"PWD=PASS;")
I get this:
0 b'#\x12\x90\xb2\xbb\x92\xbbe\xa3\xf9:\xe2\x97#...
1 b'#"\xaf\x13\x18\xc9}\xc6\xb0\xd4\x87\xbf\x9e\...
2 b'#G\xc5rLh5\x1c\xb8h\xe0\xf0\xe4t\x08\xbb'
3 b'#\x9f\xe65\xf8tR\xda\x85S\xdcu\xd3\xf6*\xa2'
4 b'#\xa4\xcb^T\x06\xb2\xd0\x91S\x9e\xc0\xa7\xe543'
... ...
122 b'O\xa6\xe1\xd8\tA\xe9E\xa0\xf7\x96\x7f!"\xa3\...
123 b'O\xa9j,\x02\x89pF\xb9\xb4:G]y\xc4\xb6'
124 b'O\xab\xb6gy\xa2\x17\x1b\xadd\xc3\r\xa6\xee50'
125 b'O\xd7ogpWj\xee\xb0\xd8!y\xec\x08\xc7\xfa'
126 b"O\xf0u\x14\xcd\x8cT\x06\x9bm\xea\xddY\x08'\xef"
I have three questions:
How can this data be intepreted, why am I getting this?
Is there a way to manipulate this data back at the original HEX value? and if not...
What can I do to receive the original HEX value?
I've looking for a solution but haven't found anything yet, as you can see I'm not an expert on this kind of topics so if you are not able to provide a solution I will, also, really appreciate documents with some background knowledge that I need to get so I can provide a solution by myself.
I think your issue is simply due to the fact that SSMS and Python produce different hexadecimal representations of binary data. Your column is apparently a binary or varbinary column, and when you query it in SSMS you see its fairly standard hex representation of the binary values, e.g., 0x01476F726400.
When you retrieve the value using pyodbc you get a <class 'bytes'> object which is represented as b'hex_representation' with one twist: Instead of simply displaying b'\x01\x47\x6F\x72\x64\x00', Python will render any byte that corresponds to a printable ASCII character as that character, so we get b'\x01Gord\x00' instead.
That minor annoyance (IMO) aside, the good news it that you already have the correct bytes in a <class 'bytes'> object, ready to pass along to any Python function that expects to receive binary data.

Encoding error reading Greek characters string form SQL database

I have a search form (with method GET) with only one text field named “search_field”. When a user submits the form, the typed by the user characters are posted to the URL. For example if the user type "blablabla" the generated URL will be something like that:
results.asp?search_field=blablabla
In my MSSQL 2012 database I have a table named “Products” with a column named “kodikos” in it.
I want to display all the records from the column “kodikos” containing the typed characters. My SQL select statement if the following:
"SELECT * FROM dbo.Products WHERE dbo.Products.kodikos LIKE '%' + ? + '%' "
(the question mark is the “search_field” that contains the typed by the user characters.
All the above works perfect and I am getting the correct results. The problem that I am facing is with the Greek characters. For example when the user type “fff” my codes works perfect and finds all the records containing the characters “fff”. Also works perfect with numbers too. But if the user type in Greek characters “φφφ” I am not getting any results. And there are a lot of records with “φφφ”. The problem is that the Greek characters are not recognized at all.
For your information:
In my local PC with the same SQL version the Greek characters are recognized correctly with my code, because my regional settings are set in Greek. But the same code in the hosting server in US does not recognize them.
All of my pages have UTF-8 encoding.
Can someone have any idea to solve this issue???
SQL Server knows two encodings natively:
2-byte-unicode (in most cases NVARCHAR)
extended ASCII in connection with a collation (in most cases VARCHAR)
I assume, that the language you are calling this from is using 2-byte-unicode for normal strings. This is pretty usual today...
I assume, that your column Products.kodikos is of type NVARCHAR (2-byte-unicode). In this case it should help to force your search string to be 2-byte-unicode too. Try
LIKE N'%' + CAST(? AS NVARCHAR(MAX)) + N'%'
If your column is not 2-byte encoded it might help to use COLLATE to force your search string to know your special characters.
If you pass this string into a SQL-Server routine as-is, you should make sure, that the accepting parameter is 2-byte-unicode too.
You have to make sure your search string is two byte encoded using the N'' notation...
For instance, the following query uses a string that is two byte encoded:
SELECT * FROM dbo.Products WHERE dbo.Products.kodikos LIKE N'%φφφ%'
But this query uses a string that is not two byte encoded (you won't get any results):
SELECT * FROM dbo.Products WHERE dbo.Products.kodikos LIKE '%φφφ%'

How is Unicode (UTF-16) data that is out of collation stored in varchar column?

This is purely theoretical question to wrap my head around
Let's say I have Unicode cyclone (🌀 1F300) symbol. If I try to store it in varchar column that has default Latin1_General_CI_AS collation, cyclone symbol cannot not fit into one byte that is used per symbol in varchar...
The ways I can see this done:
Like javascript does for symbols out of Basic plane(BMP) where it stores them as 2 symbols (surrogate pairs), and then additional processing is needed to put them back together...
Just truncate the symbol, store first byte and drop the second.... (data is toast - you should have read the manual....)
Data is destroyed and nothing of use is saved... (data is toast - you should have read the manual....)
Some other option that is outside of my mental capacity.....
I have done some research after inserting couple of different unicode symbols
INSERT INTO [Table] (Field1)
VALUES ('👽')
INSERT INTO [Table] (Field1)
VALUES ('🌀')
and then reading them as bytes SELECT
cast (field1 as varbinary(10)) in both cases I got 0x3F3F.
3F in ascii is ? (question mark) e.g two question marks (??) that I also see when doing normal select * does that mean that data is toast and not even 1st bite is being stored?
How is Unicode data that is out of collation stored in varchar column?
The data is toast and is exactly what you see, 2 x 0x3F bytes. This happens during the type conversion prior to the insert and is effectively the same as cast('👽' as varbinary(2)) which is also 0xF3F3 (as opposed to casting N'👽').
When Unicode data must be inserted into non-Unicode columns, the columns are internally converted from Unicode by using the WideCharToMultiByte API and the code page associated with the collation. If a character cannot be represented on the given code page, the character is replaced by a question mark (?) Ref.
Yes the data has gone.
Varchar requires less space, compared to NVarchar. But that reduction comes at a cost. There is no space for a Varchar to store Unicode characters (at 1 byte per character the internal lookup just isn't big enough).
From Microsoft's Developer Network:
...consider using the Unicode nchar or nvarchar data types to minimize character conversion issues.
As you've spotted, unsupported characters are repalced with question marks.

Unable to return query Thai data

I have a table with columns that contain both thai and english text data. NVARCHAR(255).
In SSMS I can query the table and return all the rows easy enough. But if I then query specifically for one of the Thai results it returns no rows.
SELECT TOP 1000 [Province]
,[District]
,[SubDistrict]
,[Branch ]
FROM [THDocuworldRego].[dbo].[allDistricsBranches]
Returns
Province District SubDistrict Branch
อุตรดิตถ์ ลับแล ศรีพนมมาศ Northern
Bangkok Khlong Toei Khlong Tan SSS1
But this query:
SELECT [Province]
,[District]
,[SubDistrict]
,[Branch ]
FROM [THDocuworldRego].[dbo].[allDistricsBranches]
where [Province] LIKE 'อุตรดิตถ์'
Returns no rows.
What do I need o do to get the expected results.
The collation set is Latin1_General_CI_AS.
The data is displayed and inserted with no errors just can't search.
Two problems:
The string being passed into the LIKE clause is VARCHAR due to not being prefixed with a capital "N". For example:
SELECT 'อุตรดิตถ์' AS [VARCHAR], N'อุตรดิตถ์' AS [NVARCHAR]
-- ????????? อุตรดิตถ
What is happening here is that when SQL Server is parsing the query batch, it needs to determine the exact type and value of all literals / constants. So it figures out that 12 is an INT and 12.0 is a NUMERIC, etc. It knows that N'ดิ' is NVARCHAR, which is an all-inclusive character set, so it takes the value as is. BUT, as noted before, 'ดิ' is VARCHAR, which is an 8-bit encoding, which means that the character set is controlled by a Code Page. For string literals and variables / parameters, the Code Page used for VARCHAR data is the Database's default Collation. If there are characters in the string that are not available on the Code Page used by the Database's default Collation, they are either converted to a "best fit" mapping, if such a mapping exists, else they become the default replacement character: ?.
Technically speaking, since the Database's default Collation controls string literals (and variables), and since there is a Code Page for "Thai" (available in Windows Collations), then it would be possible to have a VARCHAR string containing Thai characters (meaning: 'ดิ', without the "N" prefix, would work). But that would require changing the Database's default Collation, and that is A LOT more work than simply prefixing the string literal with "N".
For an in-depth look at this behavior, please see my two-part series:
Which Collation is Used to Convert NVARCHAR to VARCHAR in a WHERE Condition? (Part A of 2: “Duck”)
Which Collation is Used to Convert NVARCHAR to VARCHAR in a WHERE Condition? (Part B of 2: “Rabbit”)
You need to add the wildcard characters to both ends:
N'%อุตรดิตถ์%'
The end result will look like:
WHERE [Province] LIKE N'%อุตรดิตถ์%'
EDIT:
I just edited the question to format the "results" to be more readable. It now appears that the following might also work (since no wildcards are being used in the LIKE predicate in the question):
WHERE [Province] = N'อุตรดิตถ์'
EDIT 2:
A string (i.e. something inside of single-quotes) is VARCHAR if there is no "N" prefixed to the string literal. It doesn't matter what the destination datatype is (e.g. an NVARCHAR(255) column). The issue here is the datatype of the source data, and that source is a string literal. And unlike a string in .NET, SQL Server handles 'string' as an 8-bit encoding (VARCHAR; ASCII values 0 - 127 same across all Code Pages, Extended ASCII values 128 - 255 determined by the Code Page, and potentially 2-byte sequences for Double-Byte Character Sets) and N'string' as UTF-16 Little Endian (NVARCHAR; Unicode character set, 2-byte sequences for BMP characters 0 - 65535, two 2-byte sequences for Code Points above 65535). Using 'string' is the same as passing in a VARCHAR variable. For example:
DECLARE #ASCII VARCHAR(20);
SET #ASCII = N'อุตรดิตถ์';
SELECT #ASCII AS [ImplicitlyConverted]
-- ?????????
Could be a number of things!
Fist of print out the value of the column and your query string in hex.
SELECT convert(varbinary(20)Province) as stored convert(varbinary(20),'อุตรดิตถ์') as query from allDistricsBranches;
This should give you some insight to the problem. I think the most likely cause is the ั, ิ, characters being typed in the wrong sequence. They are displayed as part of the main letter but are stored internally as separate characters.

Why is Turkish Lira symbol ₺ replaced with ? in SQL server 2008 database

Any idea why the Turkish Lira symbol is replaced by a question mark when I insert it in a table in the database. See the image below
This is not a font issue. This is a Unicode (UTF-16) vs 8-bit Code Page character set issue (i.e. NVARCHAR vs VARCHAR). The character you are trying to use does not exist in the particular Code Page indicated by the default Collation of the DB in which you are executing this query. The Code Page used by the DB's default Collation is relevant here since your string literal is not prefixed with an upper-case "N". If it was, then the string would be interpreted as being Unicode and no conversion would take place. But since you are passing in a non-Unicode string, it will be forced into the current DB's default Collation's Code Page as the query is parsed. Any characters not available in that Code Page, and not having a Best-fit mapping, get turned into "?".
You can run the following to see for yourself:
SELECT '₺';
PRINT '₺';
It both prints AND displays in the results grid as ?
If you want to see what character SQL Server thinks it is, run the following:
SELECT ASCII('₺');
And it will return: 63
If you want to see what character has an ASCII value of 63, run this:
SELECT CHAR(63);
And it will return: ?
Now run this:
SELECT N'₺';
PRINT N'₺';
This will both print and display in the results grid correctly.
To see what character value the symbol really is, run the following:
SELECT UNICODE(N'₺'), UNICODE('₺');
This will return: 8378 and 63
But isn't 63 the question mark? Yes. That is because not prefixing the string literal '₺' with a capital "N" tells SQL Server that it is VARCHAR and so it gets translated to the default unknown character.
Now, if you were to execute this VARCHAR version in a DB that had a Collation tied to a Code Page that had this character, then it would work even when not prefixing the string literal with an upper-case "N". However, at the moment, I cannot find any Code Page used within SQL Server that supports this character. So, it might be a Unicode-only character, at least at far as SQL Server is concerned.
The way to fix this is:
Change the datatype of the field to NVARCHAR (I see in a comment on the question that the field is currently VARCHAR). If the field is VARCHAR then even if you use the N prefix on the string, the character will still get stored as ?, unless the Code Page specified by the Collation of the column supports this character, but again, I think this might be a Unicode-only character.
Change your INSERT statement to prefix the string field with a capital "N": (73, 4, N'(3) ₺'). Even if you change the field to NVARCHAR, if you don't prefix the string with N then SQL Server will translate the character to ? first and then insert the ?. This is because the query gets parsed before it gets executed, and parsing (for non-Unicode string literals and variables) is done in the Code Page of the DB's default Collation
Probably for the same reason my browser isn't displaying it in the title for this question: It isn't in the application's character set (or maybe not supported by the font).
In this case, my browser shows some numbers in a box (denoting the character code).
SQL-server is translating it to a known character instead.
Ensure you're storing it in a field that supports the character in it's character set (I think UTF-8 is sufficient)

Resources