Unicode REGEX in Sql Server CLR function

Unicode REGEX in Sql Server CLR function - sql-server

I have a REGEX SQL CLR function:
var rule1 = new Regex("شماره\\s?\\d{1,10}")
Calling it on SQL Server 2016, however, returns this error:
System.ArgumentException: parsing "?????\s?\d{1,10}" - Quantifier {x,y} following nothing.
at System.Text.RegularExpressions.Regex..ctor(String pattern)
It seems that my unicode characters are changed to question marks, which makes the whole Regex wrong.

This issue has nothing to do with datatypes, whether for input parameters or return values, as the code provided, while sparse on detail, does show enough to see that:
there is no input parameter being used (the string is hard-coded).
the error is being thrown by System.Text.RegularExpressions.Regex, so has nothing to do with T-SQL or return values / types.
Also, while the error message does mention "Quantifier {x,y}", and there is indeed a {1,10} quantifier being used in the Regular Expression, it is a false correlation (albeit a rather understandable one) that the error message is referring to that specific quantifier. If you shorten the Regular Expression down to just "شماره", you will get the same error, except it will report the Regular Expression as being just "?????". Hence, "Quantifier {x,y}" actually refers to the first "?" in the expression shown in the error message (you will get the same error even if the Regular Expression is nothing more than "ش"). I figure that "Quantifier {x,y}" is the generalized way of looking at the ?, +, and * quantifiers as they can also be expressed as {0,1}, {1,}, and {0,}, respectively (or at least they should be).
This issue has nothing to do with SQL Server, or even Regular Expressions. This is an encoding issue, and RegEx is reporting the problem because it is being given ????? instead of شماره.
<TL;DR> Check your source code file's encoding. You might need to go to "Save As...", click on the down-arrow to the right of the word "Save" on the "Save" button, select "Save with Encoding...", and then select "Unicode (UTF-8 with signature) - Codepage 65001".
There is a problem with the project configuration and/or the compiler. I placed the following string in both a Console Application and a Database Project:
"-😈-ŏ-א---\U0001F608-\u014F-\u05D0-"
(The second half of that test string, after the ---, is merely the escape sequences for the same three characters as appear in the first half, and in the same order.)
I compiled both and inspected the compiled output (meaning: it hasn't been deployed to SQL Server yet). That string appears in the EXE file (Console App) as:
2D003DD808DE2D004F012D00D0052D002D002D003DD808DE2D004F012D00D0052D00
which is the UTF-16 LE encoding for: -😈-ŏ-א---😈-ŏ-א-
Yet, it appears in the DLL file (SQLCLR Assembly) as:
2D003F003F002D003F002D003F002D002D002D003DD808DE2D004F012D00D0052D00
which is the UTF-16 LE encoding for: -??-?-?---😈-ŏ-א-
I even changed the output type of the Console App project to be "Class Library" and the string still got embedded correctly in that DLL file. So, for some reason the literal characters are being turned into literal question marks when compiled into a SQLCLR Assembly. I haven't yet figured out what is causing this as a quick look at the config settings and command-line flags for csc.exe seems to show them being effectively the same.
In either case, it should be clear that specifying the Arabic characters via escape sequences, while cumbersome, will at least work, hence providing a (hopefully short-term) work-around so that you can move forward on this. I will continue looking to see what could be causing this difference in behavior.
UPDATE
In order to determine if the string was being converted to an 8-bit encoding or something else, I added two characters to the test string (one in both Windows-1252 and ISO-8859-1, and one only in Windows-1252):
§ = 0xA7 in CP-1252, 0xA7 in ISO-8859-1, and 0x00A7 in UTF-16
œ = 0x9C in CP-1252, not in ISO-8859-1, and 0x0153 in UTF-16
The new test string is:
"-😈-ŏ-א-§-œ---\U0001F608-\u014F-\u05D0-\x00A7-\x0153-"
That string appears in the EXE file (Console App) as:
2D003DD808DE2D004F012D00D0052D00A7002D0053012D002D002D003DD808DE2D004F012D00D0052D00A7002D0053012D00
which is the UTF-16 LE encoding for: -😈-ŏ-א-§-œ---😈-ŏ-א-§-œ-
Yet, it appears in the DLL file (SQLCLR Assembly) as:
2D003F003F002D003F002D003F002D00A7002D0053012D002D002D003DD808DE2D004F012D00D0052D00A7002D0053012D00
which is the UTF-16 LE encoding for: -??-?-?-§-œ---😈-ŏ-א-§-œ-
So, because both § and œ came through correctly in the SQLCLR Assembly, it is clearly not ISO-8859-1. And, it is either Code Page Windows-1252 or some other that supports both of those characters (CP-1252 being the most likely given that my system is using it).
Still investigating the root cause...
UPDATE 2
Ok, I feel kinda silly. Sometimes it helps to close a file (or the entire solution sometimes) and reopen it. Doing so I noticed that my test string now appeared as:
"-??-?-?-?-?---\U0001F608-\u014F-\u05D0-\x00A7-\x0153-"
Funny, I don't remember pasting that in ;-). So, I checked the file encoding that Visual Studio was saving it as and sure enough it was "Western European (Windows) - Codepage 1252". And just to be extra special certain, I checked the file for the Console App and it was correctly set to "Unicode (UTF-8 with signature) - Codepage 65001". D'oh! Changing the file encoding under "Save As..." to "Unicode (UTF-8 with signature) - Codepage 65001", I then replaced both the test string and the O.P.'s Regular Expression. Both came through perfectly, no errors or question marks.

Related

Support for compiling ASN1 file to C with CONTAINING keyword

I'm using ESNACC for compiling multiple ASN source files to C code. For ease of understanding, I will explain the scenario here as succintly as possible:-
FileA.asn1 contains the following:-
FileA DEFINITIONS ::=
BEGIN
A ::= SEQUENCE
{
AContent [0] OCTET STRING (CONTAINING FileB.B)
}
END
FileB.asn1 contains the following:-
FileB DEFINITIONS ::=
BEGIN
B ::= SEQUENCE
{
BElem1 [0] INTEGER,
BElem2 [1] INTEGER
}
END
I used ESNACC to compile both files in one command. Upon analysing the C source files generated, I observed that the AContent field will be decoded as a constructed OCTET STRING (the data being received in the application guarantees that the field will be specified as constructed) with its contents being filled into a simple string. This means that FileB does not come into the picture at all. I was hoping that AContent would be further decoded with a structure of FileB being filled, so that I can easily access the elements within. This does not seem to be the case.
I'm fairly new with ASN1, so please let me know if my understanding is wrong in any way.
Is ESNACC not capable of generating code for supporting CONTAINING keyword properly?
Are there other compilers that are able to do this?
Can this be done by using ESNACC in any way?
If this cannot be done using ESNACC, and I don't want to use any other compiler, how would I access the contents within AContent at runtime easily?

I am not sure of the capabilities of ESNACC, but there are many other compilers that support the CONTAINING keyword. An excellent list of compilers can be found at https://www.itu.int/en/ITU-T/asn1/Pages/Tools.aspx which is part of the ITU-T ASN.1 Project.

Heimdal's ASN.1 compiler (lib/asn1/) has support for the funky Information Object System syntax extensions that allow you to declare things like what all goes into Certificate Extensions (for example), and the generated code will decode everything recursively in one go.

Bulk spell checking for C code

I have a large source code package in C and I would like to spell check all string literals and comments. After adding exceptions to a file, I would like to perform the same procedure on each release, to see if there are any spelling errors inserted.
I have checked with ispell, hunspell and aspell but, to my disappointment and surprise, although they do seem to understand HTML, Tex and a few other languages they do not have a C feature. The closest I found was a "ccpp" filter mentioned for aspell, but when I "aspell dump filters" the ccpp filter is not listed.
Any ideas?

You have to write a lexer first to extract string constants and comments to a text file with the associated line & column of the source file.
(ply can be useful or lex/yacc but needs some coding).
Then use whatever spell checker you like, then parse the report, and trace back to the original C file location.
Or connect directly the spell checker to your lexer.

Dokan cAlternateFileName doesn't seem to work

I am writing a File System Driver for Windows 7. I'm using the Dokan library. In the FindFiles function I want to set the 8.3 alternate name. I am assuming that will show up if I use dir /x but it doesn't. I have tried passing a null terminated string then changed to a blank padded (not null terminated) string as coded below. Neither one show the alternate name the dir /x.
See http://msdn.microsoft.com/en-us/library/windows/desktop/aa365740%28v=vs.85%29.aspx for a reference to cAlternateFileName in struct _WIN32_FIND_DATA.
Does anyone have any information on this?
Here is a clip from my code:
wsprintf(w_surfaceName, L"S%d-P%02x~1", pCartIDtable[count].dsmNumber, pCartIDtable[count].pltrNumber);
wp = wcschr(w_surfaceName, L'\0');
wmemset(wp, L' ', _countof(w_surfaceName) - (wp - w_surfaceName));
wmemcpy(findData.cAlternateFileName, w_surfaceName, _countof(findData.cAlternateFileName));
FillFindData(&findData, DokanFileInfo);

Either the file doesn't have an 8.3 short-name, or the field hasn't been filled in.
Some versions of Windows have short-name generation turned off by default. Some people have short-name generation turned of just to make the file system faster. Even if you have short-name generation turned on now, that doesn't retro-actively go and generate short names across your existing file system.
And the field is not filled in anyway if the request was only for "FindExInfoBasic".

Dokan does not support 8.3 short-names at this point. Progress for implementation of this feature is tracked at: https://github.com/dokan-dev/dokany/issues/301

Hard wrap string literals at print margin in Eclipse C/C++

C/C++ Eclipse can automatically format and wrap just about any kind of code and behaviour is very configurable, except for string literals. Here is a made up example where debug output message happens to be longer than what can fit within a printable area:
if (some_kind_of_action() == TOUGH_LUCK) {
system_debug_print("Task name error: some_kind_of_action() failed due to your sloppy design.");
}
Using 79 character print margin the desirable result could be:
if (some_kind_of_action() == TOUGH_LUCK) {
system_debug_print("Task name error: some_kind_of_action() failed due to yo"
"ur sloppy design.");
}
You can do this manually by typing your string literal, then placing cursor at the desirable wrap point and pressing Enter key. Eclipse will automatically add necessary quotation marks. This is all nice, until something in your code changes and you have to manually redo the wrapping. I don't see why wrapping at print margin can't be done fully automatically like any other piece of code.
Is there any way to automate hard wrapping of string literals at print margin in Eclipse for C/C++?

Eclipse does not support this feature in any of its editors (even though it was requested nine years ago). However you may be able to avoid breaking your lines manually by using the following plugin for enabling soft wrap.
http://ahtik.com/blog/projects/eclipse-word-wrap/

_findfirst and wildcard matching

I am trying to use _findfirst() Windows API in C to match file name using wildcards.
If I am passing ????????.txt then I am expecting it will match all the files in a directory with 8 characters only, but it matches more than that.
Is there any thing wrong with this usage?

I would guess that it is matching on the short name. On windows all files have a long name and a DOS 8.3 short name. Therefore "????????.txt" is effectively the same as "*.txt".
Also on a pedantic note, _findfirst() is not part of the Windows API. Is it part of the Microsoft C run-time library.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight