When loading data to Snowflake using the COPY INTO command, there is an parameter called: REPLACE_INVALID_CHARACTERS. According to the documentation, if this is set to TRUE, then any invalid UTF-8 characters are replaced with a Unicode replacement character which looks like this (�)
https://docs.snowflake.net/manuals/sql-reference/sql/copy-into-table.html#type-csv
My question is, how can I remove this character from data in my table? I have tried to use REGEXP_REPLACE but have been unable to figure out the right pattern to use.
Here is an example of what the data looks like:
Notice how the LENGTH function doesn't even register that the character is there since it says there are 7 characters when there are clearly 8.
Any advice on what Snowflake SQL function to use to remove these characters would be greatly appreciated!
The "unicode replacement character" is \uFFFD so replacing that with '' should work
select replace('asdf�', '\uFFFD', '');
--Returns: asdf
After extensive back-and-forth with Snowflake support, we finally settled on creating our own javascript function to cleanse non-ascii characters including this unicode replacement character.
What made this challenging is that the unicode replacement character that Snowflake adds does not itself appear to be a valid character which makes it hard to remove.
The function below is the only thing we found that reliably works. It is also super fast:
CREATE OR REPLACE FUNCTION ADMIN.DESIGN.REPLACE_NON_ASCII("input" varchar, "replacement" varchar )
RETURNS varchar
LANGUAGE JavaScript
AS
$$
//This function is used to cleanse non ascii characters out of data included corrupt non-unicode characters
var output = "";
if (input == undefined){
return input
}
else {
for (var i=0; i<input.length; i++) {
if (input.charCodeAt(i) >= 32 && input.charCodeAt(i) <= 127) {
output += input.charAt(i);
}
else {
output += replacement
}
}
return output
}
$$
Related
How do I split a string into two strings (array name, index number) only if the string is matching the following string structure: "ArrayName[index]".
The array name can be 31 characters at most and the index 3 at most.
I found the following example which suppose to work with "Matrix[index1][index2]". I really couldn't understand how it does it in order to take apart the part I need to get my strings.
sscanf(inputString, "%32[^[]%*[[]%3[^]]%*[^[]%*[[]%3[^]]", matrixName, index1,index2) == 3
This try over here wasn't a success, what am I missing?
sscanf(inputString, "%32[^[]%*[[]%3[^]]", arrayName, index) == 2
How do I split a string into two strings (array name, index number) only if the string is matching the following string structure: "ArrayName[index]".
With sscanf, you don't. Not if you mean that you can rely on nothing being modified in the event that the input does not match the pattern. This is because sscanf, like the rest of the scanf family, processes its input and format linearly, without backtracking, and by design it fills input fields as they are successfully matched. Thus, if you scan with a format that assigns multiple fields or has trailing literal characters then it is possible for results to be stored for some fields despite a matching failure occurring.
But if that's ok with you then #gsamaras's answer provides a nearly-correct approach to parsing and validating a string according to your specified format, using sscanf. That answer also presents a nice explanation of the meaning of the format string. The problem with it is that it provides no way to distinguish between the input fully matching the format and the input failing to match at the final ], or including additional characters after.
Here is a variation on that code that accounts for those tail-end issues, too:
char array_name[32] = {0}, idx[4] = {0}, c = 0;
int n;
if (sscanf(str, "%31[^[][%3[^]]%c%n", array_name, idx, &c, &n) >= 3
&& c == ']' && str[n] == '\0')
printf("arrayName = %s\nindex = %s\n", array_name, idx);
else
printf("Not in the expected format \"ArrayName[idx]\"\n");
The difference in the format is the replacement of the literal terminating ] with a %c directive, which matches any one character, and the addition of a %n directive, which causes the number of characters of input read so far to be stored, without itself consuming any input.
With that, if the return value is at least 3 then we know that the whole format was matched (a %n never produces a matching failure, but docs are unclear and behavior is inconsistent on whether it contributes to the returned field count). In that event, we examine variable c to determine whether there was a closing ] where we expected to find one, and we use the character count recorded in n to verify that all characters of the string were parsed (so that str[n] refers to a string terminator).
You may at this point be wondering at how complicated and cryptic that all is. And you would be right to do so. Parsing structured input is a complicated and tricky proposition, for one thing, but also the scanf family functions are pretty difficult to use. You would be better off with a regex matcher for cases like yours, or maybe with a machine-generated lexical analyzer (see lex), possibly augmented by machine-generated parser (see yacc). Even a hand-written parser that works through the input string with string functions and character comparisons might be an improvement. It's still complicated any way around, but those tools can at least make it less cryptic.
Note: the above assumes that the index can be any string of up to three characters. If you meant that it must be numeric, perhaps specifically a decimal number, perhaps specifically non-negative, then the format can be adjusted to serve that purpose.
A naive example to get you started:
#include <stdio.h>
#include <string.h>
int main(void)
{
char str[] = "myArray[123]";
char array_name[32] = {0}, idx[4] = {0};
if(sscanf(str, "%31[^[][%3[^]]]", array_name, idx) == 2)
printf("arrayName = %s\nindex = %s\n", array_name, idx);
else
printf("Not in the expected format \"ArrayName[idx]\"\n");
return 0;
}
Output:
arrayName = myArray
index = 123
which will find easy not-in-the-expected format cases, such as "ArrayNameidx]" and "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOP[idx]", but not "ArrayName[idx".
The essence of sscanf() is to tell it where to stop, otherwise %s would read until the next whitespace.
This negated scanset %[^[] means read until you find an opening bracket.
This negated scanset %[^]] means read until you find a closing bracket.
Note: I used 31 and 3 as the width specifiers respectively, since we want to reserve the last slot for the NULL terminator, since the name of the array is assumed to be 31 characters at the most, and the index 3 at the most. The size of the array for its token is the max allowed length, plus one.
How can I use sscanf to analyze string data?
Use "%n" to detect a completed scan.
array name can be 31 characters at most and the index 3 at most.
For illustration, let us assume the index needs to limit to a numeric value [0 - 999].
Use string literal concatenation to present the format more clearly.
char name[32]; // array name can be 31 characters
#define NAME_FMT "%31[^[]"
char idx[4]; //
#define IDX_FMT "%3[0-9]"
int n = 0; // be sure to initialize
sscanf(str, NAME_FMT "[" IDX_FMT "]" "%n", array_name, idx, &n);
// Did scan complete (is `n` non-zero) with no extra text?
if (n && str[n] == '\0') {
printf("arrayName = %s\nindex = %d\n", array_name, atoi(idx));
} else {
printf("Not in the expected format \"ArrayName[idx]\"\n");
}
I've never really used C before but am trying to run this code: https://github.com/stanfordnlp/GloVe/blob/master/src/glove.c
Problem: when I read the utf8 character using this code and simply output that utf8 character, it outputs them differently.
Here is an example
µl µl
。 。
ß Ã<9f>
versión versión
◘ â<97><98>
Léon Léon
Résumé Résumé
Cancún Cancún
������ ���ï¿
The left side is what original word in fid and the right side is what this code outputs.
The fprintf is happening in line 234-237.
if (fscanf(fid,format,word) == 0) return 1;
if (strcmp(word, "<unk>") == 0) return 1;
fprintf(fout, "%s",word);
The first line reads the word from fid in format. However, format is defined as sprintf(format,"%%%ds",MAX_STRING_LENGTH);. It doesn't have any information about encoding.
My question is: How does C know which encoding to read and output? On this file, I can't find how it defines encodings like utf8, ISO-8859, etc.
How should I make this code to write left side characters?
Any comment (short is fine too!) or some keywords that I should look up will be highly appreciated! Thanks.
C doesn't know anything about whatever encoding you use for the input. The fscanf call will simply read space-delimited "characters", where each character is a single byte.
I'm trying to write a program that counts all the characters in a string at Turkish language. I can't see why this does not work. i added library, setlocale(LC_ALL,"turkish") but still doesn't work. Thank you. Here is my code:
My file character encoding: utf_8
int main(){
setlocale(LC_ALL,"turkish");
char string[9000];
int c = 0, count[30] = {0};
int bahar = 0;
...
if ( string[c] >= 'a' && string[c] <= 'z' ){
count[string[c]-'a']++;
bahar++;
}
my output:
a 0.085217
b 0.015272
c 0.022602
d 0.035736
e 0.110263
f 0.029933
g 0.015272
h 0.053146
i 0.071167
k 0.010996
l 0.047954
m 0.025046
n 0.095907
o 0.069334
p 0.013745
q 0.002443
r 0.053451
s 0.073916
t 0.095296
u 0.036958
v 0.004582
w 0.019243
x 0.001527
y 0.010996
This is English alphabet but i need this characters calculate too: "ğ,ü,ç,ı,ö"
setlocale(LC_ALL,"turkish");
First: "turkish" isn't a locale.
The proper name of a locale will typically look like xx_YY.CHARSET, where xx is the ISO 639-1 code for the language, YY is the ISO 3166-1 Alpha-2 code for the country, and CHARSET is an optional character set name (usually ISO8859-1, ISO8859-15, or UTF-8). Note that not all combinations are valid; the computer must have locale files generated for that specific combination of language code, country code, and character set.
What you probably want here is setlocale(LC_ALL, "tr_TR.UTF-8").
if ( string[c] >= 'a' && string[c] <= 'z' ){
Second: Comparison operators like >= and <= are not locale-sensitive. This comparison will always be performed on bytes, and will not include characters outside the ASCII a-z range.
To perform a locale-sensitive comparison, you must use a function like strcoll(). However, note additionally that some letters (including the ones you're trying to include here!) are composed of multi-byte sequences in UTF-8, so looping over bytes won't work either. You will need to use a function like mblen() or mbtowc() to separate these sequences.
Since you are apparently working with a UTF-8 file, the answer will depend upon your execution platform:
If you're on Linux, setlocale(LC_CTYPE, "en_US.UTF-8") or something similar should work, but the important part is the UTF-8 at the end! The language shouldn't matter. You can verify it worked by using
if (setlocale(LC_CTYPE, "en_US.UTF-8") == NULL) {
abort();
}
That will stop the program from executing. Anything after that code means that the locale was set correctly.
If you're on Windows, you can instead open the file using fopen("myfile.txt", "rt, ccs=UTF-8"). However, this isn't entirely portable to other platforms. It's a lot cleaner than the alternatives, however, which is likely more important in this particular case.
If you're using FreeBSD or another system that doesn't allow you to use either approach (e.g. there are no UTF-8 locales), you'd need to parse the bytes manually or use a library to convert them for you. If your implementation has an iconv() function, you might be able to use it to convert from UTF-8 to ISO-8859-9 to use your special characters as single bytes.
Once you're ready to read the file, you can use fgetws with a wchar_t array.
Another problem is checking if one of your non-ASCII characters was detected. You could do something like this:
// lower = "abcdefghijklmnopqrstuvwxyzçöüğı"
// upper = "ABCDEFGHİJKLMNOPQRSTUVWXYZÇÖÜĞI"
const wchar_t lower[] = L"abcdefghijklmnopqrstuvwxyz\u00E7\u00F6\u00FC\u011F\u0131";
const wchar_t upper[] = L"ABCDEFGH\u0130JKLMNOPQRSTUVWXYZ\u00C7\u00D6\u00DC\u011EI";
const wchar_t *lchptr = wcschr(lower, string[c]);
const wchar_t *uchptr = wcschr(upper, string[c]);
if (lchptr) {
count[(size_t)(lchptr-lower)]++;
bahar++;
} else if (uchptr) {
count[(size_t)(uchptr-upper)]++;
bahar++;
}
That code assumes you're counting characters without regard for case (case insensitive). That is, ı (\u0131) and I are considered the same character (count[8]++), just like İ (\u0130) and i are considered the same (count[29]++). I won't claim to know much about the Turkish language, but I used what little I understand about Turkish casing rules when I created the uppercase and lowercase strings.
Edit
As #JonathanLeffler mentioned in the question's comments, a better solution would be to use something like isalpha (or in this case, iswalpha) on each character in string instead of the lower and upper strings of valid characters I used. This, however, would only allow you to know that the character is an alphabetic character; it wouldn't tell you the index of your count array to use, and the truth is that there is no universal answer to do so because some languages use only a few characters with diacritic marks rather than an entire group where you can just do string[c] >= L'à' && string[c] <= L'ç'. In other words, even when you have read the data, you still need to convert it to fit your solution, and that requires knowledge of what you're working with to create a mapping from characters to integer values, which my code does by using strings of valid characters and the indices of each character in the string as the indices of the count array (i.e. lower[29] will mean count[29]++ is executed, and upper[18] will mean count[18]++ is executed).
The solution depends on the character encoding of your files.
If the file is in ISO 8859-9 (latin-5), then each special character is still encoded in a single byte, and you can modify your code easily: You already have a distiction between upper case and lower case. Just add more branches for the special characters.
If the file is in UTF-8, or some other unicode encoding, you need a multi-byte capable string library.
Okay, total newbie here, but I need a little help/insight on how to start writing a specific program. I'm not asking for someone to do it for me, I'm just asking for an approach to this problem because I'm honestly not sure how to begin.
The program I am supposed to write is to detect valid integers. However, in this program, a valid integer is defined as the following:
0 or more leading white spaces followed by...
an optional '+' or '-' followed by...
1 or more digits, followed by a non-alphanumeric, but not a '.' followed by 1 or more digits.
Examples of valid integers: “1234”, “ 1234 ”, “1234.”, “ +1234 ”, "12+34", "1234.", "1234 x", and “ -1234 ” are all integers, and none of “1234e5”, “e1234”, “1234.56”, and “1234abc” are.
So far, all I can think of is using a bunch of if statements to check for valid integers, but I cant help but think there has to be a better and more robust approach than using a lot of if statements to check each character of the string. I can't think of any functions that would be useful to me other than using isdigit() and maybe strtol()? Any advice would be appreciated.
You just need to examine each character in a loop and keep a little state machine as you're going, until you decide it's not valid or you reach the end.
Edit: Nothing wrong with if statements, or you could use a switch statement.
I'd probably use sscanf (or fscanf, etc.)
Although it doesn't support full regular expressions, scanf format strings do support scan set conversions, which are about like a character set in a regular expression (including inverted ones, so for example %1[^a-zA-Z0-9] matches a single non-alphanumeric character).
A single space in a format string matches an arbitrary amount of white space in the input.
Put your words into code - one piece at a time. Pseudo code follows
// to detect valid integers.
success_failure detect valid integers(const char *s) {
// 0 or more leading white spaces followed by...
while (test_for_whitespace(*s)) s++;
// an optional '+' or '-' followed by...
if (test_if_sign(*s)) s++;
// 1 or more digits, ...
digit_found = false;
while (test_if_digit(*s)) { s++; digit_found = true; ]
if (!digit_found) return fail;
// followed by a non-alphanumeric, but not a '.' followed by 1 or more digits.
if (is_a_non_alphanumeric_non_dp_not_null(*s)) {
s++;
digit_found = false;
while (test_if_digit(*s)) { s++; digit_found = true; ]
if (!digit_found) return fail;
}
if (is_not_a_null_character(*s)) return fail;
return success;
}
Have a look at strtol(), it can tell you about invalid parts of the string by pointer return.
And beware of enthusiastic example code.. see the man page for comprehensive error-handling.
I am trying to do my own version of wc (unix filter), but I have a problem with non-ASCII characters. I did a HEX dump of a text file and found out that these characters occupy more than one byte. So they won't fit to char. Is there any way I can read these characters from file and handle them like single characters (in order to count characters in a file) in C?
I've been googling a little bit and found some wchar_t type, but there were not any simple examples how to use it with files.
I've been googling a little bit and found some wchar_t type, but there was not any simple example how to use it with files.
Well met. There weren't any simple examples because, unfortunately, proper character set support isn't simple.
Aside: In an ideal world, everybody would use UTF-8 (a Unicode encoding that is memory-efficient, robust, and backward-compatible with ASCII), the standard C library would include UTF-8 encoding-decoding support, and the answer to this question (and dealing with text in general) would be simple and straightforward.
The answer to the question "What is the best unicode library for C?" is to use the ICU library. You may want to look at ustdio.h, as it has a u_fgetc function, and adding Unicode support to your program will probably take little more than typing u_ a few times.
Also, if you can spare a few minutes for some light reading, you may want to read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know about Unicode and Character Sets (No Excuses!) from Joel On Software.
I, personally, have never used ICU, but I probably will from now on :-)
If you want to write a standard C version of the wc utility that respects the current language setting when it is run, then you can indeed use the wchar_t versions of the stdio functions. At program startup, you should call setlocale():
setlocale(LC_CTYPE, "");
This will cause the wide character functions to use the appropriate character set defined by the environment - eg. on Unix-like systems, the LANG environment variable. For example, this means that if your LANG variable is set to a UTF8 locale, the wide character functions will handle input and output in UTF8. (This is how the POSIX wc utility is specified to work).
You can then use the wide-character versions of all the standard functions. For example, if you have code like this:
long words = 0;
int in_word = 0;
int c;
while ((c = getchar()) != EOF)
{
if (isspace(c))
{
if (in_word)
{
in_word = 0;
words++;
}
}
else
{
in_word = 1;
}
}
...you would convert it to the wide character version by changing c to a wint_t, getchar() to getwchar(), EOF to WEOF and isspace() to iswspace():
long words = 0;
int in_word = 0;
wint_t c;
while ((c = getwchar()) != WEOF)
{
if (iswspace(c))
{
if (in_word)
{
in_word = 0;
words++;
}
}
else
{
in_word = 1;
}
}
Go have a look at ICU. That library is what you need to deal with all the issues.
Most of the answers so far have merit, but which you use depends on the semantics you want:
If you want to process text in the configured locale's encoding, and don't care about complete failure in the case of encountering invalid sequences, using getwchar() is fine.
If you want to process text in the configured locale's encoding, but need to detect and recover from invalid sequences, you need to read bytes and use mbrtowc manually.
If you always want to process text as UTF-8, you need to read bytes and feed them to your own decoder. If you know in advance the file will be valid UTF-8, you can just count bytes in the ranges 00-7F and C2-F4 and skip counting all other bytes, but this could give wrong results in the presence of invalid sequences. A more robust approach would be decoding the bytestream to Unicode codepoints and counting the number of successful decodes.
Hope this helps.
Are you sure you really need the number of characters? wc counts the number of bytes.
~$ echo 'דניאל' > hebrew.txt
~$ wc hebrew.txt
1 1 11 hebrew.txt
(11 = 5 two-byte characters + 1 byte for '\n')
However, if you really do want to count characters rather than bytes, and can assume that your text files are encoded in UTF-8, then the easiest approach is to count all bytes that are not trail bytes (i.e., in the range 0x80 to 0xBF).
If you can't assume UTF-8 but can assume that any non-UTF-8 files are in a single-byte encoding, then perform a UTF-8 validation check on the data. If it passes, return the number of UTF-8 lead bytes. If if fails, return the number of total bytes.
(Note that the above approach is specific to wc. If you're actually doing something with the characters rather than just counting them, you'll need to know the encoding.)