unicode string in C extension - c

I am writing a C extension for Ruby, and I need to accept a string as a parameter, and iterate the characters in the string. My code below works fine for ASCII characters, but it does not handle multiple byte characters, and outputs "garbage" instead. I could not find any sample code that would iterate over unicode strings. I would appreciate any pointers.
static VALUE test_method(VALUE self, VALUE text)
{
char *pch;
char *pch_end = RSTRING_END(text);
for (pch = RSTRING_PTR(text); pch < pch_end; pch++)
{
printf("%c\n", *pch);
}
...
}

Here’s an example of one way you could iterate over the characters:
static VALUE print_single_char(VALUE s)
{
char* pch;
pch = StringValueCStr(s);
// pch is now a pointer to a sequence of bytes representing the
// character in whatever its encoding was. printf will work if the
// console encoding is the same, otherwise you may get junk again.
printf("%s\n", pch);
return Qnil;
}
static VALUE test_method(VALUE self, VALUE text)
{
rb_block_call(text, rb_intern("each_char"), 0, NULL, print_single_char, Qnil);
return Qnil;
}
Note that once you convert any characters to C-strings you lose any associated encoding information. You might want to convert any input into a known encoding (such as UTF-8) before doing anything else:
text = rb_funcall(text, rb_intern("encode"), 1, rb_str_new_cstr("utf-8"));

char is only of size 1 so if you deal with multibyte characters you would have to use wchar_t instead and use the appropriate wide versions as well like wprintf.

Related

sscanf_s doesn't return first character of string

I'm trying to find the first string (max 4 characters) in a comma-separated list of strings inside a C char-array.
I'm trying to achieve this by using sscanf_s (under Windows) and the format-control string %[^,]:
char mystring[] = "STR1,STR2";
char temp[5];
if (sscanf_s(mystring, "%[^,]", temp, 5) != 0) {
if (strcmp(temp, "STR1") == 0) { return 0; }
else if (strcmp(temp, "STR2") == 0) { return 1; }
else { return -1; }
}
After calling sscanf_s the content of temp is not STR1 but \0TR1 (\0 being the ASCII-interpretation of 0). And the value -1 is returned.
Why do I get that behavior and how do I fix my code to get the right result (return of 0)?
EDIT: changed char mystring to mystring[] (I should have made sure I typed it correcly here)
There are multiple problems in your code:
mystring is defined as a char, not a string pointer.
the argument 5 following temp in sscanf_s() should have type rsize_t, which is the same as size_t. You should specify it as sizeof(temp).
you should specify the maximum number of characters to store into the destination array in the format string, to avoid the counter-intuitive behavior of sscanf_s in case of overflow.
sscanf_s returns 1 if it can convert and store the string. Testing != 0 will also accept EOF which is an input failure, for which the contents of temp is indeterminate.
Here is a modified version:
const char *mystring = "STR1,STR2";
char temp[5];
if (sscanf_s(mystring, "%4[^,]", temp, sizeof temp) == 1) {
if (strcmp(temp, "STR1") == 0) {
return 0;
} else
if (strcmp(temp, "STR2") == 0) {
return 1;
} else {
return -1;
}
}
UPDATE: The OP uses Microsoft Visual Studio, which seems to have a non-conforming implementation of the so-called secure stream functions. Here is a citation from their documentation page:
The sscanf_s function reads data from buffer into the location that's given by each argument. The arguments after the format string specify pointers to variables that have a type that corresponds to a type specifier in format. Unlike the less secure version sscanf, a buffer size parameter is required when you use the type field characters c, C, s, S, or string control sets that are enclosed in []. The buffer size in characters must be supplied as an additional parameter immediately after each buffer parameter that requires it. For example, if you are reading into a string, the buffer size for that string is passed as follows:
wchar_t ws[10];
swscanf_s(in_str, L"%9s", ws, (unsigned)_countof(ws)); // buffer size is 10, width specification is 9
The buffer size includes the terminating null. A width specification field may be used to ensure that the token that's read in will fit into the buffer. If no width specification field is used, and the token read in is too big to fit in the buffer, nothing is written to that buffer.
In the case of characters, a single character may be read as follows:
wchar_t wc;
swscanf_s(in_str, L"%c", &wc, 1);
This example reads a single character from the input string and then stores it in a wide-character buffer. When you read multiple characters for non-null terminated strings, unsigned integers are used as the width specification and the buffer size.
char c[4];
sscanf_s(input, "%4c", &c, (unsigned)_countof(c)); // not null terminated
This example reads a single character from the input string and then stores it in a wide-character buffer. When you read multiple characters for non-null terminated strings, unsigned integers are used as the width specification and the buffer size.
char c[4];
sscanf_s(input, "%4c", &c, (unsigned)_countof(c)); // not null terminated
This specification is incompatible with the C Standard, that specifies the type of the width arguments to be rsize_t and type rsize_t to be the same type as size_t.
As a conclusion, for improved portability, one should avoid using these secure functions and use the standard functions correctly, with the length prefix to prevent buffer overruns.
You can prevent the Visual Studio warning about deprecation of sscanf by inserting this definition before including <stdio.h>:
#ifdef _MSC_VER
#define _CRT_SECURE_NO_WARNINGS // let me use standard functions
#endif
edited per the comment from chqrlie
regarding:
if(sscanf_s(mystring, "%[^,]",temp, 5) != 0){
The input format conversion specifier: %[..] always appends a NUL byte to the end of the input. So the input format conversion specifier should be: "%4[^,]" The result after the correction is:
if(sscanf_s(mystring, "%4[^,]",temp, 5) != 0){
also, no matter how many times this code snippet is executed, the returned value wnce the other problems are corrected will ALWAYS be STR1
regarding the statement;
char mystring = "STR1,STR2";
This is not a valid statement. Suggest:
char *mystring = "STR1,STR2"; // notice the '*'
--or--
char mystring[] = "STR1,STR2"; // notice the '[]'

C - Using sprintf() to put a prefix inside of a string

I'm trying to use sprintf() to put a string "inside itself", so I can change it to have an integer prefix. I was testing this on a character array of length 12 with "Hello World" inside it already.
The basic premise is that I want a prefix that denotes the amount of words within a string. So I copy 11 characters into a character array of length 12.
Then I try to put the integer followed by the string itself by using "%i%s" in the function. To get past the integer (I don't just use myStr as the argument for %s), I make sure to use myStr + snprintf(NULL, 0, "%i", wordCount), which should be myStr + characters taken up by the integer.
The problem is that I'm having is that it eats the 'H' when I do this and prints "2ello World" instead of having the '2' right beside the "Hello World"
So far I've tried different options for getting "past the integer" in the string when I try to copy it inside itself, but nothing really seems to be the right case, as it either comes out as an empty string or just the integer prefix itself '222222222222' copied throughout the entire array.
int main() {
char myStr[12];
strcpy(myStr, "Hello World");//11 Characters in length
int wordCount = 2;
//Put the integer wordCount followed by the string myStr (past whatever amount of characters the integer would take up) inside of myStr
sprintf(myStr, "%i%s", wordCount, myStr + snprintf(NULL, 0, "%i", wordCount));
printf("\nChanged myStr '%s'\n", myStr);//Prints '2ello World'
return 0;
}
First, to insert a one-digit prefix into a string “Hello World”, you need a buffer of 13 characters—one for the prefix, eleven for the characters in “Hello World”, and one for the terminating null character.
Second, you should not pass a buffer to snprintf as both the output buffer and an input string. Its behavior is not defined by the C standard when objects passed to it overlap.
Below is a program that shows you how to insert a prefix by moving the string with memmove. This is largely tutorial, as it is not generally a good way to manipulate strings. For short strings, where space is not an issue, most programmers would simply print the desired string into a temporary buffer, avoiding overlap issues.
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
/* Insert a decimal numeral for Prefix into the beginning of String.
Length specifies the total number of bytes available at String.
*/
static void InsertPrefix(char *String, size_t Length, int Prefix)
{
// Find out how many characters the numeral needs.
int CharactersNeeded = snprintf(NULL, 0, "%i", Prefix);
// Find the current string length.
size_t Current = strlen(String);
/* Test whether there is enough space for the prefix, the current string,
and the terminating null character.
*/
if (Length < CharactersNeeded + Current + 1)
{
fprintf(stderr,
"Error, not enough space in string to insert prefix.\n");
exit(EXIT_FAILURE);
}
// Move the string to make room for the prefix.
memmove(String + CharactersNeeded, String, Current + 1);
/* Remember the first character, because snprintf will overwrite it with a
null character.
*/
char Temporary = String[0];
// Write the prefix, including a terminating null character.
snprintf(String, CharactersNeeded + 1, "%i", Prefix);
// Restore the first character of the original string.
String[CharactersNeeded] = Temporary;
}
int main(void)
{
char MyString[13] = "Hello World";
InsertPrefix(MyString, sizeof MyString, 2);
printf("Result = \"%s\".\n", MyString);
}
The best way to deal with this is to create another buffer to output to, and then if you really need to copy back to the source string then copy it back once the new copy is created.
There are other ways to "optimise" this if you really needed to, like putting your source string into the middle of the buffer so you can append and change the string pointer for the source (not recommended, unless you are running on an embedded target with limited RAM and the buffer is huge). Remember code is for people to read so best to keep it clean and easy to read.
#define MAX_BUFFER_SIZE 128
int main() {
char srcString[MAX_BUFFER_SIZE];
char destString[MAX_BUFFER_SIZE];
strncpy(srcString, "Hello World", MAX_BUFFER_SIZE);
int wordCount = 2;
snprintf(destString, MAX_BUFFER_SIZE, "%i%s", wordCount, srcString);
printf("Changed string '%s'\n", destString);
// Or if you really want the string put back into srcString then:
strncpy(srcString, destString, MAX_BUFFER_SIZE);
printf("Changed string in source '%s'\n", srcString);
return 0;
}
Notes:
To be safer protecting overflows in memory you should use strncpy and snprintf.

Replace "0x" in hexadecimal string to "\x" in C

I have a C library that requires hexadecimal input of the form "\xFF". I need to pass an array of hexadecimal values formatted as "0xFF" form. Is there a way to replace "0x" by "\x" in C?
That sounds like an easy string replacement operation, but I think that's not really what you need.
The notation "\xFF" in a C string means "this string contains the character whose encoded value is 0xFF, i.e. 255 decimal".
So if that's what you mean, then you need to do the compiler's job and replace the incoming "0xFF" text with the single character that has the code 0xFF.
There is no standard function for this, since it's typically done by the compiler.
To implement this, I would write a loop that looks for 0x, and every time it's found, use strtoul() to attempt to convert a number at that location. If the number is too long (i.e. 0xDEAD) you need to figure out how to handle that.
You can use strstr in order to find the substring "0x" and then replace '0' with '\\':
#include <stdio.h>
#include <string.h>
int main(void)
{
char s[] = "0x01,0x0a,0x0f";
char *p = s;
printf("%s\n", s);
while (p) {
p = strstr(p, "0x");
if (p) *p = '\\';
}
printf("%s\n", s);
return 0;
}
Output:
0x01,0x0a,0x0f
\x01,\x0a,\x0f
But as pointed out by #unwind and #Sathish, that's probably not what you need.

Parse data from input string

I'm having this kind of input data.
<html>......
<!-- OK -->
I only want to extract the data before the comment sign <!--.
This is my code:
char *parse_data(char *input) {
char *parsed_data = malloc(strlen(input) * sizeof(char));
sscanf(input, "%s<!--%*s", parsed_data);
return parsed_data;
}
However, it doesn't seem to return the expected result. I can't figure out why is that so.
Could anyone explain me the proper way to extract this kind of data and the behavior of 'sscanf()`.
Thank you!
The "%s" format specifier will not treat "<!--" as a single delimiter, or any of the individual characters as a delimiter (which would not be the correct behaviour anyway). Only whitespace is considered a delimiter. Scan sets are available in sscanf() but they take a collection of individual characters rather that a sequence of characters representing a single delimiter. This means that everything in input before the first whitespace character will be assigned to parsed_data.
You could use strstr() instead:
const char* comment_start = strstr(input, "<!--");
char* result = 0;
if (comment_start)
{
result = malloc(comment_start - input + 1);
memcpy(result, input, comment_start - input);
result[comment_start - input] = 0;
}
Note that sizeof(char) is guaranteed to be 1 so can be omitted as part of the malloc() argument calculation.

Char to Int conversion in c

How can I convert a char to Int?
This is what I have done so far.
Thanks
scanf("%s", str );
printf("str: %s\n", str);
int i;
if(isdigit(*str))
i = (int) str;
else {
i = 3;
}
test case
7
str: 7
i: 1606415584
Edit: I could have sworn the post was tagged C++ at the start. I'll leave this up in case the OP is interested in C++ answers and the change to C tag was an edit.
A further option, which may be advanced given the question, is to use boost::lexical_cast as so:
scanf("%s", str );
printf("str: %s\n", str);
int i = boost::lexical_cast<int>( str );
I have used boost::lexical_cast a lot to convert between types, mostly strings and primitives when reading in user-defined properties. I find it an invaluable resource.
It's worth noting that boost::lexical_cast can throw exceptions, and these should be appropriately handled when you use the call. The link I posted at the start of this answer contains all the information you should need regarding that.
If you want to parse an integer from a string:
i = atoi(str);
You're mixing the character and string concepts here. str is a string, and str[0] (which is equivalent to *str) is a character, the first character of that string.
If you want to extract an integer from the string, try this
sscanf(str,"%d",&i);
Your
i = (int) str;
forces 4 bytes that start at the same memory address str (and for completeness sake, str is a pointer) starts to be interpreted as an integer, and that's why you get a result that's totally off.
You can convert strings to int by using sscanf
sscanf(str,"%d",&i);
http://www.cplusplus.com/reference/clibrary/cstdio/sscanf/
i = (int) str;
is a wrong way to convert a string to number, because It copies an address to i variable (the address which str is pointing to it).
You could try this:
i = atoi(str);
or
sscanf(str,"%d",&i);
to convert your string into a number.
Note that you cannot make sure the entered string is numeric by just isdigit(*str), because it only check the first character of the string.
One possible way is this:
int isNumeric = 1;
for(int j=0;j<length(str);j++)
if( isdigit(str[j]) == false)
{
isNumeric = 0;
break;
}
if(isNumeric)
{
// Code when the string is number
// (e.g. convert the string to a number with atoi function)
}
else
{
// Code when the string is NOT number
// (e.g. show a error message)
}

Resources