Parsing ASCII in raw memory in C - c

My goal is to parse potentially garbled ASCII data containing text and numbers. So far I've been doing fine with getting a pointer to numbers within the text with strcmp and hard-coded pointer arithmetic and then converting them with strtol, but I feel there must be a better way.
Is there a function with the effect of scanf which takes a pointer to memory instead of input stream?

You'll be wanting sscanf. Note that it does expect a null-terminated string.

Related

Writing a C function to parse a Hex string and convert it to decimal?

I am just learning C and in my class this is a part of our first program.
The full description of the function I am trying to implement:
if that's tl;dr, a key point is that I am not allowed to use functions from other libraries (so something like srtol is ruled out).
int parseHexString(char *hexString, int *integerRead); The first
parameter is a null terminated C string, that represents a hexadecimal
integer. This function parses this string, accumulating the integer
value it represents. This integer value is placed at the location
pointed to by the second parameter, integerRead. If a bad hexadecimal
character, thus an invalid hex value, is encountered, this function
stops looking at further characters within the string and returns -1.
If a good hex value is parsed, it returns 0.
The correct way to implement this function looks at the first
character within the string first and does not use a stack to
accomplish the parsing. Your first assembly language program will need
to implement what this function accomplishes, so you will save
yourself time by implementing this function the correct way.
For this function, do not call any functions from any libraries; as an
exception, for debugging purposes only, you may use printf(). It will
help our grading if you remove your debugging code before you submit
your assignment.
I am NOT just looking for a full implementation of this function, just some tips or hints to get me started.
I feel as though there is some intuitive way of doing this, but right now I am blanking. I'm concerned with how I am supposed to start at the first character of the string and then go forward from there to convert it to decimal.
How about:
Get the length of the string
Walk the string from left to right
For each character:
Check if it's a valid hex character
Add its decimal value multiplied by 16^x, where x is the number of characters left on the right

How can I parse text input and convert strings to integers?

I have a file input, in which i have the following data.
1 1Apple 2Orange 10Kiwi
2 30Apple 4Orange 1Kiwi
and so on. I have to read this data from file and work on it but i dont know how to retrieve the data. I want to store 1(of 1 apple) as integer and then Apple as a string.
I thought of reading the whole 1Apple as a string. and then doing something with the stoi function.
Or I could read the whole thing character by character and then if the ascii value of that character lies b/w 48 to 57 then i will combine that as an integer and save the rest as string? Which one shall I do? Also how do I check what is the ASCII value of the char. (shall I convert the char to int and then compare, or is there any inbuilt function?)
How about using the fscanf() function if and only if your input pattern is not going to change. Otherwise you should probably use fgets() and perform checks if you want to separate the number from the string such as you suggested.
There is one easy right way to do this with standard C library facilities, one rather more difficult right way, and a whole lot of wrong ways. This is the easy right way:
Read an entire line into a char[] buffer using fgets.
Extract numbers from this line using strtol or strtoul.
It is very important to understand why the easier-looking alternatives (*scanf and atoi) should never be used. You might write less code initially, but once you start thinking about how to handle even slightly malformed input, you will discover that you should have used strtol.
The "rather more difficult right way" is to use lex and yacc. They are much more complicated but also much more powerful. You shouldn't need them for this problem.

Is sscanf considered safe to use?

I have vague memories of suggestions that sscanf was bad. I know it won't overflow buffers if I use the field width specifier, so is my memory just playing tricks with me?
I think it depends on how you're using it: If you're scanning for something like int, it's fine. If you're scanning for a string, it's not (unless there was a width field I'm forgetting?).
Edit:
It's not always safe for scanning strings.
If your buffer size is a constant, then you can certainly specify it as something like %20s. But if it's not a constant, you need to specify it in the format string, and you'd need to do:
char format[80]; //Make sure this is big enough... kinda painful
sprintf(format, "%%%ds", cchBuffer - 1); //Don't miss the percent signs and - 1!
sscanf(format, input); //Good luck
which is possible but very easy to get wrong, like I did in my previous edit (forgot to take care of the null-terminator). You might even overflow the format string buffer.
The reason why sscanf might be considered bad is because it doesnt require you to specify maximum string width for string arguments, which could result in overflows if the input read from the source string is longer. so the precise answer is: it is safe if you specify widths properly in the format string otherwise not.
Note that as long as your buffers are at least as long as strlen(input_string)+1, there is no way the %s or %[ specifiers can overflow. You can also use field widths in the specifiers if you want to enforce stricter limits, or you can use %*s and %*[ to suppress assignment and instead use %n before and after to get the offsets in the original string, and then use those to read the resulting sub-string in-place from the input string.
Yes it is..if you specify the string width so the are no buffer overflow related problems.
Anyway, like #Mehrdad showed us, there will be possible problems if the buffer size isn't established at compile-time. I suppose that put a limit to the length of a string that can be supplied to sscanf, could eliminate the problem.
All of the scanf functions have fundamental design flaws, only some of which could be fixed. They should not be used in production code.
Numeric conversion has full-on demons-fly-out-of-your-nose undefined behavior if a value overflows the representable range of the variable you're storing the value in. I am not making this up. The C library is allowed to crash your program just because somebody typed too many input digits. Even if it doesn't crash, it's not obliged to do anything sensible. There is no workaround.
As pointed out in several other answers, %s is just as dangerous as the infamous gets. It's possible to avoid this by using either the 'm' modifier, or a field width, but you have to remember to do that for every single text field you want to convert, and you have to wire the field widths into the format string -- you can't pass sizeof(buff) as an argument.
If the input does not exactly match the format string, sscanf doesn't tell you how many characters into the input buffer it got before it gave up. This means the only practical error-recovery policy is to discard the entire input buffer. This can be OK if you are processing a file that's a simple linear array of records of some sort (e.g. with a CSV file, "skip the malformed line and go on to the next one" is a sensible error recovery policy), but if the input has any more structure than that, you're hosed.
In C, parse jobs that aren't complicated enough to justify using lex and yacc are generally best done either with POSIX regexps (regex.h) or with hand-rolled string parsing. The strto* numeric conversion functions do have well-specified and useful behavior on overflow and do tell you how may characters of input they consumed, and string.h has lots of handy functions for hand-rolled parsers (strchr, strcspn, strsep, etc).
There is 2 point to take care.
The output buffer[s].
As mention by others if you specify a size smaller or equals to the output buffer size in the format string you are safe.
The input buffer.
Here you need to make sure that it is a null terminate string or that you will not read more than the input buffer size.
If the input string is not null terminated sscanf may read past the boundary of the buffer and crash if the memorie is not allocated.

Converting a UNICODE_STRING to ANSI or vice versa in C

I have a UNICODE_STRING that I would like to compare to a null-terminated ANSI string to check if they are the same. I'm using C. I would like to avoid including winternl.h for RtlInitUnicodeString.
What is the preferred method doing this?
Or, alternatively, is there any problem with me using MultiByteToWideChar() to convert the ANSI string to a wide-character representation and then comparing that to the UNICODE_STRING.buffer (with the understanding that the buffer might not be null-terminated)?
WideCharToMultiByte seems the more logical route. It can handle strings that aren't zero-terminated and produces a terminated one. And it tries to do something meaningful with codepoints that don't have a character in the system code page. Then just strcmp().
I would just convert the ANSI string using MultiByteToWideChar(). The CompareString() function takes length parameters for each string, so no worries about the missing null-terminator.
Just be careful about which parameters take or return bytes verses characters, and there should be no problems using these functions.

How can I get how many bytes sscanf_s read in its last operation?

I wrote up a quick memory reader class that emulates the same functions as fread and fscanf.
Basically, I used memcpy and increased an internal pointer to read the data like fread, but I have a fscanf_s call. I used sscanf_s, except that doesn't tell me how many bytes it read out of the data.
Is there a way to tell how many bytes sscanf_s read in the last operation in order to increase the internal pointer of the string reader? Thanks!
EDIT:
And example format I am reading is:
|172|44|40|128|32|28|
fscanf reads that fine, so does sscanf. The only reason is that, if it were to be:
|0|0|0|0|0|0|
The length would be different. What I'm wondering is how fscanf knows where to put the file pointer, but sscanf doesn't.
With scanf and family, use %n in the format string. It won't read anything in, but it will cause the number of characters read so far (by this call) to be stored in the corresponding parameter (expects an int*).
Maybe I´m silly, but I´m going to try anyway. It seems from the comment threads that there's still some misconception. You need to know the amount of bytes. But the method returns only the amount of fields read, or EOF.
To get to the amount of bytes, either use something that you can easily count, or use a size specifier in the format string. Otherwise, you won't stand a chance finding out how many bytes are read, other then going over the fields one by one. Also, what you may mean is that
sscanf_s(source, "%d%d"...)
will succeed on both inputs "123 456" and "10\t30", which has a different length. In these cases, there's no way to tell the size, unless you convert it back. So: use a fixed size field, or be left in oblivion....
Important note: remember that when using %c it's the only way to include the field separators (newline, tab and space) in the output. All others will skip the field boundaries, making it harder to find the right amount of bytes.
EDIT:
From "C++ The Complete Reference" I just read that:
%n Receives an integer value equal to
the nubmer of characters read so far
Isn't that precisely what you were after? Just add it in the format string. This is confirmed here, but I haven't tested it with sscanf_s.
From MSDN:
sscanf_s, _sscanf_s_l, swscanf_s, _swscanf_s_l
Each of these functions returns the number of fields successfully converted and assigned; the return value does not include fields that were read but not assigned. A return value of 0 indicates that no fields were assigned. The return value is EOF for an error or if the end of the string is reached before the first conversion.

Resources