Prevent crash in string manipulation crashing whole application - c

I created a program which at regular intervals downloads a text file from a website, which is in csv format, and parses it, extracting relevant data, which then is displayed.
I have noticed that occasionally, every couple of months or so, it crashes. The crash is rare, considering the cycle of data downloading and parsing can happen every 5 minutes or even less. I am pretty sure it crashes inside the function that parses the string and extracts the data. When it crashes it happens during a congested internet connection, i.e. heavy downloads and/or a slow connection. Occasionally the remote site may be handing corrupt or incomplete data.
I used a test application which saves the data to be processed before processing it and it indeed shows it was not complete when a crash happens.
I have adapted the function to accommodate for a number of cases of invalid or incomplete data, as well as checking all return values. I also check return values of the various functions used to connect to the remote site and download the data. And will not go further when a return value indicates no success.
The core of the function uses strsep() to walk through the data and extract information out of it:
/ *
* delimiters typically contains: <;>, <">, < >
* strsep() is used to split part of the string using delimiter
* and copy into token which then is copied into the array
* normally the function stops way before ARRAYSIZE which is just a safeguard
* it would normally stop when the end of file is reached, i.e. \0
*/
for(n=0;n<ARRAYSIZE;n++)
{
token=strsep(&copy_of_downloaded_data, delimiters);
if (token==NULL)
break;
data->array[n].example=strndup(token, strlen(token));
if (data->array[n].example!=NULL)
{
token=strsep(&copy_of_downloaded_data, delimiters);
if (token==NULL)
break;
(..)
copy_of_downloaded_data=strchr(copy_of_downloaded_data,'\n'); /* find newline */
if (copy_of_downloaded_data==NULL)
break;
copy_of_downloaded_data=copy_of_downloaded_data+1;
if (copy_of_downloaded_data=='\0') /* find end of text */
break;
}
Since I suspect I can not account for all ways in which data can be corrupted I would like to know if there is a way to program this so the function when run does not crash the whole application in case of corrupted data.
If that is not possible what could I do to make it more robust.
Edit: One possible instance of a crash is when the data ends abruptly, where the middle of a field is cut of, i.e.
"test","example","this data is brok
At least I noticed it by looking through the saved data, however I found it not being consistent. Will have to stress test it as was suggested below.

The best thing to do would be to figure out what input causes the function to crash, and fix the function so that it does not crash. Since the function is doing string processing, this should be possible to do by feeding it lots of dummy/test data (or feeding it the "right" test data if it's a particular input that causes the crash). You basically want to torture-test the function until you find out how to make it crash on demand; at that point you can start investigating exactly where and why it crashes, and once you understand that, the necessary changes to fix the crash will probably become obvious to you.
Running the program under valgrind might also point you to the bug.
If for some reason you can't fix the bug, the other option is to spawn a child process and run the buggy code inside the child process. That way if it crashes, only the child process is lost and not the parent. (You can spawn the child process under most OS's by calling fork(); you'll need to come up with some way for the child process to communicate its results back to the parent process, of course). (Note that doing it this way is a kludge and will likely not be very efficient, and could also introduce a security hole into your application if someone malicious who has the ability to send your program input can figure out how to manipulate the bug in order to take control of the child process -- so I don't recommend this approach!)

What does the coredump point to?
strsep - does not have memory synchronization mechanisms, so protect it as a critical section ( lock it when you do strsep ) ?
see if strsep can handle a big chunk ( ARRAYSIZE is not gonna help you here ).
stack size of the thread/program that receives copy_of_downloaded_data ( i know you are only referencing it so look at the function that receives it. )

I would suggest that one should try to write code that keeps track of string lengths deliberately and doesn't care whether strings are zero-terminated or not. Even though null pointers have been termed the "billion dollar mistake"(*) I think zero-terminated strings are far worse. While there may be some situations where code using zero-terminated strings might be "simpler" than code that tracks string lengths, extra effort required to make sure that nothing can cause string-handling code to exceed buffer boundaries exceeds that required when working with known-length strings.
If, for example, one wants to store the concatenation of strings of length length1 and length2 into a buffer if length BUFF_SIZE, one can test easily whether length1+length2 <= BUFF_SIZE if one isn't expecting strings to be null-terminated, or length1+length2 < BUFF_SIZE if one expects a gratuitous null byte to follow every string. When using zero-terminated strings, one would have to determine the length of the two strings before concatenation, and having done so one could just as well use memcpy() rather than strcpy() or the useless strcat().
(*) There are many situations where it's much better to have a recognizably-invalid pointer than to require that pointers which can't point to anything meaningful must instead point to something meaningless. Many null-pointer related problems actually stem from a failure of implementations to trap arithmetic with null pointers; it's not fair to blame null pointers for problems that could have been, but weren't avoided.

Related

How to deal with SEGFAULT-s as a beginner? (C)

I have a problem on SPOJ and a deadline soon.
I'm tasked to write a program that counts how many identifiers are in a given line. Identifier is defined a sequence of characters from set 'a'-'z' or 'A'-'Z' or '0'-'9' or '_', starting from any letter or underline character ('_').
Input
There are given some number of data sets. Each data set is a line consisting from the sequence of some number of words, separated by spaces and finishing with the end of line character (even the last one line). A word is a sequence of any ASCII character of code from 33 till 126 (see http://www.asciitable.com for more details), e.g., aqui28$-3q or _dat_ The second word is an identifier, but the first one is not.
Output
The number of identifiers in each line.
Example
Input:
Dato25 has 2 c-ats and 3 _dogs
op8ax _yu _yu67 great-job ax~no identifier.
Output:
4
3
The code I wrote compiles, but when submitting it returns SIGSEGV (Segmentation Fault).
Your code exhibits a lamentably common anti-pattern: unnecessarily reading a large chunk of data into memory before processing it, when it could instead be processed as you go.
Consider: when you're processing one input word, do you need to refer to previous or subsequent words of the same or any other line? No. So why are you keeping all that around in memory?
In fact, you don't need to store any part of any of the words, except a single character you have just read. You simply need to
track
how many identifiers you've seen so far on the current line and
what kind of thing you're parsing at any given time (possible identifier, non-identifier word, or spaces),
update that appropriately for each character read, and
emit appropriate output (based on the preceding) at the end of each line.
That is likely to be faster than your approach. It will certainly use less memory, and especially less stack memory. And it affords little to no room for any kind of bounds overrun or invalid pointer use, such as are the usual reasons for a memory error such as a segfault.
As to why your original program segfaults, I ran it using valgrind, which is a popular tool for identifying memory usage problems. It can detect memory leaks, some out-of-bounds accesses, and use of uninitialized memory, among other things. It showed me that you never initialize the ident_count of any line[i]. Non-static local variables such as your line are not automatically initialized to anything in particular. Sometimes you can luck out with that, and it's not the cause of your particular issue, but cultivate good programming practices: fix it.
Valgrind did not indicate any other errors to me, however, nor did your program segfault for me with the example input. Nevertheless, I anticipate that I could wreak all kinds of havoc in your program by feeding it input with more than 100 lines and / or more than 300 words in a line, and / or more than 50 characters in a word. Automated judges tend to include test cases that explore the extremes of the problem space, so you need to be sure that your program works for all valid inputs.
Alternatively, a valid point is made in comments that you are allocating a large object on the stack, and stack space may not be sufficient for it in the judge's test environment. If that's the issue, then a quick and easy way to resolve it in your current code would be to allocate only one struct WORDS and reuse it for every line. That will reduce your stack usage by about a factor of 100, and again, what purpose is served by storing all the lines in memory at the same time anyway?

fgetc vs getline or fgets - which is most flexible

I am reading data from a regular file and I was wondering which would allow for the most flexibility.
I have found that both fgets and getline both read in a line (one with a maximum number of characters, the other with dynamic memory allocation). In the case of fgets, if the length of the line is bigger than the given size, the rest of the line would not be read but remain buffered in the stream. With getline, I am worried that it may attempt to assign a large block of memory for an obscenely long line.
The obvious solution for me seems to be turning to fgetc, but this comes with the problem that there will be many calls to the function, thereby resulting in the read process being slow.
Is this compromise in either case between flexibility and efficiency unavoidable, or can it worked through?
The three functions you mention do different things:
fgetc() reads a single character from a FILE * descriptor, it buffers input and so, you can process the file in a buffered way without having the overhelm of making a system call for each character. when your problem can be handled in a character oriented way, it is the best.
fgets() reads a single line from a FILE * descriptor, it's like calling fgetc() to fill the character array you pass to it in order to read line by line. It has the drawback of making a partial read in case your input line is longer than the buffer size you specify. This function buffers also input data, so it is very efficient. If you know that your lines will be bounded, this is the best to read your data line by line. Sometimes you want to be able to process data in an unbounded line size way, and you must redesign your problem to use the available memory. Then the one below is probably better election.
getline() this function is relatively new, and is not ANSI-C, so it is possible you port your program to some architecture that lacks it. It's the most flexible, at the price of being the less efficient. It requires a reference to a pointer that is realloc()ated to fill more and more data. It doesn't bind the line length at the cost of being possible to fill all the memory available on a system. Both, the buffer pointer and the size of the buffer are passed by reference to allow them to be updated, so you know where the new string is located and the new size. It mus be free()d after use.
The reason of having three and not only one function is that you have different needs for different cases and selecting the mos efficient one is normally the best selection.
If you plan to use only one, probably you'll end in a situation where using the function you selected as the most flexible will not be the best election and you will probably fail.
Much is case dependent.
getline() is not part of the standard C library. Its functionality may differ - depends on the implementation and what other standards it follows - thus an advantage for the standard fgetc()/fgets().
... case between flexibility and efficiency unavoidable, ...
OP is missing the higher priorities.
Functionality - If code cannot function right with the selected function, why use it? Example: fgets() and reading null characters create issues.
Clarity - without clarity, feel the wrath of the poor soul who later has to maintain the code.
would allow for the most flexibility. (?)
fgetc() allows for the most flexibility at the low level - yet helper functions using it to read lines tend to fail corner cases.
fgets() allows for the most flexibility at mid level - still have to deal with long lines and those with embedded null characters, but at least the low level of slogging in the weeds is avoided.
getline() useful when high portability not needed and risks of allowing the user to overwhelm resources is not a concern.
For robust handing of user/file input to read a line, create a wrapping function (e.g. int my_read_line(size_t buf, char *buf, FILE *f)) and call that and only that in user code. Then when issues arise, they can be handled locally, regardless of the low level input function selected.

What are the alignments referred to when discussing the strings section of a process address space

I'm trying to write a program to expose the arguments of other pids on macOS. I've made the KERN_PROCARGS2 sysctl, but it turns out that everyone and their dog use this wrong. Including Apple's ps, and Google's Chrome. The exec family of functions all allow you to pass an empty string as argv[0], which is not great but it can happen and so must be dealt with. In this case, the standard approach of skipping forward past the NULLs following the exec_path in the returned buffer doesn't work, as the last NULL before the rest of the arguments is actually the terminating NULL of an empty string, So you wind up skipping an argument you didn't mean to, which can result in printing an env var as an argument (I've confirmed this behaviour in many programs).
To do this properly one must calculate how many nulls to skip, instead of skipping them all every time. There are references around the web to the different parts of the returned buffer being pointer aligned, however no matter what part of the buffer I try to check with len % 8 I don't get a correct count of padding NULLs.
https://github.com/apple/darwin-xnu/blob/main/bsd/kern/kern_sysctl.c#L1528
https://lists.apple.com/archives/darwin-kernel/2012/Mar/msg00025.html
https://chromium.googlesource.com/crashpad/crashpad/+/refs/heads/master/util/posix/process_info_mac.cc#153
I wrote a library to do this correctly: https://getargv.narzt.cam

what's more efficient: reading from a file or allocating memory

I have a text file and I should allocate an array with as many entries as the number of lines in the file. What's more efficient: to read the file twice (first to find out the number of lines) and allocate the array once, or to read the file once, and use "realloc" after each line read? thank you in advance.
Reading the file twice is a bad idea, regardless of efficiency. (It's also almost certainly less efficient.)
If your application insists on reading its input teice, that means its input must be rewindable, which excludes terminal input and pipes. That's a limitation so annoying that apps which really need to read their input more than once (like sort) generally have logic to make a temporary copy if the input is unseekable.
In this case, you are only trying to avoid the trivial overhead of a few extra malloc calls. That's not justification to limit the application's input options.
If that's not convincing enough, imagine what will happen if someone appends to the file between the first time you read it and the second time. If your implementation trusts the count it got on the first read, it will overrun the vector of line pointers on the second read, leading to Undefined Behaviour and a potential security vulnerability.
I presume you want to store the read lines also and not just allocate an array of that many entries.
Also that you don't want to change the lines and then write them back as in that case you might be better off using mmap.
Reading a file twice is always bad, even if it is cached the 2nd time, too many system calls are needed. Also allocing every line separately if a waste of time if you don't need to dealloc them in a random order.
Instead read the entire file at once, into an allocated area.
Find the number of lines by finding line feeds.
Alloc an array
Put the start pointers into the array by finding the same line feeds again.
If you need it as strings, then replace the line feed with \0
This might also be improved upon on modern cpu-architectures, instead of reading the array twice it might be faster simply allocating a "large enough" array for the pointer and scan the array once. This will cause a realloc at the end to have the right size and potentially a couple of times to make the array larger if it wasn't large enough at start.
Why is this faster? because you have a lot of if's that can take a lot of time for each line. So its better to only have to do this once, the cost is the reallocation, but copying large arrays with memcpy can be a bit cheaper.
But you have to measure it, your system settings, buffer sizes etc. will influence things too.
The answer to "What's more efficient/faster/better? ..." is always:
Try each one on the system you're going to use it on, measure your results accurately, and find out.
The term is "benchmarking".
Anything else is a guess.

Unicode normalization through ICU4C

I want to normalize a string using the ICU C interface.
Looking at unorm2_normalize, I have some questions.
The UNormalizer2 instance -- how do I dispose of it after I'm done with it?
What if the buffer isn't large enough for decomposition or recomposition? Is the normal way to check if the error code is U_BUFFER_OVERFLOW_ERROR? Does U_STRING_NOT_TERMINATED_WARNING apply? Is the resulting string null-terminated? If an error is returned, do I reallocate memory and try again? It seems like a waste of time to start all over again.
See unorm2_close(). Note that you should not free instances acquired via unorm2_getInstance()
In general, most ICU APIs can be passed a NULL buffer and 0 length as input which should result in U_BUFFER_OVERLOW_ERROR and a variable populated with the required length. If you get U_STRING_NOT_TERMINATED_WARNING it means just that: The data is populated but not terminated.

Resources