I have a problem on SPOJ and a deadline soon.
I'm tasked to write a program that counts how many identifiers are in a given line. Identifier is defined a sequence of characters from set 'a'-'z' or 'A'-'Z' or '0'-'9' or '_', starting from any letter or underline character ('_').
Input
There are given some number of data sets. Each data set is a line consisting from the sequence of some number of words, separated by spaces and finishing with the end of line character (even the last one line). A word is a sequence of any ASCII character of code from 33 till 126 (see http://www.asciitable.com for more details), e.g., aqui28$-3q or _dat_ The second word is an identifier, but the first one is not.
Output
The number of identifiers in each line.
Example
Input:
Dato25 has 2 c-ats and 3 _dogs
op8ax _yu _yu67 great-job ax~no identifier.
Output:
4
3
The code I wrote compiles, but when submitting it returns SIGSEGV (Segmentation Fault).
Your code exhibits a lamentably common anti-pattern: unnecessarily reading a large chunk of data into memory before processing it, when it could instead be processed as you go.
Consider: when you're processing one input word, do you need to refer to previous or subsequent words of the same or any other line? No. So why are you keeping all that around in memory?
In fact, you don't need to store any part of any of the words, except a single character you have just read. You simply need to
track
how many identifiers you've seen so far on the current line and
what kind of thing you're parsing at any given time (possible identifier, non-identifier word, or spaces),
update that appropriately for each character read, and
emit appropriate output (based on the preceding) at the end of each line.
That is likely to be faster than your approach. It will certainly use less memory, and especially less stack memory. And it affords little to no room for any kind of bounds overrun or invalid pointer use, such as are the usual reasons for a memory error such as a segfault.
As to why your original program segfaults, I ran it using valgrind, which is a popular tool for identifying memory usage problems. It can detect memory leaks, some out-of-bounds accesses, and use of uninitialized memory, among other things. It showed me that you never initialize the ident_count of any line[i]. Non-static local variables such as your line are not automatically initialized to anything in particular. Sometimes you can luck out with that, and it's not the cause of your particular issue, but cultivate good programming practices: fix it.
Valgrind did not indicate any other errors to me, however, nor did your program segfault for me with the example input. Nevertheless, I anticipate that I could wreak all kinds of havoc in your program by feeding it input with more than 100 lines and / or more than 300 words in a line, and / or more than 50 characters in a word. Automated judges tend to include test cases that explore the extremes of the problem space, so you need to be sure that your program works for all valid inputs.
Alternatively, a valid point is made in comments that you are allocating a large object on the stack, and stack space may not be sufficient for it in the judge's test environment. If that's the issue, then a quick and easy way to resolve it in your current code would be to allocate only one struct WORDS and reuse it for every line. That will reduce your stack usage by about a factor of 100, and again, what purpose is served by storing all the lines in memory at the same time anyway?
Related
I have a text file and I should allocate an array with as many entries as the number of lines in the file. What's more efficient: to read the file twice (first to find out the number of lines) and allocate the array once, or to read the file once, and use "realloc" after each line read? thank you in advance.
Reading the file twice is a bad idea, regardless of efficiency. (It's also almost certainly less efficient.)
If your application insists on reading its input teice, that means its input must be rewindable, which excludes terminal input and pipes. That's a limitation so annoying that apps which really need to read their input more than once (like sort) generally have logic to make a temporary copy if the input is unseekable.
In this case, you are only trying to avoid the trivial overhead of a few extra malloc calls. That's not justification to limit the application's input options.
If that's not convincing enough, imagine what will happen if someone appends to the file between the first time you read it and the second time. If your implementation trusts the count it got on the first read, it will overrun the vector of line pointers on the second read, leading to Undefined Behaviour and a potential security vulnerability.
I presume you want to store the read lines also and not just allocate an array of that many entries.
Also that you don't want to change the lines and then write them back as in that case you might be better off using mmap.
Reading a file twice is always bad, even if it is cached the 2nd time, too many system calls are needed. Also allocing every line separately if a waste of time if you don't need to dealloc them in a random order.
Instead read the entire file at once, into an allocated area.
Find the number of lines by finding line feeds.
Alloc an array
Put the start pointers into the array by finding the same line feeds again.
If you need it as strings, then replace the line feed with \0
This might also be improved upon on modern cpu-architectures, instead of reading the array twice it might be faster simply allocating a "large enough" array for the pointer and scan the array once. This will cause a realloc at the end to have the right size and potentially a couple of times to make the array larger if it wasn't large enough at start.
Why is this faster? because you have a lot of if's that can take a lot of time for each line. So its better to only have to do this once, the cost is the reallocation, but copying large arrays with memcpy can be a bit cheaper.
But you have to measure it, your system settings, buffer sizes etc. will influence things too.
The answer to "What's more efficient/faster/better? ..." is always:
Try each one on the system you're going to use it on, measure your results accurately, and find out.
The term is "benchmarking".
Anything else is a guess.
Here is my approach:
int linesize=1
int ReadStatus;
char buff[200];
ReadStatus=read(file,buff,linesize)
while(buff[linesize-1]!='\n' && ReadStatus!=0)
{
linesize++;
ReadStatus=read(file,buf,linesize)
}
Is this idea right?
I think my code is a bit inefficient because the run time is O(FileWidth); however I think it can be O(log(FileWidth)) if we exponentially increase linesize to find the linefeed character.
What do you think?
....
I just saw a new problem. How do we read the second line?. Is there anyway to delimit the bytes?
Is this idea right?
No. At the heart of a comment written by Siguza, lies the summary of an issue:
1) read doesn't read lines, it just reads bytes. There's no reason buff should end with \n.
Additionally, there's no reason buff shouldn't contain multiple newline characters, and as there's no [posix] tag here there's no reason to suggest what read does, let alone whether it's a syscall. Assuming you're referring to the POSIX function, there's no error handling. Where's your logic to handle the return value/s reserved for errors?
I think my code is a bit inefficient because the run time is O(FileWidth); however I think it can be O(log(FileWidth)) if we exponentially increase linesize to find the linefeed character.
Providing you fix the issues mentioned above (more on that later), if you were to test this theory, you'd likely find, also at the heart of the comment by Siguza,
Disks usually work on a 512-byte basis and file system caches and even CPU/memory caches are a lot larger than that.
To an extent, you can expect your idea to approach O(log n), but your bottleneck will be one of those cache lines (likely the one closest to your keyboard/the filesystem/whatever is feeding the stream with information). At that point, you should stop guzzling memory which other programs might need because your optimisation becomes less and less effective.
What do you think?
I think you should just STOP! You're guessing!
Once you've written your program, decide whether or not it's too slow. If it's not too slow, it doesn't need optimisation, and you probably won't shave enough nanoseconds to make optimisation worthwhile.
If it is to slow, then you should:
Use a profiler to determine what the most significant bottleneck is,
apply optimisations based on what your profiler tells you, then
use your profiler again, with the same inputs as before, to measure the effect your optimisation had.
If you don't use a profiler, your guess-work could result in slower code, or you might miss opportunities for more significant optimisations...
How do we read the second line?
Naturally, it makes sense to read character by character, rather than two hundred characters at a time, because there's no other way to stop reading the moment you reach a line terminating character.
Is there anyway to delimit the bytes?
Yes. The most sensible tools to use are provided by the C standard, and syscalls are managed automatically to be most efficient based on configurations decided by the standard library devs (who are much likely better at this than you are). Those tools are:
fgets to attempt to read a line (by reading one character at a time), up to a threshold (the size of your buffer). You get to decide how large a line should be, because it's more often the case that you won't expect a user/program to input huge lines.
strchr or strcspn to detect newlines from within your buffer, in order to determine whether you read a complete line.
scanf("%*[^\n]"); to discard the remainder of an incomplete line, when you detect those.
realloc to reallocate your buffer, if you decide you want to resize it and call fgets a second time to retrieve more data rather than discarding the remainder. Note: this will have an effect on the runtime complexity of your code, not that I think you should care about that...
Other options are available for the first three. You could use fgetc (or even read one character at a time) like I did at the end of this answer, for example...
In fact, that answer is highly relevant to your question, as it does make an attempt to exponentially increase the size. I wrote another example of this here.
It should be pointed out that the reason to address these problems is not so much optimisation, but the need to read a large, yet variadic in size chunk of memory. Remember, if you haven't yet written the code, it's likely you won't know whether it's worthwhile optimising it!
Suffice to say, it isn't the read function you should try to reduce your dependence upon, but the malloc/realloc/calloc function... That's the real kicker! If you don't absolutely need to store the entire line, then don't!
I created a program which at regular intervals downloads a text file from a website, which is in csv format, and parses it, extracting relevant data, which then is displayed.
I have noticed that occasionally, every couple of months or so, it crashes. The crash is rare, considering the cycle of data downloading and parsing can happen every 5 minutes or even less. I am pretty sure it crashes inside the function that parses the string and extracts the data. When it crashes it happens during a congested internet connection, i.e. heavy downloads and/or a slow connection. Occasionally the remote site may be handing corrupt or incomplete data.
I used a test application which saves the data to be processed before processing it and it indeed shows it was not complete when a crash happens.
I have adapted the function to accommodate for a number of cases of invalid or incomplete data, as well as checking all return values. I also check return values of the various functions used to connect to the remote site and download the data. And will not go further when a return value indicates no success.
The core of the function uses strsep() to walk through the data and extract information out of it:
/ *
* delimiters typically contains: <;>, <">, < >
* strsep() is used to split part of the string using delimiter
* and copy into token which then is copied into the array
* normally the function stops way before ARRAYSIZE which is just a safeguard
* it would normally stop when the end of file is reached, i.e. \0
*/
for(n=0;n<ARRAYSIZE;n++)
{
token=strsep(©_of_downloaded_data, delimiters);
if (token==NULL)
break;
data->array[n].example=strndup(token, strlen(token));
if (data->array[n].example!=NULL)
{
token=strsep(©_of_downloaded_data, delimiters);
if (token==NULL)
break;
(..)
copy_of_downloaded_data=strchr(copy_of_downloaded_data,'\n'); /* find newline */
if (copy_of_downloaded_data==NULL)
break;
copy_of_downloaded_data=copy_of_downloaded_data+1;
if (copy_of_downloaded_data=='\0') /* find end of text */
break;
}
Since I suspect I can not account for all ways in which data can be corrupted I would like to know if there is a way to program this so the function when run does not crash the whole application in case of corrupted data.
If that is not possible what could I do to make it more robust.
Edit: One possible instance of a crash is when the data ends abruptly, where the middle of a field is cut of, i.e.
"test","example","this data is brok
At least I noticed it by looking through the saved data, however I found it not being consistent. Will have to stress test it as was suggested below.
The best thing to do would be to figure out what input causes the function to crash, and fix the function so that it does not crash. Since the function is doing string processing, this should be possible to do by feeding it lots of dummy/test data (or feeding it the "right" test data if it's a particular input that causes the crash). You basically want to torture-test the function until you find out how to make it crash on demand; at that point you can start investigating exactly where and why it crashes, and once you understand that, the necessary changes to fix the crash will probably become obvious to you.
Running the program under valgrind might also point you to the bug.
If for some reason you can't fix the bug, the other option is to spawn a child process and run the buggy code inside the child process. That way if it crashes, only the child process is lost and not the parent. (You can spawn the child process under most OS's by calling fork(); you'll need to come up with some way for the child process to communicate its results back to the parent process, of course). (Note that doing it this way is a kludge and will likely not be very efficient, and could also introduce a security hole into your application if someone malicious who has the ability to send your program input can figure out how to manipulate the bug in order to take control of the child process -- so I don't recommend this approach!)
What does the coredump point to?
strsep - does not have memory synchronization mechanisms, so protect it as a critical section ( lock it when you do strsep ) ?
see if strsep can handle a big chunk ( ARRAYSIZE is not gonna help you here ).
stack size of the thread/program that receives copy_of_downloaded_data ( i know you are only referencing it so look at the function that receives it. )
I would suggest that one should try to write code that keeps track of string lengths deliberately and doesn't care whether strings are zero-terminated or not. Even though null pointers have been termed the "billion dollar mistake"(*) I think zero-terminated strings are far worse. While there may be some situations where code using zero-terminated strings might be "simpler" than code that tracks string lengths, extra effort required to make sure that nothing can cause string-handling code to exceed buffer boundaries exceeds that required when working with known-length strings.
If, for example, one wants to store the concatenation of strings of length length1 and length2 into a buffer if length BUFF_SIZE, one can test easily whether length1+length2 <= BUFF_SIZE if one isn't expecting strings to be null-terminated, or length1+length2 < BUFF_SIZE if one expects a gratuitous null byte to follow every string. When using zero-terminated strings, one would have to determine the length of the two strings before concatenation, and having done so one could just as well use memcpy() rather than strcpy() or the useless strcat().
(*) There are many situations where it's much better to have a recognizably-invalid pointer than to require that pointers which can't point to anything meaningful must instead point to something meaningless. Many null-pointer related problems actually stem from a failure of implementations to trap arithmetic with null pointers; it's not fair to blame null pointers for problems that could have been, but weren't avoided.
I want to develop an application in C where I need to check word by word from a file on disk. I've been told that reading a line from file and then splitting it into words is more efficient as less file accesses are required. Is it true?
If you know you're going to need the entire file, you may as well be reading it in as large chunks as you can (at the extreme end, you'll memory map the entire file into memory in one go). You are right that this is because less file accesses are needed.
But if your program is not slow, then write it in the way that makes it the fastest and most bug free for you to develop. Early optimization is a grievous sin.
Not really true, assuming you're going to be using scanf() and your definition of 'word' matches what scanf() treats as a word.
The standard I/O library will buffer the actual disk reads, and reading a line or a word will have essentially the same I/O cost in terms of disk accesses. If you were to read big chunks of a file using fread(), you might get some benefit — at a cost in complexity.
But for reading words, it's likely that scanf() and a protective string format specifier such as %99s if your array is char word[100]; would work fine and is probably simpler to code.
If your definition of word is more complex than the definition supported by scanf(), then reading lines and splitting is probably easier.
As far as splitting is concerned there is no difference with respect to performance. You are splitting using whitespace in one case and newline in another.
However it would impact in case of word in a way that you would need to allocate buffers M times, while in case of lines it will be N times, where M>N. So if you are adopting word split approach, try to calculate total memory need first, allocate that much chunk (so you don't end up with fragmented M chunks), and later get M buffers from that chunk. Note that same approach can be applied in lines split but the difference will be less visible.
This is correct, you should read them in to a buffer, and then split into whatever you define as 'words'.
The only case where this would not be true is if you can get fscanf() to correctly grab out what you consider to be words (doubtful).
The major performance bottlenecks will likely be:
Any call to a stdio file I/O function. The less calls, the less overhead.
Dynamic memory allocation. Should be done as scarcely as possible. Ultimately, a lot of calls to malloc will cause heap segmentation.
So what it boils down to is a classic programming consideration: you can get either quick execution time or you can get low memory usage. You can't get both, but you can find some suitable middle-ground that is most effective both in terms of execution time and memory consumption.
To one extreme, the fastest possible execution can be obtained by reading the whole file as one big chunk and upload it to dynamic memory. Or to the other extreme, you can read it byte by byte and evaluate it as you read, which might make the program slower but will not use dynamic memory at all.
You will need a fundamental knowledge of various CPU-specific and OS-specific features to optimize the code most effectively. Issues like alignment, cache memory layout, the effectiveness of the underlying API function calls etc etc will all matter.
Why not try a few different ways and benchmark them?
Not actually answer to your exact question (words vs lines), but if you need all words in memory at the same time anyway, then the most efficient approach is this:
determine file size
allocate buffer for entire file plus one byte
read entire file to the buffer, and put '\0' to the extra byte.
make a pass over it and count how many words it has
allocate char* (pointers to words) or int (indexes to buffer) index array, with size matching word count
make 2nd pass over buffer, and store addresses or indexes to the first letters of words to the index array, and overwrite other bytes in buffer with '\0' (end of string marker).
If you have plenty of memory, then it's probably slightly faster to just assume the worst case for number of words: (filesize+1) / 2 (one letter words with one space in between, with odd number of bytes in file). Also taking the Java ArrayList or Qt QVector approach with the index array, and using realloc() to double it's size when word count exceeds current capacity, will be quite efficient (due to doubling=exponential growth, reallocation will not happen many times).
I need to write a program where during run time, a set of integers of arbitrary size will taken as input. They will be seperated by white space. At the end, a new line is given, showing the end of input. How do I save them into an array of integers so that i can display them later. I think it is a little difficult because the number of values that will be entered is not known during compilation
Sounds like homework.
Correct me if I am wrong and I will give you more than hints.
You can either declare an array of a really large size that would not possibly be filled by the user input, then use scanf or something like that to grab the integers until you hit '\n', or you can grab each integer at a time, allocating memory as you go, using a combination of malloc and memcpy calls. The first option should never be done in a real world problem, and I am certainly not advocating such practices even though your textbook probably tells you to do it this way.
There is an example just like this in K&R.
This is a typical problem you will have in C. The solution is usually one of two options.
Use a really large array that is large enough to hold the input. Sometimes this is a poor option when the data could be really large. An example of when it would be a bad idea is when you are saving a video frame or a large text file to the array. This also opens you up to a buffer overrun attack in older versions of Windows. However, this is sometimes a good quick hack solution for smaller (homework) programs where you can count on the user (i.e. your professor who is not trying to break your program) to not input 1000's of characters. Usually this is considered bad practice, please consider my 2nd option for the security reason I mentioned before.
Use dynamic arrays (i.e. malloc). This is probably what your professor wants you to do as this sounds like a typical problem to use when a student is first learning pointers and arrays. This is a great approach, just remember to call free on your memory when you are finished. The tricky part here is that you still have to know the size of the array you want ahead of time (not at compile time though of course).