I am searching a character at first occurence in string using the following code.
But it is taking some time when the character is too long or the character that I am
searching is at far extent, which delays other operations. How could I tackle this problem. The code is below here.
Note: attrPtr is a char * which holds a reference to a string containing '"' character at far extent.
int position = 0;
char qolon = '"';//character to search
while (*(attrPtr + position++) != qolon);
char* attrValue = NULL;
attrValue = (char*)malloc(position * sizeof(char));
strncpy(attrValue, attrPtr, position-1);
strchr will usually be somewhat faster. Also, you need to check for the NUL terminator, which strchr will handle for you.
char *quotPtr = strchr(attrPtr, qolon);
if(quotPtr == NULL)
{
... // Handle error
}
int position = quotPtr - attrPtr;
char* attrValue = (char*) malloc((position + 1) * sizeof(char));
memcpy(attrValue, attrPtr, position);
attrValue[position] = '\0';
I haven't tested, though.
EDIT: Fix off-by-one.
C has a built-in function for searching for a character in a string - strchr(). strchr() returns a pointer to the found character, not the array position, so you have to subtract the pointer to the start of the string from the returned pointer to get that. You could rewrite your function as:
char qolon = '"';//character to search
char *found;
char *attrVal = NULL;
found = strchr(attrPtr, qolon);
if (found)
{
size_t len = found - attrPtr;
attrVal = malloc(len + 1);
memcpy(attrVal, attrPtr, len);
attrVal[len] = '\0';
}
This may be faster than your original by a small constant factor; however, you aren't going to get an order-of-magnitude speedup. Searching for a character within an un-ordered string is fundamentally O(n) in the length of the string.
Two important things:
1) Always check for a NULL terminator when searching a string this way:
while (*(attrPtr + position++) != qolon);
should be:
while (attrPtr[position] && attrPtr[position++] != qolon);
(if passed a string lacking your searched character, it could take a very long time as it scans all memory). Edit: I just noticed someone else posted this before, me, but oh well. I disagree, btw, strchr() is fine, but a simple loop that also checks for the terminator is fine (and often has advantages), too.
2) BEWARE of strncpy()!
strncpy(attrValue, attrPtr, position-1);
strlen(attrPtr)>=(position-1) so this will NOT null terminate the string in attrValue, which could cause all kinds of problems (including incredible slowdown in code later on). As a related note, strncpy() is erm, uniquely designed, so if you do something like:
char buf[512];
strncpy(buf,"",4096);
You will be writing 4096 bytes of zeroes.
Personally, I use lstrcpyn() on Win32, and on other platforms I have a simple implementation of it. It is much more useful for me.
It requires an O(n) algorithm to search for a character in the string. So you can't do much better than what your are already doing. Also, note that you are missing memset(attrValue, 0, position); , otherwise your string attrValue will not be null terminated.
The algorithm you posted doesn't properly handle the case where the character doesn't exist in the string. If that happens, it will just merilly march through memory until it either randomly happens to find a byte that matches your char, or you blow past your allocated memory and get a segfault. I suspect that is why it seems to be "taking too long" sometimes.
In C, strings are usually terminated with a 0 (ascii nul, or '\0'). Alternatively, if you know the length of the string ahead of time, you can use that.
Of course, there is a standard C library routine that does exactly this: strchr(). A wise programmer would use that rather than risk bugs by rolling their own.
Related
I am currently working on a program which involves creating a template for an exam.
In the function where I allow the user to add a question to the exam, I am required to ensure that I use only as much memory as is required to store it's data. I've managed to do so after a great deal of research into the differences between various input functions (getc, scanf, etc), and my program seems to be working but I am concerned about one thing. Here is the code for my function, I've placed a comment on the line in question:
int AddQuestion(){
Question* newQ = NULL;
char tempQuestion[500];
char* newQuestion;
if(exam.phead == NULL){
exam.phead = (Question*)malloc(sizeof(Question));
}
else{
newQ = (Question*)malloc(sizeof(Question));
newQ->pNext = exam.phead;
exam.phead = newQ;
}
while(getchar() != '\n');
puts("Add a new question.\n"
"Please enter the question text below:");
fgets(tempQuestion, 500, stdin);
newQuestion = (char*)malloc(strlen(tempQuestion) + 1); /*Here is where I get confused*/
strcpy(newQuestion, tempQuestion);
fputs(newQuestion, stdout);
puts("Done!");
return 0;
}
What's confusing me is that I've tried running the same code but with small changes to test exactly what is going on behind the scenes. I tried removing the + 1 from my malloc, which I put there because strlen only counts up to but not including the terminating character and I assume that I want the terminating character included. That still ran without a hitch. So I tried running it but with - 1 instead under the impression that doing so would remove whatever is before the terminating character (newline character, correct?). Still, it displayed everything on separate lines.
So now I'm somewhat baffled and doubting my knowledge of how character arrays work. Could anybody help clear up what's going on here, or perhaps provide me with a resource which explains this all in further detail?
In C, strings are conventionally null-terminated. Strlen, however, only counts the characters before the null. So, you always must add one to the value of strlen to get enough space. Or call strdup.
A C string contains the characters you can see "abc" plus one you can't which marks the end of the string. You represent this as '\0'. The strlen function uses the '\0' to find the end of the string, but doesn't count it.
So
myvar = malloc(strlen(str) + 1);
is correct. However, what you tried:
myvar = malloc(strlen(str));
and
myvar = malloc(strlen(str) - 1);
while INCORRECT, MAY seem to work some of the time. This is because malloc typically allocates memory in chunks, (say maybe in units of 16 bytes) rather than the exact size you ask for. So sometimes, you may 'luck out' and end up using the 'slop' at the end of the chunk.
Below is some code which I ran through a static analyzer. It came back saying there is a stack overflow vulnerability in the function that uses strtok below, as described here:
https://cwe.mitre.org/data/definitions/121.html
If you trace the execution, the variables used by strtok ultimately derive their data from the user_input variable in somefunction coming in from the wild. But I figured I prevented problems by first checking the length of user_input as well as by explicitly using strncpy with a bound any time I copied pieces of user_input.
somefunction(user_input) {
if (strlen(user_input) != 23) {
if (user_input != NULL)
free(user_input);
exit(1);
}
Mystruct* mystruct = malloc(sizeof(Mystruct));
mystruct->foo = malloc(3 * sizeof(char));
memset(mystruct->foo, '\0', 3);
strncpy(mystruct->foo,&(user_input[0]),2);
mystruct->bar = malloc(19 * sizeof(char));
memset(mystruct->bar, '\0', 19);
/* Remove spaces from user's input. strtok is not guaranteed to
* not modify the source string so we copy it first.
*/
char *input = malloc(22 * sizeof(char));
strncpy(input,&(user_input[2]),21);
remove_spaces(input,mystruct->bar);
}
void remove_spaces(char *input, char *output) {
const char space[2] = " ";
char *token;
token = strtok(input, space);
while( token != NULL ) {
// the error is indicated on this line
strncat(output, token, strlen(token));
token = strtok(NULL, space);
}
}
I presumed that I didn't have to malloc token per this comment, and elsewhere. Is there something else I'm missing?
strncpy does not increase the safety of your code; indeed, it may well make the code less safe by introducing the possibility of an unterminated output string. But the issue being flagged by the static analyser involves neither with strncpy nor strtok; it's with strncat.
Although they are frequently touted as increasing code safety, that was never the purpose of strncpy, strncat nor strncmp. The strn* alternatives to str* functions are intended for use in a context in which string data is not null-terminated. Such a context exists, although it is rare in student code: fixed-length string fields in fixed-size database records. If a field in a database record always contains 20 characters (CHAR(20) in SQL terms), there's no need to force a trailing 0-byte, which could have been used to allow 21-character names (or whatever the field is). It's a waste of space, and the only reason that those unnecessary bytes might be examined by the database code is to check database integrity. (Not that the extra byte really helps maintain integrity, either. But it must be checked for correctness.)
If you were writing code which used or created fixed-length unterminated string fields, you would certainly need a set of string functions which accept a length argument. But the string library already had those functions: memcpy and memcmp. The strn versions were added to ease the interface when both null-terminated and fixed-length strings are being used in the same application; for example, if a null-terminated string is read from user input and needs to be copied into a fixed-length database field. In that context, the interface of strncpy makes sense: the database field must be completed cleared of old data, but the input string might be too short to guarantee that. So you can't use strcpy even if you check that it won't overflow (because it doesn't necessarily erase old data) and you can't use memcpy (because the bytes following the end of the input string are indeterminated). Hence an interface like strncpy, which guarantees that the destination will be filled but doesn't guarantee that it will be null-terminated.
strncmp and strnlen do have some applications which don't necessarily have to do with fixed-length string records, but they are not safety-related either. strncmp is handy if you want to know whether a given string is a prefix of another string (although a startswith function would have more directly addressed this use case) and strnlen lets you answer the question "Are there at least four characters in this string?" without having to worry about how many cycles would be wasted if the string continued for another four million characters. But that doesn't justify using them in other, more normal, contexts.
OK, that was a bit of a detour. Let's get back to strncat, whose prototype is
char *strncat(char *dest, const char *src, size_t n);
where n is the maximum number of characters to copy. As the man page notes, you (and not the standard library) are responsible for ensuring that the destination has n+1 bytes available for the copy. The library function cannot take responsibility, because it cannot know how much space is available, and it hasn't asked you to specify that.
In my opinion, that makes strncat completely useless. In order to know how much space is available in the destination, you need to know where the concatenation's copy will start. But if you knew that, why on earth would you ask the standard library to scan over the destination looking for the concatenation point? In any case, you are not verifying how much space is available; you simply call:
strncat(output, token, strlen(token));
That does exactly the same thing as strcat(output, token) except that it scans token twice (once to count the bytes and a second time to copy them) and during the copy it does a redundant check to ensure that the count has not been exceeded while copying.
A "safe" version of strncat would require you to specify the length of the destination, but since there is no such function in the standard C library and also no consensus as to what the prototype for such a function would be, you need to guarantee safety yourself by tracking the amount of space used in output by each concatenation. As an extra benefit, if you do that, you can then make the computational complexity of a sequence of concatenations linear in the number of bytes copied, which one might intuitively expect, as opposed to quadratic, as implemented by strcat and strncat.
So a safe and efficient procedure might look like this:
void remove_spaces(char *output, size_t outmax,
char *input) {
if (outmax = 0) return;
char *token = strtok(input, " ");
char *outlimit = output + outmax;
while( token ) {
size_t tokelen = strlen(token);
if (tokelen >= outlimit - output)
tokelen = outlimit - output - 1;
memcpy(output, token, tokelen);
output += tokelen;
token = strtok(NULL, " ");
}
*output = 0;
}
The CWE warning does not mention strtok at all, so the question in the title itself is a red herring. strtok is one of the few parts of your code which is not problematic, although (as you note) it does force an otherwise unnecessary copy of the input string, in case that string is in read-only memory. (As noted above, strncpy does not guarantee that the copy is null-terminated, so it is not safe here. strdup, which needs to be paired with free, is the safest way to copy a string. Fortunately, it will still be part of the C standard instead of just being available almost everywhere.)
That might be a good enough reason to avoid strtok. If so, it's easy to get rid of:
void remove_spaces(char *output, size_t outmax,
/* This version doesn't modify input */
const char *input) {
if (outmax = 0) return;
char *token = strtok(input, " ");
char *outlimit = output + outmax;
while ( *(input += strspn(input, " ")) ) {
size_t cpylen = (tokelen < outlimit - outptr)
? tokelen
: outlimit - outptr - 1;
memcpy(output, input, cpylen);
output += cpylen;
input += tokelen;
}
*output = 0;
}
A better interface would manage to indicate whether the output was truncated, and perhaps give an indication of how many bytes were necessary to accommodate the operation. See snprintf for an example.
text which i passed to get_document function is a normal string data.
1." " denotes separation of words.
2."." denotes separation of sentences.
3."\n" denotes separation of paragraphs.
get_document is a function which allocates each words, sentences, paragraphs for separate memory blocks making it easily accessible.
Here's the code snippet.
char**** get_document(char* text) {
//get_document
int l=0,k=0,j=0,i=0;
char**** document = (char****)malloc(sizeof(char***));//para
document[l] = (char***)malloc(sizeof(char**));//sen
document[l][k] = (char**)malloc(sizeof(char*));//word
document[l][k][j] = (char*)malloc(sizeof(char));//letter
for(int z = 0; z < strlen(text); z++) {
if(strcmp(&text[z]," ")==0) {
document[l][k][j][i] = '\0';
j++;
document[l][k] = realloc(document[l][k],(sizeof(char*)) * j+1);
i=0;
document[l][k][j] = (char*)malloc(sizeof(char));
}
else if(strcmp(&text[z],".")==0) {
k++;
document[l] = realloc(document[l],(sizeof(char**)) * k+1);
j=0;
i=0;
document[l][k] =(char**)malloc(sizeof(char*));
document[l][k][j] = (char*)malloc(sizeof(char));
}
else if(strcmp(&text[z],"\n")==0) {
l++;
document = realloc(document,(sizeof(char***)) * l+1);
k=0;
j=0;
i=0;
document[l] = (char***)malloc(sizeof(char**));
document[l][k] =(char**)malloc(sizeof(char*));
document[l][k][j] = (char*)malloc(sizeof(char));
}
else {
strcpy(&document[l][k][j][i],&text[z]);
i++;
document[l][k][j] = realloc(document[l][k][j],(sizeof(char)) * i+1);
}
}
return document;
}
but when I run the program , I get the error
realloc:invalid next size
Can anyone help me with this. Thanks in advance.
when I run the program , I get the error
realloc:invalid next size
It appears that one of your realloc calls is failing because the allocator's tracking data has been corrupted. This is one of the more common things that can go wrong when you overwrite the bounds of an object, especially an allocated one. Which you do, a lot:
strcpy(&document[l][k][j][i],&text[z]);
If you want to make any progress in your study of C, it is essential that you learn the difference between a char and a string. The C string functions, such as strcmp() and strcpy(), apply only to the latter. You may use them on empty strings (containing only a nul) or on single-character strings (containing one character plus a nul), among other kinds, but they are neither safe nor useful for individual chars. For individual chars you would use standard C operators instead, such as == and =.
In the case of the line quoted above, each strcpy call will attempt to copy the entire tail of the input string, including the terminator, into into the one-char-big space pointed to by &document[l][k][j][i]. This will always write past the end of the allocated space, often by a lot, thus producing undefined behavior. You appear to instead want:
document[l][k][j][i] = text[z];
(well-deserved criticism of the choice of a quadruple pointer left aside). I see that you leave appending a string terminator for later, which is ok in principle, but I also see that you fail to terminate the last word of each sentence if the period ('.') immediately follows the word without any space.
Along the same lines, your several uses of strcmp() each compare the entire tail of the input string to one of several length-one string literals. Such comparisons are allowed, but they will not yield the results you appear to want. It appears you want simple equality tests against character constants, instead:
if (text[z] == ' ')
// ...
else if (text[z] == '.')
// ...
else if (text[z] == '\n')
And of course, even with those corrections, your approach is highly inefficient. Memory [re]allocation is comparatively expensive, and you are performing an allocation or reallocation for every. single. character. in the document. At least scan ahead to the end of each word so as to allocate a word at a time, though it is possible to do better even than that.
Also, do not neglect the fact that malloc() and realloc() can fail, in which case they return a null pointer. Robust code is meticulous about checking for and handling error results from its function calls, including allocation errors.
You mess up characters with strings.
You conditions to detect your elements are wrong:
if(strcmp(&text[z]," ")==0)
else if(strcmp(&text[z],".")==0)
...
Unless strlen(text) == 1 you will never enter any of your branches.
strcmp compares strings, not single characters. This means it compares the whole remaining buffer with a string of length 1 which can never be true except for the last character.
If you want to compare single characters, use if(text[z] == ' ') instead.
In your final else branch you completely smash your heap:
strcpy(&document[l][k][j][i],&text[z]);
You copy a string (again: the complete remaining buffer) into a single character.
The memory for document[l][k][j] was allocated using size=1. This cannot even hold a string of length 1 because there is no room for terminating '\0' byte.
Copying the string into memory large enough to hold exactly 1 character, causes heap corruption and in any call to memory allocation function, this will finally explode as you can see with your error message.
What you need is:
document[l][k][j][i] = text[z];
document[l][k][j][i+1] = 0;
Finally your memory size for allocation is wrong:
document = realloc(document,(sizeof(char***)) * l+1);
You want to add 1 extra element to the array but you only add 1 byte. Use this instead:
document = realloc(document,(sizeof(char***)) * (l+1));
The same applies for all other levels of your construction.
In addition your naming of counters is poor. One character variable names should only be used for loops etc. where there is no risk of confusion.
If you use them for different levels of array indexing, you should use names like wordcount, paracount etc. This would make the code much more readable.
Also I suggest you follow the hints in comments. Rethink your complete design.
Good day everyone. I have the following c-string initialization:
*(str) = 0; where str is declared as char str[255];. The questions are:
1) is it the best way to initialize a string this way?
2) should I expect trouble with this code on 64-bit platform?
Thanks in advance.
The instruction *(str) = 0 is completely equivalent to str[0] = 0;.
The initiatlization simply place the integer 0 (which is equivalent to '\0') in the first position, making the string empty.
1) There's no "best way", the key is being consistent so if you choose the "*(str) = 0;" path, always do the same everywhere.
2) No trouble could ever come from that statement. Whatever the architecture, the compiler will take care of everything.
What this does is change the uninitialized char array str, of which each element has some undefined value, to have its first element set to the null terminator character.
As far as most string related functions go, they will traverse this string from left to right, and immediately see the null terminator.
There are surely functions that don't care about null terminators, and are given a maximum size. In these cases, you are reading undefined values, which is Badâ„¢.
Use this instead:
char str[255] = { 0 }; // zero every element
or initialize the string to something useful on first use.
I'm trying to use Mac OS X's listxattr C function and turn it into something useful in Python. The man page tells me that the function returns a string buffer, which is a "simple NULL-terminated UTF-8 strings and are returned in arbitrary order. No extra padding is provided between names in the buffer."
In my C file, I have it set up correctly it seems (I hope):
char buffer[size];
res = listxattr("/path/to/file", buffer, size, options);
But when I got to print it, I only get the FIRST attribute ONLY, which was two characters long, even though its size is 25. So then I manually set buffer[3] = 'z' and low and behold when I print buffer again I get the first TWO attributes.
I think I understand what is going on. The buffer is a sequence of NULL-terminated strings, and stops printing as soon as it sees a NULL character. But then how am I supposed to unpack the entire sequence into ALL of the attributes?
I'm new to C and using it to figure out the mechanics of extending Python with C, and ran into this doozy.
char *p = buffer;
get the length with strlen(p). If the length is 0, stop.
process the first chunk.
p = p + length + 1;
back to step 2.
So you guessed pretty much right.
The listxattr function returns a bunch of null-terminated strings packed in next to each other. Since strings (and arrays) in C are just blobs of memory, they don't carry around any extra information with them (such as their length). The convention in C is to use a null character ('\0') to represent the end of a string.
Here's one way to traverse the list, in this case changing it to a comma-separated list.
int i = 0;
for (; i < res; i++)
if (buffer[i] == '\0' && i != res -1) //we're in between strings
buffer[i] = ',';
Of course, you'll want to make these into Python strings rather than just substituting in commas, but that should give you enough to get started.
It looks like listxattr returns the size of the buffer it has filled, so you can use that to help you. Here's an idea:
for(int i=0; i<res-1; i++)
{
if( buffer[i] == 0 )
buffer[i] = ',';
}
Now, instead of being separated by null characters, the attributes are separated by commas.
Actually, since I'm going to send it to Python I don't have to process it C-style after all. Just use the Py_BuildValue passing it the format character s#, which knows what do with it. You'll also need the size.
return Py_BuildValue("s#", buffer, size);
You can process it into a list on Python's end using split('\x00'). I found this after trial and error, but I'm glad to have learned something about C.