GNU memmem vs C strstr - c

Is there any use case that can be solved by memmem but not by strstr?
I was thinking of able to parse a string raw bytes (needle) inside a bigger string of raw bytes(haystack). Like trying to find a particular raw byte pattern inside a blob of raw bytes read from C's read function.

There is the case you mention: raw binary data.
This is because raw binary data may contain zeroes, which are interpreted and string terminator by strstr, making it ignore the rest of the haystack or needle.
Additionally, if the raw binary data contains no zero bytes, and you don't have a valid (inside the same array or buffer allocation) extra zero after the binary data, then strstr will happily go beyond the data and cause Undefined Behavior via buffer overflow.
Or, to the point: strstr can't be used if the data is not strings. memmem doesn't have this limitation.

In addition to searching in non-string data, memmem() can be used to look for substrings in just a portion of a longer string, something strstr() can't do:
char somestr[] = "a long string with the word apple";
// Look in just the 5th through 15 characters
// (Haystack must have at least 15 characters or else)
char *loc = memmem(somestr + 4, 10, "pp", 2);
and if you already know the lengths of the strings, it might be faster than strstr() when used on the entire haystack string, but that depends a lot on the implementation and should be benchmarked.

Related

what are non null terminated string?

The "sz" part of the prefix is important, because some strings in the Windows world (especially when talking about the DDK) are not zero-terminated.reading this in STR,LPSTR section
can anyone tell me what are those non null terminated string?
In computer science, a string is a sequence of characters. A sequence has some length—there are some number of characters in it. To work with a string, one generally has to know the length of the string.
The length may be indicated in various ways. One way is to indicate the end of the sequence with a sentinel value, which is simply a chosen value that is not used in the sequence. With character strings, it is common to use zero as a sentinel: The string continues from its start until a zero character is found. When using a sentinel, the sentinel value cannot appear inside the string, since it marks the end.
Another way to indicate the length is to keep it separately from the string. For example, the length is passed to the C memcmp routine as a separate parameter. This allows memcmp to compare arbitrary sequences of bytes in memory, including sequences that contain zero bytes.
Sometimes the length is treated as part of the data structure for the string. It might be in the first byte or first several bytes of the string. So software using the string would get the length by reading the first byte, and the bytes after that would contain the characters of the string.
Another method, related to the sentinel method, is to use delimiters. For example, we commonly write strings such as "abc" in source code, text, and in shell commands. The quote marks are delimiters that mark the beginnings and ends of strings. Various methods are used to allow the delimiters themselves to be characters in the strings, such as “quoting” the delimiters with other special characters, as in: "This is a quote mark: \".".
In summary, the concept of a string that is not null-terminated is broad and open: Any method of indicating the length of a string other than marking the end with a null character is a string that is not null-terminated.
In windows kernel programming, the most often used string type is UNICODE_STRING, a non-null terminated string type:
typedef struct _UNICODE_STRING {
USHORT Length;
USHORT MaximumLength;
PWSTR Buffer;
} UNICODE_STRING
The purpose of this data structure is to efficiently processing string along the
stack drivers. Each driver in the stack may append text or modify the string in the range of "MaximumLenth" without allocating a new buffer.
For example, below is a typical unicode string stored in a continuous 64 bytes buffer:
address + 0 : 22 (Length)
address + 4 : 48 (MaximumLength)
address + 8 : buffer + 16 (Buffer)
address + 16: "Hello World" (UTF16 string, may without null terminated)
The standard string manipulating function can not use on UNICODE_STRING instead you should use the Rtl*UnicodeString() functions.
It would be easier to answer when we can use non null terminated strings.
Some API functions take only string pointer (SetWindowText, CreateFile) and strings have to be terminated with null character. Other functions (ExtTextOut, WriteConsole) take pointer and some form of length (usually number of chars, TCHARs or wchar_ts. These strings don't have to be terminated by null character.
// No termination NUL charcter bellow.
TCHAR hello[] = { 'H','E','L','L','O' };
ExtTextOut( hdc, 100, 100, 0, hello, 5, 0 );
TCHAR hello2[] = _T("HELLO WORLD!");
ExtTextOut( hdc, 100, 100, 0, hello2, 5, 0 );
In second ExtTextOut we don't have to artificially cut hello2 string (or copy it to temporary buffer). This function allows to use parts of string without null termination requirements.

Can I store NULL in a string?

I want to perform some lengthy operations on a large file, which will involve lots and lots of seeking. (The current version of the program takes 5 hours and uses fseek at least 15,057,456 times.) As a result, I am hoping to load the file into the ram, and use char* instead of FILE*. Can I load null characters from the file into the char* array if I:
Malloc the char array, and store its length separately, and
Only use character operations on the array (i.e. newchar = *(pointertothearray+offset) ), avoiding operations like strcpy or strstr?
You can load the whole file in a dynamic char array (malloc'ed on the heap) even if there are null characters in it : a null character is a valid char.
But you cannot call it a string. A C string is from specification of language a null terminated char array.
So as long as you only use offsets, mem... functions and no str... functions, there is no problems having null characters in a char array.
You can load the entire file's contents into memory. Essentially this buffer will be a byte stream and not a string.

Appending an Int to a char * in C

So I am looking to append the length of a cipher text onto the end of the char array that I am storing the cipher in. I am not a native to C and below is a test snippet of what I have devised that I think works.
...
int cipherTextLength = 0;
unsigned char *cipherText = NULL;
...
EVP_EncryptFinal_ex(&encryptCtx, cipherText + cipherTextLength, &finalBlockLength);
cipherTextLength += finalBlockLength;
EVP_CIPHER_CTX_cleanup(&encryptCtx);
// Append the length of the cipher text onto the end of the cipher text
// Note, the length stored will never be anywhere near 4294967295
char cipherLengthChar[1];
sprintf(cipherLengthChar, "%d", cipherTextLength);
strcat(cipherText, cipherLengthChar);
printf("ENC - cipherTextLength: %d\n", cipherTextLength);
...
The problem is I don't think using strcat when dealing with binary data is going to be trouble free. Could anyone suggest a better way to do this?
Thanks!
EDIT
Ok, so I'll add a little context as to why I was looking to append the length. In my encrypt function, the function EVP_EncryptUpdate requires the length of the plainText being encrypted. As this is much more easy to obtain, this part isn't a problem. However, similarly, using EVP_DecryptFinal_ex in my decrypt function requires the length of the ciperText being decrypted, so I need to store it somewhere.
In the application where I am implementing this, all I am doing is changing some poor hashing to proper encryption. To add further hassle, the way the application is I first need to decrypt information read in from XML, do something with it, then encrypt it and rewrite it to XML again, so I need to have this cipher length stored in the cipher somehow. I also don't have scope to redesign this.
Instead of what you are doing now, it may be smarter to encode the ciphertext size to a location before the ciphertext itself. Once you start decrypting, it is not very useful to find the size at the end. You need to know the end to get the size to find the end, not very helpful.
Furthermore, the ciphertext is binary, so you don't need to convert anything to string. You would like to convert it to a fixed number of bytes (otherwise you don't know the size of the size :P ). So create a bigger buffer (4 bytes more than you require for the ciphertext), and start encrypting to offset 4 forwards. Then copy the size of the ciphertext in at the start of the buffer.
If you don't know how to encode an integer, take a look at - for instance - this question/ answer. Note, this will only encode 32 bits for a maximum size of the ciphertext of 2^32, about 4 GiB. Furthermore, the link pointed to use Big Endian encoding. You should use either Big Endian (preferred for crypto code) or Little Endian encoding - but don't mix the two.
Neither the ciphertext nor the encoded size should be used as a character string. If you need a character string, my suggestion is to base 64 encode the buffer up to the end of the ciphertext.
I hope you are having enough big arrays both for cipherText and cipheLegthChar to store the required text. Hence instead of
unsigned char *cipherText = NULL;
You can have
unsigned char cipherText[MAX_TEXT];
similarly for
cipherLenghthChar[MAX_INT];
Or you can have them dynamically allocated.
where MAX_TEXT and MAX_INT max buffer size to store text and integer. Also after first call of EVP_EncryptFinal_ex NULL terminate cipherText so that you strcat works.
The problem is I don't think using strcat when dealing with binary data is going to be trouble free.
Correct! That's not your only problem though:
// Note, the length stored will never be anywhere near 4294967295
char cipherLengthChar[1];
sprintf(cipherLengthChar, "%d", cipherTextLength);
Even if cipherTextLength is 0 here, you've gone out of bounds, since sprintf will add a null terminator, making a total of two chars -- but cipherLengthChar only has room for one.
If you consider, e.g. 4294967295, as a string, that's 10 chars + '\0' = 11 chars.
It would appear that finalBlockLength is the length of the data put into cipherText. However, the EVP_EncryptFinal_ex() call will probably fail in one way or another, or at least not do what you want, since cipherText == NULL. You then add 0 to that (== 0 aka. still NULL) and submit it as a parameter. Either you need a pointer to a pointer there (if EVP_EncryptFinal_ex is going to allocate space for you), or else you have to make sure there is enough room in cipherText to start with.
With regard to tacking text (or whatever) onto the end, you can just use sprintf directly:
sprintf(cipherText + finalBlockLength, "%d", cipherTextLength);
Presuming that cipherText is non-NULL and has enough extra room in it (see first couple of paragraphs).
However, I'm very dubious that doing that will be useful later on, but since I don't have any further context, I can't say more.

Manipulating C-strings with multiple null characters in memory

I need to search through a chunk of memory for a string of characters, but several of these strings have every character null separated, like this:
"I. .a.m. .a. .s.t.r.i.n.g"
with all of the '.'s being null characters. My problem comes from actually getting this into memory. I've tried several ways, for instance:
char* str2;
str2 = (char*)malloc(sizeof(char)*40);
memcpy((void*)str2, "123\0567\09abc", 12);
Will put the following into the memory that str2 points to: 123.7.9abc..
Something like
str2 = "123456789\0abcde\054321";
Will have str2 pointing to a block of memory that looks like 123456789.abcde,321 , wherein the '.' is a null character, and the ',' is an actual comma.
So clearly inserting null characters into cstrings doesn't work as easily as I thought it did, like inserting a newline character. I encountered similar difficulties trying this with the string library as well. I could do separate assignments, something like:
char* str;
str = (char*)malloc(sizeof(char)*40);
strcpy(str, "123");
strcpy(str+4, "abc");
strcpy(str+8, "ABC");
But that is certainly not preferable, and I believe the problem lies in my understanding of how c-style strings are stored in memory. Clearly "abc\0123" doesn't actually go into memory as 61 62 63 00 31 32 33 (in hex). How is it stored, and how can I store what I need to?
(I also apologize for not having set the code in blocks, this is my first time posting a question, and somehow "four spaced" is more difficult than I can handle apparently. Thank you, Luchian. I see more newlines were needed.)
If every other char contains a null, then almost certainly you actually have UTF-16 encoded strings. Process them accordingly and your problems will disappear.
Assuming you are on Windows, where UTF-16 is common, you would use wchar_t* rather than char* to hold such strings. And you would use wide char string processing functions to operate on such data. For example, use wcscpy rather than strcpy and so on.
\0 is the starting sequence of an escaped character in octets, it's not just a "null character" (even though the use of it's own will result in one).
The easiest way to define a string containing a null-character followed by something that could also be treated as a part of an escaped characer in octet (such as "\012"1) is to split it up using this below feature of C:
char const * p = "123456789" "\0" "abcde" "\0" "54321";
1. "\012" will result in the character with the equivalent hex value of 0x0A, not three characters; 0x00, '1' and '2'.
First off, every second character being a NULL is a clear hallmark of a widestring - a string that's composed of two-byte characters, really an array of unsigned shorts. Depending on your compiler and settings, you might be better off using datatype wchar_t instead of char and wcsxxx() family of functions instead of strxxx().
On Windows, 2-byte widestrings (UTF-16, technically) is the native string format of the OS, so they're all around the place.
That said, strxxx() functions all assume that the string is null-terminated. So plan accordingly. Sometimes memxxx() will come to the rescue.
"abc\0123" does not go into memory the way you expect because \012 is being interpreted by the compiler as a single octal escape sequence - the character with octal code 12 (that's 0a hex). To avoid, use one of the following literals:
"abc\000123"
"abc\x00123"
"abc\0""123"
The snippet where you generate a string from chunks is mostly correct. It's just that I'd rather use
strcpy(str+strlen(str)+1, "123");
that guarantees that the next chunk will be written past the null character of the previous chunk.
I am a bit confused by your question.
But let me guess what is going on. You are looking at 16 bit wchat_t string and not a normal c string.
wchar getting ascii characters may look like null separated between letters but actually this is normal.
simply (wchar_t *)XXX where XXX is a pointer to that region of memory and lookup wchar_t operations like wcscpy etc... as for the nulls between strings, this may actually be a known method to pass multiple string construct. You can simply iterate after your read each string until normally you encounter 2 consecutive nulls.
Hope I have answered your question.
Good luck!

Scanning a file and allocating correct space to hold the file

I am currently using fscanf to get space delimited words. I establish a char[] with a fixed size to hold each of the extracted words. How would I create a char[] with the correct number of spaces to hold the correct number of characters from a word?
Thanks.
Edit: If I do a strdup on a char[1000] and the char[1000] actually only holds 3 characters, will the strdup reserve space on the heap for 1000 or 4 (for the terminating char)?
Here is a solution involving only two allocations and no realloc:
Determine the size of the file by seeking to the end and using ftell.
Allocate a block of memory this size and read the whole file into it using fread.
Count the number of words in this block.
Allocate an array of char * able to hold pointers to this many words.
Loop through the block of text again, assigning to each pointer the address of the beginning of a word, and replacing the word delimiter at the end of the word with 0 (the null character).
Also, a slightly philosophical matter: If you think this approach of inserting string terminators in-place and breaking up one gigantic string to use it as many small strings is ugly, hackish, etc. then you probably should probably forget about programming in C and use Python or some other higher-level language. The ability to do radically-more-efficient data manipulation operations like this while minimizing the potential points of failure is pretty much the only reason anyone should be using C for this kind of computation. If you want to go and allocate each word separately, you're just making life a living hell for yourself by doing it in C; other languages will happily hide this inefficiency (and abundance of possible failure points) behind friendly string operators.
There's no one-and-only way. The idea is to just allocate a string large enough to hold the largest possible string. After you've read it, you can then allocate a buffer of exactly the right size and copy it if needed.
In addition, you can also specify a width in your fscanf format string to limit the number of characters read, to ensure your buffer will never overflow.
But if you allocated a buffer of, say 250 characters, it's hard to imaging a single word not fitting in that buffer.
char *ptr;
ptr = (char*) malloc(size_of_string + 1);
char first = ptr[0];
/* etc. */

Resources