compare two binary files using strcmp() in c Language - c

Sorry for my bad English first.
I have two binary files.
And I store binary into buffer respectively.
Then I compared two buffer using strcmp().
Result of strcmp() is zero.
So I think two binary is identical.
Open two binary and then checked if there are no differences.
But I can find little difference.
what is the problem?
strcmp() function doesn't proper way to compare binary to binary?

The C function strcmp is written to compare strings. In C, strings are char pointers or arrays, that end with a null byte ('\0'). Therefore, the comparison only goes up to the first null byte.
Example:
File A: "abcd\0efg"
File B: "abcd\0xyz"
Since both files are equal up to the null byte, the "strings" at these locations are equal, although what comes after may differ. You should use the function memcmp instead (see this tutorial; see examples from the reference).
EDIT:
As pointed out by the comment under this answer and as mentioned in the other answer, the man pages of strcmp and memcmp are reliable resources to learn about these function from the standard library.

You cant compare binary data using string function.
You need to use memcmp instead.
https://man7.org/linux/man-pages/man3/memcmp.3.html

Related

Do strcmp and strstr test binary equivalence?

https://learn.microsoft.com/en-us/windows/win32/intl/security-considerations--international-features
This webpage makes me wonder.
Apparently some windows api may consider two strings equal when they are actually different byte sequences.
I want to know how C standard library behaves in this respect.
in other words, does strcmp(a,b)==0 imply strlen(a)==strlen(b)&&memcmp(a,b,strlen(a))==0?
and what about other string functions, including wide character versions?
edit:
for example, CompareStringW equates L"\x00C5" and L"\x212B"
printf("%d\n",CompareStringW(LOCALE_INVARIANT,0,L"\x00C5",-1,L"\x212B",-1)==CSTR_EQUAL); outputs 1
what I'm asking is whether C library functions never behave like this
two strings using different encodings can be the same even if their byte representation are different.
standard library strcmp does compare plain "character" strings and in this case strcmp(a,b)==0 implies strlen(a)==strlen(b)&&memcmp(a,b,strlen(a))==0
Functions like wcscmp require both strings to be encoded the same way, so their byte representation should be the same.
The regular string functions operate byte-by-byte. The specification says:
The sign of a nonzero value returned by the comparison functions memcmp, strcmp, and strncmp is determined by the sign of the difference between the values of the first pair of characters (both interpreted as unsigned char) that differ in the objects being compared.
strcmp() and memcmp() do the same comparisons. The only difference is that strcmp() uses the null terminators in the strings as the limit, memcmp() uses a parameter for this, and strncmp() takes a limit parameter and uses whichever comes first.
The wide string function specification says:
Unless explicitly stated otherwise, the functions described in this subclause order two wide characters the same way as two integers of the underlying integer type designated by wchar_t.
wcscmp() doesn't say otherwise, so it's also comparing the wide characters numerically, not by converting their encodings to some common character representations. wcscmp() is to wmemcmp() as strcmp() is to memcmp().
On the other hand, wcscoll() compares the strings as interpreted according to the LC_COLLATE category of the current locale. So this may not be equivalent to memcmp().
For other functions you should check the documentation to see whether they reference the locale.
Apparently some windows api may consider two strings equal when they are actually different byte sequences.
Depending on context and where you got those strings from, that would actually be the semantically correct behavor.
There are multiple ways to encode certain characters. The German 'ä', for example. In Unicode, this could be U+00E4 LATIN SMALL LETTER A WITH DAERHESIS, or it could be the sequence of U+0308 COMBINING DIAERESIS and U+0061 LATIN SMALL LETTER A. You could desire a comparison function that actually compares these equal. Or you could have them not compare equal, but have a standalone function that turns one representation into the other ("normalization").
You could want a comparison function that compares '6' (six) as equal to '๖' (also six, just in Thai). ("Canonicalization")
The byte string functions (strcmp() etc.) are not capable of any of that. They only deal in byte sequences, and are unaware of anything I wrote above.
As for the wide string functions (wcscmp() etc.), well... they are not that either, really.
in other words, does strcmp(a,b)==0 imply strlen(a)==strlen(b)&&memcmp(a,b,strlen(a))==0? and what about other string functions, including wide character versions?
Either will test for binary equivalence, as there are no mechanics in the C Standard Library to normalize or canonicalize strings.[1]
If you are actually dealing in processing strings (as opposed to just passing them through, for which C byte strings and wide strings are adequate), you should use the ICU library, the de facto standard for C/C++ Unicode handling. It looks daunting but actually needs to be to handle all these things correctly.
Basically, any C/C++ API that promises to do the same is either using the ICU library itself, or is very likely not doing what it advertises.
[1]: Actually, strcoll() / strxfrm() and wcscoll() / wcsxfrm() actually provide enough wiggle room to squeeze in proper Unicode mechanics for collation, but I don't know of an implementation that actually bothers to do so.

what is the effective way to compare strings? [duplicate]

This question already has an answer here:
Fastest way of comparing strings in C
(1 answer)
Closed 3 years ago.
I am writing the different set of strings generated by a piece of software into a text file. I want to write a test so that it compares the generated and written text for any possible error!
What is the effective way to do such test?
The standard method to compare C strings is the strcmp() function declared in <string.h>.
There are a few special cases where more efficient solutions can be sought:
if the strings have a known length: memcmp() can be used and might perform better as it does not need to test for end of strings.
if only equality is to be tested, the extra work performed by strcmp() to compute the relative lexicographical order of the strings could be avoided, but strcmp() is usually implemented very efficiently, so it is unlikely you get any improvement by handcoding an alternative in C.
To compare two strings in C programming, you have to ask the user to enter the two strings and start comparing using the function strcmp().
If it will return 0, then both strings are equal.
If it will not return 0, then both strings are not be equal to each other.

Difficulty in reading a series of whitespace separated DNA string into different locations of an array [duplicate]

Just wondering why this is the case. I'm eager to know more about low level languages, and I'm only into the basics of C and this is already confusing me.
Do languages like PHP automatically null terminate strings as they are being interpreted and / or parsed?
From Joel's excellent article on the topic:
Remember the way strings work in C: they consist of a bunch of bytes followed by a null character, which has the value 0. This has two obvious implications:
There is no way to know where the string ends (that is, the string length) without moving through it, looking for the null character at the end.
Your string can't have any zeros in it. So you can't store an arbitrary binary blob like a JPEG picture in a C string.
Why do C strings work this way? It's because the PDP-7 microprocessor, on which UNIX and the C programming language were invented, had an ASCIZ string type. ASCIZ meant "ASCII with a Z (zero) at the end."
Is this the only way to store strings? No, in fact, it's one of the worst ways to store strings. For non-trivial programs, APIs, operating systems, class libraries, you should avoid ASCIZ strings like the plague.
Think about what memory is: a contiguous block of byte-sized units that can be filled with any bit patterns.
2a c6 90 f6
A character is simply one of those bit patterns. Its meaning as a string is determined by how you treat it. If you looked at the same part of memory, but using an integer view (or some other type), you'd get a different value.
If you have a variable which is a pointer to the start of a bunch of characters in memory, you must know when that string ends and the next piece of data (or garbage) begins.
Example
Let's look at this string in memory...
H e l l o , w o r l d ! \0
^
|
+------ Pointer to string
...we can see that the string logically ends after the ! character. If there were no \0 (or any other method to determine its end), how would we know when seeking through memory that we had finished with that string? Other languages carry the string length around with the string type to solve this.
I asked this question when my underlying knowledge of computers was limited, and this is the answer that would have helped many years ago. I hope it helps someone else too. :)
C strings are arrays of chars, and a C array is just a pointer to a memory location, which is the start location of the array. But also the length (or end) of the array must be expressed somehow; in case of strings, a null termination is used. Another alternative would be to somehow carry the length of the string alongside with the memory pointer, or to put the length in the first array location, or whatever. It's just a matter of convention.
Higher level languages like Java or PHP store the size information with the array automatically & transparently, so the user needn't worry about them.
C has no notion of strings by itself. Strings are simply arrays of chars (or wchars for unicode and such).
Due to those facts C has no way to check i.e. the length of the string as there is no "mystring->length", there is no length value set somewhere. The only way to find the end of the string is to iterate over it and check for the \0.
There are string-libraries for C which use structs like
struct string {
int length;
char *data;
};
to remove the need for the \0-termination but this is not standard C.
Languages like C++, PHP, Perl, etc have their own internal string libraries which often have a seperate length field that speeds up some string functions and remove the need for the \0.
Some other languages (like Pascal) use a string type that is called (suprisingly) Pascal String, it stores the length in the first byte of the string which is the reason why those strings are limited to a length of 255 characters.
Because in C strings are just a sequence of characters accessed viua a pointer to the first character.
There is no space in a pointer to store the length so you need some indication of where the end of the string is.
In C it was decided that this would be indicated by a null character.
In pascal, for example, the length of a string is recorded in the byte immediately preceding the pointer, hence why pascal strings have a maximum length of 255 characters.
It is a convention - one could have implemented it with another algorithm (e.g. length at the beginning of the buffer).
In a "low level" language such as assembler, it is easy to test for "NULL" efficiently: that might have ease the decision to go with NULL terminated strings as opposed of keeping track of a length counter.
They need to be null terminated so you know how long they are. And yes, they are simply arrays of char.
Higher level languages like PHP may choose to hide the null termination from you or not use it at all - they may maintain a length, for example. C doesn't do it that way because of the overhead involved. High level languages may also not implement strings as an array of char - they could (and some do) implement them as lists of arrays of char, for example.
In C strings are represented by an array of characters allocated in a contiguous block of memory and thus there must either be an indicator stating the end of the block (ie. the null character), or a way of storing the length (like Pascal strings which are prefixed by a length).
In languages like PHP,Perl,C# etc.. strings may or may not have complex data structures so you cannot assume they have a null character. As a contrived example, you could have a language that represents a string like so:
class string
{
int length;
char[] data;
}
but you only see it as a regular string with no length field, as this can be calculated by the runtime environment of the language and is only used internally by it to allocate and access memory correctly.
They are null-terminated because whole plenty of Standard Library functions expects them to be.

The terminating NULL in an array in C

I have a simple question. Why is it necessary to consider the terminating null in an
array of chars (or simply a string) and not in an array of integers. So when i want a string to hold 20 characters i need to declare char string[21];. When i want to declare an array of integers holding 5 digits then int digits[5]; is enough. What is the reason for this?
You don't have to terminate a char array with NULL if you don't want to, but when using them to represent a string, then you need to do it because C uses null-terminated strings to represent its strings. When you use functions that operate on strings (like strlen for string-length or using printf to output a string), then those functions will read through the data until a NULL is encountered. If one isn't present, then you would likely run into buffer overflow or similar access violation/segmentation fault problems.
In short: that's how C represents string data.
Null terminators are required at the end of strings (or character arrays) because:
Most standard library string functions expect the null character to be there. It's put there in lieu of passing an explicit string length (though some functions require that instead.)
By design, the NUL character (ASCII 0x00) is used to designate the end of strings. Hence why it's also used as an EOF character when reading from ASCII files or streams.
Technically, if you're doing your own string manipulation with your own coded functions, you don't need a null terminator; you just need to keep track of how long the string is. But, if you use just about anything standardized, it will expect it.
It is only by convention that C strings end in the ascii nul character. (That's actually something different than NULL.)
If you like, you can begin your strings with a nul byte, or randomly include nul bytes in the middle of strings. You will then need your own library.
So the answer is: all arrays must allocate space for all of their elements. Your "20 character string" is simply a 21-character string, including the nul byte.
The reason is it was a design choice of the original implementors. A null terminated string gives you a way to pass an array into a function and not pass the size. With an integer array you must always pass the size. Ints convention of the language nothing more you could rewrite every string function in c with out using a null terminator but you would allways have to keep track of your array size.
The purpose of null termination in strings is so that the parser knows when to stop iterating through the array of characters.
So, when you use printf with the %s format character, it's essentially doing this:
int i = 0;
while(input[i] != '\0') {
output(input[i]);
i++;
}
This concept is commonly known as a sentinel.
It's not about declaring an array that's one-bigger, it's really about how we choose to define strings in C.
C strings by convention are considered to be a series of characters terminated by a final NUL character, as you know. This is baked into the language in the form of interpreting "string literals", and is adopted by all the standard library functions like strcpy and printf and etc. Everyone agrees that this is how we'll do strings in C, and that character is there to tell those functions where the string stops.
Looking at your question the other way around, the reason you don't do something similar in your arrays of integers is because you have some other way of knowing how long the array is-- either you pass around a length with it, or it has some assumed size. Strings could work this way in C, or have some other structure to them, but they don't -- the guys at Bell Labs decided that "strings" would be a standard array of characters, but would always have the terminating NUL so you'd know where it ended. (This was a good tradeoff at that time.)
It's not absolutely necessary to have the character array be 21 elements. It's only necessary if you follow the (nearly always assumed) convention that the twenty characters be followed by a null terminator. There is usually no such convention for a terminator in integer and other arrays.
Because of the the technical reasons of how C Strings are implemented compared to other conventions
Actually - you don't have to NUL-terminate your strings if you don't want to!
The only problem is you have to re-write all the string libraries because they depend on them. It's just a matter of doing it the way the library expects if you want to use their functionality.
Just like I have to bring home your daughter at midnight if I wish to date her - just an agreement with the library (or in this case, the father).

Why do strings in C need to be null terminated?

Just wondering why this is the case. I'm eager to know more about low level languages, and I'm only into the basics of C and this is already confusing me.
Do languages like PHP automatically null terminate strings as they are being interpreted and / or parsed?
From Joel's excellent article on the topic:
Remember the way strings work in C: they consist of a bunch of bytes followed by a null character, which has the value 0. This has two obvious implications:
There is no way to know where the string ends (that is, the string length) without moving through it, looking for the null character at the end.
Your string can't have any zeros in it. So you can't store an arbitrary binary blob like a JPEG picture in a C string.
Why do C strings work this way? It's because the PDP-7 microprocessor, on which UNIX and the C programming language were invented, had an ASCIZ string type. ASCIZ meant "ASCII with a Z (zero) at the end."
Is this the only way to store strings? No, in fact, it's one of the worst ways to store strings. For non-trivial programs, APIs, operating systems, class libraries, you should avoid ASCIZ strings like the plague.
Think about what memory is: a contiguous block of byte-sized units that can be filled with any bit patterns.
2a c6 90 f6
A character is simply one of those bit patterns. Its meaning as a string is determined by how you treat it. If you looked at the same part of memory, but using an integer view (or some other type), you'd get a different value.
If you have a variable which is a pointer to the start of a bunch of characters in memory, you must know when that string ends and the next piece of data (or garbage) begins.
Example
Let's look at this string in memory...
H e l l o , w o r l d ! \0
^
|
+------ Pointer to string
...we can see that the string logically ends after the ! character. If there were no \0 (or any other method to determine its end), how would we know when seeking through memory that we had finished with that string? Other languages carry the string length around with the string type to solve this.
I asked this question when my underlying knowledge of computers was limited, and this is the answer that would have helped many years ago. I hope it helps someone else too. :)
C strings are arrays of chars, and a C array is just a pointer to a memory location, which is the start location of the array. But also the length (or end) of the array must be expressed somehow; in case of strings, a null termination is used. Another alternative would be to somehow carry the length of the string alongside with the memory pointer, or to put the length in the first array location, or whatever. It's just a matter of convention.
Higher level languages like Java or PHP store the size information with the array automatically & transparently, so the user needn't worry about them.
C has no notion of strings by itself. Strings are simply arrays of chars (or wchars for unicode and such).
Due to those facts C has no way to check i.e. the length of the string as there is no "mystring->length", there is no length value set somewhere. The only way to find the end of the string is to iterate over it and check for the \0.
There are string-libraries for C which use structs like
struct string {
int length;
char *data;
};
to remove the need for the \0-termination but this is not standard C.
Languages like C++, PHP, Perl, etc have their own internal string libraries which often have a seperate length field that speeds up some string functions and remove the need for the \0.
Some other languages (like Pascal) use a string type that is called (suprisingly) Pascal String, it stores the length in the first byte of the string which is the reason why those strings are limited to a length of 255 characters.
Because in C strings are just a sequence of characters accessed viua a pointer to the first character.
There is no space in a pointer to store the length so you need some indication of where the end of the string is.
In C it was decided that this would be indicated by a null character.
In pascal, for example, the length of a string is recorded in the byte immediately preceding the pointer, hence why pascal strings have a maximum length of 255 characters.
It is a convention - one could have implemented it with another algorithm (e.g. length at the beginning of the buffer).
In a "low level" language such as assembler, it is easy to test for "NULL" efficiently: that might have ease the decision to go with NULL terminated strings as opposed of keeping track of a length counter.
They need to be null terminated so you know how long they are. And yes, they are simply arrays of char.
Higher level languages like PHP may choose to hide the null termination from you or not use it at all - they may maintain a length, for example. C doesn't do it that way because of the overhead involved. High level languages may also not implement strings as an array of char - they could (and some do) implement them as lists of arrays of char, for example.
In C strings are represented by an array of characters allocated in a contiguous block of memory and thus there must either be an indicator stating the end of the block (ie. the null character), or a way of storing the length (like Pascal strings which are prefixed by a length).
In languages like PHP,Perl,C# etc.. strings may or may not have complex data structures so you cannot assume they have a null character. As a contrived example, you could have a language that represents a string like so:
class string
{
int length;
char[] data;
}
but you only see it as a regular string with no length field, as this can be calculated by the runtime environment of the language and is only used internally by it to allocate and access memory correctly.
They are null-terminated because whole plenty of Standard Library functions expects them to be.

Resources