Handle a string with length in C - c

In C (not C++), we can think several ways of handling strings with its length:
Just rely on the null terminating character (\0): We assume that the string doesn't contain \0. Store a string to a char array and append \0 at the end. Use the functions like strlen() when we need its size.
Store the characters and the length into a struct:
typedef struct _String {
char* data;
int size;
} String;
Use another variable for storing the length: For example,
char name[] = "hello";
int name_size = 5;
some_func(name, name_size, ...);
Personally, I prefer to use the second approach, since
It can cover some 'weird' strings which contain \0 in the middle.
We may implement some functions like string_new(), string_del(), string_getitem(), etc. to write some 'OOP-like' codes.
We don't have to two (or more) variables to handle the string and its length together.
My question is: What is the most-used way to handle strings in C? (especially: when we have to use a lot of strings (ex. writing an interpreter))
Thanks.

What is the most-used way to handle strings in C?
No doubt the most common way by far is to simply rely on the null termination.
Is it the "best" way? Probably not. Using a custom string library may be the "best" way as far as execution speed and program design are concerned. The downside is that you would have to drag that library around, since there are no standard or even de facto standard string libraries for C.

Most C programmers simply use asciiz strings and accept the inefficiency. C is still a very fast language.
However if you are doing a lot of string processing, it's maybe worthwhile writing a dedicated string library or suite. So a struct with a length member and a pointer is an obvious choice. However if you get really advanced, for example for genetic data processing, you find that you need structures such as suffix trees, which allow searches for sub-strings in O(constant) time.

In C language, a string is by definition a null terminated string. That's the reason why litteral string are null terminated, and why the strxxx functions of the Standard Library operate on null terminated strings.
On the other hand, character arrays can contain what you want including nulls, and you have to pass their length in another way, like for any other array.
Because of the way C handles string litterals and of the C standard library, C programmers ordinarily use null terminated strings. But it is worth noticing that in C++ a std::string is close(*) to a character array and a length and even if it is a different language C++, the introduction of C++ standard says (emphasize mine):
C++ is a general purpose programming language based on the C programming language...
Another example is the way Windows API internally manages unicode strings as BSTR. A BSTR is a special array of uint16_t where the length is at a -1 offset. This was choosen for compatibility with Visual Basic.
So if you need it, it is perfectly fine to build a library using strings defined as a struct array + length... or use the WINAPI implementation if appropriate or migrate to C++.
(*) In fact a C++ string is a smart pointer counting references to a character array and its length

Obviously the most used way is the null-terminated way, since that is supported by the standard libraries.
Writing your own structures for strings may make sense for your purpose, but it will never become "the most used way", because it is not a standard way.

Related

Difficulty in reading a series of whitespace separated DNA string into different locations of an array [duplicate]

Just wondering why this is the case. I'm eager to know more about low level languages, and I'm only into the basics of C and this is already confusing me.
Do languages like PHP automatically null terminate strings as they are being interpreted and / or parsed?
From Joel's excellent article on the topic:
Remember the way strings work in C: they consist of a bunch of bytes followed by a null character, which has the value 0. This has two obvious implications:
There is no way to know where the string ends (that is, the string length) without moving through it, looking for the null character at the end.
Your string can't have any zeros in it. So you can't store an arbitrary binary blob like a JPEG picture in a C string.
Why do C strings work this way? It's because the PDP-7 microprocessor, on which UNIX and the C programming language were invented, had an ASCIZ string type. ASCIZ meant "ASCII with a Z (zero) at the end."
Is this the only way to store strings? No, in fact, it's one of the worst ways to store strings. For non-trivial programs, APIs, operating systems, class libraries, you should avoid ASCIZ strings like the plague.
Think about what memory is: a contiguous block of byte-sized units that can be filled with any bit patterns.
2a c6 90 f6
A character is simply one of those bit patterns. Its meaning as a string is determined by how you treat it. If you looked at the same part of memory, but using an integer view (or some other type), you'd get a different value.
If you have a variable which is a pointer to the start of a bunch of characters in memory, you must know when that string ends and the next piece of data (or garbage) begins.
Example
Let's look at this string in memory...
H e l l o , w o r l d ! \0
^
|
+------ Pointer to string
...we can see that the string logically ends after the ! character. If there were no \0 (or any other method to determine its end), how would we know when seeking through memory that we had finished with that string? Other languages carry the string length around with the string type to solve this.
I asked this question when my underlying knowledge of computers was limited, and this is the answer that would have helped many years ago. I hope it helps someone else too. :)
C strings are arrays of chars, and a C array is just a pointer to a memory location, which is the start location of the array. But also the length (or end) of the array must be expressed somehow; in case of strings, a null termination is used. Another alternative would be to somehow carry the length of the string alongside with the memory pointer, or to put the length in the first array location, or whatever. It's just a matter of convention.
Higher level languages like Java or PHP store the size information with the array automatically & transparently, so the user needn't worry about them.
C has no notion of strings by itself. Strings are simply arrays of chars (or wchars for unicode and such).
Due to those facts C has no way to check i.e. the length of the string as there is no "mystring->length", there is no length value set somewhere. The only way to find the end of the string is to iterate over it and check for the \0.
There are string-libraries for C which use structs like
struct string {
int length;
char *data;
};
to remove the need for the \0-termination but this is not standard C.
Languages like C++, PHP, Perl, etc have their own internal string libraries which often have a seperate length field that speeds up some string functions and remove the need for the \0.
Some other languages (like Pascal) use a string type that is called (suprisingly) Pascal String, it stores the length in the first byte of the string which is the reason why those strings are limited to a length of 255 characters.
Because in C strings are just a sequence of characters accessed viua a pointer to the first character.
There is no space in a pointer to store the length so you need some indication of where the end of the string is.
In C it was decided that this would be indicated by a null character.
In pascal, for example, the length of a string is recorded in the byte immediately preceding the pointer, hence why pascal strings have a maximum length of 255 characters.
It is a convention - one could have implemented it with another algorithm (e.g. length at the beginning of the buffer).
In a "low level" language such as assembler, it is easy to test for "NULL" efficiently: that might have ease the decision to go with NULL terminated strings as opposed of keeping track of a length counter.
They need to be null terminated so you know how long they are. And yes, they are simply arrays of char.
Higher level languages like PHP may choose to hide the null termination from you or not use it at all - they may maintain a length, for example. C doesn't do it that way because of the overhead involved. High level languages may also not implement strings as an array of char - they could (and some do) implement them as lists of arrays of char, for example.
In C strings are represented by an array of characters allocated in a contiguous block of memory and thus there must either be an indicator stating the end of the block (ie. the null character), or a way of storing the length (like Pascal strings which are prefixed by a length).
In languages like PHP,Perl,C# etc.. strings may or may not have complex data structures so you cannot assume they have a null character. As a contrived example, you could have a language that represents a string like so:
class string
{
int length;
char[] data;
}
but you only see it as a regular string with no length field, as this can be calculated by the runtime environment of the language and is only used internally by it to allocate and access memory correctly.
They are null-terminated because whole plenty of Standard Library functions expects them to be.

Can a C implementation use length-prefixed-strings "under the hood"?

After reading this question: What are the problems of a zero-terminated string that length-prefixed strings overcome? I started to wonder, what exactly is stopping a C implementation from allocating a few extra bytes for any char or wchar_t array allocated on the stack or heap and using them as a "string prefix" to store the number N of its elements?
Then, if the N-th character is '\0', N - 1 would signify the string length.
I believe this could mightily boost performance of functions such as strlen or strcat.
This could potentially turn to extra memory consumption if a program uses non-0-terminated char arrays extensively, but that could be remedied by a compiler flag turning on or off the regular "count-until-you-reach-'\0'" routine for the compiled code.
What are possible obstacles for such an implementation? Does the C Standard allow for this? What problems can this technique cause that I haven't accounted for?
And... has this actually ever been done?
You can store the length of the allocation. And malloc implementations really do do that (or some do, at least).
You can't reasonably store the length of whatever string is stored in the allocation, though, because the user can change the contents to their whim; it would be unreasonable to keep the length up to date. Furthermore, users might start strings somewhere in the middle of the character array, or might not even be using the array to hold a string!
Then, if the N-th character is '\0', N - 1 would signify the string length.
Actually, no, and that's why this suggestion cannot work.
If I overwrite a character in a string with a 0, I have effectively truncated the string, and a subsequent call of strlen on the string must return the truncated length. (This is commonly done by application programs, including every scanner generated by (f)lex, as well as the strtok standard library function. Amongst others.)
Moreover, it is entirely legal to call strlen on an interior byte of the string.
For example (just for demonstration purposes, although I'll bet you can find code almost identical to this in common use.)
/* Split a string like 'key=value...' into key and value parts, and
* return the value, and optionally its length (if the second argument
* is not a NULL pointer).
* On success, returns the value part and modifieds the original string
* so that it is the key.
* If there is no '=' in the supplied string, neither it nor the value
* pointed to by plen are modified, and NULL is returned.
*/
char* keyval_split(char* keyval, int* plen) {
char* delim = strchr(keyval, '=');
if (delim) {
if (plen) *plen = strlen(delim + 1)
*delim = 0;
return delim + 1;
} else {
return NULL;
}
}
There's nothing fundamentally stopping you from doing this in your application, if that was useful (one of the comments noted this). There are two problems that would emerge, however:
You'd have to reimplement all the string-handling functions, and have my_strlen, my_strcpy, and so on, and add string-creating functions. That might be annoying, but it's a bounded problem.
You'd have to stop people, when writing for the system, deliberately or automatically treating the associated character arrays as ‘ordinary’ C strings, and using the usual functions on them. You might have to make sure that such usages broke promptly.
This means that it would, I think, be infeasible to smuggle a reimplemented ‘C string’ into an existing program.
Something like
typedef struct {
size_t len;
char* buf;
} String;
size_t my_strlen(String*);
...
might work, since type-checking would frustrate (2) (unless someone decided to hack things ‘for efficiency’, in which case there's not much you can do).
Of course, you wouldn't do this unless and until you'd proven that string management was the bottleneck in your code and that this approach provably improved things....
There are a couple of issues with this approach. First of all, you wouldn't be able to create arbitrarily long strings. If you only reserve 1 byte for length, then your string can only go up to 255 characters. You can certainly use more bytes to store the length, but how many? 2? 4?
What if you try to concatenate two strings that are both at the edge of their size limits (i.e., if you use 1 byte for length and try to concatenate two 250-character strings to each other, what happens)? Do you simply add more bytes to the length as necessary?
Secondly, where do you store this metadata? It somehow has to be associated with the string. This is similar to the problem Dennis Ritchie ran into when he was implementing arrays in C. Originally, array objects stored an explicit pointer to the first element of the array, but as he added struct types to the language, he realized that he didn't want that metadata cluttering up the representation of the struct object in memory, so he got rid of it and introduced the rule that array expressions get converted to pointer expressions in most circumstances.
You could create a new aggregate type like
struct string
{
char *data;
size_t len;
};
but then you wouldn't be able to use the C string library to manipulate objects of that type; an implementation would still have to support the existing interface.
You could store the length in the leading byte or bytes of the string, but how many do you reserve? You could use a variable number of bytes to store the length, but now you need a way to distinguish length bytes from content bytes, and you can't read the first character by simply dereferencing the pointer. Functions like strcat would have to know how to step around the length bytes, how to adjust the contents if the number of length bytes changes, etc.
The 0-terminated approach has its disadvantages, but it's also a helluva lot easier to implement and makes manipulating strings a lot easier.
The string methods in the standard library have defined semantics. If one generates an array of char that contains various values, and passes a pointer to the array or a portion thereof, the methods whose behavior is defined in terms of NUL bytes must search for NUL bytes in the same fashion as defined by the standard.
One could define one's own methods for string handling which use a better form of string storage, and simply pretend that the standard library string-related functions don't exist unless one must pass strings to things like fopen. The biggest difficulty with such an approach is that unless one uses non-portable compiler features it would not be possible to use in-line string literals. Instead of saying:
ns_output(my_file, "This is a test"); // ns -- new string
one would have to say something more like:
MAKE_NEW_STRING(this_is_a_test, "This is a test");
ns_output(my_file, this_is_a_test);
where the macro MAKE_NEW_STRING would create a union of an anonymous type, define an instance called this_is_a_test, and suitably initialize it. Since a lot of strings would be of different anonymous types, type-checking would require that strings be unions that include a member of a known array type, and code expecting strings should be given a pointer that member, likely using something like:
#define ns_output(f,s) (ns_output_func((f),(s).stringref))
It would be possible to define the types in such a way as to avoid the need for the stringref member and have code just accept void*, but the stringref member would essentially perform static duck-typing (only things with a stringref member could be given to such a macro) and could also allow type-checking on the type of stringref itself).
If one could accept those constraints, I think one could probably write code that was more efficient in almost every way that zero-terminated strings; the question would be whether the advantages would be worth the hassle.

Why strcpy() and strcat() is not good in Embedded Domain

Here i want to know about strcpy() and strcat() disadvantages
i want to know about these functions danger area in embedded domain/environment.
somebody told me we never use strcpy,strcat and strlen functions in embedded domain because its end with null and sometimes we works on encrypted data and null character comes so we cant got actual result because these functions stop on null character.
So i want to know all things and other alternative of these functions. how we can use other alternatives functions
The str* functions works with strings. If you are dealing with strings, they're fine to use as long as you use them correctly - it's easy to create a buffer overflow if you use them incorrectly.
If you are dealing with binary data, which it sounds like you are, string handling functions are unsuitable (They're meant for strings after all, not binary data). Use mem* functions for dealing with binary data.
In C , a string is a sequence of chars that end with a nul byte. If you're dealing with binary data, there might very well be a char with the value 0 in that data, which string handling functions assume to be the end of the string, or the data does not contain any nul bytes and is not nul terminated, which will cause the string functions to run past the end of your buffer.
Well, these functions indeed copy null-terminated strings and not only in embedded domain. Depending on your need you may want to use mem* functions instead.
As others have already answered, they work fine for strings. Encrypted data can't be regarded as strings.
There is however the aspect of using any C library function in embedded systems, particularly in high-integrity real-time embedded systems, such as automotive/medical/avionics etc. On such projects, a coding standard will be used, such as MISRA-C.
The vast majority of C libraries are likely not compatible with your coding standard. And even if you have the option (at least in MISRA-C) to make deviations, you would still have to verify the whole library. For example you will have to verify the whole string.h, just because you used strlen(). Common practice in such systems is to write all functions yourself, particularly simple ones like strlen() which you can write yourself in a minute.
But most embedded systems don't have such high requirements for quality and safety, and then the library functions are to prefer. Particularly memcpy() and similar search/sort/move functions, that will likely be heavily optimized by the compiler.
If you are worried about overwriting buffers (which everybody really should be), use strncpy or strncat instead. I see no problem with strlen.
This issue is specific to the system you describe, not to embedded systems per-se. Either way the string functions are simply not suited to the application you describe. I think you should simply have been told that you can't use string functions on the encrypted data in your particular application. That is not an issue with embedded systems, or even the string library. It is entirely about the nature you your encrypted strings - they are no longer C strings once encrypted, so any string library operation would no longer be valid - it becomes just data, and it would be your responsibility to retain any necessary meta-data regarding length etc. You could use Pascal style strings to do that for example (with a suitable accompanying library).
Now in general the C string library, and C-strings themselves present a number of issues for all systems, not just embedded. See this article by Joel Spolsky to see why caution should be used when using C strings functions, especially strcat().
The reason is just what you said:
because its end with null and sometimes we works on encrypted data and null character comes so we cant got actual result because these functions stop on null character.
And for alternatives, I recommend strn* series like strncpy, strnlen. n here means the maximum possible length of string.
You may want to find a C-standard library reference and seek for some details about those strn* functions.
As others have said str* functions are for strings, not binary data.
However, I suggest that when you do come to use strings, you should consider functions such as strlcpy() instead of strcpy(), and strlcat() instead of strcat().
They're not standard functions, but you'll be able to find copies of them readily enough (or really just write your own). They take the size of the destination buffer as an extra parameter to their standard cousins and are designed to avoid buffer overflows.
It probably seems like an imposition to have to pass around the size of a pointer's block wherever you use it, but I'm afraid that's what programming in C is about.
At least until we get smarter pointers that is.

What is the difference between the C string and C++ string?

I mean what is the difference of string in C and C++?
C does not define string: it only has "perfectly ordinary arrays of characters" and pointers to those arrays;
C++ defines it, as a class type, with several properties and methods.
In C there is no such thing/type as "string". It is represented as NULL terminated array of characters like char str[256];. C++ has string class in standard library that internally maintains it as array of characters and has many methods and properties to manipulate it.
I fully agree with #pmg answer. But one need to mention some things. In C programmer must be very careful when he works with C-strings because a) every C-string must be ended with zero code character; b) it is very easy to make buffer overrun if buffer size for string is too small. Also in C all work with strings goes through functions. It may be programmers nightmare. In C++ things are much simpler. Firstly, you don't need to care about memory management. String class allocate additional memory when internal buffer becomes small. Secondly, you don't need to care about zero terminating character. You work with container. Thirdly, there are simple methods for working with string class. For example, overloaded operator + for string concatenation. No more awful strcat() calls. Let the work with strings to be simple!
in C++ String objects are a special type of container, specifically designed to operate with sequences of characters.string class defined in string
or in C string is a character sequence terminated with a null character ('\0'), all functions related to strings defined in string.h

Why do strings in C need to be null terminated?

Just wondering why this is the case. I'm eager to know more about low level languages, and I'm only into the basics of C and this is already confusing me.
Do languages like PHP automatically null terminate strings as they are being interpreted and / or parsed?
From Joel's excellent article on the topic:
Remember the way strings work in C: they consist of a bunch of bytes followed by a null character, which has the value 0. This has two obvious implications:
There is no way to know where the string ends (that is, the string length) without moving through it, looking for the null character at the end.
Your string can't have any zeros in it. So you can't store an arbitrary binary blob like a JPEG picture in a C string.
Why do C strings work this way? It's because the PDP-7 microprocessor, on which UNIX and the C programming language were invented, had an ASCIZ string type. ASCIZ meant "ASCII with a Z (zero) at the end."
Is this the only way to store strings? No, in fact, it's one of the worst ways to store strings. For non-trivial programs, APIs, operating systems, class libraries, you should avoid ASCIZ strings like the plague.
Think about what memory is: a contiguous block of byte-sized units that can be filled with any bit patterns.
2a c6 90 f6
A character is simply one of those bit patterns. Its meaning as a string is determined by how you treat it. If you looked at the same part of memory, but using an integer view (or some other type), you'd get a different value.
If you have a variable which is a pointer to the start of a bunch of characters in memory, you must know when that string ends and the next piece of data (or garbage) begins.
Example
Let's look at this string in memory...
H e l l o , w o r l d ! \0
^
|
+------ Pointer to string
...we can see that the string logically ends after the ! character. If there were no \0 (or any other method to determine its end), how would we know when seeking through memory that we had finished with that string? Other languages carry the string length around with the string type to solve this.
I asked this question when my underlying knowledge of computers was limited, and this is the answer that would have helped many years ago. I hope it helps someone else too. :)
C strings are arrays of chars, and a C array is just a pointer to a memory location, which is the start location of the array. But also the length (or end) of the array must be expressed somehow; in case of strings, a null termination is used. Another alternative would be to somehow carry the length of the string alongside with the memory pointer, or to put the length in the first array location, or whatever. It's just a matter of convention.
Higher level languages like Java or PHP store the size information with the array automatically & transparently, so the user needn't worry about them.
C has no notion of strings by itself. Strings are simply arrays of chars (or wchars for unicode and such).
Due to those facts C has no way to check i.e. the length of the string as there is no "mystring->length", there is no length value set somewhere. The only way to find the end of the string is to iterate over it and check for the \0.
There are string-libraries for C which use structs like
struct string {
int length;
char *data;
};
to remove the need for the \0-termination but this is not standard C.
Languages like C++, PHP, Perl, etc have their own internal string libraries which often have a seperate length field that speeds up some string functions and remove the need for the \0.
Some other languages (like Pascal) use a string type that is called (suprisingly) Pascal String, it stores the length in the first byte of the string which is the reason why those strings are limited to a length of 255 characters.
Because in C strings are just a sequence of characters accessed viua a pointer to the first character.
There is no space in a pointer to store the length so you need some indication of where the end of the string is.
In C it was decided that this would be indicated by a null character.
In pascal, for example, the length of a string is recorded in the byte immediately preceding the pointer, hence why pascal strings have a maximum length of 255 characters.
It is a convention - one could have implemented it with another algorithm (e.g. length at the beginning of the buffer).
In a "low level" language such as assembler, it is easy to test for "NULL" efficiently: that might have ease the decision to go with NULL terminated strings as opposed of keeping track of a length counter.
They need to be null terminated so you know how long they are. And yes, they are simply arrays of char.
Higher level languages like PHP may choose to hide the null termination from you or not use it at all - they may maintain a length, for example. C doesn't do it that way because of the overhead involved. High level languages may also not implement strings as an array of char - they could (and some do) implement them as lists of arrays of char, for example.
In C strings are represented by an array of characters allocated in a contiguous block of memory and thus there must either be an indicator stating the end of the block (ie. the null character), or a way of storing the length (like Pascal strings which are prefixed by a length).
In languages like PHP,Perl,C# etc.. strings may or may not have complex data structures so you cannot assume they have a null character. As a contrived example, you could have a language that represents a string like so:
class string
{
int length;
char[] data;
}
but you only see it as a regular string with no length field, as this can be calculated by the runtime environment of the language and is only used internally by it to allocate and access memory correctly.
They are null-terminated because whole plenty of Standard Library functions expects them to be.

Resources