C language. Length of the string without using null-termination - c

strlen() considers that '\00' is end of the string. How to calculate real length? E.g. there is AnsiString type in Pascal. It can contain many #$00 , but Length(s) will give correct result. Debugging of compiled pascal code shows that pascal puts length to dword before first element (#s[0] - 4) and recalculate it for me if needed. Is there something the same in C language (or I must manually to allocate memory and take care of -1 - element)? If no, C language is much worse than Pascal.

The C standard says that strings end with a NUL character. The string may be stored in an array that is larger than that, but there is no way to get the size of an array if you are only given a pointer to the array.
#include <stdio.h>
void f(char *s)
{
printf("%s\n", s);
// you can't get the size of array s here
}
int main(void)
{
char s[100] = "hi";
printf("size of s = %zu\n", sizeof(s)); // this works
f(s);
return 0;
}

Question
Is there something the same in C language?
No, there is nothing like that in C or the standard C library. However, the language provides the building blocks to define such a type and create API functions to work on the type.
Something like:
typedef struct AnsiString
{
size_t len;
char* data;
} AnsiString;
AnsiString createAnsiString(size_t len)
{
AnsiString s;
s.len = len;
s.data = malloc(len);
return s;
}
void deleteAnsiString(AnsiString s)
{
free(s.data);
}
Then you can use
AnsiString s = createAnsiString(10);
// Use s as you please
deleteAnsiString(s);

The biggest problem with this question is that a string isn't considered a 'type' in C. It's a pattern of values... Think about integers that are multiples of ten, for example. They all end in 0, yet you can store them in any type of integer providing they're in range. Strings are similar; you can store them in arrays of char, signed char, wchar_t, etc. They're not types, though... They're values that fit a specific pattern.
That pattern is similar to the "multiples of ten" I described above: A string is a sequence of digits that ends at the first '\0'. If there is no NUL character, it's not a string. It's just a sequence of characters.
If you want the size of the array, and the array hasn't yet been converted to a pointer type, you can use sizeof array because the array still carries the size information. However, once the array is converted to a pointer (when calling a function with the array identifier as an argument, as a common example) you need to manually carry the size information. Standard functions commonly use separate size arguments as a result, for example fgets(array, sizeof array, stdin);.
Whether you consider C to be better or worse than Pascal is a matter of opinion, and you have one good point. The main point of Pascal vs. C is ease of developing a compiler, which you'll discover if you ever try to write one (a great way of learning everything about the language, btw).
However, in the grand scheme of things this is only a small part of a larger issue, which Pascal also mostly suffers from. If you're going into the business of comparing programming languages based on ability to write expressive code, you might find this article ranking programming languages on expressive power to be useful.
Best of luck with your language studies :) It's nice to see people developing code that uses the maximum expressive potential of a language.

Related

Is it good programming practice in C to use first array element as array length?

Because in C the array length has to be stated when the array is defined, would it be acceptable practice to use the first element as the length, e.g.
int arr[9]={9,0,1,2,3,4,5,6,7};
Then use a function such as this to process the array:
int printarr(int *ARR) {
for (int i=1; i<ARR[0]; i++) {
printf("%d ", ARR[i]);
}
}
I can see no problem with this but would prefer to check with experienced C programmers first. I would be the only one using the code.
Well, it's bad in the sense that you have an array where the elements does not mean the same thing. Storing metadata with the data is not a good thing. Just to extrapolate your idea a little bit. We could use the first element to denote the element size and then the second for the length. Try writing a function utilizing both ;)
It's also worth noting that with this method, you will have problems if the array is bigger than the maximum value an element can hold, which for char arrays is a very significant limitation. Sure, you can solve it by using the two first elements. And you can also use casts if you have floating point arrays. But I can guarantee you that you will run into hard traced bugs due to this. Among other things, endianness could cause a lot of issues.
And it would certainly confuse virtually every seasoned C programmer. This is not really a logical argument against the idea as such, but rather a pragmatic one. Even if this was a good idea (which it is not) you would have to have a long conversation with EVERY programmer who will have anything to do with your code.
A reasonable way of achieving the same thing is using a struct.
struct container {
int *arr;
size_t size;
};
int arr[10];
struct container c = { .arr = arr, .size = sizeof arr/sizeof *arr };
But in any situation where I would use something like above, I would probably NOT use arrays. I would use dynamic allocation instead:
const size_t size = 10;
int *arr = malloc(sizeof *arr * size);
if(!arr) { /* Error handling */ }
struct container c = { .arr = arr, .size = size };
However, do be aware that if you init it this way with a pointer instead of an array, you're in for "interesting" results.
You can also use flexible arrays, as Andreas wrote in his answer
In C you can use flexible array members. That is you can write
struct intarray {
size_t count;
int data[]; // flexible array member needs to be last
};
You allocate with
size_t count = 100;
struct intarray *arr = malloc( sizeof(struct intarray) + sizeof(int)*count );
arr->count = count;
That can be done for all types of data.
It makes the use of C-arrays a bit safer (not as safe as the C++ containers, but safer than plain C arrays).
Unforntunately, C++ does not support this idiom in the standard.
Many C++ compilers provide it as extension though, but it is not guarantueed.
On the other hand this C FLA idiom may be more explicit and perhaps more efficient than C++ containers as it does not use an extra indirection and/or need two allocations (think of new vector<int>).
If you stick to C, I think this is a very explicit and readable way of handling variable length arrays with an integrated size.
The only drawback is that the C++ guys do not like it and prefer C++ containers.
It is not bad (I mean it will not invoke undefined behavior or cause other portability issues) when the elements of array are integers, but instead of writing magic number 9 directly you should have it calculate the length of array to avoid typo.
#include <stdio.h>
int main(void) {
int arr[9]={sizeof(arr)/sizeof(*arr),0,1,2,3,4,5,6,7};
for (int i=1; i<arr[0]; i++) {
printf("%d ", arr[i]);
}
return 0;
}
Only a few datatypes are suitable for that kind of hack. Therefore, I would advise against it, as this will lead to inconsistent implementation styles across different types of arrays.
A similar approach is used very often with character buffers where in the beginning of the buffer there is stored its actual length.
Dynamic memory allocation in C also uses this approach that is the allocated memory is prefixed with an integer that keeps the size of the allocated memory.
However in general with arrays this approach is not suitable. For example a character array can be much larger than the maximum positive value (127) that can be stored in an object of the type char. Moreover it is difficult to pass a sub-array of such an array to a function. Most of functions that designed to deal with arrays will not work in such a case.
A general approach to declare a function that deals with an array is to declare two parameters. The first one has a pointer type that specifies the initial element of an array or sub-array and the second one specifies the number of elements in the array or sub-array.
Also C allows to declare functions that accepts variable length arrays when their sizes can be specified at run-time.
It is suitable in rather limited circumstances. There are better solutions to the problem it solves.
One problem with it is that if it is not universally applied, then you would have a mix of arrays that used the convention and those that didn't - you have no way of telling if an array uses the convention or not. For arrays used to carry strings for example you have to continually pass &arr[1] in calls to the standard string library, or define a new string library that uses "Pascal strings" rather then "ASCIZ string" conventions (such a library would be more efficient as it happens),
In the case of a true array rather then simply a pointer to memory, sizeof(arr) / sizeof(*arr) will yield the number of elements without having to store it in the array in any case.
It only really works for integer type arrays and for char arrays would limit the length to rather short. It is not practical for arrays of other object types or data structures.
A better solution would be to use a structure:
typedef struct
{
size_t length ;
int* data ;
} intarray_t ;
Then:
int data[9] ;
intarray_t array{ sizeof(data) / sizeof(*data), data } ;
Now you have an array object that can be passed to functions and retain the size information and the data member can be accesses directly for use in third-party or standard library interfaces that do not accept the intarray_t. Moreover the type of the data member can be anything.
Obviously NO is the answer.
All programming languages has predefined functions stored along with the variable type. Why not use them??
In your case is more suitable to access count /length method instead of testing the first value.
An if clause sometimes take more time than a predefined function.
On the first look seems ok to store the counter but imagine you will have to update the array. You will have to do 2 operations, one to insert other to update the counter. So 2 operations means 2 variables to be changed.
For statically arrays might be ok to have them counter then the list, but for dinamic ones NO NO NO.
On the other hand please read programming basic concepts and you will find your idea as a bad one, not complying with programming principles.

C - is there a way to work with strings which have NULL character in the middle

Is it possible to have strings with NULL character somewhere except the end and work with them? Like get their size, use strcat, etc?
I have some ideas:
1) Write your own function for getting length (or something else), which is going to iterate over a string. If it meets a NULL char, it is going to check the next char of the string. If it is not NULL - continue counting chars. But it may (and WILL!) eventually lead to situation when you are reading memory OUTSIDE of the char array. So it is a bad idea.
2) Use sizeof(array)/sizeof(type), eg sizeof(input)/sizeof(char). That is going to work pretty good I think.
Do you have any other ideas on how this can be done? Maybe there are some function which I am not aware of (C newbie alert :))?
The only really safe method I can think of is to use "Pascal"-type strings (that is, something that has a string header and assorted other data associated with it).
Something like this:
typedef struct {
int len, allocated;
char *data;
} my_string;
You would then have to implement pretty much every string manipulation function yourself. Keeping both the "length of the string" and "the size of the allocation" allows you to have an allocation that's larger than the current contents, this may make repeated string concatenation cheaper (allows an amortized O(1) append).
You can have an array of char, either statically or dynamically allocated, that contains a zero byte in the middle, but only the part up to and including the zero can be considered a "string" in the standard C sense. Only that part will be recognized or considered by the standard library's string functions.
You can use a different terminator -- say two zeroes in a row -- and write your own string functions, but that just pushes off the problem. What happens when you need two zeroes in the middle of your string? In any case, you need to exercise even more care in this case than in the ordinary string case to ensure that your custom strings are properly terminated. You also have to be certain to avoid using them with the standard string functions.
If your special strings are stored in char array of known size then you can get the length of the overall array via sizeof, but that doesn't tell you what portion of the array contains meaningful data. It also doesn't help with any of the other string functions you might want to perform, and it does nothing for you if your handle on the pseudo-strings is a char *.
If you are contemplating custom string functions anyway, then you should consider string objects that have an explicit length stored with them. For example:
struct my_string {
unsigned allocated, length;
char *contents;
};
Your custom functions then handle objects of that type, being certain to do the right thing with the length member. There is no explicit terminator, so these strings can contain any char value. Also, you can be certain not to mixed these up with standard strings.
As long as you store the length of the array of chars then you can have strings with nul characters or even without a terminating nul.
struct MyString
{
int length;
char* buffer;
};
And then you would have to write all your equivalent functions for managing the string.
The bstring library http://bstring.sourceforge.net and Microsofts BSTR (uses wide chars) are existing libraries that work in this way and also offer some compatibilty with c-style strings.
pros - getting the length of the string is quick
cons - the strings need to be dynamically allocated.

Why C doesn't have a function which is used like strcpy() and check buffer size automatically to prevent buffer overflow bug?

I'm really wondering why there's no function in C like strcpy(), memcpy(), etc. that automatically checks the size of the buffer. Something that behaves like this:
#define strcpy2(X, Y) strncpy(X, Y, sizeof(X))
Some people tell me: "Because it's old language." But, C is not a dead language. IOS can fix the standard, and new functions like strncpy have been added.
Others tell me: "It causes performance issues." But, I argue "if a function like that existed, you can still use the old function in situations where performance is important. In all situation, you can use that function and you can expect security improvement."
Still others tell me: "So, there's a function like strncpy()", or "C is designed for professional developer who consider this problem", but strncpy() does not do the check automatically - developers must determine the size of the buffer, and still large programs like Chrome, which are made by professional developers, have buffer overflow vulnerabilities.
I want to know a technical reason why such a function cannot be made.
*English is not my native language. so I guess there are some mistakes... sorry about this. (Edit (cmaster): Should be fixed now. Hope you like the new wording.)
If X is a pointer, and it usually is, then sizeof X tells you nothing about the size of the array to which X points. The size must be passed as a parameter.
To really understand the reason why C functions cannot do what you want, you need to understand about the difference between arrays and pointers, and what it means that an array decays to a pointer. Just to give you an idea what I'm talking about:
int array[7]; //define an array
int* pointer = array; //define a pointer that points to the same memory, array decays into a pointer to the first int
//Now the following two expressions are precisely equivalent, since array decays to a pointer again:
pointer[3];
array[3];
//However, the sizeof of the two is not the same:
assert(sizeof(array) == 7*sizeof(int)); //this is what you used in your define
assert(*pointer == sizeof(int)); //probably not what you expected
//Now the thing gets nasty: Array declarations in function arguments truly decay into pointers!
void foo(int bar[9]) {
assert(sizeof(bar) == sizeof(int)); //I bet, you didn't expect this!
}
//This is, because the definition of foo() is truly equivalent to this definition:
void foo(int* bar) {
assert(sizeof(bar) == sizeof(int));
}
//Transfering this to your #define, this will definitely not do what you want:
void baz(char aBuffer[BUFFER_SIZE], const char* source) {
strcpy2(aBuffer, source); //This will copy only the first four or eight bytes (depending on the size of a pointer on your system), no matter how big you make BUFFER_SIZE!
}
I hope, I enticed you to google for array-pointer-decay now...
The truth is, that the C language relies heavily on the fact that no array size is required to correctly access an array element, only the surrounding loops need to know the size. As such, arrays decay to pure pointers in many places, and once they are decayed, there is no bringing back the size of the array. This brings a great deal of flexibility and simplicity to the language (very easy handling of subarrays!), but it also makes a function that behaves like your #define impossible.
Technical reason is: in C the buffer size cannot be checked automatically, because it is not managed by the language. Functions like strcpy operate on pointers, and though pointers point to buffers, there is no way for strcpy implementation to know how long a buffer is. Your suggestion of using sizeof does not work, since sizeof returns the object size, not the size of the buffer a pointer points to. (In your example it would return always the same number, most probably 4 or 8).
C language makes programmer responsible for managing buffer sizes, so one can use functions like strncpy and pass the buffer size explicitly. But it will never be possible to implement safe version of strcpy in C, since it would require fundamental changes in the way the language treats pointers.
All of it applies to C descendants like C++ of Objective C too.
#include <stdlib.h>
char* x;
if (!asprintf(&x, "%s", y)) {
perror("asprintf");
exit(1);
}
// from here, x will contain the content of y
Under the assumption, that y is Null terminated, this works safely.
(Written a on tablet, so forgive any silly errors, please.)

Is there a standard function in C that would return the length of an array?

Is there a standard function in C that would return the length of an array?
Often the technique described in other answers is encapsulated in a macro to make it easier on the eyes. Something like:
#define COUNT_OF( arr) (sizeof(arr)/sizeof(0[arr]))
Note that the macro above uses a small trick of putting the array name in the index operator ('[]') instead of the 0 - this is done in case the macro is mistakenly used in C++ code with an item that overloads operator[](). The compiler will complain instead of giving a bad result.
However, also note that if you happen to pass a pointer instead of an array, the macro will silently give a bad result - this is one of the major problems with using this technique.
I have recently started to use a more complex version that I stole from Google Chromium's codebase:
#define COUNT_OF(x) ((sizeof(x)/sizeof(0[x])) / ((size_t)(!(sizeof(x) % sizeof(0[x])))))
In this version if a pointer is mistakenly passed as the argument, the compiler will complain in some cases - specifically if the pointer's size isn't evenly divisible by the size of the object the pointer points to. In that situation a divide-by-zero will cause the compiler to error out. Actually at least one compiler I've used gives a warning instead of an error - I'm not sure what it generates for the expression that has a divide by zero in it.
That macro doesn't close the door on using it erroneously, but it comes as close as I've ever seen in straight C.
If you want an even safer solution for when you're working in C++, take a look at Compile time sizeof_array without using a macro which describes a rather complex template-based method Microsoft uses in winnt.h.
No, there is not.
For constant size arrays you can use the common trick Andrew mentioned, sizeof(array) / sizeof(array[0]) - but this works only in the scope the array was declared in.
sizeof(array) gives you the size of the whole array, while sizeof(array[0]) gives you the size of the first element.
See Michaels answer on how to wrap that in a macro.
For dynamically allocated arrays you either keep track of the size in an integral type or make it 0-terminated if possible (i.e. allocate 1 more element and set the last element to 0).
sizeof array / sizeof array[0]
The number of elements in an array x can be obtained by:
sizeof(x)/sizeof(x[0])
You need to be aware that arrays, when passed to functions, are degraded into pointers which do not carry the size information. In reality, the size information is never available to the runtime since it's calculated at compile time, but you can act as if it is available where the array is visible (i.e., where it hasn't been degraded).
When I pass arrays to a function that I need to treat as arrays, I always ensure two arguments are passed:
the length of the array; and
the pointer to the array.
So, whilst the array can be treated as an array where it's declared, it's treated as a size and pointer everywhere else.
I tend to have code like:
#define countof(x) (sizeof(x)/sizeof(x[0]))
: : :
int numbers[10];
a = fn (countof(numbers),numbers);
then fn() will have the size information available to it.
Another trick I've used in the past (a bit messier in my opinion but I'll give it here for completeness) is to have an array of a union and make the first element the length, something like:
typedef union {
int len;
float number;
} tNumber;
tNumber number[10];
: : :
number[0].len = 5;
a = fn (number);
then fn() can access the length and all the elements and you don't have to worry about the array/pointer dichotomy.
This has the added advantage of allowing the length to vary (i.e., the number of elements in use, not the number of units allocated). But I tend not to use this anymore since I consider the two-argument array version (size and data) better.
I created a macro that returns the size of an array, but yields a compiler error if used on a pointer. Do however note that it relies on gcc extensions. Because of this, it's not a portable solution.
#define COUNT(a) (__builtin_choose_expr( \
__builtin_types_compatible_p(typeof(a), typeof(&(a)[0])), \
(void)0, \
(sizeof(a)/sizeof((a)[0]))))
int main(void)
{
int arr[5];
int *p;
int x = COUNT(arr);
// int y = COUNT(p);
}
If you remove the comment, this will yield: error: void value not ignored as it ought to be
The simple answer, of course, is no. But the practical answer is "I need to know anyway," so let's discuss methods for working around this.
One way to get away with it for a while, as mentioned about a million times already, is with sizeof():
int i[] = {0, 1, 2};
...
size_t i_len = sizeof(i) / sizeof(i[0]);
This works, until we try to pass i to a function, or take a pointer to i. So what about more general solutions?
The accepted general solution is to pass the array length to a function along with the array. We see this a lot in the standard library:
void *memcpy(void *s1, void *s2, size_t n);
Will copy n bytes from s1 to s2, allowing us to use n to ensure that our buffers never overflow. This is a good strategy - it has low overhead, and it actually generates some efficient code (compare to strcpy(), which has to check for the end of the string and has no way of "knowing" how many iterations it must make, and poor confused strncpy(), which has to check both - both can be slower, and either could be sped up by using memcpy() if you happen to have already calculated the string's length for some reason).
Another approach is to encapsulate your code in a struct. The common hack is this:
typedef struct _arr {
size_t len;
int arr[0];
} arr;
If we want an array of length 5, we do this:
arr *a = malloc(sizeof(*a) + sizeof(int) * 5);
a->len = 5;
However, this is a hack that is only moderately well-defined (C99 lets you use int arr[]) and is rather labor-intensive. A "better-defined" way to do this is:
typedef struct _arr {
size_t len;
int *arr;
} arr;
But then our allocations (and deallocations) become much more complicated. The benefit of either of these approaches is, of course, that now arrays you make will carry around their lengths with them. It's slightly less memory-efficient, but it's quite safe. If you chose one of these paths, be sure to write helper functions so that you don't have to manually allocate and deallocate (and work with) these structures.
If you have an object a of array type, the number of elements in the array can be expressed as sizeof a / sizeof *a. If you allowed your array object to decay to pointer type (or had only a pointer object to begin with), then in general case there's no way to determine the number of elements in the array.

What are convincing examples where pointer arithmetic is preferable to array subscripting?

I'm preparing some slides for an introductory C class, and I'm trying to present good examples (and motivation) for using pointer arithmetic over array subscripting.
A lot of the examples I see in books are fairly equivalent. For example, many books show how to reverse the case of all values in a string, but with the exception of replacing an a[i] with a *p the code is identical.
I am looking for a good (and short) example with single-dimensional arrays where pointer arithmetic can produce significantly more elegant code. Any ideas?
Getting a pointer again instead of a value:
One usually uses pointer arithmetic when they want to get a pointer again. To get a pointer while using an array index: you are 1) calculating the pointer offset, then 2) getting the value at that memory location, then 3) you have to use & to get the address again. That's more typing and less clean syntax.
Example 1: Let's say you need a pointer to the 512th byte in a buffer
char buffer[1024]
char *p = buffer + 512;
Is cleaner than:
char buffer[1024];
char *p = &buffer[512];
Example 2: More efficient strcat
char buffer[1024];
strcpy(buffer, "hello ");
strcpy(buffer + 6, "world!");
This is cleaner than:
char buffer[1024];
strcpy(buffer, "hello ");
strcpy(&buffer[6], "world!");
Using pointer arithmetic ++ as an iterator:
Incrementing pointers with ++, and decrementing with -- is useful when iterating over each element in an array of elements. It is cleaner than using a separate variable used to keep track of the offset.
Pointer subtraction:
You can use pointer subtraction with pointer arithmetic. This can be useful in some cases to get the element before the one you are pointing to. It can be done with array subscripts too, but it looks really bad and confusing. Especially to a python programmer where a negative subscript is given to index something from the end of the list.
char *my_strcpy(const char *s, char *t) {
char *u = t;
while (*t++ = *s++);
return u;
}
Why would you want to spoil such a beauty with an index? (See K&R, and how they build on up to this style.)There is a reason I used the above signature the way it is. Stop editing without asking for a clarification first. For those who think they know, look up the present signature -- you missed a few restrict qualifications.
Structure alignment testing and the offsetof macro implementation.
Pointer arithmetic may look fancy and "hackerish", but I have never encountered a case it was FASTER than the standard indexing. Just the opposite, I often encountered cases when it slowed the code down by a large factor.
For example, typical sequential looping through an array with a pointer may be less efficient than looping with a classic index on a modern processors, that support SSE extensions. Pointer arithmetic in a loop sufficiently blocks compilers from performing loop vectorization, which can yield typical 2x-4x performance boost. Additionally, using pointers instead of simple integer variables may result in needless memory store operations due to pointer aliasing.
So, generally pointer arithmetic instead of standard indexed access should NEVER be recommended.
iterating through a 2-dimensional array where the position of a datum does not really matter
if you dont use pointers, you would have to keep track of two subscripts
with pointers, you could point to the top of your array, and with a single loop, zip through the whole thing
If you were using an old compiler, or some kind of specialist embedded systems compiler, there might be slight performance differences, but most modern compilers would probably optimize these (tiny) differences out.
The following article might be something you could draw on - depends on the level of your students:
http://geeks.netindonesia.net/blogs/risman/archive/2007/06/25/Pointer-Arithmetic-and-Array-Indexing.aspx
You're asking about C specifically, but C++ builds upon this as well:
Most pointer arithmetic naturally generalizes to the Forward Iterator concept. Walking through memory with *p++ can be used for any sequenced container (linked list, skip list, vector, binary tree, B tree, etc), thanks to operator overloading.
Something fun I hope you never have to deal with: pointers can alias, whereas arrays cannot. Aliasing can cause all sorts of non-ideal code generation, the most common of which is using a pointer as an out parameter to another function. Basically, the compiler cannot assume that the pointer used by the function doesn't alias itself or anything else in that stack frame, so it has to reload the value from the pointer every time it's used. Or rather, to be safe it does.
Often the choice is just one of style - one looks or feels more natural than the other for a particular case.
There is also the argument that using indexes can cause the compiler to have to repeatedly recalculate offsets inside a loop - I'm not sure how often this is the case (other than in non-optimized builds), but I imagine it happens, but it's probably rarely a problem.
One area that I think is important in the long run (which might not apply to an introductory C class - but learn 'em early, I say) is that using pointer arithmetic applies to the idioms used in the C++ STL. If you get them to understand pointer arithmetic and use it, then when they move on to the STL, they'll have a leg up on how to properly use iterators.
#include ctype.h
void skip_spaces( const char **ppsz )
{
const char *psz = *ppsz;
while( isspace(*psz) )
psz++;
*ppsz = psz;
}
void fn(void)
{
char a[]=" Hello World!";
const char *psz = a;
skip_spaces( &psz );
printf("\n%s", psz);
}

Resources