What is the most efficient way to implement mutable strings in C? - c

I'm currently implementing a very simple JSON parser in C and I would like to be able to use mutable strings (I can do it without mutable strings, however I would like to learn the best way of doing them anyway). My current method is as follows:
char * str = calloc(0, sizeof(char));
//The following is executed in a loop
int length = strlen(str);
str = realloc(str, sizeof(char) * (length + 2));
//I had to reallocate with +2 in order to ensure I still had a zero value at the end
str[length] = newChar;
str[length + 1] = 0;
I am comfortable with this approach, however it strikes me as a little inefficient given that I am always only appending one character each time (and for the sake of argument, I'm not doing any lookaheads to find the final length of my string). The alternative would be to use a linked list:
struct linked_string
{
char character;
struct linked_string * next;
}
Then, once I've finished processing I can find the length, allocate a char * of the appropriate length, and iterate through the linked list to create my string.
However, this approach seems memory inefficient, because I have to allocate memory for both each character and the pointer to the following character. Therefore my question is two-fold:
Is creating a linked list and then a C-string faster than reallocing the C-string each time?
If so, is the gained speed worth the greater memory overhead?

The standard way for dynamic arrays, regardless of whether you store chars or something else, is to double the capacity when you grow it. (Technically, any multiple works, but doubling is easy and strikes a good balance between speed and memory.) You should also ditch the 0 terminator (add one at the end if you need to return a 0 terminated string) and keep track of the allocated size (also known as capacity) and the number of characters actually stored. Otherwise, your loop has quadratic time complexity by virtue of using strlen repeatedly (Shlemiel the painter's algorithm).
With these changes, time complexity is linear (amortized constant time per append operation) and practical performance is quite good for a variety of ugly low-level reasons.
The theoretical downside is that you use up to twice as much memory as strictly necessary, but the linked list needs at least five times as much memory for the same amount of characters (with 64 bit pointers, padding and typical malloc overhead, more like 24 or 32 times). It's not usually a problem in practice.

No, linked lists are most certainly not "faster" (how and wherever you measure such a thing). This is a terrible overhead.
If you really find that your current approach is a bottleneck, you could always allocate or reallocate your strings in sizes of powers of 2. Then you would only have to do realloc when you cross such a boundary for the total size of the char array.

I would suggest that it would be reasonable to read the entire set of text into one memory allocation, then go through and NUL terminate each string. Then count the number of strings, and make a array of pointers to each of the strings. That way you have one memory allocation for the text area, and one for the array of pointers.

Most implementations of variable-length arrays/strings/whatever have a size and a capacity. The capacity is the allocated size, and the size is what's actually used.
struct mutable_string {
char* data;
size_t capacity;
};
Allocating a new string looks like this:
#define INITIAL_CAPACITY 10
mutable_string* mutable_string_create_empty() {
mutable_string* str = malloc(sizeof(mutable_string));
if (!str) return NULL;
str->data = calloc(INITIAL_CAPACITY, 1);
if (!str->data) { free(str); return NULL; }
str->capacity = INITIAL_CAPACITY;
return str;
}
Now, any time you need to add a character to the string, you'd do this:
int mutable_string_concat_char(mutable_string* str, char chr) {
size_t len = strlen(str->data);
if (len < str->capacity) {
str->data[len] = chr;
return 1; //Success
}
size_t new_capacity = str->capacity * 2;
char* new_data = realloc(str->data, new_capacity);
if (!new_data) return 0;
str->data = new_data;
str->data[len] = chr;
str->data[len + 1] = '\0';
str->capacity = new_capacity;
}
The linked list approach is worse because:
You still need to do an allocation call every time you add a character;
It consumes a LOT more memory. This approach consumes up to sizeof(size_t) + (string_length + 1) * 2. That approach consumes string_length * sizeof(linked_string).
Generally, linked lists are less cache-friendly than arrays.

Related

What is the less expensive way to remove a char from the start of a string in C?

I have to create a very inexpensive algorithm (processor and memory) to remove the first char from a string (char array) in C.
I'm currently using:
char *newvalue = strdup(value+1);
free(value);
value = newvalue;
But I want to know if there is some less expensive way to do that. The string value is dynamically allocated.
value+1 is a char* that represent the string with the first character removed. It is the less expensive way to obtain such a string..
You'll have to be careful when freeing memory though to make sure to free the original pointer and not the shifted one.
Reuse the original array. May or may not be faster, depend on the relative speed of memory (de)allocation and copy.
int size = strlen(value);
if (size > 0) memmove(value, value+1, size);
Since heap calls will be quite expensive, the obvious optimization is to avoid them.
If you need to do this often, you could probably come up with some simple wrapper around the bare pointer that can express this.
Something like:
typedef struct {
const char *chars;
size_t offset;
} mystring;
Then you'd need to devise an API to convert a mystring * into a character pointer, by adding the offset:
const char * mystring_get(const mystring *ms)
{
return ms->chars + ms->offset;
}
and of course a function to create a suffix where the 1st character is removed:
mystring mystring_tail(const mystring *ms)
{
const mystring suffix = { ms->chars, ms->offset + 1};
return suffix;
}
note that mystring_tail() returns the new string structure by value, to avoid heap allocations.

An if-else block in Redis source (Simple Dynamic Strings) which i couldn't understand

Firstly, i'm really sorry for the title but i have no other idea about how can i tell in otherwise.
I'm trying to understand Simple Dynamic Strings and between lines 138-141 in sds.c there is an if-else block which i couldn't understand. I don't even know why is it there and i don't know what it does too.
The relevant function is:
/* Enlarge the free space at the end of the sds string so that the caller
* is sure that after calling this function can overwrite up to addlen
* bytes after the end of the string, plus one more byte for nul term.
*
* Note: this does not change the *length* of the sds string as returned
* by sdslen(), but only the free buffer space we have. */
sds sdsMakeRoomFor(sds s, size_t addlen) {
struct sdshdr *sh, *newsh;
size_t free = sdsavail(s);
size_t len, newlen;
if (free >= addlen) return s;
len = sdslen(s);
sh = (void*) (s-(sizeof(struct sdshdr)));
newlen = (len+addlen);
if (newlen < SDS_MAX_PREALLOC) /* unwind: line 138 */
newlen *= 2;
else
newlen += SDS_MAX_PREALLOC;
newsh = zrealloc(sh, sizeof(struct sdshdr)+newlen+1);
if (newsh == NULL) return NULL;
newsh->free = newlen - len;
return newsh->buf;
}
Sorry for such a noob question but any help will be appreciated.
I assume you understand what it does, but not why.
The what is that is doubles the increment of the size of the buffer being allocated to hold the string, if the computed increment is considered "too small".
The why is to increase performance: if the string continues to grow (as dynamic strings are able to do), Redis won't need to reallocate a new buffer quite as soon as it would otherwise have had to. This is good, since realloc() is costly.
Basically, it's buying performance by spending memory, a very common trade-off.

C - resizing an array of pointers

I more or less have an idea, but I'm not sure if I've even got the right idea and I was hoping maybe I was just missing something obvious. Basically, I have and array of strings (C strings, so basically an array of pointers to character arrays) like so:
char **words;
Which I don't know how many words I'll have in the end. As I parse the string, I want to be able to resize the array, add a pointer to the word, and move on to the next word then repeat.
The only way I can think of is to maybe start with a reasonable number and realloc every time I hit the end of the array, but I'm not entirely sure that works. Like I want to be able to access words[0], words[1], etc. If I had char **words[10] and called
realloc(words, n+4) //assuming this is correct since pointers are 4 bytes
once I hit the end of the array, if I did words[11] = new word, is that even valid?
Keep track of your array size:
size_t arr_size = 10;
And give it an initial chunk of memory:
char **words = malloc( arr_size * sizeof(char*) );
Once you have filled all positions, you may want to double the array size:
size_t tailIdx = 0;
while( ... ) {
if( tailIdx >= arr_size ) {
char **newWords;
arr_size *= 2;
newWords = realloc(words, arr_size * sizeof(char*) );
if( newWords == NULL ) { some_error() };
words = newWords;
}
words[tailIdx++] = get_next_word();
}
...
free(words);
That approach is fine ,although you may want to do realloc(words, n * 2) instead. calling realloc and malloc is expensive so you want to have to reallocate as little as possible and this means you can go for longer without reallocating (and possibly copying data). This is how most buffers are implemented to amortize allocation and copy costs. So just double the size of your buffer every time you run out of space.
You are probably going to want to allocate multiple blocks of memory. One for words, which will contain the array of pointers. And then another block for each word, which will be pointed to by elements in the words array.
Adding elements then involves realloc()ing the words array and then allocating new memory blocks for each new word.
Be careful how you write your clean up code. You'll need to be sure to free up all those blocks.

How to allocate memory for an array of strings of unknown length in C

I have an array, say, text, that contains strings read in by another function. The length of the strings is unknown and the amount of them is unknown as well. How should I try to allocate memory to an array of strings (and not to the strings themselves, which already exist as separate arrays)?
What I have set up right now seems to read the strings just fine, and seems to do the post-processing I want done correctly (I tried this with a static array). However, when I try to printf the elements of text, I get a segmentation fault. To be more precise, I get a segmentation fault when I try to print out specific elements of text, such as text[3] or text[5]. I assume this means that I'm allocating memory to text incorrectly and all the strings read are not saved to text correctly?
So far I've tried different approaches, such as allocating a set amount of some size_t=k , k*sizeof(char) at first, and then reallocating more memory (with realloc k*sizeof(char)) if cnt == (k-2), where cnt is the index of **text.
I tried to search for this, but the only similar problem I found was with a set amount of strings of unknown length.
I'd like to figure out as much as I can on my own, and didn't post the actual code because of that. However, if none of this makes any sense, I'll post it.
EDIT: Here's the code
int main(void){
char **text;
size_t k=100;
size_t cnt=1;
int ch;
size_t lng;
text=malloc(k*sizeof(char));
printf("Input:\n");
while(1) {
ch = getchar();
if (ch == EOF) {
text[cnt++]='\0';
break;
}
if (cnt == k - 2) {
k *= 2;
text = realloc(text, (k * sizeof(char))); /* I guess at least this is incorrect?*/
}
text[cnt]=readInput(ch); /* read(ch) just reads the line*/
lng=strlen(text[cnt]);
printf("%d,%d\n",lng,cnt);
cnt++;
}
text=realloc(text,cnt*sizeof(char));
print(text); /*prints all the lines*/
return 0;
}
The short answer is you can't directly allocate the memory unless you know how much to allocate.
However, there are various ways of determining how much you need to allocate.
There are two aspects to this. One is knowing how many strings you need to handle. There must be some defined way of knowing; either you're given a count, or there some specific pointer value (usually NULL) that tells you when you've reached the end.
To allocate the array of pointers to pointers, it is probably simplest to count the number of necessary pointers, and then allocate the space. Assuming a null terminated list:
size_t i;
for (i = 0; list[i] != NULL; i++)
;
char **space = malloc(i * sizeof(*space));
...error check allocation...
For each string, you can use strdup(); you assume that the strings are well-formed and hence null terminated. Or you can write your own analogue of strdup().
for (i = 0; list[i] != NULL; i++)
{
space[i] = strdup(list[i]);
...error check allocation...
}
An alternative approach scans the list of pointers once, but uses malloc() and realloc() multiple times. This is probably slower overall.
If you can't reliably tell when the list of strings ends or when the strings themselves end, you are hosed. Completely and utterly hosed.
C don't have strings. It just has pointers to (conventionally null-terminated) sequence of characters, and call them strings.
So just allocate first an array of pointers:
size_t nbelem= 10; /// number of elements
char **arr = calloc(nbelem, sizeof(char*));
You really want calloc because you really want that array to be cleared, so each pointer there is NULL. Of course, you test that calloc succeeded:
if (!arr) perror("calloc failed"), exit(EXIT_FAILURE);
At last, you fill some of the elements of the array:
arr[0] = "hello";
arr[1] = strdup("world");
(Don't forget to free the result of strdup and the result of calloc).
You could grow your array with realloc (but I don't advise doing that, because when realloc fails you could have lost your data). You could simply grow it by allocating a bigger copy, copy it inside, and redefine the pointer, e.g.
{ size_t newnbelem = 3*nbelem/2+10;
char**oldarr = arr;
char**newarr = calloc(newnbelem, sizeof(char*));
if (!newarr) perror("bigger calloc"), exit(EXIT_FAILURE);
memcpy (newarr, oldarr, sizeof(char*)*nbelem);
free (oldarr);
arr = newarr;
}
Don't forget to compile with gcc -Wall -g on Linux (improve your code till no warnings are given), and learn how to use the gdb debugger and the valgrind memory leak detector.
In c you can not allocate an array of string directly. You should stick with pointer to char array to use it as array of string. So use
char* strarr[length];
And to mentain the array of characters
You may take the approach somewhat like this:
Allocate a block of memory through a call to malloc()
Keep track of the size of input
When ever you need a increament in buffer size call realloc(ptr,size)

How do I declare an array of undefined or no initial size?

I know it could be done using malloc, but I do not know how to use it yet.
For example, I wanted the user to input several numbers using an infinite loop with a sentinel to put a stop into it (i.e. -1), but since I do not know yet how many he/she will input, I have to declare an array with no initial size, but I'm also aware that it won't work like this int arr[]; at compile time since it has to have a definite number of elements.
Declaring it with an exaggerated size like int arr[1000]; would work but it feels dumb (and waste memory since it would allocate that 1000 integer bytes into the memory) and I would like to know a more elegant way to do this.
This can be done by using a pointer, and allocating memory on the heap using malloc.
Note that there is no way to later ask how big that memory block is. You have to keep track of the array size yourself.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main(int argc, char** argv)
{
/* declare a pointer do an integer */
int *data;
/* we also have to keep track of how big our array is - I use 50 as an example*/
const int datacount = 50;
data = malloc(sizeof(int) * datacount); /* allocate memory for 50 int's */
if (!data) { /* If data == 0 after the call to malloc, allocation failed for some reason */
perror("Error allocating memory");
abort();
}
/* at this point, we know that data points to a valid block of memory.
Remember, however, that this memory is not initialized in any way -- it contains garbage.
Let's start by clearing it. */
memset(data, 0, sizeof(int)*datacount);
/* now our array contains all zeroes. */
data[0] = 1;
data[2] = 15;
data[49] = 66; /* the last element in our array, since we start counting from 0 */
/* Loop through the array, printing out the values (mostly zeroes, but even so) */
for(int i = 0; i < datacount; ++i) {
printf("Element %d: %d\n", i, data[i]);
}
}
That's it. What follows is a more involved explanation of why this works :)
I don't know how well you know C pointers, but array access in C (like array[2]) is actually a shorthand for accessing memory via a pointer. To access the memory pointed to by data, you write *data. This is known as dereferencing the pointer. Since data is of type int *, then *data is of type int. Now to an important piece of information: (data + 2) means "add the byte size of 2 ints to the adress pointed to by data".
An array in C is just a sequence of values in adjacent memory. array[1] is just next to array[0]. So when we allocate a big block of memory and want to use it as an array, we need an easy way of getting the direct adress to every element inside. Luckily, C lets us use the array notation on pointers as well. data[0] means the same thing as *(data+0), namely "access the memory pointed to by data". data[2] means *(data+2), and accesses the third int in the memory block.
The way it's often done is as follows:
allocate an array of some initial (fairly small) size;
read into this array, keeping track of how many elements you've read;
once the array is full, reallocate it, doubling the size and preserving (i.e. copying) the contents;
repeat until done.
I find that this pattern comes up pretty frequently.
What's interesting about this method is that it allows one to insert N elements into an empty array one-by-one in amortized O(N) time without knowing N in advance.
Modern C, aka C99, has variable length arrays, VLA. Unfortunately, not all compilers support this but if yours does this would be an alternative.
Try to implement dynamic data structure such as a linked list
Here's a sample program that reads stdin into a memory buffer that grows as needed. It's simple enough that it should give some insight in how you might handle this kind of thing. One thing that's would probably be done differently in a real program is how must the array grows in each allocation - I kept it small here to help keep things simpler if you wanted to step through in a debugger. A real program would probably use a much larger allocation increment (often, the allocation size is doubled, but if you're going to do that you should probably 'cap' the increment at some reasonable size - it might not make sense to double the allocation when you get into the hundreds of megabytes).
Also, I used indexed access to the buffer here as an example, but in a real program I probably wouldn't do that.
#include <stdlib.h>
#include <stdio.h>
void fatal_error(void);
int main( int argc, char** argv)
{
int buf_size = 0;
int buf_used = 0;
char* buf = NULL;
char* tmp = NULL;
char c;
int i = 0;
while ((c = getchar()) != EOF) {
if (buf_used == buf_size) {
//need more space in the array
buf_size += 20;
tmp = realloc(buf, buf_size); // get a new larger array
if (!tmp) fatal_error();
buf = tmp;
}
buf[buf_used] = c; // pointer can be indexed like an array
++buf_used;
}
puts("\n\n*** Dump of stdin ***\n");
for (i = 0; i < buf_used; ++i) {
putchar(buf[i]);
}
free(buf);
return 0;
}
void fatal_error(void)
{
fputs("fatal error - out of memory\n", stderr);
exit(1);
}
This example combined with examples in other answers should give you an idea of how this kind of thing is handled at a low level.
One way I can imagine is to use a linked list to implement such a scenario, if you need all the numbers entered before the user enters something which indicates the loop termination. (posting as the first option, because have never done this for user input, it just seemed to be interesting. Wasteful but artistic)
Another way is to do buffered input. Allocate a buffer, fill it, re-allocate, if the loop continues (not elegant, but the most rational for the given use-case).
I don't consider the described to be elegant though. Probably, I would change the use-case (the most rational).

Resources