How to handle a huge string correctly?

How to handle a huge string correctly? - c

This may be a newbie question, but i want to avoid buffer overflow. I read very much data from the registry which will be uploaded to an SQL database. I read the data in a loop, and the data was inserted after each loop. My problem is, that this way, if i read 20 keys, and the values under is ( the number of keys is different on every computer ), then i have to connect to the SQL database 20 times.
However i found out, that there is a way, to create a stored procedure, and pass the whole data it, and so, the SQL server will deal with data, and i have to connect only once to the SQL server.
Unfortunately i don't know how to handle such a big string to avoid any unexpected errors, like buffer owerflow. So my question is how should i declare this string?
Should i just make a string like char string[ 15000 ]; and concatenate the values? Or is there a simplier way for doing this?
Thanks!

STL strings should do a much better job than the approach you have described.
You'll also need to build some thresholds. For example, if your string grew more than a mega bytes, it will be worth considering making different SQL connections since your transaction will be too long.

You may read (key, value) pairs from a registry and store them into a preallocated buffer while there is sufficient space there.
Maintain "write" position within the buffer. You could use it to check whether there is enough space for new key,value pair in the buffer.
When there is no space left for new (key,value) pair - execute stored procedure and reset "write" position within the buffer.
At the end of the "read key, value pairs" loop - check buffer's 'write" position and execute stored procedure if it is greater than 0.
This way you will minimize number of times you execute stored procedure on a server.
const int MAX_BUFFER_SIZE = 15000;
char buffer[MAX_BUFFER_SIZE];
char buffer_pos = 0; // "write" position within the buffer.
...
// Retrieve key, value pairs and push them into the buffer.
while(get_next_key_value(key, value)) {
post(key, value);
}
// Execute stored procedure if buffer is not empty.
if(buffer_pos > 0) {
exec_stored_procedure(buffer);
}
...
bool post(const char* key, const char* value)
{
int len = strlen(key) + strlen(value) + <length of separators>;
// Execute stored procedure if there is no space for new key/value pair.
if(len + buffer_pos >= MAX_BUFFER_SIZE) {
exec_stored_procedure(buffer);
buffer_pos = 0; // Reset "write" position.
}
// Copy key, value pair to the buffer if there is sufficient space.
if(len + buffer_pos < MAX_BUFFER_SIZE) {
<copy key, value to the buffer, starting from "write" position>
buffer_pos += len; // Adjust "write" position.
return true;
}
else {
return false;
}
}
bool exec_stored_procedure(const char* buf)
{
<connect to SQL database and execute stored procedure.>
}

To do this properly in C you need to allocate the memory dynamically, using malloc or one of the operating system equivalents. The idea here is to figure out how much memory you actually need and then allocate the correct amount. The registry functions provide various ways you can determine how much memory you need for each read.
It gets a bit trickier if you're reading multiple values and concatenating them. One approach would be to read each value into a separately allocated memory block, then concatenate them to a new memory block once you've got them all.
However, it may not be necessary to go to this much trouble. If you can say "if the data is more than X bytes the program will fail" then you can just create a static buffer as you suggest. Just make sure that you provide the registry and/or string concatenation functions with the correct size for the remaining part of the buffer, and check for errors, so that if it does fail it fails properly rather than crashing.
One more note: char buf[15000]; is OK provided the declaration is in program scope, but if it appears in a function you should add the static specifier. Implicitly allocated memory in a function is by default taken from the stack, so a large allocation is likely to fail and crash your program. (Fifteen thousand bytes should be OK but it's not a good habit to get into.)
Also, it is preferable to define a macro for the size of your buffer, and use it consistently:
#define BUFFER_SIZE 15000
char buf[BUFFER_SIZE];
so that you can easily increase the size of the buffer later on by modifying a single line.

Related

(C) Recursive strcpy() that takes only 1 parameter

Let me be clear from the get go, this is not a dupe, I'll explain how.
So, I tasked myself to write a function that imitates strcpy but with 2 conditions:
it needs to be recursive
it must take a single parameter (which is the original string)
The function should return a pointer to the newly copied string. So this is what I've tried so far:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
char * my_strcpy(char *original);
int main(void) {
char *string = my_strcpy("alpine");
printf("string = <%s>\n", string);
return 0;
}
char * my_strcpy(char *original){
char *string = (char *)malloc(10);
if(*original == '\0') {
return string;
}
*string++ = *original;
my_strcpy(original + 1);
}
The problem is somewhat obvious, string gets malloc-ed every time my_strcpy() is called. One of the solutions I could think of would be to allocate memory for string only the first time the function is called. Since I'm allowed to have only 1 parameter, only thing I could think of was to check the call stack, but I don't know whether that's allowed and it does feel like cheating.
Is there a logical solution to this problem?

You wrote it as tail recursive, but I think without making the function non-reentrant your only option is going to be to make the function head recursive and repeatedly call realloc on the return value of the recursive call to expand it, then add in one character. This has the same problem that just calling strlen to do the allocation has: it does something linear in the length of the input string in every recursive call and turns out to be an implicitly n-squared algorithm (0.5*n*(n+1)). You can improve it by making the amortized time complexity better, by expanding the string by a factor and only growing it when the existing buffer is full, but it's still not great.
There's a reason you wouldn't use recursion for this task (which you probably know): stack depth will be equal to input string length, and a whole stack frame pushed and a call instruction for every character copied is a lot of overhead. Even so, you wouldn't do it recursively with a single argument if you were really going to do it recursively: you'd make a single-argument function that declares some locals and calls a recursive function with multiple arguments.
Even with the realloc trick, it'll be difficult or impossible to count the characters in the original as you go so that you can call realloc appropriately, remembering that other stdlib "str*" functions are off limits because they'll likely make your whole function n-squared, which I assumed we were trying to avoid.
Ugly tricks like verifying that the string is as long as a pointer and replacing the first few characters with a pointer by memcpy could be used, making the base case for the recursion more complicated, but, um, yuck.

Recursion is a technique for analysing problems. That is, you start with the problem and think about what the recursive structure of a solution might be. You don't start with a recursive structure and then attempt to shoe-horn your problem willy-nilly into it.
In other words, it's good to practice recursive analysis, but the task you have set yourself -- to force the solution to have the form of a one-parameter function -- is not the way to do that. If you start contemplating global or static variables, or extracting runtime context by breaking into the call stack, you have a pretty good hint that you have not yet found the appropriate recursive analysis.
That's not to say that there is not an elegant recursive solution to your problem. There is one, but before we get to it, we might want to abstract away a detail of the problem in order to provide some motivation.
Clearly, if we have a contiguous data structure already in memory, making a contiguous copy is not challenging. If we don't know how big it is, we can do two traverses: one to find its size, after which we can allocate the needed memory, and another one to do the copy. Both of those tasks are simple loops, which is one form of recursion.
The essential nature of a recursive solution is to think about how to step from a problem to a (slightly) simpler or smaller problem. Or, more commonly, a small number of smaller or simpler problems.
That's the nature of one of the most classic recursive problems: sorting a sequence of numbers. The basic structure: divide the sequence into two roughly equal parts; sort each part (the recursive step) and put the results back together so that the combination is sorted. That basic outline has at least two interesting (and very different) manifestations:
Divide the sequence arbitrarily into two almost equal parts either by putting alternate elements in alternate parts or by putting the first half in one part and the rest in the other part. (The first one will work nicely if we don't know in advance how big the sequence is.) To put the sorted parts together, we have to interleave ("merge") the. (This is mergesort).
Divide the sequence into two ranges by estimating the middle value and putting all smaller values into one part and all larger values into the other part. To put the sorted parts together, we just concatenate them. (This is quicksort.)
In both cases, we also need to use fact that a single-element sequence is (trivially) sorted so no more processing needs to be done. If we divide a sequence into two parts often enough, ensuring that neither part is empty, we must eventually reach a part containing one element. (If we manage to do the division accurately, that will happen quite soon.)
It's interesting to implement these two strategies using singly-linked lists, so that the lengths really are not easily known. Both can be implemented this way, and the implementations reveal something important about the nature of sorting.
But let's get back to the much simpler problem at hand, copying a sequence into a newly-allocated contiguous array. To make the problem more interesting, we won't assume that the sequence is already stored contiguously, nor that we can traverse it twice.
To start, we need to find the length of the sequence, which we can do by observing that an empty sequence has length zero and any other sequence has one more element than the subsequence starting after the first element (the "tail" of the sequence.)
Length(seq):
If seq is empty, return 0.
Else, return 1 + Length(Tail(seq))
Now, suppose we have allocated storage for the copy. Now, we can copy by observing that an empty sequence is fully copied, and any other sequence can be copied by placing the first element into the allocated storage and then cipying the tail of the sequence into the storage starting at the second position: (and this procedure logically takes two arguments)
Copy(destination, seq):
If seq is not empty:
Put Head(seq) into the location destination
Call Copy (destination+1, Tail(seq))
But we can't just put those two procedures together, because that would traverse the sequence twice, which we said we couldn't do. So we need to somehow nest these algorithms.
To do that, we have to start by passing the accumulated length down through the recursion so that we can use it at to allocate the storage when we know how big the object. Then, on the way back, we need to copy the element we counted on the way down:
Copy(seq, length):
If seq is not empty:
Set item to its first element (that is, Head(seq))
Set destination to Copy(Tail(seq), length + 1)
Store item at location destination - 1
Return destination - 1
Otherwise: (seq is empty)
Set destination to Allocate(length)
# (see important note below)
Return destination + length
To correctly start the recursion, we need to pass in 0 as the initial length. It's bad style to force the user to insert "magic numbers", so we would normally wrap the function with a single-argument driver:
Strdup(seq):
Return Copy (seq, 0)
Important Note: if this were written in C using strings, we would need to NUL-terminate the copy. That means allocating length+1 bytes, rather than length, and then storing 0 at destination+length.

You didn't say we couldn't use strcat.
So here is logical (although somewhat useless) answer by using recursion to do nothing other than chop off the last character and adding it back on again.
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
char * my_strcpy(char *original);
int main(void) {
char *string = my_strcpy("alpine");
printf("string = <%s>\n", string);
return 0;
}
char * my_strcpy(char *original){
if(*original == '\0') {
return original;
}
int len = strlen(original);
char *string = (char *)malloc(len+1);
char *result = (char *)malloc(len+1);
string[0] = result[0] = '\0';
strcat (string, original);
len--;
char store[2] = {string[len] , '\0'}; // save last char
string[len] = '\0'; // cut it off
strcat (result, my_strcpy(string));
strcat (result, store); // add it back
return result;
}

How to save memory in an array of which many elements are always 0?

I have a 2tensor in C that looks like:
int n =4;
int l =5;
int p =6;
int q=2;
I then initialize each element of T
//loop over each of the above indices
T[n][l][p][q]=...
However, many of them are zero and there are symmetries such as.
T[4][3][2][1]=-T[3][4][2][1]
How can I save memory on the elements of T which are zero? Ideally I would like to place something like NULL in those positions so they use 0 instead of 8 bytes. Also, later on in the calculation I can check if they are zero or not by checking if they are equal to NULL
How do I implicitly include those symmetries in T with using excess memory?
Edit: the symmetry can perhaps be fixed with a different implementation. But what about the zeros? Is there any implementation to not have them waste memory?

You cannot influence the size of any variable by a value you write to it.
If you want to save memory you have not only to not use it, you have to not define a variable using it.
If you do not define a variable, then you have to not use it ever.
Then you have saved memory.
This is of course obvious.
Now, how to apply that to your problem.
Allow me to simplify, for one because you did not give enough information and explanation, at least not for me to understand every detail. For another, to keep the explanation simple.
So I hope that it suffices if I solve the following problem for you, which I think is kind of the little brother of your problem.
I have a large array in C (not really large, lets say N entries, with N==20).
But for special reasons, I will never need to actually read and write any even indices, they should act as if they contain 0, but I want to save the memory used by them.
So actually I want to only use M of the entries, with M*2==N.
So instead of
int Array[N]; /* all the theoretical elements */
I define
int Array[M]; /* only the actually used elements */
Of course I cannot access any of the elements which are not needed and it will not really be necessary.
But for the logic of my program, I want to be able to program as if I could access them, but be sure that they will always every only read 0 and ignore any written value.
So what I do is wrapping all accesses to the array.
int GetArray(int index)
{
if (index & 1)
{
/* odd, I need to really access the array,
but at a calculated index */
return Array[index/2];
} else
{
/* even, always 0 */
return 0;
}
}
void SetArray(int index, int value)
{
if (index & 1)
{
/* odd, I need to really access the array,
but at a calculated index */ */
Array[index/2] = value;
} else
{
/* even, no need to store anything, stays always "0" */
}
}
So I can read and write as if the array were twice as large, but guarantee not to ever use the faked elements.
And by mapping the indices as
actualindex = wantindex / 2
I ensure that I do not access beyond the size of the actually existing array.
Porting this concept now to the more complicated setup you have described is your job. You know all the details, you can test wether everything works.
I recommend to extend GetArray() and SetArray() by checks on the resulting index, to make sure that it is never outside of the actual array.
You can also add all kinds of self checks to verify that all your rules and expectations are met.

Most efficient way to check if elements in an array have changed

I have an array in c and I need to perform some operation only if the elements in an array have changed. However the time and memory taken for this is very important. I realized that an efficient way to do this would probably be to hash all the elements of the array and compare the result with the previous result. If they match that means the elements dont change. I would however like to know if this is the most efficient way of doing things. Also since the array is only 8 bytes long(1 byte for each element) which hashing function would be least time consuming?
The elements in an array are actually being received from another microcontroller. So they may or may not change depending on whether what the other micro-controller measured is the same or not

If you weren't tied to a simple array, you could create a "MRU" List of structures where the structure could contain a flag that indicates if the item was changed since it was last inspected.
Every time an item changes set the "changed flag" and move it to the head of the list. When you need to check for the changed items you traverse the list from the head and unset the changed flags and stopping at the first element with its change flag not set.
Sorry, I missed the part about the array being only 8 bytes long. With that info and with the new info from your edit, I'm thinking the previous suggestion is not ideal.
If the array is only 8-bytes long why not just cache a copy of the previous array and compare it to the new array received?
Below is a clarification of my comment about "shortcutting" the compares. How you implement this would depend on what the sizeof(int) is on the platform used.
Using a 64-bit integer you could get away with one compare to determine if the array has changed. For example:
#define ARR_SIZE 8
unsigned char cachedArr[ARR_SIZE];
unsigned char targetArr[ARR_SIZE];
unsigned int *ic = (unsigned int *)cachedArr;
unsigned int *it = (unsigned int *)targetArr;
// This assertion needs to be true for this implementation to work
// correctly.
assert(sizeof(int) == sizeof(cachedArr));
/*
** ...
** assume initialization and other suff here
** leading into the main loop that is receiving the target array data.
** ...
*/
if (*ic != *it)
{
// Target array has changed; find out which element(s) changed.
// If you only cared that there was a change and did not care
// to know which specific element(s) had changed you could forego
// this loop altogether.
for (int i = 0; i < ARR_SIZE; i++)
{
if (cachedArr[i] != targetArr[i])
{
// Do whatever needs to be done based on the i'th element
// changed
}
}
// Cache the array again since it has changed.
memcpy(cachedArr, targetArr, sizeof(cachedArr));
}
// else no change to the array
If the native integer size was smaller than 64-bit you could use the same theory, but you'd have to loop over the array sizeof(cachedArr) / sizeof(unsigned int) times; and there would be a worst-case scenario involved (but isn't there always) if the change was in the last chunk tested.
It should be noted that with doing any char to integer type casting you may need to take into consideration alignment (if the char data is aligned to the appropriate word-size boundary).
Thinking further upon this however, it might be better altogether to just unroll the loop yourself and do:
if (cachedArr[0] != targetArr[0])
{
doElement0ChangedWork();
}
if (cachedArr[1] != targetArr[1])
{
doElement1ChangedWork();
}
if (cachedArr[2] != targetArr[2])
{
doElement2ChangedWork();
}
if (cachedArr[3] != targetArr[3])
{
doElement3ChangedWork();
}
if (cachedArr[4] != targetArr[4])
{
doElement4ChangedWork();
}
if (cachedArr[5] != targetArr[5])
{
doElement5ChangedWork();
}
if (cachedArr[6] != targetArr[6])
{
doElement6ChangedWork();
}
if (cachedArr[7] != targetArr[7])
{
doElement7ChangedWork();
}
Again, depending on whether or not knowing which specific element(s) changed that could be tightened up. This would result in more instruction memory needed but eliminates the loop overhead (the good old memory versus speed trade-off).
As with anything time/memory related test, measure, compare, tweak and repeat until desired results are achieved.

only if the elements in an array have changed
Who else but you is going to change them? You can just keep track of whether you've made a change since the last time you did the operation.
If you don't want to do that (perhaps because it'd require recording changes in too many places, or because the record-keeping would take too much time, or because another thread or other hardware is messing with the array), just save the old contents of the array in a separate array. It's only 8 bytes. When you want to see whether anything has changed, compare the current array to the copy element-by-element.

As others have said, the elements will only change if the code changed them.
Maybe this data can be changed by another user? Otherwise you would know that you had changed an entry.
As far as the hash function, there are only 2^8 = 256 different values that this array can take. A hash function won't really help here. Also, a hash function has to be computed, which costs memory so I don't think that will work for your application.
I would just compare bits until you find one has changed. If one has changed, the you will check 4 bits on average before you that your array has changed (assuming that each bit is equally likely to change).
If one hasn't changed, that is worst case scenario and you will have to check all eight bits to conclude that none have changed.

If array only 8 bytes long, you can treat it as if it is a long long type number. Suppose original array is char data[8].
long long * pData = (logn long *)data;
long long olddata = *pData;
if ( olddata != *pData )
{
// detect which one changed
}
I mean, this way you operate all data in one shot, this is much faster than access each element using index. hash is slower n this case.

If it is byte oriented with only eight elements, doing an XOR function would be more efficient than any other comparison.
If ((LocalArray[0] ^ received Array [0]) & (LocalArray[1] ^ received Array [1]) & ...)
{
//Yes it is changed
}

C: Is there an advantage to allocating more memory than is needed?

I am working on a Windows C project which is string-intensive: I need to convert a marked up string from one form to another. The basic flow is something like:
DWORD convert(char *point, DWORD extent)
{
char *point_end = point + extent;
char *result = memory_alloc(1);
char *p_result = result;
while (point < point_end)
{
switch (*point)
{
case FOO:
result_extent = p_result - result;
result = memory_realloc(12);
result += result_extent;
*p_result++ = '\n';
*p_result++ = '\t';
memcpy(result, point, 10);
point += 10;
result += 10;
break;
case BAR:
result_extent = p_result - result;
result = memory_realloc(1);
result += result_extent;
*result++ = *point++;
break;
default:
point++;
break;
}
}
// assume point is big enough to take anything I would copy to it
memcpy(point, result, result_extent);
return result_extent;
}
memory_alloc() and memory_realloc() are fake functions to highlight the purpose of my question. I do not know beforehand how big the result 'string' will be (technically, it's not a C-style/null-terminate string I'm working with, just a pointer to a memory address and a length/extent), so I'll need to dynamically size the result string (it might be bigger than the input, or smaller).
In my initial pass, I used malloc() to create room for the first byte/bytes and then subsequently realloc() whenever I needed to append another byte/handful of bytes...it works, but it feels like this approach will needlessly hammer away at the OS and likely result in shifting bytes around in memory over and over.
So I made a second pass, which determines how long the result_string will be after an individual unit of the transformation (illustrated above with the FOO and BAR cases) and picks a 'preferred allocation size', e.g. 256 bytes. For example, if result_extent is 250 bytes and I'm in the FOO case, I know I need to grow the memory 12 bytes (newline, tab and 10 bytes from the input string) -- rather than reallocating 260 bytes of memory, I'd reach for 512 bytes, hedging my bet that I'm likely going to continue to add more data (and thus I can save myself a few calls into realloc).
On to my question: is this latter thinking sound or is it premature optimization that the compiler/OS is probably already taking care of for me? Other than not wasting memory space, is there an advantage to reallocating memory by a couple bytes, as needed?
I have some rough ideas of what I might expect during a single conversion instance, e.g. a worse case scenario might be a 2MB input string with a couple hundred bytes of markup that will result in 50-100 bytes of data to be added to the result string, per markup instance (so, say 200 reallocs stretching the string by 50-100 bytes with another 100 reallocations caused by simply copying data from the input string into the result string, aside from the markup).
Any thoughts on the subject would be appreciated. thanks

As you might know, realloc can move your data at each call. This results in an additional copy. In cases like this, I think it is much better to allocate a large buffer that will most probably be sufficient for the operation (an upper bound). In the end, you can allocate the exact amount for the result and do a final copy/free. This is better and is not premature optimization at all. IMO using realloc might be considered premature optimization in this case.

Ideal data structure for mapping integers to integers?

I won't go into details, but I'm attempting to implement an algorithm similar to the Boyer-Moore-Horspool algorithm, only using hex color values instead of characters (i.e., there is a much greater range).
Following the example on Wikipedia, I originally had this:
size_t jump_table[0xFFFFFF + 1];
memset(jump_table, default_value, sizeof(jump_table);
However, 0xFFFFFF is obviously a huge number and this quickly causes C to seg-fault (but not stack-overflow, disappointingly).
Basically, what I need is an efficient associative array mapping integers to integers. I was considering using a hash table, but having a malloc'd struct for each entry just seems overkill to me (I also do not need hashes generated, as each key is a unique integer and there can be no duplicate entries).
Does anyone have any alternatives to suggest? Am I being overly pragmatic about this?
Update
For those interested, I ended up using a hash table via the uthash library.

0xffffff is rather too large to put on the stack on most systems, but you absolutely can malloc a buffer of that size (at least on current computers; not so much on a smartphone). Whether or not you should do it for this task is a separate issue.
Edit: Based on the comment, if you expect the common case to have a relatively small number of entries other than the "this color doesn't appear in the input" skip value, you should probably just go ahead and use a hash map (obviously only storing values that actually appear in the input).
(ignore earlier discussion of other data structures, which was based on an incorrect recollection of the algorithm under discussion -- you want to use a hash table)

If the array you were going to make (of size 0xFFFFFF) was going to be sparse you could try making a smaller array to act as a simple hash table, with the size being 0xFFFFFF / N and the hash function being hexValue / N (or hexValue % (0xFFFFFF / N)). You'll have to be creative to handle collisions though.
This is the only way I can foresee getting out of mallocing structs.

You can malloc(3) 0xFFFFFF blocks of size_t on the heap (for simplicity), and address them as you do with an array.
As for the stack overflow. Basically the program receives a SIGSEGV, which can be a result of a stack overflow or accessing illegal memory or writing on a read-only segment etc... They are all abstracted under the same error message "Segmentation fault".
But why don't you use a higher level language like python that supports associate arrays?

At possibly the cost of some speed, you could try modifying the algorithm to find only matches that are aligned to some boundary (every three or four symbols), then perform the search at byte level.

You could create a sparse array of sorts which has "pages" like this (this example uses 256 "pages", so the upper most byte is the page number):
int *pages[256];
/* call this first to make sure all of the pages start out NULL! */
void init_pages(void) {
for(i = 0; i < 256; ++i) {
pages[i] = NULL;
}
}
int get_value(int index) {
if(pages[index / 0x10000] == NULL) {
pages[index / 0x10000] = calloc(0x10000, 1); /* calloc so it will zero it out */
}
return pages[index / 0x10000][index % 0x10000];
}
void set_value(int index, int value) {
if(pages[index / 0x10000] == NULL) {
pages[index / 0x10000] = calloc(0x10000, 1); /* calloc so it will zero it out */
}
pages[index / 0x10000][index % 0x10000] = value;
}
this will allocate a page the first time it is touched, read or write.

To avoid the overhead of malloc you can use a hashtable where the entries in the table are your structs, assuming they are small. In your case a pair of integers should suffice, with a special value to indicate emptyness of the slot in the table.

How many values are there in your output space, i.e. how many different values do you map to in the range 0-0xFFFFF?
Using randomized universal hashing you can come up with a collision-free hash function with a table no bigger than 2 times the number of values in your output space (for a static table)

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight