How to read a file and simultaneously fill an array - c

In a C program I'm writing I have to read values from a text file and put them in an array for later use.
I don't like my code (snippet shown below) because I do two while loops, the first to count the number of values, then I create an array as big as that value, and lastly I read the file again, filling the array.
Also, in the first loop, I use a variable x because the fscanf() requires it, but I never use it later in the code and I'd like to avoid it at all if possible.
int x, n=0, sum=0;
fp=fopen("data.txt", "r");
while(fscanf(fp,"%d\n",&x)!=EOF){
n++;
}
rewind(fp);
int v[n];
while(fscanf(fp,"%d\n",&v[i])!=EOF){
sum+=v[i];
i++;
}
So, any advice on how can I improve this code? I figured I could kinda "fix" it by declaring an array big "enough" at the beginning and filling it as needed. But I don't know in advance how many values I have to work with, so I decided to trash out this method.

This is one scenario where dynamic memory allocation can come handy. You can follow the general procedure as described below
Define a pointer.
Open the file fopen() and read the first element from the file fscanf(). Error check should be taken care, also.
If the read is successful, allocate memory dynamically malloc() to the pointer and copy the value.
Read next element.
4.1. If the read is successful
If the read is successful, re-allocate the memory realloc() with one more new element size.
Copy the last read value to the newly allocated memory.
4.2. If the read id failure, check for EOF and stop the reading.
Continue to step 4.
Also, please keep in mind, the memory which you have allocate using dynamic memory allocation, needs to be free()d also.
As a note, referring to the comment of Mr. # szczurcio, this is not an optimized effort, because, you've to re-allocte memory in each successful read. TO minimize the impact of dynamic memory allocation, we can decide on a threshold value which we will use to allocate memory and then, when exhausted, will double the amount of the previous value. This way, the allocation will happen in a chunk and the allocation overhead in each read cycle can be avoided.

Minor changes to the code, please note that I've changed v to be an int* and then check the amount of carriage returns in the file. I then allocate the correct amount of memory for the array, rewind file and then let your code loop through the file again...
int x, n=0, sum=0;
char c;
int* v;
int i = 0;
fp=fopen("data.txt", "r");
while (f.get(c))
if (c == '\n')
++i;
rewind(fp);
v = malloc( i * sizeof( int ));
i = 0;
while(fscanf(fp,"%d\n",&v[i])!=EOF)
{
sum+=v[i];
i++;
}

As said by Sourav, dynamic memory allocation is definitely the way to go.
That said, you can also change the data structure to another that doesn't require a priori knowledge of N. If you only need sequential access to the values and don't really need random access, a linked list is an option. Moreover, you can always use binary trees, hash tables and so on. Depends on what you want to do with the data.
P.S: srry, I'd post this as a comment, but I don't have the reputation.

This is the typical scenario in which you would like to know the size of the file before creating the array. Well, better speaking, the number of lines in the file.
I'm going to suggest something radically different. Since this is a text file, the smallest number will occupy two chars (smallest in a "text" sense), one for the digit, and another one for the \n (though the \n can be one or two bytes, that's OS dependant).
We can now the size of the file. After fopen'ing it, you can, through ftell, know how many bytes it holds. If you divide that number by 2, you will have an approximation of the maximum possible number of lines in the file. So you can create the array of that size, and then save the number of positions really occupied.
FILE * fp = fopen( "data.txt", "rt" );
/* Get file size */
fseek( fp, SEEK_END, 0 );
long size = ftell( fp );
fseek( fp, SEEK_SET, 0 );
/* Create data */
long MaxNumbers = size / 2;
int * data = (int *) malloc( sizeof( int ) * MaxNumbers );
long lastPos = 0;
/* Read file */
int * next = data;
while( fscanf(fp, "%d\n", next) != EOF ) {
++next;
}
lastPos = ( next - data ) / sizeof( int );
/* Close the file */
fclose( fp );
Once you have the data loaded in data, you know the real number of items, so you can copy it to another array of the exact size (maybe through memcpy()), or stay with this array. If you want to change the array:
int * v = (int *) malloc( sizeof( int ) * lastPos );
memcpy( v, data, sizeof( int ) * lastPos );
free( data );
Note: this code is a simple demo, and it does not check for NULL's after calling malloc(), while a real program should.
This code does not waste memory or computation time when expanding the array because the data does not fit. However, it a) creates an array at the beginning which is potentially bigger than needed, and b) if you want to have an array of the exact size, then you will temporally have twice the needed space allocated. We are changing memory for better performance, and sometimes this is not a good idea for our environment (i.e., an embedded system).
A big improvement to this strategy would be to be able to deal with the input file. If you dedicate the same space to each number (say there are always three positions, and a 3 is stored as 003), and you know the maximum number (in order to know how many spaces you need for each number), then the algorithm will be completely accurate, and you don't need to change the data read to another array or whatever. With this change, this strategy is simply the best one I can imagine.
Hope this helps.

Related

Can I avoid a loop for writing the same value in a continous subset of an array?

I have a program where I repeat a succession of methods to reproduce time evolution. One of the things I have to do is to write the same value for a long continue subset of elements of a very large array. Knowing which elements are and which value I want, is there any other way rather than doing a loop for setting these values each by each?
EDIT: To be clear, I want to avoid this:
double arr[10000000];
int i;
for (i=0; i<100000; ++i)
arr[i] = 1;
by just one single call if it is possible. Can you assign to a part of an array the values from another array of the same size? Maybe I could have in memory a second array arr2[1000000] with all elements 1 and then do something like copying the memory of arr2 to the first 100.000 elements of arr?
I have a somewhat tongue-in-cheek and non-portable possibility for you to consider. If you tailored your buffer to a size that is a power of 2, you could seed the buffer with a single double, then use memcpy to copy successively larger chunks of the buffer until the buffer is full.
So first you copy the first 8 bytes over the next 8 bytes...(so now you have 2 doubles)
...then you copy the first 16 bytes over the next 16 bytes...(so now you have 4 doubles)
...then you copy the first 32 bytes over the next 32 bytes...(so now you have 8 doubles)
...and so on.
It's plain to see that we won't actually call memcpy all that many times, and if the implementation of memcpy is sufficiently faster than a simple loop we'll see a benefit.
Try building and running this and tell me how it performs on your machine. It's a very scrappy proof of concept...
#include <string.h>
#include <time.h>
#include <stdio.h>
void loop_buffer_init(double* buffer, int buflen, double val)
{
for (int i = 0; i < buflen; i++)
{
buffer[i] = val;
}
}
void memcpy_buffer_init(double* buffer, int buflen, double val)
{
buffer[0] = val;
int half_buf_size = buflen * sizeof(double) / 2;
for (int i = sizeof(double); i <= half_buf_size; i += i)
{
memcpy((unsigned char *)buffer + i, buffer, i);
}
}
void check_success(double* buffer, int buflen, double expected_val)
{
for (int i = 0; i < buflen; i++)
{
if (buffer[i] != expected_val)
{
printf("But your whacky loop failed horribly.\n");
break;
}
}
}
int main()
{
const int TEST_REPS = 500;
const int BUFFER_SIZE = 16777216;
static double buffer[BUFFER_SIZE]; // 2**24 doubles, 128MB
time_t start_time;
time(&start_time);
printf("Normal loop starting...\n");
for (int reps = 0; reps < TEST_REPS; reps++)
{
loop_buffer_init(buffer, BUFFER_SIZE, 1.0);
}
time_t end_time;
time(&end_time);
printf("Normal loop finishing after %.f seconds\n",
difftime(end_time, start_time));
time(&start_time);
printf("Whacky loop starting...\n");
for (int reps = 0; reps < TEST_REPS; reps++)
{
memcpy_buffer_init(buffer, BUFFER_SIZE, 2.5);
}
time(&end_time);
printf("Whacky loop finishing after %.f seconds\n",
difftime(end_time, start_time));
check_success(buffer, BUFFER_SIZE, 2.5);
}
On my machine, the results were:
Normal loop starting...
Normal loop finishing after 21 seconds
Whacky loop starting...
Whacky loop finishing after 9 seconds
To work with a buffer that was less than a perfect power of 2 in size, just go as far as you can with the increasing powers of 2 and then fill out the remainder in one final memcpy.
(Edit: before anyone mentions it, of course this is pointless with a static double (might as well initialize it at compile time) but it'll work just as well with a nice fresh stretch of memory requested at runtime.)
It looks like this solution is very sensitive to your cache size or other hardware optimizations. On my old (circa 2009) laptop the memcpy solution is as slow or slower than the simple loop, until the buffer size drops below 1MB. Below 1MB or so the memcpy solution returns to being twice as fast.
I have a program where I repeat a succession of methods to reproduce
time evolution. One of the things I have to do is to write the same
value for a long continue subset of elements of a very large array.
Knowing which elements are and which value I want, is there any other
way rather than doing a loop for setting these values each by each?
In principle, you can initialize an array however you like without using a loop. If that array has static duration then that initialization might in fact be extremely efficient, as the initial value is stored in the executable image in one way or another.
Otherwise, you have a few options:
if the array elements are of a character type then you can use memset(). Very likely this involves a loop internally, but you won't have one literally in your own code.
if the representation of the value you want to set has all bytes equal, such as is the case for typical representations of 0 in any arithmetic type , then memset() is again a possibility.
as you suggested, if you have another array with suitable contents then you can copy some or all of it into the target array. For this you would use memcpy(), unless there is a chance that the source and destination could overlap, in which case you would want memmove().
more generally, you may be able to read in the data from some external source, such as a file (e.g. via fread()). Don't count on any I/O-based solution to be performant, however.
you can write an analog of memset() that is specific to the data type of the array. Such a function would likely need to use a loop of some form internally, but you could avoid such a loop in the caller.
you can write a macro that expands to the needed loop. This can be type-generic, so you don't need different versions for different data types. It uses a loop, but the loop would not appear literally in your source code at the point of use.
If you know in advance how many elements you want to set, then in principle, you could write that many assignment statements without looping. But I cannot imagine why you would want so badly to avoid looping that you would resort to this for a large number of elements.
All of those except the last actually do loop, however -- they just avoid cluttering your code with a loop construct at the point where you want to set the array elements. Some of them may also be clearer and more immediately understandable to human readers.

How to make this sorting program in C much faster for the large input sets

This sort code fails for very large input file data because it takes too long for it to finish.
rewind(ptr);
j=0;
while(( fread(&temp,sizeof(temp),1,ptr)==1) &&( j!=lines-1)) //read object by object
{
i=j+1;
while(fread(&temp1,sizeof(temp),1,ptr)==1) //read next object , to compare previous object with next object
{
if(temp.key > temp1.key) //compare key value of object
{
temp2=temp; //if you don't want to change records and just want to change keys use three statements temp2.key =temp.key;
temp=temp1;
temp1=temp2;
fseek(ptr,j*sizeof(temp),0); //move stream to overwrite
fwrite(&temp,sizeof(temp),1,ptr); //you can avoid above swap by changing &temp to &temp1
fseek(ptr,i*sizeof(temp),0); //move stream to overwrite
fwrite(&temp1,sizeof(temp),1,ptr); //you can avoid above swap by changing &temp1 to &temp
}
i++;
}
j++;
fseek(ptr,j*sizeof(temp),0);
}
Any idea on how to make this C code much faster? Also would using qsort() (predefined in C) be much faster and how should be applied to the above code?
You asked the question Sorting based on key from a file and were given various answers about how to sort in memory. You added a supplemental question as an answer, and then created this question instead (which was correct).
Your code here is basically a disk-based bubble sort, with O(N2) complexity, and poor time performance because it is manipulating file buffers and disk. A bubble sort is a bad choice at the best of times — simple, yes, but slow.
The basic ways to speed up sorting programs are:
If possible, read all the data into memory, sort in memory, and write the result out.
If it won't all fit into memory, read as much into memory as possible, sort it, and write the sorted data to a temporary file. Repeat as often as necessary to sort all the data. Then merge the temporary files into one file. If the data set is truly astronomical (or the memory truly minuscule), you may have to create intermediate merge files. These days, though, you have to be sorting many hundreds of gigabytes for that to be an issue at all, even on a 32-bit computer.
Make sure you choose a good sorting algorithm. Quick sort with appropriate pivot selection is very good. You could look up 'introsort' too.
You'll find example in-memory sorting code in the answers to the cross-referenced question (your original question). If you choose to write your own sort, you can consider whether to base the interface on the standard C qsort() function. If you write a Quick Sort, you should look at Quicksort — Choosing the pivot where the answers have copious references.
You'll find example merging code in the answer to Merging multiple sorted files into one file. The merging code out-performs the system sort program in its merge mode, which is intriguing since it is not highly polished code (but it is reasonably workmanlike).
You could look at the external sort program described in Software Tools, though it is a bit esoteric in that it is written in 'RatFor' or Rational Fortran. The design, though, is readily transferrable to other languages.
Yes, by all means, use qsort(). Use it either as SpiderPig suggests by reading the whole file into memory, or as the in-memory sort for runs that do fit into memory preparing for a merge sort. Don't worry about the worst-case performance. A decent implementation will take a the median of (first, last, middle) to get fast sorting for the already-sorted and reverse-order pathological case, plus better average performance in the random case.
This all-in-memory example shows you how to use qsort:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
typedef struct record_tag
{
int key;
char data[12];
} record_type, *record_ptr;
const record_type * record_cptr;
void create_file(const char *filename, int n)
{
record_type buf;
int i;
FILE *fptr = fopen(filename, "wb");
for (i=0; i<n; ++i)
{
buf.key = rand();
snprintf(buf.data, sizeof buf.data, "%d", buf.key);
fwrite(&buf, sizeof buf, 1, fptr);
}
fclose(fptr);
}
/* Key comparison function used by qsort(): */
int compare_records(const void *x, const void *y)
{
const record_ptr a=(const record_ptr)x;
const record_ptr b=(const record_ptr)y;
return (a->key > b->key) - (a->key < b->key);
}
/* Read an input file of (record_type) records, sort by key field, and write to the output file */
void sort_file(const char *ifname, const char *ofname)
{
const size_t MAXREC = 10000;
int n;
FILE *ifile, *ofile;
record_ptr buffer;
ifile = fopen(ifname, "rb");
buffer = (record_ptr) malloc(MAXREC*sizeof *buffer);
n = fread(buffer, sizeof *buffer, MAXREC, ifile);
fclose(ifile);
qsort(buffer, n, sizeof *buffer, compare_records);
ofile = fopen(ofname, "wb");
fwrite(buffer, sizeof *buffer, n, ofile);
fclose(ofile);
}
void show_file(const char *fname)
{
record_type buf;
int n = 0;
FILE *fptr = fopen(fname, "rb");
while (1 == fread(&buf, sizeof buf, 1, fptr))
{
printf("%9d : %-12s\n", buf.key, buf.data);
++n;
}
printf("%d records read", n);
}
int main(void)
{
srand(time(NULL));
create_file("test.dat", 99);
sort_file("test.dat", "test.out");
show_file("test.out");
return 0;
}
Notice the compare_records function. The qsort() function needs a function that accepts void pointers, so those pointer must be cast to the correct type. Then the pattern:
(left > right) - (left < right)
...will return 1 if the left argument is greater, 0 if they are equal or -1 if the right argument is greater.
The could be improved. First, there is absolutely no error checking. That's not sensible in production code. Second, you could examine the input file to get the file size instead of guessing that it's less than some MAXxxx value. One way to do that is to use ftell. (Follow the link for a file size example.) Then, use that value to allocate a single buffer, just big enough to qsort the data.
If there is not enough room (if the malloc returns NULL) then you can fall back on sorting chunks (with qsort, as in the snippet) that do fit into memory, writing them to separate temporary files, and then merging them into a single output file. That's more complicated, and rarely done since there are sort/merge utility programs designed specifically for sorting large files.

how to determine the number of numbers in a text file in C

I'm reading a text file of numbers and I want to get the sum of this number, how can I determine the number of numbers in the text file."my text file is consist of one line"
this is the code I have written, how to determine number of numbers in the text file to put it instead of the variable "number of numbers" in the secondline of code
int main()
{
FILE *file = fopen("numbers.txt", "r");
int integers[number of numbers];
int i=0;
int j=0;
int num;
while(fscanf(file, "%d", &num) > 0) {
integers[i] =num;
printf("%d",integers[i]);
printf("\n");
i++;
}
int sum=0;
for(j=0;j<sizeof(integers)/sizeof(int);j++)
{
sum=sum+integers[j];
}
printf("%d",sum);
printf("\n");
fclose(file);
return 0;
}
If you want to do this, there are three possible solutions:
make integers quite large (say 10000 elements) and say "Too many numbers" if there are more than the "quite large" number.
Read the file twice, count the number of numbers the first time, second time store them.
Use dynamic allocation, and start with a small number, when that number is reached use realloc to allocate a larger array, until all numbers have been read.
However, in your particular case, what you are doing can be done without at all storing the numbers. So, the whole integers array is completely unnecessary.
Just do:
sum += num;
in the first loop.
First, figure out if you actually need to save every number. Quite frequently it is possible to do simple data processing by computing some intermediate result without needing to keep every input. For example, it is possible to compute the mean and standard deviation of an input set without keeping the input dataset.
In your specific example, you can print every number as it is read, then accumulate them into sum, without having to keep all of them.
If you decide you really need to keep every number, then you have two options:
Read through the file once to count the number of numbers, then allocate the array, then fseek back to the beginning to read all of them.
Allocate an initial array, then use realloc to progressively increase its size (in this case, make sure to increase the size by a fixed factor when needed, rather than just increasing the size by one).
If you don't need individual numbers but only the sum of all of them, what you should do is just add them together at the same time as you read them:
int sum = 0;
int num;
while(fscanf(file, "%d", &num) > 0) {
sum += num;
printf("%d",num);
printf("\n");
}
On the other hand, if you really need to keep every single number, you can do it different ways.
You could first read the file while counting the numbers, then seek to the beginning, allocate the needed memory and read again saving each number.
You can ask for some memory at first, and when you run out (you are going to have to keep track of how many free spaces you have) ask for more memory (realloc), and keep doing that until you are finished.
You can use a linked list instead of an array, if you didn't need random access.
Edit:
If for some case you need to do an avarage, and thus you really need the total amount of numbers you read, just declare a int n = 0; and inside the loop do ++n; so you have it in the end.

Best approach to continuously scan for a string in a streaming buffer

I have this situation where my function continuously receive data of various length. The data can be anything. I want to find the best way I to hunt for particular string in this data. The solution will require somehow to buffer previous data but I cannot wrap my head around the problem.
Here is an example of the problem:
DATA IN -> [\x00\x00\x01\x23B][][LABLABLABLABLA\x01TO][KEN][BLA\x01]...
if every [...] represents a data chunk and [] represents a data chunk with no items, what is the best way to scan for the string TOKEN?
UPDATE:
I realised the question is a bit more complex. the [] are not separators. I just use them to describe the structure of the chunk per above example. Also TOKEN is not a static string per-se. It is variable length. I think the best way to read line by line but than the question is how to read a streaming buffer of variable length into lines.
The simplest way to search for TOKEN is:
try to match "TOKEN" starting from position 0 in the stream
try to match "TOKEN" starting from position 1 in the stream
etc
So all you need to buffer is a number of bytes from the stream equal to the length of "TOKEN" (5 bytes, or actually 4 will do). At each position try to match "TOKEN", which might require waiting until you have at least 5 bytes read into your buffer. If the match fails, rewind to where you started matching, plus one. Since you never rewind more than the length of the string you're searching for (minus one) that's all the buffer you really need.
The technical issue then is, how to maintain your 5 bytes of buffered data as you read continuously from the stream. One way is a so-called "circular buffer". Another way, especially if the token is small, is to use a larger buffer and whenever you get too near the end, copy the bytes you need to the beginning and start again.
If your function is a callback, called once for each new chunk of data, then you will need to maintain some state from one call to the next to allow for a match that spans two chunks. If you're lucky then your callback API includes a "user data pointer", and you can set that to point to whatever struct you like that includes the buffer. If not, you'll need global or thread-local variables.
If the stream has a high data rate then you might want to think about speeding things up, with the KMP algorithm or otherwise.
Sorry, I voted to delete my previous answer as my understanding of the question was not correct. I didn't read carefully enouogh and thought that the [] are token delimiters.
For your problem I'd recommend building a small state machine based on a simple counter:
For every character you do something like the following pseudo code:
if (received_character == token[pos]) {
++pos;
if (pos >= token_length) {
token_received = 1;
}
}
else {
pos = 0; // Startover
}
This takes a minimum of processor cycles and also a minimum of memory aso you don't need to buffer anything except the chunk just received.
If the needle is contained within memory, it could be assumed that you can allocate an equally-sized object to read into (e.g. char input_array[needle_size];).
To start the search process, fill that object with bytes from your file (e.g. size_t sz = fread(input_array, 1, input_size, input_file);) and attempt a match (e.g. if (sz == needle_size && memcmp(input_array, needle, needle_size) == 0) { /* matched */ }.
If the match fails or you want to continue searching after a successful match, advance the position forward by one byte (e.g. memmove(input_array, input_array + 1, input_size - 1); input_array[input_size - 1] = fgetc(input_file); and try again.
A concern was raised that this idea copies too many bytes around, in the comments. While I don't believe that this concern has a significant merit (as there is no evidence of significant value), the copies can be avoided by using a circular array; we insert new characters at pos % needle_size and compare the regions before and after that boundary as though they are the tail and head respectively. For example:
void find_match(FILE *input_file, char const *needle, size_t needle_size) {
char input_array[needle_size];
size_t sz = fread(input_array, 1, needle_size, input_file);
if (sz != needle_size) {
// No matches possible
return;
}
setvbuf(input_file, NULL, _IOFBF, BUFSIZ);
unsigned long long pos = 0;
for (;;) {
size_t cursor = pos % needle_size;
int tail_compare = memcmp(input_array, needle + needle_size - cursor, cursor),
head_compare = memcmp(input_array + cursor, needle, needle_size - cursor);
if (head_compare == 0 && tail_compare == 0) {
printf("Match found at offset %llu\n", pos);
}
int c = fgetc(input_file);
if (c == EOF) {
break;
}
input_array[cursor] = c;
pos++;
}
}

Measuring cache size in C

I have a function as follow:
int doSomething(long numLoop,long arraySize){
int * buffer;
buffer = (int*) malloc (arraySize * sizeof(int));
long k;
int i;
for (i=0;i<arraySize;i++)
buffer[i]=2;//write to make sure memory is allocated
//start reading from cache
for(k=0;k<numLoop;k++){
int i;
int temp
for (i=0;i<arraySize;i++)
temp = buffer[i];
}
}
What it do is to declare an array and read from the beginning to the end. The purpose is to see the effect of cache.
What I expect to see is: when I call doSomething(10000,1000), the arraySize is small so it is all stored in the cache. After that I call doSomething(100,100000), the arraySize is bigger than that of the cache. As a result, the 2nd function call should take longer than the 1st one. The latter function call involved in some memory access as the whole array cannot be stored in the cache.
However, it seems that the 2nd operation takes approximately the same time as the 1st one. So what's wrong here? I tried to compile with -O0 and it doesnt solve the problem.
Thank you.
Update 1: these are the code with random access and it seems to work, time access with large array is ~15s while small array is ~3s
int doSomething(long numLoop,int a, long arraySize){
int * buffer;
buffer = (int*) malloc (arraySize * sizeof(int));
long k;
int i;
for (i=0;i<arraySize;i++)
buffer[i]=2;//write to make sure memory is allocated
//start reading from cache
for(k=0;k<numLoop;k++){
int temp;
for (i=0;i<arraySize;i++){
long randnum = rand();//max is 32767
randnum = (randnum <<16) | rand();
if (randnum < 0) randnum = -randnum;
randnum%=arraySize;
temp = buffer[randnum];
}
}
}
You are accessing the array in sequence,
for (i=0;i<arraySize;i++)
temp = buffer[i];
so the part you are accessing will always be in the cache since that pattern is trivial to predict. To see a cache-effect, you must access the array in a less predictable order, for example by generating (pseudo)random indices, so that you jump between the fron and the back of the array.
In addition to the other answers: Your code accesses the memory sequentially. Let's assume that the cache line is 32 bytes. That means that you probably get a cache miss on every 8 access. So, picking a random index you should make it at least 32 bytes far from the previous value
In order to measure the effect across multiple calls, you must use the same buffer (with the expectation that the first time through you are loading the cache, and the next time you are using it). In your case, you are allocating a new buffer for every call. (Additionally, you are never freeing your allocation.)

Resources