Read large text files in C efficiently

Read large text files in C efficiently - c

I need to read large text files containing either real or complex data in C. As of right now, I am using the code shown below. These functions are simple to understand and work fine for small sizes. However, I need to read files that are a few GBs in size which is taking a lot of time in comparison to binary files with the same data. (The project switched from text to binary files at some point because of the larger sizes. I have to clean up the mess.)
int io_read_array_real(char ordering, DTYPE *array,
int m, int n, FILE *ifile)
{
int i, j, match;
DTYPE elem;
for (i = 0; i < m; i++)
{
for (j = 0; j < n; j++)
{
match = fscanf(ifile, "%e", &elem);
if (match == 0)
{
printf("An error occurred while parsing the file!\n");
return (-1);
}
if (ordering == 'C')
*(array + RTC(i, j, m)) = elem;
else if (ordering == 'R')
*(array + i * n + j) = elem;
else
return (-1);
}
}
return (0);
}
int io_read_array_complex(char ordering, CDTYPE *array,
int m, int n, FILE *ifile)
{
int i, j, info;
DTYPE zreal, zimag;
for (i = 0; i < m; i++)
{
for (j = 0; j < n; j++)
{
info = fscanf(ifile, " (%e%ej) ", &zreal, &zimag);
if (info != 2)
{
fprintf(stderr, "Input file in wrong format at (%d,%d) info = %d!\n"
"strerror: %s\n",
i, j, info, strerror(errno));
return (-1);
}
if (ordering == 'C')
*(array + RTC(i, j, m)) = zreal + I * zimag;
else if (ordering == 'R')
*(array + i * n + j) = zreal + I * zimag;
else
return (-1);
}
}
return (0);
}
Now I'd like to know if there's any faster way of reading these files. They are guaranteed to be of the form:
real:
1.233e-3 2.231e-1 ...
2.335e-4 8.241e-2 ...
.
.
complex:
(1.233e-3+3.239e-4j) (1.233e-3+3.239e-4j) ...
(7.684e-2+8.269e-5j) (1.233e-3+3.239e-4j) ...
.
.

You might memory map the entire file. The file content is then memory addressable directly and the operating-system's virtual memory management handles the memory and paging for you regardless of the size of the file.
You then operate on the file content directly as if it were memory - no explicit allocation, reallocation.
Windows and POSIX API's for memory mapped files differ, but you will find plenty of examples for whatever system you are using.
The advantage here is that the OS will manage loading data into the virtual address space in the background and the performance is likely to be determined by the amount of physical memory available. Moreover it will load and page large chunks of the file at once which is fare more efficient than reading the file stream 23 bytes at a time.
If you do persist with stream I/O, you would do well at least to reduce the file I/O overhead by reading in larger "power if two" sized blocks, such as 1024 or 4096 bytes.

There are small improvements possible:
Move repetitive if (ordering == 'C') outside the loops and have it select one of two similar block to iterate.
Move most of destination address calculation outside the innermost loop so fscanf() can save directly into the target destination.
But for significant improvements, OP is limited unless code can make assumptions about the format of the text representing floating point values and replace fscanf(). #Eric Postpischil
Example:
If complex data is (#.###e##+#.###e##j) (# digit, # sign), it may be faster to craft code.
Following is illustrative. The exact details depend on the precise definition of OP's data format. OP should profile various approaches.
// fscanf(ifile, " (%e%ej) ", &zreal, &zimag);
// (#.###e####.###e##j)
// 123456789 123456789 1
char buf[21+1+1]; // Input length + \n + \0
buf[19] = 0;
if (fgets(buf, sizeof buf, ifile)) {
if (buf[19]!=')') Error_out; // trivial error check
static const double expos[19] = { 1.0e-12, 1.0e-11, ..., 1.0e+6}; // Offset by 3: #.###
int e = buf[7]=='-' ? buf[8]-'0' : -(buf[8]-'0');
*dest++ = (buf[1]*1000 + buf[3]*100 + buf[4]*10 + buf[5] - 1111*'0') * expos[e+9];
...
Multiple threads? #Martin James is a worthy idea to test.

Related

decreasing time it takes to run my program in c

I was writing a program that is reading from a file and then storing the data in two tables that are in a table of structure. I am expanding the tables with realloc and the time my program takes to run is ~ 0.7 s.
Can i somehow decrease this time?
typedef struct {
int *node;
int l;
int *waga;
} przejscie_t;
void czytaj(przejscie_t **graf, int vp, int vk, int waga) {
(*graf)[vp].node[(*graf)[vp].l - 1] = vk;
(*graf)[vp].waga[(*graf)[vp].l - 1] = waga;
(*graf)[vp].l++;
}
void wypisz(przejscie_t *graf, int i) {
printf("i=%d l=%d ", i, graf[i].l);
for (int j = 0; j < (graf[i].l - 1); j++) {
printf("vk=%d waga=%d ", graf[i].node[j], graf[i].waga[j]);
}
printf("\n");
}
void init(przejscie_t **graf, int vp, int n) {
*graf = realloc(*graf, (vp + 1) * sizeof(przejscie_t));
if (n == vp || n == -1){
(*graf)[vp].l = 1;
(*graf)[vp].node = malloc((*graf)[vp].l * sizeof(int));
(*graf)[vp].waga = malloc((*graf)[vp].l * sizeof(int));
}
else {
for (int i = n; i <= vp; i++) {
(*graf)[i].l = 1;
(*graf)[i].node = malloc((*graf)[i].l * sizeof(int));
(*graf)[i].waga = malloc((*graf)[i].l * sizeof(int));
}
}
}

Here some suggestions:
I think you should pre-calculate the required size of your *graf memory instead of reallocating it again and again. By using a prealloc_graf function for example.
You will get some great time improvement since reallocating is time-consuming especially when it must actually move the memory.
You should do this method especially if you are working with big files.
And since you're working with files, pre-calculating should be done easily.
If your files size are both light and heavy, you have two choices:
Accept your fate and allow your code to be a little bit less optimized on small files.
Create two init functions: The first one is optimized for small files, the other one will be for bigger files but... You will have to run some benchmarks to actually determine what algorithm is the best for each case before being able to implement it. You could actually automate that if you have the time and the will to do so.
It is important to check for successful memory allocation before trying to use the said memory because allocation function can fail.
Finally, some changes for the init function :
void init(przejscie_t **graf, int vp, int n) {
*graf = realloc(*graf, (vp + 1) * sizeof(przejscie_t));
// The `if` statement was redundant.
// Added a ternary operator for ``n == -1``.
// Alternatively, you could use ``n = (n == -1 ? vp : n)`` right before the loop.
for (int i = (n == -1 ? vp : n); i <= vp; i++) {
(*graf)[i].l = 1;
// (*graf)[X].l is is always 1.
// There is no reason to use (*graf)[X].l * sizeof(int) for malloc.
(*graf)[i].node = malloc(sizeof(int));
(*graf)[i].waga = malloc(sizeof(int));
}
}
I've commented everything that I've changed but here is a summary :
The if statement was redundant.
The for loop cover all cases with ternary operator for n
equals -1.
The code should be easier to understand and to comprehend this way.
The node and waga arrays were not being initialized "properly".
Since l is always equals 1 there was no need for an
additional operation.
This doesn't really change execution time tho since its constant.
I would also suggest that your "functions running allocation functions" should return a boolean saying if the function succeeded. In the case the allocation failed you can return false to say that your function failed.

Access Binary form of Text saved in memory using an Array

The storage representation of the string or equivalently text from a file, is the ASCII code for each character of the string or text from a file, I have been told that I/O functions like fread and fgets will read a string from disk into memory without conversion. The C compiler always works with the storage representation, so when we "retrieve" a string in C, it's always in binary form.
I need to access this binary form to use in my code (without saving this as a binary file, also not asking to print in binary format).
For example, the text string "AA" is saved in memory as "0100000101000001", I need to access directly, without any conversion (like we do when we print, integer using %s, %d) this binary form "0100000101000001" of "AA" using an integer array, say, D[16] which has elements 0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,1. So if I use an index int i, I will get 0 from D[4] for i=0.
Array-index operations like buffer[i] (for example, in the sample code in the below) will extract one character from a string:
FILE *fp = fopen("a.txt", "r");
if (fp == NULL)
return 1;
char buffer[100];
int r = fread(buf, 1, sizeof(buffer), fp);
if (r <= 0)
return 1;
printf("As string: %.*s", r, buffer);
printf("As integers:");
for (i = 0; i < r; i++)
printf(" %d", buffer[i]);
But I would like to have the complete text as an array of 0 and 1, whereas here, buffer[i] contains 8 bits which I cannot access individually each bit, how can I do that?

I have been told that I/O functions like fread and fgets will read a string from disk into memory without conversion.
This is true if the file has been open as binary, ie: with "rb". Such streams do not undergo any translation when read into memory, and all stream functions will read the contents as it is stored on disk, getc() included. If your system is unix based, there is no difference with "r", but on legacy systems, there can be substantial differences: text mode, which is the default, may imply end of line conversion, code page translation, end of file mitigation... If you want the actual file contents, always use binary mode ("rb").
You should also avoid the char type when dealing with binary representation, because char is signed by default on many architectures, hence inappropriate for byte values which are usually considered positive. Use unsigned char to prevent this issue.(*)
The most common way to display binary contents is using hexadecimal representation, where each byte is output as exactly 2 hex digits.
If you want to output binary representation, there is no standard printf conversion to output base-2 numbers, but you can write a loop to convert the byte to its bit values.
(*) among other historical issues such as non two's complement signed value representations
Here is a modified version:
#include <stdio.h>
int main() {
FILE *fp = fopen("a.txt", "r");
if (fp == NULL) {
perror("a.txt");
return 1;
}
unsigned char buffer[100];
unsigned char bits[100 * 8];
int r = fread(buffer, 1, sizeof(buffer), fp);
if (r <= 0) {
fprintf(stderr, "empty file\n");
fclose(fp);
return 1;
}
printf("As a string: %.*s\n\n", r, (char *)buffer);
int pos;
pos = printf("As 8-bit integers:");
for (int i = 0; i < r; i++) {
if (pos > 72) {
printf("\n");
pos = 0;
}
pos += printf(" %d", buffer[i]);
}
printf("\n\n");
pos = printf("As hex bytes:");
for (int i = 0; i < r; i++) {
if (pos > 72) {
printf("\n");
pos = 0;
}
pos += printf(" %02X", buffer[i]);
}
printf("\n\n");
pos = printf("Converting to a bit array:");
for (int i = 0; i < r; i++) {
for (int j = 8; j-- > 0;) {
bits[i * 8 + 7 - j] = (buffer[i] >> j) & 1;
}
}
/* output the bit array */
for (int i = 0; i < r * 8; i++) {
if (pos > 72) {
printf("\n ");
pos = 4;
}
pos += printf("%d", bits[i]);
}
printf("\n");
fclose(fp);
return 0;
}

Use bit masking to check the value of individual bits. Checkout a brief description here https://www.learn-c.org/en/Bitmasks
Then you can write the result to your array for the corresponding bit.

Transfer results to txt file C

so i'm completely new to programming (i've been learning for 3 days) and i find myself infront of a problem i simply don't know how to resolve.
I want this program to give me every single combination from 0 to a specific number in base 36. That is easy enough when the number is only about 50000 or so. But my goal from this is to extract actual words(with numbers too) and if i try to get words with 5 characters, the terminal will start overwriting the previous words(not helpful, i want ALL of them).
So i thought i should look for a way to transfer everything into a txt file and there resides my problem: I don't know how... Sorry for the long text but i wanted to explain precisely what i'm trying to get. Thanks for the help.
int main() {
int dec, j, i, q, r, k;
char val[80];
printf("Enter a decimal number: ");
scanf("%d", &dec);
for (k = 0; k <= dec; k++) { /*repeat for all possible combinations*/
q = k;
for (i = 1; q != 0; i++) { /*convert decimal number to value for base 36*/
r = q % 36;
if (r < 10)
r = r + 48;
else
r = r + 55;
val[i] = r;
q = q / 36;
}
for (j = i - 1; j > 0; j--) { /*print every single value*/
printf("%c", val[j]);
}
printf(" "); /*add spaces because why not*/
}
return (0);
}

A few observations that might help:
First is type related:
In your declarations you create the following:
int dec, j, i, q, r, k;
char val[80];
Then later you make the assignment:
val[i] = r;//assigning an int to a char, dangerous
While r is type int with a range (typically) of –2,147,483,648 to 2,147,483,647,
val[i] is of type char with a range (typically) of only –128 to 127.
Because of this, you may run into an overflow, leading to unexpected results.
The most immediate solution is use the same type for both variables. Pick either int or char, but not both.
The other has already been addressed correctly by #Nasim. Use the file version of printf() - fprintf(). As the link shows, the prototype for fprintf() is:
int fprintf( FILE *stream, const char *format [, argument ]...);
Usage example:
FILE *fp = fopen(".\somefile.txt", "w");//create a pointer to a FILE
if(fp)//if the FILE was successfully created, write to it...
{
// some of your previous code...
for (j = i - 1; j > 0; j--)
{ /*print every single value*/
fprintf(fp, "%c", val[j]);//if val is typed as char
//OR
fprintf(fp, "%d", val[j]);//if val is typed as int
}
fclose(fp);
}
Lastly, there are a wide range of methods to perform base conversion. Some more complicated than others.

create a file and then you can use fprintf() instead of printf the only difference between the two is that you need to specify the file as an argument
FILE *myFile = fopen("file.txt", "w"); //"w" erase previous content, "a" appends
If(myFile == NULL) {printf("Error in openning file\n"); exit(1);}
fprintf(myFile, "some integer : %d\n", myInteger); // same as printf my specify file pointer name in first argument
fclose(myFile); //dont forget to close the file

How to read a 10 GB txt file consisting of tab-separated double data line by line in C

I have a txt file consisting of tab-separated data with type double. The data file is over 10 GB, so I just wish to read the data line-by-line and then do some processing. Particularly, the data is layout as an matrix with, say 1001 columns, and millions of rows. Below is just a fake sample to show the layout.
10.2 30.4 42.9 ... 3232.000 23232.45
...
...
7.234 824.23232 ... 4009.23 230.01
...
For each line I'd like to store the first 1000 values in an array, and the last value in a separate variable. I am new to C, so it would be nice if you could kindly point out major steps.
Update:
Thanks for all valuable suggestions and solutions. I just figured out one simple example where I just read a 3-by-4 matrix row by row from a txt file. For each row, the first 3 elements are stored in x, and the last element is stored in vector y. So x is a n-by-p matrix with n=p=3, y is a 1-by-3 vector.
Below is my data file and my code.
Data file:
1.112272 -0.345324 0.608056 0.641006
-0.358203 0.300349 -1.113812 -0.321359
0.155588 2.081781 0.038588 -0.562489
My code:
#include<math.h>
#include <stdlib.h>
#include<stdio.h>
#include <string.h>
#define n 3
#define p 3
void main() {
FILE *fpt;
fpt = fopen("./data_temp.txt", "r");
char line[n*(p+1)*sizeof(double)];
char *token;
double *x;
x = malloc(n*p*sizeof(double));
double y[n];
int index = 0;
int xind = 0;
int yind = 0;
while(fgets(line, sizeof(line), fpt)) {
//printf("%d\n", sizeof(line));
//printf("%s\n", line);
token = strtok(line, "\t");
while(token != NULL) {
printf("%s\n", token);
if((index+1) % (p+1) == 0) { // the last element in each line;
yind = (index + 1) / (p+1) - 1; // get index for y vector;
sscanf(token, "%lf", &(y[yind]));
} else {
sscanf(token, "%lf", &(x[xind]));
xind++;
}
//sscanf(token, "%lf", &(x[index]));
index++;
token = strtok(NULL, "\t");
}
}
int i = 0;
int j = 0;
puts("Print x matrix:");
for(i = 0; i < n*p; i++) {
printf("%f\n", x[i]);
}
printf("\n");
puts("Print y vector:");
for(j = 0; j < n; j++) {
printf("%f\t", y[j]);
}
printf("\n");
free(x);
fclose(fpt);
}
With above, hopefully things will work if I replace data_temp.txt with my raw 10 GB data file (of course change values of n,p, and some other code wherever necessary.)
I have additional questions that I wish if you could help me.
I first initialized char line[] as char line[(p+1)*sizeof(double)] (note not multiplying n). But the line cannot be read completely. How could I assign memory JUST for one single line? What's the lenght? I assume it's (p+1)*sizeof(double) since there are (p+1) doubles in each line. Should I also assign memory for \t and \n? If so, how?
Does the code look reasonable to you? How could I make it more efficient since this code will be executed over millions of rows?
If I don't know the number of columns or rows in the raw 10 GB file, how could I quickly count rows and columns?
Again I am new to C, any comments are very appreciated. Thanks a lot!

1st way
Read file in chunks into preallocated buffer using fread.
2nd way
Map the file into your process memory space using mmap, move the pointer then over the file.

3rd way
Since your file is delimited by lines, open the file with fopen, use setvbuf or similar to set a buffer size greater than about 10 lines or so, then read the file line-by-line using fgets.
To potentially read the file even faster, use open with O_DIRECT (assuming Linux), then use fdopen to get a FILE * for the open file, then use setvbuf to set a page-aligned buffer. Doing that will allow you to bypass the kernel page cache - if your system's implementation works successfully using direct IO that way. (There can be many restrictions to direct IO)

Something to get you started: Reading 1 line
#define COLUMN (1000+1)
double data[COLUMNS];
for (int i = 0; i< COLUMN; i++) {
char delim = '\n';
int cnt = fscanf(in_stream, "%lf%c", &data[i], &delim);
if (cnt < 1) {
if (cnt == EOF && i == 0) return 0; // None read, OK as end of file
puts("Missing or bad data");
return -1; // problem
}
if (delim != '\t') {
// If tab not found, should be at end of line
if (delim == '\n' && i == COLUMN-1) {
return COLUMN; // Success
}
puts("Bad delimiter");
return -1;
}
}
puts("Extra data");
return -1;

longest common subsequence: why is this wrong?

int lcs(char * A, char * B)
{
int m = strlen(A);
int n = strlen(B);
int *X = malloc(m * sizeof(int));
int *Y = malloc(n * sizeof(int));
int i;
int j;
for (i = m; i >= 0; i--)
{
for (j = n; j >= 0; j--)
{
if (A[i] == '\0' || B[j] == '\0')
X[j] = 0;
else if (A[i] == B[j])
X[j] = 1 + Y[j+1];
else
X[j] = max(Y[j], X[j+1]);
}
Y = X;
}
return X[0];
}
This works, but valgrind complains loudly about invalid reads. How was I messing up the memory? Sorry, I always fail at C memory allocation.

The issue here is with the size of your table. Note that you're allocating space as
int *X = malloc(m * sizeof(int));
int *Y = malloc(n * sizeof(int));
However, you are using indices 0 ... m and 0 ... n, which means that there are m + 1 slots necessary in X and n + 1 slots necessary in Y.
Try changing this to read
int *X = malloc((m + 1) * sizeof(int));
int *Y = malloc((n + 1) * sizeof(int));
Hope this helps!

Series of issues. First, as templatetypedef says, you're under-allocated.
Then, as paddy says, you're not freeing up your malloc'd memory. If you need the Y=X line, you'll need to store the original malloc'd space addresses in another set of variables so you can call free on them.
...mallocs...
int * original_y = Y;
int * original_x = X;
...body of code...
free(original_y);
free(original_x);
return X[0];
But this doesn't address your new question, which is why doesn't the code actually work?
I admit I can't follow your code (without a lot more study), but I can propose an algorithm that will work and be far more understandable. This may be somewhat pseudocode and not particularly efficient, but getting it correct is the first step. I've listed some optimizations later.
int lcs(char * A, char * B)
{
int length_a = strlen(A);
int length_b = strlen(B);
// these hold the position in A of the longest common substring
int longest_found_length = 0;
// go through each substring of one of the strings (doesn't matter which, you could pick the shorter one if you want)
char * candidate_substring = malloc(sizeof(char) * length_a + 1);
for (int start_position = 0; start_position < length_a; start_position++) {
for (int end_position = start_position; end_position < length_a; end_position++) {
int substring_length = end_position - start_position + 1;
// make a null-terminated copy of the substring to look for in the other string
strncpy(candidate_substring, &(A[start_position]), substring_length);
if (strstr(B, candidate_substring) != NULL) {
longest_found_length = substring_length;
}
}
}
free(candidate_substring);
return longest_found_length;
}
Some different optimizations you could do:
// if this can't be longer, then don't bother checking it. You can play games with the for loop to not have this happen, but it's more complicated.
if (substring_length <= longest_found_index) {
continue;
}
and
// there are more optimizations you could do to this, but don't check
// the substring if it's longer than b, since b can't contain it.
if (substring_length > length_b) {
continue;
}
and
if (strstr(B, candidate_substring) != NULL) {
longest_found_length = end_position - start_position + 1;
} else {
// if nothing contains the shorter string, then nothing can contain the longer one, so skip checking longer strings with the same starting character
break; // skip out of inner loop to next iteration of start_position
}
Instead of copying each candidate substring to a new string, you could do a character swap with the end_position + 1 and a NUL character. Then, after looking for that substring in b, swap the original character at end_position+1 back in. This would be much faster, but complicates the implementation a little.

NOTE: I don't normally write two answers and if you feel that it is tacky, feel free to comment on this one and note vote it up. This answer is a more optimized solution, but I wanted to give the most straightforward one I could think of first and then put this in another answer to not confuse the two. Basically they are for different audiences.
The key to solving this problem efficiently is to not throw away information you have about shorter common substrings when looking for longer ones. Naively, you check each substring against the other one, but if you know that "AB" matches in "ABC", and your next character is C, don't check to see if "ABC" is in "ABC", just check that the spot after "AB" is a "C".
For each character in A, you have to check up to all the letters in B, but because we stop looking through B once a longer substring is no longer possible, it greatly limits the number of checks. Each time you get a longer match up front, you eliminate checks on the back-end, because it will no longer be a longer substring.
For example, if A and B are both long, but contain no common letters, each letter in A will be compared against each letter in B for a runtime of A*B.
For a sequence where there are a lot of matches, but the match length isn't a large fraction of the length of the shorter string, you have A * B combinations to check against the shorter of the two strings (A or B) leading to either A*B*A or A*B*B, which is basically O(n^3) time for similar length strings. I really thought the optimizations in this solution would be better than n^3 even though there are triple-nested for loops, but it appears to not be as best as I can tell.
I'm thinking about this some more, though. Either the substrings being found are NOT a significant fraction of the length of the strings, in which case the optimizations don't do much, but the comparisons for each combination of A*B don't scale with A or B and drop out to be constants -- OR -- they are a significant fraction of A and B and it directly divides against the A*B combinations that have to be compared.
I just may ask this in a question.
int lcs(char * A, char * B)
{
int length_a = strlen(A);
int length_b = strlen(B);
// these hold the position in A of the longest common substring
int longest_length_found = 0;
// for each character in one string (doesn't matter which), look for incrementally larger strings in the other
for (int a_index = 0; a_index < length_a - longest_length_found; a_index++) {
for (int b_index = 0; b_index < length_b - longest_length_found; b_index++) {
// offset into each string until end of string or non-matching character is found
for (int offset = 0; A[a_index+offset] != '\0' && B[b_index+offset] != '\0' && A[a_index+offset] == B[b_index+offset]; offset++) {
longest_length_found = longest_length_found > offset ? longest_length_found : offset;
}
}
}
return longest_found_length;
}

In addition to what templatetypedef said, some things to think about:
Why aren't X and Y the same size?
Why are you doing Y = X? That's an assignment of pointers. Did you perhaps mean memcpy(Y, X, (n+1)*sizeof(int))?

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Read large text files in C efficiently - c

Related

decreasing time it takes to run my program in c

Access Binary form of Text saved in memory using an Array

Transfer results to txt file C

How to read a 10 GB txt file consisting of tab-separated double data line by line in C

longest common subsequence: why is this wrong?

Categories

Resources