I have over 100,000 csv files in the below format:
1,1,5,1,1,1,0,0,6,6,1,1,1,0,1,0,13,4,7,8,18,20,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,1,0,1,6,5,1,1,1,0,1,0,4,7,8,18,20,,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,1,0,2,6,5,1,1,1,0,1,0,4,7,8,18,20,,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,1,0,3,6,5,1,1,1,0,1,0,13,4,7,8,20,,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,1,0,4,6,5,1,1,1,0,1,0,13,4,7,8,20,,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,1,0,5,6,4,1,0,1,0,1,0,4,8,18,20,,,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,1,0,6,6,5,1,1,1,0,1,0,4,7,8,18,20,,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,1,0,7,6,5,1,1,1,0,1,0,13,4,7,8,20,,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,1,0,8,6,5,1,1,1,0,1,0,13,4,7,8,20,,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,2,0,0,12,12,1,2,4,1,1,0,13,4,7,8,18,20,21,25,27,29,31,32,,,,,,,,,,,,,,,,
All I need is field 10 and field 17 onward, field 10 is the counter indicate how many
integer stored start from field 17 i.e. what I need is:
6,13,4,7,8,18,20
5,4,7,8,18,20
5,4,7,8,18,20
5,13,4,7,8,20
5,13,4,7,8,20
4,4,8,18,20
5,4,7,8,18,20
5,13,4,7,8,20
5,13,4,7,8,20
12,13,4,7,8,18,20,21,25,27,29,31,32
Max number of integer need to read is 28. I can easily achieve this by Getline in C++, however, from my previous experience,
since I need to handle over 100,000 such files and each files may have 300,000~400,000 such lines.
Therefore using Getline to read in the data and build a vector> may have serious performance issue
for me. I tried to use fscanf to achieve this:
while (!feof(stream)){
fscanf(fstream,"%*d,%*d,%*d,%*d,%*d,%*d,%*d,%*d,%*d,%d",&MyCounter);
fscanf(fstream,"%*d,%*d,%*d,%*d,%*d,%*d"); // skip to column 17
for (int i=0;i<MyCounter;i++){
fscanf(fstream,"%d",&MyIntArr[i]);
}
fscanf(fstream,"%*s"); // to finish the line
}
However, this will call fscanf multiple times and may also create performance issue.
Is there any way to read in variable number of integer at 1 call with fscanf ?
Or I need to read into a string and then strsep/stoi it ? Compare to fscanf, which
is better from performance point of view?
So, there are at most 43 numbers per line. Even at 64 bits, each number is limited to 21 digits, so 1024 bytes is plenty for the max 946 bytes that a line could be (so long as there is no whitespace).
char line[1024];
while (fgets(line, sizeof(line), stdin) != NULL) {
//...
}
A helper function to skip to the desired column.
const char *find_nth_comma(const char *s, int n) {
const char *p = s;
if (p && n) while (*p) {
if (*p == ',') {
if (--n == 0) break;
}
++p;
}
return p;
}
So, inside your loop, skip to column 10 to find the first number of interest, and then skip to column 17 to start reading in the rest of the numbers. The completed loop looks like:
while (fgets(line, sizeof(line), stdin) != NULL) {
const char *p = find_nth_comma(line, 9);
char *end;
assert(p && *p);
MyCounter = strtol(p+1, &end, 10);
assert(*end == ',');
p = find_nth_comma(end+1, 6);
assert(p && *p);
for (int i = 0; i < MyCounter; ++i, p = end) {
MyIntArray[i] = strtol(p+1, &end, 10);
assert((*end == ',') ||
(i == MyCounter-1) &&
(*end == '\0' || isspace(*end & 0xFF)));
}
}
This approach will work with a mmap solution as well. The fgets would be replaced with a function that points to the next line to be processed in the file. The find_nth_comma would need a modification to detect end of line/end of file rather than rely on a NUL terminated string. strtol would be changed with a custom function that again detects end of line or end of file. (The purpose of such changes is to remove any code that would require copying the data, which would be motivation for a mmap approach.)
With parallel processing, it is possible to parse multiple parts of the file simultaneously. But, it is probably sufficient to have different threads process different files, and then collate the results after all files have been processed.
Eventually I use memory mapped file to solve my problem (this solution is a
side product of my previous problem, performance issue when reading big CSV file)
read in large CSV file performance issue in C++
Since I work on MS Windows, so I use Stephan Brumme's "Portable Memory Mapping C++ Class"
http://create.stephan-brumme.com/portable-memory-mapping/
Since I don't need to deal with file(s) > 2 GB, My implementation is simpler.
For over 2GB file, visit the web to see how to handle.
Below please find my piece of code:
// may tried RandomAccess/SequentialScan
MemoryMapped MemFile(FilterBase.BaseFileName, MemoryMapped::WholeFile, MemoryMapped::RandomAccess);
// point to start of memory file
char* start = (char*)MemFile.getData();
// dummy in my case
char* tmpBuffer = start;
// looping counter
uint64_t i = 0;
// pre-allocate result vector
MyVector.resize(300000);
// Line counter
int LnCnt = 0;
//no. of field
int NumOfField=43;
//delimiter count, num of field + 1 since the leading and trailing delimiter are virtual
int DelimCnt=NoOfField+1;
//Delimiter position. May use new to allocate at run time
// or even use vector of integer
// This is to store the delimiter position in each line
// since the position is relative to start of file. if file is extremely
// large, may need to change from int to unsigner, long or even unsigned long long
static int DelimPos[DelimCnt];
// Max number of field need to read usually equal to NumOfField, can be smaller, eg in my case, I only need 4 fields
// from first 15 field, in this case, can assign 15 to MaxFieldNeed
int MaxFieldNeed=NumOfField;
// keep track how many comma read each line
int DelimCounter=0;
// define field and line seperator
char FieldDelim=',';
char LineSep='\n';
// 1st field, "virtual Delimiter" position
DelimPos[CommaCounter]=-1
DelimCounter++;
// loop through the whole memory field, 1 and only once
for (i = 0; i < MemFile.size();i++)
{
// grab all position of delimiter in each line
if ((MemFile[i] == FieldDelim) && (DelimCounter<=MaxFieldNeed)){
DelimPos[DelimCounter] = i;
DelimCounter++;
};
// grab all values when end of line hit
if (MemFile[i] == LineSep) {
// no need to use if (DelimCounter==NumOfField) just assign anyway, waste a little bit
// memory in integer array but gain performance
DelimPos[DelimCounter] = i;
// I know exactly what the format is and what field(s) I want
// a more general approach (as a CSV reader) may put all fields
// into vector of vector of string
// With *EFFORT* one may modify this piece of code so that it can parse
// different format at run time eg similar to:
// fscanf(fstream,"%d,%f....
// also, this piece of code cannot handle complex CSV e.g.
// Peter,28,157CM
// John,26,167CM
// "Mary,Brown",25,150CM
MyVector.StrField = string(strat+DelimPos[0] + 1, strat+DelimPos[1] - 1);
MyVector.IntField = strtol(strat+DelimPos[3] + 1,&tmpBuffer,10);
MyVector.IntField2 = strtol(strat+DelimPos[8] + 1,&tmpBuffer,10);
MyVector.FloatField = strtof(start + DelimPos[14] + 1,&tmpBuffer);
// reset Delim counter each line
DelimCounter=0
// previous line seperator treat as first delimiter of next line
DelimPos[DelimCounter] = i;
DelimCounter++
LnCnt++;
}
}
MyVector.resize(LnCnt);
MyVector.shrink_to_fit();
MemFile.close();
};
I can code whatever I want inside:
if (MemFile[i] == LineSep) {
}
eg handle empty field, perform calculation etc.
With this piece of code, I handle 2100 files (6.3 GB) in 57 seconds!!!
(I code the CSV format in it and only grab 4 values in my previous case).
Later will change this code to handle this issue.
Thx all who help me in this issue.
In order to maximize performance, you should map the files in memory with mmap or equivalent and parse the file with ad hoc code, typically scanning one character at a time with a pointer, checking for '\n' and/or '\r' for end of record and converting the numbers on the fly for storage to your arrays. The tricky parts are:
how do you allocate or otherwise handle the destination arrays.
are the fields all numeric? integral?
is the last record terminated by a newline? You can easily check this condition after the mmap call. The advantage is you only need check for end of file when you encounter a newline sequence.
Probably the easiest way to read a run-time determined number of integers is to point into the right part of a longer format string. In other words, we can have a format string with 28 %d, specifiers, but point to the nth one before the end of string and pass that pointer as the format string for scanf().
As a simple example, consider accepting 3 integers from a maximum of 6:
"%d,%d,%d,%d,%d,%d,"
^
The arrow shows the string pointer to use as the pattern argument.
Here's a full worked example; its runtime is about 8 seconds for 1 million iterations (10 million lines) when built with gcc -O3. It's slightly complicated by the mechanics to update the input string pointer, which is obviously not necessary when reading from a file stream. I've skipped the checking that nfields <= 28, but that's easily added.
char const *const input =
"1,1,5,1,1,1,0,0,6,6,1,1,1,0,1,0,13,4,7,8,18,20,,,,,,,,,,,,,,,,,,,,,,\n"
"1,1,5,1,1,1,0,1,6,5,1,1,1,0,1,0,4,7,8,18,20,,,,,,,,,,,,,,,,,,,,,,,\n"
"1,1,5,1,1,1,0,2,6,5,1,1,1,0,1,0,4,7,8,18,20,,,,,,,,,,,,,,,,,,,,,,,\n"
"1,1,5,1,1,1,0,3,6,5,1,1,1,0,1,0,13,4,7,8,20,,,,,,,,,,,,,,,,,,,,,,,\n"
"1,1,5,1,1,1,0,4,6,5,1,1,1,0,1,0,13,4,7,8,20,,,,,,,,,,,,,,,,,,,,,,,\n"
"1,1,5,1,1,1,0,5,6,4,1,0,1,0,1,0,4,8,18,20,,,,,,,,,,,,,,,,,,,,,,,,\n"
"1,1,5,1,1,1,0,6,6,5,1,1,1,0,1,0,4,7,8,18,20,,,,,,,,,,,,,,,,,,,,,,,\n"
"1,1,5,1,1,1,0,7,6,5,1,1,1,0,1,0,13,4,7,8,20,,,,,,,,,,,,,,,,,,,,,,,\n"
"1,1,5,1,1,1,0,8,6,5,1,1,1,0,1,0,13,4,7,8,20,,,,,,,,,,,,,,,,,,,,,,,\n"
"1,1,5,1,1,2,0,0,12,12,1,2,4,1,1,0,13,4,7,8,18,20,21,25,27,29,31,32,,,,,,,,,,,,,,,,\n";
#include <stdio.h>
#define SKIP_FIELD "%*[^,],"
#define DECIMAL_FIELD "%d,"
int read()
{
int n; /* bytes read - not needed for file or stdin */
int sum = 0; /* just to make sure results are used */
for (char const *s = input; *s; ) {
int nfields;
int array[28];
int m = sscanf(s,
/* field 0 is missing */
SKIP_FIELD SKIP_FIELD SKIP_FIELD
SKIP_FIELD SKIP_FIELD SKIP_FIELD
SKIP_FIELD SKIP_FIELD SKIP_FIELD
DECIMAL_FIELD /* field 10 */
SKIP_FIELD SKIP_FIELD SKIP_FIELD
SKIP_FIELD SKIP_FIELD SKIP_FIELD
"%n",
&nfields,
&n);
if (m != 1) {
return -1;
}
s += n;
static const char fieldchars[] = DECIMAL_FIELD;
static const size_t fieldsize = sizeof fieldchars - 1; /* ignore terminating null */
static const char *const parse_entries =
DECIMAL_FIELD DECIMAL_FIELD DECIMAL_FIELD DECIMAL_FIELD
DECIMAL_FIELD DECIMAL_FIELD DECIMAL_FIELD DECIMAL_FIELD
DECIMAL_FIELD DECIMAL_FIELD DECIMAL_FIELD DECIMAL_FIELD
DECIMAL_FIELD DECIMAL_FIELD DECIMAL_FIELD DECIMAL_FIELD
DECIMAL_FIELD DECIMAL_FIELD DECIMAL_FIELD DECIMAL_FIELD
DECIMAL_FIELD DECIMAL_FIELD DECIMAL_FIELD DECIMAL_FIELD
DECIMAL_FIELD DECIMAL_FIELD DECIMAL_FIELD DECIMAL_FIELD
"[^\n] ";
const char *const line_parse = parse_entries + (28-nfields) * fieldsize;
/* now read nfields (max 28) */
m = sscanf(s,
line_parse,
&array[0], &array[1], &array[2], &array[3],
&array[4], &array[5], &array[6], &array[7],
&array[8], &array[9], &array[10], &array[11],
&array[12], &array[13], &array[14], &array[15],
&array[16], &array[17], &array[18], &array[19],
&array[20], &array[21], &array[22], &array[23],
&array[24], &array[25], &array[26], &array[27]);
if (m != nfields) {
return -1;
}
/* advance stream position */
sscanf(s, "%*[^\n] %n", &n); s += n;
/* use the results */
for (int i = 0; i < nfields; ++i) {
sum += array[i];
}
}
return sum;
}
#undef SKIP_FIELD
#undef DECIMAL_FIELD
int main()
{
int sum = 0;
for (int i = 0; i < 1000000; ++i) {
sum += read() * (i&1 ? 1 : - 1); /* alternate add and subtract */
}
return sum != 0;
}
Related
I'm pretty new to C, and I'm trying to write a function that takes a user input RAM size in B, kB, mB, or gB, and determines the address length. My test program is as follows:
int bitLength(char input[6]) {
char nums[4];
char letters[2];
for(int i = 0; i < (strlen(input)-1); i++){
if(isdigit(input[i])){
memmove(&nums[i], &input[i], 1);
} else {
//memmove(&letters[i], &input[i], 1);
}
}
int numsInt = atoi(nums);
int numExponent = log10(numsInt)/log10(2);
printf("%s\n", nums);
printf("%s\n", letters);
printf("%d", numExponent);
return numExponent;
}
This works correctly as it is, but only because I have that one line commented out. When I try to alter the 'letters' character array with that line, it changes the 'nums' character array to '5m2'
My string input is '512mB'
I need the letters to be able to tell if the user input is in B, kB, mB, or gB.
I am confused as to why the commented out line alters the 'nums' array.
Thank you.
In your input 512mB, "mB" is not digit and is supposed to handled in commented code. When handling those characters, i is 3 and 4. But because length of letters is only 2, when you execute memmove(&letters[i], &input[i], 1);, letters[i] access out of bounds of array so it does undefined behaviour - in this case, writing to memory of nums array.
To fix it, you have to keep unique index for letters. Or better, for both nums and letters since i is index of input.
There are several problems in your code. #MarkSolus have already pointed out that you access letters out-of-bounds because you are using i as index and i can be more than 1 when you do the memmove.
In this answer I'll address some of the other poroblems.
string size and termination
Strings in C needs a zero-termination. Therefore arrays must be 1 larger than the string you expect to store in the array. So
char nums[4]; // Can only hold a 3 char string
char letters[2]; // Can only hold a 1 char string
Most likely you want to increase both arrays by 1.
Further, your code never adds the zero-termination. So your strings are invalid.
You need code like:
nums[some_index] = '\0'; // Add zero-termination
Alternatively you can start by initializing the whole array to zero. Like:
char nums[5] = {0};
char letters[3] = {0};
Missing bounds checks
Your loop is a for-loop using strlen as stop-condition. Now what would happen if I gave the input "123456789BBBBBBBB" ? Well, the loop would go on and i would increment to values ..., 5, 6, 7, ... Then you would index the arrays with a value bigger than the array size, i.e. out-of-bounds access (which is real bad).
You need to make sure you never access the array out-of-bounds.
No format check
Now what if I gave an input without any digits, e.g. "HelloWorld" ? In this case nothin would be written to nums so it will be uninitialized when used in atoi(nums). Again - real bad.
Further, there should be a check to make sure that the non-digit input is one of B, kB, mB, or gB.
Performance
This is not that important but... using memmove for copy of a single character is slow. Just assign directly.
memmove(&nums[i], &input[i], 1); ---> nums[i] = input[i];
How to fix
There are many, many different ways to fix the code. Below is a simple solution. It's not the best way but it's done like this to keep the code simple:
#define DIGIT_LEN 4
#define FORMAT_LEN 2
int bitLength(char *input)
{
char nums[DIGIT_LEN + 1] = {0}; // Max allowed number is 9999
char letters[FORMAT_LEN + 1] = {0}; // Allow at max two non-digit chars
if (input == NULL) exit(1); // error - illegal input
if (!isdigit(input[0])) exit(1); // error - input must start with a digit
// parse digits (at max 4 digits)
int i = 0;
while(i < DIGITS && isdigit(input[i]))
{
nums[i] = input[i];
++i;
}
// parse memory format, i.e. rest of strin must be of of B, kB, mB, gB
if ((strcmp(&input[i], "B") != 0) &&
(strcmp(&input[i], "kB") != 0) &&
(strcmp(&input[i], "mB") != 0) &&
(strcmp(&input[i], "gB") != 0))
{
// error - illegal input
exit(1);
}
strcpy(letters, &input[i]);
// Now nums and letter are ready for further processing
...
...
}
}
Sorry if the question title is a little bit off, I had no idea what to call it just because it is such a peculiar question. What I am aiming to do is decode an input string encoded using a method I will explain in a bit, into a plain English text.
Encoding is done by choosing an integer nRows between 2 and half the length of the message, e.g. a message of length 11 would allow values of nRows in the range 2 to 5. The message is then written down the columns of a grid, one character in each grid cell, nRows in each column, until all message characters have been used. This may result in the last column being only partially filled. The message is then read out row-wise.
For example if the input message was ALL HAIL CAESAR, and the nRows value was 2, encoding would look like this:
A L H I A S R
L A L C E A #
Where # symbolizes a or blank character in the table, that doesn't actually exist - I have simply added it to explain the next part :)
The actual question I have is decoding these phrases. The code I have written thus far works for a few problems, but once the blank characters (#) become many the code begins to break down, as the code obviously does not register them and the algorithm skips past them.
My code is:
/*
* DeConfabulons.c
* A program to Decode for the Confabulons
*
* August 9th 2015
*/
#include <stdio.h>
#include <string.h>
#include <math.h>
//A simple function confab which given input text encoded using
//the Confabulons encoding scheme, and a number of rows, returns
//the originally encoded phrase.
void deconfab(const char inText[], int nRows, char outText[])
{
int count = 0;
int i = 0;
int len = strlen(inText);
float help = ((float)len/(float)nRows);
int z = 0;
while (z < round(help))
{
while (((int)inText[count] > 0) && (count <= len))
{
outText[i] = inText[count];
i ++;
if (count < (int)help)
{
count = count + round((int)help+0.5);
}
else
{
float helper = count + help;
count = round(helper);
}
}
z ++;
count = z;
}
outText[i] = '\0';
}
Which thus far works for the Caesar example I gave earlier. The encoded form of it was ALHI ASRL ALCEA. The main(void) input I have been provided for that problem was:
char buffer[40] = {'\0'};
deconfab("ALHI ASRL ALCEA", 2, buffer);
printf("%s\n", buffer);
Which correctly outputs:
ALL HAIL CAESAR
However when working with cases with extra "blank" characters such as:
char buffer[60] = {0};
char* s = "Two hnvde eo frgqo .uxti hcjeku mlbparszo y";
deconfab(s, 13, buffer);
printf("%s\n", buffer);
The output should be:
The quick brown fox jumps over the lazy dog.
However my code will return:
Thdefq.the browneorouickmps ov g x julazy
I have concluded that this caused by the blank characters at the end in the last column by running through multiple tests by hand, however no matter what I try the code will not work for every test case. I am allowed to edit the bulk of the function in nearly any way, however any inputs or anything in int main(void) is not allowed to be edited.
I am simply looking for a way to have these blank characters recognized as characters without actually being there (as such) :)
First of all, as far as I see, you don't include those "null" characters in your input - if you did that (I guess) by adding any "dummy" characters, the algorithm would work. The reason it does in the first case is that the 'blank' character is missing at the end of the input - the same place as it's missing in the sentence.
You can try to make a workaround by guessing the length of a message with those dummy characters (I'm not sure how to formulate this) like:
ALHI ASRL ALCEA has 15 characters (15 mod 2 = 1) but ALHI ASRL ALCEA# has 16 characters. Similarly, Two hnvde eo frgqo .uxti hcjeku mlbparszo y has 44 characters (44 mod 13 = 5) so you need quite a lot of the dummy chars to make this work (13-5=8).
There are several ways at this point - you can for instance try to insert the missing blank spaces to align the columns, copy everything into a 2-dimensional array char by char, and then read it line by line, or just determine the (len mod rows) characters from the last column, remove them from the input (requires some fiddling with the classic C string functions so I won't give you the full answer here), read the rest and then append the characters from the last column.
I hope this helps.
There is some mess with index calculation.
At first it is pure discrete transformation. So, it should be implemented using only integer numbers.
The function below does what you need.
void deconfab(const char inText[], int nRows, char outText[])
{
int len = strlen(inText);
int cols = len / nRows;
int rows_with_large_cols = len % nRows;
int count = 0;
int col = 0;
int row = 0;
while (count < len)
{
int idx;
if (row < rows_with_large_cols)
idx = row * (cols + 1) + col;
else
idx = rows_with_large_cols * (cols + 1) +
(row - rows_with_large_cols) * cols + col;
if (idx > len - 1) {
++col;
row = 0;
idx = col;
}
outText[count] = inText[idx];
++row;
++count;
}
outText[count] = '\0';
}
It may be rewritten more nicely. Now it is like a pseudocode to explain the algorithm.
You cannot use the standard str* functions if you are going to handle nulls. You must, instead, work with the data directly and use the *read family of functions to get your data.
The last time update: my classmate uses fread() to read about one third of the whole file into a string, this can avoid lacking of memory. Then process this string, separate this string into your data structure. Notice, you need to care about one problem: at the end of this string, these last several characters may cannot consist one whole number. Think about one way to detect this situation so you can connect these characters with the first several characters of the next string.
Each number is corresponding to different variable in your data structure. Your data structure should be very simple because each time if you insert your data into one data structure, it is very slow. The most of time is spent on inserting data into data structure. Therefore, the fastest way to process these data is: using fread() to read this file into a string, separate this string into different one-dimensional arrays.
For example(just an example, not come from my project), I have a text file, like:
72 24 20
22 14 30
23 35 40
42 29 50
19 22 60
18 64 70
.
.
.
Each row is one person's information. The first column means the person's age, the second column is his deposit, the second is his wife's age.
Then we use fread() to read this text file into string, then I use stroke() to separate it(you can use faster way to separate it).
Don't use data structure to store the separated data!
I means, don't do like this:
struct person
{
int age;
int deposit;
int wife_age;
};
struct person *my_data_store;
my_data_store=malloc(sizeof(struct person)*length_of_this_array);
//then insert separated data into my_data_store
Don't use data structure to store data!
The fastest way to store your data is like this:
int *age;
int *deposit;
int *wife_age;
age=(int*)malloc(sizeof(int)*age_array_length);
deposit=(int*)malloc(sizeof(int)*deposit_array_length);
wife_age=(int*)malloc(sizeof(int)*wife_array_length);
// the value of age_array_length,deposit_array_length and wife_array_length will be known by using `wc -l`.You can use wc -l to get the value in your C program
// then you can insert separated data into these arrays when you use `stroke()` to separate them.
The second update: The best way is to use freed() to read part of the file into a string, then separate these string into your data structure. By the way, don't use any standard library function which can format string into integer , that's to slow, like fscanf() or atoi(), we should write our own function to transfer a string into n integer. Not only that, we should design a more simpler data structure to store these data. By the way, my classmate can read this 1.7G file within 7 seconds. There is a way can do this. That way is much better than using multithread. I haven't see his code, after I see his code, I will update the third time to tell you how could hi do this. That will be two months later after our course finished.
Update: I use multithread to solve this problem!! It works! Notice: don't use clock() to calculate the time when using multithread, that's why I thought the time of execution increases.
One thing I want to clarify is that, the time of reading the file without storing the value into my structure is about 20 seconds. The time of storing the value into my structure is about 60 seconds. The definition of "time of reading the file" includes the time of read the whole file and store the value into my structure. the time of reading the file = scan the file + store the value into my structure. Therefore, have some suggestions of storing value faster ? (By the way, I don't have control over the inout file, it is generated by our professor. I am trying to use multithread to solve this problem, if it works, I will tell you the result.)
I have a file, its size is 1.7G.
It looks like:
1 1427826
1 1427827
1 1750238
1 2
2 3
2 4
3 5
3 6
10 7
11 794106
.
.
and son on.
It has about ten millions of lines in the file. Now I need to read this file and store these numbers in my data structure within 15 seconds.
I have tried to use freed() to read whole file and then use strtok() to separate each number, but it still need 80 seconds. If I use fscanf(), it will be slower. How do I speed it up? Maybe we cannot make it less than 15 seconds. But 80 seconds to read it is too long. How to read it as fast as we can?
Here is part of my reading code:
int Read_File(FILE *fd,int round)
{
clock_t start_read = clock();
int first,second;
first=0;
second=0;
fseek(fd,0,SEEK_END);
long int fileSize=ftell(fd);
fseek(fd,0,SEEK_SET);
char * buffer=(char *)malloc(sizeof(char)*fileSize);
char *string_first;
long int newFileSize=fread(buffer,1,fileSize,fd);
char *string_second;
while(string_first!=NULL)
{
first=atoi(string_first);
string_second=strtok(NULL," \t\n");
second=atoi(string_second);
string_first=strtok(NULL," \t\n");
max_num= first > max_num ? first : max_num ;
max_num= second > max_num ? second : max_num ;
root_level=first/NUM_OF_EACH_LEVEL;
leaf_addr=first%NUM_OF_EACH_LEVEL;
if(root_addr[root_level][leaf_addr].node_value!=first)
{
root_addr[root_level][leaf_addr].node_value=first;
root_addr[root_level][leaf_addr].head=(Neighbor *)malloc(sizeof(Neighbor));
root_addr[root_level][leaf_addr].tail=(Neighbor *)malloc(sizeof(Neighbor));
root_addr[root_level][leaf_addr].g_credit[0]=1;
root_addr[root_level][leaf_addr].head->neighbor_value=second;
root_addr[root_level][leaf_addr].head->next=NULL;
root_addr[root_level][leaf_addr].tail=root_addr[root_level][leaf_addr].head;
root_addr[root_level][leaf_addr].degree=1;
}
else
{
//insert its new neighbor
Neighbor *newNeighbor;
newNeighbor=(Neighbor*)malloc(sizeof(Neighbor));
newNeighbor->neighbor_value=second;
root_addr[root_level][leaf_addr].tail->next=newNeighbor;
root_addr[root_level][leaf_addr].tail=newNeighbor;
root_addr[root_level][leaf_addr].degree++;
}
root_level=second/NUM_OF_EACH_LEVEL;
leaf_addr=second%NUM_OF_EACH_LEVEL;
if(root_addr[root_level][leaf_addr].node_value!=second)
{
root_addr[root_level][leaf_addr].node_value=second;
root_addr[root_level][leaf_addr].head=(Neighbor *)malloc(sizeof(Neighbor));
root_addr[root_level][leaf_addr].tail=(Neighbor *)malloc(sizeof(Neighbor));
root_addr[root_level][leaf_addr].head->neighbor_value=first;
root_addr[root_level][leaf_addr].head->next=NULL;
root_addr[root_level][leaf_addr].tail=root_addr[root_level][leaf_addr].head;
root_addr[root_level][leaf_addr].degree=1;
root_addr[root_level][leaf_addr].g_credit[0]=1;
}
else
{
//insert its new neighbor
Neighbor *newNeighbor;
newNeighbor=(Neighbor*)malloc(sizeof(Neighbor));
newNeighbor->neighbor_value=first;
root_addr[root_level][leaf_addr].tail->next=newNeighbor;
root_addr[root_level][leaf_addr].tail=newNeighbor;
root_addr[root_level][leaf_addr].degree++;
}
}
Some suggestions:
a) Consider converting (or pre-processing) the file into a binary format; with the aim to minimise the file size and also drastically reduce the cost of parsing. I don't know the ranges for your values, but various techniques (e.g. using one bit to tell if the number is small or large and storing the number as either a 7-bit integer or a 31-bit integer) could halve the file IO (and double the speed of reading the file from disk) and slash parsing costs down to almost nothing. Note: For maximum effect you'd modify whatever software created the file in the first place.
b) Reading the entire file into memory before you parse it is a mistake. It doubles the amount of RAM required (and the cost of allocating/freeing) and has disadvantages for CPU caches. Instead read a small amount of the file (e.g. 16 KiB) and process it, then read the next piece and process it, and so on; so that you're constantly reusing the same small buffer memory.
c) Use parallelism for file IO. It shouldn't be hard to read the next piece of the file while you're processing the previous piece of the file (either by using 2 threads or by using asynchronous IO).
d) Pre-allocate memory for the "neighbour" structures and remove most/all malloc() calls from your loop. The best possible case is to use a statically allocated array as a pool - e.g. Neighbor myPool[MAX_NEIGHBORS]; where malloc() can be replaced with &myPool[nextEntry++];. This reduces/removes the overhead of malloc() while also improving cache locality for the data itself.
e) Use parallelism for storing values. For example, you could have multiple threads where the first thread handles all the cases where root_level % NUM_THREADS == 0, the second thread handles all cases where root_level % NUM_THREADS == 1, etc.
With all of the above (assuming a modern 4-core CPU), I think you can get the total time (for reading and storing) down to less than 15 seconds.
My suggestion would be to form a processing pipeline and thread it. Reading the file is an I/O bound task and parsing it is CPU bound. They can be done at the same time in parallel.
There are several possibilities. You'll have to experiment.
Exploit what your OS gives you. If Windows, check out overlapped io. This lets your computation proceed with parsing one buffer full of data while the Windows kernel fills another. Then switch buffers and continue. This is related to what #Neal suggested, but has less overhead for buffering. Windows is depositing data directly in your buffer through the DMA channel. No copying. If Linux, check out memory mapped files. Here the OS is using the virtual memory hardware to do more-or-less what Windows does with overlapping.
Code your own integer conversion. This is likely to be a bit faster than making a clib call per integer.
Here's example code. You want to absolutely limit the number of comparisons.
// Process one input buffer.
*end_buf = ' '; // add a sentinel at the end of the buffer
for (char *p = buf; p < end_buf; p++) {
// somewhat unsafe (but fast) reliance on unsigned wrapping
unsigned val = *p - '0';
if (val <= 9) {
// Found start of integer.
for (;;) {
unsigned digit_val = *p - '0';
if (digit_val > 9) break;
val = 10 * val + digit_val;
p++;
}
... do something with val
}
}
Don't call malloc once per record. You should allocate blocks of many structs at a time.
Experiment with buffer sizes.
Crank up compiler optimizations. This is the kind of code that benefits greatly from excellent code generation.
Yes, standard library conversion functions are surprisingly slow.
If portability is not a problem, I'd memory-map the file. Then, something like the following C99 code (untested) could be used to parse the entire memory map:
#include <stdlib.h>
#include <errno.h>
struct pair {
unsigned long key;
unsigned long value;
};
typedef struct {
size_t size; /* Maximum number of items */
size_t used; /* Number of items used */
struct pair item[];
} items;
/* Initial number of items to allocate for */
#ifndef ITEM_ALLOC_SIZE
#define ITEM_ALLOC_SIZE 8388608
#endif
/* Adjustment to new size (parameter is old number of items) */
#ifndef ITEM_REALLOC_SIZE
#define ITEM_REALLOC_SIZE(from) (((from) | 1048575) + 1048577)
#endif
items *parse_items(const void *const data, const size_t length)
{
const unsigned char *ptr = (const unsigned char *)data;
const unsigned char *const end = (const unsigned char *)data + length;
items *result;
size_t size = ITEMS_ALLOC_SIZE;
size_t used = 0;
unsigned long val1, val2;
result = malloc(sizeof (items) + size * sizeof (struct pair));
if (!result) {
errno = ENOMEM;
return NULL;
}
while (ptr < end) {
/* Skip newlines and whitespace. */
while (ptr < end && (*ptr == '\0' || *ptr == '\t' ||
*ptr == '\n' || *ptr == '\v' ||
*ptr == '\f' || *ptr == '\r' ||
*ptr == ' '))
ptr++;
/* End of data? */
if (ptr >= end)
break;
/* Parse first number. */
if (*ptr >= '0' && *ptr <= '9')
val1 = *(ptr++) - '0';
else {
free(result);
errno = ECOMM; /* Bad data! */
return NULL;
}
while (ptr < end && *ptr >= '0' && *ptr <= '9') {
const unsigned long old = val1;
val1 = 10UL * val1 + (*(ptr++) - '0');
if (val1 < old) {
free(result);
errno = EDOM; /* Overflow! */
return NULL;
}
}
/* Skip whitespace. */
while (ptr < end && (*ptr == '\t' || *ptr == '\v'
*ptr == '\f' || *ptr == ' '))
ptr++;
if (ptr >= end) {
free(result);
errno = ECOMM; /* Bad data! */
return NULL;
}
/* Parse second number. */
if (*ptr >= '0' && *ptr <= '9')
val2 = *(ptr++) - '0';
else {
free(result);
errno = ECOMM; /* Bad data! */
return NULL;
}
while (ptr < end && *ptr >= '0' && *ptr <= '9') {
const unsigned long old = val2;
val1 = 10UL * val2 + (*(ptr++) - '0');
if (val2 < old) {
free(result);
errno = EDOM; /* Overflow! */
return NULL;
}
}
if (ptr < end) {
/* Error unless whitespace or newline. */
if (*ptr != '\0' && *ptr != '\t' && *ptr != '\n' &&
*ptr != '\v' && *ptr != '\f' && *ptr != '\r' &&
*ptr != ' ') {
free(result);
errno = ECOMM; /* Bad data! */
return NULL;
}
/* Skip the rest of this line. */
while (ptr < end && *ptr != '\n' && *ptr != '\r')
ptr++;
}
/* Need to grow result? */
if (used >= size) {
items *const old = result;
size = ITEMS_REALLOC_SIZE(used);
result = realloc(result, sizeof (items) + size * sizeof (struct pair));
if (!result) {
free(old);
errno = ENOMEM;
return NULL;
}
}
result->items[used].key = val1;
result->items[used].value = val2;
used++;
}
/* Note: we could reallocate result here,
* if memory use is an issue.
*/
result->size = size;
result->used = used;
errno = 0;
return result;
}
I've used a similar approach to load molecular data for visualization. Such data contains floating-point values, but precision is typically only about seven significant digits, no multiprecision math needed. A custom routine to parse such data beats the standard functions by at least an order of magnitude in speed.
At least the Linux kernel is pretty good at observing memory/file access patterns; using madvise() also helps.
If you cannot use a memory map, then the parsing function would be a bit different: it would append to an existing result, and if the final line in the buffer is partial, it would indicate so (and the number of chars not parsed), so that the caller can memmove() the buffer, read more data, and continue parsing. (Use 16-byte aligned addresses for reading new data, to maximize copy speeds. You don't necessarily need to move the unread data to the exact beginning of the buffer, you see; just keep the current position in the buffered data.)
Questions?
First, what's your disk hardware? A single SATA drive is likely to be topped out at 100 MB/sec. And probably more like 50-70 MB/sec. If you're already moving data off the drive(s) as fast as you can, all the software tuning you do is going to be wasted.
If your hardware CAN support reading faster? First, your read pattern - read the whole file into memory once - is the perfect use-case for direct IO. Open your file using open( "/file/name", O_RDONLY | O_DIRECT );. Read to page-aligned buffers (see man page for valloc()) in page-sized chunks. Using direct IO will cause your data to bypass double buffering in the kernel page cache, which is useless when you're reading that much data that fast and not re-reading the same data pages over and over.
If you're running on a true high-performance file system, you can read asynchronously and likely faster with lio_listio() or aio_read(). Or you can just use multiple threads to read - and use pread() so you don't have waste time seeking - and because when you read using multiple threads a seek on an open file affects all threads trying to read from the file.
And do not try to read fast into a newly-malloc'd chunk of memory - memset() it first. Because truly fast disk systems can pump data into the CPU faster than the virtual memory manager can create virtual pages for a process.
I've searched around for a quiet some time but surprisingly I couldn't find an answer to it:
I want to rewrite a char array starting from [0], but all what's happening is: it's always appending. Here's my code:
The algorithm is: I have a very long string which I like to break into several lines (wherever there is a blank space at the end of a line). Each line shall be saved in an array Index (lineContent);
void print_text(char* content, int menu_width, int which_selected, int menu_height, int scroll_pos)
{
int posCounter = 0;
int charCounter = menu_width-10;
int printOutCounter;
char* lineContent[400]; // 400 lines max
short spaceFound;
while (strlen(content) > menu_width) // If string is longer than 1 line
{
//Interesting Part ---------- START
char changeString [strlen(content)];
char printString [menu_width-10];
spaceFound = 0;
charCounter = menu_width-10;
lineContent[posCounter] = malloc(MAXITEMSTR);
while (spaceFound == 0)
{
if (content[charCounter] == ' ')
{
// I guess the error goes between here ...
strncpy(changeString,content,strlen(content));
strncpy(printString,content,menu_width-10);
// ...and here
memmove(&changeString[0], &changeString[charCounter], strlen(content));
content=changeString;
lineContent[posCounter]=printString;
strcat(lineContent[posCounter],"\0");
posCounter++;
spaceFound = 1;
//Interesting Part ---------- END
}
charCounter--;
if (charCounter <= 0)
spaceFound = 1;
}
}
}
As I said, in the end, when checking the content of lineContent, every entry is the same (the one from the last line).
I think this is because, strcpy just appends to the end, therefor I have to clear the array, to erase the former line. So it will start from [0] and not from the last printed letter.
Has anybody an idea how to do this? Is there a function that overwrites a char array instead of appending it?
Kind Regards
Strcat appends to the end, strcpy overwrites the value stored in the string.
I have to print 1,000,000 four digit numbers. I used printf for this purpose
for(i=0;i<1000000;i++)
{
printf("%d\n", students[i]);
}
and it turns out to be too slow.Is there a faster way so that I can print it.
You could create an array, fill it with output data and then print out that array at once. Or if there is memory problem, just break that array to smaller chunks and print them one by one.
Here is my attempt replacing printf and stdio stream buffering with straightforward special-case code:
int print_numbers(const char *filename, const unsigned int *input, size_t len) {
enum {
// Maximum digits per number. The input numbers must not be greater
// than this!
# if 1
DIGITS = 4,
# else
// Alternative safe upper bound on the digits per integer
// (log10(2) < 28/93)
DIGITS = sizeof *input * CHAR_BIT * 28UL + 92 / 93,
# endif
// Maximum lines to be held in the buffer. Tune this to your system,
// though something on the order of 32 kB should be reasonable
LINES = 5000
};
// Write the output in binary to avoid extra processing by the CRT. If necessary
// add the expected "\r\n" line endings or whatever else is required for the
// platform manually.
FILE *file = fopen(filename, "wb");
if(!file)
return EOF;
// Disable automatic file buffering in favor of our own
setbuf(file, NULL);
while(len) {
// Set up a write pointer for a buffer going back-to-front. This
// simplifies the reverse order of digit extraction
char buffer[(DIGITS + 1 /* for the newline */) * LINES];
char *tail = &buffer[sizeof buffer];
char *head = tail;
// Grab the largest set of lines still remaining to be printed which
// will safely fit in our buffer
size_t chunk = len > LINES ? LINES : len;
const unsigned int *input_chunk;
len -= chunk;
input += chunk;
input_chunk = input;
do {
// Convert the each number by extracting least-significant digits
// until all have been printed.
unsigned int number = *--input_chunk;
*--head = '\n';
do {
# if 1
char digit = '0' + number % 10;
number /= 10;
# else
// Alternative in case the compiler is unable to merge the
// division/modulo and perform reciprocal multiplication
char digit = '0' + number;
number = number * 0xCCCDUL >> 19;
digit -= number * 10;
# endif
*--head = digit;
} while(number);
} while(--chunk);
// Dump everything written to the present buffer
fwrite(head, tail - head, 1, file);
}
return fclose(file);
}
I fear this won't buy you much more than a fairly small constant factor over your original (by avoiding some printf format parsing, per-character buffering, locale handling, multithreading locks, etc.)
Beyond this you may want to consider processing the input and writing the output on-the-fly instead of reading /processing/writing as separate stages. Of course whether or not this is possible depends entirely on the operation to be performed.
Oh, and don't forget to enable compiler optimizations when building the application. A run through with a profiler couldn't hurt either.