Search Binary File for a Pattern - c

I need to search for a binary pattern in binary file,
how can i do it?
I tried with "strstr()" function and convert the file and the pattern to a string but its not working.
(the pattern is also a binary file)
this is what it tried:
void isinfected(FILE *file, FILE *sign, char filename[], char filepath[])
{
char* fil,* vir;
int filelen, signlen;
fseek(file, 0, SEEK_END);
fseek(sign, 0, SEEK_END);
filelen = ftell(file);
signlen = ftell(sign);
fil = (char *)malloc(sizeof(char) * filelen);
if (!fil)
{
printf("unseccesful malloc!\n");
}
vir = (char *)malloc(sizeof(char) * signlen);
if (!vir)
{
printf("unseccesful malloc!\n");
}
fseek(file, 0, SEEK_CUR);
fseek(sign, 0, SEEK_CUR);
fread(fil, 1, filelen, file);
fread(vir, 1, signlen, sign);
if (strstr(vir, fil) != NULL)
log(filename, "infected",filepath );
else
log(filename, "not infected", filepath);
free(vir);
free(fil);
}

For any binary handling you should never use one of the strXX functions, because these only (and exclusively) work on C-style zero terminated strings. Your code is failing because the strXX functions cannot look beyond the first binary 0 they encounter.
As your basic idea with strstr appears correct (and only fails because it works on zero terminated strings only), you can replace it with memmem, which does the same on arbitrary data. Since memmem is a GNU C extension (see also Is there a particular reason for memmem being a GNU extension?), it may not be available on your system and you need to write code that does the same thing.
For a very basic implementation of memmem you can use memchr to scan for the first binary character, followed by memcmp if it found something:
void * my_memmem(const void *big, size_t big_len, const void *little, size_t little_len)
{
void *iterator;
if (big_len < little_len)
return NULL;
iterator = (void *)big;
while (1)
{
iterator = memchr (iterator, ((unsigned char *)little)[0], big_len - (iterator-big));
if (iterator == NULL)
return NULL;
if (iterator && !memcmp (iterator, little, little_len))
return iterator;
iterator++;
}
}
There are better implementations possible, but unless memmem is an important function in your program, it'll do the job just fine.

The basic idea is to check if vir matches the beginning of fil. If it doesn't, then you check again, starting at the second byte of fil, and repeating until you find a match or until you've reached the end of fil. (This is essentially what a simple implementation of strstr does, except that strstr treats 0 bytes as a special case.)
int i;
for (i = 0; i < filelen - signlen; ++i) {
if (memcmp(vir, fil + i, signlen) == 0) {
return true; // vir exists in fil found
}
}
return false; // vir is not in file
This is the "brute force" approach. It can get very slow if your files are long. There are advanced searching algorithms that can potentially make this much faster, but this is a good starting point.

Related

How to read imaginary data from text file, with C

I am unable to read imaginary data from text file.
Here is my .txt file
abc.txt
0.2e-3+0.3*I 0.1+0.1*I
0.3+0.1*I 0.1+0.4*I
I want to read this data into a matrix and print it.
I found the solutions using C++ here and here. I don't know how to do the same in C.
I am able to read decimal and integer data in .txt and print them.
I am also able to print imaginary data initialized at the declaration, using complex.h header. This is the program I have writtern
#include<stdio.h>
#include<stdlib.h>
#include<complex.h>
#include<math.h>
int M,N,i,j,k,l,p,q;
int b[2];
int main(void)
{
FILE* ptr = fopen("abc.txt", "r");
if (ptr == NULL) {
printf("no such file.");
return 0;
}
long double d=0.2e-3+0.3*I;
long double c=0.0000000600415046630252;
double matrixA[2][2];
for(i=0;i<2; i++)
for(j=0;j<2; j++)
fscanf(ptr,"%lf+i%lf\n", creal(&matrixA[i][j]), cimag(&matrixA[i][j]));
//fscanf(ptr, "%lf", &matrixA[i][j]) for reading non-imainary data, It worked.
for(i=0;i<2; i++)
for(j=0;j<2; j++)
printf("%f+i%f\n", creal(matrixA[i][j]), cimag(matrixA[i][j]));
//printf("%lf\n", matrixA[i][j]); for printing non-imainary data, It worked.
printf("%f+i%f\n", creal(d), cimag(d));
printf("%Lg\n",c);
fclose(ptr);
return 0;
}
But I want to read it from the text, because I have an array of larger size, which I can't initialize at declaration, because of it's size.
There are two main issues with your code:
You need to add complex to the variables that hold complex values.
scanf() needs pointers to objects to store scanned values in them. But creal() returns a value, copied from its argument's contents. It is neither a pointer, nor could you get the address of the corresponding part of the complex argument.
Therefore, you need to provide temporary objects to scanf() which receive the scanned values. After successfully scanning, these values are combined to a complex value and assigned to the indexed matrix cell.
Minor issues not contributing to the core problem are:
The given source is "augmented" with unneeded #includes, unused variables, global variables, and experiments with constants. I removed them all to see the real thing.
The specifier "%f" (as many others) lets scanf() skip whitespace like blanks, tabs, newlines, and so on. Providing a "\n" mostly does more harm than one would expect.
I kept the "*I" to check the correct format. However, an error will only be found on the next call of scanf(), when it cannot scan the next number.
You need to check the return value of scanf(), always! It returns the number of conversions that were successful.
It is a common and good habit to let the compiler calculate the number of elements in an array. Divide the total size by an element's size.
Oh, and sizeof is an operator, not a function.
It is also best to return symbolic values to the caller, instead of magic numbers. Fortunately, the standard library defines these EXIT_... macros.
The signs are correctly handled by scanf() already. There is no need to tell it more. But for a nice output with printf(), you use the "+" as a flag to always output a sign.
Since the sign is now placed directly before the number, I moved the multiplication by I (you can change it to lower case, if you want) to the back of the imaginary part. This also matches the input format.
Error output is done via stderr instead of stdout. For example, this enables you to redirect the standard output to a pipe or file, without missing potential errors. You can also redirect errors somewhere else. And it is a well-known and appreciated standard.
This is a possible solution:
#include <stdio.h>
#include <stdlib.h>
#include <complex.h>
int main(void)
{
FILE* ptr = fopen("abc.txt", "r");
if (ptr == NULL) {
perror("\"abc.txt\"");
return EXIT_FAILURE;
}
double complex matrixA[2][2];
for (size_t i = 0; i < sizeof matrixA / sizeof matrixA[0]; i++)
for (size_t j = 0; j < sizeof matrixA[0] / sizeof matrixA[0][0]; j++) {
double real;
double imag;
if (fscanf(ptr, "%lf%lf*I", &real, &imag) != 2) {
fclose(ptr);
fprintf(stderr, "Wrong input format\n");
return EXIT_FAILURE;
}
matrixA[i][j] = real + imag * I;
}
fclose(ptr);
for (size_t i = 0; i < sizeof matrixA / sizeof matrixA[0]; i++)
for (size_t j = 0; j < sizeof matrixA[0] / sizeof matrixA[0][0]; j++)
printf("%+f%+f*I\n", creal(matrixA[i][j]), cimag(matrixA[i][j]));
return EXIT_SUCCESS;
}
Here's a simple solution using scanf() and the format shown in the examples.
It writes the values in the same format that it reads them โ€” the output can be scanned by the program as input.
/* SO 7438-4793 */
#include <stdio.h>
static int read_complex(FILE *fp, double *r, double *i)
{
int offset = 0;
char sign[2];
if (fscanf(fp, "%lg%[-+]%lg*%*[iI]%n", r, sign, i, &offset) != 3 || offset == 0)
return EOF;
if (sign[0] == '-')
*i = -*i;
return 0;
}
int main(void)
{
double r;
double i;
while (read_complex(stdin, &r, &i) == 0)
printf("%g%+g*I\n", r, i);
return 0;
}
Sample input:
0.2e-3+0.3*I 0.1+0.1*I
0.3+0.1*I 0.1+0.4*I
-1.2-3.6*I -6.02214076e23-6.62607015E-34*I
Output from sample input:
0.0002+0.3*I
0.1+0.1*I
0.3+0.1*I
0.1+0.4*I
-1.2-3.6*I
-6.02214e+23-6.62607e-34*I
The numbers at the end with large exponents are Avogadro's Number and the Planck Constant.
The format is about as stringent are you can make it with scanf(), but, although it requires a sign (+ or -) between the real and imaginary parts and requires the * and I to be immediately after the imaginary part (and the conversion will fail if the *I is missing), and accepts either i or I to indicate the imaginary value:
It doesn't stop the imaginary number having a second sign (so it will read a value such as "-6+-4*I").
It doesn't stop there being white space after the mandatory sign (so it will read a value such as "-6+ 24*I".
It doesn't stop the real part being on one line and the imaginary part on the next line.
It won't handle either a pure-real number or a pure-imaginary number properly.
The scanf() functions are very flexible about white space, and it is very hard to prevent them from accepting white space. It would require a custom parser to prevent unwanted spaces. You could do that by reading the numbers and the markers separately, as strings, and then verifying that there's no space and so on. That might be the best way to handle it. You'd use sscanf() to convert the string read after ensuring there's no embedded white space yet the format is correct.
I do not know which IDE you are using for C, so I do not understand this ./testprog <test.data.
I have yet to find an IDE that does not drive me bonkers. I use a Unix shell running in a terminal window. Assuming that your program name is testprog and the data file is test.data, typing ./testprog < test.data runs the program and feeds the contents of test.data as its standard input. On Windows, this would be a command window (and I think PowerShell would work much the same way).
I used fgets to read each line of the text file. Though I know the functionality of sscanf, I do not know how to parse an entire line, which has about 23 elements per line. If the number of elements in a line are few, I know how to parse it. Could you help me about it?
As I noted in a comment, the SO Q&A How to use sscanf() in loops? explains how to use sscanf() to read multiple entries from a line. In this case, you will need to read multiple complex numbers from a single line. Here is some code that shows it at work. It uses the POSIX getline() function to read arbitrarily long lines. If it isn't available to you, you can use fgets() instead, but you'll need to preallocate a big enough line buffer.
#include <stdio.h>
#include <stdlib.h>
#include <complex.h>
#ifndef CMPLX
#define CMPLX(r, i) ((double complex)((double)(r) + I * (double)(i)))
#endif
static size_t scan_multi_complex(const char *string, size_t nvalues,
complex double *v, const char **eoc)
{
size_t nread = 0;
const char *buffer = string;
while (nread < nvalues)
{
int offset = 0;
char sign[2];
double r, i;
if (sscanf(buffer, "%lg%[-+]%lg*%*[iI]%n", &r, sign, &i, &offset) != 3 || offset == 0)
break;
if (sign[0] == '-')
i = -i;
v[nread++] = CMPLX(r, i);
buffer += offset;
}
*eoc = buffer;
return nread;
}
static void dump_complex(size_t nvalues, complex double values[nvalues])
{
for (size_t i = 0; i < nvalues; i++)
printf("%g%+g*I\n", creal(values[i]), cimag(values[i]));
}
enum { NUM_VALUES = 128 };
int main(void)
{
double complex values[NUM_VALUES];
size_t nvalues = 0;
char *buffer = 0;
size_t buflen = 0;
int length;
size_t lineno = 0;
while ((length = getline(&buffer, &buflen, stdin)) > 0 && nvalues < NUM_VALUES)
{
const char *eoc;
printf("Line: %zu [[%.*s]]\n", ++lineno, length - 1, buffer);
size_t nread = scan_multi_complex(buffer, NUM_VALUES - nvalues, &values[nvalues], &eoc);
if (*eoc != '\0' && *eoc != '\n')
printf("EOC: [[%s]]\n", eoc);
if (nread == 0)
break;
dump_complex(nread, &values[nvalues]);
nvalues += nread;
}
free(buffer);
printf("All done:\n");
dump_complex(nvalues, values);
return 0;
}
Here is a data file with 8 lines with 10 complex numbers per line):
-1.95+11.00*I +21.72+64.12*I -95.16-1.81*I +64.23+64.55*I +28.42-29.29*I -49.25+7.87*I +44.98+79.62*I +69.80-1.24*I +61.99+37.01*I +72.43+56.88*I
-9.15+31.41*I +63.84-15.82*I -0.77-76.80*I -85.59+74.86*I +93.00-35.10*I -93.82+52.80*I +85.45+82.42*I +0.67-55.77*I -58.32+72.63*I -27.66-81.15*I
+87.97+9.03*I +7.05-74.91*I +27.60+65.89*I +49.81+25.08*I +44.33+77.00*I +93.27-7.74*I +61.62-5.01*I +99.33-82.80*I +8.83+62.96*I +7.45+73.70*I
+40.99-12.44*I +53.34+21.74*I +75.77-62.56*I +54.16-26.97*I -37.02-31.93*I +78.20-20.91*I +79.64+74.71*I +67.95-40.73*I +58.19+61.25*I +62.29-22.43*I
+47.36-16.19*I +68.48-15.00*I +6.85+61.50*I -6.62+55.18*I +34.95-69.81*I -88.62-81.15*I +75.92-74.65*I +85.17-3.84*I -37.20-96.98*I +74.97+78.88*I
+56.80+63.63*I +92.83-16.18*I -11.47+8.81*I +90.74+42.86*I +19.11-56.70*I -77.93-70.47*I +6.73+86.12*I +2.70-57.93*I +57.87+29.44*I +6.65-63.09*I
-35.35-70.67*I +8.08-21.82*I +86.72-93.82*I -28.96-24.69*I +68.73-15.36*I +52.85+94.65*I +85.07-84.04*I +9.98+29.56*I -78.01-81.23*I -10.67+13.68*I
+83.10-33.86*I +56.87+30.23*I -78.56+3.73*I +31.41+10.30*I +91.98+29.04*I -9.20+24.59*I +70.82-19.41*I +29.21+84.74*I +56.62+92.29*I +70.66-48.35*I
The output of the program is:
Line: 1 [[-1.95+11.00*I +21.72+64.12*I -95.16-1.81*I +64.23+64.55*I +28.42-29.29*I -49.25+7.87*I +44.98+79.62*I +69.80-1.24*I +61.99+37.01*I +72.43+56.88*I]]
-1.95+11*I
21.72+64.12*I
-95.16-1.81*I
64.23+64.55*I
28.42-29.29*I
-49.25+7.87*I
44.98+79.62*I
69.8-1.24*I
61.99+37.01*I
72.43+56.88*I
Line: 2 [[-9.15+31.41*I +63.84-15.82*I -0.77-76.80*I -85.59+74.86*I +93.00-35.10*I -93.82+52.80*I +85.45+82.42*I +0.67-55.77*I -58.32+72.63*I -27.66-81.15*I]]
-9.15+31.41*I
63.84-15.82*I
-0.77-76.8*I
-85.59+74.86*I
93-35.1*I
-93.82+52.8*I
85.45+82.42*I
0.67-55.77*I
-58.32+72.63*I
-27.66-81.15*I
Line: 3 [[+87.97+9.03*I +7.05-74.91*I +27.60+65.89*I +49.81+25.08*I +44.33+77.00*I +93.27-7.74*I +61.62-5.01*I +99.33-82.80*I +8.83+62.96*I +7.45+73.70*I]]
87.97+9.03*I
7.05-74.91*I
27.6+65.89*I
49.81+25.08*I
44.33+77*I
93.27-7.74*I
61.62-5.01*I
99.33-82.8*I
8.83+62.96*I
7.45+73.7*I
Line: 4 [[+40.99-12.44*I +53.34+21.74*I +75.77-62.56*I +54.16-26.97*I -37.02-31.93*I +78.20-20.91*I +79.64+74.71*I +67.95-40.73*I +58.19+61.25*I +62.29-22.43*I]]
40.99-12.44*I
53.34+21.74*I
75.77-62.56*I
54.16-26.97*I
-37.02-31.93*I
78.2-20.91*I
79.64+74.71*I
67.95-40.73*I
58.19+61.25*I
62.29-22.43*I
Line: 5 [[+47.36-16.19*I +68.48-15.00*I +6.85+61.50*I -6.62+55.18*I +34.95-69.81*I -88.62-81.15*I +75.92-74.65*I +85.17-3.84*I -37.20-96.98*I +74.97+78.88*I]]
47.36-16.19*I
68.48-15*I
6.85+61.5*I
-6.62+55.18*I
34.95-69.81*I
-88.62-81.15*I
75.92-74.65*I
85.17-3.84*I
-37.2-96.98*I
74.97+78.88*I
Line: 6 [[+56.80+63.63*I +92.83-16.18*I -11.47+8.81*I +90.74+42.86*I +19.11-56.70*I -77.93-70.47*I +6.73+86.12*I +2.70-57.93*I +57.87+29.44*I +6.65-63.09*I]]
56.8+63.63*I
92.83-16.18*I
-11.47+8.81*I
90.74+42.86*I
19.11-56.7*I
-77.93-70.47*I
6.73+86.12*I
2.7-57.93*I
57.87+29.44*I
6.65-63.09*I
Line: 7 [[-35.35-70.67*I +8.08-21.82*I +86.72-93.82*I -28.96-24.69*I +68.73-15.36*I +52.85+94.65*I +85.07-84.04*I +9.98+29.56*I -78.01-81.23*I -10.67+13.68*I]]
-35.35-70.67*I
8.08-21.82*I
86.72-93.82*I
-28.96-24.69*I
68.73-15.36*I
52.85+94.65*I
85.07-84.04*I
9.98+29.56*I
-78.01-81.23*I
-10.67+13.68*I
Line: 8 [[+83.10-33.86*I +56.87+30.23*I -78.56+3.73*I +31.41+10.30*I +91.98+29.04*I -9.20+24.59*I +70.82-19.41*I +29.21+84.74*I +56.62+92.29*I +70.66-48.35*I]]
83.1-33.86*I
56.87+30.23*I
-78.56+3.73*I
31.41+10.3*I
91.98+29.04*I
-9.2+24.59*I
70.82-19.41*I
29.21+84.74*I
56.62+92.29*I
70.66-48.35*I
All done:
-1.95+11*I
21.72+64.12*I
-95.16-1.81*I
64.23+64.55*I
28.42-29.29*I
-49.25+7.87*I
44.98+79.62*I
69.8-1.24*I
61.99+37.01*I
72.43+56.88*I
-9.15+31.41*I
63.84-15.82*I
-0.77-76.8*I
-85.59+74.86*I
93-35.1*I
-93.82+52.8*I
85.45+82.42*I
0.67-55.77*I
-58.32+72.63*I
-27.66-81.15*I
87.97+9.03*I
7.05-74.91*I
27.6+65.89*I
49.81+25.08*I
44.33+77*I
93.27-7.74*I
61.62-5.01*I
99.33-82.8*I
8.83+62.96*I
7.45+73.7*I
40.99-12.44*I
53.34+21.74*I
75.77-62.56*I
54.16-26.97*I
-37.02-31.93*I
78.2-20.91*I
79.64+74.71*I
67.95-40.73*I
58.19+61.25*I
62.29-22.43*I
47.36-16.19*I
68.48-15*I
6.85+61.5*I
-6.62+55.18*I
34.95-69.81*I
-88.62-81.15*I
75.92-74.65*I
85.17-3.84*I
-37.2-96.98*I
74.97+78.88*I
56.8+63.63*I
92.83-16.18*I
-11.47+8.81*I
90.74+42.86*I
19.11-56.7*I
-77.93-70.47*I
6.73+86.12*I
2.7-57.93*I
57.87+29.44*I
6.65-63.09*I
-35.35-70.67*I
8.08-21.82*I
86.72-93.82*I
-28.96-24.69*I
68.73-15.36*I
52.85+94.65*I
85.07-84.04*I
9.98+29.56*I
-78.01-81.23*I
-10.67+13.68*I
83.1-33.86*I
56.87+30.23*I
-78.56+3.73*I
31.41+10.3*I
91.98+29.04*I
-9.2+24.59*I
70.82-19.41*I
29.21+84.74*I
56.62+92.29*I
70.66-48.35*I
The code would handle lines with any number of entries on a line (up to 128 in total because of the limit on the size of the array of complex numbers โ€” but that can be fixed too.

Is concatenating arbitrary number of strings with nested function calls in C undefined behavior?

I have an application that builds file path names through a series of string concatenations using pieces of text to create a complete file path name.
The question is whether an approach to handle concatenating a small but arbitrary number of strings of text together depends on Undefined Behavior for success.
Is the order of evaluation of a series of nested functions guaranteed or not?
I found this question Nested function calls order of evaluation however it seems to be more about multiple functions in the argument list rather than a sequence of nesting functions.
Please excuse the names in the following code samples. It is congruent with the rest of the source code and I am testing things out a bit first.
My first cut on the need to concatenate several strings was a function that looked like the following which would concatenate up to three text strings into a single string.
typedef wchar_t TCHAR;
TCHAR *RflCatFilePath(TCHAR *tszDest, int nDestLen, TCHAR *tszPath, TCHAR *tszPath2, TCHAR *tszFileName)
{
if (tszDest && nDestLen > 0) {
TCHAR *pDest = tszDest;
TCHAR *pLast = tszDest;
*pDest = 0; // ensure empty string if no path data provided.
if (tszPath) for (pDest = pLast; nDestLen > 0 && (*pDest++ = *tszPath++); nDestLen--) pLast = pDest;
if (tszPath2) for (pDest = pLast; nDestLen > 0 && (*pDest++ = *tszPath2++); nDestLen--) pLast = pDest;
if (tszFileName) for (pDest = pLast; nDestLen > 0 && (*pDest++ = *tszFileName++); nDestLen--) pLast = pDest;
}
return tszDest;
}
Then I ran into a case where I had four pieces of text to put together.
Thinking through this it seemed that most probably there would also be a case for five that would be uncovered shortly so I wondered if there was a different way for an arbitrary number of strings.
What I came up with is two functions as follows.
typedef wchar_t TCHAR;
typedef struct {
TCHAR *pDest;
TCHAR *pLast;
int destLen;
} RflCatStruct;
RflCatStruct RflCatFilePathX(const TCHAR *pPath, RflCatStruct x)
{
TCHAR *pDest = x.pLast;
if (pDest && pPath) for ( ; x.destLen > 0 && (*pDest++ = *pPath++); x.destLen--) x.pLast = pDest;
return x;
}
RflCatStruct RflCatFilePathY(TCHAR *buffDest, int nLen, const TCHAR *pPath)
{
RflCatStruct x = { 0 };
TCHAR *pDest = x.pDest = buffDest;
x.pLast = buffDest;
x.destLen = nLen;
if (buffDest && nLen > 0) { // ensure there is room for at least one character.
*pDest = 0; // ensure empty string if no path data provided.
if (pPath) for (pDest = x.pLast; x.destLen > 0 && (*pDest++ = *pPath++); x.destLen--) x.pLast = pDest;
}
return x;
}
Examples of using these two functions is as follows. This code with the two functions appears to work fine with Visual Studio 2013.
TCHAR buffDest[512] = { 0 };
TCHAR *pPath = L"C:\\flashdisk\\ncr\\database";
TCHAR *pPath2 = L"\\";
TCHAR *pFilename = L"filename.ext";
RflCatFilePathX(pFilename, RflCatFilePathX(pPath2, RflCatFilePathY(buffDest, 512, pPath)));
printf("dest t = \"%S\"\n", buffDest);
printf("dest t = \"%S\"\n", RflCatFilePathX(pFilename, RflCatFilePathX(pPath2, RflCatFilePathY(buffDest, 512, pFilename))).pDest);
RflCatStruct dStr = RflCatFilePathX(pPath2, RflCatFilePathY(buffDest, 512, pPath));
// other stuff then
printf("dest t = \"%S\"\n", RflCatFilePathX(pFilename, dStr).pDest);
Arguments to a function call are completely evaluated before the function is invoked. So the calls to RflCatFilePath* will be evaluated in the expected order. (This is guaranteed by ยง6.5.2.2/10: "There is a sequence point after the evaluations of the function designator and the actual arguments but before the actual call.")
As indicated in a comment, the snprintf function is likely to be a better choice for this problem. (asprintf would be even better, and there is a freely available shim for it which works on Windows.) The only problem with snprintf is that you may have to call it twice. It always returns the number of bytes which would have been stored in the buffer had there been enough space, so if the return value is not less than the size of the buffer, you will need to allocate a larger buffer (whose size you now know) and call snprintf again.
asprintf does that for you, but it is a BSD/Gnu extension to the standard library.
In the case of concatenating filepaths, there is a maximum string length supported by the operating system/file system, and you should be able to find out what it is (although it might require OS-specific calls on non-Posix systems). So it might well be reasonable to simply return an error indication if the concatenation does not fit into a 512-byte buffer.
Just for fun, I include a recursive varargs concatenator:
#include <stdarg.h>
#include <stdlib.h>
#include <string.h>
static char* concat_helper(size_t accum, char* chunk, va_list ap) {
if (chunk) {
size_t chunklen = strlen(chunk);
char* next_chunk = va_arg(ap, char*);
char* retval = concat_helper(accum + chunklen, next_chunk, ap);
memcpy(retval + accum, chunk, chunklen);
return retval;
} else {
char* retval = malloc(accum + 1);
retval[accum] = 0;
return retval;
}
}
char* concat_list(char* chunk, ...) {
va_list ap;
va_start(ap, chunk);
char* retval = concat_helper(0, chunk, ap);
va_end(ap);
return retval;
}
Since concat_list is a varargs function, you need to supply (char*)NULL at the end of the arguments. On the other hand, you don't need to repeat the function name for each new argument. So an example call might be:
concat_list(pPath, pPath2, pFilename, (char*)0);
(I suppose you need a wchar_t* version but the changes should be obvious. Watch out for the malloc.) For production purposes, the recursion should probably be replaced by an iterative version which traverses the argument list twice (see va_copy) but I've always been fond of the "there-and-back" recursion pattern.

Two-dimensional char array too large exit code 139

Hey guys I'm attempting to read in workersinfo.txt and store it into a two-dimensional char array. The file is around 4,000,000 lines with around 100 characters per line. I want to store each file line on the array. Unfortunately, I get exit code 139(Not enough memory). I'm aware I have to use malloc() and free() but I've tried a couple of things and I haven't been able to make them work.Eventually I have to sort the array by ID number but I'm stuck on declaring the array.
The file looks something like this:
First Name, Last Name,Age, ID
Carlos,Lopez,,10568
Brad, Patterson,,20586
Zack, Morris,42,05689
This is my code so far:
#include <stdio.h>
#include <stdlib.h>
int main(void) {
FILE *ptr_file;
char workers[4000000][1000];
ptr_file =fopen("workersinfo.txt","r");
if (!ptr_file)
perror("Error");
int i = 0;
while (fgets(workers[i],1000, ptr_file)!=NULL){
i++;
}
int n;
for(n = 0; n < 4000000; n++)
{
printf("%s", workers[n]);
}
fclose(ptr_file);
return 0;
}
The Stack memory is limited. As you pointed out in your question, you MUST use malloc to allocate such a big (need I say HUGE) chunk of memory, as the stack cannot contain it.
you can use ulimit to review the limits of your system (usually including the stack size limit).
On my Mac, the limit is 8Mb. After running ulimit -a I get:
...
stack size (kbytes, -s) 8192
...
Or, test the limit using:
struct rlimit slim;
getrlimit(RLIMIT_STACK, &rlim);
rlim.rlim_cur // the stack limit
I truly recommend you process each database entry separately.
As mentioned in the comments, assigning the memory as static memory would, in most implementations, circumvent the stack.
Still, IMHO, allocating 400MB of memory (or 4GB, depending which part of your question I look at), is bad form unless totally required - especially for a single function.
Follow-up Q1: How to deal with each DB entry separately
I hope I'm not doing your homework or anything... but I doubt your homework would include an assignment to load 400Mb of data to the computer's memory... so... to answer the question in your comment:
The following sketch of single entry processing isn't perfect - it's limited to 1Kb of data per entry (which I thought to be more then enough for such simple data).
Also, I didn't allow for UTF-8 encoding or anything like that (I followed the assumption that English would be used).
As you can see from the code, we read each line separately and perform error checks to check that the data is valid.
To sort the file by ID, you might consider either running two lines at a time (this would be a slow sort) and sorting them, or creating a sorted node tree with the ID data and the position of the line in the file (get the position before reading the line). Once you sorted the binary tree, you can sort the data...
... The binary tree might get a bit big. did you look up sorting algorithms?
#include <stdio.h>
// assuming this is the file structure:
//
// First Name, Last Name,Age, ID
// Carlos,Lopez,,10568
// Brad, Patterson,,20586
// Zack, Morris,42,05689
//
// Then this might be your data structure per line:
struct DBEntry {
char* last_name; // a pointer to the last name
char* age; // a pointer to the name - could probably be an int
char* id; // a pointer to the ID
char first_name[1024]; // the actual buffer...
// I unified the first name and the buffer since the first name is first.
};
// each time you read only a single line, perform an error check for overflow
// and return the parsed data.
//
// return 1 on sucesss or 0 on failure.
int read_db_line(FILE* fp, struct DBEntry* line) {
if (!fgets(line->first_name, 1024, fp))
return 0;
// parse data and review for possible overflow.
// first, zero out data
int pos = 0;
line->age = NULL;
line->id = NULL;
line->last_name = NULL;
// read each byte, looking for the EOL marker and the ',' seperators
while (pos < 1024) {
if (line->first_name[pos] == ',') {
// we encountered a devider. we should handle it.
// if the ID feild's location is already known, we have an excess comma.
if (line->id) {
fprintf(stderr, "Parsing error, invalid data - too many fields.\n");
return 0;
}
// replace the comma with 0 (seperate the strings)
line->first_name[pos] = 0;
if (line->age)
line->id = line->first_name + pos + 1;
else if (line->last_name)
line->age = line->first_name + pos + 1;
else
line->last_name = line->first_name + pos + 1;
} else if (line->first_name[pos] == '\n') {
// we encountered a terminator. we should handle it.
if (line->id) {
// if we have the id string's possition (the start marker), this is a
// valid entry and we should process the data.
line->first_name[pos] = 0;
return 1;
} else {
// we reached an EOL without enough ',' seperators, this is an invalid
// line.
fprintf(stderr, "Parsing error, invalid data - not enough fields.\n");
return 0;
}
}
pos++;
}
// we ran through all the data but there was no EOL marker...
fprintf(stderr,
"Parsing error, invalid data (data overflow or data too large).\n");
return 0;
}
// the main program
int main(int argc, char const* argv[]) {
// open file
FILE* ptr_file;
ptr_file = fopen("workersinfo.txt", "r");
if (!ptr_file)
perror("File Error");
struct DBEntry line;
while (read_db_line(ptr_file, &line)) {
// do what you want with the data... print it?
printf(
"First name:\t%s\n"
"Last name:\t%s\n"
"Age:\t\t%s\n"
"ID:\t\t%s\n"
"--------\n",
line.first_name, line.last_name, line.age, line.id);
}
// close file
fclose(ptr_file);
return 0;
}
Followup Q2: Sorting array for 400MB-4GB of data
IMHO, 400MB is already touching on the issues related to big data. For example, implementing a bubble sort on your database should be agonizing as far as performance goes (unless it's a single time task, where performance might not matter).
Creating an Array of DBEntry objects will eventually get you a larger memory foot-print then the actual data..
This will not be the optimal way to sort large data.
The correct approach will depend on your sorting algorithm. Wikipedia has a decent primer on sorting algorythms.
Since we are handling a large amount of data, there are a few things to consider:
It would make sense to partition the work, so different threads/processes sort a different section of the data.
We will need to minimize IO to the hard drive (as it will slow the sorting significantly and prevent parallel processing on the same machine/disk).
One possible approach is to create a heap for a heap sort, but only storing a priority value and storing the original position in the file.
Another option would probably be to employ a divide and conquer algorithm, such as quicksort, again, only sorting a computed sort value and the entry's position in the original file.
Either way, writing a decent sorting method will be a complicated task, probably involving threading, forking, tempfiles or other techniques.
Here's a simplified demo code... it is far from optimized, but it demonstrates the idea of the binary sort-tree that holds the sorting value and the position of the data in the file.
Be aware that using this code will be both relatively slow (although not that slow) and memory intensive...
On the other hand, it will require about 24 bytes per entry. For 4 million entries, it's 96MB, somewhat better then 400Mb and definitely better then the 4GB.
#include <stdlib.h>
#include <stdio.h>
// assuming this is the file structure:
//
// First Name, Last Name,Age, ID
// Carlos,Lopez,,10568
// Brad, Patterson,,20586
// Zack, Morris,42,05689
//
// Then this might be your data structure per line:
struct DBEntry {
char* last_name; // a pointer to the last name
char* age; // a pointer to the name - could probably be an int
char* id; // a pointer to the ID
char first_name[1024]; // the actual buffer...
// I unified the first name and the buffer since the first name is first.
};
// this might be a sorting node for a sorted bin-tree:
struct SortNode {
struct SortNode* next; // a pointer to the next node
fpos_t position; // the DB entry's position in the file
long value; // The computed sorting value
}* top_sorting_node = NULL;
// this function will free all the memory used by the global Sorting tree
void clear_sort_heap(void) {
struct SortNode* node;
// as long as there is a first node...
while ((node = top_sorting_node)) {
// step forward.
top_sorting_node = top_sorting_node->next;
// free the original first node's memory
free(node);
}
}
// each time you read only a single line, perform an error check for overflow
// and return the parsed data.
//
// return 0 on sucesss or 1 on failure.
int read_db_line(FILE* fp, struct DBEntry* line) {
if (!fgets(line->first_name, 1024, fp))
return -1;
// parse data and review for possible overflow.
// first, zero out data
int pos = 0;
line->age = NULL;
line->id = NULL;
line->last_name = NULL;
// read each byte, looking for the EOL marker and the ',' seperators
while (pos < 1024) {
if (line->first_name[pos] == ',') {
// we encountered a devider. we should handle it.
// if the ID feild's location is already known, we have an excess comma.
if (line->id) {
fprintf(stderr, "Parsing error, invalid data - too many fields.\n");
clear_sort_heap();
exit(2);
}
// replace the comma with 0 (seperate the strings)
line->first_name[pos] = 0;
if (line->age)
line->id = line->first_name + pos + 1;
else if (line->last_name)
line->age = line->first_name + pos + 1;
else
line->last_name = line->first_name + pos + 1;
} else if (line->first_name[pos] == '\n') {
// we encountered a terminator. we should handle it.
if (line->id) {
// if we have the id string's possition (the start marker), this is a
// valid entry and we should process the data.
line->first_name[pos] = 0;
return 0;
} else {
// we reached an EOL without enough ',' seperators, this is an invalid
// line.
fprintf(stderr, "Parsing error, invalid data - not enough fields.\n");
clear_sort_heap();
exit(1);
}
}
pos++;
}
// we ran through all the data but there was no EOL marker...
fprintf(stderr,
"Parsing error, invalid data (data overflow or data too large).\n");
return 0;
}
// read and sort a single line from the database.
// return 0 if there was no data to sort. return 1 if data was read and sorted.
int sort_line(FILE* fp) {
// allocate the memory for the node - use calloc for zero-out data
struct SortNode* node = calloc(sizeof(*node), 1);
// store the position on file
fgetpos(fp, &node->position);
// use a stack allocated DBEntry for processing
struct DBEntry line;
// check that the read succeeded (read_db_line will return -1 on error)
if (read_db_line(fp, &line)) {
// free the node's memory
free(node);
// return no data (0)
return 0;
}
// compute sorting value - I'll assume all IDs are numbers up to long size.
sscanf(line.id, "%ld", &node->value);
// heap sort?
// This is a questionable sort algorythm... or a questionable implementation.
// Also, I'll be using pointers to pointers, so it might be a headache to read
// (it's a headache to write, too...) ;-)
struct SortNode** tmp = &top_sorting_node;
// move up the list until we encounter something we're smaller then us,
// OR untill the list is finished.
while (*tmp && (*tmp)->value <= node->value)
tmp = &((*tmp)->next);
// update the node's `next` value.
node->next = *tmp;
// inject the new node into the tree at the position we found
*tmp = node;
// return 1 (data was read and sorted)
return 1;
}
// writes the next line in the sorting
int write_line(FILE* to, FILE* from) {
struct SortNode* node = top_sorting_node;
if (!node) // are we done? top_sorting_node == NULL ?
return 0; // return 0 - no data to write
// step top_sorting_node forward
top_sorting_node = top_sorting_node->next;
// read data from one file to the other
fsetpos(from, &node->position);
char* buffer = NULL;
ssize_t length;
size_t buff_size = 0;
length = getline(&buffer, &buff_size, from);
if (length <= 0) {
perror("Line Copy Error - Couldn't read data");
return 0;
}
fwrite(buffer, 1, length, to);
free(buffer); // getline allocates memory that we're incharge of freeing.
return 1;
}
// the main program
int main(int argc, char const* argv[]) {
// open file
FILE *fp_read, *fp_write;
fp_read = fopen("workersinfo.txt", "r");
fp_write = fopen("sorted_workersinfo.txt", "w+");
if (!fp_read) {
perror("File Error");
goto cleanup;
}
if (!fp_write) {
perror("File Error");
goto cleanup;
}
printf("\nSorting");
while (sort_line(fp_read))
printf(".");
// write all sorted data to a new file
printf("\n\nWriting sorted data");
while (write_line(fp_write, fp_read))
printf(".");
// clean up - close files and make sure the sorting tree is cleared
cleanup:
printf("\n");
fclose(fp_read);
fclose(fp_write);
clear_sort_heap();
return 0;
}

syscall read acting weird

c lang, ubuntu
so i have a task - write a menu with these 3 options:
1. close program
2. show user id
3. show current working directory
i can only use 3 libraries - unistd.h, sys/syscall.h, sys/sysinfo.h.
so no printf/scanf
i need to use an array of a struct im given, that has a function pointer,
to call the function the user wants to use.
problem is on options 2 & 3;
when i pick 2, on the first time it works fine (i think)
second time i pick 2, it works, but then when going to the third iteration,
it doesn't wait for an input, it takes '\n' as an input for some reason, then it says invalid input. (i checked what it takes as input with printf, i printed index after recalculating it and it because -39, so it means selection[0] = 10 = '\n')
that's the first problem, that i just cant find the solution for.
second problem is on the current working directory function;
the SYS_getcwd returns -1 for some reason, which means there's an error, but i cant figure it out.
any explanations for these things?
(also - slen and __itoa are functions i am given - slen returns the length of a string,
__itoa returns a char*, that was the string representation of an integer)
helper.h:
typedef struct func_desc {
char *name;
void (*func)(void);
} fun_desc;
main.c:
#include <unistd.h>
#include "helper.h"
#include <sys/syscall.h>
#include <sys/sysinfo.h>
void exitProgram();
void printID();
void currDir();
int main() {
fun_desc arrFuncs[3];
arrFuncs[0].name = "exitProgram";
arrFuncs[0].func = &exitProgram;
arrFuncs[1].name = "printID";
arrFuncs[1].func = &printID;
arrFuncs[2].name = "currDir";
arrFuncs[2].func = &currDir;
char selection[2];
int index;
const char* menu = "Welcome to the menu. Please pick one of the following actions:\n1.Close the program\n2.Print the current user's id\n3.Print the current directory's id\n";
while(1 == 1) {
syscall(SYS_write, 0, menu, slen(menu));
syscall(SYS_write, 0, "Your selection: ", slen("Your selection: "));
syscall(SYS_read, 1, selection, slen(selection)); //might be a problem
selection[1] = '\0';
index = selection[0] - '0' - 1;
if(index > 2)
syscall(SYS_write, 0, "Invalid input\n", slen("Invalid input\n"));
else
arrFuncs[index].func();
}
return(0);
}
void exitProgram() {
syscall(SYS_write, 0, "The program will close\n", slen("The program will close\n"));
syscall(SYS_exit);
}
void printID() { //problem
char* uid = __itoa(syscall(SYS_getuid));
syscall(SYS_write, 0, uid, slen(uid));
syscall(SYS_write, 0, "\n", slen("\n"));
}
void currDir() { //????
char* buf = __itoa(syscall(SYS_getcwd));
syscall(SYS_write, 0, buf, slen(buf));
syscall(SYS_write, 0, "\n", slen("\n"));
}
You're passing the wrong number of arguments to some of these system calls. In particular:
syscall(SYS_exit);
_exit() takes one argument: the exit code.
char* buf = __itoa(syscall(SYS_getcwd));
getcwd() takes two arguments: a pointer to a buffer to write the string to, and the length of that buffer. In practice, this probably looks something like:
char pathbuf[PATH_MAX];
syscall(SYS_getcwd, pathbuf, sizeof(pathbuf));
If you don't have the header which defines PATH_MAX, define it yourself. 4096 is an appropriate value.
Note that getcwd() writes a string into the buffer passed to it โ€” it does not return a numeric identifier.
As an aside, you may want to save yourself some time by implementing a wrapper to write a string, e.g.
void putstring(const char *str) {
syscall(SYS_write, 0, str, slen(str));
}
since you seem to be doing that a lot.

LZW encoding for large file

I am building an LZW encoding algorithm, which uses dictionary and hashing so it can reach fast enough for working words already stored in a dictionary.
The algorithm gives proper results when ran on smaller files (cca few hundreds of symbols), but on the larger files (and especially in those files which contain of less different symbols - for example, it gives the worst performance when ran on a file which consists only of 1 symbol, 'y' let's say). The worst performance, in terms that it just crashes when dictionary is not even close to being full. However, when the large input file consists of more than 1 symbol, dictionary gets close to being full, approximately 90%, but again then it crashes.
Considering the structure of my algorithm, I am not quite sure what is causing it to crash in general, or crash so soon when large file of just 1 symbol is given.
It must be something about hashing (first time doing it, so it might have some bugs).
The hash function I am using can be found here, and from what I have tested it, it gives good results: oat_hash
LZW encoding algorithm is based on this link, with slight change, that it works until the dictionary is not full: LZW encoder
Let's get into code:
Note: oat_hash is changed so it returns value % CAPACITY, so every index is from DICTIONARY
// Globals
#define CAPACITY 100000
char *DICTIONARY[CAPACITY];
unsigned short CODES[CAPACITY]; // CODES and DICTIONARY are linked via index: word from dictionary on index i, has its code in CODES on index i
int position = 0;
int code_counter = 0;
void encode(FILE *input, FILE *output){
int succ1 = fseek(input, 0, SEEK_SET);
if(succ1 != 0) printf("Error: file not open!");
int succ2 = fseek(output, 0, SEEK_SET);
if(succ2 != 0) printf("Error: file not open!");
//1. Working word = next symbol from the input
char *working_word = malloc(2048*sizeof(char));
char new_symbol = getc(input);
working_word[0] = new_symbol;
working_word[1] = '\0';
//2. WHILE(there are more symbols on the input) DO
//3. NewSymbol = next symbol from the input
while((new_symbol = getc(input)) != EOF){
char *workingWord_and_newSymbol= NULL;
char newSymbol[2];
newSymbol[0] = new_symbol;
newSymbol[1] = '\0';
workingWord_and_newSymbol = working_word_and_new_symbol(working_word, newSymbol);
int index = oat_hash(workingWord_and_newSymbol, strlen(workingWord_and_newSymbol));
//4. IF(WorkingWord + NewSymbol) is already in the dictionary THEN
if(DICTIONARY[index] != NULL){
// 5. WorkingWord += NewSymbol
working_word = working_word_and_new_symbol(working_word, newSymbol);
}
//6. ELSE
else{
//7. OUTPUT: code for WorkingWord
int idx = oat_hash(working_word, strlen(working_word));
fprintf(output, "%u", CODES[idx]);
//8. Add (WorkingWord + NewSymbol) into a dictionary and assign it a new code
if(!dictionary_full()){
DICTIONARY[index] = workingWord_and_newSymbol;
CODES[index] = code_counter + 1;
code_counter += 1;
working_word = strdup(newSymbol);
}else break;
}
//10. END IF
}
//11. END WHILE
//12. OUTPUT: code for WorkingWord
int index = oat_hash(working_word, strlen(working_word));
fprintf(output, "%u", CODES[index]);
free(working_word);
}
int index = oat_hash(workingWord_and_newSymbol, strlen(workingWord_and_newSymbol));
And later
int idx = oat_hash(working_word, strlen(working_word));
fprintf(output, "%u", CODES[idx]);
//8. Add (WorkingWord + NewSymbol) into a dictionary and assign it a new code
if(!dictionary_full()){
DICTIONARY[index] = workingWord_and_newSymbol;
CODES[index] = code_counter + 1;
code_counter += 1;
working_word = strdup(newSymbol);
}else break;
idx and index are unbounded and you use them to access a bounded array. You're accessing memory out of range. Here's a suggestion, but it may skew the distribution. If your hash range is much larger than CAPACITY it shouldn't be a problem. But you also have another problem which was mentioned, collisions, you need to handle them. But that's a different problem.
int index = oat_hash(workingWord_and_newSymbol, strlen(workingWord_and_newSymbol)) % CAPACITY;
// and
int idx = oat_hash(working_word, strlen(working_word)) % CAPACITY;
LZW compression is certainly used to construct binary files and normally is capable of reading binary files.
The following code is problematic as it relies on new_symbol never being a \0.
newSymbol[0] = new_symbol; newSymbol[1] = '\0';
strlen(workingWord_and_newSymbol)
strdup(newSymbol)
Needs re-write to work with arrays of bytes rather than strings.
fopen() was not shown. Insure one is opening in binary. input = fopen(..., "rb");
#Wumpus Q. Wumbley is correct, use int newSymbol.
Minor:
new_symbol and newSymbol are confusing.
Consider:
// char *working_word = malloc(2048*sizeof(char));
#define WORKING_WORD_N (2048)
char *working_word = malloc(WORKING_WORD_N*sizeof(*working_word));
// or
char *working_word = malloc(WORKING_WORD_N);

Resources