Substring Search with Syscall - c

I'm trying to do a substring search using sys call where I open a file from the command line, and compare the following command line arguments to the file. I want to output the number of occurrences of each substring. For example, if I wrote ./a.out filename aa b I am looking for the number of times aa and b occurs in filename.
My code so far
for(int num = 4; num < argc; num++)
{
int fp = open (argv[1], O_RDONLY);
int sizeofbar = strlen(argv[1]);
char *buf = (char*)malloc(sizeofbar+1);
int count = 0; //counter for output
char* string2 = argv[num];
int sizeofcompare = strlen(string2);
read(fp, buf, sizeofcompare);
while (strstr(buf, string2) != NULL)
{
count++;
buf++;
}

I think you need to do some initialization before entering your loop. Perhaps you want to take the first argument off first:
filename = argv[0];
argv++;
argc--;
int fp = open(...
Next, I'd pre-process the rest of the arguments to build a data structure to store it. You can use the argc value to determine how many words you'll need to track.
counts = (int *)calloc(argc, sizeof(int));
Note that this will initialize the values to zero for you too.
With everything setup, then I'd read through the entire file contents and compare to the strings. The trick is in comparing multiple different length strings at once. A simple yet inefficient method is to read the entire file contents in and then loop using strstr for each word that follows the filename. Another method would be mapping the file's contents into memory and scanning it directly (letting the operating system do the heavy lifting).

Related

How to read a .xyz file into an array of doubles?

I am new to C, coming from Python. I want to read a .xyz file into a dynamically sized array, to use for various calculations later on in the program. The file is formatted as follows:
Title
Comment
Symbol 0.000 0.000 0.000
Symbol 0.000 0.000 0.000
....
The two first lines are not needed, and should just be skipped. The "Symbol" part of the file are chemical symbols--e.g. H, Au, C, Mn--as the .xyz file format is used for storing 3D coordinates of atoms. They need to be ignored as well. I'm interested in the space separated decimal numbers. I therefore want to:
Skip the first two lines, or just ignore them in some way.
Skip the first part of each line until the first space.
Store the three columns of numbers (coordinates) in an array.
So far I have been able to open a file for reading, and then I've attempted to check how long the file is, in order to have the size of the array change depending on how many coordinate sets needs to be stored.
// Variable declaration
FILE *fp;
long file_size;
// Open file and error checking
fp = fopen ("file_name" , "r");
if(!fp) perror("file_name"), exit(1);
// Check file size
fseek(fp, 0, SEEK_END);
file_size = ftell(fp);
rewind(fp);
// Close file
fclose(fp);
I've been able to skip the first two lines using fscanf(fp, "%*[^\n]"), to skip to the end of the line. But, I haven't been able to figure out how to loop through the rest of the file, while storing only the decimal numbers in an array.
If I understand correctly, I need to allocate memory for the array, using something like malloc() in combination with my file_size and then copy the data into the array using fread().
Here is an example of the contents of an actual .xyz file:
10 atom system
Energy: -914941.6614699
Ag 0.96834 1.51757 0.02281
Ag 0.96758 -1.51824 -0.02206
Ag -1.80329 2.27401 0.03179
Ag -3.58033 0.00046 0.00126
Ag -1.80447 -2.27338 -0.03537
Ag -0.96581 0.02246 -1.51755
Ag -0.96929 -0.02231 1.51463
Ag 1.80613 0.03321 -2.27213
Ag 3.58027 0.00028 0.00206
Ag 1.80086 -0.03407 2.27455
Here is a general approach in C for reading a file into an array of cstrings (pointers to cstrings, so the rough equivalent of a Python list of strings).
int count = 0; // line counter;
int char_count = 0; // char counter;
int max_len = 0; // for storing the longest line length
int c; // for measuring each line length
char **str_ptr_arr; // array of pointers to c-string
//extract characters from the file, looking for endlines; note that
//the EOF check has to come AFTER the getc(fp) to work properly
for (c = getc(fp); c != EOF; c = getc(fp)) { //edit see comments
char_count += 1;
if (c == '\n') { //safe comparison see comments
count += 1;
if (max_len < char_count) {
max_len = char_count; //gets longest line
}
char_count = 0;
}
}
//should probably do an feof check here
rewind(fp);
So now you have the number of lines and the length of the longest line, (You can try using the above loop to exclude lines if you want but it might just be easier to read the whole thing into an array of c-strings, then process that into an array of doubles). Now allocate the memory for the array of pointers to c-strings and for the c-strings themselves:
//allocate enough memory to hold all the strings in the file, by first
//allocating the arr of ptrs then a slot for each c-string pointed to:
str_ptr_arr = malloc(count * sizeof(char*)); //size of pointer
for (int i = 0; i < count; i++) {
str_ptr_arr[i] = malloc ((max_len + 1) * sizeof(char)); // +1 for '\0' terminate
}
rewind(fp); //rewind again;
Now, we have a problem, which is how to populate these cstrings (Python is so much easier!). This works, I'm not sure if it's the expert approach, but here we read into a
temporary buffer then use strcpy to move the contents of the buffer into our allocated array slots:
for (int i = 0; i < count; i++) {
char buff[max_len + 1]; //local temporary buffer that can store any line in file
fscanf(fp, "%s", buff); //read the first string to buffer
strcpy(str_ptr_arr[i], buff);
}
Note: this is a decent point at which to start excluding lines or removing various substrings from lines, as you can make strcpy conditional on the contents of the buffer, by using other cstring methods. I'm fairly new at this myself, (learning to write C functions for use in Python progams), but this seems to be the correct approach.
It might also be possible to go directly to a dynamically allocated array of floats for storing your numerical data without bothering with the cstring array; that could be done in the last loop above. You could split the strings at whitespace, exclude the alphabetical parts, and use the cstring function atof to convert to float datatype.
Edit: I should mention all these memory allocations must be manually freed when you are done with them, and this is the approach:
for(int i = 0; i < count; i++) { // free each allocated cstring space
free(str_ptr_arr[i]);
}
free(str_ptr_arr); // free the cstring pointer space
str_ptr_arr = NULL;
Given, for example:
#define STORAGE_INCREMENT 128
typedef struct
{
double x, y, z ;
} sXYZ ;
Then:
int atom_count = 0 ;
int atom_capacity = STORAGE_INCREMENT ;
sXYZ* atoms = malloc( atom_capacity * sizeof(*atoms) ) ;
// While valid triplet, discard symbol, get x,y,z
while( fscanf( fp, "%*s%lf%lf%lf", &atoms[atom_count].x,
&atoms[atom_count].y,
&atoms[atom_count].z ) == 3 )
{
// Increment count
atom_count++ ;
// If capacity exhausted, expand allocation
if( atom_count == atom_capacity )
{
atom_capacity += STORAGE_INCREMENT ;
sXYZ* bigger = realloc( atoms, atom_capacity * sizeof(*atoms) ) ;
if( bigger == NULL )
{
break ;
}
atoms = bigger ;
}
}
This allocates enough space for 128 atoms initially, and if the space is exhausted, it is expanded by a further 128 atoms - indefinitely. A smaller value can be used if the files typically have fewer atoms to be a little more memory efficient. This approach saves you having to first count the number of triplets in the file.

How to reverse text in a file in C?

I'm try to get my text to be read back to front and to be printed in the reverse order in that file, but my for loop doesn't seem to working. Also my while loop is counting 999 characters even though it should be 800 and something (can't remember exactly), I think it might be because there is an empty line between the two paragraphs but then again there are no characters there.
Here is my code for the two loops -:
/*Reversing the file*/
char please;
char work[800];
int r, count, characters3;
characters3 = 0;
count = 0;
r = 0;
fgets(work, 800, outputfile);
while (work[count] != NULL)
{
characters3++;
count++;
}
printf("The number of characters to be copied is-: %d", characters3);
for (characters3; characters3 >= 0; characters3--)
{
please = work[characters3];
work[r] = please;
r++;
}
fprintf(outputfile, "%s", work);
/*Closing all the file streams*/
fclose(firstfile);
fclose(secondfile);
fclose(outputfile);
/*Message to direct the user to where the files are*/
printf("\n Merged the first and second files into the output file
and reversed it! \n Check the outputfile text inside the Debug folder!");
There are a couple of huge conceptual flaws in your code.
The very first one is that you state that it "doesn't seem to [be] working" without saying why you think so. Just running your code reveals what the problem is: you do not get any output at all.
Here is why. You reverse your string, and so the terminating zero comes at the start of the new string. You then print that string – and it ends immediately at the first character.
Fix this by decreasing the start of the loop in characters3.
Next, why not print a few intermediate results? That way you can see what's happening.
string: [This is a test.
]
The number of characters to be copied is-: 15
result: [
.tset aa test.
]
Hey look, there seems to be a problem with the carriage return (it ends up at the start of the line), which is exactly what should happen – after all, it is part of the string – but more likely not what you intend to do.
Apart from that, you can clearly see that the reversing itself is not correct!
The problem now is that you are reading and writing from the same string:
please = work[characters3];
work[r] = please;
You write the character at the end into position #0, decrease the end and increase the start, and repeat until done. So, the second half of reading/writing starts copying the end characters back from the start into the end half again!
Two possible fixes: 1. read from one string and write to a new one, or 2. adjust the loop so it stops copying after 'half' is done (since you are doing two swaps per iteration, you only need to loop half the number of characters).
You also need to think more about what swapping means. As it is, your code overwrites a character in the string. To correctly swap two characters, you need to save one first in a temporary variable.
void reverse (FILE *f)
{
char please, why;
char work[800];
int r, count, characters3;
characters3 = 0;
count = 0;
r = 0;
fgets(work, 800, f);
printf ("string: [%s]\n", work);
while (work[count] != 0)
{
characters3++;
count++;
}
characters3--; /* do not count last zero */
characters3--; /* do not count the return */
printf("The number of characters to be copied is-: %d\n", characters3);
for (characters3; characters3 >= (count>>1); characters3--)
{
please = work[characters3];
why = work[r];
work[r] = please;
work[characters3] = why;
r++;
}
printf ("result: [%s]\n", work);
}
As a final note: you do not need to 'manually' count the number of characters, there is a function for that. All that's needed instead of the count loop is this;
characters3 = strlen(work);
Here's a complete and heavily commented function that will take in a filename to an existing file, open it, then reverse the file character-by-character. Several improvements/extensions could include:
Add an argument to adjust the maximum buffer size allowed.
Dynamically increase the buffer size as the input file exceeds the original memory.
Add a strategy for recovering the original contents if something goes wrong when writing the reversed characters back to the file.
// naming convention of l_ for local variable and p_ for pointers
// Returns 1 on success and 0 on failure
int reverse_file(char *filename) {
FILE *p_file = NULL;
// r+ enables read & write, preserves contents, starts pointer p_file at beginning of file, and will not create a
// new file if one doesn't exist. Consider a nested fopen(filename, "w+") if creation of a new file is desired.
p_file = fopen(filename, "r+");
// Exit with failure value if file was not opened successfully
if(p_file == NULL) {
perror("reverse_file() failed to open file.");
fclose(p_file);
return 0;
}
// Assumes entire file contents can be held in volatile memory using a buffer of size l_buffer_size * sizeof(char)
uint32_t l_buffer_size = 1024;
char l_buffer[l_buffer_size]; // buffer type is char to match fgetc() return type of int
// Cursor for moving within the l_buffer
int64_t l_buffer_cursor = 0;
// Temporary storage for current char from file
// fgetc() returns the character read as an unsigned char cast to an int or EOF on end of file or error.
int l_temp;
for (l_buffer_cursor = 0; (l_temp = fgetc(p_file)) != EOF; ++l_buffer_cursor) {
// Store the current char into our buffer in the original order from the file
l_buffer[l_buffer_cursor] = (char)l_temp; // explicitly typecast l_temp back down to signed char
// Verify our assumption that the file can completely fit in volatile memory <= l_buffer_size * sizeof(char)
// is still valid. Return an error otherwise.
if (l_buffer_cursor >= l_buffer_size) {
fprintf(stderr, "reverse_file() in memory buffer size of %u char exceeded. %s is too large.\n",
l_buffer_size, filename);
fclose(p_file);
return 0;
}
}
// At the conclusion of the for loop, l_buffer contains a copy of the file in memory and l_buffer_cursor points
// to the index 1 past the final char read in from the file. Thus, ensure the final char in the file is a
// terminating symbol and decrement l_buffer_cursor by 1 before proceeding.
fputc('\0', p_file);
--l_buffer_cursor;
// To reverse the file contents, reset the p_file cursor to the beginning of the file then write data to the file by
// reading from l_buffer in reverse order by decrementing l_buffer_cursor.
// NOTE: A less verbose/safe alternative to fseek is: rewind(p_file);
if ( fseek(p_file, 0, SEEK_SET) != 0 ) {
return 0;
}
for (l_temp = 0; l_buffer_cursor >= 0; --l_buffer_cursor) {
l_temp = fputc(l_buffer[l_buffer_cursor], p_file); // write buffered char to the file, advance f_open pointer
if (l_temp == EOF) {
fprintf(stderr, "reverse_file() failed to write %c at index %lu back to the file %s.\n",
l_buffer[l_buffer_cursor], l_buffer_cursor, filename);
}
}
fclose(p_file);
return 1;
}

Cannot get Call to Function Working

I found this piece of code at Reading a file character by character in C and it compiles and is what I wish to use. My problem that I cannot get the call to it working properly. The code is as follows:
char *readFile(char *fileName)
{
FILE *file = fopen(fileName, "r");
char *code;
size_t n = 0;
int c;
if (file == NULL)
return NULL; //could not open file
code = malloc(1500);
while ((c = fgetc(file)) != EOF)
{
code[n++] = (char) c;
}
code[n] = '\0';
return code;
}
I am not sure of how to call it. Currently I am using the following code to call it:
.....
char * rly1f[1500];
char * RLY1F; // This is the Input File Name
rly1f[0] = readFile(RLY1F);
if (rly1f[0] == NULL) {
printf ("NULL array); exit;
}
int n = 0;
while (n++ < 1000) {
printf ("%c", rly1f[n]);
}
.....
How do I call the readFile function such that I have an array (rly1f) which is not NULL? The file RLY1F exists and has data in it. I have successfully opened it previously using 'in line code' not a function.
Thanks
The error you're experiencing is that you forgot to pass a valid filename. So either the program crashes, or fopen tries to open a trashed name and returns NULL
char * RLY1F; // This is not initialized!
RLY1F = "my_file.txt"; // initialize it!
The next problem you'll have will be in your loop to print the characters.
You have defined an array of pointers char * rly1f[1500];
You read 1 file and store it in the first pointer of the array rly1f[0]
But when you display it you display the pointer values as characters which is not what you want. You should just do:
while (n < 1000) {
printf ("%c", rly1f[0][n]);
n++;
}
note: that would not crash but would print trash if the file read is shorter than 1000.
(BLUEPIXY suggested the post-incrementation fix for n BTW or first character is skipped)
So do it more simply since your string is nul-terminated, pass the array to puts:
puts(rly1f[0]);
EDIT: you have a problem when reading your file too. You malloc 1500 bytes, but you read the file fully. If the file is bigger than 1500 bytes, you get buffer overflow.
You have to compute the length of the file before allocating the memory. For instance like this (using stat would be a better alternative maybe):
char *readFile(char *fileName, unsigned int *size) {
...
fseek(file,0,SEEK_END); // set pos to end of file
*size = ftell(file); // get pos, i.e. size
rewind(file); // set pos to 0
code = malloc(*size+1); // allocate the proper size plus one
notice the extra parameter which allows you to return the size as well as the file data.
Note: on windows systems, text files use \r\n (CRLF) to delimit lines, so the allocated size will be higher than the number of characters read if you use text mode (\r\n are converted to \n so there are less chars in your buffer: you could consider a realloc once you know the exact size to shave off the unused allocated space).

Need help reading from a file in C

I have been looking around for a solution but cannot seem to find a solution to my question so I will ask it. I am working in C and am reading in a .txt and taking all the values and storing them in an array then doing various tasks with them. Now my problem is that no matter what I do I cannot get file pointer I create to point to the file for some reason. I have done this for projects in the past and have compared my code then to the current one and cannot see the issue. The filename needs to be read in from the command line as well. I think there is something wrong with what I'm passing through the command line but am not sure. I have stepped through and the filename is being passed correctly but when it tries to open I get a null pointer so there is just something I'm missing.
The text file will contain a series of numbers, the first number will be the number of numbers in the file after that first number. (So if the number is 10 then there will be ten numbers after 10 is read in) after that first number the remaining numbers will be 0-9 in a random order.
Below is my current chunk of code only involving reading of the file and storing its data. (I already know the array will be of size 10 which is why the array is declared with that size.)
int main(int argc, char *argv[])
{
char* filename = "numbers.txt";
int arr[10];
int numElem;
int indexDesired = 0;
FILE *fp;
fp = fopen(filename, "r"); // open file begin reading
if (!fp)
{
printf("The required file parameter name is missing\n");
system("pause");
exit(EXIT_FAILURE);
}
else
{
fscanf(fp, "%d", &numElem); //scans for the first value which will tell the number of values to be stored in the array
int i = 0;
int num;
while (i <= numElem) //scans through and gets the all the values and stores them in the array.
{
fscanf(fp, "%d", &num);
arr[i] = num;
i++;
}
fclose(fp);
}
}
***note: My sort and swap method work perfectly so I have omitted them from the code as the error happens before they are even called.
you said,
The filename needs to be read in from the command line as well.
However, you are using:
char* filename = "numbers.txt";
and
fp = fopen(filename, "r"); // open file begin reading
No matter what you are passing in the command line, the file you are trying to open is "numbers.txt".
Things to try:
Use the full path name of "numbers.txt" instead of just the name of the file.
char* filename = "C:\\My\\Full\\Path\\numbers.txt";
If that doesn't work, you will probably have to deal with permissions issues.
Pass the file name from the command line, using the full path. That should work if there are no permissions issues.
if ( argc < 2 )
{
// Deal with unspecified file name.
}
char* filename = argv[1];
Pass the relative path of the file name. If you are testing your program from Visual Studio, you have to make sure that you use the path relative to the directory from where Visual Studio launches your program.
while (i <= numElem)
should be
while (i < numElem)
Because in fscanf(fp, "%d", &numElem); you are scanning the number of elements.
Notice that the array in C starts from 0, so if say numElem is 10 arr[10] does not exist which can be harmful because arr goes from arr[0] to arr[9]
Also, you should check if numElem is lower than 10 before the while(i < numElem) loop.

How would I compare a string (entered by the user) to the first word of a line in a file?

I am really struggling to understand how character arrays work in C. This seems like something that should be really simple, but I do not know what function to use, or how to use it.
I want the user to enter a string, and I want to iterate through a text file, comparing this string to the first word of each line in the file.
By "word" here, I mean substring that consists of characters that aren't blanks.
Help is greatly appreciated!
Edit:
To be more clear, I want to take a single input and search for it in a database of the form of a text file. I know that if it is in the database, it will be the first word of a line, since that is how to database is formatted. I suppose I COULD iterate through every single word of the database, but this seems less efficient.
After finding the input in the database, I need to access the two words that follow it (on the same line) to achieve the program's ultimate goal (which is computational in nature)
Here is some code that will do what you are asking. I think it will help you understand how string functions work a little better. Note - I did not make many assumptions about how well conditioned the input and text file are, so there is a fair bit of code for removing whitespace from the input, and for checking that the match is truly "the first word", and not "the first part of the first word". So this code will not match the input "hello" to the line "helloworld 123 234" but it will match to "hello world 123 234". Note also that it is currently case sensitive.
#include <stdio.h>
#include <string.h>
int main(void) {
char buf[100]; // declare space for the input string
FILE *fp; // pointer to the text file
char fileBuf[256]; // space to keep a line from the file
int ii, ll;
printf("give a word to check:\n");
fgets(buf, 100, stdin); // fgets prevents you reading in a string longer than buffer
printf("you entered: %s\n", buf); // check we read correctly
// see (for debug) if there are any odd characters:
printf("In hex, that is ");
ll = strlen(buf);
for(ii = 0; ii < ll; ii++) printf("%2X ", buf[ii]);
printf("\n");
// probably see a carriage return - depends on OS. Get rid of it!
// note I could have used the result that ii is strlen(but) but
// that makes the code harder to understand
for(ii = strlen(buf) - 1; ii >=0; ii--) {
if (isspace(buf[ii])) buf[ii]='\0';
}
// open the file:
if((fp=fopen("myFile.txt", "r"))==NULL) {
printf("cannot open file!\n");
return 0;
}
while( fgets(fileBuf, 256, fp) ) { // read in one line at a time until eof
printf("line read: %s", fileBuf); // show we read it correctly
// find whitespace: we need to keep only the first word.
ii = 0;
while(!isspace(fileBuf[ii]) && ii < 255) ii++;
// now compare input string with first word from input file:
if (strlen(buf)==ii && strstr(fileBuf, buf) == fileBuf) {
printf("found a matching line: %s\n", fileBuf);
break;
}
}
// when you get here, fileBuf will contain the line you are interested in
// the second and third word of the line are what you are really after.
}
Your recent update states that the file is really a database, in which you are looking for a word. This is very important.
If you have enough memory to hold the whole database, you should do just that (read the whole database and arrange it for efficient searching), so you should probably not ask about searching in a file.
Good database designs involve data structures like trie and hash table. But for a start, you could use the most basic improvement of the database - holding the words in alphabetical order (use the somewhat tricky qsort function to achieve that).
struct Database
{
size_t count;
struct Entry // not sure about C syntax here; I usually code in C++; sorry
{
char *word;
char *explanation;
} *entries;
};
char *find_explanation_of_word(struct Database* db, char *word)
{
for (size_t i = 0; i < db->count; i++)
{
int result = strcmp(db->entries[i].word, word);
if (result == 0)
return db->entries[i].explanation;
else if (result > 0)
break; // if the database is sorted, this means word is not found
}
return NULL; // not found
}
If your database is too big to hold in memory, you should use a trie that holds just the beginnings of the words in the database; for each beginning of a word, have a file offset at which to start scanning the file.
char* find_explanation_in_file(FILE *f, long offset, char *word)
{
fseek(f, offset, SEEK_SET);
char line[100]; // 100 should be greater than max line in file
while (line, sizeof(line), f)
{
char *word_in_file = strtok(line, " ");
char *explanation = strtok(NULL, "");
int result = strcmp(word_in_file, word);
if (result == 0)
return explanation;
else if (result > 0)
break;
}
return NULL; // not found
}
I think what you need is fseek().
1) Pre-process the database file as follows. Find out the positions of all the '\n' (carriage returns), and store them in array, say a, so that you know that ith line starts at a[i]th character from the beginning of the file.
2) fseek() is a library function in stdio.h, and works as given here. So, when you need to process an input string, just start from the start of the file, and check the first word, only at the stored positions in the array a. To do that:
fseek(inFile , a[i] , SEEK_SET);
and then
fscanf(inFile, "%s %s %s", yourFirstWordHere, secondWord, thirdWord);
for checking the ith line.
Or, more efficiently, you could use:
fseek ( inFile , a[i]-a[i-1] , SEEK_CURR )
Explanation: What fseek() does is, it sets the read/write position indicator associated with the file at the desired position. So, if you know at which point you need to read or write, you can just go there and read directly or write directly. This way, you won't need to read whole lines just to get first three words.

Resources