Searching for a combination of characters in a file - c

I am trying to create a program that reads a file and searches for a specific combination of characters.
For example: "/start/ 4jy42jygsfsf /end/".
So I want to find all the "strings" starting with /start/ and ending with /end/.
In order to do that, I use read() function because the file might be a binary file (it doesn't have to be a file with chars).
I call the read() function like that:
#define BUFFSIZE 4000
// more declarations
while (read(file_descriptor, buffer, BUFFSIZE) > 0)
{
//search for /start/
//then search for /end/
//build a string with all the chars between these two
//keep searching till you reach the end of buffer
}
Assume that every /start/ is followed by an /end/.
The question is:
How do I deal with cases that this combination of characters is cut in half?
For example, let's say that the first time read() gets called, in the end of this buffer I spot /star and the next time read() gets called at the start of the second buffer there is t/ 4jy42jygsfsf /end/.
This combination might get cut anywhere. The solutions I thought will result to many many lines of code. Is there any smart way to deal with all these cases?

When you reach the end of the buffer, record the state of the current partial match, if any. Then when you get the next buffer, you have 4 general cases:
Not inside any text to be matched.
Saw just a beginning / at the end of the last buffer
Currently inside /start/. Another variable records how far you have matched.
Currently inside /end/. Same variable as for /start/ records how far you have matched.
Your states inside the matcher are generally:
Currently not matching anything
Just saw a / - next looking for an 's' or an 'e'.
Matching either start/ or end/.
Matched - either /start or /end.
Based on the partial match, just jump to the right state in the matcher.
OR
You can use the PCRE library. It supports partial matching. But probably is overkill for your purposes.

Related

Is it possible to count the frequency of a word in a file precisely using two buffers in C?

I have a file of size 1GB. I want to find out how many times the word "sosowhat" is found in the file. I've written a code using fgetc() which reads one character at a time from the file which is way too slower when it comes for a file of size 1GB. So I made a buffer of size 1000(using mmalloc) to hold 1000 words at a time from the file and I used the strstr() function to count the occurrence of the word "sosowhat". The logic is fine. But the problem is that if the part "so" of "sosowhat" is located at the end of the buffer and the "sowhat" part in the new buffer, the word will not be counted. So I used two buffers old_buffer and current_buffer. At the beginning of each buffer I want to check from the last few characters of old buffer. Is this possible? How can I go back to the old buffer? Is it possible without memmove()? As a beginner, I will be more than happy for your help.
Yes, it can be done. There are more possible approaches to this.
The first one, which is the cleanest, is to keep a second buffer, as suggested, of the length of the searched word, where you keep the last chunk of the old buffer. (It needs to be exactly the length of the searched word because you store wordLength - 1 characters + NULL terminator). Then the quickest way is to append to this stored chunk from the old buffer the first wordLen - 1 characters from the new buffer and search your word here. Then continue with your search normally. - Of course you can create a buffer which can hold both chunks (the last bytes from the old buffer and the first bytes from the new one).
Another approach (which I don't recommend, but can turn out to be a bit easier in terms of code) would be to fseek wordLen - 1 bytes backwards in the read file. This will "move" the chunk stored in previous approach to the next buffer. This is a bit dirtier as you will read some of the contents of the file twice. Although that's not something noticeable in terms of performance, I again recommend against it and use something like the first described approach.
use the same algorithm as per fgetc only read from the buffers you created. It will be same efficient as strstr iterates thorough the string char by char as well.

How to wildcard search with capture in C?

I'm trying to write a routine in C to capture sequences of characters in a string argument. The matching criteria in addition to characters can have ? meaning exactly one character and * meaning zero or more characters. (lazy).
e.g.
string: ok1ok1234567890
match: *(ok?2*)4*
The result should be the position of the match = 3 and the length of the match = 5
I have tried numerous ways of doing this, have put it aside, come back to it, put it aside again etc. I cannot crack it. It needs to be a purely C solution and able to capture multiple captures.
e.g. (*)(ok??)3(4*)8*
Every solution I come up with works in many cases but not all. I'm hoping someone somewhere might have done this already or have an insight to how it can be done.

Getting and processing user input across multiple lines

I'm trying to get multiple lines of input from a user via stdin (although eventually, I'd like to be able to specify a file). The idea is that the user specifies inputs within matching "<" and ">". I'd like them to be able to invoke the program and then type as many of these inputs as they'd like across multiple lines until they terminate the input with Control-D.
So they could do:
"[this is a valid
input even thought it spans
three lines]" (using brackets instead of < so that my text doesn't disappear!)
I discovered I could use fscanf to easily process these kinds of fragment by using something like:
fscanf(fp, "<" "%999[^>]" ">", buffer);
But I'm having a lot of difficulty getting this to work across multiple lines, and I'm not entirely sure about the best way to loop through these inputs and put the strings between < into an array containing just the relevant string.
I've done a bit of research and people seem to have differing opinions on the use of fgets versus sscanf versus fscanf, and I'm not really sure about what the merits of each are as they related to my particular problem. I'm also not sure how to make a newline not terminate the input (as it currently does). Should I be checking the number of matches that fscanf returns, or should I be looking for an EOF terminator?
Currently, my code looks like this (I was using matches earlier to check the number of each input but have since removed that). Obviously this doesn't yet move anything to an array for the sake of simplicity. Additionally, fp is currently stdin, but I'd like to keep it robust enough that I could simply change that pointer to a file in case I wanted to read from a file.
char buffer[1000];
while(true){
int matches = fscanf(fp, "{" "%999[^}]" "}", buffer);
if (feof(fp)) break;
printf("%s\n", buffer);
}

how to read last n lines from a file in C

Its a microsoft interview question.
Read last n lines of file using C (precisely)
Well there could be so many ways to achieve this , few of them could be :
-> Simplest of all, in first pass , count the number of lines in the file and in second pass display the last n lines.
-> Or may be maintain a doubly linked-list for every line and display the last n lines by back traversing the linkedlist till nth last node.
-> Implement something of sort tail -n fname
-> In order to optimize it more we can have double pointer with length as n and every line stored dynamically in a round robin fashion till we reach the end of file.
for example if there are 10 lines in file and want to read last 3 lines. then we could create a array of buffer as buf[3][] and at run time would keep on mallocing and freeing the buffer in circular way till we reach the last line and keep a counter to know the current index of array.
Can anyone please help me with more optimized solution or atleast guide me if any of the above approaches can help me get the correct answer or any other popular approach/method for such kind of questions.
You can use a queue and to store the last n lines seen in this queue. When you see the eof just print the queue.
Another way is reading a blocks of 1024 bytes from the end of file towards the beginning. Stop when you find n \n characters and print out the last n lines.
You can have two file pointers initially pointing to beginning of file.
Keep on incrementing first pointer till it find '\n' character also stores the instance of file pointer when it find '\n'.
Once it find (n+1)th '\n',assign first stored instance of file pointer which we previously saved,to second file pointer.Keep on doing the same till EOF.
So when first file pointer is on EOF,second will be on n '\n' back.Then print all characters from second file pointer to EOF.
So this is solution which can print last n lines in file in single pass.
How about using memory mapped file and scan the file from backward? This eliminates the hard work of updating the buffer window each time every time if the lines happened to be longer than your buffer space. Then, when you found a \n, push the position into a stack. This works in O(L) where L is the number of characters to output. So there is nothing really better than that is it?

Reading parts of a file after a specific tag is found in C using fgets

I would like some suggestion on how to read an 'XML' like file in such a way that the program would only read/store elements observed in a node that meets some requirements. I was thinking about using two fgets in the following way:
while (fgets(file_buffer,line_buffer,fp) != NULL)
{
if (p_str = (char*) strstr(file_buffer,"<element of interest opening")) )
{
//new fgets that starts at fp and runs only until the end of the node
{
//read and process
}
}
}
Does this make sense or are there smarter ways of doing this?
Secondly (in my idea), will i have to define a new FILE* (like fr), set fr to fp at the start of the second fgets or can i somehow abuse the original filepointer for that?
Use an XML parser like Xmllib2 http://xmlsoft.org/xml.html
Your approach seems isn't bad for the job.
You could read the whole line from the file, then, process it using sprintf, strstr or whatever functions you like. This will save you time and unnecessary overheads with FILE I/O.
As per your second idea, you can use fseek()(Refer: man fseek) or rewind()(Refer: man rewind) using the same file pointer fp. You do not need an extra file pointer.
EDIT:
If you could change the tag format to adhere to XML structure you will be able to use libXML2 and such libraries properly.
If that's not possible, then you have to write your own parser.
A few pointers:
First extract data from the file into a buffer.The size of the buffer and whether dynamically or statically allocated, will depend on your specs.
Search in the buffer, if non-whitespace character is < or whatever character your tag usually begins with. If not, you can just show an error and exit.
Now follows the tag name, until the first whitespace, or the / or the > character. Store them. Process the =, strings and stuff, as you wish.
If the next non-whitespace character is /, check that it is followed by >(or a similar pattern in your specs to find if a tag ends). If so, you've finished parsing and can return your result. Otherwise, you've got a malformed tag and should exit with an error.
If the character is >, then you've found the end of the begin tag. Now follows the content.
Otherwise what follows is an argument. Parse that, store the result, continue at step 4.
Read the content until you find a < character.
If that character is followed by /, it's the end tag. Check that it is followed by the tag name and >. If yes, return the result , else, throw an error.
If you get here, you've found the beginning of a nested XML. Parse that with this algorithm and then continue at 4 again.
Although, its quite basic idea, I hope it should help you start.
EDIT:
If you still want to reference the file as a pointer consider using mmap().
If you add mmap with a bit of shared memory IPC and adequate memory locking stuff, you could write a parallel processing program, that will process most of your files faster.

Resources