Reading a text line using a given pattern - c

The question is quite simple, and I hope the answer it's simple too :)
Let's say you are given a pattern, and you have to read a text file, and every line is composed by a first digit, that indicates the number of "patterns" in the string, and the the patterns element separed by one space.
For example, given the pattern
key value
a valid line of the text file could be
3 10 "apple" 15 "orange" 17 "melon"
If the number N of repetitions is fixed, I'd use something like
fscanf(inFile,"%d %s",&n,str);
but is there a function that allows me to give the number of repetitions as a parameter, or I should scan each line and extract values I'm interested in, using substr and atoi?
The "trivial" way is obvious, I'm looking for something more "professional" and effective.

Use fscanf() in a loop: first extract the number of repetitions N, then loop N times extracting your pattern.
If you're looking for something more professional or sophisticated, you might want to move away from the standard C library and towards a regex or parsing library, or something mentioned heere: http://www.and.org/vstr/comparison. While I won't go so far as to say you can't do string processing easily or well in C, it's not a strong point of the core language.

Related

Best Search-Algorithm to search an Input String

I have a theoretical question.
I'm trying to find the best search algorithm to deal with the following problem: I want to process a input String, the input String can have up to 7 independent and valid arguments (parameters), which start different segments in my code. All parameters can appear at any position inside the input, which means the input does not have a order.
My first idea was to look at the input string and do a strcmp for all individual valid inputs at every position inside the string, which would mean that I would basically search the string until I find a match or reach the end of the input. However, this would have very bad runtime latency as I would iterate the input with n!.
I'm wondering if there is a better way to search the input, maybe something where I can decrease the sample size after a valid input has been found, so I dont have to look at this position again. I would be glad if someone could help me to find a search-algorithm with a better runtime efficiency.
As far as I know, its not possible to decrease a stack size. I guess the only way to improve the efficiency is to skip positions inside the stack, based on a bool value? Every position that I can skip drastically improves the runtime, as it would have to be compared with all 7 inputs.
Thanks for reading :)
I'd bet that your libc implementation does this quite well ;)
strstr() is the function you are looking for, go check this man : http://manpagesfr.free.fr/man/man3/strstr.3.html

How to wildcard search with capture in C?

I'm trying to write a routine in C to capture sequences of characters in a string argument. The matching criteria in addition to characters can have ? meaning exactly one character and * meaning zero or more characters. (lazy).
e.g.
string: ok1ok1234567890
match: *(ok?2*)4*
The result should be the position of the match = 3 and the length of the match = 5
I have tried numerous ways of doing this, have put it aside, come back to it, put it aside again etc. I cannot crack it. It needs to be a purely C solution and able to capture multiple captures.
e.g. (*)(ok??)3(4*)8*
Every solution I come up with works in many cases but not all. I'm hoping someone somewhere might have done this already or have an insight to how it can be done.

How to save a 2-Dimensional array to a file in C?

I am an beginner C programmer and I am currently working on a project to implement viola jones object detection algorithm using C. I would like to know how I would be able to store data in a 2-Dimensional array to a file that can be easily ported and accessed by different program files(e.g. main.c, header_file.h etc.)
Thank you in advance.
There's not quite enough detail to be sure what you're looking for, but the basic structure of what you want to do is going to look something like this:
open file.csv for writing
for(iterate through one dimension of the array using i)
{
for(iterate through the other dimension of the array using j)
{
fprintf(yourfilehandle,"%d,",yourvalue[i][j]);
}
fprintf(yourfilehandle,"\n");
}
close your file
As has been suggested by others, this will leave you with a .CSV file, which is a pretty good choice, as it's easy to read in and parse, and you can open your file in Notepad or Excel and view it no problems.
This is assuming you really meant to do this with C file I/O, which is a perfectly valid way of doing things, some just feel it's a bit dated.
Note this leaves an extraneous comma at the end of the line. If that bugs you it's easy enough to do the pre and post conditions to only get commas where you want. Hint: it involves printing the comma before the entry inside the second for loop, reducing the number of entries you iterate over for the interior for loop, and printing out the first and last case of each row special, immediately before and after the inner for loop, respectively. Harder to explain that to do, probably.
Here is a reference for C-style file I/O, and here is a tutorial.
Without knowing anything about what type of data you're storing, I would say to store this as a matrix. You'll need to choose a delimiter to separate your elements (tab or space are common choices, aka 'tsv' and 'csv', respectively) and then something to mark the end of a row (new line is a good choice here).
So your saved file might look something like:
10 162 1 5
7 1 4 12
9 2 2 0
You can also define your format as having some metadata in the first line -- the number of rows and columns may be useful if you want to pre-allocate memory, along with other information like character encoding. Start simple and add as necessary!

storing strings in an array in a compact way [duplicate]

I bet somebody has solved this before, but my searches have come up empty.
I want to pack a list of words into a buffer, keeping track of the starting position and length of each word. The trick is that I'd like to pack the buffer efficiently by eliminating the redundancy.
Example: doll dollhouse house
These can be packed into the buffer simply as dollhouse, remembering that doll is four letters starting at position 0, dollhouse is nine letters at 0, and house is five letters at 3.
What I've come up with so far is:
Sort the words longest to shortest: (dollhouse, house, doll)
Scan the buffer to see if the string already exists as a substring, if so note the location.
If it doesn't already exist, add it to the end of the buffer.
Since long words often contain shorter words, this works pretty well, but it should be possible to do significantly better. For example, if I extend the word list to include ragdoll, then my algorithm comes up with dollhouseragdoll which is less efficient than ragdollhouse.
This is a preprocessing step, so I'm not terribly worried about speed. O(n^2) is fine. On the other hand, my actual list has tens of thousands of words, so O(n!) is probably out of the question.
As a side note, this storage scheme is used for the data in the `name' table of a TrueType font, cf. http://www.microsoft.com/typography/otspec/name.htm
This is the shortest superstring problem: find the shortest string that contains a set of given strings as substrings. According to this IEEE paper (which you may not have access to unfortunately), solving this problem exactly is NP-complete. However, heuristic solutions are available.
As a first step, you should find all strings that are substrings of other strings and delete them (of course you still need to record their positions relative to the containing strings somehow). These fully-contained strings can be found efficiently using a generalised suffix tree.
Then, by repeatedly merging the two strings having longest overlap, you are guaranteed to produce a solution whose length is not worse than 4 times the minimum possible length. It should be possible to find overlap sizes quickly by using two radix trees as suggested by a comment by Zifre on Konrad Rudolph's answer. Or, you might be able to use the generalised suffix tree somehow.
I'm sorry I can't dig up a decent link for you -- there doesn't seem to be a Wikipedia page, or any publicly accessible information on this particular problem. It is briefly mentioned here, though no suggested solutions are provided.
I think you can use a Radix Tree. It costs some memory because of pointers to leafs and parents, but it is easy to match up strings (O(k) (where k is the longest string size).
My first thought here is: use a data structure to determine common prefixes and suffixes of your strings. Then sort the words under consideration of these prefixes and postfixes. This would result in your desired ragdollhouse.
Looks similar to the Knapsack problem, which is NP-complete, so there is not a "definitive" algorithm.
I did a lab back in college where we tasked with implementing a simple compression program.
What we did was sequentially apply these techniques to text:
BWT (Burrows-Wheeler transform): helps reorder letters into sequences of identical letters (hint* there are mathematical substitutions for getting the letters instead of actually doing the rotations)
MTF (Move to front transform): Rewrites the sequence of letters as a sequence of indices of a dynamic list.
Huffman encoding: A form of entropy encoding that constructs a variable-length code table in which shorter codes are given to frequently encountered symbols and longer codes are given to infrequently encountered symbols
Here, I found the assignment page.
To get back your original text, you do (1) Huffman decoding, (2) inverse MTF, and then (3) inverse BWT. There are several good resources on all of this on the Interwebs.
Refine step 3.
Look through current list and see whether any word in the list starts with a suffix of the current word. (You might want to keep the suffix longer than some length - longer than 1, for example).
If yes, then add the distinct prefix to this word as a prefix to the existing word, and adjust all existing references appropriately (slow!)
If no, add word to end of list as in current step 3.
This would give you 'ragdollhouse' as the stored data in your example. It is not clear whether it would always work optimally (if you also had 'barbiedoll' and 'dollar' in the word list, for example).
I would not reinvent this wheel yet another time. There has already gone an enormous amount of manpower into compression algorithms, why not take one of the already available ones?
Here are a few good choices:
gzip for fast compression / decompression speed
bzip2 for a bit bitter compression but much slower decompression
LZMA for very high compression ratio and fast decompression (faster than bzip2 but slower than gzip)
lzop for very fast compression / decompression
If you use Java, gzip is already integrated.
It's not clear what do you want to do.
Do you want a data structure that lets to you store in a memory-conscious manner the strings while letting operations like search possible in a reasonable amount of time?
Do you just want an array of words, compressed?
In the first case, you can go for a patricia trie or a String B-Tree.
For the second case, you can just adopt some index compression techinique, like that:
If you have something like:
aaa
aaab
aasd
abaco
abad
You can compress like that:
0aaa
3b
2sd
1baco
2ad
The number is the length of the largest common prefix with the preceding string.
You can tweak that schema, for ex. planning a "restart" of the common prefix after just K words, for a fast reconstruction

Parsing a stream of data for control strings

I feel like this is a pretty common problem but I wasn't really sure what to search for.
I have a large file (so I don't want to load it all into memory) that I need to parse control strings out of and then stream that data to another computer. I'm currently reading in the file in 1000 byte chunks.
So for example if I have a string that contains ASCII codes escaped with ('$' some number of digits ';') and the data looked like this... "quick $33;brown $126;fox $a $12a". The string going to the other computer would be "quick brown! ~fox $a $12a".
In my current approach I have the following problems:
What happens when the control strings falls on a buffer boundary?
If the string is '$' followed by anything but digits and a ';' I want to ignore it. So I need to read ahead until the full control string is found.
I'm writing this in straight C so I don't have streams to help me.
Would an alternating double buffer approach work and if so how does one manage the current locations etc.
If I've followed what you are asking about it is called lexical analysis or tokenization or regular expressions. For regular languages you can construct a finite state machine which will recognize your input. In practice you can use a tool that understands regular expressions to recognize and perform different actions for the input.
Depending on different requirements you might go about this differently. For more complicated languages you might want to use a tool like lex to help you generate an input processor, but for this, as I understand it, you can use a much more simple approach, after we fix your buffer problem.
You should use a circular buffer for your input, so that indexing off the end wraps around to the front again. Whenever half of the data that the buffer can hold has been processed you should do another read to refill that. Your buffer size should be at least twice as large as the largest "word" you need to recognize. The indexing into this buffer will use the modulus (remainder) operator % to perform the wrapping (if you choose a buffer size that is a power of 2, such as 4096, then you can use bitwise & instead).
Now you just look at the characters until you read a $, output what you've looked at up until that point, and then knowing that you are in a different state because you saw a $ you look at more characters until you see another character that ends the current state (the ;) and perform some other action on the data that you had read in. How to handle the case where the $ is seen without a well formatted number followed by an ; wasn't entirely clear in your question -- what to do if there are a million numbers before you see ;, for instance.
The regular expressions would be:
[^$]
Any non-dollar sign character. This could be augmented with a closure ([^$]* or [^$]+) to recognize a string of non$ characters at a time, but that could get very long.
$[0-9]{1,3};
This would recognize a dollar sign followed by up 1 to 3 digits followed by a semicolon.
[$]
This would recognize just a dollar sign. It is in the brackets because $ is special in many regular expression representations when it is at the end of a symbol (which it is in this case) and means "match only if at the end of line".
Anyway, in this case it would recognize a dollar sign in the case where it is not recognized by the other, longer, pattern that recognizes dollar signs.
In lex you might have
[^$]{1,1024} { write_string(yytext); }
$[0-9]{1,3}; { write_char(atoi(yytext)); }
[$] { write_char(*yytext); }
and it would generate a .c file that will function as a filter similar to what you are asking for. You will need to read up a little more on how to use lex though.
The "f" family of functions in <stdio.h> can take care of the streaming for you. Specifically, you're looking for fopen(), fgets(), fread(), etc.
Nategoose's answer about using lex (and I'll add yacc, depending on the complexity of your input) is also worth considering. They generate lexers and parsers that work, and after you've used them you'll never write one by hand again.

Resources