Parsing a stream of data for control strings - c

I feel like this is a pretty common problem but I wasn't really sure what to search for.
I have a large file (so I don't want to load it all into memory) that I need to parse control strings out of and then stream that data to another computer. I'm currently reading in the file in 1000 byte chunks.
So for example if I have a string that contains ASCII codes escaped with ('$' some number of digits ';') and the data looked like this... "quick $33;brown $126;fox $a $12a". The string going to the other computer would be "quick brown! ~fox $a $12a".
In my current approach I have the following problems:
What happens when the control strings falls on a buffer boundary?
If the string is '$' followed by anything but digits and a ';' I want to ignore it. So I need to read ahead until the full control string is found.
I'm writing this in straight C so I don't have streams to help me.
Would an alternating double buffer approach work and if so how does one manage the current locations etc.

If I've followed what you are asking about it is called lexical analysis or tokenization or regular expressions. For regular languages you can construct a finite state machine which will recognize your input. In practice you can use a tool that understands regular expressions to recognize and perform different actions for the input.
Depending on different requirements you might go about this differently. For more complicated languages you might want to use a tool like lex to help you generate an input processor, but for this, as I understand it, you can use a much more simple approach, after we fix your buffer problem.
You should use a circular buffer for your input, so that indexing off the end wraps around to the front again. Whenever half of the data that the buffer can hold has been processed you should do another read to refill that. Your buffer size should be at least twice as large as the largest "word" you need to recognize. The indexing into this buffer will use the modulus (remainder) operator % to perform the wrapping (if you choose a buffer size that is a power of 2, such as 4096, then you can use bitwise & instead).
Now you just look at the characters until you read a $, output what you've looked at up until that point, and then knowing that you are in a different state because you saw a $ you look at more characters until you see another character that ends the current state (the ;) and perform some other action on the data that you had read in. How to handle the case where the $ is seen without a well formatted number followed by an ; wasn't entirely clear in your question -- what to do if there are a million numbers before you see ;, for instance.
The regular expressions would be:
[^$]
Any non-dollar sign character. This could be augmented with a closure ([^$]* or [^$]+) to recognize a string of non$ characters at a time, but that could get very long.
$[0-9]{1,3};
This would recognize a dollar sign followed by up 1 to 3 digits followed by a semicolon.
[$]
This would recognize just a dollar sign. It is in the brackets because $ is special in many regular expression representations when it is at the end of a symbol (which it is in this case) and means "match only if at the end of line".
Anyway, in this case it would recognize a dollar sign in the case where it is not recognized by the other, longer, pattern that recognizes dollar signs.
In lex you might have
[^$]{1,1024} { write_string(yytext); }
$[0-9]{1,3}; { write_char(atoi(yytext)); }
[$] { write_char(*yytext); }
and it would generate a .c file that will function as a filter similar to what you are asking for. You will need to read up a little more on how to use lex though.

The "f" family of functions in <stdio.h> can take care of the streaming for you. Specifically, you're looking for fopen(), fgets(), fread(), etc.
Nategoose's answer about using lex (and I'll add yacc, depending on the complexity of your input) is also worth considering. They generate lexers and parsers that work, and after you've used them you'll never write one by hand again.

Related

Why does this scanf() conversion actually work?

Ah, the age old tale of a programmer incrementally writing some code that they aren't expecting to do anything more than expected, but the code unexpectedly does everything, and correctly, too.
I'm working on some C programming practice problems, and one was to redirect stdin to a text file that had some lines of code in it, then print it to the console with scanf() and printf(). I was having trouble getting the newline characters to print as well (since scanf typically eats up whitespace characters) and had typed up a jumbled mess of code involving multiple conditionals and flags when I decided to start over and ended up typing this:
(where c is a character buffer large enough to hold the entirety of the text file's contents)
scanf("%[a-zA-Z -[\n]]", c);
printf("%s", c);
And, voila, this worked perfectly. I tried to figure out why by creating variations on the character class (between the outside brackets), such as:
[\w\W -[\n]]
[\w\d -[\n]]
[. -[\n]]
[.* -[\n]]
[^\n]
but none of those worked. They all ended up reading either just one character or producing a jumbled mess of random characters. '[^\n]' doesn't work because the text file contains newline characters, so it only prints out a single line.
Since I still haven't figured it out, I'm hoping someone out there would know the answer to these two questions:
Why does "[a-zA-Z -[\nn]]" work as expected?
The text file contains letters, numbers, and symbols (':', '-', '>', maybe some others); if 'a-z' is supposed to mean "all characters from unicode 'a' to unicode 'z'", how does 'a-zA-Z' also include numbers?
It seems like the syntax for what you can enter inside the brackets is a lot like regex (which I'm familiar with from Python), but not exactly. I've read up on what can be used from trying to figure out this problem, but I haven't been able to find any info comparing whatever this syntax is to regex. So: how are they similar and different?
I know this probably isn't a good usage for scanf, but since it comes from a practice problem, real world convention has to be temporarily ignored for this usage.
Thanks!
You are picking up numbers because you have " -[" in your character set. This means all characters from space (32) to open-bracket (91), which includes numbers in ASCII (48-57).
Your other examples include this as well, but they are missing the "a-zA-Z", which lets you pick up the lower-case letters (97-122). Sequences like '\w' are treated as unknown escape sequences in the string itself, so \w just becomes a single w. . and * are taken literally. They don't have a special meaning like in a regular expression.
If you include - inside the [ (other than at the beginning or end) then the behaviour is implementation-defined.
This means that your compiler documentation must describe the behaviour, so you should consult that documentation to see what the defined behaviour is, which would explain why some of your code worked and some didn't.
If you want to write portable code then you can't use - as anything other than matching a hyphen.

Parsing shell commands in c: string cutting with respect to its contents

I'm currently creating Linux shell to learn more about system calls.
I've already figured out most of the things. Parser, token generation, passing appropriate things to appropriate system calls - works.
The thing is, that even before I start making tokens, I split whole command string into separate words. It's based on array of separators, and it works surprisingly good. Except that I'm struggling with adding additional functionality to it, like escape sequences or quotes. I can't really live without it, since even people using basic grep commands use arguments with quotes. I'll need to add functionality for:
' ' - ignore every other separator, operator or double quotes found between those two, pass this as one string, don't include these quotation marks into resulting word,
" "- same as above, but ignore single quotes,
\\ - escape this into single backslash,
\(space) - escape this into space, do not parse resulting space as separator
\", \' - analogously to the above.
Many other things that I haven't figured out I need yet
and every single one of them seems like an exception on its own. Each of them must operate on diversity of possible positions in commands, being included into result or not, having influence on the rest of the parsing. It makes my code look like big ball of mud.
Is there a better approach to do this? Is there a more general algorithm for that purpose?
You are trying to solve a classic problem in program analysis (of lexing and parsing) using a nontraditional structure for lexer ( I split whole command string into separate words... ). OK, then you will have non-traditional troubles with getting the lexer "right".
That doesn't mean that way is doomed to failure, and without seeing specific instances of your problem, (you list a set of constructs you want to handle, but don't say why these are hard to process), it is hard to provide any specific advice. It also doesn't mean that way will lead to success; splitting the line may break tokens that shouldn't be broken (usually by getting confused about what has been escaped).
The point of using a standard lexer (such as Flex or any of the 1000 variants you can get) is that they provide a proven approach to complex lexing problems, based generally on the idea that one can use regular expressions to describe the shape of individual lexemes. Thus, you get one regexp per lexeme type, thus an ocean of them but each one is pretty easy to specify by itself.
I've done ~~40 languages using strong lexers and parsers (using one of the ones in that list). I assure you the standard approach is empirically pretty effective. The types of surprises are well understood and manageable. A nonstandard approach always has the risk that it will surprise you in a bad way.
Last remark: shell languages for Unix have had people adding crazy stuff for 40 years. Expect the job to be at least medium hard, and don't expect it to be pretty like Wirth's original Pascal.

How can I parse text input and convert strings to integers?

I have a file input, in which i have the following data.
1 1Apple 2Orange 10Kiwi
2 30Apple 4Orange 1Kiwi
and so on. I have to read this data from file and work on it but i dont know how to retrieve the data. I want to store 1(of 1 apple) as integer and then Apple as a string.
I thought of reading the whole 1Apple as a string. and then doing something with the stoi function.
Or I could read the whole thing character by character and then if the ascii value of that character lies b/w 48 to 57 then i will combine that as an integer and save the rest as string? Which one shall I do? Also how do I check what is the ASCII value of the char. (shall I convert the char to int and then compare, or is there any inbuilt function?)
How about using the fscanf() function if and only if your input pattern is not going to change. Otherwise you should probably use fgets() and perform checks if you want to separate the number from the string such as you suggested.
There is one easy right way to do this with standard C library facilities, one rather more difficult right way, and a whole lot of wrong ways. This is the easy right way:
Read an entire line into a char[] buffer using fgets.
Extract numbers from this line using strtol or strtoul.
It is very important to understand why the easier-looking alternatives (*scanf and atoi) should never be used. You might write less code initially, but once you start thinking about how to handle even slightly malformed input, you will discover that you should have used strtol.
The "rather more difficult right way" is to use lex and yacc. They are much more complicated but also much more powerful. You shouldn't need them for this problem.

How do I check if a file is text-based?

I am working on a small text replacement application that basically lets the user select a file and replace text in it without ever having to open the file itself. However, I want to make sure that the function only runs for files that are text-based. I thought I could accomplish this by checking the encoding of the file, but I've found that Notepad .txt files use Unicode UTF-8 encoding, and so do MS Paint .bmp files. Is there an easy way to check this without placing restrictions on the file extensions themselves?
Unless you get a huge hint from somewhere, you're stuck. Purely by examining the bytes there's a non-zero probability you'll guess wrong given the plethora of encodings ("ASCII", Unicode, UTF-8, DBCS, MBCS, etc). Oh, and what if the first page happens to look like ASCII but the next page is a btree node that points to the first page...
Hints can be:
extension (not likely that foo.exe is editable)
something in the stream itself (like BOM [byte-order-marker])
user direction (just edit the file, goshdarnit)
Windows used to provide an API IsTextUnicode that would do a probabilistic examination, but there were well-known false-positives.
My take is that trying to be smarter than the user has some issues...
Honestly, given the Windows environment that you're working with, I'd consider a whitelist of known text formats. Windows users are typically trained to stick with extensions. However, I would personally relax the requirement that it not function on non-text files, instead checking with the user for goahead if the file does not match the internal whitelist. The risk of changing a binary file would be mitigated if your search string is long - that is assuming you're not performing Y2K conversion (a la sed 's/y/k/g').
It's pretty costly to determine if a file is text-based or not (i.e. a binary file). You would have to examine each byte in the file to determine if it is a valid character, irrespective of the file encoding.
Others have said to look at all the bytes in the file and see if they're alphanumeric. Some UNIX/Linux utils do this, but just check the first 1K or 2K of the file as an "optimistic optimization".
well a text file contains text, right ? so a really easy way to check a file if it does contain only text is to read it and check if it does contains alphanumeric characters.
So basically the first thing you have to do is to check the file encoding if its pure ASCII you have an easy task just read the whole file in to a char array (I'm assuming you are doing it in C/C++ or similar) and check every char in that array with functions isalpha and isdigit ...of course you have to take care about special exceptions like tabulators '\t' space ' ' or the newline ('\n' in linux , '\r'\'n' in windows)
In case of a different encoding the process is the same except the fact that you have to use different functions for checking if the current character is an alphanumeric character... also note that in case of UTF-16 or greater a simple char array is simply to small...but if you are doing it for example in C# you dont have to worry about the size :)
You can write a function that will try to determine if a file is text based. While this will not be 100% accurate, it may be just enough for you. Such a function does not need to go through the whole file, about a kilobyte should be enough (or even less). One thing to do is to count how many whitespaces and newlines are there. Another thing would be to consider individual bytes and check if they are alphanumeric or not. With some experiments you should be able to come up with a decent function. Note that this is just a basic approach and text encodings might complicate things.

UTF-8 tuple storage using lowest common technological denominator, append-only

EDIT: Note that due to the way hard drives actually write data, none of the schemes in this list work reliably. Do not use them. Just use a database. SQLite is a good simple one.
What's the most low-tech but reliable way of storing tuples of UTF-8 strings on disk? Storage should be append-only for reliability.
As part of a document storage system I'm experimenting with I have to store UTF-8 tuple data on disk. Obviously, for a full-blown implementation, I want to use something like Amazon S3, Project Voldemort, or CouchDB.
However, at the moment, I'm experimenting and haven't even firmly settled on a programming language yet. I have been using CSV, but CSV tend to become brittle when you try to store outlandish unicode and unexpected whitespace (eg vertical tabs).
I could use XML or JSON for storage, but they don't play nice with append-only files. My best guess so far is a rather idiosyncratic format where each string is preceded by a 4-byte signed integer indicating the number of bytes it contains, and an integer value of -1 indicates that this tuple is complete - the equivalent of a CSV newline. The main source of headaches there is having to decide on the endianness of the integer on disk.
Edit: actually, this won't work. If the program exits while writing a string, the data becomes irrevocably misaligned. Some sort of out-of-band signalling is needed to ensure alignment can be regained after an aborted tuple.
Edit 2: Turns out that guaranteeing atomicity when appending to text files is possible, but the parser is quite non-trivial. Writing said parser now.
Edit 3: You can view the end result at http://github.com/MetalBeetle/Fruitbat/tree/master/src/com/metalbeetle/fruitbat/atrio/ .
I would recommend tab delimiting each field and carriage-return delimiting each record.
Within each string, Replace all characters that would affect the field and record interpretation and rendering. This would include control characters (U+0000–U+001F, U+007F–U+009F), non-graphical line and paragraph separators (U+2028, U=2029), directional control characters (U+202A–U+202E), and the byte order mark (U+FEFF).
They should be replaced with escape sequences of constant length. The escape sequences should begin with a rare (for your application) character. The escape character itself should also be escaped.
This would allow you to append new records easily. It has the additional advantage of being able to load the file for visual inspection and modification into any spreadsheet or word processing program, which could be useful for debugging purposes.
This would also be easy to code, since the file will be a valid UTF-8 document, so standard text reading and writing routines may be used. This also allows you to convert easily to UTF-16BE or UTF-16LE if desired, without complications.
Example:
U+0009 CHARACTER TABULATION becomes ~TB
U+000A LINE FEED becomes ~LF
U+000D CARRIAGE RETURN becomes ~CR
U+007E TILDE becomes ~~~
etc.
There are a couple of reasons why tabs would be better than commas as field delimiters. Commas appear more commonly within normal text strings (such as English text), and would have to be replaced more frequently. And spreadsheet programs (such as Microsoft Excel) tend to handle tab-delimited files much more naturally.
Mostly thinking out loud here...
Really low tech would be to use (for example) null bytes as separators, and just "quote" all null bytes appearing in the output with an additional null.
Perhaps one could use SCSU along with that.
Or it might be worth to look at the gzip format, and maybe ape it, if not using it:
A gzip file consists of a series of "members" (compressed data sets).
[...]
The members simply appear one after another in the file, with no additional information before, between, or after them.
Each of these members can have an optional "filename", comment, or the like, and i believe you can just keep appending members.
Or you could use bencode, used in torrent-files. Or BSON.
See also Wikipedia's Comparison of data serialization formats.
Otherwise i think your idea of preceding each string with its length is probably the simplest one.

Resources