Convert escape sequences from user input into their real representation - c

I'm trying to write an interpreter for LOLCODE that reads escaped strings from a file in the form:
VISIBLE "HAI \" WORLD!"
For which I wish to show an output of:
HAI " WORLD!
I have tried to dynamically generate a format string for printf in order to do this, but it seems that the escaping is done at the stage of declaration of a string literal.
In essence, what I am looking for is exactly the opposite of this question:
Convert characters in a c string to their escape sequences
Is there any way to go about this?

It's a pretty standard scanning exercise. Depending on how close you intend to be to the LOLCODE specification (which I can't seem to reach right now, so this is from memory), you've got a few ways to go.
Write a lexer by hand
It's not as hard as it sounds. You just want to analyze your input one character at a time, while maintaining a bit of context information. In your case, the important context consists of two flags:
one to remember you're currently lexing a string. It'll be set when reading " and cleared when reading ".
one to remember the previous character was an escape. It'll be set when reading \ and cleared when reading the character after that, no matter what it is.
Then the general algorithm looks like: (pseudocode)
loop on: c ← read next character
if not inString
if c is '"' then clear buf; set inString
else [out of scope here]
if inEscape then append c to buf; clear inEscape
if c is '"' then return buf as result; clear inString
if c is '\' then set inEscape
else append c to buf
You might want to refine the inEscape case should you want to implement \r, \n and the like.
Use a lexer generator
The traditional tools here are lex and flex.
Get inspiration
You're not the first one to write a LOLCODE interpreter. There's nothing wrong with peeking at how the others did it. For example, here's the string parsing code from lci.

Related

Use special characters other than "\n" and "\0" in C

I have one question.
I'm writing some code in C, on UNIX.
I need to write a special character in a file, because I need to divide my file in small sections.
Example:
'SPECIAL_CHARACTER'
section 1 with some text
'SPECIAL_CHARACTER'
section 2 with some text
etc..
I was thinking to use character '\1'.It seems to work, but it is ok? Or It is wrong?
To do these things without using characters like "\0" or "\n" what should I do?
I hear two different questions where you ask "Or It is wrong?"
I hear you asking "how can I designate a separator byte in my code?", and I hear you asking "what is a good choice for a separator byte?"
First, fundamentally, what you are asking about is covered in section 6.4.4.4 of the C language specification, which covers "C Character Constants". There are various places you can look up the formal C language spec, or you can search for "C Character Constants" for perhaps a friendlier description, etc.
In detail, a handful of letters can be used in escape sequences to stand in for single bytes of specific values; e.g., \n is one of those, as a stand-in for 0x0a (decimal 10), a byte designated (in ASCII) as a newline. Here are the legal ones:
\a \b \f \n \r \t \v
The escape sequences \0 and \1 work because C supports using \ followed by digits as an octal value. So, that'll also work with, say, \3 and \35, but not \9, and note that \35 has a decimal value of 29. (Google "octal values" if you don't immediately see why that's the case.)
There are other legal escape sequences:
\' \" \\ \? : ' " \ and ?, respectively
\xNNNN... : each 'N' can be a hexadecimal digit
And, of course, escape sequences are just one aspect of C character constants.
Second, whether or not you should use a given byte value as your file's section separator depends entirely on how your program will be used. As others have pointed out in the comments, there are commonplace prevailing practices on what sort of byte value to use for this sort of thing.
I personally agree that 0x1e makes perhaps the most sense since in ASCII it is the "record separator". Conforming to ASCII can matter if the data will need to be understood by other programs, or if your program will need to be understood by other people.
On the other hand, a simple code comment can make it clear to anyone reading your code what byte value you are using for separating sections of your data file, and any program that needs to understand your data files needs to 'know' a lot more about the file format than just what the record separator is. There is nothing magical about 0x1e : it is merely a convention, and a reserved spot on the ASCII table to facilitate a common need -- that is, record separation of text that could contain normal text separators like space, newline, and null.
Broadly, any byte value that won't show up in the contents of your sections would make a fine section separator. Since you say those contents will be text, there are well over 100 choices, even if you exclude \0 (0x00) and \n (0x0a). In ASCII, a handful of byte values have been set aside for this sort of purpose, so that helps reduce the choice from several dozen to just several. Even among those several, there are only a few commonly used as separators.

Why does this scanf() conversion actually work?

Ah, the age old tale of a programmer incrementally writing some code that they aren't expecting to do anything more than expected, but the code unexpectedly does everything, and correctly, too.
I'm working on some C programming practice problems, and one was to redirect stdin to a text file that had some lines of code in it, then print it to the console with scanf() and printf(). I was having trouble getting the newline characters to print as well (since scanf typically eats up whitespace characters) and had typed up a jumbled mess of code involving multiple conditionals and flags when I decided to start over and ended up typing this:
(where c is a character buffer large enough to hold the entirety of the text file's contents)
scanf("%[a-zA-Z -[\n]]", c);
printf("%s", c);
And, voila, this worked perfectly. I tried to figure out why by creating variations on the character class (between the outside brackets), such as:
[\w\W -[\n]]
[\w\d -[\n]]
[. -[\n]]
[.* -[\n]]
[^\n]
but none of those worked. They all ended up reading either just one character or producing a jumbled mess of random characters. '[^\n]' doesn't work because the text file contains newline characters, so it only prints out a single line.
Since I still haven't figured it out, I'm hoping someone out there would know the answer to these two questions:
Why does "[a-zA-Z -[\nn]]" work as expected?
The text file contains letters, numbers, and symbols (':', '-', '>', maybe some others); if 'a-z' is supposed to mean "all characters from unicode 'a' to unicode 'z'", how does 'a-zA-Z' also include numbers?
It seems like the syntax for what you can enter inside the brackets is a lot like regex (which I'm familiar with from Python), but not exactly. I've read up on what can be used from trying to figure out this problem, but I haven't been able to find any info comparing whatever this syntax is to regex. So: how are they similar and different?
I know this probably isn't a good usage for scanf, but since it comes from a practice problem, real world convention has to be temporarily ignored for this usage.
Thanks!
You are picking up numbers because you have " -[" in your character set. This means all characters from space (32) to open-bracket (91), which includes numbers in ASCII (48-57).
Your other examples include this as well, but they are missing the "a-zA-Z", which lets you pick up the lower-case letters (97-122). Sequences like '\w' are treated as unknown escape sequences in the string itself, so \w just becomes a single w. . and * are taken literally. They don't have a special meaning like in a regular expression.
If you include - inside the [ (other than at the beginning or end) then the behaviour is implementation-defined.
This means that your compiler documentation must describe the behaviour, so you should consult that documentation to see what the defined behaviour is, which would explain why some of your code worked and some didn't.
If you want to write portable code then you can't use - as anything other than matching a hyphen.

C - clarifying delimiters in strtok

I'm trying to break up a shell command that contains both pipes (|) and the OR symbols (||) represented as characters in an array with strtok, except, well the OR command could also be two pipes next to each other. Specifically, I need to know when |, ;, &&, or || show up in the command.
Is there a way to specify where one delimiter ends and another begins in strtok, since I know usually the delimiters are one character long and you just list them all out with no spaces or anything in between.
Oh and, is a newline a valid delimiter? Or does strtok only do spaces?
Starting from your last question: yes, strtok can use new-line as a delimiter without any problems.
Unfortunately, the answer to your first question isn't nearly so positive. strtok treats all delimiter characters as equal, and does nothing to differentiate between a single delimiter and an arbitrary number of consecutive delimiters. In other words, if you give |&; as the delimiter, it'll treat ||||||||| or &&& or &|&|; all exactly the same way.
I'll go a little further: I'll go out on a limb and state as a fact that strtok simply isn't suitable for breaking a shell command into constituent pieces -- I'm pretty sure there's just no way to use it for this job that will produce usable results.
In particular, you don't have anything that just acts as a delimiter. For your purposes, the &, |, and || are tokens of their own. In a string being supplied to the shell, you don't necessarily have anything that qualifies as a delimiter the way strtok "thinks" of them.
strtok is oriented toward tokens that are separated by delimiters that are nothing except delimiters. As strtok reads the tokens, the delimiters between them are completely ignored (and, destroyed, for that matter). For the shell, a string like a|b is really three tokens -- you need the a, the | and the b -- there's nothing between them that strtok can safely overwrite and/or ignore -- but that's a requirement for how strtok works. For it to deliver you the first a, it overwrites the next character (the | in this case) with a '\0'. Then it has no way of recovering that pipe to tell you what the next token should be.
I think you probably need a greedy tokenizer instead -- i.e., one that builds the longest string of characters that can be token, and stops when it encounters a character that can't be part of the current token. When you ask for the next token, it starts from the first character after the end of the previous token, without (necessarily) skipping/ignoring anything (though, of course, if it encounters something like white-space that hasn't been quoted somehow, it'll probably skip over it).
For your purpose, strtok() is not the correct tool to use; it destroys the delimiter, so you can't tell what was at the end of a token if someone types ls|wc. It could have been a pipe, a semi-colon, and ampersand, or a space. Also, it treats multiple adjacent delimiters as part of a single delimiter.
Look at strspn() and strcspn(); both are in standard C and are non-destructive relatives of strtok().
strtok() is quite happy to use newline as a delimiter; in fact, any character except '\0' can be used as one of the delimiters.
There are other reasons for being extremely cautious about using strtok(), such as thread safety and the fact that it is highly unwise to use it in library code.
strtok() is a basic, all-purpose parsing function. For more advanced parsing, I don't recommend its use.
For example, in the case of '|', you really need to inspect the next character to determine if you've found '|' or '||'.
I've done a huge amount of parsing of this nature, including writing a small language interpreter. It's not that hard if you break it up into smaller tasks. But my advice is to write your own parsing routine in this case.
And, yes, a newline character is a valid delimiter.

Parsing a stream of data for control strings

I feel like this is a pretty common problem but I wasn't really sure what to search for.
I have a large file (so I don't want to load it all into memory) that I need to parse control strings out of and then stream that data to another computer. I'm currently reading in the file in 1000 byte chunks.
So for example if I have a string that contains ASCII codes escaped with ('$' some number of digits ';') and the data looked like this... "quick $33;brown $126;fox $a $12a". The string going to the other computer would be "quick brown! ~fox $a $12a".
In my current approach I have the following problems:
What happens when the control strings falls on a buffer boundary?
If the string is '$' followed by anything but digits and a ';' I want to ignore it. So I need to read ahead until the full control string is found.
I'm writing this in straight C so I don't have streams to help me.
Would an alternating double buffer approach work and if so how does one manage the current locations etc.
If I've followed what you are asking about it is called lexical analysis or tokenization or regular expressions. For regular languages you can construct a finite state machine which will recognize your input. In practice you can use a tool that understands regular expressions to recognize and perform different actions for the input.
Depending on different requirements you might go about this differently. For more complicated languages you might want to use a tool like lex to help you generate an input processor, but for this, as I understand it, you can use a much more simple approach, after we fix your buffer problem.
You should use a circular buffer for your input, so that indexing off the end wraps around to the front again. Whenever half of the data that the buffer can hold has been processed you should do another read to refill that. Your buffer size should be at least twice as large as the largest "word" you need to recognize. The indexing into this buffer will use the modulus (remainder) operator % to perform the wrapping (if you choose a buffer size that is a power of 2, such as 4096, then you can use bitwise & instead).
Now you just look at the characters until you read a $, output what you've looked at up until that point, and then knowing that you are in a different state because you saw a $ you look at more characters until you see another character that ends the current state (the ;) and perform some other action on the data that you had read in. How to handle the case where the $ is seen without a well formatted number followed by an ; wasn't entirely clear in your question -- what to do if there are a million numbers before you see ;, for instance.
The regular expressions would be:
[^$]
Any non-dollar sign character. This could be augmented with a closure ([^$]* or [^$]+) to recognize a string of non$ characters at a time, but that could get very long.
$[0-9]{1,3};
This would recognize a dollar sign followed by up 1 to 3 digits followed by a semicolon.
[$]
This would recognize just a dollar sign. It is in the brackets because $ is special in many regular expression representations when it is at the end of a symbol (which it is in this case) and means "match only if at the end of line".
Anyway, in this case it would recognize a dollar sign in the case where it is not recognized by the other, longer, pattern that recognizes dollar signs.
In lex you might have
[^$]{1,1024} { write_string(yytext); }
$[0-9]{1,3}; { write_char(atoi(yytext)); }
[$] { write_char(*yytext); }
and it would generate a .c file that will function as a filter similar to what you are asking for. You will need to read up a little more on how to use lex though.
The "f" family of functions in <stdio.h> can take care of the streaming for you. Specifically, you're looking for fopen(), fgets(), fread(), etc.
Nategoose's answer about using lex (and I'll add yacc, depending on the complexity of your input) is also worth considering. They generate lexers and parsers that work, and after you've used them you'll never write one by hand again.

getchar() and counting sentences and words in C

I'm creating a program which follows certain rules to result in a count of the words, syllables, and sentences in a given text file.
A sentence is a collection of words separated by whitespace that ends in a . or ! or ?
However, this is also a sentence:
Greetings, earthlings..
The way I've approached this program is to scan through the text file one character at a time using getchar(). I am prohibited from working with the the entire text file in memory, it must be one character or word at a time.
Here is my dilemma: using getchar() i can find out what the current character is. I just keep using getchar() in a loop until it finds the EOF character. But, if the sentence has multiple periods at the end, it is still a single sentence. Which means I need to know what the last character was before the one I'm analyzing, and the one after it. Through my thinking, this would mean another getchar() call, but that would create problems when i go to scan in the next character (its now skipped a character).
Does anyone have a suggestion as to how i could determine that the above sentence, is indeed a sentence?
Thanks, and if you need clarification or anything else, let me know.
You just need to implement a very simple state machine. Once you've found the end of a sentence you remain in that state until you find the start of a new sentence (normally this would be a non-white space character other than a terminator such as . ! or ?).
You need an extensible grammar. Look for example at regular expressions and try to build one.
Generally human language is diverse and not easily parseable especially if you have colloquial speech to analyze or different languages. In some languages it may not even be clear what the distinction between a word and a sentence is.

Resources