Why does this scanf() conversion actually work? - c

Ah, the age old tale of a programmer incrementally writing some code that they aren't expecting to do anything more than expected, but the code unexpectedly does everything, and correctly, too.
I'm working on some C programming practice problems, and one was to redirect stdin to a text file that had some lines of code in it, then print it to the console with scanf() and printf(). I was having trouble getting the newline characters to print as well (since scanf typically eats up whitespace characters) and had typed up a jumbled mess of code involving multiple conditionals and flags when I decided to start over and ended up typing this:
(where c is a character buffer large enough to hold the entirety of the text file's contents)
scanf("%[a-zA-Z -[\n]]", c);
printf("%s", c);
And, voila, this worked perfectly. I tried to figure out why by creating variations on the character class (between the outside brackets), such as:
[\w\W -[\n]]
[\w\d -[\n]]
[. -[\n]]
[.* -[\n]]
[^\n]
but none of those worked. They all ended up reading either just one character or producing a jumbled mess of random characters. '[^\n]' doesn't work because the text file contains newline characters, so it only prints out a single line.
Since I still haven't figured it out, I'm hoping someone out there would know the answer to these two questions:
Why does "[a-zA-Z -[\nn]]" work as expected?
The text file contains letters, numbers, and symbols (':', '-', '>', maybe some others); if 'a-z' is supposed to mean "all characters from unicode 'a' to unicode 'z'", how does 'a-zA-Z' also include numbers?
It seems like the syntax for what you can enter inside the brackets is a lot like regex (which I'm familiar with from Python), but not exactly. I've read up on what can be used from trying to figure out this problem, but I haven't been able to find any info comparing whatever this syntax is to regex. So: how are they similar and different?
I know this probably isn't a good usage for scanf, but since it comes from a practice problem, real world convention has to be temporarily ignored for this usage.
Thanks!

You are picking up numbers because you have " -[" in your character set. This means all characters from space (32) to open-bracket (91), which includes numbers in ASCII (48-57).
Your other examples include this as well, but they are missing the "a-zA-Z", which lets you pick up the lower-case letters (97-122). Sequences like '\w' are treated as unknown escape sequences in the string itself, so \w just becomes a single w. . and * are taken literally. They don't have a special meaning like in a regular expression.

If you include - inside the [ (other than at the beginning or end) then the behaviour is implementation-defined.
This means that your compiler documentation must describe the behaviour, so you should consult that documentation to see what the defined behaviour is, which would explain why some of your code worked and some didn't.
If you want to write portable code then you can't use - as anything other than matching a hyphen.

Related

Using C I would like to format my output such that the output in the terminal stops once it hits the edge of the window

If you type ps aux into your terminal and make the window really small, the output of the command will not wrap and the format is still very clear.
When I use printf and output my 5 or 6 strings, sometimes the length of my output exceeds that of the terminal window and the strings wrap to the next line which totally screws up the format. How can I write my program such that the output continues to the edge of the window but no further?
I've tried searching for an answer to this question but I'm having trouble narrowing it down and thus my search results never have anything to do with it so it seems.
Thanks!
There are functions that can let you know information about the terminal window, and some others that will allow you to manipulate it. Look up the "ncurses" or the "termcap" library.
A simple approach for solving your problem will be to get the terminal window size (specially the width), and then format your output accordingly.
There are two possible answers to fix your problem.
Turn off line wrapping in your terminal emulator(if it supports it).
Look into the Curses library. Applications like top or vim use the Curses library for screen formatting.
You can find, or at least guess, the width of the terminal using methods that other answers describe. That's only part of the problem however -- the tricky bit is formatting the output to fit the console. I don't believe there's any alternative to reading the text word by word, and moving the output to the next line when a word would overflow the width. You'll need to implement a method to detect where the white-space is, allowing for the fact that there could be multiple white spaces in a row. You'll need to decide how to handle line-breaking white-space, like CR/LF, if you have any. You'll need to decide whether you can break a word on punctuation (e.g, a hyphen). My approach is to use a simple finite-state machine, where the states are "At start of line", "in a word", "in whitespace", etc., and the characters (or, rather character classes) encountered are the events that change the state.
A particular complication when working in C is that there is little-to-no built-in support for multi-byte characters. That's fine for text which you are certain will only ever be in English, and use only the ASCII punctuation symbols, but with any kind of internationalization you need to be more careful. I've found that it's easiest to convert the text into some wide format, perhaps UTF-32, and then work with arrays of 32-bit integers to represent the characters. If your text is UTF-8, there are various tricks you can use to avoid having to do this conversion, but they are a bit ugly.
I have some code I could share, but I don't claim it is production quality, or even comprehensible. This simple-seeming problem is actually far more complicated than first impressions suggest. It's easy to do badly, but difficult to do well.

Will program crash if Unicode data is parsed as Multibyte?

Okay basically what I'm asking is:
Let's say I use PathFindFileNameA on a unicode enabled path. I obtain this path via GetModuleFileNameA, but since this api doesn't support unicode characters (italian characters for example) it will output junk characters in that part of the system path.
Let's assume x represents a junk character in the file path, such as:
C:\Users\xxxxxxx\Desktop\myfile.sys
I assume that PathFindFileNameA just parses the string with strtok till it encounters the last \\, and outputs the remainder in a preallocated buffer given str_length - pos_of_last \\.
The question is, will the PathFindFileNameA parse the string correctly even if it encounters the junk characters from a failed unicode conversion (since the multi-byte API reciprocal is called), or will it crash the program?
Don't answer something like "Well just use MultiByteToWideChar", or "Just use a wide-version of the API". I am asking a specific question, and a specific answer would be appreciated.
Thanks!
Why you think that Windows API only do strtok? I used to hear that all xxA APIs are redirected to xxW APIs before win10 are released.
And I think the answer to this question is quite simple. Just write a easy program and then set the code page to what you want.Running that program and the answer goes out.
P.S.:personally I think that GetModuleFileNameA will work correctly even if there are junk characters because Windows will store the Image Name as an UNICODE_STRING internally. And even if you uses MBCS, the junk code does not contain zero bytes, and it will work as usual since it's just using strncpy.
Sorry for my last answer :)

C Removing Newlines in a Portable and International Friendly Way

Simple question here with a potentially tricky answer: I am looking for a portable and localization friendly way to remove trailing newlines in C, preferably something standards-based.
I am already aware of the following solutions:
Parsing for some combination of \r and \n. Really not pretty when dealing with Windows, *nix and Mac, all which use different sequences to represent a new line. Also, do other languages even use the same escape sequence for a new line? I expect this will blow up in languages that use different glyphs from English (say, Japanese or the like).
Removing trailing n bytes and replacing final \0. Seems like a more brittle way of doing the above.
isspace looks tempting but I need to only match newlines. Other whitespace is considered valid token text.
C++ has a class to do this but it is of little help to me in a pure-C world.
locale.h seems like what I am after but I cannot see anything pertinent to extracting newline tokens.
So, with that, is this an instance that I will have to "roll my own" functionality or is there something that I have missed? Thanks!
Solution
I ended up combining both answers from Weather Vane and Loic, respectively, for my final solution. What worked was to use the handy strcspn function to break on the first newline character as selected from Loic's provided links. Thus, I can select delimiters based on a number of supported locales. Is a good point that there are too many to support generically at this level; I didn't even know that there were several competing encodings for the Cyrillic.
In this way, I can achieve "good enough" multinational support while still using standard library functions.
Since I can only accept one answer, I am selecting Weather Vane's as his was the final invocation I used. That being said, it was really the two answers together that worked for me.
The best one I know is
buffer [ strcspn(buffer, "\r\n") ] = 0;
which is a safe way of dealing with all the combinations of \r and \n - both, one or none.
I suggest to replace one or more whitespace characters with one standard space (US-ASCII 0x20). Considering only ISO-8859-1 characters (https://en.wikipedia.org/wiki/ISO/IEC_8859-1), whitespace consists of any byte in 0x00..0x20 (C0 control characters and space) and 0x7F..0xA0 (delete, C1 control characters and no-break space). Notice that US-ASCII is subset of ISO-8859-1.
But take into account that Windows 1251 (https://en.wikipedia.org/wiki/Windows-1251) assign different, visible (non-control) characters to the range 0x80..0x9F. In this case, those bytes cannot be replaced by spaces without lost of textual information.
Resources for an extensive definition of whitespace characters:
https://en.wikipedia.org/wiki/Unicode_character_property#Whitespace
http://unicode.org/reports/tr23/
http://www.unicode.org/Public/8.0.0/charts/CodeCharts.pdf
Take also onto account that different encodings may be used, most commonly:
ISO-8859-1 (https://en.wikipedia.org/wiki/ISO/IEC_8859-1)
UTF-8 (https://en.wikipedia.org/wiki/UTF-8)
Windows 1251 (https://en.wikipedia.org/wiki/Windows-1251)
But in non-western countries (for instance Russia, Japan), further character encodings are also usual. Numerous encodings exist, but it probably does not make sense to try to support each and every known encoding.
Thus try to define and restrict your use-cases, because implementing it in full generality means a lot of work.
This answer is for C++ users with the same problem.
Matching a newline character for any locale and character type can be done like this:
#include <locale>
template<class Char>
bool is_newline(Char c, std::locale const & loc = std::locale())
{
// Translate character into default locale and character type.
// Then, test against '\n', which is the only newline character there.
return std::use_facet< std::ctype<Char>>(loc).narrow(c, ' ') == '\n';
}
Now, removing all trailing newlines can be done like this:
void remove_trailing_newlines(std::string & str) {
while (!str.empty() && is_newline(*str.rbegin())
str.pop_back();
}
This should be absolutely portable, as it relies only on standard C++ functions.

Parsing a stream of data for control strings

I feel like this is a pretty common problem but I wasn't really sure what to search for.
I have a large file (so I don't want to load it all into memory) that I need to parse control strings out of and then stream that data to another computer. I'm currently reading in the file in 1000 byte chunks.
So for example if I have a string that contains ASCII codes escaped with ('$' some number of digits ';') and the data looked like this... "quick $33;brown $126;fox $a $12a". The string going to the other computer would be "quick brown! ~fox $a $12a".
In my current approach I have the following problems:
What happens when the control strings falls on a buffer boundary?
If the string is '$' followed by anything but digits and a ';' I want to ignore it. So I need to read ahead until the full control string is found.
I'm writing this in straight C so I don't have streams to help me.
Would an alternating double buffer approach work and if so how does one manage the current locations etc.
If I've followed what you are asking about it is called lexical analysis or tokenization or regular expressions. For regular languages you can construct a finite state machine which will recognize your input. In practice you can use a tool that understands regular expressions to recognize and perform different actions for the input.
Depending on different requirements you might go about this differently. For more complicated languages you might want to use a tool like lex to help you generate an input processor, but for this, as I understand it, you can use a much more simple approach, after we fix your buffer problem.
You should use a circular buffer for your input, so that indexing off the end wraps around to the front again. Whenever half of the data that the buffer can hold has been processed you should do another read to refill that. Your buffer size should be at least twice as large as the largest "word" you need to recognize. The indexing into this buffer will use the modulus (remainder) operator % to perform the wrapping (if you choose a buffer size that is a power of 2, such as 4096, then you can use bitwise & instead).
Now you just look at the characters until you read a $, output what you've looked at up until that point, and then knowing that you are in a different state because you saw a $ you look at more characters until you see another character that ends the current state (the ;) and perform some other action on the data that you had read in. How to handle the case where the $ is seen without a well formatted number followed by an ; wasn't entirely clear in your question -- what to do if there are a million numbers before you see ;, for instance.
The regular expressions would be:
[^$]
Any non-dollar sign character. This could be augmented with a closure ([^$]* or [^$]+) to recognize a string of non$ characters at a time, but that could get very long.
$[0-9]{1,3};
This would recognize a dollar sign followed by up 1 to 3 digits followed by a semicolon.
[$]
This would recognize just a dollar sign. It is in the brackets because $ is special in many regular expression representations when it is at the end of a symbol (which it is in this case) and means "match only if at the end of line".
Anyway, in this case it would recognize a dollar sign in the case where it is not recognized by the other, longer, pattern that recognizes dollar signs.
In lex you might have
[^$]{1,1024} { write_string(yytext); }
$[0-9]{1,3}; { write_char(atoi(yytext)); }
[$] { write_char(*yytext); }
and it would generate a .c file that will function as a filter similar to what you are asking for. You will need to read up a little more on how to use lex though.
The "f" family of functions in <stdio.h> can take care of the streaming for you. Specifically, you're looking for fopen(), fgets(), fread(), etc.
Nategoose's answer about using lex (and I'll add yacc, depending on the complexity of your input) is also worth considering. They generate lexers and parsers that work, and after you've used them you'll never write one by hand again.

getchar() and counting sentences and words in C

I'm creating a program which follows certain rules to result in a count of the words, syllables, and sentences in a given text file.
A sentence is a collection of words separated by whitespace that ends in a . or ! or ?
However, this is also a sentence:
Greetings, earthlings..
The way I've approached this program is to scan through the text file one character at a time using getchar(). I am prohibited from working with the the entire text file in memory, it must be one character or word at a time.
Here is my dilemma: using getchar() i can find out what the current character is. I just keep using getchar() in a loop until it finds the EOF character. But, if the sentence has multiple periods at the end, it is still a single sentence. Which means I need to know what the last character was before the one I'm analyzing, and the one after it. Through my thinking, this would mean another getchar() call, but that would create problems when i go to scan in the next character (its now skipped a character).
Does anyone have a suggestion as to how i could determine that the above sentence, is indeed a sentence?
Thanks, and if you need clarification or anything else, let me know.
You just need to implement a very simple state machine. Once you've found the end of a sentence you remain in that state until you find the start of a new sentence (normally this would be a non-white space character other than a terminator such as . ! or ?).
You need an extensible grammar. Look for example at regular expressions and try to build one.
Generally human language is diverse and not easily parseable especially if you have colloquial speech to analyze or different languages. In some languages it may not even be clear what the distinction between a word and a sentence is.

Resources