C style printf/scanf - c

I've seen, here and elsewhere, many questions that, to get input data, use something like this:
...
printf("What's your name? ");
scanf("%s",name);
...
This is very reminiscent of the old BASIC days (INPUT for those who remember it).
The majority of those questions, if not all, are from people just learning C and are homeworks or example taken from their book.
I clearly remember that when I learned C I was told that this type of question/answer style was not a good practice for getting user input. The "Right Way" was either to get parameters on the command line (argv[...]) or reading from a data file to be parsed with fgets(). When user friendliness was a must, termio and friends had to be used.
Now, I wonder if anything changed in the past years. Are people thaught to structure user interaction as a set question/answer now?
I can only see disadvantages in using the printf()/scanf() approach, the main one being the diversity of terminals (^H anyone?) that could make difficult for the user to correct mistakes.
Could anyone point me to concrete advantages of this approach?

This structure is easy to explain and easy to learn, which is why it appears in so many introductory materials. Doing user input "the right way" in C can appear fairly daunting to a neophyte, especially when you have to deal with tokenizing and conversions.
However, I agree that it would be valuable for introductory materials to demonstrate more robust methods for handling user input.

I always thought the Unix way was to accept input from stdin. That way the caller of the command can pipe input in from another command, from a file or manually.

fgets() / sscanf() or similar is the right way to accept user input.
Take what you read here and elsewhere with a (large) pinch of salt.

The GNU readline library is really an excellent resource for this. Its main advantage is that it handles all the intricacies of editing, as well as letting users have their own input settings, eg Vi or Emacs mode.
This is the library bash and many other programs that accept interactive line-based data use.
By using the library, you get an interface that your users will have some knowledge of how to use, plus you get all sorts of nice features without having to explicitly code support for line editing.

I agree with CTFord, that IF you're doing a command line program, then stdin is a perfectly reasonable way to deal with input.
However, now adays, the correct way is to make a window that has a text box a label and a button and blah blah blah.
I know that doesn't really answer your question, but I think my point is that that style of programming has become MUCH less prevalent, so that it doesn't really matter what is "correct".

In many cases you are probably missing the point of the exercise if you think this is important. There's a world of a difference between a homework exercise and real world programs. The purpose of such exercises is seldom, if ever, to teach user interface design; the technique you describe is generally just a simple way to obtain test input for the real exercise, and is also probably encouraged by tutors who, pressed for time require submitted code to adhere to some 'course-style' to make marking easier. 'Course-style' is seldom the same thing as 'good-practice' or 'practical style', but that is not to say that it does not serve some purpose, or even that that purpose is beneficial to the students education.
Unfortunately novices often get hung up on this stuff when it is generally irrelevant to the actual purpose of the exercise. The problem is that user input using standard library primitives is not as foolproof and some tutors seem to think.

Related

How to definitely solve the scanf input stream problem

Suppose I want to run the following C snippet:
scanf("%d" , &some_variable);
printf("something something\n\n");
printf("Press [enter] to continue...")
getchar(); //placed to give the user some time to read the something something
This snippet will not pause! The problem is that the scanf will leave the "enter" (\n)character in the input stream1, messing up all that comes after it; in this context the getchar() will eat the \n and not wait for an actual new character.
Since I was told not to use fflush(stdin) (I don't really get why tho) the best solution I have been able to come up with is simply to redefine the scan function at the start of my code:
void nsis(int *pointer){ //nsis arconim of: no shenanigans integer scanf
scanf("%d" , pointer);
getchar(); //this will clean the inputstream every time the scan function is called
}
And then we simply use nsis in place of scanf. This should fly. However it seems like a really homebrew, put-together-with-duct-tape, solution. How do professional C developers handle this mess? Do they not use scanf at all? Do they simply accept to work with a dirty input stream? What is the standard here?
I wasn't able to find a definite answer on this anywhere! Every source I could find mentioned a different (and sketchy) solution...
EDIT: In response to all commenting some version of "just don't use scanf": ok, I can do that, but what is the purpose of scanf then? Is it simply an useless broken function that should never be used? Why is it in the libraries to begin with then?
This seems really absurd, especially considering all beginners are taught to use scanf...
[1]: The \n left behind is the one that the user typed when inputting the value of the variable some_variable, and not the one present into the printf.
but what is the purpose of scanf then?
An excellent question.
Is it simply a useless broken function that should never be used?
It is almost useless. It is, arguably, quite broken. It should almost never be used.
Why is it in the libraries to begin with then?
My personal belief is that it was an experiment. It tries to be the opposite of printf. But that turned out not to be such a good idea in practice, and the function never got used very much, and pretty much fell out of favor, except for one particular use case...
This seems really absurd, especially considering all beginners are taught to use scanf...
You're absolutely right. It is really quite absurd.
There's a decent reason why all beginners are taught to use scanf, though. During week 1 of your first C programming class, you might write the little program
#include <stdio.h>
int main()
{
int size = 5;
for(int i = 0; i < size; i++) {
for(int j = 0; j < size; j++)
putchar('*');
putchar('\n');
}
}
to print a square. And during that first week, to make a square of a different size, you just edit the line int size = 5; and recompile.
But pretty soon — say, during week 2 — you want a way for the user to enter the size of the square, without having to recompile. You're probably not ready to muck around with argv. You're probably not ready to read a line of text using fgets and convert it back to an integer using atoi. (You're probably not even ready to seriously contemplate the vast differences between the integer 5 and the string "5" at all.) So — during week 2 of your first C programming class — scanf seems like just the ticket.
That's the "one particular use case" I was talking about. And if you only used scanf to read small integers into simple C programs during the second week of your first C programming class, things wouldn't be so bad. (You'd still have problems forgetting the &, but that would be more or less manageable.)
The problem (though this is again my personal belief) is that it doesn't stop there. Virtually every instructor of beginning C classes teaches students to use scanf. Unfortunately, few or none of those instructors ever explicitly tell students that scanf is a stopgap, to be used temporarily during that second week, and to be emphatically graduated beyond in later weeks. And, even worse, many instructors go on to assign more advanced problems, involving scanf, for which it is absolutely not a good solution, such as trying to do robust or "user friendly" input validation.
scanf's only virtue is that it seems like a nice, simple way to get small integers and other simple input from the user into your early programs. But the problem — actually a big, shuddering pile of 17 separate problems — is that scanf turns out to be vastly complicated and full of exceptions and hard to use, precisely the opposite of what you'd want in order to make things easy for beginners. scanf is only useful for beginners, and it's almost perfectly useless for beginners. It has been described as being like square training wheels on a child's bicycle.
How do professional C developers handle this mess?
Quite simply: by not using scanf at all. For one thing, very few production C programs print prompts to a line-based screen and ask users to type something followed by Return. And for those programs that do work that way, professional C developers unhesitatingly use fgets or the like to read a full line of input as text, then use other techniques to break down the line to extract the necessary information.
In answer to your initial question, there's no good answer. One of the fundamental rules of scanf usage (a set of rules, by the way, that no instructor ever teaches) is that you should never try to mix scanf and getchar (or fgets) in the same program. If there were a good way to make your "Press [enter] to continue..." code work after having called scanf, we wouldn't need that rule.
If you do want to try to flush the extra newline, so that a later call to getchar might work, there are several questions here with a bunch of good answers:
scanf() leaves the newline character in the buffer
Using fflush(stdin)
How to properly flush stdin in fgets loop
There's one more unrelated point that ends up being pretty significant to your question. When C was invented, there was no such thing as a GUI with multiple windows. Therefore no C programmer ever had the problem of having their output disappear before they could read it. Therefore no C programmer ever felt the need to write printf("Press [enter] to continue..."); followed by getchar(). I believe (another personal belief) that it is egregiously bad behavior for any vendor of a GUI-based C compiler to rig things up so that the output disappears upon program exit. Persistent output windows ought to be the default, for the benefit of beginning C programmers, with some kind of non-default option to turn that behavior off for those who don't want it.
Is scanf broken? No it is not. It is an excellent input function when you want to parse free form input data where few errors are to be expected. Free form means here that new lines are not relevant exactly as when you read/write very long paragraphs on a normal screen. And few errors expected is common when you read from files.
The scanf family function has another nice point: you have the same syntax when reading from the standard input stream, a file stream or a character string. It can easily parse simple common types and provide a minimal return value to allow cautious programmers to know whether all or part of all the expected data could be decoded.
That being said, it has major drawbacks: first being a C function, it cannot directly control whether the programmer has passed types meeting the format specifications, and second, as beginners are not consistenly hit on their head when they forget to control its return value, it is really too easy to make fully broken programs using it.
But the rule is:
if input is expected to be line oriented, first use fgets to get lines and then sscanf testing return values of both
only if input is expect to be free form (irrelevant newlines), scanf should be used directly. But never without testing its return value except for trivial tests.
Another drawback is that beginners hope it to be clever. It can indeed parse simple input formats, but is only a poor man's parser: do not use it as a generic parser because that is not what it is intended for.
Provided those rules are observed, it is a nice tool consistent with most of C language and its standard library: a simple tool to do simple things. It is up to programmers or library implementers to build richer tools.
I have only be using C language for more than 30 years, and was never bitten by scanf (well I was when I was a beginner, but I now know that I was to blame). Simply I have just tried for decades to only use it for what it can do...

Jest -- Proper way to write test descriptions

This may seem pedantic but I'm asking from a learner's perspective.
The syntax of the it helper lends itself to the syntax
it('should behave in this certain way...')
and most all the coders I know insist on writing 'should' for every single test.
To me this is a subtle but annoying pet peeve because the whole point of it is to save me keystrokes -- if it comes with the requirement to type should for every test it seems like an utter waste of a shortcut.
Is it a best practice to write it('should...') or can we simply write it('behaves in this expected way')?
Serious question -- I'm losing my mind over this little detail.
This might be an opinion based question, but I found myself writing a lot of shoulds in the past, until I found a spotify repository that changed my testing life
Should up is a CLI that removes shoulds from your tests and they explain in the README.md why and how you should write tests.
Basically it just removes a lot of text that is just redundant and makes everything easier to read.

Is it possible to capture Enter keypresses in C?

I want to write a C program that tokenizes a string and prints it out word by word, slowly, and I want it to simultaneously listen for the user pressing Enter, at which point they will be able to input data. Is this possible?
Update: see #modifiable-lvalue 's comment below for additional helpful information, e.g., using getchar() instead of getch(), if you go that route.
Definitely possible. Just be aware that gets() may not be entirely helpful for this purpose, since gets() interprets enter not as enter per se, but as "now, I the user, have entered as much string as I want to". So the input gathered by gets() from pressing just an enter will appear as an empty string (which might be workable for you).
See: http://www.cplusplus.com/reference/cstdio/gets/.
But there are other reasons not to use gets()--it does not let you specify a maximum to read in, so it's easy to overflow whatever buffer you are using. A security and bug nightmare waiting to happen. So you want fgets(), which allows you to specify a maximum size to read in. fgets() will place a newline in the string when an enter is pressed. (BTW, props to #jazzbassrob on fgets()).
You could also consider something like getch()--which really deals with individual key-presses (but it gets a bit complicated handling keys that have non-straightforward scan-codes). You may find this example helpful: http://www.daniweb.com/software-development/cpp/code/216732/reading-scan-codes-from-the-keyboard. But because of the scancodes issues, getch() is subject to platform details.
So if you want a more portable approach, you may need to use something heavier weight, but fairly portable, such as ncurses.
I suspect you can do what you want with either fgets(), keeping in mind that enter will give you a string with just a newline in it, or getch().
I just wanted you to be aware of some of the implementation/platform issues that can arise.
C can absolutely do this, but it's a little more complicated than one might guess at first attempt. That's because terminal input is, historically, very platform dependent.
Check out the conio.h library and its use in game making. You can compile with Borland. That was my first experience in unbuffered input and listening for keypresses in real time. There are certainly other ways though.

Parsing: library functions, FSM, explode() or lex/yacc?

When I have to parse text (e.g. config files or other rather simple/descriptive languages), there are several solutions that come to my mind:
using library functions, e.g. strtok(), sscanf()
a finite state machine which processes one char at a time, tokenizing and parsing
using the explode() function I once wrote out of pure boredom
using lex/yacc (read: flex/bison) to generate an appropriate parser
I don't like the "library functions" approach. It feels clumsy and awkward. explode(), while it doesn't take much new code, feels even more blown up. And flex/bison often seems like sheer overkill.
I usually implement a FSM, but at the same time I already feel sorry for the poor guy that may have to maintain my code at a later point.
Hence my question:
What is the best way to parse relatively simple text files?
Does it matter at all?
Is there a commonly agreed-upon approach?
I'm going to break the rules a bit and answer your questions out of order.
Is there a commonly agreed-upon approach?
Absolutely not. IMHO the solution you choose should depend on (to name a few) your text, your timeframe, your experience, even your personality. If the text is simple enough to make flex and bison overkill, maybe C is itself overkill. Is it more important to be fast, or robust? Does it need to be maintained, or can it start quick and dirty? Are you a passionate C user, or can you be enticed away with the right language features? &c., &c.
Does it matter at all?
Again, this is something only you can answer. If you're working closely with a team of people, with particular skills and abilities, and the parser is important and needs to be maintained, it sure does matter! If you're writing something "out of pure boredom," I would suggest that it doesn't matter at all, no. :-)
What is the best way to parse relatively simple text files?
Well, I don't know that you're going to like my answer. Maybe first read some of the other fine answers here.
No, really, go ahead. I'll wait.
Ah, you're back and relaxed. Let's ease into things, shall we?
Never write it in 'C' if you can do it in 'awk';
Never do it in 'awk' if 'sed' can handle it;
Never use 'sed' when 'tr' can do the job;
Never invoke 'tr' when 'cat' is sufficient;
Avoid using 'cat' whenever possible.
-- Taylor's Laws of Programming
If you're writing it in C, but C feels like the wrong tool...it really might be the wrong tool. awk or perl will likely do what you're trying to do without all the aggravation. You may even be able to do it with cut or something similar.
On the other hand, if you're writing it in C, you probably have a good reason to write it in C. Maybe your parser is a tiny part of a much larger system, which, for the sake of argument, is embedded, in a refrigerator, on the moon. Or maybe you loooove C. You may even hate awk and perl, heaven forfend.
If you don't hate awk and perl, you may want to embed them into your C program. This is doable, in principle--I've never done it myself. For awk, try libmawk. For perl, there are proably a few ways (TMTOWTDI). You can run perl separately using popen to start it, or you can actually embed a Perl interpreter into your C program--see man perlembed.
Anyhow, as I've said, "the best way to parse" entirely depends on you and your team, the problem space, and your approach to the issue. What I can offer is my opinion.
I'm going to assume that in your C-only solutions (library functions and FSM (considering your explode to essentially be a library function)) you've already done your best at isolating the relevant code, designing the code and files well, and so forth.
Even so, I'm going to recommend lex and yacc.
Library functions feel "clumsy and awkward." A state machine seems unmaintainable. But you say that lex and yacc feel like overkill.
I think you should approach your complaints differently. What you're really doing is specifying a FSM. However, you're also hiring someone to write and maintain it for you, thereby solving most of the maintainability problem. Overkill? Did I mention they'll work for free?
I suspect, but do not know, that the reason lex and yacc originally felt like overkill was that your config / simple files just felt too, well, simple. If I'm right (a big if), you may be able to do most of your work in the lexer. (It's even conceivable that you can do all of your work in the lexer, but I know nothing about your input.) If your input is not only simple but widespread, you may be able to find a lexer/parser combination freely available for what you need.
In short: if you can do this not in C, try something else. If you want C, use lex and yacc--they have a little overhead, but they're a very good solution.
If you can get it to work, I'd go with an FSM, but with a huge assist from Perl-compatible regular expressions. This library is easy to understand, and you ought to be able to trim back sufficient extraneous spaghetti to give your monster that aerodynamic flair to which all flying monsters aspire. That, and plenty of comments in well-structured spaghetti, ought to make your code-maintaining successor comfortable. (And, as I'm sure you know, that code-maintaining successor is you after six months, when you've moved on to something else and the details of this code have slipped your mind.)
My short answer is to use the right too for the problem. If you have configuration files use existing standards and formats e.g. ini Files and parse them using Boost program_options.
If you enter the world of "own" languages use lex/yacc, since they provide you with the required features, but you have to consider the cost of maintaining the grammar and language implementation.
As a result I would recommend to further narrow you problem scope to find the right tool.

Why should I use a human readable file format?

Why should I use a human readable file format in preference to a binary one? Is there ever a situation when this isn't the case?
EDIT:
I did have this as an explanation when initially posting the question, but it's not so relevant now:
When answering this question I wanted to refer the asker to a standard SO answer on why using a human readable file format is a good idea. Then I searched for one and couldn't find one. So here's the question
It depends
The right answer is it depends. If you are writing audio/video data for instance, if you crowbar it into a human readable format, it won't be very readable! And word documents are the classic example where people have wished they were human readable, so more flexible, and by moving to XML MS are going that way.
Much more important than binary or text is a standard or not a standard. If you use a standard format, then chances are you and the next guy won't have to write a parser, and that's a win for everyone.
Following this are some opinionated reasons why you might want to choose one over the other, if you have to write your own format (and parser).
Why use human readable?
The next guy. Consider the maintaining developer looking at your code 30 years or six months from now. Yes, he should have the source code. Yes he should have the documents and the comments. But he quite likely won't. And having been that guy, and had to rescue or convert old, extremely, valuable data, I'll thank you for for making it something I can just look at and understand.
Let me read AND WRITE it with my own tools. If I'm an emacs user I can use that. Or Vim, or notepad or ... Even if you've created great tools or libraries, they might not run on my platform, or even run at all any more. Also, I can then create new data with my tools.
The tax isn't that big - storage is free. Nearly always disc space is free. And if it isn't you'll know. Don't worry about a few angle brackets or commas, usually it won't make that much difference. Premature optimisation is the root of all evil. And if you are really worried just use a standard compression tool, and then you have a small human readable format - anyone can run unzip.
The tax isn't that big - computers are quick. It might be a faster to parse binary. Until you need to add an extra column, or data type, or support both legacy and new files. (though this is mitigated with Protocol Buffers)
There are a lot of good formats out there. Even if you don't like XML. Try CSV. Or JSON. Or .properties. Or even XML. Lots of tools exist for parsing these already in lots of languages. And it only takes 5mins to write them again if mysteriously all the source code gets lost.
Diffs become easy. When you check in to version control it is much easier to see what has changed. And view it on the Web. Or your iPhone. Binary, you know something has changed, but you rely on the comments to tell you what.
Merges become easy. You still get questions on the web asking how to append one PDF to another. This doesn't happen with Text.
Easier to repair if corrupted. Try and repair a corrupt text document vs. a corrupt zip archive. Enough said.
Every language (and platform) can read or write it. Of course, binary is the native language for computers, so every language will support binary too. But a lot of the classic little tool scripting languages work a lot better with text data. I can't think of a language that works well with binary and not with text (assembler maybe) but not the other way round. And that means your programs can interact with other programs you haven't even thought of, or that were written 30 years before yours. There are reasons Unix was successful.
Why not, and use binary instead?
You might have a lot of data - terabytes maybe. And then a factor of 2 could really matter. But premature optimization is still the root of all evil. How about use a human one now, and convert later? It won't take much time.
Storage might be free but bandwidth isn't (Jon Skeet in comments). If you are throwing files around the network then size can really make a difference. Even bandwidth to and from disc can be a limiting factor.
Really performance intensive code. Binary can be seriously optimised. There is a reason databases don't normally have their own plain text format.
A binary format might be the standard. So use PNG, MP3 or MPEG. It makes the next guys job easier (for at least the next 10 years).
There are lots of good binary formats out there. Some are global standards for that type of data. Or might be a standard for hardware devices. Some are standard serialization frameworks. A great example is Google Protocol Buffers. Another example: Bencode
Easier to embed binary. Some data already is binary and you need to embed it. This works naturally in binary file formats, but looks ugly and is very inefficient in human readable ones, and usually stops them being human readable.
Deliberate obscurity. Sometimes you don't want it obvious what your data is doing. Encryption is better than accidental security through obscurity, but if you are encrypting you might as well make it binary and be done with it.
Debatable
Easier to parse. People have claimed that both text and binary are easier to parse. Now clearly the easiest to parse is when your language or library supports parsing, and this is true for some binary and some human readable formats, so doesn't really support either. Binary formats can clearly be chosen so they are easy to parse, but so can human readable (think CSV or fixed width) so I think this point is moot. Some binary formats can just be dumped into memory and used as is, so this could be said to be the easiest to parse, especially if numbers (not just strings are involved. However I think most people would argue human readable parsing is easier to debug, as it is easier to see what is going on in the debugger (slightly).
Easier to control. Yes, it is more likely someone will mangle text data in their editor, or will moan when one Unicode format works and another doesn't. With binary data that is less likely. However, people and hardware can still mangle binary data. And you can (and should) specify a text encoding for human-readable data, either flexible or fixed.
At the end of the day, I don't think either can really claim an advantage here.
Anything else
Are you sure you really want a file? Have you considered a database? :-)
Credits
A lot of this answer is merging together stuff other people wrote in other answers (you can see them there). And especially big thanks to Jon Skeet for his comments (both here and offline) for suggesting ways it could be improved.
It entirely depends on the situation.
Benefits of a human readable format:
You can read it in its "native" format
You can write it yourself, e.g. for unit tests - or even for real content, depending on what it's for
Probable benefits of a binary format:
Easier to parse (in terms of code)
Faster to parse
More efficient in terms of space
Easier to control (any time you need text in there, you can ensure it's UTF-8 encoded, and length prefixed etc)
Easier to include opaque binary data efficiently (images, etc - with a text format you'd be getting into base64)
Don't forget that you can always implement a binary format but produce tools to convert to/from a human-readable format as well. That's what the Protocol Buffers framework does - it's actually pretty rare IME to need to parse a text version of a protocol buffer, but it's really handy to be able to write it out as text.
EDIT: Just in case this ends up being an accepted answer, you should also bear in mind the point made by starblue: Human readable forms are much better for diffing. I suspect it would be feasible to design a binary format which is appropriate for diffing (and where a human-readable diff could be generated) but out-of-the-box support from existing diff tools will be better for text.
Version control is easier with text formats, because changes can easily be viewed and merged.
Especially MS-Word is giving us grief in this respect.
Open format -- no binary bit juggling
Readability :)
Interchange across platforms
Debugging aid
Easily parsed (and easily converted to any format)
One important point: you write a parser once, but read the output many times. That kind of tilts the balance in favor of HRF.
A major reason is that if someone needs to read the data say, 30 years from now, human readable format can be figured out. Binary is much more difficult.
If your have large data sets that are binary by nature (e.g. images), they obviously can't be stored in any other than binary form. But even then, the metadata could (and should!) be human-readable.
There's something called The Art of Unix Programming.
I won't say it's good or bad, but it's fairly famous. It has a whole chapter called Textuality in which the author asserts that human readable file format are an important part of the Unix way of programming.
They open the possibility to be created/edited with tools other than the original ones. New and better tools can be developed by others, integration into third party applications becomes possible. Think about binary iCal files, for example - would the format have been a success?
Apart from that: Human readable files improve the ability to debug or, for the savvy user, at least find the reason an error.
Pros for binary:
fast to parse
generally smaller data
easy to write a parser for
Pros for human readable:
easier to understand while reading - no "field X is set to 4 487 which means that the reactor should be shut down NOW"
if using something like XML easy to write a tool that will parse any file
I have had to deal with both types. If you are sending data and you want to keep it small binary is good. If you expect people to read it then human readable is good.
Human readable generally somewhat self documenting as well. And with binary it is bery easy to make mistakes - and hard to spot them.
Editable
Readable (duh!)
Printable
Notepad and vi enabled
Most importantly , their function can be decuded from the content (well mostly)
Because you are a human, and sooner or later you (or one of your customers) will be able to read the data.
We only use binary format if speed is an issue. And even then debugging is troublesome so we added a human readable equivalent.
Interoperability is the standard argument, i.e. a human-readable form is easier for developers of disparate systems to deal with so therefore confers some advantage.
Personally I think that is not true, and the performance benfits of binary files ought to beat that argument, especially if you publish your protocol. However the ubiquity of XML/HTTP based frameworks for machine interactions means that it is easier to adopt.
XML is way over-used.
Just a quick illustration where human-readable document format can be a better choice:
documents used for deploying application in production
We used to have our release notes in word format, but that release notes document had to be opened on various environment (Linux, Solaris) in pre-production and production plateform.
It also had to be parsed in order to extract various data.
In the end, we switched to a wiki-based syntax, still displayed nicely in HTML through a wiki, but still used as a simple text file in other situation.
As an adjuct to this, there are differing levels of human readability, and all are enhanced by using a good editor or viewer with code coloring, folding or navigation.
For example,
JSON is quite readable even in plaintext
XML has the angle bracket tax but is usable when using a good editor
INI is mostly human readable
CSV can be readable, but is best when loaded into a spreadsheet.
No one said, so I will: human-readability is not really a property of a file format (all files are binary after all), but rather of a file format and viewer app combination.
So called human readable formats are all based on top of additional abstraction layer of one of existing text encodings. And viewer programs (often also serving as an editor) that are capable of rendering these encodings in a form readable by humans are very common.
Text encoding standards are widespread and fairly mature, which means they're unlikely to evolve much in the foreseeable future.
Usually on top of the text encoding layer of the format we find a syntax layer that is reasonably intuitive given target user knowledge and cultural background.
Hence the benefits of "human-readable" formats:
Ubiquity of suitable viewers and editors.
Timelessness (given that cultural conventions won't change much).
Easiness-to-learn, read and modify.
Reliance on the extra abstraction layer makes text encoded files:
Space hungry.
Slower to process.
"Binary" files do not resort to text encoding abstraction layer as a base (or a common denominator), but they might or might not use some sort of an extra abstraction more suitable for their purpose and hence, they can be much better optimised for a specific task at hand meaning:
Faster processing.
Smaller footprint.
On the other hand:
Viewers and editors are specific for a particular binary format and make interoperability harder.
Viewers for any given format are less wide spread, because they are more specialised.
Formats might evolve significantly or go out of use over time: their main benefit in being very well suited for a particular task and as the task or task requirements evolve, so does the format.
Take a moment and think about application OTHER than web development.
The assumption that:
A) It has a meaning that is "obvious" in text format is false.
Things like control systems for a steel mill, or manufacturing plant don't typically have any advantage in being human readable. The software for those types of environments will typically have routines to display data in a graphically meaningful manner.
B) Outputting it in text is easier. Unnecessary conversions that actually require more code make a system LESS robust. The fact of the matter if you are NOT using a language which treats all variables as strings then human readable text is an extra conversion. I.E. Extra code means more code to be verified, tested and more opportunities to intro errors in the application.
C) You have to parse it anyway. It many cases for DSP systems I've worked on (I.E. NO Human readable interface to start with.) Data is streamed out of the system in uniformly sized packets. Logging the data for analysis and later processing is simply a matter of pointing to the beginning of a buffer and writing a multiple of the block size to the data logger system. This allows me to analysis the data "untouched" as the customer's system would see it where, once again, converting it to a different format would result in possibly introducing errors. Not only that, if you only save the "converted data" you may lose information in the translation that may help you diagnose a problem.
D) Text is a Natural format for the data. No hardware I've ever seen uses a "TEXT" interface. (My first job out of college was writing a device driver for a camera line scan camera.) The system build on top of it does MIGHT, but for every "PC".
For web pages where the information has a "natural" meaning in text format, so sure knock yourself out. For processing source code it’s a no brainer, of course. But the pervasive computing environments where even you refrigerator and TOOTHBRUSH are going to have a processor built in, not so much. Simply burdening these type of systems with the overhead of adding the ability to process text introduces unnessary complexity. You're not going to link "printf" into the software for an 8-bit micro that controls a mouse. (And yeah, somebody has to write that software too.)
The world is not a black and white place where the only forms of computing that need to be consider are PCs and Web servers.
Even on a PC, if I can directly load the data directly into a datastructure using a single OS read call and be done with it without writing serialize and deserializing routines, that's fantastic, check a blocks CRC job -- done on to the next problem.
Uhm… because human-readable file formats can be read by humans? Seems like a pretty good reason to me.
(Well, for configuration files it’s inevitable that they are read (and edited!) by humans. Files for persistent storage of some sort or the other don’t really need to be read or edited by humans.)
Why should I use a human readable file
format in preference to a binary one?
Is there ever a situation when this
isn't the case?
Yes, compressed volumes (zip, jpeg, mp3, etc) would be suboptimal if they were human readable.
I guess its not good in most situations probably. I think the main reason for these formats such as JSON and XML is because of web development, and general use over the web where you need to be able to process data on the user-side and you cant necessarily read binary. A good example of a bad case to use a human readable format would be any thing non textual such as images, video, audio. Ive noticed the use of non-binary formats being used in web development where it does not make sense, I feel guilty!
Often files become part of your human interface thus they should be human friendly (not programmer only)
The only time that I use a binary stream for files that aren't archives is when I want to conceal things from the casual observer. For instance, if I'm making temporary files that only my application should be editing, I'll use binary.
Its not an attempt to obfuscate, rather its just discouraging the user from editing the file by hand (which could break the application).
One instance where this would be a good idea is storing / saving running data about some game.. i.e. to save your game and continue later. Other scenarios would describe intermediate files, but those are typically binary / byte compiled anyway.
Why should I use a human readable file
format in preference to a binary one?
Depends on the content and context, i.e. where is the data coming from and going. If the data is typically directly written by a human, storing it in an format that can be manipulated through a text editor is a good idea. For example, program source code will normally be stored as human readable with good reason. However, if we are archiving it, or sharing it using a version control system, our storage strategy will change.
The human format is simplier to parsing and debugging if you have a problem with a field (example: a field contains a number where the spec says the this field must be a string), also the human format is closier to domain of problem.
I prefer the binary format with a lot of data AND i'm sure that I have the software for parsing him :)
When reading Fielding's dissertation about REST, I really liked the concept of "Architectural Properties"; one that sticked was "Visibility". That's what we're talking about here: being able to 'see' the data. Huge benefits when debugging the system.
One aspect that I find missing in the other answers: enforcing semantics.
From the moment you go for human readable, you allow the silly notepad user to create data to be fed into the system. No way to guarantee this data makes sense. No way to guarantee the system will respond in a sensible way.
So in the case you don't need to notepad-inspect your data, and you want to enforce valid data (by e.g. usage of an API) rather than first validating it, you better avoid human readable data. If debuggeability is an issue (it most often is), inspection of the data can be done by using the API, too.
Human readable is not equal to easier to be parsed by machine code.
Take human natural language as an example. :) Machine parsing of human language is still a pending problem to be fully solved.
So I agree with https://stackoverflow.com/a/714111/2727173 which has much deeper insight on this question.

Resources