What does the 9th commandment mean? - c

In The Ten Commandments for C Programmers, what is your interpretation of the 9th commandment?
The 9th commandment:
Thy external identifiers shall be unique in the first six characters, though this harsh discipline be irksome and the years of its necessity stretch before thee seemingly without end, lest thou tear thy hair out and go mad on that fateful day when thou desirest to make thy program run on an old system.
Exactly what is this all about?

Old linkers only used a limited number of characters of the symbol - I seem to recall that the old IBM mainframes I started programming in only used 8 characters. The C standards people settled on 6 characters as a "lowest common denominator", but would allow a linker to resolve longer names if they wanted to.
If you really hit one of these lowest common denominator linkers, external symbols (function names, external variables, etc) ABCDEFG and ABCDEFH would appear the same to them. Unless you're programming on really old hardware, you can safely ignore this "commandment".
Note that any linker that can't handle more than 6 characters can't do C++ either because of the name mangling.

External identifier = something that might have to be called from another system
The reason for the first six characters being unique is that an ancient system may have a six-character limitation on its' identifiers. If, one day, such a system tries to call your code, it needs to be able to tell the difference between all of your functions.
These days, this seems overly conservative to me, unless you are working with a lot of legacy systems.

Here are the minimum number of significant characters in an external identifier that must be supported by C/C++ compilers under various standards:
C90 - 6 characters
C99 - 31 characters
C++98 - 1024 characters
Here's an example of the kinds of problems that you can run into if your toolset skimps on these limits (from http://www.comeaucomputing.com/libcomo/):
Note to users with Borland and
MetroWerks CodeWarrior as backend C:
==================================================================
Note that the Borland compiler and linker, and the Metrowerks compiler,
seem to have a maximum external id length of 250 characters. It turns
out that some of the generated mangled template names are unable to
fit within that space. Therefore, when Borland or Metrowerks is used
as the backend C compiler, we have remapped some of the names libcomo
uses to shorter names. So short in fact we could not get away with
names beginning with underscores. In fact, it was necessary to map most
to 2 character id names.

In response to:
C++98 - 1024 characters
'begin humor'
Addendum to 9th commandment:
If thy external identifiers approach'th to be anywhere
near as long as one-thousand-and-
twenty-four thou shouldst surely be
quickly brought outside and shot.
'/end humor'

A lot of old compilers and linkers had limitations on how long an identifier could be. Six characters was a common limit. Actually, they could be longer than that, but the compiler or linker would throw away everything after the sixth character.
This was usually done to conserve symbol table memory.

It means you're looking at a piece of ancient history. Those commandments are mostly true, but that 9th one may as well actually be carved into a stone tablet, it's so old.
The remaining mystery is: creat. What was wrong with create? It's only six letters!

According to this site:
What does this mean? Your globals should be "Unique to the first six letters", not "limited to six letters". This is in ANSI, I hear, because of utterly painful "obsolescence" of some linkers. Hopefully ANSI will some day say "linkers will have to do longer names, or they'll cry and do longer names". (And all rational people will just ignore this commandment and tell people to upgrade their 2-penny linker - this may not get you friends, or make warnings happy...)

Related

Display-width of multibyte character in C standard library – how accurate is the database?

The wcwidth call of Standard C Library returns 2 for Asian characters. Then there are Unicode symbols, like arrows. For those it returns 1. It is often the case that character is wider than single column, yet the library isn't wrong, because terminals print them at single column and allow visual overlapping, sometimes giving not bad results, like for ndash "–".
Are there characters that plainly suffer? I wonder how Asian people and people from other regions use terminals, what solutions have they developed. For example displaying a shell prompt that spans whole line and contains current directory name can be a serious problem. Can be wcwidth patched to obtain better results? Using github/wcwidth.c as a starting point, for example.
There are differences with the ambiguous-width characters. xterm has both Markus Kuhn's original (the link you show appears to be his, with the comment-header removed), as well as an alternate version with adjustments to accommodate CJK (East Asian). Besides that, it checks at startup for usable system locale tables. Some are good enough; others are not. No one's done a systematic (unbiased) survey of what's actually good (you may see some opinions on that aspect, offered as answers).

Standard C function names

Is there any rationale for the abbreviated way standard C functions are written? For example, malloc() is short for 'memory allocation'. sprintf() is 'string print formatted'. Neither of these names are very good at telling you what the function actually does. It never occurred to me how terrible some of these abbreviated function names are until recently when I had to teach a new intern many of these functions.
When the language was being developed, was there any reason malloc() was chosen over memAllocate() or something similar? My best guess would be that they more closely resemble UNIX commands, but that doesn't feel like the right answer.
Check out http://publications.gbdirect.co.uk/c_book/chapter2/keywords_and_identifiers.html -
The problem is that there was never any guarantee that more than a
certain number of characters would be checked when names were compared
for equality—in Old C this was eight characters, in Standard C this
has changed to 31.
Basically, in the past (long while back) you could only count on the the first eight characters for uniqueness in a function name. So, you end up with a bunch of short names for the core functions.
As Neal Stephenson wrote about Unix in In the Beginning Was the Command Line,
Note the obsessive use of abbreviations and avoidance of capital letters; this is a system invented by people to whom repetitive stress disorder is what black lung is to miners. Long names get worn down to three-letter nubbins, like stones smoothed by a river.
The first version of Unix and the first C compiler were written using versions of ed. Not vi, not emacs, not anything resembling an IDE, but the line-based ed. There comes a point where reducing the number of keystrokes really does increase the number of SLOC you can write per day, when you're inventing something brand-new and writing it for the first time.
The historical justification is of course that historically the C standard only required implementations to distinguish the initial 6 characters of external identifier names. This allowance was removed in C99. However, users of the C language generally:
Aim to write source code in such a way that it fits in a reasonable number of columns, usually 80 or fewer, which is difficult with long identifier names.
Type identifier names with a keyboard, which is difficult and a waste of time when the identifiers are long.
Tend to prefer high information density and signal-to-noise ratio in source code.

Internationalize C program

I have a C program written for some embedded device in English.
So there are codes like:
SomeMethod("Please select menu");
OtherMethod("Choice 1");
Say I want to support other languages, but I don't know how much memory I have with this device. I don't want to store strings in other memory areas where I might have less space and crash the program. So I want to store strings in the same memory area and take the same space. So I thought of this:
SomeMethod(SELECT_MENU);
OtherMethod(CHOICE_1);
And a separate header file:
English.h
#define SELECT_MENU "Please select menu"
#define CHOICE_1 "Choice 1"
For other languages:
French.h
#define SELECT_MENU "Text in french"
#define CHOICE_1 "same here"
Now depending which language I want I would include that header file only.
Does this satisfy the requirement that if I select English version my internationalized programs' strings will be stored on same memory region and take same memory as my previous one? (I know French might take more - but that is other issue related that French letters take more bytes).
I thought since I will use defines strings will be placed at same place in memory they were before.
At least on Linux and many other POSIX systems, you should be interested by gettext(3) (and by the positioning arguments in printf(3), e.g. %3$d instead of %d in the control format string).
Then you'll code
printf(gettext("here x is %d and y is %d"), x, y);
and that is common enough to have the habit to
#define _(X) gettext(X)
and code later
printf(_("here x is %d and y is %d"), x, y);
You'll also want to process message catalogs with msgfmt(1)
You'll find several documents on internationalization (i18n) and localization, e.g. Debian Introduction to i18n. Read also locale(7). And you probably should always use UTF-8 today.
The advantage of such message catalogs (all this is by default already available on Linux systems!) is that the internationalization happens at runtime. There is no reason to restrict it to happen at compile time. Message catalogs can (and often are) translated by other people that the developers. You'll have directories in your file system (e.g. in some cheap flash memory, like some SD chip) containing these.
Notice that internationalization & localization is a difficult subject (read more documentation to understand how difficult it can be, once you want to handle non-European languages), and the Linux infrastructure has designed it quite well (probably better, and more efficient, than what you are suggesting with your macros). And Qt and Gtk have also extensive support for internationalization (based upon gettext etc...).
Let me get this straight: You want to know that if preprocessor-defined variables (in your case, related to i18n) were swapped out before compile, that they would (a) take the same amount of memory (between the macro and non-macro version) and (b) be stored in the same program segment?
The short answer is (a) yes and (b) yes-ish.
For the first part, this is easy. Preprocessor-defined constants are whole-text replaced with their #define'd values by the preprocessor before being passed into the compiler. So, to the compiler,
#define SELECT_MENU "Please select menu"
// ...
SomeMethod(SELECT_MENU);
is read in as
SomeMethod("Please select menu");
and therefore will be identical for all intents and purposes except for how it appears to the programmer.
For the second part, this is a bit more complex. If you have constant string literals in a C program, they will be allocated either into the program's data segment or (if declared as the initial contents of a self-allocating char array) built dynamically within the program's code segment and stored either on the stack or the heap, if I'm not mistaken (as discussed in the answers to this question). This is dependent on how the preprocessor-defined constant is used in the program.
Considering what I said in the first part, if you have char buffer[] = MY_CONSTANT;, it is likely be stored as a heap-space allocator and initializer where it is used in the program, and will increase the code segment (and possibly the BSS). If you have someFunction(MY_CONSTANT);, or char* c_str = MY_CONSTANT;, then it will likely be stored in the data segment, and you will receive a pointer to that area at runtime. There are many ways this may manifest in your actual program; having the variables #define'd does not reliably determine how they will be stored in your compiled program, although if they are used in certain ways only, then you can be reasonably certain where it will be stored.
EDIT Modified first half of answer to accurately address what is being asked, thanks to #esm's comment.
The pre-processor use here is simple substitution: there is no difference in the executable code between
SomeMethod("Please select menu");
and
#define SELECT_MENU "Please select menu"
...
SomeMethod(SELECT_MENU);
But the memory usage is unlikely to be exactly the same for each language.
In practice, messages are often more complicated than a simple translation. For example in the message
Input #4 is dangerous
Would you have
#define DANGER "Input #%d is dangerous"
...
printf(DANGER, inpnum);
Or would you do
#define DANGER "Dangerous input #"
...
printf(DANGER);
printf("%d", inpnum);
I use these examples to show that you must consider language versions from the outset, not as an easy post-fix.
Since you mention "a device" and are concerned with memory usage, I guess you are working with embedded. My own preferred method is to provide language modules containing an array of words or phrases, with #define to reference the array element to use to piece together a message. That could also be done with enum.
For example (would actually include the English language source file separately
#include <stdio.h>
char *message[] = { "Input",
"is dangerous" };
#define M_INPUT 0
#define M_DANGER 1
int main()
{
int input = 4;
printf ("%s #%d %s\n", message[M_INPUT], input, message[M_DANGER]);
return 0;
}
Program output:
Input #4 is dangerous
The way you are doing that, if you compile the program as English, then French words will not be stored in the English version of the program.
The compiler will not even see the French words. The French words will not be in the final executable.
In some cases, the compiler may see some data, but it chooses to ignore that data if the data is not being used in the program.
For example, consider this function:
void foo() {
cout << "qwerty\n";
}
If you define this function, but you don't use it in the program, then the function foo and the string "qwerty" will not find their way in the final executable.
Using macro doesn't make any difference. For example, foo1 and foo2 are identical.
#define SOME_TEXT "qwerty\n"
void foo2() {
cout << SOME_TEXT;
}
The data is stored in heap, heap limit is usually very large. There won't be shortage of memory unless SOME_TEXT is bigger than stack limit (usually about 100 kb) and this data is being copied in stack.
So basically you don't have anything to worry about except the final size of the program.
To answer the question of will this take the same amount of memory and will strings be placed in the same section of the program for the English non-macro version when using English macro version the answer is yes.
The C preprocessor (CPP) will replace all instances of the macro with the correct language string for the given language and after the CPP run it will be as if the macros were never there. The strings will still be placed in the read only data section of the binary, assuming that is supported, just as if you didn't use macros.
So to summarize the English version with macros and the English version without macros are the same as far as the C compiler is considered, see link

What must I know to handle UTF-8 in my C program?

I have a C program that now I need to do support to UTF-8 characters. What must I know in order to perform that? I've always hear how problematic is handle it in a C/C++ environment. Why exactly is it problematic? How does it differ from an usual C character, also its size? Can I do it without any operating system help, in pure C and still make it portable? what else I should have asked but I didn't? what I'm looking for implement is it: The characters are a name with accents(like french word: résumé) that I need to read it and put into a symbol table and then search and print them from a file. It's part of my configuration file parsing(very much .ini-like)
There's an awesome article written by Joel Spolsky, one of the Stack Overflow creators.
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Apart from that, you might want to query some other Q&A's regarding this subject, like Handling special characters in C (UTF-8 encoding).
As cited in the aforementioned Q&A, Tips on Using Unicode with C/C++ might give you the basics.
Two good links that i have used in the past:
The-Basics-of-UTF8
reading-unicode-utf-8-by-hand-in-c
valter

How do I explain my colleagues that filenames should not contain uppercase characters or special characters?

For what I know, it's best practice to name files like this: file_name.txt - or if you want to file-name.txt.
Now some people seem to like to name their files fileName.txt or FILENAME.TXT or "File Name.txt" - how do explain them that it's not a good idea? Why exactly is the aforementioned file naming best practice?
I only vaguely know some file systems have trouble with uppercase, and that URIs should be lowercase only to avoid confusion (Wikipedia does have uppercase characters in their URLs though e.g. http://en.wikipedia.org/wiki/Sinusitis )
W.
Well, a problem with uppercase letters would be that some filesystems (like NTFS) ignore them and treat filename.txt and FILENAME.TXT as the same file, whereas other filesystems (ext for example, I think) thinks of these as 2 different files.
So, if you have some reference to a file that you called file.txt, and the reference points to the file File.txt, then on NTFS this would be no problem, but if you copy the files to a file system like ext, the reference would fail because the filesystem thinks there is no such file as File.txt.
Because of this, it's best practice to always use lowercase letters.
If your colleagues are clueless, then you might be able to convince them that ALL CAPS takes more storage, uses more power, and is more likely to get deleted.
However, if they are as knowledgeable about filenames as you, there's little you can to get them to side with your preference.
In this situation, I like to take the absurdist approach, to help my colleagues want to have a reasonable approach. I suggest you start naming files with CrAzY cAsE. After a few directories of CrAzY cAsE, your ALL CAPS colleagues and your lowercase colleagues will come to you and ask you to stop. You then say, Well we should have standard naming convention, I'm impartial to the results if we can agree on a standard. Then nudge the discussion toward lower case names, and declare that as the binding compromise.
Maximilian has good point!
It's best practice to avoid the possibility of confusion (dissimilar names treated as identical) but I work in a place where various systems are used, from DOS to Windows to Unix, and I have never been able to convince those users that the CAPS LOCK should be avoided.
Since I mostly deal with Unix-like systems, I would dearly love to legislate for lower-case everywhere, but I'm beating my head against a brick wall.
Best Practice is an alien concept to most computer users.
If your colleagues are programmers you might stand a chance.
The argument that all lower case is the 'best practice' would easily vindicate using all CAPS as best practice as well.
I think it's fair to say that the vast majority of users don't operate in multi-platform environments, or at least not in a manner that's likely to cause them to encounter the issue raised here.
The issue is really only a problem when copying from a case-sensitive environment to a non-case sensitive one where you have multiple variants of a filename within a single directory (somewhat unlikely). The idea that a file reference would be the crux for me falls down when you consider that directory structure variation is likely to be an equal issue in such situations.
At the end of the day, in a corporate environment, there should be a published standard for such things that everyone is at least encouraged to follow, that for me is best practice. Those that follow the standard don't only have themselves to blame.
The POSIX standard (IEEE Std 1003.1) defines a character set for portable filenames (however, it does indicate that case should be preserved). At least it removes spaces and other "special" characters from the set.
The set is, from memory: [a-zA-Z0-9_-.]

Resources