Introduction
I am running out of Flash on my Cortex-M4 device. I analysed the code, and the biggest opportunity to reduce code size is simply in predefined constants.
- Example
const Struct option364[] = {
{ "String1", 0x4523, "String2" },
{ "Str3", 0x1123, "S4" },
{ "String 5", 0xAAFC, "S6" }
};
Problem
The problem is that I have a (large) number of (short) strings to store, but most of them are used in tables - arrays of const structs that have pointers to the const strings mixed with the numerical data. Each string is variable in size, however I still looked at changing the struct pointer to hold a simple (max) char array instead of a pointer - and there wasn't much difference. It didn't help that the compiler wanted to start each new string on a 4-byte boundary; which got me thinking...
Idea
If I could replace the 4-byte char pointer with a 2-byte index into a string table - a predefined linker section to which index was an offset - I would save 2 bytes per record right there, at the expense of a minor code bump. I'd also avoid the interior padding, since each string could start immediately after the previous string's NUL byte. And if I could be clever, I could re-use strings - or even part-strings - for the indexes.
But moreover, I'd change the 4 + 2 + 4 (+ 2) alignment to 2 + 2 + 2 - saving even more space!
- Consideration
Of course, inside the source code the housekeeping on all those strings, and the string table itself, would be a nightmare... unless I could get the compiler to help? I thought of changing the syntax of the actual source code: if I wanted a string to be in the string table, I would write it as #"String", where the # prefix would flag it as a string table candidate. A normal string wouldn't have that prefix, and the compiler would treat it as normal.
Implementation
So to implement this I'd have to write a pre- pre-compiler. Something that would process just the #"" strings, replacing them with "magic" 16-bit offsets, and then output everything else to the real (pre)compiler to do the actual compilation. The pre-pre-compiler would also have to write a new C file with the complete string table inside (although with a trick - see below), for the compiler to parse and provide to the linker for its dedicated section. Invoking this would be easy with the -no-integrated-cpp switch, to invoke my own pre-pre-processor that would in turn invoke the real one.
- Issues
Don't get me wrong; I know there are issues. For example, it would have to be able to handle partial builds. My solution there is that for every modified C file, it would write (if necessary) a parallel string table file. The "master" C string table file would be nothing more than a series of #includes, that the build would realise needed recompiling if one of its #includes had changed - or indeed, if a new #include was added.
Result
The upshot would be an executable that would have all the (constant) strings packed into a memory blob of no larger than 64K (not a problem!). The code would know that index would be an offset into that blob, so would add the index to the start of the string table pointer before using it as normal.
Question
My question is: is it worth it?
- Pros:
It would save a tonne of space. I didn't quantify it above, but assume a saving of 5%(!) of total Flash.
- Cons:
It would require the build process to be modified to include a bespoke preprocessor;
That preprocessor would have to be built as part of the toolchain rather than the project;
The preprocessor could have bugs or limitations;
The real source code wouldn't compile "out of the box".
Now...
I have donned my asbestos suit, so... GO!
This kind of "project custom preprocessor" used to be faily common back in the days when memory was pretty constrained. It's pretty easy to do if you use make as your build system -- just a custom pattern or suffix rule to run your preprocessor.
The main question is if you want to run it on all source files or just some. If only a couple need it, you define a new file extension for source files that need preprocssing (eg, .cx and a .cx.c: rule to run the preprocessor). If all need it, you redefine the implicit .c.o: rule.
The main drawback, as you noted, is that if there's any sort of global coordination (such as pooling all the strings like you are trying to do), changing any source file needing the preprocessor will likely require rebuilding all of them, which is potentially quite slow.
Related
I have defined a customized tcl type using tcl library in c/c++. I basically make the Tcl_Obj.internalRep.otherValuePtr point to my own data structure. The problem happens by calling [string length myVar] or other similar string functions that does so called shimmering behaviour which replace my internalRep with it's own string structure. So that after the string series tcl function, myVar cannot convert back! because it's a complicate data structure cannot be converted back from the Tcl_Obj.bytes representation plus the type is no longer my customized type. How can I avoid that.
The string length command converts the internal representation of the values it is given to the special string type, which records information to allow many string operations to be performed rapidly. Apart from most of the string command's various subcommands, the regexp and regsub commands are the main ones that do this (for their string-to-match-the-RE-against argument). If you have a precious internal representation of your own and do not wish to lose it, you should avoid those commands; there are some operations that avoid the trouble. (Tcl mostly assumes that internal representations are not fragility, and therefore that they can be regenerated on demand. Beware when using fragility!)
The key operations that are mostly safe (as in they generate the bytes/length rep through calling the updateStringProc if needed, but don't clear the internal rep) are:
substitution into a string; the substituted value won't have the internal rep, but it will still be in the original object.
comparison with the eq and ne expression operators. This is particularly relevant for checks to see if the value is the empty string.
Be aware that there are many other operations that spoil the internal representation in other ways, but most don't catch people out so much.
[EDIT — far too long for a comment]: There are a number of relatively well-known extensions that work this way (e.g., TCOM and Tcl/Java both do this). The only thing you can really do is “be careful” as the values really are fragile. For example, put them in an array and then pass the indexes into the array around instead, as those need not be fragile. Or keep things as elements in a list (probably in a global variable) and pass around the list indices; those are just plain old numbers.
The traditional, robust approach is to put a map (e.g., a Tcl_HashTable or std::map) in your C or C++ code and have the indices into that be short strings with not too much meaning (I like to use the name of the type of value followed by either a sequence number or a serialisation of the pointer, such as you might get with the %p conversion in sprintf(); the printed pointer reveals more of the implementation details, is a little more helpful if you're debugging, and generally doesn't actually make that much difference in practice). You then have the removal of things from the map be an explicit deletion operation, and it is also easy to provide operations like listing all the known current values. This is safe, but prone to “leaking” (though it's not formally a memory leak if you provide the listing operation). It can be accelerated by caching the lookup in a Tcl_Obj*'s internal representation (a cheap way to handle deletion is to use a sequence number that you increment when you delete something, and only bypass the map lookup if the sequence number that you cache in the intrep is equal to the main sequence number) but it's not usually a big deal; only go to that sort of thing if you've measured a bottleneck in the lookups.
But I'd probably just live with fragility in my own code, and would just take care to ensure that I never bust the assumptions. The problem is really that you're being incautious about how you use the values; the Tcl code should just pass them around and nothing else really. Also, I've experimented a fair bit with wrapping such things up inside a TclOO object; it's far too heavyweight (by the design of TclOO) for values that you're making a lot of, but if you've only got a few of them and you're wanting to treat them as objects with methods, this can work very well indeed (and gives many more options for automatic cleanup).
I wish to encode the location -- say, FILE/LINE -- each time I do a memory allocation. That's over 3,000 in my codebase, so I don't really want to hard code it.
I have used a macro which just passes in FILE, LINE which works great.
Now I want to store this with each allocation as well so it needs to get compressed. I have used a minimal perfect hash for FILE which makes the (FILE, LINE) pair fit within a 32 bit integer.
However, computing the MPH on each allocation is just too expensive (mostly because it loops through the string computing a primary hash first).
Since all the strings are constant, the MPH is constant and everything is constant, there should be a faster way to compute this.
Alternatively, does anyone know a better way to compute code locations so they can be looked up and stored in an efficient manner (I've looked at the boost library PP_COUNTER macro as well) ?
Thanks!
A code location is already efficiently encoded by { __FILE__, __LINE__ }.
The macro __FILE__ expands to a string literal, which (in C99, and likely earlier) is “used to initialize an array of static storage duration” and you pass in its address, which is all you need and need not be compressed. I have done it like this (optionally including the current function name), with no problems, in VMS C, AIX C and MSVS C, and it has been very helpful.
N.B.
In theory, a really poor compiler may not pool string literals, not even __FILE__, resulting in bloated object code, but that seems unlikely in the extreme!
As long as your compiler does pool string literals, you can calculate a hash on the address, if you need one.
I have seen that C++ function name macros may be function calls, so this technique may be inapplicable there.
I have an homework to do for my school. The goal is to create a really basic virtual machine as well as a simple assembler. I had no problem creating the virtual machine but I can't think of a 'nice' way to create the assembler.
The grammar of this assembler is really basic: an optional label followed by a colon, then a mnemonic followed by 1, 2 or 3 operands. If there is more than one operand they shall be separated by commas. Also, whitespaces are ignored as long as they don't occur in the middle of a word.
I'm sure I can do this with strtok() and some black magic, but I'd prefer to do it in a 'clean' way. I've heard about Parse Trees/AST, but I don't know how to translate my assembly code into these kinds of structures.
I wrote an assembler like this when I was a teenager. You don't need a complicated parser at all.
All you need to do is five steps for each line:
Tokenize (i.e. split the line into tokens). This will give you an array of tokens and then you don't need to worry about the whitespace, because you will have removed it during tokenization.
Initialize some variables representing parts of the line to NULL.
A sequence of if statements to walk over the token array and check which parts of the line are present. If they are present put the token (or a processed version of it) in the corresponding variable, otherwise leave that variable as NULL (i.e. do nothing).
Report any syntax errors (i.e. combinations of types of tokens that are not allowed).
Code generation - I guess you know how to do this part!
What you're looking for is actually lexical analyses, parsing en finally the generation of the compiled code. There are a lot of frameworks out there which helps creating/generating a parser like Gold Parser or ANTLR. Creating a language definition (and learning how to depending on the framework you use) is most often quite a lot of work.
I think you're best off with implementing the shunting yard algorithm. Which converts your source into a representation computers understand, which makes it easy to understand for your virtual machine.
I also want to say that diving into parsers, abstract syntax trees, all the tools available on the web and reading a lot of papers about this subject is a really good learning experience!
You can take a look at some already-made assemblers, like PASMO: an assmbler for Z80 CPU, and get ideas from it. Here it is:
http://pasmo.speccy.org/
I've written a couple of very simple assemblers, both of them using string manipulation with strtok() and the like. For a simple grammar like the assembly language is, it's enough. Key pieces of my assemblers are:
A symbol table: just an array of structs, with the name of a symbol and its value.
typedef struct
{
char nombre[256];
u8 valor;
} TSymbol;
TSymbol tablasim[MAXTABLA];
int maxsim = 0;
A symbol is just a name that have associated a value. This value can be the current position (the address where the next instruction will be assembled), or it can be an explicit value assigned by the EQU pseudoinstruction.
Symbol names in this implementation are limited to 255 characters each, and one source file is limited to MAXTABLA symbols.
I perform two passes to the source code:
The first one is to identify symbols and store them in the symbol table, detecting whether they are followed by an EQU instruction or not. If there is such, the value next to EQU is parsed and assigned to the symbol. In other case, the value of the current position is assigned. To update the current position I have to detect if there is a valid instruction (although I do not assemble it yet) and update it acordingly (this is easy for me because my CPU has a fixed instruction size).
Here you have a sample of my code that is in charge of updating the symbol table with a value from EQU of the current position, and advancing the current position if needed.
case 1:
if (es_equ (token))
{
token = strtok (NULL, "\n");
tablasim[maxsim].valor = parse_numero (token, &err);
if (err)
{
if (err==1)
fprintf (stderr, "Error de sintaxis en linea %d\n", nlinea);
else if (err==2)
fprintf (stderr, "Simbolo [%s] no encontrado en linea %d\n", token, nlinea);
estado = 2;
}
else
{
maxsim++;
token = NULL;
estado = 0;
}
}
else
{
tablasim[maxsim].valor = pcounter;
maxsim++;
if (es_instruccion (token))
pcounter++;
token = NULL;
estado = 0;
}
break;
The second pass is where I actually assemble instructions, replacing a symbol with its value when I find one. It's rather simple, using strtok() to split a line into its components, and using strncasecmp() to compare what I find with instruction mnemonics
If the operands can be expressions, like "1 << (x + 5)", you will need to write a parser. If not, the parser is so simple that you do not need to think in those terms. For each line get the first string (skipping whitespace). Does the string end with a colon? then it is a label, else it is the menmonic. etc.
For an assembler there's little need to build an explicit parse tree. Some assemblers do have fancy linkers capable of resolving complicated expressions at link-time time but for a basic assembler an ad-hoc lexer and parsers should do fine.
In essence you write a little lexer which consumes the input file character-by-character and classifies everything into simple tokens, e.g. numbers, labels, opcodes and special characters.
I'd suggest writing a BNF grammar even if you're not using a code generator. This specification may then be translated into a recursive-decent parser almost by-wrote. The parser simply walks through the whole code and emits assembled binary code along the way.
A symbol table registering every label and its value is also needed, traditionally implemented as a hash table. Initially when encountering an unknown label (say for a forward branch) you may not yet know the value however. So it is simply filed away for future reference.
The trick is then to spit out dummy values for labels and expressions the first time around but compute the label addresses as the program counter is incremented, then take a second pass through the entire file to fill in the real values.
For a simple assembler, e.g. no linker or macro facilities and a simple instruction set, you can get by with perhaps a thousand or so lines of code. Much of it brainless through-free hand translation from syntax descriptions and opcode tables.
Oh, and I strongly recommend that you check out the dragon book from your local university library as soon as possible.
At least in my experience, normal lexer/parser generators (e.g., flex, bison/byacc) are all but useless for this task.
When I've done it, nearly the entire thing has been heavily table driven -- typically one table of mnemonics, and for each of those a set of indices into a table of instruction formats, specifying which formats are possible for that instruction. Depending on the situation, it can make sense to do that on a per-operand rather than a per-instruction basis (e.g., for mov instructions that have a fairly large set of possible formats for both the source and the destination).
In a typical case, you'll have to look at the format(s) of the operand(s) to determine the instruction format for a particular instruction. For a fairly typical example, a format of #x might indicate an immediate value, x a direct address, and #x an indirect address. Another common form for an indirect address is (x) or [x], but for your first assembler I'd try to stick to a format that specifies instruction format/addressing mode based only on the first character of the operand, if possible.
Parsing labels is simpler, and (mostly) separate. Basically, each label is just a name with an address.
As an aside, if possible I'd probably follow the typical format of a label ending with a colon (":") instead of a semicolon (";"). Much more often, a semicolon will mark the beginning of a comment.
I'm working on a project in which I need to read text (source) file in memory and be able to perform random access into (say for instance, retrieve the address corresponding to line 3, column 15).
I would like to know if there is an established way to do this, or data structures that are particularly good for the job. I need to be able to perform a (probably amortized) constant time access. I'm working in C, but am willing to implement higher level data structures if it is worth it.
My first idea was to go with a linked list of large buffer that will hold the character data of the file. I would also make an array, whose index are line numbers and content are addresses corresponding to the begin of the line. This array would be reallocated on need.
Subsidiary question: does anyone have an idea the average size of a source file ? I was surprised not to find this on google.
To clarify:
The file I'm concerned about are source files, so their size should be manageable, they should not be modified and the lines have variables length (tough hopefully capped at some maximum).
The problem I'm working on needs mostly a read-only file representation, but I'm very interested in digging around the problem.
Conlusion:
There is a very interesting discussion of the data structures used to maintain a file (with read/insert/delete support) in the paper Data Structures for Text Sequences.
If you just need read-only, just get the file size, read it in memory with fread(), then you have to maintain a dynamic array which maps the line numbers (index) to pointer to the first character in the line. Someone below suggested to build this array lazily, which seems a good idea in many cases.
I'm not quite sure what the question is here, but there seems to be a bit of both "how do I keep the file in memory" and "how do I index it". Since you need random access to the file's contents, you're probably well advised to memory-map the file, unless you're tight on address space.
I don't think you'll be able to avoid a linear pass through the file once to find the line endings. As you said, you can create an index of the pointers to the beginning of each line. If you're not sure how much of the index you'll need, create it lazily (on demand). You can also store this index to disk (as offsets, not pointers) if you will need it on subsequent runs. You can estimate the size of the index based on the file size and the expected line length.
1) Read (or mmap) the entire file into one chunk of memory.
2) In a second pass create an array of pointers or offsets pointing to the beginnings of the lines (hint: one after the '\n' ) into that memory.
Now you can index the array to access a specific line.
It's impossible to make insertion, deletion, and reading at a particular line/column/character address all simultaneously O(1). The best you can get is simultaneous O(log n) for all of these operations, and it can be achieved using various sorts of balanced binary trees for storing the file in memory.
Of course, unless your files will be larger than 100 kB or so, you're probably best off not bothering with anything fancy and just using a flat linear buffer...
solution: If lines are about same size, make all lines equally long by appending needed number of metacharacters to each line. Then you can simply calculate the fseek() position from line number, making your search O(1).
If lines are sorted, then you can perform binary search, making your search O(log(nõLines)).
If neither, you can store the indexes of line begginings. But then, you have a problem if you modify file a lot, because if you insert let's say X characters somewhere, you have to calculate which line it is, and then add this X to the all next lines. Similar with with deletion. Yu essentially get O(nõLines). And code gets ugly.
If you want to store whole file in memory, just create aray of lines *char[]. You then get line by first dereference and character by second dereference.
As an alternate suggestion (although I do not fully understand the question), you might want to consider a struct based, dynamically linked list of dynamic strings. If you want to be astutely clever, you could build a dynamically linked list of chars which you then export as strings.
You'd have to use OO type design for this to be manageable.
So structs you'd likely want to build are:
DynamicArray;
DynamicListOfArrays;
CharList;
So it goes:
CharList(Gets Chars/Size) -> (SetSize)DynamicArray -> (AddArray)DynamicListOfArrays
If you build suitable helper functions for malloc and delete, and make it so the structs can either delete themselves automatically or manually. Using the above combinations won't get you O(1) read in (which isn't possible without the files have a static format), but it will get you good time.
If you know the file static length (at least individual line wise), IE no bigger than 256 chars per line, then all you need is the DynamicListOfArries - write directly to the array (preset to 256), create a new one, repeat. Downside is it wastes memory.
Note: You'd have to convert the DynamicListOfArrays into a 'static' ArrayOfArrays before you could get direct point-to-point access.
If you need source code to give you an idea (although mine is built towards C++ it wouldn't take long to rewrite), leave a comment about it. As with any other code I offer on stackoverflow, it can be used for any purpose, even commercially.
Average size of a source file? Does such a thing exist? A source file could go from 0 bytes to thousands of bytes, like any text file, it depends on the number of caracters it contains
Once long ago, out of curiosity, I've tried hex-editing the executable file of the game "Dangerous Dave".
I've looked around the file for any strings I could find, and made some random edits to see if it would actually change the text displayed within the game.
I was surprised to see the result, which I have now recreated using a hex-editor and DOSBox:
As can be seen, editing the two characters "RO" in the string "ROMERO" resulted in 4 characters being changed, with the result becoming "ZUMEZU". It seems as if the program is reusing the two characters and prints them at the start and end of that string.
What is the cause of this? My first guess would be trying to make the executable smaller but just the code that reuses the characters would probably require more space than those 2 bytes to be saved.
Is it just a trick done by the author, or just some compiler voodoo?
Tricky to say for sure without reverse-engineering, but my guess would be that a lot of the constant data in the program is compressed using an algorithm from the LZ family. These compression schemes work essentially in the way that you've observed: they encode repeated substrings as references to text that has previously been decoded.
These compression algorithms were probably used for more than just this one string, and not just for text either; it's quite possible that they were also used to compress other data, such as graphics or level layouts. In short, there were probably significant savings made by using this algorithm!
The use of these compression algorithms is common in older games as a way of saving disk space, but was not automatic - the implementation of this algorithm would likely have been something Romero added himself.