I am currently working on a command line interface for a particle simulator. Its parser takes reads input in the following format:
[command] [argument]* (-[flag] [flag argument])
Currently, the command is sent through a conditional block, compared to various known commands and its corresponding data packet is sent to the matching function. This, however, seems clunky, inefficient and inelegant.
I am thinking about using a hashmap instead, with a string representation of a command as the key and a function pointer as the value. The function referenced would then be sent a data packet containing arguments, flags, etc.
Is a hash map overkill in this situation? Does the extra infrastructure required to implement one outweigh the potential benefits? I am aiming for speed, elegance, function, and, since this is an open-source project, extensibility.
Thanks for the help.
You might want to consider the Ternary Search Tree. It has good performnce, efficient use of storage; and you don't need a hash function or a collision strategy.
The linked Bentley/Sedgwick article is a very thorough-yet-readable explanation of the accompanying C source.
I've been using a TST for name-lookup in the past 3 versions of my postscript interpreter. The only changes that have been needed have been due to changes in memory management. Here's a version I modified (lightly) to use explicit pointers. I use yet another version in my postscript interpreter, any of the xpost2*.zip versions, in the file core.c, which uses byte-offsets for pointers (have to be added to the user-memory byte-pointer to yield a real pointer).
Speed gained will probably be minimal, but you could hash the command to convert it to a number and then use a switch statement. Faster than a hash map.
Related
For a certain time now, I'm looking to build a logging framework in C (not C++!), but for small microcontrollers or devices with a small footprint of some sort. For this, I've had the idea of hashing the strings that are being logged to a certain value and just saving the hashed value with the timestamp instead of the complete ASCII string. The hash can then be correlated with a 'database' file that would be generated from an external process that parses the strings out of the C source files and saves the logged strings along with the hash value.
After doing a little bit of research, this idea is not new, but I do not find an implementation of this idea in C. In other languages, this idea has been worked out, but that is not the goal of my exercise. An example may be this talk where the same concept has been worked out in C++: youtube.com/watch?v=Dt0vx-7e_B0
Some of the requirements that I've set myself for this library are the following:
as portable C code as possible
COMPILE TIME optimization/hashing for the string hash conversion, it should be equivalent to just printf("%d\n", hashed_value) for a single log statement. (Assuming no parameters/arguments for this particular logging statement).
arguments can be passed to the logging statement similar to the printf function.
user can define their own output function (being console, file descriptor, sending the data directly over an UART connection,...)
fast to run!! fast to compile is nice to have, but it should not be terribly slow.
very easy to use, no very complicated API to use the library.
But to achieve this in C, what is a good approach? I've tried several things now, but do not seem to have found a good method of achieving this.
An overview of things I've tried so far, along with the drawbacks are:
Full pre-processor string hashing: did get it working, but the compile time is terribly slow. Also, this code does not feel to be very portable over multiple C compilers.
Semi pre-processor string hashing: The idea was to generate a hash for each string and make an external header file with the defines in of each string with their hash value. The problem here is that I cannot figure out a way of converting the string to the correct define preprocessor value.
Letting go of the default logging macro with a string pointer: Instead of working with the most used method of LOG_DEBUG("Some logging statement"), converting it with an external parser to /*LOG_DEBUG("Some logging statement") */ LOG_RAW(45). This solves the problem of hashing the string since the hash will be replaced by the external parser with the correct hash, but is not the cleanest to read since the original statement will be a comment.
Also expanding this idea to take care of arguments proved to be tricky. How to take care of multiple types of variables as efficiently as possible?
I've tried some other methods but all without success. Especially when I want to add arguments to log the value of a variable, for example, it gets very complicated, and I do not get the required result...
I have a C program that currently edits a single value in a parameter file by using sed through a system call. I'd like to change the program to use the C file libraries to edit this value, but the only way I know how to do this is by reading in the entire file, changing the value, and rewriting the file. Is there a more efficient way to do this? The program is intended for use on an embedded device so I'd like to use the most efficient solution possible.
Working with files is like working with arrays in the sense that one can't truly before insertions and deletions. Insertions and deletions require shifting (copying) the rest of the file/array. Only replacing elements is possible (by opening the file for reading and writing, and using seek).
Reading and writing the entire file is quite efficient, especially for tiny files. If the memory usage isn't an issue, that's the approach I would take.
Other solutions might be better in specific circumstances, but the approach you describe is generally the best.
I have defined a customized tcl type using tcl library in c/c++. I basically make the Tcl_Obj.internalRep.otherValuePtr point to my own data structure. The problem happens by calling [string length myVar] or other similar string functions that does so called shimmering behaviour which replace my internalRep with it's own string structure. So that after the string series tcl function, myVar cannot convert back! because it's a complicate data structure cannot be converted back from the Tcl_Obj.bytes representation plus the type is no longer my customized type. How can I avoid that.
The string length command converts the internal representation of the values it is given to the special string type, which records information to allow many string operations to be performed rapidly. Apart from most of the string command's various subcommands, the regexp and regsub commands are the main ones that do this (for their string-to-match-the-RE-against argument). If you have a precious internal representation of your own and do not wish to lose it, you should avoid those commands; there are some operations that avoid the trouble. (Tcl mostly assumes that internal representations are not fragility, and therefore that they can be regenerated on demand. Beware when using fragility!)
The key operations that are mostly safe (as in they generate the bytes/length rep through calling the updateStringProc if needed, but don't clear the internal rep) are:
substitution into a string; the substituted value won't have the internal rep, but it will still be in the original object.
comparison with the eq and ne expression operators. This is particularly relevant for checks to see if the value is the empty string.
Be aware that there are many other operations that spoil the internal representation in other ways, but most don't catch people out so much.
[EDIT — far too long for a comment]: There are a number of relatively well-known extensions that work this way (e.g., TCOM and Tcl/Java both do this). The only thing you can really do is “be careful” as the values really are fragile. For example, put them in an array and then pass the indexes into the array around instead, as those need not be fragile. Or keep things as elements in a list (probably in a global variable) and pass around the list indices; those are just plain old numbers.
The traditional, robust approach is to put a map (e.g., a Tcl_HashTable or std::map) in your C or C++ code and have the indices into that be short strings with not too much meaning (I like to use the name of the type of value followed by either a sequence number or a serialisation of the pointer, such as you might get with the %p conversion in sprintf(); the printed pointer reveals more of the implementation details, is a little more helpful if you're debugging, and generally doesn't actually make that much difference in practice). You then have the removal of things from the map be an explicit deletion operation, and it is also easy to provide operations like listing all the known current values. This is safe, but prone to “leaking” (though it's not formally a memory leak if you provide the listing operation). It can be accelerated by caching the lookup in a Tcl_Obj*'s internal representation (a cheap way to handle deletion is to use a sequence number that you increment when you delete something, and only bypass the map lookup if the sequence number that you cache in the intrep is equal to the main sequence number) but it's not usually a big deal; only go to that sort of thing if you've measured a bottleneck in the lookups.
But I'd probably just live with fragility in my own code, and would just take care to ensure that I never bust the assumptions. The problem is really that you're being incautious about how you use the values; the Tcl code should just pass them around and nothing else really. Also, I've experimented a fair bit with wrapping such things up inside a TclOO object; it's far too heavyweight (by the design of TclOO) for values that you're making a lot of, but if you've only got a few of them and you're wanting to treat them as objects with methods, this can work very well indeed (and gives many more options for automatic cleanup).
So I've run into an interesting design pattern and I wanted to know if you guys had an opinion on it.
Basically, the design is passing everything around as a pre-serialized type. There is no "types" for the returns, for example. It is passed as a simple uint8_t*. There is a defined header that "tells" you what is in the buffer, how big it is, what the version of the buffer is, ect. I call it "pre-serialized" because it forces flattening of all structures.
The pros:
You can easily write it (or even a set of it) to what ever you want. Files, IO, whatever.
Can store arbitrary data.
The Cons: IMHO:
No type safety is going to be a nightmare
The programmer has to parse the code. Even if there is an enumerated type, the user would have to know what that type means. Even if there are functions to parse the type, the programmer has to know that is the function to call.
Version hell: changing code will cause a ripple effect of errors. Because everywhere is parsing it differently, you have no idea where the code works or where it is broken.
It is viral: because it is flat, you can't "insert" the header on the end of outside data. You could wrap the call if you copy your "data", but this could cause an unnecessary copy that would be SLOW. So either your code is slower than it needs to be, or you conform to this data structure.
It isn't human readable OR debug-able.
Have you seen this design pattern before? Is there a name for this design pattern? Things I missed?
Is there a name for this design pattern?
Well, Legacy Code? :) I have seen such design in 30 years old Cobol systems...
The pros you have stated are easily reachable also by using XML format (or JSON):
You can easily write it (or even a set of it) to what ever you want. Files, IO, whatever - most of all, web services!
Can store arbitrary data.
Furthermore, all your cons are eliminated.
The only pro I can see in your solution is conciseness - when every byte counts and you need to avoid any overhead as too expensive, then this is nice.
Added: Cobol has a feature to easily define the structure of such serialized data, see PICTURE clause. Reading the data is very easy then, you read them as variables. (Like if you have a binary data and define a struct in the C language and typecast the binary to the struct.)
As Honza said this would be normal in Legacy Cobol/PL1 (was there a Cobol/PL1 conversion or interface to COBOL programs ???).
In COBOL this design pattern would make sense, not sure about C though (one of the binary serialization packages or JSON etc might be more sensible).
In Cobol, you would have a Cobol copybook which all programs would use and could edit the data using the Cobol Copybook (with something like file-aid or Microfocus Data Editor).
Why use this "design pattern" in Cobol:
Regression testing of Modules; you can write a driver module like
Read Test-data-file
while more-data
Call Module
write Result to output-file
Read Test-data-file
end
You can then do a compare between Output from the
re-Change Program to the changed program.
Testing - some times you can use a "production file" in testing
A file provides trace or snapshot of what is going on, this can be very useful.
Easy to reorganize Batch streams:
Split a programs up (and pass the data via file). There variety of reason for doing this including
program has gotten to big and is hard to maintain.
Sorting the data
Performance (use a file rather than hitting the DB multiple times)
new uses for extracted data
While your cons are valid for C, they will be less of an issue in Cobol.
The key to using this "design pattern" is being able to edit/view/compare the format. If you can not edit/view/compare a file, I do not see the point
I am doing a stimulation of dead-code remover in a very simpler manner.
For that my Idea is to,
Step 1: Read the input C-Program line by line and store it in a doubly linked-list or Array.(Since deletion and insertion will be easier than in file operations).
Doubt:Is my approach correct? If so, How to minimize traversing a Linked-List each time.
Step 2: Analyzing of the read strings will be done in parallel, and tables are created to maintain variables names and their details, functions and their calls,etc.,
Step 3: Searching will be done for each entries in the variable table, and the variables will be replaced by its that time's value(as it has).
(E.g.)
i=0;
if(i==3) will be replaced by if(0==3).
But on situation like..
get(a);
i=a;
if(i){}
here,'i' will not be replaced since it depends on another variable. 'a' will not be replaced since it depends on user input.
Doubt: if user input is,
if(5*5+6){print hello;} ,
it surely will be unnecessary check. How can i solve this expression to simplify the code as
{
print hello;
}
Step 4: Strings will be searched for if(0),while(0) etc., and using stack, the action block is removed. if(0){//this will be removed*/}
Step 5:(E.g) function foo(){/**/} ... if(0) foo(); ..., Once all the dead codes are removed, foo()'s entry in the function table is checked to get no.of.times it gets referred in the code. If it is 0, that function has to be removed using the same stack method.
Step 6: In the remaining functions, the lines below the return statements (if any) are removed except the '}'. This removal is done till the end of the function. The end of the function is identified using stack.
Step 7: And I will assume that my dead-free code is ready now. Store the linked-list or array in an output file.
My Questions are..
1.Whether my idea will be meaningful? or will it be implementable? How
can I improve this algorithm?
2.While i am trying to implement this idea, I have to deal more with string
manipulations rather than removing dead-codes. Is any way to reduce
string manipulations in this algorithm.
Do not do it this way. C is a free-form language, and trying to process it line-by-line will result in supporting a subset of C that is so ridiculously restricted that it doesn't deserve the name.
What you need to do is to write a proper parser. There is copious literature about that out there. Find out which textbook your school uses for its compiler-construction course, and work through that -- or just take the course! Only when you've got the parser down should you even begin to consider semantics. Then do your work on abstract syntax trees instead of strings. Alternatively, find an already written and tested parser for C that you can reuse (but you'll still need to learn quite a bit in order to integrate it with your own processing).
If you end up writing the parser yourself, and it's only for your own edification, consider using a simpler language than C as your subject. Even though C at is core is fairly compact as languages go, getting all details of the declaration syntax right is surprisingly tricky, and will probably detract you from what you're actually interested in. And the presence of the preprocessor is an issue in itself which can make it very difficult to design meaningful source-to-source transformations.
By the way, the transformations you sketch are known in the trade as "constant propagation", or (in a more ambitious variants that will clone functions and loop bodies when they have differing constant inputs) "partial evaluation". Googling those terms may be interesting.