Related
I need to use array of graph data, i.e. struct with x and y integers. This array will be passed through many functions, and I need to decide the API choice.
typedef struct {
int x;
int y;
} GraphData_t;
How should I choose whether to use NULL-termination for the array, or supply count variable?
I have three approaches for my API:
1: loadGraph(GraphData_t *data, int count); //use count variable
2: loadGraph(GraphData_t *data); // use null-termination (or any other termination value)
typedef struct {
GraphData_t *data;
int count;
} GraphArray_t;
3: loadGraph(GraphArray_t *data); //use a struct which has integrated count variable
So far these seem equal to me. Which one would be the preferable method, and why?
As a rather old dinosaur, I will use history here.
Anyway, the size + pointer idiom is the multi purpose and bullet proof way. If in doubt, use it.
The delimited way is just more common for human beings, specially when you want to initialize an array: no need to manually count the items (with the risk of a one off mistake specially if you later add or remove elements to the initialization list), you just add the delimitor as the last element. BTW, it is the way we use lines in text files... But anyway, the sizeof(array)/sizeof(array[0]) idiom allows to easily and automatically get the size...
The NULL terminated idiom comes from the begining of micro-processors, where code was close to the hardware for performance reasons: comparison to 0 was the fastest test, and memory was expensive. And programmers began to end their constant strings with a NULL character for that reason: only one byte overhead, even if the string was longer than 256 characters. You find reference to this ASCIIZ idiom in MS/DOS 2 manuals but it had been made popular by the pair Unix and K&R C language since the 70's.
It is still convenient, and still used in C strings, but many higher level tools like C++ std::string now prefere the counted idiom which does not require one forbidden value.
For daily programming, the (null) terminated idiom should only be used when an array can only be browsed forward, and when you have no special need for the size. But beware, if you simply want to copy a null terminated array, you have to scan it twice: once for its size and once for its data.
Null termination is a convention that can only be used if the null value is excluded from the set of legal values for the entries. For example the string array argv in main has a null pointer at the end because a null pointer cannot be a legal string.
In your case, the array elements are structures with 2 int coordinates. You would need to decide what values to consider invalid for these coordinates. If all values are OK, then you must pass the number of elements explicitly. Passing the array length explicitly is preferred in all cases, as it avoids unnecessary scans. Indeed main also gets the length of the argv array as a separate int argument argc.
Whether to encapsulate the array pointer and the length in a structure is a matter of style and convenience. For complex structures, it is preferable to group all characteristics in a structure, but for a simple array, it may be more convenient to pass a pointer and a size explicitly as it allows you to apply the function to a subset of the array with loadGraph(data + i, j).
While all approaches can of course get the job done, there are some differences which may or may not be relevant for your use-case.
Null-termination is very convenient if the user needs or wants to use hard-coded arrays, because then they can just add or delete entries, without needing to worry about possibly breaking the application (unless they remove the terminator of course).
Since the size is unknown, almost every function working with a null-terminated array needs to iterate over the whole thing. This might be a problem if the array is large, and many functions usually wouldn't actually need to access all entries (or not in the order they are stored).
The terminator itself obviously needs to be a value that can never occur in your actual data. So, depending on your data, there might not be an obvious candidate to use as terminator (or even none at all).
There are probably more subtle differences which might influence your decision, but these are the first ones that came to my mind.
So, I'm making a Unix minishell, and have come to a roadblock. I need to be able to execute built-in functions, so I made a function:
int exec_if_built_in(char **args)
It takes an array of strings(the first being the command, and the rest being arguments). For non built-in commands I simply use something like execvp, however I need to find a way to map the first string to a function. I was thinking of making two arrays, one of strings, and another with their corresponding function pointers. However, since many of these functions will be different(return and accept different things), this approach won't work. I also thought of making an array of structs with a name property and a function pointer property, however once again due to the varied nature of the functions I'll be using, this won't work.
So, what's the best way to execute a function based on the input of a string? How do I map a string to a certain function? I'm not very familiar with function pointers so I may be missing something.
Thank you guys for the help :)
I was thinking of making two arrays, one of strings, and another with their corresponding function pointers.
That is the right approach. Make wrapper functions that take arguments of identical type* then call the "real" functions from inside your wrappers.
If you would like to "earn additional points for style", sort the array of strings alphabetically, and use binary search on the array of strings. This will let you save a few CPU cycles when the two arrays get bigger.
* Perhaps you could use char **, because that's what your exec_if_built_in takes.
If your cases are so varied that a table-driven approach is inappropriate, then maybe an if/else-if cascade is the best solution.
if(!strcmp(args[0], "cat")) {
/* cat a file */
} else if(!strcmp(args[0], "dog")) {
/* dog a file */
} else if(!strcmp(args[0], "echo")) {
/* create an echo chamber */
} ...
I'm working on a project in which I need to read text (source) file in memory and be able to perform random access into (say for instance, retrieve the address corresponding to line 3, column 15).
I would like to know if there is an established way to do this, or data structures that are particularly good for the job. I need to be able to perform a (probably amortized) constant time access. I'm working in C, but am willing to implement higher level data structures if it is worth it.
My first idea was to go with a linked list of large buffer that will hold the character data of the file. I would also make an array, whose index are line numbers and content are addresses corresponding to the begin of the line. This array would be reallocated on need.
Subsidiary question: does anyone have an idea the average size of a source file ? I was surprised not to find this on google.
To clarify:
The file I'm concerned about are source files, so their size should be manageable, they should not be modified and the lines have variables length (tough hopefully capped at some maximum).
The problem I'm working on needs mostly a read-only file representation, but I'm very interested in digging around the problem.
Conlusion:
There is a very interesting discussion of the data structures used to maintain a file (with read/insert/delete support) in the paper Data Structures for Text Sequences.
If you just need read-only, just get the file size, read it in memory with fread(), then you have to maintain a dynamic array which maps the line numbers (index) to pointer to the first character in the line. Someone below suggested to build this array lazily, which seems a good idea in many cases.
I'm not quite sure what the question is here, but there seems to be a bit of both "how do I keep the file in memory" and "how do I index it". Since you need random access to the file's contents, you're probably well advised to memory-map the file, unless you're tight on address space.
I don't think you'll be able to avoid a linear pass through the file once to find the line endings. As you said, you can create an index of the pointers to the beginning of each line. If you're not sure how much of the index you'll need, create it lazily (on demand). You can also store this index to disk (as offsets, not pointers) if you will need it on subsequent runs. You can estimate the size of the index based on the file size and the expected line length.
1) Read (or mmap) the entire file into one chunk of memory.
2) In a second pass create an array of pointers or offsets pointing to the beginnings of the lines (hint: one after the '\n' ) into that memory.
Now you can index the array to access a specific line.
It's impossible to make insertion, deletion, and reading at a particular line/column/character address all simultaneously O(1). The best you can get is simultaneous O(log n) for all of these operations, and it can be achieved using various sorts of balanced binary trees for storing the file in memory.
Of course, unless your files will be larger than 100 kB or so, you're probably best off not bothering with anything fancy and just using a flat linear buffer...
solution: If lines are about same size, make all lines equally long by appending needed number of metacharacters to each line. Then you can simply calculate the fseek() position from line number, making your search O(1).
If lines are sorted, then you can perform binary search, making your search O(log(nõLines)).
If neither, you can store the indexes of line begginings. But then, you have a problem if you modify file a lot, because if you insert let's say X characters somewhere, you have to calculate which line it is, and then add this X to the all next lines. Similar with with deletion. Yu essentially get O(nõLines). And code gets ugly.
If you want to store whole file in memory, just create aray of lines *char[]. You then get line by first dereference and character by second dereference.
As an alternate suggestion (although I do not fully understand the question), you might want to consider a struct based, dynamically linked list of dynamic strings. If you want to be astutely clever, you could build a dynamically linked list of chars which you then export as strings.
You'd have to use OO type design for this to be manageable.
So structs you'd likely want to build are:
DynamicArray;
DynamicListOfArrays;
CharList;
So it goes:
CharList(Gets Chars/Size) -> (SetSize)DynamicArray -> (AddArray)DynamicListOfArrays
If you build suitable helper functions for malloc and delete, and make it so the structs can either delete themselves automatically or manually. Using the above combinations won't get you O(1) read in (which isn't possible without the files have a static format), but it will get you good time.
If you know the file static length (at least individual line wise), IE no bigger than 256 chars per line, then all you need is the DynamicListOfArries - write directly to the array (preset to 256), create a new one, repeat. Downside is it wastes memory.
Note: You'd have to convert the DynamicListOfArrays into a 'static' ArrayOfArrays before you could get direct point-to-point access.
If you need source code to give you an idea (although mine is built towards C++ it wouldn't take long to rewrite), leave a comment about it. As with any other code I offer on stackoverflow, it can be used for any purpose, even commercially.
Average size of a source file? Does such a thing exist? A source file could go from 0 bytes to thousands of bytes, like any text file, it depends on the number of caracters it contains
I'm using a naive approach to this problem, I'm putting the words in a linked list and just making a linear search into it. But it's taking too much time in large files.
I was thinking in use a Binary Search Tree but I don't know if it works good with strings. Also heard of Skip Lists, didn't really learn it yet.
And also I have to use the C language...
You can put all of the words into a trie and then count the number of words after you have processed the whole file.
Binary Search Trees work fine for strings.
If you don't care about having the words in sorted order, you can just use a hash table.
You're counting the number of unique words in the file?
Why don't your construct a simple hash table? This way, for each word in your list, add it into the hash table. Any duplicates will be discarded since they would already be in the hash table - and finally, you can just count the number of elements in the data structure (by storing a counter and incrementing it each time you add to the table).
The first upgrade to your algorithm could be having the list sorted, so, your lineal search could be faster (you only search until you find one element greater than yours), but this is still a naive solution.
Best approaches are Binary Search Trees and even better, a prefix tree (or trie, already mentioned in other answer).
In "The C Programming Language" From K&R you have the exact example of what you are looking for.
The first example of "autoreferenced data structs" (6.5) is a binary search tree used for counting the ocurrences of every word in a string. (You don't need to count :P)
the structure is something like this:
struct tnode {
char *word;
struct tnode *left;
struct tnode *right;
};
In the book you can see the whole example of what you want to do.
Binary Search Trees works good with any tipe of data structure that can accept an order, and will be better than a lineal search in a list.
Sorry for my poor english, and correct me if i was wrong with something I've said, Im very noob with C :p
EDIT: I can't add comments to other answers, but I have read a coment from OP saying "The list isn't sorted so I can't use binary search". It is nonsense to use binary search on a linked list. ¿Why? Binary Search is efficient when the access to a random element is fast, like in an array. In a double linked list, your worst access will be n/2.. However, you can put a lot of pointers in the list (accesing to key elements), but it is a bad solution..
I'm puting the words in a linked list and just making a linear search into it.
If to check whether word W is present, you go through the whole list, then it's surely long. O(n^2), where n is size of the list.
Simplest way is probably having a hash. It's easy to implement yourself (unlike some tree structures) and even C should have some libraries for that. You'll get O(n) complexity.
edit Some C hashtable implementations
http://en.wikipedia.org/wiki/Hash_table#Independent_packages
If you're on a UNIX system, then you could use the bsearch() or hsearch() family of functions instead of a linear search.
If you need something simple and easily available then man tsearch for simple binary search tree. But this is plain binary search tree, not balanced.
Depending on number of unique words, plain C array + realloc() + qsort() + bsearch() might be an option too. That's what I use when I need no-frills faster-than-linear search in plain portable C. (Otherwise, if possible, I opt for C++ and std::map/std::set.)
More advanced options are often platforms specific (e.g. glib on Linux).
P.S. Another very easy to implement structure is a hash. Less efficient for strings but very easy to implement. Can be very quickly made blazing fast by throwing memory at the problem.
This is not exactly a technical question, since I know C kind of enough to do the things I need to (I mean, in terms of not 'letting the language get in your way'), so this question is basically a 'what direction to take' question.
Situation is: I am currently taking an advanced algorithms course, and for the sake of 'growing up as programmers', I am required to use pure C to implement the practical assignments (it works well: pretty much any small mistake you make actually forces you to understand completely what you're doing in order to fix it). In the course of implementing, I obviously run into the problem of having to implement the 'basic' data structures from the ground up: actually not only linked lists, but also stacks, trees, et cetera.
I am focusing on lists in this topic because it's typically a structure I end up using a lot in the program, either as a 'main' structure or as a 'helper' structure for other bigger ones (for example, a hash tree that resolves conflicts by using a linked list).
This requires that the list stores elements of lots of different types. I am assuming here as a premise that I don't want to re-code the list for every type. So, I can come up with these alternatives:
Making a list of void pointers (kinda inelegant; harder to debug)
Making only one list, but having a union as 'element type', containing all element types I will use in the program (easier to debug; wastes space if elements are not all the same size)
Using a preprocessor macro to regenerate the code for every type, in the style of SGLIB, 'imitating' C++'s STL (creative solution; doesn't waste space; elements have the explicit type they actually are when they are returned; any change in list code can be really dramatic)
Your idea/solution
To make the question clear: which one of the above is best?
PS: Since I am basically in an academic context, I am also very interested in the view of people working with pure C out there in the industry. I understand that most pure C programmers are in the embedded devices area, where I don't think this kind of problem I am facing is common. However, if anyone out there knows how it's done 'in the real world', I would be very interested in your opinion.
A void * is a bit of a pain in a linked list since you have to manage it's allocation separately to the list itself. One approach I've used in the past is to have a 'variable sized' structure like:
typedef struct _tNode {
struct _tNode *prev;
struct _tNode *next;
int payloadType;
char payload[1]; // or use different type for alignment.
} tNode;
Now I realize that doesn't look variable sized but let's allocate a structure thus:
typedef struct {
char Name[30];
char Addr[50];
} tPerson;
tNode *node = malloc (sizeof (tNode) - 1 + sizeof (tPerson));
Now you have a node that, for all intents and purposes, looks like this:
typedef struct _tNode {
struct _tNode *prev;
struct _tNode *next;
int payloadType;
char Name[30];
char Addr[50];
} tNode;
or, in graphical form (where [n] means n bytes):
+----------------+
| prev[4] |
+----------------+
| next[4] |
+----------------+
| payloadType[4] |
+----------------+ +----------+
| payload[1] | <- overlap -> | Name[30] |
+----------------+ +----------+
| Addr[50] |
+----------+
That is, assuming you know how to address the payload correctly. This can be done as follows:
node->prev = NULL;
node->next = NULL;
node->payloadType = PLTYP_PERSON;
tPerson *person = &(node->payload); // cast for easy changes to payload.
strcpy (person->Name, "Bob Smith");
strcpy (person->Addr, "7 Station St");
That cast line simply casts the address of the payload character (in the tNode type) to be an address of the actual tPerson payload type.
Using this method, you can carry any payload type you want in a node, even different payload types in each node, without the wasted space of a union. This wastage can be seen with the following:
union {
int x;
char y[100];
} u;
where 96 bytes are wasted every time you store an integer type in the list (for a 4-byte integer).
The payload type in the tNode allows you to easily detect what type of payload this node is carrying, so your code can decide how to process it. You can use something along the lines of:
#define PAYLOAD_UNKNOWN 0
#define PAYLOAD_MANAGER 1
#define PAYLOAD_EMPLOYEE 2
#define PAYLOAD_CONTRACTOR 3
or (probably better):
typedef enum {
PAYLOAD_UNKNOWN,
PAYLOAD_MANAGER,
PAYLOAD_EMPLOYEE,
PAYLOAD_CONTRACTOR
} tPayLoad;
My $.002:
Making a list of void pointers (kinda diselegant; harder to debug)
This isn't such a bad choice, IMHO, if you must write in C. You might add API methods to allow the application to supply a print() method for ease of debugging. Similar methods could be invoked when (e.g.) items get added to or removed from the list. (For linked lists, this is usually not necessary, but for more complex data structures -- hash tables, for example) -- it can sometimes be a lifesaver.)
Making only one list, but having a union as 'element type', containing all element types I will use in the program (easier to debug; wastes space if elements are not all the same size)
I would avoid this like the plague. (Well, you did ask.) Having a manually-configured, compile-time dependency from the data structure to its contained types is the worst of all worlds. Again, IMHO.
Using a preprocessor macro to regenerate the code for every type, in the style of SGLIB (sglib.sourceforge.net), 'imitating' C++'s STL (creative solution; doesn't waste space; elements have the explicit type they actually are when they are returned; any change in list code can be really dramatic)
Intriguing idea, but since I don't know SGLIB, I can't say much more than that.
Your idea/solution
I'd go with the first choice.
I've done this in the past, in our code (which has since been converted to C++), and at the time, decided on the void* approach. I just did this for flexibility - we were almost always storing a pointer in the list anyways, and the simplicity of the solution, and usability of it outweighed (for me) the downsides to the other approaches.
That being said, there was one time where it caused some nasty bug that was difficult to debug, so it's definitely not a perfect solution. I think it's still the one I'd take, though, if I was doing this again now.
Using a preprocessor macro is the best option. The Linux kernel linked list is a excellent a eficient implementation of a circularly-linked list in C. Is very portable and easy to use. Here a standalone version of linux kernel 2.6.29 list.h header.
The FreeBSD/OpenBSD sys/queue is another good option for a generic macro based linked list
I haven't coded C in years but GLib claims to provide "a large set of utility functions for strings and common data structures", among which are linked lists.
Although It's tempting to think about solving this kind of problem using the techniques of another language, say, generics, in practice it's rarely a win. There are probably some canned solutions that get it right most of the time (and tell you in their documentation when they get it wrong), using that might miss the point of the assignment, So i'd think twice about it. For a very few number of cases, It might be feasable to roll your own, but for a project of any reasonable size, Its not likely to be worth the debugging effort.
Rather, When programming in language x, you should use the idioms of language x. Don't write java when you're using python. Don't write C when you're using scheme. Don't write C++ when you're using C99.
Myself, I'd probably end up using something like Pax's suggestion, but actually use a union of char[1] and void* and int, to make the common cases convenient (and an enumed type flag)
(I'd also probably end up implementing a fibonacci tree, just cause that sounds neat, and you can only implement RB Trees so many times before it loses it's flavor, even if that is better for the common cases it'd be used for.)
edit: based on your comment, it looks like you've got a pretty good case for using a canned solution. If your instructor allows it, and the syntax it offers feels comfortable, give it a whirl.
This is a good problem. There are two solutions I like:
Dave Hanson's C Interfaces and Implementations uses a list of void * pointers, which is good enough for me.
For my students, I wrote an awk script to generate type-specific list functions. Compared to preprocessor macros, it requires an extra build step, but the operation of the system is much more transparent to programmers without a lot of experience. And it really helps make the case for parametric polymorphism, which they see later in their curriculum.
Here's what one set of functions looks like:
int lengthEL (Explist *l);
Exp* nthEL (Explist *l, unsigned n);
Explist *mkEL (Exp *hd, Explist *tl);
The awk script is a 150-line horror; it searches C code for typedefs and generates a set of list functions for each one. It's very old; I could probably do better now :-)
I wouldn't give a list of unions the time of day (or space on my hard drive). It's not safe, and it's not extensible, so you may as well just use void * and be done with it.
One improvement over making it a list of void* would be making it a list of structs that contain a void* and some meta-data about what the void* points to, including its type, size, etc.
Other ideas: embed a Perl or Lisp interpreter.
Or go halfway: link with the Perl library and make it a list of Perl SVs or something.
I'd probably go with the void* approach myself, but it occurred to me that you could store your data as XML. Then the list can just have a char* for data (which you would parse on demand for whatever sub elements you need)....