Multiple Keys for exact and approximate lookups in C

Multiple Keys for exact and approximate lookups in C - c

I am trying to develop a network resource manager component in C which keeps track of various network elements over TCP/UDP sockets. For this, I use three values :
Hardware Location Number
Service Group Number
Node Number
The rule is that no two elements on a network may have the same set of these three numbers. Thus, each location's identity will be unique on the network. This information needs to be saved in the program (non-persistently) in a way so that given any of the parameters (could be just a single number, or a combination of any two, or all three) the program returns the eligible candidates by performing a quick search.
The addition and deletion should also be efficient, but given that there will be few insertions or deletions after the initial transient phase if they are a bit slower than search, it should be OK. Using trees is one option, but the answer of 'Which one to use?' still eludes me (Not that I know of many, but I look forward to learning newer ones if they serve my purpose).
To do this, I could have three different trees maintained separately with similar nodes pointing to a same structure in memory, but I feel that is inefficient and not compact. I am looking for a unified data set which can handle these variations like multiple keys.
Or I could have a single AVL tree with multiple keys (if that is allowed).
The number of elements in the network is dynamic, so using a 3D array is out of option.
A friend also suggested hashing, but I am not too sure.
Please help.

Hashing seems like a silly choice for this. Perhaps the most significant reason is that you seem interested in approximate lookups. Hashing your values will likely mean iterating through the entire collection to find a group of nodes that have a common prefix, or a similar prefix.
PATRICIA is commonly used in routing tables, and makes itself quite amenable to searching for items that have similar keys. Note that I have found much misleading information about PATRICIA tries, which I've written about here. I found this resource to be particularly helpful.
Similarly to an AVL tree, you'll need to combine the three keys to form one (without hashing, preferably).
unsigned int key[3] = { hardware_location_number, service_group_number, node_number };
/* ^------- Use something like this as your key */

Related

Can a B tree have more solutions?

I have this values
10,15,20,25,30,33,38,40,43,45,50
and then I insert 34
I tried 2 generators
https://s3.amazonaws.com/learneroo/visual-algorithms/BTree.html
http://ysangkok.github.io/js-clrs-btree/btree.html
and they gave me different results
On paper I tried to create the tree inserting those consecutive values 1 by1 and get a totally different result.
If the elements were in random order would the result be the same?
My result is this
The problem is when on the right I have 38|40|45 and I add 50 I have to put 40 a level higher but in the internet generators they also put 33 a level down and I don't see why

Can a B tree have more solutions?
I think you're asking whether there can be more than one one way to store a given set of keys in a b-tree, but you have already answered that yourself. Both of the generated examples you present contain the same keys and are valid 1-3 b-trees. The first is also a valid 1-2 b-tree. With the correction, your attempt is also a valid 1-3 b-tree.
Note well that there are different flavors of b-tree, based on how many keys the internal nodes are permitted to contain, and also that even binary trees, with which you may be more familiar, afford many different structures for the same set of two or more keys.
If the elements were in random order would the result be the same?
Very likely so, yes, but that's not a question of the b-tree form and structure, but rather about the implementation of the software used to construct and maintain it.
You seem confused that
in the internet generators they also put 33 a level down and I don't see why
, but we can only speculate about the implementation of the software supporting those trees. It's unlikely that anyone here can tell you with certainty why they produce the particular b-tree forms they do, but those forms are valid, and so, now, is yours.

how to remember multiple indexes in a buffer to later access them for modification one by one...keeping optimization in mind

i have a scenario where i have to set few records with field values to a constant and then later access them one by one sequentially .
The records can be random records.
I dont want to use link list as it will be costly and don't want to traverse the whole buffer.
please give me some idea to do that.

When you say "set few records with field values to a constant" is this like a key to the record? And then "later access them one by one" - is this to recall them with some key? "one-by-one sequentially" and "don't want to traverse the whole buffer" seems to conflict, as sequential access sounds a lot like traversal.
But I digress. If you in fact do have a key (and it's a number), you could use some sort of Hash Table to organize your records. One basic implementation might be an array of linked lists, where you mod the key down into the array's range, then add it to the list there. This might increase performance assuming you have a good distribution of keys (your records spread across the array well).
Another data structure to look into might be a B-Tree or a binary search tree, which can access nodes in logarithmic time.
However, overall I agree with the commenters that over-optimizing is usually not a good idea.

Two hash tables, hash table with double key or different solution?

Once again, talking about my upcoming university project... I had a class today, where we could ask stuff about the project but I still haven't decided on the best way to do this.
Basically, I have a bunch of users (some struct with a couple of members) which must be quickly searched by name and by SSN. Since I will also need to use theses users on a Graph (for other operations), I will be working with pointers.
Now, I though about using two hash tables. One where the key is the name and another where the key is the SSN. But I don't like the idea of having two Hash Tables, simply with different keys and pointing to the same place.
It crossed my mind using a Hash Table with two keys but I don't even know if that is possible and I believe it's not. I just can't think of a way to do it, maybe there is one, or maybe not.
Besides these two solutions, I can't think of any other alternative... I may have to go with the two Hash Tables.
Do you guys suggest any other alternative?

I'd go with two hash tables. Think of it as two indexes into a database. The database is your users, and you provide two indexes: one ssn index and one name index.

I think that two Hashtables are ok. Consider binary search trees also, they can be more compact but O(log n) search and harder to implement.
"Hash Table with two keys" never heard of it...

I don't think there is a way to build a single hash table which supports two keys.
If you want both SSN-lookup and name-lookup to be really fast, then you need two hash tables. You have to remember to add to both of them, or remove from both of them.
Otherwise, you can make the more frequent one (e.g. SSN-lookup) as a hash-based lookup, and the other one as brute-force lookup from the hash table.

Two hash tables like you said. The advantage is that lookup will be very fast for RANDOM data or even real-life data. The disadvantage is that you don't know what your professors will throw at it (or do you?) and they might force the worst case.
Balanced search trees. I recommend treaps: http://en.wikipedia.org/wiki/Treap - they are, in my opinion, the easiest to implement.
Sort your users and binary search. Also O(log N) per search, and even easier to implement than a treap.
A combination of hashes + sorted users / search trees, if you can afford the memory. This will make it O(1) best case and O(log N) worst case. If H[i] = list of objects that hashed to i, keep a count for each i that tells you how many objects are in that list. If that count is too big, use the sorted users list / search tree instead.

What about concatenate the two keys and use as key?
Example i had x , y , z.
Concatanete x and y using a string or char as a separator. This is a simple way to do it.
At this post I see what could be more interesting to this solution:
Multi-dimensional associative arrays in javascript

Linked lists or hash tables?

I have a linked list of around 5000 entries ("NOT" inserted simultaneously), and I am traversing the list, looking for a particular entry on occasions (though this is not very often), should I consider Hash Table as a more optimum choice for this case, replacing the linked list (which is doubly-linked & linear) ?? Using C in Linux.

If you have not found the code to be the slow part of the application via a profiler then you shouldn't do anything about it yet.
If it is slow, but the code is tested, works, and is clear, and there are other slower areas that you can work on speeding up do those first.
If it is buggy then you need to fix it anyways, go for the hash table as it will be faster than the list. This assumes that the order that the data is traversed does not matter, if you care about what the insertion order is then stick with the list (you can do things with a hash table and keep the order, but that will make the code much tricker).
Given that you need to search the list only on occasion the odds of this being a significant bottleneck in your code is small.
Another data structure to look at is a "skip list" which basically lets you skip over a large portion of the list. This requires that the list be sorted however, which, depending on what you are doing, may make the code slower overall.

Whether using hash table is more optimum or not depends on the use case, which you have not described in detail. But more importantly, make sure the bottleneck of performance is in this part of the code. If this code is called only once in a while and not in a critical path, no use bothering to change the code.

Have you measured and found a performance hit with the lookup? A hash_map or hash table should be good.

If you need to traverse the list in order (not as a part of searching for elements, but say for displaying them) then a linked list is a good choice. If you're only storing them so that you can look up elements then a hash table will greatly outperform a linked list (for all but the worst possible hash function).
If your application calls for both types of operations, you might consider keeping both, and using whichever one is appropriate for a particular task. The memory overhead would be small, since you'd only need to keep one copy of each element in memory and have the data structures store pointers to these objects.
As with any optimization step that you take, make sure you measure your code to find the real bottleneck before you make any changes.

If you care about performance, you definitely should. If you're iterating through the thing to find a certain element with any regularity, it's going to be worth it to use a hash table. If it's a rare case, though, and the ordinary use of the list is not a search, then there's no reason to worry about it.

If you only traverse the collection I don't see any advantages of using a hashmap.

I advise against hashes in almost all cases.
There are two reasons; firstly, the size of the hash is fixed.
Second and much more importantly; the hashing algorithm. How do you know you've got it right? how will it behave with real data rather than test data?
I suggest a balanced b-tree. Always O(log n), no uncertainty with regard to a hash algorithm and no size limits.

Best data structure in C for these two situations?

I kinda need to decide on this to see if I can achieve it in a couple of hours before the deadline for my school project is due but I don't understand much about data structures and I need suggestions...
There's 2 things I need to do, they will probably use different data structures.
I need a data structure to hold profile records. The profiles must be search able by name and social security number. The SSN is unique, so I probably can use that for my advantage? I suppose hash maps is the best bet here? But how do I use the SSN in an hash map to use that as an advantage in looking for a specific profile? A basic and easy to understand explanation would be much appreciated.
I need a data structure to hold records about cities. I need to know which are cities with most visitors, cities less visited and the clients (the profile is pulled from the data structure in #1 for data about the clients) that visit a specific city.
This is the third data structure I need for my project and it's the data structure that I have no idea where to begin. Suggestions as for which type of data structure to use are appreciated, if possible, with examples on how to old the data above in bold.
As a note:
The first data structure is already done (I talked about it in a previous question). The second one is posted here on #1 and although the other group members are taking care of that I just need to know if what we are trying to do is the "best" approach. The third one is #2, the one I need most help.

The right answer lies anywhere between a balanced search tree and an array.
The situation you have mentioned here and else-thread misses out on a very important point: The size of the data you are handling. You choose your data structure and algorithm(s) depending on the amount of data you have to handle. It is important that you are able to justify your choice(s). Using a less efficient general algorithm is not always bad. Being able to back up your choices (e.g: choosing bubble-sort since data size < 10 always) shows a) greater command of the field and b) pragmatism -- both of which are in short supply.

For searchability across multiple keys, store the data in any convenient form, and provides fast lookup indexes on the key(s).
This could be as simple as keeping the data in an array (or linked list, or ...) in the order of creation, and keeping a bunch of {hashtables|sorted arrays|btrees} of maps (key, data*) for all the interesting keys (SSN, name, ...).
If you had more time, you could even work out how to not have a different struct for each different map...
I think this solution probably applies to both your problems.
Good luck.
For clarity:
First we have a simple array of student records
typedef
struct student_s {
char ssn[10]; // nul terminated so we can use str* functions
char name[100];
float GPA;
...
} student;
student slist[MAX_STUDENTS];
which is filled in as you go. It has no order, so search on any key is a linear time operation. Not a problem for 1,000 entries, but maybe a problem for 10,000, and certainly a problem for 1 million. See dirkgently's comments.
If we want to be able to search fast we need another layer of structure. I build a map between a key and the main data structure like this:
typedef
struct str_map {
char* key;
student *data;
} smap;
smap skey[MAX_STUDENTS]
and maintain skey sorted on the key, so that I can do fast lookups. (Only an array is a hassle to keep sorted, so we probably prefer a tree, or a hashmap.)
This complexity isn't needed (and should certainly be avoided) if you will only want fast searches on a single field.

Outside of a homework question, you'd use a relational database for
this. But that probably doesn't help you…
The first thing you need to figure out, as others have already pointed
out, is how much data you're handling. An O(n) brute-force search is
plenty fast as long a n is small. Since a trivial amount of data would
make this a trivial problem (put it in an array, and just brute-force
search it), I'm going to assume the amount of data is large.
Storing Cities
First, your search requirements appear to require the data sorted in
multiple ways:
Some city unique identifier (name?)
Number of visitors
This actually isn't too hard to satisfy. (1) is easiest. Store the
cities in some array. The array index becomes the unique identifier
(assumption: we aren't deleting cities, or if we do delete cities we can
just leave that array spot unused, wasting some memory. Adding is OK).
Now, we also need to be able to find most & fewest visits. Assuming
modifications may happen (e.g., adding cities, changing number of
visitors, etc.) and borrowing from relational databases, I'd suggest
creating an index using some form of balanced tree. Databases would
commonly use a B-Tree, but different ones may work for you: check Wikipedia's
article on trees. In each tree node, I'd just keep a pointer (or
array index) of the city data. No reason to make another copy!
I recommend a tree over a hash for one simple reason: you can very
easily do a preorder or reverse order traversal to find the top or
bottom N items. A hash can't do that.
Of course, if modifications may not happen, just use another array (of
pointers to the items, once again, don't duplicate them).
Linking Cities to Profiles
How to do this depends on how you have to query the data, and what form
it can take. The most general is that each profile can be associated
with multiple cities and each city can be associated with multiple
profiles. Further, we want to be able to efficiently query from either
direction — ask both "who visits Phoenix?" and "which cities does Bob
visit?".
Shamelessly lifting from databases again, I'd create another data
structure, a fairly simple one along the lines of:
struct profile_city {
/* btree pointers here */
size_t profile_idx; /* or use a pointer */
size_t city_idx; /* for both indices */
};
So, to say Bob (profile 4) has visited Phoenix (city 2) you'd have
profile_idx = 4 and city_idx = 2. To say Bob has visited Vegas (city
1) as well, you'd add another one, so you'd have two of them for Bob.
Now, you have a choice: you can store these either in a tree or a
hash. Personally, I'd go with the tree, since that code is already
written. But a hash would be O(n) instead of O(logn) for lookups.
Also, just like we did for the city visit count, create an index for
city_idx so the lookup can be done from that side too.
Conclusion
You now have a way to look up the 5 most-visited cities (via an in-order
traversal on the city visit count index), and find out who visits those
cities, by search for each city in the city_idx index to get the
profile_idx. Grab only unique items, and you have your answer.
Oh, and something seems wrong here: This seems like an awful lot of code for your instructor to want written in several hours!

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight