C Database Design, Sortable by Multiple Fields - c

If memory is not an issue for my particular application (entry, lookup, and sort speed being the priorities), what kind of data structure/concept would be the best option for a multi-field rankings table?
For example, let's say I want to create a Hall of Fame for a game, sortable by top score (independent of username), username (with all scores by the same user placed together before ranking users by their highest scores), or level reached (independent of score or name). In this example, if I order a linked list, vector, or any other sequential data structure by the top score of each player, it makes searching for the other fields -- like level and non-highest scores -- more iterative (i.e. iterate across all looking for the stage, or looking for a specific score-range), unless I conceive some other way to store the information sorted when I enter new data.
The question is whether there is a more efficient (albeit complicated and memory-consumptive) method or database structure in C/C++ that might be primed for this kind of multi-field sort. Linked lists seem fine for simple score rankings, and I could even organize a hashtable by hashing on a single field (player name, or level reached) to sort by a single field, but then the other fields take O(N) to find, worse to sort. With just three fields, I wonder if there is a way (like sets or secondary lists) to prevent iterating in certain pre-desired sorts that we know beforehand.

Do it the same way databases do it: using index structures. You have your main data as a number of records (structs), perhaps ordered according to one of your sorting criteria. Then you have index structures, each one ordered according to one of your other sorting criteria, but these index structures don't contain copies of all the data, just pointers to the main data records. (Think "index" like the index in a book, with page numbers "pointing" into the main data body.)
Using ordered linked list for your index structures will give you a fast and simple way to go through the records in order, but it will be slow if you need to search for a given value, and similarly slow when inserting new data.
Hash tables will have fast search and insertion, but (with normal hash tables) won't help you with ordering at all.
So I suggest some sort of tree structure. Balanced binary trees (look for AVL trees) work well in main memory.
But don't forget the option to use an actual database! Database managers such as MySQL and SQLite can be linked with your program, without a separate server, and let you do all your sorting and indexing very easily, using SQL embedded in your program. It will probably execute a bit slower than if you hand-craft your own main-memory data structures, or if you use main-memory data structures from a library, but it might be easier to code, and you won't need to write separate code to save the data on disk.

So, you already know how to store your data and keep it sorted with respect to a single field. Assuming the values of the fields for a single entry are independent, the only way you'll be able to get what you want is to keep three different lists (using the data structure of your choice), each of which are sorted to a different field. You'll use three times the memory's worth of pointers of a single list.
As for what data structure each of the lists should be, using a binary max heap will be effective. Insertion is lg(N), and displaying individual entries in order is O(1) (so O(N) to see all of them). If in some of these list copies the entries need to be sub-sorted by another field, just consider that in the comparison function call.

Related

Which data structure I should use if the data is mostly sorted?

I have huge amount of data (mainly of type long long) which is mostly sorted (data is spread in different files and in each file data is in sorted format). I need to dump this data into a file in sorted manner. Which data structure should I use. I am thinking about BST.
Is there any other DS I should use which can give me the optimum performance ?
Thanks
Arpit
Using any additional data structure won't help. Since most of your data is already sorted and you just need to fix the occasional value, use a simple array to extract data, then use Insertion Sort.
Insertion sort runs in O(n) for mostly presorted data.
However, this depends if you can hold large enough an array in memory or not depending upon your input size.
Update:
I wasn't very clear on your definition of "mostly sorted". Generally it means only few elements are not in the precise sorted position.
However, as you stated further, 'data is in different files where each file is individually sorted', then may be it is a good candidate for the sub function call - Merge as in merge Sort.
Note that Merge routine, merges two already sorted arrays. If you have say 10 files where each of them is individually sorted for sure, then using Merge routine would only take O(n).
However, if you have even a few off instances where a single file is not perfectly sorted (on its own), you need to use Insertion Sort.
Update 2:
OP says he cannot use an array because he cannot know the number of records in advance. Using simple link list is out of question, since that never competes with arrays (sequential vs random access time) in time complexity.
Pointed out in comments, using link list is a good idea IF the files are individually sorted and all you need to run on them is the merge procedure.
Dynamically allocated arrays are best, if he can predict size at some point. Since c++ tag was used (only removed latter), going for vector would be a good idea, since it can re size comfortably.
Otherwise, one option might be Heap Sort, since it would call heapify first i.e. build a heap (so it can dynamically accommodate as many elements needed) and still produce O(nlogn) complexity. This is still better than trying to use a link list.
Perhaps you don't need a data structure at all.
If the files are already sorted, you can use the merge part of merge sort, which is O(n), or more generally O(n*log k), where k is the number of files.
How many files do you have to merge?
If it's only a few (on the order of a dozen or so) and each individual file is fully sorted, then you shouldn't need to build any sort of complex data structure at all: just open all the input files, read the next record from each file, compare, write the smallest to the destination, then replace that record from the appropriate file.
If each file is not fully sorted or if there are too many files to open at once, then yes, you will need to build an intermediate data structure in memory. I'd recommend a self-balancing tree, but since the data are already mostly sorted, you'll be re-balancing on almost every insert. A heap may work better for your purposes.
Best Sorting Algorithm:
Insertion sort can be used efficiently for nearly sorted data (O(n) time complexity).
Best data structure:
Linked list is the best choice for the data structure if you are sorting it using insertion sort.
Reason for using linked list:
Removing and inserting elements can be done faster when elements are stored as a linked list.

how to remember multiple indexes in a buffer to later access them for modification one by one...keeping optimization in mind

i have a scenario where i have to set few records with field values to a constant and then later access them one by one sequentially .
The records can be random records.
I dont want to use link list as it will be costly and don't want to traverse the whole buffer.
please give me some idea to do that.
When you say "set few records with field values to a constant" is this like a key to the record? And then "later access them one by one" - is this to recall them with some key? "one-by-one sequentially" and "don't want to traverse the whole buffer" seems to conflict, as sequential access sounds a lot like traversal.
But I digress. If you in fact do have a key (and it's a number), you could use some sort of Hash Table to organize your records. One basic implementation might be an array of linked lists, where you mod the key down into the array's range, then add it to the list there. This might increase performance assuming you have a good distribution of keys (your records spread across the array well).
Another data structure to look into might be a B-Tree or a binary search tree, which can access nodes in logarithmic time.
However, overall I agree with the commenters that over-optimizing is usually not a good idea.

File format need for inverted indexing

i have been working on Inverted indexing, which index documents collection, store each term with information and also store its reference in posting file (document id, location, etc.).
Currently i store it in .txt file format which need string matching for each and every query concerning to that .txt file, which take more time and also something more complex.
Now i want to store that information in a file like linked list style data structure. so is this possible for this type of scenario.... (and also i am using PHP language for indexing).
Any help will be appreciated, thanks.
The point of an inverted index is to allow for extremely fast access to the list of occurrences (the postings list) for any given term. If you want to implement it using simple, readily-available data structures, then the best you can probably do is
Use a hash to store the mapping from terms to postings lists
Store each postings list as a continuous block of sorted integers (i.e. something like ArrayList in Java or std::vector in C++). Do not use a linked list because that involves a huge amount of space wasted for pointers
A more proper (and more sophisticated) implementation would take into account:
That postings lists can get very large, so you would have to break it up into multiple chunks, each stored as one continuous block
That postings lists can and should be compressed
Detailed descriptions of these techniques are found in the classical book Managing Gigabytes.

How to maintain an ordered table with Core Data (or SQL) with insertions/deletions?

This question is in the context of Core Data, but if I am not mistaken, it applies equally well to a more general SQL case.
I want to maintain an ordered table using Core Data, with the possibility for the user to:
reorder rows
insert new lines anywhere
delete any existing line
What's the best data model to do that? I can see two ways:
1) Model it as an array: I add an int position property to my entity
2) Model it as a linked list: I add two one-to-one relations, next and previous from my entity to itself
1) makes it easy to sort, but painful to insert or delete as you then have to update the position of all objects that come after
2) makes it easy to insert or delete, but very difficult to sort. In fact, I don't think I know how to express a Sort Descriptor (SQL ORDER BY clause) for that case.
Now I can imagine a variation on 1):
3) add an int ordering property to the entity, but instead of having it count one-by-one, have it count 100 by 100 (for example). Then inserting is as simple as finding any number between the ordering of the previous and next existing objects. The expensive renumbering only has to occur when the 100 holes have been filled. Making that property a float rather than an int makes it even better: it's almost always possible to find a new float midway between two floats.
Am I on the right track with solution 3), or is there something smarter?
If the ordering is arbitrary i.e. not inherent in the data being modeled, then you have no choice but to add an attribute or relationship to maintain the order.
I would advise the linked list since it is easiest to maintain. I'm not sure what you mean by a linked list being difficult to sort since you won't most likely won't be sorting on an arbitrary order anyway. Instead, you will just fetch the top most instance and walk your way down.
Ordering by a divisible float attribute is a good idea. You can create an near infinite number of intermediate indexes just by subtracting the lower existing index from higher existing index, dividing the result by two and then adding that result to the lower index.
You can also combine an divisible index with a linked list if you need an ordering for tables or the like. The linked list would make it easy to find existing indexes and the divisible index would make it easy to sort if you needed to.
Core Data resist this kind of ordering because it is usually unnecessary. You don't want to add something to the data model unless it is necessary to simulate the real world object, event or condition that the model describes. Usually, ordering/sorting are not inherent to the model but merely needed by the UI/view. In that case, you should have the sorting logic in the controller between the model and the view.
Think carefully before adding ordering to model when you may not need it.
Since iOS 5 you can (and should) use NSOrderedSet and its mutable subclass. → Core Data Release Notes for OS X v10.7 and iOS 5.0
See the accepted answer at How can I preserve an ordered list in Core Data.

Best data structure in C for these two situations?

I kinda need to decide on this to see if I can achieve it in a couple of hours before the deadline for my school project is due but I don't understand much about data structures and I need suggestions...
There's 2 things I need to do, they will probably use different data structures.
I need a data structure to hold profile records. The profiles must be search able by name and social security number. The SSN is unique, so I probably can use that for my advantage? I suppose hash maps is the best bet here? But how do I use the SSN in an hash map to use that as an advantage in looking for a specific profile? A basic and easy to understand explanation would be much appreciated.
I need a data structure to hold records about cities. I need to know which are cities with most visitors, cities less visited and the clients (the profile is pulled from the data structure in #1 for data about the clients) that visit a specific city.
This is the third data structure I need for my project and it's the data structure that I have no idea where to begin. Suggestions as for which type of data structure to use are appreciated, if possible, with examples on how to old the data above in bold.
As a note:
The first data structure is already done (I talked about it in a previous question). The second one is posted here on #1 and although the other group members are taking care of that I just need to know if what we are trying to do is the "best" approach. The third one is #2, the one I need most help.
The right answer lies anywhere between a balanced search tree and an array.
The situation you have mentioned here and else-thread misses out on a very important point: The size of the data you are handling. You choose your data structure and algorithm(s) depending on the amount of data you have to handle. It is important that you are able to justify your choice(s). Using a less efficient general algorithm is not always bad. Being able to back up your choices (e.g: choosing bubble-sort since data size < 10 always) shows a) greater command of the field and b) pragmatism -- both of which are in short supply.
For searchability across multiple keys, store the data in any convenient form, and provides fast lookup indexes on the key(s).
This could be as simple as keeping the data in an array (or linked list, or ...) in the order of creation, and keeping a bunch of {hashtables|sorted arrays|btrees} of maps (key, data*) for all the interesting keys (SSN, name, ...).
If you had more time, you could even work out how to not have a different struct for each different map...
I think this solution probably applies to both your problems.
Good luck.
For clarity:
First we have a simple array of student records
typedef
struct student_s {
char ssn[10]; // nul terminated so we can use str* functions
char name[100];
float GPA;
...
} student;
student slist[MAX_STUDENTS];
which is filled in as you go. It has no order, so search on any key is a linear time operation. Not a problem for 1,000 entries, but maybe a problem for 10,000, and certainly a problem for 1 million. See dirkgently's comments.
If we want to be able to search fast we need another layer of structure. I build a map between a key and the main data structure like this:
typedef
struct str_map {
char* key;
student *data;
} smap;
smap skey[MAX_STUDENTS]
and maintain skey sorted on the key, so that I can do fast lookups. (Only an array is a hassle to keep sorted, so we probably prefer a tree, or a hashmap.)
This complexity isn't needed (and should certainly be avoided) if you will only want fast searches on a single field.
Outside of a homework question, you'd use a relational database for
this. But that probably doesn't help you…
The first thing you need to figure out, as others have already pointed
out, is how much data you're handling. An O(n) brute-force search is
plenty fast as long a n is small. Since a trivial amount of data would
make this a trivial problem (put it in an array, and just brute-force
search it), I'm going to assume the amount of data is large.
Storing Cities
First, your search requirements appear to require the data sorted in
multiple ways:
Some city unique identifier (name?)
Number of visitors
This actually isn't too hard to satisfy. (1) is easiest. Store the
cities in some array. The array index becomes the unique identifier
(assumption: we aren't deleting cities, or if we do delete cities we can
just leave that array spot unused, wasting some memory. Adding is OK).
Now, we also need to be able to find most & fewest visits. Assuming
modifications may happen (e.g., adding cities, changing number of
visitors, etc.) and borrowing from relational databases, I'd suggest
creating an index using some form of balanced tree. Databases would
commonly use a B-Tree, but different ones may work for you: check Wikipedia's
article on trees. In each tree node, I'd just keep a pointer (or
array index) of the city data. No reason to make another copy!
I recommend a tree over a hash for one simple reason: you can very
easily do a preorder or reverse order traversal to find the top or
bottom N items. A hash can't do that.
Of course, if modifications may not happen, just use another array (of
pointers to the items, once again, don't duplicate them).
Linking Cities to Profiles
How to do this depends on how you have to query the data, and what form
it can take. The most general is that each profile can be associated
with multiple cities and each city can be associated with multiple
profiles. Further, we want to be able to efficiently query from either
direction — ask both "who visits Phoenix?" and "which cities does Bob
visit?".
Shamelessly lifting from databases again, I'd create another data
structure, a fairly simple one along the lines of:
struct profile_city {
/* btree pointers here */
size_t profile_idx; /* or use a pointer */
size_t city_idx; /* for both indices */
};
So, to say Bob (profile 4) has visited Phoenix (city 2) you'd have
profile_idx = 4 and city_idx = 2. To say Bob has visited Vegas (city
1) as well, you'd add another one, so you'd have two of them for Bob.
Now, you have a choice: you can store these either in a tree or a
hash. Personally, I'd go with the tree, since that code is already
written. But a hash would be O(n) instead of O(logn) for lookups.
Also, just like we did for the city visit count, create an index for
city_idx so the lookup can be done from that side too.
Conclusion
You now have a way to look up the 5 most-visited cities (via an in-order
traversal on the city visit count index), and find out who visits those
cities, by search for each city in the city_idx index to get the
profile_idx. Grab only unique items, and you have your answer.
Oh, and something seems wrong here: This seems like an awful lot of code for your instructor to want written in several hours!

Resources