C Find a subset of strings from a list - c

I have a list of 1M entries and I want to exclude a subset of 20,000 of these entries (the two lists are in different order by have the same key (string)). Can anyone suggest a quick search algorithm in C to do this?
I dont want to have to read each of the 20K IDs and look through the list of 1M every time. Any suggestions would be most helpful.
Thank you.

What you want to use is a hash set. A hash set is a special case of a hash table that basically records if an element exists in the set or not, in constant time. So, what you would do is insert your 20k IDs into the hash set, and then run through the 1 million strings and see if they exist in the hash set.
For your reference, here's an implementation of a hash set in C: https://github.com/avsej/hashset.c
Your running time would be O(n), since for each check for the 1M strings, it would be constant time.

Sort both lists first. Then you can traverse them together, advancing the pointer in the list that is behind the pointer in the other.
Do you have to use C? This sounds like a job for Perl.

Put the 20,000 keys to be included in a hash table before you start searching the list. Then for each item's key in the list, if that key is found in the hash table, exclude the item from the list.

Related

Which data structure I should use if the data is mostly sorted?

I have huge amount of data (mainly of type long long) which is mostly sorted (data is spread in different files and in each file data is in sorted format). I need to dump this data into a file in sorted manner. Which data structure should I use. I am thinking about BST.
Is there any other DS I should use which can give me the optimum performance ?
Thanks
Arpit
Using any additional data structure won't help. Since most of your data is already sorted and you just need to fix the occasional value, use a simple array to extract data, then use Insertion Sort.
Insertion sort runs in O(n) for mostly presorted data.
However, this depends if you can hold large enough an array in memory or not depending upon your input size.
Update:
I wasn't very clear on your definition of "mostly sorted". Generally it means only few elements are not in the precise sorted position.
However, as you stated further, 'data is in different files where each file is individually sorted', then may be it is a good candidate for the sub function call - Merge as in merge Sort.
Note that Merge routine, merges two already sorted arrays. If you have say 10 files where each of them is individually sorted for sure, then using Merge routine would only take O(n).
However, if you have even a few off instances where a single file is not perfectly sorted (on its own), you need to use Insertion Sort.
Update 2:
OP says he cannot use an array because he cannot know the number of records in advance. Using simple link list is out of question, since that never competes with arrays (sequential vs random access time) in time complexity.
Pointed out in comments, using link list is a good idea IF the files are individually sorted and all you need to run on them is the merge procedure.
Dynamically allocated arrays are best, if he can predict size at some point. Since c++ tag was used (only removed latter), going for vector would be a good idea, since it can re size comfortably.
Otherwise, one option might be Heap Sort, since it would call heapify first i.e. build a heap (so it can dynamically accommodate as many elements needed) and still produce O(nlogn) complexity. This is still better than trying to use a link list.
Perhaps you don't need a data structure at all.
If the files are already sorted, you can use the merge part of merge sort, which is O(n), or more generally O(n*log k), where k is the number of files.
How many files do you have to merge?
If it's only a few (on the order of a dozen or so) and each individual file is fully sorted, then you shouldn't need to build any sort of complex data structure at all: just open all the input files, read the next record from each file, compare, write the smallest to the destination, then replace that record from the appropriate file.
If each file is not fully sorted or if there are too many files to open at once, then yes, you will need to build an intermediate data structure in memory. I'd recommend a self-balancing tree, but since the data are already mostly sorted, you'll be re-balancing on almost every insert. A heap may work better for your purposes.
Best Sorting Algorithm:
Insertion sort can be used efficiently for nearly sorted data (O(n) time complexity).
Best data structure:
Linked list is the best choice for the data structure if you are sorting it using insertion sort.
Reason for using linked list:
Removing and inserting elements can be done faster when elements are stored as a linked list.

how to remember multiple indexes in a buffer to later access them for modification one by one...keeping optimization in mind

i have a scenario where i have to set few records with field values to a constant and then later access them one by one sequentially .
The records can be random records.
I dont want to use link list as it will be costly and don't want to traverse the whole buffer.
please give me some idea to do that.
When you say "set few records with field values to a constant" is this like a key to the record? And then "later access them one by one" - is this to recall them with some key? "one-by-one sequentially" and "don't want to traverse the whole buffer" seems to conflict, as sequential access sounds a lot like traversal.
But I digress. If you in fact do have a key (and it's a number), you could use some sort of Hash Table to organize your records. One basic implementation might be an array of linked lists, where you mod the key down into the array's range, then add it to the list there. This might increase performance assuming you have a good distribution of keys (your records spread across the array well).
Another data structure to look into might be a B-Tree or a binary search tree, which can access nodes in logarithmic time.
However, overall I agree with the commenters that over-optimizing is usually not a good idea.

How do databases implement skipping?

I'm writing my own little database engine. Is there any efficient way to implement skipping function other than inspecting every leaf nodes of B+tree, which will be slow with large entries.
If you are using a B+tree for your indices, all the values are stored in leaves and thus can be linked together to form an (ordered) linked list, or rather an unrolled linked list. That's the main advantage of B+ tree over plain B trees.
That said, even if unrolled lists let you do some form of skipping, nothing prevents you from implementing skip lists on your records, and using the the nodes of these lists as your btree values.
2 years after, but anyway.
You can do it in Cassandra's way too. No limit, but you specify the last key from previous query, e.g.
select * from abc where key > 123 limit 100
where 123 is last key from previous query

Associative array - Tree Vs HashTable

Associative arrays are usually implemented with Hashtables. But recently, I came to know that they can also be implemented using Trees. Can someone explain how to implement using a Tree?
Should we use a simple binary tree or BST?
How do we represent the keys in the tree? Do we calculate a hashfunction on the key and insert the (key,value) based on the integer hash value?
If we assume that we calculate hash value and insert into tree, why do people say trees or ordered? What order does it preserve and how? What does this ordering buy us?
Lastly, one general question about the hash table. People say the search time is O(1) in hash table. But when we say O(1), do we take into account of calculating the hash value before we look up using the hash value? If our key is string, then, we need to visit all the characters of the string to find the hashvalue. So, to search an element, won't it take O(n)+O(1) time?
Hash table solutions will take the hash of objects stored in them (its important to note that the hash is a primitive, often an integer or long) and use that hash as an array index, and store a reference to the object at that index in the array. To solve the collision problem, indices in this array will often contain linked list nodes that hold the actual references.
Checking if an object is in the array is as simple as hashing it, looking in the index referred to by the hash, and comparing equality with each object that's there, an operation that runs in amortized constant time, because the hash table can grow larger if collisions are beginning to accumulate.
Now for your questions.
Question 1
Should we use a simple binary tree or a BST?
BST stands for Binary Search Tree. The difference between it and a "simple" binary tree is that it is strictly ordered. Optimal implementations will also attempt to keep the tree balanced. Searching through a balanced, ordered tree is much easier than an unordered one, because certain assumptions about the location of the target element that cannot be made in an unordered tree. You should use a BST if practical.
Question 2
How do we represent the keys in the tree? Do we calculate a hashfunction on the key and insert the (key,value) based on the integer hash value?
That would be what a hash table does. Even hash tables must store the keys by value for equality checking due to the possibility of collisions. The BST would not store hashes at all because under all nonsimple circumstances, determining sort order from the hash values would be very difficult. You would use (key, value) pairs without any hashing.
Question 3
If we assume that we calculate hash value and insert into tree, why do people say trees or ordered? What order does it preserve and how? What does this ordering buy us?
As you noticed, it doesn't work that way. So we store the value of the key instead of the hash. Ordering the tree (as opposed to an unordered tree) gives us a huge advantage when searching (O(log(N)) instead of O(N)). In an unordered structure, some specific element must be exhaustively searched for because it could reside anywhere in the structure. In an ordered structure, it will only exist above keys whose values are less than its, and below keys whose values are greater. Consider a textbook. Would you have an easier or harder time jumping to a specific page if the pages were in a random order?
Question 4
If our key is string, then, we need to visit all the characters of the string to find the hashvalue. So, to search an element, won't it take O(n)+O(1) time?
I've asked myself the same question before, and it actually depends on the hash function. "Amortized constant" lookup's worst case time is:
O(H) + O(C)
O(H) is the worst case complexity of the hash function. A smart hash function might look at only the first few dozen characters of a string, or the last few, or some in the middle, or something. It's not necessary for the hash function to be secure, it just has to be deterministic (i.e. return the same value for the identical objects). Regardless of how good your function is, you will get collisions anyway, so if you can trade off doing a huge amount of extra work for a slightly more collidable function, it's often worth it.
O(C) is the worst case complexity of comparing the keys. An unsuccessful lookup knows that there are no matches because no entry in the table exists for its hash value. A successful lookup however always must compare the provided key with the ones in the table, otherwise the possibility exists that the key is not actually a match but merely a collision. Worst possible case is if there are multiple collisions for one hash value, the provided key must be compared with all of the stored colliding keys one after another until a match is found or all comparisons fail. This gotcha is why it's only amortized constant time instead of just constant time: as the table grows, the chances of having collisions decreases, so this happens less frequently, and the average time required to search some collection of elements will always tend toward a constant.
When people say hash is O(1) (or you could say O(1+0n) and tree is O(log n), they mean where n is size of collection.
You are right in that it takes O(m) (where m is length of currently examinded string) aditional work, but usualy m has some upper bound and n tends to get large. So m can be considered constant for both of implementations.
And m is in absolute terms more influential in tree implementation. Because you compare with key in every node you visit, where in hash you only compute hash and compare whole key with all values in one bucket determined by hash (with good hash function and big enough table, there should usualy be only one value there).

C Database Design, Sortable by Multiple Fields

If memory is not an issue for my particular application (entry, lookup, and sort speed being the priorities), what kind of data structure/concept would be the best option for a multi-field rankings table?
For example, let's say I want to create a Hall of Fame for a game, sortable by top score (independent of username), username (with all scores by the same user placed together before ranking users by their highest scores), or level reached (independent of score or name). In this example, if I order a linked list, vector, or any other sequential data structure by the top score of each player, it makes searching for the other fields -- like level and non-highest scores -- more iterative (i.e. iterate across all looking for the stage, or looking for a specific score-range), unless I conceive some other way to store the information sorted when I enter new data.
The question is whether there is a more efficient (albeit complicated and memory-consumptive) method or database structure in C/C++ that might be primed for this kind of multi-field sort. Linked lists seem fine for simple score rankings, and I could even organize a hashtable by hashing on a single field (player name, or level reached) to sort by a single field, but then the other fields take O(N) to find, worse to sort. With just three fields, I wonder if there is a way (like sets or secondary lists) to prevent iterating in certain pre-desired sorts that we know beforehand.
Do it the same way databases do it: using index structures. You have your main data as a number of records (structs), perhaps ordered according to one of your sorting criteria. Then you have index structures, each one ordered according to one of your other sorting criteria, but these index structures don't contain copies of all the data, just pointers to the main data records. (Think "index" like the index in a book, with page numbers "pointing" into the main data body.)
Using ordered linked list for your index structures will give you a fast and simple way to go through the records in order, but it will be slow if you need to search for a given value, and similarly slow when inserting new data.
Hash tables will have fast search and insertion, but (with normal hash tables) won't help you with ordering at all.
So I suggest some sort of tree structure. Balanced binary trees (look for AVL trees) work well in main memory.
But don't forget the option to use an actual database! Database managers such as MySQL and SQLite can be linked with your program, without a separate server, and let you do all your sorting and indexing very easily, using SQL embedded in your program. It will probably execute a bit slower than if you hand-craft your own main-memory data structures, or if you use main-memory data structures from a library, but it might be easier to code, and you won't need to write separate code to save the data on disk.
So, you already know how to store your data and keep it sorted with respect to a single field. Assuming the values of the fields for a single entry are independent, the only way you'll be able to get what you want is to keep three different lists (using the data structure of your choice), each of which are sorted to a different field. You'll use three times the memory's worth of pointers of a single list.
As for what data structure each of the lists should be, using a binary max heap will be effective. Insertion is lg(N), and displaying individual entries in order is O(1) (so O(N) to see all of them). If in some of these list copies the entries need to be sub-sorted by another field, just consider that in the comparison function call.

Resources