What is the most efficient way to represent two dimensional arrays in prolog? I thought of one long list or list of lists, but they have linear access time which seems to be too slow for my problem. I'm not necessarily looking for a ready solution, but rather a concept how it should be implemented.
You can get logarithmic time access with AVL trees or Red-Black trees, see library(assoc) and library(rbtrees) in SWI and YAP. For constant time access, make a term with N arguments, and use arg/3 for efficient access. Each of these arguments can again be a term with arity N, so you have an array with efficient read-access. Using setarg/3, you can even destructively modify elements, at the cost of losing nice logical properties and much more painful debugging and testing. In many cases, you can reformulate your algorithms to not require random-access, and work with lists of lists. If this is not possible, AVL or other balanced trees are often a very good option.
Related
This might be a silly question; for all of you who have been in this field for a while, nevertheless, I would still appreciate your insight on the matter - why does an array need to be sorted, and in what scenario would we need to sort an array?
So far it is clear to me that the whole purpose of sorting is to organize the data in such a way that will minimize the searching complexity and improve the overall efficiency of our program, though I would appreciate it if someone could describe a scenario in which it would be most useful to sort an array? If we are searching for something specific like a number wouldn't the process of sorting an array be equally demanding as the process of just iterating through the array until we find what we are looking for? I hope that made sense.
Thanks.
This is just a general question for my coursework.
A lot of different algorithms work much faster with sorted array, including searching, comparing and merging arrays.
For one time operation, you're right, it is easier and faster to use unsorted array. But as soon as you need to repeat the operation multiple times on the same array, it is much faster to sort the array one time, and then use its benefits.
Even if you are going to change array, you can keep it sorted, again it improves performance of all other operations.
Sorting brings useful structure in a list of values.
In raw data, reading a value tells you absolutely nothing about the other values in the list. In a sorted list, when you read a value, you know that all preceding elements are not larger, and following elements are not smaller.
So to search a raw list, you have no other choice than exhaustive comparison, while when searching a sorted list, comparing to the middle element tells you in which half the searched value can be found and this drastically reduces the number of tests to be performed.
When the list is given in sorted order, you can benefit from this. When it is given in no known order, you have to ponder if it is worth affording the cost of the sort to accelerate the searches.
Sorting has other algorithmic uses than search in a list, but it is always the ordering property which is exploited.
I would like to implement data structure which is able to make fast insertion and keeping data sorted, without duplicates, after every insert.
I thought about binomial heap, but what I understood about that structure is that it can't tell during insertion that particular element is yet in heap. On the another hand there is AVL tree, which fits perfectly for my case, but honestly there are rather too hard for implement for me, at that moment.
So my question is: is there any possiblity to edit binomial heap insertion algorithm to skip duplicates? Maybe anyoune could suggest another structure?
Grettings :)
In C++, there is std::set. it is internally an implementation of red black tree. So it will sort when you enter data.You can have a look into that for a reference.
A good data structure for this is the red-black tree, which is O(log(n)) for insertion. You said you would like to implement a data structure that does this. A good explanation of how to implement that is given here, as well as an open source usable library.
If you're okay using a library you may take a look at libavl Here
The library implements some other varieties of binary trees as well.
Skip lists are also a possibility if you are concerned with thread safety. Balanced binary search trees will perform more poorly than a skip list in such a case as skip lists require no rebalancing, and skip lists are also inherently sorted like a BST. There is a disadvantage in the amount of memory required (since multiple linked lists are technically used), but theoretically speaking it's a good fit.
You can read more about skip lists in this tutorial.
If you have a truly large number of elements, you might also consider just using a doubly-linked list and sorting the list after all items are inserted. This has the benefit of ease of implementation and insertion time.
You would then need to implement a sorting algorithm. A selection sort or insertion sort would be slower but easier to implement than a mergesort, heapsort, or quicksort algorithm. On the other hand, the latter three are not terribly difficult to implement either. The only thing to be careful about is that you don't overflow the stack since those algorithms are typically implemented using recursion. You could create your own stack implementation (not difficult) and implement them iteratively, pushing and popping values onto your stack as necessary. See Iterative quicksort for an example of what I'm referring to.
if you looking for fast insertion and easy implemantaion why not linked list (single or double).
insertion : push head/ push tail - O(1)
remove: pop head/pop tail - O(1)
the only BUT is "find" will be in O(n)
An interview question:How to use array to design a data structure that has O(1) for get, insert, delete?
My thought is to first pick a prime number as the size of the array and to use "mod by the size of array" as a simple hash function. And for each spot in the array we store a linked list to deal with collision.
Is there any better solutions?
The approach you have mentioned is the chaining technique for collision resolution that is most widely used and implemented in most implementations. And as others have commented it's almost impossible to implement a HashMap with guaranteed O(1) get, insertion and deletion.
Moreover, to implement a map with just an array, you can use an open-addressing approach to collision resolution. It is a different collision resolution technique in which no pointers are used, just an array that is larger than the maximum number of elements that would ever be stored in the map. There are several ways for implementing open addressing based collision resolution, but the simplest of them all is the linear probing technique. You can read more about it's implementation here.
I'm scanning a large data source, currently about 8 million entries, extracting on string per entry, which I want in alphabetical order.
Currenlty I put them in an array then sort an index to them using qsort() which works fine.
But out of curiosity I'm thinking of instead inserting each string into a data structure that maintains them in alphabetical order as I scan them from the data source, partly for the experience of emlplementing one, partly because it will feel faster without the wait for the sort to complete after the scan has completed (-:
What data structure would be the most straightforward to implement in C?
UPDATE
To clarify, the only operations I need to perform are inserting an item and dumping the index when it's done, by which I mean for each item in the original order dump an integer representing the order it is in after sorting.
SUMMARY
The easiest to implement are binary search trees.
Self balancing binary trees are much better but nontrivial to implement.
Insertion can be done iteratively but in-order traversal for dumping the results and post-order traversal for deleting the tree when done both require either recursion or an explicit stack.
Without implementing balancing, runs of ordered input will result in the degenerate worst case which is a linked list. This means deep trees which severely impact the speed of the insert operation.
Shuffling the input slightly can break up ordered input significantly and is easier to implement that balancing.
Binary search trees. Or self-balancing search trees. But don't expect those to be faster than a properly implemented dynamic array, since arrays have much better locality of reference than pointer structures. Also, unbalanced BSTs may "go linear", so your entire algorithm becomes O(n²), just like quicksort.
You are already using the optimal approach. Sort at the end will be much cheaper than maintaining an online sorted data structure. You can get the same O(logN) with a rb-tree but the constant will be much worse, not to mention significant space overhead.
That said, AVL trees and rb-trees are much simpler to implement if you don't need to support deletion. Left-leaning rb tree can fit in 50 or so lines of code. See http://www.cs.princeton.edu/~rs/talks/LLRB/ (by Sedgewick)
You could implement a faster sorting algorithm such us Timsort or other sorting algorithms with a nlog(n) worst case and just search it using Binary search since its faster if the list is sorted.
you should take a look at Trie datastructure wikilink
i think this will serve what you want
Can someone please shed some light on how popular languages like Python, Ruby implements hash tables internally for symbol lookup? Do they use the classic "array with linked-list" method, or use a balanced tree?
I need a simple (fewer LOC) and fast method for indexing the symbols in a DSL written in C. Was wondering what others have found most efficient and practical.
The classic "array of hash buckets" you mention is used in every implementation I've seen.
One of the most educative versions is the hash implementation in the Tcl language, in file tcl/generic/tclHash.c. More than half of the lines in the file are comments explaining everything in detail: allocation, search, different hash table types, strategies, etc. Sidenote: the code implementating the Tcl language is really readable.
Perl uses an array with linked lists to hold collisions. It has a simple heuristic to automatically double the size of the array as necessary. There's also code to share keys between hashes to save a little memory. You can read about it in the dated but still relevant Perl Illustrated Guts under "HV". If you're truly adventurous you can dig into hv.c.
The hashing algorithm used to be pretty simple but its probably a lot more complicated now with Unicode. Because the algorithm was predictable there was a DoS attack whereby the attacker generated data which would cause hash collisions. For example, a huge list of keys sent to a web site as POST data. The Perl program would likely split it and dump it into a hash which then shoved it all into one bucket. The resulting hash was O(n) rather than O(1). Throw a whole lot of POST requests at a server and you might clog the CPU. As a result Perl now perturbs the hash function with a bit of random data.
You also might want to look at how Parrot implements basic hashes which is significantly less terrifying than the Perl 5 implementation.
As for "most efficient and practical", use someone else's hash library. For god's sake don't write one yourself for production use. There's a hojillion robust and efficient ones out there already.
Lua tables use an utterly ingenious implemenation which for arbitrary keys behaves like 'array of buckets', but if you use consecutive integers as keys, it has the same representation and space overhead as an array. In the implementation each table has a hash part and an array part.
I think this is way cool :-)
Attractive Chaos have a comparison of Hash Table Libraries and a update.
The source code is available and it is in C and C++
Balanced trees sort of defeat the purpose of hash tables since a hash table can provide lookup in (amortized) constant time, whereas the average lookup on a balanced tree is O(log(n)).
Separate chaining (array with linked list) really works quite well if you have enough buckets, and your linked list implementation uses a pooling allocator rather than malloc()ing each node from the heap individually. I've found it to be just about as performant as any other technique when properly tuned, and it is very easy and quick to write. Try starting with 1/8 as many buckets as source data.
You can also use open addressing with quadratic or polynomial probing, as Python does.
If you can read Java, you might want to check out the source code for its various map implementations, in particular HashMap, TreeMap and ConcurrentSkipListMap. The latter two keep the keys ordered.
Java's HashMap uses the standard technique you mention of chaining at each bucket position. It uses fairly weak 32-bit hash codes and stores the keys in the table. The Numerical Recipes authors also give an example (in C) of a hash table essentially structured like Java's but in which (a) you allocate the nodes of the bucket lists from an array, and (b) you use a stronger 64-bit hash code and dispense with storing keys in the table.
What Crashworks mean to say was....
The purpose of Hash tables are constant time lookup, addition and deletion. In terms of Algorithm, the operation for all operation is O(1) amortized.
Whereas in case you use tree ...the worst case operation time will be O(log n) for a balanced tree. N is the number of nodes. but, do we really have hash implemented as Tree?