I am writing a C program which calculates the total size of files in a given directory. I know that each file points to an inode, so I am planning to use stat to find the inode value and file size. Since I want to avoid erroneous calculation when there are multiple hard links and/or sym links to an inode, I want to store the inodes in an array. Problem is, now to check if the inode is unique for a given file, I would have to iterate through the inode array again, giving a runtime of approx n^2. I want to avoid overly complex structures such as RB trees. Is there a faster, more clever way to implement this? I know there are system tools which does this, and I want to know how they implement something like this.
Even binary trees are a good choice since under random data they are relatively balanced. This is also a very simple structure to implement.
In general, the structure of choice is the hash table with constant average search time. The challenge here is to find a good hash function for your data. Implementation of hash tables is not difficult and I guess you could find a lot of good libraries implementing them.
But if you are willing to wait until you store all inodes in the array, then you can sort this array and traverse it in order to find duplicates..
EDIT:
Inodes contain a reference count. This counts the number of hard links. So you could check for duplicates among the inodes with reference count > 1.
Use a hash table. It is O(1) (though somewhat expensive for tiny sets). Of course you may find this "overly complex" as you said about red-black trees, but if you want good worst-case performance you'll need to do something a little more complex than a plain array (which by the way would be fastest for small sets, despite worse theoretical time complexity).
If you don't have a hash table implementation already available (this is C after all), there is an overview of several here: https://stackoverflow.com/a/8470745/4323
Related
Please forgive this stupid question, but I didn't find any hint by googling it.
If I have an array (contiguous memory), and I search sequentially for a given pattern (for example build the list of all even numbers), am I using a cache-oblivious algorithm? Yes it's quite stupid as an algorithm, but I'm trying to understand here :)
Yes, you are using a cache-oblivious algorithm since your running time is O(N/B) - i.e. # of disk transfers, which is dependent on the block size, but your algorithm doesn't depend on a particular value of the block size. Additionally, this means that you are both cache-oblivious as well as cache-efficient.
i don't really know how to put this but, aside from it being implemented in primary memory i.e. heap, how can i implement variation 1,2 or 3 or any of the variations into the secondary memory which is where we manipulate files right?
Assuming your secondary memory is something with relatively slow seek times like hard disk drives, typically you want to implement a closed hash scheme based on "buckets" where buckets can be paged into main memory in their full relatively quickly. In this manner you usually don't have to perform expensive disk seeks for collisions or unstored keys. This isn't a particularly trivial undertaking and often one will end up using a library such as the classic gdbm or others (also see wikipedia).
Most of the bucket schemes are based on extensible hashing with a special case for trying to store large keys or data that don't fit nicely into a bucket. CiteSeer is also a good place to look for papers related to extensible hashing. (See the references of the linked paper, for example.)
I developing a Ip filter and was guessing how i could, using any type of esque data structure, develop a VERY efficient and fast BlackList filter.
What i want to do is simple, every incoming/outcoming connection i have to check in a list of blocked IP´s.
The IPs are scattered, and the memory use should be linear(not dependent of the number of blocked list, because i want to use on limited systems(homebrew routers)).
I have time and could create anything from zero. The difficulty is not important to me.
If you can use anything, what you should do ?
Hashtables are the way to go.
They have averaged O(1) complexity for lookup, insertion and deletion!
They tend to occupy more memory than trees but are much faster.
Since you are just working with 32 bit integer (you can of course convert an IP to a 32 bit integer) things will be amazingly simple and fast.
You can just use a sorted array. Insertion and removal cost is O(n) but lookup is O(log n) and especially memory is just 4 byte for each ip.
The implementation is very simple, perhaps too much :D
Binary trees have complexity of O(log n) for lookup, insertion and deletion.
A simple binary tree would not be sufficient however, you need an AVL tree or a Red Black Tree, that can be very annoying and complicated to implement.
AVL and RBT trees are able to balance themselves, and we need that because an unbalanced tree will have a worst time complexity of O(n) for lookup, that is the same of a simple linked list!
If instead of single and unique ip u need to ban ip ranges, probably you need a Patricia Trie, also called Radix Tree, they were invented for word dictionaries and for ip dictionaries.
However these trees can be slower if not well written\balanced.
Hashtable are always better for simple lookups! They are too fast to be real :)
Now about synchronization:
If you are filling the black list only once at application startup, you can use a plain read only hashtable (or radix tree) that don't have problems about multithreading and locking.
If you need to update it not very often, I would suggest you the use reader-writer locks.
If you need very frequent updates I would suggest you to use a concurrent hashtable.
Warning: don't write your own, they are very complicated and bug prone, find an implementation on the web!
They use a lot the (relatively) new atomic CAS operations of new processors (CAS means Compare and Swap). These are a special set of instructions or sequence of instructions that allow 32 bit or 64 bit fields on memory to be compared and swapped in a single atomic operation without the need of locking.
Using them can be complicated because you have to know very well your processor, your operative system, your compiler and the algorithm itself is counterintuitive.
See http://en.wikipedia.org/wiki/Compare-and-swap for more informations about CAS.
Concurrent AVL tree was invented, but it is so complicated that I really don't know what to say about these :) for example, http://hal.inria.fr/docs/00/07/39/31/PDF/RR-2761.pdf
I just found that concurrent radix tree exists:
ftp://82.96.64.7/pub/linux/kernel/people/npiggin/patches/lockless/2.6.16-rc5/radix-intro.pdf but it is quite complicated too.
Concurrent sorted arrays doesn't exists of course, you need a reader-writer lock for update.
Consider also that the amount of memory required to handle a non-concurrent hashtable can be quite little: For each IP you need 4 byte for the IP and a pointer.
You need also a big array of pointers (or 32 bit integers with some tricks) which size should be a prime number greater than the number of items that should be stored.
Hashtables can of course also resize themselves when required if you want, but they can store also more item than that prime numbers, at the cost of slower lookup time.
For both trees and hashtable, the space complexity is linear.
I hope this is a multithreading application and not a multiprocess application (fork).
If it is not multithreading you cannot share a portion of memory in a fast and reliable way.
One way to improve the performance of such a system is to use a Bloom Filter. This is a probabilistic data structure, taking up very little memory, in which false positives are possible but false negatives are not.
When you want to look up an IP address, you first check in the Bloom Filter. If there's a miss, you can allow the traffic right away. If there's a hit, you need to check your authoritative data structure (eg a hash table or prefix tree).
You could also create a small cache of "hits in the Bloom Filter but actually allowed" addresses, that is checked after the Bloom Filter but before the authoritative data structure.
Basically the idea is to speed up the fast path (IP address allowed) at the expense of the slow path (IP address denied).
The "most efficient" is a hard term to quantify. Clearly, if you had unlimited memory, you would have a bin for every IP address and could immediately index into it.
A common tradeoff is using a B-tree type data structure. First level bins could be preset for the first 8 bits of the IP address, which could store a pointer to and the size of a list containing all currently blocked IP addresses. This second list would be padded to prevent unnecessary memmove() calls and possibly sorted. (Having the size and the length of the list in memory allows an in-place binary search on the list at the slight expensive of insertion time.)
For example:
127.0.0.1 =insert=> { 127 :: 1 }
127.0.1.0 =insert=> { 127 :: 1, 256 }
12.0.2.30 =insert=> { 12 : 542; 127 :: 1, 256 }
The overhead on such a data structure is minimal, and the total storage size is fixed. The worse case, clearly, would be a large number of IP addresses with the same highest order bits.
I want to sort on the order of four million long longs in C. Normally I would just malloc() a buffer to use as an array and call qsort() but four million * 8 bytes is one huge chunk of contiguous memory.
What's the easiest way to do this? I rate ease over pure speed for this. I'd prefer not to use any libraries and the result will need to run on a modest netbook under both Windows and Linux.
Just allocate a buffer and call qsort. 32MB isn't so very big these days even on a modest netbook.
If you really must split it up: sort smaller chunks, write them to files, and merge them (a merge takes a single linear pass over each of the things being merged). But, really, don't. Just sort it.
(There's a good discussion of the sort-and-merge approach in volume 2 of Knuth, where it's called "external sorting". When Knuth was writing that, the external data would have been on magnetic tape, but the principles aren't very different with discs: you still want your I/O to be as sequential as possible. The tradeoffs are a bit different with SSDs.)
32 MB? thats not too big.... quicksort should do the trick.
Your best option would be to prevent having the data unordered if possible. Like it has been mentioned, you'd be better of reading the data from disk (or network or whatever the source) directly into a selforganizing container (a tree, perhaps std::set will do).
That way, you'll never have to sort through the lot, or have to worry about memory management. If you know the required capacity of the container, you might squeeze out additional performance by using std::vector(initialcapacity) or call vector::reserve up front.
You'd then best be advised to use std::make_heap to heapify any existing elements, and then add element by element using push_heap (see also pop_heap). This essentially is the same paradigm as the self-ordering set but
duplicates are ok
the storage is 'optimized' as a flat array (which is perfect for e.g. shared memory maps or memory mapped files)
(Oh, minor detail, note that sort_heap on the heap takes at most N log N comparisons, where N is the number of elements)
Let me know if you think this is an interesting approach. I'd really need a bit more info on the use case
This is an interview question: "Given a directory with lots of files, find the files that have the same content". I would propose to use a hash function to generate hash values of the file contents and compare only the files with the same hash values. Does it make sense ?
The next question is how to choose the hash function. Would you use SHA-1 for that purpose ?
I'd rather use the hash as a second step. Sorting the dir by file size first and hashing and comparing only when there are duplicate sizes may improve a lot your search universe in the general case.
Like most interview questions, it's more meant to spark a conversation than to have a single answer.
If there are very few files, it may be faster to simply to a byte-by-byte comparison until you reach bytes which do not match (assuming you do). If there are many files, it may be faster to compute hashes, as you won't have to shift around the disk reading in chunks from multiple files. This process may be sped up by grabbing increasingly large chunks of each file, as you progress through the files eliminating potentials. hIt may also be necessary to distribute the problem among multiple servers, if their are enough files.
I would begin with a much faster and simpler hash function than SHA-1. SHA-1 is cryptographically secure, which is not necessarily required in this case. In my informal tests, Adler 32, for example, is 2-3 times faster. You could also use an even weaker presumptive test, than retest any files which match. This decision also depends on the relation between IO bandwidth and CPU power, if you have a more powerful CPU, use a more specific hash to save having to reread files in subsequent tests, if you have faster IO, the rereads may be cheaper than doing expensive hashes unnecessarily.
Another interesting idea would be to use heuristics on the files as you process them to determine the optimal method, based on the files size, computer's speed, and the file's entropy.
Yes, the proposed approach is reasonable and SHA-1 or MD5 will be enough for that task. Here's a detailed analysis for the very same scenario and here's a question specifically on using MD5. Don't forget you need a hash function as fast as possible.
Yes, hashing is the first that comes to mind. For your particular task you need to take the fastest hash function available. Adler32 would work. Collisions are not a problem in your case, so you don't need cryptographically strong function.