Most Efficient way of implementing a BlackList

Most Efficient way of implementing a BlackList - c

I developing a Ip filter and was guessing how i could, using any type of esque data structure, develop a VERY efficient and fast BlackList filter.
What i want to do is simple, every incoming/outcoming connection i have to check in a list of blocked IP´s.
The IPs are scattered, and the memory use should be linear(not dependent of the number of blocked list, because i want to use on limited systems(homebrew routers)).
I have time and could create anything from zero. The difficulty is not important to me.
If you can use anything, what you should do ?

Hashtables are the way to go.
They have averaged O(1) complexity for lookup, insertion and deletion!
They tend to occupy more memory than trees but are much faster.
Since you are just working with 32 bit integer (you can of course convert an IP to a 32 bit integer) things will be amazingly simple and fast.
You can just use a sorted array. Insertion and removal cost is O(n) but lookup is O(log n) and especially memory is just 4 byte for each ip.
The implementation is very simple, perhaps too much :D
Binary trees have complexity of O(log n) for lookup, insertion and deletion.
A simple binary tree would not be sufficient however, you need an AVL tree or a Red Black Tree, that can be very annoying and complicated to implement.
AVL and RBT trees are able to balance themselves, and we need that because an unbalanced tree will have a worst time complexity of O(n) for lookup, that is the same of a simple linked list!
If instead of single and unique ip u need to ban ip ranges, probably you need a Patricia Trie, also called Radix Tree, they were invented for word dictionaries and for ip dictionaries.
However these trees can be slower if not well written\balanced.
Hashtable are always better for simple lookups! They are too fast to be real :)
Now about synchronization:
If you are filling the black list only once at application startup, you can use a plain read only hashtable (or radix tree) that don't have problems about multithreading and locking.
If you need to update it not very often, I would suggest you the use reader-writer locks.
If you need very frequent updates I would suggest you to use a concurrent hashtable.
Warning: don't write your own, they are very complicated and bug prone, find an implementation on the web!
They use a lot the (relatively) new atomic CAS operations of new processors (CAS means Compare and Swap). These are a special set of instructions or sequence of instructions that allow 32 bit or 64 bit fields on memory to be compared and swapped in a single atomic operation without the need of locking.
Using them can be complicated because you have to know very well your processor, your operative system, your compiler and the algorithm itself is counterintuitive.
See http://en.wikipedia.org/wiki/Compare-and-swap for more informations about CAS.
Concurrent AVL tree was invented, but it is so complicated that I really don't know what to say about these :) for example, http://hal.inria.fr/docs/00/07/39/31/PDF/RR-2761.pdf
I just found that concurrent radix tree exists:
ftp://82.96.64.7/pub/linux/kernel/people/npiggin/patches/lockless/2.6.16-rc5/radix-intro.pdf but it is quite complicated too.
Concurrent sorted arrays doesn't exists of course, you need a reader-writer lock for update.
Consider also that the amount of memory required to handle a non-concurrent hashtable can be quite little: For each IP you need 4 byte for the IP and a pointer.
You need also a big array of pointers (or 32 bit integers with some tricks) which size should be a prime number greater than the number of items that should be stored.
Hashtables can of course also resize themselves when required if you want, but they can store also more item than that prime numbers, at the cost of slower lookup time.
For both trees and hashtable, the space complexity is linear.
I hope this is a multithreading application and not a multiprocess application (fork).
If it is not multithreading you cannot share a portion of memory in a fast and reliable way.

One way to improve the performance of such a system is to use a Bloom Filter. This is a probabilistic data structure, taking up very little memory, in which false positives are possible but false negatives are not.
When you want to look up an IP address, you first check in the Bloom Filter. If there's a miss, you can allow the traffic right away. If there's a hit, you need to check your authoritative data structure (eg a hash table or prefix tree).
You could also create a small cache of "hits in the Bloom Filter but actually allowed" addresses, that is checked after the Bloom Filter but before the authoritative data structure.
Basically the idea is to speed up the fast path (IP address allowed) at the expense of the slow path (IP address denied).

The "most efficient" is a hard term to quantify. Clearly, if you had unlimited memory, you would have a bin for every IP address and could immediately index into it.
A common tradeoff is using a B-tree type data structure. First level bins could be preset for the first 8 bits of the IP address, which could store a pointer to and the size of a list containing all currently blocked IP addresses. This second list would be padded to prevent unnecessary memmove() calls and possibly sorted. (Having the size and the length of the list in memory allows an in-place binary search on the list at the slight expensive of insertion time.)
For example:
127.0.0.1 =insert=> { 127 :: 1 }
127.0.1.0 =insert=> { 127 :: 1, 256 }
12.0.2.30 =insert=> { 12 : 542; 127 :: 1, 256 }
The overhead on such a data structure is minimal, and the total storage size is fixed. The worse case, clearly, would be a large number of IP addresses with the same highest order bits.

Related

Temporal complexity of primary instructions in C

I have a question about algorithmic complexity.
Do the basic instructions in C have an equivalent complexity, if not, in what order are they:
if, write/read a single cell of a matrix, a+b, a*b, a = b ...
Thanks

No. The basic instructions in C cannot be ordered by any kind of wall-time or theoretic complexity. This is not specified and probably cannot be specified by the Standard; rather, these properties arise from the interaction of the code, the OS, and the underlying architecture.
I think you're looking for information on cycles per instruction.
However, even this is not the whole story. Modern CPUs have hierarchical caches. If your algorithm operates on data which is primarily in a fast cache, then it will run much faster than a program which operates on data that must be repeatedly accessed from RAM, the hard drive, or over a network. The amount of calculation done per load is an application's arithmetic intensity. Roofline models provide a tool for thinking about this. You can achieve better cache utilization via blocking and other techniques, though the subfield of communication avoiding algorithms explores this in-depth.
Ultimately, the C language is a high-level abstraction of what a processor actually does. In standard cost models we think of all instructions as taking the same amount of time. In more accurate, but potentially more difficult to use, cache-aware cost models, data movement is treated as being more expensive.

Complexity is not about the time it takes to execute "basic" code lines like addition, multiplication, division and so on.
Even if these expressions have different execution time they all have complexity O(1).
Complexity is about what happens when some variable figure changes. That variable figure can be many different things. Some examples could be "the number of element in an array", "the number of elements in a linked list", "the size of a file", "the size of a matrix".
For instance - if you write code that has to find the largest value in an array of integers, the execution time depends on the number of elements in the array. The code will have to visit every array element to check if it's larger than the previous elements. Consequently, the complexity is O(N), where N is the number of elements. From that we can't say how much time it will take to find the largest element but we can say that it will take 10 times longer to execute on a 1000 element array than on a 100 element array.
Now if you did the same with a linked list (i.e. find largest element) the complexity would again be O(N). However, this does not say that a linked list perform just the same as an array. It only says that it scales in the same way as an array.
A simplified way to say it - if there is no loops involved the complexity is always
O(1).

space optimize a large array with many duplicates

I have an array where the index doubles as 'identifier for a collection of items' and the content of the array is a group-number. The group numbers fall into a finite range from 0..N, where N << length_of_the_array. Hence every is entry will be duplicated large number of times. Currently I have to use 2 bytes to represent group number (which can be > 1000 but < 6500 ), which due to the duplicated nature ends up consuming a lot of memory.
Are there ways to space optimize this array as the complete array can get into multiple MBs in size. Appreciate any pointers toward relevant optimization algorithm/technique. FYI: The programming language im using is cpp.

Do you still want efficient random-access to arbitrary elements? Or are you thinking about space-efficient serialization of the index->group map?
If you still want efficient random access, a single array lookup is not bad. It's at worst a single cache miss. Well really, at worst a page fault, or more likely a TLB miss, but that's unlikely if it's only a couple MB).
A sorted and run-length encoded list could be binary-searched (by searching an array of prefix-sums of the repeat-counts), but that only works if you can occasionally sort the list to keep duplicates together.
If the duplicates can't be at least somewhat grouped together, there's not much you can do that allows random access.
Packed 12-bit entries are probably not worth the trouble, unless that was enough to significantly reduce cache misses. A couple multiply instructions to generate the right address, and a shift and mask instruction on the 16b load containing the desired value, is not much overhead compared to a cache miss. Write access to packed bitfields is slower, and isn't atomic, so that's a serious downside. Getting a compiler to pack bitfields using structs can be compiler-specific. Maybe just using a char array would be best.

Is array access always constant-time / O(1)?

From Richard Bird, Pearls of Functional Algorithm Design (2010), page 6:
For a pure functional programmer, an update operation takes logarithmic time in the size of the array. To be fair, procedural programmers also appreciate that constant-time indexing and updating
are only possible when the arrays are small.
Under what conditions do arrays have non-constant-time access? Is this related to CPU cache?

Most modern machine architectures try to offer small unit time access to memory.
They fail. Instead we see layers of caches with differing speeds.
The problem is simple: the speed of light. If you need an enormous memory [array] (in the extreme, imagine a memory the size of the Andromeda galaxy), it will take enormous space, and light cannot cross enormous space in short periods of time. Information cannot travel faster than the speed of light. You are screwed by physics from the start.
So the best you can do is build part of memory "nearby" where light takes only fractions of nanosecond to traverse (thus registers and L1 cache), and part of the memory far away (the disk drive). Other practical complications ensue, such as capacitance (think inertia) to slow down access to things further away.
Now, if you are willing to take the access time of your farthest memory element as "unit" time, yes, access to everything takes the same amount of time, e.g., O(1). In practical computing, we treat RAM memory this way most of the time, and we leave out other, slower devices to avoid screwing up our simple model.
Then you discover people that aren't satisfied with that, and voila, you have people optimizing for cache line access. So it may be O(1) in theory, and acts like O(1) for small arrays (that fit in the first level of cache), but it often is not in practice.
An extreme practical case is an array that doesn't fit in main memory; now an array access may cause paging from the disk.
Sometimes we don't care even in that case. Google is essentially a giant cache. We tend to think of Google searches as O(1).

Can oranges be red?
Yes, they can be red due to a number of reasons -
You color them red.
You grow a genetically modified variety.
You grow them on Mars, the red planet, where every thing is supposed to look red.
The (theoretical) list of some practical (given todays technology) and impractical (fiction / or future reality) goes on...
Point is, that I think the question you are asking, is really about two orthogonal concepts. Namely -
Big O Notation - "In mathematics, big O notation describes the limiting behavior of a function when the argument tends towards a particular value or infinity, usually in terms of simpler functions."
vs
Practicalities (hardware and software) a good software engineer should be aware of, while architecting / designing their app and writing code.
In other words, while the concept of Big O Notation can be called academic, but it is most appropriate way of classifying algorithms complexity (Time / Space).. and that's where it ends. There is no need to muddy the waters with orthogonal concerns.
To be clear, I am not saying that one should not be aware of the under the hood implementation details and workings of things, which affect the performance of the software you write.. but there is no point of mixing the two together. For example, does it make sense to say -
Arrays do not have constant time access (with indexes) because -
Large arrays do not fit in CPU cache, and hence incur high cost of cache misses.
On a system under memory pressure, the array, big or small, has been swapped out from Physical Memory to Hard Disk, and not only is impacted by a cache miss, but also a hard page fault.
On a system under extreme CPU load, the thread which read the supposed array can be pre-empted, and may not get a chance to execute for several seconds.
On a hypothetical OS, which backs its memory not just with disk, but with additional memory on another computer on the other corner of the world, will make array access un-imaginably slow.
Like my apple and orange example, as you read through my increasingly absurd examples, hope the point I am trying to make is clear.
Conclusion - Any day, I'd answer the question "Do Arrays have constant time O(1) access (with indexes)", as yes.. without any doubt or ifs and buts, they do.
EDIT:
Put it another way - If O(1) is not the answer.. then neither is O(log n), or O(n log n), or O(n^2) or O(n ^ 3)..... and certainly not 42.

He is talking about Computation Models, and in particular the word-based RAM machine
A RAM machine is a formalization of something of very similar to an actual computer: we model the computer memory as a big array of memory words of w bits each, and we can read/write any words in O(1) time
But we have yet something important to define: how large should a word be?
We need a word size w ≥ Ω(log n) to be able at least to address the n parts of our input.
For this reason, word-based RAMs normally assume a word length of O(log n)
But having the word length of your machine depends on the size of the input appears strange and unrealistic
What if we keep the word length fixed? Then even following a pointer needs Ω(log n) time just to read the entire pointer
We need Ω(log n) words to store pointer and Ω(log n) time to access input element

if a language supports sparse arrays, access to the array would have to go through a directory, and a tree-structured directory would have non-linear access time. Or did you mean realistic conditions? ;-)

Transferring large variable amount of memory from Cuda

Cuda is awesome and I'm using it like crazy but I am not using her full potential because I'm having a issue transferring memory and was wondering if there was a better way to get a variable amount of memory out. Basically I send 65535 item array into Cuda and Cuda analyzes each data item around 20,000 different ways and if there's a match in my programs logic then it saves a 30 int list as a result. Think of my logic of analyzing each different combination and then looking at the total and if the total is equal to a number I'm looking for then it saves the results (which is a 30 int list for each analyzed item).
The problem is 65535 (blocks/items in data array) * 20000 (total combinations tested per item) = 1,310,700,000. This means I need to create a array of that size to deal with the chance that all the data will be a positive match (which is extremely unlikely and creating int output[1310700000][30] seems crazy for memory). I've been forced to make it smaller and send less blocks to process because I don't know how if Cuda can write efficiently to a linked list or a dynamically sized list (with this approach the it writes the output to host memory using block * number_of_different_way_tests).
Is there a better way to do this? Can Cuda somehow write to free memory that is not derived from the blockid? When I test this process on the CPU, less then 10% of the item array have a positive match so its extremely unlikely I'll use so much memory each time I send work to the kernel.
p.s. I'm looking above and although it's exactly what I'm doing, if it's confusing then another way of thinking about it (not exactly what I'm doing but good enough to understand the problem) is I am sending 20,000 arrays (that each contain 65,535 items) and adding each item with its peer in the other arrays and if the total equals a number (say 200-210) then I want to know the numbers it added to get that matching result.
If the numbers are very widely range then not all will match but using my approach I'm forced to malloc that huge amount of memory. Can I capture the results with mallocing less memory? My current approach to is malloc as much as I have free but I'm forced to run less blocks which isn't efficient (I want to run as many blocks and threads a time because I like the way Cuda organizes and runs the blocks). Is there any Cuda or C tricks I can use for this or I'm a stuck with mallocing the max possible results (and buying a lot more memory)?

As Per Roger Dahl's great answer:
The functionality you're looking for is called stream compaction.
You probably do need to provide an array that contains room for 4 solutions per thread because attempting to directly store the results in a compact form is likely to create so many dependencies between the threads that the performance gained in being able to copy less data back to the host is lost by a longer kernel execution time. The exception to this is if almost all of the threads find no solutions. In that case, you might be able to use an atomic operation to maintain an index into an array. So, for each solution that is found, you would store it in an array at an index and then use an atomic operation to increase the index. I think it would be safe to use atomicAdd() for this. Before storing a result, the thread would use atomicAdd() to increase the index by one. atomicAdd() returns the old value, and the thread can store the result using the old value as the index.
However, given a more common situation, where there's a fair number of results, the best solution will be to perform a compacting operation as a separate step. One way to do this is with thrust::copy_if. See this question for some more background.

How to sort a very large array in C

I want to sort on the order of four million long longs in C. Normally I would just malloc() a buffer to use as an array and call qsort() but four million * 8 bytes is one huge chunk of contiguous memory.
What's the easiest way to do this? I rate ease over pure speed for this. I'd prefer not to use any libraries and the result will need to run on a modest netbook under both Windows and Linux.

Just allocate a buffer and call qsort. 32MB isn't so very big these days even on a modest netbook.
If you really must split it up: sort smaller chunks, write them to files, and merge them (a merge takes a single linear pass over each of the things being merged). But, really, don't. Just sort it.
(There's a good discussion of the sort-and-merge approach in volume 2 of Knuth, where it's called "external sorting". When Knuth was writing that, the external data would have been on magnetic tape, but the principles aren't very different with discs: you still want your I/O to be as sequential as possible. The tradeoffs are a bit different with SSDs.)

32 MB? thats not too big.... quicksort should do the trick.

Your best option would be to prevent having the data unordered if possible. Like it has been mentioned, you'd be better of reading the data from disk (or network or whatever the source) directly into a selforganizing container (a tree, perhaps std::set will do).
That way, you'll never have to sort through the lot, or have to worry about memory management. If you know the required capacity of the container, you might squeeze out additional performance by using std::vector(initialcapacity) or call vector::reserve up front.
You'd then best be advised to use std::make_heap to heapify any existing elements, and then add element by element using push_heap (see also pop_heap). This essentially is the same paradigm as the self-ordering set but
duplicates are ok
the storage is 'optimized' as a flat array (which is perfect for e.g. shared memory maps or memory mapped files)
(Oh, minor detail, note that sort_heap on the heap takes at most N log N comparisons, where N is the number of elements)
Let me know if you think this is an interesting approach. I'd really need a bit more info on the use case