I want to sort on the order of four million long longs in C. Normally I would just malloc() a buffer to use as an array and call qsort() but four million * 8 bytes is one huge chunk of contiguous memory.
What's the easiest way to do this? I rate ease over pure speed for this. I'd prefer not to use any libraries and the result will need to run on a modest netbook under both Windows and Linux.
Just allocate a buffer and call qsort. 32MB isn't so very big these days even on a modest netbook.
If you really must split it up: sort smaller chunks, write them to files, and merge them (a merge takes a single linear pass over each of the things being merged). But, really, don't. Just sort it.
(There's a good discussion of the sort-and-merge approach in volume 2 of Knuth, where it's called "external sorting". When Knuth was writing that, the external data would have been on magnetic tape, but the principles aren't very different with discs: you still want your I/O to be as sequential as possible. The tradeoffs are a bit different with SSDs.)
32 MB? thats not too big.... quicksort should do the trick.
Your best option would be to prevent having the data unordered if possible. Like it has been mentioned, you'd be better of reading the data from disk (or network or whatever the source) directly into a selforganizing container (a tree, perhaps std::set will do).
That way, you'll never have to sort through the lot, or have to worry about memory management. If you know the required capacity of the container, you might squeeze out additional performance by using std::vector(initialcapacity) or call vector::reserve up front.
You'd then best be advised to use std::make_heap to heapify any existing elements, and then add element by element using push_heap (see also pop_heap). This essentially is the same paradigm as the self-ordering set but
duplicates are ok
the storage is 'optimized' as a flat array (which is perfect for e.g. shared memory maps or memory mapped files)
(Oh, minor detail, note that sort_heap on the heap takes at most N log N comparisons, where N is the number of elements)
Let me know if you think this is an interesting approach. I'd really need a bit more info on the use case
Related
I need to keep track of a lot of boolean-esque data in C. I'm writing a toy kernel and need to store data on whether a certain memory address is used or free. Because of this, I need to store and traverse through this data in the fastest, most efficient way possible. Since I am writing the kernel from scratch I cannot use the C Standard Library. What's the best, fastest, most efficient way to organize, traverse through and modify a large series of boolean-esque data without using the C Standard Library? E.g. would a bitmap or an array or linked list take the least amount of resources to traverse and modify?
Many filesystems have the same problem: indicating whether an allocation unit (group of disk sectors) are available or not. Except for MSDOS's FAT, I think all use a bitmap. Definitely NTFS and Linux's ext/ext2/ext3/ext4 use bitmaps.
There are several easy optimizations to make. If more than 8/16/32/64 sequential units are needed for an allocation, checking that many bits at once is simple using the corresponding integer size. If a bit being zero means "available", then testing for a zero integer tells whether the whole allocation is available. However, boundary optimizations might need to be considered.
I have 200 million records, some of which have variable sized fields (string, variable length array etc.). I need to perform some filters, aggregations etc. on them (analytics oriented queries).
I want to just lay them all out in memory (enough to fit in a big box) and then do linear scans on them. There are two approach I can take, and I want to hear your opinions on which approach is better, for maximizing speed:
Using an array of structs with char* and int* etc. to deal with variable length fields
Use a large byte array, scan the byte array like a binary stream, and then parse the records
Which approach would you recommend?
Update: Using C.
The unfortunate answer is that "it depends on the details which you haven't provided" which, while true, is not particularly useful. The general advice to approaching a problem like this is to start with the simplest/most obvious design and then profile and optimize it as needed. If it really matters you can start with a few very basic benchmark tests of a few designs using your exact data and use-cases to get a more accurate idea of what direction you should take.
Looking in general at a few specific designs and their general pros/cons:
One Big Buffer
char* pBuffer = malloc(200000000);
Assumes your data can fit into memory all at once.
Would work better for all text (or mostly text) data.
Wouldn't be my first choice for large data as it just mirrors the data on the disk. Better just to use the hardware/software file cache/read ahead and read data directly from the drive, or map it if needed.
For linear scans this is a good format but you lose a bit if it requires complex parsing (especially if you have to do multiple scans).
Potential for the least overhead assuming you can pack the structures one after the other.
Static Structure
typedef struct {
char Data1[32];
int Data2[10];
} myStruct;
myStruct *pData = malloc(sizeof(myStruct)*200000000);
Simplest design and likely the best potential for speed at a cost of memory (without actual profiling).
If your variable length arrays have a wide range of sizes you are going to waste a lot of memory. Since you have 200 million records you may not have enough memory to use this method.
For a linear scan this is likely the best memory structure due to memory cache/prefetching.
Dynamic Structure
typedef struct {
char* pData1;
int* pData2;
} myStruct2;
myStruct2 *pData = malloc(sizeof(myStruct2)*200000000);
With 200 million records this is going require a lot of dynamic memory allocations which is very likely going to have a significant impact on speed.
Has the potential to be more memory efficient if your dynamic arrays have a wide range of sizes (though see next point).
Be aware of the overhead of the pointer sizes. On a 32-bit system this structure needs 8 bytes (ignoring padding) to store the pointers which is 1.6 GB alone for 200 million records! If your dynamic arrays are generally small (or empty) you may be spending more memory on the overhead than the actual data.
For a linear scan of data this type of structure will probably perform poorly as you are accessing memory in a non-linear manner which cannot be predicted by the prefetcher.
Streaming
If you only need to do one scan of the data then I would look at a streaming solution where you read a small bit of the data at a time from the file.
Works well with very large data sets that wouldn't fit into memory.
Main limitation here is the disk read speed and how complex your parsing is.
Even if you have to do multiple passes with file caching this might be comparable in speed to the other methods.
Which of these is "best" really depends on your specific case...I can think of situations where each one would be the preferred method.
You could use structs, indeed, but you'd have to be very careful about alignments and aliasing, and it would all need patching up when there's a variable length section. In particular, you could not use an array of such structs because all entries in an array must be constant size.
I suggest the flat array approach. Then add a healthy dose of abstraction; you don't want your "business logic" doing bit-twiddling.
Better still, if you need to do a single linear scan over the whole data set then you should treat it like a data stream, and de-serialize (copy) the records into proper, native structs, one at a time.
"Which approach would you recommend?" Neither actually. With this amount of data my recommendation would be something like a linked list of your structs. However, if you are 100% sure that you will be able to allocate the required amount of memory (with 1 malloc call) for all your data, then use array of structs.
I need to allocate memory of order of 10^15 to store integers which can be of long long type.
If i use an array and declare something like
long long a[1000000000000000];
that's never going to work. So how can i allocate such a huge amount of memory.
Really large arrays generally aren't a job for memory, more one for disk. 1015 array elements at 64 bits apiece is (I think) 8 petabytes. You can pick up 8G memory slices for about $15 at the moment so, even if your machine could handle that much memory or address space, you'd be outlaying about $15 million dollars.
In addition, with upcoming DDR4 being clocked up to about 4GT/s (giga-transfers), even if each transfer was a 64-bit value, it would still take about one million seconds just to initialise that array to zero. Do you really want to be waiting around for eleven and a half days before your code even starts doing anything useful?
And, even if you go the disk route, that's quite a bit. At (roughly) $50 per TB, you're still looking at $400,000 and you'll possibly have to provide your own software for managing those 8,000 disks somehow. And I'm not even going to contemplate figuring out how long it would take to initialise the array on disk.
You may want to think about rephrasing your question to indicate the actual problem rather than what you currently have, a proposed solution. It may be that you don't need that much storage at all.
For example, if you're talking about an array where many of the values are left at zero, a sparse array is one way to go.
You can't. You don't have all this memory, and you'll don't have it for a while. Simple.
EDIT: If you really want to work with data that does not fit into your RAM, you can use some library that work with mass storage data, like stxxl, but it will work a lot slower, and you have always disk size limits.
MPI is what you need, that's actually a small size for parallel computing problems the blue gene Q monster at Lawerence Livermore National Labs holds around 1.5 PB of ram. you need to use block decomposition to divide up your problem and viola!
the basic approach is dividing up the array into equal blocks or chunks among many processors
You need to uppgrade to a 64-bit system. Then get 64-bit-capable compiler then put a l at the end of 100000000000000000.
Have you heard of sparse matrix implementation? In one of the sparse matrices, you just use very little part of the matrix despite of the matrix being huge.
Here are some libraries for you.
Here is a basic info about sparse-matrices You dont actually use all of it. Just the needed few points.
Cuda is awesome and I'm using it like crazy but I am not using her full potential because I'm having a issue transferring memory and was wondering if there was a better way to get a variable amount of memory out. Basically I send 65535 item array into Cuda and Cuda analyzes each data item around 20,000 different ways and if there's a match in my programs logic then it saves a 30 int list as a result. Think of my logic of analyzing each different combination and then looking at the total and if the total is equal to a number I'm looking for then it saves the results (which is a 30 int list for each analyzed item).
The problem is 65535 (blocks/items in data array) * 20000 (total combinations tested per item) = 1,310,700,000. This means I need to create a array of that size to deal with the chance that all the data will be a positive match (which is extremely unlikely and creating int output[1310700000][30] seems crazy for memory). I've been forced to make it smaller and send less blocks to process because I don't know how if Cuda can write efficiently to a linked list or a dynamically sized list (with this approach the it writes the output to host memory using block * number_of_different_way_tests).
Is there a better way to do this? Can Cuda somehow write to free memory that is not derived from the blockid? When I test this process on the CPU, less then 10% of the item array have a positive match so its extremely unlikely I'll use so much memory each time I send work to the kernel.
p.s. I'm looking above and although it's exactly what I'm doing, if it's confusing then another way of thinking about it (not exactly what I'm doing but good enough to understand the problem) is I am sending 20,000 arrays (that each contain 65,535 items) and adding each item with its peer in the other arrays and if the total equals a number (say 200-210) then I want to know the numbers it added to get that matching result.
If the numbers are very widely range then not all will match but using my approach I'm forced to malloc that huge amount of memory. Can I capture the results with mallocing less memory? My current approach to is malloc as much as I have free but I'm forced to run less blocks which isn't efficient (I want to run as many blocks and threads a time because I like the way Cuda organizes and runs the blocks). Is there any Cuda or C tricks I can use for this or I'm a stuck with mallocing the max possible results (and buying a lot more memory)?
As Per Roger Dahl's great answer:
The functionality you're looking for is called stream compaction.
You probably do need to provide an array that contains room for 4 solutions per thread because attempting to directly store the results in a compact form is likely to create so many dependencies between the threads that the performance gained in being able to copy less data back to the host is lost by a longer kernel execution time. The exception to this is if almost all of the threads find no solutions. In that case, you might be able to use an atomic operation to maintain an index into an array. So, for each solution that is found, you would store it in an array at an index and then use an atomic operation to increase the index. I think it would be safe to use atomicAdd() for this. Before storing a result, the thread would use atomicAdd() to increase the index by one. atomicAdd() returns the old value, and the thread can store the result using the old value as the index.
However, given a more common situation, where there's a fair number of results, the best solution will be to perform a compacting operation as a separate step. One way to do this is with thrust::copy_if. See this question for some more background.
I am writing a program in C to solve an optimisation problem, for which I need to create an array of type float with an order of 1013 elements. Is it practically possible to do so on a machine with 20GB memory.
A float in C occupies 4 bytes (assuming IEEE floating point arithmetic, which is pretty close to universal nowadays). That means 1013 elements are naïvely going to require 4×1013 bytes of space. That's quite a bit (40 TB, a.k.a. quite a lot of disk for a desktop system, and rather more than most people can afford when it comes to RAM) so you need to find another approach.
Is the data sparse (i.e., mostly zeroes)? If it is, you can try using a hash table or tree to store only the values which are anything else; if your data is sufficiently sparse, that'll let you fit everything in. Also be aware that processing 1013 elements will take a very long time. Even if you could process a billion items a second (very fast, even now) it would still take 104 seconds (several hours) and I'd be willing to bet that in any non-trivial situation you'll not be able to get anything near that speed. Can you find some way to make not just the data storage sparse but also the processing, so that you can leave that massive bulk of zeroes alone?
Of course, if the data is non-sparse then you're doomed. In that case, you might need to find a smaller, more tractable problem instead.
I suppose if you had a 64 bit machine with a lot of swap space, you could just declare an array of size 10^13 and it may work.
But for a data set of this size it becomes important to consider carefully the nature of the problem. Do you really need random access read and write operations for all 10^13 elements? Is the array at all sparse? Could you express this as a map/reduce problem? If so, sequential access to 10^13 elements is much more practical than random access.