How to manipulate *huge* amounts of data - arrays

I'm having the following problem. I need to store huge amounts of information (~32 GB) and be able to manipulate it as fast as possible. I'm wondering what's the best way to do it (combinations of programming language + OS + whatever you think its important).
The structure of the information I'm using is a 4D array (NxNxNxN) of double-precission floats (8 bytes). Right now my solution is to slice the 4D array into 2D arrays and store them in separate files in the HDD of my computer. This is really slow and the manipulation of the data is unbearable, so this is no solution at all!
I'm thinking on moving into a Supercomputing facility in my country and store all the information in the RAM, but I'm not sure how to implement an application to take advantage of it (I'm not a professional programmer, so any book/reference will help me a lot).
An alternative solution I'm thinking on is to buy a dedicated server with lots of RAM, but I don't know for sure if that will solve the problem. So right now my ignorance doesn't let me choose the best way to proceed.
What would you do if you were in this situation? I'm open to any idea.
Thanks in advance!
EDIT: Sorry for not providing enough information, I'll try to be more specific.
I'm storing a discretized 4D mathematical function. The operations that I would like to perform includes transposition of the array (change b[i,j,k,l] = a[j,i,k,l] and the likes), array multiplication, etc.
As this is a simulation of a proposed experiment, the operations will be applied only once. Once the result is obtained it wont be necessary to perform more operations on the data.
EDIT (2):
I also would like to be able to store more information in the future, so the solution should be somehow scalable. The current 32 GB goal is because I want to have the array with N=256 points, but it'll be better if I can use N=512 (which means 512 GB to store it!!).

Amazon's "High Memory Extra Large Instance" is only $1.20/hr and has 34 GB of memory. You might find it useful, assuming you're not running this program constantly..

Any decent answer will depend on how you need to access the data. Randomly access? Sequential access?
32GB is not really that huge.
How often do you need to process your data? Once per (lifetime | year | day | hour | nanosecond)? Often, stuff only needs to be done once. This has a profound effect on how much you need to optimize your solution.
What kind of operations will you be performing (you mention multiplication)? Can the data be split up into chunks, such that all necessary data for a set of operations is contained in a chunk? This will make splitting it up for parallel execution easier.
Most computers you buy these days have enough RAM to hold your 32GB in memory. You won't need a supercomputer just for that.

As Chris pointed out, what are you going to do with the data.
Besides, I think storing it in a (relational) database will be faster than reading it from the harddrive since the RDBMS will perform some optimizations for you like caching.

If you can represent your problem as MapReduce, consider a clustering system optimized for disk access, such as Hadoop.
Your description sounds more math-intensive, in which case you probably want to have all your data in memory at once. 32 GB of RAM in a single machine is not unreasonable; Amazon EC2 offers virtual servers with up to 68 GB.

Without more information, if you need quickest possible access to all the data I would go with using C for your programming language, using some flavor of *nix as the O/S, and buying RAM, it's relatively cheap now. This also depends on what you are familiar with, you can go the windows route as well. But as others have mentioned it will depend on how you are using this data.

So far, there are a lot of very different answers. There are two good starting points mentioned above. David suggests some hardware and someone mentioned learning C. Both of these are good points.
C is going to get you what you need in terms of speed and direct memory paging. The last thing you want to do is perform linear searches on the data. That would be slow - slow - slow.
Determine your workflow -, if your workflow is linear, that is one thing. If the workflow is not linear, I would design a binary tree referencing pages in memory. There are tons of information on B-trees on the Internet. In addition, these B-trees will be much easier to work with in C since you will also be able to set up and manipulate your memory paging.

Depending on your use, some mathematical and physical problems tend to be mostly zeros (for example, Finite Element models). If you expect that to be true for your data, you can get serious space savings by using a sparse matrix instead of actually storing all those zeros in memory or on disk.
Check out wikipedia for a description, and to decide if this might meet your needs:
http://en.wikipedia.org/wiki/Sparse_matrix

Here's another idea:
Try using an SSD to store your data. Since you're grabbing very small amounts of random data, an SSD would probably be much, much faster.

You may want to try using mmap instead of reading the data into memory, but I'm not sure it'll work with 32Gb files.

The whole database technology is about manipulating huge amounts of data that can't fit in RAM, so that might be your starting point (i.e. get a good dbms principles book and read about indexing, query execution, etc.).
A lot depends on how you need to access the data - if you absolutely need to jump around and access random bits of information, you're in trouble, but perhaps you can structure your processing of the data such that you will scan it along one axis (dimension). Then you can use a smaller buffer and continuously dump already processed data and read new data.

For transpositions, it's faster to actually just change your understanding of what index is what. By that, I mean you leave the data where it is and instead wrap an accessor delegate that changes b[i][j][k][l] into a request to fetch (or update) a[j][i][k][l].

Could it be possible to solve it by this procedure?
First create M child processes and execute them in paralel. Each process will be running in a dedicated core of a cluster and will load some information of the array into the RAM of that core.
A father process will be the manager of the array, calling (or connecting) the appropiate child process to obtain certain chunks of data.
Will this be faster than the HDD storage approach? Or am I cracking nuts with a sledgehammer?

The first thing that I'd recommend is picking an object-oriented language, and develop or find a class that lets you manipulate a 4-D array without concern for how it's actually implemented.
The actual implementation of this class would probably use memory-mapped files, simply because that can scale from low-power development machines up to the actual machine where you want to run production code (I'm assuming that you'll want to run this many times, so that performance is important -- if you can let it run overnight, then a consumer PC may be sufficient).
Finally, once I had my algorithms and data debugged, I would look into buying time on a machine that could hold all the data in memory. Amazon EC2, for instance, will provide you with a machine that has 68 GB of memory for $US 2.40 an hour (less if you play with spot instances).

How to handle processing large amounts of data typically revolves around the following factors:
Data access order / locality of reference: Can the data be separated out into independent chunks that are then processed either independently or in a serial/sequential fashon vs. random access to the data with little or no order?
CPU vs I/O bound: Is the processing time spent more on computation with the data or reading/writing it from/to storage?
Processing frequency: Will the data be processed only once, every few weeks, daily, etc?
If the data access order is essentially random, you will need either to get access to as much RAM as possible and/or find a way to at least partially organize the order so that not as much of the data needs to be in memory at the same time. Virtual memory systems slow down very quickly once physical RAM limits are exceeded and significant swapping occurs. Resolving this aspect of your problem is probably the most critical issue.
Other than the data access order issue above, I don't think your problem has significant I/O concerns. Reading/writing 32 GB is usually measured in minutes on current computer systems, and even data sizes up to a terabyte should not take more than a few hours.
Programming language choice is actually not critical so long as it is a compiled language with a good optimizing compiler and decent native libraries: C++, C, C#, or Java are all reasonable choices. The most computationally and I/O-intensive software I've worked on has actually been in Java and deployed on high-performance supercomputing clusters with a few thousand CPU cores.

Related

Reading quickly from many random points of a file

I am in the process of developing a performance-critical network service in Rust. A request to my service looks like a vector ids: Vec<u64> of numerical ids. For each id in ids, my service must read the id-th record from a long sequence of records stored contiguously on an SSD. Because all records have the same size RECORD_SIZE (in practice, around 6 KB), the position of every record is entirely predictable, so a trivial solution reduces to
for id in ids {
file.seek(SeekFrom::Start(id * RECORD_SIZE)).unwrap();
let mut record = vec![0u8; RECORD_SIZE];
file.read_exact(&mut record).unwrap();
records.push(record);
}
// Do something with `records`
Now, sadly, the following apply:
The elements of ids are non-contiguous, unpredictable, unstructured, and effectively equivalent to distributed uniformly at random in the range [0, N].
N is way too large for me to store the entire file in memory.
ids.len() is much smaller than N, so I cannot efficiently cycle through the file linearly without having 99% of my reads be for records that have nothing to do with ids.
Now, reading the specs, the raw QD32 IOPS of my SSD should allow me to collect all records in time (i.e., before the next request comes). But, what I observe with my trivial implementation is much much worse. I suspect that this is due to that being effectively a QD1 implementation:
Read something from disk at a random location.
Wait for the data to arrive, store it in RAM.
Read the next thing from disk at another, independent location.
Now, the thing is I know all ids at the very beginning, and I would love it if there was a way to specify:
As much in parallel as possible, read all the locations relevant to each element of ids.
When that is done, carry on doing something on everything.
I am wondering if there is an easy way to get this done in Rust. I scouted for file.parallel_read-like functions in the standard library, for useful crates on crates.io, but to no avail. Which puzzles me because this should be a relatively common problem in a server / database setting. Am I missing something?
Depending on the architecture you're targeting, there is the posix_fadvise syscall:
Programs can use posix_fadvise() to announce an intention to access file data in a specific pattern in the future, thus allowing the kernel to perform appropriate optimizations.
You would pass the offset, RECORD_SIZE, and probably the POSIX_FADV_WILLNEED advise. Both the function and constant are available in the libc crate. This same idea can be done with memory mapped files using posix_madvise() and POSIX_MADV_WILLNEED as hinted in the comments.
You then will need to do some performance tuning to determine how far ahead to make these calls. Not early enough and the data isn't there when you want it, and too early means you're needlessly adding pressure on your system memory.

decompression algorithms that can work virtually without RAM (LZ like if possible)

edit: I try to rephrase this as to make this clearer the best I can :)
I need to find a suitable way / choose a suitable compression to store a blob of data (say approx. 900KB) in a ROM where the available amount of free space is only about 700KB. If I compress the blob with some modern compression tool (eg. WinZIP/WinRAR) I can achieve the required compression easily.
The matter here is that the decompression will take place on a very VERY VERY limited hardware where I can't afford to have more than few bytes of RAM available (say no more than 100 bytes, for the sake of it).
I already tried RLE'ing the data... the data hardly compress.
While I'm working trying to change the data blob format so that it could potentially have more redundancy and achieve better compression ratio, I'm at the same time seeking a compression method that will enable me to decompress on my limited hardware. I have a limited knowledge of compression algorithms so I'm seeking suggestions/pointers to continue with my hunt.
Thanks!
Original question was "I need info/pointers on decompression algorithms that can work without using the uncompressed data, as this will be unavailable right after decompression. LZ like approaches would still be preferred."
I'm afraid this is off topic because too broad.
LZW uses a sizable state that is not very different from keeping a slice of uncompressed data. Even if the state is constant and read from ROM, handling it with just registers seems difficult. There are many different algorithms than can use a constant state, but if you really have NO RAM, then only the most basic algorithms can be used.
Look up RLE, run length encoding.
EDIT: OK, no sliding window, but if you can access ROM, 100 bytes of RAM give you quite some possibilities. You want to implement this in assembly, so stick with very simple algorithms. RLE plus a dictionary. Given your requirements, the choice of algorithm should be based on the type of data you need to decompress.

Conceptual Ideas - Memory is limited for an application but need to pass more data

I have a situation as followed - (because of IP-right I cannot share technical details)
There are few individual embedded applications running as a part of a whole project.
Any of these applications can occpy maximum 9000 MB (9GB) of memory.
I am upgrading some application as per new requirement.
There are few tables with buffer length 32767 in each application with is passed to a network server for calculation using 15KHz frequency.
I need to make it double ie 65534 that will be passed to the network at the rate of 30KHz frequency.
The problem arises here -
One of these applications occupy 8094 MB (8GB+) so doubling the table buffer length goes beyond the maximum size of the application.
As a result the application output does not appear (but there is no crash).
My question is have you ever overcome such problem, could you share some idea how can I do memory management in this particular case? All these programs are written in cpp, perl, c and python (VxWorks, Linux, sunsolaris OS are used).
A quick reply is highly appreciated.
Thanks
It is very vague, but I'll try to answer to the point:
If your program needs larger tables due to whatever reasons, but cannot occupy more memory, you have to change something to compensate that.
You don't mention why you need larger tables:
If the length of the records has increased, try to reduce their number.
If you then can store a fewer number of entries, you'll have to send them quicker so that you don't have to store so much of them.
What you can do as well is do some compressing in RAM. That is dependent on the nature of the data, but in general, this might help you.

How to sort a very large array in C

I want to sort on the order of four million long longs in C. Normally I would just malloc() a buffer to use as an array and call qsort() but four million * 8 bytes is one huge chunk of contiguous memory.
What's the easiest way to do this? I rate ease over pure speed for this. I'd prefer not to use any libraries and the result will need to run on a modest netbook under both Windows and Linux.
Just allocate a buffer and call qsort. 32MB isn't so very big these days even on a modest netbook.
If you really must split it up: sort smaller chunks, write them to files, and merge them (a merge takes a single linear pass over each of the things being merged). But, really, don't. Just sort it.
(There's a good discussion of the sort-and-merge approach in volume 2 of Knuth, where it's called "external sorting". When Knuth was writing that, the external data would have been on magnetic tape, but the principles aren't very different with discs: you still want your I/O to be as sequential as possible. The tradeoffs are a bit different with SSDs.)
32 MB? thats not too big.... quicksort should do the trick.
Your best option would be to prevent having the data unordered if possible. Like it has been mentioned, you'd be better of reading the data from disk (or network or whatever the source) directly into a selforganizing container (a tree, perhaps std::set will do).
That way, you'll never have to sort through the lot, or have to worry about memory management. If you know the required capacity of the container, you might squeeze out additional performance by using std::vector(initialcapacity) or call vector::reserve up front.
You'd then best be advised to use std::make_heap to heapify any existing elements, and then add element by element using push_heap (see also pop_heap). This essentially is the same paradigm as the self-ordering set but
duplicates are ok
the storage is 'optimized' as a flat array (which is perfect for e.g. shared memory maps or memory mapped files)
(Oh, minor detail, note that sort_heap on the heap takes at most N log N comparisons, where N is the number of elements)
Let me know if you think this is an interesting approach. I'd really need a bit more info on the use case

Efficent Cache usage in C

I want to write code that best uses the cache memory of my system. For example, I have one large array(2kb size) that is frequently used in operations. For better execution speed, I want it to be loaded in chache memory so that It takes less time for processor to fetch it. How can I ensure this thing in C language? Any help would be appreciated.
First, ask what processor you are running your code on. Then read its tech specs to find out how big the cache is and how the cache is arranged.
CPU caches are pretty big these days so if your array is only 2kB then it's almost certainly going to be held entirely in cache unless you read megabytes of data between accesses to the array.
In short: don't worry about it. Your array is tiny so it's unlikely you can do much to "optimise" cache usage for it. Instead, look at the algorithm you are using to see if there is a more efficient approach that can be used, and run a profiler on your code to see where the bottlenecks are.
For GCC you could use "-fprefetch-loop-arrays" option.
Anyway, if you're using that memory block frequently, it will most probably reside in cache.
If you want to make it faster, read this: http://www.agner.org/optimize/optimizing_cpp.pdf , and try to use it. And don't forget, the best optimizations can be done by changing the algorithm.

Resources