Fastest disk based solution to cache trillions of unique md5 hashes

Fastest disk based solution to cache trillions of unique md5 hashes - database

Is there a very low latency disk based caching solution that I can use to store only unique values (NOT key+value)?
My script needs to keep track of which files it has processed so it doesn't redo any work. I need to check the cache to search for the md5 hash of the file, if it doesn't exist, I process the file and add the hash to the cache.
Is there a faster disk based caching solution than using a key-value based solution?

Try LevelDB.
It's a key-value store but is very compact due to the trie structure.
Less space usage => less I/O => better performance.
Not sure about "trillions" (a trillion MD5 hashes would be 16,000 TB), but Bitcoin core as well as Ethereum implementations all use LevelDB.

In your case, there is no need for an "Ordered Key-Value Store". That is you can rely on plain Key-Value stores (direct dbm successors):
Good candidates are:
tokyo cabinet it has a hash-based format, that might be faster in your case.
gdbm
In the case where the datatset fit into memory, you might want to try LMDB.
I do not recomment LevelDB because it is slow.

Do the math. 1 trillion MD5s, without any tricks, would take 16TB of disk space. This is, I assume, far more than your RAM size.
Since each MD5 lookup is essentially a 'random' probe into the disk, there will necessarily be about 1 disk hit per check.
If, say, an SSD read is 1ms, that is 1e9 seconds to insert (or check) a trillion hashes. That's 30 years.
There are a lot flaws in my math, but I think this says that it is not practical today to store and check a trillion of anything random.
If you want to crank it down to a billion MD5s, now we are getting in the range of RAM sizes. But you probably want to have the data persisted? So you really need some database-like tool that will do the persisting for you, while making the checks purely in RAM (CPU-speed).
In any case, I would consider writing code that breaks the MD5 into 2 or 3 chunks, then use the chunks like a directory structure. At the bottom level, you have a variable-length bunch of values for the last chunk. Each is perhaps 8 bytes long. That would need a linear or binary search into a bunch of numbers that are half the size of an MD5. The savings here helps compensate for the various overheads in the rest of the structure, plus the need for writing blocks to disk. Hence, I would still expect needing about 16GB of RAM to house a billion MD5s.
Given that approach, virtually any database engine is already geared up to do most of the work reasonably efficiently. The lowest level would be some type of BLOB containing multiple 8-byte chunks.
Another trick to use... Let's look at just the first 5 bytes of an MD5. There are a trillion different values in 5 bytes. If you have only a billion entries in your dataset, then checking the 5 bytes has a 99.9% chance of correctly saying "the md5 is not in the dataset" versus less than 0.01% chance of saying "the md5 might be in the dataset". In the former case, you get a quick answer with only 5GB for a billion items. In the latter case, you may have to go to disk and be slower. Still the average time is better. This helps with the speed of checking. (But does not address the speed of loading.)

Related

How to choose the best buffer size when you need read large data

Let's assume a scenario where I have a lot of log files for a given system, let's imagine that it's petabytes of data. This is my scenario.
Used Technology
For my purpose, I'm going to choose the C/C++ to do this.
My Problem
I have the need to read these files, which are on disk, and do some processing later, whether sending them to a topic on some pub/sub system or simply displaying these logs on screen.
Questions
What is the best buffer size for me to have the best performance in reading this data and which saves hardware resources such as disk and RAM memory?
I just don't know if I should choose 64 Kilobytes, 128 Kilobytes, 5 Megabytes, 10 Megabytes, how do I calculate this?
And if this calculation depends on how much available resource I have, then how to calculate from these resources?

The optimal buffer size depends on many factors, most notably the hardware. You can find out which size is optimal by picking one size, measuring how long the operation takes then picking another size, measuring, comparing. Repeat until you find optimal size.
Caveats:
You need to measure with the hardware matching the target system to have meaningful measurements.
You also need to measure with inputs comparable to the target task. You may reduce the size of input by using subset of real data to make measuring faster, but at some size it may affect the quality of measurement.
It's possible to encounter a local maxima buffer size that is faster than either slightly larger or smaller buffer, but not as fast as some other buffer size that is more larger or smaller. General global optimisation techniques may be used to avoid getting stuck in the search for the optimal value, such as simulated annealing.
Although benchmarking is a simple concept, it's actually quite difficult to do correctly. It's possible and likely that your measurements are biased by incidental factors that may cause differences in performance of the target system. Environment randomisation may help reduce this.
Typical sizes that may be a good starting point to measure are the size of the caches on the system:
Cache line size
L1 cache size
L2 cache size
L3 cache size
Memory page size
SSD DRAM cache size

I saw this answer regarding the same question in C#, basically buffer size doesn't really matter performance-wise (as long as it's a reasonable value). Then regarding the RAM and disk usage you will have the same quantity of data to read/write, whatever your buffer size might be. Again, as long as you stay between reasonable values you shouldn't have a problem.

Actually you don't have to load all your data into memory for doing anything. You just have to read those which are concerned.
I have the need to read these files, which are on disk, and do some processing later
Just load them later and pass to subsystem at instant. If you want to display these then, Simply Read, Process and Display.
What is the best buffer size for me to have the best performance in reading this data and which saves hardware resources such as disk and RAM memory?
Why do you want to save Disk resource, Isn't where your files are? You have to load data from here to RAM in small Quantities like a particular log file then do whatever you want and finally Flush it all. Repeat.
I just don't know if I should choose 64 Kilobytes, 128 Kilobytes, 5 Megabytes, 10 Megabytes, how do I calculate this?
Again load files one by one not there data in specific amounts.
And if this calculation depends on how much available resource I have, then how to calculate from these resources?
No calculation Needed. Just smartly handle RAM resources by focusing on one or may be two file at a time. Don't care about Disk resources.

Understanding O(1) vs O(n) Time Complexity Intuitively

I understand that O(1) is constant-time, which means that the operation does not depend on the input size, and O(n) is linear time, which means that the operation changes linearly with input size.
If I had an algorithm that could simply go directly to an array index rather than going through each index one-by-one to find the required one, that would be considered constant-time rather than linear-time, right? This is what the textbooks say. But, intuitively, I don't understand how a computer could work this way: Wouldn't the computer still need to go through each index one-by-one, from 0 to (potentially) n, in order to find the specific index given to it? But then, is this not the same as what a linear-time algorithm does?
Edit
My response to ElKamina's answer elaborates on how my confusion extends to hardware:
But wouldn't the computer have to check where it is on its journey to
the index? For instance, if it's trying to find index 3, "I'm at index
0 so I need address 0 + 3", "Ok, now I'm at address 1, so I have to
move forward 2", "Ok, now I'm at address 2, so I have to move forward
1", "Ok, now I'm at index 3". Isn't those the same thing as what
linear-time algorithms do? How can the computer not do it
sequentially?

Theory
Imagine you have an array which stores events in the order they happened. If each event takes the same amount of space in a computer's memory, you know where that array begins, and you know what number event you're interested in, then you can precalculate the location of each event.
Imagine you want to store records and key them by telephone numbers. Since there are many numbers, you can calculate a hash of each one. The simplest hash you might apply is to treat the telephone number like a regular number and take it modulus the length of the array you'd like to store the number in. Again, you can assume each record takes the same amount of space, you know the number of records, you know where the array begins, and you know the offset of the event of interest. From these, you can precalculate the location of each event.
If array items have different sizes, then instead fill the array with pointers to the actual items. Your lookup then has two stages: find the appropriate array element and then follow it to the item in question.
Much like we can use shmancy GPS systems to tell us where an address is, but we still need to do the work of driving there, the problem with accessing memory is not knowing where an item is, it's getting there.
Answer to your question
With this in mind, the answer to your question is that look-up is almost never free, but it also is rarely O(N).
Tape memory: O(N)
Tape memory requires O(N) seeks, for obvious reasons: you have to spool and unspool the tape to position it to the needed location. It's slow. It's also cheap and reliable, so it's still in use today in long-term back-up systems. Special algorithms which account for the physical nature of the tape can speed up operations on it slightly.
Notice that, per the foregoing, the problem with tape is not that we don't know where the thing is we're trying to find. The problem is getting the physical medium to get there. The nature of a good tape algorithm is to try to minimize the total amount of tape spooled and unspooled over a grouping of operations.
Speaking of which, what if, instead of having one long tape, we had two shorter tapes: this would reduce the point-to-point travel time. What if we had four tapes?
Disk memory: O(N), but smaller
Hard drives make a huge reduction in seek time by turning the tape into a series of rings. Now, even though there are N memory spaces on a disk, any one can be accessed in short order by moving the drive head and the disk to the appropriate point. (Figuring out how to express this in big-oh notation is a challenge.)
Again, if you use faster disks or smaller disks, you can optimize performance.
RAM: O(1), but with caveats
Pretty much everyone who answers this question is going to fixate on RAM, since that's what programmers work with most frequently. Look to their answers for fuller explanations.
But, just briefly, RAM is a natural extension of the ideas developed above. The RAM holds N items and we know where the item we want is. However, this time there's nothing that needs to mechanically move in order for us to get to that item. In addition, we saw that by having more short tapes or smaller, faster drives, we could get to the memory we wanted faster. RAM takes this idea to its extreme.
For practical purposes, you can think of RAM as being a collection of little memory stores, all strung together. Your computer doesn't know exactly where in RAM a particular item is, just the collection it belongs to. So it grabs the whole collection, consisting of thousands or millions of bytes. It stashes this in something like an L3 cache.
But where is a particular item in that cache? Again, you can think of the computer as not really knowing, it just grabs the a subset which is guaranteed to include the item and passes it to the L2 cache.
And again, for the L1 cache.
And, at this point, we've gone from gigabytes (or terabytes) of RAM to something like 3-30 kilobytes. And, at this level, your computer (finally) knows exactly where the item is and grabs it for processing.
This kind of hierarchical behavior means that accessing adjacent items in RAM is much faster than randomly accessing different points all across RAM. That was also true of tape drives and hard disks.
However, unlike tape drives and hard disks, the worst-case time where all the caches are missed is not dependent on the amount of memory (or, at least, is very weakly dependent: path lengths, speed of light, &c)! For this reason, you can treat it as an O(1) operation in the size of the memory.
Comparing speeds
Knowing this, we can talk about access speed by looking at Latency Numbers Every Programmer Should Know:
Latency Comparison Numbers
--------------------------
L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns 14x L1 cache
Mutex lock/unlock 25 ns
Main memory reference 100 ns 20x L2 cache, 200x L1 cache
Compress 1K bytes with Zippy 3,000 ns 3 us
Send 1K bytes over 1 Gbps network 10,000 ns 10 us
Read 4K randomly from SSD* 150,000 ns 150 us ~1GB/sec SSD
Read 1 MB sequentially from memory 250,000 ns 250 us
Round trip within same datacenter 500,000 ns 500 us
Read 1 MB sequentially from SSD* 1,000,000 ns 1,000 us 1 ms ~1GB/sec SSD, 4X memory
Disk seek 10,000,000 ns 10,000 us 10 ms 20x datacenter roundtrip
Read 1 MB sequentially from disk 20,000,000 ns 20,000 us 20 ms 80x memory, 20X SSD
Send packet CA->Netherlands->CA 150,000,000 ns 150,000 us 150 ms
In more human terms, these look like:
Minute:
L1 cache reference 0.5 s One heart beat (0.5 s)
Branch mispredict 5 s Yawn
L2 cache reference 7 s Long yawn
Mutex lock/unlock 25 s Making a coffee
Hour:
Main memory reference 100 s Brushing your teeth
Compress 1K bytes with Zippy 50 min One episode of a TV show (including ad breaks)
Day:
Send 2K bytes over 1 Gbps network 5.5 hr From lunch to end of work day
Week:
SSD random read 1.7 days A normal weekend
Read 1 MB sequentially from memory 2.9 days A long weekend
Round trip within same datacenter 5.8 days A medium vacation
Read 1 MB sequentially from SSD 11.6 days Waiting for almost 2 weeks for a delivery
Year:
Disk seek 16.5 weeks A semester in university
Read 1 MB sequentially from disk 7.8 months Almost producing a new human being
The above 2 together 1 year
Decade:
Send packet CA->Netherlands->CA 4.8 years Average time it takes to complete a bachelor's degree

Underlying any calculation of time complexity is a cost model. Cost models tend to be oversimplified; for example, we generally talk about the time complexity of sort algorithms in terms of how many elements do we have to compare to each other.
The assumption underlying concluding that indexing into an array is O(1) is that of random access memory; that we can access location N by encoding N on the address lines of the memory bus, and the contents of that location come back on the data bus. If memory were sequential access (e.g., accessing off of a magnetic tape), we'd assume a different cost model.

Imagine computer memory as buckets, say you have 10 buckets in from of you.
if someone tells you to pick something up from bucket number 8, you will not first stick your hand into bucket 1 to 7. you would simply put your hand directly into bucket 8.
Arrays work the same way, in most languages map to some form of memory layout. so e.g. if you have an byte array of 10 that would be 10 sequential bytes.
other types could vary in size depending if the content is a value type/struct or if it is a reference type where the array would consist of pointers.

We assume that the memory is "Random Access Memory" (also known as RAM), not the tape or disk memory. In RAM you can access any address in constant time. See the corresponding wiki article for more information on how it works.
Also, elements of the array are stored sequentially. Say we want to store integers in Java which take up 4 bytes. If we wanted to look for kth element, we would directly look at start + 4 * k location in the memory.
You could implement an array in other ways as well. For example, you could implement the array with a linked list, in which case it would take O(n) time to access an element. But this is not how arrays are implemented typically.

No one here has explained why (IMO) in sufficient detail you can access it in O(1) time in detail, so I will try to:
As a note before I do, this is probably trivializing how complex the hardware in the computer has become, but hopefully it's something along the right path. You would cover this in a Computer Organization course that goes into the guts of the hardware.
When you have circuits, the voltage passed through the computer propagates very fast, and the results that come back depends on the pulse of the clock. Take this diagram for example:
https://upload.wikimedia.org/wikipedia/commons/3/3d/Square_array_of_mosfet_cells_read.png
The following is missing parts that you would learn properly from a textbook or course (or online), but omission of those details should still leave you with sufficient enough of a high level overview for a rough idea of how this works:
The address you send as bits will go up the left side of the image, and based on the address size you send, the voltage will be properly sent to the proper memory cell that has the data you want. Upon the cell receiving voltage, it will then emit the value back down to the bottom (which also is basically instant), and now you've read the 'value stored in memory' since the data you want has arrived. Because of how fast voltage travels, you pretty much almost instantly get the result due to the speed of voltage change in circuits. This means it does not depend on traversing the elements before it since you can just go to it, which is the idea behind RAM. The bottleneck comes from the clock pulse with the latches, which when you take a computer organization course you will see what we do and why we do it.
This is why we consider it doable in O(1) time.
Now an Operating Systems and Computer Organization course would show you all about how this is connected under the hood, why its way more complex than what I've written (and what might not even be that accurate anymore), but hopefully gives you an intuition as to why we can do it in constant time.
Since complexity notation hides the constants under the hood (which from the above, we can assume it's constant time to go to any offset in memory), it then would make sense that we can jump to any array offset in O(1) time from a high level point of view -- which is what complexity analysis aims to do for us -- compared to. This is also why we don't need to traverse over every element in memory to get where we want, which as you said is O(n).

Assuming the data structure you are talking about is a vector/array, you can easily reach index 'x' by incrementing whatever you use to iterate over it.
Say you have a vector of struct "A" where A occupies 20bytes, say you want to get to index 28 and you know the vector starts at memory location 'x', than you simply need to go to x + 20 bytes and that is your element.
With a data structure like a list the lookup time will be O(n) since its not continously assigned you have to jump from pointer to pointer.
With a binary tree its O(log2(n)) ... etc
So the answer here is that it depend on your structure. I would recommend reading some books about fundamental data structures, those might help you greatly in gaining more theoretical understanding of the various concepts you are using.

Using a filesystem in place of a Database to perform quick lookups

I was recently posed a simple question to which I responded with an unusual answer. I suspect my answer was particularly bad but am not certain what the performance characteristics truly would be.
Suppose you are given inputs each in the form of a hash code (just a bunch of bits.) Uniquely corresponding to each hash code is an integer value which you would like to return. Your system knows the most likely queries and caches them in memory. For the remaining, less frequent lookups you will have to access a hard drive (disk I/O.) There exists a least recently used policy to replace the cache in memory but that shouldn't be terribly important here.
For the hashes on disk, the conventional way to store them would be in a Database (keyed on the hashes) in a tree shape. This would grant you O(log(n)) lookup time once at the database stage.
My answer seemed odd to the asker and a little odd to me. Suppose instead of a database, you simply kept the values on disk in a file system with a directory structure that exactly mirrored the bits of the hash values. For instance, if we had three bit hashes (and only had entries for 100 => 42 and 010 => 314159 your file system would look like:
\0
.\00
.\000
.\100
42.justanumber
.\10
.\010
314159.justanumber
.\110
\1
.\01
.\001
.\101
.\11
.\011
.\111
The x.justanumber files are empty. The filenames themselves contain the information you're looking for.
Further assume that updates never occur (the entire DB/file system is re-written weekly.) I'd think that a filesystem set up this way would give you O(1) lookup time instead of the O(log(n)) lookup time of a tree-based DB. Am I missing something? Why would this not be preferable?

I believe I've come to an answer.
If the system always uses a folder for every possible bit combination and you have a 32-bit hash, you will have 2^32 folders at the outer layer and roughly the same amount within all inner layers so you'll have 2^33 folders each of size 4kB due to page size limitations. Additionally, each "empty" hash file will occupy 4kB on disk for the same reason. So assuming, let's say, 5% occupancy you would have:
2^33 * 4kB + 2^32 * 4kB * 0.05 = 35.22 TB of storage needed.
The savings are O(1) lookup time instead of O(log(# hashes)) lookup time which almost certainly doesn't justify the storage space requirements. (Also remember this is for the uncommon case where you have a memory/cache miss.)
Admittedly, if you only created the minimum number of folders needed to support the 5% occupancy you'd end up with less space needed. The exact number of folders needed would depend on the distribution of the hashes. Assuming a good hash function, this should be close to random, which I believe means we'll end up needing the majority of the inner layers and definitely 5% of the outermost layers of directories. This instead gives something like:
2^32 * 4 kB * 0.05 + 2^32 * 4kB * 0.40 + 2^32 * 4kB * 0.05 = 8.59 TB
...which is still way too much space for such a simple thing. (I made up the 40% figure for the inner folders. If anyone can come up with a rigorous figure there, please comment/answer.)

Wierd use of md5 for hashing a file - Does it do anything?

For uploading a file to a service, I was calculating the md5 based on the whole content of the file.
I was asked to do in a different way: the md5 of the file, and then also 3 more parts: 2% from the start of the file, 2% from 1/3 of the file, and 2% from 2/3, and 2% of the end of the file and then hash it the file's size and added the file size in bytes at the end.
Apparently this solves hash collisions between files. To me it seems like a waste of time, since your not increasing the size of md5. So for a huge large number of files, you're still gonna have, statistically, the same number of collisions.
Please help me understand the reasoning behind this.
EDIT: we are then hashing the resulting hashes.

A good cryptographically strong hashing algorithm is already designed with the goal to make it infeasible to intentionally find two different pieces of data with the same hash, let alone by accident. Therefore, just hashing the file is sufficient. Extra hashing of parts of the file is pointless.
This may seem unintuitive because obviously there must exist collisions if the length of the hash is shorter than the length of the data. However, it is not feasible to find these collisions because an MD5 hash is an unpredictable 128-bit number and the amount of possible 128-bit numbers (2^128) is mind boggling. If you could count at a rate of a trillion trillion per second, counting through all 128-bit numbers would still take (2^128 / 1e24) seconds ~ about 10 million years. This is probably a good lower limit to the amount of time that it would take to find a hash collision the brute force way without custom hardware.
That said, this is all assuming that there are no weaknesses in the hashing algorithm that allow you to do better than brute force. MD5 is broken in this regard, so you should not use it if you need to defend against attackers that would try to create collisions. It would be better to use a newer hashing algorithm like SHA-2 or SHA-3. (These also support even larger outputs such as 256 bits.)

Sounds like a dangerous practice, because you're re-hashing without factoring in a lot of data. The advantage however is that by running other hashes, you are effectivley winding up with a hash signature consisting of "more bits" - (i.e. you are getting three MD5 hashes as a result).
If you want to do this - and are in-fact okay with having more (larger) hash data to store/compare - you would be MUCH better advise to simply run a different hash function (other than MD5) that is either more secure, and/or uses a larger number of bits.
MD5 is an "older" algorithm and is known to have cryptographic weaknessess. I'd recommend one of the "SHA" algorithms - like SHA-256 or SHA-512. Advantages are that it is a stronger algorithm, you'd only have to has the data ONCE, and you'd get more bits than an MD5, yet since your running it once, it would be faster.
Note, that the possibility of hash collisions always exists. Even "high end" storage products which use hashes for detection will compare buffers to verify an exact match even if the comparison of two hashes matches.

How to manipulate huge amounts of data

I'm having the following problem. I need to store huge amounts of information (~32 GB) and be able to manipulate it as fast as possible. I'm wondering what's the best way to do it (combinations of programming language + OS + whatever you think its important).
The structure of the information I'm using is a 4D array (NxNxNxN) of double-precission floats (8 bytes). Right now my solution is to slice the 4D array into 2D arrays and store them in separate files in the HDD of my computer. This is really slow and the manipulation of the data is unbearable, so this is no solution at all!
I'm thinking on moving into a Supercomputing facility in my country and store all the information in the RAM, but I'm not sure how to implement an application to take advantage of it (I'm not a professional programmer, so any book/reference will help me a lot).
An alternative solution I'm thinking on is to buy a dedicated server with lots of RAM, but I don't know for sure if that will solve the problem. So right now my ignorance doesn't let me choose the best way to proceed.
What would you do if you were in this situation? I'm open to any idea.
Thanks in advance!
EDIT: Sorry for not providing enough information, I'll try to be more specific.
I'm storing a discretized 4D mathematical function. The operations that I would like to perform includes transposition of the array (change b[i,j,k,l] = a[j,i,k,l] and the likes), array multiplication, etc.
As this is a simulation of a proposed experiment, the operations will be applied only once. Once the result is obtained it wont be necessary to perform more operations on the data.
EDIT (2):
I also would like to be able to store more information in the future, so the solution should be somehow scalable. The current 32 GB goal is because I want to have the array with N=256 points, but it'll be better if I can use N=512 (which means 512 GB to store it!!).

Amazon's "High Memory Extra Large Instance" is only $1.20/hr and has 34 GB of memory. You might find it useful, assuming you're not running this program constantly..

Any decent answer will depend on how you need to access the data. Randomly access? Sequential access?
32GB is not really that huge.
How often do you need to process your data? Once per (lifetime | year | day | hour | nanosecond)? Often, stuff only needs to be done once. This has a profound effect on how much you need to optimize your solution.
What kind of operations will you be performing (you mention multiplication)? Can the data be split up into chunks, such that all necessary data for a set of operations is contained in a chunk? This will make splitting it up for parallel execution easier.
Most computers you buy these days have enough RAM to hold your 32GB in memory. You won't need a supercomputer just for that.

As Chris pointed out, what are you going to do with the data.
Besides, I think storing it in a (relational) database will be faster than reading it from the harddrive since the RDBMS will perform some optimizations for you like caching.

If you can represent your problem as MapReduce, consider a clustering system optimized for disk access, such as Hadoop.
Your description sounds more math-intensive, in which case you probably want to have all your data in memory at once. 32 GB of RAM in a single machine is not unreasonable; Amazon EC2 offers virtual servers with up to 68 GB.

Without more information, if you need quickest possible access to all the data I would go with using C for your programming language, using some flavor of *nix as the O/S, and buying RAM, it's relatively cheap now. This also depends on what you are familiar with, you can go the windows route as well. But as others have mentioned it will depend on how you are using this data.

So far, there are a lot of very different answers. There are two good starting points mentioned above. David suggests some hardware and someone mentioned learning C. Both of these are good points.
C is going to get you what you need in terms of speed and direct memory paging. The last thing you want to do is perform linear searches on the data. That would be slow - slow - slow.
Determine your workflow -, if your workflow is linear, that is one thing. If the workflow is not linear, I would design a binary tree referencing pages in memory. There are tons of information on B-trees on the Internet. In addition, these B-trees will be much easier to work with in C since you will also be able to set up and manipulate your memory paging.

Depending on your use, some mathematical and physical problems tend to be mostly zeros (for example, Finite Element models). If you expect that to be true for your data, you can get serious space savings by using a sparse matrix instead of actually storing all those zeros in memory or on disk.
Check out wikipedia for a description, and to decide if this might meet your needs:
http://en.wikipedia.org/wiki/Sparse_matrix

Here's another idea:
Try using an SSD to store your data. Since you're grabbing very small amounts of random data, an SSD would probably be much, much faster.

You may want to try using mmap instead of reading the data into memory, but I'm not sure it'll work with 32Gb files.

The whole database technology is about manipulating huge amounts of data that can't fit in RAM, so that might be your starting point (i.e. get a good dbms principles book and read about indexing, query execution, etc.).
A lot depends on how you need to access the data - if you absolutely need to jump around and access random bits of information, you're in trouble, but perhaps you can structure your processing of the data such that you will scan it along one axis (dimension). Then you can use a smaller buffer and continuously dump already processed data and read new data.

For transpositions, it's faster to actually just change your understanding of what index is what. By that, I mean you leave the data where it is and instead wrap an accessor delegate that changes b[i][j][k][l] into a request to fetch (or update) a[j][i][k][l].

Could it be possible to solve it by this procedure?
First create M child processes and execute them in paralel. Each process will be running in a dedicated core of a cluster and will load some information of the array into the RAM of that core.
A father process will be the manager of the array, calling (or connecting) the appropiate child process to obtain certain chunks of data.
Will this be faster than the HDD storage approach? Or am I cracking nuts with a sledgehammer?

The first thing that I'd recommend is picking an object-oriented language, and develop or find a class that lets you manipulate a 4-D array without concern for how it's actually implemented.
The actual implementation of this class would probably use memory-mapped files, simply because that can scale from low-power development machines up to the actual machine where you want to run production code (I'm assuming that you'll want to run this many times, so that performance is important -- if you can let it run overnight, then a consumer PC may be sufficient).
Finally, once I had my algorithms and data debugged, I would look into buying time on a machine that could hold all the data in memory. Amazon EC2, for instance, will provide you with a machine that has 68 GB of memory for $US 2.40 an hour (less if you play with spot instances).

How to handle processing large amounts of data typically revolves around the following factors:
Data access order / locality of reference: Can the data be separated out into independent chunks that are then processed either independently or in a serial/sequential fashon vs. random access to the data with little or no order?
CPU vs I/O bound: Is the processing time spent more on computation with the data or reading/writing it from/to storage?
Processing frequency: Will the data be processed only once, every few weeks, daily, etc?
If the data access order is essentially random, you will need either to get access to as much RAM as possible and/or find a way to at least partially organize the order so that not as much of the data needs to be in memory at the same time. Virtual memory systems slow down very quickly once physical RAM limits are exceeded and significant swapping occurs. Resolving this aspect of your problem is probably the most critical issue.
Other than the data access order issue above, I don't think your problem has significant I/O concerns. Reading/writing 32 GB is usually measured in minutes on current computer systems, and even data sizes up to a terabyte should not take more than a few hours.
Programming language choice is actually not critical so long as it is a compiled language with a good optimizing compiler and decent native libraries: C++, C, C#, or Java are all reasonable choices. The most computationally and I/O-intensive software I've worked on has actually been in Java and deployed on high-performance supercomputing clusters with a few thousand CPU cores.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight