Reading quickly from many random points of a file - file

I am in the process of developing a performance-critical network service in Rust. A request to my service looks like a vector ids: Vec<u64> of numerical ids. For each id in ids, my service must read the id-th record from a long sequence of records stored contiguously on an SSD. Because all records have the same size RECORD_SIZE (in practice, around 6 KB), the position of every record is entirely predictable, so a trivial solution reduces to
for id in ids {
file.seek(SeekFrom::Start(id * RECORD_SIZE)).unwrap();
let mut record = vec![0u8; RECORD_SIZE];
file.read_exact(&mut record).unwrap();
records.push(record);
}
// Do something with `records`
Now, sadly, the following apply:
The elements of ids are non-contiguous, unpredictable, unstructured, and effectively equivalent to distributed uniformly at random in the range [0, N].
N is way too large for me to store the entire file in memory.
ids.len() is much smaller than N, so I cannot efficiently cycle through the file linearly without having 99% of my reads be for records that have nothing to do with ids.
Now, reading the specs, the raw QD32 IOPS of my SSD should allow me to collect all records in time (i.e., before the next request comes). But, what I observe with my trivial implementation is much much worse. I suspect that this is due to that being effectively a QD1 implementation:
Read something from disk at a random location.
Wait for the data to arrive, store it in RAM.
Read the next thing from disk at another, independent location.
Now, the thing is I know all ids at the very beginning, and I would love it if there was a way to specify:
As much in parallel as possible, read all the locations relevant to each element of ids.
When that is done, carry on doing something on everything.
I am wondering if there is an easy way to get this done in Rust. I scouted for file.parallel_read-like functions in the standard library, for useful crates on crates.io, but to no avail. Which puzzles me because this should be a relatively common problem in a server / database setting. Am I missing something?

Depending on the architecture you're targeting, there is the posix_fadvise syscall:
Programs can use posix_fadvise() to announce an intention to access file data in a specific pattern in the future, thus allowing the kernel to perform appropriate optimizations.
You would pass the offset, RECORD_SIZE, and probably the POSIX_FADV_WILLNEED advise. Both the function and constant are available in the libc crate. This same idea can be done with memory mapped files using posix_madvise() and POSIX_MADV_WILLNEED as hinted in the comments.
You then will need to do some performance tuning to determine how far ahead to make these calls. Not early enough and the data isn't there when you want it, and too early means you're needlessly adding pressure on your system memory.

Related

efficiently using large arrays in C

i'm working on a program in C that uses 4D arrays with 2 millions+ points. i currently have it implemented like this:
main.h
extern float data[31][31][25][100];
main.c
float data[31][31][25][100] = {{.....},{......},.....};
int main()
{
double sum;
for(i=0;i<31;i++)
for(j=0;j<31;j++)
for(k=0;k<25;k++)
for(l=0;l<100;l++)
sum += data[i][j][k][l];
}
the sum is there as a place holder...in its place will be something that does a 4D lookup table. i implemented like this because i dont want to load this data from disk. in the future i might use a database or something to load just a portion of the data but for now i need to use the whole dataset.
so my question is...is there any way i can so this more efficiently and/or can i make this executable smaller (executable currently is ~5 MB. This will eventually run on a quadcore arm board.
the only other thing i've tried is the optimization with gcc. i've tried -O2 and -O4 and i've gotten the following error. Without -O2, it compiles and runs fine. any ideas? i havent really looked up what all the optimization options are...just tried stuff i've seen online.
ld: can't link with a main executable file 'test' for architecture x86_64
thanks for your help!
answers to comments:
the data cannot be generated programmatically. the data is generated
by offline simulation
i've updated to code to show that the array is outside of the main and is global
#js1 you're right its closer to 9 MB, i was working with couple versions of the code and the 5 MB executable was one with 1 million+ elements.
#pm100 i guess thats a good question...i'm prototyping this code on a pc and it works fine...but where it will actually end up running is on an embedded platform. we are currently planning on testing with a pixhawk (which is an arm board for rc vehicles, our production board will be faster and have more memory). i am trying to be efficient and optimal as possible to mitigate any potential issues running on an arm platform.
#user3629249 yes we are prototyping our code on mac osx and then
compiling for ARM once we are finished.
#mcleod_ideafix would loading a binary file that contains 2million
pts be faster? honestly i hadnt even considered a binary file...i
will try implementing like this. everytime the program is called, it
will need access to some portion of this data. it wont need all 2
million pts but the input deciding which part of the array it needs
is variable. ideally i'd like to load just the portion of the arrays
that i needed. however when i tried it out, loading the file and
searching for the right arrays took 2-3x longer than this current
approach. i wonder if i messed something up when i was
loading/searching the file.
answers to comments pt2:
the data isnt sparse...i cant think of any easy ways to reduce the number of pts without reducing the fidelity of the model. the data is fixed and wont ever change. what will change are the inputs to use the data, which will lead to different portions of the 4D data being used.
as far as what the data is: it is essentially trajectory prediction data for flying vehicle. the 4D data is generated offline using a nonlinear simulation running on a cluster.
so what my embedded program has to do is take the current vehicle state (location, orientation, etc.) along with the 4D data to generate a estimated trajectory. i cant really provide the dataset due to proprietary reasons. i hope this answers some questions...sorry for being vague
i will work on a binary implementation and try loading a subset of the array. i might have done something dumb that made it really slow. thank you all for the comments, it gave me some new ideas to try out.
If your data cannot be generated programatically, then they have to be somewhere in your hard disk when your program starts, and your program, somehow must load that data into the 4D array.
So if your executable is 5MB in size, that is normal considering that the initialization data is included. This approach has the benefit of being the OS which loads and initializes your array. When your program executes the first instruction of your main() function, the data is already there. You have just to use it. The disadvantage is that if your program never needs to use the data, the memory they use will still be there, wasting address space.
On the other hand, you can have your data in a separate file: be it a data file which you will load as part of your processing, be it a dynamic library your program loads when needed, or a binary file you map into memory. The advantage if this is that you load the data into memory only if and when it's needed, needing the extra address space only when you actually access your data, and freeing it when it's no longer needed. Besides, your executable will load faster, as no prior load and initialization will be required. The disadvantage of this is that your program will have to include some procedure to load and initialize the 4D array prior to use it, and some other to dispose it when not needed.
That said, for a static non procedural-calculated values array that needs to live for the entire program, the most efficient way is to declare the array as global and initialize it in the same declaration. This will add a memory block with the initialized data, already in the format needed for the array, in your .data section. The beginning of that memory block will be assigned to the name of your array during the relocation operation.
Do you need 32bit floating point precision?
16bit fixed point values would for example cut the size of your binary in half.
If the characteristics of the values stored in the table are linear rather than exponential, then fixed point is the most efficient way to store them in terms of precision per stored bit of information.
Uneven fixed point representations of 24bits or 12bits per value is also a possibility.
You could also consider using different levels of precision for different parts of the table.
Is every single value of the lookup table actually used? Perhaps certain subsections of it can be omitted. It would reduce size at the cost of a more complicated data structure and lookup function.
On a side note, you may want to declare your lookup table to be "const".
As far as your lookups go.
I would recommend using Tree's for each of your lists. this will greatly reduce lookup time to (LOG(n)) and insertion time as well at n(log(n)) at most.
That will at least help your application move much faster when operating.
You'll want to use a decent data structure. such as a Heap or generic B-TREE.

Conceptual Ideas - Memory is limited for an application but need to pass more data

I have a situation as followed - (because of IP-right I cannot share technical details)
There are few individual embedded applications running as a part of a whole project.
Any of these applications can occpy maximum 9000 MB (9GB) of memory.
I am upgrading some application as per new requirement.
There are few tables with buffer length 32767 in each application with is passed to a network server for calculation using 15KHz frequency.
I need to make it double ie 65534 that will be passed to the network at the rate of 30KHz frequency.
The problem arises here -
One of these applications occupy 8094 MB (8GB+) so doubling the table buffer length goes beyond the maximum size of the application.
As a result the application output does not appear (but there is no crash).
My question is have you ever overcome such problem, could you share some idea how can I do memory management in this particular case? All these programs are written in cpp, perl, c and python (VxWorks, Linux, sunsolaris OS are used).
A quick reply is highly appreciated.
Thanks
It is very vague, but I'll try to answer to the point:
If your program needs larger tables due to whatever reasons, but cannot occupy more memory, you have to change something to compensate that.
You don't mention why you need larger tables:
If the length of the records has increased, try to reduce their number.
If you then can store a fewer number of entries, you'll have to send them quicker so that you don't have to store so much of them.
What you can do as well is do some compressing in RAM. That is dependent on the nature of the data, but in general, this might help you.

Transferring large variable amount of memory from Cuda

Cuda is awesome and I'm using it like crazy but I am not using her full potential because I'm having a issue transferring memory and was wondering if there was a better way to get a variable amount of memory out. Basically I send 65535 item array into Cuda and Cuda analyzes each data item around 20,000 different ways and if there's a match in my programs logic then it saves a 30 int list as a result. Think of my logic of analyzing each different combination and then looking at the total and if the total is equal to a number I'm looking for then it saves the results (which is a 30 int list for each analyzed item).
The problem is 65535 (blocks/items in data array) * 20000 (total combinations tested per item) = 1,310,700,000. This means I need to create a array of that size to deal with the chance that all the data will be a positive match (which is extremely unlikely and creating int output[1310700000][30] seems crazy for memory). I've been forced to make it smaller and send less blocks to process because I don't know how if Cuda can write efficiently to a linked list or a dynamically sized list (with this approach the it writes the output to host memory using block * number_of_different_way_tests).
Is there a better way to do this? Can Cuda somehow write to free memory that is not derived from the blockid? When I test this process on the CPU, less then 10% of the item array have a positive match so its extremely unlikely I'll use so much memory each time I send work to the kernel.
p.s. I'm looking above and although it's exactly what I'm doing, if it's confusing then another way of thinking about it (not exactly what I'm doing but good enough to understand the problem) is I am sending 20,000 arrays (that each contain 65,535 items) and adding each item with its peer in the other arrays and if the total equals a number (say 200-210) then I want to know the numbers it added to get that matching result.
If the numbers are very widely range then not all will match but using my approach I'm forced to malloc that huge amount of memory. Can I capture the results with mallocing less memory? My current approach to is malloc as much as I have free but I'm forced to run less blocks which isn't efficient (I want to run as many blocks and threads a time because I like the way Cuda organizes and runs the blocks). Is there any Cuda or C tricks I can use for this or I'm a stuck with mallocing the max possible results (and buying a lot more memory)?
As Per Roger Dahl's great answer:
The functionality you're looking for is called stream compaction.
You probably do need to provide an array that contains room for 4 solutions per thread because attempting to directly store the results in a compact form is likely to create so many dependencies between the threads that the performance gained in being able to copy less data back to the host is lost by a longer kernel execution time. The exception to this is if almost all of the threads find no solutions. In that case, you might be able to use an atomic operation to maintain an index into an array. So, for each solution that is found, you would store it in an array at an index and then use an atomic operation to increase the index. I think it would be safe to use atomicAdd() for this. Before storing a result, the thread would use atomicAdd() to increase the index by one. atomicAdd() returns the old value, and the thread can store the result using the old value as the index.
However, given a more common situation, where there's a fair number of results, the best solution will be to perform a compacting operation as a separate step. One way to do this is with thrust::copy_if. See this question for some more background.

How to sort a very large array in C

I want to sort on the order of four million long longs in C. Normally I would just malloc() a buffer to use as an array and call qsort() but four million * 8 bytes is one huge chunk of contiguous memory.
What's the easiest way to do this? I rate ease over pure speed for this. I'd prefer not to use any libraries and the result will need to run on a modest netbook under both Windows and Linux.
Just allocate a buffer and call qsort. 32MB isn't so very big these days even on a modest netbook.
If you really must split it up: sort smaller chunks, write them to files, and merge them (a merge takes a single linear pass over each of the things being merged). But, really, don't. Just sort it.
(There's a good discussion of the sort-and-merge approach in volume 2 of Knuth, where it's called "external sorting". When Knuth was writing that, the external data would have been on magnetic tape, but the principles aren't very different with discs: you still want your I/O to be as sequential as possible. The tradeoffs are a bit different with SSDs.)
32 MB? thats not too big.... quicksort should do the trick.
Your best option would be to prevent having the data unordered if possible. Like it has been mentioned, you'd be better of reading the data from disk (or network or whatever the source) directly into a selforganizing container (a tree, perhaps std::set will do).
That way, you'll never have to sort through the lot, or have to worry about memory management. If you know the required capacity of the container, you might squeeze out additional performance by using std::vector(initialcapacity) or call vector::reserve up front.
You'd then best be advised to use std::make_heap to heapify any existing elements, and then add element by element using push_heap (see also pop_heap). This essentially is the same paradigm as the self-ordering set but
duplicates are ok
the storage is 'optimized' as a flat array (which is perfect for e.g. shared memory maps or memory mapped files)
(Oh, minor detail, note that sort_heap on the heap takes at most N log N comparisons, where N is the number of elements)
Let me know if you think this is an interesting approach. I'd really need a bit more info on the use case

How to manipulate *huge* amounts of data

I'm having the following problem. I need to store huge amounts of information (~32 GB) and be able to manipulate it as fast as possible. I'm wondering what's the best way to do it (combinations of programming language + OS + whatever you think its important).
The structure of the information I'm using is a 4D array (NxNxNxN) of double-precission floats (8 bytes). Right now my solution is to slice the 4D array into 2D arrays and store them in separate files in the HDD of my computer. This is really slow and the manipulation of the data is unbearable, so this is no solution at all!
I'm thinking on moving into a Supercomputing facility in my country and store all the information in the RAM, but I'm not sure how to implement an application to take advantage of it (I'm not a professional programmer, so any book/reference will help me a lot).
An alternative solution I'm thinking on is to buy a dedicated server with lots of RAM, but I don't know for sure if that will solve the problem. So right now my ignorance doesn't let me choose the best way to proceed.
What would you do if you were in this situation? I'm open to any idea.
Thanks in advance!
EDIT: Sorry for not providing enough information, I'll try to be more specific.
I'm storing a discretized 4D mathematical function. The operations that I would like to perform includes transposition of the array (change b[i,j,k,l] = a[j,i,k,l] and the likes), array multiplication, etc.
As this is a simulation of a proposed experiment, the operations will be applied only once. Once the result is obtained it wont be necessary to perform more operations on the data.
EDIT (2):
I also would like to be able to store more information in the future, so the solution should be somehow scalable. The current 32 GB goal is because I want to have the array with N=256 points, but it'll be better if I can use N=512 (which means 512 GB to store it!!).
Amazon's "High Memory Extra Large Instance" is only $1.20/hr and has 34 GB of memory. You might find it useful, assuming you're not running this program constantly..
Any decent answer will depend on how you need to access the data. Randomly access? Sequential access?
32GB is not really that huge.
How often do you need to process your data? Once per (lifetime | year | day | hour | nanosecond)? Often, stuff only needs to be done once. This has a profound effect on how much you need to optimize your solution.
What kind of operations will you be performing (you mention multiplication)? Can the data be split up into chunks, such that all necessary data for a set of operations is contained in a chunk? This will make splitting it up for parallel execution easier.
Most computers you buy these days have enough RAM to hold your 32GB in memory. You won't need a supercomputer just for that.
As Chris pointed out, what are you going to do with the data.
Besides, I think storing it in a (relational) database will be faster than reading it from the harddrive since the RDBMS will perform some optimizations for you like caching.
If you can represent your problem as MapReduce, consider a clustering system optimized for disk access, such as Hadoop.
Your description sounds more math-intensive, in which case you probably want to have all your data in memory at once. 32 GB of RAM in a single machine is not unreasonable; Amazon EC2 offers virtual servers with up to 68 GB.
Without more information, if you need quickest possible access to all the data I would go with using C for your programming language, using some flavor of *nix as the O/S, and buying RAM, it's relatively cheap now. This also depends on what you are familiar with, you can go the windows route as well. But as others have mentioned it will depend on how you are using this data.
So far, there are a lot of very different answers. There are two good starting points mentioned above. David suggests some hardware and someone mentioned learning C. Both of these are good points.
C is going to get you what you need in terms of speed and direct memory paging. The last thing you want to do is perform linear searches on the data. That would be slow - slow - slow.
Determine your workflow -, if your workflow is linear, that is one thing. If the workflow is not linear, I would design a binary tree referencing pages in memory. There are tons of information on B-trees on the Internet. In addition, these B-trees will be much easier to work with in C since you will also be able to set up and manipulate your memory paging.
Depending on your use, some mathematical and physical problems tend to be mostly zeros (for example, Finite Element models). If you expect that to be true for your data, you can get serious space savings by using a sparse matrix instead of actually storing all those zeros in memory or on disk.
Check out wikipedia for a description, and to decide if this might meet your needs:
http://en.wikipedia.org/wiki/Sparse_matrix
Here's another idea:
Try using an SSD to store your data. Since you're grabbing very small amounts of random data, an SSD would probably be much, much faster.
You may want to try using mmap instead of reading the data into memory, but I'm not sure it'll work with 32Gb files.
The whole database technology is about manipulating huge amounts of data that can't fit in RAM, so that might be your starting point (i.e. get a good dbms principles book and read about indexing, query execution, etc.).
A lot depends on how you need to access the data - if you absolutely need to jump around and access random bits of information, you're in trouble, but perhaps you can structure your processing of the data such that you will scan it along one axis (dimension). Then you can use a smaller buffer and continuously dump already processed data and read new data.
For transpositions, it's faster to actually just change your understanding of what index is what. By that, I mean you leave the data where it is and instead wrap an accessor delegate that changes b[i][j][k][l] into a request to fetch (or update) a[j][i][k][l].
Could it be possible to solve it by this procedure?
First create M child processes and execute them in paralel. Each process will be running in a dedicated core of a cluster and will load some information of the array into the RAM of that core.
A father process will be the manager of the array, calling (or connecting) the appropiate child process to obtain certain chunks of data.
Will this be faster than the HDD storage approach? Or am I cracking nuts with a sledgehammer?
The first thing that I'd recommend is picking an object-oriented language, and develop or find a class that lets you manipulate a 4-D array without concern for how it's actually implemented.
The actual implementation of this class would probably use memory-mapped files, simply because that can scale from low-power development machines up to the actual machine where you want to run production code (I'm assuming that you'll want to run this many times, so that performance is important -- if you can let it run overnight, then a consumer PC may be sufficient).
Finally, once I had my algorithms and data debugged, I would look into buying time on a machine that could hold all the data in memory. Amazon EC2, for instance, will provide you with a machine that has 68 GB of memory for $US 2.40 an hour (less if you play with spot instances).
How to handle processing large amounts of data typically revolves around the following factors:
Data access order / locality of reference: Can the data be separated out into independent chunks that are then processed either independently or in a serial/sequential fashon vs. random access to the data with little or no order?
CPU vs I/O bound: Is the processing time spent more on computation with the data or reading/writing it from/to storage?
Processing frequency: Will the data be processed only once, every few weeks, daily, etc?
If the data access order is essentially random, you will need either to get access to as much RAM as possible and/or find a way to at least partially organize the order so that not as much of the data needs to be in memory at the same time. Virtual memory systems slow down very quickly once physical RAM limits are exceeded and significant swapping occurs. Resolving this aspect of your problem is probably the most critical issue.
Other than the data access order issue above, I don't think your problem has significant I/O concerns. Reading/writing 32 GB is usually measured in minutes on current computer systems, and even data sizes up to a terabyte should not take more than a few hours.
Programming language choice is actually not critical so long as it is a compiled language with a good optimizing compiler and decent native libraries: C++, C, C#, or Java are all reasonable choices. The most computationally and I/O-intensive software I've worked on has actually been in Java and deployed on high-performance supercomputing clusters with a few thousand CPU cores.

Resources