The fftw manual states that (Page 4)
The data is an array of type fftw_complex, which is by default a double[2] composed of the real (in[i][0]) and imaginary (in[i][1]) parts of a complex number.
For doing FFT of a time series of n samples, the size of the matrix becomes n rows and 2 columns. If we wish to do element-by-element manipulation, isn't accessing in[i][0] for different values of i slow for large values of n since C stores 2D arrays in row-major format?
The real and imaginary part are stored consecutively in memory (assuming little endian lay out where byte 0 of R0 is at the smallest address) :
In-1,Rn-1..|I1,R1|I0,R0
That means that it's possible to copy an element i into place accessing the same cacheline (usually 64byte today), as both real & imaginary are adjacent. If you stored the 2D array in the Fortan order and wanted to assign to one element, then you immediately access memory on 2 different cachelines, as they are stored N*sizeof double locations apart in memory - Row & COlumn Major order
Now if your processing was just operating on the real parts in one thread and the imaginary seperately in another, for some reason, then yes it would be more efficient to store them in Column major order, or even as seperate parallel arrays. In general though, data is stored close together because it is used together.
All arrays in C are really single dimensional byte arrays, unless you store an array of pointers to arrays, usually done with things like strings with varying lengths.
Sometimes in matrix calculations, it's actually faster to first transpose one array, because of the rules of matrix multiplication, it's complex but if you want the real nitty gritty details search for Ulrich Dreppers article at LWN.net about memory which shows an example that benefits from this technique (section 5 IIRC).
Very often Scientific numberic libraries have worked in Column major order, because Fortran compatability was more important than using the array in a natural way. Most languages prefer Row major, as it's generally more desirable, when you store fixed length strings in a table for instance.
Related
I know that traditional "lists" in Perl implemented internally exactly as a real double-linked lists. So indexed access to the list elements is slow. This is a cost of dynamic nature of lists, which can be sliced, expanded, shrinked.
But for performance reasons it will be very good to have possibility to malloc() some memory chunk and create vector of static size and predefined size of its elements: for example, fixed-size double-linked list may be represented as a sequence of elements which size will be 4(prev_v_index) + 4(next_v_index) + 8(data_ptr aka REF) = 16 bytes. So we can access every element of this vector as we usually do it in compiled languages like C: elem_ptr=vector_ptr+(index*elem_size) - access to elements will be very fast with some architecure-specific alignment (8 bytes for x86_64).
Maybe there is already some XS module for manipulating with the fixed-vectors in Perl5?
Perl's arrays (#array variables or [...] references) do use a contiguous memory region. They are not linked lists. However, these arrays only hold pointers to the scalar values, not the values themselves. This is a necessary restriction of the Perl data model.
If you know C++, a Perl array can be thought of as similar to a std::vector<Scalar*>, except that Perl's arrays can push and pop at the front and the back.
To resize a Perl array, you can assign the last index. E.g. to pre-allocate 50 elements:
my #array;
$#array = 50 - 1;
If you need compact data storage within Perl, then you will have to use strings. Given a fixed-size record, you can get and set one record with substr, and pack/unpack the data from and to Perl data structures.
You can use the vec function to use a string as a vector. For example, you could pack Boolean values into individual bits.
vec EXPR,OFFSET,BITS
Treats the string in EXPR as a bit vector made up of elements of
width BITS and returns the value of the element specified by
OFFSET as an unsigned integer. BITS therefore specifies the
number of bits that are reserved for each element in the bit
vector. This must be a power of two from 1 to 32 (or 64, if your
platform supports that).
That said, your concern about array access being "slow" is unwarranted and your beliefs about perl's internals is incorrect. Array performance is likely to be fast enough. Don't try to "optimize" around it until you've profiled your code and proven that its a bottleneck.
I'm using arrays of elements, many of which referencing each other, and I assumed in that case it's more efficient to use pointers.
But in some cases I need to know the index of an element I have the pointer to. For example I have p = &a[i] and I need to know the value of i. As I understand it, i can be computed through p - a. But this operation inherently involves division, which is expensive, whereas computing an address from an array index involves a multiplication and is faster.
So my question is, is cross referencing with pointers in a case where you need the indexes as well even worth it?
But this operation inherently involves division, which is expensive, whereas computing an address from an array index involves a multiplication and is faster.
This operation requires a division only when the size of the element is not a power of two, i.e. when it is not a pointer, or some standard type on most systems. Dividing by a power of two is done using bit shifting, which is extremely cheap.
computing an address from an array index involves a multiplication and is faster.
Same logic applies here, except the compiler shifts left instead of shifting right.
is cross referencing with pointers in a case where you need the indexes as well even worth it?
Counting CPU cycles without profiling is a case of premature optimization - a bad thing to consider when you are starting your design.
A more important consideration is that indexes are more robust, because they often survive array reallocation.
Consider an example: let's say you have an array that grows dynamically as you add elements to its back, an index into that array, and a pointer into that array. You add an element to the array, exhausting its capacity, so now it must grow. You call realloc, and get a new array (or an old array if there was enough extra memory after the "official" end). The pointer that you held is now invalid; the index, however, is still valid.
Indexing an array is dirt cheap in ways where I've never found any kind of performance boost by directly using pointers instead. That includes some very performance-critical areas like looping through each pixel of an image containing millions of them -- still no measurable performance difference between indices and pointers (though it does make a difference if you can access an image using one sequential loop over two).
I've actually found many opposite cases where turning pointers into 32-bit indices boosted performance after 64-bit hardware started becoming available when there was a need to store a boatload of them.
One of the reasons is obvious: you can take half the space now with 32-bit indices (assuming you don't need more than ~4.3 billion elements). If you're storing a boatload of them and taking half the memory as in the case of a graph data structure like indexed meshes, then typically you end up with fewer cache misses when your links/adjacency data can be stored in half the memory space.
But on a deeper level, using indices allows a lot more options. You can use purely contiguous structures that realloc to new sizes without worrying about invalidation as dasblinkenlight points out. The indices will also tend to be more dense (as opposed to sparsely fragmented across the entire 64-bit addressing space), even if you leave holes in the array, allowing for effective compression (delta, frame of reference, etc) if you want to squash down memory usage. You can then also use parallel arrays to associate data to something in parallel without using something much more expensive like a hash table. That includes parallel bitsets which allow you to do things like set intersections in linear time. It also allows for SoA reps (also parallel arrays) which tend to be optimal for sequential access patterns using SIMD.
You get a lot more room to optimize with indices, and I'd consider it mostly just a waste of memory if you keep pointers around on top of indices. The downside to indices for me is mostly just convenience. We have to have access to the array we're indexing on top of the index itself, while the pointer allows you to access the element without having access to its container. It's often more difficult and error-prone to write code and data structures revolving around indices and also harder to debug since we can't see the value of an element through an index. That said, if you accept the extra burden, then often you get more room to optimize with indices, not less.
tl;dr: What is the fastest way to sort an uint8x16_t?
I need to sort many arrays of exactly 16 unsigned bytes (in descending order, which doesn't matter, of course), and i'm trying to optimize sorting by means of ARM NEON vectorization.
And i find it to be quite a fancy puzzle, as it seems that there "must" exist a short combination of NEON instructions (such as vmax/vpmax/vmin/vpmin, vzip/vuzp) that reliably results in a sorted array.
For example, if we transform a pair (A, B) of two 8-byte arrays into (vpmax(A,B), vpmin(A,B)), we obtain same 16 values, just in different order. If we repeat this operation four times, we reliably have the array maximum in the first cell and the array minimum in the last cell; we cannot be sure about the middle elements though.
Another example: if we first do (C,D)=(vmax(A,B),vmin(A,B)), then we do (E,F)=(vpmax(C,D),vpmin(C,D)), then we do (G,H)=vzip(E,F), then we get our array split into four parts of four bytes, in each part we already know the largest element and the smallest element. Probably the next naive step would be to deinterleave this array to have top four bytes at start of the array (which won't necessary be the top 4 elements of the array, just top bytes of their respective groups) and repeat, not yet sure where it leads at the end.
Is there any known method for this particular problem or for other similar problems (for different array sizes or whatever)? Any ideas are appreciated :)
I know that Intel Fortran has libraries with functions and subroutines for working with sparse matricies, but I'm wondering if there is also some sort of data type or automated method for creating the sparse matricies in the first place.
BACKGROUND: I have a program that uses some 3 & 4 dimensional arrays that can be very large in the first 2 dimensions (~10k to ~100k elements in each dimension, maybe more). In the first 2 dimensions, each array is mostly (95% or so) populated w/ zeroes. To make the program friendly to machines with a "normal" amount of RAM available, I'd like to convert to sparse matricies. The manner in which the current conventional arrays are handled & updated throughout the code is pretty dependent on the code application, so I'm looking for a way to convert to sparse matrix storage without significant modification to the code. Basically, I'm lazy, and I don't want to revise the entire memory management implementation or write an entire new module where my arrays live and are managed. Is there a library or something else for Fortran that would implement a data type or something so that I can use sparse matrix storage without re-engineering each array and how it is handled? Thanks for the help. Cheers.
There are many different sparse formats and many different libraries for handling sparse matrices in Fortran (e.g. sparskit, petsc, ...) However, none of them can offer that compact array handling formalism, which is available in Fortran for intrinsic dense arrays (especially the subarray notation). So, you'll have to touch your code at several places, when you want to change it to use sparse matrices.
Many algorithms work by using the merge algorithm to merge two different sorted arrays into a single sorted array. For example, given as input the arrays
1 3 4 5 8
and
2 6 7 9
The merge of these arrays would be the array
1 2 3 4 5 6 7 8 9
Traditionally, there seem to be two different approaches to merging sorted arrays (note that the case for merging linked lists is quite different). First, there are out-of-place merge algorithms that work by allocating a temporary buffer for storage, then storing the result of the merge in the temporary buffer. Second, if the two arrays happen to be part of the same input array, there are in-place merge algorithms that use only O(1) auxiliary storage space and rearrange the two contiguous sequences into one sorted sequence. These two classes of algorithms both run in O(n) time, but the out-of-place merge algorithm tends to have a much lower constant factor because it does not have such stringent memory requirements.
My question is whether there is a known merging algorithm that can "interpolate" between these two approaches. That is, the algorithm would use somewhere between O(1) and O(n) memory, but the more memory it has available to it, the faster it runs. For example, if we were to measure the absolute number of array reads/writes performed by the algorithm, it might have a runtime of the form n g(s) + f(s), where s is the amount of space available to it and g(s) and f(s) are functions derivable from that amount of space available. The advantage of this function is that it could try to merge together two arrays in the most efficient way possible given memory constraints - the more memory available on the system, the more memory it would use and (ideally) the better the performance it would have.
More formally, the algorithm should work as follows. Given as input an array A consisting of two adjacent, sorted ranges, rearrange the elements in the array so that the elements are completely in sorted order. The algorithm is allowed to use external space, and its performance should be worst-case O(n) in all cases, but should run progressively more quickly given a greater amount of auxiliary space to use.
Is anyone familiar with an algorithm of this sort (or know where to look to find a description of one?)
at least according to the documentation, the in-place merge function in SGI STL is adaptive and "its run-time complexity depends on how much memory is available". The source code is available of course you could at least check this one.
EDIT: STL has inplace_merge, which will adapt to the size of the temporary buffer available. If the temporary buffer is at least as big as one of the sub-arrays, it's O(N). Otherwise, it splits the merge into two sub-merges and recurses. The split takes O(log N) to find the right part of the other sub array to rotate in (binary search).
So it goes from O(N) to O(N log N) depending on how much memory you have available.