I know that traditional "lists" in Perl implemented internally exactly as a real double-linked lists. So indexed access to the list elements is slow. This is a cost of dynamic nature of lists, which can be sliced, expanded, shrinked.
But for performance reasons it will be very good to have possibility to malloc() some memory chunk and create vector of static size and predefined size of its elements: for example, fixed-size double-linked list may be represented as a sequence of elements which size will be 4(prev_v_index) + 4(next_v_index) + 8(data_ptr aka REF) = 16 bytes. So we can access every element of this vector as we usually do it in compiled languages like C: elem_ptr=vector_ptr+(index*elem_size) - access to elements will be very fast with some architecure-specific alignment (8 bytes for x86_64).
Maybe there is already some XS module for manipulating with the fixed-vectors in Perl5?
Perl's arrays (#array variables or [...] references) do use a contiguous memory region. They are not linked lists. However, these arrays only hold pointers to the scalar values, not the values themselves. This is a necessary restriction of the Perl data model.
If you know C++, a Perl array can be thought of as similar to a std::vector<Scalar*>, except that Perl's arrays can push and pop at the front and the back.
To resize a Perl array, you can assign the last index. E.g. to pre-allocate 50 elements:
my #array;
$#array = 50 - 1;
If you need compact data storage within Perl, then you will have to use strings. Given a fixed-size record, you can get and set one record with substr, and pack/unpack the data from and to Perl data structures.
You can use the vec function to use a string as a vector. For example, you could pack Boolean values into individual bits.
vec EXPR,OFFSET,BITS
Treats the string in EXPR as a bit vector made up of elements of
width BITS and returns the value of the element specified by
OFFSET as an unsigned integer. BITS therefore specifies the
number of bits that are reserved for each element in the bit
vector. This must be a power of two from 1 to 32 (or 64, if your
platform supports that).
That said, your concern about array access being "slow" is unwarranted and your beliefs about perl's internals is incorrect. Array performance is likely to be fast enough. Don't try to "optimize" around it until you've profiled your code and proven that its a bottleneck.
Related
I'm using arrays of elements, many of which referencing each other, and I assumed in that case it's more efficient to use pointers.
But in some cases I need to know the index of an element I have the pointer to. For example I have p = &a[i] and I need to know the value of i. As I understand it, i can be computed through p - a. But this operation inherently involves division, which is expensive, whereas computing an address from an array index involves a multiplication and is faster.
So my question is, is cross referencing with pointers in a case where you need the indexes as well even worth it?
But this operation inherently involves division, which is expensive, whereas computing an address from an array index involves a multiplication and is faster.
This operation requires a division only when the size of the element is not a power of two, i.e. when it is not a pointer, or some standard type on most systems. Dividing by a power of two is done using bit shifting, which is extremely cheap.
computing an address from an array index involves a multiplication and is faster.
Same logic applies here, except the compiler shifts left instead of shifting right.
is cross referencing with pointers in a case where you need the indexes as well even worth it?
Counting CPU cycles without profiling is a case of premature optimization - a bad thing to consider when you are starting your design.
A more important consideration is that indexes are more robust, because they often survive array reallocation.
Consider an example: let's say you have an array that grows dynamically as you add elements to its back, an index into that array, and a pointer into that array. You add an element to the array, exhausting its capacity, so now it must grow. You call realloc, and get a new array (or an old array if there was enough extra memory after the "official" end). The pointer that you held is now invalid; the index, however, is still valid.
Indexing an array is dirt cheap in ways where I've never found any kind of performance boost by directly using pointers instead. That includes some very performance-critical areas like looping through each pixel of an image containing millions of them -- still no measurable performance difference between indices and pointers (though it does make a difference if you can access an image using one sequential loop over two).
I've actually found many opposite cases where turning pointers into 32-bit indices boosted performance after 64-bit hardware started becoming available when there was a need to store a boatload of them.
One of the reasons is obvious: you can take half the space now with 32-bit indices (assuming you don't need more than ~4.3 billion elements). If you're storing a boatload of them and taking half the memory as in the case of a graph data structure like indexed meshes, then typically you end up with fewer cache misses when your links/adjacency data can be stored in half the memory space.
But on a deeper level, using indices allows a lot more options. You can use purely contiguous structures that realloc to new sizes without worrying about invalidation as dasblinkenlight points out. The indices will also tend to be more dense (as opposed to sparsely fragmented across the entire 64-bit addressing space), even if you leave holes in the array, allowing for effective compression (delta, frame of reference, etc) if you want to squash down memory usage. You can then also use parallel arrays to associate data to something in parallel without using something much more expensive like a hash table. That includes parallel bitsets which allow you to do things like set intersections in linear time. It also allows for SoA reps (also parallel arrays) which tend to be optimal for sequential access patterns using SIMD.
You get a lot more room to optimize with indices, and I'd consider it mostly just a waste of memory if you keep pointers around on top of indices. The downside to indices for me is mostly just convenience. We have to have access to the array we're indexing on top of the index itself, while the pointer allows you to access the element without having access to its container. It's often more difficult and error-prone to write code and data structures revolving around indices and also harder to debug since we can't see the value of an element through an index. That said, if you accept the extra burden, then often you get more room to optimize with indices, not less.
I'm porting some imperative code to Haskell. My goal is to analyze an executable, therefore each byte of the text section gets assigned a number of flags, which would all fit in a byte (6 bits to be precise).
In a language like C, I would just allocate an array of bytes, zero them and update them as I go. How would I do this efficiently in Haskell?
In other words: I'm looking for a ByteString with bit-wise access and constant time updates as I disassemble more of the text section.
Edit: Of course, any kind of other data structure would do if it was similarly efficient.
The implementation for unboxed arrays of Bool-s in array is a packed bitarray. You can do mutable updates on such arrays in the ST Monad (this is essentially the same runtime behaviour as in C).
You can use a vector with any data type that's an instance of the Bits typeclass, for example Word64 for 64 bits per element in the vector. Use an unboxed vector if you want the array to be contiguous in memory.
The fftw manual states that (Page 4)
The data is an array of type fftw_complex, which is by default a double[2] composed of the real (in[i][0]) and imaginary (in[i][1]) parts of a complex number.
For doing FFT of a time series of n samples, the size of the matrix becomes n rows and 2 columns. If we wish to do element-by-element manipulation, isn't accessing in[i][0] for different values of i slow for large values of n since C stores 2D arrays in row-major format?
The real and imaginary part are stored consecutively in memory (assuming little endian lay out where byte 0 of R0 is at the smallest address) :
In-1,Rn-1..|I1,R1|I0,R0
That means that it's possible to copy an element i into place accessing the same cacheline (usually 64byte today), as both real & imaginary are adjacent. If you stored the 2D array in the Fortan order and wanted to assign to one element, then you immediately access memory on 2 different cachelines, as they are stored N*sizeof double locations apart in memory - Row & COlumn Major order
Now if your processing was just operating on the real parts in one thread and the imaginary seperately in another, for some reason, then yes it would be more efficient to store them in Column major order, or even as seperate parallel arrays. In general though, data is stored close together because it is used together.
All arrays in C are really single dimensional byte arrays, unless you store an array of pointers to arrays, usually done with things like strings with varying lengths.
Sometimes in matrix calculations, it's actually faster to first transpose one array, because of the rules of matrix multiplication, it's complex but if you want the real nitty gritty details search for Ulrich Dreppers article at LWN.net about memory which shows an example that benefits from this technique (section 5 IIRC).
Very often Scientific numberic libraries have worked in Column major order, because Fortran compatability was more important than using the array in a natural way. Most languages prefer Row major, as it's generally more desirable, when you store fixed length strings in a table for instance.
tl;dr: What is the fastest way to sort an uint8x16_t?
I need to sort many arrays of exactly 16 unsigned bytes (in descending order, which doesn't matter, of course), and i'm trying to optimize sorting by means of ARM NEON vectorization.
And i find it to be quite a fancy puzzle, as it seems that there "must" exist a short combination of NEON instructions (such as vmax/vpmax/vmin/vpmin, vzip/vuzp) that reliably results in a sorted array.
For example, if we transform a pair (A, B) of two 8-byte arrays into (vpmax(A,B), vpmin(A,B)), we obtain same 16 values, just in different order. If we repeat this operation four times, we reliably have the array maximum in the first cell and the array minimum in the last cell; we cannot be sure about the middle elements though.
Another example: if we first do (C,D)=(vmax(A,B),vmin(A,B)), then we do (E,F)=(vpmax(C,D),vpmin(C,D)), then we do (G,H)=vzip(E,F), then we get our array split into four parts of four bytes, in each part we already know the largest element and the smallest element. Probably the next naive step would be to deinterleave this array to have top four bytes at start of the array (which won't necessary be the top 4 elements of the array, just top bytes of their respective groups) and repeat, not yet sure where it leads at the end.
Is there any known method for this particular problem or for other similar problems (for different array sizes or whatever)? Any ideas are appreciated :)
According to this question and this answer, lists are implemented as arrays:
Perl implements lists with an array
and first/last element offsets. The
array is allocated larger than needed
with the offsets originally pointing
in the middle of the array so there is
room to grow in both directions
(unshifts and pushes/inserts) before a
re-allocation of the underlying array
is necessary. The consequence of this
implementation is that all of perl's
primitive list operators (insertion,
fetching, determining array size,
push, pop, shift, unshift, etc.)
perform in O(1) time.
So you would expect accessing an element by a numeric offset would be just as fast because they're arrays in the implementation, which provide very fast constant-time indexing. However, in a footnote in Learning Perl, the author says
Indexing into arrays is not using
Perl’s strengths. If you use the pop,
push, and similar operators that avoid
using indexing, your code will
generally be faster than if you use
many indices, as well as avoiding
“off-by-one” errors, often called
“fencepost” errors. Occasionally, a
beginning Perl programmer (wanting to
see how Perl’s speed compares to C’s)
will take, say, a sorting algorithm
optimized for C (with many array index
operations), rewrite it
straightforwardly in Perl (again, with
many index operations) and wonder why
it’s so slow. The answer is that using
a Stradivarius violin to pound nails
should not be considered a sound
construction technique.
How can this be true when a list is really an array under the hood? I know it's simply ignorant to try to compare the speed of Perl to C, but wouldn't indexing a list by offset be just as fast as pop or push or whatever? These seem to contradict each other.
It's to do with the implementation of Perl as a series of opcodes. push, pop, shift and unshift are all opcodes themselves, so they can index into the array they're manipulating from C, where the accesses are very fast. If you do this from Perl with indices you'll make Perl perform extra opcodes to get the index from the scalar, get the slot from the array, then put something into it.
You can see this by using the -MO=Terse switch to see what Perl is really (in some sense) doing:
$foo[$i] = 1
BINOP (0x18beae0) sassign
SVOP (0x18bd850) const IV (0x18b60b0) 1
BINOP (0x18beb60) aelem
UNOP (0x18bedb0) rv2av
SVOP (0x18bef30) gv GV (0x18b60c8) *foo
UNOP (0x18beba0) null [15]
SVOP (0x18bec70) gvsv GV (0x18b60f8) *i
push #foo, 1
LISTOP (0x18bd7b0) push [2]
OP (0x18aff70) pushmark
UNOP (0x18beb20) rv2av [1]
SVOP (0x18bd8f0) gv GV (0x18b60c8) *foo
SVOP (0x18bed10) const IV (0x18b61b8) 1
You see that Perl has to perform fewer steps, so can be expected to be faster.
The trick with any interpreted language is to let it do all the work.