Speed of indexing Perl arrays by offset

Speed of indexing Perl arrays by offset - arrays

According to this question and this answer, lists are implemented as arrays:
Perl implements lists with an array
and first/last element offsets. The
array is allocated larger than needed
with the offsets originally pointing
in the middle of the array so there is
room to grow in both directions
(unshifts and pushes/inserts) before a
re-allocation of the underlying array
is necessary. The consequence of this
implementation is that all of perl's
primitive list operators (insertion,
fetching, determining array size,
push, pop, shift, unshift, etc.)
perform in O(1) time.
So you would expect accessing an element by a numeric offset would be just as fast because they're arrays in the implementation, which provide very fast constant-time indexing. However, in a footnote in Learning Perl, the author says
Indexing into arrays is not using
Perl’s strengths. If you use the pop,
push, and similar operators that avoid
using indexing, your code will
generally be faster than if you use
many indices, as well as avoiding
“off-by-one” errors, often called
“fencepost” errors. Occasionally, a
beginning Perl programmer (wanting to
see how Perl’s speed compares to C’s)
will take, say, a sorting algorithm
optimized for C (with many array index
operations), rewrite it
straightforwardly in Perl (again, with
many index operations) and wonder why
it’s so slow. The answer is that using
a Stradivarius violin to pound nails
should not be considered a sound
construction technique.
How can this be true when a list is really an array under the hood? I know it's simply ignorant to try to compare the speed of Perl to C, but wouldn't indexing a list by offset be just as fast as pop or push or whatever? These seem to contradict each other.

It's to do with the implementation of Perl as a series of opcodes. push, pop, shift and unshift are all opcodes themselves, so they can index into the array they're manipulating from C, where the accesses are very fast. If you do this from Perl with indices you'll make Perl perform extra opcodes to get the index from the scalar, get the slot from the array, then put something into it.
You can see this by using the -MO=Terse switch to see what Perl is really (in some sense) doing:
$foo[$i] = 1
BINOP (0x18beae0) sassign
SVOP (0x18bd850) const IV (0x18b60b0) 1
BINOP (0x18beb60) aelem
UNOP (0x18bedb0) rv2av
SVOP (0x18bef30) gv GV (0x18b60c8) *foo
UNOP (0x18beba0) null [15]
SVOP (0x18bec70) gvsv GV (0x18b60f8) *i
push #foo, 1
LISTOP (0x18bd7b0) push [2]
OP (0x18aff70) pushmark
UNOP (0x18beb20) rv2av [1]
SVOP (0x18bd8f0) gv GV (0x18b60c8) *foo
SVOP (0x18bed10) const IV (0x18b61b8) 1
You see that Perl has to perform fewer steps, so can be expected to be faster.
The trick with any interpreted language is to let it do all the work.

Related

Are there real vectors (one-dimensional arrays) in Perl?

I know that traditional "lists" in Perl implemented internally exactly as a real double-linked lists. So indexed access to the list elements is slow. This is a cost of dynamic nature of lists, which can be sliced, expanded, shrinked.
But for performance reasons it will be very good to have possibility to malloc() some memory chunk and create vector of static size and predefined size of its elements: for example, fixed-size double-linked list may be represented as a sequence of elements which size will be 4(prev_v_index) + 4(next_v_index) + 8(data_ptr aka REF) = 16 bytes. So we can access every element of this vector as we usually do it in compiled languages like C: elem_ptr=vector_ptr+(index*elem_size) - access to elements will be very fast with some architecure-specific alignment (8 bytes for x86_64).
Maybe there is already some XS module for manipulating with the fixed-vectors in Perl5?

Perl's arrays (#array variables or [...] references) do use a contiguous memory region. They are not linked lists. However, these arrays only hold pointers to the scalar values, not the values themselves. This is a necessary restriction of the Perl data model.
If you know C++, a Perl array can be thought of as similar to a std::vector<Scalar*>, except that Perl's arrays can push and pop at the front and the back.
To resize a Perl array, you can assign the last index. E.g. to pre-allocate 50 elements:
my #array;
$#array = 50 - 1;
If you need compact data storage within Perl, then you will have to use strings. Given a fixed-size record, you can get and set one record with substr, and pack/unpack the data from and to Perl data structures.

You can use the vec function to use a string as a vector. For example, you could pack Boolean values into individual bits.
vec EXPR,OFFSET,BITS
Treats the string in EXPR as a bit vector made up of elements of
width BITS and returns the value of the element specified by
OFFSET as an unsigned integer. BITS therefore specifies the
number of bits that are reserved for each element in the bit
vector. This must be a power of two from 1 to 32 (or 64, if your
platform supports that).
That said, your concern about array access being "slow" is unwarranted and your beliefs about perl's internals is incorrect. Array performance is likely to be fast enough. Don't try to "optimize" around it until you've profiled your code and proven that its a bottleneck.

Index vs. Pointer

I'm using arrays of elements, many of which referencing each other, and I assumed in that case it's more efficient to use pointers.
But in some cases I need to know the index of an element I have the pointer to. For example I have p = &a[i] and I need to know the value of i. As I understand it, i can be computed through p - a. But this operation inherently involves division, which is expensive, whereas computing an address from an array index involves a multiplication and is faster.
So my question is, is cross referencing with pointers in a case where you need the indexes as well even worth it?

But this operation inherently involves division, which is expensive, whereas computing an address from an array index involves a multiplication and is faster.
This operation requires a division only when the size of the element is not a power of two, i.e. when it is not a pointer, or some standard type on most systems. Dividing by a power of two is done using bit shifting, which is extremely cheap.
computing an address from an array index involves a multiplication and is faster.
Same logic applies here, except the compiler shifts left instead of shifting right.
is cross referencing with pointers in a case where you need the indexes as well even worth it?
Counting CPU cycles without profiling is a case of premature optimization - a bad thing to consider when you are starting your design.
A more important consideration is that indexes are more robust, because they often survive array reallocation.
Consider an example: let's say you have an array that grows dynamically as you add elements to its back, an index into that array, and a pointer into that array. You add an element to the array, exhausting its capacity, so now it must grow. You call realloc, and get a new array (or an old array if there was enough extra memory after the "official" end). The pointer that you held is now invalid; the index, however, is still valid.

Indexing an array is dirt cheap in ways where I've never found any kind of performance boost by directly using pointers instead. That includes some very performance-critical areas like looping through each pixel of an image containing millions of them -- still no measurable performance difference between indices and pointers (though it does make a difference if you can access an image using one sequential loop over two).
I've actually found many opposite cases where turning pointers into 32-bit indices boosted performance after 64-bit hardware started becoming available when there was a need to store a boatload of them.
One of the reasons is obvious: you can take half the space now with 32-bit indices (assuming you don't need more than ~4.3 billion elements). If you're storing a boatload of them and taking half the memory as in the case of a graph data structure like indexed meshes, then typically you end up with fewer cache misses when your links/adjacency data can be stored in half the memory space.
But on a deeper level, using indices allows a lot more options. You can use purely contiguous structures that realloc to new sizes without worrying about invalidation as dasblinkenlight points out. The indices will also tend to be more dense (as opposed to sparsely fragmented across the entire 64-bit addressing space), even if you leave holes in the array, allowing for effective compression (delta, frame of reference, etc) if you want to squash down memory usage. You can then also use parallel arrays to associate data to something in parallel without using something much more expensive like a hash table. That includes parallel bitsets which allow you to do things like set intersections in linear time. It also allows for SoA reps (also parallel arrays) which tend to be optimal for sequential access patterns using SIMD.
You get a lot more room to optimize with indices, and I'd consider it mostly just a waste of memory if you keep pointers around on top of indices. The downside to indices for me is mostly just convenience. We have to have access to the array we're indexing on top of the index itself, while the pointer allows you to access the element without having access to its container. It's often more difficult and error-prone to write code and data structures revolving around indices and also harder to debug since we can't see the value of an element through an index. That said, if you accept the extra burden, then often you get more room to optimize with indices, not less.

what is more expensive: compare or accessing an index of array

basically i saw i video on youtube that visualized sorting algorithms and they provided the program so that we can play with it .. and the program counts two main things (comparisons , array accesses) .. i wanted to see which one of (merge & quick) sort is the fastest ..
for 100 random numbers
quick sort:
comparisons 1000
array accesses 1400
merge sort:
comparisons 540
array accesses 1900
so quick sort uses less array access while merge sort uses less comparisons and the difference increases with the number of the indexes .. so which one of those is harder for computer to do?

The numbers are off. Results from actual runs with 100 random numbers. Note that quick sort compare count is affected by the implementation, Hoare uses less compares than Lomuto.
quick sort (Hoare partition scheme)
pivot reads 87 (average)
compares 401 (average)
array accesses 854 (average)
merge sort:
compares 307 (average)
array accesses 1400 (best, average, worst)
Since numbers are being sorted, I'm assuming they fit in registers, which reduces the array accesses.
For quick sort, the compares are done versus a pivot value, which should be read just once per recursive instance of quick sort and placed in a register, then one read for each value compared. An optimizing compiler may keep the values used for compare in registers so that swaps already have the two values in registers and only need to do two writes.
For merge sort, the compares add almost zero overhead to the array accesses, since the compared values will be read into registers, compared, then written from the registers instead of reading from memory again.

Sorting performance depends on many conditions, I think answering your exact question won't lead to a helpful answer (you can benchmark it easily yourself).
Sorting a small number of elements is usually not time critical, benchmarking makes sense for larger lists.
Also it is a rare case to sort an array of integers, it is much more common to sort a list of objects, comparing one or more of their properties.
If you head for performance, think about multi threading.
MergeSort is stable (equal elements keep their relative position), QuickSort is not, so you are comparing different results.
In your example, the quicksort algorithm is probably faster most of the time. If the comparison is more complex, e.g. string instead of int or multiple fields, MergeSort will become more and more effective because it needs fewer (expensive) comparisons. If you want to parallize the sorting, MergeSort is predestined because of the algorithm itself.

Fastest way to compare one byte array with many others?

I have a loop with the following structure :
Calculate a byte array with length k (somewhere slow)
Find if the calculated byte array matches any in a list of N byte arrays I have.
Repeat
My loop is to be called many many times (it's the main loop of my program), and I want the second step to be as fast as possible.
The naive implementation for the second step would be using memcmp:
char* calc;
char** list;
int k, n, i;
for(i = 0; i < n; i++) {
if (!memcmp(calc, list[i], k)) {
printf("Matches array %d", i);
}
}
Can you think of any faster way ? A few things :
My list is fixed at the start of my program, any precomputation on it is fine.
Let's assume that k is small (<= 64), N is moderate (around 100-1000).
Performance is the goal here, and portability is a non issue. Intrinsics/inline assembly is fine, as long as it's faster.
Here are a few thoughts that I had :
Given k<64 and I'm on x86_64, I could sort my lookup array as a long array, and do a binary search on it. O(log(n)). Even if k was big, I could sort my lookup array and do this binary search using memcmp.
Given k is small, again, I could compute a 8/16/32 bits checksum (the simplest being folding my arrays over themselves using a xor) of all my lookup arrays and use a built-in PCMPGT as in How to compare more than two numbers in parallel?. I know SSE4.2 is available here.
Do you think going for vectorization/sse is a good idea here ? If yes, what do you think is the best approach.
I'd like to say that this isn't early optimization, but performance is crucial here, I need the outer loop to be as fast as possible.
Thanks
EDIT1: It looks like http://schani.wordpress.com/tag/c-optimization-linear-binary-search-sse2-simd/ provides some interesting thoughts about it. Binary search on a list of long seems the way to go..

The optimum solution is going to depend on how many arrays there are to match, the size of the arrays, and how often they change. I would look at avoiding doing the comparisons at all.
Assuming the list of arrays to compare it to does not change frequently and you have many such arrays, I would create a hash of each array, then when you come to compare, hash the thing you are testing. Then you only need compare the hash values. With a hash like SHA256, you can rely on this both as a positive and negative indicator (i.e. the hashes matching is sufficient to say the arrays match as well as the hashes not matching being sufficient to say the arrays differ). This would work very well if you had (say) 1,000,000 arrays to compare against which hardly ever change, as calculating the hash would be faster than 1,000,000 array comparisons.
If your number of arrays is a bit smaller, you might consider a faster non-crytographic hash. For instance, a 'hash' which simply summed the bytes in an array module 256 (this is a terrible hash and you can do much better) would eliminate the need to compare (say) 255/256ths of the target array space. You could then compare only those where the so called 'hash' matches. There are well known hash-like things such as CRC-32 which are quick to calculate.
In either case you can then have a look up by hash (modulo X) to determine which arrays to actually compare.
You suggest k is small, N is moderate (i.e. about 1000). I'm guessing speed will revolve around memory cache. Not accessing 1,000 small arrays here is going to be pretty helpful.
All the above will be useless if the arrays change with a frequency similar to the comparison.
Addition (assuming you are looking at 64 bytes or similar). I'd look into a very fast non-cryptographic hash function. For instance look at: https://code.google.com/p/smhasher/wiki/MurmurHash3
It looks like 3-4 instructions per 32 bit word to generate the hash. You could then truncate the result to (say) 12 bits for a 4096 entry hash table with very few collisions (each bucket being linked list to the target arrays). This means you would look at something like about 30 instructions to calculate the hash, then one instruction per bucket entry (expected value 1) to find the list item, then one manual compare per expected hit (that would be between 0 and 1). So rather than comparing 1000 arrays, you would compare between 0 and 1 arrays, and generate one hash. If you can't compare 999 arrays in 30-ish instructions (I'm guessing not!) this is obviously a win.

We can assume that my stuff fits in 64bits, or even 32bits. If it
wasn't, I could hash it so it could. But now, what's the fastest way
to find whether my hash exists in the list of precomputed hashes ?
This is sort of a meta-answer, but... if your question boils down to: how can I efficiently find whether a certain 32-bit number exists in a list of other 32-bit numbers, this is a problem IP routers deal with all the time, so it might be worth looking into the networking literature to see if there's something you can adapt from their algorithms. e.g. see http://cit.mak.ac.ug/iccir/downloads/SREC_07/K.J.Poornaselvan1,S.Suresh,%20C.Divya%20Preya%20and%20C.G.Gayathri_07.pdf
(Although, I suspect they are optimized for searching through larger numbers of items than your use case..)

can you do an XOR instead of memcmp ?
or caclulate hash of each element in the array and sort it search for the hash
but hash will take more time .unless you can come up with a faster hash

Another way is to pre-build a tree from your list and use tree search.
for examples, with list:
aaaa
aaca
acbc
acca
bcaa
bcca
caca
we can get a tree like this
root
-a
--a
---a
----a
---c
----a
--c
---b
----c
---c
----a
-b
--c
---a
----a
---c
----a
-c
--a
---c
----a
Then do binary search on each level of the tree

How to best sort a portion of a circular buffer?

I have a circular, statically allocated buffer in C, which I'm using as a queue for a depth breadth first search. I'd like have the top N elements in the queue sorted. It would be easy to just use a regular qsort() - except it's a circular buffer, and the top N elements might wrap around. I could, of course, write my own sorting implementation that uses modular arithmetic and knows how to wrap around the array, but I've always thought that writing sorting functions is a good exercise, but something better left to libraries.
I thought of several approaches:
Use a separate linear buffer - first copy the elements from the circular buffer, then apply qsort, then copy them back. Using an additional buffer means an additional O(N) space requirement, which brings me to
Sort the "top" and "bottom" halve using qsort, and then merge them using the additional buffer
Same as 2. but do the final merge in-place (I haven't found much on in-place merging, but the implementations I've seen don't seem worth the reduced space complexity)
On the other hand, spending an hour contemplating how to elegantly avoid writing my own quicksort, instead of adding those 25 (or so) lines might not be the most productive either...
Correction: Made a stupid mistake of switching DFS and BFS (I prefer writing a DFS, but in this particular case I have to use a BFS), sorry for the confusion.
Further description of the original problem:
I'm implementing a breadth first search (for something not unlike the fifteen puzzle, just more complicated, with about O(n^2) possible expansions in each state, instead of 4). The "bruteforce" algorithm is done, but it's "stupid" - at each point, it expands all valid states, in a hard-coded order. The queue is implemented as a circular buffer (unsigned queue[MAXLENGTH]), and it stores integer indices into a table of states. Apart from two simple functions to queue and dequeue an index, it has no encapsulation - it's just a simple, statically allocated array of unsigned's.
Now I want to add some heuristics. The first thing I want to try is to sort the expanded child states after expansion ("expand them in a better order") - just like I would if I were programming a simple best-first DFS. For this, I want to take part of the queue (representing the most recent expanded states), and sort them using some kind of heuristic. I could also expand the states in a different order (so in this case, it's not really important if I break the FIFO properties of the queue).
My goal is not to implement A*, or a depth first search based algorithm (I can't afford to expand all states, but if I don't, I'll start having problems with infinite cycles in the state space, so I'd have to use something like iterative deepening).

I think you need to take a big step back from the problem and try to solve it as a whole - chances are good that the semi-sorted circular buffer is not the best way to store your data. If it is, then you're already committed and you will have to write the buffer to sort the elements - whether that means performing an occasional sort with an outside library, or doing it when elements are inserted I don't know. But at the end of the day it's going to be ugly because a FIFO and sorted buffer are fundamentally different.
Previous answer, which assumes your sort library has a robust and feature filled API (as requested in your question, this does not require you to write your own mod sort or anything - it depends on the library supporting arbitrary located data, usually through a callback function. If your sort doesn't support linked lists, it can't handle this):
The circular buffer has already solved this problem using % (mod) arithmetic. QSort, etc don't care about the locations in memory - they just need a scheme to address the data in a linear manner.
They work as well for linked lists (which are not linear in memory) as they do for 'real' linear non circular arrays.
So if you have a circular array with 100 entries, and you find you need to sort the top 10, and the top ten happen to wrap in half at the top, then you feed the sort the following two bits of information:
The function to locate an array item is (x % 100)
The items to be sorted are at locations 95 to 105
The function will convert the addresses the sort uses into an index used in the real array, and the fact that the array wraps around is hidden, although it may look weird to sort an array past its bounds, a circular array, by definition, has no bounds. The % operator handles that for you, and you might as well be referring to the part of the array as 1295 to 1305 for all it cares.
Bonus points for having an array with 2^n elements.
Additional points of consideration:
It sounds to me that you're using a sorting library which is incapable of sorting anything other than a linear array - so it can't sort linked lists, or arrays with anything other than simple ordering. You really only have three choices:
You can re-write the library to be more flexible (ie, when you call it you give it a set of function pointers for comparison operations, and data access operations)
You can re-write your array so it somehow fits your existing libraries
You can write custom sorts for your particular solution.
Now, for my part I'd re-write the sort code so it was more flexible (or duplicate it and edit the new copy so you have sorts which are fast for linear arrays, and sorts which are flexible for non-linear arrays)
But the reality is that right now your sort library is so simple you can't even tell it how to access data that is non linearly stored.
If it's that simple, there should be no hesitation to adapting the library itself to your particular needs, or adapting your buffer to the library.
Trying an ugly kludge, like somehow turning your buffer into a linear array, sorting it, and then putting it back in is just that - an ugly kludge that you're going to have to understand and maintain later. You're going to 'break' into your FIFO and fiddle with the innards.
-Adam

I'm not seeing exactly the solution you asked for in c. You might consider one of these ideas:
If you have access to the source for your libc's qsort(), you might copy it and simply replace all the array access and indexing code with appropriately generalized equivalents. This gives you some modest assurance that the underling sort is efficient and has few bugs. No help with the risk of introducing your own bugs, of course. Big O like the system qsort, but possibly with a worse multiplier.
If the region to be sorted is small compared to the size of the buffer, you could use the straight ahead linear sort, guarding the call with a test-for-wrap and doing the copy-to-linear-buffer-sort-then-copy-back routine only if needed. Introduces an extra O(n) operation in the cases that trip the guard (for n the size of the region to be sorted), which makes the average O(n^2/N) < O(n).
I see that C++ is not an option for you. ::sigh:: I will leave this here in case someone else can use it.
If C++ is an option you could (subclass the buffer if needed and) overload the [] operator to make the standard sort algorithms work. Again, should work like the standard sort with a multiplier penalty.

Perhaps a priority queue could be adapted to solve your issue.'

You could rotate the circular queue until the subset in question no longer wraps around. Then just pass that subset to qsort like normal. This might be expensive if you need to sort frequently or if the array element size is very large. But if your array elements are just pointers to other objects then rotating the queue may be fast enough. And in fact if they are just pointers then your first approach might also be fast enough: making a separate linear copy of a subset, sorting it, and writing the results back.

Do you know about the rules regarding optimization? You can google them (you'll find a few versions, but they all say pretty much the same thing, DON'T).
It sounds like you are optimizing without testing. That's a huge no-no. On the other hand, you're using straight C, so you are probably on a restricted platform that requires some level of attention to speed, so I expect you need to skip the first two rules because I assume you have no choice:
Rules of optimization:
Don't optimize.
If you know what you are doing, see rule #1
You can go to the more advanced rules:
Rules of optimization (cont):
If you have a spec that requires a certain level of performance, write the code unoptimized and write a test to see if it meets that spec. If it meets it, you're done. NEVER write code taking performance into consideration until you have reached this point.
If you complete step 3 and your code does not meet the specs, recode it leaving your original "most obvious" code in there as comments and retest. If it does not meet the requirements, throw it away and use the unoptimized code.
If your improvements made the tests pass, ensure that the tests remain in the codebase and are re-run, and that your original code remains in there as comments.
Note: that should be 3. 4. 5. Something is screwed up--I'm not even using any markup tags.
Okay, so finally--I'm not saying this because I read it somewhere. I've spent DAYS trying to untangle some god-awful messes that other people coded because it was "Optimized"--and the really funny part is that 9 times out of 10, the compiler could have optimized it better than they did.
I realize that there are times when you will NEED to optimize, all I'm saying is write it unoptimized, test and recode it. It really won't take you much longer--might even make writing the optimized code easier.
The only reason I'm posting this is because almost every line you've written concerns performance, and I'm worried that the next person to see your code is going to be some poor sap like me.

How about somthing like this example here. This example easely sorts a part or whatever you want without having to redefine a lot of extra memory.
It takes inly two pointers a status bit and a counter for the for loop.
#define _PRINT_PROGRESS
#define N 10
BYTE buff[N]={4,5,2,1,3,5,8,6,4,3};
BYTE *a = buff;
BYTE *b = buff;
BYTE changed = 0;
int main(void)
{
BYTE n=0;
do
{
b++;
changed = 0;
for(n=0;n<(N-1);n++)
{
if(*a > *b)
{
*a ^= *b;
*b ^= *a;
*a ^= *b;
changed = 1;
}
a++;
b++;
}
a = buff;
b = buff;
#ifdef _PRINT_PROGRESS
for(n=0;n<N;n++)
printf("%d",buff[n]);
printf("\n");
}
#endif
while(changed);
system( "pause" );
}

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight