Leading zeros calculation with intrinsic function - arm

I'm trying to optimize some code working in an embedded system (FLAC decoding, Windows CE, ARM 926 MCU).
The default implementation uses a macro and a lookup table:
/* counts the # of zero MSBs in a word */
#define COUNT_ZERO_MSBS(word) ( \
(word) <= 0xffff ? \
( (word) <= 0xff? byte_to_unary_table[word] + 24 : \
byte_to_unary_table[(word) >> 8] + 16 ) : \
( (word) <= 0xffffff? byte_to_unary_table[word >> 16] + 8 : \
byte_to_unary_table[(word) >> 24] ) \
)
static const unsigned char byte_to_unary_table[] = {
8, 7, 6, 6, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
};
However most CPU already have a dedicated instruction, bsr on x86 and clz on ARM (http://www.devmaster.net/articles/fixed-point-optimizations/), that should be more efficient.
On Windows CE we have the intrinsic function _CountLeadingZeros, that should just call that value. However it is 4 times slower than the macro (measured on 10 million of iterations).
How is possible that an intrinsic function, that (should) rely on a dedicated ASM instruction, is 4 times slower?

Check the disassembly. Are you sure that the compiler inserted the instruction? In the Remarks section there is this text:
This function can be implemented by
calling a runtime function.
I suspect that's what's happening in your case.
Note that the CLZ instruction is only available in ARMv5 and later. You need to tell the compiler if you want ARMv5 code:
/QRarch5 ARM5 Architecture
/QRarch5T ARM5T Architecture
(Microsoft incorrectly uses "ARM5" instead of "ARMv5")

Related

Design a specific algorithm for a nxn array implemented on O(nlogn)

The problem:
Suppose that each row of an n×n array A consists of 1’s and 0’s such that, in any row of A, all the 1’s come before any 0’s in that row. Assuming A is already in memory, describe a method running in O(nlogn) time (not O(n2) time!) for counting the number of 1’s in A.
My experience: I have done it for O(n) but I dont know how can I achieve it with O(nlogN)
I would appreciate any help !
Consider that each individual row consists of all 1s followed by all 0s:
1111111000
You can use a binary search to find the transition point (the last 1 in the row). The way this works is to set low and high to the ends and check the middle.
If you are at the transition point, you're done. Otherwise, if you're in the 1s, set low to one after the midpoint. Otherwise, you're in the 0s, so set high to one before the midpoint.
That would go something like (pseudo-code, with some optimisations):
def countOnes(row):
# Special cases first, , empty, all 0s, or all 1s.
if row.length == 0: return 0
if row[0] == "0": return 0
if row[row.length - 1] == 1: return row.length
# At this point, there must be at least one of each value,
# so length >= 2. That means you're guaranteed to find a
# transition point.
lo = 0
hi = row.length - 1
while true:
mid = (lo + hi) / 2
if row[mid] == 1 and row[mid+1] == 0:
return mid + 1
if row[mid] == 1:
lo = mid + 1
else:
hi = mid - 1
Since a binary search for a single row is O(logN) and you need to do that for N rows, the resultant algorithm is O(NlogN).
For a more concrete example, see the following complete Python program, which generates a mostly random matrix then uses the O(N) method and the O(logN) method (the former as confirmation) of counting the ones in each row:
import random
def slow_count(items):
count = 0
for item in items:
if item == 0:
break
count += 1
return count
def fast_count(items):
# Special cases first, no 1s or all 1s.
if len(items) == 0: return 0
if items[0] == 0: return 0
if items[len(items) - 1] == 1: return len(items)
# At this point, there must be at least one of each value,
# so length >= 2. That means you're guaranteed to find a
# transition point.
lo = 0
hi = len(items) - 1
while True:
mid = (lo + hi) // 2
if items[mid] == 1 and items[mid+1] == 0:
return mid + 1
if items[mid] == 1:
lo = mid + 1
else:
hi = mid - 1
# Ensure test data has rows with all zeros and all ones.
N = 20
matrix = [[1] * N, [0] * N]
# Populate other rows randomly.
random.seed()
for _ in range(N - 2):
numOnes = random.randint(0, N)
matrix.append([1] * numOnes + [0] * (N - numOnes))
# Print rows and counts using slow-proven and fast method.
for row in matrix:
print(row, slow_count(row), fast_count(row))
The fast_count function is the equivalent of what I've provided in this answer.
A sample run is:
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] 20 20
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] 0 0
[1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] 5 5
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0] 15 15
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] 10 10
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] 1 1
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0] 11 11
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0] 12 12
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0] 11 11
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] 1 1
[1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] 6 6
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0] 16 16
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0] 14 14
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0] 11 11
[1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] 9 9
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0] 13 13
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] 1 1
[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] 4 4
[1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] 6 6
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0] 19 19

resize array while keeping mask

I'm trying to figure out how to effectively resize an 1-d array while keeping the mask it represents. Using this array i do draw simple sprites while one value in the array represents a specific color.
Anyway my goal is as follows, having the following "small" array with values:
0, 1, 2, 3,
0, 1, 2, 2,
0, 1, 1, 1,
0, 0, 1, 1,
0, 0, 0, 0
This obviously is going to be a sprite of size 4x5.
Now i want to resize it keeping the values so getting the same sprite/shape but in higher resolution.
Now by saying "scale-by-2" i would get a 8x10 sized sprite, the 1-d array then should look as follows:
0, 0, 1, 1, 2, 2, 3, 3,
0, 0, 1, 1, 2, 2, 3, 3,
0, 0, 1, 1, 2, 2, 2, 2,
0, 0, 1, 1, 2, 2, 2, 2,
0, 0, 1, 1, 1, 1, 1, 1,
0, 0, 1, 1, 1, 1, 1, 1,
0, 0, 0, 0, 1, 1, 1, 1,
0, 0, 0, 0, 1, 1, 1, 1,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0
My idea is to group the numbers row by row, take the scale factor (2) and add as many of the digits (from one group) as we have to scale (2) in one row. Then duplicate each row by the scale factor as well. But still i am not sure if this covers all cases.
Any other (more effective) way to handle this?

python: vectorized cumulative counting

I have a numpy array and would like to count the number of occurences for each value, however, in a cumulative way
in = [0, 1, 0, 1, 2, 3, 0, 0, 2, 1, 1, 3, 3, 0, ...]
out = [0, 0, 1, 1, 0, 0, 2, 3, 1, 2, 3, 1, 2, 4, ...]
I'm wondering if it is best to create a (sparse) matrix with ones at col = i and row = in[i]
1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0
0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0
0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0
Then we could compute the cumsums along the rows and extract the numbers from the locations where the cumsums increment.
However, if we cumsum a sparse matrix, doesn't become dense? Is there an efficient way of doing it?
Here's one vectorized approach using sorting -
def cumcount(a):
# Store length of array
n = len(a)
# Get sorted indices (use later on too) and store the sorted array
sidx = a.argsort()
b = a[sidx]
# Mask of shifts/groups
m = b[1:] != b[:-1]
# Get indices of those shifts
idx = np.flatnonzero(m)
# ID array that will store the cumulative nature at the very end
id_arr = np.ones(n,dtype=int)
id_arr[idx[1:]+1] = -np.diff(idx)+1
id_arr[idx[0]+1] = -idx[0]
id_arr[0] = 0
c = id_arr.cumsum()
# Finally re-arrange those cumulative values back to original order
out = np.empty(n, dtype=int)
out[sidx] = c
return out
Sample run -
In [66]: a
Out[66]: array([0, 1, 0, 1, 2, 3, 0, 0, 2, 1, 1, 3, 3, 0])
In [67]: cumcount(a)
Out[67]: array([0, 0, 1, 1, 0, 0, 2, 3, 1, 2, 3, 1, 2, 4])

How to create a random mask array?

I've an array with 128 values, each value is 1:
length = 128
partials = Array.new length
partials.each_index do |i|
partials[i] = 1
end
I want to set value 0 on some (random) position (for example, on pos 1,6,50,70,100,112,120).
Of course, the number of position could be different every time, and if I choose 7 different position, I want to end with 7 different pos changed.
What's the faster way to do this in Ruby?
Assuming you want to have n elements with value 0, you can do the below:
n = 5
partials[0,n] = [0]*n
partials.shuffle
Alternatively, can also be written as:
partials.tap{|p| p[0,n] = [0]*n}.shuffle
You can incorporate the zeros into the array creation:
length = 128
zeros = 7
partials = Array.new(length) { |i| i < zeros ? 0 : 1 }.shuffle
#=> [1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
# 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,
# 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,
# 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,
# 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
# 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
# 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
# 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
A way:
array = 128.times.map{1}
Or with randomly sprayed 0s:
array = 128.times.map{rand(2)}
or put a number of 0s later:
10.times{array[rand(128)]=0}
etc... Play with it and see what you need
Another alternative:
length = 10
zeros = 2
([0]*(length-zeros)+[1]*zeros).shuffle

I want to use Bilinear interpolation to calculate the summation of vectors

I have individual vectors from my last stage of code which i implemented it
The next stage of the algorithm is to calculate the summation of these vectors
As mentioned in the paper
"The vectors from the previous stage were summed together spatially by bilinearly weighting"
I think The bilinear weighting means bilinear interpolation
can any one tell or give me an example how can i use bilinear interpolation
to calculate the Summation of this vectors
V1 = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2]
V2 = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 11, 11]
V3 = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 23, 0, 0]
V4 = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 19, 19, 0, 0]
V5 = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 0, 0]
V6 = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 8, 0, 0]
V7 = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 18, 18, 0, 0]
V8 = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 23, 23, 0, 0, 0]
V9= [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0]
V10 = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 0, 0]
V11 = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 8, 0, 0]
V12 = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 11, 11, 0, 0, 0]
V13 = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
V14 = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0]
V15 = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 0, 0, 0]
V16 = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 0, 0, 0, 0]
I googled it but didn't understand the Equations
Regards and thanks in advance !
Sadly I'm having trouble understanding the paper as well. The idea, as you've said, is to weight the vectors based on their distance from the pooling centres, so that vectors farther from the pooling centres have less of an impact. The paper compares this to what is done in the famous SIFT feature, which you can read about in this tutorial.
Below is by best guess as to what the meaning is. Since this is related to machine learning, could also ask people over at cross-validated to get their opinion, or consider contacting the author of the paper.
If I understand correctly, this is amounts to a process similar to bilinear interpolation, except in reverse.
With bilinear interpolation, we are given a set of function values arranged in a grid, and we want to find a good guess for what the function values are between the gridpoints. We do this by taking a weighted average of the four surrounding function values, with the weights being the relative area of the opposite rectangle in the image below. (By "relative" I mean the area is normalized by the area of the whole grid rectangle, so the weights sum to 1.) Note how the point to be interpolated is the closest to the (x1,y2) gridpoint, so we weight it with the largest weight (the relative area of the yellow rectangle).
f(x,y) = w_11*f(x1,y1) + w_21*f(x2,y1) + w_12*f(x1,y2) + w_22*f(x2,y2)
w_ij = area of rectangle opposite (xi,yj) / total area of grid square
The "bilinear weighing" described in the paper seems to be doing the opposite: we have values (or vectors in this case) scattered throughout 2D space, and we want to "pool" their values at a set of gridpoints that we choose.
We do this by adding a fraction of each vector to the four surrounding pooling gridpoints. This fraction would again be the relative area of the opposite rectangle.
In the above image... pooling point (xi,yj) would get w_ij * f(x,y) summed along with the appropriate fraction of any other points we have in the region.
As the paper states, the spacing of the grid points is up to you. I assume it would need to be big enough to allow most polling points have at least one vector in its neighbourhood.
EDIT: Here is an example of what I mean.
(0,1) . _ _ _ _ _ . (1,1)
| |
| v |
| |
| |
(0,0) . _ _ _ _ _ . (1,0)
Let's say the vector v=[10,5] is at point (0.2,0.8)
point (0,0) gets weight 0.8*0.2=0.16, so we add 0.16*v = [1.6,0.8] to that pool
point (1,0) gets weight 0.2*0.2=0.04, so we add 0.04*v = [0.4,0.2] to that pool
point (0,1) gets weight 0.8*0.8=0.64, so we add 0.64*v = [6.4,3.2] to that pool
point (1,1) gets weight 0.2*0.8=0.16, so we add 0.16*v = [1.6,0.8] to that pool

Resources