python: vectorized cumulative counting - arrays

I have a numpy array and would like to count the number of occurences for each value, however, in a cumulative way
in = [0, 1, 0, 1, 2, 3, 0, 0, 2, 1, 1, 3, 3, 0, ...]
out = [0, 0, 1, 1, 0, 0, 2, 3, 1, 2, 3, 1, 2, 4, ...]
I'm wondering if it is best to create a (sparse) matrix with ones at col = i and row = in[i]
1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0
0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0
0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0
Then we could compute the cumsums along the rows and extract the numbers from the locations where the cumsums increment.
However, if we cumsum a sparse matrix, doesn't become dense? Is there an efficient way of doing it?

Here's one vectorized approach using sorting -
def cumcount(a):
# Store length of array
n = len(a)
# Get sorted indices (use later on too) and store the sorted array
sidx = a.argsort()
b = a[sidx]
# Mask of shifts/groups
m = b[1:] != b[:-1]
# Get indices of those shifts
idx = np.flatnonzero(m)
# ID array that will store the cumulative nature at the very end
id_arr = np.ones(n,dtype=int)
id_arr[idx[1:]+1] = -np.diff(idx)+1
id_arr[idx[0]+1] = -idx[0]
id_arr[0] = 0
c = id_arr.cumsum()
# Finally re-arrange those cumulative values back to original order
out = np.empty(n, dtype=int)
out[sidx] = c
return out
Sample run -
In [66]: a
Out[66]: array([0, 1, 0, 1, 2, 3, 0, 0, 2, 1, 1, 3, 3, 0])
In [67]: cumcount(a)
Out[67]: array([0, 0, 1, 1, 0, 0, 2, 3, 1, 2, 3, 1, 2, 4])

Related

Design a specific algorithm for a nxn array implemented on O(nlogn)

The problem:
Suppose that each row of an n×n array A consists of 1’s and 0’s such that, in any row of A, all the 1’s come before any 0’s in that row. Assuming A is already in memory, describe a method running in O(nlogn) time (not O(n2) time!) for counting the number of 1’s in A.
My experience: I have done it for O(n) but I dont know how can I achieve it with O(nlogN)
I would appreciate any help !
Consider that each individual row consists of all 1s followed by all 0s:
1111111000
You can use a binary search to find the transition point (the last 1 in the row). The way this works is to set low and high to the ends and check the middle.
If you are at the transition point, you're done. Otherwise, if you're in the 1s, set low to one after the midpoint. Otherwise, you're in the 0s, so set high to one before the midpoint.
That would go something like (pseudo-code, with some optimisations):
def countOnes(row):
# Special cases first, , empty, all 0s, or all 1s.
if row.length == 0: return 0
if row[0] == "0": return 0
if row[row.length - 1] == 1: return row.length
# At this point, there must be at least one of each value,
# so length >= 2. That means you're guaranteed to find a
# transition point.
lo = 0
hi = row.length - 1
while true:
mid = (lo + hi) / 2
if row[mid] == 1 and row[mid+1] == 0:
return mid + 1
if row[mid] == 1:
lo = mid + 1
else:
hi = mid - 1
Since a binary search for a single row is O(logN) and you need to do that for N rows, the resultant algorithm is O(NlogN).
For a more concrete example, see the following complete Python program, which generates a mostly random matrix then uses the O(N) method and the O(logN) method (the former as confirmation) of counting the ones in each row:
import random
def slow_count(items):
count = 0
for item in items:
if item == 0:
break
count += 1
return count
def fast_count(items):
# Special cases first, no 1s or all 1s.
if len(items) == 0: return 0
if items[0] == 0: return 0
if items[len(items) - 1] == 1: return len(items)
# At this point, there must be at least one of each value,
# so length >= 2. That means you're guaranteed to find a
# transition point.
lo = 0
hi = len(items) - 1
while True:
mid = (lo + hi) // 2
if items[mid] == 1 and items[mid+1] == 0:
return mid + 1
if items[mid] == 1:
lo = mid + 1
else:
hi = mid - 1
# Ensure test data has rows with all zeros and all ones.
N = 20
matrix = [[1] * N, [0] * N]
# Populate other rows randomly.
random.seed()
for _ in range(N - 2):
numOnes = random.randint(0, N)
matrix.append([1] * numOnes + [0] * (N - numOnes))
# Print rows and counts using slow-proven and fast method.
for row in matrix:
print(row, slow_count(row), fast_count(row))
The fast_count function is the equivalent of what I've provided in this answer.
A sample run is:
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] 20 20
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] 0 0
[1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] 5 5
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0] 15 15
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] 10 10
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] 1 1
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0] 11 11
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0] 12 12
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0] 11 11
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] 1 1
[1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] 6 6
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0] 16 16
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0] 14 14
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0] 11 11
[1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] 9 9
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0] 13 13
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] 1 1
[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] 4 4
[1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] 6 6
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0] 19 19

resize array while keeping mask

I'm trying to figure out how to effectively resize an 1-d array while keeping the mask it represents. Using this array i do draw simple sprites while one value in the array represents a specific color.
Anyway my goal is as follows, having the following "small" array with values:
0, 1, 2, 3,
0, 1, 2, 2,
0, 1, 1, 1,
0, 0, 1, 1,
0, 0, 0, 0
This obviously is going to be a sprite of size 4x5.
Now i want to resize it keeping the values so getting the same sprite/shape but in higher resolution.
Now by saying "scale-by-2" i would get a 8x10 sized sprite, the 1-d array then should look as follows:
0, 0, 1, 1, 2, 2, 3, 3,
0, 0, 1, 1, 2, 2, 3, 3,
0, 0, 1, 1, 2, 2, 2, 2,
0, 0, 1, 1, 2, 2, 2, 2,
0, 0, 1, 1, 1, 1, 1, 1,
0, 0, 1, 1, 1, 1, 1, 1,
0, 0, 0, 0, 1, 1, 1, 1,
0, 0, 0, 0, 1, 1, 1, 1,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0
My idea is to group the numbers row by row, take the scale factor (2) and add as many of the digits (from one group) as we have to scale (2) in one row. Then duplicate each row by the scale factor as well. But still i am not sure if this covers all cases.
Any other (more effective) way to handle this?

Numpy Number Patterns

Is there a function in Numpy that allows you to take 4 records at a time and see where they match with a second dataset? Once there is a match move to the next 4 records of the first data set. It wont always be every 4 records, but i am using this as an example.
So if dataset one had - 1,5,7,8,10,12,6,1,3,6,8,9
And the second dataset had - 1,5,7,8,11,15,6,1,3,6,10,6
My result will be: 1,5,7,8, 6,1,3,6
POST EDIT:
My second example datasets:
import numpy as np
a =np.array([15,15,0,0,10,10,0,0,2,1,8,8,42,2,4,4,3,1,1,3,5,6,0,9,47,1,1,7,7,0,0,45,12,17,45])
b = np.array ([6,0,0,15,15,0,0,10,10,0,0,2,1,8,8,42,2,4,4,3,3,4,6,0,9,47,1,1,7,7,0,0,45,12,16,1,9,3,30])
Here's another snapshot of an example:
Thank you in advance for looking at my question!!
Update: for the more difficult and more interesting alignment problem it is probably best not to reinvent the wheel but to rely on python's difflib:
from difflib import SequenceMatcher
import numpy as np
k=4
a = np.array([15,15,0,0,10,10,0,0,2,1,8,8,42,2,4,4,3,1,1,3,5,6,0,9,47,1,1,7,7,0,0,45,12,17,45])
b = np.array ([6,0,0,15,15,0,0,10,10,0,0,2,1,8,8,42,2,4,4,3,3,4,6,0,9,47,1,1,7,7,0,0,45,12,16,1,9,3,30])
sm = SequenceMatcher(a=a, b=b)
matches = sm.get_matching_blocks()
matches = [m for m in matches if m.size >= k]
# [Match(a=0, b=3, size=17), Match(a=21, b=22, size=12)]
consensus = [a[m.a:m.a+m.size] for m in matches]
# [array([15, 15, 0, 0, 10, 10, 0, 0, 2, 1, 8, 8, 42, 2, 4, 4, 3]), array([ 6, 0, 9, 47, 1, 1, 7, 7, 0, 0, 45, 12])]
consfour = [a[m.a:m.a + m.size // k * k] for m in matches]
# [array([15, 15, 0, 0, 10, 10, 0, 0, 2, 1, 8, 8, 42, 2, 4, 4]), array([ 6, 0, 9, 47, 1, 1, 7, 7, 0, 0, 45, 12])]
summary = [np.c_[np.add.outer(np.arange(m.size // k * k), (m.a, m.b)), c]
for m, c in zip(matches, consfour)]
merge = np.concatenate(summary, axis=0)
Below is my original solution assuming already aligned and same-length arrays:
Here is a hybrid solution using numpy to find consecutive matches and cutting them out and then list comp to apply length constraints:
import numpy as np
d1 = np.array([7,1,5,7,8,0,6,9,0,10,12,6,1,3,6,8,9])
d2 = np.array([8,1,5,7,8,0,6,9,0,11,15,6,1,3,6,10,6])
k = 4
# find matches
m = d1 == d2
# find switches between match, no match
sw = np.where(m[:-1] != m[1:])[0] + 1
# split
mnm = np.split(d1, sw)
# select matches
ones_ = mnm[1-m[0]::2]
# apply length constraint
res = [blck[i:i+k] for blck in ones_ for i in range(len(blck)-k+1)]
# [array([1, 5, 7, 8]), array([5, 7, 8, 0]), array([7, 8, 0, 6]), array([8, 0, 6, 9]), array([0, 6, 9, 0]), array([6, 1, 3, 6])]
res_no_ovlp = [blck[k*i:k*i+k] for blck in ones_ for i in range(len(blck)//k)]
# [array([1, 5, 7, 8]), array([0, 6, 9, 0]), array([6, 1, 3, 6])]
You can use matrix masking like,
import numpy as np
from scipy.sparse import dia_matrix
a = np.array([1,5,7,8,10,12,6,1,3,6,8,9])
b = np.array([1,5,7,8,11,15,6,1,3,6,10,6])
mask = dia_matrix((np.ones((1, a.size)).repeat(4, axis=0), np.arange(4)),
shape=(a.size, b.size), dtype=np.int)
print(mask.toarray())
matches = a[mask.T.dot(mask.dot(a == b) == 4).astype(np.bool)]
print(matches)
This will output,
array([[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]])
[1 5 7 8 6 1 3 6]
You can think about how the matrix multiplication works to get this result.
Scaling
For scaling, I tested with 1e3, 1e5, and 1e7 elements and got,
1e3 - 0.019184964010491967
1e5 - 0.4330314120161347
1e7 - 144.54082221200224
See the gist. Not sure why such a hard jump at 1e7 elements.
This is an exercise is list comprehension. We have the data
data = [1,5,7,8,10,12,6,1,3,6,8,9]
search_data = [1,5,7,8,11,15,6,1,3,6,10,6]
First we can chunk the original data into blocks of length n
n = 4
chunks = [data[i:i + n] for i in range(len(data) - n + 1)]
search_chunks = [search_data[i:i + n] for i in range(len(search_data) - n + 1)]
Now we must select chunks from the first list that appear in the second list
hits = [c for c in chunks if c in search_chunks]
print hits
# [[1, 5, 7, 8], [6, 1, 3, 6]]
This may not be the optimal solution for long lists. It may improve performance to consider sets, if there are likely to repeated chunks
chunks = set(tuple(data[i:i + n]) for i in range(len(data) - n + 1))
search_chunks = set(tuple(search_data[i:i + n]) for i in range(len(search_data) - n + 1))
This can be quite competitive with above numpy solution, e.g.
import numpy as np
import time
# Generate data
len_ = 10000
max_ = 10
data = map(int, np.random.rand(len_) * max_)
search_data = map(int, np.random.rand(len_) * max_)
# Time list comprehension
start = time.time()
n = 4
chunks = set(tuple(data[i:i + n]) for i in range(len(data) - n + 1))
search_chunks = set(tuple(search_data[i:i + n]) for i in range(len(search_data) - n + 1))
hits = [c for c in chunks if c in search_chunks]
print time.time() - start
# Time numpy
a = np.array(data)
b = np.array(search_data)
mask = 1 * (np.abs(np.arange(a.size).reshape((-1, 1)) - np.arange(a.size) - 0.5) < 2)
start = time.time()
matches = a[mask.T.dot(mask.dot(a == b) == 4).astype(np.bool)]
print time.time() - start
It's typically faster here, but it depends on number of repeated chunks etc.

I want to use Bilinear interpolation to calculate the summation of vectors

I have individual vectors from my last stage of code which i implemented it
The next stage of the algorithm is to calculate the summation of these vectors
As mentioned in the paper
"The vectors from the previous stage were summed together spatially by bilinearly weighting"
I think The bilinear weighting means bilinear interpolation
can any one tell or give me an example how can i use bilinear interpolation
to calculate the Summation of this vectors
V1 = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2]
V2 = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 11, 11]
V3 = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 23, 0, 0]
V4 = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 19, 19, 0, 0]
V5 = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 0, 0]
V6 = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 8, 0, 0]
V7 = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 18, 18, 0, 0]
V8 = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 23, 23, 0, 0, 0]
V9= [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0]
V10 = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 0, 0]
V11 = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 8, 0, 0]
V12 = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 11, 11, 0, 0, 0]
V13 = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
V14 = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0]
V15 = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 0, 0, 0]
V16 = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 0, 0, 0, 0]
I googled it but didn't understand the Equations
Regards and thanks in advance !
Sadly I'm having trouble understanding the paper as well. The idea, as you've said, is to weight the vectors based on their distance from the pooling centres, so that vectors farther from the pooling centres have less of an impact. The paper compares this to what is done in the famous SIFT feature, which you can read about in this tutorial.
Below is by best guess as to what the meaning is. Since this is related to machine learning, could also ask people over at cross-validated to get their opinion, or consider contacting the author of the paper.
If I understand correctly, this is amounts to a process similar to bilinear interpolation, except in reverse.
With bilinear interpolation, we are given a set of function values arranged in a grid, and we want to find a good guess for what the function values are between the gridpoints. We do this by taking a weighted average of the four surrounding function values, with the weights being the relative area of the opposite rectangle in the image below. (By "relative" I mean the area is normalized by the area of the whole grid rectangle, so the weights sum to 1.) Note how the point to be interpolated is the closest to the (x1,y2) gridpoint, so we weight it with the largest weight (the relative area of the yellow rectangle).
f(x,y) = w_11*f(x1,y1) + w_21*f(x2,y1) + w_12*f(x1,y2) + w_22*f(x2,y2)
w_ij = area of rectangle opposite (xi,yj) / total area of grid square
The "bilinear weighing" described in the paper seems to be doing the opposite: we have values (or vectors in this case) scattered throughout 2D space, and we want to "pool" their values at a set of gridpoints that we choose.
We do this by adding a fraction of each vector to the four surrounding pooling gridpoints. This fraction would again be the relative area of the opposite rectangle.
In the above image... pooling point (xi,yj) would get w_ij * f(x,y) summed along with the appropriate fraction of any other points we have in the region.
As the paper states, the spacing of the grid points is up to you. I assume it would need to be big enough to allow most polling points have at least one vector in its neighbourhood.
EDIT: Here is an example of what I mean.
(0,1) . _ _ _ _ _ . (1,1)
| |
| v |
| |
| |
(0,0) . _ _ _ _ _ . (1,0)
Let's say the vector v=[10,5] is at point (0.2,0.8)
point (0,0) gets weight 0.8*0.2=0.16, so we add 0.16*v = [1.6,0.8] to that pool
point (1,0) gets weight 0.2*0.2=0.04, so we add 0.04*v = [0.4,0.2] to that pool
point (0,1) gets weight 0.8*0.8=0.64, so we add 0.64*v = [6.4,3.2] to that pool
point (1,1) gets weight 0.2*0.8=0.16, so we add 0.16*v = [1.6,0.8] to that pool

Leading zeros calculation with intrinsic function

I'm trying to optimize some code working in an embedded system (FLAC decoding, Windows CE, ARM 926 MCU).
The default implementation uses a macro and a lookup table:
/* counts the # of zero MSBs in a word */
#define COUNT_ZERO_MSBS(word) ( \
(word) <= 0xffff ? \
( (word) <= 0xff? byte_to_unary_table[word] + 24 : \
byte_to_unary_table[(word) >> 8] + 16 ) : \
( (word) <= 0xffffff? byte_to_unary_table[word >> 16] + 8 : \
byte_to_unary_table[(word) >> 24] ) \
)
static const unsigned char byte_to_unary_table[] = {
8, 7, 6, 6, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
};
However most CPU already have a dedicated instruction, bsr on x86 and clz on ARM (http://www.devmaster.net/articles/fixed-point-optimizations/), that should be more efficient.
On Windows CE we have the intrinsic function _CountLeadingZeros, that should just call that value. However it is 4 times slower than the macro (measured on 10 million of iterations).
How is possible that an intrinsic function, that (should) rely on a dedicated ASM instruction, is 4 times slower?
Check the disassembly. Are you sure that the compiler inserted the instruction? In the Remarks section there is this text:
This function can be implemented by
calling a runtime function.
I suspect that's what's happening in your case.
Note that the CLZ instruction is only available in ARMv5 and later. You need to tell the compiler if you want ARMv5 code:
/QRarch5 ARM5 Architecture
/QRarch5T ARM5T Architecture
(Microsoft incorrectly uses "ARM5" instead of "ARMv5")

Resources