I have an formula that I use multiple times in my subroutine, but my processor does not have division instruction(M0), so this is handled by the software library. To speed up this operation, I am considering using a lookup table to store the result of the inverse. However that would still take up 2kb in space (2 bytes per value). How can I optimize it further?
Formula is as follows, k is a constant known at compile time k = [10, 100]. x = [0, 1023]
(1000 * k) * ((1023/x) - 1)
EDITE: Clarification about precision. Since I have the "1000", I am considering using the result of the multiplication by 1000 to increase precision.
Assuming / is integer division
You don't need to store 1024 values, because many values of x result in the same value of 1023/x.
Specifically:
x: [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 33, 34, 35, 36, 37, 39, 40, 42, 44, 46, 48, 51, 53, 56, 60, 63, 68, 73, 78, 85, 93, 102, 113, 127, 146, 170, 204, 255, 341, 511, 1023]
1023/x: [1023, 511, 341, 255, 204, 170, 146, 127, 113, 102, 93, 85, 78, 73, 68, 63, 60, 56, 53, 51, 48, 46, 44, 42, 40, 39, 37, 36, 35, 34, 33, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1]
You need only to store these 62 values of x and the 62 results of 1023/x.
As a bonus: if you look carefully, you'll notice those values are symmetric. The values for x are the exact mirror of the values for 1023/x. So you only need to store one of these two arrays.
You can easily shrink the lookup table to 256*2 bytes
static inline uint16_t get1023divxminus1(uint16_t x)
{
static const uint16_t table[256] = {0, 1022, 510, ....., 3};
if (x >= 512) return 0;
if (x >= 342) return 1;
if (x >= 256) return 2;
return table[x];
}
You could shrink the table even further, but I think it isn't worth the additional ifs.
You could compress the data in the table.
For example by storing full 2-byte values for every N-th value of x and store difference values for xs in between. The difference should fit in 1 byte in many cases.
If N would be 4, you'd store full values for x: 0, 4, 8, ... and difference values for x: 1, 2, 3, 5, 6, 7, 9, ...
To get the result for say x == 3, start with 2-byte value of 0 and add the 1-byte difference values of 1 and 2.
There will for sure be other 'tricks' to play if you'd have a close look at the data and think in the direction of data compression.
Accessing RAM is probably going to be slower than calculating long division, as long as your values fit within a register. In principle, calculating long division should be linear in the number of bits. Implement both and profile, but I am highly convinced that long division will be faster:
The algorithms is:
left shift the divisor until the MSD of the divisor equals the MSD of the dividend.
If the divisor is smaller than the dividend, write one, else write 0. Right shift the divisor by one. Repeat until the LSD of the divisor is also the LSD of the dividend.
Here is an explicit implementation:
https://codegolf.stackexchangechaschastitytity.com/questions/24541/divide-two-numbers-using-long-division
With the following simple code snippet:
struct timespec ts;
for (int i = 0; i < 100; i++) {
timespec_get(&ts, TIME_UTC);
printf("%ld, ", ts.tv_nsec % 100);
}
I get output like this:
58, 1, 74, 49, 5, 59, 89, 20, 52, 86, 17, 48, 79, 10, 41, 73, 3, 40, 72, 3, 36, 67, 98, 30, 61, 92, 24, 55, 86, 17, 49, 82, 14, 45, 76, 7, 40, 72, 3, 36, 71, 2, 35, 66, 97, 28, 66, 97, 28, 60, 90, 22, 52, 83, 15, 46, 77, 7, 41, 72, 3, 36, 67, 0, 44, 17, 82, 13, 45, 77, 8, 59, 90, 22, 54, 85, 17, 48, 80, 12, 43, 75, 6, 57, 89, 20, 52, 84, 15, 47, 79, 14, 50, 82, 16, 47, 79, 11, 43, 74,
I haven't studied the statistical distribution of the numbers and my searches have turned up blank, but the output does at first glance look similar to output of rand() or random(). Has anyone studied this or is able to express an opinion - could timespec_get() be used as random number generator? Would it be pseudo random or not? Why?
could timespec_get() be used as random number generator?
Of course. But that doesn't mean the output of such a RNG would have desirable or even acceptable statistical properties.
In particular, successive outputs are strongly correlated with each other. Your example hides that, somewhat, by discarding all the most-significant decimal digits. Additionally, the system clock is not required to have single-nanosecond resolution, though yours seems to have. In a system that didn`t have such resolution, the least-significant digits of all results would likely be correlated, and their distribution non-uniform.
Would it be pseudo random or not? Why?
No, actually. The output of a PRNG is deterministic with respect to the runtime state of the calling program at the time of the call. timespec_get(), on the other hand, depends on the program's execution context, not its own state.
The code you have provided is almost certainly guaranteed not to provide (pseudo-)random numbers!
Why?
Consider running this on an efficient CPU that can dedicate 100% of its time to your code (and with nothing else of 'significant impact') going on in the OS background: each run of the for loop executes an identical instruction sequence, so the intervals between successive calls to timespec_get will all be very similar - and a list of numbers with continuously similar intervals is certainly not random.
Even a fairly cursory glance through your generated number list shows that the only time a number is less than its precursor is when the value "rolls over" the 100 mark (this effect will be more noticeable if you increase your modulus from 100 to, say, 500 and run the test again).
could timespec_get() be used as random number generator?
I tried calling timespec_get(&ts, TIME_UTC); multiple times and received delta values of about 14 +/- 1 ns. To me this implies at best a non-predictable-ness (random-ness) of 1 bit per call (given the variability in the delta), not the 7 to 8 bits hoped for with timespec_get(&ts, TIME_UTC); ts.tv_nsec % 100. At worst, there is nearly zero bits of randomness.
.tv_nsec and .tv_sec could be used to initialize a random engine, but as as a source, it is very weak.
Would it be pseudo random or not? Why?
No. A PRNG is deterministic. Reading time is not deterministic enough.
I have a 3 dimensional numpy array (temp_X) like:
[ [[23,34,45,56],[34,45,67,78],[23,45,67,78]],
[[12,43,65,43],[23,54,67,87],[12,32,34,43]],
[[43,45,86,23],[23,45,56,23],[12,23,65,34]] ]
I want to remove the 1st element of each 3rd sub-array (highlighted values).
shown below is the code that i tried:
for i in range(len(temp_X)):
temp_X = np.delete(temp_X[i][(len(temp_X[i]) - 1)], [0])
Somehow when I run the code the whole array gets deleted except for 3 values. Any help is much appreciated. Thank you in advance.
With a as the 3D input array, here's one way -
m = np.prod(a.shape[1:])
n = m-a.shape[-1]
out = a.reshape(a.shape[0],-1)[:,np.r_[:n,n+1:m]]
Alternative to last step with boolean-indexing -
out = a.reshape(a.shape[0],-1)[:,np.arange(m)!=n]
Sample input, output -
In [285]: a
Out[285]:
array([[[23, 34, 45, 56],
[34, 45, 67, 78],
[23, 45, 67, 78]],
[[12, 43, 65, 43],
[23, 54, 67, 87],
[12, 32, 34, 43]],
[[43, 45, 86, 23],
[23, 45, 56, 23],
[12, 23, 65, 34]]])
In [286]: out
Out[286]:
array([[23, 34, 45, 56, 34, 45, 67, 78, 45, 67, 78],
[12, 43, 65, 43, 23, 54, 67, 87, 32, 34, 43],
[43, 45, 86, 23, 23, 45, 56, 23, 23, 65, 34]])
Here's another with mask creation to mask along the last two axes -
mask = np.ones(a.shape[-2:],dtype=bool)
mask[-1,0] = 0
out = np.moveaxis(a,0,-1)[mask].T
I need to create and work with lists with 2**30 elements, but It's to slow. Is there any form to increase the speed?
My code:
sup = []
for i in range(2**30):
sup.append([i,pow(y,i,N)])
pow(y,i,n) == y**i*mod(N), modular exponentiation
I tried to use list comprehensions but isn't enough.
Different approach: why do you want to store those numbers in a list?
You have your formula right there; whenever some piece of code needs sup[i]; you just compute pow(y,i,N).
In other words: instead of storing values within a list; just compute them when you need them.
Edit: as it seems that you have good reasons to store that data in an array; I would then say: use the appropriate tool then.
Meaning: instead of doing computing intense things directly with python, you rather look into the numpy framework. That framework is designed for exactly such purposes. Beyond that, I would also look in the way you are storing/preparing your data. Example: you mention to later look for identical entries in that array. I am wondering if that would meant you should use a dictionary instead of a list; or did you really intend do check 2**30 entries each time you look for equal pow values?
Going by your comment and complementing the answer of GhostCat, go directly for the data you are looking for, for example like this
>>> from collections import defaultdict
>>> y = 2
>>> N = 10
>>> data = defaultdict(list)
>>> for i in range(100):
data[pow(y,i,N)].append(i)
>>> for x in data.items():
x
(8, [3, 7, 11, 15, 19, 23, 27, 31, 35, 39, 43, 47, 51, 55, 59, 63, 67, 71, 75, 79, 83, 87, 91, 95, 99])
(1, [0])
(2, [1, 5, 9, 13, 17, 21, 25, 29, 33, 37, 41, 45, 49, 53, 57, 61, 65, 69, 73, 77, 81, 85, 89, 93, 97])
(4, [2, 6, 10, 14, 18, 22, 26, 30, 34, 38, 42, 46, 50, 54, 58, 62, 66, 70, 74, 78, 82, 86, 90, 94, 98])
(6, [4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60, 64, 68, 72, 76, 80, 84, 88, 92, 96])
>>>
or more specifically, as you need a random sample go for it from the start and don't waste time producing a gazillion stuff you would not need, for example
>>> import random
>>> random_data = defaultdict(list)
>>> for i in random.sample(range(2**30), 20):
random_data[pow(2,i,10)].append(i)
>>> for x in random_data.items():
x
(8, [633728687, 357300263, 208747091, 456291987, 1028949643, 23961003, 750842555])
(2, [602395153, 215460881, 144481457, 829193705])
(4, [752840814, 26689262])
(6, [423520476, 969809132, 326786996, 736424520, 929123176, 865279408, 338237708])
>>>
and depending of what you do with those i later on, you can instead try a more mathematical approach to uncover the underplaying patter that produce an i for which yi mod N is the same and that way you can produce as many i as you need for that particular modular class.
Which for this example is easy, it is
2i = 8 (mod 10) for all i=3 (mod 4) -> range(3,2**30,4)
2i = 2 (mod 10) for all i=1 (mod 4) -> range(1,2**30,4)
2i = 4 (mod 10) for all i=2 (mod 4) -> range(2,2**30,4)
2i = 6 (mod 10) for all i=0 (mod 4) -> range(4,2**30,4)
2i = 1 (mod 10) for i=0
I am trying to optimizing the Kasumi crypto algorithm written in C.
There are S-box which uses to encrypt the data. which I am representing as an array in which is huge:
int S7[128] = {
54, 50, 62, 56, 22, 34, 94, 96, 38, 6, 63, 93, 2, 18,123, 33,
55,113, 39,114, 21, 67, 65, 12, 47, 73, 46, 27, 25,111,124, 81,
53, 9,121, 79, 52, 60, 58, 48,101,127, 40,120,104, 70, 71, 43,
20,122, 72, 61, 23,109, 13,100, 77, 1, 16, 7, 82, 10,105, 98,
117,116, 76, 11, 89,106, 0,125,118, 99, 86, 69, 30, 57,126, 87,
112, 51, 17, 5, 95, 14, 90, 84, 91, 8, 35,103, 32, 97, 28, 66,
102, 31, 26, 45, 75, 4, 85, 92, 37, 74, 80, 49, 68, 29,115, 44,
64,107,108, 24,110, 83, 36, 78, 42, 19, 15, 41, 88,119, 59, 3
};
int S9[512] = {
167,239,161,379,391,334, 9,338, 38,226, 48,358,452,385, 90,397,
183,253,147,331,415,340, 51,362,306,500,262, 82,216,159,356,177,
175,241,489, 37,206, 17, 0,333, 44,254,378, 58,143,220, 81,400,
95, 3,315,245, 54,235,218,405,472,264,172,494,371,290,399, 76,
165,197,395,121,257,480,423,212,240, 28,462,176,406,507,288,223,
501,407,249,265, 89,186,221,428,164, 74,440,196,458,421,350,163,
232,158,134,354, 13,250,491,142,191, 69,193,425,152,227,366,135,
344,300,276,242,437,320,113,278, 11,243, 87,317, 36, 93,496, 27,
487,446,482, 41, 68,156,457,131,326,403,339, 20, 39,115,442,124,
475,384,508, 53,112,170,479,151,126,169, 73,268,279,321,168,364,
363,292, 46,499,393,327,324, 24,456,267,157,460,488,426,309,229,
439,506,208,271,349,401,434,236, 16,209,359, 52, 56,120,199,277,
465,416,252,287,246, 6, 83,305,420,345,153,502, 65, 61,244,282,
173,222,418, 67,386,368,261,101,476,291,195,430, 49, 79,166,330,
280,383,373,128,382,408,155,495,367,388,274,107,459,417, 62,454,
132,225,203,316,234, 14,301, 91,503,286,424,211,347,307,140,374,
35,103,125,427, 19,214,453,146,498,314,444,230,256,329,198,285,
50,116, 78,410, 10,205,510,171,231, 45,139,467, 29, 86,505, 32,
72, 26,342,150,313,490,431,238,411,325,149,473, 40,119,174,355,
185,233,389, 71,448,273,372, 55,110,178,322, 12,469,392,369,190,
1,109,375,137,181, 88, 75,308,260,484, 98,272,370,275,412,111,
336,318, 4,504,492,259,304, 77,337,435, 21,357,303,332,483, 18,
47, 85, 25,497,474,289,100,269,296,478,270,106, 31,104,433, 84,
414,486,394, 96, 99,154,511,148,413,361,409,255,162,215,302,201,
266,351,343,144,441,365,108,298,251, 34,182,509,138,210,335,133,
311,352,328,141,396,346,123,319,450,281,429,228,443,481, 92,404,
485,422,248,297, 23,213,130,466, 22,217,283, 70,294,360,419,127,
312,377, 7,468,194, 2,117,295,463,258,224,447,247,187, 80,398,
284,353,105,390,299,471,470,184, 57,200,348, 63,204,188, 33,451,
97, 30,310,219, 94,160,129,493, 64,179,263,102,189,207,114,402,
438,477,387,122,192, 42,381, 5,145,118,180,449,293,323,136,380,
43, 66, 60,455,341,445,202,432, 8,237, 15,376,436,464, 59,461
};
During the encryption we are accessing this array very frequently.
One optimization which I had done moving this array from header file to local function so that some cache miss will not happened.
Any suggestion to more optimize this either by changing this array to any other data structure?
that array is not huge. a typical L1 cache is at least 10s of kB (that's the total memory on, say, an apple ii). and moving the array from a header to a function is not going to change cache locality.
storing it in the appropriate form (as in comments) may make sense (it's going to fit in l1 cache, but if you have other data, perhaps used by another thread, there's more chance of it staying there) - there's no need for more than 2 bytes per value (but i have no idea if that introduces extra cost compared to using native size ints).
if this is really critical, you should look at the code generated and optimize that.
First of all, make sure you declare those arrays as const, so that the compiler knows they'll never change.
Second, as Oli Charlesworth suggests in the comments, you don't really need a full int to store each value. The elements of the S7 and S9 arrays are 7-bit and 9-bit unsigned integers, so either of int8_t or uint8_t should be enough for S7, and either of int16_t or uint16_t for S9. (You may want to benchmark whether there's any difference between using signed or unsigned types, although I wouldn't really expect any.)
Finally, if you really want to get rid of the arrays entirely, it's also possible to implement the KASUMI S-boxes directly without any lookup tables, using bit operations (specifically, AND and XOR). For details, see pages 13–16 of the KASUMI specification. However, I strongly suspect that this will not be useful for a software implementation, unless you're using bit-slicing to encrypt many blocks in parallel.