How can I modify a C struct by an array index? - arrays

I am working with structs in a game I am making, and the struct holds information about the humanoid (position, speed, etc). I have a struct for the player and one for a zombie. I'm adding the zombie struct onto an array (and plan on doing the same for other enemies) in order to iterate over them every frame loop and advance them one step forward or work game logic out. (For the record, the game runs on the Nintendo DS). Unfortunately, I can't seem to modify values of the struct via an array index, as the values don't change. Modifying the struct directly works, but it would be a huge pain to update every struct independently (reason for looping through an array of the structs). Here is the code that goes in more depth:
typedef struct{
int x;
float y;
float speed;
bool isSwordActive;
bool facingRight;
} humanoid;
void game(){
humanoid player = {12, 12, 0, false, false};
humanoid zombie = {226, 12, 0, false, false};
humanoid objects[1] = {zombie};
while(1){
// This works
zombie.y += zombie.speed;
zombie.speed += 0.125;
// This does not work
for(int i = 0; i < sizeof(objects)/sizeof(objects[0]); i++){
objects[i].y += objects[i].speed;
objects[i].speed += 0.125;
}
}
}
Could someone please explain why this doesn't work, and a possible solution? Thank you!

objects should be a pointer array. You were copying the data into the array, so the original objects were untouched.
Here is the refactored code. It is annotated:
#define bool int
#define false 0
#define true 1
typedef struct {
int x;
float y;
float speed;
bool isSwordActive;
bool facingRight;
} humanoid;
void
game()
{
humanoid player = { 12, 12, 0, false, false };
humanoid zombie = { 226, 12, 0, false, false };
// ORIGINAL
#if 0
humanoid objects[1] = { zombie };
// FIXED
#else
humanoid *objects[2] = { &zombie, &player };
#endif
// ORIGINAL
#if 0
while (1) {
// This works
zombie.y += zombie.speed;
zombie.speed += 0.125;
// This does not work
for (int i = 0; i < sizeof(objects) / sizeof(objects[0]); i++) {
objects[i].y += objects[i].speed;
objects[i].speed += 0.125;
}
}
#endif
// BETTER
#if 0
while (1) {
// This works
zombie.y += zombie.speed;
zombie.speed += 0.125;
// This does not work
for (int i = 0; i < sizeof(objects) / sizeof(objects[0]); i++) {
objects[i]->y += objects[i]->speed;
objects[i]->speed += 0.125;
}
}
#endif
// BEST
#if 1
while (1) {
// This works
zombie.y += zombie.speed;
zombie.speed += 0.125;
// This does not work
for (int i = 0; i < sizeof(objects) / sizeof(objects[0]); i++) {
humanoid *obj = objects[i];
obj->y += obj->speed;
obj->speed += 0.125;
}
}
#endif
}
In the above code, I've used cpp conditionals to denote old vs. new code:
#if 0
// old code
#else
// new code
#endif
#if 1
// new code
#endif
This works, thank you! Pointers are always screwing me over. If you don't mind, could you explain why the best solution is better than the better solution? Is it performance-wise? Wouldn't performance slow down a bit if there are too many objects and making new variables for each object? Thank you! –
JuanR4140
We're not creating/copying the contents of the array element (e.g. 20 bytes). We're just calculating the address of the array element. This must be done anyway, so no extra code.
This is a fundamental difference and I think this is still a point of confusion for you.
If we compile without optimization, the "BETTER" solution would reevaluate objects[i] three times. Before it can "use" object[i], the value there must be fetched and placed into a CPU register. That is, it would do:
regA = &objects[0];
regA += i * sizeof(humanoid);
regA = *regA;
regA->something
Again, this is repeated 3 times:
regA = &objects[0];
regA += i * sizeof(humanoid);
regA = *regA;
regA->something
regA = &objects[0];
regA += i * sizeof(humanoid);
regA = *regA;
regA->something
regA = &objects[0];
regA += i * sizeof(humanoid);
regA = *regA;
regA->something
If we compile with optimization, the generated code would be similar to the "BEST" solution. That is, the compiler realizes that objects[i] is invariant within the given loop iteration, so it only needs to calculate it once:
regA = &objects[0];
regA += i * sizeof(humanoid);
regA = *regA;
regA->something
regA->something
regA->something
Note that the compiler must always put the calculated address value into a register before it can use it. The compiler will assume that obj should be in a register. So, no extra code generated.
In the "BEST" version, I "told" the compiler how to generate the efficient code. And, it would generate the efficient code without optimization.
And, doing obj->something in several different places is simpler to read [by "humanoid" programmers ;-)] than objects[i]->something in multiple places.
This would show its value even more with a 2D array.

Related

Solving a large system of linear equations over the finite field F2

I have 10163 equations and 9000 unknowns, all over finite fields, like this style:
Of course my equation will be much larger than this, I have 10163 rows and 9000 different x.
Presented in the form of a matrix is AX=B. A is a 10163x9000 coefficient matrix and it may be sparse, X is a 9000x1 unknown vector, B is the result of their multiplication and mod 2.
Because of the large number of unknowns that need to be solved for, it can be time consuming. I'm looking for a faster way to solve this system of equations using C language.
I tried to use Gaussian elimination method to solve this equation, In order to make the elimination between rows more efficient, I store the matrix A in a 64-bit two-dimensional array, and let the last column of the array store the value of B, so that the XOR operation may reduce the calculating time.
The code I am using is as follows:
uint8_t guss_x_main[R_BITS] = {0};
uint64_t tmp_guss[guss_j_num];
for(uint16_t guss_j = 0; guss_j < x_weight; guss_j++)
{
uint64_t mask_1 = 1;
uint64_t mask_guss = (mask_1 << (guss_j % GUSS_BLOCK));
uint16_t eq_j = guss_j / GUSS_BLOCK;
for(uint16_t guss_i = guss_j; guss_i < R_BITS; guss_i++)
{
if((mask_guss & equations_guss_byte[guss_i][eq_j]) != 0)
{
if(guss_x_main[guss_j] == 0)
{
guss_x_main[guss_j] = 1;
for(uint16_t change_i = 0; change_i < guss_j_num; change_i++)
{
tmp_guss[change_i] = equations_guss_byte[guss_j][change_i];
equations_guss_byte[guss_j][change_i] =
equations_guss_byte[guss_i][change_i];
equations_guss_byte[guss_i][change_i] = tmp_guss[change_i];
}
}
else
{
GUARD(xor_64(equations_guss_byte[guss_i], equations_guss_byte[guss_i],
equations_guss_byte[guss_j], guss_j_num));
}
}
}
for(uint16_t guss_i = 0; guss_i < guss_j; guss_i++)
{
if((mask_guss & equations_guss_byte[guss_i][eq_j]) != 0)
{
GUARD(xor_64(equations_guss_byte[guss_i], equations_guss_byte[guss_i],
equations_guss_byte[guss_j], guss_j_num));
}
}
}
R_BIT = 10163, x_weight = 9000, GUSS_BLOCK = 64, guss_j_num = x_weight / GUSS_BLOCK + 1; equations_guss_byte is a two-dimensional array of uint64, where x_weight / GUSS_BLOCK column stores the matrix A and the latter column stores the vector B, xor_64() is used to XOR two arrays, GUARD() is used to check the correctness of function operation.
Using this method takes about 8 seconds to run on my machine. Is there a better way to speed up the calculation?

Use Arduino to generate sin wave modulated by gold code

I am trying to use Arduino to generate sin wave and gold code is used to determine when the wave will have a phase shift. However, the output is not performed as I expected. Sometimes, it does not occur any phase shift for consequent ten cycles, which should not happen according to our definition of gold code array. Which part of the code could I try to fix the problem?
int gold_code[]={1,-1,-1,-1,-1,-1,-1,1,-1,-1,-1,1,-1,-1,1,1,-1,-1, 1,1,-1,1,1,1,-1,1,-1,1,1,-1,1,1,-1,-1,-1,-1,-1,1,1,-1,-1,1,1,-1,1,-1,1,-1,-1,1,1,1,-1,-1,1,1,1,1,-1,1,1,-1, -1, 1, 1,1,-1,-1,1,-1,-1,1,1,1};
void loop()
{
int n = sizeof(gold_code)/sizeof(gold_code[0]);
byte bsin[128];
int it;
unsigned long tm0;
unsigned int tm;
for(int i=0;i<128;i++)
{
bsin[i] = 8 + (int)(0.5 + 7.*sin( (double)i*3.14159265/64.));
}
int count=0;
int count1=0;
Serial.println(n);
tm0 = micros();
while(true)
{
tm = micros() - tm0;
if(tm > 511)
{
tm0 = tm0+512;
tm -= 512;
count++;
//Serial.println(gold_code[count%n]);
}
tm = (tm >> 2) ;
if(gold_code[count%n]==0){
PORTB = bsin[tm];
}
else{
PORTB = 16-bsin[tm];
}
}
}
The variable count eventually overflows and becomes negative. This, in conjunction with the modulo operation is a sign (pun intended) of a disaster waiting to happen.
Use a different method for limiting the value of count to the bounds of your gold_codearray.
You should expect a significant increase in frequency after removing the modulo operation, so you may need to add some pacing to your loop.
The pacing in your loop is wrong. Variable count increments 4 times as fast as your phase counter.
Also, #Edward Karak raises a valid point. To do a proper phase shift, you should add (or subtract) from tm, not from the sin value.
[EDIT] I was not quite happy with the way the phase shift is handled. It just doesn't feel right to advance the gold counter at the same pace as the phase counter. So I added a separate timer for that. Advances in the gold_code array every 8 microseconds for now, but you can change it to whatever you're supposed to have.
as in:
unsigned char tm0 = 0;
unsigned char tm0_gold = 0;
const unsigned char N = sizeof(gold_code) / sizeof(gold_code[0]);
unsigned char phase = 0;
for(;;)
{
// pacing for a stable frequency
unsigned char mic = micros() & 0xFF;
if (mic - tm0_gold >= 8)
{
tm0_gold = mic;
// compute and do the phase shift
if (++count >= N)
count -= N;
if (gold_code[count] > 0) // you have == 0 in your code, but that doesn't make sense.
phase += 16; // I can't make any sense of what you are trying to do,
// so I'll just add 45° of phase for each positive value
// you'll probably want to make your own test here
}
if (mic - tm0 >= 4)
{
tm0 = mic;
// advance the phase. keep within the LUT bounds
if (++phase >= 128)
phase -= 128;
// output
PORTB = bsin[phase];
}
}
For frequency stability, you will want to move the sine generator to a timer interrupt, after debugging. This will free up your loop() to do some extra control.
I don't quite understand why count increments as fast as the phase counter.
You may want to increment count at a slower pace to reach your goal.

Passing PIN status as a function parameter

I want to write a function for my AVR ATmega328 that debounces switches using state space to confirm a switch press. After finishing it I wanted to generalize my function so that I may reuse it in the future with little work, but that involves passing the pin I want to use as a function parameter, and I just can't get that to work.
This is what I have now:
int debounceSwitch(unsigned char *port, uint8_t mask)
{
int n = 0;
while (1)
{
switch (n)
{
case 0: //NoPush State
_delay_ms(30);
if(!(*port & (1<<mask))){n = n + 1;}
else {return 0;}
break;
case 1: //MaybePush State
_delay_ms(30);
if(!(*port & (1<<mask))){n = n + 1;}
else {n = n - 1;}
break;
case 2: //YesPush State
_delay_ms(30);
if(!(*port & (1<<mask))){return 1;}
else {n = n - 1;}
break;
}
}
}
I have a hunch my issue is with the data type I'm using as the parameter, and I seem to have gotten different answers online.
Any help would be appreciated!
Well in AVR ports are special IO registers and they are accessed using IN and OUT instructions. Not like memory using LDR etc.
From the port definition you can see that you need to make the port pointer volatile. which the compiler would have also told you as a warning when you would had tried to pass PORT to the function.
#define PORTB _SFR_IO8(0x05)
which maps to
#define _SFR_IO8(io_addr) _MMIO_BYTE((io_addr) + __SFR_OFFSET)
#define _MMIO_BYTE(mem_addr) (*(volatile uint8_t *)(mem_addr))
Various issues:
The function should be void debounceSwitch(volatile uint8_t* port, uint8_t pin). Pointers to hardware registers must always be volatile. It doesn't make sense to return anything.
Never use 1 signed int literals when bit-shifting. Should be 1u << n or your program will bug out when n is larger than 8.
Burning away 30ms several times over in a busy-delay is horrible practice. It will lock your CPU at 100% doing nothing meaningful, for an eternity.
There are many ways to debounce buttons. The simplest professional form is probably to have a periodic timer running with interrupt every 10ms (should be enough, if in doubt measure debounce spikes of your button with a scope). It will look something like the following pseudo code:
volatile bool button_pressed = false;
void timer_interrupt (void)
{
uint8_t button = port & mask;
button_pressed = button && prev;
prev = button;
}
This assuming that buttons use active high logic.
What I dislike on your implementation is the pure dependency on PORT/IO handling and the actual filter/debouncing logic. What are you doing then, when the switch input comes over a signal e.g. from CAN?
Also, it can be handled much easier, if you think in configurable/parameterizable filters. You implement the logic once, and then just create proper configs and pass separate state variables into the filter.
// Structure to keep state
typedef struct {
boolean state;
uint8 cnt;
} deb_state_t;
// Structure to configure the filters debounce values
typedef struct {
uint8 cnt[2]; // [0] = H->L transition, [1] = L->H transition
} deb_config_t;
boolean debounce(boolean in, deb_state_t *state, const deb_config_t *cfg)
{
if (state->state != in) {
state->cnt++;
if (state->cnt >= cfg->cnt[in]) {
state->state = in;
state->cnt = 0;
}
} else {
state->cnt = 0;
}
return state->state;
}
static const deb_config_t debcfg_pin = { {3,4} };
static const deb_config_t debcfg_can = { {2,1} };
int main(void)
{
boolean in1, in2, out1, out2;
deb_state_t debstate_pin = {0, 0};
deb_state_t debstate_can = {0, 0};
while(1) {
// read pin and convert to 0/1
in1 = READ_PORT(PORTx, PINxy); // however this is defined on this architecture
out1 = debounce(in1, &debstate_pin, &debcfg_pin);
// same handling, but input from CAN
in2 = READ_CAN(MSGx, SIGxy); // however this is defined on this architecture
out2 = debounce(in2, &debstate_can, &debcfg_can);
// out1 & out2 are now debounced
}

Fisher Yates algorithm gives back same order of numbers in parallel started programs when seeded over the system time

I start several C / C++ programs in parallel, which rely on random numbers. Fairly new to this topic, I heard that the seed should be done over the time.
Furthermore, I use the Fisher Yates Algorithm to get a list with unique random shuffled values. However, starting the program twice in parallel gives back the same results for both lists.
How can I fix this? Can I use a different, but still relient seed?
My simple test code for this looks like this:
#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <time.h>
static int rand_int(int n) {
int limit = RAND_MAX - RAND_MAX % n;
int rnd;
do {
rnd = rand();
}
while (rnd >= limit);
return rnd % n;
}
void shuffle(int *array, int n) {
int i, j, tmp;
for (i = n - 1; i > 0; i--) {
j = rand_int(i + 1);
tmp = array[j];
array[j] = array[i];
array[i] = tmp;
}
}
int main(int argc,char* argv[]){
srand(time(NULL));
int x = 100;
int randvals[100];
for(int i =0; i < x;i++)
randvals[i] = i;
shuffle(randvals,x);
for(int i=0;i < x;i++)
printf("%d %d \n",i,randvals[i]);
}
I used the implementation for the fisher yates algorithm from here:
http://www.sanfoundry.com/c-program-implement-fisher-yates-algorithm-array-shuffling/
I started the programs in parallel like this:
./randomprogram >> a.txt & ./randomprogram >> b.txt
and then compared both text files, which had the same content.
The end application is for data augmentation in the deep learning field. The machine runs Ubuntu 16.04 with C++11.
You're getting the same results due to how you're seeding the RNG:
srand(time(NULL));
The time function returns the time in seconds since the epoch. If two instances of the program start during the same second (which is likely if start them in quick succession) then both will use the same seed and get the same set of random values.
You need to add more entropy to your seed. A simple way of doing this is to bitwise-XOR the process ID with the time:
srand(time(NULL) ^ getpid());
As I mentioned in a comment, I like to use a Xorshift* pseudo-random number generator, seeded from /dev/urandom if present, otherwise using POSIX.1 clock_gettime() and getpid() to seed the generator.
It is good enough for most statistical work, but obviously not for any kind of security or cryptographic purposes.
Consider the following xorshift64.h inline implementation:
#ifndef XORSHIFT64_H
#define XORSHIFT64_H
#include <stdlib.h>
#include <unistd.h>
#include <stdint.h>
#include <time.h>
#ifndef SEED_SOURCE
#define SEED_SOURCE "/dev/urandom"
#endif
typedef struct {
uint64_t state[1];
} prng_state;
/* Mixes state by generating 'rounds' pseudorandom numbers,
but does not store them anywhere. This is often done
to ensure a well-mixed state after seeding the generator.
*/
static inline void prng_skip(prng_state *prng, size_t rounds)
{
uint64_t state = prng->state[0];
while (rounds-->0) {
state ^= state >> 12;
state ^= state << 25;
state ^= state >> 27;
}
prng->state[0] = state;
}
/* Returns an uniform pseudorandom number between 0 and 2**64-1, inclusive.
*/
static inline uint64_t prng_u64(prng_state *prng)
{
uint64_t state = prng->state[0];
state ^= state >> 12;
state ^= state << 25;
state ^= state >> 27;
prng->state[0] = state;
return state * UINT64_C(2685821657736338717);
}
/* Returns an uniform pseudorandom number [0, 1), excluding 1.
This carefully avoids the (2**64-1)/2**64 bias on 0,
but assumes that the double type has at most 63 bits of
precision in the mantissa.
*/
static inline double prng_one(prng_state *prng)
{
uint64_t u;
double d;
do {
do {
u = prng_u64(prng);
} while (!u);
d = (double)(u - 1u) / 18446744073709551616.0;
} while (d == 1.0);
return d;
}
/* Returns an uniform pseudorandom number (-1, 1), excluding -1 and +1.
This carefully avoids the (2**64-1)/2**64 bias on 0,
but assumes that the double type has at most 63 bits of
precision in the mantissa.
*/
static inline double prng_delta(prng_state *prng)
{
uint64_t u;
double d;
do {
do {
u = prng_u64(prng);
} while (!u);
d = ((double)(u - 1u) - 9223372036854775808.0) / 9223372036854775808.0;
} while (d == -1.0 || d == 1.0);
return d;
}
/* Returns an uniform pseudorandom integer between min and max, inclusive.
Uses the exclusion method to ensure uniform distribution.
*/
static inline uint64_t prng_range(prng_state *prng, const uint64_t min, const uint64_t max)
{
if (min != max) {
const uint64_t basis = (min < max) ? min : max;
const uint64_t range = (min < max) ? max-min : min-max;
uint64_t mask = range;
uint64_t u;
/* In range, all bits up to the higest bit set in range, must be set. */
mask |= mask >> 1;
mask |= mask >> 2;
mask |= mask >> 4;
mask |= mask >> 8;
mask |= mask >> 16;
mask |= mask >> 32;
/* In all cases, range <= mask < 2*range, so at worst case,
(mask = 2*range-1), this excludes at most 50% of generated values,
on average. */
do {
u = prng_u64(prng) & mask;
} while (u > range);
return u + basis;
} else
return min;
}
static inline void prng_seed(prng_state *prng)
{
#if _POSIX_TIMERS-0 > 0
struct timespec now;
#endif
FILE *src;
/* Try /dev/urandom. */
src = fopen(SEED_SOURCE, "r");
if (src) {
int tries = 16;
while (tries-->0) {
if (fread(prng->state, sizeof prng->state, 1, src) != 1)
break;
if (prng->state[0]) {
fclose(src);
return;
}
}
fclose(src);
}
#if _POSIX_TIMERS-0 > 0
#if _POSIX_MONOTONIC_CLOCK-0 > 0
if (clock_gettime(CLOCK_MONOTONIC, &now) == 0) {
prng->state[0] = (uint64_t)((uint64_t)now.tv_sec * UINT64_C(60834327289))
^ (uint64_t)((uint64_t)now.tv_nsec * UINT64_C(34958268769))
^ (uint64_t)((uint64_t)getpid() * UINT64_C(2772668794075091))
^ (uint64_t)((uint64_t)getppid() * UINT64_C(19455108437));
if (prng->state[0])
return;
} else
#endif
if (clock_gettime(CLOCK_REALTIME, &now) == 0) {
prng->state[0] = (uint64_t)((uint64_t)now.tv_sec * UINT64_C(60834327289))
^ (uint64_t)((uint64_t)now.tv_nsec * UINT64_C(34958268769))
^ (uint64_t)((uint64_t)getpid() * UINT64_C(2772668794075091))
^ (uint64_t)((uint64_t)getppid() * UINT64_C(19455108437));
if (prng->state[0])
return;
}
#endif
prng->state[0] = (uint64_t)((uint64_t)time(NULL) * UINT64_C(60834327289))
^ (uint64_t)((uint64_t)clock() * UINT64_C(34958268769))
^ (uint64_t)((uint64_t)getpid() * UINT64_C(2772668794075091))
^ (uint64_t)((uint64_t)getppid() * UINT64_C(19455108437));
if (!prng->state[0])
prng->state[0] = (uint64_t)UINT64_C(16233055073);
}
#endif /* XORSHIFT64_H */
If it can seed the state from SEED_SOURCE, it is used as-is. Otherwise, if POSIX.1 clock_gettime() is available, it is used (CLOCK_MONOTONIC, if possible; otherwise CLOCK_REALTIME). Otherwise, time (time(NULL)), CPU time spent thus far (clock()), process ID (getpid()), and parent process ID (getppid()) are used to seed the state.
If you wanted the above to also run on Windows, you'd need to add a few #ifndef _WIN32 guards, and either omit the process ID parts, or replace them with something else. (I don't use Windows myself, and cannot test such code, so I omitted such from above.)
The idea is that you can include the above file, and implement other pseudo-random number generators in the same format, and choose between them by simply including different files. (You can include multiple files, but you'll need to do some ugly #define prng_state prng_somename_state, #include "somename.h", #undef prng_state hacking to ensure unique names for each.)
Here is an example of how to use the above:
#include <stdlib.h>
#include <inttypes.h>
#include <stdint.h>
#include <stdio.h>
#include "xorshift64.h"
int main(void)
{
prng_state prng1, prng2;
prng_seed(&prng1);
prng_seed(&prng2);
printf("Seed 1 = 0x%016" PRIx64 "\n", prng1.state[0]);
printf("Seed 2 = 0x%016" PRIx64 "\n", prng2.state[0]);
printf("After skipping 16 rounds:\n");
prng_skip(&prng1, 16);
prng_skip(&prng2, 16);
printf("Seed 1 = 0x%016" PRIx64 "\n", prng1.state[0]);
printf("Seed 2 = 0x%016" PRIx64 "\n", prng2.state[0]);
return EXIT_SUCCESS;
}
Obviously, initializing two PRNGs like this is problematic in the fallback case, because it basically relies on clock() yielding different values for consecutive calls (so expects each call to take at least 1 millisecond of CPU time).
However, even a small change in the seeds thus generated is sufficient to yield very different sequences. I like to generate and discard (skip) a number of initial values to ensure the generator state is well mixed:
Seed 1 = 0x8a62585b6e71f915
Seed 2 = 0x8a6259a84464e15f
After skipping 16 rounds:
Seed 1 = 0x9895f664c83ad25e
Seed 2 = 0xa3fd7359dd150e83
The header also implements 0 <= prng_u64() < 2**64, 0 <= prng_one() < 1, -1 < prng_delta() < +1, and min <= prng_range(,min,max) <= max, which should be uniform.
I use the above Xorshift64* variant for tasks where a lot of quite uniform pseudorandom numbers are needed, so the functions also tend to use the faster methods (like max. 50% average exclusion rate rather than 64-bit modulus operation, and so on) (of those that I know of).
Additionally, if you require repeatability, you can simply save a randomly-seeded prng_state structure (a single uint64_t), and load it later, to reproduce the exact same sequence. Just remember to only do the skipping (generate-and-discard) only after randomly seeding, not after loading a new seed from a file.
Converting rather copious comments into an answer.
If two programs are started in the same second, they'll both have the same sequence of random numbers.
Consider whether you need to use a better random number generator than the rand()/srand() duo — that is usually only barely random (better than nothing, but not by a large margin). Do NOT use them for cryptography.
I asked about platform; you responded Ubuntu 16.04 LTS.
Use /dev/urandom or /dev/random to get some random bytes for the seed.
On many Unix-like platforms, there's a device /dev/random — on Linux, there's also a slightly lower-quality device /dev/urandom which won't block whereas /dev/random might. Systems such as macOS (BSD) have /dev/urandom as a synonym for /dev/random for Linux compatibility. You can open it and read 4 bytes (or the relevant number of bytes) of random data, and use that as a seed for the PRNG of your choice.
I often use the drand48() set of functions because they are in POSIX and were in System V Unix. They're usually adequate for my needs.
Look at the manuals across platforms; there are often other random number generators. C++11 provides high-quality PRNG — the header <random> has a number of different ones, such as the MT 19937 (Mersenne Twister). MacOS Sierra (BSD) has random(3) and arc4random(3) as alternatives to rand() – as well as drand48() et al.
Another possibility on Linux is simply to keep a connection to /dev/urandom open, reading more bytes when you need them. However, that gives up any chance of replaying a random sequence. The PRNG systems have the merit of allowing you to replay the same sequence again by recording and setting the random seed that you use. By default, grab a seed from /dev/urandom, but if the user requests it, take a seed from the command line, and report the seed used (at least on request).

C: fastest way to evaluate a function on a finite set of small integer values by using a lookup table?

I am currently working on a project where I would like to optimize some numerical computation in Python by calling C.
In short, I need to compute the value of y[i] = f(x[i]) for each element in an huge array x (typically has 10^9 entries or more). Here, x[i] is an integer between -10 and 10 and f is function that takes x[i] and returns a double. My issue is that f but it takes a very long time to evaluate in a way that is numerically stable.
To speed things up, I would like to just hard code all 2*10 + 1 possible values of f(x[i]) into constant array such as:
double table_of_values[] = {f(-10), ...., f(10)};
And then just evaluate f using a "lookup table" approach as follows:
for (i = 0; i < N; i++) {
y[i] = table_of_values[x[i] + 11]; //instead of y[i] = f(x[i])
}
Since I am not really well-versed at writing optimized code in C, I am wondering:
Specifically - since x is really large - I'm wondering if it's worth doing second-degree optimization when evaluating the loop (e.g. by sorting x beforehand, or by finding a smart way to deal with the negative indices (aside from just doing [x[i] + 10 + 1])?
Say x[i] were not between -10 and 10, but between -20 and 20. In this case, I could still use the same approach, but would need to hard code the lookup table manually. Is there a way to generate the look-up table dynamically in the code so that I make use of the same approach and allow for x[i] to belong to a variable range?
It's fairly easy to generate such a table with dynamic range values.
Here's a simple, single table method:
#include <malloc.h>
#define VARIABLE_USED(_sym) \
do { \
if (1) \
break; \
if (!! _sym) \
break; \
} while (0)
double *table_of_values;
int table_bias;
// use the smallest of these that can contain the values the x array may have
#if 0
typedef int xval_t;
#endif
#if 0
typedef short xval_t;
#endif
#if 1
typedef char xval_t;
#endif
#define XLEN (1 << 9)
xval_t *x;
// fslow -- your original function
double
fslow(int i)
{
return 1; // whatever
}
// ftablegen -- generate variable table
void
ftablegen(double (*f)(int),int lo,int hi)
{
int len;
table_bias = -lo;
len = hi - lo;
len += 1;
// NOTE: you can do free(table_of_values) when no longer needed
table_of_values = malloc(sizeof(double) * len);
for (int i = lo; i <= hi; ++i)
table_of_values[i + table_bias] = f(i);
}
// fcached -- retrieve cached table data
double
fcached(int i)
{
return table_of_values[i + table_bias];
}
// fripper -- access x and table arrays
void
fripper(xval_t *x)
{
double *tptr;
int bias;
double val;
// ensure these go into registers to prevent needless extra memory fetches
tptr = table_of_values;
bias = table_bias;
for (int i = 0; i < XLEN; ++i) {
val = tptr[x[i] + bias];
// do stuff with val
VARIABLE_USED(val);
}
}
int
main(void)
{
ftablegen(fslow,-10,10);
x = malloc(sizeof(xval_t) * XLEN);
fripper(x);
return 0;
}
Here's a slightly more complex way that allows many similar tables to be generated:
#include <malloc.h>
#define VARIABLE_USED(_sym) \
do { \
if (1) \
break; \
if (!! _sym) \
break; \
} while (0)
// use the smallest of these that can contain the values the x array may have
#if 0
typedef int xval_t;
#endif
#if 1
typedef short xval_t;
#endif
#if 0
typedef char xval_t;
#endif
#define XLEN (1 << 9)
xval_t *x;
struct table {
int tbl_lo; // lowest index
int tbl_hi; // highest index
int tbl_bias; // bias for index
double *tbl_data; // cached data
};
struct table ftable1;
struct table ftable2;
double
fslow(int i)
{
return 1; // whatever
}
double
f2(int i)
{
return 2; // whatever
}
// ftablegen -- generate variable table
void
ftablegen(double (*f)(int),int lo,int hi,struct table *tbl)
{
int len;
tbl->tbl_bias = -lo;
len = hi - lo;
len += 1;
// NOTE: you can do free tbl_data when no longer needed
tbl->tbl_data = malloc(sizeof(double) * len);
for (int i = lo; i <= hi; ++i)
tbl->tbl_data[i + tbl->tbl_bias] = fslow(i);
}
// fcached -- retrieve cached table data
double
fcached(struct table *tbl,int i)
{
return tbl->tbl_data[i + tbl->tbl_bias];
}
// fripper -- access x and table arrays
void
fripper(xval_t *x,struct table *tbl)
{
double *tptr;
int bias;
double val;
// ensure these go into registers to prevent needless extra memory fetches
tptr = tbl->tbl_data;
bias = tbl->tbl_bias;
for (int i = 0; i < XLEN; ++i) {
val = tptr[x[i] + bias];
// do stuff with val
VARIABLE_USED(val);
}
}
int
main(void)
{
x = malloc(sizeof(xval_t) * XLEN);
// NOTE: we could use 'char' for xval_t ...
ftablegen(fslow,-37,62,&ftable1);
fripper(x,&ftable1);
// ... but, this forces us to use a 'short' for xval_t
ftablegen(f2,-99,307,&ftable2);
return 0;
}
Notes:
fcached could/should be an inline function for speed. Notice that once the table is calculated once, fcached(x[i]) is quite fast. The index offset issue you mentioned [solved by the "bias"] is trivially small in calculation time.
While x may be a large array, the cached array for f() values is fairly small (e.g. -10 to 10). Even if it were (e.g.) -100 to 100, this is still about 200 elements. This small cached array will [probably] stay in the hardware memory cache, so access will remain quite fast.
Thus, sorting x to optimize H/W cache performance of the lookup table will have little to no [measurable] effect.
The access pattern to x is independent. You'll get best performance if you access x in a linear manner (e.g. for (i = 0; i < 999999999; ++i) x[i]). If you access it in a semi-random fashion, it will put a strain on the H/W cache logic and its ability to keep the needed/wanted x values "cache hot"
Even with linear access, because x is so large, by the time you get to the end, the first elements will have been evicted from the H/W cache (e.g. most CPU caches are on the order of a few megabytes)
However, if x only has values in a limited range, changing the type from int x[...] to short x[...] or even char x[...] cuts the size by a factor of 2x [or 4x]. And, that can have a measurable improvement on the performance.
Update: I've added an fripper function to show the fastest way [that I know of] to access the table and x arrays in a loop. I've also added a typedef named xval_t to allow the x array to consume less space (i.e. will have better H/W cache performance).
UPDATE #2:
Per your comments ...
fcached was coded [mostly] to illustrate simple/single access. But, it was not used in the final example.
The exact requirements for inline has varied over the years (e.g. was extern inline). Best use now: static inline. However, if using c++, it may be, yet again different. There are entire pages devoted to this. The reason is because of compilation in different .c files, what happens when optimization is on or off. Also, consider using a gcc extension. So, to force inline all the time:
__attribute__((__always_inline__)) static inline
fripper is the fastest because it avoids refetching globals table_of_values and table_bias on each loop iteration. In fripper, compiler optimizer will ensure they remain in registers. See my answer: Is accessing statically or dynamically allocated memory faster? as to why.
However, I coded an fripper variant that uses fcached and the disassembled code was the same [and optimal]. So, we can disregard that ... Or, can we? Sometimes, disassembling the code is a good cross check and the only way to know for sure. Just an extra item when creating fully optimized C code. There are many options one can give to the compiler regarding code generation, so sometimes it's just trial and error.
Because benchmarking is important, I threw in my routines for timestamping (FYI, [AFAIK] the underlying clock_gettime call is the basis for python's time.clock()).
So, here's the updated version:
#include <malloc.h>
#include <time.h>
typedef long long s64;
#define SUPER_INLINE \
__attribute__((__always_inline__)) static inline
#define VARIABLE_USED(_sym) \
do { \
if (1) \
break; \
if (!! _sym) \
break; \
} while (0)
#define TVSEC 1000000000LL // nanoseconds in a second
#define TVSECF 1e9 // nanoseconds in a second
// tvget -- get high resolution time of day
// RETURNS: absolute nanoseconds
s64
tvget(void)
{
struct timespec ts;
s64 nsec;
clock_gettime(CLOCK_REALTIME,&ts);
nsec = ts.tv_sec;
nsec *= TVSEC;
nsec += ts.tv_nsec;
return nsec;
)
// tvgetf -- get high resolution time of day
// RETURNS: fractional seconds
double
tvgetf(void)
{
struct timespec ts;
double sec;
clock_gettime(CLOCK_REALTIME,&ts);
sec = ts.tv_nsec;
sec /= TVSECF;
sec += ts.tv_sec;
return sec;
)
double *table_of_values;
int table_bias;
double *dummyptr;
// use the smallest of these that can contain the values the x array may have
#if 0
typedef int xval_t;
#endif
#if 0
typedef short xval_t;
#endif
#if 1
typedef char xval_t;
#endif
#define XLEN (1 << 9)
xval_t *x;
// fslow -- your original function
double
fslow(int i)
{
return 1; // whatever
}
// ftablegen -- generate variable table
void
ftablegen(double (*f)(int),int lo,int hi)
{
int len;
table_bias = -lo;
len = hi - lo;
len += 1;
// NOTE: you can do free(table_of_values) when no longer needed
table_of_values = malloc(sizeof(double) * len);
for (int i = lo; i <= hi; ++i)
table_of_values[i + table_bias] = f(i);
}
// fcached -- retrieve cached table data
SUPER_INLINE double
fcached(int i)
{
return table_of_values[i + table_bias];
}
// fripper_fcached -- access x and table arrays
void
fripper_fcached(xval_t *x)
{
double val;
double *dptr;
dptr = dummyptr;
for (int i = 0; i < XLEN; ++i) {
val = fcached(x[i]);
// do stuff with val
dptr[i] = val;
}
}
// fripper -- access x and table arrays
void
fripper(xval_t *x)
{
double *tptr;
int bias;
double val;
double *dptr;
// ensure these go into registers to prevent needless extra memory fetches
tptr = table_of_values;
bias = table_bias;
dptr = dummyptr;
for (int i = 0; i < XLEN; ++i) {
val = tptr[x[i] + bias];
// do stuff with val
dptr[i] = val;
}
}
int
main(void)
{
ftablegen(fslow,-10,10);
x = malloc(sizeof(xval_t) * XLEN);
dummyptr = malloc(sizeof(double) * XLEN);
fripper(x);
fripper_fcached(x);
return 0;
}
You can have negative indices in your arrays. (I am not sure if this is in the specifications.) If you have the following code:
int arr[] = {1, 2 ,3, 4, 5};
int* lookupTable = arr + 3;
printf("%i", lookupTable[-2]);
it will print out 2.
This works because arrays in c are defined as pointers. And if the pointer does not point to the begin of the array, you can access the item before the pointer.
Keep in mind though that if you have to malloc() the memory for arr you probably cannot use free(lookupTable) to free it.
I really think Craig Estey is on the right track for building your table in an automatic way. I just want to add a note for looking up the table.
If you know that you will run the code on a Haswell machine (with AVX2) you should make sure your code utilise VGATHERDPD which you can utilize with the _mm256_i32gather_pd intrinsic. If you do that, your table lookups will fly! (You can even detect avx2 on the fly with cpuid(), but that's another story)
EDIT:
Let me elaborate with some code:
#include <stdint.h>
#include <stdio.h>
#include <immintrin.h>
/* I'm not sure if you need the alignment */
double table[8] __attribute__((aligned(16)))= { 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8 };
int main()
{
int32_t i[4] = { 0,2,4,6 };
__m128i index = _mm_load_si128( (__m128i*) i );
__m256d result = _mm256_i32gather_pd( table, index, 8 );
double* f = (double*)&result;
printf("%f %f %f %f\n", f[0], f[1], f[2], f[3]);
return 0;
}
Compile and run:
$ gcc --std=gnu99 -mavx2 gathertest.c -o gathertest && ./gathertest
0.100000 0.300000 0.500000 0.700000
This is fast!

Resources