Why is my transposition table so slow? (4x slowdown) - artificial-intelligence

So I am using a transposition table (TT) to try and speed up an alphabeta minimax algorithm for some board game. But in my surprise, when I turn off the transposition table I get 4x speedup! Transposition tables are supposed to speed things up right?
So one thing I noticed is that only 10% of the time do I get a cache-hit. Not the greatest, but I shouldn't need 75% in order for the TT to be useful.
So, the idea is that the TT is a very fast cache lookup. I am using the 2Big replacement scheme. The TT stores lower- and upper-bounds, the search depth, the move and the amount of effort (size) needed to calculate the best response. LockEntry is just an alias for unsigned long long.
My code is below:
ctypedef struct CacheEntry:
LockEntry lock
minimax_score lower_bound
minimax_score upper_bound
int depth
Move move
long long size
ctypedef struct BigEntry:
CacheEntry small
CacheEntry large
cdef void set_entry(CacheEntry entry):
"""set an entry into the cache"""
global cache
cdef int key
key = entry.lock % PRIME
if entry.size > cache[key].large.size or entry.lock == cache[key].large.lock:
cache[key].large = entry
else:
cache[key].small = entry
return
cdef int get_entry(CacheEntry *entry, LockEntry lock):
"""retrieve an entry from cache"""
global cache
global cache_queries
global large_hits
global small_hits
cdef int key
cache_queries += 1
key = lock % PRIME
if cache[key].large.lock == lock:
large_hits += 1
entry[0] = cache[key].large
return 1 #success
elif cache[key].small.lock == lock:
small_hits += 1
entry[0] = cache[key].small
return 1 #success
return 0 #failure
The set_entry function takes a struct CacheEntry type and puts it into the TT (uncreatively called cache), while the get_entry function takes a pointer to a CacheEntry and updates the value for it (I've tried returning the CacheEntry, but no difference in speed occurs.). It also returns a 1 for a successful cache-hit, and 0 for unsuccessful.
So, basically, I am out of ideas on how to use this properly. I am surprised it is slow, and when compiled with -a it is all white, so it seems that I am cythoning the code properly.
Help appreciated.

Related

Program is taking way longer than expected, is it running properly?

not sure this is the right place...
I am running a brute-force code to solve an asymmetric traveler sales problem.
It has 17 cities, one is fixed, so it would have 16! (> 20 trillions) permutations to check.
unsigned long TotalCost(unsigned long *Matrix, short *Path, short
Dimention)
{
unsigned long result = 0;
unsigned long Cost;
int iD;
for (iD = 1; iD <= Dimention; iD++)
{
Cost = Matrix[Dimention*Path[iD - 1] + Path[iD]];
if (Cost > 0)
{
result = result + Cost;
}
else
{
return 4099999999;
}
}
return result;
}
void swapP(short *x, short *y)
{
short temp;
temp = *x;
*x = *y;
*y = temp;
}
void permute(unsigned long *Matrix, short Dimention, unsigned long *CurrentMin, short *PerPath, short **MinPath, short l, short r)
{
short i;
unsigned long CCost;
if (l == r)
{
CCost = TotalCost(Matrix, PerPath, Dimention);
if (CCost < (*CurrentMin))
{
for (i = 0; i <= Dimention; i++)
{
(*MinPath)[i] = PerPath[i];
}
(*CurrentMin) = CCost;
PrintResults(Matrix, PerPath, Dimention, 2);
}
}
else
{
for (i = l; i <= r; i++)
{
swapP((PerPath+l), (PerPath+i));
permute(Matrix, Dimention, CurrentMin, PerPath, MinPath, l+1, r);
swapP((PerPath+l), (PerPath+i)); //backtrack
}
}
}
int main (void)
{
// The ommited code here, allocs memory for the matrix, HcG and HrGR array
// it also initializes them
permute(Matrix, Dimention, &TotalMin, HcG, &HrGR, 1, Dimention - 1);
}
I tested the above code for an instance of five cities and it returned successfully as expected in a few milliseconds.
For the 17 cities, i initially thought it would take a few hours to solve, and then a couple days. It is running for 4 days now and i'm beginning to suspect the program, for some reason, is no longer running, like it's frozen.
I'm not getting any errors, but it's taking way longer than i expected, the program prints the total cost and the path every time it finds a path with lower cost, but it stopped printing half an hour since it started.
I am using ubuntu 18.04, the program is "running" on terminal, the system monitor tells Memory: N/A, does that mean it's not using memory?
It also tells CPU: 6%, can i increase it?
Is there a way to check if it is running properly? Or estimate how long it will take to finish?
I'm so unsure about it's integrity that i think i should stop the process, but at the same time i really wanted to see the results.
I only glanced through your code, but I have done things like this many times in the past. My general approach for this is as follows (although it adds a small cost) ...
add a print statement in a way (perhaps with a mod counter) that you would expect the print to come out approximately once every 2 to 3 minutes. Include some information in the print so that you can tell how far along your simulation is progressing. (note, among that information you probably want to be sure to print out variables that, if they get trashed, could cause infinite looping, for example "Dimention" (which you have misspelled btw)
I would personally not have jumped from 5 cities to 17. Rather 5 to 7, then maybe 9 or 10 ... just to confirm all is working and to get an idea how much time increase to expect with your particular CPU.
Finally, in the situation you are in now, is it possible to get another window and run "ps" to see if your job is getting any CPU time? If not, my approach would be to kill it and implement as I described above. HTH.
Note also, the code you have omitted (memory allocation, etc) is critical: the code as written has the potential to go out of bounds, and possibly not crash (if only slightly out of bounds) but rather end up trashing variables (depending on memory layout) that could (as mentioned above) create an infinite or near-infinite loop.

Need an algorithm to detect large spikes in oscillating data

I am parsing through data on an sd card one large chunk at a time on a microcontroller. It is accelerometer data so it is oscillating constantly. At certain points huge oscillations occur (shown in graph). I need an algorithm to be able to detect these large oscillations, more so, determine the range of data that contains this spike.
I have some example data:
This is the overall graph, there is only one spike of interest, the first one.
Here it is zoomed in a bit
As you can see it is a large oscillation that produces the spike.
So any algorithm that can scan through a data set and determine the portion of data that contains a spike relative to some threshold would be great. This data set is about 50,000 samples, each sample is 32 bits long. I have ample RAM to be able to hold this much data.
Thanks!
For the following signal:
If you take the absolute value of the differential between two consecutive samples, you get:
That is not quite good enough to unambiguously distinguish from the minor "unsustained" disturbances. But if you then take a simple moving sum (a leaky integrator) of the abs-differentials. Here a window width of 4 diff-samples was used:
The moving average introduces a lag or phase shift, which in cases where the data is stored and processing is not real-time can easily be compensated for by subtracting half the window width from the timing:
For real-time processing if the lag is critical a more sophisticated IIR filter might be appropriate. Anyhow a clear threshold can be selected from this data.
In code for the above data set:
#include <stdio.h>
#include <stdint.h>
#include <stdbool.h>
#include <stdlib.h>
static int32_t dataset[] = { 0,0,0,0,0,3,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,3,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,
0,-10,-15,-5,20,25,50,-10,-20,-30,0,30,5,-5,
0,0,5,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,1,0,0,0,0,6,0,0,0,0,0,0,0} ;
#define DATA_LEN (sizeof(dataset)/sizeof(*dataset))
#define WINDOW_WIDTH 4
#define THRESHOLD 15
int main()
{
uint32_t window[WINDOW_WIDTH] = {0} ;
int window_index = 0 ;
int window_sum = 0 ;
bool spike = false ;
for( int s = 1; s < DATA_LEN ; s++ )
{
uint32_t diff = abs( dataset[s] - dataset[s-1] ) ;
window_sum -= window[window_index] ;
window[window_index] = diff ;
window_index++ ;
window_index %= WINDOW_WIDTH ;
window_sum += diff ;
if( !spike && window_sum >= THRESHOLD )
{
spike = true ;
printf( "Spike START # %d\n", s - WINDOW_WIDTH / 2 ) ;
}
else if( spike && window_sum < THRESHOLD )
{
spike = false ;
printf( "Spike END # %d\n", s - WINDOW_WIDTH / 2 ) ;
}
}
return 0;
}
The output is:
Spike START # 66
Spike END # 82
https://onlinegdb.com/ryEw69jJH
Comparing the original data with the detection threshold gives:
For your real data, you will need to select a suitable window width and threshold to get the desired result, both of which will depend on the bandwidth and amplitude of the disturbances you wish to detect.
Also you may need to guard against arithmetic overflow if your samples are of sufficient magnitude. They need to be less than 232 / window-width to guarantee no overflow in the integrator. Alternatively you could use floating-point or uint64_t for the window type, or add code to deal with saturation.
You could look at statistical analysis. Calculating the standard deviation over the data set and then checking when your data goes out of bound.
You can choose to do this in two way's; either you use a running average over a fixed (relatively small) number of samples or you take the average over the whole data set. As I see multiple spikes in your set I would suggest the first. This way you can possibly stop processing (and later continue) every time you find a spike.
For your purpose you do not really need to calculate the standard deviation sigma. You could actually leave it at the squared of sigma. This will give you a slight performance optimization not having to calculate the square root.
Some pseudo code:
// The data set.
int x[N];
// The number of samples in your mean and std calculation.
int M <= N;
// Simga at index i over the previous M samples.
int sigma_i = sqrt( sum( pow(x[i] - mean(x,M), 2) ) / M );
// Or the squared of sigma
int sigma_squared_i = sum( pow(x[i] - mean(x,M), 2) ) / M;
The disadvantage of this method is that you need to set a threshold for the value of sigma at which you trigger. However it is very safe to say that when setting the threshold at 4 or 5 times your average sigma you will have an usable system.
Managed to get a working algorithm. Basically, determine the average difference between data points. If my data starts to exceed some multiple of that value consecutively then most likely there is a spike occurring.

Matlab: filtering large array elements, quicker alternative to logical indexing?

I have a large, three-dimensional dataset of floats, roughly 500 million elements (3000 x 300 x 600).
I want to make elements that are below or above certain thresholds zero. Logical indexing can do this, e.g.
cut_in = 0.5
cut_out = 6
Hs(Hs<cut_in) = 0 ;
Hs(Hs>cut_out) = 0 ;
The problem is that this is painfully slow for me, what with the large data size. The above code takes 240 seconds to run on my computer. Is there a quicker way I can do this?
Many thanks
As #rayryeng and #AndrasDeak point out in comments to your question, logical indexing is usually fastest, though your runtimes suggest that you are probably limited by memory (and being forced to swap onto disk) rather than by the actual speed of the indexing.
One surprising alternative that can win out in this case is for loops. This is because logical indexing requires three passes through the array (once for each inequality test, and once to change the data), whereas a for loop only requires one pass through the array.
Benchmarks
So I ran these tests (and accidentally doubled the array size) on a machine with 8 GB memory:
>> A = randn(6000,300,600);
>> cut_in = -1;
>> cut_out = 1;
Using a for loop:
>> tic; for i=1:numel(A), if A(i)<cut_in || A(i)>cut_out, A(i)=0; end; end; toc
Elapsed time is 597.384884 seconds.
Using logical indexing:
>> tic; A(A<cut_in | A>cut_out) = 0; toc
Elapsed time is 1619.105332 seconds.
And just for laughs (had some time on my hands waiting for the benchmarks to run), here is a compiled for loop:
#include "mex.h"
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
double *A = mxGetPr(prhs[0]);
size_t N = mxGetNumberOfElements(prhs[0]);
double cut_in = *mxGetPr(prhs[1]);
double cut_out = *mxGetPr(prhs[2]);
// You're not supposed to do in-place operations! Don't do this!
for (ptrdiff_t ii=0; ii<N; ii++) {
if ((A[ii]<cut_in) || (A[ii]>cut_out))
A[ii] = 0;
}
}
And benchmarked:
>> mex -v CXXOPTIMFLAGS="-O3 -DNDEBUG" -largeArrayDims apply_threshold.cpp
>> tic; apply_threshold(A,cut_in,cut_out); toc
Elapsed time is 529.994643 seconds
One thing to keep in mind is that we are operating in a regime where accessing the swap space is the main bottleneck in performance, and so benchmark results can vary, even within the same machine, depending on what's currently in main memory (vs. what needs to be swapped in) and what kind of background processes are running.

Hash table implementation

I just bought a book "C Interfaces and Implementations".
in chapter one , it has implemented a "Atom" structure, sample code as follow:
#define NELEMS(x) ((sizeof (x))/(sizeof ((x)[0])))
static struct atom {
struct atom *link;
int len;
char *str;
} *buckets[2048];
static unsigned long scatter[] = {
2078917053, 143302914, 1027100827, 1953210302, 755253631, 2002600785,
1405390230, 45248011, 1099951567, 433832350, 2018585307, 438263339,
813528929, 1703199216, 618906479, 573714703, 766270699, 275680090,
1510320440, 1583583926, 1723401032, 1965443329, 1098183682, 1636505764,
980071615, 1011597961, 643279273, 1315461275, 157584038, 1069844923,
471560540, 89017443, 1213147837, 1498661368, 2042227746, 1968401469,
1353778505, 1300134328, 2013649480, 306246424, 1733966678, 1884751139,
744509763, 400011959, 1440466707, 1363416242, 973726663, 59253759,
1639096332, 336563455, 1642837685, 1215013716, 154523136, 593537720,
704035832, 1134594751, 1605135681, 1347315106, 302572379, 1762719719,
269676381, 774132919, 1851737163, 1482824219, 125310639, 1746481261,
1303742040, 1479089144, 899131941, 1169907872, 1785335569, 485614972,
907175364, 382361684, 885626931, 200158423, 1745777927, 1859353594,
259412182, 1237390611, 48433401, 1902249868, 304920680, 202956538,
348303940, 1008956512, 1337551289, 1953439621, 208787970, 1640123668,
1568675693, 478464352, 266772940, 1272929208, 1961288571, 392083579,
871926821, 1117546963, 1871172724, 1771058762, 139971187, 1509024645,
109190086, 1047146551, 1891386329, 994817018, 1247304975, 1489680608,
706686964, 1506717157, 579587572, 755120366, 1261483377, 884508252,
958076904, 1609787317, 1893464764, 148144545, 1415743291, 2102252735,
1788268214, 836935336, 433233439, 2055041154, 2109864544, 247038362,
299641085, 834307717, 1364585325, 23330161, 457882831, 1504556512,
1532354806, 567072918, 404219416, 1276257488, 1561889936, 1651524391,
618454448, 121093252, 1010757900, 1198042020, 876213618, 124757630,
2082550272, 1834290522, 1734544947, 1828531389, 1982435068, 1002804590,
1783300476, 1623219634, 1839739926, 69050267, 1530777140, 1802120822,
316088629, 1830418225, 488944891, 1680673954, 1853748387, 946827723,
1037746818, 1238619545, 1513900641, 1441966234, 367393385, 928306929,
946006977, 985847834, 1049400181, 1956764878, 36406206, 1925613800,
2081522508, 2118956479, 1612420674, 1668583807, 1800004220, 1447372094,
523904750, 1435821048, 923108080, 216161028, 1504871315, 306401572,
2018281851, 1820959944, 2136819798, 359743094, 1354150250, 1843084537,
1306570817, 244413420, 934220434, 672987810, 1686379655, 1301613820,
1601294739, 484902984, 139978006, 503211273, 294184214, 176384212,
281341425, 228223074, 147857043, 1893762099, 1896806882, 1947861263,
1193650546, 273227984, 1236198663, 2116758626, 489389012, 593586330,
275676551, 360187215, 267062626, 265012701, 719930310, 1621212876,
2108097238, 2026501127, 1865626297, 894834024, 552005290, 1404522304,
48964196, 5816381, 1889425288, 188942202, 509027654, 36125855,
365326415, 790369079, 264348929, 513183458, 536647531, 13672163,
313561074, 1730298077, 286900147, 1549759737, 1699573055, 776289160,
2143346068, 1975249606, 1136476375, 262925046, 92778659, 1856406685,
1884137923, 53392249, 1735424165, 1602280572
};
const char *Atom_new(const char *str, int len) {
unsigned long h;
int i;
struct atom *p;
assert(str);
assert(len >= 0);
for (h = 0, i = 0; i < len; i++)
h = (h<<1) + scatter[(unsigned char)str[i]];
h &= NELEMS(buckets)-1;
for (p = buckets[h]; p; p = p->link)
if (len == p->len) {
for (i = 0; i < len && p->str[i] == str[i]; )
i++;
if (i == len)
return p->str;
}
p = ALLOC(sizeof (*p) + len + 1);
p->len = len;
p->str = (char *)(p + 1);
if (len > 0)
memcpy(p->str, str, len);
p->str[len] = '\0';
p->link = buckets[h];
buckets[h] = p;//insert atom in front of list
return p->str;
}
at end of chapter , in exercises 3.1, the book's author said
"Most texts recommend using a prime number for the size of
buckets. Using a prime and a good hash function usually gives a
better distribution of the lengths of the lists hanging off of buckets.
Atom uses a power of two, which is sometimes explicitly cited
as a bad choice. Write a program to generate or read, say, 10,000
typical strings and measure Atom_new’s speed and the distribution
of the lengths of the lists. Then change buckets so that it has
2,039 entries (the largest prime less than 2,048), and repeat the
measurements. Does using a prime help? How much does your
conclusion depend on your specific machine?"
so I did changed that hash table size to 2039,but it seems a prime number actually made
a bad distribution of the lengths of the lists, I have tried 64, 61, 61 actually made a bad distribution too.
I am just want to know why a prime table size make a bad distribution, is this because the hash function used with Atom_new a bad hash function?
I am using this function to print out the lengths of the atom lists
#define B_SIZE 2048
void Atom_print(void)
{
int i,t;
struct atom *atom;
for(i= 0;i<B_SIZE;i++) {
t = 0;
for(atom=buckets[i];atom;atom=atom->link) {
++t;
}
printf("%d ",t);
}
}
Well, along time ago I had to implement a hash table (in driver development), and I about the same. Why the heck should I use a prime number? OTOH power of 2 is even better - instead of calculating the modulus in case of power of 2 you may use bitwise AND.
So I've implemented such a hash table. The key was a pointer (returned by some 3rd-party function). Then, eventually I noticed that in my hash table only 1/4 of all the entries is filled. Because that hash function I used was identity function, and just in case it turned out that all the returned pointers are multiples of 4.
The idea of using the prime numbers for the hash table size is the following: real-world hash functions do not produce equally-distributed values. Usually there's (or at least there may be) some dependency. So, in order to diffuse this distribution it's recommended to use prime numbers.
BTW, theoretically there may happen that occasionally the hash function will produce the numbers that are multiples of your chosen prime number. But the probability of this is lower than if it was not a prime number.
I think it's the code to select the bucket. In the code you pasted it says:
h &= NELEMS(buckets)-1;
That works fine for sizes which are powers of two, since its final effect is choosing the lower bits of h. For other sizes, NELEMS(buckets)-1 will have bits in 0 and the bit-wise & operator will discard those bits, effectively leaving "holes" in the bucket list.
The general formula for bucket selection is:
h = h % NELEMS(buckets);
This is what Julienne Walker from Eternally Confuzzled has to say about hash table sizes:
When it comes to hash tables, the most
recommended table size is any prime
number. This recommendation is made
because hashing in general is
misunderstood, and poor hash functions
require an extra mixing step of
division by a prime to resemble a
uniform distribution. Another reason
that a prime table size is recommended
is because several of the collision
resolution methods require it to work.
In reality, this is a generalization
and is actually false (a power of two
with odd step sizes will typically
work just as well for most collision
resolution strategies), but not many
people consider the alternatives and
in the world of hash tables, prime
rules.
There's another factor at work here and that is that the constant hashing values should all be odd/prime and widely dispersed. If you have an even number of units (characters for instance) in the key to be hashed then having all odd constants will give you an even initial hash value. For an odd number of units you'd get an odd number. I've done some experimenting with this and just the 50/50% split was worth a lot in evening the distribution. Of course if all keys are equally long this doesn't matter.
The hashing also needs to ensure that you won't get the same initial hash value for "AAB" as for "ABA" or "BAA".

Intermittent bugs - sometimes this code works and sometimes it doesn't!

This code intermittently works. It's running on a small microcontroller. It will work fine even after restarting the processor, but if I change some part of the code, it breaks. This makes me think that it's some kind of pointer bug or memory corruption. What's happening is the coordinate, p_res.pos.x is sometimes read as 0 (the incorrect value) and 96 (the correct value) when it is passed to write_circle_outlined. y seems to be correct most of the time. If anyone can spot anything obviously wrong please point it out!
int demo_game()
{
long int d;
int x, y;
struct WorldCamera p_viewer;
struct Point3D_LLA p_subj;
struct Point2D_CalcRes p_res;
p_viewer.hfov = 27;
p_viewer.vfov = 32;
p_viewer.width = 192;
p_viewer.height = 128;
p_viewer.p.lat = 51.26f;
p_viewer.p.lon = -1.0862f;
p_viewer.p.alt = 100.0f;
p_subj.lat = 51.20f;
p_subj.lon = -1.0862f;
p_subj.alt = 100.0f;
while(1)
{
fill_buffer(draw_buffer_mask, 0x0000);
fill_buffer(draw_buffer_level, 0xffff);
compute_3d_transform(&p_viewer, &p_subj, &p_res, 10000.0f);
x = p_res.pos.x;
y = p_res.pos.y;
write_circle_outlined(x, y, 1.0f / p_res.est_dist, 0, 0, 0, 1);
p_viewer.p.lat -= 0.0001f;
//p_viewer.p.alt -= 0.00001f;
d = 20000;
while(d--);
}
return 1;
}
The code for compute_3d_transform is:
void compute_3d_transform(struct WorldCamera *p_viewer, struct Point3D_LLA *p_subj, struct Point2D_CalcRes *res, float cliph)
{
// Estimate the distance to the waypoint. This isn't intended to replace
// proper lat/lon distance algorithms, but provides a general indication
// of how far away our subject is from the camera. It works accurately for
// short distances of less than 1km, but doesn't give distances in any
// meaningful unit (lat/lon distance?)
res->est_dist = hypot2(p_viewer->p.lat - p_subj->lat, p_viewer->p.lon - p_subj->lon);
// Save precious cycles if outside of visible world.
if(res->est_dist > cliph)
goto quick_exit;
// Compute the horizontal angle to the point.
// atan2(y,x) so atan2(lon,lat) and not atan2(lat,lon)!
res->h_angle = RAD2DEG(angle_dist(atan2(p_viewer->p.lon - p_subj->lon, p_viewer->p.lat - p_subj->lat), p_viewer->yaw));
res->small_dist = res->est_dist * 0.0025f; // by trial and error this works well.
// Using the estimated distance and altitude delta we can calculate
// the vertical angle.
res->v_angle = RAD2DEG(atan2(p_viewer->p.alt - p_subj->alt, res->est_dist));
// Normalize the results to fit in the field of view of the camera if
// the point is visible. If they are outside of (0,hfov] or (0,vfov]
// then the point is not visible.
res->h_angle += p_viewer->hfov / 2;
res->v_angle += p_viewer->vfov / 2;
// Set flags.
if(res->h_angle < 0 || res->h_angle > p_viewer->hfov)
res->flags |= X_OVER;
if(res->v_angle < 0 || res->v_angle > p_viewer->vfov)
res->flags |= Y_OVER;
res->pos.x = (res->h_angle / p_viewer->hfov) * p_viewer->width;
res->pos.y = (res->v_angle / p_viewer->vfov) * p_viewer->height;
return;
quick_exit:
res->flags |= X_OVER | Y_OVER;
return;
}
Structure for the results:
typedef struct Point2D_Pixel { unsigned int x, y; };
// Structure for storing calculated results (from camera transforms.)
typedef struct Point2D_CalcRes
{
struct Point2D_Pixel pos;
float h_angle, v_angle, est_dist, small_dist;
int flags;
};
The code is part of an open source project of mine so it's okay to post a lot of code here.
I see some of your calculation depends on p_viewer->yaw, but I do not see any intialization for p_viewer->yaw. Is this your problem?
A couple of things that seem sketchy:
You can return from compute_3d_transform without setting many of the fields in p_res/res but the caller never checks for this situation.
You consistently read from res->flags without initializing it first.
Whenever the output differs, it possibly means some value is not initialized and the outcome depends on the garbage value present in a variable. Keeping that in mind, I looked for uninitialized variables. the structure p_res is not initialized.
if(res->est_dist > cliph)
goto quick_exit;
that means if condition may turn out to be true or false depending on what garbage value is stored in res->est_dist. When if condition turns out to true, it goes straight to quick_exit label and doesn't update p_res.pos.x. If condition turned out to be false then its updated.
When I used to program C, I would use a divide and conquer debugging technique for this kind of problem to try to isolate the offending operation (paying attention to whether the symptoms change as debugging code is added, which is indicative of dangling pointer type bugs).
Essentially, start with the first line where the value is known to be good (and prove that it is consistently good at that line). Then identify where is it known to be bad. Then approx. halfway between the two points insert a test to see if it's bad. If not, then insert a test halfway between the mid-point and the known bad location, if it is bad then insert a test halfway between the mid-point and the known good location, and so on.
If the line identified is itself a function call, this process can be repeated in that called function, and so on.
When using this kind of approach, it's important to minimize the amount of added code and the artificial "noise", which can create timing changes.
Use this if you don't have (or can't use) an interactive debugger, or if the problem does not manifest when using one.

Resources