I have an audio application that uses ALSA to play back audio samples.
The "hw:0" device has been setup as:
Samples: 48kHz, 16-bit LE
Buffersize: 1920 frames (=20 ms)
Periodsize: 960 frames (=10 ms)
This is the pseudo-code:
snd_pcm_sframes_t delayp = 0;
snd_pcm_sframes_t availp = 0;
while(true)
{
snd_pcm_delay(m_pHandle, &delayp);
availp = snd_pcm_avail(m_pHandle);
print "Delay" + delay + "Available" + availp
err = snd_pcm_writei(m_pHandle, data, periodSize);
availp = snd_pcm_avail_update(m_pHandle);
print "Wrote " + err + "samples - samples available" + avail;
}
The log looks like this:
Periodsize 960 frames for a periodtime of 20 ms
Buffersize 1920 frames for a buffertime of 40 ms
Delay: 0 frames/ 0 ms; available 1920 frames/ 40
Wrote 960 frames; available after write: 960 frames/ 20 ms
xrunRecovery: Underrun!!! (at least 0.669 ms/ 32.112 frames long
Delay: 0 frames/ 0 ms; available 1920 frames/ 40
Wrote 960 frames; available after write: 960 frames/ 20 ms
Delay: 955 frames/ 19.8958 ms; available 965 frames/ 20.1042
Wrote 960 frames; available after write: 9 frames/ 0.1875 ms
Delay: 906 frames/ 18.875 ms; available 1014 frames/ 21.125
...
Delay: 952 frames/ 19.8333 ms; available 968 frames/ 20.1667
Wrote 960 frames; available after write: 18 frames/ 0.375 ms
xrunRecovery: Underrun!!! (at least 234.825 ms/ 11271.6 frames long
Delay: 0 frames/ 0 ms; available 1920 frames/ 40
Wrote 960 frames; available after write: 960 frames/ 20 ms
xrunRecovery: Underrun!!! (at least 0.869 ms/ 41.712 frames long
Delay: 0 frames/ 0 ms; available 1920 frames/ 40
There are two strange things happening:
1. Although everything I write 960 frames, the snd_pcm_avail_update is not always reflecting this
2. Out of a blue, suddenly an xrun happens. For instance, in the case where only 18 frames are available, the next line gives an xrun when trying to write a new period to the buffer.
Can someone explain me what is going on here?
When you write 960 frames, the number of available frames decreases by 960.
At the same time, any samples being played increase the number of available frames.
You get an underrun when the buffer becomes empty.
According to your log, your program didn't run for about 234 ms.
Related
My program calls malloc 10'000 times a second. I have absolutely no idea how long a malloc call takes.
As long as an uncontested mutex lock? (10-100 ns)
As long as compressing 1kb of data? (1-10 us)
As long as an SSD random read? (100-1000 us)
As long as a network transfer to Amsterdam? (10-100ms)
Instead of spending two hours to investigate this, only to find out that it is absolutely dwarfed by some other thing my program does, I would like to get a rough idea of what to expect. Ballpark. Not precise. Off by factor 10 does not matter at all.
The following picture was updooted 200 times here:
To state the obvious first: profiling for specific use cases is always required. However, this question asked for a rough general ballpark approximation guesstimate of the order of magnitude. That's something we do when we don't know if we should even think about a problem. Do I need to worry about my data being in cache when it is then sent to Amsterdam? Looking at the picture in the question, the answer is a resounding No. Yes, it could be a problem, but only if you messed up big. We assume that case to be ruled out and instead discuss the problem in probabilistic generality.
It may be ironic that the question arose when I was working on a program that cares very much about small details, where a performance difference of a few percent translates into millions of CPU hours. Profiling suggested malloc was not an issue, but before dismissing it outright, I wanted to sanity check: Is it theoretically plausible that malloc is a bottleneck?
As repeatedly suggested in a closed, earlier version of the question, there are large differences between environments.
I tried various machines (intel: i7 8700K, i5 5670, some early gen mobile i7 in a laptop; AMD: Ryzen 4300G, Ryzen 3900X), various OS (windows 10, debian, ubuntu) and compilers (gcc, clang-14, cygwin-g++, msvc; no debug builds).
I've used this to get an idea about the characteristics(*), using just 1 thread:
#include <stddef.h>
#include <stdlib.h>
#include <time.h>
#include <stdio.h>
int main(int argc, char* argv[]) {
const size_t allocs = 10;
const size_t repeats = 10000;
printf("chunk\tms\tM1/s\tGB/s\tcheck\n");
for (size_t size = 16; size < 10 * 1000 * 1000; size *= 2) {
float t0 = (float)clock() / CLOCKS_PER_SEC;
size_t check = 0;
for (size_t repeat = 0; repeat < repeats; ++repeat) {
char* ps[allocs];
for (size_t i = 0; i < allocs; i++) {
ps[i] = malloc(size);
if (!ps[i]) {
exit(1);
}
for (size_t touch = 0; touch < size; touch += 512) {
ps[i][touch] = 1;
}
}
for (size_t i = 0; i < allocs; i++) {
check += ps[i][0];
free(ps[i]);
}
}
float dt = (float)clock() / CLOCKS_PER_SEC - t0;
printf ("%d\t%1.5f\t%7.3f\t%7.1f\t%d\n",
size,
dt / allocs / repeats * 1000,
allocs / dt * repeats / 1000 / 1000,
allocs / dt * repeats * size / 1024 / 1024 / 1024,
check);
}
}
The variance is stark, but, as expected, the values still belong to the same ballpark.
the following table is representative, others were off by less than factor 10
chunk ms M1/s GB/s check
16 0.00003 38.052 0.6 100000
32 0.00003 37.736 1.1 100000
64 0.00003 37.651 2.2 100000
128 0.00004 24.931 3.0 100000
256 0.00004 26.991 6.4 100000
512 0.00004 26.427 12.6 100000
1024 0.00004 24.814 23.7 100000
2048 0.00007 15.256 29.1 100000
4096 0.00007 14.633 55.8 100000
8192 0.00008 12.940 98.7 100000
16384 0.00066 1.511 23.1 100000
32768 0.00271 0.369 11.3 100000
65536 0.00707 0.141 8.6 100000
131072 0.01594 0.063 7.7 100000
262144 0.04401 0.023 5.5 100000
524288 0.11226 0.009 4.3 100000
1048576 0.25546 0.004 3.8 100000
2097152 0.52395 0.002 3.7 100000
4194304 0.80179 0.001 4.9 100000
8388608 1.78242 0.001 4.4 100000
Here's one from a 3900X on cygwin-g++. You can clearly see the larger CPU cache, and after that, the higher memory throughput.
chunk ms M1/s GB/s check
16 0.00004 25.000 0.4 100000
32 0.00005 20.000 0.6 100000
64 0.00004 25.000 1.5 100000
128 0.00004 25.000 3.0 100000
256 0.00004 25.000 6.0 100000
512 0.00005 20.000 9.5 100000
1024 0.00004 25.000 23.8 100000
2048 0.00005 20.000 38.1 100000
4096 0.00005 20.000 76.3 100000
8192 0.00010 10.000 76.3 100000
16384 0.00015 6.667 101.7 100000
32768 0.00077 1.299 39.6 100000
65536 0.00039 2.564 156.5 100000
131072 0.00067 1.493 182.2 100000
262144 0.00093 1.075 262.5 100000
524288 0.02679 0.037 18.2 100000
1048576 0.14183 0.007 6.9 100000
2097152 0.26805 0.004 7.3 100000
4194304 0.51644 0.002 7.6 100000
8388608 1.01604 0.001 7.7 100000
So what gives?
With small chunk sizes, >= 10 million of calls per second are possible even on old commodity hardware.
Once sizes go beyond CPU cache, i.e. 1 to 100-ish MB, RAM access quickly dominates this (I did not test malloc without actually using the chunks).
Depending on what sizes you malloc, one or the other will be the (ballpark) limit.
However, with something like 10k allocs per second, this is something you can likely ignore for the time being.
hi i need to add the pid number to /proc/%d/stat
how i can do that ?
this is my full code , with this i have the total cpu usage :
unsigned sleep(unsigned sec);
struct cpustat {
unsigned long t_user;
unsigned long t_nice;
unsigned long t_system;
unsigned long t_idle;
unsigned long t_iowait;
unsigned long t_irq;
unsigned long t_softirq;
};
void skip_lines(FILE *fp, int numlines)
{
int cnt = 0;
char ch;
while((cnt < numlines) && ((ch = getc(fp)) != EOF))
{
if (ch == '\n')
cnt++;
}
return;
}
void get_stats(struct cpustat *st, int cpunum)
{
FILE *fp = fopen("/proc/stat", "r");
int lskip = cpunum+1;
skip_lines(fp, lskip);
char cpun[255];
Obviously, to replace the %d with an integer, you'll need to use sprintf into a buffer as you do in your second example. You could also just use /proc/self/stat to get the stats of the current process, rather than getpid+sprintf.
Your main problem seems to be with the contents/format you're expecting to see. stat contains a single line of info about the process, as described in proc(5). For example:
$ cat /proc/self/stat
27646 (cat) R 3284 27646 3284 34835 27646 4194304 86 0 1 0 0 0 0 0 20 0 1 0 163223159 7618560 210 18446744073709551615 4194304 4240236 140730092885472 0 0 0 0 0 0 0 0 0 17 1 0 0 2 0 0 6340112 6341364 37523456 140730092888335 140730092888355 140730092888355 140730092892143 0
You seem to be skipping some initial lines, and then trying to read something with a different format.
from the proc(5) man page, some of those numbers from /proc/self/stat are related to cpu time:
(14) utime %lu
Amount of time that this process has been scheduled in user mode, mea‐
sured in clock ticks (divide by sysconf(_SC_CLK_TCK)). This includes
guest time, guest_time (time spent running a virtual CPU, see below), so
that applications that are not aware of the guest time field do not lose
that time from their calculations.
(15) stime %lu
Amount of time that this process has been scheduled in kernel mode, mea‐
sured in clock ticks (divide by sysconf(_SC_CLK_TCK)).
Which gives you the total cpu time in this process since it started. With the above cat program, those numbers are both 0 (it runs too fast to accumulate any ticks), but if I do
$ cat /proc/$$/stat
3284 (bash) S 2979 3284 3284 34835 12764 4194304 71122 947545 36 4525 104 66 1930 916 20 0 1 0 6160 24752128 1448 18446744073709551615 4194304 5192964 140726761267456 0 0 0 65536 3670020 1266777851 1 0 0 17 1 0 0 68 0 0 7290352 7326856 30253056 140726761273517 140726761273522 140726761273522 140726761275374 0
you can see that my shell has 104 ticks of user time and 66 ticks of system time.
I used Imagemagick in my project. I implemented a sub-image detection system using the compare command of ImageMagick. It is working well giving fine results. By reading articles i got to know that ImageMagick compares pixels of small image at every possible position within the pixels of larger image.And also i got to know ImageMagick detects rotated images and scaled images using Fuzzy factor.I was able to find the source code related to compare command.
const Image *reconstruct_image,double *distortion,ExceptionInfo *exception)
585 {
586 CacheView
587 *image_view,
588 *reconstruct_view;
589
590 double
591 area;
592
593 MagickBooleanType
594 status;
595
596 register ssize_t
597 j;
598
599 size_t
600 columns,
601 rows;
602
603 ssize_t
604 y;
605
606 status=MagickTrue;
607 rows=MagickMax(image->rows,reconstruct_image->rows);
608 columns=MagickMax(image->columns,reconstruct_image->columns);
609 area=0.0;
610 image_view=AcquireVirtualCacheView(image,exception);
611 reconstruct_view=AcquireVirtualCacheView(reconstruct_image,exception);
612 #if defined(MAGICKCORE_OPENMP_SUPPORT)
613 #pragma omp parallel for schedule(static,4) shared(status) \
614 magick_threads(image,image,rows,1) reduction(+:area)
615 #endif
616 for (y=0; y < (ssize_t) rows; y++)
617 {
618 double
619 channel_distortion[MaxPixelChannels+1];
620
621 register const Quantum
622 *magick_restrict p,
623 *magick_restrict q;
624
625 register ssize_t
626 x;
627
628 if (status == MagickFalse)
629 continue;
630 p=GetCacheViewVirtualPixels(image_view,0,y,columns,1,exception);
631 q=GetCacheViewVirtualPixels(reconstruct_view,0,y,columns,1,exception);
632 if ((p == (const Quantum *) NULL) || (q == (const Quantum *) NULL))
633 {
634 status=MagickFalse;
635 continue;
636 }
637 (void) ResetMagickMemory(channel_distortion,0,sizeof(channel_distortion));
638 for (x=0; x < (ssize_t) columns; x++)
639 {
640 double
641 Da,
642 Sa;
643
644 register ssize_t
645 i;
646
647 if ((GetPixelReadMask(image,p) == 0) ||
648 (GetPixelReadMask(reconstruct_image,q) == 0))
649 {
650 p+=GetPixelChannels(image);
651 q+=GetPixelChannels(reconstruct_image);
652 continue;
653 }
654 Sa=QuantumScale*GetPixelAlpha(image,p);
655 Da=QuantumScale*GetPixelAlpha(reconstruct_image,q);
656 for (i=0; i < (ssize_t) GetPixelChannels(image); i++)
657 {
658 double
659 distance;
660
661 PixelChannel channel=GetPixelChannelChannel(image,i);
662 PixelTrait traits=GetPixelChannelTraits(image,channel);
663 PixelTrait reconstruct_traits=GetPixelChannelTraits(reconstruct_image,
664 channel);
665 if ((traits == UndefinedPixelTrait) ||
666 (reconstruct_traits == UndefinedPixelTrait) ||
667 ((reconstruct_traits & UpdatePixelTrait) == 0))
668 continue;
669 distance=QuantumScale*fabs(Sa*p[i]-Da*GetPixelChannel(reconstruct_image,
670 channel,q));
671 channel_distortion[i]+=distance;
672 channel_distortion[CompositePixelChannel]+=distance;
673 }
674 area++;
675 p+=GetPixelChannels(image);
676 q+=GetPixelChannels(reconstruct_image);
677 }
678 #if defined(MAGICKCORE_OPENMP_SUPPORT)
679 #pragma omp critical (MagickCore_GetMeanAbsoluteError)
680 #endif
681 for (j=0; j <= MaxPixelChannels; j++)
682 distortion[j]+=channel_distortion[j];
683 }
684 reconstruct_view=DestroyCacheView(reconstruct_view);
685 image_view=DestroyCacheView(image_view);
686 area=PerceptibleReciprocal(area);
687 for (j=0; j <= MaxPixelChannels; j++)
688 distortion[j]*=area;
689 distortion[CompositePixelChannel]/=(double) GetImageChannels(image);
690 return(status);
691 }
Anyone have an idea about what are the conditions they are searching for in the following code snippet?
if ((traits == UndefinedPixelTrait) ||
666 (reconstruct_traits == UndefinedPixelTrait) ||
667 ((reconstruct_traits & UpdatePixelTrait) == 0))
In case you're wondering how these values are used:
http://www.learncpp.com/cpp-tutorial/3-8a-bit-flags-and-bit-masks/
Those values are nothing but different bits. They are different so you can combine them and check them in a disambiguous way.
In case you don't know the meaning of UndefinedPixelTraits and so on, just google the word and you'll end up in the ImageMagick documentation:
https://www.imagemagick.org/include/porting.php
Pixel Traits
Each pixel channel includes one or more of these traits:
Undefined no traits associated with this pixel channel Copy do not
update this pixel channel, just copy it Update update this pixel
channel Blend blend this pixel channel with the alpha mask if it's
enabled We provide these methods to set and get pixel traits:
GetPixelAlphaTraits() SetPixelAlphaTraits() GetPixelBlackTraits()
SetPixelBlackTraits() GetPixelBlueTraits() SetPixelBlueTraits()
GetPixelCbTraits() SetPixelCbTraits() GetPixelChannelTraits()
SetPixelChannelTraits() GetPixelCrTraits() SetPixelCrTraits()
GetPixelGrayTraits() SetPixelGrayTraits() GetPixelGreenTraits()
SetPixelGreenTraits() GetPixelIndexTraits() SetPixelIndexTraits()
GetPixelMagentaTraits() SetPixelMagentaTraits() GetPixelRedTraits()
SetPixelRedTraits() GetPixelYellowTraits() SetPixelYellowTraits()
GetPixelYTraits() SetPixelYTraits() For convenience you can set
the active trait for a set of pixel channels with a channel mask and
this method:
SetImageChannelMask() Previously MagickCore methods had channel
analogs, for example, NegateImage() and NegateImageChannels(). The
channel analog methods are no longer necessary because the pixel
channel traits specify whether to act on a particular pixel channel or
whether to blend with the alpha mask. For example, instead of
NegateImageChannel(image,channel); we use:
channel_mask=SetImageChannelMask(image,channel);
NegateImage(image,exception); (void)
SetImageChannelMask(image,channel_mask);
If you want to know how and why each method is handling those flags read the respective documentation or the code itself.
I'm working with the example recording program that's available here on the portaudio website. And I'm confused about how the frame index is being incremented
Given that the buffer size is 512 frames, and the sample rate is set to 44100Hz, you would assume that the program works through the buffer, returns the callback, and increases the frame index by 512 roughly every 11.6 ms. In main(), I have the program outputting the current frame index every 12ms (as opposed to every 1000ms like in the example, but the rest of my code is identical to theirs), and I assumed that it would increment by 512 with each line of output, but that's not the case. This is a chunk of the output:
index at 12 ms = 0
index at 24 ms = 0
index at 36 ms = 1024
index at 48 ms = 1024
index at 60 ms = 2048
index at 72 ms = 2048
index at 84 ms = 2048
index at 96 ms = 3072
index at 108 ms = 3072
index at 120 ms = 3072
index at 132 ms = 4096
index at 144 ms = 4096
index at 156 ms = 4096
index at 168 ms = 5120
index at 180 ms = 5120
index at 192 ms = 5120
index at 204 ms = 6144
index at 216 ms = 7680
As you can see, this is incrementing in a strange fashion. The index stays at 0 until 36ms, where it then jumps up to 1024, and then the index suddenly increases from 1024 to 2048 at 60ms. The way the index increases is not the way one would it expect it to, and it's also inconsistent. You'll notice that the index takes 24ms to increment from 1024 to 2048, and then 36ms to increment from 2048 to 3072, but then only 12ms to later increment from 6144 to 7680.
My question is, what is happening here and what can I do to get the output happening at a more consistent rate? Is it something to do with the ALSA audio buffer size perhaps?
The PortAudio callback buffer size is not necessarily the host buffer size. The relationship between the two is complex, because not all host APIs support all buffer sizes, but PortAudio will "adapt" to support your requested callback buffer size.
PortAudio computes the host buffer size using an algorithm that incorporates multiple constraints including: the user buffer size, the requested latency, supported host buffer sizes, and heuristics.
For example, if you request a very small callback buffer size, but a large latency, it is likely that PortAudio will use a larger host buffer size, and you may observe bursty callbacks. That looks like what's happening for you: 1024 frame host buffer size, but 512 frame PA callback buffer size.
There is more information in my answer here:
Time between callback calls?
what can I do to get the output happening at a more consistent rate?
Request a smaller latency in the appropriate parameters to Pa_OpenStream().
For some PortAudio host APIs there is also a host-API specific mechanism to force the host buffer size. I'm not sure whether ALSA provides that option.
Alternatively, you can use the paFramesPerBufferUnspecified option, which should result in more regular callbacks at whatever buffer size PortAudio and ALSA negotiate based on your requested latency.
Note that there might be some ALSA-specific or device-specific reason why you're getting 1024 frame host buffers. But I'd try the above first.
I am trying to port some code from IDL to C, and I find myself having to replicate the ROT function. The goal is to rotate a 1024x1024 array of unsigned shorts by an angle in degrees. For the purposes of this project, the angle is very small, less than one degree. The function uses bilinear interpolation.
I tried a backwards approach. For each pixel in the output array, I did a reverse rotation to figure out what coordinate in the input array it would belong to, then used interpolation to figure out what that value would be. I wasn't sure how to go about doing bilinear interpolation if the input grid was skewed; every example of it I've seen assumes that it's orthogonal.
For the rotation, I referred to this:
x' = x * cos(a) + y * sin(a)
y' = y * cos(a) - x * sin(a)
from this: Image scaling and rotating in C/C++
And for the interpolation, I referred to this: http://en.wikipedia.org/wiki/Bilinear_interpolation
Anyway, here's my code:
#define DTOR 0.0174532925
void rotatearray(unsigned short int *inarray, unsigned short int *outarray, int xsize,
int ysize, double angle)
{
//allocate temparray, set to 0
unsigned short int *temparray;
temparray = calloc(xsize*ysize, sizeof(unsigned short int));
int xf, yf;
int xi1, xi2, yi1, yi2;
double xi, yi;
double x, y;
double minusangle = (360 - angle)*DTOR;
unsigned short int v11, v12, v21, v22;
int goodpixels=0;
int badpixels=0;
for(yf=0;yf<ysize;yf++)
{
for(xf=0;xf<xsize;xf++)
{
//what point in the input grid would map to this output pixel?
//(inverse of rotation)
xi = (xf+0.5)*cos(minusangle) + (yf+0.5)*sin(minusangle);
yi = (yf+0.5)*cos(minusangle) - (xf+0.5)*sin(minusangle);
//Is it within bounds?
if ((xi>(0+0.5))&&(xi<xsize-0.5)&&
(yi>(0+0.5))&&(yi<ysize-0.5))
{
//what are the indices of the bounding input pixels?
xi1 = (int)(xi - 0.5);
xi2 = (int)(xi + 0.5);
yi1 = (int)(yi - 0.5);
yi2 = (int)(yi + 0.5);
//What position is (x,y) in the bound unit square?
x = xi - xi1;
y = yi - yi1;
//what are the values of the bounding input pixels?
v11 = inarray[yi1*xsize + xi1];//What are the values of
v12 = inarray[yi2*xsize + xi1];//the bounding input pixels?
v21 = inarray[yi1*xsize + xi2];
v22 = inarray[yi2*xsize + xi2];
//Do bilinear interpolation
temparray[yf*xsize + xf] = (unsigned short int)
(v11*(1-x)*(1-y) + v21*x*(1-y) + v12*(1-x)*y + v22*x*y);
goodpixels++;
}
else{temparray[yf*xsize + xf]=0; badpixels++;}
}
}
//copy to outarray
for(yf=0;yf<ysize;yf++)
{
for(xf=0;xf<xsize;xf++)
{
outarray[yf*xsize + xf] = temparray[yf*xsize+xf];
}
}
free(temparray);
return;
}
I tested it by printing several dozen numbers, and comparing it to the same ind of the IDL code, and the results are not at all the same. I'm not sure what more information I can give on that, as I'm not currently able to produce a working image of the array. Do you see any errors in my implementation? Is my reasoning behind the algorithm sound?
EDIT: Here are some selected numbers from the input array; they are identical in the C and IDL programs. What's printed is the x index, followed by the y index, followed by the value at that point.
0 0 24.0000
256 0 17.0000
512 0 23.0000
768 0 21.0000
1023 0 0.00000
0 256 19.0000
256 256 459.000
512 256 379.000
768 256 191.000
1023 256 0.00000
0 512 447.000
256 512 388.000
512 512 231.000
768 512 231.000
1023 512 0.00000
0 768 286.000
256 768 378.000
512 768 249.000
768 768 205.000
1023 768 0.00000
0 1023 6.00000
256 1023 10.0000
512 1023 11.0000
768 1023 12.0000
1023 1023 0.00000
This is what the IDL program outputs after rotation:
0 0 31.0000
256 0 20.4179
512 0 20.3183
768 0 20.0000
1023 0 0.00000
0 256 63.0000
256 256 457.689
512 256 392.406
768 256 354.140
1023 256 0.00000
0 512 511.116
256 512 402.241
512 512 230.939
768 512 240.861
1023 512 0.00000
0 768 296.826
256 768 377.217
512 768 218.039
768 768 277.194
1023 768 0.00000
0 1023 14.0000
256 1023 8.00000
512 1023 9.34906
768 1023 23.7820
1023 1023 0.00000
And here is the data after rotation using my function:
[0,0]: 0
[256,0]: 44
[512,0]: 276
[768,0]: 299
[1023,0]: 0
[0,256]: 0
[256,256]: 461
[512,256]: 439
[768,256]: 253
[1023,256]: 0
[0,512]: 0
[256,512]: 377
[512,512]: 262
[768,512]: 379
[1023,512]: 0
[0,768]: 0
[256,768]: 340
[512,768]: 340
[768,768]: 198
[1023,768]: 18
[0,1023]: 0
[256,1023]: 0
[512,1023]: 0
[768,1023]: 0
[1023,1023]: 0
I didn't see an immediately useful pattern emerging here to indicate what's going on, which is why I didn't originally include it.
EDIT EDIT EDIT: I believe my mind just suddenly stumbled across the problem! I noticed that the 0,0 pixel never seems to change, and that the 1023,1023 pixel changes the most. Of course this means the algorithm is designed to rotate around the origin, while I'm assuming that the function I seek to imitate is designed to rotate around the center of the image. The answer is still not the same but it is much closer. All I did was change the line
xi = (xf+0.5)*cos(minusangle) + (yf+0.5)*sin(minusangle);
yi = (yf+0.5)*cos(minusangle) - (xf+0.5)*sin(minusangle);
to
xi = (xf-512+0.5)*cos(minusangle) + (yf-512+0.5)*sin(minusangle) + 512;
yi = (yf-512+0.5)*cos(minusangle) + (xf-512+0.5)*sin(minusangle) + 512;