Rotation algorithm not producing expected results - c

I am trying to port some code from IDL to C, and I find myself having to replicate the ROT function. The goal is to rotate a 1024x1024 array of unsigned shorts by an angle in degrees. For the purposes of this project, the angle is very small, less than one degree. The function uses bilinear interpolation.
I tried a backwards approach. For each pixel in the output array, I did a reverse rotation to figure out what coordinate in the input array it would belong to, then used interpolation to figure out what that value would be. I wasn't sure how to go about doing bilinear interpolation if the input grid was skewed; every example of it I've seen assumes that it's orthogonal.
For the rotation, I referred to this:
x' = x * cos(a) + y * sin(a)
y' = y * cos(a) - x * sin(a)
from this: Image scaling and rotating in C/C++
And for the interpolation, I referred to this: http://en.wikipedia.org/wiki/Bilinear_interpolation
Anyway, here's my code:
#define DTOR 0.0174532925
void rotatearray(unsigned short int *inarray, unsigned short int *outarray, int xsize,
int ysize, double angle)
{
//allocate temparray, set to 0
unsigned short int *temparray;
temparray = calloc(xsize*ysize, sizeof(unsigned short int));
int xf, yf;
int xi1, xi2, yi1, yi2;
double xi, yi;
double x, y;
double minusangle = (360 - angle)*DTOR;
unsigned short int v11, v12, v21, v22;
int goodpixels=0;
int badpixels=0;
for(yf=0;yf<ysize;yf++)
{
for(xf=0;xf<xsize;xf++)
{
//what point in the input grid would map to this output pixel?
//(inverse of rotation)
xi = (xf+0.5)*cos(minusangle) + (yf+0.5)*sin(minusangle);
yi = (yf+0.5)*cos(minusangle) - (xf+0.5)*sin(minusangle);
//Is it within bounds?
if ((xi>(0+0.5))&&(xi<xsize-0.5)&&
(yi>(0+0.5))&&(yi<ysize-0.5))
{
//what are the indices of the bounding input pixels?
xi1 = (int)(xi - 0.5);
xi2 = (int)(xi + 0.5);
yi1 = (int)(yi - 0.5);
yi2 = (int)(yi + 0.5);
//What position is (x,y) in the bound unit square?
x = xi - xi1;
y = yi - yi1;
//what are the values of the bounding input pixels?
v11 = inarray[yi1*xsize + xi1];//What are the values of
v12 = inarray[yi2*xsize + xi1];//the bounding input pixels?
v21 = inarray[yi1*xsize + xi2];
v22 = inarray[yi2*xsize + xi2];
//Do bilinear interpolation
temparray[yf*xsize + xf] = (unsigned short int)
(v11*(1-x)*(1-y) + v21*x*(1-y) + v12*(1-x)*y + v22*x*y);
goodpixels++;
}
else{temparray[yf*xsize + xf]=0; badpixels++;}
}
}
//copy to outarray
for(yf=0;yf<ysize;yf++)
{
for(xf=0;xf<xsize;xf++)
{
outarray[yf*xsize + xf] = temparray[yf*xsize+xf];
}
}
free(temparray);
return;
}
I tested it by printing several dozen numbers, and comparing it to the same ind of the IDL code, and the results are not at all the same. I'm not sure what more information I can give on that, as I'm not currently able to produce a working image of the array. Do you see any errors in my implementation? Is my reasoning behind the algorithm sound?
EDIT: Here are some selected numbers from the input array; they are identical in the C and IDL programs. What's printed is the x index, followed by the y index, followed by the value at that point.
0 0 24.0000
256 0 17.0000
512 0 23.0000
768 0 21.0000
1023 0 0.00000
0 256 19.0000
256 256 459.000
512 256 379.000
768 256 191.000
1023 256 0.00000
0 512 447.000
256 512 388.000
512 512 231.000
768 512 231.000
1023 512 0.00000
0 768 286.000
256 768 378.000
512 768 249.000
768 768 205.000
1023 768 0.00000
0 1023 6.00000
256 1023 10.0000
512 1023 11.0000
768 1023 12.0000
1023 1023 0.00000
This is what the IDL program outputs after rotation:
0 0 31.0000
256 0 20.4179
512 0 20.3183
768 0 20.0000
1023 0 0.00000
0 256 63.0000
256 256 457.689
512 256 392.406
768 256 354.140
1023 256 0.00000
0 512 511.116
256 512 402.241
512 512 230.939
768 512 240.861
1023 512 0.00000
0 768 296.826
256 768 377.217
512 768 218.039
768 768 277.194
1023 768 0.00000
0 1023 14.0000
256 1023 8.00000
512 1023 9.34906
768 1023 23.7820
1023 1023 0.00000
And here is the data after rotation using my function:
[0,0]: 0
[256,0]: 44
[512,0]: 276
[768,0]: 299
[1023,0]: 0
[0,256]: 0
[256,256]: 461
[512,256]: 439
[768,256]: 253
[1023,256]: 0
[0,512]: 0
[256,512]: 377
[512,512]: 262
[768,512]: 379
[1023,512]: 0
[0,768]: 0
[256,768]: 340
[512,768]: 340
[768,768]: 198
[1023,768]: 18
[0,1023]: 0
[256,1023]: 0
[512,1023]: 0
[768,1023]: 0
[1023,1023]: 0
I didn't see an immediately useful pattern emerging here to indicate what's going on, which is why I didn't originally include it.
EDIT EDIT EDIT: I believe my mind just suddenly stumbled across the problem! I noticed that the 0,0 pixel never seems to change, and that the 1023,1023 pixel changes the most. Of course this means the algorithm is designed to rotate around the origin, while I'm assuming that the function I seek to imitate is designed to rotate around the center of the image. The answer is still not the same but it is much closer. All I did was change the line
xi = (xf+0.5)*cos(minusangle) + (yf+0.5)*sin(minusangle);
yi = (yf+0.5)*cos(minusangle) - (xf+0.5)*sin(minusangle);
to
xi = (xf-512+0.5)*cos(minusangle) + (yf-512+0.5)*sin(minusangle) + 512;
yi = (yf-512+0.5)*cos(minusangle) + (xf-512+0.5)*sin(minusangle) + 512;

Related

Is 'malloc' possibly a bottleneck in my program?

My program calls malloc 10'000 times a second. I have absolutely no idea how long a malloc call takes.
As long as an uncontested mutex lock? (10-100 ns)
As long as compressing 1kb of data? (1-10 us)
As long as an SSD random read? (100-1000 us)
As long as a network transfer to Amsterdam? (10-100ms)
Instead of spending two hours to investigate this, only to find out that it is absolutely dwarfed by some other thing my program does, I would like to get a rough idea of what to expect. Ballpark. Not precise. Off by factor 10 does not matter at all.
The following picture was updooted 200 times here:
To state the obvious first: profiling for specific use cases is always required. However, this question asked for a rough general ballpark approximation guesstimate of the order of magnitude. That's something we do when we don't know if we should even think about a problem. Do I need to worry about my data being in cache when it is then sent to Amsterdam? Looking at the picture in the question, the answer is a resounding No. Yes, it could be a problem, but only if you messed up big. We assume that case to be ruled out and instead discuss the problem in probabilistic generality.
It may be ironic that the question arose when I was working on a program that cares very much about small details, where a performance difference of a few percent translates into millions of CPU hours. Profiling suggested malloc was not an issue, but before dismissing it outright, I wanted to sanity check: Is it theoretically plausible that malloc is a bottleneck?
As repeatedly suggested in a closed, earlier version of the question, there are large differences between environments.
I tried various machines (intel: i7 8700K, i5 5670, some early gen mobile i7 in a laptop; AMD: Ryzen 4300G, Ryzen 3900X), various OS (windows 10, debian, ubuntu) and compilers (gcc, clang-14, cygwin-g++, msvc; no debug builds).
I've used this to get an idea about the characteristics(*), using just 1 thread:
#include <stddef.h>
#include <stdlib.h>
#include <time.h>
#include <stdio.h>
int main(int argc, char* argv[]) {
const size_t allocs = 10;
const size_t repeats = 10000;
printf("chunk\tms\tM1/s\tGB/s\tcheck\n");
for (size_t size = 16; size < 10 * 1000 * 1000; size *= 2) {
float t0 = (float)clock() / CLOCKS_PER_SEC;
size_t check = 0;
for (size_t repeat = 0; repeat < repeats; ++repeat) {
char* ps[allocs];
for (size_t i = 0; i < allocs; i++) {
ps[i] = malloc(size);
if (!ps[i]) {
exit(1);
}
for (size_t touch = 0; touch < size; touch += 512) {
ps[i][touch] = 1;
}
}
for (size_t i = 0; i < allocs; i++) {
check += ps[i][0];
free(ps[i]);
}
}
float dt = (float)clock() / CLOCKS_PER_SEC - t0;
printf ("%d\t%1.5f\t%7.3f\t%7.1f\t%d\n",
size,
dt / allocs / repeats * 1000,
allocs / dt * repeats / 1000 / 1000,
allocs / dt * repeats * size / 1024 / 1024 / 1024,
check);
}
}
The variance is stark, but, as expected, the values still belong to the same ballpark.
the following table is representative, others were off by less than factor 10
chunk ms M1/s GB/s check
16 0.00003 38.052 0.6 100000
32 0.00003 37.736 1.1 100000
64 0.00003 37.651 2.2 100000
128 0.00004 24.931 3.0 100000
256 0.00004 26.991 6.4 100000
512 0.00004 26.427 12.6 100000
1024 0.00004 24.814 23.7 100000
2048 0.00007 15.256 29.1 100000
4096 0.00007 14.633 55.8 100000
8192 0.00008 12.940 98.7 100000
16384 0.00066 1.511 23.1 100000
32768 0.00271 0.369 11.3 100000
65536 0.00707 0.141 8.6 100000
131072 0.01594 0.063 7.7 100000
262144 0.04401 0.023 5.5 100000
524288 0.11226 0.009 4.3 100000
1048576 0.25546 0.004 3.8 100000
2097152 0.52395 0.002 3.7 100000
4194304 0.80179 0.001 4.9 100000
8388608 1.78242 0.001 4.4 100000
Here's one from a 3900X on cygwin-g++. You can clearly see the larger CPU cache, and after that, the higher memory throughput.
chunk ms M1/s GB/s check
16 0.00004 25.000 0.4 100000
32 0.00005 20.000 0.6 100000
64 0.00004 25.000 1.5 100000
128 0.00004 25.000 3.0 100000
256 0.00004 25.000 6.0 100000
512 0.00005 20.000 9.5 100000
1024 0.00004 25.000 23.8 100000
2048 0.00005 20.000 38.1 100000
4096 0.00005 20.000 76.3 100000
8192 0.00010 10.000 76.3 100000
16384 0.00015 6.667 101.7 100000
32768 0.00077 1.299 39.6 100000
65536 0.00039 2.564 156.5 100000
131072 0.00067 1.493 182.2 100000
262144 0.00093 1.075 262.5 100000
524288 0.02679 0.037 18.2 100000
1048576 0.14183 0.007 6.9 100000
2097152 0.26805 0.004 7.3 100000
4194304 0.51644 0.002 7.6 100000
8388608 1.01604 0.001 7.7 100000
So what gives?
With small chunk sizes, >= 10 million of calls per second are possible even on old commodity hardware.
Once sizes go beyond CPU cache, i.e. 1 to 100-ish MB, RAM access quickly dominates this (I did not test malloc without actually using the chunks).
Depending on what sizes you malloc, one or the other will be the (ballpark) limit.
However, with something like 10k allocs per second, this is something you can likely ignore for the time being.

Processing YUV I420 from framebuffer?

I have a byte array named buf, that contains a single video frame in YUV I420 format obtained from a framebuffer. For every video frame I also have the following information:
Size (e.g. 320x180)
Stride Y (e.g. 384)
Stride U (e.g. 384)
Stride V (e.g. 384)
Plane offset Y (e.g. 0)
Plane offset U (e.g. 69120)
Plane offset V (e.g. 69312)
Concatenating multiple video frames in a file, and passing that with size information to a raw video decoder in VLC or FFmpeg just produces garbled colors, so I think the bytes in buf should be reordered using the information above to produce playable output, but I'm completely new to working with video so this may be wrong.
I which order should size, stride and offset information be combined with bytes in buf to produce a byte stream that could be played raw in a video player?
Example:
https://transfer.sh/E8LNy5/69518644-example-01.yuv
The layout of the data seems odd but using the given offsets and strides, this is decodable as YUV.
First there are 384 * 180 bytes of luma.
Following are the chroma lines, each being 192 bytes long... but U and V lines take turns! This is accounted for by the strange offsets. U offset points exactly to after luma. V offset is 192 bytes further... and reading would leapfrog by 384 bytes.
Here's code that extracts those planes and assembles them as I420, for decoding with cvtColor:
#!/usr/bin/env python3
import numpy as np
import cv2 as cv
def extract(data, offset, stride, width, height):
data = data[offset:] # skip to...
data = data[:height * stride] # get `height` lines
data.shape = (height, stride)
return data[:, :width] # drop overscan/padding
width, height = 320, 180
Yoffset = 0
Uoffset = 69120 # 384*180
Voffset = 69312 # 384*180 + 192
Ystride = 384
Ustride = 384
Vstride = 384
data = np.fromfile("69518644-example-01.yuv", dtype=np.uint8)
Y = extract(data, Yoffset, Ystride, width, height)
U = extract(data, Uoffset, Ustride, width // 2, height // 2)
V = extract(data, Voffset, Vstride, width // 2, height // 2)
# construct I420: Y,U,V planes in order
i420 = np.concatenate([Y.flat, U.flat, V.flat])
i420.shape = (height * 3 // 2, width)
result = cv.cvtColor(i420, cv.COLOR_YUV2BGR_I420)
cv.namedWindow("result", cv.WINDOW_NORMAL)
cv.resizeWindow("result", width * 4, height * 4)
cv.imshow("result", result)
cv.waitKey()
cv.destroyAllWindows()

add pid value to /proc/"pid"/stat

hi i need to add the pid number to /proc/%d/stat
how i can do that ?
this is my full code , with this i have the total cpu usage :
unsigned sleep(unsigned sec);
struct cpustat {
unsigned long t_user;
unsigned long t_nice;
unsigned long t_system;
unsigned long t_idle;
unsigned long t_iowait;
unsigned long t_irq;
unsigned long t_softirq;
};
void skip_lines(FILE *fp, int numlines)
{
int cnt = 0;
char ch;
while((cnt < numlines) && ((ch = getc(fp)) != EOF))
{
if (ch == '\n')
cnt++;
}
return;
}
void get_stats(struct cpustat *st, int cpunum)
{
FILE *fp = fopen("/proc/stat", "r");
int lskip = cpunum+1;
skip_lines(fp, lskip);
char cpun[255];
Obviously, to replace the %d with an integer, you'll need to use sprintf into a buffer as you do in your second example. You could also just use /proc/self/stat to get the stats of the current process, rather than getpid+sprintf.
Your main problem seems to be with the contents/format you're expecting to see. stat contains a single line of info about the process, as described in proc(5). For example:
$ cat /proc/self/stat
27646 (cat) R 3284 27646 3284 34835 27646 4194304 86 0 1 0 0 0 0 0 20 0 1 0 163223159 7618560 210 18446744073709551615 4194304 4240236 140730092885472 0 0 0 0 0 0 0 0 0 17 1 0 0 2 0 0 6340112 6341364 37523456 140730092888335 140730092888355 140730092888355 140730092892143 0
You seem to be skipping some initial lines, and then trying to read something with a different format.
from the proc(5) man page, some of those numbers from /proc/self/stat are related to cpu time:
(14) utime %lu
Amount of time that this process has been scheduled in user mode, mea‐
sured in clock ticks (divide by sysconf(_SC_CLK_TCK)). This includes
guest time, guest_time (time spent running a virtual CPU, see below), so
that applications that are not aware of the guest time field do not lose
that time from their calculations.
(15) stime %lu
Amount of time that this process has been scheduled in kernel mode, mea‐
sured in clock ticks (divide by sysconf(_SC_CLK_TCK)).
Which gives you the total cpu time in this process since it started. With the above cat program, those numbers are both 0 (it runs too fast to accumulate any ticks), but if I do
$ cat /proc/$$/stat
3284 (bash) S 2979 3284 3284 34835 12764 4194304 71122 947545 36 4525 104 66 1930 916 20 0 1 0 6160 24752128 1448 18446744073709551615 4194304 5192964 140726761267456 0 0 0 65536 3670020 1266777851 1 0 0 17 1 0 0 68 0 0 7290352 7326856 30253056 140726761273517 140726761273522 140726761273522 140726761275374 0
you can see that my shell has 104 ticks of user time and 66 ticks of system time.

Solving code-forces "1A Theatre Square" in C

novice programmer here trying to get better at C, so i began doing code problems on a website called codeforces. However i seem to be stuck, i have written code that appears to work in practice but the website does not accept it as right.
the problem :
Theatre Square in the capital city of Berland has a rectangular shape with the size n × m meters. On the occasion of the city's anniversary, a decision was taken to pave the Square with square granite flagstones. Each flagstone is of the size a × a. What is the least number of flagstones needed to pave the Square? It's allowed to cover the surface larger than the Theatre Square, but the Square has to be covered. It's not allowed to break the flagstones. The sides of flagstones should be parallel to the sides of the Square.1
Source :
https://codeforces.com/problemset/problem/1/A
I did have a hard time completely understanding the math behind the problem and used this source's answer from a user named "Joshua Pan" to better understand the problem
Source :
https://www.quora.com/How-do-I-solve-the-problem-Theatre-Square-on-Codeforces
This is my code :
#include<stdio.h>
#include<math.h>
int main(void)
{
double n,m,a;
scanf("%lf %lf %lf", &n,&m,&a);
printf("%1.lf\n", ceil(n/a)*ceil(m/a));
return 0;
}
I compiled it using "gcc TheatreSquare.c -lm"
When given the sample input 6,6,4 my code produces the correct output 4, however the website does not accept this code as correct, i could be wrong but maybe im using format specifiers incorrectly?
Thanks in advance.
Typical double (IEEE754 64-bit floating point) doesn't have enough accuracy for the problem.
For example, for input
999999999 999999999 1
Your program may give output
999999998000000000
While the actual answer is
999999998000000001
To avoid this, you shouldn't use floating point data type.
You can add #include <inttypes.h> and use 64-bit integer type int64_t for this calculation.
"%" SCNd64 is for reading and "%" PRId64 is for writing int64_t.
cell(n/a) on integers can be done by (n + a - 1) / a.
You can solve this using integers.
#include <stdio.h>
int main()
{
unsigned long n, m, a = 1;
unsigned long na, ma, res = 0;
scanf("%lu %lu %lu", &n, &m, &a);
na = n/a;
if (n%a != 0)
na++;
ma = m/a;
if (m%a != 0)
ma++;
res = na * ma;
printf("%lu", res);
return 0;
}
This code will fail in the Codeforce platform, on the test 9 (see below). But if you compile it and run it locally with the same inputs, the result is correct.
> Test: #9, time: 15 ms., memory: 3608 KB, exit code: 0, checker exit code: 1, verdict: WRONG_ANSWER
> Input 1000000000 1000000000 1
> Output 2808348672 Answer 1000000000000000000
> Checker Log wrong answer 1st numbers differ - expected: '1000000000000000000', found: '2808348672'
EDIT:
The problem described above is due to the fact that I'm running a 64-bit machine and the online compiler is probably using 32-bit. The unsigned long variables overflow.
The following code will pass all the tests.
#include <stdio.h>
int main()
{
unsigned long long n, m, a = 1;
unsigned long long na, ma, res = 0;
scanf("%llu %llu %llu", &n, &m, &a);
na = n/a;
if (n%a != 0)
na++;
ma = m/a;
if (m%a != 0)
ma++;
res = na * ma;
printf("%llu", res);
return 0;
}
Use the code below it will pass all the test cases we need to use long long for all variable declaration to get output.
#include <stdio.h>
#include <math.h>
int main(){
long long n,m,a,l,b;
scanf("%lld%lld%lld",&n,&m,&a);
l= n/a;
if(n%a != 0)
l++;
b= m/a;
if(m%a != 0)
b++;
printf("%lld",l*b);
return 0;
}
Theatre Square in the capital city of Berland has a rectangular shape with the size n × m meters. On the occasion of the city's anniversary, a decision was taken to pave the Square with square granite flagstones. Each flagstone is of the size a × a.
import java.util.Scanner;
public class theatre_square {
public static void main(String[] args) {
long a,b,c;
Scanner s = new Scanner(System.in);
a = s.nextLong();
b = s.nextLong();
c = s.nextLong();
long result = 0;
if(a>=c){
if(a%c==0)
result = a/c;
else
result = a/c + 1; // some part is left
}else{ // area of rectangle < area of square then 1 square is required
result = 1;
}
if(b>=c){
if(b%c==0)
result *= b/c;
else
result *= b/c + 1;
}
System.out.println(result);
}
}
case 1 . 2 2 3 => 1
length = 2 so 2 < 3 then only 1 square required <br>
breadth = 2 so 2 < 3 then covered in previous square so output 1
intial view
0 0
0 0
after adding 1 square ( r= remaining or left)
1 1 r
1 1 r
r r r
case 2 . 6 6 4 => 4
length = 2 so 6 > 4 then only 2 square required <br>
breadth = 2 so 6 > 4 then 2 square required
intial view
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
after adding 4 square ( r= remaining or left)
1 1 1 1 2 2 r r
1 1 1 1 2 2 r r
1 1 1 1 2 2 r r
1 1 1 1 2 2 r r
3 3 3 3 4 4 r r
3 3 3 3 4 4 r r
r r r r r r r r
r r r r r r r r
You can try the following:
import math
x,y,z=list(map(float, input().split()))
print(math.ceil(x/z)*math.ceil(y/z))
Here is the code for the above problem in CPP. We need a long long variable to store the value as we may have a very large value.
GUIDANCE ABOUT THE QUESTION:
As we are given the hint of edges so we have to cover them nicely. For a rectangle, we know that we have a length and height which is shown as n * m and the square is of a*a so we will try to cover the length first and decide its squares first
for that, we divide it by k, and then if any remainder exists we will add one more and the same for height.
I hope it will help you
HERE IS THE CODE
#include<iostream>
using namespace std;
int main()
{
long long n,m,k,l=0,o=0;
cin>>n>>m>>k;
l=n/k;
if(n%k!=0)
{
l++;
}
o=m/k;
if(m%k!=0)
{
o++;
}
cout<<l*o;
}

Use of '%06.3f' in a C program

#include<stdio.h>
main()
{
int Fahrenheit;
for (Fahrenheit = 0; Fahrenheit <= 300; Fahrenheit = Fahrenheit + 20)
printf("%3d %06.3f\n", Fahrenheit, (5.0/9.0)*(Fahrenheit-32));
}
Output of the source above:
0 -17.778
20 -6.667
40 04.444
60 15.556
80 26.667
100 37.778
120 48.889
140 60.000
160 71.111
180 82.222
200 93.333
220 104.444
240 115.556
260 126.667
280 137.778
300 148.889
Please explain to me the function of '06.3f' in the 'printf' function in the program above.
0 fill with 0 on the left
6 the string should be at least 6 characters long
.3 precision is 3 digits after the decimal point
f it accepts a float (or double) variable

Resources