Converting arm code to use NEON intrinsics - arm

I have been trying to modify the code beneath to work with NEON Intrinsics, thereby creating a speedup. Unfortunately nothing seems to work correctly. Does anyone have any idea what is going wrong? I updated the doubles to single floating point elements.
typedef float REAL;
typedef REAL VEC3[3];
typedef struct driehoek
{
VEC3 norm; /* Face normal. */
REAL d; /* Plane equation D. */
VEC3 *vptr; /* Global vertex list pointer. */
VEC3 *nptr; /* Global normal list pointer. */
INT vindex[3]; /* Index of vertices. */
INT indx; /* Normal component max flag. */
BOOL norminterp; /* Do normal interpolation? */
BOOL vorder; /* Vertex order orientation. */
}driehoek;
typedef struct element
{
INT index;
struct object *parent; /* Ptr back to parent object. */
CHAR *data; /* Pointer to data info. */
BBOX bv; /* Element bounding volume. */
}ELEMENT;
INT TriangleIntersection(RAY *pr, ELEMENT *pe, IRECORD *hit)
{
FLOAT Rd_dot_Pn; /* Polygon normal dot ray direction. */
FLOAT Ro_dot_Pn; /* Polygon normal dot ray origin. */
FLOAT q1, q2;
FLOAT tval; /* Intersection t distance value. */
VEC3 *v1, *v2, *v3; /* Vertex list pointers. */
VEC3 e1, e2, e3; /* Edge vectors. */
driehoek *pt; /* Ptr to triangle data. */
pt = (driehoek *)pe->data;
Rd_dot_Pn = VecDot(pt->norm, pr->D);
if (ABS(Rd_dot_Pn) < RAYEPS) /* Ray is parallel. */
return (0);
hit->b3 = e1[0] * (q2 - (*v1)[1]) - e1[1] * (q1 - (*v1)[0]);
if (!INSIDE(hit->b3, pt->norm[2]))
return (0);
break;
}
return (1);
}

An array of float vec[3] is not enough of a hint to the compiler that NEON intrinsic can be used. The issue is that float vec[3] has each element individually addressable. The compiler must store each in a floating point register. See gcc NEON intrinsic documentation.
Although 3 dimensions is very common in this Universe, our friends the computers like binary. So you have two data types that can be used for NEON intrinsics; float32x4_t and float32x2_t. You need to use the intrinsics such as vfmaq_f32, vsubq_f32, etc. These intrinsics are different for each compiler; I guess you are using gcc. You should only use the intrinsic data types as combining float32x2_t with a single float can result in movement between register types, which is expensive. If your algorithm can treat each dimension separately, then you might be able to combine types. However, I don't think you will have register pressure and the SIMD speed-up should be beneficial. I would keep everything in float32x4_t to begin with. You maybe able to use the extra dimension for 3D-projection when it comes to the rendering phase.
Here is the source to a cmath library called math-neon under LGPL. Instead of using intrinsics with gcc, it uses inline assembler.Neon intrinsics vs assembly
See also: armcc NEON intrinsics, if you are using the ARM compiler.

Related

Doxygen: Documenting struct as a member of a struct

I would like to document a struct as a member of a struct, in a style, that there will be clickable link to structure precalculated same as on calibrator_calibration_t and after click it will show me all members of precalculated.
Generated HTML:
I've tried many different approaches but none of them worked as I needed. Any tip?
/**
* #struct filter_t
* #brief Filter structure
*/
typedef struct
{
calibrator_calibration_t calibration; ///< Copied calibration
blackbox_weight_id_e weight_id;
struct
{
float slope;
float above_mixed;
float under_mixed;
float above_male;
float under_male;
float above_female;
float under_female;
uint32_t stable_counter_minimum;
} precalculated; ///< Precalculated values (for faster calculation) based on settings
} filter_t;
If you want the type of structure member precalculated to be documented with a name and a link to separate documentation of that type, then you must give that type a name or tag. You have not done that. C does not allow you to name it (via typedef) when its definition is inside a struct definition, however, and it is poor style to tag it in that context.
If you can get over your apparent aversion to structure tags and are also unconcerned with the stylistic problems involved, then I anticipate that adding a tag would induce Doxygen to do what you want:
typedef struct
{
calibrator_calibration_t calibration; ///< Copied calibration
blackbox_weight_id_e weight_id;
struct precalc // Note the structure tag here
{
float slope;
float above_mixed;
float under_mixed;
float above_male;
float under_male;
float above_female;
float under_female;
uint32_t stable_counter_minimum;
} precalculated; ///< Precalculated values (for faster calculation) based on settings
} filter_t;
But if you are going to tag the structure type then it would be better form to move it out of the host structure definition, and if you're going to do that, then it appears that your standard convention would be to name it instead of tagging it:
typedef struct {
float slope;
float above_mixed;
float under_mixed;
float above_male;
float under_male;
float above_female;
float under_female;
uint32_t stable_counter_minimum;
} precalculated_t;
typedef struct
{
calibrator_calibration_t calibration; ///< Copied calibration
blackbox_weight_id_e weight_id;
precalculated_t precalculated; ///< Precalculated values (for faster calculation) based on settings
} filter_t;

Using malloc on structs / arrays in other files

UPDATE: The problem with the segmentation fault is not within this function as described below, it is within another function of the same program.
I'm trying to make a program that animates bouncing balls, however I am quite stuck and can't figure out what I am doing wrong. I believe I have isolated the problem to be within the function below. I have sort of figured out that it has something to do with the new-model statements.
Anyway, upon running the code I get segmentation fault and the values drawn by the function (in terms of triangles) are way out of where they should be. I should be getting values between 0 and 1600 but I end up with 94 million sometimes.
Any help is greatly appreciated!
object_t *create_object(SDL_Surface *surface, triangle_t *model, int numtriangles){
object_t *new=malloc(sizeof(object_t));
new->surface = surface;
new->model = malloc(sizeof(triangle_t)*numtriangles);
*new->model= *model;
new->numtriangles = numtriangles;
new->tx = surface->w/2;
new->ty = surface->h/2;
new->scale = 0.1;
new->rotation = 0.0;
return new;
}
NB! The triangle_t *model pointer points to an array which describes multiple triangles.
EDIT:
Including struct of object:
typedef struct object object_t;
struct object {
float scale; /* Object scale */
float rotation; /* Object rotation */
float tx, ty; /* Position on screen */
float speedx, speedy; /* Object speed in x and y direction */
unsigned int ttl; /* Time till object should be removed from screen */
int numtriangles; /* Number of triangles in model */
triangle_t *model; /* Model triangle array */
SDL_Surface *surface; /* SDL screen */
};
And struct of triangles:
typedef struct triangle triangle_t;
struct triangle {
/* Model coordinates, where each pair resemble a corner */
int x1, y1;
int x2, y2;
int x3, y3;
/* The color the triangle is to be filled with */
unsigned int fillcolor;
/* Scale factor, meaning 0.5 should half the size, 1 keep, and 2.0 double */
float scale;
/* The point (tx, ty) where the center of the teapot should be placed on-screen */
int tx, ty;
/* The degrees the triangle is supposed to be rotated at the current frame */
float rotation;
/*
* Bounding box of on-screen coordinates:
* rect.x - x-coordinate of the bounding box' top left corner
* rect.y - y-coordinate of the bounding box' top left corner
* rect.w - width of the bounding box
* rect.h - height of the bounding box
*/
SDL_Rect rect;
/* On-screen coordinates, where each pair resemble a corner */
int sx1, sy1;
int sx2, sy2;
int sx3, sy3;
};
This line is copying only the first triangle:
*new->model = *model;
From the point of view of your function model is only a pointer to an object. The compiler doesn't know it points to an array of triangles, hence we need to pass the number of triangles in there as an argument.
Replace it for:
memcpy( new->model, model, sizeof(triangle_t)*numtriangles);
Additional comments:
Remember to free the model when freeing the object
Replace new for something else like newObj if you ever consider to compile this with a c++ compiler
More info:
https://linux.die.net/man/3/memcpy
https://en.cppreference.com/w/c/string/byte/memcpy
[EDIT]
Regarding to the segmentation fault: your function is correct now and it is not causing SEGFAULT unless you are running out of memory, what is very unlikely. Anyway, if you are running out of memory and getting a SEGFAULT in that function then the problem is either:
you are not deallocating memory correctly somewhere else and then you have a memory leak making you run out of memory improperly.
your platform needs more memory what, despite unlikely, is possible especially if it is a limited embedded platform
Post another question with the backtrace of the segfault.

Is there a C language technique that can occupy a bit in a byte?

Given a complex data structure where each sub-structure has a variable that has a domain of {true or false},
(e.g.)
struct dataBlock{
struct {
/* more members */
char status;
} node1;
struct {
/* more members */
char status;
} node2;
/* More nodes */
};
It would be a waste to have 1 byte just for a value of 1 or 0. Is there a C language technique that status in each node will only occupy a bit in a byte? What I can think of is by using MACROS but macros cannot be contained in a local scope right? So having macro status will mean only one macro status in the program. Hence, calling node1.status and node2.status uses the same macro.
You can use a bitfield - this syntax allows you to define how many bits each int in a strcut should occupy.
Note, however, that C can only allocate full bytes, so the size of the struct would be rounded up to the nearest multiplication of 8 bits in any case.
E.g.:
struct {
int whole_int; /* a whole int, let's assume it's 16 bits. */
int half_int : 8; /* only half an int */
int another_half_int : 8;
} some_struct /* Total size is 2 bytes*/
Having said that, I sincerely doubt you'll notice any performance gain from using this technique, and as Fredrick Gauss commented, it's probably not worth the hassle.
C has a built in feature called bit fields that will get the job done.
Basically, bit fields automatically optimizes a structure to use only as much memory as needed for each given member. In your case, you would do something like this.
struct statusNode {
/* ... */
/* only use 1 bit for this member */
unsigned int status : 1;
/* for example, test only needs 4 bits (range of 0 to 15) */
unsigned int test : 4;
};
struct dataBlock {
struct statusNode node1;
/* ... */
struct statusNode node2;
};
You can assign each members a certain number of bits based on the highest value that you'll ever come across.
You can find more information about bit fields here.

Xcode generates seemingly invalid code from OpenCL kernel

Xcode automatically generates code for submitting OpenCL kernels to dispatch queues but it seems that it generates invalid code which can't be compiled for one of my kernels. The problem is with a struct definition:
typedef struct {
float4 position; ///< The point position.
float4 velocity; ///< The point velocity.
float intensity; ///< The point intensity.
int links[6]; ///< The point links specified as indexes for other points in the array.
float align; ///< Dummy for alignment.
} PointElement;
This code is generated by Xcode:
typedef struct {
cl_float4 position;
cl_float4 velocity;
cl_float intensity;
int [6] links;
cl_float align;
} _PointElement_unalign;
I'm not expert at obscure C syntax variants but it surely int [6] links; is not valid C and therefore does not compile.
Why does Xcode do this? Have I done something wrong myself or is this a bug?

C struct size alignment

I want the size of a C struct to be multiple of 16 bytes (16B/32B/48B/..).
It does not matter which size it gets to; it only needs to be multiple of 16 bytes.
How could I enforce the compiler to do that?
For Microsoft Visual C++:
#pragma pack(push, 16)
struct _some_struct
{
...
}
#pragma pack(pop)
For GCC:
struct _some_struct { ... } __attribute__ ((aligned (16)));
Example:
#include <stdio.h>
struct test_t {
int x;
int y;
} __attribute__((aligned(16)));
int main()
{
printf("%lu\n", sizeof(struct test_t));
return 0;
}
compiled with gcc -o main main.c will output 16. The same goes for other compilers.
The size of a C struct will depend on the members of the struct, their types and how many of them there are. There is really no standard way to force the compiler to make structs to be a multiple of some size. Some compilers provide a pragma that will allow you to set the alignment boundary however that is really a different thing. And there may be some that would have such a setting or provide such a pragma.
However if you insist on this one method would be to do memory allocation of the struct and to force the memory allocation to round up to the next 16 byte size.
So if you had a struct like this.
struct _simpleStruct {
int iValueA;
int iValueB;
};
Then you could do something like the following.
{
struct _simpleStruct *pStruct = 0;
pStruct = malloc ((sizeof(*pStruct)/16 + 1)*16);
// use the pStruct for whatever
free(pStruct);
}
What this would do is to push the size up to the next 16 byte size so far as you were concerned. However what the memory allocator does may or may not be to give you a block that is actually that size. The block of memory may actually be larger than your request.
If you are going to do something special with this, for instance lets say that you are going to write this struct to a file and you want to know the block size then you would have to do the same calculation used in the malloc() rather than using the sizeof() operator to calculate the size of the struct.
So the next thing would be to write your own sizeof() operator using a macro such as.
#define SIZEOF16(x) ((sizeof(x)/16 + 1) * 16)
As far as I know there is no dependable method for pulling the size of an allocated block from a pointer. Normally a pointer will have a memory allocation block that is used by the memory heap management functions that will contain various memory management information such as the allocated block size which may actually be larger than the requested amount of memory. However the format for this block and where it is located relative to the actual memory address provided will depend on the C compiler's run time.
This depends entirely on the compiler and other tools since alignment is not specified that deeply in the ISO C standard (it specifies that alignment may happen at the compilers behest but does not go into detail as to how to enforce it).
You'll need to look into the implementation-specific stuff for your compiler toolchain. It may provide a #pragma pack (or align or some other thing) that you can add to your structure defininition.
It may also provide this as a language extension. For example, gcc allows you to add attributes to a definition, one of which controls alignment:
struct mystruct { int val[7]; } __attribute__ ((aligned (16)));
You could perhaps do a double struct, wrapping your actual struct in a second one that can add padding:
struct payload {
int a; /*Your actual fields. */
float b;
char c;
double d;
};
struct payload_padded {
struct payload p;
char padding[16 * ((sizeof (struct payload) + 15) / 16)];
};
Then you can work with the padded struct:
struct payload_padded a;
a.p.d = 43.3;
Of course, you can make use of the fact that the first member of a structure starts 0 bytes from where the structure starts, and treat a pointer to struct payload_padded as if it's a pointer to a struct payload (because it is):
float d_plus_2(const struct payload *p)
{
return p->d + 2;
}
/* ... */
struct payload_padded b;
const double dp2 = d_plus_2((struct payload *) &b);

Resources