Numpy has an amazing CPU dispatch mechanism that allows it to work out which instruction set a CPU uses at runtime. It uses that info to then "slot"/choose an optimized kernel for the requested call, i.e., machines with AVX support run an AVX optimized loop, machines with just SSE2 use SSE2, etc.
I'm trying to work out how this works under the hood, so I am reading numpy's source code (which is python, C, and C++), and I have encountered something I can't make sense of:
// copied from numpy/core/src/umath/loops_minmax.dispatch.c.src
// contiguous input.
static inline void
simd_reduce_c_#intrin#_#sfx#(const npyv_lanetype_#sfx# *ip, npyv_lanetype_#sfx# *op1, npy_intp len)
{
if (len < 1) {
return;
}
const int vstep = npyv_nlanes_#sfx#;
const int wstep = vstep*8;
npyv_#sfx# acc = npyv_setall_#sfx#(op1[0]);
for (; len >= wstep; len -= wstep, ip += wstep) {
#ifdef NPY_HAVE_SSE2
NPY_PREFETCH(ip + wstep, 0, 3);
#endif
npyv_#sfx# v0 = npyv_load_#sfx#(ip + vstep * 0);
npyv_#sfx# v1 = npyv_load_#sfx#(ip + vstep * 1);
npyv_#sfx# v2 = npyv_load_#sfx#(ip + vstep * 2);
npyv_#sfx# v3 = npyv_load_#sfx#(ip + vstep * 3);
npyv_#sfx# v4 = npyv_load_#sfx#(ip + vstep * 4);
npyv_#sfx# v5 = npyv_load_#sfx#(ip + vstep * 5);
npyv_#sfx# v6 = npyv_load_#sfx#(ip + vstep * 6);
npyv_#sfx# v7 = npyv_load_#sfx#(ip + vstep * 7);
npyv_#sfx# r01 = V_INTRIN(v0, v1);
npyv_#sfx# r23 = V_INTRIN(v2, v3);
npyv_#sfx# r45 = V_INTRIN(v4, v5);
npyv_#sfx# r67 = V_INTRIN(v6, v7);
acc = V_INTRIN(acc, V_INTRIN(V_INTRIN(r01, r23), V_INTRIN(r45, r67)));
}
for (; len >= vstep; len -= vstep, ip += vstep) {
acc = V_INTRIN(acc, npyv_load_#sfx#(ip));
}
npyv_lanetype_#sfx# r = V_REDUCE_INTRIN(acc);
// Scalar - finish up any remaining iterations
for (; len > 0; --len, ++ip) {
const npyv_lanetype_#sfx# in2 = *ip;
r = SCALAR_OP(r, in2);
}
op1[0] = r;
}
What does the # symbol do here? As far as I know, this is not a valid character in C, so I am a bit lost. I assume it is templating magic that replaces, e.g., #sfx# with a suffix for a vector extension, but how does this work?
As the comments already suspected, this is indeed a substitution/templating mechanism. This time, however, it is a "homebrew" that lives inside numpy's distutils module (the module that is responsible for building numpy for python<3.12).
For the most part, it allows looping and variable substitution. A repeated block is declared using /** begin repeat ... */ and /*end repeat*/. A variable is declared within the block comment that defines the loop using * #var = 1,2,3, .... Nested loops are also supported and are identified using a number after repeat, e.g. /**repeat1 ... for the first nested loop. Inside a loop, the phrase #var# is then substituted with the respective variable.
An example source file like the following
/**begin repeat
* #a = 1,2,3#
* #b = 1,2,3#
*/
/**begin repeat1
* #c = ted, jim#
*/
#a#, #b#, #c#
/**end repeat1**/
/**end repeat**/
processed with the following command
from numpy.distutils.conv_template import process_file
from pathlib import Path
generated = process_file(
# write somewhere
Path("test.src").write_text(generated)
produces:
#line 1 "test.c.src"
/*
*****************************************************************************
** This file was autogenerated from a template DO NOT EDIT!!!! **
** Changes should be made to the original source (.src) file **
*****************************************************************************
*/
#line 1
#line 5
#line 8
1, 1, ted
#line 8
1, 1, jim
#line 5
#line 8
2, 2, ted
#line 8
2, 2, jim
#line 5
#line 8
3, 3, ted
#line 8
3, 3, jim
Going back to the snippet I shared in my question, the relevant bit to disentangle the code is:
/**begin repeat
* #sfx = s8, u8, s16, u16, s32, u32, s64, u64, f32, f64#
* #simd_chk = NPY_SIMD*8, NPY_SIMD_F32, NPY_SIMD_F64#
* #is_fp = 0*8, 1, 1#
* #scalar_sfx = i*8, f, d#
*/
/**begin repeat1
* # intrin = max, min, maxp, minp#
* # fp_only = 0, 0, 1, 1#
*/
a bit higher up in the same file.
Essentially, this means that the template will generate a whooping 40 reduction loops for various combinations of dtypes (#sfx#), and reduction operators (#intrin#). (Note: The result is further substituted during the preprocessor stage where macros like npyv_load_f64 are replaced by snippets that perform the requested operation depending on the instruction set that will be compiled for.)
I have function in my library which computes N (N = 500 to 2000) explicit rather simple operations but it is called hundreds of thousands of times by he main software. Each small computation is independent from other and each one is slightly different (polynomial coefficients and sometimes other additional features vary) and therefore no loop is made but the cases are hard coded into the function.
Unfortunately the calls (loop) in the main software cannot be threaded because before the actual call to this particular function is made, the code there is not thread safe. (bigger software package to deal with here...)
I already tested to create a team of openmp threads in the beginning of this function and execute the computations in e.g. 4 blocks via the sections functionality in openmp, but it seems that the overhead of the thread creation #pragma omp parallel, was too high (Can it be?)
Any nice ideas how to speed-up this kind of situation? Perhaps applying SIMD features but how would it happen when I don't have an explicit for loop here to deal with?
#include "needed.h"
void eval_func (const double x, const double y, const double * __restrict__ z, double * __restrict__ out1, double * __restrict__ out2) {
double logx = log(x);
double tmp1;
double tmp2;
//calculation 1
tmp1 = exp(3.6 + 2.7 * logx - (3.1e+03 / x));
out1[0] = z[6] * z[5] * tmp1;
if (x <= 1.0) {
tmp2 = (-4.1 + 9.2e-01 * logx + x * (-3.3e-03 + x * (2.95e-06 + x * (-1.4e-09 + 3.2e-13 * x))) - 8.8e+02 / x);
} else {
tmp2 = (2.71e+00 + -3.3e-01 * logx + x * (3.4e-04 + x * (-6.8e-08 + x * (8.7e-12 + -4.2e-16 * x))) - 1.0e+03 / x);
}
tmp2 = 1.3 * exp(tmp2);
out2[0] = z[3] * z[7] * tmp1 / tmp2;
//calculation 2
.
.
out1[1] = ...
out2[1] = ...
//calculation N
.
.
out1[N-1] = ...
out2[N-1] = ...
I need to do color space conversion from RGB to YCbCr in C for my homework. First, I get r-g-b values for each pixel of a bmp file. Then, use the code shown below. But I can not get r-g-b values of pixels. How can I do that?
struct YCbCr ycbcr;
ycbcr.Y = (float)(0.2989 * fr + 0.5866 * fg + 0.1145 * fb);
ycbcr.Cb = (float)(-0.1687 * fr - 0.3313 * fg + 0.5000 * fb);
ycbcr.Cr = (float)(0.5000 * fr - 0.4184 * fg - 0.0816 * fb);
I'm trying to implement parallelism to this function I want it to take as many threads as possible, and write the results to a file.
The results need to be written in the file in the incrementing order so the first result needs to be written first the second second and so on.
The keyGen function is simply an MD5 of the integer m which is used as the start point for each chain. Reduction32 is a reduction function it takes the first 8 byte adds t and returns that value. When a chain reaches its endpoint it is stored in the binary file.
Is there a smart way to make this parallel? without screwing up the order the endpoints are stored in?
void tableGenerator32(uint32_t * text){
int mMax = 33554432, lMax = 236;
int m, t, i;
uint16_t * temp;
uint16_t * key, ep[2];
uint32_t tp;
FILE * write_ptr;
write_ptr = fopen("table32bits.bin", "wb");
for(m = 0; m < mMax ; m++){
key = keyGen(m);
for (t = 0; t < lMax; t++){
keyschedule(key);
temp = kasumi_enc(text);
tp = reduction32(t,temp);
temp[0]=tp>>16;
temp[1]=tp;
for(i=0; i < 8; i++){
key[i]=temp[i%2];
}
}
for(i=0;i<2;i++)
ep[i] = key[i];
fwrite(ep,sizeof(ep),1,write_ptr);
}
fclose(write_ptr);
}
The best way to parallelize the above function without facing concurrency issues is to create as many memory streams as many threads you wish to use and then divide the task into fractions, like if you have 4 threads,
one thread performs the task from 0 to mMax / 4
one thread performs the task from mMax / 4 to (mMax / 4) * 2
one thread performs the task from (mMax / 4) * 2 to (mMax / 4) * 3
one thread performs the task from (mMax / 4) * 3 to (mMax / 4) * 4
then you concatenate the result streams and write them into a file.
I have a problem running an example code from hdf5 called h5_rdwt.c (in eclipse). You can find this here:
http://www.hdfgroup.org/HDF5/Tutor/rdwt.html#rdwr
The code is:
/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* Copyright by The HDF Group. *
* Copyright by the Board of Trustees of the University of Illinois. *
* All rights reserved. *
* *
* This file is part of HDF5. The full HDF5 copyright notice, including *
* terms governing use, modification, and redistribution, is contained in *
* the files COPYING and Copyright.html. COPYING can be found at the root *
* of the source code distribution tree; Copyright.html can be found at the *
* root level of an installed copy of the electronic HDF5 document set and *
* is linked from the top-level documents page. It can also be found at *
* http://hdfgroup.org/HDF5/doc/Copyright.html. If you do not have *
* access to either file, you may request a copy from help#hdfgroup.org. *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
/*
* This example illustrates how to write and read data in an existing
* dataset. It is used in the HDF5 Tutorial.
*/
#include "hdf5.h"
#define FILE "dset.h5"
int main() {
hid_t file_id, dataset_id; /* identifiers */
herr_t status;
int i, j, dset_data[4][6];
/* Initialize the dataset. */
for (i = 0; i < 4; i++)
for (j = 0; j < 6; j++)
dset_data[i][j] = i * 6 + j + 1;
/* Open an existing file. */
file_id = H5Fopen(FILE, H5F_ACC_RDWR, H5P_DEFAULT);
/* Open an existing dataset. */
dataset_id = H5Dopen2(file_id, "/dset", H5P_DEFAULT);
/* Write the dataset. */
status = H5Dwrite(dataset_id, H5T_NATIVE_INT, H5S_ALL, H5S_ALL, H5P_DEFAULT,
dset_data);
status = H5Dread(dataset_id, H5T_NATIVE_INT, H5S_ALL, H5S_ALL, H5P_DEFAULT,
dset_data);
/* Close the dataset. */
status = H5Dclose(dataset_id);
/* Close the file. */
status = H5Fclose(file_id);
}
Before I did so, I created a file named "dset.h5" with:
#include "hdf5.h"
#define FILE "dset.h5"
main() {
hid_t file_id; /* file identifier */
herr_t status;
/* Create a new file using default properties. */
file_id = H5Fcreate(FILE, H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);
/* Terminate access to the file. */
status = H5Fclose(file_id);
}
Building is no problem, but when I try to run this I get the message:
HDF5-DIAG: Error detected in HDF5 (1.8.11) thread 0:
#000: ../../src/H5D.c line 334 in H5Dopen2(): not found
major: Dataset
minor: Object not found
#001: ../../src/H5Gloc.c line 430 in H5G_loc_find(): can't find object
major: Symbol table
minor: Object not found
#002: ../../src/H5Gtraverse.c line 861 in H5G_traverse(): internal path traversal failed
major: Symbol table
minor: Object not found
#003: ../../src/H5Gtraverse.c line 641 in H5G_traverse_real(): traversal operator failed
major: Symbol table
minor: Callback failed
#004: ../../src/H5Gloc.c line 385 in H5G_loc_find_cb(): object 'dset' doesn't exist
major: Symbol table
minor: Object not found
HDF5-DIAG: Error detected in HDF5 (1.8.11) thread 0:
#000: ../../src/H5Dio.c line 234 in H5Dwrite(): can't prepare for writing data
major: Dataset
minor: Write failed
#001: ../../src/H5Dio.c line 266 in H5D__pre_write(): not a dataset
major: Invalid arguments to routine
minor: Inappropriate type
HDF5-DIAG: Error detected in HDF5 (1.8.11) thread 0:
#000: ../../src/H5Dio.c line 140 in H5Dread(): not a dataset
major: Invalid arguments to routine
minor: Inappropriate type
HDF5-DIAG: Error detected in HDF5 (1.8.11) thread 0:
#000: ../../src/H5D.c line 391 in H5Dclose(): not a dataset
major: Invalid arguments to routine
minor: Inappropriate type
Does someone know what went wrong?
Thank you!
The main routine in your program includes the lines
/* Open an existing dataset. */
dataset_id = H5Dopen2(file_id, "/dset", H5P_DEFAULT);
but there's no evidence from what you have posted that the file in question contains any datasets to open. You seem to have created a file called dset but that's not the same thing as a dataset. From what you've posted your file is empty.
The clue to this is given in the error report which states, inter alia
HDF5-DIAG: Error detected in HDF5 (1.8.11) thread 0:
#000: ../../src/H5D.c line 334 in H5Dopen2(): not found
major: Dataset
minor: Object not found
You have to create the dataset as well before accessing it. There is an example for that on the hdf5-page as well:
ftp://www.hdfgroup.org/HDF5/examples/introductory/C/h5_crtdat.c
Short: using H5Screate_simple to create dataspace and H5Dcreate2 to create the dataset.