Storing input values in structs for fastest comparison later

Storing input values in structs for fastest comparison later - c

I'm sampling eight input ports and comparing the values up to ten times a second.
These inputs will be XOR'd against a similar field, indicating which signals are set to "Active Low", then an AND operation to mask out input signals that are not going to be compared (though all signals are sampled, whether compared or not).
So this is an example for the sampling. I've created a struct where the signals will be stored and then saved in memory. This struct contains a lot of other values, so replacing the whole struct is not an option. Anyway, these input values need to be saved in a efficient way so I later on can perform fast XOR and AND operations with my masks.
void SampleData(){
// These are not all values o be sampled, only inputs
currentSample.i0 = RD13_bit;
currentSample.i1 = RD12;
currentSample.i2 = RD11;
currentSample.i3 = RD10;
currentSample.i4 = RE12;
currentSample.i5 = RE13;
currentSample.i6 = RF8;
currentSample.i7 = RF9;
}
This is an example of the comparison I need
checkInputSignals(){
activated = ((inputValues ^ activeLowInputs) & activeInputsMask);
if(activated ){
importantMethod();
}
}
I've tried a bitfield, but I couldn't get the operators to work, and I've no knowledge about the effiency using bitfield. Efficiency in this project is not focused on memory, but speed and comfort. How should I store my three fields? If it helps, I am using a dsPic33EP microprocessor.
If using a 'char' or 'uint_8', my sample method would look like this, right? And this does not seem to be the most elegant solution.
unsigned char inputValues;
void SampleData(){
currentSample.i0 = RD13_bit;
currentSample.i1 = RD12;
currentSample.i2 = RD11;
currentSample.i3 = RD10;
currentSample.i4 = RE12;
currentSample.i5 = RE13;
currentSample.i6 = RF8;
currentSample.i7 = RF9;
// For the masking
inputValues += currentSample.i7;
inputValues = (inputValues << 1) + currentSample.i6;
inputValues = (inputValues << 1) + currentSample.i5;
inputValues = (inputValues << 1) + currentSample.i4;
inputValues = (inputValues << 1) + currentSample.i3;
inputValues = (inputValues << 1) + currentSample.i2;
inputValues = (inputValues << 1) + currentSample.i1;
inputValues = (inputValues << 1) + currentSample.i0;
}
And I would have to do the same for my masks, for example.
void ConfigureActiveLowInputs(){
activeLowInputs += currentCalibration->I0_activeLow;
activeLowInputs = (activeLowInputs << 1) + currentCalibration->I1_activeLow;
activeLowInputs = (activeLowInputs << 1) + currentCalibration->I2_activeLow;
activeLowInputs = (activeLowInputs << 1) + currentCalibration->I3_activeLow;
activeLowInputs = (activeLowInputs << 1) + currentCalibration->I4_activeLow;
activeLowInputs = (activeLowInputs << 1) + currentCalibration->I5_activeLow;
activeLowInputs = (activeLowInputs << 1) + currentCalibration->I6_activeLow;
activeLowInputs = (activeLowInputs << 1) + currentCalibration->I7_activeLow;
}
There must be a better solution than bit shifting?

Some things I think you need to know.
Don't use bit fields. Apart from being non-portable, they make this kind of bit-twiddling harder, not easier.
Don't use run-time shifts. Get the compiler to do your work.
Do read code, study and practice. Learning bit-twiddling can be hard, and from your code I don't think you're quite there yet.
If we're going to help there are some things we need to know.
You mention 8 ports. Are they single bit ports, or single ports with multiple bits?
You mention 3 fields. What are they?
Your sample code uses + operators, which are rarely used in bit operations. Why?
In C the code usually ends up with a set of macros and defines, plus a few small functions. It's all quite simple, generates good code, and runs fast without too much effort. If we only knew what you were trying to do.

You seem to be storing individual bits is separate structure members, and then packing them to word on the fly to be able to apply masks; but it is probably more efficient to pack them into a word, and use a mask to access the individual bits when necessary.
The members i0, i1 etc. are probably unnecessary. It would be simpler to pack the bits directly into a uint8_t member, then write functions or macros to return individual bits where necessary.
uint8_t void SampleData()
{
return (RD13_bit << 7 ) |
(RD12 << 6) |
(RD11 << 5) |
(RD10 << 4) |
(RE12 << 3) |
(RE13 << 2) |
(RF8 << 1) |
RF9 ;
}
Then:
currentSample.i = SampleData() ;
Then you can apply masks to that directly. If you need to access individual bits (and if you don't, why make then separate members in the first case?) then for example:
#include <stdbool.h>
#define GETBIT( word, bit ) (((word) & (1<<bit) != 0)
bool i6 = GETBIT( currentSample.i, 6 ) ;

Related

Difference between two buffers in C/C++

This development is being done on Windows in usermode.
I have two (potentially quite large) buffers, and I would like to know the number of bytes different between the two of them.
I wrote this myself just checking byte by byte, but this resulted in a quite slow implementation. As I'm comparing on the order of hundreds of megabytes, this is undesirable. I'm aware that I could optimize this though many different means, but this seems like a common problem that's probably got optimized solutions already out there, and there's no way I'm going to optimize this as effectively as if it was written by optimization experts.
Perhaps my Googling is inadequate, but I'm unable to find any other C or C++ functions that can count the number of different bytes between two buffers. Is there such a built in function to the C standard library, WinAPI, or C++ standard library that I just don't know of? Or do I need to manually optimize this?

I ended up writing this (perhaps somewhat poorly) optimized code to do the job for me. I was hoping it would vectorize this under the hood, but that doesn't appear to be happening unfortunately, and I didn't feel like digging around the SIMD intrinsics to do it manually. As a result, my bit fiddling tricks may end up making it slower, but it's still fast enough that it's no more than about 4% of my code's runtime (and almost all of that was memcmp). Whether or not it could be better, it's good enough for me.
I'll note that this is designed to be fast for my use case, where I'm expecting only rare differences.
inline size_t ComputeDifferenceSmall(
_In_reads_bytes_(size) char* buf1,
_In_reads_bytes_(size) char* buf2,
size_t size) {
/* size should be <= 0x1000 bytes */
/* In my case, I expect frequent differences if any at all are present. */
size_t res = 0;
for (size_t i = 0; i < (size & ~0xF); i += 0x10) {
uint64_t diff1 = *reinterpret_cast<uint64_t*>(buf1) ^
*reinterpret_cast<uint64_t*>(buf2);
if (!diff1) continue;
/* Bit fiddle to make each byte 1 if they're different and 0 if the same */
diff1 = ((diff1 & 0xF0F0F0F0F0F0F0F0ULL) >> 4) | (diff1 & 0x0F0F0F0F0F0F0F0FULL);
diff1 = ((diff1 & 0x0C0C0C0C0C0C0C0CULL) >> 2) | (diff1 & 0x0303030303030303ULL);
diff1 = ((diff1 & 0x0202020202020202ULL) >> 1) | (diff1 & 0x0101010101010101ULL);
/* Sum the bytes */
diff1 = (diff1 >> 32) + (diff1 & 0xFFFFFFFFULL);
diff1 = (diff1 >> 16) + (diff1 & 0xFFFFULL);
diff1 = (diff1 >> 8) + (diff1 & 0xFFULL);
diff1 = (diff1 >> 4) + (diff1 & 0xFULL);
res += diff1;
}
for (size_t i = (size & ~0xF); i < size; i++) {
res += (buf1[i] != buf2[i]);
}
return res;
}
size_t ComputeDifference(
_In_reads_bytes_(size) char* buf1,
_In_reads_bytes_(size) char* buf2,
size_t size) {
size_t res = 0;
/* I expect most pages to be identical, and both buffers should be page aligned if
* larger than a page. memcmp has more optimizations than I'll ever come up with,
* so I can just use that to determine if I need to check for differences
* in the page. */
for (size_t pn = 0; pn < (size & ~0xFFF); pn += 0x1000) {
if (memcmp(&buf1[pn], &buf2[pn], 0x1000)) {
res += ComputeDifferenceSmall(&buf1[pn], &buf2[pn], 0x1000);
}
}
return res + ComputeDifferenceSmall(
&buf1[size & ~0xFFF], &buf2[size & ~0xFFF], size & 0xFFF);
}

how can I implement paging , and find physical memory address knowing virtual address

I want to implement the initialisation of paging .
Referring to some links of osdev wiki : https://wiki.osdev.org/Paging , https://wiki.osdev.org/Setting_Up_Paging , my own version is very different.
Because , when we look at the page directory , they said that 12 bits is for the flag and the rest is for the address of the page table , so I tried something like this:
void init_paging() {
unsigned int i = 0;
unsigned int __FIRST_PAGE_TABLE__[0x400] __attribute__((aligned(0x1000)));
for (i = 0; i < 0x400; i++) __PAGE_DIRECTORY__[i] = PAGE_PRESENT(0) | PAGE_READ_WRITE;
for (i = 0; i < 0x400; i++) __FIRST_PAGE_TABLE__[i] = ((i * 0x1000) << 12) | PAGE_PRESENT(1) | PAGE_READ_WRITE;
__PAGE_DIRECTORY__[0] = ((unsigned int)__FIRST_PAGE_TABLE__ << 12) | PAGE_PRESENT(1) | PAGE_READ_WRITE;
_EnablingPaging_();
}
this function help me to know the physical address knowing the virtual address :
void *get_phyaddr(void *virtualaddr) {
unsigned long pdindex = (unsigned long)virtualaddr >> 22;
unsigned long ptindex = (unsigned long)virtualaddr >> 12 & 0x03FF;
unsigned long *pd = (unsigned long *)__PAGE_DIRECTORY__[pdindex];
unsigned long *pt = (unsigned long *)pd[ptindex];
return (void *)(pt + ((unsigned int)virtualaddr & 0xFFF));
}
I'm in the wrong direction?
Or still the same?

Assuming you're trying to identity map the first 4 MiB of the physical address space:
a) for unsigned int __FIRST_PAGE_TABLE__[0x400] __attribute__((aligned(0x1000))); it's a local variable (e.g. likely put on the stack); and it will not survive after the function returns (e.g. the stack space it was using will be overwritten by other functions later), causing the page table to become corrupted. That isn't likely to end well.
b) For __FIRST_PAGE_TABLE__[i] = ((i * 0x1000) << 12) | PAGE_PRESENT(1) | PAGE_READ_WRITE;, you're shifting i twice, once with * 0x1000 (which is the same as << 12) and again with the << 12. This is too much, and it needs to be more like __FIRST_PAGE_TABLE__[i] = (i << 12) | PAGE_PRESENT(1) | PAGE_READ_WRITE;.
c) For __PAGE_DIRECTORY__[0] = ((unsigned int)__FIRST_PAGE_TABLE__ << 12) | PAGE_PRESENT(1) | PAGE_READ_WRITE;, the address is already an address (and not a "page number" that needs to be shifted), so it needs to be more like __PAGE_DIRECTORY__[0] = ((unsigned int)__FIRST_PAGE_TABLE__) | PAGE_PRESENT(1) | PAGE_READ_WRITE;.
Beyond that; I'd very much prefer better use of types. Specifically; you should probably get in the habit of using uint32_t (or uint64_t, or a typedef of your own) for physical addresses to make sure you don't accidentally confuse a virtual address with a physical address (and make sure the compiler complains abut the wrong type when you make a mistake); because (even though it's not very important now because you're identity mapping) it will become important "soon". I'd also recommend using uint32_t for page table entries and page directory entries, because they must be 32 bits and not "whatever size the compiler felt like int should be" (note that this is a difference in how you think about the code, which is more important than what the compiler actually does or whether int happens to be 32 bits anyway).

When we ask page , but the page was not present , we have pageFault Interrupt .
SO to avoid that , we can check if the page is there , else , i choice to return 0x0:
physaddr_t *get_phyaddr(void *virtualaddr) {
uint32_t pdindex = (uint32_t)virtualaddr >> 22;
uint32_t ptindex = (uint32_t)virtualaddr >> 12 & 0x03FF;
uint32_t *pd, *pt, ptable;
if ((page_directory[pdindex] & 0x3) == 0x3) {
pd = (uint32_t *)(page_directory[pdindex] & 0xFFFFF000);
if ((pd[ptindex] & 0x3) == 0x3) {
ptable = pd[ptindex] & 0xFFFFF000;
pt = (uint32_t *)ptable;
return (physaddr_t *)(pt + ((uint32_t)(virtualaddr)&0xFFF));
} else
return 0x0;
} else
return 0x0;
}

Flipping Pebble Screen Issue

I'm writing a Pebble Time Watch app using Pebble SDK 3.0 on the basalt platform that requires text to be displayed upsidedown.
The logic is:-
Write to screen
Capture screen buffer
Flip screen buffer (using flipHV routine, see below)
Release buffer.
After a fair amount of experimentation I've got it working after a fashion but the (black) text has what seems to be random vertical white lines through it (see image below) which I suspect is something to do with shifting bits.
The subroutine I'm using is:-
void flipHV(GBitmap *bitMap) {
GRect fbb = gbitmap_get_bounds(bitMap);
int Width = 72; // fbb.size.w;
int Height = 84; // fbb.size.h;
uint32_t *pBase = (uint32_t *)gbitmap_get_data(bitMap);
uint32_t *pTopRemainingPixel = pBase;
uint32_t *pBottomRemainingPixel = pBase + (Height * Width);
while (pTopRemainingPixel < pBottomRemainingPixel) {
uint32_t TopPixel = *pTopRemainingPixel;
uint32_t BottomPixel = *pBottomRemainingPixel;
TopPixel = (TopPixel << 16) | (TopPixel >> 16);
*pBottomRemainingPixel = TopPixel;
BottomPixel = (BottomPixel << 16) | (BottomPixel >> 16);
*pTopRemainingPixel = BottomPixel;
pTopRemainingPixel++;
pBottomRemainingPixel--;
}
}
and its purpose is to work though the screen buffer taking the first pixel and swapping with the last one, the second one and swapping it with the second last one etc etc.
Because each 32 bit 'byte' holds 2 pixels I also need to rotate it through 16 bits.
I suspect that that is where the problem lies.
Can someone have a look at my code and see if they can see what is going wrong and put me right. I should say that I'm both a C and Pebble SDK newbie so please explain everything as if to a child!

Your assignments like
TopPixel = (TopPixel << 16) | (TopPixel >> 16)
swap pixels pair-wise
+--+--+ +--+--+
|ab|cd| => |cd|ab|
+--+--+ +--+--+
What you want instead is a full swap:
+--+--+ +--+--+
|ab|cd| => |dc|ba|
+--+--+ +--+--+
That can be done with even more bit-fiddling, e.g
TopPixel = ((TopPixel << 24) | // move d from 0..7 to 24..31
((TopPixel << 8) & 0x00ff0000) | // move c from 8..15 to 16..23
((TopPixel >> 8) & 0x0000ff00) | // move b from 16..23 to 8..15
((TopPixel >> 24) | // move a from 24..31 to 0..7
or - way more readable(!) - by using GColor8 instead of uint32_t and a loop on a per-pixel-basis:
// only loop to half of the distance to avoid swapping twice
for (int16_t y = 0; y <= max_y / 2; y++) {
for (int16_t x = 0; x <= max_x / 2; x++) {
GColor8 *value_1 = gbitmap_get_bytes_per_row(bmp) * y + x;
GColor8 *value_2 = gbitmap_get_bytes_per_row(bmp) * (max_y - y) + (max_x - x);
// swapping the two pixel values, could be simplified with a SWAP(a,b) macro
GColor8 tmp = *value_1;
*value_1 = *value_2;
*value_2 = tmp;
}
}
Disclaimer: I haven't compiled this code. It might also be necessary to cast the gbitmap_get_byes_per_row()... expressions to GColor8*. And the whole pointer arithmetic can be tuned if you see that this is a performance bottle-neck.

It turns out that I needed to replace all of the uint32_t with uint8_t and do away with the shifting.

I'm working on my gEDA fork and want to get rid of the existing simple tile-based system1 in favour of a real spatial index2.
An algorithm that efficiently finds points is not enough: I need to find objects with non-zero extent. Think in terms of objects having bounding rectangles, that pretty much captures the level of detail I need in the index. Given a search rectangle, I need to be able to efficiently find all objects whose bounding rectangles are inside, or that intersect, the search rectangle.
The index can't be read-only: gschem is a schematic capture program, and the whole point of it is to move things around the schematic diagram. So things are going to be a'changing. So while I can afford insertion to be a bit more expensive than searching, it can't be too much more expensive, and deleting must also be both possible and reasonably cheap. But the most important requirement is the asymptotic behaviour: searching should be O(log n) if it can't be O(1). Insertion / deletion should preferably be O(log n), but O(n) would be okay. I definitely don't want anything > O(n) (per action; obviously O(n log n) is expected for an all-objects operation).
What are my options? I don't feel clever enough to evaluate the various options. Ideally there'd be some C library that will do all the clever stuff for me, but I'll mechanically implement an algorithm I may or may not fully understand if I have to. gEDA uses glib by the way, if that helps to make a recommendation.
Footnotes:
1 Standard gEDA divides a schematic diagram into a fixed number (currently 100) of "tiles" which serve to speed up searches for objects in a bounding rectangle. This is obviously good enough to make most schematics fast enough to search, but the way it's done causes other problems: far too many functions require a pointer to a de-facto global object. The tiles geometry is also fixed: it would be possible to defeat this tiling system completely simply by panning (and possibly zooming) to an area covered by only one tile.
2 A legitimate answer would be to keep elements of the tiling system, but to fix its weaknesses: teaching it to span the entire space, and to sub-divide when necessary. But I'd like others to add their two cents before I autocratically decide that this is the best way.

A nice data structure for a mix of points and lines would be an R-tree or one of its derivatives (e.g. R*-Tree or a Hilbert R-Tree). Given you want this index to be dynamic and serializable, I think using SQLite's R*-Tree module would be a reasonable approach.
If you can tolerate C++, libspatialindex has a mature and flexible R-tree implementation which supports dynamic inserts/deletes and serialization.

Your needs sound very similar to what is used in collision detection algorithms for games and physics simulations. There are several open source C++ libraries that handle this in 2-D (Box2D) or 3-D (Bullet physics). Although your question is for C, you may find their documentation and implementations useful.
Usually this is split into a two phases:
A fast broad phase that approximates objects by their axis-aligned bounding box (AABB), and determines pairs of AABBs that touch or overlap.
A slower narrow phase that calculates the points of geometric overlap for pairs of objects whose AABBs touch or overlap.
Physics engines also use spatial coherence to further reduce the pairs of objects that are compared, but this optimization probably won't help your application.
The broadphase is usually implemented with an O(N log N) algorithm like Sweep and prune. You may be able to accelerate this by using it in conjunction with the current tile approach (one of Nvidia's GPUGems describes this hybrid approach). The narrow phase is quite costly for each pair, and may be overkill for your needs. The GJK algorithm is often used for convex objects in this step, although faster algorithms exist for more specialized cases (e.g.: box/circle and box/sphere collisions).

This sounds to like an application well-suited to a quadtree (assuming you are interested only in 2D.) The quadtree is hierarchical (good for searching) and it's spatial resolution is dynamic (allowing higher resolution in areas that need it).
I've always rolled my own quadtrees, but here is a library that appears reasonable: http://www.codeproject.com/Articles/30535/A-Simple-QuadTree-Implementation-in-C

It is easy to do. It's hard to do fast. Sounds like a problem I worked on where there was a vast list of min,max values and given a value it had to return how many min,max pairs overlapped that value. You just have it in two dimensions. So you do it with two trees for each direction. Then do a intersection on the results. This is really fast.
#include <iostream>
#include <fstream>
#include <map>
using namespace std;
typedef unsigned int UInt;
class payLoad {
public:
UInt starts;
UInt finishes;
bool isStart;
bool isFinish;
payLoad ()
{
starts = 0;
finishes = 0;
isStart = false;
isFinish = false;
}
};
typedef map<UInt,payLoad> ExtentMap;
//==============================================================================
class Extents
{
ExtentMap myExtentMap;
public:
void ReadAndInsertExtents ( const char* fileName )
{
UInt start, finish;
ExtentMap::iterator EMStart;
ExtentMap::iterator EMFinish;
ifstream efile ( fileName);
cout << fileName << " filename" << endl;
while (!efile.eof()) {
efile >> start >> finish;
//cout << start << " start " << finish << " finish" << endl;
EMStart = myExtentMap.find(start);
if (EMStart==myExtentMap.end()) {
payLoad pay;
pay.isStart = true;
myExtentMap[start] = pay;
EMStart = myExtentMap.find(start);
}
EMFinish = myExtentMap.find(finish);
if (EMFinish==myExtentMap.end()) {
payLoad pay;
pay.isFinish = true;
myExtentMap[finish] = pay;
EMFinish = myExtentMap.find(finish);
}
EMStart->second.starts++;
EMFinish->second.finishes++;
EMStart->second.isStart = true;
EMFinish->second.isFinish = true;
// for (EMStart=myExtentMap.begin(); EMStart!=myExtentMap.end(); EMStart++)
// cout << "| key " << EMStart->first << " count " << EMStart->second.value << " S " << EMStart->second.isStart << " F " << EMStart->second.isFinish << endl;
}
efile.close();
UInt count = 0;
for (EMStart=myExtentMap.begin(); EMStart!=myExtentMap.end(); EMStart++)
{
count += EMStart->second.starts - EMStart->second.finishes;
EMStart->second.starts = count + EMStart->second.finishes;
}
// for (EMStart=myExtentMap.begin(); EMStart!=myExtentMap.end(); EMStart++)
// cout << "||| key " << EMStart->first << " count " << EMStart->second.starts << " S " << EMStart->second.isStart << " F " << EMStart->second.isFinish << endl;
}
void ReadAndCountNumbers ( const char* fileName )
{
UInt number, count;
ExtentMap::iterator EMStart;
ExtentMap::iterator EMTemp;
if (myExtentMap.empty()) return;
ifstream nfile ( fileName);
cout << fileName << " filename" << endl;
while (!nfile.eof())
{
count = 0;
nfile >> number;
//cout << number << " number ";
EMStart = myExtentMap.find(number);
EMTemp = myExtentMap.end();
if (EMStart==myExtentMap.end()) { // if we don't find the number then create one so we can find the nearest number.
payLoad pay;
myExtentMap[ number ] = pay;
EMStart = EMTemp = myExtentMap.find(number);
if ((EMStart!=myExtentMap.begin()) && (!EMStart->second.isStart))
{
EMStart--;
}
}
if (EMStart->first < number) {
while (!EMStart->second.isFinish) {
//cout << "stepped through looking for end - key" << EMStart->first << endl;
EMStart++;
}
if (EMStart->first >= number) {
count = EMStart->second.starts;
//cout << "found " << count << endl;
}
}
else if (EMStart->first==number) {
count = EMStart->second.starts;
}
cout << count << endl;
//cout << "| count " << count << " key " << EMStart->first << " S " << EMStart->second.isStart << " F " << EMStart->second.isFinish<< " V " << EMStart->second.value << endl;
if (EMTemp != myExtentMap.end())
{
myExtentMap.erase(EMTemp->first);
}
}
nfile.close();
}
};
//==============================================================================
int main (int argc, char* argv[]) {
Extents exts;
exts.ReadAndInsertExtents ( "..//..//extents.txt" );
exts.ReadAndCountNumbers ( "..//../numbers.txt" );
return 0;
}
the extents test file was 1.5mb of:
0 200000
1 199999
2 199998
3 199997
4 199996
5 199995
....
99995 100005
99996 100004
99997 100003
99998 100002
99999 100001
The numbers file was like:
102731
104279
109316
104859
102165
105762
101464
100755
101068
108442
107777
101193
104299
107080
100958
.....
Even reading the two files from disk, extents were 1.5mb and numbers were 780k and the really large number of values and lookups, this runs in a fraction of a second. If in memory it would lightning quick.

Can I call a "function-like macro" in a header file from a CUDA global function?

This is part of my header file aes_locl.h:
.
.
# define SWAP(x) (_lrotl(x, 8) & 0x00ff00ff | _lrotr(x, 8) & 0xff00ff00)
# define GETU32(p) SWAP(*((u32 *)(p)))
# define PUTU32(ct, st) { *((u32 *)(ct)) = SWAP((st)); }
.
.
Now from the .cu file I have declared a __ global__ function and included the header file like this :
#include "aes_locl.h"
.....
__global__ void cudaEncryptKern(u32* _Te0, u32* _Te1, u32* _Te2, u32* _Te3, unsigned char* in, u32* rdk, unsigned long* length)
{
u32 *rk = rdk;
u32 s0, s1, s2, s3, t0, t1, t2, t3;
s0 = GETU32(in + threadIdx.x*(i) ) ^ rk[0];
}
This leads me to the following error message:
error: calling a host function from a __ device__/__ global__ function is only allowed in device emulation mode
I have sample code where the programmer calls the macro exactly in that way.
Can I call it in this way, or is this not possible at all? If it is not, I will appreciate some hints of what would be the best approach to rewrite the macros and assign the desired value to S0.
thank you very much in advance!!!

I think the problem is not the macros themselves - the compilation process used by nvcc for CUDA code runs the C preprocessor in the usual way and so using header files in this way should be fine. I believe the problem is in your calls to _lrotl and _lrotr.
You ought to be able to check that that is indeed the problem by temporarily removing those calls.
You should check the CUDA programming guide to see what functionality you need to replace those calls to run on the GPU.

The hardware doesn't have a built-in rotate instruction, and so there is no intrinsic to expose it (you can't expose something that doesn't exist!).
It's fairly simple to implement with shifts and masks though, for example if x is 32-bits then to rotate left eight bits you can do:
((x << 8) | (x >> 24))
Where x << 8 will push everything left eight bits (i.e. discarding the leftmost eight bits), x >> 24 will push everything right twnty-four bits (i.e. discarding all but the leftmost eight bits), and bitwise ORing them together gives the result you need.
// # define SWAP(x) (_lrotl(x, 8) & 0x00ff00ff | _lrotr(x, 8) & 0xff00ff00)
# define SWAP(x) (((x << 8) | (x >> 24)) & 0x00ff00ff | ((x >> 8) | (x << 24)) & 0xff00ff00)
You could of course make this more efficient by recognising that the above is overkill:
# define SWAP(x) (((x & 0xff00ff00) >> 8) | ((x & 0x00ff00ff) << 8))

The error says what the problem really is. You are calling a function/macro defined in another file (which belongs to the CPU code), from inside the CUDA function. This is impossible!
You cannot call CPU functions/macros/code from a GPU function.
You should put your definitions (does _lrotl() exist in CUDA?) inside the same file that will be compiled by nvcc.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight