help to understand macro - c

I'm having problem to understand some piece of code in MTD driver
#define ROUNDUP(x, y) ((((x)+((y)-1))/(y))*(y))
...
static struct mtd_partition my_parts[] =
{
{
.name = "boot",
.size = 0,
.offset = 0,
.mask_flags = MTD_WRITEABLE
},
{
.name = "linux",
.size = 0,
.offset = 0
},
{
.name = "rootfs",
.size = 0,
.offset = 0,
.mask_flags = MTD_WRITEABLE
},
{
.name = "nvram",
.size = 0,
.offset = 0
},
{
.name = 0,
.size = 0,
.offset = 0
}
}
...
i = (sizeof(bcm947xx_parts)/sizeof(struct mtd_partition)) - 2;
bcm947xx_parts[i].size = ROUNDUP(NVRAM_SPACE, mtd->erasesize);
bcm947xx_parts[i].offset = size - bcm947xx_parts[i].size;
So here are my questoins:
1) why is it necessary to round up the size of the partition?
2) could you help to understand how the rounding works?
3) flash driver in boot loader on the same platform doesn't do the rounding for this specific partition, so the flash layout has different offsets in the kernel side and in the bootloader. What is the reason for this?
Thanks in advance for any valuable comments !

(1) Flash memory comes in multiples of its erase size. (Apparently. At least, this is what the quoted code tells me.) This means there is a gap between the end of the NVRAM and whatever comes next. This gap is less than the size of one erase size. In flash, it's convenient to not put two objects with different rewrite schedules in a single erase block -- changing either object requires the flash storage controller to copy the block to a temporary store, apply a partial update to the store, erase the block (slow-ish), and write the updated block to the main store. (It can reuse a different previously erased block and thread it back in the place of the original block. But this is considered a high tech optimization.)
(2) How to parse macros:
((((x)+((y)-1))/(y))*(y))
Step 1, remove the parens around the arguments that make sure that complicated expressions passed as arguments don't suddenly rebind in unexpected ways due to operator precedence.
(((x+(y-1))/y)*y)
Step 2, remove paranoid parens for operations that clearly have the indicated precedence.
(x+y-1)/y*y
Step 3, use your C parsing rules, not your algebra rules. If x and y are integral types (not enough information in your code to be certain of this), then the division is integer division, so translate from C to math.
floor((x+y-1)/y)*y
Step 4, read. If x is a multiple of y, then since y-1 is too small to be a multiple of y, the operation just gives back x. If x is 1 more than a multiple of y, then the +y-1 pushes the numerator over the next multiple of y and the result is the smallest multiple of y that happens to be larger than x. In fact, if x is between 1 more and y-1 more than a multiple of y, the "+y-1" bumps the numerator up over the next multiple of y and the result of rounding up is the smallest multiple of y larger than x.
What we find, therefore is that ROUNDUP(x,y) rounds x up to the smallest multiple of y that happens to be greater than or equal to x. Additionally, this macro evaluates its second argument more than once: don't put expressions with side effects in the second slot unless you want those side effects to happen three times per call. (Consider int i = 3; ROUNDUP(6,i++) and wonder which subexpressions are evaluated before and which after each of the three increments of i.)
(3) No idea. No one told the bootloader writer that NVRAMs only come in multiples of erasesize?

Related

How to implement input independent logical shift in software?

I'm trying to implement AES/DES/.. encryption/decryption in software without using any input dependent operations (specifically only using constant time not, and, or, xor operations and input independent array indexing/loops).
Is there any way to implement input independent logical shift (someconst << key[3] & 5 etc.)?
Array indexing with input dependent variable, using hardware shifts with input dependent n, input dependent conditional jumps must be avoided and I don't care about code size/speed.
Depending on your requirements and which operations you can assume to be constant time, this code needs some additional modifications.
However, it might point you in the right direction (as the SELECT primitive is quite powerful for side-channel free code):
#define MAX_SHIFT 32 // maximum amount to be shifted
// this may not be constant time.
// However, you can find different (more ugly) ways to achieve the same thing.
// 1 -> 0
// 0 -> 0xff...
#define MASK(cond) (cond - 1)
// again, make sure everything here is constant time according to your threat model
// (0, x, y) -> y
// (i, x, y) -> x (i != 0)
#define SELECT(cond, A, B) ((MASK(!(cond)) & A) | (MASK(!!(cond)) & B))
int shift(int value, int shift){
int result = value;
for(int i = 0; i <= MAX_SHIFT; i++){
result = SELECT(i ^ shift, result, value);
// this may not be constant time. If it is not, implement it yourself ;)
value <<= 1;
}
return result;
}
Note, however, that you have to make sure the compiler does not optimize this.
Also, CPUs may also employ operand-dependent performance optimizations, that may lead to timing differences.
In addition to this, transient execution attacks like Spectre may also be a possible threat.
In conclusion: It is almost impossible to write side-channel free code.

FIFO implementation in C

I am analysing an Internet guide, where I fond code like that. Can somebody explain me the usage of ~ and & operators?
Thanks in advance
uint8_t tx_fifo_put(tx_dataType data)
{
/*Check if FIFO is full*/
if((tx_put_itr - tx_get_itr) & ~(TXFIFOSIZE-1))
{
/*FIFO full - return TXFAIL*/
return (TXFAIL);
}
/*Put data into fifo*/
TX_FIFO[tx_put_itr & (TXFIFOSIZE - 1)] = data;
/*Incerment itr*/
tx_put_itr++;
return(TXSUCCESS);
}
What the code does, is an obfuscated way to replace a more human readable code.
As a commenter wrote before me, the TX_FIFO[tx_put_itr & (TXFIFOSIZE - 1)] = data; loops the output. Also as it was mentioned in comments, the code is meant to have size being power of two.
I do not know why it is done so, for me TX_FIFO[tx_put_itr % TXFIFOSIZE] = data does the same, but more readable. Also, a person expects predicate checks to be before data access. At least it is my nature.
The (w - r) &~ size part is a way to check for (1)w < r and, (2) as an edge case, w being equal to FIFOSIZE and r being zero. Semantically it should have meant, that "if the write pointer points to boundary, and read pointer points to start of a buffer, we suggest that, for our data structure, next write could be an overflow."
Let us see some code, numbers and their binary representation.
let s = 8 - 1, in binary is 00000111 and negated is 11111000.
let w = 0, let r = 1.
now in binary w = 00000000, r = 00000001.
w - r = 11111111, logical and that with ~(8 - 1) and get some value, other then zero.
Continuing the logic for the w < r case, we get that any negative integer will produce some bits in the above. So this definitely gives true for the OP if code.
Now the w = r case can not commit bits to the boolean test.
And last case,
let s = 8,
let w = 8
let r = 0
w - r = 00001000
~(8 - 1) = 11111000
(w - r) &~ 7 = 00001000
All other cases where w > r give zero.
Update
To my great grief, the #UkropUkraine had deleted all comments and his answer. There were some discussion there about the fact, that one can use (w - r) >= mask in place of (w - r) & mask.
Here I present a code, and an explanation that it is not an optimization, or just syntax, or whatever came to mind to the person who wrote the OP code. It is intended code. And it fails to do its purpose: to run as a FIFO or circular queue, or whatever that part of code was meant to do.
First, take an example of usage. The part where Ukrop user had difficulties. The w pointer can be less than r pointer. And the result of w - r will be negative.
The common usage is to add a byte to the buffer and wrap write pointer as soon as it reaches the end. Imagine situation where w pointer already wrapped.
#include <stdio.h>
int main()
{
unsigned char w = 0, r = 1;
int r;
r = (a - b) & 0xffffffff;
printf("%d\n", r);
return 0;
}
-1
I do not know what is a common boolean result type with micro controllers. For a common x86 C machine, it is int. So I expect the if((w - r) &~ size) to be converted to an int. And the result is negative. You can not just write the above with >=, '>', or == as it was stated by the comments and the other answer here.
More than that, the code fails its semantics. It is meant to be a FIFO, or something, I do not know. But in the above situation, the read pointer still has some sensible data to read. And it can be done, because the write pointer, even if it is wrapped, does not overwrite the read portion of a buffer, yet. But the code returns BUFFULL.
I thought about read/write being different directions, but it does not change anything. The code OP gave, fails to do what one would expect.
Maybe I do miss some insight here, as Ukrop user, and OP, point me to the fact that they know code semantics. The OP just did not get a ~ and & usage. Well, this is an answer, the ~& is used to test for a negative value and for the edge cases.
The two operators:
& is a bitwise and operator
~ is a bitwise complement operator
Now for the posted code it's important to notice that TXFIFOSIZE must have a value which is a power of 2, i.e. values like 2, 4, 8, 16, 32, ...
When that is true, the code:
TX_FIFO[tx_put_itr & (TXFIFOSIZE - 1)] = data;
is equivalent to:
TX_FIFO[tx_put_itr % TXFIFOSIZE] = data;
Notice that tx_put_itr is being incremented in such a way that it will take value higher than TXFIFOSIZE. So in order to get a valid array index the code must find the remainder of tx_put_itr with respect to TXFIFOSIZE.
So how does work? Why are the above lines equivalent?
Let's take a value as example.
Assume TXFIFOSIZE is 8 (2 to the power of 3)
So TXFIFOSIZE-1 is 7
7 is bitwise 00....00111
And when you do:
SOME_NUMBER & 00....00111
You keep the 3 least significant bits of SOME_NUMBER
And that is exactly the remainder of when diving by 8
So let's look at
if((tx_put_itr - tx_get_itr) & ~(TXFIFOSIZE-1))
It is equivalent to
if((tx_put_itr - tx_get_itr) >= TXFIFOSIZE)
So it checks for "FIFO full"
Again using an example it works like this:
Assume TXFIFOSIZE is 8 (2 to the power of 3)
So TXFIFOSIZE-1 is 7
7 is bitwise 00....00111
~7 is bitwise 11....11000
And when you do:
SOME_NUMBER & 11....11000
You clear the 3 least significant bits of SOME_NUMBER and keep the rest unchanged
So if the result is non-zero it means that the difference between
tx_put_itr and tx_get_itr is 8 (or more).

At a sequence point all previous accesses to volatile objects have stabilized

From GNU document about volatile:
The minimum requirement is that at a sequence point all previous
accesses to volatile objects have stabilized and no subsequent
accesses have occurred
Ok, so we know what sequence points are, and we now know how volatile behaves with respect to them in gcc.
So, naively I would look at the following program:
volatile int x = 0;
int y = 0;
x = 1; /* sequence point at the end of the assignment */
y = 1; /* sequence point at the end of the assignment */
x = 2; /* sequence point at the end of the assignment */
And will apply the GNU requirement the following way:
At a sequence point (end of y=1) access to volatile x = 1 stabilize and no subsequence access x = 2 have occurred.
But that just wrong because non-volatiles y = 1 can be reordered across sequence points, for example y = 1 can actually be performed before x = 1 and x = 2, and furthermore it can be optimised away (without violating the as-if rule).
So I am very eager to know how can I apply the GNU requirement properly, is there a problem with my understanding? is the requirement written in a wrong way?
Maybe should the requirement be written as something like:
The minimum
requirement is that at a sequence point WHICH HAS A SIDE EFFECT all previous accesses to volatile objects have stabilized
and no subsequent accesses have occurre
Or as pmg elegantly suggested in the comment:
The minimum requirement is that at a sequence point all UNSEQUENCED previous accesses to volatile objects have
stabilized and no subsequent accesses have occurred
so we could only apply it on the sequence points of end of x = 1; and end of x = 2; on which is definitely true that previous accesses to volatile objects have stabilized and no subsequent accesses have occurred?

PID implementation in arduino

I came across some code online in which the PID is implemented for arduino. I am confused of the implementation. I have basic understanding of how PID works, however my source of confusion is why the hexadecimal is being used for m_prevError? what is the value 0x80000000L representing and why is right shifting by 10 when calculating the velocity?
// ServoLoop Constructor
ServoLoop::ServoLoop(int32_t proportionalGain, int32_t derivativeGain)
{
m_pos = RCS_CENTER_POS;
m_proportionalGain = proportionalGain;
m_derivativeGain = derivativeGain;
m_prevError = 0x80000000L;
}
// ServoLoop Update
// Calculates new output based on the measured
// error and the current state.
void ServoLoop::update(int32_t error)
{
long int velocity;
char buf[32];
if (m_prevError!=0x80000000)
{
velocity = (error*m_proportionalGain + (error - m_prevError)*m_derivativeGain)>>10;
m_pos += velocity;
if (m_pos>RCS_MAX_POS)
{
m_pos = RCS_MAX_POS;
}
else if (m_pos<RCS_MIN_POS)
{
m_pos = RCS_MIN_POS;
}
}
m_prevError = error;
}
Shifting a binary number to right by 1 means multiplying its corresponding decimal value by 2. Here shifting by 10 means multiplying by 2^10 which is 1024. As any basic control loop, it could be a gain of the velocity where the returned-back value is converted to be suitable to re-use by any other method.
The L here 0x80000000L is declaring that value as long. So, this value 0x80000000 may be an initial value of error or so. Also, you need to revise the full program to see how things work and what value is assigned to something like error.
Contrary to the other answer, shifting to the right has the effect to divide by a power of two, in this case >> 10 would divide by 1024. But a real division would be better, more clear, and optimized by the compiler with a shift anyway. So I find this shift ugly.
The intent is to implement some float math without actually use floating point numbers - it is a kind of fixed point calculation, where the fractional part is about 10 bits. To understand, assuming to simplify the derivative coefficient=0, an m_proportionalGain set to 1024 would mean 1, while if set to 512 it would mean 0.5. In fact in the case of proportional=1024, and error=100, the formula would give
100*1024 / 1024 = 100
(gain=1), while proportional=512 would give
100*512 / 1024 = 50
(gain=0.5).
As for previous error m_prevError set to 0x80000000, it is simply a special value which is checked in the loop to see if "there is already" a previous error. If not, i.e. if prevError has the special value, the entire loop is skipped once; in other words, it serves the purpose to skip the first update after creation of the object. Not very cleaver I suppose, I would prefer to simply set the previous error equal to 0 and skip completely the check in ::update(). Using special values as flag has the problem that sometimes the calculations result in the special value itself - it would be a big bug. If absolutely needed, it is better to use a true flag.
All in all, I think this is a poor PID algorithm, as it lacks completely the integrative part; it seems that the variable m_pos is thought for this integrative purpose, it is managed quite that way, but never used - only set. Nevertheless this algorithm can work, but all depends on the target system and the wanted performances: on most situations, this algorithm leaves a residual error.

How to optimize C code : looking for the next set bit and finding sum of corresponding array elements

EDIT: Now I realize I didn't explain my algorithm well enough. I'll try again.
What I'm doing is something very similar to dot product of two vectors, but there is a difference. I've got two vectors: one vector of bits and one vector of floats of the same length. So I need to calculate sum:
float[0]*bit[0]+float[1]*bit[1]+..+float[N-1]*bit[N-1], BUT the difference from a classic dot product is that I need to skip some fixed number of elements after each set bit.
Example:
vector of floats = {1.5, 2.0, 3.0, 4.5, 1.0}
vector of bits = {1, 0, 1, 0, 1 }
nSkip = 2
in this case sum is calculated as follows:
sum = floats[0]*bits[0]
bits[0] == 1, so skipping 2 elements (at positions 1 and 2)
sum = sum + floats[3]*bits[3]
bits[3] == 0, so no skipping
sum = sum + floats[4]*bits[4]
result = 1.5*1+4.5*0+1.0*1 = 2.5
The following code is called many times with different data so I need to optimize it to run as fast as possible on my Core i7 (I don't care much about compatibility with anything else). It is optimized to some extent but still slow, but I don't know how to further improve it.
Bit array is implemented as an array of 64 bit unsigned ints, it allows me to use bitscanforward to find the next set bit.
code:
unsigned int i = 0;
float fSum = 0;
do
{
unsigned int nAddr = i / 64;
unsigned int nShift = i & 63;
unsigned __int64 v = bitarray[nAddr] >> nShift;
unsigned long idx;
if (!_BitScanForward64(&idx, v))
{
i+=64-nShift;
continue;
}
i+= idx;
fSum += floatarray[i];
i+= nSkip;
} while(i<nEnd);
Profiler shows 3 slowest hotspots :
1. v = bitarray[nAddr] >> nShift (memory access with shift)
2. _BitScanForward64(&idx, v)
3. fSum += floatarray[i]; (memory access)
But probably there is a different way of doing this. I was thinking about just resetting nSkip bits after each set bit in the bit vector and then calculating classical dot product - didn't try yet but honestly don't belive it will be faster with more memory access.
You have too many of your operations inside of the loop. You also only have one loop, so many of the operations that do need to happen for each flag word (the 64 bit unsigned integer) are happening 63 extra times.
Consider division an expensive operation and try to not do that too often when optimizing code for performance.
Memory access is also considered expensive in terms of how long it takes, so this should also be limited to required accesses only.
Tests that allow you to exit early are often useful (though sometimes the test itself is expensive relative to the operations you'd be avoiding, but that's probably not the case here.
Using nested loops should simplify this a lot. The outer loop should work at the 64 bit word level, and the inner loop should work at the bit level.
I have noticed a mistake in my earlier recommendations. Since the division here is by 64, which is a power of 2, this is not actually an expensive operation, but we still need to get as many operations as far out of the loops as we can.
/* this is completely untested, but incorporates the optimizations
that I outlined as well as a few others.
I process the arrays backwards, which allows for elimination of
comparisons of variables against other variables, which is much
slower than comparisons of variables against 0, which is essentially
free on many processors when you have just operated or loaded the
value to a register.
Going backwards at the bit level also allows for the possibility that
the compiler will take advantage of the comparison of the top bit
being the same as test for negative, which is cheap and mostly free
for all but the first time through the inner loop (for each time
through the outer loop.
*/
double acc = 0.0;
unsigned i_end = nEnd-1;
unsigned i_bit;
int i_word_end;
if (i_end == 0)
{
return acc;
}
i_bit = i_end % 64;
i_word = i_end / 64;
do
{
unsigned __int64 v = bitarray[i_word_end];
unsigned i_upper = i_word_end << 64;
while (v)
{
if (v & 0x80000000000000)
{
// The following code is semantically the same as
// unsigned i = i_bit_end + (i_word_end * sizeof(v));
unsigned i = i_bit_end | i_upper;
acc += floatarray[i];
}
v <<= 1;
i--;
}
i_bit_end = 63;
i_word_end--;
} while (i_word_end >= 0);
I think you should check "how to ask questions" first. You will not gain many upvotes for this, since you are asking us to do the work for you instead of introducing a particular problem.
I cannot see why you are incrementing the same variable in two places instead of one (i).
Also think you should declare variables only once, not in every iteration.

Resources