I am working on an application where I have to keep data sequenced, every unit of data comes with a sequence number where, I check if the sequence number is 1 greater than the previous one, if it is, I increase my received count by 1. My question is, Is there a difference between :
1. in increasing my received count by one.
AND
2. assigning the last received sequence number to received count.
Thanks.
It sounds like a classic premature optimization question to me. Generally increasing value would mean "fetch original->change->store", while assigning would be "fetch other->store new". The "other" would probably be fetched already, thus saving even more clock cycles. Thus assigning would probably be faster.
BUT increment by 1 is usually very well optimized by the compilers and CPU's so that it wouldn't require any fetching or storing. It can very well be done in one CPU command, thus eliminating any difference, and in fact making increment by 1 probably better option performance-wise.
Confused? Good.
Point is that this is the kind of optimization you should not be doing, unless you benchmarked a bottle neck. Then you benchmark the options and chose the best.
Related
From what I'm reading, building the constant-1 and constant-0 operations in a quantum computer involves building something like this, where there's two qbits being used. Why do we need two?
The bottom qbit in both examples is not being used at all, so has no impact on the operation. Both operations seemingly only work if the top qbit's initial value is 0 so surely what this is just saying is that this is an operation which either flips a 0 or leaves it alone - in which case what is the second qbit needed for? Wouldn't a set-to-0 function set the input to 0 whatever it is and wouldn't need one of it's inputs to be predetermined?
Granted, the 'output' qbit is for output, but it's value still needs to be predetermined going in to the operation?
Update: I've posted this on the quantum computing stack exchange with links to a couple of blogs/video where you can see the below being brought up.
With regards to coding in C, which would be faster, to check the statement with an If, or I just run the function anyway for example say the output is already 1.
if(a==b && output!=1)
{
output=1;
}
Or
if(a==b)
{
output=1;
}
In the first code, an extra check has to be run every time the code runs.
In the second you are running the code repeatedly unnecessarily
Which is more efficient??
The question basically boils down to the question of is a compare less expensive than a variable assignment. For integers, the answer is no. I am assuming this will be in a tight loop where the variables will already be in the CPU level 1 cache. The compare will compile down to Op codes like:
1) Move "output" memory locations data into Register A
2) Put 1 into Register B
3) Jump <somewhere> if Register A == Register B.
You might get an optimization where 2) is not done if comparing to 0 because there are special op codes for comparing to 0 in most CPUs.
The assignment will compiler to op codes like:
1) Put 1 into Register A
2) Push Register A to memory location of output
The question come down to clock cycles spent for each of the op codes. I think that they are all likely to be exactly the same clock cycles.
Regardless any possible optimization, as shown in the comments, the first code is less efficient than the second code due to the extra check.
Beware of your data meaning, that check may be mandatory.
If not, you should optimize your code as suggested.
Edit
I'm assuming your question to be more theoretical than practical. In any real scenario, the data context assume a huge role when we want to optimize some code.
The code don't need to be fast itself, but need to be fast in processing its data.
Context
The function BN_consttime_swap in OpenSSL is a thing of beauty. In this snippet, condition has been computed as 0 or (BN_ULONG)-1:
#define BN_CONSTTIME_SWAP(ind) \
do { \
t = (a->d[ind] ^ b->d[ind]) & condition; \
a->d[ind] ^= t; \
b->d[ind] ^= t; \
} while (0)
…
BN_CONSTTIME_SWAP(9);
…
BN_CONSTTIME_SWAP(8);
…
BN_CONSTTIME_SWAP(7);
The intention is that so as to ensure that higher-level bignum operations take constant time, this function either swaps two bignums or leaves them in place in constant time. When it leaves them in place, it actually reads each word of each bignum, computes a new word that is identical to the old word, and write that result back to the original location.
The intention is that this will take the same time as if the bignums had effectively been swapped.
In this question, I assume a modern, widespread architecture such as those described by Agner Fog in his optimization manuals. Straightforward translation of the C code to assembly (without the C compiler undoing the efforts of the programmer) is also assumed.
Question
I am trying to understand whether the construct above characterizes as a “best effort” sort of constant-time execution, or as perfect constant-time execution.
In particular, I am concerned about the scenario where bignum a is already in the L1 data cache when the function BN_consttime_swap is called, and the code just after the function returns start working on the bignum a right away. On a modern processor, enough instructions can be in-flight at the same time for the copy not to be technically finished when the bignum a is used. The mechanism allowing the instructions after the call to BN_consttime_swap to work on a is memory dependence speculation. Let us assume naive memory dependence speculation for the sake of the argument.
What the question seems to boil down to is this:
When the processor finally detects that the code after BN_consttime_swap read from memory that had, contrary to speculation, been written to inside the function, does it cancel the speculative execution as soon as it detects that the address had been written to, or does it allow itself to keep it when it detects that the value that has been written is the same as the value that was already there?
In the first case, BN_consttime_swap looks like it implements perfect constant-time. In the second case, it is only best-effort constant-time: if the bignums were not swapped, execution of the code that comes after the call to BN_consttime_swap will be measurably faster than if they had been swapped.
Even in the second case, this is something that looks like it could be fixed for the foreseeable future (as long as processors remain naive enough) by, for each word of each of the two bignums, writing a value different from the two possible final values before writing either the old value again or the new value. The volatile type qualifier may need to be involved at some point to prevent an ordinary compiler to over-optimize the sequence, but it still sounds possible.
NOTE: I know about store forwarding, but store forwarding is only a shortcut. It does not prevent a read being executed before the write it is supposed to come after. And in some circumstances it fails, although one would not expect it to in this case.
Straightforward translation of the C code to assembly (without the C compiler undoing the efforts of the programmer) is also assumed.
I know it's not the thrust of your question, and I know that you know this, but I need to rant for a minute. This does not even qualify as a "best effort" attempt to provide constant-time execution. A compiler is licensed to check the value of condition, and skip the whole thing if condition is zero. Obfuscating the setting of condition makes this less likely to happen, but is no guarantee.
Purportedly "constant-time" code should not be written in C, full stop. Even if it is constant time today, on the compilers that you test, a smarter compiler will come along and defeat you. One of your users will use this compiler before you do, and they will not be aware of the risk to which you have exposed them. There are exactly three ways to achieve constant time that I am aware of: dedicated hardware, assembly, or a DSL that generates machine code plus a proof of constant-time execution.
Rant aside, on to the actual architecture question at hand: assuming a stupidly naive compiler, this code is constant time on the µarches with which I am familiar enough to evaluate the question, and I expect it to broadly be true for one simple reason: power. I expect that checking in a store queue or cache if a value being stored matches the value already present and conditionally short-circuiting the store or avoiding dirtying the cache line on every store consumes more energy than would be saved in the rare occasion that you get to avoid some work. However, I am not a CPU designer, and do not presume to speak on their behalf, so take this with several tablespoons of salt, and please consult one before assuming this to be true.
This blog post, and the comments made by the author, Henry, on the subject of this question should be considered as authoritative as anyone should allowed to expect. I will reproduce the latter here for archival:
I didn’t think the case of overwriting a memory location with the same value had a practical use. I think the answer is that in current processors, the value of the store is irrelevant, only the address is important.
Out here in academia, I’ve heard of two approaches to doing memory disambiguation: Address-based, or value-based. As far as I know, current processors all do address-based disambiguation.
I think the current microbenchmark has some evidence that the value isn’t relevant. Many of the cases involve repeatedly storing the same value into the same location (particularly those with offset = 0). These were not abnormally fast.
Address-based schemes uses a store queue and a load queue to track outstanding memory operations. Loads check the store queue to for an address match (Should this load do store-to-load forwarding instead of reading from cache?), while stores check the load queue (Did this store clobber the location of a later load I allowed to execute early?). These checks are based entirely on addresses (where a store and load collided). One advantage of this scheme is that it’s a fairly straightforward extension on top of store-to-load forwarding, since the store queue search is also used there.
Value-based schemes get rid of the associative search (i.e., faster, lower power, etc.), but requires a better predictor to do store-to-load forwarding (Now you have to guess whether and where to forward, rather than searching the SQ). These schemes check for ordering violations (and incorrect forwarding) by re-executing loads at commit time and checking whether their values are correct. In these schemes, if you have a conflicting store (or made some other mistake) that still resulted in the correct result value, it would not be detected as an ordering violation.
Could future processors move to value-based schemes? I suspect they might. They were proposed in the mid-2000s(?) to reduce the complexity of the memory execution hardware.
The idea behind constant-time implementation is not to actually perform everything in constant time. That will never happen on an out-of-order architecture.
The requirement is that no secret information can be revealed by timing analysis.
To prevent this there are basically two requirements:
a) Do not use anything secret as a stop condition for a loop, or as a predicate to a branch. Failing to do so will open you to a branch prediction attack https://eprint.iacr.org/2006/351.pdf
b) Do not use anything secret as an index to memory access. This leads to cache timing attacks http://www.daemonology.net/papers/htt.pdf
As for your code: assuming that your secret is "condition" and possibly the contents of a and b the code is perfectly constant time in the sense that its execution does not depend on the actual contents of a, b and condition. Of course the locality of a and b in memory will affect the execution time of the loop, but not the CONTENTS which are secret.
That is assuming of course condition was computed in a constant time manner.
As for C optimizations: the compiler can only optimize code based on information it knows. If "condition" is truly secret the compiler should not be able to discern it contents and optimize. If it can be deducted from your code then the compiler will most likely make optimization for the 0 case.
I am going to analyse and optimize some C-Code and therefore I first have to check, whether the functions I want to optimize are memory-bound or cpu-bound. In general I know, how to do this, but I have some questions about counting Floating Point Operations and analysing the size of data, which is used. Look at the following for-loop, which I want to analyse. The values of the array are doubles (that means 8 Byte each):
for(int j=0 ;j<N;j++){
for(int i=1 ;i<Nt;i++){
matrix[j*Nt+i] = matrix[j*Nt+i-1] * mu + matrix[j*Nt+i]*sigma;
}
}
1) How many floating point operations do you count? I thought about 3*(Nt-1)*N... but do I have to count the operations within the arrays, too (matrix[j*Nt+i], which are 2 more FLOP for this array)?
2)How much data is transfered? 2* ((Nt-1)*N)8Byte or 3 ((Nt-1)*N)*8Byte. I mean, every entry of the matrix has to be loaded. After the calculation, the new values is saved to that index of the array (now these is 1load and 1 store). But this value is used for the next calculation. Is another load operations needed therefore, or is this value (matrix[j*Nt+i-1]) already available without a load operation?
Thx a lot!!!
With this type of code, the direct sort of analysis you are proposing to do can be almost completely misleading. The only meaningful information about the performance of the code is actually measuring how fast it runs in practice (benchmarking).
This is because modern compilers and processors are very clever about optimizing code like this, and it will end up executing in a way which is nothing like your straightforward analysis. The compiler will optimize the code, rearranging the individual operations. The processor will itself try to execute the individual sub-operations in parallel and/or pipelined, so that for example computation is occurring while data is being fetched from memory.
It's useful to think about algorithmic complexity, to distinguish between O(n) and O(n²) and so on, but constant factors (like you ask about 2*... or 3*...) are completely moot because they vary in practice depending on lots of details.
Which is more taxing? Is enclosing an array element exchange with a conditional if statement to prevent redundant exchanges, like say exchanging with itself, more efficient?
Or is having to check for an only probabilistic condition all the time more inefficient? Say the chance of the special condition increases every invocation.
Say you're developing an algorithm and is trying to check for efficiency: compares or exchanges(like insertion sort).
if(condition)
exchange two elements
This very much depends on your processor architecture, how often this would be done, the throughput its required to handle and the cost of doing said exchanges, in which case, the only viable, real world-answer is: "profile, profile and profile some more".
Basically, if you CPU suffers badly from branch miss-prediction, and the swapping of elements is trivial, then its makes sense to leave out the conditional.
however, if your target CPU architecture can support a fair amount of branch miss-predictions with cause too much stalling or the cost of swapping elements is not trivial, then you might gain performance, depending on the size of said array. you may also benefit from the use of instructions like MOVcc/CMPXCHG, or there non-x86 counterparts (though it this situation, you'd still need a read + compare, but it removes the branching).
With so many variable inputs, it makes sense to profile your code and find where its really bottlenecking, things like VTune or CodeAnalyst will also give you stats on branch miss-prediction so you can see how much it affects your algorithm as a whole.
A useful way to look at any condition-evaluation code to ask, "What is the probability of each outcome?"
For example, it there's a test expression test, and it's probability of being true is 1/100, then on average it is telling you very little, for your investment in processor cycles.
In fact you can quantify that.
If it's true, then the the amount of information you've learned is pretty good.
It is log2(100/1) = 6.6 bits, roughly, but that only happens 1 out of 100 times.
The other 99 times, the amount of information you learn is log2(100/99) = .014 bits.
Practically nothing.
So a condition like that is telling you very little, on average. It's not "working" very hard.
A good way to finish quantifying it is to multiply what you learn from each outcome by the probability of that outcode, and add those up.
That tells you what you learn on average.
That is 6.6 * 1/100 + .014 * 99/100 = .066 + .014 = .08 bits, which is very poor.
(This number is called the entropy of the decision.)
On the other hand, if you have a decision point where each outcome is equally likely, it learns a full 1 bit on average.
In fact that's the most work a binary decision can possibly do.
So if you're worried about the performance of a conditional test (you may not be) try to make it earn its cycles.