Multiplication algorithm for abritrary precision (bignum) integers - c

I'm writing a small bignum library for a homework project. I am to implement Karatsuba multiplication, but before that I would like to write a naive multiplication routine.
I'm following a guide written by Paul Zimmerman titled "Modern Computer Arithmetic" which is freely available online.
On page 4, there is a description of an algorithm titled BasecaseMultiply which performs gradeschool multiplication.
I understand step 2, 3, where B^j is a digit shift of 1, j times.
But I don't understand step 1 and 3, where we have A*b_j. How is this multiplication meant to be carried out if the bignum multiplication hasn't been defined yet?
Would the operation "*" in this algorithm just be the repeated addition method?
Here is the parts I have written thus far. I have unit tested them so they appear to be correct for the most part:
The structure I use for my bignum is as follows:
#define BIGNUM_DIGITS 2048
typedef uint32_t u_hw; // halfword
typedef uint64_t u_w; // word
typedef struct {
unsigned int sign; // 0 or 1
unsigned int n_digits;
u_hw digits[BIGNUM_DIGITS];
} bn;
Currently available routines:
bn *bn_add(bn *a, bn *b); // returns a+b as a newly allocated bn
void bn_lshift(bn *b, int d); // shifts d digits to the left, retains sign
int bn_cmp(bn *a, bn *b); // returns 1 if a>b, 0 if a=b, -1 if a<b

I wrote a multiplication algorithm a while ago, and I have this comment at the top. If you have two numbers x and y of the same size (same n_digits) then you would multiply like this to get n, which would have twice the digits. Part of the complexity of the algorithm comes from working out which bits not to multiply if n_digits is not the same for both inputs.
Starting from the right, n0 is x0*y0 and you save off the overflow. Now n1 is the sum of x1*y0 and y1*x0 and the previous overflow shifted by your digit size. If you are using 32 bit digits in 64 bit math, that means n0 = low32(x0*y0) and you carry high32(x0*y0) as the overflow. You can see that if you actually used 32 bit digits you could not add the center columns up without exceeding 64 bits, so you probably use 30 or 31 bit digits.
If you have 30 bits per digit, that means you can multiple two 8 digit numbers together. First write this algorithm to accept two small buffers with n_digits up to 8 and use native math for the arithmetic. Then implement it again, taking arbitrary sized n_digits and using the first version, along with your shift and add method, to multiply 8x8 chunks of digits at a time.
/*
X*Y = N
x0 y3
\ /
\ /
X
x1 /|\ y2
\ / | \ /
\ / | \ /
X | X
x2 /|\ | /|\ y1
\ / | \ | / | \ /
\ / | \|/ | \ /
X | X | X
x3 /|\ | /|\ | /|\ y0
\ / | \ | / | \ | / | \ /
\ / | \|/ | \|/ | \ /
V | X | X | V
|\ | /|\ | /|\ | /|
| \ | / | \ | / | \ | / |
| \|/ | \|/ | \|/ |
| V | X | V |
| |\ | /|\ | /| |
| | \ | / | \ | / | |
| | \|/ | \|/ | |
| | V | V | |
| | |\ | /| | |
| | | \ | / | | |
| | | \|/ | | |
| | | V | | |
| | | | | | |
n7 n6 n5 n4 n3 n2 n1 n0
*/

To do A*b_j, you need to do the grade school multiplication of a bignum with a single digit. You end up having to add a bunch of two-digit products together:
bn *R = ZERO;
for(int i = 0; i < n; i++) {
bn S = {0, 2};
S.digits[0] = a[i] * b_j;
S.digits[1] = (((u_w)a[i]) * b_j) >> 32; // order depends on endianness
bn_lshift(S, i);
R = bn_add(R, S);
}
Of course, this is very inefficient.

Related

Extracting particular bits and packing them into a payload

I need to write a function that uses interface functions from other components to get individual values for year, month, day, hour, minute, second, and then pack those values into a 5-byte message payload and provide it to another component by using an unsigned char pointer as function parameter. The payload structure is strictly defined and must look like this:
| | bit 7 | bit 6 | bit 5 | bit 4 | bit 3 | bit 2 | bit 1 | bit 0 |
| -------------- | -------------- | -------------- | --------------| --------------| --------------| --------------|-------------- |-------------- |
| byte 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | year |
| byte 1 | year | year | year | year | year | year | month | month |
| byte 2 | month | month | day | day | day | day | day | hour |
| byte 3 | hour | hour | hour | hour | minute | minute | minute | minute |
| byte 4 | minute | minute | second | second | second | second | second | second |
My current approach is:
void prepareDateAndTimePayload(unsigned char * message)
{
unsigned char payload[5] = {0};
unsigned char year = getYear();
unsigned char month = getMonth();
unsigned char day = getDay();
unsigned char hour = getHour();
unsigned char minute = getMinute();
unsigned char second = getSecond();
payload[0] = (year & 0x40) >> 6u; //get 7th bit of year and shift it
payload[1] = ((year & 0x3F) << 2u) | ((month & 0xC) >> 2u); //remaining 6 bits of year and starting with month
payload[2] = ((month & 0x3) << 6u) | ((day & 0x1F) << 1u) | ((hour & 0x10) >> 4u); //...
payload[3] = ((hour & 0xF) << 4u) | ((minute & 0x3C) >> 2u);
payload[4] = ((minute & 0x3) << 6u) | (second & 0x3F); //...
memcpy(message, payload, sizeof(payload));
}
I'm wondering how I should approach extracting the particular bits and packing them into a payload, so they match the required message structure. I find my version with bit masks and bit shifting to be messy and not elegant. Is there any better way to do it?
Look at your code with its various magic numbers. Now look at this code below. The compiler will optimise to use registers, so the extra clarity is for the human readers able to see and check that all makes sense.
void prepareDateAndTimePayload(unsigned char msg[5])
{
unsigned char yr = 0x7F & getYear() - bias; // 7 bits
unsigned char mn = 0x0F & getMonth(); // 4 bits
unsigned char dy = 0x1F & getDay(); // 5 bits
unsigned char hr = 0x1F & getHour(); // 5 bits
unsigned char mi = 0x3F & getMinute(); // 6 bits
unsigned char sc = 0x3F & getSecond(); // 6 bits
// [4] mmss ssss
msg[4] = sc; // 6/6 bit sc (0-59)
msg[4] |= mi << 6; // lo 2/6 bit mi (0-59)
// [3] hhhh mmmm
msg[3] = mi >> 2; // hi 4/6 bit mi (0-59)
msg[3] |= hr << 4; // lo 4/5 bit hr (0-23)
// [2] MMDD DDDh
msg[2] = hr >> 4; // hi 1/5 bit hr (0-23)
msg[2] |= dy << 1; // 5/5 bit dy (0-31)
msg[2] |= mn << 6; // lo 2/4 bit mn (1-12)
// [1] YYYY YYMM
msg[1] = mn >> 2; // hi 2/4 bit mn (1-12)
msg[1] |= yr << 2; // lo 6/7 bit yr (0-127)
// [0] 0000 000Y
msg[0] = yr >> 6; // hi 1/7 bit yr (0-127)
}
The OP refers to one side of a send/receive operation.
This proposal is based on the idea that both sides of that translation are amenable to revision.
It does not address the OP directly, but provides an alternative if both sides are still under development.
This requires only single byte data (suitable for narrow processors.)
Observation:
Oddly packing 6 values into 5 bytes with error prone jiggery-pokery.
Hallmark of a badly thought-out design.
Here is a reasonable (and cleaner) alternative.
| | b7 | b6 | b5 | b4 | b3 | b2 | b1 | b0 |
| ------ | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | yr max 127 from bias
| byte 0 | yr3 | yr2 | yr1 | yr0 | mo | mo | mo | mo | max 12
| byte 1 | yr6 | yr5 | yr4 | dy | dy | dy | dy | dy | max 31
| byte 2 | | | | hr | hr | hr | hr | hr | max 23
| byte 3 | | | mi | mi | mi | mi | mi | mi | max 59
| byte 4 | | | sc | sc | sc | sc | sc | sc | max 59
And, in code:
typedef union {
uint8_t yr; // 7 bits valid. bias + [0-127]
uint8_t mo; // 1-12. 4 bits
uint8_t dy; // 1-31. 5 bits
uint8_t hr; // 0-23. 5 bits
uint8_t mn; // 0-59. 6 bits
uint8_t sc; // 0-59. 6 bits
} ymdhms_t;
void pack(unsigned char pyld[5], ymdhms_t *dt)
{
// year biased into range 0-127 (eg 2022 - 1980 = 42 )
dt.yr -= bias;
pyld[0] = dt->mo | (dt->yr & 0x0F) << 4; // mask unnecessary
pyld[1] = dt->dy | (dt->yr & 0x70) << 1;
pyld[2] = dt->hr;
pyld[3] = dt->mn;
pyld[4] = dt->sc;
}
void unpack(unsigned char pyld[5], ymdhms_t *dt)
{
dt->mo = pyld[0] & 0x0F;
dt->dy = pyld[1] & 0x1F;
dt->hr = pyld[2];
dt->mn = pyld[3];
dt->sc = pyld[4];
dt->yr = bias + pyld[0] >> 4 + pyld[1] >> 1;
}
One might even ask if "shrinking 6 bytes to 5" is worth the effort when the useful bit density only rises from 69% to 82%.
If you want to write a portable program, then your current implementation is pretty much the way you have to do it. You can make it a bit easier on the eyes by defining some macros to do handle the shifting and masking for you, but that's about it.
Using bitfields, you can offload most of that work to the compiler. Beware though that the compiler is free to implement bitfields in pretty much any way it wants. The resulting memory layout is not always portable between compilers - or even between ISAs using the same compiler.
As an example from the Linux kernel, here you can see how care must be taken to make sure that the CPU's endianness is taken into consideration, for example:
struct iphdr {
#if defined(__LITTLE_ENDIAN_BITFIELD)
__u8 ihl:4,
version:4;
#elif defined (__BIG_ENDIAN_BITFIELD)
__u8 version:4,
ihl:4;
#else
#error "Please fix <asm/byteorder.h>"
#endif
__u8 tos;
__be16 tot_len;
__be16 id;
__be16 frag_off;
__u8 ttl;
__u8 protocol;
__sum16 check;
__be32 saddr;
__be32 daddr;
/*The options start here. */
};
Whatever is chosen, code for 1) portable correctness, 2) clarity and 3) ease of maintenance.
Avoid premature optimization.
OP's code looks like a reasonable way to solve the issue - especially for narrow processors, yet it is challenging to review.
Perhaps use a wider type for the in-between part and let the compiler emit efficient code:
uint64_t datetime =
((uint64_t) getYear() << 26) |
((uint32_t) getMonth() << 22) |
((uint32_t) getDay() << 17) |
((uint32_t) getHour() << 12) |
( getMinute() << 6) |
( getSecond() << 0);
payload[0] = datetime >> 32;
payload[1] = datetime >> 24;
payload[2] = datetime >> 16;
payload[3] = datetime >> 8;
payload[4] = datetime >> 0;
or maybe end with
datetime = htob64(datetime);
memcpy(message, &datetime, 5);
Could add mask in like getMonth() --> (getMonth() && 0x0Flu) if concerned about out of range get function values.

How to create a responsive table using C?

I want create a responsive table using C, not C++ or C#, only the old C.
Basically, I create a multiplication table and put lines and borders using the symbols + - and |, but when I use a number with a width greater than one, they are disorganized, so I would like to know some way that, when I put this number, the lines follow it. My code, the actual output and the desired output:
int endTab, selecNum, CurrentRes;
printf("\n\t+----------------------+");
printf("\n\t| multiplication table |");
printf("\n\t+----------------------+\n\n\n");
printf("Enter the table number:");
scanf("%d", &selecNum);
printf("Enter which number will end in:");
scanf("%d", &endTab);
printf("\n\t+-------+---+\n");
// | 1 x 2 | 2 |
for (int i = 1; i <= endTab; i++){
CurrentRes = i*selecNum;
printf("\t| %d x %d | %d |\n", i, selecNum, CurrentRes);
printf("\t+-------+---+\n");
}
return 0;
current output
+----------------------+
| multiplication table |
+----------------------+
Enter the table number:1
Enter which number will end in:10
+-------+---+
| 1 x 1 | 1 |
+-------+---+
| 2 x 1 | 2 |
+-------+---+
| 3 x 1 | 3 |
+-------+---+
| 4 x 1 | 4 |
+-------+---+
| 5 x 1 | 5 |
+-------+---+
| 6 x 1 | 6 |
+-------+---+
| 7 x 1 | 7 |
+-------+---+
| 8 x 1 | 8 |
+-------+---+
| 9 x 1 | 9 |
+-------+---+
| 10 x 1 | 10 |
+-------+---+
expected output
+----------------------+
| multiplication table |
+----------------------+
Enter the table number:1
Enter which number will end in:10
+--------+----+
| 1 x 1 | 1 |
+--------+----+
| 2 x 1 | 2 |
+--------+----+
| 3 x 1 | 3 |
+--------+----+
| 4 x 1 | 4 |
+--------+----+
| 5 x 1 | 5 |
+--------+----+
| 6 x 1 | 6 |
+--------+----+
| 7 x 1 | 7 |
+--------+----+
| 8 x 1 | 8 |
+--------+----+
| 9 x 1 | 9 |
+--------+----+
| 10 x 1 | 10 |
+--------+----+
Things to note:
The output has two columns and you have to maintain width of both the columns for each row.
The maximum width of column 1 is width of selectNum x endTab including leading and trailing space character.
The maximum width of column 2 is the width of result of selectNum x endTab including leading and trailing space.
The length of separator after every row will be based on the maximum width of both the columns.
+---------------+-------+
\ / \ /
+-----------+ +---+
| |
max width max width
of col 1 of col 2
You can do:
#include <stdio.h>
#define SPC_CHR ' '
#define BIND_CHR '+'
#define HORZ_SEP_CH '-'
#define VERT_SEP_CH '|'
#define MULT_OP_SIGN 'x'
void print_label (void) {
printf("\n\t+----------------------+");
printf("\n\t| multiplication table |");
printf("\n\t+----------------------+\n\n\n");
}
void print_char_n_times (char ch, int n){
for (int i = 0; i < n; ++i) {
printf ("%c", ch);
}
}
void print_row_sep (int max_w1, int max_w2) {
printf ("\t%c", BIND_CHR);
print_char_n_times (HORZ_SEP_CH, max_w1);
printf ("%c", BIND_CHR);
print_char_n_times (HORZ_SEP_CH, max_w2);
printf ("%c\n", BIND_CHR);
}
void print_multiplication_row (int m1, int m2, int max_w1, int max_w2) {
printf ("\t%c", VERT_SEP_CH);
int nc = printf ("%c%d%c%c%c%d%c", SPC_CHR, m1, SPC_CHR, MULT_OP_SIGN, SPC_CHR, m2, SPC_CHR);
if (nc < max_w1) {
print_char_n_times (SPC_CHR, max_w1 - nc);
}
printf ("%c", VERT_SEP_CH);
nc = printf ("%c%d%c", SPC_CHR, m1 * m2, SPC_CHR);
if (nc < max_w2) {
print_char_n_times (SPC_CHR, max_w2 - nc);
}
printf ("%c\n", VERT_SEP_CH);
}
void print_multiplication_table (int m1, int m2) {
int col1_max_width = snprintf (NULL, 0, "%c%d%c%c%c%d%c", SPC_CHR, m1, SPC_CHR, MULT_OP_SIGN, SPC_CHR, m2, SPC_CHR);
int col2_max_width = snprintf (NULL, 0, "%c%d%c", SPC_CHR, m1 * m2, SPC_CHR);
for (int i = 0; i < m2; ++i) {
print_row_sep (col1_max_width, col2_max_width);
print_multiplication_row(m1, i + 1, col1_max_width, col2_max_width);
}
print_row_sep (col1_max_width, col2_max_width);
}
int main (void) {
int endTab, selecNum;
print_label();
printf("Enter the table number: ");
scanf("%d", &selecNum);
printf("Enter which number will end in: ");
scanf("%d", &endTab);
print_multiplication_table (selecNum, endTab);
return 0;
}
Output:
% ./a.out
+----------------------+
| multiplication table |
+----------------------+
Enter the table number: 1
Enter which number will end in: 10
+--------+----+
| 1 x 1 | 1 |
+--------+----+
| 1 x 2 | 2 |
+--------+----+
| 1 x 3 | 3 |
+--------+----+
| 1 x 4 | 4 |
+--------+----+
| 1 x 5 | 5 |
+--------+----+
| 1 x 6 | 6 |
+--------+----+
| 1 x 7 | 7 |
+--------+----+
| 1 x 8 | 8 |
+--------+----+
| 1 x 9 | 9 |
+--------+----+
| 1 x 10 | 10 |
+--------+----+
Note that if you want output in the way you have shown, i.e. like this -
+--------+----+
| 1 x 1 | 1 |
+--------+----+
| 2 x 1 | 2 |
+--------+----+
| 3 x 1 | 3 |
+--------+----+
....
....
+--------+----+
| 10 x 1 | 10 |
+--------+----+
then make following change in the statement of for loop of function print_multiplication_table():
print_multiplication_row(i + 1, m1, col1_max_width, col2_max_width);
^^^^^^^^^
arguments swapped
A couple of points:
If you want to maintain the width at the level of numbers printed one down other, in the first column, then calculate the width of maximum digit entered by the user and use it while printing the multiplication row.
Above program is just to show you the way to get the output in desired form. Leaving it up to you to do all sort of optimisations that you can do.
Read about printf() family functions. Read about sprintf(), snprintf() and their return type etc.
This is not the complete solution, but you may be able to work out exactly what you want/need based on the ideas here. (The key ingredient is that log10(), a math library function) will tell how much horizontal space will be needed. Feed it the largest value in each of the 3 numbers columns and you determine the widths needed from that.
#include <stdio.h>
#include <math.h> // for log10()
int demo( int m0, int m1 ) {
char buf[ 100 ]; // adequate
int wid0 = (int)( log10( m0 ) + 1);
int wid1 = (int)( log10( m1 ) + 1);
int widR = (int)( log10( m0 * m1 ) + 1);
int need = 0;
need++; // left 'box'
need++; // space
need += wid0; // first number
need += strlen( " x " ); // mult
need += wid1; // second number
need += strlen( " | " ); // middle box
need += widR; // result
need++; // space
need++; // right 'box'
memset( buf, '\0', sizeof buf ); // start null
memset( buf, '-', need );
puts( buf );
printf( "| %*d x %*d | %*d |\n\n", wid0, m0, wid1, m1, widR, m0 * m1 );
return 0;
}
int main() {
demo( 24, 25 );
demo( 15, 456 );
return 0;
}
Output:
-----------------
| 24 x 25 | 600 |
-------------------
| 15 x 456 | 6840 |
Use the %n directive to gather how many bytes have been printed up to a point and work from there to write your '-' printing loop, for example:
int field_width;
snprintf(NULL, 0, "%d%n\n", 420, &field_width);
char horizontal[field_width + 1];
memset(horizontal, '-', field_width);
horizontal[field_width] = '\0';
Now you can print a horizontal string that's the same width as 420 when printed. Part of your problem is solved by this.
I've adapted my initial example to use snprintf because it occurs to me that you need to work out the column widths from the largest numbers first. In your print loop you'll want to pad out each value to field_width wide; you could use %*d (right justified, space padded) or %-*d (left justified, space padded) or %.*d (zero prefix padded), depending on your choice, for example:
printf("%*d\n", field_width, 1);
... and there's the rest of your problem solved, if I am correct.

Why this type of power function work?

res = 1;
for ( i = 1; i <= n; i <<= 1 ) // n = exponent
{
if ( n & i )
res *= a; // a = base
a *= a;
}
This should be more effective code for power and I don't know why this works.
First line of for() is fine I know why is there i <<= i. But I don't understand the line where is: if ( n & i ). I know how that works but I don't know why...
Let us say you have a binary representation of an unsigned number. How do you find the decimal representation?
Let us take a simple four bit example:
N = | 0 | 1 | 0 | 1 |
-----------------------------------------
| 2^3 = 8 | 2^2 = 4 | 2^1 = 2 | 2^0 = 1 |
-----------------------------------------
| 0 | 4 | 0 | 1 | N = 4 + 1 = 5
Now what would happen if the base wasn't fixed at 2 for each bit but instead was the square of the previous bit and you multiply the contribution from each bit instead of adding:
N = | 0 | 1 | 0 | 1 |
----------------------------
| a^8 | a^4 | a^2 | a^1 |
----------------------------
| 0 | a^4 | 0 | a^1 | N = a^4 * a^1 = a^(4+1) = a^5
As you can see, the code calculate a^N

Weighted random integers

I want to assign weightings to a randomly generated number, with the weightings represented below.
0 | 1 | 2 | 3 | 4 | 5 | 6
─────────────────────────────────────────
X | X | X | X | X | X | X
X | X | X | X | X | X |
X | X | X | X | X | |
X | X | X | X | | |
X | X | X | | | |
X | X | | | | |
X | | | | | |
What's the most efficient way to do it?
#Kerrek's answer is good.
But if the histogram of weights is not all small integers, you need something more powerful:
Divide [0..1] into intervals sized with the weights. Here you need segments with relative size ratios 7:6:5:4:3:2:1. So the size of one interval unit is 1/(7+6+5+4+3+2+1)=1/28, and the sizes of the intervals are 7/28, 6/28, ... 1/28.
These comprise a probability distribution because they sum to 1.
Now find the cumulative distribution:
P x
7/28 => 0
13/28 => 1
18/28 => 2
22/28 => 3
25/28 => 4
27/28 => 5
28/28 => 6
Now generate a random r number in [0..1] and look it up in this table by finding the smallest x such that r <= P(x). This is the random value you want.
The table lookup can be done with binary search, which is a good idea when the histogram has many bins.
Note you are effectively constructing the inverse cumulative density function, so this is sometimes called the method of inverse transforms.
If your array is small, just pick a uniform random index into the following array:
int a[] = {0,0,0,0,0,0,0, 1,1,1,1,1,1, 2,2,2,2,2, 3,3,3,3, 4,4,4, 5,5, 6};
If you want to generate the distribution at runtime, use std::discrete_distribution.
To get the distribution you want, first you basically add up the count of X's you wrote in there. You can do it like this (my C is super rusty, so treat this as pseudocode)
int num_cols = 7; // for your example
int max;
if (num_cols % 2 == 0) // even
{
max = (num_cols+1) * (num_cols/2);
}
else // odd
{
max = (num_cols+1) * (num_cols/2) + ((num_cols+1)/2);
}
Then you need to randomly select an integer between 1 and max inclusive.
So if your random integer is r the last step is to find which column holds the r'th X. Something like this should work:
for(int i=0;i<num_cols;i++)
{
r -= (num_cols-i);
if (r < 1) return i;
}

Loop unrolling and its effects on pipelining and CPE (have the solution, but don't understand it)

Below the line is a question on a practice test. The table actually has all the solutions filled in. However, I need clarification upon why the solutions are what they are. (Read the question below the horizontal line).
For example, I would really like to understand the solution row for A2 and A3.
As I see it, you have the following situation going on in A2:
x * y
xy * r
xyr * z
Now, let's look at how that'd be in the pipeline:
|1|2|3|4|5|6|7|8 |9|10|11|12|13|14|15|16|17|18|19|20|21|
| | | | | | | | | | | | | | | | | | | | | |
{ x * y } | | | | | | | | | | | | | | | | |
{ xy * r } | | | | | | | | | | | | |
{ xyr * z } | | | | | | | | |
//next iteration, which means different x, y and z's| |
{x2 * y2 } | | | | | | | |
{x2y2 * r } // this is dependent on both previous r and x2y2
{x2y2r * z }
So we are able to overlap xyr * z and x2 * y2, because there are no dependency conflicts. However, that is only getting rid of 3 cycles right?
So it would still be (12 - 3) / 3 = 9 / 3 = 3 Cycles Per Element (three elements). So how are they getting 8/3 CPE for A2?
Any help understanding this concept will be greatly appreciated! There's not a big rush, as the test isn't til next week. If there is any other information you need, please let me know!
(Below is the full test question text, along with the table completely filled in with the solutions)
Consider the following function for computing the product of an array of n integers.
We have unrolled the loop by a factor of 3.
int prod(int a[], int n) {
int i, x, y, z;
int r = 1;
for(i = 0; i < n-2; i += 3) {
x = a[i]; y = a[i+1]; z = a[i+2];
r = r * x * y * z; // Product computation
}
for (; i < n; i++)
r *= a[i];
return r;
}
For the line labeled Product computation, we can use parentheses to create five different
associations of the computation, as follows:
r = ((r * x) * y) * z; // A1
r = (r * (x * y)) * z; // A2
r = r * ((x * y) * z); // A3
r = r * (x * (y * z)); // A4
r = (r * x) * (y * z); // A5
We express the performance of the function in terms of the number of cycles per element
(CPE). As described in the book, this measure assumes the run time, measured in clock
cycles, for an array of length n is a function of the form Cn + K, where C is the CPE.
We measured the five versions of the function on an Intel Pentium III. Recall that the integer multiplication operation on this machine has a latency of 4 cycles and an issue time of 1 cycle.
The following table shows some values of the CPE, and other values missing. The measured
CPE values are those that were actually observed. “Theoretical CPE” means that performance
that would be achieved if the only limiting factor were the latency and issue time of
the integer multiplier.
Fill in the missing entries. For the missing values of the measured CPE, you can use the
values from other versions that would have the same computational behavior. For the values
of the theoretical CPE, you can determine the number of cycles that would be required for
an iteration considering only the latency and issue time of the multiplier, and then divide by 3.
Without knowing the CPU architecture, we can only guess.
My interpretation would be that the timing diagram only shows part of the pipeline, from gathering the operands to writing the result, because this is what is relevant to dependency resolution.
Now, the big if: If there is a buffer stage between the dependency resolver and the execution units, it would be possible to start the third multiplication of the first group (3) and the first multiplication of the second group (4) both at offset 8.
As 3 is dependent on 2, it does not make sense to use a different unit here, so 3 is queued to unit 1 right after 2. The following instruction, 4 is not dependent on a previous result, so it can be queued to unit 2, and started in parallel.
In theory, this could happen as early as cycle 6, giving a CPE of 6/3. In practice, that is dependent on the CPU design.

Resources