The max number of digits in an int based on number of bits

The max number of digits in an int based on number of bits - c

So, I needed a constant value to represent the max number of digits in an int, and it needed to be calculated at compile time to pass into the size of a char array.
To add some more detail: The compiler/machine I'm working with has a very limited subset of the C language, so none of the std libraries work as they have unsupported features. As such I cannot use INT_MIN/MAX as I can neither include them, nor are they defined.
I need a compile time expression that calculates the size. The formula I came up with is:
((sizeof(int) / 2) * 3 + sizeof(int)) + 2
It is marginally successful with n byte integers based on hand calculating it.
sizeof(int) INT_MAX characters formula
2 32767 5 7
4 2147483647 10 12
8 9223372036854775807 19 22

You're looking for a result related to a logarithm of the maximum value of the integer type in question (which logarithm depends on the radix of the representation whose digits you want to count). You cannot compute exact logarithms at compile time, but you can write macros that estimate them closely enough for your purposes, or that compute a close enough upper bound for your purposes. For example, see How to compute log with the preprocessor.
It is useful also to know that you can convert between logarithms in different bases by multiplying by appropriate constants. In particular, if you know the base-a logarithm of a number and you want the base-b logarithm, you can compute it as
logb(x) = loga(x) / loga(b)
Your case is a bit easier than the general one, though. For the dimension of an array that is not a variable-length array, you need an "integer constant expression". Furthermore, your result does not need more than two digits of precision (three if you wanted the number of binary digits) for any built-in integer type you'll find in a C implementation, and it seems like you need only a close enough upper bound.
Moreover, you get a head start from the sizeof operator, which can appear in integer constant expressions and which, when applied to an integer type, gives you an upper bound on the base-256 logarithm of values of that type (supposing that CHAR_BIT is 8). This estimate is very tight if every bit is a value bit, but signed integers have a sign bit, and they may have padding bits as well, so this bound is a bit loose for them.
If you want a a bound on the number of digits in a power-of-two radix then you can use sizeof pretty directly. Let's suppose, though, that you're looking for the number of decimal digits. Mathematically, the maximum number of digits in the decimal representation of an int is
N = ceil(log10(MAX_INT))
or
N = floor(log10(MAX_INT)) + 1
provided that MAX_INT is not a power of 10. Let's express that in terms of the base-256 logarithm:
N = floor( log256(MAX_INT) / log256(10) ) + 1
Now, log256(10) cannot be part of an integer constant expression, but it or its reciprocal can be pre-computed: 1 / log256(10) = 2.40824 (to a pretty good approximation; the actual value is slightly less). Now, let's use that to rewrite our expression:
N <= floor( sizeof(int) * 2.40824 ) + 1
That's not yet an integer constant expression, but it's close. This expression is an integer constant expression, and a good enough approximation to serve your purpose:
N = 241 * sizeof(int) / 100 + 1
Here are the results for various integer sizes:
sizeof(int) INT_MAX True N Computed N
1 127 3 3
2 32767 5 5
4 2147483648 10 10
8 ~9.223372037e+18 19 20
(The values in the INT_MAX and True N columns suppose one of the allowed forms of signed representation, and no padding bits; the former and maybe both will be smaller if the representation contains padding bits.)
I presume that in the unlikely event that you encounter a system with 8-byte ints, the extra one byte you provide for your digit array will not break you. The discrepancy arises from the difference between having (at most) 63 value bits in a signed 64-bit integer, and the formula accounting for 64 value bits in that case, with the result that sizeof(int) is a bit too much of an overestimation of the base-256 log of INT_MAX. The formula gives exact results for unsigned int up to at least size 8, provided there are no padding bits.
As a macro, then:
// Expands to an integer constant expression evaluating to a close upper bound
// on the number the number of decimal digits in a value expressible in the
// integer type given by the argument (if it is a type name) or the the integer
// type of the argument (if it is an expression). The meaning of the resulting
// expression is unspecified for other arguments.
#define DECIMAL_DIGITS_BOUND(t) (241 * sizeof(t) / 100 + 1)

An upper bound on the number of decimal digits an int may produce depends on INT_MIN.
// Mathematically
max_digits = ceil(log10(-INT_MAX))
It is easier to use the bit-width of the int as that approximates a log of -INT_MIN. sizeof(int)*CHAR_BIT - 1 is the max number of value bits in an int.
// Mathematically
max_digits = ceil((sizeof(int)*CHAR_BIT - 1)* log10(2))
// log10(2) --> ~ 0.30103
On rare machines, int has padding, so the above will over estimate.
For log10(2), which is about 0.30103, we could use 1/3 or one-third.
As a macro, perform integer math and add 1 for the ceiling
#include <stdlib.h>
#define INT_DIGIT10_WIDTH ((sizeof(int)*CHAR_BIT - 1)/3 + 1)
To account for a sign and null character add 2, use the following. With a very tight log10(2) fraction to not over calculate the buffer needs:
#define INT_STRING_SIZE ((sizeof(int)*CHAR_BIT - 1)*28/93 + 3)
Note 28/93 = ‭0.3010752... > log2(10)
The number of digits needed for any base down to base 2 would need follows below. It is interesting that +2 is needed and not +1. Consider a 2 bit signed number in base 2 could be "-10", a size of 4.
#define INT_STRING2_SIZE ((sizeof(int)*CHAR_BIT + 2)

Boringly, I think you need to hardcode this, centred around inspecting sizeof(int) and consulting your compiler documentation to see what kind of int you actually have. (All the C standard specifies is that it can't be smaller than a short, and needs to have a range of at least -32767 to +32767, and 1's complement, 2's complement, and signed magnitude can be chosen. The manner of storage is arbitrary although big and little endianness are common.) Note that an arbitrary number of padding bits are allowed, so you can't, in full generality, impute the number of decimal digits from the sizeof.
C doesn't support the level of compile time evaluable constant expressions you'd need for this.
So hardcode it and make your code intentionally brittle so that compilation fails if a compiler encounters a case that you have not thought of.
You could solve this in C++ using constexpr and metaprogramming techniques.

((sizeof(int) / 2) * 3 + sizeof(int)) + 2
is the formula I came up with.
The +2 is for the negative sign and the null terminator.

If we suppose that integral values are either 2, 4, or 8 bytes, and if we determine the respective digits to be 5, 10, 20, then a integer constant expression yielding the exact values could be written as follows:
const int digits = (sizeof(int)==8) ? 20 : ((sizeof(int)==4) ? 10 : 5);
int testArray[digits];
I hope that I did not miss something essential. I've tested this at file scope.

Related

Multiplication of 2 numbers with a maximum of 2000 digits [duplicate]

This question already has answers here:
What is the simplest way of implementing bigint in C?
(5 answers)
How can I compute a very big digit number like (1000 digits ) in c , and print it out using array
(4 answers)
Store very big numbers in an integer in C
(2 answers)
Closed 3 months ago.
Implement a program to multiply two numbers, with the mention that the first can have a maximum of 2048 digits, and the second number is less than 100. HINT: multiplication can be done using repeated additions.
Up to a certain point, the program works using long double, but when working with larger numbers, only INF is displayed. Any ideas?

Implement a program to multiply two numbers, with the mention that the first can have a maximum of 2048 digits, and the second number is less than 100.
OK. The nature of multiplication is that if a number with N bits is multiplied by a number with M bits, then the result will have up to N+M bits. In other words, you need to handle a result that has 2148 bits.
A long double could be anything (it's implementation dependent). Most likely (Windows or not 80x86) is that it's a synonym for double, but sometimes it might be larger (e.g. the 80-bit format described on this Wikipedia page ). The best you can realistically hope for is a dodgy estimate with lots of precision loss and not a correct result.
The worst case (the most likely case) is that the exponent isn't big enough either. E.g. for double the (unbiased) exponent has to be in the range −1022 to +1023 so attempting to shove a 2048 bit number in there will cause an overflow (an infinity).
What you're actually being asked to do is implement a program that uses "big integers". The idea would be to store the numbers as arrays of integers, like uint32_t result[2148/32];, so that you actually do have enough bits to get a correct result without precision loss or overflow problems.
With this in mind, you want a multiplication algorithm that can work with big integers. Note: I'd recommend something from that Wikipedia page's "Algorithms for multiplying by hand" section - there's faster/more advanced algorithms that are way too complicated for (what I assume is) a university assignment.
Also, the "HINT: multiplication can be done using repeated additions" is a red herring to distract you. It'd take literally days for a computer do the equivalent of a while(source2 != 0) { result += source1; source2--; } with large numbers.

Here's a few hints.
Multiplying a 2048 digit string by a 100 digit string might yield a string with as many as 2148 digits. That's two high for any primitive C type. So you'll have to do all the math the hard way against "strings". So stay in the string space since your input will most likely be read in as much.
Let's say you are trying to multiple "123456" x "789".
That's equivalent to (123456 * (700 + 80 + 9)
Which is equivalent to to 123456 * 700 + 123456 * 80 + 123456 * 9
Which is equivalent to doing these steps:
result1 = Multiply 123456 by 7 and add two zeros at the end
result2 = Multiply 123456 by 8 and add one zero at the end
result3 = Multiply 123456 by 9
final result = result1+result2+result3
So all you need is a handful of primitives that can take a digit string of arbitrary length and do some math operations on it.
You just need these three functions:
// Returns a new string that is identical to s but with a specific number of
// zeros added to the end.
// e.g. MultiplyByPowerOfTen("123", 3) returns "123000"
char* MultiplyByPowerOfTen(char* s, size_t zerosToAdd)
{
};
// Performs multiplication on the big integer represented by s
// by the specified digit
// e.g. Multiple("12345", 2) returns "24690"
char* Multiply(char* s, int digit) // where digit is between 0 and 9
{
};
// Performs addition on the big integers represented by s1 and s2
// e.g. Add("12345", "678") returns "13023"
char* Add(char* s1, char* s2)
{
};
Final hint. Any character at position i in your string can be converted to its integer equivalent like this:
int digit = s[i] - '0';
And any digit can be converted back to a printable char:
char c = '0' + digit

32-bit fixed point arithmetic 1x1 does not equal 1

I'm implementing 32-bit signed integer fixed-point arithmetic. The scale is from 1 to -1, with INT32_MAX corresponding to 1. I'm not sure whether to make INT32_MIN or -INT32_MAX correspond to -1, but that's an aside for now.
I've made some operations to multiply and round, as follows:
#define mul(a, b) ((int64_t)(a) * (b))
#define round(x) (int32_t)((x + (1 << 30)) >> 31)
The product of two numbers can then be found using round(mul(a, b)).
The issue comes up when I check the identity.
The main problem is that 1x1 is not 1. It's INT32_MAX-1. That's obviously not desired as I would like bit-accuracy. I suppose this would affect other nearby numbers so the fix isn't a case of just adding 1 if the operands are both INT32_MAX.
Additionally, -1x-1 is not -1, 1x-1 is not -1, and -1x-1=-1. So none of the identities hold up.
Is there a simple fix to this, or is this just a symptom of using fixed point arithmetic?

In its general form, a fixed-point format represents a number x as an integer x•s. Commonly, s is a power of some base b, s = bp. For example, we might store a number of dollars x as x•100, so $3.45 might be stored as 345. Here we can easily see why this is called a “fixed-point” format: The stored number conceptually has decimal point inserted as a fixed position, in this case two to the left of the rightmost digit: “345” is conceptually “3.45”. (This may also be called a radix point rather than a decimal point, allowing for cases when the base b is not ten. And p specifies where the radix-point is inserted, p base-b digits from the right.)
If you make INT_MAX represent 1, then you are implicitly saying s = INT_MAX. (And, since INT_MAX is not a power of any other integer, we have b = INT_MAX and p = 1.) Then −1 must be represented by −1•INT_MAX = -INT_MAX. It would not be represented by INT_MIN (except in archaic C implementations where INT_MIN = -INT_MAX).
Given s = INT_MAX, shifting by 31 bits is not a correct way to implementation multiplication. Given two numbers x and y with representations a and b, the representation of xy is computed by multiplying the representations a and b and dividing by s:
a represents x, so a = xs.
b represents y, so b = ys.
Then ab/s = xsys/s = xys, and xys represents xy.
Shifting by 31 divides by 231, so that is not the same as dividing by INT_MAX. Also, division is generally slow in hardware. You may be better off choosing s = 230 instead of INT_MAX. Then you could shift by 30 bits.
When calculating ab/s, we often want to round. Adding ½s to the product before dividing is one method of rounding, but it is likely not what you want for negative products. You may want to consider adding −½s if the product is negative.

Why is log base 10 used in this code to convert int to string?

I saw a post explaining how to convert an int to a string. In the explanation there is a line of code to get the number of chars in a string:
(int)((ceil(log10(num))+1)*sizeof(char))
I’m wondering why log base 10 is used?

ceil(log10(num))+1 is incorrectly being used instead of floor(log10(num))+2.
The code is attempting to determine the amount of memory needed to store the decimal representation of the positive integer num as a string.
The two formulas presented above are equal except for numbers which are exact powers of 10, in which case the former version returns one less than the desired number.
For example, 10,000 requires 6 bytes, yet ceil(log10(10000))+1 returns 5. floor(log10(10000))+2 correctly returns 6.
How was floor(log10(num))+2 obtained?
A 4-digit number such as 4567 will be between 1,000 (inclusive) and 10,000 (exclusive), so it will be between 103 (inclusive) and 104 (exclusive), so log10(4567) will be between 3 (inclusive) and 4 (exclusive).
As such, floor(log10(num))+1 will return number of digits needed to represent the positive value num in decimal.
As such, floor(log10(num))+2 will return the amount of memory needed to store the decimal representation of the positive integer num as a string. (The extra char is for the NUL that terminates the string.)

I’m wondering why log base 10 is used?
I'm wondering the same thing. It uses a very complex calculation that happens at runtime, to save a couple bytes of temporary storage. And it does it wrong.
In principle, you get the number of digits in base 10 by taking the base-10 logarithm and flooring and adding 1. It comes exactly from the fact that
log10(1) = log10(10⁰) = 0
log10(10) = log10(10¹) = 1
log10(100) = log10(10²) = 2
and all numbers between 10 and 100 have their logarithms between 1 and 2 so if you floor the logarithm for any two digit number you get 1... add 1 and you get the number of digits.
But you do not need to do this at runtime. The maximum number of bytes needed for a 32-bit int in base 10 is 10 digits, negative sign and null terminator for 12 chars. The maximum you can save with the runtime calculation are 10 bytes of RAM, but it is usually temporary so it is not worth it. If it is stack memory, well, the call to log10, ceil and so forth might require far more.
In fact, we know the maximum number of bits needed to represent an integer: sizeof (int) * CHAR_BIT. This is greater than or equal to log2 of the MAX_INT + 1. And we know that log10(x) =~ 3.32192809489 * log2(x), so we get a good (possibly floored) approximation of log10(MAX_INT) by just dividing sizeof (int) * CHAR_BIT by 3. Then add 1 for we were supposed to add 1 to the floored logarithm to get the number of digits, then 1 for possible sign, and 1 for the null terminator and we get
sizeof (int) * CHAR_BIT / 3 + 3
Unlike the one from your question, this is an integer constant expression, i.e. the compiler can easily fold it at the compilation time, and it can be used to set the size of a statically-typed array, and for 32-bits it gives 13 which is only one more than the 12 actually required, for 16 bits it gives 8 which is again only one more than the maximum required 7 and for 8 bits it gives 5 which is the exact maximum.

ceil(log10(num)) + 1 is intended to provide the number of characters needed for the output string.
For example, if num=101, the expression's value is 4, the correct length of '101' plus the null terminator.
But if num=100, the value is 3. This behavior is incorrect.

This is because it's allocating enough space for the number to fit in the string.
If, for example, you had the number 1034, log10(1034) = 3.0145.... ceil(3.0145) is 4, which is the number of digits in the number. The + 1 is for the null-terminator.
This isn't perfect though: take 1000, for example. Despite having four digits, log(1000) = 3, and ceil(3) = 3, so this will allocate space for too few digits. Plus, as #phuclv mentions below, the log() function is very time-consuming for this purpose, especially since the length of a number has a (relatively low) upper-bound.
The reason it's log base 10 is because, presumably, this function represents the number in decimal form. If, for example, it were hexadecimal, log base 16 would be used.

A number N has n decimal digits iff 10^(n-1) <= N < 10^n which is equivalent to n-1 <= log(N) < n or n = floor(log(N)) + 1.
Since double representation has only limited precision floor(log(N)) may be off by 1 for certain values, so it is safer to allow for an extra digit i.e. allocate floor(log(N)) + 2 characters, and then another char for the nul terminator for a total of of floor(log(N)) + 3.
The expression in the original question ceil(log(N)) + 1 appears to not count the nul terminator, and neither allow for the chance of rounding errors, so it is one shorter in general, and two shorter for powers of 10.

Subtracting 1 from 0 in 8 bit binary

I have 8 bit int zero = 0b00000000; and 8 bit int one = 0b00000001;
according to binary arithmetic rule,
0 - 1 = 1 (borrow 1 from next significant bit).
So if I have:
int s = zero - one;
s = -1;
-1 = 0b1111111;
where all those 1s are coming from? There are nothing to borrow since all bits are 0 in zero variable.

This is a great question and has to do with how computers represent integer values.
If you’re writing out a negative number in base ten, you just write out the regular number and then prefix it with a minus sign. But if you’re working inside a computer where everything needs to either be a zero or a one, you don’t have any minus signs. The question then comes up of how you then choose to represent negative values.
One popular way of doing this is to use signed two’s complement form. The way this works is that you write the number using ones and zeros, except that the meaning of those ones and zeros differs from “standard” binary in how they’re interpreted. Specifically, if you have a signed 8-bit number, the lower seven bits have their standard meaning as 20, 21, 22, etc. However, the meaning of the most significant bit is changed: instead of representing 27, it represents the value -27.
So let’s look at the number 0b11111111. This would be interpreted as
-27 + 26 + 25 + 24 + 23 + 22 + 21 + 20
= -128 + 64 + 32 + 16 + 8 + 4 + 2 + 1
= -1
which is why this collection of bits represents -1.
There’s another way to interpret what’s going on here. Given that our integer only has eight bits to work with, we know that there’s no way to represent all possible integers. If you pick any 257 integer values, given that there are only 256 possible bit patterns, there’s no way to uniquely represent all these numbers.
To address this, we could alternatively say that we’re going to have our integer values represent not the true value of the integer, but the value of that integer modulo 256. All of the values we’ll store will be between 0 and 255, inclusive.
In that case, what is 0 - 1? It’s -1, but if we take that value mod 256 and force it to be nonnegative, then we get back that -1 = 255 (mod 256). And how would you write 255 in binary? It’s 0b11111111.
There’s a ton of other cool stuff to learn here if you’re interested, so I’d recommend reading up on signed and unsigned two’s-complement numbers.
As some exercises: what would -4 look like in this format? How about -9?
These aren't the only ways you can represent numbers in a computer, but they're probably the most popular. Some older computers used the balanced ternary number system (notably the Setun machine). There's also the one's complement format, which isn't super popular these days.

Zero minus one must give some number such that if you add one to it, you get zero. The only number you can add one to and get zero is the one represented in binary as all 1's. So that's what you get.
So long as you use any valid form of arithmetic, you get the same results. If there are eight cars and someone takes away three cars, the value you get for how many case are left should be five, regardless of whether you do the math with binary, decimal, or any other kind of representation.
So any valid system of representation that supports the operations you are using with their normal meanings must produce the same result. When you take the representation for zero and perform the subtraction operation using the representation for one, you must get the representation such that when you add one to it, you get the representation for zero. Otherwise, the result is just wrong based on the definitions of addition, subtraction, zero, one, and so on.

understanding Fixed point arithmetic

I am struggling with how to implement arithmetic on fixed-point numbers of different precision. I have read the paper by R. Yates, but I'm still lost. In what follows, I use Yates's notation, in which A(n,m) designates a signed fixed-point format with n integer bits, m fraction bits, and n + m + 1 bits overall.
Short question: How exactly is a A(a,b)*A(c,d) and A(a,b)+A(c,d) carried out when a != c and b != d?
Long question: In my FFT algorithm, I am generating a random signal having values between -10V and 10V signed input(in) which is scaled to A(15,16), and the twiddle factors (tw) are scaled to A(2,29). Both are stored as ints. Something like this:
float temp = (((float)rand() / (float)(RAND_MAX)) * (MAX_SIG - MIN_SIG)) + MIN_SIG;
int in_seq[i][j] = (int)(roundf(temp *(1 << numFracBits)));
And similarly for the twiddle factors.
Now I need to perform
res = a*tw
Questions:
a) how do I implement this?
b) Should the size of res be 64 bit?
c) can I make 'res' A(17,14) since I know the ranges of a and tw? if yes, should I be scaling a*tw by 2^14 to store correct value in res?
a + res
Questions:
a) How do I add these two numbers of different Q formats?
b) if not, how do I do this operation?

Maybe it's easiest to make an example.
Suppose you want to add two numbers, one in the format A(3, 5), and the other in the format A(2, 10).
You can do it by converting both numbers to a "common" format - that is, they should have the same number of bits in the fractional part.
A conservative way of doing that is to choose the greater number of bits. That is, convert the first number to A(3, 10) by shifting it 5 bits left. Then, add the second number.
The result of an addition has the range of the greater format, plus 1 bit. In my example, if you add A(3, 10) and A(2, 10), the result has the format A(4, 10).
I call this the "conservative" way because you cannot lose information - it guarantees that the result is representable in the fixed-point format, without losing precision. However, in practice, you will want to use smaller formats for your calculation results. To do that, consider these ideas:
You can use the less-accurate format as your common representation. In my example, you can convert the second number to A(2, 5) by shifting the integer right by 5 bits. This will lose precision, and usually this precision loss is not problematic, because you are going to add a less-precise number to it anyway.
You can use 1 fewer bit for the integer part of the result. In applications, it often happens that the result cannot be too big. In this case, you can allocate 1 fewer bit to represent it. You might want to check if the result is too big, and clamp it to the needed range.
Now, on multiplication.
It's possible to multiply two fixed-point numbers directly - they can be in any format. The format of the result is the "sum of the input formats" - all the parts added together - and add 1 to the integer part. In my example, multiplying A(3, 5) with A(2, 10) gives a number in the format A(6, 15). This is a conservative rule - the output format is able to store the result without loss of precision, but in applications, almost always you want to cut the precision of the output, because it's just too many bits.
In your case, where the number of bits for all numbers is 32, you probably want to lose precision in such a way that all intermediate results have 32 bits.
For example, multiplying A(17, 14) with A(2, 29) gives A(20, 43) - 64 bits required. You probably should cut 32 bits from it, and throw away the rest. What is the range of the result? If your twiddle factor is a number up to 4, the result is probably limited by 2^19 (the conservative number 20 above is needed to accommodate the edge case of multiplying -1 << 31 by -1 << 31 - it's almost always worth rejecting this edge-case).
So use A(19, 12) for your output format, i.e. remove 31 bits from the fractional part of your output.
So, instead of
res = a*tw;
you probably want
int64_t res_tmp = (int64_t)a * tw; // A(20, 43)
if (res_tmp == ((int64_t)1 << 62)) // you might want to neglect this edge case
--res_tmp; // A(19, 43)
int32_t res = (int32_t)(res_tmp >> 31); // A(19, 12)

Your question seems to assume that there is a single right way to perform the operations you are interested in, but you are explicitly asking about some of the details that direct how the operations should be performed. Perhaps this is the kernel of your confusion.
res = a*tw
a is represented as A(15,16) and tw is represented as A(2,29), so the its natural representation of their product A(18,45). You need more value bits (as many bits as the two factors have combined) to maintain full precision. A(18,45) is how you should interpret the result of widening your ints to a 64-bit signed integer type (e.g. int64_t) and computing their product.
If you don't actually need or want 45 bits of fraction, then you can indeed round that to A(18,13) (or to A(18+x,13-x) for any non-negative x) without changing the magnitude of the result. That does requiring scaling. I would probably implement it like this:
/*
* Computes a magnitude-preserving fixed-point product of any two signed
* fixed-point numbers with a combined 31 (or fewer) value bits. If x
* is represented as A(s,t) and y is represented as A(u,v),
* where s + t == u + v == 31, then the representation of the result is
* A(s + u + 1, t + v - 32).
*/
int32_t fixed_product(int32_t x, int32_t y) {
int64_t full_product = (int64_t) x * (int64_t) y;
int32_t truncated = full_product / (1U << 31);
int round_up = ((uint32_t) full_product) >> 31;
return truncated + round_up;
}
That avoids several potential issues and implementation-defined characteristics of signed integer arithmetic. It assumes that you want the results to be in a consistent format (that is, depending only on the formats of the inputs, not on their actual values), without overflowing.
a + res
Addition is actually a little harder if you cannot rely on the operands to initially have the same scale. You need to rescale so that they match before you can perform the addition. In the general case, you may not be able to do that without rounding away some precision.
In your case, you start with one A(15,16) and one A(18,13). You can compute an intermediate result in A(19,16) or wider (presumably A(47,16) in practice) that preserves magnitude without losing any precision, but if you want to represent that in 32 bits then the best you can do without risk of changing the magnitude is A(19,11). That would be this:
int32_t a_plus_res(int32_t a, int32_t res) {
int64_t res16 = ((int64_t) res) * (1 << 3);
int64_t sum16 = a + res16;
int round_up = (((uint32_t) sum16) >> 4) & 1;
return (int32_t) ((sum16 / (1 << 5)) + round_up);
}
A generic version would need to accept the scales of the operands' representations as additional arguments. Such a thing is possible, but the above is enough to chew on as it is.
All of the foregoing assumes that the fixed-point format for each operand and result is constant. That is more or less the distinguishing feature of fixed-point, differentiating it from floating-point formats on one hand and from arbitrary-precision formats on the other. You do, however, have the alternative of allowing formats to vary, and tracking them with a separate variable per value. That would be basically a hybrid of fixed-point and arbitrary-precision formats, and it would be messier.
Additionally, the foregoing assumes that overflow must be avoided at all costs. It would also be possible to instead put operands and results on a consistent scale; this would make addition simpler and multiplication more complicated, and it would afford the possibility of arithmetic overflow. That might nevertheless be acceptable if you have reason to believe that such overflow is unlikely for your particular data.