Function to convert YCbCr to RGB?

Function to convert YCbCr to RGB? - c

I have Y, Cb and Cr values, each with a size of 8 bits. What would be a simple C function which can convert these values to R,G,B (each with a size of 8 bits)?
Here is a prototype of the function I am looking for:
void convertYCbCrToRGB(
unsigned char Y,
unsigned char cg,
unsigned char cb,
unsigned char &r,
unsigned char &g,
unsigned char &b);
P.S.
I am looking for the correct conversion formula only, since I have found different versions of it everywhere. I am very well-versed in C/C++.

Here is my solution to my question.
This one is Full Range YCbCr to RGB conversion routine.
Color GetColorFromYCbCr(int y, int cb, int cr, int a)
{
double Y = (double) y;
double Cb = (double) cb;
double Cr = (double) cr;
int r = (int) (Y + 1.40200 * (Cr - 0x80));
int g = (int) (Y - 0.34414 * (Cb - 0x80) - 0.71414 * (Cr - 0x80));
int b = (int) (Y + 1.77200 * (Cb - 0x80));
r = Max(0, Min(255, r));
g = Max(0, Min(255, g));
b = Max(0, Min(255, b));
return Color.FromArgb(a, r, g, b);
}

The problem comes that nearly everybody confuses YCbCr YUV and YPbPr. So literature you can find is often crappy. First you have to know if you really have YCbCr or if someone lies to you :-).
YUV coded data comes from analog sources (PAL video decoder, S-Video, ...)
YPbPr coded data also comes from analog sources but produces better color results as YUV (Component Video)
YCbCr coded data comes from digital sources (DVB, HDMI, ...)
YPbPr and YCbCr are related. Here are the right formulae:
https://web.archive.org/web/20180421030430/http://www.equasys.de/colorconversion.html
(the archive.org has been added to fix the old, broken link).

Integer operation of ITU-R standard for YCbCr is (from Wikipedia)
Cr = Cr - 128 ;
Cb = Cb - 128 ;
r = Y + ( Cr >> 2 + Cr >> 3 + Cr >> 5 ) ;
g = Y - ( Cb >> 2 + Cb >> 4 + Cb >> 5) - ( Cr >> 1 + Cr >> 3 + Cr >> 4 + Cr >> 5) ;
b = Y + ( Cb + Cb >> 1 + Cb >> 2 + Cb >> 6) ;
or equivalently but more concisely to :
Cr = Cr - 128 ;
Cb = Cb - 128 ;
r = Y + 45 * Cr / 32 ;
g = Y - (11 * Cb + 23 * Cr) / 32 ;
b = Y + 113 * Cb / 64 ;
do not forget to clamp values of r, g and b within [0,255].

Check out this page. It contains useful information on conversion formulae.
As an aside, you could return a unsigned int with the values for RGBA encoded from the most significant byte to least significant byte, i.e.
unsigned int YCbCrToRGBA(unsigned char Y, unsigned char Cb, unsigned char Cb) {
unsigned char R = // conversion;
unsigned char G = // conversion;
unsigned char B = // conversion;
return (R << 3) + (G << 2) + (B << 1) + 255;
}

Related

Convert C-Source image dump from RGB565 into RGB888

I have created with GIMP a C-Source image dump like the following:
/* GIMP RGBA C-Source image dump (example.c) */
static const struct {
guint width;
guint height;
guint bytes_per_pixel; /* 2:RGB16, 3:RGB, 4:RGBA */
guint8 pixel_data[304 * 98 * 2 + 1];
} example= {
304, 98, 2,
"\206\061\206\061..... }
Is there a way to convert this image from RG565 to RGB888?
I mean , I have found a way to covert pixel by pixel:
for (i = 0; i < w * h; i++)
{
uint16_t color = *RGB565p++;
uint8_t r = ((color >> 11) & 0x1F);
uint8_t g = ((color >> 5) & 0x3F);
uint8_t b = (color & 0x1F);
r = ((((color >> 11) & 0x1F) * 527) + 23) >> 6;
g = ((((color >> 5) & 0x3F) * 259) + 33) >> 6;
b = (((color & 0x1F) * 527) + 23) >> 6;
uint32_t RGB888 = r << 16 | g << 8 | b;
printf("%d \n", RGB888);
}
the problem is that using this logic I get numbers that are not represented as the one used n the original image:
P3
304 98
255
3223857
3223857
3223857
3223857
3223857
3223857
3223857
3223857
Did I miss something?
EDIT: here you can find the original image:
https://drive.google.com/file/d/1YBphg5_V6M2FA3HWcaFZT4fHqD6yeEOl/view

There are two things you need to do to create a C file similar to the original.
Increase the size of the pixel buffer, because you are creating three bytes per pixel from the original's two bytes.
Write strings that represent the new pixel data
The first part means simply changing the 2 to 3, so you get:
guint8 pixel_data[304 * 98 * 3 + 1];
} example= {
304, 98, 3,
In the second part the simplest method would be to print ALL characters in hexadecimal or octal representation. (The original code has the "printable" characters visible, but the non-printable as octal escape sequences.)
To print ALL the characters in hexadecimal representation, do similar to
for (i = 0; i < w * h; i++)
{
...
R, G and B calculation goes here
...
// Print start of line and string (every 16 pixels)
if (i % 16 == 0)
printf("\n\"");
printf("\\x%02x\\x%02x\\x%02x", r, g, b);
// Print end of string and line (every 16 pixels)
if ((i+1) % 16 == 0)
printf("\"\n");
}
printf("\"\n"); // Termination of last line
This prints three bytes in hex representation \xab\xcd\xef and after 16 pixels, prints end of string and newline.
Note that the byte order might need changing depending on your implementation. So b, g, r instead of r, g, b.

Efficient Conversion of a Binary Number to Hexadecimal String [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
I am writing a program that converts a binary value's hexadecimal representation to a regular string. So each character in the hex representation would convert to two hexadecimal characters in the string. This means the result will be twice the size; a hexadecimal representation of 1 byte would need two bytes in a string.
Hexadecimal Characters
0123456789 ;0x30 - 0x39
ABCDEF ;0x41 - 0x46
Example
0xF05C1E3A ;hex
4032568890 ;dec
would become
0x4630354331453341 ;hex
5057600944242766657 ;dec
Question?
Are there any elegant/alternative(/interesting) methods for converting between these states, other than a lookup table, (bitwise operations, shifts, modulo, etc)?
I'm not looking for a function in a library, but rather how one would/should be implemented. Any ideas?

Here's a solution with nothing but shifts, and/or, and add/subtract. No loops either.
uint64_t x, m;
x = 0xF05C1E3A;
x = ((x & 0x00000000ffff0000LL) << 16) | (x & 0x000000000000ffffLL);
x = ((x & 0x0000ff000000ff00LL) << 8) | (x & 0x000000ff000000ffLL);
x = ((x & 0x00f000f000f000f0LL) << 4) | (x & 0x000f000f000f000fLL);
x += 0x0606060606060606LL;
m = ((x & 0x1010101010101010LL) >> 4) + 0x7f7f7f7f7f7f7f7fLL;
x += (m & 0x2a2a2a2a2a2a2a2aLL) | (~m & 0x3131313131313131LL);
Above is the simplified version I came up with after a little time to reflect. Below is the original answer.
uint64_t x, m;
x = 0xF05C1E3A;
x = ((x & 0x00000000ffff0000LL) << 16) | (x & 0x000000000000ffffLL);
x = ((x & 0x0000ff000000ff00LL) << 8) | (x & 0x000000ff000000ffLL);
x = ((x & 0x00f000f000f000f0LL) << 4) | (x & 0x000f000f000f000fLL);
x += 0x3636363636363636LL;
m = (x & 0x4040404040404040LL) >> 6;
x += m;
m = m ^ 0x0101010101010101LL;
x -= (m << 2) | (m << 1);
See it in action: http://ideone.com/nMhJ2q

Spreading out the nibbles to bytes is easy with pdep:
spread = _pdep_u64(raw, 0x0F0F0F0F0F0F0F0F);
Now we'd have to add 0x30 to bytes in the range 0-9 and 0x41 to higher bytes. This could be done by SWAR-subtracting 10 from every byte and then using the sign to select which number to add, such as (not tested)
H = 0x8080808080808080;
ten = 0x0A0A0A0A0A0A0A0A
cmp = ((spread | H) - (ten &~H)) ^ ((spread ^~ten) & H); // SWAR subtract
masks = ((cmp & H) >> 7) * 255;
// if x-10 is negative, take 0x30, else 0x41
add = (masks & 0x3030303030303030) | (~masks & 0x3737373737373737);
asString = spread + add;
That SWAR compare can probably be optimized since you shouldn't need a full subtract to implement it.
There are some different suggestions here, including SIMD: http://0x80.pl/articles/convert-to-hex.html

A slightly simpler version based on Mark Ransom's:
uint64_t x = 0xF05C1E3A;
x = ((x & 0x00000000ffff0000LL) << 16) | (x & 0x000000000000ffffLL);
x = ((x & 0x0000ff000000ff00LL) << 8) | (x & 0x000000ff000000ffLL);
x = ((x & 0x00f000f000f000f0LL) << 4) | (x & 0x000f000f000f000fLL);
x = (x + 0x3030303030303030LL) +
(((x + 0x0606060606060606LL) & 0x1010101010101010LL) >> 4) * 7;
And if you want to avoid the multiplication:
uint64_t m, x = 0xF05C1E3A;
x = ((x & 0x00000000ffff0000LL) << 16) | (x & 0x000000000000ffffLL);
x = ((x & 0x0000ff000000ff00LL) << 8) | (x & 0x000000ff000000ffLL);
x = ((x & 0x00f000f000f000f0LL) << 4) | (x & 0x000f000f000f000fLL);
m = (x + 0x0606060606060606LL) & 0x1010101010101010LL;
x = (x + 0x3030303030303030LL) + (m >> 1) - (m >> 4);

A bit more decent conversion from the the integer to the string any base from 2 to length of the digits
char *reverse(char *);
const char digits[] = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ";
char *convert(long long number, char *buff, int base)
{
char *result = (buff == NULL || base > strlen(digits) || base < 2) ? NULL : buff;
char sign = 0;
if (number < 0)
{
sign = '-';
number = -number;
}
if (result != NULL)
{
do
{
*buff++ = digits[number % base];
number /= base;
} while (number);
if(sign) *buff++ = sign;
*buff = 0;
reverse(result);
}
return result;
}
char *reverse(char *str)
{
char tmp;
int len;
if (str != NULL)
{
len = strlen(str);
for (int i = 0; i < len / 2; i++)
{
tmp = *(str + i);
*(str + i) = *(str + len - i - 1);
*(str + len - i - 1) = tmp;
}
}
return str;
}
example - counting from -50 to 50 decimal in base 23
-24 -23 -22 -21 -20 -1M -1L -1K -1J -1I -1H -1G -1F -1E -1D
-1C -1B -1A -19 -18 -17 -16 -15 -14 -13 -12 -11 -10 -M -L
-K -J -I -H -G -F -E -D -C -B -A -9 -8 -7 -6
-5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9
A B C D E F G H I J K L M 10 11
12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F 1G
1H 1I 1J 1K 1L 1M 20 21 22 23 24

A LUT (lookup table) C++ variant. I didn't check the actual machine code produced, but I believe any modern C++ compiler can catch the idea and compile it well.
static const char nibble2hexChar[] { "0123456789ABCDEF" };
// 17B in total, because I'm lazy to init it per char
void byteToHex(std::ostream & out, const uint8_t value) {
out << nibble2hexChar[value>>4] << nibble2hexChar[value&0xF];
}
// this one is actually written more toward short+simple source, than performance
void dwordToHex(std::ostream & out, uint32_t value) {
int i = 8;
while (i--) {
out << nibble2hexChar[value>>28];
value <<= 4;
}
}
EDIT: For C code you have just to switch from std::ostream to some other output means, unfortunately your question lacks any details, what you are actually trying to achieve and why you don't use the built-in printf family of C functions.
For example C like this can write to some char* output buffer, converting arbitrary amount of bytes:
/**
* Writes hexadecimally formatted "n" bytes array "values" into "outputBuffer".
* Make sure there's enough space in output buffer allocated, and add zero
* terminator yourself, if you plan to use it as C-string.
*
* #Returns: pointer after the last character written.
*/
char* dataToHex(char* outputBuffer, const size_t n, const unsigned char* values) {
for (size_t i = 0; i < n; ++i) {
*outputBuffer++ = nibble2hexChar[values[i]>>4];
*outputBuffer++ = nibble2hexChar[values[i]&0xF];
}
return outputBuffer;
}
And finally, I did help once somebody on code review, as he had performance bottleneck exactly with hexadecimal formatting, but I did there the code variant conversion, without LUT, also the whole process and other answer + performance measuring may be instructional for you, as you may see that the fastest solution doesn't just blindly convert result, but actually mix up with the main operation, to achieve better performance overall. So that's why I'm wonder what you are trying to solve, as the whole problem may often allow for more optimal solution, if you just ask about conversion, printf("%x",..) is safe bet.
Here is that another approach for "to hex" conversion:
fast C++ XOR Function

Decimal -> Hex
Just iterate throught string and every character convert to int, then you can do
printf("%02x", c);
or use sprintf for saving to another variable
Hex -> Decimal
Code
printf("%c",16 * hexToInt('F') + hexToInt('0'));
int hexToInt(char c)
{
if(c >= 'a' && c <= 'z')
c = c - ('a' - 'A');
int sum;
sum = c / 16 - 3;
sum *= 10;
sum += c % 16;
return (sum > 9) ? sum - 1 : sum;
}

The articles below compare different methods of converting digits to string, hex numbers are not covered but it seems not a big problem to switch from dec to hex
Integers
Fixed and floating point
#EDIT
Thank you for pointing that the answer above is not relevant.
Common way with no LUT is to split integer into nibbles and map them to ASCII
#include <stdio.h>
#include <stdint.h>
#include <string.h>
#define HI_NIBBLE(b) (((b) >> 4) & 0x0F)
#define LO_NIBBLE(b) ((b) & 0x0F)
void int64_to_char(char carr[], int64_t val){
memcpy(carr, &val, 8);
}
uint64_t inp = 0xF05C1E3A;
char tmp_st[8];
int main()
{
int64_to_char(tmp_st,inp);
printf("Sample: %x\n", inp);
printf("Result: 0x");
for (unsigned int k = 8; k; k--){
char tmp_ch = *(tmp_st+k-1);
char hi_nib = HI_NIBBLE(tmp_ch);
char lo_nib = LO_NIBBLE(tmp_ch);
if (hi_nib || lo_nib){
printf("%c%c",hi_nib+((hi_nib>9)?55:48),lo_nib+((lo_nib>9)?55:48));
}
}
printf("\n");
return 0;
}
Another way is to use Allison's Algorithm. I am total noob in ASM, so I post the code in the form I googled it.
Variant 1:
ADD AL,90h
DAA
ADC AL,40h
DAA
Variant 2:
CMP AL, 0Ah
SBB AL, 69h
DAS

How to multiply 2 uint8 modulo a big number without using integer type in C language [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
If A and B are of type uint8_t and I want the result C=AxB % N where N is 2^16, how do i do this if I can't use integers (so I can't declare N as an integer, only uint8_t) in C language?
N.B: A, B and C are stored in uint8 arrays, so they are "expressed" as uint8 but their values can be bigger.

In general there is no easy way to do this.
Firstly you need to implement the multiply with carry between A and B for each uint8_t block. See the answer here.
Division with 2^16 really mean "disregard" the last 16 bits, "don't use" the last two uint8_t (as you use the array of int.). As you have the modulus operator, this means just the opposite, so you only need to get the last two uint8_ts.
Take the lowest two uint8 of A (say a0 and a1) and B (say b0 and b1):
split each uint8 in high and low part
a0h = a0 >> 4; ## the same as a0h = a0/16;
a0l = a0 % 16; ## the same as a0l = a0 & 0x0f;
a1h = a1 >> 4;
a1l = a1 % 16;
b0h = b0 >> 4;
b0l = b0 % 16;
b1h = b1 >> 4;
b1l = b1 % 16;
Multiply the lower parts first (x is a buffer var)
x = a0l * b0l;
The first part of the result is the last four bits of x, let's call it s0l
s0l = x % 16;
The top for bits of x are carry.
c = x>>4;
multiply the higher parts of first uint8 and add carry.
x = (a0h * b0h) + c;
The first part of the result is the last four bits of x, let's call it s0h. And we need to get carry again.
s0h = x % 16;
c = x>>4;
We can now combine the s0:
s0 = (s0h << 4) + s0l;
Do exactly the same for the s1 (but don't forget to add the carry!):
x = (a1l * b1l) + c;
s1l = x % 16;
c = x>>4;
x = (a1h * b1h) + c;
s1h = x % 16;
c = x>>4;
s1 = (s1h << 4) + s1l;
Your result at this point is c, s1 and s0 (you need carry for next multiplications eg. s2, s3, s4,). As your formula says %(2^16) you already have your result - s1 and s2. If you have to divide with something else, you should do something similar to the code above, but for division. In this case be careful to catch the dividing with zero, it will give you NAN or something!
You can put A, B, C and S in array and loop it through the indexes to make code cleaner.

Here's my effort. I took the liberty of using larger integers and pointers for looping through the arrays. The numbers are represented by arrays of uint8_t in big-endian order. All the intermediate results are kept in uint8_t variables. The code could be made more efficient if intermediate results could be stored in wider integer variables!
#include <stddef.h>
#include <stdint.h>
#include <stdio.h>
static void add_c(uint8_t *r, size_t len_r, uint8_t x)
{
uint8_t o;
while (len_r--) {
o = r[len_r];
r[len_r] += x;
if (o <= r[len_r])
break;
x = 1;
}
}
void multiply(uint8_t *res, size_t len_res,
const uint8_t *a, size_t len_a, const uint8_t *b, size_t len_b)
{
size_t ia, ib, ir;
for (ir = 0; ir < len_res; ir++)
res[ir] = 0;
for (ia = 0; ia < len_a && ia < len_res; ia++) {
uint8_t ah, al, t;
t = a[len_a - ia - 1];
ah = t >> 4;
al = t & 0xf;
for (ib = 0; ib < len_b && ia + ib < len_res; ib++) {
uint8_t bh, bl, x, o, c0, c1;
t = b[len_b - ib - 1];
bh = t >> 4;
bl = t & 0xf;
c0 = al * bl;
c1 = ah * bh;
o = c0;
t = al * bh;
x = (t & 0xf) << 4;
c0 += x;
x = (t >> 4);
c1 += x;
if (o > c0)
c1++;
o = c0;
t = ah * bl;
x = (t & 0xf) << 4;
c0 += x;
x = (t >> 4);
c1 += x;
if (o > c0)
c1++;
add_c(res, len_res - ia - ib, c0);
add_c(res, len_res - ia - ib - 1, c1);
}
}
}
int main(void)
{
uint8_t a[2] = { 0xee, 0xdd };
uint8_t b[2] = { 0xcc, 0xbb };
uint8_t r[4];
multiply(r, sizeof(r), a, sizeof(a), b, sizeof(b));
printf("0x%02X%02X * 0x%02X%02X = 0x%02X%02X%02X%02X\n",
a[0], a[1], b[0], b[1], r[0], r[1], r[2], r[3]);
return 0;
}
Output:
0xEEDD * 0xCCBB = 0xBF06976F

Byte swapping in bit wise operations

I have this function called byte swap I am supposed to implement. The idea is that the function takes 3 integers (int x, int y, int z) and the function will swap the y and z bytes of the int x. The restrictions are pretty much limited to bit wise operations (no loops, and no if statements or logical operators such as ==).
I don't believe that I presented this problem adequately so Im going to re attempt
I now understand that
byte 1 is referring to bits 0-7
byte 2 is referring to bits 8-15
byte 3 16-23
byte 4 24-31
My function is supposed to take 3 integer inputs, x, y and z. The y byte and z byte on the x then would have to get switched
int byteSwap(int x, int y, int z)
ex of the working function
byteSwap(0x12345678, 1, 3) = 0x56341278
byteSwap(0xDEADBEEF, 0, 2) = 0xDEEFBEAD
My original code had some huge errors in it, namely the fact that I was considering a byte to be 2 bits instead of 8. The main problem that I'm struggling with is that I do not know how to access the bits inside of the given byte. For example, when I'm given byte 4 and 5, how do I access their respected bits? As far as I can tell I can't find a mathematical relationship between the given byte, and its starting bit. I'm assuming I have to shift and then mask, and save those to variables.Though I cannot even get that far.

Extract the ith byte by using ((1ll << ((i + 1) * 8)) - 1) >> (i * 8). Swap using the XOR operator, and put the swapped bytes in their places.
int x, y, z;
y = 1, z = 3;
x = 0x12345678;
int a, b; /* bytes to swap */
a = (x & ((1ll << ((y + 1) * 8)) - 1)) >> (y * 8);
b = (x & ((1ll << ((z + 1) * 8)) - 1)) >> (z * 8);
/* swap */
a = a ^ b;
b = a ^ b;
a = a ^ b;
/* put zeros in bytes to swap */
x = x & (~((0xff << (y * 8))));
x = x & (~((0xff << (z * 8))));
/* put new bytes in place */
x = x | (a << (y * 8));
x = x | (b << (z * 8));

When you say the 'the y and z bytes of x' this implies x is an array of bytes, not an integer. If so:
x[z] ^= x[y];
x[y] ^= x[z];
x[z] ^= x[y];
will do the trick, by swapping x[y] and x[z]
After your edit, it appears you want to swap individual bytes of a 32 bit integer:
On a little-endian machine:
int
swapbytes (int x, int y, int z)
{
char *b = (char *)&x;
b[z] ^= b[y];
b[y] ^= b[z];
b[z] ^= b[y];
return x;
}
On a big-endian machine:
int
swapbytes (int x, int y, int z)
{
char *b = (char *)&x;
b[3-z] ^= b[3-y];
b[3-y] ^= b[3-z];
b[3-z] ^= b[3-y];
return x;
}
With a strict interpretation of the rules, you don't even need the xor trick:
int
swapbytes (int x, int y, int z)
{
char *b = (char *)&x;
char tmp = b[z];
b[z] = b[y];
b[y] = tmp;
return x;
}
On a big-endian machine:
int
swapbytes (int x, int y, int z)
{
char *b = (char *)&x;
char tmp = b[3-z];
b[3-z] = b[3-y];
b[3-y] = tmp;
return x;
}
If you want to do it using bit shifts (note <<3 multiplies by 8):
int
swapbytes (unsigned int x, int y, int z)
{
unsigned int masky = 0xff << (y<<3);
unsigned int maskz = 0xff << (z<<3);
unsigned int origy = (x & masky) >> (y<<3);
unsigned int origz = (x & maskz) >> (z<<3);
return (x & ~masky & ~maskz) | (origz << (y<<3)) | (origy << (z<<3));
}

Floating point emulation or Fixed Point for numbers in a given range

I have a co-processor which does not have floating point support. I tried to use 32 bit fix point, but it is unable to work on very small numbers. My numbers range from 1 to 1e-18. One way is to use floating point emulation, but it is too slow. Can we make it faster in this case where we know the numbers won't be greater than 1 and smaller than 1e-18. Or is there a way to make fix point work on very small numbers.

It is not possible for a 32-bit fixed-point encoding to represent numbers from 10–18 to 1. This is immediately obvious from the fact that the span from 10-18 is a ratio of 1018, but the non-zero encodings of a 32-bit integer span a ratio of less than 232, which is much less than 1018. Therefore, no choice of scale for the fixed-point encoding will provide the desired span.
So a 32-bit fixed-point encoding will not work, and you must use some other technique.
In some applications, it may be suitable to use multiple fixed-point encodings. That is, various input values would be encoded with a fixed-point encoding but each with a scale suitable to it, and intermediate values and the outputs would also have customized scales. Obviously, this is possible only if suitable scales can be determined at design time. Otherwise, you should abandon 32-bit fixed-point encodings and consider alternatives.

Will simplified 24-bit floating point be fast enough and accurate enough?:
#include <stdio.h>
#include <limits.h>
#if UINT_MAX >= 0xFFFFFFFF
typedef unsigned myfloat;
#else
typedef unsigned long myfloat;
#endif
#define MF_EXP_BIAS 0x80
myfloat mfadd(myfloat a, myfloat b)
{
unsigned ea = a >> 16, eb = b >> 16;
if (ea > eb)
{
a &= 0xFFFF;
b = (b & 0xFFFF) >> (ea - eb);
if ((a += b) > 0xFFFF)
a >>= 1, ++ea;
return a | ((myfloat)ea << 16);
}
else if (eb > ea)
{
b &= 0xFFFF;
a = (a & 0xFFFF) >> (eb - ea);
if ((b += a) > 0xFFFF)
b >>= 1, ++eb;
return b | ((myfloat)eb << 16);
}
else
{
return (((a & 0xFFFF) + (b & 0xFFFF)) >> 1) | ((myfloat)++ea << 16);
}
}
myfloat mfmul(myfloat a, myfloat b)
{
unsigned ea = a >> 16, eb = b >> 16, e = ea + eb - MF_EXP_BIAS;
myfloat p = ((a & 0xFFFF) * (b & 0xFFFF)) >> 16;
return p | ((myfloat)e << 16);
}
myfloat double2mf(double x)
{
myfloat f;
unsigned e = MF_EXP_BIAS + 16;
if (x <= 0)
return 0;
while (x < 0x8000)
x *= 2, --e;
while (x >= 0x10000)
x /= 2, ++e;
f = x;
return f | ((myfloat)e << 16);
}
double mf2double(myfloat f)
{
double x;
unsigned e = (f >> 16) - 16;
if ((f & 0xFFFF) == 0)
return 0;
x = f & 0xFFFF;
while (e > MF_EXP_BIAS)
x *= 2, --e;
while (e < MF_EXP_BIAS)
x /= 2, ++e;
return x;
}
int main(void)
{
double testConvData[] = { 1e-18, .25, 0.3333333, .5, 1, 2, 3.141593, 1e18 };
unsigned i;
for (i = 0; i < sizeof(testConvData) / sizeof(testConvData[0]); i++)
printf("%e -> 0x%06lX -> %e\n",
testConvData[i],
(unsigned long)double2mf(testConvData[i]),
mf2double(double2mf(testConvData[i])));
printf("300 * 5 = %e\n", mf2double(mfmul(double2mf(300),double2mf(5))));
printf("500 + 3 = %e\n", mf2double(mfadd(double2mf(500),double2mf(3))));
printf("1e18 * 1e-18 = %e\n", mf2double(mfmul(double2mf(1e18),double2mf(1e-18))));
printf("1e-18 + 2e-18 = %e\n", mf2double(mfadd(double2mf(1e-18),double2mf(2e-18))));
printf("1e-16 + 1e-18 = %e\n", mf2double(mfadd(double2mf(1e-16),double2mf(1e-18))));
return 0;
}
Output (ideone):
1.000000e-18 -> 0x459392 -> 9.999753e-19
2.500000e-01 -> 0x7F8000 -> 2.500000e-01
3.333333e-01 -> 0x7FAAAA -> 3.333282e-01
5.000000e-01 -> 0x808000 -> 5.000000e-01
1.000000e+00 -> 0x818000 -> 1.000000e+00
2.000000e+00 -> 0x828000 -> 2.000000e+00
3.141593e+00 -> 0x82C90F -> 3.141541e+00
1.000000e+18 -> 0xBCDE0B -> 9.999926e+17
300 * 5 = 1.500000e+03
500 + 3 = 5.030000e+02
1e18 * 1e-18 = 9.999390e-01
1e-18 + 2e-18 = 2.999926e-18
1e-16 + 1e-18 = 1.009985e-16
Subtraction is left as an exercise. Ditto for better conversion routines.

Use 64 bit fixed point and be done with it.
Compared with 32 bit fixed point it will be four times slower for multiplication, but it will still be far more efficient than float emulation.

In embedded systems I'd suggest using 16+32, 16+16, 8+16 or 8+24 bit redundant floating point representation, where each number is simply M * 2^exp.
In this case you can choose to represent zero with both M=0 and exp=0; There are 16-32 representations for each power of 2 -- and that mainly makes comparison a bit harder than typically. Also one can postpone normalization e.g. after subtraction.