Efficiently implementing arrays of ternary data types in C - c

I need to implement "big" arrays (~1800 elements) of a ternary datatype as runtime-efficiently as possible in C for cryptographic research. I thought of the following:
Using an array of any-sized integers, using 2 Bits to represent one element each
So I'd have
typedef uint32_t block;
const int blocksize = sizeof(block)<<3;
block dataArray[3]; // 3*32 bit => 48 Elements
uint8_t getElementAt(block *data, int position)
{
position = position * 2;
return (data[position/blocksize] >> (position % blocksize)) & 3;
}
returning me 0..2 which i can map to my three values.
Using an array of uint8_t addressing the elementy directly.
uint8_t data[48];
Sure, that needs at least four times more RAM but addressing and setting might be more efficient - is it?
Are there any other good possibilities I'm missing or special caveats in any of the two solutions?

The answer depends on how big the arrays will get, and how you want to optimize. I sketch some scenarios:
Runtime, small arrays.
Simply use unsigned long arr[N]. Reading only on machine word boundaries is the fastest, but uses a lot of memory. When the memory usage gets too big you actually do not want to do this, because cache performance outweighs the aligned reads.
Runtime, big arrays.
Use unsigned char arr[N]. This will give you fast reads/writes at a decent speed.
Good memory usage, mediocre speed.
Use unsigned long arr[N] and store each trit in two bits, unpacking using shifts and masks.
Better memory usage, slow.
Use unsigned long arr[N], and store floor(CHAR_BIT * sizeof(long) * log(2) / log(3)) numbers, by storing digits in base-3. You can pack 20 trits in 32 bits using this method.
Best memory usage, horrendous.
Store all the numbers as digits in one base-3 number, using a bignum implementation.

Related

C: Is using char faster than using int?

Since char is only 1 byte long, is it better to to use char while dealing with 8-bit unsigned int?
Example:
I was trying to create a struct for storing rgb values of a color.
struct color
{
unsigned int r: 8;
unsigned int g: 8;
unsigned int b: 8;
};
Now since it is int, it allocates a memory of 4 bytes in my case. But if I replace them with unsigned char, they will be taking 3 bytes of memory as intended (in my platform).
No. Maybe a tiny bit.
First, this is a very platform dependent question.
However the <stdint.h> header was introduced to help with this.
Some hardware platforms are optimised for a particular size of operand and have an overhead to using smaller (in bit-size) operands even though logically the smaller operand requires less computation.
You should find that uint_fast8_t is the fastest unsigned integer type with at least 8 bits (#include <stdint.h> to use it).
That may be the same as unsigned char or unsigned int depending on whether your question is 'yes' or 'no' respectively(*).
So the idea would be that if you're speed focused you'd use uint_fast8_t and the compiler will pick the fastest type fitting your purpose.
There are a couple of downsides to this scheme.
One is that if you create very vast quantities of data performance can be impaired (and limits reached) because you're using an 'oversized' type for the purpose.
On a platform where a 4-byte int is faster than a 1-byte char you're using about 4 times as much memory as you need.
If your platform is small or your scale large that can be a significant overhead.
Also you need to be careful that if the underlying type isn't the minimum size you expect then some calculations may be confounded.
Arithmetic 'clocks' neatly for unsigned operands but obviously at different sizes if uint_fast8_t isn't in fact 8-bits.
It's platform dependent what the following returns:
#include <stdint.h>
//...
int foo() {
uint_fast8_t x=255;
++x;
if(x==0){
return 1;
}
return 0;
}
The overhead of dealing with potentially outsized types can claw back your gains.
I tend to agree with Knuth that "premature optimisation is the root of all evil" and would say you should only get into this sort of cleverness if you need it.
Do a typedef for typedef uint8_t color_comp; for now and get the application working before trying to shave off fractions of a second performance later!
I don't know what your application does but it may be that it's not compute intensive in RGB channels and the bottleneck (if any) is elsewhere. Maybe you find some high load calculation where it's worth dealing with uint_fast8_t conversions and issues.
The wisdom of Knuth is that you probably don't know where that is until later anyway.
(*) It could be unsigned long or indeed some other type. The only constraint is that it is an unsigned type and at least 8 bits.

Converting from AoS to SoA - Write performance and Optimal packing

I am currently thinking about converting the data storage used in my CUDA kernels from Array of Structs (AoS) to Struct of Arrays (SoA).
I have a struct Element:
struct Element
{
float3 origin;
float3 direction;
uint8_t count1;
uint8_t count2;
unsigned int index;
float distance;
uint16_t instanceId;
uint64_t hash;
};
These structs are written in kernel1 each as a whole into an array residing in global memory and then subsets of the entries are used in multiple subsequent kernels.
I could now convert this to the following structure:
struct ElementSoA
{
float3 origin[N];
float3 direction[N];
uint8_t count1[N];
uint8_t count2[N];
unsigned int index[N];
float distance[N];
uint16_t instanceId[N];
uint64_t hash[N];
};
Questions:
1) Is the write performance affected if I have 8 separate, "small" writes in kernel1 instead of 1 "big" write?
2) Would it make sense to "pack" parts of the entries within ElementSoA, e.g. combining count1 and count2 into a
struct uint8_2
{
uint8_t count1;
uint8_t count2;
};
3) If "packing" is useful, is there a way to calculate the optimal structure of ElementSoA?
Suppose I have a list of the per-kernel read accesses like this:
kernel2: origin, direction, hash
kernel3: count1, count2, distance, hash
...
The reason I ask for a calculation of the optimal solution is that I have multiple structs and they contain even more entries than Element, so there is a huge number of combinations which I need to implement and test.
Assuming that thread i accesses element i and only element i. From the context this seems to be the case.
Is the write performance affected if I have 8 separate, "small" writes
in kernel1 instead of 1 "big" write?
Yes. It should be faster. The "big" write will be split into multiple small writes by the compiler, each of which will be strided. The memory subsystem works much better when access pattern is unstrided.
It's worth pointing out here that the float3 types you're using will also work like this, and will be split into three 32-bit transactions with a stride. There's no reason you can't convert these from AoS to SoA as well though.
Would it make sense to "pack" parts of the entries within ElementSoA?
Yes. Larger aligned power-of-two types (on current hardware, up to 128 bits) allow the hardware to load and store more efficiently. The difference isn't huge, but if it's easy to do, it's often worth it.
If "packing" is useful, is there a way to calculate the optimal
structure of ElementSoA?
There is no way to calculate this. One problem is that kernels have different characteristics. It maybe be that one is strongly bandwidth bound, so using efficient loads will help. Another may be compute bound, so more efficient loads won't help much.

When to use bit-fields in C

On the question 'why do we need to use bit-fields?', searching on Google I found that bit fields are used for flags.
Now I am curious,
Is it the only way bit-fields are used practically?
Do we need to use bit fields to save space?
A way of defining bit field from the book:
struct {
unsigned int is_keyword : 1;
unsigned int is_extern : 1;
unsigned int is_static : 1;
} flags;
Why do we use int?
How much space is occupied?
I am confused why we are using int, but not short or something smaller than an int.
As I understand only 1 bit is occupied in memory, but not the whole unsigned int value. Is it correct?
A quite good resource is Bit Fields in C.
The basic reason is to reduce the used size. For example, if you write:
struct {
unsigned int is_keyword;
unsigned int is_extern;
unsigned int is_static;
} flags;
You will use at least 3 * sizeof(unsigned int) or 12 bytes to represent three small flags, that should only need three bits.
So if you write:
struct {
unsigned int is_keyword : 1;
unsigned int is_extern : 1;
unsigned int is_static : 1;
} flags;
This uses up the same space as one unsigned int, so 4 bytes. You can throw 32 one-bit fields into the struct before it needs more space.
This is sort of equivalent to the classical home brew bit field:
#define IS_KEYWORD 0x01
#define IS_EXTERN 0x02
#define IS_STATIC 0x04
unsigned int flags;
But the bit field syntax is cleaner. Compare:
if (flags.is_keyword)
against:
if (flags & IS_KEYWORD)
And it is obviously less error-prone.
Now I am curious, [are flags] the only way bitfields are used practically?
No, flags are not the only way bitfields are used. They can also be used to store values larger than one bit, although flags are more common. For instance:
typedef enum {
NORTH = 0,
EAST = 1,
SOUTH = 2,
WEST = 3
} directionValues;
struct {
unsigned int alice_dir : 2;
unsigned int bob_dir : 2;
} directions;
Do we need to use bitfields to save space?
Bitfields do save space. They also allow an easier way to set values that aren't byte-aligned. Rather than bit-shifting and using bitwise operations, we can use the same syntax as setting fields in a struct. This improves readability. With a bitfield, you could write
directions.alice_dir = WEST;
directions.bob_dir = SOUTH;
However, to store multiple independent values in the space of one int (or other type) without bitfields, you would need to write something like:
#define ALICE_OFFSET 0
#define BOB_OFFSET 2
directions &= ~(3<<ALICE_OFFSET); // clear Alice's bits
directions |= WEST<<ALICE_OFFSET; // set Alice's bits to WEST
directions &= ~(3<<BOB_OFFSET); // clear Bob's bits
directions |= SOUTH<<BOB_OFFSET; // set Bob's bits to SOUTH
The improved readability of bitfields is arguably more important than saving a few bytes here and there.
Why do we use int? How much space is occupied?
The space of an entire int is occupied. We use int because in many cases, it doesn't really matter. If, for a single value, you use 4 bytes instead of 1 or 2, your user probably won't notice. For some platforms, size does matter more, and you can use other data types which take up less space (char, short, uint8_t, etc.).
As I understand only 1 bit is occupied in memory, but not the whole unsigned int value. Is it correct?
No, that is not correct. The entire unsigned int will exist, even if you're only using 8 of its bits.
Another place where bitfields are common are hardware registers. If you have a 32 bit register where each bit has a certain meaning, you can elegantly describe it with a bitfield.
Such a bitfield is inherently platform-specific. Portability does not matter in this case.
We use bit fields mostly (though not exclusively) for flag structures - bytes or words (or possibly larger things) in which we try to pack tiny (often 2-state) pieces of (often related) information.
In these scenarios, bit fields are used because they correctly model the problem we're solving: what we're dealing with is not really an 8-bit (or 16-bit or 24-bit or 32-bit) number, but rather a collection of 8 (or 16 or 24 or 32) related, but distinct pieces of information.
The problems we solve using bit fields are problems where "packing" the information tightly has measurable benefits and/or "unpacking" the information doesn't have a penalty. For example, if you're exposing 1 byte through 8 pins and the bits from each pin go through their own bus that's already printed on the board so that it leads exactly where it's supposed to, then a bit field is ideal. The benefit in "packing" the data is that it can be sent in one go (which is useful if the frequency of the bus is limited and our operation relies on frequency of its execution), and the penalty of "unpacking" the data is non-existent (or existent but worth it).
On the other hand, we don't use bit fields for booleans in other cases like normal program flow control, because of the way computer architectures usually work. Most common CPUs don't like fetching one bit from memory - they like to fetch bytes or integers. They also don't like to process bits - their instructions often operate on larger things like integers, words, memory addresses, etc.
So, when you try to operate on bits, it's up to you or the compiler (depending on what language you're writing in) to write out additional operations that perform bit masking and strip the structure of everything but the information you actually want to operate on. If there are no benefits in "packing" the information (and in most cases, there aren't), then using bit fields for booleans would only introduce overhead and noise in your code.
To answer the original question »When to use bit-fields in C?« … according to the book "Write Portable Code" by Brian Hook (ISBN 1-59327-056-9, I read the German edition ISBN 3-937514-19-8) and to personal experience:
Never use the bitfield idiom of the C language, but do it by yourself.
A lot of implementation details are compiler-specific, especially in combination with unions and things are not guaranteed over different compilers and different endianness. If there's only a tiny chance your code has to be portable and will be compiled for different architectures and/or with different compilers, don't use it.
We had this case when porting code from a little-endian microcontroller with some proprietary compiler to another big-endian microcontroller with GCC, and it was not fun. :-/
This is how I have used flags (host byte order ;-) ) since then:
# define SOME_FLAG (1 << 0)
# define SOME_OTHER_FLAG (1 << 1)
# define AND_ANOTHER_FLAG (1 << 2)
/* test flag */
if ( someint & SOME_FLAG ) {
/* do this */
}
/* set flag */
someint |= SOME_FLAG;
/* clear flag */
someint &= ~SOME_FLAG;
No need for a union with the int type and some bitfield struct then. If you read lots of embedded code those test, set, and clear patterns will become common, and you spot them easily in your code.
Why do we need to use bit-fields?
When you want to store some data which can be stored in less than one byte, those kind of data can be coupled in a structure using bit fields.
In the embedded word, when one 32 bit world of any register has different meaning for different word then you can also use bit fields to make them more readable.
I found that bit fields are used for flags. Now I am curious, is it the only way bit-fields are used practically?
No, this not the only way. You can use it in other ways too.
Do we need to use bit fields to save space?
Yes.
As I understand only 1 bit is occupied in memory, but not the whole unsigned int value. Is it correct?
No. Memory only can be occupied in multiple of bytes.
Bit fields can be used for saving memory space (but using bit fields for this purpose is rare). It is used where there is a memory constraint, e.g., while programming in embedded systems.
But this should be used only if extremely required because we cannot have the address of a bit field, so address operator & cannot be used with them.
A good usage would be to implement a chunk to translate to—and from—Base64 or any unaligned data structure.
struct {
unsigned int e1:6;
unsigned int e2:6;
unsigned int e3:6;
unsigned int e4:6;
} base64enc; // I don't know if declaring a 4-byte array will have the same effect.
struct {
unsigned char d1;
unsigned char d2;
unsigned char d3;
} base64dec;
union base64chunk {
struct base64enc enc;
struct base64dec dec;
};
base64chunk b64c;
// You can assign three characters to b64c.enc, and get four 0-63 codes from b64dec instantly.
This example is a bit naive, since Base64 must also consider null-termination (i.e. a string which has not a length l so that l % 3 is 0). But works as a sample of accessing unaligned data structures.
Another example: Using this feature to break a TCP packet header into its components (or other network protocol packet header you want to discuss), although it is a more advanced and less end-user example. In general: this is useful regarding PC internals, SO, drivers, an encoding systems.
Another example: analyzing a float number.
struct _FP32 {
unsigned int sign:1;
unsigned int exponent:8;
unsigned int mantissa:23;
}
union FP32_t {
_FP32 parts;
float number;
}
(Disclaimer: Don't know the file name / type name where this is applied, but in C this is declared in a header; Don't know how can this be done for 64-bit floating-point numbers since the mantissa must have 52 bits and—in a 32 bit target—ints have 32 bits).
Conclusion: As the concept and these examples show, this is a rarely used feature because it's mostly for internal purposes, and not for day-by-day software.
To answer the parts of the question no one else answered:
Ints, not Shorts
The reason to use ints rather than shorts, etc. is that in most cases no space will be saved by doing so.
Modern computers have a 32 or 64 bit architecture and that 32 or 64 bits will be needed even if you use a smaller storage type such as a short.
The smaller types are only useful for saving memory if you can pack them together (for example a short array may use less memory than an int array as the shorts can be packed together tighter in the array). For most cases, when using bitfields, this is not the case.
Other uses
Bitfields are most commonly used for flags, but there are other things they are used for. For example, one way to represent a chess board used in a lot of chess algorithms is to use a 64 bit integer to represent the board (8*8 pixels) and set flags in that integer to give the position of all the white pawns. Another integer shows all the black pawns, etc.
You can use them to expand the number of unsigned types that wrap. Ordinary you would have only powers of 8,16,32,64... , but you can have every power with bit-fields.
struct a
{
unsigned int b : 3 ;
} ;
struct a w = { 0 } ;
while( 1 )
{
printf("%u\n" , w.b++ ) ;
getchar() ;
}
To utilize the memory space, we can use bit fields.
As far as I know, in real-world programming, if we require, we can use Booleans instead of declaring it as integers and then making bit field.
If they are also values we use often, not only do we save space, we can also gain performance since we do not need to pollute the caches.
However, caching is also the danger in using bit fields since concurrent reads and writes to different bits will cause a data race and updates to completely separate bits might overwrite new values with old values...
Bitfields are much more compact and that is an advantage.
But don't forget packed structures are slower than normal structures. They are also more difficult to construct since the programmer must define the number of bits to use for each field. This is a disadvantage.
Why do we use int? How much space is occupied?
One answer to this question that I haven't seen mentioned in any of the other answers, is that the C standard guarantees support for int. Specifically:
A bit-field shall have a type that is a qualified or unqualified version of _Bool, signed int, unsigned int, or some other implementation defined type.
It is common for compilers to allow additional bit-field types, but not required. If you're really concerned about portability, int is the best choice.
Nowadays, microcontrollers (MCUs) have peripherals, such as I/O ports, ADCs, DACs, onboard the chip along with the processor.
Before MCUs became available with the needed peripherals, we would access some of our hardware by connecting to the buffered address and data buses of the microprocessor. A pointer would be set to the memory address of the device and if the device saw its address along with the R/W signal and maybe a chip select, it would be accessed.
Oftentimes we would want to access individual or small groups of bits on the device.
In our project, we used this to extract a page table entry and page directory entry from a given memory address:
union VADDRESS {
struct {
ULONG64 BlockOffset : 16;
ULONG64 PteIndex : 14;
ULONG64 PdeIndex : 14;
ULONG64 ReservedMBZ : (64 - (16 + 14 + 14));
};
ULONG64 AsULONG64;
};
Now suppose, we have an address:
union VADDRESS tempAddress;
tempAddress.AsULONG64 = 0x1234567887654321;
Now we can access PTE and PDE from this address:
cout << tempAddress.PteIndex;

Different ways to store information efficiently when working with limited space

I am and have been working on software for the Pebble. It is the first time I have worked with C, and I am struggling to get my head around how to manage information/data within the program.
I am used to being able to have multi-dimensional arrays with thousands of entries. With the Pebble we are very limited.
I can talk to the requirements for my program, but happy to see any sort of discussion on the topic.
The application I am building needs to store a running feed of data with every button press. Ideally I would like to store one binary value and two small integer values with each press. I would like to take advantage of the local storage on the Pebble which is limited to 256 bytes per array which presents a challenge.
I have thought about using a custom struct - and having multiple arrays of those, making sure to check that each array doesn't exceed the 256 byte mark. It just seems really messy and complicated to manage... am I missing something fundamentally simple, or does it need to be this complicated?
At the moment my program only stores the binary value and I haven't bothered with the small integer values at all.
Perhaps you could define structures as follows:
#pragma pack(1)
typedef struct STREAM_RECORD_S
{
unsigned short uint16; // The uint16 field will store a number from 0-65535
unsigned short uint15 : 15; // The uint15 field will store a number from 0-32767
unsigned short binary : 1; // The binary field will store a number from 0-1
} STREAM_RECORD_T;
typedef struct STREAM_BLOCK_S
{
struct STREAM_BLOCK_S *nextBlock; // Store a pointer to the next block.
STREAM_RECORD_T records[1]; // Array of records for this block.
} STREAM_BLOCK_T;
#pragma pack(0);
The actual number of records in the array would depend on the size of the nextBlock pointer. For example, if you are running with 32-bit addressing, the nextBlock size would likely be 4 bytes; and it would be 2 bytes with 16-bit addressing, or 8 bytes with 64-bit addressing. (I do not know the pointer size on an ARM Cortex-M3 processor).
So, recordsPerArray = (256 - sizeof(nextBlock)) / sizeof(STREAM_RECORD_T);

Precisly convert float 32 to unsigned short or unsigned char

First of all sorry if this is a duplicate, I couldn't find any subject answering my question.
I'm coding a little program that will be used to convert 32-bit floating point values to short int (16 bits) and unsigned char (8 bits) values. This is for HDR images purpose.
From here I could get the following function (without clamping):
static inline uint8_t u8fromfloat(float x)
{
return (int)(x * 255.0f);
}
I suppose that in the same way we could get short int by multiplying by (pow( 2,16 ) -1)
But then I ended up thinking about ordered dithering and especially to Bayer dithering. To convert to uint8_t I suppose I could use a 4x4 matrix and a 8x8 matrix for unsigned short.
I also thought of a Look-up table to speed-up the process, this way:
uint16_t LUT[0x10000] // 2¹⁶ values contained
and store 2^16 unsigned short values corresponding to a float.
This same table could be then used for uint8_t as well because of the implicit cast between unsigned short ↔ unsigned int
But wouldn't a look-up table like this be huge in memory? Also how would one fill a table like this?!
Now I'm confused, what would be best according to you?
EDIT after uwind answer: Let's say now that I also want to do basic color space conversion at the same time, that is before converting to U8/U16 , do a color space conversion (in float), and then shrink it to U8/U16. Wouldn't in that case use a LUT be more efficient? And yeah I would still have the problem to index the LUT.
The way I see it, the look-up table won't help since in order to index into it, you need to convert the float into some integer type. Catch 22.
The table would require 0x10000 * sizeof (uint16_t) bytes, which is 128 KB. Not a lot by modern standards, but on the other hand cache is precious. But, as I said, the table doesn't add much to the solution since you need to convert float to integer in order to index.
You could do a table indexed by the raw bits of the float re-interpreted as integer, but that would have to be 32 bits which becomes very large (8 GB or so).
Go for the straight-forward runtime conversion you outlined.
Just stay with the multiplication - it'll work fine.
Practically all modern CPU have vector instructions (SSE, AVX, ...) adapted to this stuff, so you might look at programming for that. Or use a compiler that automatically vectorizes your code, if possible (Intel C, also GCC). Even in cases where table-lookup is a possible solution, this can often be faster because you don't suffer from memory latency.
First, it should be noted that float has 24 bits of precision, which can no way fit into a 16-bit int or even 8 bits. Second, float have much larger range, which can't be stored in any int or long long int
So your question title is actually incorrect, no way to precisely convert any float to short or char. You want to map a float value between 0 and 1 to an 8-bit or 16-bit int range.
For the code you use above, it'll work fine. However the value 255 is extremely unlikely to be returned because it needs exactly 1.0 as input, otherwise values such as 254.99999 will ended up being truncated as 254. You should round the value instead
return (int)(x * 255.0f + .5f);
or better, use the code provided in your link for more balanced distribution
static inline uint8_t u8fromfloat_trick(float x)
{
union { float f; uint32_t i; } u;
u.f = 32768.0f + x * (255.0f / 256.0f);
return (uint8_t)u.i;
}
Using LUT wouldn't be any faster because a table for 16-bit values is too large for fitting in cache, and in fact may reduce your performance greatly. The snippet above needs only 2 floating-point instructions, or only 1 with FMA. And SIMD will improve performance 4-32x (or more) further, so LUT method would be easily outperformed as it's much harder to parallelize table look ups

Resources