C: Most efficient way to store variables where every bit matters

C: Most efficient way to store variables where every bit matters - c

To start off: this might be a duplicate, but i can't seem to find a definitive answer on this question after having searched for it on google.
For a project i am designing a script that makes 2 ATMega328p chips communicate. At this moment i'm testing the best speed to do this, but my goal is to achieve really high baudrates. I have plenty of experience with making code efficient, but not with the memory management part. The problem:
I want to store a multiple of 8 bits (ex.: 48 bits). My first thought was to use an array of length 6 and type uint8_t, but I don't know how efficient arrays are compared to other types. Some people say pointers are more efficient and others say it doesn't matter, but I cant find a definitive answer on what the case is for really small amounts of memory. last quesion: I know the size of the sent bits will never be bigger than 64 bits, so would it matter if i just Always jused uint64_t?
Edit:
to clarify: My goal is to minimize the storage size, not the transmission size
Edit2:
What i meant by having a varying size: The size is determined on compile time, not while running the program.

The ATmega328p is a 8bit processor. All of its instructions are 8bit. Nothing will be faster than simply having an uint8_t array.
What you can do is, when you compile, look at your .lss file, it will show you the assmebly code, then you can look up the AVR instruction set and see the clock cycles each one will take. I think you will find using a uint64_t will just add unncessary overhead unless you are very careful with the way you are putting the bytes into it.

If the length of your packages might vary, the most efficient approach would be to compress the package before communication.
For example the first 3 bits of each package, could determine the size of that package.
The compressed packages are communicated faster, and use up less memory space.

Related

Can bit-level operations ever be "fast" in software?

Let me clarify the soft-sounding title straight away. This is actually something that has been nagging me for quite a while now, despite feeling like a pretty basic question.
Many languages give a faulty impression of efficiency by letting the developer play with bits, such as thebool.h C header which, as I understand it, is essentially just an int with a wrapper around it. Essentially, the byte seems to be the absolute lowest atomic unit of computation in C - bool x = 0 is not faster/more memory efficient than int x = 0.
What I'm wondering is then, what do we do when we want to implement an algorithm that is inherently tied to loading and manipulating single bits, such as decoding binary codes, unweighted graph connectivity problems and many others? In other words, is the atomicity of the byte an inherent property of modern CPUs or could we theoretically rival the efficiency of an ASIC just by using machine code?
EDIT: Pretty surprised by the downvotes, but I suppose people just didn't understand what I was asking. I think a really good, canonical example is traversing a binary tree (or any other sequential list of yes/no questions really). What I was wondering is if modern cpu architectures are fundamentally poorly equipped to do this (as compared to an ASIC/FPGA, that is), or if this is an artifact of some abstraction layer (language/kernel/etc). Mark's answer was good though (although I'd love a reference to the mentioned architecture extension)

No you can't rival the efficiency of an ASIC. An ASIC means you can replicate parallel bit streams as much as you have budget for on the chip. You just cut and paste your HDL until you fill your die space. A CPU only has a limited number of cores.
I'm guessing that you think that bit operations like z = (x|(1<<y)>>4 are slow and yes, all that bit shifting is extra overhead. But that is just accessing the bits. The bit operations (OR, AND, etc) are all as fast as you can get on modern CPU, i.e. 1 cycle throughput.
The 8051 architecture has a way of accessing individual bits directly, without using byte registers, but if you are worried about speed, you wouldn't consider a 8051.

By convention, a byte is the smallest addressable piece of memory in a computer. The number of bits that a byte has can differ from one system to another.
In the case of x86, there are instructions to move bytes from memory to a register and back, and instructions to manipulate values in registers. I can't speak to other architectures, but they most likely work in a similar way.
So anytime you need to manipulate some number of bits you need to do so a byte (or word, i.e. multiple bytes) at a time.

I also don't know why this question got so many downvotes, the question:
In other words, is the atomicity of the byte an inherent property of modern CPUs or could we theoretically rival the efficiency of an ASIC just by using machine code?
seems reasonable to me. It's certainly not a bad compared to many questions on stackoverflow.
The answer is: no CPUs can't match the efficiency of an ASIC.
However, the reason is not because CPUs are manipulating bytes instead of bits. Instead it's because most of the work that CPUs do to process an instruction is involved with loading it from memory, decoding it, tracking dependencies, etc., rather than performing the actual arithmetic operations on bits or bytes that the instruction directs the CPU to perform.
A good explanation of this is shown in the following presentation from the 2014 LLVM developers meeting. The presentation shows how OpenCL can be used to generate custom FPGA hardware. Slides 12 to 28 show a nice pictorial example of overhead associated with a CPU algorithm and how custom hardware can remove much of this overhead.

Fastest use of a dataset of just over 64 bytes?

Structure: I have 8 64-bit integers (512 bits = 64 bytes, the assumed cache line width) that I would like to compare to another, single 64-bit integer, in turn, without cache misses. The data set is, unfortunately, absolutely inflexible -- it's already as small as possible.
Access pattern: Each uint64_t is in fact an array of 4x4x4 bits, each bit representing the presence or absence of a voxel. This means sometimes I will be using half of one chunk and half of another, or even corners of 8 different 64-bit chunks.... I guess what this means is there is a high likelihood of a lack of alignment.
How can I do this as fast as possible i.e. without thrashing the cache?
P.S. The idea is that this code will ultimately run on a fairly wide range of architectures of at least a 64B cache line width, so I'd prefer this were absolutely as fast as possible. This also means I can't rely on MOVNTDQA, which anyway may incur a performance hit of it's own inspite of loading the 9th element directly to the CPU.
P.P.S. My knowledge of this area is fairly limited so please take it easy on me. But please spare me the premature optimisation comments; be sure that this is the 3% of this application that really counts.

I wouldn't worry about it. If your dataset is really only 9 integers, most of it will likely be stored in registers anyway. Also, there isn't really any way to optimize cache usage without specifying an architecture, since cache structure is architecture dependent. If you can list several target architectures you may be able to find some commonalities that you can optimize toward, but without knowing those architectures, I don't think there's much we can do for you.
Lastly, this seems like a good example of optimizing too early. I would suggest you take the following steps:
Decide what your maximum acceptable run time is
Finish your program in C
Compile for all of your target architectures
For those platforms that don't meet your speed spec, hand-optimize the intermediate assembly files and recompile until you meet your spec.

Are you sure you get cache-misses?
Even if the comparing value is not in an register, i think your first uint64 array should be on one cache stage (or what ever it is called) and your other data in another.
Your cache surely has some n-way associativity, that prevents your data row from being removed from the cache just by accessing your compare value.
Do not lose your time on Micro Optimizations. Improve your algorithms and data structures.

C fastest way to compare two bitmaps

There are two arrays of bitmaps in the form of char arrays with millions of records. What could be fastest way to compare them using C.
I can imagine to use bitwise operator xor 1 byte at a time in a for loop.
Important point about bitmaps:
1% to 10% of times algorithm is run, bitmaps can differ. Most of the time they will be same. When hey can differ, they can as much as 100%. There is high probability of change of bits in continuous streak.
Both bitmaps are of same length.
Aim:
Check do they differ and if yes then where.
Be correct every time (probability of detecting error if there is one should be 1).

This answer assumes you mean 'bitmap' as a sequence of 0/1 values rather than 'bitmap image format'
If you simply have two bitmaps of the same length and wish to compare them quickly, memcmp() will be effective as someone suggested in the comments. You could if you want try using SSE type optimizations, but these are not as easy as memcmp(). memcmp() is assuming you simply want to know 'they are different' and nothing more.
If you want to know how many bits they are different by, e.g. 615 bits differ, then again you have little option except to XOR every byte and count the number of differences. As others have noted, you probably want to do this more at 32/64 or even 256 bits at a time, depending on your platform. However, if the arrays are millions of bytes long, then the biggest delay (with current CPUs) will be the time to transfer main memory to the CPU, and it wont matter terribly what the CPU does (lots of caveats here)
If you question is more asking about comparing A to B, but really you are doing this lots of times, such as A to B and C,D,E etc, then you can do a couple of things
A. Store a checksum of each array and first compare the checksums, if these are the same then there is a high chance the arrays are the same. Obviously there is a risk here that checksums can be equal but the data can differ, so make sure that a false result in this case will not have dramatic side effects. And, if you cannot withstand false results, do not use this technique.
B. if the arrays have structure, such as they are image data, then leverage specific tools for this, how is beyond this answer to explain.
C. If the image data can be compressed effectively, then compress each array and compare using the compressed form. If you use ZIP type of compression you cannot tell directly from zip how many bits differ, but other techniques such as RLE can be effective to quickly count bit differences (but are a lot of work to build and get correct and fast)
D. If the risk with (a) is acceptable, then you can checksum each chunk of say 262144 bits, and only count differences where checksums differ. This heavily reduces main memory access and will go lots faster.
All of the options A..D are about reducing main memory access as this is the nub of any performance gain (for problem as stated)

How can we allocate memory of order 10^15 in C

I need to allocate memory of order of 10^15 to store integers which can be of long long type.
If i use an array and declare something like
long long a[1000000000000000];
that's never going to work. So how can i allocate such a huge amount of memory.

Really large arrays generally aren't a job for memory, more one for disk. 1015 array elements at 64 bits apiece is (I think) 8 petabytes. You can pick up 8G memory slices for about $15 at the moment so, even if your machine could handle that much memory or address space, you'd be outlaying about $15 million dollars.
In addition, with upcoming DDR4 being clocked up to about 4GT/s (giga-transfers), even if each transfer was a 64-bit value, it would still take about one million seconds just to initialise that array to zero. Do you really want to be waiting around for eleven and a half days before your code even starts doing anything useful?
And, even if you go the disk route, that's quite a bit. At (roughly) $50 per TB, you're still looking at $400,000 and you'll possibly have to provide your own software for managing those 8,000 disks somehow. And I'm not even going to contemplate figuring out how long it would take to initialise the array on disk.
You may want to think about rephrasing your question to indicate the actual problem rather than what you currently have, a proposed solution. It may be that you don't need that much storage at all.
For example, if you're talking about an array where many of the values are left at zero, a sparse array is one way to go.

You can't. You don't have all this memory, and you'll don't have it for a while. Simple.
EDIT: If you really want to work with data that does not fit into your RAM, you can use some library that work with mass storage data, like stxxl, but it will work a lot slower, and you have always disk size limits.

MPI is what you need, that's actually a small size for parallel computing problems the blue gene Q monster at Lawerence Livermore National Labs holds around 1.5 PB of ram. you need to use block decomposition to divide up your problem and viola!
the basic approach is dividing up the array into equal blocks or chunks among many processors

You need to uppgrade to a 64-bit system. Then get 64-bit-capable compiler then put a l at the end of 100000000000000000.
Have you heard of sparse matrix implementation? In one of the sparse matrices, you just use very little part of the matrix despite of the matrix being huge.
Here are some libraries for you.
Here is a basic info about sparse-matrices You dont actually use all of it. Just the needed few points.

LZW compression/decompression under low memory conditions

Can anybody give pointers how I can implement lzw compression/decompression in low memory conditions (< 2k). is that possible?

The zlib library that everyone uses is bloated among other problems (for embedded). I am pretty sure it wont work for your case. I had a little more memory maybe 16K and couldnt get it to fit. It allocates and zeros large chunks of memory and keeps copies of stuff, etc. The algorithm can maybe do it but finding existing code is the challenge.
I went with http://lzfx.googlecode.com The decompression loop is tiny, it is the older lz type compression that relies on the prior results so you need to have access to the uncompressed results...The next byte is a 0x5, the next byte is a 0x23, the next 15 bytes are a copy of the 15 200 bytes ago, the next 6 bytes are a copy of 127 ago...the newer lz algorithm is variable width table based that can be big or grow depending on how implemented.
I was dealing with repetitive data and trying to squeeze a few K down into a few hundred, I think the compression was about 50%, not great but did the job and the decompression routine was tiny. The lzfx package above is small, not like zlib, like two main functions that have the code right there, not dozens of files. You could likely change the depth of the buffer, perhaps improve the compression algorithm if you so desire. I did have to modify the decompression code (like 20 or 30 lines of code perhaps) it was pointer heavy and I switched it to arrays because in my embedded environment the pointers were in the wrong place. Burns maybe an extra register or not depending on how you implement it and your compiler. I also did that so I could abstract the fetches and the stores of the bytes as I had them packed into memory that wasnt byte addressable.
If you find something better please post it here or ping me through stackoverflow, I am also very interested in other embedded solutions. I searched quite a bit and the above was the only useful one I found and I was lucky that my data was such that it compressed well enough using that algorithm...for now.

Can anybody give pointers how I can implement lzw compression/decompression in low memory conditions (< 2k). is that possible?
Why LZW? LZW needs lots of memory. It is based on a hash/dictionary and compression ratio is proportional to the hash/dictionary size. More memory - better compression. Less memory - output can be even larger than input.
I haven't touched encoding for very long time, but IIRC Huffman coding is little bit better when it comes to memory consumption.
But it all depends on type of information you want to compress.

I have used LZSS. I used code from Haruhiko Okumura as base. It uses the last portion of uncompressed data(2K) as dictionary. The code I linked can be modified to use almost no memory if you have all the uncompressed data available in memory. With a bit of googling you will find that a lot of different implementations.

If the choice of compression algorithm isn't set in stone, you might try gzip/LZ77 instead. Here's a very simple implementation I used and adapted once:
ftp://quatramaran.ens.fr/pub/madore/misc/myunzip.c
You'll need to clean up the way it reads input, error handling, etc. but it's a good start. It's probably also way too big if your data AND code need to fit in 2k, but at least the data size is small already.
Big plus is that it's public domain so you can use it however you like!

It has been over 15 years since I last played with the LZW compression algorithm, so take the following with a grain of salt.
Given the memory constraints, this is going to be difficult at best. The dictionary you build is going to consume the vast majority of what you have available. (Assuming that code + memory <= 2k.)
Pick a small fixed size for your dictionary. Say 1024 entries.
Let each dictionary entry take the form of ....
struct entry {
intType prevIdx;
charType newChar;
};
This structure makes the dictionary recursive. You need the item at the previous index to be valid in order for it to work properly. Is this workable? I'm not sure. However, let us assume for the moment that it is and find out where it leads us ....
If you use the standard types for int and char, you are going to run out of memory fast. You will want to pack things together as tightly as possible. 1024 entries will take 10 bits to store. Your new character, will likely take 8 bits. Total = 18 bits.
18 bits * 1024 entries = 18432 bits or 2304 bytes.
At first glance this appears too large. What do we do? Take advantage of the fact that the first 256 entries are already known--your typical extended ascii set or what have you. This means we really need 768 entries.
768 * 18 bits = 13824 bits or 1728 bytes.
This leaves you with about 320 bytes to play with for code. Naturally, you can play around with the dictionary size and see what's good for you, but you will not end up with very much space for your code. Since you are looking at so little code space, I would expect that you would end up coding in assembly.
I hope this helps.

My best recommendation is to examine the BusyBox source and see if their LZW implementation is sufficiently small to work in your environment.

The lowest dictionary for lzw is trie on linked list. See original implementation in LZW AB. I've rewrited it in fork LZWS. Fork is compatible with compress. Detailed documentation here.
n bit dictionary requires (2 ** n) * sizeof(code) + ((2 ** n) - 257) * sizeof(code) + (2 ** n) - 257.
So:
9 bit code - 1789 bytes.
12 bit code - 19709 bytes.
16 bit code - 326909 bytes.
Please be aware that it is a requirements for dictionary. You need to have about 100-150 bytes for state or variables in stack.
Decompressor will use less memory than compressor.
So I think that you can try to compress your data with 9 bit version. But it won't provide good compression ratio. More bits you have - ratio is better.

typedef unsigned int UINT;
typedef unsigned char BYTE;
BYTE *lzw_encode(BYTE *input ,BYTE *output, long filesize, long &totalsize);
BYTE *lzw_decode(BYTE *input ,BYTE *output, long filesize, long &totalsize);

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight