Padding in 24-bits rgb bitmap - c

could somebody explain to me why in 24-bit rgb bitmap file I have to add a padding which size depends on width of image ? What for ?
I mean I must add this code to my program (in C):
if( read % 4 != 0 ) {
read = 4 - (read%4);
printf( "Padding: %d bytes\n", read );
fread( pixel, read, 1, inFile );
}

Because 24 bits is an odd number of bytes (3) and for a variety of reasons all the image rows are required to start at an address which is a multiple of 4 bytes.

According to Wikipedia, the bitmap file format specifies that:
The bits representing the bitmap pixels are packed in rows. The size of each row is rounded up to a multiple of 4 bytes (a 32-bit DWORD) by padding. Padding bytes (not necessarily 0) must be appended to the end of the rows in order to bring up the length of the rows to a multiple of four bytes. When the pixel array is loaded into memory, each row must begin at a memory address that is a multiple of 4. This address/offset restriction is mandatory only for Pixel Arrays loaded in memory. For file storage purposes, only the size of each row must be a multiple of 4 bytes while the file offset can be arbitrary. A 24-bit bitmap with Width=1, would have 3 bytes of data per row (blue, green, red) and 1 byte of padding, while Width=2 would have 2 bytes of padding, Width=3 would have 3 bytes of padding, and Width=4 would not have any padding at all.
The wikipedia article on Data Structure Padding is also an interesting read that explains the reasons that paddings are generally used in computer science.

I presume this was design decision to align for better memory patterns while not wasting that much space (for 319px wide image you would waste 3 bytes or 0.25%)
Imagine you need to access some odd row directly. You could access first 4 pixels of n-th row by doing:
uint8_t *startRow = bmp + n * width * 3; //3 bytes per pixel
uint8_t r1 = startRow[0];
uint8_t g1 = startRow[1];
//... Repeat
uint8_t b4 = startRow[11];
Note that if n and width are odd (and bmp is even), startRow is going to be odd.
Now if you tried to do following speedup:
uint32_t *startRow = (uint32_t *) (bmp + n * width * 3);
uint32_t a = startRow[0]; //Loading register at a time is MUCH faster
uint32_t b = startRow[1]; //but only if address is aligned
uint32_t c = startRow[2]; //else code can hit bus errors!
uint8_t r1 = (a & 0xFF000000) >> 24;
uint8_t g1 = (a & 0x00FF0000) >> 16;
//... Repeat
uint8_t b4 = (c & 0x000000FF) >> 0;
You'd run into lots of problems. In best case scenario (that is intel cpu) your every load of a, b and c would need to be broken into two loads since startRow is not divisible by 4. In worst case scenario (eg. sun sparc) your program would crash with "bus error".
In newer designs it is common to force rows to be aligned to at least L1 cache line size (64 bytes on intel or 128 bytes on nvidia gpus).

Short version
Because the bmp file format specifies rows must perfectly fit in a 32bits "memory cells". Because pixels are 24bits, some combinations of pixels will not perfect sit in 32bit "cells". In this case, the cell is "padded up to" the full 32bits.
8bits per byte ∴
cell: 32bit = 4bytes ∴
pixel: 24bits = 3bytes
// If doesn't fit perfectly in 4 byte "cell"
if( read % 4 != 0 ) {
// find the difference between the "cell", and "the partial fit"
read = 4 - (read%4);
printf( "Padding: %d bytes\n", read );
// skip the difference
fread( pixel, read, 1, inFile );
}
Long version
In computing, a word is the natural unit of data used by a particular processor design. A word is a fixed-sized piece of data handled as a unit by the instruction set or the hardware of the processor
-wiki: Word_(computer_architecture)
Computer systems basically have a preferred "word length" (though not so important these days). A standard data unit allows all sorts of optimisations in the architecture of the computer system (think what shipping containers did for the shipping industry). There is a 32 bit standard called DWORD aka Double word (I guess) - and thats what typical bitmap images are optimised for.
So if you have 24bits per pixel, there will be various "literal pixels" row lengths that will not fit nicely into the 32bits. So in that case, pad it out.
Note: today, you are probably using a computer with a 64bit word size. Check your processor.

It depends on the format whether or not there is padding at the end of each row.
There really isn't much reason for it for 3 x 8 bit channel images since I/O is byte oriented anyway. For images with pixels packed into less than a byte (1 bit / pixel for example), padding is useful so that each row starts at a byte offset.

Related

Copy from one memory to another skipping constant bytes in C

I am working on embedded system application. I want to copy from source to destination, skipping constant number of bytes. For example: source[6] = {0,1,2,3,4,5} and I want destination to be {0,2,4} skipping one byte. Unfortunately memcpy could not fulfilled my requirement. How can I achieve this in 'C' without using loop as I have large data to process and using loop experiences time overhead.
My current implementation is something like this which takes upto 5-6 milli-seconds for 1500 bytes to copy:
unsigned int len_actual = 1500;
/* Fill in the SPI DMA buffer. */
while (len_actual-- != 0)
{
*(tgt_handle->spi_tx_buff ++) = ((*write_irp->buffer ++)) | (2 << 16) | DSPI_PUSHR_CONT;
}
You could write a "cherry picker" function
void * memcpk(void * destination, const void * source,
size_t num, size_t size
int (*test)(const void * item));
which copies at most num "objects", each having size size from
source to destination. Only the objects that satisfy the test are copied.
Then with
int oddp(const void * intptr) { return (*((int *)intptr))%2; }
int evenp(const void * intptr) { return !oddp(intptr); }
you could do
int destination[6];
memcpk(destination, source, 6, sizeof(int), evenp);
.
Almost all CPUs have caches; which means that (e.g.) when you modify one byte the CPU fetches an entire cache line from RAM, modifies the byte in the cache, then writes the entire cache line back to RAM. By skipping small pieces you add overhead (more instructions for CPU to care about) and won't reduce the amount of data transfered between cache and RAM.
Also, typically memcpy() is optimised to copy larger pieces. For example, if you copy an array of bytes but the CPU is capable of copying 32-bits (4 bytes) at once, then memcpy() will probably do the majority of the copying as a loop with 4 bytes per iteration (to reduce the number of reads and writes and reduce the number of loop iterations).
In other words; code to avoid copying specific bytes will make it significantly slower than mempcy() for multiple reasons.
To avoid that, you really want to separate the data that needs to be copied from the data that doesn't - e.g. put everything that doesn't need to be copied at the end of the array and only copy the first part of the array (so that it remains "copy a contiguous area of bytes").
If you can't do that the next alternative to consider would be masking. For example, if you have an array of bytes where some bytes shouldn't be copied, then you'd also have an array of "mask bytes" and do something like dest[i] = (dest[i] & mask[i]) | (src[i] & ~mask[i]); in a loop. This sounds horrible (and is horrible) until you optimise it by operating on larger pieces - e.g. if the CPU can copy 32-bit pieces, masking allows you to do 4 bytes per iteration by pretending all of the arrays are arrays of uint32_t). Note that for this technique wider is better - e.g. if the CPU supports operations on 256-bit pieces (AVX on 80x86) you'd be able to do 32 bytes per iteration of the loop. It also helps if you can make guarantees about the size and alignment (e.g. if the CPU can operate on 32 bits/4 bytes at a time, ensure that the size of the arrays is always a multiple of 4 bytes and that the arrays are always 4-byte aligned; even if it means adding unused padding at the end).
Also note that depending on which CPU it actually is, there might be special support in the instruction set. For one example, modern 80x86 CPUs (that support SSE2) have a maskmovdqu instruction that is designed specifically for selectively writing some bytes but not others. In that case, you'd need to resort to instrinsics or inline assembly because "pure C" has no support for this type of thing (beyond bitwise operators).
Having overlooked your speed requirements:
You may try to find a way which solves the problem without copying at all.
Some ideas here:
If you want to iterate the destination array you could define
kind of a "picky iterator" for source that advances to the next number you allow: Instead of iter++ do iter = advance_source(iter)
If you want to search the destination array then wrap a function around bsearch() that searches source and inspects the result. And so on.
Depending on your processor memory width, and number of internal registers, you might be able to speed this up by using shift operations.
You need to know if your processor is big-endian or little-endian.
Lets say you have a 32 bit processor and bus, and at least 4 spare registers that the compiler can use for optimisation. This means you can read or write 4 bytes in the same target word, having read 2 source words. Note that you are reading the bytes you are going to discard.
You can also improve the speed by making sure that everything is word aligned, and ignoring the gaps between the buffers, so not having to worry about the odd counts of bytes.
So, for little-endian:
inline unsigned long CopyEven(unsigned long a, unsigned long b)
{
long c = a & 0xff;
c |= (a>>8) & 0xff00;
c |= (b<<16) & 0xff0000;
c |= (b<<8) &0xff000000;
return c;
}
unsigned long* d = (unsigned long*)dest;
unsigned long* s = (unsigned long*)source;
for (int count =0; count <sourceLenBytes; count+=8)
{
*d = CopyEven(s[0], s[1]);
d++;
s+=2;
}

Parsing ID3V2 Frames in C

I have been attempting to retrieve ID3V2 Tag Frames by parsing through the mp3 file and retrieving each frame's size. So far I have had no luck.
I have effectively allocated memory to a buffer to aid in reading the file and have been successful in printing out the header version but am having difficulty in retrieving both the header and frame sizes. For the header framesize I get 1347687723, although viewing the file in a hex editor I see 05 2B 19.
Two snippets of my code:
typedef struct{ //typedef structure used to read tag information
char tagid[3]; //0-2 "ID3"
unsigned char tagversion; //3 $04
unsigned char tagsubversion;//4 00
unsigned char flags; //5-6 %abc0000
uint32_t size; //7-10 4 * %0xxxxxxx
}ID3TAG;
if(buff){
fseek(filename,0,SEEK_SET);
fread(&Tag, 1, sizeof(Tag),filename);
if(memcmp(Tag.tagid,"ID3", 3) == 0)
{
printf("ID3V2.%02x.%02x.%02x \nHeader Size:%lu\n",Tag.tagversion,
Tag.tagsubversion, Tag.flags ,Tag.size);
}
}
Due to memory alignment, the compiler has set 2 bytes of padding between flags and size. If your struct were putted directly in memory, size would be at address 6 (from the beginning of the struct). Since an element of 4 bytes size must be at an address multiple of 4, the compiler adds 2 bytes, so that size moves to the closest multiple of 4 address, which is here 8. So when you read from your file, size contains bytes 8-11. If you try to print *(&Tag.size - 2), you'll surely get the correct result.
To fix that, you can read fields one by one.
ID3v2 header structure is consistent across all ID3v2 versions (ID3v2.0, ID3v2.3 and ID3v2.4).
Its size is stored as a big-endian synch-safe int32
Synchsafe integers are
integers that keep its highest bit (bit 7) zeroed, making seven bits
out of eight available. Thus a 32 bit synchsafe integer can store 28
bits of information.
Example:
255 (%11111111) encoded as a 16 bit synchsafe integer is 383
(%00000001 01111111).
Source : http://id3.org/id3v2.4.0-structure § 6.2
Below is a straightforward, real-life C# implementation that you can easily adapt to C
public int DecodeSynchSafeInt32(byte[] bytes)
{
return
bytes[0] * 0x200000 + //2^21
bytes[1] * 0x4000 + //2^14
bytes[2] * 0x80 + //2^7
bytes[3];
}
=> Using values you read on your hex editor (00 05 EB 19), the actual tag size should be 112025 bytes.
By coincidence I am also working on an ID3V2 reader. The doc says that the size is encoded in four 7-bit bytes. So you need another step to convert the byte array into an integer... I don't think just reading those bytes as an int will work because of the null bit on top.

MATLAB: difference between double and logical data allocation

I need to create a large binary matrix that is over the array size limit for MATLAB.
By default, MATLAB creates integer arrays as double precision arrays. But since my matrix is binary, I am hoping that there is a way to create an array of bits instead of doubles and consume far less memory.
I created a random binary matrix A and converted it to a logical array B:
A = randi([0 1], 1000, 1000);
B=logical(A);
I saved both as .mat files. They take up about the same space on my computer so I don't think MATLAB is using a more compact data type for logicals, which seems very wasteful. Any ideas?
Are you sure that the variables take the same amount of space? logical data matrices / arrays are inherently 1 byte per number where as randi is double precision, which is 8 bytes per number. A simple call to whos will show you how much memory each variable takes:
>> A = randi([0 1], 1000, 1000);
>> B = logical(A);
>> whos
Name Size Bytes Class Attributes
A 1000x1000 8000000 double
B 1000x1000 1000000 logical
As you can see, A takes 8 x 1000 x 1000 = 8M bytes where as B takes up 1 x 1000 x 1000 = 1M bytes. There is most certainly memory savings between them.
The drawback with logicals is that it takes 1 byte per number, and you're looking for 1 bit instead. The best thing I can think of is to use an unsigned integer type and interleave chunks of N-bits where N is the associated bit precision of the data type, so uint8, uint16, uint32 etc. into a single interleaved array. As such, 32 digits can be interleaved per number and you can save this final matrix.
Going off on a tangent - Images
In fact, this is how Java packs colour pixels when reading images in using their BufferedImage class. Each pixel in a RGB image is 24 bits, where there are 8 bits per colour channel - red, green and blue. Each pixel is represented as a proportion of red, green and blue, and they concatenate the trio of 8 bits into a single 24-bit integer. Usually, integers are represented as 32 bits and so you may think that there are 8 extra bits being wasted. There is in fact an alpha channel that represents the transparency of each colour pixel and that is another 8 bits to represent this. If you don't use transparency, these are assumed to be all 1s, and so the collection of these 4 pairs of 8 bits constitute 32 bits per pixel. There is, however, compression algorithms to reduce the size of each pixel on average to significantly less than 32 bits per pixel, but that's outside the scope of what I'm talking about.
Going back to our discussion, one way to represent this binary matrix in bit form would be perhaps in a for loop like so:
Abin = zeros(1, ceil(numel(A)/32), 'uint32');
for ii = 1 : numel(Abin)
val = A((ii-1)*32 + 1:ii*32);
dec = bin2dec(sprintf('%d', val));
Abin(ii) = dec;
end
Bear in mind that this will only work for matrices where the total number of elements is divisible by 32. I won't go into how to handle the case where it isn't because I solely want to illustrate the point that you can do what you ask, but it requires a bit of manipulation. Your case of 1000 x 1000 = 1M is certainly divisible by 32 (you get 1M / 32 = 31250), and so this will work.
This is probably not the most optimized code, but it gets the point across. Basically, we take chunks of 32 numbers (0/1) going column-wise from left to right and determining the 32-bit unsigned integer representation of this number. We then store this in a single location in the matrix Abin. What you will get in the end, given your 1000 x 1000 matrix is 31250 32-bit unsigned integers, which corresponds to 1000 x 1000 bits, or 1M bits = 125,000 bytes.
Try looking at the size of each variable now:
>> whos
Name Size Bytes Class Attributes
A 1000x1000 8000000 double
Abin 1x31250 125000 uint32
B 1000x1000 1000000 logical
To perform a reconstruction, try:
Arec = zeros(size(A));
for ii = 1 : numel(Abin)
val = dec2bin(Abin(ii), 32) - '0';
Arec((ii-1)*32 + 1:ii*32) = val(:);
end
Also not the most optimized, but it gets the point across. Given the "compressed" matrix Abin that we calculated before, for each element, we reconstruct what the original 32-bit number was then assign these numbers in 32-bit chunks stored in Arec.
You can verify that Arec is indeed equal to the original matrix A:
>> isequal(A, Arec)
ans =
1
Also, check out the workspace with whos:
>> whos
Name Size Bytes Class Attributes
A 1000x1000 8000000 double
Abin 1x31250 125000 uint32
Arec 1000x1000 8000000 double
B 1000x1000 1000000 logical
You are storing your data in a compressed file format. For mat files in version 7.0 and 7.3 gzip compression is used. The uncompressed data has different sizes, but after compression both are compressed down to roughly the same size. That happened because both data contains only 0 and 1 which can be compressed efficient.

How to transpose a matrix in ARM assembly

I'm trying to perform a matrix trasposition of specifically 8 n-bits arrays, each having n bits (around 70,000), to a byte array of n elements.
Context information: The 8 n-bits arrays are RGB data for 8 channels. I need to have one byte representing the nth-bit position of the 8 arrays. This will be running on an ARM Cortex-M3 processor and needs to perform as fast as possible since I'm generating 8 simultaneous signals using the resulting array.
I've come up with a pseudo algorithm (in the link) to do this, but I'm afraid it might be too costly for the processor.
Pseudo Algorithm
I'm looking for the fastest executing code. Size is of secondary importance.
I will appreciate suggestions.
This is what I implemented but the results are not that good.
do{
for(b=0;b<24;b++){ //Optimize to for(b=24;b!=0;b--)
m = 1 << b;
*dataBytes = *dataBytes + __ROR((*s0 & m),32+b-0); //strip 0 data
*dataBytes = *dataBytes + __ROR((*s1 & m),32+b-1); //strip 1 data
*dataBytes = *dataBytes + __ROR((*s2 & m),32+b-2); //strip 2 data
*dataBytes = *dataBytes + __ROR((*s3 & m),32+b-3); //strip 3 data
*dataBytes = *dataBytes + __ROR((*s4 & m),32+b-4); //strip 4 data
*dataBytes = *dataBytes + __ROR((*s5 & m),32+b-5); //strip 5 data
*dataBytes = *dataBytes + __ROR((*s6 & m),32+b-6); //strip 6 data
*dataBytes = *dataBytes + __ROR((*s7 & m),32+b-7); //strip 7 data
dataBytes++;
}
s0 += 3;
s1 += 3;
s2 += 3;
s3 += 3;
s4 += 3;
s5 += 3;
s6 += 3;
s7 += 3;
}while(n--);
S0 to 7 are the 8 individual vectors from which the bits are being taken in groups of 24.
N is the number of groups, m is the mask and b is the mask position.
dataBytes is the resulting array.
There are two things that are always present when optimizing,
Memory bandwidth
CPU clocks
Bandwidth
Your current algorithm is loading a byte at a time. You may do this more efficiently by loading at least 32bits at a time. This will optimize the ARM BUS. For certain the end algorithm will not be BUS bound and if it is, you have optimized for this.
For the different ARM CPUs, there are instructions like pld, etc which can try to optimize the BUS by pre-fetching the next data elements in advance. This may or may not apply to your Cortex-M. Another technique is to relocate the data to faster memory such as TCM if possible.
CPU speed
Pixel processing is almost always speed up by SIMD type instructions. The Cortex-M has instructions labelled SIMD. Don't get hung up on the label SIMD; use the concept. If you have loaded multiple bytes into a word, then you can use a table.
const unsigned long bits[16] = {
0, 1, 0x100, 0x101,
0x10000, 0x10001, 0x10100, 0x10101,
0x1000000, 0x1000001, 0x1000100, 0x1000101,
0x1010000, 0x1010001, 0x1010100, 0x1010101
}
A similar concept is used in many CRC algorithms on the Internet. Process each nibble (4 bits) and form the next four bytes of output a bit at a time. Probably there is a multiplication value which can replace the table, but this depends on the speed of you multiple which depends on the type of Cortex-M and/or ARM.
Definitely prototype in 'C' and then convert to assembler or use inline assembler if possible. If you have many mov statements in your algorithm, it is a signal that a compiler could probably allocate the register better than you. Many sophisticated algorithm use a code generator (scripted in phython, perl, etc) which may unroll whatever optimum loop you end up with and also track registers in a algorithmic way.
Note: Double check my table; it is just a first crack and I have not actually coded this particular algorithm. There maybe more slick ways to process multiple bits at a time, but the idea is probably fruitful.

Reading data from GIF Headers in C

The first 13 bytes of any GIF image file are as follows:
3 bytes - the ascii characters "GIF"
3 bytes - the version - either "87a" or "89a"
2 bytes - width in pixels
2 bytes - height in pixels
1 byte - packed fields
1 byte - background color index
1 byte - pixel aspect ratio
I can get the first six bytes myself by using some sort of code like:
int G = getchar();
int I = getchar();
int F = getchar();
etc .. doing the same for the 87a/89a part, all this gets the first 6 bytes, providing the ascii characters for, say, GIF87a.
Well, I can't manage to figure out how to get the rest of the information. I try going along with the same getchar(); method, but it's not what I would expect it to be. Say I have a 350x350 GIF file, since the width and height is 2 bytes each, i use getchar 2 times, and I end up with the width being "94" and "1", two numbers, as there's two bytes. But how would I use this information to get the actual, in base 10, width and height? I tried bitwise-anding 94 and 1, but then realized it returns 0.
I figure maybe if I can find out how to get the width and height I'll be able to access the rest of the information on my own.
Pixel width and height are stored in little indian format.
It's just like any other number broken into parts with a limited range. For example, look at 43. Each digit has a limited range, from 0 to 9. So the next digit is the number of 10's, then hundreds (10*10) and so on. In this case, the values can range from 0 to 255, so the next number is the number of 256's.
256 * 1 + 94 = 350
The standard should specify the byte order, that is, whether the most significant (called big endian) comes first of the least significant (called little endian) comes first.
Byte Order: Little-endian
Typically for reading a compressed bitstream or image data, we could open the file in read binary mode, read the data and interpret the same through a getBits functionality. For example, please consider a sample below
fptr = fopen("myfile.gif", "rb");
// Read a word
fread(&cache, sizeof(unsigned char), 4, fptr);
//Read your width through getBits
width = getBits(cache, position, number_of_bits);
Please refer here K & R Question: Need help understanding "getbits()" method in Chapter 2 for more details on getBits

Resources