C code run slower when SIMD instructions are used? - c

I am a SIMD new, writing a program that converts an image from ARGB to grayscale, and the main operation code is as follows:
void* ptr;
int* pBitmap;
posix_memalign(&ptr, 16, height * width * sizeof(int));
pBitmap = (int*)ptr;
for(row = 0; row < height; row++){
for(col = 0; col < width; col++){
int pixel = pBitmap[col + row * width];
int alpha = (pixel >> 24) & 0xff;
int red = (pixel >> 16) & 0xff;
int green = (pixel >> 8) & 0xff;
int blue = pixel & 0xff;
int bw = (int)(red * 0.299 + green * 0.587 + blue * 0.114);
pBitmap[col + row * width] = (alpha << 24) + (bw << 16) + (bw << 8) + (bw);
}
}
And this is my modified SIMD program, which is much slower than the original one.
__m128i bw;
__m128i* rec;
__m128d blue, grees, red, alpha;
for(int i = 0; i < width * height; i += 2){
rec = (__m128i*)(pBitmap + i);
alpha = _mm_cvtepi32_pd(_mm_srli_epi32(*rec, 24));
red = _mm_cvtepi32_pd(_mm_and_si128(_mm_srli_epi32(*rec, 16), _mm_set1_epi32(0xff)));
green = _mm_cvtepi32_pd(_mm_and_si128(_mm_srli_epi32(*rec, 8), _mm_set1_epi32(0xff)));
blue = _mm_cvtepi32_pd(_mm_and_si128(*rec, _mm_set1_epi32(0xff)));
bw = _mm_add_epi32(_mm_cvtpd_epi32(_mm_mul_pd(reds, _mm_set_pd1(0.299))), _mm_cvtpd_epi32(_mm_mul_pd(greens, _mm_set_pd1(0.587))));
bw = _mm_add_epi32(bws, _mm_cvtpd_epi32(_mm_mul_pd(blues, _mm_set_pd1(0.114))));
*rec = _mm_add_epi32(_mm_add_epi32(_mm_slli_epi32(_mm_cvtpd_epi32(alphas), 24), _mm_slli_epi32(bws, 16)), _mm_add_epi32(_mm_slli_epi32(bws, 8), bws));
}
Is the reason for this result because there are more type conversions? But I don't know where else I can optimize, please help me, thank you.

A few issues with your implementation.
SIMD works best when doing multiple pixels at a time in parallel. Do an Internet search "Arrays of Structures vs. Structures of Arrays" for some examples.
Why use doubles instead of single-precision? That's halving your throughput.
Most compilers do not have way to automatically create data constants from SIMD vectors. All those calls to _mm_set_* intrinsics are doing a lot of things at runtime you should really do at compile time.
Replace all the use of _mm_set_* macros with something like:
union simdConstant
{
float f[4];
__m128 v;
};
static const simdConstant c_luminance = { { 0.299f, 0.587f, 0.114f, 1.f } };
static const simdConstant c_luminanceRed = { { 0.299f, 0.299f, 0.299f, 0.299f } };
Then use c_luminance.v or c_luminanceRed.v instead of _mm_set_ps or _mm_set_ps1.
See also DirectXMath which will provide numerous examples of SIMD implementations.

Related

Color gradient in C

I'm taking my first steps in C, and was trying to make a gradient color function, that draws a bunch of rectangles to the screen (vertically).
This is the code so far:
void draw_gradient(uint32_t start_color, uint32_t end_color) {
int steps = 8;
int draw_height = window_height / 8;
//Change this value inside the loop to write different color
uint32_t loop_color = start_color;
for (int i = 0; i < steps; i++) {
draw_rect(0, i * draw_height, window_width, draw_height, loop_color);
}
}
Ignoring the end_color for now, I want to try and pass a simple red color in like 0xFFFF0000 (ARGB)..and then take the red 'FF' and convert it to an integer or decrease it using the loop_color variable.
I'm not sure how to go get the red value from the hexcode and then minipulate it as a number and then write it back to hex..any ideas?
So in 8 steps the code should for example go in hex from FF to 00 or as integer from 255 to 0.
As you have said, your color is in RGB format. This calculation assumes vertical gradient - meaning from top to the bottom (linear lines).
Steps to do are:
Get number of lines to draw; this is your rectangle height
Get A, R, G, B color components from your start and end colors
uint8_t start_a = start_color >> 24;
uint8_t start_r = start_color >> 16;
uint8_t start_g = start_color >> 8;
uint8_t start_b = start_color >> 0;
uint8_t end_a = end_color >> 24;
uint8_t end_r = end_color >> 16;
uint8_t end_g = end_color >> 8;
uint8_t end_b = end_color >> 0;
Calculate step for each of the components
float step_a = (float)(end_a - start_a) / (float)height;
float step_r = (float)(end_r - start_r) / (float)height;
float step_g = (float)(end_g - start_g) / (float)height;
float step_b = (float)(end_b - start_b) / (float)height;
Run for loop and apply different step for each color
for (int i = 0; i < height; ++i) {
uint32_t color = 0 |
((start_a + i * step_a) & 0xFF) << 24 |
((start_r + i * step_r) & 0xFF) << 16 |
((start_g + i * step_g) & 0xFF) << 8 |
((start_b + i * step_b) & 0xFF) << 0
draw_horizontal_line(i, color);
}
It is better to use float for step_x and multiply/add on each iteration. Otherwise with integer rounding, you may never increase number as it will always get rounded down.

CHIP-8 SDL rendering problems

I have coded a chip-8 emulator.Whatever I do, it seems that I cannot show any pixels on the screen.The weird thing is that I have checked the code, top-bottom for 2 days already, and there does not seem to be any problem.It reads the .rom file into memory, and fetches the OP code correctly.
Here is the source code:
SDL_SetRenderDrawColor( renderer, 0, 0, 0, SDL_ALPHA_OPAQUE );
SDL_RenderClear(renderer);
uint32_t pixels[(WINDOW_WIDTH / 10) * (WINDOW_HEIGHT / 10)];
uint16_t i;
for(i = 0; i < 64*32; i++){
pixels[i] = (0x00FFFFFF * display[i]) | 0xFF000000;
}
//upload the pixels to the texture
SDL_UpdateTexture(tex,NULL,pixels, 64 * sizeof(uint32_t));
//Now get the texture to the screen
SDL_RenderCopy(renderer,tex,NULL,NULL);
SDL_RenderPresent(renderer); // Update screen
ch8.drawF = false;
uint16_t x = ch8->V[((ch8->opcode & 0x0F00) >> 8)];
uint16_t y = ch8->V[((ch8->opcode & 0x00F0) >> 4)];
uint8_t n = (ch8->opcode & 0x000F);
for(i = 0; i < n; i++) {
uint8_t pixel= memory[ch8->I.word + i];
for(j = 0; j < 8; j++) {
if((pixel & (0x80 >> j)) != 0){
if(display[x + j + ((y + i) * 64)] == 1) {
ch8->V[0xF] = 1;
}
display[x + j + ((y + i) * 64)] ^= 1;
}
}
}
So basically, the problem was at init() function.I was initially using, SDL_CreateWindow and SDL_CreateRenderer,but now I'm using ,SDL_CreateWindowAndRenderer, which takes pointers to pointers of SDL_Window and SDL_Renderer instead of a pointer to a char and a pointer to a window.
Also there were 3 problems I fixed.
1.I was adding + 0x200 to NNN opcodes,because at firstly I thought that the NNN in ROM's are relative to 0, so I removed +0x200 from each XNNN opcode.Also I forgot a * at SDL_Texture* tex, its supposed to be SDL_Texture** tex, I was merely changing the address the local pointer was poiting too...
2.at opcode 2NNN, instead of (ch8->SP) = ch8->opcode & 0x0FFF; its(ch8->SP) = ch8->PC.word;
3.at opcode FX65 its i <= ((ch8->opcode & 0x0F00) >> 8)
Basically, the differences between SDL_CreateWindowAndRenderer and SDL_CreateWindow&SDL_CreateRenderer had me confused, I should had check'd the documentation first.
Now I only need to make the emulator only redraw the changed pixels, then make the emulator play sound.

Optimization using NEON intrinsics

I'm very beginner to NEON intrinsic. I am trying to optimize the algorithm below
uint32_t blue = 0, red = 0 , green = 0, alpha = 0, factor = 0 , shift = 0;
// some initial calculation to calculate factor shift and R G B init values all are expected to be initilized with 16 bit unsigned
//pSRC is 32 bbp flat pixel array and count is total pixels count
for( int i = 0; i < count; i++ )
{
blue += *psrc++;
green += *psrc++;
green += *psrc++;
alpha += *psrc++;
*pDest++ = static_cast< uint_8 >( ( blue * factor ) >> shift );
*pDest++ = static_cast< uint_8 >( ( green * factor ) >> shift );
*pDest++ = static_cast< uint_8 >( ( red * factor ) >> shift );
*pDest++ = static_cast< uint_8 >( ( alpha * factor ) >> shift );
}
I am not sure how to do this since I need the result in 32-bit containers and I have source data as 8-bit ( R G B A ), and there is no instruction which can add 8-bits with 32-bits.
Can anyone help me out with this?
I was able to convert them to 32-bits as suggested by Paul's link and do the needful arithmetic. Now I have:
uint32x4_t result1 = vshlq_u32(mult1281, shift);
uint32x4_t result2 = vshlq_u32(mult1282, shift);
uint32x4_t result3 = vshlq_u32(mult1283, shift);
uint32x4_t result4 = vshlq_u32(mult1284, shift);
result 1/2/3/4 now contains 32-bits (per channel) RGB channels. How can I now combine result 1/2/3/4 to get 8-bits (per channel) RGB channels and put it back to the destination?
I still haven't understood deep sense of the algorithm, but of course you can optimize it with using NEON:
uint32_t blue = 0, red = 0, green = 0, alpha = 0, factor = 0, shift = 0;
// some your initializations.
uint32x4_t bgra = { blue, green, red, alpha };
for (int i = 0; i < count; i += 2)
{
//load 8 8-bit values and unpack to 16-bit
uint16x8_t src = vmovl_u8(vld1_u8(psrc + i * 4));
//accumulate low 4 values
bgra = vaddw_u16(bgra, vget_low_u16(src));
//get low 4 values of dst
uint32x4_t lo = vshrq_n_u32(vmulq_u32(bgra, vdupq_n_u32(factor)), shift);
//accumulate high 4 values
bgra = vaddw_u16(bgra, vget_high_u16(src));
//get high 4 values of dst
uint32x4_t hi = vshrq_n_u32(vmulq_u32(bgra, vdupq_n_u32(factor)), shift);
//pack 8 32-bit values to 8 8-bit.
uint8x8_t dst = vmovn_u16(vcombine_u16(vmovn_u32(lo), vmovn_u32(hi)));
//store result
vst1_u8(pDest + i * 4, dst);
}

Fastest sort algorithm for millions of UINT64 RGBZ graphics pixels

I am sorting 10+ million uint64_ts with RGB data from .RAW files and 79% of my C program time is spent in qsort. I am looking for a faster sort for this specific data type.
Being RAW graphical data, the numbers are very random and ~80% unique. No partial sorting or runs of sorted data can be expected. The 4 uint16_ts inside the uint64_t are R, G, B and zero (possibly a small count <= ~20).
I have the simplest comparison function I can think of using unsigned long longs (you CANNOT just subtract them):
qsort(hpidx, num_pix, sizeof(uint64_t), comp_uint64);
...
int comp_uint64(const void *a, const void *b) {
if(*((uint64_t *)a) > *((uint64_t *)b)) return(+1);
if(*((uint64_t *)a) < *((uint64_t *)b)) return(-1);
return(0);
} // End Comp_uint64().
There was a very interesting "Programming Puzzles & Code Golf" on StackExchange, but they used floats. Then there are QSort, RecQuick, heap, stooge, tree, radix...
The swenson/sort looked interesting but had no (obvious) support for my datatype, uint64_t. And the "quick sort" time was the best. Some sources say the system qsort can be anything, not necessarily "Quick Sort".
A C++ sort bypasses the generic casting of void pointers and realizes great improvements in performance over C. There has to be an optimized method to slam U8s through a 64bit processor at warp speed.
System/compiler info:
I am currently using the GCC with Strawberry Perl
gcc version 4.9.2 (x86_64-posix-sjlj, built by strawberryperl.com
Intel 2700K Sandy Bridge CPU, 32GB DDR3
windows 7/64 pro
gcc -D__USE_MINGW_ANSI_STDIO -O4 -ffast-math -m64 -Ofast -march=corei7-avx -mtune=corei7 -Ic:/bin/xxHash-master -Lc:/bin/xxHash-master c:/bin/stddev.c -o c:/bin/stddev.g6.exe
First attempt at a better qsort, QSORT()!
Tried to use Michael Tokarev's inline qsort.
"READY-TO-USE"? From qsort.h documentation
-----------------------------
* Several ready-to-use examples:
*
* Sorting array of integers:
* void int_qsort(int *arr, unsigned n) {
* #define int_lt(a,b) ((*a)<(*b))
* QSORT(int, arr, n, int_lt);
--------------------------------
Change from type "int" to "uint64_t"
compile error on TYPE???
c:/bin/bpbfct.c:586:8: error: expected expression before 'uint64_t'
QSORT(uint64_t, hpidx, num_pix, islt);
I can't find a real, compiling, working example program, just comments with the "general concept"
#define QSORT_TYPE uint64_t
#define islt(a,b) ((*a)<(*b))
uint64_t *QSORT_BASE;
int QSORT_NELT;
hpidx=(uint64_t *) calloc(num_pix+2, sizeof(uint64_t)); // Hash . PIDX
QSORT_BASE = hpidx;
QSORT_NELT = num_pix; // QSORT_LT is function QSORT_LT()
QSORT(uint64_t, hpidx, num_pix, islt);
//QSORT(uint64_t *, hpidx, num_pix, QSORT_LT); // QSORT_LT mal-defined?
//qsort(hpidx, num_pix, sizeof(uint64_t), comp_uint64); // << WORKS
The "ready-to-use" examples use types of int, char * and struct elt. Isn't uint64_t a type?? Try long long
QSORT(long long, hpidx, num_pix, islt);
c:/bin/bpbfct.c:586:8: error: expected expression before 'long'
QSORT(long long, hpidx, num_pix, islt);
Next attempt: RADIXSORT:
Results: RADIX_SORT is RADICAL!
I:\br3\pf.249465>grep "Event" bb12.log | grep -i Sort
<< 1.40 sec average
4) Time=1.411 sec = 49.61%, Event RADIX_SORT , hits=1
4) Time=1.396 sec = 49.13%, Event RADIX_SORT , hits=1
4) Time=1.392 sec = 49.15%, Event RADIX_SORT , hits=1
16) Time=1.414 sec = 49.12%, Event RADIX_SORT , hits=1
I:\br3\pf.249465>grep "Event" bb11.log | grep -i Sort
<< 5.525 sec average = 3.95 time slower
4) Time=5.538 sec = 86.34%, Event QSort , hits=1
4) Time=5.519 sec = 79.41%, Event QSort , hits=1
4) Time=5.519 sec = 79.02%, Event QSort , hits=1
4) Time=5.563 sec = 79.49%, Event QSort , hits=1
4) Time=5.684 sec = 79.83%, Event QSort , hits=1
4) Time=5.509 sec = 79.30%, Event QSort , hits=1
3.94 times faster than whatever sort qsort out of the box uses!
And, even more importantly, there was actual, working code, not just 80% of what you need given by some Guru who assumes you know everything they know and can fill in the other 20%.
Fantastic solution! Thanks Louis Ricci!
I would use Radix Sort with an 8bit radix. For 64bit values a well optimized radix sort will have to iterate over the list 9 times (one to precalculate the counts and offsets and 8 for 64bits/8bits). 9*N time and 2*N space (using a shadow array).
Here's what an optimized radix sort would look like.
typedef union {
struct {
uint32_t c8[256];
uint32_t c7[256];
uint32_t c6[256];
uint32_t c5[256];
uint32_t c4[256];
uint32_t c3[256];
uint32_t c2[256];
uint32_t c1[256];
};
uint32_t counts[256 * 8];
} rscounts_t;
uint64_t * radixSort(uint64_t * array, uint32_t size) {
rscounts_t counts;
memset(&counts, 0, 256 * 8 * sizeof(uint32_t));
uint64_t * cpy = (uint64_t *)malloc(size * sizeof(uint64_t));
uint32_t o8=0, o7=0, o6=0, o5=0, o4=0, o3=0, o2=0, o1=0;
uint32_t t8, t7, t6, t5, t4, t3, t2, t1;
uint32_t x;
// calculate counts
for(x = 0; x < size; x++) {
t8 = array[x] & 0xff;
t7 = (array[x] >> 8) & 0xff;
t6 = (array[x] >> 16) & 0xff;
t5 = (array[x] >> 24) & 0xff;
t4 = (array[x] >> 32) & 0xff;
t3 = (array[x] >> 40) & 0xff;
t2 = (array[x] >> 48) & 0xff;
t1 = (array[x] >> 56) & 0xff;
counts.c8[t8]++;
counts.c7[t7]++;
counts.c6[t6]++;
counts.c5[t5]++;
counts.c4[t4]++;
counts.c3[t3]++;
counts.c2[t2]++;
counts.c1[t1]++;
}
// convert counts to offsets
for(x = 0; x < 256; x++) {
t8 = o8 + counts.c8[x];
t7 = o7 + counts.c7[x];
t6 = o6 + counts.c6[x];
t5 = o5 + counts.c5[x];
t4 = o4 + counts.c4[x];
t3 = o3 + counts.c3[x];
t2 = o2 + counts.c2[x];
t1 = o1 + counts.c1[x];
counts.c8[x] = o8;
counts.c7[x] = o7;
counts.c6[x] = o6;
counts.c5[x] = o5;
counts.c4[x] = o4;
counts.c3[x] = o3;
counts.c2[x] = o2;
counts.c1[x] = o1;
o8 = t8;
o7 = t7;
o6 = t6;
o5 = t5;
o4 = t4;
o3 = t3;
o2 = t2;
o1 = t1;
}
// radix
for(x = 0; x < size; x++) {
t8 = array[x] & 0xff;
cpy[counts.c8[t8]] = array[x];
counts.c8[t8]++;
}
for(x = 0; x < size; x++) {
t7 = (cpy[x] >> 8) & 0xff;
array[counts.c7[t7]] = cpy[x];
counts.c7[t7]++;
}
for(x = 0; x < size; x++) {
t6 = (array[x] >> 16) & 0xff;
cpy[counts.c6[t6]] = array[x];
counts.c6[t6]++;
}
for(x = 0; x < size; x++) {
t5 = (cpy[x] >> 24) & 0xff;
array[counts.c5[t5]] = cpy[x];
counts.c5[t5]++;
}
for(x = 0; x < size; x++) {
t4 = (array[x] >> 32) & 0xff;
cpy[counts.c4[t4]] = array[x];
counts.c4[t4]++;
}
for(x = 0; x < size; x++) {
t3 = (cpy[x] >> 40) & 0xff;
array[counts.c3[t3]] = cpy[x];
counts.c3[t3]++;
}
for(x = 0; x < size; x++) {
t2 = (array[x] >> 48) & 0xff;
cpy[counts.c2[t2]] = array[x];
counts.c2[t2]++;
}
for(x = 0; x < size; x++) {
t1 = (cpy[x] >> 56) & 0xff;
array[counts.c1[t1]] = cpy[x];
counts.c1[t1]++;
}
free(cpy);
return array;
}
EDIT this implementation was based on a JavaScript version Fastest way to sort 32bit signed integer arrays in JavaScript?
Here's the IDEONE for the C radix sort http://ideone.com/JHI0d9
I see a few options, roughly in order of easiest to hardest.
Enable link-time optimization with the -flto switch. This may get the compiler to inline your comparison function. It's too easy not to try.
If LTO has no effect, you can use an inline qsort implementation like Michael Tokarev's inline qsort. This page suggests a 2x improvement, again solely due to the compiler's ability to inline the comparison function.
Use the C++ std::sort. I know your code is in C, but you can make a small module that only sorts and provides a C interface. You're already using a toolchain that has great C++ support.
Try swenson/sort's library. It implements many algorithms so you can use the one that works best on your data. It appears to be inlineable, and they claim to be faster than qsort.
Find another sorting library. Something that can do Louis' Radix Sort is a good suggestion.
Note you can also do your comparison with a single branch instead of two. Just find out which is bigger, then subtract.
With some compilers/platforms the following is branch-less and faster, though not much different than OP's original.
int comp_uint64_b(const void *a, const void *b) {
return
(*((uint64_t *)a) > *((uint64_t *)b)) -
(*((uint64_t *)a) < *((uint64_t *)b));
}
Maybe some ?: instead of ifs would make things a tad quicker.

Writing images with an Arduino

I have an SD card, SD card shield, and Arduino Uno R3. I need to write an image onto the SD card. I would much rather prefer going from a raw array to JPEG/PNG/BMP/etc, rather than using the formats that are easy to write, but not really openable (PPM, PGM, etc).
Is the image writing function included in the Arduino standard libraries? If not, what library should I use? I've looked at lodePNG, but ran into weird errors (vector is not a member of std).
I take zero credit for this code as I pulled it from a thread on the Arduino forums (http://forum.arduino.cc/index.php?topic=112733.0). It writes a .bmp file to an SD card.
Another discussion indicated that because of the compression algorithms associated with JPG and PNG files, the amount of code to make those work would be more difficult to fit on an Arduino, which makes sense in my head (http://forum.arduino.cc/index.php?topic=76376.0).
Hope this helps. Definitely not an expert with Arduino - just tinkered a bit.
#include <SdFat.h>
#include <SdFatUtil.h>
/*
WRITE BMP TO SD CARD
Jeff Thompson
Summer 2012
TO USE MEGA:
The SdFat library must be edited slightly to use a Mega - in line 87
of SdFatConfig.h, change to:
#define MEGA_SOFT_SPI 1
(this uses pins 10-13 for writing to the card)
Writes pixel data to an SD card, saved as a BMP file. Lots of code
via the following...
BMP header and pixel format:
http://stackoverflow.com/a/2654860
SD save:
http://arduino.cc/forum/index.php?topic=112733 (lots of thanks!)
... and the SdFat example files too
www.jeffreythompson.org
*/
char name[] = "9px_0000.bmp"; // filename convention (will auto-increment)
const int w = 16; // image width in pixels
const int h = 9; // " height
const boolean debugPrint = true; // print details of process over serial?
const int imgSize = w*h;
int px[w*h]; // actual pixel data (grayscale - added programatically below)
SdFat sd;
SdFile file;
const uint8_t cardPin = 8; // pin that the SD is connected to (d8 for SparkFun MicroSD shield)
void setup() {
// iteratively create pixel data
int increment = 256/(w*h); // divide color range (0-255) by total # of px
for (int i=0; i<imgSize; i++) {
px[i] = i * increment; // creates a gradient across pixels for testing
}
// SD setup
Serial.begin(9600);
if (!sd.init(SPI_FULL_SPEED, cardPin)) {
sd.initErrorHalt();
Serial.println("---");
}
// if name exists, create new filename
for (int i=0; i<10000; i++) {
name[4] = (i/1000)%10 + '0'; // thousands place
name[5] = (i/100)%10 + '0'; // hundreds
name[6] = (i/10)%10 + '0'; // tens
name[7] = i%10 + '0'; // ones
if (file.open(name, O_CREAT | O_EXCL | O_WRITE)) {
break;
}
}
// set fileSize (used in bmp header)
int rowSize = 4 * ((3*w + 3)/4); // how many bytes in the row (used to create padding)
int fileSize = 54 + h*rowSize; // headers (54 bytes) + pixel data
// create image data; heavily modified version via:
// http://stackoverflow.com/a/2654860
unsigned char *img = NULL; // image data
if (img) { // if there's already data in the array, clear it
free(img);
}
img = (unsigned char *)malloc(3*imgSize);
for (int y=0; y<h; y++) {
for (int x=0; x<w; x++) {
int colorVal = px[y*w + x]; // classic formula for px listed in line
img[(y*w + x)*3+0] = (unsigned char)(colorVal); // R
img[(y*w + x)*3+1] = (unsigned char)(colorVal); // G
img[(y*w + x)*3+2] = (unsigned char)(colorVal); // B
// padding (the 4th byte) will be added later as needed...
}
}
// print px and img data for debugging
if (debugPrint) {
Serial.print("\nWriting \"");
Serial.print(name);
Serial.print("\" to file...\n");
for (int i=0; i<imgSize; i++) {
Serial.print(px[i]);
Serial.print(" ");
}
}
// create padding (based on the number of pixels in a row
unsigned char bmpPad[rowSize - 3*w];
for (int i=0; i<sizeof(bmpPad); i++) { // fill with 0s
bmpPad[i] = 0;
}
// create file headers (also taken from StackOverflow example)
unsigned char bmpFileHeader[14] = { // file header (always starts with BM!)
'B','M', 0,0,0,0, 0,0, 0,0, 54,0,0,0 };
unsigned char bmpInfoHeader[40] = { // info about the file (size, etc)
40,0,0,0, 0,0,0,0, 0,0,0,0, 1,0, 24,0 };
bmpFileHeader[ 2] = (unsigned char)(fileSize );
bmpFileHeader[ 3] = (unsigned char)(fileSize >> 8);
bmpFileHeader[ 4] = (unsigned char)(fileSize >> 16);
bmpFileHeader[ 5] = (unsigned char)(fileSize >> 24);
bmpInfoHeader[ 4] = (unsigned char)( w );
bmpInfoHeader[ 5] = (unsigned char)( w >> 8);
bmpInfoHeader[ 6] = (unsigned char)( w >> 16);
bmpInfoHeader[ 7] = (unsigned char)( w >> 24);
bmpInfoHeader[ 8] = (unsigned char)( h );
bmpInfoHeader[ 9] = (unsigned char)( h >> 8);
bmpInfoHeader[10] = (unsigned char)( h >> 16);
bmpInfoHeader[11] = (unsigned char)( h >> 24);
// write the file (thanks forum!)
file.write(bmpFileHeader, sizeof(bmpFileHeader)); // write file header
file.write(bmpInfoHeader, sizeof(bmpInfoHeader)); // " info header
for (int i=0; i<h; i++) { // iterate image array
file.write(img+(w*(h-i-1)*3), 3*w); // write px data
file.write(bmpPad, (4-(w*3)%4)%4); // and padding as needed
}
file.close(); // close file when done writing
if (debugPrint) {
Serial.print("\n\n---\n");
}
}
void loop() { }

Resources