Matching CRC32 from STM32F0 and zlib - c

I'm working on a communication link between a computer running Linux and a STM32F0. I want to use some kind of error detection for my packets and since the STM32F0 has CRC32 hw and I have zlib with CRC32 on Linux I thought it would be a good idea to use CRC32 for my project. The problem is that I won't get the same CRC value for the same data on the different platforms.
#include <inttypes.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <zlib.h>
int
main(void)
{
uint8_t byte0 = 0x00;
uint32_t crc0 = crc32(0L, Z_NULL, 0);
crc0 = crc32(crc0, &byte0, 1);
printf("CRC32 value of %" PRIu8 " is: %08" PRIx32 "\n", byte0, crc0);
}
outputs CRC32 value of 0 is: d202ef8d which matches the result on several online calculators.
It seems though that whatever settings I use on the STM32 I can't get the same CRC.
I have found a flowchart on how the CRC hw calculates its value in an application note from ST but I can't figure out how it's done in zlib.
Does anyone know if they are compatible?
[edit 1] They both uses the same init value and polynomial.
[edit 2] The STM32 code is relatively uninteresging since it's using the hw.
...
/* Default values are used for init value and polynomial, see edit 1 */
CRC->CR |= CRC_CR_RESET;
CRC->DR = (uint8_t)0x00;
uint32_t crc = CRC->DR;
...

The CRC32 implementation on STM32Fx seems to be not the standard CRC32 implementation you find on many online CRC calculators and the one used in zip.
STM32 implements CRC32-MPEG2 which uses big endian and no final flip mask compared to the zip CRC32 which uses little endian and a final flip mask.
I found this online calculator which supports CRC32-MPEG2.
If you are more interested on other CRC algorithms and their implementation, look at this link.
PS: The HAL driver from STM supports input in byte, half word and word formats and they seem to work fine for STM32F0x in v1.3.1

From the documentation, it appears that your STM32 code is not just uninteresting — it is rather incomplete. From the documentation, in order to use the CRC hardware you need to:
Enable the CRC peripheral clock via the RCC peripheral.
Set the CRC Data Register to the initial CRC value by configuring the Initial CRC value register (CRC_INIT).(a)
Set the I/O reverse bit order through the REV_IN[1:0] and REV_OUT bits respectively in CRC Control register (CRC_CR).(a)
Set the polynomial size and coefficients through the POLYSIZE[1:0] bits in CRC
Control register (CRC_CR) and CRC Polynomial register (CRC_POL) respectively.(b)
Reset the CRC peripheral through the Reset bit in CRC Control register (CRC_CR).
Set the data to the CRC Data register.
Read the content of the CRC Data register.
Disable the CRC peripheral clock.
Note in particular steps 2, 3, and 4, which define the CRC being computed. They say that their example has rev_in and rev_out false, but for the zlib crc, they need to be true. Depending on the way the hardware is implemented, the polynomial will likely need to reversed as well (0xedb88320UL). The initial CRC needs to be 0xffffffff, and the final CRC inverted to match the zlib crc.

I haven't tested this, but I suspect that you're not, in fact, doing an 8-bit write on the STM32.
Instead you're probably doing a write to the full width of the register (32 bits), which of course means you're computing the CRC32 of more bytes than you intended.
Disassemble the generated code to analyze the exact store instruction that is being used.

I had similar issues implementing a CRC on an STM32 CRC module where the final checksum was not matching. I was able to fix this by looking at the example code in stm32f30x_crc.c that is provided with STMCube. In the source it has a function for 8-bit CRC that used the following code
(uint8_t)(CRC_BASE) = (uint8_t) CRC_Data;
to write to the DR register. As stated previously, the line CRC->DR is defined as volatile uint32_t which will access the whole register. It would have been helpful if ST had been more explicit about accessing the DR register rather than simply saying it supports 8-bit types.

From my experience, you can not compare the CRC32 code between the STM32 CRC unit output with the online CRC32 calculator.
Please find my CRC32 calculator for STM32 in this link.
This function has been used in my project, and proved to be correct.

For reference, getting compliant crc32 using the Low Level API:
void crc_init(void) {
LL_AHB1_GRP1_EnableClock(LL_AHB1_GRP1_PERIPH_CRC);
LL_CRC_SetPolynomialCoef(CRC, LL_CRC_DEFAULT_CRC32_POLY);
LL_CRC_SetPolynomialSize(CRC, LL_CRC_POLYLENGTH_32B);
LL_CRC_SetInitialData(CRC, LL_CRC_DEFAULT_CRC_INITVALUE);
LL_CRC_SetInputDataReverseMode(CRC, LL_CRC_INDATA_REVERSE_WORD);
LL_CRC_SetOutputDataReverseMode(CRC, LL_CRC_OUTDATA_REVERSE_BIT);
}
uint32_t crc32(const char* const buf, const uint32_t len) {
uint32_t data;
uint32_t index;
LL_CRC_ResetCRCCalculationUnit(CRC);
/* Compute the CRC of Data Buffer array*/
for (index = 0; index < (len / 4); index++) {
data = (uint32_t)((buf[4 * index + 3] << 24) |
(buf[4 * index + 2] << 16) |
(buf[4 * index + 1] << 8) | buf[4 * index]);
LL_CRC_FeedData32(CRC, data);
}
/* Last bytes specific handling */
if ((len % 4) != 0) {
if (len % 4 == 1) {
LL_CRC_FeedData8(CRC, buf[4 * index]);
}
if (len % 4 == 2) {
LL_CRC_FeedData16(CRC, (uint16_t)((buf[4 * index + 1] << 8) |
buf[4 * index]));
}
if (len % 4 == 3) {
LL_CRC_FeedData16(CRC, (uint16_t)((buf[4 * index + 1] << 8) |
buf[4 * index]));
LL_CRC_FeedData8(CRC, buf[4 * index + 2]);
}
}
return LL_CRC_ReadData32(CRC) ^ 0xffffffff;
}
Tweaked from the example here
The LL_CRC_ResetCRCCalculationUnit() re-initializes the crc unit so that repeated calls to this function return the expected value, otherwise the initial value will be whatever the result of the previous crc was, instead of 0xFFFFFFFF
You probably don't need the lines setting the polynomial to the default value because they should already be initialized to these values from reset, but I've left them in for posterity.

Related

STM32 HAL_CRC 16 Bit

I try to use the HAL_CRC on my STM32L4 in order to calculate a 16 bit CRC, but somehow I always get the same result no matter what the input is...
The CRC init
hcrc.Instance = CRC;
hcrc.Init.CRCLength = CRC_POLYLENGTH_16B; //as I have a 16bit polynome
hcrc.Init.DefaultPolynomialUse = DEFAULT_POLYNOMIAL_DISABLE;
hcrc.Init.GeneratingPolynomial = 0x1021; //MCRF4xx polynome
hcrc.Init.DefaultInitValueUse = DEFAULT_INIT_VALUE_ENABLE; //I want to init with 0xFFFF
hcrc.Init.InputDataInversionMode = CRC_INPUTDATA_INVERSION_BYTE; //input inversion
hcrc.Init.OutputDataInversionMode = CRC_OUTPUTDATA_INVERSION_ENABLE; //output inversion
hcrc.InputDataFormat = CRC_INPUTDATA_FORMAT_BYTES; //I have byte input
if (HAL_CRC_Init(&hcrc) != HAL_OK)
{
Error_Handler();
}
and then the calculation is called with
uint32_t result;
uint8_t pBuffer[3] = {0x33, 0x33, 0x55};
result = HAL_CRC_Calculate(&hcrc,pBuffer,3);
but the result is always 0xe000ed04, I'd expect 0xC91B for this specific case but at least it should change if a change the input. Does anyone spot an issue with this code snippet? I couldn't find any sample codes for 16bit CRC with the HAL Library.
I'm aware that the return value of HAL_CRC_Calculate() is a uint32_t, so my result would be the two lower bytes - in this case 0xed04. At least that's my interpretation of the function description.
The documentation indicates that you need to enable the CRC hardware clock with __HAL_RCC_CRC_CLK_ENABLE();. Are you doing that?

AVR Internal eeprom reading issue

I am using the internal EEPROM of atmega8A using avr's EEPROM library. My code looks like this
#define EEPROM_ADDR 0x0A
int main(void)
{
_delay_ms(2000);
LED_Initialize();
vBlink_Led(100, 2);
//eeprom_write_byte((uint8_t*)EEPROM_ADDR, 8);
val = eeprom_read_byte((uint8_t*)EEPROM_ADDR);
while (1);
}
when i uncomment the line eeprom_write_byte((uint8_t*)EEPROM_ADDR, 8); and then read using val = eeprom_read_byte((uint8_t*)EEPROM_ADDR);, the correct value 8 is read. But when I comment on the line and then reflash the code, the values changes to 255.
any suggestions?
note - i have unchecked the box in avrdudes for erase flash and eeprom
Normaly, when "Chip Erase" operation is performed, the EEPROM is being cleared as well.
To prevent this, you have to program (i.e. set to zero) EESAVE fuse bit, which is 3rd bit in the high fuse byte (please refer to the datasheet chapter 29.2 Fuse Bits)

M95128-W EEPROM. First byte of each page not writing or reading correctly

I am working on a library for controlling the M95128-W EEPROM from an STM32 device. I have the library writing and reading back data however the first byte of each page it not as expected and seems to be fixed at 0x04.
For example I write 128 bytes across two pages starting at 0x00 address with value 0x80. When read back I get:
byte[0] = 0x04;
byte[1] = 0x80;
byte[2] = 0x80;
byte[3] = 0x80;
.......
byte[64] = 0x04;
byte[65] = 0x80;
byte[66] = 0x80;
byte[67] = 0x80;
I have debugged the SPI with a logic analyzer and confirmed the correct bytes are being sent. When using the logic analyzer on the read command the mysterios 0x04 is transmitted from the EEPROM.
Here is my code:
void FLA::write(const void* data, unsigned int dataLength, uint16_t address)
{
int pagePos = 0;
int pageCount = (dataLength + 64 - 1) / 64;
int bytePos = 0;
int startAddress = address;
while (pagePos < pageCount)
{
HAL_GPIO_WritePin(GPIOB,GPIO_PIN_2, GPIO_PIN_SET); // WP High
chipSelect();
_spi->transfer(INSTRUCTION_WREN);
chipUnselect();
uint8_t status = readRegister(INSTRUCTION_RDSR);
chipSelect();
_spi->transfer(INSTRUCTION_WRITE);
uint8_t xlow = address & 0xff;
uint8_t xhigh = (address >> 8);
_spi->transfer(xhigh); // part 1 address MSB
_spi->transfer(xlow); // part 2 address LSB
for (unsigned int i = 0; i < 64 && bytePos < dataLength; i++ )
{
uint8_t byte = ((uint8_t*)data)[bytePos];
_spi->transfer(byte);
printConsole("Wrote byte to ");
printConsoleInt(startAddress + bytePos);
printConsole("with value ");
printConsoleInt(byte);
printConsole("\n");
bytePos ++;
}
_spi->transfer(INSTRUCTION_WRDI);
chipUnselect();
HAL_GPIO_WritePin(GPIOB,GPIO_PIN_2, GPIO_PIN_RESET); //WP LOW
bool writeComplete = false;
while (writeComplete == false)
{
uint8_t status = readRegister(INSTRUCTION_RDSR);
if(status&1<<0)
{
printConsole("Waiting for write to complete....\n");
}
else
{
writeComplete = true;
printConsole("Write complete to page ");
printConsoleInt(pagePos);
printConsole("# address ");
printConsoleInt(bytePos);
printConsole("\n");
}
}
pagePos++;
address = address + 64;
}
printConsole("Finished writing all pages total bytes ");
printConsoleInt(bytePos);
printConsole("\n");
}
void FLA::read(char* returndata, unsigned int dataLength, uint16_t address)
{
chipSelect();
_spi->transfer(INSTRUCTION_READ);
uint8_t xlow = address & 0xff;
uint8_t xhigh = (address >> 8);
_spi->transfer(xhigh); // part 1 address
_spi->transfer(xlow); // part 2 address
for (unsigned int i = 0; i < dataLength; i++)
returndata[i] = _spi->transfer(0x00);
chipUnselect();
}
Any suggestion or help appreciated.
UPDATES:
I have tried writing sequentially 255 bytes increasing data to check for rollover. The results are as follows:
byte[0] = 4; // Incorrect Mystery Byte
byte[1] = 1;
byte[2] = 2;
byte[3] = 3;
.......
byte[63] = 63;
byte[64] = 4; // Incorrect Mystery Byte
byte[65] = 65;
byte[66] = 66;
.......
byte[127] = 127;
byte[128] = 4; // Incorrect Mystery Byte
byte[129} = 129;
Pattern continues. I have also tried writing just 8 bytes from address 0x00 and the same problem persists so I think we can rule out rollover.
I have tried removing the debug printConsole and it has had no effect.
Here is a SPI logic trace of the write command:
And a close up of the first byte that is not working correctly:
Code can be viewed on gitlab here:
https://gitlab.com/DanielBeyzade/stm32f107vc-home-control-master/blob/master/Src/flash.cpp
Init code of SPI can be seen here in MX_SPI_Init()
https://gitlab.com/DanielBeyzade/stm32f107vc-home-control-master/blob/master/Src/main.cpp
I have another device on the SPI bus (RFM69HW RF Module) which works as expected sending and receiving data.
The explanation was actually already given by Craig Estey in his answer. You do have a rollover. You write full page and then - without cycling the CS pin - you send INSTRUCTION_WRDI command. Guess what's the binary code of this command? If you guessed that it's 4, then you're absolutely right.
Check your code here:
chipSelect();
_spi->transfer(INSTRUCTION_WRITE);
uint8_t xlow = address & 0xff;
uint8_t xhigh = (address >> 8);
_spi->transfer(xhigh); // part 1 address MSB
_spi->transfer(xlow); // part 2 address LSB
for (unsigned int i = 0; i < 64 && bytePos < dataLength; i++ )
{
uint8_t byte = ((uint8_t*)data)[bytePos];
_spi->transfer(byte);
// ...
bytePos ++;
}
_spi->transfer(INSTRUCTION_WRDI); // <-------------- ROLLOEVER!
chipUnselect();
With these devices, each command MUST start with cycling CS. After CS goes low, the first byte is interpreted as command. All remaining bytes - until CS is cycled again - are interpreted as data. So you cannot send multiple commands in a single "block" with CS being constantly pulled low.
Another thing is that you don't need WRDI command at all - after the write instruction is terminated (by CS going high), the WEL bit is automatically reset. See page 18 of the datasheet:
The Write Enable Latch (WEL) bit, in fact, becomes reset by any of the
following events:
• Power-up
• WRDI instruction execution
• WRSR instruction completion
• WRITE instruction completion.
Caveat: I don't have a definitive solution, just some observations and suggestions [that would be too large for a comment].
From 6.6: Each time a new data byte is shifted in, the least significant bits of the internal address counter are incremented. If more bytes are sent than will fit up to the end of the page, a condition known as “roll-over” occurs. In case of roll-over, the bytes exceeding the page size are overwritten from location 0 of the same page.
So, in your write loop code, you do: for (i = 0; i < 64; i++). This is incorrect in the general case if the LSB of address (xlow) is non-zero. You'd need to do something like: for (i = xlow % 64; i < 64; i++)
In other words, you might be getting the page boundary rollover. But, you mentioned that you're using address 0x0000, so it should work, even with the code as it exists.
I might remove the print statements from the loop as they could have an effect on the serialization timing.
I might try this with an incrementing data pattern: (e.g.) 0x01,0x02,0x03,... That way, you could see which byte is rolling over [if any].
Also, try writing a single page from address zero, and write less than the full page size (i.e. less that 64 bytes) to guarantee that you're not getting rollover.
Also, from figure 13 [the timing diagram for WRITE], it looks like once you assert chip select, the ROM wants a continuous bit stream clocked precisely, so you may have a race condition where you're not providing the data at precisely the clock edge(s) needed. You may want to use the logic analyzer to verify that the data appears exactly in sync with clock edge as required (i.e. at clock rising edge)
As you've probably already noticed, offset 0 and offset 64 are getting the 0x04. So, this adds to the notion of rollover.
Or, it could be that the first data byte of each page is being written "late" and the 0x04 is a result of that.
I don't know if your output port has a SILO so you can send data as in a traditional serial I/O port or do you have to maintain precise bit-for-bit timing (which I presume the _spi->transfer would do)
Another thing to try is to write a shorter pattern (e.g. 10 bytes) starting at a non-zero address (e.g. xhigh = 0; xlow = 4) and the incrementing pattern and see how things change.
UPDATE:
From your update, it appears to be the first byte of each page [obviously].
From the exploded view of the timing, I notice SCLK is not strictly uniform. The pulse width is slightly erratic. Since the write data is sampled on the clock rising edge, this shouldn't matter. But, I wonder where this comes from. That is, is SCLK asserted/deasserted by the software (i.e. transfer) and SCLK is connected to another GPIO pin? I'd be interested in seeing the source for the transfer function [or a disassembly].
I've just looked up SPI here: https://en.wikipedia.org/wiki/Serial_Peripheral_Interface_Bus and it answers my own question.
From that, here is a sample transfer function:
/*
* Simultaneously transmit and receive a byte on the SPI.
*
* Polarity and phase are assumed to be both 0, i.e.:
* - input data is captured on rising edge of SCLK.
* - output data is propagated on falling edge of SCLK.
*
* Returns the received byte.
*/
uint8_t SPI_transfer_byte(uint8_t byte_out)
{
uint8_t byte_in = 0;
uint8_t bit;
for (bit = 0x80; bit; bit >>= 1) {
/* Shift-out a bit to the MOSI line */
write_MOSI((byte_out & bit) ? HIGH : LOW);
/* Delay for at least the peer's setup time */
delay(SPI_SCLK_LOW_TIME);
/* Pull the clock line high */
write_SCLK(HIGH);
/* Shift-in a bit from the MISO line */
if (read_MISO() == HIGH)
byte_in |= bit;
/* Delay for at least the peer's hold time */
delay(SPI_SCLK_HIGH_TIME);
/* Pull the clock line low */
write_SCLK(LOW);
}
return byte_in;
}
So, the delay times need be at least the ones the ROM needs. Hopefully, you can verify that is the case.
But, I also notice that on the problem byte, the first bit of the data appears to lag its rising clock edge. That is, I would want the data line to be stabilized before clock rising edge.
But, that assumes CPOL=0,CPHA=1. Your ROM can be programmed for that mode or CPOL=0,CPHA=0, which is the mode used by the sample code above.
This is what I see from the timing diagram. It implies that the transfer function does CPOL=0,CPHA=0:
SCLK
__
| |
___| |___
DATA
___
/ \
/ \
This is what I originally expected (CPOL=0,CPHA=1) based on something earlier in the ROM document:
SCLK
__
| |
___| |___
DATA
___
/ \
/ \
The ROM can be configured to use either CPOL=0,CPHA=0 or CPOL=1,CPHA=1. So, you may need to configure these values to match the transfer function (or vice-versa) And, verify that the transfer function's delay times are adequate for your ROM. The SDK may do all this for you, but, since you're having trouble, double checking this may be worthwhile (e.g. See Table 18 et. al. in the ROM document).
However, since the ROM seems to respond well for most byte locations, the timing may already be adequate.
One thing you might also try. Since it's the first byte that is the problem, and here I mean first byte after the LSB address byte, the memory might need some additional [and undocumented] setup time.
So, after the transfer(xlow), add a small spin loop after that before entering the data transfer loop, to give the ROM time to set up for the write burst [or read burst].
This could be confirmed by starting xlow at a non-zero value (e.g. 3) and shortening the transfer. If the problem byte tracks xlow, that's one way to verify that the setup time may be required. You'd need to use a different data value for each test to be sure you're not just reading back a stale value from a prior test.

Reading 22 bits via SPI [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
I am trying to read ADC(ADS8320) values using SPI communication. I am using IAR embedded workbench for ARM and SM470R1B1 controller. In the data sheet it says first 6 bits are dummy read and next 16 bits is the actval data. I am trying to read 24 bits and ignore first 6 bits and last 2 bits. I am getting wrong values when i try with the below code.
unsigned int readADC8320(void)
{
value = AD8320_16(0xFFFFFF); // read registe
return value;
}
unsigned int AD8320_16(unsigned int value)
{
unsigned int data;
AD8320_CS_Status(0);
SW_Delay(DELAY_10US);
while(GIODOUTA&X2);
data = spi2(value >> 16); //read high 8 bits
data = (data << 6)| spi2(value >> 8); //read next 8 bits but 6+8 =14
data = (data << 8)| spi2(value >> 2); //add last two bits only
SW_Delay(DELAY_10US);
AD8320_CS_Status(1);
return data;
}
unsigned char spi2(unsigned char data)
{
// Write byte to SPI2DAT1 register
SPI2DAT1 = data;
while(!(SPI2CTRL3 & 0x01)){} // Wait for RxFlag to get set
return (SPI2BUF & 0x000000FF); // Read SPIBUF
}
Can anyone suggest me where i am doing wrong. I am quite poor with shift operations.
Your code looks wonky. For example, what purpose does value have?
The datasheet you linked to indicates that the 22 clock cycles suffice (and that 24 clock cycles are okay), no extra delays are needed, assuming the SPI clock is within limits (at most 2.4 MHz, I believe). So, I'd expect your code to look more like
void AD8320_setup(void)
{
/* TODO:
* - Set SPI2 clock frequency, if programmable (max. 2.4 MHz)
* - Set SPI2 transfer size, if programmable (8 bits per byte)
* - Set SPI2 to latch data on the falling edge of clock pulse
* - Set SPI2 to pull clock high when inactive
* - Set AD8320 chip select pin as an output
* - Set AD8320 chip select pin high (it is active low)
* - Ensure AD8320 is powered
*/
}
uint16_t AD8320_16(void)
{
uint16_t result;
/* TODO: Set AD8320 chip select pin 0/LOW (to select it)
*/
/* TODO: Clear SPI2 FIFO if it has one,
* so that we get actual wire data,
* not old cached data.
*/
result = ((uint16_t)(spi2_read() & 3U)) << 14;
result |= ((uint16_t)spi2_read()) << 6;
result |= ((uint16_t)spi2_read()) >> 2;
/* TODO: Set AD8320 chip select pin 1/HIGH (to deselect it)
*/
return result;
}
The prototype for spi2_read is typically unsigned char spi2_read(void);, but it might be of any integer type that can represent all eight bit values correctly (0 to 255, inclusive). Above, I cast the result -- after masking out any irrelevant bits, if any, with a binary AND operation -- to uint16_t before shifting it into the correct position in the result, so that regardless of the spi2_read() return type, we can use it to construct our 16-bit value.
The first spi2_read() reads the first 8 bits. I am assuming your device SPI bus sends and receives the most significant bit first; this means that the 6 dummy bits are most significant bits, and the 2 least significant bits are the most significant bits of the result. That's why we only keep the two least significant bits of the first byte, and transfer it 14 places left -- that way we get them in the most significant position.
The second spi2_read() reads the next eight bits. These are bits 13..6 of the data; that's why we shift this byte left by 6 places, and OR with the existing data.
The last spi2_read() reads the next eight bits. The last two bits are to be discarded, because only the first (most significant) 6 bits contain the rest of the data we're interested in. So, we shift this down by 2 places, and OR with the existing data.
Preparing spi2 involves stuff like word size (8 bits), data rate (if programmable), and so on.
The chip select for the AD8320 is just a generic output pin, it's just active low. That's why it's described as ^CS/SHDN in the datasheet: chip selected when low, shutdown when high.
The chip is a micropower one, so it requires less than 2mA, so you could conceivably power it from an output pin on many microcontrollers. If you were to do so, you'd also have to remember to make that pin an output and 1/high in _setup() function.

Looking for a fast portable method for colorkeying 8bit images

I am looking for a fast, C language, portable method (freestanding, no libraries) for performing colorkeyed blits. The target is a freestanding emulator library for a retro style computer design.
My current take on this (the images are stored in uint32_t typed storage, however they are 8bit palettized images):
uint32_t sdata; /* Last read source data (4 pixels) */
uint32_t ckey; /* Prepared colorkey, such as 0x30303030U if the key is 0x30 */
uint32_t t32;
...
t32 = sdata ^ ckey; /* 0x00 where the colorkey matches */
t32 = (((t32 & 0x7F7F7F7FU) + 0x7F7F7F7FU) | t32) & 0x80808080U;
t32 = (t32 - (t32 >> 7)) + t32; /* Colorkey mask prepared (0xFF or 0x00) */
Some explanations:
It basically utilizes this bit twiddling hack modified to produce a mask to be applied on the source. The second line is the core of the hack, which my case returns 0x80 for every pixel (byte) not matching the colorkey, 0 otherwise. The third line just expands all 0x80's to 0xFF's. Later applying the mask is trivial, so I didn't include it (and in the actual code it is combined with other stuff, anyway).
So the question is, under the constraints of plain vanilla C (C89, to be precise), is there any way to make this task faster in general (for most architectures)?

Resources