Subtracting two images using NEON - arm

I'm trying to subtract two images(grayscaled) by using Neon intrinsics as an exercise, I don't know what is the best way to subtract two vectors using the C intrinsics.
void subtractTwoImagesNeonOnePass( uint8_t *src, uint8_t*dest, uint8_t*result, int srcWidth)
{
for (int i = 0; i<srcWidth; i++)
{
// load 8 pixels
uint8x8x3_t srcPixels = vld3_u8 (src);
uint8x8x3_t dstPixels = vld3_u8 (src);
// subtract them
uint8x8x3_t subPixels = vsub_u8(srcPixels, dstPixels);
// store the result
vst1_u8 (result, subPixels);
// move 8 pixels
src+=8;
dest+=8;
result+=8;
}
}

It looks like you're using the wrong kind of loads and stores. Did you copy this from a three channel example? I think this is what you need:
#include <stdint.h>
#include <arm_neon.h>
void subtractTwoImagesNeon( uint8_t *src, uint8_t*dst, uint8_t*result, int srcWidth, int srcHeight)
{
for (int i = 0; i<(srcWidth/8); i++)
{
// load 8 pixels
uint8x8_t srcPixels = vld1_u8(src);
uint8x8_t dstPixels = vld1_u8(dst);
// subtract them
uint8x8_t subPixels = vsub_u8(srcPixels, dstPixels);
// store the result
vst1_u8 (result, subPixels);
// move 8 pixels
src+=8;
dst+=8;
result+=8;
}
}
You should also check that srcWidth is a multiple of 8. Also, you'd need to include all the lines of the image, as it appears that your code only handles the first line (maybe you know this and just cut down the example for simplicity).

Related

Problems printing a bitmap font

Lately I'm trying to print a bitmap font in C, using only a set_pixel function (that only sets the color of a pixel in a determinate coordinate of the screen).
The problem is that when I try my code, this code does not work at all.
I will write the code down, do you know why this is failing?
The only reason I could find is a warning that my compiler (clang) reports:
x86_64-uefi/tty.c:67:11: warning: incompatible pointer to integer conversion passing 'char [2]' to parameter of type 'char' [-Wint-conversion]
put_char("c", 1, 1, r | g | b);
That I don't know how to fix.
Files involved:
1: tty.c (The one that prints)
#include "font.h"
#include "tty.h"
KABI void put_char(char c, uint8_t x, uint8_t y, uint32_t rgb)
{
uint8_t i,j;
// Convert the character to an index
c = c & 0x7F;
if (c < ' ') {
c = 0;
} else {
c -= ' ';
}
// 'font' is a multidimensional array of [96][char_width]
// which is really just a 1D array of size 96*char_width.
const uint8_t* chr = font[c*CHAR_WIDTH];
// Draw pixels
for (j=0; j<CHAR_WIDTH; j++) {
for (i=0; i<CHAR_HEIGHT; i++) {
if (chr[j] & (1<<i)) {
set_pixel(x+j, y+i, rgb);
}
}
}
}
2: tty.h (the header of tty.c)
#pragma once
// Types (uint8_t, etc.)
#include "typedefs.h"
// tty_init: Cleans the screen and setup all.
KABI void tty_init(void);
KABI void put_char(char c, uint8_t x, uint8_t y, uint32_t rgb);
3: font.h (I made it a little bit shorter)
// Our types (uint8_t, etc.)
#include "typedefs.h"
#define CHAR_WIDTH 6
#define CHAR_HEIGHT 8
const unsigned char font[96][6] = {
{0x00,0x00,0x00,0x00,0x00,0x00}, //
{0x2e,0x00,0x00,0x00,0x00,0x00}, // !
{0x03,0x00,0x03,0x00,0x00,0x00}, // "
{0x0a,0x1f,0x0a,0x1f,0x0a,0x00}, // #
{0x2e,0x2a,0x6b,0x2a,0x3a,0x00}, // $
{0x0e,0x2a,0x1e,0x08,0x3c,0x2a}, // %
{0x3e,0x2a,0x2a,0x22,0x38,0x08}, // &
{0x03,0x00,0x00,0x00,0x00,0x00}, // '
...
{0x3e,0x24,0x24,0x24,0x3c,0x00}, // b
{0x3c,0x24,0x24,0x24,0x24,0x00}, // c
...
{0x00,0x00,0x00,0x00,0x00,0x00}
};
Calling the tty_init function output the following:
Apparently nothing, but if you take a closer look (the font is pretty small), it prints just a line of 6 pixels,
Thanks in advance!
I don't know if it has something to do, but...
Notes: KABI stands for __attribute__((sysv_abi))
and the pixel line is printed at the top of the "represented" string.

Algorithm for writing to EEPROM?

I have a memory which is a column of 4 byte rows. I can only write to it in 16 bytes and read is done in 4 bytes (line by line, that is) using I2C.
I am interested in how to write data into the EEPROM: the data that is being written consists of a few different parts of which two can be of variable length. For example, I can have XYYZ or XYYYYZZZZZZZ where each letter is 4 bytes.
My question is, how I should go about this problem to have a general way of writing the message to the memory using 16 byte write that would accommodate the variable nature of the two parts?
Rather than try to work in 4 or 16-byte units, you could consider using a small (21-byte) static cache for the eeprom. Let's assume you have
void eeprom_read16(uint32_t page, uint8_t *data);
void eeprom_write16(uint32_t page, const uint8_t *data);
where page is the address divided by 16, and always operate on 16 byte chunks. The cache itself and its initialization function (you'd call once at power-on) would be
static uint32_t eeprom_page; /* uint16_t suffices for 2 MiB EEPROM */
static uint8_t eeprom_cache[16];
static uint8_t eeprom_dirty;
static void eeprom_init(void)
{
eeprom_page = 0x80000000U; /* "None", at 32 GiB */
eeprom_dirty = 0;
}
static void eeprom_flush(void)
{
if (eeprom_dirty) {
eeprom_write16(eeprom_page, eeprom_cache);
eeprom_dirty = 0;
}
}
The eeprom_flush() function is only needed if you wish to ensure some data is stored in the EEPROM -- basically, after each complete transaction. You can safely call it at any time.
To access any memory in the EEPROM, you use the accessor functions
static inline uint8_t eeprom_get(const uint32_t address)
{
const uint32_t page = address >> 4;
if (page != eeprom_page) {
if (eeprom_dirty) {
eeprom_write(eeprom_page, eeprom_cache);
eeprom_dirty = 0;
}
eeprom_read(page, eeprom_cache);
eeprom_page = page;
}
return eeprom_cache[address % 0xFU];
}
static inline void eeprom_set(const uint32_t address, const uint8_t value)
{
const uint32_t page = address >> 4;
if (page != eeprom_page) {
if (eeprom_dirty) {
eeprom_write(eeprom_page, eeprom_cache);
eeprom_dirty = 0;
}
eeprom_read(page, eeprom_cache);
eeprom_page = page;
}
eeprom_dirty = 1;
eeprom_cache[address % 0xFU] = value;
}
Feel free to omit the inline if you like; it is just an optimization. The static inline above tell a C99 compiler to inline the functions if possible. It might increase a bit your code size, but it should produce faster code (because the compiler can make better optimizations when such small functions are inlined into the code).
Note that you should not use the above in interrupt handlers, because normal code is not prepared for the eeprom page to change mid-operation.
You can mix read and write operations, but that may lead to unnecessary wear on the EEPROM. You can, of course, split the read and write sides to separate caches, if you do mix reads and writes. That would also allow you to safely do EEPROM reads from an interrupt context (although the delay/latency of the I2C access might wreak havoc elsewhere).
Not tailored specifically to your examples, completely untested and relying on having "read 4 bytes from EEPROM" and "write 16 bytes to EEPROM" encapsulated in suitable functions.
void write_to_eeprom(uint32_t start, size_t len, uint8_t *data) {
uint32_t eeprom_dst = start & 0xfffffff0;
uint8_t buffer[16];
ssize_t data_offset;
for (data_offset = (start - eeprom_dst); data_offset < len; data_offset += 16, eeprom_dst+= 16) {
if (data_offset < 0) || ((len - data_offset) < 16) {
// we need to fill our buffer with EEPROM data
read_from_eeprom(eeprom_dst, buffer); // read 4 bytes, place at ptr
read_from_eeprom(eeprom_dst+4, buffer+4);
read_from_eeprom(eeprom_dst+8, buffer+8);
read_from_eeprom(eeprom_dst+12, buffer+12);
for (int buf_ix=0, ssize_t tmp_offset = data_offset; buf_ix < 16; buf_ix++, offset++) {
if ((offset >= 0) && (buf_ix < 16)) {
// We want to copy actual data
buffer[buf_ix] = data[offset];
}
}
} else {
// We don't need to cater for edge cases and can simply shift
// 16 bytes into our tmp buffer.
for (int ix = 0; ix < 16; ix++) {
buffer[ix] = data[data_offset + ix];
}
}
write_to_eeprom(eeprom_dst, buffer);
}
}

Determine if a message is too long to embed in an image

I created a program that embeds a message in a PPM file by messing with the last bit in each byte in the file. The problem I have right now is that I don't know if I am checking if a message is too long or not correctly. Here's what I've got so far:
int hide_message(const char *input_file_name, const char *message, const char *output_file_name)
{
unsigned char * data;
int n;
int width;
int height;
int max_color;
//n = 3 * width * height;
int code = load_ppm_image(input_file_name, &data, &n, &width, &height, &max_color);
if (code)
{
// return the appropriate error message if the image doesn't load correctly
return code;
}
int len_message;
int count = 0;
unsigned char letter;
// get the length of the message to be hidden
len_message = (int)strlen(message);
if (len_message > n/3)
{
fprintf(stderr, "The message is longer than the image can support\n");
return 4;
}
for(int j = 0; j < len_message; j++)
{
letter = message[j];
int mask = 0x80;
// loop through each byte
for(int k = 0; k < 8; k++)
{
if((letter & mask) == 0)
{
//set right most bit to 0
data[count] = 0xfe & data[count];
}
else
{
//set right most bit to 1
data[count] = 0x01 | data[count];
}
// shift the mask
mask = mask>>1 ;
count++;
}
}
// create the null character at the end of the message (00000000)
for(int b = 0; b < 8; b++){
data[count] = 0xfe & data[count];
count++;
}
// write a new image file with the message hidden in it
int code2 = write_ppm_image(output_file_name, data, n, width, height, max_color);
if (code2)
{
// return the appropriate error message if the image doesn't load correctly
return code2;
}
return 0;
}
So I'm checking to see if the length of the message (len_message) is longer that n/3, which is the same thing as width*height. Does that seem correct?
The check you're currently doing is checking whether the message has more bytes than the image has pixels. Because you're only using 1 bit per pixel to encode the message, you need to check if the message has more bits than the message has pixels.
So you need to do this:
if (len_message*8 > n/3)
In addition to #dbush's remarks about checking the number of bits in your message, you appear not to be accounting for all the bytes available to you in the image. Normal ("raw", P6-format) PPM images use three color samples per pixel, at either 8 or 16 bits per sample. Thus, the image contains at least 3 * width * height bytes of color data, and maybe as many as 6 * width * height.
On the other hand, the point of steganophraphy is to make the presence of a hidden message difficult to detect. In service to that objective, if you have a PPM with 16 bits per sample then you probably want to avoid modifying the more-significant bytes of the samples. Or if you don't care about that, then you might as well use the whole low-order byte of each sample in that case.
Additionally, PPM files record the maximum possible value of any sample, which does not need to be the same as the maximum value of the underlying type. It is possible for your technique to change the actual maximum value to be greater than the recorded maximum, and if you do not then change the maximum-value field as well then the inconsistency could be a tip-off that the file has been tampered with.
Furthermore, raw PPM format affords the possibility of multiple images of the same size in one file. The file header does not express how many there are, so you have to look at the file size to tell. You can use the bytes of every image in the file to hide your message.

How to optimize C for loop for font rendering on oled display

I need to optimize this function: Any strange way to optimize the for loop? (early break i think can't be possible)
void SeeedGrayOLED::putChar(unsigned char C)
{
if(C < 32 || C > 127) //Ignore non-printable ASCII characters. This can be modified for multilingual font.
{
C=' '; //Space
}
uint8_t k,offset = 0;
char bit1,bit2,c = 0;
for(char i=0;i<16;i++)
{
for(char j=0;j<32;j+=2)
{
if(i>8){
k=i-8;
offset = 1;
}else{
k=i;
}
// Character is constructed two pixel at a time using vertical mode from the default 8x8 font
c=0x00;
bit1=(pgm_read_byte(&hallfetica_normal[C-32][j+offset]) >> (8-k)) & 0x01;
bit2=(pgm_read_byte(&hallfetica_normal[C-32][j+offset]) >> ((8-k)-1)) & 0x01;
// Each bit is changed to a nibble
c|=(bit1)?grayH:0x00;
c|=(bit2)?grayL:0x00;
sendData(c);
}
}
}
I've got a font in the array hallfetica_normal, is an array of array of uint8_t, that maybe compressed or something like that?
This code run on a arduino, ad i've to run a countdown from 500 to 0 with one unit down every 10/20ms.
EDIT
This is the new code after yours indication, thanks all:
I'm looking to organise the font differently to permit less call to pgm_read_byte.. (something like changing the orientation... i wonder)
void SeeedGrayOLED::putChar(unsigned char C)
{
if(C < 32 || C > 127) //Ignore non-printable ASCII characters. This can be modified for multilingual font.
{
C=' '; //Space
}
char c,byte = 0x00;
unsigned char nibble_lookup[] = { 0, grayL, grayH, grayH | grayL };
for(int ii=0;ii<2;ii++){
for(int i=0;i<8;i++)
{
for(int j=0;j<32;j+=2)
{
byte = pgm_read_byte(&hallfetica_normal[C-32][j+ii]);
c = nibble_lookup[(byte >> (8-i)) & 3];
sendData(c);
}
}
}
}
Well, you seem to be reading the same byte twice in a row unnecessarily via pgm_read_byte(&hallfetica_normal[C-32][j+offset]). You could load that once into a local variable.
Additionally, you could avoid the if(i>8){ check per iteration by breaking up the code into two loops; one where i goes from 0 to 8 and another where it goes from 9 to 15. (Although I suspect you really intended >= here, making the loop boundaries 0-7 then 8-15.) That also means things like offset become constant values, which will help.
In an effort to make the inner loop as fast as possible, I'd try to get rid of all branching with a lookup table and see whether that helped.
First, I'd define the lookup table outside the loop:
/* outside the loop */
unsigned char h_lookup[] = { 0, grayH };
unsigned char l_lookup[] = { 0, grayL };
Then inside the loop, since you're testing the least-significant bit, you can use that as an index into the lookup table. If it's clear, then the lookup index will be 0. If it's set, then the lookup index will be 1:
/* inside the loop */
byte = pgm_read_byte(&hallfetica_normal[C-32][j+offset]);
c = h_lookup[((byte >> (8-k)) & 0x01)] |
l_lookup[((byte >> (8-k-1)) & 0x01)]
sendData(c);
Since you're masking and testing 2 adjacent bits, 8-k and 8-k-1, you could list all 4 possibilities in a single lookup table:
/* Outside loop */
unsigned char nibble_lookup[] = { 0, grayL, grayH, grayH | grayL };
And then the lookup becomes dramatically simplified.
/* loop */
byte = pgm_read_byte(&hallfetica_normal[C-32][j+offset]);
c = nibble_lookup[(byte >> (8-k)) & 3];
sendData(c);
The other answer has addressed what to do about the branches in the top part of your inner loop.

Calling userdefined functions in thrust

I'm loading a .png file using OpenCV and I want to extract its blue intensity values using thrust library.
My code goes like this:
Loading an image using OpenCV IplImage pointer
Copying the image data into thrust::device_vector
Extracting the blue intensity values from the device vector inside a structure using thrust library.
Now I have a problem in extracting Blue Intensity values from the device vector.
I did this code in cuda already now converting it using thrust library.
I fetch blue intensity values inside this function.
I want to know how to call this struct FetchBlueValues from the main function.
Code:
#define ImageWidth 14
#define ImageHeight 10
thrust::device_vector<int> BinaryImage(ImageWidth*ImageHeight);
thrust::device_vector<int> ImageVector(ImageWidth*ImageHeight*3);
struct FetchBlueValues
{
__host__ __device__ void operator() ()
{
int index = 0 ;
for(int i=0; i<= ImageHeight*ImageWidth*3 ; i = i+3)
{
BinaryImage[index]= ImageVector[i];
index++;
}
}
};
void main()
{
src = cvLoadImage("../Input/test.png", CV_LOAD_IMAGE_COLOR);
unsigned char *raw_ptr,*out_ptr;
raw_ptr = (unsigned char*) src->imageData;
thrust::device_ptr<unsigned char> dev_ptr = thrust::device_malloc<unsigned char>(ImageHeight*src->widthStep);
thrust::copy(raw_ptr,raw_ptr+(src->widthStep*ImageHeight),dev_ptr);
int index=0;
for(int j=0;j<ImageHeight;j++)
{
for(int i=0;i<ImageWidth;i++)
{
ImageVector[index] = (int) dev_ptr[ (j*src->widthStep) + (i*src->nChannels) + 0 ];
ImageVector[index+1] = (int) dev_ptr[ (j*src->widthStep) + (i*src->nChannels) + 1 ];
ImageVector[index+2] = (int) dev_ptr[ (j*src->widthStep) + (i*src->nChannels) + 2 ];
index +=3 ;
}
}
}
Since the image is stored in pixel format, and each pixel includes distinct colors, there is a natural "stride" in accessing the individual color components of each pixel. In this case, it appears that the color components of a pixel are stored in three successive int quantities per pixel, so the access stride for a given color component would be three.
An example strided range access iterator methodology is covered here.

Resources