Running C Bilinear Interpolation - c

I came across this, the C version of Bilinear interpolation. I can't figure out how to run it. I have an image called image_1.jpeg that I want to use...
It looks you just need to call scale(), but how exactly do you write the main() method to do that?
The Code ->
#include <stdint.h>
typedef struct {
uint32_t *pixels;
unsigned int w;
unsigned int h;
} image_t;
#define getByte(value, n) (value >> (n*8) & 0xFF)
uint32_t getpixel(image_t *image, unsigned int x, unsigned int y){
return image->pixels[(y*image->w)+x];
}
float lerp(float s, float e, float t){return s+(e-s)*t;}
float blerp(float c00, float c10, float c01, float c11, float tx, float ty){
return lerp(lerp(c00, c10, tx), lerp(c01, c11, tx), ty);
}
void putpixel(image_t *image, unsigned int x, unsigned int y, uint32_t color){
image->pixels[(y*image->w) + x] = color;
}
void scale(image_t *src, image_t *dst, float scalex, float scaley){
int newWidth = (int)src->w*scalex;
int newHeight= (int)src->h*scaley;
int x, y;
for(x= 0, y=0; y < newHeight; x++){
if(x > newWidth){
x = 0; y++;
}
float gx = x / (float)(newWidth) * (src->w-1);
float gy = y / (float)(newHeight) * (src->h-1);
int gxi = (int)gx;
int gyi = (int)gy;
uint32_t result=0;
uint32_t c00 = getpixel(src, gxi, gyi);
uint32_t c10 = getpixel(src, gxi+1, gyi);
uint32_t c01 = getpixel(src, gxi, gyi+1);
uint32_t c11 = getpixel(src, gxi+1, gyi+1);
uint8_t i;
for(i = 0; i < 3; i++){
//((uint8_t*)&result)[i] = blerp( ((uint8_t*)&c00)[i], ((uint8_t*)&c10)[i], ((uint8_t*)&c01)[i], ((uint8_t*)&c11)[i], gxi - gx, gyi - gy); // this is shady
result |= (uint8_t)blerp(getByte(c00, i), getByte(c10, i), getByte(c01, i), getByte(c11, i), gx - gxi, gy -gyi) << (8*i);
}
putpixel(dst,x, y, result);
}
}
source: https://rosettacode.org/wiki/Bilinear_interpolation#C

Your problem here is probably less so figuring out how to call scale, and more so figuring out how to load your images. I recommend the SOIL library, it handles most common formats and gets you the raw pixel data.
You'll also most likely need some function to create the actual image_t instances from the raw image. SOIL and other image libraries should be able to provide you the width and height, as well as some kind of byte or int array for the image itself. image_t appears to be row-major in your example, so something like this could work:
image_t to_image_t(int width, int height, unsigned char* data) {
image_t img = { malloc(sizeof(int) * width * height), width, height };
for (int i = 0; i < width * height; i ++) {
// You may need to fiddle with this based on your image's format
// and how you load it.
img->pixels[i] = (data[0] << 24)
| (data[1] << 16)
| (data[2] << 8)
| (data[3]);
}
return img;
}
Just remember to free your image's pixel data after you're done using it.

Related

store bits(byte) in long long by concatenation

poly8_bitslice() array if char as input, this input will be converted to bits(byte) by the function intToBits().
After the conversion I want to store the result in a long long variable. Is this possible?
Can I concatenate the result of intToBits()?
I want to do this with the following code:
#include <stdio.h>
#include <string.h>
#include <inttypes.h>
#include <string.h>
//#include <math.h>
typedef unsigned char poly8;
typedef unsigned long long poly8x64[8];
void intToBits(unsigned k, poly8 nk[8]) {
int i;
for(i=7;i>=0;i--){
nk[i] = (k%2);
k = (int)(k/2);
}
}
void poly8_bitslice(poly8x64 r, const poly8 x[64])
{
//TODO
int i;
for(i=0;i<64;i++){
poly8 xb[8];
intToBits(x[i], xb);
int j;
long long row;
for(j=0;j<8;j++){
row = row + x[j];
}
printf("row=%d \n", row);
}
}
int main()
{
poly8 a[64], b[64], r[64];
poly8x64 va, vb, vt;
int i;
FILE *urandom = fopen("/dev/urandom","r");
for(i=0;i<64;i++)
{
a[i] = fgetc(urandom);
b[i] = fgetc(urandom);
}
poly8_bitslice(va, a);
poly8_bitslice(vb, b);
fclose(urandom);
return 0;
}
I am not sure I fully understood your question but you can do something like this
char ch0 = 0xAA;
char ch1 = 0xBB;
char ch2 = 0xCC;
char ch3 = 0xDD;
long long int x = 0; // x is 0x00000000
x = (long long int)ch0; // x is 0x000000AA
x = x << 8; // x is 0x0000AA00
x = x | (long long int)ch1; // x is 0x0000AABB
x = x << 8; // x is 0x00AABB00
x = x | (long long int)ch2; // x is 0x00AABBCC
x = x << 8; // x is 0xAABBCC00
x = x | (long long int)ch3; // x is 0xAABBCCDD
In this case x would contain 0xAABBCCDD
The << operator will shift the content of the left hand operator by the number specified by the right hand operator. So 0xAA << 8 would become 0xAA00. Note that it will append zeros at the end while shifting.
The| operator will perform a bitwise or on both its operands. That is a bit by bit or. So the first bit of the left hand operator will be or'ed with the first bit of the right hand operator and the result will be placed in the first bit of the result.
Anything or'ed with zero results to its self so
0xAA00 | 0x00BB
would result in
0xAABB
In general a bit append function would be
long long int bitAppend(long long int x, char ch) {
return ((x << 8) | (long long int)ch);
}
This function would take the long long integer that you need to append to and the char to append to it and would return the appended long long int. Note that as soon as the 64 bits are filled up the high order bits will be shifted out.
For example
long long int x = 0x1122334455667788
x = x << 8; // x now is 0x2233445566778800
this would result into x being 0x2233445566778800 since there is only 64bits in a long long int so the high order bits had to move out.

How to convert 16-bit unsigned short to 8-bit unsigned char using scaling efficiently?

I'm trying to convert 16 bit unsigned short data to 8 bit unsigned char using some scaling function. Currently I'm doing this by converting into float and scale down and then saturate into 8 bit. Is there any more efficient way to do this?
int _tmain(int argc, _TCHAR* argv[])
{
float Scale=255.0/65535.0;
USHORT sArr[8]={512,1024,2048,4096,8192,16384,32768,65535};
BYTE bArr[8],bArrSSE[8];
//Desired Conventional Method
for (int i = 0; i < 8; i++)
{
bArr[i]=(BYTE)(sArr[i]*Scale);
}
__m128 vf_scale = _mm_set1_ps(Scale),
vf_Round = _mm_set1_ps(0.5),
vf_zero = _mm_setzero_ps();
__m128i vi_zero = _mm_setzero_si128();
__m128i vi_src = _mm_loadu_si128(reinterpret_cast<const __m128i*>(&sArr[0]));
__m128 vf_Src_Lo=_mm_cvtepi32_ps(_mm_unpacklo_epi16(vi_src, _mm_set1_epi16(0)));
__m128 vf_Src_Hi=_mm_cvtepi32_ps(_mm_unpackhi_epi16(vi_src, _mm_set1_epi16(0)));
__m128 vf_Mul_Lo=_mm_sub_ps(_mm_mul_ps(vf_Src_Lo,vf_scale),vf_Round);
__m128 vf_Mul_Hi=_mm_sub_ps(_mm_mul_ps(vf_Src_Hi,vf_scale),vf_Round);
__m128i v_dst_i = _mm_packus_epi16(_mm_packs_epi32(_mm_cvtps_epi32(vf_Mul_Lo), _mm_cvtps_epi32(vf_Mul_Hi)), vi_zero);
_mm_storel_epi64((__m128i *)(&bArrSSE[0]), v_dst_i);
for (int i = 0; i < 8; i++)
{
printf("ushort[%d]= %d * %f = %.3f ,\tuChar[%d]= %d,\t SSE uChar[%d]= %d \n",i,sArr[i],Scale,(float)(sArr[i]*Scale),i,bArr[i],i,bArrSSE[i]);
}
return 0;
}
Pleas note tha the scaling factor may need to be set to other values, e.g. 255.0/512.0, 255.0/1024.0 or 255.0/2048.0, so any solution should not be hard-coded for 255.0/65535.0.
If ratio in your code is fixed, you can perform the scale with the following algorithm
Shift the high byte of each word into the lower one.
E.g. 0x200 -> 0x2, 0xff80 -> 0xff
Add an offset of -1 if the low byte was less than 0x80.
E.g. 0x200 -> Offset -1, 0xff80 -> Offset 0
The first part is easily achieved with _mm_srli_epi16
The second one is trickier but it basically consists in taking the bit7 (the higher bit of the lower byte) of each word, replicating it all over the word and then negating it.
I used another approach: I created a vector of words valued -1 by comparing a vector with itself for equality.
Then I isolated the bit7 of each source word and add it to the -1 words.
#include <stdio.h>
#include <emmintrin.h>
int main(int argc, char* argv[])
{
float Scale=255.0/65535.0;
unsigned short sArr[8]={512,1024,2048,4096,8192,16384,32768,65535};
unsigned char bArr[8], bArrSSE[16];
//Desired Conventional Method
for (int i = 0; i < 8; i++)
{
bArr[i]=(unsigned char)(sArr[i]*Scale);
}
//Values to be converted
__m128i vi_src = _mm_loadu_si128((__m128i const*)sArr);
//This computes 8 words (16-bit) that are
// -1 if the low byte of relative word in vi_src is less than 0x80
// 0 if the low byte of relative word in vi_src is >= than 0x80
__m128i vi_off = _mm_cmpeq_epi8(vi_src, vi_src); //Set all words to -1
//Add the bit15 of each word in vi_src to each -1 word
vi_off
= _mm_add_epi16(vi_off, _mm_srli_epi16(_mm_slli_epi16(vi_src, 8), 15));
//Shift vi_src word right by 8 (move hight byte into low byte)
vi_src = _mm_srli_epi16 (vi_src, 8);
//Add the offsets
vi_src = _mm_add_epi16(vi_src, vi_off);
//Pack the words into bytes
vi_src = _mm_packus_epi16(vi_src, vi_src);
_mm_storeu_si128((__m128i *)bArrSSE, vi_src);
for (int i = 0; i < 8; i++)
{
printf("%02x %02x\n", bArr[i],bArrSSE[i]);
}
return 0;
}
Here is an implementation and test harness using _mm_mulhi_epu16 to perform a fixed point scaling operation.
scale_ref is your original scalar code, scale_1 is the floating point SSE implementation from your (currently deleted) answer, and scale_2 is my fixed point implementation.
I've factored out the various implementations into separate functions and also added a size parameter and a loop, so that they can be used for any size array (although currently n must be a multiple of 8 for the SSE implementations).
There is a compile-time flag, ROUND, which controls whether the fixed point implementation truncates (like your scalar code) or rounds (to nearest). Truncation is slightly faster.
Also note that scale is a run-time parameter, currently hard-coded to 255 (equivalent to 255.0/65535.0) in the test harness below, but it can be any reasonable value.
#include <stdio.h>
#include <stdint.h>
#include <limits.h>
#include <xmmintrin.h>
#define ROUND 1 // use rounding rather than truncation
typedef uint16_t USHORT;
typedef uint8_t BYTE;
static void scale_ref(const USHORT *src, BYTE *dest, const USHORT scale, const size_t n)
{
const float kScale = (float)scale / (float)USHRT_MAX;
for (size_t i = 0; i < n; i++)
{
dest[i] = src[i] * kScale;
}
}
static void scale_1(const USHORT *src, BYTE *dest, const USHORT scale, const size_t n)
{
const float kScale = (float)scale / (float)USHRT_MAX;
__m128 vf_Scale = _mm_set1_ps(kScale);
__m128 vf_Round = _mm_set1_ps(0.5f);
__m128i vi_zero = _mm_setzero_si128();
for (size_t i = 0; i < n; i += 8)
{
__m128i vi_src = _mm_loadu_si128((__m128i *)&src[i]);
__m128 vf_Src_Lo = _mm_cvtepi32_ps(_mm_unpacklo_epi16(vi_src, _mm_set1_epi16(0)));
__m128 vf_Src_Hi = _mm_cvtepi32_ps(_mm_unpackhi_epi16(vi_src, _mm_set1_epi16(0)));
__m128 vf_Mul_Lo = _mm_mul_ps(vf_Src_Lo, vf_Scale);
__m128 vf_Mul_Hi = _mm_mul_ps(vf_Src_Hi, vf_Scale);
//Convert -ive to +ive Value
vf_Mul_Lo = _mm_max_ps(_mm_sub_ps(vf_Round, vf_Mul_Lo), vf_Mul_Lo);
vf_Mul_Hi = _mm_max_ps(_mm_sub_ps(vf_Round, vf_Mul_Hi), vf_Mul_Hi);
__m128i v_dst_i = _mm_packus_epi16(_mm_packs_epi32(_mm_cvtps_epi32(vf_Mul_Lo), _mm_cvtps_epi32(vf_Mul_Hi)), vi_zero);
_mm_storel_epi64((__m128i *)&dest[i], v_dst_i);
}
}
static void scale_2(const USHORT *src, BYTE *dest, const USHORT scale, const size_t n)
{
const __m128i vk_scale = _mm_set1_epi16(scale);
#if ROUND
const __m128i vk_round = _mm_set1_epi16(scale / 2);
#endif
for (size_t i = 0; i < n; i += 8)
{
__m128i v = _mm_loadu_si128((__m128i *)&src[i]);
#if ROUND
v = _mm_adds_epu16(v, vk_round);
#endif
v = _mm_mulhi_epu16(v, vk_scale);
v = _mm_packus_epi16(v, v);
_mm_storel_epi64((__m128i *)&dest[i], v);
}
}
int main(int argc, char* argv[])
{
const size_t n = 8;
const USHORT scale = 255;
USHORT src[n] = { 512, 1024, 2048, 4096, 8192, 16384, 32768, 65535 };
BYTE dest_ref[n], dest_1[n], dest_2[n];
scale_ref(src, dest_ref, scale, n);
scale_1(src, dest_1, scale, n);
scale_2(src, dest_2, scale, n);
for (size_t i = 0; i < n; i++)
{
printf("src = %u, ref = %u, test_1 = %u, test_2 = %u\n", src[i], dest_ref[i], dest_1[i], dest_2[i]);
}
return 0;
}
Ok found the solution with reference to this.
Here is my Solution:
int _tmain(int argc, _TCHAR* argv[])
{
float Scale=255.0/65535.0;
USHORT sArr[8]={512,1024,2048,4096,8192,16384,32768,65535};
BYTE bArr[8],bArrSSE[8];
//Desired Conventional Method
for (int i = 0; i < 8; i++)
{
bArr[i]=(BYTE)(sArr[i]*Scale);
}
__m128 vf_scale = _mm_set1_ps(Scale),
vf_zero = _mm_setzero_ps();
__m128i vi_zero = _mm_setzero_si128();
__m128i vi_src = _mm_loadu_si128(reinterpret_cast<const __m128i*>(&sArr[0]));
__m128 vf_Src_Lo=_mm_cvtepi32_ps(_mm_unpacklo_epi16(vi_src, _mm_set1_epi16(0)));
__m128 vf_Src_Hi=_mm_cvtepi32_ps(_mm_unpackhi_epi16(vi_src, _mm_set1_epi16(0)));
__m128 vf_Mul_Lo=_mm_mul_ps(vf_Src_Lo,vf_scale);
__m128 vf_Mul_Hi=_mm_mul_ps(vf_Src_Hi,vf_scale);
//Convert -ive to +ive Value
vf_Mul_Lo=_mm_max_ps(_mm_sub_ps(vf_zero, vf_Mul_Lo), vf_Mul_Lo);
vf_Mul_Hi=_mm_max_ps(_mm_sub_ps(vf_zero, vf_Mul_Hi), vf_Mul_Hi);
__m128i v_dst_i = _mm_packus_epi16(_mm_packs_epi32(_mm_cvtps_epi32(vf_Mul_Lo), _mm_cvtps_epi32(vf_Mul_Hi)), vi_zero);
_mm_storel_epi64((__m128i *)(&bArrSSE[0]), v_dst_i);
for (int i = 0; i < 8; i++)
{
printf("ushort[%d]= %d * %f = %.3f ,\tuChar[%d]= %d,\t SSE uChar[%d]= %d \n",i,sArr[i],Scale,(float)(sArr[i]*Scale),i,bArr[i],i,bArrSSE[i]);
}
return 0;
}

multi-precision multiplication in CUDA

I am trying to implement multi-precision multiplication in CUDA. For doing that, I have implemented a kernel which should compute multiplication of uint32_t type operand with 256-bit operand and put the result in 288-bit array. So far, I have came up with this code:
__device__ __constant__ UN_256fe B_const;
__global__ void multiply32x256Kernel(uint32_t A, UN_288bite* result){
uint8_t tid = blockIdx.x * blockDim.x + threadIdx.x;
//for managing warps
//uint8_t laineid = tid % 32;
//allocate partial products into array of uint64_t
__shared__ uint64_t partialMuls[8];
uint32_t carry, r;
if((tid < 8) && (tid != 0)){
//compute partial products
partialMuls[tid] = A * B_const.uint32[tid];
//add partial products and propagate carry
result->uint32[8] = (uint32_t)partialMuls[7];
r = (partialMuls[tid] >> 32) + ((uint32_t)partialMuls[tid - 1]);
carry = r < (partialMuls[tid] >> 32);
result->uint32[0] = (partialMuls[0] >> 32);
while(__any(carry)){
r = r + carry;
//new carry?
carry = r < carry;
}
result->uint32[tid] = r;
}
and my data-type is :
typedef struct UN_256fe{
uint32_t uint32[8];
}UN_256fe;
typedef struct UN_288bite{
uint32_t uint32[9];
}UN_288bite;
My kernel works, but it gives me wrong result. I cannot debug inside the kernel, so I would appreciate if someone let me know where the problem is or how I can debug my code inside the kernel on tegra-ubuntu with cuda-6.0.
Thanks
This answer has nothing to do with CUDA itself, but is a general C implementation.
I can't quite follow what you are doing (especially with carry) but you could try this snippet based on my own big num functions. I defined dtype to make it easier to test with smaller fields. Note that I don't specifically use a carry, but carry forward the partial product.
// little-endian
#include <stdio.h>
#include <stdint.h>
#include <limits.h>
#define dtype uint8_t // for testing
//#define dtype uint32_t // for proper ver
#define SHIFTS (sizeof(dtype)*CHAR_BIT)
#define NIBBLES (SHIFTS/4)
#define ARRLEN 8
typedef struct UN_256fe {
dtype uint[ARRLEN];
} UN_256fe;
typedef struct UN_288bite {
dtype uint[ARRLEN+1];
} UN_288bite;
void multiply(UN_288bite *product, UN_256fe *operand, dtype multiplier)
{
int i;
uint64_t partial = 0;
for (i=0; i<ARRLEN; i++) {
partial = partial + (uint64_t)multiplier * operand->uint[i];
product->uint[i] = (dtype)partial;
partial >>= SHIFTS; // carry
}
product->uint[i] = (dtype)partial;
}
int main(void)
{
int i;
dtype multiplier = 0xAA;
UN_256fe operand = { 1, 2, 3, 4, 5, 6, 7, 8};
UN_288bite product;
multiply(&product, &operand, multiplier);
for(i=ARRLEN-1; i>=0; i--)
printf("%0*X", NIBBLES, operand.uint[i]);
printf("\n * %0*X = \n", NIBBLES, multiplier);
for(i=ARRLEN; i>=0; i--)
printf("%0*X", NIBBLES, product.uint[i]);
printf("\n");
return 0;
}
Program output for uint8_t
0807060504030201
* AA =
0554A9FF54A9FF54AA

Filling up the bytes in an int variable

There are two variables,
uint8_t x (8 bit type)
uint16_t y (16 bit type)
, that together hold information about the value of an int num. Say num consists of four bytes abcd (where a is most significant). Then x needs to be copied to b, and y needs to be compied to cd. What is the best way/code to do this?
This works for me:
#include <stdio.h>
#include <stdint.h>
int main()
{
uint8_t x = 0xF2;
uint16_t y = 0x1234;
int a = 0x87654321;
// The core operations that put x and y in a.
a = (a & 0xFF000000) | (x<<16);
a = (a & 0xFFFF0000) | y;
printf("x: %X\n", x);
printf("y: %X\n", y);
printf("a: %X\n", a);
}
Here's the output:
x: F2
y: 1234
a: 87F21234
you can use a union (although be careful with padding/alignment)
typedef union
{
uint32_t abcd;
struct
{
uint8_t a;
uint8_t b;
uint16_t cd;
} parts;
} myvaluetype;
myvaluetype myvalue = {0};
uint8_t x = 42;
uint16_t y = 2311;
myvalue.parts.b = x;
myvalue.parts.cd = y;
printf( "%u\n", myvalue.abcd );
Bytemasks will do. something like below
int8 x = a & 0xff;
int16 y = a & 0xff00;

Round down float using bit operations in C

I am trying to round down a float using bit operations in C.
I start by converting the float to an unsigned int.
I think my strategy should be to get the exponent, and then zero out the bits after that, but I'm not sure how to code that. This is what I have so far:
float roundDown(float f);
unsigned int notRounded = *(unsigned int *)&f;
unsigned int copy = notRounded;
int exponent = (copy >> 23) & 0xff;
int fractional = 127 + 23 - exponent;
if(fractional > 0){
//not sure how to zero out the bits.
//Also don't know how to deal with the signed part.
Since its just for fun, and I'm not sure what the constraints are, here's a variant that DOES work for negative numbers:
float myRoundDown_1 (float v) { //only works right for positive numbers
return ((v-0.5f)+(1<<23)) - (1<<23);
}
float myRoundDown_2 (float v) { //works for all numbers
static union {
unsigned long i;
float f;
} myfloat;
unsigned long n;
myfloat.f = v;
n = myfloat.i & 0x80000000;
myfloat.i &= 0x7fffffff;
myfloat.f = myRoundDown_1(myfloat.f+(n>>31));
myfloat.i |= n;
return myfloat.f;
}
float roundDown(float f); should be float roundDown(float f) {.
unsigned int notRounded = *(unsigned int *)&f; is incompatible with modern compiler optimizations. Look up “strict aliasing”.
Here is a working function to round down to the power of two:
#include <stdio.h>
#include <assert.h>
#include <string.h>
float roundDown(float f) {
unsigned int notRounded;
assert(sizeof(int) == sizeof(float));
memcpy(&notRounded, &f, sizeof(int));
// zero out the significand (mantissa):
unsigned int rounded = notRounded & 0xFF800000;
float r;
memcpy(&r, &rounded, sizeof(int));
return r;
}
int main()
{
printf("%f %f\n", 1.33, roundDown(1.33));
printf("%f %f\n", 3.0, roundDown(3.0));
}
This should produce :
1.330000 1.000000
3.000000 2.000000

Resources