I am trying to understand 9 point stencil's algorithm from this book , the logic is clear to me , but the calculation of WIDTHP macro is what i am unable to understand, here is the breif code (original code is more than 300 lines length!!):
#define PAD64 0
#define WIDTH 5900
#if PAD64
#define WIDTHP ((((WIDTH*sizeof(REAL))+63)/64)*(64/sizeof(REAL)))
#else
#define WIDTHP WIDTH
#endif
#define HEIGHT 10000
REAL *fa = (REAL *)malloc(sizeof(REAL)*WIDTHP*HEIGHT);
REAL *fb = (REAL *)malloc(sizeof(REAL)*WIDTHP*HEIGHT);
original array is 5900 X 10000, but if i define PAD64 , the array
becomes 5915.75 X 10000
Though so far i can guess that the author is trying to align & pad array to 64 byte boundary. But array returned by malloc is usually aligned(& padded) , also, the posix_memalign gives you a chunk of memory that is guaranteed to have the requested alignment , we can also use
__attribute__((align(64)))
what impact does this WIDTHP can make on my code's performance?
The idea is that each row of the matrix (or column, if it's treated as a column-major matrix) can be aligned to the start of a new cache line, by adding padding to the end of the line. Exactly what impact this has depends of course a lot on the access pattern, but in general cache-friendliness can be quite important for intensely number-crunching code.
Also, the computation is integer, so the result is certainly not 5915.75, that doesn't make sense.
I was going to put this in as a comment to unwind's answer because he's right. But perhaps I can explain more clearly, albeit in more characters than will fit in a comment.
When I do the math, I get 5904 reals, which is 23616 bytes, which is 396 cache lines for 64 byte cache lines. It is the bytes, rather than the number of elements which must be a multiple of 64.
As to why you want to pad the value of width, lets look at a smaller example. Let's pretend we had a "cache line" that holds 10 letter and that we have an "array" with a width of 8 letters and height of 4. Now since our hypothetical array is in C and C is row major, the array will look something like this:
AAAAAAAA
BBBBBBBB
CCCCCCCC
DDDDDDDD
but what does it look like when it is arranged in cache lines, since those are 10 letters long:
AAAAAAAABB
BBBBBBCCCC
CCCCDDDDDD
DD
Not good. Only the first row of the array is aligned. But if we pad width by two spaces, we get this in cache:
AAAAAAAA__
BBBBBBBB__
CCCCCCCC__
DDDDDDDD__
which is what we want. Now we can have a nested loop like
for i = 1 to height
for j = 1 to width
and know that every time we start to work on the j loop, the data we need will be aligned.
Oh, and yes, they really should do something to make sure that the first element of the array is aligned. 'attribute((align(64)))' won't work because the arrays are being allocated dynamically but they could have used posix_memalign instead of malloc.
The width p calculation is say
( Width/64) +1
Well rounded for int precision math. I'd give you a better answer except in the SE mobile app it ain't viable to flick between this and the listing
Related
need to make an algorithm (formula, function) using AND OR XOR NEG SHIFT NOT etc. which calculates the element of array from an index,
the size of the element is one byte
e.g. element = index^constant where the constant is array[index]^index (previously calculated).
This will work only if the array size is less then 256.
How to make a byte from an index when the index would be bigger then 1 byte.
the same way however there will be duplicates as you got only 256 possible numbers in BYTE so if your array is bigger than 256 there must be duplicates.
to avoid obvious mirroring you can not use monotonic functions for example
value[ix] = ix
is monotnic so it will be saw like shape mirroring the content of array every 256 bytes. To avoiding this you need to combine more stuff together. Its similar to computing own pseudo random generator. The usual approaches are:
modular arithmetics
something like:
value[ix]=( ( c0*ix + c1*ix*ix + c2*ix*ix*ix )%prime )&255
if constants c0,c1,c2 and prime are big enough the output looks random so there will be much less repeated patterns visible in the output ... But you need to use arithmetic of bitwidth that can hold the prime ...
In case you are hitting the upper bounds of your arithmetics bitwidth then you need to use modmul,modpow to avoid overflows. See:
Modular arithmetics and NTT (finite field DFT) optimizations
swapping bits
simply do some math on your ix where you also use ix with swapped bits. That will change the monotonic properties a lot... This approach works best however on cumulative sub-result which is not the case of yours. I would try:
value[ix]=( ix + ((ix<<3)*5) - ((ix>>2)*7) + ((3*ix)^((ix<<4)||(ix>>4))) )&255
playing with constants and operators achieve different results. However with this approach you need to check validity (which I did not!). So render graph for first few values (like 1024) where x axis is ix and y axis is value[ix]. There you should see if the stuff is repeating or even saturated towards some value and if it is change the equation.
for more info see How to seed to generate random numbers?
Of coarse after all this its not possible to get the ix from value[ix] ...
I am reading from two raw 8-bit video file of 256x256 frame size. One is the upper byte and one is the lower byte of a 16-bit raw video source.
I am struggling to use Matlab to either combine the two bytes into a single uint16 array to write to a file, or try to write the lower byte first and then the upper byte. Each time I use fwrite, I must write the full 256x256 frame of the upper byte, and then the full frame for the lower byte.
I have a loop like below that is working, but it is excruciatingly slow.
for j = 1:256
for k = 1:256
fwrite(RemergedFID, lowFrame(k,j), 'uint8');
fwrite(RemergedFID, highFrame(k,j), 'uint8');
end
end
Is there any better way and faster way to write out something like this?
A combination of reshape and typecast can do what you want in Matlab:
%% // optional, in case your frames are not always 256x256
nLine = size(highFrame,1) ;
nCol = size(highFrame,2) ;
%% // if you like the "black magic" effect, in one line:
FullFrame = reshape( typecast( reshape( [lowFrame(:) highFrame(:)].' , [] , nLine*nCol*2 ) , 'uint16') , nLine , [] ) ; %'//
fwrite(RemergedFID, FullFrame, 'uint16'); %// then write the full matrix in one go
If you do not like long obscure lines, you can decompose it in:
%% // merge the 2x uint8 frames into a single uint16 frame (lower endian ordering)
FullFrame = [lowFrame(:) highFrame(:)] ; %// Make a double column with high and low bytes in lower endian ordering
FullFrame = reshape( FullFrame.' , [] , nLine*nCol*2 ) ; %'// Reshape the transpose to get one single line vector
FullFrame = typecast( FullFrame , 'uint16') ; %// cast every 2 uint8 value into a uint16 value (lower endian ordering)
FullFrame = reshape( FullFrame , nLine , [] ) ; %// reshape that into an m*n matrix
fwrite(RemergedFID, FullFrame, 'uint16'); %// then write the full matrix in one go
Matlab, and most pc will use the lower endian byte ordering to order values of more than 8 bits. Unless you are writing a file for a system which you know will be specifically big endian ordering (some embedded processor are), I suggest you stick with lower endian, or better yet, let Matlab or your system handle it.
I can suggest creating a 16-bit frame where the upper half of each pixel in the 16-bit frame corresponds to the lowFrame and the other half in the 16-bit frame corresponds to the highFrame. Your wording suggests the opposite, yet the way you are writing the results in MATLAB suggest what I have just said. I'm going to go with the way you are writing it in your code, but you can change it according to what you want to do.
In any case, simply take the lowFrame and bitshift all of the bits to the left by 8, then add this result to highFrame to get the concatenated result. By bitshifting one frame to the left by 8 bits, you would allow room for the other frame to occupy the lower half when you add. You would then write this entire array at once to file. Bear in mind that when you write arrays to file, arrays are written in column-major format. This means that when you write arrays to file, the columns are stacked together in a single contiguous block and are written to file. I don't know what application you're ultimately going to be using this for, but if you want a row-major format where the rows are stacked together in a single contiguous block, you'll need to transpose this output frame first before writing. As such, do something like this:
lowFrame = uint16(lowFrame);
highFrame = uint16(highFrame);
outFrame = bitshift(lowFrame, 8) + highFrame;
fwrite(RemergedFID, outFrame, 'uint16');
%fwrite(RemergedFID, outFrame.', 'uint16'); %// Use this if you want row-major
What's important is the first two lines of code where we cast the frames, assuming that they are uint8, to uint16. We must do this to allow for the bitshifting to work. If we left this as uint8, the bitshifting would simply saturate the type, and so moving to the left by 8 bits would simply give a max of 255. The third line does the magic that allows you to concatenate the bits from both frames together as single 16-bit words, then you write the results to file.
In terms of performance, this will certainly be must faster than looping through your pixels because bitshifting requires almost no time at all as it is an atomic operation - the same goes for addition.
I don't understand the address of a 2-dimensional array Mat struture for a given point is computed as:
addr(M_{i,j}) = M.data + M.step[0]*i + M.step[1]*j
And why???
M.step[i] >= M.step[i+1] (in fact, M.step[i] >= M.step[i+1]*M.size[i+1] )
For example, if we have a 2-dimensional array with size 5X10. The way I know how to compute the address for the point (4,7) is the following:
Address = 4 + 7*5
Could someone shed some light on it??
Best regards,
1) Address you are talking about is index in the array, not address in computer memory. For example, if you have an array that occupies memory between 10000 to 20000, than address of pixel at point (0,0) is 10000, not 0.
2) Image may have more than one channels and pixel values may use more than one byte. For example if you have matrix with 3 channels and pixels are ints (i.e. 4 bytes), than step[1] is 3x4=12 bytes. Address of pixel at (0,5) in such array will be 10000 + step[0] x 0 + 12 x 5.
3) Also your computation is missing the fact that matrix may not be continuous in memory, i.e. between end of one row and beginning of next one may be some gap. This is also incorporated in step[0].
Just a recommendation: don't bother too much with all those computations of steps. If you need to access random pixels in image use function 'at()', and if you work on the rows sequentially use 'ptr()' to get pointer to the beginning of the row. This will save you a lot of computations and potential bugs.
I have this simple program:
% Read Image:
I=imread('Bureau.bmp');
% calculate Hist:
G= unique(I); % Calculate the different gray values
Hist= zeros(size(G)); % initialize an array with the same size as G
% For each different gray value, loop all the image, and each time you find
% a value that equals the gray value, increment the hist by 1
for j=1:numel(G)
for i= 1:numel (I)
if G(j)== I(i)
Hist(j)=Hist(j)+1;
end
end
end
Now look at this multiplication:
>> G(2)
ans =
1
>> Hist(2)
ans =
550
>> Hist(2)*G(2)
ans =
255
And it's giving me 255 not only for the index 2, but for any combination of indexes!
Two things for your problem.
First, here is the reason of your problem of multiplication: different types. I and so Gare of type uint8. H is of type double. When you perform the multiplication, Matlab seems to use the most restrictive type, so here uint8. So the result of Hist(2)*G(2) is of type uint8, comprised between 0 and 255.
Second: please DON'T compute an histogram this way. Matlab has numerous functions for that (hist and histc for the most common ones), so please read the doc and use it instead of creating your own code. If you want nevertheless write your own function (learning purpose), this code is far too slow. You go through the image about 256 times, it is useless. Instead of that, a classic way would be:
Hist = zeros(1,256);
for i=1:numel(I)
Hist(int32(I(i))+1) = Hist(int32(I(i))+1)+1
end
You use directly the value of the current pixel (+1 because index starts at 1 in Matlab) to access the corresponding slot of your histogram. Also, you must cast the pixel value to int32, to avoid the problem of value 255 (with uint8 variables, 255+1=0).
I don't want here to be pedantic, but Matlab comes with thousands of functions (without mentioning the dozens of toolboxes) and a very well-written doc, so please read it and use every suitable you can find inside, that's the best advice I could give to anybody who starts learning Matlab.
For my university process I'm simulating a process called random sequential adsorption.
One of the things I have to do involves randomly depositing squares (which cannot overlap) onto a lattice until there is no more room left, repeating the process several times in order to find the average 'jamming' coverage %.
Basically I'm performing operations on a large array of integers, of which 3 possible values exist: 0, 1 and 2. The sites marked with '0' are empty, the sites marked with '1' are full. Initially the array is defined like this:
int i, j;
int n = 1000000000;
int array[n][n];
for(j = 0; j < n; j++)
{
for(i = 0; i < n; i++)
{
array[i][j] = 0;
}
}
Say I want to deposit 5*5 squares randomly on the array (that cannot overlap), so that the squares are represented by '1's. This would be done by choosing the x and y coordinates randomly and then creating a 5*5 square of '1's with the topleft point of the square starting at that point. I would then mark sites near the square as '2's. These represent the sites that are unavailable since depositing a square at those sites would cause it to overlap an existing square. This process would continue until there is no more room left to deposit squares on the array (basically, no more '0's left on the array)
Anyway, to the point. I would like to make this process as efficient as possible, by using bitwise operations. This would be easy if I didn't have to mark sites near the squares. I was wondering whether creating a 2-bit number would be possible, so that I can account for the sites marked with '2'.
Sorry if this sounds really complicated, I just wanted to explain why I want to do this.
You can't create a datatype that is 2-bits in size since it wouldn't be addressable. What you can do is pack several 2-bit numbers into a larger cell:
struct Cell {
a : 2;
b : 2;
c : 2;
d : 2;
};
This specifies that each of the members a, b, c and d should occupy two bits in memory.
EDIT: This is just an example of how to create 2-bit variables, for the actual problem in question the most efficient implementation would probably be to create an array of int and wrap up the bit fiddling in a couple of set/get methods.
Instead of a two-bit array you could use two separate 1-bit arrays. One holds filled squares and one holds adjacent squares (or available squares if this is more efficient).
I'm not really sure that this has any benefit though over packing 2-bit fields into words.
I'd go for byte arrays unless you are really short of memory.
The basic idea
Unfortunately, there is no way to do this in C. You can create arrays of 1 byte, 2 bytes, etc., but you can't create areas of bits.
The best thing you can do, then, is to write a new library for yourself, which makes it look like you're dealing with arrays of 2 bits, but in reality does a lot of hard work. The same way that the string libraries give you functions that work on "strings" (which in C are just arrays), you'll be creating a new library which works on "bit arrays" (which in reality will be arrays of integers, with a few special functions to deal with them as-if they were arrays of bits).
NOTE: If you're new to C, and haven't learned the ideas of "creating a new library/module", or the concept of "abstraction", then I'd recommend learning about them before you continue with this project. Understanding them is IMO more important than optimizing your program to use a little less space.
How to implement this new "library" or module
For your needs, I'd create a new module called "2-bit array", which exports functions for dealing with the 2-bit arrays, as you need them.
It would have a few functions that deal with setting/reading bits, so that you can work with it as if you have an actual array of bits (you'll actually have an array of integers or something, but the module will make it seem like you have an array of bits).
Using this module would like something like this:
// This is just an example of how to use the functions in the twoBitArray library.
twoB my_array = Create2BitArray(size); // This will "create" a twoBitArray and return it.
SetBit(twoB, 5, 1); // Set bit 5 to 1 //
bit b = GetBit(twoB, 5); // Where bit is typedefed to an int by your module.
What the module will actually do is implement all these functions using regular-old arrays of integers.
For example, the function GetBit(), for GetBit(my_arr, 17), will calculate that it's the 1st bit in the 4th integer of your array (depending on sizeof(int), obviously), and you'd return it by using bitwise operations.
You can compact one dimension of array into sub-integer cells. To convert coordinate (lets say x for example) to position inside byte:
byte cell = array[i][ x / 4 ];
byte mask = 0x0004 << (x % 4);
byte data = (cell & mask) >> (x % 4);
to write data do reverse