Dividing a Bitmap for Parallel Processing

Dividing a Bitmap for Parallel Processing - c

How would I approach dividing a bitmap into segments and use for parallel processing? I already have the Height and Width of the Bitmap but where from here. I've read to use MPI_Cart_shift() and MPI_Sendrecv(). However, I'm unsure how to approach using them.
width = BMP_GetWidth (bmp);
height = BMP_GetHeight (bmp);
new_bmp = BMP_Create(width, height, 24); // BMP_Create(UINT width, UINT height, USHORT depth)

How I'd approach dividing a bitmap into segments for use with parallel processing depends on what type of processing is being done.
Your tags (but not your question) mention Gaussian blur, so that's probably a good place to start.
For Gaussian blur, each output pixel depends on lots of input pixels and nothing else. If each processor has a (read only) copy of all input pixels then you can split the work however you like, but "banding" would work best. Specifically, if there are N processors the first processor would find the first group of "total_pixels/N" output pixels (which is likely to be a band of pixels at the top of the image), the second processor would do the second group of "total_pixels/N" output pixels (likely to be a band of pixels just below the first band), etc. Once all the processors are done you'd just append the output pixels from each processor in the right order to get the whole output bitmap.
Note that (due to rounding) some processors may need to do a different number of pixels - e.g. if the bitmap has 10000 pixels and you have 64 processors, then "10000/64 = 156.25" but a processor can't do a quarter of a pixel, so you end up with 48 processors doing 156 pixels while 16 processors do 157 pixels ("48*156 + 16*157 = 10000").
Also, if processors may be different speeds and/or different latencies, you might want to split the work into more pieces (e.g. if there are 64 processors split the work into 128 pieces, where slower processors might only do 1 piece while faster processors might do 4 pieces).
If the processors don't already have a copy of all input pixels (and if there's no shared memory) then you can send each processor a fraction of all pixels. For example, if you have a Gaussian matrix that's 7 rows high (3 rows above the output position, one row at the output position and 3 rows below the output position), and if each processor outputs a band of 100 rows of pixels, then you'd send each processor a "3+100+3 = 106" band of input pixels to work on (except for the processors that do the first band and last band, which would only get "3+100" or "100+3" rows of input pixels).
For something like (e.g.) Floyd–Steinberg dithering things get a lot more complicated because one output pixel depends on all previous output pixels (in addition to the input pixels). In this case you can split the "3 colour" bitmap into three separate monochrome bitmaps (one for each processor, for up to 3 processors) and each processor can dither its monochrome bitmap, and then you can merge the three resulting monochrome bitmaps back together to get a single "3 colour" output bitmap; but it's virtually impossible to use more than 3 processors (without changing to a different dithering algorithm that's more suited to parallelisation).
For drawing one circle or one ellipse, you might get each processor to draw an arc and combine the arcs; for drawing 1234 shapes you might split the image into a grid and get each processor to do a tile within the grid.

Related

What is the difference between Static HDR and dynamic HDR?

HDR is a high dynamic range which is widely used in video devices to have better viewing experience.
What is the difference between static HDR and dynamic HDR?

Dynamic HDR can achieve higher HDR media quality across a variety of
displays.
The following presentation: SMPTE ST 2094 and Dynamic Metadata summarizes the subject of Dynamic Metadata:
Dynamic Metadata for Color Volume Transforms (DMCVT)
- Can preserve the creative intent in HDR media across a variety of displays
- Carried in files, video streams, packaged media
- Standardized in SMPTE ST 2094
It all starts with digital Quantization.
Assume you need to approximate the numbers between 0 and 1,000,000 using only 1000 possible values.
Your first option is using uniform quantification:
Values in range [0, 999] are mapped to 0, range [1000, 1999] are mapped to 1, [2000, 2999] are mapped to 2, and so on...
When you need to restore the original data, you can't restore it accurately, so you need to get the value with minimal average error.
0 is mapped to 500 (to the center of the range [0, 999]).
1 is mapped to 1500 (to the center of the range [1000, 1999]).
When you restore the quntized data, you are loosing lots of information.
The information you loose is called "Quantization error".
The common HDR video applies 10 bits per color component (10 bits for Y component, 10 bits for U and 10 bits for V). Or 10 bits for red, 10 for green and 10 for blue in RGB color space.
10 bits can store 1024 possible values (values in range [0, 1023]).
Assume you have a very good monitor that can display 1,000,001 different brightness levels (0 is darkest and 1000000 is the brightest).
Now you need to quantize the 1,000,001 levels to 1024 values.
Since the response of the human visual system to brightness level is not linear, the uniform quantization illustrated above, is sub-optimal.
The quantization to 10 bits is performed after applying a gamma function.
Example for gamma function: divide each value by 1000000 (new range is [0,1]), compute square root of each value, and multiply the result by 1000000.
Apply the quantization after the gamma function.
The result is: keeping more accuracy on the darker values, on expanse of the brighter values.
The monitor do the opposite operation (de-quantization, and inverse gamma).
Preforming the quantization after applying gamma function results a better quality for the human visual system.
In reality, square root is not the best gamma function.
There are three types of standard HDR static gamma functions:
HLG - Hybrid Log Gamma
PQ - Perceptual Quantizer
HDR10 - Static Metadata
Can we do better?
What if we could select the optimal "gamma functions" for each video frame?
Example for Dynamic Metadata:
Consider the case where all the brightness levels in the image are in range [500000, 501000]:
Now we can map all the levels to 10 bits, without any quantization.
All we need to do is send 500000 as minimum level, and 501000 as minimum level in the image metadata.
Instead of quantization, we can just subtract 500000 from each value.
The monitor that receives the image, reads the metadata, and knows to add 500000 to each value - so there is a perfect data reconstruction (no quantization errors).
Assume the levels of the next image is in range 400000 to 401000, so we need to adjust the metadata (dynamically).
DMCVT - Dynamic Metadata for Color Volume Transform
The true math of DMCVT is much more complicated than the example above (and much more than quantization), but it's based on the same principles - adjusting the metadata dynamically according to the scene and display, can achieve better quality compared to static gamma (or static metadata).
In case you are still reading...
I am really not sure that the main advantage of DMCVT is reducing the quantization errors.
(It was just simpler to give an example of reducing the quantization errors).
Reducing the conversion errors:
Accurate conversion from the digital representation of the input (e.g BT.2100 to the optimal pixel value of the display (like the RGB voltage of the pixel) requires "heavy math".
The conversion process is called Color Volume Transformation.
Displays replaces the heavy computation with mathematical approximations (using look up tables and interpolations [I suppose]).
Another advantage of DMCVT, is moving the "heavy math" from the display to the video post-production process.
The computational resources in the video post-production stage are in order of magnitudes higher than the display resources.
In the post-production stage, the computers can calculate metadata that helps the display performing much more accurate Color Volume Transformation (with less computational resources), and reduce the conversion errors considerably.
Example from the presentation:
Why does "HDR static gamma functions" called static?
Opposed to DMCVT, the static gamma functions are fixed across the entire movie, or fixed (pre-defined) across the entire "system".
For example: Most PC systems (PC and monitors) are using sRGB color space (not HDR).
The sRGB standard uses the following fixed gamma function:
.
Both the PC system and the display knows from advance, that they are working in sRGB standard, and knows that this is the gamma function that is used (without adding any metadata, or adding one byte of metadata that marks the video data as sRGB).

Rendering image using Multithread

I have a ray tracing algorithm, which works with only 1 thread and I am trying to make it work with any number of threads.
My question is, which way can I divide this task among threads.
At first my Instructor told me to just divide the width of the image, for example if I have an 8x8 image, and I want 2 threads to do the task, let thread 1 render 0 to 3 horizontal area ( of course all the way down vertically ) and thread 2 render 4 to 7 horizontal area.
I found this approach to work perfect when both my image length and number of threads are powers of 2, but I have no idea how can I deal with odd number of threads or any number of threads that cant divide width without a reminder.
My approach to this problem was to let threads render the image by alternating, for example if I have an 8x8 image, andlets say if I have 3 threads.
thread 1 renders pixels 0,3,6 in horizontal direction
thread 1 renders pixels 1,4,7 in horizontal direction
thread 1 renders pixels 2,5 in horizontal direction
Sorry that I cant provide all my code, since there are more than 5 files with few hundreds line of code in each one.
Here is the for loops that loop trough horizontal area, and the vertical loop is inside these but I am not going to provide it here.
My Instructor`s suggestion
for( int px=(threadNum*(width/nthreads)); px < ((threadNum+1)*(width/nthreads)); ++px )
threadNum is the current thread that I am on (meaning thread 0,1,2 and so on)
width is the width of the image
nthreads is the overall number of threads.
My solution to this problem
for( int px= threadNum; px< width; px+=nthreads )
I know my question is not so clear, and sorry but I cant provide the whole code here, but basically all I am asking is which way is the best way to divide the rendering of the image among given number of threads ( can be any positive number). Also I want threads to render the image by columns, meaning I cant touch the part of the code which handles vertical rendering.
Thank you, and sorry for chaotic question.

First thing, let me tell you that under the assumption that the rendering of each pixel is independent from the other pixels, your task is what in the HPC field is called an "embarassing parallel problem"; that is, a problem that can be efficiently divided between any number of thread (until each thread has a single "unit of work"), without any intercommunication between the processes (which is very good).
That said, it doesn't mean that any parallelization scheme is as good as any other. For your specific problem, I would say that the two main factors to keep in mind are load balancing and cache efficiency.
Load balancing means that you should divide the work among threads in a way that each thread has roughly the same amount of work: in this way you prevent one or more threads from waiting for that one last thread that has to finish it's last job.
E.g.
You have 5 threads and you split your image in 5 big chunks (let's say 5 horizontal strips, but they could be vertical and it wouldn't change the point). Being the problem embarassing parallel, you expect a 5x speedup, and instead you get a meager 1.2x.
The reason might be that your image has most of computationally expensive details in the lower part of the image (I know nothing of rendering, but I assume that a reflective object might take far more time to render than a flat empty space), because is composed by a set of polished metal marbles on the floor on an empty frame.
In this scenario, only one thread (the one with the bottom 1/5 of the image) does all the work anyway, while the other 4 remains idling after finishing their brief tasks.
As you can imagine, this isn't a good parallelization: keeping load balancing in mind alone, the best parallelization scheme would be to assign interleaved pixels to each core for them to process, under the (very reasonable) assumption that the complexity of the image would be averaged on each thread (true for natural images, might yield surprises in very very limited scenarios).
With this solution, your image is eavenly distributed among pixels (statistically) and the worst case scenario is N-1 threads waiting for a single thread to compute a single pixel (you wouldn't notice, performance-wise).
To do that you need to cycle over all pixels forgetting about lines, in this way (pseudo code, not tested):
for(i = thread_num; i < width * height; i+=thread_num)
The second factor, cache efficiency deals with the way computers are designed, specifically, the fact that they have many layers of cache to speed up computations and prevent the CPUs to starve (remain idle while waiting for data), and accessing data in the "right way" can speed up computations considerably.
It's a very complex topic, but in your case, a rule of thumb might be "feeding to each thread the right amount of memory will improve the computation" (emphasys on "right amount" intended...).
It means that, even if passing to each thread interleaved pixels is probably the perfect balancing, it's probably the worst possible memory access pattern you could devise, and you should pass "bigger chunks" to them, because this would keep the CPU busy (note: memory aligment comes also heavily into play: if your image has padding after each line keep them multiples of, say, 32 bytes, like some image formats, you should keep it into consideration!!)
Without expanding an already verbose answer to alarming sizes, this is what I would do (I'm assuming the memory of the image is consecutive, without padding between lines!):
create a program that splits the image into N consecutive pixels (use a preprocessor constant or a command argument for N, so you can change it!) for each of M threads, like this:
1111111122222222333333334444444411111111
do some profiling for various values of N, stepping from 1 to, let's say, 2048, by powers of two (good values to test might be: 1 to get a base line, 32, 64, 128, 256, 512, 1024, 2048)
find out where the perfect balance is between perfect load balancing (N=1), and best caching (N <= the biggest cache line in your system)
a try the program on more than one system, and keep the smalles value of N that gives you the best test results among the machines, in order to make your code run fast everywhere (as the caching details vary among systems).
b If you really really want to squeeze every cycle out of every system you install your code on, forget step 4a, and create a code that automatically finds out the best value of N by rendering a small test image before tackling the appointed task :)
fool around with SIMD instructions (just kidding... sort of :) )
A bit theoretical (and overly long...), but still I hope it helps!

An alternating division of the columns will probably lead to a suboptimal cache usage. The threads should operate on a larger continuous range of data. By the way, if your image is stored row-wise it would also be better to distribute the rows instead of the columns.
This is one way to divide the data equally with any number of threads:
#define min(x,y) (x<y?x:y)
/*...*/
int q = width / nthreads;
int r = width % nthreads;
int w = q + (threadNum < r);
int start = threadNum*q + min(threadNum,r);
for( int px = start; px < start + w; px++ )
/*...*/
The remainder r is distributed over the first r threads. This is important when calculating the start index for a thread.
For the 8x8 image this would lead to:
thread 0 renders columns 0-2
thread 1 renders columns 3-5
thread 2 renders columns 6-7

Antipole Clustering

I made a photo mosaic script (PHP). This script has one picture and changes it to a photo buildup of little pictures. From a distance it looks like the real picture, when you move closer you see it are all little pictures. I take a square of a fixed number of pixels and determine the average color of that square. Then I compare this with my database which contains the average color of a couple thousand of pictures. I determine the color distance with all available images. But to run this script fully it takes a couple of minutes.
The bottleneck is matching the best picture with a part of the main picture. I have been searching online how to reduce this and came a cross “Antipole Clustering.” Of course I tried to find some information on how to use this method myself but I can’t seem to figure out what to do.
There are two steps. 1. Database acquisition and 2. Photomosaic creation.
Let’s start with step one, when this is all clear. Maybe I understand step 2 myself.
Step 1:
partition each image of the database into 9 equal rectangles arranged in a 3x3 grid
compute the RGB mean values for each rectangle
construct a vector x composed by 27 components (three RGB components for each rectangle)
x is the feature vector of the image in the data structure
Well, point 1 and 2 are easy but what should I do at point 3. How do I compose a vector X out of the 27 components (9 * R mean, G mean, B mean.)
And when I succeed to compose the vector, what is the next step I should do with this vector.
Peter

Here is how I think the feature vector is computed:
You have 3 x 3 = 9 rectangles.
Each pixel is essentially 3 numbers, 1 for each of the Red, Green, and Blue color channels.
For each rectangle you compute the mean for the red, green, and blue colors for all the pixels in that rectangle. This gives you 3 numbers for each rectangle.
In total, you have 9 (rectangles) x 3 (mean for R, G, B) = 27 numbers.
Simply concatenate these 27 numbers into a single 27 by 1 (often written as 27 x 1) vector. That is 27 numbers grouped together. This vector of 27 numbers is the feature vector X that represents the color statistic of your photo. In the code, if you are using C++, this will probably be an array of 27 number or perhaps even an instance of the (aptly named) vector class. You can think of this feature vector as some form of "summary" of what the color in the photo is like. Roughly, things look like this: [R1, G1, B1, R2, G2, B2, ..., R9, G9, B9] where R1 is the mean/average of red pixels in the first rectangle and so on.
I believe step 2 involves some form of comparing these feature vectors so that those with similar feature vectors (and hence similar color) will be placed together. Comparison will likely involve the use of the Euclidean distance (see here), or some other metric, to compare how similar the feature vectors (and hence the photos' color) are to each other.
Lastly, as Anony-Mousse suggested, converting your pixels from RGB to HSB/HSV color would be preferable. If you use OpenCV or have access to it, this is simply a one liner code. Otherwise wiki HSV etc. will give your the math formula to perform the conversion.
Hope this helps.

Instead of using RGB, you might want to use HSB space. It gives better results for a wide variety of use cases. Put more weight on Hue to get better color matches for photos, or to brightness when composing high-contrast images (logos etc.)
I have never heard of antipole clustering. But the obvious next step would be to put all the images you have into a large index. Say, an R-Tree. Maybe bulk-load it via STR. Then you can quickly find matches.

Maybe it means vector quantization (vq). In vq the image isn't subdivide in rectangles but in density areas. Then you can take a mean point of this cluster. First off you need to take all colors and pixels separate and transfer it to a vector with XY coordinate. Then you can use a density clustering like voronoi cells and get the mean point. This point can you compare with other pictures in the database. Read here about VQ: http://www.gamasutra.com/view/feature/3090/image_compression_with_vector_.php.
How to plot vector from adjacent pixel:
d(x) = I(x+1,y) - I(x,y)
d(y) = I(x,y+1) - I(x,y)
Here's another link: http://www.leptonica.com/color-quantization.html.
Update: When you have already computed the mean color of your thumbnail you can proceed and sort all the means color in a rgb map and using the formula I give to you to compute the vector x. Now that you have a vector of all your thumbnails you can use the antipole tree to search for a thumbnail. This is possbile because the antipole tree is something like a kd-tree and subdivide the 2d space. Read here about antipole tree: http://matt.eifelle.com/2012/01/17/qtmosaic-0-2-faster-mosaics/. Maybe you can ask the author and download the sourcecode?

OpenCV lower color values

I was wondering if there was a way to lower the color scheme of an image. Lets say I have an image that has 32bit color range in the RGB. I was wondering if it would be possible to scale it down to perhaps an 8 bit color scheme. This would be similar to a "cartoon" filter in applications like photoshop or if you change your screen color space from 32-bit true color to 256 colors.
Thanks

If you want the most realistic result take a look at colour quantisation. Basically find the blocks of pixels with a similar RGB colour and replace them with a single colour, you are trying to minimize the number of pixels that are changed and the amount each new pixel is different from it's original colour - so it's a space parameterisation problem

Well, you could do convertTo(newimg, CV_8U) to convert it to 8-bit, but that's still 16 million colors. If the image has integer pixel values you can also do val = val / reductionFactor * reductionFactor + reductionFactor / 2 (or some optimization thereof) on each pixel's R, G, and B values for arbitrary reduction factors or val = val & mask + reductionFactor >> 1 for reduction factors that are a power of two.

Have you tried the pyramidal Mean Shift filter example program given in the samples with OpenCV? The mention of "cartoon" filter reminded me of it - the colors are flattened and subtle shades are merged and reduced resulting in a reduction in the number of colors present.
The reduction is based on a threshold and some experimentation should surely get satisfactory results.

Search image pattern

I need to do a program that does this: given an image (5*5 pixels), I have to search how many images like that exist in another image, composed by many other images. That is, i need to search a given pattern in an image.
The language to use is C. I have to use parallel computing to search in the 4 angles (0º, 90º, 180º and 270º).
What is the best way to do that?

Seems straight forward.
Create 4 versions of the image rotated by 0°, 90°, 180°, and 270°.
Start four threads each with one version of the image.
For all positions from (0,0) to (width - 5, height - 5)
Comapare the 25 pixels of the reference image with the 25 pixels at the current position
If they are equal enough using some metric, report the finding.

Use normalized correlation to determine a match of templates.
#Daniel, Daniel's solution is good for leveraging your multiple CPUs. He doesn't mention a quality metric that would be useful and I would like to suggest one quality metric that is very common in image processing.
I suggest using normalized correlation[1] as a comparison metric because it outputs a number from -1 to +1. Where 0 is no correlation 1 would be output if the two templates were identical and -1 would be if the two templates were exactly opposite.
Once you compute the normalized correlation you can test to see if you have found the template by doing either a threshold test or a peak-to-average test[2].
[1 - footnote] How do you implement normalized correlation? It is pretty simple and only has two for loops. Once you have an implementation that is good enough you can verify your implementation by checking to see if the identical image gets you a 1.
[2 - footnote] You do the ratio of the max(array) / average(array_without_peak). Then threshold to make sure you have a good peak to average ratio.

There's no need to create the additional three versions of the image, just address them differently or use something like the class I created here. Better still, just duplicate the 5x5 matrix and rotate those instead. You can then linearly scan the image for all rotations (which is a good thing).
This problem will not scale well for parallel processing since the bottleneck is certainly accessing the image data. Having multiple threads accessing the same data will slow it down, especially if the threads get 'out of sync', i.e. one thread gets further through the image than the other threads so that the other threads end up reloading the data the first thread has discarded.
So, the solution I think will be most efficient is to create four threads that scan 5 lines of the image, one thread per rotation. A fifth thread loads the image data one line at a time and passes the line to each of the four scanning threads, waiting for all four threads to complete, i.e. load one line of image, append to five line buffer, start the four scanning threads, wait for threads to end and repeat until all image lines are read.

5 * 5 = 25
25 bits fits in an integer.
each image can be encoded as an array of 4 integers.
Iterate your larger image, (hopefully it is not too big),
pulling out all 5 * 5 sub images, convert to an array of 4 integers and compare.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight