I have a ray tracing algorithm, which works with only 1 thread and I am trying to make it work with any number of threads.
My question is, which way can I divide this task among threads.
At first my Instructor told me to just divide the width of the image, for example if I have an 8x8 image, and I want 2 threads to do the task, let thread 1 render 0 to 3 horizontal area ( of course all the way down vertically ) and thread 2 render 4 to 7 horizontal area.
I found this approach to work perfect when both my image length and number of threads are powers of 2, but I have no idea how can I deal with odd number of threads or any number of threads that cant divide width without a reminder.
My approach to this problem was to let threads render the image by alternating, for example if I have an 8x8 image, andlets say if I have 3 threads.
thread 1 renders pixels 0,3,6 in horizontal direction
thread 1 renders pixels 1,4,7 in horizontal direction
thread 1 renders pixels 2,5 in horizontal direction
Sorry that I cant provide all my code, since there are more than 5 files with few hundreds line of code in each one.
Here is the for loops that loop trough horizontal area, and the vertical loop is inside these but I am not going to provide it here.
My Instructor`s suggestion
for( int px=(threadNum*(width/nthreads)); px < ((threadNum+1)*(width/nthreads)); ++px )
threadNum is the current thread that I am on (meaning thread 0,1,2 and so on)
width is the width of the image
nthreads is the overall number of threads.
My solution to this problem
for( int px= threadNum; px< width; px+=nthreads )
I know my question is not so clear, and sorry but I cant provide the whole code here, but basically all I am asking is which way is the best way to divide the rendering of the image among given number of threads ( can be any positive number). Also I want threads to render the image by columns, meaning I cant touch the part of the code which handles vertical rendering.
Thank you, and sorry for chaotic question.
First thing, let me tell you that under the assumption that the rendering of each pixel is independent from the other pixels, your task is what in the HPC field is called an "embarassing parallel problem"; that is, a problem that can be efficiently divided between any number of thread (until each thread has a single "unit of work"), without any intercommunication between the processes (which is very good).
That said, it doesn't mean that any parallelization scheme is as good as any other. For your specific problem, I would say that the two main factors to keep in mind are load balancing and cache efficiency.
Load balancing means that you should divide the work among threads in a way that each thread has roughly the same amount of work: in this way you prevent one or more threads from waiting for that one last thread that has to finish it's last job.
E.g.
You have 5 threads and you split your image in 5 big chunks (let's say 5 horizontal strips, but they could be vertical and it wouldn't change the point). Being the problem embarassing parallel, you expect a 5x speedup, and instead you get a meager 1.2x.
The reason might be that your image has most of computationally expensive details in the lower part of the image (I know nothing of rendering, but I assume that a reflective object might take far more time to render than a flat empty space), because is composed by a set of polished metal marbles on the floor on an empty frame.
In this scenario, only one thread (the one with the bottom 1/5 of the image) does all the work anyway, while the other 4 remains idling after finishing their brief tasks.
As you can imagine, this isn't a good parallelization: keeping load balancing in mind alone, the best parallelization scheme would be to assign interleaved pixels to each core for them to process, under the (very reasonable) assumption that the complexity of the image would be averaged on each thread (true for natural images, might yield surprises in very very limited scenarios).
With this solution, your image is eavenly distributed among pixels (statistically) and the worst case scenario is N-1 threads waiting for a single thread to compute a single pixel (you wouldn't notice, performance-wise).
To do that you need to cycle over all pixels forgetting about lines, in this way (pseudo code, not tested):
for(i = thread_num; i < width * height; i+=thread_num)
The second factor, cache efficiency deals with the way computers are designed, specifically, the fact that they have many layers of cache to speed up computations and prevent the CPUs to starve (remain idle while waiting for data), and accessing data in the "right way" can speed up computations considerably.
It's a very complex topic, but in your case, a rule of thumb might be "feeding to each thread the right amount of memory will improve the computation" (emphasys on "right amount" intended...).
It means that, even if passing to each thread interleaved pixels is probably the perfect balancing, it's probably the worst possible memory access pattern you could devise, and you should pass "bigger chunks" to them, because this would keep the CPU busy (note: memory aligment comes also heavily into play: if your image has padding after each line keep them multiples of, say, 32 bytes, like some image formats, you should keep it into consideration!!)
Without expanding an already verbose answer to alarming sizes, this is what I would do (I'm assuming the memory of the image is consecutive, without padding between lines!):
create a program that splits the image into N consecutive pixels (use a preprocessor constant or a command argument for N, so you can change it!) for each of M threads, like this:
1111111122222222333333334444444411111111
do some profiling for various values of N, stepping from 1 to, let's say, 2048, by powers of two (good values to test might be: 1 to get a base line, 32, 64, 128, 256, 512, 1024, 2048)
find out where the perfect balance is between perfect load balancing (N=1), and best caching (N <= the biggest cache line in your system)
a try the program on more than one system, and keep the smalles value of N that gives you the best test results among the machines, in order to make your code run fast everywhere (as the caching details vary among systems).
b If you really really want to squeeze every cycle out of every system you install your code on, forget step 4a, and create a code that automatically finds out the best value of N by rendering a small test image before tackling the appointed task :)
fool around with SIMD instructions (just kidding... sort of :) )
A bit theoretical (and overly long...), but still I hope it helps!
An alternating division of the columns will probably lead to a suboptimal cache usage. The threads should operate on a larger continuous range of data. By the way, if your image is stored row-wise it would also be better to distribute the rows instead of the columns.
This is one way to divide the data equally with any number of threads:
#define min(x,y) (x<y?x:y)
/*...*/
int q = width / nthreads;
int r = width % nthreads;
int w = q + (threadNum < r);
int start = threadNum*q + min(threadNum,r);
for( int px = start; px < start + w; px++ )
/*...*/
The remainder r is distributed over the first r threads. This is important when calculating the start index for a thread.
For the 8x8 image this would lead to:
thread 0 renders columns 0-2
thread 1 renders columns 3-5
thread 2 renders columns 6-7
Related
let A be an MxN matrix with entries a_ij in {0, ..., n-1}.
we can think of the entries in A as a rectangular grid that has been n-colored.
I am interested in partitioning each colored region into rectangles, in such a way that the number of rectangles is minimized. That is, I want to produce n sets of quadruples
L_k = {(i, j, w, h) | a_xy = k forall i <= x < i + w, j <= y < j + h}
satisfying the condition that every a_ij belongs to exactly one rectangle and all of the rectangles are disjoint. Furthermore, the sum
L_0 + ... + L_(n-1) is minimized.
Obviously, minimizing each of the L_k can be done independently, but there is also a requirement that this happen extremely fast. Assume this is a real-time application. It may be the case that since the sets are disjoint, sharing information between the L_ks speeds things up more than doing everything in parallel. n can be small (say, <100) and M and N can be large.
I assume there is a dynamic programming approach to this, or maybe there is a way to rephrase it as a graph problem, but it is not immediately obvious to me how to approach this.
EDIT:
There seems to be some confusion about what I mean. Here is a picture to help illustrate.
Imagine this is a 10x10 matrix with red = 0, green = 1, and blue = 2. draw the black boxes like so, minimizing the number of boxes. The output here would be
L_0 = {(0,8,2,2),(1,7,2,1),(2,8,1,1),(4,5,4,2),(6,7,2,2)}
L_1 = {(0,0,4,4),(4,0,6,2),(6,2,2,3),(8,8,2,2)}
L_2 = {(0,4,4,3),(0,7,1,1),(2,9,6,1),(3,7,3,2),(4,2,2,4),(8,2,2,6)}
One thing to immediately do is to note that you can separate the problem into individual instances of connected regions of colors. From there, the post linked in the article explains how you can use a maximal matching to construct the optimal solution.
But it's probably quite hard to implement that (and judging by your tag of C, even more hard). So I recommend one of two strategies: Backtracking, or greedy.
To do backtracking, you will recurse on the set of tiles which are not yet covered (I assume this makes sense, as you have listed all integer coordinates. This changes but not massively otherwise). Take the highest, leftmost uncovered tile, and loop over all possible rectangles which contain it (there are only ~n^2 of them, and hopefully less). Then recurse.
To optimize this, you will want to prune. An easy way to prune this is to stop recursing if you have already seen a solution with a better answer. This pruning is known as branch and bound.
You can also just quit early in the backtracking, if you only need an approximate answer. Since you mentioned "real-time application," it might be okay if you're only off by a little.
Continuing with the idea of approximation, you could also do something similar by just greedily picking the largest rectangle that you can at the moment. This is annoying to implement, but doable. You can also combine this with recursion and backing out early.
I am trying to implement a collaborative diffusion behaviour for the first time and I am stuck with a problem. I understand how to make obstacles not diffusing scents and how to dampen scent for other friendly agents if one of them already pursues it. What I cannot understand is how do I make scents to evenly distribute in the matrix. It seems to me that every way of iterating in the matrix, determines the scent to distribute faster and better in the tiles I check later in the iteration. I mean if I iterate from i to maxRows and j to maxCols and then I apply the diffusion equation in every tile, on the 'north' and 'west' side of the goal I will have only one tile with the correct potential, whereas in the 'east' and 'south' side I will have more of them since their neighbours already have an assigned potential. How can I make the values distribute evenly? A double iteration from both extremities of the matrix and them combining the result seems like a memory-eater, as do a goal-oriented approach, since if I try to start from the goals and work around them I will have to execute the calculations for every goal and every tile with assigned potential, which means that I will have to do it for 4^(turn since starter diffusion)*nrOfGoals more every turn, which seems inefficient in a large matrix with a lot of goals.
My question is how can I evenly distribute the values in the matrix in an efficient way. I'm using the AiChallenge Ants, if that helps in any way!
I thank you in anticipation and I'm sorry for the grammar mistakes I've made in this post.
There may be a better solution, but the easiest way to do it is to use something similar to how a simple implementation of the game of life is done.
You have two buffers. One has the current "generation" of scent (and if you are doing multitasking, can be locked so only readers can look at it)... and another has the next generation of sent being calculated. You only "mix" scents from the current generation.
Once you are done, you swap the two buffers by simply changing the pointers / references.
Another way to think about it would be to have all the tiles calculate their new sent by asking their neighbors and averaging. When asked by their neighbors what their scent level is, they report their pre-calculated values from the previous pass. The new sent is only locked in once everyone has finished calculating.
Background
I work with very large datasets from Synthetic Aperture Radar satellites. These can be thought of as high dynamic range greyscale images of the order of 10k pixels on a side.
Recently, I've been developing applications of a single-scale variant of Lindeberg's scale-space ridge detection algorithm method for detecting linear features in a SAR image. This is an improvement on using directional filters or using the Hough Transform, methods that have both previously been used, because it is less computationally expensive than either. (I will be presenting some recent results at JURSE 2011 in April, and I can upload a preprint if that would be helpful).
The code I currently use generates an array of records, one per pixel, each of which describes a ridge segment in the rectangle to bottom right of the pixel and bounded by adjacent pixels.
struct ridge_t { unsigned char top, left, bottom, right };
int rows, cols;
struct ridge_t *ridges; /* An array of rows*cols ridge entries */
An entry in ridges contains a ridge segment if exactly two of top, left, right and bottom have values in the range 0 - 128. Suppose I have:
ridge_t entry;
entry.top = 25; entry.left = 255; entry.bottom = 255; entry.right = 76;
Then I can find the ridge segment's start (x1,y1) and end (x2,y2):
float x1, y1, x2, y2;
x1 = (float) col + (float) entry.top / 128.0;
y1 = (float) row;
x2 = (float) col + 1;
y2 = (float) row + (float) entry.right / 128.0;
When these individual ridge segments are rendered, I get an image something like this (a very small corner of a far larger image):
Each of those long curves are rendered from a series of tiny ridge segments.
It's trivial to determine whether two adjacent locations which contain ridge segments are connected. If I have ridge1 at (x, y) and ridge2 at (x+1, y), then they are parts of the same line if 0 <= ridge1.right <= 128 and ridge2.left = ridge1.right.
Problem
Ideally, I would like to stitch together all of the ridge segments into lines, so that I can then iterate over each line found in the image to apply further computations. Unfortunately, I'm finding it hard to find an algorithm for doing this which is both low complexity and memory-efficient and suitable for multiprocessing (all important consideration when dealing with really huge images!)
One approach that I have considered is scanning through the image until I find a ridge which only has one linked ridge segment, and then walking the resulting line, flagging any ridges in the line as visited. However, this is unsuitable for multiprocessing, because there's no way to tell if there isn't another thread walking the same line from the other direction (say) without expensive locking.
What do readers suggest as a possible approach? It seems like the sort of thing that someone would have figured out an efficient way to do in the past...
I'm not entirely sure this is correct, but I thought I'd throw it out for comment. First, let me introduce a lockless disjoint set algorithm, which will form an important part of my proposed algorithm.
Lockless disjoint set algorithm
I assume the presence of a two-pointer-sized compare-and-swap operation on your choice of CPU architecture. This is available on x86 and x64 architectures at the least.
The algorithm is largely the same as described on the Wikipedia page for the single threaded case, with some modifications for safe lockless operation. First, we require that the rank and parent elements to both be pointer-sized, and aligned to 2*sizeof(pointer) in memory, for atomic CAS later on.
Find() need not change; the worst case is that the path compression optimization will fail to have full effect in the presence of simultaneous writers.
Union() however, must change:
function Union(x, y)
redo:
x = Find(x)
y = Find(y)
if x == y
return
xSnap = AtomicRead(x) -- read both rank and pointer atomically
ySnap = AtomicRead(y) -- this operation may be done using a CAS
if (xSnap.parent != x || ySnap.parent != y)
goto redo
-- Ensure x has lower rank (meaning y will be the new root)
if (xSnap.rank > ySnap.rank)
swap(xSnap, ySnap)
swap(x, y)
-- if same rank, use pointer value as a fallback sort
else if (xSnap.rank == ySnap.rank && x > y)
swap(xSnap, ySnap)
swap(x, y)
yNew = ySnap
yNew.rank = max(yNew.rank, xSnap.rank + 1)
xNew = xSnap
xNew.parent = y
if (!CAS(y, ySnap, yNew))
goto redo
if (!CAS(x, xSnap, xNew))
goto redo
return
This should be safe in that it will never form loops, and will always result in a proper union. We can confirm this by observing that:
First, prior to termination, one of the two roots will always end up with a parent pointing to the other. Therefore, as long as there is no loop, the merge succeeds.
Second, rank always increases. After comparing the order of x and y, we know x has lower rank than y at the time of the snapshot. In order for a loop to form, another thread would need to have increased x's rank first, then merged x and y. However in the CAS that writes x's parent pointer, we check that rank has not changed; therefore, y's rank must remain greater than x.
In the event of simultaneous mutation, it is possible that y's rank may be increased, then return to redo due to a conflict. However, this implies that either y is no longer a root (in which case rank is irrelevant) or that y's rank has been increased by another process (in which case the second go around will have no effect and y will have correct rank).
Therefore, there should be no chance of loops forming, and this lockless disjoint-set algorithm should be safe.
And now on to the application to your problem...
Assumptions
I make the assumption that ridge segments can only intersect at their endpoints. If this is not the case, you will need to alter phase 1 in some manner.
I also make the assumption that co-habitation of a single integer pixel location is sufficient for ridge segments can be connected. If not, you will need to change the array in phase 1 to hold multiple candidate ridge segments+disjoint-set pairs, and filter through to find ones that are actually connected.
The disjoint set structures used in this algorithm shall carry a reference to a line segment in their structures. In the event of a merge, we choose one of the two recorded segments arbitrarily to represent the set.
Phase 1: Local line identification
We start by dividing the map into sectors, each of which will be processed as a seperate job. Multiple jobs may be processed in different threads, but each job will be processed by only one thread. If a ridge segment crosses a sector, it is split into two segments, one for each sector.
For each sector, an array mapping pixel position to a disjoint-set structure is established. Most of this array will be discarded later, so its memory requirements should not be too much of a burden.
We then proceed over each line segment in the sector. We first choose a disjoint set representing the entire line the segment forms a part of. We first look up each endpoint in the pixel-position array to see if a disjoint set structure has already been assigned. If one of the endpoints is already in this array, we use the assigned disjoint set. If both are in the array, we perform a merge on the disjoint sets, and use the new root as our set. Otherwise, we create a new disjoint-set, and associate with the disjoint-set structure a reference to the current line segment. We then write back into the pixel-position array our new disjoint set's root for each of our endpoints.
This process is repeated for each line segment in the sector; by the end, we will have identified all lines completely within the sector by a disjoint set.
Note that since the disjoint sets are not yet shared between threads, there's no need to use compare-and-swap operations yet; simply use the normal single-threaded union-merge algorithm. Since we do not free any of the disjoint set structures until the algorithm completes, allocation can also be made from a per-thread bump allocator, making memory allocation (virtually) lockless and O(1).
Once a sector is completely processed, all data in the pixel-position array is discarded; however data corresponding to pixels on the edge of the sector is copied to a new array and kept for the next phase.
Since iterating over the entire image is O(x*y), and disjoint-merge is effectively O(1), this operation is O(x*y) and requires working memory O(m+2*x*y/k+k^2) = O(x*y/k+k^2), where t is the number of sectors, k is the width of a sector, and m is the number of partial line segments in the sector (depending on how often lines cross borders, m may vary significantly, but it will never exceed the number of line segments). The memory carried over to the next operation is O(m + 2*x*y/k) = O(x*y/k)
Phase 2: Cross-sector merges
Once all sectors have been processed, we then move to merging lines that cross sectors. For each border between sectors, we perform lockless merge operations on lines that cross the border (ie, where adjacent pixels on each side of the border have been assigned to line sets).
This operation has running time O(x+y) and consumes O(1) memory (we must retain the memory from phase 1 however). Upon completion, the edge arrays may be discarded.
Phase 3: Collecting lines
We now perform a multi-threaded map operation over all allocated disjoint-set structure objects. We first skip any object which is not a root (ie, where obj.parent != obj). Then, starting from the representative line segment, we move out from there and collect and record any information desired about the line in question. We are assured that only one thread is looking at any given line at a time, as intersecting lines would have ended up in the same disjoint-set structure.
This has O(m) running time, and memory usage dependent on what information you need to collect about these line segments.
Summary
Overall, this algorithm should have O(x*y) running time, and O(x*y/k + k^2) memory usage. Adjusting k gives a tradeoff between transient memory usage on the phase 1 processes, and the longer-term memory usage for the adjacency arrays and disjoint-set structures carried over into phase 2.
Note that I have not actually tested this algorithm's performance in the real world; it is also possible that I have overlooked concurrency issues in the lockless disjoint-set union-merge algorithm above. Comments welcome :)
You could use a non-generalized form of the Hough Transform. It appears that it reaches an impressive O(N) time complexity on N x N mesh arrays (if you've got access to ~10000x10000 SIMD arrays and your mesh is N x N - note: in your case, N would be a ridge struct, or cluster of A x B ridges, NOT a pixel). Click for Source. More conservative (non-kernel) solutions list the complexity as O(kN^2) where k = [-π/2, π]. Source.
However, the Hough Transform does have some steep-ish memory requirements, and the space complexity will be O(kN) but if you precompute sin() and cos() and provide appropriate lookup tables, it goes down to O(k + N), which may still be too much, depending on how big your N is... but I don't see you getting it any lower.
Edit: The problem of cross-thread/kernel/SIMD/process line elements is non-trivial. My first impulse tells me to subdivide the mesh into recursive quad-trees (dependent on a certain tolerance), check immediate edges and ignore all edge ridge structs (you can actually flag these as "potential long lines" and share them throughout your distributed system); just do the work on everything INSIDE that particular quad and progressively move outward. Here's a graphical representation (green is the first pass, red is the second, etc). However, my intuition tells me that this is computationally-expensive..
If the ridges are resolved enough that the breaks are only a few pixels then the standard dilate - find neighbours - erode steps you would do for finding lines / OCR should work.
Joining longer contours from many segments and knowing when to create a neck or when to make a separate island is much more complex
Okay, so having thought about this a bit longer, I've got a suggestion that seems like it's too simple to be efficient... I'd appreciate some feedback on whether it seems sensible!
1) Since I can easily determine whether each ridge_t ridge segment at is connected to zero, one or two adjacent segments, I could colour each one appropriately (LINE_NONE, LINE_END or LINE_MID). This can easily be done in parallel, since there is no chance of a race condition.
2) Once colouring is complete:
for each `LINE_END` ridge segment X found:
traverse line until another `LINE_END` ridge segment Y found
if X is earlier in memory than Y:
change X to `LINE_START`
else:
change Y to `LINE_START`
This is also free of race conditions, since even if two threads are simultaneously traversing the same line, they will make the same change.
3) Now every line in the image will have exactly one end flagged as LINE_START. The lines can be located and packed into a more convenient structure in a single thread, without having to do any look-ups to see if the line has already been visited.
It's possible that I should consider whether statistics such as line length should be gathered in step 2), to help with the final re-packing...
Are there any pitfalls that I've missed?
Edit: The obvious problem is that I end up walking the lines twice, once to locate RIDGE_STARTs and once to do the final re-packing, leading to a computational inefficiency. It's still appears to be O(N) in terms of storage and computation time, though, which is a good sign...
Given is an array of 320 elements (int16), which represent an audio signal (16-bit LPCM) of 20 ms duration. I am looking for a most simple and very fast method which should decide whether this array contains active audio (like speech or music), but not noise or silence. I don't need a very high quality of the decision, but it must be very fast.
It occurred to me first to add all squares or absolute values of the elements and compare their sum with a threshold, but such a method is very slow on my system, even if it is O(n).
You're not going to get much faster than a sum-of-squares approach.
One optimization that you may not be doing so far is to use a running total. That is, in each time step, instead of summing the squares of the last n samples, keep a running total and update that with the square of the most recent sample. To avoid your running total from growing and growing over time, add an exponential decay. In pseudocode:
decay_constant=0.999; // Some suitable value smaller than 1
total=0;
for t=1,...
// Exponential decay
total=total*decay_constant;
// Add in latest sample
total+=current_sample;
if total>threshold
// do something
end
end
Of course, you'll have to tune the decay constant and threshold to suit your application. If this isn't fast enough to run in real time, you have a seriously underpowered DSP...
You might try calculating two simple "statistics" - first would be spread (max-min). Silence will have very low spread. Second would be variety - divide the range of possible values into say 16 brackets (= value range) and as you go through the elements, determine in which bracket that element goes. Noise will have similar numbers for all brackets, whereas music or speech should prefer some of them while neglecting others.
This should be possible to do in just one pass through the array and you do not need complicated arithmetics, just some addition and comparison of values.
Also consider some approximation, for example take only each fourth value, thus reducing the number of checked elements to 80. For audio signal, this should be okay.
I did something like this a while back. After some experimentation I arrived at a solution that worked sufficiently well in my case.
I used the rate of change in the cube of the running average over about 120ms. When there is silence (only noise that is) the expression should be hovering around zero. As soon as the rate starts increasing over a couple of runs, you probably have some action going on.
rate = cur_avg^3 - prev_avg^3
I used a cube because the square just wasn't agressive enough. If the cube is to slow for you, try using the square and a bitshift instead. Hope this helps.
It is clearly that the complexity should be at least O(n). Probably some simple algorithms that calculate some value range are good for the moment but I would look for Voice Activity Detection on web and for related code samples.
I need to do a program that does this: given an image (5*5 pixels), I have to search how many images like that exist in another image, composed by many other images. That is, i need to search a given pattern in an image.
The language to use is C. I have to use parallel computing to search in the 4 angles (0º, 90º, 180º and 270º).
What is the best way to do that?
Seems straight forward.
Create 4 versions of the image rotated by 0°, 90°, 180°, and 270°.
Start four threads each with one version of the image.
For all positions from (0,0) to (width - 5, height - 5)
Comapare the 25 pixels of the reference image with the 25 pixels at the current position
If they are equal enough using some metric, report the finding.
Use normalized correlation to determine a match of templates.
#Daniel, Daniel's solution is good for leveraging your multiple CPUs. He doesn't mention a quality metric that would be useful and I would like to suggest one quality metric that is very common in image processing.
I suggest using normalized correlation[1] as a comparison metric because it outputs a number from -1 to +1. Where 0 is no correlation 1 would be output if the two templates were identical and -1 would be if the two templates were exactly opposite.
Once you compute the normalized correlation you can test to see if you have found the template by doing either a threshold test or a peak-to-average test[2].
[1 - footnote] How do you implement normalized correlation? It is pretty simple and only has two for loops. Once you have an implementation that is good enough you can verify your implementation by checking to see if the identical image gets you a 1.
[2 - footnote] You do the ratio of the max(array) / average(array_without_peak). Then threshold to make sure you have a good peak to average ratio.
There's no need to create the additional three versions of the image, just address them differently or use something like the class I created here. Better still, just duplicate the 5x5 matrix and rotate those instead. You can then linearly scan the image for all rotations (which is a good thing).
This problem will not scale well for parallel processing since the bottleneck is certainly accessing the image data. Having multiple threads accessing the same data will slow it down, especially if the threads get 'out of sync', i.e. one thread gets further through the image than the other threads so that the other threads end up reloading the data the first thread has discarded.
So, the solution I think will be most efficient is to create four threads that scan 5 lines of the image, one thread per rotation. A fifth thread loads the image data one line at a time and passes the line to each of the four scanning threads, waiting for all four threads to complete, i.e. load one line of image, append to five line buffer, start the four scanning threads, wait for threads to end and repeat until all image lines are read.
5 * 5 = 25
25 bits fits in an integer.
each image can be encoded as an array of 4 integers.
Iterate your larger image, (hopefully it is not too big),
pulling out all 5 * 5 sub images, convert to an array of 4 integers and compare.