Why does convolution with kernels work? - c

I don't understand how someone could come up with a simple 3x3 matrix called kernel, so when applied to the image, it would produce some awesome effect. Examples: http://en.wikipedia.org/wiki/Kernel_(image_processing) . Why does it work? How did people come up with those kernels (trial and error?)? Is it possible to prove it will always work for all images?

I don't understand how someone could come up with a simple 3x3 matrix called kernel, so when applied to the image, it would produce some awesome effect. Examples: http://en.wikipedia.org/wiki/Kernel_(image_processing).
If you want to dig into the history, you'll need to check some other terms. In older textbooks on image processing, what we think of as kernels today are more likely to be called "operators." Another key term is convolution. Both these terms hint at the mathematical basis of kernels.
http://en.wikipedia.org/wiki/Convolution
You can read about mathematical convolution in the textbook Computer Vision by Ballard and Brown. The book dates back to the early 80s, but it's still quite useful, and you can read it for free online:
http://homepages.inf.ed.ac.uk/rbf/BOOKS/BANDB/toc.htm
From the table of contents to the Ballard and Brown book you'll find a link to a PDF for section 2.2.4 Spatial Properties.
http://homepages.inf.ed.ac.uk/rbf/BOOKS/BANDB/LIB/bandb2_2.pdf
In the PDF, scroll down to the section "The Convolution Theorem." This provides the mathematical background for convolution. It's a relatively short step from thinking about convolution expressed as functions and integrals to the application of the same principles to the discrete world of grayscale (or color) data in 2D images.
You will notice that a number of kernels/operators are associated with names: Sobel, Prewitt, Laplacian, Gaussian, and so on. These names help suggest that there's a history--really quite a long history--of mathematical development and image processing research that has lead to the large number of kernels in common use today.
Gauss and Laplace lived long before us, but their mathematical work has trickled down into forms we can use in image processing. They didn't work on kernels for image processing, but mathematical techniques they developed are directly applicable and commonly used in image processing. Other kernels were developed specifically for processing images.
The Prewitt operator (kernel), which is quite similar to the Sobel operator, was published in 1970, if Wikipedia is correct.
http://en.wikipedia.org/wiki/Prewitt_operator
Why does it work?
Read about the mathematical theory of convolution to understand how one function can be "passed over" or "dragged" across another. That can explain the theoretical basis.
Then there's the question of why individual kernels work. In you look at the edge transition from dark to light in an image, and if you plot the pixel brightness on a 2D scatterplot, you'll notice that the values in the Y-axis increase rapidly about the edge transition in the image. That edge transition is a slope. A slope can be found using the first derivative. Tada! A kernel that approximates a first derivative operator will find edges.
If you know there's such a thing in optics as Gaussian blur, then you might wonder how it could be applied to a 2D image. Thus the derivation of the Gaussian kernel.
The Laplacian, for instance, is an operator that, according to the first sentence from the Wikipedia entry, "is a differential operator given by the divergence of the gradient of a function on Euclidean space."
http://en.wikipedia.org/wiki/Laplacian
Hoo boy. It's quite a leap from that definition to a kernel. The following page does a fine job of explaining the relationship between derivatives and kernels, and it's a quick read:
http://www.aishack.in/2011/04/the-sobel-and-laplacian-edge-detectors/
You'll also see that one form of the Laplacian kernel is simply named the "edge-finding" kernel in the Wikipedia entry you cited.
There is more than one edge-finding kernel, and each has its place. The Laplacian, Sobel, Prewitt, Kirsch, and Roberts kernels all yield different results, and are suited for different purposes.
How did people come up with those kernels (trial and error?)?
Kernels were developed by different people following a variety of research paths.
Some kernels (to my memory) were developed specifically to model the process of "early vision." Early vision isn't what happens only to early humans, or only for people who rise at 4 a.m., but instead refers to the low-level processes of biological vision: sensing of basic color, intensity, edges, and that sort of thing. At the very low level, edge detection in biological vision can be modeled with kernels.
Other kernels, such as the Laplacian and Gaussian, are approximations of mathematical functions. With a little effort you can derive the kernels yourself.
Image editing and image processing software packages will often allow you to define your own kernel. For example, if you want to identify a shape in an image small enough to be defined by a few connected pixels, then you can define a kernel that matches the shape of the image feature you want to detect. Using custom kernels to detect objects is too crude to work in most real-world applications, but sometimes there are reasons to create a special kernel for a very specific purpose, and sometimes a little trial and error is necessary to find a good kernel.
As user templatetypedef pointed out, you can think of kernels intuitively, and in a fairly short time develop a feel for what each would do.
Is it possible to prove it will always work for all images?
Functionally, you can throw a 3x3, 5x5, or NxN kernel at an image of the appropriate size and it'll "work" in the sense that the operation will be performed and there will be some result. But then the ability to compute a result whether it's useful or not isn't a great definition of "works."
One information definition of whether a kernel "works" is whether convolving an image with that kernel produces a result that you find useful. If you're manipulating images in Photoshop or GIMP, and if you find that a particular enhancement kernel doesn't yield quite what you want, then you might say that kernel doesn't work in the context of your particular image and the end result you want. In image processing for computer vision there's a similar problem: we must pick one or more kernels and other (often non-kernel based) algorithms that will operate in sequence to do something useful such as identify faces, measures the velocity of cars, or guide robots in assembly tasks.
Homework
If you want to understand how you can translate a mathematical concept into a kernel, it helps to derive a kernel by yourself. Even if you know what the end result of the derivation should be, to grok the notion of kernels and convolution it helps to derive a kernel from a mathematical function by yourself, on paper, and (preferably) from memory.
Try deriving the 3x3 Gaussian kernel from the mathematical function.
http://en.wikipedia.org/wiki/Gaussian_function
Deriving the kernel yourself, or at least finding an online tutorial and reading closely, will be quite revealing. If you'd rather not do the work, then you may not appreciate the way that some mathematical expression "translates" to a bunch of numbers in a 3x3 matrix. But that's okay! If you get the general sense of a common kernel is useful, and if you observe how two similar kernels produce slightly different results, then you'll develop a good feel for them.

Intuitively, a convolution of an image I with a kernel K produces a new image that's formed by computing a weighted sum, for each pixel, of all the nearby pixels weighted by the weights in K. Even if you didn't know what a convolution was, this idea still seems pretty reasonable. You can use it to do a blur effect (by using a Gaussian weighting of nearby pixels) or to sharpen edges (by subtracting each pixel from its neighbors and putting no weight anywhere else.) In fact, if you knew you needed to do all these operations, it would make sense to try to write a function that given I and K did the weighted sum of nearby pixels, and to try to optimize that function as aggressively as possible (since you'd probably use it a lot).
To get from there to the idea of a convolution, you'd probably need to have a background in Fourier transforms and Fourier series. Convolutions are a totally natural idea in that domain - if you compute the Fourier transformation of two images and multiply the transforms together, you end up computing the transform of the convolution. Mathematicians had worked that out a while back, probably by answering the very natural question "what function has a Fourier transform defined by the product of two other Fourier transforms?," and from there it was just a matter of time before the connection was found. Since Fourier transforms are already used extensively in computing (for example, in signal processing in networks), my guess is that someone with a background in Fourier series noticed that they needed to apply a kernel K to an image I, then recognized that this is way easier and more computationally efficient when done in frequency space.
I honestly have no idea what the real history is, but this is a pretty plausible explanation.
Hope this helps!

There is a good deal of mathematical theory about convolutions, but the kernel examples you link to are simple to explain intuitively:
[ 0 0 0]
[ 0 1 0]
[ 0 0 0]
This one says to take the original pixel and nothing else, so it yields just the original image.
[-1 -1 -1]
[-1 8 -1]
[-1 -1 -1]
This one says to subtract the eight neighbors from eight times the original pixel. First consider what happens in a smooth part of the image, where there is solid, unchanging color. Eight times the original pixel equals the sum of eight identical neighbors, so the difference is zero. Thus, smooth parts of the image become black. However, parts of the images where there are changes do not become black. Thus, this kernel highlights changes, so it highlights places where one shape ends and another begins: the edges of objects in the image.
[ 0 1 0]
[ 1 -4 1]
[ 0 1 0]
This is similar to the one above, but it is tuned differently.
[ 0 -1 0]
[-1 5 -1]
[0 -1 0]
Observe that this is just the negation of the edge detector above plus the first filter we saw, the one for the original image. So this kernel both highlights edges and adds that to the original image. The result is the original image with more visible edges: a sharpening effect.
[ 1 2 1]
[ 2 4 2]
[ 1 2 1]
[ 1 1 1]
[ 1 1 1]
[ 1 1 1]
Both of these blend the original pixel with its neighbors. So they blur the image a little.

There are two ways of thinking about (or encoding) an image: the spatial domain and the frequency domain. A spatial representation is based on pixels, so it's more familiar and easier to obtain. Both the image and the kernel are expressed in the spatial domain.
To get to the frequency domain, you need to use a Fourier or related transform, which is computationally expensive. Once you're there, though, many interesting manipulations are simpler. To blur an image, you can just chop off some high-frequency parts — like cropping the image in the spatial domain. Sharpening is the opposite, akin to increasing the contrast of high-frequency information.
Most of the information of an image is in the high frequencies, which represent detail. Most interesting detail information is at a small, local scale. You can do a lot by looking at neighboring pixels. Blurring is basically taking a weighted average of neighboring pixels. Sharpening consists of looking at the difference between a pixel and its neighbors and enhancing the contrast.
A kernel is usually produced by taking a frequency-domain transformation, then keeping only the high-frequency part and expressing it in the spatial domain. This can only be done for certain transformation algorithms. You can compute the ideal kernel for blurring, sharpening, selecting certain kinds of lines, etc., and it will work intuitively but otherwise seems like magic because we don't really have a "pixel arithmetic."
Once you have a kernel, of course, there's no need to get into the frequency domain at all. That hard work is finished, conceptually and computationally. Convolution is pretty friendly to all involved, and you can seldom simplify any further. Of course, smaller kernels are friendlier. Sometimes a large kernel can be expressed as a convolution of small sub-kernels, which is a kind of factoring in both the math and software senses.
The mathematical process is pretty straightforward and has been studied since long before there were computers. Most common manipulations can be done mechanically on an optical bench using 18th century equipment.

I think the best way to explain them is to start in 1d and discuss the z-transform and its inverse. That switches from the time domain to the frequency domain — from describing a wave as a timed sequence of samples to describing it as the amplitude of each frequency that contributes to it. The two representations contain the same amount of information, they just express it differently.
Now suppose you had a wave described in the frequency domain and you wanted to apply a filter to it. You might want to remove high frequencies. That would be a blur. You might want to remove low frequencies. That would be a sharpen or, in extremis, an edge detect.
You could do that by just forcing the frequencies you don't want to 0 — e.g. by multiplying the entire range by a particular mask, where 1 is a frequency you want to keep and 0 is a frequency you want to eliminate.
But what if you want to do that in the time domain? You could transfer to the frequency domain, apply the mask, then transform back. But that's a lot of work. So what you do (approximately) is transform the mask from the frequency domain to the time domain. You can then apply it in the time domain.
Following the maths involved for transforming back and forth, in theory to apply that you'd have to make each output sample a weighted sum of every single input sample. In the real world you make a trade-off. You use the sum of, say, 9 samples. That gives you a smaller latency and less processing cost than using, say, 99 samples. But it also gives you a less accurate filter.
A graphics kernel is the 2d analogue of that line of thought. They tend to be small because processing cost grows with the square of the edge length so it gets expensive very quickly. But you can approximate any sort of frequency domain limiting filter.

Related

Non-converging Neural Network in C

I wrote my first feed-forward neural network in C, using the sigmoid 1.0 / (1.0 + exp(-x)) as activation function and gradient descent to adjust the weights. I tried to approximate sin(x) to make sure my network works. However, the output of the neuron on the output layer seems to always oscillate between the extreme values 0 and 1 and the weights of the neurons grow to absurd sizes, no matter how many hidden layers there are, how many neurons are in the hidden layer(s), how many training samples I provide, or even what the target outputs are.
1) Are there any standard 'tried and tested' data sets used to proof-test neural networks for errors? If yes, what structures work best (e.g. numbers of neuron(s) in the hidden layer) to converge to the desired output?
2) Are there any common errors that generate the same symptoms? I found this thread, but the issue was because of faulty data, which I believe is not my case.
3) Is there any preferred way of training the network? In my implementation I cycle through the training sets and adjust the weights each time, then rinse and repeat ~1000 times. Is there any other order that works better?
So, to sum up:
Assuming that your gradient propagation works properly usually the values of parameters like topology, learning rate, batch size or value of a constant connected with weight penalty (L1 and L2 decay) are computed using a techniques called grid search or random search. It was empirically proved that random search performs better in this task.
The most common reason of weight divergence is wrong learning rate. Big value of it might make learning really hard. But on the other hand - when learning rate is too small - learning process might take a really long time. Usually - you should babysit the learning phase. The specified instruction might be found e.g. here.
In your learning phase you used a technique called SGD. Usually - it may achieve good results but it's vulnerable to variance of data sets and big values of learning rates. What I advice you is to use batch learning and set a batch size as additional learning parameter learnt during grid or random search. You can read about here e.g. here.
Another thing which you might consider is to change your activation function to tanh or relu. There are a lot of problems with saturation regions of sigmoid and it usually needs a proper initialization. You can read about it here.

Subtle Lack of Randomness in Maze Generation Algorithm in C

This concerns what I can only surmise is a flaw in somebody's code used to generate random mazes. The code is somewhat long, but most of it is commented-out options and/or not concerned specifically with the randomization.
I got the 2001x2001 maze from the link dllu put up and saved as a png here. From that, I have created this. To get the blue pattern, I started filling in dead ends starting in the bottom left-hand corner of the maze. According to the backtracker algorithm he used, that's the point at which the maze begins to be generated: so if you following the trail of dead ends created by that, you can systematically fill in all the dead ends on that side of the maze. In other words, the central blue mass represents the total accessible area starting from the bottom left, up to the sole frontier pixel at 2678 x 1086.
But there's something immediately anomalous, in that the blue "fractal" seems to repeat itself. Indeed, by overlaying one part of the fractal, rotated and mirrored, you can see there is an exact correspondence of shape. Another anomaly from this overlay maps part of one of the continents onto another, but strangely only a sliver of the landmass this time. Evidently these aren't the only auto-correspondences.
But besides the shape of the dead-end components, there is repetition of the actual pattern of the walls when you zoom in. Strangest of all, the repetition isn't exact, but only maybe 50-60% of the walls correspond. Zoomed and constrasted sample of a region:
Long question short, what in the code is causing this obscure lack of randomness?
The standard library function rand is often horribly implemented and your code is doing rand()%4 which exacerbates the problem because poor implementations tend to have even less randomness in the lower bits. Try replacing rand with a different random number generator.
Anonymous has properly indicated that rand is poorly implemented and that it has poor entropy in its lower bits.
To alleviate this issue, you could use a stronger pseudo-random number generator (like one suitable for cryptographic methods); however, the higher quality of the randomness tends to lead to longer time in the random number generating function.
I would initially attempt to pull out some of the upper level bits from your random generator. It might be sufficient to improve the result with a less-than perfectly random generator. As they are pulled from a different region of the result, and shifted to the 0-3 values you need, their distribution should be different (and hopefully a bit more random)

Musical instrument automated teaching via microcontroller

The premise of the project will be:
There will be a prerecorded track of guitar, for example. The student will play the same track on his guitar. I need to compare these two sounds and find out whether the student played it good or not. I will be using STM32 microcontroller and Keil uVision software for simulation at first (programming at C).
I know that I will be using an ADC using DMA and I assume I would Fast Fourier Transform the wave signals and then somehow compare the two frequency responses. Also, would there be a problem with tempo? I mean it is not logical that every note will hit on the exact ms and then compare it
I've seen some methods like Hidden Markov Model or Goertzel algorithm but I am not quite sure what they do and if they are optimal and easy for the project. So my question would be: is there a specific algorithm that suits best and how would I implement it on my code (since I haven't really started working on code, mostly theoretical reading so far).
edit: I've made a similar post yesterday but my premise was too complicated to solve so I am posting on a new premise, much easier to accomplish. I thought not to ask on the first thread since it would mix up two different issues.
Assuming that you can use FFT to find out which notes are playing at what time (this may prove to be difficult for distorted guitar chords), you can do this e.g. 10 times per second for both streams, and then check how often the notes in both streams match. This will give you a percentage, if you need a binary value you'd have to use a threshold value.
If both streams are not equal length (different tempo) then you will have to stretch. You don't have to stretch the actual audio, just the times between the note measurements (e.g. every 100 ms for the first stream and every 125 seconds for the second stream).
So the biggest problem may be to find out what notes are playing at any given moment in time.
I'd start with constructing a mapping of frequencies to notes. Also it may be a good idea to low-pass filter the signal at around 1100 Hz to already get rid of some of the unwanted harmonics (you can't play higher than that on the guitar anyway) and similaryly high-pass filter the signal at 80 Hz. Then after the FFT or DFT (not sure if it matters which you choose), find the frequencies that are close to the real note frequencies. Then pick the loudest one and those that are above a certain threshold relative to the loudest one (e.g. drop anything that is less than half as loud as the loudest one, but some experimentation will be needed to find a good threshold value).

Calculate values for spectrum analyser [duplicate]

This question already has an answer here:
How to generate the audio spectrum using fft in C++? [closed]
(1 answer)
Closed 9 years ago.
How would I go about implementing a spectrum analyser like the ones in WinAmp below?
Just by looking at it, I think that these bars are rendered to display the 'volume level' of a specific frequency band of the incoming audio data; however, I'm not sure how to actually calculate this data needed for the rather easy task of drawing the bars.
From what I've been told and understand, calculating these values can be done by using an FFT — however, I'm not exactly sure how to calculate those, given a buffer of input data — am I on the right track about FFTs? How would I apply an FFT on the input data and get, say, an integer out of the FFT that represents the 'volume' of a specific frequency band?
The drawing part isn't a problem, since I can just draw directly to my framebuffer and render that out. I'm doing this as a project on an FPGA, using a Nios II soft-CPU, in case anyone's wondering about potential hardware limitations. Audio data comes in as 24-bit data at 96kHz.
You're probably looking for FFTw.
edit:
To elaborate on your question:
calculating these values can be done by using an FFT — however, I'm not exactly sure how to calculate those, given a buffer of input data: yes, you're right; that's exactly how it's done. You take a (necssarily small, due to the time-frequency uncertainty principle) sample segment out of the currently playing audio data, and feed it to a (typically) discrete, real-only FFT (one of the best known, most widely used and fastest being the DCT family of DFTs - in fact there are highly optimized versions of most DCTs in FFTw). Then you take out the next sample segment and repeat the process.
The output of the FFT will be the freqeuncy decomposition of the audio signal that has been fed in - you then need to decide how to display it (i.e. which function to use on the outputs of the FFT, common candidates being f(x) = x; f(x) = sqrt(x); f(x) = log(x)) and also how to present/animate the following readings (e.g. you could average each band in the temporal direction or you could have the maximums "fall off" slowly).
rage-edit:
Additional links since it appears somebody knows how to downvote but not how to use google:
http://en.wikipedia.org/wiki/FFTW
http://webcache.googleusercontent.com/search?q=cache:m6ou54tn_soJ:www.fftw.org/+&cd=1&hl=en&ct=clnk&gl=it&client=firefox-a
http://web.archive.org/web/20130123131356/http://fftw.org/
It's pretty straight forward - just use one of the many FFT algorithms! Most of them require floating point calculations, but a google search brings up methods with just integers. You are spot on though, FFT's are what you want.
http://en.wikipedia.org/wiki/Fast_Fourier_transform
To understand how to apply a FFT, you should have a read of this page on discrete fourier transforms, although it's quite heavy on maths:
http://en.wikipedia.org/wiki/Discrete_Fourier_transform
To implement it on your FPGA, I'd have a look at the source code of this project:
http://qt-project.org/doc/qt-4.8/demos-spectrum.html
Here's a previous SO question that gives a summary of how it works (in any language).
How to generate the audio spectrum using fft in C++?
There's an immense amount of information on creating what by the way is called a "Spectrum Analyser", there must be dozens of complete implementations where the source code is freely available. Just page through a google search for "Spectrum analyser source code C" for example.

What is the most practical board representation for a magic bitboard move generation system?

I am rewriting a chess engine I wrote to run on magic bitboards. I have the magic functions written and they take a square of the piece and an occupancy bitboard as parameters. What I am having debates with myself is which one of these board representation schemes is faster/more practical:
scheme 1: There is a bitboard for each type of piece, 1 for white knights, 1 for black rooks. . . , and in order to generate moves and push them to the move stack, I must serialize them to find the square of the piece and then call the magic function. Then I must serialize that move bitboard and push them. The advantage is is that the attacking and occupancy bitboards are closer at hand.
scheme 2: A simple piece centric array [2][16] or [32] contains the square indices of the pieces. A simply loopthrough and call of the functions is all it takes for the move bitboards. I then serialize those bitboards and push them to the move stack. I also have to maintain an occupancy bitboard. I guess getting an attack bitboard shouldn't be any different: I have to once again generate all the move bitboards and, instead of serializing them, I bitwise operate them in a mashup of magic.
I'm leaning towards scheme 2, but for some reason I think there is some sort of implementation similar to scheme 1 that is standard. For some reason I can't find drawbacks of making a "bitboard" engine without actually using bitboards. I wouldn't even be using bitboards for king and knight data, just a quick array lookup.
I guess my question is more of whether there is a better way to do this board representation, because I remember reading that keeping a bitboard for each type of piece is standard (maybe this is only necessary with rotated bitboards?). I'm relatively new to bitboard engines but I've read a lot and I've implemented the magic method. I certainly like the piece centric array approach - it makes a lot of arbitrary stuff like printing the board to the screen easier, but if there is a better/equal/more standard way can someone please point it out? Thanks in advance - I know this is a fairly specific question and difficult to answer unless you are very familiar with chess programming.
Last minute question: how is the speed of a lookup into a 2D array measure up to using a 1D array and adding 16 * team_side to the normal index to lookup the piece?
edit: I thought I should add that I am valuing speed over almost all else in my chess implementation. Why else would I go with magic bitboards as opposed to simply arrays with slide data?
There is no standard answer to this, sorry.
The number and types of data structures you need depends on exactly what you want to do in your program. For example, having more than one representation of the pieces on the board makes some operations faster. On the other hand, it takes more time to update your data during each move.
To get the maximum speed, it is your job to find out what works best for your program. Does maintaining an extra array of pieces result in a net speedup for a program? It depends!
Sometimes it is a net gain to maintain a data structure, sometimes you can delay the calculations and cache the result, and sometimes you just calculate it when needed (and hope it isn't needed very often).

Resources