Holistic Word Recognition algorithm in detail - c

Where Can I find algorithm details for holistic word recognition? I need to build a simple OCR system in hardware (FPGAs actually), and the scientific journals seems so abstract?
Are there any open source (open core) codes for holistic word recognition?
Thanks

For an algorithm that is quite suitable for FPGA implementation (embarrassingly parallel) you might look at:
http://en.wikipedia.org/wiki/Cross-correlation
It is fast, and easily implemented.
The only thing is: it recognizes a shape (in your case some text) DEPENDENT of the rotation and size / stretch / skew etc. But if that isn't a problem, it can be very fast and is quite robust. You should only watch out for interpretation problems with characters that are similar (like o and c).
I used it to find default texts on scanned forms to obtain bearings where Region of Interests are and searching in those images (6M pixels) only took around 15 ms with our implementation on a Core2 CPU in a single thread.

Related

Why does convolution with kernels work?

I don't understand how someone could come up with a simple 3x3 matrix called kernel, so when applied to the image, it would produce some awesome effect. Examples: http://en.wikipedia.org/wiki/Kernel_(image_processing) . Why does it work? How did people come up with those kernels (trial and error?)? Is it possible to prove it will always work for all images?
I don't understand how someone could come up with a simple 3x3 matrix called kernel, so when applied to the image, it would produce some awesome effect. Examples: http://en.wikipedia.org/wiki/Kernel_(image_processing).
If you want to dig into the history, you'll need to check some other terms. In older textbooks on image processing, what we think of as kernels today are more likely to be called "operators." Another key term is convolution. Both these terms hint at the mathematical basis of kernels.
http://en.wikipedia.org/wiki/Convolution
You can read about mathematical convolution in the textbook Computer Vision by Ballard and Brown. The book dates back to the early 80s, but it's still quite useful, and you can read it for free online:
http://homepages.inf.ed.ac.uk/rbf/BOOKS/BANDB/toc.htm
From the table of contents to the Ballard and Brown book you'll find a link to a PDF for section 2.2.4 Spatial Properties.
http://homepages.inf.ed.ac.uk/rbf/BOOKS/BANDB/LIB/bandb2_2.pdf
In the PDF, scroll down to the section "The Convolution Theorem." This provides the mathematical background for convolution. It's a relatively short step from thinking about convolution expressed as functions and integrals to the application of the same principles to the discrete world of grayscale (or color) data in 2D images.
You will notice that a number of kernels/operators are associated with names: Sobel, Prewitt, Laplacian, Gaussian, and so on. These names help suggest that there's a history--really quite a long history--of mathematical development and image processing research that has lead to the large number of kernels in common use today.
Gauss and Laplace lived long before us, but their mathematical work has trickled down into forms we can use in image processing. They didn't work on kernels for image processing, but mathematical techniques they developed are directly applicable and commonly used in image processing. Other kernels were developed specifically for processing images.
The Prewitt operator (kernel), which is quite similar to the Sobel operator, was published in 1970, if Wikipedia is correct.
http://en.wikipedia.org/wiki/Prewitt_operator
Why does it work?
Read about the mathematical theory of convolution to understand how one function can be "passed over" or "dragged" across another. That can explain the theoretical basis.
Then there's the question of why individual kernels work. In you look at the edge transition from dark to light in an image, and if you plot the pixel brightness on a 2D scatterplot, you'll notice that the values in the Y-axis increase rapidly about the edge transition in the image. That edge transition is a slope. A slope can be found using the first derivative. Tada! A kernel that approximates a first derivative operator will find edges.
If you know there's such a thing in optics as Gaussian blur, then you might wonder how it could be applied to a 2D image. Thus the derivation of the Gaussian kernel.
The Laplacian, for instance, is an operator that, according to the first sentence from the Wikipedia entry, "is a differential operator given by the divergence of the gradient of a function on Euclidean space."
http://en.wikipedia.org/wiki/Laplacian
Hoo boy. It's quite a leap from that definition to a kernel. The following page does a fine job of explaining the relationship between derivatives and kernels, and it's a quick read:
http://www.aishack.in/2011/04/the-sobel-and-laplacian-edge-detectors/
You'll also see that one form of the Laplacian kernel is simply named the "edge-finding" kernel in the Wikipedia entry you cited.
There is more than one edge-finding kernel, and each has its place. The Laplacian, Sobel, Prewitt, Kirsch, and Roberts kernels all yield different results, and are suited for different purposes.
How did people come up with those kernels (trial and error?)?
Kernels were developed by different people following a variety of research paths.
Some kernels (to my memory) were developed specifically to model the process of "early vision." Early vision isn't what happens only to early humans, or only for people who rise at 4 a.m., but instead refers to the low-level processes of biological vision: sensing of basic color, intensity, edges, and that sort of thing. At the very low level, edge detection in biological vision can be modeled with kernels.
Other kernels, such as the Laplacian and Gaussian, are approximations of mathematical functions. With a little effort you can derive the kernels yourself.
Image editing and image processing software packages will often allow you to define your own kernel. For example, if you want to identify a shape in an image small enough to be defined by a few connected pixels, then you can define a kernel that matches the shape of the image feature you want to detect. Using custom kernels to detect objects is too crude to work in most real-world applications, but sometimes there are reasons to create a special kernel for a very specific purpose, and sometimes a little trial and error is necessary to find a good kernel.
As user templatetypedef pointed out, you can think of kernels intuitively, and in a fairly short time develop a feel for what each would do.
Is it possible to prove it will always work for all images?
Functionally, you can throw a 3x3, 5x5, or NxN kernel at an image of the appropriate size and it'll "work" in the sense that the operation will be performed and there will be some result. But then the ability to compute a result whether it's useful or not isn't a great definition of "works."
One information definition of whether a kernel "works" is whether convolving an image with that kernel produces a result that you find useful. If you're manipulating images in Photoshop or GIMP, and if you find that a particular enhancement kernel doesn't yield quite what you want, then you might say that kernel doesn't work in the context of your particular image and the end result you want. In image processing for computer vision there's a similar problem: we must pick one or more kernels and other (often non-kernel based) algorithms that will operate in sequence to do something useful such as identify faces, measures the velocity of cars, or guide robots in assembly tasks.
Homework
If you want to understand how you can translate a mathematical concept into a kernel, it helps to derive a kernel by yourself. Even if you know what the end result of the derivation should be, to grok the notion of kernels and convolution it helps to derive a kernel from a mathematical function by yourself, on paper, and (preferably) from memory.
Try deriving the 3x3 Gaussian kernel from the mathematical function.
http://en.wikipedia.org/wiki/Gaussian_function
Deriving the kernel yourself, or at least finding an online tutorial and reading closely, will be quite revealing. If you'd rather not do the work, then you may not appreciate the way that some mathematical expression "translates" to a bunch of numbers in a 3x3 matrix. But that's okay! If you get the general sense of a common kernel is useful, and if you observe how two similar kernels produce slightly different results, then you'll develop a good feel for them.
Intuitively, a convolution of an image I with a kernel K produces a new image that's formed by computing a weighted sum, for each pixel, of all the nearby pixels weighted by the weights in K. Even if you didn't know what a convolution was, this idea still seems pretty reasonable. You can use it to do a blur effect (by using a Gaussian weighting of nearby pixels) or to sharpen edges (by subtracting each pixel from its neighbors and putting no weight anywhere else.) In fact, if you knew you needed to do all these operations, it would make sense to try to write a function that given I and K did the weighted sum of nearby pixels, and to try to optimize that function as aggressively as possible (since you'd probably use it a lot).
To get from there to the idea of a convolution, you'd probably need to have a background in Fourier transforms and Fourier series. Convolutions are a totally natural idea in that domain - if you compute the Fourier transformation of two images and multiply the transforms together, you end up computing the transform of the convolution. Mathematicians had worked that out a while back, probably by answering the very natural question "what function has a Fourier transform defined by the product of two other Fourier transforms?," and from there it was just a matter of time before the connection was found. Since Fourier transforms are already used extensively in computing (for example, in signal processing in networks), my guess is that someone with a background in Fourier series noticed that they needed to apply a kernel K to an image I, then recognized that this is way easier and more computationally efficient when done in frequency space.
I honestly have no idea what the real history is, but this is a pretty plausible explanation.
Hope this helps!
There is a good deal of mathematical theory about convolutions, but the kernel examples you link to are simple to explain intuitively:
[ 0 0 0]
[ 0 1 0]
[ 0 0 0]
This one says to take the original pixel and nothing else, so it yields just the original image.
[-1 -1 -1]
[-1 8 -1]
[-1 -1 -1]
This one says to subtract the eight neighbors from eight times the original pixel. First consider what happens in a smooth part of the image, where there is solid, unchanging color. Eight times the original pixel equals the sum of eight identical neighbors, so the difference is zero. Thus, smooth parts of the image become black. However, parts of the images where there are changes do not become black. Thus, this kernel highlights changes, so it highlights places where one shape ends and another begins: the edges of objects in the image.
[ 0 1 0]
[ 1 -4 1]
[ 0 1 0]
This is similar to the one above, but it is tuned differently.
[ 0 -1 0]
[-1 5 -1]
[0 -1 0]
Observe that this is just the negation of the edge detector above plus the first filter we saw, the one for the original image. So this kernel both highlights edges and adds that to the original image. The result is the original image with more visible edges: a sharpening effect.
[ 1 2 1]
[ 2 4 2]
[ 1 2 1]
[ 1 1 1]
[ 1 1 1]
[ 1 1 1]
Both of these blend the original pixel with its neighbors. So they blur the image a little.
There are two ways of thinking about (or encoding) an image: the spatial domain and the frequency domain. A spatial representation is based on pixels, so it's more familiar and easier to obtain. Both the image and the kernel are expressed in the spatial domain.
To get to the frequency domain, you need to use a Fourier or related transform, which is computationally expensive. Once you're there, though, many interesting manipulations are simpler. To blur an image, you can just chop off some high-frequency parts — like cropping the image in the spatial domain. Sharpening is the opposite, akin to increasing the contrast of high-frequency information.
Most of the information of an image is in the high frequencies, which represent detail. Most interesting detail information is at a small, local scale. You can do a lot by looking at neighboring pixels. Blurring is basically taking a weighted average of neighboring pixels. Sharpening consists of looking at the difference between a pixel and its neighbors and enhancing the contrast.
A kernel is usually produced by taking a frequency-domain transformation, then keeping only the high-frequency part and expressing it in the spatial domain. This can only be done for certain transformation algorithms. You can compute the ideal kernel for blurring, sharpening, selecting certain kinds of lines, etc., and it will work intuitively but otherwise seems like magic because we don't really have a "pixel arithmetic."
Once you have a kernel, of course, there's no need to get into the frequency domain at all. That hard work is finished, conceptually and computationally. Convolution is pretty friendly to all involved, and you can seldom simplify any further. Of course, smaller kernels are friendlier. Sometimes a large kernel can be expressed as a convolution of small sub-kernels, which is a kind of factoring in both the math and software senses.
The mathematical process is pretty straightforward and has been studied since long before there were computers. Most common manipulations can be done mechanically on an optical bench using 18th century equipment.
I think the best way to explain them is to start in 1d and discuss the z-transform and its inverse. That switches from the time domain to the frequency domain — from describing a wave as a timed sequence of samples to describing it as the amplitude of each frequency that contributes to it. The two representations contain the same amount of information, they just express it differently.
Now suppose you had a wave described in the frequency domain and you wanted to apply a filter to it. You might want to remove high frequencies. That would be a blur. You might want to remove low frequencies. That would be a sharpen or, in extremis, an edge detect.
You could do that by just forcing the frequencies you don't want to 0 — e.g. by multiplying the entire range by a particular mask, where 1 is a frequency you want to keep and 0 is a frequency you want to eliminate.
But what if you want to do that in the time domain? You could transfer to the frequency domain, apply the mask, then transform back. But that's a lot of work. So what you do (approximately) is transform the mask from the frequency domain to the time domain. You can then apply it in the time domain.
Following the maths involved for transforming back and forth, in theory to apply that you'd have to make each output sample a weighted sum of every single input sample. In the real world you make a trade-off. You use the sum of, say, 9 samples. That gives you a smaller latency and less processing cost than using, say, 99 samples. But it also gives you a less accurate filter.
A graphics kernel is the 2d analogue of that line of thought. They tend to be small because processing cost grows with the square of the edge length so it gets expensive very quickly. But you can approximate any sort of frequency domain limiting filter.

Musical instrument automated teaching via microcontroller

The premise of the project will be:
There will be a prerecorded track of guitar, for example. The student will play the same track on his guitar. I need to compare these two sounds and find out whether the student played it good or not. I will be using STM32 microcontroller and Keil uVision software for simulation at first (programming at C).
I know that I will be using an ADC using DMA and I assume I would Fast Fourier Transform the wave signals and then somehow compare the two frequency responses. Also, would there be a problem with tempo? I mean it is not logical that every note will hit on the exact ms and then compare it
I've seen some methods like Hidden Markov Model or Goertzel algorithm but I am not quite sure what they do and if they are optimal and easy for the project. So my question would be: is there a specific algorithm that suits best and how would I implement it on my code (since I haven't really started working on code, mostly theoretical reading so far).
edit: I've made a similar post yesterday but my premise was too complicated to solve so I am posting on a new premise, much easier to accomplish. I thought not to ask on the first thread since it would mix up two different issues.
Assuming that you can use FFT to find out which notes are playing at what time (this may prove to be difficult for distorted guitar chords), you can do this e.g. 10 times per second for both streams, and then check how often the notes in both streams match. This will give you a percentage, if you need a binary value you'd have to use a threshold value.
If both streams are not equal length (different tempo) then you will have to stretch. You don't have to stretch the actual audio, just the times between the note measurements (e.g. every 100 ms for the first stream and every 125 seconds for the second stream).
So the biggest problem may be to find out what notes are playing at any given moment in time.
I'd start with constructing a mapping of frequencies to notes. Also it may be a good idea to low-pass filter the signal at around 1100 Hz to already get rid of some of the unwanted harmonics (you can't play higher than that on the guitar anyway) and similaryly high-pass filter the signal at 80 Hz. Then after the FFT or DFT (not sure if it matters which you choose), find the frequencies that are close to the real note frequencies. Then pick the loudest one and those that are above a certain threshold relative to the loudest one (e.g. drop anything that is less than half as loud as the loudest one, but some experimentation will be needed to find a good threshold value).

how to align two meshes

I have a very nice & tricky question for you.
I need to align two meshes using a very fast algorithm. Given mesh1 and mesh2 I want to find how I need to traslate and rotate mesh1 to be in the same position of mesh2.
Firstly I did this using inertia moments of the two meshes, but the algorithm does not work if the second mesh is similar to the first one but with some missing parts. In other words, take two identical meshes and from one of them cut same parts off.
I'd like to write the code in C because I need to perform that on multiplatform machines (linux/win) and do that in a very fast way: it has to be put into a GA algorithm.
The two meshes are in STL (stereolitography) format (binary or ascii) but maybe can be useful using another kind of file format.
Do you have any idea how to perform this stuff?
question update:
first of all I want to thank you guys very much for all your suggestions. I've downloaded an install PCL on my machine and compiled the ICP (tutorial) algorithm successfully, taken from PCL web site.
But now I have some questions about that, maybe because for me is a brand new thing. what is the meaning of the 4x4 matrix output for the fitness? I should expect a rotational matrix and a traslational vector..
I hope some of you can help me.
If you need any other info please ask.
Point Cloud Library has several resources that you may find useful. As #Throwback1986 says, ICP is one excellent algorithm for aligning geometry. Pcl also features other, often faster alignment algorithms, based on identifying and matching features of interest in two pieces of geometry. The library finds a lot of use in the robotics communities, who, like you, are very performance conscious.
Pcl is written in c++. While not as portable as straight C, They offer installation instructions for windows, a few *nix flavors, and mac os. I've seen it running on ios and android as well. Check out the tutorials.
Iterative Closest Point (ICP) is one way of registering (aligning) 3D point clouds with rigid transformations. (It can also apply to meshes.)
Here is a good introduction: http://www.cs.duke.edu/courses/spring07/cps296.2/scribe_notes/lecture24.pdf
Here is a reasonable summary:
students.asl.ethz.ch/upl_pdf/314-report.pdf
Here is a matlab implementation:
http://www.mathworks.com/matlabcentral/fileexchange/12627-iterative-closest-point-method
Here are some potential optimizations:
http://www.cs.princeton.edu/~smr/papers/fasticp/

Writing "Power" Efficient Code [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
Power Efficient Software Coding
Adobe announced at Google I/O that it's next version of Flash 10.1 is going to more efficient for devices where power consumption matters.
This got me to thinking: how do you write code that uses less power? Are there any helpful resources regarding this topic?
My guess would be that it is a combination of:
reducing the complexity of your application
writing efficient code that is executed quickly (presumably because processing time = power consumed)
There's actually one much bigger way to reduce power consumption that hasn't been touched on.
Let's take a computer and divide all functions into two basic groups. Those implemented in hardware and those implemented in software.
If a function is implemented in hardware (that is- there is circuitry for which you can put the inputs on one set of wires and the outputs come out another set of wires) then the power consumption is equal to the power consumed in the total number of gates. The clock ticks one time (draining a little power) and the bus goes hot for the output (draining a little power).
If a function is implemented in software (that is- there is no single circuit which is used to implement the function) then it requires the use of multiple circuits, multiple clock cycles, and often-times lots of memory calls. Keep in mind that SRAM (used for processor registers) is made of D flip-flops which are constantly draining power so long as they are in use.
As a simple example, let's look at the H.264 encoder. H.264 is a video encoding used by QuickTime videos. It's also used in MPEG videos, many AVIs, and it's used by Skype. Because it's so common someone sat down and found a way to make a chip in hardware to which you feed the encoded file on one end and the red, green, and blue video channels come out the other end.
Before this chip existed (and before Flash 10.1) you had to decode this using software. Decoding it involves lots of sines and cosines. Sine and cosine are transcendental functions (that is- there is no way to write them in the four basic math operations without an infinite series). This means that the best you could do what run a loop 32-64 times getting gradually more accurate, with each iteration of the loop adding, multiplying, and dividing. Each iteration of the loop also moves values in and out of registers (which- as you recall, uses power).
Flash used to decode video by mathematically decoding it in software. Now it just says "pass the video to the H.264 chip". Of course it also has to check for the existence of this chip and use software if it doesn't exist. This means Flash, as a whole, is now larger. But one any system (like HTC phones) with an H.264 chip, it now uses less power.
Apply this same logic for:
Multiplying (adding multiple times in software)
Modulus (an infinite series in software)
Comparing (subtracting and checking if negative in software)
Drawing (sines/cosines/nastiness in software. Easy to pass to a videocard)
Seeing as though this is probably aimed towards embedded devices, I would venture to say that the best way to save power is to not be on, and to minimize how long the device is on. This means putting the processor to sleep and waking up only when work needs to be done. The best way I can think of to do this would to make an application entirely interrupt-driven.
In addition to Kevin's suggestion, I would think that minimizing Internet communications would help. This would include fetching data in bulk so more time can be spent asleep.
Also keep in mind that accessing devices like drives and wifi increases power consumption. Try to minimize access to such devices.

Is it possible to programmatically edit a sound file based on frequency?

Just wondering if it's possible to go through a flac, mp3, wav, etc file and edit portions, or the entire file by removing sections based on a specific frequency range?
So for example, I have a recording of a friend reciting a poem with a few percussion instruments in the background. Could I write a C program that goes through the entire file and cuts out everything except the vocals (human voice frequency ranges from 85-255 Hz, from what I've been reading)?
Thanks in advance for any ideas!
To address the OP's specific example: I think your understanding of human voice frequency is wrong. Perhaps the fundamental frequency of male spoken voice stays in that range (for tenor singing, or female speech or singing, or shouting, even the fundamental will go much higher, maybe 500-1000 Hz). But that doesn't even matter, because even if the fundamental is low, the overtones which create the different vowel sounds will go up to 2000-4000 Hz or higher. And the frequencies which define "noise" consonants like "t" and "s" go all the way to the top of the audio range, say 5000-10000 Hz. Percussion fills this same audio range, so I doubt that you can separate voice and percussion by filtering certain frequencies in or out.
It is certainly possible, otherwise digital studio mixing software wouldn't exist.
What your'e effectively asking for is to attenuate frequency ranges across an entire file. In analog land, you would apply a low-pass and a high-pass filter (or some other combination of filters) to attenuate the frequencies.
In software, you'd solve this problem by writing a digital filter of sorts that would reduce the output of various frequencies. Frequencies would be identified via an FFT computation.
The fastest thing to do would be to use an audio editing app and apply the changes there.
There is an audio library called PortAudio that may provide some support for editing an audio stream at the numerical level. It is written in C, and has a C API.
If you want to test out audio processing algorithms I strongly suggest Supercollider. It's free and has many kinds of audio filter built in. But eliminating voice could require considerable tweaking. Supercollider will allow you to write code driven by various parameters and then hook those parameters up to a GUI which you'll be able to tweak while supplying it with live (or recorded) data.
Even if you want to write C code, you'll learn a lot from using Supercollider first. Many of the filters are surprisingly easy to implement in C but you'll need to write a certain amount of framework code before you can get started.
Additionally, I learnt quite a bit about writing digital audio filters from this book. Among other things, it discusses some of the characteristics of human speech, as well as how to build filters to selectively enhance or knock out particular frequencies. It also provides working C code.
SciPy can do all sorts of signal processing.
You can also use MAX/MSP (but that's paid) or PureData (that's free) for working with music algorithms , they are the basis from which supercollider was created. And are excellent softwares if you want to do that on real-time envirollments.

Resources