Tf-Idf using cosine similarity for document similarity of almost similar sentence - tf-idf

I am Using tf-idf with cosine similarity to calculate description(sentence) similarity
Input string:
3/4x1/2x3/4 blk mi tee
Below are the sentences among which i need to find sentence similar to input string
smith-cooper® 33rt1 reducing pipe tee 3/4 x 1/2 x 3/4 in npt 150 lb malleable iron black
smith-cooper® 33rt1 reducing pipe tee 1 x 1/2 x 3/4 in npt 150 lb malleable iron black
smith-cooper® 33rt1 reducing pipe tee 1-1/4 x 1 x 3/4 in npt 150 lb malleable iron black
smith-cooper® 33rt1 reducing pipe tee 1-1/2 x 3/4 x 1-1/2 in npt 150 lb malleable iron black
smith-cooper® 33rt1 reducing pipe tee 1-1/2 x 1-1/4 x 1 in npt 150 lb malleable iron black
smith-cooper® 33rt1 reducing pipe tee 2 x 2 x 3/4 in npt 150 lb malleable iron black
smith-cooper® 33rt1 reducing pipe tee 2 x 1-1/2 x 1-1/4 in npt 150 lb malleable iron black
smith-cooper® 33rt1 reducing pipe tee 2-1/2 x 2 x 2 in npt 150 lb malleable iron black
smith-cooper® 33rt1 reducing pipe tee 3 x 3 x 2 in npt 150 lb malleable iron black
As the sentences are almost similar, I am using tf-idf approach which give low score to words that appear in all document( Idf ) and give more score to unique words which helps to find the similar document easier.
is there any approach that works better than this?

There are certainly other approaches such as latent semantic analysis, but what will work best totally depends on your data/corpus. In my experience, TF-IDF is a good starting point. More sophisticated approaches may underperform TF-IDF, or provide an negligible improvement relative to their complexity.
Something to experiment with using TF-IDF is different sized n-grams, and other pre-processing strategies for your corpus. Given your example, you may not want to tokenize your words based on word-boundary-splits; maybe you want to consider some of those sentence components as a single term e.g. 3/4 x 1/2 x 3/4 as a single term. I'd experiment with different sized n-grams first.
In your example, the sentences are identical except for the measurements/dimensions. If this sample is representative, you may want to put more thought into how to measure distances between those measurements.

Related

WAV File ambiguity

I have an 8 bit mono wav file. So each 1 byte of the file represents a sample (1/44100 of a second).
I want to know if the sound, which is heard when the wav file is played, depends only on the samples or does it depend on the difference between two consecutive samples?
Will the following pair of consecutive samples produce same sound:
1st pair :20 and 22
2nd pair :78 and 80
Please help me...
The sound is the difference between consecutive samples. But in your example 20 22 78 80 you will "hear" mostly the difference 22->78 (if you "play" the four samples). In fact, the values of samples define the voltage on speaker which will move the speaker membrane to corresponding position. The neutral (middle) position of membrane is equal to middle value of sample (in 8-bit case it is 127 or 128 - 0 and 255 and lowest and highest position).
It's almost meaningless to talk about the 'same sound' in the case of two 44khz samples but you can think of them both as representing the same frequency (i.e. pitch) but the latter is significantly louder than the former.

Why does convolution with kernels work?

I don't understand how someone could come up with a simple 3x3 matrix called kernel, so when applied to the image, it would produce some awesome effect. Examples: http://en.wikipedia.org/wiki/Kernel_(image_processing) . Why does it work? How did people come up with those kernels (trial and error?)? Is it possible to prove it will always work for all images?
I don't understand how someone could come up with a simple 3x3 matrix called kernel, so when applied to the image, it would produce some awesome effect. Examples: http://en.wikipedia.org/wiki/Kernel_(image_processing).
If you want to dig into the history, you'll need to check some other terms. In older textbooks on image processing, what we think of as kernels today are more likely to be called "operators." Another key term is convolution. Both these terms hint at the mathematical basis of kernels.
http://en.wikipedia.org/wiki/Convolution
You can read about mathematical convolution in the textbook Computer Vision by Ballard and Brown. The book dates back to the early 80s, but it's still quite useful, and you can read it for free online:
http://homepages.inf.ed.ac.uk/rbf/BOOKS/BANDB/toc.htm
From the table of contents to the Ballard and Brown book you'll find a link to a PDF for section 2.2.4 Spatial Properties.
http://homepages.inf.ed.ac.uk/rbf/BOOKS/BANDB/LIB/bandb2_2.pdf
In the PDF, scroll down to the section "The Convolution Theorem." This provides the mathematical background for convolution. It's a relatively short step from thinking about convolution expressed as functions and integrals to the application of the same principles to the discrete world of grayscale (or color) data in 2D images.
You will notice that a number of kernels/operators are associated with names: Sobel, Prewitt, Laplacian, Gaussian, and so on. These names help suggest that there's a history--really quite a long history--of mathematical development and image processing research that has lead to the large number of kernels in common use today.
Gauss and Laplace lived long before us, but their mathematical work has trickled down into forms we can use in image processing. They didn't work on kernels for image processing, but mathematical techniques they developed are directly applicable and commonly used in image processing. Other kernels were developed specifically for processing images.
The Prewitt operator (kernel), which is quite similar to the Sobel operator, was published in 1970, if Wikipedia is correct.
http://en.wikipedia.org/wiki/Prewitt_operator
Why does it work?
Read about the mathematical theory of convolution to understand how one function can be "passed over" or "dragged" across another. That can explain the theoretical basis.
Then there's the question of why individual kernels work. In you look at the edge transition from dark to light in an image, and if you plot the pixel brightness on a 2D scatterplot, you'll notice that the values in the Y-axis increase rapidly about the edge transition in the image. That edge transition is a slope. A slope can be found using the first derivative. Tada! A kernel that approximates a first derivative operator will find edges.
If you know there's such a thing in optics as Gaussian blur, then you might wonder how it could be applied to a 2D image. Thus the derivation of the Gaussian kernel.
The Laplacian, for instance, is an operator that, according to the first sentence from the Wikipedia entry, "is a differential operator given by the divergence of the gradient of a function on Euclidean space."
http://en.wikipedia.org/wiki/Laplacian
Hoo boy. It's quite a leap from that definition to a kernel. The following page does a fine job of explaining the relationship between derivatives and kernels, and it's a quick read:
http://www.aishack.in/2011/04/the-sobel-and-laplacian-edge-detectors/
You'll also see that one form of the Laplacian kernel is simply named the "edge-finding" kernel in the Wikipedia entry you cited.
There is more than one edge-finding kernel, and each has its place. The Laplacian, Sobel, Prewitt, Kirsch, and Roberts kernels all yield different results, and are suited for different purposes.
How did people come up with those kernels (trial and error?)?
Kernels were developed by different people following a variety of research paths.
Some kernels (to my memory) were developed specifically to model the process of "early vision." Early vision isn't what happens only to early humans, or only for people who rise at 4 a.m., but instead refers to the low-level processes of biological vision: sensing of basic color, intensity, edges, and that sort of thing. At the very low level, edge detection in biological vision can be modeled with kernels.
Other kernels, such as the Laplacian and Gaussian, are approximations of mathematical functions. With a little effort you can derive the kernels yourself.
Image editing and image processing software packages will often allow you to define your own kernel. For example, if you want to identify a shape in an image small enough to be defined by a few connected pixels, then you can define a kernel that matches the shape of the image feature you want to detect. Using custom kernels to detect objects is too crude to work in most real-world applications, but sometimes there are reasons to create a special kernel for a very specific purpose, and sometimes a little trial and error is necessary to find a good kernel.
As user templatetypedef pointed out, you can think of kernels intuitively, and in a fairly short time develop a feel for what each would do.
Is it possible to prove it will always work for all images?
Functionally, you can throw a 3x3, 5x5, or NxN kernel at an image of the appropriate size and it'll "work" in the sense that the operation will be performed and there will be some result. But then the ability to compute a result whether it's useful or not isn't a great definition of "works."
One information definition of whether a kernel "works" is whether convolving an image with that kernel produces a result that you find useful. If you're manipulating images in Photoshop or GIMP, and if you find that a particular enhancement kernel doesn't yield quite what you want, then you might say that kernel doesn't work in the context of your particular image and the end result you want. In image processing for computer vision there's a similar problem: we must pick one or more kernels and other (often non-kernel based) algorithms that will operate in sequence to do something useful such as identify faces, measures the velocity of cars, or guide robots in assembly tasks.
Homework
If you want to understand how you can translate a mathematical concept into a kernel, it helps to derive a kernel by yourself. Even if you know what the end result of the derivation should be, to grok the notion of kernels and convolution it helps to derive a kernel from a mathematical function by yourself, on paper, and (preferably) from memory.
Try deriving the 3x3 Gaussian kernel from the mathematical function.
http://en.wikipedia.org/wiki/Gaussian_function
Deriving the kernel yourself, or at least finding an online tutorial and reading closely, will be quite revealing. If you'd rather not do the work, then you may not appreciate the way that some mathematical expression "translates" to a bunch of numbers in a 3x3 matrix. But that's okay! If you get the general sense of a common kernel is useful, and if you observe how two similar kernels produce slightly different results, then you'll develop a good feel for them.
Intuitively, a convolution of an image I with a kernel K produces a new image that's formed by computing a weighted sum, for each pixel, of all the nearby pixels weighted by the weights in K. Even if you didn't know what a convolution was, this idea still seems pretty reasonable. You can use it to do a blur effect (by using a Gaussian weighting of nearby pixels) or to sharpen edges (by subtracting each pixel from its neighbors and putting no weight anywhere else.) In fact, if you knew you needed to do all these operations, it would make sense to try to write a function that given I and K did the weighted sum of nearby pixels, and to try to optimize that function as aggressively as possible (since you'd probably use it a lot).
To get from there to the idea of a convolution, you'd probably need to have a background in Fourier transforms and Fourier series. Convolutions are a totally natural idea in that domain - if you compute the Fourier transformation of two images and multiply the transforms together, you end up computing the transform of the convolution. Mathematicians had worked that out a while back, probably by answering the very natural question "what function has a Fourier transform defined by the product of two other Fourier transforms?," and from there it was just a matter of time before the connection was found. Since Fourier transforms are already used extensively in computing (for example, in signal processing in networks), my guess is that someone with a background in Fourier series noticed that they needed to apply a kernel K to an image I, then recognized that this is way easier and more computationally efficient when done in frequency space.
I honestly have no idea what the real history is, but this is a pretty plausible explanation.
Hope this helps!
There is a good deal of mathematical theory about convolutions, but the kernel examples you link to are simple to explain intuitively:
[ 0 0 0]
[ 0 1 0]
[ 0 0 0]
This one says to take the original pixel and nothing else, so it yields just the original image.
[-1 -1 -1]
[-1 8 -1]
[-1 -1 -1]
This one says to subtract the eight neighbors from eight times the original pixel. First consider what happens in a smooth part of the image, where there is solid, unchanging color. Eight times the original pixel equals the sum of eight identical neighbors, so the difference is zero. Thus, smooth parts of the image become black. However, parts of the images where there are changes do not become black. Thus, this kernel highlights changes, so it highlights places where one shape ends and another begins: the edges of objects in the image.
[ 0 1 0]
[ 1 -4 1]
[ 0 1 0]
This is similar to the one above, but it is tuned differently.
[ 0 -1 0]
[-1 5 -1]
[0 -1 0]
Observe that this is just the negation of the edge detector above plus the first filter we saw, the one for the original image. So this kernel both highlights edges and adds that to the original image. The result is the original image with more visible edges: a sharpening effect.
[ 1 2 1]
[ 2 4 2]
[ 1 2 1]
[ 1 1 1]
[ 1 1 1]
[ 1 1 1]
Both of these blend the original pixel with its neighbors. So they blur the image a little.
There are two ways of thinking about (or encoding) an image: the spatial domain and the frequency domain. A spatial representation is based on pixels, so it's more familiar and easier to obtain. Both the image and the kernel are expressed in the spatial domain.
To get to the frequency domain, you need to use a Fourier or related transform, which is computationally expensive. Once you're there, though, many interesting manipulations are simpler. To blur an image, you can just chop off some high-frequency parts — like cropping the image in the spatial domain. Sharpening is the opposite, akin to increasing the contrast of high-frequency information.
Most of the information of an image is in the high frequencies, which represent detail. Most interesting detail information is at a small, local scale. You can do a lot by looking at neighboring pixels. Blurring is basically taking a weighted average of neighboring pixels. Sharpening consists of looking at the difference between a pixel and its neighbors and enhancing the contrast.
A kernel is usually produced by taking a frequency-domain transformation, then keeping only the high-frequency part and expressing it in the spatial domain. This can only be done for certain transformation algorithms. You can compute the ideal kernel for blurring, sharpening, selecting certain kinds of lines, etc., and it will work intuitively but otherwise seems like magic because we don't really have a "pixel arithmetic."
Once you have a kernel, of course, there's no need to get into the frequency domain at all. That hard work is finished, conceptually and computationally. Convolution is pretty friendly to all involved, and you can seldom simplify any further. Of course, smaller kernels are friendlier. Sometimes a large kernel can be expressed as a convolution of small sub-kernels, which is a kind of factoring in both the math and software senses.
The mathematical process is pretty straightforward and has been studied since long before there were computers. Most common manipulations can be done mechanically on an optical bench using 18th century equipment.
I think the best way to explain them is to start in 1d and discuss the z-transform and its inverse. That switches from the time domain to the frequency domain — from describing a wave as a timed sequence of samples to describing it as the amplitude of each frequency that contributes to it. The two representations contain the same amount of information, they just express it differently.
Now suppose you had a wave described in the frequency domain and you wanted to apply a filter to it. You might want to remove high frequencies. That would be a blur. You might want to remove low frequencies. That would be a sharpen or, in extremis, an edge detect.
You could do that by just forcing the frequencies you don't want to 0 — e.g. by multiplying the entire range by a particular mask, where 1 is a frequency you want to keep and 0 is a frequency you want to eliminate.
But what if you want to do that in the time domain? You could transfer to the frequency domain, apply the mask, then transform back. But that's a lot of work. So what you do (approximately) is transform the mask from the frequency domain to the time domain. You can then apply it in the time domain.
Following the maths involved for transforming back and forth, in theory to apply that you'd have to make each output sample a weighted sum of every single input sample. In the real world you make a trade-off. You use the sum of, say, 9 samples. That gives you a smaller latency and less processing cost than using, say, 99 samples. But it also gives you a less accurate filter.
A graphics kernel is the 2d analogue of that line of thought. They tend to be small because processing cost grows with the square of the edge length so it gets expensive very quickly. But you can approximate any sort of frequency domain limiting filter.

Perfect/ideal hash to isolate anagrams

In an effort to accelerate fast-out behaviour on testing strings for anagrams, I came up with a prime-based hashing scheme -- although it looks like I wasn't the first.
The basic idea is to map letters to prime numbers, and to compute the product of these primes. Any rearrangement of the letters will have the same product, and if the result can be arbitrarily large then no combination of other letters can produce the same result.
I had initially envisioned this as just a hash. Eventually the product would overflow and start to alias other letter combinations. However, by mapping the most frequent letters to the smallest primes the product grows slowly and can often avoid overflow altogether. In this case we get a perfect hash, giving both definite positive and negative results without additional testing.
What's notable is that it doesn't fill the coding space very efficiently before overflowing. No result will have any prime factors greater than 103, and the distribution of small primes is fixed and not necessarily a great match to letter frequency.
Now I'm wondering if there's something substantially better than this. Something that covers more results with perfect hashes and has strong distribution in the remaining cases.
The densest coding scheme I can think of is to sort the letters and then pack them into a word with an entropy coder. In this scheme the letter frequency will obviously be enormously biased because of the range constraints applied to each position (eg., the likelihood of a sorted array starting with z is substantially lower than that of a sorted array ending with a z).
That sounds like a whole lot of work, though -- and I can't see it guaranteeing to give good distribution in the overflow case.
Perhaps there's a better set of factors to map the letters to, and a better way to detect when the risk of aliasing has started. Or a hashing scheme that doesn't rely on multiplication? Something that's easy to calculate?
So that's:
A perfect hash for as much real-world input as possible (for some sensible number of bits).
A strong hash for remaining cases, with a means of distinguishing the two cases.
Easy to calculate.
English language constraints (26 letters with typical English-like word structure) will do fine. Multi-byte coding schemes are a whole other problem.
C code preferred because I understand it.
If you are using n-bit hashes with an alphabet of size m, you can get a unique hash for anagrams up to (n-m) characters long using the approach I described here. This makes collision detection unnecessary but it does limit your word size depending on the size of the alphabet and your available space.
To allow words of any length, I would use n-1 bits to do that hash for words up to (n-m-1) characters in length, and save the last bit to signal that the word is m characters or longer. In those cases you would use the remaining n-1 bits for your prime-number or other hashing algorithm, but of course you would have do collision detection anytime you got multiple words in those buckets. Since in a real-world application the majority of the words will occupy the shorter word lengths, you'll drastically cut the collision detection needed for the longer words.
Implemented flancor's answer here: https://github.com/sh1boot/anagram/blob/master/bitgram.c
Using http://ftp.debian.org/debian/pool/main/s/scowl/wbritish_7.1-1_all.deb as the
dictionary, I get the following proportions of perfect codes (both positive and
negative matches from the hash comparison) using these coding schemes:
order | frequency| alphabet| frequency| alphabet| frequency
code | prime | unary | unary | huffman | huffman
-------|----------|----------|----------|----------|----------
64-bit | 93.95% | 100.00% | 100.00% | 100.00% | 100.00%
32-bit | 24.60% | 69.14% | 74.23% | 86.57% | 90.58%
28-bit | 15.20% | 41.75% | 60.27% | 68.16% | 74.34%
24-bit | 8.21% | 13.14% | 38.55% | 42.67% | 50.34%
20-bit | 3.72% | 3.35% | 16.73% | 19.41% | 25.59%
16-bit | 1.31% | 0.82% | 4.86% | 5.51% | 9.18%
This shows that any variant of the bit-packed histogram encoding can capture
the whole dictionary in perfect 64-bit hashes, and more than two thirds of the
dictionary in 32-bit hashes; whereas the prime products scheme struggles to
reach 100% even in 64-bit, and can't even cover a quarter of the dictionary in
a 32-bit hash.
This list does contain a lot of apostrophes, though, and almost all (but not
all) of the 64-bit misses include apostrophes. Putting that character in its
proper place in the frequency table might help.
Some additional tests:
frequency/huffman, overflowed:
88aff7eb53ef9940: Pneumonoultramicroscopicsilicovolcanoconiosis
frequency/huffman, 64-bit perfect:
17415fbc30c2ebf7: Chargoggagoggmanchauggagoggchaubunagungamaugg
018a6b5dda873fba: Supercalifragilisticexpialidocious
00021ae1bcf50dba: Pseudopseudohypoparathyroidism
00070003dd83b5fb: Floccinaucinihilipilification
00002b800ebbdf7a: Antidisestablishmentarianism
000006c6ab75b3f9: Honorificabilitudinitatibus
0000000421b2ad94: glyptoperichthys
0000000201b2ad94: pterygoplichtys
It's probably the case that the tuning should be done against words that are likely to fall near the boundary, as they may have properties (eg., language of origin) which strongly influence their distribution, and making shorter words longer doesn't matter until they overflow.

detecting outliers in a sparse distribution?

i would like to find what the best way to detect outliers is. here is the problem and some things which probably will not work. let's say we want to fish out some quasi-uniform data from a dirty varchar(50) column in mysql. let's start by doing an analysis by string length.
| strlen | freq |
| 0 | 2312 |
| 3 | 45 |
| 9 | 75 |
| 10 | 15420 |
| 11 | 395 |
| 12 | 114 |
| 19 | 27 |
| 20 | 1170 |
| 21 | 33 |
| 35 | 9 |
what i would like to do is devise an algorithm to determine which string length has a high probability of being purposefully unique rather than being typeo's or random garbage. this field has the possibility of being an "enum" type, so there can be several frequency spikes for valid values. clearly 10 and 20 are valid, 0 is just omitted data. 35 and 3 might be some random trash despite both being very different in frequency. 19 and 21 might be type-os around the 20 format. 11 might be type-os for 10, but what about 12?
it seems simply using occurrence frequency % is not enough. there need to hotspots of higher "just an error" probability around the obvious outliers.
also, having a fixed threshold fails when there are 15 unique lengths which can vary by between 5-20 chars, each with between 7% - 20% occurrence.
standard deviation will not work because it relies on the mean. median absolute deviation probably wont work because you can have a high frequency outlier that cannot be discarded.
yes there will be other params for cleaning the data in the code, but length seems to very quickly pre-filter and classify fields with any amount of structure.
are there any known methods which would work efficiently? i'm not very familiar with Bayesian filters or machine learning but maybe they can help?
thanks!
leon
Sounds like anomaly detection is the way the to go. Anomaly detection is a kind of machine learning that is used to find outliers. It comes in a couple of varieties, including supervised and unsupervised. In supervised learning, the algorithm is training using examples of outliers. In unsupervised learning, the algorithm attempts to find outliers without any examples. Here are a couple of links to start out:
http://en.wikipedia.org/wiki/Anomaly_detection
http://s3.amazonaws.com/mlclass-resources/docs/slides/Lecture15.pdf
I didn't find any links to readily available libraries. Something like MATLAB, or its free cousin, Octave, might be a nice way to if you can't find an anomaly detection library in your language of choice. https://goker.wordpress.com/tag/anomaly-detection/

How to extract semi-precise frequencies from a WAV file using Fourier Transforms

Let us say that I have a WAV file. In this file, is a series of sine tones at precise 1 second intervals. I want to use the FFTW library to extract these tones in sequence. Is this particularly hard to do? How would I go about this?
Also, what is the best way to write tones of this kind into a WAV file? I assume I would only need a simple audio library for the output.
My language of choice is C
To get the power spectrum of a section of your file:
collect N samples, where N is a power of 2 - if your sample rate is 44.1 kHz for example and you want to sample approx every second then go for say N = 32768 samples.
apply a suitable window function to the samples, e.g. Hanning
pass the windowed samples to an FFT routine - ideally you want a real-to-complex FFT but if all you have a is complex-to-complex FFT then pass 0 for all the imaginary input parts
calculate the squared magnitude of your FFT output bins (re * re + im * im)
(optional) calculate 10 * log10 of each magnitude squared output bin to get a magnitude value in dB
Now that you have your power spectrum you just need to identify the peak(s), which should be pretty straightforward if you have a reasonable S/N ratio. Note that frequency resolution improves with larger N. For the above example of 44.1 kHz sample rate and N = 32768 the frequency resolution of each bin is 44100 / 32768 = 1.35 Hz.
You are basically interested in estimating a Spectrum -assuming you've already gone past the stage of reading the WAV and converting it into a discrete time signal.
Among the various methods, the most basic is the Periodogram, which amounts to taking a windowed Discrete Fourier Transform (with a FFT) and keeping its squared magnitude. This correspond to Paul's answer. You need a window which spans over several periods of the lowest frequency you want to detect. Example: if your sinusoids can be as low as 10 Hz (period = 100ms), you should take a window of 200ms o 300ms or so (or more). However, the periodogram has some disadvantages, though it's simple to compute and it's more than enough if high precision is not required:
The raw periodogram is not a good
spectral estimate because of spectral
bias and the fact that the variance
at a given frequency does not decrease
as the number of samples used in the
computation increases.
The periodogram can perform better by averaging several windows, with a judious choosing of the widths (Bartlet method). And there are many other methods for estimating the spectrum (AR modelling).
Actually, you are not exactly interested in estimating a full spectrum, but only the location of a single frequency. This can be done seeking a peak of an estimated spectrum (done as explained), but also by more specific and powerful (and complicated) methods (Pisarenko, MUSIC algorithm). They would probably be overkill in your case.
WAV files contain linear pulse code modulated (LPCM) data. That just means that it is a sequence of amplitude values at a fixed sample rate. A RIFF header is contained at the beginning of the file to convey information like sampling rate and bits per sample (e.g. 8 kHz signed 16-bit).
The format is very simple and you could easily roll your own. However, there are several libraries available to speed the process such as libsndfile. Simple Direct-media Layer (SDL)/SDL_mixer and PortAudio are two nice libraries for playback.
As for feeding the data into FFTW, you would need to buffer 1 second chunks (determine size by the sample rate and bits per sample). Then convert all of the samples to IEEE floating-point (i.e. float or double depending on the FFTW configuration--libsndfile can do this for you). Next create another array to hold the frequency domain output. Finally, create and execute an FFTW plan by passing both buffers to fftw_plan_dft_r2c_1d and calling fftw_execute with the returned fftw_plan handle.

Resources