I saw this coding challenge posted somewhere on a Elixir forum and have not quite figured out how to solve it. I have generalized the problem to make it more understandable.
Given an input of a random sequence of numbers, compute the M most common K-length sequences. M and K are constants. For example, compute the 10 most common 3-number sequences from the input.
The input could be potentially very large, so the solution should scale to any size.
I know that storing the sequences in a hash table in a higher level language is potentially the simplest, most efficient solution, but I’d like to find another solution that can be done in C without any hash functions
I am looking for an efficient algorithm to allow mismatches (at most 3) when comparing a pattern with a text. Original KMP does this job efficiently on my data but was considering this to extend this algo to accommodate for mismatches.
For my case: GACCCT is considered a match with GGGGGAGGTTTTTT with start position 4 in second sequence
I need to do pairwise comparison between two files. Each contains approximately 500,000 sequences. Sequences in one file is relatively short (~50 bases) while in other is longer (~200)
I tried Regex package in python, Levenshtein algorithm and edit distances. But they are slow and I will have to wait for couple of weeks to get the work done.
I think your data isn't too large, so maybe this will work:
I think you should create a suffix tree for your data. Once you do this, finding substrings will be very easy, whether or not you want to count mismatches: you just traverse the tree with the characters you're looking for, until you've either found a substring, or hit the most number of mismatches you can tolerate.
If you want at most three mismatches, there's a simple but kind of daft algorithm that'll work on most real cases. Break your pattern into four contiguous parts arbitrarily. (It is probably useful for them to match a random text location with roughly the same probability.) Find all matches in the text of your four contiguous parts. See which of those completes to an at-most-three-mismatches match by brute force.
Mehrdad's solution of using a suffix tree is better in general, but it requires more programming effort.
I want to tackle some image-processing problems in Haskell. I'm working with both bitonal (bitmap) and color images with millions of pixels. I have a number of questions:
On what basis should I choose between Vector.Unboxed and UArray? They are both unboxed arrays, but the Vector abstraction seems heavily advertised, particular around loop fusion. Is Vector always better? If not, when should I use which representation?
For color images I will wish to store triples of 16-bit integers or triples of single-precision floating-point numbers. For this purpose, is either Vector or UArray easier to use? More performant?
For bitonal images I will need to store only 1 bit per pixel. Is there a predefined datatype that can help me here by packing multiple pixels into a word, or am I on my own?
Finally, my arrays are two-dimensional. I suppose I could deal with the extra indirection imposed by a representation as "array of arrays" (or vector of vectors), but I'd prefer an abstraction that has index-mapping support. Can anyone recommend anything from a standard library or from Hackage?
I am a functional programmer and have no need for mutation :-)
For multi-dimensional arrays, the current best option in Haskell, in my view, is repa.
Repa provides high performance, regular, multi-dimensional, shape polymorphic parallel arrays. All numeric data is stored unboxed. Functions written with the Repa combinators are automatically parallel provided you supply +RTS -Nwhatever on the command line when running the program.
Recently, it has been used for some image processing problems:
Real time edge detection
Efficient Parallel Stencil Convolution in Haskell
I've started writing a tutorial on the use of repa, which is a good place to start if you already know Haskell arrays, or the vector library. The key stepping stone is the use of shape types instead of simple index types, to address multidimensional indices (and even stencils).
The repa-io package includes support for reading and writing .bmp image files, though support for more formats is needed.
Addressing your specific questions, here is a graphic, with discussion:
On what basis should I choose between Vector.Unboxed and UArray?
They have approximately the same underlying representation, however, the primary difference is the breadth of the API for working with vectors: they have almost all the operations you'd normally associate with lists (with a fusion-driven optimization framework), while UArray have almost no API.
For color images I will wish to store triples of 16-bit integers or triples of single-precision floating-point numbers.
UArray has better support for multi-dimensional data, as it can use arbitrary data types for indexing. While this is possible in Vector (by writing an instance of UA for your element type), it isn't the primary goal of Vector -- instead, this is where Repa steps in, making it very easy to use custom data types stored in an efficient manner, thanks to the shape indexing.
In Repa, your triple of shorts would have the type:
Array DIM3 Word16
That is, a 3D array of Word16s.
For bitonal images I will need to store only 1 bit per pixel.
UArrays pack Bools as bits, Vector uses the instance for Bool which does do bit packing, instead using a representation based on Word8. Howver, it is easy to write a bit-packing implementation for vectors -- here is one, from the (obsolete) uvector library. Under the hood, Repa uses Vectors, so I think it inherits that libraries representation choices.
Is there a predefined datatype that can help me here by packing multiple pixels into a word
You can use the existing instances for any of the libraries, for different word types, but you may need to write a few helpers using Data.Bits to roll and unroll packed data.
Finally, my arrays are two-dimensional
UArray and Repa support efficient multi-dimensional arrays. Repa also has a rich interface for doing so. Vector on its own does not.
Notable mentions:
hmatrix, a custom array type with extensive bindings to linear algebra packages. Should be bound to use the vector or repa types.
ix-shapeable, getting more flexible indexing from regular arrays
chalkboard, Andy Gill's library for manipulating 2D images
codec-image-devil, read and write various image formats to UArray
Once I reviewed the features of Haskell array libraries which matter for me, and compiled a comparison table (only spreadsheet: direct link). So I'll try to answer.
On what basis should I choose between Vector.Unboxed and UArray? They are both unboxed arrays, but the Vector abstraction seems heavily advertised, particular around loop fusion. Is Vector always better? If not, when should I use which representation?
UArray may be preferred over Vector if one needs two-dimensional or multi-dimensional arrays. But Vector has nicer API for manipulating, well, vectors. In general, Vector is not well suited for simulating multi-dimensional arrays.
Vector.Unboxed cannot be used with parallel strategies. I suspect that UArray cannot be used neither, but at least it is very easy to switch from UArray to boxed Array and see if parallelization benefits outweight the boxing costs.
For color images I will wish to store triples of 16-bit integers or triples of single-precision floating-point numbers. For this purpose, is either Vector or UArray easier to use? More performant?
I tried using Arrays to represent images (though I needed only grayscale images). For color images I used Codec-Image-DevIL library to read/write images (bindings to DevIL library), for grayscale images I used pgm library (pure Haskell).
My major problem with Array was that it provides only random access storage, but it doesn't provide many means of building Array algorithms nor doesn't come with ready to use libraries of array routines (doesn't interface with linear algebra libs, doesn't allow to express convolutions, fft and other transforms).
Almost every time a new Array has to be built from the existing one, an intermediate list of values has to be constructed (like in matrix multiplication from the Gentle Introduction). The cost of array construction often out-weights the benefits of faster random access, to the point that a list-based representation is faster in some of my use cases.
STUArray could have helped me, but I didn't like fighting with cryptic type errors and efforts necessary to write polymorphic code with STUArray.
So the problem with Arrays is that they are not well suited for numerical computations. Hmatrix' Data.Packed.Vector and Data.Packed.Matrix are better in this respect, because they come along with a solid matrix library (attention: GPL license). Performance-wise, on matrix multiplication, hmatrix was sufficiently fast (only slightly slower than Octave), but very memory-hungry (consumed several times more than Python/SciPy).
There is also blas library for matrices, but it doesn't build on GHC7.
I didn't have much experience with Repa yet, and I don't understand repa code well. From what I see it has very limited range of ready to use matrix and array algorithms written on top of it, but at least it is possible to express important algorithms by the means of the library. For example, there are already routines for matrix multiplication and for convolution in repa-algorithms. Unfortunately, it seems that convolution is now limited to 7×7 kernels (it's not enough for me, but should suffice for many uses).
I didn't try Haskell OpenCV bindings. They should be fast, because OpenCV is really fast, but I am not sure if the bindings are complete and good enough to be usable. Also, OpenCV by its nature is very imperative, full of destructive updates. I suppose it's hard to design a nice and efficient functional interface on top of it. If one goes OpenCV way, he is likely to use OpenCV image representation everywhere, and use OpenCV routines to manipulate them.
For bitonal images I will need to store only 1 bit per pixel. Is there a predefined datatype that can help me here by packing multiple pixels into a word, or am I on my own?
As far as I know, Unboxed arrays of Bools take care of packing and unpacking bit vectors. I remember looking at implementation of arrays of Bools in other libraries, and didn't see this elsewhere.
Finally, my arrays are two-dimensional. I suppose I could deal with the extra indirection imposed by a representation as "array of arrays" (or vector of vectors), but I'd prefer an abstraction that has index-mapping support. Can anyone recommend anything from a standard library or from Hackage?
Apart from Vector (and simple lists), all other array libraries are capable of representing two-dimensional arrays or matrices. I suppose they avoid unneccesary indirection.
Although, this doesn't exactly answer your question and isn't really even haskell as such, I would recommend taking a look at CV or CV-combinators libraries at hackage. They bind the many rather useful image processing and vision operators from the opencv-library and make working with machine vision problems much faster.
It would be rather great if someone figures out how repa or some such array library could be directly used with opencv.
Here is a new Haskell Image Processing library that can handle all of the tasks in question and much more. Currently it uses Repa and Vector packages for underlying representations, which consequently inherits fusion, parallel computation, mutation and most of the other goodies that come with those libraries. It provides an easy to use interface that is natural for image manipulation:
2D indexing and unboxed pixels with arbitrary precision (Double, Float, Word16, etc..)
all essential functions like map, fold, zipWith, traverse ...
support for various color spaces: RGB, HSI, gray scale, Bi-tonal, Complex, etc.
common image processing functionality:
Binary morphology
Convolution
Interpolation
Fourier transform
Histogram plotting
etc.
Ability to treat pixels and images as regular numbers.
Reading and writing common image formats through JuicyPixels library
Most importantly, it is a pure Haskell library, so it does not depend on any external programs. It is also highly extendable, new color spaces and image representations can be introduced.
One thing that it does not do is packing multiple binary pixels in a Word, instead it uses a Word per binary pixel, maybe in a future...
As part of a larger problem, I need to solve small linear systems (i.e NxN where N ~10) so using the relevant cuda libraries doesn't make any sense in terms of speed.
Unfortunately something that's also unclear is how to go about solving such systems without pulling in the big guns like GSL, EIGEN etc.
Can anyone point me in the direction of a dense matrix solver (Ax=B) in straight C?
For those interested, the basic structure of the generator for this section of code is:
ndarray=some.generator(N,N)
for v in range N:
B[v]=_F(v)*constant
for x in range N:
A[v,x]=-_F(v)*ndarray[x,v]
Unfortunately I have approximately zero knowledge of higher mathematics, so any advice would be appreciated.
UPDATE: I've been working away at this, and have a nearly-solution that runs but isn't working. Anyone lurking is welcome to check out what I've got so far on pastebin.
I'm using Crout Decomposition with Pivoting which seems to be the most general approach. The idea for this test is that every thread does the same work. Boring I know, but the plan is that the matrixcount variable is increased, actual data is put in, and each thread solves the small matrices individually.
Thanks for everyone who's been checking on this.
POST-ANSWER UPDATE: Finished the matrix solving code for CPU and GPU operation, check out my lazy-writeup here
CUDA won't help here, that's true. Matrices like that are just too small for it.
What you do to solve a system of linear equations is LU decomposition:
http://en.wikipedia.org/wiki/LU_decomposition
http://mathworld.wolfram.com/LUDecomposition.html
Or even better a QR decomposition with Householder reflections like in the Gram-Schmidt process.
http://en.wikipedia.org/wiki/QR_decomposition#Computing_the_QR_decomposition
Solving the linear equation becomes easy afterwards, but I'm afraid there always is some "higher mathematics" (linear algebra) involved. That, and there are many (many!) C libraries out there for solving linear equations. Doesn't seem like "big guns" to me.
Recently I asked this question: How to get the fundamental frequency from FFT? (you don't actually need to read it)
My doubt right now it: how to use the cepstral algorithm?
I just don't know how to use it because the only language that I know is ActionScript 3, and for this reason I have few references about the native functions found in C, Java and so on, and how I should implement them on AS. Most articles are about these languages =/
(althought, answers in other languages than AS are welcome, just explain how the script works please)
The articles I found about cepstral to find the fundamental frequency of a FFT result told me that I should do this:
signal → FT → abs() → square → log → FT → abs() → square → power cepstrum
mathematically:
|F{log(|F{f(t)}|²)}|²
Important info:
I am developing a GUITAR TUNER in flash
This is the first time I am dealing with advanced sound
I am using an FFT to extract frequency bins from the signal that reaches user's microphone, but I got stuck in getting the fundamental frequency from it
I don't know:
How to apply a square in an ARRAY (I mean, the data that my FFT gives me is an array. Should I multiply it by itself? ActionScript's debug throws errors when I try to fftResults * fftResults)
How to apply the "log". I would not know how to apply it even if I had a single number.
What is the difference between complex cepstral and power cepstral. Also, what of them should I use? I am trying to develop a guitar tuner.
Thanks!
Note that the output of an FFT is an array of complex values, i.e. each bin = re + j*im. I think you can just combine the abs and square operations and calculate re*re + im*im for each bin. This gives you a single positive value for each bin, and obviously you can calculate the log value for each bin quite easily. You then need to do a second FFT on this log squared data and again using the output of this second FFT you will calculate re*re + im*im for each bin. You will then have an array of postive values which will have one or more peaks representing the fundamental frequency or frequencies of your input.
The autocorrelation is the easiest and most logical approach, and the best place to start.
To get this working, start with a simple autocorrelation, and then, if necessary, improve it following the outline provided by YIN. (YIN is based on the autocorrelation with refinements. But whether or not you'll need these refinements depends on details of your situation.) This way also, you can learn as you go rather than trying to understand the whole thing in one shot.
Although FFT approaches can also work, they are a bit more confusing. The issue is that what you are really after is the period, and this isn't well represented by the FFT. The missing fundamental is a good example of this, where if you have 2Hz and 3Hz, the fundamental is 1Hz, but is nowhere in the FFT, while 1Hz is obvious in a time based representation (e.g. the autocorrelation). Add to this that overtones aren't necessarily harmonic, and noise, etc... and all of these issues make it usually best to start with a direct approach to the problem.
There are many ways of finding fundamental frequency (F0).
For languages like Java etc there are many libraries with those type of algorithms already implemented (you can study their sources).
MFCC (based on cepstral) implemented in Comirva (Open source).
Audacity (beta version!) (Open source) presents cepstrum, autocorellation, enhanced autocorellation,
Yin based on autocorrelation (example )
Finding max signal values after FFT
All these algorithms may be be very helpful for you. However easiest way to get F0 (one value in Hz) would be to use Yin.