When to use which base of log for tf-idf? - c

I'm working on a simple search engine where I use the TF-IDF formula to score how important a search word is. I see people using different bases for the formula, but I see no explanation for when to use which. Does it matter at all, and do you have any recommendations?
My current implementation uses the regular log() function of the math.h library

TF-IDF literature usually uses base 2, although a common implementation sklearn uses natural logarithms for example. Just take in count that the lower the base, the bigger the score, which can affect truncation of search resultset by score.
Note that from a mathematical point of view, the base can always be changed later. It's easy to convert from one base to another, because the following equality holds:
log_a(x)/log_a(y) = log_b(x)/log_b(y)
You can always convert from one base to another. It's actually very easy. Just use this formula:
log_b(x) = log_a(x)/log_a(b)
Often bases like 2 and 10 are preferred among engineers. 2 is good for halftimes, and 10 is our number system. Math people prefer the natural logarithm, because it makes calculus a lot easier. The derivative of the function b^x where b is a constant is k*b^x. Bur if b is equal to e (the natural logarithm) then k is 1.
So let's say that you want to send in the 2-logarithm of 5.63 using log(). Just use log(5.63)/log(2).
If you have the need for it, just use this function for arbitrary base:
double logb(double x, double b) {
return log(x)/log(b);
}

Related

Should I use Halfcomplex2Real or Complex2Complex

Good morning, I'm trying to perform a 2D FFT as 2 1-Dimensional FFT.
The problem setup is the following:
There's a matrix of complex numbers generated by an inverse FFT on an array of real numbers, lets call it arr[-nx..+nx][-nz..+nz].
Now, since the original array was made up of real numbers, I exploit the symmetry and reduce my array to be arr[0..nx][-nz..+nz].
My problem starts here, with arr[0..nx][-nz..nz] provided.
Now I should come back in the domain of real numbers.
The question is what kind of transformation I should use in the 2 directions?
In x I use the fftw_plan_r2r_1d( .., .., .., FFTW_HC2R, ..), called Half complex to Real transformation because in that direction I've exploited the symmetry, and that's ok I think.
But in z direction I can't figure out if I should use the same transformation or, the Complex to complex (C2C) transformation?
What is the correct once and why?
In case of needing here, at page 11, the HC2R transformation is briefly described
Thank you
"To easily retrieve a result comparable to that of fftw_plan_dft_r2c_2d(), you can chain a call to fftw_plan_dft_r2c_1d() and a call to the complex-to-complex dft fftw_plan_many_dft(). The arguments howmany and istride can easily be tuned to match the pattern of the output of fftw_plan_dft_r2c_1d(). Contrary to fftw_plan_dft_r2c_1d(), the r2r_1d(...FFTW_HR2C...) separates the real and complex component of each frequency. A second FFTW_HR2C can be applied and would be comparable to fftw_plan_dft_r2c_2d() but not exactly similar.
As quoted on the page 11 of the documentation that you judiciously linked,
'Half of these column transforms, however, are of imaginary parts, and should therefore be multiplied by I and combined with the r2hc transforms of the real columns to produce the 2d DFT amplitudes; ... Thus, ... we recommend using the ordinary r2c/c2r interface.'
Since you have an array of complex numbers, you can either use c2r transforms or unfold real/imaginary parts and try to use HC2R transforms. The former option seems the most practical.Which one might solve your issue?"
-#Francis

Sorting and making "genes" in output bitstrings from a genetic algorithm

I was wondering if anybody had suggestions as to how I could analyze an output bitstring that is being permuted by a genetic algorithm. In particular it would be nice if I could try to identify patterns of bits (I'm calling them genes here) that seem to yield a desirable cv score. The difficulty comes in trying to examine these datasets because there are a lot of them (I have probably already something like 30 million bitstrings that are 140 bits long and I'll probably hit over 100 million pretty quickly), so after I sort out the desirable data there is still ALOT of potential datasets and doing similarity comparisons by eye is out of the question. My questions are:
How should I compare for similarity between these bitstrings?
How can I identify "genes" in these bitstrings in an algorithmic (aka programmable) way?
As you want to extract common gene-patterns, what about looking at the intersection of the two strings. So if you have
set1 = 11011101110011...
set2 = 11001100000110...
# apply bitwise '=='
set1 && set2 == 11101110000010...
The result now shows what genes are the same, and could be used in further analysis.
For the similarity part you need to do an exclusive-or (XOR). The result of this bit-wise operation will give you the difference between two bit strings, and is probably the most efficient and easy way of doing it (for pair comparison). As an example:
>>> from bitarray import bitarray
>>> a = bitarray('0001100111')
>>> b = bitarray('0100110110')
>>> a ^ b
bitarray('0101010001')
Then you can either count the differences, inspect quickly where the differences lie, etc.
For the second part, it depends on the representation of course, and on the programming language (PL) chosen for the implementation. Most PL libraries will have a search function, that retrieves all or at least the first of the indexes where some pattern is found in a string (or bitstring, or bitstream...). You just have to refer to the documentation of your chosen PL to know more about the performance if you have more than one option for the task.

Graphs for Big O notation

I wonder if there's any tool/website where I can plot some run times as graphs. Because, the asymptotic notation is often not what I wanted i.e. Don't want to ignore constants.
For example, suppose I have two notations, like:
1) O = (n * log n).
2) O = (n * log n * log n)/5.
It's obvious that 1st one is asymptotically better. But what I want to see how they perform and at which point the second one starts to become better.
A graphical notation where I can enter different equations and plot them them to see how they vary would be greatly useful for this purpose. In my search I found this site where they have some plots. I am looking for something similar but I also want to input my equations to plot to analyse the performance for various 'n' values.
As soon as you stop "ignoring constants", you're no longer graphing "Big O" notation, but just performing a standard XY plot. As such, any graphing program, even online graphing calculators, would let you display this, just replace "n" for "X" and you'll get the proper graph.
Would this or this help?
If you use a 3d grapher, you can use the other dimension (say y) as a constant replacement.
This way you would be able to interpret results as:
when y is greater than 5, n*log(n)*log(n)/y is better than n*log(n) starting from n = (actual value)
Also, you can ignore the 3rd dimension. Or use it if you have a complexity depending on 2 variables.
Just input the difference between the complexities. In this case, ignoring the 3rd dimension and considering log(x) = ln(x), the equation is:
z = x*ln(x) - x*ln(x)*ln(x)/5
An you can interpret that as x*ln(x) is more efficient when z is negative.
If you want to see how they perform then you have to implement the algorithms and execute them on various graph. Modern processors with memory locality and cache misses make it really hard to come up with an equation that gives you a reasonable estimation.
I can guarantee you that oyu won't measure what you would expect.

Fast way to in-place update one vector with another

I have a vector A, represented by an angle and a length. I want to add vector B, updating the original A. B comes from a lookup table, so it can be represented in which ever way makes the computation easier.
Specifically, A is defined thusly:
uint16_t A_angle; // 0-65535 = 0-2π
int16_t A_length;
Approximations are fine. Checking for overflow is not necessary. A fast sin/cos approximation is available.
The fastest way I can think is to have B represented as a component vector, convert A to component, add A and B, convert the result back to angle/length and replace A. (This requires the addition of a fast asin/acos)
I am not especially good at math and wonder if I am missing a more sensible approach?
I am primarily looking for a general approach, but specific answers/comments about useful micro-optimizations in C is also interesting.
If you need to do a lot of additive operations, it would probably be worth considering storing everything in Cartesian coordinates, rather than polar.
Polar is well-suited to rotation operations (and scaling, I guess), but sticking with Cartesian (where a rotation is four multiplies, see below) is probably going to be cheaper than using cos/sin/acos/asin every time you want to do a vector addition. Although, of course, it depends on the distribution of operations in your case.
FYI, a rotation in Cartesian coordinates is as follows (see http://en.wikipedia.org/wiki/Rotation_matrix):
x' = x.cos(a) - y.sin(a)
y' = x.sin(a) + y.cos(a)
If a is known ahead of time, then cos(a) and sin(a) can be precomputed.

Efficient comparison of 1 million vectors containing (float, integer) tuples

I am working in a chemistry/biology project. We are building a web-application for fast matching of the user's experimental data with predicted data in a reference database. The reference database will contain up to a million entries. The data for one entry is a list (vector) of tuples containing a float value between 0.0 and 20.0 and an integer value between 1 and 18. For instance (7.2394 , 2) , (7.4011, 1) , (9.9367, 3) , ... etc.
The user will enter a similar list of tuples and the web-app must then return the - let's say - top 50 best matching database entries.
One thing is crucial: the search algorithm must allow for discrepancies between the query data and the reference data because both can contain small errors in the float values (NOT in the integer values). (The query data can contain errors because it is derived from a real-life experiment and the reference data because it is the result of a prediction.)
Edit - Moved text to answer -
How can we get an efficient ranking of 1 query on 1 million records?
You should add a physicist to the project :-) This is a very common problem to compare functions e.g. look here:
http://en.wikipedia.org/wiki/Autocorrelation
http://en.wikipedia.org/wiki/Correlation_function
In the first link you can read: "The SEQUEST algorithm for analyzing mass spectra makes use of autocorrelation in conjunction with cross-correlation to score the similarity of an observed spectrum to an idealized spectrum representing a peptide."
An efficient linear scan of 1 million records of that type should take a fraction of a second on a modern machine; a compiled loop should be able to do it at about memory bandwidth, which would transfer that in a two or three milliseconds.
But, if you really need to optimise this, you could construct a hash table of the integer values, which would divide the job by the number of integer bins. And, if the data is stored sorted by the floats, that improves the locality of matching by those; you know you can stop once you're out of tolerance. Storing the offsets of each of a number of bins would give you a position to start.
I guess I don't see the need for a fancy algorithm yet... describe the problem a bit more, perhaps (you can assume a fairly high level of chemistry and physics knowledge if you like; I'm a physicist by training)?
Ok, given the extra info, I still see no need for anything better than a direct linear search, if there's only 1 million reference vectors and the algorithm is that simple. I just tried it, and even a pure Python implementation of linear scan took only around three seconds. It took several times longer to make up some random data to test with. This does somewhat depend on the rather lunatic level of optimisation in Python's sorting library, but that's the advantage of high level languages.
from cmath import *
import random
r = [(random.uniform(0,20), random.randint(1,18)) for i in range(1000000)]
# this is a decorate-sort-undecorate pattern
# look for matches to (7,9)
# obviously, you can use whatever distance expression you want
zz=[(abs((7-x)+(9-y)),x,y) for x,y in r]
zz.sort()
# return the 50 best matches
[(x,y) for a,x,y in zz[:50]]
Can't you sort the tuples and perform binary search on the sorted array ?
I assume your database is done once for all, and the positions of the entries is not important. You can sort this array so that the tuples are in a given order. When a tuple is entered by the user, you just look in the middle of the sorted array. If the query value is larger of the center value, you repeat the work on the upper half, otherwise on the lower one.
Worst case is log(n)
If you can "map" your reference data to x-y coordinates on a plane there is a nifty technique which allows you to select all points under a given distance/tolerance (using Hilbert curves).
Here is a detailed example.
One approach we are trying ourselves which allows for the discrepancies between query and reference is by binning the float values. We are testing and want to offer the user the choice of different bin sizes. Bin sizes will be 0.1 , 0.2 , 0.3 or 0.4. So binning leaves us with between 50 and 200 bins, each with a corresponding integer value between 0 and 18, where 0 means there was no value within that bin. The reference data can be pre-binned and stored in the database. We can then take the binned query data and compare it with the reference data. One approach could be for all bins, subtract the query integer value from the reference integer value. By summing up all differences we get the similarity score, with the the most similar reference entries resulting in the lowest scores.
Another (simpler) search option we want to offer is where the user only enters the float values. The integer values in both query as reference list can then be set to 1. We then use Hamming distance to compute the difference between the query and the reference binned values. I have previously asked about an efficient algorithm for that search.
This binning is only one way of achieving our goal. I am open to other suggestions. Perhaps we can use Principal Component Analysis (PCA), as described here

Resources