Compute trigram probability from bigrams probabilities - artificial-intelligence

Given bigram probabilities for words in a text, how would one compute trigram probabilities?
For example, if we know that P(dog cat) = 0.3 and P(cat mouse) = 0.2
how do we find the probability of P(dog cat mouse)?
Thank you!

In the following I consider a trigram as three random variables A,B,C. So dog cat horse would be A=dog, B=cat, C=horse.
Using the chain rule: P(A,B,C) = P(A,B) * P(C|A,B). Now your stuck if you want to stay exact.
What you can do is assuming C is independent of A given B. Then it holds that P(C|A,B) = P(C|B). And P(C|B) = P(C,B) / P(B), which you should be able to compute from your trigram frequencies. Note that in your case P(C|B) should really be the probability of C following a B, so it's the probability of a BC divided by the probability of a B*.
So to sum it up, when using the conditional independence assumption:
P(ABC) = P(AB) * P(BC) / P(B*)
And to compute P(B*) you have to sum up the probabilities for all trigrams beginning with B.

Related

Finding max of difference + element for an array

We have an array consisting of each entry as a tuple of two integers. Let the array be A = [(a1, b1), (a2, b2), .... , (an, bn)]. Now we have multiple queries where we are given an integer x, we need to find the maximum value of ai + |x - bi| for 1 <= i <= n.
I understand this can be easily achieved in O(n) time complexity for each query but I am looking for something faster than that, probably O(log n) for each query. I can preprocess the array in O(n) time, but the queries should be done faster than O(n).
Any kind of help would be appreciated.
It seems to be way too easy to over-think this.
For n = 1, the function is v-shaped with a minimum of a1 at b1, with slopes of -1 and 1, respectively - let's call these values ac and bc (for combined).
For an additional pair (ai, bi), one of the pairs may dominate the other (|bc - bi| ≤ |ac - ai), which may then be ignored.
Otherwise, the falling slope of the combination will be from the pair with the larger b, the rising slope from the other.
The minimum will be between the individual b, closer to the b of the pair with the larger a, the distance being half the difference between the (absolute value of the) "coordinate" differences, the minimum value that amount higher.
The main catch is that neither needs to be an integer - the only alternative being exactly in the middle between two integers.
(Ending up with the falling slope from max ai + bi, and the rising slope of max ai - bi.)

How to use linked list to implement the multiplication of two polynomial in O(M^2N)?

This is an exercise from "Data structure and algorithm analysis in C", exercise 3.7.
Assume there are two polynomials implemented in linked list. One has M terms. Another has N terms. The exercise ask me to implement the multiplication of two polynomial in O(M^2N)(Assume that M is the smaller one). How to solve this?
I can give you the idea.
Suppose the polynomials are 1+x+x^3 and 1+x^2.
Create a linked list P using (1,1)--->(1,1)--->(1,3)
Create another linked list Q (1,1)--->(1,2) where (a,b) denotes coefficient of x^b.
Now for each node in P, multiply it with each node of Q.
How? we will create a node res with (x,y) where
x= P->coeff * Q->coeff.
And y=P->exp+Q->exp.
Add this new node to the polynomial which will contain the answer.
During insertion in the answer polynomial you have to keep in mind 2 things-
i) Keep a sorted list (sorted against the exp)(increasing maybe as I have takem increasing here--you can take decreasing also).
ii) Get the correct position in case you add new node and if a node with same exp value exists add the coeff only and delete the node that you are about to insert.
Ok! Now print the polynomial.
The complexity analysis.
You can find my implementation for polynomial multiplication here, which reads the polynomials from a file in the following format:
let's say 2X^5 + 3x^3 + 5X is a polynomial expression and its linked list representation is as follows:
| 2 | 5 | -|---> | 3 | 4 | -|---> | 5 | 1 | NULL|
Regarding the time complexity, I perform the multiplication by multiplying each term of the first polynomial by each term of the second, and than adding common terms (i.e. have the same degree) together.
If the first polynomial size is N and the second polynomial size is M, than by multiplying them term by term we get a complexity of N*M.
Pay attention that the output polynomial can be at most of size N+M (after adding together the common terms, so if you assume that M>N, and at each step of the terms multiplication you add a value to the cell which holds the same degree (if such exists, else you add a new cell), you can the desired complexity.
Assume that M < N.
Multiplying each term in the M terms' polynomial by the entire N terms' polynomial. Each step you will get an temporary result polynomial with N terms, then take the union of the result polynomial with this temp polynomial, which causes O(MN+N) in the last step.
Remind that there's O(MN) multiplications.
The union operations totally take O((N+N)+(2N+N)+(3N+N)+...+(MN+N))=O(M^2N).
So overall the complexity takes O(MN)+O(M^2N) = O(M^2N)

How to efficiently evaluate or approximate a road Clothoid?

I'm facing the problem of computing values of a clothoid in C in real-time.
First I tried using the Matlab coder to obtain auto-generated C code for the quadgk-integrator for the Fresnel formulas. This essentially works great in my test scnearios. The only issue is that it runs incredibly slow (in Matlab as well as the auto-generated code).
Another option was interpolating a data-table of the unit clothoid connecting the sample points via straight lines (linear interpolation). I gave up after I found out that for only small changes in curvature (tiny steps along the clothoid) the results were obviously degrading to lines. What a surprise...
I know that circles may be plotted using a different formula but low changes in curvature are often encountered in real-world-scenarios and 30k sampling points in between the headings 0° and 360° didn't provide enough angular resolution for my problems.
Then I tried a Taylor approximation around the R = inf point hoping that there would be significant curvatures everywhere I wanted them to be. I soon realized I couldn't use more than 4 terms (power of 15) as the polynom otherwise quickly becomes unstable (probably due to numerical inaccuracies in double precision fp-computation). Thus obviously accuracy quickly degrades for large t values. And by "large t values" I'm talking about every point on the clothoid that represents a curve of more than 90° w.r.t. the zero curvature point.
For instance when evaluating a road that goes from R=150m to R=125m while making a 90° turn I'm way outside the region of valid approximation. Instead I'm in the range of 204.5° - 294.5° whereas my Taylor limit would be at around 90° of the unit clothoid.
I'm kinda done randomly trying out things now. I mean I could just try to spend time on the dozens of papers one finds on that topic. Or I could try to improve or combine some of the methods described above. Maybe there even exists an integrate function in Matlab that is compatible with the Coder and fast enough.
This problem is so fundamental it feels to me I shouldn't have that much trouble solving it. any suggetions?
about the 4 terms in Taylor series - you should be able to use much more. total theta of 2pi is certainly doable, with doubles.
you're probably calculating each term in isolation, according to the full formula, calculating full factorial and power values. that is the reason for losing precision extremely fast.
instead, calculate the terms progressively, the next one from the previous one. Find the formula for the ratio of the next term over the previous one in the series, and use it.
For increased precision, do not calculate in theta by rather in the distance, s (to not lose the precision on scaling).
your example is an extremely flat clothoid. if I made no mistake, it goes from (25/22) pi =~ 204.545° to (36/22) pi =~ 294.545° (why not include these details in your question?). Nevertheless it should be OK. Even 2 pi = 360°, the full circle (and twice that), should pose no problem.
given: r = 150 -> 125, 90 degrees turn :
r s = A^2 = 150 s = 125 (s+x)
=> 1+(x/s) = 150/125 = 1 + 25/125 x/s = 1/5
theta = s^2/2A^2 = s^2 / (300 s) = s / 300 ; = (pi/2) * (25/11) = 204.545°
theta2 = (s+x)^2/(300 s) = (6/5)^2 s / 300 ; = (pi/2) * (36/11) = 294.545°
theta2 - theta = ( 36/25 - 1 ) s / 300 == pi/2
=> s = 300 * (pi/2) * (25/11) = 1070.99749554 x = s/5 = 214.1994991
A^2 = 150 s = 150 * 300 * (pi/2) * (25/11)
a = sqrt (2 A^2) = 300 sqrt ( (pi/2) * (25/11) ) = 566.83264608
The reference point is at r = Infinity, where theta = 0.
we have x = a INT[u=0..(s/a)] cos(u^2) d(u) where a = sqrt(2 r s) and theta = (s/a)^2. write out the Taylor series for cos, and integrate it, term-by-term, to get your Taylor approximation for x as function of distance, s, along the curve, from the 0-point. that's all.
next you have to decide with what density to calculate your points along the clothoid. you can find it from a desired tolerance value above the chord, for your minimal radius of 125. these points will thus define the approximation of the curve by line segments, drawn between the consecutive points.
I am doing my thesis in the same area right now.
My approach is the following.
at each point on your clothoid, calculate the following (change in heading / distance traveled along your clothoid), by this formula you can calculate the curvature at each point by this simple equation.
you are going to plot each curvature value, your x-axis will be the distance along the clothoid, the y axis will be the curvature. By plotting this and applying very easy linear regression algorithm (search for Peuker algorithm implementation in your language of choice)
you can easily identify where are the curve sections with value of zero (Line has no curvature), or linearly increasing or decreasing (Euler spiral CCW/CW), or constant value != 0 (arc has constant curvature across all points on it).
I hope this will help you a little bit.
You can find my code on github. I implemented some algorithms for such problems like Peuker Algorithm.

Compact storage coefficients of a multivariate polynomial

The setup
I am writing a code for dealing with polynomials of degree n over d-dimensional variable x and ran into a problem that others have likely faced in the past. Such polynomial can be characterized by coefficients c(alpha) corresponding to x^alpha, where alpha is a length d multi-index specifying the powers the d variables must be raised to.
The dimension and order are completely general, but known at compile time, and could be easily as high as n = 30 and d = 10, though probably not at the same time. The coefficients are dense, in the sense that most coefficients are non-zero.
The number of coefficients required to specify such a polynomial is n + d choose n, which in high dimensions is much less than n^d coefficients that could fill a cube of side length n. As a result, in my situation I have to store the coefficients rather compactly. This is at a price, because retrieving a coefficient for a given multi-index alpha requires knowing its location.
The question
Is there a (straightforward) function mapping a d-dimensional multi-index alpha to a position in an array of length (n + d) choose n?
Ordering combinations
A well-known way to order combinations can be found on this wikipedia page. Very briefly you order the combinations lexically so you can easily count the number of lower combinations. An explanation can be found in the sections Ordering combinations and Place of a combination in the ordering.
Precomputing the binomial coefficients will speed up the index calculation.
Associating monomials with combinations
If we can now associate each monomial with a combination we can effectively order them with the method above. Since each coefficient corresponds with such a monomial this would provide the answer you're looking for. Luckily if
alpha = (a[1], a[2], ..., a[d])
then the combination you're looking for is
combination = (a[1] + 0, a[1] + a[2] + 1, ..., a[1] + a[2] + ... + a[d] + d - 1)
The index can then readily be calculated with the formula from the wikipedia page.
A better, more object oriented solution, would be to create Monomial and Polynomial classes. The Polynomial class would encapsulate a collection of Monomials. That way you can easily model a pathological case like
y(x) = 1.0 + x^50
using just two terms rather than 51.
Another solution would be a map/dictionary where the key was the exponent and the value is the coefficient. That would only require two entries for my pathological case. You're in business if you have a C/C++ hash map.
Personally, I don't think doing it the naive way with arrays is so terrible, even with a polynomial containing 1000 terms. RAM is cheap; that array won't make or break you.

How should I weight an N-gram sentence generator so that it doesn't favor short sentences?

I'm playing around with writing an n-gram sentence comparison/generation script. The model heavily favors shorter sentences, any quick suggestions on how I might weight it more towards longer sentences?
Assuming that you compute a score for each n-gram and rank the ngrams by these scores, you can adjust the scores of these n-grams by applying a different scalar weight for each value of n, e.g., v = <0.1, 0.2, 0.5, 0.9, 1.0>, where v[0] would be applied to an n-gram where n == 1. Such a vector could be determined from a larger text corpus by measuring the relative frequencies of a set of representative solution n-grams (e.g., if you are looking for sentences, then calculate n for each sentence, count the frequencies of each value of n, and create a probability distribution from that data.

Resources