Solving ill-conditioned system of linear equations with Lapack&co - c

I have the following 11x11 linear system of equations Ax = b with:
A = {
{1.0000000000000000, 8.0000000000000000, 6.0000000000000000, 12.0000000000000000, 24.0000000000000000, 24.0000000000000000, 8.0000000000000000, 6.0000000000000000, 24.0000000000000000, 24.0000000000000000, 24.0000000000000000},
{4.5999999999999996, 41.8531411531233601, 33.0479488942856037, 87.8349057232554173, 149.3783917109033439, 195.3689938163366833, 121.0451669808013690, 48.8422484540841708, 223.6406089026404516, 851.8470736603384239, 269.3015780207464900},
{21.1599999999999966, 218.9606780479085160, 182.0278210198854936, 642.9142219510971472, 929.7459962556697519, 1590.3768227003254196, 1831.4915561762611560, 397.5942056750813549, 2083.9634145976574473, 30235.1432043200838962, 3021.8058301860087340},
{97.3359999999999701, 1145.5240206653393216, 1002.6076877338904296, 4705.8591727678940515, 5786.8317341801457587, 12946.2633183243797248, 27711.6501551604087581, 3236.5658295810949312, 19419.1186238102454809, 1073154.9275125553831458, 33907.3782725576675148},
{447.7455999999998539, 5992.9723163999815370, 5522.3546042079124163, 34444.8913989153879811, 36017.8173980603314703, 105387.4349242659372976, 419295.1650431178859435, 26346.8587310664843244, 180954.3130575636751018, 38090161.8577392920851707, 380471.2698060897528194},
{0.0000000000000000, 34.2801357124991952, 168.4702728821191613, 2101.6181209908259007, 1236.1435394200643714, 6631.0420254749351443, 38374.2674650820554234, 4069.0485156323466072, 28291.8793721561523853, 7044717.1197200166061521, 60211.4334496619121637},
{2059.6297599999993508, 31353.0895356311411888, 30417.0821226643129194, 252121.9823892920394428, 224178.4848274685500655, 857893.2134182706940919, 6344206.6583608603104949, 214473.3033545676735230, 1686197.1981563565786928, 1351958038.0734937191009521, 4269229.7229307144880295},
{0.0000000000000000, 179.3414198404317403, 927.9328280691040618, 15382.9524602928686363, 7693.8805767663707229, 53979.1670196200575447, 580627.4516345988959074, 33123.5797620395824197, 263633.8804078772664070, 250042569.2999326586723328, 675626.4184535464737564},
{0.0000000000000000, 938.2502198978935439, 5111.0461132262771571, 112596.6815912620077142, 47887.4794405465727323, 439410.6478194649680518, 8785268.3545934017747641, 269638.3520710353623144, 2456635.0642409822903574, 8874917956.1941699981689453, 7581135.8600852200761437},
{0.0000000000000000, 938.2502198978935439, 0.0000000000000000, 56298.3407956310038571, 23943.7397202732863661, 319571.3802323381532915, 8785268.3545934017747641, 0.0000000000000000, 269630.6777825467870571, 3293783983.7421655654907227, 1735440.7390556528698653},
{0.0000000000000000, 70.9608494071368625, 1546.2151390406352220, 34063.2210755480555235, 13279.8613116998949408, 129911.1650312914862297, 2657756.2850107550621033, 183537.2854802548536099, 1654054.3836708476301283, 5487391301.6329326629638672, 5049794.3807012736797333}
};
b = {1, 6.167551546217714, 39.66265463865314, 267.9960092725794, 1918.2310370808632, 137.49061855461255, 14662.396462231256, 1216.4598834815756, 11424.520672986631, 3808.17355766221, 6082.299417407878};
The matrix is clearly ill-conditioned, although the correct solution can be found with mathematica:
x = {0.0775277, 0.00771443, 0.087553, 0.0208838, 8.47931*1e-7, 0.00197285, 0.0000611365, 0.00187375, 0.000283606, 3.82771*1e-9, 0.000788588};
I now want to solve the system using this and many other similar matrices inside a C program.
I have tried almost every lapack function for solving a linear system of equations, in particular:
dgesv
dsgesv
dgels
dgelss
dgelsy
but they all give severely wrong results.
At this point I don't expect to have any typo / mistake from a programming point of view, since trying with well-conditioned matrices I get correct results.
I guess it's something conceptually or maybe I have to use other tools.. Is there anything I can do to find get the correct solution with some routine from mathematical libraries?

Solving ill-coditioned linear equations is generally hard. At least you could not use those one-step LAPACK APIs to get an answer with satisfied numerical error.
As a good start, you could use truncated SVD method to get a more numerically stable result.
https://en.m.wikipedia.org/wiki/Linear_least_squares_(mathematics)
This method is the most computationally intensive, but is particularly useful if the normal equations matrix, XTX, is very ill-conditioned (i.e. if its condition number multiplied by the machine's relative round-off error is appreciably large). In that case, including the smallest singular values in the inversion merely adds numerical noise to the solution. This can be cured with the truncated SVD approach, giving a more stable and exact answer, by explicitly setting to zero all singular values below a certain threshold and so ignoring them, a process closely related to factor analysis.
More effective methods may involve making the matrix well-conditioned before solving by finding a pre conditioning matrix. You need to have some understanding on the structure of the original matrix. You could find some more ideas in the following discussion.
https://www.researchgate.net/post/How_can_I_solve_an_ill-conditioned_linear_system_of_equations

Related

How do I implement a controlled Rx in Cirq/Tensorflow Quantum?

I am trying to implement a controlled rotation gate in Cirq/Tensorflow Quantum.
The readthedocs.io at https://cirq.readthedocs.io/en/stable/gates.html states:
"Gates can be converted to a controlled version by using Gate.controlled(). In general, this returns an instance of a ControlledGate. However, for certain special cases where the controlled version of the gate is also a known gate, this returns the instance of that gate. For instance, cirq.X.controlled() returns a cirq.CNOT gate. Operations have similar functionality Operation.controlled_by(), such as cirq.X(q0).controlled_by(q1)."
I have implemented
cirq.rx(theta_0).on(q[0]).controlled_by(q[3])
I get the following error:
~/.local/lib/python3.6/site-packages/cirq/google/serializable_gate_set.py in
serialize_op(self, op, msg, arg_function_language)
193 return proto_msg
194 raise ValueError('Cannot serialize op {!r} of type {}'.format(
--> 195 gate_op, gate_type))
196
197 def deserialize_dict(self,
ValueError: Cannot serialize op cirq.ControlledOperation(controls=(cirq.GridQubit(0, 3),), sub_operation=cirq.rx(sympy.Symbol('theta_0')).on(cirq.GridQubit(0, 0)), control_values=((1,),)) of type <class 'cirq.ops.controlled_gate.ControlledGate'>
I have the qubits and symbols initialized as:
q = cirq.GridQubit.rect(1, 4)
symbol_names = x_0, x_1, x_2, x_3, theta_0, theta_1, z_2, z_3
I do re-use the circuits with various circuits.
My question: How do I properly implement a controlled Rx in Cirq/Tensorflow Quantum?
P.S. I can't find a tag for Google Cirq
Follow up:
How does this generalize to the similar situations of Controlled Ry and controlled Rz?
For Rz I found a gate decomposition at https://threeplusone.com/pubs/on_gates.pdf, involving H.on(q1), CNOT(q0, q1), H.on(q2), but this is not yet an CRz with an arbitrary angle. Would I introduce the angle before the H?
For the Ry, I did not find a decomposition yet, neither the CRy.
What you have is a completely correct implementation of a controlled X rotation in Cirq. It can be used in simulation and other things like cirq.unitary without any issues.
TFQ only supports a subset of gates in Cirq. For example a cirq.ControlledGate can have an arbitrary number of control qubits, which in some cases can make it harder to decompose down to primitive gates that are compatible with NiSQ hardware platforms (This is why cirq.decompose doesn't do anything to ControlledOperations). TFQ only supports these primitive style gates , for a full list of the supported gates, you can do:
tfq.util.get_supported_gates().keys()
In your case it is possible to come up with a simpler implementation of this gate. First we can note that cirq.rx(some angle) is equal to cirq.X**(some angle / pi) offset by a global phase:
>>> a = cirq.rx(0.3)
>>> b = cirq.X**(0.3 / np.pi)
>>> cirq.equal_up_to_global_phase(cirq.unitary(a), cirq.unitary(b))
True
Lets move to using X now. Then the operation we are after is:
>>> qs = cirq.GridQubit.rect(1,2)
>>> a = (cirq.X**0.3)(qs[0]).controlled_by(qs[1])
>>> b = cirq.CNOT(qs[0], qs[1]) ** 0.3
>>> cirq.equal_up_to_global_phase(cirq.unitary(a), cirq.unitary(b))
True
Since cirq.CNOT is in the TFQ supported gates it should be serializable without any issues. If you want to make a symbolized version of the gate you can just replace the 0.3 with a sympy.Symbol.
Answer to follow up: If you want to do a CRz you can do the same thing you did above, swapping out the CNOT gate for the CZ gate. For CRy it's not as easy. For that I would recommend doing some combination of: cirq.Y(0) and cirq.YY(0, 1).
Edit: tfq-nightly builds and likely releases after 0.4.0 now include support for arbitrary controlled gates. So on these versions of tfq you could also do things like cirq.Y(...).controlled_by(...) to achieve the desired result now too.

cublas matrix matrix multiplication gives INTERNAL ERROR when applying to matrix with one very long dimension with multiple GPUs

What I tried to do was to simply apply cublasDgemm (matrix-matrix multiplication) on several matrices with "double" (8 bytes) type element all of which have one dimension that is very large. In my case, the sizes of the matrices are 12755046 by 46. Simply say, A[46,12755046]*B_i[12755046,46] = C_i[46,46], where i = 1,2,3,....
The machine includes 128GB memory and two GTX2080Ti (11GB GPU memory) so my original strategy was to distribute B_i to each GPU. However, I always get INTERNAL ERROR when I execute my code on two GPUs.
So I solved this problem by trying three things:
1. use one GPU only. No error.
2. downsize the matrix size but keep using two GPUs. No error.
3. use cublasXt which implicitly uses two GPUs. No error.
Though it is solved, I am still interested in finding an answer to why my original plan did not work for large dimension matrix? I am guessing this could be due to some internal limitations from cublas or I missed some configurations?
I attached my simplified code here to illustrate my original plan:
double *A, *B[2], *C[2];
cudaMallocManaged(&A, 46*12755046*sizeof(double));
cudaMallocManaged(&B[0], 46*12755046*sizeof(double));
cudaMallocManaged(&B[1], 46*12755046*sizeof(double));
cudaMallocManaged(&C[0], 46*12755046*sizeof(double));
cudaMallocManaged(&C[1], 46*12755046*sizeof(double));
givevalueto(A);
givevalueto(B[0]);
givevalueto(B[1]);
double alpha = 1.0;
double beta = 0.0;
cublasHandle_t handle[nGPUs];
int iGPU;
for(iGPU=0;iGPU<nGPUs;iGPU++)
{
cublasCreate (& handle[iGPU]);
}
for(iGPU=0;iGPU<nGPUs;i++)
{
cudaSetDevice(iGPU);
cublasDgemm(handle[iGPU],CUBLAS_OP_N,CUBLAS_OP_N,46,46,12755046,&alpha,A,46,B[iGPU],12755046,&beta,C[iGPU],46);
}
for(iGPU=0;iGPU<nGPUs;i++)
{
cudaSetDevice(iGPU);
cudaDeviceSynchronize();
}
for(iGPU=0;iGPU<nGPUs;iGPU++)
{
cudaFree(B[iGPU]);
}
The cublas handle is applicable to the device that was active when the handle was created.
From the documentation for cublasCreate:
The CUBLAS library context is tied to the current CUDA device.
See also the description of the cublas context:
The device associated with a particular cuBLAS context is assumed to remain unchanged between the corresponding cublasCreate() and cublasDestroy() calls. In order for the cuBLAS library to use a different device in the same host thread, the application must set the new device to be used by calling cudaSetDevice() and then create another cuBLAS context, which will be associated with the new device, by calling cublasCreate().
You can fix your code with:
for(iGPU=0;iGPU<nGPUs;iGPU++)
{
cudaSetDevice(iGPU); // add this line
cublasCreate (& handle[iGPU]);
}

Filtering "Smoothing" an array of numbers in C

I am writing an application in X-code. It is gathering the sensor data (gyroscope) and then transforming it throw FFTW. At the end I am getting the result in an array. In the app. I am plotting the graph but there is so much peaks (see the graph in red) and i would like to smooth it.
My array:
double magnitude[S];
...
magnitude[i]=sqrt((fft_result[i][0])*(fft_result[i][0])+ (fft_result[i][1])*(fft_result[i][1]) );
An example array (for 30 samples, normally I am working with 256 samples):
"0.9261901713034604",
"2.436272348237486",
"1.618854900218465",
"1.849221286218342",
"0.8495016887742839",
"0.5716796354304043",
"0.4229791869017677",
"0.3731843430827401",
"0.3254446111798023",
"0.2542702545675339",
"0.25237940627189",
"0.2273716541964159",
"0.2012780334451323",
"0.2116151847259499",
"0.1921943719520009",
"0.1982429400169304",
"0.18001770452247",
"0.1982429400169304",
"0.1921943719520009",
"0.2116151847259499",
"0.2012780334451323",
"0.2273716541964159",
"0.25237940627189",
"0.2542702545675339",
"0.3254446111798023",
"0.3731843430827401",
"0.4229791869017677",
"0.5716796354304043",
"0.8495016887742839",
"1.849221286218342"
How to filter /smooth it? whats about gauss? Any idea how to begin or even giving me a sample code.
Thank you for your help!
best regards
josef
Simplest way to smooth would be to replace each sample with the average of it and its 2 neighbors.
The simpliest idea would be taking average of 2 points and putting them into an array. Something like
double smooth_array[S];
for (i = 0; i<S-2; i++)
smooth_array[i]=(magnitude[i] + magnitude[i+1])/2;
smooth_array[S-1]=magnitude[S-1];
It is not best one, but I think it should be ok.
If you need the scientific approach - use some kind of approximation / approximation algorithms. Something like least squares function approximation or even full SE13/SE35 etc. algorithms.

Testing an Algorithms speed. How?

I'm currently testing different algorithms, which determine whether an Integer is a real square or not. During my research I found this question at SOF:
Fastest way to determine if an integer's square root is an integer
I'm compareably new to the Programming scene. When testing the different Algorithms that are presented in the question, I found out that this one
bool istQuadratSimple(int64 x)
{
int32 tst = (int32)sqrt(x);
return tst*tst == x;
}
actually works faster than the one provided by A. Rex in the Question I posted. I've used an NS-Timer object for this testing, printing my results with an NSLog.
My question now is: How is speed-testing done in a professional way? How can I achieve equivalent results to the ones provided in the question I posted above?
The problem with calling just this function in a loop is that everything will be in the cache (both the data and the instructions). You wouldn't measure anything sensible; I wouldn't do that.
Given how small this function is, I would try to look at the generated assembly code of this function and the other one and I would try to reason based on the assembly code (number of instructions and the cost of the individual instructions, for example).
Unfortunately, it only works in trivial / near trivial cases. For example, if the assembly codes are identical then you know there is no difference, you don't need to measure anything. Or if one code is like the other plus additional instructions; in that case you know that the longer one takes longer to execute. And then there are the not so clear cases... :(
(See the update below.)
You can get the assembly with the -S -emit-llvm flags from clang and with the -S flag from gcc.
Hope this help.
UPDATE: Response to Prateek's question in the comment "is there any way to determine the speed of one particular algorithm?"
Yes, it is possible but it gets horribly complicated REALLY quick. Long story short, ignoring the complexity of modern processors and simply accumulating some predefined cost associated with the instructions can lead to very very inaccurate results (the estimate off by a factor of 100, due to the cache and the pipeline, among others). If you try take into consideration the complexity of the modern processors, the hierarchical cache, the pipeline, etc. things get very difficult. See for example Worst Case Execution Time Prediction.
Unless you are in a clear situation (trivial / near trivial case), for example the generated assembly codes are identical or one is like the other plus a few instructions, it is also hard to compare algorithms based on their generated assembly.
However, here a simple function of two lines is shown, and for that, looking at the assembly could help. Hence my answer.
I am not sure if there is any professional way of checking the speed (if there is let me know as well). For the method that you directed to in your question I would probably do something this this in java.
package Programs;
import java.math.BigDecimal;
import java.math.RoundingMode;
public class SquareRootInteger {
public static boolean isPerfectSquare(long n) {
if (n < 0)
return false;
long tst = (long) (Math.sqrt(n) + 0.5);
return tst * tst == n;
}
public static void main(String[] args) {
long iterator = 1;
int precision = 10;
long startTime = System.nanoTime(); //Getting systems time before calling the isPerfectSquare method repeatedly
while (iterator < 1000000000) {
isPerfectSquare(iterator);
iterator++;
}
long endTime = System.nanoTime(); // Getting system time after the 1000000000 executions of isPerfectSquare method
long duration = endTime - startTime;
BigDecimal dur = new BigDecimal(duration);
BigDecimal iter = new BigDecimal(iterator);
System.out.println("Speed "
+ dur.divide(iter, precision, RoundingMode.HALF_UP).toString()
+ " nano secs"); // Getting average time taken for 1 execution of method.
}
}
You can check your method in similar fashion and check which one outperforms other.
Record the time value before your massive calculation and the value after that. The difference is the time executed.
Write a shell script where you will run the program. And run 'time ./xxx.sh' to get it's running time.

C++ path finding in a 2d array

I have been struggling badly with this challenge my lecturer has provided. I have programmed the files that set up the class needed for this solution but I have no idea how to implement it, here is the class in question were I need to add the algorithm.
#include "Solver.h"
int* Solver::findNumPaths(const MazeCollection& mazeCollection)
{
int *numPaths = new int[mazeCollection.NUM_MAZES];
return numPaths;
}
and here is the problem description we have been provided. does anybody know how to implement this or set me on the right track, Thank you!
00C, we need your help again.
Angry with being thwarted, the diabolically evil mastermind Dr Russello Kane has unleashed a scurry of heavy-armed squirrels to attack the BCB and eliminate all the delightfully beautiful and intellectual superior computing students.
We need to respond to this threat at short notice and have plans to partially barricade the foyer of the BCB. The gun-toting squirrels will enter the BCB at square [1,1] and rush towards the exit shown at [10,10].
A square that is barricaded is impassable to the furry rodents. Importantly, the squirrel bloodlust is such that they will only ever move towards the exit – either moving one square to the right, or one square down. The squirrels will never move up or to the left, even if a barricade is blocking their approach.
Our boffins need to run a large number of tests to determine how barricade placement will impede the movement of the squirrels. In each test, a number of squares will be barricaded and you must determine the total number of different paths from the start to the exit (adhering to the squirrel movement patterns noted above).
A number of our boffins have been heard to mumble something incoherent about a recursive counting algorithm, others about the linkage between recursion and iteration, but I’m sure, OOC, you know better than to be distracted by misleading advice.
Start w/ the obvious:
int count = 0;
void countPaths( x, y ) {
if ( x==10 && y==10 ) {
count++;
return;
}
if ( can-move-right )
countPaths( x+1, y );
if ( can-mopve-down )
countPaths( x, y+1 );
}
Start by calling countPaths(0,0).
Not the most efficient by a long shot, but it'll work. Then look for ways to optimize (for example, you end up re-computing paths from the squares close to the goal a LOT -- reducing that work could make a big difference).

Resources