Data.Map vs. Data.Array for symmetric matrices? - arrays

Sorry for the vague question, but I hope for an experienced Haskeller this is a no-brainer.
I have to represent and manipulate symmetric matrices, so there are basically three different choices for the data type:
Complete matrix storing both the (i,j) and (j,i) element, although m(i,j) = m(j,i)
Data.Array (Int, Int) Int
A map, storing only elements (i,j) with i <= j (upper triangular matrix)
Data.Map (Int, Int) Int
A vector indexed by k, storing the upper triangular matrix given some vector order f(i,j) = k
Data.Array Int Int
Many operations are going to be necessary on the matrices, updating a single element, querying for rows and columns etc. However, they will mainly act as containers, no linear algebra operations (inversion, det, etc) will be required.
Which one of the options would be the fastest one in general if the dimensionality of the matrices is going to be at around 20x20? When I understand correctly, every update (with (//) in the case of array) requires full copies, so going from 20x20=400 elements to 20*21/2 = 210 elements in the cases 2. or 3. would make a lot of sense, but access is slower for case 2. and 3. needs conversion at some point.
Are there any guidelines?
Btw: The 3rd option is not a really good one, as computing f^-1 requires square roots.

You could try using Data.Array using a specialized Ix class that only generates the upper half of the matrix:
newtype Symmetric = Symmetric { pair :: (Int, Int) } deriving (Ord, Eq)
instance Ix Symmetric where
range ((Symmetric (x1,y1)), (Symmetric (x2,y2))) =
map Symmetric [(x,y) | x <- range (x1,x2), y <- range (y1,y2), x >= y]
inRange (lo,hi) i = x <= hix && x >= lox && y <= hiy && y >= loy && x >= y
where
(lox,loy) = pair lo
(hix,hiy) = pair hi
(x,y) = pair i
index (lo,hi) i
| inRange (lo,hi) i = (x-loy)+(sum$take(y-loy)[hix-lox, hix-lox-1..])
| otherwise = error "Error in array index"
where
(lox,loy) = pair lo
(hix,hiy) = pair hi
(x,y) = pair i
sym x y
| x < y = Symmetric (y,x)
| otherwise = Symmetric (x,y)
*Main Data.Ix> let a = listArray (sym 0 0, sym 6 6) [0..]
*Main Data.Ix> a ! sym 3 2
14
*Main Data.Ix> a ! sym 2 3
14
*Main Data.Ix> a ! sym 2 2
13
*Main Data.Ix> length $ elems a
28
*Main Data.Ix> let b = listArray (sym 0 0, sym 19 19) [0..]
*Main Data.Ix> length $ elems b
210

There is a fourth option: use an array of decreasingly-large arrays. I would go with either option 1 (using a full array and just storing every element twice) or this last one. If you intend to be updating a lot of elements, I strongly recommend using a mutable array; IOArray and STArray are popular choices.
Unless this is for homework or something, you should also take a peek at Hackage. A quick look suggests the problem of manipulating matrices has been solved several times already.

Related

Knights tour in haskell getting a loop

I'm in the process of coding the knight's tour function, and I'm as far as this where I'm getting an infinte loop in my ghci:
type Field = (Int, Int)
nextPositions:: Int -> Field -> [Field]
nextPositions n (x,y) = filter onBoard
[(x+2,y-1),(x+2,y+1),(x-2,y-1),(x-2,y+1),(x+1,y-2),(x+1,y+2),(x-1,y-2),(x-1,y+2)]
where onBoard (x,y) = x `elem` [1..n] && y `elem` [1..n]
type Path = [Field]
knightTour :: Int -> Field -> [Path]
knightTour n start = [posi:path | (posi,path) <- tour (n*n)]
where tour 1 = [(start, [])]
tour k = [(posi', posi:path) | (posi, path) <- tour (k-1), posi' <- (filter (`notElem` path) (nextPositions n posi))]
F.e. knightTour 10 (4,4) does not give an output!
Any advise?
I think one of the main problems is checking if you have visited a square. This takes too much time. You should look for a data structure that makes that more efficient.
For small boards, for example up to 8×8, you can make use of a 64-bit integer for that. A 64-bit can be seen as 64 booleans that each can represent whether the knight already has visited that place.
we thus can implement this with:
{-# LANGUAGE BangPatterns #-}
import Data.Bits(testBit, setBit)
import Data.Word(Word64)
testPosition :: Int -> Word64 -> (Int, Int) -> Bool
testPosition !n !w (!r, !c) = testBit w (n*r + c)
setPosition :: Int -> (Int, Int) -> Word64 -> Word64
setPosition !n (!r, !c) !w = setBit w (n*r + c)
nextPositions :: Int -> Word64 -> (Int, Int) -> [(Int, Int)]
nextPositions !n !w (!x, !y) = [ c
| c#(x', y') <- [(x-1,y-2), (x-1,y+2), (x+1,y-2), (x+1,y+2), (x-2,y-1), (x-2,y+1), (x+2,y-1), (x+2,y+1)]
, x' >= 0
, y' >= 0
, x' < n
, y' < n
, not (testPosition n w c)
]
knightTour :: Int -> (Int, Int) -> [[(Int, Int)]]
knightTour n p0 = go (n*n-1) (setPosition n p0 0) p0
where go 0 _ _ = [[]]
go !k !w !ps = [
(ps':rs)
| ps' <- nextPositions n w ps
, rs <- go (k-1) (setPosition n ps' w) ps'
]
main = print (knightTour 6 (1,1))
If I compile this with the -O2 flag and run this locally for a 5×5 board where the knight starts at (1,1), all the solutions are generated in 0.32 seconds. For a 6×6 board, it takes 2.91 seconds to print the first solution, but it takes forever to find all solutions that start at (1,1). For an 8×8 board, the first solution was found in 185.76 seconds:
[(0,3),(1,5),(0,7),(2,6),(1,4),(0,2),(1,0),(2,2),(3,0),(4,2),(3,4),(4,6),(5,4),(6,2),(5,0),(3,1),(2,3),(3,5),(2,7),(0,6),(2,5),(1,3),(0,1),(2,0),(3,2),(2,4),(0,5),(1,7),(3,6),(4,4),(5,6),(7,7),(6,5),(7,3),(6,1),(4,0),(5,2),(7,1),(6,3),(7,5),(6,7),(5,5),(4,7),(6,6),(7,4),(5,3),(7,2),(6,0),(4,1),(3,3),(2,1),(0,0),(1,2),(0,4),(1,6),(3,7),(4,5),(5,7),(7,6),(6,4),(4,3),(5,1),(7,0)]
It is however not a good idea to solve this with a brute force approach. If we assume an average branching factor of ~6 moves, then for a 6×6 board, we have already 1.031×1028 possible sequences we have to examine for a 6×6 board.
It is better to work with a divide and conquer approach. It is easy to split a board like 8×8 into four 4×4 boards. Then you determine places where you can hop from one board to another, and then you solve the subproblems for a 4×4 board. For small boards, you can easily store the solutions to go from any square to any other square on a 4×4 board, and then reuse these for all quadrants, so you save computational effort, by not calculating this a second time, especially since you do not need to store symmetrical queries multiple times. If you know how to go from (1,0) to (2,3) on a 4×4 board, you can easily use this to go from (3,0) to (2,3) on the same board, just by mirroring this.

Fractal dimension algorithms gives results of >2 for time-series

I'm trying to compute Fractal Dimension of very specific time series array.
I've found implementations of Higuchi FD algorithm:
def hFD(a, k_max): #Higuchi FD
L = []
x = []
N = len(a)
for k in range(1,k_max):
Lk = 0
for m in range(0,k):
#we pregenerate all idxs
idxs = np.arange(1,int(np.floor((N-m)/k)),dtype=np.int32)
Lmk = np.sum(np.abs(a[m+idxs*k] - a[m+k*(idxs-1)]))
Lmk = (Lmk*(N - 1)/(((N - m)/ k)* k)) / k
Lk += Lmk
L.append(np.log(Lk/(m+1)))
x.append([np.log(1.0/ k), 1])
(p, r1, r2, s)=np.linalg.lstsq(x, L)
return p[0]
from https://github.com/gilestrolab/pyrem/blob/master/src/pyrem/univariate.py
and Katz FD algorithm:
def katz(data):
n = len(data)-1
L = np.hypot(np.diff(data), 1).sum() # Sum of distances
d = np.hypot(data - data[0], np.arange(len(data))).max() # furthest distance from first point
return np.log10(n) / (np.log10(d/L) + np.log10(n))
from https://github.com/ProjectBrain/brainbits/blob/master/katz.py
I expect results of ~1,5 in both cases however get 2,2 and 4 instead...
hFD(x,4) = 2.23965648024 (k value of here is chosen as an example, however result won't change much in range 4-12 edit: I was able to get result of ~1,9 with k=22, however this still does not make any sense);
katz(x) = 4.03911343057
Which in theory should not be possible for 1D time-series array.
Questions here are: are Higuchi and Katz algorithms not suitable for time-series analysis in general, or am I doing something wrong on my side? Also are there any other python libraries with already implemented and error-less algorithms to verify my results?
My array of interest (each element represents point in time t, t+1, t+2,..., t+N)
x = np.array([373.4413096546802, 418.58026161917803,
395.7387698762124, 416.21163042783206,
407.9812265426947, 430.2355284504048,
389.66095393296763, 442.18969320408166,
383.7448638776275, 452.8931822090381,
413.5696828065546, 434.45932712853585
,429.95212301648996, 436.67612861616215,
431.10235365546964, 418.86935850068545,
410.84902747247423, 444.4188867775925,
397.1576881118471, 451.6129904245434,
440.9181246439599, 438.9857353268666,
437.1800408012741, 460.6251405281339,
404.3208481355302, 500.0432305427639,
380.49579242696177, 467.72953450552893,
333.11328535523967, 444.1171938340972,
303.3024198243042, 453.16332062153276,
356.9697406524534, 520.0720647379901,
402.7949987727925, 536.0721418821788,
448.21609036718445, 521.9137447208354,
470.5822486372967, 534.0572029633416,
480.03741443274765, 549.2104258193126,
460.0853321729541, 561.2705350421926,
444.52689144575794, 560.0835589548401,
462.2154563472787, 559.7166600213686,
453.42374550322353, 559.0591804941763,
421.4899935529862, 540.7970410737004,
454.34364779193913, 531.6018122709779,
437.1545739076901, 522.4262260216169,
444.6017030695873, 533.3991716674865,
458.3492761150962, 513.1735160522104])
The array you are trying to estimate hDF is too short. You need to get longer sample or oversample the current one to have at least 128 points for hDF and more then 4000 points for Katz
import scipy.signal as signal
...
x_res=signal.resample(x,128)
hfd(x_res,4) will be 1.74383694265

efficient way to remember array index of two large arrays

I have two Fortran arrays in 2 and 3 dimensions, say a(nx,ny) and b(nx,ny,nz). In array a, I need to find out the satisfied points, say values > 0. Then I need to locate the vectors in array b having the same indexes of x and y of those satisfied points in a. What is the easiest and fast way to do it? The two arrays are big, and I don't want to search one element by one element. Hope I explain my problem clearly! thanks!
I'm not sure that this is the best method, but here's what I would do:
Put a where clause inside a do loop over the z-values. You can first get a 2D map of valid indices into a logical array if you don't want to recalculate the points every time:
program indices
implicit none
integer, parameter :: nx = 3000, ny = 400, nz = 500
integer, dimension(nx, ny) :: a
integer, dimension(nx, ny, nz) :: b
logical, dimension(nx, ny) :: valid_points
integer :: x, y, z
do y = 1, ny
do x = 1, nx
a(x, y) = x - y
end do
end do
valid_points = (a > 0)
do z = 1, nz
where(valid_points)
b(:, :, z) = z
else where
b(:, :, z) = 0
end where
end do
end program indices

Mapping elements in 3D lower "triangle" to linear structure

This is the 3D version of an existing question.
A 3D array M[x,y,z] of shape (n,n,n) should be mapped to a flat vector containing only the elements with x<=y<=z in order to save space. So what I need is an expression similar to the 2D case (index := x + (y+1)*y/2). I tried to derive some formulas but just can't get it right. Note that the element order inside the vector doesn't matter.
This is an extension of user3386109's answer for mapping an array of arbitrary dimension d with shape (n,...,n) into a vector of size size(d,n) only containing the elements whose indices satisfy X_1 <= X_2 <= ... <= X_d.
The 3D version of the equation is
index := (z * (z+1) * (z+2)) / 6 + (y * (y+1))/2 + x
In case someone interested, here is the code of #letmaik answer in python:
import math
from itertools import combinations_with_replacement
import numpy as np
ndim = 3 # The one you'd like
size = 4 # The size you'd like
array = np.ones([size for _ in range(ndim)]) * -1
indexes = combinations_with_replacement([n for n in range(size)], ndim)
def index(*args):
acc = []
for idx, val in enumerate(args):
rx = np.prod([val + i for i in range(idx + 1)])
acc.append(rx / math.factorial(idx + 1))
return sum(acc)
for args in indexes:
array[args] = index(*args)
print(array)
Although I must confess it could be improved as the order of the elements do not seem natural.

C variable assignment and R equivalent

Hi I am trying to understand the following variable assignment in C, and try re-write it in R. I use R often but have only really glanced at C.
int age,int b_AF,int b_ra,int b_renal,int b_treatedhyp,int b_type2,double bmi,int ethrisk,int fh_cvd,double rati,double sbp,int smoke_cat,int surv,double town
)
{
double survivor[3] = {
0,
0.996994316577911,
0.993941843509674
};
a = /*pre assigned*/
double score = 100.0 * (1 - pow(survivor[surv], exp(a)) );
return(score);
}
how does survivor[surv] work in this context? An explanation would be helpful, and any input on how to do the assignment in R would be a bonus.
Thanks very much!
This is an aggregate initializer:
double survivor[3] = {
0,
0.996994316577911,
0.993941843509674
};
and is equivalent to:
double survivor[3];
survivor[0] = 0;
survivor[1] = 0.996994316577911;
survivor[2] = 0.993941843509674;
and survivor[surv] is the value stored at index of the survivor array. Array indexes run from 0 to N - 1 so if surv was 1 then survivor[surv] has value of 0.996994316577911.
Note, the function as currently written does not check that surv is a valid index for the array survivor (i.e. surv > -1 and surv < 3) and runs the risk of undefined behaviour.
Given the Answer of #hmjd then, the R equivalent would be
survivor <- c(0, 0.996994316577911, 0.993941843509674)
or if survivor already exists and you wish to assign into the first 3 elements:
survivor[1:3] <- c(0, 0.996994316577911, 0.993941843509674)
(Note R's indices are 1-based unlike C's 0-based ones.)
As for the extraction, the general idea is the same as with C, but the details matter:
R> survivor[0] ## 0 index returns an empty vector
numeric(0)
R> survivor[-1] ## negative index **drops** that element
[1] 0.9969943 0.9939418
R> survivor[10] ## positive outside length of vector returns NA
[1] NA
R> surv <- 2
R> survivor[surv] ## same holds for whatever surv contains
[1] 0.9969943

Resources