Lossless Join Decomposition - database

I am studying for a test, and this is on the study guide sheet. This is not homework, and will not be graded.
Relation Schema R = (A,B,C,D,E)
Functional Dependencies = (AB->E, C->AD, D->B, E->C)
Is r1 = (A,C,D) r2 = (B,C,E) OR
x1 = (A,C,D) x2 = (A,B,E) a lossless join decomposition? and why?

My relational algebra is horribly rusty, but here is how I remember it to go
If r1 ∩ r2 -> r1 - r2 or r1 ∩ r2 -> r2 - r1 in FDs then you have lossless decomposition.
r1 ∩ r2 = C
r1 - r2 = AD
C->AD is in functional dependencies => lossless
for x1 and x2
x1 ∩ x2 = A
x1 - x2 = CD
A->CD is not in FDs
now check x2 - x1
x2 - x1 = BE
A->BE is not in FDs either, therefore lossy
references here, please check for horrible mistakes that I might have committed

Here is my understanding, basically you look at your decompositions and determine whether the common attributes between the relations is a key to at least one of the relations.
So with R1 and R2 - the only thing common between them is C. C would be a Key to R1, since you are given C -> A,D. So it's lossless.
For X1 and X2, the only thing common is A, which by itself is neither a key for X1 or X2 from the functional dependencies you are given.

Functional Dependencies = (AB->E, C->AD, D->B, E->C)
Is r1 = (A,C,D) r2 = (B,C,E) is lossless when you perform the Chase algorithm.
It can be seen that both tables agree on 'C' and the dependency C->AD is preserved in the table ACD.
x1 = (A,C,D) x2 = (A,B,E) is lossy as you will conclude after performing Chase Algorithm.
Alternately, it can be seen that both the tables only agree on A and there is no such dependency that is fully functionally dependent on A.

As described here, decomposition of R into R1 and R2 is lossless if
Attributes(R1) U Attributes(R2) = Attributes(R)
Attributes(R1) ∩ Attributes(R2) ≠ Φ
Common attribute must be a key for at least one relation (R1 or R2)
EDIT
with the assumption that only the non-trivial cases are considered here, I think OP intended that too (so that (2) holds under this non-trivial assumption):
e.g., not considering the trivial corner case where all tuples of R1 / R2 are unique, i.e., the empty set {} is a key (as #philipxy pointed out), hence any decomposition will be lossless and hence not interesting (since spurious tuples cant be created upon joining) - the corner cases for which decomposition can be lossless despite
Attributes(R1) ∩ Attributes(R2) = Φ are ruled out.
We can check the above conditions with the following python code snippet:
def closure(s, fds):
c = s
for f in fds:
l, r = f[0], f[1]
if l.issubset(c):
c = c.union(r)
if s != c:
c = closure(c, fds)
return c
def is_supekey(s, rel, fds):
c = closure(s, fds)
print(f'({"".join(sorted(s))})+ = {"".join(sorted(c))}')
return rel.issubset(c) #c == rel
def is_lossless_decomp(r1, r2, r, fds):
c = r1.intersection(r2)
if r1.union(r2) != r:
print('not lossless: R1 U R2 ≠ R!')
return False
if r1.union(r2) != r or len(c) == 0:
print('not lossless: no common attribute in between R1 and R2!')
return False
if not is_supekey(c, r1, fds) and not is_supekey(c, r2, fds):
print(f'not lossless: common attribute {"".join(c)} not a key in R1 or R2!')
return False
print('lossless decomposition!')
return True
To process the given FDs in standard form, to convert to suitable data structure, we can use the following function:
import re
def process_fds(fds):
pfds = []
for fd in fds:
fd = re.sub('\s+', '', fd)
l, r = fd.split('->')
pfds.append([set(list(l)), set(list(r))])
return pfds
Now let's test with the above decompositions given:
R = {'A','B','C','D','E'}
fds = process_fds(['AB->E', 'C->AD', 'D->B', 'E->C'])
R1, R2 = {'A', 'C', 'D'}, {'B', 'C', 'E'}
is_lossless_decomp(R1, R2, R, fds)
# (C)+ = ACD
# lossless decomposition!
R1, R2 = {'A', 'C', 'D'}, {'A', 'B', 'E'}
is_lossless_decomp(R1, R2, R, fds)
# (A)+ = A
# not lossless: common attribute A not a key in R1 or R2!

Related

Fractal dimension algorithms gives results of >2 for time-series

I'm trying to compute Fractal Dimension of very specific time series array.
I've found implementations of Higuchi FD algorithm:
def hFD(a, k_max): #Higuchi FD
L = []
x = []
N = len(a)
for k in range(1,k_max):
Lk = 0
for m in range(0,k):
#we pregenerate all idxs
idxs = np.arange(1,int(np.floor((N-m)/k)),dtype=np.int32)
Lmk = np.sum(np.abs(a[m+idxs*k] - a[m+k*(idxs-1)]))
Lmk = (Lmk*(N - 1)/(((N - m)/ k)* k)) / k
Lk += Lmk
L.append(np.log(Lk/(m+1)))
x.append([np.log(1.0/ k), 1])
(p, r1, r2, s)=np.linalg.lstsq(x, L)
return p[0]
from https://github.com/gilestrolab/pyrem/blob/master/src/pyrem/univariate.py
and Katz FD algorithm:
def katz(data):
n = len(data)-1
L = np.hypot(np.diff(data), 1).sum() # Sum of distances
d = np.hypot(data - data[0], np.arange(len(data))).max() # furthest distance from first point
return np.log10(n) / (np.log10(d/L) + np.log10(n))
from https://github.com/ProjectBrain/brainbits/blob/master/katz.py
I expect results of ~1,5 in both cases however get 2,2 and 4 instead...
hFD(x,4) = 2.23965648024 (k value of here is chosen as an example, however result won't change much in range 4-12 edit: I was able to get result of ~1,9 with k=22, however this still does not make any sense);
katz(x) = 4.03911343057
Which in theory should not be possible for 1D time-series array.
Questions here are: are Higuchi and Katz algorithms not suitable for time-series analysis in general, or am I doing something wrong on my side? Also are there any other python libraries with already implemented and error-less algorithms to verify my results?
My array of interest (each element represents point in time t, t+1, t+2,..., t+N)
x = np.array([373.4413096546802, 418.58026161917803,
395.7387698762124, 416.21163042783206,
407.9812265426947, 430.2355284504048,
389.66095393296763, 442.18969320408166,
383.7448638776275, 452.8931822090381,
413.5696828065546, 434.45932712853585
,429.95212301648996, 436.67612861616215,
431.10235365546964, 418.86935850068545,
410.84902747247423, 444.4188867775925,
397.1576881118471, 451.6129904245434,
440.9181246439599, 438.9857353268666,
437.1800408012741, 460.6251405281339,
404.3208481355302, 500.0432305427639,
380.49579242696177, 467.72953450552893,
333.11328535523967, 444.1171938340972,
303.3024198243042, 453.16332062153276,
356.9697406524534, 520.0720647379901,
402.7949987727925, 536.0721418821788,
448.21609036718445, 521.9137447208354,
470.5822486372967, 534.0572029633416,
480.03741443274765, 549.2104258193126,
460.0853321729541, 561.2705350421926,
444.52689144575794, 560.0835589548401,
462.2154563472787, 559.7166600213686,
453.42374550322353, 559.0591804941763,
421.4899935529862, 540.7970410737004,
454.34364779193913, 531.6018122709779,
437.1545739076901, 522.4262260216169,
444.6017030695873, 533.3991716674865,
458.3492761150962, 513.1735160522104])
The array you are trying to estimate hDF is too short. You need to get longer sample or oversample the current one to have at least 128 points for hDF and more then 4000 points for Katz
import scipy.signal as signal
...
x_res=signal.resample(x,128)
hfd(x_res,4) will be 1.74383694265

Interleave and Deinterleave a vector into two new vectors

Interleaver: Assume we have vector X= randi(1,N) I would like to split the contents of X into two new vectors X1and X2 such that the first element of X is the first element of X1, the first element of X2 is the second element of X, the third element of X is the second element of X1 and the fourth element of X is the second element of X2... etc till the last element of the vector `X.
I have the following idea
X1(1)=X(1);
X2(1)=X(2);
for i=1:length(X)
X1(i)= X(i+2);
end
for j=2:length (X)
X2(i)= X(i+2)
end
My question is: is my method correct? is there a better way to do it?
Deinterleaver
I also have the reverse problem so basically in this case I have X1 and X2 and would like to recover X, how would I efficiently recover X?
I think the terminology in this question is reversed. Interleaving would be to merge two vectors alternating their values:
x1 = 10:10:100;
x2 = 1:1:10;
x = [x1;x2];
x = x(:).';
This is the same as the one-liner:
x = reshape([x1;x2],[],1).';
Deinterleaving would be to separate the interleaved data, as already suggested by David in a comment and Tom in an answer:
y1 = x(1:2:end);
y2 = x(2:2:end);
but can also be done in many other ways, for example inverting the process we followed above:
y = reshape(x,2,[]);
y1 = y(1,:);
y2 = y(2,:);
To verify:
isequal(x1,y1)
isequal(x2,y2)
I was hoping, as well for some cool new one liner, but anyway, following the previous answer you can use the same indexing expression for the assignment.
x = 1:20
x1 = x(1:2:end)
x2 = x(2:2:end)
y = zeros(20,1)
y(1:2:end) = x1
y(2:2:end) = x2
I think it's hard to get a cleaner solution than this:
x = 1:20
x1 = x(1:2:end)
x2 = x(2:2:end)
Just to add another option, you could use the deal function and some precomputed indices. This is basically the same as the answer from Peter M, but collecting the assignments into single lines:
X = randi(10, [1 20]); % Sample data
ind1 = 1:2:numel(X); % Indices for x1
ind2 = 2:2:numel(X); % Indices for x2
[x1, x2] = deal(X(ind1), X(ind2)); % Unweave (i.e. deinterleave)
[X(ind1), X(ind2)] = deal(x1, x2); % Interleave

Can the xor-swap be extended to more than two variables?

I've been trying to extend the xor-swap to more than two variables, say n variables. But I've gotten nowhere that's better than 3*(n-1).
For two integer variables x1 and x2 you can swap them like this:
swap(x1,x2) {
x1 = x1 ^ x2;
x2 = x1 ^ x2;
x1 = x1 ^ x2;
}
So, assume you have x1 ... xn with values v1 ... vn. Clearly you can "rotate" the values by successively applying swap:
swap(x1,x2);
swap(x2,x3);
swap(x3,x4);
...
swap(xm,xn); // with m = n-1
You will end up with x1 = v2, x2 = v3, ..., xn = v1.
Which costs n-1 swaps, each costing 3 xors, leaving us with (n-1)*3 xors.
Is a faster algorithm using xor and assignment only and no additional variables known?
As a partial result I tried a brute force search for N=3,4,5 and all of these agree with your formula.
Python code:
from collections import *
D=defaultdict(int) # Map from tuple of bitmasks to number of steps to get there
N=5
Q=deque()
Q.append( (tuple(1<<n for n in range(N)), 0) )
goal = (tuple(1<<( (n+1)%N ) for n in range(N)))
while Q:
masks,ops = Q.popleft()
if len(D)%10000==0:
print len(D),len(Q),ops
ops += 1
# Choose two to swap
for a in range(N):
for b in range(N):
if a==b:
continue
masks2 = list(masks)
masks2[a] = masks2[a]^masks2[b]
masks2 = tuple(masks2)
if masks2 in D:
continue
D[masks2] = ops
if masks2==goal:
print 'found goal in ',ops
raise ValueError
Q.append( (masks2,ops) )

BCNF Decomposition, when to stop decomposing?

I'm having difficulties understanding BCNF decomposition.
If I have:
R=(A,B,C)
FDs: AB -> C, C -> B
Computing the closure, I have concluded that the minimal keys are {AB} and {AC}.
Therefore,
AB --> C is NOT in BCNF violation because AB is a key
C --> B IS in violation because C is not a key.
I decompose C --> B like this
R1 = Closure of C = (C,B)
R2 = (A,C)
Im unsure how to proceed from here. If it needs to be further decomposed, what do I need to do? If I am supposed to end here, how do you know when to stop decomposing?
Computing the closure, I have concluded that the minimal keys are {AB} and {AC}.
The candidate keys of R are {AB} and {AC}.
You decompose R into these two relations, and you identify all the candidate keys in each of those relations.
R1 {AB -> C}
R2 {C -> B}
The only candidate key to R1 is {AB}.
The only candidate key to R2 is {C}. The attribute {C} is not a key in R, but it is a key in R2.
R1 and R2 are when you stop. After the decomposition you identify the keys and the functional dependencies in the new relations. The key in R1 is C (with FB: C -> B, no BCNF violation), the key in R2 is AC (no BCNF violation either).
Write the closure for AB->C and C->B:
{A,B}+ = {A, B, C}
{C}+= {C, B}
A+B is a superkey hence it is not violating BCNF. So we use the violating FD (C->B) to do the decomposition.
{A, B, C} - {B, C} = {A}
Then add left side of C->B to {A} and will have {A, C} and {B, C}
Hence we decompose the R(A, B, C) into R(B, C) and R(A, C).

R: Aggregate on Group 1 and NOT Group 2

I am trying to create two data sets, one which summarizes data by 2 groups which I have done using the following code:
x = rnorm(1:100)
g1 = sample(LETTERS[1:3], 100, replace = TRUE)
g2 = sample(LETTERS[24:26], 100, replace = TRUE)
aggregate(x, list(g1, g2), mean)
The second needs to summarize the data by the first group and NOT the second group.
If we consider the possible pairs from the previous example:
A - X B - X C - X
A - Y B - Y C - Y
A - Z B - Z C - Z
The second dataset should to summarize the data as the average of the outgroup.
A - not X
A - not Y
A - not Z etc.
Is there a way to manipulate aggregate functions in R to achieve this?
Or I also thought there could be dummy variable that could represent the data in this way, although I am unsure how it would look.
I have found this answer here:
R using aggregate to find a function (mean) for "all other"
I think this indicates that a dummy variable for each pairing is necessary. However if there is anyone who can offer a better or more efficient way that would be appreciated, as there are many pairings in the true data set.
Thanks in advance
First let us generate the data reproducibly (using set.seed):
# same as question but added set.seed for reproducibility
set.seed(123)
x = rnorm(1:100)
g1 = sample(LETTERS[1:3], 100, replace = TRUE)
g2 = sample(LETTERS[24:26], 100, replace = TRUE)
Now we have two solutions both of which use aggregate:
1) ave
# x equals the sums over the groups and n equals the counts
ag = cbind(aggregate(x, list(g1, g2), sum),
n = aggregate(x, list(g1, g2), length)[, 3])
ave.not <- function(x, g) ave(x, g, FUN = sum) - x
transform(ag,
x = NULL, # don't need x any more
n = NULL, # don't need n any more
mean = x/n,
mean.not = ave.not(x, Group.1) / ave.not(n, Group.1)
)
This gives:
Group.1 Group.2 mean mean.not
1 A X 0.3155084 -0.091898832
2 B X -0.1789730 0.332544353
3 C X 0.1976471 0.014282465
4 A Y -0.3644116 0.236706489
5 B Y 0.2452157 0.099240545
6 C Y -0.1630036 0.179833987
7 A Z 0.1579046 -0.009670734
8 B Z 0.4392794 0.033121335
9 C Z 0.1620209 0.033714943
To double check the first value under mean and under mean.not:
> mean(x[g1 == "A" & g2 == "X"])
[1] 0.3155084
> mean(x[g1 == "A" & g2 != "X"])
[1] -0.09189883
2) sapply Here is a second approach which gives the same answer:
ag <- aggregate(list(mean = x), list(g1, g2), mean)
f <- function(i) mean(x[g1 == ag$Group.1[i] & g2 != ag$Group.2[i]]))
ag$mean.not = sapply(1:nrow(ag), f)
ag
REVISED Revised based on comments by poster, added a second approach and also made some minor improvements.

Resources