Natural join subset decomposition

Natural join subset decomposition - database

For this question , I found the answer is (c). but I can give an example to show that (c) is not correct. which is the answer?
Let r be a relation instance with schema R = (A, B, C, D). We define r1 = ‘select A,B,C from r’ and r2 = ‘select A, D from r’. Let s = r1 * r2 where * denotes natural join. Given that the decomposition of r into r1 and r2 is lossy, which one of the following is true?
(a) s is subset of r
(b) r U s = r
(c) r is a subset of s
(d) r * s = s
If the Answer is (c) , consider the following example with lossy decomposition of r into r1 and r2.
Table r
A B C D
1 10 100 1000
2 20 200 1000
3 20 200 1001
Table r1
A B C
1 10 100
2 20 200
Table r2
A D
2 1000
3 1001
Table s (natural join of r1 and r2)
A B C D
2 20 200 1000
The answer is not (c) . but I can also give you an example that (c) can be an answer.
What should be the answer?

Table r
A B C D
1 10 100 1000
1 20 200 2000
Table r1
A B C
1 10 100
1 20 200
Table r2
A D
1 1000
1 2000
Table s (natural join of r1 and r2)
A B C D
1 10 100 1000
1 20 200 1000
1 10 100 2000
1 20 200 2000
The decomposition is called "lossy" because we have lost the ability to recompose the original relation value using natural join. This lost ability manifests itself by extra rows appearing when we try the natural join. The underlying cause why this happens is that the decomposition did not retain any key ( { {B} {C} {D} } ) fully in both tables of the decomposition. If any key of the original is fully retained in all components of a decomposition, then the decomposition isn't lossy.

Table r is:
A B C D
1 10 100 1000
2 20 200 1000
3 20 200 1001
r1 is:
r1 = ‘select A,B,C from r’
A B C
1 10 100
2 20 200
3 20 200
r2 is:
r2 = ‘select A, D from r'
A D
1 1000
2 1000
3 1001
s is:
s = r1 * r2
1 10 100 1000
2 20 200 1000
3 20 200 1001
So effectively r is as subset of s. If the definicion of r1 is ‘select A,B,C from r’ you just can't remove a row (as you did in your example) from the result and said that r1 still complies with the definition, the same applies to r2 where you remove the first row.

Related

How to create a Power BI pivot table without drill-down?

I have data of the following form for several categories and years. The data is too large for Import so I am using DirectQuery.
Id Cat1 Cat2 Cat3 Year Value
1 A X Q 2000 1
2 A X R 2000 2
3 A Y Q 2000 3
4 A Y R 2000 4
5 A X Q 2000 1
6 A X R 2000 2
7 A Y Q 2000 3
8 A Y R 2000 4
9 A X Q 2001 1
10 A X R 2001 2
11 A Y Q 2001 3
12 A Y R 2001 4
13 A X Q 2001 1
14 A X R 2001 2
15 A Y Q 2001 3
16 A Y R 2001 4
I would like construct a pivot table similar to what can be done in Excel.
Cat1 Cat2 Cat3 2000 2001
A X Q 2 2
A X R 4 4
A Y Q 6 6
A Y R 8 8
I tried this with the Matrix option by; placing columns Cat1, Cat2, and Cat3 in the Rows, placing Year in the Columns, and placing Value in the Values. Unfortunately, this produces a hierarchical view.
Cat1 2000 2001
A 20 20
X 6 6
Q 2 2
R 4 4
Y 14 14
Q 6 6
R 8 8
How do I get the simpler Excel pivot table view of the data instead of the hierarchical view?

I'm not sure if it's possible to get the row headers to repeat like in your example, but if you go to Format > Row headers > Stepped layout and toggle that off, then your matrix should change from what you are seeing (below left) to something closer to what you want (below right).

SQL Server - apply to case when to multiple columns

I have a number of columns that contain values between 0 and 4, like so:
ID Q1 Q2 Q3 Q4 Q5 Q6 ... Q30
0001 4 0 3 1 0 4 2
0002 0 2 1 2 0 3 1
0003 4 2 3 0 3 0 4
0004 1 4 2 4 1 1 3
I need to transform these values so that 4=0, 3-25, 2=50, 1=75 and 0=100.
So, in the transformed version my first row would show the following:
ID Q1 Q2 Q3 Q4 Q5 Q6 ... Q30
0001 0 100 25 75 100 0 50
If it was just for one column I would use a case statement:
case
when Q1=4 then 0
when Q1=3 then 25
when Q1=2 then 50
when Q1=1 then 75
when Q1=0 then 100
end Q1
Can I apply this to a range of columns instead of doing a separate case statement for each column?
Or is there a more efficient way of achieving the same outcome?

write the condition like this
is more efficient,
because now is just an aritmetical operation,
and is better then a case with 5 check
SELECT 100 - (q1 * 25) AS Q1

update yourtable
set Q1 = 100 - Q1 * 25,
Q2 = 100 - Q2 * 25,
Q3 = 100 - Q3 * 25,
Q4 = 100 - Q4 * 25

Think about functions, something like this:
CREATE FUNCTION COL_EVAL(#Q1 : int) RETURNS INT AS
BEGIN
RETURN case
when #Q1=4 then 0
when #Q1=3 then 25
when #Q1=2 then 50
when #Q1=1 then 75
when #Q1=0 then 100
end Q1
END
And after all use it as:
SELECT COL_EVAL(Q1) as Q1, COL_EVAL(Q2)....
p.s. have no server to check syntax correctness...

Array row calculations

I have the following table:
DATA:
Lines <- " ID MeasureX MeasureY x1 x2 x3 x4 x5
1 1 1 1 1 1 1 1
2 1 1 0 1 1 1 1
3 1 1 1 2 3 3 3"
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE)
What i would like to achieve is :
Create 5 columns(r1-r5)
which is the division of each column x1-x5 with MeasureX (example x1/measurex, x2/measurex etc.)
Create 5 columns(p1-p5)
which is the division of each column x1-x5 with number 1-5 (the number of xcolumns) example x1/1, x2/2 etc.
MeasureY is irrelevant for now, the end product would be the ID and columns r1-r5 and p1-p5, is this feasible?
In SAS i would go with something like this:
data test6;
set test5;
array x {5} x1- x5;
array r{5} r1 - r5;
array p{5} p1 - p5;
do i=1 to 5;
r{i} = x{i}/MeasureX;
p{i} = x{i}/(i);
end;
The reason would be to have more dynamic beacuse the number of columns could change in the future.

Argument recycling allows you do do element-wise division with a constant vector. The tricky part was extracting the digits from the column names. I then repeated each of the digits by the number of rows to do the second division-task.
DF[ ,paste0("r", 1:5)] <- DF[ , grep("x", names(DF) )]/ DF$MeasureX
DF[ ,paste0("p", 1:5)] <- DF[ , grep("x", names(DF) )]/ # element-wise division
rep( as.numeric( sub("\\D","",names(DF)[ # remove non-digits
grep("x", names(DF))] #returns only 'x'-cols
) ), each=nrow(DF) ) # make them as long as needed
#-------------
> DF
ID MeasureX MeasureY x1 x2 x3 x4 x5 r1 r2 r3 r4 r5 p1 p2 p3 p4 p5
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0.5 0.3333333 0.25 0.2
2 2 1 1 0 1 1 1 1 0 1 1 1 1 0 0.5 0.3333333 0.25 0.2
3 3 1 1 1 2 3 3 3 1 2 3 3 3 1 1.0 1.0000000 0.75 0.6
This could be greatly simplified if you already know the sequence vector for the second division task would be 1-5, but this was designed to allow "gaps" in the sequence for column names and still use the digit information in the names as the divisor. (You were not entirely clear about what situations this code would be used in.) The construct of r{1-5} in SAS is mimicked by [ , paste0('r', 1:5)]. SAS is a macro language and sometimes experienced users have trouble figuring out how to make R behave like one. Generally it takes a while to lose the for-loop mentality and begin using R as a functional language.

An alternative with the data.table package:
cols <- names(df[c(4:8)])
library(data.table)
setDT(df)[, (paste0("r",1:5)) := .SD / df$MeasureX, by = ID, .SDcols = cols
][, (paste0("p",1:5)) := .SD / 1:5, by = ID, .SDcols = cols]
which results in:
> df
ID MeasureX MeasureY x1 x2 x3 x4 x5 r1 r2 r3 r4 r5 p1 p2 p3 p4 p5
1: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0.5 0.3333333 0.25 0.2
2: 2 1 1 0 1 1 1 1 0 1 1 1 1 0 0.5 0.3333333 0.25 0.2
3: 3 1 1 1 2 3 3 3 1 2 3 3 3 1 1.0 1.0000000 0.75 0.6

You could put together a nifty loop or apply to do this, but here it is explicitly:
# Handling the "r" columns.
DF$r1 <- DF$x1 / DF$MeasureX
DF$r2 <- DF$x2 / DF$MeasureX
DF$r3 <- DF$x3 / DF$MeasureX
DF$r4 <- DF$x4 / DF$MeasureX
DF$r5 <- DF$x5 / DF$MeasureX
# Handling the "p" columns.
DF$p1 <- DF$x1 / 1
DF$p2 <- DF$x2 / 2
DF$p3 <- DF$x3 / 3
DF$p4 <- DF$x4 / 4
DF$p5 <- DF$x5 / 5
# Taking only the columns we want.
FinalDF <- DF[, c("ID", "r1", "r2", "r3", "r4", "r5", "p1", "p2", "p3", "p4", "p5")]
Just noting that this is pretty straightforward matrix manipulation that you definitely could have found elsewhere. Perhaps you're new to R, but still put a little more effort in next time. If you are new to R, it's definitely worth the time to look up some basic R coding tutorial or video.

Use array result as multiplier for the original data frame

for a given data frame I would like to multiply values of an array to a column of the data frame. The data frame consists of rows, containing a name, a numerical value and two factor values:
name credit gender group
n1 10 m A
n2 20 f B
n3 30 m A
n4 40 m B
n5 50 f C
This data frame can be generated using the commands:
name <- c('n1','n2','n3','n4','n5')
credit <- c(10,20,30,40,50)
gender <- c('m','f','m','m','f')
group <- c('A','B','A','B','C')
DF <-data.frame(cbind(name,credit,gender,group))
# binds columns together and uses it as a data frame
Additionally we have a matrix derived from the data frame (in more complex cases this will be an array). This matrix contains the sum value of all contracts that fall into a particular category (characterized by m/f and A/B/C):
m f
A 40 NA
B 40 20
C NA 50
The goal is to multiply the values in DF$credit by using the corresponding value assigned to each category in the matrix, e.g. the value 10 of the first row in DF would be multiplied by 40 (the category defined by m and A).
The result would look like:
name credit gender group result
n1 10 m A 400
n2 20 f B 400
n3 30 m A 1200
n4 40 m B 1600
n5 50 f C 2500
If possible, I would like to perform this using the R base package but I am open for any helpful solutions that work nicely.

You can construct a set of indices into derived (being your derived matrix) by making an index matrix out of DF$group and DF$gender. The reason the as.character is there is because DF$group and DF$gender are factors, whereas I just want character indices.
>idx = matrix( c(as.character(DF$group),as.character(DF$gender)),ncol=2)
>idx
[,1] [,2]
[1,] "A" "m"
[2,] "B" "f"
[3,] "A" "m"
[4,] "B" "m"
[5,] "C" "f"
>DF$result = DF$credit * derived[idx]
Note with that last line, using the code you have above to generate DF, your numeric columns turn out as factors (ie DF$credit is a factor). In that case you need to do as.numeric(DF$credit)*derived[idx]. However, I imagine that in your actual data your data frame doesn't have DF$credit as a factor but instead as a numeric.

When you create the data.frame object, don't use cbind, it's not necessary and it forces the credit variable to become a factor.
Just use DF <- data.frame(name, credit, gender, group)
Then run a for loop that goes through each row in your data.frame object.
n <- length(DF$credit)
result <- rep(0, n)
for(i in 1:n) {
result[i] <- DF$credit[i] * sum(DF$credit[DF$gender==DF$gender[i] & DF$group==DF$group[i]])
}
Replace your data.frame object with this new one that includes your results.
DF <- data.frame(name, credit, gender, group, result)

I recommend the plyr package, but you can do this using the base by function:
> by(DF, DF['name'], function (row) row$credit * m[as.character(row$group), as.character(row$gender)])
name: n1
[1] 400
---------------------------------------------------------------------
name: n2
[1] 400
---------------------------------------------------------------------
name: n3
[1] 1200
---------------------------------------------------------------------
name: n4
[1] 1600
---------------------------------------------------------------------
name: n5
[1] 2500
plyr can give you the result as a data frame which is nice:
> ddply(DF, .(name), function (row) row$credit * m[as.character(row$group), as.character(row$gender)])
name V1
1 n1 400
2 n2 400
3 n3 1200
4 n4 1600
5 n5 2500

Informix subqueries with FIRST option

What is the best way of transcribing the following Transact-SQL code to Informix Dynamic Server (IDS) 9.40:
Objective: I need the first 50 orders with their respective order lines
select *
from (select top 50 * from orders) a inner join lines b
on a.idOrder = b.idOrder
My problem is with the subselect because Informix does not allow the FIRST option in the subselect.
Any simple idea?.

The official answer would be 'Please upgrade from IDS 9.40 since it is no longer supported by IBM'. That is, IDS 9.40 is not a current version - and should (ideally) not be used.
Solution for IDS 11.50
Using IDS 11.50, I can write:
SELECT *
FROM (SELECT FIRST 10 * FROM elements) AS e
INNER JOIN compound_component AS a
ON e.symbol = a.element
INNER JOIN compound AS c
ON c.compound_id = a.compound_id
;
This is more or less equivalent to your query. Consequently, if you use a current version of IDS, you can write the query using almost the same notation as in Transact-SQL (using FIRST in place of TOP).
Solution for IDS 9.40
What can you do in IDS 9.40? Excuse me a moment...I have to run up my IDS 9.40.xC7 server (this fix pack was released in 2005; the original release was probably in late 2003)...
First problem - IDS 9.40 does not allow sub-queries in the FROM clause.
Second problem - IDS 9.40 does not allow 'FIRST n' notation in either of these contexts:
SELECT FIRST 10 * FROM elements INTO TEMP e;
INSERT INTO e SELECT FIRST 10 * FROM elements;
Third problem - IDS 9.40 doesn't have a simple ROWNUM.
So, to work around these, we can write (using a temporary table - we'll remove that later):
SELECT e1.*
FROM elements AS e1, elements AS e2
WHERE e1.atomic_number >= e2.atomic_number
GROUP BY e1.atomic_number, e1.symbol, e1.name, e1.atomic_weight, e1.stable
HAVING COUNT(*) <= 10
INTO TEMP e;
SELECT *
FROM e INNER JOIN compound_component AS a
ON e.symbol = a.element
INNER JOIN compound AS c
ON c.compound_id = a.compound_id;
This produces the same answer as the single query in IDS 11.50. Can we avoid the temporary table? Yes, but it is more verbose:
SELECT e1.*, a.*, c.*
FROM elements AS e1, elements AS e2, compound_component AS a,
compound AS c
WHERE e1.atomic_number >= e2.atomic_number
AND e1.symbol = a.element
AND c.compound_id = a.compound_id
GROUP BY e1.atomic_number, e1.symbol, e1.name, e1.atomic_weight,
e1.stable, a.compound_id, a.element, a.seq_num,
a.multiplicity, c.compound_id, c.name
HAVING COUNT(*) <= 10;
Applying that to the original orders plus order lines example is left as an exercise for the reader.
Relevant subset of schema for 'Table of Elements':
-- See: http://www.webelements.com/ for elements.
-- See: http://ie.lbl.gov/education/isotopes.htm for isotopes.
CREATE TABLE elements
(
atomic_number INTEGER NOT NULL UNIQUE CONSTRAINT c1_elements
CHECK (atomic_number > 0 AND atomic_number < 120),
symbol CHAR(3) NOT NULL UNIQUE CONSTRAINT c2_elements,
name CHAR(20) NOT NULL UNIQUE CONSTRAINT c3_elements,
atomic_weight DECIMAL(8,4) NOT NULL,
stable CHAR(1) DEFAULT 'Y' NOT NULL
CHECK (stable IN ('Y', 'N'))
);
CREATE TABLE compound
(
compound_id SERIAL NOT NULL PRIMARY KEY,
name VARCHAR(100) NOT NULL UNIQUE
);
-- The sequence number is used to order the components within a compound.
CREATE TABLE compound_component
(
compound_id INTEGER REFERENCES compound,
element CHAR(3) NOT NULL REFERENCES elements(symbol),
seq_num SMALLINT DEFAULT 1 NOT NULL
CHECK (seq_num > 0 AND seq_num < 20),
multiplicity INTEGER NOT NULL
CHECK (multiplicity > 0 AND multiplicity < 20),
PRIMARY KEY(compound_id, seq_num)
);
Output (on my sample database):
1 H Hydrogen 1.0079 Y 1 H 1 2 1 water
1 H Hydrogen 1.0079 Y 3 H 2 4 3 methane
1 H Hydrogen 1.0079 Y 4 H 2 6 4 ethane
1 H Hydrogen 1.0079 Y 5 H 2 8 5 propane
1 H Hydrogen 1.0079 Y 6 H 2 10 6 butane
1 H Hydrogen 1.0079 Y 11 H 2 5 11 ethanol
1 H Hydrogen 1.0079 Y 11 H 4 1 11 ethanol
6 C Carbon 12.0110 Y 2 C 1 1 2 carbon dioxide
6 C Carbon 12.0110 Y 3 C 1 1 3 methane
6 C Carbon 12.0110 Y 4 C 1 2 4 ethane
6 C Carbon 12.0110 Y 5 C 1 3 5 propane
6 C Carbon 12.0110 Y 6 C 1 4 6 butane
6 C Carbon 12.0110 Y 7 C 1 1 7 carbon monoxide
6 C Carbon 12.0110 Y 9 C 2 1 9 magnesium carbonate
6 C Carbon 12.0110 Y 10 C 2 1 10 sodium bicarbonate
6 C Carbon 12.0110 Y 11 C 1 2 11 ethanol
8 O Oxygen 15.9990 Y 1 O 2 1 1 water
8 O Oxygen 15.9990 Y 2 O 2 2 2 carbon dioxide
8 O Oxygen 15.9990 Y 7 O 2 1 7 carbon monoxide
8 O Oxygen 15.9990 Y 9 O 3 3 9 magnesium carbonate
8 O Oxygen 15.9990 Y 10 O 3 3 10 sodium bicarbonate
8 O Oxygen 15.9990 Y 11 O 3 1 11 ethanol

If I understand your question you are having a problem with "TOP". Try using a TOP-N query.
For example:
select *
from (SELECT *
FROM foo
where foo_id=[number]
order by foo_id desc)
where rownum <= 50
This will get you the top fifty results (because I order by desc in the sub query)

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Natural join subset decomposition - database

Related

How to create a Power BI pivot table without drill-down?

SQL Server - apply to case when to multiple columns

Array row calculations

Use array result as multiplier for the original data frame

Informix subqueries with FIRST option

Categories

Resources