Why do these join differently based on size? - arrays

In Postgresql, if you unnest two arrays of the same size, they line up each value from one array with one from the other, but if the two arrays are not the same size, it joins each value from one with every value from the other.
select unnest(ARRAY[1, 2, 3, 4, 5]::bigint[]) as id,
unnest(ARRAY['a', 'b', 'c', 'd', 'e']) as value
Will return
1 | "a"
2 | "b"
3 | "c"
4 | "d"
5 | "e"
But
select unnest(ARRAY[1, 2, 3, 4, 5]::bigint[]) as id, -- 5 elements
unnest(ARRAY['a', 'b', 'c', 'd']) as value -- 4 elements
order by id
Will return
1 | "a"
1 | "b"
1 | "c"
1 | "d"
2 | "b"
2 | "a"
2 | "c"
2 | "d"
3 | "b"
3 | "d"
3 | "a"
3 | "c"
4 | "d"
4 | "a"
4 | "c"
4 | "b"
5 | "d"
5 | "c"
5 | "b"
5 | "a"
Why is this? I assume some sort of implicit rule is being used, and I'd like to know if I can do it explicitly (eg if I want the second style when I have matching array sizes, or if I want missing values in one array to be treated as NULL).

Support for set-returning functions in SELECT is a PostgreSQL extension, and an IMO very weird one. It's broadly considered deprecated and best avoided where possible.
Avoid using SRF-in-SELECT where possible
Now that LATERAL is supported in 9.3, one of the two main uses is gone. It used to be necessary to use a set-returning function in SELECT if you wanted to use the output of one SRF as the input to another; that is no longer needed with LATERAL.
The other use will be replaced in 9.4, when WITH ORDINALITY is added, allowing you to preserve the output ordering of a set-returning function. That's currently the main remaining use: to do things like zip the output of two SRFs into a rowset of matched value pairs. WITH ORDINALITY is most anticipated for unnest, but works with any other SRF.
Why the weird output?
The logic that PostgreSQL is using here (for whatever IMO insane reason it was originally introduced in ancient history) is: whenever either function produces output, emit a row. If only one function has produced output, scan the other one's output again to get the rows required. If neither produces output, stop emitting rows.
It's easier to see with generate_series.
regress=> SELECT generate_series(1,2), generate_series(1,2);
generate_series | generate_series
-----------------+-----------------
1 | 1
2 | 2
(2 rows)
regress=> SELECT generate_series(1,2), generate_series(1,3);
generate_series | generate_series
-----------------+-----------------
1 | 1
2 | 2
1 | 3
2 | 1
1 | 2
2 | 3
(6 rows)
regress=> SELECT generate_series(1,2), generate_series(1,4);
generate_series | generate_series
-----------------+-----------------
1 | 1
2 | 2
1 | 3
2 | 4
(4 rows)
In the majority of cases what you really want is a simple cross join of the two, which is a lot saner.
regress=> SELECT a, b FROM generate_series(1,2) a, generate_series(1,2) b;
a | b
---+---
1 | 1
1 | 2
2 | 1
2 | 2
(4 rows)
regress=> SELECT a, b FROM generate_series(1,2) a, generate_series(1,3) b;
a | b
---+---
1 | 1
1 | 2
1 | 3
2 | 1
2 | 2
2 | 3
(6 rows)
regress=> SELECT a, b FROM generate_series(1,2) a, generate_series(1,4) b;
a | b
---+---
1 | 1
1 | 2
1 | 3
1 | 4
2 | 1
2 | 2
2 | 3
2 | 4
(8 rows)
The main exception is currently for when you want to run multiple functions in lock-step, pairwise (like a zip), which you cannot currently do with joins.
WITH ORDINALITY
This will be improved in 9.4 with WITH ORDINALITY, a d while it'll be a bit less efficient than a multiple SRF scan in SELECT (unless optimizer improvements are added) it'll be a lot saner.
Say you wanted to pair up 1..3 and 10..40 with nulls for excess elements. Using with ordinality that'd be (PostgreSQL 9.4 only):
regress=# SELECT aval, bval
FROM generate_series(1,3) WITH ORDINALITY a(aval,apos)
RIGHT OUTER JOIN generate_series(1,4) WITH ORDINALITY b(bval, bpos)
ON (apos=bpos);
aval | bval
------+------
1 | 1
2 | 2
3 | 3
| 4
(4 rows)
wheras the srf-in-from would instead return:
regress=# SELECT generate_series(1,3) aval, generate_series(1,4) bval;
aval | bval
------+------
1 | 1
2 | 2
3 | 3
1 | 4
2 | 1
3 | 2
1 | 3
2 | 4
3 | 1
1 | 2
2 | 3
3 | 4
(12 rows)

Related

What is the maximum number of tuples that can be returned by natural join?

Consider that the relation R(A,B,C) contains 200 tuples and relation S(A,D,E) contains 100 tuples, then the maximum number of tuples possible in a natural join of R and S.
Select one:
A. 300
B. 200
C. 100
D. 20000
It will be great if the answer is provided with some explanation.
The maximum number of tuples possible in natural join will be 20000.
You can find what natural join exactly is in this site.
Let us check for the given example:
Let the table R(A,B,C) be in the given format:
A | B | C
---------------
1 | 2 | 4
1 | 6 | 8
1 | 5 | 7
and the table S(A,D,E) be in the given format:
A | D | E
---------------
1 | 2 | 4
1 | 6 | 8
Here, the result of natural join will be:
A | B | C | D | E
--------------------------
1 | 2 | 4 | 2 | 4
1 | 2 | 4 | 6 | 8
1 | 6 | 8 | 2 | 4
1 | 6 | 8 | 6 | 8
1 | 5 | 7 | 2 | 4
1 | 5 | 7 | 6 | 8
Thus we can see the resulting table has 3*2=6 rows. This is the maximum possible value because both the input tables have the same single value in column A (1).
Natural join returns all tuple values that can be formed from (tuple-joining or tuple-unioning) a tuple value from one input relation and a tuple value from the other. Since they could agree on a single subtuple value for the common set of attributes, and there could be unique values for the non-common subtuples within each relation, you could get a unique result tuple from every pairing, although no more than that. So the maximum number of tuples is the product of the tuple counts of the relations.
Here that's D 20000.
A and A present in R and S so according to natural join 100 tuples take part in join process.
Option C 100 is the answer.

Using Arithmetic in SQL on my own columns to fill a third column where it is zero. (complicated, only when certain criteria is met)

So here is my question. Brace yourself as it takes some thinking just to wrap your head around what I am trying to do.
I'm working with Quarterly census employment and wage data. QCEW data has something called suppression codes. If a data denomination (comes in overall, location quotient, and over the year each year each quarter) is suppressed, then all the data for that denomination is zero. I have my table set up in the following way (only showing you columns that are relevant for the question):
A County_Id column,
Industry_ID column,
Year column,
Qtr column,
Suppressed column (0 for not suppressed and 1 for suppressed),
Data_Category column (1 for overall, 2 for lq, and 3 for over the year),
Data_Denomination column (goes 1-8 for what specific data is being looked at in that category ex: monthly employment,Taxable wage, etc. typical data),
and a value column (which will be zero if the Data_Category is suppressed - since all the data denomination values will be zero).
Now, if Overall data (cat 1) for, say, 1991 quarter 1 is suppressed, but the next year quarter 1 has both overall and over the year (cats 1 and 3) NOT suppressed, then we can infer what the value would be for that first year's suppressed data, since OTY1991q1 = (Overall1991q1 - Overall1990q1). So to find that suppressed data we would just subtract our cat 1 (denom 1-8) values from our cat 3 (denom 1-8) values to replace the zeroes that are in our suppressed values from the year before. It's fairly easy to grasp mathematically, the difficulty is that there are millions of columns with which to check for these criteria. I'm trying to write some kind of SQL query that would do this for me, check to make sure Overall-n qtr-n is suppressed, then look to see if the next year isn't for both overall and oty, (in maybe some sort of complicated case statement? Then if those criteria are met, perform the arithmetic for the two Data_Cat-Data_Denom categories and replace the zero in the respective Cat-Denom values.
Below is a simple sample (non-relevant data_cats removed) that I hope will help get what I'm trying to do across.
|CountyID IndustryID Year Qtr Suppressed Data_Cat Data_Denom Value
| 5 10 1990 1 1 1 1 0
| 5 10 1990 1 1 1 2 0
| 5 10 1990 1 1 1 3 0
| 5 10 1991 1 0 1 1 5
| 5 10 1991 1 0 1 2 15
| 5 10 1991 1 0 1 3 25
| 5 10 1991 1 0 3 1 20
| 5 10 1991 1 0 3 2 20
| 5 10 1991 1 0 3 3 35
So basically what we're trying to do here is take the overall data from each data category (I removed lq ~ data_cat 2) because it isn't relevant and data_denom (which I've narrowed down from 8 to 3 for simplicity) in 1991, subtract it from the overall 1991 value and that will give you the applicable
| value for the previous year's 1990 cat_1. So here data_cat 1 Data_denom 1 would be 15 (20-5), denom 2 would be 5(20-15), and denom 3 would be 10(35-25). (Oty 1991q1 - overall 1991q1) = 1990q1. I hope this helps. Like I said the problem isn't the math it's formulating a query that will check this criteria millions and millions of times.
If you want to find supressed data that has 2 rows of unsupressed data for the next year and quarter, we could use cross apply() to do something like this:
test setup: http://rextester.com/ORNCFR23551
using cross apply() to return rows with a valid derived value:
select t.*
, NewValue = cat3.value - cat1.value
from t
cross apply (
select i.value
from t as i
where i.CountyID = t.CountyID
and i.IndustryID = t.IndustryID
and i.Data_Denom = t.Data_Denom
and i.Year = t.Year +1
and i.Qtr = t.Qtr
and i.Suppressed = 0
and i.Data_Cat = 1
) cat1
cross apply (
select i.value
from t as i
where i.CountyID = t.CountyID
and i.IndustryID = t.IndustryID
and i.Data_Denom = t.Data_Denom
and i.Year = t.Year +1
and i.Qtr = t.Qtr
and i.Suppressed = 0
and i.Data_Cat = 3
) cat3
where t.Suppressed = 1
and t.Data_Cat = 1
returns:
+----------+------------+------+-----+------------+----------+------------+-------+----------+
| CountyID | IndustryID | Year | Qtr | Suppressed | Data_Cat | Data_Denom | Value | NewValue |
+----------+------------+------+-----+------------+----------+------------+-------+----------+
| 5 | 10 | 1990 | 1 | 1 | 1 | 1 | 0 | 15 |
| 5 | 10 | 1990 | 1 | 1 | 1 | 2 | 0 | 5 |
| 5 | 10 | 1990 | 1 | 1 | 1 | 3 | 0 | 10 |
+----------+------------+------+-----+------------+----------+------------+-------+----------+
Using outer apply() to return all rows
select t.*
, NewValue = coalesce(nullif(t.value,0),cat3.value - cat1.value,0)
from t
outer apply (
select i.value
from t as i
where i.CountyID = t.CountyID
and i.IndustryID = t.IndustryID
and i.Data_Denom = t.Data_Denom
and i.Year = t.Year +1
and i.Qtr = t.Qtr
and i.Suppressed = 0
and i.Data_Cat = 1
) cat1
outer apply (
select i.value
from t as i
where i.CountyID = t.CountyID
and i.IndustryID = t.IndustryID
and i.Data_Denom = t.Data_Denom
and i.Year = t.Year +1
and i.Qtr = t.Qtr
and i.Suppressed = 0
and i.Data_Cat = 3
) cat3
returns:
+----------+------------+------+-----+------------+----------+------------+-------+----------+
| CountyID | IndustryID | Year | Qtr | Suppressed | Data_Cat | Data_Denom | Value | NewValue |
+----------+------------+------+-----+------------+----------+------------+-------+----------+
| 5 | 10 | 1990 | 1 | 1 | 1 | 1 | 0 | 15 |
| 5 | 10 | 1990 | 1 | 1 | 1 | 2 | 0 | 5 |
| 5 | 10 | 1990 | 1 | 1 | 1 | 3 | 0 | 10 |
| 5 | 10 | 1991 | 1 | 0 | 1 | 1 | 5 | 5 |
| 5 | 10 | 1991 | 1 | 0 | 1 | 2 | 15 | 15 |
| 5 | 10 | 1991 | 1 | 0 | 1 | 3 | 25 | 25 |
| 5 | 10 | 1991 | 1 | 0 | 3 | 1 | 20 | 20 |
| 5 | 10 | 1991 | 1 | 0 | 3 | 2 | 20 | 20 |
| 5 | 10 | 1991 | 1 | 0 | 3 | 3 | 35 | 35 |
+----------+------------+------+-----+------------+----------+------------+-------+----------+
UPDATE 1 - fixed some column names
UPDATE 2 - improved aliases in 2nd query
Ok, I think I get it.
If you're just wanting to make that one inference, then the following may help. (If this is just the first of many inferences you want to make in filling data gaps, you may find that a different method leads to a more efficient solution for doing both/all of them, but I guess cross that bridge when you get there...)
While much of the basic logic stays the same, how you'd tweak it depends on whether you want a query just to provide the values you would infer (e.g. to drive an UPDATE statement), or whether you want to use this logic inline in a bigger query. For performance reasons, I suspect the former makes more sense (especially if you can do the update once and then read the resulting dataset many times), so I'll start by framing things that way and come back to the other in a moment...
It sounds like you have a single table (I'll call it QCEW) with all these columns. In that case, use joins to associate each suppressed overall datapoint (c_oa in the following code) with the corresponding overall and oty datapoints from a year later:
SELECT c_oa.*, n_oa.value - n_oty.value inferred_value
FROM QCEW c_oa --current yr/qtr overall
inner join QCEW n_oa --next yr (same qtr) overall
on c_oa.countyId = n_oa.countyId
and c_oa.industryId = n_oa.industryId
and c_oa.year = n_oa.year - 1
and c_oa.qtr = n_oa.qtr
and c_oa.data_denom = n_oa.data_denom
inner join QCEW n_oty --next yr (same qtr) over-the-year
on c_oa.countyId = n_oty.countyId
and c_oa.industryId = n_oty.industryId
and c_oa.year = n_oty.year - 1
and c_oa.qtr = n_oty.qtr
and c_oa.data_denom = n_oty.data_denom
WHERE c_oa.SUPPRESSED = 1
AND c_oa.DATA_CAT = 1
AND n_oa.SUPPRESSED = 0
AND n_oa.DATA_CAT = 1
AND n_oty.SUPPRESSED = 0
AND n_oty.DATA_CAT = 3
Now it sounds like the table is big, and we've just joined 3 instances of it; so for this to work you'll need good physical design (appropriate indexes/stats for join columns, etc.). And that's why I'd suggest doing an update based on the above query once; sure, it may run long, but then you can read the inferred values in no time.
But if you really want to merge this directly into a query of the data you could modify it some to show all values, with inferred values mixed in. We need to switch to outer joins to do this, and I'm going to do some slightly weird things with join conditions to make it fit together:
SELECT src.COUNTYID
, src.INDUSTRYID
, src.YEAR
, src.QTR
, case when (n_oa.value - n_oty.value) is null
then src.suppressed
else 2
end as SUPPRESSED_CODE -- 0=NOT SUPPRESSED, 1=SUPPRESSED, 2=INFERRED
, src.DATA_CAT
, src.DATA_DENOM
, coalesce(n_oa.value - n_oty.value, src.value) as VALUE
FROM QCEW src --a source row from which we'll generate a record
left join QCEW n_oa --next yr (same qtr) overall (if src is suppressed/overall)
on src.countyId = n_oa.countyId
and src.industryId = n_oa.industryId
and src.year = n_oa.year - 1
and src.qtr = n_oa.qtr
and src.data_denom = n_oa.data_denom
and src.SUPPRESSED = 1 and n_oa.SUPPRESSED = 0
and src.DATA_CAT = 1 and n_oa.DATA_CAT = 1
left join QCEW n_oty --next yr (same qtr) over-the-year (if src is suppressed/overall)
on src.countyId = n_oty.countyId
and src.industryId = n_oty.industryId
and src.year = n_oty.year - 1
and src.qtr = n_oty.qtr
and src.data_denom = n_oty.data_denom
and src.SUPPRESSED = 1 and n_oty.SUPPRESSED = 0
and src.DATA_CAT = 1 and n_oty.DATA_CAT = 3

Tableau pivot rows into fixed number of columns

I am new to Tableau so maybe this is an easy question but I can't get it done yet. I have my data in the following format:
EntityId | ActionId
------------|------------
1 | 2
1 | 6
1 | 1
1 | 7
1 | 7
2 | 1
2 | 2
2 | 3
2 | 3
My desired table format for my visualizations looks like the following:
EntityId | 1stActId | 2ndActId | 3rdActId
-------------------------------------------------
1 | 2 | 6 | 1
1 | 6 | 1 | 7
1 | 1 | 7 | 7
2 | 1 | 2 | 3
2 | 2 | 3 | 3
So I want to extract all Action Triples where every action is in one column. The next step would be to have the number of columns variable so that I can get Tuples, Triples, Quadruples and so on.
Is there a way to do this in Tableau directly or do I have to transform it before importing it in Tableau?
Thanks in advance!
Best regards,
Tim
Interestingly Tableau works best with your current data format rather than your desired table format. There is a functionality called Pivot which transforms your desired table to your current table format but not vice versa. To achieve what you want, you will have to transform the data before importing it into Tableau. Otherwise, consider the format below, depending on your objective, it may give you opportunity to filter, group and drill down into your data. However, it will duplicate the EntityId, assuming this is not an issue for you.
EntityId Value ActionId
1 2 1st
1 6 1st
1 1 1st
2 1 1st
2 2 1st
1 6 2nd
1 1 2nd
1 7 2nd
2 2 2nd
2 3 2nd
1 1 3rd
1 7 3rd
1 7 3rd
2 3 3rd
2 3 3rd

SQL query/functions to flatten a multiple table item and hierarchical item links

I have the data structure below, storing items and links between them in parent-child relashionship.
I need to display the result as show below, one line by parent, with all children.
Values are the ItemCodes by item type, for ex. C-1 and C-2 are the 2 first items of type C, and so on.
In a previous application version, there were only one C and one H maximum for each P.
So I did a max() and group by mix and the result was there.
But now, parents may be linked to different types and number of children.
I tried several techniques including adding temporary tables, views, use of PIVOT, ROLLUP, CUBE, stored procedures and cursors (!), but nothing worked for this specific problem.
I finally succeeded to adapt the query. However, there are many select from (select ...) clauses, as well as row_number based queries.
Also, the result is not dynamic, meaning the number of columns is fixed (which is acceptable).
My question is: what would be your approach for such issue (if possible in a single query)? Thank you!
The table structure:
Item
-------------------------------
ItemId | ItemCode | ItemType
-------------------------------
1 | P1 | P
2 | C11 | C
3 | H11 | H
4 | H12 | H
5 | P2 | P
6 | C21 | C
7 | C22 | C
8 | C23 | C
9 | H21 | H
ItemLink
---------------------------------------
LinkId | ParentItemId | ChildItemId
---------------------------------------
1 | 1 | 2
2 | 1 | 3
3 | 1 | 4
4 | 2 | 6
5 | 2 | 7
6 | 2 | 8
7 | 2 | 9
Expcted Result
-----------------------------------------------------
P C-1 C-2 ... C-N H1 H2 ... H-N
-----------------------------------------------------
P1 C11 NULL NULL NULL H11 H12 NULL NULL
P2 C21 C22 C23 NULL H21 NULL NULL NULL
...
Part of my current query (which is working):
!http://s12.postimg.org/r64tgjjnh/SOQuestion.png

How to find multivalued dependencies that not satisfy R?

I'm currently taking the course on DB and the theme is Relational Design Theory. Sub-theme is Multivalued dependencies and I'm completely lost in them :(
I have such a question:
R(A,B,C):
A | B | C
----------
1 | 2 | 3
1 | 3 | 2
1 | 2 | 2
3 | 2 | 1
3 | 2 | 3
Which of the following multivalued dependencies does this instance of R not satisfy?
1. AB ↠ C
2. B ↠ C
3. C ↠ A
4. BC ↠ C
I know that:
A | B | rest
----------
a | b1 | r1
a | b2 | r2
a | b1 | r2
a | b2 | r1
I'm making tables for those dependencies in those manner:
B ↠ C
B | C | A
----------
2 | 3 | 1
2 | 2 | 1
2 | 1 | 3
2 | 3 | 3
3 | 2 | 1
According to my assumptions:
2 2 1 and 2 1 3 are at rule above, so there also should be 2 1 1 and 2 2 3 records.
So this one is not correct. Am I right?
And I don't have any idea how to build such table for AB ↠ C. It will be the same table?
A | B | C
----------
1 | 2 | 3
1 | 3 | 2
1 | 2 | 2
3 | 2 | 1
3 | 2 | 3
Is it something that called trivial dependency?
Let me stretch my mind back to college days, from my Database Modeling Concepts class... Taking a quick refresher course via some online PDFs...
I wrote and rewrote a paragraph here to try to concisely explain the multidetermines operator, but I can't explain it any more clearly than the PDF above. Roughly speaking, if B ↠ C, then for every (B, C, A) tuple that exists, you can swap out any C value in (B, C) in R and still find a tuple in R.
In the case of your last question, you're right. I don't remember the proper term, but since there are no other columns than A, B, C, then AB ↠ C is trivially satisfied in R. To build a table for it, I'd order by A first, then B. It would look something like
A | B | C
---------
1 | 2 | 2
1 | 2 | 3
1 | 3 | 2
3 | 2 | 1
3 | 2 | 3
B ↠ C is a much more interesting example. Since the (B, C, A) tuples (2, 3, 1) and (2, 1, 3) both exist, the tuples (2, 1, 1) and (2, 3, 3) (obtained by swapping C) must both exist to satisfy R. Since (2, 1, 1) does not exist, it does not satisfy R. Writing out the tables like you did makes it quite easy to see.
If you see the examples in the PDF I linked earlier, it puts meaningful names to the columns, which may aid in your understanding. I hope this sets you on the right path!

Resources