determine the number of intersecting records - snowflake-cloud-data-platform

determine the number of intersecting records - snowflake-cloud-data-platform

I have a set of objects that have a variety of attributes:
id
title
owner_id
25
The Name of the object
5000
31
Another name of a different object
5000
71
Yet another
5000
19
example title
6000
21
another example
6000
23
more example
6000
and a list of attributes for each object (this list isn't completely normalized, but could be if necessary)
object_id
attribute
25
Blue
25
Green
25
Black
31
Blue
31
Purple
71
White
19
Grey
19
Blue
21
Yellow
23
Pink
23
Grey
The goal is to calculate how many objects each owner has that overlap any attribtues with the other owners. So in the example, owner 5000 has 3 objects ( 25, 31, 71 ), and owner 6000 has 3 ( 19, 21, 23 ).
5000
object 25 is (blue | green | black ) and overlaps with 19 because it also has a blue attribute.
object 31 is (blue | purple | white ) and overlaps with 19 because it also has a blue attribute
object 71 is (white) and doesn't overlap with any of owner 6000's objects, so it is unique.
6000
object 19 is (grey | blue) and overlaps 25
object 21 is (yellow) and is unique
object 23 is (pink & grey) is unique because none of 5000s objects are pink or grey
so since it doesn't really matter if 1 or all of the other owners objects match, i can create a summary table for each owner that lists out all of that companies unique attributes, which would look like:
5000: ( blue | green | black | purple | white )
6000: ( grey | blue | yellow | pink )
so the goal would be some output like:
owner_a
owner_b
count_a
count_b
similarity_a_to_b
similarity_b_to_a
unique_a
unique_b
5000
6000
3
3
2
1
1
2
the challenge is that I'm dealing with several thousand owner_ids, with several million objects, which have 10s of millions of attributes, so I'm trying to figure out how to summarize the data so I could generate these types of metrics.

Well the SQL can be done via:
WITH objects (id, title, owner_id) AS (
SELECT * FROM VALUES
(25, 'The Name of the object', 5000),
(31, 'Another name of a different object', 5000),
(71, 'Yet another', 5000),
(19, 'example title', 6000),
(21, 'another example', 6000),
(23, 'more example', 6000)
), attributes (object_id, attribute) AS (
SELECT * FROM VALUES
(25, 'Blue'),
(25, 'Green'),
(25, 'Black'),
(31, 'Blue'),
(31, 'Purple'),
(71, 'White'),
(19, 'Grey'),
(19, 'Blue'),
(21, 'Yellow'),
(23, 'Pink'),
(23, 'Grey')
), obj_att AS (
SELECT o.*, a.*
FROM objects AS o
JOIN attributes AS a
ON o.id = a.object_id
), owner_att AS (
SELECT DISTINCT OWNER_ID, ATTRIBUTE
FROM obj_att
), dist_owners AS (
SELECT DISTINCT OWNER_ID
FROM objects
), owner_pairs AS (
SELECT a.owner_id as a_id, b.owner_id as b_id
FROM dist_owners AS a
JOIN dist_owners AS b
ON a.owner_id <> b.owner_id
), object_mixed_attributes AS (
SELECT op.*, oba.*, owa.attribute as b_att
FROM owner_pairs AS op
JOIN obj_att AS oba ON oba.owner_id = op.a_id
LEFT JOIN owner_att AS owa ON owa.owner_id = op.b_id and oba.attribute = owa.attribute
)
SELECT op.a_id as owner_a
,op.b_id as owner_b
,count(distinct(om_a.id)) as count_a
,count(distinct(om_b.id)) as count_b
,count(distinct(iff(om_a.b_att is not null,om_a.id,null))) as similarity_a_to_b
,count(distinct(iff(om_b.b_att is not null,om_b.id,null))) as similarity_b_to_a
,count_a-similarity_a_to_b as unique_a
,count_b-similarity_b_to_a as unique_b
FROM owner_pairs AS op
JOIN object_mixed_attributes AS om_a ON om_a.a_id = op.a_id
JOIN object_mixed_attributes AS om_b ON om_b.a_id = op.b_id
WHERE op.a_id < op.b_id
GROUP BY 1,2;
which gives:
OWNER_A OWNER_B COUNT_A COUNT_B SIMILARITY_A_TO_B SIMILARITY_B_TO_A UNIQUE_A UNIQUE_B
5000 6000 3 3 2 1 1 2
Given it's all equijoin's you will get the best performance out merge joins. looking at the profiles for this tiny example there are a couple WithReference blocks which over large tables, can sometime be slow than just reading from the table twice, this comes form the CTE being reused. So playing there can help.
Also I have ben lazy and not fully expanded the selects, thus using x.* at time, and maybe those values are not needed lower down. The complier will solve that for us, and if your tables are that large, then it will be lost in the wash.

Related

star-schema - fact table modeling

I need to model a star schema for some business needs about liquidity stress testing.
i will try to find an analogy example.
let's say we have deals about financing/financial securities etc
in the fact table,
at a given date, this deals have the real value of X euros
but will have a variation in time. thus we have some projection values.
my concern is about how to represent this projection values for this deals, and more specifically what granularity to choose.
( the example below is oversimplification of the fact table -- and yes it's Dimension Id's that
are used otherwise )
Method 1 : as many metric columns as projection values calculated
AsofDate
DealId
0Day
1Day
7Days
1Month
2022-01-01
financingDeal1
100
99
98
85
2022-01-01
financingDeal2
150
150
120
120
2022-01-01
financingDeal3
100
99
98
85
2022-01-01
financingDeal4
100
99
98
85
Method 2 : add a granularity : a row is not anymore only a deal on a given that. it's a deal in a given date and it's projection in the next few days/months
AsofDate
DealId
projection
value
2022-01-01
financingDeal1
0Day - actual
100
2022-01-01
financingDeal1
1Day
99
2022-01-01
financingDeal1
7Days
99
2022-01-01
financingDeal1
1Month
85
from where i see it :
for method 1, the main inconvenient is if in the futur, we have a new projection value for 3 months, we will need to add a column in the ETL/ in the OLAP cube for '3months'
for method 2:
we will have as many rows as (deals * projections) and we do have 11 projections so it's 11 rows for each deal and we do have 1Million+ of them.
what is your opinion on this topic?
Thanks for your consideration

SQLITE: horizontally combining large sql tables based on common entry

I have multiple tables I need to join horizontally based on a common entry. For the first sets of tables, I need to join tables that look like:
Table 1 dimension = 1093367x18 and It looks like
ROW #
ID
TEMPERATURE
DESCR
...
NUMB
1
32
23
Y
...
23
2
47
54
N
...
24
...
...
...
...
...
...
1,093367
78
12
Y
...
45
Table 2 dimension = 1093367x648
ROW #
ID
COLOR 1
COLOR 2
...
COLOR 648
1
32
RED
BLUE
...
GREEN
2
47
BLUE
PURPLE
...
RED
...
...
...
...
...
...
1,093367
78
YELLOW
RED
...
BLUE
And I need [Table 1 |Table 2]:
ROW #
ID
TEMPERATURE
DESCR
...
NUMB
COLOR 1
COLOR 2
...
COLOR 648
1
32
23
Y
...
23
RED
BLUE
...
GREEN
2
47
54
N
...
24
BLUE
PURPLE
...
RED
...
...
...
...
...
...
...
...
...
...
1,093367
78
12
Y
...
45
YELLOW
RED
...
BLUE
Is this possible to do in SQLITE? I have only found solutions in which I would have to type out all 648 columns for table 2. Is this the only way to do this in SQLITE?

You don't have to write any column name in the SELECT statement if you do a join with the USING clause instead of the ON clause:
SELECT *
FROM Table1 INNER JOIN Table2
USING(id);
This will return all the columns of the 2 tables (first Table1's rows and then Table2's rows) but the columns used in the USING clause (in this case the column id on which the join is based) will be returned only once.
You can find more about the USING clause in Determination of input data (FROM clause processing).
See a simplified demo.

SQL- Is this a multi step line of code...add column than add data for each row ONLY for that particular column

There is a table below and further below is the question. Is this a multi-line code with parenthesis to use for this? This is for a business analyst assignment...its the first time im using sql(I've used python, js, html, css self-taught back when i was trying to be a web developer)
SQL Queries
Table Name: TRADES
DATE FIRM SYMBOL SIDE QUANTITY PRICE
2/3/2014 1ABC A123 B 200 41
2/4/2014 2BCD B234 B 600 60
2/7/2014 1ABC C345 S 600 70
2/10/2014 3CDE C345 S 600 70
2/12/2014 4DEF B234 B 200 62
2/14/2014 3CDE B234 B 300 61
2/21/2014 1ABC A123 B 300 40
2/24/2014 1ABC A123 S 300 30
2/25/2014 4DEF C345 B 2100 71
2/27/2014 CDE B234 S 1100 63
Q3. Your business user asks you to show them a table that includes the number of trades for each firm and symbol combination in the data table above. Please write the SQL query you would use to query TRADES table to get below result
FIRM SYMBOL NO_TRADES
1ABC A123 3
2BCD B234 1
1ABC C345 1
3CDE C345 1
4DEF B234 1
3CDE B234 1
4DEF C345 1
CDE B234 1

This looks like simple aggregation:
select firm, symbol, count(*) no_trades
from mytable
group by firm, symbol
order by no_trades desc, symbol

T-SQL: having trouble making a subquery or a join (on a table that defines a widget type) yield column titles for a result set

Say widgets are stored in a table WIDGETS. Widgets come in 3 flavors: type 0, type 1, and type 2. A trivial WIDGETS table containing 4 widgets (with primary key IDWidget), might be:
IDWidget wiType
-----------------
501 0
502 0
503 2
504 1
A table TYPEATTRIBUTES lists the structure of attributes for the types of widgets, with primary key IDTypeAttribute. The attributes for type 0 widgets might be described as:
IDTypeAttribute wiType taLabel taStore taSequence
-----------------------------------------------------
1 0 'Type0a' 'text' 1
2 0 'Type0b' 'text' 2
3 0 'Type0c' 'number' 3
Another table AttributeValues contains the attribute values for individual widgets. The attribute values for some widgets might be:
IDAttributeValue IDWidget IDTypeAttribute avValText avValNumber
-------------------------------------------------------------------
101 501 1 'val1' NULL
102 501 2 'val2' NULL
103 501 3 NULL 123
104 502 1 'vala' NULL
105 502 2 'valb' NULL
106 502 3 NULL 789
Now I want to select all the type 0 widgets and display their attributes in a table whose column headings are the names of each attribute. The result set would be:
IDWidget Type0a Type0b Type0c
---------------------------------
501 'val1' 'val2' 123
502 'vala' 'valb' 789
Of course, if I only had 1 type with a fixed structure (e.g., I know the names of the attributes for Type 0), I could do something like:
SELECT
w.IDWidget,
(SELECT avValText FROM ATTRIBUTEVALUES WHERE IDWidget = w.IDWidget AND IDTypeAttribute = 1)Type0a
(SELECT avValText FROM ATTRIBUTEVALUES WHERE IDWidget = w.IDWidget AND IDTypeAttribute = 2)Type0b
(SELECT avValNumber FROM ATTRIBUTEVALUES WHERE IDWidget = w.IDWidget AND IDTypeAttribute = 3)Type0c
FROM
WIDGETS w
WHERE
w.wiType = 0
But I want a general result where the TYPEATTRIBUTES table gives the column titles based on the attributes for this type of widget, and specifies whether to look for the text or numeric value, (not by hard-coding these).
I've tried various convoluted approaches using a sub-query to extract the proper attributes names based on wiType FROM the TYPEATTRIBUTES table but I run into problems with the requirement that only a single value be returned by the subquery.
Is it possible to name columns in this way with SQL Server?

SQL Parent child tree data return only completed tree nodes from list

I have a parent child tree relationship in SQL Database. There are many tables holding all my data/relationships, the important data I need for this purpose is in the table below and I am using a recursive CTE query to get that. My application and database design are working perfectly for what I am doing. I need to run some management/maintenance to confirm the data is correct and what I am trying to do is outlined below:
What my data set contains is:
- FilterID - This is the Parent Record ID
- depth - in my CTE, this is giving me the number of records for that tree relationship.
- sortCol - in my CTE I am building this out (this is just a combined binary string of all the FilterIDs)
- TreeListOfFilterIDs - This is a list of all the FilterIDs in the tree for this record at this level.
- RESULTS I WANT - This is a column I added to identify the list of fields I want to return from my dataset
In this example the FilterID 35 is the starting record. The parent child relathionship can have 1 parent and 2 (ore more) different paths for children, so 1 Parent can have 2 or more differnet children. NOTE: This relatinoship is intentional as 1 parent can have 2 different children and I am using that in my code/processes).
What I need to get from the below data set is to return only records for the final path of the parent/child relationship in each node. I added a column to the below data that I do not have from my query called "RESULTS I WANT" where I only want to get the records with the "X" in them.
I am populating the tables and my code/processes are working with how I have designed everything. What I am tryign to accomplish now with this logic is to find the unique final paths so I can easily identify each final path and make sure they are correct. This would be for maintenace purposes to monitor the data and the paths.
FilterID depth sortCol TreeListOfFilterIDs RESULTS I WANT
35 0 0x00000023 35
36 1 0x0000002300000024 35,36
37 2 0x000000230000002400000025 35,36,37
39 3 0x00000023000000240000002500000027 35,36,37,39 X
38 2 0x000000230000002400000026 35,36,38
40 3 0x00000023000000240000002600000028 35,36,38,40
44 4 0x000000230000002400000026000000280000002C 35,36,38,40,44 X
41 3 0x00000023000000240000002600000029 35,36,38,41
45 4 0x000000230000002400000026000000290000002D 35,36,38,41,45 X
42 3 0x0000002300000024000000260000002A 35,36,38,42
46 4 0x0000002300000024000000260000002A0000002E 35,36,38,42,46 X
43 3 0x0000002300000024000000260000002B 35,36,38,43
47 4 0x0000002300000024000000260000002B0000002F 35,36,38,43,47
48 5 0x0000002300000024000000260000002B0000002F00000030 35,36,38,43,47,48 X
49 5 0x0000002300000024000000260000002B0000002F00000031 35,36,38,43,47,49 X
One note, the above data is from my CTE results but is being sorted by the sortcol value (so this is not the order the data is being inserted into the CTE).
SQL to generate the above results:
-- this combines the required answers for the next questions to display (will join to what is filled out in the form
IF OBJECT_ID('tempdb..#RequiredAnswersToFindNextFiltersToDisplay') IS NOT NULL
DROP TABLE #RequiredAnswersToFindNextFiltersToDisplay
CREATE TABLE #RequiredAnswersToFindNextFiltersToDisplay (
FilterID INT,
FormAssociationID INT,
RequiredAnswerFilterID INT
)
-- this gets the RequiredAnswersIntoA String for joining on
INSERT INTO #RequiredAnswersToFindNextFiltersToDisplay (
FilterID, FormAssociationID, RequiredAnswerFilterID
)
VALUES
( 35, 1, 0 ),
( 36, 2, 35 ),
( 37, 3, 36 ),
( 38, 4, 36 ),
( 39, 5, 37 ),
( 40, 6, 38 ),
( 41, 7, 38 ),
( 42, 8, 38 ),
( 43, 9, 38 ),
( 44, 10, 40 ),
( 45, 11, 41 ),
( 46, 12, 42 ),
( 47, 13, 43 ),
( 48, 14, 47 ),
( 49, 15, 47 )
;WITH ItemDataCTE(FilterID, FormAssociationID, RequiredAnswerFilterID, depth, sortcol, TreeListOfFilterIDs)
AS (
-- anchor member
SELECT FilterID, FormAssociationID, RequiredAnswerFilterID, 0, CAST(FilterID AS VARBINARY(900)) AS SortCol,
CAST(FilterID AS VARCHAR(MAX)) AS TreeListOfFilterIDs
FROM #RequiredAnswersToFindNextFiltersToDisplay
WHERE RequiredAnswerFilterID = 0
UNION ALL
SELECT ID.FilterID, ID.FormAssociationID, ID.RequiredAnswerFilterID, M.depth + 1, CAST(M.SortCol + CAST(ID.FilterID AS BINARY(4)) AS VARBINARY(900)) SortCol,
CAST(M.TreeListOfFilterIDs + ',' + CAST(ID.FilterID AS VARCHAR(50)) AS VARCHAR(MAX)) AS TreeListOfFilterIDs
FROM #RequiredAnswersToFindNextFiltersToDisplay ID
INNER JOIN ItemDataCTE AS M ON ID.RequiredAnswerFilterID = M.FilterID
)
SELECT *
FROM ItemDataCTE
ORDER BY ItemDataCTE.sortcol

The simplest way to do it is to add a self join to the recursive cte with not exists:
SELECT *
FROM ItemDataCTE as c0
WHERE NOT EXISTS
(
SELECT 1
FROM ItemDataCTE AS c1
WHERE c1.TreeListOfFilterIDs LIKE c0.TreeListOfFilterIDs +',%'
)
ORDER BY sortcol
Result:
FilterID FormAssociationID RequiredAnswerFilterID depth sortcol TreeListOfFilterIDs
39 5 37 3 00035000360003700039 35,36,37,39
44 10 40 4 0003500036000380004000044 35,36,38,40,44
45 11 41 4 0003500036000380004100045 35,36,38,41,45
46 12 42 4 0003500036000380004200046 35,36,38,42,46
48 14 47 5 000350003600038000430004700048 35,36,38,43,47,48
49 15 47 5 000350003600038000430004700049 35,36,38,43,47,49