DBMS: Relational Algebra Execution Plan Cost Calculation - database

I have been trying the final days to come with a solution to the following question.
Lets suppose that we have the following two tables.
Film(ID',Title,Country,Production_Date)
Actor(ID',Name,Genre,Nationality)
Cast(Actor_ID',Film_ID',Role)
Given information:
Film holds N(film)=50.000 records, r(film)=40bytes, sequential organized, index on PK
Actor holds N(actor)=200.000 records r(actor)=80bytes,heap organized, index on PK
Cast holds N(cast)=100.000 records,r(cast)=25 bytes, heap organized, No INDEXES
The execution tree and relation expression for an execution plan is in the following picture:
For the lower level join between cast & film I'm calculating the followings:
Block Nested Loop Join : Bcast x Bfilm
Index Nested Loop Join : Bcast + Ncast x Cfilm
I'm keeping the smallest value which is given with an INLJ.
Question:
Now how can I calculate the size of the joined table and the new r which is the size of a record on the new joined table in order to proceed and calculate the upper level join between the already joined table with table actor after having calculated the cost B in blocks that join operation will take?

I assume you want to do a natural join on FILM.ID = CAST.FILM_ID and CAST.FILM_ID is a foreign key referencing FILM.ID.
1) Size of one row:
A join of Film and Cast results in tuples of the form
[FILM_ID, TITLE, COUNTRY, PRODUCTION_DATE, ACTOR_ID, ROLE].
Hence the row size should be something like
R(FILM JOIN CAST) = R(FILM) + R(CAST) - R(FILM_ID)
since the FILM_ID is the only column which is shared.
2) Number of rows:
N(FILM JOIN CAST) = N(CAST)
As there is exactly one row in FILM for every row in CAST.

Related

In Apache Flink, how to join certain columns from 2 dynamic tables (Kafka topic-based) using TableAPI

Here are the specs of my system:
2 Kafka topics:
Kafka Topic A: Contains {"key": string, "some_data1": [...], "timestamp: int}
Kafka Topic B: Contains {"key": string, "some_data2": [...], "timestamp: int}
Let's say both topics receive 5 messages per second. (Let's ignore delays for this example)
I want to add the some_data2 values from B into A for a certain duration using Hop windowing (let's say 2 seconds hop and length for the sake of this example).
I tried the following:
1 - Create a SQL VIEW
First I created a view that join both topics like this:
CREATE TEMPORARY VIEW IF NOT EXISTS my_joined_view (
key,
some_data1,
some_data2,
timestamp
) AS SELECT
A.key,
A.timestamp,
A.some_data1,
B.some_data2
FROM
A
LEFT JOIN B ON A.key = B.key
2 - Continous Query
Doing my window on the joined view like this:
SELECT
key
my_udf(key, some_data1, some_data2, timestamp),
timestamp,
HOP_START(timestamp, INTERVAL '1' SECOND, INTERVAL '5' SECOND)
FROM
my_joined_view,
GROUP BY
HOP(timestamp, INTERVAL '1' SECOND, INTERVAL '5' SECOND), key
3 - My Expectations
My expectations were that my_udf function accumulator would receive 10 entries for some_data1 and some_data2 like:
class MyUDF(AggregateFunction):
def accumulate(self, accumulator, *args):
key = args[0]
some_data1 = args[1]
some_data2 = args[2]
timestamp = args[3]
assert len(some_data1) == 10
assert len(some_data2) == 10
But, in most cases I receive duplicates of each entries. It looks like the join mechanism is creating one row for each combination of values in the columns.
I am using a vectorized UDF function, therefore I'm dealing with pandas.Series in my UDF as my arguments types in the accumulator.
There's is clearly something wrong with the way I am joining my tables. I don't understand why I am receive duplicate entries in my UDF.
Thanks,

Missing value of measure (calculated member )

I've got a data warehouse which is underlying database of OLAP cube.
When I run query like that:
SELECT dimS.Attribute2,SUM(fact.LastValue)
FROM FactTable fact
JOIN DimS dimS ON fact.DimSKey = DimS.DimSKey
GROUP BY DimSKey.Attribute2
I can see that all existing Attribute2 at dimS table have corresponding rows at fact table.
On the other hand I've got a calculated measure:
CREATE MEMBER CURRENTCUBE.[MEASURES].[MyMeasure]
AS ([Measures].[FactTable - LastValue]
, [DimS].[S Hierarchy].[All].[Hierarchy SomeName]
, [DimS].[Category].[All]
, [DimS].[Question].CurrentMember
, [CimC].[Status].&[Active]
),DISPLAY_FOLDER='Folder',VISIBLE = 1;
and when running below MDX:
SELECT
{ [Measures].[MyMeasure] } ON COLUMNS,
{ ([Survey].[Attribute2].ALLMEMBERS ) } ON ROWS
FROM [MyCube]
I can see that 2 of Attribute2 have no values (null) assigned to them.
What can cause issue like that (DimS and cube has been just processed fully)?
Found the rootcause.
the reference in MDX definition of calculated measure to [DimS].[S Hierarchy].[All].[Hierarchy SomeName] is another calculated measure where actually we have hardcoded values within dimension hierarchy. And for [Attribute2] this condition is not met.

Returning from a join the first result of one column based one a second column

I need some help to improve part of my query. The query is returning the correct data, I just need to exclude some extra information that I don't need.
I believe that one of the main parts that will change is:
JOIN TBL_DATA_TYPE_RO_BODY TB ON TB.FK_ID_TBL_FILE_NAMES=VMI.ID_TBL_FILE_NAMES
In this part, I have, for example, 2 FK_ID_TBL_FILE_NAMES, it will return 2 results from TBL_DATA_TYPE_RO_BODY.
The data that I have is (I excluded some extra columns):
If I have 2 or more equal MAG for the same field "ONLY_FIELD_NAME" I should return only the first one (I don't care about the others one). I believe that this is a simple case for Group by, but I am having trouble doing the group by on the join.
My ideas:
Use select top (i.e. here)
Use first valeu (i.e. here)
What I have (note the 2 last lines):
Freq|Mag|Phase|Date|ONLY_FILE_NAME
1608039|767|3234|37:00.0|RO_Mass_Load_4b
1608039|781|3371|44:00.0|RO_Mass_Load_4b
1608039|788|3138|37:00.0|RO_Mass_Load_4b
1608039|797|3326|44:00.0|RO_Mass_Load_4b
1608039|808|3117|37:00.0|RO_Mass_Load_4b
1608039|808|3269|44:00.0|RO_Mass_Load_4b
What I would like to have (note the last line):
Freq|Mag|Phase|Date|ONLY_FILE_NAME
1608039|767|3234|37:00.0|RO_Mass_Load_4b
1608039|781|3371|44:00.0|RO_Mass_Load_4b
1608039|788|3138|37:00.0|RO_Mass_Load_4b
1608039|797|3326|44:00.0|RO_Mass_Load_4b
1608039|808|3117|37:00.0|RO_Mass_Load_4b
Note that the mag field is coming from my JOIN.
Ideas? Any help?
In case you wanna see the whole code is:
SELECT TW.CURRENT_MEASUREMENT as Cycle_Current_Measurement,
TW.REF_MEASUREMENT as Cycle_Ref_Measurement,
CONVERT(REAL,TT.CURRENT_TEMP) as Cycle_Current_Temp,
CONVERT(REAL,TT.REF_TEMP) as Cycle_Ref_Temp,
TP.TYPE as Cycle_Type, TB.FREQUENCY as Freq,
TB.MAGNITUDE as Mag,
TB.PHASE as Phase,
VMI.TIME_FORMATTED as Date,
VMI.ID_TBL_FILE_NAMES as IdFileNames, VMI.ID_TBL_DATA_TYPE_RO_HEADER as IdHeader, VMI.*
FROM VW_MAIN_INFO VMI
JOIN TBL_DATA_TYPE_RO_BODY TB ON TB.FK_ID_TBL_FILE_NAMES=VMI.ID_TBL_FILE_NAMES
LEFT JOIN TBL_POINTS_AND_CYCLES TP ON VMI.ID_TBL_DATA_TYPE_RO_HEADER = TP.FK_ID_TBL_DATA_TYPE_RO_HEADER
LEFT JOIN TBL_POINTS_AND_MEASUREMENT TW ON VMI.ID_TBL_DATA_TYPE_RO_HEADER = TW.FK_ID_TBL_DATA_TYPE_RO_HEADER
LEFT JOIN TBL_POINTS_AND_TEMP TT ON VMI.ID_TBL_DATA_TYPE_RO_HEADER = TT.FK_ID_TBL_DATA_TYPE_RO_HEADER
Try something like this. the partition by is like a group by; it defines groups over which row_number will auto-increment an integer by 1. The order by tells row_number which rows should have a lower number. So in this example, the lowest date will have RID = 1. Then subquery it, and select only those rows which have RID = 1
select *
from (select RID = row_number() over (partition by tb.Magnitude order by vmi.time_formatted)
from ...<rest of your query>) a
where a.RID = 1

Resolve many to many relationship

Does anyone have a process or approach to use for determining how to resove a many-to-many relationship in a relational database? Here is my scenario. I have a group of contacts and a group of phone numbers. Each contact can be associated with multiple phone numbers and each phone number can be associated with multiple contacts.
A simple example of this situation would be an office with two employess (e1 & e2), one main voice line (v1), one private voice line (v2). e1 is the CEO so they have thier own private voice line, v1, but they can also be reached by calling the main line, v2, and asking for the CEO. e2 is just an employee and can only be reached by calling v2.
So, e1 should be related to v1 & v2. e2 should be related to v2. Conversly, v1 should be related to e1 and v2 should be related to e1 & e2.
The goal here is to ge able to run queries like "what numbers can e1 be reached at" and "what employees can be reached at v2", etc.. I know the answer will involve an intermediate table or tables but I just can't seem to nail down the exact architecture.
You don't need any temp tables for the query. There is an intermediary table for the mapping.
numbers_tbl
-----------
nid int
number varchar
employees_tbl
-----------
eid int
name varchar
employee_to_phone_tbl
-----------
eid int
nid int
How can I call Bob?
select *
from employees_tbl e
inner join employee_to_phone_tbl m
on e.eid = m.eid
inner join numbers_tbl n
on m.nid = n.nid
where e.name = 'Bob'
Who might pickup if I call this number?
select *
from numbers_tbl n
inner join employee_to_phone_tbl m
on m.nid = n.nid
inner join employees_tbl e
on e.eid = m.eid
where n.number = '555-5555'
Employees:
eID, eName
1, e1
2, e2
PhoneNumbers:
pID, pNumber
1, v1
2, v2
EmployeePhones:
eID, pID
1, 1
1, 2
2, 2
then you inner join. if you need to find out what number(s) e1 can be reached at (t-sql):
SELECT E.eName, P.pNumber
FROM dbo.Employees E
INNER JOIN dbo.EmployeePhones EP ON E.eID = EP.eID
INNER JOIN dbo.PhoneNumbers P ON EP.pID = P.eID
WHERE E.eName = 'e1'
I believe this should work (testing it right now...)
EDIT: Took me a few minutes to type up, sorry for duplication...
Others have explained the schema, but I'm going to explain the concept. What they're building for you, the table named EmployeePhones and employee_to_phone_tbl, is called an Associative Entity, which is a type of Weak Entity.
A Weak Entity does not have its own natural key and must instead be defined in terms of its foreign keys. An Associative Entity exists for the sole purpose of mapping a many-to-many relationship in a database that does not support the concept. Its primary key is the grouped foreign keys to the tables it maps.
For further information on relational theory, see this link
Normalize
Best Practices on Referential Integrity
Check this - http://statisticsio.com/Home/tabid/36/articleType/ArticleView/articleId/327/Need-Some-More-Sample-Databases.aspx
After just a little more thought, here is what I came up with. It probably goes along with the approach AviewAnew is thinking of.
employees
id (index)
name
numbers
id (index)
number
relations
employees.id (index)
numbers.id (index)
employees
1 : e1
2 : e2
numbers
1 : v1
2 : v2
relations
1 : 1
1 : 2
2 : 1
Is this the best/only approach?

How would you query an array of 1's and 0's chars from a database?

Say you had a long array of chars that are either 1 or 0, kind of like a bitvector, but on a database column. How would you query to know what values are set/no set? Say you need to know if the char 500 and char 1500 are "true" or not.
SELECT
Id
FROM
BitVectorTable
WHERE
SUBSTRING(BitVector, 500, 1) = '1'
AND SUBSTRING(BitVector, 1000, 1) = '1'
No index can be used for this kind of query, though. When you have many rows, this will get slow very quickly.
Edit: On SQL Server at least, all built-in string functions are deterministic. That means you could look into the possibility to make computed columns based on the SUBSTRING() results for the whole combined value, putting an index on each of them. Inserts will be slower, table size will increase, but searches will be really fast.
SELECT
Id
FROM
BitVectorTable
WHERE
BitVector_0500 = '1'
AND BitVector_1000 = '1'
Edit #2: The limits for SQL Server are:
1,024 columns per normal table
30.000 columns per "wide" table
In MySQL, something using substring like
select foo from bar
where substring(col, 500,1)='1' and substring(col, 1500,1)='1';
This will be pretty inefficient though, you might want to rethink your schema. For example, you could store each bit separately to tradeoff space for speed...
create table foo
(
id int not null,
bar varchar(128),
primary key(id)
);
create table foobit
(
int foo_id int not null,
int idx int not null,
value tinyint not null,
primary key(foo_id,idx),
index(idx,value)
);
Which would be queried
select foo.bar from foo
inner join foobit as bit500
on(foo.id=bit500.foo_id and bit500.idx=500)
inner join foobit as bit1500
on(foo.id=bit1500.foo_id and bit1500.idx=1500)
where
bit500.value=1 and bit1500.value=1;
Obviously consumes more storage, but should be faster for those query operations as an index will be used.
I would convert the column to multiple bit-columns and rewrite the relevant code - Bit masks are so much faster than string comparisons. But if you can't do that, you must use db-specific functions. Regular expressions could be an option
-- Flavor: MySql
SELECT * FROM table WHERE column REGEXP "^.{499}1.{999}1"
select substring(your_col, 500,1) as char500,
substring(your_col, 1500,1) as char1500 from your_table;

Resources