DWH Ingest ALL to one table or Split out to smaller tables for Transformation (ETL prep) - prepared-statement

Another DWH Best Practice question 🙂
our DBs in MongoDB follow a structure as so:
*company (DB name)
-- collectionId
-- collectionId_metrics
-- collectionId_dashboards
*activity_log (DB name)
--collectionId
-- collectionId_actions
-- collectionId_notification
what is best practice on ingestion? Should I extract each 'sub-type' to a separate raw table, or just dump the whole DB into one big table and then parse out upon transformation?
ie. TABLE = company or TABLE = activity_log and then transform and split out
OR
TABLE = company, TABLE = company_metrics, TABLE = company_dashboards, etc?

Related

Best method to get todays data through a view snowflake

My warehouse details:
warehouse - XS
reading data external tables from s3 into snowflake
Refresh structure: SNS
I have the S3 folder structure as below
S3://eveningdtaa/2022-06-07/files -- contains parquet format
S3://eveningdtaa/2022-06-08/files -- contains parquet format
S3://eveningdtaa/2022-06-09/files -- contains parquet format
I am using external tables to read data from snowflake tables.
So tables- Has historical information
views - Has daily data
My view defination is a below:
create view result_view as (
select * from table1 where date_part=(select max(date_part) from table 1)
)
My question our daily views are running slow and it has only 70k rows. Is there a way to rewrite my view to pick only the latest data instead of max of date? or able to run this view faster through some indexes?
Thanks,
Xi
It may be rewritten using QUALIFY:
create view result_view
as
select *
from table1
qualify date_part=max(date_part) over();
It is also worth adding partition on date: Partitioning Parameters

transfer data from one database to another regarding keys

How can i transfer rows from two tables (Patient and ContactDetails) from DB1 to DB2?
Both DBs, have already these 2 tables with data. i just want to add data from these two tables from db1 to db2.
i tried following that
but it didnt work, because there are some rows with the same keys and overwrite is forbidden.
is there an other way to do it? or am i missing something?
patient and contactdetails relationship is
patient inner join contactdetails
(foreign_key)patient.contactdetailsid = (primary_key)contactdetails.id
loop on the source contactdetails table, insert each row one a time saving in a temp table the old contactdetail id and the matching new contactdetail id (here is an example of sql loop).
the temp table should be something like:
create #temptableforcopy table (
oldcontactdetailsid [insertheretherightdatatype],
newcontactdetailsid [insertheretherightdatatype]
)
copy the data from the patient table joined to the temp table used for the previous step like this:
insert into newdb.newschema.patient (contactdetailsid, field1, field2, ...)
select TT.newcontactdetailsid,
old.field1,
old.field2,
...
from olddb.oldschema.patient old
join #temptableforcopy TT on TT.oldcontactdetailsid = old.contactdetailsid
please note that my proposal is just a wild guess: you gave no information about structure, keys, constraints, no detail about which key is preventing the copy with which error message, the solution you already discarded, the amount of data you have to deal with...

Merge query using two tables in SQL server 2012

I am very new to SQL and SQL server, would appreciate any help with the following problem.
I am trying to update a share price table with new prices.
The table has three columns: share code, date, price.
The share code + date = PK
As you can imagine, if you have thousands of share codes and 10 years' data for each, the table can get very big. So I have created a separate table called a share ID table, and use a share ID instead in the first table (I was reliably informed this would speed up the query, as searching by integer is faster than string).
So, to summarise, I have two tables as follows:
Table 1 = Share_code_ID (int), Date, Price
Table 2 = Share_code_ID (int), Share_name (string)
So let's say I want to update the table/s with today's price for share ZZZ. I need to:
Look for the Share_code_ID corresponding to 'ZZZ' in table 2
If it is found, update table 1 with the new price for that date, using the Share_code_ID I just found
If the Share_code_ID is not found, update both tables
Let's ignore for now how the Share_code_ID is generated for a new code, I'll worry about that later.
I'm trying to use a merge query loosely based on the following structure, but have no idea what I am doing:
MERGE INTO [Table 1]
USING (VALUES (1,23-May-2013,1000)) AS SOURCE (Share_code_ID,Date,Price)
{ SEEMS LIKE THERE SHOULD BE AN INNER JOIN HERE OR SOMETHING }
ON Table 2 = 'ZZZ'
WHEN MATCHED THEN UPDATE SET Table 1.Price = 1000
WHEN NOT MATCHED THEN INSERT { TO BOTH TABLES }
Any help would be appreciated.
http://msdn.microsoft.com/library/bb510625(v=sql.100).aspx
You use Table1 for target table and Table2 for source table
You want to do action, when given ID is not found in Table2 - in the source table
In the documentation, that you had read already, that corresponds to the clause
WHEN NOT MATCHED BY SOURCE ... THEN <merge_matched>
and the latter corresponds to
<merge_matched>::=
{ UPDATE SET <set_clause> | DELETE }
Ergo, you cannot insert into source-table there.
You could use triggers for auto-insertion, when you insert something in Table1, but that will not be able to insert proper Shared_Name - trigger just won't know it.
So you have two options i guess.
1) make T-SQL code block - look for Stored Procedures. I think there also is a construct to execute anonymous code block in MS SQ, like EXECUTE BLOCK command in Firebird SQL Server, but i don't know it for sure.
2) create updatable SQL VIEW, joining Table1 and Table2 to show last most current date, so that when you insert a row in this view the view's on-insert trigger would actually insert rows to both tables. And when you would update the data in the view, the on-update trigger would modify the data.

Hierarchical List of All tables

In a SQL Server DB, I have to find all the "Master"(Parent) tables and also build a
Hierarchical list of Paerent/Child tables. Finally I would like to traverse that hierarchical
list from down and delete all the child table data at the end i can able to delete
the parent data also.
I have tried in one way, that is, Using system tables (like sys.objects etc) I
queried the metadata of the db (like its primary and Foreign keys). But I don't know how
to formulate the tree like structure.
try this in SQL Server Management Studio:
EXEC sp_msdependencies #intrans = 1
if you insert the results into a temp table, you could then filter it to be just tables, just views, or use the other, alternative parameters for the proc to do the same thing
EXEC sp_msdependencies #intrans = 1 ,#objtype=8 --8 = tables
EXEC sp_msdependencies #intrans = 1 ,#objtype=3 --3 = tables is the correct one
Check this for more Heirarchical

How to optimize oracle query for repeated full table sorts?

I have a database infrastructure where we are regularly (at least once a day) replicating the full content of tables from a source database to approximately 20 target databases. Due to the replication code in use (we have to use regular oracle queries, no control or direct access to source database) - this results in 20 full-table sorts of the source table.
Is there any way to optimize for this in the query? I'm looking for something that would basically tell oracle "I'm going to be repeatedly sorting this entire table"? MySQL had an option with myisamchk where you could tell it to sort a table and keep it in sorted order, but obviously that wouldn't apply here for multiple reasons.
Currently, there are also some intermediate tables involved (sync from A to B, then from B to C.) We do have control over the intermediate tables, so if there are tuning options there, that would be useful as well.
Generally, the queries are almost all of the very simplistic form:
select a, b, c, d, e, ... z from tbl1 order by a, b, c, d, e, ... z;
I'm aware of streams, but as described above, the primary source tables are outside of our control, so we won't be able to use streams there. (Additionally, those source tables are rebuilt completely from a snapshot daily, so streams wouldn't really work anyway.)
you could look into the multi-table INSERT feature. It should perform a single FULL SCAN and will insert into multiple tables. Consider (10gR2):
SQL> CREATE TABLE t1 (ID NUMBER);
Table created
SQL> CREATE TABLE t2 (ID NUMBER);
Table created
SQL> INSERT ALL
2 INTO t1 VALUES (d_id)
3 INTO t2 VALUES (d_id)
4 /* your select goes here */
5 SELECT ROWNUM d_id FROM dual d CONNECT BY LEVEL <= 5;
10 rows inserted
SQL> SELECT COUNT(*) FROM t1;
COUNT(*)
----------
5
SQL> SELECT COUNT(*) FROM t2;
COUNT(*)
----------
5
You will have to check if it works over database links.
Some things that would help the sorting issue is to have indexes on the columns that you are sorting on (and also joining the tables on, if they're not there already). You could also create materialized views which are already sorted, and Oracle would keep the sorted results cached.
You don't say exactly how the replication is done or the data volumes involved (or why you are sorting the data).
If the aim is to minimise the impact on the source database, your best bet may be to extract into an intermediate file and load the file into the destination databases. The sort could be done on the intermediate file (if plain text), or as part of either the export or import into the destination databases.
In source database :
create table export_emp_info
organization external
( type oracle_datapump
default directory DATA_PUMP_DIR
location ('emp.dmp')
) as select emp_id, emp_name, dept_id from emp order by dept_id
/
Copy file then, import in dest database:
create table import_emp_info
(EMP_ID NUMBER(12),
EMP_NAME VARCHAR2(100),
DEPT_ID NUMBER)
organization external
( type oracle_datapump
default directory DATA_PUMP_DIR
location ('emp.dmp')
)
/
insert into emp_info select * from import_emp_info;
If you don't want or can't have the external table on the source db, you can use a straight expdp of the emp table (possibly using NETWORK_LINK if you have limited access to the source database directory structure) and QUERY to do the ordering.
You could load data from source table A to an intermediate table B and then do a partition exchange between B and destination table C. Exact replication, no sorting involved.
This I/U/D form of replication is what the MERGE command is there for. It's very doubtful that an expensive sort-merge would be required, and I'd expect to see hash joins instead. As long as the hash table can be stored in memory the hash join is barely more expensive than scanning the tables.
A handy optimisation is to store a hash value based on the non-key attributes, so that you can join between source and target tables on the key column(s) and compare small hash values instead of the full set of columns - change detection made easy.

Resources