Running a query like:
select *
from (select * from tableA where date = '2020-07-01') as prev
join
(select * from tableB where di_data_dt = '2020-08-01') as cur
on prev.ID = cur.ID;
Query Profile shows:
Question:
Why is snowflake loading the first table and then second table and then join? Why can't it load both together and save time?
P.S: I am using an XL warehouse and table is not super massive that snowflake can't handle tohether.
Snowflake clusters utilize all the available bandwidth for remote IO and transfer is already distributed across the cluster. If one table can be retrieved at a rate of X, then two tables would be retrieved at a rate of approximately 0.5 * X, so would take the same total time.
#mike-walton points out that the doing the scans sequentially can result in faster queries due to the partition pruning that can result from join filters.
Snowflake makes a plan for the entire query, so if there is one step that needs Partitions A & B from Table A, and a later sub-select that needs Partitions B & C, then A, B and C would be retrieved during one tablescan.
Related
I have a simple DB table with ONLY 5 columns with no primary key having 7 billion+(7,50,01,771) data. yes, you read it correctly. it has one cluster index.
DB table columns
Cluster index
if I write a simple select query to get data, it is taking 7-8 minutes to return data. now, you get my next question. what are the techniques that I can apply to this DB table? So that I can get data in time.
in the actual scenario, where I am using this table have join with 2 temp tables that have WHERE clause and filtered data. Please find below my query for reference.
SELECT dt.ZipFrom, dt.ZipTo, dt.Total_time, sz.storelocation, sz.AcctShip, sz.Licensee,sz.Entity from #Zips z INNER join DriveTime_ZIPtoZIP dt on zipFrom = z.zip INNER join #storeZips sz on ZipTo = sz.zip order by z.zip desc, total_time asc
Thanks
You can index according to the where conditions in the query. However, this comes at a cost: Storage.
Order by statement is also important. If you have to use order by in your query, you can also index accordingly.
But do not forget, the cost of indexing ...
I have a local SQL Server DB table with about 5 million records.
I snowflake server that has a similar table that is updated daily.
I need to update my local table with the new records that are added on the Snowflake table.
This code works but it takes about an hour to retrieve about 200,000 records. I insert the records into a local temp table and then insert them into my Sql server db.
Is there a faster way to retrieve the records from Snowflake and get them into SQL Server?
TIA
JohnB
SELECT A.*
into #Sale2020New
FROM OPENQUERY(SNOW, 'SELECT * FROM "DATA"."PUBLIC"."Sales" where "Sales"."Date" >= ''1/1/2020'' and "Sales"."Date" <= ''12/31/2020'' ') A
Left JOIN [SnowFlake].[dbo].Sale2020 B
ON B.PrimaryKey = A.PrimaryKey
WHERE
b.PrimaryKey IS NULL;
Does it take 1 hour just retrieving data from Snowflake or the whole process?
To speed up data retrieval from Snowflake, implement clustering on DATE column in snowflake table. This would prune micropartitions and avoid full table scan. You can get more information on clustering here
As for delta load, instead of a join you can apply filter on DATE column to current date and this will avoid a costly join operation and filter data at the start.
SELECT * FROM "SALES"
where "Sales"."Date" = '2020-04-07'
There is a base with 2 namespaces (main and archive) located on the same disk space.
Task: transfer records up to a certain time of creation (the date is present in one of the tables) in the archive space in an identical table. (Oracle 12c) (hundreds of billions of rows in each table)
Because this will work on the production, the option of transferring only the necessary records to a new table is not suitable, because the data will be updated in the process. (will be executed SELECT/INSERT/UPDATE from same table of main namespace)
Currently the best option which I found:
CREATE TABLE MAIN_NAME_SPACE.TEMP AS SELECT ID FROM MAIN_NAME_SPACE.MAIN_TABLE
create a temp table with id for some period (~ 5 days) (as a separate procedure);
PROCEDURE ARCH AS
table variable;
BEGIN
SELECT ID INTO table FROM TEMP;
FORALL i IN table
DELETE ARCH_NAME_SPACE.TABLE_1 T WHERE T.ID = i;
FORALL i IN table
INSERT INTO ARCH_NAME_SPACE.TABLE_1 VALUES(SELECT * FROM MAIN_NAME_SPACE.TABLE_1 T WHERE T.ID = i);
...
FORALL i IN table
DELETE ARCH_NAME_SPACE.TABLE_N T WHERE T.ID = i;
FORALL i IN table
INSERT INTO ARCH_NAME_SPACE.TABLE_N VALUES(SELECT * FROM MAIN_NAME_SPACE.TABLE_N T WHERE T.ID = i);
END ARCH;
we read the data in a table variable and for each table in the archive through FORALL we delete the records (if they exist there) and write in a new one;
PROCEDURE DELETE AS
table variable;
BEGIN
SELECT ID INTO table FROM TEMP;
FORALL i IN table
DELETE MAIN_NAME_SPACE.TABLE_1 T WHERE T.ID = i;
...
FORALL i IN table
DELETE MAIN_NAME_SPACE.TABLE_N T WHERE T.ID = i;
END ARCH;
Using another procedure, we delete data from the main namespace by analogy (leaving this piece within the same procedure, the time increases to 2.5 hours for unknown reasons).
But the speed leaves much to be desired (10 million records are transferred to the archive for 43 minutes, removal from the original namespace is 1h 5min).
Is there any other way to speed up this pleasure? (earlier, before updating to 12c, all this worked through the cursor veryyyyyyy slowly and veryyyyyyy rarely started).
P.S.: tables are not partitioned.
Also maybe somebody can say why with INVISIBLE indexes DELETE/INSERT operations work faster? I'am not understand how it works.
Thank u advance.
I am creating a Query which is taking data from multiple other databases through DBlinks.
One of the table "ord" in a Query is immensely large , say more than 50 million rows.
Now, I want to create query and I want to traverse the data and retrieve the required data based on Partitions defined in t1.
i.e if ord has 50 partitions in it with 1 million records each, I want to run my whole query on first partition get the result ,then move to 2nd, 3rd and so on. . upto last partition.
How can I do that?
Please consider the Sample Query where from local DB I am accessing all the remote DBs using DB links.
This Query list all the orders which are active.
Select ord.order_no,
ord.customer_id,
ord.order_date,
cust.customer_id,
cust.cust_name,
adr.street,
adr.city,
adr.state,
ship.ship_addr_street,
ship.ship_addr_city,
ship.ship_addr_state,
ship.ship_date
from order ord#ordDB
inner join customer#custDB cust on cust.customer_id = ord.customer_id
inner join address#adrDB adr on adr.address_id = cust.address_id
inner join shipment#shipDB ship on ship.shipment_id = ord.shipment_id
where ord.active = 'true';
Now there is a feild "partition_key" defined in this table, and each key is associated with say 1 million rows and I want to restructure the query so that at one time we take one partition from Order and run this whole query on that partition and move to next partition until table is not completed.
Please help me to create sample query.
I have a database infrastructure where we are regularly (at least once a day) replicating the full content of tables from a source database to approximately 20 target databases. Due to the replication code in use (we have to use regular oracle queries, no control or direct access to source database) - this results in 20 full-table sorts of the source table.
Is there any way to optimize for this in the query? I'm looking for something that would basically tell oracle "I'm going to be repeatedly sorting this entire table"? MySQL had an option with myisamchk where you could tell it to sort a table and keep it in sorted order, but obviously that wouldn't apply here for multiple reasons.
Currently, there are also some intermediate tables involved (sync from A to B, then from B to C.) We do have control over the intermediate tables, so if there are tuning options there, that would be useful as well.
Generally, the queries are almost all of the very simplistic form:
select a, b, c, d, e, ... z from tbl1 order by a, b, c, d, e, ... z;
I'm aware of streams, but as described above, the primary source tables are outside of our control, so we won't be able to use streams there. (Additionally, those source tables are rebuilt completely from a snapshot daily, so streams wouldn't really work anyway.)
you could look into the multi-table INSERT feature. It should perform a single FULL SCAN and will insert into multiple tables. Consider (10gR2):
SQL> CREATE TABLE t1 (ID NUMBER);
Table created
SQL> CREATE TABLE t2 (ID NUMBER);
Table created
SQL> INSERT ALL
2 INTO t1 VALUES (d_id)
3 INTO t2 VALUES (d_id)
4 /* your select goes here */
5 SELECT ROWNUM d_id FROM dual d CONNECT BY LEVEL <= 5;
10 rows inserted
SQL> SELECT COUNT(*) FROM t1;
COUNT(*)
----------
5
SQL> SELECT COUNT(*) FROM t2;
COUNT(*)
----------
5
You will have to check if it works over database links.
Some things that would help the sorting issue is to have indexes on the columns that you are sorting on (and also joining the tables on, if they're not there already). You could also create materialized views which are already sorted, and Oracle would keep the sorted results cached.
You don't say exactly how the replication is done or the data volumes involved (or why you are sorting the data).
If the aim is to minimise the impact on the source database, your best bet may be to extract into an intermediate file and load the file into the destination databases. The sort could be done on the intermediate file (if plain text), or as part of either the export or import into the destination databases.
In source database :
create table export_emp_info
organization external
( type oracle_datapump
default directory DATA_PUMP_DIR
location ('emp.dmp')
) as select emp_id, emp_name, dept_id from emp order by dept_id
/
Copy file then, import in dest database:
create table import_emp_info
(EMP_ID NUMBER(12),
EMP_NAME VARCHAR2(100),
DEPT_ID NUMBER)
organization external
( type oracle_datapump
default directory DATA_PUMP_DIR
location ('emp.dmp')
)
/
insert into emp_info select * from import_emp_info;
If you don't want or can't have the external table on the source db, you can use a straight expdp of the emp table (possibly using NETWORK_LINK if you have limited access to the source database directory structure) and QUERY to do the ordering.
You could load data from source table A to an intermediate table B and then do a partition exchange between B and destination table C. Exact replication, no sorting involved.
This I/U/D form of replication is what the MERGE command is there for. It's very doubtful that an expensive sort-merge would be required, and I'd expect to see hash joins instead. As long as the hash table can be stored in memory the hash join is barely more expensive than scanning the tables.
A handy optimisation is to store a hash value based on the non-key attributes, so that you can join between source and target tables on the key column(s) and compare small hash values instead of the full set of columns - change detection made easy.