I have a single database of over 9,000 tables, built by ingesting an enormous dataset. Each table can vary in size, but the columns of each table are the same (day, month, year, time, measurement, altitude).
How can I do this? I've read that using a single table will be much more efficient for me to query for, say, all measurements within a specific month. Rebuilding the database itself would be a pain, and would take too much time.
Assuming you have the list of tables, execute the following SQL statement for each table:
INSERT INTO Dateset_all SELECT * FROM Dataset_123;
Related
I have a single MSSQL 2017 Standard table, let's call it myTable, with data going back to 2015, containing 206.4 million rows. Once INSERTed by the application, these rows are never modified or deleted. The table is actively collecting data, 24/7.
My goal is to reduce the data in this table to only the most recent full 6 months plus current month, into monthly-based partitions for easy monthly pruning. myTable.dateCreated would determine which partition the data ultimately resides.
(Unrelated, but mentioning in case it ends up being relevant: I have an existing application that replicates all data that gets stored in myTable out to a data warehouse for long term storage every 15 minutes; the main application is able to query myTable for recent data and the data warehouse for older data as needed.)
Because I want to prune the oldest one month worth of data out of myTable each time a new month starts, partitioning myTable by month makes the most sense - I can simply SWITCH the oldest partition to a staging table, then truncate that staging table without causing downtime or performance on the main table.
I've come up with the following plan, and my questions are simple: Is this the best way to approach this task, and will it keep downtime/performance degradation to a minimum?
Create a new table, myTable_pending, with the same exact table structure as myTable, EXCEPT that it will have a total of 7 monthly partitions (6 months retention plus current month) configured;
In one complete step: rename myTable to myTable_transfer, and rename myTable_pending to myTable. This should have the net effect of allowing incoming data to continue being stored, but now it will be in a partition for the month of 2023-01;
Step 3 is where I need advice... which of the following might be best to get the remaining 6mos + current data back into the now-partitioned myTable, or are there additional options I should consider?
OPTION 1: Run a Bulk Insert of just the most recent 6 months of data from myTable_transfer back into myTable, causing the data to end up in the correct partitions in the process (with the understanding that this may still take some time, but not as long as a bunch of INSERTs that would end up chewing on the transaction log);
OPTION 2: Run a DELETE against myTable_transfer, getting rid of all data except the most recent full 6 months + current, and then set up and apply partitions on THIS table, that would then cause SQL Server to reorganize the data into those partitions, but without affecting access or performance on myTable, after which I could just SWITCH the partitions from myTable_transfer into myTable for immediate access; (related issue: since myTable is still collecting current data, and myTable_transfer will contain data from the current month as well, can the current month partitions be merged?)
OPTION 3: Any other way to do this, so that myTable ends up with 6 months worth of data, properly partitioned, without significant downtime?
We ended up revising our solution, since the original table was replicated to a data warehouse anyway, we simply renamed the table and created a new one with partitioning to start collecting new data from the rename point. This provided the least amount of downtime, the fastest schema changes, and gave us the partitioning we needed to maintain the table efficiently going forward.
I have a snowflake query with multiple ctes and inserting into a table using a Talend job. It takes more than 90 minutes to execute the query. It is multiple cascading ctes, one is calling other and other is calling the other.
I want to improve the performance of the query. It is like 1000 lines of code and I can't paste it here. As I checked the profile and it is showing all the window functions and aggregate functions which slows down the query.
For example, the top slower is,
ROW_NUMBER() OVER (PARTITION BY LOWER(S.SUBSCRIPTIONID)
ORDER BY S.ISROWCURRENT DESC NULLS FIRST,
TO_NUMBER(S.STARTDATE) DESC NULLS FIRST,
IFF(S.ENDDATE IS NULL, '29991231', S.ENDDATE) DESC NULLS FIRST)
takes 7.3% of the time. Can you suggest an alternative way to improve the performance of the query please?
The problem is that 1000 lines are very hard for any query analyzer to optimize. It also makes troubleshooting a lot harder for you and for a future team member who inherits the code.
I recommend breaking the query up and these optimizations:
Use CREATE TEMPORARY TABLE AS instead of CTEs. Add ORDER BY as you create the table on the column that you will join or filter on. The temporary tables are easier for the optimizer to build and later use. The ORDER BY helps Snowflake know what to optimize for with subsequent joins to other tables. They're also easier for troubleshooting.
In your example, see if you can persist this data as permanent columns so that Snowflake can skip the conversion portion and have better statistics on it: TO_NUMBER(S.STARTDATE) and IFF(S.ENDDATE IS NULL, '29991231', S.ENDDATE).
Alternatively to step 2, instead of sorting by startDate and endDate, see if you can add an IDENTITY, SEQUENCE, or populate an INTEGER column which you can use as the sortkey. You can also literally name this new column sortKey. Sorting an integer will be significantly faster than running a function on a DATETIME and then ordering by it.
If any of the CTEs can be changed into materialized views, they will be pre-built and significantly faster.
Finally stage all of the data in a temporary table - ordered by the same columns that your target table was created in - before you insert it. This will make the insert step itself quicker and Snowflake will have an easier time handling a concurrent change to that table.
Notes:
To create a temporary table:
create or replace temporary table table1 as select * from dual; After that you refer to table1 instead of your code instead of the CTE.
Materialized views are documented here. They are an Enterprise edition feature. They syntax is: create materialized view mymv as select col1, col2 from mytable;
I have two table(T_1 & T_2) with same fields. What I need, after every hour T_2 table only have the data which was inserted on T_1 table within that hour(previous hour data will be erased). I am using sql server. Please help me.
Why would you set up two tables to do this?
Your use-case seems like a canonical case for table partitioning. This is a way of storing data in separate "units" (files). You seem to want T_1 to have its data split by hour.
Then you can directly access the data for a particular hour. This will be as efficient from an access perspective as copying the data into a separate table.
If you really wanted to, you could copy the most recent partition to another table every hour -- swapping in the new data for the older data. But that seems unnecessary in practice.
I have a table Transactions which has a column store index and stored 74445000 rows.
I have a query like below
SELECT
CustomerNumber,
MONTH(CreationDate) AS m,
YEAR(CreationDate) AS y,
CurrencyID AS c
FROM Transactions
I am mulling that may be doing a JOIN on a Date Dimension table which contains month and year for all dates may be better than the above query which uses SQL date functions.
Can anyone verify this assumption and/or point to a resource which can provide a details?
Any alteration of the original column will have an impact on the query performance. For the calculation of month & year in your query, you should get a very efficient Batch Execution Mode, which will make alternatives look quite pale.
Plus, if your join will be done on an integer/bigint column, than you might be able to get Segment Elimination which should improve your query performance, but going for any string column will make query lasts incredible amount of time in comparison with the int data types.
In another words - unnecessary headaches.
There are no big alternatives since Columnstore Indexes do not support computed columns yet (SQL Server 2016).
I have a large database and use a query like this:
...WHERE (DATETIME > '30.10.2014 00:00:00' AND DATETIME < '03.11.2014 00:00:00')
My query is already ordered by the field DATETIME, so is it possible to break the query if DATETIME < '03.11.2014 00:00:00' is first time reached so that oracle don't need to check the remaining rows because they aren't needed and this would safe time?
Thanks!
You got basically 3 options here (ordered from best to worst):
If you will create the desired table partitioned by DATETIME column, the optimizer will scan only the relevant partitions (in range 30.10.2014 00:00:00—03.11.2014 00:00:00) instead of accessing the entire table.
Create the table as an IOT (Index Organized Table) — this way the table will be stored as an ordered B-Tree.
Create an index on the DATETIME column. This way, when accessing the column, you will scan the ordered index. This option has a major disadvantage — if the data which is being inserted into this table is "real time sequential data" (I mean that DATETIME column values are always increasing [sysdate for example] and not random [date of birth for example]), there will always be a Hot Block on your B-Tree index. This will cause contentions and probably many wait events (dependent on the data insertion rates of course). The way to "solve" this is to create this index reversed, but then you will have a big problem performing queries on ranges (like the query you've presented here), because the data will be scattered across the index and not stored sequentially.
So, my best advice for situations like this — work with partitions, this is the best way to work efficiently with big amounts of data on Oracle Databases.
If there's not enough data on this table to consider partitions, then the table is not that big and you can consider options #2, #3.
Best regards.