I currently have an Azure SQL Database (Standard 100 DTUs S3) and I'm wanting to create partitions on a large table splitting a datetime2 value into YYYYMM. Each table has at least the following columns:
Guid (uniqueidentifier type)
MsgTimestamp (datetime2 type) << partition using this.
I've been looking on Azure documentation and SO but can't find anything that clearly says how to create a partition on a 'datetime2' in the desired format or even if it's supported on the SQL database type.
Another example if trying the link below, but I do not find the option to create a partition within SQL Studio to create a partition on the Storage menu.
https://www.sqlshack.com/database-table-partitioning-sql-server/
In addition, would these tables have to be created daily as the clock goes past 12am or is this done automatically?
UPDATE
I suspect I may have to manually create the partitions using the first link below and then at the beginning of each month, use the second link to create the next months partition table in advance.
https://learn.microsoft.com/en-us/sql/t-sql/statements/create-partition-function-transact-sql?view=sql-server-ver15
https://learn.microsoft.com/en-us/sql/t-sql/statements/alter-partition-function-transact-sql?view=sql-server-ver15
Context
I currently connect into a real-time feed that feeds upto 600 rows a minute and have a backlog of around 370 million for 3 years worth of data.
Correct.
You can create partitions based upon datetime2 columns. Generally, you'd just do that on the start of month date, and you'd use a RANGE RIGHT (so that the start of the month is included in the partition).
And yes, at the end of every month, the normal action is to:
Split the partition function to add a new partition option.
Switch out the oldest monthly partition into a separate table for archiving purposes (presuming you want to have a rolling period of months)
And another yes, we all wish the product had options to do this for you automatically.
I was one of the tech reviewers on the following whitepaper by Ron Talmage, back in 2008 and 99% of the advice in it is still current:
https://learn.microsoft.com/en-us/previous-versions/sql/sql-server-2008/dd578580(v=sql.100)
Related
I have a single MSSQL 2017 Standard table, let's call it myTable, with data going back to 2015, containing 206.4 million rows. Once INSERTed by the application, these rows are never modified or deleted. The table is actively collecting data, 24/7.
My goal is to reduce the data in this table to only the most recent full 6 months plus current month, into monthly-based partitions for easy monthly pruning. myTable.dateCreated would determine which partition the data ultimately resides.
(Unrelated, but mentioning in case it ends up being relevant: I have an existing application that replicates all data that gets stored in myTable out to a data warehouse for long term storage every 15 minutes; the main application is able to query myTable for recent data and the data warehouse for older data as needed.)
Because I want to prune the oldest one month worth of data out of myTable each time a new month starts, partitioning myTable by month makes the most sense - I can simply SWITCH the oldest partition to a staging table, then truncate that staging table without causing downtime or performance on the main table.
I've come up with the following plan, and my questions are simple: Is this the best way to approach this task, and will it keep downtime/performance degradation to a minimum?
Create a new table, myTable_pending, with the same exact table structure as myTable, EXCEPT that it will have a total of 7 monthly partitions (6 months retention plus current month) configured;
In one complete step: rename myTable to myTable_transfer, and rename myTable_pending to myTable. This should have the net effect of allowing incoming data to continue being stored, but now it will be in a partition for the month of 2023-01;
Step 3 is where I need advice... which of the following might be best to get the remaining 6mos + current data back into the now-partitioned myTable, or are there additional options I should consider?
OPTION 1: Run a Bulk Insert of just the most recent 6 months of data from myTable_transfer back into myTable, causing the data to end up in the correct partitions in the process (with the understanding that this may still take some time, but not as long as a bunch of INSERTs that would end up chewing on the transaction log);
OPTION 2: Run a DELETE against myTable_transfer, getting rid of all data except the most recent full 6 months + current, and then set up and apply partitions on THIS table, that would then cause SQL Server to reorganize the data into those partitions, but without affecting access or performance on myTable, after which I could just SWITCH the partitions from myTable_transfer into myTable for immediate access; (related issue: since myTable is still collecting current data, and myTable_transfer will contain data from the current month as well, can the current month partitions be merged?)
OPTION 3: Any other way to do this, so that myTable ends up with 6 months worth of data, properly partitioned, without significant downtime?
We ended up revising our solution, since the original table was replicated to a data warehouse anyway, we simply renamed the table and created a new one with partitioning to start collecting new data from the rename point. This provided the least amount of downtime, the fastest schema changes, and gave us the partitioning we needed to maintain the table efficiently going forward.
I have a Table in sql server consisting of 200 million records in two different servers. I need to move this table from Server 1 to Server 2.
Table in server 1 can be a subset or a superset of the table in server 2. Some of the records(around 1 million) in server 1 are updated which I need to update in server 2. So currently I am following this approach :-
1) Use SSIS to move data from server 1 to staging database in server 2.
2) Then compare data in staging with the table in server 2 column by column. If any of the column is different, I update the whole row.
This is taking a lot of time. I tried using hashbytes inorder to compare rows like this:-
HASHBYTES('sha',CONCAT(a.[account_no],a.[transaction_id], ...))
<>
HASHBYTES('sha',CONCAT(b.[account_no],b.[transaction_id], ...))
But this is taking even more time.
Any other approach which can be faster and can save time?
This is a problem that's pretty common.
First - do not try and do the updates directly in SQL - the performance will be terrible, and will bring the database server to its knees.
In context, TS1 will be the table on Server 1, TS2 will be the table on Server 2
Using SSIS - create two steps within the job:
First, find the deleted - scan TS2 by ID, and any TS2 ID that does not exist in TS1, delete it.
Second, scan TS1, and if the ID exists in TS2, you will need to update that record. If memory serves, SSIS can inspect for differences and only update if needed, otherwise, just execute the update statement.
While scanning TS1, if the ID does not exist in TS2, then insert the record.
I can't speak to performance on this due to variations in schemas as servers, but it will be compute intensive to analyze the 200mm records. It WILL take a long time.
For on-going execution, you will need to add a "last modified date" timestamp to each record and a trigger to update the field on any legitimate change. Then use that to filter out your problem space. The first scan will not be terrible, as it ONLY looks at the IDs. The insert/update phase will actually benefit from the last modified date filter, assuming the number of records being modified is small (< 5%?) relative to the overall dataset. You will also need to add an index to that column to aid in the filtering.
The other option is to perform a burn and load each time - disable any constraints around TS2, truncate TS2 and copy the data into TS2 from TS1, finally reenabling the constraints and rebuild any indexes.
Best of luck to you.
What is the correct way to model data in a star schema such that a BI tool (such as PowerBI) can select a date range crossing multiple days?
I've currently got fact tables that have separate date and time dimensions. My time resolution is to the second, date resolution is to the day.
It's currently very easy to do aggregation providing the data of interest is in the same day, or even multiple complete days, but it becomes a lot more complicated when you're asking for, say, a 12 hour rolling window that crosses the midnight boundary.
Yes, I can write a SQL statement to first pull out all rows for the entirety of the days in question, and then by storing the actual date time as a field in the fact table I can further filter down to the actual time range I'm interested in, but that's not trivial (or possible in some cases) to do in BI reporting tools.
However this must be a frequent scenario in data warehouses... So how should it be done?
An example would be give me the count of ordered items from the fact_orders table between 2017/Jan/02 1600 and 2017/Jan/03 0400.
Orders are stored individually in the fact_orders table.
In my actual scenario I'm using Azure SQL database, but it's more of a general design question.
Thank you.
My first option would be (as you mention in the question) to include a calculated column (Date + Time) in the SQL query and then filter the time part inside the BI tool.
If that doesn't work, you can create a view in the database to achieve the same effect. The easiest is to take the full joined fact + dimensions SQL query that you'd like to use in the BI tool and add the date-time column in the view.
Be sure to still filter on the Date field itself to allow index use! So for your sliding window, your parameters would be something like
WHERE Date between 2017/Jan/02 AND 2017/Jan/03
AND DateTime between 2017/Jan/02 1600 and 2017/Jan/03 0400
If that doesn't perform well enough due to data volumes, you might want to set up and maintain a separate table or materialized view (depending on your DB and ETL options) that does a Cartesian join of the time dimension with a small range of the Date dimension (only the last week or whatever period you are interested in partial day reports), then join the fact table to that.
The DateTimeWindow table/view would be indexed on the DateTime column and have only two extra columns: DateKey and TimeKey. Inner join that to the fact table using both keys and you should get exactly the window you want when the BI tool supplies a datetime range.
That is not easily modeled. A solution would be to build a additional dimension with date + time. Of course this could means you have to severely limit the granularity of the time dimension.
10 year hour granularity: 365 * 10 * 24 = 87600 rows
10 year minute granularity: 365 * 10 * 24 * 60 = 5256000 rows
You could use just this dimension, or (better) add it and do not show it to all users. It would means an additional key in the fact table: if the FT is not gigantic, no big deal.
I have a SQL Server database. The size is about 150GB which saves some data for analysis. Each day, new data comes in and we need to delete old data (based on date). Recently, the daily data size increase a lot, it will be about 8-9GB per day soon.
Currently, we delete in small batch, which takes a very long time to finish. Is there a general guide to make it faster? Tried to drop/disable index before delete, after delete finished, then rebuild index. It does not help much.
Or, this will totally depend on the actual date?
Thanks
Given the amount of data I would use a partitioned table, one for each day.
Swapping partitions in and out is going to be the fastest way to delete all data for one day.
EDIT: since truncating a partition is not as trivial as it should be in SQL Server, I figured I'd provide more details, in case you're not familiar with partitions.
In the next release of SQL Server, you should be able to just TRUNCATE PARTITION or something like that. In the meantime you have to proceed as follows:
The quickest way to delete a day of data in your database is to have the table partitioned by day and then:
Swap out the partition that you want to delete to another table: ALTER TABLE partitioned SWAP PARTITION n TO otherTableToDelete.
TRUNCATE TABLE otherTableToDelete.
I have a huge database, it process the email traffic everyday. In the system, it needs delete some old emails everyday:
Delete from EmailList(nolock)
WHERE EmailId IN (
SELECT EmailId
FROM Emails
WHERE EmailDate < DATEADD([days], -60, GETDATE())
)
It works, but the problem is: it takes a long time to finish and the log file becomes very huge because of this. The log file size increases more than 100GB everyday.
I'm thinking we can change it to
Delete from EmailList(nolock)
WHERE EXISTS (
SELECT EmailId
FROM Emails
WHERE (Emails.EmailId = EmailList.EmailId) AND
(EmailDate < DATEADD([days], -60, GETDATE()))
)
But other than this, is there anything we can do to improve the performance. most of all, reduce the log file size?
EmailId is indexed.
Ive seen
GetDate()-60
style syntax perform MUCH better than
DATEADD([days], -60, GETDATE()))
especially if there is an index on the date column. A few fellow DBAs and I had spent quite a bit of time trying to understand WHY it would perform better, but the result was in the pudding.
Another thing you might want to consider, considering the volume of records I presume you have to delete, is to chunk the deletes in batches of say 1000 or 10000 records. This would probably speed up the delete process.
Have you tried parition by date, then you can just drop the table versions for the days yo uare not interested in anymore. Given a "hugh" database you for sure do run enterprise edition of SQL Server (after all, hugh is bigger than very large) and that has table partitioning.
[EDIT]:
regarding #TomTom's comment:
If you have SQL Server Enterprise edition available you should use Table Partitioning.
If this is not the case, my original post may be helpfull:
[ORIGINAL POST]
Deleting a large amount of data is always difficult. I faced the same problem and I went with the following solution:
Depending on your requirements this will not work, but maybe you can get some ideas from it.
Instead of using 1 table, use 2 tables, with the same schema. Create a synonym (I assume you are using MS SQL server) that points to the "active table of the 2 tables (active means, this is the table that you currently write to). Use this synyonym for the inserts in your application, or instead of using the synonym, the application could just change the table each x days it writes to.
Every x days you can truncate the old/inactive table and afterwards recreate the synonym to target the truncated table (if you use the synonnym solution), so effectively you are partitioning the data per time.
You have to synchronize the switch of the active table. I automated this completely, by using a shared App-lock for the application, and an Exclusive Applock when changing the synonym (== blocking the writing application during the switching process).
If changing your applicaiotn's code is not an option, consider using the same principle but instead of writing to the synonym you could create a view with instead of triggers (the insert operation would insert into the "active" partition). The trigger code would need syhcnronize using something like the Applock as mentioned above (so that writes during the switching process work).
My solution is a litte more complex, so I currently cannot post the code here, But it works without problems for a high load application and the swithcingt/cleanup process is completely automated.