Partition table based on variable value - SQL Server 2008 - sql-server

Can I specify in a range that all rows having value in CreatedDate column earlier than one month from GETDATE() should be placed in one partition and the rest in other, so that I should query the 2nd partition for latest data and 1st one for archived data?

No, you can't. Partition function must be deterministic. Deterministic functions always return the same result any time they are called with a specific set of input values.
Unfortunately, GetDate() is nondeterministic function.
Unfortunately, you can't use GetDate(), because GetDate() is nondeterministic function.
See http://shannonlowder.com/2010/08/partitioning/ for more details
#Ismail
There are alternatives:
Create bit column LastMonth and partition function based on LastMonth column. You need to update field every day, before you start using your data. You don't need to do it daily, maybe is better way to update column you choose to flag your fresh data (or change your partition function), once in a period you choose (week/month/quarter).
I don't try this approach, you may need to start some maintenance on table for full performance after updating column.
Another idea that might be work is to make partition for every month, and change filegroups when new month start. For example, if you want your latest data on fast disk f: and history on s:, you will have PartitionJan on s: and PartitionFebruary on f:, when martch started move PartitionFebruary to s:, and start using PartitionMartch on f:.

Related

Converting Large Data Table To Use Partitions

I have a single MSSQL 2017 Standard table, let's call it myTable, with data going back to 2015, containing 206.4 million rows. Once INSERTed by the application, these rows are never modified or deleted. The table is actively collecting data, 24/7.
My goal is to reduce the data in this table to only the most recent full 6 months plus current month, into monthly-based partitions for easy monthly pruning. myTable.dateCreated would determine which partition the data ultimately resides.
(Unrelated, but mentioning in case it ends up being relevant: I have an existing application that replicates all data that gets stored in myTable out to a data warehouse for long term storage every 15 minutes; the main application is able to query myTable for recent data and the data warehouse for older data as needed.)
Because I want to prune the oldest one month worth of data out of myTable each time a new month starts, partitioning myTable by month makes the most sense - I can simply SWITCH the oldest partition to a staging table, then truncate that staging table without causing downtime or performance on the main table.
I've come up with the following plan, and my questions are simple: Is this the best way to approach this task, and will it keep downtime/performance degradation to a minimum?
Create a new table, myTable_pending, with the same exact table structure as myTable, EXCEPT that it will have a total of 7 monthly partitions (6 months retention plus current month) configured;
In one complete step: rename myTable to myTable_transfer, and rename myTable_pending to myTable. This should have the net effect of allowing incoming data to continue being stored, but now it will be in a partition for the month of 2023-01;
Step 3 is where I need advice... which of the following might be best to get the remaining 6mos + current data back into the now-partitioned myTable, or are there additional options I should consider?
OPTION 1: Run a Bulk Insert of just the most recent 6 months of data from myTable_transfer back into myTable, causing the data to end up in the correct partitions in the process (with the understanding that this may still take some time, but not as long as a bunch of INSERTs that would end up chewing on the transaction log);
OPTION 2: Run a DELETE against myTable_transfer, getting rid of all data except the most recent full 6 months + current, and then set up and apply partitions on THIS table, that would then cause SQL Server to reorganize the data into those partitions, but without affecting access or performance on myTable, after which I could just SWITCH the partitions from myTable_transfer into myTable for immediate access; (related issue: since myTable is still collecting current data, and myTable_transfer will contain data from the current month as well, can the current month partitions be merged?)
OPTION 3: Any other way to do this, so that myTable ends up with 6 months worth of data, properly partitioned, without significant downtime?
We ended up revising our solution, since the original table was replicated to a data warehouse anyway, we simply renamed the table and created a new one with partitioning to start collecting new data from the rename point. This provided the least amount of downtime, the fastest schema changes, and gave us the partitioning we needed to maintain the table efficiently going forward.

Azure SQL Database Partitioning

I currently have an Azure SQL Database (Standard 100 DTUs S3) and I'm wanting to create partitions on a large table splitting a datetime2 value into YYYYMM. Each table has at least the following columns:
Guid (uniqueidentifier type)
MsgTimestamp (datetime2 type) << partition using this.
I've been looking on Azure documentation and SO but can't find anything that clearly says how to create a partition on a 'datetime2' in the desired format or even if it's supported on the SQL database type.
Another example if trying the link below, but I do not find the option to create a partition within SQL Studio to create a partition on the Storage menu.
https://www.sqlshack.com/database-table-partitioning-sql-server/
In addition, would these tables have to be created daily as the clock goes past 12am or is this done automatically?
UPDATE
I suspect I may have to manually create the partitions using the first link below and then at the beginning of each month, use the second link to create the next months partition table in advance.
https://learn.microsoft.com/en-us/sql/t-sql/statements/create-partition-function-transact-sql?view=sql-server-ver15
https://learn.microsoft.com/en-us/sql/t-sql/statements/alter-partition-function-transact-sql?view=sql-server-ver15
Context
I currently connect into a real-time feed that feeds upto 600 rows a minute and have a backlog of around 370 million for 3 years worth of data.
Correct.
You can create partitions based upon datetime2 columns. Generally, you'd just do that on the start of month date, and you'd use a RANGE RIGHT (so that the start of the month is included in the partition).
And yes, at the end of every month, the normal action is to:
Split the partition function to add a new partition option.
Switch out the oldest monthly partition into a separate table for archiving purposes (presuming you want to have a rolling period of months)
And another yes, we all wish the product had options to do this for you automatically.
I was one of the tech reviewers on the following whitepaper by Ron Talmage, back in 2008 and 99% of the advice in it is still current:
https://learn.microsoft.com/en-us/previous-versions/sql/sql-server-2008/dd578580(v=sql.100)

Why this arithmetic calculation is faster than cast nvarchar?

The following query is faster (1:05):
SELECT DATEPART(DW,DATEFROMPARTS(
FLOOR(20180501/10000)
,FLOOR(20180501-FLOOR(20180501/10000)*10000)/100
,FLOOR(20180501-FLOOR(20180501/100)*100)))
GO 1000
Than (1:10):
SELECT DATEPART(DW,CAST(CAST(20180501 AS nvarchar) AS DATE))
GO 1000
Why?
I have a table with 2 billions of records (roughly) so the difference becomes important. There is far more logic behind hardcoded date. Otherwise, if there exists a better approach, in term of performace, for executing the same logic, feel free to correct me.
The date column is always an integer, and not always have the same format. Two formats are retrieved YYYYMMDD and, YYYYMM. I know, a bit of a mess.
Thanks!
Delete duplicate rows when the first day of the month (YYYYMM01) is monday
If you wan to speed up the delete create a temporary table (or permanent if this is recurring operation) with a column of the same data type as your table's "date" column with all 1st Mondays of each month across XX years. Make sure the data is in the same format as you mentioned in your question. Be sure that this column has an index (clustered). Now use this table in your query as the filter without doing any conversions which will allow Sql Server to take advantage of any indexes that exist on the existing table's "date" column.

SQL Server - how to generate unique sequential number on update

I have column in my table, say updateStamp. I'd like to get an approach to update that field with a new sequential number upon row update.
The database has lot of traffic, mostly read, but multiple concurrent updates could also happen in batches. Therefore the solution should cause minimal locks.
Reason for this requirement is that I need to have a solution for clients to iterate over the table forwards and if a row is updated - it should come up on the result set again.
So, query would then be like
SELECT *
FROM mytable
WHERE updateStamp > #lastReturnedUpdateStamp
ORDER BY updateStamp
Unfortunately timestamps do not work here because multiple updates could happen at same time.
The timestamp (deprecated) or rowversion (current) data type is the only one I'm aware of that is updated on every write operation on the row.
It's not a time stamp per se - it doesn't store date, time in hours, seconds etc. - it's really more of a RowVersion (hence the name change) - a unique, ever-increasing number (binary) on the row.
It's typically used to check for any modifications between the time you have read the row, and the time you're going to update it.
Since it's not really a date/time information, you will most likely have to have another column for that human-readable information. You can add a LastModified DATETIME column to your table, and with a DEFAULT GETDATE() constraint, you can insert a new value upon insertion. For keeping that up to date, you'll have to write a AFTER UPDATE trigger to update the LastModified column when any update occurs.
SQL Server 2011 (a.k.a. "Denali") will bring us SEQUENCES which would be the perfect fit in your case here - but alas, that' still at least a year from official release.....

How can I get a list of modified records from a SQL Server database?

I am currently in the process of revamping my company's management system to run a little more lean in terms of network traffic. Right now I'm trying to figure out an effective way to query only the records that have been modified (by any user) since the last time I asked.
When the application starts it loads the job information and caches it locally like the following: SELECT * FROM jobs.
I am writing out the date/time a record was modified ala UPDATE jobs SET Widgets=#Widgets, LastModified=GetDate() WHERE JobID=#JobID.
When any user requests the list of jobs I query all records that have been modified since the last time I requested the list like the following: SELECT * FROM jobs WHERE LastModified>=#LastRequested and store the date/time of the request to pass in as #LastRequest when the user asks again. In theory this will return only the records that have been modified since the last request.
The issue I'm running into is when the user's date/time is not quite in sync with the server's date/time and also of server load when querying an un-indexed date/time column. Is there a more effective system then querying date/time information?
I don't know that I would rely on Date-Time since it is external to SQL Server.
If you have an Identity column, I would use that column in a table UserId, LastQueryDateTime, LastIdRetrieved
Every time you query the base table, insert new row for user (or update if exists) the max id into this table. Also, the query should read the row from this table to get the LastIdRetrieved and use that in the where clause.
All this could be eliminated if all of your code chooses to insert GetDate() from SQL Server instead of from the client machines, but that task is pretty labor intensive.
The easiest solution seems to settle on one time as leading.
One way would be to settle on the server time. After updating the row, store the value returned by select LastModified where JobID = #JobID on the client side. That way, the client can effectively query using only the server time as reference.
Use an update sequence number (USN) much like Active Directory and DNS use to keep track of the objects that have changed since their last replication. Pick a number to start with, and each time a record in the Jobs table is inserted or modified, write the most recent USN. Keep track of the USN when the last Select query was executed, and you then always know what records were altered since the last query. For example...
Set LastQryUSN = 100
Update Jobs Set USN=101, ...
Update Jobs Set USN=102, ...
Insert Jobs (USN, ...) Values (103, ...)
Select * From Jobs Where USN > LastQryUSN
Set LastQryUSN = 103
Update Jobs Set USN=104
Insert Jobs (USN, ...) Values (105, ...)
Select * From Jobs Where USN > LastQryUSN
Set LastQryUSN = 105
... and so on
When you get the Jobs, get the server time too:
DECLARE #now DATETIME = GETUTCDATE();
SELECT #now AS [ServerTime], * FROM Jobs WHERE Modified >= #LastModified;
First time you pass in a minimum date as #LastModified. On each subsequent call, you pass in the ServerTime returned last call. This way the client time is taken out of the equation.
The answer to the server load is, I hope, obvious: add an index on Modified column.
And one more adice: never use local time, not even on server. Always use UTC times, and store UTC time in Modified. As it is right now, your program is completely screwed two times a year, when the daylight savings time changes are set in or when they are removed.
Current SQL Server has change tracking you can use for exactly that. Just enable change tracking on the tables you want to track.

Resources