Using SQL, it is taking over 4 hours every evening to pull over all the data from the twelve Production database tables or views needed for our Sandbox database. There has to be a significantly more efficient and effective manner to get this data into our Sandbox.
Currently, I'm creating a UID (Unique ID) by concatenating the views Primary Keys and system date fields.
The UID is used in two steps:
Step 1.
INSERT INTO Sandbox
WHERE UID IS NULL
and only Looking back the Last 30 Days based on the System Date
(using Left Join the Production Table/View.UID to the Existing Sandbox Table/View.UID)
Step 2.
UPDATE Sandbox
Where Production.UID = Sandbox.UID
(using an Inner Join of the Production Table/View.UID to the Existing Sandbox Table/View.UID)
I've cut the 4 hour time down to 2 hours, but it feels like this process I've created is missing a (big) step.
How can I cut this time down? Should I put a 30 day filter on my UPDATE statement as well?
Assuming you're not moving billions of rows into your development environment, I would just create a simple ETL strategy to truncate the dev environment and do a full load from production. If you don't want the full dataset, add a filter to the source queries for your ETL. Just make sure that doesn't have any effect on the integrity of the data.
If your data is in the billions, you likely have a enterprise storage solution in place. Many of those can handle snapshotting the data files to another location. There are some security aspects with that approach that you'll need to consider as well.
I found an answer that is in two parts. It may not be the best solution, but it seems to be working for the moment.
I can use primary keys as my UID from the production box database tables (for the most part). Updating them using a 30-90 day filter
The views are a bit trickier as they union two exact tables and have duplicate primary keys. So, I created my own uid concatenating multiple primary key fields and updating with a 30-90 day filter.
The previous process would take up to 4+ hours to complete. The new process runs in an hour, and seems to be working for the moment.
Related
I've tried to search for some ideas but can't find anything that's very suitable for my scenario.
I have a table which I write and updata data to from multiple sites, maybe a row per second for specific hours of the day and on average having around 50k records added daily. Seperate to this, I have dashboards where people can query this data but some of the queries may be quite complex and take a number of seconds to complete.
I can't afford my write/updates to slow down
Although the dashboards don't need to be real time, it would be a bonus
Im hosting on Azure DB S2. What options are available?
Current idea is to use an 'active' table for writes/updates and flush the data to the full table every x min. My only concern is that I have a seeded bigint as a PK on the main table and because I also save other data linked to this, I'd have problems linking to this id until I commit to the main table. An option would be to reseed the active table and set identity insert off on the main table to populate it myself but I'm not 100% happy with this.
Just looking for suggestions until I go ahead with my current idea! Thanks
I have a very large table from which I need to extract a subset of records, the last 30 days records, and to replicate this 30 days records to a second database, for reporting purposes. Now I am using transactional replication where I added a filter in the published articles to isolate the 30 days records, to get a near real time replication envirnment.
The issue I have is that : the replication seems to be incremental, meaning that the most recent records are added to the replica, but the older records are not removed so it keeps getting large.
When a record that is out of filtering criteria is updated and enters again under the filtering criteria the replication crashes with an "duplicate primary key error".
How to make it work so that the replica to contain only the last 30 days of data ?
Is the above described behaviour something that I shall expect to see ?
Many thanks,
Well the simplest way, is not to use mssql's filter. The simplest way is to change the SPS used for update and delete with custom sps so that you will not get errors on deleting (absent rows) and updating (absent rows). This is done from the article's advanced properties. In case of a delete you should just use a merge and filter there your criteria.
Also have a job that deletes from the tables what you need to have deleted.
Of course you will need to be very careful when doing structure updates, but it is doable.
Another more ugly way is to keep sql's stored procedures and just ignore the errors (through the distribution agent .. -SkipErrors 2601:2627:20598). This will require again a job to delete old rows and it will not bring you back into your scope the old rows that are just updated. All in all the first solution should be the best one.
Hope it helps.
We want to know what rows in a certain table is used frequently, and which are never used. We could add an extra column for this, but then we'd get an UPDATE for every SELECT, which sounds expensive? (The table contains 80k+ rows, some of which are used very often.)
Is there a better and perhaps faster way to do this? We're using some old version of Microsoft's SQL Server.
This kind of logging/tracking is the classical application server's task. If you want to realize your own architecture (there tracking architecture) do it on your own layer.
And in any case you will need application server there. You are not going to update tracking field it in the same transaction with select, isn't it? what about rollbacks? so you have some manager who first run select than write track information. And what is the point to save tracking information together with entity info sending it back to DB? Save it into application server file.
You could either update the column in the table as you suggested, but if it was me I'd log the event to another table, i.e. id of the record, datetime, userid (maybe ip address etc, browser version etc), just about anything else I could capture and that was even possibly relevant. (For example, 6 months from now your manager decides not only does s/he want to know which records were used the most, s/he wants to know which users are using the most records, or what time of day that usage pattern is etc).
This type of information can be useful for things you've never even thought of down the road, and if it starts to grow large you can always roll-up and prune the table to a smaller one if performance becomes an issue. When possible, I log everything I can. You may never use some of this information, but you'll never wish you didn't have it available down the road and will be impossible to re-create historically.
In terms of making sure the application doesn't slow down, you may want to 'select' the data from within a stored procedure, that also issues the logging command, so that the client is not doing two roundtrips (one for the select, one for the update/insert).
Alternatively, if this is a web application, you could use an async ajax call to issue the logging action which wouldn't slow down the users experience at all.
Adding new column to track SELECT is not a practice, because it may affect database performance, and the database performance is one of major critical issue as per Database Server Administration.
So here you can use one very good feature of database called Auditing, this is very easy and put less stress on Database.
Find more info: Here or From Here
Or Search for Database Auditing For Select Statement
Use another table as a key/value pair with two columns(e.g. id_selected, times) for storing the ids of the records you select in your standard table, and increment the times value by 1 every time the records are selected.
To do this you'd have to do a mass insert/update of the selected ids from your select query in the counting table. E.g. as a quick example:
SELECT id, stuff1, stuff2 FROM myTable WHERE stuff1='somevalue';
INSERT INTO countTable(id_selected, times)
SELECT id, 1 FROM myTable mt WHERE mt.stuff1='somevalue' # or just build a list of ids as values from your last result
ON DUPLICATE KEY
UPDATE times=times+1
The ON DUPLICATE KEY is right from the top of my head in MySQL. For conditionally inserting or updating in MSSQL you would need to use MERGE instead
I have a huge database, it process the email traffic everyday. In the system, it needs delete some old emails everyday:
Delete from EmailList(nolock)
WHERE EmailId IN (
SELECT EmailId
FROM Emails
WHERE EmailDate < DATEADD([days], -60, GETDATE())
)
It works, but the problem is: it takes a long time to finish and the log file becomes very huge because of this. The log file size increases more than 100GB everyday.
I'm thinking we can change it to
Delete from EmailList(nolock)
WHERE EXISTS (
SELECT EmailId
FROM Emails
WHERE (Emails.EmailId = EmailList.EmailId) AND
(EmailDate < DATEADD([days], -60, GETDATE()))
)
But other than this, is there anything we can do to improve the performance. most of all, reduce the log file size?
EmailId is indexed.
Ive seen
GetDate()-60
style syntax perform MUCH better than
DATEADD([days], -60, GETDATE()))
especially if there is an index on the date column. A few fellow DBAs and I had spent quite a bit of time trying to understand WHY it would perform better, but the result was in the pudding.
Another thing you might want to consider, considering the volume of records I presume you have to delete, is to chunk the deletes in batches of say 1000 or 10000 records. This would probably speed up the delete process.
Have you tried parition by date, then you can just drop the table versions for the days yo uare not interested in anymore. Given a "hugh" database you for sure do run enterprise edition of SQL Server (after all, hugh is bigger than very large) and that has table partitioning.
[EDIT]:
regarding #TomTom's comment:
If you have SQL Server Enterprise edition available you should use Table Partitioning.
If this is not the case, my original post may be helpfull:
[ORIGINAL POST]
Deleting a large amount of data is always difficult. I faced the same problem and I went with the following solution:
Depending on your requirements this will not work, but maybe you can get some ideas from it.
Instead of using 1 table, use 2 tables, with the same schema. Create a synonym (I assume you are using MS SQL server) that points to the "active table of the 2 tables (active means, this is the table that you currently write to). Use this synyonym for the inserts in your application, or instead of using the synonym, the application could just change the table each x days it writes to.
Every x days you can truncate the old/inactive table and afterwards recreate the synonym to target the truncated table (if you use the synonnym solution), so effectively you are partitioning the data per time.
You have to synchronize the switch of the active table. I automated this completely, by using a shared App-lock for the application, and an Exclusive Applock when changing the synonym (== blocking the writing application during the switching process).
If changing your applicaiotn's code is not an option, consider using the same principle but instead of writing to the synonym you could create a view with instead of triggers (the insert operation would insert into the "active" partition). The trigger code would need syhcnronize using something like the Applock as mentioned above (so that writes during the switching process work).
My solution is a litte more complex, so I currently cannot post the code here, But it works without problems for a high load application and the swithcingt/cleanup process is completely automated.
We have an indexed view that runs across three large tables. Two of these tables (A & B) are constantly getting updated with user transactions and the other table (C) contains data product info that is needs to be updated once a week. This product table contains over 6 million records.
We need this view across these three tables for our core business process and unfortunately we cannot change this aspect. We even had a sql server MVP come in to help test under load to make sure we have the most efficient configuration. There is one column in the product table that gets utilized in the view and has to be updated each week.
The problem we are now encountering is that as volume is increasing on our transactions against tables A & B, the update to Table C is causing deadlocks.
I have tried several different methods to no avail:
1) I was hoping that we could change the view so that table C could be a dirty read "WITH (NOLOCK)" but apparently that functionality is not available with indexes views.
2) I thought about updating a new column in Table C and then just renaming it when the process is done but you cannot do that due to the dependency in the view.
3) I also entertained the idea of writing this value to a temporary product table, and then running an ALTER statement against the view to have it point to my new table. however when i did that the indexes on my view were dropped and it took quite a bit of time to recreate them.
4) we tried to do the weekly update in small chunks (as small as 100 records at a time) but we still run into dead locks.
questions:
a) we are using sql server 2005. Does sql server 2008 have a new functionality with their indexed views that would help us? Is there now a way to do dirty reads w/ an indexed view?
b) a better approach to altering an existing view to point to a new table?
thanks!
The issue you're experiencing is that adding the indexed view between the three tables is causing lock contention. There is a really good post about the issue here : http://sqlblog.com/blogs/alexander_kuznetsov/archive/2009/06/02/be-ready-to-drop-your-indexed-view.aspx
Partitioning the table might provide some relief, although I don't know if the partitioning will circumvent the lock issue. You will have to upgrade to 2008 if you want to investigate this option however - as you need to use partition-aligned indexed views. 2005 will require you to drop the view before you swap in/out any partitions.
More information about partition-aligned indexed views: http://msdn.microsoft.com/en-us/library/dd171921.aspx
Have you considered making C a partitioned table and swapping in/out a partition as your price update mechanism? I'm not sure how that would work with an indexed view - I would think the index needs to be rebuilt at that point. I think this is probably the same situation you are seeing with the ALTER TABLE, actually.
Is the indexed view really necessary? i.e. could appropriate indexes on the 3 underlying tables perform just as well when a normal view is used? Remember that the indexed view may have to be updated on key changes to any of the three tables, while a index on a single table would only have to be updated if a key changes or data moves in just that table. Typically indexed views are indexed on different columns than the base tables because it is a different kind of section accross the data than is available in the underlying tables - does that description really apply?
How long does the pricing update take? This would appear to be the core of your problem, but it's hard to say without more information.
We can try this for avoid locking.
SELECT a,b,c FROM indexedview as v WITH (NOEXPAND,NOLOCK) WHERE ...