Am writing some processes to pre-format certain data for another downstream process to consume. The pre-formatting essentially involves gathering data from several permanent tables in one DB, applying some logic, and saving the results into another DB.
The problem i am running into is the volume of data. the resulting data set that i need to commit has about 132.5million rows. The commit itself takes almost 2 hours. I can cut that by changing the logging to simple, but it's still quite substantial (seeing as the generating of the 132.5 million rows into a temp table only takes 9 mins).
I have been reading on best methods to migrate large data, but most of the solutions implicitly assumes that the source data already resides in a single file/data table (which is not the case here). Some solutions like using SSMS task option makes it difficult to embed some of the logic applications that i need.
Am wondering if anyone here has some solutions.
Assuming you're on SQL Server 2014 or later the temp table is not flushed to disk immediately. So the difference is probably just disk speed.
Try making the target table a Clustered Columnstore to optimize for compression and minimize IO.
The problem:
I have 2 SQL Server databases from 2 different applications. They describe different aspects of industrial machines: one is about "how many consumables were spent per order", the other is about "how many good/bad production items were produced per operator". Sometimes many operators are working on 1 order one after another, sometimes one operator is working on multiple small orders, and there is no connection Order-Operator in the database.
I want to have united fact table, where for every timestamp I know MachineID, OrderID and OperatorID. If a timestamp exists in DB1, then the record will have numeric measures from it (Consumables); if it exists in DB2, then it will have numeric measures from DB2 (good/bad production items). If it exists in both databases, then it have all numeric measures. A simple UNION ALL is not enough, because I want to have MachineID, OrderID and OperatorID for every record.
I created a T-SQL stored procedure to make FULL JOIN by timestamp and MachineID. But on large data sets (multiple machines, multiple customers) it becomes very slow. Both applications support editing history, so I need to merge full history from both databases at every nightly load.
To speed up the process, I would like to put calculations into multiple parallel threads, separated by Customer, MachineID, and Year.
I tried to do it by using SQL Server stored procedures, running in parallel by SQL Agent with different parameters, but I found that it didn't help the performance. Instead it created multiple deadlocks when updating staging and final tables.
I am looking for an alternative way to resolve this problem, but I don't know what is the right tool. Can Hadoop or similar parallel processing tool help with this task?
I am looking for solution with minimal cost, because it is needed for just one specific task. For everything else, SQL Server and PowerBI reporting are working just fine for me.
Hadoop seems hard to justify in this use case, given limited scope. The thing about Hadoop is that it scales well not only due to parallel processing but thanks to parallel IO, when data is distributed across multiple servers/storage media. Unless you happy to copy all data to HDFS distributed among multiple nodes, it likely will not help much. If you want to spin up a Hadoop cluster and run multiple jobs querying single SQL server, it'll likely end up badly for the later.
Have you considered optimizations which will allow you to limit the amount of data you processing nightly?
E.g. what is 'timestamp' field? Does it reflect last update time? Can you use it to filter rows which haven't been updated since the previous run?
Even if the 'timestamp' is not the time of last updates, can you add an "updateTime" field and triggers on updates which will populate the field, so you don't need to import rows which have not changed since the previous run? If you build an index on the field, then, if the number of updates during the day is not high relative to total table size, a query with a filter on such field will hit the index, and fetching of incremental changes should be fast.
Another thing to consider - are those DBs running on the same node/SQL server? Access to remote DBs is slow, so if that's the case, think about how to fix this first.
I have two SQL Server 2005 instances that are geographically separated. Important databases are replicated from the primary location to the secondary using transactional replication.
I'm looking for a way that I can monitor this replication and be alerted immediately if it fails.
We've had occasions in the past where the network connection between the two instances has gone down for a period of time. Because replication couldn't occur and we didn't know, the transaction log blew out and filled the disk causing an outage on the primary database as well.
My google searching some time ago led to us monitoring the MSrepl_errors table and alerting when there were any entries but this simply doesn't work. The last time replication failed (last night hence the question), errors only hit that table when it was restarted.
Does anyone else monitor replication and how do you do it?
Just a little bit of extra information:
It seems that last night the problem was that the Log Reader Agent died and didn't start up again. I believe this agent is responsible for reading the transaction log and putting records in the distribution database so they can be replicated on the secondary site.
As this agent runs inside SQL Server, we can't simply make sure a process is running in Windows.
We have emails sent to us for Merge Replication failures. I have not used Transactional Replication but I imagine you can set up similar alerts.
The easiest way is to set it up through Replication Monitor.
Go to Replication Monitor and select a particular publication. Then select the Warnings and Agents tab and then configure the particular alert you want to use. In our case it is Replication: Agent Failure.
For this alert, we have the Response set up to Execute a Job that sends an email. The job can also do some work to include details of what failed, etc.
This works well enough for alerting us to the problem so that we can fix it right away.
You could run a regular check that data changes are taking place, though this could be complex depending on your application.
If you have some form of audit train table that is very regularly updated (i.e. our main product has a base audit table that lists all actions that result in data being updated or deleted) then you could query that table on both servers and make sure the result you get back is the same. Something like:
SELECT CHECKSUM_AGG(*)
FROM audit_base
WHERE action_timestamp BETWEEN <time1> AND BETWEEN <time2>
where and are round values to allow for different delays in contacting the databases. For instance, if you are checking at ten past the hour you might check items from the start the last hour to the start of this hour. You now have two small values that you can transmit somewhere and compare. If they are different then something has most likely gone wrong in the replication process - have what-ever pocess does the check/comparison send you a mail and an SMS so you know to check and fix any problem that needs attention.
By using SELECT CHECKSUM_AGG(*) the amount of data for each table is very very small so the bandwidth use of the checks will be insignificant. You just need to make sure your checks are not too expensive in the load that apply to the servers, and that you don't check data that might be part of open replication transactions so might be expected to be different at that moment (hence checking the audit trail a few minutes back in time instead of now in my example) otherwise you'll get too many false alarms.
Depending on your database structure the above might be impractical. For tables that are not insert-only (no updates or deletes) within the timeframe of your check (like an audit-trail as above), working out what can safely be compared while avoiding false alarms is likely to be both complex and expensive if not actually impossible to do reliably.
You could manufacture a rolling insert-only table if you do not already have one, by having a small table (containing just an indexed timestamp column) to which you add one row regularly - this data serves no purpose other than to exist so you can check updates to the table are getting replicated. You can delete data older than your checking window, so the table shouldn't grow large. Only testing one table does not prove that all the other tables are replicating (or any other tables for that matter), but finding an error in this one table would be a good "canery" check (if this table isn't updating in the replica, then the others probably aren't either).
This sort of check has the advantage of being independent of the replication process - you are not waiting for the replication process to record exceptions in logs, you are instead proactively testing some of the actual data.
In an update project i have to do the following:
Move 3 databases from SQL2000 to SQL2005 and merge them at the same time. There are already quite a few cross database queries used in SP's and Views.
The current plan is to move each of the old databases into a separate schema in 1 database.
That means we will also have to change our current SP's and Views, we now have:
SELECT OrderId, OrderDate FROM Sales.dbo.Orders
and expect we will have to change that into
SELECT OrderId, OrderDate FROM Sales.Orders
The question is: how do we do that as automated as possible?
I know about SED and similar for changing the scripts. I would welcome tips about how to be 'smart' about this, like strategies for partitioning the scripts, performance (tons of INSERT INTO lines) etc.
Note: I did look at the Import/Export Wizard but apparently I would have to set the Schema manually on each output table and fix the SP's through ALTER scripts anyway.
I did this a couple of years ago, and I ran into a few problems that you want to be aware of.
Assumptions:
You've got a single SQL 2000 database server with 3 databases, A/B/C
You want all of the objects to end up in SQL 2005 in database A (we'll refer to that as the Target)
You want to get rid of databases B and C eventually (the old Sources)
You don't have a full-blown test environment where you can automatically restore your production databases every day, and script this again and again until it's right. (That's the best way, and I've taken that approach too, but it's labor-intensive.)
Here's my hard lessons learned:
Don't do the merge and the SQL 2005 change the same day. Either do the merge before you go to 2005, or after, but don't try to accomplish it all in a single outage. It'll be a finger-pointing mess. If it was me, I'd go to 2005 first just to get it out of the way. That way, I know anything that breaks isn't because of a schema change, and those types of breaks are easier to fix. You want at least a week of end user activity on the 2005 box before you declare victory and move on to the merge.
Build the new objects in Target ahead of time. Even if they're not being queried in your live production apps, go ahead and build 'em now. That way you can populate fake test data in there to test your applications ahead of time. Yes, this means mixing live and test data, but frankly, you're already out there working without a net. Be wary of identity fields, though, since you can end up with conflicting records with the same identity number but different data in the Target and Source databases.
Create views in Target ahead of time. You mentioned that you've got views that already do cross-database queries. Copy those from Source to Target now, and tell any other developers (report guys, power users) to start referring to the Target views instead. This isn't going to speed up your own work, but it speeds up THEIR work. If you can get to the point where you can verify that they're only hitting Target (even though the Target views still point to tables in Source) then it'll make troubleshooting easier on migration day. Then you can start denying permissions on the Source views ahead of time.
Sync tables ahead of time. Make a list of all of the tables that need to be moved out of the Sources, and for each one, analyze how it's being updated. If it's only being inserted into (not updated or deleted), like a log table, then write a T-SQL script to start keeping it in sync in Target. Run that script via a SQL Agent job during periods of low activity on your server, like nightly. This way, when it's go-live day, you won't have to push as many records around, meaning your go-live window will be smaller and your Target transaction logs can stay smaller. Tables that are being constantly updated or deleted aren't as easy, and it's up to you whether you decide to sync those as well. We did it for any tables over a million lines.
Check for record conflicts between the Source databases. It sounds like this one doesn't apply to you specifically, but I'm noting it here in case anybody else does a merge and they're reading it for tips. If you have more than one Source database, dump out the list of objects. If you've got two objects with the same name, check their schema. I've worked with instances where they had a State or Region table in each database, and they were supposed to be identical, but they had identity fields for their primary keys. Each child table (like Customers, which linked to a Region table) referred to the parent table (Region) by the primary key (identity field) - which didn't match from one database to the other. In that case, the smart thing to do is take an outage window ahead of time, before the migration day, to clean those records up with manual update scripts.
Disable any constraints or foreign key relationships
Change the identity fields (if they're lookup tables, you may be able to turn off the identity stuff and just run with manually specified pk numbers)
Modify the Region table to add a NewID field, matching to what it's going to become, and an OldID field, showing what it used to be
Update all of the child tables (Customers) to use the NewID number instead of the original
Update the Region table so that the real ID field now has the NewID value, and the OldID field has what the Region used to be. (You're probably going to screw something up like miss a child table you didn't know about, and you're going to wonder what it used to be.)
Break the migration into pieces. List every stored proc in all of the databases. If any of them can be moved without moving data, do that first. For example, if you've got Source.dbo.usp_RunReport, and it only refers to tables in the Target database, then do that in a first phase. If you've got small system lookup tables that are only used internally in your app, not visible to customers or reports, then put that in the first phase too. It sounds like it's too small to bother with, but the idea is to reduce the amount of panic on migration day. The less you wonder about, the better you can troubleshoot. We moved every static lookup table (State, Region, Calendar, etc) over ahead of time. The amount of work required in Phase 1 - just moving those small, static tables - got management to understand how huge it was going to be to move the rest, and it bought us resources and time we wouldn't have gotten otherwise.
Pre-grow the data files for Target. If you're not using SQL 2005's new Instant File Initialization, data file growths take quite a while. Enable Instant File Initialization if you've got a choice, then grow the data files to make sure they're not fragmented. If they just grow naturally during your migration day, they can be fragmented. If you can't use Instant File Initialization, you still need to pre-grow the files, but you want to do that ahead of time during periods of low activity to speed up the maintenance window.
On migration day, run your inserts one table at a time, or smaller. You want to keep your insert transactions as tight as possible. The smaller your insert transactions, the less space you'll need in the transaction log. Remember that the transaction log will grow with insert statements even in simple mode. After every round of inserts, do a sanity check to make sure that they worked, and that you're not going to run out of drive space for data files or t-log files.
After the updates finish, change security on the Source databases. Put every non-SA login into the dbdenydatareader and dbdenydatawriter roles in the Source databases. That way they can still log in if they've hard-coded the database name in the connection string, but they won't be able to do anything. This makes your troubleshooting easier too: if an app or a query runs into problems, you could consider taking their login out of the deny roles and see if it works - if it does, it's borked. The risk with that is that they might run a transaction that uses the Source database data to update the Target database (get customers from Source, update them in Target) and it might cause issues.
Other options for the Source databases are:
Rename them, so you can still query 'em but the apps won't touch 'em
Detach them, but keep the files available in case you need to troubleshoot
Strip out all logins, and use new logins to access the existing databases just in case. Then if somebody's read-only report is totally borked, you can let it work temporarily by issuing them a new login and telling them it's referring to the wrong database.
After the updates finish, rebuild indexes & statistics on Target. If you're just doing continuous inserts, this isn't a big deal, but if you're merging multiple databases (like two Sales databases that had been broken up into regions of the country) then you'll want to clean things up.
IMHO, use one schema unless you can justify a gain from multiple schemas. This last one is just my two cents, but it sounds like you're going through an awful lot of work to go from 3 databases 1 schema each, to 1 database with 3 schemas. If you're not really sure about the 3 schema thing, you might consider using 1 schema - or else you'll be in another messy rework later on down the road. 3 schemas does make sense if you have specific security needs, but otherwise, just make sure you're getting the bang for the buck that you want. Now would be a great time to go to one schema.
You could give Redgate SQL Compare and Data Compare a shot. They have a schema mapping feature that should let you map the dbo schema to the sales schema in another and then move the tables and procs. It would make it so you don't have to mess with the SQL export wizard. You still would have to refactor your other objects though.
I love these two tools.
edit:
I think you can get a fully functional demo too.
edit:
Additionally, they offer SQL Refactor, which does a 'smart' rename. Score!
Could you have a dummy database called SALES that has a VIEW called [Orders]:
CREATE VIEW Sales.dbo.Orders
AS
SELECT OrderId, OrderDate, ...
FROM CombinedDatabase.Sales.Orders
and then
SELECT ... FROM Sales.dbo.Order
will still work.
You won't be able to INSERT / UPDATE that table without some further jiggery-pokery though.
If you could have such VIEWs log that they were used that would enable you to fix the code that called them!! but I can't think of a way to do that; however you could disable each in turn, run some tests, fix whatever is broken, then move on to next one ... and thus eradicate them by refactoring, but have a largely working application during the process.
I've used SED for this type of thing, but we have unique names for all our tables and all our columns, and we use variable names within our application that match the database column names - so I would have high confidence that changing xxx_yyy_ID to aaa_bbb_ID in our application would work well, and not have accidental side effects.
If you have actual column/table names like "Sales" and "Orders" I think that something like SED would be risky
Ok, so my basic understanding of your problem is something like this:
You have three different databases (i.e. Sales, Manu, Inventory)
They have distinct table & procedure names (no table/proc names in Sales exist in Manu or Inventory)
You want all the tables/procs from all three databases in a single database (i.e. SaleManInv)
Some stored procedures in each database explicitly refer to tables in the other databases (i.e. Sales.dbo.lookupItem() explicitly refers to Inventory.dbo.Items table)
Exporting and importing the tables doesn't seem like it will be a problem, what I would do for the procs:
Export one proc from the SQL Server 2000 db to the SQL Server 2005 DB to determine if you need to get rid of the ".dbo." portion of the cross references.
Export all the procs to text files (same folder for all procs)
Use a text editor with a "Search and Replace in Files" (I use PSPAD) and replace all the "Sales.dbo." with "SaleManInv.dbo.", then all the "Iventory.dbo." with "SameManInv.dbo." etc. to convert all the references to the new db.
Then run the exported and modified procs into your new db.
Is that making any sense? :-)
I was in a similar position where I had several SQL Server 2008 databases that were merged into 1. My solution was to use Integration Services' Transfer Server Objects task into a new target database. All data was copied over along with tables. Afterwards - in what was a very complex query, I scripted out all stored procedures/functions/views/etc. to a file and changed all cross-database references and re-created the stored procedures and other objects.
The trick with the stored procedures was to script them out in the order or syscontraints in order to ensure that stored procedures or functions that were referencing other stored procedures/functions internally were created last.
If there was a tool that I felt could have handled this task in an automated fashion, I would have purchased it immediately.
I would like to know if it's same kind of data. Any way. I would create a new column with the name 'SourceSystem'. So when the boss comes running after:
" - what was the sales diff between databasesystem1 and db2 in 2004".
Then you can answer that. Then in a year or two, if that questions don't pop up. You can delete that column. Merging data removes the origin of the data.
There is a SqlServer2000 Database we have to update during weekend.
It's size is almost 10G.
The updates range from Schema changes, primary keys updates to some Million Records updated, corrected or Inserted.
The weekend is hardly enough for the job.
We set up a dedicated server for the job,
turned the Database SINGLE_USER
made any optimizations we could think of: drop/recreate indexes, relations etc.
Can you propose anything to speedup the process?
SQL SERVER 2000 is not negatiable (not my decision). Updates are run through custom made program and not BULK INSERT.
EDIT:
Schema updates are done by Query analyzer TSQL scripts (one script per Version update)
Data updates are done by C# .net 3.5 app.
Data come from a bunch of Text files (with many problems) and written to local DB.
The computer is not connected to any Network.
Although dropping excess indexes may help, you need to make sure that you keep those indexes that will enable your upgrade script to easily find those rows that it needs to update.
Otherwise, make sure you have plenty of memory in the server (although SQL Server 2000 Standard is limited to 2 GB), and if need be pre-grow your MDF and LDF files to cope with any growth.
If possible, your custom program should be processing updates as sets instead of row by row.
EDIT:
Ideally, try and identify which operation is causing the poor performance. If it's the schema changes, it could be because you're making a column larger and causing a lot of page splits to occur. However, page splits can also happen when inserting and updating for the same reason - the row won't fit on the page anymore.
If your C# application is the bottleneck, could you run the changes first into a staging table (before your maintenance window), and then perform a single update onto the actual tables? A single update of 1 million rows will be more efficient than an application making 1 million update calls. Admittedly, if you need to do this this weekend, you might not have a lot of time to set this up.
What exactly does this "custom made program" look like? i.e. how is it talking to the data? Minimising the amount of network IO (from a db server to an app) would be a good start... typically this might mean doing a lot of work in TSQL, but even just running the app on the db server might help a bit...
If the app is re-writing large chunks of data, it might still be able to use bulk insert to submit the new table data. Either via command-line (bcp etc), or through code (SqlBulkCopy in .NET). This will typically be quicker than individual inserts etc.
But it really depends on this "custom made program".