Mirroring legacy FlashFiler2 database to SQL Server - sql-server

I maintain a very old data acquisition system which uses a legacy FlashFiler2 database. One of my customers would like the database tables to be mirrored into a SQL Server database for easier post-processing.
There are two types of tables:
Measured values: these tables get new timestamped data every day and my strategy would be to record the timestamp of the last data record that I have already mirrored for every measuring point and only add the new ones to the target database.
Mostly static tables: these tables rarely change and the records don't bear a timestamp. Maybe think of a customers table that rarely gets new entries and the existing records are changed very rarely.
To handle case 2 by brute force, I would have to either clear the target table and recreate it every day or compare each and every record for changes, detect deleted records and add new ones.
What is an efficient way to accomplish this task?
In a related post I found the idea to create MD5 hashes of the target records. The hash could then be used to compare records for changes. Of course I would still have to check for added and deleted records. Would this be worth the effort or should I go with one of the brute force methods?
My tools are: Visual Studio 2017, C# with the ADO.NET provider for SQL Server and FlashFiler2 Delphi components.

I ended up with the following:
1) for all tables that contain rarely changing data I chose to recreate them completely. The performance is indeed better than I had thought.
2) for the measured values tables I chose a configurable lookback from which on the data are chronologically copied to the target database. This is mostly redundant because the lookback by intention should include data that have already been transferred. It is necessary, however, because at times it might be necessary to correct some data in the source that have already been transferred. In my case those corrections usually occur within one or two months. So a lookback of 3 months would usually be enough.
The main performance bottleneck now seems to be my record by record insertion into the target database. I guess there are batch routines for this but that would be a topic for another post.

Related

Is there a way I can fast load data with SSIS?

I'm moving data from ODBC to OLE Destination, records get inserted everyday on the ODBC in different tables. The packages gets slower and slower it take about a day for million records sometimes more. The tables can have new data inserted or new updated data and the loading and looking up of new data slows the processs. Is the anyway i can fast track the ETL process or is there any open source platform i can use to load the data faster
Tried to count the number of rows in the OLE Destination to check and only insert new records that are greater than the ones in the ODBC Source, but to my surprise the ROW_NUMBER() function isn't supported in Openedge ODBC
Based on the limited information in your question, I'd design your packages like the following
SEQC PG to SQL
The point of these operations is to transfer data from our source system verbatim to the target. The target table should be brand new and the SQL Server equivalent of the PG table from a data type perspective. Clustered Key if one exists, otherwise, see how a heap performs. I am going to reference this as a staging table.
The Data Flow itself is going to be bang simple
By default, the destination will perform a fast load and lock the table.
Run the package and observe times.
Edit the OLE DB Destination and change the Maximum Commit Size to something less than 2147483647. Try 100000 - is it better, worse? Move up/down an order of magnitude until you have an idea of what it looks like will be the fastest the package can move data.
There are a ton of variables at this stage of the game - how busy is the source PG database, what are the data types involved, how far does the data need to travel from the Source, to your computer, to the Destination but this can at least help you understand "can I pull (insert large number here) rows from the source system within the expected tolerance" If you can get the data moved from PG to SQL within the expected SLA and you still have processing time left, then move on to the next section.
Otherwise, you have to rethink your strategy for what data gets brought over. Maybe there's reliable (system generated) insert/update times associated to the rows. Maybe it's a financial-like system where rows aren't updated, just new versions of the row are insert and the net values are all that matters. Too many possibilities here but you'll likely need to find a Subject Matter Expert on the system - someone who knows the logical business process the database models as well as how the data is stored in the database. Buy that person some tasty snacks because they are worth their weight in gold.
Now what?
At this point, we have transferred the data from PG to SQL Server and we need to figure out what to do with it. 4 possibilities exist
The data is brand new. We need to add the row into the target table
The data is unchanged. Do nothing
The data exists but is different. We need to change the existing row in the target table
There is data in the target table that isn't in the staging table. We're not going to do anything about this case either.
Adding data, inserts, are easy and can be fast - it depends on table design.
Changing data, updates, are less easy in SSIS and are slower than adding new rows. Slower because behind the scenes, the database will delete and add the row back in.
Non-Clustered indexes are also potential bottlenecks here, but they can also be beneficial. Welcome to the world of "it depends"
Option 1 is to just write the SQL statements to handle the insert and update. Yes, you have a lovely GUI tool for creating data flows but you need speed and this is how you get it (especially since we've already moved all the data from the external system to a central repository)
Option 2 is to use a Data Flow and potentially an Execute SQL Task to move the data. The idea being, the Data Flow will segment your data into New which will use an OLE DB Destination to write the inserts. The updates - it depends on volume what makes the most sense from an efficiency perspective. If it's tens, hundreds, thousands of rows to update, eh take the performance penalty and use an OLE DB Command to update the row. Maybe it's hundreds of thousands and the package runs good enough, then keep it.
Otherwise, route your changed rows to yet another staging table and then do a mass update from the staged updates to the target table. But at this point, you just wrote half the query you needed for the first option so just write the Insert and be done (and speed up performance because now everything is just SQL Engine "stuff")
You might want to investigate Progress' Change Data Capture feature. If you have a modern release of OpenEdge (11.7 or better) and the proper licenses you can enable CDC policies to track changes. Your ETL process could then use that information to target its efforts.
Warning: it's complicated. There is a lot more to actually doing it than marketing would have you believe. But if your use-case is straight-forward it might not be too terrible.
Or you could implement Progress "Pro2" product to do all the dirty work for you. (That's an extra cost option.)

How do I handle rows that were deleted from the source using SSIS Slowly Changing Dimension

I am trying to implement an ETL process for our Type 1 slowly changing dimension tables in a SQL 2014 database. The load needs to happen across servers, and I would prefer not to use linked servers.
I have been looking for ways to do this in SSIS and found the slowly changing dimension wizard which works fine except that this seems to only allow either inserting new rows or updating rows where there is a match on the business key, however I haven't found a place where it allows me to handle when a record exists in the dimension table but was deleted from the source. I would like to make sure these are deleted. Am I missing something? Has anyone found a better way to handle this in SSIS?
I know that I could just dump everything into another table on the destination server and write a TSQL merge, but there just seems like should be a simple way to do this in SSIS.
First, I would avoid the SCD functionality in SSIS, as its performance tends to be terrible - I've actually been told to avoid it by MS certified trainers, as well as plenty of people with a lot of experience. It's OK-ish on very small dimensions, but quickly tends to become unmanageable. There's a blog post here from someone who thinks it's usable in some situations, but even they suggest using a staging table for updates.
If you want to do this in SSIS you could use a Lookup to find the rows that need to be deleted (find the rows in your destination which aren't in the source using the no match output), then an OLE DB Command to delete them. But I'd give some serious thought to simply moving the data over to a staging area and doing this in TSQL, because SSIS will do it row by agonising row. Similarly to the SCD tool - it might be OK on small amounts of data, but if you're dealing with larger amounts (or might be in future), it may well become unmanageable.
If you don't want to move all of the data over to a staging area, you could use SSIS to build up a table only holding the unique IDs of the rows that need deleting, then fire off an Execute SQL Task from the Control Flow to delete them all at once.

Copy Multiple Tables into ONE Table (From Multiple Databases)

I've got multiple identical databases (distributed on several servers) and need to gather them to one single point to do data mining, etc.
The idea is to take Table1, Table2, ..., TableN from each database and merge them and put the result into one single big database.
To be able to write queries, and to know from which database each row came from we will add a single column DatabaseID to target table, describing where the row came from.
Editing the source tables is not an option, it belongs to some proprietary software.
We've got ~40 servers, ~170 databases and need to copy ~40 tables.
Now, how should we implement this given that it should be:
Easy to setup
Easy to maintain
Preferably easy to adjust if database schema changes
Reliable, logging/alarm if something fails
Not too hard to add more tables to copy
We've looked into SSIS, but it seemed that we would have to add each table as a source/transformation/destination. I'm guessing it would also be quite tied to the database schema. Right?
Another option would be to use SQL Server Replication, but I don't see how to add the DatabaseID column to each table. It seems it's only possible to copy data, not modify it.
Maybe we could copy all the data into separate databases, and then to run a local job on the target server to merge the tables?
It also seems like a lot of work if we'd need to add more tables to copy, as we'd have to redistribute new publications for each database (manual work?).
Last option (?) is to write a custom application to our needs. Bigger time investment, but it'd at least do precisely what we'd like.
To make it worse... we're using Microsoft SQL Server 2000.
We will upgrade to SQL Server 2008 R2 within 6 months, but we'd like the project to be usable sooner.
Let me know what you guys think!
UPDATE 20110721
We ended up with a F# program opening a connection to the SQL Server where we would like the aggregated databases. From there we query the 40 linked SQL Servers to fetch all rows (but not all columns) from some tables, and add an extra row to each table to say which DatabaseID the row came from.
Configuration of servers to fetch from, which tables and which columns, is a combination of text file configuration and hard coded values (heh :D).
It's not super fast (sequential fetching so far) but it's absolutely manageable, and the data processing we do afterwards takes far longer time.
Future improvements could be to;
improve error handling if it turns out to be a problem (if a server isn't online, etc).
implement parallel fetching, to reduce the total amount of time to finish fetching.
figure out if it's enough to fetch only some of the rows, like only what's been added/updated.
All in all it turned out to be quite simple, no dependencies to other products, and it works well in practice.
Nothing fancy but couldn't you do something like
DROP TABLE dbo.Merged
INSERT INTO dbo.Merged
SELECT [DatabaseID] = "Database1", * FROM ServerA.dbo.Table
UNION ALL SELECT [DatabaseID] = "Database2", * FROM ServerB.dbo.Table
...
UNION ALL SELECT [DatabaseID] = "DatabaseX", * FROM ServerX.dbo.Table
Advantages
Easy to setup
Easy to maintain
Easy to adjust
Easy to add more tables
Disadvantages
Performance
Reliable logging
We had a similar requirement where we took a different approach. first created a central database to collect the data. Then we created a inventory table to store the list of target servers / databases. Then a small vb.net based CLR procedure which take the path of SQL query, target SQL Instance name and the target table which will store the data(This would eliminate the setup of linked server when new targets are added). This also adds two additional columns to the result set. The Target server name and the timestamp when the data is captured.
Then we set up a service broker queue/service and pushed list of target servers to interogate.
The above CLR procedure is wrapped in another procedure which dequeues the message, executes the SQL on the target server provided. The wrapper procedure is then configured as the activated procedure for the queue.
With this we are able to achieve a bit of parallelism to capture the data.
Advantages :
Easy to setup Easy to manage (Add / Remove targets)
Same framework works for multiple queries
Logging tables to check for failed queries.
Works independent of each target, so if one of the target fails to
respond, others still continue.
Workflow can be pause gracefully by disabling the queue (for
maintenance on central server) and then resume collection be
re-enabling it.
Disadvantage:
requires good understanding of service brokers.
should properly handle poison messages.
Please Let me know if it helps

SQL Server performance with a large number of tables in database

I am updating a piece of legacy code in one of our web apps. The app allows the user to upload a spreadsheet, which we will process as a background job.
Each of these user uploads creates a new table to store the spreadsheet data, so the number of tables in my SQL Server 2000 database will grow quickly - thousands of tables in the near term. I'm worried that this might not be something that SQL Server is optimized for.
It would be easiest to leave this mechanism as-is, but I don't want to leave a time-bomb that is going to blow up later. Better to fix it now if it needs fixing (the obvious alternative is one large table with a key associating records with user batches).
Is this architecture likely to create a performance problem as the number of tables grows? And if so, could the problem be mitigated by upgrading to a later version of SQL Server ?
Edit: Some more information in response to questions:
Each of these tables has the same schema. There is no reason that it couldn't have been implemented as one large table; it just wasn't.
Deleting old tables is also an option. They might be needed for a month or two, no longer than that.
Having many tables is not an issue for the engine. The catalog metadata is optimized for very large sizes. There are also some advantages on having each user own its table, like ability to have separate security ACLs per table, separate table statistics for each user content and not least improve query performance for the 'accidental' table scan.
What is a problem though is maintenance. If you leave this in place you must absolutely set up task for automated maintenance, you cannot let this as a manual task for your admins.
I think this is definitely a problem that will be a pain later. Why would you need to create a new table every time? Unless there is a really good reason to do so, I would not do it.
The best way would be to simply create an ID and associate all uploaded data with an ID, all in the same table. This will require some work on your part, but it's much safer and more manageable to boot.
Having all of these tables isn't ideal for any database. After the upload, does the web app use the newly created table? Maybe it gives some feedback to the user on what was uploaded?
Does your application utilize all of these tables for any reporting etc? You mentioned keeping them around for a few months - not sure why. If not move the contents to a central table and drop the individual table.
Once the backend is taken care of, recode the website to save uploads to a central table. You may need two tables. An UploadHeader table to track the upload batch: who uploaded, when, etc. and link to a detail table with the individual records from the excel upload.
I will suggest you to store these data in a single table. At the server side you can create a console from where user/operator could manually start the task of freeing up the table entries. You can ask them for range of dates whose data is no longer needed and the same will be deleted from the db.
You can take a step ahead and set a database trigger to wipe the entries/records after a specified time period. You can again add the UI from where the User/Operator/Admin could set these data validity limit
Thus you could create the system such that the junk data will be auto deleted after specified time which could again be set by the Admin, as well as provide them with a console using which they can manually delete additional unwanted data.

Migrate and Merge several databases into one

In an update project i have to do the following:
Move 3 databases from SQL2000 to SQL2005 and merge them at the same time. There are already quite a few cross database queries used in SP's and Views.
The current plan is to move each of the old databases into a separate schema in 1 database.
That means we will also have to change our current SP's and Views, we now have:
SELECT OrderId, OrderDate FROM Sales.dbo.Orders
and expect we will have to change that into
SELECT OrderId, OrderDate FROM Sales.Orders
The question is: how do we do that as automated as possible?
I know about SED and similar for changing the scripts. I would welcome tips about how to be 'smart' about this, like strategies for partitioning the scripts, performance (tons of INSERT INTO lines) etc.
Note: I did look at the Import/Export Wizard but apparently I would have to set the Schema manually on each output table and fix the SP's through ALTER scripts anyway.
I did this a couple of years ago, and I ran into a few problems that you want to be aware of.
Assumptions:
You've got a single SQL 2000 database server with 3 databases, A/B/C
You want all of the objects to end up in SQL 2005 in database A (we'll refer to that as the Target)
You want to get rid of databases B and C eventually (the old Sources)
You don't have a full-blown test environment where you can automatically restore your production databases every day, and script this again and again until it's right. (That's the best way, and I've taken that approach too, but it's labor-intensive.)
Here's my hard lessons learned:
Don't do the merge and the SQL 2005 change the same day. Either do the merge before you go to 2005, or after, but don't try to accomplish it all in a single outage. It'll be a finger-pointing mess. If it was me, I'd go to 2005 first just to get it out of the way. That way, I know anything that breaks isn't because of a schema change, and those types of breaks are easier to fix. You want at least a week of end user activity on the 2005 box before you declare victory and move on to the merge.
Build the new objects in Target ahead of time. Even if they're not being queried in your live production apps, go ahead and build 'em now. That way you can populate fake test data in there to test your applications ahead of time. Yes, this means mixing live and test data, but frankly, you're already out there working without a net. Be wary of identity fields, though, since you can end up with conflicting records with the same identity number but different data in the Target and Source databases.
Create views in Target ahead of time. You mentioned that you've got views that already do cross-database queries. Copy those from Source to Target now, and tell any other developers (report guys, power users) to start referring to the Target views instead. This isn't going to speed up your own work, but it speeds up THEIR work. If you can get to the point where you can verify that they're only hitting Target (even though the Target views still point to tables in Source) then it'll make troubleshooting easier on migration day. Then you can start denying permissions on the Source views ahead of time.
Sync tables ahead of time. Make a list of all of the tables that need to be moved out of the Sources, and for each one, analyze how it's being updated. If it's only being inserted into (not updated or deleted), like a log table, then write a T-SQL script to start keeping it in sync in Target. Run that script via a SQL Agent job during periods of low activity on your server, like nightly. This way, when it's go-live day, you won't have to push as many records around, meaning your go-live window will be smaller and your Target transaction logs can stay smaller. Tables that are being constantly updated or deleted aren't as easy, and it's up to you whether you decide to sync those as well. We did it for any tables over a million lines.
Check for record conflicts between the Source databases. It sounds like this one doesn't apply to you specifically, but I'm noting it here in case anybody else does a merge and they're reading it for tips. If you have more than one Source database, dump out the list of objects. If you've got two objects with the same name, check their schema. I've worked with instances where they had a State or Region table in each database, and they were supposed to be identical, but they had identity fields for their primary keys. Each child table (like Customers, which linked to a Region table) referred to the parent table (Region) by the primary key (identity field) - which didn't match from one database to the other. In that case, the smart thing to do is take an outage window ahead of time, before the migration day, to clean those records up with manual update scripts.
Disable any constraints or foreign key relationships
Change the identity fields (if they're lookup tables, you may be able to turn off the identity stuff and just run with manually specified pk numbers)
Modify the Region table to add a NewID field, matching to what it's going to become, and an OldID field, showing what it used to be
Update all of the child tables (Customers) to use the NewID number instead of the original
Update the Region table so that the real ID field now has the NewID value, and the OldID field has what the Region used to be. (You're probably going to screw something up like miss a child table you didn't know about, and you're going to wonder what it used to be.)
Break the migration into pieces. List every stored proc in all of the databases. If any of them can be moved without moving data, do that first. For example, if you've got Source.dbo.usp_RunReport, and it only refers to tables in the Target database, then do that in a first phase. If you've got small system lookup tables that are only used internally in your app, not visible to customers or reports, then put that in the first phase too. It sounds like it's too small to bother with, but the idea is to reduce the amount of panic on migration day. The less you wonder about, the better you can troubleshoot. We moved every static lookup table (State, Region, Calendar, etc) over ahead of time. The amount of work required in Phase 1 - just moving those small, static tables - got management to understand how huge it was going to be to move the rest, and it bought us resources and time we wouldn't have gotten otherwise.
Pre-grow the data files for Target. If you're not using SQL 2005's new Instant File Initialization, data file growths take quite a while. Enable Instant File Initialization if you've got a choice, then grow the data files to make sure they're not fragmented. If they just grow naturally during your migration day, they can be fragmented. If you can't use Instant File Initialization, you still need to pre-grow the files, but you want to do that ahead of time during periods of low activity to speed up the maintenance window.
On migration day, run your inserts one table at a time, or smaller. You want to keep your insert transactions as tight as possible. The smaller your insert transactions, the less space you'll need in the transaction log. Remember that the transaction log will grow with insert statements even in simple mode. After every round of inserts, do a sanity check to make sure that they worked, and that you're not going to run out of drive space for data files or t-log files.
After the updates finish, change security on the Source databases. Put every non-SA login into the dbdenydatareader and dbdenydatawriter roles in the Source databases. That way they can still log in if they've hard-coded the database name in the connection string, but they won't be able to do anything. This makes your troubleshooting easier too: if an app or a query runs into problems, you could consider taking their login out of the deny roles and see if it works - if it does, it's borked. The risk with that is that they might run a transaction that uses the Source database data to update the Target database (get customers from Source, update them in Target) and it might cause issues.
Other options for the Source databases are:
Rename them, so you can still query 'em but the apps won't touch 'em
Detach them, but keep the files available in case you need to troubleshoot
Strip out all logins, and use new logins to access the existing databases just in case. Then if somebody's read-only report is totally borked, you can let it work temporarily by issuing them a new login and telling them it's referring to the wrong database.
After the updates finish, rebuild indexes & statistics on Target. If you're just doing continuous inserts, this isn't a big deal, but if you're merging multiple databases (like two Sales databases that had been broken up into regions of the country) then you'll want to clean things up.
IMHO, use one schema unless you can justify a gain from multiple schemas. This last one is just my two cents, but it sounds like you're going through an awful lot of work to go from 3 databases 1 schema each, to 1 database with 3 schemas. If you're not really sure about the 3 schema thing, you might consider using 1 schema - or else you'll be in another messy rework later on down the road. 3 schemas does make sense if you have specific security needs, but otherwise, just make sure you're getting the bang for the buck that you want. Now would be a great time to go to one schema.
You could give Redgate SQL Compare and Data Compare a shot. They have a schema mapping feature that should let you map the dbo schema to the sales schema in another and then move the tables and procs. It would make it so you don't have to mess with the SQL export wizard. You still would have to refactor your other objects though.
I love these two tools.
edit:
I think you can get a fully functional demo too.
edit:
Additionally, they offer SQL Refactor, which does a 'smart' rename. Score!
Could you have a dummy database called SALES that has a VIEW called [Orders]:
CREATE VIEW Sales.dbo.Orders
AS
SELECT OrderId, OrderDate, ...
FROM CombinedDatabase.Sales.Orders
and then
SELECT ... FROM Sales.dbo.Order
will still work.
You won't be able to INSERT / UPDATE that table without some further jiggery-pokery though.
If you could have such VIEWs log that they were used that would enable you to fix the code that called them!! but I can't think of a way to do that; however you could disable each in turn, run some tests, fix whatever is broken, then move on to next one ... and thus eradicate them by refactoring, but have a largely working application during the process.
I've used SED for this type of thing, but we have unique names for all our tables and all our columns, and we use variable names within our application that match the database column names - so I would have high confidence that changing xxx_yyy_ID to aaa_bbb_ID in our application would work well, and not have accidental side effects.
If you have actual column/table names like "Sales" and "Orders" I think that something like SED would be risky
Ok, so my basic understanding of your problem is something like this:
You have three different databases (i.e. Sales, Manu, Inventory)
They have distinct table & procedure names (no table/proc names in Sales exist in Manu or Inventory)
You want all the tables/procs from all three databases in a single database (i.e. SaleManInv)
Some stored procedures in each database explicitly refer to tables in the other databases (i.e. Sales.dbo.lookupItem() explicitly refers to Inventory.dbo.Items table)
Exporting and importing the tables doesn't seem like it will be a problem, what I would do for the procs:
Export one proc from the SQL Server 2000 db to the SQL Server 2005 DB to determine if you need to get rid of the ".dbo." portion of the cross references.
Export all the procs to text files (same folder for all procs)
Use a text editor with a "Search and Replace in Files" (I use PSPAD) and replace all the "Sales.dbo." with "SaleManInv.dbo.", then all the "Iventory.dbo." with "SameManInv.dbo." etc. to convert all the references to the new db.
Then run the exported and modified procs into your new db.
Is that making any sense? :-)
I was in a similar position where I had several SQL Server 2008 databases that were merged into 1. My solution was to use Integration Services' Transfer Server Objects task into a new target database. All data was copied over along with tables. Afterwards - in what was a very complex query, I scripted out all stored procedures/functions/views/etc. to a file and changed all cross-database references and re-created the stored procedures and other objects.
The trick with the stored procedures was to script them out in the order or syscontraints in order to ensure that stored procedures or functions that were referencing other stored procedures/functions internally were created last.
If there was a tool that I felt could have handled this task in an automated fashion, I would have purchased it immediately.
I would like to know if it's same kind of data. Any way. I would create a new column with the name 'SourceSystem'. So when the boss comes running after:
" - what was the sales diff between databasesystem1 and db2 in 2004".
Then you can answer that. Then in a year or two, if that questions don't pop up. You can delete that column. Merging data removes the origin of the data.

Resources