I'm using SSIS to migrate data from a legacy system to a replacement system.
Both old and new systems have a hierarchy set up along the lines of Company has many sites, sites have many locations etc etc. Each file to be migrated will only contain one company. Using SSIS, given that the ID of the site needs to ripple down to location and location ID to the levels below what is the best approach. I was thinking nested foreach loops but being to SSIS I have no idea if this is the right way to go. Perhaps I am missing some clever feature of SSIS that can handle this?
In the end I've used a version of the technique detailed in this article
http://agilebi.com/jwelch/2010/10/29/insert_parent_child_pattern1/
I created views in the source database that return the data I need including a guid I can use to get hold of the inserted ID from the target database. I then use this in a lookup transform to get the ID I need and populate this into the target table. I am going to create a dataflow for each level following this pattern. It is working fine for the first couple of layers of the hierachy.
Related
I have source tables in Snowflake and Destination tables in Snowflake.
I need to load data from source to destination using ADF.
Requirement: I need to load data using single pipeline for all the tables.
Eg: For suppose i have 40 tables in source and load the total 40 tables data to destination tables. I need to create a single pipeline to load all tables at a time.
Can anyone help me in achieving this?
Thanks,
P.
This is a fairly broad question. So take this all as general thoughts, more than specific advice.
Feel free to ask more specific questions, and I'll try to update/expand on this.
ADF is useful as an orchestration/monitoring process, but can be tricky to manage the actual copying and maneuvering of data in Snowflake. My high level recommendation is to write your logic and loading code in snowflake stored procedures
then you can use ADF to orchestrate by simply calling those stored procedures. You get the benefits of using ADF for what it is good at, and allow Snowflake to do the heavy lifting, which is what it is good at.
hopefully you'd be able to parameterize procedures so that you can have one procedure (or a few) that takes a table name and dynamically figures out column names and the like to run your loading process.
Assorted Notes on implementation.
ADF does have a native Snowflake connector. It is fairly new, so a lot of online posts will tell you how to set up a custom ODBC connector. You don't need to do this. Use the native connector and auto resolve integration and it should work for you.
You can write a query in an ADF lookup activity to output your list of tables, along with any needed parameters (like primary key, order by column, procedure name to call, etc.), then feed that list into an ADF foreach loop.
foreach loops are a little limited in that there are some things that you can't nest inside of a loop (like conditionals). If you need extra functionality, you can have the foreach loop call a child ADF pipeline (passing in those parameters) and have the child pipline manage your table processing logic.
Snowflake has pretty good options for querying metadata based on a tablename. See INFORMATION_SCHEMA. Between that and just a tiny bit of javascript logic, it's not too bad to generate dynamic queries (e.g. with column names specific to a provided tablename).
If you do want to use ADF's Copy Activities, I think You'll need to set up an intermediary Azure Storage Account connection. I believe this is because it uses COPY INTO under the hood which requires using external storage.
ADF doesn't have many good options for avoiding running one pipeline multiple times at once. Either be careful about making sure that your code can handle edge cases like this, or that your scheduling/timeouts won't allow for that scenario with a pipeline running too long.
Extra note:
I don't know how tied you are to ADF, but without more context, I might suggest a quick look into DBT for this use case. It's a great tool for this specific scenario of Snowflake to Snowflake processing/transforming. My team's been much happier since moving some of our projects from ADF to DBT. (not sponsored :P )
This is linked to my other question when to move from a spreadsheet to RDBMS
Having decided to move to an RDBMS from an excel book, here is what I propose to do.
The existing data is loosely structured across two sheets in a work-book. The first sheet contains main record. The second sheet allows additional data.
My target DBMS is mysql, but I'm open to suggestions.
Define RDBMS schema
Define, say, web-services to interface with the database so the same can be used for both, UI and migration.
Define a migration script to
Read each group of affiliated rows from the spreadsheet
Apply validation/constraints
Write to RDBMS using the web-service
Define macros/functions/modules in spreadsheet to enforce validation where possible. This will allow use of the existing system while the new comes up. At the same time, ( i hope ) it will reduce migration failures when the move is eventually made.
What strategy would you follow?
There are two aspects to this question.
Data migration
Your first step will be to "Define RDBMS schema" but how far are you going to go with it? Spreadsheets are notoriously un-normalized and so have lots of duplication. You say in your other question that "Data is loosely structured, and there are no explicit constraints." If you want to transform that into a rigourously-defined schema (at least 3NF) then you are going to have to do some cleansing. SQL is the best tool for data manipulation.
I suggest you build two staging tables, one for each worksheet. Define the columns as loosely as possible (big strings basically) so that it is easy to load the spreadsheets' data. Once you have the data loaded into the staging tables you can run queries to assess the data quality:
how many duplicate primary keys?
how many different data formats?
what are the look-up codes?
do all the rows in the second worksheet have parent records in the first?
how consistent are code formats, data types, etc?
and so on.
These investigations will give you a good basis for writing the SQL with which you can populate your actual schema.
Or it might be that the data is so hopeless that you decide to stick with just the two tables. I think that is an unlikely outcome (most applications have some underlying structure, we just have to dig deep enough).
Data Loading
Your best bet is to export the spreadsheets to CSV format. Excel has a wizard to do this. Use it (rather than doing Save As...). If the spreadsheets contain any free text at all the chances are you will have sentences which contain commas, so make sure you choose a really safe separator, such as ^^~
Most RDBMS tools have a facility to import data from CSV files. Postgresql and Mysql are the obvious options for an NGO (I presume cost is a consideration) but both SQL Server and Oracle come in free (if restricted) Express editions. SQL Server obviously has the best integration with Excel. Oracle has a nifty feature called external tables which allow us to define a table where the data is held in a CSV file, removing the need for staging tables.
One other thing to consider is Google App Engine. This uses Big Table rather than an RDBMS but that might be more suited to your loosely-structured data. I suggest it because you mentioned Google Docs as an alternative solution. GAE is an attractive option because it is free (more or less, they start charging if usage exceeds some very generous thresholds) and it would solve the app sharing issue with those other NGOs. Obviously your organisation may have some qualms about Google hosting their data. It depends on what field they are operating in, and the sensitivity of the information.
Obviously, you need to create a target DB and the necessary table structure.
I would skip the web services and write a groovy script which reads the .xls (using the POI library), validates and saves the data in the database.
In my view, anything more involved (web services, GUI...) is not justified: these kinds of tasks are very well suited for scripts because they're concise and extremely flexible while things like performance, code base scalability and such are less of an issue here. Once you have something that works, you will be able to adapt the script to any future document with different data anomalies you run into in a matter of minutes or a few hours.
This is all assuming your data isn't in perfect order and needs to be filtered and/or cleaned.
Alternatively, if the data and validation rules aren't too complex, you can probably get good results with using a visual data transfer tool like Kettle: you just define the .xls as your source, the database table as the table, some validation/filter rules if needed and trigger the loading process. Quite painless.
If you'd rather use a tool that roll your own, check out SeekWell, which lets you write to your database from Google Sheets. Once you define your schema, Select the tables into a Sheet, then edit or insert the records and mark them for the appropriate action (e.g., update, insert, etc.). Set the schedule for the update and you're done. Read more about it here. Disclaimer--I'm a co-founder.
Hope that helps!
You might be doing more work than you need to. Excel spreadsheets can be saved as CVS or XML files and many RDBMS clients support importing these files directly into tables.
This could allow you skip writing web service wrappers and migration scripts. Your database constraints would still be properly enforced during any import. If your RDBMS data model or schema is very different from your Excel spreadsheets, however, then some translation would of course have to take place via scripts or XSLT.
I have an issue where I'm creating a greenfield web application using ASP.NET MVC to replace a lengthy paper form that manually gets (mostly) entered into an existing SQL Server 2005 database. So the front end is the new part, but I'm working against an existing moderately normalized schema. I can easily add new tables, views, etc. to the schema, but modifying tables is going to be near impossible. There's currently at least 2 existing applications (that I'm aware of) that reference this schema and I've stumbled upon at least a dozen "SELECT * FROM..." statements in each. They exist both in code and in views/triggers/stored procs/etc. That's why modifying existing table schemas is a no-go.
All that being said, the form targets different fields in multiple tables in database. It also has to be dynamic enough to allow the end users to add new questions targeting fields. The end users have a rough idea of the existing database schema so they're savvy enough to know how to pick out tables/fields to be targeted.
I'm have a really rough idea of how I could tackle this, but it seems like complete overkill and will be difficult to write up. I'm hoping somebody might have a simple(r) way of handling this sort of project that I haven't thought of.
If users know DB schema maybe you should go with Dynamic Data project and just create a web app front end of that DB to them. So you would only make the model they need and do the application that will display data from those tables with insert/edit capabilities.
But it's completely different story if they have some additional functionality to it.
How to you handle the contents of lookup tables that should be treated as "code" rather than data?
There is currently no good support in VS for Database Developers for managing reference data.
There is a good post at MSDN that proposes a work around for this missing feature that uses a temp table and a post deployment script to merge changes to the reference data at deployment time.
Essentially your reference data exists as a static set of insert statements into your temp table and then the merge keeps the live tables up to date.
I've been using the approach on a large fully CI project for the last five months or so and found it works well.
I've also got my fingers crossed that they will add better support for this in VSDB vNext.
I have a desktop (winforms) application that uses a Firebird database as a data store (in embedded mode) and I use NHibernate for ORM. One of the functions we need to support is to be able to import / export groups of data to/from an external file. Currently, this external file is also a database with the same schema as the main database.
I've already got NHibernate set up to look at multiple databases and I can work with two databases at the same time. The problem, however, is copying data between the two databases. I have two copy strategies: (1) copy with all the same IDs for objects [aka import/export] and (2) copy with mostly new IDs [aka duplicate / copy]. I say "mostly new" because there are some lookup items that will always be copied with the same ID.
Copying everything with new IDs is fine, because I'll just have a "CopyForExport" method that can create copies of everything and not assign new IDs (or wipe out all the IDs in the object tree).
What is the "best practices" way to handle this situation and to copy data between databases while keeping the same IDs?
Clarification: I'm not trying to synchronize two databases, just exporting a subset (user-selectable) or data for transfer to someone else (who will then import the subset of data into their own database).
Further Clarification: I think I've isolated the problem down to this:
I want to use the ISession.SaveOrUpdate feature of NHibernate, so I set up my entities with an identity generator that isn't "assigned". However, I have a problem when I want to override the generated identity (for copying data between multiple databases in the same process).
Is there a way to use a Guid.Comb or UUID generator, but be able to sometimes specify my own identifier (for transferring to a different database connection with the same schema).
I found the answer to my own question:
The key is the ISession.Replicate method. This allows you to copy object graphs between data stores and keep the same identifier. To create new identifiers, I think I can use ISession.Merge, but I still have to verify this.
There are a few caveats though: my test class has a reference to the parent object (many-to-one relationship) and I had to make the class non-lazy-loading to get Replicate to work properly. If I didn't have it set to eager load (non lazy load I guess), it would only replicate the object and not the parent object (cascade="all" in my hbm.xml file).
The java Hibernate docs have a reference to Replicate(), but the NHibernate documentation doesn't (section 10.9 in the java docs).
This makes sense for the Replicate behavior because we want to have fully hydrated entities before transferring them to another data store. What's weird though is that even with both sessions open (one to each data store), it didn't think to hydrate the object when I wanted to replicate it.
You can use FBCopy for this. Just define which tables and columns you want copied and I'll do the job. You can also add optional WHERE clause for each table, so it only copies the rows you want.
While copying it makes sure the order of which data is exported is maintained, so that foreign keys do not break. It also supports generators.