SSIS common destination, multiple file inputs, different structures - file

I'm sure this question is a common one, but I'm having a real challenging time coming up with a modular design for what should not be an impossible task. I have a situation where I have common destination tables, about five or six of them, but multiple input files which need to be massaged into a certain format for insertion. I've been tasked with making the design modular so as we work with new data providers with different formats, the pieces of the package that handle the insertion don't change, nor the error reporting, etc., just the input side. I've suggested using a common file format which would mean taking the source files and then transforming them and running the rest of the common import process on them. It was suggested that I consider using tables for this process on the input side.
I guess what strikes me about this process is the fact that the package can be saved as a template and I can use the common pieces over and over and set up new connections as we work with other data providers. Outside of that, I could see resorting to custom code in a script task to ensure a common format to be inserted into common input tables, but that's as far as I've gotten.
If anyone has ever dealt with a situation as such, I would appreciate design suggestions to accommodate functionality for now and in the future.
Update: I think the layered architectural design that is being emphasized in this particular instance would be as such (which is why I find it confusing):
There would be six layers. They are as follows:
A. File acquisition
B. File Preparation
C. Data Translation to common file format (in XML)
D. Transformation of data to destination format (XML - preparation for insertion into database)
E. Insert into database
F. Post processing (reporting and output of erred out
Since we will be dealing with several different data providers, the steps would be the same for processing the data, but the individual steps themselves may differ between providers, if that makes sense. Example: We may get data from provider A, but we would receive files from them and they are zipped CSV files. Provider B's would be in XML, uncompressed. One provider might send files to us and we may have to go pick files up (This would take place in the file acquisition step above).
So my question are:
A. Is it more important to follow the architectural pattern here or to combine things were possible? It was a possible suggestion to combine all the connection items in a single package as the top layer, so therefore a single package would handle things like making a service call, SFTP, FTP, and anything else that was needed. I'm not sure quite how one would do multiple connections for different providers when a schedule is needed. It just seems to complicate things...I'm thinking connection layer, but have it be specific to the provider, not a be all end all.
B.Same thing with the file preparation layer.
I think a modular design is key, but stacking things into control tasks seems to make things more complicated in design than they should be. Any feedback or suggestions would be helpful in this area.

I would do what was suggested in another comment and import each file into the appropriate temp table first and then union them all later. This will give you more modularity and make adding or removing input files easier and make debugging easier because you can easily see which section failed. Here is an outline of what I would do:
Step 1 SQL Task:
Create the temp table (repeat as needed for each task with a unique table):
IF EXISTS (SELECT * FROM sys.objects
WHERE object_id = OBJECT_ID(N'[dbo].[tbl_xxxxxx_temp]')
AND type in (N'U'))
DROP TABLE [dbo].[tbl_xxxxxx_temp]
GO
CREATE TABLE [dbo].[tbl_xxxxxx_temp](
(columns go here)
) ON [PRIMARY]
GO
Step 2: Data Flow Task
Create a Data Flow Task and import each file into their unique temp table which you created above.
Step 3: Data Flow Task
Create a second DFT and connect each temp table to a union all Data Flow Transformation (convert, or derive columns as needed) and then connect the output to your static data base table.
Step 4: SQL Task: Drop temp tables
DROP TABLE tbl_xxxxxx_temp
Please note it is necessary to set "DelayValidation" to True in each Data Flow Task in order for this to work.

Related

Count how many fields where cleansed and which fields on SSIS

I'm doing an exercise in which I have to clean data from a Flat File Source and write it on my Database. I have already managed to clean all of the fields by using some data quality rules for each field and also generate error codes which I write to a different table when a rule is broken.
My problem is that for the final step of the exercise I have to generate some Power BI graphics in which it shows how many fields were fixed from the source and which fields where cleansed. The only thing that I have thought compares the DB table to the flat file source or maybe do something with script components but I don't really think that those are really good solutions.
Has anybody encountered this problem? if somebody could point me out for info for something like this, it would be great. Thanks!
If I am facing a similar issue, I will do this in three steps:
Importing data without any transformation to a staging table
Cleaning data and loading it into the destination table
Comparing staging and destination table to get how many values were fixed.
From design standpoint - establishing a key is central before starting to clean.
Use could use SSIS derived column transformation to create a business key that is a concatenation of available fields to create a unique key, using FindString function and string functions.
Similar to the above step add a column in your staging table or use a derived column (depending on if you are using sql cleanup or ssis tasks to cleanup) to indicate if it was cleaned or not.

Reduce column size and also trim data, in production database, handle constraints/dependencies on same column

I have a scenario where Java developer has made the change to the variable which used to transfer the data from column - col of table - tbl.
Now, I have to change the column varchar(15) to varchar(10). But, before making this change - have to handle the existing data and the constraints/dependencies on same column.
What should be the best sequence of doing so?
I am thinking to check the constraints first, then trim the existing data and then alter the table.
Please suggest how to handle constrains/dependencies and before handling it, how to check such dependencies.
Schema-evolution (the DDL changes that happen over time to tables and columns in a database, while preserving existing data and functionality) is a well understood topic with several solutions, some of which are RDBMS independent, others are built-in to the RDBMS solution.
A key requirement for production environments is to need both a forward-change and a backout, which can be run unattended.
Many open source advocates use Liquibase which also has a commercial variant.
Db2 for Linux/Unix/Windows also offers a built-in stored-procedure SYSPROC.ALTOBJ which helps to automate various schema-evolution alterations, including decreasing the size of a column. You would need to study its documentation carefully and test it fully on non-production environments until you are satisfied. Read about it here
https://www.ibm.com/support/knowledgecenter/en/SSEPGG_11.1.0/com.ibm.db2.luw.sql.rtn.doc/doc/r0011934.html
You can grow-your-own script of course, in whatever language you prefer, including SQL, but remember you should also build and test a back-out script.

How to attach and view pdf documents to access database

I have a very simple database in access, but for each record i need to attach a scanned in document (probably pdf). What is the best way to do this, the database should not just link to a file on the pc, but should copy and keep the file with it, meaning if the original file goes missing the database is moved or copied, the file should still be accessable from within the Database. Is This possible? and what is the easiest way of doing it? If is should i can write a macro, i just dont know where to start. and also when i display a report of the table, i would like to just see thumbnails of the documents.
Thank you.
As the other answerers have noted, storing file data inside a database table can be a questionable practice. That said, I wouldn't personally rule it out, though if you are going to take that option, I'd strongly suggest splitting out the file data into its own table in its own backend file. For example:
Create a new database file called Scanned files.mdb (or Scanned files.accdb).
Add a single table called Scans with fields such as FileID (AutoNumber, primary key), MainTableID (matches whatever is the primary key of the main table in the main database file), FileName (Text), FileExt (Text) and FileData ('OLE object', really just a BLOB - don't actually use OLE Objects because they will bloat the database horribly).
Back in the frontend, add a reference to Scans as a linked table.
Use a bit of VBA to upload and extract files from the Scans table (if you're interested in the mechanics of this, post a separate question).
Use the VBA Shell routine (if you must) or ShellExecute from the Windows API (= the better option IMO) to open extracted data.
If you are using the newer ACCDB format, then you have the 'attachment' field type available as smk081 suggests. This basically does most of the above steps for you, however doing things 'by hand' gives you greater flexibilty - for example, it allows giving each file a 'DateScanned' or 'DateEffective' field.
That said, your requirement for thumbnails will require explicit coding whatever option you take. It might be possible to leverage the Windows file previewing API, though I'd be certain thumbnails are a definite requirement before investigating this - Access VBA is powerful enough to encourage attempts at complex solutions, but frequently not clean and modern enough to allow fulfilling them in a particularly maintainable fashion.
There is an Attachment type under Data Type when you go into Design View of your table. You can add an attachment field here. When you go into the Datasheet view of the table you can select this field for a particular row and a window will open for you to specify the attachment. This will cause your database to quickly grow in size if you add a lot of large attachments.
You can use an OLE field in a table, but I would really suggest you not use this approach. The database is going to be HUGE in no time, and you're going to regret it.
Instead, you should consider adding a field that stores the path to the file, and keep the files in one folder on your network. Then you can use a SHELL() command to open the file. What's the difference between restoring an Access database and restoring PDF files if something goes wrong? This will keep your database at a manageable size and reduce the possibility of corruption.

ADO - Can I edit results of a complex query with multiple join statements?

I'm working on a data conversion utility which can push data from one master database out to a number of different databases. The utility its self will have no knowledge of how data is kept in the destination (table structure), but I would like to provide writing a SQL statement to return data from the destination using a complex SQL query with multiple join statements. As long as the data is in a standardized format that the utility can recognize (field names) in an ADO query.
What I would like to do is then modify the live data in this ADO Query. However, since there are multiple join statements, I'm not sure if it's possible to do this. I know at least with BDE (I've never used BDE), it was very strict and you had to return all fields (*) and such. ADO I know is more flexible, but I don't know quite how flexible in this case.
Is it supposed to be possible to modify data in a TADOQuery in this manner, when the results include fields from different tables? And even if so, suppose I want to append a new record to the end (TADOQuery.Append). Would it append to two different tables?
The actual primary table I'm selecting from has a complimentary table which is joined by the same primary key field, one is a "Small" table (brief info) and the other is a "Detail" table (more info for each record in Small table). So, a typical statement would include something like this:
select ts.record_uid, ts.SomeField, td.SomeOtherField from table_small ts
join table_detail td on td.record_uid = ts.record_uid
There are also a number of other joins to records in other tables, but I'm not worried about appending to those ones. I'm only worried about appending to the "Small" and "Detail" tables - at the same time.
Is such a thing possible in an ADO Query? I'm willing to tweak and modify the SQL statement in any way necessary to make this possible. I have a bad feeling though that it's not possible.
Compatibility:
SQL Server 2000 through 2008 R2
Delphi XE2
Editing these Fields which have no influence on the joins is usually no problem.
Appending is ... you can limit the Append to one of the Tables by
procedure TForm.ADSBeforePost(DataSet: TDataSet);
begin
inherited;
TCustomADODataSet(DataSet).Properties['Unique Table'].Value := 'table_small';
end;
but without an Requery you won't get much further.
The better way will be setting Values by Procedure e.g. in BeforePost, Requery and Abort.
If your View would be persistent you would be able to use INSTEAD OF Triggers
Jerry,
I encountered the same problem on FireBird, and from experience I can tell you that it can be made(up to a small complexity) by using CachedUpdates . A very good resource is this one - http://podgoretsky.com/ftp/Docs/Delphi/D5/dg/11_cache.html. This article has the answers to all your questions.
I have abandoned the original idea of live ADO query updates, as it has become more complex than I can wrap my head around. The scope of the data push project has changed, and therefore this is no longer an issue for me, however still an interesting subject to know.
The new structure of the application consists of attaching multiple "Field Links" on various fields from the original set of data. Each of these links references the original field name and a SQL Statement which is to be executed when that field is being imported. Multiple field links can be on one single field, therefore can execute multiple statements, placing the value in various tables, etc. The end goal was an app which I can easily and repeatedly export a common dataset from an original source to any outside source with different data structures, without having to recompile the app.
However the concept of cached updates was not appealing to me, simply for the fact pointed out in the link in RBA's answer that data can be changed in the database in the mean-time. So I will instead integrate my own method of customizable data pushes.

Temporary storage for cleaned data in Integration Services

I have an Excel file that I need to process three times in integration services, once for projects, once for persons and once for time tracking data.
At each step I have the excel source and I do need to do some data clean up and type conversions (same in all three steps).
Is there an easy way of creating a step that does all this and that allows me to use the output as input to the other "real" steps?
I am starting to think about importing it into SQL server in a temp table, which is by all means ok, but it would be nice if I could skip that step.
This can actually be achieved using a single data flow.
You can read the Excel data source once and then use Multicast Transformation to create copies of the data set in memory. You can then process each of your three data flow branches accordingly and can also make use of parallel processing!
See the following reference for details:
http://msdn.microsoft.com/en-us/library/ms137701(SQL.90).aspx
I hope what I have detailed is clear and understandable but please feel free to contact me directly if you require further guidance.
Cheers, John
[Added in response to comments]
With regard to your further question, you can specify the precedence/flow control of your package using more than one flow. So for example, you could use the multicast task to create three data flows however and then subsequently define your precedence flow control so that all transformation tasks in flow 1 must be completed before the transformations in flow two can begin.
You could use three separate data flow tasks with a file operation task first. The File Operation would be to copy the original Excel file to a temporary area. Each of the three Data Flow tasks would start with the temp file and write to the temp file (I think they may need to write to a copy).
An issue with this is that this makes the data flows operate sequentially. This might not be an issue for your Excel file, but would be an issue for processing larger numbers of rows. In such a case, it would be better to process the three "steps" in parallel, and join the results at the final stage.

Resources