SQL2008 Integration Services - Loading CSV files with varying file schema

SQL2008 Integration Services - Loading CSV files with varying file schema - sql-server

I'm using SQL2008 to load sensor data in a table with Integration Services. I have to deal with hundreds of files. The problem is that the CSV files all have slightly different schemas. Each file can have a maximum of 20 data fields. All data files have these fields in common. Some files have all the fields others have some of the fields. In addition, the order of the fields can vary.
Here’s and example of what the file schemas look like.
Station Name,Station ID,LOCAL_DATE,T_1,TD_1,RH_1,CL_1,RS_1,RI_1,PR_1,RD_1,SH_1,CL_2
Station Name,Station ID,LOCAL_DATE,T_1,TD_1,RH_1,CL_1,RS_1,RI_1,PR_1,WS_1,WD_1,WSM_1,WDM_1,SH_1
Station Name,Station ID,LOCAL_DATE,T_1,TD_1,RH_1,RS_1,RI_1,PR_1,RD_1,WS_1,WD_1,WSM_1,WDM_1
Station Name,Station ID,LOCAL_DATE,T_1,RH_1,RS_1,PR_1,VI_1,PW_1,WS_1,WD_1,WSM_1
Station Name,Station ID,LOCAL_DATE,T_1,RH_1,RS_1,WS_1,WD_1,WSM_1
Station Name,Station ID,LOCAL_DATE,T_1,RH_1,RS_1,PR_1,VI_1,WS_1,WD_1,WSM_1
I’m using a Data Flow Script Task to process the data via CreateNewOutputRows() and MyOutputBuffer.AddRow(). I have a working package to load the data however it’s not reliable and robust because as I had more files the package fails because the file schema has not been defined in CreateNewOutputRows().
I'm looking for a dynamic solution that can cope with the variation in the file schema. Doeas anyone have any ideas?

Who controls the data model for the output of the sensors? If it's not you, do they know what they are doing? If they create new and inconsistent models every time they invent a new sensor, you are pretty much up the creek.
If you can influence or control the evolution of the schemas for CSV files, try to come up with a top level data architecture. In the bad old days before there were databases, files made up of records often had, as the first field of each record, a "record type". CSV files could be organized the same way. The first field of every record could indicate what type of record you are dealing with. When you get an unknown type, put it in the "bad input file" until you can maintain your software.
If that isn't dynamic enough for you, you may have to consider artificial intelligence, or looking for a different job.

Maybe the cmd command is good. in the cmd, you can use sqlserver import csv.

If the CSV files that all have identical formats use the same file name convention or if they can be separated out in some fashion you can use the ForEach Loop Container for each file schema type.
Possible way to separate out the CSV files is run a Script (in VB) in SSIS that reads the first row of the CSV file and checks for the differing types (if the column names are in the first row) and then moves the files to the appropriate folder for use in the ForEach Loop Container.

Related

SSIS: add non-existent column to a CSV source

I am loading a large set (10s of thousands) of CSV files into a single staging sql server table, using standard SSIS approach.
Vast majority of source CSV files have identical column structure (order, set of columns, data types). There's around 140 columns all together.
However, in certain (<1%) cases a source file will be lacking some columns (I know exactly which columns they are, and there are three possible combinations of missing columns). This is by design i.e. this is a valid business scenario (meh).
Can I somehow create a "virtual" column (filled with NULL/empty/blank values) for a source CSV connection if (and only if) that column does not exist in the physical source CSV file?
I know I can read CSV header with a C# scripting component and create multiple source connections, and re-direct to the right data flow based on existence (or lack) of certain columns but I am hoping for a more "elegant" solution, with just single CSV data source "smart" enough to "artificially" add blank columns that are missing in the source file.
For simplicity let's assume that the full column set is:
ID;C1;C2;C3
And that C3 is missing occasionally i.e. some CSV files are:
ID;C1;C2
Any hints welcome.

No, there is no "smart" CSV data source built in to SSIS.
You are certainly going to need to use a script component, but instead of using a Script Task outside the dataflow that directs the control flow to the correct dataflow, you can simply create one dataflow that has a script component as the data source. The script component reads the CSV that is currently being imported, and if the column in question is missing, it supplies it with NULL or default values.

Need to map csv file to target table dynamically

I have several CSV files and have their corresponding tables (which will have same columns as that of CSVs with appropriate datatype) in the database with the same name as the CSV. So, every CSV will have a table in the database.
I somehow need to map those all dynamically. Once I run the mapping, the data from all the csv files should be transferred to the corresponding tables.I don't want to have different mappings for every CSV.
Is this possible through informatica?
Appreciate your help.

PowerCenter does not provide such feature out-of-the-box. Unless the structures of the source files and target tables are the same, you need to define separate source/target definitions and create mappings that use them.
However, you can use Stage Mapping Generator to generate a mapping for each file automatically.

PMy understanding is you have mant CSV files with different column layouts and you need to load them into appropriate tables in the Database.
Approach 1 : If you use any RDBMS you should have have some kind of import option. Explore that route to create tables based on csv files. This is a manual task.
Approach 2: Open the csv file and write formuale using the header to generate a create tbale statement. Execute the formula result in your DB. So, you will have many tables created. Now, use informatica to read the CSV and import all the tables and load into tables.
Approach 3 : using Informatica. You need to do lot of coding to create a dynamic mapping on the fly.
Proposed Solution :
mapping 1 :
1. Read the CSV file pass the header information to a java transformation
2. The java transformation should normalize and split the header column into rows. you can write them to a text file
3. Now you have all the columns in a text file. Read this text file and use SQL transformation to create the tables on the database
Mapping 2
Now, the table is available you need to read the CSV file excluding the header and load the data into the above table via SQL transformation ( insert statement) created by mapping 1
you can follow this approach for all the CSV files. I haven't tried this solution at my end but, i am sure that the above approach would work.

If you're not using any transformations, its wise to use Import option of the database. (e.g bteq script in Teradata). But if you are doing transformations, then you have to create as many Sources and targets as the number of files you have.
On the other hand you can achieve this in one mapping.
1. Create a separate flow for every file(i.e. Source-Transformation-Target) in the single mapping.
2. Use target load plan for choosing which file gets loaded first.
3. Configure the file names and corresponding database table names in the session for that mapping.

If all the mappings (if you have to create them separately) are same, use Indirect file Method. In the session properties under mappings tab, source option.., you will get this option. Default option will be Direct change it to Indirect.
I dont hav the tool now to explore more and clearly guide you. But explore this Indirect File Load type in Informatica. I am sure that this will solve the requirement.

I have written a workflow in Informatica that does it, but some of the complex steps are handled inside the database. The workflow watches a folder for new files. Once it sees all the files that constitute a feed, it starts to process the feed. It takes a backup in a time stamped folder and then copies all the data from the files in the feed into an Oracle table. An Oracle procedure gets to work and then transfers the data from the Oracle table into their corresponding destination staging tables and finally the Data Warehouse. So if I have to add a new file or a feed, I have to make changes in configuration tables only. No changes are required either to the Informatica Objects or the db objects. So the short answer is yes this is possible but it is not an out of the box feature.

How to attach and view pdf documents to access database

I have a very simple database in access, but for each record i need to attach a scanned in document (probably pdf). What is the best way to do this, the database should not just link to a file on the pc, but should copy and keep the file with it, meaning if the original file goes missing the database is moved or copied, the file should still be accessable from within the Database. Is This possible? and what is the easiest way of doing it? If is should i can write a macro, i just dont know where to start. and also when i display a report of the table, i would like to just see thumbnails of the documents.
Thank you.

As the other answerers have noted, storing file data inside a database table can be a questionable practice. That said, I wouldn't personally rule it out, though if you are going to take that option, I'd strongly suggest splitting out the file data into its own table in its own backend file. For example:
Create a new database file called Scanned files.mdb (or Scanned files.accdb).
Add a single table called Scans with fields such as FileID (AutoNumber, primary key), MainTableID (matches whatever is the primary key of the main table in the main database file), FileName (Text), FileExt (Text) and FileData ('OLE object', really just a BLOB - don't actually use OLE Objects because they will bloat the database horribly).
Back in the frontend, add a reference to Scans as a linked table.
Use a bit of VBA to upload and extract files from the Scans table (if you're interested in the mechanics of this, post a separate question).
Use the VBA Shell routine (if you must) or ShellExecute from the Windows API (= the better option IMO) to open extracted data.
If you are using the newer ACCDB format, then you have the 'attachment' field type available as smk081 suggests. This basically does most of the above steps for you, however doing things 'by hand' gives you greater flexibilty - for example, it allows giving each file a 'DateScanned' or 'DateEffective' field.
That said, your requirement for thumbnails will require explicit coding whatever option you take. It might be possible to leverage the Windows file previewing API, though I'd be certain thumbnails are a definite requirement before investigating this - Access VBA is powerful enough to encourage attempts at complex solutions, but frequently not clean and modern enough to allow fulfilling them in a particularly maintainable fashion.

There is an Attachment type under Data Type when you go into Design View of your table. You can add an attachment field here. When you go into the Datasheet view of the table you can select this field for a particular row and a window will open for you to specify the attachment. This will cause your database to quickly grow in size if you add a lot of large attachments.

You can use an OLE field in a table, but I would really suggest you not use this approach. The database is going to be HUGE in no time, and you're going to regret it.
Instead, you should consider adding a field that stores the path to the file, and keep the files in one folder on your network. Then you can use a SHELL() command to open the file. What's the difference between restoring an Access database and restoring PDF files if something goes wrong? This will keep your database at a manageable size and reduce the possibility of corruption.

Better way to store updatable scientific data?

I am using a file consisting of published scientific data. I'm using this file with a program that reads in the first 5 space delimited data fields, and everything after that is considered a comment by the program.
2 example lines (of thousands):
FeII 1608.4511 0.521 55.36 -1300 M03 Journal of Physics
FeII 1611.23045 0.0321 55.36 1100 01J AJ
The program reads it as:
FeII 1608.4511 0.521 55.36 -1300
FeII 1611.23045 0.0321 55.36 1100
These numbers are each measurements and most (don't get me started) have associated errors that are not listed in this file. I would like to store this information in a useful and updatable way. That is, say the first entry FeII 1608.4511 has an error of plus/minus 0.002. Consider when a new measurement is made and changes it to: FeII 1608.45034 plus/minus 0.0005. I would like to update the value, the error, and record some information about the publication that it came from.
The program that uses this file is legacy code and is both crucial and inflexible: and it needs the file to look like the above output when it's read in. I would really like for there to be a way to update the input file to include things like errors on the values and publication hyperlinks in comments. I would also like a kind of version control ability to return the state of this large file today; or in 5 months after 20 more lines are updated with new values.
Any suggestions on how best to accomplish this? Should I store everything in some kind of database?

Databases are deeply tied to identity. If a database can't identify a row by the data that's in it, a database isn't going to help you.
If I were you, I'd start by storing the base file in a version control system, not a database. At 20 changes per 5 months, I'd probably make those changes manually and commit each batch of changes. (I don't know what might constitute a batch for you. Could be a single change every time.)
Since the format of the existing file is both crucial and brittle, I'm not sure whether modifying it is a good idea. I think I'd feel better about storing error ranges and publication hyperlinks in a separate file, and using a script to put the pieces together for applications that can use error ranges and hyperlinks.

A database sounds sensible, SQL Server Express is free and widely used.
You can read in the text file including all comments and output the edited data in the same format. You can use a number of front ends including Access, for rapid development, or something you create yourself in VB.Net, or even Excel, at a pinch.
You will need to consider the structure of the table(s) but it should not be too difficult, and you can get help here.

For updating the information in the file introducing errors and links, you don't need any database; just open the file, iterate through the lines and update each one.
If you want to be able to restore a line state, you definetively need some kind of database. You can create a database in Sql Server or Firebird for example, and store in it a row for each line historical state (with date of creation off course); your file itself would be the repository for current values and you would be able to restore the file with a date and some simple fetcing of the database information.
If you can't use a database like Firebird or SQL Server, you can store the historical data in a simple text file, it's up to you. Just remember that you necesarely will need, like #CatCall commented, a way to identify each line in order to create a relation between the line in the file and the historical data stored in your repository.

create a star schema database from an xml file

Is it possible to create a Star Schema Database from data available in an XML file or a text file?
Your valuable inputs appreciated.
Thanks in advance.
Regards,
Sam.

It's a very vague question, but you can pretty much describe anything logical with an XML file. XML is "extensible", so you can make it describe nearly any set of data in any form as long as there is consistency.

I've done both. I wrote a some JavaScripts that parse files and creates records in an Oracle database. One script processed the Tyco2 data from a formated text file. I've only done two of the 20 text files (250,000 stars) because of the size and I'm doing this on my laptop. A similar file created a table for the Hipparcos data (118,000 stars). The other I just finished processed the data from a file called "Stars - 2300+6000-2.xml." That file contained some stars with names and I created a table with the named stars that matched the Hipparcos stars. Alas only 92 of them showed up.
For xml, I used the Microsoft.XMLDOM interface and to add the data to Oracle, I used ADODB.Connection. The xml also used ADODB.Recordset because I had to run queries against the Hipparcos table to find the Hipparcos number (where there was a match) matching a star in the xml file.
If that is what your looking for, let me know and I can send the JavaScripts.
Frank Perry, MSEE

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight