Identifying duplicate files based on data content in SSIS - sql-server

I get files to a shared location . Every file has different meta ie. file name, date created.
I have to extract the data using SSIS if and only if file content is different than previously processed files.

This should be fairly straight-forward -
Use a ForEach container configured to For Each File setting. Folder name would be the shared location. File Name should be a wildcard (example, *.csv)
Create a table in SQL called LoadedFiles which will hold the names of the files loaded. Note that when you create the ForEach container you would have also created a variable that would hold the file-name. Now in the ForEach container, check if the value in this variable exists already in the LoadedFiles table. If it doesn't, only then load.
I've assumed that all the files have the same metadata (column names and data types). Even if they do not, you can employ the same logic.
Also, if it isn't obvious, for this to work you need to insert a new row into the LoadedFiles table every time you do decide to load a file.
EDIT: It seems same file name does not equate to same content for the OP. In that case, he should just do a MERGE on the SQL table instead of a blind insert.
MERGE on the primary key and IF MATCHED do nothing else INSERT

I got work around
SSIS execute process task and i have called FC.exe
http://www.howtogeek.com/206123/how-to-use-fc-file-compare-from-the-windows-command-prompt/

Related

Add data from other object within SSIS package to populate a field for a table

There are many aspects of what I want to do but I think learning one piece will let me derive the rest.
I have an SSIS package that uses powershell to download a publicly available zip file, an execute script to unzip with 7zip and then data flows to load the unzipped files to corresponding tables.
What I want to do is add the file name (and eventually other aspects of the file like creation date, record counts and so on) from any one of the unzipped files to a log table that keeps track of the summary level details of the files.
How do I dynamically store this type of information as part of the package? Derived columns? But what's the input? Thanks!
There are many options for dynamically working with files through SSIS. Below is an overview of one method. Of course this can vary, depending on your specific needs and requirements.
Add a Foreach Loop Container. On the Collection pane, the Folder property can either be set using the
GUI as well as through a parameter or variable with the Directory
expression. Searching sub folders can also be set by checking the "Traverse subfolders" checkbox or using the Recurse expression like the Folder field.
The Files field will indicate the files to use and wildcards can be
used. * will match any number of characters. For
example, *.csv will get all csv files regardless of name and
Test*.txt will return all .txt files with names beginning Test,
regardless of how many or which characters follow. To limit this to
a single character, use ?. The FileSpec expression will allow
this to be set dynamically similar to the directory by variable or parameter.
The Variable Mappings pane will allow for setting a variable to hold a file name from the directory. Add a variable that will hold the file name to index 0 to map these.
You indicated that you wanted to store the file name. The detail of this can be controlled from the "Retrieve file name" field on the Collection window. As their names imply, Fully Qualified will hold the complete file path, Name and Extension will return the file name with extension, and Name Only is just the file name.
As for other aspects of the file, I'd recommend a using a Script Task for this for more functionality. The C# FileInfo class provides options for finding details about the file such as the creation date, last time the file was accessed, and when the file was most recently written to. Additonal information on this can be found here.
For the record counts from the file, you'll need to create a Connection Manager for this and work with the data within the package. I'm assuming these are flat files? If so, creating a Flat File Connection Manager, and setting the same variable from the Variable Mappings pane of the Foreach Loop to the ConnectionString expression of the Connection Manager will allow you to dynamically loop through each file. Make sure that the Fully Qualified option is used for the "Retrieve file name" field as earlier if you decide to do this. You will also want to configure the correct columns and data types for the Connection Manager ahead of time. This same process can be followed for Excel files, however the variable with the file name will be used on the ExcelFilePath expression instead.
As for storing information about a file in a log table, there are a multitude of options for these. A very basic example of an Insert statement within an Execute SQL Task that's placed within the Foreach Loop is below. The 3 part table name is only necessary if you're using a table that differs from the initial catalog of the Connection Manager. The ? is the parameter marker (assuming this is an OLE DB connection). After this, map the same variable/parameter that stores the file name using the Parameter Mapping pane. Set the direction to Input, appropriate data type (likely VARCHAR/NVARCHAR), 0 in the Parameter Name field to indicate this is the first parameter in the SQL statement (additional ? can be used for subsequent parameters in the SQL statement, then increment this field in accordance), and the default Parameter Size can be left at -1. Again, this is a simple example and you'll probably want store more about the files and their contents, but this can get you started.
Sample SQL Insert:
INSERT INTO YourDataBase.YourSchema.YourTable (ColumnToHoldFileName)
VALUES (?)
you can use Variable to store File name when your loop the files, and after file been loaded to table, then u can use current file name to insert/update log table.
figured it out from looking at other posts. I had to expand the parameter size...easy fix!

Enumerating sub-folders and saving the records from flat file to SQL

I've a root folder which contains few CoNum folder and they contain CycleDate folder, every CycleDate folder contain a file named N718010.txt which contains comma separated records whom I want to insert into SQL database table. How can I achieve the same? I'm a beginner in the SSIS world.
I followed this url- http://microsoft-ssis.blogspot.in/2011/01/foreach-folder-enumerator.html
but it is incomplete and this ended up getting the path in a variable (xmldoc) like:
How can I get these records saved to the SQL database table? Note: I also have to save CoNum and CycleDate to the table with each record.
What you need to do, is just to use a simple Foreach Loop Container
You can read about a very basic method here
To get CoNum and CycleDate, you can just substring your FullPath variable in a derived column.

load matching files in table

I have a folder that contains multiple .csv files for each employee like empname_date.csv and I want to load files in one table.
Not all files but only files where file name matches the data with tbl_empmaster table that contains the master list of employees.
I do not want to check each file because it will take too much of time. I need to filter files as per master list and then load the matching employee files.
Please help what I can do in this case.
I am using SSIS to do the same.
Create an SSIS Package to with a For Each Loop Container to Read all the CSV files of the given Folder.
Read the File Name without extension to a variable and Before inserting perform a table lookup to see whether the given File Name exists in your table and insert only if the match is found

Moving files based on a source path found in a table using SSIS

I've chased my tail for a full 12 hours. Haven't found the right solution.
I'm locked into using SSIS. I have a SQL Server table with full paths and filenames already concatenated. Examples:
\\MydevServer1\C$\ABC\App_Data\Sample.pdf
\\MydevServer2\E$\Garth\App_Data\Morefiles.txt
\\MydevServer3\D$\Paths\App_Data\MySS.xlsx
etc.
I need to read each row of the table, get the path and filename and move that file to a new static destination directory.
The rows in the table will remain unchanged. I only use it as a source to locate the file to be moved.
I've tried:
1) Feeding a resultset from an ole db source to a recordset destination then to an Object variable that connects via variable to a foreach loop container holding a files system task. (Very problematic.)
2) Sending the table rows to a .csv file and reading each line of the csv file using a foreach loop container holding a file system task.
3) Reading directly from the table rows using a foreach loop container holding a file system task. (preferred).
and many other scenarios.
I have viewed a hundred examples online, but most of them involve loading a table, or sending results to flat files, or moving files from one folder to another based on extension type, etc. I haven't found anything on configuring a file system task to read a table supplied path and move the file based on the table value as the source.
I'm rambling. :-)
Any insight or help will be appreciated. I'm not new to SSIS, but I sure feel like it right now.
Create two string variables to store source and destination paths
Use an Execute SQL Task to populate a Full Recordset (Variable with Object data type)
Use For Loop container to go through each row of recordset and set those two variables.
Inside For Loop container, use File System Task. You need to specify IsSourcePathVariable = True, IsDestinationPathVariable = True, path variables - DestinationVariable / SourceVariable, and set operation (copy, move, etc.)
It appears I've been tail chasing due to the error, "Source is empty error".
This was caused by a blank first row in my recordset. I was searching for a fix to the Object variable is empty issue, when in reality the issue was that the Object variable couldn't find data right off the bat.
Insert shameful smug here.
Thanks to Anton for the help.

SSIS Dynamic Mapping column

I'm little new to SSIS and I have a need to import some flat files into SQL tables in the same structure.
(Assume the table is already exist in the same structure and table name and flat file name is same)
I thought to create a generic package (sql 2014) to import all those file by looping through a folder.
I try to create a data flow task in a foreach loop container in the data flow task I dropped a flat file source and ADO.Net destination .
I have set the file source to a variable so that every time it loops through it get the new file. similarly for the ADO.net table name I set it to a the variable so that each time it select a different table according to the file name.
since both source column names and destination column names are same I assume it will map the columns automatically.
but with a simple map it didn't let me to run the package so added a column on the source and selected a table and mapped it.
when I run the package I assumed it will automatically re map everything.
but for the first file it ran but second file it failed complaining with map issues.
can some one let me know whether this is achievable by doing some dynamic mapping?? or using any other way.
any help would be much appreciated.
thanks
Ned

Resources