SSIS Extract links from Excel cells to load into SQL - sql-server

The problem:
I have an SSIS package that loops through 100+ Excel files and reads the data, then copies the contents over to a SQL Server Table. In these Excel files, this one column has hyperlinks. The column text itself says something like DSH-LN-4, but clicking on it in Excel opens up a folder that contains some images. How do I copy the underlying link in this column rather than the actual text in the cells?
What have I tried so far:
I haven't really tried anything because I found absolutely no resources on how to do this in SSIS. Manually adding a column to the Excel files is NOT possible, since there are 100's of files. The only resource I found was in this SO Question, but this does not indicate the process of doing this without manually manipulating the Excel files.
What I would like:
In my ForEach loop container, I have a data flow task that gets the Excel contents and shoves it into the SQL Table. The column that contains hyperlinks is called PhotoReference (since these hyperlinks open the folder that has the photos). I would like this PhotoReference column to copy over the underlying hyperlink of the cell and add that to the SQL column.
For instance, I want the PhotoReference column to contain this:
www.companyname.box.com/asjdfbgkjb134kjbsdafo2bm21n4bk
If I can manage to do this, my Power BI report running off of this underlying data could contain a clickable text that would open the image directly.
Any help would be appreciated.
UPDATE:
I was able to try two different methods to extract the hyperlinks from my column, but each of these have their own issues:
Method 1: I added a Script Task component to my ForEach container and as I loop through each Excel file, used Microsoft.Office.Interop.Excel.Hyperlinks assembly to get the hyperlink from my Excel column. BUT, I don't know what to do with it after. I figured the only thing to do is to overwrite the Excel columns' content with my extracted hyperlink, but I really rather not change my Excel files in any manner.
Method 2: I added a Script Component object inside my data flow task in between my Excel source and SQL Destination. In this method, I could not get nearly as far because the Input0_ProcessInputRow method that is auto-generated has the argument Row of type Input0Buffer. I am not able to apply any Microsoft.Office.Interop.Excel properties to my Input0Buffer object. So I am stuck.

If you have to right to alter the excel files, you can simply add a Script Task before the data flow task to replace the URL column value with the hyperlink.
In this answer, I will provide a step-by-step solution to solve this problem:
Creating Excel samples
First of all, I created some Excel files with the following columns:
First name (text)
Last name (text)
Age (number)
Photo (hyperlink)
The file content looks like the following:
Creating the SSIS package
First of all, You must add an Excel connection manager that link to one of the Excel files you need to import. And an OLE DB connection manager to connect to the SQL Server instance.
You must add a SSIS variable of type string, to store the Excel file path when using the foreach enumerator
Add a Foreach loop container and configure it to loop over the Excel files as mentioned in the images below:
Within the Foreach Loop container add a Script Task and a Data flow task as mentioned in the image below:
Now, Open the data flow task and add an Excel source and an OLE DB destination and configure the columns mapping between them.
Open the Script Task configuration, and select the ExcelFilePath variable (created in step 2) as a readonly variable as mentioned in the image below:
Now, open the Script editor and in the solution explorer window, right-click on the references icon and click on "Add Reference..."
When the Add reference catalog appears, click on the COM tab, and search for Excel, then you should select the Excel Object Library from the results as shown in the following image:
Also, make sure to add Microsoft.CSharp.dll reference.
On the top of the script you should add the following line:
using Excel = Microsoft.Office.Interop.Excel;
using System.Runtime.InteropServices;
In the Main() function add the following lines:
Excel.Application excel = new Excel.Application();
string originalPath = Dts.Variables["User::ExcelFilePath"].Value.ToString();
Excel.Workbook workbook = excel.Workbooks.Open(originalPath);
Excel.Worksheet worksheet = (Excel.Worksheet)workbook.Worksheets[1];
Excel.Range usedRange = worksheet.UsedRange;
int intURLColidx = 0;
excel.Visible = false;
excel.DisplayAlerts = false;
for (int i = 1; i <= usedRange.Columns.Count; i++)
{
if ((worksheet.Cells[1, i] as Excel.Range).Value != null &&
(string)(worksheet.Cells[1, i] as Excel.Range).Value == "Photo")
{
intURLColidx = i;
break;
}
}
for (int i = 2; i <= usedRange.Rows.Count; i++)
{
if ((worksheet.Cells[i, intURLColidx] as Excel.Range).Hyperlinks.Count > 0)
{
(worksheet.Cells[i, intURLColidx] as Excel.Range).Value2 = (worksheet.Cells[i, intURLColidx] as Excel.Range).Hyperlinks.Item[1].Address.ToString();
}
}
workbook.Save();
Marshal.FinalReleaseComObject(worksheet);
workbook.Close(Type.Missing, Type.Missing, Type.Missing);
Marshal.FinalReleaseComObject(workbook);
excel.Quit();
Marshal.FinalReleaseComObject(excel);
Dts.TaskResult = (int)ScriptResults.Success;
In the lines above, first we searched for the column index that contains the hyperlink (in this example the column name is "Photo", then we will check for each line if the Hyperlink address is not empty we will replace the column value with this hyperlink address)
Finally, make sure to configure the Excel connection manager to read the file path from the created variable value (Step 2) using expressions:
Experiments
After running the package, if we open an Excel file we will see that the Cell value is replaced with the URL:
And as shown in the image below, data are imported successfully to SQL Server:
References
Missing compiler required member 'microsoft.csharp.runtimebinder.binder.convert'
Extracting a URL from hyperlinked text in Excel cell
Excel interop prevent showing password dialog

What you will probably need to do is some hackery involving the Excel COM API, or macros. In fact, since you should stay away from using the Office COM API in SSIS.
You could pre-process excel to take that value with non-standard operations in SSIS, like using script component.
These are the steps you need to follow to import that data using the Script component:
Drag and drop a script component and select "source" as the script option type.
By default the script language is Microsoft Visual C# 2008 and I have done this sample with Microsoft Visual Basic 2008. Change this if you need to.
Define your output columns with the correct data type in "data type properties"
Edit the script. In the IDE you should add reference:
Microsoft.Excel 11.0 Object Library
(if that reference doesn´t work, try with Microsoft.Excel 5.0 Object Library)
Finally, write some code:
Imports Microsoft.Office.Interop.Excel
Public Overrides Sub getHyperlink()
Dim oExcel As Object = CreateObject("Excel.Application")
Dim FileName As String
FileName = Variables.FileName
Dim oBook As Object = oExcel.Workbooks.Open(FileName)
Dim oSheet As Object = oBook.Worksheets(1)
Output0Buffer.AddRow()
// change A1 with your correct col & row
Output0Buffer.Address = cell.range("A1").Hyperlinks(1).Address & "#" & cell.range("A1").Hyperlinks(1).SubAddress
End Sub
(keep in mind that it is a code that may not run, it is by way of illustration)
You could see code in C# here:
C# Script in SSIS Script Task to convert Excel Column in "Text" Format to "General"
The only issue with the script method is you need to have the Excel
runtime installed.
More about script component here:
https://www.tutorialgateway.org/ssis-script-component-as-transformation/

Related

Update excel file using selenium webdriver

In my application, there are two buttons(export & import ). Using the export button I can download an excel file, in that downloaded excel file, there are lots of blanks fields.
I need to upload the same file using import button but if I upload the same file then I can get many mandatory fields validation message because of blank fields. Instead of uploading the same file I have another excel file, in which every blank field are filled.
In the exported file, there is a unique id which is generated in the run time and I have to set the same unique id into another excel file, in which every blank field are filled same as the exported file but unique id is different. otherwise, I will get a validation message.
I want to replace the epricer quote number 8766876 to 4181981 in new import file.
Possibly duplicate of following question :
update excel cell in selenium webdriver
Possible solution from link :
InputStream inp = new FileInputStream("workbook.xls");
Workbook wb = WorkbookFactory.create(inp);
Sheet sheet = wb.getSheetAt(0);
Row row = sheet.getRow(2);
Cell cell = row.getCell(3);
if (cell == null)
cell = row.createCell(3);
cell.setCellType(Cell.CELL_TYPE_STRING);
cell.setCellValue("a test");
// Write the output to a file
FileOutputStream fileOut = new FileOutputStream("workbook.xls");
wb.write(fileOut);
fileOut.close();
You cannot update Excel using Selenium, Selenium is a series of libraries that are used to drive a browser, they cannot interact with Excel documents.
If you want to interact with Excel documents you will need to use a library specifically designed to do that, some options are:
Apache POI
DocX4J
JExcelAPI

WildCards in SSIS Collection {not include} name xlsx

I have a process built in SSIS that loops through Excel files and Import data only from those that include name Report.
My UserVariable used as Expression is: *Report*.xlsx
and it works perfectly fine. Now I am trying to build similar loop but only for files that DOES NOT include Report in file name.
Something like *<>Report*.xlsx
Is it possible?
Thanks for help!
Matt
In your loop, put a Script task before your first task. Connect those two with a line. Right click that line and set Constraint Options to expression. Your expression would look like this...
FINDSTRING(#var, "Report", 1) == 0
Where #var is the loop iterable.
Only files without "Report" inside will proceed to the next step.
Referencing this exact answer. SSIS Exclude certain files in Foreach Loop Container
Unfortunately, you cannot achieve this using SSIS expression (something like *[^...]*.xlsx) you have to search for some workarounds:
Workarounds
First
Get List of - filtered - files using an Execute Script Task before entering Loop and loop over then using ForEach Loop container (Ado enumerator)
You have to a a SSIS variable (ex: User::FilesList) with type System.Object (Scope: Package)
Add an Execute Script Task befor the for each Loop container and add User::FilesList as a ReadWrite Variable
In the Script Write The following Code:
Imports System.Linq
Imports System.IO
Imports System.Collections.Generic
Public Sub Main()
Dim lstFiles As New List(Of String)
lstFiles.AddRange(Directory.GetFiles("C:\Temp", "*.xlsx", SearchOption.TopDirectoryOnly).Where(Function(x) Not x.Contains("Report")).ToList)
Dts.Variables.Item("FilesList").Value = lstFiles
Dts.TaskResult = ScriptResults.Success
End Sub
In the For each Loop Container Choose the Enumertaion Type as 'From variable Enumerator' and choose FilesList variable as a source
ScreenShots
Second
Inside the for each loop add an Expression Task to check if the file contains Report string or not
Add a variable of type System.Boolean (Name: ExcludeFile)
Inside the ForEach Loop Container add an Expression Task component before the DataFlowTask you that imports the Excel File
Inside The Expression Task write the following:
#[User::ExcludeFile] = (FINDSTRING(#[User::XlsxFile], "Report", 1 ) == 0)
Double Click on the connector between the expression task and the DataFlowTask and write the following expression
#[User::ExcludeFile] == False
Note: It is not necessary to use an Expression Task to validate this you can use a Dummy DataFlowTask or a Script Task to check if the filename contains the Keyword you want to exclude or not

Is there a way to import an image from excel to a PictureBox?

I am writing an application that works with Excel files. So far I have been using Gembox spreadsheet to work with excel files. However, I discovered using Gembox spreadsheet I can save pics to excel files, but not retrieve them. Anyone can recommend how to retrieve a pic from excel file? Thank you
Here is how you can retrieve an image from an Excel file with GemBox.Spreadsheet:
ExcelFile workbook = ExcelFile.Load("Sample.xlsx");
ExcelWorksheet worksheet = workbook.Worksheets.ActiveWorksheet;
// Select Picture element.
ExcelPicture picture = worksheet.Pictures[0];
// Import to PictureBox control.
this.pictureBox1.Image = Image.FromStream(picture.PictureStream);
// Or write to file.
File.WriteAllBytes("Sample.png", picture.PictureStream.ToArray());

Apply VBA code to Excel file from SSIS

Good evening everyone.
I have to build a SSIS package that does as follows:
1) Execute a VBA code to a XLS file (Transpose a range into another range)
2) Save the XLS (In the same file or as a new file)
3) Import the modified XLS from the Transposed range.
Basically I have to transpose the data inside a XLS that I must import, and I didn't find a good way to do that in SSIS (Since the column range can change between files)
With this simple VBA script I can do that and make SSIS read the data in a very straightforward way. However I'm not finding a way to apply this code without modifying the Excel previously manually to add the script and run the VBA script. I want to automate this so the package prepares the xls, extracts the new data, and save it to a table.
Can anyone shed some ideas on how to apply this code or other ways to do this? The most important point I think is that it's a very specific range that I want to transpose.
Sub myTranspose()
With Range("a18:ZZ27", Range("a18:ZZ27").End(xlDown))
.Copy
Range("a30").PasteSpecial Transpose:=True
End With
End Sub
Create a Script Task that is piped into a Data Flow task
Edit the Script Task by double clicking the Script Task and clicking the Edit Script button.
Add references to Excel and CSharp as seen in this answer
Add some code similar to the following:
public void Main()
{
string filepath = #"c:\temp\transpose.xlsx";
Excel.Application xlApp;
Excel._Workbook oWB;
try
{
xlApp = new Excel.Application();
xlApp.Visible = false;
oWB = (Excel.Workbook)xlApp.Workbooks.Open(filepath);
Excel.Range fromrng = xlApp.get_Range("B4", "F5");
Object[,] transposedRange = (Object[,])xlApp.WorksheetFunction.Transpose(fromrng);
Excel.Range to_rng = xlApp.get_Range("A8", "A8");
to_rng = to_rng.Resize[transposedRange.GetUpperBound(0), transposedRange.GetUpperBound(1)];
to_rng.Value = transposedRange;
xlApp.ActiveWorkbook.Save();
oWB.Close(filepath);
}
catch (Exception ex)
{
//do something
}
Dts.TaskResult = (int)ScriptResults.Success;
}
This gives the following result in the sample transpose.xlsx I created.

Importing Excel file with dynamic name into SQL table via SSIS?

I've done a few searches here, and while some issues are similar, they don't seem to be exactly what I need.
What I'm trying to do is import an Excel file into a SQL table via SSIS, but the problem is that I will never know the exact filename. We get files at no steady interval, and the file usually has a date/month in the name. For instance, our current file is "Census Data - May 2013.xls". We will only ever load ONE file at a time, so I don't need to loop through a directory for multiple Excel files.
My concept is that I can take this file, copy it to a "Loading" directory, and load it from there. At the start of the package, I will first clear out the loading directory, then scan the original directory for an Excel file, copy it to the loading directory and then load it into SQL. I suppose I may have to store the file names somewhere so I don't copy the same file into the loading directory in subsequent months, but I'm not really sure of the best way to handle that.
I've pretty much got everything down except the part that scans the directory for the Excel file and copies it to the loading directory. I've taken the majority of my info from this page, which (again) is close to what I want to do but not quite exactly the solution I need.
Can anyone get me over the finish line? I can't seem to get the Excel Connection Manager right (this is my first time using variables), and I can't figure out how to get the file into the Loading directory.
Problem statement
How do I dynamically identify a file name?
You will require some mechanism to inspect the contents of a folder and see what exists. Specifically, you are looking for an Excel file in your "Loading" directory. You know the file extension and that is it.
Resolution A
Use a ForEach File Enumerator.
Configure the Enumerator with an Expression on FileSpec of *.xls or *.xlsx depending on which flavor of Excel you're dealing with.
Add another Expression on Directory to be your Loading directory.
I typically create SSIS Variables named FolderInput and FileMask and assign those in the Enumerator.
Now when you run your package, the Enumerator is going to look in Diretory and find all the files that match the FileSpec.
Something needs to be done with what is found. You need to use that file name that the Enumerator returns. That's done through the Variable Mappings tab. I created a third Variable called CurrentFileName and assign it the results of the enumerator.
If you put a Script Task inside the ForEach Enumerator, you should be able to see that the value in the "Locals" window for #[User::CurrentFileName] has updated from the Design time value of whatever to the "real" file name.
Resolution B
Use a Script Task.
You will still need to create a Variable to hold the current file name and it probably won't hurt to also have the FolderInput and FileMask Variables available. Set the former as ReadWrite and the latter as ReadOnly variables.
Chose the .NET language of your choice. I'm using C#. The method System.IO.Directory.EnumerateFiles
using System;
using System.Data;
using System.IO;
using Microsoft.SqlServer.Dts.Runtime;
using System.Windows.Forms;
namespace ST_fe2ea536a97842b1a760b271f190721e
{
[Microsoft.SqlServer.Dts.Tasks.ScriptTask.SSISScriptTaskEntryPointAttribute]
public partial class ScriptMain : Microsoft.SqlServer.Dts.Tasks.ScriptTask.VSTARTScriptObjectModelBase
{
public void Main()
{
string folderInput = Dts.Variables["User::FolderInput"].Value.ToString();
string fileMask = Dts.Variables["User::FileMask"].Value.ToString();
try
{
var files = Directory.EnumerateFiles(folderInput, fileMask, SearchOption.AllDirectories);
foreach (string currentFile in files)
{
Dts.Variables["User::CurrentFileName"].Value = currentFile;
break;
}
}
catch (Exception e)
{
Dts.Events.FireError(0, "Script overkill", e.ToString(), string.Empty, 0);
}
Dts.TaskResult = (int)ScriptResults.Success;
}
enum ScriptResults
{
Success = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Success,
Failure = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Failure
};
}
}
Decision tree
Given the two resolutions to the above problem, how do you chose? Normally, people say "It Depends" but there only possible time it would depend is if the process should stop/error out in the case that more than one file did exist in the Loading folder. That's a case that the ForEach enumerator would be more cumbersome than a script task. Otherwise, as I stated in my original response that adds cost to your project for Development, Testing and Maintenance for no appreciable gain.
Bits and bobs
Further addressing nuances in the question: Configuring Excel - you'll need to be more specific in what isn't working. Both Siva's SO answer and the linked blogspot article show how to use the value of the Variable I call CurrentFileName to ensure the Excel File is pointing to the "right" file.
You will need to set the DelayValidation to True for both the Connection Manager and the Data Flow as the design-time value for the Variable will not be valid when the package begins execution. See this answer for a longer explanation but again, Siva called that out in their SO answer.

Resources