Bulk importing text files / VB2005 / SQL Server 2005 - sql-server

I've inherited a .NET app to support / enhance which reads in a couple of files of high hundreds of thousands of rows, and one of millions of row.
The original developer left me code like :-
For Each ModelListRow As String In ModelListDataArray
If ModelListRow.Trim.Length = 0 Or ModelListRow.Contains(",") = False Then
GoTo SKIP_ROW
End If
Dim ModelInfo = ModelListRow.Split(",")
Dim ModelLocation As String = UCase(ModelInfo(0))
Dim ModelCustomer As String = UCase(ModelInfo(1))
Dim ModelNumber As String = UCase(ModelInfo(2))
If ModelLocation = "LOCATION" Or ModelNumber = "MODEL" Then
GoTo SKIP_ROW
End If
Dim MyDataRow As DataRow = dsModels.Tables(0).NewRow
MyDataRow.Item("location") = ModelLocation.Replace(vbCr, "").Replace(vbLf, "").Replace(vbCrLf, "")
MyDataRow.Item("model") = ModelNumber.Replace(vbCr, "").Replace(vbLf, "").Replace(vbCrLf, "")
dsModels.Tables(0).Rows.Add(MyDataRow)
SKIP_ROW:
Next
and it takes an age (well, nearly half an hour) to import these files.
I suspect there's a MUCH better way to do it. I'm looking for suggestions.
Thanks in advance.

Take a look at BULK INSERT.
http://msdn.microsoft.com/en-us/library/ms188365(v=SQL.90).aspx
Basically you point SQL Server at a text file in CSV format and it does all the logic of pulling the data into a table. If you need to massage it more than that, you can pull the text file into a staging location in SQL Server, and then run a stored proc to massage it into the format you are looking for.

The main options (apart from writing your own code from scratch) are:
BULK INSERT or bcp.exe, which work well if your data is cleanly formatted
SSIS, if you need workflow, data type transformations, data cleansing etc.
.NET SqlBulkCopy API
jkohlhepp's suggestion about pulling data into a staging table then cleaning it is a good one and a very common pattern in ETL processes. But if your "massaging" isn't easy to do in TSQL then you will probably need some .NET code anyway, whether it's in SSIS or in a CLR procedure.
Personally I would use SSIS in your case, because it looks like the data is not cleanly formatted so you will probably need some custom code to clean/re-format the data on its way to the database. However it does depend on what you're most comfortable/productive with and what existing tools and standards you have in place.

Dim ExcelConnection As New System.Data.OleDb.OleDbConnection("Provider=Microsoft.ACE.OLEDB.12.0;Data Source=C:\MyExcelSpreadsheet.xlsx;Extended Properties=""Excel 12.0 Xml;HDR=Yes""")
ExcelConnection.Open()

Related

Alternative to Index + Seek on Access/DAO Using SQL Server/ADO

I'm converting an Excel app from a local MS Access backend (with DAO) to SQL Server running on Azure (with ADO).
A common task I perform with DAO is the index + seek method to scan a large amount of input rows (~10,000, and using multi field indexes), check for matching records in the database, and update or add new records as required. The NoMatch property of the Seek method works very nicely when deciding to add or update.
This seems like it should be pretty simple with SQL Server, but I can't seem to find a good solution that lets me check for matches, add or update, and use multi column indexes.
Downloading the table to memory then doing a batch update would be fine, but ADO's Find method doesn't seem as good as index + seek, and it can't use multi columns. Connecting to SQL Server with an ADO provider that supports index seek would also work (Jet 4.0?) but I can't find examples of that either.
Am I missing something obvious? What is the best way to check and add or update large number of rows to SQL Server? Thanks
Edit:
Here's a simple example of the operation I'm doing currently in Access/DAO:
Set rs = db.OpenRecordset("TableName", dbOpenTable)
With rs
.Index = "MultiFieldIndex"
'Loop through the input data
For i = 1 To 10000
.Seek "=", Criteria1, Criteria2
If Not .NoMatch Then
'Found a match, just update specific fields
!Field1 = a
!Field2 = b
Else
'No match found, add a new record then populate
.AddNew
!Field3 = c
!Field4 = d
End If
.Update
Next i
End With
Whats the best way to do something like this only with SQL? I'd still probably start with loading a disconnected recordset of the full target table, but not sure how to update a few thousand records when I don't know if I'll need to update or add new, or the values of the input criteria. How do I find the rows I need to update/check without index + seek?
OR can I create a temp table in memory with only the input data then somehow just 'merge' that table to the database and the db will figure out how to update or add?
I feel like this should be a pretty basic procedure, but maybe I'm just missing some fundamental SQL concept?
Thanks for all the help!
The solution would be to write an SQL query
create procedure dbo.prc_TableNameUpdIns (#parm1 int, #parm2 int)
AS
IF EXISTS (SELECT * from MyTable WHERE A = #parm1 AND B = #parm2)
UPDATE ...
ELSE
INSERT....
GO
(you could use merge but I advise against it)
And you call the SP by creating an AdoConnection (in VBA)
dim myAdoConnection as ADOConnection
dim X as Integer, Y as Integer
X = 7
Y = 8
'SET the myAdoConnection
'Calling the proc
myAdoConnection.prc_TableNameUpdIns X, Y
Processing as much code as possible in the SQL is much preferable. The only time you need to do this type of ADO manipulation is when you need to update the form only and not necessarily the database itself.
About ADO connection
And research the subject some more, this is only a pointer

Automated export of a table/query from SQL SVR 2012 into Excel 2007 and subsequent VBA formatting macro?

I’ve been given the task of updating some code which isn’t working after our SQL SVR 200 to SQL SVR 2012 (don't laugh) conversion. We have an automated task that creates 500ish xls files in a line-by-line manner by using the old sp_OACreate command. In 2000, this job always takes hours and is rickety at best. Needless to say, it doesn’t work at all in 2012. That's what you get for never upgrading?
I rewrote the job by constructing a table, adding info into the table, & doing a bulk export into xls (using OpenRowSet). The new task ran in less than 15 minutes and I was ecstatic. Then Ops complained that the new files weren’t formatted…
I looped back and tried to get the formatting into the automation. That’s where everything fell apart.
Use OpenRowSet and export into an “xls template file” that has an autorun vba code to do the formatting when the file is opened for the first time. Didn’t work.
Use OpenRowSet and export into an “xlsm template file” that has an autorun vba code to do the formatting when the file is opened for the first time. Didn’t work.
Use BCP and export into an “xls template file” that has an autorun vba code to do the formatting when the file is opened for the first time. Didn’t work.
Use BCP and export into an “xlsm template file” that has an autorun vba code to do the formatting when the file is opened for the first time. Didn’t work.
Use BCP/OpenRowSet and export into xls/xlsx and try to execute a vba code from another workbook…Didn’t work.
Use BCP/OpenRowSet, export into a “dummy” xls/xlsx, use a bat file to recopy the file, resave, etc…Didn’t work.
Use BCP/OpenRowSet, export into a “dummy” txt/csv, use a bat file to convert the file into excel, resave, etc…Didn’t work.
Etc. Didn't work.
I tried every combination, every export method, every filetype, etc. Certain methods allowed me to export but no formatting. Other methods wouldn’t even export anything. No method permitted the export/format, though.
Then I discovered the “REAL” problem ==> When I export using BCP/OpenRowSet and then try to open the file, I always get the “file you are trying to open is in a different format than specified by the file extension” error (FYI Using Excel 2007). I had sort of ignored this error, but after days of banging my head against the wall I can now see that this error was leading me to the real problem all along. The export file (regardless of xls/xlsm/xlsx) is not an actual excel file; it’s really just a bunch of html tags. This is why the formatting vba code won’t work; it’s not an actual excel file! No matter what I do, it’s still not a real excel file. Formatting will never work because it's simply not a real excel file.
So, I need an automated method to export a table to some sort of excel format (xls/xlsm/xlsx) and then execute a vba code to format (bold, column width, number/date formatting, etc.) the newly-exported file. This seems like such a routine task…but I see now that routine <> easy. I’ve seen references to NPOI and ClosedXML in forums, but I simply can’t believe that I need additional 3rd party software to accomplish this task.
You could write some VBA code in an Excel Spreadsheet to go get the data from your database and then do the formatting VBA after. The below code should be a good starting point.
Please note that you will have to add a Reference to Microsoft ActiveX Data Objects Library for this code to function. Also, you may need to change the Provider in you connection string deepening on you systems configuration. The below code utilizes the SQL Server Native Client 11.
Dim cn As New ADODB.Connection
Dim SQLrs As New ADODB.Recordset
Dim SQLCommand As String
SQLCommand = "SELECT...FROM..."
cn.ConnectionString = "Provider=SQLNCLI11;Server=xxxxxxx;DataBase=xxxxxx;Trusted_Connection=yes;"
cn.Open
cn.Execute "SET NOCOUNT ON" ''required for complex T-SQL queries
SQLrs.CursorLocation = adUseClient
Call SQLrs.Open(SQLCommand, cn, adOpenStatic, adLockBatchOptimistic)
ActiveWorkbook.Sheets("sheet1").Range("a1").CopyFromRecordset SQLrs
SQLrs.Close
Set SQLrs = Nothing
cn.Close
Set cn = Nothing

Recommended solutions to load a huge CSV file to SQL Server 2008 R2 [duplicate]

What is your recommended way to import .csv files into Microsoft SQL Server 2008 R2?
I'd like something fast, as I have a directory with a lot of .csv files (>500MB spread across 500 .csv files).
I'm using SQL Server 2008 R2 on Win 7 x64.
Update: Solution
Here's how I solved the problem the end:
I abandoned trying to use LINQ to Entities to do the job. It works - but it doesn't support bulk insert, so its about 20x slower. Maybe the next version of LINQ to Entities will support this.
Took the advice given on this thread, used bulk insert.
I created a T-SQL stored procedure that uses bulk insert. Data goes into a staging table, is normalized then copied into the target tables.
I mapped the stored procedure into C# using the LINQ to Entities framework (there is a video on www.learnvisualstudio.net showing how to do this).
I wrote all the code to cycle through files, etc in C#.
This method eliminates the biggest bottleneck, which is reading tons of data off the drive and inserting it into the database.
The reason why this method is extremely quick at reading .csv files? Microsoft SQL Server gets to import the files directly from the hard drive straight into the database, using its own highly optimized routines. Most of the other C# based solutions require much more code, and some (like LINQ to Entities) end up having to pipe the data slowly into the database via the C#-to-SQL-server link.
Yes, I know it'd be nicer to have 100% pure C# code to do the job, but in the end:
(a) For this particular problem, using T-SQL requires much less code compared to C#, about 1/10th, especially for the logic to denormalize the data from the staging table. This is simpler and more maintainable.
(b) Using T-SQL means you can take advantage of the native bulk insert procedures, which speeds things up from 20-minute wait to a 30-second pause.
Using BULK INSERT in a T-SQL script seems to be a good solution.
http://blog.sqlauthority.com/2008/02/06/sql-server-import-csv-file-into-sql-server-using-bulk-insert-load-comma-delimited-file-into-sql-server/
You can get the list of files in your directory with xp_cmdshell and the dir command (with a bit of cleanup). In the past, I tried to do something like this with sp_OAMethod and VBScript functions and had to use the dir method because I had trouble getting the list of files with the FSO object.
http://www.sqlusa.com/bestpractices2008/list-files-in-directory/
If you have to do anything with the data in the files other than insert it, then I would recommend using SSIS. It can not only insert and/or update, it can also clean the data for you.
First officially supported way of importing large text files is with command line tool called "bcp" (Bulk Copy Utility), very useful for huge amounts of binary data.
Please check out this link: http://msdn.microsoft.com/en-us/library/ms162802.aspx
However, in SQL Server 2008 I presume that BULK INSERT command would be your choice number one, because on the first place it became a part of standard command set. If for any reason you have to maintain vertical compatibility, I'd stick to bcp utility, available for SQL Server 2000 too.
HTH :)
EDITED LATER: Googling around I recalled that SQL Server 2000 had BULK INSERT command too... however, there was obviously some reason I sticked up to bcp.exe, and I cannot recall why... perhaps of some limits, I guess.
I should recommend this:
using System;
using System.Data;
using Microsoft.VisualBasic.FileIO;
namespace ReadDataFromCSVFile
{
static class Program
{
static void Main()
{
string csv_file_path=#"C:\Users\Administrator\Desktop\test.csv";
DataTable csvData = GetDataTabletFromCSVFile(csv_file_path);
Console.WriteLine("Rows count:" + csvData.Rows.Count);
Console.ReadLine();
}
private static DataTable GetDataTabletFromCSVFile(string csv_file_path)
{
DataTable csvData = new DataTable();
try
{
using(TextFieldParser csvReader = new TextFieldParser(csv_file_path))
{
csvReader.SetDelimiters(new string[] { "," });
csvReader.HasFieldsEnclosedInQuotes = true;
string[] colFields = csvReader.ReadFields();
foreach (string column in colFields)
{
DataColumn datecolumn = new DataColumn(column);
datecolumn.AllowDBNull = true;
csvData.Columns.Add(datecolumn);
}
while (!csvReader.EndOfData)
{
string[] fieldData = csvReader.ReadFields();
//Making empty value as null
for (int i = 0; i < fieldData.Length; i++)
{
if (fieldData[i] == "")
{
fieldData[i] = null;
}
}
csvData.Rows.Add(fieldData);
}
}
}
catch (Exception ex)
{
}
return csvData;
}
}
}
//Copy the DataTable to SQL Server using SqlBulkCopy
function static void InsertDataIntoSQLServerUsingSQLBulkCopy(DataTable csvData)
{
using(SqlConnection dbConnection = new SqlConnection("Data Source=ProductHost;Initial Catalog=yourDB;Integrated Security=SSPI;"))
{
dbConnection.Open();
using (SqlBulkCopy s = new SqlBulkCopy(dbConnection))
{
s.DestinationTableName = "Your table name";
foreach (var column in csvFileData.Columns)
s.ColumnMappings.Add(column.ToString(), column.ToString());
s.WriteToServer(csvFileData);
}
}
}
If the structure of all your CSVs are the same i recomend you to use Integration Services (SSIS) in order to loop between them and insert all of them into the same table.
I understand this is not exactly your question. But, if you get into a situation where you use a straight insert use tablock and insert multiple rows. Depends on the row size but I usually go for 600-800 rows at at time. If it is a load into an empty table then sometimes dropping the indexes and creating them after it is loaded is faster. If you can sort the data on the clustered index before it is loaded. Use IGNORE_CONSTRAINTS and IGNORE_TRIGGERS if you can. Put the database in single user mode if you can.
USE AdventureWorks2008R2;
GO
INSERT INTO Production.UnitMeasure with (tablock)
VALUES (N'FT2', N'Square Feet ', '20080923'), (N'Y', N'Yards', '20080923'), (N'Y3', N'Cubic Yards', '20080923');
GO

How Do I Convert an Excel XLS to an Access MDB using Excel VBA

I need to use VBA from Excel to load an Excel workbook in access and transfer it out to a database.
Dim acc As New Access.Application
acc.OpenCurrentDatabase "C:\Test.xls"
I got that far and Excel crashes and has to restart. My plan was to use the following but I can't get that far.
acc.DoCmd.TransferDatabase
Any ideas? I've googled for days and come up with nothing.
*Edit: Thanks for the responses so farI absolutely have to use Excel VBA, unfortunately. There is an excel spreadsheet that has a bunch of empty columns that are being recognized by the Jet engine as defined columns, too many in fact > 255 (or is it > 256?). I do NOT want to open the Excel worksheet for any reason (this takes far too long over the network). I don't have the option or choice to format it correctly or clean it up. It's easy to convert an XLS spreadsheet into a MDB database inside of access as you all say, but that's not an option. So like I said, I need to use VBA in Excel to manipulate the access object to convert an XLS workbook to an MDB database; Once I have this, the rest will be cake. Thanks so much! I love this site.
This task is straightforward if you're able to use VBA from within Microsoft Access; e.g.:
DoCmd.TransferSpreadsheet , , _
"tblImportFromExcel","C:\path\to\myfile.xls", True, "A1:B200"
TransferSpreadsheet documentation.
What is wrong with the suggested solution from Adam Bernier (with the addition from PowerUser concerning using an access object from within Excel); your last comment was after those suggestions and you did not reply.
Dim acc As New Access.Application
acc.OpenCurrentDatabase "C:\Test.mdb"
acc.DoCmd.TransferSpreadsheet _
TransferType:=acImport, _
Spreadsheettype:=acSpreadsheetTypeExcel8, _
TableName:="tblImportFromExcel", _
Filename:="C:\path\to\myfile.xls", _
HasFieldNames:=True, _
Range:="A1:B200"
(Adapt as needed, especially the HasFieldNames and Range). If this does not work, then there is probably something really wrong with your Excel-Sheet.
The only other thing I can think of (but that would mean to open the file) is to save the Excel-Sheet as XML and transform the values via XSLT in a more suitable format, then import the resulting XML. But that might be overkill (how complex is your file, how often do you need this import to proceed).
HTH
Andreas

How to Export binary data in SqlServer to file using DTS

I have an image column in a sql server 2000 table that is used to store the binary of a pdf file.
I need to export the contents of each row in the column to an actual physical file using SqlServer 2000 DTS.
I found the following method for vb at http://www.freevbcode.com/ShowCode.asp?ID=1654&NoBox=True
Set rs = conn.execute("select BinaryData from dbo.theTable")
FileHandle = FreeFile
Open ("AFile") For Binary As #FileHandle
ByteLength = LenB(rs("BinaryData"))
ByteContent = rs("BinaryData").GetChunk(ByteLength)
Put #FileHandle, ,ByteContent
Close #FileHandle
Unfortunately, the DTS script task is VBSCript, not VB, and it throws up on the AS keyword in the third line.
Any other ideas?
Writing binary files is a notoriously difficult task in VBScript. The only direct file operations exposed by VBScript live in the FileSystemObject object, which only supports writing text files. The only viable option is to use ADO Stream objects, which still is cumbersome because VBScript does not support passing script-created Byte arrays to COM objects, which is required to be able to write arbitrary binary data.
Here's a technique using ADO streams that probably won't work but might put you on the track to the right solution.
adTypeBinary = 1
Set rs = CreateObject("ADODB.Recordset")
rs.Open "SELECT binary_data FROM dbo.data_table", "connection string..."
Set strm = CreateObject("ADODB.Stream")
strm.Type = adTypeBinary
strm.Open
strm.Write rs.Fields("binary_data").GetChunk( _
LenB(rs.Fields("binary_data").Value))
strm.SaveToFile "binarydata.bin"
strm.Close
rs.Close
I experimented with this code using Access, and unfortunately it didn't seem to work. I didn't get the same binary data that was in the table I created, but I haven't bothered going into a hex editor to see what was different.
If you get errors from an ADO operation, you can get the actual messages by adding an "On Error Resume Next" to the top of the script and using this code.
For Each err In rs.ActiveConnection.Errors
WScript.Echo err.Number & ": " & err.Description
Next
I would go with SQL Server Integration Services (SSIS) instead of DTS, if at all possible, and use a Script Task which would allow you to use VB.NET.
You can connect to your SQL Server 2000 data source and point the exported output to a file.

Resources