Using Powershell to Bulk Import Large CSV into SQL Server - sql-server

I came across a post discussing how to use Powershell to bulk import massive data relatively fast. I have a typical csv file with about 5 million rows formatted in the usual way.
I keep getting the same error messages regardless if I choose to import a txt or csv file. Playing around with the csvdelimiter/firstcolumnnames section also created their own issues.
I've spent hours trying to figure out how to get it to work with MY csv files and I keep getting the same error messages no matter what I try. All field names accept Null and they are identical in every way between the table and csv file. I do not have a primary key for the database.
# Database variables
$sqlserver = "SERVERNAMEHERE"
$database = "autos"
$table = "AgedAutos"
# CSV variables
$csvfile = "C:\temp\aged.csv"
$csvdelimiter = "',"
$firstRowColumnNames = $true
################### No need to modify anything below ###################
Write-Host "Script started..."
$elapsed = [System.Diagnostics.Stopwatch]::StartNew()
[void][Reflection.Assembly]::LoadWithPartialName("System.Data")
[void][Reflection.Assembly]::LoadWithPartialName("System.Data.SqlClient")
# 50k worked fastest and kept memory usage to a minimum
$batchsize = 50000
# Build the sqlbulkcopy connection, and set the timeout to infinite
$connectionstring = "Data Source=$sqlserver;Integrated Security=true;Initial Catalog=$database;"
$bulkcopy = New-Object Data.SqlClient.SqlBulkCopy($connectionstring, [System.Data.SqlClient.SqlBulkCopyOptions]::TableLock)
$bulkcopy.DestinationTableName = $table
$bulkcopy.bulkcopyTimeout = 0
$bulkcopy.batchsize = $batchsize
# Create the datatable, and autogenerate the columns.
$datatable = New-Object System.Data.DataTable
# Open the text file from disk
$reader = New-Object System.IO.StreamReader($csvfile)
$columns = (Get-Content $csvfile -First 1).Split($csvdelimiter)
if ($firstRowColumnNames -eq $true) { $null = $reader.readLine() }
foreach ($column in $columns) {
$null = $datatable.Columns.Add()
}
# Read in the data, line by line
while (($line = $reader.ReadLine()) -ne $null) {
$null = $datatable.Rows.Add($line.Split($csvdelimiter))
$i++; if (($i % $batchsize) -eq 1) {
$bulkcopy.WriteToServer($datatable)
Write-Host "$i rows have been inserted in $($elapsed.Elapsed.ToString())."
$datatable.Clear()
}
}
# Add in all the remaining rows since the last clear
if($datatable.Rows.Count -gt 0) {
$bulkcopy.WriteToServer($datatable)
$datatable.Clear()
}
# Clean Up
$reader.Close(); $reader.Dispose()
$bulkcopy.Close(); $bulkcopy.Dispose()
$datatable.Dispose()
Write-Host "Script complete. $i rows have been inserted into the database."
Write-Host "Total Elapsed Time: $($elapsed.Elapsed.ToString())"
# Sometimes the Garbage Collector takes too long to clear the huge datatable.
[System.GC]::Collect()
Error message listed below.
Exception calling "WriteToServer" with "1" argument(s): "The given value of type String from the data source cannot be converted to
type date of the specified target column."
At C:\powershell_scripts\batch_csv_import-code1-working-test for auto table.ps1:43 char:3
+ $bulkcopy.WriteToServer($datatable)
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : NotSpecified: (:) [], MethodInvocationException
+ FullyQualifiedErrorId : InvalidOperationException
340000 rows have been inserted in 00:00:03.5156162
I have no idea what that error means since I cannot find anything useful on Google. I'm thinking one of the columns might be listed incorrectly in SQL Server, but I could be wrong.
Please help me figure out the problem. Thanks.

You are getting all the data in the first column because your value for $csvdelimiter is incorrect.
you have: $csvdelimiter = "',"
it should be: $csvdelimiter = ","

Related

Sqlbulkcopy Excessive Memory Consumtion even with EnableStreaming and low BatchSize

I try to bulk load data from Oracle to SqlServer through Powershell Sqlserver Module Sqlbulkcopy
On small Data, everything works fine, but on big Datasets, even if bachsize and streaming are set, sqlbulkcopy is taking all the memory available... until an out of memory
Also the notify function seems to give no answer, so I guess even with streaming=True, the process first load everything to memory...
What did I missed ?
$current = Get-Date
#copy table from Oracle table to SQL Server table
add-type -path "D:\oracle\product\12.1.0\client_1\odp.net\managed\common\Oracle.ManagedDataAccess.dll";
#define oracle connectin string
$conn_str = "cstr"
# query for oracle table
$qry = "
SELECT
ID,CREATEDT,MODIFIEDDT
FROM MYTABLE
WHERE source.ISSYNTHETIC=0 AND source.VALIDFROM >= TO_Date('2019-01-01','yyyy-mm-dd')
";
# key (on the left side) is the source column while value (on the right side) is the target column
[hashtable] $mapping = #{'ID'='ID';'CREATEDT'='CREATEDT';'MODIFIEDDT'};
$adapter = new-object Oracle.ManagedDataAccess.Client.OracleDataAdapter($qry, $conn_str);
#$info = new-object Oracle.ManagedDataAccess.Client;
#Write-Host ( $info | Format-Table | Out-String)
$dtbl = new-object System.Data.DataTable('MYTABLE');
#this Fill method will populate the $dtbl with the query $qry result
$adapter.Fill($dtbl);
#define sql server target instance
$sqlconn = "cstr";
$sqlbc = new-object system.data.sqlclient.Sqlbulkcopy($sqlconn)
$sqlbc.BatchSize = 1000;
$sqlbc.EnableStreaming = $true;
$sqlbc.NotifyAfter = 1000;
$sqlbc.DestinationTableName="DWHODS.MYTABLE";
#need to tell $sqlbc the column mapping info
foreach ($k in $mapping.keys)
{
$colMapping = new-object System.Data.SqlClient.SqlBulkCopyColumnMapping($k, $mapping[$k]);
$sqlbc.ColumnMappings.Add($colMapping) | out-null
}
$sqlbc.WriteToServer($dtbl);
$sqlbc.close;
$end= Get-Date
$diff= New-TimeSpan -Start $current -End $end
Write-Output "import needed : $diff"
Thanks to Jeroen, I changed the code like this, now its no more consuming memory :
$oraConn = New-Object Oracle.ManagedDataAccess.Client.OracleConnection($conn_str);
$oraConn.Open();
$command = $oraConn.CreateCommand();
$command.CommandText=$qry;
$reader = $command.ExecuteReader()
...
$sqlbc.WriteToServer($reader);

Load data into memory or select multiple times

I have a process that runs every hour, as a part of the process it iterating on a text file that contains about 100K strings and it need to check if each line already exists in specific table in a SQL Server database that has about 30M records.
I have 2 options:
Option 1: SELECT all strings from my table and load it into memory and then during the process it will check for each line in the file if it exists in the data.
Downside: It eats up the machine memory.
Option 2: check if each line in the 100K text file is found in the database (assumes table is indexed correctly).
Downside: It will require multiple requests (100K requests) to database.
Questions:
If I'm using option 2, can SQL Server handle this number of requests?
What is the preferred way in order to overcome this issue?
Below is PowerShell example code for another option; bulk insert the strings into temp table and perform the lookups as a single set-based SELECT query. I would expect this method to typically run a few seconds, depending on your infrastructure.
$connectionString = "Data Source=.;Initial Catalog=YourDatabase;Integrated Security=SSPI"
$connection = New-Object System.Data.SqlClient.SqlConnection($connectionString)
# load strings from file into a DataTable
$timer = [System.Diagnostics.Stopwatch]::StartNew()
$dataTable = New-Object System.Data.DataTable
($dataTable.Columns.Add("StringData", [System.Type]::GetType("System.String"))).MaxLength = 20
$streamReader = New-Object System.IO.StreamReader("C:\temp\temp_strings.txt")
while ($streamReader.Peek() -ge 0) {
$string = $streamReader.ReadLine()
$row = $dataTable.NewRow()
[void]$dataTable.Rows.Add($row)
$row[0] = $string
}
$streamReader.Close()
Write-Host "DataTable load completed. Duration $($timer.Elapsed.ToString())"
# bulk insert strings into temp table
$timer = [System.Diagnostics.Stopwatch]::StartNew()
$connection.Open();
$command = New-Object System.Data.SqlClient.SqlCommand("CREATE TABLE #temp_strings(StringValue varchar(20));", $connection)
[void]$command.ExecuteNonQuery()
$bcp = New-Object System.Data.SqlClient.SqlBulkCopy($connection)
$bcp.DestinationTableName = "#temp_strings"
$bcp.WriteToServer($dataTable)
$bcp.Close()
Write-Host "BCP completed. Duration $($timer.Elapsed.ToString())"
# execute set-based lookup query and return found/notfound for each string
$timer = [System.Diagnostics.Stopwatch]::StartNew()
$command.CommandText = #"
SELECT
strings.StringValue
, CASE
WHEN YourTable.YourTableKey IS NOT NULL THEN CAST(1 AS bit)
ELSE CAST(0 AS bit)
END AS Found
FROM #temp_strings AS strings
LEFT JOIN dbo.YourTable ON strings.StringValue = YourTable.YourTableKey;
"#
$reader = $command.ExecuteReader()
while($reader.Read()) {
Write-Host "String $($reader["StringValue"]) found: $($reader["Found"])"
}
$connection.Close()
Write-Host "Lookups completed. Duration $($timer.Elapsed.ToString())"
As an alternative to bulk insert, you could alternatively pass the strings using a table-valued parameter (or XML, JSON, delimited values) for use in the query.

Powershell function to import csv file to SQL Server database table

I have created a PowerShell function that bulk copies data from a .csv file (first row is the header), and inserts the data in to a SQL Server database table.
See my code:
function BulkCsvImport($sqlserver, $database, $table, $csvfile, $csvdelimiter, $firstrowcolumnnames) {
Write-Host "Bulk Import Started."
$elapsed = [System.Diagnostics.Stopwatch]::StartNew()
[void][Reflection.Assembly]::LoadWithPartialName("System.Data")
[void][Reflection.Assembly]::LoadWithPartialName("System.Data.SqlClient")
# 50k worked fastest and kept memory usage to a minimum
$batchsize = 50000
# Build the sqlbulkcopy connection, and set the timeout to infinite
$connectionstring = "Data Source=$sqlserver;Integrated Security=true;Initial Catalog=$database;"
# Wipe the bulk insert table first
Invoke-Sqlcmd -Query "TRUNCATE TABLE $table" -ServerInstance $sqlserver -Database $database
$bulkcopy = New-Object Data.SqlClient.SqlBulkCopy($connectionstring, [System.Data.SqlClient.SqlBulkCopyOptions]::TableLock)
$bulkcopy.DestinationTableName = $table
$bulkcopy.bulkcopyTimeout = 0
$bulkcopy.batchsize = $batchsize
# Create the datatable, and autogenerate the columns.
$datatable = New-Object System.Data.DataTable
# Open the text file from disk
$reader = New-Object System.IO.StreamReader($csvfile)
$columns = (Get-Content $csvfile -First 1).Split($csvdelimiter)
if ($firstrowcolumnnames -eq $true) { $null = $reader.readLine() }
foreach ($column in $columns) {
$null = $datatable.Columns.Add()
}
# Read in the data, line by line
while (($line = $reader.ReadLine()) -ne $null) {
$null = $datatable.Rows.Add($line.Split($csvdelimiter))
$i++;
if (($i % $batchsize) -eq 0) {
$bulkcopy.WriteToServer($datatable)
Write-Host "$i rows have been inserted in $($elapsed.Elapsed.ToString())."
$datatable.Clear()
}
}
# Add in all the remaining rows since the last clear
if($datatable.Rows.Count -gt 0) {
$bulkcopy.WriteToServer($datatable)
$datatable.Clear()
}
# Clean Up
$reader.Close();
$reader.Dispose()
$bulkcopy.Close();
$bulkcopy.Dispose()
$datatable.Dispose()
Write-Host "Bulk Import Completed. $i rows have been inserted into the database."
# Write-Host "Total Elapsed Time: $($elapsed.Elapsed.ToString())"
# Sometimes the Garbage Collector takes too long to clear the huge datatable.
$i = 0
[System.GC]::Collect()
}
I am looking to modify the above though so that the column names in the .csv file match up with the column names in the SQL Server database table. They should be identical. At the moment the data is being imported in to the incorrect database columns.
Could I get some assistance as what I need to do to modify the above function to achieve this?
I would use existing open source solution:
Import-DbaCsv - dbatools.io
Import-DbaCsv.ps1
Efficiently imports very large (and small) CSV files into SQL Server.
Import-DbaCsv takes advantage of .NET's super fast SqlBulkCopy class to import CSV files into SQL Server.
Parameters:
-ColumnMap
By default, the bulk copy tries to automap columns. When it doesn't
work as desired, this parameter will help.
PS C:\> $columns = #{
>> Text = 'FirstName'
>> Number = 'PhoneNumber'
>> }
PS C:\> Import-DbaCsv -Path c:\temp\supersmall.csv
-SqlInstance sql2016 -Database tempdb -ColumnMap $columns
-BatchSize 50000 -Table table_name -Truncate
The CSV column 'Text' is inserted into SQL column 'FirstName' and CSV column Number is inserted into the SQL Column 'PhoneNumber'. All other columns are ignored and therefore null or default values.

Powershell Looping through eventlog

I am trying to gather data from eventlogs of logons, disconnect, logoff etc... this data will be stored in a csv format.
This is the script i am working which got from Microsoft Technet and i have modified to meet my requirement. Script is working as it should be but there is looping going on which i can't figure out how it should be stopped.
$ServersToQuery = Get-Content "C:\Users\metho.HOME\Desktop\computernames.txt"
$cred = "home\Administrator"
$StartTime = "September 19, 2018"
#$Yesterday = (Get-Date) - (New-TimeSpan -Days 1)
foreach ($Server in $ServersToQuery) {
$LogFilter = #{
LogName = 'Microsoft-Windows-TerminalServices-LocalSessionManager/Operational'
ID = 21, 23, 24, 25
StartTime = (Get-Date).AddDays(-1)
}
$AllEntries = Get-WinEvent -FilterHashtable $LogFilter -ComputerName $Server -Credential $cred
$AllEntries | Foreach {
$entry = [xml]$_.ToXml()
$Output += New-Object PSObject -Property #{
TimeCreated = $_.TimeCreated
User = $entry.Event.UserData.EventXML.User
IPAddress = $entry.Event.UserData.EventXML.Address
EventID = $entry.Event.System.EventID
ServerName = $Server
}
}
}
$FilteredOutput += $Output | Select TimeCreated, User, ServerName, IPAddress, #{Name='Action';Expression={
if ($_.EventID -eq '21'){"logon"}
if ($_.EventID -eq '22'){"Shell start"}
if ($_.EventID -eq '23'){"logoff"}
if ($_.EventID -eq '24'){"disconnected"}
if ($_.EventID -eq '25'){"reconnection"}
}
}
$Date = (Get-Date -Format s) -replace ":", "-"
$FilePath = "$env:USERPROFILE\Desktop\$Date`_RDP_Report.csv"
$FilteredOutput | Sort TimeCreated | Export-Csv $FilePath -NoTypeInformation
Write-host "Writing File: $FilePath" -ForegroundColor Cyan
Write-host "Done!" -ForegroundColor Cyan
#End
First time when i run the script, it runs fine and i get the csv output as it should be. When i run the script again than a new CSV is created (as it should be) but the same event log enteries are created twice and run it again than three enteries are created for the same event. This is very strange as a new csv is created each time and i dont not have -append switch for export-csv configured.
$FilteredOutput = #()
$Output = #()
I did try adding these two lines in above script as i read somewhere that it is needed if i am mixing multiple variables into a array (i do not understand this so applogies if i get this wrong).
Can someone please help me this, more importantly, I need to understand this as it is good to know for my future projects.
Thanks agian.
mEtho
It sounds like the$Output and $FilteredOutput variables aren't getting cleared when you run the script subsequent times (nothing in the current script looks to do that), so the results are just getting appended to these variables each time.
As you've already said, you could add these to the top of your script:
$FilteredOutput = #()
$Output = #()
This will initialise them as empty arrays at the beginning, which will ensure they start empty as well as make it possible for them to be appended to (which happens at the script via +=). Without doing this on the first run the script likely failed, so I assume you must have done this in your current session at some point for it to be working at all.

PowerShell pipe / SQL insert query data limits (and increasing them?)

Are there any limits to the size of a string you can assign to a variable in powershell or any limits to the size of the text sent within an SQL INSERT query?
I have a big CSV file coming in to PowerShell and through string construction in a foreach loop I am generating SQL INSERT queries for each row. The resulting INSERT query; INSERT query; is over about 4MB.
The SQL server has a perfect schema to receive the data, however, when sending the 4MB collection of INSERT queries (each seperated by ;) I get an error that looks to me like the long 4MB set of insert queries was truncated somehow. I guess I have hit some kind of limit.
Is there a way of getting around this (programatically in PowerShell) or a way of increasing the size limit of an acceptable collection of SQL INSERT queries?
My code is using System.Data.SqlClient.SqlConnection and System.Data.sqlclient.SqlCommand.
Smaller datasets work ok but the larger datasets give an error like the following example. Each different dataset gives off a different "Incorrect syntax near" indicator.
Exception calling "ExecuteNonQuery" with "0" argument(s): "Incorrect syntax
near '('."
At C:\Users\stuart\Desktop\git\ADStfL\WorkInProgress.ps1:211 char:3
+ $SQLCommand.executenonquery()
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : NotSpecified: (:) [], MethodInvocationException
+ FullyQualifiedErrorId : SqlException
In my experience, the best performing way to do this is to load the CSV into a DataTable and then use SQLBulkCopy.
$ErrorActionPreference = 'Stop';
$Csv = Import-Csv -Path $FileName;
$SqlServer = 'MyServer';
$SqlDatabase = 'MyDatabase';
$DestinationTableName = 'MyTable';
# Create Connection String
$SqlConnectionString = 'Data Source={0};Initial Catalog={1};Integrated Security=SSPI' -f $SqlServer, $SqlDatabase;
# Define your DataTable. The column order of the DataTable must either match the table in the database, or
# you must specify the column mapping in SqlBulkCopy.ColumnMapping. If you have an IDENTITY column, it's a
# bit more complicated
$DataTable = New-Object -TypeName System.Data.DataTable -ArgumentList $DestinationTableName;
$NewColumn = $DataTable.Columns.Add('Id',[System.Int32]);
$NewColumn.AllowDBNull = $false;
$NewColumn = $DataTable.Columns.Add('IntegerField',[System.Int32]);
$NewColumn.AllowDBNull = $false;
$NewColumn = $DataTable.Columns.Add('DecimalField',[System.Decimal]);
$NewColumn.AllowDBNull = $false;
$NewColumn = $DataTable.Columns.Add('VarCharField',[System.String]);
$NewColumn.MaxLength = 50;
$NewColumn = $DataTable.Columns.Add('DateTimeField',[System.DateTime]);
$NewColumn.AllowDBNull = $false;
# Populate your datatable from the CSV file
# You may find that you need to type cast some of the fields.
$Csv | ForEach-Object {
$NewRow = $DataTable.NewRow();
$NewRow['Id'] = $_.Id;
$NewRow['IntegerField'] = $_.IntegerField;
$NewRow['DecimalField'] = $_.DecimalFiled;
$NewRow['StringField'] = $_.StringField1;
$NewRow['DateTimeField'] = $_.DateTimeField1;
$DataTable.Rows.Add($NewRow);
}
# Create Connection
$SqlConnection = New-Object -TypeName System.Data.SqlClient.SqlConnection -ArgumentList $SqlConnectionString;
# Open Connection
$SqlConnection.Open();
# Start Transaction
$SqlTransaction = $SqlConnection.BeginTransaction();
# Double check the possible options at https://msdn.microsoft.com/en-us/library/system.data.sqlclient.sqlbulkcopyoptions(v=vs.110).aspx
# If you need multiple then -bor them together
$SqlBulkCopyOptions = [System.Data.SqlClient.SqlBulkCopyOptions]::CheckConstraints;
# Create SqlBulkCopy class
$SqlBulkCopy = New-Object -TypeName System.Data.SqlClient.SqlBulkCopy -ArgumentList $SqlConnection, $SqlBulkCopyOptions, $SqlTransaction;
# Specify destination table
$SqlBulkCopy.DestinationTableName = $DestinationTableName;
# Do the insert; rollback on error
try {
$SqlBulkCopy.WriteToServer($DataTable);
$SqlTransaction.Commit();
}
catch {
# Roll back transaction and rethrow error
$SqlTransaction.Rollback();
throw ($_);
}
finally {
$SqlConnection.Close();
$SqlConnection.Dispose();
}
The other method is to use an SQLCommand and do it row by row:
$ErrorActionPreference = 'Stop';
$Csv = Import-Csv -Path $FileName;
$SqlServer = 'MyServer';
$SqlDatabase = 'MyDatabase';
# Create Connection String
$SqlConnectionString = 'Data Source={0};Initial Catalog={1};Integrated Security=SSPI' -f $SqlServer, $SqlDatabase;
# Create Connection
$SqlConnection = New-Object -TypeName System.Data.SqlClient.SqlConnection -ArgumentList $SqlConnectionString;
# Create Command
$InsertCommandText = 'INSERT INTO DestinationTable (Id, IntegerField, DecimalField, StringField, DateTimeField) VALUES (#Id, #IntegerField, #DecimalField, #StringField, #DateTimeField)';
$InsertCommand = New-Object -TypeName System.Data.SqlClient.SqlCommand -ArgumentList $SqlConnection;
[void]$InsertCommand.Parameters.Add('#Id', [System.Data.SqlDbType]::Int);
[void]$InsertCommand.Parameters.Add('#IntegerField', [System.Data.SqlDbType]::Int);
[void]$InsertCommand.Parameters.Add('#DecimalField', [System.Data.SqlDbType]::Decimal);
[void]$InsertCommand.Parameters.Add('#StringField', [System.Data.SqlDbType]::VarChar,50);
[void]$InsertCommand.Parameters.Add('#DateTimeField', [System.Data.SqlDbType]::DateTime);
# Open connection and start transaction
$SqlConnection.Open()
$SqlTransaction = $SqlConnection.BeginTransaction();
$InsertCommand.Transaction = $SqlTransaction;
$RowsInserted = 0;
try {
$line = 0;
$Csv | ForEach-Object {
$line++;
# Specify parameter values
$InsertCommand.Parameters['#Id'].Value = $_.Id;
$InsertCommand.Parameters['#IntegerField'].Value = $_.IntegerField;
$InsertCommand.Parameters['#DecimalField'].Value = $_.DecimalField;
$InsertCommand.Parameters['#StringField'].Value = $_.StringField;
$InsertCommand.Parameters['#DateTimeField'].Value = $_.DateTimeField;
$RowsInserted += $InsertCommand.ExecuteNonQuery();
# Clear parameter values
$InsertCommand.Parameters | ForEach-Object { $_.Value = $null };
}
$SqlTransaction.Commit();
Write-Output "Rows affected: $RowsInserted";
}
catch {
# Roll back transaction and rethrow error
$SqlTransaction.Rollback();
Write-Error "Error on line $line" -ErrorAction Continue;
throw ($_);
}
finally {
$SqlConnection.Close();
$SqlConnection.Dispose();
}
Edit: Oh, I forgot one important point. If you need to set the value of a field to null in the database, you need to set it's value to [System.DBNull]::Value, not $null.

Resources