SQL Server Bulk insert of CSV file with inconsistent quotes - sql-server

Is it possible to BULK INSERT (SQL Server) a CSV file in which the fields are only OCCASSIONALLY surrounded by quotes? Specifically, quotes only surround those fields that contain a ",".
In other words, I have data that looks like this (the first row contain headers):
id, company, rep, employees
729216,INGRAM MICRO INC.,"Stuart, Becky",523
729235,"GREAT PLAINS ENERGY, INC.","Nelson, Beena",114
721177,GEORGE WESTON BAKERIES INC,"Hogan, Meg",253
Because the quotes aren't consistent, I can't use '","' as a delimiter, and I don't know how to create a format file that accounts for this.
I tried using ',' as a delimter and loading it into a temporary table where every column is a varchar, then using some kludgy processing to strip out the quotes, but that doesn't work either, because the fields that contain ',' are split into multiple columns.
Unfortunately, I don't have the ability to manipulate the CSV file beforehand.
Is this hopeless?
Many thanks in advance for any advice.
By the way, i saw this post SQL bulk import from csv, but in that case, EVERY field was consistently wrapped in quotes. So, in that case, he could use ',' as a delimiter, then strip out the quotes afterwards.

It isn't possible to do a bulk insert for this file, from MSDN:
To be usable as a data file for bulk import, a CSV file must comply with the following restrictions:
Data fields never contain the field terminator.
Either none or all of the values in a data field are enclosed in quotation marks ("").
(http://msdn.microsoft.com/en-us/library/ms188609.aspx)
Some simple text processing should be all that's required to get the file ready for import. Alternatively your users could be required to either format the file according to the se guidelines or use something other than a comma as a delimiter (e.g |)

You are going to need to preprocess the file, period.
If you really really need to do this, here is the code. I wrote this because I absolutely had no choice. It is utility code and I'm not proud of it, but it works. The approach is not to get SQL to understand quoted fields, but instead manipulate the file to use an entirely different delimiter.
EDIT: Here is the code in a github repo. It's been improved and now comes with unit tests! https://github.com/chrisclark/Redelim-it
This function takes an input file and will replace all field-delimiting commas (NOT commas inside quoted-text fields, just the actual delimiting ones) with a new delimiter. You can then tell sql server to use the new field delimiter instead of a comma. In the version of the function here, the placeholder is <TMP> (I feel confident this will not appear in the original csv - if it does, brace for explosions).
Therefore after running this function you import in sql by doing something like:
BULK INSERT MyTable
FROM 'C:\FileCreatedFromThisFunction.csv'
WITH
(
FIELDTERMINATOR = '<*TMP*>',
ROWTERMINATOR = '\n'
)
And without further ado, the terrible, awful function that I apologize in advance for inflicting on you (edit - I've posted a working program that does this instead of just the function on my blog here):
Private Function CsvToOtherDelimiter(ByVal InputFile As String, ByVal OutputFile As String) As Integer
Dim PH1 As String = "<*TMP*>"
Dim objReader As StreamReader = Nothing
Dim count As Integer = 0 'This will also serve as a primary key'
Dim sb As New System.Text.StringBuilder
Try
objReader = New StreamReader(File.OpenRead(InputFile), System.Text.Encoding.Default)
Catch ex As Exception
UpdateStatus(ex.Message)
End Try
If objReader Is Nothing Then
UpdateStatus("Invalid file: " & InputFile)
count = -1
Exit Function
End If
'grab the first line
Dim line = reader.ReadLine()
'and advance to the next line b/c the first line is column headings
If hasHeaders Then
line = Trim(reader.ReadLine)
End If
While Not String.IsNullOrEmpty(line) 'loop through each line
count += 1
'Replace commas with our custom-made delimiter
line = line.Replace(",", ph1)
'Find a quoted part of the line, which could legitimately contain commas.
'In that case we will need to identify the quoted section and swap commas back in for our custom placeholder.
Dim starti = line.IndexOf(ph1 & """", 0)
If line.IndexOf("""",0) = 0 then starti=0
While starti > -1 'loop through quoted fields
Dim FieldTerminatorFound As Boolean = False
'Find end quote token (originally a ",)
Dim endi As Integer = line.IndexOf("""" & ph1, starti)
If endi < 0 Then
FieldTerminatorFound = True
If endi < 0 Then endi = line.Length - 1
End If
While Not FieldTerminatorFound
'Find any more quotes that are part of that sequence, if any
Dim backChar As String = """" 'thats one quote
Dim quoteCount = 0
While backChar = """"
quoteCount += 1
backChar = line.Chars(endi - quoteCount)
End While
If quoteCount Mod 2 = 1 Then 'odd number of quotes. real field terminator
FieldTerminatorFound = True
Else 'keep looking
endi = line.IndexOf("""" & ph1, endi + 1)
End If
End While
'Grab the quoted field from the line, now that we have the start and ending indices
Dim source = line.Substring(starti + ph1.Length, endi - starti - ph1.Length + 1)
'And swap the commas back in
line = line.Replace(source, source.Replace(ph1, ","))
'Find the next quoted field
' If endi >= line.Length - 1 Then endi = line.Length 'During the swap, the length of line shrinks so an endi value at the end of the line will fail
starti = line.IndexOf(ph1 & """", starti + ph1.Length)
End While
line = objReader.ReadLine
End While
objReader.Close()
SaveTextToFile(sb.ToString, OutputFile)
Return count
End Function

I found the answer by Chris very helpful, but I wanted to run it from within SQL Server using T-SQL (and not using CLR), so I converted his code to T-SQL code. But then I took it one step further by wrapping everything up in a stored procedure that did the following:
use bulk insert to initially import the CSV file
clean up the lines using Chris's code
return the results in a table format
For my needs, I further cleaned up the lines by removing quotes around values and converting two double quotes to one double quote (I think that's the correct method).
CREATE PROCEDURE SSP_CSVToTable
-- Add the parameters for the stored procedure here
#InputFile nvarchar(4000)
, #FirstLine int
AS
BEGIN
-- SET NOCOUNT ON added to prevent extra result sets from
-- interfering with SELECT statements.
SET NOCOUNT ON;
--convert the CSV file to a table
--clean up the lines so that commas are handles correctly
DECLARE #sql nvarchar(4000)
DECLARE #PH1 nvarchar(50)
DECLARE #LINECOUNT int -- This will also serve as a primary key
DECLARE #CURLINE int
DECLARE #Line nvarchar(4000)
DECLARE #starti int
DECLARE #endi int
DECLARE #FieldTerminatorFound bit
DECLARE #backChar nvarchar(4000)
DECLARE #quoteCount int
DECLARE #source nvarchar(4000)
DECLARE #COLCOUNT int
DECLARE #CURCOL int
DECLARE #ColVal nvarchar(4000)
-- new delimiter
SET #PH1 = '†'
-- create single column table to hold each line of file
CREATE TABLE [#CSVLine]([line] nvarchar(4000))
-- bulk insert into temp table
-- cannot use variable path with bulk insert
-- so we must run using dynamic sql
SET #Sql = 'BULK INSERT #CSVLine
FROM ''' + #InputFile + '''
WITH
(
FIRSTROW=' + CAST(#FirstLine as varchar) + ',
FIELDTERMINATOR = ''\n'',
ROWTERMINATOR = ''\n''
)'
-- run dynamic statement to populate temp table
EXEC(#sql)
-- get number of lines in table
SET #LINECOUNT = ##ROWCOUNT
-- add identity column to table so that we can loop through it
ALTER TABLE [#CSVLine] ADD [RowId] [int] IDENTITY(1,1) NOT NULL
IF #LINECOUNT > 0
BEGIN
-- cycle through each line, cleaning each line
SET #CURLINE = 1
WHILE #CURLINE <= #LINECOUNT
BEGIN
-- get current line
SELECT #line = line
FROM #CSVLine
WHERE [RowId] = #CURLINE
-- Replace commas with our custom-made delimiter
SET #Line = REPLACE(#Line, ',', #PH1)
-- Find a quoted part of the line, which could legitimately contain commas.
-- In that case we will need to identify the quoted section and swap commas back in for our custom placeholder.
SET #starti = CHARINDEX(#PH1 + '"' ,#Line, 0)
If CHARINDEX('"', #Line, 0) = 0 SET #starti = 0
-- loop through quoted fields
WHILE #starti > 0
BEGIN
SET #FieldTerminatorFound = 0
-- Find end quote token (originally a ",)
SET #endi = CHARINDEX('"' + #PH1, #Line, #starti) -- sLine.IndexOf("""" & PH1, starti)
IF #endi < 1
BEGIN
SET #FieldTerminatorFound = 1
If #endi < 1 SET #endi = LEN(#Line) - 1
END
WHILE #FieldTerminatorFound = 0
BEGIN
-- Find any more quotes that are part of that sequence, if any
SET #backChar = '"' -- thats one quote
SET #quoteCount = 0
WHILE #backChar = '"'
BEGIN
SET #quoteCount = #quoteCount + 1
SET #backChar = SUBSTRING(#Line, #endi-#quoteCount, 1) -- sLine.Chars(endi - quoteCount)
END
IF (#quoteCount % 2) = 1
BEGIN
-- odd number of quotes. real field terminator
SET #FieldTerminatorFound = 1
END
ELSE
BEGIN
-- keep looking
SET #endi = CHARINDEX('"' + #PH1, #Line, #endi + 1) -- sLine.IndexOf("""" & PH1, endi + 1)
END
END
-- Grab the quoted field from the line, now that we have the start and ending indices
SET #source = SUBSTRING(#Line, #starti + LEN(#PH1), #endi - #starti - LEN(#PH1) + 1)
-- sLine.Substring(starti + PH1.Length, endi - starti - PH1.Length + 1)
-- And swap the commas back in
SET #Line = REPLACE(#Line, #source, REPLACE(#source, #PH1, ','))
--sLine.Replace(source, source.Replace(PH1, ","))
-- Find the next quoted field
-- If endi >= line.Length - 1 Then endi = line.Length 'During the swap, the length of line shrinks so an endi value at the end of the line will fail
SET #starti = CHARINDEX(#PH1 + '"', #Line, #starti + LEN(#PH1))
--sLine.IndexOf(PH1 & """", starti + PH1.Length)
END
-- get table based on current line
IF OBJECT_ID('tempdb..#Line') IS NOT NULL
DROP TABLE #Line
-- converts a delimited list into a table
SELECT *
INTO #Line
FROM dbo.iter_charlist_to_table(#Line,#PH1)
-- get number of columns in line
SET #COLCOUNT = ##ROWCOUNT
-- dynamically create CSV temp table to hold CSV columns and lines
-- only need to create once
IF OBJECT_ID('tempdb..#CSV') IS NULL
BEGIN
-- create initial structure of CSV table
CREATE TABLE [#CSV]([Col1] nvarchar(100))
-- dynamically add a column for each column found in the first line
SET #CURCOL = 1
WHILE #CURCOL <= #COLCOUNT
BEGIN
-- first column already exists, don't need to add
IF #CURCOL > 1
BEGIN
-- add field
SET #sql = 'ALTER TABLE [#CSV] ADD [Col' + Cast(#CURCOL as varchar) + '] nvarchar(100)'
--print #sql
-- this adds the fields to the temp table
EXEC(#sql)
END
-- go to next column
SET #CURCOL = #CURCOL + 1
END
END
-- build dynamic sql to insert current line into CSV table
SET #sql = 'INSERT INTO [#CSV] VALUES('
-- loop through line table, dynamically adding each column value
SET #CURCOL = 1
WHILE #CURCOL <= #COLCOUNT
BEGIN
-- get current column
Select #ColVal = str
From #Line
Where listpos = #CURCOL
IF LEN(#ColVal) > 0
BEGIN
-- remove quotes from beginning if exist
IF LEFT(#ColVal,1) = '"'
SET #ColVal = RIGHT(#ColVal, LEN(#ColVal) - 1)
-- remove quotes from end if exist
IF RIGHT(#ColVal,1) = '"'
SET #ColVal = LEFT(#ColVal, LEN(#ColVal) - 1)
END
-- write column value
-- make value sql safe by replacing single quotes with two single quotes
-- also, replace two double quotes with a single double quote
SET #sql = #sql + '''' + REPLACE(REPLACE(#ColVal, '''',''''''), '""', '"') + ''''
-- add comma separater except for the last record
IF #CURCOL <> #COLCOUNT
SET #sql = #sql + ','
-- go to next column
SET #CURCOL = #CURCOL + 1
END
-- close sql statement
SET #sql = #sql + ')'
--print #sql
-- run sql to add line to table
EXEC(#sql)
-- move to next line
SET #CURLINE = #CURLINE + 1
END
END
-- return CSV table
SELECT * FROM [#CSV]
END
GO
The stored procedure makes use of this helper function that parses a string into a table (thanks Erland Sommarskog!):
CREATE FUNCTION [dbo].[iter_charlist_to_table]
(#list ntext,
#delimiter nchar(1) = N',')
RETURNS #tbl TABLE (listpos int IDENTITY(1, 1) NOT NULL,
str varchar(4000),
nstr nvarchar(2000)) AS
BEGIN
DECLARE #pos int,
#textpos int,
#chunklen smallint,
#tmpstr nvarchar(4000),
#leftover nvarchar(4000),
#tmpval nvarchar(4000)
SET #textpos = 1
SET #leftover = ''
WHILE #textpos <= datalength(#list) / 2
BEGIN
SET #chunklen = 4000 - datalength(#leftover) / 2
SET #tmpstr = #leftover + substring(#list, #textpos, #chunklen)
SET #textpos = #textpos + #chunklen
SET #pos = charindex(#delimiter, #tmpstr)
WHILE #pos > 0
BEGIN
SET #tmpval = ltrim(rtrim(left(#tmpstr, #pos - 1)))
INSERT #tbl (str, nstr) VALUES(#tmpval, #tmpval)
SET #tmpstr = substring(#tmpstr, #pos + 1, len(#tmpstr))
SET #pos = charindex(#delimiter, #tmpstr)
END
SET #leftover = #tmpstr
END
INSERT #tbl(str, nstr) VALUES (ltrim(rtrim(#leftover)), ltrim(rtrim(#leftover)))
RETURN
END
Here's how I call it from T-SQL. In this case, I'm inserting the results into a temp table, so I create the temp table first:
-- create temp table for file import
CREATE TABLE #temp
(
CustomerCode nvarchar(100) NULL,
Name nvarchar(100) NULL,
[Address] nvarchar(100) NULL,
City nvarchar(100) NULL,
[State] nvarchar(100) NULL,
Zip nvarchar(100) NULL,
OrderNumber nvarchar(100) NULL,
TimeWindow nvarchar(100) NULL,
OrderType nvarchar(100) NULL,
Duration nvarchar(100) NULL,
[Weight] nvarchar(100) NULL,
Volume nvarchar(100) NULL
)
-- convert the CSV file into a table
INSERT #temp
EXEC [dbo].[SSP_CSVToTable]
#InputFile = #FileLocation
,#FirstLine = #FirstImportRow
I haven't tested the performance much, but it works well for what I need - importing CSV files with less than 1000 rows. However, it might choke on really large files.
Hopefully someone else also finds it useful.
Cheers!

I have also created a function to convert a CSV to a usable format for Bulk Insert. I used the answered post by Chris Clark as a starting point to create the following C# function.
I ended up using a regular expression to find the fields. I then recreated the file line by line, writing it to a new file as I went, thus avoiding having the entire file loaded into memory.
private void CsvToOtherDelimiter(string CSVFile, System.Data.Linq.Mapping.MetaTable tbl)
{
char PH1 = '|';
StringBuilder ln;
//Confirm file exists. Else, throw exception
if (File.Exists(CSVFile))
{
using (TextReader tr = new StreamReader(CSVFile))
{
//Use a temp file to store our conversion
using (TextWriter tw = new StreamWriter(CSVFile + ".tmp"))
{
string line = tr.ReadLine();
//If we have already converted, no need to reconvert.
//NOTE: We make the assumption here that the input header file
// doesn't have a PH1 value unless it's already been converted.
if (line.IndexOf(PH1) >= 0)
{
tw.Close();
tr.Close();
File.Delete(CSVFile + ".tmp");
return;
}
//Loop through input file
while (!string.IsNullOrEmpty(line))
{
ln = new StringBuilder();
//1. Use Regex expression to find comma separated values
//using quotes as optional text qualifiers
//(what MS EXCEL does when you import a csv file)
//2. Remove text qualifier quotes from data
//3. Replace any values of PH1 found in column data
//with an equivalent character
//Regex: \A[^,]*(?=,)|(?:[^",]*"[^"]*"[^",]*)+|[^",]*"[^"]*\Z|(?<=,)[^,]*(?=,)|(?<=,)[^,]*\Z|\A[^,]*\Z
List<string> fieldList = Regex.Matches(line, #"\A[^,]*(?=,)|(?:[^"",]*""[^""]*""[^"",]*)+|[^"",]*""[^""]*\Z|(?<=,)[^,]*(?=,)|(?<=,)[^,]*\Z|\A[^,]*\Z")
.Cast<Match>()
.Select(m => RemoveCSVQuotes(m.Value).Replace(PH1, '¦'))
.ToList<string>();
//Add the list of fields to ln, separated by PH1
fieldList.ToList().ForEach(m => ln.Append(m + PH1));
//Write to file. Don't include trailing PH1 value.
tw.WriteLine(ln.ToString().Substring(0, ln.ToString().LastIndexOf(PH1)));
line = tr.ReadLine();
}
tw.Close();
}
tr.Close();
//Optional: replace input file with output file
File.Delete(CSVFile);
File.Move(CSVFile + ".tmp", CSVFile);
}
}
else
{
throw new ArgumentException(string.Format("Source file {0} not found", CSVFile));
}
}
//The output file no longer needs quotes as a text qualifier, so remove them
private string RemoveCSVQuotes(string value)
{
//if is empty string, then remove double quotes
if (value == #"""""") value = "";
//remove any double quotes, then any quotes on ends
value = value.Replace(#"""""", #"""");
if (value.Length >= 2)
if (value.Substring(0, 1) == #"""")
value = value.Substring(1, value.Length - 2);
return value;
}

More often than not, this issue is caused by users exporting an Excel file to CSV.
There are two ways around this problem:
Export from Excel using a macro, as per Microsoft's suggestion
Or the really easy way:
Open the CSV in Excel.
Save as Excel file. (.xls or .xlsx).
Import that file into SQL Server as an Excel file.
Chuckle to yourself because you didn't have to code anything like the solutions above.... muhahahaha
Here's some SQL if you really want to script it (after saving the CSV as Excel):
select *
into SQLServerTable FROM OPENROWSET('Microsoft.Jet.OLEDB.4.0',
'Excel 8.0;Database=D:\testing.xls;HDR=YES',
'SELECT * FROM [Sheet1$]')

This might be more complicated or involved than what your willing to use, but ...
If you can implement the logic for parsing the lines into fields in VB or C#, you can do this using a CLR table valued function (TVF).
A CLR TVF can be a good performing way to read data in from external source when you want to have some C# or VB code separate the data into columns and/or adjust the values.
You have to be willing to add a CLR assembly to your database (and one that allows external or unsafe operations so it can open files). This can get a bit complicated or involved, but might be worth it for the flexibility you get.
I had some large files that needed to be regularly loaded to tables as fast as possible, but certain code translations needed to be performed on some columns and special handling was needed to load values that would have otherwise caused datatype errors with a plain bulk insert.
In short, a CLR TVF lets you run C# or VB code against each line of the file with bulk insert like performance (although you may need to worry about logging). The example in the SQL Server documentation lets you create a TVF to read from the event log that you could use as a starting point.
Note that the code in the CLR TVF can only access the database in an init stage before the first row is processed (eg. no lookups for each row - you use a normal TVF on top of this to do such things). You don't appear to need this based on your question.
Also note, each CLR TVF must have its output columns explicitly specified, so you can't write a generic one that is reusable for each different csv file you might have.
You could write one CLR TVF to read whole lines from the file, returning a one column result set, then use normal TVFs to read from that for each type of file. This requires the code to parse each line to be written in T-SQL, but avoids having to write many CLR TVFs.

An alternate method--assuming you don't have a load of fields or expect a quote to appear in the data itself would be to use the REPLACE function.
UPDATE dbo.tablename
SET dbo.tablename.target_field = REPLACE(t.importedValue, '"', '')
FROM #tempTable t
WHERE dbo.tablename.target_id = t.importedID;
I have used it. I can't make any claims regarding performance. It is just a quick and dirty way to get around the problem.

Preprocessing is needed.
The PowerShell function Import-CSV supports this type of file. Export-CSV will then encapsulate each value in quotes.
Single file:
Import-Csv import.csv | Export-Csv -NoTypeInformation export.csv
To merge many files with paths C:\year\input_date.csv:
$inputPath = 'C:\????\input_????????.csv'
$outputPath = 'C:\merged.csv'
Get-ChildItem $inputPath |
Select -ExpandProperty FullName |
Import-CSV |
Export-CSV -NoTypeInformation -Path $outputPath
PowerShell can typically be run with SQL Server Agent using a PowerShell proxy account.
In case delimiters are not handled properly, explicitly specify another delimiter.
Export-CSV -NoTypeInformation -Delimiter ';' -Path $outputPath

You should be able to specifiy not only the field separator, which should be [,] but also the text qualifier, which in this case would be ["]. Using [] to enclose that so there's no confusion with ".

I found few issues while having ',' inside our fields like Mike,”456 2nd St, Apt 5".
Solution to this issue is # http://crazzycoding.blogspot.com/2010/11/import-csv-file-into-sql-server-using.html
Thanks,
- Ashish

Chris,
Thanks a bunch for this!! You saved my biscuits!! I could not believe that bulk loader wouldn't handle this case when XL does such a nice job..don't these guys see eachother in the halls???
Anyway...I needed a ConsoleApplication version so here is what I hacked together. It's down and dirty but it works like a champ! I hardcoded the delimiter and commented out the header as they were not needed for my app.
I wish I could also paste a nice big beer in here for ya too.
Geeze, I have no idea why the End Module and Public Class are outside the code block...srry!
Module Module1
Sub Main()
Dim arrArgs() As String = Command.Split(",")
Dim i As Integer
Dim obj As New ReDelimIt()
Console.Write(vbNewLine & vbNewLine)
If arrArgs(0) <> Nothing Then
For i = LBound(arrArgs) To UBound(arrArgs)
Console.Write("Parameter " & i & " is " & arrArgs(i) & vbNewLine)
Next
obj.ProcessFile(arrArgs(0), arrArgs(1))
Else
Console.Write("Usage Test1 <inputfile>,<outputfile>")
End If
Console.Write(vbNewLine & vbNewLine)
End Sub
End Module
Public Class ReDelimIt
Public Function ProcessFile(ByVal InputFile As String, ByVal OutputFile As String) As Integer
Dim ph1 As String = "|"
Dim objReader As System.IO.StreamReader = Nothing
Dim count As Integer = 0 'This will also serve as a primary key
Dim sb As New System.Text.StringBuilder
Try
objReader = New System.IO.StreamReader(System.IO.File.OpenRead(InputFile), System.Text.Encoding.Default)
Catch ex As Exception
MsgBox(ex.Message)
End Try
If objReader Is Nothing Then
MsgBox("Invalid file: " & InputFile)
count = -1
Exit Function
End If
'grab the first line
Dim line = objReader.ReadLine()
'and advance to the next line b/c the first line is column headings
'Removed Check Headers can put in if needed.
'If chkHeaders.Checked Then
'line = objReader.ReadLine
'End If
While Not String.IsNullOrEmpty(line) 'loop through each line
count += 1
'Replace commas with our custom-made delimiter
line = line.Replace(",", ph1)
'Find a quoted part of the line, which could legitimately contain commas.
'In that case we will need to identify the quoted section and swap commas back in for our custom placeholder.
Dim starti = line.IndexOf(ph1 & """", 0)
While starti > -1 'loop through quoted fields
'Find end quote token (originally a ",)
Dim endi = line.IndexOf("""" & ph1, starti)
'The end quote token could be a false positive because there could occur a ", sequence.
'It would be double-quoted ("",) so check for that here
Dim check1 = line.IndexOf("""""" & ph1, starti)
'A """, sequence can occur if a quoted field ends in a quote.
'In this case, the above check matches, but we actually SHOULD process this as an end quote token
Dim check2 = line.IndexOf("""""""" & ph1, starti)
'If we are in the check1 ("",) situation, keep searching for an end quote token
'The +1 and +2 accounts for the extra length of the checked sequences
While (endi = check1 + 1 AndAlso endi <> check2 + 2) 'loop through "false" tokens in the quoted fields
endi = line.IndexOf("""" & ph1, endi + 1)
check1 = line.IndexOf("""""" & ph1, check1 + 1)
check2 = line.IndexOf("""""""" & ph1, check2 + 1)
End While
'We have searched for an end token (",) but can't find one, so that means the line ends in a "
If endi < 0 Then endi = line.Length - 1
'Grab the quoted field from the line, now that we have the start and ending indices
Dim source = line.Substring(starti + ph1.Length, endi - starti - ph1.Length + 1)
'And swap the commas back in
line = line.Replace(source, source.Replace(ph1, ","))
'Find the next quoted field
If endi >= line.Length - 1 Then endi = line.Length 'During the swap, the length of line shrinks so an endi value at the end of the line will fail
starti = line.IndexOf(ph1 & """", starti + ph1.Length)
End While
'Add our primary key to the line
' Removed for now
'If chkAddKey.Checked Then
'line = String.Concat(count.ToString, ph1, line)
' End If
sb.AppendLine(line)
line = objReader.ReadLine
End While
objReader.Close()
SaveTextToFile(sb.ToString, OutputFile)
Return count
End Function
Public Function SaveTextToFile(ByVal strData As String, ByVal FullPath As String) As Boolean
Dim bAns As Boolean = False
Dim objReader As System.IO.StreamWriter
Try
objReader = New System.IO.StreamWriter(FullPath, False, System.Text.Encoding.Default)
objReader.Write(strData)
objReader.Close()
bAns = True
Catch Ex As Exception
Throw Ex
End Try
Return bAns
End Function
End Class

This code work for me :
public bool CSVFileRead(string fullPathWithFileName, string fileNameModified, string tableName)
{
SqlConnection con = new SqlConnection(ConfigurationSettings.AppSettings["dbConnectionString"]);
string filepath = fullPathWithFileName;
StreamReader sr = new StreamReader(filepath);
string line = sr.ReadLine();
string[] value = line.Split(',');
DataTable dt = new DataTable();
DataRow row;
foreach (string dc in value)
{
dt.Columns.Add(new DataColumn(dc));
}
while (!sr.EndOfStream)
{
//string[] stud = sr.ReadLine().Split(',');
//for (int i = 0; i < stud.Length; i++)
//{
// stud[i] = stud[i].Replace("\"", "");
//}
//value = stud;
value = sr.ReadLine().Split(',');
if (value.Length == dt.Columns.Count)
{
row = dt.NewRow();
row.ItemArray = value;
dt.Rows.Add(row);
}
}
SqlBulkCopy bc = new SqlBulkCopy(con.ConnectionString, SqlBulkCopyOptions.TableLock);
bc.DestinationTableName = tableName;
bc.BatchSize = dt.Rows.Count;
con.Open();
bc.WriteToServer(dt);
bc.Close();
con.Close();
return true;
}

I put together the below to solve my case. I needed to pre-process very large files and sort out the inconsistent quoting. Just paste it in to a blank C# application, set the consts to your requirements and away you go. This worked on very large CSV's of over 10 GB.
namespace CsvFixer
{
using System.IO;
using System.Text;
public class Program
{
private const string delimiter = ",";
private const string quote = "\"";
private const string inputFile = "C:\\temp\\input.csv";
private const string fixedFile = "C:\\temp\\fixed.csv";
/// <summary>
/// This application fixes inconsistently quoted csv (or delimited) files with support for very large file sizes.
/// For example : 1223,5235234,8674,"Houston","London, UK",3425,Other text,stuff
/// Must become : "1223","5235234","8674","Houston","London, UK","3425","Other text","stuff"
/// </summary>
/// <param name="args"></param>
static void Main(string[] args)
{
// Use streaming to allow for large files.
using (StreamWriter outfile = new StreamWriter(fixedFile))
{
using (FileStream fs = File.Open(inputFile, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
using (BufferedStream bs = new BufferedStream(fs))
using (StreamReader sr = new StreamReader(bs))
{
string currentLine;
// Read each input line in and write each fixed line out
while ((currentLine = sr.ReadLine()) != null)
{
outfile.WriteLine(FixLine(currentLine, delimiter, quote));
}
}
}
}
/// <summary>
/// Fully quote a partially quoted line
/// </summary>
/// <param name="line">The partially quoted line</param>
/// <returns>The fully quoted line</returns>
private static string FixLine(string line, string delimiter, string quote)
{
StringBuilder fixedLine = new StringBuilder();
// Split all on the delimiter, acceptinmg that some quoted fields
// that contain the delimiter wwill be split in to many pieces.
string[] fieldParts = line.Split(delimiter.ToCharArray());
// Loop through the fields (or parts of fields)
for (int i = 0; i < fieldParts.Length; i++)
{
string currentFieldPart = fieldParts[i];
// If the current field part starts and ends with a quote it is a field, so write it to the result
if (currentFieldPart.StartsWith(quote) && currentFieldPart.EndsWith(quote))
{
fixedLine.Append(string.Format("{0}{1}", currentFieldPart, delimiter));
}
// else if it starts with a quote but doesnt end with one, it is part of a lionger field.
else if (currentFieldPart.StartsWith(quote))
{
// Add the start of the field
fixedLine.Append(string.Format("{0}{1}", currentFieldPart, delimiter));
// Append any additional field parts (we will only hit the end of the field when
// the last field part finishes with a quote.
while (!fieldParts[++i].EndsWith(quote))
{
fixedLine.Append(string.Format("{0}{1}", fieldParts[i], delimiter));
}
// Append the last field part - i.e. the part containing the closing quote
fixedLine.Append(string.Format("{0}{1}", fieldParts[i], delimiter));
}
else
{
// The field has no quotes, add the feildpart with quote as bookmarks
fixedLine.Append(string.Format("{0}{1}{0}{2}", quote, currentFieldPart, delimiter));
}
}
// Return the fixed string
return fixedLine.ToString();
}
}
}

Speaking from practice...In SQL Server 2017 you can provide a 'Text qualifier' of double quote, and it doesn't "supersede" your delimiter. I bulk insert several files that look just like the example by the OP. My files are ".csv" and they have inconsistent text qualifiers that are only found when the value contains a comma. I have no idea what Version of SQL Server this feature/functionality started working, but I know it works in SQL Server 2017 Standard. Pretty easy.

You do not need to preprocess the file outside of SQL.
What worked for me was changing
ROWTERMINATOR = '\n'
to
ROWTERMINATOR = '0x0a'.

A new option was added in SQL 2017 to specify WITH ( FORMAT='CSV') for BULK INSERT commands.
An example from a Microsoft GitHub page:
BULK INSERT Product
FROM 'product.csv'
WITH ( DATA_SOURCE = 'MyAzureBlobStorage',
FORMAT='CSV', CODEPAGE = 65001, --UTF-8 encoding
FIRSTROW=2,
ROWTERMINATOR = '0x0a',
TABLOCK);
Detailed documentation for that option is available here:
https://learn.microsoft.com/en-us/sql/t-sql/statements/bulk-insert-transact-sql?view=sql-server-2017#input-file-format-options
I have successfully used this option with CSV data containing optional quotes just as the OP gave an example of.

Create a VB.NET Program to convert to new Delimiter using 4.5 Framework TextFieldParser
This will automatically handle Text qualified fields
Modified above code to use built in TextFieldParser
Module Module1
Sub Main()
Dim arrArgs() As String = Command.Split(",")
Dim i As Integer
Dim obj As New ReDelimIt()
Dim InputFile As String = ""
Dim OutPutFile As String = ""
Dim NewDelimiter As String = ""
Console.Write(vbNewLine & vbNewLine)
If Not IsNothing(arrArgs(0)) Then
For i = LBound(arrArgs) To UBound(arrArgs)
Console.Write("Parameter " & i & " is " & arrArgs(i) & vbNewLine)
Next
InputFile = arrArgs(0)
If Not IsNothing(arrArgs(1)) Then
If Not String.IsNullOrEmpty(arrArgs(1)) Then
OutPutFile = arrArgs(1)
Else
OutPutFile = InputFile.Replace("csv", "pipe")
End If
Else
OutPutFile = InputFile.Replace("csv", "pipe")
End If
If Not IsNothing(arrArgs(2)) Then
If Not String.IsNullOrEmpty(arrArgs(2)) Then
NewDelimiter = arrArgs(2)
Else
NewDelimiter = "|"
End If
Else
NewDelimiter = "|"
End If
obj.ConvertCSVFile(InputFile,OutPutFile,NewDelimiter)
Else
Console.Write("Usage ChangeFileDelimiter <inputfile>,<outputfile>,<NewDelimiter>")
End If
obj = Nothing
Console.Write(vbNewLine & vbNewLine)
'Console.ReadLine()
End Sub
End Module
Public Class ReDelimIt
Public Function ConvertCSVFile(ByVal InputFile As String, ByVal OutputFile As String, Optional ByVal NewDelimiter As String = "|") As Integer
Using MyReader As New Microsoft.VisualBasic.FileIO.TextFieldParser(InputFile)
MyReader.TextFieldType = FileIO.FieldType.Delimited
MyReader.SetDelimiters(",")
Dim sb As New System.Text.StringBuilder
Dim strLine As String = ""
Dim currentRow As String()
While Not MyReader.EndOfData
Try
currentRow = MyReader.ReadFields()
Dim currentField As String
strLine = ""
For Each currentField In currentRow
'MsgBox(currentField)
If strLine = "" Then
strLine = strLine & currentField
Else
strLine = strLine & NewDelimiter & currentField
End If
Next
sb.AppendLine(strLine)
Catch ex As Microsoft.VisualBasic.FileIO.MalformedLineException
'MsgBox("Line " & ex.Message & "is not valid and will be skipped.")
Console.WriteLine("Line " & ex.Message & "is not valid and will be skipped.")
End Try
End While
SaveTextToFile(sb.ToString, OutputFile)
End Using
Return Err.Number
End Function
Public Function SaveTextToFile(ByVal strData As String, ByVal FullPath As String) As Boolean
Dim bAns As Boolean = False
Dim objReader As System.IO.StreamWriter
Try
If FileIO.FileSystem.FileExists(FullPath) Then
Kill(FullPath)
End If
objReader = New System.IO.StreamWriter(FullPath, False, System.Text.Encoding.Default)
objReader.Write(strData)
objReader.Close()
bAns = True
Catch Ex As Exception
Throw Ex
End Try
Return bAns
End Function
End Class

Related

Visual Basic when inserting data in SQL Server

I put a SQL statement into a button in visual to make it insert data in the DB and when I touch it, this error happens:
Conversion from string "Insert into TBL_Usuario_102 valu" to type 'Double' is not valid.
This is the code that's in the button:
Private Sub Guardar_Click(sender As Object, e As EventArgs) Handles Guardar.Click
If NombreDePersona.Text <> "" And Cedula.Text <> "" And RepetirContraseña.Text <> "" And Contraseña.Text <> "" Then
If (RepetirContraseña.Text = Contraseña.Text) Then
instruccionSQL = New SqlClient.SqlCommand("Insert into TBL_Usuario_102 values" +
"(" + Cedula.Text + "," +
NombreDePersona.Text + "," + 3 +
"," + Contraseña.Text + "," +
FechaInclusion.Text + "," + 0 +
"," + FechaInclusion.Text + "," + 3 + ")")
MsgBox("Datos Guardados Correctamente")
Cedula.Clear()
NombreDePersona.Clear()
Contraseña.Clear()
RepetirContraseña.Clear()
Else
MsgBox("Las contraseñas no coinciden")
End If
Else
MsgBox("Escriba en Cada Campo")
End If
End Sub
The SQL connection is in a module and it working good because when I insert the data manually in SQL Server the login works fine.
The type of data in the table of the database is in this order
varchar(15)
varchar(20)
int
varchar(50)
datetime
bit
datetime
int
Creating a SQL string like this is dangerous, as it can lead to SQL injection attacks. Usually it is recommended to use command parameters; however, you can also escape single quotes in strings by doubling them. This should make such an attack impossible. Command parameters also have the advantage that you don't have to care about the formatting of strings (and escaping them), numbers, Booleans and dates. E.g. see: How to pass a parameter from vb.net.
As it is now, there is another problem with your SQL statement. Strings must be enclosed in single quotes. Also use & for string concatenation. Not + (it's this + which let's VB think that you want to add Doubles).
The type of your texts and numbers inputs does not seem to match the one in the table (is NombreDePersona a varchar(20)?) and you are inserting FechaInclusion twice.
I would also specify the column names explicitly
INSERT INTO TBL_Usuario_102 (column_name1, column_name2, ...) values ('a text', 3, ...)
Finally, you don't execute your command. After having opened a connection:
instruccionSQL.ExecuteNonQuery()

Matlab using editbox to match database

I'm facing problem when changing the user input using editbox when retrieve value from database.
The following is the working code.
conn = database('SQL', '', '');
name = input('what is your name: ', 's');
sqlquery = ['select Staff.staffPW from imageProcessing.dbo.Staff '...
'where Staff.staffID = ' '''' name ''''];
curs = exec(conn,sqlquery);
curs = fetch(curs);
curs.Data
close(curs)
close(conn)
But now when I changed the input using editbox, problem occured
function pushbutton1_Callback(hObject, eventdata, handles)
conn = database('SQL', '', '');
name = get(handles.edit1,'String');
sqlquery = ['select Staff.staffPW from imageProcessing.dbo.Staff '...
'where Staff.staffID = ' '''' name ''''];
curs = exec(conn,sqlquery);
curs = fetch(curs);
curs.Data
close(curs)
close(conn)
I can get the correct pw from the working code, but the input from editbox I'm getting nothing. Anyone can teach me how to make it work? Thanks a lot!
The immediate issue is that the String property in your case is a cell array containing a string rather than just a plain string.
MATLAB edit-style uicontrols are capable of displaying multiple lines of text. To fill in multiple lines of text, in these cases, the string can be passed in as a cell array of strings where each element is on a different line of the edit box.
control = uicontrol('style', 'edit', ...
'Max', 2, ... % A two-line edit box
'String', {'Row1', 'Row2'});
Because of this, even for a single-line edit box, the String property could be a cell array of strings or just a string. Therefore, when retrieving values from an edit box, be sure to check if it is a cell array (iscell) or a string (ischar) before using it.
So adapting this to your code, we could do something like the following
name = get(handles.edit1, 'String');
% Check to ensure it is a cell array of one string
if iscell(name) && numel(name) == 1
name = name{1};
end
% Disallow non-strings, cell arrays of multiple strings, or empty strings
if ~ischar(name) || isempty(name)
error('A valid string must be supplied!');
end
sqlquery = ['select Staff.staffPW from imageProcessing.dbo.Staff '...
'where Staff.staffID = ' '''' name ''''];

test ssis transformation expressions in management studio and creating an expression

I am trying to remove part of a string from another string, for example:
declare #url varchar (20)
set #url = 'www.test.com~~34235645463544563554'
select #url, substring(#url,1,CHARINDEX('~~',#url)-1)
I am trying to remove '~~34235645463544563554'
I am use to using the built in tsql functions (as shown above) to do this but trying to do the same thing in a step in ssis as a "derived column transformation". Can someone suggest how I can do this and whether there is a easy way to quickly test this out in management studio? ie using the expression written in ssis to test for the expected result. I would prefer not to run the whole thing as it is a big package.
In the end I used a script component:
Dim debugOn As Boolean
debugOn = False
If debugOn Then MsgBox(Row.trackingCode)
If Row.trackingCode <> "" Then
' find the delimiter location
Dim endLocation As Integer
If debugOn Then MsgBox("index of ~~ is " & Row.trackingCode.ToString().IndexOf("~~", 0))
' chk if we have ~~ in field, if not in field then -1 is returned
If Row.trackingCode.ToString().IndexOf("~~", 0) > -1 Then
' if ~~ at the beginning ie no tracking code
If Row.trackingCode.ToString().IndexOf("~~", 0) = 1 Then
endLocation = Row.trackingCode.ToString().IndexOf("~~", 0)
ElseIf Row.trackingCode.ToString().IndexOf("~~", 0) > 1 Then
endLocation = Row.trackingCode.ToString().IndexOf("~~", 0) - 1
End If
If debugOn Then MsgBox("end of track code is " & endLocation)
Row.trackingCode = Row.trackingCode.Substring(1, endLocation)
End If
End If

Natural (human alpha-numeric) sort in Microsoft SQL 2005

We have a large database on which we have DB side pagination. This is quick, returning a page of 50 rows from millions of records in a small fraction of a second.
Users can define their own sort, basically choosing what column to sort by. Columns are dynamic - some have numeric values, some dates and some text.
While most sort as expected text sorts in a dumb way. Well, I say dumb, it makes sense to computers, but frustrates users.
For instance, sorting by a string record id gives something like:
rec1
rec10
rec14
rec2
rec20
rec3
rec4
...and so on.
I want this to take account of the number, so:
rec1
rec2
rec3
rec4
rec10
rec14
rec20
I can't control the input (otherwise I'd just format in leading 000s) and I can't rely on a single format - some are things like "{alpha code}-{dept code}-{rec id}".
I know a few ways to do this in C#, but can't pull down all the records to sort them, as that would be to slow.
Does anyone know a way to quickly apply a natural sort in Sql server?
We're using:
ROW_NUMBER() over (order by {field name} asc)
And then we're paging by that.
We can add triggers, although we wouldn't. All their input is parametrised and the like, but I can't change the format - if they put in "rec2" and "rec10" they expect them to be returned just like that, and in natural order.
We have valid user input that follows different formats for different clients.
One might go rec1, rec2, rec3, ... rec100, rec101
While another might go: grp1rec1, grp1rec2, ... grp20rec300, grp20rec301
When I say we can't control the input I mean that we can't force users to change these standards - they have a value like grp1rec1 and I can't reformat it as grp01rec001, as that would be changing something used for lookups and linking to external systems.
These formats vary a lot, but are often mixtures of letters and numbers.
Sorting these in C# is easy - just break it up into { "grp", 20, "rec", 301 } and then compare sequence values in turn.
However there may be millions of records and the data is paged, I need the sort to be done on the SQL server.
SQL server sorts by value, not comparison - in C# I can split the values out to compare, but in SQL I need some logic that (very quickly) gets a single value that consistently sorts.
#moebius - your answer might work, but it does feel like an ugly compromise to add a sort-key for all these text values.
order by LEN(value), value
Not perfect, but works well in a lot of cases.
Most of the SQL-based solutions I have seen break when the data gets complex enough (e.g. more than one or two numbers in it). Initially I tried implementing a NaturalSort function in T-SQL that met my requirements (among other things, handles an arbitrary number of numbers within the string), but the performance was way too slow.
Ultimately, I wrote a scalar CLR function in C# to allow for a natural sort, and even with unoptimized code the performance calling it from SQL Server is blindingly fast. It has the following characteristics:
will sort the first 1,000 characters or so correctly (easily modified in code or made into a parameter)
properly sorts decimals, so 123.333 comes before 123.45
because of above, will likely NOT sort things like IP addresses correctly; if you wish different behaviour, modify the code
supports sorting a string with an arbitrary number of numbers within it
will correctly sort numbers up to 25 digits long (easily modified in code or made into a parameter)
The code is here:
using System;
using System.Data.SqlTypes;
using System.Text;
using Microsoft.SqlServer.Server;
public class UDF
{
[SqlFunction(DataAccess = DataAccessKind.None, IsDeterministic=true)]
public static SqlString Naturalize(string val)
{
if (String.IsNullOrEmpty(val))
return val;
while(val.Contains(" "))
val = val.Replace(" ", " ");
const int maxLength = 1000;
const int padLength = 25;
bool inNumber = false;
bool isDecimal = false;
int numStart = 0;
int numLength = 0;
int length = val.Length < maxLength ? val.Length : maxLength;
//TODO: optimize this so that we exit for loop once sb.ToString() >= maxLength
var sb = new StringBuilder();
for (var i = 0; i < length; i++)
{
int charCode = (int)val[i];
if (charCode >= 48 && charCode <= 57)
{
if (!inNumber)
{
numStart = i;
numLength = 1;
inNumber = true;
continue;
}
numLength++;
continue;
}
if (inNumber)
{
sb.Append(PadNumber(val.Substring(numStart, numLength), isDecimal, padLength));
inNumber = false;
}
isDecimal = (charCode == 46);
sb.Append(val[i]);
}
if (inNumber)
sb.Append(PadNumber(val.Substring(numStart, numLength), isDecimal, padLength));
var ret = sb.ToString();
if (ret.Length > maxLength)
return ret.Substring(0, maxLength);
return ret;
}
static string PadNumber(string num, bool isDecimal, int padLength)
{
return isDecimal ? num.PadRight(padLength, '0') : num.PadLeft(padLength, '0');
}
}
To register this so that you can call it from SQL Server, run the following commands in Query Analyzer:
CREATE ASSEMBLY SqlServerClr FROM 'SqlServerClr.dll' --put the full path to DLL here
go
CREATE FUNCTION Naturalize(#val as nvarchar(max)) RETURNS nvarchar(1000)
EXTERNAL NAME SqlServerClr.UDF.Naturalize
go
Then, you can use it like so:
select *
from MyTable
order by dbo.Naturalize(MyTextField)
Note: If you get an error in SQL Server along the lines of Execution of user code in the .NET Framework is disabled. Enable "clr enabled" configuration option., follow the instructions here to enable it. Make sure you consider the security implications before doing so. If you are not the db admin, make sure you discuss this with your admin before making any changes to the server configuration.
Note2: This code does not properly support internationalization (e.g., assumes the decimal marker is ".", is not optimized for speed, etc. Suggestions on improving it are welcome!
Edit: Renamed the function to Naturalize instead of NaturalSort, since it does not do any actual sorting.
I know this is an old question but I just came across it and since it's not got an accepted answer.
I have always used ways similar to this:
SELECT [Column] FROM [Table]
ORDER BY RIGHT(REPLICATE('0', 1000) + LTRIM(RTRIM(CAST([Column] AS VARCHAR(MAX)))), 1000)
The only common times that this has issues is if your column won't cast to a VARCHAR(MAX), or if LEN([Column]) > 1000 (but you can change that 1000 to something else if you want), but you can use this rough idea for what you need.
Also this is much worse performance than normal ORDER BY [Column], but it does give you the result asked for in the OP.
Edit: Just to further clarify, this the above will not work if you have decimal values such as having 1, 1.15 and 1.5, (they will sort as {1, 1.5, 1.15}) as that is not what is asked for in the OP, but that can easily be done by:
SELECT [Column] FROM [Table]
ORDER BY REPLACE(RIGHT(REPLICATE('0', 1000) + LTRIM(RTRIM(CAST([Column] AS VARCHAR(MAX)))) + REPLICATE('0', 100 - CHARINDEX('.', REVERSE(LTRIM(RTRIM(CAST([Column] AS VARCHAR(MAX))))), 1)), 1000), '.', '0')
Result: {1, 1.15, 1.5}
And still all entirely within SQL. This will not sort IP addresses because you're now getting into very specific number combinations as opposed to simple text + number.
RedFilter's answer is great for reasonably sized datasets where indexing is not critical, however if you want an index, several tweaks are required.
First, mark the function as not doing any data access and being deterministic and precise:
[SqlFunction(DataAccess = DataAccessKind.None,
SystemDataAccess = SystemDataAccessKind.None,
IsDeterministic = true, IsPrecise = true)]
Next, MSSQL has a 900 byte limit on the index key size, so if the naturalized value is the only value in the index, it must be at most 450 characters long. If the index includes multiple columns, the return value must be even smaller. Two changes:
CREATE FUNCTION Naturalize(#str AS nvarchar(max)) RETURNS nvarchar(450)
EXTERNAL NAME ClrExtensions.Util.Naturalize
and in the C# code:
const int maxLength = 450;
Finally, you will need to add a computed column to your table, and it must be persisted (because MSSQL cannot prove that Naturalize is deterministic and precise), which means the naturalized value is actually stored in the table but is still maintained automatically:
ALTER TABLE YourTable ADD nameNaturalized AS dbo.Naturalize(name) PERSISTED
You can now create the index!
CREATE INDEX idx_YourTable_n ON YourTable (nameNaturalized)
I've also made a couple of changes to RedFilter's code: using chars for clarity, incorporating duplicate space removal into the main loop, exiting once the result is longer than the limit, setting maximum length without substring etc. Here's the result:
using System.Data.SqlTypes;
using System.Text;
using Microsoft.SqlServer.Server;
public static class Util
{
[SqlFunction(DataAccess = DataAccessKind.None, SystemDataAccess = SystemDataAccessKind.None, IsDeterministic = true, IsPrecise = true)]
public static SqlString Naturalize(string str)
{
if (string.IsNullOrEmpty(str))
return str;
const int maxLength = 450;
const int padLength = 15;
bool isDecimal = false;
bool wasSpace = false;
int numStart = 0;
int numLength = 0;
var sb = new StringBuilder();
for (var i = 0; i < str.Length; i++)
{
char c = str[i];
if (c >= '0' && c <= '9')
{
if (numLength == 0)
numStart = i;
numLength++;
}
else
{
if (numLength > 0)
{
sb.Append(pad(str.Substring(numStart, numLength), isDecimal, padLength));
numLength = 0;
}
if (c != ' ' || !wasSpace)
sb.Append(c);
isDecimal = c == '.';
if (sb.Length > maxLength)
break;
}
wasSpace = c == ' ';
}
if (numLength > 0)
sb.Append(pad(str.Substring(numStart, numLength), isDecimal, padLength));
if (sb.Length > maxLength)
sb.Length = maxLength;
return sb.ToString();
}
private static string pad(string num, bool isDecimal, int padLength)
{
return isDecimal ? num.PadRight(padLength, '0') : num.PadLeft(padLength, '0');
}
}
Here's a solution written for SQL 2000. It can probably be improved for newer SQL versions.
/**
* Returns a string formatted for natural sorting. This function is very useful when having to sort alpha-numeric strings.
*
* #author Alexandre Potvin Latreille (plalx)
* #param {nvarchar(4000)} string The formatted string.
* #param {int} numberLength The length each number should have (including padding). This should be the length of the longest number. Defaults to 10.
* #param {char(50)} sameOrderChars A list of characters that should have the same order. Ex: '.-/'. Defaults to empty string.
*
* #return {nvarchar(4000)} A string for natural sorting.
* Example of use:
*
* SELECT Name FROM TableA ORDER BY Name
* TableA (unordered) TableA (ordered)
* ------------ ------------
* ID Name ID Name
* 1. A1. 1. A1-1.
* 2. A1-1. 2. A1.
* 3. R1 --> 3. R1
* 4. R11 4. R11
* 5. R2 5. R2
*
*
* As we can see, humans would expect A1., A1-1., R1, R2, R11 but that's not how SQL is sorting it.
* We can use this function to fix this.
*
* SELECT Name FROM TableA ORDER BY dbo.udf_NaturalSortFormat(Name, default, '.-')
* TableA (unordered) TableA (ordered)
* ------------ ------------
* ID Name ID Name
* 1. A1. 1. A1.
* 2. A1-1. 2. A1-1.
* 3. R1 --> 3. R1
* 4. R11 4. R2
* 5. R2 5. R11
*/
ALTER FUNCTION [dbo].[udf_NaturalSortFormat](
#string nvarchar(4000),
#numberLength int = 10,
#sameOrderChars char(50) = ''
)
RETURNS varchar(4000)
AS
BEGIN
DECLARE #sortString varchar(4000),
#numStartIndex int,
#numEndIndex int,
#padLength int,
#totalPadLength int,
#i int,
#sameOrderCharsLen int;
SELECT
#totalPadLength = 0,
#string = RTRIM(LTRIM(#string)),
#sortString = #string,
#numStartIndex = PATINDEX('%[0-9]%', #string),
#numEndIndex = 0,
#i = 1,
#sameOrderCharsLen = LEN(#sameOrderChars);
-- Replace all char that have the same order by a space.
WHILE (#i <= #sameOrderCharsLen)
BEGIN
SET #sortString = REPLACE(#sortString, SUBSTRING(#sameOrderChars, #i, 1), ' ');
SET #i = #i + 1;
END
-- Pad numbers with zeros.
WHILE (#numStartIndex <> 0)
BEGIN
SET #numStartIndex = #numStartIndex + #numEndIndex;
SET #numEndIndex = #numStartIndex;
WHILE(PATINDEX('[0-9]', SUBSTRING(#string, #numEndIndex, 1)) = 1)
BEGIN
SET #numEndIndex = #numEndIndex + 1;
END
SET #numEndIndex = #numEndIndex - 1;
SET #padLength = #numberLength - (#numEndIndex + 1 - #numStartIndex);
IF #padLength < 0
BEGIN
SET #padLength = 0;
END
SET #sortString = STUFF(
#sortString,
#numStartIndex + #totalPadLength,
0,
REPLICATE('0', #padLength)
);
SET #totalPadLength = #totalPadLength + #padLength;
SET #numStartIndex = PATINDEX('%[0-9]%', RIGHT(#string, LEN(#string) - #numEndIndex));
END
RETURN #sortString;
END
I know this is a bit old at this point, but in my search for a better solution, I came across this question. I'm currently using a function to order by. It works fine for my purpose of sorting records which are named with mixed alpha numeric ('item 1', 'item 10', 'item 2', etc)
CREATE FUNCTION [dbo].[fnMixSort]
(
#ColValue NVARCHAR(255)
)
RETURNS NVARCHAR(1000)
AS
BEGIN
DECLARE #p1 NVARCHAR(255),
#p2 NVARCHAR(255),
#p3 NVARCHAR(255),
#p4 NVARCHAR(255),
#Index TINYINT
IF #ColValue LIKE '[a-z]%'
SELECT #Index = PATINDEX('%[0-9]%', #ColValue),
#p1 = LEFT(CASE WHEN #Index = 0 THEN #ColValue ELSE LEFT(#ColValue, #Index - 1) END + REPLICATE(' ', 255), 255),
#ColValue = CASE WHEN #Index = 0 THEN '' ELSE SUBSTRING(#ColValue, #Index, 255) END
ELSE
SELECT #p1 = REPLICATE(' ', 255)
SELECT #Index = PATINDEX('%[^0-9]%', #ColValue)
IF #Index = 0
SELECT #p2 = RIGHT(REPLICATE(' ', 255) + #ColValue, 255),
#ColValue = ''
ELSE
SELECT #p2 = RIGHT(REPLICATE(' ', 255) + LEFT(#ColValue, #Index - 1), 255),
#ColValue = SUBSTRING(#ColValue, #Index, 255)
SELECT #Index = PATINDEX('%[0-9,a-z]%', #ColValue)
IF #Index = 0
SELECT #p3 = REPLICATE(' ', 255)
ELSE
SELECT #p3 = LEFT(REPLICATE(' ', 255) + LEFT(#ColValue, #Index - 1), 255),
#ColValue = SUBSTRING(#ColValue, #Index, 255)
IF PATINDEX('%[^0-9]%', #ColValue) = 0
SELECT #p4 = RIGHT(REPLICATE(' ', 255) + #ColValue, 255)
ELSE
SELECT #p4 = LEFT(#ColValue + REPLICATE(' ', 255), 255)
RETURN #p1 + #p2 + #p3 + #p4
END
Then call
select item_name from my_table order by fnMixSort(item_name)
It easily triples the processing time for a simple data read, so it may not be the perfect solution.
Here is an other solution that I like:
http://www.dreamchain.com/sql-and-alpha-numeric-sort-order/
It's not Microsoft SQL, but since I ended up here when I was searching for a solution for Postgres, I thought adding this here would help others.
EDIT: Here is the code, in case the link goes away.
CREATE or REPLACE FUNCTION pad_numbers(text) RETURNS text AS $$
SELECT regexp_replace(regexp_replace(regexp_replace(regexp_replace(($1 collate "C"),
E'(^|\\D)(\\d{1,3}($|\\D))', E'\\1000\\2', 'g'),
E'(^|\\D)(\\d{4,6}($|\\D))', E'\\1000\\2', 'g'),
E'(^|\\D)(\\d{7}($|\\D))', E'\\100\\2', 'g'),
E'(^|\\D)(\\d{8}($|\\D))', E'\\10\\2', 'g');
$$ LANGUAGE SQL;
"C" is the default collation in postgresql; you may specify any collation you desire, or remove the collation statement if you can be certain your table columns will never have a nondeterministic collation assigned.
usage:
SELECT * FROM wtf w
WHERE TRUE
ORDER BY pad_numbers(w.my_alphanumeric_field)
For the following varchar data:
BR1
BR2
External Location
IR1
IR2
IR3
IR4
IR5
IR6
IR7
IR8
IR9
IR10
IR11
IR12
IR13
IR14
IR16
IR17
IR15
VCR
This worked best for me:
ORDER BY substring(fieldName, 1, 1), LEN(fieldName)
If you're having trouble loading the data from the DB to sort in C#, then I'm sure you'll be disappointed with any approach at doing it programmatically in the DB. When the server is going to sort, it's got to calculate the "perceived" order just as you would have -- every time.
I'd suggest that you add an additional column to store the preprocessed sortable string, using some C# method, when the data is first inserted. You might try to convert the numerics into fixed-width ranges, for example, so "xyz1" would turn into "xyz00000001". Then you could use normal SQL Server sorting.
At the risk of tooting my own horn, I wrote a CodeProject article implementing the problem as posed in the CodingHorror article. Feel free to steal from my code.
Simply you sort by
ORDER BY
cast (substring(name,(PATINDEX('%[0-9]%',name)),len(name))as int)
##
I've just read a article somewhere about such a topic. The key point is: you only need the integer value to sort data, while the 'rec' string belongs to the UI. You could split the information in two fields, say alpha and num, sort by alpha and num (separately) and then showing a string composed by alpha + num. You could use a computed column to compose the string, or a view.
Hope it helps
You can use the following code to resolve the problem:
Select *,
substring(Cote,1,len(Cote) - Len(RIGHT(Cote, LEN(Cote) - PATINDEX('%[0-9]%', Cote)+1)))alpha,
CAST(RIGHT(Cote, LEN(Cote) - PATINDEX('%[0-9]%', Cote)+1) AS INT)intv
FROM Documents
left outer join Sites ON Sites.IDSite = Documents.IDSite
Order BY alpha, intv
regards,
rabihkahaleh#hotmail.com
I'm fashionably late to the party as usual. Nevertheless, here is my attempt at an answer that seems to work well (I would say that). It assumes text with digits at the end, like in the original example data.
First a function that won't end up winning a "pretty SQL" competition anytime soon.
CREATE FUNCTION udfAlphaNumericSortHelper (
#string varchar(max)
)
RETURNS #results TABLE (
txt varchar(max),
num float
)
AS
BEGIN
DECLARE #txt varchar(max) = #string
DECLARE #numStr varchar(max) = ''
DECLARE #num float = 0
DECLARE #lastChar varchar(1) = ''
set #lastChar = RIGHT(#txt, 1)
WHILE #lastChar <> '' and #lastChar is not null
BEGIN
IF ISNUMERIC(#lastChar) = 1
BEGIN
set #numStr = #lastChar + #numStr
set #txt = Substring(#txt, 0, len(#txt))
set #lastChar = RIGHT(#txt, 1)
END
ELSE
BEGIN
set #lastChar = null
END
END
SET #num = CAST(#numStr as float)
INSERT INTO #results select #txt, #num
RETURN;
END
Then call it like below:
declare #str nvarchar(250) = 'sox,fox,jen1,Jen0,jen15,jen02,jen0004,fox00,rec1,rec10,jen3,rec14,rec2,rec20,rec3,rec4,zip1,zip1.32,zip1.33,zip1.3,TT0001,TT01,TT002'
SELECT tbl.value --, sorter.txt, sorter.num
FROM STRING_SPLIT(#str, ',') as tbl
CROSS APPLY dbo.udfAlphaNumericSortHelper(value) as sorter
ORDER BY sorter.txt, sorter.num, len(tbl.value)
With results:
fox
fox00
Jen0
jen1
jen02
jen3
jen0004
jen15
rec1
rec2
rec3
rec4
rec10
rec14
rec20
sox
TT01
TT0001
TT002
zip1
zip1.3
zip1.32
zip1.33
I still don't understand (probably because of my poor English).
You could try:
ROW_NUMBER() OVER (ORDER BY dbo.human_sort(field_name) ASC)
But it won't work for millions of records.
That why I suggested to use trigger which fills separate column with human value.
Moreover:
built-in T-SQL functions are really
slow and Microsoft suggest to use
.NET functions instead.
human value is constant so there is no point calculating it each time
when query runs.

How to do hit-highlighting of results from a SQL Server full-text query

We have a web application that uses SQL Server 2008 as the database. Our users are able to do full-text searches on particular columns in the database. SQL Server's full-text functionality does not seem to provide support for hit highlighting. Do we need to build this ourselves or is there perhaps some library or knowledge around on how to do this?
BTW the application is written in C# so a .Net solution would be ideal but not necessary as we could translate.
Expanding on Ishmael's idea, it's not the final solution, but I think it's a good way to start.
Firstly we need to get the list of words that have been retrieved with the full-text engine:
declare #SearchPattern nvarchar(1000) = 'FORMSOF (INFLECTIONAL, " ' + #SearchString + ' ")'
declare #SearchWords table (Word varchar(100), Expansion_type int)
insert into #SearchWords
select distinct display_term, expansion_type
from sys.dm_fts_parser(#SearchPattern, 1033, 0, 0)
where special_term = 'Exact Match'
There is already quite a lot one can expand on, for example the search pattern is quite basic; also there are probably better ways to filter out the words you don't need, but it least it gives you a list of stem words etc. that would be matched by full-text search.
After you get the results you need, you can use RegEx to parse through the result set (or preferably only a subset to speed it up, although I haven't yet figured out a good way to do so). For this I simply use two while loops and a bunch of temporary table and variables:
declare #FinalResults table
while (select COUNT(*) from #PrelimResults) > 0
begin
select top 1 #CurrID = [UID], #Text = Text from #PrelimResults
declare #TextLength int = LEN(#Text )
declare #IndexOfDot int = CHARINDEX('.', REVERSE(#Text ), #TextLength - dbo.RegExIndexOf(#Text, '\b' + #FirstSearchWord + '\b') + 1)
set #Text = SUBSTRING(#Text, case #IndexOfDot when 0 then 0 else #TextLength - #IndexOfDot + 3 end, 300)
while (select COUNT(*) from #TempSearchWords) > 0
begin
select top 1 #CurrWord = Word from #TempSearchWords
set #Text = dbo.RegExReplace(#Text, '\b' + #CurrWord + '\b', '<b>' + SUBSTRING(#Text, dbo.RegExIndexOf(#Text, '\b' + #CurrWord + '\b'), LEN(#CurrWord) + 1) + '</b>')
delete from #TempSearchWords where Word = #CurrWord
end
insert into #FinalResults
select * from #PrelimResults where [UID] = #CurrID
delete from #PrelimResults where [UID] = #CurrID
end
Several notes:
1. Nested while loops probably aren't the most efficient way of doing it, however nothing else comes to mind. If I were to use cursors, it would essentially be the same thing?
2. #FirstSearchWord here to refers to the first instance in the text of one of the original search words, so essentially the text you are replacing is only going to be in the summary. Again, it's quite a basic method, some sort of text cluster finding algorithm would probably be handy.
3. To get RegEx in the first place, you need CLR user-defined functions.
It looks like you could parse the output of the new SQL Server 2008 stored procedure sys.dm_fts_parser and use regex, but I haven't looked at it too closely.
You might be missing the point of the database in this instance. Its job is to return the data to you that satisfies the conditions you gave it. I think you will want to implement the highlighting probably using regex in your web control.
Here is something a quick search would reveal.
http://www.dotnetjunkies.com/PrintContent.aspx?type=article&id=195E323C-78F3-4884-A5AA-3A1081AC3B35
Some details:
search_kiemeles=replace(lcase(search),"""","")
do while not rs.eof 'The search result loop
hirdetes=rs("hirdetes")
data=RegExpValueA("([A-Za-zöüóőúéáűíÖÜÓŐÚÉÁŰÍ0-9]+)",search_kiemeles) 'Give back all the search words in an array, I need non-english characters also
For i=0 to Ubound(data,1)
hirdetes = RegExpReplace(hirdetes,"("&NoAccentRE(data(i))&")","<em>$1</em>")
Next
response.write hirdetes
rs.movenext
Loop
...
Functions
'All Match to Array
Function RegExpValueA(patrn, strng)
Dim regEx
Set regEx = New RegExp ' Create a regular expression.
regEx.IgnoreCase = True ' Set case insensitivity.
regEx.Global = True
Dim Match, Matches, RetStr
Dim data()
Dim count
count = 0
Redim data(-1) 'VBSCript Ubound array bug workaround
if isnull(strng) or strng="" then
RegExpValueA = data
exit function
end if
regEx.Pattern = patrn ' Set pattern.
Set Matches = regEx.Execute(strng) ' Execute search.
For Each Match in Matches ' Iterate Matches collection.
count = count + 1
Redim Preserve data(count-1)
data(count-1) = Match.Value
Next
set regEx = nothing
RegExpValueA = data
End Function
'Replace non-english chars
Function NoAccentRE(accent_string)
NoAccentRE=accent_string
NoAccentRE=Replace(NoAccentRE,"a","§")
NoAccentRE=Replace(NoAccentRE,"á","§")
NoAccentRE=Replace(NoAccentRE,"§","[aá]")
NoAccentRE=Replace(NoAccentRE,"e","§")
NoAccentRE=Replace(NoAccentRE,"é","§")
NoAccentRE=Replace(NoAccentRE,"§","[eé]")
NoAccentRE=Replace(NoAccentRE,"i","§")
NoAccentRE=Replace(NoAccentRE,"í","§")
NoAccentRE=Replace(NoAccentRE,"§","[ií]")
NoAccentRE=Replace(NoAccentRE,"o","§")
NoAccentRE=Replace(NoAccentRE,"ó","§")
NoAccentRE=Replace(NoAccentRE,"ö","§")
NoAccentRE=Replace(NoAccentRE,"ő","§")
NoAccentRE=Replace(NoAccentRE,"§","[oóöő]")
NoAccentRE=Replace(NoAccentRE,"u","§")
NoAccentRE=Replace(NoAccentRE,"ú","§")
NoAccentRE=Replace(NoAccentRE,"ü","§")
NoAccentRE=Replace(NoAccentRE,"ű","§")
NoAccentRE=Replace(NoAccentRE,"§","[uúüű]")
end function

Resources