I am bulk inserting a csv file into SQL Server 2012. The data is currently | pipe delimited as one long string for each row. I'd like to separate the data into the different columns at each pipe.
Here is how the data looks as its imported:
ID|ID2|Person|Person2|City|State
"1"|"ABC"|"Joe"|"Ben"|"Boston"|"MA"
"2"|"ABD"|"Jack"|"Tim"|"Nashua"|"NH"
"3"|"ADC"|"John"|"Mark"|"Hartford"|"CT"
I'd liek to separate the data into the columns at each pipe:
ID ID2 Person Person2 City State
1 ABC Joe Ben Boston MA
2 ABD Jack Tim Nashua NH
3 AFC John Mark Hartford CT
I'm finding it difficult to use charindex and substring functions because of the number of columns of the data also I've tried to use ParseName since that is a 2012 function but thats not working either as all the columns come out as NULL with ParseName
The file contains about 300k rows and I've found a solution using xmlname but it is very slow. ie: takes a minute to separate the data.
Here's the slow xml solution:
CREATE TABLE #tbl(iddata varchar(200))
DECLARE #i int = 0
WHILE #i < 100000
BEGIN
SET #i = #i + 1
INSERT INTO #tbl(iddata)
SELECT '"1"|"ABC"|"Joe"|"Ben"|"Boston"|"MA"'
UNION ALL
SELECT '"2"|"ABD"|"Jack"|"Tim"|"Nashua"|"NH"'
UNION ALL
SELECT '"3"|"AFC"|"John"|"Mark"|"Hartford"|"CT"'
END
;WITH XMLData
AS
(
SELECT idData,
CONVERT(XML,'<IDs><id>'
+ REPLACE(iddata,'|', '</id><id>') + '</id></IDs>') AS xmlname
FROM (
SELECT REPLACE(iddata,'"','') as iddata
FROM #tbl
)x
)
SELECT xmlname.value('/IDs[1]/id[1]','varchar(100)') AS ID,
xmlname.value('/IDs[1]/id[2]','varchar(100)') AS ID2,
xmlname.value('/IDs[1]/id[3]','varchar(100)') AS Person,
xmlname.value('/IDs[1]/id[4]','varchar(100)') AS Person2,
xmlname.value('/IDs[1]/id[5]','varchar(100)') AS City,
xmlname.value('/IDs[1]/id[6]','varchar(100)') AS State
FROM XMLData
This will do it for you.
CREATE TABLE #Import (
ID NVARCHAR(MAX),
ID2 NVARCHAR(MAX),
Person NVARCHAR(MAX),
Person2 NVARCHAR(MAX),
City NVARCHAR(MAX),
State NVARCHAR(MAX))
SET QUOTED_IDENTIFIER OFF
BULK INSERT #Import
FROM 'C:\MyFile.csv'
WITH
(
FIRSTROW = 2,
FIELDTERMINATOR = '|',
ROWTERMINATOR = '\n',
ERRORFILE = 'C:\myRubbishData.log'
)
select * from #Import
DROP TABLE #Import
Unfortunately using BULK INSERT will not deal with text qualifiers, so you will end up with "ABC" rather than ABC.
Either remove the text qualifiers from the csv file, or run a replace on your table once the data has been imported.
To save you the pain and misery of having to deal with pipes, I would strongly recommend that you process your input file to convert those pipes into commas, and then use SQL Server's built-in capacity to parse CSV into a table.
If you are using Java, replacing the pipes would literally take just one line of code:
String line = "\"1\"|\"ABC\"|\"Joe\"|\"Ben\"|\"Boston\"|\"MA\"";
line = line.replaceAll("|", ",");
// then write this line back out to file
BULK INSERT YourTable
FROM 'input.csv'
WITH
(
FIRSTROW = 2,
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n',
ERRORFILE = 'C:\CSVDATA\SchoolsErrorRows.csv',
TABLOCK
)
If you cannot work with the BULK suggestions (due to rights) you might speed your query up by about 30% with this:
SELECT AsXml.value('x[1]','varchar(100)') AS ID
,AsXml.value('x[2]','varchar(100)') AS ID2
,AsXml.value('x[3]','varchar(100)') AS Person
,AsXml.value('x[4]','varchar(100)') AS Person2
,AsXml.value('x[5]','varchar(100)') AS City
,AsXml.value('x[6]','varchar(100)') AS State
FROM #tbl
CROSS APPLY(SELECT CAST('<x>' + REPLACE(SUBSTRING(iddata,2,LEN(iddata)-2),'"|"','</x><x>') + '</x>' AS XML)) AS a(AsXml)
Related
I need to read a pipe delimited file where we have an array repeating 30 times. I need to access these array of elements and change the sequence and send in the output file.
E.g.
Tanya|1|Pen|2|Book|3|Eraser
Raj|11|Eraser|22|Bottle
In the above example, first field is the Customer name. After that we have an array of items ordered - Order ID and Item name.
Could you please suggest how to read these array elements individually to process these further?
You will be using copy activity in Azure Data Factory pipeline in which the source will be DelimitedText dataset and sink will be JSON dataset. If your source and destination files are located in Azure Blob Storage, create a Linked Service to connect the files with Azure Data Factory.
The source file dataset properties will look like this. You need to select Column Delimiter as Pipe (|).
The sink file with JSON type dataset settings will look like as shown in below. Just mention the output location path where the file will be saved after conversion.
In copy data activity sink tab, select the File pattern as Array of files. Trigger the pipeline.
The sample input and output shown below.
If you can control the output file format then you're better off using two (or three) delimiters, one for the person and the other for the order items. If you can get a third then use that to split the order item from the order line.
Assuming the data format:
Person Name | 1:M [Order]
And each Order is
order line | item name
You can simply ingest the entire row into a single nvarchar(max) column and then use SQL to break out the data you need.
The following is one such example.
declare #tbl table (d nvarchar(max));
insert into #tbl values('Tanya|1|Pen|2|Book|3|Eraser'),('Raj|11|Eraser|22|Bottle');
declare #base table (id int, person varchar(100),total_orders int, raw_orders varchar(max));
declare #output table (id int, person varchar(100),item_id int, item varchar(100));
with a as
(
select
CHARINDEX('|',d) idx
,d
,ROW_NUMBER() over (order by d) as id /*Or newid()*/
from #tbl
), b as
(
select
id
,SUBSTRING(d,0,idx) person
,SUBSTRING(d,idx+1,LEN(d)-idx+1) order_array
from a
), c as
(
select id, person, order_array
,(select count(1) from string_split(order_array,'|')) /2 orders
from b
)
insert into #base (id,person,total_orders,raw_orders)
select id,person,orders,order_array from c
declare #total_persons int = (select count(1) from #base);
declare #person_enu int = 1;
while #person_enu <= #total_persons
BEGIN
declare #total_orders int = (select total_orders from #base where id = #person_enu);
declare #raw_orders nvarchar(max) = (select raw_orders from #base where id = #person_enu);
declare #order_enu int = 1;
declare #i int = 1;
print CONCAT('Person ', #person_enu, '. Total orders: ', #total_orders);
while #order_enu <= #total_orders
begin
--declare #id int = (select value from string_split(#raw_orders,'|',1) where ordinal = #i);
--declare #val varchar(100) = (select value from string_split(#raw_orders,'|',1) where ordinal = #i+1);
--print concat('Will process order ',#order_enu);
--print concat('ID:',#i, ' Value:', #i+1)
--print concat('ID:',#id, ' Value:', #val)
INSERT INTO #output (id,person,item_id,item)
select b.id,b.person,n.value [item_id], v.value [item] from #base b
cross apply string_split(b.raw_orders,'|',1) n
cross apply string_split(b.raw_orders,'|',1) v
where b.id = #person_enu and n.ordinal = #i and v.ordinal = #i+1;
set #order_enu +=1;
set #i+=2;
end
set #person_enu += 1;
END
select * from #output;
I have the following in table TABLE
id content
-------------------------------------
1 Hellö world, I äm text
2 ènd there äré many more chars
3 that are speçial in my dat£base
I now need to export these records into HTML files, using bcp:
set #command = 'bcp "select [content] from [TABLE] where [id] = ' +
#id queryout +' + #filename + '.html" -S ' + #instance +
' -c -U ' + #username + ' -P ' + #password"
exec xp_cmdshell #command, no_ouput
To make the output look correct, I need to first replace all special characters with their respective HTML entities (pseudo)
insert into [#temp_html] ..
replace(replace([content], 'ö', 'ö'), 'ä', 'ä')
But by now, I have 30 nested replaces and it's starting to look insane.
After much searching, I found this post which uses a HTML conversion table but it is too advanced for me to understand:
The table does not list the special chars itself as they are in my text (ö, à etc) but UnicodeHex. Do I need to add them to the table to make the conversions that I need?
I am having trouble understanding how to update my script to replace all special chars. Can someone please show me a snippet of (pseudo) code?
One way to do that with a translation table is using a recursive cte to do the replaces, and one more cte to get only the last row of each translated value.
First, create and populate sample table (Please save us this step in your future questions):
DECLARE #T AS TABLE
(
id int,
content nvarchar(100)
)
INSERT INTO #T (id, content) VALUES
(1, 'Hellö world, I äm text'),
(2, 'ènd there äré many more chars'),
(3, 'that are speçial in my dat£base')
Then, create and populate the translation table (I don't know the HTML entities for these chars, so I've just used numbers [plus it's easier to see in the results]). Also, please note that this can be done using yet another cte in the chain.
DECLARE #Translations AS TABLE
(
str nchar(1),
replacement nvarchar(10)
)
INSERT INTO #Translations (str, replacement) VALUES
('ö', '-1-'),
('ä', '-2-'),
('è', '-3-'),
('ä', '-4-'),
('é', '-5-'),
('ç', '-6-'),
('£', '-7-')
Now, the first cte will do the replaces, and the second cte just adds a row_number so that for each id, the last value of lvl will get 1:
;WITH CTETranslations AS
(
SELECT id, content, 1 As lvl
FROM #T
UNION ALL
SELECT id, CAST(REPLACE(content, str, replacement) as nvarchar(100)), lvl+1
FROM CTETranslations
JOIN #Translations
ON content LIKE '%' + str + '%'
), cteNumberedTranslation AS
(
SELECT id, content, ROW_NUMBER() OVER(PARTITION BY Id ORDER BY lvl DESC) rn
FROM CTETranslations
)
Select from the second cte where rn = 1, I've joined the original table to show the source and translation side by side:
SELECT r.id, s.content, r.content
FROM #T s
JOIN cteNumberedTranslation r
ON s.Id = r.Id
WHERE rn = 1
ORDER BY Id
Results:
id content content
1 Hellö world, I äm text Hell-1- world, I -4-m text
2 ènd there äré many more chars -3-nd there -4-r-5- many more chars
3 that are speçial in my dat£base that are spe-6-ial in my dat-7-base
Please note that if your content have more that 100 special chars, you will need to add the maxrecursion 0 hint to the final select:
SELECT r.id, s.content, r.content
FROM #T s
JOIN cteNumberedTranslation r
ON s.Id = r.Id
WHERE rn = 1
ORDER BY Id
OPTION ( MAXRECURSION 0 );
See a live demo on rextester.
I'm trying to merge a very wide table from a source (linked Oracle server) to a target table (SQL Server 2012) w/o listing all the columns. Both tables are identical except for the records in them.
This is what I have been using:
TRUNCATE TABLE TargetTable
INSERT INTO TargetTable
SELECT *
FROM SourceTable
When/if I get this working I would like to make it a procedure so that I can pass into it the source, target and match key(s) needed to make the update. For now I would just love to get it to work at all.
USE ThisDatabase
GO
DECLARE
#Columns VARCHAR(4000) = (
SELECT COLUMN_NAME + ','
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 'TargetTable'
FOR XML PATH('')
)
MERGE TargetTable AS T
USING (SELECT * FROM SourceTable) AS S
ON (T.ID = S.ID AND T.ROWVERSION = S.ROWVERSION)
WHEN MATCHED THEN
UPDATE SET #Columns = S.#Columns
WHEN NOT MATCHED THEN
INSERT (#Columns)
VALUES (S.#Columns)
Please excuse my noob-ness. I feel like I'm only half way there, but I don't understand some parts of SQL well enough to put it all together. Many thanks.
As previously mentioned in the answers, if you don't want to specify the columns , then you have to write a dynamic query.
Something like this in your case should help:
DECLARE
#Columns VARCHAR(4000) = (
SELECT COLUMN_NAME + ','
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 'TargetTable'
FOR XML PATH('')
)
DECLARE #MergeQuery NVARCHAR(MAX)
DECLARE #UpdateQuery VARCHAR(MAX)
DECLARE #InsertQuery VARCHAR(MAX)
DECLARE #InsertQueryValues VARCHAR(MAX)
DECLARE #Col VARCHAR(200)
SET #UpdateQuery='Update Set '
SET #InsertQuery='Insert ('
SET #InsertQueryValues=' Values('
WHILE LEN(#Columns) > 0
BEGIN
SET #Col=left(#Columns, charindex(',', #Columns+',')-1);
IF #Col<> 'ID' AND #Col <> 'ROWVERSION'
BEGIN
SET #UpdateQuery= #UpdateQuery+ 'TargetTable.'+ #Col + ' = SourceTable.'+ #Col+ ','
SET #InsertQuery= #InsertQuery+#Col + ','
SET #InsertQueryValues=#InsertQueryValues+'SourceTable.'+ #Col+ ','
END
SET #Columns = stuff(#Columns, 1, charindex(',', #Columns+','), '')
END
SET #UpdateQuery=LEFT(#UpdateQuery, LEN(#UpdateQuery) - 1)
SET #InsertQuery=LEFT(#InsertQuery, LEN(#InsertQuery) - 1)
SET #InsertQueryValues=LEFT(#InsertQueryValues, LEN(#InsertQueryValues) - 1)
SET #InsertQuery=#InsertQuery+ ')'+ #InsertQueryValues +')'
SET #MergeQuery=
N'MERGE TargetTable
USING SourceTable
ON TargetTable.ID = SourceTable.ID AND TargetTable.ROWVERSION = SourceTable.ROWVERSION ' +
'WHEN MATCHED THEN ' + #UpdateQuery +
' WHEN NOT MATCHED THEN '+#InsertQuery +';'
Execute sp_executesql #MergeQuery
If you want more information about Merge, you could read the this excellent article
Don't feel bad. It takes time. Merge has interesting syntax. I've actually never used it. I read Microsoft's documentation on it, which is very helpful and even has examples. I think I covered everything. I think there may be a slight amount of tweaking you might have to do, but I think it should work.
Here's the documentation for MERGE:
https://msdn.microsoft.com/en-us/library/bb510625.aspx
As for your code, I commented pretty much everything to explain it and show you how to do it.
This part is to help write your merge statement
USE ThisDatabase --This says what datbase context to use.
--Pretty much what database your querying.
--Like this: database.schema.objectName
GO
DECLARE
#SetColumns VARCHAR(4000) = (
SELECT CONCAT(QUOTENAME(COLUMN_NAME),' = S.',QUOTENAME(COLUMN_NAME),',',CHAR(10)) --Concat just says concatenate these values. It's adds the strings together.
--QUOTENAME adds brackets around the column names
--CHAR(10) is a line break for formatting purposes(totally optional)
FROM INFORMATION_SCHEMA.COLUMNS
--WHERE TABLE_NAME = 'TargetTable'
FOR XML PATH('')
) --This uses some fancy XML trick to get your Columns concatenated into one row.
--What really is in your table is a column of your column names in different rows.
--BTW If the columns names in both tables are identical, then this will work.
DECLARE #Columns VARCHAR(4000) = (
SELECT QUOTENAME(COLUMN_NAME) + ','
FROM INFORMATION_SCHEMA.COLUMNS
--WHERE TABLE_NAME = 'TargetTable'
FOR XML PATH('')
)
SET #Columns = SUBSTRING(#Columns,0,LEN(#Columns)) -- this gets rid off the comma at the end of your list
SET #SetColumns = SUBSTRING(#SetColumns,0,LEN(#SetColumns)) --same thing here
SELECT #SetColumns --Your going to want to copy and paste this into your WHEN MATCHED statement
SELECT #Columns --Your going to want to copy this into your WHEN NOT MATCHED statement
GO
Merge Statement
Especially look at my notes on ROWVERSION.
MERGE INTO TargetTable AS T
USING SourceTable AS S --Don't really need to write SELECT * FROM since you need the whole table anyway
ON (T.ID = S.ID AND T.[ROWVERSION] = S.[ROWVERSION]) --These are your matching parameters
--One note on this, if ROWVERSION is different versions of the same data you don't want to have RowVersion here
--Like lets say you have ID 1 ROWVERSION 2 in your source but only version 1 in your targetTable
--If you leave T.ID =S.ID AND T.ROWVERSION = S.ROWVERSION, then it will insert the new ROWVERSION
--So you'll have two versions of ID 1
WHEN MATCHED THEN --When TargetTable ID and ROWVERSION match in the matching parameters
--Update the values in the TargetTable
UPDATE SET /*Copy and Paste #SetColumnss here*/
--Should look like this(minus the "--"):
--Col1 = S.Col1,
--Col2 = S.Col2,
--Col3 = S.Col3,
--Etc...
WHEN NOT MATCHED THEN --This says okay there are no rows with the existing ID, now insert a new row
INSERT (col1,col2,col3) --Copy and paste #Columns in between the parentheses. Should look like I show it. Note: This is insert into target table so your listing the target table columns
VALUES (col1,col2,col3) --Same thing here. This is the list of source table columns
I have the following tables:
tbl_File:
FileID | Filename
-----------------
1 | test.jpg
and
tbl_Tag:
TagID | TagName
---------------
1 | Red
and
tbl_TagFile:
ID | TagID | FileID
-------------------
1 | 1 | 1
I need to pass a non-inclusive query against these tables. For example, imagine a list of checkboxes to select one or more tags, and then a search button. I need to pass the TagID's to the query as a PIPE delimited string, such as "1|2|5|"
The search results need to be non-inclusive, such as if it must meet all the criteria. If 3 tags are selected, the results are to be files that have all 3 tags associated with them.
I think I've made this too complicated, but tried iterating over the tags using charindex and stuff to work my way through the string, but it seems there must be an easier way.
I'd like to do this as a function... Such as
SELECT FileID, Filename
FROM tbl_Files
WHERE dbo.udf_FileExistswithTags(#Tags, FileID) = 1
Any efficient way to do this?
It doesn't sound from your example scenario that the actual "need" is to pass a pipe-delimited string. I would highly suggest abandoning that idea and using a Table Value Parameter in your stored procedure. This has numerous advantages in that you will not hit a datatype limit or a "number of parameters" limit that might occur with very large sets of criteria. Additionally it gets away from any need to run a (potentially very slow) UDF.
Split the string into tokens on the application side, and then insert each token as a row in the TVP. Example below:
Create the TVP type in your database:
CREATE TYPE [dbo].[FileNameType] AS TABLE
(
fileName varchar(1000)
)
On the application side, build your list of filename tokens into a recordset:
private static List<SqlDataRecord> BuildFileNameTokenRecords(IEnumerable<string> tokens)
{
var records = new List<SqlDataRecord>();
foreach (string token in tokens){
var record = new SqlDataRecord(
new SqlMetaData[]
{
new SqlMetaData("fileName", SqlDbType.Varchar),
}
);
records.Add(record);
}
return records;
}
Wherever you run your proc from (rough code here):
var records = BuildFileNameTokenRecords(listofstrings);
var sqlCmd = sqlDb.GetStoredProcCommand("FileExists");
sqlDb.AddInParameter(sqlCmd, "tvpFilenameTokens", SqlDbType.Structured, records);
ExecuteNonQuery(sqlCmd);
Filtering your select statement then simply becomes a matter of joining on the tokens in the table parameter. Something like this:
CREATE PROCEDURE dbo.FileExists
(
-- Put additional parameters here
#tvpFilenameTokens dbo.FileNameType READONLY,
)
AS
BEGIN
SELECT FileID, Filename
FROM tbl_Files INNER JOIN #tvpFilenameTokens
ON tbl_Files.FileID = #tvpFilenameTokens.fileName
END
Here is an option that should scale. All of the functionality is available back to SQL Server 2005. It uses a CTE to separate the portion of the query that finds only the FileIDs that have all of the TagIDs passed in, and then that list of FileIDs is joined to the [File] table to get the details. It also uses an INNER JOIN instead of an IN list to match the TagID's.
Please note that the example below uses a SQLCLR splitter that is freely available in the SQL# library (which I wrote, but this function is in the Free version). The specific splitter used is not the important part; it should just be one that is either SQLCLR, an inline tally-table (like the one used in #wewesthemenace's answer), or is the XML method. Just don't use a splitter based on a WHILE-loop or a recursive CTE.
---- TEST SETUP
DECLARE #File TABLE
(
FileID INT NOT NULL PRIMARY KEY,
[Filename] NVARCHAR(200) NOT NULL
);
DECLARE #TagFile TABLE
(
TagID INT NOT NULL,
FileID INT NOT NULL,
PRIMARY KEY (TagID, FileID)
);
INSERT INTO #File VALUES (1, 'File1.txt');
INSERT INTO #File VALUES (2, 'File2.txt');
INSERT INTO #File VALUES (3, 'File3.txt');
INSERT INTO #TagFile VALUES (1, 1);
INSERT INTO #TagFile VALUES (2, 1);
INSERT INTO #TagFile VALUES (5, 1);
INSERT INTO #TagFile VALUES (1, 2);
INSERT INTO #TagFile VALUES (2, 2);
INSERT INTO #TagFile VALUES (4, 2);
INSERT INTO #TagFile VALUES (1, 3);
INSERT INTO #TagFile VALUES (2, 3);
INSERT INTO #TagFile VALUES (5, 3);
INSERT INTO #TagFile VALUES (6, 3);
---- DONE WITH TEST SETUP
DECLARE #TagsToGet VARCHAR(100); -- this would be the proc input parameter
SET #TagsToGet = '1|2|5';
CREATE TABLE #Tags (TagID INT NOT NULL PRIMARY KEY);
DECLARE #NumTags INT;
INSERT INTO #Tags (TagID)
SELECT split.SplitVal
FROM SQL#.String_Split4k(#TagsToGet, '|', 1) split;
SET #NumTags = ##ROWCOUNT;
;WITH files AS
(
SELECT tf.FileID
FROM #TagFile tf
INNER JOIN #Tags tg
ON tg.TagID = tf.TagID
GROUP BY tf.FileID
HAVING COUNT(*) = #NumTags
)
SELECT fl.*
FROM #File fl
INNER JOIN files
ON files.FileID = fl.FileID
ORDER BY fl.[Filename] ASC;
DROP TABLE #Tags; -- don't need this if code above is placed in a proc
Results:
FileID Filename
1 File1.txt
3 File3.txt
Notes
As much as I love TVPs (and I do, when they are done correctly and used appropriately), I would say that they are a bit much for this type of small scale, single dimensional array scenario. There won't really be any performance gain over using a SQLCLR streaming TVF string splitter but it would require more app code and the additional User-Defined Table Type, which can't be updated without first dropping all procs that reference it. That doesn't happen all of the time, but needs to be considered in terms of long-term maintenance costs.
The JOIN between TagFile and the temporary table populated from the split operation should be much more efficient than using an IN list with a subquery for the split operation. An IN list is short-hand for all of the values in it to be their own OR conditions. Hence the JOIN is a fully set-based approach that lets the Query Optimizer do its thang.
The structure I used for the test #TagFile table only has the two relevant IDs in it: TagID and FileID. It does not have the ID field that I assume is an IDENTITY field on this table. Unless there is a very specific reason for needing that IDENTITY field, I would suggest removing it. It adds to inherent benefit as the combination of TagID and FileID is a natural key (i.e. it is both NOT NULL and Unique). And if the Clustered PK of this table were simply those two fields, the JOIN to the temp table of those split-out TagIDs would be quite fast, even with millions of rows in TagFile.
One reason that this approach works so much better than trying to handle this via a function per FileID (outside of the obvious set-based is better than cursor-based reason) is that the list of TagIDs is the same for all files to be checked. So splitting that out more than one time is a waste of effort.
By not splitting the TagID list inline in the query I am able to capture the number of elements in that list with no additional effort. Hence this saves from needing to do a secondary calculation.
Here is a function called DelimitedSplit8K by Jeff Moden. This is used to split strings of length up to 8000. For more info, read this: http://www.sqlservercentral.com/articles/Tally+Table/72993/
CREATE FUNCTION [dbo].[DelimitedSplit8K](
#pString VARCHAR(8000), --WARNING!!! DO NOT USE MAX DATA-TYPES HERE! IT WILL KILL PERFORMANCE!
#pDelimiter CHAR(1)
)
RETURNS TABLE WITH SCHEMABINDING AS
RETURN
WITH E1(N) AS (--10E+1 or 10 rows
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
),
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
cteTally(N) AS (
SELECT TOP (ISNULL(DATALENGTH(#pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
),
cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)
SELECT 1 UNION ALL
SELECT t.N+1 FROM cteTally t WHERE SUBSTRING(#pString, t.N, 1) = #pDelimiter
),
cteLen(N1, L1) AS(--==== Return start and length (for use in substring)
SELECT
s.N1,
ISNULL(NULLIF(CHARINDEX(#pDelimiter, #pString, s.N1), 0) - s.N1, 8000)
FROM cteStart s
)
--===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.
SELECT
ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),
Item = SUBSTRING(#pString, l.N1, l.L1)
FROM cteLen l
Your query would now be:
DECLARE #pString VARCHAR(8000) = '1|3|5'
SELECT
f.*
FROM tbl_File f
INNER JOIN tbl_TagFile tf ON tf.FileID = f.FileID
WHERE
tf.TagID IN(SELECT CAST(item AS INT) FROM dbo.DelimitedSplit8K(#pString, '|'))
GROUP BY f.FileID, f.FileName
HAVING COUNT(tf.ID) = (LEN(#pString) - LEN(REPLACE(#pString,'|','')) + 1)
The statement below counts the number of TagID in the parameter by counting the occurrence of the delimiter | + 1.
(LEN(#pString) - LEN(REPLACE(#pString,'|','')) + 1)
Here is an option that does not require UDF's.
It can be argued that this is also complicated.
DECLARE #TagList VARCHAR(50)
-- pass in this
SET #TagList = '1|3|6'
SELECT
FinalSet.FileID,
FinalSet.Tag,
FinalSet.TotalMatches
FROM
(
SELECT
tbl_TagFile.FileID,
tbl_TagFile.Tag,
COUNT(*) OVER(PARTITION BY tbl_TagFile.FileID) TotalMatches
FROM
(
SELECT 1 FileID, '1' Tag UNION ALL
SELECT 1 , '2' UNION ALL
SELECT 1 , '3' UNION ALL
SELECT 1 , '6' UNION ALL
SELECT 2 , '1' UNION ALL
SELECT 2 , '3'
) tbl_TagFile
INNER JOIN
(
SELECT tbl_Tag.Tag
FROM
(
SELECT '1' Tag UNION ALL
SELECT '2' UNION ALL
SELECT '3' UNION ALL
SELECT '4' UNION ALL
SELECT '5' UNION ALL
SELECT '6'
) tbl_Tag
WHERE '|' + #TagList + '|' LIKE '%|' + Tag + '|%'
) LimitedTagTable
ON LimitedTagTable.Tag = tbl_TagFile.Tag
) FinalSet
WHERE
FinalSet.TotalMatches = (LEN(#TagList) - LEN(REPLACE(#TagList,'|','')) + 1)
There's some complications in this around data types and indexes and stuff but you can see the concept - you are only getting the records that match your passed in string.
subtable LimitedTagTable is your tag list filtered by your input pipe delimited string
subtable FinalSet joins your limited tag list to your list of files
column TotalMatches works out how many tag matches your file had
Finally this line limits the output to those files that had enough matches:
FinalSet.TotalMatches = (LEN(#TagList) - LEN(REPLACE(#TagList,'|','')) + 1)
Please experiment with different inputs and datasets and see if it suits as I have made a number of assumptions.
I'm answering my own question, in hopes that someone can let me know if/how flawed it is. So far it seems to be working but just early testing.
Function:
ALTER FUNCTION [dbo].[udf_FileExistsByTags]
(
#FileID int
,#Tags nvarchar(max)
)
RETURNS bit
AS
BEGIN
DECLARE #Exists bit = 0
DECLARE #Count int = 0
DECLARE #TagTable TABLE ( FileID int, TagID int )
DECLARE #Tag int
WHILE len(#Tags) > 0
BEGIN
SET #Tag = CAST(LEFT(#Tags, charindex('|', #Tags + '|') -1) as int)
SET #Count = #Count + 1
IF EXISTS (SELECT * FROM tbl_FileTag WHERE FileID = #FileID AND TagID = #Tag )
BEGIN
INSERT INTO #TagTable ( FileID, TagID ) VALUES ( #FileID, #Tag )
END
SET #Tags = STUFF(#Tags, 1, charindex('|', #Tags + '|'), '')
END
SET #Exists = CASE WHEN #Count = (SELECT COUNT(*) FROM #TagTable) THEN 1 ELSE 0 END
RETURN #Exists
END
Then in the query:
SELECT * FROM tbl_File a WHERE dbo.udf_FileExistsByTags(a.FileID, #Tags) = 1
So now I'm looking for errors.
What do you think? Probably not every efficient, however this search will be used only on a periodic basis.
This maybe a bit of a noob question, but is there a nice simple way of inserting ROWS from a select statment into COLUMNS of another table?
I'm not just talking about doing an INSERT / SELECT.
I have a function which splits some CSV into rows. So say I have two rows of data like this
joe,bloggs,joe.bloggs#domain.fake;jane,soap,jane.soap#domain.notreal;
I split first by semi-colon, then by comma
Result of first split call
Id Data
1 joe,bloggs,joe.bloggs#domain.fake
2 jane,soap,jane.soap#domain.notreal
The on each of these I run the split function again
Id Data
1 Joe
2 Bloggs
3 joe.bloggs#domain.fake
With this returned data, I want to do an insert statement that looks like this
INSERT INTO Customers (#first,#last,#email)
SELECT [row1].[col2],[row2].[col2],[row3].[col2]
Is there any simple way to do this?
An easy way load all the data from CSV to database table is to use BULK INSERT statement. All you need to do is to use correct parameters(in your case FIELDTERMINATOR = ',', ROWTERMINATOR = ';')
BULK INSERT Customers
FROM 'c:\split.csv'
WITH
(FIELDTERMINATOR = ',',
ROWTERMINATOR = ';'
)
GO
After the first splitting you have a data in table format. Thus you can use greate method of splitting a column with delimited string into multiple columns using XML method
INSERT dbo.Customers([first], [last], [email])
SELECT Split.a.value('/M[1]', 'VARCHAR(100)' ) AS [first],
Split.a.value('/M[2]', 'VARCHAR(100)' ) AS [last],
Split.a.value('/M[3]', 'VARCHAR(100)' ) AS [email]
FROM (SELECT CAST('<M>' + REPLACE(Data , ',' , '</M><M>' ) + '</M>' AS XML) AS xmlData
FROM dbo.testCSV
) AS x CROSS APPLY XMLDATA.nodes('.') AS Split(a)
Demo on SQLFiddle