Remove lines recursively from a SQL TEXT column - sql-server

I have a SQL Text column which has a block of text and has multiple lines which are not relevant and I need to remove only those lines.
Example - This is all one column value:
Header_ID askdjfhklasjdhfklajhfwoi fhweiohrognfk
ABC
SECTION_ID asdfhkwjehfi efjhewiu1382204 3904834
123
SECTION_ID deihefgjkahf dfjsdhfkl edfashldfkljh
So basically I need to remove all lines which are starting with Header_ID and Section_ID and the output Text i need is just
ABC
123
The only thing constant about these lines is the first word it starts with and depending on that I need to remove the whole line.

Here is a solution. Details about how it works are below. Note this solution needs MSSQL 2017+ to work.
-- Place the raw string value as varchar data in a variable so it is convenient to work with:
declare #rawValue varchar(max) = 'Header_ID askdjfhklasjdhfklajhfwoi fhweiohrognfk
ABC
SECTION_ID asdfhkwjehfi efjhewiu1382204 3904834
123
SECTION_ID deihefgjkahf dfjsdhfkl edfashldfkljh';
-- Perform multiple operations on the raw value and save the result to another variable:
declare #convertedValue varchar(max) =
(
select string_agg(value, char(13) + char(10))
from string_split(#rawValue, char(10))
where value not like 'header_id%' and value not like 'section_id%'
);
-- Display converted value.
select #convertedValue;
The magic begins with the string_split() function which produces a table value. It detects the line feed character, char(10), and splits the multi-line string into a table with each line from the string in a separate row.
Next, we filter out the rows from the table that we don't want. These rows begin with the known substrings header_id and section_id. This is accomplished in the where clause.
Lastly, for the output, we use string_agg() and aggregate the remaining rows (the lines we do want) back into a string with the individual values delimited by a combination of the carriage return char(13) and line feed char(10) characters.

Since I am using SQL Server 2016 and not 2017 what i did to resolve the issue was First break all the data into multiple rows(Cross apply with the Split function) using the delimiter as CHAR(13) and then taking only rows that did not start with Header_ID, Section_ID etc and using stuff again built the text block.
Thanks again #otto for the resolution.

Related

Check if one of item in comma delimited string exist in another comma delimited string

I can check single string in comma delimited string, for example finding varchar data field that contain single value such as 1 or 2 or 5 in '1,2,3,4,5' comma delimited string as described here: https://stackoverflow.com/a/49202026/1830909, But I'm wondering how could I check if compare string isn't single solid string and is comma delimited string too. for example data field is varchar and containe comma delimited string like '1,3,4' and I want to check if one of items such as 1 or 3 or 4 is exist in '1,2,3,4,5' comma delimited string, I hope success to clarify that, Any help appreciated.
Clarifying: Although "keeping delimited strings in a column is a bad
idea" but I think it's not matter when the biggest value just contain
less than 15 item, in some situation I have too much table and I don't
want increasing. Other reason is like to use json for transferring
data, Parsing all values in one delimited string and save to one
column of DB table and pool it from DB as an string and pars to
different values.
You need a string splitter (AKA tokenizer). In SQL 2016+ you can use string_split pre-2016 I suggest DelimitedSplit8K. This code returns a 1 is there is a matching value, a 0 otherwise.
DECLARE
#string1 varchar(100) = '1,32,2',
#string2 varchar(100) = '1,2,3,4,5';
SELECT matchingValue = ISNULL(MAX(1),0)
FROM string_split(#string1,',')
WHERE [value] IN (SELECT [value] FROM string_split(#string2,','));

FOR XML PATH always adds trailing space to value

Using the FOR XML PATH structure to create a list of values,
I find that (annoyingly) it always adds a trailing space to selected values.
This ruins my attempts at providing my own delimiters - the trailing space is added after the column and delimiters have been concatenated.
For example:
SELECT country + '-' FROM countryTable...
results in the following string:
china- france- england-
Has anyone else seen this, and is there a way to stop it?
I don't think TRIM() will work, as that would be applied before the extra space is inserted...
I'm using SQL Server 2016.
Thanks
Ok, thanks to John C and his sample query I found the culprit.
I had a AS [data()] clause after the column name/delimiter.
Removing that removed the trailing space.
I don't know how/why but it did...
I suspect the data inside the country column, What if each value in Country column is having leading space. For XML PATH does not add any space to the data
Try this
SELECT RTRIM(LTRIM(country)) + '-' FROM countryTable...
You may have leading/trailing spaces and/or CRLFs. Perhaps this will help
Declare #countryTable table (country varchar(100))
Insert Into #countryTable values
(' china'), -- leading space
(char(13)+'france'), -- leading char(13)
(char(10)+'england') -- leading char(10)
Select Value=Stuff((Select Distinct '-' + ltrim(rtrim(replace(replace(country,char(13),''),char(10),'')))
From #countryTable
Where 1=1
For XML Path ('')),1,1,'')
Returns
Value
china-england-france
FOR XML PATH ... AS [data()] add to this from MS Help
If the path specified as column name is data(), the value is treated as an atomic value in the generated XML. A space character is added to the XML if the next item in the serialization is also an atomic value. This is useful when you are creating list typed element and attribute values.
When you write here ... AS something. Then something is used as open/closing markup tag for each selected value.
Add 2. Is possible concate in select clausule more fileds from each row. For other types than string type, value must be converted into string type CAST AS

How to strip characters out of Temp Table after Bulk Insert

I am trying to remove some very annoying inline characters from my .csv file. I need to strip out ! CR LF because these are junking up my import. I have a proc to try to get rid of the crap but it doesn't seem to work. Here's the code:
CREATE TABLE #Cleanup
(
SimpleData nvarchar(MAX)
)
BULK INSERT #Cleanup from '**********************\myimport.csv'
SELECT * FROM #Cleanup
DECLARE #ReplVar nvarchar(MAX)
SET #ReplVar = CONCAT(char(33),char(10),char(13),char(32))
UPDATE #Cleanup SET SimpleData = replace([SimpleData], #ReplVar,'') from #Cleanup
SELECT * FROM #Cleanup
My plan is if the goofy line break gets removed, the second select should not have it in there. The text looks like
js5t,1599,This is this and that is t!
hat,asdf,15426
when that line should read
js5t,1599,This is this and that is that,asdf,15426
See my quandary? Once the sequential characters !crlfsp are removed, I will take that temp table and feed it into the working one.
Edit to show varbinary data:
`0x31003700360039002C004300560045002D0032003000310035002D0030003000380035002C0028004D005300310035002D00300032003200290020004D006900630072006F0073006F006600740020004F006600660069006300650020004D0065006D006F00720079002000480061006E0064006C0069006E0067002000520065006D006F0074006500200043006F0064006500200045007800650063007500740069006F006E0020002800330030003300380039003900390029002C004D006900630072006F0073006F00660074002C00570069006E0064006F007700730020004F0053002C0053006F006600740077006100720065002000550070006700720061006400650020006F0072002000500061007400630068002C002C0042006100730065006C0069006E0065002C002C00310035002D003100320035002C0039002E0033002C0048006900670068002C0022005500730065002D00610066007400650072002D0066007200650065002000760075006C006E00650072006100620069006C00690074007900200069006E0020004D006900630072006F0073006F006600740020004F00660066006900630065002000320030003000370020005300500033002C00200045007800630065006C002000320030003000370020005300500033002C00200050006F0077006500720050006F0069006E0074002000320030003000370020005300500033002C00200057006F00720064002000320030003000370020005300500033002C0020004F00660066006900630065002000320030003100300020005300500032002C00200045007800630065006C002000320030003100300020005300500032002C00200050006F0077006500720050006F0069006E0074002000320030003100300020005300500032002C00200057006F00720064002000320030003100300020005300500032002C0020004F006600660069006300650020003200300031003300200047006F006C006400200061006E00640020005300500031002C00200057006F007200640020003200300031003300200047006F006C006400200061006E00640020005300500031002C0020004F006600660069006300650020003200300031003300200052005400200047006F006C006400200061006E00640020005300500031002C00200057006F007200640020003200300031003300200052005400200047006F006C006400200061006E00640020005300500031002C00200045007800630065006C0020005600690065007700650072002C0020004F0066006600690063006500200043006F006D007000610074006900620069006C0069007400790020005000610063006B0020005300500033002C00200057006F007200640020004100750074006F006D006100740069006F006E0020005300650072007600690063006500730020006F006E0020005300680061007200650050006F0069006E00740020005300650072007600650072002000320030003100300020005300500032002C00200045007800630065006C0020005300650072007600690063006500730020006F006E0020005300680061007200650050006F0069006E007400200053006500720076006500720020003200300031003300200047006F006C006400200061006E00640020005300500031002C00200057006F007200640020004100750074006F006D006100740069006F006E0020005300650072007600690063006500730020006F006E0020005300680061007200650050006F0069006E007400200053006500720076006500720020003200300031003300200047006F006C006400200061006E00640020005300500031002C00200057006500620020004100700070006C00690063006100740069006F006E0073002000320030003100300020005300500032002C0020004F006600660069006300650020005700650062002000410070007000730020005300650072007600650072002000320030003100300020005300500032002C00200057006500620020004100700070007300200053006500720076006500720020003200300031003300200047006F006C006400200061006E00640020005300500031002C0020005300680061007200650050006F0069006E00740020005300650072007600650072002000320030003000370020005300500033002C002000570069006E0064006F007700730020005300680061007200650050006F0069006E007400200053006500720076006900630065007300200033002E00300020005300500033002C0020005300680061007200650050006F0069006E007400200046006F0075006E0064006100740069006F006E002000320030003100300020005300500032002C0020005300680061007200650050006F0069006E00740020005300650072007600650072002000320030003100300020005300500032002C0020005300680061007200650050006F0069006E007400200046006F0075006E0064006100740069006F006E0020003200300031003300200047006F006C006400200061006E00640020005300500031002C00200061006E00640020005300680061007200650050006F0069006E007400200053006500720076006500720020003200300031003300200047006F006C006400200061006E0064002000530050003100200061006C006C006F00770073002000720065006D006F002100
`0x2000740065002000610074007400610063006B00650072007300200074006F00200065007800650063007500740065002000610072006200690074007200610072007900200063006F00640065002000760069006100200061002000630072006100660074006500640020004F0066006600690063006500200064006F00630075006D0065006E0074002C00200061006B0061002000220022004D006900630072006F0073006F006600740020004F0066006600690063006500200043006F006D0070006F006E0065006E0074002000550073006500200041006600740065007200200046007200650065002000560075006C006E00650072006100620069006C006900740079002E00220022002200
#seagulledge, in a comment on the question, is correct, or at least partially correct, in stating that the CHAR(10) and CHAR(13) are out of order. A carriage-return (CR) is CHAR(13) and a line-feed (LF) is CHAR(10).
HOWEVER, the main thing preventing this from working is not the order of those two characters: it is the simple fact that the newlines -- whether they are \r\n or just \n -- are in the incoming CSV file, and hence the BULK INSERT command is assuming that the newlines are separating input rows (which makes sense for it to do). This can be seen looking at the VARBINARY output in the question. There are two rows of output, both starting with 0x.
This problem can only be solved by fixing the incoming CSV file prior to calling BULK INSERT. That way the erroneously embedded newlines will be removed such that each row imports as a single row into the temp table.

T-SQL Extract portion of xml or nvarchar(max) column matching pattern

I'd like to do something that I think is fairly trivial using T-SQL//SQL Server 2008 R2, but I can't seem to figure out a way.
If I were in Java, C#, C++, whatever, I would do:
Find position of first occurrance of '123' in string
Execute substring operation from that position getting next 50 characters
So, in SQL Server, I'd basically like:
Find all rows where column (X) contains said string (basically a
LIKE clause)
Return 50 characters from that column starting at the said string's location.
Can I do this somehow? I can cast an XML column to nvarchar(max), do a like operation, and do a substring operation, I don't know how to get the position of the said string in the column in the first place though.
Sample content requested in comment
CREATE TABLE SampleTable(xmlData xml);
Pretend the value is in one if SampleTable's xmlData column is as follows. I would like to, for debugging purposes, extract the string from the funny unicode Þ character forward 50 characters (or to the end of the file if that's less than 50).
<RootNode>
<Row>
<NestedNode1>
some text.
</NestedNode1>
<NestedNode2>
123456
</NestedNode2>
<NestedNode3>
Þ Some crazy name with unicode letters. Þ
</NestedNode3>
</Row>
</RootNode>
Are you looking for CHARINDEX?
;WITH CTE AS(
SELECT CAST (xmlData as nvarchar(max)) as X
FROM SampleTable
)
SELECT SUBSTRING(X,CHARINDEX(N'Þ',X),50) as [String]
FROM CTE
WHERE CHARINDEX(N'Þ',X)>0

How can I make SQL Server return FALSE for comparing varchars with and without trailing spaces?

If I deliberately store trailing spaces in a VARCHAR column, how can I force SQL Server to see the data as mismatch?
SELECT 'foo' WHERE 'bar' = 'bar '
I have tried:
SELECT 'foo' WHERE LEN('bar') = LEN('bar ')
One method I've seen floated is to append a specific character to the end of every string then strip it back out for my presentation... but this seems pretty silly.
Is there a method I've overlooked?
I've noticed that it does not apply to leading spaces so perhaps I run a function which inverts the character order before the compare.... problem is that this makes the query unSARGable....
From the docs on LEN (Transact-SQL):
Returns the number of characters of the specified string expression, excluding trailing blanks. To return the number of bytes used to represent an expression, use the DATALENGTH function
Also, from the support page on How SQL Server Compares Strings with Trailing Spaces:
SQL Server follows the ANSI/ISO SQL-92 specification on how to compare strings with spaces. The ANSI standard requires padding for the character strings used in comparisons so that their lengths match before comparing them.
Update: I deleted my code using LIKE (which does not pad spaces during comparison) and DATALENGTH() since they are not foolproof for comparing strings
This has also been asked in a lot of other places as well for other solutions:
SQL Server 2008 Empty String vs. Space
Is it good practice to trim whitespace (leading and trailing)
Why would SqlServer select statement select rows which match and rows which match and have trailing spaces
you could try somethign like this:
declare #a varchar(10), #b varchar(10)
set #a='foo'
set #b='foo '
select #a, #b, DATALENGTH(#a), DATALENGTH(#b)
Sometimes the dumbest solution is the best:
SELECT 'foo' WHERE 'bar' + 'x' = 'bar ' + 'x'
So basically append any character to both strings before making the comparison.
After some search the simplest solution i found was in Anthony Bloesch
WebLog.
Just add some text (a char is enough) to the end of the data (append)
SELECT 'foo' WHERE 'bar' + 'BOGUS_TXT' = 'bar ' + 'BOGUS_TXT'
Also works for 'WHERE IN'
SELECT <columnA>
FROM <tableA>
WHERE <columnA> + 'BOGUS_TXT' in ( SELECT <columnB> + 'BOGUS_TXT' FROM <tableB> )
The approach I’m planning to use is to use a normal comparison which should be index-keyable (“sargable”) supplemented by a DATALENGTH (because LEN ignores the whitespace). It would look like this:
DECLARE #testValue VARCHAR(MAX) = 'x';
SELECT t.Id, t.Value
FROM dbo.MyTable t
WHERE t.Value = #testValue AND DATALENGTH(t.Value) = DATALENGTH(#testValue)
It is up to the query optimizer to decide the order of filters, but it should choose to use an index for the data lookup if that makes sense for the table being tested and then further filter down the remaining result by length with the more expensive scalar operations. However, as another answer stated, it would be better to avoid these scalar operations altogether by using an indexed calculated column. The method presented here might make sense if you have no control over the schema , or if you want to avoid creating the calculated columns, or if creating and maintaining the calculated columns is considered more costly than the worse query performance.
I've only really got two suggestions. One would be to revisit the design that requires you to store trailing spaces - they're always a pain to deal with in SQL.
The second (given your SARG-able comments) would be to add acomputed column to the table that stores the length, and add this column to appropriate indexes. That way, at least, the length comparison should be SARG-able.

Resources