SQL Select statement until a character - sql-server

I'm looking to extract all the text up until a '\' (backslash).
The substring is required to remove all proceeding characters (17 in total) and so I would like to return all after the 17th until it comes across a backslash.
I've tried using charindex but it doesn't seem to stop at the \ it returns characters afterward. My code is as follows
SELECT path, substring(path,17, CHARINDEX('\',Path)+ LEN(Path)) As Data
FROM [Table].[dbo].[Projects]
WHERE Path like '\ENQ%\' AND
Deleted = '0'
Example
The below screen shot shows the basic query and result i.e the whole string
I then use substring to remove the first X characters as there will always be the same amount of proceeding characters
But what Im actually after is (based on the above result) the "Testing 1" "Testing 2" and "Testing ABC" section

The substring is required to remove all proceeding characters (17 in total) and so I would like to return all after the 17th until it comes across a backslash.
select
substring(path,17,CHARINDEX('\',Path)-17)
from
table
To overcome Invalid length parameter passed to the LEFT or SUBSTRING function error, you can use CASE
select
substring(path,17,
CASE when CHARINDEX('\',Path,17)>0
Then CHARINDEX('\',Path)-17)
else VA end
)
from
table

Related

substring to retrieve a date between two characters

Could someone shed some light on how I could extract the date field from this string below
'NA,Planning-Completed=17-07-2019 10:38,Print-Dispatch-Date=10-02-2020 13:06,Award-Complete=NA'
Expected Output:
10-02-2020
This is what I have written so far, it seems to work sometimes but I noticed some rows are incorrect.
[Dispatch Date] = SUBSTRING(dates,CHARINDEX('Print-Dispatch-Date=',dates) + LEN('Print-Dispatch-Date='), LEN(dates) )
The 3rd parameter for SUBSTRING is wrong. You have it as LEN(dates), which means return every character after 'Print-Dispatch-Date='; in this case that returns '10-02-2020 13:06,Award-Complete=NA'.
As your date is 10 characters long, just use 10:
SELECT dates, SUBSTRING(dates,CHARINDEX('Print-Dispatch-Date=',dates) + LEN('Print-Dispatch-Date='), 10)
FROM (VALUES('NA,Planning-Completed=17-07-2019 10:38,Print-Dispatch-Date=10-02-2020 13:06,Award-Complete=NA'))V(dates)

Underscore in Where clause yields unexpected result, why? [duplicate]

This question already has answers here:
LIKE not working in TSQL
(3 answers)
Closed 3 years ago.
I want to find all tables starting with TB_, hence I've wrote following script:
select *
from INFORMATION_SCHEMA.TABLES
where TABLE_NAME like 'TB_%'
To my surprise I got following result:
TB103_xxx
TB037_bbb
TB104_ccc
I'm curious why?
It means any single character in combination with a like. See
MSDN - LIKE (Transact-SQL)
% - Any string of zero or more characters.
_ - Any single character. _a will match aa, ba etc.
[ ] - Any single character within the specified range ([a-f]) or set ([abcdef]).
[^] - Any single character not within the specified range ([^a-f]) or set ([^abcdef]).
You could use [_] to match a underscore, so like 'TB[_]%'
Or you could use LIKE 'TB\_%' ESCAPE '\'. (thanks to Jeroen Mostert)
This is why because you used the underscore (_) symbol. It means the string Allows you to match on a single character.
Check this SQL LIKE Operator
Better you should use WHERE COLUMN_NAME LIKE 'TB[_]%' or WHERE COLUMN_NAME LIKE 'TB\_%'
% - The percent sign represents zero, one, or multiple characters.
_ - The underscore represents a single character.
[] - Any single character within the specified range ([a-f]) or set ([abcdef]).
[^] - Any single character not within the specified range ([^a-f]) or set ([^abcdef]).
Here is some examples
WHERE CustomerName LIKE 'a%' Finds any values that start with "a"
WHERE CustomerName LIKE '%a' Finds any values that end with "a"
WHERE CustomerName LIKE '%or%' Finds any values that have "or" in any position
WHERE CustomerName LIKE '_r%' Finds any values that have "r" in the second position
WHERE CustomerName LIKE 'a_%' Finds any values that start with "a" and are at least 2 characters in length
WHERE CustomerName LIKE 'a%o' Finds any values that start with "a" and ends with "o"
WHERE CustomerName LIKE '[a-e]arsen' Finds any values that end with "arsen" and starting with any single character between "a" and "e"
WHERE CustomerName LIKE '[^a-e]arsen' Finds any values that end with "arsen" and starting with any single character isn't between "a" and "e".

SQL: Fix for CSV import mistake

I have a database that has multiple columns populated with various numeric fields. While trying to populate from a CSV, I must have mucked up assigning delimited fields. The end result is a column containing It's Correct information, but also contains the next column over's data- seperated by a comma.
So instead of Column UPC1 containing "958634", it contains "958634,95877456". The "95877456" is supposed to be in the UPC2 column, instead UPC2 is NULL.
Is there a way for me to split on the comma and send the data to UPC2 while keeping UPC1 data before the comma in tact?
Thanks.
You can do this with string functions. To query the values and verify the logic, try this:
SELECT
LEFT(UPC1, CHARINDEX(',', UPC1) - 1),
SUBSTRING(UPC1, CHARINDEX(',', UPC1) + 1, 1000)
FROM myTable;
If the result is what you want, turn it into an update:
UPDATE myTable SET
UPC1 = LEFT(UPC1, CHARINDEX(',', UPC1) - 1),
UPC2 = SUBSTRING(UPC1, CHARINDEX(',', UPC1) + 1, 1000);
The expression for UPC1 takes the left side of UPC1 up to one character before the comma.
The expression for UPC2 takes the remainder of the UPC1 string starting one character after the comma.
The third argument to SUBSTRING needs some explaining. It's the number of characters you want to include after the starting position of the string (which in this case is one character after the comma's location). If you specify a value that's longer than the string SUBSTRING will just return to the end of the string. Using 1000 here is a lot easier than calculating the exact number of characters you need to get to the end.

Organizing and Searching for (dates , strings , countries) in matlab

I just got done doing a project here on R and am now doing some work with matlab.
I need to make 3 vectors :
DOD
Country
Age
Count and store a .txt list with 236 data points the data in the text file looks like this:
Unknown woman
Cause of death: found dead, with eyes removed.
Location of death: Jardim dos Ipês Itaquaquecetuba, São Paulo, Brazil
Date of death: August 9th, 2014
Cris
Cause of death: multiple gunshot wounds
Location of death: Portal da Foz, Foz do Iguaçu, Brazil
Date of death: September 13th, 2014
Betty Skinner (52 years old)
Cause of death: blunt force trauma to the head
Location of death: Cleveland, Ohio, USA
Date of death: December 4th, 2013
Brittany Stergis (22 years old)
Cause of death: gunshot wound to the head
Location of death: Cleveland, Ohio, USA
Date of death: December 5th, 2013
I have no idea how to look for string and organize them but would appreciate any ideas how to get started.
You can use textscan to read the file into a cell array of strings, and then use regexp to parse the strings to get your desired fields.
First, we read the text file into a cell array of strings:
fid = fopen('deaths.txt');
scanned_fields = textscan(fid, '%s', 'Delimiter','\n');
text_array = scanned_fields{1};
fclose(fid);
While textscan is capable of some rudimentary parsing, it's not sophisticated enough for what we're doing. So we're just using it to read each line as a single string: format %s means we are expecting a string, and setting Delimiter to \n means that the strings are separated by newline characters.
Next, we can unleash the awesome power of regular expressions to parse your string of dead women:
format = {
'(?<name>[ \w]*)'
' \('
'(?<age>[\d]*)'
' years old\) - Cause of death: '
'(?<cause>[ \w]*)'
' - Location of death: '
'(?<city>[ \w]*)'
', '
'(?<province>[ \w]*)'
', '
'(?<country>[ \w]*)'
' - Date of death: '
'(?<date>[ ,\w]*)'
};
format = [format{:}];
Here we're just defining a format string. I've broken it up like this to make it a little clearer what's going on. Let's go through it line-by-line:
(?<name>[ \w]*) The parentheses indicate that this is a chunk of text (a.k.a. a "token") that we wish to capture. The ?<name> says that we will call this token "name". Finally, the [ \w]* specifies what kind of text to match. The stuff inside the square brackets specifies which characters to look for: spaces () and/or alphanumeric characters (\w). The * outside the square brackets indicates that we will accept any number of these characters.
\( Next we are looking for a space and an open parenthesis. The backslash in front of the parenthesis is to indicate that we are looking for a literal parenthesis, i.e. this parenthesis should not be interpreted as the start of another token to capture.
(?<age>[\d]*) Another token to capture. This one is called "age" and contains any number of \d (numeric characters).
years old \) - Cause of death: More text to look for. Again, we will be matching this text, but we will not capturing it (because it is not enclosed in parentheses).
(?<city>[ \w]*) Another token to capture. This one is called "city" and contains any number of spaces and/or alphanumeric characters.
, Comma, space
(?<province>[ \w]*), (?<country>[ \w]*) - Date of death: You get the idea
(?<date>[ ,\w]*) Our final token, called "date", which contains any number of spaces, commas, and/or alphanumeric characters.
Then we parse the strings into a struct array:
parsed_fields = regexp(text_array, format, 'names');
parsed_fields = [parsed_fields{:}]'
This is what the output should look like:
>> parsed_fields(1)
ans =
name: 'Jacqueline Cowdrey'
age: '50'
cause: 'unknown'
city: 'Worthing'
province: 'West Sussex'
country: 'United Kingdom'
date: 'November 20th, 2013'
So you can get your vector of countries pretty straightforward-ly:
Country = {parsed_fields.country}';
Age is a simple numeric conversion:
Age_str = {parsed_fields.age};
Age = cellfun(#str2double, Age_str)';
Date as a string is pretty easy:
Date_str = {parsed_fields.date}';
But it's nice to have it as a MATLAB "serial date number", which allows arithmetic computations and reformatting into different types of representation formats. Unfortunately, having the day as "20th" instead of "20" is incompatible with the conversion functions, so we'll need to first strip off the "st", "nd", "rd" from "1st", "2nd", "3rd", etc:
Date_str = regexprep(Date_str, '(?<day>[\d]+)(st|nd|rd|th)', '$<day>');
Date_num = datenum(Date_str, 'mmmm dd, yyyy');
Some other notes:
If the file is very large, you may wish to use fgetl to read it one line at a time (and then also parse it one line at a time) rather than reading the entire file into memory as we did above.
In your example, it looks like the entries are separated by an extra newline. I'm not sure if that's case in your actual data or if that's just a stackoverflow thing, but if you need to remove these newlines you can do so with:
is_empty_line = cellfun(#isempty, text_array);
text_array = text_array(~is_empty_line);
In your example, there were a lot of typos (an extra space here and there, sometimes the colons or dashes were other symbols). If these typos exist in your actual data, you will need to adjust the format specification to account for this. For example, instead of using - to match (space, dash, space), you can use \s*\W\s* to match (any number of whitespace characters, a single non-alphanumeric character, any number of whitespace characters).
If syntax like format = [format{:}]; or Country = {parsed_fields.country}'; look strange to you, these are equivalent to:
format = [format{1} format{2} format{3} ... format{end}];
Country = cell(length(parsed_fields),1);
for ii = 1:length(parsed_fields)
Country{ii} = parsed_fields(ii).country;
end
MATLAB R2014b added a new datetime class, so there may be a better way to deal with that nowadays.
Sorry about my previous answer; I had misunderstood how exactly the data is formatted.
As before, let's first read the text file into a cell array of strings:
fid = fopen('deaths.txt');
scanned_fields = textscan(fid, '%s', 'Delimiter','\n');
text_array = scanned_fields{1};
fclose(fid);
While textscan is capable of some rudimentary parsing, it's not sophisticated enough for what we're doing. So we're just using it to read each line as a single string: format %s means we are expecting a string, and setting Delimiter to \n means that the strings are separated by newline characters.
In the sample data you posted, each entry is 4 lines (name, cause, location, date) followed by an empty line. As long as we can rely on this formatting, this provides an easy way to split up the data (instead of the regexp parsing I proposed in my previous answer).
name_str_array = text_array(1:5:end);
cause_str_array = text_array(2:5:end);
loc_str_array = text_array(3:5:end);
date_str_array = text_arary(4:5:end);
So for example, name_strs is going to be every 5th line, starting with line #1. Likewise, cause_strs is every 5th line, starting with line #2. Just be careful that there are not any extra or missing lines in the data.
Next we will parse each of these to get the information that we want. In my previous answer, I proposed parsing all of the strings at once, but I think it would be easier to understand if we went through it one entry at a time. For example, let's consider the first entry.
name_str = name_str_array{1};
loc_str = loc_str_array{1};
date_str = date_str_array{1};
Let's start with the easiest one: parsing the date.
date_format = 'Date of death:\s*(?<date>.*)';
parsed_fields = regexp(date_str, date_format, 'names');
DOD = parsed_fields.date;
The format we're looking for is the string Date of death:, followed by any number of whitespace characters (\s*), followed by the chunk of text (aka "token") that we wish to capture: (?<date>.*)
The parentheses indicate that this is a token we wish to capture, the ?<date> indicates that we wish to call this token "date", and the .* specifies which characters to look for. The . is the universal wildcard, i.e. it matches all possible characters. The * indicates that we are interested in any number of repeats. So in essence, this .* means "match all remaining characters in the string".
Calling regexp with the names option causes it to return a struct with the named tokens as its fields.
Next, let's do the country. This one is a little trickier because there is a variable number of city/region specifiers. But the country will always be the last one, so that's the one we'll grab.
country_format = '(?<country>\w[ \w]*)$';
parsed_fields = regexp(loc_str, country_format, 'names');
Country = parsed_fields.country;
This format specification is the token (?<country>\w[ \w]*) followed by the end of the string (denoted by the special character $). In the token specification we are matching an alphanumeric character (\w) followed by any number of spaces and/or alphanumeric characters ([ \w]*). The reason for specifying this leading \w is so that we don't match the space between the previous comma and the start of the country name.
Finally, let's do the age. This one is tricky because not every entry has an age. At least it's easy because the age (if it exists) is the only numeric data in the line. Hence:
age_format = '(?<age>[\d]+)';
parsed_fields = regexp(name_str, age_format, 'names');
if isempty(parsed_fields)
Age = -1;
else
Age = str2double(parsed_fields.age);
end
The format specification is simply the token (?<age>[\d]+), which specifies that we are looking for numeric characters (\d), and we are looking for one or more of them (+).
After parsing, we check whether or not there was a match. If not (parsed_fields is empty), then we assign Age a value of -1. Otherwise, we convert the parsed age field into a number.
So putting it all together:
date_format = 'Date of death:\s*(?<date>.*)';
country_format = '(?<country>\w[ \w]*)[\W]?$';
age_format = '(?<age>[\d]+)';
nEntries = length(date_str_array);
DOD = cell(nEntries, 1);
Country = cell(nEntries, 1);
Age = zeros(nEntries, 1);
for ii = 1:nEntries
name_str = name_str_array{ii};
loc_str = loc_str_array{ii};
date_str = date_str_array{ii};
parsed_fields = regexp(date_str, date_format, 'names');
assert(~isempty(parsed_fields), 'Could not parse date from:\n%s', date_str);
DOD{ii} = parsed_fields.date;
parsed_fields = regexp(loc_str, country_format, 'names');
assert(~isempty(parsed_fields), 'Could not parse country from:\n%s', loc_str);
Country{ii} = parsed_fields.country;
parsed_fields = regexp(name_str, age_format, 'names');
if isempty(parsed_fields)
Age(ii) = -1;
else
Age(ii) = str2double(parsed_fields.age);
end
end
I added the assert statements to help debug what's going on if you encounter errors in parsing.
For example, you may also notice that I added an [\W]? to the country format. This is because while running it on your example data, I encountered one country that contained a period at the end of the line (i.e. it ended with "Brazil." instead of just "Brazil"). So now we're looking to match a non-alphanumeric character (\W) repeated zero or 1 times (?), and it's outside of the parentheses so it is not being captured as part of the "country" token.

SSIS How to get part of a string by separator

I need an SSIS expression to get the left part of a string before the separator, and then put the new string in a new column. I checked in derived column, it seems no such expressions. Substring could only return string part with fixed length.
For example, with separator string - :
Art-Reading Should return Art
Art-Writing Should return Art
Science-chemistry Should return Science
P.S.
I knew this could be done in MySQL with SUBSTRING_INDEX(), but I'm looking for an equivalent in SSIS, or at least in SQL Server
Better late than never, but I wanted to do this too and found this.
TOKEN(character_expression, delimiter_string, occurrence)
TOKEN("a little white dog"," ",2)
returns little the source is below
http://technet.microsoft.com/en-us/library/hh213216.aspx
of course you can:
just configure your derived columns like this:
Here is the expression to make your life easier:
SUBSTRING(name,1,FINDSTRING(name,"-",1) - 1)
FYI, the second "1" means to get the first occurrence of the string "-"
EDIT:
expression to deal with string without "-"
FINDSTRING(name,"-",1) != 0 ? (SUBSTRING(name,1,FINDSTRING(name,"-",1) - 1)) : name
You can specify the length to copy in the SUBSTRING function and check for the location of the dash using CHARINDEX
SELECT SUBSTRING(#sString, 1, CHARINDEX('-',#sString) - 1)
For the SSIS expression it is pretty much the same code:
SUBSTRING(#[User::String], 1, FINDSTRING(#[User::String], "-", 1)-1)
if SUBSTRING length param returns -1 then it results in error,
"The length -1 is not valid for function "SUBSTRING". The length parameter cannot be negative. Change the length parameter to zero or a positive value."

Resources