Regex named-capture groups in T-SQL - sql-server

I need to extract ICD 9 values from a requirements document. The ICD 9 values could be individual codes like V91.19 or ranges 441.00-441. For example:
4. Peripheral vascular disorders - ICD-9-CM codes: 440.0-440.9, 441.00-441.9, 442.0-442.9, 443.1-443.9, 447.1, 557.1, 557.9, V43.4, V91.19, V9000, V91, M8440/0
Ultimately, the goal is to use these values in a WHERE clause:
SELECT *
FROM ICD9
WHERE (
(CODE BETWEEN '440.0' AND '440.9')
OR (CODE BETWEEN '441.00' AND '441.9')
...
OR CODE IN ('447.1', '557.1', '557.9', 'V43.4', 'V91.19', 'V9000', 'V91', 'M8440/0')
)
This regex:
/[A-Z]?[0-9]+[\.\/]?[0-9]*/g
matches:
the individual ICD 9 values (447.1)
the starting and ending values of the range (440.0-440.9)
4. and ICD-9-CM - not desirable
How do I need to modify my regex to:
create a capture group for the individual values?
create a capture group for the range values?
exclude the undesirables?

Do you mean like this?
[A-Z]?[0-9]+[\.\/]?(?=\d)[0-9]*
where
(?=\d) The positive lookahead - Assert that the regex can be matched only if a digit [0-9]
Same results is if your remove the optional * part of the last digit like and replace it with +:
[A-Z]?[0-9]+[\.\/]?[0-9]+
https://regex101.com/r/nK3zB3/2
About the ranges and groups I think it could something like:
(([A-Z]?[0-9]+[\.\/]?[0-9]+)[-]*(([A-Z]?[0-9]+[\.\/]?[0-9]+))?)
Online Demo

Related

Regex string with 2+ different numbers and some optional characters in Snowflake syntax

I would like to check if a specific column in one of my tables meets the following conditions:
String must contain at least three characters
String must contain at least two different numbers [e.g. 123 would work but 111 would not]
Characters which are allowed in the string:
Numbers (0-9)
Uppercase letters
Lowercase letters
Underscores (_)]
Dashes (-)
I have some experience with Regex but am having issues with Snowflake's syntax. Whenever I try using the '?' regex character (to mark something as optional) I receive an error. Can someone help me understand a workaround and provide a solution?
What I have so far:
SELECT string,
LENGTH(string) AS length
FROM tbl
WHERE REGEXP_LIKE(string,'^[0-9]+{3,}[-+]?[A-Z]?[a-z]?$')
ORDER BY length;
Thanks!
Your regex looks a little confusing and invalid, and it doesn't look like it quite meets your needs either. I read this expression as a string that:
Must start with one or more digits, at least 3 or more times
The confusing part to me is the '+' is a quantifier, which is not quantifiable with {3,} but somehow doesn't produce an error for me
Optionally followed by either a dash or plus sign
Followed by an uppercase character zero or one times (giving back as needed)
Followed by and ending with a lowercase character zero or one times (giving back as needed)
Questions
You say that your string must contain 3 characters and at least 2 different numbers, numbers are characters but I'm not sure if you mean 3 letters...
Are you considering the numbers to be characters?
Does the order of the characters matter?
Can you provide an example of the error you are receiving?
Notes
Checking for a second digit that is not the same as the first involves the concept of a lookahead with a backreference. Snowflake does not support backreferences.
One thing about pattern matching with regular expressions is that order makes a difference. If order is not of importance to you, then you'll have multiple patterns to match against.
Example
Below is how you can test each part of your requirements individually. I've included a few regexp_substr functions to show how extraction can work to check if something exists again.
Uncomment the WHERE clause to see the dataset filtered. The filters are written as expressions so you can remove any/all of the regexp_* columns.
select randstr(36,random(123)) as r_string
,length(r_string) AS length
,regexp_like(r_string,'^[0-9]+{3,}[-+]?[A-Z]?[a-z]?$') as reg
,regexp_like(r_string,'.*[A-Za-z]{3,}.*') as has_3_consecutive_letters
,regexp_like(r_string,'.*\\d+.*\\d+.*') as has_2_digits
,regexp_substr(r_string,'(\\d)',1,1) as first_digit
,regexp_substr(r_string,'(\\d)',1,2) as second_digit
,first_digit <> second_digit as digits_1st_not_equal_2nd
,not(regexp_instr(r_string,regexp_substr(r_string,'(\\d)',1,1),1,2)) as first_digit_does_not_appear_again
,has_3_consecutive_letters and has_2_digits and first_digit_does_not_appear_again as test
from table(generator(rowcount => 10))
//where regexp_like(r_string,'.*[A-Za-z]{3,}.*') // has_3_consecutive_letters
// and regexp_like(r_string,'.*\\d+.*\\d+.*') // has_2_digits
// and not(regexp_instr(r_string,regexp_substr(r_string,'(\\d)',1,1),1,2)) // first_digit_does_not_appear_again
;
Assuming the digits need to be contiguous, you can use a javascript UDF to find the number in a string with with the largest number of distinct digits:
create or replace function f(S text)
returns float
language javascript
returns null on null input
as
$$
const m = S.match(/\d+/g)
if (!m) return 0
const lengths = m.map(m=> [...new Set (m.split(''))].length)
const max_length = lengths.reduce((a,b) => Math.max(a,b))
return max_length
$$
;
Combined with WHERE-clause, this does what you want, I believe:
select column1, f(column1) max_length
from t
where max_length>1 and length(column1)>2 and column1 rlike '[\\w\\d-]+';
Yielding:
COLUMN1 | MAX_LENGTH
------------------------+-----------
abc123def567ghi1111_123 | 3
123 | 3
111222 | 2
Assuming this input:
create or replace table t as
select * from values ('abc123def567ghi1111_123'), ('xyz111asdf'), ('123'), ('111222'), ('abc 111111111 abc'), ('12'), ('asdf'), ('123 456'), (null);
The function is even simpler if the digits don't have to be contiguous (i.e. count the distinct digits in a string). Then core logic changes to:
const m = S.match(/\d/g)
if (!m) return 0
const length = [...new Set (m)].length
return length
Hope that's helpful!

T-SQL: SUM Number between Delimiters from String

I need to get numbers with a variable length out of a string and sum them.
The strings got the following format:
EH:NUMBER=SomeOtherStuff->Code
I'm extracting the code via RIGHT() and join with another table to get the group right, at the moment I'm using sum to get it together via date:
SUM(CASE WHEN (MONTH(data.DATE1) = 5 AND YEAR(data.DATE1) = YEAR(GETDATE())) THEN 1 ELSE 0 END) N'Mai',
I then need to sum the numbers from the string and not the number of rows.
Some Examples:
Month1 EH:1=24->ZTM
Month1 EH:4=13-21->LKm
Month2 EH:3=34,33,43->LKm
Month2 EH:7=12,92-29,29->LKm
Month2 EH:5=24-26,11,21,22->ZOL
What i need:
Material - Month1 - Month2
ZTM - 1 - 0
LKM - 4 - 10
ZOL - 0 - 5
Could you help me please?
Greetings
Short version:
What you are looking for is SUBSTRING.
Longer version:
To get the the sum of the numerical value of NUMBER you need think about how break it down.
I'd recommend following these steps:
Extract the NUMBER part from the string. This should be done with SUBSTRING (much like you extract Code with RIGHT). To get the start and and length och your substring use charindex ( or patindex if you like).
Convert the NUMBER part to a numerical value with cast (or convert or what you are familiar with)
Now you can do your aggregation.
So SUM(CAST(SUBSTRING(*this part you will have to figure out by yourself)) as correct numerical data type)).
I'll let you figure out the values to insert by yourself and would recommend to first find the positions of the delimiting characters, then extract the NUMBER part, then get the numerical value .... you get it .
This to gain a better understanding of what you are actually doing.
Cheers, and good luck with your assignment
Martin

Anychart tables: How to include thousand separators?

How can I put the text "100.000" in a table in Anychart? When I try to get the string "100.000" in, it is modified to "100".
For a working example see https://jsfiddle.net/Republiq/xcemvm9L/
table = anychart.standalones.table(2,2);
table.getCell(0,0).content("100.000");
table.container("container").draw();
If you want to use such number formatting for the whole table you can define numberLocale in the beginning. If the actual number is 100 and '.' - is a decimal separator and you want to show 3 zeros as decimals, put the following lines before creating the table:
anychart.format.locales.default.numberLocale.decimalsCount = 3;
anychart.format.locales.default.numberLocale.zeroFillDecimals = true;
And then put in the number as:
table.getCell(0,0).content(100);
If '.' - is a group separator and the actual number is 100000, put the following line:
anychart.format.locales.default.numberLocale.groupsSeparator = '.';
And then put in the number as:
table.getCell(0,0).content(100000);
If you want to use special format only for a single cell, we recommend you to use number formatter, which helps to configure all these options only for a single number. For example, it may looks like:
table = anychart.standalones.table(5,5);
table.getCell(0,0).content(anychart.format.number(100000, 3, ".", ","));
table.container("container").draw();
Also, you may learn more about this useful method and find examples in this article

MATLAB Extract all rows between two variables with a threshold

I have a cell array called BodyData in MATLAB that has around 139 columns and 3500 odd rows of skeletal tracking data.
I need to extract all rows between two string values (these are timestamps when an event happened) that I have
e.g.
BodyData{}=
Column 1 2 3
'10:15:15.332' 'BASE05' ...
...
'10:17:33:230' 'BASE05' ...
The two timestamps should match a value in the array but might also be within a few ms of those in the array e.g.
TimeStamp1 = '10:15:15.560'
TimeStamp2 = '10:17:33.233'
I have several questions!
How can I return an array for all the data between the two string values plus or minus a small threshold of say .100ms?
Also can I also add another condition to say that all str values in column2 must also be the same, otherwise ignore? For example, only return the timestamps between A and B only if 'BASE02'
Many thanks,
The best approach to the first part of your problem is probably to change from strings to numeric date values. In Matlab this can be done quite painlessly with datenum.
For the second part you can just use logical indexing... this is were you put a condition (i.e. that second columns is BASE02) within the indexing expression.
A self-contained example:
% some example data:
BodyData = {'10:15:15.332', 'BASE05', 'foo';...
'10:15:16.332', 'BASE02', 'bar';...
'10:15:17.332', 'BASE05', 'foo';...
'10:15:18.332', 'BASE02', 'foo';...
'10:15:19.332', 'BASE05', 'bar'};
% create column vector of numeric times, and define start/end times
dateValues = datenum(BodyData(:, 1), 'HH:MM:SS.FFF');
startTime = datenum('10:15:16.100', 'HH:MM:SS.FFF');
endTime = datenum('10:15:18.500', 'HH:MM:SS.FFF');
% select data in range, and where second column is 'BASE02'
BodyData(dateValues > startTime & dateValues < endTime & strcmp(BodyData(:, 2), 'BASE02'), :)
Returns:
ans =
'10:15:16.332' 'BASE02' 'bar'
'10:15:18.332' 'BASE02' 'foo'
References: datenum manual page, matlab help page on logical indexing.

How do I match a substring of variable length?

I am importing data into my SQL database from an Excel spreadsheet.
The imp table is the imported data, the app table is the existing database table.
app.ReceiptId is formatted as "A" followed by some numbers. Formerly it was 4 digits, but now it may be 4 or 5 digits.
Examples:
A1234
A9876
A10001
imp.ref is a free-text reference field from Excel. It consists of some arbitrary length description, then the ReceiptId, followed by an irrelevant reference number in the format " - BZ-0987654321" (which is sometimes cropped short, or even missing entirely).
Examples:
SHORT DESC A1234 - BZ-0987654321
LONGER DESCRIPTION A9876 - BZ-123
REALLY LONG DESCRIPTION A2345 - B
REALLY REALLY LONG DESCRIPTION A23456
The code below works for a 4-digit ReceiptId, but will not correctly capture a 5-digit one.
UPDATE app
SET
[...]
FROM imp
INNER JOIN app
ON app.ReceiptId = right(right(rtrim(replace(replace(imp.ref,'-',''),'B','')),5)
+ rtrim(left(imp.ref,charindex(' - BZ-',imp.ref))),5)
How can I change the code so it captures either 4 (A1234) or 5 (A12345) digits?
As ughai rightfully wrote in his comment, it's not recommended to use anything other then columns in the on clause of a join.
The reason for that is that using functions prevents sql server for using any indexes on the columns that it might use without the functions.
Therefor, I would suggest adding another column to imp table that will hold the actual ReceiptId and be calculated during the import process itself.
I think the best way of extracting the ReceiptId from the ref column is using substring with patindex, as demonstrated in this fiddle:
SELECT ref,
RTRIM(SUBSTRING(ref, PATINDEX('%A[0-9][0-9][0-9][0-9]%', ref), 6)) As ReceiptId
FROM imp
Update
After the conversation with t-clausen-dk in the comments, I came up with this:
SELECT ref,
CASE WHEN PATINDEX('%[ ]A[0-9][0-9][0-9][0-9][0-9| ]%', ref) > 0
OR PATINDEX('A[0-9][0-9][0-9][0-9][0-9| ]%', ref) = 1 THEN
SUBSTRING(ref, PATINDEX('%A[0-9][0-9][0-9][0-9][0-9| ]%', ref), 6)
ELSE
NULL
END As ReceiptId
FROM imp
fiddle here
This will return null if there is no match,
when a match is a sub string that contains A followed by 4 or 5 digits, separated by spaces from the rest of the string, and can be found at the start, middle or end of the string.
Try this, it will remove all characters before the A[number][number][number][number] and take the first 6 characters after that:
UPDATE app
SET
[...]
FROM imp
INNER JOIN app
ON app.ReceiptId in
(
left(stuff(ref,1, patindex('%A[0-9][0-9][0-9][0-9][ ]%', imp.ref + ' ') - 1, ''), 5),
left(stuff(ref,1, patindex('%A[0-9][0-9][0-9][0-9][0-9][ ]%', imp.ref + ' ') - 1, ''), 6)
)
When using equal, the spaces after is not evaluated

Resources