Escaping dynamically generated regex - snowflake-cloud-data-platform

I'm dynamically creating a string that will be used as a regular expression pattern. I'm creating it and using it in Snowflake SQL. There are reserved regular expression characters in it that I want to keep as part of the original text. For example:
'word1, word2, a.b.c, hot/cool, | general'
I'm going to convert those commas to a | so that we search through text and get a positive match if any of them are in the text. The | general may also be legitimately in the text so need to escape that. The ., / and many other reserved characters are in the text. Basically, I need to escape them all. I'm doing this transform in separate steps, so I can convert the commas to pipes after this escaping step.
This is the simplest test case and solution I can come up with:
select regexp_replace(
'+ . * ? ^ $ , [ ] { } ( ) | /', -- text to escape
'\\+|\\.|\\*|\\?|\\^|\\$|\\,|\\[|\\]|\\{|\\}|\\(|\\)|\\||\\/', -- pattern
'\\\\$0' -- replace captured text with \\ in front of it
)
Even in this case I'm missing the \ in the original text to escape because it throws an error. The result of this is:
\$0 \$0 \$0 \$0 \$0 \$0 \$0 \$0 \$0 \$0 \$0 \$0 \$0 \$0 \$0
I've tried many variations of backslashes before the $0 and nothing works.
Python has a re.escape() function. Javascript has ways of doing it (https://stackoverflow.com/a/3561711/1884101). I can't figure out any way to do this in Snowflake other than a UDF, which I would really like to avoid. Someone else tried my example above in Postgres and it worked.
Is there a way to do this in Snowflake SQL (including escaping the \)?

Obviously, regexp_replace function has a bit different features on various databases. On snowflake, it works well
select REGEXP_REPLACE(
'+ . * ? ^ $ , [ ] { } ( ) | /',
'\\+|\\.|\\*|\\?|\\^|\\$|\\,|\\[|\\]|\\{|\\}|\\(|\\)|\\||\\/', -- escaped reserved characters
'\\\\\\\\\\0' -- I want to add \\ in front of every reserved character
)
4 backslashes could replace to one, that's why you wanna add two backslashes, 4*2 = 8 (backslashes).plus
for getting matched patterns \\0

Can you try this
select regexp_replace(
'+ . * ? ^ $ , [ ] { } ( ) | /', -- text to escape
'\\+|\\.|\\*|\\?|\\^|\\$|\\,|\\[|\\]|\\{|\\}|\\(|\\)|\\||\\/', -- pattern
'\\\\\\\\$0' -- replace captured text with \\ in front of it
)

Related

Parsing a String with quoted Fields like a CSV-line in Powershell

I have to parse a variable input-string into a string-array.
The input is a CSV-style comma-separated field-list where each field has its own quoted string.
Because I dont want to write my own full-blown CSV-parser the only working solution I could create till now is this one:
$input = '"Miller, Steve", "Zappa, Frank", "Johnson, Earvin ""Magic"""'
Add-Type -AssemblyName Microsoft.VisualBasic
$enc = [System.Text.Encoding]::UTF8
$bytes = $enc.GetBytes($input)
$stream = [System.IO.MemoryStream]::new($bytes)
$parser = [Microsoft.VisualBasic.FileIO.TextFieldParser]::new($stream)
$parser.Delimiters = ','
$parser.HasFieldsEnclosedInQuotes = $true
$list = $parser.ReadFields()
$list
Output looks like this:
Miller, Steve
Zappa, Frank
Johnson, Earvin "Magic"
Is there any better solution available via another .NET-library for Powersell?
In best case I could avoid this extra bytes-array and stream.
I am also not sure if this VisualBasic-Assembly will be avail on a long term.
Any ideas here?
With some extra precautions for security and to prevent inadvertent string extrapolation, you can combine Invoke-Expression with Write-Output, though note that Invoke-Expression should generally be avoided:
$fieldList = '"Miller, Steve", "Zappa, Frank", "Johnson, Earvin ""Magic""", "Honey, I''m $HOME"'
# Parse into array.
$fields = (
Invoke-Expression ("Write-Output -- " + ($fieldList -replace '\$', "`0"))
) -replace "`0", '$$'
Note:
-replace '\$', "`0" temporarily replaces literal $ chars. in the input with NUL chars. to prevent accidental (or malicious) string expansion (interpolation); the second -replace operation restores the original $ chars.
See this answer for more information about the regex-based -replace operator.
Prepending Write-Output -- to the resulting string and interpreting the result as a PowerShell command via Invoke-Expression causes Write-Output to parse the remainder of the string as individual arguments and output them as such. -- ensures that any arguments that happen to look like Write-Output's own parameters are not interpreted as such.
If and only if the input string is guaranteed to never contain embedded $ characters, the solution can be simplified to:
$fields = Invoke-Expression "Write-Output -- $fieldList"
Outputting $fields yields the following:
Miller, Steve
Zappa, Frank
Johnson, Earvin "Magic"
Honey, I'm $HOME
Explanation and list of constraints:
The solution relies on making the input string part of a string whose content is a syntactically valid Write-Output call, with the input string serving as the latter's arguments. Invoke-Expression then evaluates this string as if its content had directly been submitted as a command and therefore executes the Write-Output command. Based on how PowerShell parses command arguments, this implies the following constraints:
Supported field separators:
Either: ,-separated (with per-field (unquoted) leading and/or trailing whitespace getting removed, as shown above).
Or: whitespace-separated, using one or more whitespace characters between the fields.
Non-/quoting of embedded fields:
Fields can be quoted:
If single-quoted ('...'), field-internal ' characters must be escaped as ''.
If double-quoted, field-internal " characters must be escaped as either "" or `".
Fields can also be unquoted:
However, such fields mustn't contain any PowerShell argument-mode metacharacters (of these, < > # # are only metacharacters at the start of a token):
<space> ' " ` , ; ( ) { } | & < > # #
Alternative, via ConvertFrom-Csv:
iRon's helpful answer shows a solution based on ConvertFrom-Csv, given that the field list embedded in the input string is comma-separated (,):
On the one hand, it is more limited in that it only supports "..."-quoting of fields and ""-escaping of field-internal ", and doesn't support fields separated by varying amounts of whitespace (only).
On the other hand, it is more flexible, in that it supports any single-character separator between the fields (irrespective of incidental leading/trailing per-field whitespace), which can be specified via the -Delimiter parameter.
What makes the solution awkward is the need to anticipate the max. number of embedded fields and to provide dummy headers (column names) for them (-Header (0..99)) in order to make ConvertFrom-Csv work, which is both fragile and potentially wasteful.
However, a simple trick can bypass this problem: Submit the input string twice, in which case ConvertFrom-Csv treats the fields in the input string as both the column names and as the column values of the one and only output row (object), whose values can then be queried:
$fieldList = '"Miller, Steve", "Zappa, Frank", "Johnson, Earvin ""Magic""", "Honey, I''m $HOME"'
# Creates the same array as the solution at the top.
$fields = ($fieldList, $fieldList | ConvertFrom-Csv).psobject.Properties.Value
If the list is limited, you might use the parser of the ConvertFrom-Csv cmdlet, like:
$List = '"Miller, Steve", "Zappa, Frank", "Johnson, Earvin ""Magic""", "Honey, I''m $HOME"'
($List | ConvertFrom-Csv -Header (0..99)).PSObject.Properties.Value.Where{ $Null -ne $_ }
Miller, Steve
Zappa, Frank
Johnson, Earvin "Magic"
Honey, I'm $HOME

Recursive parsing and arrays in shell script

I intend to accept a single argument for my shell script my_script.sh and parse the values from it using separators. For example,
./my_script.sh a-e,f/b-1/c-5,g/d
means my primary separator is / and secondary separator is - and tertiary separator is ,. The challenge here is the number of values separated by , or - is not fixed, but variable. Like in d, there is no - or , at all. I can always parse the values separated by / as:
IFS='/' read -ra list_l1 <<<$1
This way, I get the number of times I need to loop over. But I'm stuck trying a parsing within list_l1. Here,
I need to see if there is - and , or if they are there at all.
If there is - and ,, get the values after - and pass it/them as arguments to another script (eg. for a e,f will be passed as separate arguments to another script).
If there is no - and ,, just run another script without arguments (eg. for d, another script is run without any arguments).
How can I get this done?
UPDATE:
I managed to figure a way for level one:
IFS='/' read -ra list_l1 <<<$1
for i in "${!list_l1[#]}"; do
list_l2[$i]="${list_l1[$i]//,/$' '}"
# This section is a pseudocode of what I would like to do:
get 'type' from first part (before '-' as in example above)
if type == 'a':
pass the with parameters after '-' to another .sh script, discarding the separators '-', ','
elif type == 'b':
pass the with parameters after '-' to another .sh script, discarding the separators '-', ','
elif type == 'c':
pass the with parameters after '-' to another .sh script, discarding the separators '-', ','
elif type == 'd':
pass the with parameters after '-' to another .sh script, discarding the separators '-', ','
# This section is a pseudocode of what I would like to do:
done
Take a look at this:
#!/usr/bin/env bash
f() { printf 'I am called with %d arguments: %s\n' "$#" "$*"; }
param='a-e,f/b-1/c-5,g/d'
IFS=/ read -ra a <<< "$param"
for i in "${a[#]}"; do
IFS=- read -r _ b <<< "$i"
IFS=, read -ra c <<< "$b"
f "${c[#]}"
done
$ ./script
I am called with 2 arguments: e f
I am called with 1 arguments: 1
I am called with 2 arguments: 5 g
I am called with 0 arguments:
Based on what I understood of your question, I produced this code:
** Edit no1, calling another script using that array**
#!/bin/bash
arg='a-e,f/b-1/c-5,g/d'
# Cuts it in [a-e,f] [b-1] [c5,g] [d]
IFS='//' read -ra list_l1 <<<$arg
echo "First cut on /."
echo "Content of list_l1"
for K in "${!list_l1[#]}"
do
echo "list_l1[$K]: ${list_l1[$K]}"
done
echo ""
declare -A list_l2
echo "Then loop, cut on '-' and replace ',' by ' '."
for onearg in ${list_l1[#]}
do
IFS='-' read part1 part2 <<<$onearg
list_l2[$part1]=$(echo $part2 | tr ',' ' ')
done
echo "Content of list_l2:"
for K in "${!list_l2[#]}"
do
echo "list_l2[$K]: ${list_l2[$K]}"
done
# Calling another script using these values
echo ""
for K in "${!list_l2[#]}"
do
echo "./another_script.sh ${list_l2[$K]}"
done
Which gives the following output:
$ ./t.bash
First cut on /.
Content of list_l1
list_l1[0]: a-e,f
list_l1[1]: b-1
list_l1[2]: c-5,g
list_l1[3]: d
Then loop, cut on '-' and replace ',' by ' '.
Content of list_l2:
list_l2[a]: e f
list_l2[b]: 1
list_l2[c]: 5 g
list_l2[d]:
./another_script.sh e f
./another_script.sh 1
./another_script.sh 5 g
./another_script.sh
Some details:
The first step is to cut on '/'. This creates list_l1.
All elements in list_l1 start with ['a', 'b', 'c', 'd', ...]. The first letter of each element after the cut on '/'.
Then each of these is cut a second time on '-'.
The first part of that cut (left of the '-') becomes key.
The second part of that cut (right of the '-') becomes the value.
list_l2 is created as an associative array, using the key and value that were just calculated.
This way list_l2 contains everything you need, without having to reference list_l1 at all later. If you need the list of keys, use ${!list_l2[#]}. If you need the list of values, use ${list_l2[#]}.
Let me know if that meets your requirement.

Can't write character array to file in Powershell

OK, Powershell may not be the best tool for the job but it's the only one available to me.
I have a bunch of 600K+ row .csv data files. Some of them have delimiter errors e.g. " in the middle of a text field or "" at the start of one. They are too big to edit (even in UltraEdit) and fix manually even if I wanted to which I don't!
Because the double-""-delimeter at the start of some text fields and rogue-"-delimiter in the middle of some text fields, I haven't used a header row to define the columns because these rows appear as if there is an extra column in them due to the extra delimiter.
I need to parse the file looking for "" instead of " at the start of a text-field and also to look for " in the middle of a text field and remove them.
I have managed to write the code to do this (after a fashion) by basically reading the whole file into an array, looping through it and adding output characters to an output array.
What I haven't managed to do is successfully write this output array to a file.
I have read every part of https://learn.microsoft.com/en-us/powershell/module/Microsoft.PowerShell.Utility/out-file?view=powershell-5.1 that seemed relevant. I've also trawled through about 10 similar questions on this site and attempted various code gleaned from them.
The output array prints perfectly to screen using a Write-Host but I can't get the data back into a file for love or money. I have a total of 1.5days Powershell experience so far! All suggestions gratefully received.
Here is my code to read/identify rogue delimiters (not pretty (at all), refer previous explanation of data and available technology constraints):
$ContentToCheck=get-content 'myfile.csv' | foreach { $_.ToCharArray()}
$ContentOutputArray=#()
for ($i = 0; $i -lt $ContentToCheck.count; $i++)
{
if (!($ContentToCheck[$i] -match '"')) {#not a quote
if (!($ContentToCheck[$i] -match ',')) {#not a comma i.e. other char that could be enclosed in ""
if ($ContentToCheck[$i-1] -match '"' ) {#check not rogue " delimiter in previous char allow for start of file exception i>1?
if (!($ContentToCheck[$i-2] -match ',') -and !($ContentToCheck[$i-3] -match '"')){
Write-Host 'Delimiter error' $i
$ContentOutputArray+= ''
}#endif not preceded by ",
}#endif"
else{#previous char not a " so move on
$ContentOutputArray+= $ContentToCheck[$i]
}
}#endifnotacomma
else
{#a comma, include it
$ContentOutputArray+= $ContentToCheck[$i]
}#endacomma
}#endifnotaquote
else
{#a quote so just append it to the output array
$ContentOutputArray+= $ContentToCheck[$i]
}#endaquote
}#endfor
So far so good, if inelegant. if I do a simple
Write-Host $ContentOutputArray
data displays nicely " 6 5 " , " 652 | | 999 " , " 99 " , " " , " 678 | | 1 " ..... furthermore when I check the size of the array (based on a cut-down version of one of the problem files)
$ContentOutputArray.count
I get 2507 character length of array. Happy out. However, then variously using:
$ContentOutputArray | Set-Content 'myfile_FIXED.csv'
creates blank file
$ContentOutputArray | out-file 'myfile_FIXED.csv' -encoding ASCII
creates blank file
$ContentOutputArray | export-csv 'myfile_FIXED.csv'
gives only '#TYPE System.Char' in file
$ContentOutputArray | Export-Csv 'myfile_FIXED.csv' -NoType
gives empty file
$ContentOutputArray >> 'myfile_FIXED.csv'
gives blanks separated by ,
What else can I try to write an array of characters to a flat file? It seems such a basic question but it has me stumped. Thanks for reading.
Convert (or cast) the char array to a string before exporting it.
(New-Object string (,$ContentOutputArray)) |Set-Content myfile_FIXED.csv

PowerShell regex to extract SID from filename

I have an array $vhdlist with contents similar to the following filenames:
UVHD-S-1-5-21-8746256374-654813465-374012747-4533.vhdx
UVHD-S-1-5-21-8746256374-654813465-374012747-6175.vhdx
UVHD-S-1-5-21-8746256374-654813465-374012747-8147.vhdx
UVHD-template.vhdx
I want to use a regex and be left with an array containing only SID portion of the filenames.
I am using the following:
$sids = foreach ($file in $vhdlist)
{
[regex]::split($file, '^UVHD-(?:([(\d)(\w)-]+)).vhdx$')
}
There are 2 problems with this: in the resulting array there are 3 blank lines for every SID; and the "template" filename matches (the resulting line in the output is just "template"). How can I get an array of SIDs as the output and not include the "template" line?
You seem to want to filter the list down to those filenames that contain an SID. Filtering is done with Where-Object (where for short); you don't need a loop.
An SID could be described as "S- and then a bunch of digits and dashes" for this simple case. That leaves us with ^UVHD-S-[\d-]*\.vhdx$ for the filename.
In combination we get:
$vhdlist | where { $_ -Match "^UVHD-S-[\d-]*\.vhdx$" }
When you don't really have an array of strings, but actually an array of files, use them directly.
dir C:\some\folder | where { $_.Name -Match "^UVHD-S-[\d-]*\.vhdx$" }
Or, possibly you can even make it as simple as:
dir C:\some\folder\UVHD-S-*.vhdx
EDIT
Extracting the SIDs from a list of strings can be thought as a combined transformation (for each element, extract the SID) and filter (remove non-matches) operation.
PowerShell's ForEach-Object cmdlet (foreach for short) works like map() in other languages. It takes every input element and returns a new value. In effect it transforms a list of input elements into output elements. Together with the -replace operator you can extract SIDs this way.
$vhdlist | foreach { $_ -replace ^(?:UVHD-(S-[\d-]*)\.vhdx|.*)$,"`$1" } | where { $_ -gt "" }
The regex back-reference for .NET languages is $1. The $ is a special character in PowerShell strings, so it needs to be escaped, except when there is no ambiguity. The backtick is the PS escape character. You can escape the $ in the regex as well, but there it's not necessary.
As a final step we use where to remove empty strings (i.e. non-matches). Doing it this way around means we only need to apply the regex once, instead of two times when filtering first and replacing second.
PowerShell operators can also work on lists directly. So the above could even be shortened:
$vhdlist -replace "^UVHD-(S-[\d-]*)\.vhdx$","`$1" | where { $_ -gt "" }
The shorter version only works on lists of actual strings or objects that produce the right thing when .ToString() is called on them.
Regex breakdown:
^ # start-of-string anchor
(?: # begin non-capturing group (either...)
UVHD- # 'UVHD-'
( # begin group 1
S-[\d-]* # 'S-' and however many digits and dashes
) # end group 1
\.vhdx # '.vhdx'
| # ...or...
.* # anything else
) # end non-capturing group
$ # end-of-string anchor

Remove all vowels from a file name using shell script

Current code:
find . -depth | \
while read LONG; do
SHORT=$( basename "$LONG" | tr '[aeiou]' '[ ]' )
DIR=$( dirname "$LONG" )
if [ "${LONG}" != "${DIR}/${SHORT}" ]; then
mv "${LONG}" "${DIR}/${SHORT}"
fi
done
So if I have files like aaa abc bdf I get the files ' ' ' bc' 'bdf'
The way I want this to work is to return 'aaa' 'bc' bdf'.
(Completly remove the a from the second file and if all the characters (excluding the file extension) are vowels, ignore it.
I think the two problems with your solution are:
You're substituting vowels for a space. Shouldn't you substitute an empty string?
Then you need to test if SHORT is empty. If it is, discard it, perhaps by assigning SHORT=LONG.
Remove all vowels:
tr -d aeiou
Ignore if basename (excluding the file extension) is only vowels:
case $SHORT in ''|.*) continue;; esac

Resources