Bulk regex removals against large array very slow in PowerShell

Bulk regex removals against large array very slow in PowerShell - arrays

I am trying to find the quickest / most efficient way to run many regex removals against an array.
My $hosts array contains tens of thousands of individual items, in domain format. E.g:
test.domain.xyz
domain.xyz
something.com
anotherdomain.net
My $local_regex array contains ~1000 indivdual regexes, in multi-line format. E.g:
^ad. (ad.*)
domain.xyz$ (*domain.xyz)
I am currently trying to exclude any regex matches in the following way, but it is EXTREMELY slow with a large array and many regexes to match:
Function Regex-Remove
{
Param
(
[Parameter(Mandatory=$true)]
$local_regex,
[Parameter(Mandatory=$true)]
$hosts
)
# Loop through each regex and select only non-matching items
foreach($regex in $local_regex)
{
# Multi line, case insensitive
$regex = "(?im)$regex"
# Select hosts that do not match regex
$hosts = $hosts -notmatch $regex
}
return $hosts
}
Is there a better way to do this?

Reassigning a large array is going to be costly. Changing an array's size requires allocating a new array and copying the contents into it. If you have, say, 10 000 hostnames and 1 000 regexes, you have 10 000 000 copy operations. That's going to have some measurable effect. There is a cmdlet Measure-Command which is used to time execution times.
As an alternative approach, try to use indexed an array and overwrite undesired values with $null values. Like so,
foreach($regex in $local_regex) {
$regex = "(?im)$regex"
for($i=0;$i -lt $hosts.length; ++$i) {
if( $hosts[$i] -match $regex) {
$hosts[$i] = $null
}
}
}

You can use System.Collections.ArrayList objects instead of arrays, this will make the process much faster, and you have methods to add / remove items without rebuilding the whole array
$var = New-Object System.Collections.ArrayList
$var.Add()
$var.AddRange()
$var.Remove()
$var.RemoveRange()

As suggested by #Roberto, I switched the $hosts array to a New-Object System.Collections.ArrayList
The ability to remove from the ArrayList on the fly is exactly what I needed, and the while loop makes sure to remove duplicate values.
Function Regex-Remove
{
Param
(
[Parameter(Mandatory=$true)]
$local_regex,
[Parameter(Mandatory=$true)]
$hosts
)
# Loop through each regex and select only non-matching items
foreach($regex in $local_regex)
{
# Multi line, case insensitive
$regex = "(?i)$regex"
# Select hosts that do not match regex
$hosts -match $regex | % {
while($hosts.Contains($_))
{
$hosts.Remove($_)
}
}
}
return $hosts
}

Related

String comparison in PowerShell doesn't seem to work

I am trying to compare strings from one array to every string from the other. The current method works, as I have tested it with simpler variables and it retrieves them correctly. However, the strings that I need it to compare don't seem to be affected by the comparison.
The method is the following:
$counter = 0
for ($num1 = 0; $num1 -le $serverid_oc_arr.Length; $num1++) {
for ($num2 = 0; $num2 -le $moss_serverid_arr.Length; $num2++) {
if ($serverid_oc_arr[$num1] -eq $moss_serverid_arr[$num2]) {
break
}
else {
$counter += 1
}
if ($counter -eq $moss_serverid_arr.Length) {
$unmatching_serverids += $serverid_oc_arr[$num1]
$counter = 0
break
}
}
}
For each string in the first array it is iterating between all strings in the second and comparing them. If it locates equality, it breaks and goes to the next string in the first array. If it doesn't, for each inequality it is adding to the counter and whenever the counter hits the length of the second array (meaning no equality has been located in the second array) it adds the corresponding string to a third array that is supposed to retrieve all strings that don't match to anything in the second array in the end. Then the counter is set to 0 again and it breaks so that it can go to the next string from the first array.
This is all okay in theory and also works in practice with simpler strings, however the strings that I need it to work with are server IDs and look like this:
289101b4-3e6c-4495-9c67-f317589ba92c
Hence, the script seems to completely ignore the comparison and just puts all the strings from the first array into the third one and retrieves them at the end (sometimes also some random strings from both first and second array).
Another method I tried with similar results was:
$unmatching_serverids = $serverid_oc_arr | Where {$moss_serverid_arr -NotContains $_}
Can anyone spot any mistake that I may be making anywhere?

The issue with your code is mainly the use of -le instead of -lt, collection index starts at 0 and the collection's Length or Count property starts at 1. This was causing $null values being added to your result collection.
In addition to the above, $counter was never getting reset if the following condition was not met:
if ($counter -eq $moss_serverid_arr.Length) { ... }
You would need to a add a $counter = 0 outside inner for loop to prevent this.
Below you can find the same, a little bit improved, algorithm in addition to a test case that proves it's working.
$ran = [random]::new()
$ref = [System.Collections.Generic.HashSet[int]]::new()
# generate a 100 GUID collection
$arr1 = 1..100 | ForEach-Object { [guid]::NewGuid() }
# pick 90 unique GUIDs from arr1
$arr2 = 1..90 | ForEach-Object {
do {
$i = $ran.Next($arr1.Count)
} until($ref.Add($i))
$arr1[$i]
}
$result = foreach($i in $arr1) {
$found = foreach($z in $arr2) {
if ($i -eq $z) {
$true; break
}
}
if(-not $found) { $i }
}
$result.Count -eq 10 # => Must be `$true`
As side, the above could be reduced to this using the .ExceptWith(..) method from HashSet<T> Class:
$hash = [System.Collections.Generic.HashSet[guid]]::new([guid[]]$arr1)
$hash.ExceptWith([guid[]]$arr2)
$hash.Count -eq 10 # => `$true`

The working answer that I found for this is the below:
$unmatching_serverids = #()
foreach ($item in $serverid_oc_arr)
{
if ($moss_serverid_arr -NotContains $item)
{
$unmatching_serverids += $item
}
}
No obvious differences can be seen between it and the other methods (especially for the for-loop, this is just a simplified variant), but somehow this works correctly.

Removing Strings if Substring of the String is Present in Same Array

Noob here.
I'm trying to pare down a list of domains by eliminating all subdomains if the parent domain is present in the list. I've managed to cobble together a script that somewhat does this with PowerShell after some searching and reading. The output is not exactly what I want, but will work OK. The problem with my solution is that it takes so long to run because of the size of my initial list (tens of thousands of entries).
UPDATE: I've updated my example to clarify my question.
Example "parent.txt" list:
adk2.co
adk2.com
adobe.com
helpx.adobe.com
manage.com
list-manage.com
graph.facebook.com
Example output "repeats.txt" file:
adk2.com (different top level domain than adk2.co but that's ok)
helpx.adobe.com
list-manage.com (not subdomain of manage.com but that's ok)
I would then take and eliminate the repeats from the parent, leaving a list of "unique" subdomains and domains. I have this in a separate script.
Example final list with my current script:
adk2.co
adobe.com
manage.com
graph.facebook.com (it's not facebook.com because facebook.com wasn't in the original list.)
Ideal final list:
adk2.co
adk2.com (since adk2.co and adk2.com are actually distinct domains)
adobe.com
manage.com
graph.facebook.com
Below is my code:
I've taken my hosts list (parent.txt) and checked it against itself, and spit out any matches into a new file.
$parent = Get-Content("parent.txt")
$hosts = Get-Content("parent.txt")
$repeats =#()
$out_file = "$PSScriptRoot\repeats.txt"
$hosts | where {
$found = $FALSE
foreach($domains in $parent){
if($_.Contains($domains) -and $_ -ne $domains){
$found = $TRUE
$repeats += $_
}
if($found -eq $TRUE){
break
}
}
$found
}
$repeats = $repeats -join "`n"
[System.IO.File]::WriteAllText($out_file,$repeats)
This seems like a really inefficient way to do it since I'm going through each element of the array. Any suggestions on how to best optimize this? I have some ideas like putting more conditions on what elements to check and check against, but I feel like there's a drastically different approach that would be far better.

First, a solution based strictly on shared domain names (e.g., helpx.adobe.com and adobe.com are considered to belong to the same domain, but list-manage.com and manage.com are not).
This is not what you asked for, but perhaps more useful to future readers:
Get-Content parent.txt | Sort-Object -Unique { ($_ -split '\.')[-2,-1] -join '.' }
Assuming list.manage.com rather than list-manage.com in your sample input, the above command yields:
adk2.co
adk2.com
adobe.com
graph.facebook.com
manage.com
{ ($_ -split '\.')[-2,-1] -join '.' } sorts the input lines by the last 2 domain components (e.g., adobe.com):
-Unique discards duplicates.
A shared-suffix solution, as requested:
# Helper function for (naively) reversing a string.
# Note: Does not work properly with Unicode combining characters
# and surrogate pairs.
function reverse($str) { $a = $str.ToCharArray(); [Array]::Reverse($a); -join $a }
# * Sort the reversed input lines, which effectively groups them by shared suffix
# with the shortest entry first (e.g., the reverse of 'manage.com' before the
# reverse of 'list-manage.com').
# * It is then sufficient to output only the first entry in each group, using
# wildcard matching with -notlike to determine group boundaries.
# * Finally, sort the re-reversed results.
Get-Content parent.txt | ForEach-Object { reverse $_ } | Sort-Object |
ForEach-Object { $prev = $null } {
if ($null -eq $prev -or $_ -notlike "$prev*" ) {
reverse $_
$prev = $_
}
} | Sort-Object

One approach is to use a hash table to store all your parent values, then for each repeat, remove it from the table. The value 1 when adding to the hash table does not matter since we only test for existence of the key.
$parent = #(
'adk2.co',
'adk2.com',
'adobe.com',
'helpx.adobe.com',
'manage.com',
'list-manage.com'
)
$repeats = (
'adk2.com',
'helpx.adobe.com',
'list-manage.com'
)
$domains = #{}
$parent | % {$domains.Add($_, 1)}
$repeats | % {if ($domains.ContainsKey($_)) {$domains.Remove($_)}}
$domains.Keys | Sort

Validate members of array

I have a string I am pulling from XML that SHOULD contain comma separated integer values. Currently I am using this to convert the string to an array and test each member of the array to see if it is an Int. Ultimately I still want an array in the end, as I also have an array of default success codes and I want to combine them. That said, I have never found this pattern of setting the test condition true then looping and potentially setting it to false to be all that elegant. So, I am wondering if there is a better approach. I mean, this works, and it's fast, and the code is easy to read, so in a sense there is no reason to change it, but if there is a better way...
$supplamentalSuccessCode = ($string.Split(',')).Trim()
$validSupplamentalSuccessCode = $true
foreach ($code in $supplamentalSuccessCode) {
if ($code -as [int] -isNot [int]) {
$validSupplamentalSuccessCode = $false
}
}
EDIT: To clarify, this example is fairly specific, but I am curious about a more generic solution. So imagine the array could contain values that need to be checked against a lookup table, or local drive paths that need to be checked with Test-Path. So more generically, I wonder if there is a better solution than the Set variable true, foreach, if test fails set variable false logic.
Also, I have played with a While loop, but in most situations I want to find ALL bad values, not exit validation on the first bad one, so I can provide the user with a complete error in a log. Thus the ForEach loop approach I have been using.

In PSv4+ you can enlist the help of the .Where() collection "operator" to determine all invalid values:
Here's a simplified example:
# Sample input.
$string = '10, no, 20, stillno, -1'
# Split the list into an array.
$codes = ($string.Split(',')).Trim()
# Test all array members with a script block passed to. Where()
# As usual, $_ refers to the element at hand.
# You can perform whatever validation is necessary inside the block.
$invalidCodes = $codes.Where({ $null -eq ($_ -as [int]) })
$invalidCodes # output the invalid codes, if any
The above yields:
no
stillno
Note that what .Where() returns is not a regular PowerShell array ([object[]]), but an instance of [System.Collections.ObjectModel.Collection[PSObject]], but in most situations the difference shouldn't matter.
A PSv2-compatible solution is a bit more cumbersome:
# Sample input.
$string = '10, no, 20, stillno, -1'
# Split the list into an array.
# Note: In PSv*3* you could use the simpler $codes = ($string.Split(',')).Trim()
# as in the PSv4+ solution.
$codes = foreach ($code in $string.Split(',')) { $code.Trim() }
# Emulate the behavior of .Where() with a foreach loop:
# Note that do you get an [object[]] instance back this time.
$invalidCodes = foreach ($code in $codes) { if ($null -eq ($code -as [int])) { $code } }

PowerShell - Create an array that ignores duplicate values

Curious if there a construct in PowerShell that does this?
I know you can do this:
$arr = #(1,1,1,1,2,2,2,2,3,3,3,3,3,4,4,4,4,4)
$arr = $arr | Get-Unique
But seems like performance-wise it would be better to ignore the value as you are entering it into the array instead of filtering out after the fact.

If are you inserting a large number of items in to an array (thousands) the performance does drop, because the array needs to be reinitialized every time you add to it so it may be better in your case, performance wise, to use something else.
Dictionary, or HashTable could be a way. Your single dimensional unique array could be retrieved with $hash.Keys For example:
$hash = ${}
$hash.Set_Item(1,1)
$hash.Set_Item(2,1)
$hash.Set_Item(1,1)
$hash.Keys
1
2
If you use Set_Item, the key will be created or updated but never duplicated. Put anything else for the value if you're not using it, But maybe you'll have a need for a value with your problem too.

You could also use an Arraylist:
Measure-Command -Expression {
$bigarray = $null
$bigarray = [System.Collections.ArrayList]#()
$bigarray = (1,1,1,1,2,2,2,2,3,3,3,3,3,4,4,4,4,4)
$bigarray | select -Unique
}
Time passed:
TotalSeconds : 0,0006581
TotalMilliseconds : 0,6581
Measure-Command -Expression {
$array = #(1,1,1,1,2,2,2,2,3,3,3,3,3,4,4,4,4,4)
$array | select -Unique
}
Time passed:
TotalSeconds : 0,0009261
TotalMilliseconds : 0,9261

Powershell Leading zeros are trimmed in array using Measure-Object cmdlet

When using Powershell to find out the maximum or minimum value in a string array, the leading zeros of the outcome string are trimmed. How to retain the zeros?
$arr = #("0001", "0002", "0003")
($arr | Measure-Object -Maximum).Maximum
>>> 3

Enumerating the array is the fastest method:
$max = ''
foreach ($el in $arr) {
if ($el -gt $max) {
$max = $el
}
}
$max
Or use SortedSet from .NET 4 framework (built-in since Win 8), it's 2 times faster than Measure-Object but two times slower than the manual enumeration above. Still might be useful if you plan to sort the data without duplicates quickly: it's faster than the built-in Sort-Object.
([Collections.Generic.SortedSet[string]]$arr).max
Obviously, it'll allocate some memory for the array index, but not the actual data as it'll be reused from the existing array. If you're concerned about it, just force garbage collection with [gc]::Collect()

try this
$arr = #("0001", "0002", "0003")
$arr | sort -Descending | select -First 1

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight