Efficient way to remove duplicates from large 2D arrays in PowerShell

Efficient way to remove duplicates from large 2D arrays in PowerShell - arrays

I have a large set of data roughly 10 million items that I need to process efficiently and quickly removing duplicate items based on two of the six column headers.
I have tried grouping and sorting items but it's horrendously slow.
$p1 = $test | Group-Object -Property ComputerSeriaID,ComputerID
$p2 = foreach ($object in $p1.group) {
$object | Sort-Object -Property FirstObserved | Select-Object -First 1
}
The goal would be to remove duplicates by assessing two columns while maintaining the oldest record based on first observed.
The data looks something like this:
LastObserved : 2019-06-05T15:40:37
FirstObserved : 2019-06-03T20:29:01
ComputerName : 1
ComputerID : 2
Virtual : 3
ComputerSerialID : 4
LastObserved : 2019-06-05T15:40:37
FirstObserved : 2019-06-03T20:29:01
ComputerName : 5
ComputerID : 6
Virtual : 7
ComputerSerialID : 8
LastObserved : 2019-06-05T15:40:37
FirstObserved : 2019-06-03T20:29:01
ComputerName : 9
ComputerID : 10
Virtual : 11
ComputerSerialID : 12

You might want to clean up your question a little bit, because it's a little bit hard to read, but I'll try to answer the best I can with what I can understand about what you're trying to do.
Unfortunately, with so much data there's no way to do this quickly. String Comparison and sorting are done by brute force; there is no way to reduce the complexity of comparing each character in one string against another any further than measuring them one at a time to see if they're the same.
(Honestly, if this were me, I'd just use export-csv $object and perform this operation in excel. The time tradeoff to scripting something like this only once just wouldn't be worth it.)
By "Items" I'm going to assume that you mean rows in your table, and that you're not trying to retrieve only the strings in the rows you're looking for. You've already got the basic idea of select-object down, you can do that for the whole table:
$outputFirstObserved = $inputData | Sort-Object -Property FirstObserved -Unique
$outputLastObserved = $inputData | Sort-Object -Property LastObserved -Unique
Now you have ~20 million rows in memory, but I guess that beats doing it by hand. All that's left is to join the two tables. You can download that Join-Object command from the powershell gallery with Install-Script -Name Join and use it in the way described. If you want to do this step yourself, the easiest way would be to squish the two tables together and sort them again:
$output = $outputFirstObserved + $outputLastObserved
$return = $output | Sort-Object | Get-Unique

Does this do it? It keeps the one it finds first.
$test | sort -u ComputerSeriaID, ComputerID

I created this function to de-duplicate my multi-dimensional arrays.
Basically, I concatenate the contents of the record, add this to a hash.
If the concatenate text already exists in the hash, don't add it to the array to be returned.
Function DeDupe_Array
{
param
(
$Data
)
$Return_Array = #()
$Check_Hash = #{}
Foreach($Line in $Data)
{
$Concatenated = ''
$Elements = ($Line | Get-Member -MemberType NoteProperty | % {"$($_.Name)"})
foreach($Element in $Elements)
{
$Concatenated += $line.$Element
}
If($Check_Hash.$Concatenated -ne 1)
{
$Check_Hash.add($Concatenated,1)
$Return_Array += $Line
}
}
return $Return_Array
}

Try the following script.
Should be as fast as possible due to avoiding any pipe'ing in PS.
$hashT = #{}
foreach ($item in $csvData) {
# Building hash table key
$key = '{0}###{1}' -f $item.ComputerSeriaID, $item.ComputerID
# if $key doesn't exist yet OR when $key exists and "FirstObserverd" is less than existing one in $hashT (only valid when date provided in sortable format / international format)
if ((! $hashT.ContainsKey($key)) -or ( $item.FirstObserved -lt $hashT[$key].FirstObserved )) {
$hashT[$key] = $item
}
}
$result = $hashT.Values

Related

Query PSCustomObject Array for row with largest value

I'm trying to find the row with an attribute that is larger than the other row's attributes. Example:
$Array
Name Value
---- ----
test1 105
test2 101
test3 512 <--- Selects this row as it is the largest value
Here is my attempt to '1 line' this but It doesn't work.
$Array | % { If($_.value -gt $Array[0..($Array.Count)].value){write-host "$_.name is the largest row"}}
Currently it outputs nothing.
Desired Output:
"test1 is the largest row"
I'm having trouble visualizing how to do this efficiently with out some serious spaghetti code.

You could take advantage of Sort-Object to rank them by the property "Value" like this
$array = #(
[PSCustomObject]#{Name='test1';Value=105}
[PSCustomObject]#{Name='test2';Value=101}
[PSCustomObject]#{Name='test3';Value=512}
)
$array | Sort-Object -Property value -Descending | Select-Object -First 1
Output
Name Value
---- -----
test3 512
To incorporate your write host you can just run the one you select through a foreach.
$array | Sort-Object -Property value -Descending |
Select-Object -First 1 | Foreach-Object {Write-host $_.name,"has the highest value"}
test3 has the highest value
Or capture to a variable
$Largest = $array | Sort-Object -Property value -Descending | Select-Object -First 1
Write-host $Largest.name,"has the highest value"
test3 has the highest value

PowerShell has many built in features to make tasks like this easier.
If this is really an array of PSCustomObjects you can do something like:
$Array =
#(
[PSCustomObject]#{ Name = 'test1'; Value = 105 }
[PSCustomObject]#{ Name = 'test2'; Value = 101 }
[PSCustomObject]#{ Name = 'test3'; Value = 512 }
)
$Largest = ($Array | Sort-Object Value)[-1].Name
Write-host $Largest,"has the highest value"
This will sort your array according to the Value property. Then reference the last element using the [-1] syntax, then return the name property of that object.
Or if you're a purist you can assign the variable like:
$Largest = $Array | Sort-Object Value | Select-Object -Last 1 -ExpandProperty Name
If you want the whole object just remove .Name & -ExpandProperty Name respectively.
Update:
As noted PowerShell has some great tools to help with common tasks like sorting & selecting data. However, that doesn't mean there's never a need for looping constructs. So, I wanted to make a couple of points about the OP's own answer.
First, if you do need to reference array elements by index use a traditional For loop, which might look something like:
For( $i = 0; $i -lt $Array.Count; ++$i )
{
If( $array[$i].Value -gt $LargestValue )
{
$LargestName = $array[$i].Name
$LargestValue = $array[$i].Value
}
}
$i is commonly used as an iteration variable, and within the script block is used as the array index.
Second, even the traditional loop is unnecessary in this case. You can stick with the ForEach loop and track the largest value as and when it's encountered. That might look something like:
ForEach( $Row in $array )
{
If( $Row.Value -gt $LargestValue )
{
$LargestName = $Row.Name
$LargestValue = $Row.Value
}
}
Strictly speaking you don't need to assign the variables beforehand, though it may be a good practice to precede either of these with:
$LargestName = ""
$LargestValue = 0
In these examples you'd have to follow with a slightly modified Write-Host command
Write-host $LargestName,"has the highest value"
Note: Borrowed some of the test code from Doug Maurer's Fine Answer. Considering our answers were similar, this was just to make my examples more clear to the question and easier to test.

Figured it out, hopefully this isn't awful:
$Count = 1
$CurrentLargest = 0
Foreach($Row in $Array) {
# Compare This iteration vs the next to find the largest
If($Row.value -gt $Array.Value[$Count]){$CurrentLargest = $Row}
Else {$CurrentLargest = $Array[$Count]}
# Replace the existing largest value with the new one if it is larger than it.
If($CurrentLargest.Value -gt $Largest.Value){ $Largest = $CurrentLargest }
$Count += 1
}
Write-host $Largest.name,"has the highest value"
Edit: its awful, look at the other answers for a better way.

Format an array of hashtables as table with Powershell

I wish to display an array of hastables as a Table. I found several threads dealing with this issue, but so far no solution has worked for me (see here and here).
My array consists of 18 hashtables in this form:
Array = #(
#{Company=1}
#{Company=2}
#{Company=3}
#{Contact=X}
#{Contact=Y}
#{Contact=Y}
#{Country=A}
#{Country=B}
#{Country=C}
)
I would like to get the following output:
Company Contact Country
------------------------------
1 X A
2 Y B
3 Z C
I've tried the following:
$Array | ForEach-Object { New-Object -Type PSObject -Property $_ } | Format-Table
That displays the following:
Company
----------
1
2
3
A Format-List works better; I then get this:
Company: 1 2 3
Contact: X Y Z
Country: A B C
Is there any way to accomplish my desired output?

You can do the following if you want to work with CSV data and custom objects:
$Array = #(
#{Company=1}
#{Company=2}
#{Company=3}
#{Contact='X'}
#{Contact='Y'}
#{Contact='Z'}
#{Country='A'}
#{Country='B'}
#{Country='C'}
)
$keys = $Array.Keys | Get-Unique
$data = for ($i = 0; $i -lt $Array.($Keys[0]).Count; $i++) {
($Keys | Foreach-Object { $Array.$_[$i] }) -join ','
}
($Keys -join ','),$data | ConvertFrom-Csv | Format-Table
Explanation:
Since $Array is an array of hash tables, then the .Keys property will return all the keys of those hash tables. Since we only care about unique keys when building our object, Get-Unique is used to remove duplicates.
$Array.($Keys[0]).Count counts the number of items in a group. $Keys[0] here will be Company. So it returns the number of hash tables (3) that contain the key Company.
$Array.Company for example returns all values from the hash tables that contain the key Company. The Foreach-Object loops through each unique key's value at a particular index ($i). Once each key-value is read at a particular index, the values are joined by a comma.
When the loop completes, ($Keys -join ','),$data outputs the data in CSV format, which is piped into ConvertFrom-Csv to create a custom object.
Note: that if your data contains commas, you may want to consider the alternative method below.
Alernatively, you can work with hash tables and custom objects with the following:
$Array = #(
#{Company=1}
#{Company=2}
#{Company=3}
#{Contact='X'}
#{Contact='Y'}
#{Contact='Z'}
#{Country='A'}
#{Country='B'}
#{Country='C'}
)
$keys = $Array.Keys | Get-Unique
$data = for ($i = 0; $i -lt $Array.($Keys[0]).Count; $i++) {
$hash = [ordered]#{}
$Keys | Foreach-Object { $hash.Add($_,$Array.$_[$i]) }
[pscustomobject]$hash
}
$data | Format-Table

Removing Strings if Substring of the String is Present in Same Array

Noob here.
I'm trying to pare down a list of domains by eliminating all subdomains if the parent domain is present in the list. I've managed to cobble together a script that somewhat does this with PowerShell after some searching and reading. The output is not exactly what I want, but will work OK. The problem with my solution is that it takes so long to run because of the size of my initial list (tens of thousands of entries).
UPDATE: I've updated my example to clarify my question.
Example "parent.txt" list:
adk2.co
adk2.com
adobe.com
helpx.adobe.com
manage.com
list-manage.com
graph.facebook.com
Example output "repeats.txt" file:
adk2.com (different top level domain than adk2.co but that's ok)
helpx.adobe.com
list-manage.com (not subdomain of manage.com but that's ok)
I would then take and eliminate the repeats from the parent, leaving a list of "unique" subdomains and domains. I have this in a separate script.
Example final list with my current script:
adk2.co
adobe.com
manage.com
graph.facebook.com (it's not facebook.com because facebook.com wasn't in the original list.)
Ideal final list:
adk2.co
adk2.com (since adk2.co and adk2.com are actually distinct domains)
adobe.com
manage.com
graph.facebook.com
Below is my code:
I've taken my hosts list (parent.txt) and checked it against itself, and spit out any matches into a new file.
$parent = Get-Content("parent.txt")
$hosts = Get-Content("parent.txt")
$repeats =#()
$out_file = "$PSScriptRoot\repeats.txt"
$hosts | where {
$found = $FALSE
foreach($domains in $parent){
if($_.Contains($domains) -and $_ -ne $domains){
$found = $TRUE
$repeats += $_
}
if($found -eq $TRUE){
break
}
}
$found
}
$repeats = $repeats -join "`n"
[System.IO.File]::WriteAllText($out_file,$repeats)
This seems like a really inefficient way to do it since I'm going through each element of the array. Any suggestions on how to best optimize this? I have some ideas like putting more conditions on what elements to check and check against, but I feel like there's a drastically different approach that would be far better.

First, a solution based strictly on shared domain names (e.g., helpx.adobe.com and adobe.com are considered to belong to the same domain, but list-manage.com and manage.com are not).
This is not what you asked for, but perhaps more useful to future readers:
Get-Content parent.txt | Sort-Object -Unique { ($_ -split '\.')[-2,-1] -join '.' }
Assuming list.manage.com rather than list-manage.com in your sample input, the above command yields:
adk2.co
adk2.com
adobe.com
graph.facebook.com
manage.com
{ ($_ -split '\.')[-2,-1] -join '.' } sorts the input lines by the last 2 domain components (e.g., adobe.com):
-Unique discards duplicates.
A shared-suffix solution, as requested:
# Helper function for (naively) reversing a string.
# Note: Does not work properly with Unicode combining characters
# and surrogate pairs.
function reverse($str) { $a = $str.ToCharArray(); [Array]::Reverse($a); -join $a }
# * Sort the reversed input lines, which effectively groups them by shared suffix
# with the shortest entry first (e.g., the reverse of 'manage.com' before the
# reverse of 'list-manage.com').
# * It is then sufficient to output only the first entry in each group, using
# wildcard matching with -notlike to determine group boundaries.
# * Finally, sort the re-reversed results.
Get-Content parent.txt | ForEach-Object { reverse $_ } | Sort-Object |
ForEach-Object { $prev = $null } {
if ($null -eq $prev -or $_ -notlike "$prev*" ) {
reverse $_
$prev = $_
}
} | Sort-Object

One approach is to use a hash table to store all your parent values, then for each repeat, remove it from the table. The value 1 when adding to the hash table does not matter since we only test for existence of the key.
$parent = #(
'adk2.co',
'adk2.com',
'adobe.com',
'helpx.adobe.com',
'manage.com',
'list-manage.com'
)
$repeats = (
'adk2.com',
'helpx.adobe.com',
'list-manage.com'
)
$domains = #{}
$parent | % {$domains.Add($_, 1)}
$repeats | % {if ($domains.ContainsKey($_)) {$domains.Remove($_)}}
$domains.Keys | Sort

Export array with "sub-array"

I am currently trying to automate license counting in Office 365 across multiple partner tenants using PowerShell.
My current code (aquired from the internet) with some modifications gives me this output:
Column A Column B Column C
-------- -------- --------
CustA LicA,LicB 1,3
CustB LicA,LicB 7,3
CustC LicA 4
But the output I want from this code is:
Column A Column B Column C
-------- -------- --------
CustA LicA 1
LicB 3
CustB LicA 7
LicB 3
Here is my current code which is exported using Export-Csv -NoType:
$tenantID = (Get-MsolPartnerContract).tenantid
foreach($i in $tenantID){
$tenantName = Get-MsolPartnerInformation -TenantId $i
$tenantLicense = Get-MsolSubscription -TenantId $i
$properties = [ordered]#{
'Company' = ($tenantName.PartnerCompanyName -join ',')
'License' = ($tenantLicense.SkuPartNumber -join ',')
'LicenseCount' = ($tenantLicense.TotalLicenses -join ',')
}
$obj = New-Object -TypeName psobject -Property $properties
Write-Output $obj
}
I have tried this along with several other iterations of code which all fail catastophically:
$properties = [ordered]#{
'Company' = ($tenantName.PartnerCompanyName -join ','),
#{'License' = ($tenantLicense.SkuPartNumber -join ',')},
#{'LicenseCount' = ($tenantLicense.TotalLicenses -join',')}
}
I was thinking about making a "sub-array" $tenantLicense.SkuPartnumber and $tenantLicense.TotalLicenses, but I'm not quite sure how to approach this with appending it to the object or "main-array".

A second loop for each $tenantLIcense should do the trick for you. I don't have access to an environment like yours so I cannot test this.
$tenantID | ForEach-Object{
$tenantName = Get-MsolPartnerInformation -TenantId $_
$tenantLicense = Get-MsolSubscription -TenantId $_
# Make an object for each $tenantLicense
$tenantLicense | ForEach-Object{
$properties = [ordered]#{
'Company' = $tenantName.PartnerCompanyName
'License' = $_.SkuPartNumber
'LicenseCount' = $_.TotalLicenses
}
# Send the new object down the pipe.
New-Object -TypeName psobject -Property $properties
}
}
Since you have multiple $tenantLicenses that have the same company name lets just loop over those and use the same company name in the output. Assuming this worked it would not have the same output as you desired since there no logic to omit company in subsequent rows. I would argue that is it better this way since you can sort the data now with out loss of data / understanding.
Notice I change foreach() to ForEach-Object. This makes it simpler to send object down the pipe.

Without providing the code solution I can say that you need to build up the array.
In terms of programming, you will need to iterate the array ARRAY1 and populate another one ARRAY2with the extra rows. For example if columns A,B are simple value and C is an array of 3 items, then you would add 3 rows in the new table with A,B,C1, A,B,C2 and A,B,C3. On each iteration of the loop you need to calculate all the permutations, for example in your case the ones generated by columnB and columnC.
This should be also possible with pipelining using the ForEach-Object cmdlet but that is more difficult and as you mentioned your relatively new relationship with powershell I would not pursuit this path, unless of coarse you want to learn.

How can I import a .csv into multiple arrays in a simpler way?

I am new to powershell and am writing my first somewhat complicated script. I would like to import a .csv file and create multiple text arrays with it. I think that I have found a way that will work but it will be time consuming to generate all of the lines that I need. I assume I can do it more simply using foreach-object but I can't seem to get the syntax right.
See my current code...
$vmimport = Import-Csv "gss_prod.csv"
$gssall = $vmimport | ForEach-Object {$_.vmName}`
$gssweb = $vmimport | Where-Object {$_.tier -eq web} | ForEach-Object {$_.vmName}
$gssapp = $vmimport | Where-Object {$_.tier -eq app} | ForEach-Object {$_.vmName}
$gsssql = $vmimport | Where-Object {$_.tier -eq sql} | ForEach-Object {$_.vmName}
The goal is to make 1 group with all entries containing only the vmName value, and then 3 separate groups containing only the vmName value but using the tier value to sort them.
Can anyone help me with an easier way to do this?
Thanks!

For the last three you can group the object by the Tier property and have the result as a hasthable. Then you can reference the Tier name to get its VMs.
#group objects by tier
$gs = $vmimport | Group-Object tier -AsHashTable
# get web VMs
$gs['web']
# get sql VMs
$gs['app']

You may want to use a dictionary for storing the data:
$vmimport = Import-Csv "gss_prod.csv"
$gssall = $vmimport | % { $_.vmName }
$categories = "web", "app", "sql", ...
$gss = #{}
foreach ($cat in $categories) {
$gss[$cat] = $vmimport | ? { $_.tier -eq $cat } | % { $_.vmName }
}

I like the Shay Levy way, but the values of hash tables remain hash tables. Here is an other more efficient approach where values are jagged arrays, and categories are made automatically (contrary to Ansgar Wiechers solution):
# define hashtable
$gs = #{};
# fill it
$vmimport | foreach {$gs[$_.tier]+=, $_.vmName};
# get web VMs
$gs['web'] # the result is an array of 'web' vmNames.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight