I have a CSV file (Comma Separated Values)
The file looks like this:
20171108,120909470,SO1244,12,101
20171109,122715740,AG415757,11,101
I need to obscure the data in (for example) columns 3 and, without affecting any of the other entries in the file.
I want to do this using a hashing algorithm like SHA1 or MD5, so that the same strings will resove to the same hash values anywhere they are encountered.
I need to send data to a third party, and certain columns contain sensitive information (e.g. customer names). I need the file to be complete, and where a string is replaced, I need it to be done in the same way every time it is encountered (so that any mapping or grouping remains). It does not need military encryption, just to be difficult to reverse. As I need to to this intermittently, a scripted solution would be ideal.
What is the easiest way to achieve this using a command line tool or script?
By preference, I would like a PowerShell script, since that does not require any additional software to achieve...
This question seems like a duplicate of I need to hash (obfuscate) a column of data in a CSV file. Script preferred but the proposed solution didn't resolve my problem and throws the following error
You cannot call a method on a null-valued expression.
At C:\Users\mey\Hashr.ps1:4 char:5
+ $_.column3 = $_.column3.gethashcode()
The script is the following
(Import-Csv .\results.csv -delimiter ',' ) | ForEach-Object{
$_.column3 = $_.column3.gethashcode()
$_
} | Export-Csv .\myobfuscated.csv -NoTypeInformation -delimiter ','
Update:
Here's the program i am running and that has been proposed by #BaconBits:
param (
[Parameter(Mandatory = $true, ValueFromPipeline = $true, Position = 0)]
[String[]]
$String,
[Parameter(Position = 1)]
[ValidateSet('SHA1', 'MD5', 'SHA256', 'SHA384', 'SHA512')]
[String]
$HashName = 'SHA256'
)
process {
$StringBuilder = [System.Text.StringBuilder]::new(128)
[System.Security.Cryptography.HashAlgorithm]::Create($HashName).ComputeHash([System.Text.Encoding]::UTF8.GetBytes($String)) | ForEach-Object {
[Void]$StringBuilder.Append($_.ToString("x2"))
}
$StringBuilder.ToString()
}
}
$csv = Import-Csv .\results.csv -delimiter ','
foreach ($line in $csv) {
$line.column1 = Get-StringHash $line.column1
}
$csv | Export-Csv .\myobfuscated.csv -NoTypeInformation -delimiter ','
The csv file i am importing is an output from another java program i made and it creates no header, it just fill the csv file with values
I am getting this error
Get-StringHash : Cannot bind argument to parameter 'String' because it is null.
Based on the doc, you're not going to want to use GetHashCode() this way:
A hash code is intended for efficient insertion and lookup in
collections that are based on a hash table. A hash code is not a
permanent value. For this reason:
Do not serialize hash code values or store them in databases.
Do not use the hash code as the key to retrieve an object from a keyed collection.
Do not send hash codes across application domains or processes. In some cases, hash codes may be computed on a per-process or
per-application domain basis.
Do not use the hash code instead of a value returned by a cryptographic hashing function if you need a cryptographically strong
hash. For cryptographic hashes, use a class derived from the
System.Security.Cryptography.HashAlgorithm or
System.Security.Cryptography.KeyedHashAlgorithm class.
Do not test for equality of hash codes to determine whether two objects are equal. (Unequal objects can have identical hash codes.) To
test for equality, call the ReferenceEquals or Equals method.
Bullet point 4 is the main problem. There's no guarantee that the hashing isn't reversible. The hashing function used is an implementation detail, not a secure cryptographic function like SHA.
I'd use a function like this one:
function Get-StringHash {
[CmdletBinding()]
param (
[Parameter(Mandatory = $true, ValueFromPipeline = $true, Position = 0)]
[String[]]
$String,
[Parameter(Position = 1)]
[ValidateSet('SHA1', 'MD5', 'SHA256', 'SHA384', 'SHA512')]
[String]
$HashName = 'SHA256'
)
process {
$StringBuilder = [System.Text.StringBuilder]::new(128)
[System.Security.Cryptography.HashAlgorithm]::Create($HashName).ComputeHash([System.Text.Encoding]::UTF8.GetBytes($String)) | ForEach-Object {
[Void]$StringBuilder.Append($_.ToString("x2"))
}
$StringBuilder.ToString()
}
}
$csv = Import-Csv .\results.csv -delimiter ',' -Header column1,column2,column3,column4,column5
foreach ($line in $csv) {
$line.column3 = Get-StringHash $line.column3
}
$csv | Export-Csv .\myobfuscated.csv -NoTypeInformation -delimiter ','
I believe I based that function off of this one, but it's been awhile since I've written it.
Edit by LotPings to show results of hash
"column1","column2","column3","column4","column5"
"20171108","120909470","0cdd3c3acdb7cfa107286565c044c5a0f1e58268f6f10e7e3415ff84942e577d","12","101 "
"20171109","122715740","0a7fb9f6bb7a180f2fd9429b0fbd1e7b0a83597b6a64aa6a123cef3e84700fe3","11","101"
Bacon Bits appears to have the correct methodology minus one part. The ForEach loop in your original example does not modify the original variable. Also, it appears the column you want to modify is not 'Column3', but 'Column #2' as the headers begin at zero. I'll repeat the function provided in Bacon Bits's suggestion.
function Get-StringHash {
[CmdletBinding()]
param (
[Parameter(Mandatory = $true, ValueFromPipeline = $true, Position = 0)]
[String[]]
$String,
[Parameter(Position = 1)]
[ValidateSet('SHA1', 'MD5', 'SHA256', 'SHA384', 'SHA512')]
[String]
$HashName = 'SHA256'
)
process {
$StringBuilder = [System.Text.StringBuilder]::new(128)
[System.Security.Cryptography.HashAlgorithm]::Create($HashName).ComputeHash([System.Text.Encoding]::UTF8.GetBytes($String)) | ForEach-Object {
[Void]$StringBuilder.Append($_.ToString("x2"))
}
$StringBuilder.ToString()
}
}
I would suggest for the substitution:
$csv = Import-Csv .\results.csv | Select-Object *,#{n='Column #2';e={Get-StringHash $_.'Column #2'}} -ExcludeProperty 'Column #2'
$CSV | Export-Csv .\myobfuscated.csv -NoTypeInformation
This will put the 'Column #2' last in the CSV. You can simply list them explicitly if you need it to appear in the same order, e.g.:
Select-Object 'Column #0','Column #1',#{n='Column #2';e={Get-StringHash $_.'Column #2'}},'Column #3'
Related
My array has a lot of properties but I'm only looking to edit one. The goal is to remove domains from the hostname but I haven't been able to get it working. This data is being returned from a REST API and there are thousands of assets that contain the same type of data (JSON content). The end goal is to compare assets pulled from the API and compare it to assets in a CSV file. The issue is that the domain may appear on one list and not the other so I'm trying to strip the domains off for comparison. I didn't want to iterate through both files and the comparison has to go from the CSV file to the API data hence the need to get rid of the domain altogether.
There are other properties in the array that I will need to pull from later. This is just an example of one of the arrays with a few properties:
$array = #{"description"="data"; "name"="host1.domain1.com"; "model"="data"; "ip"="10.0.0.1"; "make"="Microsoft"}
#{"description"="data"; "name"="host2.domain2.com"; "model"="data"; "ip"="10.0.0.2"; "make"="Different"}
#{"description"="data"; "name"="10.0.0.5"; "model"="data"; "ip"="10.0.0.5"; "make"="Different"}
The plan was to match the domain and then strip using the period.
$domainList = #(".domain1.com", ".domain2.com")
$domains = $domainList.ForEach{[Regex]::Escape($_)} -join '|'
$array = $array | ForEach-Object {
if($_.name -match $domains) {
$_.name = $_.name -replace $domains }
$_.name
}
You may do the following to output a new array of values without the domain names:
# starting array
$array = #("hosta.domain1.com", "hostb.domain2.com", "host3", "10.0.0.1")
# domains to remove
$domains = #('.domain1.com','.domain2.com')
# create regex expression with alternations: item1|item2|item3 etc.
# regex must escape literal . since dot has special meaning in regex
$regex = $domains.foreach{[regex]::Escape($_)} -join '|'
# replace domain strings and output everything else
$newArray = $array -replace $regex
Arrays use pointers so I needed to load the array and pipe through ForEach-Object and then set the object when logic was complete. Thanks for all of the help.
$domainList = #(".domain1.com", ".domain2.com")
$domains = $domainList.ForEach{[Regex]::Escape($_)} -join '|'
$array = $array | ForEach-Object {
if($_.name -match $domains) {
$_.name = $_.name -replace $domains }
$_.name
}
I have a large set of data roughly 10 million items that I need to process efficiently and quickly removing duplicate items based on two of the six column headers.
I have tried grouping and sorting items but it's horrendously slow.
$p1 = $test | Group-Object -Property ComputerSeriaID,ComputerID
$p2 = foreach ($object in $p1.group) {
$object | Sort-Object -Property FirstObserved | Select-Object -First 1
}
The goal would be to remove duplicates by assessing two columns while maintaining the oldest record based on first observed.
The data looks something like this:
LastObserved : 2019-06-05T15:40:37
FirstObserved : 2019-06-03T20:29:01
ComputerName : 1
ComputerID : 2
Virtual : 3
ComputerSerialID : 4
LastObserved : 2019-06-05T15:40:37
FirstObserved : 2019-06-03T20:29:01
ComputerName : 5
ComputerID : 6
Virtual : 7
ComputerSerialID : 8
LastObserved : 2019-06-05T15:40:37
FirstObserved : 2019-06-03T20:29:01
ComputerName : 9
ComputerID : 10
Virtual : 11
ComputerSerialID : 12
You might want to clean up your question a little bit, because it's a little bit hard to read, but I'll try to answer the best I can with what I can understand about what you're trying to do.
Unfortunately, with so much data there's no way to do this quickly. String Comparison and sorting are done by brute force; there is no way to reduce the complexity of comparing each character in one string against another any further than measuring them one at a time to see if they're the same.
(Honestly, if this were me, I'd just use export-csv $object and perform this operation in excel. The time tradeoff to scripting something like this only once just wouldn't be worth it.)
By "Items" I'm going to assume that you mean rows in your table, and that you're not trying to retrieve only the strings in the rows you're looking for. You've already got the basic idea of select-object down, you can do that for the whole table:
$outputFirstObserved = $inputData | Sort-Object -Property FirstObserved -Unique
$outputLastObserved = $inputData | Sort-Object -Property LastObserved -Unique
Now you have ~20 million rows in memory, but I guess that beats doing it by hand. All that's left is to join the two tables. You can download that Join-Object command from the powershell gallery with Install-Script -Name Join and use it in the way described. If you want to do this step yourself, the easiest way would be to squish the two tables together and sort them again:
$output = $outputFirstObserved + $outputLastObserved
$return = $output | Sort-Object | Get-Unique
Does this do it? It keeps the one it finds first.
$test | sort -u ComputerSeriaID, ComputerID
I created this function to de-duplicate my multi-dimensional arrays.
Basically, I concatenate the contents of the record, add this to a hash.
If the concatenate text already exists in the hash, don't add it to the array to be returned.
Function DeDupe_Array
{
param
(
$Data
)
$Return_Array = #()
$Check_Hash = #{}
Foreach($Line in $Data)
{
$Concatenated = ''
$Elements = ($Line | Get-Member -MemberType NoteProperty | % {"$($_.Name)"})
foreach($Element in $Elements)
{
$Concatenated += $line.$Element
}
If($Check_Hash.$Concatenated -ne 1)
{
$Check_Hash.add($Concatenated,1)
$Return_Array += $Line
}
}
return $Return_Array
}
Try the following script.
Should be as fast as possible due to avoiding any pipe'ing in PS.
$hashT = #{}
foreach ($item in $csvData) {
# Building hash table key
$key = '{0}###{1}' -f $item.ComputerSeriaID, $item.ComputerID
# if $key doesn't exist yet OR when $key exists and "FirstObserverd" is less than existing one in $hashT (only valid when date provided in sortable format / international format)
if ((! $hashT.ContainsKey($key)) -or ( $item.FirstObserved -lt $hashT[$key].FirstObserved )) {
$hashT[$key] = $item
}
}
$result = $hashT.Values
Noob here.
I'm trying to pare down a list of domains by eliminating all subdomains if the parent domain is present in the list. I've managed to cobble together a script that somewhat does this with PowerShell after some searching and reading. The output is not exactly what I want, but will work OK. The problem with my solution is that it takes so long to run because of the size of my initial list (tens of thousands of entries).
UPDATE: I've updated my example to clarify my question.
Example "parent.txt" list:
adk2.co
adk2.com
adobe.com
helpx.adobe.com
manage.com
list-manage.com
graph.facebook.com
Example output "repeats.txt" file:
adk2.com (different top level domain than adk2.co but that's ok)
helpx.adobe.com
list-manage.com (not subdomain of manage.com but that's ok)
I would then take and eliminate the repeats from the parent, leaving a list of "unique" subdomains and domains. I have this in a separate script.
Example final list with my current script:
adk2.co
adobe.com
manage.com
graph.facebook.com (it's not facebook.com because facebook.com wasn't in the original list.)
Ideal final list:
adk2.co
adk2.com (since adk2.co and adk2.com are actually distinct domains)
adobe.com
manage.com
graph.facebook.com
Below is my code:
I've taken my hosts list (parent.txt) and checked it against itself, and spit out any matches into a new file.
$parent = Get-Content("parent.txt")
$hosts = Get-Content("parent.txt")
$repeats =#()
$out_file = "$PSScriptRoot\repeats.txt"
$hosts | where {
$found = $FALSE
foreach($domains in $parent){
if($_.Contains($domains) -and $_ -ne $domains){
$found = $TRUE
$repeats += $_
}
if($found -eq $TRUE){
break
}
}
$found
}
$repeats = $repeats -join "`n"
[System.IO.File]::WriteAllText($out_file,$repeats)
This seems like a really inefficient way to do it since I'm going through each element of the array. Any suggestions on how to best optimize this? I have some ideas like putting more conditions on what elements to check and check against, but I feel like there's a drastically different approach that would be far better.
First, a solution based strictly on shared domain names (e.g., helpx.adobe.com and adobe.com are considered to belong to the same domain, but list-manage.com and manage.com are not).
This is not what you asked for, but perhaps more useful to future readers:
Get-Content parent.txt | Sort-Object -Unique { ($_ -split '\.')[-2,-1] -join '.' }
Assuming list.manage.com rather than list-manage.com in your sample input, the above command yields:
adk2.co
adk2.com
adobe.com
graph.facebook.com
manage.com
{ ($_ -split '\.')[-2,-1] -join '.' } sorts the input lines by the last 2 domain components (e.g., adobe.com):
-Unique discards duplicates.
A shared-suffix solution, as requested:
# Helper function for (naively) reversing a string.
# Note: Does not work properly with Unicode combining characters
# and surrogate pairs.
function reverse($str) { $a = $str.ToCharArray(); [Array]::Reverse($a); -join $a }
# * Sort the reversed input lines, which effectively groups them by shared suffix
# with the shortest entry first (e.g., the reverse of 'manage.com' before the
# reverse of 'list-manage.com').
# * It is then sufficient to output only the first entry in each group, using
# wildcard matching with -notlike to determine group boundaries.
# * Finally, sort the re-reversed results.
Get-Content parent.txt | ForEach-Object { reverse $_ } | Sort-Object |
ForEach-Object { $prev = $null } {
if ($null -eq $prev -or $_ -notlike "$prev*" ) {
reverse $_
$prev = $_
}
} | Sort-Object
One approach is to use a hash table to store all your parent values, then for each repeat, remove it from the table. The value 1 when adding to the hash table does not matter since we only test for existence of the key.
$parent = #(
'adk2.co',
'adk2.com',
'adobe.com',
'helpx.adobe.com',
'manage.com',
'list-manage.com'
)
$repeats = (
'adk2.com',
'helpx.adobe.com',
'list-manage.com'
)
$domains = #{}
$parent | % {$domains.Add($_, 1)}
$repeats | % {if ($domains.ContainsKey($_)) {$domains.Remove($_)}}
$domains.Keys | Sort
There are quite a few posts on SO that address PowerShell transposition. However, most of the code is specific to the use case or addresses data being gathered from a text/CSV file and does me no good. I'd like to see a solution that can do this work without such specifics and works with arrays directly in PS.
Example data:
Customer Name: SomeCompany
Abbreviation: SC
Company Contact: Some Person
Address: 123 Anywhere St.
ClientID: XXXX
This data is much more complicated, but I can work with it using other methods if I can just get the rows and columns to cooperate. The array things that "Name:" and "SomeCompany" are column headers. This is a byproduct of how the data is gathered and cannot be changed. I'm importing the data from an excel spreadsheet with PSExcel and the spreadsheet format is not changeable.
Desired output:
Customer Name:, Abbreviation:, Company Contact:, Address:, ClientID:
SomeCompany, SC, Some Person, 123 Anywhere St., XXXX
Example of things I've tried:
$CustInfo = Import-XLSX -Path "SomePath" -Sheet "SomeSheet" -RowStart 3 -ColumnStart 2
$b = #()
foreach ($Property in $CustInfo.Property | Select -Unique) {
$Props = [ordered]#{ Property = $Property }
foreach ($item in $CustInfo."Customer Name:" | Select -Unique){
$Value = ($CustInfo.where({ $_."Customer Name:" -eq $item -and
$_.Property -eq $Property })).Value
$Props += #{ $item = $Value }
}
$b += New-Object -TypeName PSObject -Property $Props
}
This does not work because of the "other" data I mentioned. There are many other sections in this particular workbook so the "Select -Unique" fails without error and the output is blank. If I could limit the input to only select the rows/columns I needed, this might have a shot. It appears that while there is a "RowStart" and "ColumnStart" to Import-XLSX, there are no properties for stopping either one.
I've tried methods from the above linked SO questions, but as I said, they are either too specific to the question's data or apply to importing CSV files and not working with arrays.
I was able to resolve this by doing two things:
Removed the extra columns by using the "-Header" switch on the Import-XLSX function to add fake header names and then only select those headers.
$CustInfo = Import-XLSX -Path "SomePath" -Sheet "SomeSheet" -RowStart 2 -ColumnStart 2 -Header 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18 | Select "1","2"
The downside to this is that I had to know how many columns the input data had -- Not dynamic. If anyone can provide a solution to this issue, I'd be grateful.
Flipped the columns and headers with a simple foreach loop:
$obj = [PSCustomObject]#{}
ForEach ($item in $CustInfo) {
$value = $null
$name = $null
if ($item."2") { [string]$value = $item."2" }
if ($item."1") { [string]$name = $item."1" }
if ($value -and $name) {
$obj | Add-Member -NotePropertyName $name -NotePropertyValue $value
}
}
I had to force string type on the property names and values because the zip codes and CustID was formatting as an Int32. Otherwise, this does what I need.
Im interested in some ideas on how one would approach coding a search of a filesystem for files that match any entries contained in a master CSV file. I have a function to search the filesystem, but filtering against the CSV is proving harder than I expect. I have a csv with headers in it for Name & IPaddr:
#create CSV object
$csv = import-csv filename.csv
#create filter object containing only Name column
$filter = $csv | select-object Name
#Now run the search function
SearchSubfolders | where {$_.name -match $filter} #returns no results
I guess my question is this: Can I filter against an array within a pipeline like this???
You need a pair of loops:
#create CSV object
$csv = import-csv filename.csv
#Now run the search function
#loop through the folders
foreach ($folder in (SearchSubfolders)) {
#check that folder against each item in the csv filter list
#this sets up the loop
foreach ($Filter in $csv.Name) {
#and this does the checking and outputs anything that is matched
If ($folder.name -match $Filter) { "$filter" }
}
}
Usually CSVs are 2-dimensional data structures, so you can't use them directly for filtering. You can convert the 2-dimensional array into a 1-dimensional array, though:
$filter = Import-Csv 'C:\path\to\some.csv' | % {
$_.PSObject.Properties | % { $_.Value }
}
If the CSV has just a single column, the "mangling" can be simplified to this (replace Name with the actual column name):
$filter = Import-Csv 'C:\path\to\some.csv' | % { $_.Name }
or this:
$filter = Import-Csv 'C:\path\to\some.csv' | select -Expand Name
Of course, if the CSV has just a single column, it would've been better to make it a flat list right away, so it could've been imported like this:
$filter = Get-Content 'C:\path\to\some.txt'
Either way, with the $filter prepared, you can apply it to your input data like this:
SearchSubFolders | ? { $filter -contains $_.Name } # ARRAY -contains VALUE
The -match operator won't work, because it compares a value (left operand) against a regular expression (right operand).
See Get-Help about_Comparison_Operators for more information.
Another option is to create a regex from the filename collection and use that to filter for all the filenames at once:
$filenames = import-csv filename.csv |
foreach { $_.name }
[regex]$filename_regex = ‘(?i)^(‘ + (($filenames | foreach {[regex]::escape($_)}) –join “|”) + ‘)$’
$SearchSubfolders |
where { $_.name -match $filename_regex }
You can use Compare-Object to do this pretty easily if you are matching the actual Names of the files to names in the list. An example:
$filter = import-csv files.csv
ls | Compare-Object -ReferenceObject $filter -IncludeEqual -ExcludeDifferent -Property Name
This will print the files in the current directory that match the any Name in files.csv. You could also print only the different ones by dropping -IncludeEqual and -ExcludeDifferent flags. If you need full regex matching you will have to loop through each regex in the csv and see if it is a match.
Here's any alternate solution that uses regular expression filters. Note that we will create and cache the regex instances so we don't have to rely on the runtime's internal cache (which defaults to 15 items). First we have a useful helper function, Test-Any that will loop through an array of items and stop if any of them satisfies a criteria:
function Test-Any() {
param(
[Parameter(Mandatory=$True,ValueFromPipeline=$True)]
[object[]]$Items,
[Parameter(Mandatory=$True,Position=2)]
[ScriptBlock]$Predicate)
begin {
$any = $false
}
process {
foreach($item in $items) {
if ($predicate.Invoke($item)) {
$any = $true
break
}
}
}
end { $any }
}
With this, the implementation is relatively simple:
$filters = import-csv files.csv | foreach { [regex]$_.Name }
ls -recurse | where { $name = $_.Name; $filters | Test-Any { $_.IsMatch($name) } }
I ended up using a 'loop within a loop' construct to get this done after much trial and error:
#the SearchSubFolders function was amended to force results in a variable, SearchResults
$SearchResults2 = #()
foreach ($result in $SearchResults){
foreach ($line in $filter){
if ($result -match $line){
$SearchResults2 += $result
}
}
}
This works great after collapsing my CSV file down to a text-based array containing only the necessary column data from that CSV. Much thanks to Ansgar Wiechers for assisting me with that particular thing!!!
All of you presented viable solutions, some more complex than I cared for, nevertheless if I could mark multiple answers as correct, I would!! I chose the correct answer based on not only correctness but also simplicity.....