How to convert a string into a useable format - arrays

I'm trying to automate a process, essentially we receive an error code and line numbers for a file. The script I'm writing needs to get the error code and line numbers, then dive into the file and retrieve the lines.
Everything is working except parsing the error code and line numbers into some sort of useable format so I can loop through them
The format is:
Error code Line number
1234 00232,00233,00787
3333 00444
1111 01232,2132
I've tried
$a = $a -replace "\s+","="
$a|ConvertFrom-StringData
But I'm drawing a blank when it comes to looping through the hashtable and dealing with the occasional CSV values.
I did think of converting the whole thing to a CSV but I'm running up against the edge of my knowledge...

Use a regular expression that matches a space followed by either a digit or an upper case letter, then replace said match with a delimiter and finally parse the resulting string as a CSV document:
# read target file into memory for later extraction
$fileContents = Get-Content C:\path\to\source\file.txt
# define error report, replace with `Get-Content` if data if coming from file too
$errorReport = #'
Error code Line number
1234 00232,00233,00787
3333 00444
1111 01232,2132
'#
# replace the middle space and parse as CSV
$errorMappingList = $errorReport -replace '(?-i) (?=\p{Lu}|\d)', '|' |ConvertFrom-Csv -Delimiter '|'
# go through each entry in the error mapping list
foreach($errorMapping in $errorMappingList){
# go through each line number associated with the error code
foreach($lineNo in $errorMapping.'Line Number' -split ','){
# extract the line from the file contents, output 1 new object per line extracted
[pscustomobject]#{
ErrorCode = $errorMapping.'Error code'
LineNumber = $lineNo
Line = $fileContents[$lineNo - 1]
}
}
}

Related

Can't write character array to file in Powershell

OK, Powershell may not be the best tool for the job but it's the only one available to me.
I have a bunch of 600K+ row .csv data files. Some of them have delimiter errors e.g. " in the middle of a text field or "" at the start of one. They are too big to edit (even in UltraEdit) and fix manually even if I wanted to which I don't!
Because the double-""-delimeter at the start of some text fields and rogue-"-delimiter in the middle of some text fields, I haven't used a header row to define the columns because these rows appear as if there is an extra column in them due to the extra delimiter.
I need to parse the file looking for "" instead of " at the start of a text-field and also to look for " in the middle of a text field and remove them.
I have managed to write the code to do this (after a fashion) by basically reading the whole file into an array, looping through it and adding output characters to an output array.
What I haven't managed to do is successfully write this output array to a file.
I have read every part of https://learn.microsoft.com/en-us/powershell/module/Microsoft.PowerShell.Utility/out-file?view=powershell-5.1 that seemed relevant. I've also trawled through about 10 similar questions on this site and attempted various code gleaned from them.
The output array prints perfectly to screen using a Write-Host but I can't get the data back into a file for love or money. I have a total of 1.5days Powershell experience so far! All suggestions gratefully received.
Here is my code to read/identify rogue delimiters (not pretty (at all), refer previous explanation of data and available technology constraints):
$ContentToCheck=get-content 'myfile.csv' | foreach { $_.ToCharArray()}
$ContentOutputArray=#()
for ($i = 0; $i -lt $ContentToCheck.count; $i++)
{
if (!($ContentToCheck[$i] -match '"')) {#not a quote
if (!($ContentToCheck[$i] -match ',')) {#not a comma i.e. other char that could be enclosed in ""
if ($ContentToCheck[$i-1] -match '"' ) {#check not rogue " delimiter in previous char allow for start of file exception i>1?
if (!($ContentToCheck[$i-2] -match ',') -and !($ContentToCheck[$i-3] -match '"')){
Write-Host 'Delimiter error' $i
$ContentOutputArray+= ''
}#endif not preceded by ",
}#endif"
else{#previous char not a " so move on
$ContentOutputArray+= $ContentToCheck[$i]
}
}#endifnotacomma
else
{#a comma, include it
$ContentOutputArray+= $ContentToCheck[$i]
}#endacomma
}#endifnotaquote
else
{#a quote so just append it to the output array
$ContentOutputArray+= $ContentToCheck[$i]
}#endaquote
}#endfor
So far so good, if inelegant. if I do a simple
Write-Host $ContentOutputArray
data displays nicely " 6 5 " , " 652 | | 999 " , " 99 " , " " , " 678 | | 1 " ..... furthermore when I check the size of the array (based on a cut-down version of one of the problem files)
$ContentOutputArray.count
I get 2507 character length of array. Happy out. However, then variously using:
$ContentOutputArray | Set-Content 'myfile_FIXED.csv'
creates blank file
$ContentOutputArray | out-file 'myfile_FIXED.csv' -encoding ASCII
creates blank file
$ContentOutputArray | export-csv 'myfile_FIXED.csv'
gives only '#TYPE System.Char' in file
$ContentOutputArray | Export-Csv 'myfile_FIXED.csv' -NoType
gives empty file
$ContentOutputArray >> 'myfile_FIXED.csv'
gives blanks separated by ,
What else can I try to write an array of characters to a flat file? It seems such a basic question but it has me stumped. Thanks for reading.
Convert (or cast) the char array to a string before exporting it.
(New-Object string (,$ContentOutputArray)) |Set-Content myfile_FIXED.csv

Extracting values from strings in an array using powershell

I've got an issue trying to extract particular values from an Array. I have an array that contains 40010 rows each of which is a string of pipe separated values (64 on each line).
I need to extract values 7, 4, 22, 23, 24, 52 and 62 from each row and write it into a new array so that I will end up with a new array containing 40010 rows with only 7 pipe separated values in each (could be comma separated) row.
I've looked at split and can't seem to get my head around it to even get close to what I need.
I'd also be open to doing this from a file as I'm currently creating my 1st array with
$data = (Get-content $statement_file|Select-String "^01")
If I can add to that command to do the split on the input so I only have one array and don't need the intermediate array that would be even better.
I know if I was in Linux I could do the split with AWK quite easily but I'm fairly new to powershell so would appreciate any suggestions
# create an array of header columns (assuming your pipe separated file doesn't have headers)
$header = 1..64 | ForEach-Object { "h$_" }
# import the file as 'csv' but with pipes as separators, use the above header, then select columns 7,4,22,23,24,52,62
# edit 1: then only return rows that start with 01
# edit 2: then join these into a pipe separated string
$smallerArray = $statement_file |
Import-Csv -Delimiter '|' -Header $header |
Where-Object { $_.h1.StartsWith('01') } |
Select-Object #{Name="piped"; Expression={ #($_.h7,$_.h4,$_.h22,$_.h23,$_.h24,$_.h52,$_.h62) -join '|' }} |
Select-Object -ExpandProperty piped

Powershell: Delete Every n Files

I just imported a bunch of pictures, and realized that there's 3 copies of each pictures, but they're named sequentially.
Basically these three files are the same:
P5240901.dng
P5240902.dng
P5240903.dng
And that, for about 1600 pictures.
I was looking into writing a simple PowerShell script (I use Windows) that would look into the directory of these files, and keep 1 file out of three, just looping through a range of files.
I didn't find something that would deal with the 'P' character before my file, and I'm not familiar with PowerShell language.
Any ideas?
Thank you!
Assuming everything in the dir follows the naming convention & is in a set of 3 something like this should work:
$mydir = 'C:\path\to\files'
[int]$idx = 1
get-childitem $mydir|sort-object {$_.Name} |foreach-object{
if ($idx % 3 -ne 1){ #get the modulus
$_ |remove-item
}
$idx++
}
Try the following, which will keep only the 1st file in each group of files whose names are the same except for the last character before the filename extension, assuming that character is a digit (syntax assumes PSv3+):
'P5240901.dng', 'P5240902.dng', 'P5240903.dng', 'A1.dng', 'A2.dng', 'singleton.dng' |
Group-Object { $_ -replace '^(.+)\d\.', '$1' } |
? Count -gt 1 |
% { $_.Group[1..$($_.Group.Count)] }
yields:
P5240902.dng
P5240903.dng
A2.dng
Replace the sample input array with a call to Get-ChildItem -File, and prepend Remove-Item to $_.Group[1..$($_.Group.Count)] to perform actual deletion.
The above command uses a string array with input filenames, but the [System.IO.FileInfo] instances output by Get-ChildItem will effectively act the same in a string context: they will expand to their respective filenames.
The advantage of this solution is that it doesn't rely on input files appearing strictly in groups of 3:
Any group of input files sharing the same name except for a digit before the filename extension that has at least 2 members (and any number beyond that) will have every member but the 1st deleted.
Any other files are left untouched.
Explanation:
Group-Object { $_ -replace '^(.+)\d\.', '$1' }
effectively groups the input files by the portion of the filename they share (but only if they share everything but the last char. before the filename extension, and if that char. is a digit).
? Count -gt 1
only passes on those resulting groups that have at least 2 members.
% { $_.Group[1..$($_.Group.Count)] }
processes each group's files, except the 1st.
Update: Here's a variation prompted by the OP's later comments:
The following, given input filenames such as P5240901.dng, P5240902.dng, ..., P5240910.dng, P5240911.dng, ..., P5240990.dng, P5240991.dng, ..., P5240999.dng, will consider each group of 10 files a group (based on the tens place), and within each group only retain the 1st file:
1..99 | % { "P52409$('{0:00}' -f $_).dng" } |
Group-Object { $_ -replace '^(.+\d)\d\.', '$1' } |
? Count -gt 1 |
% { $_.Group[1..$($_.Group.Count)]}
yields:
# tens place of 0; skips ...01.dng
P5240902.dng
P5240903.dng
... # up to ...09.dng
# tens place of 1; skips ...10.dng
P5240911.dng
P5240912.dng
... # skips ...20.dng, ...30.dng, ...
# tens place of 9; skips ...90.dng
P5240991.dng
P5240992.dng
...
P5240999.dng
In order to only pass the files of interest to the command, replace the sample input array with
Get-ChildItem P52515[0-9][0-9].dng.

Simplifying elements of a list/array and then adding incremental identifiers a,b,c,d.... etc to them

I'm processing headers of a .fasta file (which is a file universally used in genetics/bioinformatics to store DNA/RNA sequence data). Fasta files have headers starting with a > symbol (which gives specific info), followed by the actual sequence data on the next line that the header describes. The sequence data extends indefinitely until the next \n after which is followed the next header and its respective sequence. For example:
>scaffold1.1_size947603
ACGCTCGATCGTACCAGACTCAGCATGCATGACTGCATGCATGCATGCATCATCTGACTGATG....
>scaffold2.1_size747567.2.603063_605944
AGCTCTGATCGTCGAAATGCGCGCTCGCTAGCTCGATCGATCGATCGATCGACTCAGACCTCA....
and so on...
So, I have a problem with the fasta headers of the genome for the organism with which I am working with. Unfortunately the perl expertise needed to solve this problem seems to be beyond my current skill level :S So I was hoping someone on here could show me how it can be done.
My genome consists of about 25000 fasta headers and their respective sequences, the headers in their current state are giving me a lot of trouble with sequence aligners I am trying to use, so I have to simplify them significantly. Here is an example of my first few headers:
>scaffold1.1_size947603
>scaffold10.1_size550551
>scaffold100.1_size305125:1-38034
>scaffold100.1_size305125:38147-38987
>scaffold100.1_size305125:38995-44965
>scaffold100.1_size305125:76102-78738
>scaffold100.1_size305125:84171-87568
>scaffold100.1_size305125:87574-89457
>scaffold100.1_size305125:90495-305068
>scaffold1000.1_size94939
Essentially I would like to refine these to look like this:
>scaffold1.1a
>scaffold10.1a
>scaffold100.1a
>scaffold100.1b
>scaffold100.1c
>scaffold100.1d
>scaffold100.1e
>scaffold100.1f
>scaffold100.1g
>scaffold1000.1a
Or perhaps even this (but this seems like it would be more complicated):
>scaffold1.1
>scaffold10.1
>scaffold100.1a
>scaffold100.1b
>scaffold100.1c
>scaffold100.1d
>scaffold100.1e
>scaffold100.1f
>scaffold100.1g
>scaffold1000.1
What I'm doing here is getting rid of all the size data for each scaffold of the genome. For scaffolds that happen to be fragmented, I'd like to denote them with a,b,c,d etc. There are a few scaffolds with more than 26 fragments so perhaps I could denote them with x, y, z, A, B, C, D .... etc..
I was thinking to do this with a simple replace foreach loop similar to this:
#!/usr/bin/perl -w
### Open the files
$gen = './Hc_genome/haemonchus_V1.fa';
open(FASTAFILE, $gen);
#lines = <FASTAFILE>;
#print #lines;
###Add an # symbol to the start of the label
my #refined;
foreach my $lines (#lines){
chomp $lines;
$lines =~ s/match everything after .1/replace it with a, b, c.. etc/g;
push #refined, $lines;
}
#print #refined;
###Push the array on to a new fasta file
open FILE3, "> ./Hc_genome/modded_haemonchus_V1.fa" or die "Cannot open output.txt: $!";
foreach (#refined)
{
print FILE3 "$_\n"; # Print each entry in our array to the file
}
close FILE3;
But I don't know have to build in the added alphabetical label additions between the $1 and the \n in the match and replace operator. Essentially because I'm not sure how to do it sequentially through the alphabet for each fragment of a particular scaffold (All I could manage is to add an a to the start of each one...)
Please if you don't mind, let me know how I might achieve this!
Much appreciated!
Andrew
In Perl, the increment operator ++ has “magical” behaviour with respect to strings. E.g. my $s = "a"; $a++ increments $a to "b". This goes on until "z", where the increment will produce "aa" and so forth.
The headers of your file appear to be properly sorted, so we can just loop through each header. From the header, we extract the starting part (everything up to including the .1). If this starting part is the same as the starting part of the previous header, we increment our sequence identifier. Otherwise, we set it to "a":
use strict; use warnings; # start every script with these
my $index = "a";
my $prev = "";
# iterate over all lines (rather than reading all 25E3 into memory at once)
while (<>) {
# pass through non-header lines
unless (/^>/) {
print; # comment this line to remove non-header lines
next;
}
s/\.1\K.*//s; # remove everything after ".1". Implies chomping
# reset or increment $index
if ($_ eq $prev) {
$index++;
} else {
$index = "a";
}
# update the previous line
$prev = $_;
# output new header
print "$_$index\n";
}
Usage: $ perl script.pl <./Hc_genome/haemonchus_V1.fa >./Hc_genome/modded_haemonchus_V1.fa.
It is considered good style to write programs that accept input from STDIN and write to STDOUT, as this improves flexibility. Rather than hardcoding paths in your perl script, keep your script general, and use shell redirection operators like < to specify the input. This also saves you the hassle of manually opening the files.
Example Output:
>scaffold1.1a
>scaffold10.1a
>scaffold100.1a
>scaffold100.1b
>scaffold100.1c
>scaffold100.1d
>scaffold100.1e
>scaffold100.1f
>scaffold100.1g
>scaffold1000.1a

Powershell -- replace string from one file into another

I have two files; A and B. They both contain similar text but of course there are subtle differences in each.
I need to replace a single line of text from file B that came from File A, leaving all the rest of the text in file B as is. The thing is that I don't know the full line of text that will exist in file A, just the first few letters.
Said another way:
I can get the single line of text(string) from file A: $a = (get-content $original_file)[5]
how do I replace line 5 of file B with what's in variable $A
thanks!
PowerShell arrays are zero-based so line 5 would be index 4. The rest of the script would be something like:
$b = (get-content $another_file)
$b[4] = $a
$b | Out-File -Encoding Ascii $another_file
You can pick Ascii or Unicode (or UTF8) for the encoding.

Resources