Use of uninitialized value within #spl in substitution (s///) - arrays

I am getting following error while running the script.
Use of uninitialized value in print at PreProcess.pl line 137.
Use of uninitialized value within #spl in substitution (s///) at PreProcess.pl line 137.
Is there any syntax error in the script?
(Running it in Windows - Strawberry 64 last version)
my $Dat=2;
my $a = 7;
foreach (#spl) {
if ( $_ =~ $NameInstru ) {
print $spl[$Dat] =~ s/-/\./gr, " 00:00; ",$spl[$a],"\n"; # data
$Dat += 87;
$a += 87;
}
}
inside of array i hve this type of data
"U.S. DOLLAR INDEX - ICE FUTURES U.S."
150113
2015-01-13
098662
ICUS
01
098
128104
14111
88637
505
13200
50
269
43140
34142
1862
37355
482
180
110623
126128
17480
1976
1081
-3699
8571
-120
646
50
248
1581
-8006
319
2093
31
-30
1039
1063
42
18
100.0
11.0
69.2
0.4
10.3
0.0
0.2
33.7
26.7
1.5
29.2
0.4
0.1
86.4
98.5
13.6
1.5
215
7
.
.
16
.
.
50
16
8
116
6
4
197
34
28.6
85.1
41.3
91.3
28.2
85.1
40.8
91.2
"(U.S. DOLLAR INDEX X $1000)"
"098662"
"ICUS"
"098"
"F90"
"Combined"
"U.S. DOLLAR INDEX - ICE FUTURES U.S."
150106
2015-01-06
098662
ICUS
01
098
127023
17810
80066
625
12554
0
21
41559
42148
1544
35262
452
210
109585
125065
17438
1958
19675
486
23911
49
2717
0
-73
9262
-5037
30
5873
270
95
18439
19245
1237
431
100.0
14.0
63.0
0.5
9.9
0.0
0.0
32.7
33.2
1.2
27.8
0.4
0.2
86.3
98.5
13.7
1.5
202
7
.
.
16
0
.
48
16
9
105
6
4
185
34
29.3
83.2
43.2
90.6
28.9
83.2
42.8
90.5
"(U.S. DOLLAR INDEX X $1000)"
"098662"
"ICUS"
"098"
"F90"
"Combined"

You are probably trying to load a file of data sets of a size of 87 lines each into an array, and then you get an error at the end of your data, when you try to read outside of the last array index.
You can probably solve it by iterating over the array indexes instead of the array values, e.g.
my $Dat = 2;
my $a = 7;
my $set_size = 87;
for (my $n = 0; $n + $a < #spl; $n += $set_size) {
if ( $spl[$n] =~ $NameInstru ) {
print $spl[$n + $Dat] =~ s/-/\./gr, " 00:00; ",$spl[$n + $a],"\n"; # data
}
}
While this might solve your problem, it might be better to try and find a proper way to parse your file.
If the records inside the input file are separated by a blank line, you can try to read whole records at once by changing the input record separator to "" or "\n\n". Then you can split each element in the resulting array on newline \n and get an entire record as a result. For example:
$/ = "";
my #spl;
open my $fh ...
while (<$fh>) {
push #spl, [ split "\n", $_ ];
}
...
for my $record (#spl) {
# #$record is now an 87 element array with each record in the file
}

TLP's solution of iterating over the indexes of an array, incrementing by 87 at time is great.
Here's a more complex solution, but one that doesn't require loading the entire file into memory.
my $lines_per_row = 87;
my #row;
while (<>) {
chomp;
push #row, $_;
if (#row == $lines_per_row) {
my ($instru, $dat, $a) = #row[0, 2, 7];
if ($instru =~ $NameInstru) {
print $dat =~ s/-/\./gr, " 00:00; $a\n";
}
#row = ();
}
}

Related

AWK loop to parse file

I have trouble understandig an awk command which I want to change slightly (but can't because I don't understand the code enough).
The result of this awk command is to put together text files having 6 columns. In the output file, the first column is a mix of all first column of the input file. The other columns of the output file are the other column of the input file with added blank if needed, to still match with the first column values.
First, I would like to only parse some specific columns from these files and not all 6. I couldn't figure out where to specify it in the awk loop.
Secondly, the header of the columns are not the first row of the output file anymore. It would be nice to have it as header in the output file as well.
Thirdly, I need to know from which file the data comes from. I know that the command take the files in the order they appear when doing ls -lh *mosdepth.summary.txt so I can deduce that the first 6 columns are from file 1, the 6 next from file 2, ect. However, I would like to automatically have this information in the output file to reduce the potential human errors I can do by infering the origin of the data.
Here is the awk command
awk -F"\t" -v OFS="\t" 'F!=FILENAME { FNUM++; F=FILENAME }
{ COL[$1]++; C=$1; $1=""; A[C, FNUM]=$0 }
END {
for(X in COL)
{
printf("%s", X);
for(N=1; N<=FNUM; N++) printf("%s", A[X, N]);
printf("\n");
}
}' *mosdepth.summary.txt > Se_combined.coverage.txt
the input file look like this
cat file1
chrom length bases mean min max
contig_1_pilon 223468 603256 2.70 0 59
contig_2_pilon 197061 1423255 7.22 0 102
contig_6_pilon 162902 1372153 8.42 0 80
contig_19_pilon 286502 1781926 6.22 0 243
contig_29_pilon 263348 1251842 4.75 0 305
contig_32_pilon 291449 1819758 6.24 0 85
contig_34_pilon 51310 197150 3.84 0 29
contig_37_pilon 548146 4424483 8.07 0 399
contig_41_pilon 7529 163710 21.74 0 59
cat file2
chrom length bases mean min max
contig_2_pilon 197061 2098426 10.65 0 198
contig_19_pilon 286502 1892283 6.60 0 233
contig_32_pilon 291449 2051790 7.04 0 172
contig_37_pilon 548146 6684861 12.20 0 436
contig_42_pilon 14017 306188 21.84 0 162
contig_79_pilon 17365 883750 50.89 0 1708
contig_106_pilon 513441 6917630 13.47 0 447
contig_124_pilon 187518 374354 2.00 0 371
contig_149_pilon 1004879 13603882 13.54 0 801
the wrong output looks like this
contig_149_pilon 1004879 13603882 13.54 0 801
contig_79_pilon 17365 883750 50.89 0 1708
contig_1_pilon 223468 603256 2.70 0 59
contig_106_pilon 513441 6917630 13.47 0 447
contig_2_pilon 197061 1423255 7.22 0 102 197061 2098426 10.65 0 198
chrom length bases mean min max length bases mean min max
contig_37_pilon 548146 4424483 8.07 0 399 548146 6684861 12.20 0 436
contig_41_pilon 7529 163710 21.74 0 59
contig_6_pilon 162902 1372153 8.42 0 80
contig_42_pilon 14017 306188 21.84 0 162
contig_29_pilon 263348 1251842 4.75 0 305
contig_19_pilon 286502 1781926 6.22 0 243 286502 1892283 6.60 0 233
contig_124_pilon 187518 374354 2.00 0 371
contig_34_pilon 51310 197150 3.84 0 29
contig_32_pilon 291449 1819758 6.24 0 85 291449 2051790 7.04 0 172
EDIT:
Thanks to several input from several users I manage to answer my points 1 and 3 like this
awk -F"\t" -v OFS="\t" 'F!=FILENAME { FNUM++; F=FILENAME }
{ B[FNUM]=F; COL[$1]; C=$1; $1=""; A[C, FNUM]=$4}
END {
printf("%s\t", "contig")
for (N=1; N<=FNUM; N++)
{ printf("%.5s\t", B[N])}
printf("\n")
for(X in COL)
{
printf("%s\t", X);
for(N=1; N<=FNUM; N++)
{ printf("%s\t", A[X, N]);
}
printf("\n");
}
}' file1.txt file2.txt > output.txt
with output
contig file1 file2
contig_149_pilon 13.54
contig_79_pilon 50.89
contig_1_pilon 2.70
contig_106_pilon 13.47
contig_2_pilon 7.22 10.65
chrom mean mean
contig_37_pilon 8.07 12.20
contig_41_pilon 21.74
contig_6_pilon 8.42
contig_42_pilon 21.84
contig_29_pilon 4.75
contig_19_pilon 6.22 6.60
contig_124_pilon 2.00
contig_34_pilon 3.84
contig_32_pilon 6.24 7.04
Awk processes files in records, where the records are separated by the record separator RS. Each record is split in fields where the field separator is defined by the variable FS that the -F flag can define.
In the case of the program presented in the OP, the record separator is the default value which is the <newline>-character and the field separator is set to be the <tab>-character.
Awk programs are generally written as a sequence of pattern-action pairs of the form pattern { action }. These pairs are executed sequentially and state to perform action when pattern returns a non-zero or non-empty string value.
In the current program there are three such action-patter pairs:
F!=FILENAME { FNUM++; F=FILENAME }: This states that if the value of F is different from the current FILENAME which is processed, then increase the value of FNUM with one and update the value of F with the current FILENAME.
In the end, this is the same as just checking if we are processing a new file or not. The equivalent version of this would be:
(FNR==1) { FNUM++ }
which reads: If we are processing the first record of the current file (FNR), then increase the file count FNUM.
{ COL[$1]++; C=$1; $1=""; A[C, FNUM]=$0 }: As there is no pattern, it implies true by default. So here, for each record/line increment the number of times you saw the value in the first column and store it in an associative array COL (key-value pairs). Memorize the first field in C and store in an array A the value of the current record, but remove the first field. So if the record of the second file reads "foo A B C D", and foo already been seen 3 times, then, COL["foo"] will be equal to 4 and A["foo",2] will read " A B C D".
END{ ... } This is a special pattern-action pair. Here END indicates that this action should only be executed at the end, when all files have been processed. What the end statement does, is straightforward, it just prints all records of each file. Including empty records.
In the end, this entire script can be simplified to the following:
awk 'BEGIN{ FS="\t" }
{ file_list[FILENAME]
key_list[$1]
record_list[FILENAME,$1]=$0 }
END { for (key in key_list)
for (fname in file_list)
print ( record_list[fname,key] ? record_list[fname,key] : key )
}' file1 file2 file3 ...
Assuming your '*mosdepth.summary.txt' files look like the following:
$ ls *mos*txt
1mosdepth.summary.txt 2mosdepth.summary.txt 3mosdepth.summary.txt
And contents are:
$ cat 1mosdepth.summary.txt
chrom length bases mean min max
contig_1_pilon 223468 1181176 5.29 0 860
contig_2_pilon 197061 2556215 12.97 0 217
contig_6_pilon 162902 2132156 13.09 0 80
$ cat 2mosdepth.summary.txt
chrom length bases mean min max
contig_19_pilon 286502 2067244 7.22 0 345
contig_29_pilon 263348 2222566 8.44 0 765
contig_32_pilon 291449 2671881 9.17 0 128
contig_34_pilon 51310 525393 10.24 0 47
$ cat 3mosdepth.summary.txt
chrom length bases mean min max
contig_37_pilon 548146 6652322 12.14 0 558
contig_41_pilon 7529 144989 19.26 0 71
The following awk command might be appropriate:
$ awk -v target_cols="1 2 3 4 5 6" 'BEGIN{split(target_cols, cols," ")} \
NR==1{printf "%s ", "file#"; for (i=1;i<=length(cols);i++) {printf "%s ", $cols[i]} print ""} \
FNR==1{fnbr++} \
FNR>=2{printf "%s ", fnbr; for (i=1;i<=length(cols);i++) {printf "%s ", $cols[i]} print ""}' *mos*txt | column -t
Output:
file# chrom length bases mean min max
1 contig_1_pilon 223468 1181176 5.29 0 860
1 contig_2_pilon 197061 2556215 12.97 0 217
1 contig_6_pilon 162902 2132156 13.09 0 80
2 contig_19_pilon 286502 2067244 7.22 0 345
2 contig_29_pilon 263348 2222566 8.44 0 765
2 contig_32_pilon 291449 2671881 9.17 0 128
2 contig_34_pilon 51310 525393 10.24 0 47
3 contig_37_pilon 548146 6652322 12.14 0 558
3 contig_41_pilon 7529 144989 19.26 0 71
Alternatively, the following will output the filename rather than file#:
$ awk -v target_cols="1 2 3 4 5 6" 'BEGIN{split(target_cols, cols," ")} \
NR==1{printf "%s ", "fname"; for (i=1;i<=length(cols);i++) {printf "%s ", $cols[i]} print ""} \
FNR==1{fnbr=FILENAME} \
FNR>=2{printf "%s ", fnbr; fnbr="-"; for (i=1;i<=length(cols);i++) {printf "%s ", $cols[i]} print ""}' *mos*txt | column -t
Output:
fname chrom length bases mean min max
1mosdepth.summary.txt contig_1_pilon 223468 1181176 5.29 0 860
- contig_2_pilon 197061 2556215 12.97 0 217
- contig_6_pilon 162902 2132156 13.09 0 80
2mosdepth.summary.txt contig_19_pilon 286502 2067244 7.22 0 345
- contig_29_pilon 263348 2222566 8.44 0 765
- contig_32_pilon 291449 2671881 9.17 0 128
- contig_34_pilon 51310 525393 10.24 0 47
3mosdepth.summary.txt contig_37_pilon 548146 6652322 12.14 0 558
- contig_41_pilon 7529 144989 19.26 0 71
With either command, the target_cols="1 2 3 4 5 6" specifies the targeted columns to extract.
target_cols="1 2 3" for example, will produce:
fname chrom length bases
1mosdepth.summary.txt contig_1_pilon 223468 1181176
- contig_2_pilon 197061 2556215
- contig_6_pilon 162902 2132156
2mosdepth.summary.txt contig_19_pilon 286502 2067244
- contig_29_pilon 263348 2222566
- contig_32_pilon 291449 2671881
- contig_34_pilon 51310 525393
3mosdepth.summary.txt contig_37_pilon 548146 6652322
- contig_41_pilon 7529 144989
target_cols="4 5 6" will produce:
fname mean min max
1mosdepth.summary.txt 5.29 0 860
- 12.97 0 217
- 13.09 0 80
2mosdepth.summary.txt 7.22 0 345
- 8.44 0 765
- 9.17 0 128
- 10.24 0 47
3mosdepth.summary.txt 12.14 0 558
- 19.26 0 71

Powershell script to break up list into multiple arrays

I am very new to powershell I have a code a co-worker helped me build. It works on a small set of data. However, I am sending this to a SAP business objects query and that will only accept about 2000 pieces of data. Each month the amount of data I have to run will vary but is usually around 7000-8000 items. I need help to update my script to run through the list of data create an array, add 2000 items to it and then create a new array with the next 2000 items, etc until it reaches the end of the list.
$source = "{0}\{1}" -f $ENV:UserProfile, "Documents\Test\DataSD.xls"
$WorkbookSource = $Excel.Workbooks.Open("$source")
$WorkSheetSource = $WorkbookSource.WorkSheets.Item(1)
$WorkSheetSource.Activate()
$row = [int]2
$docArray = #()
$docArray.Clear() |Out-Null
Do
{
$worksheetSource.cells.item($row, 1).select() | Out-Null
$docArray += #($worksheetSource.cells.item($row, 1).value())
$row++
}
While ($worksheetSource.cells.item($row,1).value() -ne $null)
So for this example I would need the script to create 4 separate arrays. The first 3 would have 2000 items in them and the last would have 1200 items in it.
for this to work, you will need to export the data to a CSV or otherwise extract it to a collection that holds all the items. using something like the StreamReader stuff would probably allow for faster processing, but i have never worked with it. [blush]
once the $CurBatch is generated, you can feed that into whatever process you want.
$InboundCollection = 1..100
$ProcessLimit = 22
# the "- 1" is to correct for "starts at zero"
$ProcessLimit = $ProcessLimit - 1
$BatchCount = [math]::Floor($InboundCollection.Count / $ProcessLimit)
#$End = 0
foreach ($BC_Item in 0..$BatchCount)
{
if ($BC_Item -eq 0)
{
$Start = 0
}
else
{
$Start = $End + 1
}
$End = $Start + $ProcessLimit
# powershell will happily slice past the end of an array
$CurBatch = $InboundCollection[$Start..$End]
''
$Start
$End
# the 1st item is not the _number in $Start_
# it's the number in the array # "[$Start]"
"$CurBatch"
}
output ...
0
21
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
22
43
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
44
65
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66
66
87
67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88
88
109
89 90 91 92 93 94 95 96 97 98 99 100
To do this, there are a number of options.
You can read in everything from the Excel file in one large array and split that afterwards in smaller chunks
or you can add the Excel file values in separate arrays while reading.
The code below does just that.
In any case, it is up to you when you would like to actually send the data.
process each array immediately (send it to a SAP business objects
query) while reading from Excel
add it to a Hashtable so you keep all arrays together in memory
store it on disk for later use
In the code below, I choose the second option to read in the data in a number of arrays and keep these in memory in a hashTable.
The advantage is that you do not need to interrupt the reading of the Excel data like with option 1. and there is no need to create and re-read 'in-between' files as with option 3.
$source = Join-Path -Path $ENV:UserProfile -ChildPath "Documents\Test\DataSD.xls"
$maxArraySize = 2000
$Excel = New-Object -ComObject Excel.Application
# It would speed up things considerably if you set $Excel.Visible = $false
$WorkBook = $Excel.Workbooks.Open($source)
$WorkSheet = $WorkBook.WorkSheets.Item(1)
$WorkSheet.Activate()
# Create a Hashtable object to store each array under its own key
# I don't know if you need to keep the order of things later,
# but it maybe best to use an '[ordered]' hash here.
# If you are using PowerShell version below 3.0. you need to create it using
# $hash = New-Object System.Collections.Specialized.OrderedDictionary
$hash = [ordered]#{}
# Create an ArrayList for better performance
$list = New-Object System.Collections.ArrayList
# Initiate a counter to use as Key in the Hashtable
$arrayCount = 0
# and maybe a counter for the total number of items to process?
$totalCount = 0
# Start reading the Excel data. Begin at row $row
$row = 2
do {
$list.Clear()
# Add the values of column 1 to the arraylist, but keep track of the maximum size
while ($WorkSheet.Cells.Item($row, 1).Value() -ne $null -and $list.Count -lt $maxArraySize) {
[void]$list.Add($WorkSheet.Cells.Item($row, 1).Value())
$row++
}
if ($list.Count) {
# Store this array in the Hashtable using the $arrayCount as Key.
$hash.Add($arrayCount.ToString(), $list.ToArray())
# Increment the $arrayCount variable for the next iteration
$arrayCount++
# Update the total items counter
$totalCount += $list.Count
}
} while ($list.Count)
# You're done reading Excel data, so close it and release Com objects from memory
$Excel.Close()
[System.Runtime.Interopservices.Marshal]::ReleaseComObject($WorkSheet) | Out-Null
[System.Runtime.Interopservices.Marshal]::ReleaseComObject($WorkBook) | Out-Null
[System.Runtime.Interopservices.Marshal]::ReleaseComObject($Excel) | Out-Null
[System.GC]::Collect()
[System.GC]::WaitForPendingFinalizers()
# At this point you should have all arrays stored in the hash to process
Write-Host "Processing $($hash.Count) arrays with a total of $totalCount items"
foreach ($key in $hash.Keys) {
# Send each array to a SAP business objects query separately
# The array itself is at $hash.$key or use $hash[$key]
}
This is not 100% but i will fine tune it a bit later today:
$docarray = #{}
$values = #()
$i = 0
$y = 0
for ($x = 0; $x -le 100; $x++) {
if ($i -eq 20) {
$docarray.add($y, $values)
$y++
$i=0
$values = #()
}
$values += $x
$i++
}
$docarray.add($y, $values) ## required
$docarray | Format-List
If the limit is 2000 then you would set the if call to trigger at 2000. The results of this will be a hash table of x amount:
Name : 4
Value : {80, 81, 82, 83...}
Name : 3
Value : {60, 61, 62, 63...}
Name : 2
Value : {40, 41, 42, 43...}
Name : 1
Value : {20, 21, 22, 23...}
Name : 0
Value : {0, 1, 2, 3...}
Whereby each name in the hash array has x amount of values represented by the $i iterator on the if statement.
You should then be able to send this to your SAP business objects query by using a foreach loop with the values for each item in the hash array:
foreach ($item in $docarray) {
$item.Values
}

Getting exact match in a has but with a twist

I have something I cannot get my head around
Let's say I have a phone list used for receiving and dialing out stored like below. The from and to location is specified as well.
Country1 Country2 number1 number2
USA_Chicago USA_LA 12 14
AUS_Sydney USA_Chicago 19 15
AUS_Sydney USA_Chicago 22 21
CHI_Hong-Kong RSA_Joburg 72 23
USA_LA USA_Chigaco 93 27
Now all I want to do is to remove all the duplicates and give only what is relevant to the countries as keys and each number that is assigned to it in a pair, but the pair needs to be bi-directional.
In other words I need to get results back and then print them like this.
USA_Chicago-USA_LA 27 93 12 14
Aus_Sydney-USA_Chicago 19 15 22 21
CHI_Hong-kong-RSA_Joburg 72 23
I have tried many methods including a normal hash table and the results seem fine, but it does not do the bi-direction, so I will get this instead.
USA_Chicago-USA_LA 12 14
Aus_Sydney-USA_Chicago 19 15 22 21
CHI_Hong-kong-RSA_Joburg 72 23
USA_LA-USA_Chicago 93 27
So the duplicate removal works in one way, but because there is another direction, it will not remove the duplicate "USA_LA-USA_Chicago" which already exists as "USA_Chicago-USA_LA" and will store the same numbers under a swopped name.
The hash table I tried last is something like this. (not exactly as I trashed the lot and had to rewrite it for this post)
#input= ("USA_Chicago USA_LA 12 14" ,
"AUS_Sydney USA_Chicago 19 15" ,
"AUS_Sydney USA_Chicago 22 21" ,
"CHI_Hong-Kong RSA_Joburg 72 23" '
"USA_LA USA_Chigaco 93 27");
my %hash;
for my $line (#input) {
my ($c1, $c2, $n1, $n2) = split / [\s\|]+ /x, $line6;
my $arr = $hash{$c1} ||= [];
push #$arr, "$n1 $n2";
}
for my $c1 (sort keys %hash) {
my $arr = $hash{$c1};
my $vals = join " : ", #$arr;
print "$c1 $vals\n";
}
So all if A-B exists and so does B-A, use only one but assign the values from the key being removed, to the remaining key. I basically need to do is get rid of any duplicate key in any direction, but assign the values for to the remaining key. So A-B and B-A would be considered a duplicate, but A-C and B-C are not. -_-
Simply normalise the destinations. I chose to sort them.
use strictures;
use Hash::MultiKey qw();
my #input = (
'USA_Chicago USA_LA 12 14',
'AUS_Sydney USA_Chicago 19 15',
'AUS_Sydney USA_Chicago 22 21',
'CHI_Hong-Kong RSA_Joburg 72 23',
'USA_LA USA_Chicago 93 27'
);
tie my %hash, 'Hash::MultiKey';
for my $line (#input) {
my ($c1, $c2, $n1, $n2) = split / [\s\|]+ /x, $line;
my %map = ($c1 => $n1, $c2 => $n2);
push #{ $hash{[sort keys %map]} }, #map{sort keys %map};
}
__END__
(
['CHI_Hong-Kong', 'RSA_Joburg'] => [72, 23],
['AUS_Sydney', 'USA_Chicago'] => [19, 15, 22, 21],
['USA_Chicago', 'USA_LA'] => [12, 14, 27, 93],
)
Perl is great for creating complex data structures but learning to use them effectively takes practices.
Try:
#!/usr/bin/env perl
use strict;
use warnings;
# --------------------------------------
use charnames qw( :full :short );
use English qw( -no_match_vars ); # Avoids regex performance penalty
use Data::Dumper;
# Make Data::Dumper pretty
$Data::Dumper::Sortkeys = 1;
$Data::Dumper::Indent = 1;
# Set maximum depth for Data::Dumper, zero means unlimited
local $Data::Dumper::Maxdepth = 0;
# conditional compile DEBUGging statements
# See http://lookatperl.blogspot.ca/2013/07/a-look-at-conditional-compiling-of.html
use constant DEBUG => $ENV{DEBUG};
# --------------------------------------
# skip the column headers
<DATA>;
my %bidirectional = ();
while( my $line = <DATA> ){
chomp $line;
my ( $country1, $country2, $number1, $number2 ) = split ' ', $line;
push #{ $bidirectional{ $country1 }{ $country2 } }, [ $number1, $number2 ];
push #{ $bidirectional{ $country2 }{ $country1 } }, [ $number1, $number2 ];
}
print Dumper \%bidirectional;
__DATA__
Country1 Country2 number1 number2
USA_Chicago USA_LA 12 14
AUS_Sydney USA_Chicago 19 15
AUS_Sydney USA_Chicago 22 21
CHI_Hong-Kong RSA_Joburg 72 23
USA_LA USA_Chicago 93 27

comparing all elements of two big files

How to compare All elements of a file with All elements of another file using C or Perl for much larger data? File 1 contains 100,000 such numbers and file 2 contains 500,000 elements.
I used foreach within a foreach with spliting each and every element in arrays. It worked perfectly in perl but the time consumed to check and print every occurrence of elements of just a single column from File2 in file1 was 40 minutes. There are 28 such columns.
Is there any way to reduce the time or using another language like C?
File 1:
0.1
0.11
0.12
0.13
0.14
0.15
0.16
0.17
0.18
0.19
0.2
File 2:
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28
1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 1.1 1.11 1.12 1.13 1.14 1.15 1.16 1.17 1.18 1.19 1.2 1.21 1.22 1.23 1.24 1.25 1.26 1.27 1.28
EDIT:
Expected output:
If element in file 2, matched print 'column number' if not print '0'.
1 2 0 0 0 0 0 0 0 10 11 12 13 14 15 16 17 18 19 20 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Here is the code I am using. Note: it checks File2 columnwise in File 1 and prints the Column Number if true and '0' if false. It will print output for every column in 28 different files.
#!/usr/bin/perl-w
chomp($file = "File1.txt");
open(FH, $file);
#k_org = <FH>;
chomp($hspfile = 'file2.txt');
open(FH1, $hspfile);
#hsporg = <FH1>;
for $z (1 .. 28) {
open(OUT, ">$z.txt");
foreach (#hsporg) {
$i = 0;
#h_org = split('\t', $_);
chomp ($h_org[0]);
foreach(#k_org) {
#orginfo = split('\t', $_);
chomp($orginfo[0]);
if($h_org[0] eq $orginfo[0]) {
print OUT "$z\n";
$i = 1;
goto LABEL;
} elsif ($h_org[0] ne $orginfo[0]) {
if($h_org[0]=~/(\w+\s\w+)\s/) {
if($orginfo[0] eq $1) {
print OUT "0";
$i = 1;
goto LABEL;
}
}
}
}
if ($i == 0) {
print OUT "0";
}
LABEL:
}
}
close FH;
close FH1;
close OUT;
If you sort(1) the files you can then check it in a single pass. Should not take more than a couple of seconds (including the sort).
Another way is to load all values from file1 into a hash. It's a bit more memory-consuming, especially if file1 is large, but should be fast (again, no more than a couple of seconds).
I would choose perl over C for such a job, and I'm more proficient in C than in perl. It's much faster to code in perl for this kind of job, less error-prone and runs fast enough.
This script runs a test case. Note that your expected output is objectively wrong: In file 2, line 1, column 20, the value 0.2 exists.
#!perl
use 5.010; # just for `say`
use strict; use warnings;
use Test::More;
# define input files + expected outcome
my $file_1_contents = <<'_FILE1_';
0.1
0.11
0.12
0.13
0.14
0.15
0.16
0.17
0.18
0.19
0.2
_FILE1_
my $file_2_contents = <<'_FILE2_';
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28
1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 1.1 1.11 1.12 1.13 1.14 1.15 1.16 1.17 1.18 1.19 1.2 1.21 1.22 1.23 1.24 1.25 1.26 1.27 1.28
_FILE2_
my $expected_output = <<'_OUTPUT_';
1 2 0 0 0 0 0 0 0 10 11 12 13 14 15 16 17 18 19 20 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
_OUTPUT_
# open the filehandles
open my $file1, "<", \$file_1_contents or die "$!";
open my $file2, "<", \$file_2_contents or die "$!";
open my $expected, "<", \$expected_output or die "$!";
my %file1 = map { chomp; 0+$_ => undef } <$file1>;
while (<$file2>) {
chomp;
my #vals = split;
# If value exists in file1, print the col number.
my $line = join " " => map { exists $file1{0+$vals[$_]} ? $_+1 : 0 } 0 .. $#vals;
chomp(my $expected_line = <$expected>);
is $line, $expected_line;
}
done_testing;
To print the exact same output to 28 files, you would remove the testing code, and rather
say {$_} $line for #filehandles;
instead.
Old answer
Your existing code is simply quite inefficient and unidiomatic. Let me help you fix that.
First, start all your Perl scripts with use strict; use warnings;, and if you have a modern perl (v10 or later), you could do use 5.010; (or whatever your version is) to activate additional features.
The chomp call takes a variable and removes the current value of $/ (usually a newline) from the end of the string. This is important because the readline operator doesn't do that for us. It is not good for declaring a constant variable. Rather, do
my $file = "File1.txt";
my $hspfle = "File2.txt";
The use strict forces you to declare your variables properly, you can do so with my.
To open a file, you should use the following idiom:
open my $fh, "<", $filename or die "Can't open $filename: $!";
Instead of or die ... you can use autodie at the top of your script.
This will abort the script if you can't open the file, tell you the reason ($!), and specifies an explicit open mode (here: < = read). This avoids bugs with special characters in filenames.
Lexical filehandles (in my variables, as contrasted to bareword filehandles) have proper scope, and close themselves. There are various other reasons why you should use them.
The split function takes a regex, not a string as first argument. If you inspect your program closely, you will see that you split each element in #hsporg 28 times, and each element in #k_org 28 × #hsporg times. This is extremely slow, and unneccessary, as we can do that beforehand.
If a condition is false, you don't need to explicitely test for the falseness again in
if ($h_org[0] eq $orginfo[0]) {
...;
} elsif ($h_org[0] ne $orginfo[0]) {
...;
}
as $a ne $b is exactly equivalent to not $a eq $b.
It is quite unidiomatic in Perl to use a goto, and jumping to a label somewhere isn't especially fast either. Labels are mainly used for loop control:
# random example
LOOP: for my $i (1 .. 10) {
for my $j (1 .. 5) {
next if $i == $j; # start next iteration of current loop
next LOOP if 2 * $i == $j; # start next iteration of labeled loop
last LOOP if $i + $j == 13;# like `break` in C
}
The redo loop control verb is similar to next, but doesn't recheck the loop condition, if there is one.
Because of these loop control facilities, and the ability to break out of any enclosing loop, maintaining flags or elaborate gotos is often quite unneccessary.
Here is a cleaned-up version of your script, without fixing too much of the actual algorithm:
#!/usr/bin/perl
use strict; use warnings;
use autodie; # automatic error messages
my ($file, $hspfile) = ("File1.txt", "file2.txt");
open my $fh1, "<", $file;
open my $fh2, "<", $hspfile;
my #k_org = <$fh1>;
my #hsporg = <$fh2>;
# Presplit the contents of the arrays:
for my $arr (\#k_org, \#hsporg) {
for (#$arr) {
chomp;
$_ = [ split /\t/ ]; # put an *anonymous arrayref* into each slot
}
}
my $output_files = 28;
for my $z (1 .. $output_files) {
open my $out, ">", "$z.txt";
H_ORG:
for my $h_org (#hsporg) {
my $i = 0;
ORGINFO:
for my $orginfo (#k_org) {
# elements in array references are accessed like $arrayref->[$i]
if($h_org->[0] eq $orginfo->[0]) {
print $out "$z\n";
$i = 1;
last ORGINFO; # break out of this loop
} elsif($h_org->[0] =~ /(\w+\s\w+)\s/ and $orginfo->[0] eq $1) {
print $out "0";
$i = 1;
last ORGINFO;
}
}
print $out "0" if not $i;
}
}
# filehandles are closed automatically.
Now we can optimize further: In each line, you only ever use the first element. This means we don't have to store the rest:
...;
for (#$arr) {
chomp;
$_ = (split /\t/, $_, 2)[0]; # save just the first element
}
...;
ORGINFO:
for my $orginfo (#k_org) {
# elements in array references are accessed like $arrayref->[$i]
if($h_org eq $orginfo) {
...;
} elsif($h_org =~ /(\w+\s\w+)\s/ and $orginfo eq $1) {
...;
}
}
Also, acessing scalars is a bit faster than accessing array elements.
The third arg to split limits the number of resulting fragments. Because we are only interested in the first field, we don't have to split the rest too.
Next on, we last out of the ORGINFO loop, then check a flag. This is unneccessary: We can jump directly to the next iteration of the H_ORG loop instead of setting the flag. If we drop out naturally out of the ORGINFO loop, the flag is guaranteed to not be set, so we can execute the print anyway:
H_ORG:
for my $h_org (#hsporg) {
for my $orginfo (#k_org) {
if($h_org eq $orginfo) {
print $out "$z\n";
next H_ORG;
} elsif($h_org =~ /(\w+\s\w+)\s/ and $orginfo eq $1) {
print $out "0";
next H_ORG;
}
}
print $out "0";
}
Then, you compare the same data 28 times to print it to different files. Better: Define two subs print_index and print_zero. Inside these, you loop over the output filehandles:
# make this initialization *before* you use the subs!
my #filehandles = map {open my $fh, ">", "$_.txt"; $fh} 1 .. $output_files;
...; # the H_ORG loop
sub print_index {
for my $i (0 .. $#filehandles) {
print {$filehandles[$i]} $i+1, "\n";
}
}
sub print_zero {
print {$_} 0 for #filehandles;
}
Then:
# no enclosing $z loop!
H_ORG:
for my $h_org (#hsporg) {
for my $orginfo (#k_org) {
if($h_org eq $orginfo) {
print_index()
next H_ORG;
} elsif($h_org =~ /(\w+\s\w+)\s/ and $orginfo eq $1) {
print_zero();
next H_ORG;
}
}
print_zero();
}
This avoids checking data you already know doesn't match.
In C you could try using "qsort" and "bsearch" functions
First you need to load your files into an array.
Than you should execute a qsort() (unless you are sure the elements have an order). And use the bsearch() to execute a binary search into your array.
http://linux.die.net/man/3/bsearch
This will be much more faster than check all elements one by one.
You could try to implement a binary search in perl if it do not exist already, it is a simple algorithm.

How could I print a #slice array elements in Perl?

I have this part of code to catch the greater value of an array immersed in a Hash. When Perl identified the biggest value the array is removed by #slice array:
if ( max(map $_->[1], #$val)){
my #slice = (#$val[1]);
my #ignored = #slice;
delete(#$val[1]);
print "$key\t #ignored\n";
warn Dumper \#slice;
}
Data:Dumper out:
$VAR1 = [
[
'3420',
'3446',
'13',
'39',
55
]
];
I want to print those information separated by tabs (\t) in one line like this list:
miRNA127 dvex589433 - 131 154
miRNA154 dvex546562 + 232 259
miRNA154 dvex573491 + 297 324
miRNA154 dvex648254 + 147 172
miRNA154 dvex648254 + 287 272
miRNA32 dvex320240 - 61 83
miRNA32 dvex623745 - 141 163
miRNA79 dvex219016 + ARRAY(0x100840378)
But in the last line always obtain this result.
How could I generate this output?:
miRNA127 dvex589433 - 131 154
miRNA154 dvex546562 + 232 259
miRNA154 dvex573491 + 297 324
miRNA154 dvex648254 + 147 172
miRNA154 dvex648254 + 287 272
miRNA32 dvex320240 - 61 83
miRNA32 dvex623745 - 141 163
miRNA79 dvex219016 + 3420 3446
Additional explication:
In this case, I want to catch the highest value in $VAR->[1] and looking if the subtraction with the minimum in $VAR->[0] is <= to 55. If not, i need to eliminate this AoA (the highest value) and fill a #ignored array with it. Next, i want to print some values of #ignored, like a list. Next, with the resultants AoA, I want to iterate the last flow...
print "$key\t $ignored[0]->[0]\t$ignored[0]->[1]";
You have an array of arrays, so each element of #ignored is an array. The notation $ignored[0] gets to the zeroth element (which is an array), and ->[0] and ->[1] retrieves the zeroth and first elements of that array.
For example:
use strict;
use warnings;
use Data::Dumper;
my #ignored;
$ignored[0] = [ '3420', '3446', '13', '39', 55 ];
my $key = 'miRNA79 dvex219016 +';
print Dumper \#ignored;
print "\n";
print "$key\t$ignored[0]->[0]\t$ignored[0]->[1]";
Output:
$VAR1 = [
[
'3420',
'3446',
'13',
'39',
55
]
];
miRNA79 dvex219016 + 3420 3446
Another option that generates the same output is to join all the values with a \t:
print join "\t", $key, #{ $ignored[0] }[ 0, 1 ];

Resources