I have a problem I am hoping someone can help with (greatly simplified for the purposes of explaining what I am trying to do)...
I have three different arrays:
my #array1 = ("DOG","CAT","HAMSTER");
my #array2 = ("DONKEY","FOX","PIG", "HORSE");
my #array3 = ("RHINO","LION","ELEPHANT");
I also have a variable that contains the content from a web page (using WWW::Mechanize):
my $variable = $r->content;
I now want to see if any of the elements in each of the arrays are found in the variable, and if so which array it comes from:
e.g
if ($variable =~ (any of the elements in #array1)) {
print "FOUND IN ARRAY1";
} elsif ($variable =~ (any of the elements in #array2)) {
print "FOUND IN ARRAY2";
} elsif ($variable =~ (any of the elements in #array3)) {
print "FOUND IN ARRAY3";
}
What is the best way to go about doing this using the arrays and iterating through each element in the arrays? Is there a better way this can be done?
your help is much appreciated, thanks
You can make a regex out of the array elements, but you'll most likely want to disable meta characters and make sure you do not get partial matches:
my $rx = join('\b|\b', map quotemeta, #array1);
if ($variable =~ /\b$rx\b/) {
print "matched array 1\n";
}
If you do want to get partial matches, such as FOXY below, simply remove all the \b sequences.
Demonstration:
use strict;
use warnings;
my #array1 = ("DOG","CAT","HAMSTER");
my #array2 = ("DONKEY","FOX","PIG", "HORSE");
my #array3 = ("RHINO","LION","ELEPHANT");
my %checks = (
array1 => join('\b|\b', map quotemeta, #array1),
array2 => join('\b|\b', map quotemeta, #array2),
array3 => join('\b|\b', map quotemeta, #array3),
);
while (<DATA>) {
chomp;
print "The string: '$_'\n";
for my $key (sort keys %checks) {
print "\t";
if (/\b$checks{$key}\b/) {
print "does";
} else {
print "does not";
}
print " match $key\n";
}
}
__DATA__
A DOG ATE MY RHINO
A FOXY HORSEY
Output:
The string: 'A DOG ATE MY RHINO'
does match array1
does not match array2
does match array3
The string: 'A FOXY HORSEY'
does not match array1
does not match array2
does not match array3
my $re1 = join '|', #array1;
say "found in array 1" if $variable =~ /$re1/;
Repeat for each additional array (or use an array of regexes and an array of arrays of terms).
First of all, if When you find yourself adding an integer suffix to variable names, think I should have used an array.
Therefore, first I am going to put the wordsets in an array of arrayrefs. That will help identify where the matched word came from.
Second, I am going to use Regex::PreSuf to make a pattern out of each word list because I always forget the right way to do that.
Third note that using \b in regex patterns can lead to surprising results. So, instead, I am going to split up the content into individual sequences of \w characters.
Fourth, you say "I also have a variable that contains the content from a web page (using WWW::Mechanize)". Do you want to match words in the comments? In title attributes? If you don't, you should parse the HTML document either to extract full plain text or to restrict the match to within a certain element or set of elements.
Then, grep from the list of words in the text those that are in a wordset and map them to the wordset they matched.
#!/usr/bin/env perl
use strict; use warnings;
use Regex::PreSuf qw( presuf );
my #wordsets = (
[ qw( DOG CAT HAMSTER ) ],
[ qw( DONKEY FOX PIG HORSE ) ],
[ qw( RHINO LION ELEPHANT ) ],
);
my #patterns = map {
my $pat = presuf(#$_);
qr/\A($pat)\z/;
} #wordsets;
my $content = q{Lorem ipsum dolor sit amet, consectetur adipisicing elit,
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim
ad minim veniam, quis ELEPHANT exercitation ullamco laboris nisi ut aliquip
ex ea commodo consequat. Duis aute irure dolor in reprehenderit in HAMSTER
velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat
cupidatat non proident, sunt in DONKEY qui officia deserunt mollit anim id
est laborum.};
my #contents = split /\W+/, $content;
use YAML;
print Dump [
map {
my $i = $_;
map +{$_ => $i },
grep { $_ =~ $patterns[$i] } #contents
} 0 .. $#patterns
];
Here, grep { $_ =~ $patterns[$i] } #contents extracts the words from #contents which are in the given wordset. Then, map +{$_ => $i } maps those words to the wordset from which they came. The outer map just loops over each wordset pattern.
Output:
---
- HAMSTER: 0
- DONKEY: 1
- ELEPHANT: 2
That is, you get a list of hashrefs where the key in each hashref is the word that was found and the value is the wordset that matched.
I assume $variable is not an array, in which case use a foreach statement.
foreach my $item (#array1) {
if ($item eq $variable) {
print "FOUND IN ARRAY1";
}
}
and repeat the above for each array, i.e. array2, array3...
EDIT: I think you could use perl's map function, something like this:
#a1matches = map { $variable =~ /$_/ ? $_ : (); } #array1;
print "FOUND IN ARRAY1\n" if $#a1matches >= 0;
#a2matches = map { $variable =~ /$_/ ? $_ : (); } #array2;
print "FOUND IN ARRAY2\n" if $#a2matches >= 0;
#a3matches = map { $variable =~ /$_/ ? $_ : (); } #array3;
print "FOUND IN ARRAY3\n" if $#a3matches >= 0;
A fun side effect is that #a1matches contain the elements of #array1 that were in $variable.
Regexp::Assemble may be helpful if you like to use a module. It allows to assemble strings of regular expressions into one regular expression matching all the individual regular expressions.
Related
I've got a strange edge-case.
I have a long string which contains \n (newline characters).
So the string looks something like:
text="loremipsum\nDollor sit atmet \n aliquyam erat,
sed diam\naliquyam erat \n sed diam"
I need to split the string into an array, but keep the newline characters uniterpreted,
so the array/output looks like:
"loremipsum\n"
"Dollor sit atmet \n"
"aliquyam erat, sed diam\n"
"aliquyam erat \n"
"sed diam"
I couldn't find a way to split the string and preserve the \n characters.
If I use IFS=$"\n" the \ncharacters are deleted,
but if I use IFS="\n" it gets split and delets all occurrence of n.
I tried it like:
IFS=$"\n" read -d '' -a arr <<<"$text"
How can I solve this?
Clarification/Update
The text is dynamic and can be very long 3000+ chars,
so creating the array like: declare -a arr=([0]=$'loremipsum\n'... is not an option.
The \n characters (0x5c + 0x6e in ascii code) should all be treated the same,
the should not be replaced with an actual newline.
The \n characters must be preserved,
because the progrann which gets the output looks for these in plaintext.
The \n characters can be àt every position in a sentence,
also in a word like:
lor\nem or with spaces: Lorem \n ipsum
So the \n characters must be at the end of the elements inside the array, like shown above.
The text must only be splitted at \n not a spaces etc..
My understanding from the sample (input/output) data given:
there is one actual newline character in text (between erat, and sed diem); this is to be removed and assuming there is no (space) after erat, we need to add a (space), ie, replace the actual newline character with a (space)
there are 4 literal strings of \ + n; we are to break the array after these literals; the literal \ + n are to remain in the text that is stored in the array
the output should have a leading space removed from array values
I'm assuming the final results should not include the double quotes (ie, OP included the double quotes in the desired output as a means of delimiting the array values for display purposes)
One idea:
text="loremipsum\nDollor sit atmet \n aliquyam erat,
sed diam\naliquyam erat \n sed diam"
# convert actual newline character to a (space)
text=${text//$'\n'/ }
# add an actual newline character after the literal `\` + `n`
text=${text//\n/\n$'\n'}
# print our value, remove leading (space), and load into array
IFS=$'\n' arr=( $(printf "%s\n" "${text}." | sed 's/^ //g') )
# display array
typeset -p arr
declare -a arr=([0]="loremipsum\\n" [1]="Dollor sit atmet \\n" [2]="aliquyam erat, sed diam\\n" [3]="aliquyam erat \\n" [4]="sed diam.")
# loop through array and display individual strings; add double quotes as delimiters for display purposes
for i in "${!arr[#]}"
do
echo "\"${arr[${i}]}\""
done
"loremipsum\n"
"Dollor sit atmet \n"
"aliquyam erat, sed diam\n"
"aliquyam erat \n"
"sed diam."
You can use process substitution and echo, e.g.
text="loremipsum\nDollor sit atmet \n aliquyam erat, sed diam\naliquyam erat \n sed diam"
readarray arr < <(echo -e "$text")
You can also use printf in the process substitution as well, e.g.
< <(printf "$text")
Since the -t option is not give to readarray, the '\n' is included as part of the array element.
Example Use/Output
Adding a declare -p arr to output the array, you would have:
text="loremipsum\nDollor sit atmet \n aliquyam erat, sed diam\naliquyam erat \n sed diam"
readarray arr < <(echo -e "$text")
declare -p arr
declare -a arr=([0]=$'loremipsum\n' [1]=$'Dollor sit atmet \n' [2]=$' aliquyam erat, sed diam\n' [3]=$'aliquyam erat \n' [4]=$' sed diam\n')
If you want to trim leading whitespace, you can use the brace-expansion ${element#*[[:space:]]}. Up to you.
I am trying to match and remove elements from an array called #array. The elements to be removed must match the patterns stored inside an array called #del_pattern
my #del_pattern = ('input', 'output', 'wire', 'reg', '\b;\b', '\b,\b');
my #array = (['input', 'port_a', ','],
['output', '[31:0]', 'port_b,', 'port_c', ',']);
To remove the patterns contained in #del_pattern from #array, I loop through all the elements in #del_pattern and exclude them using grep.
## delete the patterns found in #del_pattern array
foreach $item (#del_pattern) {
foreach $i (#array) {
#$i = grep(!/$item/, #$i);
}
}
However, I have been unable to remove ',' from #array. If I use ',' instead of '\b,\b' in #del_pattern, element port_b, gets removed from the #array as well, which is not an intended outcome. I am only interested in removing elements that contain only ','.
You are using the wrong regular expression. I updated the code and tried it and its working fine. PFB the update code:
my #del_pattern = ('input', 'output', 'wire', 'reg', '\b;\b', '^,$');
my #array = (['input', 'port_a', ','],
['output', '[31:0]', 'port_b,', 'port_c', ',']);
## delete the patterns found in #del_pattern array
foreach my $item (#del_pattern) {
foreach my $i (#array) {
#$i = grep(!/$item/, #$i);
}
}
The only change made is in the Regex '\b,\b' to '^,$'. I don't have much info on \b but the regex I am suggesting is doing what you intend.
You want
^,\z
An explanation of what \b doesn't match at all follows.
\b
defines the boundary of a "word". It is equivalent to
(?<=\w)(?!\w) | (?<!\w)(?=\w)
so
\b,\b
is equivalent to
(?: (?<=\w)(?!\w) | (?<!\w)(?=\w) ) , (?: (?<=\w)(?!\w) | (?<!\w)(?=\w) )
Since comma is a non-word character, that simplifies to
(?<=\w),(?=\w)
So
'a,b' =~ /\b,\b/ # Match
',' =~ /\b,\b/ # No match
This works, but not very nice from the code.
This code snippet also removes the ','
my $elm = [];
sub extract {
my $elm = shift;
foreach my $del (#del_pattern) {
$elm =~ s/$del//g;
if ( $elm ) {
return $elm;
}
}
}
foreach my $item (#array) {
foreach my $i (#$item) {
my $extract = extract($i);
if ($extract) {
push(#$elm, $extract);
}
}
}
print Dumper($elm);
Why your #array has array? Why not one big array?
I'm trying to follow this tutorial for my own code which basically right now reads a value into a scalar which is pushed into an array called states. However, it doesnt properly hash the function like in the tutorial and I believe its because the contents of the array isn't properly quoted.
I've tried
foreach (#states)
{
q($_);
}
and
push #states, q($key);
but neither produces the necessary output. Currently my output displays as
NY, NJ, MI , NJ
when using
print join(", ", #states);
I want it to display
'NY', 'NJ', 'MI' , 'NJ'
Take states, map them to quoted strings, join by comma:
my #states = qw( NY NJ MI );
print join ', ', map "'$_'", #states;
To add quotes around a value you can use double-quoted string interpolation:
"'$_'"
Or you can use string concatenation:
"'".$_."'"
So you can write your foreach loop as follows:
foreach (#states) {
$_ = "'$_'";
}
Note that $_ must be assigned, otherwise the body of the loop has no effect (this is the case with your q($_); code).
Full demo:
use strict;
use warnings;
my #states = qw(NY NJ MI NJ);
foreach (#states) {
$_ = "'$_'";
}
print(join(', ', #states ));
'NY', 'NJ', 'MI', 'NJ'
One other way:
use strict;
use warnings;
my #states = qw/ NY NJ MI NJ /;
my $output = join ', ', map qq/'$_'/, #states;
print $output;
Would result to a formatted list (string) of single quoted elements, each separated as you expect.
'NY', 'NJ', 'MI', 'NJ'
I am experiencing something that i don't understand here's my code :
#iprouteur = split( /./, $arraylist[1] );
print "lip du routeur est $arraylist[1]\n";
for ( $i = 2; $i <= $#arraylist; $i++ ) {
print "we found secondary which is $arraylist[$i]\n";
#secondary = split( /./, $arraylist[$i] );
print "voici les ip a comparer : $iprouteur[0] $iprouteur[1] $iprouteur[2] et $secondary[0] $secondary[1] $secondary[2] \n";
if ( $iprouteur[0] eq $secondary[0] && $iprouteur[1] eq $secondary[1] && $iprouteur[2] eq $secondary[2] ) {
print "we need to splice \n";
}
}
The output is like :
lip du routeur est 126.x.x.x
we found secondary which is 126.x.x.x/24
voici les ip a comparer : et
we need to splice
Why perl can't find what is inside the $iprouteur[x] and $secondary[y] variable ?
The problem is that you split with the regex /./. The period . is a meta character, and it is the wildcard, matching any char except newline. So it consumes your entire string when used, and returns a bunch of empty strings. The solution is to escape the period:
#secondary = split(/\./, $arraylist[$i]);
# ^--- note the backslash
Also what I meant in the comments is that this line:
print "voici les ip a comparer : $iprouteur[0] $iprouteur[1] $iprouteur[2] et $secondary[0] $secondary[1] $secondary[2] \n";
can be written:
print "voici les ip a comparer : #iprouteur[0,1,2] et #secondary[0,1,2] \n";
Which is easier both to read and to write.
My bad, i forgot to use "." with the split function
I forgot that it was a special char.
#iprouteur = split(/\./,$arraylist[1]);
#secondary = split(/\./, $arraylist[$i]);
suppose I have a script called 'Hello'. something like:
array[0]="hello world"
array[1]="goodbye world"
echo ${array[*]}
and I want to do something like this in another script:
tmp=(`Hello`)
the result I need is:
echo ${tmp[0]} #prints "hello world"
echo ${tmp[1]} #prints "goodbye world"
instead I get
echo ${tmp[0]} #prints "hello"
echo ${tmp[1]} #prints "world"
or in other words, every word is put in a different spot in the tmp array.
how do I get the result I need?
Emit it as a NUL-delimited stream:
printf '%s\0' "${array[#]}"
...and, in the other side, read from that stream:
array=()
while IFS= read -r -d '' entry; do
array+=( "$entry" )
done
This often comes in handy in conjunction with process substitution; in the below example, the initial code is in a command (be it a function or an external process) invoked as generate_an_array:
array=()
while IFS= read -r -d '' entry; do
array+=( "$entry" )
done < <(generate_an_array)
You can also use declare -p to emit a string which can be evaled to get the content back:
array=( "hello world" "goodbye world" )
declare -p array
...and, on the other side...
eval "$(generate_an_array)"
However, this is less preferable -- it's not as portable to programming languages other than bash (whereas almost all languages can read a NUL-delimited stream), and it requires the receiving program to trust the sending program to return declare -p results and not malicious content.
Although there are workarounds, you can't really "return" an array from a bash function or script, since the normal way of "returning" a value is to send it as a string to stdout and let the caller capture it with command substitution. [Note 1] That's fine for simple strings or very simple arrays (such as arrays of numbers, where the elements cannot contain whitespace), but it's really not a good way to send structured data.
There are workarounds, such as printing a string with specific delimiters (in particular, with NUL bytes) which can be parsed by the caller, or in the form of an executable bash statement which can be evaluated by the caller with eval, but on the whole the simplest mechanism is to require that the caller provide the name of an array variable into which the value can be placed. This only works with bash functions, since scripts can't modify the environment of the caller, and it only works with functions called directly in the parent process, so it won't work with pipelines. Effectively, this is a mechanism similar to that used by the read built-in, and a few other bash built-ins.
Here's a simple example. The function split takes three arguments: an array name, a delimiter, and a string:
split () {
IFS=$2 read -a "$1" -r -d '' < <(printf %s "$3")
}
eg:
$ # Some text
$ lorem="Lorem ipsum dolor
sit amet, consectetur
adipisicing elit, sed do
eiusmod tempor incididunt"
# Split at the commas, putting the pieces in the array phrase
$ split phrase "," "$lorem"
# Print the pieces in a way that you can see the elements.
$ printf -- "--%s\n" "${phrase[#]}"
--Lorem ipsum dolor
sit amet
-- consectetur
adipisicing elit
-- sed do
eiusmod tempor incididunt
Notes:
Any function or script does have a status return, which is a small integer; this is what is actually returned by the return and exit special forms. However, the status return mostly works as a boolean value, and certainly cannot carry a structured value.
hello.sh
declare -a array # declares a global array variable
array=(
"hello world"
"goodbye world"
)
other.sh
. hello.sh
tmp=( "${array[#]}" ) # if you need to make a copy of the array
echo "${tmp[0]}"
echo "${tmp[1]}"
If you truly want a function to spit out values that your script will capture, do this:
hello.sh
#!/bin/bash
array=(
"hello world"
"goodbye world"
)
printf "%s\n" "${array[#]}"
other.sh
#!/bin/bash
./hello.sh | {
readarray -t tmp
echo "${tmp[0]}"
echo "${tmp[1]}"
}
# or
readarray -t tmp < <(./hello.sh)
echo "${tmp[0]}"
echo "${tmp[1]}"