Suppose I have:
awk 'BEGIN{
c["1","2","3"]=1
c["12","3"]=2
c["123"]=3 # fleeting...
c["1","23"]=4
c["1" "2" "3"]=5 # will replace c["123"] above...
for (x in c) {
print length(x), x, c[x]
split(x, d, "") # is there something that would split c["12", "3"] into "12, 3"?
# better: some awk / gawkism in one step?
for (i=1; i <= length(x); i++)
printf("|%s|", d[i])
print "\n"
}
}'
Prints:
4 123 4
|1||||2||3|
3 123 5
|1||2||3|
4 123 2
|1||2||||3|
5 123 1
|1||||2||||3|
In each case, the use of the , in forming the array entry produces a visually similar result (123) when printed in the terminal but a distinct hash value. It would appear that there is an 'invisible' separator between the the elements that is lost when printing (i.e., what delimiter makes c["12", "3"] hash differently than c["123"])
What value would I use in split to be able to determine where in the string the comma was placed when the array index was created? i.e., if I created an array entry with c["12","3"] what is the easiest way to print "12","3" vs "123" as a visually distinctly different string (in the terminal) than c["123"]?
(I know that I could do c["12" "," "3"] when creating the array entry. But what makes c["12","3"] hash differently than c["123"] and how to print those so they are seen differently in the terminal...)
c["12","3"] = c["12" SUBSEP "3"]
See SUBSEP in the awk man pages. You can set SUBSEP=FS in the BEGIN section if you have a CSV and want to write c["12","3"] instead of c["12" FS "3"] and have commas printed as the separator in the array indices.
Related
I am pulling my hair out manipulating arrays in bash. I have an array of strings, which contain spaces. I would like an array containing all but the first element of my input array.
input=("first string" "second string" "third string")
echo ${#input[#]}
# len(input)=3
# get slice of all except for first element of input
slice=${input[#]:1}
echo ${#slice[#]}
# expect 2, but get 1
echo $slice
# second string third string
# slice should contain ("second string" "third string"), but instead is "second string third string"
Slicing the array clearly works to eliminate the first element, but the result appears to be a concatenation of all remaining strings, rather than an array. Is there a way to slice an array in bash and get an array as a result?
(sorry, I'm not new to bash, but I've never used it for much before, and I can't find any documentation showing why my slice is flattened)
First off, you should always quote variable expansions. Be very wary of any solution that relies on unquoted expansions. ShellCheck.net is a great tool for catching bugs related to quoting (among many other issues).
To your specific issue, slice=${input[#]:1} does not do what you want. It defines a single scalar variable slice rather than an array, meaning the array expansion (denoted by the [#]) will first be munged into a single string using the current IFS. Here's a demo:
$ arr=(1 2 '3 4')
$ IFS=,
$ var="${arr[#]:1}"
$ echo "$var"
2,3 4
To instead declare and populate an array use the =() notation, like so:
$ var=("${arr[#]:1}")
$ printf '%s\n' "${var[#]}"
2
3 4
Indexes are reset, element 1 is now element 0:
slice=("${input[#]:1}")
Element and index are removed, the first element is now index 1, not index 0:
unset input[0]
${#slice[#]} or ${#input[#]} will now be 1 less than the previous value of ${#input[#]}. Starting out with three elements in slice, the values of "${!slice[#]}" and "${!input[#]}", will be 0 1 and 1 2 respectively (for either the first or second approach)
If you don't quote slice=("${input[#]:1}"), each array element is split on whitespace, creating many more elements.
I am trying to re-do my program for match-all, match-any, match-none of the items in an array. Some of the documentations on Perl6 don't explain the behavior of the current implementation (Rakudo 2018.04) and I have a few more questions.
(1) Documentation on regex says that interpolating array into match regex means "longest match"; however, this code does not seem to do so:
> my $a="123 ab 4567 cde";
123 ab 4567 cde
> my #b=<23 b cd 567>;
[23 b cd 567]
> say (||#b).WHAT
(Slip)
> say $a ~~ m/ #b /
「23」 # <=== I expected the match to be "567" (#b[3] matching $a) which is longer than "23";
(2) (||#b) is a Slip; how do I easily do OR or AND of all the elements in the array without explicitly looping through the array?
> say $a ~~ m:g/ #b /
(「23」 「b」 「567」 「cd」)
> say $a ~~ m:g/ ||#b /
(「23」 「b」 「567」 「cd」)
> say $a ~~ m/ ||#b /
「23」
> say $a ~~ m:g/ |#b /
(「23」 「b」 「567」 「cd」)
> say $a ~~ m:g/ &#b /
(「23」 「b」 「567」 「cd」)
> say $a ~~ m/ &#b /
「23」
> say $a ~~ m/ &&#b /
「23」 # <=== && and & don't do the AND function
(3) What I ended up doing is condensing my previous codes into 2 lines:
my $choose = &any; # can prompt for choice of any, one, all, none here;
say so (gather { for #b -> $z { take $a ~~ m/ { say "==>$_ -->$z"; } <{$z}> /; } }).$choose;
output is "True" as expected. But I am hoping a simpler way, without the "gather-take" and "for" loop.
Thank you very much for any insights.
lisprog
interpolate array in match for AND, OR, NOT functions
I don't know any better solution than Moritz's for AND.
I cover OR below.
One natural way to write a NOT of a list of match tokens would be to use the negated versions of a lookahead or lookbehind assertion, eg:
my $a="123 ab 4567 cde";
my #b=<23 b cd 567>;
say $_>>.pos given $a ~~ m:g/ <!before #b> /;
displays:
(0 2 3 4 6 7 9 10 11 13 14 15)
which is the positions of the 12 matches of not 23, b, cd, or 567 in the string "123 ab 4567 cde", shown by the line of ^s below which point to each of the character positions that matched:
my $a="123 ab 4567 cde";
^ ^^^ ^^ ^^^ ^^^
0123456789012345
I am trying to re-do my program for match-all, match-any, match-none of the items in an array.
These sound junction like and some of the rest of your question is clearly all about junctions. If you linked to your existing program it might make it easier for me/others to see what you're trying to do.
(1)
||#b matches the leftmost matching token in #b, not the longest one.
Write |#b, with a single |, to match the longest matching token in #b. Or, better yet, write just plain #b, which is shorthand for the same thing.
Both of these match patterns (|#b or ||#b), like any other match patterns, are subject to the way the regex engine works, as briefly described by Moritz and in more detail below.
When the regex engine matches a regex against an input string, it starts at the start of the regex and the start of the input string.
If it fails to match, it steps past the first character in the input string, giving up on that character, and instead pretends the input string began at its second character. Then it tries matching again, starting at the start of the regex but the second character of the input string. It repeats this until it either gets to the end of the string or finds a match.
Given your example, the engine fails to match right at the start of 123 ab 4567 cde but successfully matches 23 starting at the second character position. So it's then done -- and the 567 in your match pattern is irrelevant.
One way to get the answer you expected:
my $a="123 ab 4567 cde";
my #b=<23 b cd 567>;
my $longest-overall = '';
sub update-longest-overall ($latest) {
if $latest.chars > $longest-overall.chars {
$longest-overall = $latest
}
}
$a ~~ m:g/ #b { update-longest-overall( $/ ) } /;
say $longest-overall;
displays:
「567」
The use of :g is explained below.
(2)
|#b or ||#b in mainline code mean something completely unrelated to what they mean inside a regex. As you can see, |#b is the same as #b.Slip. ||#b means #b.Slip.Slip which evaluates to #b.Slip.
To do a "parallel" longest-match-pattern-wins OR of the elements of #b, write #b (or |#b) inside a regex.
To do a "sequential" leftmost-match-pattern-wins OR of the elements of #b, write ||#b inside a regex.
I've so far been unable to figure out what & and && do when used to prefix an array in a regex. It looks to me like there are multiple bugs related to their use.
In some of the code in your question you've specified the :g adverb. This directs the engine to not stop when it finds a match but rather to step past the substring it just matched and begin trying to match again further along in the input string.
(There are other adverbs. The :ex adverb is the most extreme. In this case, when there's a match at a given position in the input string, the engine tries to match any other match pattern at the same position in the regex and input string. It keeps doing this no matter how many matches it accumulates until it has tried every last possible match at that position in the regex and input string. Only when it's exhausted all these possibilities does it move forward one character in the input string, and tries exhaustively matching all over again.)
(3)
My best shot:
my $a="123 ab 4567 cde";
my #b=<23 b cd 567>;
my &choose = &any;
say so choose do for #b -> $z {
$a ~~ / { say "==>$a -->$z"; } $z /
}
(1) Documentation on regex says that interpolating array into match regex means "longest match"; however, this code does not seem to do so:
The actual rule is that a regex finds the left-most match first, and the longest match second.
However, the left-most rule is true for all regex matches, which is why the regex documentation doesn't explicitly mention it when talking about alternations.
(2) (||#b) is a Slip; how do I easily do OR or AND of all the elements in the array without explicitly looping through the array?
You can always construct a regex as text first:
my $re_text = join '&&', #branches;
my $regex = re/ <$re_text> /;
This question already has answers here:
How to input integer value to an array, based preceeding row + column values? [duplicate]
(2 answers)
Closed 9 years ago.
For this following project, I am supposed to take input in the following format : R1C5+2 , which reads it as "in the table, Row 1 Column 5 ,add 2. Or in this format : R1C2C3-5 , which reads : "in the table, Row 1 Column 2-3, subtract 5. This is assuming that all numbers in the table are initially all 0.
Where I left Off:
I am having trouble finding a way to detect for a "+" or "-" to either add/subtract values in the table. Also, in providing a range to allow multiple additions when provided two C's or R's. For example: R1R5C2C3+2 (Row Range 1 - 5, Column Range 2 - 3, add 2).
Here is the following code:
puts 'Please input: '
x = gets.chomp
col = []
row = []
x.chars.each_slice(2) { |u| u[0] == "R" ? row << u[1] : col << u[1] }
p col
p row
puts "Largest # in Row array: #{row.max}"
puts "Largest # in Columns array: #{col.max}" #must be in "" to return value
big_row = row.max.to_i
big_col = col.max.to_i
table = Array.new (big_row) { Array.new(big_col) }
The method you are looking for is the =~ operator. If you use it on a string and give it a regexp pattern it will return the location of that pattern in the string. Thus:
x = 'R1C2C3-5'
x =~ /R/
returns: 0 since that is the position of 'R' in the string (counted just like an array 0,1,2...).
If you are unfamiliar with regexp and the =~ operator, I suggest you check out the Ruby doc on it, it is very valuable. Basically the pattern between the forward slashes get matched. You are looking to match + or -, but they have special meaning in regexp, so you have to escape them with a backslash.
x =~ /\+/
x =~ /\-/
but you can combine those into one pattern matcher with an OR symbol (pipe) |
x =~ /\+|\-/
So now you have a method to get the operator:
def operator(my_string)
r = my_string.slice(my_string =~ /\+|\-/)
end
I would also use the operator to split your string into the column/row part and the numeric part:
op = operator(x) # which returns op = '-'
arr = x.split(my_string(x)) # which returns an array of two strings ['R1C2C3', '5']
I leave further string manipulation up to you. I would read through this page on the String class: Ruby String Class and this on arrays: Ruby Array Class as Ruby contains so many methods to make things like this easier. One thing I've learned to do with Ruby is think "I want to do this, I wonder if there is already a built in method to do this?" and I go check the docs. Even more so with Rails!
If you look at output of this awk test, you see that array in awk seems to be printed at some random pattern. It seems to be in same order for same number of input. Why does it do so?
echo "one two three four five six" | awk '{for (i=1;i<=NF;i++) a[i]=$i} END {for (j in a) print j,a[j]}'
4 four
5 five
6 six
1 one
2 two
3 three
echo "P04637 1A1U 1AIE 1C26 1DT7 1GZH 1H26 1HS5 1JSP 1KZY 1MA3 1OLG 1OLH 1PES 1PET 1SAE 1SAF 1SAK 1SAL 1TSR 1TUP 1UOL 1XQH 1YC5 1YCQ" | awk '{for (i=1;i<=NF;i++) a[i]=$i} END {for (j in a) print j,a[j]}'
17 1SAF
4 1C26
18 1SAK
5 1DT7
19 1SAL
6 1GZH
7 1H26
8 1HS5
9 1JSP
10 1KZY
20 1TSR
11 1MA3
21 1TUP
12 1OLG
22 1UOL
13 1OLH
23 1XQH
14 1PES
1 P04637
24 1YC5
15 1PET
2 1A1U
25 1YCQ
16 1SAE
3 1AIE
Why does it do so, is there rule for this?
From 8. Arrays in awk --> 8.5 Scanning All Elements of an Array in the GNU Awk user's guide when referring to the for (value in array) syntax:
The order in which elements of the array are accessed by this
statement is determined by the internal arrangement of the array
elements within awk and cannot be controlled or changed. This can lead
to problems if new elements are added to array by statements in the
loop body; it is not predictable whether or not the for loop will
reach them. Similarly, changing var inside the loop may produce
strange results. It is best to avoid such things.
So if you want to print the array in the order you store it, then you have to use the classical for loop:
for (j=1; j<=NF; j++) print j,a[j]
Example:
$ awk '{for (i=1;i<=NF;i++) a[i]=$i} END {for (j=1; j<=NF; j++) print j,a[j]}' <<< "P04637 1A1U 1AIE 1C26 1DT7 1GZH 1H26 1HS5 1JSP 1KZY 1MA3 1OLG 1OLH 1PES 1PET 1SAE 1SAF 1SAK 1SAL 1TSR 1TUP 1UOL 1XQH 1YC5 1YCQ"
1 P04637
2 1A1U
3 1AIE
4 1C26
5 1DT7
6 1GZH
7 1H26
8 1HS5
9 1JSP
10 1KZY
11 1MA3
12 1OLG
13 1OLH
14 1PES
15 1PET
16 1SAE
17 1SAF
18 1SAK
19 1SAL
20 1TSR
21 1TUP
22 1UOL
23 1XQH
24 1YC5
25 1YCQ
Awk uses hash tables to implement associative arrays. This is just an inherent property of this particular data structure. The location that a particular element is stored into the array depends on the hash of the value. Other factors to consider is the implementation of the hash table. If it is memory efficient, it will limit the range each key gets stored in using the modulus function or some other method. You also may get clashing hash values for different keys so chaining will occur, again affecting the order depending on which key was inserted first.
The construct (key in array) is perfectly fine when used appropriately to loop over every key but you cannot count on the order and you should not update array whilst in the loop as you may end up process array[key] multiple times by mistake.
There is a good decription of hash tables in the book Think Complexity.
The issue is the operator you use to get the array indices, not the fact that the array is stored in a hash table.
The in operator provides the array indices in a random(-looking) order (which IS by default related to the hash table but that's an implementation choice and can be modified).
A for loop that explicitly provides the array indices in a numerically increasing order also operates on the same hash table that the in operator on but that produces output in a specific order regardless.
It's just 2 different ways of getting the array indices, both of which work on a hash table.
man awk and look up the in operator.
If you want to control the output order using the in operator, you can do so with GNU awk (from release 4.0 on) by populating PROCINFO["sorted_in"]. See http://www.gnu.org/software/gawk/manual/gawk.html#Controlling-Array-Traversal for details.
Some common ways to access array indices:
To print array elements in an order you don't care about:
{a[$1]=$0} END{for (i in a) print i, a[i]}
To print array elements in numeric order of indices if the indices are numeric and contiguous starting at 1:
{a[++i]=$0} END{for (i=1;i in a;i++) print i, a[i]}
To print array elements in numeric order of indices if the indices are numeric but non-contiguous:
{a[$1]=$0; min=($1<min?$1:min); max=($1>max?$1:max)} END{for (i=min;i<=max;i++) if (i in a) print i, a[i]}
To print array elements in the order they were seen in the input:
{a[$1]=$0; b[++max]=$1} END{for (i=1;i <= max;i++) print b[i], a[b[i]]}
To print array elements in a specific order of indices using gawk 4.0+:
BEGIN{PROCINFO["sorted_in"]=whatever} {a[$1]=$0} END{for (i in a) print i, a[i]}
For anything else, write your own code and/or see gawk asort() and asorti().
If you are using gawk or mawk, you can also set an env variable WHINY_USERS, which will sort indices before iterating.
Example:
echo "one two three four five six" | WHINY_USERS=true awk '{for (i=1;i<=NF;i++) a[i]=$i} END {for (j in a) print j,a[j]}'
1 one
2 two
3 three
4 four
5 five
6 six
From mawk's manual:
WHINY_USERS
This is an undocumented gawk feature. It tells mawk to sort array indices before it starts to iterate over the elements of an array.
Is it possible to initialize an array like this in AWK ?
Colors[1] = ("Red", "Green", "Blue")
Colors[2] = ("Yellow", "Cyan", "Purple")
And then to have a two dimensional array where Colors[2,3]="Purple".
From another thread I understand that it's not possible ( "sadly, there is no way to set an array all at once without abusing split()" ). Anyways I want to be 100% sure and I'm sure that there are others with the same question.
I am looking for the easiest method to initialize arrays like the one above, will be nice to have it well written.
If you have GNU awk, you can use a true multidimensional array. Although this answer uses the split() function, it most certainly doesn't abuse it. Run like:
awk -f script.awk
Contents of script.awk:
BEGIN {
x=SUBSEP
a="Red" x "Green" x "Blue"
b="Yellow" x "Cyan" x "Purple"
Colors[1][0] = ""
Colors[2][0] = ""
split(a, Colors[1], x)
split(b, Colors[2], x)
print Colors[2][3]
}
Results:
Purple
You can create a 2-dimensional array easily enough. What you can't do, AFAIK, is initialize it in a single operation. As dmckee hints in a comment, one of the reasons for not being able to initialize an array is that there is no restriction on the types of the subscripts, and hence no requirement that they are pure numeric. You can do multiple assignments as in the script below. The subscripts are formally separated by an obscure character designated by the variable SUBSEP, with default value 034 (U+001C, FILE SEPARATOR). Clearly, if one of the indexes contains this character, confusion will follow (but when was the last time you used that character in a string?).
BEGIN {
Colours[1,1] = "Red"
Colours[1,2] = "Green"
Colours[1,3] = "Blue"
Colours[2,1] = "Yellow"
Colours[2,2] = "Cyan"
Colours[2,3] = "Purple"
}
END {
for (i = 1; i <= 2; i++)
for (j = 1; j <= 3; j++)
printf "Colours[%d,%d] = %s\n", i, j, Colours[i,j];
}
Example run:
$ awk -f so14063783.awk /dev/null
Colours[1,1] = Red
Colours[1,2] = Green
Colours[1,3] = Blue
Colours[2,1] = Yellow
Colours[2,2] = Cyan
Colours[2,3] = Purple
$
Thanks for the answers.
Anyways, for those who want to initialize unidimensional arrays, here is an example:
SColors = "Red_Green_Blue"
split(SColors, Colors, "_")
print Colors[1] " - " Colors[2] " - " Colors[3]
The existing answers are helpful and together cover all aspects, but I thought I'd give a more focused summary.
The question conflates two aspects:
initializing arrays in Awk in general
doing so to fill a two-dimensional array in particular
Array initialization:
Awk has no array literal (initializer) syntax.
The simplest workaround is to:
represent the array elements as a single string and
use the split() function to split that string into the elements of an array.
$ awk 'BEGIN { n=split("Red Green Blue", arr); for (i=1;i<=n;++i) print arr[i] }'
Red
Green
Blue
This is what the OP did in their own helpful answer.
If the elements themselves contain whitespace, use a custom separator that's not part of the data, | in this example:
$ awk 'BEGIN { n=split("Red (1)|Green (2)", arr, "|"); for (i=1;i<=n;++i) print arr[i] }'
Red (1)
Green (2)
Initialization of a 2-dimensional array:
Per POSIX, Awk has no true multi-dimensional arrays, only an emulation of it using a one-dimensional array whose indices are implicitly concatenated with the value of built-in variable SUBSEP to form a single key (index; note that all Awk arrays are associative).
arr[1, 2] is effectively the same as arr[1 SUBSEP 2], where 1 SUBSEP 2 is a string concatenation that builds the key value.
Because there aren't truly multiple dimensions - only a flat array of compound keys - you cannot enumerate the (pseudo-)dimensions individually with for (i in ...), such as to get all sub-indices for primary (pseudo-)dimension 1 only.
The default value of SUBSEP is the "INFORMATION SEPARATOR ONE" character, a a rarely used control character that's unlikely to appear in date; in ASCII and UTF-8 it is represented as single byte 0x1f; if needed, you change the value.
By contrast, GNU Awk, as a nonstandard extension, does have support for true multi-dimensional arrays.
Important: You must then always specify the indices separately; e.g., instead of arr[1,2] you must use arr[1][2].
POSIX-compliant example (similar to TrueY's helpful answer):
awk 'BEGIN {
n=split("Red Green Blue", arrAux); for (i in arrAux) Colors[1,i] = arrAux[i]
n=split("Yellow Cyan Purple", arrAux); for (i in arrAux) Colors[2,i] = arrAux[i]
print Colors[1,2]
print "---"
# Enumerate all [2,*] values - see comments below.
for (i in Colors) { if (index(i, 2 SUBSEP)==1) print Colors[i] }
}'
Green
---
Yellow
Cyan
Purple
Note that the emulation of multi-dimensional arrays with a one-dimensional array using compound keys has the following inconvenient implications:
Auxiliary array auxArr is needed, because you cannot directly populate a given (pseudo-)dimension of an array.
You cannot enumerate just one (pseudo-)dimension with for (i in ...), you can only enumerate all indices, across (pseudo-)dimensions.
for (i in Colors) { if (index(i, 2 SUBSEP)==1) print Colors[i] }
above shows how to work around that by enumerating all keys and then matching only the ones whose first constituent index is 2, which means that the key value must start with 2, followed by SUBSEP.
GNU Awk example (similar to Steve's helpful answer, improved with Ed Morton's comment):
GNU Awk's (nonstandard) support for true multi-dimensional arrays makes the inconveniences of the POSIX-compliant solution (mostly) go away
(GNU Awk also doesn't have array initializers, however):
gawk 'BEGIN {
Colors[1][""]; split("Red Green Blue", Colors[1])
Colors[2][""]; split("Yellow Cyan Purple", Colors[2])
# NOTE: Always use *separate* indices: [1][2] instead of [1,2]
print Colors[1][2]
print "---"
# Enumerate all [2][*] values
for (i in Colors[2]) print Colors[2][i]
}'
Note:
Important: As stated, to address a specific element in a multi-dimensional array, always use separate indices; e.g., [1][2] rather than [1,2].
If you use [1,2] you'll get the standard POSIX-mandated behavior, and you'll mistakenly create a new, single index (key) with (string-concatenated) value 1 SUBSEP 2.
split() can conveniently be used to directly populate a sub-array.
As a prerequisite, however, the 2-dimensional target arrays must be initialized:
Colors[1][""] and Colors[2][""] do just that.
Dummy index [""] is just there to create a 2-dimensional array; it is discarded when split() fills that dimension later.
Enumerating a specific dimension with for (i in ...) is supported:
for (i in Colors[2]) ... conveniently enumerates only the sub-indices of Colors[2].
A similar solution. SUBSEP=":" is not really needed, just set to any visible char for demo:
awk 'BEGIN{SUBSEP=":"
split("Red Green Blue",a); for(i in a) Colors[1,i]=a[i];
split("Yellow Cyan Purple",a); for(i in a) Colors[2,i]=a[i];
for(i in Colors) print i" => "Colors[i];}'
Or a little bit more cryptic version:
awk 'BEGIN{SUBSEP=":"
split("Red Green Blue Yellow Cyan Purple",a);
for(i in a) Colors[int((i-1)/3)+1,(i-1)%3+1]=a[i];
for(i in Colors) print i" => "Colors[i];}'
Output:
1:1 => Red
1:2 => Green
1:3 => Blue
2:1 => Yellow
2:2 => Cyan
2:3 => Purple