Consider the following function which is currently in the public domain.
function join(array, start, end, sep, result, i)
{
if (sep == "")
sep = " "
else if (sep == SUBSEP) # magic value
sep = ""
result = array[start]
for (i = start + 1; i <= end; i++)
result = result sep array[i]
return result
}
I would like to use this function join contiguous columns such as $2, $3, $4 where the start and end ranges are variables.
However, in order to do this, I must first convert all the fields into an array using a loop like the following.
for (i = 1; i <= NF; i++) {
a[i] = $i
}
Or the shorter version, as #StevenPenny mentioned.
split($0, a)
Unfortunately both approaches require the creation of a new variable.
Does awk have a built-in way of accessing the columns as an array so that the above manual conversions are not necessary?
No such array is defined in POSIX awk (the only array type special variables are ARGV and ENVIRON).
None exists in gawk either, though it adds PROCINFO, SYMTAB and FUNCTAB special arrays. You can check all the defined variables and types at runtime using the SYMTAB array (gawk-4.1.0 feature):
BEGIN { PROCINFO["sorted_in"]="#ind_str_asc" } # automagic sort for "in"
{ print $0 }
END { for (ss in SYMTAB) printf("%-12s: %s\n",PROCINFO["identifiers"][ss],ss) }
(though you will find that SYMTAB and FUNCTAB themselves are missing from the list, and missing from --dump-variables too, they are treated specially by design).
gawk also offers a few standard loadable extensions, none implements this feature though (and given the dynamic relation ship between $0, $1..., NF and OFS, an array that had the same functionality would be a little tricky to implement).
As suggested by Jidder, one work-around is to skip the array altogether and use fields. There's nothing special about the field names, a variable $n can be used the same as a literal like $1 (just take care to use braces for precedence in expressions like $(NF-1). Here's an fjoin function which works on fields rather than an array:
function fjoin(start,end,sep, result,ii) {
if (sep=="") sep=" "
else if (sep==SUBSEP) sep =""
result=$start
for (ii=start+1; ii<=end; ii++)
result=result sep $ii
return result
}
{ print "2,4: " fjoin(2,4,":") }
(this does not treat $0 as a special case)
Or just use split() and be happy, gawk at least guarantees that it behaves identically to field splitting (assuming that none of FS, FIELDWIDTHS and possibly IGNORECASE are being modified so as to change the behaviour).
this is what i do in my own code
function iter0gen() {
PROCINFO["sorted_in"] = "#ind_num_asc"; # skip this for mawk
return split(sprintf("%0"(NF)"d", 0), iter0, //)
}
since splitting by null string means 1-char per bin, then just split a string of zeros with length equal to that of NF, create an array called iter0, then you can do
for (x in iter0) { $(x) = do stuff….. }
This is only for if you need a lazy iterator. The plus side of this is that since indices begin at 1 by default, u can't accidentally get $0 in the iterator loop. The down side of this is that if you're not careful, you would've switched all the input FS into OFS the moment you assign into any field, and this doesn't pre-backup $0 on ur behalf.
if you just want the columns, then just do standard split() of the array using FS. If you're using gawk and would like the seps array too, then add that optional 4th argument that's non-portable.
Related
I'd like to select a specific element of an array from a file with awk where the file is not setup specifying every entry as being part of an array. I plan on putting this in a for loop or assigning this as a variable to be used for arithmetic opterations. However, I am finding that I cannot use the way I'm selecting the element of the array when assigning it as a variable or using it in a for loop.
1 2 3 4
5 6 7 8
9 8 7 6
If these elements are not specified in awk as being part of an array, referencing them could be done with
FNR == 1 {print $3}
However, I cannot assign this as a variable to be used later, nor can I put this in a loop.
Is there another way to reference a single element of an array without having to restructure the input file?
You can read the file into an array, then access the array. When accessing the array, use split:
{ array[NR] = $0 }
After the input scanning is complete, array[42] gives you the contents of record #42, usually the 42nd line of the input. We can put in an END { ... } block where we process the array.
To get the third element of array[1], we can do this:
split(array[1], fields)
Now we have an array called fields. fields[3] holds the same datum as $3 held when the first record were being processed which we assigned to array[1].
In Awk we can also simulate two-dimensional arrays, by catenating multiple indices together with some unambiguous separator, like a space or dash.
{ for (i = 1; i <= NF; i++)
array[NR "-" i] = $i }
After this executes for every input record, we can access $3 from record 1 as array["1-3"]. The key 1-3 is a character string.
The expression NR "-" i in the loop body places several expressions next to each other with no operators in between. That denotes string catenation. When NR is 17 and i is 5, we get the string "17-5" and so on.
Since the number of fields per record is variable, we could have another array which gives the NF value for each element of array.
{ nf[i] = NF;
for (i = 1; i <= NF; i++)
array[NR "-" i] = $i }
Now we know that if nf[17] is 5, the fields array["17-1"] through array["17-5"] are valid.
I know split returns the number of fields parsed, if it assigned to a scalar; and returns an array if assigned to an array.
Is there a way to check whether a line is successfully parsed without having to call split twice (once to check how many fields were parsed, and, if the correct number of fields were parsed, a second time to return the fields in an array)?
foreach (#lines) {
if ( split ) {
my ($ipaddr, $hostname) = split;
}
}
.. I need to check whether the split succeeded in order to avoid later uninitialized references to $ipaddr and $hostname. Just seems like I ought to be able to combine the two calls to split into a single call.
Sure:
foreach (#lines) {
if (2 == (my ($ipaddr, $hostname) = split)) {
# Got exactly two fields
}
}
So if you just want to skip bad lines, you can simply use:
foreach (#lines) {
2 == (my ($ipaddr, $hostname) = split)
or next;
# Got exactly two fields
}
Don't forget to remove trailing whitespace from your lines first (such as by using chomp to remove line feeds) or it will mess up your field count.
You can change the == to <= if there might be more fields.
I think I would prefer a regex match:
for ( #lines ) {
next unless my ($ipaddr, $hostname) = /(\S+)\s+(\S+)/;
# use $ipaddr & $hostname
}
This is different from the original in that it will succeed if more than two non-space substrings are found, but a fix is simple if it is necessary.
I've often been frustrated by the fact that AutoHotkey is not a zero based language. It doesn't match well when you are translating code from other languages, or even interacting with them such as in JScript through COM ScriptControl. Even parsing DOM elements you have to account for them being zero based, it just seems that most languages have adopted zero based arrays.
Now you can declare an array and make it zero based by doing this:
arr := []
arr[0] := 1
The above works, if I asked for arr[0] it would return 1. But if I use length() method it returns 0, even though there is a value in there!
If we declare and then push():
arr := []
arr.push(3)
It's always stored starting from 1, I want this changed!
Is this possible to do?
Because AutoHotkey is a prototype OOP language (like JavaScript) you can override any function, even built in ones. Below is demonstration of overriding Array(), which according to Lexikos is an undocumented fact that it overrides defining an array as such [].
I didn't believe it possible as there are several threads on the forums asking for zero based to implemented natively, but none offered a solution. Even the thread where an override of Array() was demonstrated made no mention that this would be possible!
As a bonus, I included split() (zero based StrSplit() function), to help demonstrate further the endless possibilities of this feature.
Just to note, I haven't unit tested or implemented ever method override, it's possible I've overlooked something but I felt it was enough for a proof of concept. Further, I have no doubts that this will affect performance on large arrays, particularly because of how I implemented Length() for this demo.
x := [] ; declare empty array
x.push("Zero Based rocks!") ; push message to the array.
msgbox % x[0]
x := "" ; clear our Object
x := split("AutoHotkey with Zero Based Arrays")
msgbox % x.2 " " x.3 " " x.4 " " x.1 " " x.0
Array(prm*) {
x := {}
loop % prm.length()
x[A_Index -1] := prm[A_Index]
x.base := _Array
return x
}
split(x, dlm:="", opt:="") {
r := []
for k,v in StrSplit(x, dlm, opt)
r.push(v)
return r
}
Class _Array {
; Modified .length() to account for 0 index
length() {
c:=0
for k in this
c++
return c
}
; Modified .push() to start at 0
push(x) {
if (this.0 == "" && this.length() == 0)
return this.0 := x
else
return this[this.MaxIndex()+1] := x
}
}
I have various subroutines that give me arrays of arrays. I have tested them separately and somehow when i write my main routine, I fail to make the program recognize my arrays. I know it's a problem of dereferencing, or at least i suspect it heavily.
The code is a bit long but I'll try to explain it:
my #leaderboard=#arrarraa; #an array of arrays
my $parentmass=$spect[$#spect]; #scalar
while (scalar #leaderboard>0) {
for my $i(0..(scalar #leaderboard-1)) {
my $curref=$leaderboard[$i]; #the program says here that there is an uninitialized value. But I start with a list of 18 elements.
my #currentarray=#$curref; #then i try to dereference the array
my $w=sumaarray (#currentarray);
if ($w==$parentmass) {
if (defined $Leader[0]) {
my $sc1=score (#currentarray);
my $sc2=score (#Leader);
if ($sc1>$sc2) {
#Leader=#currentarray;
}
}
else {#Leader=#currentarray;}
}
elsif ($w>$parentmass) {splice #leaderboard,$i,1;} #here i delete the element if it doesn't work. I hope it's done correctly.
}
my $leadref= cut (#leaderboard); #here i take the first 10 scores of the AoAs
#leaderboard = #$leadref;
my $leaderef=expand (#leaderboard); #then i expand the AoAs by one term
#leaderboard= #$leaderef; #and i should end with a completely different list to work with in the while loop
}
So I don't know how to dereference the AoAs correctly. The output of the program says:
"Use of uninitialized value $curref in concatenation (.) or string at C:\Algorithms\22cyclic\cyclospectrumsub.pl line 183.
Can't use an undefined value as an ARRAY reference at C:\Algorithms\22cyclic\cyclospectrumsub.pl line 184."
I would appreciate enormously any insight or recommendation.
The problem is with the splice that modifies the list while it is being processed. By using the 0..(scalar #leaderboard-1) you set up the range of elements to process at the beginning, but when some elements are removed by the splice, the list ends up shorter than that and once $i runs off the end of the modified list you get undefined references.
A quick fix would be to use
for (my $i = 0; $i < #leaderboard; $i++)
although that's neither very idiomatic nor efficient.
Note that doing something like $i < #leaderboard or #leaderboard-1 already provides scalar context for the array variable, so you don't need the scalar() call, it does nothing here.
I'd probably use something like
my #result;
while(my $elem = shift #leaderboard) {
...
if ($w==$parentmass) {
# do more stuff
push #result, $elem;
}
}
So instead of deleting from the original list, all elements would be taken off the original and only the successful (by whatever criterion) ones included in the result.
There seem to be two things going on here
You're removing all arrays from #leaderboard whose sumaarray is greater than $parentmass
You're putting in #Leader the array with the highest score of all the arrays in #leaderboard whose sumaarray is equal to $parentmass
I'm unclear whether that's correct. You don't seem to handle the case where sumaarray is less than $parentmass at all. But that can be written very simply by using grep together with the max_by function from the List::UtilsBy module
use List::UtilsBy 'max_by';
my $parentmass = $spect[-1];
my #leaderboard = grep { sumaarray(#$_) <= $parentmass } #arrarraa;
my $leader = max_by { score(#$_) }
grep { sumaarray(#$_) == $parentmass }
#leaderboard;
I'm sure this could be made a lot neater if I understood the intention of your algorithm; especially how those elements with a sumarray of less that $parentmass
Given a string and array of strings find the longest suffix of string in array.
for example
string = google.com.tr
array = tr, nic.tr, gov.nic.tr, org.tr, com.tr
returns com.tr
I have tried to use binary search with specific comparator, but failed.
C-code would be welcome.
Edit:
I should have said that im looking for a solution where i can do as much work as i can in preparation step (when i only have a array of suffixes, and i can sort it in every way possible, build any data-structure around it etc..), and than for given string find its suffix in this array as fast as possible. Also i know that i can build a trie out of this array, and probably this will give me best performance possible, BUT im very lazy and keeping a trie in raw C in huge peace of tangled enterprise code is no fun at all. So some binsearch-like approach will be very welcome.
Assuming constant time addressing of characters within strings this problem is isomorphic to finding the largest prefix.
Let i = 0.
Let S = null
Let c = prefix[i]
Remove strings a from A if a[i] != c and if A. Replace S with a if a.Length == i + 1.
Increment i.
Go to step 3.
Is that what you're looking for?
Example:
prefix = rt.moc.elgoog
array = rt.moc, rt.org, rt.cin.vof, rt.cin, rt
Pass 0: prefix[0] is 'r' and array[j][0] == 'r' for all j so nothing is removed from the array. i + 1 -> 0 + 1 -> 1 is our target length, but none of the strings have a length of 1, so S remains null.
Pass 1: prefix[1] is 't' and array[j][1] == 'r' for all j so nothing is removed from the array. However there is a string that has length 2, so S becomes rt.
Pass 2: prefix[2] is '.' and array[j][2] == '.' for the remaining strings so nothing changes.
Pass 3: prefix[3] is 'm' and array[j][3] != 'm' for rt.org, rt.cin.vof, and rt.cin so those strings are removed.
etc.
Another naïve, pseudo-answer.
Set boolean "found" to false. While "found" is false, iterate over the array comparing the source string to the strings in the array. If there's a match, set "found" to true and break. If there's no match, use something like strchr() to get to the segment of the string following the first period. Iterate over the array again. Continue until there's a match, or until the last segment of the source string has been compared to all the strings in the array and failed to match.
Not very efficient....
Naive, pseudo-answer:
Sort array of suffixes by length (yes, there may be strings of same length, which is a problem with the question you are asking I think)
Iterate over array and see if suffix is in given string
If it is, exit the loop because you are done! If not, continue.
Alternatively, you could skip the sorting and just iterate, assigning the biggestString if the currentString is bigger than the biggestString that has matched.
Edit 0:
Maybe you could improve this by looking at your array before hand and considering "minimal" elements that need to be checked.
For instance, if .com appears in 20 members you could just check .com against the given string to potentially eliminate 20 candidates.
Edit 1:
On second thought, in order to compare elements in the array you will need to use a string comparison. My feeling is that any gain you get out of an attempt at optimizing the list of strings for comparison might be negated by the expense of comparing them before doing so, if that makes sense. Would appreciate if a CS type could correct me here...
If your array of strings is something along the following:
char string[STRINGS][MAX_STRING_LENGTH];
string[0]="google.com.tr";
string[1]="nic.tr";
etc, then you can simply do this:
int x, max = 0;
for (x = 0; x < STRINGS; x++) {
if (strlen(string[x]) > max) {
max = strlen(string[x]);
}
}
x = 0;
while(true) {
if (string[max][x] == ".") {
GOTO out;
}
x++;
}
out:
char output[MAX_STRING_LENGTH];
int y = 0;
while (string[max][x] != NULL) {
output[y++] = string[++x];
}
(The above code may not actually work (errors, etc.), but you should get the general idea.
Why don't you use suffix arrays ? It works when you have large number of suffixes.
Complexity, O(n(logn)^2), there are O(nlogn) versions too.
Implementation in c here. You can also try googling suffix arrays.