How to keep track of printed items in a for loop? - arrays

I was recently dealing with a hash that I wanted to print in a nice manner.
To simplify, it is just n array with two fields a['name']="me", a['age']=77 and the data I want to print like key1:value1,key2:value2,... and end with a new line. That is:
name=me,age=77
Since it is not an array whose indices are autoincremented values, I do not know how to loop through them and know when I am processing the last one.
This is important because it allows to use a different separator on the case I am in the last one. Like this, a different character can be printed in this case (new line) instead of the one that is printed after the rest of the files (comma).
I ended up using a counter to compare to the length of the array:
awk 'BEGIN {a["name"]="me"; a["age"]=77;
n = length(a);
for (i in a) {
count++;
printf "%s=%s%s", i, a[i], (count<n?",":ORS)
}
}'
This works well. However, is there any other better way to handle this? I don't like the fact of adding an extra count++.

In general when you know the end point of the loop you put the OFS or ORS after each field:
for (i=1; i<=n; i++) {
printf "%s%s", $i, (i<n?OFS:ORS)
}
but if you don't then you put the OFS before the second and subsequent fields and print the ORS after the loop:
for (idx in array) {
printf "%s%s", (++i>1?OFS:""), array[idx]
}
print ""
I do like the:
n = length(array)
for (idx in array) {
printf "%s%s", array[idx], (++i<n?OFS:ORS)
}
idea to get the end of the loop too, but length(array) is gawk-specific and the resulting code isn't any more concise or efficient than the 2nd loop above:
$ cat tst.awk
BEGIN {
OFS = ","
array["name"] = "me"
array["age"] = 77
for (idx in array) {
printf "%s%s=%s", (++i>1?OFS:""), array[idx], idx
}
print ""
}
vs
$ cat tst.awk
BEGIN {
OFS = ","
array["name"] = "me"
array["age"] = 77
n = length(array) # or non-gawk: for (idx in array) n++
for (idx in array) {
printf "%s=%s%s", array[idx], idx, (++i<n?OFS:ORS)
}
}

Related

Creating a unique array in awk: can this snippet be elaborated?

Thanks to #EdMorton, I can unique an array in awk this way:
BEGIN {
# create an array
# here, I create an array from a string, but other approaches are possible, too
split("a b c d e a b", array)
# unique it
for (i=1; i in array; i++) {
if ( !seen[array[i]]++ ) {
unique[++j] = array[i]
}
}
# print out the result
for (i=1; i in unique; i++) {
print unique[i]
}
# results in:
# a
# b
# c
# d
# e
}
What I don't understand, though, is this ( !seen[array[i]]++ ) condition with an increment:
I do understand that we collect unique indices in the seen array;
So, we check if our temp array seen already has an index array[i] (and add it to unique, if it hasn't);
But the increment after the index is the thing I still can't get :) (despite the detailed explanation provided by Ed).
So, my question is the following: can we somehow re-write this conditional in a more elaborate way? May be this would really help to finalise my take on it :)
Hope this is clearer but idk - best I can say is it's more elaborate as requested!
$ cat tst.awk
BEGIN {
# create an array
# here, I create an array from a string, but other approaches are possible, too
split("a b c d e a b", array)
# unique it
for (i=1; i in array; i++) {
val = array[i]
count[val] = count[val] + 1
if ( count[val] == 1 ) {
is_first_time_val_seen = 1
}
else {
is_first_time_val_seen = 0
}
if ( is_first_time_val_seen ) {
unique[++j] = val
}
}
# print out the result
for (i=1; i in unique; i++) {
print unique[i]
}
}
$ awk -f tst.awk
a
b
c
d
e
Another approach is to put array's values into a new associative array as keys. That will enforce uniqueness:
BEGIN {
# it's helpful to use the return value from `split`
n = split("a b c d e a b", array)
# use the element value as a key.
# It doesn't really matter what the right-hand side of the assignment is.
for (i = 1; i <= n; i++) uniq[array[i]] = i
# now, it's easy to iterate over the unique keys
for (elem in uniq) print elem
}
outputs in no guaranteed order:
a
b
c
d
e
if you're using GNU awk, use PROCINFO["sorted_in"] to control sorting of the array traversal

awk array that overtypes itself when printed

this is my first question so please let me know if I miss anything.
This is an awk script that uses arrays to make key-value pairs.
I have a file that has a header information separated by colons. The data is below it and separated by colons as well. My goal is to make key-value pairs that print out to a new file. I have everything set to be placed in arrays and it prints out almost perfectly.
Here is the input:
...:iscsi_name:iscsi_alias:panel_name:enclosure_id:canister_id:enclosure_serial_number
...:iqn.1111-00.com.abc:2222.blah01.blah01node00::11BLAH00:::
Here is the code:
#!/bin/awk -f
BEGIN {
FS = ":"
}
{
x = 1
if (NR==1) {
num_fields = NF ###This is done incase there are uneven head fields to data fields###
while (x <= num_fields) {
head[x] = $x
x++
}
}
y = 2
while (y <= NR) {
if (NR==y) {
x = 1
while (x <= num_fields) {
data[x] = $x
x++
}
x = 1
while (x <= num_fields) {
print head[x]"="data[x]
x++
}
}
y++
}
}
END {
print "This is the end of the arrays and the beginning of the test"
print head[16]
print "I am head[16]-"head[16]"- and now I'm going to overwrite everything"
print "I am data[16]-"data[16]"- and I will not overwrite everything, also there isn't any data in data[16]"
}
Here is the output:
...
iscsi_name=iqn.1111-00.com.abc
iscsi_alias=2222.blah01.blah01node00
panel_name=
enclosure_id=11BLAH00
canister_id=
=nclosure_serial_number ### Here is my issue ###
This is the end of the arrays and the beginning of the test
enclosure_serial_number
- and now I'm going to overwrite everything
I am data[16]-- and I will not overwrite everything, also there isn't any data in data[16]
NOTE: data[16] is not at the end of a line, for some reason, there is an extra colon on the data lines, hence the num_fields note above
Why does head[16] overwrite itself? Is it that there is a newline (\n) at the end of the field? If so, how do I get rid of it? I have tried adding subtracting the last character, no luck. I have tried to limit the number of characters the array can take in on that field, no luck. I have tried many more ideas, no luck.
Full Disclosure: I am relatively new to all of this, I might have messed up these previous fixes!
Does anyone have any ideas as to why this is happening?
Thanks!
-cheezter88
your script is unnecessarily complex. If you want to adjust the record size with the first row, do it so.
(I replaced "..." prefix with "x")
awk -F: 'NR==1 {n=split($0,h); next} # populate header fields and record size
NR==2 {for(i=1;i<=n;i++) # do the assignment up to header size
print h[i]"="$i}' file
x=x
iscsi_name=iqn.1111-00.com.abc
iscsi_alias=2222.blah01.blah01node00
panel_name=
enclosure_id=11BLAH00
canister_id=
enclosure_serial_number=
if you want to do this for the rest of the records, remove the NR==2 condition,

How to increment array dynamically with 'awk'?

I have file that contains list of IPs:
1.1.1.1
2.2.2.2
3.3.3.3
5.5.5.5
1.1.1.1
5.5.5.5
I want to create file that prints list of counters of above mentioned IPs like:
1.1.1.1: 2
2.2.2.2: 1
3.3.3.3: 1
5.5.5.5: 2
Where 2,1,1,2 are counters.
I started to write script that work for final count IPs and known count but don't know how to continue
./ff.sh file_with_IPs.txt
script
#!/bin/sh
file=$1
awk '
BEGIN {
for(x=0; x<4; ++x)
count[x] = 0;
ip[0] = "1.1.1.1";
ip[1] = "2.2.2.2";
ip[2] = "3.3.3.3";
ip[3] = "5.5.5.5";
}
{
if($1==ip[0]){
count[0] += 1;
} else if($1==ip[1]){
count[1] += 1;
}else if($1==ip[2]){
count[2] += 1;
}else if($1==ip[3]){
count[3] += 1;
}
}
END {
for(x=0; x<4; ++x) {
print ip[x] ": " count[x]
}
}
' $file > newfile.txt
The main problem that I don't know how many IPs stored in file and how they look like.
So I need to increment array ip each time when awk catch new IP.
I think it is quite easier with sort -u, but with awk this can do it:
awk '{a[$0]++; next}END {for (i in a) print i": "a[i]}' file_with_IPs.txt
Output:
1.1.1.1: 2
3.3.3.3: 1
5.5.5.5: 2
2.2.2.2: 1
(with a little help of this tutorial that sudo_O recommended me)
You can use uniq for that task, like:
sort IPFILE | uniq -c
(Note, that this prints the occurrences in front of the IP.)
Or with awk (if there are only IP addresses on the lines):
awk '{ips[$0]++} END { for (k in ips) { print k, ips[k] } }' IPFILE
(Note, this prints the IP addresses unordered, but you can do it with awk, read the docs, for asort, asorti, or simply append a sort after a pipe. )

Perl Modification of non creatable array value attempted, subscript -1

I have a Perl-Script, which executes a recursive function. Within it compares two elements of a 2dimensional Array:
I call the routine with a 2D-Array "#data" and "0" as a starting value. First I load the parameters into a separate 2D-Array "#test"
Then I want to see, if the array contains only one Element --> Compare if the last Element == the first. And this is where the Error occurs: Modification of non creatable array value attempted, subscript -1.
You tried to make an array value spring into existence, and the subscript was probably negative, even counting from end of the array backwards.
This didn't help me much...I'm pretty sure it has to do with the if-clause "$counter-1". But I don't know what, hope you guys can help me!
routine(#data,0);
sub routine {
my #test #(2d-Array)
my $counter = $_[-1]
for(my $c=0; $_[$c] ne $_[-1]; $c++){
for (my $j=0; $j<13;$j++){ #Each element has 13 other elements
$test[$c][$j] = $_[$c][$j];
}
}
if ($test[$counter-1][1] eq $test[-1][1]{
$puffertime = $test[$counter][4];
}
else{
for (my $l=0; $l<=$counter;$l++){
$puffertime+= $test[$l][4]
}
}
}
#
#
#
if ($puffertime <90){
if($test[$counter][8]==0){
$counter++;
routine(#test,$counter);
}
else{ return (print"false");}
}
else{return (print "true");}
Weird thing is that I tried it out this morning, and it worked. After a short time of running he again came up with this error message. Might be that I didn't catch up a error constellation, which could happen by the dynamic database-entries.
Your routine() function would be easier to read if it starts off like this:
sub routine {
my #data = #_;
my $counter = pop(#data);
my #test;
for(my $c=0; $c <= $#data; $c++){
for (my $j=0; $j<13;$j++){ #Each element has 13 other elements
$test[$c][$j] = $data[$c][$j];
}
}
You can check to see if #data only has one element by doing scalar(#data) == 1 or $#data == 0. From your code snippet, I do not see why you need to copy the data to passed to routine() to #test. Seems superfluous. You can just as well skip all this copying if you are not going to modify any of the data passed to your routine.
Your next code might look like this:
if ($#test == 0) {
$puffertime = $test[0][4];
} else {
for (my $l=0; $l <= $counter; $l++) {
$puffertime += $test[$l][4];
}
}
But if your global variable $puffertime was initialized to zero then you can replace this code with:
for (my $l=0; $l <= $counter; $l++) {
$puffertime += $test[$l][4];
}

Algorithm for joining e.g. an array of strings

I have wondered for some time, what a nice, clean solution for joining an array of strings might look like.
Example: I have ["Alpha", "Beta", "Gamma"] and want to join the strings into one, separated by commas – "Alpha, Beta, Gamma".
Now I know that most programming languages offer some kind of join method for this. I just wonder how these might be implemented.
When I took introductory courses, I often tried to go it alone, but never found a satisfactory algorithm. Everything seemed rather messy, the problem being that you can not just loop through the array, concatenating the strings, as you would add one too many commas (either before or after the last string).
I don’t want to check conditions in the loop. I don’t really want to add the first or the last string before/after the loop (I guess this is maybe the best way?).
Can someone show me an elegant solution? Or tell me exactly why there can’t be anything more elegant?
The most elegant solution i found for problems like this is something like this (in pseudocode)
separator = ""
foreach(item in stringCollection)
{
concatenatedString += separator + item
separator = ","
}
You just run the loop and only after the second time around the separator is set. So the first time it won't get added. It's not as clean as I'd like it to be so I'd still add comments but it's better than an if statement or adding the first or last item outside the loop.
All of these solutions are decent ones, but for an underlying library, both independence of separator and decent speed are important. Here is a function that fits the requirement assuming the language has some form of string builder.
public static string join(String[] strings, String sep) {
if(strings.length == 0) return "";
if(strings.length == 1) return strings[0];
StringBuilder sb = new StringBuilder();
sb.append(strings[0]);
for(int i = 1; i < strings.length; i++) {
sb.append(sep);
sb.append(strings[i]);
}
return sb.toString();
}
EDIT: I suppose I should mention why this would be speedier. The main reason would be because any time you call c = a + b; the underlying construct is usually c = (new StringBuilder()).append(a).append(b).toString();. By reusing the same string builder object, we can reduce the amount of allocations and garbage we produce.
And before someone chimes in with optimization is evil, we're talking about implementing a common library function. Acceptable, scalable performance is one of the requirements them. A join that takes a long time is one that's going to be not oft used.
Most languages nowadays - e.g. perl (mention by Jon Ericson), php, javascript - have a join() function or method, and this is by far the most elegant solution. Less code is better code.
In response to Mendelt Siebenga, if you do require a hand-rolled solution, I'd go with the ternary operator for something like:
separator = ","
foreach (item in stringCollection)
{
concatenatedString += concatenatedString ? separator + item : item
}
I usually go with something like...
list = ["Alpha", "Beta", "Gamma"];
output = "";
separator = "";
for (int i = 0; i < list.length ; i++) {
output = output + separator;
output = output + list[i];
separator = ", ";
}
This works because on the first pass, separator is empty (so you don't get a comma at the start, but on every subsequent pass, you add a comma before adding the next element.
You could certainly unroll this a little to make it a bit faster (assigning to the separator over and over isn't ideal), though I suspect that's something the compiler could do for you automatically.
In the end though, I suspect pretty this is what most language level join functions come down to. Nothing more than syntax sugar, but it sure is sweet.
For pure elegance, a typical recursive functional-language solution is quite nice. This isn't in an actual language syntax but you get the idea (it's also hardcoded to use comma separator):
join([]) = ""
join([x]) = "x"
join([x, rest]) = "x," + join(rest)
In reality you would write this in a more generic way, to reuse the same algorithm but abstract away the data type (doesn't have to be strings) and the operation (doesn't have to be concatenation with a comma in the middle). Then it usually gets called 'reduce', and many functional languages have this built in, e.g. multiplying all numbers in a list, in Lisp:
(reduce #'* '(1 2 3 4 5)) => 120
#Mendelt Siebenga
Strings are corner-stone objects in programming languages. Different languages implement strings differently. An implementation of join() strongly depends on underlying implementation of strings. Pseudocode doesn't reflect underlying implementation.
Consider join() in Python. It can be easily used:
print ", ".join(["Alpha", "Beta", "Gamma"])
# Alpha, Beta, Gamma
It could be easily implemented as follow:
def join(seq, sep=" "):
if not seq: return ""
elif len(seq) == 1: return seq[0]
return reduce(lambda x, y: x + sep + y, seq)
print join(["Alpha", "Beta", "Gamma"], ", ")
# Alpha, Beta, Gamma
And here how join() method is implemented in C (taken from trunk):
PyDoc_STRVAR(join__doc__,
"S.join(sequence) -> string\n\
\n\
Return a string which is the concatenation of the strings in the\n\
sequence. The separator between elements is S.");
static PyObject *
string_join(PyStringObject *self, PyObject *orig)
{
char *sep = PyString_AS_STRING(self);
const Py_ssize_t seplen = PyString_GET_SIZE(self);
PyObject *res = NULL;
char *p;
Py_ssize_t seqlen = 0;
size_t sz = 0;
Py_ssize_t i;
PyObject *seq, *item;
seq = PySequence_Fast(orig, "");
if (seq == NULL) {
return NULL;
}
seqlen = PySequence_Size(seq);
if (seqlen == 0) {
Py_DECREF(seq);
return PyString_FromString("");
}
if (seqlen == 1) {
item = PySequence_Fast_GET_ITEM(seq, 0);
if (PyString_CheckExact(item) || PyUnicode_CheckExact(item)) {
Py_INCREF(item);
Py_DECREF(seq);
return item;
}
}
/* There are at least two things to join, or else we have a subclass
* of the builtin types in the sequence.
* Do a pre-pass to figure out the total amount of space we'll
* need (sz), see whether any argument is absurd, and defer to
* the Unicode join if appropriate.
*/
for (i = 0; i < seqlen; i++) {
const size_t old_sz = sz;
item = PySequence_Fast_GET_ITEM(seq, i);
if (!PyString_Check(item)){
#ifdef Py_USING_UNICODE
if (PyUnicode_Check(item)) {
/* Defer to Unicode join.
* CAUTION: There's no gurantee that the
* original sequence can be iterated over
* again, so we must pass seq here.
*/
PyObject *result;
result = PyUnicode_Join((PyObject *)self, seq);
Py_DECREF(seq);
return result;
}
#endif
PyErr_Format(PyExc_TypeError,
"sequence item %zd: expected string,"
" %.80s found",
i, Py_TYPE(item)->tp_name);
Py_DECREF(seq);
return NULL;
}
sz += PyString_GET_SIZE(item);
if (i != 0)
sz += seplen;
if (sz < old_sz || sz > PY_SSIZE_T_MAX) {
PyErr_SetString(PyExc_OverflowError,
"join() result is too long for a Python string");
Py_DECREF(seq);
return NULL;
}
}
/* Allocate result space. */
res = PyString_FromStringAndSize((char*)NULL, sz);
if (res == NULL) {
Py_DECREF(seq);
return NULL;
}
/* Catenate everything. */
p = PyString_AS_STRING(res);
for (i = 0; i < seqlen; ++i) {
size_t n;
item = PySequence_Fast_GET_ITEM(seq, i);
n = PyString_GET_SIZE(item);
Py_MEMCPY(p, PyString_AS_STRING(item), n);
p += n;
if (i < seqlen - 1) {
Py_MEMCPY(p, sep, seplen);
p += seplen;
}
}
Py_DECREF(seq);
return res;
}
Note that the above Catenate everything. code is a small part of the whole function.
In pseudocode:
/* Catenate everything. */
for each item in sequence
copy-assign item
if not last item
copy-assign separator
' Pseudo code Assume zero based
ResultString = InputArray[0]
n = 1
while n (is less than) Number_Of_Strings
ResultString (concatenate) ", "
ResultString (concatenate) InputArray[n]
n = n + 1
loop
In Perl, I just use the join command:
$ echo "Alpha
Beta
Gamma" | perl -e 'print(join(", ", map {chomp; $_} <> ))'
Alpha, Beta, Gamma
(The map stuff is mostly there to create a list.)
In languages that don't have a built in, like C, I use simple iteration (untested):
for (i = 0; i < N-1; i++){
strcat(s, a[i]);
strcat(s, ", ");
}
strcat(s, a[N]);
Of course, you'd need to check the size of s before you add more bytes to it.
You either have to special case the first entry or the last.
collecting different language implementations ?
Here is, for your amusement, a Smalltalk version:
join:collectionOfStrings separatedBy:sep
|buffer|
buffer := WriteStream on:''.
collectionOfStrings
do:[:each | buffer nextPutAll:each ]
separatedBy:[ buffer nextPutAll:sep ].
^ buffer contents.
Of course, the above code is already in the standard library found as:
Collection >> asStringWith:
so, using that, you'd write:
#('A' 'B' 'C') asStringWith:','
But here's my main point:
I would like to put more emphasis on the fact that using a StringBuilder (or what is called "WriteStream" in Smalltalk) is highly recommended. Do not concatenate strings using "+" in a loop - the result will be many many intermediate throw-away strings. If you have a good Garbage Collector, thats fine. But some are not and a lot of memory needs to be reclaimed. StringBuilder (and WriteStream, which is its grand-grand-father) use a buffer-doubling or even adaptive growing algorithm, which needs MUCH less scratch memory.
However, if its only a few small strings you are concatenating, dont care, and "+" them; the extra work using a StringBuilder might be actually counter-productive, up to an implementation- and language-dependent number of strings.
The following is no longer language-agnostic (but that doesn't matter for the discussion because the implementation is easily portable to other languages). I tried to implement Luke's (theretically best) solution in an imperative programming language. Take your pick; mine's C#. Not very elegant at all. However, (without any testing whatsoever) I could imagine that its performance is quite decent because the recursion is in fact tail recursive.
My challenge: give a better recursive implementation (in an imperative language). You say what “better” means: less code, faster, I'm open for suggestions.
private static StringBuilder RecJoin(IEnumerator<string> xs, string sep, StringBuilder result) {
result.Append(xs.Current);
if (xs.MoveNext()) {
result.Append(sep);
return RecJoin(xs, sep, result);
} else
return result;
}
public static string Join(this IEnumerable<string> xs, string separator) {
var i = xs.GetEnumerator();
if (!i.MoveNext())
return string.Empty;
else
return RecJoin(i, separator, new StringBuilder()).ToString();
}
join() function in Ruby:
def join(seq, sep)
seq.inject { |total, item| total << sep << item } or ""
end
join(["a", "b", "c"], ", ")
# => "a, b, c"
join() in Perl:
use List::Util qw(reduce);
sub mjoin($#) {$sep = shift; reduce {$a.$sep.$b} #_ or ''}
say mjoin(', ', qw(Alpha Beta Gamma));
# Alpha, Beta, Gamma
Or without reduce:
sub mjoin($#)
{
my ($sep, $sum) = (shift, shift);
$sum .= $sep.$_ for (#_);
$sum or ''
}
Perl 6
sub join( $separator, #strings ){
my $return = shift #strings;
for #strings -> ( $string ){
$return ~= $separator ~ $string;
}
return $return;
}
Yes I know it is pointless because Perl 6 already has a join function.
I wrote a recursive version of the solution in lisp. If the length of the list is greater that 2 it splits the list in half as best as it can and then tries merging the sublists
(defun concatenate-string(list)
(cond ((= (length list) 1) (car list))
((= (length list) 2) (concatenate 'string (first list) "," (second list)))
(t (let ((mid-point (floor (/ (- (length list) 1) 2))))
(concatenate 'string
(concatenate-string (subseq list 0 mid-point))
","
(concatenate-string (subseq list mid-point (length list))))))))
(concatenate-string '("a" "b"))
I tried applying the divide and conquer strategy to the problem, but I guess that does not give a better result than plain iteration. Please let me know if this could have been done better.
I have also performed an analysis of the recursion obtained by the algorithm, it is available here.
Use the String.join method in C#
http://msdn.microsoft.com/en-us/library/57a79xd0.aspx
In Java 5, with unit test:
import junit.framework.Assert;
import org.junit.Test;
public class StringUtil
{
public static String join(String delim, String... strings)
{
StringBuilder builder = new StringBuilder();
if (strings != null)
{
for (String str : strings)
{
if (builder.length() > 0)
{
builder.append(delim);
}
builder.append(str);
}
}
return builder.toString();
}
#Test
public void joinTest()
{
Assert.assertEquals("", StringUtil.join(", ", null));
Assert.assertEquals("", StringUtil.join(", ", ""));
Assert.assertEquals("", StringUtil.join(", ", new String[0]));
Assert.assertEquals("test", StringUtil.join(", ", "test"));
Assert.assertEquals("foo, bar", StringUtil.join(", ", "foo", "bar"));
Assert.assertEquals("foo, bar, baz", StringUtil.join(", ", "foo", "bar", "baz"));
}
}

Resources