awk make it less system dependant - c

If I'm not mistaken, awk parses a number depending on the OS language (eg,echo "1,2" | awk '{printf("%f\n",$1)}' would be interpreted as 1 in an english system and as 1.2 in a system where a comma separates the integer from the decimal part).
I don't know if the C printf does this too, so I added the C tag.
I would like to modify the previous command so that it returns the same value (1.2) regardless of the system being used.

Welcome to the ugliness of locale. To fix your problem, first set the locale to the C one.
export LC_NUMERIC=C
echo "1,2" | awk '...your code...'
To turn off other locale-dependent tomfoolery, you can
export LC_ALL=C

If you're using gawk, you can use the --use-lc-numeric option.
$ LC_NUMERIC=de_DE.UTF-8 awk 'BEGIN {printf("%f\n", "1,2")}'
1.000000
$ LC_NUMERIC=de_DE.UTF-8 awk --use-lc-numeric 'BEGIN {printf("%f\n", "1,2")}'
1,200000
From the
GAWK manual
The POSIX standard says that awk always uses the period as the decimal
point when reading the awk program source code, and for command-line
variable assignments (see Other Arguments). However, when interpreting
input data, for print and printf output, and for number to string
conversion, the local decimal point character is used. Here are some
examples indicating the difference in behavior, on a GNU/Linux system:
$ gawk 'BEGIN { printf "%g\n", 3.1415927 }'
-| 3.14159
$ LC_ALL=en_DK gawk 'BEGIN { printf "%g\n", 3.1415927 }'
-| 3,14159
$ echo 4,321 | gawk '{ print $1 + 1 }'
-| 5
$ echo 4,321 | LC_ALL=en_DK gawk '{ print $1 + 1 }'
-| 5,321
The ‘en_DK’ locale is for English in Denmark, where the comma acts as
the decimal point separator. In the normal "C" locale, gawk treats
‘4,321’ as ‘4’, while in the Danish locale, it's treated as the full
number, 4.321.
Some earlier versions of gawk fully complied with this aspect of the
standard. However, many users in non-English locales complained about
this behavior, since their data used a period as the decimal point, so
the default behavior was restored to use a period as the decimal point
character. You can use the --use-lc-numeric option (see Options) to
force gawk to use the locale's decimal point character. (gawk also
uses the locale's decimal point character when in POSIX mode, either
via --posix, or the POSIXLY_CORRECT environment variable.)
I get similar behavior from /usr/bin/printf
$ LC_NUMERIC=de_DE.UTF-8 /usr/bin/printf "%f\n" "1,2"
/usr/bin/printf: 1,2: value not completely converted
1,000000
$ LC_NUMERIC=de_DE.UTF-8 /usr/bin/printf "%f\n" "1.2"
1,200000
But without the ability to override it.
If your intent is to do the opposite, that is to take "European" input and
output "US" numbers, you're going to need to use something more robust. Possible
Python or Perl with their locale modules.

Related

BASH: Parsing CSV into array using IFS last array element missing when is an empty string [duplicate]

To parse colon-delimited fields I can use read with a custom IFS:
$ echo 'foo.c:41:switch (color) {' | { IFS=: read file line text && echo "$file | $line | $text"; }
foo.c | 41 | switch (color) {
If the last field contains colons, no problem, the colons are retained.
$ echo 'foo.c:42:case RED: //alert' | { IFS=: read file line text && echo "$file | $line | $text"; }
foo.c | 42 | case RED: //alert
A trailing delimiter is also retained...
$ echo 'foo.c:42:case RED: //alert:' | { IFS=: read file line text && echo "$file | $line | $text"; }
foo.c | 42 | case RED: //alert:
...Unless it's the only extra delimiter. Then it's stripped. Wait, what?
$ echo 'foo.c:42:case RED:' | { IFS=: read file line text && echo "$file | $line | $text"; }
foo.c | 42 | case RED
Bash, ksh93, and dash all do this, so I'm guessing it is POSIX standard behavior.
Why does it happen?
What's the best alternative?
I want to parse the strings above into three variables and I don't want to mangle any text in the third field. I had thought read was the way to go but now I'm reconsidering.
Yes, that's standard behaviour (see the read specification and Field Splitting). A few shells (ash-based including dash, pdksh-based, zsh, yash at least) used not to do it, but except for zsh (when not in POSIX mode), busybox sh, most of them have been updated for POSIX compliance.
That's the same for:
$ var='a:b:c:' IFS=:
$ set -- $var; echo "$#"
3
(see how the POSIX specification for read actually defers to the Field Splitting mechanism where a:b:c: is split into 3 fields, and so with IFS=: read -r a b c, there are as many fields as variables).
The rationale is that in ksh (on which the POSIX spec is based) $IFS (initially in the Bourne shell the internal field separator) became a field delimiter, I think so any list of elements (not containing the delimiter) could be represented.
When $IFS is a separator, one can't represent a list of one empty element ("" is split into a list of 0 element, ":" into a list of two empty elements¹). When it's a delimiter, you can express a list of zero element with "", or one empty element with ":", or two empty elements with "::".
It's a bit unfortunate as one of the most common usages of $IFS is to split $PATH. And a $PATH like /bin:/usr/bin: is meant to be split into "/bin", "/usr/bin", "", not just "/bin" and "/usr/bin".
Now, with POSIX shells (but not all shells are compliant in that regard), for word splitting upon parameter expansion, that can be worked around with:
IFS=:; set -o noglob
for dir in $PATH""; do
something with "${dir:-.}"
done
That trailing "" makes sure that if $PATH ends in a trailing :, an extra empty element is added. And also that an empty $PATH is treated as one empty element as it should be.
That approach can't be used for read though.
Short of switching to zsh, there's no easy work around other than inserting an extra : and remove it afterwards like:
echo a:b:c: | sed 's/:/::/2' | { IFS=: read -r x y z; z=${z#:}; echo "$z"; }
Or (less portable):
echo a:b:c: | paste -d: - /dev/null | { IFS=: read -r x y z; z=${z%:}; echo "$z"; }
I've also added the -r which you generally want when using read.
Most likely here you'd want to use a proper text processing utility like sed/awk/perl instead of writing convoluted and probably inefficient code around read which has not been designed for that.
¹ Though in the Bourne shell, that was still split into zero elements as there was no distinction between IFS-whitespace and IFS-non-whitespace characters there, something that was also added by ksh
One "feature" of read is that it will strip leading and trailing whitespace separators in the variables it populates - it is explained in much more detail at the linked answer. This enables beginners to have read do what they expect when doing for example read first rest <<< ' foo bar ' (note the extra spaces).
The take-away? It is hard to do accurate text processing using Bash and shell tools. If you want full control it's probably better to use a "stricter" language like for example Python, where split() will do what you want, but where you might have to dig much deeper into string handling to explicitly remove newline separators or handle encoding.

Bash array indirection in a function [duplicate]

Bash script to create multiple arrays from csv with unknown columns.
I am trying to write a script to compare two csv files with similar columns. I need it to locate the matching column from the other csv and compare any differences. The kicker is I would like the script to be dynamic to allow any number of columns to be entered and it still be able to function. I thought I had a good plan to solve this but turns out I'm running into syntax errors. Here is a sample of a csv I need to compare.
IP address, Notes, Nmap-SSH, Nmap-SMTP, Nmap-HTTP, Nmap-HTTPS,
10.0.0.1, , open, closed, open, open,
10.0.0.2, , closed, open, closed, closed,
When I read the csv file I was planning to look for "IF column == open; then; populate this column's array with the IP address" This would have given me 4 lists in this scenario with the IPs that were listening on said port. I could then compare that to my security device configuration to make sure it was configured properly. Finally to the meat, here is what I thought would accomplish creating the arrays for me to search later. However I ran into a snag when I tried to use a variable inside an array name. Can my syntax be corrected or is there just a better way to do this sort of thing?
#!/bin/bash
#
#
# This script compares config_cleaned_<ip>.txt output against ext_web_env.csv and outputs the differences
#
#
# Read from ext_web_env.csv file and create Array
#
FILENAME=./tmp/ext_web_env.csv
#
index=0
#
while read line
do
# How many columns are in the .csv?
varEnvCol=$(echo $line | awk -F, '{print NF}')
echo "columns = $varEnvCol"
# While loop to create array for each column
while [ $varEnvCol != 2 ]
do
# Checks to see if port is open; if so then add IP address to array
varPortCon=$(echo $line | awk -F, -v i=$varEnvCol '{print $i}')
if [ $varPortCon = "open" ]
then
arr$varEnvCol[$index]="$(echo $line | awk -F, '{print $1}')"
# I get this error message "line29 : arr8[194]=10.0.0.194: command not found"
fi
echo "arrEnv$varEnvCol is: ${arr$varEnvCol[#]}"
# Another error but not as important since I am using this to debug "line31: arr$varEnvCol is: ${arr$varEnvCol[#]}: bad substitution"
varEnvCol=$(($varEnvCol - 1))
done
index=$(($index + 1 ))
done < $FILENAME
UPDATE
I also tried using the eval command since all the data will be populated by other scripts.
but am getting this error message:
./compare.sh: line 41: arr8[83]=10.0.0.83: command not found
Here is my new code for this example:
if [[ $varPortCon = *'open'* ]]
then
eval arr\$varEnvCol[$index]=$(echo $line | awk -F, '{print $1}')
fi
arr$varEnvCol[$index]="$(...)"
doesn't work the way you expect it to - you cannot assign to shell variables indirectly - via an expression that expands to the variable name - this way.
Your attempted workaround with eval is also flawed - see below.
tl;dr
If you use bash 4.3 or above:
declare -n targetArray="arr$varEnvCol"
targetArray[index]=$(echo $line | awk -F, '{print $1}')
bash 4.2 or earlier:
declare "arr$varEnvCol"[index]="$(echo $line | awk -F, '{print $1}')"
Caveat: This will work in your particular situation, but may fail subtly in others; read on for details, including a more robust, but cumbersome alternative based on read.
The eval-based solution mentioned by #shellter in a since-deleted comment is problematic not only for security reasons (as they mentioned), but also because it can get quite tricky with respect to quoting; for completeness, here's the eval-based solution:
eval "arr$varEnvCol[index]"='$(echo $line | awk -F, '\''{print $1}'\'')'
See below for an explanation.
Assign to a bash array variable indirectly:
bash 4.3+: use declare -n to effectively create an alias ('nameref') of another variable
This is by far the best option, if available:
declare -n targetArray="arr$varEnvCol"
targetArray[index]=$(echo $line | awk -F, '{print $1}')
declare -n effectively allows you to refer to a variable by another name (whether that variable is an array or not), and the name to create an alias for can be the result of an expression (an expanded string), as demonstrated.
bash 4.2-: there are several options, each with tradeoffs
NOTE: With non-array variables, the best approach is to use printf -v. Since this question is about array variables, this approach is not discussed further.
[most robust, but cumbersome]: use read:
IFS=$'\n' read -r -d '' "arr$varEnvCol"[index] <<<"$(echo $line | awk -F, '{print $1}')"
IFS=$'\n' ensures that that leading and trailing whitespace in each input line is left intact.
-r prevents interpretation of \ chars. in the input.
-d '' ensures that ALL input is captured, even multi-line.
Note, however, that any trailing \n chars. are stripped.
If you're only interested in the first line of input, omit -d ''
"arr$varEnvCol"[index] expands to the variable - array element, in this case - to assign to; note that referring to variable index inside an array subscript does NOT need the $ prefix, because subscripts are evaluated in arithmetic context, where the prefix is optional.
<<< - a so-called here-string - sends its argument to stdin, where read takes its input from.
[simplest, but may break]: use declare:
declare "arr$varEnvCol"[index]="$(echo $line | awk -F, '{print $1}')"
(This is slightly counter-intuitive, in that declare is meant to declare, not modify a variable, but it works in bash 3.x and 4.x, with the constraints noted below.)
Works fine OUTSIDE a FUNCTION - whether the array was explicitly declared with declare or not.
Caveat: INSIDE a function, only works with LOCAL variables - you cannot reference shell-global variables (variables declared outside the function) from inside a function that way. Attempting to do so invariably creates a LOCAL variable ECLIPSING the shell-global variable.
[insecure and tricky]: use eval:
eval "arr$varEnvCol[index]"='$(echo $line | awk -F, '\''{print $1}'\'')'
CAVEAT: Only use eval if you fully control the contents of the string being evaluated; eval will execute any command contained in a string, with potentially unwanted results.
Understanding what variable references/command substitutions get expanded when is nontrivial - the safest approach is to delay expansion so that they happen when eval executes rather than immediate expansion that happens when arguments are passed to eval.
For a variable assignment statement to succeed, the RHS (right-hand side) must eventually evaluate to a single token - either unquoted without whitespace or quoted (optionally with whitespace).
The above example uses single quotes to delay expansion; thus, the string passed mustn't contain single quotes directly and thus is broken into multiple parts with literal ' chars. spliced in as \'.
Also note that the LHS (left-hand side) of the assignment statement passed to eval must be a double-quoted string - using an unquoted string with selective quoting of $ won't work, curiously:
OK: eval "arr$varEnvCol[index]"=...
FAILS: eval arr\$varEnvCol[index]=...

Multiple grep keywords on same line?

I'm using the command grep 3 times on the same line like this
ls -1F ./ | grep / | grep -v 0_*.* | grep -v undesired_result
is there a way to combine them into one command instead of having it to pipe it 3 times?
There's no way to do both a positive search (grep <something>) and a negative search (grep -v <something>) in one command line, but if your grep supports -E (alternatively, egrep), you could do ls -1F ./ | grep / | grep -E -v '0_*.*|undesired_result' to reduce the sub-process count by one. To go beyond that, you'd have to come up with a specific regular expression that matches either exactly what you want or everything you don't want.
Actually, I guess that first sentence isn't entirely true if you have egrep, but building the proper regular expression that correctly includes both the positive and negative parts and covers all possible orderings of the parts might be more frustrating than it's worth...

How to 'cut' on null?

Unix 'file' command has a -0 option to output a null character after a filename. This is supposedly good for using with 'cut'.
From man file:
-0, --print0
Output a null character ‘\0’ after the end of the filename. Nice
to cut(1) the output. This does not affect the separator which is
still printed.
(Note, on my Linux, the '-F' separator is NOT printed - which makes more sense to me.)
How can you use 'cut' to extract a filename from output of 'file'?
This is what I want to do:
find . "*" -type f | file -n0iNf - | cut -d<null> -f1
where <null> is the NUL character.
Well, that is what I am trying to do, what I want to do is get all file names from a directory tree that have a particular MIME type. I use a grep (not shown).
I want to handle all legal file names and not get stuck on file names with colons, for example, in their name. Hence, NUL would be excellent.
I guess non-cut solutions are fine too, but I hate to give up on a simple idea.
Just specify an empty delimiter:
cut -d '' -f1
(N.B.: The space between the -d and the '' is important, so that the -d and the empty string get passed as separate arguments; if you write -d'', then that will get passed as just -d, and then cut will think you're trying to use -f1 as the delimiter, which it will complain about, with an error message that "the delimiter must be a single character".)
This works with gnu awk.
awk 'BEGIN{FS="\x00"}{print$1}'
ruakh's helpful answer works well on Linux.
On macOS, the cut utility doesn't accept '' as a delimiter argument (bad delimiter):
Here is a portable workaround that works on both platforms, via the tr utility; it only makes one assumption:
The input mustn't contain \1 control characters (START OF HEADING, U+0001) - which is unlikely in text.
You can substitute any character known not to occur in the input for \1; if it's a character that can be represented verbatim in a string, that simplifies the solution because you won't need the aux. command substitution ($(...)) with a printf call for the -d argument.
If your shell supports so-called ANSI C-quoted strings - which is true of bash, zsh and ksh - you can replace "$(printf '\1')" with $'\1'
(The following uses a simpler input command to demonstrate the technique).
# In zsh, bash, ksh you can simplify "$(printf '\1')" to $'\1'
$ printf '[first field 1]\0[rest 1]\n[first field 2]\0[rest 2]' |
tr '\0' '\1' | cut -d "$(printf '\1')" -f 1
[first field 1]
[first field 2]
Alternatives to using cut:
C. Paul Bond's helpful answer shows a portable awk solution.

Eval madness in ksh

Man I hate eval...
I'm stuck with this ksh, and it has to be this way.
There's this function I need, which will receive a variable name and a value. Will do some things to the contents of that variable and the value and then would have to update the variable that was received. Sort of:
REPORT="a text where TADA is wrong.."
setOutputReport REPORT "this"
echo $REPORT
a text where this is wrong..
Where the function would be something like
function setOutputReport {
eval local currentReport=\$$1
local reportVar=$1
local varValue=$2
newReport=$(echo "$currentReport"|sed -e 's/TADA/$varValue')
# here be dragons
eval "$reportVar=\"$newReport\""
}
I had this headache before, never manage to get this eval right at first. Important here, the REPORT var may contain multiple lines (\n's). This might be important as one of the attempts managed to correctly replace the contents of the variable with the fist line only :/
thanks.
One risk, not with eval but with the "varValue" as the replacement in the sed command: if varValue contains a slash, the sed command will break
local varValue=$(printf "%s\n" "$2" | sed 's:/:\\/:g')
local newReport=$(echo "$currentReport"|sed -e "s/TADA/$varValue/")
If your printf has the %q specifier, that will add a layer of security -- %q escapes things like quotes, backticks and dollar signs, and also escaped chars like newline and tab:
eval "$(printf "%s=%q" "$reportVar" "$newReport")"
Here's an example of what %q does (this is bash, I hope your version of ksh corresponds):
$ y='a `string` "with $quotes"
with multiple
lines'
$ printf "%s=%q\n" x "$y"
x=$'a `string` "with $quotes"\nand multiple\nlines'
$ eval "$(printf "%s=%q" x "$y")"
$ echo "$x"
a `string` "with $quotes"
and multiple
lines

Resources