Convert messages to arrays using jq - arrays

My messages generator output
$ ./messages.sh
{"a":"v1"}
{"b":"v2"}
{"c":"v3"}
...
Required output
$ ./messages.sh | jq xxxxxx
[{"a":"v1"},{"b":"v2"}]
[{"c":"v3"},{"d":"v4"}]
...

Take the first item using ., and the second using input (prepended by try to handle cases of not enough input items). Then, wrap them both into array brackets, and provide the -c option for compact output. jq will work through its whole input one by one (or two by two).
./messages.sh | jq -c '[., try input]'
[{"a":"v1"},{"b":"v2"}]
[{"c":"v3"},{"d":"v4"}]
What if I want more objects in the array than 2? For example, 3, 10, 100?
You can surround the array body with limit, and use inputs instead (note the s) to fetch more than just one item:
./messages.sh | jq -c '[limit(3; ., try inputs)]'
[{"a":"v1"},{"b":"v2"},{"c":"v3"}]
[{"d":"v4"}]

Use slurp with _nwise(2) to chunk into parts of 2:
jq --slurp --compact-output '_nwise(2)' <<< $(./messages.sh)
[{"a":"v1"},{"b":"v2"}]
[{"c":"v3"},{"d":"v4"}]
The --compact-output is to output each array on a single line

Here is a stream-oriented, generic and portable def of nwise:
# Group the items in the given stream into arrays of up to length $n
# assuming $n is a non-negative integer > 0
# Input: a stream
# Output: a stream of arrays no longer than $n
# such that [stream] == ([ nwise(stream; $n) ] | add)
# Notice that there is no assumption about an eos marker.
def nwise(stream; $n):
foreach ((stream | [.]), null) as $x ([];
if $x == null then .
elif length == $n then $x
else . + $x end;
if $x == null and length>0 then .
elif length==$n then .
else empty
end);
For the task at hand, you could use it like so:
nwise(inputs; 2)
with the -n command-line option.

Related

How does bash array slicing work when start index is not provided?

I'm looking at a script, and I'm having trouble determining what is going on.
Here is an example:
# Command to get the last 4 occurrences of a pattern in a file
lsCommand="ls /my/directory | grep -i my_pattern | tail -4"
# Load the results of that command into an array
dirArray=($(echo $(eval $lsCommand) | tr ' ' '\n'))
# What is this doing?
yesterdaysFileArray=($(echo ${x[#]::$((${#x[#]} / 2))} | tr ' ' '\n'))
There is a lot going on here. I understand how arrays work, but I don't know how $x is getting referenced if it was never declared.
I see that the $((${#x[#]} / 2}} is taking the number of elements and dividing it in half, and the tr is used to create the array. But what else is going on?
I think the last line is an array slice pattern in bash of form ${array[#]:1:2}, where array[#] returns the contents of the array, :1:2 takes a slice of length 2, starting at index 1.
So for your case though you are taking the start index empty because you haven't specified any and length as half the count of array.
But there is a lot better way to do this in bash as below. Don't use eval and use the built-in globbing support from the shell itself
cd /my/directory
fileArray=()
for file in *my_pattern*; do
[[ -f "$file" ]] || { printf '%s\n' 'no file found'; return 1; }
fileArray+=( "$file" )
done
and do
printf '%s\n' "${fileArray[#]::${#fileArray[#]}/2}"

Picking valid IPs from an array of strings

in my usecase i'm filtering certain IPv4s from the list and putting them into array for further tasks:
readarray -t firstarray < <(grep -ni '^ser*' IPbook.file | cut -f 2 -d "-")
As a result the output is:
10.8.61.10
10.0.10.15
172.0.20.30
678.0.0.10
As you see the last row is not an IP, therefore i faced an urge to add some regex check on the FIRSTARRAY.
I do not want to save a collateral files to work with them, so i'm looking for some "on-the-fly" option to regex the firstarray. I tried the following:
for X in "${FIRSTARRAY[#]}"; do
readarray -t SECONDARRAY < <(grep -E '\b((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)(\.|$)){4}\b' "$X")
done
But in the output I see that system thinks that $X is a file/dir, and didn't process the value, even though it clearly sees it:
line ABC: 172.0.20.30: No such file or directory
line ABC: 678.0.0.10: No such file or directory
What am I doing wrong and what would be the best approach to proceed?
You are passing "$X" as an argument to grep and hence it is being treated as a file. Use herestring <<< instead:
for X in "${firstarray[#]}"; do
readarray -t secondarray < <(grep -E '\b((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)(\.|$)){4}\b' <<< "$X")
done
You are better off writing a function to validate the IP instead of relying on just a regex match:
#!/bin/bash
validate_ip() {
local arr element
IFS=. read -r -a arr <<< "$1" # convert ip string to array
[[ ${#arr[#]} != 4 ]] && return 1 # doesn't have four parts
for element in "${arr[#]}"; do
[[ $element =~ ^[0-9]+$ ]] || return 1 # non numeric characters found
[[ $element =~ ^0[1-9]+$ ]] || return 1 # 0 not allowed in leading position if followed by other digits, to prevent it from being interpreted as on octal number
((element < 0 || element > 255)) && return 1 # number out of range
done
return 0
}
And then loop through your array:
for x in "${firstarray[#]}"; do
validate_ip "$x" && secondarray+=("$x") # add to second array if element is a valid IP
done
The problem is, that you passing an argument to the grep command and it expects reading standard input instead.
You can use your regex to filter the IP addresses right in the first command:
readarray -t firstarray < <(grep -ni '^ser*' IPbook.file | cut -f 2 -d "-" | grep -E '\b((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)(\.|$)){4}\b' )
Then you have only IP addresses in firstarray.

How to split an array into chunks with jq?

I have a very large JSON file containing an array. Is it possible to use jq to split this array into several smaller arrays of a fixed size? Suppose my input was this: [1,2,3,4,5,6,7,8,9,10], and I wanted to split it into 3 element long chunks. The desired output from jq would be:
[1,2,3]
[4,5,6]
[7,8,9]
[10]
In reality, my input array has nearly three million elements, all UUIDs.
There is an (undocumented) builtin, _nwise, that meets the functional requirements:
$ jq -nc '[1,2,3,4,5,6,7,8,9,10] | _nwise(3)'
[1,2,3]
[4,5,6]
[7,8,9]
[10]
Also:
$ jq -nc '_nwise([1,2,3,4,5,6,7,8,9,10];3)'
[1,2,3]
[4,5,6]
[7,8,9]
[10]
Incidentally, _nwise can be used for both arrays and strings.
(I believe it's undocumented because there was some doubt about an appropriate name.)
TCO-version
Unfortunately, the builtin version is carelessly defined, and will not perform well for large arrays. Here is an optimized version (it should be about as efficient as a non-recursive version):
def nwise($n):
def _nwise:
if length <= $n then . else .[0:$n] , (.[$n:]|_nwise) end;
_nwise;
For an array of size 3 million, this is quite performant:
3.91s on an old Mac, 162746368 max resident size.
Notice that this version (using tail-call optimized recursion) is actually faster than the version of nwise/2 using foreach shown elsewhere on this page.
The following stream-oriented definition of window/3, due to Cédric Connes
(github:connesc), generalizes _nwise,
and illustrates a
"boxing technique" that circumvents the need to use an
end-of-stream marker, and can therefore be used
if the stream contains the non-JSON value nan. A definition
of _nwise/1 in terms of window/3 is also included.
The first argument of window/3 is interpreted as a stream. $size is the window size and $step specifies the number of values to be skipped. For example,
window(1,2,3; 2; 1)
yields:
[1,2]
[2,3]
window/3 and _nsize/1
def window(values; $size; $step):
def checkparam(name; value): if (value | isnormal) and value > 0 and (value | floor) == value then . else error("window \(name) must be a positive integer") end;
checkparam("size"; $size)
| checkparam("step"; $step)
# We need to detect the end of the loop in order to produce the terminal partial group (if any).
# For that purpose, we introduce an artificial null sentinel, and wrap the input values into singleton arrays in order to distinguish them.
| foreach ((values | [.]), null) as $item (
{index: -1, items: [], ready: false};
(.index + 1) as $index
# Extract items that must be reused from the previous iteration
| if (.ready | not) then .items
elif $step >= $size or $item == null then []
else .items[-($size - $step):]
end
# Append the current item unless it must be skipped
| if ($index % $step) < $size then . + $item
else .
end
| {$index, items: ., ready: (length == $size or ($item == null and length > 0))};
if .ready then .items else empty end
);
def _nwise($n): window(.[]; $n; $n);
Source:
https://gist.github.com/connesc/d6b87cbacae13d4fd58763724049da58
If the array is too large to fit comfortably in memory, then I'd adopt the strategy suggested by #CharlesDuffy -- that is, stream the array elements into a second invocation of jq using a stream-oriented version of nwise, such as:
def nwise(stream; $n):
foreach (stream, nan) as $x ([];
if length == $n then [$x] else . + [$x] end;
if (.[-1] | isnan) and length>1 then .[:-1]
elif length == $n then .
else empty
end);
The "driver" for the above would be:
nwise(inputs; 3)
But please remember to use the -n command-line option.
To create the stream from an arbitrary array:
$ jq -cn --stream '
fromstream( inputs | (.[0] |= .[1:])
| select(. != [[]]) )' huge.json
So the shell pipeline might look like this:
$ jq -cn --stream '
fromstream( inputs | (.[0] |= .[1:])
| select(. != [[]]) )' huge.json |
jq -n -f nwise.jq
This approach is quite performant. For grouping a stream of 3 million items into groups of 3 using nwise/2,
/usr/bin/time -lp
for the second invocation of jq gives:
user 5.63
sys 0.04
1261568 maximum resident set size
Caveat: this definition uses nan as an end-of-stream marker. Since nan is not a JSON value, this cannot be a problem for handling JSON streams.
here's a simple one that worked for me:
def chunk(n):
range(length/n|ceil) as $i | .[n*$i:n*$i+n];
example usage:
jq -n \
'def chunk(n): range(length/n|ceil) as $i | .[n*$i:n*$i+n];
[range(5)] | chunk(2)'
[
0,
1
]
[
2,
3
]
[
4
]
bonus: it doesn't use recursion and doesn't rely on _nwise, so it also works with jaq.
The below is hackery, to be sure -- but memory-efficient hackery, even with an arbitrarily long list:
jq -c --stream 'select(length==2)|.[1]' <huge.json \
| jq -nc 'foreach inputs as $i (null; null; [$i,try input,try input])'
The first piece of the pipeline streams in your input JSON file, emitting one line per element, assuming the array consists of atomic values (where [] and {} are here included as atomic values). Because it runs in streaming mode it doesn't need to store the entire content in memory, despite being a single document.
The second piece of the pipeline repeatedly reads up to three items and assembles them into a list.
This should avoid needing more than three pieces of data in memory at a time.

Bash Delete doublons within array

I have a file containing a several number of fields. I am trying to delete doublons (ex: two same attributes with a different date.) within a same field. For example from this :
Andro manual gene 1 100 . + . ID=truc;Name=truc;modified=13-09-1993;added=13-09-1993;modified=13-09-1997
Andro manual mRNA 1 100 . + . ID=truc-mRNA;Name=truc-mRNA;modified=13-09-1993;added=13-09-1993;modified=13-09-1997
We can see modified=13-09-1993 and modified=13-09-1997 are doublons. So I want to obtain this :
Andro manual gene 1 100 . + . ID=truc;Name=truc;added=13-09-1993;modified=13-09-1997
Andro manual mRNA 1 100 . + . ID=truc-mRNA;Name=truc-mRNA;added=13-09-1993;modified=13-09-1997
I want to keep the latest occurence of particular attribute and deleting the oldest one. They will only have at maximum twice the same attribute in a same row.
I've tried this code (which is now working):
INPUT=$1
ID=$2
ALL_FEATURES=()
CONTIG_FEATURES=$(grep $ID $INPUT)
while read LINE; do
FEATURES=$(echo -e "$LINE" | cut -f 9)
#For each line, store all attributes from every line in an array
IFS=';' read -r -a ARRAY <<< "$FEATURES"
#Once the array is created, loop in array to look for doublons
for INDEX in "${!ARRAY[#]}"
do
ELEMENT=${ARRAY[INDEX]}
#If we are not at the end of the array, compare actual element and next element
ACTUAL=$ELEMENT
for INDEX2 in "${!ARRAY[#]}"
do
NEXT="${ARRAY[INDEX2]}"
ATTRIBUTE1=$(echo -e "$ACTUAL" | cut -d'=' -f1)
ATTRIBUTE2=$(echo -e "$NEXT" | cut -d'=' -f1)
echo "Comparing element number $INDEX ($ATTRIBUTE1) with element number $INDEX2 ($ATTRIBUTE2) ..."
if [[ $ATTRIBUTE1 = $ATTRIBUTE2 ]] && [[ $INDEX -ne $INDEX2 ]]
then
echo "Deleting features..."
#Delete actual element, because next element will be more recent
NEW=()
for VAL in "${ARRAY[#]}"
do
[[ $VAL != "${ARRAY[INDEX]}" ]] && NEW+=($VAL)
done
ARRAY=("${NEW[#]}")
unset NEW
fi
done
done
#Rewriting array into string separated by ;
FEATURES2=$( IFS=$';'; echo "${ARRAY[*]}" )
sed -i "s/$FEATURES/$FEATURES2/g" $INPUT
done < <(echo -e "$CONTIG_FEATURES")
I need advices because I think my array approache may not be a clever one, but I want a bash solution in any case. If anyone has some bash adives/shortcuts, any suggestions will be appreciated to improve my bash understanding.
I'm sorry if I forgot any details, thanks for your help.
Roxane
In awk:
$ awk '
{
n=split($NF,a,";") # split the last field by ;
for(i=n;i>=1;i--) { # iterate them backwards to keep the last "doublon"
split(a[i],b,"=") # split key=value at =
if(b[1] in c==0) { # if key not in c hash
d=a[i] (d==""?"":";") d # append key=value to d with ;
c[b[1]] # hash key into c
}
}
$NF=d # set d to last field
delete c # clear c for next record
d="" # deetoo
}
1 # output
' file
Andro manual gene 1 100 . + . ID=truc;Name=truc;added=13-09-1993;modified=13-09-1997
Andro manual mRNA 1 100 . + . ID=truc-mRNA;Name=truc-mRNA;added=13-09-1993;modified=13-09-1997
Following awk could also help you in same.
awk -F';' '{
for(i=NF;i>0;i--){
split($i, array,"=");
if(++a[array[1]]>1){
$i="\b"
}
};
delete a
}
1
' OFS=";" Input_file
Output will be as follows.
Andro manual gene 1 100 . + . ID=truc;Name=truc;added=13-09-1993;modified=13-09-1997
Andro manual mRNA 1 100 . + . ID=truc-mRNA;Name=truc-mRNA;added=13-09-1993;modified=13-09-1997

Bash: find non-repeated elements in an array

I'm looking for a way to find non-repeated elements in an array in bash.
Simple example:
joined_arrays=(CVE-2015-4840 CVE-2015-4840 CVE-2015-4860 CVE-2015-4860 CVE-2016-3598)
<magic>
non_repeated=(CVE-2016-3598)
To give context, the goal here is to end up with an array of all package update CVEs that aren't generally available via 'yum update' on a host due to being excluded. The way I came up with doing such a thing is to populate 3 preliminary arrays:
available_updates=() #just what 'yum update' would provide
all_updates=() #including excluded ones
joined_updates=() # contents of both prior arrays
Then apply logic to joined_updates=() that would return only elements that are included exactly once. Any element with two occurrences is one that can be updated normally and doesn't need to end up in the 'excluded_updates=()' array.
Hopefully this makes sense. As I was typing it out I'm wondering if it might be simpler to just remove all elements found in available_updates=() from all_updates=(), leaving the remaining ones as the excluded updates.
Thanks!
One pure-bash approach is to store a counter in an associative array, and then look for items where the counter is exactly one:
declare -A seen=( ) # create an associative array (requires bash 4)
for item in "${joined_arrays[#]}"; do # iterate over original items
(( seen[$item] += 1 )) # increment value associated with item
done
declare -a non_repeated=( )
for item in "${!seen[#]}"; do # iterate over keys
if (( ${seen[$item]} == 1 )); then # if counter for that key is 1...
non_repeated+=( "$item" ) # ...add that item to the output array.
done
declare -p non_repeated # print result
Another, terser (but buggier -- doesn't work with values containing newline literals) approach is to take advantage of standard text manipulation tools:
non_repeated=( ) # setup
# use uniq -c to count; filter for results with a count of 1
while read -r count value; do
(( count == 1 )) && non_repeated+=( "$value" )
done < <(printf '%s\n' "${joined_arrays[#]}" | sort | uniq -c)
declare -p non_repeated # print result
...or, even terser (and buggier, requiring that the array value split into exactly one field in awk):
readarray -t non_repeated \
< <(printf '%s\n' "${joined_arrays[#]}" | sort | uniq -c | awk '$1 == 1 { print $2; }'
To crib an answer I really should have come up myself from #Aaron (who deserves an upvote from anyone using this; do note that it retains the doesn't-work-with-values-with-newlines bug), one can also use uniq -u:
readarray -t non_repeated < <(printf '%s\n' "${joined_arrays[#]}" | sort | uniq -u)
I would rely on uniq.
It's -u option is made for this exact case, outputting only the uniques occurrences. It relies on the input to be a sorted linefeed-separated list of tokens, hence the need for IFS and sort :
$ my_test_array=( 1 2 3 2 1 0 )
$ printf '%s\n' "${my_test_array[#]}" | sort | uniq -u
0
3
Here is a single awk based solution that doesn't require sort:
arr=( 1 2 3 2 1 0 )
printf '%s\n' "${arr[#]}" |
awk '{++fq[$0]} END{for(i in fq) if (fq[i]==1) print i}'
0
3

Resources