jq: How to Catenate an Array and Strip Spaces - concatenation

The following jq command (Windows) successfully catenates all the "text" properties into one string replacing any spaces with a single space albeit in a roundabout way. Almost correct. What I really want is to first replace any leading or trailing space in "text", then catenate all "text" properties. A difference being that embedded (non-leading, non-trailing) spaces must not be removed. How can this be done?
jq ".segments[].words | map(.text?) | join(\",\") | gsub(\"[ ]\"; \"\") | gsub(\"[,]\"; \" \")"

Consider:
def trim: sub("^ *";"") | sub(" *$";"");
Or you could simply use: gsub("^\s|\s*$";"")
There are other ways to trim a string but the above should get you started.

Related

BASH: Parsing CSV into array using IFS last array element missing when is an empty string [duplicate]

To parse colon-delimited fields I can use read with a custom IFS:
$ echo 'foo.c:41:switch (color) {' | { IFS=: read file line text && echo "$file | $line | $text"; }
foo.c | 41 | switch (color) {
If the last field contains colons, no problem, the colons are retained.
$ echo 'foo.c:42:case RED: //alert' | { IFS=: read file line text && echo "$file | $line | $text"; }
foo.c | 42 | case RED: //alert
A trailing delimiter is also retained...
$ echo 'foo.c:42:case RED: //alert:' | { IFS=: read file line text && echo "$file | $line | $text"; }
foo.c | 42 | case RED: //alert:
...Unless it's the only extra delimiter. Then it's stripped. Wait, what?
$ echo 'foo.c:42:case RED:' | { IFS=: read file line text && echo "$file | $line | $text"; }
foo.c | 42 | case RED
Bash, ksh93, and dash all do this, so I'm guessing it is POSIX standard behavior.
Why does it happen?
What's the best alternative?
I want to parse the strings above into three variables and I don't want to mangle any text in the third field. I had thought read was the way to go but now I'm reconsidering.
Yes, that's standard behaviour (see the read specification and Field Splitting). A few shells (ash-based including dash, pdksh-based, zsh, yash at least) used not to do it, but except for zsh (when not in POSIX mode), busybox sh, most of them have been updated for POSIX compliance.
That's the same for:
$ var='a:b:c:' IFS=:
$ set -- $var; echo "$#"
3
(see how the POSIX specification for read actually defers to the Field Splitting mechanism where a:b:c: is split into 3 fields, and so with IFS=: read -r a b c, there are as many fields as variables).
The rationale is that in ksh (on which the POSIX spec is based) $IFS (initially in the Bourne shell the internal field separator) became a field delimiter, I think so any list of elements (not containing the delimiter) could be represented.
When $IFS is a separator, one can't represent a list of one empty element ("" is split into a list of 0 element, ":" into a list of two empty elements¹). When it's a delimiter, you can express a list of zero element with "", or one empty element with ":", or two empty elements with "::".
It's a bit unfortunate as one of the most common usages of $IFS is to split $PATH. And a $PATH like /bin:/usr/bin: is meant to be split into "/bin", "/usr/bin", "", not just "/bin" and "/usr/bin".
Now, with POSIX shells (but not all shells are compliant in that regard), for word splitting upon parameter expansion, that can be worked around with:
IFS=:; set -o noglob
for dir in $PATH""; do
something with "${dir:-.}"
done
That trailing "" makes sure that if $PATH ends in a trailing :, an extra empty element is added. And also that an empty $PATH is treated as one empty element as it should be.
That approach can't be used for read though.
Short of switching to zsh, there's no easy work around other than inserting an extra : and remove it afterwards like:
echo a:b:c: | sed 's/:/::/2' | { IFS=: read -r x y z; z=${z#:}; echo "$z"; }
Or (less portable):
echo a:b:c: | paste -d: - /dev/null | { IFS=: read -r x y z; z=${z%:}; echo "$z"; }
I've also added the -r which you generally want when using read.
Most likely here you'd want to use a proper text processing utility like sed/awk/perl instead of writing convoluted and probably inefficient code around read which has not been designed for that.
¹ Though in the Bourne shell, that was still split into zero elements as there was no distinction between IFS-whitespace and IFS-non-whitespace characters there, something that was also added by ksh
One "feature" of read is that it will strip leading and trailing whitespace separators in the variables it populates - it is explained in much more detail at the linked answer. This enables beginners to have read do what they expect when doing for example read first rest <<< ' foo bar ' (note the extra spaces).
The take-away? It is hard to do accurate text processing using Bash and shell tools. If you want full control it's probably better to use a "stricter" language like for example Python, where split() will do what you want, but where you might have to dig much deeper into string handling to explicitly remove newline separators or handle encoding.

BASH grep result as array name

Heyhey,
since this is my first post please be patient :) I try my best.
I try to "grep" the language out of my system (osx) and take this as a string name to set a language.
I've got some strings called $en['a' 'b' 'c'], $de['d' 'e' 'f'] and $fr['g' 'h' 'i'] somewhere...
I use:
language=$(locale | grep LANG= | cut -d'"' -f2 | cut -d_ -f1)
which gives me a ISO value like en, fr, de, ...
Here comes my main problem. I just can't just use ${language[*]}.
It feels like I tried everything. Already doing try an error with {} () '' and $.
Only thing I found out while debugging is language results in
language=ISO (so this works correct)
and if I try to get this value as my desired string
echo ${language[*]} , ${language[0]} , ${language[1]}
results in
ISO , ISO ,
which is not correct. It seems like I'm creating a new string but I want to use the existing ones.
Don't know any more keywords to google :(
These all use array syntax:
echo ${language[*]} , ${language[0]} , ${language[1]}
But the way you created the variable, this is not an array, this is a simple variable:
language=$(locale | grep LANG= | cut -d'"' -f2 | cut -d_ -f1)
To access its value, use simply $language, for example:
echo $language
Also, the pipeline with locale, grep and cut gets the first two characters of the LANG variable in a very inefficient way.
You can get it more efficiently using a substring:
language=${LANG:0:2}
If you want to use an array (though I don't see the point here), then you must put parentheses around the values to assign, for example:
language=(${LANG:0:2})
This array in this example has one element, you can access its value like this:
echo ${language[0]}
Note that the syntax of Bash is very strict with respect to symbols and spaces, every little detail may make a big difference, so it's important to write accurately.
You can paste your scripts to shellcheck.net to check for trivial errors.
You can read more about arrays in man bash.

Parsing HTML to array only returns one word

I'm trying to parse some HTML subtitles into an array using Bash and html-xml-utils, and I've tried using a Lynx dump to pretty it up, but I had the same problem, because I can't get my sed to put more than one word at a time into the array.
Code:
array=($(echo $PAGE |
hxselect -i ".sub_info_container .sub_title" |
sed -r 's/.*\">(.*)<\/a>.*/\1/' ))
echo $array
This gets piped into sed:
<div class="sub_title"><a class="sub_title" href="/link">Some Random Title.</a></div><div class="sub_title"><a class="sub_title" href="/link2">Another subtitle I want.</a>
Output of echo $array:
Some
What I'm trying to get:
Some Random Title
Without the punctuation would be nice, and the subtitles often have ? or ! instead of period, but it could work including punctuation too.
Things I've tried:
Using Lynx to pretty up the code, then using awk to grab the elements
A lot of different sed and awk methods of grabbing the text
I'm not sure why, but my code ended up separating spaces into separate items. The solution was the following code:
array=($(echo $PAGE |
hxselect -i ".sub_info_container .sub_title" |
lynx -stdin -dump | tr " " - ))
I used tr to turn the spaces into dashes, allowing it to be passed into the array. Taking off the extra parenthesis as everybody suggested actually removed the function of assigning the values into an array, as I stated was my intention. After the code completed I simply re-converted all the dashes back to spaces. It's not pretty but it works!
Try this:
s='<div class="sub_title"><a class="sub_title" href="/link">Some Random Title.</a></div><div class="sub_title"><a class="sub_title" href="/link2">Another subtitle I want.</a>'
array=$(echo "$s" | sed 's/<\/div><div /\n/' | sed -r 's/.*\">(.*)<\/a>.*/\1/g')
echo "$array"
I had to add a newline between the divs to match both. I'm not that good with sed and couldn't figure out how to do it without that.
Your main problem was with the extra parenthesis
array=($(echo .....))

Bash read file to an array based on two delimiters

I have a file that I need to parse into an array but I really only want a brief portion of each line and for only the first 84 lines.
Sometimes the line maybe:
>MT gi...
And I would just want the MT to be entered into the array. Other times it might be something like this:
>GL000207.1 dn...
and I would need the GL000207.1
I was thinking that you might be able to set two delimiters (one being the '>' and the other being the ' ' whitespace) but I am not sure how you would go about it. I have read other peoples posts about the internal field separator but I am really not sure of how that would work. I would think perhaps something like this might work though?
desiredArray=$(echo file.whatever | tr ">" " ")
for x in $desiredArray
do
echo > $x
done
Any suggestions?
How about:
head -84 <file> | awk '{print $1}' | tr -d '>'
head takes only the first lines of the file, awk strips off the first space and everything after it, and tr gets rid of the '>'.
You can also do it with sed:
head -n 84 <file> | sed 's/>\([^ ]*\).*/\1/'

How to 'cut' on null?

Unix 'file' command has a -0 option to output a null character after a filename. This is supposedly good for using with 'cut'.
From man file:
-0, --print0
Output a null character ‘\0’ after the end of the filename. Nice
to cut(1) the output. This does not affect the separator which is
still printed.
(Note, on my Linux, the '-F' separator is NOT printed - which makes more sense to me.)
How can you use 'cut' to extract a filename from output of 'file'?
This is what I want to do:
find . "*" -type f | file -n0iNf - | cut -d<null> -f1
where <null> is the NUL character.
Well, that is what I am trying to do, what I want to do is get all file names from a directory tree that have a particular MIME type. I use a grep (not shown).
I want to handle all legal file names and not get stuck on file names with colons, for example, in their name. Hence, NUL would be excellent.
I guess non-cut solutions are fine too, but I hate to give up on a simple idea.
Just specify an empty delimiter:
cut -d '' -f1
(N.B.: The space between the -d and the '' is important, so that the -d and the empty string get passed as separate arguments; if you write -d'', then that will get passed as just -d, and then cut will think you're trying to use -f1 as the delimiter, which it will complain about, with an error message that "the delimiter must be a single character".)
This works with gnu awk.
awk 'BEGIN{FS="\x00"}{print$1}'
ruakh's helpful answer works well on Linux.
On macOS, the cut utility doesn't accept '' as a delimiter argument (bad delimiter):
Here is a portable workaround that works on both platforms, via the tr utility; it only makes one assumption:
The input mustn't contain \1 control characters (START OF HEADING, U+0001) - which is unlikely in text.
You can substitute any character known not to occur in the input for \1; if it's a character that can be represented verbatim in a string, that simplifies the solution because you won't need the aux. command substitution ($(...)) with a printf call for the -d argument.
If your shell supports so-called ANSI C-quoted strings - which is true of bash, zsh and ksh - you can replace "$(printf '\1')" with $'\1'
(The following uses a simpler input command to demonstrate the technique).
# In zsh, bash, ksh you can simplify "$(printf '\1')" to $'\1'
$ printf '[first field 1]\0[rest 1]\n[first field 2]\0[rest 2]' |
tr '\0' '\1' | cut -d "$(printf '\1')" -f 1
[first field 1]
[first field 2]
Alternatives to using cut:
C. Paul Bond's helpful answer shows a portable awk solution.

Resources