How to 'cut' on null? - file

Unix 'file' command has a -0 option to output a null character after a filename. This is supposedly good for using with 'cut'.
From man file:
-0, --print0
Output a null character ‘\0’ after the end of the filename. Nice
to cut(1) the output. This does not affect the separator which is
still printed.
(Note, on my Linux, the '-F' separator is NOT printed - which makes more sense to me.)
How can you use 'cut' to extract a filename from output of 'file'?
This is what I want to do:
find . "*" -type f | file -n0iNf - | cut -d<null> -f1
where <null> is the NUL character.
Well, that is what I am trying to do, what I want to do is get all file names from a directory tree that have a particular MIME type. I use a grep (not shown).
I want to handle all legal file names and not get stuck on file names with colons, for example, in their name. Hence, NUL would be excellent.
I guess non-cut solutions are fine too, but I hate to give up on a simple idea.

Just specify an empty delimiter:
cut -d '' -f1
(N.B.: The space between the -d and the '' is important, so that the -d and the empty string get passed as separate arguments; if you write -d'', then that will get passed as just -d, and then cut will think you're trying to use -f1 as the delimiter, which it will complain about, with an error message that "the delimiter must be a single character".)

This works with gnu awk.
awk 'BEGIN{FS="\x00"}{print$1}'

ruakh's helpful answer works well on Linux.
On macOS, the cut utility doesn't accept '' as a delimiter argument (bad delimiter):
Here is a portable workaround that works on both platforms, via the tr utility; it only makes one assumption:
The input mustn't contain \1 control characters (START OF HEADING, U+0001) - which is unlikely in text.
You can substitute any character known not to occur in the input for \1; if it's a character that can be represented verbatim in a string, that simplifies the solution because you won't need the aux. command substitution ($(...)) with a printf call for the -d argument.
If your shell supports so-called ANSI C-quoted strings - which is true of bash, zsh and ksh - you can replace "$(printf '\1')" with $'\1'
(The following uses a simpler input command to demonstrate the technique).
# In zsh, bash, ksh you can simplify "$(printf '\1')" to $'\1'
$ printf '[first field 1]\0[rest 1]\n[first field 2]\0[rest 2]' |
tr '\0' '\1' | cut -d "$(printf '\1')" -f 1
[first field 1]
[first field 2]
Alternatives to using cut:
C. Paul Bond's helpful answer shows a portable awk solution.

Related

shell send args to a C program with spaces [duplicate]

This question already has answers here:
How can I store a command in a variable in a shell script?
(12 answers)
Closed 4 years ago.
These work as advertised:
grep -ir 'hello world' .
grep -ir hello\ world .
These don't:
argumentString1="-ir 'hello world'"
argumentString2="-ir hello\\ world"
grep $argumentString1 .
grep $argumentString2 .
Despite 'hello world' being enclosed by quotes in the second example, grep interprets 'hello (and hello\) as one argument and world' (and world) as another, which means that, in this case, 'hello will be the search pattern and world' will be the search path.
Again, this only happens when the arguments are expanded from the argumentString variables. grep properly interprets 'hello world' (and hello\ world) as a single argument in the first example.
Can anyone explain why this is? Is there a proper way to expand a string variable that will preserve the syntax of each character such that it is correctly interpreted by shell commands?
Why
When the string is expanded, it is split into words, but it is not re-evaluated to find special characters such as quotes or dollar signs or ... This is the way the shell has 'always' behaved, since the Bourne shell back in 1978 or thereabouts.
Fix
In bash, use an array to hold the arguments:
argumentArray=(-ir 'hello world')
grep "${argumentArray[#]}" .
Or, if brave/foolhardy, use eval:
argumentString="-ir 'hello world'"
eval "grep $argumentString ."
On the other hand, discretion is often the better part of valour, and working with eval is a place where discretion is better than bravery. If you are not completely in control of the string that is eval'd (if there's any user input in the command string that has not been rigorously validated), then you are opening yourself to potentially serious problems.
Note that the sequence of expansions for Bash is described in Shell Expansions in the GNU Bash manual. Note in particular sections 3.5.3 Shell Parameter Expansion, 3.5.7 Word Splitting, and 3.5.9 Quote Removal.
When you put quote characters into variables, they just become plain literals (see http://mywiki.wooledge.org/BashFAQ/050; thanks #tripleee for pointing out this link)
Instead, try using an array to pass your arguments:
argumentString=(-ir 'hello world')
grep "${argumentString[#]}" .
In looking at this and related questions, I'm surprised that no one brought up using an explicit subshell. For bash, and other modern shells, you can execute a command line explicitly. In bash, it requires the -c option.
argumentString="-ir 'hello world'"
bash -c "grep $argumentString ."
Works exactly as original questioner desired. There are two restrictions to this technique:
You can only use single quotes within the command or argument strings.
Only exported environment variables will be available to the command
Also, this technique handles redirection and piping, and other shellisms work as well. You also can use bash internal commands as well as any other command that works at the command line, because you are essentially asking a subshell bash to interpret it directly as a command line. Here's a more complex example, a somewhat gratuitously complex ls -l variant.
cmd="prefix=`pwd` && ls | xargs -n 1 echo \'In $prefix:\'"
bash -c "$cmd"
I have built command processors both this way and with parameter arrays. Generally, this way is much easier to write and debug, and it's trivial to echo the command you are executing. OTOH, param arrays work nicely when you really do have abstract arrays of parameters, as opposed to just wanting a simple command variant.

BASH: Parsing CSV into array using IFS last array element missing when is an empty string [duplicate]

To parse colon-delimited fields I can use read with a custom IFS:
$ echo 'foo.c:41:switch (color) {' | { IFS=: read file line text && echo "$file | $line | $text"; }
foo.c | 41 | switch (color) {
If the last field contains colons, no problem, the colons are retained.
$ echo 'foo.c:42:case RED: //alert' | { IFS=: read file line text && echo "$file | $line | $text"; }
foo.c | 42 | case RED: //alert
A trailing delimiter is also retained...
$ echo 'foo.c:42:case RED: //alert:' | { IFS=: read file line text && echo "$file | $line | $text"; }
foo.c | 42 | case RED: //alert:
...Unless it's the only extra delimiter. Then it's stripped. Wait, what?
$ echo 'foo.c:42:case RED:' | { IFS=: read file line text && echo "$file | $line | $text"; }
foo.c | 42 | case RED
Bash, ksh93, and dash all do this, so I'm guessing it is POSIX standard behavior.
Why does it happen?
What's the best alternative?
I want to parse the strings above into three variables and I don't want to mangle any text in the third field. I had thought read was the way to go but now I'm reconsidering.
Yes, that's standard behaviour (see the read specification and Field Splitting). A few shells (ash-based including dash, pdksh-based, zsh, yash at least) used not to do it, but except for zsh (when not in POSIX mode), busybox sh, most of them have been updated for POSIX compliance.
That's the same for:
$ var='a:b:c:' IFS=:
$ set -- $var; echo "$#"
3
(see how the POSIX specification for read actually defers to the Field Splitting mechanism where a:b:c: is split into 3 fields, and so with IFS=: read -r a b c, there are as many fields as variables).
The rationale is that in ksh (on which the POSIX spec is based) $IFS (initially in the Bourne shell the internal field separator) became a field delimiter, I think so any list of elements (not containing the delimiter) could be represented.
When $IFS is a separator, one can't represent a list of one empty element ("" is split into a list of 0 element, ":" into a list of two empty elements¹). When it's a delimiter, you can express a list of zero element with "", or one empty element with ":", or two empty elements with "::".
It's a bit unfortunate as one of the most common usages of $IFS is to split $PATH. And a $PATH like /bin:/usr/bin: is meant to be split into "/bin", "/usr/bin", "", not just "/bin" and "/usr/bin".
Now, with POSIX shells (but not all shells are compliant in that regard), for word splitting upon parameter expansion, that can be worked around with:
IFS=:; set -o noglob
for dir in $PATH""; do
something with "${dir:-.}"
done
That trailing "" makes sure that if $PATH ends in a trailing :, an extra empty element is added. And also that an empty $PATH is treated as one empty element as it should be.
That approach can't be used for read though.
Short of switching to zsh, there's no easy work around other than inserting an extra : and remove it afterwards like:
echo a:b:c: | sed 's/:/::/2' | { IFS=: read -r x y z; z=${z#:}; echo "$z"; }
Or (less portable):
echo a:b:c: | paste -d: - /dev/null | { IFS=: read -r x y z; z=${z%:}; echo "$z"; }
I've also added the -r which you generally want when using read.
Most likely here you'd want to use a proper text processing utility like sed/awk/perl instead of writing convoluted and probably inefficient code around read which has not been designed for that.
¹ Though in the Bourne shell, that was still split into zero elements as there was no distinction between IFS-whitespace and IFS-non-whitespace characters there, something that was also added by ksh
One "feature" of read is that it will strip leading and trailing whitespace separators in the variables it populates - it is explained in much more detail at the linked answer. This enables beginners to have read do what they expect when doing for example read first rest <<< ' foo bar ' (note the extra spaces).
The take-away? It is hard to do accurate text processing using Bash and shell tools. If you want full control it's probably better to use a "stricter" language like for example Python, where split() will do what you want, but where you might have to dig much deeper into string handling to explicitly remove newline separators or handle encoding.

Shell Script regex matches to array and process each array element

While I've handled this task in other languages easily, I'm at a loss for which commands to use when Shell Scripting (CentOS/BASH)
I have some regex that provides many matches in a file I've read to a variable, and would like to take the regex matches to an array to loop over and process each entry.
Regex I typically use https://regexr.com/ to form my capture groups, and throw that to JS/Python/Go to get an array and loop - but in Shell Scripting, not sure what I can use.
So far I've played with "sed" to find all matches and replace, but don't know if it's capable of returning an array to loop from matches.
Take regex, run on file, get array back. I would love some help with Shell Scripting for this task.
EDIT:
Based on comments, put this together (not working via shellcheck.net):
#!/bin/sh
examplefile="
asset('1a/1b/1c.ext')
asset('2a/2b/2c.ext')
asset('3a/3b/3c.ext')
"
examplearr=($(sed 'asset\((.*)\)' $examplefile))
for el in ${!examplearr[*]}
do
echo "${examplearr[$el]}"
done
This works in bash on a mac:
#!/bin/sh
examplefile="
asset('1a/1b/1c.ext')
asset('2a/2b/2c.ext')
asset('3a/3b/3c.ext')
"
examplearr=(`echo "$examplefile" | sed -e '/.*/s/asset(\(.*\))/\1/'`)
for el in ${examplearr[*]}; do
echo "$el"
done
output:
'1a/1b/1c.ext'
'2a/2b/2c.ext'
'3a/3b/3c.ext'
Note the wrapping of $examplefile in quotes, and the use of sed to replace the entire line with the match. If there will be other content in the file, either on the same lines as the "asset" string or in other lines with no assets at all you can refine it like this:
#!/bin/sh
examplefile="
fooasset('1a/1b/1c.ext')
asset('2a/2b/2c.ext')bar
foobar
fooasset('3a/3b/3c.ext')bar
"
examplearr=(`echo "$examplefile" | grep asset | sed -e '/.*/s/^.*asset(\(.*\)).*$/\1/'`)
for el in ${examplearr[*]}; do
echo "$el"
done
and achieve the same result.
There are several ways to do this. I'd do with GNU grep with perl-compatible regex (ah, delightful line noise):
mapfile -t examplearr < <(grep -oP '(?<=[(]).*?(?=[)])' <<<"$examplefile")
for i in "${!examplearr[#]}"; do printf "%d\t%s\n" $i "${examplearr[i]}"; done
0 '1a/1b/1c.ext'
1 '2a/2b/2c.ext'
2 '3a/3b/3c.ext'
This uses the bash mapfile command to read lines from stdin and assign them to an array.
The bits you're missing from the sed command:
$examplefile is text, not a filename, so you have to send to to sed's stdin
sed's a funny little language with 1-character commands: you've given it the "a" command, which is inappropriate in this case.
you only want to output the captured parts of the matches, not every line, so you need the -n option, and you need to print somewhere: the p flag in s///p means "print the [line] if a substitution was made".
sed -n 's/asset\(([^)]*)\)/\1/p' <<<"$examplefile"
# or
echo "$examplefile" | sed -n 's/asset\(([^)]*)\)/\1/p'
Note that this returns values like ('1a/1b/1c.ext') -- with the parentheses. If you don't want them, add the -r or -E option to sed: among other things, that flips the meaning of ( and \(

Moving things in terminal based on their name

Edit: I think this has been answered successfully, but I can't check 'til later. I've reformatted it as suggested though.
The question: I have a series of files, each with a name of the form XXXXNAME, where XXXX is some number. I want to move them all to separate folders called XXXX and have them called NAME. I can do this manually, but I was hoping that by naming them XXXXNAME there'd be some way I could tell Terminal (I think that's the right name, but not really sure) to move them there. Something like
mv *NAME */NAME
but where it takes whatever * was in the first case and regurgitates it to the path.
This is on some form of Linux, with a bash shell.
In the real life case, the files are 0000GNUmakefile, with sequential numbering. I'm having to make lots of similar-but-slightly-altered versions of a program to compile and run on a cluster as part of my research. It would probably have been quicker to write a program to edit all the files and put in the right place in the first place, but I didn't.
This is probably extremely simple, and I should be able to find an answer myself, if I knew the right words. Thing is, I have no formal training in programming, so I don't know what to call things to search for them. So hopefully this will result in me getting an answer, and maybe knowing how to find out the answer for similar things myself next time. With the basic programming I've picked up, I'm sure I could write a program to do this for me, but I'm hoping there's a simple way to do it just using functionality already in Terminal. I probably shouldn't be allowed to play with these things.
Thanks for any help! I can actually program in C and Python a fair amount, but that's through trial and error largely, and I still don't know what I can do and can't do in Terminal.
SO many ways to achieve this.
I find that the old standbys sed and awk are often the most powerful.
ls | sed -rne 's:^([0-9]{4})(NAME)$:mv -iv & \1/\2:p'
If you're satisfied that the commands look right, pipe the command line through a shell:
ls | sed -rne 's:^([0-9]{4})(NAME)$:mv -iv & \1/\2:p' | sh
I put NAME in brackets and used \2 so that if it varies more than your example indicates, you can come up with a regular expression to handle your filenames better.
To do the same thing in gawk (GNU awk, the variant found in most GNU/Linux distros):
ls | gawk '/^[0-9]{4}NAME$/ {printf("mv -iv %s %s/%s\n", $1, substr($0,0,4), substr($0,5))}'
As with the first sample, this produces commands which, if they make sense to you, can be piped through a shell by appending | sh to the end of the line.
Note that with all these mv commands, I've added the -i and -v options. This is for your protection. Read the man page for mv (by typing man mv in your Linux terminal) to see if you should be comfortable leaving them out.
Also, I'm assuming with these lines that all your directories already exist. You didn't mention if they do. If they don't, here's a one-liner to create the directories.
ls | sed -rne 's:^([0-9]{4})(NAME)$:mkdir -p \1:p' | sort -u
As with the others, append | sh to run the commands.
I should mention that it is generally recommended to use constructs like for (in Tim's answer) or find instead of parsing the output of ls. That said, when your filename format is as simple as /[0-9]{4}word/, I find the quick sed one-liner to be the way to go.
Lastly, if by NAME you actually mean "any string of characters" rather than the literal string "NAME", then in all my examples above, replace NAME with .*.
The following script will do this for you. Copy the script into a file on the remote machine (we'll call it sortfiles.sh).
#!/bin/bash
# Get all files in current directory having names XXXXsomename, where X is an integer
files=$(find . -name '[0-9][0-9][0-9][0-9]*')
# Build a list of the XXXX patterns found in the list of files
dirs=
for name in ${files}; do
dirs="${dirs} $(echo ${name} | cut -c 3-6)"
done
# Remove redundant entries from the list of XXXX patterns
dirs=$(echo ${dirs} | uniq)
# Create any XXXX directories that are not already present
for name in ${dirs}; do
if [[ ! -d ${name} ]]; then
mkdir ${name}
fi
done
# Move each of the XXXXsomename files to the appropriate directory
for name in ${files}; do
mv ${name} $(echo ${name} | cut -c 3-6)
done
# Return from script with normal status
exit 0
From the command line, do chmod +x sortfiles.sh
Execute the script with ./sortfiles.sh
Just open the Terminal application, cd into the directory that contains the files you want moved/renamed, and copy and paste these commands into the command line.
for file in [0-9][0-9][0-9][0-9]*; do
dirName="${file%%*([^0-9])}"
mkdir -p "$dirName"
mv "$file" "$dirName/${file##*([0-9])}"
done
This assumes all the files that you want to rename and move are in the same directory. The file globbing also assumes that there are at least four digits at the start of the filename. If there are more than four numbers, it will still be caught, but not if there are less than four. If there are less than four, take off the appropriate number of [0-9]s from the first line.
It does not handle the case where "NAME" (i.e. the name of the new file you want) starts with a number.
See this site for more information about string manipulation in bash.

script for getting extensions of a file

I need to get all the file extension types in a folder. For instance, if the directory's ls gives the following:
a.t
b.t.pg
c.bin
d.bin
e.old
f.txt
g.txt
I should get this by running the script
.t
.t.pg
.bin
.old
.txt
I have a bash shell.
Thanks a lot!
See the BashFAQ entry on ParsingLS for a description of why many of these answers are evil.
The following approach avoids this pitfall (and, by the way, completely ignores files with no extension):
shopt -s nullglob
for f in *.*; do
printf '%s\n' ".${f#*.}"
done | sort -u
Among the advantages:
Correctness: ls behaves inconsistently and can result in inappropriate results. See the link at the top.
Efficiency: Minimizes the number of subprocess invoked (only one, sort -u, and that could be removed also if we wanted to use Bash 4's associative arrays to store results)
Things that still could be improved:
Correctness: this will correctly discard newlines in filenames before the first . (which some other answers won't) -- but filenames with newlines after the first . will be treated as separate entries by sort. This could be fixed by using nulls as the delimiter, or by the aforementioned bash 4 associative-array storage approach.
try this:
ls -1 | sed 's/^[^.]*\(\..*\)$/\1/' | sort -u
ls lists files in your folder, one file per line
sed magic extracts extensions
sort -u sorts extensions and removes duplicates
sed magic reads as:
s/ / /: substitutes whatever is between first and second / by whatever is between second and third /
^: match beginning of line
[^.]: match any character that is not a dot
*: match it as many times as possible
\( and \): remember whatever is matched between these two parentheses
\.: match a dot
.: match any character
*: match it as many times as possible
$: match end of line
\1: this is what has been matched between parentheses
People are really over-complicating this - particularly the regex:
ls | grep -o "\..*" | uniq
ls - get all the files
grep -o "\..*" - -o only show the match; "\..*" match at the first "." & everything after it
uniq - don't print duplicates but keep the same order
you can also sort if you like, but sorting doesn't match the example
This is what happens when you run it:
> ls -1
a.t
a.t.pg
c.bin
d.bin
e.old
f.txt
g.txt
> ls | grep -o "\..*" | uniq
.t
.t.pg
.bin
.old
.txt

Resources