Truncate NUL bytes off a file - file

I have about 500 files with trailing NUL bytes, maybe produced with
truncate -s 8M <file>
How can I cut off the zeroes?

This perl script should do it:
for f in *; do
perl -e '$/=undef;$_=<>;s|\0+$||;print;' < $f > $f_fixed
done
This will keep all NULs within the file, remove any at the end, and save the result into <original filename>_fixed.
Script explanation: $/=undef tells perl to operate on the whole file rather than splitting it into lines; $_=<> loads the file; s|\0+|| removes any string of NULs at the end of the loaded file 'string'; and print outputs the result. The rest is standard Bash file redirection.

If the file is a "text" file and not a "binary" file, you can simply do
strings a.txt > b.txt
ref

Use tr:
cat $input_file | tr -d '\0' > $output_file
Note that $input_file and $output_file must be different

Following the suggestion of #Eevee, you can actually avoid truncating those files below 8M. Using the following condition in your loop and the fact that truncate will assume bytes as default if you don't append any suffix to the size parameter, this won't pad the files below 8M:
for file in $(ls -c1 directory); do
# ...
SIZE=$(stat -c%s $file)
LIMIT=$((8 * 1024 * 1024))
if [ "$SIZE" -lt "$LIMIT" ]; then
truncate -s $SIZE $file
else
truncate -s 8M $file
fi
# ...
done

Not really any Unix tool for this particular case. Here's a Python (3) script:
import sys
for fn in sys.argv[1:]:
with open(fn, 'rb') as f:
contents = f.read()
with open(fn, 'wb') as f:
f.write(contents.rstrip(b'\0'))
Run as:
python retruncate.py file1 file2 files* etc...

Related

Read txt file into an array in sh script

My configs.txt file is in the format
example1
example2
example3
I would like to read a text file configs.txt into an array and then loop through the array to perform a tar command to create a separate tarfile for each entry like example1.gz etc.
Not sure why you need an array
while read p; do
echo "$p";
# tar here
done < configs.txt
for p in $(cat configs.txt); do
echo "$p";
# tar here
done

How can I use sed to make thousands of substitutions in a file using a reference file?

I have a big file with two columns like this:
tiago#tiago:~/$ head Ids.txt
TRINITY_DN126999_c0_g1_i1 ENSMUST00000040656.6
TRINITY_DN126999_c0_g1_i1 ENSMUST00000040656.6
TRINITY_DN126906_c0_g1_i1 ENSMUST00000126770.1
TRINITY_DN126907_c0_g1_i1 ENSMUST00000192613.1
TRINITY_DN126988_c0_g1_i1 ENSMUST00000032372.6
.....
and I have another file with data, like this:
"baseMean" "log2FoldChange" "lfcSE" "stat" "pvalue" "padj" "super" "sub" "threshold"
"TRINITY_DN41319_c0_g1" 178.721774751278 2.1974294626636 0.342621318593487 6.41358066008381 1.4214085388179e-10 5.54686423073089e-08 TRUE FALSE "TRUE"
"TRINITY_DN87368_c0_g1" 4172.76139849472 2.45766387851112 0.404014016558211 6.08311538160958 1.17869459181235e-09 4.02673069375893e-07 TRUE FALSE "TRUE"
"TRINITY_DN34622_c0_g1" 39.1949851245197 3.28758092748061 0.54255370348027 6.05945716781964 1.3658169042862e-09 4.62597265729593e-07 TRUE FALSE "TRUE"
.....
I was thinking of using sed to perform a translation of the values in the first column of the data file, using the first file as a dictionary.
That is, considering each line of the data file in turn, if the value in the first column matches a value in the first column of the dictionary file, then a substitution would be be made; otherwise, the line would simply be printed.
Any suggestions would be appreciated.
You can turn your first file Ids.txt into a sed script:
$ sed -r 's| *(\S+) (\S+)|s/^"\1/"\2/|' Ids.txt > repl.sed
$ cat repl.sed
s/^"TRINITY_DN126999_c0_g1_i1/"ENSMUST00000040656.6/
s/^"TRINITY_DN126999_c0_g1_i1/"ENSMUST00000040656.6/
s/^"TRINITY_DN126906_c0_g1_i1/"ENSMUST00000126770.1/
s/^"TRINITY_DN126907_c0_g1_i1/"ENSMUST00000192613.1/
s/^"TRINITY_DN126988_c0_g1_i1/"ENSMUST00000032372.6/
This removes leading spaces and makes each line into a substitution command.
Then you can use this script to do the replacements in your data file:
sed -f repl.sed datafile
... with redirection to another file, or in-place with sed -i.
If you don't have GNU sed, you can use this POSIX conformant version of the first command:
sed 's| *\([^ ]*\) \([^ ]*\)|s/^"\1/"\2/|' Ids.txt
This uses basic instead of extended regular expressions and uses [^ ] for "not space" instead of \S.
Since the first file (the dictionary file) is large, using sed may be very slow; a much faster and not much more complex approach would be to use awk as follows:
awk -v col=1 -v dict=Ids.txt '
BEGIN {while(getline<dict){a["\""$1"\""]="\""$2"\""} }
$col in a {$col=a[$col]}; {print}'
(Here, "Ids.txt" is the dictionary file, and "col" is the column number of the field of interest in the data file.)
This approach also has the advantage of not requiring any modification to the dictionary file.
#!/bin/bash
# Declare hash table
declare -A Ids
# Go though first input file and add key-value pairs to hash table
while read Id; do
key=$(echo $Id | cut -d " " -f1)
value=$(echo $Id | cut -d " " -f2)
Ids+=([$key]=$value)
done < $1
# Go through second input file and replace every first column with
# the corresponding value in the hash table if it exists
while read line; do
first_col=$(echo $line | cut -d '"' -f2)
new_id=${Ids[$first_col]}
if [ -n "$new_id" ]; then
sed -i s/$first_col/$new_id/g $2
fi
done < $2
I would call the script as
./script.sh Ids.txt data.txt

First line of every file in a new file

How can I get the first line of EVERY file in a directory and save them all in a new file?
#!/bin/bash
rm FIRSTLINE
for file in "$(find $1 -type f)";
do
head -1 $file >> FIRSTLINE
done
cat FIRSTLINE
This is my bash script, but when I do this and I open the file FIRSTLINE,
then I see this:
==> 'path of the file' <==
'first line' of the file
and this for all the files in my argument.
Does anybody has some solution?
find . -type f -exec head -1 \{\} \; > YOURFILE
might work for you.
The problem is that you've quoted the output of find so it gets treated as a single string, so the for loop only runs once, with a single argument containing all the files. That means you run head -1 file1 file2 file3 file4 ... etc. and when given multiple files head prints the ==> file1 <== headers.
So to fix it, remove the double quotes around the find shell-out, which ensures you run the for loop once for each file, as intended. Also, the semi-colon after the shell-out is unnecessary.
#!/bin/bash
rm FIRSTLINE
for file in $(find $1 -type f)
do
head -1 $file >> FIRSTLINE
done
cat FIRSTLINE
This has some style issues though, do you really need to write to a file then cat the file to stdout? You could just print the output to stdout:
#!/bin/bash
for file in $(find $1 -type f)
do
head -1 $file
done
Personally I'd write it like this:
find $1 -type f | xargs -L1 head -1
or if you need the output in the file and printed to stdout:
find $1 -type f | xargs -L1 head -1 | tee FIRSTLINE
$ for file in $(find $1 -type f); do echo '';
echo $file;
head -n 4 $file;
done
for gzip files fo instances:
for file in `ls *.gz`; do gzcat $file | head -n 1; done > toto.txt

Store the output of find command in an array [duplicate]

This question already has answers here:
How can I store the "find" command results as an array in Bash
(8 answers)
Closed 4 years ago.
How do I put the result of find $1 into an array?
In for loop:
for /f "delims=/" %%G in ('find $1') do %%G | cut -d\/ -f6-
I want to cry.
In bash:
file_list=()
while IFS= read -d $'\0' -r file ; do
file_list=("${file_list[#]}" "$file")
done < <(find "$1" -print0)
echo "${file_list[#]}"
file_list is now an array containing the results of find "$1
What's special about "field 6"? It's not clear what you were attempting to do with your cut command.
Do you want to cut each file after the 6th directory?
for file in "${file_list[#]}" ; do
echo "$file" | cut -d/ -f6-
done
But why "field 6"? Can I presume that you actually want to return just the last element of the path?
for file in "${file_list[#]}" ; do
echo "${file##*/}"
done
Or even
echo "${file_list[#]##*/}"
Which will give you the last path element for each path in the array. You could even do something with the result
for file in "${file_list[#]##*/}" ; do
echo "$file"
done
Explanation of the bash program elements:
(One should probably use the builtin readarray instead)
find "$1" -print0
Find stuff and 'print the full file name on the standard output, followed by a null character'. This is important as we will split that output by the null character later.
<(find "$1" -print0)
"Process Substitution" : The output of the find subprocess is read in via a FIFO (i.e. the output of the find subprocess behaves like a file here)
while ...
done < <(find "$1" -print0)
The output of the find subprocess is read by the while command via <
IFS= read -d $'\0' -r file
This is the while condition:
read
Read one line of input (from the find command). Returnvalue of read is 0 unless EOF is encountered, at which point while exits.
-d $'\0'
...taking as delimiter the null character (see QUOTING in bash manpage). Which is done because we used the null character using -print0 earlier.
-r
backslash is not considered an escape character as it may be part of the filename
file
Result (first word actually, which is unique here) is put into variable file
IFS=
The command is run with IFS, the special variable which contains the characters on which read splits input into words unset. Because we don't want to split.
And inside the loop:
file_list=("${file_list[#]}" "$file")
Inside the loop, the file_list array is just grown by $file, suitably quoted.
arrayname=( $(find $1) )
I don't understand your loop question? If you look how to work with that array then in bash you can loop through all array elements like this:
for element in $(seq 0 $((${#arrayname[#]} - 1)))
do
echo "${arrayname[$element]}"
done
This is probably not 100% foolproof, but it will probably work 99% of the time (I used the GNU utilities; the BSD utilities won't work without modifications; also, this was done using an ext4 filesystem):
declare -a BASH_ARRAY_VARIABLE=$(find <path> <other options> -print0 | sed -e 's/\x0$//' | awk -F'\0' 'BEGIN { printf "("; } { for (i = 1; i <= NF; i++) { printf "%c"gensub(/"/, "\\\\\"", "g", $i)"%c ", 34, 34; } } END { printf ")"; }')
Then you would iterate over it like so:
for FIND_PATH in "${BASH_ARRAY_VARIABLE[#]}"; do echo "$FIND_PATH"; done
Make sure to enclose $FIND_PATH inside double-quotes when working with the path.
Here's a simpler pipeless version, based on the version of user2618594
declare -a names=$(echo "("; find <path> <other options> -printf '"%p" '; echo ")")
for nm in "${names[#]}"
do
echo "$nm"
done
To loop through a find, you can simply use find:
for file in "`find "$1"`"; do
echo "$file" | cut -d/ -f6-
done
It was what I got from your question.

how do I output the contents of a while read line loop to multiple arrays in bash?

I read the files of a directory and put each file name into an array (SEARCH)
Then I use a loop to go through each file name in the array (SEARCH) and open them up with a while read line loop and read each line into another array (filecount). My problem is its one huge array with 39 lines (each file has 13 lines) and I need it to be 3 seperate arrays, where
filecount1[line1] is the first line from the 1st file and so on. here is my code so far...
typeset -A files
for file in ${SEARCH[#]}; do
while read line; do
files["$file"]+="$line"
done < "$file"
done
So, Thanks Ivan for this example! However I'm not sure I follow how this puts it into a seperate array because with this example wouldnt all the arrays still be named "files"?
If you're just trying to store the file contents into an array:
declare -A contents
for file in "${!SEARCH[#]}"; do
contents["$file"]=$(< $file)
done
If you want to store the individual lines in a array, you can create a pseudo-multi-dimensional array:
declare -A contents
for file in "${!SEARCH[#]}"; do
NR=1
while read -r line; do
contents["$file,$NR"]=$line
(( NR++ ))
done < "$file"
done
for key in "${!contents[#]}"; do
printf "%s\t%s\n" "$key" "${contents["$key"]}"
done
line 6 is
$filecount[$linenum]}="$line"
Seems it is missing a {, right after the $.
Should be:
${filecount[$linenum]}="$line"
If the above is true, then it is trying to run the output as a command.
Line 6 is (after "fixing" it above):
${filecount[$linenum]}="$line"
However ${filecount[$linenum]} is a value and you can't have an assignment on a value.
Should be:
filecount[$linenum]="$line"
Now I'm confused, as in whether the { is actually missing, or } is the actual typo :S :P
btw, bash supports this syntax too
filecount=$((filecount++)) # no need for $ inside ((..)) and use of increment operator ++
This should work:
typeset -A files
for file in ${SEARCH[#]}; do # foreach file
while read line; do # read each line
files["$file"]+="$line" # and place it in a new array
done < "$file" # reading each line from the current file
done
a small test shows it works
# set up
mkdir -p /tmp/test && cd $_
echo "abc" > a
echo "foo" > b
echo "bar" > c
# read files into arrays
typeset -A files
for file in *; do
while read line; do
files["$file"]+="$line"
done < "$file"
done
# print arrays
for file in *; do
echo ${files["$file"]}
done
# same as:
echo ${files[a]} # prints: abc
echo ${files[b]} # prints: foo
echo ${files[c]} # prints: bar

Resources