Convert field names to lower case using miller - unix-text-processing

I would like to use miller (mlr) to convert column names to lower case. The closest I get is using the rename verb with a regular expression. \L should change the case, but instead the the column names are getting prefixed by "\L".
I'm using macOS Catalina and miller 5.10.0
echo -e 'A,B,C\n1,2,3' | mlr --csv --opprint rename -r '(.*),\L\1'
prints
\LA \LB \LC
1 2 3
But I would like it to print
a b c
1 2 3

Two examples ways:
echo -e 'A,B,C\n1,2,3' | mlr --csv put '
map inrec = $*;
$* = {};
for (oldkey, value in inrec) {
newkey = tolower(oldkey);
$[newkey] = value;
}
'
or
echo -e 'A,B,C\n1,2,3' | mlr --csv -N put -S 'if (NR == 1) {for (k in $*) {$[k] = tolower($[k])}}'

Sometimes, standard tools are easier to use:
echo -e 'A,B,C\n1,2,3' | awk 'NR == 1 {print tolower($0); next} 1'
UPDATE
with Miller:
echo -e 'A,B,C\n1,2,3' |
mlr --csv -N put 'NR == 1 {for (k,v in $*) {$[k] = tolower(v)}}'

Related

Shell script which reads each line of a csv file and counts the number of each column

I want a script that reads each row of a CSV file which is called sample.csv and it counts the number of fields of each row and if the number is more than a threshold (here is 14) it stores the whole of that line or just two fields of that line in another file (Hello.bsd) the script which I wrote is as below:
while read -r line
do
echo "$line" > tmp.kk
count= $(awk -F, '{ print NF; exit }' ~/tmp.kk)
if [ "$count" -gt 14 ]; then
field1=$(echo "$line" | awk -F',' '{printf "%s", $1}' | tr -d ',')
field2=$(echo "$line" | awk -F',' '{printf "%s", $2}' | tr -d ',')
echo "$field1 $field2" >> Hello.bsd
fi
done < ~/sample.csv
there is no output for the above code.
I would be so grateful if you could help me in this regard.
Best regards,
sina
FOR JUST FIRST 2 FIELDS
< sample.csv |
mawk 'NF=(_=(+__<NF))+_' FS=',' __="14" # enter constant or shell variable
SAMPLE OUTPUT
echo "${a}"
04z,Y7N,=TT,WLq,n54,cb8,qfy,LLG,ria,hIQ,Mmd,8N2,FK=,7a9,
us6,ck6,LvI,tnY,CQm,wBp,gPH,8ly,JAH,Phv,uwm,x1r,MF1,ide,
03I,GEs,Mok,BxK,z2D,IUH,VWn,Zb7,TkP,Ddt,RE9,mv2,XyD,tr5,
A2t,u0z,MLi,3RF,es1,goz,G0S,l=h,8Ka,coN,vHP,snk,tTV,xNF,
RiU,yBI,QrS,N6D,fWG,oOr,CwZ,9lb,f8h,g5I,c1u,D3X,kOo,lKG,
CSj,da4,Y54,S7R,AEj,Vqx,Fem,sqn,l4Z,YEA,OKe,6Bu,0xU,hGc,
1X8,jUD,XZM,pMc,Q6V,piz,6jp,SJp,E3W,zgJ,BuW,5wd,qVg,wBy,
TQC,O9k,RJ9,fie,2AV,XZ4,meR,tEC,U7v,JWH,LTs,ngF,3A3,ZPa,
ONJ,Phw,jrp,UvY,9Kb,qxf,57f,yHo,a0Q,2S=,=Ob,l1b,XjC
echo "${a}" | mawk 'NF=(_=(+__<NF))+_' FS=',' __="14"
04z Y7N
us6 ck6
03I GEs
A2t u0z
RiU yBI
CSj da4
1X8 jUD
TQC O9k
note that the last line didn't print because it didn't meet the NF threshold

diff two arrays each containing files paths into a third array (for removal)

In the function below you will see notes on several attempts to solve this problem; each attempt has a note indicating what went wrong. Between my attempts there is a line from another question here which purports to solve some element of the matter. Again, I've added a note indicating what that is supposed to solve. My brain is mush at this point. What is the stupid simple thing I've overlooking?
function func_removeDestinationOrphans() {
readarray -d '' A_Destination_orphans < <( find "${directory_PMPRoot_destination}" -type f -print0 )
for (( i = 0 ; i < ${#A_Destination_orphans[#]} ; i++ )) ; do
printf '%s\n' "→ ${A_Destination_orphans[${i}]}" # path to each track
done
printf '%b\n' ""
# https://stackoverflow.com/questions/2312762/compare-difference-of-two-arrays-in-bash
# echo ${Array1[#]} ${Array2[#]} | tr ' ' '\n' | sort | uniq -u ## original
# Array3=(`echo ${Array1[#]} ${Array2[#]} | tr ' ' '\n' | sort | uniq -u `) ## store in array
# A_Destination_orphans_diff=(`echo "${A_Destination_dubUnders[#]}" "${A_Destination_orphans[#]}" | tr ' ' '\n' | sort | uniq -u `) # drops file path after space
# printf "%s\0" "${Array1[#]}" "${Array2[#]}" | sort -z | uniq -zu ## newlines and white spaces
# A_Destination_orphans_diff=($( printf "%s\0" "${A_Destination_dubUnders[#]}" "${A_Destination_orphans[#]}" | sort -z | uniq -zu )) # throws warning and breaks at space but not newline
# printf '%s\n' "${Array1[#]}" "${Array2[#]}" | sort | uniq -u ## manage spaces
# A_Destination_orphans_diff=($( printf '%s\n' "${A_Destination_dubUnders[#]}" "${A_Destination_orphans[#]}" | sort | uniq -u )) # breaks at space and newline
# A_Destination_orphans_diff="($( printf '%s\n' "${A_Destination_dubUnders[#]}" "${A_Destination_orphans[#]}" | sort | uniq -u ))" # creates string surrounded by ()
# A_Destination_orphans_diff=("$( printf '%s\n' "${A_Destination_dubUnders[#]}" "${A_Destination_orphans[#]}" | sort | uniq -u )") # creates string
# A_Destination_orphans_diff=($( printf '%s\n' ${A_Destination_dubUnders[#]} ${A_Destination_orphans[#]} | sort | uniq -u )) # drops file path after space
for (( i = 0 ; i < ${#A_Destination_orphans_diff[#]} ; i++ )) ; do
printf '%s\n' "→ ${A_Destination_orphans_diff[${i}]}" # path to each track
done
printf '%b\n' ""
for (( i = 0 ; i < ${#A_Destination_orphans_diff[#]} ; i++ )) ; do
echo # rm "${A_Destination_orphans_diff[i]}"
done
func_EnterToContinue
}
This throws warning and breaks at space but not newline because you build the array with direct assignment of syntax construct. When an entry contains spaces, it also splits break to a new entry.
A_Destination_orphans_diff=($( printf "%s\0" "${A_Destination_dubUnders[#]}" "${A_Destination_orphans[#]}" | sort -z | uniq -zu ))
To avoid the issue of the method above, you can mapfile/readarray a null delimited entries stream.
mapfile -t -d '' A_Destination_orphans_diff < <(
printf "%s\0" "${A_Destination_dubUnders[#]}" "${A_Destination_orphans[#]}" |
sort -z |
uniq -zu
)
In case your shell version is too old to support mapfile you can perform the same task with IFS=$'\37' read -r -d '' -a array.
$'\37' is shell's C-Style string syntax with octal code 37, which is ASCII 31 US for Unit Separator:
IFS=$'\37' read -r -d '' -a A_Destination_orphans_diff < <(
printf "%s\0" "${A_Destination_dubUnders[#]}" "${A_Destination_orphans[#]}" |
sort -z |
uniq -zu |
xargs -0 printf '%s\37'
)
To remove all files not present in A_Destination_dubUnders array you could:
func_removeDestinationOrphans() {
find "${directory_PMPRoot_destination}" -type f -print0 |
sort -z |
join -z -v1 -t '' - <(printf "%s\0" "${A_Destination_dubUnders[#]}" | sort -z) |
xargs -0 echo rm
}
Use join or comm to find elements not present in one list and present in another list. I am usually wrong about -v1, so try with -v2 if it echoes the elements from wrong list (I do not understand if you want to remove files present in A_Destination_dubUnders list or not present, you did not specify that).
Note that function name() is a mix of ksh and posix function definition. Just name() {. See bash hackers wiki obsolete
Here is the working version with modifications thanks to suggested input from the first two respondents (thanks!).
function func_removeDestinationOrphans() {
printf '%s\n' " → Purge playlist orphans: " ""
printf '%b\n' "First we will remove any files not present in your proposed playlist. "
func_EnterToContinue
bash_version="$( bash --version | head -n1 | cut -d " " -f4 | cut -d "(" -f1 )"
if printf '%s\n' "4.4.0" "${bash_version}" | sort -V -C ; then
readarray -d '' A_Destination_orphans < <( find "${directory_PMPRoot_destination}" -type f -print0 ) # readarray or mapfile -d fails before bash 4.4.0
readarray -t -d '' A_Destination_orphans_diff < <(
printf "%s\0" "${A_Destination_dubUnders[#]}" "${A_Destination_orphans[#]}" |
sort -z |
uniq -zu
)
else
while IFS= read -r -d $'\0'; do
A_Destination_orphans+=( "$REPLY" )
done < <( find "${directory_PMPRoot_destination}" -type f -print0 )
IFS=$'\37' read -r -d '' -a A_Destination_orphans_diff < <(
printf "%s\0" "${A_Destination_dubUnders[#]}" "${A_Destination_dubUnders[#]}" "${A_Destination_orphans[#]}" |
sort -z |
uniq -zu |
xargs -0 printf '%s\37'
)
fi
if [[ ! "${A_Destination_orphans_diff[*]}" = '' ]] ; then
for (( i = 0 ; i < ${#A_Destination_orphans_diff[#]} ; i++ )) ; do
rm "${A_Destination_orphans_diff[i]}"
done
fi
}
If you would like to see the entire Personal Music Player sync script, you can find that via my GitHub.

Replacing column 2 in original with column 2 in new

I have a file containing thousands of original results and a file containing hundreds of new results. Only column 2 of new is different from the original. I also need to keep original results that haven't been changed. How should I go about doing this? Is it possible to create a file3 containing the original results which did not change and the new results? see below for an example.
Original New file3
1:1:1 2:5:2 1:1:1
2:2:2 3:4:3 2:5:2
3:3:3 5:9:5 3:4:3
4:4:4 6:8:6 4:4:4
5:5:5 5:9:5
6:6:6 6:8:6
7:7:7 7:7:7
awk
awk -F':' '{a[$1]=$0}END{for(i in a) print a[i]}' Original_file new_file | sort
Original_file new_file - read both files
for each one of the files read line and:
1) -F':' - use : as separator
2) a[$1]=$0 - create a Hash that it's key is the first column and the value is the all line. if key exists run it over with the new value.
3) for(i in a) print a[i] - print the hash values
4) sort - sort results by order
You can use the diff command between the old file and the new file.
diff -y Original.txt New.txt
Original New
1:1:1 1:1:1
2:2:2 | 2:5:2
3:3:3 | 3:4:3
4:4:4 4:4:4
5:5:5 | 5:9:5
6:6:6 | 6:8:6
7:7:7 7:7:7
For each line, if it contain this character "|" use the command awk to catch the value of new file. Otherwise catch the value of one of both sides, after all both are equals.
Try something how this:
number_of_lines_pipe=$(diff -y Orginal.txt New.txt | grep -e "|" | wc - l)
number_of_lines_without_pipe=$(diff -y Orginal.txt New.txt | grep -v "|" | wc - l)
for ((i = 1; i <= $number_of_lines_pipe; i++))
do
line=$(diff -y Orginal.txt New.txt | grep -e "|" | sed -n $i'p')
echo "$line" | awk -F"|" '{ print $2 }' | sed 's/\t *//' >> File3.log
done
for ((i = 1; i <= $number_of_lines_without_pipe; i++))
do
line=$(diff -y Orginal.txt New.txt | grep -v "|" | sed -n $i'p')
echo "$line" | awk -F" " '{ print $1 }' >> File3.log
done

Array job gives error in bash

As I want to run several simulations with different values in R, I have been recommended to use a job array in bash.
1) I generated the combination of parameters and saved it in a txt file, called parameters.txt.
2) I want now to use each combination of parameters into R. Each combination is represented by a line of 3 numbers (the 3 parameters) in parameters.txt.
When I run my script, an error message appears :
head: parameters.txt: invalid number of lines
head: parameters.txt: invalid number of lines
head: parameters.txt: invalid number of lines
Job array item : rx=, ry=, rz=
Here is my script:
# Sweeping parameters.txt
N=${SLURM_ARRAY_TASK_ID}
rx=`head -n ${N} parameters.txt | tail -n 1 | cut -d' ' -f1`
ry=`head -n ${N} parameters.txt | tail -n 1 | cut -d' ' -f2`
rz=`head -n ${N} parameters.txt | tail -n 1 | cut -d' ' -f3`
# Display
echo "Job array item $N: rx=$rx, ry=$ry, rz=$rz"
echo "---------------------------------"
# Run
R CMD BATCH ex.R $rx $ry $rz
Seems SLURM_ARRAY_TASK_ID is None (not set) and as a result N is None here:
N=${SLURM_ARRAY_TASK_ID}
Then bash translates it as
rx=`head -n parameters.txt ...
You can wrap with if statement as follows:
N=${SLURM_ARRAY_TASK_ID}
if [ -n "${N}" ]; then
rx=`head -n ${N} parameters.txt | tail -n 1 | cut -d' ' -f1`
ry=`head -n ${N} parameters.txt | tail -n 1 | cut -d' ' -f2`
rz=`head -n ${N} parameters.txt | tail -n 1 | cut -d' ' -f3`
# Display
echo "Job array item $N: rx=$rx, ry=$ry, rz=$rz"
echo "---------------------------------"
# Run
R CMD BATCH ex.R $rx $ry $rz
else
echo "SLURM_ARRAY_TASK_ID / N is None"
fi

Splitting files into multiple files based on some pattern and take some information

I'm working with a lot of files with this structure:
BEGIN
TITLE=id=PRD000012;PRIDE_Exp_Complete_Ac_1645.xml;spectrum=1393
PEPMASS=946.3980102539062
CHARGE=3.0+
USER03=
SEQ=DDDIAAL
TAXONOMY=9606
272.228 126847.000
273.252 33795.000
END
BEGIN IONS
TITLE=id=PRD000012;PRIDE_Exp_Complete_Ac_1645.xml;spectrum=1383
PEPMASS=911.3920288085938
CHARGE=2.0+
USER03=
SEQ=QGKFEAAETLEEAAMR
TAXONOMY=9606
1394.637 71404.000
1411.668 122728.000
END
BEGIN IONS
TITLE=id=PRD000012;PRIDE_Exp_Complete_Ac_1645.xml;spectrum=2965
PEPMASS=946.3900146484375
CHARGE=3.0+
TAXONOMY=9606
1564.717 92354.000
1677.738 33865.000
END
This structure is repeated thousands of times but with different data inside. As you can see, between some begin-end, sometimes SEQ and USER03 are not there. This is because the protein is not identified ... And here comes my problem.
I would like to know how many proteins are identified and how many are unidentified. To do this I was trying this:
for i in $(ls *.txt ); do
echo $i
awk '/^BEGIN/{n++;w=1} n&&w{print > "./cache/out" n ".txt"} /^END/{w=0}' $i
done
I found this here (Split a file into multiple files based on a pattern and name the new files by the search pattern in Unix?)
And then use the outputs and classify them:
for i in $(ls cache/*.txt ); do
echo $i
if grep -q 'SEQ' $i; then
mv $i ./archive_identified
else
mv $i ./archive_unidentified
fi
done
After this, I'd like to take some data (Example: spectrum, USER03, SEQ, TAXONOMY) from classified files.
for I in $( ls archive_identified/*.txt ); do
echo $i
grep 'SEQ' $i | cut -d "=" -f2- | tr ',' '\n' >> ./sequences_ide.txt
grep 'TAXONOMY' $i | cut -d "=" -f2- | tr ',' '\n' >> ./taxonomy_ide.txt
grep 'USER' $i | cut -d "=" -f2- >> ./modifications_ide.txt
grep 'TITLE' $i | sed 's/^.*\(spectrum.*\)/\1/g' | cut -d "=" -f2- >> ./spectrum.txt
done
for i in $( ls archive_unidentified/*.txt ); do
echo $i
grep 'SEQ' $i | cut -d "=" -f2- | tr ',' '\n' >> ./sequences_unide.txt
grep 'TAXONOMY' $i | cut -d "=" -f2- | tr ',' '\n' >> ./taxonomy_unide.txt
grep 'USER' $i | cut -d "=" -f2- >> ./modifications_unide.txt
grep 'TITLE' $i | sed 's/^.*\(spectrum.*\)/\1/g' | cut -d "=" -f2- >> ./spectrum_unide.txt
done
The problem is that the first part of the script takes too much time due to the large size of the data (12-15gb.). Is there any way to do this easier?
Thank you in advance.
You can do all in one awk script. awk can iterate through all rows (records) so you don't need an external loop. For example, for the data file you provided
$ awk -v RS= '/\nSEQ/ {seq++; print > "file_path_with_seq" NR ".txt"; next}
{noseq++; print > "file_path_without_seq" NR ".txt"}
END { print "with seq:", seq;
print "without seq:", noseq}' file
will print
with seq: 2
without seq: 1
and produces the files
$ head file_path_with*
==> file_path_with_seq1.txt <==
BEGIN
TITLE=id=PRD000012;PRIDE_Exp_Complete_Ac_1645.xml;spectrum=1393
PEPMASS=946.3980102539062
CHARGE=3.0+
USER03=
SEQ=DDDIAAL
TAXONOMY=9606
272.228 126847.000
273.252 33795.000
END
==> file_path_with_seq2.txt <==
BEGIN IONS
TITLE=id=PRD000012;PRIDE_Exp_Complete_Ac_1645.xml;spectrum=1383
PEPMASS=911.3920288085938
CHARGE=2.0+
USER03=
SEQ=QGKFEAAETLEEAAMR
TAXONOMY=9606
1394.637 71404.000
1411.668 122728.000
END
==> file_path_without_seq3.txt <==
BEGIN IONS
TITLE=id=PRD000012;PRIDE_Exp_Complete_Ac_1645.xml;spectrum=2965
PEPMASS=946.3900146484375
CHARGE=3.0+
TAXONOMY=9606
1564.717 92354.000
1677.738 33865.000
END

Resources