bash array not working - arrays

I am rather new to bioinformatic but trying my best to learn. I am running into an issue and I was hoping someone would know what to do, and explain me how bash tool for multiple file is actually working.
I have a folder with 160 RNAseq libraries unzip just look like name.fastq.
I want to run cutadapt (a software which will remove all the adapters sequence from my libraries) on all of them at the same time; so, for one library, the command looks just like this:
python2.6 /imports/home/w/workshop/oibc2013/oibc1/Apps/cutadapt-1.2.1/bin/cutadapt -a name_adapter input_file.fastq > out
So I tried to make a bash array loop to be able to do it on all 160 files I have, but it still does not work.
!/bin/bash
. $HOME/.bashrc
my_array=(*.fastq)
echo ${myarray["SGE_TASK_ID"-1]}
python2.6 \
/imports/home/w/workshop/oibc2013/oibc1/Apps/cutadapt-1.2.1/bin/cutadapt \
-a CTGTCTCTTATACACATCT \
-b AATTGCAGTGGTATCAACGCAGAGCGGCCGC \
-b GCGGCCGCTCTGCGTTGATACCACTGCAATT \
-b AAGCAGTGGTATCAACGCAGAGTACATGGG \
-b CCCATGTACTCTGCGTTGATACCACTGCTT \
inputs.$SGE_TASK_ID \
results.$SGE_TASK_ID]}

Rather than an array, you just want a loop. In this case, since you're matching a glob pattern (*.fastq), a for ... in loop would make sense.
The general syntax is for variable_name in list_of_words; do something_with $variable_name; done;. In your case:
#!/bin/bash
. $HOME/.bashrc
path=/imports/home/w/workshop/oibc2013/oibc1/Apps/cutadapt-1.2.1/bin
for file in *.fastq
do
python2.6 "$path"/cutadapt -a name_adapter "$file" > "$file.out"
done

Related

Organizing ".bash_history" into ".bash_history.cmd" category files

My script loads a "categories" array with chosen terminal commands.
This array is then used to yank matching ".bash_history" records into
separate category files. My function: "extract_records()" extracts each
category using an ERE grep:
BASH_HISTORY="$HOME"/.bash_history
...
# (where "$1" here is a category)
grep -E "^($1 )" "$BASH_HISTORY" >> ".bash_history.$1"
Once the chosen records are grepped from "$BASH_HISTORY" into individual
category files, they are then removed from "$BASH_HISTORY". This is done
using "grep -v -e" patterns where the category list is re-specified.
My script works but a potential problem exists: the list of history
command keywords is defined twice, once in the array and then in a grep
pattern list. Excerpted from the script:
#!/usr/bin/bash--------------------------------------------------
# original array definition.
categories=(apt cat dpkg echo file find git grep less locate)
...
for i in "${categories[#]}"; do
extract_records "$i" # which does the grep -E shown above.
done
...
# now remove what has been categorized to separate files.
grep -E -v \
-e "^(apt )" \
-e "^(cat )" \
-e "^(dpkg )" \
... \
"$BASH_HISTORY" >> "$BASH_HISTORY".$$
# finally the temporary "$$" file is optionally sorted and moved
# back as the main "$BASH_HISTORY".
The first part calls extract_records() each time to grep and create
each category file. The second part uses a single grep to remove
records using a pattern list, re-specified based on the array.
PROBLEM: Potentially, the two independent lists can be mismatched.
Optimally, the array: "${categories[#]}" should be used for each part:
extracting chosen records, and then rebuilding "$BASH_HISTORY" without
the separated records. This would replace my now using the "grep -E -v"
pattern list. Something of the sort:
grep -E -v "^(${categories[#]})" "$BASH_HISTORY"
It's nice and compact, but this does not work.
The goal is to divide out oft used terminal commands into separate files
so as to keep "$BASH_HISTORY" reasonably small. The separately saved
records can then be recalled using another script that functions like
the Bash's internal history facility. In this way, no history is lost
and everything is grouped and better managed.

assign two bash arrays with a single command

youtube-dl can take some time parsing remote sites when called multiple times.
EDIT0 : I want to fetch multiple properties (here fileNames and remoteFileSizes) output by youtube-dl without having to run it multiple times.
I use those 2 properties to compare the local file size and ${remoteFileSizes[$i]} to tell if the file is finished downloading.
$ youtube-dl --restrict-filenames -o "%(title)s__%(format_id)s__%(id)s.%(ext)s" -f m4a,18,webm,251 -s -j https://www.youtube.com/watch?v=UnZbjvyzteo 2>errors_youtube-dl.log | jq -r ._filename,.filesize | paste - - > input_data.txt
$ cat input_data.txt
Alan_Jackson_-_I_Want_To_Stroll_Over_Heaven_With_You_Live__18__UnZbjvyzteo__youtube_com.mp4 8419513
Alan_Jackson_-_I_Want_To_Stroll_Over_Heaven_With_You_Live__250__UnZbjvyzteo__youtube_com.webm 1528955
Alan_Jackson_-_I_Want_To_Stroll_Over_Heaven_With_You_Live__140__UnZbjvyzteo__youtube_com.m4a 2797366
Alan_Jackson_-_I_Want_To_Stroll_Over_Heaven_With_You_Live__244__UnZbjvyzteo__youtube_com.webm 8171725
I want the first column in the fileNames array and the second column in the remoteFileSizes.
For the time being, I use a while read loop, but when this loop is finished my two arrays are lost :
$ fileNames=()
$ remoteFileSizes=()
$ cat input_data.txt | while read fileName remoteFileSize; do \
fileNames+=($fileName); \
remoteFileSizes+=($remoteFileSize); \
done
$ for fileNames in "${fileNames[#]}"; do \
echo PROCESSING....; \
done
$ echo "=> fileNames[0] = ${fileNames[0]}"
=> fileNames[0] =
$ echo "=> remoteFileSizes[0] = ${remoteFileSizes[0]}"
=> remoteFileSizes[0] =
$
Is it possible to assign two bash arrays with a single command ?
You assign variables in a subshell, so they are not visible in the parent shell. Read https://mywiki.wooledge.org/BashFAQ/024 . Remove the cat and do a redirection to solve your problem.
while IFS=$'\t' read -r fileName remoteFileSize; do
fileNames+=("$fileName")
remoteFileSizes+=("$remoteFileSize")
done < input_data.txt
You might also interest yourself in https://mywiki.wooledge.org/BashFAQ/001.
For what it's worth, if you're looking for specific/bespoke functionality from youtube-dl, I recommend creating your own python scripts using the 'embedded' approach: https://github.com/ytdl-org/youtube-dl/blob/master/README.md#embedding-youtube-dl
You can set your own signal for when a download is finished (text/chime/mail/whatever) and track downloads without having to compare file sizes.

Bash array as argument inside of screen

The below code is is not working as I expect it to. I might be because I am doing this all wrong but I think it may be a quoting issue.
#!/bin/bash
IFS=$'\n'
fortune_lines=($(fortune | fold -w 30))
Screen_Session=$(mainscreen)
Screen_OneLiner=$(screen -p 0 -S ${Screen_Session} -X stuff "`printf "say ${fortune_lines[#]}\r"`")
for var in "${Screen_OneLiner[#]}"
do
echo "${var}"
done
I think I am not quoting something correctly because when I attempt to execute this. I get..
line 5: mainscreen: command not found
[screen is terminating
Essentially I am attempting to add this function (that works)
IFS=$'\n'
fortune_lines=($(fortune | fold -w 30))
To this screen one liner
screen -p 0 -S ${Screen_Session} -X stuff "`printf "say ${fortune_lines[#]}\r"`"
Then have it loop the array
for var in "${ArrayName[#]}"
do
echo "${var}"
done
So I am not sure how far I am away (in code) to what I am trying to do. Any help would be great.
Since feature requests to mark a comment as an answer remain declined, I copy the above solution here.
I managed to get this to work... gist.github.com/4006586 – user1787331
This line is problematic
Screen_Session=$(mainscreen)
You are using command substitution here, so if mainscreen is not a valid command, you'll get command not found error.
Maybe you mean to use braces instead of parentheses?

makefile loop var pipe

I'm trying to do a loop in a make to exec a remote ssh command to optain the pid of a process to kill it.
Like this:
target:
for node in 23 ; do \
echo $$node ; \
ssh user#pc$$node "~/jdk1.6.0_31/bin/jps | grep CassandraDaemon | awk '{print \$$1}'" > $(PID); \
ssh user#pc$$node "kill -9 $(PID); \
done
But I get:
/bin/sh: 3: Syntax error: ";" unexpected
The issue I think is to store the pid that the remote ssh command returns (it woks well without the > $(PID) )
> redirects into files, not into variables. $() captures in a way you can assign to variables... but is also make syntax, so you need to escape it. You also need to escape it when you use it so that you don't get the make variable instead (no, you can't store it in a make variable).
for node in 23 ; do \
echo $$node ; \
PID=$$(ssh user#pc$$node "~/jdk1.6.0_31/bin/jps | grep CassandraDaemon | awk '{print \$$1}'"); \
ssh user#pc$$node "kill -9 $$PID; \
done
(assuming one of your many, many edits hasn't changed things too much from when I copied and pasted that to fix it...)

Moving things in terminal based on their name

Edit: I think this has been answered successfully, but I can't check 'til later. I've reformatted it as suggested though.
The question: I have a series of files, each with a name of the form XXXXNAME, where XXXX is some number. I want to move them all to separate folders called XXXX and have them called NAME. I can do this manually, but I was hoping that by naming them XXXXNAME there'd be some way I could tell Terminal (I think that's the right name, but not really sure) to move them there. Something like
mv *NAME */NAME
but where it takes whatever * was in the first case and regurgitates it to the path.
This is on some form of Linux, with a bash shell.
In the real life case, the files are 0000GNUmakefile, with sequential numbering. I'm having to make lots of similar-but-slightly-altered versions of a program to compile and run on a cluster as part of my research. It would probably have been quicker to write a program to edit all the files and put in the right place in the first place, but I didn't.
This is probably extremely simple, and I should be able to find an answer myself, if I knew the right words. Thing is, I have no formal training in programming, so I don't know what to call things to search for them. So hopefully this will result in me getting an answer, and maybe knowing how to find out the answer for similar things myself next time. With the basic programming I've picked up, I'm sure I could write a program to do this for me, but I'm hoping there's a simple way to do it just using functionality already in Terminal. I probably shouldn't be allowed to play with these things.
Thanks for any help! I can actually program in C and Python a fair amount, but that's through trial and error largely, and I still don't know what I can do and can't do in Terminal.
SO many ways to achieve this.
I find that the old standbys sed and awk are often the most powerful.
ls | sed -rne 's:^([0-9]{4})(NAME)$:mv -iv & \1/\2:p'
If you're satisfied that the commands look right, pipe the command line through a shell:
ls | sed -rne 's:^([0-9]{4})(NAME)$:mv -iv & \1/\2:p' | sh
I put NAME in brackets and used \2 so that if it varies more than your example indicates, you can come up with a regular expression to handle your filenames better.
To do the same thing in gawk (GNU awk, the variant found in most GNU/Linux distros):
ls | gawk '/^[0-9]{4}NAME$/ {printf("mv -iv %s %s/%s\n", $1, substr($0,0,4), substr($0,5))}'
As with the first sample, this produces commands which, if they make sense to you, can be piped through a shell by appending | sh to the end of the line.
Note that with all these mv commands, I've added the -i and -v options. This is for your protection. Read the man page for mv (by typing man mv in your Linux terminal) to see if you should be comfortable leaving them out.
Also, I'm assuming with these lines that all your directories already exist. You didn't mention if they do. If they don't, here's a one-liner to create the directories.
ls | sed -rne 's:^([0-9]{4})(NAME)$:mkdir -p \1:p' | sort -u
As with the others, append | sh to run the commands.
I should mention that it is generally recommended to use constructs like for (in Tim's answer) or find instead of parsing the output of ls. That said, when your filename format is as simple as /[0-9]{4}word/, I find the quick sed one-liner to be the way to go.
Lastly, if by NAME you actually mean "any string of characters" rather than the literal string "NAME", then in all my examples above, replace NAME with .*.
The following script will do this for you. Copy the script into a file on the remote machine (we'll call it sortfiles.sh).
#!/bin/bash
# Get all files in current directory having names XXXXsomename, where X is an integer
files=$(find . -name '[0-9][0-9][0-9][0-9]*')
# Build a list of the XXXX patterns found in the list of files
dirs=
for name in ${files}; do
dirs="${dirs} $(echo ${name} | cut -c 3-6)"
done
# Remove redundant entries from the list of XXXX patterns
dirs=$(echo ${dirs} | uniq)
# Create any XXXX directories that are not already present
for name in ${dirs}; do
if [[ ! -d ${name} ]]; then
mkdir ${name}
fi
done
# Move each of the XXXXsomename files to the appropriate directory
for name in ${files}; do
mv ${name} $(echo ${name} | cut -c 3-6)
done
# Return from script with normal status
exit 0
From the command line, do chmod +x sortfiles.sh
Execute the script with ./sortfiles.sh
Just open the Terminal application, cd into the directory that contains the files you want moved/renamed, and copy and paste these commands into the command line.
for file in [0-9][0-9][0-9][0-9]*; do
dirName="${file%%*([^0-9])}"
mkdir -p "$dirName"
mv "$file" "$dirName/${file##*([0-9])}"
done
This assumes all the files that you want to rename and move are in the same directory. The file globbing also assumes that there are at least four digits at the start of the filename. If there are more than four numbers, it will still be caught, but not if there are less than four. If there are less than four, take off the appropriate number of [0-9]s from the first line.
It does not handle the case where "NAME" (i.e. the name of the new file you want) starts with a number.
See this site for more information about string manipulation in bash.

Resources