Change ID in multiple FASTA files - loops

I need to rename multiple sequences in multiple fasta files and I found this script in order to do so for a single ID:
original_file = "./original.fasta"
corrected_file = "./corrected.fasta"
with open(original_file) as original, open(corrected_file, 'w') as corrected:
records = SeqIO.parse(original_file, 'fasta')
for record in records:
print record.id
if record.id == 'foo':
record.id = 'bar'
record.description = 'bar' # <- Add this line
print record.id
SeqIO.write(record, corrected, 'fasta')
Each fasta file corresponds to a single organism, but it is not specified in the IDs. I have the original fasta files (because these have been translated) with the same filenames but different directories and include in their IDs the name of each organism.
I wanted to figure out how to loop through all these fasta files and rename each ID in each file with the corresponding organism name.

ok my effort, got to use my own input folders/files since they where not specified in question
/old folder contains files :
MW628877.1.fasta :
>MW628877.1 Streptococcus agalactiae strain RYG82 DNA gyrase subunit A (gyrA) gene, complete cds
ATGCAAGATAAAAATTTAGTAGATGTTAATCTAACTAGTGAAATGAAAACGAGTTTTATCGATTACGCCA
TGAGTGTCATTGTTGCTCGTGCACTTCCAGATGTTAGAGATGGTTTAAAACCTGTTCATCGTCGTATTTT
>KY347969.1 Neisseria gonorrhoeae strain 1448 DNA gyrase subunit A (gyrA) gene, partial cds
CGGCGCGTACCGTACGCGATGCACGAGCTGAAAAATAACTGGAATGCCGCCTACAAAAAATCGGCGCGCA
TCGTCGGCGACGTCATCGGTAAATACCACCCCCACGGCGATTTCGCAGTTTACGGCACCATCGTCCGTAT
MG995190.1.fasta :
>MG995190.1 Mycobacterium tuberculosis strain UKR100 GyrA (gyrA) gene, complete cds
ATGACAGACACGACGTTGCCGCCTGACGACTCGCTCGACCGGATCGAACCGGTTGACATCCAGCAGGAGA
TGCAGCGCAGCTACATCGACTATGCGATGAGCGTGATCGTCGGCCGCGCGCTGCCGGAGGTGCGCGACGG
and an /empty folder.
/new folder contains files :
MW628877.1.fasta :
>MW628877.1
MQDKNLVDVNLTSEMKTSFIDYAMSVIVARALPDVRDGLKPVHRRI
>KY347969.1
RRVPYAMHELKNNWNAAYKKSARIVGDVIGKYHPHGDFAVYGTIVR
MG995190.1.fasta :
>MG995190.1
MTDTTLPPDDSLDRIEPVDIQQEMQRSYIDYAMSVIVGRALPEVRD
my code is :
from Bio import SeqIO
from os import scandir
old = './old'
new = './new'
old_ids_dict = {}
for filename in scandir(old):
if filename.is_file():
print(filename)
for seq_record in SeqIO.parse(filename, "fasta"):
old_ids_dict[seq_record.id] = ' '.join(seq_record.description.split(' ')[1:3])
print('_____________________')
print('old ids ---> ',old_ids_dict)
print('_____________________')
for filename in scandir(new):
if filename.is_file():
sequences = []
for seq_record in SeqIO.parse(filename, "fasta"):
if seq_record.id in old_ids_dict.keys():
print('### ', seq_record.id,' ', old_ids_dict[seq_record.id])
seq_record.id += '.'+old_ids_dict[seq_record.id]
seq_record.description = ''
print('-->', seq_record.id)
print(seq_record)
sequences.append(seq_record)
SeqIO.write(sequences, filename, 'fasta')
check how it works, it actually overwrites both files in new folder,
as pointed out by #Vovin in his comment it needs to be adapted per your files template from-to.
I am sure there is more than a way to do this, probably better and more pythonic than may way, I am learning too. Let us know

Related

Error loading physionet ECG database on MATLAB

I'm using this code to load the ECG-ID database into MATLAB:
%% Initialization
clear all; close all; clc
%% read files from folder A
% Specify the folder where the files live.
myFolder = 'Databases\ECG_ID';
% Check to make sure that folder actually exists. Warn user if it doesn't.
if ~isfolder(myFolder)
errorMessage = sprintf('Error: The following folder does not exist:\n%s\nPlease specify a new folder.', myFolder;)
uiwait(warndlg(errorMessage);)
myFolder = uigetdir(; % Ask for a new one.)
if myFolder == 0
% User clicked Cancel
return;
end
end
% Get a list of all files in the folder with the desired file name pattern.
filePattern = fullfile(myFolder, '**/rec_*'; % Change to whatever pattern you need.)
theFiles = dir(filePattern;)
for k = 1 : length(theFiles)
baseFileName = theFiles(k.name;)
fullFileName = fullfile(theFiles(k.folder, baseFileName);)
fprintf(1, 'Now reading %s\n', fullFileName;)
% Now do whatever you want with this file name,
% such as reading it in as an image array with imread()
[sig, Fs, tm] = rdsamp(fullFileName, [1],[],[],[],1;)
end
But I keep getting this error message:
Now reading C:\Users\******\Documents\MATLAB\Databases\ECG_ID\Person_01\rec_1.atr
Error using rdsamp (line 203)
Could not find record: C:\Users\******\Documents\MATLAB\Databases\ECG_ID\Person_01\rec_1.atr. Search path is set to: '.
C:\Users\******\Documents\MATLAB\mcode\..\database\ http://physionet.org/physiobank/database/'
I can successfully load one signal at a time (but I can't load the entire database using the above code) using this command:
[sig, Fs, tm] = rdsamp('Databases\ECG_ID\Person_01\rec_1');
How do I solve this problem? How can I load all the files in MATLAB?
Thanks in advance.

Handling Hebrew files and folders with Python 3.4

I used Python 3.4 to create a programm that goes through E-mails and saves specific attachments to a file server.
Each file is saved to a specific destination depending on the sender's E-mail's address.
My problem is that the destination folders and the attachments are both in Hebrew and for a few attachments I get an error that the path does not exsist.
Now that's not possible because It can fail for one attachment but not for the others on the same Mail (the destination folder is decided by the sender's address).
I want to debug the issue but I cannot get python to display the file path it is trying to save correctly. (it's mixed hebrew and english and it always displays the path in a big mess, although it works correctly 95% of the time when the file is being saved to the file server)
So my questions are:
what should I add to this code so that it will proccess Hewbrew correctly?
Should I encode or decode somthing?
Are there characters I should avoid when proccessing the files?
here's the main piece of code that fails:
try:
found_attachments = False
for att in msg.Attachments:
_, extension = split_filename(str(att))
# check if attachment is not inline
if str(att) not in msg.HTMLBody:
if extension in database[sender][TYPES]:
file = create_file(str(att), database[sender][PATH], database[sender][FORMAT], time_stamp)
# This is where the program fails:
att.SaveAsFile(file)
print("Created:", file)
found_attachments = True
if found_attachments:
items_processed.append(msg)
else:
items_no_att.append(msg)
except:
print("Error with attachment: " + str(att) + " , in: " + str(msg))
and the create file function:
def create_file(att, location, format, timestamp):
"""
process an attachment to make it a file
:param att: the name of the attachment
:param location: the path to the file
:param format: the format of the file
:param timestamp: the time and date the attachment was created
:return: return the file created
"""
# create the file by the given format
if format == "":
output_file = location + "\\" + att
else:
# split file to name and type
filename, extension = split_filename(att)
# extract and format the time sent on
time = str(timestamp.time()).replace(":", ".")[:-3]
# extract and format the date sent on
day = str(timestamp.date())
day = day[-2:] + day[4:-2] + day[:4]
# initiate the output file
output_file = format
# add the original file name where needed
output_file = output_file.replace(FILENAME, filename)
# add the sent date where needed
output_file = output_file.replace(DATE, day)
# add the time sent where needed
output_file = output_file.replace(TIME, time)
# add the path and type
output_file = location + "\\" + output_file + "." + extension
print(output_file)
# add an index to the file if necessary and return it
index = get_file_index(output_file)
if index:
filename, extension = split_filename(output_file)
return filename + "(" + str(index) + ")." + extension
else:
return output_file
Thanks in advance, I would be happy to explain more or supply more code if needed.
I found out that the promlem was not using Hebrew. I found that there's a limit on the number of chars that the (path + filename) can hold (255 chars).
The files that failed excided that limit and that caused the problem

Find string in log files and return extra characters

How can I get Python to loop through a directory and find a specific string in each file located within that directory, then output a summary of what it found?
I want to search the long files for the following string:
FIRMWARE_VERSION = "2.15"
Only, the firmware version can be different in each file. So I want the log file to report back with whatever version it finds.
import glob
import os
print("The following list contains the firmware version of each server.\n")
os.chdir( "LOGS\\" )
for file in glob.glob('*.log'):
with open(file) as f:
contents = f.read()
if 'FIRMWARE_VERSION = "' in contents:
print (file + " = ???)
I was thinking I could use something like the following to return the extra characters but it's not working.
file[:+5]
I want the output to look something like this:
server1.web.com = FIRMWARE_VERSION = "2.16"
server2.web.com = FIRMWARE_VERSION = "3.01"
server3.web.com = FIRMWARE_VERSION = "1.26"
server4.web.com = FIRMWARE_VERSION = "4.1"
server5.web.com = FIRMWARE_VERSION = "3.50"
Any suggestions on how I can do this?
You can use regex for grub the text :
import re
for file in glob.glob('*.log'):
with open(file) as f:
contents = f.read()
if 'FIRMWARE_VERSION = "' in contents:
print (file + '='+ re.search(r'FIRMWARE_VERSION ="([\d.]+)"',contents).group(1))
In this case re.search will do the job! with searching the file content based on the following pattern :
r'FIRMWARE_VERSION ="([\d.]+)"'
that find a float number between two double quote!also you can use the following that match anything right after FIRMWARE_VERSIONbetween two double quote.
r'FIRMWARE_VERSION =(".*")'

Looping through A Folder and all its subdirectories

Ok. Many trouble shooting hours ...and many error "dings" later, I'm still having the same problem. Due to my beginner skills I'm having problems achieving the following segment of my project:
I will be as detailed as possible so I can hopefully nail it this time:
On my computer i have a folder C:\data which contains many different subfolders.
The subfolders are named by dates in a MMDDYY fashion. For example "040312"
In each subfolder are excel files named after Baseball teams. each subfolder may contain a different combination of xls files.
I am trying to write code that achieves the following objectives:
1.) Loops through all the subfolders of the C:\data folder looking for xls files that have the filenames: Angles.xls, Diamondbacks.xls, etc.
2.) If the files are found in each subfolder import the spreadsheet data and generate a plot of the data titled "Score" and "Allow".
3.) If the file is not found any given subfolder skip and continue to the next file to be located.
4.)Then save the generated plot in the same folder that the spreadsheet was imported from as a .fig and a .bmp file.
I've gotten hints to use various functions like: genpath, dir, but the code I've been fumbling through isn't able to achieve my goals.
a) the script doesn't import the excel files from all the subfolders
b) the script wont save the .fig or .bmp file in the associated subfolder
Here is the code I have been fumbling through:
%I know all of this is wrong wrong wrong. Please help to adjust my code to %achieve the objectives outlined above!
addpath(genpath('c:\data'))
folder = 'c:\data';
subdirs = dir(folder);
subdirs(~[subdirs.isdir]) = [] ;
numberOfFolders = length(subdirs);
if numberOfFolders <= 0
uiwait(warndlg('Number of folders = 0!'))
end
wantedfiles = {'Angels' 'Diamondbacks' 'Orioles' 'Royals' 'Yankees' 'Mets' 'Giants'};
for K = 1 : numberOfFolders
thissubdir = subdirs(K).name;
if strcmp(thissubdir, '.') || strcmp(thissubdir, '..')
continue;
end
subdirpath = [folder '\' thissubdir];
for L = 1 : length(wantedfiles)
for wantedfiles = {'Angels' 'Diamondbacks' 'Orioles' 'Royals' 'Yankees' 'Mets' 'Giants'};
folder = '';
fileToRead1 = [wantedfiles{1} '.xls'];
sheetName='Sheet1';
if exist(fileToRead1, 'file') == 0
% File does not exist
% Skip to bottom of loop and continue with the loop
continue;
end
%This is to import the data and organize it
% All of this code I had auto-generated from importing files manually
[numbers, strings, raw] = xlsread(fileToRead1, sheetName);
if ~isempty(numbers)
newData1.data = numbers;
end
if ~isempty(strings) && ~isempty(numbers)
[strRows, strCols] = size(strings);
[numRows, numCols] = size(numbers);
likelyRow = size(raw,1) - numRows;
% Break the data up into a new structure with one field per column.
if strCols == numCols && likelyRow > 0 && strRows >= likelyRow
newData1.colheaders = strings(likelyRow, :);
end
end
% Create new variables in the base workspace from those fields.
for i = 1:size(newData1.colheaders, 2)
assignin('base', genvarname(newData1.colheaders{i}), newData1.data(:,i));
end
% Now I execute the plotting of data
subplot (2,1,1), plot(Score,Allow)
title([wantedfiles{1} 'Testing to see if it works']);
subplot (2,1,2), plot(Allow,Score)
title('Well, did it?');
% here I save the generated plots, but they don't save where I want them to
saveas(gcf,[wantedfiles{1} ' did it work.fig']);
saveas(gcf,[wantedfiles{1} ' did it work.bmp']);
end
end
end
%At the end of the script I still was unable to loop over the files that I wanted
rmpath(genpath('c:\data'));

Groovy - create file issue: The filename, directory name or volume label syntax is incorrect

I'm running a script made in Groovy from Soap UI and the script needs to generate lots of files.
Those files have also in the name two numbers from a list (all the combinations in that list are different), and there are 1303 combinations
available and the script generates just 1235 files.
A part of the code is:
filename = groovyUtils.projectPath + "\\" + "$file"+"_OK.txt";
targetFile = new File(filename);
targetFile.createNewFile();
where $file is actually that part of the file name which include those 2 combinations from that list:
file = "abc" + "-$firstNumer"+"_$secondNumber"
For those file which are not created is a message returned:"The filename, directory name or volume label syntax is incorrect".
I've tried puting another path:
filename = "D:\\rez\\" + "\\" + "$file"+"_OK.txt";
targetFile = new File(filename);
targetFile.createNewFile();
and also:
File parentFolder = new File("D:\\rez\\");
File targetFile = new File(parentFolder, "$file"+"_OK.txt");
targetFile.createNewFile();
(which I've found here: What are possible reasons for java.io.IOException: "The filename, directory name, or volume label syntax is incorrect")
but nothing worked.
I have no ideea where the problem is. Is strange that 1235 files are created ok, and the rest of them, 68 aren't created at all.
Thanks,
My guess is that some of the files have illegal characters in their paths. Exactly which characters are illegal is platform specific, e.g. on Windows they are
\ / : * ? " < > |
Why don't you log the full path of the file before targetFile.createNewFile(); is called and also log whether this method succeeded or not, e.g.
filename = groovyUtils.projectPath + "\\" + "$file"+"_OK.txt";
targetFile = new File(filename);
println "attempting to create file: $targetFile"
if (targetFile.createNewFile()) {
println "Successfully created file $targetFile"
} else {
println "Failed to create file $targetFile"
}
When the process is finished, check the logs and I suspect you'll see a common pattern in the ""Failed to create file...." messages
File.createNewFile() returns false when a file or directory with that name already exists. In all other failure cases (security, I/O) it throws an exception.
Evaluate createNewFile()'s return value or, additionally, use the File.exists() method:
File file = new File("foo")
// works the first time
createNewFile(file)
// prints an error message
createNewFile(file)
void createNewFile(File file) {
if (!file.createNewFile()) {
assert file.exists()
println file.getPath() + " already exists."
}
}

Resources