Convert WordNet files to .txt

Convert WordNet files to .txt - file

I need to convert the WordNet database files (noun.shape, noun.state, verb.cognition ecc) from their custom extension to .txt in order to more easily extract their nouns, verbs, adjectives and adverbs in their custom category.
In other words, in "DATABASE FILES ONLY" you'll find the files I'm looking for, unfortunately they have a .STATE or .SHAPE extension. They are readable in the notepad but I need a list with all the items in those files without their definition in parenthesis.

If you're using WordNet simply as a dictionary, you can try Open Multilingual WordNet, see http://compling.hss.ntu.edu.sg/omw/
import os, codecs
from nltk.corpus import wordnet as wn
# Read Open Multi WN's .tab file
def readWNfile(wnfile, option="ss"):
reader = codecs.open(wnfile, "r", "utf8").readlines()
wn = {}
for l in reader:
if l[0] == "#": continue
if option=="ss":
k = l.split("\t")[0] #ss as key
v = l.split("\t")[2][:-1] #word
else:
v = l.split("\t")[0] #ss as value
k = l.split("\t")[2][:-1] #word as key
try:
temp = wn[k]
wn[k] = temp + ";" + v
except KeyError:
wn[k] = v
return wn
if not os.path.exists('msa/wn-data-zsm.tab'):
os.system('wget http://compling.hss.ntu.edu.sg/omw/wns/zsm.zip')
os.system('unzip zsm.zip')
msa_wn = readWNfile('msa/wn-data-zsm.tab')
eng_wn_keys = {(str(i.offset).zfill(8) + '-'+i.pos).decode('utf8'):i for i in wn.all_synsets()}
for i in set(eng_wn_keys).intersection(msa_wn.keys()):
print eng_wn_keys[i], msa_wn[i]
Meanwhile, hold on for a while because the NLTK developers are going to put the Open Multilingual Wordnet API together soon, see https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py from line 1048

Related

Change ID in multiple FASTA files

I need to rename multiple sequences in multiple fasta files and I found this script in order to do so for a single ID:
original_file = "./original.fasta"
corrected_file = "./corrected.fasta"
with open(original_file) as original, open(corrected_file, 'w') as corrected:
records = SeqIO.parse(original_file, 'fasta')
for record in records:
print record.id
if record.id == 'foo':
record.id = 'bar'
record.description = 'bar' # <- Add this line
print record.id
SeqIO.write(record, corrected, 'fasta')
Each fasta file corresponds to a single organism, but it is not specified in the IDs. I have the original fasta files (because these have been translated) with the same filenames but different directories and include in their IDs the name of each organism.
I wanted to figure out how to loop through all these fasta files and rename each ID in each file with the corresponding organism name.

ok my effort, got to use my own input folders/files since they where not specified in question
/old folder contains files :
MW628877.1.fasta :
>MW628877.1 Streptococcus agalactiae strain RYG82 DNA gyrase subunit A (gyrA) gene, complete cds
ATGCAAGATAAAAATTTAGTAGATGTTAATCTAACTAGTGAAATGAAAACGAGTTTTATCGATTACGCCA
TGAGTGTCATTGTTGCTCGTGCACTTCCAGATGTTAGAGATGGTTTAAAACCTGTTCATCGTCGTATTTT
>KY347969.1 Neisseria gonorrhoeae strain 1448 DNA gyrase subunit A (gyrA) gene, partial cds
CGGCGCGTACCGTACGCGATGCACGAGCTGAAAAATAACTGGAATGCCGCCTACAAAAAATCGGCGCGCA
TCGTCGGCGACGTCATCGGTAAATACCACCCCCACGGCGATTTCGCAGTTTACGGCACCATCGTCCGTAT
MG995190.1.fasta :
>MG995190.1 Mycobacterium tuberculosis strain UKR100 GyrA (gyrA) gene, complete cds
ATGACAGACACGACGTTGCCGCCTGACGACTCGCTCGACCGGATCGAACCGGTTGACATCCAGCAGGAGA
TGCAGCGCAGCTACATCGACTATGCGATGAGCGTGATCGTCGGCCGCGCGCTGCCGGAGGTGCGCGACGG
and an /empty folder.
/new folder contains files :
MW628877.1.fasta :
>MW628877.1
MQDKNLVDVNLTSEMKTSFIDYAMSVIVARALPDVRDGLKPVHRRI
>KY347969.1
RRVPYAMHELKNNWNAAYKKSARIVGDVIGKYHPHGDFAVYGTIVR
MG995190.1.fasta :
>MG995190.1
MTDTTLPPDDSLDRIEPVDIQQEMQRSYIDYAMSVIVGRALPEVRD
my code is :
from Bio import SeqIO
from os import scandir
old = './old'
new = './new'
old_ids_dict = {}
for filename in scandir(old):
if filename.is_file():
print(filename)
for seq_record in SeqIO.parse(filename, "fasta"):
old_ids_dict[seq_record.id] = ' '.join(seq_record.description.split(' ')[1:3])
print('_____________________')
print('old ids ---> ',old_ids_dict)
print('_____________________')
for filename in scandir(new):
if filename.is_file():
sequences = []
for seq_record in SeqIO.parse(filename, "fasta"):
if seq_record.id in old_ids_dict.keys():
print('### ', seq_record.id,' ', old_ids_dict[seq_record.id])
seq_record.id += '.'+old_ids_dict[seq_record.id]
seq_record.description = ''
print('-->', seq_record.id)
print(seq_record)
sequences.append(seq_record)
SeqIO.write(sequences, filename, 'fasta')
check how it works, it actually overwrites both files in new folder,
as pointed out by #Vovin in his comment it needs to be adapted per your files template from-to.
I am sure there is more than a way to do this, probably better and more pythonic than may way, I am learning too. Let us know

Handling Hebrew files and folders with Python 3.4

I used Python 3.4 to create a programm that goes through E-mails and saves specific attachments to a file server.
Each file is saved to a specific destination depending on the sender's E-mail's address.
My problem is that the destination folders and the attachments are both in Hebrew and for a few attachments I get an error that the path does not exsist.
Now that's not possible because It can fail for one attachment but not for the others on the same Mail (the destination folder is decided by the sender's address).
I want to debug the issue but I cannot get python to display the file path it is trying to save correctly. (it's mixed hebrew and english and it always displays the path in a big mess, although it works correctly 95% of the time when the file is being saved to the file server)
So my questions are:
what should I add to this code so that it will proccess Hewbrew correctly?
Should I encode or decode somthing?
Are there characters I should avoid when proccessing the files?
here's the main piece of code that fails:
try:
found_attachments = False
for att in msg.Attachments:
_, extension = split_filename(str(att))
# check if attachment is not inline
if str(att) not in msg.HTMLBody:
if extension in database[sender][TYPES]:
file = create_file(str(att), database[sender][PATH], database[sender][FORMAT], time_stamp)
# This is where the program fails:
att.SaveAsFile(file)
print("Created:", file)
found_attachments = True
if found_attachments:
items_processed.append(msg)
else:
items_no_att.append(msg)
except:
print("Error with attachment: " + str(att) + " , in: " + str(msg))
and the create file function:
def create_file(att, location, format, timestamp):
"""
process an attachment to make it a file
:param att: the name of the attachment
:param location: the path to the file
:param format: the format of the file
:param timestamp: the time and date the attachment was created
:return: return the file created
"""
# create the file by the given format
if format == "":
output_file = location + "\\" + att
else:
# split file to name and type
filename, extension = split_filename(att)
# extract and format the time sent on
time = str(timestamp.time()).replace(":", ".")[:-3]
# extract and format the date sent on
day = str(timestamp.date())
day = day[-2:] + day[4:-2] + day[:4]
# initiate the output file
output_file = format
# add the original file name where needed
output_file = output_file.replace(FILENAME, filename)
# add the sent date where needed
output_file = output_file.replace(DATE, day)
# add the time sent where needed
output_file = output_file.replace(TIME, time)
# add the path and type
output_file = location + "\\" + output_file + "." + extension
print(output_file)
# add an index to the file if necessary and return it
index = get_file_index(output_file)
if index:
filename, extension = split_filename(output_file)
return filename + "(" + str(index) + ")." + extension
else:
return output_file
Thanks in advance, I would be happy to explain more or supply more code if needed.

I found out that the promlem was not using Hebrew. I found that there's a limit on the number of chars that the (path + filename) can hold (255 chars).
The files that failed excided that limit and that caused the problem

File Open Mechanism

I'm trying to create an application. The application gives the user 2 combo boxes. Combo Box 1 gives the first part of the file name the user wants, and Combo Box 2 gives the second part of the file name. E.g. Combo box 1 option 1 is 1 and Combo Box 2 option 1 is A; the selected file is 1_A.txt.
I have a load button which is to use the file name , and open a file with that name. If no file exists, the application opens a dialog saying "No Such File Exists"
from PySide import QtGui, QtCore
from PySide.QtCore import*
from PySide.QtGui import*
class MainWindow(QtGui.QMainWindow):
def __init__(self,):
QtGui.QMainWindow.__init__(self)
QtGui.QApplication.setStyle('cleanlooks')
#PushButtons
load_button = QPushButton('Load',self)
load_button.move(310,280)
run_Button = QPushButton("Run", self)
run_Button.move(10,340)
stop_Button = QPushButton("Stop", self)
stop_Button.move(245,340)
#ComboBoxes
#Option1
o1 = QComboBox(self)
l1 = QLabel(self)
l1.setText('Option 1:')
l1.setFixedSize(170, 20)
l1.move(10,230)
o1.move(200, 220)
o1.setFixedSize(100, 40)
o1.insertItem(0,'')
o1.insertItem(1,'A')
o1.insertItem(2,'B')
o1.insertItem(3,'test')
#Option2
o2 = QComboBox(self)
l2 = QLabel(self)
l2.setText('Option 2:')
l2.setFixedSize(200, 20)
l2.move(10,290)
o2.move(200,280)
o2.setFixedSize(100, 40)
o2.insertItem(0,'')
o2.insertItem(1,'1')
o2.insertItem(2,'2')
o2.insertItem(3,'100')
self.fileName = QLabel(self)
self.fileName.setText("Select Options")
o1.activated.connect(lambda: self.fileName.setText(o1.currentText() + '_' + o2.currentText() + '.txt'))
o2.activated.connect(lambda: self.fileName.setText(o1.currentText() + '_' + o2.currentText() + '.txt'))
load_button.clicked.connect(self.fileHandle)
def fileHandle(self):
file = QFile(str(self.fileName.text()))
open(file, 'r')
if __name__ == '__main__':
import sys
app = QtGui.QApplication(sys.argv)
window = MainWindow()
window.setWindowTitle("Test11")
window.resize(480, 640)
window.show()
sys.exit(app.exec_())
The error I'm getting is TypeError: invalid file: <PySide.QtCore.QFile object at 0x031382B0> and I suspect this is because the string described in the file handle isn't being inserted in the QFile properly. Can someone please help

The Python open() function doesn't have any knowledge of objects of type QFile. I doubt you actually need to construct a QFile object though.
Instead, just open the file directly via open(self.fileName.text(), 'r'). Preferably, you would do:
with open(self.fileName.text(), 'r') as myfile:
# do stuff with the file
unless you need to keep the file open for a long period of time

I came up with a solution also.
def fileHandle(self):
string = str(self.filename.text())
file = QFile()
file.setFileName(string)
file.open(QIODevice.ReadOnly)
print(file.exists())
line = file.readLine()
print(line)
What this does is that it takes the string of the filename field. Creates the file object. Names the file object the string, and then opens the file. I have exists to check if the file is there, and after reading the test document i have, ti seemed to work as I wanted.
Thanks anyway #three_pineapples, but I'm going to use my solution :P

Find string in log files and return extra characters

How can I get Python to loop through a directory and find a specific string in each file located within that directory, then output a summary of what it found?
I want to search the long files for the following string:
FIRMWARE_VERSION = "2.15"
Only, the firmware version can be different in each file. So I want the log file to report back with whatever version it finds.
import glob
import os
print("The following list contains the firmware version of each server.\n")
os.chdir( "LOGS\\" )
for file in glob.glob('*.log'):
with open(file) as f:
contents = f.read()
if 'FIRMWARE_VERSION = "' in contents:
print (file + " = ???)
I was thinking I could use something like the following to return the extra characters but it's not working.
file[:+5]
I want the output to look something like this:
server1.web.com = FIRMWARE_VERSION = "2.16"
server2.web.com = FIRMWARE_VERSION = "3.01"
server3.web.com = FIRMWARE_VERSION = "1.26"
server4.web.com = FIRMWARE_VERSION = "4.1"
server5.web.com = FIRMWARE_VERSION = "3.50"
Any suggestions on how I can do this?

You can use regex for grub the text :
import re
for file in glob.glob('*.log'):
with open(file) as f:
contents = f.read()
if 'FIRMWARE_VERSION = "' in contents:
print (file + '='+ re.search(r'FIRMWARE_VERSION ="([\d.]+)"',contents).group(1))
In this case re.search will do the job! with searching the file content based on the following pattern :
r'FIRMWARE_VERSION ="([\d.]+)"'
that find a float number between two double quote!also you can use the following that match anything right after FIRMWARE_VERSIONbetween two double quote.
r'FIRMWARE_VERSION =(".*")'

Looping through A Folder and all its subdirectories

Ok. Many trouble shooting hours ...and many error "dings" later, I'm still having the same problem. Due to my beginner skills I'm having problems achieving the following segment of my project:
I will be as detailed as possible so I can hopefully nail it this time:
On my computer i have a folder C:\data which contains many different subfolders.
The subfolders are named by dates in a MMDDYY fashion. For example "040312"
In each subfolder are excel files named after Baseball teams. each subfolder may contain a different combination of xls files.
I am trying to write code that achieves the following objectives:
1.) Loops through all the subfolders of the C:\data folder looking for xls files that have the filenames: Angles.xls, Diamondbacks.xls, etc.
2.) If the files are found in each subfolder import the spreadsheet data and generate a plot of the data titled "Score" and "Allow".
3.) If the file is not found any given subfolder skip and continue to the next file to be located.
4.)Then save the generated plot in the same folder that the spreadsheet was imported from as a .fig and a .bmp file.
I've gotten hints to use various functions like: genpath, dir, but the code I've been fumbling through isn't able to achieve my goals.
a) the script doesn't import the excel files from all the subfolders
b) the script wont save the .fig or .bmp file in the associated subfolder
Here is the code I have been fumbling through:
%I know all of this is wrong wrong wrong. Please help to adjust my code to %achieve the objectives outlined above!
addpath(genpath('c:\data'))
folder = 'c:\data';
subdirs = dir(folder);
subdirs(~[subdirs.isdir]) = [] ;
numberOfFolders = length(subdirs);
if numberOfFolders <= 0
uiwait(warndlg('Number of folders = 0!'))
end
wantedfiles = {'Angels' 'Diamondbacks' 'Orioles' 'Royals' 'Yankees' 'Mets' 'Giants'};
for K = 1 : numberOfFolders
thissubdir = subdirs(K).name;
if strcmp(thissubdir, '.') || strcmp(thissubdir, '..')
continue;
end
subdirpath = [folder '\' thissubdir];
for L = 1 : length(wantedfiles)
for wantedfiles = {'Angels' 'Diamondbacks' 'Orioles' 'Royals' 'Yankees' 'Mets' 'Giants'};
folder = '';
fileToRead1 = [wantedfiles{1} '.xls'];
sheetName='Sheet1';
if exist(fileToRead1, 'file') == 0
% File does not exist
% Skip to bottom of loop and continue with the loop
continue;
end
%This is to import the data and organize it
% All of this code I had auto-generated from importing files manually
[numbers, strings, raw] = xlsread(fileToRead1, sheetName);
if ~isempty(numbers)
newData1.data = numbers;
end
if ~isempty(strings) && ~isempty(numbers)
[strRows, strCols] = size(strings);
[numRows, numCols] = size(numbers);
likelyRow = size(raw,1) - numRows;
% Break the data up into a new structure with one field per column.
if strCols == numCols && likelyRow > 0 && strRows >= likelyRow
newData1.colheaders = strings(likelyRow, :);
end
end
% Create new variables in the base workspace from those fields.
for i = 1:size(newData1.colheaders, 2)
assignin('base', genvarname(newData1.colheaders{i}), newData1.data(:,i));
end
% Now I execute the plotting of data
subplot (2,1,1), plot(Score,Allow)
title([wantedfiles{1} 'Testing to see if it works']);
subplot (2,1,2), plot(Allow,Score)
title('Well, did it?');
% here I save the generated plots, but they don't save where I want them to
saveas(gcf,[wantedfiles{1} ' did it work.fig']);
saveas(gcf,[wantedfiles{1} ' did it work.bmp']);
end
end
end
%At the end of the script I still was unable to loop over the files that I wanted
rmpath(genpath('c:\data'));