I have a problem processing Holter ECG (medical) files based on headers within. Those are binary datafiles that are approximately 20 MB in size starting with structured header and than the data. What I would like to achive is preferably with vbs script is:
1) To check all files in the current folder and move the processed ones to the archive folder -based on specific string in the header:
After a constant string "User Field #20" comes a 250-400 chars long text string that contains a substring like "Wn:" or "WN:" or "wn:" (with colon). If its there the file is processed and goes to archive.
The two examples hold conclusion strings like:
i)
Analize przeprowadzono w warunkach szpitalnych. Rytm prowadzacy zatokowy z HR sr 70/min ( zakres 45-133/min).
Zarejestrowano 1 SVPB, bez epizodow czestoskurczu. Komorowych zaburzeń rytmu serca nie ma.
PQ i QTc w normie.
WN: zapis prawidłowy bez zaburezń rytmu serca
ii)
Zapis w warunkach szpitalnych. Rytm zatokowy, HR w zakresie 38 /min do 126/min, średnio 66/min; przeciętnie w dzień 58-95/min, w nocy 52-65/min. Nie zarejestrowano SVPB, VPB, pauz>2,5sek.PQ w normie wiekowej. QTc prawidłowe. Dobowy profil rytmu w normie.
Wn: Zapis holterowski bez cech istotnej patologii
Newlines, special and regional chars possible within the string. I cant tell for sure but seems like the conclusions string ends with hex 80 (euro sign).
2) If possible - add log to the script - plain text, semicolon separated (maybe to be uploaded to excel if necessary).
archive_log.txt: Timestamp; Lastname; Firstname; DateRecorded; DateProcessed; ConclusionsLongText (about 250-400 chars).
DateRecorded and DateProcessed based on files date created and last modified.
This is extension of a problem that was solved some time ago. The problem is different, only the files to handle are the same. Use the contents of a file to rename it
You could use the same strategy Ansgar used in the link you referenced. Read the file contents and then you can use InStr() to search for your string:
' Read the entire file into a string...
strContents = objFS.OpenTextFile("file.dat").ReadAll()
' Search for the string "WN:" (case-insensitive)...
intPos = InStr(1, strContents, "WN:", vbTextCompare)
If intPos > 0 Then
' Found
End If
This should find the first "WN:" occurrence within the file. Note that this could be some other occurrence, outside the header, so you could also determine the position of "User Field" and compare that to the position of "WN:". For example:
intPosUser = InStr(1, strContents, "User Field", vbTextCompare)
intPosWN = InStr(1, strContents, "WN:", vbTextCompare)
' "WN:" should be within 400 chars of the first User Field record...
If intPosWN > intPosUser And intPosWN - intPosUser < 400 Then
' Found
End If
Related
I need to rename multiple sequences in multiple fasta files and I found this script in order to do so for a single ID:
original_file = "./original.fasta"
corrected_file = "./corrected.fasta"
with open(original_file) as original, open(corrected_file, 'w') as corrected:
records = SeqIO.parse(original_file, 'fasta')
for record in records:
print record.id
if record.id == 'foo':
record.id = 'bar'
record.description = 'bar' # <- Add this line
print record.id
SeqIO.write(record, corrected, 'fasta')
Each fasta file corresponds to a single organism, but it is not specified in the IDs. I have the original fasta files (because these have been translated) with the same filenames but different directories and include in their IDs the name of each organism.
I wanted to figure out how to loop through all these fasta files and rename each ID in each file with the corresponding organism name.
ok my effort, got to use my own input folders/files since they where not specified in question
/old folder contains files :
MW628877.1.fasta :
>MW628877.1 Streptococcus agalactiae strain RYG82 DNA gyrase subunit A (gyrA) gene, complete cds
ATGCAAGATAAAAATTTAGTAGATGTTAATCTAACTAGTGAAATGAAAACGAGTTTTATCGATTACGCCA
TGAGTGTCATTGTTGCTCGTGCACTTCCAGATGTTAGAGATGGTTTAAAACCTGTTCATCGTCGTATTTT
>KY347969.1 Neisseria gonorrhoeae strain 1448 DNA gyrase subunit A (gyrA) gene, partial cds
CGGCGCGTACCGTACGCGATGCACGAGCTGAAAAATAACTGGAATGCCGCCTACAAAAAATCGGCGCGCA
TCGTCGGCGACGTCATCGGTAAATACCACCCCCACGGCGATTTCGCAGTTTACGGCACCATCGTCCGTAT
MG995190.1.fasta :
>MG995190.1 Mycobacterium tuberculosis strain UKR100 GyrA (gyrA) gene, complete cds
ATGACAGACACGACGTTGCCGCCTGACGACTCGCTCGACCGGATCGAACCGGTTGACATCCAGCAGGAGA
TGCAGCGCAGCTACATCGACTATGCGATGAGCGTGATCGTCGGCCGCGCGCTGCCGGAGGTGCGCGACGG
and an /empty folder.
/new folder contains files :
MW628877.1.fasta :
>MW628877.1
MQDKNLVDVNLTSEMKTSFIDYAMSVIVARALPDVRDGLKPVHRRI
>KY347969.1
RRVPYAMHELKNNWNAAYKKSARIVGDVIGKYHPHGDFAVYGTIVR
MG995190.1.fasta :
>MG995190.1
MTDTTLPPDDSLDRIEPVDIQQEMQRSYIDYAMSVIVGRALPEVRD
my code is :
from Bio import SeqIO
from os import scandir
old = './old'
new = './new'
old_ids_dict = {}
for filename in scandir(old):
if filename.is_file():
print(filename)
for seq_record in SeqIO.parse(filename, "fasta"):
old_ids_dict[seq_record.id] = ' '.join(seq_record.description.split(' ')[1:3])
print('_____________________')
print('old ids ---> ',old_ids_dict)
print('_____________________')
for filename in scandir(new):
if filename.is_file():
sequences = []
for seq_record in SeqIO.parse(filename, "fasta"):
if seq_record.id in old_ids_dict.keys():
print('### ', seq_record.id,' ', old_ids_dict[seq_record.id])
seq_record.id += '.'+old_ids_dict[seq_record.id]
seq_record.description = ''
print('-->', seq_record.id)
print(seq_record)
sequences.append(seq_record)
SeqIO.write(sequences, filename, 'fasta')
check how it works, it actually overwrites both files in new folder,
as pointed out by #Vovin in his comment it needs to be adapted per your files template from-to.
I am sure there is more than a way to do this, probably better and more pythonic than may way, I am learning too. Let us know
I've got a class that parses a CNC file, but I'm having difficulties with trailing "words" on each line of the file.
My code parses all leading "words" until it reaches the final word. It's most noticeable when parsing "Z" values or other Double type values. I've debugged it enough to notice that it successfully parses the numerical value just as it does with "X" and "Y" values, but it doesn't seem to successfully convert it to double. Is there an issue with a character I'm missing or something?
Here's my code:
If IO.File.Exists("Some GCode File.eia") Then
Dim sr As New IO.StreamReader("Some GCode File.eia")
Dim i As Integer = 0
'Read text file
Do While Not sr.EndOfStream
'Get the words in the line
Dim words() As String = sr.ReadLine.Split(" ")
'iterate through each word
For i = 0 To words.Length - 1 Step 1
'iterate through each "registered" keyword. Handled earlier in program
For Each cmd As String In _registeredCmds.Keys
'if current word resembles keyword then process
If words(i) Like cmd & "*" Then
_commands.Add(i, _registeredCmds(cmd))
'Double check availability of a Type to convert to
If Not IsNothing(_commands(i).DataType) Then
'Verify enum ScopeType exists
If Not IsNothing(_commands(i).Scope) Then
'If ScopeType is modal then just set it to True. I'll fix later
If _commands(i).Scope = ScopeType.Modal Then
_commands(i).DataValue = True
Else
'Catch errors in conversion
Try
'Get the value of the gcode command by removing the "registered" keyword from the string
Dim strTemp As String = words(i).Remove(0, words(i).IndexOf(_commands(i).Key) + _commands(i).Key.Length)
'Save the parsed value into an Object type in another class
_commands(i).DataValue = Convert.ChangeType(strTemp, _commands(i).DataType)
Catch ex As Exception
'Log(vbTab & "Error:" & ex.Message)
End Try
End If
Else
'Log(vbTab & "Command scope is null")
End If
Else
'Log(vbTab & "Command datatype is null")
End If
Continue For
End If
Next
Next
i += 1
Loop
Else
Throw New ApplicationException("FilePath provided does not exist! FilePath Provided:'Some GCode File.eia'")
End If
Here's an example of the GCode:
N2930 X-.2187 Y-1.2378 Z-.0135
N2940 X-.2195 Y-1.2434 Z-.0121
N2950 X-.2187 Y-1.249 Z-.0108
N2960 X-.2164 Y-1.2542 Z-.0096
N2970 X-.2125 Y-1.2585 Z-.0086
N2980 X-.207 Y-1.2613 Z-.0079
N2990 X-.2 Y-1.2624 Z-.0076
N3000 X0.
N3010 X12.
N3020 X24.
N3030 X24.2
N3040 X24.2072 Y-1.2635 Z-.0075
N3050 X24.2127 Y-1.2665 Z-.0071
N3060 X24.2167 Y-1.2709 Z-.0064
N3070 X24.2191 Y-1.2763 Z-.0057
N3080 X24.2199 Y-1.2821 Z-.0048
N3090 X24.2191 Y-1.2879 Z-.004
N3100 X24.2167 Y-1.2933 Z-.0032
N3110 X24.2127 Y-1.2977 Z-.0026
N3120 X24.2072 Y-1.3007 Z-.0021
N3130 X24.2 Y-1.3018 Z-.002
N3140 X24.
N3150 X12.
N3160 X0.
N3170 X-.2
N3180 X-.2074 Y-1.3029 Z-.0019
N3190 X-.2131 Y-1.306 Z-.0018
N3200 X-.2172 Y-1.3106 Z-.0016
N3210 X-.2196 Y-1.3161 Z-.0013
N3220 X-.2204 Y-1.3222 Z-.001
N3230 X-.2196 Y-1.3282 Z-.0007
N3240 X-.2172 Y-1.3338 Z-.0004
N3250 X-.2131 Y-1.3384 Z-.0002
N3260 X-.2074 Y-1.3415 Z-.0001
N3270 X-.2 Y-1.3426 Z0.
N3280 X0.
N3290 X12.
N3300 X24.
N3310 X24.2
N3320 G0 Z.1
N3330 Z1.0
N3340 G91 G28 Z0.0
N3350 G90
With regard to the sample CNC code above, you'll notice that X and Y commands with a trailing Z command parse correctly.
EDIT
Per comment, here is a breakdown of _commands()
_commands = SortedList(Of Integer, Command)
Command is a class with the following properties:
Scope as Enum ScopeType
Name as String
Key as String
DataType as Type
DataValue as Object
EDIT: Solution!
Figured out what was wrong. The arrays that make up the construction of the classes were essentially being passed a reference to the "registered" array of objects from the Command class. Therefore every time I parsed the value out of the "word" each line, I was overwriting the DataValue in the Command object.
The solution was to declare a new 'Command' object with every parse and append it to the proper array.
Here's my short hand:
...
For I = 0 To words.Length - 1 Step 1
'iterate through each "registered" keyword. Handled earlier in program
For Each cmd as String in _registeredCmds.Keys
'if current word resembles keyword then process
If words(I) Like cmd & "*" Then
'NEW!!! Declare unassigned Command object
Dim com As Command
' ****** New elongated logic double checking existence of values.....
If _registeredCmds.Keys.Scope = ScopeType.Modal Then
'assign Command object to previously declared variable com
com = New Command()'There's technically passing arguments now to ensure items are transferred
Else
'Parse and pass DataValue from this word
com = New Command()'There's technically passing arguments now to ensure items are transferred
End If
'New sub to add Command object to local array
Add(com)
Continue For
End If
Next
Next
...
I have TWO large, CSV files (around 1GB). They both share relation between each other (ID is lets say like a foreign key). Structure is simple, line by line but CSV cells with a line break in the value string can appear
37373;"SOMETXT-CRCF or other other line break-";3838383;"sasa ssss"
One file is P file and other is T file. T is like 70% size of the P file (P > T). I must cut them to smaller parts since they are to big for the program I have to import them... I can not simply use split -l 100000 since I will loose ID=ID relations which must be preserved! Relation can be 1:1, 2:3, 4:6 or 1:5. So stupid file splitting is no option, we must check the place where we create a new file. This is example with simplified CSV structure and a place where I want the file to be cut (and the lines above go to separate P|T__00x file and we continue till P or T ends). Lines are sorted in both files, so no need to search for IDs across whole file!
File "P" (empty lines for clearness):
CSV_FILE_P;HEADER;GOES;HERE
564788402;1;1;"^";"01"
564788402;2;1;"^";"01"
564788402;3;1;"^";"01"
575438286;1;1;"^";"01"
575438286;1;1;"^";"01"
575438286;2;1;"37145859"
575438286;2;1;"37145859"
575438286;3;1;"37145859"
575438286;3;1;"37145859"
575439636;1;1;"^"
575439636;1;1;"^"
# lets say ~100k line limit of file P is somewhere here and no more 575439636 ID lines , so we cut.
575440718;1;1;"^"
575440718;1;1;"^"
575440718;2;1;"10943890"
575440718;2;1;"10943890"
575440718;3;1;"10943890"
575440718;3;1;"10943890"
575441229;1;1;"^";"01"
575441229;1;1;"^";"01"
575441229;2;1;"4146986"
575441229;2;1;"4146986"
575441229;3;1;"4146986"
575441229;3;1;"4146986"
File T (empty lines for clearness)
CSV_FILE_T;HEADER;GOES;HERE
564788402;4030000;1;"0204"
575438286;6102000;1;"0408"
575438286;6102000;0;"0408"
575439636;7044010;1;"0408"
575439636;7044010;0;"0408"
# we must cut here since bigger file "P" 100k limit has been reached
# and we end here because 575439636 ID lines are over.
575440718;6063000;1;"0408"
575440718;6063000;0;"0408"
575441229;8001001;0;"0408"
575441229;8001001;1;"0408"
Can you please help splitting those two files into many 100 000 (or so) lines separate files T_001 and corresponding P_001 file and so on? So ID matches between file parts. I believe awk will be the best tool but I have not got much experience in this field. And the last thing - CSV header should be preserved in each of the files.
I have powerful AIX machine to cope with that (linux also possible since AIX commands are limited sometimes)
You can parse the beginning IDs with awk and then check to see if the current ID is the same as the last one. Only when it is different are you allowed close the current output file and open a new one. At that point record the ID for tracking the next file. You can track this id in a text file or in memory. I've done it in memory but with big files like this you could run into trouble. It's easier to keep track in memory than opening multiple files and reading from them.
Then you just need to distinguish between the first file (output and recording) and the second file (output and using the prerecorded data).
The code does a very brute force check on the possibility of a CRLF in a field - if the line does not begin with what looks like an ID, then it outputs the line and does no further testing on it. Which is a problem if the CRLF is followed immediately by a number and semicolon! This might be unlikely though...
Run with: gawk -f parser.awk P T
I don't promise this works!
BEGIN {
MAXLINES = 100000
block = 0
trackprevious = 0
}
FNR == 1 {
# First line is CSV header
csvheader = $0
if (FILENAME == "-")
{
_error = 1
print "Error: Need filename on command line"
exit 1
}
if (trackprevious)
{
_error = 1
print "Only one file can track another"
exit 1
}
if (block >= 1)
{
# New file - track previous output...
close(outputname)
Tracking[block] = idval
print "Output for " FILENAME " tracks previous file"
trackprevious = 1
}
else
{
print "Chunking output (" MAXLINES ") for " FILENAME
}
linecount = 0
idval = 0
block = 1
outputprefix = FILENAME "_block"
outputname = sprintf("%s_%03d", outputprefix, block)
print csvheader > outputname
next
}
/^[0-9]+;/ {
linecount++
newidval = $0
sub(/;.*$/, "", newidval)
newidval = newidval + 0 # make a number
startnewfile = 0
if (trackprevious && (idval != newidval) && (idval == Tracking[block]))
{
startnewfile = 1
}
else if (!trackprevious && (idval != newidval) && (linecount > MAXLINES))
{
# Last ID value found before new file:
Tracking[block] = idval
startnewfile = 1
}
if (startnewfile)
{
close(outputname)
block++
outputname = sprintf("%s_%03d", outputprefix, block)
print csvheader > outputname
linecount = 1
}
print $0 > outputname
idval = newidval
next
}
{
linecount++
print $0 > outputname
}
Python 3 program allows people to choose from list of employee names.
Data held on text file look like this: ('larry', 3, 100)
(being the persons name, weeks worked and payment)
I need a way to assign each part of the text file to a new variable,
so that the user can enter a new amount of weeks and the program calculates the new payment.
Below is my code and attempt at figuring it out.
import os
choices = [f for f in os.listdir(os.curdir) if f.endswith(".txt")]
print (choices)
emp_choice = input("choose an employee:")
file = open(emp_choice + ".txt")
data = file.readlines()
name = data[0]
weeks_worked = data[1]
weekly_payment= data[2]
new_weeks = int(input ("Enter new number of weeks"))
new_payment = new_weeks * weekly_payment
print (name + "will now be paid" + str(new_payment))
currently you are assigning the first three lines form the file to name, weeks_worked and weekly_payment. but what you want (i think) is to separate a single line, formatted as ('larry', 3, 100) (does each file have only one line?).
so you probably want code like:
from re import compile
# your code to choose file
line_format = compile(r"\s*\(\s*'([^']*)'\s*,\s*(\d+)\s*,\s*(\d+)\s*\)")
file = open(emp_choice + ".txt")
line = file.readline() # read the first line only
match = line_format.match(line)
if match:
name, weeks_worked, weekly_payment = match.groups()
else:
raise Exception('Could not match %s' % line)
# your code to update information
the regular expression looks complicated, but is really quite simple:
\(...\) matches the parentheses in the line
\s* matches optional spaces (it's not clear to me if you have spaces or not
in various places between words, so this matches just in case)
\d+ matches a number (1 or more digits)
[^']* matches anything except a quote (so matches the name)
(...) (without the \ backslashes) indicates a group that you want to read
afterwards by calling .groups()
and these are built from simpler parts (like * and + and \d) which are described at http://docs.python.org/2/library/re.html
if you want to repeat this for many lines, you probably want something like:
name, weeks_worked, weekly_payment = [], [], []
for line in file.readlines():
match = line_format.match(line)
if match:
name.append(match.group(1))
weeks_worked.append(match.group(2))
weekly_payment.append(match.group(3))
else:
raise ...
Okedoke... I have an Excel spreadsheet with a filename in column A. The filenames listed in column A appear in one or more text files in one or more source directories.
I need Excel to search the text files recursively and return the path(s) of the file(s) that contain the filename specified in column A into column B. If more than one file go to column C etc.
The Excel sheet would be
__________________________________
__|______A___________|______B_____|
1 | filename.avi | |
2 | another_file.flv | |
The text files to search would be in multiple directories under C:\WebDocs\ and are DokuWiki pages some are quite short, such as this page that would need to be returned
===== Problem Description =====
Reopen a closed bank reconciliation.
===== Solution =====
Demonstration of the tool box routine that allows reposting of the bank rec.
{{videos:bank_rec_reopen1006031511.flv|}}
===== Additional Information -cm =====
You may have noticed that in the video there is a number to the right of the bank account number. In this case it was a 0. That indicates department 0 which is all departments. You get the department 0 if you have all departments combined using the option in the bank set up called "One Bank for All Departments". If this setting is not checked then when you create your starting bank rec for each department you will get a 1 to the right of the bank rec for department 1 and so on. You should normally only have a 0, or have numbers 1 or greater. If you have both, then the method was changed after the initial bank rec was made. You just have to be aware of this as you move forward. As always backup before you make any changes.
There are some other pages though that are quite long that do not contain videos but would be in the directories being searched. Format is the same, plain text, ==== are place holders for headings may contain links to other pages/sites.
I did find an existing VBA script that sort of does what I need it to. It does not recurse and returns too much information, date/time stamp for instance, where all I need is the path.
Private Sub CommandButton1_Click()
Dim sh As Worksheet, rng As Range, lr As Long, fPath As String
Set sh = Sheets(1) 'Change to actual
lstRw = sh.Cells.Find(What:="*", After:=sh.Range("A1"), LookAt:=xlPart, LookIn:=xlFormulas, SearchOrder:=xlByRows, SearchDirection:=xlPrevious, MatchCase:=False).Row
Set rng = sh.Range("A2:A" & lstRw)
With Application.FileDialog(msoFileDialogFolderPicker)
.Show
fPath = .SelectedItems(1)
End With
If Right(fPath, 1) <> "\" Then
fPath = fPath & "\"
End If
fwb = Dir(fPath & "*.*")
x = 2
Do While fwb <> ""
For Each c In rng
If InStr(LCase(fwb), LCase(c.Value)) > 0 Then
Worksheets("Sheet2").Range("C" & x) = fwb
Set fs = CreateObject("Scripting.FileSystemObject")
Set f = fs.GetFile(fwb)
Worksheets("Sheet2").Range("D" & x) = f.DateLastModified
Worksheets("Sheet2").Range("B" & x) = f.Path
Worksheets("sheet2").Range("A" & x) = c.Value
Columns("A:D").AutoFit
Set fs = Nothing
Set f = Nothing
x = x + 1
End If
Next
fwb = Dir
Loop
Set sh = Nothing
Set rng = Nothing
Sheets(2).Activate
End Sub
My attempts at moification so far have generally resulted in a broken script and have thus led me here asking for help.
Thanks,
Simon
Downlaoded the win32 port of the GNU tool grep from http://gnuwin32.sourceforge.net/
Saved the list of video files into a plain text file instead of using a spreadsheet.
grep --file=C:\file_containing video_file_names.txt -R --include=*.txt C:\Path\To\Files >grep_output.txt
The information written to the grep_output.txt file looked like
C:\wiki_files\wiki\pages/my_bank_rec_page.txt:{{videos:bank_rec_reopen1006031511.flv|}}
So there was the path to the file containing the video name and the video name on one line.
Imported the grep_output.txt file into a new Excel workbook.
Used regular formulae to do the following
Split Column A at the "/" to give the path in Column A and the page and video information in Column B
Split the data in in Column B at the ":{{" characters leaving page name in Column B and video information in Column C
Stripped the :{{ and |}} from the front and rear of the string in Column C
From my limited experience, it seems you'd want to perform 4 tasks.
1) Loop through Directories
2) Loop through files per directory (Good idea to keep the filename in a variable)
3) Test the text file for values. Would suggest clear a "scribble sheet", import the file, run a check. e.g.
Sheets("YourScratchPatch").Select
Application.CutCopyMode = False
With ActiveSheet.QueryTables.Add(Connection:="TEXT;" & yourpath & yourfile.txt, Destination:=Range("A1"))
.FieldNames = True
.RowNumbers = False
.FillAdjacentFormulas = False
.PreserveFormatting = True
.RefreshOnFileOpen = False
.RefreshStyle = xlInsertDeleteCells
.SavePassword = False
.SaveData = True
.RefreshPeriod = 0
.TextFilePromptOnRefresh = False
.TextFilePlatform = 850
.TextFileStartRow = 2
.TextFileParseType = xlDelimited
.TextFileTextQualifier = xlTextQualifierDoubleQuote
.TextFileConsecutiveDelimiter = False
.TextFileTabDelimiter = True
.TextFileSemicolonDelimiter = False
.TextFileCommaDelimiter = True
.TextFileSpaceDelimiter = False
.TextFileColumnDataTypes = Array(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
.TextFileTrailingMinusNumbers = True
.Refresh BackgroundQuery:=False
End With
4) if values are found, write the file name variable to the index sheet.
I'm sure there should be better (arrays?) ways to do the comparison check as well, but it depends on what's inside the text file (i.e. just one file name?)
More info on the text file structure would be useful. Hope this helps.