Reading and Writing text to a NEW file - Matlab - file

I have a file that contains a full set of values for some sentences which have transcribed for a speech recognition program. Ive been trying to write some matlab code to go through this file and extract the values for each sentence and write them to a new individual file. So instead of having them all in one 'mlf' file i want them in separate files for each sentence.
For example by 'mlf' file (contains all values for all sentences) looks like this:
#!MLF!#
"/N001.lab"
AH
SEE
I
GOT
THEM
MONTHS
AGO
.
"/N002.lab"
WELL
WORK
FOR
LIVE
WIRE
BUT
ERM
.
"/N003.lab"
IM
GOING
TO
SEE
JAMES
VINCENT
MCMORROW
.
etc
So each sentences is separated by the 'Nxxx.lab' and the '.'. I need to create a new file for every Nxxx.lab, for example the file for N001 would just contain:
AH
SEE
I
GOT
THEM
MONTHS
AGO
I've been trying to use fgetline to specify the 'Nxxx.lab' and '.' boundaries, but it doesn't work as i don't know how to write the content into a new file separate from the 'mlf'.
If anyone can give me any guidance of what sort of approach to use would be greatly appreciated!
Cheers!

Try this code (input file test.mlf has to be in the working directory):
%# read the file
filename = 'test.mlf';
fid = fopen(filename,'r');
lines = textscan(fid,'%s','Delimiter','\n','HeaderLines',1);
lines = lines{1};
fclose(fid);
%# find start and stop indices
istart = find(cellfun(#(x) strcmp(x(1),'"'), lines));
istop = find(strcmp(lines, '.'));
assert(numel(istop)==numel(istop) && all(istop>istart),'Check the input file format.')
%# write lines to new files
for k = 1:numel(istart)
filenew = lines{istart(k)}(2:end-1);
fout = fopen(filenew,'wt');
for l = (istart(k)+1):(istop(k)-1)
fprintf(fout,'%s\n',lines{l});
end
fclose(fout);
end
The code assume that the file names are in double-quotes as in your example. If not, you can find istart indices base on a pattern. Or just assuming that entries for new file start from the 2nd line and follows the dot: istart = [1; istop(1:end-1)+1];

You could use a growing cell array to gather the information.
Read one line at a time from the file.
Grab the file name and put it into the first column if its the first read for the sentence.
If the line read is a period, add it to the string and move the index to a row in the array. Write the new file with the content.
This bit of code should help you in building the cell array while appending a string within it. I assume reading line by line is not a problem. You can also retain the carriage returns/new lines within the string ('\n').
%% Declare A
A = {}
%% Fill row 1
A(1,1) = {'file1'}
A(1,2) = {'Sentence 1'}
A(1,2) = { strcat(A{1,2}, ', has been appended')}
%% Fill row 2
A(2,1) = {'file2'}
A(2,2) = {'Sentence 2'}

While I'm sure you can do this with MATLAB, I would suggest you use Perl to split the original file and then process the individual files using MATLAB.
The following Perl script reads the entire file ("xxx.txt") and writes out the individual files according the the "NAME.lab" lines:
open(my $fh, "<", "xxx.txt");
# read the entire file into $contents
# This may not be a good idea if the file is huge.
my $contents = do { local $/; <$fh> };
# iterate over the $contents string and extract the individual
# files
while($contents =~ /"(.*)"\n((.*\n)*?)\./mg) {
# We arrive here with $1 holding the filename
# and $2 the content up to the "." ending the section/sentence.
open(my $fout, ">", $1);
print $fout $2;
close($fout);
}
close($fh);
The multiline regular expression is a bit difficult but it does the job.
For these sort of text manipulation, perl is much faster and useful. A good tool to learn if you process a lot of text.

Related

How to split array with 1 column into multiple columns (numpy array)

I currently have a .txt file that has been loaded into python as a list, and then placed into a np array as a single column with n number of rows depending on the file size. The file has rows trimmed off the top and bottom to clean it up. Relevant code is shown below:
# Open the .txt file in the user selected folder
txtFile = glob.glob(folderView + "**/*.txt", recursive = True)
print (" The text files in the folder " + userInput + " are: " + str(txtFile))
# The quantitative output epatemp.txt file is always index position 1 in the glob list.
epaTemp = txtFile[1]
print (epaTemp)
###
quantData = open(epaTemp, 'r')
print (quantData)
result = [line.split(',') for line in quantData.readlines()[24:]]
print (result)
n = 4
result = result[:-n or None]
print (result)
print (" Length of List: " + str(len(result)))
print (result[0])
### Loop through all list components and split each
print (" NUMPY STUFF! ")
list_array = np.array(result)
print (list_array)
The array is then printed like this: much cleaner than the 1990's .txt files I am given to process...
My issue is that I am unable to split this single column into multiple columns after defining the opened file as a numpy array in the final lines of code listed above. I am trying to produce a single column for everything that is separated by a space, " ", but am having trouble doing so. It might also be important to note that when I use numpy.shape(list_array), it returns nothing as if the array is not being interpreted properly. The overall goal is to turn this 1col x n-rows array into 7col x n-rows for every value split by a " " in the text file. If anyone can help me with this issue I'd really appreciate it.
All fixed. I did split the txt file incorrectly, Using line.split() solved the issue. pandas was also necessary, as this was not the best application for np. Thanks everyone for the feedback.

How can I concatenate the arrays in loop ? Matlab

There is a loop;
for i = 1:n
X_rotate = X.*cos(i*increment) - Y.*sin(i*increment);
X_rotateh = X_rotate./cos(deg2rad(helix_angle));
Y_rotate = X.*sin(i*increment) + Y.*cos(i*increment);
Helix = [X_rotateh(1:K1) ; Y_rotate(1:K1)];
fileID = fopen('helix_values.txt', 'w');
fprintf(fileID,'%f %f\n ', Helix);
fclose(fileID);
end
X and Y are row vectors and their size depends on inputs. By the way, the iteration number n and the size of X and Y can be different. As I said, they depend on inputs.
When open the text file, there just exists the values of last iteration for X_rotateh and Y_rotate. I need to collect the values of X_rotateh and Y_rotate from first value to K1 th value of both for every iteration. I have tried to use cat command. It did not give what I want. On the other hand, I usually meet problems which are about length or size of arrays.
Those values should be in order in text file like;
%for first iteration;
X_rotateh(1) Y_rotate(1)
X_rotateh(2) Y_rotate(2)
.
.
X_rotateh(K1) Y_rotate(K1)
%for second iteration;
X_rotateh(1) Y_rotate(1)
X_rotateh(2) Y_rotate(2)
.
.
X_rotateh(K1) Y_rotate(K1)
%so on..
What may I do ?
Thanks in advance.
As you said, the text file has results from the last iteration. It's probably because you are opening the text file with 'w' permission. Which writes the new content but also erases the previously stored content in the file. Try using 'a' permission. This will append new content without erasing previous content.
fileID = fopen('helix_values.txt', 'a');
You can also find more details with help fopen command in MATLAB.
Let me know if this solves the problem.

Perl: RegEx: Storing variable lines in array

I'm developing a Perl script and one of the script functions is to detect many lines of data between two terminals and store them in an array. I need to store all lines in an array but to be grouped separately as 1st line in $1 and 2nd in $2 and so on. The problem here is that number of these lines is variable and will change with each new run.
my #statistics_of_layers_var;
for( <ALL_FILE> ) {
#statistics_of_layers_var = ($all_file =~ /(Statistics\s+Of\s+Layers)
(?:(\n|.)*)(Summary)/gm );
print #statistics_of_layers_var;
The given data should be
Statistics Of Layers
Line#1
Line#2
Line#3
...
Summary
How I could achieve it?
You can achieve this without a complicated regular expression. Simply use the range operator (also called flip-flop operator) to find the lines you want.
use strict;
use warnings;
use Data::Printer;
my #statistics_of_layers_var;
while (<DATA>) {
# use the range-operator to find lines with start and end flag
if (/^Statistics Of Layers/ .. /^Summary/) {
# but do not keep the start and the end
next if m/^Statistics Of Layers/ || m/^Summary/;
# add the line to the collection
push #statistics_of_layers_var, $_ ;
}
}
p #statistics_of_layers_var;
__DATA__
Some Data
Statistics Of Layers
Line#1
Line#2
Line#3
...
Summary
Some more data
It works by looking at the current line and flipps the block on and off. If /^Statistics of Layers/ matches the line it will run the block for each following line until the `/^Summary/ matches a line. Because those start and end lines are included we need to skip them when adding lines to the array.
This also works if your file contains multiple intances of this pattern. Then you'd get all of the lines in the array.
Maybe you can try this :
push #statistics_of_layers_var ,[$a] = ($slurp =~ /(Statistics\s+Of\s+Layers)
(?:(\n|.)*)(Summary)/gm );

Text File to Lua Array Table

I need to transfer a text from a text file/string to a Table with a 2 positions vector. Like this:
Text File:
Gustavo 20
Danilo 20
Dimas 40
Table
Names = {{Gustavo,20},{Danilo,20},{Dimas,40}}
Need help to do this.
You can use io.lines() for this.
vectorarray = {}
for line in io.lines(filename) do
local w, n = string.match(line, "^(%w+)"), string.match(line, "(%d+)$")
table.insert(vectorarray, {w, n})
end
This is, of course, assuming that it's an absolute end of line and absolute start, and there is only those two options per line. If you're using the file name in many other places, then you could set a global variable for the file name and call it each time, such as:
arrayfile = "C:/arrayfile.txt"
Either way, make sure you put the correct path in quotation marks in the file name.
A shorter variation of Josh's answer that directly puts the result into the table. This matches alphabetic names followed by at least one space and numbers but you can change the pattern as needed:
Names = {}
for line in io.lines(filename) do
Names[ #Names+1 ] = {line:match('(%a+)%s+(%d+)')}
end

Best way to compare data from file to data in array in Matlab

I am having a bit of trouble with a specific file i/o in matlab, I am fairly new to it still so some things are still a bit of a mystery to me. The input file is structured as so:
File Name: Processed_kplr003942670-2010174085026_llc.fits.txt
File contents- 6 Header Lines then:
1, 2, 3
1, 2, 3
basically a matrix of about [1443,3] with varying values
now here is the matrix that I'm comparing it to:
[(0123456, 1, 2, 3), (0123456, 2, 3, 4), (etc..)]
Now here is my problem, first I need to know how to properly do the file input in a way which can let me compare the ID number (0123456) that is in the filename with the ID value that is in the matrix, so that I can compare the other columns of both. I do not know how to achieve this in matlab. Furthermore, I need to be able to loop over every point in the the matrix that matches up to the specific file, for example:
If I have 15 files ranging from 'Processed_0123456_1' to 'Processed_0123456_15' then I want to be able to read in the values contained in 'Processed_0123456_1'and compare them to ANY row in the matrix that corresponds to that ID (0123456). I don't know if maybe accumaray can be used for this, but as I said I'm not sure.
So the code must:
-Read in file
-Compare file to any point in the matrix with corresponding ID
-Do operations
-Loop over until full list of files in the directory are read in and processed, and output a matrix with the results.
Thanks for any help.
EDIT: Exact File Sample--
Kepler I.D.-----Channel
[1161345]--------[84]
-TTYPE1--------TTYPE8------------TTYPE4
['TIME']---['PDCSAP_FLUX']---['SAP_FLUX']
['BJD - 2454833']--['e-/s']--------['e-/s']
CROWDSAP --- 0.9791
630.195880143,277165.0,268233.0
630.216312946,277214.0,268270.0
630.23674585,277239.0,268293.0
630.257178554,277296.0,268355.0
630.277611357,277294.0,268364.0
630.29804426,277365.0,268441.0
630.318476962,277337.0,268419.0
630.338909764,277403.0,268481.0
630.359342667,277389.0,268463.0
630.379775369,277441.0,268508.0
630.40020817,277545.0,268604.0
There are more entries than what was just posted but they go for about 1000 lines so it is impractical to post that all here.
To get the file ID, use regular expressions, e.g.:
filename = 'Processed_0123456_1';
file_id_str = regexprep(filename, 'Processed_(\d+)_\d+', '$1');
file_num_str = regexprep(filename, 'Processed_\d+_(\d+)', '$1')
To read in the file contents, assuming that it's all comma-separated values without a header, use textscan, e.g.,
fid = fopen(filename)
C = textscan(fid, '%f,%f,%f') % Use as many %f specifiers as you have entries per line in the file
textscan also works on strings. So, for example, if your file contents was:
filestr = sprintf('1, 2, 3\n1, 3, 3')
Then running textscan on filestr works like this:
C = textscan(filestr, '%f,%f,%f')
C =
[2x1 int32] [2x1 int32] [2x1 int32]
You can convert that to a matrix using cell2mat:
cell2mat(C)
ans =
1 2 3
1 3 3
You could then repeat this procedure for all files with the same ID and concatenate them into a single matrix, e.g.,
C_full = [];
for (all files with the same ID)
C = do_all_the_above_stuff;
C_full = [C_full; C];
end
Then you can look for what you want in C_full.
Update based on updated OP Dec 12, 2013
Here's code to read the values from a single file. Wrap this all in the the loop that I mentioned above to loop over all your files and read them all into a single matrix.
fid = fopen('/path/to/file');
% Skip over 12 header lines
for kk = 1:12
fgetl(fid);
end
% Read in values to a matrix
C = textscan(fid, '%f,%f,%f');
C = cell2mat(C);
I think your requirements are too complicated to write the whole script here. Nonetheless, I will try to give some pointers to help. Disclaimer: None of this is tested, just my best guess. Please expect syntax errors, etc. I hope you can figure them out :-)
1) You can use the textscan function with the delimiter option to get data from the lines of your file. Since your format varies as it does, we will probably want to use...
2) ... fgetl to read the first two lines into strings and process them separately using texstscan. Such an operation might look like:
fid = fopen('file.txt','w');
tline1 = fgetl(fid);
tline2 = fgetl(fid);
fclose(fid);
C1 = textscan(tline1,'%s %d %s','delimiter','_'); %C1{2} will be the integer we want
C2 = textscan(tline2,'%s %s'),'delimiter,':'); %C2{2} will be the values we want, but they're still a string so...
mat = str2num(C2{2});
3) Then, for the rest of the lines, we can use something like dlmread:
mat2 = dlmread('file.txt',',',2,0);
The 2,0 specifies the offset in 0-based rows,columns from the start of the file. You may need to look at something like vertcat to stitch mat and mat2 together.
4) The list of files in the directory can be found with the dir command. The filename is an attribute of the structure that's returned:
dirlist = dir;
for i = 1:length(dirlist)
filename = dirlist(i).name
%process your files
end
You can also pass matching strings to dir, like so:
dirlist = dir('*.txt');
which will find all of the files with extension .txt.
5) You can very easily loop through the comparison matrix:
sze = size(comparisonmatrix);
for i = 1:sze(1)
%compare comparisonmatrix(i,1) to C1{2}
%Perform whatever operations you need
end
Hope that helps!

Resources