Reading and processing a large text file in Matlab - file

I'm trying to read a large text file (a few million lines) into Matlab. Initially I was using importdata(file_name), which seemed like a concise solution. However I need to use Matlab 7 (yeah I know its old) and it seems importdata isn't supported. As such I tried the following:
while ~feof(fid)
fline = fgetl(fid);
fdata{1,lno} = fline ;
lno = lno + 1;
end
But this is really slow. I'm guessing its because its resizing the array on each iteration. Is there a better way of doing this. Bearing in mind the first 20 lines of the input data are string type data and the remainder of the data is 3 to 6 columns of hexadecimal values.

you will have to do some reshaping, but another option for you will be you could use fread.
But as was mentioned this essentially locks you into a rectangular import. So another option would be to use textscan. As I mention in another note, I'm not 100% sure when it was implemented, all I know is you dont have "importdata()"
fid = fopen('textfile.txt')
Out = textscan(fid,'%s','delimiter',sprintf('\n'));
fclose(fid)
with the use of textscan, you will be able to get a cell array of characters for each line which you can then manipulate however you want. And as I say in my comments, this no longer matters whether the lines are the same length or not. NOW you can parse the cell array more quickly. But as gnovice mentions, and he also does have a very elegant solution, you may have to concern yourself with memory requirements.
The one thing you never want to use in matlab if you can avoid it, is looping structures. They are fast in C/C++ etc, but in matlab, they are the slowest way of getting where you are going.
EDIT: Just looked it up, and it looks like textscan WAS implemented literally in version 7 (R14) so if thats what you have, you should be good to use that.

I see two options:
Rather than growing by 1 every single time, you could e.g. double the size of your array only when necessary. This massively reduces the number of reallocations required.
Do a two-pass approach. The first pass simply counts the number of lines, without storing them. The second pass actually fills in the array (which has been preallocated to the correct size).

One solution is to read the entire contents of the file as a string of characters with FSCANF, split the string into individual cells at the points where newline characters occur using MAT2CELL, remove extra white space on the ends with STRTRIM, then process the string data in each cell as needed. For example, using this sample text file 'junk.txt':
hi
hello
1 2 3
FF 00 FF
12 A6 22 20 20 20
FF FF FF
The following code will put each line in a cell of a cell array cellData:
>> fid = fopen('junk.txt','r');
>> strData = fscanf(fid,'%c');
>> fclose(fid);
>> nCharPerLine = diff([0 find(strData == char(10)) numel(strData)]);
>> cellData = strtrim(mat2cell(strData,1,nCharPerLine))
cellData =
'hi' 'hello' '1 2 3' 'FF 00 FF' '12 A6 22 20 20 20' 'FF FF FF'
Now if you want to convert all of the hexadecimal data (lines 3 through 6 in my sample data file) from strings to vectors of numbers, you can use CELLFUN and SSCANF like so:
>> cellData(3:end) = cellfun(#(s) {sscanf(s,'%x',[1 inf])},cellData(3:end));
>> cellData{3:end} %# Display contents
ans =
1 2 3
ans =
255 0 255
ans =
18 166 34 32 32 32
ans =
255 255 255
NOTE: Since you are dealing with such large arrays, you will have to be mindful of the amount of memory being used by your variables. The above solution is vectorized, but may take up a lot of memory. You may have to overwrite or clear large variables like strData when you create cellData. Alternatively, you could loop over the elements in nCharPerLine and individually process each segment of the larger string strData into the vectors you need, which you can preallocate now that you know how many lines of data you have (i.e. nDataLines = numel(nCharPerLine)-nHeaderLines;).

Related

SPSS recoding variables data from multiple variables into boolean variables

I have 26 variables and each of them contain numbers ranging from 1 to 61. I want for each case of 1, each case of 2 etc. the number 1 in a new variable. If there is no 1, the variable should contain 2.
So 26 variables with data like:
1 15 28 39 46 1 12 etc.
And I want 61 variables with:
1 2 1 2 2 1 etc.
I have been reading about creating vectors, loops, do if's etc but I can't find the right way to code it. What I have done is just creating 61 variables and writing
do if V1=1 or V2=1 or (etc until V26).
recode newV1=1.
end if.
exe.
**repeat this for all 61 variables.
recode newV1 to newV61(missing=2).
So this is a lot of code and quite a detour from what I imagine it could be.
Anyone who can help me out with this one? Your help is much appreciated!
noumenal is correct, you could do it with two loops. Another way though is to access the VECTOR using the original value though, writing that as 1, and setting all other values to zero.
To illustrate, first I make some fake data (with 4 original variables instead of 26) named X1 to X4.
*Fake Data.
SET SEED 10.
INPUT PROGRAM.
LOOP Id = 1 TO 20.
END CASE.
END LOOP.
END FILE.
END INPUT PROGRAM.
VECTOR X(4,F2.0).
LOOP #i = 1 TO 4.
COMPUTE X(#i) = TRUNC(RV.UNIFORM(1,62)).
END LOOP.
EXECUTE.
Now what this code does is create four vector sets to go along with each variable, then uses DO REPEAT to actually refer to the VECTOR stub. Then finishes up with RECODE - if it is missing it should be coded a 2.
VECTOR V1_ V2_ V3_ V4_ (61,F1.0).
DO REPEAT orig = X1 TO X4 /V = V1_ V2_ V3_ V4_.
COMPUTE V(orig) = 1.
END REPEAT.
RECODE V1_1 TO V4_61 (SYSMIS = 2).
It is a little painful, as for the original VECTOR command you need to write out all of the stubs, but then you can copy-paste that into the DO REPEAT subcommand (or make a macro to do it for you).
For a more simple illustration, if we have our original variable, say A, that can take on integer values from 1 to 61, and we want to expand to our 61 dummy variables, we would then make a vector and then access the location in that vector.
VECTOR DummyVec(61,F1.0).
COMPUTE DummyVec(A) = 1.
For a record if A = 10, then here DummyVec10 will equal 1, and all the others DummyVec variables will still by system missing by default. No need to use DO IF for 61 values.
The rest of the code is just extra to do it in one swoop for multiple original variables.
This should do it:
do repeat NewV=NewV1 to NewV61/vl=1 to 61.
compute NewV=any(vl,v1 to v26).
end repeat.
EXPLANATION:
This syntax will go through values 1 to 61, for each one checking whether any of the variables v1 to v26 has that value. If any of them do, the right NewV will receive the value of 1. If none of them do, the right NewV will receive the value of 0.
Just make sure v1 to v26 are consecutively ordered in the file. if not, then change to:
compute NewV=any(vl,v1, v2, v3, v4 ..... v26).
You need a nested loop: two loops - one outer and one inner.

MATLAB append two arrays(one string cell and one numeric vector) into a .csv

I have two arrays, X and Y
X has numeric data like this:
12342
1355
2324
...
Y has strings like this:
APPLE
METRO
BLANKET
...
I want to append these arrays into a pre-existing .csv in this format:
X(1,1), Y{1,1}
X(2,1), Y{2,1}
X(3,1), Y{2,1}
...
How should I do so?
unfortunately writing cell arrays to csvs can be tricky. One way or another you need to loop through so you can access the data within each cell. The reason for cells not writing automatically is that a cell can contain anything - like a whole matrix of data.
The most versatile way that will allow for any types of cell/matrix combos is to use fprintf - you just need to specify a format.
http://au.mathworks.com/help/matlab/ref/fprintf.html
nrows = size(x,1);
fid = fopen('newfile.csv','w');
for aa = 1 : nrows
fprintf(fid,'%f,%s\r\n', x(aa,1), y{aa,1});
end
you can use %d instead of %f if you are only chasing decimal numbers instead of floating point precision.
I have found fprintf to be faster than any of the alternatives for this sort of situation. It is a little more low-level but allows you to write anything you like to file.

Filling an array, A$(X,X) in Commodore BASIC?

I am trying to fill A$(X,X) with "."s in Commodore BASIC.
This is what I have so far....but I'm not really sure what to do concerning ASCII values and such. Any commentary?
INPUT A$
FOR I = 0 TO X = DIM A$(X,X)
A$(".",x)
I'm still EXTREMELY confused on PET BASIC's API... Any suggestions would be GREATLY appreciated.
My answers are based around a youth in front of a Commodore 64 and may not be completely correct for the PET series. But seeing as you haven't had any other answers yet I'll give it a bash.
In the first line of your code you are requesting a string from the user and storing it in A$. The dollar sign denotes the variable is a string. In the second line, you are redefining A$ as a two dimensional array. The dimensions are both X which hasn't been defined. I don't recall DIM having a return value but I could be wrong.
The function to get an ASCII value from a char is ASC() and to convert back you use CHR$() such:
10 NUMA = ASC("A"): REM NUMA now contains 65
20 CHARA$ = CHR$(NUMA): REM CHARA$ now contains "A"
Something you should know is that these functions use "PET ASCII" which is slightly different to ASCII. It never caused me any problems but its something to remember.
FOR loops always have a NEXT to end the block such:
10 FOR A = 1 TO 10
20 PRINT A: REM Displays series of numbers.
30 NEXT
I'm not entirely clear what you're trying to achieve but hopefully I have at least given you enough pieces to work it out. From what I understand, you need something like:
10 INPUT "Please enter a number:", X
20 DIM A$(X, X)
30 FOR I = 0 TO X
40 FOR J = 0 TO X
50 A$(I, J) = "."
60 NEXT
70 NEXT

MATLAB: Write strings and numbers of variables in files

I have following data:
a=[3 1 6]';
b=[2 5 2]';
c={'ab' 'bc' 'cd'}';
I now want to make a file which looks like this (the delimiter is tab):
ab 3 2
bc 1 5
cd 6 2
my solution (with a loop) is:
a=[3 1 6]';
b=[2 5 2]';
c={'ab' 'bc' 'cd'}';
c=cell2mat(c);
fid=fopen('filename','w');
for i=1:numel(b)
fprintf(fid,'%s\t%u\t%u\n',c(i,:),a(i),b(i));
end
fclose(fid);
Is there a possibility without loop and/or the possibility to write cell arrays directly in files?
Thanks.
How about this:
%A cell array holding all data
% (Note transpose)
data = cat(2, c, num2cell(a), num2cell(b))';
Write data to a file
fid = fopen('example.txt', 'w');
fprintf(fid, '%s\t%u\t%u\n', data{:});
fclose(fid);
This will be memory wasteful if your datasets get large (probably better to leave then as separate variables and loop), but seems to work.

In Fortran 90, what is a good way to write an array to a text file, row-wise?

I am new to Fortran, and I would like to be able to write a two-dimensional array to a text file, in a row-wise manner (spaces between columns, and each row on its own line). I have tried the following, and it seems to work in the following simple example:
PROGRAM test3
IMPLICIT NONE
INTEGER :: i, j, k, numrows, numcols
INTEGER, DIMENSION(:,:), ALLOCATABLE :: a
numrows=5001
numcols=762
ALLOCATE(a(numrows,numcols))
k=1
DO i=1,SIZE(a,1)
DO j=1,SIZE(a,2)
a(i,j)=k
k=k+1
END DO
END DO
OPEN(UNIT=12, FILE="aoutput.txt", ACTION="write", STATUS="replace")
DO i=1,numrows
WRITE(12,*) (a(i,j), j=1,numcols)
END DO
END PROGRAM test3
As I said, this seems to work fine in this simple example: the resulting text file, aoutput.txt, contains the numbers 1-762 on line 1, numbers 763-1524 on line 2, and so on.
But, when I use the above ideas (i.e., the last fifth-to-last, fourth-to-last, third-to-last, and second-to-last lines of code above) in a more complicated program, I run into trouble; each row is delimited (by a new line) only intermittently, it seems. (I have not posted, and probably will not post, here my entire complicated program/script--because it is rather long.) The lack of consistent row delimiters in my complicated program/script probably suggests another bug in my code, not with the four-line write-to-file routine above, since the above simple example appears to work okay. Still, I am wondering, can you please help me think if there is a better row-wise write-to-text file routine that I should be using?
Thank you very much for your time. I really appreciate it.
There's a few issues here.
The fundamental one is that you shouldn't use text as a data format for sizable chunks of data. It's big and it's slow. Text output is good for something you're going to read yourself; you aren't going to sit down with a printout of 3.81 million integers and flip through them. As the code below demonstrates, the correct text output is about 10x slower, and 50% bigger, than the binary output. If you move to floating point values, there are precision loss issues with using ascii strings as a data interchange format. etc.
If your aim is to interchange data with matlab, it's fairly easy to write the data into a format matlab can read; you can use the matOpen/matPutVariable API from matlab, or just write it out as an HDF5 array that matlab can read. Or you can just write out the array in raw Fortran binary as below and have matlab read it.
If you must use ascii to write out huge arrays (which, as mentioned, is a bad and slow idea) then you're running into problems with default record lengths in list-drected IO. Best is to generate at runtime a format string which correctly describes your output, and safest on top of this for such large (~5000 character wide!) lines is to set the record length explicitly to something larger than what you'll be printing out so that the fortran IO library doesn't helpfully break up the lines for you.
In the code below,
WRITE(rowfmt,'(A,I4,A)') '(',numcols,'(1X,I6))'
generates the string rowfmt which in this case would be (762(1X,I6)) which is the format you'll use for printing out, and the RECL option to OPEN sets the record length to be something bigger than 7*numcols + 1.
PROGRAM test3
IMPLICIT NONE
INTEGER :: i, j, k, numrows, numcols
INTEGER, DIMENSION(:,:), ALLOCATABLE :: a
CHARACTER(LEN=30) :: rowfmt
INTEGER :: txtclock, binclock
REAL :: txttime, bintime
numrows=5001
numcols=762
ALLOCATE(a(numrows,numcols))
k=1
DO i=1,SIZE(a,1)
DO j=1,SIZE(a,2)
a(i,j)=k
k=k+1
END DO
END DO
CALL tick(txtclock)
WRITE(rowfmt,'(A,I4,A)') '(',numcols,'(1X,I6))'
OPEN(UNIT=12, FILE="aoutput.txt", ACTION="write", STATUS="replace", &
RECL=(7*numcols+10))
DO i=1,numrows
WRITE(12,FMT=rowfmt) (a(i,j), j=1,numcols)
END DO
CLOSE(UNIT=12)
txttime = tock(txtclock)
CALL tick(binclock)
OPEN(UNIT=13, FILE="boutput.dat", ACTION="write", STATUS="replace", &
FORM="unformatted")
WRITE(13) a
CLOSE(UNIT=13)
bintime = tock(binclock)
PRINT *, 'ASCII time = ', txttime
PRINT *, 'Binary time = ', bintime
CONTAINS
SUBROUTINE tick(t)
INTEGER, INTENT(OUT) :: t
CALL system_clock(t)
END SUBROUTINE tick
! returns time in seconds from now to time described by t
REAL FUNCTION tock(t)
INTEGER, INTENT(IN) :: t
INTEGER :: now, clock_rate
call system_clock(now,clock_rate)
tock = real(now - t)/real(clock_rate)
END FUNCTION tock
END PROGRAM test3
This may be a very roundabout and time-consuming way of doing it, but anyway... You could simply print each array element separately, using advance='no' (to suppress insertion of a newline character after what was being printed) in your write statement. Once you're done with a line you use a 'normal' write statement to get the newline character, and start again on the next line. Here's a small example:
program testing
implicit none
integer :: i, j, k
k = 1
do i=1,4
do j=1,10
write(*, '(I2,X)', advance='no') k
k = k + 1
end do
write(*, *) '' ! this gives you the line break
end do
end program testing
When you run this program the output is as follows:
1 2 3 4 5 6 7 8 9 10
11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30
31 32 33 34 35 36 37 38 39 40
Using an "*" is list-directed IO -- Fortran will make the decisions for you. Some behaviors aren't specified. You could gain more control using a format statement. If you wanted to positively identify row boundaries you write a marker symbol after each row. Something like:
DO i=1,numrows
WRITE(12,*) a(i,:)
write (12, '("X")' )
END DO
Addendum several hours later:
Perhaps with large values of numcols the lines are too long for some programs that are you using to examine the file? For the output statement, try:
WRITE(12, '( 10(2X, I11) )' ) a(i,:)
which will break each row of the matrix, if it has more than 10 columns, into multiple, shorter lines in the file.

Resources