I am trying to parse a molecular dynamics dump file which has headers printed periodically. Between two successive headers, I have data (not guaranteed that the lenght of data is the same between any two successive headers) in a column format which I want to store and post-process. Is there a way I can do this without excessive use of for loops?
The basic gist of it is:
ITEM: TIMESTEP
0
ITEM: NUMBER OF ENTRIES
1079
ITEM: BOX BOUNDS xy xz yz ff ff pp
-1e+06 1e+06 0
-1e+06 1e+06 0
-1e+06 1e+06 0
ITEM: ENTRIES index c_1[1] c_1[2] c_2[1] c_2[2] c_2[3] c_2[4] c_2[5]
1 1 94 0.0399999 0 0.171554 -0.00124379 0
2 1 106 0.0399999 0 -0.0638316 0.116503 0
3 1 204 0.0299999 0 -0.124742 0.0290103 0
4 1 675 0.0299999 0 0.0245382 -0.116731 0
5 2 621 0.03 0 0.0328324 0.00185942 0
6 2 656 0.04 0 -0.0315086 0.016237 0
7 2 671 0.04 0 -0.00291159 -0.0169882 0
8 3 76 0.03 0 0.01775 0.0100646 0
9 3 655 0.03 0 0.00434063 -0.00750336 0
.
.
.
.
.
1076 678 692 100000 0 -0.222481 -1.44632e-06 0
1077 679 692 100000 0 -0.00232206 -8.05951e-09 0
1078 682 691 100000 0 0.0753935 -2.89438e-07 0
1079 687 692 100000 0 -0.0153246 -2.51076e-08 0
ITEM: TIMESTEP
1000
ITEM: NUMBER OF ENTRIES
1078
ITEM: BOX BOUNDS xy xz yz ff ff pp
-1e+06 1e+06 0
-1e+06 1e+06 0
-1e+06 1e+06 0
ITEM: ENTRIES index c_1[1] c_1[2] c_2[1] c_2[2] c_2[3] c_2[4] c_2[5]
1 1 94 0.0399997 0 1.3535 -0.00981109 0
2 1 106 0.0399986 0 -6.36969 11.6275 0
3 1 204 0.0299893 0 -236.114 54.9339 0
4 1 675 0.0299998 0 0.148064 -0.704365 0
.
.
.
.
TIA!
You don't need to write a single for loop to parse this file, MATLAB writes them for you:
[headers, tables] = parseTables('tables.txt')
...
function [headers, tables] = parseTables(filename)
content = fileread(filename); % read whole file
lines = splitlines(content); % split lines
values = cellfun(#str2num, lines, 'UniformOutput', false); % convert lines to float, when possible
headerLines = cellfun(#isempty, values); % lines with no floats
headers = lines(headerLines); % extract headers
startLines = find(headerLines)+1; % indices of first lines of tables
endLines = [startLines(2:end)-1; length(values)]; % indices of last lines of tables
tables = arrayfun(#(i, j) cell2mat(values(i:j)), ...
startLines, endLines, 'UniformOutput', false); % merge table rows to single matrix
end
The results will be stored in cell arrays:
headers =
8×1 cell array
{'ITEM: TIMESTEP' }
{'ITEM: NUMBER OF ENTRIES' }
{'ITEM: BOX BOUNDS xy xz yz ff ff pp' }
{'ITEM: ENTRIES index c_1[1] c_1[2] c_2[1] c_2[2] c_2[3] c_2[4] c_2[5] '}
{'ITEM: TIMESTEP' }
{'ITEM: NUMBER OF ENTRIES' }
{'ITEM: BOX BOUNDS xy xz yz ff ff pp' }
{'ITEM: ENTRIES index c_1[1] c_1[2] c_2[1] c_2[2] c_2[3] c_2[4] c_2[5] '}
tables =
8×1 cell array
{[ 0]}
{[ 1079]}
{ 3×3 double}
{13×8 double}
{[ 1000]}
{[ 1078]}
{ 3×3 double}
{ 4×8 double}
Let say I have a table of 10000 observations:
Obs X Y Z
1
2
3
...
10000
For each observation, I create a macro: mymacro(X, Y, Z) where I use X, Y, Z like inputs. My macro create a table with 1 observation, 4 new variables var1, var2, var3, var4.
I would like to know how to loop through 10000 observations in my initial set, and the result would be like:
Obs X Y Z Var1 Var2 Var3 Var4
1
2
3
...
10000
Update:
The calculation of Var1, Var2, Var3, Var4:
I have a reference table:
Z 25 26 27 28 29 30
0 10 000 10 000 10 000 10 000 10 000 10 000
1 10 000 10 000 10 000 10 000 10 000 10 000
2 10 000 10 000 10 000 10 000 10 000 10 000
3 10 000 10 000 10 000 10 000 10 000 10 000
4 9 269 9 322 9 322 9 381 9 381 9 436
5 8 508 8 619 8 619 8 743 8 743 8 850
6 7 731 7 914 7 914 8 102 8 102 8 258
7 6 805 7 040 7 040 7 280 7 280 7 484
8 5 864 6 137 6 137 6 421 6 421 6 655
9 5 025 5 328 5 328 5 629 5 629 5 929
10 4 359 4 648 4 648 4 934 4 934 5 320
And my have set is like:
Obs X Y Z
1 27 4 9
2
3
10000
So for the first observation (27, 4, 9):
Var1 = (8 619+ 7 914+ 7 040 + 6 137 + 5 328)/ 9 322
Var2 = (8 743+ 8 102+ 7 280+ 6 421 + 5 629 )/ 9 381
So that:
Var1 = Sum of all number in column 27 (X), from the observation 5 (Z+1) to the observation 9 (Z), and divided by the value in the (column 27 (X) - observation 4 (Z))
Var2 = Sum of all number in column 28 (X+1), from the observation 5 (Z+1) to the observation 9 (Z), and divided by the value in the (column 28 (X+1) - observation 4 (Z))
I would convert the reference table to a form that lets you do the calculations for all observations at once. So make your reference table into a tall structure, either by transposing the existing table or just reading it that way to start with:
data ref_tall;
input z #;
do col=25 to 30 ;
input value :comma9. #;
output;
end;
datalines;
0 10,000 10,000 10,000 10,000 10,000 10,000
1 10,000 10,000 10,000 10,000 10,000 10,000
2 10,000 10,000 10,000 10,000 10,000 10,000
3 10,000 10,000 10,000 10,000 10,000 10,000
4 9,269 9,322 9,322 9,381 9,381 9,436
5 8,508 8,619 8,619 8,743 8,743 8,850
6 7,731 7,914 7,914 8,102 8,102 8,258
7 6,805 7,040 7,040 7,280 7,280 7,484
8 5,864 6,137 6,137 6,421 6,421 6,655
9 5,025 5,328 5,328 5,629 5,629 5,929
10 4,359 4,648 4,648 4,934 4,934 5,320
;
Now take your list table HAVE:
data have;
input id x y z;
datalines;
1 27 4 9
2 25 2 4
;
And combine it with the reference table and make your calculations:
proc sql ;
create table want1 as
select a.id
, sum(b.value)/min(c.value) as var1
from have a
left join ref_tall b
on a.x=b.col
and b.z between a.y+1 and a.z
left join ref_tall c
on a.x=c.col
and c.z = a.y
group by a.id
;
create table want2 as
select a.id
, sum(d.value)/min(e.value) as var2
from have a
left join ref_tall d
on a.x+1=d.col
and d.z between a.y+1 and a.z
left join ref_tall e
on a.x+1=e.col
and e.z = a.y
group by a.id
;
create table want as
select *
from want1 natural join want2 natural join have
;
quit;
Results:
Obs id x y z var1 var2
1 1 27 4 9 3.75864 3.85620
2 2 25 2 4 1.92690 1.93220
The reference table can be established in an array that makes performing the specified computations easy. The reference values can than be accessed using a direct address reference.
Example
The reference table data was moved into a data set so the values can be changed over time or reloaded from some source such as Excel. The reference values can be loaded into an array for use during a DATA step.
* reference information in data set, x property column names are _<num>;
data ref;
input z (_25-_30) (comma9. &);
datalines;
0 10,000 10,000 10,000 10,000 10,000 10,000
1 10,000 10,000 10,000 10,000 10,000 10,000
2 10,000 10,000 10,000 10,000 10,000 10,000
3 10,000 10,000 10,000 10,000 10,000 10,000
4 9,269 9,322 9,322 9,381 9,381 9,436
5 8,508 8,619 8,619 8,743 8,743 8,850
6 7,731 7,914 7,914 8,102 8,102 8,258
7 6,805 7,040 7,040 7,280 7,280 7,484
8 5,864 6,137 6,137 6,421 6,421 6,655
9 5,025 5,328 5,328 5,629 5,629 5,929
10 4,359 4,648 4,648 4,934 4,934 5,320
;
* computation parameters, might be a thousand of them specified;
data have;
input id x y z;
datalines;
1 27 4 9
;
* perform computation for each parameters specified;
data want;
set have;
array ref[0:10,1:30] _temporary_;
if _n_ = 1 then do ref_row = 0 by 1 until (last_ref);
* load reference data into an array for direct addressing during computation;
set ref end=last_ref;
array ref_cols _25-_30;
do index = 1 to dim(ref_cols);
colname = vname(ref_cols[index]);
colnum = input(substr(colname,2),8.);
ref[ref_row,colnum] = ref_cols[index];
end;
end;
* perform computation for parameters specified;
array vars var1-var4;
do index = 1 to dim(vars);
ref_column = x + index - 1 ; * column x, then x+1, then x+2, then x+3;
numerator = 0; * algorithm against reference data;
do ref_row = y+1 to z;
numerator + ref[ref_row,ref_column];
end;
denominator = ref[y,ref_column];
vars[index] = numerator / denominator; * result;
end;
keep id x y z numerator denominator var1-var4;
run;
I have a data that goes like this:
Subject Treatment X
1 1 X12
2 2 X12
3 3 X13
4 1 X11
5 2 X13
6 3 X12
7 1 X11
8 2 X12
9 1 X11
10 3 X13
I have to count the number of X's using the variable Z so Z11=#of X11's , Z12=#of X12's and so on but if the last number in the X and T is the same then you add one to the allocation.
So Z11=X11+1 if T=1, Z12=X12+1 If T=2 and Z13=X13+1 if T=3 but if last number and T's don't correspond to each other then it would stay the same Z11=X11, Z12=X12 and Z13=13. I am using proc sql to count the allocations but don't how to add 1 each time the loop goes through another subject.
proc sql;
create table new1 as select
sum(y="X11")+1 as z11,
sum(y="X12") as z12,
sum(y="X13") as z13
from dynamic;
quit;
Any help will be appreciated.
I'm not clear exactly what you want, but I would do this in a data step creating the variables you want, and then sum up the totals.
data have;
input Subject Treatment X $;
datalines;
1 1 X12
2 2 X12
3 3 X13
4 1 X11
5 2 X13
6 3 X12
7 1 X11
8 2 X12
9 1 X11
10 3 X13
;
data have;
set have;
z11 = 0;
z12 = 0;
z13 = 0;
if x="X11" then do;
z11 = z11 + 1;
if Treatment=1 then
z11 = z11+1;
end;
else if x="X12" then do;
z12 = z12 + 1;
if Treatment=2 then
z12 = z12 + 1;
end;
else if x="X13" then do;
z13 = z13 + 1;
if Treatment=3 then
z13 = z13+1;
end;
run;
proc summary data=have;
var z:;
output out=want sum=;
run;
produces:
_TYPE_ _FREQ_ z11 z12 z13
0 10 6 6 5
You can drop the _TYPE_ and _FREQ_ variables if you need.
I have the following table:
DATA:
Lines <- " ID MeasureX MeasureY x1 x2 x3 x4 x5
1 1 1 1 1 1 1 1
2 1 1 0 1 1 1 1
3 1 1 1 2 3 3 3"
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE)
What i would like to achieve is :
Create 5 columns(r1-r5)
which is the division of each column x1-x5 with MeasureX (example x1/measurex, x2/measurex etc.)
Create 5 columns(p1-p5)
which is the division of each column x1-x5 with number 1-5 (the number of xcolumns) example x1/1, x2/2 etc.
MeasureY is irrelevant for now, the end product would be the ID and columns r1-r5 and p1-p5, is this feasible?
In SAS i would go with something like this:
data test6;
set test5;
array x {5} x1- x5;
array r{5} r1 - r5;
array p{5} p1 - p5;
do i=1 to 5;
r{i} = x{i}/MeasureX;
p{i} = x{i}/(i);
end;
The reason would be to have more dynamic beacuse the number of columns could change in the future.
Argument recycling allows you do do element-wise division with a constant vector. The tricky part was extracting the digits from the column names. I then repeated each of the digits by the number of rows to do the second division-task.
DF[ ,paste0("r", 1:5)] <- DF[ , grep("x", names(DF) )]/ DF$MeasureX
DF[ ,paste0("p", 1:5)] <- DF[ , grep("x", names(DF) )]/ # element-wise division
rep( as.numeric( sub("\\D","",names(DF)[ # remove non-digits
grep("x", names(DF))] #returns only 'x'-cols
) ), each=nrow(DF) ) # make them as long as needed
#-------------
> DF
ID MeasureX MeasureY x1 x2 x3 x4 x5 r1 r2 r3 r4 r5 p1 p2 p3 p4 p5
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0.5 0.3333333 0.25 0.2
2 2 1 1 0 1 1 1 1 0 1 1 1 1 0 0.5 0.3333333 0.25 0.2
3 3 1 1 1 2 3 3 3 1 2 3 3 3 1 1.0 1.0000000 0.75 0.6
This could be greatly simplified if you already know the sequence vector for the second division task would be 1-5, but this was designed to allow "gaps" in the sequence for column names and still use the digit information in the names as the divisor. (You were not entirely clear about what situations this code would be used in.) The construct of r{1-5} in SAS is mimicked by [ , paste0('r', 1:5)]. SAS is a macro language and sometimes experienced users have trouble figuring out how to make R behave like one. Generally it takes a while to lose the for-loop mentality and begin using R as a functional language.
An alternative with the data.table package:
cols <- names(df[c(4:8)])
library(data.table)
setDT(df)[, (paste0("r",1:5)) := .SD / df$MeasureX, by = ID, .SDcols = cols
][, (paste0("p",1:5)) := .SD / 1:5, by = ID, .SDcols = cols]
which results in:
> df
ID MeasureX MeasureY x1 x2 x3 x4 x5 r1 r2 r3 r4 r5 p1 p2 p3 p4 p5
1: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0.5 0.3333333 0.25 0.2
2: 2 1 1 0 1 1 1 1 0 1 1 1 1 0 0.5 0.3333333 0.25 0.2
3: 3 1 1 1 2 3 3 3 1 2 3 3 3 1 1.0 1.0000000 0.75 0.6
You could put together a nifty loop or apply to do this, but here it is explicitly:
# Handling the "r" columns.
DF$r1 <- DF$x1 / DF$MeasureX
DF$r2 <- DF$x2 / DF$MeasureX
DF$r3 <- DF$x3 / DF$MeasureX
DF$r4 <- DF$x4 / DF$MeasureX
DF$r5 <- DF$x5 / DF$MeasureX
# Handling the "p" columns.
DF$p1 <- DF$x1 / 1
DF$p2 <- DF$x2 / 2
DF$p3 <- DF$x3 / 3
DF$p4 <- DF$x4 / 4
DF$p5 <- DF$x5 / 5
# Taking only the columns we want.
FinalDF <- DF[, c("ID", "r1", "r2", "r3", "r4", "r5", "p1", "p2", "p3", "p4", "p5")]
Just noting that this is pretty straightforward matrix manipulation that you definitely could have found elsewhere. Perhaps you're new to R, but still put a little more effort in next time. If you are new to R, it's definitely worth the time to look up some basic R coding tutorial or video.
I have an array with some durations (in seconds), I'd like to split that array into accumulated duration groups that not surpass 3600 seconds in MATLAB. The durations are in order.
Input:
Duration(s) | 2010 1000 500 1030 80 2030 1090
With an:
------------- ------------ ----
Accumulated duration (s) | 3510 3130 1090
------------- ------------ ----
1st group 2nd group 3rd
Output:
Groups index | 1 1 1 2 2 2 3
I've tried with some scripts, but these take so long, and I have to process a lot of data.
Here is a vectorized way using bsxfun and cumsum:
durations = [2010 1000 500 1030 80 2030 1090]
stepsize = 3600;
idx = sum(bsxfun(#ge, cumsum(durations), (0:stepsize:sum(durations)).'),1)
idx =
1 1 1 2 2 2 3
The accumulated durations you can then get with:
accDuratiation = accumarray(idx(:),durations(:),[],#sum).'
accDuratiation =
3510 3140 1090
Explanation:
%// cumulative sum of all durations
csum = cumsum(durations);
%// thresholds
threshs = 0:stepsize:sum(durations);
%// comparison
comp = bsxfun(#ge, csum(:).',threshs(:)) %'
comp =
1 1 1 1 1 1 1
0 0 0 1 1 1 1
0 0 0 0 0 0 1
%// get index
idx = sum(comp,1)
This will get you close . . .
durs = [2010 1000 500 1030 80 2030 1090];
cums = cumsum(durs);
t = 3600;
idx = zeros(size(durs));
while ~all(idx)
idx = idx + (cums <= t);
cums = cums - max(cums(cums <= t));
end
You can then get the output into your preferred format with a simple . .
idx = -(idx-max(idx)-1)
and just in case you don't have enough, yet another way to do it:
durations = [2010 1000 500 1030 80 2030 1090] ;
stepsize = 3600;
cs = cumsum(durations) ;
idxbeg = [1 find(sign([1 diff(mod(cs,stepsize))])==-1)] ; %// first index of each group
idxend = [idxbeg(2:end)-1 numel(d)] ; %// last index of each group
groupDuration = [cs(idxend(1)) diff(cs(idxend))]
groupIndex = cell2mat( arrayfun(#(x,y) repmat(x,1,y), 1:numel(idxbeg) , idxend-idxbeg+1 , 'uni',0) )
groupDuration =
3510 3140 1090
groupIndex =
1 1 1 2 2 2 3
although if you ask me I find the bsxfun solution more elegant