Split array into chunks based on timestamp in Haskell - arrays

I have an array of records (custom data type) in Haskell which I want to aggregate based on a each records' timestamp. In very general terms each record looks like this:
data Record = Record { event :: String,
time :: Double,
from :: Int,
to :: Int
} deriving (Show, Eq)
I used a Double for the timestamp since that is the same format used in the tracefile.
And I parse them from a CSV file into an array of records: [Record]
Now I'm looking to get an approximation of instantaneous events / time. So I want to split the array into several arrays based on the timestamp (say. every 1 seconds) and then fold across each smaller array.
The problem is I can't figure out how to split an array based on the value of a record. Looking on Hoogle I found several functions like splitEvery and splitWhen, but I'm lost. I considered using splitWhen to break up the list when, say, (mod time 0.1) == 0, but even if that worked it would remove the elements it's splitting on (which I don't want to do).
I should note that the records are NOT evenly spaced in time. E.g. the timestamp on sequential records is not going to differ by a fixed amount.
I am more than willing to store the data in a different format if you can suggest one that would make this sort of work easier.
A quick sample of the data I'm parsing (from a ns2 simulation):
r 0.114 1 2 tcp 1000 ________ 2 1.0 5.0 0 2
r 0.240 1 2 tcp 1000 ________ 2 1.0 5.0 0 2
r 0.914 2 1 tcp 1000 ________ 2 5.0 1.0 0 3

If you have [Record] and you want to group them by a specific condition, you can use Data.List.groupBy. I'm assuming that for your time :: Double, 1 second is the base unit, so time = 1 is 1 second, time = 100 is 100 seconds, etc, so adjust this to whatever system you're actually using:
import Data.List
import Data.Function (on)
isInSameClockSecond :: Record -> Record -> Bool
isInSameClockSecond = (==) `on` (floor . time :: Record -> Integer)
-- The type signature is given for floor . time to remove any ambiguity
-- due to floor's polymorphic type signature.
groupBySameClockSecond :: [Record] -> [[Record]]
groupBySameClockSecond = groupBy isInSameClockSecond

Related

How to sampling Data Frame?

The goal is to subsample a data frame.
code:
# 1 date is in type datatime
dg.Yr_Mo_Dy = pd.to_datetime(dg.Yr_Mo_Dy, format='%Y%m%d')
# 2 date is in index
dg = dg.set_index(dg.Yr_Mo_Dy, drop = True)
# 3 to group by 10
dg.resample('1AS').mean().mean()
That gives:
RPT 14.847325
VAL 12.914560
ROS 13.299624
KIL 7.199498
SHA 11.667734
BIR 8.054839
DUB 11.819355
CLA 9.512047
MUL 9.543208
CLO 10.053566
BEL 14.550520
MAL 18.028763
dtype: float6
The code takes every 10 values the 10 intermediate values and the average.
Similarly, it is also possible to sum these 10 values by replacing mean() with sum().
However, what I want to do is not an average but a sampling. That is, to take all the values and only one without averaging, without summing the intermediate values.
For example, the data: 1,2,3,4,5,6.. sampled by 0.5 gives 2,4,6... et non 1.5,2.5,3.5,5.5...

lag over columns/ variables SPSS

I want to do something I thought was really simple.
My (mock) data looks like this:
data list free/totalscore.1 to totalscore.5.
begin data.
1 2 6 7 10 1 4 9 11 12 0 2 4 6 9
end data.
These are total scores accumulating over a number of trials (in this mock data, from 1 to 5). Now I want to know the number of scores earned in each trial. In other words, I want to subtract the value in the n trial from the n+1 trial.
The most simple syntax would look like this:
COMPUTE trialscore.1 = totalscore.2 - totalscore.1.
EXECUTE.
COMPUTE trialscore.2 = totalscore.3 - totalscore.2.
EXECUTE.
COMPUTE trialscore.3 = totalscore.4 - totalscore.3.
EXECUTE.
And so on...
So that the result would look like this:
But of course it is not possible and not fun to do this for 200+ variables.
I attempted to write a syntax using VECTOR and DO REPEAT as follows:
COMPUTE #y = 1.
VECTOR totalscore = totalscore.1 to totalscore.5.
DO REPEAT trialscore = trialscore.1 to trialscore.5.
COMPUTE #y = #x + 1.
END REPEAT.
COMPUTE trialscore(#i) = totalscore(#y) - totalscore(#i).
EXECUTE.
But it doesn't work.
Any help is appreciated.
Ps. I've looked into using LAG but that loops over rows while I need it to go over 1 column at a time.
I am assuming respid is your original (unique) record identifier.
EDIT:
If you do not have a record indentifier, you can very easily create a dummy one:
compute respid=$casenum.
exe.
end of EDIT
You could try re-structuring the data, so that each score is a distinct record:
varstocases
/make totalscore from totalscore.1 to totalscore.5
/index=scorenumber
/NULL=keep.
exe.
then sort your cases so that scores are in descending order (in order to be bale to use lag function):
sort cases by respid (a) scorenumber (d).
Then actually do the lag-based computations
do if respid=lag(respid).
compute trialscore=totalscore-lag(totalscore).
end if.
exe.
In the end, un-do the restructuring:
casestovars
/id=respid
/index=scorenumber.
exe.
You should end up with a set of totalscore variables (the last one will be empty), which will hold what you need.
you can use do repeat this way:
do repeat
before=totalscore.1 to totalscore.4
/after=totalscore.2 to totalscore.5
/diff=trialscore.1 to trialscore.4 .
compute diff=after-before.
end repeat.

GMT subtraction on MATLAB

I'm currently working on a small project on handling time difference on MATLAB. I have two input files; Time_in and Time_out. The two files contain arrays of time in the format e.g 2315 (GMT - Hours and Minute)
I've read both Time_in' and 'Time_out on MATLAB but I don't know how to perform the subtraction. Also, I want the corresponding answers to be in minutes domain only e.g (2hrs 30mins = 150minutes)
this is one of several possible solutions:
First, you should convert your time strings to a MATLAB serial date number. If you've done this, you can do your calculation as you want:
% input time as string
time_in = '2115';
time_out = '2345';
% read the input time as datenum
dTime_in = datenum(time_in,'HHMM');
dTime_out = datenum(time_out,'HHMM');
% subtract to get the time difference
timeDiff = abs(dTime_out - dTime_in);
% Get the minutes of the time difference
timeout = timeDiff * 24 * 60;
Furthermore, to calculate the time differences correctly you also should put some information about the date in your time vector, in order to calculate the correct time around midnight.
If you need further information about the function datenum you should read the following part of the MATLAB documentation:
https://de.mathworks.com/help/matlab/ref/datenum.html
Any questions?
In a recent version of MATLAB, you could use textscan together with datetime and duration data types to do this.
% read the first file
fh1 = fopen('Time_in');
d1 = textscan(fh1, '%{HHmm}D');
fclose(fh1);
fh2 = fopen('Time_out');
d2 = textscan(fh2, '%{HHmm}D');
fclose(fh2);
Note the format specifier '%{HHmm}D' tells MATLAB to read the 4-digit string into a datetime array.
d1 and d2 are now cell arrays where the only element is a datetime vector. You can subtract these, and then use the minutes function to find the number of minutes.
result = minutes(d2{1} - d1{1})

MATLAB Extract all rows between two variables with a threshold

I have a cell array called BodyData in MATLAB that has around 139 columns and 3500 odd rows of skeletal tracking data.
I need to extract all rows between two string values (these are timestamps when an event happened) that I have
e.g.
BodyData{}=
Column 1 2 3
'10:15:15.332' 'BASE05' ...
...
'10:17:33:230' 'BASE05' ...
The two timestamps should match a value in the array but might also be within a few ms of those in the array e.g.
TimeStamp1 = '10:15:15.560'
TimeStamp2 = '10:17:33.233'
I have several questions!
How can I return an array for all the data between the two string values plus or minus a small threshold of say .100ms?
Also can I also add another condition to say that all str values in column2 must also be the same, otherwise ignore? For example, only return the timestamps between A and B only if 'BASE02'
Many thanks,
The best approach to the first part of your problem is probably to change from strings to numeric date values. In Matlab this can be done quite painlessly with datenum.
For the second part you can just use logical indexing... this is were you put a condition (i.e. that second columns is BASE02) within the indexing expression.
A self-contained example:
% some example data:
BodyData = {'10:15:15.332', 'BASE05', 'foo';...
'10:15:16.332', 'BASE02', 'bar';...
'10:15:17.332', 'BASE05', 'foo';...
'10:15:18.332', 'BASE02', 'foo';...
'10:15:19.332', 'BASE05', 'bar'};
% create column vector of numeric times, and define start/end times
dateValues = datenum(BodyData(:, 1), 'HH:MM:SS.FFF');
startTime = datenum('10:15:16.100', 'HH:MM:SS.FFF');
endTime = datenum('10:15:18.500', 'HH:MM:SS.FFF');
% select data in range, and where second column is 'BASE02'
BodyData(dateValues > startTime & dateValues < endTime & strcmp(BodyData(:, 2), 'BASE02'), :)
Returns:
ans =
'10:15:16.332' 'BASE02' 'bar'
'10:15:18.332' 'BASE02' 'foo'
References: datenum manual page, matlab help page on logical indexing.

Link two tables based on conditions in matlab

I am using matlab to prepare my dataset in order to run it in certain data mining models and I am facing an issue with linking the data between two of my tables.
So, I have two tables, A and B, which contain sequential recordings of certain values in a certain timestamps and I want to create a third table, C, in which I will add columns of both A and B in the same rows according to some conditions.
Tables A and B don't have the same amount of rows (A has more measurements) but they both have two columns:
1st column: time of the recording (hh:mm:ss) and
2nd column: recorded value in that time
Columns of A and B are going to be added in table C when all the following conditions stand:
The time difference between A and B is more than 3 sec but less than 5 sec
The recorded value of A is the 40% - 50% of the recorded value of B.
Any help would be greatly appreciated.
For the first condition you need something like [row,col,val]=find((A(:,1)-B(:,1))>2sec && (A(:,1)-B(:,1))<5sec) where you do need to use datenum or equivalent to transform your timestamps. For the second condition this works the same, use [row,col,val]=find(A(:,2)>0.4*B(:,2) && A(:,2)<0.5*B(:,2)
datenum allows you to transform your arrays, so do that first:
A(:,1) = datenum(A(:,1));
B(:,1) = datenum(B(:,1));
you might need to check the documentation on datenum, regarding the format your string is in.
time1 = [datenum([0 0 0 0 0 3]) datenum([0 0 0 0 0 3])];
creates the datenums for 3 and 5 seconds. All combined:
A(:,1) = datenum(A(:,1));
B(:,1) = datenum(B(:,1));
time1 = [datenum([0 0 0 0 0 3]) datenum([0 0 0 0 0 3])];
[row1,col1,val1]=find((A(:,1)-B(:,1))>time1(1)&& (A(:,1)-B(:,1))<time1(2));
[row2,col2,val2]=find(A(:,2)>0.4*B(:,2) && A(:,2)<0.5*B(:,2);
The variables of row and col you might not need when you want only the values though. val1 contains the values of condition 1, val2 of condition 2. If you want both conditions to be valid at the same time, use both in the find command:
[row3,col3,val3]=find((A(:,1)-B(:,1))>time1(1)&& ...
(A(:,1)-B(:,1))<time1(2) && A(:,2)>0.4*B(:,2)...
&& A(:,2)<0.5*B(:,2);
The actual adding of your two arrays based on the conditions:
C = A(row3,2)+B(row3,2);
Thank you for your response and help! However for the time I followed a different approach by converting hh:mm:ss to seconds that will make the comparison easier later on:
dv1 = datevec(A, 'dd.mm.yyyy HH:MM:SS.FFF ');
secs = [3600,60,1];
dv1(:,6) = floor(dv1(:,6));
timestamp = dv1(:,4:6)*secs.';
Now I am working on combining both time and weight conditions in a piece of code that will run. Should I use an if condition inside a for loop or is a for loop not necessary?

Resources