lag over columns/ variables SPSS - loops

I want to do something I thought was really simple.
My (mock) data looks like this:
data list free/totalscore.1 to totalscore.5.
begin data.
1 2 6 7 10 1 4 9 11 12 0 2 4 6 9
end data.
These are total scores accumulating over a number of trials (in this mock data, from 1 to 5). Now I want to know the number of scores earned in each trial. In other words, I want to subtract the value in the n trial from the n+1 trial.
The most simple syntax would look like this:
COMPUTE trialscore.1 = totalscore.2 - totalscore.1.
EXECUTE.
COMPUTE trialscore.2 = totalscore.3 - totalscore.2.
EXECUTE.
COMPUTE trialscore.3 = totalscore.4 - totalscore.3.
EXECUTE.
And so on...
So that the result would look like this:
But of course it is not possible and not fun to do this for 200+ variables.
I attempted to write a syntax using VECTOR and DO REPEAT as follows:
COMPUTE #y = 1.
VECTOR totalscore = totalscore.1 to totalscore.5.
DO REPEAT trialscore = trialscore.1 to trialscore.5.
COMPUTE #y = #x + 1.
END REPEAT.
COMPUTE trialscore(#i) = totalscore(#y) - totalscore(#i).
EXECUTE.
But it doesn't work.
Any help is appreciated.
Ps. I've looked into using LAG but that loops over rows while I need it to go over 1 column at a time.

I am assuming respid is your original (unique) record identifier.
EDIT:
If you do not have a record indentifier, you can very easily create a dummy one:
compute respid=$casenum.
exe.
end of EDIT
You could try re-structuring the data, so that each score is a distinct record:
varstocases
/make totalscore from totalscore.1 to totalscore.5
/index=scorenumber
/NULL=keep.
exe.
then sort your cases so that scores are in descending order (in order to be bale to use lag function):
sort cases by respid (a) scorenumber (d).
Then actually do the lag-based computations
do if respid=lag(respid).
compute trialscore=totalscore-lag(totalscore).
end if.
exe.
In the end, un-do the restructuring:
casestovars
/id=respid
/index=scorenumber.
exe.
You should end up with a set of totalscore variables (the last one will be empty), which will hold what you need.

you can use do repeat this way:
do repeat
before=totalscore.1 to totalscore.4
/after=totalscore.2 to totalscore.5
/diff=trialscore.1 to trialscore.4 .
compute diff=after-before.
end repeat.

Related

Update table with random numbers in kdb+q

when I run the following script:
tbl: update prob: 1?100 from tbl;
I was expecting that I get a new column created with each row having a random number. However, I get back a column containing the same number for all the rows in the table.
How do I resolve this? I need to update my existing table and not create a table from scratch.
When you are using 1?100 you are only requesting 1 random value within the range of 0-100. If you use 10?100, you will be returned a list of 10 random values between 0-100.
So to do this in an update you want to use something like this
tbl:([]time:5?.z.p;sym:5?`3;price:5?10f;qty:5?10)
time sym price qty
-----------------------------------------------
2012.02.19D18:34:27.148501760 gkn 8.376952 9
2008.07.29D20:23:13.601434560 odo 7.041609 3
2007.02.07D08:17:59.482332864 pbl 0.955069 9
2001.04.27D03:36:44.475531384 aph 1.127308 2
2010.03.03D03:35:55.253069888 mgi 0.7663449 6
update r:abs count[i]?0h from tbl
time sym price qty r
-----------------------------------------------------
2012.02.19D18:34:27.148501760 gkn 8.376952 9 23885
2008.07.29D20:23:13.601434560 odo 7.041609 3 19312
2007.02.07D08:17:59.482332864 pbl 0.955069 9 10372
2001.04.27D03:36:44.475531384 aph 1.127308 2 25281
2010.03.03D03:35:55.253069888 mgi 0.7663449 6 27503
Note that I am using type short and abs to return positive values.
You need to seed your initial data, using something like rand(time), otherwise it will use the same seed, and thus, give the same sequence of random numbers.
EDIT: Per https://code.kx.com/wiki/Reference/SystemCommands
Use \S?n, where n is any integer.
EDIT2: Check out https://code.kx.com/wiki/Reference/SystemCommands#.5CS_.5Bn.5D_-_random_seed for how to use random numbers, please.
Just generate as many random numbers as you have rows using count tbl:
First create your table tbl:
tbl:([]date:reverse .z.d-til 100;price:sums 100?1f)
date price
--------------------
2018.04.26 0.2426471
2018.04.27 0.6163571
2018.04.28 1.179559
..
Then add a column of random numbers between 0 and 100:
update rdn:(count tbl)?100 from tbl
date price rdn
------------------------
2018.04.26 0.2426471 25
2018.04.27 0.6163571 33
2018.04.28 1.179559 13
..

Data Entry in SAS using Loops

I just learned about the "do" loop today and would like to try using it for data entry in SAS. I have tried most examples online, but I still cannot figure it out.
My dataset in an experiment with 6 treatments (1 to 6) using 2 sets of cues, 3 each, Visual and Audio. There's lag measured in seconds, which are 5, 10, and 15, which there are 2 sets.
Basically it looks like this:
Table
The entries I want are:
1. Obs_no, ranging from 1 to 18 (total of 18 observations, this allows me to easily delete outliers with an IF THEN)
2. Treatment type, which are Auditory and Visual.
3.Treatment number, 1 to 6, 3 sets.
4. Lag, 5, 10 or 15.
5. And the data itself
So far, my code makes 2 and 5 possible, it also makes the rest possible with an IF THEN statement and input statement, although I assume there's a way easier method:
data AVCue;
do cue = 'Auditory','Visual';
do i = 1 to 3;
input AVCue ##;
output;
end;
end;
datalines;
.204 .167 .202 .257 .283 .256
.170 .182 .198 .279 .235 .281
.181 .187 .236 .269 .260 .258
;
Lag and the rest was made possible using an IF THEN statement and the crude method of input:
data AVCue;
set AVCue;
IF i=1 THEN Lag=5;
IF i=2 THEN Lag=10;
IF i=3 THEN Lag=15;
input obs_no treatment;
cards;
1 1
2 2
3 3
4 4
5 5
6 6
7 1
8 2
9 3
10 4
11 5
12 6
13 1
14 2
15 3
16 4
17 5
18 6
;
proc print data=AVCue;
run;
The IF THEN should be fine, but the input statement here is just in my opinion counterproductive, and defeats the purpose of using loops, which is to me, to save time. If done this way, I might as well just put the data into excel and import it, or type everything out with ample copy and paste of the text in the
input obs_no treatment;
cards;
section.
My coding knowledge is basic, so sorry if this question sounds silly, I want to know:
1. How would I make a list of numbers using the "do" loops in SAS? I've made several attempts and all I get is a list containing the next number. I know why this happens, the loop counts to x and the value assigned would just be x. I just don't know how to get around that. Somehow this didn't happen in the datalines section, I guess SAS knows there's 18 numbers and the entry i is stored accordingly... or something?
2. How would I go about assigning in this case, the numbers 1 to 6 to each entry?
Thanks!
It is certainly much easier to read in the actual dataset instead of having to impute some of the variables based on the order the values have in the source data. You might be able to combine a SET statement and an INPUT statement in the same data step and get it to work, but it is probably NOT worth the effort. Just make two datasets and merge them.
Looking at the photograph you posted it looks like TREATMENT is not an independent variable. Instead it is just a label for the combination of CUE and LAG. To make it cycle from 1 to 6 just reset it back to 1 when it gets too large.
data AVCue;
do cue = 'Auditory','Visual';
do lag= 5, 10, 15 ;
treatment+1;
if treatment=7 then treatment=1;
obsno+1;
input AVCue ##;
output;
end;
end;
datalines;
.204 .167 .202 .257 .283 .256
.170 .182 .198 .279 .235 .281
.181 .187 .236 .269 .260 .258
;
You can get in trouble if you just let SAS guess at how you want to define your variables. For example if you change the order of the CUE values do cue = 'Visual','Auditory'; then SAS will make CUE with length $5 instead of $8. Add a LENGTH statement to define your variables before you use them.
length obsno 8 treatment 8 cue $8 lag 8 AVCue 8 ;
This will also let you control the order they are created in the dataset.
If you really did already have a SAS dataset and you wanted to add a variable like TREATMENT that cycled from 1 to 6 (or really any DO loop construct) then could nest the SET statement inside the DO loop. Just remember to add the explicit OUTPUT statement.
data new ;
do treatment=1 to 6 ;
set old;
output;
end;
run;

Link two tables based on conditions in matlab

I am using matlab to prepare my dataset in order to run it in certain data mining models and I am facing an issue with linking the data between two of my tables.
So, I have two tables, A and B, which contain sequential recordings of certain values in a certain timestamps and I want to create a third table, C, in which I will add columns of both A and B in the same rows according to some conditions.
Tables A and B don't have the same amount of rows (A has more measurements) but they both have two columns:
1st column: time of the recording (hh:mm:ss) and
2nd column: recorded value in that time
Columns of A and B are going to be added in table C when all the following conditions stand:
The time difference between A and B is more than 3 sec but less than 5 sec
The recorded value of A is the 40% - 50% of the recorded value of B.
Any help would be greatly appreciated.
For the first condition you need something like [row,col,val]=find((A(:,1)-B(:,1))>2sec && (A(:,1)-B(:,1))<5sec) where you do need to use datenum or equivalent to transform your timestamps. For the second condition this works the same, use [row,col,val]=find(A(:,2)>0.4*B(:,2) && A(:,2)<0.5*B(:,2)
datenum allows you to transform your arrays, so do that first:
A(:,1) = datenum(A(:,1));
B(:,1) = datenum(B(:,1));
you might need to check the documentation on datenum, regarding the format your string is in.
time1 = [datenum([0 0 0 0 0 3]) datenum([0 0 0 0 0 3])];
creates the datenums for 3 and 5 seconds. All combined:
A(:,1) = datenum(A(:,1));
B(:,1) = datenum(B(:,1));
time1 = [datenum([0 0 0 0 0 3]) datenum([0 0 0 0 0 3])];
[row1,col1,val1]=find((A(:,1)-B(:,1))>time1(1)&& (A(:,1)-B(:,1))<time1(2));
[row2,col2,val2]=find(A(:,2)>0.4*B(:,2) && A(:,2)<0.5*B(:,2);
The variables of row and col you might not need when you want only the values though. val1 contains the values of condition 1, val2 of condition 2. If you want both conditions to be valid at the same time, use both in the find command:
[row3,col3,val3]=find((A(:,1)-B(:,1))>time1(1)&& ...
(A(:,1)-B(:,1))<time1(2) && A(:,2)>0.4*B(:,2)...
&& A(:,2)<0.5*B(:,2);
The actual adding of your two arrays based on the conditions:
C = A(row3,2)+B(row3,2);
Thank you for your response and help! However for the time I followed a different approach by converting hh:mm:ss to seconds that will make the comparison easier later on:
dv1 = datevec(A, 'dd.mm.yyyy HH:MM:SS.FFF ');
secs = [3600,60,1];
dv1(:,6) = floor(dv1(:,6));
timestamp = dv1(:,4:6)*secs.';
Now I am working on combining both time and weight conditions in a piece of code that will run. Should I use an if condition inside a for loop or is a for loop not necessary?

Get rid of kth smallest and largest values of a dataset in SAS

I have a datset sort of like this
obs| foo | bar | more
1 | 111 | 11 | 9
2 | 9 | 2 | 2
........
I need to throw out the 4 largest and 4 smallest of foo (later then I would do a similar thing with bar) basically to proceed but I'm unsure the most effective way to do this. I know there are functions smallest and largest but I don't understand how I can use them to get the smallest 4 or largest 4 from an already made dataset. I guess alternatively I could just remove the min and max 4 times but that sounds needlessly tedious/time consuming. Is there a better way?
PROC RANK will do this for you pretty easily. If you know the total count of observations, it's trivial - it's slightly harder if you don't.
proc rank data=sashelp.class out=class_ranks(where=(height_r>4 and weight_r>4));
ranks height_r weight_r;
var height weight;
run;
That removes any observation that is in the 4 smallest heights or weights, for example. The largest 4 would require knowing the maximum rank, or doing a second processing step.
data class_final;
set class_ranks nobs=nobs;
if height_r lt (nobs-3) and weight_r lt (nobs-3);
run;
Of course if you're just removing the values then do it all in the data step and call missing the variable if the condition is met rather than deleting the observation.
You are going to need to make at least 2 passes through your dataset however you do this - one to find out what the top and bottom 4 values are, and one to exclude those observations.
You can use proc univariate to get the top and bottom 5 values, and then use the output from that to create a where filter for a subsequent data step. Here's an example:
ods _all_ close;
ods output extremeobs = extremeobs;
proc univariate data = sashelp.cars;
var MSRP INVOICE;
run;
ods listing;
data _null_;
do _N_ = 1 by 1 until (last.varname);
set extremeobs;
by varname notsorted;
if _n_ = 2 then call symput(cats(varname,'_top4'),high);
if _n_ = 4 then call symput(cats(varname,'_bottom4'),low);
end;
run;
data cars_filtered;
set sashelp.cars(where = ( &MSRP_BOTTOM4 < MSRP < &MSRP_TOP4
and &INVOICE_BOTTOM4 < INVOICE < &INVOICE_TOP4
)
);
run;
If there are multiple observations that tie for 4th largest / smallest this will filter out all of them.
You can use proc sql to place the number of distinct values of foo into a macro var (includes null values as distinct).
In you data step you can use first.foo and the macro var to selectively output only those that are no the smallest or largest 4 values.
proc sql noprint;
select count(distinct foo) + count(distinct case when foo is null then 1 end)
into :distinct_obs from have;
quit;
proc sort data = have; by foo; run;
data want;
set have;
by foo;
if first.foo then count+1;
if 4 < count < (&distinct_obs. - 3) then output;
drop count;
run;
I also found a way to do it that seems to work with IML (I'm practicing by trying to redo things different ways). I knew my maximum number of observations and had already sorted it by order of the variable of interest.
PROC IML;
EDIT data_set;
DELETE point {{1, 2, 3, 4,51, 52, 53, 54};
PURGE;
close data_set;
run;
I've not used IML very much but I stumbled upon this while reading documentation. Thank you to everyone who answered my question!

Creating loop syntax with a count variable (SPSS)

I have a longitudinal (person-level) dataset I am looking to for syntax for counting the number of times an event happened up to a certain point. More specifically I have 200 weeks of data (each week is coded 1-7, I'm only interested in weeks where the value is 5 or greater), But I am only interested in weeks that happened before a certain time point (the time point is different for each person but and captured under a single variable "eventweek"). So for person Y whose eventweek = 154, I want to know what percentage of weeks before week 154 (wks 1-153) where that person was coded a 5 or above. For person Z whose eventweek = 52, I want to know what percentage of weeks before week 52 (wks 1-51) for which that person was coded a 5 or above, and so on.
Any ideas on how to code this?
Try the following:
vector v = week1 to week200.
compute #z = 0.
loop #i = 1 to (eventweek -1).
compute #z = #z + (v(#i) ge 5).
end loop.
compute ew_perc = #z/(eventweek -1)*100.
exe.

Resources