I'm just learning SPSS and I want to do simple subgroup analysis based on a variable "status" I created which can take values from 0 to 8. I would like to print outputs in one go.
this is the pseudocode for what I want to do:
for( i = 1, i = 8, i++)
{
filter by (ststus = i)
display analysis
remove filter
}
That way I can do it all in one go but also i can add to the analysis code and do something easily for the 8 subgroups.
I don't know if it's relevant but here is the code I want to iterate over currently:
USE ALL.
COMPUTE filter_$=(Workforce EQ 1 AND SurveySample = 1 AND State = 1).
VARIABLE LABELS filter_$ 'Workforce EQ 1 (FILTER)'.
> VALUE LABELS filter_$ 0 'Not Selected' 1 'Selected'. FORMATS filter_$
> (f1.0). FILTER BY filter_$. EXECUTE.
>
>
> FREQUENCIES VARIABLES = Q86 Q33 Q34 Q88 FSEScore /BARCHART FREQ
> /ORDER=ANALYSIS.
>
> CROSSTABS /TABLES=FSEScore BY Q86 /FORMAT=AVALUE TABLES
> /CELLS=ROW /COUNT ROUND CELL.
>
> FILTER OFF. USE ALL.
Thanks guys.
split file command may solve the problem - it causes your analysis reports to show results for each category of your split variable separately:
*run your transformations.
sort cases by status.
split file by status.
FREQUENCIES .....
CROSSTABS ....
split file off.
If this is not enough, you can use a macro to run through "status" categories:
first define the macro:
define MyMacro ()
!do !ST=1 !to 8
* filter commands using **status = !ST**
* transformations using **status = !ST**
FREQUENCIES .....
CROSSTABS ....
!doend
!enddefine.
now call your macro:
MyMacro .
this is probably a very getto way of doing this, the suggestion above is probably more sensible.
You can initialise Python is spss. The following code works:
begin program.
import spss
for i in xrange(1,8):
string = str(i)
spss.Submit("""
USE ALL.
COMPUTE filter_$=(Workforce EQ 1 AND SurveySample = 1 AND Status = %s).
VARIABLE LABELS filter_$ 'Workforce EQ 1 (FILTER)'.
VALUE LABELS filter_$ 0 'Not Selected' 1 'Selected'.
FORMATS filter_$ (f1.0).
FILTER BY filter_$.
EXECUTE.
#analysis as required
FREQUENCIES VARIABLES = Q86
/BARCHART FREQ
/ORDER=ANALYSIS.
"""%(' '.join(string)) )
end program.
Many thanks to eli-k I probably should have just used splitfile.
Related
I want to do something I thought was really simple.
My (mock) data looks like this:
data list free/totalscore.1 to totalscore.5.
begin data.
1 2 6 7 10 1 4 9 11 12 0 2 4 6 9
end data.
These are total scores accumulating over a number of trials (in this mock data, from 1 to 5). Now I want to know the number of scores earned in each trial. In other words, I want to subtract the value in the n trial from the n+1 trial.
The most simple syntax would look like this:
COMPUTE trialscore.1 = totalscore.2 - totalscore.1.
EXECUTE.
COMPUTE trialscore.2 = totalscore.3 - totalscore.2.
EXECUTE.
COMPUTE trialscore.3 = totalscore.4 - totalscore.3.
EXECUTE.
And so on...
So that the result would look like this:
But of course it is not possible and not fun to do this for 200+ variables.
I attempted to write a syntax using VECTOR and DO REPEAT as follows:
COMPUTE #y = 1.
VECTOR totalscore = totalscore.1 to totalscore.5.
DO REPEAT trialscore = trialscore.1 to trialscore.5.
COMPUTE #y = #x + 1.
END REPEAT.
COMPUTE trialscore(#i) = totalscore(#y) - totalscore(#i).
EXECUTE.
But it doesn't work.
Any help is appreciated.
Ps. I've looked into using LAG but that loops over rows while I need it to go over 1 column at a time.
I am assuming respid is your original (unique) record identifier.
EDIT:
If you do not have a record indentifier, you can very easily create a dummy one:
compute respid=$casenum.
exe.
end of EDIT
You could try re-structuring the data, so that each score is a distinct record:
varstocases
/make totalscore from totalscore.1 to totalscore.5
/index=scorenumber
/NULL=keep.
exe.
then sort your cases so that scores are in descending order (in order to be bale to use lag function):
sort cases by respid (a) scorenumber (d).
Then actually do the lag-based computations
do if respid=lag(respid).
compute trialscore=totalscore-lag(totalscore).
end if.
exe.
In the end, un-do the restructuring:
casestovars
/id=respid
/index=scorenumber.
exe.
You should end up with a set of totalscore variables (the last one will be empty), which will hold what you need.
you can use do repeat this way:
do repeat
before=totalscore.1 to totalscore.4
/after=totalscore.2 to totalscore.5
/diff=trialscore.1 to trialscore.4 .
compute diff=after-before.
end repeat.
I want to iterate the same code for different macro sets like in SAS and then append all tables populated together. As I am coming from sas background, I am quite confused about how to do this in Pyspark environment. Any help is much appreciated!
Example code is below :
STEP1: define macro variables
lastyear_st=201615
lastyear_end=201622
thisyear_st=201715
thisyear_end=201722
STEP2: loop the code through various macro variables
customer_spend=sqlContext.sql("""
select a.customer_code,
sum(case when a.week_id between %d and %d then a.spend else 0 end) as spend
from tableA
group by a.card_code
"""
%(lastyear_st,lastyear_end)
(thisyear_st,thisyear_end))
STEP3: append each of the dataset populated above to the base table
# macroVars are your start and end values arranged as list of list.
# where each innner list contains start and end value
macroVars = [[201615,201622],[201715, 201722]]
# loop thru list of list ==>
for start,end in macroVars:
# prepare query using the values of start and end
query = "SELECT a.customer_code,Sum(CASE\
WHEN a.week_id BETWEEN {} AND {} \
THEN a.spend \
ELSE 0 END) \
AS spend FROM tablea GROUP BY a.card_code".format(start,end)
# execute query
customer_spend = sqlContext.sql(query)
# depending on your base table setup use appropriate write command for example
customer_spend\
.write.mode('append')\
.parquet(os.path.join(tempfile.mkdtemp(), 'data'))
I have a cell array called BodyData in MATLAB that has around 139 columns and 3500 odd rows of skeletal tracking data.
I need to extract all rows between two string values (these are timestamps when an event happened) that I have
e.g.
BodyData{}=
Column 1 2 3
'10:15:15.332' 'BASE05' ...
...
'10:17:33:230' 'BASE05' ...
The two timestamps should match a value in the array but might also be within a few ms of those in the array e.g.
TimeStamp1 = '10:15:15.560'
TimeStamp2 = '10:17:33.233'
I have several questions!
How can I return an array for all the data between the two string values plus or minus a small threshold of say .100ms?
Also can I also add another condition to say that all str values in column2 must also be the same, otherwise ignore? For example, only return the timestamps between A and B only if 'BASE02'
Many thanks,
The best approach to the first part of your problem is probably to change from strings to numeric date values. In Matlab this can be done quite painlessly with datenum.
For the second part you can just use logical indexing... this is were you put a condition (i.e. that second columns is BASE02) within the indexing expression.
A self-contained example:
% some example data:
BodyData = {'10:15:15.332', 'BASE05', 'foo';...
'10:15:16.332', 'BASE02', 'bar';...
'10:15:17.332', 'BASE05', 'foo';...
'10:15:18.332', 'BASE02', 'foo';...
'10:15:19.332', 'BASE05', 'bar'};
% create column vector of numeric times, and define start/end times
dateValues = datenum(BodyData(:, 1), 'HH:MM:SS.FFF');
startTime = datenum('10:15:16.100', 'HH:MM:SS.FFF');
endTime = datenum('10:15:18.500', 'HH:MM:SS.FFF');
% select data in range, and where second column is 'BASE02'
BodyData(dateValues > startTime & dateValues < endTime & strcmp(BodyData(:, 2), 'BASE02'), :)
Returns:
ans =
'10:15:16.332' 'BASE02' 'bar'
'10:15:18.332' 'BASE02' 'foo'
References: datenum manual page, matlab help page on logical indexing.
I am trying to run this code:
function calcs.grps(Number,ion_color)
grp .. ion_color .. Y[Number] = (ion_py_mm)
grp .. ion_color .. Z[Number] = (ion_pz_mm)
end
in a Lua script, the arrays already exist (eg grp2Y,grp5Z etc) and I want to use this function to populate them based upon the two variables fed in. I keep getting the error ' '=' expected near '..' '. What am I doing wrong?
To flesh it out a bit:
I am 'flying' 120 ions in my simulation. This is actually 12 groups of 10 ions. The individual groups of 10 are distinguished by ion_color, which is an integer value from 1 to 12. The variable 'Number' just cycles through 1 to 10 each time before moving on to the next color. Once I have populated these arrays I want to get the standard deviation for each group.
Thank you!
You can't "construct" name of variable, but you can construct an index. Use two levels of nested tables.
function calcs.grps(Number,ion_color)
ion['grp' .. ion_color .. 'Y'][Number] = (ion_py_mm)
ion['grp' .. ion_color .. 'Z'][Number] = (ion_pz_mm)
end
Well, actually you can, since all global variables are just entries in _G table, but don't do that since it is bad - it is unreadable, makes stuff spill to other functions you didn't intend to, etc.
The technical answer to your question is to simply index _G, _G is a table which holds all global variables:
function calcs.grps(Number,ion_color)
_G['grp' .. ion_color .. Y'][Number] = (ion_py_mm)
_G['grp' .. ion_color .. 'Z'][Number] = (ion_pz_mm)
end
But I think the better question, is why aren't you organizing it like this...
local ions = {
Red = {
{
Y = 0, --Y property
Z = 0 --Z property
},
--Continue your red ions
},
NewColor = {
Y = 0, --Y property
Z = 0 --Z property
},
--Continue this color's ions
},
--You get the idea
}
function calcs.grps(color, number)
ions[color][number].Y = (ion_py_mm)
ions[color][number].Z = (ion_pz_mm)
end
Then you would pass a color, and a number indicating which ion of this color
It looks a lot cleaner, IMO.
I am using matlab to prepare my dataset in order to run it in certain data mining models and I am facing an issue with linking the data between two of my tables.
So, I have two tables, A and B, which contain sequential recordings of certain values in a certain timestamps and I want to create a third table, C, in which I will add columns of both A and B in the same rows according to some conditions.
Tables A and B don't have the same amount of rows (A has more measurements) but they both have two columns:
1st column: time of the recording (hh:mm:ss) and
2nd column: recorded value in that time
Columns of A and B are going to be added in table C when all the following conditions stand:
The time difference between A and B is more than 3 sec but less than 5 sec
The recorded value of A is the 40% - 50% of the recorded value of B.
Any help would be greatly appreciated.
For the first condition you need something like [row,col,val]=find((A(:,1)-B(:,1))>2sec && (A(:,1)-B(:,1))<5sec) where you do need to use datenum or equivalent to transform your timestamps. For the second condition this works the same, use [row,col,val]=find(A(:,2)>0.4*B(:,2) && A(:,2)<0.5*B(:,2)
datenum allows you to transform your arrays, so do that first:
A(:,1) = datenum(A(:,1));
B(:,1) = datenum(B(:,1));
you might need to check the documentation on datenum, regarding the format your string is in.
time1 = [datenum([0 0 0 0 0 3]) datenum([0 0 0 0 0 3])];
creates the datenums for 3 and 5 seconds. All combined:
A(:,1) = datenum(A(:,1));
B(:,1) = datenum(B(:,1));
time1 = [datenum([0 0 0 0 0 3]) datenum([0 0 0 0 0 3])];
[row1,col1,val1]=find((A(:,1)-B(:,1))>time1(1)&& (A(:,1)-B(:,1))<time1(2));
[row2,col2,val2]=find(A(:,2)>0.4*B(:,2) && A(:,2)<0.5*B(:,2);
The variables of row and col you might not need when you want only the values though. val1 contains the values of condition 1, val2 of condition 2. If you want both conditions to be valid at the same time, use both in the find command:
[row3,col3,val3]=find((A(:,1)-B(:,1))>time1(1)&& ...
(A(:,1)-B(:,1))<time1(2) && A(:,2)>0.4*B(:,2)...
&& A(:,2)<0.5*B(:,2);
The actual adding of your two arrays based on the conditions:
C = A(row3,2)+B(row3,2);
Thank you for your response and help! However for the time I followed a different approach by converting hh:mm:ss to seconds that will make the comparison easier later on:
dv1 = datevec(A, 'dd.mm.yyyy HH:MM:SS.FFF ');
secs = [3600,60,1];
dv1(:,6) = floor(dv1(:,6));
timestamp = dv1(:,4:6)*secs.';
Now I am working on combining both time and weight conditions in a piece of code that will run. Should I use an if condition inside a for loop or is a for loop not necessary?