How to Use proc surveyselect to randomly select sample while a variable need to remain the same mean? - sampling

I have trouble with using proc surveyselect to randomly select sample from a population.
Here is the scenario:
I have a sample pool, say, 1000 observations, with variable ID, gender, income. My goal is to randomly select 400 samples to make a group 1, and the rest goes to group 2. However, the mean of income in group 1 and group 2 should be the same as the mean in sample pool. I also need the proportion of male and female in groups 1 and 2 are the same as in the pool. Is there any way to do this in proc surveyselect (SAS)?
Can anyone share example syntax?

You can control for gender by using a strata statement to tell proc surveyselect to sample each gender separately, then combine the separate samples for each gender. I think it should then be possible to use proc stdize to rescale the sample mean incomes based on the output from proc surveyselect and your original dataset. I don't have time to provide full details just now as this is quite a complex proc, but I think that's your best line of inquiry at this point.

Really you are just talking about using strata here, if your income is (or can be treated as) a discrete variable. An example:
data population;
call streaminit(7);
do _n_ = 1 to 1000;
if rand('Uniform') > 0.6 then sex='M';
else sex='F';
income = ceil(6*rand('Uniform'));
output;
end;
run;
proc freq data=population;
tables sex income;
run;
proc sort data=population;
by sex income;
run;
proc surveyselect data=population out=sample samprate=0.4 outall;
strata sex income;
run;
proc sort data=sample;
by selected;
run;
proc freq data=sample;
by selected;
tables sex income;
run;
That gives you a sample of 40% from each sex and income strata separately (so 40% of 'Males income=1' 40% of 'Females income=3' etc.) which will end up in your overall desired even distribution.
This doesn't work for income as a continuous variable; you can try using control there, in which case you won't have as specific of a distribution but it should still be in the ballpark.
This does have some differences in terms of probability of sample versus taking a sample of the entire population and controlling independently for the two variables - you will have 40% of each bucket of the two combined, while a sample of the whole population that has equal income and sex groupings might have a lot more of 'Female 3' and less of 'Male 3' but then more of 'Male 2' and less of 'Female 2'.

Related

How to establish successful rate by interval?

I have a data like this: state of projects - number of state's projects - interval
Is it possible to establish successful rate by interval?
For example: We are adding successful project states (count_n where states=successful) and divide it by the whole number of projects (sum of count_n where interval is 1-10) between 1-10 interval. The same we do with interval 1-20.
I would like to get data like this:
successful rate | interval
X 1-10
Y 10-20
I'm coding in SAS but I can use SQL Server in it.
Thanks.
In PROC SQL, you can do it this way:
proc sql;
create table want as
select interval
, sum( (upcase(state) = 'SUCCESSFUL')*count_n)/sum(count_n) format=percent8.1 as success_rate
from have
group by interval
;
quit;
The code (upcase(state) = 'SUCCESSFUL') produces a 1/0 value such that only rows where the state is successful are summed. Multiplying this by count_n will give 0 for non-successful states and count_n for successful states. This is a shortcut that prevents you from having to do multiple joins to get the required numerator.
Example code:
data have;
length state $20.;
input state$ count_n interval$;
datalines;
successful 70 1-10
successful 10 1-10
fail 20 1-10
successful 70 11-20
successful 5 11-20
fail 25 11-20
;
run;
Output:
interval success_rate
1-10 80.0%
11-20 75.0%
I prefer using pre-defined SAS procedures whenever possible - they're typically more efficient.
For something like this you can use PROC FREQ.
You need to specify the WEIGHT with the count to indicate that each observation is counted multiple times and then you can get a variety of percentages - COL_PCT is per category in this case.
Remove the WHERE/KEEP to see the full output and the different statistics it generates for you. Neither of these solutions accounts for missing values. If you need to, add the MISSING option within PROC FREQ.
proc freq data=have noprint;
table state*interval / out=want (keep = state interval count pct_col where=(state='successful')) missing outpct;
weight count_n;
run;
proc print data=want;
run;

trying to concatenate values from data sets into an array in sas

I am trying to add a Data step that creates the work.orders_fin_qtr_tot data set from the work.orders_fin_tot data set. This new data set should contain new variables for quarterly sales and profit. Use two arrays to create the new variables: QtrSales1-QtrSales4 and QtrProfit1-QtrProfit4. These represent total sales and total profit for the quarter (1-4). Use the quarter number of the year in which the order was placed to index into the correct variable to add either the TotalSales or TotalProfit to the new appropriate variable.
Add a Proc step that displays the first 10 observations of the work.orders_fin_qtr_tot data set.
My issue is that I can't seem to get the two diff arrays to meld with out spaces
proc sort data=work.orders_fin_tot_qtr;
by workqtr;
run;
data work.orders_fin_tot_qtr;
set work.orders_fin_tot_qtr;
array QtrSales{4} quarter1-quarter4 ;
do i = 1 by 1 until (last.order_id);
if workqtr=i then QtrSales{i}=totalsales;
end;
drop totalsales totalprofit _TYPE_ _FREQ_;
run;
proc print data=work.orders_fin_tot_qtr;
run;
The syntax last.order_id is only appropriate if there is a BY statement in the DATA Step -- if not present, the last. reference is always missing and the loop will never end; so you have coded an infinite loop!
The step has drop totalsales totalprofit _TYPE_ _FREQ_. Those underscored variables indicate the incoming data set was probably created with a Proc SUMMARY.
Your orders_fin_tot data set should have columns order_id quarter (valid values 1,2,3,4), and totalsales. If the data is multi-year, it should have another column named year.
The missing BY and present last.id indicate you are reshaping the data from acategorical vector going down a column to one that goes across a row -- this is known as a pivot or transpose. The do construct you show in the question is incorrect but similar to that of a technique known in SAS circles as a DOW loop -- the specialness of the technique is that the SET and BY are coded inside the loop.
Try adjusting your code to the following pattern
data want;
do _n_ = 1 by 1 until (last.order_id);
SET work.orders_fin_tot; * <--- presumed to have data 'down' a column for each quarter of an order_id;
BY order_id; * <--- ensures data is sorted and makes automatic flag variable LAST.ORDER_ID available for the until test;
array QtrSales quarter1-quarter4 ; * <--- define array for step and creates four variables in the program data vector (PDV);
* this is where the pivot magic happens;
* the (presumed) quarter value (1,2,3,4) from data going down the input column becomes an
* index into an array connected to variables going across the PDV (the output row);
QtrSales{quarter} = totalsales;
end;
run;
Notice there is no OUTPUT statement inside or outside the loop. When the loop completes it's iteration the code flow reaches the bottom of the data step and does an implicit OUTPUT (because there is no explicit OUTPUT elsewhere in the step).
Also, for any data set specified in code, you can use data set option OBS= to select which observation numbers are used.
proc print data=MyData(obs=10);
OBS is a tricky option name because it really means last observation number to use. FIRSTOBS is another data set option for specifying the row numbers to use, and when not present defaults to 1. So the above is equivalent to
proc print data=MyData(firstobs=1 obs=10);
OBS= should be thought of conceptually as LASTOBS=; there is no actual option name LASTOBS. The following would log an ERROR: because OBS < FIRSTOBS
proc print data=MyData(firstobs=10 obs=50);

Calculate number of full time equivalent employees using DAX measures

I am able to calculate how many employees are working using DAX measures:
Number of employees started := CALCULATE(COUNTA([Emp from]);FILTER(ALL(tDate[Date]);tDate[Date]<=MAX(tDate[Date]))) -
Number of employees quit := CALCULATE(COUNTA([Emp unitl]);FILTER(ALL(tDate[Date]);tDate[Date]<=MAX(tDate[Date])))
Number of employees working := [Number of employees started] - [Number of employees quit]
But I have not managed to calculate how many full time equivalent employees are working. Each employ has a workload from 0% to 100%.
How can I calculate the number of full time equivalent employees?
I have tried the following for number of full time equivalent employees started, but in contrast to the measures above it doesn't accumulate over time. It just shows the results for each individual month:
Number of full time equivalent employees started:=CALCULATE(SUMX(tEmployees;tEmployees[Workload]*Not(ISBLANK(tEmployees[Emp from])));FILTER(ALL(tDate[Date]);tDate[Date]<=MAX(tDate[date])))
Do you have any suggestion for how I can solve this?
You might try something like this. In your Emp table have the start date and end date for the employee. In your measure you would use the calculatetable function to create an in memory table that has one row for each date in your date table and each employee id. This same in memory table would take the work percentage and create a column that that represents the number of hours worked by that employee on that day. Then you just need to express number of 'equivalent' employees as: sum of number of hours worked/(number of hours in a 'full time work day' * count of days in period). This should give you a measure you can use along with your dates to find the number of full time equivalent employees in on any given day or over any given period.
See my sample table structure in this TechNet forum post. This is a modelling problem first, and a DAX problem second.
Once you've created your headcount fact, all of this becomes trivial.

SAS: If a column has values that are in a array

Below is an example of what I'm doing. I want to get the subset of the dataset.(i.e the rows that have these letters in the Alphabet column). I want to select only those records where Transport_company is either Hyundai, Toyota or Ford.
Data arrayInIf;
set OldTable;
array Car_array {3}a b c('Hyundai', 'Toyota', 'Ford');
If Transport_company ^= Car_array
Then
Delete;
Run;
Whats wrong? How can i get this to work.
Ok, so sample data would be:
Zip Transport_Company No. Sold
12345 Hyundai 10
90145 NASA 50
20202 Toyota 30
40002 Harley Davidson 5
10000 Ford 15
So, I would only want to keep all the rows related to car companies
Robbie's right that if your data isn't already in an array you shouldn't use array methods, as it's adding extra complication - in is fine.
However, if it is in an array already, whichc (or whichn for numerics) is a good solution.
data oldtable;
input Zip Transport_Company $ No_Sold;
datalines;
12345 Hyundai 10
90145 NASA 50
20202 Toyota 30
40002 HarleyDavidson 5
10000 Ford 15
;;;;
run;
Data arrayInIf;
set OldTable;
array Car_array{3} $ ('Hyundai', 'Toyota', 'Ford');
If whichc(transport_company,of car_array[*])=0
Then
Delete;
Run;
In general, the best way to do this is to construct a format. Look up PROC FORMAT CNTLIN for how to do this from a dataset; or you can do this in code:
proc format;
value $automakerF
'Hyundai','Toyota','Ford'=1
other=0;
quit;
data fmtInIf;
set oldtable;
if put(transport_company,automakerF.) ne '1'
then delete;
run;
This has the value of separating your data from your code, plus you can bring the automaker names in from a dataset if you want; as well, you can do all of your different industries in one format as well. It's also very fast, faster than a bunch of if statements or the in statement.
I think you don't need to use array here. If you just want to select rows based on multiple values, use the in keyword. The concept of array in SAS is different from some other programming languages which usually see array as a set of string and numerical values. The array in SAS stores a set of columns (variables).
data b;
set a;
where Transport_Company in ('Hyundai', 'Toyota', 'Ford');
run;
The output:
Obs Zip Transport_Company Sold
1 12345 Hyundai 10
2 20202 Toyota 30
3 10000 Ford 15
As #alex has mentioned in his comment, if you need to filter rows based on a long list, where...in () will become cumbersome. In this case, my solution would be usually creating a new set with these names.
Transport_Company
Hyundai
Toyota
Ford
...
BMW
Then do a simple pseudo-merge (conditional selection) using proc sql. This should be fairly fast.
proc sql;
create table c as
select a.* from a, cars where a.Transport_Company = cars.Transport_Company;
quit;

SPSS :Loop through the values of variable

I have a dataset that has patient data according to the site they visited our mobile clinic. I have now written up a series of commands such as freqs and crosstabs to produce the analyses I need, however I would like this to be done for patients at each site, rather than the dataset as whole.
If I had only one site, a mere filter command with the variable that specifies a patient's site would suffice, but alas I have 19 sites, so I would like to find a way to loop through my code to produce these outputs for each site. That is to say for i in 1 to 19:
1. Take the i th site
2. Compute a filter for this i th site
3. Run the tables using this filtered data of patients at ith site
Here is my first attempt using DO REPEA. I also tried using LOOP earler.
However it does not work I keep getting an error even though these are closed loops.
Is there a way to do this in SPSS syntax? Bear in mind I do not know Python well enough to do this using that plugin.
*LOOP #ind= 1 TO 19 BY 1.
DO REPEAT #ind= 1 TO 20.
****8888888888888888888888888888888888888888888888888888888 Select the Site here.
COMPUTE filter_site=(RCDSITE=#ind).
USE ALL.
FILTER BY filter_site.
**********************Step 3: Apply the necessary code for tables
*********Participation in the wellness screening, we actually do not care about those who did FP as we are not reporting it.
COUNT BIO= CheckB (1).
* COUNT FPS=CheckF(1).
* COUNT BnF= CheckB CheckF(1).
VAL LABEL BIO
1 ' Has the Wellness screening'
0 'Does not have the wellness screening'.
*VAL LABEL FPS
1 'Has the First patient survey'.
* VAL LABEL BnF
1 'Has either Wellness or FPS'
2 'Has both surveys done'.
FREQ BIO.
*************************Use simple math to calcuate those who only did the Wellness/First Patient survey FUB= F+B -FnB.
*******************************************************Executive Summary.
***********Blood Pressure.
FREQ BP.
*******************BMI.
FREQ BMI.
******************Waist Circumference.
FREQ OBESITY.
******************Glucose.
FREQ GLUCOSE.
*******************Cholesterol.
FREQ TC.
************************ Heamoglobin.
FREQ HAEMOGLOBIN.
*********************HIV.
FREQ HIV.
******************************************************************************I Lifestyle and General Health.
MISSING VALUES Gender GroupDep B8 to B13 ('').
******************Graphs 3.1
Is this just Frequencies you are producing? Try the SPLIT procedure by the variable RCDSITE. Should be enough.
SPLIT FILES allows you to partition your data by up to eight variables. Then each procedure will automatically iterate over each group.
If you need to group the results at a higher level than the procedure, that is, to run a bunch of procedures for each group before moving on to the next one so that all the output for a group will be together, you can use the SPSSINC SPLIT DATASET and SPSSINC PROCESS files extension commands to do this.
These commands require the Python Essentials. That and the commands can be downloaded from the SPSS Community website (www.ibm.com/developerworks/spssdevcentral) if you have at least version 18.
HTH,
Jon Peck
A simple but perhaps not very elegant way is to select from the menu: Data/Select Cases/If condition, there you enter the filter for site 1 and press Paste, not OK.
This will give the used filter as syntax code.
So with some copy/paste/replace/repeat you can get the freqs and all other results based on the different sites.

Resources