parsing a text file in sas - database

So I have a rather messy text file I'm trying to convert to a sas data set. It looks something like this (though much bigger):
0305679 SMITH, JOHN ARCH05 001 2
ARCH05 005 3
ARCH05 001 7
I'm trying to set 5 separate variables (ID, name, job, time, hours) but clearly only 3 of the variables appear after the first line. I tried this:
infile "C:\Users\Desktop\jobs.txt" dlm = ' ' dsd missover;
input ID $ name $ job $ time hours;
and didn't get the right output, then I tried to parse it
infile "C:\Users\Desktop\jobs.txt" dlm = ' ' dsd missover; input
allData $; id = substr(allData, find(allData,"305")-2, 7);
but I'm still not getting the right output. Any ideas?
EDIT: I'm trying now to use .scan() and .substr() to apart the larger data set, how do I subset a single line from the data?

Your data might not be all that messy; it just might be in a hierarchical format where the first row contains all five variables and subsequent rows contain values for variables 3-5. In other words, ID and NAME should be retained as you read through the file.
If that is correct (it's a hierarchical layout) this here is a possible solution:
data have;
retain ID NAME;
informat ID 7. JOB $6. TIME 3. HOURS 1.;
input #1 test_string $7. #;
if notdigit(test_string) = 0
then input #1 ID NAME $12. JOB time hours;
else input #1 JOB time hours;
drop test_string;
datalines;
0305679 SMITH, JOHN ARCH05 001 2
ARCH05 005 3
ARCH05 001 7
0305680 JONES, MARY ARCH06 002 4
ARCH06 005 3
ARCH07 001 7
run;
The key thing is to really understand how your raw file is organized. Once you know the rules, using SAS to read it is a snap!

A list input solution could be the following:
data have;
array all(6) $20. ID LNAME FNAME JOB TIME HOURS;
retain Id Lname Fname;
drop i;
input #;
nitems = countw(_infile_,', ');
if notdigit(scan(_infile_,1)) = 0 then
do i = 1 to nitems;
all(i) = Scan(_infile_,i);
end;
else
do i = 1 to 3;
all(i+3) = Scan(_infile_,i);
if i = 6 then all(i) = all(i)*1;
end;
datalines;
0305679 SMITH, JOHN ARCH05 001 2
ARCH05 005 3
ARCH05 001 7
0305680 JONES, MARY ARCH06 002 4
ARCH06 005 3
ARCH07 001 7
run;
proc print; run;

Related

Modifying final column value in SAS by group

I have the following data set:
Student TestDayStart TestDayEnd
001 1 5
001 6 10
001 11 15
002 1 4
002 5 9
002 10 14
I would like to make the last 'TestDayEnd' the final value for 'TestDayStart' for each Student.
So the data should look like this:
Student TestDayStart TestDayEnd
001 1 5
001 6 10
001 11 15
001 15 15
002 1 4
002 5 9
002 10 14
002 14 14
I'm not quite sure how I can do this in SAS. Any insight would be appreciated.
After sorting the dataset you can do this within a data step.
proc sort data=have;
by student testdaystart testdayend;
run;
Now you can use the by and retain statements in the data step. The by statement allows you to find the last student, and the retain statement lets you keep the previous value in the dataset.
data want;
set have;
retain last_testdayend;
by student testdaystart testdayend;
output;
last_testdayend = testdayend;
if last.student then do;
if testdaystart ne testdayend then do;
testdaystart = last_testdayend;
testdayend = last_testdayend;
output; * this second output statement creates a new record in the dataset;
end;
end;
drop last_testdayend;
run;

Retain last 5 visits by Person in SAS

I have the following that contains dates, the visit number, and a specific variable of interest. I would like to retain the last five visits that are available in SAS by person. I am familiar with retaining the first and last visits. The data for a single subject is listed below:
Person Date VisitNumber VariableOfInterest
001 10/10/2001 1 6
001 11/12/2001 3 8
001 01/05/2002 5 12
001 03/10/2002 6 5
001 05/03/2002 8 3
001 07/29/2002 10 11
Any insight would be appreciated.
A double DOW loop will let you measure the group in the first loop and select from the group based on your desired per-group criteria in the second loop. This is useful when have is large and pre-sorted, and you want to avoid additional sorting.
data want;
* measure the group size;
do _n_ = 1 by 1 until (last.person);
set have;
by person visitnumber; * visitnumber in by only to enforce expectation of orderness;
end;
_i_ = _n_;
* apply the criteria "last 5 rows in group";
do _n_ = 1 to _n_;
set have;
if _i_ - _n_ < 5 then output;
end;
run;
It is easier if you sort by descending VisitNumber so that the problem becomes take the first 5 observations for a person. Then just generate a counter of which observation this is for the person and subset on that.
data want;
set have ;
by person descending visitnumber;
if first.person then rowno=0;
rowno+1;
if rowno <= 5;
run;

SAS macro to print out change of baseline scores

I'm looking for a way to print out change of tests scores for each subject with a SAS macro. Here is a sample of the data:
Subject Visit Date Test Score
001 Baseline 01/01/99 Jump 5
001 Baseline 01/01/99 Reach 3
001 Week 6 02/12/99 Jump 7
001 Week 6 02/12/99 Reach 6
002 Baseline 03/01/99 Jump 2
002 Baseline 03/01/99 Reach 4
002 Week 6 04/12/99 Jump 5
002 Week 6 04/12/99 Reach 9
I would like to create a macro that generates the following for each subject:
Subject Visit Date (Days from Baseline) Test Score Change from Baseline Score
001 Baseline 01/01/99 Jump 5
01/01/99 Reach 3
001 Week 6 02/12/99 (42) Jump 7 +2
02/12/99 (42) Reach 6 +3
002 Baseline 03/01/99 Jump 2
03/01/99 Reach 4
002 Week 6 04/12/99 (42) Jump 5 +3
04/12/99 (42) Reach 9 +5
I believe I can just use the INTCK function for the Days from Baseline, but I'm not sure how to print out each test without retaining the 'Subject' and 'Visit' values in each row. Any help would be much appreciated.
You can sort by test and process using a retain for date and score for computing deltas. The print out can be done with Proc REPORT, formatting delta values appropriately.
Example:
data have; input
Subject Visit& $8. Date& mmddyy8. Test $ Score; format date mmddyy8.; datalines;
001 Baseline 01/01/99 Jump 5
001 Baseline 01/01/99 Reach 3
001 Week 6 02/12/99 Jump 7
001 Week 6 02/12/99 Reach 6
002 Baseline 03/01/99 Jump 2
002 Baseline 03/01/99 Reach 4
002 Week 6 04/12/99 Jump 5
002 Week 6 04/12/99 Reach 9
run;
proc sort data=have;
by subject test date;
run;
data for_report;
set have;
by subject test;
retain base_date base_score;
if first.subject then do;
base_date = .;
base_score = .;
end;
if first.test and visit='Baseline' then do;
base_date = date;
base_score = score;
end;
if not first.test then do;
delta_days = intck('days', date, base_date);
delta_score = score - base_score;
end;
run;
proc format;
picture plus low-0 = [best12.] other = '000000009' (prefix='+');
options missing=' ';
proc report data=for_report;
columns subject visit date delta_days test score delta_score;
define subject / order;
define visit / order order=data;
format delta_days negparen.;
format delta_score plus.;
run;
options missing='.';
An alternate report can be more subject-centric:
proc report data=for_report
style(lines) = [just=left fontweight=bold]
;
columns subject visit date delta_days test score delta_score;
define subject / order noprint;
define visit / order order=data;
format delta_days negparen.;
format delta_score plus.;
compute before subject;
subj = catx(' ', "Subject:", subject);
line subj $200.;
endcomp;
run;
Here is one way of doing it. The SQL-step calculates changes from baseline. The case-when-construct is only there to suppress zeroes in the output.
Printing using group-variables in proc report means Subject- and Visit-values are not retained on every line (but note that subject is not repeated each week).
I put the code in a macro, as that was the question. It doesn't really do much, however.
/* Creating test data*/
data testdata;
input Subject $3. #5 Visit $8. #17 Date mmddyy10. #28 Test $5. Score;
format date mmddyy10.;
datalines;
001 Baseline 01/01/99 Jump 5
001 Baseline 01/01/99 Reach 3
001 Week 6 02/12/99 Jump 7
001 Week 6 02/12/99 Reach 6
002 Baseline 03/01/99 Jump 2
002 Baseline 03/01/99 Reach 4
002 Week 6 04/12/99 Jump 5
002 Week 6 04/12/99 Reach 9
;
%macro baselines(dataset=);
/* Adding days from baseline and change from baseline. Please note that the first visit
must denoted as exactly "Baseline"*/
proc sql;
create table changes as
select t1.*, case when t1.date-t2.date>0 then t1.date-t2.date else . end as days
"Days from baseline", case when t1.score-t2.score>0 then t1.score-t2.score else .
end as score_change "Change from Baseline"
from &dataset as t1 left join (select * from &dataset where visit="Baseline") as t2
on t1.subject=t2.subject and t1.test=t2.test
order by subject, visit, test;
/* Printing the dataset. The use of subject and visit as group variables keeps SAS from repeating the values*/
title "Changes based on the dataset &dataset";
proc report data=changes;
column subject visit days test score score_change;
define subject / group;
define visit / group;
run;
%mend;
%baselines(dataset=testdata)

create "pair key=value" file with SAS datastep

I have to create a file from a dataset that is JSON style but without CR between each variable.
All variables have to be on the same line.
I would like to have something like that :
ID1 "key1"="value1" "key2"="value2" .....
Each key is a column of a dataset.
I work this SAS 9.3 on UNIX.
Sample :
I have
ID Name Sex Age
123 jerome M 30
345 william M 26
456 ingrid F 25`
I would like
123 "Name"="jerome" "sex"="M" "age"="30"
345 "Name"="william" "sex"="M" "age"="26"
456 "Name"="ingrid" "sex"="F" "age"="25"
Thanks
If your data looked like this...
Obs Name _NAME_ COL1
1 Alfred Name Alfred
2 Alfred Sex M
3 Alfred Age 14
4 Alfred Height 69
5 Alfred Weight 112.5
6 Alice Name Alice
7 Alice Sex F
8 Alice Age 13
9 Alice Height 56.5
10 Alice Weight 84
11 Barbara Name Barbara
12 Barbara Sex F
13 Barbara Age 13
14 Barbara Height 65.3
15 Barbara Weight 98
16 Carol Name Carol
17 Carol Sex F
18 Carol Age 14
19 Carol Height 62.8
20 Carol Weight 102.5
21 Henry Name Henry
22 Henry Sex M
23 Henry Age 14
24 Henry Height 63.5
25 Henry Weight 102.5
You could use code like this to write the value pairs. Assuming this is what you're talking about.
189 data _null_;
190 do until(last.name);
191 set class;
192 by name;
193 col1 = left(col1);
194 if first.name then put name #;
195 put _name_:$quote. +(-1) '=' col1:$quote. #;
196 end;
197 put;
198 run;
Alfred "Name"="Alfred" "Sex"="M" "Age"="14" "Height"="69" "Weight"="112.5"
Alice "Name"="Alice" "Sex"="F" "Age"="13" "Height"="56.5" "Weight"="84"
Barbara "Name"="Barbara" "Sex"="F" "Age"="13" "Height"="65.3" "Weight"="98"
Carol "Name"="Carol" "Sex"="F" "Age"="14" "Height"="62.8" "Weight"="102.5"
Henry "Name"="Henry" "Sex"="M" "Age"="14" "Height"="63.5" "Weight"="102.5"
NOTE: There were 25 observations read from the data set WORK.CLASS.
Consider these non-transposing variations:
Actual JSON, use Proc JSON
data have;input
ID Name $ Sex $ Age; datalines;
123 jerome M 30
345 william M 26
456 ingrid F 25
run;
filename out temp;
proc json out=out;
export have;
run;
* What hath been wrought ?;
data _null_; infile out; input; put _infile_; run;
----- LOG -----
{"SASJSONExport":"1.0","SASTableData+HAVE":[{"ID":123,"Name":"jerome","Sex":"M","Age":30},{"ID":345,"Name":"william","Sex":"M","Age":26},{"ID":456,"Name":"ingrid","Sex":"F","Age":25}]}
A concise name-value pair output of the variables using the PUT statement specification syntax (variable-list) (format-list), using _ALL_ for the variable list and = for the format.
filename out2 temp;
data _null_;
set have;
file out2;
put (_all_) (=);
run;
data _null_;
infile out2; input; put _infile_;
run;
----- LOG -----
ID=123 Name=jerome Sex=M Age=30
ID=345 Name=william Sex=M Age=26
ID=456 Name=ingrid Sex=F Age=25
Iterate the variables using the VNEXT routine. Extract the formatted values using VVALUEX function, and conditionally construct the quoted name and value parts.
filename out3 temp;
data _null_;
set have;
file out3;
length _name_ $34 _value_ $32000;
do _n_ = 1 by 1;
call vnext(_name_);
if _name_ = "_name_" then leave;
if _n_ = 1
then _value_ = strip(vvaluex(_name_));
else _value_ = quote(strip(vvaluex(_name_)));
_name_ = quote(trim(_name_));
if _n_ = 1
then put _value_ #;
else put _name_ +(-1) '=' _value_ #;
end;
put;
run;
data _null_;
infile out3; input; put _infile_;
run;
----- LOG -----
123 "Name"="jerome" "Sex"="M" "Age"="30"
345 "Name"="william" "Sex"="M" "Age"="26"
456 "Name"="ingrid" "Sex"="F" "Age"="25"

SAS use a lookup dataset like array in another dataset

I have 1 data set with content description for a school
contents:
num description
content1 math
content2 spanish
content3 geography
content4 chemistry
content5 history
in another data set (students) i have the array content1-content5 and i use a flag to indicate content that have each student.
students
name age content1 content2 content3 content4 content5
BOB 15 1 1 1 1
BRYA 16
CARL 15 1 1
SUE 17 1 1 1
LOU 15 1
if i use a code like this:
data students1;
set students;
array content[5];
format allcontents $100.;
do i=1 to dim(content);
if content[i]=1 then do;
allcontents=cat(vname(content[i]),',',allcontents);
end;
end;
run;
the result is:
name age content1 content2 content3 content4 content5 allcontents
BOB 15 1 1 1 1 content1,content2,content3,content5,
BRYA 16
CARL 15 1 1 content2,content5,
SUE 17 1 1 1 content3,content4,content5,
LOU 15 1 content5
1) i want to use the name of the lookup table (data set contents) to use the name of the content and not the arrays names of content[1-5] in the variable allcontents. how can i do that?
2) and later i want the result by content description, not by student, like this:
description name age
math BOB 15
spanish BOB 15
geography BOB 15
history BOB 15
spanish CARL 15
history CARL 15
spanish SUE 17
chemistry SUE 17
history SUE 17
history LOU 15
is it possible?
thanks.
First, grab the %create_hash() macro from this post.
Use the hash table to look up the values.
data students1;
set students
array content[5];
format num $32. description $16.;
if _n_ = 1 then do;
%create_hash(cnt,num,description,"contents");
end;
do i=1 to 5;
if content[i]=1 then do;
num = vname(content[i]);
rc = cnt.find();
output;
end;
end;
keep description name age;
run;
I find proc transpose suitable. Doing once is enough for question 2) and twice for renaming the variables contents1-5 (hence question 1). The key is the ID statement in proc transpose which automatically rename variables by their corresponding transposed orders.
The code below should give you the desired answers (albeit the name are ordered alphabetically, which may not be the same as your original ordering).
/* original data sets */
data names;
input num $ description $;
cards;
content1 math
content2 spanish
content3 geography
content4 chemistry
content5 history
;run;
data students;
input name $ age content1 content2 content3 content4 content5;
cards;
BOB 15 1 1 1 . 1
BRYA 16 . . . . .
CARL 15 . 1 . . 1
SUE 17 . . 1 1 1
LOU 15 . . . . 1
;run;
/* transpose */
proc sort data=students out=tmp_sorted;
by name age;
run;
proc transpose data=tmp_sorted out=tmp_transposed;
by name age;
run;
/* merge the names of content1-5 */
* If you want to preserve ordering from contents1-contents5
* instead of alphabetical ordering of "description" column
* from a-z, do not drop the "num" column for further use.;
proc sql;
create table tmp_merged as
select B.description, A.name, A.age, B.num, A.COL1
from tmp_transposed as A
left join names as B
on A._NAME_=B.num
order by A.name, B.num;
quit;
/* transpose again */
proc transpose data=tmp_merged(drop=num) out=tmp_renamed(drop=_name_);
by name age;
ID description; *name the transposed variables;
run;
/* answer (1) */
data ans1;
set tmp_renamed;
array content[5] math--history;
format allcontents $100.;
do i=1 to dim(content);
* better use cats (cat does not seem to work);
if content[i]=1 then allcontents=cats(allcontents,',',vname(content[i]));
end;
*kill the leading comma;
allcontents=substr(allcontents,2,99);
run;
/* answer (2) */
data ans2(drop=num col1);
set tmp_merged;
where col1=1;
run;
*cleanup;
proc datasets lib=work nolist;
delete tmp_:;
quit;

Resources