I have a folder with multiple files. Each in different category (say for example: 5 files are tab delimited, 5 are csv files, 5 are pipe delimited and so on). How can I import them using SAS? (I don't want to import them separately.)
Interesting question. If you can make logic to determine the delimiter from looking at the line then it should be easy. Let's make some example files.
%let path=%sysfunc(pathname(work));
data _null_;
set sashelp.class ;
if _n_ < 7 then file "&path/1_csv.txt" dsd dlm=',';
else if _n_ < 13 then file "&path/2_tab.txt" dsd dlm='09'x;
else file "&path/3_pipe.txt" dsd dlm='|' ;
put (_all_) (+0);
run;
Now let's try to read them. First read in the line and then set the DLM variable based on what you see. Then read the data.
data want ;
if 0 then set sashelp.class ;
length dlm $1.;
infile "&path/*.txt" dsd dlm=dlm truncover;
input #;
if indexc(_infile_,'09'x) then dlm='09'x;
else if indexc(_infile_,'|') then dlm='|';
else if indexc(_infile_,',') then dlm=',';
input name -- weight;
run;
Now let's compare the results with the original data. Note: That depending on your operating system the files might not be read in the same order as they were generated, so you might need to sort before trying to compare.
proc compare data=want compare=sashelp.class; run;
Related
I´m trying to import some dat file (comma delimited) to SAS University. However, one variable contains special characters (e.g. french accents). Most are replaced with �, but also some observations have some problems.
Example of a problem:
An original observation in the data looks like this:
Crème Brûlée,105,280
Running the following command:
DATA BenAndJerrys;
INFILE '/folders/myfolders/HW3/BenAndJerrys.dat' DLM = ',' DSD MISSOVER;
INPUT flavor_name :$48. portion_size calories;
RUN;
It has this problem:
flavor_name=Cr�me Br�l�e,105 portion_size=280 calories=
as you can see the value 105 which is the value of portion_size is merged with the value of flavor_name, and the value 280 of calories is assigned to portion_size.
How can solve this problem and allow SAS to import the data with the special characters?
Try telling SAS what encoding to use when reading the file.
I copied and saved your sample line into a text file using Windows NOTEPAD editor.
%let path=C:\Downloads ;
data _null_;
infile "&path\test.txt" dsd encoding=wlatin1;
length x1-x3 $50 ;
input x1-x3;
put (_all_) (=);
run;
Result in the log.
x1=Crème Brûlée x2=105 x3=280
NOTE: 1 record was read from the infile "C:\Downloads\test.txt".
The minimum record length was 20.
The maximum record length was 20.
After importing my CSV data with GETNAMES=NO, I have 59 columns with variable names VAR1, VAR2, . . . VAR59. My first row contains the names I need for the new variables, but they first needed manipulated by removing special characters and turning spaces into underscores since SAS doesn't like spaces in variable names. This is the array I used for that piece:
DATA DATA1; SET DATA (FIRSTOBS=7);
ARRAY VAR(59) VAR1-VAR59;
IF _N_ = 1 THEN DO;
DO I = 1 TO 59;
VAR[I] = COMPRESS(TRANSLATE(TRIM(VAR[I]),'_',' '),'?()');
PUT VAR[I]=;
END;
END;
DROP I;
RUN;
This worked perfectly, but now I need to get this first row up to the new variable names. I tried a similar array to perform this:
DATA DATA2; SET DATA1;
ARRAY V(59) VAR1-VAR59;
DO I = 1 TO 59;
IF _N_ = 1 AND V[I] NE "" THEN CALL SYMPUT("NEWNAME",V[I]);
RENAME VAR[I] = &NEWNAME;
END;
DROP I;
RUN;
This only puts the name of VAR59 since there is no [i] connected to the &NEWNAME, and it still isn't working quite right. Any suggestions to moving a row up to variable names AFTER manipulation?
Your primary problem is you are trying to use a macro variable in the data step it's created in. You can't. You're also trying to create rename statements in the data step; rename, as with other similar statements (keep, drop), must be defined before the data step is compiled.
You need to write code somewhere - either in a text file, a macro variable, whatever - with this information. For example:
filename renamef temp;
data _null_;
set myfile (obs=1);
file renamef;
array var[59];
do _i = 1 to dim(Var);
[your code to clean it out];
strput = cat("rename",vname(var[_i]),'=',var[_i],';');
put strput;
end;
run;
data want;
set myfile (firstobs=2);
%include renamef;
run;
There are lots of other examples to this on the site and on the web, "list processing" is the term for this.
Joe -- using your suggestions and another one of your posts, the following worked flawlessly:
Put the row of needed variables into long format (in my case, first row so n = 1)
DATA NEWVARS; SET DATA;
IF _N_ = 1 THEN OUTPUT NEWVARS;
RUN;
PROC TRANSPOSE DATA = NEWVARS OUT=NEWVARS1;
VAR _ALL_;
RUN;
Create a list of rename macro calls.
PROC SQL;
SELECT CATS('%RENAME(VAR=',_NAME_,',NEWVAR=',COL1,')')
INTO :RENAMELIST SEPARATED BY ' '
FROM NEWVARS1;
QUIT;
%MACRO RENAME(VAR=,NEWVAR=);
RENAME &VAR.=&NEWVAR.;
%MEND RENAME;
Call in the list created in Step 2 to rename all variables.
PROC DATASETS LIB=WORK NOLIST;
MODIFY DATA;
&RENAMELIST.;
QUIT;
I had to perform a few additional checks making sure that the variable names were not greater than 32 characters, and this was easy to check for when the data was in long format after transposing. If there are certain words that make the lengths too long, a TRANWRD statement can easily replace them with abbreviations.
My source data contains 200,000+ observations, one of the many variables in the data set is "county." My goal is to write a macro that will take this one data set as an input, and split them into 58 different temporary data sets for each of the California counties.
First question is if it is possible to specify the 58 counties on the data statement using something like a global reference array defined beforehand.
Second question is, assuming the output data sets have been properly specified on the data statement, is it possible to use a do loop to choose the right data set to write to?
I can get the comparison to work properly, but cannot seem to use a array reference to specify a output data set. This is most likely because I need more experience with the macro environment!
Please see below for the simplistic skeleton framework I have written so far. c_long array contains the names of each of the counties, c_short array contains a 3 letter abbreviation for each of the counties. Thanks in advance!
data splitraw;
length county_name $15;
infile "&path/random.csv" dsd firstobs=2;
input county_name $ number;
run;
%macro _58countysplit(dxtosplit,countycol);
data <need to specify 58 data sets here named something like &dxtosplit_ALA, &dxtosplit_ALP, etc..>;
set &dxtosplit;
do i=1 to 58;
if c_long{i}=&countycol then output &dxtosplit._&c_short{i};
end;
run;
%mend _58countysplit;
%_58countysplit(splitraw,county_name);
The code you provided will need to run through the large dataset 58 times, each time writing a small one. I have done it a bit different.
First I create a sample dataset with a variable "county" this will contain ten different values:
data large;
attrib county length=$12;
do i=1 to 10000;
county=put(mod(i,10)+1,ROMAN.);
output;
end;
run;
First, I start with finding all the unique values and constructing the names of all the different tables I would like to create:
proc sql noprint;
select distinct compbl("large_"!!county) into :counties separated by " "
from large;
quit;
Now I have a macrovariable "counties" that containes all the different datasets I want to create.
Here I am writing the IF-statements to a file:
filename x temp;
data _null_;
attrib county length=$12 ds length=$18;
file x;
i=1;
do while(scan("&counties",i," ") ne "");
ds=scan("&counties",i," ");
county=scan(ds,-1,"_");
put "if county=""" county +(-1) """ then output " ds ";";
i+1;
end;
run;
Now I have what I need to create the small datasets:
data &counties;
set large;
%inc x;
run;
I agree with user667489, there is almost always a better way then splitting one large data set into many small data sets. However, if you want to proceed along these lines there is a table in sashelp called vcolumn which lists all your libraries, their tables, and each column (in each table) that should help you. Also if you want
if c_long{i}=&countycol then output &dxtosplit._&c_short{i};
to resolve you might mean:
if c_long{i}=&countycol then output &&dxtosplit._&c_short{i};
It's likely, depending upon what you're actually trying to do, that BY processing is all you need. Nevertheless, here is a simple solution:
%macro split_by(data=, splitvar=);
%local dslist iflist;
proc sql noprint;
select distinct cats("&splitvar._", &splitvar)
into :dslist separated by ' '
from &data;
select distinct
catt("if &splitvar='", &splitvar, "' then output &splitvar._", &splitvar, ";", '0A'x)
into :iflist separated by "else "
from &data;
quit;
data &dslist;
set &data;
&iflist
run;
%mend split_by;
Here is some test data to illustrate:
options mprint;
data test;
length county $1 val $1;
input county val;
infile cards;
datalines;
A 2
B 3
A 5
C 8
C 9
D 10
run;
%split_by(data=test, splitvar=county)
And you can view the log to see how the macro generates the DATA step you want:
MPRINT(SPLIT_BY): proc sql noprint;
MPRINT(SPLIT_BY): select distinct cats("county_", county) into :dslist separated by ' ' from test;
MPRINT(SPLIT_BY): select distinct catt("if county='", county, "' then output county_", county, ";", '0A'x) into :iflist separated
by "else " from test;
MPRINT(SPLIT_BY): quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
MPRINT(SPLIT_BY): data county_A county_B county_C county_D;
MPRINT(SPLIT_BY): set test;
MPRINT(SPLIT_BY): if county='A' then output county_A;
MPRINT(SPLIT_BY): else if county='B' then output county_B;
MPRINT(SPLIT_BY): else if county='C' then output county_C;
MPRINT(SPLIT_BY): else if county='D' then output county_D;
MPRINT(SPLIT_BY): run;
NOTE: There were 6 observations read from the data set WORK.TEST.
NOTE: The data set WORK.COUNTY_A has 2 observations and 2 variables.
NOTE: The data set WORK.COUNTY_B has 1 observations and 2 variables.
NOTE: The data set WORK.COUNTY_C has 2 observations and 2 variables.
NOTE: The data set WORK.COUNTY_D has 1 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.03 seconds
cpu time 0.05 seconds
This question already has answers here:
Read specific columns of a delimited file in SAS
(3 answers)
Closed 8 years ago.
I'm trying to read a comma delimited .txt file (called 'file.txt' in the code below) into SAS in order to create a permanent database that includes only some of the variables and observations.
Here's a snippet of the .txt file for reference:
SUMLEV,REGION,DIVISION,STATE,NAME,POPESTIMATE2013,POPEST18PLUS2013,PCNT_POPEST18PLUS
10,0,0,0,United States,316128839,242542967,76.7
40,3,6,1,Alabama,4833722,3722241,77
40,4,9,2,Alaska,735132,547000,74.4
40,4,8,4,Arizona,6626624,5009810,75.6
40,3,7,5,Arkansas,2959373,2249507,76
My (abbreviated) code is as follows:
options nocenter nodate ls=72 ps=58;
filename foldr1 'C:\Users\redacted\Desktop\file.txt';
libname foldr2 'C:\Users\redacted\Desktop\Data';
libname foldr3 'C:\Users\redacted\Desktop\Formats';
options fmtsearch=(FMTfoldr.bf_fmts);
proc format library=foldr3.bf_fmts;
[redacted]
run;
data foldr2.file;
infile foldr1 DLM=',' firstobs=2 obs=52;
input STATE $ NAME $ REGION $ POPESTIMATE2013;
PERCENT=POPESTIMATE2013/316128839;
format REGION $regfmt.;
run;
proc print data=foldr2.file;
sum POPESTIMATE2013 PERCENT;
title 'Title';
run;
In my INPUT statement, I list the variables that I want to include in my new truncated database (STATE, NAME, REGION, etc.).
When I print my truncated database, I notice that all of my INPUT variables do not correspond to the same variables in the original file.
Instead my variables print out like this:
STATE (1st var listed in INPUT) printed as SUMLEV (1st var listed in
.txt file)
NAME (2nd var listed in INPUT) printed as REGION (2nd var listed in .txt file)
REGION (3rd " " " ") printed as DIVISION (3rd " " " ")
POPESTIMATE2013 (4th " " " ") printed as STATE (4th " " " ")
It seems that SAS is matching my INPUT variables based on order, not on name. So, because I list STATE first in my INPUT statement, SAS prints out the first variable of the original .txt file (i.e., the SUMLEV variable).
Any idea what's wrong with my code? Thanks for your help!
Your current code is reading in the first 4 values from each line of the CSV file and assigning them to columns with the names you have listed.
The input statement lists all the columns you want to read in (and where to read them from), it does not search for named columns within the input file.
The code below should produce the output you want. The keep statement lists the columns that you want in the output.
data foldr2.file;
infile foldr1 dlm = "," firstobs = 2 obs = 52;
/* Prevent truncating the name variable */
informat NAME $20.;
/* Name each of the columns */
input SUMLEV REGION DIVISION STATE NAME $ POPESTIMATE2013 POPEST18PLUS2013 PCNT_POPEST18PLUS;
/* Keep only the columns you want */
keep STATE NAME REGION POPESTIMATE2013 PERCENT;
PERCENT = POPESTIMATE2013/316128839;
format REGION $regfmt.;
run;
For a slightly more involved solution see Joe's excellent answer here. Applying this approach to your data will require setting the lengths of your columns in advance and converting character values to numeric.
data foldr2.file;
infile foldr1 dlm = "," firstobs = 2 obs = 52;
length STATE 8. NAME $13. REGION 8. POPESTIMATE2013 8.;
input #;
STATE = input(scan(_INFILE_, 4, ','), best.);
NAME = scan(_INFILE_, 5, ',');
REGION = input(scan(_INFILE_, 2, ','), best.);
POPESTIMATE2013 = input(scan(_INFILE_, 6, ','), best.);
PERCENT = POPESTIMATE2013/316128839;
format REGION $regfmt.;
run;
If you are looking to become more familiar with SAS it would be worth your while to take a look at the SAS documentation for reading files.
Your current data step is telling SAS what to name the first four variables in the txt file. To do what you want, you need to list all of the variables in the txt file in your "input" statement. Then, in your data statement, use the keep= option to select the variables you want to be included in the output dataset.
data foldr2.file (keep=STATE NAME REGION POPESTIMATE2013 PERCENT);
infile foldr1 DLM=',' firstobs=2 obs=52;
input
SUMLEV
REGION $
DIVISION
STATE $
NAME $
POPESTIMATE2013
POPEST18PLUS2013
PCNT_POPEST18PLUS;
PERCENT=POPESTIMATE2013/316128839;
format REGION $regfmt.;
run;
I have been trying to export a SAS data set with 49 variables. Each of these variables can potentially be 32767 characters long. I want to write this data set to a txt file, but SAS limits me with the lrecl option at 32767 characters. Is there a way to do this? I tried using the data step.
data _null_;
%let _EFIERR_ = 0; /* set the ERROR detection macro variable */
%let _EFIREC_ = 0; /* clear export record count macro variable */
file 'C:path\TEST.txt';
if _n_ = 1 then do;
put "<BLAH>"
;
end;
set WORK.SAS_DATASET end=EFIEOD;
format raw1 $32767. ;
format raw2 $32767. ;
etc...
do;
EFIOUT + 1;
put raw1 $ #;
put raw2 $ #;
etc...
;
end;
if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */
if EFIEOD then
do;
put "</BLAH>"
;
call symputx('_EFIREC_',EFIOUT);
end;
run;
Sure. You just need to specify the LRECL yourself.
filename test temp;
data _null_;
set sashelp.class;
file test lrecl=999999;
put
#1 name $32767.
#32768 sex $32767.
#65535 age 8.
;;;;
run;
Some OSs might limit your logical record length, but it's at least 1e6 in Windows so you should be okay.