SAS INPUT DATA WITH SPECIAL CHARACTERS - database

I´m trying to import some dat file (comma delimited) to SAS University. However, one variable contains special characters (e.g. french accents). Most are replaced with �, but also some observations have some problems.
Example of a problem:
An original observation in the data looks like this:
Crème Brûlée,105,280
Running the following command:
DATA BenAndJerrys;
INFILE '/folders/myfolders/HW3/BenAndJerrys.dat' DLM = ',' DSD MISSOVER;
INPUT flavor_name :$48. portion_size calories;
RUN;
It has this problem:
flavor_name=Cr�me Br�l�e,105 portion_size=280 calories=
as you can see the value 105 which is the value of portion_size is merged with the value of flavor_name, and the value 280 of calories is assigned to portion_size.
How can solve this problem and allow SAS to import the data with the special characters?

Try telling SAS what encoding to use when reading the file.
I copied and saved your sample line into a text file using Windows NOTEPAD editor.
%let path=C:\Downloads ;
data _null_;
infile "&path\test.txt" dsd encoding=wlatin1;
length x1-x3 $50 ;
input x1-x3;
put (_all_) (=);
run;
Result in the log.
x1=Crème Brûlée x2=105 x3=280
NOTE: 1 record was read from the infile "C:\Downloads\test.txt".
The minimum record length was 20.
The maximum record length was 20.

Related

SAS progamming problem in using data delimiter *

I'm trying to define, using the data in the products.txt file, a data set with the delimiter *.
products.txt data:
hartie 2 birotica
creione 10 birotica
apa 6 alimente
ceai 8 alimente
tricou 100 haine
I tried to use the delimiter *:
data produse;
infile '/home/u47505185/produse.txt' dlm='*';
input Nume $ Pret Categorie $;
run;
dsd command is changing space into , . i want the command for changing space into *
The DSD option, in addition to the other things it does, changes the DEFAULT delimiter from space to comma. But you can override the default delimiter to any list of characters you want by using the DLM= (also known as DELIMITER=) option, whether or not you are using the DSD option.
From the comments it sounds like you just want to do text manipulation. Just change the spaces to stars. Make sure to remove any trailing spaces (unless you want those to also become stars).
data _null_;
infile '/home/u47505185/produse.txt';
input;
file '/home/u47505185/produse_star.txt';
_infile_=translate(trimn(_infile_),'*',' ');
put _infile_;
run;
To display missing numeric values as an asterik (*), in output or data viewers, use this setting
OPTIONS MISSING='*';
The INFILE DLM= option is for specifying what character(s) in the data file are to be used to separate the variables being INPUT.
DLM does NOT specify a replacement value for missing values.
You told SAS to use * as a field separator.
So what is happening ? The LOG will tell you. Essentially Nume was read as a 8 character variable (default length) and the delimiter never appeared. So, Pret, a numeric variable, had nothing to be read-in from and was assigned a missing value. When viewed in output or data viewer, the value appears as a ..
data want;
infile datalines dlm='*'; * '/home/u47505185/produse.txt' dlm='*';
input Nume $ Pret Categorie $;
datalines;
hartie 2 birotica
creione 10 birotica
apa 6 alimente
ceai 8 alimente
tricou 100 haine
;
Log
25 data want;
26 infile datalines dlm='*'; * '/home/u47505185/produse.txt' dlm='*';
27 input Nume $ Pret Categorie $;
28 datalines;
NOTE: Invalid data for Pret in line 30 1-80.
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+--
31 apa 6 alimente
NOTE: Invalid data errors for file CARDS occurred outside the printed range.
NOTE: Increase available buffer lines with the INFILE n= option.
Nume=hartie 2 Pret=. Categorie=apa 6 al _ERROR_=1 _N_=1
NOTE: Invalid data for Pret in line 33 1-80.
NOTE: LOST CARD.
34 ;
NOTE: Invalid data errors for file CARDS occurred outside the printed range.
NOTE: Increase available buffer lines with the INFILE n= option.
Nume=ceai 8 a Pret=. Categorie= _ERROR_=1 _N_=2
NOTE: SAS went to a new line when INPUT statement reached past the end of a line.
NOTE: The data set WORK.WANT has 1 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.00 seconds
By default, what is shown to you when a value is missing?
Numeric variables, . or the current setting for session option MISSING="<one-char>"
Character variables, . The missing value for character variables is a single space.

How to import multiple different delimited files in sas?

I have a folder with multiple files. Each in different category (say for example: 5 files are tab delimited, 5 are csv files, 5 are pipe delimited and so on). How can I import them using SAS? (I don't want to import them separately.)
Interesting question. If you can make logic to determine the delimiter from looking at the line then it should be easy. Let's make some example files.
%let path=%sysfunc(pathname(work));
data _null_;
set sashelp.class ;
if _n_ < 7 then file "&path/1_csv.txt" dsd dlm=',';
else if _n_ < 13 then file "&path/2_tab.txt" dsd dlm='09'x;
else file "&path/3_pipe.txt" dsd dlm='|' ;
put (_all_) (+0);
run;
Now let's try to read them. First read in the line and then set the DLM variable based on what you see. Then read the data.
data want ;
if 0 then set sashelp.class ;
length dlm $1.;
infile "&path/*.txt" dsd dlm=dlm truncover;
input #;
if indexc(_infile_,'09'x) then dlm='09'x;
else if indexc(_infile_,'|') then dlm='|';
else if indexc(_infile_,',') then dlm=',';
input name -- weight;
run;
Now let's compare the results with the original data. Note: That depending on your operating system the files might not be read in the same order as they were generated, so you might need to sort before trying to compare.
proc compare data=want compare=sashelp.class; run;

How can I "define" SAS data sets using macro variable and write to them using an array

My source data contains 200,000+ observations, one of the many variables in the data set is "county." My goal is to write a macro that will take this one data set as an input, and split them into 58 different temporary data sets for each of the California counties.
First question is if it is possible to specify the 58 counties on the data statement using something like a global reference array defined beforehand.
Second question is, assuming the output data sets have been properly specified on the data statement, is it possible to use a do loop to choose the right data set to write to?
I can get the comparison to work properly, but cannot seem to use a array reference to specify a output data set. This is most likely because I need more experience with the macro environment!
Please see below for the simplistic skeleton framework I have written so far. c_long array contains the names of each of the counties, c_short array contains a 3 letter abbreviation for each of the counties. Thanks in advance!
data splitraw;
length county_name $15;
infile "&path/random.csv" dsd firstobs=2;
input county_name $ number;
run;
%macro _58countysplit(dxtosplit,countycol);
data <need to specify 58 data sets here named something like &dxtosplit_ALA, &dxtosplit_ALP, etc..>;
set &dxtosplit;
do i=1 to 58;
if c_long{i}=&countycol then output &dxtosplit._&c_short{i};
end;
run;
%mend _58countysplit;
%_58countysplit(splitraw,county_name);
The code you provided will need to run through the large dataset 58 times, each time writing a small one. I have done it a bit different.
First I create a sample dataset with a variable "county" this will contain ten different values:
data large;
attrib county length=$12;
do i=1 to 10000;
county=put(mod(i,10)+1,ROMAN.);
output;
end;
run;
First, I start with finding all the unique values and constructing the names of all the different tables I would like to create:
proc sql noprint;
select distinct compbl("large_"!!county) into :counties separated by " "
from large;
quit;
Now I have a macrovariable "counties" that containes all the different datasets I want to create.
Here I am writing the IF-statements to a file:
filename x temp;
data _null_;
attrib county length=$12 ds length=$18;
file x;
i=1;
do while(scan("&counties",i," ") ne "");
ds=scan("&counties",i," ");
county=scan(ds,-1,"_");
put "if county=""" county +(-1) """ then output " ds ";";
i+1;
end;
run;
Now I have what I need to create the small datasets:
data &counties;
set large;
%inc x;
run;
I agree with user667489, there is almost always a better way then splitting one large data set into many small data sets. However, if you want to proceed along these lines there is a table in sashelp called vcolumn which lists all your libraries, their tables, and each column (in each table) that should help you. Also if you want
if c_long{i}=&countycol then output &dxtosplit._&c_short{i};
to resolve you might mean:
if c_long{i}=&countycol then output &&dxtosplit._&c_short{i};
It's likely, depending upon what you're actually trying to do, that BY processing is all you need. Nevertheless, here is a simple solution:
%macro split_by(data=, splitvar=);
%local dslist iflist;
proc sql noprint;
select distinct cats("&splitvar._", &splitvar)
into :dslist separated by ' '
from &data;
select distinct
catt("if &splitvar='", &splitvar, "' then output &splitvar._", &splitvar, ";", '0A'x)
into :iflist separated by "else "
from &data;
quit;
data &dslist;
set &data;
&iflist
run;
%mend split_by;
Here is some test data to illustrate:
options mprint;
data test;
length county $1 val $1;
input county val;
infile cards;
datalines;
A 2
B 3
A 5
C 8
C 9
D 10
run;
%split_by(data=test, splitvar=county)
And you can view the log to see how the macro generates the DATA step you want:
MPRINT(SPLIT_BY): proc sql noprint;
MPRINT(SPLIT_BY): select distinct cats("county_", county) into :dslist separated by ' ' from test;
MPRINT(SPLIT_BY): select distinct catt("if county='", county, "' then output county_", county, ";", '0A'x) into :iflist separated
by "else " from test;
MPRINT(SPLIT_BY): quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
MPRINT(SPLIT_BY): data county_A county_B county_C county_D;
MPRINT(SPLIT_BY): set test;
MPRINT(SPLIT_BY): if county='A' then output county_A;
MPRINT(SPLIT_BY): else if county='B' then output county_B;
MPRINT(SPLIT_BY): else if county='C' then output county_C;
MPRINT(SPLIT_BY): else if county='D' then output county_D;
MPRINT(SPLIT_BY): run;
NOTE: There were 6 observations read from the data set WORK.TEST.
NOTE: The data set WORK.COUNTY_A has 2 observations and 2 variables.
NOTE: The data set WORK.COUNTY_B has 1 observations and 2 variables.
NOTE: The data set WORK.COUNTY_C has 2 observations and 2 variables.
NOTE: The data set WORK.COUNTY_D has 1 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.03 seconds
cpu time 0.05 seconds

How to create a truncated permanent database from a larger file in SAS [duplicate]

This question already has answers here:
Read specific columns of a delimited file in SAS
(3 answers)
Closed 8 years ago.
I'm trying to read a comma delimited .txt file (called 'file.txt' in the code below) into SAS in order to create a permanent database that includes only some of the variables and observations.
Here's a snippet of the .txt file for reference:
SUMLEV,REGION,DIVISION,STATE,NAME,POPESTIMATE2013,POPEST18PLUS2013,PCNT_POPEST18PLUS
10,0,0,0,United States,316128839,242542967,76.7
40,3,6,1,Alabama,4833722,3722241,77
40,4,9,2,Alaska,735132,547000,74.4
40,4,8,4,Arizona,6626624,5009810,75.6
40,3,7,5,Arkansas,2959373,2249507,76
My (abbreviated) code is as follows:
options nocenter nodate ls=72 ps=58;
filename foldr1 'C:\Users\redacted\Desktop\file.txt';
libname foldr2 'C:\Users\redacted\Desktop\Data';
libname foldr3 'C:\Users\redacted\Desktop\Formats';
options fmtsearch=(FMTfoldr.bf_fmts);
proc format library=foldr3.bf_fmts;
[redacted]
run;
data foldr2.file;
infile foldr1 DLM=',' firstobs=2 obs=52;
input STATE $ NAME $ REGION $ POPESTIMATE2013;
PERCENT=POPESTIMATE2013/316128839;
format REGION $regfmt.;
run;
proc print data=foldr2.file;
sum POPESTIMATE2013 PERCENT;
title 'Title';
run;
In my INPUT statement, I list the variables that I want to include in my new truncated database (STATE, NAME, REGION, etc.).
When I print my truncated database, I notice that all of my INPUT variables do not correspond to the same variables in the original file.
Instead my variables print out like this:
STATE (1st var listed in INPUT) printed as SUMLEV (1st var listed in
.txt file)
NAME (2nd var listed in INPUT) printed as REGION (2nd var listed in .txt file)
REGION (3rd " " " ") printed as DIVISION (3rd " " " ")
POPESTIMATE2013 (4th " " " ") printed as STATE (4th " " " ")
It seems that SAS is matching my INPUT variables based on order, not on name. So, because I list STATE first in my INPUT statement, SAS prints out the first variable of the original .txt file (i.e., the SUMLEV variable).
Any idea what's wrong with my code? Thanks for your help!
Your current code is reading in the first 4 values from each line of the CSV file and assigning them to columns with the names you have listed.
The input statement lists all the columns you want to read in (and where to read them from), it does not search for named columns within the input file.
The code below should produce the output you want. The keep statement lists the columns that you want in the output.
data foldr2.file;
infile foldr1 dlm = "," firstobs = 2 obs = 52;
/* Prevent truncating the name variable */
informat NAME $20.;
/* Name each of the columns */
input SUMLEV REGION DIVISION STATE NAME $ POPESTIMATE2013 POPEST18PLUS2013 PCNT_POPEST18PLUS;
/* Keep only the columns you want */
keep STATE NAME REGION POPESTIMATE2013 PERCENT;
PERCENT = POPESTIMATE2013/316128839;
format REGION $regfmt.;
run;
For a slightly more involved solution see Joe's excellent answer here. Applying this approach to your data will require setting the lengths of your columns in advance and converting character values to numeric.
data foldr2.file;
infile foldr1 dlm = "," firstobs = 2 obs = 52;
length STATE 8. NAME $13. REGION 8. POPESTIMATE2013 8.;
input #;
STATE = input(scan(_INFILE_, 4, ','), best.);
NAME = scan(_INFILE_, 5, ',');
REGION = input(scan(_INFILE_, 2, ','), best.);
POPESTIMATE2013 = input(scan(_INFILE_, 6, ','), best.);
PERCENT = POPESTIMATE2013/316128839;
format REGION $regfmt.;
run;
If you are looking to become more familiar with SAS it would be worth your while to take a look at the SAS documentation for reading files.
Your current data step is telling SAS what to name the first four variables in the txt file. To do what you want, you need to list all of the variables in the txt file in your "input" statement. Then, in your data statement, use the keep= option to select the variables you want to be included in the output dataset.
data foldr2.file (keep=STATE NAME REGION POPESTIMATE2013 PERCENT);
infile foldr1 DLM=',' firstobs=2 obs=52;
input
SUMLEV
REGION $
DIVISION
STATE $
NAME $
POPESTIMATE2013
POPEST18PLUS2013
PCNT_POPEST18PLUS;
PERCENT=POPESTIMATE2013/316128839;
format REGION $regfmt.;
run;

Reading a text file in MATLAB line by line

I have a CSV file, I want to read this file and do some pre-calculations on each row to see for example that row is useful for me or not and if yes I save it to a new CSV file.
can someone give me an example?
in more details this is how my data looks like: (string,float,float) the numbers are coordinates.
ABC,51.9358183333333,4.183255
ABC,51.9353866666667,4.1841
ABC,51.9351716666667,4.184565
ABC,51.9343083333333,4.186425
ABC,51.9343083333333,4.186425
ABC,51.9340916666667,4.18688333333333
basically i want to save the rows that have for distances more than 50 or 50 in a new file.the string field should also be copied.
thanks
You could actually use xlsread to accomplish this. After first placing your sample data above in a file 'input_file.csv', here is an example for how you can get the numeric values, text values, and the raw data in the file from the three outputs from xlsread:
>> [numData,textData,rawData] = xlsread('input_file.csv')
numData = % An array of the numeric values from the file
51.9358 4.1833
51.9354 4.1841
51.9352 4.1846
51.9343 4.1864
51.9343 4.1864
51.9341 4.1869
textData = % A cell array of strings for the text values from the file
'ABC'
'ABC'
'ABC'
'ABC'
'ABC'
'ABC'
rawData = % All the data from the file (numeric and text) in a cell array
'ABC' [51.9358] [4.1833]
'ABC' [51.9354] [4.1841]
'ABC' [51.9352] [4.1846]
'ABC' [51.9343] [4.1864]
'ABC' [51.9343] [4.1864]
'ABC' [51.9341] [4.1869]
You can then perform whatever processing you need to on the numeric data, then resave a subset of the rows of data to a new file using xlswrite. Here's an example:
index = sqrt(sum(numData.^2,2)) >= 50; % Find the rows where the point is
% at a distance of 50 or greater
% from the origin
xlswrite('output_file.csv',rawData(index,:)); % Write those rows to a new file
If you really want to process your file line by line, a solution might be to use fgetl:
Open the data file with fopen
Read the next line into a character array using fgetl
Retreive the data you need using sscanf on the character array you just read
Perform any relevant test
Output what you want to another file
Back to point 2 if you haven't reached the end of your file.
Unlike the previous answer, this is not very much in the style of Matlab but it might be more efficient on very large files.
Hope this will help.
You cannot read text strings with csvread.
Here is another solution:
fid1 = fopen('test.csv','r'); %# open csv file for reading
fid2 = fopen('new.csv','w'); %# open new csv file
while ~feof(fid1)
line = fgets(fid1); %# read line by line
A = sscanf(line,'%*[^,],%f,%f'); %# sscanf can read only numeric data :(
if A(2)<4.185 %# test the values
fprintf(fid2,'%s',line); %# write the line to the new file
end
end
fclose(fid1);
fclose(fid2);
Just read it in to MATLAB in one block
fid = fopen('file.csv');
data=textscan(fid,'%s %f %f','delimiter',',');
fclose(fid);
You can then process it using logical addressing
ind50 = data{2}>=50 ;
ind50 is then an index of the rows where column 2 is greater than 50. So
data{1}(ind50)
will list all the strings for the rows of interest.
Then just use fprintf to write out your data to the new file
here is the doc to read a csv : http://www.mathworks.com/access/helpdesk/help/techdoc/ref/csvread.html
and to write : http://www.mathworks.com/access/helpdesk/help/techdoc/ref/csvwrite.html
EDIT
An example that works :
file.csv :
1,50,4.1
2,49,4.2
3,30,4.1
4,71,4.9
5,51,4.5
6,61,4.1
the code :
File = csvread('file.csv')
[m,n] = size(File)
index=1
temp=0
for i = 1:m
if (File(i,2)>=50)
temp = temp + 1
end
end
Matrix = zeros(temp, 3)
for j = 1:m
if (File(j,2)>=50)
Matrix(index,1) = File(j,1)
Matrix(index,2) = File(j,2)
Matrix(index,3) = File(j,3)
index = index + 1
end
end
csvwrite('outputFile.csv',Matrix)
and the output file result :
1,50,4.1
4,71,4.9
5,51,4.5
6,61,4.1
This isn't probably the best solution but it works! We can read the CSV file, control the distance of each row and save it in a new file.
Hope it will help!

Resources