I'm trying to define, using the data in the products.txt file, a data set with the delimiter *.
products.txt data:
hartie 2 birotica
creione 10 birotica
apa 6 alimente
ceai 8 alimente
tricou 100 haine
I tried to use the delimiter *:
data produse;
infile '/home/u47505185/produse.txt' dlm='*';
input Nume $ Pret Categorie $;
run;
dsd command is changing space into , . i want the command for changing space into *
The DSD option, in addition to the other things it does, changes the DEFAULT delimiter from space to comma. But you can override the default delimiter to any list of characters you want by using the DLM= (also known as DELIMITER=) option, whether or not you are using the DSD option.
From the comments it sounds like you just want to do text manipulation. Just change the spaces to stars. Make sure to remove any trailing spaces (unless you want those to also become stars).
data _null_;
infile '/home/u47505185/produse.txt';
input;
file '/home/u47505185/produse_star.txt';
_infile_=translate(trimn(_infile_),'*',' ');
put _infile_;
run;
To display missing numeric values as an asterik (*), in output or data viewers, use this setting
OPTIONS MISSING='*';
The INFILE DLM= option is for specifying what character(s) in the data file are to be used to separate the variables being INPUT.
DLM does NOT specify a replacement value for missing values.
You told SAS to use * as a field separator.
So what is happening ? The LOG will tell you. Essentially Nume was read as a 8 character variable (default length) and the delimiter never appeared. So, Pret, a numeric variable, had nothing to be read-in from and was assigned a missing value. When viewed in output or data viewer, the value appears as a ..
data want;
infile datalines dlm='*'; * '/home/u47505185/produse.txt' dlm='*';
input Nume $ Pret Categorie $;
datalines;
hartie 2 birotica
creione 10 birotica
apa 6 alimente
ceai 8 alimente
tricou 100 haine
;
Log
25 data want;
26 infile datalines dlm='*'; * '/home/u47505185/produse.txt' dlm='*';
27 input Nume $ Pret Categorie $;
28 datalines;
NOTE: Invalid data for Pret in line 30 1-80.
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+--
31 apa 6 alimente
NOTE: Invalid data errors for file CARDS occurred outside the printed range.
NOTE: Increase available buffer lines with the INFILE n= option.
Nume=hartie 2 Pret=. Categorie=apa 6 al _ERROR_=1 _N_=1
NOTE: Invalid data for Pret in line 33 1-80.
NOTE: LOST CARD.
34 ;
NOTE: Invalid data errors for file CARDS occurred outside the printed range.
NOTE: Increase available buffer lines with the INFILE n= option.
Nume=ceai 8 a Pret=. Categorie= _ERROR_=1 _N_=2
NOTE: SAS went to a new line when INPUT statement reached past the end of a line.
NOTE: The data set WORK.WANT has 1 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.00 seconds
By default, what is shown to you when a value is missing?
Numeric variables, . or the current setting for session option MISSING="<one-char>"
Character variables, . The missing value for character variables is a single space.
Related
I´m trying to import some dat file (comma delimited) to SAS University. However, one variable contains special characters (e.g. french accents). Most are replaced with �, but also some observations have some problems.
Example of a problem:
An original observation in the data looks like this:
Crème Brûlée,105,280
Running the following command:
DATA BenAndJerrys;
INFILE '/folders/myfolders/HW3/BenAndJerrys.dat' DLM = ',' DSD MISSOVER;
INPUT flavor_name :$48. portion_size calories;
RUN;
It has this problem:
flavor_name=Cr�me Br�l�e,105 portion_size=280 calories=
as you can see the value 105 which is the value of portion_size is merged with the value of flavor_name, and the value 280 of calories is assigned to portion_size.
How can solve this problem and allow SAS to import the data with the special characters?
Try telling SAS what encoding to use when reading the file.
I copied and saved your sample line into a text file using Windows NOTEPAD editor.
%let path=C:\Downloads ;
data _null_;
infile "&path\test.txt" dsd encoding=wlatin1;
length x1-x3 $50 ;
input x1-x3;
put (_all_) (=);
run;
Result in the log.
x1=Crème Brûlée x2=105 x3=280
NOTE: 1 record was read from the infile "C:\Downloads\test.txt".
The minimum record length was 20.
The maximum record length was 20.
This question already has answers here:
Read specific columns of a delimited file in SAS
(3 answers)
Closed 8 years ago.
I'm trying to read a comma delimited .txt file (called 'file.txt' in the code below) into SAS in order to create a permanent database that includes only some of the variables and observations.
Here's a snippet of the .txt file for reference:
SUMLEV,REGION,DIVISION,STATE,NAME,POPESTIMATE2013,POPEST18PLUS2013,PCNT_POPEST18PLUS
10,0,0,0,United States,316128839,242542967,76.7
40,3,6,1,Alabama,4833722,3722241,77
40,4,9,2,Alaska,735132,547000,74.4
40,4,8,4,Arizona,6626624,5009810,75.6
40,3,7,5,Arkansas,2959373,2249507,76
My (abbreviated) code is as follows:
options nocenter nodate ls=72 ps=58;
filename foldr1 'C:\Users\redacted\Desktop\file.txt';
libname foldr2 'C:\Users\redacted\Desktop\Data';
libname foldr3 'C:\Users\redacted\Desktop\Formats';
options fmtsearch=(FMTfoldr.bf_fmts);
proc format library=foldr3.bf_fmts;
[redacted]
run;
data foldr2.file;
infile foldr1 DLM=',' firstobs=2 obs=52;
input STATE $ NAME $ REGION $ POPESTIMATE2013;
PERCENT=POPESTIMATE2013/316128839;
format REGION $regfmt.;
run;
proc print data=foldr2.file;
sum POPESTIMATE2013 PERCENT;
title 'Title';
run;
In my INPUT statement, I list the variables that I want to include in my new truncated database (STATE, NAME, REGION, etc.).
When I print my truncated database, I notice that all of my INPUT variables do not correspond to the same variables in the original file.
Instead my variables print out like this:
STATE (1st var listed in INPUT) printed as SUMLEV (1st var listed in
.txt file)
NAME (2nd var listed in INPUT) printed as REGION (2nd var listed in .txt file)
REGION (3rd " " " ") printed as DIVISION (3rd " " " ")
POPESTIMATE2013 (4th " " " ") printed as STATE (4th " " " ")
It seems that SAS is matching my INPUT variables based on order, not on name. So, because I list STATE first in my INPUT statement, SAS prints out the first variable of the original .txt file (i.e., the SUMLEV variable).
Any idea what's wrong with my code? Thanks for your help!
Your current code is reading in the first 4 values from each line of the CSV file and assigning them to columns with the names you have listed.
The input statement lists all the columns you want to read in (and where to read them from), it does not search for named columns within the input file.
The code below should produce the output you want. The keep statement lists the columns that you want in the output.
data foldr2.file;
infile foldr1 dlm = "," firstobs = 2 obs = 52;
/* Prevent truncating the name variable */
informat NAME $20.;
/* Name each of the columns */
input SUMLEV REGION DIVISION STATE NAME $ POPESTIMATE2013 POPEST18PLUS2013 PCNT_POPEST18PLUS;
/* Keep only the columns you want */
keep STATE NAME REGION POPESTIMATE2013 PERCENT;
PERCENT = POPESTIMATE2013/316128839;
format REGION $regfmt.;
run;
For a slightly more involved solution see Joe's excellent answer here. Applying this approach to your data will require setting the lengths of your columns in advance and converting character values to numeric.
data foldr2.file;
infile foldr1 dlm = "," firstobs = 2 obs = 52;
length STATE 8. NAME $13. REGION 8. POPESTIMATE2013 8.;
input #;
STATE = input(scan(_INFILE_, 4, ','), best.);
NAME = scan(_INFILE_, 5, ',');
REGION = input(scan(_INFILE_, 2, ','), best.);
POPESTIMATE2013 = input(scan(_INFILE_, 6, ','), best.);
PERCENT = POPESTIMATE2013/316128839;
format REGION $regfmt.;
run;
If you are looking to become more familiar with SAS it would be worth your while to take a look at the SAS documentation for reading files.
Your current data step is telling SAS what to name the first four variables in the txt file. To do what you want, you need to list all of the variables in the txt file in your "input" statement. Then, in your data statement, use the keep= option to select the variables you want to be included in the output dataset.
data foldr2.file (keep=STATE NAME REGION POPESTIMATE2013 PERCENT);
infile foldr1 DLM=',' firstobs=2 obs=52;
input
SUMLEV
REGION $
DIVISION
STATE $
NAME $
POPESTIMATE2013
POPEST18PLUS2013
PCNT_POPEST18PLUS;
PERCENT=POPESTIMATE2013/316128839;
format REGION $regfmt.;
run;
I have a vcf file like this:
http://www.1000genomes.org/node/101
Here's the example from that site:
##fileformat=VCFv4.0
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
20 1234567 microsat1 GTCT G,GTACT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3
After the header lines, each line has fields that contain genotypes starting with the 10th field. The 10th field is below the NA0001 heading; the 11th field is genotype NA0002, etc. I have a file with 123 different genotypes, so going from position 10 to 133 (NA0001 until NA0123). What is shown in these fields can be 0/0, 0/1, 0/2 .... till 8/9 for instance. Now I want to replace all the non-equal ones. So I would like to keep 0/0, 1/1, 2/2, etc. And replace 0/1, 0/2, 1/2, 4/5, 4/6 etc by ./.
I would like to write this in a C script. Thought about using sed y/regexp/replacement/ but no idea how to write all those unequal values in a regular expression. And on other positions in the file there could also be these values, so really only positions 10 till 133 should be replaced. And it needs to be replaced; I will be needing the rest of the file with the new values.
Hope it is clear. Anyone any idea how to do this?
This regex should do what you want: \s(\d)[|\/](?!\1)\d: Replace matches with ./.:
Breakdown:
\s(\d) matches a space followed by a single digit, capturing the digit in capture group #1
[|\/] matches a pipe or slash (since it seems that the VCF format allows either)
(?!\1)\d uses a negative lookahead to ensure that the next character is not the same as capture group #1, and matches the digit
Caveats:
I matched a leading space and trailing : to try to ensure it matches only the intended values. I couldn't work out a good way to limit it to fields 10 and after.
Example using perl:
perl -pe 's#\s(\d)[|/](?!\1)\d:# ./.:#g' testfile.vcf > testfile_afterchange.vcf
Note: I used # as the delimiter to avoid having to escape the / characters in the regex.
I wanted via matlab to read a table of data from a txt file after a specific expression and a number of non desired lines for example the AA.txt have:
Information about students :
AAAA
BBBB
1 10 100
2 3 15
! ! ! a number of lines
10 6 9
I have like information the expression 'Information about students', the number of skipped lines 2 and the number of columns 3 and rows 10 in desired matrix.
if I understand correctly, you wanna skip the first 3 lines (assuming them as headers) and then reading the rest.
I would follow this procedure:
fid = fopen(filename,'r');
A = textscan(fid,'%f %f %f','HeaderLines',3,'Delimiter','\r\n');
I currently do not have access to MATLAB, but I do believe it will work.
I have the following txt file with records in the following way
LEE ATHNOS
1215 RAINTREE CIRCLE
PHOENIX AZ 85044
JOYCE BENEFIT
85 MAPLE AVENUE
MENLO PARK CA 94025
These are actually 2 records in multiple lines. The code i am using to input the records
data lineinput;
infile linein;
input Lname $ Fname $ /
Address $1-20 /
City & $10. State $ zip $ ;
run;
I am not able to read the 2nd record .
Below is the log
NOTE: The infile LINEIN is:
Filename=\\VBOXSVR\win_7\SAS\DATA\INPUT\linepointer.txt,
RECFM=V,LRECL=256,File Size (bytes)=105,
Last Modified=30Jun2015:00:45:47,
Create Time=30Jun2015:00:32:31
NOTE: LOST CARD.
Lname=JOYCE Fname=BENEFIT Address=MENLO PARK CA 94025 City= State= zip= _ERROR_=1 _N_=2
NOTE: 6 records were read from the infile LINEIN.
The minimum record length was 10.
The maximum record length was 20.
NOTE: SAS went to a new line when INPUT statement reached past the end of a line.
NOTE: The data set WORK.LINEINPUT has 1 observations and 6 variables.
NOTE: DATA statement used (Total process time):
real time 0.02 seconds
cpu time 0.01 seconds
on proc print I am getting follwing output
ANY idea guys why I am not getting the 2nd record correct.(There is a space between city name so i have used & )
You need to add an option 'truncover' for it to work in a streaming setting:
filename FT15F001 temp lrecl=512;
data hipa;
infile FT15F001 truncover;
input Lname $ Fname $ /
Address $1-20 /
City & $10. State $ zip $ ;
parmcards4;
LEE ATHNOS
1215 RAINTREE CIRCLE
PHOENIX AZ 85044
JOYCE BENEFIT
85 MAPLE AVENUE
MENLO PARK CA 94025
;;;;
run;