Using awk or shell to total values in a delimited file - file

I have a pipe delimited file that also has delimited values in certain fields and I want to get the total of the 37th field and have it added to end of each line
Sample file looks like this
1|2|18|45324.56|John|Smith|...etc then the 37th field has |1.99^2.46^79.87|next data field here|etc
I want to add the numbers 1.99, 2.46, 79.87 together and add them to the end of the file
1|2|18|45324.56|John|Smith|...etc then the 37th field has |1.99^2.46^79.87|next data field here|84.32 <- total of all values in $37 (this field can have 1 value or over 100 values in it
Obviously I can do awk 'F'|' '{print $37}' file and it will show me 1.99^2.46^79.87 but I'm uncertain how to total those values since it's basically delimited data inside of differently delimited data
Edit here is full line of data
1|12|15|29786.31|test|true|2019-12-01|2021-02-28||2019-12-01|2021-02-28|1417.00|t0000000|John|Smith|current|1234 Main St|Dallas|TX|75000|Office|8709999999||||||||Attachment^Attachment|t0000000_4042 - Application Documents.pdf^t0000000_4042 - Lease Agreement.pdf|704405808^704405809^704405810^704523038^704523039^704523593^704523594|2021-03-01^2021-03-01^2021-03-01^2021-02-28^2021-02-28^2021-03-06^2021-03-06|RUBS Income Water/Sewer^RUBS Income Water/Sewer^Utility Bi
lling Revenue^Damages/Cleaning Fees^Damages/Cleaning Fees^RUBS Income Water/Sewer^RUBS Income Water/Sewer|Charge^Charge^Charge^Charge^Charge^Charge^Charge|18.25^15.26^2.99^40.00^25.00^18.88^15.78|18.25^15.26^2.99^40.00^25.00^18.88^15.78|Charge Code^Charge Code Desc^Transaction Note^Charge Code^Charge Code Desc^Transa
ction Note^Charge Code^Charge Code Desc^Transaction Note^Charge Code^Charge Code Desc^Transaction Note^Charge Code^Charge Code Desc^Transaction Note^Charge Code^Charge Code Desc^Transaction Note^Charge Code^Charge Code Desc^Transaction Note

awk has a function split that splits a field on some delimiter:
awk -F\| '
split($37, a, "^") # split on caret, store in a
s = 0; for (i in a) s += a[i] # sum values
print $0 "|" s # append new column and output
' sample_file

Related

SAS progamming problem in using data delimiter *

I'm trying to define, using the data in the products.txt file, a data set with the delimiter *.
products.txt data:
hartie 2 birotica
creione 10 birotica
apa 6 alimente
ceai 8 alimente
tricou 100 haine
I tried to use the delimiter *:
data produse;
infile '/home/u47505185/produse.txt' dlm='*';
input Nume $ Pret Categorie $;
run;
dsd command is changing space into , . i want the command for changing space into *
The DSD option, in addition to the other things it does, changes the DEFAULT delimiter from space to comma. But you can override the default delimiter to any list of characters you want by using the DLM= (also known as DELIMITER=) option, whether or not you are using the DSD option.
From the comments it sounds like you just want to do text manipulation. Just change the spaces to stars. Make sure to remove any trailing spaces (unless you want those to also become stars).
data _null_;
infile '/home/u47505185/produse.txt';
input;
file '/home/u47505185/produse_star.txt';
_infile_=translate(trimn(_infile_),'*',' ');
put _infile_;
run;
To display missing numeric values as an asterik (*), in output or data viewers, use this setting
OPTIONS MISSING='*';
The INFILE DLM= option is for specifying what character(s) in the data file are to be used to separate the variables being INPUT.
DLM does NOT specify a replacement value for missing values.
You told SAS to use * as a field separator.
So what is happening ? The LOG will tell you. Essentially Nume was read as a 8 character variable (default length) and the delimiter never appeared. So, Pret, a numeric variable, had nothing to be read-in from and was assigned a missing value. When viewed in output or data viewer, the value appears as a ..
data want;
infile datalines dlm='*'; * '/home/u47505185/produse.txt' dlm='*';
input Nume $ Pret Categorie $;
datalines;
hartie 2 birotica
creione 10 birotica
apa 6 alimente
ceai 8 alimente
tricou 100 haine
;
Log
25 data want;
26 infile datalines dlm='*'; * '/home/u47505185/produse.txt' dlm='*';
27 input Nume $ Pret Categorie $;
28 datalines;
NOTE: Invalid data for Pret in line 30 1-80.
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+--
31 apa 6 alimente
NOTE: Invalid data errors for file CARDS occurred outside the printed range.
NOTE: Increase available buffer lines with the INFILE n= option.
Nume=hartie 2 Pret=. Categorie=apa 6 al _ERROR_=1 _N_=1
NOTE: Invalid data for Pret in line 33 1-80.
NOTE: LOST CARD.
34 ;
NOTE: Invalid data errors for file CARDS occurred outside the printed range.
NOTE: Increase available buffer lines with the INFILE n= option.
Nume=ceai 8 a Pret=. Categorie= _ERROR_=1 _N_=2
NOTE: SAS went to a new line when INPUT statement reached past the end of a line.
NOTE: The data set WORK.WANT has 1 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.00 seconds
By default, what is shown to you when a value is missing?
Numeric variables, . or the current setting for session option MISSING="<one-char>"
Character variables, . The missing value for character variables is a single space.

How can I put CSV files in a array in bash?

So I need to put the content of some of the columns of a CSV file into a array so I can operate with them.
My File looks like this:
userID,placeID,rating,food_rating,service_rating
U1077,135085,2,2,2
U1077,135038,2,2,1
U1077,132825,2,2,2
U1077,135060,1,2,2
U1068,135104,1,1,2
U1068,132740,0,0,0
U1068,132663,1,1,1
U1068,132732,0,0,0
U1068,132630,1,1,1
U1067,132584,2,2,2
U1067,132733,1,1,1
U1067,132732,1,2,2
U1067,132630,1,0,1
U1067,135104,0,0,0
U1067,132560,1,0,0
U1103,132584,1,2,1
U1103,132732,0,0,2
U1103,132630,1,2,0
U1103,132613,2,2,2
U1103,132667,1,2,2
U1103,135104,1,2,0
U1103,132663,1,0,2
U1103,132733,2,2,2
U1107,132660,2,2,1
U1107,132584,2,2,2
U1107,132733,2,2,2
U1044,135088,2,2,2
U1044,132583,1,2,1
U1070,132608,2,2,1
U1070,132609,1,1,1
U1070,132613,1,1,0
U1031,132663,0,0,0
U1031,132665,0,0,0
U1031,132668,0,0,0
U1082,132630,1,1,1
and I want to get the PlaceID and save it in a array and in same position also put the ratings. What I need to do is get a average rating of every PlaceID.
I have been trying something like
cut -d"," -f2 FileName >> var[#]
Hard to accomplish in bash but pretty strightforward in awk:
awk -F',' 'NR>1 {sum[$2] += $3; count[$2]++}; END{ for (id in sum) { print id, sum[id]/count[id] } }' file.csv
Explanation: -F set's the field separator and you want filed 2 and the average of field 3. At the end we print the unique ids and the average. We work on all rows but the first one (row number over 1).

SAS INPUT DATA WITH SPECIAL CHARACTERS

I´m trying to import some dat file (comma delimited) to SAS University. However, one variable contains special characters (e.g. french accents). Most are replaced with �, but also some observations have some problems.
Example of a problem:
An original observation in the data looks like this:
Crème Brûlée,105,280
Running the following command:
DATA BenAndJerrys;
INFILE '/folders/myfolders/HW3/BenAndJerrys.dat' DLM = ',' DSD MISSOVER;
INPUT flavor_name :$48. portion_size calories;
RUN;
It has this problem:
flavor_name=Cr�me Br�l�e,105 portion_size=280 calories=
as you can see the value 105 which is the value of portion_size is merged with the value of flavor_name, and the value 280 of calories is assigned to portion_size.
How can solve this problem and allow SAS to import the data with the special characters?
Try telling SAS what encoding to use when reading the file.
I copied and saved your sample line into a text file using Windows NOTEPAD editor.
%let path=C:\Downloads ;
data _null_;
infile "&path\test.txt" dsd encoding=wlatin1;
length x1-x3 $50 ;
input x1-x3;
put (_all_) (=);
run;
Result in the log.
x1=Crème Brûlée x2=105 x3=280
NOTE: 1 record was read from the infile "C:\Downloads\test.txt".
The minimum record length was 20.
The maximum record length was 20.

Merging csv file's lines with the same initial fields and sorting them by their length

I have a huge csv file with 4 fields for each line in this format (ID1, ID2, score, elem):
HELLO, WORLD, 2323, elem1
GOODBYE, BLUESKY, 3232, elem2
HELLO, WORLD, 421, elem3
GOODBYE, BLUESKY, 41134, elem4
ETC...
I would like to merge each line which has the same ID1,ID2 fields on the same line eliminating the score field, resulting in:
HELLO, WORLD, elem1, elem3.....
GOODBYE, BLUESKY, elem2, elem4.....
ETC...
where each elem come from a different line with the same ID1,ID2.
After that I would like to sort the lines on the basis of their length.
I have tried to do coding in java but is superslow. I have read online about AWK, but I can't really find a good spot where I can understand its syntax for csv files.
I used this command, how can I adapt it to my needs?
awk -F',' 'NF>1{a[$1] = a[$1]","$2}END{for(i in a){print i""a[i]}}' finale.txt > finale2.txt^C
your key should be composite, also delimiter need to be set to accommodate comma and spaces.
$ awk -F', *' -v OFS=', ' '{k=$1 OFS $2; a[k]=k in a?a[k] OFS $4:$4}
END{for(k in a) print k, a[k]}' file
GOODBYE, BLUESKY, elem2, elem4
HELLO, WORLD, elem1, elem3
Explanation
set field separator (FS) to comma followed with one or more spaces, and output field separator (OFS) to normalized form (comma and one space). Create a composite key from first two fields separated with OFS (since we're going to use it in the output). Append the fourth field to the array element indexed by key (treat first element special since we don't want to start with OFS). When all records are done (END block) print all keys and values.
To add the length keep a parallel counter and increment each time you append for each key, c[k]++ and use it when printing. That is,
$ awk -F', *' -v OFS=', ' '{k=$1 OFS $2; c[k]++; a[k]=k in a?a[k] OFS $4:$4}
END{for(k in a) print k, c[k], a[k]}' file |
sort -t, -k3n
GOODBYE, BLUESKY, 2, elem2, elem4
HELLO, WORLD, 2, elem1, elem3

How to create a truncated permanent database from a larger file in SAS [duplicate]

This question already has answers here:
Read specific columns of a delimited file in SAS
(3 answers)
Closed 8 years ago.
I'm trying to read a comma delimited .txt file (called 'file.txt' in the code below) into SAS in order to create a permanent database that includes only some of the variables and observations.
Here's a snippet of the .txt file for reference:
SUMLEV,REGION,DIVISION,STATE,NAME,POPESTIMATE2013,POPEST18PLUS2013,PCNT_POPEST18PLUS
10,0,0,0,United States,316128839,242542967,76.7
40,3,6,1,Alabama,4833722,3722241,77
40,4,9,2,Alaska,735132,547000,74.4
40,4,8,4,Arizona,6626624,5009810,75.6
40,3,7,5,Arkansas,2959373,2249507,76
My (abbreviated) code is as follows:
options nocenter nodate ls=72 ps=58;
filename foldr1 'C:\Users\redacted\Desktop\file.txt';
libname foldr2 'C:\Users\redacted\Desktop\Data';
libname foldr3 'C:\Users\redacted\Desktop\Formats';
options fmtsearch=(FMTfoldr.bf_fmts);
proc format library=foldr3.bf_fmts;
[redacted]
run;
data foldr2.file;
infile foldr1 DLM=',' firstobs=2 obs=52;
input STATE $ NAME $ REGION $ POPESTIMATE2013;
PERCENT=POPESTIMATE2013/316128839;
format REGION $regfmt.;
run;
proc print data=foldr2.file;
sum POPESTIMATE2013 PERCENT;
title 'Title';
run;
In my INPUT statement, I list the variables that I want to include in my new truncated database (STATE, NAME, REGION, etc.).
When I print my truncated database, I notice that all of my INPUT variables do not correspond to the same variables in the original file.
Instead my variables print out like this:
STATE (1st var listed in INPUT) printed as SUMLEV (1st var listed in
.txt file)
NAME (2nd var listed in INPUT) printed as REGION (2nd var listed in .txt file)
REGION (3rd " " " ") printed as DIVISION (3rd " " " ")
POPESTIMATE2013 (4th " " " ") printed as STATE (4th " " " ")
It seems that SAS is matching my INPUT variables based on order, not on name. So, because I list STATE first in my INPUT statement, SAS prints out the first variable of the original .txt file (i.e., the SUMLEV variable).
Any idea what's wrong with my code? Thanks for your help!
Your current code is reading in the first 4 values from each line of the CSV file and assigning them to columns with the names you have listed.
The input statement lists all the columns you want to read in (and where to read them from), it does not search for named columns within the input file.
The code below should produce the output you want. The keep statement lists the columns that you want in the output.
data foldr2.file;
infile foldr1 dlm = "," firstobs = 2 obs = 52;
/* Prevent truncating the name variable */
informat NAME $20.;
/* Name each of the columns */
input SUMLEV REGION DIVISION STATE NAME $ POPESTIMATE2013 POPEST18PLUS2013 PCNT_POPEST18PLUS;
/* Keep only the columns you want */
keep STATE NAME REGION POPESTIMATE2013 PERCENT;
PERCENT = POPESTIMATE2013/316128839;
format REGION $regfmt.;
run;
For a slightly more involved solution see Joe's excellent answer here. Applying this approach to your data will require setting the lengths of your columns in advance and converting character values to numeric.
data foldr2.file;
infile foldr1 dlm = "," firstobs = 2 obs = 52;
length STATE 8. NAME $13. REGION 8. POPESTIMATE2013 8.;
input #;
STATE = input(scan(_INFILE_, 4, ','), best.);
NAME = scan(_INFILE_, 5, ',');
REGION = input(scan(_INFILE_, 2, ','), best.);
POPESTIMATE2013 = input(scan(_INFILE_, 6, ','), best.);
PERCENT = POPESTIMATE2013/316128839;
format REGION $regfmt.;
run;
If you are looking to become more familiar with SAS it would be worth your while to take a look at the SAS documentation for reading files.
Your current data step is telling SAS what to name the first four variables in the txt file. To do what you want, you need to list all of the variables in the txt file in your "input" statement. Then, in your data statement, use the keep= option to select the variables you want to be included in the output dataset.
data foldr2.file (keep=STATE NAME REGION POPESTIMATE2013 PERCENT);
infile foldr1 DLM=',' firstobs=2 obs=52;
input
SUMLEV
REGION $
DIVISION
STATE $
NAME $
POPESTIMATE2013
POPEST18PLUS2013
PCNT_POPEST18PLUS;
PERCENT=POPESTIMATE2013/316128839;
format REGION $regfmt.;
run;

Resources