I have a file with 30 columns ( repeated ID Name and Place) and I need to extract 3 columns at time and put them into a new file every-time :
ID Name Place ID Name Place ID Name Place ID Name place ...
19 john NY 23 Key NY 22 Tom Ny 24 Jeff NY....
20 Jen NY 22 Jill NY 22 Ki LA 34 Jack Roh ....
So I will have 10 files like these -
Output1.txt
ID Name Place
19 john NY
20 Jen NY
Output2.txt
ID Name Place
23 Key NY
22 Jill NY
and 8 more files like these. I can print columns like
awk '{print $1,$2,$3}' Input.txt > Output1.txt but it may be too cumbersome for 10 files. Is there anyway I can make it faster?
Thanks!
$ awk '{for (i=1;i<=NF;i+=3) {print $i,$(i+1),$(i+2) > ("output" ((i+2)/3) ".txt")}}' file.txt
# output1.txt
ID Name Place
19 john NY
20 Jen NY
# output2.txt
ID Name Place
23 Key NY
22 Jill NY
# output3.txt
ID Name Place
22 Tom Ny
22 Ki LA
# output4.txt
ID Name place
24 Jeff NY
34 Jack Roh
Tweaking a bit from this wonderful Ed Morton's answer,
awk -v d=3 '{sfx=0; for(i=1;i<=NF;i+=d) {str=fs=""; for(j=i;j<i+d;j++) \
{str = str fs $j; fs=" "}; print str > ("output_file_" ++sfx)} }' file
will do the split-up of files as you requested.
Remember the awk variable d defines the number of columns to split-upon which is 3 in your case.
$ awk '{for(i=0;i<=NF/3-1;i++) print $(i*3+1), $(i*3+2), $(i*3+3)>i+1".txt"}' file
$ cat 1.txt
ID Name Place
19 john NY
20 Jen NY
Related
I have to create a file from a dataset that is JSON style but without CR between each variable.
All variables have to be on the same line.
I would like to have something like that :
ID1 "key1"="value1" "key2"="value2" .....
Each key is a column of a dataset.
I work this SAS 9.3 on UNIX.
Sample :
I have
ID Name Sex Age
123 jerome M 30
345 william M 26
456 ingrid F 25`
I would like
123 "Name"="jerome" "sex"="M" "age"="30"
345 "Name"="william" "sex"="M" "age"="26"
456 "Name"="ingrid" "sex"="F" "age"="25"
Thanks
If your data looked like this...
Obs Name _NAME_ COL1
1 Alfred Name Alfred
2 Alfred Sex M
3 Alfred Age 14
4 Alfred Height 69
5 Alfred Weight 112.5
6 Alice Name Alice
7 Alice Sex F
8 Alice Age 13
9 Alice Height 56.5
10 Alice Weight 84
11 Barbara Name Barbara
12 Barbara Sex F
13 Barbara Age 13
14 Barbara Height 65.3
15 Barbara Weight 98
16 Carol Name Carol
17 Carol Sex F
18 Carol Age 14
19 Carol Height 62.8
20 Carol Weight 102.5
21 Henry Name Henry
22 Henry Sex M
23 Henry Age 14
24 Henry Height 63.5
25 Henry Weight 102.5
You could use code like this to write the value pairs. Assuming this is what you're talking about.
189 data _null_;
190 do until(last.name);
191 set class;
192 by name;
193 col1 = left(col1);
194 if first.name then put name #;
195 put _name_:$quote. +(-1) '=' col1:$quote. #;
196 end;
197 put;
198 run;
Alfred "Name"="Alfred" "Sex"="M" "Age"="14" "Height"="69" "Weight"="112.5"
Alice "Name"="Alice" "Sex"="F" "Age"="13" "Height"="56.5" "Weight"="84"
Barbara "Name"="Barbara" "Sex"="F" "Age"="13" "Height"="65.3" "Weight"="98"
Carol "Name"="Carol" "Sex"="F" "Age"="14" "Height"="62.8" "Weight"="102.5"
Henry "Name"="Henry" "Sex"="M" "Age"="14" "Height"="63.5" "Weight"="102.5"
NOTE: There were 25 observations read from the data set WORK.CLASS.
Consider these non-transposing variations:
Actual JSON, use Proc JSON
data have;input
ID Name $ Sex $ Age; datalines;
123 jerome M 30
345 william M 26
456 ingrid F 25
run;
filename out temp;
proc json out=out;
export have;
run;
* What hath been wrought ?;
data _null_; infile out; input; put _infile_; run;
----- LOG -----
{"SASJSONExport":"1.0","SASTableData+HAVE":[{"ID":123,"Name":"jerome","Sex":"M","Age":30},{"ID":345,"Name":"william","Sex":"M","Age":26},{"ID":456,"Name":"ingrid","Sex":"F","Age":25}]}
A concise name-value pair output of the variables using the PUT statement specification syntax (variable-list) (format-list), using _ALL_ for the variable list and = for the format.
filename out2 temp;
data _null_;
set have;
file out2;
put (_all_) (=);
run;
data _null_;
infile out2; input; put _infile_;
run;
----- LOG -----
ID=123 Name=jerome Sex=M Age=30
ID=345 Name=william Sex=M Age=26
ID=456 Name=ingrid Sex=F Age=25
Iterate the variables using the VNEXT routine. Extract the formatted values using VVALUEX function, and conditionally construct the quoted name and value parts.
filename out3 temp;
data _null_;
set have;
file out3;
length _name_ $34 _value_ $32000;
do _n_ = 1 by 1;
call vnext(_name_);
if _name_ = "_name_" then leave;
if _n_ = 1
then _value_ = strip(vvaluex(_name_));
else _value_ = quote(strip(vvaluex(_name_)));
_name_ = quote(trim(_name_));
if _n_ = 1
then put _value_ #;
else put _name_ +(-1) '=' _value_ #;
end;
put;
run;
data _null_;
infile out3; input; put _infile_;
run;
----- LOG -----
123 "Name"="jerome" "Sex"="M" "Age"="30"
345 "Name"="william" "Sex"="M" "Age"="26"
456 "Name"="ingrid" "Sex"="F" "Age"="25"
Beth 45 0
Danny 33 0
Thomas 22 40
Mark 65 100
Mary 29 121
Susie 39 76.5
Joey 51 189.52
Peter 23 78.26
Maximus 34 289.71
Rebecca 21 45.79
Sophie 26 28.44
Barbara 24 107.36
Elizabeth 35 105.69
Peach 40 102.69
Lily 41 123
The above is a data file which has three fields: name, age, salary.
I want to print average salary, number, and names for people aged above 30 and under 30.
In this exercise, I want to practise using strings as subscripts.
Here is my AWK code:
BEGIN { OFS = "\t\t" }
{
if ($2 < 30)
{
a = "age below 30";
salary[a] += $NF;
count[a]++;
name[a] = name[a] $1 "\t";
}
else
{
a = "age equals or above 30";
salary[a] += $NF;
count[a]++;
name[a] = name[a] $1 "\t";
}
}
END {
for (a in salary)
for (a in count)
for (a in name)
{
print "The average salary of " a " is " salary[a] / count[a];
print "There are " count[a] " people " a ;
print "Their names are " name[a];
print "********************************************************";
}
}
The following is the output:
The average salary of age equals or above 30 is 109.679
There are 9 people age equals or above 30
Their names are Beth Danny Mark Susie Joey Maximus Elizabeth Peach Lily
********************************************************
The average salary of age below 30 is 70.1417
There are 6 people age below 30
Their names are Thomas Mary Peter Rebecca Barbara Sophie
********************************************************
The average salary of age equals or above 30 is 109.679
There are 9 people age equals or above 30
Their names are Beth Danny Mark Susie Joey Maximus Elizabeth Peach Lily
********************************************************
The average salary of age below 30 is 70.1417
There are 6 people age below 30
Their names are Thomas Mary Peter Rebecca Barbara Sophie
********************************************************
The average salary of age equals or above 30 is 109.679
There are 9 people age equals or above 30
Their names are Beth Danny Mark Susie Joey Maximus Elizabeth Peach Lily
********************************************************
The average salary of age below 30 is 70.1417
There are 6 people age below 30
Their names are Thomas Mary Peter Rebecca Barbara Sophie
********************************************************
The average salary of age equals or above 30 is 109.679
There are 9 people age equals or above 30
Their names are Beth Danny Mark Susie Joey Maximus Elizabeth Peach Lily
********************************************************
The average salary of age below 30 is 70.1417
There are 6 people age below 30
Their names are Thomas Mary Peter Rebecca Barbara Sophie
********************************************************
The output is very difficult for me to understand.
What I anticipated should look like this:
The average salary of age equals or above 30 is 109.679
There are 9 people age equals or above 30
Their names are Beth Danny Mark Susie Joey Maximus Elizabeth Peach Lily
********************************************************
The average salary of age equals or above 30 is 109.679
There are 9 people age equals or above 30
Their names are Thomas Mary Peter Rebecca Barbara Sophie
********************************************************
The average salary of age equals or above 30 is 109.679
There are 6 people age below 30
Their names are Beth Danny Mark Susie Joey Maximus Elizabeth Peach Lily
********************************************************
The average salary of age equals or above 30 is 109.679
There are 6 people age below 30
Their names are Thomas Mary Peter Rebecca Barbara Sophie
********************************************************
The average salary of age below 30 is 70.1417
There are 9 people age equals or above 30
Their names are Beth Danny Mark Susie Joey Maximus Elizabeth Peach Lily
********************************************************
The average salary of age below 30 is 70.1417
There are 9 people age equals or above 30
Their names are Thomas Mary Peter Rebecca Barbara Sophie
********************************************************
The average salary of age below 30 is 70.1417
There are 6 people age below 30
Their names are Beth Danny Mark Susie Joey Maximus Elizabeth Peach Lily
********************************************************
The average salary of age below 30 is 70.1417
There are 6 people age below 30
Their names are Thomas Mary Peter Rebecca Barbara Sophie
********************************************************
So my first question is : Where did I understand wrong?
And my second question is :
I actually don't need so many loops. I just need
The average salary of age equals or above 30 is 109.679
There are 9 people age equals or above 30
Their names are Beth Danny Mark Susie Joey Maximus Elizabeth Peach Lily
********************************************************
The average salary of age equals or above 30 is 109.679
There are 9 people age equals or above 30
Their names are Thomas Mary Peter Rebecca Barbara Sophie
********************************************************
for (a in salary, count, names) doesn't work. Is there a better way ?
for (x in salary)
for (y in count)
for (z in name)
print "foo"
says for every index in salary, loop through every index in count and while doing so, for every index in count loop through every index in name and print "foo" each time. So if salary, count, and name each had 3 entries then you'd print "foo" 3*3*3 = 9 times.
It gets more complicated than that in your code though because you're using the same variable to hold the index value of each array at every level of the nested loop:
for (a in salary)
for (a in count)
for (a in name)
so I'm not sure what awk is going to do with that - it may even be undefined behavior.
Since all 3 arrays have the same indices, just pick one of the arrays and loop on it's indices and then you can access all 3 arrays using that same index.
$ cat tst.awk
{
bracket = "age " ($2 < 30 ? "under" : "equals or above") " 30"
names[bracket] = (bracket in names ? names[bracket] "\t" : "") $1
count[bracket]++
salary[bracket] += $NF
}
END {
for (bracket in names) {
print "The average salary of", bracket, "is", salary[bracket] / count[bracket]
print "There are", count[bracket], "people", bracket
print "Their names are", names[bracket]
print "********************************************************"
}
}
$ awk -f tst.awk file
The average salary of age equals or above 30 is 109.679
There are 9 people age equals or above 30
Their names are Beth Danny Mark Susie Joey Maximus Elizabeth Peach Lily
********************************************************
The average salary of age under 30 is 70.1417
There are 6 people age under 30
Their names are Thomas Mary Peter Rebecca Sophie Barbara
********************************************************
In my script, I start with a file of campaign contributors and anyone who donates a collective $500 is eligible for a contest. Anyone who meets that criteria I add to an array with an incrementing index to adjust the size as needed. Each index is formatted as outlined below, with the X's being a phone number. In the END portion of the script, I need to sort this array by last name($2) for printing. I've done some searching but come up empty handed. I'm not asking for someone to type the script for me, merely to point me in a better direction of search or offer advice. I need help sorting the array contestants as currently it will be filled properly with the string values the way I need them for the assignment.
Where v1,2, & 3 are the campaign contributions, I am using -F'[ :]' in my command to get both spaces and colons as field separators.
Input File lab4.data
Fname Lname:Phone__Number:v1:v2:v3
Mike Harrington:(510) 548-1278:250:100:175
Christian Dobbins:(408) 538-2358:155:90:201
Susan Dalsass:(206) 654-6279:250:60:50
Archie McNichol:(206) 548-1348:250:100:175
Jody Savage:(206) 548-1278:15:188:150
Guy Quigley:(916) 343-6410:250:100:175
Dan Savage:(406) 298-7744:450:300:275
Nancy McNeil:(206) 548-1278:250:80:75
John Goldenrod:(916) 348-4278:250:100:175
Chet Main:(510) 548-5258:50:95:135
Tom Savage:(408) 926-3456:250:168:200
Elizabeth Stachelin:(916) 440-1763:175:75:300
Array to hold anyone > $500, $8 is created and holds the value $5+$6+$7:
the array is initialized and filled in for loop given below
$8 = $5+$6+$7;
contestants[len++]
Loop to check add people to contestant array.
name and number are arrays that hold their respective values for later use.
for(i=0;i<=NR;i++)if(contrib[i]>500){contestants[len++]= name[i]" "number[i] }
Formatting of indexes(desired array values for contestant[len++]):
[0] Mike Harrington (510) 548-1278
[1] Archie McNichol (206) 548-1348
[2] Guy Quigley (916) 343-6410
[3] Dan Savage (406) 298-7744
[4] John Goldenrod (916) 348-4278
[5] Tom Savage (408) 926-3456
[6] Elizabeth Stachelin (916) 440-1763
Loop to print/check that array has been correctly filled(it is)
for (i=0; i <len; i++) {print contestants[i]}
Output:
Mike Harrington (510) 548-1278
Archie McNichol (206) 548-1348
Guy Quigley (916) 343-6410
Dan Savage (406) 298-7744
John Goldenrod (916) 348-4278
Tom Savage (408) 926-3456
Elizabeth Stachelin (916) 440-1763
Desired Final Output: Ignore formatting as it correctly displays in my terminal I just hard a hard time getting it all nice in here.
***FIRST QUARTERLY REPORT***
***CAMPAIGN 2004 CONTRIBUTIONS***
Name Phone Jan | Feb | Mar | Total Donated
Mike Harrington (510)548-1278 $ 250 $ 100 $ 175 $ 525
Christian Dobbins (408)538-2358 $ 155 $ 90 $ 201 $ 446
Susan Dalsass (206)654-6279 $ 250 $ 60 $ 50 $ 360
Archie McNichol (206)548-1348 $ 250 $ 100 $ 175 $ 525
Jody Savage (206)548-1278 $ 15 $ 188 $ 150 $ 353
Guy Quigley (916)343-6410 $ 250 $ 100 $ 175 $ 525
Dan Savage (406)298-7744 $ 450 $ 300 $ 275 $ 1025
Nancy McNeil (206)548-1278 $ 250 $ 80 $ 75 $ 405
John Goldenrod (916)348-4278 $ 250 $ 100 $ 175 $ 525
Chet Main (510)548-5258 $ 50 $ 95 $ 135 $ 280
Tom Savage (408)926-3456 $ 250 $ 168 $ 200 $ 618
Elizabeth Stachelin (916)440-1763 $ 175 $ 75 $ 300 $ 550
-----------------------------------------------------------------------------
SUMMARY
-----------------------------------------------------------------------------
The campaign received a total of $6137.00 for this quarter.
The average donation for the 12 contributors was $511.42.
The highest total contribution was $1025.00 made by Dan Savage.
***Thank you Dan Savage***
The following people donated over $500 to the campaign.
They are eligible for the quarterly drawing!!
Listed are their names(sorted by last names) and phone numbers.
John Goldenrod (916) 348-4278
Mike Harrington (510) 548-1278
Archie McNichol (206) 548-1348
Guy Quigley (916) 343-6410
Dan Savage (406) 298-7744
Tom Savage (408) 926-3456
Elizabeth Stachelin (916) 440-1763
Thank you all for your continued support!!
Using gawk, this is straightforward to do with the in-built sort functions, e.g.
BEGIN {
data["Jane Doe (123) 456-7890"] = 600;
data["Fred Adams (123) 456-7891"] = 800;
data["John Smith (123) 456-7892"] = 900;
exit;
}
END {
for (i in data) {
split(i,x," ")
data1[x[2] " " x[1] " " x[3] " " x[4]] = i;
}
asorti(data1,sdata1);
for (i in sdata1) {
print data1[sdata1[i]],"\t",data[data1[sdata1[i]]];
}
}
... which produces:
Fred Adams (123) 456-7891 800
Jane Doe (123) 456-7890 600
John Smith (123) 456-7892 900
In plain awk, the same result can be achieved by writing the array indices to a file, sorting that file and then reading the file back using getline.
The way to approach this is to produce the pre-SUMMARY output as you read the data so you don't need to store all of your data in an array, just the people who contributed more than $500 and just insert them into the array in the desired order using an insertion sort algorithm.
You would do it something like this:
awk -F':' '
NR==1 {
print "header stuff"
next
}
{
tot = $3 + $4 + $5
printf "%-20s%10s $%5s $%5s $%5s $%5s\n", $1, $2, $3, $4, $5, tot
}
tot > 500 {
split($1,name,/ /)
surname = name[2]
numContribs++
# insertion sort, check the algorithm:
for (i=1; i<=numContribs; i++) {
if (surname > surnames[i]) {
for (j=numContribs; j>i; j--) {
surnames[j+1] = surnames[j]
contribs[j+1] = contribs[j]
}
surnames[i] = surname
contribs[i] = $1 " " $2
break
}
}
}
END {
print "SUMMARY and text below it and then the list of $500+ contributors:"
for (i=1; i<=numContribs; i++) {
print contribs[i]
}
}
' lab4.data
The above is not a fully functional program. It's just intended to show you the right approach per your request.
I have hundreds of files, each with two columns :
For example :
file1.txt
ID Value1
1 40
2 30
3 70
file2.txt
ID Value2
1 50
2 70
3 20
And so on, till
file150.txt
ID Value150
1 98
2 52
3 71
How do I merge these files based on the first column (which is common). My output should be
ID Value1 Value2...........Value150
1 40 50 98
2 30 70 52
3 70 20 71
Thank you.
using cut and paste combination to solve the file merging problem on three files or more. cd to the folder only contains file1, file2, file3, ... file150:
i=0
cut -f 1 file1 > delim ## use first column as delimiter
for file in file*
do
i=$(($i+1)) ## for adding count to distinguish files from original ones
cut -f 2 $file > ${file}__${i}.temp
done
paste -d\\t delim file*__*.temp > output
Another solution is using join to merge two files once by steps.
join -j 1 test1 test2 | join -j 1 test3 - | join -j 1 test4 -
I have a database that i now combined using this function
def ReadAndMerge():
library1=input("Enter 1st filename to read and merge:")
with open(library1, 'r') as library1names:
library1contents = library1names.read()
library2=input("Enter 2nd filename to read and merge:")
with open(library2, 'r') as library2names:
library2contents = library2names.read()
print(library1contents)
print(library2contents)
combined_contents = library1contents + library2contents # concatenate text
print(combined_contents)
return(combined_contents)
The two databases originally looked like this
Bud Abbott 51 92.3
Mary Boyd 52 91.4
Hillary Clinton 50 82.1
and this
Don Adams 51 90.4
Jill Carney 53 76.3
Randy Newman 50 41.2
After being combined they now look like this
Bud Abbott 51 92.3
Mary Boyd 52 91.4
Hillary Clinton 50 82.1
Don Adams 51 90.4
Jill Carney 53 76.3
Randy Newman 50 41.2
if i wanted to sort this database by last names how would i go about doing that?
is there a sort function built in to python like lists? is this considered a list?
or would i have to use another function that locates the last name then orders them alphabetically
You sort with the sorted() method. But you can't sort just a big string, you need to have the data in a list or something similar. Something like this (untested):
def get_library_names(): # Better name of function
library1 = input("Enter 1st filename to read and merge:")
with open(library1, 'r') as library1names:
library1contents = library1names.readlines()
library2=input("Enter 2nd filename to read and merge:")
with open(library2, 'r') as library2names:
library2contents = library2names.readlines()
print(library1contents)
print(library2contents)
combined_contents = sorted(library1contents + library2contents)
print(combined_contents)
return(combined_contents)