awk lookup table, blank column replacement - arrays

I'm trying to use a lookup table to do a search and replace for two specific columns and keep getting a blank column as output. I've followed the syntax for several examples of lookup tables that I've found on stack, but no joy. Here is a snippet from each of the files.
Sample lookup table -- want to search for instances of column 1 in my data file and replace them with the corresponding value in column 2 (first row is a header):
#xyz type
N 400
C13 401
13A 402
13B 402
13C 402
C14 405
The source file to be substituted has the following format:
1 N 0.293000 2.545000 16.605000 0 2 6 10 14
2 C13 0.197000 2.816000 15.141000 0 1
3 13A 1.173000 2.887000 14.676000 0
4 13B -0.319000 3.756000 14.937000 0
5 13C -0.351000 1.998000 14.678000 0
6 C14 0.749000 3.776000 17.277000 0 1
The corresponding values in column 2 of the lookup table will replace the values in column 6 of my source file (currently all zeroes). Here's the awk one-liner that I thought should work:
awk -v OFS='\t' 'NR==1 { next } FNR==NR { a[$1]=$2; next } $2 in a { $6=a[$1] }1' lookup.txt source.txt
But my output essentially deletes the entire entry for column 6:
1 N 0.293000 2.545000 16.605000 2 6 10 14
2 C13 0.197000 2.816000 15.141000 1
3 13A 1.173000 2.887000 14.676000
4 13B -0.319000 3.756000 14.937000
5 13C -0.351000 1.998000 14.678000
6 C14 0.749000 3.776000 17.277000 1
(The sixth column should be 400 to 405. I considered using sed, but I have duplicate values in the source and output columns of my lookup table, so that won't work in this case. What's frustrating is that I had this one-liner working on almost the exact same source file the other week, but now can only get this behavior. I'd love to be able to modify my awk call to do lookups of two different columns simultaneously, but wanted to start simple for now. Thanks!

You have $6=a[$1] instead of $6=a[$2] in your script.
$ awk -v OFS='\t' 'NR==FNR{map[$1]=$2; next} {$6=map[$2]} 1' file1 file2
1 N 0.293000 2.545000 16.605000 400 2 6 10 14
2 C13 0.197000 2.816000 15.141000 401 1
3 13A 1.173000 2.887000 14.676000 402
4 13B -0.319000 3.756000 14.937000 402
5 13C -0.351000 1.998000 14.678000 402
6 C14 0.749000 3.776000 17.277000 405 1

Related

Drop columns from a data frame but I keep getting this error below

enter image description here
enter image description here
enter image description here
enter image description here
enter image description here
enter image description here
enter image description here
No matter how I try to code this in R, I still cannot drop my columns so that I can build my logistic regression model. I tried to run it two different ways
cols<-c("EmployeeCount","Over18","StandardHours")
Trainingmodel1 <- DAT_690_Attrition_Proj1EmpAttrTrain[-cols,]
Error in -cols : invalid argument to unary operator
cols<-c("EmployeeCount","Over18","StandardHours")
Trainingmodel1 <- DAT_690_Attrition_Proj1EmpAttrTrain[!cols,]
Error in !cols : invalid argument type
This may solve your problem:
Trainingmodel1 <- DAT_690_Attrition_Proj1EmpAttrTrain[ , !colnames(DAT_690_Attrition_Proj1EmpAttrTrain) %in% cols]
Please note that if you want to drop columns, you should put your code inside [ on the right side of the comma, not on the left side.
So [, your_code] not [your_code, ].
Here is an example of dropping columns using the code above.
cols <- c("cyl", "hp", "wt")
mtcars[, !colnames(mtcars) %in% cols]
# mpg disp drat qsec vs am gear carb
# Mazda RX4 21.0 160.0 3.90 16.46 0 1 4 4
# Mazda RX4 Wag 21.0 160.0 3.90 17.02 0 1 4 4
# Datsun 710 22.8 108.0 3.85 18.61 1 1 4 1
# Hornet 4 Drive 21.4 258.0 3.08 19.44 1 0 3 1
# Hornet Sportabout 18.7 360.0 3.15 17.02 0 0 3 2
# Valiant 18.1 225.0 2.76 20.22 1 0 3 1
#...
Edit to Reproduce the Error
The error message you got indicates that there is a column that has only one, identical value in all rows.
To show this, let's try a logistic regression using a subset of mtcars data, which has only one, identical values in its cyl column, and then we use that column as a predictor.
mtcars_cyl4 <- mtcars |> subset(cyl == 4)
mtcars_cyl4
# mpg cyl disp hp drat wt qsec vs am gear carb
# Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
# Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
# Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
# Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
# Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
# Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
# Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
# Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
# Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
# Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
# Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
glm(am ~ as.factor(cyl) + mpg + disp, data = mtcars_cyl4, family = "binomial")
#Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
# contrasts can be applied only to factors with 2 or more levels
Now, compare it with the same logistic regression by using full mtcars data, which have various values in cyl column.
glm(am ~ as.factor(cyl) + mpg + disp, data = mtcars, family = "binomial")
# Call: glm(formula = am ~ as.factor(cyl) + mpg + disp, family = "binomial",
# data = mtcars)
#
# Coefficients:
# (Intercept) as.factor(cyl)6 as.factor(cyl)8 mpg disp
# -5.08552 2.40868 6.41638 0.37957 -0.02864
#
# Degrees of Freedom: 31 Total (i.e. Null); 27 Residual
# Null Deviance: 43.23
# Residual Deviance: 25.28 AIC: 35.28
It is likely that, even though you have drop three columns that have one,identical values in all the respective rows, there is another column in Trainingmodel1 that has one identical values. The identical values in the column were probably resulted during filtering the data frame and splitting data into training and test groups. Better to have a check by using summary(Trainingmodel1).
Further edit
I have checked the summary(Trainingmodel1) result, and it becomes clear that EmployeeNumber has one identical value (called "level" for a factor) in all rows. To run your regression properly, either you drop it from your model, or if EmployeeNumber has another level and you want to include it in your model, you should make sure that it contains at least two levels in the training data. It is possible to achieve that during splitting by repeating the random sampling until the randomly selected EmployeeNumber samples contain at least two levels. This can be done by looping using for, while, or repeat. It is possible, but I don't know how proper the repeated sampling is for your study.
As for your question about subsetting more than one variable, you can use subset and conditionals. For example, you want to get a subset of mtcars that has cyl == 4 and mpg > 20 :
mtcars |> subset(cyl == 4 & mpg > 20 )
If you want a subset that has cyl == 4 or mpg > 20:
mtcars |> subset(cyl == 4 | mpg > 20 )
You can also subset by using more columns as subset criteria:
mtcars |> subset((cyl > 4 & cyl <8) | (mpg > 20 & gear > 4 ))

How to aggregate number of notes sent to each user?

Consider the following tables
group (obj_id here is user_id)
group_id obj_id role
--------------------------
100 1 A
100 2 root
100 3 B
100 4 C
notes
obj_id ref_obj_id note note_id
-------------------------------------------
1 2 10
1 3 10
1 0 foobar 10
1 4 20
1 2 20
1 0 barbaz 20
2 0 caszes 30
2 1 30
4 1 70
4 0 taz 70
4 3 70
Note: a note in the system can be assigned to multiple users (for instance: an admin could write "sent warning to 2 users" and link it to 2 user_ids). The first user the note gets linked to is stored differently than the other linked users. The note itself is linked to the first linked user only. Whenever group.obj_id = notes.obj_id then ref_obj_id = 0 and note <> null
I need to make an overview of the notes per user. Normally I would do this by joining on group.obj_id = notes.obj_idbut here this goes wrong because of ref_obj_id being 0 (in which case I should join on notes.obj_id)
There are 4 notes in this system (foobar, barbaz, caszes and taz).
The desired output is:
obj_id user_is_primary notes_primary user_is_linked notes_linked
-------------------------------------------------------------------
1 2 10;20 2 30;70
2 1 30 2 10;20
3 0 2 10;70
4 1 70 1 20
How can I get to this aggregated result?
I hope that I was able to explain the situation clearly; perhaps it is my inexperience but I find the data model not the most straightforward.
Couldn't you simply put this in the ON clause of your join?
case when notes.ref_obj_id = 0 then notes.obj_id else notes.ref_obj_id end = group.obj_id

Using group by in Proc SQL for SAS

I am trying to summarize my data set using the proc sql, but I have repeated values in the output, a simple version of my code is:
PROC SQL;
CREATE TABLE perm.rx_4 AS
SELECT patid,ndc,fill_mon,
COUNT(dea) AS n_dea,
sum(DEDUCT) AS tot_DEDUCT
FROM perm.rx
GROUP BY patid,ndc,fill_mon;
QUIT;
Some sample output is:
Obs Patid Ndc FILL_mon n_dea DEDUCT
3815 33003605204 00054465029 2000-05 2 0
3816 33003605204 00054465029 2000-05 2 0
12257 33004361450 00406035701 2000-06 2 0
16564 33004744098 00603128458 2000-05 2 0
16565 33004744098 00603128458 2000-05 2 0
16566 33004744098 00603128458 2000-06 2 0
16567 33004744098 00603128458 2000-06 2 0
46380 33008165116 00406035705 2000-06 2 0
85179 33013674758 00406035801 2000-05 2 0
89248 33014228307 00054465029 2000-05 2 0
107514 33016949900 00406035805 2000-06 2 0
135047 33056226897 63481062370 2000-05 2 0
213691 33065594501 00472141916 2000-05 2 0
215192 33065657835 63481062370 2000-06 2 0
242848 33066899581 60432024516 2000-06 2 0
As you can see there are repeated out put, for example obs 3815,3816. I have saw some people had similar problem, but the answers didn't work for me.
The content of the dataset is this:
The SAS System 5
17:01 Thursday, December 3, 2015
The CONTENTS Procedure
Engine/Host Dependent Information
Data Set Page Size 65536
Number of Data Set Pages 210
First Data Page 1
Max Obs per Page 1360
Obs in First Data Page 1310
Number of Data Set Repairs 0
Filename /home/zahram/optum/rx_4.sas7bdat
Release Created 9.0401M2
Host Created Linux
Inode Number 424673574
Access Permission rw-r-----
Owner Name zahram
File Size (bytes) 13828096
The SAS System 6
17:01 Thursday, December 3, 2015
The CONTENTS Procedure
Alphabetic List of Variables and Attributes
# Variable Type Len Format Informat Label
3 FILL_mon Num 8 YYMMD. Fill month
2 Ndc Char 11 $11. $20. Ndc
1 Patid Num 8 19. Patid
4 n_dea Num 8
5 tot_DEDUCT Num 8
Sort Information
Sortedby Patid Ndc FILL_mon
Validated YES
Character Set ASCII
The SAS System 7
17:01 Thursday, December 3, 2015
The CONTENTS Procedure
Sort Information
Sort Option NODUPKEY
NOTE: PROCEDURE CONTENTS used (Total process time):
real time 0.08 seconds
cpu time 0.01 seconds
I'll guess that you have a format on a variable, most likely the date. Proc SQL does not aggregate over formatted values but will use the underlying values but still shows them as formatted, so they appear as duplicates. Your proc contents confirms this. You can get around this by converting this the variable to a character variable.
PROC SQL;
CREATE TABLE perm.rx_4 AS
SELECT patid,ndc, put(fill_mon, yymmd.) as fill_month,
COUNT(dea) AS n_dea,
sum(DEDUCT) AS tot_DEDUCT
FROM perm.rx
GROUP BY patid,ndc, calculated fill_month;
QUIT;

Adding and multiplying tables' data by values in another table

Say I have a table of subtractions and divisions sorted by date:
tblFactors
dt sub divide
2014-07-01 1 1
2014-06-01 0 5
2014-05-01 2 1
2014-05-01 0 3
I have another table of values, sorted by date:
tblValues
dt val
2014-07-05 4
2014-06-15 5
2014-05-15 21
2014-04-14 31
2014-03-15 71
I need to perform some sequential calculations. For the first value in tblFactors, I need to subtract 1 from every val where tblValues.dt < '2014-07-01'.
Next, I need to process the second row in tblFactors. There is nothing to subtract. However, the divide = 5 means that I need to divide every val by 5 where tblValues.dt < '2014-06-01'. The tricky thing is that I need to do this on the modified val from the row before (divide 20 / 5, not 21 / 5).
Each row in tblFactors would process in this manner, giving a sequence like this:
tblFactors: Row 1 Row 2 Row 3 Row 4
Dt Original Val Subtract 1 Divide by 5 Subtract 2 Divide by 3
7/5/2014 4
6/15/2014 5 4
5/15/2014 21 20 4
4/14/2014 31 30 6 4
3/25/2014 71 70 14 12 4
This would leave me with:
qryValues
dt val
2014-07-05 4
2014-06-15 4
2014-05-15 4
2014-04-14 4
2014-03-15 4
Right now I'm doing vector multiplications over loops in R. I was wondering if there was a clever way to accomplish this in the native sql. I tried doing some aggregations but I've had limited success.

Associating / linking an array column with another column in the array

I have an array that has some calcultations done on the second column. I would like the values from the third column to follow/be linked to the second column.
Test Code:
a1= [1,10,-11;
2,70,232;
3,33.2,-33;
4,40,44;]
a2calc=abs(a1(:,2)-max(a1(:,2))) %calculation
a2=[a1(:,1),a2calc,a1(:,3)] %new array
Example:
original a1 Array
1 10 -11
2 70 232
3 33.2 -33
4 40 44
a2 Array after column 2 calculations looks like this
1 60 -11
2 0 232
3 36.8 -33
4 30 44
I'm trying to get the final array to look like this (column 3 values follow / are linked to the second column)
1 60 232
2 0 -11
3 36.8 44
4 30 -33
What I'm having problems with is I'm not sure if I should use the index values of column 2 and if so how I can get it to look like the final output array I included in the question.
I might be wrong here, but it looks to me like the logic is:
After calculating the second column, change the order of the third column so that the third column is sorted the same way as the second. To see what I mean:
This represents the two columns, numbered from highest to lowest:
A = 1 1
4 3
2 2
3 4
If I understand it right, you want the resulting matrix to be
A = 1 1
4 4
2 2
3 3
If this is the right logic then you should check out sort with two outputs. You can use the second output to index the third column.
[~, idx] = sort(A(:, 2));
sorted_3 = sort(A(:, 3));
A(idx, 3) = sorted_3;
The output from this is:
A =
1.00000 60.00000 232.00000
2.00000 0.00000 -33.00000
3.00000 36.80000 44.00000
4.00000 30.00000 -11.00000
Good luck!

Resources