Apply a custom (weighted) dictionary to text based on sentiment analysis - inner-join

I am looking to adjust this code so that I can assign each one of these modal verbs with a different weight. The idea is to use something similar to the NRC library, where we have the "numbers" 1-5 represent categories, rather than numbers.
modals<-data_frame(word=c("must", "will", "shall", "should", "may", "can"),
modal=c("5", "4", "4", "3", "2", "1"))
My problem is that when I run the following code I have that 5 "may"s count as the same as one "must". What I want is for each word to have a different weight so that when I run this analysis I can see the concentration of uses of the stronger "must" versus say the much weaker "can". *with "tidy.DF" being my corpus and "school" and "target" being the column names.
MODAL<-tidy.DF %>%
inner_join(modals) %>%
count(School, Target, modal, index=wordnumber %/% 50, modal) %>%
spread(modal, n, fill=0)
ggplot(MODAL, aes(index, 5, fill=Target)) +
geom_col(show.legend=FALSE) +
facet_wrap(~Target, ncol=2, scales="free_x")

Here's a suggestion for a better approach, using the quanteda package instead. The approach:
Create a named vector of weights, corresponding to your "dictionary".
Create a document feature matrix, selecting only the terms in the dictionary.
Weight the observed counts.
# set modal values as a named numeric vector
modals <- c(5, 4, 4, 3, 2, 1)
names(modals) <- c("must", "will", "shall", "should", "may", "can")
library("quanteda", warn.conflicts = FALSE)
## Package version: 1.4.0
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
I'll use the most recent inaugural speeches as a reproducible example here.
dfmat <- data_corpus_inaugural %>%
corpus_subset(Year > 2000) %>%
dfm() %>%
dfm_select(pattern = names(modals))
This produces the raw counts.
dfmat
## Document-feature matrix of: 5 documents, 6 features (26.7% sparse).
## 5 x 6 sparse Matrix of class "dfm"
## features
## docs will must can should may shall
## 2001-Bush 23 6 6 1 0 0
## 2005-Bush 22 6 7 1 3 0
## 2009-Obama 19 8 13 0 3 3
## 2013-Obama 20 17 7 0 4 0
## 2017-Trump 40 3 1 1 0 0
Weighting this now is as simple as calling dfm_weight() to reweight the counts by the values of your weight vector. The function will automatically apply the weights using fixed matching of the vector element names to the dfm features.
dfm_weight(dfmat, weight = modals)
## Document-feature matrix of: 5 documents, 6 features (26.7% sparse).
## 5 x 6 sparse Matrix of class "dfm"
## features
## docs will must can should may shall
## 2001-Bush 92 30 6 3 0 0
## 2005-Bush 88 30 7 3 6 0
## 2009-Obama 76 40 13 0 6 12
## 2013-Obama 80 85 7 0 8 0
## 2017-Trump 160 15 1 3 0 0

Related

Drop columns from a data frame but I keep getting this error below

enter image description here
enter image description here
enter image description here
enter image description here
enter image description here
enter image description here
enter image description here
No matter how I try to code this in R, I still cannot drop my columns so that I can build my logistic regression model. I tried to run it two different ways
cols<-c("EmployeeCount","Over18","StandardHours")
Trainingmodel1 <- DAT_690_Attrition_Proj1EmpAttrTrain[-cols,]
Error in -cols : invalid argument to unary operator
cols<-c("EmployeeCount","Over18","StandardHours")
Trainingmodel1 <- DAT_690_Attrition_Proj1EmpAttrTrain[!cols,]
Error in !cols : invalid argument type
This may solve your problem:
Trainingmodel1 <- DAT_690_Attrition_Proj1EmpAttrTrain[ , !colnames(DAT_690_Attrition_Proj1EmpAttrTrain) %in% cols]
Please note that if you want to drop columns, you should put your code inside [ on the right side of the comma, not on the left side.
So [, your_code] not [your_code, ].
Here is an example of dropping columns using the code above.
cols <- c("cyl", "hp", "wt")
mtcars[, !colnames(mtcars) %in% cols]
# mpg disp drat qsec vs am gear carb
# Mazda RX4 21.0 160.0 3.90 16.46 0 1 4 4
# Mazda RX4 Wag 21.0 160.0 3.90 17.02 0 1 4 4
# Datsun 710 22.8 108.0 3.85 18.61 1 1 4 1
# Hornet 4 Drive 21.4 258.0 3.08 19.44 1 0 3 1
# Hornet Sportabout 18.7 360.0 3.15 17.02 0 0 3 2
# Valiant 18.1 225.0 2.76 20.22 1 0 3 1
#...
Edit to Reproduce the Error
The error message you got indicates that there is a column that has only one, identical value in all rows.
To show this, let's try a logistic regression using a subset of mtcars data, which has only one, identical values in its cyl column, and then we use that column as a predictor.
mtcars_cyl4 <- mtcars |> subset(cyl == 4)
mtcars_cyl4
# mpg cyl disp hp drat wt qsec vs am gear carb
# Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
# Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
# Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
# Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
# Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
# Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
# Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
# Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
# Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
# Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
# Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
glm(am ~ as.factor(cyl) + mpg + disp, data = mtcars_cyl4, family = "binomial")
#Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
# contrasts can be applied only to factors with 2 or more levels
Now, compare it with the same logistic regression by using full mtcars data, which have various values in cyl column.
glm(am ~ as.factor(cyl) + mpg + disp, data = mtcars, family = "binomial")
# Call: glm(formula = am ~ as.factor(cyl) + mpg + disp, family = "binomial",
# data = mtcars)
#
# Coefficients:
# (Intercept) as.factor(cyl)6 as.factor(cyl)8 mpg disp
# -5.08552 2.40868 6.41638 0.37957 -0.02864
#
# Degrees of Freedom: 31 Total (i.e. Null); 27 Residual
# Null Deviance: 43.23
# Residual Deviance: 25.28 AIC: 35.28
It is likely that, even though you have drop three columns that have one,identical values in all the respective rows, there is another column in Trainingmodel1 that has one identical values. The identical values in the column were probably resulted during filtering the data frame and splitting data into training and test groups. Better to have a check by using summary(Trainingmodel1).
Further edit
I have checked the summary(Trainingmodel1) result, and it becomes clear that EmployeeNumber has one identical value (called "level" for a factor) in all rows. To run your regression properly, either you drop it from your model, or if EmployeeNumber has another level and you want to include it in your model, you should make sure that it contains at least two levels in the training data. It is possible to achieve that during splitting by repeating the random sampling until the randomly selected EmployeeNumber samples contain at least two levels. This can be done by looping using for, while, or repeat. It is possible, but I don't know how proper the repeated sampling is for your study.
As for your question about subsetting more than one variable, you can use subset and conditionals. For example, you want to get a subset of mtcars that has cyl == 4 and mpg > 20 :
mtcars |> subset(cyl == 4 & mpg > 20 )
If you want a subset that has cyl == 4 or mpg > 20:
mtcars |> subset(cyl == 4 | mpg > 20 )
You can also subset by using more columns as subset criteria:
mtcars |> subset((cyl > 4 & cyl <8) | (mpg > 20 & gear > 4 ))

Pandas How to Align Two Columns in a DataFrame and NaN empty cells

I'm using Python 3.8.8
I have a DataFrame structured like this:
A
B
0
1
1
2
2
1
3
7
4
7
5
8
and an array:
C = [3, 4, 7]
I would like to add an array "C" as a new column to the DataFrame. The problem is this array has a different length of index than the df. I would like to make up for the difference in length in C by filling the empty cells with NaNs. My desired result would look something like:
A
B
C
0
1
NaN
1
2
NaN
2
1
3
3
7
4
4
7
7
5
8
NaN
What I am looking for specifically is a way to add C starting at a specific index of the df, but I don't know how to work around the discrepancy between the length of the df and array.
Thank you for your time
To get around the problem of 'different length' when putting your list into the dataframe, you can convert it to a pandas series. Once you do that, you can easily add it to your dataframe with the rest of the values being filled with np.nan.
In your case, you can specifically also set the index when you convert your C list to a series, which you can then assign to your dataframe. Pandas nature to align data on indices will place the series on the right index
Consider using the code below:
c = pd.Series([3, 4, 7],index=[2,3,4])
df['C'] = c
prints:
A B 0
0 0 1 NaN
1 1 2 NaN
2 2 1 3.0
3 3 7 4.0
4 4 7 7.0
5 5 8 NaN
Renaming 0 should be trivial.

xor after applying filters on an array

We have an original array and a list of filters where each filter consists of indices which are allowed through the filter. The filters are rather nice, e.g. they are grouped for each power of 2 in the following way (the filters are upto n = 20).
1 (2^0) = 1 3 5 7 9 11 13 15 17 19
2 (2^1) = 1 2 5 6 9 10 13 14 17 18
4 (2^2) = 1 2 3 4 9 10 11 12 17 18 19 20
8 (2^3) = 1 2 3 4 5 6 7 8 17 18 19 20
I hope you get the idea. Now we would apply some or all of these filters (user dictates which filters to apply) to the original array and the xor of the elements of the transformed array is the answer. To take an example if the original array would have been [3 7 8 1 2 9 6 4 11] e.g. n = 9 and we needed to apply the filters of 4, 2 and 1, the transformations would be like this.
After applying filter of 4 - [3 7 8 1 x x x x 11]
After applying filter of 2 - [3 7 x x x x x x 11]
After applying filter of 1 - [3 x x x x x x x 11]
Now the xor of 3 and 11 e.g. 8 is the answer. I can solve this O(n * no. of filters) time, but I need a better solution which might give the answer in O(no of filters) time. Is there any way to take advantage of the properties of xor and/or pre-compute the results for some and then give the answer for the filters. This is because there are many queries with filters, so I need to answer the queries in O(no of filters) time. Any kind of help will be appreciated.
It can be done in O(M) where M is the number of items that pass all filters (independent of the number of filters) by iterating over the array in a particular way, generating only the indexes that pass all the filters.
This is easier to see if you write down the examples starting at zero:
1: 0 2 4 6 8 10 12 14 16 18 (numbers that don't contain 1)
2: 0 1 4 5 8 9 12 13 16 17 (numbers that don't contain 2, etc)
4: 0 1 2 3 8 9 10 11 16 17 18 19
8: 0 1 2 3 4 5 6 7 16 17 18 19
The filters are really just a constraint on the bits of the indexes in the array. That constraint is of the form index & filters = 0, where filters is just the sum of all the individual filters (eg 1 + 2 + 4 = 7). Given a valid index i the next valid index i' can be computed with only primitive operations: i' = (i | filters) + 1 & ~filters. The idea here is to set the bits that are filtered to zero so the +1 will carry through them, then filtered bits are cleared again to make the index valid. The total effect is that the unfiltered bits are incremented and the filtered bits stay zero.
This gives a simple algorithm to iterate directly over all valid indexes. Start at 0 (which is always valid) and increment using the rule above until the end of the array is reached:
for (int i = 0; i < N; i = (i | filters) + 1 & ~filters)
// do something with array[i], like XOR them all together

Julia: Sort the columns of a matrix by the values in another vector (in place...)?

I am interested in sorting the columns of a matrix in terms of the values in 2 other vectors. As an example, suppose the matrix and vectors look like this:
M = [ 1 2 3 4 5 6 ;
7 8 9 10 11 12 ;
13 14 15 16 17 18 ]
v1 = [ 2 , 6 , 6 , 1 , 3 , 2 ]
v2 = [ 3 , 1 , 2 , 7 , 9 , 1 ]
I want to sort the columns of A in terms of their corresponding values in v1 and v2, with v1 taking precedence over v2. Additionally, I am interested in trying to sort the matrix in place as the matrices I am working with are very large. Currently, my crude solution looks like this:
MM = [ v1' ; v2' ; M ] ; ## concatenate the vectors with the matrix
MM[:,:] = sortcols(MM , by=x->(x[1],x[2]))
M[:,:] = MM[3:end,:]
which gives the desired result:
3x6 Array{Int64,2}:
4 6 1 5 2 3
10 12 7 11 8 9
16 18 13 17 14 15
Clearly my approach is not ideal is it requires computing and storing intermediate matrices. Is there a more efficient/elegant approach for sorting the columns of a matrix in terms of 2 other vectors? And can it be done in place to save memory?
Previously I have used sortperm for sorting an array in terms of the values stored in another vector. Is it possible to use sortperm with 2 vectors (and in-place)?
I would probably do it this way:
julia> cols = sort!([1:size(M,2);], by=i->(v1[i],v2[i]));
julia> M[:,cols]
3×6 Array{Int64,2}:
4 6 1 5 2 3
10 12 7 11 8 9
16 18 13 17 14 15
This should be pretty fast and uses only one temporary vector and one copy of the matrix. It's not fully in-place, but doing this operation completely in-place is not easy. You would need a sorting function that moves columns as it works, or alternatively a version of permute! that works on columns. You could start with the code for permute!! in combinatorics.jl and modify it to permute columns, reusing a single column-size temporary buffer.

How can I scale an array to another length saving it's approximate values in R

I have two arrays with different lengths
value <- c(1,1,1,4,4,4,1,1,1)
time <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)
How can I resize the value array to make it same length as the time array, saving it's approximate values ?
approx() function tells that lengths are differ.
I want to get value array to be like
value <- c(1,1,1,1,1,4,4,4,4,4,4,1,1,1,1)
time <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)
so lengths are equal
UPD
Okay, the main goal is to calculate correlation of v1 from v2, where
v1 inside of data.frame v1,t1 , and v2 inside of data.frame v2,t2.
the v1,t1 and v2,t2 data frames have different lengths, but we know that t1 and t2 is for equal time period so we can overlay them.
for t1 we have 1,3,5,7,9 and for t2 we have 1,2,3,4,5,6,7,8,9,10.
The problem is that two data frames are recorded separately but simultaneusly so I need to scale one of them to overlay another data.frame. And then I can calculate correlation of how v1 affects on v2.
That why I need to scale v1 to t2 length.
I'm sorry guys, I dont know how to write the goal correctly in english.
You may use the xout argument in approx
"xout: an optional set of numeric values specifying where interpolation is to take place.".
# create some fake data, which I _think_ may resemble the data you described in edit.
set.seed(123)
# "for t1 we have 1,3,5,7,9"
df1 <- data.frame(time = c(1, 3, 5, 7, 9), value = sample(1:10, 5))
df1
# "for t2 we have 1,2,3,4,5,6,7,8,9,10", the 'full time series'.
df2 <- data.frame(time = 1:10, value = sample(1:10))
# interpolate using approx and the xout argument
# The time values for 'full time series', df2$time, is used as `xout`.
# default values of arguments (e.g. linear interpolation, no extrapolation)
interpol1 <- with(df1, approx(x = time, y = value, xout = df2$time))
# some arguments you may wish to check
# extrapolation rules
interpol2 <- with(df1, approx(x = time, y = value, xout = df2$time,
rule = 2))
# interpolation method ('last observation carried forward")
interpol3 <- with(df1, approx(x = time, y = value, xout = df2$time,
rule = 2, method = "constant"))
df1
# time value
# 1 1 3
# 2 3 8
# 3 5 4
# 4 7 7
# 5 9 6
interpol1
# $x
# [1] 1 2 3 4 5 6 7 8 9 10
#
# $y
# [1] 3.0 5.5 8.0 6.0 4.0 5.5 7.0 6.5 6.0 NA
interpol3
# $x
# [1] 1 2 3 4 5 6 7 8 9 10
#
# $y
# [1] 3 3 8 8 4 4 7 7 6 6
# correlation between a vector of inter-(extra-)polated values
# and the 'full' time series
cor.test(interpol3$y, df2$value)
This little function tries to pad the values in the shorter vector out as evenly as possible and is generalisable. Haven't thought too much about edge cases, and I am sure there are many that break it. Plus it seems like it could be simplified, but is this what you are looking to do...
pad <- function(x,y){
fill <- length(y) - length(x)
run <- rle(x)
add <- fill %/% length(run$lengths)
pad <- diff( c( 0 , as.integer( seq( add , fill , length.out = length(run$lengths) ) ) ) )
rep(run$values , times = run$lengths+pad)
}
pad(value,time)
[1] 1 1 1 1 1 4 4 4 4 4 1 1 1 1 1
Or e.g.
value <- 1:2
time <- 1:10
pad(value,time)
[1] 1 1 1 1 1 2 2 2 2 2

Resources