How to release heap memory on apache drill once the query is complete? - heap-memory

Problem is quite simple, every time I query on drill, the heap memory keeps on accumulating. My heap memory is 7 GBs but its not getting refreshed. After every 15 minutes I have to kill drill and start it again to clear the heap memory.
Current Config:
-) I am running apache drill on single node. Queries are executed on drill using the R package 'sergeant' and usually, parquet files are target files. Current OS is windows 7 Enterprise.
-) We first build the query using src_drill and then use drl_con to execute the query. The architecture of building the query and then executing the query is a architecture choice as we want the application to be able to switch between different query engines, like sql, hive, spark etc.
library(sergeant)
# setting up drill query, I do not use collect() here
ds <- src_drill("localhost")
query <- tbl(ds, "cp.`employee.json`")
query %<>% dbplyr::sql_render()
# using drill con to execute the query
drl_con <- drill_connection("localhost")
Mapping <- drill_query(drl_con, query, .progress = FALSE)
## # A tibble: 100 x 16
## employee_id full_name first_name last_name position_id position_title store_id department_id birth_date hire_date
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1 Sheri No… Sheri Nowmer 1 President 0 1 1961-08-26 1994-12-…
## 2 2 Derrick … Derrick Whelply 2 VP Country Ma… 0 1 1915-07-03 1994-12-…
## 3 4 Michael … Michael Spence 2 VP Country Ma… 0 1 1969-06-20 1998-01-…
## 4 5 Maya Gut… Maya Gutierrez 2 VP Country Ma… 0 1 1951-05-10 1998-01-…
## 5 6 Roberta … Roberta Damstra 3 VP Informatio… 0 2 1942-10-08 1994-12-…
## 6 7 Rebecca … Rebecca Kanagaki 4 VP Human Reso… 0 3 1949-03-27 1994-12-…
## 7 8 Kim Brun… Kim Brunner 11 Store Manager 9 11 1922-08-10 1998-01-…
## 8 9 Brenda B… Brenda Blumberg 11 Store Manager 21 11 1979-06-23 1998-01-…
## 9 10 Darren S… Darren Stanz 5 VP Finance 0 5 1949-08-26 1994-12-…
## 10 11 Jonathan… Jonathan Murraiin 11 Store Manager 1 11 1967-06-20 1998-01-…
## # … with 90 more rows, and 6 more variables: salary <chr>, supervisor_id <chr>, education_level <chr>,
## # marital_status <chr>, gender <chr>, management_role <chr>
Ideally I would expect drill to do garbage collection on heap memory on its own after every query, but now its not happening.

Apache Drill has its own memory manager.
On the task manager it never releases the heap memory but in the background it starts to reuse the heap memory once its full.
If you are getting memory issues chances are you are going overboard some of the other memory parameters like total memory allotted to a single query, etc.
Recycling of heap memory is not something that you should be worried about.
Refer to: https://books.google.com.au/books?id=-Tp7DwAAQBAJ&printsec=frontcover&dq=apache+drill+nook&hl=en&sa=X&ved=0ahUKEwil7LeJuPzkAhXKZSsKHUDoBw4Q6AEIKjAA#v=onepage&q&f=false for more details

Related

Drop columns from a data frame but I keep getting this error below

enter image description here
enter image description here
enter image description here
enter image description here
enter image description here
enter image description here
enter image description here
No matter how I try to code this in R, I still cannot drop my columns so that I can build my logistic regression model. I tried to run it two different ways
cols<-c("EmployeeCount","Over18","StandardHours")
Trainingmodel1 <- DAT_690_Attrition_Proj1EmpAttrTrain[-cols,]
Error in -cols : invalid argument to unary operator
cols<-c("EmployeeCount","Over18","StandardHours")
Trainingmodel1 <- DAT_690_Attrition_Proj1EmpAttrTrain[!cols,]
Error in !cols : invalid argument type
This may solve your problem:
Trainingmodel1 <- DAT_690_Attrition_Proj1EmpAttrTrain[ , !colnames(DAT_690_Attrition_Proj1EmpAttrTrain) %in% cols]
Please note that if you want to drop columns, you should put your code inside [ on the right side of the comma, not on the left side.
So [, your_code] not [your_code, ].
Here is an example of dropping columns using the code above.
cols <- c("cyl", "hp", "wt")
mtcars[, !colnames(mtcars) %in% cols]
# mpg disp drat qsec vs am gear carb
# Mazda RX4 21.0 160.0 3.90 16.46 0 1 4 4
# Mazda RX4 Wag 21.0 160.0 3.90 17.02 0 1 4 4
# Datsun 710 22.8 108.0 3.85 18.61 1 1 4 1
# Hornet 4 Drive 21.4 258.0 3.08 19.44 1 0 3 1
# Hornet Sportabout 18.7 360.0 3.15 17.02 0 0 3 2
# Valiant 18.1 225.0 2.76 20.22 1 0 3 1
#...
Edit to Reproduce the Error
The error message you got indicates that there is a column that has only one, identical value in all rows.
To show this, let's try a logistic regression using a subset of mtcars data, which has only one, identical values in its cyl column, and then we use that column as a predictor.
mtcars_cyl4 <- mtcars |> subset(cyl == 4)
mtcars_cyl4
# mpg cyl disp hp drat wt qsec vs am gear carb
# Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
# Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
# Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
# Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
# Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
# Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
# Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
# Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
# Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
# Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
# Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
glm(am ~ as.factor(cyl) + mpg + disp, data = mtcars_cyl4, family = "binomial")
#Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
# contrasts can be applied only to factors with 2 or more levels
Now, compare it with the same logistic regression by using full mtcars data, which have various values in cyl column.
glm(am ~ as.factor(cyl) + mpg + disp, data = mtcars, family = "binomial")
# Call: glm(formula = am ~ as.factor(cyl) + mpg + disp, family = "binomial",
# data = mtcars)
#
# Coefficients:
# (Intercept) as.factor(cyl)6 as.factor(cyl)8 mpg disp
# -5.08552 2.40868 6.41638 0.37957 -0.02864
#
# Degrees of Freedom: 31 Total (i.e. Null); 27 Residual
# Null Deviance: 43.23
# Residual Deviance: 25.28 AIC: 35.28
It is likely that, even though you have drop three columns that have one,identical values in all the respective rows, there is another column in Trainingmodel1 that has one identical values. The identical values in the column were probably resulted during filtering the data frame and splitting data into training and test groups. Better to have a check by using summary(Trainingmodel1).
Further edit
I have checked the summary(Trainingmodel1) result, and it becomes clear that EmployeeNumber has one identical value (called "level" for a factor) in all rows. To run your regression properly, either you drop it from your model, or if EmployeeNumber has another level and you want to include it in your model, you should make sure that it contains at least two levels in the training data. It is possible to achieve that during splitting by repeating the random sampling until the randomly selected EmployeeNumber samples contain at least two levels. This can be done by looping using for, while, or repeat. It is possible, but I don't know how proper the repeated sampling is for your study.
As for your question about subsetting more than one variable, you can use subset and conditionals. For example, you want to get a subset of mtcars that has cyl == 4 and mpg > 20 :
mtcars |> subset(cyl == 4 & mpg > 20 )
If you want a subset that has cyl == 4 or mpg > 20:
mtcars |> subset(cyl == 4 | mpg > 20 )
You can also subset by using more columns as subset criteria:
mtcars |> subset((cyl > 4 & cyl <8) | (mpg > 20 & gear > 4 ))

Obtaining data from an array to a dataframe

so i have 2 datasets, the first one is a dataframe
df1 <- data.frame(user=c(1:10), h01=c(3,3,6,8,9,10,4,1,2,5), h12=c(5,5,3,4,1,2,8,8,9,10),a=numeric(10))
the first column represents the user id, and h01 represents the id of a cell phone antenna from which the user is connected for a period of time (00:00 - 1:00AM) and h12 represents the same but between 1:00AM and 2:00AM.
And then i have an array
array1 <- array(c(23,12,63,11,5,6,9,41,23,73,26,83,41,51,29,10,1,5,30,2), dim=c(10,2))
The rows represent the cell phone antenna id, the columns represent the periods of time and the values in array1 represent how many people is connected to the antenna at that period of time. So array1[1,1] will print how many people is connected between 00:00 and 1:00 to antenna 1, array1[2,2] will print how many people is connected between 1:00 and 2:00 to antenna 2 and so on.
What i want to do is for each user in df1 obtain from array1 how many people in total is connected to the same antennas in the same period of time and place the value in column a.
For example, the first user is connected to antenna 3 between 00:00 and 1:00AM, and antenna 5 between 1:00AM and 2:00AM, so the value in a should be array1[3,1] plus array1[5,2]
I used a for loop to do this
aux1 <- df1[,2]
aux2 <- df1[,3]
for(i in 1:length(df1$user)){
df1[i,4] <- sum(array1[aux1[i],1],array1[aux2[i],2])
}
which gives
user h01 h02 a
1 1 3 5 92
2 2 3 5 92
3 3 6 3 47
4 4 8 4 92
5 5 9 1 49
6 6 10 2 156
7 7 4 8 16
8 8 1 8 28
9 9 2 9 42
10 10 5 10 7
This loop works and gives the correct values, the problem is the 2 datasets (df1 and array1) are really big. df1 has over 20.000 users and 24 periods of time, and array1 has over 1300 antennas, not to mention that this data corresponds to users from one socioeconomic level, and i have 5 in total, so simplifying the code is mandatory.
I would love if someone could show me a different approach to this, specially if its withouth a for loop.
Try this approach:
df1$a <- array1[df1$h01,1] + array1[df1$h12,2]

Clustering Coefficient using SQL Server/C#

I have two tables in SQL Server i.e.
one table is GraphNodes as:
---------------------------------------------------------
id | Node_ID | Node | Node_Label | Node_Type
---------------------------------------------------------
1 677 Nuno Vasconcelos Author 1
2 1359 Peng Shi Author 1
3 6242 Z. Q. Shi Author 1
4 8318 Kiyoung Choi Author 1
5 12405 Johan A. K. Author 1
6 26615 Tzung-Pei Hong Author 1
7 30559 Luca Benini Author 1
...
...
and other table is GraphEdges as:
-----------------------------------------------------------------------------------------
id | Source_Node | Source_Node_Type | Target_Node | Target_Node_Type | Year | Edge_Type
-----------------------------------------------------------------------------------------
1 1 1 10965 2 2005 1
2 1 1 10179 2 2007 1
3 1 1 10965 2 2007 1
4 1 1 19741 2 2007 1
5 1 1 10965 2 2009 1
6 1 1 4816 2 2011 1
7 1 1 5155 2 2011 1
...
...
I also have two tables i.e. GraphNodeTypes as:
-------------------------
id | Node | Node_Type
-------------------------
1 Author 1
2 CoAuthor 2
3 Venue 3
4 Paper 4
and GraphEdgeTypes as:
-------------------------------
id | Edge | Edge_Type
-------------------------------
1 AuthorCoAuthor 1
2 CoAuthorVenue 2
3 AuthorVenue 3
4 PaperVenue 4
5 AuthorPaper 5
6 CoAuthorPaper 6
Now, I want to calculate clustering coefficient for this graph i.e of two types:
If N(V) is # of links b/w neighbors of node V and K(V) is degree of node V then,
Local Clustering Coefficient(V) = 2 * N(V)/K(V) [K(V) - 1]
and
Global Clustering Coefficient = 3 * # of Triangles / # of connected Triplets of V
The questions is, how can I calculate degree of a node? Is it possible in SQL Server or C# programming required. And also please suggest hints for calculating Local and Global CCs as well.
Thanks!
The degree of a node is not "calculated". It's simply the number of edges this node has.
While you can try to do this in SQL, the performance will likely be mediocre. Such type of analysis is commonly done in specialized databases and, if possible, in memory.
Count the degree of each vertices as the number of edges connected to it. Using COUNT(source_node) and GROUP BY(source_node) will be helpful in this case.
To find N(V), you can join the edge table with itself and then take the intersection between the resulting table and edge table. From the result, for each vertex take the COUNT().

Using group by in Proc SQL for SAS

I am trying to summarize my data set using the proc sql, but I have repeated values in the output, a simple version of my code is:
PROC SQL;
CREATE TABLE perm.rx_4 AS
SELECT patid,ndc,fill_mon,
COUNT(dea) AS n_dea,
sum(DEDUCT) AS tot_DEDUCT
FROM perm.rx
GROUP BY patid,ndc,fill_mon;
QUIT;
Some sample output is:
Obs Patid Ndc FILL_mon n_dea DEDUCT
3815 33003605204 00054465029 2000-05 2 0
3816 33003605204 00054465029 2000-05 2 0
12257 33004361450 00406035701 2000-06 2 0
16564 33004744098 00603128458 2000-05 2 0
16565 33004744098 00603128458 2000-05 2 0
16566 33004744098 00603128458 2000-06 2 0
16567 33004744098 00603128458 2000-06 2 0
46380 33008165116 00406035705 2000-06 2 0
85179 33013674758 00406035801 2000-05 2 0
89248 33014228307 00054465029 2000-05 2 0
107514 33016949900 00406035805 2000-06 2 0
135047 33056226897 63481062370 2000-05 2 0
213691 33065594501 00472141916 2000-05 2 0
215192 33065657835 63481062370 2000-06 2 0
242848 33066899581 60432024516 2000-06 2 0
As you can see there are repeated out put, for example obs 3815,3816. I have saw some people had similar problem, but the answers didn't work for me.
The content of the dataset is this:
The SAS System 5
17:01 Thursday, December 3, 2015
The CONTENTS Procedure
Engine/Host Dependent Information
Data Set Page Size 65536
Number of Data Set Pages 210
First Data Page 1
Max Obs per Page 1360
Obs in First Data Page 1310
Number of Data Set Repairs 0
Filename /home/zahram/optum/rx_4.sas7bdat
Release Created 9.0401M2
Host Created Linux
Inode Number 424673574
Access Permission rw-r-----
Owner Name zahram
File Size (bytes) 13828096
The SAS System 6
17:01 Thursday, December 3, 2015
The CONTENTS Procedure
Alphabetic List of Variables and Attributes
# Variable Type Len Format Informat Label
3 FILL_mon Num 8 YYMMD. Fill month
2 Ndc Char 11 $11. $20. Ndc
1 Patid Num 8 19. Patid
4 n_dea Num 8
5 tot_DEDUCT Num 8
Sort Information
Sortedby Patid Ndc FILL_mon
Validated YES
Character Set ASCII
The SAS System 7
17:01 Thursday, December 3, 2015
The CONTENTS Procedure
Sort Information
Sort Option NODUPKEY
NOTE: PROCEDURE CONTENTS used (Total process time):
real time 0.08 seconds
cpu time 0.01 seconds
I'll guess that you have a format on a variable, most likely the date. Proc SQL does not aggregate over formatted values but will use the underlying values but still shows them as formatted, so they appear as duplicates. Your proc contents confirms this. You can get around this by converting this the variable to a character variable.
PROC SQL;
CREATE TABLE perm.rx_4 AS
SELECT patid,ndc, put(fill_mon, yymmd.) as fill_month,
COUNT(dea) AS n_dea,
sum(DEDUCT) AS tot_DEDUCT
FROM perm.rx
GROUP BY patid,ndc, calculated fill_month;
QUIT;

Selecting specific rows based on values in 2 columns in R

I have a large data set of GPS collar locations that have a varying number of locations each day. I want to separate out only the days that have a single location collected and make a new data frame containing all their information.
month day easting northing time ID
6 1 ####### ######## 0:00 ##
6 2 ####### ######## 6:00 ##
6 2 ####### ######## 0:00 ##
6 3 ####### ######## 18:00 ##
6 3 ####### ######## 12:00 ##
6 4 ####### ######## 0:00 ##
6 5 ####### ######## 6:00 ##
Currently I have hashed together something, but can't quite get to the next step.
library(plyr)
dog<-count(data1,vars=c("MONTH","day"))
datasub1<-subset(dog,freq==1)
This gives me a readout that looks like
MONTH day freq
1 6 29 1
7 7 5 1
8 7 6 1
10 7 8 1
12 7 10 1
I am trying to use the values of the Month and day to pull out the rows that contain them from the main dataset so that I can make a data frame containing only the points with a frequency of 1 but that contains all the associated data. I've got to this point:
sis<-c(datasub1$MONTH)
bro<-c(datasub1$day)
datasub2<-subset(data1,MONTH==sis&day==bro)
... but that doesn't give me anything, personally it makes intuitive sense (R beginner) that it should subset out the rows that contain both the values of bro and sis.
Any help would be greatly appreciated.
Revised:
datasub2<-subset(data1, paste(month,day,sep=".") %in% paste(datasub1$MONTH, datasub1$day,sep=".") )
It's not very likely (and quite possibly impossible) that any particular MONTH item will exactly equal that subset. You are presumably more interested in whether a combo of "Month.Day" is in the combo sets of "Month.Day" in the datasub1. You have mixed up the capitalization that returns from the count() function if the headers were as you illustrated.
> dog
month day freq
1 6 1 1
2 6 2 2
3 6 3 2
4 6 4 1
5 6 5 1
> datasub1
month day freq
1 6 1 1
4 6 4 1
5 6 5 1
> datasub2
month day easting northing time ID
1 6 1 ####### ######## 0:00 ##
6 6 4 ####### ######## 0:00 ##
7 6 5 ####### ######## 6:00 ##
After this:
library(plyr)
dog<-count(data1,vars=c("MONTH","day"))
try this:
indx = which(dog$freq==1)
data1[indx,]
data1[rownames(datasub1), ]
This is an extension of the OP's original thinking but may not be what they're after and is really just what Wesley suggested but carrying the OP's original steps one more forward (minus the bro sis part which confused me a bit because...well for the same reason DWin said :)). You're after the rownames not really the values in those columns. You've already got that information. The row names carry that information back to the original data set.
n <- 100
data1 <- data.frame(
Accuracy = round(runif(n, 0, 5), 1),
MONTH = sample(1:5, n, replace=TRUE),
day = sample(1:28, n, replace=TRUE),
Easting = rnorm(n),
Northing = rnorm(n),
Etc = rnorm(n)
)
library(plyr)
dog<-count(data1,vars=c("MONTH","day"))
datasub1<-subset(dog,freq==1)
data1[rownames(datasub1), ]

Resources