Need to further prep my data set in order to apply apriori algorithm
There are only two columns:
First column as the transaction_id.
Second column is item_name and is formatted as c("" "a" "b" "c"...)
I run:
rules <- apriori(nz.mb, parameter = list(supp = 0.001, conf = 0.8))
I get an error:
Error in asMethod(object) :
column(s) 2 not logical or a factor. Discretize the columns first.
So I run:
nz.mb$item_name <- discretize(nz.mb$item_name)
I get another error:
Error in min(x, na.rm = TRUE) : invalid 'type' (list) of argument
What is my next step with item_name so that's it's formatted correctly for apriori?
Most Apriori implementation support Dataset like this:
a b c d
1 1 1 0 means a,b,c are there
1 0 0 1 means a,d are there
Either use this form or go to documentation and say the supported data for
Related
Here is a data.frame df in which I want to convert the value column to g/g (grams/gram).The first two entries are in ug/mg (micrograms/milligram) and the last two entries are in ng/mg (nanograms/milligram). However, the ud.convert() function only seems to consider the first unit entry (i.e. ug/mg) it encounters to then convert all value entries from that unit, ignoring the change in units in row #3.
require(udunits2)
df = data.frame(
value = rep(1,4),
unit = c(rep('ug/mg', 2), rep('ng/mg', 2)),
stringsAsFactors = FALSE
)
df$value2 = ud.convert(df$value, df$unit, 'g/g')
df
# value unit value2
# 1 1 ug/mg 0.001
# 2 1 ug/mg 0.001
# 3 1 ng/mg 0.001
# 4 1 ng/mg 0.001
Every other R function I can think of does such an operation for each row. Consider paste() or substr():
paste(df$value, df$unit, 'g/g', sep = '---')
substr(df$unit,1,2)
In my opinion this is a very un-R behavior of ud.convert() and should be changed or at least a warning should be given. Or am I overlooking something? The conversion happens in the C-function R_ut_convert. Unfortunately, I don't know any C to propose a change ;)
I have a list of variables name "comorbid_names". And I want to select people who have those comorbidities in "comorbidities". However, I want to select the variable names if they are true.
For example patient 1 has "chd" only, therefore only that will be displayed as TRUE
comorbid_names
[1] "chd" "heart_failure" "stroke"
[4] "hypertension" "diabetes" "copd"
[7] "epilepsy" "hypothyroidism" "cancer"
[10] "asthma" "ckd_stage3" "ckd_stage4"
[13] "ckd_stage5" "atrial_fibrilation" "learning_disability"
[16] "peripheral_arterial_disease" "osteoporosis"
class(comorbid_names)
[1] "character"
comorbidities <- names(p[, comorbid_names][p[, comorbid_names] == 1])
At this point I get this error
Error: Unsupported use of matrix or array for column indexing
I am not entirely sure why, but I think it's to do with comorbid_names being character
Does anyone have an advice?
If p is a tibble as opposed to or in addition to a data.frame, you might be dealing with the following:
https://blog.rstudio.org/2016/03/24/tibble-1-0-0/
Look at the bottom of the post:
Interacting with legacy code
A handful of functions are don’t work with tibbles because they expect df[, 1] to return a vector, not a data frame. If you encounter one of these functions, use as.data.frame() to turn a tibble back to a data frame:
class(as.data.frame(tbl_df(iris)))
You might get along by doing p <- as.data.frame(p) as well.
Simply using p[, comorbid_names] == 1 will gave you the table of TRUE/FALSE values for your selected morbidities. To add the patient names or IDs to that list, use cbind, like this: cbind(p["patient_id"], p[, comorbid_names] == 1) where "patient_id" is the name of the column that identifies patients.
Here's a complete reproducible example:
comorbid_names <- c("chd", "heart_failure","stroke", "hypertension",
"diabetes", "copd", "epilepsy", "hypothyroidism",
"cancer", "asthma", "ckd_stage3", "ckd_stage4",
"ckd_stage5", "atrial_fibrilation", "learning_disability",
"peripheral_arterial_disease", "osteoporosis")
all_morbidities <- c("chd", "heart_failure","stroke", "hypertension",
"diabetes", "copd", "epilepsy", "hypothyroidism",
"cancer", "asthma", "ckd_stage3", "ckd_stage4",
"ckd_stage5", "atrial_fibrilation", "learning_disability",
"peripheral_arterial_disease", "osteoporosis",
"hairyitis", "jellyitis", "transparency")
# Create dummy data frame "p" with patient ids and whether or not they suffer from each condition
patients <- data.frame(patient_id = 1:20)
conditions <- matrix(sample(0:1, nrow(patients)*length(all_morbidities), replace=TRUE),
nrow(patients),
length(all_morbidities))
p <- cbind(patients, conditions)
names(p) <- c(names(patients), all_morbidities)
# Final step: get patient IDs and whether they suffer from specific morbidities
comorbidities <- cbind(p["patient_id"], p[, comorbid_names] == 1)
If you want to select only those patients that suffer from at least one of the morbidities, do this:
comorbidities[rowSums(comorbidities[-1]) != 0]
In Step 1, I find what type of data exists in Database.
In step 2, I retrieve all data from Database and try to store into arrays of varying sizes
1. Accessing data from MongoDB
mong <- mongo(collection = "mycollection", db = "dbname", url = "mongodb://localhost")
agg_df <- mong$aggregate('[{ "$group" :
{ "_id" : "$tagname",
"number_records" : { "$sum" : 1}
}
}]')
print(agg_df)
OUTPUT:
_id number_records
1 raees 100
2 DearZindagi 100
3 FAN 100
4 DDD 21
NOTE: Step 1 output indicates that there are 4 types of categories with records of 100,100,100,21 each.
2. From STEP 1, I need to create 4 arrays consisting of 1 column and varying nos. of rows(100,100,100,21) and give names to those array as Raees,DearZindagi,FAN,DDD
Dataset <- mong$find('{}','{"text":1}')
Dataset$text <- sapply(Dataset$text,function(row) iconv(row, "latin1", "ASCII", sub=""))
typeof(Dataset$text)
> [1] "character"
3. The arrays and their sizes(in rows) to be created is dependent on output of Step 1. There would never be a case where the output of step 1 would be more than 15 rows.
How should i do this?.
The split function would splits the Dataset into arrays, how shall i rename these arrays:
rows <- nrow(agg_df)
for (i in 1:rows){
array<- split(Dataset$text, rep(1:rows, c(agg_df[i,2])))
}
I am using matlab to prepare my dataset in order to run it in certain data mining models and I am facing an issue with linking the data between two of my tables.
So, I have two tables, A and B, which contain sequential recordings of certain values in a certain timestamps and I want to create a third table, C, in which I will add columns of both A and B in the same rows according to some conditions.
Tables A and B don't have the same amount of rows (A has more measurements) but they both have two columns:
1st column: time of the recording (hh:mm:ss) and
2nd column: recorded value in that time
Columns of A and B are going to be added in table C when all the following conditions stand:
The time difference between A and B is more than 3 sec but less than 5 sec
The recorded value of A is the 40% - 50% of the recorded value of B.
Any help would be greatly appreciated.
For the first condition you need something like [row,col,val]=find((A(:,1)-B(:,1))>2sec && (A(:,1)-B(:,1))<5sec) where you do need to use datenum or equivalent to transform your timestamps. For the second condition this works the same, use [row,col,val]=find(A(:,2)>0.4*B(:,2) && A(:,2)<0.5*B(:,2)
datenum allows you to transform your arrays, so do that first:
A(:,1) = datenum(A(:,1));
B(:,1) = datenum(B(:,1));
you might need to check the documentation on datenum, regarding the format your string is in.
time1 = [datenum([0 0 0 0 0 3]) datenum([0 0 0 0 0 3])];
creates the datenums for 3 and 5 seconds. All combined:
A(:,1) = datenum(A(:,1));
B(:,1) = datenum(B(:,1));
time1 = [datenum([0 0 0 0 0 3]) datenum([0 0 0 0 0 3])];
[row1,col1,val1]=find((A(:,1)-B(:,1))>time1(1)&& (A(:,1)-B(:,1))<time1(2));
[row2,col2,val2]=find(A(:,2)>0.4*B(:,2) && A(:,2)<0.5*B(:,2);
The variables of row and col you might not need when you want only the values though. val1 contains the values of condition 1, val2 of condition 2. If you want both conditions to be valid at the same time, use both in the find command:
[row3,col3,val3]=find((A(:,1)-B(:,1))>time1(1)&& ...
(A(:,1)-B(:,1))<time1(2) && A(:,2)>0.4*B(:,2)...
&& A(:,2)<0.5*B(:,2);
The actual adding of your two arrays based on the conditions:
C = A(row3,2)+B(row3,2);
Thank you for your response and help! However for the time I followed a different approach by converting hh:mm:ss to seconds that will make the comparison easier later on:
dv1 = datevec(A, 'dd.mm.yyyy HH:MM:SS.FFF ');
secs = [3600,60,1];
dv1(:,6) = floor(dv1(:,6));
timestamp = dv1(:,4:6)*secs.';
Now I am working on combining both time and weight conditions in a piece of code that will run. Should I use an if condition inside a for loop or is a for loop not necessary?
I have a query analogous to:
update x
set x.y = (
select sum(x2.y)
from mytable x2
where x2.y < x.y
)
from mytable x
the point being, I'm iterating over rows and updating a field based on a subquery over those fields which are changing.
What I'm seeing is the subquery is being executed for each row before any updates occur, so the changed values for each row are not being picked up.
How can I force the subquery to be re-evaluated for each row of the update?
Is there a suitable table hint or something?
As an aside, I was doing the below and it did work, however since modifying my query somewhat (for logic purposes, not to try and solve this issue) this trick no longer works :(
declare #temp int
update x
set #temp = (
select sum(x2.y)
from mytable x2
where x2.y < x.y
),
x.y = #temp
from mytable x
I'm not particularly concerned about performance, this is a background task run over a few rows
It looks like task is incorrect or other rules should apply.
Let's see on example. Let's say you have values 4, 1, 2, 3, 1, 2
Sql will update rows based on original values. I.e. during single update statement newly calculated values is NOT mixing with original values:
-- only original values used
4 -> 9 (1+2+3+1+2)
1 -> null
2 -> 2 (1+1)
3 -> 6 (1+2+1+2)
1 -> null
2 -> 2 (1+1)
Based on your request you wants that update of each rows will count newly calculated values. (Note, that SQL does not guarantees the sequence in which rows will be processed.)
Let's do this calculation by processing rows from top to bottom:
-- from top
4 -> 9 (1+2+3+1+2)
1 -> null
2 -> 1 (1)
3 -> 4 (1+1+2)
1 -> null
2 -> 1 (1)
Do the same in other sequence - from bottom to top:
-- from bottom
4 -> 3 (2+1)
1 -> null
2 -> 1 (1)
3 -> 5 (2+2+1)
1 -> null
2 -> 2 (1+1)
How you can see your expected result is inconsistent. To make it right you need to correct the calculation rule - for instance define strong sequence of the rows to process (date, id, ...)
Also, if you want to do some recursive processing look at the common_table_expression:
http://technet.microsoft.com/en-us/library/ms186243(v=sql.105).aspx