Post hoc tests for permutation repeated measure anova - permutation

I have a two-way designed experiment, which are two species and three treatments(n = 5 or 6, for each group), and I measure leaf area every three months of same individuals (repeated measured).
Because my data is heterogeneous variance (tested with levene test) and the sample size is small, I decide to use permutation.
Here is my script. I used aovperm() to build the model, and set Error(number/(month)) as random factor (number is the ID of individuals).
area.aovp<-aovperm(p_area~species*treatment*month + Error(number/(month)), data = growth, type = "permutation", np = 2000, method = "Rd_kheradPajouh_renaud")
For doing post-hoc tests, I've tryed lsmeans() and glht(), but neither of them could work with the permutated anova.
Is there other proper way to do such post hoc?

Related

How to iterate through two numpy arrays of different dimensions

I am working with the MNIST dataset, x_test has dimension of (10000,784) and y_test has a dimension of (10000,10). I need to iterate through each sample of these two numpy arrays at the same time as I need to pass them individually to score.evaluate()
I tried nditer, but it throws an error saying operands could not be broadcasted together since hey have different shape.
score=[]
for x_sample, y_sample in np.nditer ([x_test,y_test]):
a=x_sample.reshape(784,1)
a=np.transpose(a)
b=y_sample.reshape(10,1)
b=np.transpose(b)
s=model.evaluate(a,b,verbose=0)
score.append(s)
Assuming that what you are actually trying to do here is to get the individual loss per sample in your test set, here is a way to do it (in your approach, even if you get past the iteration part, you will have issues with model.evaluate, which was not designed for single sample pairs)...
To make the example reproducible, here I also assume we have first run the Keras MNIST CNN example for only 2 epochs; so, the shape of our data is:
x_test.shape
# (10000, 28, 28, 1)
y_test.shape
# (10000, 10)
Given that, here is a way to get the individual loss per sample:
from keras import backend as K
y_pred = model.predict(x_test)
y_test = y_test.astype('float32') # necessary, as y_pred.dtype is 'float32'
y_test_tensor = K.constant(y_test)
y_pred_tensor = K.constant(y_pred)
g = K.categorical_crossentropy(target=y_test_tensor, output=y_pred_tensor)
ce = K.eval(g) # 'ce' for cross-entropy
ce
# array([1.1563368e-05, 2.0206178e-05, 5.4946734e-04, ..., 1.7662416e-04,
# 2.4232995e-03, 1.8954457e-05], dtype=float32)
ce.shape
# (10000,)
i.e. ce now contains what the score list in your question was supposed to contain.
For confirmation, let's calculate the loss for all test samples using model.evaluate:
score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
# Test loss: 0.050856544668227435
and again manually, averaging the values of the ce we have just calculated:
import numpy as np
log_loss = np.sum(ce)/ce.shape[0]
log_loss
# 0.05085654296875
which, although not exactly equal (due to different numeric precision involved in the two ways of calculation), they are practically equal indeed:
log_loss == score[0]
# False
np.isclose(log_loss, score[0])
# True
Now, the adaptation of this to your own case, where the shape of x_test is (10000, 784), is arguably straighforward...
You are mixing between training features and testing labels. The training set has 60,000 samples and the test set has 10,000 samples (that is, your x_test should be of dimension (10000,784)). Make sure you have download all the correct data, and don't mix up training data with testing data.
http://yann.lecun.com/exdb/mnist/

ArangoDB - Graph based recommender system

I am using ArangoDB and I am trying to build a graph-based recommender system with it.
The data model just contains users, items and ratings (edges).
Therefore want to calculate the affinity of a user to a movie with the katz measure.
Eventually I want to do this:
Get all (or a certain number of) paths between a user and a item
For all of these paths do the following:
Multiply each edge's rating with a damping factor (e.g. 0.7)
Sum up all calculated values within a path
Calculate the average of all calculated path values
The result is some kind of affinity between a user and an item, weighted with the intermediary ratings and damped by a defined factor.
I was trying to realize something like that in AQL but it was either wrong or much too slow. How could a algorithm like this look in AQL?
From a performance point of view there might be better choices for graph based recommender systems. If someone has a suggestion (e.g. Item Rank or other algorithms), it would also be nice to get some ideas here.
I love this topic but sometimes I get to my borders.
In the following, #start and #end are parameters representing the two endpoints; for simplicity, I've assumed that:
the maximum admissible path length is 10000
"rates" is the name of the "edges" collection
"rating" is the name of the property giving a weight to an edge
the "damping" factor is as per the requirements
FOR v,e,p IN 0..10000 OUTBOUND #start rates
OPTIONS {uniqueVertices: "path"}
FILTER v._id==#end
LET r = AVERAGE(p.edges[*].rating) * 0.7
COLLECT AGGREGATE avg = AVERAGE(r)
RETURN avg

combination of smote and undersampling on weka

according to paper which written by chawla, et al (2002)
the best perfomance of balancing data is combining undersampling with SMOTE.
I’ve tried to combine my dataset using under-sampling and SMOTE,
but I am bit confuse about the attribute for under-sampling.
In weka there is Resample to decrease the majority class.
there is a attribute in Resample
biasToUniformClass -- Whether to use bias towards a uniform class. A value of 0 leaves the class distribution as-is, a value of 1 ensures the class distribution is uniform in the output data.
I use value 0 and the data in majority class is down so the minority do and when I use value 1, the data in majority decrease but in minority class, the data is up.
I try to use value 1 for that attribute, but I don't using smote to increase the instances of minority class because the data is already balance and the result is good too.
so, is that the same as I combine the SMOTE and under-sampling or I still have to try with value 0 in that attribute and do the SMOTE ?
For undersampling, see the EasyEnsemble algorithm (a Weka implementation was developed by Schubach, Robinson, and Valentini).
The EasyEnsemble algorithm allows you to split the data into a certain number of balanced partitions. To achieve this balance, set the numIterations parameter equal to:
(# of majority instances) / (# minority instances) = numIterations
For example, if there are 30 total instances with 20 in the majority class and 10 in the minority class, set the numIterations parameter equal to 2 (i.e., 20 majority instances / 10 instances equals 2 balanced partitions). These 2 partitions should each contain 20 instances; each has the same 10 minority instances along with a different 10 instances from the majority class.
The algorithm then trains classifiers on each of the balanced partitions,
and at test time, ensembles the batch of classifiers trained on each of the balanced partitions for prediction.

How to split the data into training and test sets?

One approach to split the data into two disjoint sets, one for training and one for tests is taking the first 80% as the training set and the rest as the test set. Is there another approach to split the data into training and test sets?
** For example, I have a data contains 20 attributes and 5000 objects. Therefore, I will take 12 attributes and 1000 objects as my training data and 3 attributes from the 12 attributes as test set. Is this method correct?
No, that's invalid. You would always use all features in all data sets. You split by "objects" (examples).
It's not clear why you are taking just 1000 objects and trying to extract a training set from that. What happened to the other 4000 you threw away?
Train on 4000 objects / 20 features. Cross-validate on 500 objects / 20 features. Evaluate performance on the remaining 500 objects/ 20 features.
If your training produces a classifier based on 12 features, it could be (very) hard to evaluate its performances on a test set based only on a subset of these features (your classifier is expecting 12 inputs and you'll give only 3).
Feature/attribute selection/extraction is important if your data contains many redundant or irrelevant features. So you could identify and use only the most informative features (maybe 12 features) but your training/validation/test sets should be based on the same number of features (e.g. since you're mentioning weka Why do I get the error message 'training and test set are not compatible'?).
Remaining on a training/validation/test split (holdout method), a problem you can face is that the samples might not be representative.
For example, some classes might be represented with very few instance or even with no instances at all.
A possible improvement is stratification: sampling for training and testing within classes. This ensures that each class is represented with approximately equal proportions in both subsets.
However, by partitioning the available data into fixed training/test set, you drastically reduce the number of samples which can be used for learning the model. An alternative is cross validation.

SPSS creating a loop for a multiple regression over several variables

For my master thesis I have to use SPSS to analyse my data. Actually I thought that I don't have to deal with very difficult statistical issues, which is still true regarding the concepts of my analysis. BUT the problem is now that in order to create my dependent variable I need to use the syntax editor/ programming in general and I have no experience in this area at all. I hope you can help me in the process of creating my syntax.
I have in total approximately 900 companies with 6 year observations. For all of these companies I need the predicted values of the following company-specific regression:
Y= ß1*X1+ß2*X2+ß3*X3 + error
(I know, the ß won t very likely be significant, but this is nothing to worry about in my thesis, it will be mentioned in the limitations though).
So far my data are ordered in the following way
COMPANY YEAR X1 X2 X3
1 2002
2 2002
1 2003
2 2003
But I could easily change the order, e.g. in
1
1
2
2 etc.
Ok let's say I have rearranged the data: what I need now is that SPSS computes for each company the specific ß and returns the output in one column (the predicted values with those ß multiplied by the specific X in each row). So I guess what I need is a loop that does a multiple linear regression for 6 rows for each of the 939 companies, am I right?
As I said I have no experience at all, so every hint is valuable for me.
Thank you in advance,
Janina.
Bear in mind that with only six observations per company and three (or 4 if you also have a constant term) coefficients to estimate, the coefficient estimates are likely to be very imprecise. You might want to consider whether companies can be pooled at least in part.
You can use SPLIT FILE to estimate the regressions specific for each company, example below. Note that one would likely want to consider other panel data models, and assess whether there is autocorrelation in the residuals. (This is IMO a useful approach though for exploratory analysis of multi-level models.)
The example declares a new dataset to pipe the regression estimates to (see the OUTFILE subcommand on REGRESSION) and suppresses the other tables (with 900+ tables much of the time is spent rendering the output). If you need other statistics either omit the OMS that suppresses the tables, or tweak it to only show the tables you want. (You can use OMS to pipe other results to other datasets as well.)
************************************************************.
*Making Fake data.
SET SEED 10.
INPUT PROGRAM.
LOOP #Comp = 1 to 1000.
COMPUTE #R1 = RV.NORMAL(10,2).
COMPUTE #R2 = RV.NORMAL(-3,1).
COMPUTE #R3 = RV.NORMAL(0,5).
LOOP Year = 2003 to 2008.
COMPUTE Company = #Comp.
COMPUTE Rand1 = #R1.
COMPUTE Rand2 = #R2.
COMPUTE Rand3 = #R3.
END CASE.
END LOOP.
END LOOP.
END FILE.
END INPUT PROGRAM.
DATASET NAME Companies.
COMPUTE x1 = RV.NORMAL(0,1).
COMPUTE x2 = RV.NORMAL(0,1).
COMPUTE x3 = RV.NORMAL(0,1).
COMPUTE y = Rand1*x1 + Rand2*x2 + Rand3*x3 + RV.NORMAL(0,1).
FORMATS Company Year (F4.0).
*Now sorting cases by Company and Year, then using SPLIT file to estimate
*the regression.
SORT CASES BY Company Year.
*Declare new set and have OMS suppress the other results.
DATASET DECLARE CoeffTable.
OMS
/SELECT TABLES
/IF COMMANDS = 'Regression'
/DESTINATION VIEWER = NO.
*Now split file to get the coefficients.
SPLIT FILE BY Company.
REGRESSION
/DEPENDENT y
/METHOD=ENTER x1 x2 x3
/SAVE PRED (CompSpePred)
/OUTFILE = COVB ('CoeffTable').
SPLIT FILE OFF.
OMSEND.
************************************************************.

Resources