Obtaining LRR and BAF values from affymetrix SNP array - arrays

I am trying to extract the LRR and BAF values from an affymetrix SNP chip without success using linux based tools. I tried to use a small subset in windows designed software called Axiom™ CNV Summary Tools Software and it works perfectly. The problem is that I have a huge dataset and would be impossible to run in windows machine powerful enough.
Let´s expose my steps until this point. First, I obtained five tab delimited files which are require to linux and/or windows pipeline (1-3 obtained with APT affymetrix software).
1 - The Axiom calls.txt or genotype file:
calls <- 'probeset_id sample_1 sample_2 sample_3
AX-100010998 2 2 2
AX-100010999 1 0 1
AX-100011005 0 1 2
AX-100011007 2 2 1
AX-100011008 1 1 2
AX-100011010 2 2 2
AX-100011011 0 1 0
AX-100011012 0 1 0
AX-100011016 0 0 1
AX-100011017 0 0 2'
calls <- read.table(text=calls, header=T)
2 - The confidences.txt file:
conf<- 'probeset_id sample_1 sample_2 sample_3
AX-100010998 0.00001 0.0002 0.00006
AX-100010999 0.00001 0.00001 0.00001
AX-100011005 0.00007 0.00017 0.00052
AX-100011007 0.00001 0.00001 0.00001
AX-100011008 0.001 0.00152 0.00001
AX-100011010 0.00001 0.00001 0.00002
AX-100011011 0.00004 0.00307 0.00002
AX-100011012 0.00001 0.00001 0.00001
AX-100011016 0.00003 0.00001 0.00001
AX-100011017 0.00003 0.01938 0.00032'
conf <- read.table(text=conf, header=T)
3 - The summary.txt file:
summ <- 'probeset_id sample_1 sample_2 sample_3
AX-100010998-A 740.33229 655.41465 811.98053
AX-100010998-B 1139.25679 1659.55079 917.7128
AX-100010999-A 1285.67306 1739.03296 1083.48455
AX-100010999-B 1403.51265 341.85893 1237.48577
AX-100011005-A 1650.03408 1274.57594 485.5324
AX-100011005-B 430.3122 2674.70182 4070.90727
AX-100011007-A 411.28952 449.76345 2060.7136
AX-100011007-B 4506.77692 4107.12982 2065.58516
AX-100011008-A 427.78263 439.63541 333.86312
AX-100011008-B 1033.41335 1075.31617 1623.69271
AX-100011010-A 390.12996 350.54456 356.63156
AX-100011010-B 1183.29912 1256.01391 1650.82396
AX-100011011-A 3593.93578 2902.34079 2776.2503
AX-100011011-B 867.33447 2252.54552 961.31596
AX-100011012-A 2250.44699 1192.46116 1927.70581
AX-100011012-B 740.31957 1721.70283 662.1414
AX-100011016-A 1287.9221 1367.95468 1037.98191
AX-100011016-B 554.8795 666.93132 1487.2143
AX-100011017-A 2002.40468 1787.42982 490.28802
AX-100011017-B 849.92775 1025.44417 1429.96567'
summ <- read.table(text=summ, header=T)
4 - The gender.txt:
gender <- 'cel_files gender
sample_1 female
sample_2 female
sample_3 female'
gender <- read.table(text=gender, header=F)
And finally the map file map.db in windows (non readable) or map.txt in linux as follows:
map <- 'Name Chr Position
AX-100010998 Z 70667736
AX-100010999 4 36427048
AX-100011005 26 4016045
AX-100011007 6 25439800
AX-100011008 2 147800617
AX-100011010 1 98919397
AX-100011011 Z 66652642
AX-100011012 7 28180218
AX-100011016 1A 33254907
AX-100011017 5 1918020'
map <- read.table(text=map, header=T)
This is my result in windows based result for sample_1:
Name Chr Position sample_1.GType sample_1.Log R Ratio sample_1.B Allele Freq
AX-100010998 Z 70667736 BB Infinity 0.675637419295063
AX-100010999 4 36427048 AB 0.101639462657534 0.531373516807123
AX-100011005 26 4016045 AA -0.111910305454305 0
AX-100011007 6 25439800 BB 0.148781943283483 1
AX-100011008 2 147800617 AB -0.293273363654622 0.609503132331127
AX-100011010 1 98919397 BB -0.283993308525307 0.960031843823016
AX-100011011 Z 66652642 AA Infinity 0.00579049667757003
AX-100011012 7 28180218 AA 0.0245684274744242 0.032174599843476
AX-100011016 1A 33254907 AA -0.265925457515035 0
AX-100011017 5 1918020 AA -0.0091211520536838 0
The values from the windows based tool seems to be correct, but in linux output that´s is not the case. I am following the steps decribed at penncnv software (http://penncnv.openbioinformatics.org/en/latest/user-guide/input/) and I log2 transformed my summary.txt and did the quantile normalization with limma package using normalizeBetweenArrays(x), finishing with the corrsummary.txt:
corrsum <- 'probeset_id sample_1 sample_2 sample_3
AX-100010998-A 9.804932 9.285738 9.530882
AX-100010998-B 10.249239 10.528922 9.804932
AX-100010999-A 10.528922 10.641862 10.134816
AX-100010999-B 10.641862 8.472829 10.249239
AX-100011005-A 10.804446 10.249239 8.816931
AX-100011005-B 8.835381 11.186266 12.045852
AX-100011007-A 8.542343 8.835381 11.039756
AX-100011007-B 12.045852 12.045852 11.186266
AX-100011008-A 8.816931 8.816931 8.472829
AX-100011008-B 10.134816 9.910173 10.592867
AX-100011010-A 8.472829 8.542343 8.542343
AX-100011010-B 10.374032 10.134816 10.641862
AX-100011011-A 11.593784 11.593784 11.593784
AX-100011011-B 10.012055 11.039756 9.910173
AX-100011012-A 11.186266 10.012055 10.804446
AX-100011012-B 9.530882 10.592867 9.285738
AX-100011016-A 10.592867 10.374032 10.012055
AX-100011016-B 9.285738 9.530882 10.528922
AX-100011017-A 11.039756 10.804446 8.835381
AX-100011017-B 9.910173 9.804932 10.374032'
corrsum <- read.table(text=corrsum, header=T)
Thus I applied:
./generate_affy_geno_cluster.pl calls.txt confidences.txt corrsummary.txt --locfile map.txt --sexfile gender.txt --output gencluster
and
./normalize_affy_geno_cluster.pl --locfile map.txt gencluster calls.txt --output lrrbaf.txt
And my linux based result (lrrbaf.txt) which must contain LRR and BAF information looks like that:
output <- 'Name Chr Position sample_1.LogRRatio sample_1.BAlleleFreq sample_2.LogRRatio sample_2.BAlleleFreq sample_3.LogRRatio sample_3.BAlleleFreq
AX-100010999 4 36427048 -1952.0739 2 -1953.0739 2 -1952.0739 2
AX-100011005 26 4016045 -2245.1784 2 -2244.1784 2 -2243.1784 2
AX-100011007 6 25439800 -4433.4661 2 -4433.4661 2 -4434.4661 2
AX-100011008 2 147800617 -1493.2287 2 -1493.2287 2 -1492.2287 2
AX-100011011 Z 66652642 -4088.2311 2 -4087.2311 2 -4088.2311 2
AX-100011012 7 28180218 -2741.2623 2 -2740.2623 2 -2741.2623 2
AX-100011016 1A 33254907 -2117.7005 2 -2117.7005 2 -2116.7005 2
AX-100011017 5 1918020 -3067.4077 2 -3067.4077 2 -3065.4077 2'
output <- read.table(text=output, header=T)
As showed above the linux result is completely different from windows based results (and make much less sense) and additionally do not contain the GType column in the output. Sorry to compose such a long question, but my intention was to make it as reproducible as possible. I would be grateful for any light to solve this problem as well any important remarks about this kind of data that I maybe forgot.

Related

Drop columns from a data frame but I keep getting this error below

enter image description here
enter image description here
enter image description here
enter image description here
enter image description here
enter image description here
enter image description here
No matter how I try to code this in R, I still cannot drop my columns so that I can build my logistic regression model. I tried to run it two different ways
cols<-c("EmployeeCount","Over18","StandardHours")
Trainingmodel1 <- DAT_690_Attrition_Proj1EmpAttrTrain[-cols,]
Error in -cols : invalid argument to unary operator
cols<-c("EmployeeCount","Over18","StandardHours")
Trainingmodel1 <- DAT_690_Attrition_Proj1EmpAttrTrain[!cols,]
Error in !cols : invalid argument type
This may solve your problem:
Trainingmodel1 <- DAT_690_Attrition_Proj1EmpAttrTrain[ , !colnames(DAT_690_Attrition_Proj1EmpAttrTrain) %in% cols]
Please note that if you want to drop columns, you should put your code inside [ on the right side of the comma, not on the left side.
So [, your_code] not [your_code, ].
Here is an example of dropping columns using the code above.
cols <- c("cyl", "hp", "wt")
mtcars[, !colnames(mtcars) %in% cols]
# mpg disp drat qsec vs am gear carb
# Mazda RX4 21.0 160.0 3.90 16.46 0 1 4 4
# Mazda RX4 Wag 21.0 160.0 3.90 17.02 0 1 4 4
# Datsun 710 22.8 108.0 3.85 18.61 1 1 4 1
# Hornet 4 Drive 21.4 258.0 3.08 19.44 1 0 3 1
# Hornet Sportabout 18.7 360.0 3.15 17.02 0 0 3 2
# Valiant 18.1 225.0 2.76 20.22 1 0 3 1
#...
Edit to Reproduce the Error
The error message you got indicates that there is a column that has only one, identical value in all rows.
To show this, let's try a logistic regression using a subset of mtcars data, which has only one, identical values in its cyl column, and then we use that column as a predictor.
mtcars_cyl4 <- mtcars |> subset(cyl == 4)
mtcars_cyl4
# mpg cyl disp hp drat wt qsec vs am gear carb
# Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
# Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
# Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
# Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
# Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
# Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
# Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
# Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
# Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
# Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
# Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
glm(am ~ as.factor(cyl) + mpg + disp, data = mtcars_cyl4, family = "binomial")
#Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
# contrasts can be applied only to factors with 2 or more levels
Now, compare it with the same logistic regression by using full mtcars data, which have various values in cyl column.
glm(am ~ as.factor(cyl) + mpg + disp, data = mtcars, family = "binomial")
# Call: glm(formula = am ~ as.factor(cyl) + mpg + disp, family = "binomial",
# data = mtcars)
#
# Coefficients:
# (Intercept) as.factor(cyl)6 as.factor(cyl)8 mpg disp
# -5.08552 2.40868 6.41638 0.37957 -0.02864
#
# Degrees of Freedom: 31 Total (i.e. Null); 27 Residual
# Null Deviance: 43.23
# Residual Deviance: 25.28 AIC: 35.28
It is likely that, even though you have drop three columns that have one,identical values in all the respective rows, there is another column in Trainingmodel1 that has one identical values. The identical values in the column were probably resulted during filtering the data frame and splitting data into training and test groups. Better to have a check by using summary(Trainingmodel1).
Further edit
I have checked the summary(Trainingmodel1) result, and it becomes clear that EmployeeNumber has one identical value (called "level" for a factor) in all rows. To run your regression properly, either you drop it from your model, or if EmployeeNumber has another level and you want to include it in your model, you should make sure that it contains at least two levels in the training data. It is possible to achieve that during splitting by repeating the random sampling until the randomly selected EmployeeNumber samples contain at least two levels. This can be done by looping using for, while, or repeat. It is possible, but I don't know how proper the repeated sampling is for your study.
As for your question about subsetting more than one variable, you can use subset and conditionals. For example, you want to get a subset of mtcars that has cyl == 4 and mpg > 20 :
mtcars |> subset(cyl == 4 & mpg > 20 )
If you want a subset that has cyl == 4 or mpg > 20:
mtcars |> subset(cyl == 4 | mpg > 20 )
You can also subset by using more columns as subset criteria:
mtcars |> subset((cyl > 4 & cyl <8) | (mpg > 20 & gear > 4 ))

Forecasting in gretl

Consider the following gretl script (hansl):
open bjg.gdt
arima 1 1 0 ; 2 1 0 ; g
series fitted = $yhat
g1 <- gnuplot g fitted --with-lines --time-series --output=display
What I want to do next is to make a forecast for, lets say, 24 steps ahead, that is from Jan 1961 to Dec 1962. I believe the fifth line should be something like
fcast [options] --plot=display
What options to use here? I have tried several combinations but none is successful.
After further experimentation, here is the solution:
open bjg.gdt
arima 1 1 0 ; 2 1 0 ; g
series fitted = $yhat
g1 <- gnuplot g fitted --with-lines --time-series --output=display
dataset addobs 24
g2 <- fcast --dynamic --out-of-sample --plot=display

Array row calculations

I have the following table:
DATA:
Lines <- " ID MeasureX MeasureY x1 x2 x3 x4 x5
1 1 1 1 1 1 1 1
2 1 1 0 1 1 1 1
3 1 1 1 2 3 3 3"
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE)
What i would like to achieve is :
Create 5 columns(r1-r5)
which is the division of each column x1-x5 with MeasureX (example x1/measurex, x2/measurex etc.)
Create 5 columns(p1-p5)
which is the division of each column x1-x5 with number 1-5 (the number of xcolumns) example x1/1, x2/2 etc.
MeasureY is irrelevant for now, the end product would be the ID and columns r1-r5 and p1-p5, is this feasible?
In SAS i would go with something like this:
data test6;
set test5;
array x {5} x1- x5;
array r{5} r1 - r5;
array p{5} p1 - p5;
do i=1 to 5;
r{i} = x{i}/MeasureX;
p{i} = x{i}/(i);
end;
The reason would be to have more dynamic beacuse the number of columns could change in the future.
Argument recycling allows you do do element-wise division with a constant vector. The tricky part was extracting the digits from the column names. I then repeated each of the digits by the number of rows to do the second division-task.
DF[ ,paste0("r", 1:5)] <- DF[ , grep("x", names(DF) )]/ DF$MeasureX
DF[ ,paste0("p", 1:5)] <- DF[ , grep("x", names(DF) )]/ # element-wise division
rep( as.numeric( sub("\\D","",names(DF)[ # remove non-digits
grep("x", names(DF))] #returns only 'x'-cols
) ), each=nrow(DF) ) # make them as long as needed
#-------------
> DF
ID MeasureX MeasureY x1 x2 x3 x4 x5 r1 r2 r3 r4 r5 p1 p2 p3 p4 p5
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0.5 0.3333333 0.25 0.2
2 2 1 1 0 1 1 1 1 0 1 1 1 1 0 0.5 0.3333333 0.25 0.2
3 3 1 1 1 2 3 3 3 1 2 3 3 3 1 1.0 1.0000000 0.75 0.6
This could be greatly simplified if you already know the sequence vector for the second division task would be 1-5, but this was designed to allow "gaps" in the sequence for column names and still use the digit information in the names as the divisor. (You were not entirely clear about what situations this code would be used in.) The construct of r{1-5} in SAS is mimicked by [ , paste0('r', 1:5)]. SAS is a macro language and sometimes experienced users have trouble figuring out how to make R behave like one. Generally it takes a while to lose the for-loop mentality and begin using R as a functional language.
An alternative with the data.table package:
cols <- names(df[c(4:8)])
library(data.table)
setDT(df)[, (paste0("r",1:5)) := .SD / df$MeasureX, by = ID, .SDcols = cols
][, (paste0("p",1:5)) := .SD / 1:5, by = ID, .SDcols = cols]
which results in:
> df
ID MeasureX MeasureY x1 x2 x3 x4 x5 r1 r2 r3 r4 r5 p1 p2 p3 p4 p5
1: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0.5 0.3333333 0.25 0.2
2: 2 1 1 0 1 1 1 1 0 1 1 1 1 0 0.5 0.3333333 0.25 0.2
3: 3 1 1 1 2 3 3 3 1 2 3 3 3 1 1.0 1.0000000 0.75 0.6
You could put together a nifty loop or apply to do this, but here it is explicitly:
# Handling the "r" columns.
DF$r1 <- DF$x1 / DF$MeasureX
DF$r2 <- DF$x2 / DF$MeasureX
DF$r3 <- DF$x3 / DF$MeasureX
DF$r4 <- DF$x4 / DF$MeasureX
DF$r5 <- DF$x5 / DF$MeasureX
# Handling the "p" columns.
DF$p1 <- DF$x1 / 1
DF$p2 <- DF$x2 / 2
DF$p3 <- DF$x3 / 3
DF$p4 <- DF$x4 / 4
DF$p5 <- DF$x5 / 5
# Taking only the columns we want.
FinalDF <- DF[, c("ID", "r1", "r2", "r3", "r4", "r5", "p1", "p2", "p3", "p4", "p5")]
Just noting that this is pretty straightforward matrix manipulation that you definitely could have found elsewhere. Perhaps you're new to R, but still put a little more effort in next time. If you are new to R, it's definitely worth the time to look up some basic R coding tutorial or video.

VTK Structured Point file

I am trying to parse a VTK file in C by extracting its point data and storing each point in a 3D array. However, the file I am working with has 9 shorts per point and I am having difficulty understanding what each number means.
I believe I understand most of the header information (please correct me if I have misunderstood):
ASCII: Type of file (ASCII or Binary)
DATASET: Type of dataset
DIMENSIONS: dims of voxels (x,y,z)
SPACING: Volume of each voxel (w,h,d)
ORIGIN: Unsure
POINT DATA: Total number of points/voxels (dimx.dimy.dimz)
I have looked at the documentation and I am still not getting an understanding on how to interpret the data. Could someone please help me understand or point me to some helpful resources
# vtk DataFile Version 3.0
vtk output
ASCII
DATASET STRUCTURED_POINTS
DIMENSIONS 256 256 130
SPACING 1 1 1.3
ORIGIN 86.6449 -133.929 116.786
POINT_DATA 8519680
SCALARS scalars short
LOOKUP_TABLE default
0 0 0 0 0 0 0 0 0
0 0 7 2 4 5 3 3 4
4 5 5 1 7 7 1 1 2
1 6 4 3 3 1 0 4 2
2 3 2 4 2 2 0 2 6
...
thanks.
You are correct regarding the meaning of fields in the header.
ORIGIN corresponds to the coordinates of the 0-0-0 corner of the grid.
An example of a DATASET STRUCTURED_POINTS can be found in the documentation.
Starting from this, here is a small file with 6 shorts per point. Each line represents a point.
# vtk DataFile Version 2.0
Volume example
ASCII
DATASET STRUCTURED_POINTS
DIMENSIONS 3 4 2
ASPECT_RATIO 1 1 1
ORIGIN 0 0 0
POINT_DATA 24
SCALARS volume_scalars char 6
LOOKUP_TABLE default
0 1 2 3 4 5
1 1 2 3 4 5
2 1 2 3 4 5
0 2 2 3 4 5
1 2 2 3 4 5
2 2 2 3 4 5
0 3 2 8 9 10
1 3 2 8 9 10
2 3 2 8 9 10
0 4 2 8 9 10
1 4 2 8 9 10
2 4 2 8 9 10
0 1 3 18 19 20
1 1 3 18 19 20
2 1 3 18 19 20
0 2 3 18 19 20
1 2 3 18 19 20
2 2 3 18 19 20
0 3 3 24 25 26
1 3 3 24 25 26
2 3 3 24 25 26
0 4 3 24 25 26
1 4 3 24 25 26
2 4 3 24 25 26
The 3 first fields may be displayed to understand the data layout : x change faster than y, which change faster than z in file.
If you wish to store the data in an array a[2][4][3][6], just read while doing a loop :
for(k=0;k<2;k++){ //z loop
for(j=0;j<4;j++){ //y loop : y change faster than z
for(i=0;i<3;i++){ //x loop : x change faster than y
for(l=0;l<6;l++){
fscanf(file,"%d",&a[k][j][i][l]);
}
}
}
}
To read the header, fscanf() may be used as well :
int sizex,sizey,sizez;
char headerpart[100];
fscanf(file,"%s",headerpart);
if(strcmp(headerpart,"DIMENSIONS")==0){
fscanf(file,"%d%d%d",&sizex,&sizey,&sizez);
}
Note than fscanf() need the pointer to the data (&sizex, not sizex). A string being a pointer to an array of char terminated by \0, "%s",headerpart works fine. It can be replaced by "%s",&headerpart[0]. The function strcmp() compares two strings, and return 0 if strings are identical.
As your grid seems large, smaller files can be obtained using the BINARY kind instead of ASCII, but watch for endianess as specified here.

Subsetting Last N Values From a Data Frame, R

I have a data frame of all the results of a football season, in a data frame called new. I want to extract the last 5 games of all teams home and away. The home variable is column 1 and away variable is column 2.
Say there are 20 teams in a character vector called teams, each with a unique name. If it was just a single team it would be easy to subset - say if team1 was "Arsenal", using something like
Arsenal <- "Arsenal"
head(new[new[,1] == Arsenal | new[,2] == Arsenal,], 5)
But I want to loop through the character vector teams to obtain the last 5 results of all teams, 20 in total. Can somebody help me please?
Edit: Here is some sample data. As an example, I would like to obtain the last two games of all teams- it would be easy to subset a single team but I'm not sure how to subset multiple teams.
V1 V2 V3 V4 V5
1 Chelsea Everton 2 1 19/05/2013
2 Liverpool QPR 1 0 19/05/2013
3 Man City Norwich 2 3 19/05/2013
4 Newcastle Arsenal 0 1 19/05/2013
5 Southampton Stoke 1 1 19/05/2013
6 Swansea Fulham 0 3 19/05/2013
7 Tottenham Sunderland 1 0 19/05/2013
8 West Brom Man United 5 5 19/05/2013
9 West Ham Reading 4 2 19/05/2013
10 Wigan Aston Villa 2 2 19/05/2013
11 Arsenal Wigan 4 1 14/05/2013
12 Reading Man City 0 2 14/05/2013
13 Everton West Ham 2 0 12/05/2013
14 Fulham Liverpool 1 3 12/05/2013
15 Man United Swansea 2 1 12/05/2013
16 Norwich West Brom 4 0 12/05/2013
17 QPR Newcastle 1 2 12/05/2013
18 Stoke Tottenham 1 2 12/05/2013
19 Sunderland Southampton 1 1 12/05/2013
20 Aston Villa Chelsea 1 2 11/05/2013
21 Chelsea Tottenham 2 2 08/05/2013
22 Man City West Brom 1 0 07/05/2013
23 Wigan Swansea 2 3 07/05/2013
24 Sunderland Stoke 1 1 06/05/2013
25 Liverpool Everton 0 0 05/05/2013
26 Man United Chelsea 0 1 05/05/2013
27 Fulham Reading 2 4 04/05/2013
28 Norwich Aston Villa 1 2 04/05/2013
29 QPR Arsenal 0 1 04/05/2013
30 Swansea Man City 0 0 04/05/2013
31 Tottenham Southampton 1 0 04/05/2013
32 West Brom Wigan 2 3 04/05/2013
33 West Ham Newcastle 0 0 04/05/2013
34 Aston Villa Sunderland 6 1 29/04/2013
35 Arsenal Man United 1 1 28/04/2013
36 Chelsea Swansea 2 0 28/04/2013
37 Reading QPR 0 0 28/04/2013
38 Everton Fulham 1 0 27/04/2013
39 Man City West Ham 2 1 27/04/2013
40 Newcastle Liverpool 0 6 27/04/2013
41 Southampton West Brom 0 3 27/04/2013
42 Stoke Norwich 1 0 27/04/2013
43 Wigan Tottenham 2 2 27/04/2013
Where df is your data.frame, this will create a list of 20 data.frames with each element being the dataset for one team. This also assumes that the dataset is already ordered, since you mentioned it.
setnames(df,c('hometeam','awayteam','homegoals','awaygoals','fixturedate'))
allteams <- sort(unique(df$hometeam))
eachteamlastfive <- vector(mode = "list", length = length(allteams))
for ( i in seq(length(allteams)))
{
eachteamlastfive[[i]] <- head(df[df$hometeam==allteams[i] | df$awayteam == allteams[i], ],5)
}
take a look at sapply
sapply(unique(new[,1]), function(team) head(new[new[,1] == team | new[,2] == team,], 5))

Resources