R: Aggregate on Group 1 and NOT Group 2 - database

I am trying to create two data sets, one which summarizes data by 2 groups which I have done using the following code:
x = rnorm(1:100)
g1 = sample(LETTERS[1:3], 100, replace = TRUE)
g2 = sample(LETTERS[24:26], 100, replace = TRUE)
aggregate(x, list(g1, g2), mean)
The second needs to summarize the data by the first group and NOT the second group.
If we consider the possible pairs from the previous example:
A - X B - X C - X
A - Y B - Y C - Y
A - Z B - Z C - Z
The second dataset should to summarize the data as the average of the outgroup.
A - not X
A - not Y
A - not Z etc.
Is there a way to manipulate aggregate functions in R to achieve this?
Or I also thought there could be dummy variable that could represent the data in this way, although I am unsure how it would look.
I have found this answer here:
R using aggregate to find a function (mean) for "all other"
I think this indicates that a dummy variable for each pairing is necessary. However if there is anyone who can offer a better or more efficient way that would be appreciated, as there are many pairings in the true data set.
Thanks in advance

First let us generate the data reproducibly (using set.seed):
# same as question but added set.seed for reproducibility
set.seed(123)
x = rnorm(1:100)
g1 = sample(LETTERS[1:3], 100, replace = TRUE)
g2 = sample(LETTERS[24:26], 100, replace = TRUE)
Now we have two solutions both of which use aggregate:
1) ave
# x equals the sums over the groups and n equals the counts
ag = cbind(aggregate(x, list(g1, g2), sum),
n = aggregate(x, list(g1, g2), length)[, 3])
ave.not <- function(x, g) ave(x, g, FUN = sum) - x
transform(ag,
x = NULL, # don't need x any more
n = NULL, # don't need n any more
mean = x/n,
mean.not = ave.not(x, Group.1) / ave.not(n, Group.1)
)
This gives:
Group.1 Group.2 mean mean.not
1 A X 0.3155084 -0.091898832
2 B X -0.1789730 0.332544353
3 C X 0.1976471 0.014282465
4 A Y -0.3644116 0.236706489
5 B Y 0.2452157 0.099240545
6 C Y -0.1630036 0.179833987
7 A Z 0.1579046 -0.009670734
8 B Z 0.4392794 0.033121335
9 C Z 0.1620209 0.033714943
To double check the first value under mean and under mean.not:
> mean(x[g1 == "A" & g2 == "X"])
[1] 0.3155084
> mean(x[g1 == "A" & g2 != "X"])
[1] -0.09189883
2) sapply Here is a second approach which gives the same answer:
ag <- aggregate(list(mean = x), list(g1, g2), mean)
f <- function(i) mean(x[g1 == ag$Group.1[i] & g2 != ag$Group.2[i]]))
ag$mean.not = sapply(1:nrow(ag), f)
ag
REVISED Revised based on comments by poster, added a second approach and also made some minor improvements.

Related

Minimize (firstA_max - firstA_min) + (secondB_max - secondB_min)

Given n pairs of integers. Split into two subsets A and B to minimize sum(maximum difference among first values of A, maximum difference among second values of B).
Example : n = 4
{0, 0}; {5;5}; {1; 1}; {3; 4}
A = {{0; 0}; {1; 1}}
B = {{5; 5}; {3; 4}}
(maximum difference among first values of A, maximum difference among second values of B).
(maximum difference among first values of A) = fA_max - fA_min = 1 - 0 = 1
(maximum difference among second values of B) = sB_max - sB_min = 5 - 4 = 1
Therefore, the answer if 1 + 1 = 2. And this is the best way.
Obviously, maximum difference among the values equals to (maximum value - minimum value). Hence, what we need to do is find the minimum of (fA_max - fA_min) + (sB_max - sB_min)
Suppose the given array is arr[], first value if arr[].first and second value is arr[].second.
I think it is quite easy to solve this in quadratic complexity. You just need to sort the array by the first value. Then all the elements in subset A should be picked consecutively in the sorted array. So, you can loop for all ranges [L;R] of the sorted. Each range, try to add all elements in that range into subset A and add all the remains into subset B.
For more detail, this is my C++ code
int calc(pair<int, int> a[], int n){
int m = 1e9, M = -1e9, res = 2e9; //m and M are min and max of all the first values in subset A
for (int l = 1; l <= n; l++){
int g = m, G = M; //g and G are min and max of all the second values in subset B
for(int r = n; r >= l; r--) {
if (r - l + 1 < n){
res = min(res, a[r].first - a[l].first + G - g);
}
g = min(g, a[r].second);
G = max(G, a[r].second);
}
m = min(m, a[l].second);
M = max(M, a[l].second);
}
return res;
}
Now, I want to improve my algorithm down to loglinear complexity. Of course, sort the array by the first value. After that, if I fixed fA_min = a[i].first, then if the index i increase, the fA_max will increase while the (sB_max - sB_min) decrease.
But now I am still stuck here, is there any ways to solve this problem in loglinear complexity?
The following approach is an attempt to escape the n^2, using an argmin list for the second element of the tuples (lets say the y-part). Where the points are sorted regarding x.
One Observation is that there is an optimum solution where A includes index argmin[0] or argmin[n-1] or both.
in get_best_interval_min_max we focus once on including argmin[0] and the next smallest element on y and so one. The we do the same from the max element.
We get two dictionaries {(i,j):(profit, idx)}, telling us how much we gain in y when including points[i:j+1] in A, towards min or max on y. idx is the idx in the argmin array.
calculate the objective for each dict assuming max/min or y is not in A.
combine the results of both dictionaries, : (i1,j1): (v1, idx1) and (i2,j2): (v2, idx2). result : j2 - i1 + max_y - min_y - v1 - v2.
Constraint: idx1 < idx2. Because the indices in the argmin array can not intersect, otherwise some profit in y might be counted twice.
On average the dictionaries (dmin,dmax) are smaller than n, but in the worst case when x and y correlate [(i,i) for i in range(n)] they are exactly n, and we do not win any time. Anyhow on random instances this approach is much faster. Maybe someone can improve upon this.
import numpy as np
from random import randrange
import time
def get_best_interval_min_max(points):# sorted input according to x dim
L = len(points)
argmin_b = np.argsort([p[1] for p in points])
b_min,b_max = points[argmin_b[0]][1], points[argmin_b[L-1]][1]
arg = [argmin_b[0],argmin_b[0]]
res_min = dict()
for i in range(1,L):
res_min[tuple(arg)] = points[argmin_b[i]][1] - points[argmin_b[0]][1],i # the profit in b towards min
if arg[0] > argmin_b[i]: arg[0]=argmin_b[i]
elif arg[1] < argmin_b[i]: arg[1]=argmin_b[i]
arg = [argmin_b[L-1],argmin_b[L-1]]
res_max = dict()
for i in range(L-2,-1,-1):
res_max[tuple(arg)] = points[argmin_b[L-1]][1]-points[argmin_b[i]][1],i # the profit in b towards max
if arg[0]>argmin_b[i]: arg[0]=argmin_b[i]
elif arg[1]<argmin_b[i]: arg[1]=argmin_b[i]
# return the two dicts, difference along y,
return res_min, res_max, b_max-b_min
def argmin_algo(points):
# return the objective value, sets A and B, and the interval for A in points.
points.sort()
# get the profits for different intervals on the sorted array for max and min
dmin, dmax, y_diff = get_best_interval_min_max(points)
key = [None,None]
res_min = 2e9
# the best result when only the min/max b value is includes in A
for d in [dmin,dmax]:
for k,(v,i) in d.items():
res = points[k[1]][0]-points[k[0]][0] + y_diff - v
if res < res_min:
key = k
res_min = res
# combine the results for max and min.
for k1,(v1,i) in dmin.items():
for k2,(v2,j) in dmax.items():
if i > j: break # their argmin_b indices can not intersect!
idx_l, idx_h = min(k1[0], k2[0]), max(k1[1],k2[1]) # get index low and idx hight for combination
res = points[idx_h][0]-points[idx_l][0] -v1 -v2 + y_diff
if res < res_min:
key = (idx_l, idx_h) # new merged interval
res_min = res
return res_min, points[key[0]:key[1]+1], points[:key[0]]+points[key[1]+1:], key
def quadratic_algorithm(points):
points.sort()
m, M, res = 1e9, -1e9, 2e9
idx = (0,0)
for l in range(len(points)):
g, G = m, M
for r in range(len(points)-1,l-1,-1):
if r-l+1 < len(points):
res_n = points[r][0] - points[l][0] + G - g
if res_n < res:
res = res_n
idx = (l,r)
g = min(g, points[r][1])
G = max(G, points[r][1])
m = min(m, points[l][1])
M = max(M, points[l][1])
return res, points[idx[0]:idx[1]+1], points[:idx[0]]+points[idx[1]+1:], idx
# let's try it and compare running times to the quadratic_algorithm
# get some "random" points
c1=0
c2=0
for i in range(100):
points = [(randrange(100), randrange(100)) for i in range(1,200)]
points.sort() # sorted for x dimention
s = time.time()
r1 = argmin_algo(points)
e1 = time.time()
r2 = quadratic_algorithm(points)
e2 = time.time()
c1 += (e1-s)
c2 += (e2-e1)
if not r1[0] == r2[0]:
print(r1,r2)
raise Exception("Error, results are not equal")
print("time of argmin_algo", c1, "time of quadratic_algorithm",c2)
UPDATE: #Luka proved the algorithm described in this answer is not exact. But I will keep it here because it's a good performance heuristics and opens the way to many probabilistic methods.
I will describe a loglinear algorithm. I couldn't find a counter example. But I also couldn't find a proof :/
Let set A be ordered by first element and set B be ordered by second element. They are initially empty. Take floor(n/2) random points of your set of points and put in set A. Put the remaining points in set B. Define this as a partition.
Let's call a partition stable if you can't take an element of set A, put it in B and decrease the objective function and if you can't take an element of set B, put it in A and decrease the objective function. Otherwise, let's call the partition unstable.
For an unstable partition, the only moves that are interesting are the ones that take the first or the last element of A and move to B or take the first or the last element of B and move to A. So, we can find all interesting moves for a given unstable partition in O(1). If an interesting move decreases the objective function, do it. Go like that until the partition becomes stable. I conjecture that it takes at most O(n) moves for the partition to become stable. I also conjecture that at the moment the partition becomes stable, you will have a solution.

Iterating through groups of 4

I have up to 16 datasets (only using 8 for the example here), and I'm trying to sort them into groups of 4, where the datasets within each group are as closely matched as possible. (Using a VBA macro in Excel).
My aim was to iterate through every possible combination of groups of 4 and compare how well matched they are to the previous "best match", overwriting that if so.
I've got no problems comparing how well matched the groups are, but the code I have won't iterate through every possible combination.
My question is why doesn't this code work? And if there is a better solution please let me know.
For a = 1 To UBound(Whitelist) - 3
For b = a + 1 To UBound(Whitelist) - 2
For c = b + 1 To UBound(Whitelist) - 1
For d = c + 1 To UBound(Whitelist)
TempGroups(1, 1) = a: TempGroups(1, 2) = b: TempGroups(1, 3) = c: TempGroups(1, 4) = d
For e = 1 To UBound(Whitelist) - 3
If InArray(TempGroups, e) = False Then
For f = e + 1 To UBound(Whitelist) - 2
If InArray(TempGroups, f) = False Then
For g = f + 1 To UBound(Whitelist) - 1
If InArray(TempGroups, g) = False Then
For h = g + 1 To UBound(Whitelist)
If InArray(TempGroups, h) = False Then
TempGroups(2, 1) = e: TempGroups(2, 2) = f: TempGroups(2, 3) = g: TempGroups(2, 4) = h
If HowClose(Differences, TempGroups, 1) + HowClose(Differences, TempGroups, 2) < HowClose(Differences, Groups, 1) + HowClose(Differences, Groups, 2) Then
For x = 1 To 4
For y = 1 To 4
Groups(x, y) = TempGroups(x, y)
Next y
Next x
End If
End If
Next h
End If
Next g
End If
Next f
End If
Next e
Next d
Next c
Next b
Next a
For Reference, UBound(Whitelist) can be taken as 8 (number of datasets I have to match)
TempGroups is an array which I'm writing each iteration to, so it can be compared to...
Groups, the array which will contain the data sorted into matched groups
The InArray function checks to see if the value is already allocated to a group, as the datasets can only be in one group each.
Thanks in advance!
Images:
Datasets
Relatively Well Matched Data
Fairly Poorly Matched Data

apply sum along subsets of array 3rd dimension

I have the following objects:
A: 1 array with x,y,z, dimensions -> containing a variable (Temperature)
B & C: 2 arrays with x,y dimensions -> containing the indices of vectors along A's z dimension
A <- array(rnorm(n = 12*4*3*5), dim = c(4,3,5))
dimnames(A) <- list("x" = c(1:4), "y" = c(1:3), "z" = c(1:5))
B <- matrix(rep(c(2:1), 6), nrow = 4)
dimnames(B) <- list("x" = c(1:4), "y" = c(1:3))
C <- matrix(rep(c(4:5), 6), nrow = 4)
dimnames(C) <- list("x" = c(1:4), "y" = c(1:3))
I'm looking for a way to apply sum of A across the z dimension only between the indices indicated by B and C.
If instead of the 3D-array I had a vector I would solve it like this:
> A <- round(c(rnorm(5)), 1)
> B <- 2 #index of first value to sum
> C <- 4 #index of last value to sum
> vindex <- seq(B,C,1)
> A
[1] 0.0 -0.9 -1.1 -1.7 -0.4
> vindex
[1] 2 3 4
> sum(A[vindex])
[1] -3.7
>
# or better with a function
> foo <- function(x, start_idx, end_idx) {
+ vidx <- seq(start_idx, end_idx, 1)
+ return(sum(x[vidx]))
+ }
>
> foo(A,B,C)
[1] -3.7
Unfortunately seq() does not accept vectors as arguments and therefore it's not straightforward to use the apply function. If again were A[x,y,z] and B and C[x,y]:
> apply(A,c(1,2),foo,B,C)
Error in seq.default(start_idx, end_idx, 1) : 'from' must be of length 1
Called from: seq.default(start_idx, end_idx, 1)
It would be great if anybody knew how to make this function workable with apply or with other clean solutions.
Thanks a lot!
This is not a very nice task for base R, and I would prefer to implement it in C++ in the absence of a package that already does so (?).
Logically speaking, a plain but vectorized solution to your problem could be structured as:
# initialize index array
D <- array(
1,
dim = c(4,3,5),
dimnames = list(x = letters[1:4], y = letters[1:3], z = letters[1:5])
)
# set indices out of bounds to zero
E <- rep(1:5, each = 4*3)
BB <- rep(B, times = 5)
D[E < BB] <- 0
CC <- rep(C, times = 5)
D[E > CC] <- 0
# multiply with index array and sum
apply(A * D, c(1,2), sum)

How to obtain elements of an array close to another array in MATLAB?

There must be an easier way to do this, optimization method is also welcome. I have an array 'Y' and many parameters that has to be adjusted such that Y nears zero (= 'X') as given in the MWE. Is there a much better procedure to minimize this difference? This is just an example equation, there can be 6 coefficients to optimized.
x = zeros(10,1)
y = rand(10,1)
for a=1:0.1:4
for b=2:0.1:5
for c = 3:0.1:6
z = (a * y .^ 3 + b * y + c) - x
if -1<= range(z) <= 1
a, b, c
break
end
end
end
end
I believe
p = polyfit(y,x,2);
is what you are looking for.
where p will be an array of your [a, b, c] coefficients.

How to merge two differently ordered arrays by columns

I am trying to cbind two differently ordered named arrays into a dataframe.
x = c("a" = 1, "z" = 10)
y = c("z" = 10, "a" = 1)
# Expected output:
# x y
# a 1 1
# z 10 10
I've tried the following and all ignored the arrays' names:
# Unexpected outputs:
cbind(x,y)
merge(as.data.frame(x),as.data.frame(y))
library(dplyr); bind_cols(as.data.frame(x),as.data.frame(y))
In principle, I know that I could transform the arrays into dataframe and then bind by row names, or I could match the names and index the arrays during binding.
I was wondering if there is a more straight forward way for such a straight forward task.
I came up with
x <- c("a" = 1, "z" = 10)
y <- c("z" = 10, "a" = 1)
cbind(x, "y"=y[names(x)])
> x y
>a 1 1
>z 10 10
May not be optimal but maybe it is enough for your purposes...

Resources