Element wise comparison in R - arrays

I'm attempting to write a for loop that will compare values between two individuals, but not the same individual. The following data frame contains values for five subjects:
Value1
Subject1 0
Subject2 1
Subject3 5
Subject4 6
Subject5 8
I've written a double loop that creates a 'Value2' variable based on the following criteria:
If the subject has a larger Value1, then the result is +1.
If the subject has an equal Value1, then the result is 0.
If the subject has a smaller Value1, then the result is -1.
For example, Subject 1's Value1 is smaller than the other four subjects; this should result in -4. So far the loop I've written works for the first subject but fails to iterate to the second subject.
Value2<-0
i = 0
w = 0
for(i in 1:length(Value1)){
for(j in 1:length(Value1)){
if(i != j){
Value1[i] = w
if(w > Value1[j]){
Value2[i] = Value2[i] + 1
}
if(w < Value1[j]){
Value2[i] = Value2[i] - 1
}
if(w == Value1[j]){
Value2[i] = Value2[i] + 0
}
}
}
}

If I'm understanding the problem correctly, this should give you what you want
x <- c(0, 1, 5, 6, 8)
colSums(outer(x, x, '<')) - colSums(outer(x, x, '>'))
# [1] -4 -2 0 2 4
Or
-colSums(sign(outer(x, x, '-')))
# [1] -4 -2 0 2 4
Edit: If your vector is large (or even if it isn't, really) use d.b.'s rank method instead. The outer function will create an NxN matrix where N is the length of x. For example, when x is sample(1e5) outer will attempt to create a matrix >30Gb in size! This means most people's laptops in 2019 don't even have enough memory for this method to work on large vectors. With this same x, the method using rank provided by d.b. returns the result almost instantly.
Benchmark for vector of size 1000
x <- sample(1000)
microbenchmark(
outer_diff = colSums(-sign(outer(x, x, '-'))),
outer_gtlt = colSums(outer(x, x, '<')) - colSums(outer(x, x, '>')),
rank = {r <- rank(x); 2*(r - mean(r))}
)
# Unit: microseconds
# expr min lq mean median uq max neval cld
# outer_diff 15930.26 16872.4175 20946.2980 18030.776 25346.677 38668.324 100 b
# outer_gtlt 14168.21 15120.4165 28970.7731 16698.264 23857.651 352390.298 100 b
# rank 111.18 141.5385 170.8885 177.026 188.513 282.257 100 a

x = c(0, 1, 5, 6, 8)
r = rank(x)
ans = 2 * (r - mean(r))
ans
#[1] -4 -2 0 2 4

#IceCreamToucan's benchmark considers cases with distinct values (sampling without replacement), but if we extend to repeated values (covered by criterion 2 in the OP), I figured tabulating first saves time.
library(data.table)
# from #d.b's answer and comments from d.b, ICT
fdb = function(x) {
r = frank(x)
2 * (r - mean(r))
}
# from #chinsoon's comment and some algebra
fdb2 = function(x) {
r = frank(x)
2 * r - length(x) - 1
}
# tabulation with data.table
ff = function(x){
nx = length(x)
xDT = setDT(list(x=x))
resDT = xDT[, .N, keyby=x][, res := 2L*cumsum(N) - N - nx]
resDT[xDT, x.res]
}
Sample data and results:
nv = 1e4 # number of values
n = 1e7 # length of vector
x = sample(nv, n, replace=TRUE)
system.time(res_fdb <- fdb(x))
# user system elapsed
# 0.32 0.09 0.24
system.time(res_fdb2 <- fdb2(x))
# user system elapsed
# 0.25 0.13 0.27
system.time(res_ff <- ff(x))
# user system elapsed
# 0.58 0.24 0.50
identical(res_ff, as.integer(res_fdb)) # TRUE
identical(res_ff, as.integer(res_fdb2)) # TRUE
Turns out ff() not as fast as direct use of data.table::frank, taking roughly twice as long because grouping by distinct values is done twice: once to count, and again in a lookup.
I guess the tabulation can also be done with base R's table.
ft = function(x){
nx = length(x)
N = table(x)
cN = cumsum(N)
res = 2L*cN - N - nx
as.vector(res[as.character(x)])
}
system.time(res_ft <- ft(x))
# user system elapsed
# 7.58 0.34 7.93
identical(res_ff, res_ft)
# [1] TRUE

Related

Selecting elements from a vector based on condition on another vector

I want to know how to select those numbers which correspond (i.e. same position) to my pre-defined numbers.
For example, I have these vectors:
a = [ 1 0.1 2 3 0.1 0.5 4 0.1];
b = [100 200 300 400 500 600 700 800]
I need to select elements from b which correspond to the positions of the whole numbers in a (1, 2, 3 and 4), so the output must be:
output = [1 100
2 300
3 400
4 700]
How can this be done?
Create a logical index based on a, and apply it to both a and b to get the desired result:
ind = ~mod(a,1); % true for integer numbers
output = [a(ind); b(ind)].'; % build result
round(x) == x ----> x is a whole number
round(x) ~= x ----> x is not a whole number
round(2.4) = 2 ------> round(2.4) ~= 2.4 --> 2.4 is not a whole number
round(2) = 2 --------> round(2) == 2 ----> 2 is a whole number
Following same logic
a = [ 1 0.1 2 3 0.1 0.5 4 0.1];
b = [100 200 300 400 500 600 700 800 700];
iswhole = (round(a) == a);
output = [a(iswhole); b(iswhole)]
Result:
output =
1 2 3 4
100 300 400 700
we can generate logical index based on a using fix() function
ind = (a==fix(a));
output= [a(ind); b(ind)]'
Although the intention is not clear, creating indexing to the matrix is the solution
My solution is
checkint = #(x) ~isinf(x) & floor(x) == x % It's very fast in a big array
[a(checkint(a))' b(checkint(a))']
The key here is creating the index to a and b for which it is a logical vector to the integer values in a. This function checkint does a good job checking integer.
Other approaches to check integer could be
checkint = #(x)double(uint64(x))==x % Slower but it works fine
or
checkint = #(x) mod(x,1) == 0 % Slowest, but it's robust and better for understanding what's going on
or
checkint = #(x) ~mod(x,1) % Slowest, treat 0 as false
It's been discussed in many other threads.

Total numbers having frequency k in a given range

How to find total numbers having frequency=k in a particular range(l,r) in a given array. There are total 10^5 queries of format l,r and each query is built on the basis of previous query's answer. In particular, after each query we increment l by the result of the query, swapping l and r if l > r. Note that 0<=a[i]<=10^9. Total elements in array is n=10^5.
My Attempt:
n,k,q = map(int,input().split())
a = list(map(int,input().split()))
ans = 0
for _ in range(q):
l,r = map(int,input().split())
l+=ans
l%=n
r+=ans
r%=n
if l>r:
l,r = r,l
d = {}
for i in a[l:r+1]:
try:
d[i]+=1
except:
d[i] = 1
curr_ans = 0
for i in d.keys():
if d[i]==k:
curr_ans+=1
ans = curr_ans
print(ans)
Sample Input:
5 2 3
7 6 6 5 5
0 4
3 0
4 1
Sample Output:
2
1
1
If the number of different values in the array is not too large, you may consider storing arrays as long as the input array, one per unique value, counting the number of appearances of the value until each point. Then you just need to subtract the end values from the beginning values to find how many frequency matches are there:
def range_freq_queries(seq, k, queries):
n = len(seq)
c = freq_counts(seq)
result = [0] * len(queries)
offset = 0
for i, (l, r) in enumerate(queries):
result[i] = range_freq_matches(c, offset, l, r, k, n)
offset = result[i]
return result
def freq_counts(seq):
s = {v: i for i, v in enumerate(set(seq))}
counts = [None] * (len(seq) + 1)
counts[0] = [0] * len(s)
for i, v in enumerate(seq, 1):
counts[i] = list(counts[i - 1])
j = s[v]
counts[i][j] += 1
return counts
def range_freq_matches(counts, offset, start, end, k, n):
start, end = sorted(((start + offset) % n, (end + offset) % n))
num = 0
return sum(1 for cs, ce in zip(counts[start], counts[end + 1]) if ce - cs == k)
seq = [7, 6, 6, 5, 5]
k = 2
queries = [(0, 4), (3, 0), (4, 1)]
print(range_freq_queries(seq, k, queries))
# [2, 1, 1]
You can do it faster with NumPy, too. Since each result depends on the previous one, you will have to loop in any case, but you can use Numba to really accelerate things up:
import numpy as np
import numba as nb
def range_freq_queries_np(seq, k, queries):
seq = np.asarray(seq)
c = freq_counts_np(seq)
return _range_freq_queries_np_nb(seq, k, queries, c)
#nb.njit # This is not necessary but will make things faster
def _range_freq_queries_np_nb(seq, k, queries, c):
n = len(seq)
offset = np.int32(0)
out = np.empty(len(queries), dtype=np.int32)
for i, (l, r) in enumerate(queries):
l = (l + offset) % n
r = (r + offset) % n
l, r = min(l, r), max(l, r)
out[i] = np.sum(c[r + 1] - c[l] == k)
offset = out[i]
return out
def freq_counts_np(seq):
uniq = np.unique(seq)
seq_pad = np.concatenate([[uniq.max() + 1], seq])
comp = seq_pad[:, np.newaxis] == uniq
return np.cumsum(comp, axis=0)
seq = np.array([7, 6, 6, 5, 5])
k = 2
queries = [(0, 4), (3, 0), (4, 1)]
print(range_freq_queries_np(seq, k, queries))
# [2 1 2]
Let's compare it with the original algorithm:
from collections import Counter
def range_freq_queries_orig(seq, k, queries):
n = len(seq)
ans = 0
counter = Counter()
out = [0] * len(queries)
for i, (l, r) in enumerate(queries):
l += ans
l %= n
r += ans
r %= n
if l > r:
l, r = r, l
counter.clear()
counter.update(seq[l:r+1])
ans = sum(1 for v in counter.values() if v == k)
out[i] = ans
return out
Here is a quick test and timing:
import random
import numpy
# Make random input
random.seed(0)
seq = random.choices(range(1000), k=5000)
queries = [(random.choice(range(len(seq))), random.choice(range(len(seq))))
for _ in range(20000)]
k = 20
# Input as array for NumPy version
seq_arr = np.asarray(seq)
# Check all functions return the same result
res1 = range_freq_queries_orig(seq, k, queries)
res2 = range_freq_queries(seq, k, queries)
print(all(r1 == r2 for r1, r2 in zip(res1, res2)))
# True
res3 = range_freq_queries_np(seq_arr, k, queries)
print(all(r1 == r3 for r1, r3 in zip(res1, res3)))
# True
# Timings
%timeit range_freq_queries_orig(seq, k, queries)
# 3.07 s ± 1.11 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit range_freq_queries(seq, k, queries)
# 1.1 s ± 307 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit range_freq_queries_np(seq_arr, k, queries)
# 265 ms ± 726 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
Obviously the effectiveness of this depends on the characteristics of the data. In particular, if there are fewer repeated values the time and memory cost to construct the counts table will approach O(n2).
Let's say the input array is A, |A|=n. I'm going to assume that the number of distinct elements in A is much smaller than n.
We can divide A into sqrt(n) segments each of size sqrt(n). For each of these segments, we can calculate a map from element to count. Building these maps takes O(n) time.
With that preprocessing done, we can answer each query by adding together all the maps wholly contained in (l,r), of which there are at most sqrt(n), then adding any extra elements (or going one segment over and subtracting), also sqrt(n).
If there are k distinct elements, this takes O(sqrt(n) * k) so in the worst case O(n) if in fact every element of A is distinct.
You can keep track of the elements that have the desired count while combining the hashes and extra elements.

Arranging a 3 dimensional contingency table in R in order to run a Cochran-Mantel-Haenszel analysis?

I am attempting to run a Mantel-Haenszel analysis in R to determine whether or not a comparison of proportions test is still significant when accounting for a 'diagnosis' ratio within groups. This test is available in the stats package.
library(stats)
mantelhaen.test(x)
Having done some reading, I've found that this test can perform an odds ratio test on a contingency table that is n x n x k, as opposed to simply n x n. However, I am having trouble arranging my data in the proper way, as I am fairly new to R. I have created some example data...
ex.label <- c("A","A","A","A","A","A","A","B","B","B")
ex.status <- c("+","+","-","+","-","-","-","+","+","-")
ex.diag <- c("X","X","Z","Y","Y","Y","X","Y","Z","Z")
ex.data <- data.frame(ex.label,ex.diag,ex.status)
Which looks like this...
ex.label ex.diag ex.status
1 A X +
2 A X +
3 A Z -
4 A Y +
5 A Y -
6 A Y -
7 A X -
8 B Y +
9 B Z +
10 B Z -
I was originally able to use a simple N-1 chi-square to run a comparison of proportions test of + to - for only the A and B, but now I want to be able to account for the ex.diag as well. I'll show a graph here for what I wanted to be looking at, which is basically to compare the significance of the ratio in each column. I was able to do this, but I now want to be able to account for ex.diag.
I tried to use the ftable() function to arrange my data in a way that would work.
ex.ftable <- ftable(ex.data)
Which looks like this...
ex.status - +
ex.label ex.diag
A X 1 2
Y 2 1
Z 1 0
B X 0 0
Y 0 1
Z 1 1
However, when I run mantelhaen.test(ex.ftable), I get the error 'x' must be a 3-dimensional array. How can I arrange my data in such a way that I can actually run this test?
In mantelhaen.test the last dimension of the 3-dimensional contingency table x needs to be the stratification variable (ex.diag). This matrix can be generated as follows:
ex.label <- c("A","A","A","A","A","A","A","B","B","B")
ex.status <- c("+","+","-","+","-","-","-","+","+","-")
ex.diag <- c("X","X","Z","Y","Y","Y","X","Y","Z","Z")
# Now ex.diag is in the first column
ex.data <- data.frame(ex.diag, ex.label, ex.status)
# The flat table
( ex.ftable <- ftable(ex.data) )
# ex.status - +
# ex.diag ex.label
# X A 1 2
# B 0 0
# Y A 2 1
# B 0 1
# Z A 1 0
# B 1 1
The 3D matrix can be generated using aperm.
# Trasform the ftable into a 2 x 2 x 3 array
# First dimension: ex.label
# Second dimension: ex.status
# Third dimension: ex.diag
( mtx3D <- aperm(array(t(as.matrix(ex.ftable)),c(2,2,3)),c(2,1,3)) )
# , , 1
#
# [,1] [,2]
# [1,] 1 2
# [2,] 0 0
#
# , , 2
#
# [,1] [,2]
# [1,] 2 1
# [2,] 0 1
#
# , , 3
#
# [,1] [,2]
# [1,] 1 0
# [2,] 1 1
Now the Cochran-Mantel-Haenszel chi-squared test can be performed.
# Cochran-Mantel-Haenszel chi-squared test of the null that
# two nominal variables are conditionally independent in each stratum
#
mantelhaen.test(mtx3D, exact=FALSE)
The results of the test is
Mantel-Haenszel chi-squared test with continuity correction
data: mtx3D
Mantel-Haenszel X-squared = 0.23529, df = 1, p-value = 0.6276
alternative hypothesis: true common odds ratio is not equal to 1
95 percent confidence interval:
NaN NaN
sample estimates:
common odds ratio
Inf
Given the low number of cases, it is preferable to compute an exact conditional test (option exact=TRUE).
mantelhaen.test(mtx3D, exact=T)
# Exact conditional test of independence in 2 x 2 x k tables
#
# data: mtx3D
# S = 4, p-value = 0.5
# alternative hypothesis: true common odds ratio is not equal to 1
# 95 percent confidence interval:
# 0.1340796 Inf
# sample estimates:
# common odds ratio
# Inf

IndexError: index 10 is out of bounds for axis 0 with size 10

I am numerically setting up a mesh grid for the x-grid and x-vector and also time grid but again I have set up an array for x (position) which should only be between 0 and 20 and t (time) would be from 0 until 1000 thus in order to solve a Heat equation. But every time I want for e.g., I make the number of steps 10, I get an error:
"Traceback (most recent call last):
File "/home/universe/Desktop/Python/Heat_1.py", line 33, in <module>
x[i] = a + i*h
IndexError: index 10 is out of bounds for axis 0 with size 10"
Here is my code:
from math import sin,pi
import numpy
import numpy as np
#Constant variables
N = int(input("Number of intervals in x (<=20):"))
M = int(input("Number of time steps (<=1000):" ))
#Some initialised varibles
a = 0.0
b = 1.0
t_min = 0.0
t_max = 0.5
# Array Variables
x = np.linspace(a,b, M)
t = np.linspace(t_min, t_max, M)
#Some scalar variables
n = [] # the number of x-steps
i, s = [], [] # The position and time
# Get the number of x-steps to use
for n in range(0,N):
if n > 0 or n <= N:
continue
# Get the number of time steps to use
for m in range(0,M):
if m > 0 or n <= M:
continue
# Set up x-grid and x-vector
h =(b-a)/n
for i in range(0,N+1):
x[i] = a + i*h
# Set up time-grid
k = (t_max - t_min)/m
for s in range(0, M+1):
t[s] = t_min + k*s
print(x,t)
You try to index outside the range:
for s in range(0, M+1):
t[s] = t_min + k*s
Change to:
for s in range(M):
t[s] = t_min + k*s
And it works.
You create t with length of M:
t = np.linspace(t_min, t_max, M)
So you can only access M elements in t.
Python always starts indexing with zero. Therefore:
for s in range(M):
will do M loops, while:
for s in range(0, M+1):
will do M+1 loops.

How to loop through an array in pairs?

I have a 1000x1 vector (1000 rows and 1 column). I want to get elements in pairs (row 1 and row 2, row 3 and row 4, row 5 and row 6, etc.)
Here's what I have so far
for (j in 1: ncol(total_loci)){
for (i in 1: sample_size){
# a pair
genotype[i]<- paste(total_loci[i, j], total_loci[i+1,j], sep="")
}
}
Genotype should thus be a 500x1 vector (500 rows and 1 column) containing the genotype. Assume that my for-loops are correct. I think my I needs to skip every other index -- so my i should start at 1 then 3, 5, 7, 9, etc. The variable total_loci is of class data frame.
You should try to use vectorized solutions where possible. They're usually more memory efficient and faster than loops.
In this case, you can use seq to generate an index vector for every other element. Then you can use that index vector to subset the original vector in pairs.
# sample data
x <- replicate(5, sample(LETTERS, 1000, replace=TRUE), simplify=FALSE)
x <- as.data.frame(x, stringsAsFactors=FALSE)
names(x) <- paste("V",1:NCOL(x), sep="")
# function to concatenate every other observation as a pair
f <- function(x) {
s <- seq(2, length(x), 2)
paste(x[s-1], x[s], sep="")
}
# run algorithm for each column
y <- as.data.frame(lapply(x, f), stringsAsFactors=FALSE)
Here is a general approach for processing an array in consecutive chunks of n elements. You can set n = 2 to process it by pairs.
First, here is a function that splits a vector n-by-n, returning a list of n elements:
n.ny.n <- function(x, n) split(x, 1+(seq_along(x)-1) %% n)
n.by.n(x = 1:24, n = 2)
# $`1`
# [1] 1 3 5 7 9 11 13 15 17 19 21 23
#
# $`2`
# [1] 2 4 6 8 10 12 14 16 18 20 22 24
Then you can run any function on the slices using mapply, and via do.call:
do.call(mapply, c(FUN = paste, n.by.n(x = 1:24, n = 2), sep = "_"))
# [1] "1_2" "3_4" "5_6" "7_8" "9_10" "11_12" "13_14" "15_16"
# [9] "17_18" "19_20" "21_22" "23_24"
do.call(mapply, c(FUN = paste, n.by.n(x = 1:24, n = 6), sep = "_"))
# [1] "1_2_3_4_5_6" "7_8_9_10_11_12" "13_14_15_16_17_18"
# [4] "19_20_21_22_23_24"
Here is a way to do it without any apply family calls or loops:
# Generate some sample data.
total_loci<-data.frame(genotype=sample(LETTERS,500,replace=TRUE))
# Paste
paste0(total_loci[c(TRUE,TRUE,FALSE,FALSE),],
total_loci[c(FALSE,FALSE,TRUE,TRUE),])

Resources