How to exact numbers from an array in a cell - arrays

In one of the columns of my df, the values in the cell are reported as an array (e.g. [1,2,3,4,8]) as opposed to being just single numbers. This is because the question was a "select all that apply" question.
However, when I try to count how many of each number occurs, I am not able to do so because these numbers are nested within a list. How can I extract the numbers so that I am able to count them?
For example:
row 1: [1,2,3,4,8]
row 2: [3]
row 3: [1,2,3,4]
I want to be able to run a statement such as: nrow(df[df$column == 1,]) that will count all of the occurrences of the number 1. So, in this case, the output would be 2, but right now it says 0.

Here is a method using base R:
# set up data
df <- as.data.frame(c('[1,2,3,4,8]', '[3]', '[1,2,3,4]'))
colnames(df) <- c('data')
# strip off starting and ending brackets
stripped <- substr(df$data, 2, nchar(df$data)-1)
# split each row by comma
split <- strsplit(stripped, ',')
# flatten the list of numbers to a vector
numbers <- unlist(split)
# view table of frequency of each number
table(numbers)
output:
numbers
1 2 3 4 8
2 2 3 2 1
getting count of a single number
# view count of a single number
length(which(numbers == '8'))
output:
[1] 1

Related

R: convert JAGS output matrix to array based on column names

In R, I have a JAGS model output (made in parallel with jags.parfit from the dclone package) that is a list of six 2-dimensional matrices (corresponding to six chains each with 3000 reps) with column names equivalent to the indices of an array. The first digit has 3 unique values, the second 2000, the third 4, and the fourth 6.
head(colnames(m1[[1]]))
[1] "y.pred[1,1,1,1]" "y.pred[2,1,1,1]" "y.pred[3,1,1,1]" "y.pred[1,2,1,1]" "y.pred[2,2,1,1]" "y.pred[3,2,1,1]"
I want to convert this long-form matrix into an array with 5 dimensions which correspond to the 3000 reps as row and the 4 indices from the column names as new columns. This array will have the following dimensions:
dim(m1.array)
[1] 3000 3 2000 4 6
Is there a relatively straightforward way to do this?
UPDATE
Based on the below suggestion, I was able to convert each matrix to the expected array with the following code:
m1.arrayList <- lapply(m1, function(x) array(x, dim = c(3000, 3, 2000, 4, 6)))
I was then able to convert the list of 5-dim arrays into a 6-dim array with:
m1.array <- simplify2array(m1.arrayList)

Find closest value in array column 4 where array column 1 and 2 match data of another array. Create a new array extracting the results

I have an extensive dataset in an array format of
a=[X, Y, Z, value]. At the same time i have another array b=[X,Y], with all the unique combinations of coordinates (X,Y) for the same dataset.
I would like to generate a new array, where for a given z=100, it contains the records of the original array a[X,Y,Z,value] where the Z is closest to the given z=100 for each possible X,Y combination.
The purpose of this is to extract a Z slice of the original dataset at a given depth
a description of the desired outcome would go like this
np.in1d(a[:,0], b[:,0]) and np.in1d(a[:,1], b[:,1]) # for each row
#where both these two arguments are True
a[:,2] == z+min(abs(a[:,2]-z))) # find the rows where Z is closest to z=100
#and append these rows to a new array c[X,Y,Z,value]
The idea is to first find the unique X,Y data and effectively slice the dataset in X,Y columns of the domain. Then search each of these columns to extract the row where Z is closest to the given z value
Any suggestion even for a much different approach would be highly appreciated
from pylab import *
a=array(rand(10000,4))*[[20,20,200,1]] # data in a 20*20*200 space
a[:,:2] //= 1 # int coords for X,Y
bj=a.T[0]+1j*a.T[1] # trick for sorting on 2 cols.
b=np.unique(bj)
ib=bj.argsort() # indices for sorting /X,Y
splits=bj[ib].searchsorted(b) # indices for splitting.
xy=np.split(a[ib],splits) # list of subsets of data grouped by (x,y)
c=array([s[abs(s.T[2]-100).argmin()] for s in xy[1:]]) #locate the good point in each list
print(c[:10])
gives:
[[ 0. 0. 110.44068611 0.71688432]
[ 0. 1. 103.64897184 0.31287547]
[ 0. 2. 100.85948189 0.74353677]
[ 0. 3. 105.28286975 0.98118126]
[ 0. 4. 99.1188121 0.85775638]
[ 0. 5. 107.53733825 0.61015178]
[ 0. 6. 100.82311896 0.25322798]
[ 0. 7. 104.16430907 0.26522796]
[ 0. 8. 100.47370563 0.2433701 ]
[ 0. 9. 102.40445547 0.89028359]]
At a higher level, with pandas :
labels=list('xyzt')
df=pd.DataFrame(a,columns=labels)
df['dist']=abs(df.z-100)
indices=df.groupby(['x','y'])['dist'].apply(argmin)
c=df.ix[indices][labels].reset_index(drop=True)
print(c.head())
for
x y z t
0 0 0 110.440686 0.716884
1 0 1 103.648972 0.312875
2 0 2 100.859482 0.743537
3 0 3 105.282870 0.981181
4 0 4 99.118812 0.857756
It is clearer, but 8x slower.

Excel average rows to array formula

I want to take the average of rows which would result in a column (array). Example input:
3 4
4 4
4 6
With an array formula I want to create:
3.5
4
5
The average is the sum of numbers divided by the count of that numbers.
So first add them (A1:A3+B1:B3)
3+4 = 7
4+4 = 8
4+6 = 10
Then divide by the number of numbers(/2):
7/2 = 3.5
8/2 = 4
10/2 = 5
{=(A1:A3+B1:B3)/2}
edit after comment from op:
formula for addition without adding column manually from https://productforums.google.com/forum/#!topic/docs/Q9x44sclzfY
{=mmult(A1:B3,sign(transpose(column(A1:B3))))/Columns(A1:B3)}
This is one way to do that in Excel
=SUBTOTAL(1,OFFSET(A1:B3,ROW(A1:B3)-MIN(ROW(A1:B3)),0,1))
OFFSET supplies an "array of ranges", each range being a single row, and SUBTOTAL with 1 as first argument, averages each of those ranges. You can use this in another formula or function or entered in a range on the worksheet.
The advantage over Siphor's suggestion with MMULT is that this will still work even with blanks or text values in the range (those will be ignored)
If first column is A and the second is B, then enter this formuls in column C:
=AVERAGE(A1,B1)
and extend it to the last row
Also you can use a range if you have more than 2 columns (this function allows for some cells to be empty):
=AVERAGE(A1:F1)

R read from file different-sized arrays

I need to apply the Mann Kendall trend test in R to a big number (about 1 million) of different-sized time series. I've already created a script that takes the time-series (practically a list of numbers) from all the files in a certain directory and then outputs the results to a .txt file.
The problem is that I have about 1 million of time-series so creating 1 million of file isn't exactly nice. So I thought that putting all the time-series in only one .txt file (separated by some symbol like "#" for example) could be more manageable. So I have a file like this:
1
2
4
5
4
#
2
13
34
#
...
I'm wondering, is it possible to extract such series (between two "#") in R and then apply the analysis?
EDIT
Following #acesnap hints I'm using this code:
library(Kendall)
a=read.table("to_r.txt")
numData=1017135
for (i in 1:numData){
s1=subset(a,a$V1==i)
m=MannKendall(s1$V2)
cat(m[[1]]," ",m[[2]], " ", m[[3]]," ",m[[4]]," ", m[[5]], "\n" , file="monotonic_trend_checking.txt",append=TRUE)
}
This approach works but the problem is that it is taking ages for computation. Can you suggest a faster approach?
If you were to number the datasets as they went into the larger file it would make things easier. If you were to do that you could use a for loop and subsetting.
setNum data
1 1
1 2
1 4
1 5
1 4
2 2
2 13
2 34
... ...
Then do something like:
answers1 <- c()
numOfDataSets <- 1000000
for(i in 1:numOfDataSets){
ss1 <- subset(bigData, bigData$setNum == i) ## creates subset of each data set
ans1 <- mannKendallTrendTest(ss1$data) ## gets answer from test
answers1 <- c(answers1, ans1) ## inserts answer into vector
print(paste(i, " | ", ans1, "",sep="" )) ## prints which data set is in use
flush.console() ## prints to console now instead of waiting
}
Here is a perhaps a more elegant solution:
# Read in your data
x=c('1','2','3','4','5','#','4','5','5','6','#','3','6','23','#')
# Build a list of indices where you want to split by:
ind=c(0,which(x=='#'))
# Use those indices split the vector into a list
lapply(seq(length(ind)-1),function (y) as.numeric(x[(ind[y]+1):(ind[y+1]-1)]))
Note that for this code to work, you must have a '#' character at the very end of the file.

Algorithm to find "most common elements" in different arrays

I have for example 5 arrays with some inserted elements (numbers):
1,4,8,10
1,2,3,4,11,15
2,4,20,21
2,30
I need to find most common elements in those arrays and every element should go all the way till the end (see example below). In this example that would be the bold combination (or the same one but with "30" on the end, it's the "same") because it contains the smallest number of different elements (only two, 4 and 2/30).
This combination (see below) isn't good because if I have for ex. "4" it must "go" till it ends (next array mustn't contain "4" at all). So combination must go all the way till the end.
1,4,8,10
1,2,3,4,11,15
2,4,20,21
2,30
EDIT2: OR
1,4,8,10
1,2,3,4,11,15
2,4,20,21
2,30
OR anything else is NOT good.
Is there some algorithm to speed this thing up (if I have thousands of arrays with hundreds of elements in each one)?
To make it clear - solution must contain lowest number of different elements and the groups (of the same numbers) must be grouped from first - larger ones to the last - smallest ones. So in upper example 4,4,4,2 is better then 4,2,2,2 because in first example group of 4's is larger than group of 2's.
EDIT: To be more specific. Solution must contain the smallest number of different elements and those elements must be grouped from first to last. So if I have three arrrays like
1,2,3
1,4,5
4,5,6
Solution is 1,1,4 or 1,1,5 or 1,1,6 NOT 2,5,5 because 1's have larger group (two of them) than 2's (only one).
Thanks.
EDIT3: I can't be more specific :(
EDIT4: #spintheblack 1,1,1,2,4 is the correct solution because number used first time (let's say at position 1) can't be used later (except it's in the SAME group of 1's). I would say that grouping has the "priority"? Also, I didn't mention it (sorry about that) but the numbers in arrays are NOT sorted in any way, I typed it that way in this post because it was easier for me to follow.
Here is the approach you want to take, if arrays is an array that contains each individual array.
Starting at i = 0
current = arrays[i]
Loop i from i+1 to len(arrays)-1
new = current & arrays[i] (set intersection, finds common elements)
If there are any elements in new, do step 6, otherwise skip to 7
current = new, return to step 3 (continue loop)
print or yield an element from current, current = arrays[i], return to step 3 (continue loop)
Here is a Python implementation:
def mce(arrays):
count = 1
current = set(arrays[0])
for i in range(1, len(arrays)):
new = current & set(arrays[i])
if new:
count += 1
current = new
else:
print " ".join([str(current.pop())] * count),
count = 1
current = set(arrays[i])
print " ".join([str(current.pop())] * count)
>>> mce([[1, 4, 8, 10], [1, 2, 3, 4, 11, 15], [2, 4, 20, 21], [2, 30]])
4 4 4 2
If all are number lists, and are all sorted, then,
Convert to array of bitmaps.
Keep 'AND'ing the bitmaps till you hit zero. The position of the 1 in the previous value indicates the first element.
Restart step 2 from the next element
This has now turned into a graphing problem with a twist.
The problem is a directed acyclic graph of connections between stops, and the goal is to minimize the number of lines switches when riding on a train/tram.
ie. this list of sets:
1,4,8,10 <-- stop A
1,2,3,4,11,15 <-- stop B
2,4,20,21 <-- stop C
2,30 <-- stop D, destination
He needs to pick lines that are available at his exit stop, and his arrival stop, so for instance, he can't pick 10 from stop A, because 10 does not go to stop B.
So, this is the set of available lines and the stops they stop on:
A B C D
line 1 -----X-----X-----------------
line 2 -----------X-----X-----X-----
line 3 -----------X-----------------
line 4 -----X-----X-----X-----------
line 8 -----X-----------------------
line 10 -----X-----------------------
line 11 -----------X-----------------
line 15 -----------X-----------------
line 20 -----------------X-----------
line 21 -----------------X-----------
line 30 -----------------------X-----
If we consider that a line under consideration must go between at least 2 consecutive stops, let me highlight the possible choices of lines with equal signs:
A B C D
line 1 -----X=====X-----------------
line 2 -----------X=====X=====X-----
line 3 -----------X-----------------
line 4 -----X=====X=====X-----------
line 8 -----X-----------------------
line 10 -----X-----------------------
line 11 -----------X-----------------
line 15 -----------X-----------------
line 20 -----------------X-----------
line 21 -----------------X-----------
line 30 -----------------------X-----
He then needs to pick a way that transports him from A to D, with the minimal number of line switches.
Since he explained that he wants the longest rides first, the following sequence seems the best solution:
take line 4 from stop A to stop C, then switch to line 2 from C to D
Code example:
stops = [
[1, 4, 8, 10],
[1,2,3,4,11,15],
[2,4,20,21],
[2,30],
]
def calculate_possible_exit_lines(stops):
"""
only return lines that are available at both exit
and arrival stops, discard the rest.
"""
result = []
for index in range(0, len(stops) - 1):
lines = []
for value in stops[index]:
if value in stops[index + 1]:
lines.append(value)
result.append(lines)
return result
def all_combinations(lines):
"""
produce all combinations which travel from one end
of the journey to the other, across available lines.
"""
if not lines:
yield []
else:
for line in lines[0]:
for rest_combination in all_combinations(lines[1:]):
yield [line] + rest_combination
def reduce(combination):
"""
reduce a combination by returning the number of
times each value appear consecutively, ie.
[1,1,4,4,3] would return [2,2,1] since
the 1's appear twice, the 4's appear twice, and
the 3 only appear once.
"""
result = []
while combination:
count = 1
value = combination[0]
combination = combination[1:]
while combination and combination[0] == value:
combination = combination[1:]
count += 1
result.append(count)
return tuple(result)
def calculate_best_choice(lines):
"""
find the best choice by reducing each available
combination down to the number of stops you can
sit on a single line before having to switch,
and then picking the one that has the most stops
first, and then so on.
"""
available = []
for combination in all_combinations(lines):
count_stops = reduce(combination)
available.append((count_stops, combination))
available = [k for k in reversed(sorted(available))]
return available[0][1]
possible_lines = calculate_possible_exit_lines(stops)
print("possible lines: %s" % (str(possible_lines), ))
best_choice = calculate_best_choice(possible_lines)
print("best choice: %s" % (str(best_choice), ))
This code prints:
possible lines: [[1, 4], [2, 4], [2]]
best choice: [4, 4, 2]
Since, as I said, I list lines between stops, and the above solution can either count as lines you have to exit from each stop or lines you have to arrive on into the next stop.
So the route is:
Hop onto line 4 at stop A and ride on that to stop B, then to stop C
Hop onto line 2 at stop C and ride on that to stop D
There are probably edge-cases here that the above code doesn't work for.
However, I'm not bothering more with this question. The OP has demonstrated a complete incapability in communicating his question in a clear and concise manner, and I fear that any corrections to the above text and/or code to accommodate the latest comments will only provoke more comments, which leads to yet another version of the question, and so on ad infinitum. The OP has gone to extraordinary lengths to avoid answering direct questions or to explain the problem.
I am assuming that "distinct elements" do not have to actually be distinct, they can repeat in the final solution. That is if presented with [1], [2], [1] that the obvious answer [1, 2, 1] is allowed. But we'd count this as having 3 distinct elements.
If so, then here is a Python solution:
def find_best_run (first_array, *argv):
# initialize data structures.
this_array_best_run = {}
for x in first_array:
this_array_best_run[x] = (1, (1,), (x,))
for this_array in argv:
# find the best runs ending at each value in this_array
last_array_best_run = this_array_best_run
this_array_best_run = {}
for x in this_array:
for (y, pattern) in last_array_best_run.iteritems():
(distinct_count, lengths, elements) = pattern
if x == y:
lengths = tuple(lengths[:-1] + (lengths[-1] + 1,))
else :
distinct_count += 1
lengths = tuple(lengths + (1,))
elements = tuple(elements + (x,))
if x not in this_array_best_run:
this_array_best_run[x] = (distinct_count, lengths, elements)
else:
(prev_count, prev_lengths, prev_elements) = this_array_best_run[x]
if distinct_count < prev_count or prev_lengths < lengths:
this_array_best_run[x] = (distinct_count, lengths, elements)
# find the best overall run
best_count = len(argv) + 10 # Needs to be bigger than any possible answer.
for (distinct_count, lengths, elements) in this_array_best_run.itervalues():
if distinct_count < best_count:
best_count = distinct_count
best_lengths = lengths
best_elements = elements
elif distinct_count == best_count and best_lengths < lengths:
best_count = distinct_count
best_lengths = lengths
best_elements = elements
# convert it into a more normal representation.
answer = []
for (length, element) in zip(best_lengths, elements):
answer.extend([element] * length)
return answer
# example
print find_best_run(
[1,4,8,10],
[1,2,3,4,11,15],
[2,4,20,21],
[2,30]) # prints [4, 4, 4, 30]
Here is an explanation. The ...this_run dictionaries have keys which are elements in the current array, and they have values which are tuples (distinct_count, lengths, elements). We are trying to minimize distinct_count, then maximize lengths (lengths is a tuple, so this will prefer the element with the largest value in the first spot) and are tracking elements for the end. At each step I construct all possible runs which are a combination of a run up to the previous array with this element next in sequence, and find which ones are best to the current. When I get to the end I pick the best possible overall run, then turn it into a conventional representation and return it.
If you have N arrays of length M, this should take O(N*M*M) time to run.
I'm going to take a crack here based on the comments, please feel free to comment further to clarify.
We have N arrays and we are trying to find the 'most common' value over all arrays when one value is picked from each array. There are several constraints 1) We want the smallest number of distinct values 2) The most common is the maximal grouping of similar letters (changing from above for clarity). Thus, 4 t's and 1 p beats 3 x's 2 y's
I don't think either problem can be solved greedily - here's a counterexample [[1,4],[1,2],[1,2],[2],[3,4]] - a greedy algorithm would pick [1,1,1,2,4] (3 distinct numbers) [4,2,2,2,4] (two distinct numbers)
This looks like a bipartite matching problem, but I'm still coming up with the formulation..
EDIT : ignore; This is a different problem, but if anyone can figure it out, I'd be really interested
EDIT 2 : For anyone that's interested, the problem that I misinterpreted can be formulated as an instance of the Hitting Set problem, see http://en.wikipedia.org/wiki/Vertex_cover#Hitting_set_and_set_cover. Basically the left hand side of the bipartite graph would be the arrays and the right hand side would be the numbers, edges would be drawn between arrays that contain each number. Unfortunately, this is NP complete, but the greedy solutions described above are essentially the best approximation.

Resources