Using multiple predictors in mixed-model in Python - data-modeling

I'm working in Python 3.6 using Spyder, windows 10.
I would like to be able to predict "Speed" using some predictors such as "level", "distraction", "target", etc. However, my data has several levels:
Participant 1: Session 1 (Level 1 to m), Session 2 (level m+1 to n), ...,
Session 20 (level x to z)
Participant 2: Session 1 (Level 1 to m), Session 2 (level m+1 to n), ...,
Session 20 (level x to z)
.
.
Participant n: Session 1 (Level 1 to m), Session 2 (level m+1 to n), ...,
Session 20 (level x to z)
For each level, I'm measuring Speed, Distraction, Target, etc.
This how my data looks like
So, I think the best model for my data is mixed model (random effects model) because the relation between my dependent variable (Speed) and predictors may vary across different sessions and different participants.
I'm new to python and I was wondering How I'd regress these in python, to get the mixed-models regression?
I think this is the code for only using one predictor:
model = smf.mixedlm("Speed ~ %target", data=df, groups=df[["Participant","Sessions"]]).fit()
What if I want to use multiple predictors?

Related

Searching for efficient clustering algorithm

In a 2D NxN matrix each point represents a area of a map. There are M numbers of customers in random areas whose service need to be served by K numbers of customer service centers in random areas. Each customer service center can serve up to X number of jobs. Number of all customers must be less than or equals to total capability of customer service centres. All customers must to be assigned in any of the service centre and hamiltonian distance is the cost (customer can move up,left,down and right only towards service centre). How to assign customers to minimise the total cost? I was looking for a direction if its a well known problem or at least pseudocode.
I think you can handle this problem using MinCost/MaxFlow algorithm. Create the graph as follow:
Create M + K + 2 nodes; M customer-nodes, K customer-service-center-nodes (csc-nodes), a source and a sink.
Create K edges from the source to the K csc-nodes with cost 0 and capacity equal to the number of customers that each CSC can serve.
Create M edges from the M customer-nodes to the sink, each edge will have capacity 1 and cost 0.
Create K * M edges from the K csc-nodes to the M customer-nodes each one with a capacity equal to 1 and cost equal to the distance between the CSC and the customer.
Run MinCost/MaxFlow algorithm on the network (V = M + K + 2, E = M + K + M*K). If the max-flow value is equal to M, then you can serve all the customers with the resulting (minimum) cost.
The solution for this case is 23.
The way the problem is formulated, you have a constrained optimization problem, and not a clustering problem. It likely is convex, integer and linear.
Clustering algorithms won't satisfy the capacity constraint.
There is plenty of research on such optimization. There are various highly optimized solvers available.

AI : evaluate mass of a spaceship via prod (exert force lightly) and sense change in its velocity

Problem
I have to code AI to find mass of a spaceship in a game.
My AI can exert a little force c to the spaceship, to measure the mass via change of velocity.
However, my AI can access only current position of spaceship ,x, in every time-step.
Mass is not constant, but it is safe to assume that it will not change too fast.
For simplicity :-
Let the space be 1D, and has no gravity.
Timestep is always 1 second.
Forces
There are many forces that exert on the spaceship currently, e.g. gravity, an automatic propulsion system controlled by an unknown AI, collision impulse, etc.
The summation of these forces is b, which depends on t (time).
Acceleration a for a certain timestep is calculated by a game-play formula which is out of my control:-
a = (b+c)/m ................. (1)
The velocity v is updated as:-
v = vOld + a ................. (2)
The position x is updated as:-
x = xOld + v ................. (3)
The order of execution (1)-(3) is also unknown, i.e. AI should not rely on such order.
My poor solution
I will exert c0=0.001 for a few second and compare result against when I exert c1=-0.001.
I would assume that b and m are constant for the time period.
I calculate acceleration via :-
t 0 1 2 3 (Exert force `c0` at `t1`, `c1` at `t2`)
x 0 1 2 3 (The number are points in timeline that I sampling x.)
v 0 1 2 (v0=x1-x0, v1=x2-x1, ... )
a 0 1 (a0=v1-v0, ... )
Now I know acceleration of 2 points of timeline, and I can cache c because I am the one who exert it.
With a = (b+c)/m, with unknown b and m and known a0,a1,c0 and c1:-
a0 = (b+c0)/m
a1 = (b+c1)/m
I can solve them to find b and m.
However, my assumption is wrong at the beginning.
b and m are actually not constants.
This problem might be viewed in a more casual way :-
Many persons are trying to lift a heavy rock.
I am one of them.
How can I measure the mass of the rock (with feeling from my hand) without interrupt them too much?

Query on a array

Assume that I have an array A = {a, b, c, d, e, f, g, h........} and Q queries. in each query I will be asked to do one of the following operation:
1 i j -> increase i the element by 1 and decrease j the element by one
2 x -> tell the number of elements of the array which are less than x
if there was no update operation I could have done this by lower bound. I can still do it by sorting the array and finding the lower bound but complexity will be too high since the size of array A and Q can be both 10^5. is there any faster algorithm or way to do this?
The simplest way is to use std::count_if.
What complexity bound do you have to meet? 10^5^2 is still only 10^10.
If you have to do better than that, I suspect you have to have a "value" which has back pointers to the "index", and an "index" which is a pointer to the value. Sort the values initially, and then when you update, move the value to the right point. (Probably best to see if the value needs to move at all before searching).
Then the query is still a lower bound operation.
Once you sort the array (O(n log n) complexity), a query "LESS(X)" will run in log n time since you can use binary search. Once you know that element X is found (or the next largest element in A is found) at position k-th, you know that k is your answer (k elements are less than X).
The (i, j) command implies a partial reorder of the array between the element which is immediately less than min(A[i]+1, A[j]-1) and the one which is immediately after max(A[i], A[j]). These you find both in log n, worst case log n + n, time: this is close to the worst case:
k 0 1 2 3 4 5 6 7 8 9 command: (4, 5)
v 7 14 14 15 15 15 16 16 16 18
^ ^
becomes 16 becomes 14 -- does it go before 3 or before 1?
The re-sort is then worst case n, since your array is already almost sorted except for two elements, which means you'll do well by using two runs of insertion sort.
So with m update queries and q simple queries you can expect to have
n log n + m*2*(log n + 2*n) + q * log n
complexity. Average case (no pathological arrays, reasonable sparseness, no pathological updates, (j-i) = d << n) will be
( n + 2m + q ) * log n + 2m*d
which is linearithmic. With n = m = q = 10^5, you get an overall complexity which is still below 10^7 unless you've got pathological arrays and ad hoc queries, in which case the complexity should be quadratic (or maybe even cubic; I haven't examined it closely).
In a real world scenario, you can also conceivably employ some tricks. Remember the last values of the modified indexes of i and j, and the last location query k. This costs little. Now on the next query, chances are that you will be able to use one of the three values to prime your binary search and shave some time.

Choosing distributed computing framework for very large overlap querys

I am trying to analyze 2 billion rows (of text files in HDFS). Each file's lines contain an array of sorted integers:
[1,2,3,4]
The integer values can be 0 to 100,000. I am looking to overlap within each array of integers all possibly combinations (one-way aka 1,2 and 2,1 are not necessary). Then reduce and sum the counts of those overlaps. For example:
File:
[1,2,3,4]
[2,3,4]
Final Output:
(1,2) - 1
(1,3) - 1
(1,4) - 1
(2,3) - 2
(2,4) - 2
(3,4) - 2
The methodology that I have tried is using Apache Spark, to create a simple job that parallelizes the processing and reducing of blocks of data. However I am running into issues where the memory can't hold a hash of ((100,000)^2)/2 options and thus I am having to result in running traditional map reduce of map, sort, shuffle, reduce locally, sort, shuffle, reduce globally. I know creating the combinations is a double for loop so O(n^2) but what is the most efficient way to programmatically do this so I can minimally write to disk? I am trying to perform this task sub 2 hours on a cluster of 100 nodes (64gb ram/2 cores) Also any recommended technologies or frameworks. Below is what I have been using in Apache Spark and Pydoop. I tried using more memory optimized Hashs, however they still were too much memory.
import collection.mutable.HashMap
import collection.mutable.ListBuffer
def getArray(line: String):List[Int] = {
var a = line.split("\\x01")(1).split("\\x02")
var ids = new ListBuffer[Int]
for (x <- 0 to a.length - 1){
ids += Integer.parseInt(a(x).split("\\x03")(0))
}
return ids.toList
}
var textFile = sc.textFile("hdfs://data/")
val counts = textFile.mapPartitions(lines => {
val hashmap = new HashMap[(Int,Int),Int]()
lines.foreach( line => {
val array = getArray(line)
for((x,i) <- array.view.zipWithIndex){
for (j <- (i+1) to array.length - 1){
hashmap((x,array(j))) = hashmap.getOrElse((x,array(j)),0) + 1
}
}
})
hashmap.toIterator
}).reduceByKey(_ + _)
Also Tried PyDoop:
def mapper(_, text, writer):
columns = text.split("\x01")
slices = columns[1].split("\x02")
slice_array = []
for slice_obj in slices:
slice_id = slice_obj.split("\x03")[0]
slice_array.append(int(slice_id))
val array = getArray(line)
for (i, x) in enumerate(array):
for j in range(i+1, len(array) - 1):
write.emit((x,array[j]),1)
def reducer(key, vals, writer):
writer.emit(key, sum(map(int, vals)))
def combiner(key, vals, writer):
writer.count('combiner calls', 1)
reducer(key, vals, writer)
I think your problem can be reduced to word count where the corpus contains at most 5 billion distinct words.
In both of your code examples, you're trying to pre-count all of the items appearing in each partition and sum the per-partition counts during the reduce phase.
Consider the worst-case memory requirements for this, which occur when every partition contains all of the 5 billion keys. The hashtable requires at least 8 bytes to represent each key (as two 32-bit integers) and 8 bytes for the count if we represent it as a 64-bit integer. Ignoring the additional overheads of Java/Scala hashtables (which aren't insignificant), you may need at least 74 gigabytes of RAM to hold the map-side hashtable:
num_keys = 100000**2 / 2
bytes_per_key = 4 + 4 + 8
bytes_per_gigabyte = 1024 **3
hashtable_size_gb = (num_keys * bytes_per_key) / (1.0 * bytes_per_gigabyte)
The problem here is that the keyspace at any particular mapper is huge. Things are better at the reducers, though: assuming a good hash partitioning, each reducer processes an even share of the keyspace, so the reducers only require roughly (74 gigabytes / 100 machines) ~= 740 MB per machine to hold their hashtables.
Performing a full shuffle of the dataset with no pre-aggregation is probably a bad idea, since the 2 billion row dataset probably becomes much bigger once you expand it into pairs.
I'd explore partial pre-aggregation, where you pick a fixed size for your map-side hashtable and spill records to reducers once the hashtable becomes full. You can employ different policies, such as LRU or randomized eviction, to pick elements to evict from the hashtable. The best technique might depend on the distribution of keys in your dataset (if the distribution exhibits significant skew, you may see larger benefits from partial pre-aggregation).
This gives you the benefit of reducing the amount of data transfer for frequent keys while using a fixed amount of memory.
You could also consider using a disk-backed hashtable that can spill blocks to disk in order to limit its memory requirements.

Need some help calculating percentile

An rpc server is given which receives millions of requests a day. Each request i takes processing time Ti to get processed. We want to find the 65th percentile processing time (when processing times are sorted according to their values in increasing order) at any moment. We cannot store processing times of all the requests of the past as the number of requests is very large. And so the answer need not be exact 65th percentile, you can give some approximate answer i.e. processing time which will be around the exact 65th percentile number.
Hint: Its something to do how a histogram (i.e. an overview) is stored for a very large data without storing all of data.
Take one day's data. Use it to figure out what size to make your buckets (say one day's data shows that the vast majority (95%?) of your data is within 0.5 seconds of 1 second (ridiculous values, but hang in)
To get 65th percentile, you'll want at least 20 buckets in that range, but be generous, and make it 80. So you divide your 1 second window (-0.5 seconds to +0.5 seconds) into 80 buckets by making each 1/80th of a second wide.
Each bucket is 1/80th of 1 second. Make bucket 0 be (center - deviation) = (1 - 0.5) = 0.5 to itself + 1/80th of a second. Bucket 1 is 0.5+1/80th - 0.5 + 2/80ths. Etc.
For every value, find out which bucket it falls in, and increment a counter for that bucket.
To find 65th percentile, get the total count, and walk the buckets from zero until you get to 65% of that total.
Whenever you want to reset, set the counters all to zero.
If you always want to have good data available, keep two of these, and alternate resetting them, using the one you reset least recently as having more useful data.
Use an updown filter:
if q < x:
q += .01 * (x - q) # up a little
else:
q += .005 * (x - q) # down a little
Here a quantile estimator q tracks the x stream,
moving a little towards each x.
If both factors were .01, it would move up as often as down,
tracking the 50 th percentile.
With .01 up, .005 down, it floats up, 67 th percentile;
in general, it tracks the up / (up + down) th percentile.
Bigger up/down factors track faster but noisier --
you'll have to experiment on your real data.
(I have no idea how to analyze updowns, would appreciate a link.)
The updown() below works on long vectors X, Q in order to plot them:
#!/usr/bin/env python
from __future__ import division
import sys
import numpy as np
import pylab as pl
def updown( X, Q, up=.01, down=.01 ):
""" updown filter: running ~ up / (up + down) th percentile
here vecs X in, Q out to plot
"""
q = X[0]
for j, x in np.ndenumerate(X):
if q < x:
q += up * (x - q) # up a little
else:
q += down * (x - q) # down a little
Q[j] = q
return q
#...............................................................................
if __name__ == "__main__":
N = 1000
up = .01
down = .005
plot = 0
seed = 1
exec "\n".join( sys.argv[1:] ) # python this.py N= up= down=
np.random.seed(seed)
np.set_printoptions( 2, threshold=100, suppress=True ) # .2f
title = "updown random.exponential: N %d up %.2g down %.2g" % (N, up, down)
print title
X = np.random.exponential( size=N )
Q = np.zeros(N)
updown( X, Q, up=up, down=down )
# M = np.zeros(N)
# updown( X, M, up=up, down=up )
print "last 10 Q:", Q[-10:]
if plot:
fig = pl.figure( figsize=(8,3) )
pl.title(title)
x = np.arange(N)
pl.plot( x, X, "," )
pl.plot( x, Q )
pl.ylim( 0, 2 )
png = "updown.png"
print >>sys.stderr, "writing", png
pl.savefig( png )
pl.show()
An easier way to get the value that represents a given percentile of a list or array is the scoreatpercentile function in the scipy.stats module.
>>>import scipy.stats as ss
>>>ss.scoreatpercentile(v,65)
there's a sibling percentileofscore to return the percentile given the value
you will need to store a running sum and a total count.
then check out standard deviation calculations.

Resources