emcee MCMC Sampling... Small set of walkers not converging? - sampling

I'm running mcmc on a simulation where I know the true parameter values. Suppose I use 100 walkers and 10,000 steps in the chain. The bulk of the walkers converge to the correct parameter values, but ~3 walkers (out of the 100) just trail off into the distance, not converging. Not sure what could be at play here, other than a potential error on my end.

Related

Is there possible way to get data sampling in continous data batches?

We've a data stream with continously dumps data in our data lake. Is there a good solution with min running to get 10% random data samples from the data?
I'm currently using code(snipped below) but this will outgrow 10% total sampling as new batches will arrive. I've also tried to calculate 10 batches of 100 records each with (.1) mean but it resulted in ~32% sampling.
select id,
(uniform(0::float, 1::float, random(1)) < .10)::boolean as sampling
from temp_hh_mstr;
Prior to it, I thought to get sampling via snowflake's TABLESAMPLE by substracting from the total count and current IDs in sampling from the table. It takes calculations for every time and any batch arrives which will increase the cost.
Some additional referece I've been thinking towards -
Wilson Score Interval With Continuity Correction
Binomial Confidence Interval

Simple MATLAB Script

I have conducted 400 experiments and got several data. So I am writing a code that allows me to read all the files using MATLAB, analyze it and store the results in an array. Otherwise I would have to do them individually which would take forever.
Code:
clear
clc
close all
myfolder='C:\Users\Surface\Desktop\New_folder\';
filepattern=fullfile(myfolder,'.txt');
thefiles=dir(filepattern);
for i=1:length(thefiles)
baseFileName=thefiles(i).name;
fullFileName=fullfile(myfolder,baseFileName);
fprintf(1,'Now reading %s\n',fullFileName);
load(fullFileName);
Fs=500; % Samping frequency
for i=1:n
A=Listing(i).name;
[t,X,Xrms(i)]=process_data(Listing(i).name);
end
% Extract the data for the time period of [ts tf]
ts=0;
tf=3;
t=t(ts*Fs+1:tf*Fs+1);
X=X(ts*Fs+1:tf*Fs+1);
end
% FFT analysis to see the natural frequency
fftplot(1/Fs,X);
% Find the damping ratio
[zeta zeta_m]=finddamping(t,X)
I was wondering where I went wrong.

probability of collision in MD5

Worst case, I have 180 million values in a cache(15 minute window before they go stale) and an MD5 has 2^128 values. What is my probability of a collision? or better yet, is there a web page somewhere to answer that question or a rough estimate thereof? That would rock so I know my chances.
The probability is 1-m!/(mⁿ(m-n)!) where m = 2¹²⁸ and n = 180000000.
Running it through on-line Wolfram exceeds available computational time!
If you have SmallTalk installed locally, you can run this:
|m n p|
m := 2 raisedTo:128.
n := 180000000.
p := (1-(m factorial/((m raisedTo:n)*(m-n)factorial)))asFloat.
Transcript show:p printString;cr.
A search for the Birthday Problem brings up a Wikipedia page where they provide a table showing for 128 bits and 2.6×10¹⁰ hashes, the probability of a collision is 1 in 10¹⁸, so that’s for 140× the number of hashes than what you’re considering. So you know your odds are “worse” than this.
A good approximation if n ≪ m is 1-e-n2/2m, where if you plug in m and n above, you get 4.76×10⁻²³ or 1 in 2.10×10²² as the probability of a collision.
Even though the probability of a collision is very low, it is prudent in the FOOBAR case, say if there is an issue and the hashes accumulate for more than 15 minutes, to at least confirm what would happen in the event of a collision. This will also help if someone somehow injects duplicate hashes in order to try to compromise it.

Is there an easy way to get the percentage of successful reads of last x minutes?

I have a setup with a Beaglebone Black which communicates over I²C with his slaves every second and reads data from them. Sometimes the I²C readout fails though, and I want to get statistics about these fails.
I would like to implement an algorithm which displays the percentage of successful communications of the last 5 minutes (up to 24 hours) and updates that value constantly. If I would implement that 'normally' with an array where I store success/no success of every second, that would mean a lot of wasted RAM/CPU load for a minor feature (especially if I would like to see the statistics of the last 24 hours).
Does someone know a good way to do that, or can anyone point me in the right direction?
Why don't you just implement a low-pass filter? For every successfull transfer, you push in a 1, for every failed one a 0; the result is a number between 0 and 1. Assuming that your transfers happen periodically, this works well -- and you just have to adjust the cutoff frequency of that filter to your desired "averaging duration".
However, I can't follow your RAM argument: assuming you store one byte representing success or failure per transfer, which you say happens every second, you end up with 86400B per day -- 85KB/day is really negligible.
EDIT Cutoff frequency is something from signal theory and describes the highest or lowest frequency that passes a low or high pass filter.
Implementing a low-pass filter is trivial; something like (pseudocode):
new_val = 1 //init with no failed transfers
alpha = 0.001
while(true):
old_val=new_val
success=do_transfer_and_return_1_on_success_or_0_on_failure()
new_val = alpha * success + (1-alpha) * old_val
That's a single-tap IIR (infinite impulse response) filter; single tap because there's only one alpha and thus, only one number that is stored as state.
EDIT2: the value of alpha defines the behaviour of this filter.
EDIT3: you can use a filter design tool to give you the right alpha; just set your low pass filter's cutoff frequency to something like 0.5/integrationLengthInSamples, select an order of 0 for the IIR and use an elliptic design method (most tools default to butterworth, but 0 order butterworths don't do a thing).
I'd use scipy and convert the resulting (b,a) tuple (a will be 1, here) to the correct form for this feedback form.
UPDATE In light of the comment by the OP 'determine a trend of which devices are failing' I would recommend the geometric average that Marcus Müller ꕺꕺ put forward.
ACCURATE METHOD
The method below is aimed at obtaining 'well defined' statistics for performance over time that are also useful for 'after the fact' analysis.
Notice that geometric average has a 'look back' over recent messages rather than fixed time period.
Maintain a rolling array of 24*60/5 = 288 'prior success rates' (SR[i] with i=-1, -2,...,-288) each representing a 5 minute interval in the preceding 24 hours.
That will consume about 2.5K if the elements are 64-bit doubles.
To 'effect' constant updating use an Estimated 'Current' Success Rate as follows:
ECSR = (t*S/M+(300-t)*SR[-1])/300
Where S and M are the count of errors and messages in the current (partially complete period. SR[-1] is the previous (now complete) bucket.
t is the number of seconds expired of the current bucket.
NB: When you start up you need to use 300*S/M/t.
In essence the approximation assumes the error rate was steady over the preceding 5 - 10 minutes.
To 'effect' a 24 hour look back you can either 'shuffle' the data down (by copy or memcpy()) at the end of each 5 minute interval or implement a 'circular array by keeping track of the current bucket index'.
NB: For many management/diagnostic purposes intervals of 15 minutes are often entirely adequate. You might want to make the 'grain' configurable.

how do I figure out provisional throughput for AWS DynamoDB table?

My system is supposed to write a large amount of data into a DynamoDB table every day. These writes come in bursts, i.e. at certain times each day several different processes have to dump their output data into the same table. Speed of writing is not critical as long as all the daily data gets written before the next dump occurs. I need to figure out the right way of calculating the provisional capacity for my table.
So for simplicity let's assume that I have only one process writing data once a day and it has to write upto X items into the table (each item < 1KB). Is the capacity I would have to specify essentially equal to X / 24 / 3600 writes/second?
Thx
The provisioned capacity is in terms of writes/second. You need to make sure that you can handle the PEAK number of writes/second that you are going to expect, not the average over the day. So, if you have a single process that runs once a day and makes X number of writes, of Y size (in KB, rounded up), over Z number of seconds, your formula would be
capacity = (X * Y) / Z
So, say you had 100K writes over 100 seconds and each write < 1KB, you would need 1000 w/s capacity.
Note that in order to minimize provisioned write capacity needs, it is best to add data into the system on a more continuous basis, so as to reduce peaks in necessary read/write capacity.

Resources