Reducing the range of a hash function maintaining even distribution - c

I'm writing a cron-like job-dispatcher that runs jobs every minute (and other jobs every 5 minutes, etc.). However, instead of immediately dispatching all the jobs that run on a particular period at the top-of-the-minute, I want to spawn them evenly over their periods.
For example, if I have N jobs that run every P minutes, rather than spawn them all at P:00, I want to spawn the jobs evenly over the P*60 seconds, i.e., ceil(N/P*60) jobs per second. Hence, the spawn time for each job would be "skewed" somewhat later.
However, for each job J, I want J to be spawned at the same skew every time it's dispatched so that the time between spawns for J is constant (and matches its period).
Each job has various information associated with it, including several strings that vary for each job. My original thought was to calculate a hash-code, H, for one or more of the strings and mod it by P*60 to calculate a constant skew, S, for each job. As long as the strings associated with the job remain the same, the calculated skew would remain constant.
However, I would assume that S=H%(P*60) suffers from to problems similar to using rand() (uneven distribution that's biased towards lower numbers). However, I don't think the solutions presented there (to call rand() multiple times) would apply to my case where I'm using a hash-code because the hash function for a given job would always return the same hash.
So how can I get what I want? (I'm writing in C.)
Examples:
Suppose I have N every-minute jobs (a cron schedule of * * * * *). For N < 60 (let's say 2), then job-1 might be skewed to start at :23 (23 seconds past the minute) and job-2 might be skewed to start at :37. With so few jobs, it may not seem evenly distributed. However, as N approached 60, the "gaps" would fill in (assuming a perfect skew function) so that one job would be spawned every second. If N passed 60, some jobs spawned at some seconds would "double up." Similarly, as N approached 120, the "gaps" would again fill in so that two jobs would be spawned every second. And so on.
Supppose I have N every-five-minute jobs (a cron schedule of */5 * * * *). In "normal" cron, that means "every five minutes on the zeroth second on the fives." I instead want that to mean "every five minutes, but not necessarily (and most likely not) on the zeroth second of some minute, but the only guarantee is that the interval between spawns will be five minutes." So for example, a particular job might be spawned at 00:07:24, 00:12:24, 00:17:24, etc. As N approached 300, one job would be spawned per second.

Related

Need help finding a logical solution solving a problem

Given the variable 'points' which increases every time a variable 'player' collects a point, how do I logically find a way to reward user for finding 30 points inside a 5 minutes limit? There's no countdown timer.
e.g player may have 4 points but in 5 minutes if he has 34 points that also counts.
I was thinking about using timestamps but I don't really know how to do that.
What you are talking about is a "sliding window". Your window is time based. Record each point's timestamp and slide your window over these timestamps. You will need to pick a time increment to slide your window.
Upon each "slide", count your points. When you get the amount you need, "reward your user". The "upon each slide" means you need some sort of timer that calls a function each time to evaluate the result and do what you want.
For example, set a window for 5 minutes and a slide of 1 second. Don't keep a single variable called points. Instead, simply create an array of timestamps. Every timer tick (of 1 second in this case), count the number of timestamps that match t - 5 minutes to t now; if there are 30 or more, you've met your threshold and can reward your super-fast user. If you need the actual value, that may be 34, well, you've just computed it, so you can use it.
There may be ways to optimize this. I've provided the naive approach. Timestamps that have gone out of range can be deleted to save space.
If there are "points going into the window" that count, then just add them to the sum.

Starvation of one of 2 streams in ConnectedStreams

Background
We have 2 streams, let's call them A and B.
They produce elements a and b respectively.
Stream A produces elements at a slow rate (one every minute).
Stream B receives a single element once every 2 weeks. It uses a flatMap function which receives this element and generates ~2 million b elements in a loop:
(Java)
for (BElement value : valuesList) {
out.collect(updatedTileMapVersion);
}
The valueList here contains ~2 million b elements
We connect those streams (A and B) using connect, key by some key and perform another flatMap on the connected stream:
streamA.connect(streamB).keyBy(AClass::someKey, BClass::someKey).flatMap(processConnectedStreams)
Each of the b elements has a different key, meaning there are ~2 million keys coming from the B stream.
The Problem
What we see is starvation. Even though there are a elements ready to be processed they are not processed in the processConnectedStreams.
Our tries to solve the issue
We tried to throttle stream B to 10 elements in a 1 second by performing a Thread.sleep() every 10 elements:
long totalSent = 0;
for (BElement value : valuesList) {
totalSent++;
out.collect(updatedTileMapVersion);
if (totalSent % 10 == 0) {
Thread.sleep(1000)
}
}
The processConnectedStreams is simulated to take 1 second with another Thread.sleep() and we have tried it with:
* Setting parallelism of 10 to all the pipeline - didn't work
* Setting parallelism of 15 to all the pipeline - did work
The question
We don't want to use all these resources since stream B is activated very rarely and for stream A elements having high parallelism is an overkill.
Is it possible to solve it without setting the parallelism to more than the number of b elements we send every second?
It would be useful if you shared the complete workflow topology. For example, you don't mention doing any keying or random partitioning of the data. If that's really the case, then Flink is going to pipeline multiple operations in one task, which can (depending on the topology) lead to the problem you're seeing.
If that's the case, then forcing partitioning prior to the processConnectedStreams can help, as then that operation will be reading from network buffers.

Weight assignment to define an objective function

I have a set of jobs with execution times (C1,C2...Cn) and deadlines (D1,D2,...Dn). Each job will complete its execution in some time, i.e,
response time (R1,R2,....Rn). However, there is a possibility that not every job will complete its execution before its deadline. So I define a variable called Slack for each job, i.e., (S1,S2,...Sn). Slack is basically the difference between the deadline and the response time of jobs, i.e.,
S1=D1-R1
S2=D2-R2, .. and so on
I have a set of slacks [S1,S2,S3,...Sn]. These slacks can be positive or negative depending on the deadline and completion time of tasks, i.e., D and R.
The problem is I need to define weights (W) for each job (or slack) such that the job with negative slack (i.e., R>D, jobs that miss deadlines) has more weight (W) than the jobs with positive slack and based on these weights and slacks I need to define an objective function that can be used to maximize the slack.
The problem doesn't seem to be a difficult one. However, I couldn't find a solution. Some help in this regard is much appreciated.
Thanks
This can often be done easily with variable splitting:
splus(i) - smin(i) = d(i) - r(i)
splus(i) ≥ 0, smin(i) ≥ 0
If we have a term in the objective so that we are minimizing:
sum(i, w1 * splus(i) + w2 * smin(i) )
this will work ok: we don't need to add the complementarity condition splus(i)*smin(i)=0.

Where would arrays be used? [closed]

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 10 years ago.
Suppose that a fast-food restaurant sells salad and burger. There are
two cashiers. With cashier 1, the number of seconds that it takes to
complete an order of salad is uniformly distributed in
{55,56,...,64,65}; and the number of seconds it takes to complete an
order of burger is uniformly distributed in {111,112,...,,129,130}.
With cashier 2, the number of seconds that it takes to complete an
order of salad is uniformly distributed in {65,66,...,74,75}; and the
number of seconds it takes to complete an order of burger is uniformly
distributed in {121,122,...,,139,140}. Assume that the customers
arrive at random times but has an average arrival rate of r customers
per minute.
Consider two different scenarios.
• Customers wait in one
line for service and, when either of two cashiers is available, the
first customer in the line goes to the cashier and gets serviced. In
this scenario, when a customer arrives at the restaurant, he either
gets serviced if there is no line up, or waits at the end of the line.
• Customers wait in two lines, each for a cashier. The first customer
in a line will get serviced if and only if the cashier for his line
becomes available. In this scenario, when a customer arrives at the
restaurant, he joins the shorter line. In addition, we impose the
condition that if a customer joins a line, he will not move to the
other line or to the other cashier when the other line becomes shorter
or when the other cashier becomes free.
In both scenarios considered,
a cashier will only start serving the next customer when the customer
he is currently serving has received his ordered food. (That is the
point we call “the customer’s order is completed”.)
... Simulation
For
each of the two scenarios and for several choices of r (see later
description), you are to simulate the customers
arriving/waiting/getting service over a period of 3 hours, namely,
from time 0 to time 180 minutes, where you assume that at time 0 there
is no customer waiting and both cashiers are available; The entire
period of 3 hours is to be divided into time slots each of 1 second
duration. At each time slot, with r/60 probability, you make one new
customer arrive, and with 1 − r/60 probability you make no new
customer arrive. This should give rise to an average customer arrival
rate of r customers/minute, and the arrival model will be reasonably
close to what is described above. In each time slot, you will make
your program handle whatever necessary.
... Objectives and
Deliverables
You need to write a program to investigate the following.
For each of the two scenarios and for each r, you are to divide the
three-hour simulated period into 10-minute periods, and for every
customer arriving during period i (i ∈ {1,2,...,18}), compute the
overall waiting time of the customer (namely, from the time he arrives
at the restaurant to the time when his order is completed. You need to
print for each i the average waiting time for the customers arriving
during period i. Note that if a customer arriving in period i has not
been served within the three-hour simulated period, then his waiting
time is not known. So the average waiting time for customers arriving
in this period cannot be computed. In that case, simply print “not
available” as the average waiting time for that period.
So, this program deals with hours, minutes, and seconds.
Would it be best to make a three-dimensional array as such:
time[3][60][60]
A total of three hours, with 60 minutes within, with 60 seconds within.
Alternatively, I was thinking that I should make a "for-loop" with this structure:
for (time=0;t<10800;t++)
Every iteration of this loop will represent one second of the three hour simulation (3hx60mx60s=10800 seconds).
Am I on the right track here guys? Which method is more plausible. Are there other arrays that are critical for this program?
Help is appreciated, as always!
It's almost always best to have your internal representation of time be in seconds; you'll have a much easier time working with your for loop than with a three-dimensional array. One nice convention is to write it as
MAX_SECONDS = 3 * 60 * 60
for (t=0;t<MAX_SECONDS;t++)
The data structure to look into for this project is, appropriately enough, a queue. This can be implemented using arrays, but will require some extra work.

how do I figure out provisional throughput for AWS DynamoDB table?

My system is supposed to write a large amount of data into a DynamoDB table every day. These writes come in bursts, i.e. at certain times each day several different processes have to dump their output data into the same table. Speed of writing is not critical as long as all the daily data gets written before the next dump occurs. I need to figure out the right way of calculating the provisional capacity for my table.
So for simplicity let's assume that I have only one process writing data once a day and it has to write upto X items into the table (each item < 1KB). Is the capacity I would have to specify essentially equal to X / 24 / 3600 writes/second?
Thx
The provisioned capacity is in terms of writes/second. You need to make sure that you can handle the PEAK number of writes/second that you are going to expect, not the average over the day. So, if you have a single process that runs once a day and makes X number of writes, of Y size (in KB, rounded up), over Z number of seconds, your formula would be
capacity = (X * Y) / Z
So, say you had 100K writes over 100 seconds and each write < 1KB, you would need 1000 w/s capacity.
Note that in order to minimize provisioned write capacity needs, it is best to add data into the system on a more continuous basis, so as to reduce peaks in necessary read/write capacity.

Resources