Bin packing variation where item size varies with time - bin-packing

I have a use case similar to traditional bin packing (https://en.wikipedia.org/wiki/Bin_packing_problem) but different in the following way
Size of items s(i) varies with time - The distribution is available in advance
Objective of the problem is - Given a fixed bin size and number of bins,
pack the items such that total average utilisation is maximised
Definitions
utilisation of a bin : min (sum total of size of each item in the bin at that time, bin capacity)
total average utilisation : avg over time (sum of utilisation of each bin at time T)
Are there existing algorithms for above type of problem statement ? The most important part is size of items varies with time.

Related

How is Average Seek Time Calculated?

A hard disk system has the following parameters :
Number of tracks = 500
Number of sectors/track = 100
Number of bytes /sector = 500
Time taken by the head to move from one track to adjacent track = 1 ms
Rotation speed = 600 rpm.
What is the average time taken for transferring 250 bytes from the disk ?
Well I wanted to know How the average seek time is calculated ?
My Approach
Avg. time to transfer = Avg. seek time + Avg. rotational delay + Data transfer time
Avg Seek Time
given that : time to move between successive tracks is 1 ms
time to move from track 1 to track 1 : 0ms
time to move from track 1 to track 2 : 1ms
time to move from track 1 to track 3 : 2ms
..
..
time to move from track 1 to track 500 : 499 ms
Avg Seek time =
= 249.5 ms
But After Reading Answer given here Why is average disk seek time one-third of the full seek time?
Im confused with my approach.
My question is
Is my Approach Correct ?
If not Please explain the correct way to calculate Average seek time
If Yes please explain wh we are not considering average for every possible pair of tracks (as mentioned in the above link)?
There are a lot more than 500 possible seek times. Your method only accounts for seeks starting at track 1.
What about seeks starting from track 2? Or from track 285?
I wouldn't say your approach is wrong, but it's certainly incomplete.
As is pointed out in the link you're reffering to in this question the average time is calculated as average distance from ANY track to ANY track. So you have to add all of the Subsums to the one you are using to calculate average seek time and then divide this sum by the number of tracks. It sums out to: N/3, where N is the distance between track 0 and last.
f.eg. average distance from track 249 to ANY other track is:middle average sum
Your calculation is the average track seek, you need to add the sector seek to that.
When seeking for a read operation, the head is positioned on (a) a track, at a given (b) sector.
The (average) seek time is the time taken to switch to that position to any other position, with both (a) track and (b) sector.
When positioned, the read can start.
The disk RPM is into play for this, if it spins at 600rpm and has 100 sectors per track, it means that it seeks sectors at
60000ms (because rpm = per minute)
/
600rpm (disk spin speed)
/
100sectors (per track)
=
1ms (to change from a sector to the next adjacent one)
Normally, you would have to consider that as you change tracks, the disk is still spinning and thus account for the sector offset change. But since we are interested only in the average, this cancels out (hopefully).
So, to your 249.5 ms for the track seek average time, you need to add :
same formula :
sum 0->100/100 * 1ms (sector seek speed) = 50.5ms
Thus, the average seek speed for both track and sector is 300ms.

Sampling Calculation / multimedia

I have a serious exam tomorrow and this is one of the sample questions provided. I tried to solve this problem many times but I could never get an accurate answer. There are no information regarding on the calculations in my lecture materials. I googled many things and looked for ways of calculating this in two different books which I have but could not find anything related. I do not know what the exact subject name for these sort of calculations but I think it is multimedia/sampling. I would greatly appreciate any information regarding the problem seriously any briefing would do. I just want to be able to solve it. I have quoted the question below.
"A supermarket must store text, image and video information on 2,000
items. There is text information associated with each item occupying 0.5
Kb. For 200 items, it is also necessary to store an image consisting of 1
million pixels. Each pixel represents one of 255 colours. For 10 items, it is
also necessary to store a 4 second colour video (25 frames per second), to
be viewed on a screen with a resolution of 1000 x 1000 pixels. The total
storage required for the database is:"
TOTAL = 2,000 items x 0.5 kilobytes +
(200 items x (1,000,000 pixels x 1 byte each)) +
(10 items x (25 frames x 4
seconds) x (1,000 pixels x 1,000 pixels x 1 byte each))
= 1,000,000 + 200,000,000 + 1,000,000,000
= 1,201,000,000 bytes = 1.201 GB
Notes:
Kb could represent either 1000 or 1024, depending on how coherent your syllabus is. I imagine given the choice of the other numbers it is 1,000.
Each of 255 colors can be stored in a single byte TINYINT (as 256 is the TINYINT max).

Search for a string from 100 million rows of strings

I have this text file containing some md5 hashes, 100 million rows of them. I have this another smaller file with few thousand md5 hashes. I want to find the corresponding indices of these md5 hashes from this new smaller file to the old bigger file.
what is the most efficient way to do it? Is it possible to do it in like 15 mins or so?
I have tried lots of things but they do not work. First I tried to import the bigger data to a database file and create an index on the md5 hash column. Creating this hash takes for ever. I am not even sure if this will increase the query speed much. Suggestions?
Don't do this in db - use a simple program.
Read the md5 hashes from the small file into a hash map in memory, that allow for fast look-ups.
Then read through the md5's in the big file one row at a time, and check if the row is in the hash map.
Average look-up time in the hash map ought to be close to O(1), so the process time of this is basically how fast you can read through the big file.
The 15 minutes is easily obtained with today's hardware with this approach.
First of all: 100 Megarows à 32 Bytes = ca. 3.2 GByte of data. Reading them in 15 Minutes translates to 3.5 Megabytes per second, which should easily be doable with modern hardware.
I recommend not to use a database, but process consisting of some easy steps:
Sort your data - you have to do this only once, and you can parallelize much of it
Read the small file into memory (sorted into an array)
Cycle this array:
Read the big file line by line, comparing with the current line of your array (first compar e first byte, then first and second, ...) until you either reach a match (output index) or pass the value (output "not found")
Move to next array element
The initial sort might easily take longer than 15 minutes, but the lookups should be quite fast: Ify you have enough RAM (and an OS that supports processes bigger than 2GB) you should be able to get a compare rate at least an order of magnitude faster!
There are algorithms specifically designed for searching for multiple strings in a large file. One of them is Rabin-Karp. I have a blog post about this.
More simply, the following pseudo-code should get you there in no time :
Load your few thousand strings in a set data structure
For each line (index: i) in your file
If that line appears in your set of values
print i
This will be very fast: The set data structure will have almost-instant lookups, so the IO will the culprit, and 100 million hashsums will fit in 15 minutes without too much difficulty.
Assumptions:
(1) every record in the small file appears in the large file
(2) the data in each file is randomly ordered.
Options:
(1) For each record in the large file, search the small file linearly for a match. Since most searches will not find a match, the time will be close to
Nlarge * Nsmall * k
where k represents the time to attempt one match.
(2) For each record in the small file, search the large file linearly for a match. Since every search will find a match, the time will be about
Nlarge/2 * Nsmall * k.
This looks twice as fast as option (1) -- but only if you can fit the large file completely into fast memory. You would probably need 6 GB of RAM.
(3) Sort the small file into an easily searchable form. A balanced binary tree is best, but a sorted array is almost as good. Or you could trust the author of some convenient hash table object to have paid attention in CS school. For each record in the large file, search the structured small file for a match. The time will be
log2 Nsmall * s
to sort the small file, where s represents the time to sort one record, plus
log2 Nsmall * Nlarge * k
for the scan. This gives a total time of
log2 Nsmall * (s + Nlarge * k).
(4) Sort the large file into an easily searchable form. For each record in the small file, search the structured large file for a match. The time will be
log2 Nlarge * s
to sort the large file plus
log2 Nlarge * Nsmall * k
for the scan, giving a total of
log2 Nlarge * (s + Nsmall * k).
Option (4) is obviously the fastest, as reducing any coefficient of Nlarge dominates all other improvements. But if the sortable structure derived from the large file will not fit completely into RAM, then option (3) might turn out to be faster.
(5) Sort the large file into an easily searchable form. Break this structure into pieces that will fit into your RAM. For each such piece, load the piece into RAM, then for each record in the small file, search the currently loaded piece for a match. The time will be
log2 Nlarge * s
to sort the large file plus
log2 Nlarge * Nsmall * k * p
for the scan, where the structure was broken into p pieces, giving a total of
log2 Nlarge * (s + Nsmall * k * p).
With the values you indicated for Nlarge and Nsmall, and enough RAM so that p can be kept to a single digit, option (5) seems likely to be the fastest.

how do I figure out provisional throughput for AWS DynamoDB table?

My system is supposed to write a large amount of data into a DynamoDB table every day. These writes come in bursts, i.e. at certain times each day several different processes have to dump their output data into the same table. Speed of writing is not critical as long as all the daily data gets written before the next dump occurs. I need to figure out the right way of calculating the provisional capacity for my table.
So for simplicity let's assume that I have only one process writing data once a day and it has to write upto X items into the table (each item < 1KB). Is the capacity I would have to specify essentially equal to X / 24 / 3600 writes/second?
Thx
The provisioned capacity is in terms of writes/second. You need to make sure that you can handle the PEAK number of writes/second that you are going to expect, not the average over the day. So, if you have a single process that runs once a day and makes X number of writes, of Y size (in KB, rounded up), over Z number of seconds, your formula would be
capacity = (X * Y) / Z
So, say you had 100K writes over 100 seconds and each write < 1KB, you would need 1000 w/s capacity.
Note that in order to minimize provisioned write capacity needs, it is best to add data into the system on a more continuous basis, so as to reduce peaks in necessary read/write capacity.

How to resample an array of n elements into an array of m elements

I have an array of N measurements that I should present as a graph, but the graph can be only of M pixels wide and is scrolled only by M pixels.
While M is constant, N can be anything from tens to thousands.
Each time when I need to show the graph, I know what N is, however as N/M can be not integer, there is an accumulated error that I want somehow to compensate.
I am working in a plain C, and no math libraries can be used.
EDIT 2:
The data is relatively homogeneous with peaks once in a while, and I do not want to miss these peaks, while interpolating.
EDIT 3:
I am looking for solution that will work good enough for any N, greater than M and lesser than M.
Thanks.
One good solution is not to iterate over your input samples, but over your output positions. That is, you will always draw exactly M pixels. To calculate the nearest sample value for the ith pixel, use the array offset:
[(i*N+M/2)/M]
Of course only using the nearest sample will give a very aliased result (discarding most of your samples in the case where N is large). If you're sure N will always be larger than M, a good but simple approach is to average sufficient neighboring samples with a weighted average such that each sample gets a total weight of 1 (with endpoints having their weight split between neighboring output pixels). Of course there are more elaborate resampling algorithms you can use that may be more appropriate (especially if your data is something like audio samples which are more meaningful in the frequency domain), but for an embedded device with tight memory and clock cycle requirements, averaging is likely the approach you want.

Resources