Sampling Calculation / multimedia - sampling

I have a serious exam tomorrow and this is one of the sample questions provided. I tried to solve this problem many times but I could never get an accurate answer. There are no information regarding on the calculations in my lecture materials. I googled many things and looked for ways of calculating this in two different books which I have but could not find anything related. I do not know what the exact subject name for these sort of calculations but I think it is multimedia/sampling. I would greatly appreciate any information regarding the problem seriously any briefing would do. I just want to be able to solve it. I have quoted the question below.
"A supermarket must store text, image and video information on 2,000
items. There is text information associated with each item occupying 0.5
Kb. For 200 items, it is also necessary to store an image consisting of 1
million pixels. Each pixel represents one of 255 colours. For 10 items, it is
also necessary to store a 4 second colour video (25 frames per second), to
be viewed on a screen with a resolution of 1000 x 1000 pixels. The total
storage required for the database is:"

TOTAL = 2,000 items x 0.5 kilobytes +
(200 items x (1,000,000 pixels x 1 byte each)) +
(10 items x (25 frames x 4
seconds) x (1,000 pixels x 1,000 pixels x 1 byte each))
= 1,000,000 + 200,000,000 + 1,000,000,000
= 1,201,000,000 bytes = 1.201 GB
Notes:
Kb could represent either 1000 or 1024, depending on how coherent your syllabus is. I imagine given the choice of the other numbers it is 1,000.
Each of 255 colors can be stored in a single byte TINYINT (as 256 is the TINYINT max).

Related

Maximum Number of Cells in a Cassandra Table

I have a system that stores measurements from machines with many transducers, once per second. I'm considering using Cassandra and would like to store the 1 second sample of machine state measurements in a single table, which would be something like:
create table inst_samples (
machine_id text,
batch_id int,
sample_time timestamp,
var1 double,
var2 double,
.....
varN double,
PRIMARY KEY ((machine_id, batch_id), sample_time)
);
There are approximately 20 machines with 400 state variables each and the batch_id will update every 1-2 hours. I have reviewed the documentation on the 2 billion cells maximum per table and noted similar questions
here What are the maximum number of columns allowed in Cassandra and here Cassandra has a limit of 2 billion cells per partition, but what's a partition?
If I am understanding this limit correctly I would hit the 2 billion cell limit for a single machine in the inst_samples table in approximately 60 days?
(2e9 cells / 400 cols/row) / (3600 rows / hour) / (24 hours / day) =~ 58 days?
I am a total Cassandra newbie. Thanks.
This 2 billion limit is for partition, and if you have good data model, you should have many partitions. In practice, it's recommended to keep number of cells per partition under control - something like, not more 100,000 cells per partition, otherwise there could be some performance problems, etc. But the actual limit depends on the multiple factors, like Cassandra version, what queries are executed, etc.
In your case, we have partition key of machine_id + batch_id, and that gives us for batch size of 2 hours: 400x7200 = 2880000 - almost 3 million cells. It may still work (would be better if you set batch size to 1 hour), but will require testing on real hardware - this could be done for example, with NoSQLBench.
There are also other ways to optimize your data model - for example, instead of allocating a separate column for every variable, just use frozen<map<text, double>> - in this case, all measurements will be stored as a single cell. The drawback of it - you can't change the individual values without reading the map & inserting it with changed value. Another drawback is that you'll need to read all measurements at once - but this could be ok.

Is there a space efficent way to store and retrieve the order of a dataset?

Here's my problem. I have a set of 20 objects stored in memory as an array. I want to store a second piece of data that defines an order for the objects to be displayed.
The simplest way to store the order is as an array of 20 unsigned integers, each of which is 5 bits (aka 0-31). The position of the object in the output list would be defined by the number stored in this array at the same index as the object in it's array.
But.. I know from statistics that there are only 20! (that's 20 factorial), ways to arrange these objects.
This could be stored in 62 bits, since 2^62 > 20!
I'm currently using 100 bits to store the same information.
So my question is this: Is there a space efficient way to store ORDER as a sequence of bits?
I have some addition constraints as well. This will run on an embedded device, so I can't use any huge arrays or high level math functions. I would need a simple iterative method.
Edit: Some clarification on why this is necessary. Say for example the objects are pictures, and they're stored in ROM (aka they can't be moved around). Now lets say I want to keep track of what order to display the images in, and i'm going to update that order every second. My device has 1k of storage with wear leveling, but each bit in the storage can only be written 1000 times before it becomes unreliable. If I need 1kb to store the order, than my device will only work for 1000 seconds. If I need 0.1kb, it will work for 10k seconds, and so on. Thus the devices longevity will be inversely proportional to the number of bits I need to update every cycle.
You can store the order in a single 64-bit value x:
For the first choice, 20 possibilities, compute the index as x % 20 and update x as x /= 20,
For the next choice, only 19 possibilities, compute x % 19 and update x as x /= 19.
Continue this process 17 more times and you are done.
I think I've found a partial solution to my own question. Assuming I start at the left side of the order array, for every move right there are fewer remaining possibilities for the position value. The number of possibilities is 20,19,18,etc. I can take advantage of this by populating the order array in a relative fashion. The first index will place a value in the order array. There are 20 possibilities so this takes 5 bits. Placing the next value, there are only 19 position available (still 5 bits). Proceeding though the whole array. The bits-required is now 5,5,5,5,4,4,4,4,4,4,4,4,3,3,3,3,2,2,1,0. So that gets me down to 69 bits, much better.
There's still some "wasted" precision in each of the values, since for example the first position can store 32 possible values, even though there are only 20. I'm not sure how to deal with this, but I think will have something to do with carrying a remainder from one calculation to the next..

lookup table AI

I have a question about the number of lookup tables generated when dealing with AI. I'm reading the book AI, A modern approach and there, I read an example that the lookup table will contain
∑|P|^t
where P is possible percepts and t is the lifetime of the agent. In the book, the visual input for from a single camera, at a rate of 27 megabytes per second (30 frames per second, 640X480 pixels with 24 bits of color information) will lead to
10^(250,000,000,000) entries in the lookup table.
To understand this, I read online and for the same hour, the visual input from
a single camera comes in at the rate of 50 megabytes per second (25 frames per second, 1000X1000 pixels with 8 bits of color and 8 bits of intensity information). So the lookup table for an
hour would be 2^(60*60*50M) entries.
Can someone explain me what's the difference between the two answers? How come they are so different?
yeah, according to the formula the lookup tables for each rate/type and encoding would be different. Is there something I'm missing to answer?

Is the Leptonica implementation of 'Modified Median Cut' not using the median at all?

I'm playing around a bit with image processing and decided to read up on how color quantization worked and after a bit of reading I found the Modified Median Cut Quantization algorithm.
I've been reading the code of the C implementation in Leptonica library and came across something I thought was a bit odd.
Now I want to stress that I am far from an expert in this area, not am I a math-head, so I am predicting that this all comes down to me not understanding all of it and not that the implementation of the algorithm is wrong at all.
The algorithm states that the vbox should be split along the lagest axis and that it should be split using the following logic
The largest axis is divided by locating the bin with the median pixel
(by population), selecting the longer side, and dividing in the center
of that side. We could have simply put the bin with the median pixel
in the shorter side, but in the early stages of subdivision, this
tends to put low density clusters (that are not considered in the
subdivision) in the same vbox as part of a high density cluster that
will outvote it in median vbox color, even with future median-based
subdivisions. The algorithm used here is particularly important in
early subdivisions, and 3is useful for giving visible but low
population color clusters their own vbox. This has little effect on
the subdivision of high density clusters, which ultimately will have
roughly equal population in their vboxes.
For the sake of the argument, let's assume that we have a vbox that we are in the process of splitting and that the red axis is the largest. In the Leptonica algorithm, on line 01297, the code appears to do the following
Iterate over all the possible green and blue variations of the red color
For each iteration it adds to the total number of pixels (population) it's found along the red axis
For each red color it sum up the population of the current red and the previous ones, thus storing an accumulated value, for each red
note: when I say 'red' I mean each point along the axis that is covered by the iteration, the actual color may not be red but contains a certain amount of red
So for the sake of illustration, assume we have 9 "bins" along the red axis and that they have the following populations
4 8 20 16 1 9 12 8 8
After the iteration of all red bins, the partialsum array will contain the following count for the bins mentioned above
4 12 32 48 49 58 70 78 86
And total would have a value of 86
Once that's done it's time to perform the actual median cut and for the red axis this is performed on line 01346
It iterates over bins and check they accumulated sum. And here's the part that throws me of from the description of the algorithm. It looks for the first bin that has a value that is greater than total/2
Wouldn't total/2 mean that it is looking for a bin that has a value that is greater than the average value and not the median ? The median for the above bins would be 49
The use of 43 or 49 could potentially have a huge impact on how the boxes are split, even though the algorithm then proceeds by moving to the center of the larger side of where the matched value was..
Another thing that puzzles me a bit is that the paper specified that the bin with the median value should be located, but does not mention how to proceed if there are an even number of bins.. the median would be the result of (a+b)/2 and it's not guaranteed that any of the bins contains that population count. So this is what makes me thing that there are some approximations going on that are negligible because of how the split actually takes part at the center of the larger side of the selected bin.
Sorry if it got a bit long winded, but I wanted to be as thoroughas I could because it's been driving me nuts for a couple of days now ;)
In the 9-bin example, 49 is the number of pixels in the first 5 bins. 49 is the median number in the set of 9 partial sums, but we want the median pixel in the set of 86 pixels, which is 43 (or 44), and it resides in the 4th bin.
Inspection of the modified median cut algorithm in colorquant2.c of leptonica shows that the actual cut location for the 3d box does not necessarily occur adjacent to the bin containing the median pixel. The reasons for this are explained in the function medianCutApply(). This is one of the "modifications" to Paul Heckbert's original method. The other significant modification is to make the decision of which 3d box to cut next based on a combination of both population and the product (population * volume), thus permitting splitting of large but sparsely populated regions of color space.
I do not know the algo, but I would assume your array contains the population of each red; let's explain this with an example:
Assume you have four gradations of red: A,B,C and D
And you have the following sequence of red values:
AABDCADBBBAAA
To find the median, you would have to sort them according to red value and take the middle:
median
v
AAAAAABBBBCDD
Now let's use their approach:
A:6 => 6
B:4 => 10
C:1 => 11
D:2 => 13
13/2 = 6.5 => B
I think the mismatch happened because you are counting the population; the average color would be:
(6*A+4*B+1*C+2*D)/13

What is the length of time to send a list of 200,000 integers from a client's browser to an internet sever?

Over the connections that most people in the USA have in their homes, what is the approximate length of time to send a list of 200,000 integers from a client's browser to an internet sever (say Google app engine)? Does it change much if the data is sent from an iPhone?
How does the length of time increase as the size of the integer list increases (say with a list of a million integers) ?
Context: I wasn't sure if I should write code to do some simple computations and sorting of such lists for the browser in javascript or for the server in python, so I wanted to explore this issue of how long it takes to send the output data from a browser to a server over the web in order to help me decide where (client's browser or app engine server) is the best place for such computations to be processed.
More Context:
Type of Integers: I am dealing with 2 lists of integers. One is a list of ids for the 200,000 objects whose integers look like {0,1,2,3,...,99,999}. The second list of 100,000 is just single digits {...,4,5,6,7,8,9,0,1,...} .
Type of Computations: From the browser a person will create her own custom index (or rankings) based changing the weights associated to about 10 variables referenced to the 100,000 objects. INDEX = w1*Var1 + w2*Var2 + ... wNVarN. So the computations refer to vector (array) multiplication to a scalar and addition of 2 vectors, as well as sorting the final INDEX variable vector of 100,000 values.
In a nutshell...
This is probably a bad idea,
in particular with/for mobile devices where, aside from the delay associated with transfer(s), limits and/or extra fees associated with monthly volumes exceeding various plans limits make this a lousy economical option...
A rough estimate (more info below) is that the one-way transmission takes between 0.7 and and 5 seconds.
There is a lot of variability in this estimate, due mainly to two factors
Network technology and plan
compression ratio which can be obtained for a 200k integers.
Since the network characteristics are more or less a given, the most significant improvement would come from the compression ratio. This in turn depends greatly on the statistic distribution of the 200,000 integers. For example, if most of them are smaller than say 65,000, it would be quite likely that the list would compress to about 25% of its original size (75% size reduction). The time estimates provided assumed only a 25 to 50% size reduction.
Another network consideration is the availability of binary mime extension (8 bits mime) which would avoid the 33% overhead of B64 for example.
Other considerations / idea:
This type of network usage for iPhone / mobile devices plans will not fare very well!!!
ATT will love you (maybe), your end-users will hate you at least the ones with plan limits, which many (most?) have.
Rather than sending one big list, you could split the list over 3 or 4 chunks, allowing the server-side sorting to take place [mostly] in parallel to the data transfer.
One gets better compression ratio for integers when they are [roughly] sorted, maybe you can have a first pass sorting of some kind client-side.
How do I figure? ...
1) Amount of data to transfer (one-way)
200,000 integers
= 800,000 bytes (assumes 4 bytes integers)
= 400,000 to 600,000 bytes compressed (you'll want to compress!)
= 533,000 to 800,000 bytes in B64 format for MIME encoding
2) Time to upload (varies greatly...)
Low-end home setup (ADSL) = 3 to 5 seconds
broadband (eg DOCSIS) = 0.7 to 1 second
iPhone = 0.7 to 5 seconds possibly worse;
possibly a bit better with high-end plan
3) Time to download (back from server, once list is sorted)
Assume same or slightly less than upload time.
With portable devices, the differential is more notable.
The question is unclear of what would have to be done with the resulting
(sorted) array; so I didn't worry to much about the "return trip".
==> Multiply by 2 (or 1.8) for a safe estimate of a round trip, or inquire
about specific network/technlogy.
By default, typically integers are stored in a 32-bit value, or 4 bytes. 200,000 integers would then be 800,000 bytes, or 781.25 kilobytes. It would depend on the client's upload speed, but at 640Kbps upload, that's about 10 seconds.
well that is 800000 bytes or 781.3 kb, or you could say the size of a normal jpeg photo. for broadband, that would be within seconds, and you could always consider compression (there are libraries for this)
the time increases linearly for data.
Since you're sending the data from JavaScript to the server, you'll be using a text representation. The size will depend a lot on the number of digits in each integer. Are talking about 200,000 two to three digit integers or six to eight integers? It also depends on if HTTP compression is enabled and if Safari on the iPhone supports it (I'm not sure).
The amount of time will be linear depending on the size. Typical upload speeds on an iPhone will vary a lot depending on if the user is on a business wifi, public wifi, home wifi, 3G, or Edge network.
If you're so dependent on performance perhaps this is more appropriate for a native app than an HTML app. Even if you don't do the calculations on the client, you can send/receive binary data and compress it which will reduce time.

Resources