What is the length of time to send a list of 200,000 integers from a client's browser to an internet sever? - google-app-engine

Over the connections that most people in the USA have in their homes, what is the approximate length of time to send a list of 200,000 integers from a client's browser to an internet sever (say Google app engine)? Does it change much if the data is sent from an iPhone?
How does the length of time increase as the size of the integer list increases (say with a list of a million integers) ?
Context: I wasn't sure if I should write code to do some simple computations and sorting of such lists for the browser in javascript or for the server in python, so I wanted to explore this issue of how long it takes to send the output data from a browser to a server over the web in order to help me decide where (client's browser or app engine server) is the best place for such computations to be processed.
More Context:
Type of Integers: I am dealing with 2 lists of integers. One is a list of ids for the 200,000 objects whose integers look like {0,1,2,3,...,99,999}. The second list of 100,000 is just single digits {...,4,5,6,7,8,9,0,1,...} .
Type of Computations: From the browser a person will create her own custom index (or rankings) based changing the weights associated to about 10 variables referenced to the 100,000 objects. INDEX = w1*Var1 + w2*Var2 + ... wNVarN. So the computations refer to vector (array) multiplication to a scalar and addition of 2 vectors, as well as sorting the final INDEX variable vector of 100,000 values.

In a nutshell...
This is probably a bad idea,
in particular with/for mobile devices where, aside from the delay associated with transfer(s), limits and/or extra fees associated with monthly volumes exceeding various plans limits make this a lousy economical option...
A rough estimate (more info below) is that the one-way transmission takes between 0.7 and and 5 seconds.
There is a lot of variability in this estimate, due mainly to two factors
Network technology and plan
compression ratio which can be obtained for a 200k integers.
Since the network characteristics are more or less a given, the most significant improvement would come from the compression ratio. This in turn depends greatly on the statistic distribution of the 200,000 integers. For example, if most of them are smaller than say 65,000, it would be quite likely that the list would compress to about 25% of its original size (75% size reduction). The time estimates provided assumed only a 25 to 50% size reduction.
Another network consideration is the availability of binary mime extension (8 bits mime) which would avoid the 33% overhead of B64 for example.
Other considerations / idea:
This type of network usage for iPhone / mobile devices plans will not fare very well!!!
ATT will love you (maybe), your end-users will hate you at least the ones with plan limits, which many (most?) have.
Rather than sending one big list, you could split the list over 3 or 4 chunks, allowing the server-side sorting to take place [mostly] in parallel to the data transfer.
One gets better compression ratio for integers when they are [roughly] sorted, maybe you can have a first pass sorting of some kind client-side.
How do I figure? ...
1) Amount of data to transfer (one-way)
200,000 integers
= 800,000 bytes (assumes 4 bytes integers)
= 400,000 to 600,000 bytes compressed (you'll want to compress!)
= 533,000 to 800,000 bytes in B64 format for MIME encoding
2) Time to upload (varies greatly...)
Low-end home setup (ADSL) = 3 to 5 seconds
broadband (eg DOCSIS) = 0.7 to 1 second
iPhone = 0.7 to 5 seconds possibly worse;
possibly a bit better with high-end plan
3) Time to download (back from server, once list is sorted)
Assume same or slightly less than upload time.
With portable devices, the differential is more notable.
The question is unclear of what would have to be done with the resulting
(sorted) array; so I didn't worry to much about the "return trip".
==> Multiply by 2 (or 1.8) for a safe estimate of a round trip, or inquire
about specific network/technlogy.

By default, typically integers are stored in a 32-bit value, or 4 bytes. 200,000 integers would then be 800,000 bytes, or 781.25 kilobytes. It would depend on the client's upload speed, but at 640Kbps upload, that's about 10 seconds.

well that is 800000 bytes or 781.3 kb, or you could say the size of a normal jpeg photo. for broadband, that would be within seconds, and you could always consider compression (there are libraries for this)
the time increases linearly for data.

Since you're sending the data from JavaScript to the server, you'll be using a text representation. The size will depend a lot on the number of digits in each integer. Are talking about 200,000 two to three digit integers or six to eight integers? It also depends on if HTTP compression is enabled and if Safari on the iPhone supports it (I'm not sure).
The amount of time will be linear depending on the size. Typical upload speeds on an iPhone will vary a lot depending on if the user is on a business wifi, public wifi, home wifi, 3G, or Edge network.
If you're so dependent on performance perhaps this is more appropriate for a native app than an HTML app. Even if you don't do the calculations on the client, you can send/receive binary data and compress it which will reduce time.

Related

Compressing a sparse bit array

I have arrays of 1024 bytes (8192 bits) which are mostly zero.
Between 0.01% and 10% of bits will be set (random, no pattern).
How could these be compressed, given the lack of structure and the relatively small size?
(My first thought was to store the distances between set bits. I need 13 bits for each distance, but at worst case 10% occupancy this needs 13 * 816 / 8 = 1326 bytes, which is not an improvement.)
This is for ultra-low bandwidth comms, so every byte matters.
I've dealt deeply with a similar problem, but my sets are much bigger (30 million possible values with between 1 and 30 million elements in each set), so they both gain much more from compression and the compression metadata is insignificant compared to the size of the data. I have never gone down to squeezing things into units smaller than uint16_t, so the things I write below might not apply if you start chopping up 13 bit values into pieces. It feels like it should work, but caveat emptor.
What I've found works is to employ several strategies that depend on the particular data we have. The good news is that the count of elements in each set is a very good indicator of which compression strategy will work best for a particular set. So all the metadata you need is a count of elements in the set. In my data format the first and only metadata value (I'll be unspecific and just call it "value", you can squeeze things in bytes, 16 bit values or 13 bit values however you feel) is the count of elements in the set, the rest is just the encoding of the set elements.
The strategies are:
If very few elements are in the set, you can't do better than an array that says "1, 4711, 8140", so in this case the data is encoded as: [3, 1, 4711, 8140]
If almost all elements are in the set, you can just keep track of elements that aren't. For example [8190, 17, 42].
If around half of the elements are in the set you pretty much can't do much better than a bitmap, so you get [4000, {bitmap}], this is the only case where your data ends up being longer than strictly uncompressed.
If more than "a few" but many fewer than "around half" elements are set, I found another strategy. Divide the bits of your possible values in the set in half. Let's say we have 2^16 (it's easier to describe, it should probably work for 2^13) possible values. The values are divided into 256 ranges with each range with 256 possible values. We then have an array with 256 bytes, each of these bytes describes how many values are in each range (so byte 0 tells us how many elements are [0,255], byte 1 gives us [256,511], etc.) immediately after follow arrays with the values in each range mod 256. The trick here is that while every element in the set encoded as an array (strategy 1) would be 2 bytes, in this scheme each element is only 1 bytes + 256 static bytes for the counts of elements. This means that as soon as we have more than 256 elements in the set this saves us space by switching from strategy 1 to 4.
Strategy 4 can be refined (probably meaningless if your data is random as you mention, but my data had more patterns sometimes, so it worked for me). Since we still need 8 bits for each element in the previous encoding, as soon as a sub-array of elements goes over 32 elements (256 bytes), we can store it as a bitmap instead. This is also a good breakpoint for switching strategies between 4/5 to 3. If all the arrays in this strategy are just bitmaps, then we should just use strategy 3 (it's more complicated than that, but the breakpoint between strategies can be precomputed quite accurately that you'll end up picking the most likely efficient strategy each time).
I have only vaguely tried saving deltas between numbers in the set. Quick experiments showed that they weren't really much more efficient than the strategies I mentioned above, had unpredictable degenerate cases, but most importantly, the application I work with really likes to not have to deserialise its data, just use it raw straight from disk (mmap).

How to compress/archive a temperature curve effectively?

Summary: The industrial thermometer is used to sample temperature at the technology device. For few months, the samples are simply stored in the SQL database. Are there any well-known ways to compress the temperature curve so that much longer history could be stored effectively (say for the audit purpose)?
More details: Actually, there are much more thermometers, and possibly other sensors related to the technology. And there are well known time intervals where the curve belongs to a batch processed on the machine. The temperature curves should be added to the batch documentation.
My idea was that the temperature is a smooth function that could be interpolated somehow -- say the way a sound is compressed using MP3 format. The compression need not to be looseless. However, it must be possible to reconstruct the temperature curve (not necessarily the identical sample values, and the identical sampling interval) -- say, to be able to plot the curve or to tell what was the temperature in certain time.
The raw sample values from the SQL table would be processed, the compressed version would be stored elsewhere (possibly also in SQL database, as a blob), and later the raw samples can be deleted to save the database space.
Is there any well-known and widely used approach to the problem?
A simple approach would be code the temperature into a byte or two bytes, depending on the range and precision you need, and then to write the first temperature to your output, followed by the difference between temperatures for all the rest. For two-byte temperatures you can restrict the range some and write one or two bytes depending on the difference with a variable-length integer. E.g. if the high bit of the first byte is set, then the next byte contains 8 more bits of difference, allowing for 15 bits of difference. Most of the time it will be one byte, based on your description.
Then take that stream and feed it to a standard lossless compressor, e.g. zlib.
Any lossiness should be introduced at the sampling step, encoding only the number of bits you really need to encode the required range and precision. The rest of the process should then be lossless to avoid systematic drift in the decompressed values.
Subtracting successive values is the simplest predictor. In that case the prediction of the next value is the value before it. It may also be the most effective, depending on the noisiness of your data. If your data is really smooth, then you could try a higher-order predictor to see if you get better performance. E.g. a predictor for the next point using the last two points is 2a - b, where a is the previous point and b is the point before that, or using the last three points 3a - 3b + c, where c is the point before b. (These assume equal time steps between each.)

Looking for an ultrafast data store to perform intersect operations

I've been using Redis for a while as a backend for Resque and now that I'm looking for a fast way to perform intersect operation on large sets of data, I decided to give Redis a shot.
I've been conducting the following test:
— x, y and z are Redis sets, they all contain approx. 1 million members (random integers taken from a seed array containing 3M+ members).
— I want to intersect x y and z, so I'm using sintersectstore (to avoid overheating caused by data retrieval from the server to the client)
sinterstore r x y z
— the resulting set (r) contains about half a million members, Redis computes this set in approximately half a second.
Half a second is not bad, but I would need to perform such calculations on sets that could contain more than a billion members each.
I haven't tested how Redis would react with such enormous sets but I assume it would take a lot more time to process the data.
Am I doing this right? Is there a faster way to do that?
Notes:
— native arrays aren't an option since I'm looking for a distributed data store that would be accessed by several workers.
— I get these results on a 8 cores #3.4Ghz Mac with 16GB of RAM, disk saving has been disabled on the Redis configuration.
I suspect that bitmaps are your best hope.
In my experience, redis is a perfect server for bitmaps; you would use the string data structure (one of the five data structures available in redis)
many or perhaps all of the operations you will need to perform are available out-of-the-box in redis, as atomic operations
the redis setbit operation has time complexity of O(1)
In a typical implementation, you would hash your array values to offset values on the bit string, then set each bit at its corresponding offset (or index); like so:
>>> r1.setbit('k1', 20, 1)
the first argument is the key, the second is the offset (index value) and the third is the value at that index on the bitmap.
to find if a bit is set at this offset (20), call getbit passing in the key for the bit string.
>>> r1.getbit('k1', 20)
then on those bitmaps, you can of course perform the usual bitwise operations e.g., logical AND, OR, XOR.

Loading tiles for a 2D game

Im trying to make an 2D online game (with Z positions), and currently im working with loading a map from a txt file. I have three different map files. One contains an int for each tile saying what kind of floor there is, one saying what kind of decoration there is, and one saying what might be covering the tile. The problem is that the current map (20, 20, 30) takes 200 ms to load, and I want it to be much much bigger. I have tried to find a good solution for this and have so far come up with some ideas.
Recently I'v thought about storing all tiles in separate files, one file per tile. I'm not sure if this is a good idea (it feels wrong somehow), but it would mean that I wouldn't have to store any unneccessary tiles as "-1" in a text file and I would be able to just pick the right tile from the folder easily during run time (read the file named mapXYZ). If the tile is empty I would just be able to catch the FileNotFoundException. Could anyone tell me a reason for this being a bad solution? Other solutions I'v thought about would be to split the map into smaller parts or reading the map during startup in a BackgroundWorker.
Try making a much larger map in the same format as your current one first - it may be that the 200ms is mostly just overhead of opening and initial processing of the file.
If I'm understanding your proposed solution (opening one file per X,Y or X,Y,Z coordinate of a single map), this is a bad idea for two reasons:
There will be significant overhead to opening so many files.
Catching a FileNotFoundException and eating it will be significantly slower - there is actually a lot of overhead with catching exceptions, so you shouldn't rely on them to perform application logic.
Are you loading the file from a remote server? If so, that's why it's taking so long. Instead you should embed the file into the game. I'm saying this because you probably take 2-3 bytes per tile, so the file's about 30kb and 200ms sounds like a reasonable download time for that size of file (including overhead etc, and depending on your internet connection).
Regarding how to lower the filesize - there are two easy techniques I can think of that will decrease the filesize a bit:
1) If you have mostly empty squares and only some significant ones, your map is what is often referred to as 'sparse'. When storing a sparse array of data you can use a simple compression technique (formally known as 'run-length encoding') where each time you come accross empty squares, you specify how many of them there are. So for example instead of {0,0,0,0,0,0,0,0,0,0,1,1,2,3,0,0,0,0,0,0,0,0,0,0,0,0,1} you could store {10 0's, 1, 1, 2, 3, 12 0's, 1}
2) To save space, I recommend that you store everything as binary data. The exact setup of the file mainly depends on how many possible tile types there are, but this is a better solution than storing the ascii characters corresponding to the base-10 representation of the numers, separated by delimiters.
Example Binary Format
File is organized into segments which are 3 or 4 bytes long, as explained below.
First segment indicates the version of the game for which the map was created. 3 bytes long.
Segments 2, 3, and 4 indicate the dimensions of the map (x, y, z). 3 bytes long each.
The remaining segments all indicate either a tile number and is 3 bytes long with an MSB of 0. The exception to this follows.
If one of the tile segments is an empty tile, it is 4 bytes long with an MSB of 1, and indicates the number of empty tiles including that tile that follow.
The reason I suggest the MSB flag is so that you can distinguish between segments which are for tiles, and segments which indicate the number of empty tiles which follow that segment. For those segments I increase the length to 4 bytes (you might want to make it 5) so that you can store larger numbers of empty tiles per segment.

Any strategies for assessing the trade-off between CPU loss and memory gain from compression of data held in a datastore model's TextProperty?

Are very large TextProperties a burden? Should they be compressed?
Say I have a information stored in 2 attributes of type TextProperty in my datastore entities.
The strings are always the same length of 65,000 characters and have lots of repeating integers, a sample appearing as follows:
entity.pixel_idx = 0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,5,5,5,5,5,5,5,5,5,5,5,5....etc.
entity.pixel_color = 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,...etc.
So these above could also be represented using much less storage memory by compressing say using only each integer and the length of its series ( '0,8' for '0,0,0,0,0,0,0,0') but then its takes time and CPU to compress and decompress?
Any general ideas?
Are there some tricks for testing different attempts to the problem?
If all of your integers are single-digit numbers (as in your example), then you can reduce your storage space in half by simply omitting the commas.
The Short Answer
If you expect to have a lot of repetition, then compressing your data makes sense - your data is not so small (65K) and is highly repetitive => it will compress well. This will save you storage space and will reduce how long it takes to transfer the data back from the datastore when you query for it.
The Long Answer
I did a little testing starting with the short example string you provided and that same string repeated to 65000 characters (perhaps more repetitive than your actual data). This string compressed from 65K to a few hundred bytes; you may want to do some additional testing based on how well your data actually compresses.
Anyway, the test shows a significant savings when using compressed data versus uncompressed data (for just the above test where compression works really well!). In particular, for compressed data:
API time takes 10x less for a single entity (41ms versus 387ms on average)
Storage used is significantly less (so it doesn't look like GAE is doing any compression on your data).
Unexpectedly, CPU time is about 50% less (130ms versus 180ms when fetching 100 entities). I expected CPU time to be a little worse since the compressed data has to be uncompressed. There must be some other CPU work (like decoding the protocol buffer) which is even more CPU work for the much larger uncompressed data.
These differences mean wall clock time is also significantly faster for the compressed version (<100ms versus 426ms when fetching 100 entities).
To make it easier to take advantage of compression, I wrote a custom CompressedDataProperty which handles all of the compressing/decompressing business so you don't have to worry about it (I used it in the above tests too). You can get the source from the above link, but I've also included it here since I wrote it for this answer:
from google.appengine.ext import db
import zlib
class CompressedDataProperty(db.Property):
"""A property for storing compressed data or text.
Example usage:
>>> class CompressedDataModel(db.Model):
... ct = CompressedDataProperty()
You create a compressed data property, simply specifying the data or text:
>>> model = CompressedDataModel(ct='example uses text too short to compress well')
>>> model.ct
'example uses text too short to compress well'
>>> model.ct = 'green'
>>> model.ct
'green'
>>> model.put() # doctest: +ELLIPSIS
datastore_types.Key.from_path(u'CompressedDataModel', ...)
>>> model2 = CompressedDataModel.all().get()
>>> model2.ct
'green'
Compressed data is not indexed and therefore cannot be filtered on:
>>> CompressedDataModel.gql("WHERE v = :1", 'green').count()
0
"""
data_type = db.Blob
def __init__(self, level=6, *args, **kwargs):
"""Constructor.
Args:
level: Controls the level of zlib's compression (between 1 and 9).
"""
super(CompressedDataProperty, self).__init__(*args, **kwargs)
self.level = level
def get_value_for_datastore(self, model_instance):
value = self.__get__(model_instance, model_instance.__class__)
if value is not None:
return db.Blob(zlib.compress(value, self.level))
def make_value_from_datastore(self, value):
if value is not None:
return zlib.decompress(value)
I think this should be pretty easy to test. Just create 2 handlers, one that compresses the data, and one that doesn't, and record how much cpu each one uses (using the appstats package for whichever language you are developing with.) You should also create 2 entity types, one for the compressed data, one for the uncompressed.
Load in a few hundred thousand or a million entities (using the task queue perhaps). Then you can check the disk space usage in the administrator's console, and see how much each entity type uses. If the data is compressed internally by app engine, you shouldn't see much difference in the space used (unless their compression is significantly better than yours) If it is not compressed, there should be a stark difference.
Of course, you may want to hold off on this type of testing until you know for sure that these entities will account for a significant portion of your quota usage and/or your page load time.
Alternatively, you could wait for Nick or Alex to pop in and they could probably tell you whether the data is compressed in the datastore or not.

Resources