Avoiding Memcache "1000000 bytes in length" limit on values - google-app-engine

My model has different entities that I'd like to calculate once like the employees of a company. To avoid making the same query again and again, the calculated list is saved in Memcache (duration=1day).. The problem is that the app is sometimes giving me an error that there are more bytes being stored in Memcache than is permissible:
Values may not be more than 1000000 bytes in length; received 1071339 bytes
Is storing a list of objects something that you should be doing with Memcache? If so, what are best practices in avoiding the error above? I'm currently pulling 1000 objects. Do you limit values to < 200? Checking for an object's size in memory doesn't seem like too good an idea because they're probably being processed (serialized or something like that) before going into Memcache.

David, you don't say which language you use, but in Python you can do the same thing as Ibrahim suggests using pickle. All you need to do is write two little helper functions that read and write a large object to memcache. Here's an (untested) sketch:
def store(key, value, chunksize=950000):
serialized = pickle.dumps(value, 2)
values = {}
for i in xrange(0, len(serialized), chunksize):
values['%s.%s' % (key, i//chunksize)] = serialized[i : i+chunksize]
return memcache.set_multi(values)
def retrieve(key):
result = memcache.get_multi(['%s.%s' % (key, i) for i in xrange(32)])
serialized = ''.join([v for k, v in sorted(result.items()) if v is not None])
return pickle.loads(serialized)

I frequently store objects with the size of several megabytes on the memcache. I cannot comment on whether this is a good practice or not, but my opinion is that sometimes we simply need a relatively fast way to transfer megabytes of data between our app engine instances.
Since I am using Java, what I did is serializing my raw objects using Java's serializer, producing a serialized array of bytes. Since the size of the serialized object is now known, I could cut into chunks of 800 KBs byte arrays. I then encapsulate the byte array in a container object, and store that object instead of the raw objects.
Each container object could have a pointer to the next memcache key where I could fetch the next byte array chunk, or null if there is no more chunks that need to be fetched from the memcache. (i.e. just like a linked list) I then re-merge the chunks of byte arrays into a large byte array and deserialize it using Java's deserializer.

Do you always need to access all the data which you store? If not then you will benefit from partitioning the dataset and accessing only the part of data you need.
If you display a list of 1000 employees you probably are going to paginate it. If you paginate then you definitely can partition.
You can make two lists of your dataset: one lighter with just the most essential information which can fit into 1 MB and other list which is divided into several parts with full information. On the light list you will be able to apply the most essential operations for example filtering through employees name or pagination. And then when needed load the heavy dataset you will be able to load only parts which you really need.
But well these suggestions takes time to implement. If you can live with your current design then just divide your list into lumps of ~300 items or whatever number is safe and load them all and merge.

If you know how large will the objects be you can use the memcached option to allow larger objects:
memcached -I 10m
This will allow objects up to 10MB.

Related

How do I store byte arrays inside an object in Couchbase?

I want to store byte arrays (less than 1 MB) as a field value. I know about ByteArrayDocument and storing binary data as an independent non-JSON object.
To store a field as a byte array, do I just use com.couchbase.client.core.utils.Base64 to build a string value?
Or is some other approach recommended?
If you want to store it as an attribute in your JSon document, base64 would be the right approach.
However, unless your document contains only metadata about the file itself, I don't recommend using this strategy. Documents are automatically cached, and if your document is big, the cache memory will be filled quite easily.

How many items need to be stored before a Map become more efficient than an array

I feel this question should have probably been answered somewhere but I haven't been able to find it. I'm working on some legacy code(first job) and I've come across a point where a map is being used to store exactly 23 variables(and from my understanding of the code this number never changes). Fundamentally I understand access time for a map is O(1) but I've always heard there is some overhead when dealing with maps.
So my question is at what point does it become more efficient to use a map, and not just have an array that stores these 23 values. I guess essentially I would be creating my own map in the code, where whenever I need to store xValue, I just access the array where I know xValue is being stored.
Is the overhead for a map so small that I'm overthinking this? I figure the readability of the code might be a bit easier for the map, since the key being used is essentially a description of what the xValue is.
Using a map (or dictionary) is unlikely to be faster than direct array access. That is, if you know that you have exactly N items and you know what those items are, you can create an array and constants that say where in the array each item is. For example (pseudocode):
const NumItems = 22
const ItemType1 = 0
const ItemType2 = 1
const ItemType3 = 2
...
const ItemType22 = 21
Array myItems[22]
Then, if you want to access the value for item type 15, it's simply myItems[ItemType15].
The drawback to this approach is that you must have storage space for one of every possible item. So if you have 20,000 items instead of just 23, you would have to allocate your array to hold all 20,000 possible items. And create constants to define where each item type would be stored.
A map lets you save space, but at some memory cost. The per-item overhead can be two or three times the per-item cost of an array. But if you have a huge number of possible items but you're only going to have a few dozen at a time, then the map is more memory efficient. Say you're going to have 30 items out of 20,000 possible. A full array would cost you storage for 20,000 items. The map would cost you storage for, perhaps 90 items (three times the cost of a 30-item array).
The point here is that a map trades speed for storage space. The map can store a subset of the entire universe of items more efficiently than an array can, but it takes more time to access an individual item.
So the issue really isn't "which is faster," but rather if the space savings afforded by a map is worth the extra processing time. In your case, where you have exactly one each of a small number of items, there is no advantage to using a map.

Data Structure to do lookup on large number

I have a requirement to do a lookup based on a large number. The number could fall in the range 1 - 2^32. Based on the input, i need to return some other data structure. My question is that what data structure should i use to effectively hold this?
I would have used an array giving me O(1) lookup if the numbers were in the range say, 1 to 5000. But when my input number goes large, it becomes unrealistic to use an array as the memory requirements would be huge.
I am hence trying to look at a data structure that yields the result fast and is not very heavy.
Any clues anybody?
EDIT:
It would not make sense to use an array since i may have only 100 or 200 indices to store.
Abhishek
unordered_map or map, depending on what version of C++ you are using.
http://www.cplusplus.com/reference/unordered_map/unordered_map/
http://www.cplusplus.com/reference/map/map/
A simple solution in C, given you've stated at most 200 elements is just an array of structs with an index and a data pointer (or two arrays, one of indices and one of data pointers, where index[i] corresponds to data[i]). Linearly search the array looking for the index you want. With a small number of elements, (200), that will be very fast.
One possibility is a Judy Array, which is a sparse associative array. There is a C Implementation available. I don't have any direct experience of these, although they look interesting and could be worth experimenting with if you have the time.
Another (probably more orthodox) choice is a hash table. Hash tables are data structures which map keys to values, and provide fast lookup and insertion times (provided a good hash function is chosen). One thing they do not provide, however, is ordered traversal.
There are many C implementations. A quick Google search turned up uthash which appears to be suitable, particularly because it allows you to use any value type as the key (many implementations assume a string as the key). In your case you want to use an integer as the key.

redis memory efficiency

I want to load data with 4 columns and 80 millon rows in MySQL on Redis, so that I can reduce fetching delay.
However, when I try to load all the data, it becomes 5 times larger.
The original data was 3gb (when exported to csv format), but when I load them on Redis, it takes 15GB... it's too large for our system.
I also tried different datatypes -
1) 'table_name:row_number:column_name' -> string
2) 'table_name:row_number' -> hash
but all of them takes too much.
am I missing something?
added)
my data have 4 col - (user id(pk), count, created time, and a date)
The most memory efficient way is storing values as a json array, and splitting your keys such that you can store them using a ziplist encoded hash.
Encode your data using say json array, so you have key=value pairs like user:1234567 -> [21,'25-05-2012','14-06-2010'].
Split your keys into two parts, such that the second part has about 100 possibilities. For example, user:12345 and 67
Store this combined key in a hash like this hset user:12345 67 <json>
To retrieve user details for user id 9876523, simply do hget user:98765 23 and parse the json array
Make sure to adjust the settings hash-max-ziplist-entries and hash-max-ziplist-value
Instagram wrote a great blog post explaining this technique, so I will skip explaining why this is memory efficient.
Instead, I can tell you the disadvantages of this technique.
You cannot access or update a single attribute on a user; you have to rewrite the entire record.
You'd have to fetch the entire json object always even if you only care about some fields.
Finally, you have to write this logic on splitting keys, which is added maintenance.
As always, this is a trade-off. Identify your access patterns and see if such a structure makes sense. If not, you'd have to buy more memory.
+1 idea that may free some memory in this case - key zipping based on crumbs dictionary and base62 encoding for storing integers,
it shrinks user:12345 60 to 'u:3d7' 'Y', which take two times less memory for storing key.
And with custom compression of data, not to array but to a loooong int (it's possible to convert [21,'25-05-2012','14-06-2010'] to such int: 212505201214062010, two last part has fixed length then it's obvious how to pack/repack such value )
So whole bunch of keys/values size is now 1.75 times less.
If your codebase is ruby-based I may suggest me-redis gem which is seamlessly implement all ideas from Sripathi answer + given ones.

Why use arrays in VBA when there are collections?

many people use extensively arrays in Excel/VBA to store a list of data. However, there is the collection object which in my view is MUCH MUCH more convenient (mainly: don't need to re/define length of the list).
So, I am sincerely asking myself if I am missing something? Why do other people still use arrays to store a list of data? Is it simply a hangover of the past?
Several reasons to use arrays instead of collections (or dictionaries):
you can transfer easily array to range (and vice-versa) with Range("A1:B12") = MyArray
collections can store only unique keys whereas arrays can store any value
collections have to store a couple (key, value) whereas you can store whatever in an array
See Chip Pearson's article about arrays for a better understanding
A better question would rather be why people would use collections over dictionaries (ok, collections are standard VBA whereas you have to import dictionaries)
#CharlesWilliams answer is correct: looping through all the values of an array is faster than iterating a Collection or dictionary: so much so, that I always use the Keys() or Items() method of a dictionary when I need to do that - both methods return a vector array.
A note: I use the Dictionary class far more than I use collections, the Exists() method is just too useful.
There are, or course, drawbacks to collections and dictionaries. One of them is that arrays can be 2- or even 3-Dimensional - a much better data structure for tabulated data. You can store arrays as members of a collection, but there's some downsides to that: one of them is that you might not be getting a reference to the item - unless you use arrItem = MyDictionary(strKey) you will almost certainly get a 'ByVal' copy of the array; that's bad if your data is dynamic, and subject to change by multiple processes. It's also slow: lots of allocation and deallocation.
Worst of all, I don't quite trust VBA to deallocate the memory if I have a collection or dictionary with arrays (or objects!) as members: not on out-of-scope, not by Set objCollection = Nothing, not even by objDictionary.RemoveAll - it's difficult to prove that the problem exists with the limited testing toolkit available in the VBE, but I've seen enough memory leaks in applications that used arrays in dictionaries to know that you need to be cautious. That being said, I never use an array without an Erase command somewhere.
#JMax has explained the other big plus for arrays: you can populate an array in a single 'hit' to the worksheet, and write back your work in a single 'hit.
You can, of course, get the best of both worlds by constructing an Indexed Array class: a 2-dimensional array with associated collection or dictionary objects storing some kind of row identifier as the keys, and the row ordinals as the data items.
Collections that auto-resize are slower (theoretically speaking, different implementations will obviously have their own mileage). If you know you have a set number of entries and you only need to access them in a linear fashion then a traditional array is the correct approach.

Resources