I want to store byte arrays (less than 1 MB) as a field value. I know about ByteArrayDocument and storing binary data as an independent non-JSON object.
To store a field as a byte array, do I just use com.couchbase.client.core.utils.Base64 to build a string value?
Or is some other approach recommended?
If you want to store it as an attribute in your JSon document, base64 would be the right approach.
However, unless your document contains only metadata about the file itself, I don't recommend using this strategy. Documents are automatically cached, and if your document is big, the cache memory will be filled quite easily.
Related
The _ids that are generated by MongoDB are always in this form: ObjectId("5f1b0e51b931af765f21edd4")
If the main reason for creating the _id column is to have something to uniquely identifies a document why is the generated _id format not simply in this form "5f1b0e51b931af765f21edd4.
I don't know if I'm right, but also I suspect that the first format occupies more space.
The _ids that are generated by MongoDB are always in this form: ObjectId("5f1b0e51b931af765f21edd4")
Not at all. Ids generated by MongoDB are 12-byte byte sequences. mongo shell uses the rendering ObjectId("xxx") to indicate that the value is stored as a 12-byte ObjectId and not as a 24-byte string, which is what "5f1b0e51b931af765f21edd4" is.
I don't know if I'm right, but also I suspect that the first format occupies more space.
As stored by the server, ObjectId occupies less space than a hex string you see on your screen (half as much, in fact). To convey this compact storage, the rendering of an ObjectId occupies more space on your screen.
ObjectId is a special type in Mongo. It is not like a normal object/document and only takes up 12-bytes. The ObjectId("24-character-hex-string") is just its human-readable notation.
A 24-character string takes up at least 24-bytes, and if we look up the bson spec, stores an additional 4-bytes for length and 1-byte for a null terminator, so 29-bytes total.
I'm trying to understand the internals of Redis. It uses a simple implementation of a dictionary as data-storage in memory. Moreover data transferred from the client to the server is serialized by its own RESP protocol.
What I didn't figure out so far is how the data is stored in redis. Does it store the according RESP value as simple dynamic string (sds) or does it first parse the value from the RESP eg. as an integer and stores it as an int (possibly from the shared integers array), which then is a sds again? I'm getting curious since in dict.c e.g. int dictAdd(dict *d, void *key, void *val){...} data is used as void *, which could indicate that data is stored as string, int or anything else, but tracing it down I didn't find any piece of code converting sds into objects.
But if it stores the data as sds, how does it store lists and sets?
Each data type in Redis has its own encoding, and most of them have several encodings for different scenarios. Even sds strings (and yes, string keys are usually sds strings) can have multiple encodings.
Sets, sorted sets, lists and hashes use a compact "ziplist" encoding in memory when they are small, but move to a memory wasteful yet faster encoding when they grow.
The most complex object is the sorted set, which is a combination of a skiplist and a hash table. And the new streams object also has a very interesting representation.
In RDB though, they get serialized into a compact representation and not kept as they are in memory.
I'm writing a Dart library in which I'm very regularly dealing with byte arrays or byte strings. Since Dart doesn't have a byte type nor an array type, I'm using List for all byte arrays.
Is this a good practice to do? I only recently found out about the existence of Uint8List in the dart:typed_data package. It's clear that this class aims to by the go-to implementation for byte arrays.
But does it have any direct advantages?
I can imagine that it does always perform checks on new items so that the user can make sure no non-byte-value integers are inside the list. But are there other advantages or differences?
There also is a class named ByteArray, but it seems to be a quite inefficient alternative for List...
The advantage should be that the Uint8List consumes less memory than a normal List, because it is known from the beginning that each elements size is a single byte.
Uint8List can also be mapped directly to underlying optimized Uint8List types (e.g. in Javascript).
Copies of list slices are also easier to perform, because all bytes are laid-out continguos in memory and therefore the slice can be directly copied in a single operation to another Uint8List (or equivalent) type.
However if this advantage is fully used depends on how good the implementation of Uint8List in Dart is.
John Mccutchan of the Dart team explains that the Dart VM relies on 3 different integer representations — pretty like the Three Musketeer's, there is the small machine integer (smi), the medium (mint) and the big heavy integer (bint). The VM takes care to switch automatically between the three depending on the size of the integer in play.
Within the smi range, which depends on the CPU architecture, integers fit in a register, therefore can be loaded and stored directly in the field instead of being fetched from memory. They also never require memory allocation. Which leads to the performance side of the story: within the smi range, storing an integer in object lists is faster than putting them in a typed list.
Typed lists would have to tag and untags, steps which refer to the VM set of operations to box and unbox smi values without allocation memory or loading the value from a object. The leaner, the better.
On the other hand, typed list have two big capabilities to consider. The garbage collection is very low as typed lists can store never store object references, only numbers. Typed list can also be much more dense therefore an Int8List would require much less memory and make better use of CPU's cache. The smi range principle applies also in typed lists, so playing with numbers within that range provides the best performance.
All in all, what remains of this is that we need to benchmark each approach to find which work the best depending on the situation.
I want to load data with 4 columns and 80 millon rows in MySQL on Redis, so that I can reduce fetching delay.
However, when I try to load all the data, it becomes 5 times larger.
The original data was 3gb (when exported to csv format), but when I load them on Redis, it takes 15GB... it's too large for our system.
I also tried different datatypes -
1) 'table_name:row_number:column_name' -> string
2) 'table_name:row_number' -> hash
but all of them takes too much.
am I missing something?
added)
my data have 4 col - (user id(pk), count, created time, and a date)
The most memory efficient way is storing values as a json array, and splitting your keys such that you can store them using a ziplist encoded hash.
Encode your data using say json array, so you have key=value pairs like user:1234567 -> [21,'25-05-2012','14-06-2010'].
Split your keys into two parts, such that the second part has about 100 possibilities. For example, user:12345 and 67
Store this combined key in a hash like this hset user:12345 67 <json>
To retrieve user details for user id 9876523, simply do hget user:98765 23 and parse the json array
Make sure to adjust the settings hash-max-ziplist-entries and hash-max-ziplist-value
Instagram wrote a great blog post explaining this technique, so I will skip explaining why this is memory efficient.
Instead, I can tell you the disadvantages of this technique.
You cannot access or update a single attribute on a user; you have to rewrite the entire record.
You'd have to fetch the entire json object always even if you only care about some fields.
Finally, you have to write this logic on splitting keys, which is added maintenance.
As always, this is a trade-off. Identify your access patterns and see if such a structure makes sense. If not, you'd have to buy more memory.
+1 idea that may free some memory in this case - key zipping based on crumbs dictionary and base62 encoding for storing integers,
it shrinks user:12345 60 to 'u:3d7' 'Y', which take two times less memory for storing key.
And with custom compression of data, not to array but to a loooong int (it's possible to convert [21,'25-05-2012','14-06-2010'] to such int: 212505201214062010, two last part has fixed length then it's obvious how to pack/repack such value )
So whole bunch of keys/values size is now 1.75 times less.
If your codebase is ruby-based I may suggest me-redis gem which is seamlessly implement all ideas from Sripathi answer + given ones.
My model has different entities that I'd like to calculate once like the employees of a company. To avoid making the same query again and again, the calculated list is saved in Memcache (duration=1day).. The problem is that the app is sometimes giving me an error that there are more bytes being stored in Memcache than is permissible:
Values may not be more than 1000000 bytes in length; received 1071339 bytes
Is storing a list of objects something that you should be doing with Memcache? If so, what are best practices in avoiding the error above? I'm currently pulling 1000 objects. Do you limit values to < 200? Checking for an object's size in memory doesn't seem like too good an idea because they're probably being processed (serialized or something like that) before going into Memcache.
David, you don't say which language you use, but in Python you can do the same thing as Ibrahim suggests using pickle. All you need to do is write two little helper functions that read and write a large object to memcache. Here's an (untested) sketch:
def store(key, value, chunksize=950000):
serialized = pickle.dumps(value, 2)
values = {}
for i in xrange(0, len(serialized), chunksize):
values['%s.%s' % (key, i//chunksize)] = serialized[i : i+chunksize]
return memcache.set_multi(values)
def retrieve(key):
result = memcache.get_multi(['%s.%s' % (key, i) for i in xrange(32)])
serialized = ''.join([v for k, v in sorted(result.items()) if v is not None])
return pickle.loads(serialized)
I frequently store objects with the size of several megabytes on the memcache. I cannot comment on whether this is a good practice or not, but my opinion is that sometimes we simply need a relatively fast way to transfer megabytes of data between our app engine instances.
Since I am using Java, what I did is serializing my raw objects using Java's serializer, producing a serialized array of bytes. Since the size of the serialized object is now known, I could cut into chunks of 800 KBs byte arrays. I then encapsulate the byte array in a container object, and store that object instead of the raw objects.
Each container object could have a pointer to the next memcache key where I could fetch the next byte array chunk, or null if there is no more chunks that need to be fetched from the memcache. (i.e. just like a linked list) I then re-merge the chunks of byte arrays into a large byte array and deserialize it using Java's deserializer.
Do you always need to access all the data which you store? If not then you will benefit from partitioning the dataset and accessing only the part of data you need.
If you display a list of 1000 employees you probably are going to paginate it. If you paginate then you definitely can partition.
You can make two lists of your dataset: one lighter with just the most essential information which can fit into 1 MB and other list which is divided into several parts with full information. On the light list you will be able to apply the most essential operations for example filtering through employees name or pagination. And then when needed load the heavy dataset you will be able to load only parts which you really need.
But well these suggestions takes time to implement. If you can live with your current design then just divide your list into lumps of ~300 items or whatever number is safe and load them all and merge.
If you know how large will the objects be you can use the memcached option to allow larger objects:
memcached -I 10m
This will allow objects up to 10MB.