redis memory efficiency - database

I want to load data with 4 columns and 80 millon rows in MySQL on Redis, so that I can reduce fetching delay.
However, when I try to load all the data, it becomes 5 times larger.
The original data was 3gb (when exported to csv format), but when I load them on Redis, it takes 15GB... it's too large for our system.
I also tried different datatypes -
1) 'table_name:row_number:column_name' -> string
2) 'table_name:row_number' -> hash
but all of them takes too much.
am I missing something?
added)
my data have 4 col - (user id(pk), count, created time, and a date)

The most memory efficient way is storing values as a json array, and splitting your keys such that you can store them using a ziplist encoded hash.
Encode your data using say json array, so you have key=value pairs like user:1234567 -> [21,'25-05-2012','14-06-2010'].
Split your keys into two parts, such that the second part has about 100 possibilities. For example, user:12345 and 67
Store this combined key in a hash like this hset user:12345 67 <json>
To retrieve user details for user id 9876523, simply do hget user:98765 23 and parse the json array
Make sure to adjust the settings hash-max-ziplist-entries and hash-max-ziplist-value
Instagram wrote a great blog post explaining this technique, so I will skip explaining why this is memory efficient.
Instead, I can tell you the disadvantages of this technique.
You cannot access or update a single attribute on a user; you have to rewrite the entire record.
You'd have to fetch the entire json object always even if you only care about some fields.
Finally, you have to write this logic on splitting keys, which is added maintenance.
As always, this is a trade-off. Identify your access patterns and see if such a structure makes sense. If not, you'd have to buy more memory.

+1 idea that may free some memory in this case - key zipping based on crumbs dictionary and base62 encoding for storing integers,
it shrinks user:12345 60 to 'u:3d7' 'Y', which take two times less memory for storing key.
And with custom compression of data, not to array but to a loooong int (it's possible to convert [21,'25-05-2012','14-06-2010'] to such int: 212505201214062010, two last part has fixed length then it's obvious how to pack/repack such value )
So whole bunch of keys/values size is now 1.75 times less.
If your codebase is ruby-based I may suggest me-redis gem which is seamlessly implement all ideas from Sripathi answer + given ones.

Related

Alternatives for data entry k* in index k

I'm having a hard time wrapping my head around this concept of Alternative 1 vs 2/3 for data entry into an index. Here is an excerpt from some notes:
Alternative 1:
Actual data record (with key
value k)
– If this is used, index structure is a file
organization for data records (like Heap
files or sorted files).
– At most one index on a given collection of
data records can use Alternative 1.
– This alternative saves pointer lookups but
can be expensive to maintain with
insertions and deletions.
Alternative 2: (k, rid of matching data record) and
Alternative 3: (k, list of rids of matching data records)
– Easier to maintain than Alt 1.
– If more than one index is required on a given file, at most
one index can use Alternative 1; rest must use Alternatives 2
or 3.
– Alternative 3 more compact than Alternative 2, but leads to
variable sized data entries even if search keys are of fixed
length.
– Even worse, for large rid lists the data entry would have to
span multiple blocks!
Can someone help me understand this by providing some concrete examples?

how to write order preserving minimal perfect hash for integer keys?

I have searched stackoverflow and google and cant find exactly what im looking for which is this:
I have a set of 4 byte unsigned integers keys, up to a million or so, that I need to use as an index into a table. The easiest would be to simply use the keys as an array index but I dont want to have a 4gb array when Im only going to use a couple of million entries! The table entries and keys are sequential so I need a hash function that preserves order.
e.g.
keys = {56, 69, 3493, 49956, 345678, 345679,....etc}
I want to translate the keys into {0, 1, 2, 3, 4, 5,....etc}
The keys could potentially be any integer but there wont be more than 2 million in total. The number will vary as keys (and corresponding array entries) will be deleted but new keys will always be higher numbered than the previous highest numbered key.
In the above example, if key 69 was deleted, then the hash integer returned on hashing 3493 should be 1 (rather than 2) as it then becomes the 2nd lowest number.
I hope I'm explaining this right. Is the above possible with any fast efficient hashing solution? I need the translation to take in the low 100s of nS though deletion I expect to take longer. I looked at CMPH but couldn't find any usage examples that didn't involved getting the data from a file. It needs to run under linux and compiled with gcc using pure C.
Actually, I don't know if I understand what exactly you want to do.
It seems you are trying to obtain the index number in the "array" (or "list") of sequentialy ordered integers that you have stored somewhere.
If you have stored these integer values in an array, then the algorithm that returns the index integer in optimal time is Binary Search.
Binary Search Algorithm
Since your list is known to be in order, then binary search works in O(log(N)) time, which is very fast.
If you delete an element in the list of "keys", the Binary Search Algorithm works anyway, without extra effort or space (however, the operation of removing one element in the list enforces to you, naturally, to move all the elements being at the right of the deleted element).
You only have to provide three data to the Ninary Search Algorithm: the array, the size of the array, and the desired key, of course.
There is a full Python implementation here. See also the materials available here. If you only need to decode the dictionary, the simplest way to go is to modify the Python code to make it spit out a C file defining the necessary array, and reimplement only the lookup function.
It could be solved by using two dynamic allocated arrays: One for the "keys" and one for the data for the keys.
To get the data for a specific key, you first find in in the key-array, and its index in the key-array is the index into the data array.
When you remove a key-data pair, or want to insert a new item, you reallocate the arrays, and copy over the keys/data to the correct places.
I don't claim this to be the best or most effective solution, but it is one solution to your problem anyway.
You don't need an order preserving minimal perfect hash, because any old hash would do. You don't want to use a 4GB array, but with 2 MB of items, you wouldn't mind using 3 MB of lookup entries.
A standard implementation of a hash map will do the job. It will allow you to delete and add entries and assign any value to entries as you add them.
This leaves you with the question "What hash function might I use on integers?" The usual answer is to take the remainder when dividing by a prime. The prime is chosen to be a bit larger than your expected data. For example, if you expect 2M of items, then choose a prime around 3M.

Data Structure to do lookup on large number

I have a requirement to do a lookup based on a large number. The number could fall in the range 1 - 2^32. Based on the input, i need to return some other data structure. My question is that what data structure should i use to effectively hold this?
I would have used an array giving me O(1) lookup if the numbers were in the range say, 1 to 5000. But when my input number goes large, it becomes unrealistic to use an array as the memory requirements would be huge.
I am hence trying to look at a data structure that yields the result fast and is not very heavy.
Any clues anybody?
EDIT:
It would not make sense to use an array since i may have only 100 or 200 indices to store.
Abhishek
unordered_map or map, depending on what version of C++ you are using.
http://www.cplusplus.com/reference/unordered_map/unordered_map/
http://www.cplusplus.com/reference/map/map/
A simple solution in C, given you've stated at most 200 elements is just an array of structs with an index and a data pointer (or two arrays, one of indices and one of data pointers, where index[i] corresponds to data[i]). Linearly search the array looking for the index you want. With a small number of elements, (200), that will be very fast.
One possibility is a Judy Array, which is a sparse associative array. There is a C Implementation available. I don't have any direct experience of these, although they look interesting and could be worth experimenting with if you have the time.
Another (probably more orthodox) choice is a hash table. Hash tables are data structures which map keys to values, and provide fast lookup and insertion times (provided a good hash function is chosen). One thing they do not provide, however, is ordered traversal.
There are many C implementations. A quick Google search turned up uthash which appears to be suitable, particularly because it allows you to use any value type as the key (many implementations assume a string as the key). In your case you want to use an integer as the key.

Avoiding Memcache "1000000 bytes in length" limit on values

My model has different entities that I'd like to calculate once like the employees of a company. To avoid making the same query again and again, the calculated list is saved in Memcache (duration=1day).. The problem is that the app is sometimes giving me an error that there are more bytes being stored in Memcache than is permissible:
Values may not be more than 1000000 bytes in length; received 1071339 bytes
Is storing a list of objects something that you should be doing with Memcache? If so, what are best practices in avoiding the error above? I'm currently pulling 1000 objects. Do you limit values to < 200? Checking for an object's size in memory doesn't seem like too good an idea because they're probably being processed (serialized or something like that) before going into Memcache.
David, you don't say which language you use, but in Python you can do the same thing as Ibrahim suggests using pickle. All you need to do is write two little helper functions that read and write a large object to memcache. Here's an (untested) sketch:
def store(key, value, chunksize=950000):
serialized = pickle.dumps(value, 2)
values = {}
for i in xrange(0, len(serialized), chunksize):
values['%s.%s' % (key, i//chunksize)] = serialized[i : i+chunksize]
return memcache.set_multi(values)
def retrieve(key):
result = memcache.get_multi(['%s.%s' % (key, i) for i in xrange(32)])
serialized = ''.join([v for k, v in sorted(result.items()) if v is not None])
return pickle.loads(serialized)
I frequently store objects with the size of several megabytes on the memcache. I cannot comment on whether this is a good practice or not, but my opinion is that sometimes we simply need a relatively fast way to transfer megabytes of data between our app engine instances.
Since I am using Java, what I did is serializing my raw objects using Java's serializer, producing a serialized array of bytes. Since the size of the serialized object is now known, I could cut into chunks of 800 KBs byte arrays. I then encapsulate the byte array in a container object, and store that object instead of the raw objects.
Each container object could have a pointer to the next memcache key where I could fetch the next byte array chunk, or null if there is no more chunks that need to be fetched from the memcache. (i.e. just like a linked list) I then re-merge the chunks of byte arrays into a large byte array and deserialize it using Java's deserializer.
Do you always need to access all the data which you store? If not then you will benefit from partitioning the dataset and accessing only the part of data you need.
If you display a list of 1000 employees you probably are going to paginate it. If you paginate then you definitely can partition.
You can make two lists of your dataset: one lighter with just the most essential information which can fit into 1 MB and other list which is divided into several parts with full information. On the light list you will be able to apply the most essential operations for example filtering through employees name or pagination. And then when needed load the heavy dataset you will be able to load only parts which you really need.
But well these suggestions takes time to implement. If you can live with your current design then just divide your list into lumps of ~300 items or whatever number is safe and load them all and merge.
If you know how large will the objects be you can use the memcached option to allow larger objects:
memcached -I 10m
This will allow objects up to 10MB.

Dropping Leading Zeros

I have a form that records a student ID number. Some of those numbers contain a leading zero. When the number gets recorded into the database it drops the leading 0.
The field is set up to only accept numbers. The length of the student ID varies.
I need the field to be recorded and displayed with the leading zero.
If you are always going to have a number of a certain length (say, it will always be 10 characters), then you can just get the length of the number in the database (after it is converted to a string) and then add the appropriate 0's.
However, if this is an arbitrary amount of leading zeros, then you will have to store the content as a string in the database so you can capture the leading zeros.
It sounds like this should be stored as string data. It sounds like the leading zeros are part of the data itself, not just part of it's formatting.
You could reformat the data for display with the leading zeros in it, however I believe you should store the correct form of the ID number, it will lead to less bugs down the road (ex: you forgot to format it in one place but not in another).
There are a few ways of doing this - depending on the answers to my comments in your question:
Store the extra data in the database by converting the datatype from numeric to varchar/string.
Advantages: Very simple in its implementation; You can treat all the values in the same way.
Disadvantage: If you've got very large amounts of data, storage sizes will escalate; indexing and sorting on strings doesn't perform so well.
Use if: Each number may have an arbitrary length (and hence number of zeros).
Don't use if: You're going to be spending a lot of time sorting data, sorting numeric strings is a pain in the ass - look up natural sorting to see some of the pitfalls;
Continue to store the data in the database as numeric but pad the numeric back to a set length (i.e. 10 as I have suggested in my example below):
Advantages: Data will index better, search better, not require such large amounts of storage if you've got large amounts of data.
Disadvantage: Every query or display of data will require every data instance to be padded to the correct length causing a slight performance hit.
Use if: All the output numbers will be the same length (i.e. including zeros they're all [for example] 10 digits); Large amounts of sorting will be necessary.
Add a field to your table to store the original length of the numeric, continue to store the value as numeric (to leverage sorting/indexing performance gains of numeric vs. string) in your new field store the length as it would include the significant zeros:
Advantages: Reduction in required storage space; maximum use of indexing; sorting of numerics is far easier than sorting text numerics; You still get the ability to pad numerics to arbitrary lengths like you have with option 1.
Disadvantages: An extra field is required in your database, so all your queries will have to pull that extra field thus potentially requiring a slight increase in resources at query/display time.
Use if: Storage space/indexing/sorting performance is any sort of concern.
Don't use if: You don't have the luxury of changing the table structure to include the extra value; This will overcomplicate already complex queries.
If I were you and I had access to modify the db structure slightly, I'd go with option 3, sure you need to pull out an extra field to get the length. The slightly increased complexity pays huge dividends in the advantages versus the disadvantages. The performance hit of padding the string back out the correct length will be far superceded by the performance increase of the indexing and storage space required.
I worked with a database with a similar problem. They were storing zip codes as a number. The consequence was that people in New Jersey couldn't use our app.
You're using data that is logically a text string and not a number. It just happens to look like a number, but you really need to treat it as text. Use a text-oriented data type, or at least create a database view that enables you to pull back a properly formatted value for this.
See here: Pad or remove leading zeroes from numbers
declare #recordNumber integer;
set #recordNumber = 93088;
declare #padZeroes integer;
set #padZeroes = 8;
select
right( replicate('0',#padZeroes)
+ convert(varchar,#recordNumber), #padZeroes);
Unless you intend on doing calculations on that ID, its probably best to store them as text/string.
Another option is since the field is an id, i would recommend creating a secondary field for display number (nvarchar) that you can use for reports, etc...
Then in your application when the student id is entered you can insert that into the database as the number, as well as the display number.
An Oracle solution
Store the ID as a number and convert it into a character for display. For instance, to display 42 as a zero-padded, three-character string:
SQL> select to_char(42, '099') from dual;
042
Change the format string to fit your needs.
(I don't know if this is transferable to other SQL flavors, however.)
You could just concatenate '1' to the beginning of the ID when storing it in the database. When retrieving it, treat it as a string and remove the first char.
MySQL Example:
SET #student_id = '123456789';
INSERT INTO student_table (id,name) VALUES(CONCAT('1',#student_id),'John Smith');
...
SELECT SUBSTRING(id,1) FROM student_table;
Mathematically:
Initially I thought too much and did it mathematically by adding an integer to the student ID, depending on its length (like 1,000,000,000 if it's 9 digits), before storing it.
SET #new_student_id = ABS(#student_id) + POW(10, CHAR_LENGTH(#student_id));
INSERT INTO student_table (id,name) VALUES(#new_student_id,'John Smith');

Resources