Is Redis good for what I need - database

I have a website where users can submit text messages, dead simple data structure...
Name <-- Less than 20 characters
Message <-- Max 150 characters
Timestamp
IP
Hidden <-- Bool (True or False)
On the previous version of the website they are stored in MySQL database which is very big, lots of tables, and am wanting to simplify the database. So I heard Redis is good for simple data structures and non relational information...
Would Redis be a good option for this kind of data and how would it perform, with memory usage and read times when talking about 100,000+ records a year...

redis is really only good for in-memory problem sets. It DOES have a page-to-disk capability - but then you're at the mercy of the OS swapper - namely you're RAM will be in competition with system-caches. Also, I think the keys always have to fit in RAM. So you're NOT going to want to store 1G+ log records - mysql-archive-table is MUCH better for that.
redis has a master-slave functionality, similar to mysql. So you can perform various tricks such as sorting on a slave to keep the master responsive. While I haven't used it, I'd speculate that for in-memory databases, mysql-cluster is probably far more advanced - but with corresponding extra complexity / resource-costs.
If you have large values for your key-value set, you can perform client-side compression/decompression. There isn't much the server can do to search on the values of those 'blobs' anyway.
One common way to get around the RAM limitation is to perform client-side sharding (partitioning). Namely, if you KNOW your upper bounds, and you don't have enough RAM to throw at the problem for some reason (say you already have 64GB of RAM), then you could 'shard' based on the primary key.. If it's a sequence counter, you could take the bottom 3 bits (or some hashing function + partition function), and distribute amongst 4,8,16, etc server nodes. That scales linearly, though if you need to re-partition, that could be painful. You COULD take advantage of the 'slots' in redis to start off with fewer machines.. Say 1 machine with 16 slots.. Then later, dump slots 7-15 and restore on a different machine and remap all the clients to point to the two machines (with the same slot numbers). And so forth to 16-way sharding. At which point, you'd need to remap ALL your data to go to 32-way.
Obviously first evaluate the command-set of redis to see if ALL your data-storage and reporting needs can be met. There are equivalents to "select * from foo for update", but they're not obvious. Not all RDBMS queries can be reproduced efficiently with key-value stores. But for simple natural-primary-key record-structures it should do fine.
Additionally, it should be easy to extend the redis command-set to perform custom operations.. Just keep in mind, it's designed around no-pause single-threaded execution (avoids locking /context-switching overhead).
But things I really like are the FIFOs, pub/sub, data-time-outs, atomic-mutations (inc/dec), lazy-sorting (e.g. on client with read-only nodes), maps of maps. It's simple enough that instead of using name-spaces, you just launch separate redis processes on different ports / UNIX-sockets (my preference if possible).
It's meant to replace memcached more than anything else, but has a very nice background persistent framework.

Related

Storing and accessing a large number of relatively small files

I am running lots of very slow computations with reusable results (and often computing something new relies on a computation that was already performed before). To make use of them, I want to store the results somewhere (permanently). The computations can be uniquely identified by two identifiers: experiment name and computation name, and the value is an array of floats (which I currently store as raw binary data). They need to be individually accessed (read and written) by experiment and computation name very often, and sometimes also just by experiment name (i.e. all computations with their results for a given experiment). They are also sometimes concatenated, but if reading and writing is fast, no additional support for this operation is needed. This data will not need to be accessed for any web application (used only by non-production scripts that need the results of the computations, but calculating them each time is not feasible), and there is no need for transactions, but every write needs to be atomic (e.g. turning off the computer should not result in corrupted/partial data). Reading also needs to be atomic (e.g. if two processes try to access a result of one computation, and it's not there, so one of them starts saving the new result, the other process should either receive it when it's done, or receive nothing at all). Accessing the data remotely is not required, but helpful.
So, TL;DR requirements:
permanent storage of binary data (no metadata other than the identifier needs to be stored)
very fast access (read/write) based on a compound identifier
ability to read all data by one part of a compound identifier
concurrent, atomic read/write
no need for transactions, complex queries, etc.
remote access would be nice to have, but not required
the whole thing is there mostly to save time, so speed is critical
The solutions I tried so far are:
storing them as individual files (one directory per experiment, one binary file per computation) - requires manual handling of atomicity, and also most file systems support file names only up to 255 characters long (and computation names may be longer than that), so an additional mapping would be required; also I'm not sure if ext4 (which is the filesystem I'm using and can't change it) is designed to handle millions of files
using a sqlite database (with just one table and a compound primary key) - at first it seemed perfect, but when we got to hundreds of gigabytes of data (millions of ~100 KB blobs, and both number of them and their size will increase), it started being really slow, even after applying optimizations found on the internet
Naturally, after sqlite failed, the first idea was to just move to a "proper" database like postgres, but then I realized that perhaps in this case a relational database is not really the way to go (especially since speed is critical here, and I don't need most of their features) - and especially postgres is probably not the way to go, since the closest thing to a blob is bytea, which requires additional conversions (so a perfomance hit is guaranteed). However, after researching a bit about key-value databases (which seemed to apply to my problem), I found out that all of the databases that I checked do not support compound keys, and often have length limitations for keys (e.g. couchbase has just 250 bytes). So, should I just go with a normal relational database, try one of NoSQL databases, or maybe something completely different like HDF5?
One way to improve on the database solution is to externalize the data blob.
You can use SeaweedFS https://github.com/chrislusf/seaweedfs as an object store, upload the blob and get an file id, and then store the file id in the database. (I am working on SeaweedFS)
This should reduce the database load quite a bit, and querying will be much faster.
So, I ended up using a relational database anyway (since only there I could use compound keys without any hacks).
I performed a benchmark to compare sqlite with postgres and mysql - 500 000 inserts of ~60 KB blobs and then 50 000 selects by the whole key. This was not enough to slow down sqlite to the unacceptable levels I was experiencing, but set a point of reference (i.e. the speed at which sqlite was running with this few records was acceptable to me). I assumed that I wouldn't experience a huge performance hit when adding more records with mysql and postgres (since they were designed to work with much larger amounts of data than sqlite), and when finally using one of them, that turned out to be true.
The settings (other than defaults) were following:
sqlite: journal mode=wal (required for parallel access), isolation level autocommit, values as BLOB
postgres: isolation level autocommit (can't turn off transactions, and doing everything in one huge transaction was not an option for me), values as BYTEA (which sadly includes the double conversion I wrote about)
mysql: engine=aria, transactions disabled, values as MEDIUMBLOB
As you can see, I was able to customize mysql much more to fit the task at hand. The results below reflect it well:
sqlite postgres mysql
selects 90.816292 191.910514 106.363534
inserts 4367.483822 7227.473075 5081.281370
Mysql had similar speed to sqlite, with postgres being significantly slower.

Redis as a cache for RDBMS

I am planning to use redis as a cache for an already existing database(MS SQL).I would like to use the data from redis to put in the front end.I will be dealing with huge amount of data around 100GB in a day.I will mostly have table which contains a time value and some counter value(some 10-100 columns). How would redis perform if i am to do aggregation on these much data based on hour,day etc....(ie based on time column.)
Is redis the right way to do it or are there are any alternative? I don't know how good nosql is when dealing with aggreagation compared to RDBMS.
And how would MonogoDB do in such a scenario?
Thanks
If you need to store 100Gb and you don't expect your data set to grow much beyond that, start with 3 redis instances, each with 64Gb of RAM, total of 192Gb, more than enough to hold your data set and with room to grow.
Each redis instance will be a master, so your data will be split amongst the instances equally. You'll need to shard across the instances from the application layer using a simple hashing algorithm, for instance...
(from your application layer)
shardKey = "redis" + getShardKey( cacheKey);
redisConnection = getRedisConnectionByShardKey( shardKey);
//do work with redisConnection here
The function getShardKey(string) takes the cacheKey, converts it to an integer, then mods it by the number of redis instances, returning either 0, 1, or 2. Configure a connection pool for each redis instance, give each one a name like redis0, redis1, etc., after you call the hash function, use the shard key to get a connection for the target redis instance. Once you have the data you need, do the aggregation in your application layer.
This is a simple approach; it distributes data equally amongst the redis instances (more or less), and avoids stuffing everything into a single redis instance. Redis is single-threaded, so if you're doing lots of I/O you'll be bound by how fast your cpu can service requests,; using multiple instances distributes that load.
This solution breaks down when your data set grows beyond 180Gb. If you add another redis instance to accommodate a larger data set, the hash function must be updated to reflect modulo 4, not 3, and you'll have to move most of your data around, this gets ugly, so use this approach only if you're 100% sure the data set will stay below 150Gb.

Database choice: High-write, low-read

I'm building a component for recording historical data. Initially I expect it to do about 30 writes/second, and less than 1 read/second.
The data will never be modified, only new data will be added. Reads are likely to be done with fresh records.
The demand is likely to increase rapidly, expecting around 80 writes/second in one year time.
I could choose to distribute my component and use a common database such as MySql, or I could go with a distributed database such as MongoDb. Either way, I'd like the database to handle writes very well.
The database must be free. Open source would be a plus :-)
Note: A record is plain text in variable size, typically 50 to 500 words.
Your question can be solved a few different ways, so let's break it down and look at the individual requirements you've laid out:
Writes - It sounds like the bulk of what you're doing is append only writes at a relatively low volume (80 writes/second). Just about any product on the market with a reasonable storage backend is going to be able to handle this. You're looking at 50-500 "words" of data being saved. I'm not sure what constitutes a word, but for the sake of argument let's assume that a word is an average of 8 characters, so your data is going to be some kind of metadata, a key/timestamp/whatever plus 400-4000 bytes of words. Barring implementation specific details of different RDBMSes, this is still pretty normal, we're probably writing at most (including record overhead) 4100 bytes per record. This maxes out at 328,000 bytes per second or, as I like to put it, not a lot of writing.
Deletes - You also need the ability to delete your data. There's not a lot I can say about that. Deletes are deletes.
Reading - Here's where things get tricky. You mention that it's mostly primary keys and reads are being done on fresh data. I'm not sure what either of these mean, but I don't think that it matters. If you're doing key only lookups (e.g. I want record 8675309), then life is good and you can use just about anything.
Joins - If you need the ability to write actual joins where the database handles them, you've written yourself out of the major non-relational database products.
Data size/Data life - This is where things get fun. You've estimated your writes at 80/second and I guess at 4100 bytes per record or 328,000 bytes per second. There are 86400 seconds in a day, which gives us 28,339,200,000 bytes. Terrifying! That's 3,351,269.53125 KB, 27,026 MB, or roughly 26 GB / day. Even if you're keeping your data for 1 year, that's 9633 GB, or 10TB of data. You can lease 1 TB of data from a cloud hosting provider for around $250 per month or buy it from a SAN vendor like EqualLogic for about $15,000.
Conclusion: I can only think of a few databases that couldn't handle this load. 10TB is getting a bit tricky and requires a bit of administration skill, and you might need to look at certain data lifecycle management techniques, but almost any RDBMS should be up to this task. Likewise, almost any non-relational/NoSQL database should be up to this task. In fact, almost any database of any sort should be up to the task.
If you (or your team members) already have skills in a particular product, just stick with that. If there's a specific product that excels in your problem domain, use that.
This isn't the type of problem that requires any kind of distributed magical unicorn powder.
Ok for MySQL I would advice you to use InnoDB without any indexes, expect on primary keys, even then, if you can skip them it would be good, in order to make input flow uninterrupted.
Indexes optimize reading, but descrease the writing capabilities.
You also may use PostgreSQL. Where you also need to skip indexes as well but you wont have a engine selection and its capabilities are also very strong for writing.
This approach you want is actually used in some solutions, but with two db servers, or at least two databases. The first is receiving a lot of new data (your case), while the second communicates with the first and store it in a well-structured database (with indexes, rules, etc). And then when you need to read or make a snapshot of the data you refer the second server (or second database), where you can use transactions and so on.
You should take a look and refer at Oracle Express (I think this was its name) and SQL Server Express Edition. The last two have better performance, but also some limitations. To have more detailed picture.

For millions of objects, is it better to store in an array or a database like redis if the objects are needed in realtime?

I am developing a simulation in which there can be millions of entities that can interact with each other. At the moment, all the entities are stored in a list. Would it be better to store the objects in a database like redis instead of a list?
Note: I assumed this was being implemented in Java (force of habit). My answer is not terribly useful if it is not Java.
Making lots of assumptions about your requirements, I'd consider Redis if:
You are running into unacceptable GC pauses as a result of your millions of objects OR
The entities you create can be reused across multiple simulation runs
Java apps with giant heaps and lots of long-lived objects can run into very long GC pauses, depending on work-load. i.e. the old gen fills up with all these millions of objects and they're never eligible for collection. Regardless, periodically a full collect will happen (unless you're a GC tuning master) and have to scan these millions of objects in the old gen. This can take many seconds each time it happens, and you're frozen during that time. If this is happening and you don't like it, you could off-load all these long-lived objects to Redis, and pay the serialize/deserialize cost of accessing them rather than the GC pauses.
On the other point about reusing entities: if you're loading up a big Redis db and then dropping all its data when the simulation ends, it feels a bit wasteful. If you can re-use entities across simulation runs you might save yourself a bunch of time by persisting them in Redis.
The best choice depends on a number of factors, including how you access data, whether it will fit in memory, and what the distribution of accesses looks like. As a broad generalization, keeping data in memory is always faster than on disk, and keeping it in-process is faster than keeping it elsewhere.
If your data fits in memory, is accessed in a manner that means you can use basic data structures like lists/arrays and hashtables efficiently, and all items are accessed roughly equally often, keeping your data in memory is probably the best option.
If your data fits in memory, but you need to access it in complex ways, you may be best choosing a datastore like redis that supports in-memory databases.
If your data doesn't fit in memory, or you have a very uneven access pattern such that evicting the least used data to disk might allow other things to be loaded, speeding up your task in general, a regular disk-based datastore may be a better choice.
A list is not necessarily the best data structure unless "interaction" is limited to the respective next or previous element. Random access (by index) is very slow on a list.
Lists rocket at inserting at front and end, and at finding the next (or previous) element, or inserting one in between. They totally blow for accessing element 164553 and then element 10657, being O(N) on random access. Thus "interact with each other" suggests that list is a bad choice.
It very much depends on the access and allocation patterns, but a vector or deque will likely be much better suited than a list for your simulation.
Redis is based on a hash table, which has a (much!) better characteristic for random access, but it will most likely still be slower, because it has considerable overhead for you serializing the data, it going through a socket, redis unserializing and analyzing it, sending a reply, and you parsing that.

Need for speed: Best database solution

What I want to create is a huge index over an even bigger collection of data. The data is a huge collection of images (and I mean millions of photos!) and I want to build an index on all unique images.
So I calculate a hash value of every image and append this with the width, height and file size of the image. This would generate a very unique key for every image. This would be combined with the location of the image, or locations in case of duplicates.
Technically speaking, this would fit perfectly in a single database table. An unique index on file name, plus an additional non-unique index on hash-width-height-size would be enough. However, I could use an existing database system to solve this, or just write my own, optimized version. It will be a single-user application anyway and the main purpose is to detect when I add a duplicate image to the collection so it will warn me that I already have it in my collection and display the locations where the other copies are. I can then decide to still add the duplicate or to discard it.
I've written hash-table implementations before and it's not that difficult once you know what you have to be aware of. So I could just implement my own file format for this data. It's unlikely that I'll ever need to add more information to these images and I'm not interested in similar images, just exact images. I'm not storing the original images in this file either, just the hash, size and location.
From experience, I know this could run extremely fast. I've done it before and have been doing similar things for nearly three decades so it's likely that I will chose this solution.
But I do wonder... Doing the same with an existing database system like SQL Server, Oracle, Interbase or MySQL, would performance still be high enough? There would be about 750 TB of images indexed in this database, which roughly translates to around 30 million records in a single, small table. Is it even worth considering the use of a regular database?
I have doubts about the usability of a database for this project. The amount of data is huge, yet the structure is real simple. I don't need multi-user support or most other features that most databases provide. So I don't see a need for a database. But I'm interested in the opinions of other programmers about this. (Although I expect most will agree with me here.)
The project itself, which is still just an idea in my head, is supposed to be some tool or add-on for explorer or whatever. Basically, it builds an index for any external hard disk that I attach to the system and when I copy an image to this disk somewhere, it's supposed to tell me if the image already exists at this disk. It will allow me to avoid filling up my backup disks with duplicates, although I sometimes would like to add duplicates. (E.g. because they're part of a series.) Since I like to create my own rendered artwork I have plenty of images. Plus, I've been taking digital pictures with digital cameras since 1996 so I also have a huge collection of photos. Add some other large collections to this and you'll soon realise that the amount of data will be huge. (And yes, there are already plenty of duplicates in my collection...)
Since it's a single-user application that you are considering, I'd probably have a look at SQLite. It ought to fit your other requirements rather nicely, I'd say.
I just tested the performance of PostgreSQL on my laptop (Core 2 Duo T5800 2.0 GHz 3.0 GiB RAM). I have a table with slightly more than 100M records, 5 columns and some indexes. I performed a range query on one indexed column (not the primary key) and returned all columns. A mean query returned 75 rows and executed in 750ms. You have to decide if this is fast enough.
I would avoid DIY-ing it unless you know all the repocussions of what you're doing.
Transactional Consistency for example, is not trivial.
I would suggest designing your code in such a way the backend can be easily replaced later, and then run with something sane ( SQLite is a good starting choice ), develop it the most sane and rational way possible, and then try slotting in the alternative backing store.
Then profile the differences, and run regression tests against it to make sure your database is not worse than SQLite.
Exisiting database solutions tend to win because they've had years of improvement and fine tuning to get their benefits, an a naïve attempt will likely be slower, buggier, and do less, all the while Increasing your development load to purely MONUMENTAL proportions.
http://fetter.org/optimization.html
The first rule of Optimization is, you do not talk about Optimization.
The second rule of Optimization is, you DO NOT talk about Optimization.
If your app is running faster than the underlying transport protocol, the optimization is over.
One factor at a time.
No marketroids, no marketroid schedules.
Testing will go on as long as it has to.
If this is your first night at Optimization Club, you have to write a test case.
Also, with databases, there is one thing you utterly MUST get ingrained.
Speed is unimportant
Your data being there when you need it, that is important.
When you have the assuredness that your data will always be there, then you may worry about trivial concerns like speed.
Hashes
You also lament that you'll be using image SHA's/MD5's etc to deduplicate images. This is a fallacious notion of its own, Hashes of files are only able to tell if the files are different, not if they're the same.
The logic is akin to asking 30 people to flip a coin, and you see the first one get heads, and thus decide to delete every other person who gets a head, because they're obviously the same person.
https://stackoverflow.com/questions/405628/what-is-the-best-method-to-remove-duplicate-image-files-from-your-computer
Although you may think it unlikely you'd have 2 different files with the same hash, your odds are about as good as winning the lotto. The chances of you winning the lotto are low, but somebody wins the lotto every day. Don't let it be you.

Resources