I have written a PHP application which requires storage of millions of integers between 0 and 10,000,000 inclusive. Each number is incremented by one very frequently (on average 100 values are updated every second) and read very frequently (20,000 reads per second). The numbers are reset to 0 either nightly, weekly, monthly or annually.
I've got a fairly good handle on MySQL but it feels like overkill, and not very efficient in the process.
Has anyone had to deal with this before, and/or could shed some light on a suitable data storage system?
Depending on how much memory you have, and how long you have to keep them around - you could just use the APC cache, or memcache. memcache has a nice increment operation, so it's crazy efficient doing it.
BTW, you said you need 20K reads. And you are doing it in PHP? on one server or on a cluster? Sounds like you need some architectural guidance..... :)
If it's a web-app, you are going to be using a cluster. And if it's not a web app, I'd suggest you think about re-doing it in something which isn't PHP. Java, C++ and c# come to mind.
Related
I am at the beginning of a project where we will need to manage a near real-time flow of messages containing some ids (e.g. sender's id, receiver's id, etc.). We expect a throughput of about 100 messages per second.
What we will need to do is to keep track of the number of times these ids appeared in a specific time frame (e.g. last hour or last day) and store these values somewhere.
We will use the values to perform some real time analysis (i.e. apply a predictive model) and update them when needed while parsing the messages.
Considering the high throughput and the need to be in real time what DB solution would be the better choice?
I was thinking about a key-value in memory DB that will persist data on disk periodically (like Redis).
Thanks in advance for the help.
The best choice depends on many factors we don’t know, like what tech stack is your team already using, how open are they to learning new things, how much operational burden are you willing to take on, etc.
That being said, I would build a counter on top of DynamoDB. Since DynamoDB is fully managed, you have no operational burden (no database server upgrades, etc.). It can handle very high throughput, and it has single-digit millisecond latency for writes and reads to a single row. AWS even has documentation describing how to use DynamoDB as a counter.
I’m not as familiar with other cloud platforms, but you can probably find something in Azure or GCP that offers similar functionality.
I have a particular use case for multiple in memory key value maps that need very fast lookup time. They are set just set once a day so can be considered immutable for all practical purposes. Redis is not an option since it gets CPU throttled in case of multiple threads accessing it. Multi instance redis takes up too much memory because of data replication. The important thing to consider here is that the read rate is very high in bursts. Around 10 million requests in bursts from around 40-50 workers simultaneously.
I was thinking of creating a simple client server architecture with multiple readers connecting to a server to read from shared memory maps. However I wonder if such an architecture already exists and has been tested profusely for this use case in which case I should not be reinventing the wheel.
So to sum up what is my best alternative? TIA.
Might not be suitable for you but you could try RBLDNSD and store your values in DNS. It's high performance and results will be cached, and it's easy to read the values from pretty much any programming environment. To write values to it you'll need to write directly to its zone files, but the format is simple and easy to write.
You don't mention the size of your maps, but given that performance is so critical, it sounds like you may want to consider keeping copies of your 'multiple in memory key value maps' with each worker.
You could then implement a simple mechanism to notify each worker that it's time to refresh their maps (e.g. Redis PUBLISH, or any other pubsub type framework).
At the risk of running afoul of the stackoverlow self-promotion police :-) eXtremeDB might be a consideration. It's not schema-less, but your schema can simply define a key-value pair. It supports MVCC (optimistic, non-blocking) concurrency so even the relatively infrequent writes won't get in the way of readers, and you'll be able to utilize all the CPU cores.
I'm working on a system that will need to store lots of temperature data. I could potentially store 5 samples per second or more.
I've done this in the past with a relatively simple mysql database and performance became unbearable. Inserts were not too bad, but had a noticeable load. Queries, however, could take minutes.
At that time, I had something like 50 gb of data, which is ridiculous. I can think of many ways to compress or discard data without losing critical information, but that's a completely different problem.
I'd like to pick a tool/database that is optimized for this kind of data, preferably cross platform (at a minimum, linux/c++).
RRD (Round Robin Database) seems built for this kind of thing, but it seems designed more for processing data than for storing it.
What other tools are available?
Edit: more info...
This will be running on an embedded system (a Raspberry Pi) so an ideal tool has low computing overhead, low memory footprint, and few library dependencies.
The storage may not necessarily be on the same device.
I suppose growth could reach as much as 500k samples per hour in a contrived, extreme case. More likely it will be about 20k samples per hour.
Internet access should not be presumed.
Looks like you're looking for a time series DB.
I know of two candidates:
http://tempo-db.com/
http://opentsdb.net/
If you could be a bit more specific about your requirements (req/s, daily data growth, API type, self-hosted or a fully managed solution, etc), I'd be able to go into more details or recommend other solutions.
Good luck.
I've been using Cassandra instance without reboot for a few days for a simple
task for storing tweets, 1-2 saves in a second. After then Cassandra got really slow and I had to kill a restart it.
Is this Cassandras's expected stability now? Would it be a good solution to write a daemon to kill/restart it every day or two?
No. Cassandra is widely expected to be more stable than that. If it is not stable, there is a substantial chance you have configured it wrong. It may be attempting to use more memory than you expect, for instance. If you have encountered a bug or defect in Cassandra, it is not one which is afflicting the majority of users.
As for your "restart daemon" plan, I'm going to go with "that's a horrible solution for pretty much everything, and especially so for something that you're trusting with any data you actually care about."
From https://cassandra.apache.org/
Cassandra is in use at Netflix, Twitter, Urban Airship, Constant
Contact, Reddit, Cisco, OpenX, Digg, CloudKick, Ooyala, and more
companies that have large, active data sets. The largest known
Cassandra cluster has over 300 TB of data in over 400 machines.
It was (is?) largely used at Facebook too. I would say it's stable. :)
And btw, I don't think it's meant to be restarted every 1-2 days at all: you use it if you have huge data sets with high availability (HA) requirements, and going down every 2 days is not HA.
I have a very small amount of data (~200 bytes) that I retrieve from the database very often. The write rate is insignificant.
I would like to get away from all the unnecessary database calls to fetch this almost static data. Is memcached a good use for this? Something else?
If it's of any relevance I'm running this on GAE using python. The data in question could be easily (de)serialized as json.
Memcache is well-suited for this - reading from the datastore is much more expensive than reading from memcache. This is especially true for small amounts of data for which the cost to retrieve is dominated by latency to the datastore.
If your app receives enough requests that instances typically stay alive for a little while, then you could go one step further and use App Caching to largely avoid memcache too. (Basically, cache the value in a global variable, and also app-cache the time the value was last updated. Provide an accessor for the value which retrieves the latest from memcache/db if it hasn't been updated in X minutes). Memcache is pretty cheap though, so this extra work might only make sense if you access this variable rather frequently.
If it changes less often than once per day, you could just hardcode it in webapp code, and reupload the file each time it changes.