I need to run a process that deals with a large volume of data. Too large to reasonably work with in RAM. The data does not need to be preserved between process instances or shared between process instances though, so I was hoping to use some kind of disk-based database for the storage.
My first thought was SQLite, but Heroku explicitly does not support that. Te second thing that I tried was Ruby's PStore, but that turned out to be way too flakey for the job. The next thing that I tried was DBM, which Heroku does not seem to say that they don't support. When I tried to run the code after deploying it, however, I got "LoadError: cannot load such file -- dbm".
Given that I do not need data to be persisted between process instances, is there any way to work around Heroku's bias against support for file-based relational or key/value data stores?
I'm not sure what the structure of the data is, but have you considered using Redis? There is an effective free tier on Heroku. The issue with file based solutions are that Heroku uses an ephemeral file system. It's not designed to scale disk space among computing units.
Related
Traditionally, in a non-serverless environment, I would have the following system. Say I have a custom ID generation protocol for all my models. Say I also have 20 servers scattered around. I give each server a slice of IDs to work with off the whole stack of IDs. When they are done or the server goes down, it returns the IDs back to the system so they don't get wasted. The reason for sending each server a batch of IDs is so that every time a new record is created you don't need to fetch from a central ID server to get the next ID. Instead they have a local set they can work with freely.
How would you do this sort of thing in a serverless system? I am deploying to Vercel and wondering what the appropriate architecture might be for such an ID batching system. There are other use cases for needed a persistent copy of data in a local server, so if you don't like the ID example just imagine another sort of system. How do you solve this optimization problem in a serverless environment?
Serverless is an approach. Like all such things (solutions), it should be matched to the problem - not the other way around. Is this simply a case where serverless is a good solution choice for dealing with 80% of your problem, and that all you need to do is choose something appropriate to deal to the other 20%?
Assuming you have the freedom to do this, can't you just have the serverless parts of the solution consume non-serverless services - e.g. an ID Service?
Separately to this, caching comes to mind - just the general idea of having some data close by which might be mastered somewhere else. Caching patterns like Write Behind would allow you to work with local copies (i.e. immediate consumption) whilst farming out the cache-master communication.
I'm trying to build a very simple wiki-like system in Clojure and serving the http using Ring.
Instead of using a regular database i was thinking about using just an atom and serialise it to a file when it gets changed. Something like https://github.com/alandipert/enduro just with a delayed write.
Having the data in-mem in vectors and maps will surely make the service faster and the code simpler/more intuitive to write?
Will that work with a multithreaded Jetty/Ring server?
The content of the atom will surely fit in memory for now, but that might not hold true in the future. Any ideas to how i can structure the code to make it easier to switch to an alternative storage backend in the future?
This is the best guide for keeping data in memory and storing it to a single file: http://www.brandonbloom.name/blog/2013/06/26/slurp-and-spit/
Datomic would give you a few options.
You could use the in-memory db which would give you query power and thread safety. It would also be very easy to switch to a persistent datastore if/when the time comes. However, I'm not sure about serialization of the in-memory db.
Or you could use Datomic just for Datalog, which can be used for querying data structures. In that case, you could use an atom and then serialize as planned. Moving to a persistent datastore would be more work than the first case, but still not much. In either case, most of your code wouldn't need to change.
In my opinion, you'd be better of just starting with the free version of Datomic that uses the file system as a datastore. I don't think using an atom simplifies your code very much.
I second the recommendation for Datomic.
I've been using it on a "real" project for a few weeks now, and the more I use it, the more I realize that it would be a solid foundation for handling your data in any non-trivial project. Even if you never plan to use a "real" database in the future, just having a fact-based data model, powerful querying, and even full-text search built in is a huge win over just using an atom to store some huge map.
I checked and the free version does give you local storage as well as the in-memory database, so that would solve your storage needs perfectly (it uses an H2 database behind the scenes). And if you ever find yourself needing to scale to something bigger, you're already set.
I'm writing a document editing web service, in which documents can be edited via a website, or locally and pushed via git. I'm trying to decide if the documents should be stored as individual documents on the filesystem, or in a database. The points I'm wondering are:
If they're in a database, is there any way for git to see the documents?
How much higher are the overheads using the filesystem? I assume the OS is doing a lot more work. How can I alleviate some of this? For example, the web editor autosaves, what would the best way to cache the save data be, to minimise writes?
Does one scale significantly better or worse than the other? If all goes according to plan, this will be a service with many thousands of documents being accessed and edited.
If the documents go into a database, git can't directly see the documents. git will see the backing storage file(s) for the database, but have no way of correlating changes there to changes to files.
The overhead of using the database is higher than using a filesystem, as answered by Carlos. Databases are optimized for transactions, which they'll do in memory, but they have to hit the file. Unless you program the application to do database transactions at a sub-document level (Eg: changing only modified lines), the database will give you no performance improvement. Most modern filesystems do caching and you can 'write' in a way that will sit in RAM rather than going to your backing stoage as well. You'll need to manage the granularity of the 'autosaves' in your application (every change? every 30 seconds? 5 minutes?), but really, doing it at the same granularity with a database will cause the same amount of traffic to the backing store.
I think you intended to ask "does the filesystem scale as well as the database"? :) If you have some way to organize your files per-user, and you figure out the security issue of a particular user only being able to access/modify the files they should be able to (which are doable imo), the filesystem should be doable.
Filesystem will always be faster than DB, because after all, DB's store data in the Filesystem!
Git is quite efficiently on it's own as proven on github, so i say you stick with git, and workaround it.
After all, Linus should know something... ;)
Need to be able to set server(s) that replicate all information, as a master data store that has all the data.
Also need servers that specifically store/replicate certain data, available in local LANs, so that when the internet connection goes down, they can still access their local data. Under normal circumstances, the clients will access most of their data from the local LAN, and may use others when the local LAN server goes down.
This is wanted alongside the benefits of a distributed data store, such as failure resistance and speed.
Which Distributed Key-Value Data Store or other data storage method would be most suited for this?
Try out CouchDB. Your use case reads like it was build for it. Point taken, CouchDB is much more than a key/value store, but on the other hand, not less suitable for it.
Add replication and as an added bonus fault tolerance, conflict detection (and resolution) and an easy API (HTTP).
Let me know if you have any other questions.
Of course you must remember that replication is something completely different from backup, because one system's programmatic failure in handling the data can quickly replicate to other nodes resulting in total mayhem.
Maybe using a Hadoop File System or OpenAFS would be a good solution here?
I haven't used any of those systems in real-life scenarios, only had interest in them during my research on peer-to-peer and distributed storage solutions, but I think they're worth a try.
Have you checked out the new Microsoft's Velocity? http://msdn.microsoft.com/en-us/data/cc655792.aspx. Unlike many other cloud services, you can run the setup (for Velocity) on your premises.
I'm working on a web app that is somewhere between an email service and a social network. I feel it has the potential to grow really big in the future, so I'm concerned about scalability.
Instead of using one centralized MySQL/InnoDB database and then partitioning it when that time comes, I've decided to create a separate SQLite database for each active user: one active user per 'shard'.
That way backing up the database would be as easy as copying each user's small database file to a remote location once a day.
Scaling up will be as easy as adding extra hard disks to store the new files.
When the app grows beyond a single server I can link the servers together at the filesystem level using GlusterFS and run the app unchanged, or rig up a simple SQLite proxy system that will allow each server to manipulate sqlite files in adjacent servers.
Concurrency issues will be minimal because each HTTP request will only touch one or two database files at a time, out of thousands, and SQLite only blocks on reads anyway.
I'm betting that this approach will allow my app to scale gracefully and support lots of cool and unique features. Am I betting wrong? Am I missing anything?
UPDATE I decided to go with a less extreme solution, which is working fine so far. I'm using a fixed number of shards - 256 sqlite databases, to be precise. Each user is assigned and bound to a random shard by a simple hash function.
Most features of my app require access to just one or two shards per request, but there is one in particular that requires the execution of a simple query on 10 to 100 different shards out of 256, depending on the user. Tests indicate it would take about 0.02 seconds, or less, if all the data is cached in RAM. I think I can live with that!
UPDATE 2.0 I ported the app to MySQL/InnoDB and was able to get about the same performance for regular requests, but for that one request that requires shard walking, innodb is 4-5 times faster. For this reason, and other reason, I'm dropping this architecture, but I hope someone somewhere finds a use for it...thanks.
The place where this will fail is if you have to do what's called "shard walking" - which is finding out all the data across a bunch of different users. That particular kind of "query" will have to be done programmatically, asking each of the SQLite databases in turn - and will very likely be the slowest aspect of your site. It's a common issue in any system where data has been "sharded" into separate databases.
If all the of the data is self-contained to the user, then this should scale pretty well - the key to making this an effective design is to know how the data is likely going to be used and if data from one person will be interacting with data from another (in your context).
You may also need to watch out for file system resources - SQLite is great, awesome, fast, etc - but you do get some caching and writing benefits when using a "standard database" (i.e. MySQL, PostgreSQL, etc) because of how they're designed. In your proposed design, you'll be missing out on some of that.
Sounds to me like a maintenance nightmare. What happens when the schema changes on all those DBs?
http://freshmeat.net/projects/sphivedb
SPHiveDB is a server for sqlite database. It use JSON-RPC over HTTP to expose a network interface to use SQLite database. It supports combining multiple SQLite databases into one file. It also supports the use of multiple files. It is designed for the extreme sharding schema -- one SQLite database per user.
One possible problem is that having one database for each user will use disk space and RAM very inefficiently, and as the user base grows the benefit of using a light and fast database engine will be lost completely.
A possible solution to this problem is to create "minishards" consisting of maybe 1024 SQLite databases housing up to 100 users each. This will be more efficient than the DB per user approach, because data is packed more efficiently. And lighter than the Innodb database server approach, because we're using Sqlite.
Concurrency will also be pretty good, but queries will be less elegant (shard_id yuckiness). What do you think?
If you're creating a separate database for each user, it sounds like you're not setting up relationships... so why use a relational database at all?
If your data is this easy to shard, why not just use a standard database engine, and if you scale large enough that the DB becomes the bottleneck, shard the database, with different users in different instances? The effect is the same, but you're not using scores of tiny little databases.
In reality, you probably have at least some shared data that doesn't belong to any single user, and you probably frequently need to access data for more than one user. This will cause problems with either system, though.
I am considering this same architecture as I basically wanted to use the server side SQLLIte databases as backup and synching copy for clients. My idea for querying across all the data is to use Sphinx for full-text search and run Hadoop jobs from flat dumps of all the data to Scribe and then expose the results as webservies. This post gives me some pause for thought however, so I hope people will continue to respond with their opinion.
Having one database per user would make it really easy to restore individual users data of course, but as #John said, schema changes would require some work.
Not enough to make it hard, but enough to make it non-trivial.