Extreme Sharding: One SQLite Database Per User - database

I'm working on a web app that is somewhere between an email service and a social network. I feel it has the potential to grow really big in the future, so I'm concerned about scalability.
Instead of using one centralized MySQL/InnoDB database and then partitioning it when that time comes, I've decided to create a separate SQLite database for each active user: one active user per 'shard'.
That way backing up the database would be as easy as copying each user's small database file to a remote location once a day.
Scaling up will be as easy as adding extra hard disks to store the new files.
When the app grows beyond a single server I can link the servers together at the filesystem level using GlusterFS and run the app unchanged, or rig up a simple SQLite proxy system that will allow each server to manipulate sqlite files in adjacent servers.
Concurrency issues will be minimal because each HTTP request will only touch one or two database files at a time, out of thousands, and SQLite only blocks on reads anyway.
I'm betting that this approach will allow my app to scale gracefully and support lots of cool and unique features. Am I betting wrong? Am I missing anything?
UPDATE I decided to go with a less extreme solution, which is working fine so far. I'm using a fixed number of shards - 256 sqlite databases, to be precise. Each user is assigned and bound to a random shard by a simple hash function.
Most features of my app require access to just one or two shards per request, but there is one in particular that requires the execution of a simple query on 10 to 100 different shards out of 256, depending on the user. Tests indicate it would take about 0.02 seconds, or less, if all the data is cached in RAM. I think I can live with that!
UPDATE 2.0 I ported the app to MySQL/InnoDB and was able to get about the same performance for regular requests, but for that one request that requires shard walking, innodb is 4-5 times faster. For this reason, and other reason, I'm dropping this architecture, but I hope someone somewhere finds a use for it...thanks.

The place where this will fail is if you have to do what's called "shard walking" - which is finding out all the data across a bunch of different users. That particular kind of "query" will have to be done programmatically, asking each of the SQLite databases in turn - and will very likely be the slowest aspect of your site. It's a common issue in any system where data has been "sharded" into separate databases.
If all the of the data is self-contained to the user, then this should scale pretty well - the key to making this an effective design is to know how the data is likely going to be used and if data from one person will be interacting with data from another (in your context).
You may also need to watch out for file system resources - SQLite is great, awesome, fast, etc - but you do get some caching and writing benefits when using a "standard database" (i.e. MySQL, PostgreSQL, etc) because of how they're designed. In your proposed design, you'll be missing out on some of that.

Sounds to me like a maintenance nightmare. What happens when the schema changes on all those DBs?

http://freshmeat.net/projects/sphivedb
SPHiveDB is a server for sqlite database. It use JSON-RPC over HTTP to expose a network interface to use SQLite database. It supports combining multiple SQLite databases into one file. It also supports the use of multiple files. It is designed for the extreme sharding schema -- one SQLite database per user.

One possible problem is that having one database for each user will use disk space and RAM very inefficiently, and as the user base grows the benefit of using a light and fast database engine will be lost completely.
A possible solution to this problem is to create "minishards" consisting of maybe 1024 SQLite databases housing up to 100 users each. This will be more efficient than the DB per user approach, because data is packed more efficiently. And lighter than the Innodb database server approach, because we're using Sqlite.
Concurrency will also be pretty good, but queries will be less elegant (shard_id yuckiness). What do you think?

If you're creating a separate database for each user, it sounds like you're not setting up relationships... so why use a relational database at all?

If your data is this easy to shard, why not just use a standard database engine, and if you scale large enough that the DB becomes the bottleneck, shard the database, with different users in different instances? The effect is the same, but you're not using scores of tiny little databases.
In reality, you probably have at least some shared data that doesn't belong to any single user, and you probably frequently need to access data for more than one user. This will cause problems with either system, though.

I am considering this same architecture as I basically wanted to use the server side SQLLIte databases as backup and synching copy for clients. My idea for querying across all the data is to use Sphinx for full-text search and run Hadoop jobs from flat dumps of all the data to Scribe and then expose the results as webservies. This post gives me some pause for thought however, so I hope people will continue to respond with their opinion.

Having one database per user would make it really easy to restore individual users data of course, but as #John said, schema changes would require some work.
Not enough to make it hard, but enough to make it non-trivial.

Related

DB recommendation - Portable, Concurrent (multiple read only, one write)

I'm looking for a portable database solution I can use with a website that is designed to handle service outages. I need to nightly retrieve a list of users from SQL Server and upsert their details into a portable database. It's roughly about 250,000 users (and growing) and each one has probably 25 fields that are required. Of those fields, i'd say less than 5 need to be searched on. The rest just need retrieving.
The idea is, in times of a service outage, we can use a website that's designed to work from the portable database rather than SQL Server. Our long term goal, is to move to the cloud and handle things in an entirely different way, but for the short term this is our aim.
The website is going to be a .Net Core web api so will be being accessed by multiple users in multiple threads. The website will only ever need read access, it will not be updating these details what-so-ever.
To keep the portable database up-to-date i'm thinking of having another application that just runs nightly to update the data. Our business is 24 hours (albeit quieter overnight), so there is a potential this updater is in use while the website is in use. While service outage would assume the SQL Server is down, this may not be the case. There are other factors in play that could cause what we would describe as outages. This will be the only piece of software updating the database.
I've tried using LiteDB but I couldn't get it working in a way that worked with my concurrency requirements. It did seem to do some of the job, and was easy to get running. However, i'd often run into locked files due to the nature of web api. I did work out a solution for that, but then the updater app couldn't access the database file.
Does anyone have any recommendations I can look into?
Given the description of the problem (1 table, 250k rows with - I assume - relative fast growth rate) and requirements, I don't think a relational database is what you are looking for.
I think nosql databases, or, more specifically, document oriented databases are more fitted to meet your requirements. There are many choices: Mongo, Cassandra, CouchDB, ... the choice is yours.
Personally I have some experience with ElasticSearch (https://www.elastic.co/elasticsearch), that is quite easy to learn, is portable (runs on Linux, Windows, Containers, etc...), is scalable, and it is fast. I mean, really, really fast, you can get results in 10-20 milliseconds (even less, sometimes).
The NEST nuget package acts as a high level client for working with ElasticSearch (https://www.elastic.co/guide/en/elasticsearch/client/net-api/7.x/nest-getting-started.html)

Which NoSQL backend to store trace data from webpage

In our web application we need to trace what users click, what they write into search box, etc. Lots of data will be sent by AJAX. Generally functionality is a bit similar to google analytics, but we need to customize it in different ways.
Data will be collected and once per day aggregated and exported to PostgreSQL, so backend should be able to handle dozens of inserts. I don't consider usage of traditional SQL database, because probably it won't handle so many inserts efficiently.
I wonder which backend would you use for such task? Actually I think about MongoDB or Cassandra. But maybe you know better software for that task? Maybe something different then NoSQL database?
Web application is written in Ruby on Rails so support for Ruby would be nice but that's definitely not the most important.
Sounds like you need to analyse your specific requirements.
It may be that the best solution is to split / partition / shard a conventional database and then push the data up from there.
Depending on what your tolerance for data loss is, there are a lot of options. If you choose a system which has single-server durability, a major source of write bottleneck will be fdatasync() (assuming you use hard drives to store your data on).
If you can tolerate syncing less often than on every commit, then you may be able to tune your database to commit at timed intervals.
Depending on your table, index structure etc, I'd expect that you can get rather a lot of inserts with a "conventional" db (e.g. postgresql), if you manage it correctly and tune the durability (if it supports that) to your liking.
Sharding this into several instances of course will enable you to scale this up. However, you need to be mindful of operational requirements (i.e. what happens if some of the instances are down). Talk to your Ops team about what they're comfortable managing.

SQLite as a production database for a low-traffic site?

I'm considering using SQLite as a production database for a site that would receive perhaps 20 simultaneous users, but with the potential for a peak that could be many multiples of that (since the site would be accessible on the open internet and there's always a possibility that someone will post a link somewhere that could drive many people to the site all at once).
Is SQLite a possibility?
I know it's not an ideal production scenario. I'm only asking if this is within the realm of being a realistic possibility.
SQLite doesn't support any kind of concurrency, so you may have problems running it on a production website. If you're looking for a 'lighter' database, perhaps consider trying a contemporary object-document store like CouchDB.
By all means, continue to develop against SQLite, and you're probably fine to use it initially. If you find your application has more users down the track, you're going to want to transition to Postgres or MySQL however.
The author of SQLite addresses this on the website:
SQLite works great as the database engine for most low to medium traffic websites (which is to say, most websites). The amount of web traffic that SQLite can handle depends on how heavily the website uses its database. Generally speaking, any site that gets fewer than 100K hits/day should work fine with SQLite. The 100K hits/day figure is a conservative estimate, not a hard upper bound. SQLite has been demonstrated to work with 10 times that amount of traffic.
The SQLite website (https://www.sqlite.org/) uses SQLite itself, of course, and as of this writing (2015), it handles about 400K to 500K HTTP requests per day, about 15-20% of which are dynamic pages touching the database. Dynamic content uses about 200 SQL statements per webpage. This setup runs on a single VM that shares a physical server with 23 others and yet still keeps the load average below 0.1 most of the time.
So I think the long and short of it is, go for it, and if it's not working well for you, making the transition to an enterprise-class database is fairly trivial anyway. Do take care of your schema, however, and design your database with growth and efficiency in mind.
Here's a thread with some more independent comments around using SQLite for a production web application. It sounds like it has been used with some mixed results.
Edit (2014):
Since this answer was posted, SQLite now features a multi-threaded mode and write ahead logging mode which may influence your evaluation of its suitability for low-medium traffic sites.
Charles Leifer has written a blog post about SQLite's WAL (write ahead logging) feature and some well-considered opinions on appropriate use cases.
The small excerpt from SQLite website says it all.
Is the data separated from the application by a network? → choose
client/server
Many concurrent writers? → choose client/server
Big data? → choose client/server
Otherwise → choose SQLite!
SQLite "just works" (until it doesn't of course)
We often use SQLite for internal databases; The employee directory, our calendar of events, and other intranet services all run on lightweight databases. It would be major overkill to be running these apps at the scale we do on a "real" database like mySQL. This is especially true when you factor in that they're running along side 4 other virtual machines on a single mid-range computer.
At one point we had an outward facing site that ran on an sqlite db for months with only a single reboot required. Obviously, it was very low traffic, but it putted along nicely for what it did.
We have encountered a similar option on an environment with absolutely no writes, and we selected using SQLite.
See my blog post on the subject:
Well, the main assumption which makes this solution theoretically
possible is that our SQLite database is totally read-only. Our server
code should never change it. This would solve any locking problems, as
there are no read locks. We could find nowhere on the internet anyone
saying there is a problem in high-throughput reading of SQLite when
there are no writes - it could be possible!
I think it would depend mostly on what your read/write ratio will be. If it's mostly reading from the database, you may be okay. Multi-user writing in SQLite can be a problem because of how it locks the database.
People speak about concurrency problems, but sqlite has a way to cache incoming requests and have them wait for some time. It doesn't timeout immediately.
I've read things about the default timeout setting begin zero, meaning it times out immediately and that's nonsense. Maybe people didn't adjust this setting?
Depends on the usage of the site. If most of the time you're just reading data, you can pretty much use anything for a DB and cache the data in the application to achieve good performance.
I am using it in a very low traffic web server (it is a genomic database) and I don't have any problems. But there are only SELECT statements, no writing to the DB involved.
To add to an already brilliant answer: Since you are working with a server-less solution in this case, you can say goodbye to replication, or any sort of horizontal scaling of your db, as well as other advanced options. It also isn't the best choice if you have multiple users updating the same exact chunk of information. If you were to shard the database in the future you would have to migrate the data and move to something else. Also if you have a load balancer and multiple systems involved it would be difficult to maintain data centrality if using sqlite. These are just some of the reasons why it isn't recommended. Its great for smaller projects, and great for development.
It seems like with queuing you could also get away with avoiding a lot of the concurrency write problems with SQLite. Instead of writing directly to the sqlite db you would write to a queue that then in turn sequentially writes to the sqlite db in a first in first out mode. Not sure if your application reaches to where you would need this if it would be worth writing or just moving on to client/server DB...but a thought.

In Memory Database

I'm using SqlServer to drive a WPF application, I'm currently using NHibernate and pre-read all the data so it's cached for performance reasons. That works for a single client app, but I was wondering if there's an in memory database that I could use so I can share the information across multiple apps on the same machine. Ideally this would sit below my NHibernate stack, so my code wouldn't have to change. Effectively I'm looking to move my DB from it's traditional format on the server to be an in memory DB on the client.
Note I only need select functionality.
I would be incredibly surprised if you even need to load all your information in memory. I say this because, just as one example, I'm working on a Web app at the moment that (for various reasons) loads thousands of records on many pages. This is PHP + MySQL. And even so it can do it and render a page in well under 100ms.
Before you go down this route make sure that you have to. First make your database as performant as possible. Now obviously this includes things like having appropriate indexes and tuning your database but even though are putting the horse before the cart.
First and foremost you need to make sure you have a good relational data model: one that lends itself to performant queries. This is as much art as it is science.
Also, you may like NHibernate but ORMs are not always the best choice. There are some corner cases, for example, that hand-coded SQL will be vastly superior in.
Now assuming you have a good data model and assuming you've then optimized your indexes and database parameters and then you've properly configured NHibernate, then and only then should you consider storing data in memory if and only if performance is still an issue.
To put this in perspective, the only times I've needed to do this are on systems that need to perform millions of transactions per day.
One reason to avoid in-memory caching is because it adds a lot of complexity. You have to deal with issues like cache expiry, independent updates to the underlying data store, whether you use synchronous or asynchronous updates, how you give the client a consistent (if not up-to-date) view of your data, how you deal with failover and replication and so on. There is a huge complexity cost to be paid.
Assuming you've done all the above and you still need it, it sounds to me like what you need is a cache or grid solution. Here is an overview of Java grid/cluster solutions but many of them (eg Coherence, memcached) apply to .Net as well. Another choice for .Net is Velocity.
It needs to be pointed out and stressed that something like NHibernate is only consistent so long as nothing externally updates the database and that there is exactly one NHibernate-enabled process (barring clustered solutions). If two desktop apps on two different PCs are both updating the same database with NHibernate the caching simply won't work because the persistence units simply won't be aware of the changes the other is making.
http://www.db4o.com/ can be your friend!
Velocity is an out of process object caching server designed by Microsoft to do pretty much what you want although it's only in CTP form at the moment.
I believe there are also wrappers for memcached, which can also be used to cache objects.
You can use HANA, express edition. You can download it for free, it's in-memory, columnar and allows for further analytics capabilities such as text analytics, geospatial or predictive. You can also access with ODBC, JDBC, node.js hdb library, REST APIs among others.

What are the advantages of using a single database for EACH client?

In a database-centric application that is designed for multiple clients, I've always thought it was "better" to use a single database for ALL clients - associating records with proper indexes and keys. In listening to the Stack Overflow podcast, I heard Joel mention that FogBugz uses one database per client (so if there were 1000 clients, there would be 1000 databases). What are the advantages of using this architecture?
I understand that for some projects, clients need direct access to all of their data - in such an application, it's obvious that each client needs their own database. However, for projects where a client does not need to access the database directly, are there any advantages to using one database per client? It seems that in terms of flexibility, it's much simpler to use a single database with a single copy of the tables. It's easier to add new features, it's easier to create reports, and it's just easier to manage.
I was pretty confident in the "one database for all clients" method until I heard Joel (an experienced developer) mention that his software uses a different approach -- and I'm a little confused with his decision...
I've heard people cite that databases slow down with a large number of records, but any relational database with some merit isn't going to have that problem - especially if proper indexes and keys are used.
Any input is greatly appreciated!
Assume there's no scaling penalty for storing all the clients in one database; for most people, and well configured databases/queries, this will be fairly true these days. If you're not one of these people, well, then the benefit of a single database is obvious.
In this situation, benefits come from the encapsulation of each client. From the code perspective, each client exists in isolation - there is no possible situation in which a database update might overwrite, corrupt, retrieve or alter data belonging to another client. This also simplifies the model, as you don't need to ever consider the fact that records might belong to another client.
You also get benefits of separability - it's trivial to pull out the data associated with a given client ,and move them to a different server. Or restore a backup of that client when the call up to say "We've deleted some key data!", using the builtin database mechanisms.
You get easy and free server mobility - if you outscale one database server, you can just host new clients on another server. If they were all in one database, you'd need to either get beefier hardware, or run the database over multiple machines.
You get easy versioning - if one client wants to stay on software version 1.0, and another wants 2.0, where 1.0 and 2.0 use different database schemas, there's no problem - you can migrate one without having to pull them out of one database.
I can think of a few dozen more, I guess. But all in all, the key concept is "simplicity". The product manages one client, and thus one database. There is never any complexity from the "But the database also contains other clients" issue. It fits the mental model of the user, where they exist alone. Advantages like being able to doing easy reporting on all clients at once, are minimal - how often do you want a report on the whole world, rather than just one client?
Here's one approach that I've seen before:
Each customer has a unique connection string stored in a master customer database.
The database is designed so that everything is segmented by CustomerID, even if there is a single customer on a database.
Scripts are created to migrate all customer data to a new database if needed, and then only that customer's connection string needs to be updated to point to the new location.
This allows for using a single database at first, and then easily segmenting later on once you've got a large number of clients, or more commonly when you have a couple of customers that overuse the system.
I've found that restoring specific customer data is really tough when all the data is in the same database, but managing upgrades is much simpler.
When using a single database per customer, you run into a huge problem of keeping all customers running at the same schema version, and that doesn't even consider backup jobs on a whole bunch of customer-specific databases. Naturally restoring data is easier, but if you make sure not to permanently delete records (just mark with a deleted flag or move to an archive table), then you have less need for database restore in the first place.
To keep it simple. You can be sure that your client is only seeing their data. The client with fewer records doesn't have to pay the penalty of having to compete with hundreds of thousands of records that may be in the database but not theirs. I don't care how well everything is indexed and optimized there will be queries that determine that they have to scan every record.
Well, what if one of your clients tells you to restore to an earlier version of their data due to some botched import job or similar? Imagine how your clients would feel if you told them "you can't do that, since your data is shared between all our clients" or "Sorry, but your changes were lost because client X demanded a restore of the database".
As for the pain of upgrading 1000 database servers at once, some fairly simple automation should take care of that. As long as each database maintains an identical schema, then it won't really be an issue. We also use the database per client approach, and it works well for us.
Here is an article on this exact topic (yes, it is MSDN, but it is a technology independent article): http://msdn.microsoft.com/en-us/library/aa479086.aspx.
Another discussion of multi-tenancy as it relates to your data model here: http://www.ayende.com/Blog/archive/2008/08/07/Multi-Tenancy--The-Physical-Data-Model.aspx
Scalability. Security. Our company uses 1 DB per customer approach as well. It also makes code a bit easier to maintain as well.
In regulated industries such as health care it may be a requirement of one database per customer, possibly even a separate database server.
The simple answer to updating multiple databases when you upgrade is to do the upgrade as a transaction, and take a snapshot before upgrading if necessary. If you are running your operations well then you should be able to apply the upgrade to any number of databases.
Clustering is not really a solution to the problem of indices and full table scans. If you move to a cluster, very little changes. If you have have many smaller databases to distribute over multiple machines you can do this more cheaply without a cluster. Reliability and availability are considerations but can be dealt with in other ways (some people will still need a cluster but majority probably don't).
I'd be interested in hearing a little more context from you on this because clustering is not a simple topic and is expensive to implement in the RDBMS world. There is a lot of talk/bravado about clustering in the non-relational world Google Bigtable etc. but they are solving a different set of problems, and lose some of the useful features from an RDBMS.
There are a couple of meanings of "database"
the hardware box
the running software (e.g. "the oracle")
the particular set of data files
the particular login or schema
It's likely Joel means one of the lower layers. In this case, it's just a matter of software configuration management... you don't have to patch 1000 software servers to fix a security bug, for example.
I think it's a good idea, so that a software bug doesn't leak information across clients. Imagine the case with an errant where clause that showed me your customer data as well as my own.

Resources