I'm architecting a database to hold thousands of users using CouchDB. We need to synchronise data between the server-side DB and a device DB. [ couchDB <-> pouchDB ]
The only problem is that we will have hundreds to thousands of users, and we want to restrict the data synchronisation to only the data for that user.
This means that if we want 1000 users, we need 1000 replications set up for couchDB on the server side. Just wondering if this will push the limits of Couch, or if it's within it's designed capabilities. Or if there could be any performance deficiencies with this approach?
Thanks
It's difficult to give a yes or no answer to this as there are a number of factors that would have to be taken into account. You would really need to do some testing to see how well it scaled for you. In principle there is no reason a CouchDB server couldn't handle thousands of users.
How many and how frequently documents are created and modified will have a big impact on replication. The more there is to replicate the more work the server will have to do.
Running a filter function on replication so that each user only gets their documents will crease the load on the server. All changed documents will be checked for all replications. A per-user database model will avoid this extra work.
With a per-user database model you will also be able to scale out much more easily. If you find that you can comfortably host x users on a server, you can double that simply by adding another server.
Related
I'm making a web-application. I wanted to know the experienced experts here, how other web-applications such as Facebook, Fantasy sports, trello, todoist or Google Keep etc store the data of users, not username/passwords, but by user data I mean data like their statuses, posts, tasks, reminders, due dates, their configurations etc.
I'm of the impression that all of the above examples would need some kind of database operation. So that, when the user demands/changes some value, the program will have to query into the database where those values are stored. However aren't databases usually a little slow to operate? especially since one user will probably be demanding lots of information at the same time, equaling lots of queries into the database, equaling more slow.
Is there another approach those softwares apply?
Most web-applications have a database to store user information as you described, yes. This isn't slow however - it's the exact opposite. Databases are optimized for fast retrieval of large amounts of data, and they're very good at enforcing and tracking relationships between sets of data and querying for specific sets. So most web applications have a model of a front-end, a server back-end, and a server database, with the server back-end translating web requests into database queries.
This isn't the only way to go, but other options can be more complicated. The programming language Racket, for example, has built-in features for webservers that include the ability to store the program's state between server requests using continuations. This allows the server to essentially remember its running program state between requests, and lets it turn multiple requests into a continuous session by storing these continuations and communicating keys to access them to the browser. I've never heard of any other languages offering this sort of feature, and it can require some mind-bending to get your head around how it works, but it does show that the standard architecture of frontend/backend/database for webapps isn't necessarily set in stone.
We maintain a Software as a Service (SaaS) web application that sits on top of a multi-tenant SQL Server database. There are about 200 tables in the system, this biggest with just over 100 columns in it, at last look the database was about 10 gigabytes in size. We have about 25 client companies using the application every entering their data and running reports.
The single instance architecture is working very effectively for us - we're able to design and develop new features that are released to all clients every month. Each client experience can be configured through the use of feature-toggles, data dictionary customization, CSS skinning etc.
Our typical client is a corporate with several branches, one head office and sometimes their own inhouse IT software development teams.
The problem we're facing now is that a few of the clients are undertaking their own internal projects to develop reporting, data warehousing and dashboards based on the data presently stored in our multi-tenant database. We see it as likely that the number and sophistication of these projects will increase over time and we want to cater for it effectively.
At present, we have a "lite" solution whereby we expose a secured XML webservice that clients can call to get a full download of their records from a table. They specify the table, and we map that to a purpose-built stored proc that returns a fixed number of columns. Currently clients are pulling about 20 tables overnight into a local SQL database that they manage. Some clients have tens of thousands of records in a few of these tables.
This "lite" approach has several drawbacks:
1) Each client needs to develop and maintain their own data-pull mechanism, deal with all the logging, error handling etc.
2) Our database schema is constantly expanding and changing. The stored procs they are calling have a fixed number of columns, but occasionally when we expand an existing column (e.g. turn a varchar(50) into a varchar(100)) their pull will fail because it suddenly exceeds the column size in their local database.
3) We are starting to amass hundreds of different stored procs built for each client and their specific download expectations, which is a management hassle.
4) We are struggling to keep up with client requests for more data. We provide a "shell" schema (i.e. a copy of our database with no data in it) and ask them to select the tables they need to pull. They invariably say "all of them" which compounds the changing schema problem and is a heavy drain on our resources.
Sorry for the long winded question, but what I'm looking for is an approach to this problem that other teams have had success with. We want to securely expose all their data to them in a way they can most easily use it, but without getting caught in a constant process of negotiating data exchanges and cleaning up after schema changes.
What's worked for you?
Thanks,
Michael
I've worked for a SaaS company that went through a similar exercise some years back and Web Services is the probably the best solution here. incidentally, one of your "drawbacks" is actually a benefit. Customers should be encouraged to do their own data pulls because each customer's needs on timing and amount of data will be different.
Now instead of a LITE solution, you should look at building out a WSDL with separate CRUD calls for each table and good filtering capabilities. Also, make sure you have change times for records on each table. this way a customer can hit each table and immediately pull only the records that have been updated since the last time they pulled.
Will it be easy. Not a chance, but if you want scalability, it's the only route to go.
ood luck.
In our web application we need to trace what users click, what they write into search box, etc. Lots of data will be sent by AJAX. Generally functionality is a bit similar to google analytics, but we need to customize it in different ways.
Data will be collected and once per day aggregated and exported to PostgreSQL, so backend should be able to handle dozens of inserts. I don't consider usage of traditional SQL database, because probably it won't handle so many inserts efficiently.
I wonder which backend would you use for such task? Actually I think about MongoDB or Cassandra. But maybe you know better software for that task? Maybe something different then NoSQL database?
Web application is written in Ruby on Rails so support for Ruby would be nice but that's definitely not the most important.
Sounds like you need to analyse your specific requirements.
It may be that the best solution is to split / partition / shard a conventional database and then push the data up from there.
Depending on what your tolerance for data loss is, there are a lot of options. If you choose a system which has single-server durability, a major source of write bottleneck will be fdatasync() (assuming you use hard drives to store your data on).
If you can tolerate syncing less often than on every commit, then you may be able to tune your database to commit at timed intervals.
Depending on your table, index structure etc, I'd expect that you can get rather a lot of inserts with a "conventional" db (e.g. postgresql), if you manage it correctly and tune the durability (if it supports that) to your liking.
Sharding this into several instances of course will enable you to scale this up. However, you need to be mindful of operational requirements (i.e. what happens if some of the instances are down). Talk to your Ops team about what they're comfortable managing.
I hear a lot about couchdb, but after reading some documents about it, I still don't get why to use it and how.
Could you clarify this mystery for me?
It's a non-relational database, open-source, distributed (incremental, bidirectional replication), schema-free. A CouchDB database is a collection of documents; each document is a bunch of string "keys" and corresponding "values" (which can be numbers, strings, lists, dates, ...). You can have indices, queries, views.
If a relational DB feels confining to you (you find schemas too rigid, can't spread the DB engine work around a very large numbers of servers, etc), CouchDB is worth considering (it's one of the most interesting of the many non-relational DBs that are emerging these days).
But if all of your work happily fits in a relational database, that's what you probably want to continue using for production work (even though "playing around" with some non-relational DB is still well worth your time, just for personal growth and edification, that's quite different from transferring huge production systems over from a relational DB!-).
It sounds like you should be reading Why CouchDB
To quote from wikipedia
It is not a relational database management system. Instead of storing data in rows and columns, the database manages a collection of JSON documents. The documents in a collection need not share a schema, but retain query abilities via views.
CouchDB provides a different model for data storage than a traditional relational database in that it does not represent data as rows within tables, instead it stores data as "documents" in JSON format.
This difference in data storage model is what differenciates CouchDB from products like MySQL and SQL Server.
In terms of programatic access to CouchDB, it exposes a REST API which you can access by sending HTTP requests from your code
I hope this has been somewhat helpful, though I acknowlege it may not be given my minimal familiarity with the product
I'm far from an expert(all I've done is play around with it some...) but here's how I'm thinking of using it:
Usually when I'm designing an app I've got a bunch of app servers behind a load balancer. Often times, I've got sticky sessions so that each user will go back to the same app server during that session. What I'm thinking of doing is have a couchdb instance tied to each app server.
That way you can use that local couchdb to access user preferences, product data...whatever data you've got that doesn't have to be perfectly up to date.
So...now you've got data on these local CouchDBs. CouchDB allows replication. So, every fixed time period, merge the data back(every X seconds?) into it's peers to keep them up to date.
As a whole you shouldn't have to worry about conflicts b/c each appserver has it's own CouchDB and users are attached to the appserver, and you've got eventual consistency because you've got replication.
Does that answer your question?
A good example is when you say have to deal with people data in either a website or application. If you set off wishing to design the data and keep the individuals' information seperate, that makes a good case for CouchDB, which stores data in documents rather than relational tables. In a production deployment, my users may end up adding adhoc data about 10% of the people and some other funny details for another selected 5%. In a relational context, this could add up to loads of redundancy but not for CouchDB.
And it's not just about the fact that CouchDB is non-relational: if you're too focus on that, you're missing the point. CouchDB is plugged into the web, all you need to start with is HTTP for creating and making queries (GET/PUT/POST/DELETE...), and it's RESTful, plus the fact that it's portable and great for peer to peer sharing. It can also serve up web applications in what is termed as 'CouchApps', where CouchDB totally holds the images, CSS, markup as data stored under special documents called design documents.
Check out this collection of videos introducing non-relational databases, the one on CouchDB should give you a better idea.
I'm working on a web app that is somewhere between an email service and a social network. I feel it has the potential to grow really big in the future, so I'm concerned about scalability.
Instead of using one centralized MySQL/InnoDB database and then partitioning it when that time comes, I've decided to create a separate SQLite database for each active user: one active user per 'shard'.
That way backing up the database would be as easy as copying each user's small database file to a remote location once a day.
Scaling up will be as easy as adding extra hard disks to store the new files.
When the app grows beyond a single server I can link the servers together at the filesystem level using GlusterFS and run the app unchanged, or rig up a simple SQLite proxy system that will allow each server to manipulate sqlite files in adjacent servers.
Concurrency issues will be minimal because each HTTP request will only touch one or two database files at a time, out of thousands, and SQLite only blocks on reads anyway.
I'm betting that this approach will allow my app to scale gracefully and support lots of cool and unique features. Am I betting wrong? Am I missing anything?
UPDATE I decided to go with a less extreme solution, which is working fine so far. I'm using a fixed number of shards - 256 sqlite databases, to be precise. Each user is assigned and bound to a random shard by a simple hash function.
Most features of my app require access to just one or two shards per request, but there is one in particular that requires the execution of a simple query on 10 to 100 different shards out of 256, depending on the user. Tests indicate it would take about 0.02 seconds, or less, if all the data is cached in RAM. I think I can live with that!
UPDATE 2.0 I ported the app to MySQL/InnoDB and was able to get about the same performance for regular requests, but for that one request that requires shard walking, innodb is 4-5 times faster. For this reason, and other reason, I'm dropping this architecture, but I hope someone somewhere finds a use for it...thanks.
The place where this will fail is if you have to do what's called "shard walking" - which is finding out all the data across a bunch of different users. That particular kind of "query" will have to be done programmatically, asking each of the SQLite databases in turn - and will very likely be the slowest aspect of your site. It's a common issue in any system where data has been "sharded" into separate databases.
If all the of the data is self-contained to the user, then this should scale pretty well - the key to making this an effective design is to know how the data is likely going to be used and if data from one person will be interacting with data from another (in your context).
You may also need to watch out for file system resources - SQLite is great, awesome, fast, etc - but you do get some caching and writing benefits when using a "standard database" (i.e. MySQL, PostgreSQL, etc) because of how they're designed. In your proposed design, you'll be missing out on some of that.
Sounds to me like a maintenance nightmare. What happens when the schema changes on all those DBs?
http://freshmeat.net/projects/sphivedb
SPHiveDB is a server for sqlite database. It use JSON-RPC over HTTP to expose a network interface to use SQLite database. It supports combining multiple SQLite databases into one file. It also supports the use of multiple files. It is designed for the extreme sharding schema -- one SQLite database per user.
One possible problem is that having one database for each user will use disk space and RAM very inefficiently, and as the user base grows the benefit of using a light and fast database engine will be lost completely.
A possible solution to this problem is to create "minishards" consisting of maybe 1024 SQLite databases housing up to 100 users each. This will be more efficient than the DB per user approach, because data is packed more efficiently. And lighter than the Innodb database server approach, because we're using Sqlite.
Concurrency will also be pretty good, but queries will be less elegant (shard_id yuckiness). What do you think?
If you're creating a separate database for each user, it sounds like you're not setting up relationships... so why use a relational database at all?
If your data is this easy to shard, why not just use a standard database engine, and if you scale large enough that the DB becomes the bottleneck, shard the database, with different users in different instances? The effect is the same, but you're not using scores of tiny little databases.
In reality, you probably have at least some shared data that doesn't belong to any single user, and you probably frequently need to access data for more than one user. This will cause problems with either system, though.
I am considering this same architecture as I basically wanted to use the server side SQLLIte databases as backup and synching copy for clients. My idea for querying across all the data is to use Sphinx for full-text search and run Hadoop jobs from flat dumps of all the data to Scribe and then expose the results as webservies. This post gives me some pause for thought however, so I hope people will continue to respond with their opinion.
Having one database per user would make it really easy to restore individual users data of course, but as #John said, schema changes would require some work.
Not enough to make it hard, but enough to make it non-trivial.