Best practices to structure a database to be scaling-ready - database

I know this is a very generic and subjective question, so feel free to vote to close it if it does not meet the StackOverflow netiquette.. but for me, it's worth trying ;)
I've never built a high-traffic application since now, so I'm not aware (except for some reading on the web) about scaling practices.
How can I design a database that, when a scaling is needed, I dont have to refactor the database structure, or the application code?
I know that development (and optimization) should come step-by-step, optimize bottleneck as they happen, and is nearly impossible to design the perfect structure when you don't know how many users you'll have and how would they use the database (e.g. read/write ratio), I'm just looking for a good base to start.
What are the best practices for making a structure almost ready to be scaled with partitioning and sharding, and what hacks must be absolutely avoided?
Edit some detail about my application:
The application will run as a multisite behavior
I'll have a database for each application version (db_0_0_1, db_0_0_2, etc..)*
Every 'site' will have a schema inside a database* and a role that can access only his own schemas
Application code will be mostly PHP and few things (daemons and maintenance things) in Python
Web server will probably be Nginx and lighttpd or node.js as support for long-polling tasks (e.g. chat)
Caching will be done with memcached (plus apc for things strictly related to the php code, as it can be used outside php)

The question is really generic, but here are few tips:
Do not use any session variables (pg_backend_pid(), inet_client_addr()) or per-session control (SET ROLE, SET SESSION) in application code.
Do not use explicit transaction control (BEGIN/COMMIT/SET TRANSACTION) in application code. All such logic should be wrapped in UDFs. This enables stateless, statement-mode pooling which enables fastest possible DB pooling. (see pgbouncer docs, and pg wiki for more info)
Encapsulate all App<->Db communication in well defined DB API of UDFs - this will let you use PL/Proxy. If doing this with all SELECTs is too hard, do it at least for all data writes (INSERT/UPDATE/DELETE). Example: instead of INSERT INTO users(name) VALUES('Joe') you need SELECT create_user('Joe').
check your DB schema - is it easy to separate all data belonging to given user? (most probably this will be the partitioning key). All that's left is common, shared data which will need to be replicated to all nodes.
think of caching before you need it. what will be caching key? what will be cache timeout? will you use memcached?

Related

DB recommendation - Portable, Concurrent (multiple read only, one write)

I'm looking for a portable database solution I can use with a website that is designed to handle service outages. I need to nightly retrieve a list of users from SQL Server and upsert their details into a portable database. It's roughly about 250,000 users (and growing) and each one has probably 25 fields that are required. Of those fields, i'd say less than 5 need to be searched on. The rest just need retrieving.
The idea is, in times of a service outage, we can use a website that's designed to work from the portable database rather than SQL Server. Our long term goal, is to move to the cloud and handle things in an entirely different way, but for the short term this is our aim.
The website is going to be a .Net Core web api so will be being accessed by multiple users in multiple threads. The website will only ever need read access, it will not be updating these details what-so-ever.
To keep the portable database up-to-date i'm thinking of having another application that just runs nightly to update the data. Our business is 24 hours (albeit quieter overnight), so there is a potential this updater is in use while the website is in use. While service outage would assume the SQL Server is down, this may not be the case. There are other factors in play that could cause what we would describe as outages. This will be the only piece of software updating the database.
I've tried using LiteDB but I couldn't get it working in a way that worked with my concurrency requirements. It did seem to do some of the job, and was easy to get running. However, i'd often run into locked files due to the nature of web api. I did work out a solution for that, but then the updater app couldn't access the database file.
Does anyone have any recommendations I can look into?
Given the description of the problem (1 table, 250k rows with - I assume - relative fast growth rate) and requirements, I don't think a relational database is what you are looking for.
I think nosql databases, or, more specifically, document oriented databases are more fitted to meet your requirements. There are many choices: Mongo, Cassandra, CouchDB, ... the choice is yours.
Personally I have some experience with ElasticSearch (https://www.elastic.co/elasticsearch), that is quite easy to learn, is portable (runs on Linux, Windows, Containers, etc...), is scalable, and it is fast. I mean, really, really fast, you can get results in 10-20 milliseconds (even less, sometimes).
The NEST nuget package acts as a high level client for working with ElasticSearch (https://www.elastic.co/guide/en/elasticsearch/client/net-api/7.x/nest-getting-started.html)

Database: Repositories with NoSQL/Document Database (DDD)

Looking for any advice from anyone who has migrated their repositories from relational DB to a NoSQL?
We are currently building an App using a Postgres database & ORM (SQLAlchemy). However, there is a possibility that at a later date we may need to migrate the App to an environment that currently only supports a couple of NoSQL solutions.
With that in mind, we're following the Persistence-Orietated approach to repositories covered in Vaughn Vernon's Implementing Domain-Driven Design. This results in the following API:
save(aggregate)
save_all(aggregates)
remove(aggregate)
get_by_...
Without going into detail, the ORM specific code has been hidden away in the repository itself. The Session is only used for the short span of time when data is retrieved, or updated, and then immediately committed and closed (in the repos methods). This means lots of merging on save, and not the most efficient use of the Session.
def save(aggregate):
try:
session.merge(aggregate)
commit
except:
rollback
def get():
try:
aggregate = session.query_by(id)
session.expunge
commit
except:
rollback
return aggregate
etc etc
The advantages:
We are limiting ourselves to updating a single Aggregate per Use Case, so the lack of fully utilising the UOW Transaction Control in the Application Service is minimal (outside of performance). Transaction Control is enabled in the repos while the aggregate is written to ensure the full aggregate is persisted.
No ORM specific code leaks outside of the Repositories, which would need to be re-coded in the advent of switching to a NoSQL db anyway.
So if we do have to switch to a NoSQL DB, we should have a minimal amount of work to do.
However, almost everything I have read encourages Transactional Behaviour to live in the Application Service Layer. Although I believe there is a distinction here between Business Transactional and DB Transactional.
Likewise, we're taking performance hit, in that we are asking the session factory for a session on every call to the repository. Most services contain about 3 or so calls to a repository.
So, the question to anyone who has migrated from Relational to a NoSQL DB?
Does the concept of a Unit of Work / Session mean anything in a NoSQL world?
Should we fully embrace the ORM in the meantime, and move the UOW/Session outside of the Repository into the Application Service?
If we do that, what was the level of effort to re-engineer the Application Service, if we need to migrate to a NoSQL solution in the end. (The repositories will need to be re-written in any instance).
Finally, anyone had much experience writing a implementation agnostic repository?
PS. Understand we could drop the ORM entirely and go pure SQL in the meantime, but we have beed asked to ensure we are using an ORM.
EDIT: In this answer I focus on document db's based on the questions title. Of course other NoSQL stores exist with vastly different characteristics (for example graph db's, using event sourcing and others).
It should not be a problem really.
In document db's your entire aggregate should be a single document. This way you have exactly the same guarantees that you need for transactional consistency. Regardless of how many entities change within the aggregate, you're still storing a document. You will need to make sure you enforce some form of optimistic concurrency (through an etag or version or similar), and not a Unit of Work pattern, but after that your transactional requirements are covered.
I can't really comment whether you fully embrase a UoW pattern now, vs rely on ORM implementation etc. This really depends a lot on your current situation and details about implementation. What I can say though is that it is quite probable that you won't need to migrate from normal form (SQL) to documents all in one go. Start from a simple one so that you can see what works for you and what doesn't.
I don't know if implementation-agnostic repositories exist, but that doesn't make a lot of sense to me. The whole point of a repository is encapsulating persistence, so you can't abstract it: there won't be any other responsibility allocated to them. Also, you can't assume that the repository will need to compose different models into the aggregate model: this is specific to platform, so it's not agnostic.
Another final comment: I see in your question that for documents you wrote save_all(aggregates). I'm not sure what you're referring to, but at minimum, each aggregate save should be wrapped in its own transaction, otherwise this operation violates transactional boundary characteristic of Aggregate.
Does the concept of a Unit of Work / Session mean anything in a NoSQL
world?
Yes, it can still be an interesting concept to have. Just because you're using a NoSQL storage doesn't mean that the need for some sort of business transaction management disappears. Many NoSQL databases have drivers or third party libraries that manage change tracking. See RavenDB for instance.
Sure, if you're only ever loading one aggregate per transaction and if your NoSQL unit of storage matches an aggregate perfectly, most of a Unit of Work's features will be less important, but you'll still be facing exceptions to that rule. Besides, the part of a UoW that's relevant in any case is Commit and possibly Abort.
Should we fully embrace the ORM in the meantime, and move the
UOW/Session outside of the Repository into the Application Service?
What I recommend instead is materializing the concept of Unit of Work in a full fledged class:
class UnitOfWork {
void Commit()
{
// Call your ORM persistence here
}
}
Application Services are just the place where the Unit of Work is called, not where it is implemented.
If we do that, what was the level of effort to re-engineer the
Application Service, if we need to migrate to a NoSQL solution in the
end. (The repositories will need to be re-written in any instance).
It depends on a lot of other parameters such as Unit of Work support by your NoSQL API or third party libraries, and similarity in shape between Aggregates and the NoSQL storage. It can range from practically no work to writing a full UoW/change tracking implementation yourself. If the latter, extracting UoW logic from the Repository to a separate class won't be the hardest part of the job.
Finally, anyone had much experience writing a implementation agnostic
repository?
I concur with SKleanthous here - implementation agnostic repos don't make much sense IMO. You've got your repository abstractions (interfaces) which are of course agnostic but when it comes to implementations, you have to address a particular persistent storage.

Database for a java application in cluster

I'd like to play around with kubernetes, I'm able to start a simple app, but now I'd like to design something more complex. Nevertheless I can't figure out, how to handle the database access in such architecture.
Let's say I have 100 pod replicas of some simple chat application. They all need to access the same database (or more like data set) and perform CRUD operations upon them. How to design it to keep the data consistent and eliminate the risk of deadlocks?
If possible, I'd like to use SQL-like database, so I can comfortably use hibernate and other tools I'm familiar with.
Is this even possible or do I have to use totally a different approach? What is the name of the technology or architecture I'm searching for?
1) You can use a connection pool to reduce this number and make the connection settings more aggressive/elastic;
2) Split your microservices in such way the access to the persistence is a microservice exposing your CRUD service to your persistence(mysql/rdms/nosql/etc). In that way you most likely don't need hundreds of replicas of your pods.
3) Deadlocks / locking strategies - as Andrew mentioned in the comments, it's more related to your software development architecture rather than K8s itself. There are plenty of ways to deal with that with pros/cons.

Why shouldn't I give outsiders access to my database?

Lots of sites today have APIs that allow users to get data from the site as XML or JSON using a GET HTTP request. Flickr and del.icio.us are example of sites with APIs. These APIs require the server to access the database, and then output the result as either XML or JSON.
Why do we need this translation though? Why not just create a user on the database (for example MySQL)? The user would be given limited access to the database, only being allowed to SELECT, and only certain tables and certain columns in those tables. Wouldn't this be a lot more efficient for the server (it wouldn't have to deal with the HTTP request), and it would be easier for developers, who could now access exactly the data they need, the way they need it.
Security considerations aside, so that you can change your database structure without affecting your clients. Also, poorly formed queries tie up your server, not the clients.
Can you prevent a malicious individual from crafting a super-complex SQL query that will peg your database's CPU at 100%? Can you prevent a lot of innocent programmers from crafting inefficient queries that will never be optimized that will do the same thing?
Coding to Contract - with APIs, you may change everything behind them without affecting outsiders use of them. Here you'd be tying them to not just MySQL but your schema
Caching - Allowing them any query almost removes any opportunity for caching that predictable queries over http that can be used. This is probably the number one way to remove the often number one bottleneck, the database.
Security - with this approach, it would be easy for a denial of service attack, even by accident. Not to mention the fact you'd have to give access to data layer, which is often put in a restricted zone where security can be tightened
Usability - not everyone is a developer or wants to understand a your internal domain. They probably prefer a pre baked straight forward and self-explaining API. An extreme example would be to give managers db privileges rather than reports.
An API:
Makes it easier to montior and control usage (implementing 'limited queries per X' for DB users may be harder)
Allows for presenting simpler structures to the user than may be used in the DB.
Means the user doesn't have to understand your DB structure.
Allows for DB portability. (Oh you've grown massive and now need to implement: sharding, move to bigtable, etc. - With an API the user doesn't need to know)
Allows for different (better? / variable?) caching of requests.
Means you don't have to pay for extra DB users (If that's how the DB is licensed.)
Portability too. Lets say for licensing reasons and scaling you make the business decision to move from MSSQL to MySql. Syntax ain't quite the same and your clients will all have to change their code.
Much better to just buffer it all off and keep the implementation abstracted away. Whose to say you're not persisting the state of the application using trained monkeys scratching marks on bottletops?
Security is the number 1 reason but I hope those reasons are obvious. The user tying up precious resources with bad queries is another good reason.
Beyond that though, why an abstraction layer?
Might you ever want to add some logging to database queries to diagnose speed or to help debug?
Might you ever go from MySQL to MS SQL or vice versa where SQL other than pure ANSI might break?
Should the customer really have to learn your schema rather than a more logical abstraction?
When a new programmer learns of normalization and can now see your whole schema including your carefully balanced denormalizations, do you want to put up with every uninformed criticism?
When a more experienced db person points out improvements, do you want to be stuck with your old schema?
Why to use an API is a question of why to use abstractions and my list here barely scratches the surface.
the web server gives you a buffer that you can control. if there is some bug in your sql server or whatever, you don't want it exposed directly to the internet. true, if the web server has bugs, it might be just as bad ... except you have that extra layer between the data and the world.
-don
It's not as much a 'why not' than a 'why should you' question. Handling HTTP requests is a small penalty for complete control over what all data you allow or disallow a user from accessing. Further, should the nature / quantity / security level of data change in future, you will be better off with a JSON / XML response than allowing total access.
The thing to bear in mind when you're thinking of security issues is that it's really hard to anticipate all of the possible vectors that someone could use to attack you. For instance, are you really sure you've gotten your database permissions set so that people can't mess things up?
Therefore, you want to try restricting actions to only what you know to be good, not just trying to restrict the things you know to be bad. This can be done with a web service that you have absolute control over, but it's difficult to allow somebody to access the database directly and be sure that you're secure.
API is a kind of Wrapper around of database. Users do not know anything about database internal representation of data, he only need to send a number of unified requests and get unified response on it. How and when data will be processed on the server - it's not his headache.

In Memory Database

I'm using SqlServer to drive a WPF application, I'm currently using NHibernate and pre-read all the data so it's cached for performance reasons. That works for a single client app, but I was wondering if there's an in memory database that I could use so I can share the information across multiple apps on the same machine. Ideally this would sit below my NHibernate stack, so my code wouldn't have to change. Effectively I'm looking to move my DB from it's traditional format on the server to be an in memory DB on the client.
Note I only need select functionality.
I would be incredibly surprised if you even need to load all your information in memory. I say this because, just as one example, I'm working on a Web app at the moment that (for various reasons) loads thousands of records on many pages. This is PHP + MySQL. And even so it can do it and render a page in well under 100ms.
Before you go down this route make sure that you have to. First make your database as performant as possible. Now obviously this includes things like having appropriate indexes and tuning your database but even though are putting the horse before the cart.
First and foremost you need to make sure you have a good relational data model: one that lends itself to performant queries. This is as much art as it is science.
Also, you may like NHibernate but ORMs are not always the best choice. There are some corner cases, for example, that hand-coded SQL will be vastly superior in.
Now assuming you have a good data model and assuming you've then optimized your indexes and database parameters and then you've properly configured NHibernate, then and only then should you consider storing data in memory if and only if performance is still an issue.
To put this in perspective, the only times I've needed to do this are on systems that need to perform millions of transactions per day.
One reason to avoid in-memory caching is because it adds a lot of complexity. You have to deal with issues like cache expiry, independent updates to the underlying data store, whether you use synchronous or asynchronous updates, how you give the client a consistent (if not up-to-date) view of your data, how you deal with failover and replication and so on. There is a huge complexity cost to be paid.
Assuming you've done all the above and you still need it, it sounds to me like what you need is a cache or grid solution. Here is an overview of Java grid/cluster solutions but many of them (eg Coherence, memcached) apply to .Net as well. Another choice for .Net is Velocity.
It needs to be pointed out and stressed that something like NHibernate is only consistent so long as nothing externally updates the database and that there is exactly one NHibernate-enabled process (barring clustered solutions). If two desktop apps on two different PCs are both updating the same database with NHibernate the caching simply won't work because the persistence units simply won't be aware of the changes the other is making.
http://www.db4o.com/ can be your friend!
Velocity is an out of process object caching server designed by Microsoft to do pretty much what you want although it's only in CTP form at the moment.
I believe there are also wrappers for memcached, which can also be used to cache objects.
You can use HANA, express edition. You can download it for free, it's in-memory, columnar and allows for further analytics capabilities such as text analytics, geospatial or predictive. You can also access with ODBC, JDBC, node.js hdb library, REST APIs among others.

Resources