Encapsulating server communication for data intensive multi-server applications - database

There are two databases, A and B, that serve web pages and communicate with each-other via internal network when they need to share data. Sometimes server A needs to produce a webpage with a chart that requires intensive calculations involving a large quantity of data on server B. Lets assume both servers are equally powerful and the network is fast. I'm trying to figure out a good way to do this. Ideas so far:
Server B could do a lot of the calculations then pass back a small set of results to server A. Less flexible, but fairly efficient. Unfortunately this is a somewhat complex transaction. I'd compare it to a method with a lot of parameters, side-effects and results.
Server B could simply provide its own raw data server to A and let it handle all of the calculations itself. More "open" and flexible, but less efficient since there is a lot of data involved in the calculation, and server A would have to pull all of it over the network.
Server B could produce and return the chart, or a link to it. Perhaps the most efficient, but least flexible, and also creates a messy relationship where server B is partially responsible for producing server A's webpages. However, at least server A doesn't have to worry about getting back a complex object and know what to do with it.
Is there a best practice here that results in a balance of performance and maintainable encapsulation? Or is it simply a case-by-case situation? I'm leaning towards the first option. I hope this question is "answerable" enough and not a discussion issue. I've tried to concentrate the general problem into a specific scenario.

I cannot answer the best practice part of your question, But will for the performance aspect. What you have described is two servers with no real technical difference between them, and presumably you have equal access to update both of them, neither starved of resource, etc.
To truly calculate performance, there are a few number needed.
A) how many requests per second?
B) for techniques a and b, how much traffic will flow between the machines?
C) how much latency can you withstand?
Your first option will be reasonable for performance, if the data flowing is small enough. Option b will generate most load on your infra structure, and if this is approaching hardware limits then ditch this option now. If all the numbers are small, then this isn't really a performance based decision.
If you are driving this all from web pages, then cannot the first page load simply return Img tags etc that refer to server B? Ie I request pageA, but the HTML stream includes resources from serverB? This is essentially your option C, but will perhaps give higher perceived user performance as the resources from different servers can download in parallel (assuming there are several to be accessed).
from a design standpoint, I struggle to see Where a calculation is performed is Relevant in this question, you control both servers, so the question becomes what 'contract' does server b have to provide information to server a. Is there some reason that b cannot provide filtered results only and never the underlying raw data? Or put another way, why should a have to process b's data? Whether the logic is in a or b doesn't matter when you control both ends.
If you do not control both servers, then the decision is more around who should bare the cost (build and support) for processing the raw data to results.
Summary/opinion. Option c if possible and suitable, else a in absence of some overriding reason for b

Related

Which one is better, iterate and sort data in backend or let the database handle it?

I'm trying to design a database schema for Djabgo rest framework web application.
At some point, I have two choces:
1- Choose a schema in which in one or several apies, I have to get a queryset from database and iterate and order it with python. (For example, I can store some datas in an array-data-typed column, get them from database and sort them with python.)
2- store the data in another table and insert a kind of big number of rows with each insert. This way, I can get the data in my favorite format in much less lines with orm codes.
I tried some basic tests and benchmarking to see which way is faster, and letting database handle more of the job (second way) didn't let me down. But I don't have the means of setting a more real situatuin and here's the question:
Is it still a good idea to let database handle the job when it also has to handle hundreds of requests from other apies and clients each second?
Is database (and orm) usually faster and more reliable than backend?
As a general rule, you want to let the database do work when the work is appropriate for the database. Sorting result sets would be in that category.
Keep in mind:
The database is running on a server, often on a distributed system and so it has access to more resources.
Databases are designed to handle large data, so they are not limited by the memory in a single thread.
When this question comes up, often more data needs to be passed back to the application than is strictly needed. Consider a problem such as getting the top 10 of something.
Mixing processing in the application and the database often requires multiple queries and passing data back and forth, which is expensive.
(And there are no doubt other considerations.)
There are some situations where it might be more efficient or convenient to do work in the application. A common example is formatting result sets for the application -- say turning 1234.56 into $1,234.56. Other examples would be when the application language has capabilities that are not directly in SQL or are hard to implement in SQL.

How to implement sharding?

First world problems: We've got a production system that is growing rapidly, and we are aiming to grow our user base even more. At peak times our DB is flatlining at 100% CPU, which I take as an indication that it's pretty much stretched to the limit. Being an AWS instance, we could always throw some more hardware at it, but long term, it seems we will need to implement sharding.
I've Googled all over and found lots of explanations of what sharding is, why it is a good idea under certain circumstances, what design considerations, etc... but not a word on the practicality of how to do it.
What are the practical steps to shard a database? How do you redirect queries to the appropriate shard? And how do you run reports that require data from all shards?
The first thing you'll want to decide is whether or not you want to take on the complexity of routing queries in your application. If you decide to roll your own implementation, there are a number of complexities that you'll need to deal with over time.
You'll need a scheme to distribute data and queries evenly across the cluster. You'll need to ensure that this scheme is forward-compatible with a larger cluster, as if your data is already big enough to require a sharded architecture, it's likely that you'll need to add more servers.
The problem with sharding schemes is that they force you to make tradeoffs that you wouldn't have to make with a single-server database. For example, if you are sharding by user_id, any query which spans multiple users will need to be sent to all servers (or a subset of servers) and the results must be accumulated in your client application. This is especially complex if you are using aggregate queries that rely on the ordering of the data, such as MAX(), or any histogram computation.
All of this complexity isn't meant to scare you, but it's something you'll need to pay attention to. There are tools out there that can help you (disclosure: my company makes a tool called dbShards) but you can definitely put together your own solution, especially if your application is mature and the query patterns are quite predictable.

How can I avoid first retrieving aggregate root for optimizations in DDD friendly way?

I have written an application and tried to use DDD as far as possible. I have run into a problem that I am not sure how to address. Most of the system is driven through users via a web front end, the volume of requests that the application is pretty low and quite CRUDy in nature. However there is one part of the system that integrate into other online systems and the volume of requests will be substantial. We already know from previous experience that we have to streamline this as much as possible.
The scenario is that we want to add create an entity if it doesn't exist but add to the aggregate root if it does exist.
var foo = fooRepo.Get(id);
if (foo != null)
{
foo.Bars.Add(new Bar("hello"));
fooRepo.Add(foo);
}
else
{
var foo = new Foo();
foo.Bars.Add(new Bar("hello"));
fooRepo.Add(fooo);
}
If the action is initiated by a user we don't mind because it is low volume. But the system generated ones it is an extra trip to the DB to determine if it exists. We have already determined that this is a bottle neck for us.
How do I do the above in a DDD way but avoiding the extra trip to the database. I started looking into CQRS because I thought that part of the pattern was to not have repositories so that I could execute commands. But I see that most CQRS examples still use repos and in most examples are still performing get operations for the aggregate roots and then manipulating them and then saving them back.
Clearly repositories don't work in my scenario because I can't treat the problem like an in memory collection of items. (In fact there is just way to much data to even cache everything). Best thing for performance is a straight write to the DB.
Is there anything that can be done in this scenario? Or should I just code an exception?
You could have a FooRepository.AddOrUpdate(foo, bar) method implemented as a direct INSERT INTO ... IF NOT EXISTS or MERGE INTO SQL statement if the extra round-trip is really a concern (which I'm not really sure why it would).
ETL or mass import processes rarely obey the same rules or go through the same layers as UI-based ones. The Application layer will typically be a separate one, and you might even want to bypass the Domain layer in some cases.
It could be a good idea to think about this second input flow in terms of transactions (do these system generated changes come in batches or units, do you want to isolate them the same way you isolate two concurrent end user transactions, etc.?) and create a specific Application layer based on that.
If it turns out that even like this you're too constrained by your aggregate design, you might want to short circuit the domain layer altogether.

Database Joins Done On The Webserver

Today I found an article online discussing Facebooks architecture (though it's a bit dated). While reading it I noticed under the section Software that helps Facebook scale, the third bullet point states:
Facebook uses MySQL, but primarily as a key-value persistent storage,
moving joins and logic onto the web servers since optimizations are
easier to perform there (on the “other side” of the Memcached layer).
Why move complex joins to the web server? Aren't databases optimized to perform join logic? This methodology seems contrary to what I've learned up to this point, so maybe the explanation is just eluding me.
If possible, could someone explain this (an example would help tremendously) or point me to a good article (or two) for the benefits (and possibly examples) of how and why you'd want to do this?
I'm not sure about Facebook, but we have several applications where we follow a similar model. The basis is fairly straightforward.
The database contains huge amounts of data. Performing joins at the database level really slows down any queries we make on the data, even if we're only returning a small subset. (Say 100 rows of parent data, and 1000 rows of child data in a parent-child relationship for example)
However, using .NET DataSet objects, of we select in the rows we need and then create DataRelation objects within the DataSet, we see a dramatic boost in performance.
I can't answer why this is, as I'm not knowledgeable about the internal workings of either, but I can venture a guess...
The RDBMS (Sql Server in our case) has to deal with the data that lives in files. These files are very large, and only so much of it can be loaded into memory, even on our heavy-hitter SQL Servers, so it there is a penalty of disk I/O.
When we load a small portion of it into a Dataset, the join is happening entirely in memory, so we lose the I/O penalty of going to the disk.
Even though I can't explain the reason for the performance boost completely (and I'd love to have someone more knowledgeable tell me if my guess is right) I can tell you that in certain cases, when there is a VERY large amount of data, but your app only needs to pull a small subset of it, there is a noticeable boot in performance by following the model described. We've seen it turn apps that just crawl into lightning-quick apps.
But if done improperly, there is a penalty - if you overload the machine's RAM but doing it inappropriately or in every situation, then you'll have crashes or performance issues as well.

How to share data across an organization

What are some good ways for an organization to share key data across many deparments and applications?
To give an example, let's say there is one primary application and database to manage customer data. There are ten other applications and databases in the organization that read that data and relate it to their own data. Currently this data sharing is done through a mixture of database (DB) links, materialized views, triggers, staging tables, re-keying information, web services, etc.
Are there any other good approaches for sharing data? And, how do your approaches compare to the ones above with respect to concerns like:
duplicate data
error prone data synchronization processes
tight vs. loose coupling (reducing dependencies/fragility/test coordination)
architectural simplification
security
performance
well-defined interfaces
other relevant concerns?
Keep in mind that the shared customer data is used in many ways, from simple, single record queries to complex, multi-predicate, multi-sort, joins with other organization data stored in different databases.
Thanks for your suggestions and advice...
I'm sure you saw this coming, "It Depends".
It depends on everything. And the solution to sharing Customer data for department A may be completely different for sharing Customer data with department B.
My favorite concept that has risen up over the years is the concept of "Eventual Consistency". The term came from Amazon talking about distributed systems.
The premise is that while the state of data across a distributed enterprise may not be perfectly consistent now, it "eventually" will be.
For example, when a customer record gets updated on system A, system B's customer data is now stale and not matching. But, "eventually", the record from A will be sent to B through some process. So, eventually, the two instances will match.
When you work with a single system, you don't have "EC", rather you have instant updates, a single "source of truth", and, typically, a locking mechanism to handle race conditions and conflicts.
The more able your operations are able to work with "EC" data, the easier it is to separate these systems. A simple example is a Data Warehouse used by sales. They use the DW to run their daily reports, but they don't run their reports until the early morning, and they always look at "yesterdays" (or earlier) data. So there's no real time need for the DW to be perfectly consistent with the daily operations system. It's perfectly acceptable for a process to run at, say, close of business and move over the days transactions and activities en masse in a large, single update operation.
You can see how this requirement can solve a lot of issues. There's no contention for the transactional data, no worries that some reports data is going to change in the middle of accumulating the statistic because the report made two separate queries to the live database. No need to for the high detail chatter to suck up network and cpu processing, etc. during the day.
Now, that's an extreme, simplified, and very coarse example of EC.
But consider a large system like Google. As a consumer of Search, we have no idea when or how long it takes for a search result that Google harvests to how up on a search page. 1ms? 1s? 10s? 10hrs? It's easy to imaging how if you're hitting Googles West Coast servers, you may very well get a different search result than if you hit their East Coast servers. At no point are these two instances completely consistent. But by large measure, they are mostly consistent. And for their use case, their consumers aren't really affected by the lag and delay.
Consider email. A wants to send message to B, but in the process the message is routed through system C, D, and E. Each system accepts the message, assume complete responsibility for it, and then hands it off to another. The sender sees the email go on its way. The receiver doesn't really miss it because they don't necessarily know its coming. So, there is a big window of time that it can take for that message to move through the system without anyone concerned knowing or caring about how fast it is.
On the other hand, A could have been on the phone with B. "I just sent it, did you get it yet? Now? Now? Get it now?"
Thus, there is some kind of underlying, implied level of performance and response. In the end, "eventually", A's outbox matches B inbox.
These delays, the acceptance of stale data, whether its a day old or 1-5s old, are what control the ultimate coupling of your systems. The looser this requirement, the looser the coupling, and the more flexibility you have at your disposal in terms of design.
This is true down to the cores in your CPU. Modern, multi core, multi-threaded applications running on the same system, can have different views of the "same" data, only microseconds out of date. If your code can work correctly with data potentially inconsistent with each other, then happy day, it zips along. If not you need to pay special attention to ensure your data is completely consistent, using techniques like volatile memory qualifies, or locking constructs, etc. All of which, in their way, cost performance.
So, this is the base consideration. All of the other decisions start here. Answering this can tell you how to partition applications across machines, what resources are shared, and how they are shared. What protocols and techniques are available to move the data, and how much it will cost in terms of processing to perform the transfer. Replication, load balancing, data shares, etc. etc. All based on this concept.
Edit, in response to first comment.
Correct, exactly. The game here, for example, if B can't change customer data, then what is the harm with changed customer data? Can you "risk" it being out of date for a short time? Perhaps your customer data comes in slowly enough that you can replicate it from A to B immediately. Say the change is put on a queue that, because of low volume, gets picked up readily (< 1s), but even still it would be "out of transaction" with the original change, and so there's a small window where A would have data that B does not.
Now the mind really starts spinning. What happens during that 1s of "lag", whats the worst possible scenario. And can you engineer around it? If you can engineer around a 1s lag, you may be able to engineer around a 5s, 1m, or even longer lag. How much of the customer data do you actually use on B? Maybe B is a system designed to facilitate order picking from inventory. Hard to imagine anything more being necessary than simply a Customer ID and perhaps a name. Just something to grossly identify who the order is for while it's being assembled.
The picking system doesn't necessarily need to print out all of the customer information until the very end of the picking process, and by then the order may have moved on to another system that perhaps is more current with, especially, shipping information, so in the end the picking system doesn't need hardly any customer data at all. In fact, you could EMBED and denormalize the customer information within the picking order, so there's no need or expectation of synchronizing later. As long as the Customer ID is correct (which will never change anyway) and the name (which changes so rarely it's not worth discussing), that's the only real reference you need, and all of your pick slips are perfectly accurate at the time of creation.
The trick is the mindset, of breaking the systems up and focusing on the essential data that's necessary for the task. Data you don't need doesn't need to be replicated or synchronized. Folks chafe at things like denormalization and data reduction, especially when they're from the relational data modeling world. And with good reason, it should be considered with caution. But once you go distributed, you have implicitly denormalized. Heck, you're copying it wholesale now. So, you may as well be smarter about it.
All this can mitigated through solid procedures and thorough understanding of workflow. Identify the risks and work up policy and procedures to handle them.
But the hard part is breaking the chain to the central DB at the beginning, and instructing folks that they can't "have it all" like they may expect when you have a single, central, perfect store of information.
This is definitely not a comprehensive reply. Sorry, for my long post and I hope it adds to thoughts that would be presented here.
I have a few observations on some of the aspect that you mentioned.
duplicate data
It has been my experience that this is usually a side effect of departmentalization or specialization. A department pioneers collection of certain data that is seen as useful by other specialized groups. Since they don't have unique access to this data as it is intermingled with other data collection, in order to utilize it, they too start collecting / storing the data, inherently making it duplicate. This issue never goes away and just like there is a continuos effort in refactoring code and removing duplication, there is a need to continuously bring duplicate data for centralized access, storage and modification.
well-defined interfaces
Most interfaces are defined with good intention keeping other constraints in mind. However, we simply have a habit of growing out of the constraints placed by previously defined interfaces. Again a case for continuos refactoring.
tight coupling vs loose coupling
If any thing, most software is plagued by this issue. The tight coupling is usually a result of expedient solution given the constraint of time we face. Loose coupling incurs a certain degree of complexity which we dislike when we want to get things done. The web services mantra has been going rounds for a number of years and I am yet to see a good example of solution that completely alleviates the point
architectural simplification
To me this is the key to fighting all the issues you have mentioned in your question. SIP vs H.323 VoIP story comes into my mind. SIP is very simplified, easy to build while H.323 like a typical telecom standard tried to envisage every issue on the planet about VoIP and provide a solution for it. End result, SIP grew much more quickly. It is a pain to be H.323 compliant solution. In fact, H.323 compliance is a mega buck industry.
On a few architectural fads that I have grown up to.
Over years, I have started to like REST architecture for it's simplicity. It provides a simple unique access to data and easy to build applications around it. I have seen enterprise solution suffer more from duplication, isolation and access of data than any other issue like performance etc. REST to me provides a panacea to some of those ills.
To solve a number of those issues, I like the concept of central "Data Hubs". A Data Hub represents a "single source of truth" for a particular entity, but only stores IDs, no information like names etc. In fact, it only stores ID maps - for example, these map the Customer ID in system A, to the Client Number from system B, and to the Customer Number in system C. Interfaces between the systems use the hub to know how to relate information in one system to the other.
It's like a central translation; instead of having to write specific code for mapping from A->B, A->C, and B->C, with its attendance exponential increase as you add more systems, you only need to convert to/from the hub: A->Hub, B->Hub, C->Hub, D->Hub, etc.

Resources