Selecting and aggregating data from multiple databases

Selecting and aggregating data from multiple databases - sql-server

I'm looking for some advice, so apologies if this is the wrong section (I will delete and post in DB Administrators if that is more appropriate).
I have an application that manages legal cases. Each law firm has its own database for managing their own cases and other business processes. Therefore there can be potentially hundreds of databases in total.
A citizen may have cases with many law firms. Some of these cases are in various "stages" of their life (see below for example)
I want my application to show the user an aggregation of all their cases in total, by stage and status. So to the user it will look like something from an email client:
Cases (15)
Open (10)
Under Examination (4)
Escalated (6)
Closed (5)
To get these numbers I have to select all their cases from each database for each law firm this user has a case with. In pseudo-code I'll have to:
SELECT
c1.Title, c1.Stage
FROM
[#DatabaseName].dbo.Cases c1
WHERE
c1.CustomerID = #CustomerUserID
It has to do this for EACH database (potentially 50+) and then UNION all the results together to get a total for all Open cases, by Stage, that this citizen has.
The only other way I can think of to make this simpler is to not have a multi-tennant architecture. That is, just have 1 table for Cases in 1 database which has a column to identify which law firm it belongs to. The problem is that this solution won't scale well if the application grows into other areas such as HR or Finance etc. It will mean that I will have to store the entirety of every law firms data in 1 database which will kill my server and resources.
Is there a better way to tackling this?

Check Azure Elastic Scale Api
This provides APIs to handle multiple shards using a shard management database.
You can create a shard for Each Law Firm.

Related

How to design the logic or database when two users choose one product?

Assuming a e commerce web app has a high amount of requests, how do I prevent two users from choosing the only product left? Should I check the quantity when adding to shopping list or payment? Is it using a field to record quantity of selected product in DB is bad way? How does the large e commerce web app like amazon deal with conflict problem?

Several options that I know :
For the RDBMS that support ACID , you can use optimistic locking technique on the product table. Unless it is very often that many users hit the buying button on the same product at the nearly same times ,it should work pretty well.(For how many users does the 'many' means, you have to measure it. I think 1k should be no problem. Just my guess , don't take it for granted)
Do not check it and let users to buy it. Adjust the business flow to handle it. For example, when an user hits the buying button ,tell him his order is just accepted and will be processed but not guarantee he must able to buy it. Then in the later stage when you find that there is not enough inventory to ship the product to him , send an email to apologise and refund to him.
Also in the real business , it is common that the product inventory can go to negative and still accepting orders but tell the user he will get the product at XXX days later. The business can then produce or order more product from the supplier after receiving the money.
If you are buying iPhone on the Apple web site , it also works like this.

It really depends upon the number of concurrent users here. In the case of millions, the NoSQL approach is prefered to manage the basket with eventual consistency then the buying process would go with ACID to ensure the product can be sold.
For less users, you can rely on an ACID database.
If you are not sure, you may go with a database that has ACID capabilities but can as well allow you to work in an eventual consistency way or that can implement the concept of sharding for scalability purpose. To my knowledge Oracle can do these 3 things: COMMIT NO WAIT, COMMIT and Sharding deployment.
HTH

How to query when data is spread into different microservices?

I'm new on microservices architecture and I'm facing this problem:
I have a platform where basically Users manage the accounting of their Clients.
I have one microservice in charge of the security. This one manages which Users have access to which Clients.
Then I have another microservice that manages the Invoices of the Clients.
One of the functions here would be: given a User is logged, list all the Invoices of all the Clients that the User has access to.
For that, I thought that I should ask the Security microservice to give me the list of the Clients the User has access to. And then, I go to the database of Invoices and query, filtering by all those clients.
The problem is that I end up with a horrible query, as it's something like:
SELECT * FROM Invoice WHERE clientId IN (CLI1, CLI2, CLI3, ...) -- Potentially 200 clients
I thought to keep a copy of the User-Client relation in the Invoice database. Or to have both microservices sharing the same database. But none of them convince me as I have more microservices that may face the same problem, leading to a huge repetition of data or to a big monolithic database.
Is there a better way to do this?
Thanks in advance!

In general, database access is restricted across the services, and also keeping information which you do not own, tends to be a tiresome process as you will always be in fight to sync this piece of data with its intended source of truth.
So the only option that you have is what you have already mentioned in the question.
You end up with a horrible query.
But is it horrible when you write it or will it lag in performance?
It depends, yes if you are using MySQL, you can always perform joins over sub queries
But joins have there own cost.
And even than it will be ok to check if sub queries in these cases gives you expected performance.
If its feasible you can explore other databases which could be really optimized for queries like these.
Or worst case scenario, you can copy the data but you have to put in lot effort to ensure they remain in sync.
Most of the choices are not straight forward and there will be trade offs that you need to take call on.

Bad real-world database schemas

Our masters thesis project is creating a database schema analyzer. As a foundation to this, we are working on quantifying bad database design.
Our supervisor has tasked us with analyzing a real world schema, of our choosing, such that we can identify some/several design issues. These issues are to be used as a starting point in the schema analyzer.
Finding a good schema is a bit difficult because we do not want a schema which is well designed in all aspects, but a schema that is more "rare to medium".
We have already scheduled the following schemas for analysis: wikimedia, moodle and drupal. Not sure in which category each fit. It is not necessary that the schema is open source.
The database engine used is not important, though we would like to focus on SQL server, Posgresql and Oracle.
For now literature will be deferred, as this task is supposed to give us real world examples which can be used in the thesis. i.e. "Design X is perceived by us as bad design, which our analyzer identifies and suggests improvements to", instead of coming up with contrived examples.
I will update this post when we have some kind of a tool ready.

Check the Dell-dvd-store, you can use it for free.
The Dell DVD Store is an open source
simulation of an online ecommerce site
with implementations in Microsoft SQL
Server, Oracle and MySQL along with
driver programs and web applications
Bill Karwin has written a great book about bad designs: SQL antipatterns

I'm working on a project including a geographical information system. And in my opinion these designs are often "medium" to "rare".
Here are some examples:
1) Geonames.org
You can find the data and the schema here: http://download.geonames.org/export/dump/ (scroll down to the bottom of the page for the schema, it's in plain text on the site !)
It'd be interesting how this DB design performs with such a HUGE amount of data!
2) OpenGeoDB
This one is very popular in german-speaking countries (Germany, Austria, Switzerland) because it's a database containing nearly every city/town/village in the german speaking region with zip-code, name, hierarchy and coordinates.
This one comes with a .sql schema and the table fields are in english, so this shouldn't be a problem.
http://fa-technik.adfc.de/code/opengeodb/
The interesting thing in both examples is how they managed the hierarchy of entities like Country -> State -> County -> City -> Village etc.
PS: Maybe you could judge my DB design too ;) DB Schema of a Role Based Access Control

vBulletin has a really bad database schema.

"we are working on quantifying bad database design."
It seems to me like you are developing a model, or process, or apparatus, that takes a relational schema as input and scores it for quality.
I invite you to ponder the following:
Can a physical schema be "bad" while the logical schema is nonetheless "extremely good" ? Do you intend to distinguish properly between "logical schema" and "physical schema" ? How do you dream to achieve that ?
How do you decide that a certain aspect of physical design is "bad" ? Take for example the absence of some index. If the relvar that that "supposedly desirable index" is to be on, is itself constrained to be a singleton, then what detrimental effects would the absence of that index cause for the system ? If there are no such detrimental effects, then what grounds are there for qualifying the absence of such an index as "bad" ?
How do you decide that a certain aspect of logical design is "bad" ? Choices in logical design are done as a consequence of what the actual requirements are. How can you make any judgment whatsoever about a logical design, without a formalized and machine-readable way to specify what the actual requirements are ?

Wow - you have an ambitious project ahead of you. To determine what is a good database design may be impossible, except for broadly understood principles and guidelines.
Here are a few ideas that come to mind:
I work for a company that does database management for several large retail companies. We have custom databases designed for each of these companies, according to how they intend for us to use the data (for direct mail, email campaigns, etc.), and what kind of analysis and selection parameters they like to use. For example, a company that sells musical equipment in stores and online will want to distinguish between walk-in and online customers, categorize the customers according to the type of items they buy (drums, guitars, microphones, keyboards, recording equipment, amplifiers, etc.), and keep track of how much they spent, and what they bought, over the past 6 months or the past year. They use this information to decide who will receive catalogs in the mail. These mailings are very expensive; maybe one or two dollars per customer, so the company wants to mail the catalogs only to those most likely to buy something. They may have 15 million customers in their database, but only 3 million buy drums, and only 750,000 have purchased anything in the past year.
If you were to analyze the database we created, you would find many "work" tables, that are used for specific selection purposes, and that may not actually be properly designed, according to database design principles. While the "main" tables are efficiently designed and have proper relationships and indexes, these "work" tables would make it appear that the entire database is poorly designed, when in reality, the work tables may just be used a few times, or even just once, and we haven't gone in yet to clear them out or drop them. The work tables far outnumber the main tables in this particular database.
One also has to take into account the volume of the data being managed. A customer base of 10 million may have transaction data numbering 10 to 20 million transactions per week. Or per day. Sometimes, for manageability, this data has to be partitioned into tables by date range, and then a view would be used to select data from the proper sub-table. This is efficient for this huge volume, but it may appear repetitive to an automated analyzer.
Your analyzer would need to be user configurable before the analysis began. Some items must be skipped, while others may be absolutely critical.
Also, how does one analyze stored procedures and user-defined functions, etc? I have seen some really ugly code that works quite efficiently. And, some of the ugliest, most inefficient code was written for one-time use only.
OK, I am out of ideas for the moment. Good luck with your project.

If you can get ahold of it, the project management system Clarity has a horrible database design. I don't know if they have a trial version you can download.

What is the best database design for thousand rows

I'm about to start a Database Design that will simply manage users under companies.
Each company will have a admin area that can manage users
Each company will have around 25.000 users
Client believes to have around 50 companies to start
My main question is
Should I create tables based on Companies? like
users_company_0001 users_company_0002 users_company_0003 ...
as each company will never use "other" users and nothing will need to sum/count different tables in all user_company (a simple JOIN will do the trick, though it's more expensive (time) it will work as having the main picture, this will never be needed.
or should I just create a users table to have (50 x 25000) 1 250 000 users (and growing).
I'm thinking about the first option, though, I'm not sure how would I use Entity Framework on such layout... I would probably need to go back to the 90's and generate my Data Logic Layer by hand.
has it will be a simple call to Store Procedures containing the Company Id
What will you suggest?
The system application will be ASP.NET (probably MVC, I'm still trying to figure this out as all my knowledge is on webforms, though I saw Scott Hanselman MVC videos - seams easy - but I know it will not be that easy as problems will come and I will take more time to fix them), plus Microsoft SQL.

Even though you've described this as a 1-many relationship, I'd still design the DB as many-to-many to guard against a future change in requirements. Something like:

Having worked with a multi-terabyte SQL Server database, and having experience with hundreds of tables over the course of my career with multi-million rows, I can tell you with full assurance that SQL Server can handle a your company and users tables without partitioning. It's always there when you need it, but your worry shouldn't be about your tables - pick the simplest schema that meets your needs. If you want to do something to optimize performance, your bottleneck will almost assuredly be your disks. Don't buy large, slow disks. Get yourself a bunch of small, high RPM disks and spread your data out across them as much as possible, and don't share disks with your logs and your data. With databases, you're almost always better off achieving performance with good hardware, a good disk subsystem, and proper indexing. Don't compromise and over complicate your schema trying to anticipate performance - you'll regret it. I've seen really big databases where that sort of thing was necessary, but yours ain't it.

re: Should I create tables based on Companies?
yes
like
users_company_0001 users_company_0002 users_company_0003
no, like
companyID companyName, contactID
or should I just create a users table to have (50 x 25000) 1 250 000 users (and growing)
yes

I think you should create separate tables for Company and User. Then
a third table to connect the two: CompanyAdmin. Something like:
Company(Company_Id, Company_name, ...)
User(User_Id, User_name, ...)
CompanyAdmin(Company_id, User_id)
This way you can add users and/or companies without affecting the number
of tables you need to manage. It is generally a bad design where you need
to modify the database (ie. add tables) when new data (companies) are added to the system.
With proper indexing, the join costs in a database containing
a few million rows should not be a problem.
Finally, if you ever need to change or record additional information about
Companies, Users or the relationship between them, this setup should
have the least amount of impact on your application.

Users should not get their own set of tables. It will most likely not perform as well as one table (properly indexed), and schema changes will have to be deployed to all user tables.
You could have default values specified on the table for things that are optional.
With difficulty. With one set of tables it will be a lot easier, and probably faster.
That sort of data should be stored in a User Preferences table that stores all preferences for all users. Again, don't duplicate the schema for all users.

Generally the idea of creating separate tables for each entity (in this case users) is not a good idea. If each table is separate querying may be cumbersome.
If your table is large you should optimize the table with indexes. If it gets very large, you also may want to look into partitioning tables.
This allows you to see the table as 1 object, though it is logically split up - the DBMS handles most of the work and presents you with 1 object. This way you SELECT, INSERT, UPDATE, ALTER etc as normal, and the DB figures out which partition the SQL refers to and performs the command.
Not splitting up the tables by users, instead using indexes and partitions, would deal with scalability while maintaining performance. if you don't split up the tables manually, this also makes that points 2, 3, and 4 moot.
Here's a link to partitioning tables (SQL Server-specific):
http://databases.about.com/od/sqlserver/a/partitioning.htm

It doesn't make any kind of sense to me to create a set of tables for each user. If you have a common set of tables for all users then I think that avoids all the issues you are asking about.

It sounds like you need to locate a primer on relational database design basics. Regardless of the type of application you are designing, you should start there. Learn how joins work, indices, primary and foreign keys, and so on. Learn about basic database normalization.
It's not customary to create new tables on-the-fly in an application; it's usually unnecessary in a properly designed schema. Usually schema changes are done at deployment time. The only time "users" get their own tables is an artifact of a provisioning decision, wherein each "user" is effectively a tenant in a walled-off garden; this only makes sense if each "user" (more likely, a company or organization) never needs access to anything that other users in the system have stored.
There are mechanisms for dealing with loosely structured types of information in databases, but if you find yourself reaching for this often (the most common method is called Entity-Attribute-Value), your problem is either not quite correctly modeled, or you may not actually need a relational database, in which case it might be better off with a document-oriented database like CouchDB/MongoDB.
Adding, based on your updated comments/notes:
Your concerns about the number of records in a particular table are most likely premature. Get something working first. Most modern DBMSes, including newer versions of MySql, support mechanisms beyond indices and clustered indices that can help deal with large numbers of records. To wit, in MS Sql Server you can create a partition function on fields on a table; MySql 5.1+ has a few similar partitioning options based on hash functions, ranges, or other mechanisms. Follow well-established conventions for database design modeling your domain as sensibly as possible, then adjust when you run into problems. First adjust using the tools available within your choice of database, then consider more drastic measures only when you can prove they are needed. There are other kinds of denormalization that are more likely to make sense before you would even want to consider having something as unidiomatic to database systems as a "table per user" model; even if I were to look at that route, I'd probably consider something like materialized views first.

I agree with the comments above that say that a table per user is a bad idea. Also, while it's a good idea to have strategies in mind now for how you can cope when things get really big, I'd concentrate on getting things right for a small number of users first - if no-one wants to / is able to use your service, then unfortunately you won't be faced with the problem of lots of users.
A common approach among very large sites is database sharding. The summary is: you have N instances of your database in parallel (on separate machines), and each holds 1/N of the total data. There's some shared way of knowing which instance holds a given bit of data. To access some data you have 2 steps, rather than the 1 you might expect:
Work out which shard holds the data
Go to that shard for the data
There are problems with this, such as: you set up e.g. 8 shards and they all fill up, so you want to share the data over e.g. 20 shards -> migrating data between shards.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight