I have an old app running on SQL Server (but I suspect the concept I'm asking about applies to most/all major DBs) that got re-written. Aside from the fact that a lot of the UI changes and more modular backend code resulted in different queries, one of the major changes is that the old code used zero explicit transactions.
Yeah, as in if an error happened, you'd be left with orphan records and such. The new app has corrected that, using transactions when there are multiple inserts/updates. That seems to be a no-brainer, but we're finding that we're getting a lot of complaints about performance, particularly from clients that have more data (each client has their own separate DB). Am I correct in assuming that, given the transactions, there's a lot more room for resources to be waiting on locks, which could then drastically hurt performance?
Btw, another major difference is that the old app relied on stored procedures to a point, whereas the new app does not use them at all. I'm throwing this in here just in case, but I'm really under the impression that transactions being a problem is more likely, especially given the complexity of the queries in the system (tons of queries with lots of joins, subqueries in the SELECT clause, etc.)
Also, it's worth noting we're not talking huge databases/tables. Each client has their own database and clients complaining have tables with a few million records at worse, but not billions/trillions of records or anything like that. While some new queries have been introduced and some have changed, the majority of queries are the same as in the old system, sometimes just running outside of a stored proc when before they were inside of one. Also, a lot of the more complicated queries have been checked and when run on their own, they're fast.
We solved this by setting up a series of jobs/procs packaged as dbWarden. This allowed us to identify that there was a lot of blocking happening in certain cases. Once we figure that out, we made adjustments to avoid the blocking, including making certain operations read/write to different DBs, adjusting indexes, and making some operations query different data stores (ElasticSearch).
Related
Today I found an article online discussing Facebooks architecture (though it's a bit dated). While reading it I noticed under the section Software that helps Facebook scale, the third bullet point states:
Facebook uses MySQL, but primarily as a key-value persistent storage,
moving joins and logic onto the web servers since optimizations are
easier to perform there (on the “other side” of the Memcached layer).
Why move complex joins to the web server? Aren't databases optimized to perform join logic? This methodology seems contrary to what I've learned up to this point, so maybe the explanation is just eluding me.
If possible, could someone explain this (an example would help tremendously) or point me to a good article (or two) for the benefits (and possibly examples) of how and why you'd want to do this?
I'm not sure about Facebook, but we have several applications where we follow a similar model. The basis is fairly straightforward.
The database contains huge amounts of data. Performing joins at the database level really slows down any queries we make on the data, even if we're only returning a small subset. (Say 100 rows of parent data, and 1000 rows of child data in a parent-child relationship for example)
However, using .NET DataSet objects, of we select in the rows we need and then create DataRelation objects within the DataSet, we see a dramatic boost in performance.
I can't answer why this is, as I'm not knowledgeable about the internal workings of either, but I can venture a guess...
The RDBMS (Sql Server in our case) has to deal with the data that lives in files. These files are very large, and only so much of it can be loaded into memory, even on our heavy-hitter SQL Servers, so it there is a penalty of disk I/O.
When we load a small portion of it into a Dataset, the join is happening entirely in memory, so we lose the I/O penalty of going to the disk.
Even though I can't explain the reason for the performance boost completely (and I'd love to have someone more knowledgeable tell me if my guess is right) I can tell you that in certain cases, when there is a VERY large amount of data, but your app only needs to pull a small subset of it, there is a noticeable boot in performance by following the model described. We've seen it turn apps that just crawl into lightning-quick apps.
But if done improperly, there is a penalty - if you overload the machine's RAM but doing it inappropriately or in every situation, then you'll have crashes or performance issues as well.
We're developing a new eCommerce website and are using NHibernate for the first time. At present we are splitting our data into multiple SQL Server databases, divided per area of functionality. So we have one for UserInfo, one for Orders, one for ProductCatalogue and so on...
Our justification for this decision is twofold really:
the website has the potential to be HUGE (it is a new website for one of the largest online brands in the UK) and we feel that by partitioning our data along functional lines we will be able to move the databases onto their own servers which would give us an easy scaling route should we need it;
my team has always worked this way - partly as a consequence of following the MS Commerce Server pattern from previous projects.
However, reading up on this decision on the internet, we find that the normal response to this sort of model is extremely scathing. "Creating more work for the devs now in order to create more work for the devs later" is one sample comment from Stack Overflow!
In addition, NHibernate is much easier to use with only one database (just one SessionFactory needed). And knowing that Stack Overflow ran off just one box for a long time makes me think that maybe we should not try to be so clever.
So, my question is, "are we correct in thinking that using fine-grained databases might increase our ability to scale or should we sacrifice this for easier development"?
Why don't you just design your database properly and put the files on appropriate disk? Use a cluster if necessary. Creating multiple databases is not an inherently scaling solution. Also - cross database referential integrity? Good luck.
What's your definition of "HUGE"? SQL Server can handle massive databases, but one thing I've learnt is that people often have no idea what constitutes a lot of data.
I've never worked in a project like this. I'm used to databases with several hundred tables, which had never been a problem.
Therefore I can't say if your idea is a good idea, I never tried it. The "my team has always worked this way"-argument is a major driver for many decisions, and I can't even say that it is always wrong.
With NHibernate you organize your data in classes. They can be in different namespaces and assemblies. You usually don't work much with the database directly, you don't need this kind of structure there.
About the scalability argument: I'm not sure if it is really scaling well when you need to access several databases every time. I mean: you always need users and orders and probably more. Then you need to get all this data from several databases.
Agree fully with starskythehutch - keep your related tables together in the same DB. BUT, you may want to consider having separate databases for things that are not related or non-critical to your main product; but that are a part of the app.
For eg: if you decide to log every visit/hit to the site in a DB, you should probably keep that in a separate DB.
The reason you should consider:
1. huge number of transactions - say hundreds of thousands / sec. Having non-critical un-related stuff in a separate DB will ensure that tlog contentions because of this are avoided.
Restore, DBCC CHECKDB, backup times. If you stuff your non-related non-critical stuff in your main DB, you are essentially increasing the size of your DB and it will affect these operations. Having it in separate DB will help you improve performance of these operations.
A lot of web applications having a 3 tier architecture are doing all the processing in the app server and use the database for persistence just to have database independence. After paying a huge amount for a database, doing all the processing including batch at the app server and not using the power of the database seems to be a waste. I have a difficulty in convincing people that we need to use best of both worlds.
What "power" of the database are you not using in a 3-tier archiecture? Presumably we exploit SQL to the full, and all the data management, paging, caching, indexing, query optimisation and locking capabilities.
I'd guess that the argument is where what we might call "business logic" should be implemented. In the app server or in database stored procedure.
I see two reasons for putting it in the app server:
1). Scalability. It's comparatively hard to add more datbase engines if the DB gets too busy. Partitioning data across multiple databases is really tricky. So instead pull the business logic out to the app server tier. Now we can have many app server instances all doing business logic.
2). Maintainability. In principle, Stored Procedure code can be well-written, modularised and resuable. In practice it seems much easier to write maintainable code in an OO language such as C# or Java. For some reason re-use in Stored Procedures seems to happen by cut and paste, and so over time the business logic becomes hard to maintain. I would concede that with discipline this need not happen, but discipline seems to be in short supply right now.
We do need to be careful to truly exploit the database query capabilities to the full, for example avoiding pulling large amounts of data across to the app server tier.
It depends on your application. You should set things up so your database does things databases are good for. An eight-table join across tens of millions of records is not something you're going to want to handle in your application tier. Nor is performing aggregate operations on millions of rows to emit little pieces of summary information.
On the other hand, if you're just doing a lot of CRUD, you're not losing much by treating that large expensive database as a dumb repository. But simple data models that lend themselves to application-focused "processing" sometimes end up leading you down the road to creeping unforeseen inefficiencies. Design knots. You find yourself processing recordsets in the application tier. Looking things up in ways that begin to approximate SQL joins. Eventually you painfully refactor these things back to the database tier where they run orders of magnitude more efficiently...
So, it depends.
No. They should be used for business rules enforcement as well.
Alas the DBMS big dogs are either not competent enough or not willing to support this, making this ideal impossible, and keeping their customers hostage to their major cash cows.
I've seen one application designed (by a pretty smart guy) with tables of the form:
id | one or two other indexed columns | big_chunk_of_serialised_data
Access to that in the application is easy: there are methods that will load one (or a set) of objects, deserialising it as necessary. And there are methods that will serialise an object into the database.
But as expected (but only in hindsight, sadly), there are so many cases where we want to query the DB in some way outside that application! This is worked around is various ways: an ad-hoc query interface in the app (which adds several layers of indirection to getting the data); reuse of some parts of the app code; hand-written deserialisation code (sometimes in other languages); and simply having to do without any fields that are in the deserialised chunk.
I can readily imagine the same thing occurring for almost any app: it's just handy to be able to access your data. Consequently I think I'd be pretty averse to storing serialised data in a real DB -- with possible exceptions where the saving outweighs the increase in complexity (an example being storing an array of 32-bit ints).
I'm interested in database refactoring. I deal with several databases that don't have a large amount of data, just a few GB with at most a few hundred thousand rows. However, they have hundreds -- sometimes many hundreds -- of tables, views, sprocs and functions. In some places a divide-and-rule strategy using schemas has been implemented which has helped some problems of seeing ownership/usage of tables. However, it hasn't really helped object coupling.
We all read that integration via shared database isn't A Good Thing, but we also know that it is, at least for a while , a very productive thing as everything is in the database. We just don't apply the Single Responsibility Principle to databases like we do to objects.
Edit: I should add that I have no database performance issues. The tables are not large, the biggest has only a few hundred thousand rows. There is no real database performance issue; except when the database schema/logic/implementation is grotesquely inefficient (say requiring a cursor to do a sproc execution for each row in a result set in order to pre-process data for a report). Before you say I should change these, that is the whole point: I can't because the database is no longer in a state where the impact of changes can be assessed.
Clearly at some point you say "Enough!" and divide into multiple databases connected by messages, ETL, application tiers etc etc
The question is: how many is too many? What is the absolute upper limit of the number of sprocs/tables/functions that you can have before you go insane?
First, stop trying to think of databases in object oriented terms. Principles of object oriented programming simply do NOT apply to relational databases.
Shared databases are a very good thing from a business perspective. Multiple databases storing information that has to be transferred between them quickly becomes way more complex than your piddly many hundreds of objects. Data that is consistent between enterprise applications is priceless. Trying to reconcile if GE Corp and General Electric Corporation are really the same entity between two databases can be a nightmare.
Refactoring datbases is a nice goal, but it is very complex in reality. Don't do it unless you have a major performance issue that needs to be addressed or unless you are willing to commit to a process of identifying all the code that might be affected by a change. Even then, consider if you can know all the code that might change (this is one reason why database people hate, hate, hate dynamic code!).
Often the best way to refactor is to add your change and start changing over to using your new field, sp etc while leaving the old one in place until a set expiration date. Since you are on an annual cycle, you will need to manage those dates over a long period of time. To see if sps are being used, you can identify the ones you aren't sure of and add some code to them to insert to a table everytime they are run. If after your whole year cycle, they haven't been run, you can safely eliminate them. The cycle may be shorter depending on the sp.
If I'm writing something that will only be run annually, I would normally put the word annual in the sp name. But that may not be true where you are, however, the function of the sp should give you an idea if it is something that should only be run periodically. I wouldn't expect usp_send email proc to only run once a year but I might expect that a usp_attendance_report might not be run often. Of course as I said, I would have named it something more like usp_annual_attendance_report and you can consider doing that sort of thing moving forward.
But be aware that any refactoring you do will have to take place on a long cycle to ensure that you don't delete something you need. If your code is in a source control system (and all database tables, sp, views, UDFs, triggers, etc should be), you can probably eliminate some things knowing that if they fail you can pretty instantly put them back. Again, I'd examine the object to determine the possible risk eliminating them would have.
Of course if you have good automated tests in place, eliminating something on dev and running the tests can help you find out if something is still being referenced.
If you are looking for an easy way to refactor, I don't know of one. Refactoring databses is a time-consuming, risky activity and one which may not show enough improvement for the powers that be to be willing to pay for it.
A good book on refactoring databases is:http://www.amazon.com/Refactoring-Databases-Evolutionary-Addison-Wesley-Signature/dp/0321293533
I'm not sure there is a magical limit for any of the things you mentioned. I prefer to keep things in one place so I don't have to remember that some records are in place and other records are in another.
I'd be more interested to know if all this work is impacting your performance? And if it's not then why change it? Unless it's impacting performance in some horrible way your customers won't see any benefit from your work and then what's the point?
Your customers might be better served if you just bought a new machine or upgraded your database server software.
I've just started my first development job for a reasonably sized company that has to manage a lot of data. An average database is 6gb (from what I've seen so far). One of the jobs is reporting. How it's done currently is -
Data is replicated and transferred onto a data warehouse. From there, all the data required for a particular report is gathered (thousands of rows and lots of tables) and aggregated to a reports database in the warehouse. This is all done with stored procedures.
When a report is requested, a stored procedure is invoked which copies the data onto a reports database which PHP reads from to display the data.
I'm not a big fan of stored procs at all. But the people I've spoken to insist that stored procedures are the only option, as queries directly against the data via a programming language are incredibly slow (think 30 mins?). Security is also a concern.
So my question is - are stored procedures required when you have a very large data set? Do queries really take that long on such a large amount of data or is there a problem with either the DB servers or how the data is arranged (and indexed?). I've got a feeling that something is wrong.
The reasoning behind using a stored procedure is that the execution plan that is created in order to execute your procedure is cached by SQL Server in an area of memory known as the Plan Cache. When the procedure is then subsequently re-run at a later time, the execution plan has the possibility of being re-used.
A stored procedure will not run any faster than the same query, executed as a batch of T-SQL. It is the execution plans re-use that result in a performance improvement. The query cost will be the same for the actual T-SQL.
Offloading data to a reporting database is a typical pursuit however you may need to review your indexing strategy on the reporting database as it will likely need to be quite different from that of your OLTP platform for example.
You may also wish to consider using SQL Server Analysis Services in order to service your reporting requirements as it sounds like your reports contain lots of data aggregations. Storing and processing data for the purpose of fast counts and analytics is exactly what SSAS is all about. It sounds like it is time for your business to look as building a data warehouse.
I hope this helps but please feel free to request further details.
Cheers, John
In the context in which you are operating - large corporate database accessed in several places - it is virtually always best to place as much business logic inside the database as is possible.
In this case your immediate performance benefits are :
Firstly because if the the SP involves any processing beyond a simple select the processing of the data within the database can be orders of magnitude faster than sending rows across the network to your program for handling there.
You do acquire some benefits in that the SP is stored compiled. This is usually marginal compared to 1. if processing large volumes
However, and in my mind often more important than performance, is the fact that with corporate databases encapsulating the logic inside the database itself provides major management and maintenance benefits:-
Data structures can be abstracted away from program logic, allowing database structures to change without requiring changes to programs accessing the data. Anyone who has spent hours grep'ing a corporate codebase for SQL using [mytable] before making a simple database change will appreciate this.
SPs can provide a security layer, although this can be overused and overrelied on.
You say this is your first job for a company with a database of this type, so you can be forgiven for not appreciating how a database-centric approach to handling the data is really essential in such environments. You are not alone either - in a recent podcast Jeff Attwood said he wasn't a fan of putting code into databases. This is a fine and valid opinion where you are dealing with a database serving a single application, but is 100% wrong with a database used across a company by several applications, where the best policy is to screw down the data with a full complement of constraints and use SPs liberally for access and update.
The reason for this is if you don't such databases always lose data integrity and accumulate crud. Sometimes it's virtually impossible to imagine how they do, but in any large corporate database (tens of millions of records) without sufficient constraints there will be badly formed records - at best these force a periodic clean-up of data (a task I regularly used to get dumped with as a junior programmer), or worse will cause applications to crash due to invalid inputs, or even worse not cause them to crash but deliver incorrect business information to the end-users. And if your end user is your finance director then that's your job on the line :-)
It seems to me that there is an additional step in there that, based on your description, appears unneccessary. Here is what I am referring to -
When a report is requested, a stored
procedure is invoked which gathers the
data into a format required for a
report, and forwarded to another
stored procedure which transforms the
data into a view, and forwards THAT
off to a PHP framework for display.
A sproc transforms the data for a report, then another sproc transforms this data into another format for front-end presentation - is the data ever used in the format in which it is in after the first sproc? If not, that stage seems unneccessary to me.
I'm assuming that your reports database is a data warehouse and that data is ETL'ed and stored within in a format for the purposes of reporting. Where I currently work, this is common practice.
As for your question regarding stored procedures, they allow you to centralize logic within the database and "encapsulate" security, the first of which would appear to be of benefit within your organisation, given the other sprocs that you have for data transformation. Stored procedures also have a stored execution plan which, under some circumstances, can provide some improvement to performance.
I found that stored procedures help with large data sets because they eliminate a ton of network traffic, which can be a huge performance bottleneck depending on how large the data set actually is.
When processing large numbers of rows, where indexes are available and the SQL is relatively tuned, the database engine performing set-based operations directly on the data - through SQL, say - will almost always outperform row-by-row processing (even on the same server) in a client tool. The data is not crossing any physical or logical boudaries to leave the database server processes or to leave the database server and go out across the network. Even performing RBAR (row by agonizing row) on the server will be faster than performing it in a client tool, if only a limited amount of data really needs to ever leave the server, because...
When you start to pull more data across networks, then the process will slow down and limiting the number of rows at each stage becomes the next optimization.
All of this really has nothing to do with stored procedures. Stored procedures (in SQL Server) no longer provide much performance advantages over batch SQL. Stored procedures do provide a large number of other benefits like modularization, encapsulation, security management, design by contract, version management. Performance, however is no longer an advantage.
Generally speaking stored procedures have a number of advantages over direct queries. I can't comment on your complete end to end process, however, SPs will probably perform faster. For a start a direct query needs to be compiled and an execution plan worked out every time you do a direct query - SPs don't.
There are other reasons, why you would want to use stored procedure - centralisation of logic, security etc.
The end to end process does look a little complicated but there may be good reasons for it simply due to the data volume - it might well be that if you run the reports on the main database, the queries are slowing down the rest of the system so much that you'll cause problems for the rest of the users.
Regarding the stored procedures, their main advantage in a scenario like this is that they are pre-compiled and the database has already worked out what it considers to be the optimal query plan. Especially with the data volumes you are talking about, this might well result in a very noticeable performance improvement.
And yes, depending on the complexity of the report, a query like this can take half an hour or longer...
This reporting solution seems to have been designed by people that think the database is the centre of the world. This is a common and valid view – however I don’t always hold to it.
When moving data between tables/databases, it can be a lot quicker to use stored procs, as the data does not need to travel between the database and the application. However in most cases, I would rather not use stored proc as they make development more complex, I am in the ORM camp myself. You can sometimes get great speedups by loading lots into RAM and processing it there, however that is a totally different way of coding and will not allow the reuse of the logic that is already in the stored procs. Sorry I think you are stack with stored proc while in that job.
Giving the amount of data being moved about, if using SQL server I would look at using SSIS or DTS – oracle will have something along the same line. SSIS will do the data transformations on many threads while taking care of a lot of the details for you.
Remember the design of software has more to do with the history of the software and the people working it in, than it has to do with the “right way of doing it”. Come back in 100 years and we may know how to write software, at present it is mostly a case of the blind leading the blind. Just like when the first bridges were build and a lot of them fell down, no one could tell you in advance witch bridge would keep standing and why.
Unlike autogenerated code from an ORM product, stored procs can be performance tuned. This is critical in large production environment. There are many ways to tweak performance that are not available when using an ORM. Also there are many many tasks performed by a large database which have nothing to do with the user interface and thus should not be run from code produced from there.
Stored procs are also required if you want to control rights so that the users can only do the procedures specified in the proc and nothing else. Otherwise, users can much more easily make unauthorized changes to the databases and commit fraud. This is one reason why database people who work with large business critical systems, do not allow any access except through stored procs.
If you are moving large amounts of data to other servers though, I would consider using DTS (if using SQL Server 2000) or SSIS. This may speed up your processes still further, but it will depend greatly on what you are doing and how.
The fact that sps may be faster in this case doesn't preclude that indexing may be wrong or statistics out of date, but generally dbas who manage large sets of data tend to be pretty on top of this stuff.
It is true the process you describe seems a bit convoluted, but without seeing the structure of what is happening and understanding the database and environment, I can't say if maybe this is the best process.
I can tell you that new employees who come in and want to change working stuff to fit their own personal predjudices tend to be taken less than seriously and then you will have little credibility when you do need to suggest a valid change. This is particularly true when your past experience is not with databases of the same size or type of processing. If you were an expert in large systems, you might be taken more seriously from the start, but, face it, you are not and thus your opinion is not likely to sway anybody until you have been there awhile and they have a measure of your real capabilities. Plus if you learn the system as it is and work with it as it is, you will be in a better position in six months or so to suggest improvements rather than changes.
I could perhaps come up with more, but a few points.
Assuming a modern DB, stored procedures probably won't actually be noticeably faster than normal procedures due to caching and the like.
The security benefits of Stored procedures are somewhat overrated.
Change is evil. Consistency is king.
I'd say #3 trumps all other concerns unless stored procedures are causing a legitimate problem.
The faster way for reporting is to just read all data into memory (64 bit OS required) and just walk the objects. This is of course limited to ram size (affordable 32 GB) and reports where you hit a large part of the db. No need to make the effort for small reports.
In the old days I could run a report querying over 8 million objects in 1.5 seconds. That was in about a gigabyte of ram on a 3GHz pentium 4. 64 bit should be about twice as slow, but that is compensated by faster processors.