Sources of information on administering large SQL Server Databases? - sql-server

As part of my role at the firm I'm at, I've been forced to become the DBA for our database. Some of our tables have rowcounts approaching 100 million and many of the things that I know how to do SQL Server(like joins) simply break down at this level of data. I'm left with a couple options
1) Go out and find a DBA with experience administering VLDBs. This is going to cost us a pretty penny and come at the expense of other work that we need to get done. I'm not a huge fan of it.
2) Most of our data is historical data that we use for analysis. I could simply create a copy of our database schema and start from scratch with data putting on hold any analysis of our current data until I find a proper way to solve the problem(this is my current "best" solution).
3) Reach out to the developer community to see if I can learn enough about large databases to get us through until I can implement solution #1.
Any help that anyone could provide, or any books you could recommend would be greatly appreciated.

Here are a few thoughts, but none of them are quick fixes:
Develop an archival strategy for the
data in your large tables. Create
tables with similar formats to the
existing transactional table and
copy the data out into those tables
on a periodic basis. If you can get
away with whacking the data out of
the tx system, then fine.
Develop a relational data warehouse
to store the large data sets,
complete with star schemas
consisting of fact tables and
dimensions. For an introduction to
this approach there is no better
book (IMHO) than Ralph Kimball's
Data Warehouse Toolkit.
For analysis, consider using MS
Analysis Services for
pre-aggregating this data for fast
querying.
Of course, you could also look at
your indexing strategy within the
existing database. Be careful with
any changes as you could add indexes
that would improve querying at the
cost of insert and transactional
performance.
You could also research
partitioning in SQL Server.
Don't feel bad about bringing in a DBA on contract basis to help out...
To me, your best bet would be to begin investigating movement of that data out of the transactional system if it is not necessary for day to day use.
Of course, you are going to need to pick up some new skills for dealing with these amounts of data. Whatever you decide to do, make a backup first!
One more thing you should do is ensure that your I/O is being spread appropriately across as many spindles as possible. Your data files, log files and sql server temp db data files should all be on separate drives with a database system that large.

DBA's are worth their weight in gold, if you can find a good one. They specialize in doing the very thing that you are describing. If this is a one time problem, maybe you can subcontract one.
I believe Microsoft offers a similar service. You might want to ask.

You'll want to get a DBA in there, at least on contract to performance tune the database.
Joining to a 100 Million record table shouldn't bring the database serer to its knees. My company customers do it many hundreds (possibly thousands) of times per minute on our system.

Related

postgresql fast transfer of a table between databases

I have a postgresql operational DB with data partitioned per day
and a postgresql data warehouse DB.
In order to copy the data quickly from the operational DB to the DWH I would like to copy the tables as fast and with least of resources used.
Since the tables are partitioned by day, I understand that each partition is a table as itself.
Is that means I can somehow copy the data files between the machines and create the tables in the DWH with those data files?
What is the best practice in that case?
EDIT:
I will answer all questions asked in here:
1. I'm building an ETL. First step of ETL is to copy the data with less influence on the operational DB.
2. I would want to replicate the data if this won't slow the operational DB writings.
3. A bit more data, The operational DB is not in my responsepbility but the main concern is the write time on the that DB.
It writes about 500 Million rows a day where there are hours that are more loaded but there aren't hours with no writings at all.
4. I came across with few tools/ways - Replication, pg_dump. But I couldn't find something that compare the tools to know when to use what and to understand what is fit to my case.
If you are doing a bulk transfer I would actually consider running pg_dump on the warehouse system and piping the results into psql once a day. You could probably run Slony too but that woudl require more resources, and would probably be more complicated.
There are many good ways to replicate data between databases. While just looking for a
fast transfer of a table between databases
... a simple and fast solution is provided by the extension dblink. There are many examples here on SO. Try a search.
If you want a wider approach, continued synchronization etc. consider one of the established tools for replication. There is nice comparison in the manual to get you started.

Use one large database or use single databases per customer

Currently I'm working on a on-line webapplication for construction materials. Companies can log in on our website and then they can use the webapp.
From the beginnen the idea was to create a database per customer. But now it's becomming larger and larger (100+) so we have now 100 databases to manage.
We have to run approx. twice a year an update script for db maintanance.
The advantage that I see, is that when a customer wants to quit, we delete their database and than it's finished.
When I want to add new customer, I have to fill the database with approx. 1.000.000 unique records for that specific customer, because every customer have different prices /materials.
For backups I use a MySQL Dump script, that creates a *.sql file per database that I download every day.
What is your opnion and what do you think?
One large db or per customer a database?
I'm using MySQL with ASP.NET/C#...
I don't want to make a suggestion because there are far too many variables.
I do want to note, however, that my employer has 1000s of deployed databases -- we use one database per customer with replication (2+ databases).
So, the idea is workable. My job isn't related to DB management but I do recall that we do a lot in the way of automation and online tools. Backups and DB management is handled by a team.
Ultimately, you can make the 100+ deployments work but you are going to want to start investing in the development of utility and tools to help automate the backup and/or management of the DBs.
Ideally, nothing (DB Management) should be done by hand. Furthermore, the connection strings should be abstracted away from a given web app deployment.
But now it's becomming larger and larger (100+) so we have now 100 databases to manage
I think you have your answer right there.
Have to agree with #Hogan - the overhead of managing that many databases is probably far from ideal - especially if you ever need to make schema changes, etc. in the future.
That said, if you use a single database are you ever likely to need to separate out a given customer's data into a standalone database/site? If this is likely, how long would it take to carry out this separation?
In essence, if it's likely to take less effort to write a set of tools to handle the above case, then I'd be tempted to go for the single database approach. However, you'll also need to factor in the likely timescales for creating a unified version of the database schemas that handle datasets for each customer, etc.
Also, are the schemas precisely the same for all of the existing 100+ databases? If not, there's potentially a world of pain if you decide to migrate the existing data into a single database.
Update - Incidentally, all of the above is a bit generalised, but it's hard to be specific without knowing more about the amount of data, and traffic, etc. in use. (e.g.: If you ever had a high demand site for a customer it would be trivial to put it onto its own DB server if you were using a per-customer database.)
i agree with #Hogan and #middaparke... if the schemas are the same, you shuol dput it in one instance.
unfortuantely it is impossible to tell from here if your schemas would benefit from reusing most of those million rows or not, if normalized well, the ncertinly it would be beneficial.
it is also impossible to tell how difficult any changes to the applications would be based on this change.
unfortunately, it sounds like you have a large customer base with working applications, and therefore momentum to keep going in that direction - which thros you into the realm of sucking it up and dealing with it by automating the management of so many db's... not the way you would do it from scratch - but maybe cheapest since you are where you are.

What's the best DB to store banking transactions?

We are planning to create a web app to store banking transactions for customers, e.g purchases, transfers etc and allow them to tag / categorize each transaction.
Could someone point us to the best DB for this purpose? It needs to scale horizontally and we also need to perform analysis on all transactions.
Thanks
The best database to store banking transactions is the one the banks use, DB2/z.
But, since I doubt you'd be able to afford a System z mainframe, that's probably not an option. That doesn't make it any less the best database of course.
If, however, you're talking about storing transaction for Joe Bloggs or Dodgy Brothers Rug Emporium (as opposed to the two hundred million or so customers of ICBC), pretty well any database will be up to the task - Oracle (despite its inability to differentiate NULLs from empty strings), SQL Server, MySQL PostgreSQL, even SQLite probably.
I'm going to start this by saying its almost impossible to recommend a system based on what you've described. It could be for such a varied number of uses, ranging from mission critical real time financial data that needs to be there and needs to be accurate, through to a web app that sucks in financial records from a bank/credit card statement and lets the user annotate them, in which case it isn't as sensitive.
If you're storing mission critical, sensitive data, I'd go with a commercial option that includes significant support. Also a DBA would be a good idea.
Oracle or MS SQL would be my inclination, and probably Oracle over MS SQL, over because of its multi-platform support. If you're happy to run on Windows then MS SQL is fine.
If you're storing existing transactions that can be tagged (ala Blippy), then any database would be sufficient. If you're thinking of scaling this out to the n'th degree, you might like one of the document database flavours of the month, (MongoDB, Couch etc).
Really I think the question should be reconsidered from the context of what your application will do, not that it happens to do it with financial data. The fact that financial data may require additional security, or additional accuracy checks, that forms part of what the system will do, as does the way the user interacts with your web app etc.
This may not answer your question directly, but here is what I have experienced.
I think, its really about how you'd save your banking transactions. Most database vendors provide sufficient amount of database performance, so all you have to do is to choose one over other.
What you are left with is the actual information to be saved(besides schema). You might think about using database encryption option, but then its not really realistic in your case; because you are talking about transactions, I assume there are quite alot of transactions coming in, and you doing large of amount of reads for your reporting(besides write), possibly for mining, etc.
Usually(sql server), using encryption any data that is written into the database file is encrypted. Snapshots and backups are also use encryption. The transaction log is also protected, so it would hit the performance that you might desire.
So, I see your question really boiling down to How to protect sensitive data?
Here are couple of articles that might help:
Btw, I have deployed solutions with Oracle, SQL Server, and even Sybase as backends, with several transactions pouring in from ATMs, and what I really look for is the performance, besides security. Except for minute limitations of one over other, all are same.
Following articles might help:
Database security: protecting sensitive and critical information
Using One-Way Functions to Protect Sensitive Information in SQL Server Databases

Best database solution for managing a huge amount of data

I have to design a traffic database which includes data from different towns (8 towns) 2mb in a period of 10 min for each town 24h. The incoming data is the same for all Town. So my first question is what is better on the performance side: design one database for all towns with many tables (one table for each town) or design many databases (one database for each town)? My second question is what is the best database management system for this scenario, MySQL, Postgres, Oracle, or others?
The amount of data you are receiving each day is quite a lot (~5GB) but the number of rows being inserted is actually rather low. Consequently you need to design your physical model to make database storage adminstration easy and querying efficient.
Having a separate database per town only makes sense if you are going to have a server per database. But you do not need load balancing, as you only have to handle eight inserts every ten minutes. On the other hand that architecture will turn every query which compares one town against another into a distributed query.
Having one table per town in the same database might give you some performance advantages if the majority of your queries are constrained to data from a town rather than comparing towns. But I wouldn't like to put much money on it. Even if it did work, it might make other sorts of queries harder.
Given that the data is the same for all towns my preferred option would be one table with a differentiating column (TOWN_ID). Especially if I had the money to spring for a Oracle license with the Partitioning option.
Differnt databases per town can be difficult to maintain, same with differnt tables. It might be workable if you never have to compare towns though, but sooner or later I'd bet on having to compare data from differnt towns.
Partitioning data is the way to go. Anty database which supports partioning of data such as Oracle or SQL Server would work fine. Not sure if Postgre or Mysql support this, you'd have to ask someone more familiar with those databases.

What's the best way to manage a large number of tables in MS SQL Server?

This question is related to another:
Will having multiple filegroups help speed up my database?
The software we're developing is an analytical tool that uses MS SQL Server 2005 to store relational data. Initial analysis can be slow (since we're processing millions or billions of rows of data), but there are performance requirements on recalling previous analyses quickly, so we "save" results of each analysis.
Our current approach is to save analysis results in a series of "run-specific" tables, and the analysis is complex enough that we might end up with as many as 100 tables per analysis. Usually these tables use up a couple hundred MB per analysis (which is small compared to our hundreds of GB, or sometimes multiple TB, of source data). But overall, disk space is not a problem for us. Each set of tables is specific to one analysis, and in many cases this provides us enormous performance improvements over referring back to the source data.
The approach starts to break down once we accumulate enough saved analysis results -- before we added more robust archive/cleanup capability, our testing database climbed to several million tables. But it's not a stretch for us to have more than 100,000 tables, even in production. Microsoft places a pretty enormous theoretical limit on the size of sysobjects (~2 billion), but once our database grows beyond 100,000 or so, simple queries like CREATE TABLE and DROP TABLE can slow down dramatically.
We have some room to debate our approach, but I think that might be tough to do without more context, so instead I want to ask the question more generally: if we're forced to create so many tables, what's the best approach for managing them? Multiple filegroups? Multiple schemas/owners? Multiple databases?
Another note: I'm not thrilled about the idea of "simply throwing hardware at the problem" (i.e. adding RAM, CPU power, disk speed). But we won't rule it out either, especially if (for example) someone can tell us definitively what effect adding RAM or using multiple filegroups will have on managing a large system catalog.
Without first seeing the entire system, my first recommendation would be to save the historical runs in combined tables with a RunID as part of the key - a dimensional model may also be relevant here. This table can be partitioned for improvement, which will also allow you to spread the table into other filegroups.
Another possibility it to put each run in its own database and then detach them, only attaching them as needed (and in read-only form)
CREATE TABLE and DROP TABLE are probably performing poorly because the master or model databases are not optimized for this kind of behavior.
I also recommend talking to Microsoft about your choice of database design.
Are the tables all different structures? If they are the same structure you might get away with a single partitioned table.
If they are different structures, but just subsets of the same set of dimension columns, you could still store them in partitions in the same table with nulls in the non-applicable columns.
If this is analytic (derivative pricing computations perhaps?) you could dump the results of a computation run to flat files and reuse your computations by loading from the flat files.
This seems to be a very interesting problem/application that you are working with. I would love to work on something like this. :)
You have a very large problem surface area, and that makes it hard to start helping. There are several solution parameters that are not evident in your post. For example, how long do you plan to keep the run analysis tables? There's a LOT other questions that need to be asked.
You are going to need a combination of serious data warehousing, and data/table partitioning. Depending on how much data you want to keep and archive you may need to start de-normalizing and flattening the tables.
This would be pretty good case where contacting Microsoft directly can be mutually beneficial. Microsoft gets a good case to show other customers, and you get help directly from the vendor.
We ended up splitting our database into multiple databases. So the main database contains a "databases" table that refers to one or more "run" databases, each of which contains distinct sets of analysis results. Then the main "run" table contains a database ID, and the code that retrieves a saved result includes the relevant database prefix on all queries.
This approach allows the system catalog of each database to be more reasonable, it provides better separation between the core/permanent tables and the dynamic/run tables, and it also makes backups and archiving more manageable. It also allows us to split our data across multiple physical disks, although using multiple filegroups would have done that too. Overall, it's working well for us now given our current requirements, and based on expected growth we think it will scale well for us too.
We've also noticed that SQL 2008 tends to handle large system catalogs better than SQL 2000 and SQL 2005 did. (We hadn't upgraded to 2008 when I posted this question.)

Resources