What is the best database for my needs? - database

I am currently using MS SQL Server 2008 but I'm not sure it it is the best system for this particular task.
I have a single table like so:
PK_ptA PK_ptB DateInserted LookupColA LookupColB ... LookupColF DataCol (ntext)
A common query is
SELECT TOP(1000000) DataCol FROM table
WHERE LookupColA=x AND LookupColD=y AND LookupColE=z
ORDER BY DateInserted DESC
The table has about a billion rows with 5 million inserted per day.
My main problem with SQL Server is that it isn't too easy to shard or spread out the datafiles. Also, exporting seems to max out at 1000rows per second (about 1MB/s) which seems very slow.
Another problem I have is, with SQL Server, if I want to add a new LookupCol the log file grows enormously requiring a large amount of rarely used free space on tap.
Are there any obvious better solutions for this problem?

You have a problem, and it is not SQL Server. let me also ignore that you seem to ahve a bad table design.
Spreading data files is actually pretty easy. REORGANIZING later is not that easy, but also doable. How is your table, filegroup and file layout?
export 1mb per second is a joke. Seriously. I have been handling 150 million row files in minutes - that runs down to a LOT more than 60.000 rows per minute. Something is freaking out. Temp space? Did you do a performance analysis? How does the hardware look?
Nothing will work for the log usage. Basically like most pro databases the log contains all changed database pages during a transaction. Adding a field changes - ALL pages.
You should:
Redesign the database (use a view to keep the same old table in place if you ahve to) so that it does not ahve "LookupColA" etc., but is normalized (LookupValue, and a LookuPTable that is coded by "column"). This way you get instant additional fields. This turns into a data warehouse like star schema.
Do a performance analysis. Looks like you ahve some problems.
Definitely tell us abou your hardware ;)
This problem here is definitely NOT SQL Server, it is related to bad table design AND - possibly - insufficient - badly utilized hardware.

Ok, the table design (separate answer). Lokup are bassically lookup tables.
So....
LookupTable
pk (int)
TableType
Value
as vields
ValueTable
pk
ValueLookupMap table
pk of ValueTable entry
pk of LookupTable entry
So, basically, if you add a lookup "field" then you just create a set of entries in the LookupTable then add entries in the ValueLookupMap.

Related

What affects does changing the column-order of a table in a database have on the server-memory?

I just started my new job and after looking at the DBM I was shocked. Its a huge mess.
Now first thing I wanted to do is get some consistency in the order of table columns. We publish new database versions over a .dacpac. My co-worker told me that changing the order of a column would force MSSQL to create a temporary table which stores all the data. MSSQL then creates a new table and inserts all the data into that table.
So lets say my server only runs 2GB of RAM and has 500MB storage left on the harddrive. The whole database weights 20GB. Is it possible that changing the order of columns will cause trouble (memory related)? Is the statement of my co-worker correct?
I couldnt find any good source for my question.
Thanks!
You should NOT "go one table by one".
You should leave your tables as they are, if you don't like the order of columns of some table just create a view reordering your columns as you want.
Not only changing order of columns will cause your tables to be recreated, all the indexes will be recreated, you'll get problems with FK constraints.
And after all, you'll gain absolutely nothig but do damage only. You'll waste server resources, make your tables temporarily inaccessible and the columns will not be stored as you defind anyway, internally they will be stored in "var-fix" format (divided into fixed-length and variable-length)

Bulkcopy inserts with DBCC CheckIdent

Our team needs to insert a cruel amount of data into our SQL Server 2008 database. We're looking for a good solution. Now we came up with one, but I have doubts with it, simply because it doesn't feel right. So I'm asking here if this seems like a good solution. Extra challange is that it's a peer-to-peer replicated database over 4 servers! :)
Imagine we have 1 million rows to insert
Start transaction
Increase current ident value on a table with 1 million
Have a DataSet/DataTable ready with 1 million rows and the correct ids
BulkCopy the data into the database
Commit transaction
Is this a good solution, might we get into concurrency issues, have too large transactions, etc.
you'll only get problems (as far as I can see, so there might be things I overlook!) if the database is online and users can insert rows into that table. Increasing the identity value for new rows on the meta-level simply means that the next row inserted by the system will use that number, so if you bump it with 1 million, it means you reserved those numbers up front.
Identity columns are 'nice' but have the side effect that they're not transferable. So if you have to migrate the data to another DB, realize that you likely have to adjust the data inserted to match the database you insert it in (as that's the scope of the data which means identity fields could collide with rows already in the table).
If this is a one-time affair, it might work out. If you're planning to do this regularly, I'd look into a more higher-level migration system where you migrate the data to new identity values or use guid's with NEWSEQUENTIALID() so you get proper checked indexes and also unique, transferable id's.

How to improve performance in SQL Server table with image fields?

I'm having a very particular performance problem at work!
In the system we're using there's a table that holds information about the current workflow process. One of the fields holds a spreadsheet that contains metadata about the process (don't ask me why!! and NO I CAN'T CHANGE IT!!)
The problem is that this spreadsheet is stored in an IMAGE field in an SQL Server 2005 (within a database set with SQL 2000 compatibility).
This table currently has 22K+ lines and even a simple query like this:
SELECT TOP 100 *
FROM OFFENDING_TABLE
Takes 30 seconds to retrieve the data in Query Analyser.
I'm thinking about updating the compatibility to SQL 2005 (once that I was informed that the app can handle it).
The second thing I'm thinking is to change the data-type of the column to varbinary(max) but I don't know if doing this will affect the application.
Another thing that I'm considering is to use sp_tableoption to set the large value types out of row to 1 as it's currently 0, but I have no information if doing this will improve performance.
Does anyone know how to improve performance in such scenario?
Edited to clarify
My problem is that I have no control on what the application asks to the SQL Server, and I did some Reflection on it (the app is a .NET 1.1 website) and it uses the offending field for some internal stuff that I have no idea what it is.
I need to improve the overall performance of this table.
I'd recommend you look into the offending table layout health:
select * from sys.dm_db_index_physical_stats(
db_id(), object_id('offending_table'), null, null, detailed);
Things too look for are avg_fragmentation_in_percent, page_count, avg_page_space_used_in_percent, record_count and ghost_record_count. Cues like high fragmentation, or a high number of ghost records, or a low page used percent indicate problems and things can be improved quite a bit just by rebuilding the index (ie. the table) from scratch:
ALTER INDEX ALL ON offending_table REBUILD;
I'm saying this considering that you cannot change the table nor the app. If you'd be able to change the table and the app, the advice you already got is good advice (don't use '*', dont' select w/o a condition, use the newer varbinary(max) type etc etc).
I'd also look into the average page lifetime in performance counters to understand if the system is memory starved. From your description of the symptomps the system looks IO bound which leads me to think there is little page caching going on, and more RAM could help, as well as a faster IO subsytem. On a SQL 2008 system I would also suggest turning page compression on, but on 2005 you can't.
And, just to be sure, make sure the queries are not blocked by contention from the app itself, ie. the query doesn't spend 90% of that 30 seconds waiting for a row lock. Look at sys.dm_exec_requests while the query is running, see the wait_time, wait_type and wait_resource. Is it PAGEIOLATCH_XX? Or is it a lock? Also, how is the sys.dm_os_wait_stats in your server, what are the top wait reasons?
First of all - don't ever do a SELECT * in production code - reporting or not.
You have three basic choices:
move that blob field out into a separate table if it's not always needed; probably not practical since you mention you cannot change the schema
be more careful with your SELECT statements to select only those fields that you really need - and omit the blob field
see if you can limit your query to include a WHERE clause and find a way to optimize the query plan by e.g. adding a suitable index to the table (if you can)
There's no magic "make this faster" switch - but you can optimize your query or optimize your table layout. Both help. If you can't change anything - neither the table layout, nor add an index, nor change the queries, you'll have a hard time optimizing anything, I'm afraid....
Just changing the field to VARBINARY(MAX) won't change anything at all - no performance improvement to be expected just from changing the data type.
A short answer is to only do SELECTs against multiple rows when the fields returned do not include the offending image field, ie no SELECT *. If you want the value of the image field, retrieve it on a case-by-case basis.
Setting the large value types out of row option should definitely help performance. The row size will be significantly smaller, SQL Server can do a lot fewer physical reads to get throught the table.

How to copy large set of data in SQLServer db

I have a requirement to take a "snapshot" of a current database and clone it into the same database, with new Primary Keys.
The schema in question consists of about 10 tables, but a few of the tables will potentially contain hundreds of thousands to 1 million records that need to be duplicated.
What are my options here?
I'm afraid that writing a SPROC will require a locking of the database rows in question (for concurrency) for the entire duration of the operation, which is quite annoying to other users. How long would such an operation take, assuming that we can optimize it to the full extent sqlserver allows? Is it going to be 30 seconds to 1 minute to perform this many inserts? I'm not able to lock the whole table(s) and do a bulk insert, because there are other users under other accounts that are using the same tables independently.
Depending on performance expectations, an alternative would be to dump the current db into an xml file and then asynchronously clone the db from this xml file at leisure in the background. The obvious advantage of this is that the db is only locked for the time it takes to do the xml dump, and the inserts can run in the background.
If a good DBA can get the "clone" operation to execute start to finish in under 10 seconds, then it's probably not worth the complexity of the xmldump/webservice solution. But if it's a lost cause, and inserting potentially millions of rows is likely to balloon out in time, then I'd rather start out with the xml approach right away.
Or maybe there's an entirely better approach altogether??
Thanks a lot for any insights you can provide.
I would suggest backing the up database, and then restoring it as new db on your server. You can use that new DB as your source.
I will definitely recommend against the xml dump idea..
Does it need to be in the exact same tables? You could make a set of "snapshots" tables where all these records go, you would only need a single insert + select, like
insert into snapshots_source1 (user,col1, col2, ..., colN)
select 'john', col1, col2, ..., colN from source1
and so on.
You can make snapshots_* to have an IDENTITY column that will create the 'new PK' and that can also preserve the old one if you so wished.
This has (almost) no locking issues and looks a lot saner.
It does require a change in the code, but shouldn't be too hard to make the app to point to the snapshots table when appropriate.
This also eases cleaning and maintenance issues
---8<------8<------8<---outdated answer---8<---8<------8<------8<------8<---
Why don't you just take a live backup and do the data manipulation (key changing) on the destination clone?
Now, in general, this snapshot with new primary keys idea sounds suspect. If you want a replica, you have log shipping and cluster service, if you want a copy of the data to generate a 'new app instance' a backup/restore/manipulate process should be enough.
You don't say how much your DB will occupy, but you can certainly backup 20 million rows (800MB?) in about 10 seconds depending on how fast your disk subsystem is...

Have you ever encountered a query that SQL Server could not execute because it referenced too many tables?

Have you ever seen any of there error messages?
-- SQL Server 2000
Could not allocate ancillary table for view or function resolution.
The maximum number of tables in a query (256) was exceeded.
-- SQL Server 2005
Too many table names in the query. The maximum allowable is 256.
If yes, what have you done?
Given up? Convinced the customer to simplify their demands? Denormalized the database?
#(everyone wanting me to post the query):
I'm not sure if I can paste 70 kilobytes of code in the answer editing window.
Even if I can this this won't help since this 70 kilobytes of code will reference 20 or 30 views that I would also have to post since otherwise the code will be meaningless.
I don't want to sound like I am boasting here but the problem is not in the queries. The queries are optimal (or at least almost optimal). I have spent countless hours optimizing them, looking for every single column and every single table that can be removed. Imagine a report that has 200 or 300 columns that has to be filled with a single SELECT statement (because that's how it was designed a few years ago when it was still a small report).
For SQL Server 2005, I'd recommend using table variables and partially building the data as you go.
To do this, create a table variable that represents your final result set you want to send to the user.
Then find your primary table (say the orders table in your example above) and pull that data, plus a bit of supplementary data that is only say one join away (customer name, product name). You can do a SELECT INTO to put this straight into your table variable.
From there, iterate through the table and for each row, do a bunch of small SELECT queries that retrieves all the supplemental data you need for your result set. Insert these into each column as you go.
Once complete, you can then do a simple SELECT * from your table variable and return this result set to the user.
I don't have any hard numbers for this, but there have been three distinct instances that I have worked on to date where doing these smaller queries has actually worked faster than doing one massive select query with a bunch of joins.
#chopeen You could change the way you're calculating these statistics, and instead keep a separate table of all per-product stats.. when an order is placed, loop through the products and update the appropriate records in the stats table. This would shift a lot of the calculation load to the checkout page rather than running everything in one huge query when running a report. Of course there are some stats that aren't going to work as well this way, e.g. tracking customers' next purchases after purchasing a particular product.
This would happen all the time when writing Reporting Services Reports for Dynamics CRM installations running on SQL Server 2000. CRM has a nicely normalised data schema which results in a lot of joins. There's actually a hotfix around that will up the limit from 256 to a whopping 260: http://support.microsoft.com/kb/818406 (we always thought this a great joke on the part of the SQL Server team).
The solution, as Dillie-O aludes to, is to identify appropriate "sub-joins" (preferably ones that are used multiple times) and factor them out into temp-table variables that you then use in your main joins. It's a major PIA and often kills performance. I'm sorry for you.
#Kevin, love that tee -- says it all :-).
I have never come across this kind of situation, and to be honest the idea of referencing > 256 tables in a query fills me with a mortal dread.
Your first question should probably by "Why so many?", closely followed by "what bits of information do I NOT need?" I'd be worried that the amount of data being returned from such a query would begin to impact performance of the application quite severely, too.
I'd like to see that query, but I imagine it's some problem with some sort of iterator, and while I can't think of any situations where its possible, I bet it's from a bad while/case/cursor or a ton of poorly implemented views.
Post the query :D
Also I feel like one of the possible problems could be having a ton (read 200+) of name/value tables which could condensed into a single lookup table.
I had this same problem... my development box runs SQL Server 2008 (the view worked fine) but on production (with SQL Server 2005) the view didn't. I ended up creating views to avoid this limitation, using the new views as part of the query in the view that threw the error.
Kind of silly considering the logical execution is the same...
Had the same issue in SQL Server 2005 (worked in 2008) when I wanted to create a view. I resolved the issue by creating a stored procedure instead of a view.

Resources