choosing table design for database performance - sql-server

I am developing a Job application which executes multiple parallel jobs. Every job will pull data from third party source and process. Minimum records are 100,000. So i am creating new table for each job (like Job123. 123 is jobId) and processing it. When job starts it will clear old records and get new records and process. Now the problem is I have 1000 jobs and the DB has 1000 tables. The DB size is drastically increased due to lots of tables.
My question is whether it is ok to create new table for each job. or have only one table called Job and have column jobId, then enter data and process it. Only problem is every job will have 100,000+ records. If we have only one table, whether DB performance will be affected?
Please let me know which approach is better.

Don't create all those tables! Even though it might work, there's a huge performance hit.
Having a big table is fine, that's what databases are for. But...I suspect that you don't need 100 million persistent records, do you? It looks like you only process one Job at a time, but it's unclear.
Edit
The database will grow to the largest size needed, but the space from deleted records is reused. If you add 100k records and delete them, over and over, the database won't keep growing. But even after the delete it will take up as much space as 100k records.

I recommend a single large table for all jobs. There should be one table for each kind of thing, not one table for each thing.
If you make the Job ID the first field in the clustered index, SQL Server will use a b-tree index to determine the physical order of data in the table. In principle, the data will automatically be physically grouped by Job ID due to the physical sort order. This may not stay strictly true forever due to fragmentation, but that would affect a multiple table design as well.
The performance impact of making the Job ID the first key field of a large table should be negligible for single-job operations as opposed to having a separate table for each job.
Also, a single large table will generally be more space efficient than multiple tables for the same amount of total data. This will improve performance by reducing pressure on the cache.

Related

Fast data retrieval without indexes on a table with data insertions at every 10 seconds (short time span)

I am fetching data from a table having 20K row from a third-party source where the way of filling the table can't be changed table.
On the third party, the table is filled as following
New data is coming at every 15 seconds approx 7K rows.
At any given time only the last three timestamps will be available rest data will be deleted.
No index on the table is there. Neither it can be requested due to unavoidable reasons and might be slowness in the insert.
I am aware of the following
Row locks and up the hierarchy other locks are being taken while data insert.
The problem persists with select with NO LOCK.
There is no Join with any other table while fetching as we are joining the tables when data is at the local with us in the temp table.
When the data insertion is stopped at the third party the data comes in 100ms to 122ms.
When service is on it takes 3 to 5 seconds.
Any help/suggestion/approach is appreciated in advance.
The following is a fairly high-end solution. Based on what you have said I believe it would work, but there'd be a lot of detail to work out.
Briefly: table partitions.
Set up a partition scheme on this table
Based on an article I read recently, this CAN be done with unindexed heaps
Data is loaded every 15 seconds? Then the partitions need to be based on those 15 second intervals
For a given dataload (i.e. once per 15 seconds):
Create the "next" partition
Load the data
SWITCH the new partition (new data) into the main table
SWITCH the oldest partition out (data for only three time periods present at a time, right?
Drop that "retired" partition
While potentially efficient and effective, this would be very messy. The big problem I see is, if they can't add a simple index, I don't see how they could possibly set up table partitioning.
Another similar trick is to set up partitioned views, which essentially is "roll your own partitioning". This would go something like:
Have a set of identically structured tables
Create a view UNION ALLing the tables
On dataload, create a new table, load data into that table, then ALTER VIEW to include that newest table and remove the oldest table.
This could have worse locking/blocking issues than the partitioning solution, though much depends on how heavy your read activity is. And, of course, it is much messier than just adding an index.

Optimum number of rows in a table for creating indexes

My understanding is that creating indexes on small tables could be more cost than benefit.
For example, there is no point creating indexes on a table with less than 100 rows (or even 1000 rows?)
Is there any specific number of rows as a threshold for creating indexes?
Update 1
The more I am investigating, the more I get conflicting information. I might be too concern about preserving IO write operations; since my SQL servers database is in HA Synchronous-commit mode.
Point #1:
This question concerns very much the IO write performance. With scenarios like SQL Server HA Synchronous-commit mode, the cost of IO write is high when database servers reside in cross subnet data centers. Adding indexes adds to the expensive IO write cost.
Point #2:
Books Online suggests:
Indexing small tables may not be optimal because it can take the query
optimizer longer to traverse the index searching for data than to
perform a simple table scan. Therefore, indexes on small tables might
never be used, but must still be maintained as data in the table
changes.
I am not sure adding index to a table with only 1 one row will ever have any benefit - or am I wrong?
Your understanding is wrong. Small tables also benefit from index specially when are used to join with bigger tables.
The cost of index has two part, storage space and process time during insert/update. First one is very cheap this days so is almost discard. So you only consideration should be when you have a table with lot of updates and inserts apply the proper configurations.

Fact table partitioning: how to handle updates in ETL?

We are trying to implement table partitioning for a Data Warehouse Fact table which contains approximately 400M rows. Our ETL takes data from source system 50 days backwards (new rows, modified rows, based on source system timestamp) from the previous load. So in every ETL cycle there are new rows coming in, and also old rows which are updating the corresponding rows in the Fact table. The idea is to insert new rows into the Fact table and update modified rows.
The partition column would be date (int, YYYYMMDD) and we are considering to partition by month.
As far as I'm concerned, table partitioning would ease our inserts via fast partition switch operations. We could split the most recent partition to create a new free partition, load new rows into a staging table (using date constraint, e.g for the most recent month) and then use partition switch operation to "move" new rows into the partitioned Fact table. But how can we handle the modified rows which should update the corresponding rows in the Fact table? Those rows can contain data from the previous month(s). Does partition switch help here? Usually INSERT and UPDATE rows are determined by an ETL tool (e.g. SSIS in our case) or by MERGE statement. How partitioning works in these kind of situations?
I'd take another look at the design and try to figure out if there's a way around the updates. Here are a few implications of updating the fact table:
Performance: Updates are fully logged transactions. Big fact tables also have lots of data to read and write.
Cubes: Updating the fact table requires reprocessing the affected partitions. As your fact table continues to grow, the cube processing time will continue to as well.
Budget: Fast storage is expensive. Updating big fact tables will require lots of fast reads and writes.
Purist theory: You should not change the fact table unless the initial value was an error (ie the user entered $15,000 instead of $1,500). Any non-error scenario will be changing the originally recorded transaction.
What is changing? Are the changing pieces really attributes of a dimension? If so, can they be moved to a dimension and have changes handled with a Slowly Changing Dimension type task?
Another possibility, can this be accomplished via offsetting transactions? Example:
The initial InvoiceAmount was $10.00. Accounting later added $1.25 for tax then billed the customer for $11.25. Rather than updating the value to $11.25, insert a record for $1.25. The sum amount for the invoice will still be $11.25 and you can do a minimally logged insert rather than a fully logged update to accomplish.
Not only is updating the fact table a bad idea in theory, it gets very expensive and non-scalable as the fact table grows. You'll be reading and writing more data, requiring more IOPS from the storage subsytem. When you get ready to do analytics, cube processing will then throw in more problems.
You'll also have to constantly justify to management why you need so many IOPS for the data warehouse. Is there business value/justification in needing all of those IOPS for your constant changing "fact" table?
If you can't find a way around updates on the fact table, at least establish a cut-off point where the data is determined read-only. Otherwise, you'll never be able to scale.
Switching does not help here.
Maybe you can execute updates concurrently using multiple threads on distinct ranges of rows. That might speed it up. Be careful not to trigger lock escalation so you get good concurrency.
Also make sure that you update the rows mostly in ascending sort order of the clustered index. This helps with disk IO (this technique might not work well with multi-threading).
There are as many reasons to update a fact record as there are non-identifying attributes in the fact. Unless you plan on a "first delete" then "insert", you simply cannot avoid updates. You cannot simply say "record the metric deltas as new facts".

SQL Server 2008 indexes - performance gain on queries vs. loss on INSERT/UPDATE

How can you determine if the performance gained on a SELECT by indexing a column will outweigh the performance loss on an INSERT in the same table? Is there a "tipping-point" in the size of the table when the index does more harm than good?
I have table in SQL Server 2008 with 2-3 million rows at any given time. Every time an insert is done on the table, a lookup is also done on the same table using two of its columns. I'm trying to determine if it would be beneficial to add indexes to the two columns used in the lookup.
Like everything else SQL-related, it depends:
What kind of fields are they? Varchar? Int? Datetime?
Are there other indexes on the table?
Will you need to include additional fields?
What's the clustered index?
How many rows are inserted/deleted in a transaction?
The only real way to know is to benchmark it. Put the index(es) in place and do frequent monitoring, or run a trace.
This depends on your workload and your requirements. Sometimes data is loaded once and read millions of times, but sometimes not all loaded data is ever read.
Sometimes reads or writes must complete in certain time.
case 1: If table is static and is queried heavily (eg: item table in Shopping Cart application) then indexes on the appropriate fields is highly beneficial.
case 2: If table is highly dynamic and not a lot of querying is done on a daily basis (eg: log tables used for auditing purposes) then indexes will slow down the writes.
If above two cases are the boundary cases, then to build indexes or not to build indexes on a table depends on which case above does the table in contention comes closest to.
If not leave it to the judgement of Query tuning advisor. Good luck.

Your first gut feeling on this SqlServer design question

We have 2 tables. One holds measurements, the other one holds timestamps (one for every minute)
every measurement holds a FK to a timestamp.
We have 8M (million) measurements, and 2M timestamps.
We are creating a report database via replication, and my first solution was this: when a new measurement was received via the replication process, lookup the right timestamp and add it to the measurement table.
Yes, it's duplication of data, but it is for reporting and since we have measurements every 5 minutes and users can query for yearly data (105.000 measurements) we have to optimize for speed.
But a co-developer said: you don't have to do that, we'll just query with a join (on the two tables), SqlServer is so fast, you don't see the difference.
My first reaction was: a join on two tables with 8M and 2M records can't make 'no difference'.
What is your first feeling on this?
EDIT:
new measurements: 400 records per 5 minutes
EDIT 2:
maybe the question is not so clear:
the first solution is to get the data from the timestamp table and copy it to the measurement table when the measurement record is inserted.
In that case we have an action when the record is inserted AND an extra (duplicated) timestamp value. In this case we lonly query ONE table because it holds all the data.
The second solution is to join the two tables in a query.
With the proper index the join will make no difference*. My initial thought is that if the report is querying over the entire dataset, the joins might actually be faster because there is literally 6 million fewer timestamps that it has to read from the disk.
*This is just a guess based on my experience with tables with millions of records. You results will vary based on your queries.
I'd create an Indexed View (similar to a Materialized view in Oracle) which joins the tables using appropriate indexes.
If the query just retrieves the data for the given date ranges, there will be a merge join - that is, a range scan for each of tow tables. Since the timestamp table presumably contains only timestamp, this shouldn't be expensive.
On the other hand, if you have only one table and index on the date column, the index itself becomes larger and more expensive to scan.
So, with properly constructed indexes and queries I won't expect a significant difference in performance.
I'd suggest you to keep properly normalized design until you start having performance problems that force you to change it. And then you need to carefully analyze query plans and measure performance with different options - there're lots of thing that could matter in your particular case.
Frankly in this case your best bet is try both solutions and see which one is better. Performance tuning is an art when you start talking about large data sets and is highly dependant onthe not only the database design you have but the hardware and the whther you are using partioning, etc. Be sure to test both getting the data out and putting the data in. Since you have so many inserts, insert speed is critical and tthe index you would need on on the datetime field is critical to select performance, so you really need to thouroughly test this. Don't forget about dumping the cache when you test. And test multiple times and if possible test under a typical query load.

Resources