Handling large datasets with SQL Server

Handling large datasets with SQL Server - sql-server

I'm looking to manage a large dataset of log files. There is an average of 1.5 million new events per month that I'm trying to keep. I've used access in the past, though it's clearly not meant for this, and managing the dataset is a nightmare, because I'm having to split the datasets into months.
For the most part, I just need to filter event types and count the number. But before I do a bunch of work on the data import side of things, I wanted to see if anyone can verify that this SQL Server is a good choice for this. Is there an entry limit I should avoid and archive entries? Is there a way of archiving entries?
The other part is that I'm entering logs from multiple sources, with this amount of entries, is it wise to put them all into the same table, or should each source have their own table, to make queries faster?
edit...
There would be no joins, and about 10 columns. Data would be filtered through a view, and I'm interested to see if the results from a select query that filter based on one or more columns would have a reasonable response time? Does creating a set of views speed things up for frequent queries?

In my experience, SQL Server is a fine choice for this, and you can definitely expect better performance from SQL Server than MS-Access, with generally more optimization methods at your disposal.
I would probably go ahead and put this stuff into SQL Server Express as you've said, hopefully installed on the best machine you can use (though you did mention only 2GB of RAM). Use one table so long as it only represents one thing (I would think a pilot's flight log and a software error log wouldn't be in the same "log" table, as an absurdly contrived example). Check your performance. If it's an issue, move forward with any number of optimization techniques available to your edition of SQL Server.
Here's how I would probably do it initially:
Create your table with a non-clustered primary key, if you use a PK on your log table -- I normally use an identity column to give me a guaranteed order of events (unlike duplicate datetimes) and show possible log insert failures (missing identities). Set a clustered index on the main datetime column (you mentioned that your're already splitting into separate tables by month, so I assume you'll query this way, too). If you have a few queries that you run on this table routinely, by all means make views of them but don't expect a speedup by simply doing so. You'll more than likely want to look at indexing your table based upon the where clauses in those queries. This is where you'll be giving SQL server the information it needs to run those queries efficiently.
If you're unable to get your desired performance through optimizing your queries, indexes, using the smallest possible datatypes (especially on your indexed columns) and running on decent hardware, it may be time to try partitioned views (which require some form of ongoing maintenance) or partitioning your table. Unfortunately, SQL Server Express may limit you on what you can do with partitioning, and you'll have to decide if you need to move to a more feature-filled edition of SQL Server. You could always test partitioning with the Enterprise evaluation or Developer editions.
Update:
For the most part, I just need to filter event types and count the number.
Since past logs don't change (sort of like past sales data), storing the past aggregate numbers is an often-used strategy in this scenario. You can create a table which simply stores your counts for each month and insert new counts once a month (or week, day, etc.) with a scheduled job of some sort. Using the clustered index on your datetime column, SQL Server could much more easily aggregate the current month's numbers from the live table and add them to the stored aggregates for displaying the current values of total counts and such.

Sounds like one table to me, that would need indexes on exactly the sets of columns you will filter. Restricting access through views is generally a good idea and ensures your indexes will actually get used.
Putting each source into their own table will require UNION in your queries later, and SQL-Server is not very good optimizing UNION-queries.
"Archiving" entries can of course be done manually, by moving entries in a date-range to another table (that can live on another disk or database), or by using "partitioning", which means you can put parts of a table (e.g. defined by date-ranges) on different disks. You have to plan for the partitions when you plan your SQL-Server installation.
Be aware that Express edition is limited to 4GB, so at 1.5 million rows per month this could be a problem.
I have a table like yours with 20M rows and little problems querying and even joining, if the indexes are used.

Related

SQL Server - Inserting new data worsens query performance

We have a 4-5TB SQL Server database. The largest table is around 800 GB big containing 100 million rows. 4-5 other comparable tables are 1/3-2/3 of this size. We went through a process to create new indexes to optimize performance. While the performance certainly improved we saw that the newly inserted data was slowest to query.
It's a financial reporting application with a BI tool working on top of the database. The data is loaded overnight continuing in the late morning, though the majority of the data is loaded by 7am. Users start to query data around 8am through the BI tool and are most concerned with the latest (daily) data.
I wanted to know if newly inserted data causes indexes to go out of order. Is there anything we can do so that we get better performance on the newly inserted data than the old data. I hope I have explained the issue well here. Let me know in case of any missing information. Thanks
Edit 1
Let me describe the architecture a bit.
I have a base table (let’s call it Base) with Date,id as clustered index.
It has around 50 columns
Then we have 5 derived tables (Derived1, Derived2,...) , according to different metric types, which also have Date,Id as clustered index and foreign key constraint on the Base table.
Tables Derived1 and Derived2 have 350+ columns. Derived3,4,5 have around 100-200 columns. There is one large view created to join all the data tables due limitations of the BI tool. The date,ID are the joining columns for all the tables joining to form the view (Hence I created clustered index on those columns). The main concern is with regard to BI tool performance. The BI tool always uses the view and generally sends similar queries to the server.
There are other indexes as well on other filtering columns.
The main question remains - how to prevent performance from deteriorating.
In addition I would like to know
If NCI on Date,ID on all tables would be better bet in addition to the clustered index on date,ID.
Does it make sense to have 150 columns as included in NCI for the derived tables?

You have about a 100 million rows, increasing every day with new portions and those new portions are usually selected. I should use partitioned indexes with those numbers and not regular indexes.
Your solution within sql server would be partitioning. Take a look at sql partitioning and see if you can adopt it. Partitioning is a form of clustering where groups of data share a physical block. If you use year and month for example, all 2018-09 records will share the same physical space and easy to be found. So if you select records with those filters (and plus more) it is like the table has the size of 2018-09 records. That is not exactly accurate but its is quite like it. Be careful with data values for partitioning - opposite to standard PK clusters where each value is unique, partitioning column(s) should result a nice set of different unique combinations thus partitions.
If you cannot use partitions you have to create 'partitions' yourself using regular indexes. This will require some experiments. The basic idea is data (a number?) indicating e.g. a wave or set of waves of imported data. Like data imported today and the next e.g. 10 days will be wave '1'. Next 10 days will be '2' and so on. Filtering on the latest e.g. 10 waves, you work on the latest 100 days import effectively skip out all the rest data. Roughly, if you divided your existing 100 million rows to 100 waves and start on at wave 101 and search for waves 90 or greater then you have 10 million rows to search if SQL is put correctly to use the new index first (will do eventually)

This is a broad question especially without knowing your system. But one thing that I would try is manually update your stats on the indexes/table once you are done loading data. With tables that big, it is unlikely that you will manipulate enough rows to trigger an auto-update. Without clean stats, SQL Server won't have an accurate histogram of your data.
Next, dive into your execution plans and see what operators are the most expensive.

SQL Server: Best technique to regenerate a computed table

We have a few tables that are periodically recomputed within SQL Server. The computation takes a few seconds to a few minutes and we do the following:
Dump the results in computed_table_tmp
Drop computed_table
Rename computed_table_tmp to computed_table. (and all indexes).
However, we seem to still run into concurrency issues where we have our application requesting a view that utilizes this computed table at the precise moment where it no longer exists.
What would be the best technique to avoid this type of problem while ensuring high availability?

If this table is part of your high-availability requirement, then you can't do this the way you've been doing it. Dropping a table in a production SQL environment breaks the concept of high availability.
You might be able to accomplish what you're trying to achieve by creating one or more partitions on this table. A partitioned table is divided into subgroups of rows that can be spread across more than one filegroup in your database. For querying purposes, however, the table is still a single logical entity. The advantage of using a table partition is that you can move around subsets of your data without breaking the integrity of the database, i.e., high-availability is still in place.
In your scenario, you'd have to modify your process such that all activity takes place in the production version of the table. The new rows are dumped in to a separate partition, based on the value of your partition function. Then you'll need to switch the partitions.
One of the things you'll need to do is identify a column in your table that may be used as the partition column, which determines which partition a row will be allocated to. This might be, for example, a datetime column indicating when the row was generated. You can even use a computed column for this purpose, provided it is a PERSISTED column.
One caveat: Table partitioning is not available in all editions of SQL Server... I don't believe Standard has it.

How to deal with billions of records in an sql server?

I have an sql server 2008 database along with 30000000000 records in one of its major tables. Now we are looking for the performance for our queries. We have done with all indexes. I found that we can split our database tables into multiple partitions, so that the data will be spread over multiple files, and it will increase the performance of the queries.
But unfortunatly this functionality is only available in the sql server enterprise edition, which is unaffordable for us.
Is there any way to opimize for the query performance? For example, the query
select * from mymajortable where date between '2000/10/10' and '2010/10/10'
takes around 15 min to retrieve around 10000 records.

A SELECT * will obviously be less efficiently served than a query that uses a covering index.
First step: examine the query plan and look for and table scans and the steps taking the most effort(%)
If you don’t already have an index on your ‘date’ column, you certainly need one (assuming sufficient selectivity). Try to reduce the columns in the select list, and if ‘sufficiently’ few, add these to the index as included columns (this can eliminate bookmark lookups into the clustered index and boost performance).
You could break your data up into separate tables (say by a date range) and combine via a view.
It is also very dependent on your hardware (# cores, RAM, I/O subsystem speed, network bandwidth)
Suggest you post your table and index definitions.

First always avoid Select * as that will cause the select to fetch all columns and if there is an index with just the columns you need you are fetching a lot of unnecessary data. Using only the exact columns you need to retrieve lets the server make better use of indexes.
Secondly, have a look on included columns for your indexes, that way often requested data can be included in the index to avoid having to fetch rows.
Third, you might try to use an int column for the date and convert the date into an int. Ints are usually more effective in range searches than dates, especially if you have time information to and if you can skip the time information the index will be smaller.
One more thing to check for is the Execution plan the server uses, you can see this in management studio if you enable show execution plan in the menu. It can indicate where the problem lies, you can see which indexes it tries to use and sometimes it will suggest new indexes to add.
It can also indicate other problems, Table Scan or Index Scan is bad as it indicates that it has to scan through the whole table or index while index seek is good.
It is a good source to understand how the server works.

If you add an index on date, you will probably speed up your query due to an index seek + key lookup instead of a clustered index scan, but if your filter on date will return too many records the index will not help you at all because the key lookup is executed for each result of the index seek. SQL server will then switch to a clustered index scan.
To get the best performance you need to create a covering index, that is, include all you columns you need in the "included columns" part of your index, but that will not help you if you use the select *
another issue with the select * approach is that you can't use the cache or the execution plans in an efficient way. If you really need all columns, make sure you specify all the columns instead of the *.
You should also fully quallify the object name to make sure your plan is reusable

you might consider creating an archive database, and move anything after, say, 10-20 years into the archive database. this should drastically speed up your primary production database but retains all of your historical data for reporting needs.

What type of queries are we talking about?
Is this a production table? If yes, look into normalizing a bit more and see if you cannot go a bit further as far as normalizing the DB.
If this is for reports, including a lot of Ad Hoc report queries, this screams data warehouse.
I would create a DW with seperate pre-processed reports which include all the calculation and aggregation you could expect.
I am a bit worried about a business model which involves dealing with BIG data but does not generate enough revenue or even attract enough venture investment to upgrade to enterprise.

MS-SQL Server 2000 slow full text indexing

We have a full text index on a fairly large table of 633,569 records. The index is rebuilt from scratch as part of a maintenance plan every evening, after a bunch of DTS packages run that delete / insert records. Large chunks of data are deleted, then inserted (to take care of updates and inserts), so incremental indexing is not a possibility. Changing the packages to only delete when necessary is not a possibility either as it is a legacy application that will eventually be replaced.
The FTI includes two columns - one a varchar(50) not null and a varchar(255) null.
There is a clustered index on the primary key column, which is just an identity column. There is also an combined index on an integer column and the varchar(50) column mentioned above. This latter index was added for performance reasons.
The problem is that the re-indexing is painfully slow - about 8 hours.
The server is fairly robust (dual processor, 4gb of ram), and everything runs quickly beyond this re-indexing.
Any tips on how to speed this up?
UPDATE
Our client has access to the sql box. Turns out they turned on change tracking on the table that is part of the full text index. We turned this off, and the full population took less than 3 hours. Still not great, but better than 8.
UPDATE 2
The FTI is again taking ~8 hours to populate.

SQL Server's indexing is slow primarily because of its asynchronous data extraction scheme.
Use change tracking with the "update
index in background" option.
The easiest way to improve the performance of full-text indexing is to use change tracking with the "update index in background" option.When you index a table (FTI, like "standard" SQL indexes, works on a per-table basis), you specify full population, incremental population, or change tracking. When you opt for full population, every row in the table you're full-text indexing is extracted and indexed. This is a two-step process.
First, you (or Enterprise Manager) run this system stored procedure:
sp_fulltext_getdata CatalogID, object_id
After all the results sets of all of the timestamps and PK values are returned to MSSearch, MSSearch will issue another sp_fulltext_getdata, but this time, once for every row in your table.So if you have 50 million rows in your database, this procedure will be issued 50 million times.
On the other hand, if you use an incremental population, MSSearch will issue an initial:
sp_fulltext_getdata CatalogID, object_id
for each row in the table that you're full-text indexing. So if you have 50 million rows in your database, this statement will also be issued 50 million times. Why? Because even with an incremental population, MSSearch must figure out exactly which rows have been changed, updated, and deleted. Another problem with incremental populations is that they'll index or re-index a row even if the change was made to a column that you aren't full-text indexing.
Although an incremental population is generally faster than a full population, you can see that for large tables, either will be time-consuming.
I recommend you enable change tracking with background or scheduled updating. If you do, you'll see that MSSearch will first issue another:
sp_fulltext_getdata CatalogID, object_id
for every row in the table with change tracking enabled.Then, for every row that has a column that you're full-text indexing and that's modified after your initial full population, the row information will be written (in the database you're indexing) to the sysfulltextnotify table. MSSearch will then issue the following only for the rows that apear in this table and will then remove them from the sysfulltextnotify table.
Consider using a separate build
server
Tables that are heavily updated while you're indexing can create locking problems, so if you can live with a catalog that's periodically out of date and an MSSearch engine that's sometimes unavailable consider using a separate build server. You do this by making sure the indexing server has a copy of the table to be full-text indexed and exporting the catalog .Clearly, if you need real-time or near real-time updates to your catalog, this is not a good solution
Limit activity when population is
running
When population is running, don't run Profiler, and limit other database activity as much as possible. Profiler consumes significant resources.
Increase the number of threads for
the indexing process
Increase the number of threads you're running for the indexing process. The default is only five, and on quads or 8-ways, you can bump this up to much higher values. MSSearch will, however, throttle itself if it's slurping too much data from SQL Server, so avoid doing this on single- or dual-processor systems.
Stop any anti-virus or open
file-agent backup software.
If this is not possible, try to prevent them from scanning the temporary directories being used by SQL FTI and the catalog directories
Place the catalog,temp directory and
pagefiles on their own controllers
If you can make that investment.Place the catalog on its own controller, preferably on a RAID-1 array.Place the temp directory on a RAID-1 array. Similarly, consider putting pagefile on its own RAID-1 array with its own controller.
Consider creating secondary data
files for the Temp DB - 1 per CPU /
Core.

Do you have enough RAM?
What are your file drive placements in terms of RAID configuration?
Are you seeing high tempDB activity?
(BTW, half a million records is not large; it's not even medium... ;) )

Is the system offline whilst you are doing the reindex or live ?
Are these the only items in your full text catalog; if not you might want to consider separating them out from the remainder of your FTS data. (Might help with monitoring too) In the index is the identity column configured as the unique key ?
Can you quantify the large amounts of changes? There are 3 basic options for repopulation; You might want to try switching to full or incremental as one may suit you better than the one you are using now. In my experience incremental works well if changes to the total DB are less than 40% (had a similar issue during large data take ons into the database.) If >40% change then full is likely better (from my experience - i index documents so it might work differently for you) The third option you might want to consider try the Change Tracking with scheduled update reindex option.
If you can take the server off-line to users then what performance settings do you have FTS running under whilst reindexing? You can check this Full-Text Search Service Properties / Performance tab - System Resource Usage as a slider (think there are 4 or 5 positions). There is probably a system proc to change this dont know it and dont have a 2000 machine to check anymore.
FTS / Reindexing loves ram and lots of it; the general rule of thumb is have virtual memory 3x the physical memory; if you have several physical disks then create several Pagefile.sys files, so that each Pagefile.sys file will be placed on its own physical disk. Are you on NT or Windows 2000 ? check that extended memory over 2gb is actually configured properly.

Try putting the index on a separate physical disk than the database.
EDIT: Scott reports this is already the case.

Disallowing nulls in the column that currently does might not speed up the index, but in my experience is a better practice, especially for indexing purposes. The only columns I can justify allowing nulls in are date columns.

Here is a checklist of parameters for FT-indexing performance on SQL Server. Most of them are already quoted and checked here. I don't find one of them on your comments though:
The SQL Server MAX SERVER MEMORY setting should be set manually (dynamic memory allocation is turned off) so that enough virtual memory is left for the Full-Text Search service to run. To achieve this, select a MAX SERVER MEMORY setting that once set, leaves enough virtual memory so that the Full-Text Search service is able to access an amount of virtual memory equal to 1.5 times the amount of physical RAM in the server. This will take some trial and error to achieve this setting.

Improve the Performance of Full-Text Indexes: http://msdn.microsoft.com/en-us/library/ms142560.aspx

Limits on number of Rows in a SQL Server Table

Are there any hard limits on the number of rows in a table in a sql server table? I am under the impression the only limit is based around physical storage.
At what point does performance significantly degrade, if at all, on tables with and without an index. Are there any common practicies for very large tables?
To give a little domain knowledge, we are considering usage of an Audit table which will log changes to fields for all tables in a database and are wondering what types of walls we might run up against.

You are correct that the number of rows is limited by your available storage.
It is hard to give any numbers as it very much depends on your server hardware, configuration, and how efficient your queries are.
For example, a simple select statement will run faster and show less degradation than a Full Text or Proximity search as the number of rows grows.

BrianV is correct. It's hard to give a rule because it varies drastically based on how you will use the table, how it's indexed, the actual columns in the table, etc.
As to common practices... for very large tables you might consider partitioning. This could be especially useful if you find that for your log you typically only care about changes in the last 1 month (or 1 day, 1 week, 1 year, whatever). You could then archive off older portions of the data so that it's available if absolutely needed, but won't be in the way since you will almost never actually need it.
Another thing to consider is to have a separate change log table for each of your actual tables if you aren't already planning to do that. Using a single log table makes it VERY difficult to work with. You usually have to log the information in a free-form text field which is difficult to query and process against. Also, it's difficult to look at data if you have a row for each column that has been changed because you have to do a lot of joins to look at changes that occur at the same time side by side.

In addition to all the above, which are great reccomendations I thought I would give a bit more context on the index/performance point.
As mentioned above, it is not possible to give a performance number as depending on the quality and number of your indexes the performance will differ. It is also dependent on what operations you want to optimize. Do you need to optimize inserts? or are you more concerned about query response?
If you are truly concerned about insert speed, partitioning, as well a VERY careful index consideration is going to be key.
The separate table reccomendation of Tom H is also a good idea.

With audit tables another approach is to archive the data once a month (or week depending on how much data you put in it) or so. That way if you need to recreate some recent changes with the fresh data, it can be done against smaller tables and thus more quickly (recovering from audit tables is almost always an urgent task I've found!). But you still have the data avialable in case you ever need to go back farther in time.