Large scale ETL string lookups performance issues

Large scale ETL string lookups performance issues - sql-server

I have an ETL process performance problem. I have a table with 4+ billion rows in it. Structure is:
id bigint identity(1,1)
raw_url varchar(2000) not null
md5hash char(32) not null
job_control_number int not null
Clustered unique index on the id and non clustered unique index on md5hash
SQL Server 2008 Enterprise
Page level compression is turned on
We have to store the raw urls from our web-server logs as a dimension. Since the raw string > 900 characters we cannot put a unique index on that column. We use an md5 hash function to create the unique 32 character string for indexing purposes. We cannot allow duplicate raw_url strings in the table.
The problem is poor performance. The md5hash is of course random by nature so the index fragmentation drives to 50% which leads to inefficient IO.
Looking for advice on how to structure this to allow better insertion and lookup performance as well as less index fragmentation.

I would break up the table into physical files, with the older non-changing data in a read-only file group. Make sure the non-clustered index is also in the filegroup.
Edit (from comment): And while I'm thinking about it, if you turn off page level compression, that'll improve I/O as well.

I would argue that it should be a degenerate dimension in the fact table.
And figure some way to do partitioning on the data. Maybe take the first xxx characters and store them as a separate field, and partition by that.
Then when you're doing lookups, you're passing the short and long columns, so it's looking in a partition first.

Related

SQL Server: ~2000 Heap Tables all using GUID Uniqueidentifier - Possible Clustered Indexing?

I have just taken over a database which has around 2200 tables. Over 2000 of these have no clustered index (some have no indexes at all).
All of the tables have been configured to use a GUID as the uniqueidentifier.
Just looking at the query plans, I can see that there are many table scans occurring. Most searches use the uniqueidentifier to search on.
I am wondering if it is better to have a clustered index on the GUID than not to have a clustered index at all. I imagine that a clustered index on a 16-byte column will inevitably lead to fragmentation.
I could arguably cluster on other columns but the majority of searches tend to search by or join via the GUIDS.
Any advice would be very much welcomed. I've never seen so many GUID's!!

In generally, I would recommend having an identity column as the primary key and use that for clustering. This is also a better choice for joins.
Why? First, identity keys are generally shorter that unique ids. So, foreign key references and indexes are smaller.
More importantly, inserts would always go at the "end" of the table. When using GUIDs, inserts are often going to cause fragmentation. If you are inserting rows, I would say that a secondary index on the GUID might be better than a clustered index (the fragmentation is only in the index).
With 2000 tables, I doubt you will change the structure. You can ameliorate the fragmentation using newsequentialid().

GUID column with random values usually is not the best choice for a clustered index because it could be the root cause of an index fragmentation:
Read ahead opportunity of the database won't be effective;
The cost of insert operations will be too expensive, because in this case you'll got lots of page split overhead;
There are 3 ways how you can live with that:
Schedule planning index reorganizing and rebuilding which will reduce index fragmentation and improve your statistics automatically;
Use
newsequantialid for generating values of this column;
Generate
GUID value sequantialy outside of the database (Guid.Comb
Identifier is a great example of solving this issue in
NHibernate).

This is a really a comment on your question to Gordon's good answer:
Firstly, don't forget to check the index DMVs to see which ones are being used (or not used) and have a look at the expensive query plans in the cache to focus on the tables and queries that will be causing most pain. I would expect that many of those 2200 tables are relatively small & the queries are able to look up pretty quickly even from the guid clustered index.
For those tables that aren't clustered, clustering on the guid would reduce fragmentation, since it forces all the data for the table to be colocated rather than allowing pages to be put in the next free extent & spreading tables all over the disk. This should make some of the I/O more efficient.
Check you have a low enough fillfactor so that your regular index rebuilds avoid page splitting in advance, although it will also be workload dependent (OLTP vs DW and read/write ratio of table)
If you have applications that are doing explicit column selects/inserts then you may be able to add an identity column without breaking anything. That allows you cluster around the identity & add an index to the guid. Whether this really helps depends on the relative (in)efficency of the new plans.
You could consider clustering around a non-guid field where queries will lookup against it fairly regularly (eg, a date range) and index the guid separately.
You'd have to look at the queries & relative performance for that more closely.

Reducing disk space of sql database

I got a database that have 2TB of data, and i wanna reduce it to 500Go by dropping some rows and removing some useless columns, but i have other ideas of optimizations, and i need an answer of some questions before.
My database got one .mdf file, and 9 other .ndf file and each file has an initiale size of 100Go.
Should I reduce the initiale size of each .ndf file to 50Go? can this operation affect my data?
Dropping an index help to reduce space?
PS : My Database contains only one single table, that has one clustered index and two other non clustered indexes,
I want to remove the two non clustered indexes
Remove the insertdate column
If you have any other ideas of optimizations, it would be very helpful

Before droping any indexes run these two views.
sys.dm_db_index_usage_stats
sys.dm_db_index_operational_stats
They will let you know if any of them are being used to support queries. The last thing you want is to remove an index and start seeing full table scans on a 2TB table.
If you can't split up the table into a relational model then try these for starters.
Check your data types.
-Can you replace NVARCHAR with VARCHAR or NCHAR with CHAR? (they take up half the space)
-Does your table experience a lot of Updates or a lot of Inserts (above view will tell you this)? If there are very few updates then consider changing CHAR fields to VARCHAR fields. Heavy updates can cause page splits and result in poor Page fullness.
-Check that columns only storing a Date with no time are not declared as Datetime
-Check value ranges in numeric fields i.e. try and use Smallint instead of Int.
Look at the activity on the table, update & insert behaviour. If the activity means very few Pages are re-arranged then consider increasing your Fill Factor.
Look at the Plan Cache, get an idea of how the table is being queried, if the bulk of queries focus on a specific portion of the table then implement a Filtered Index.
Is your Clustered Index Unique? If not then SQL creates a "hidden extra Integer column" that creates uniqueness under the bonnet.

Clustered index consideration in regards to distinct valus and large result sets and a single vertical table for auditing

I've been researching best practices for creating clustered indexes and I'm just trying to totally understand these two suggestions that's listed with pretty much every BLOG or article on the matter
Columns that contain a large number of distinct values.
Queries that return large result sets.
These seem to be slightly contrary or I'm guessing maybe it just depends on how you're accessing the table.. Or my interpretation of what "large result sets" mean is wrong....
Unless you're doing range queries over the clustered column it seems like you typically won't be getting large result sets that matter. So in cases where SQL Server defaults the clustered indexes on the PK you're rarely going to fulfill the large result set suggestion but of course it does the large number of distinct values..
To give the question a little more context. This quetion stems from a vertical auditing table we have that has a column for TABLE.... Every single query that's written against this table has a
WHERE TABLE = 'TABLENAME'
But the TableName is highly non distinct... Each result set of tablenames is rather large which seems to fulfill that second conditon but it's definitely not largerly unique.... Which means all that other stuff happens with having to add the 4 byte Uniquifer (sp?) which makes the table a lot larger etc...
This situation has come up a few times for me when I've come upon DBs that have say all the contact or some accounts normalized into a single table and they are only separated by a TYPE parameter. Which is on every query....
In the case of the audit table the queries are typically not that exciting either they are just sorted by date modified, sometimes filtered by column, user that made the change etc...
My other thought with this auditing scenario was to just make the auditing table a HEAP so that inserting is fast so there's not contention between tables being audited and then to generate indexed views over the data ...

Index design is just as much art as it is science.
There are many things to consider, including:
How the table will be accessed most often: mostly inserts? any updates? more SELECTs than DML statements? Any audit table will likely have mostly inserts, no updates, rarely deletes unless there is a time-limit on the data, and some SELECTs.
For Clustered indexes, keep in mind that the data in each column of the clustered index will be copied into each non-clustered index (though not for UNIQUE indexes, I believe). This is helpful as those values are available to queries using the non-clustered index for covering, etc. But it also means that the physical space taken up by the non-clustered indexes will be that much larger.
Clustered indexes generally should either be declared with the UNIQUE keyword or be the Primary Key (though there are exceptions, of course). A non-unique clustered index will have a hidden 4-byte field called a uniqueifier that is required to make each row with a non-unique key value addressable, and is just wasted space given that the order of your rows within the non-unique groupings is not apparently obvious so trying to narrow down to a single row is still a range.
As is mentioned everywhere, the clustered index is the physical ordering of the data so you want to cater to what needs the best I/O. This relates also to the point directly above where non-unique clustered indexes have an order but if the data is truly non-unique (as opposed to unique data but missing the UNIQUE keyword when the index was created) then you miss out on a lot of the benefit of having the data physically ordered.
Regardless of any information or theory, TEST TEST TEST. There are many more factors involved that pertain to your specific situation.
So, you mentioned having a Date field as well as the TableName. If the combination of the Date and TableName is unique then those should be used as a composite key on a PK or UNIQUE CLUSTERED index. If they are not then find another field that creates the uniqueness, such as UserIDModified.
While most recommendations are to have the most unique field as the first one (due to statistics being only on the first field), this doesn't hold true for all situations. Given that all of your queries are by TableName, I would opt for putting that field first to make use of the physical ordering of the data. This way SQL Server can read more relevant data per read without having to seek to other locations on disk. You would likely also being ordering on the Date so I would put that field second. Putting TableName first will cause higher fragmentation across INSERTs than putting the Date first, but upon an index rebuild the data access will be faster as the data is already both grouped ( TableName ) and ordered ( Date ) as the queries expect. If you put Date first then the data is still ordered properly but the rows needed to satisfy the query are likely spread out across the datafile(s) which would require more I/O to get. AND, more data pages to satisfy the same query means more pages in the Buffer Pool, potentially pushing out other pages and reducing Page Life Expectancy (PLE). Also, you would then really need to inculde the Date field in all queries as any queries using only TableName (and possibly other filters but NOT using the Date field) will have to scan the clustered index or force you to create a nonclustered index with TableName being first.
I would be weary of the Heap plus Indexed View model. Yes, it might be optimized for the inserts but the system still needs to maintain the data in the indexed view across all DML statements against the heap. Again you would need to test, but I don't see that being materially better than a good choice of fields for a clustered index on the audit table.

optimizing sql server database

My database has one very large table with over 2 billion rows with 3 columns.
Id(uniqueidentity), Type(int, between 0-10. 0 = most used. 10 = least used), Data(Binary data between 1-10MB)
What are some ways I can optimize this database? (primarily select queries)
*Note: I might add a few more columns to this table later (eg: location, date...)

Assuming that the id column is the clustered index key, and assuming that by uniqueidentity you mean uniqueidentifier:
do you need the uniqueidentifier type? Why?
What other alternatives have you considered?
Do you populate the data using sequential GUIDs or not?
GUIDs are a notoriously poor choise for clustered keys. See GUIDs as PRIMARY KEYs and/or the clustering key for a more detailed discussion:
But, a GUID that is not sequential -
like one that has it's values
generated in the client (using .NET)
OR generated by the newid() function
(in SQL Server) can be a horribly bad
choice - primarily because of the
fragmentation that it creates in the
base table but also because of its
size. It's unnecessarily wide (it's 4
times wider than an int-based identity
- which can give you 2 billion (really, 4 billion) unique rows). And,
if you need more than 2 billion you
can always go with a bigint (8-byte
int) and get 2^63-1 rows
Also read Disk space is cheap...That's not the point! as a follow up.
Other than this, you need to do your homework and post the required details for such a question: exact table and index definition, prevalent data access pattern (by key, by range, filters sort order, joins etc etc).
Have you done any work to identify problems so far? If not, start with Waits and Queues, a proven methodology to identify performance bottlenecks. Once you measure and find places that need improvement, we can advise how to improve.

Add an Index(es). Decide which column(s) are the most appropriate clustered index.
Decide if storing 10MB of binary data in each (otherwise small) row is a good use of a database
[Updated in response to Remus's comment]

Do DB indexes take same amount of disc space as column data?

If I have a table column with data and create an index on this column, will the index take same amount of disc space as the column itself?
I'm interested because I'm trying to understand if b-trees actually keep copies of column data in leaf nodes or they somehow point to it?
Sorry if this a "Will Java replace XML?" kind question.
UPDATE:
created a table without index with a single GUID column, added 1M rows - 26MB
same table with a primary key (clustered index) - 25MB (even less!), index size - 176KB
same table with a unique key (nonclustered index) - 26MB, index size - 27MB
So only nonclustered indexes take as much space as the data itself.
All measurements were done in SQL Server 2005

The B-Tree points to the row in the table, but the B-Tree itself still takes some space on disk.
Some database, have special table which embed the main index and the data. In Oracle, it's called IOT -- index-organized table.
Each row in a regular table can be identified by an internal ID (but it's database specific) which is used by the B-Tree to identify the row. In Oracle, it's called rowid and looks like AAAAECAABAAAAgiAAA :)
If I have a table column with data and
create an index on this column, will
the index take same amount of disc
space as the column itself?
In a basic B-Tree, you have the same number of node as the number of item in the column.
Consider 1,2,3,4:
1
/
2
\ 3
\ 4
The exact space can still be a bit different (the index is probably a bit bigger as it need to store links between nodes, it may not be balanced perfectly, etc.), and I guess database can use optimization to compress part of the index. But the order of magnitude between the index and the column data should be the same.

I'm almost sure it's quite a DB dependent, but generally – yeah, they take additional space. This happens because of two reasons:
This way you can utilize the fact
the data in BTREE leafs is sorted;
You gain lookup speed advantage as
you don't have to seek back and
forth to fetch neccessary stuff.
PS just checked our mysql server: for a 20GB table indexes take 10GB of space :)

Judging by this article, it will, in fact, take at least the same amount of space as the data in the column (in PostgreSQL, anyway).
The article also goes to suggest a strategy to reduce disk and memory usage.
A way to check for yourself would be to use e.g. the derby DB, create a table with a million rows and a single column, check it's size, create an index on the column and check it's size again. If you take the 10-15 minutes to do so, let us know the results. :)

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight