I am currently performance tuning a Microsoft Dynamics AX 2009 fully upgraded to the latest kernel / hotfix.
During this process I have come across database fill factor set to 80. Not sure why?
I have altered it to 95 now - not daring the to make the final move to 100 just yet.
Any thought on this?
Now the question I came here for:
What flags would be recommended on the SQL Server to support the Dynamics AX 2009?
As mentioned it is fully upgraded and its setup to call parameterized but with the DataAreaId as a literal in order to make a dedicated plan for each company.
During the past 10 years it has been performance tuned a few times.
These flags are currently set: 1117, 1118, 1224, 2371, 4136, 4199, 7646
I would like to remove the 4136
A bit about myself
Database performance tuning for AX in my experience requires the intersection of two expertises (AX and SQL Server) that rarely exist in one person. I would consider myself on the AX expert side of things and have enough SQL Server knowledge to be dangerous (or to get by). So be aware of this when reading the rest of this answer.
General observations
First, two general observations:
AX performance issues rarley come down to database performance issues. First check should always be in the application where the bottleneck is. 9 times out of 10, it is an issue with application, data or layer 8, but not the database.
If you do SQL Server performance optimization, you should know what you are doing. It is far easier to make things worse than it is to make them better. You should have a good performance monitoring in place that tells you how things have changed after you changed a setting. Changing a setting just because it seems a good idea is not a good idea.
Your questions
Now, in your question you mention two separate points:
index fill factor: I wouldn't change this setting without having a good reason to do so. Many tables in AX have a large number of rows and experience frequent updates. Increasing the fill factor would degrade performance for these tables. To quote from SQL Server Index Fill factor with a Performance Benchmark:
In this example we have used 80% Fill Factor, however it doesn’t make sense to push without any benchmarking for the table. In most cases SQL Server index Fill factor will help to get well performed when Table having large number of rows and frequent update over the rows. Before setting the Fill Factor we need to analyse the Datatype of columns, actual cell size of the rows, Average number of rows in the pages and estimated updated size of row cell. This proper calculation derives an actual Fill factor value which need to be applied on the table.
removing flag 4136 (which disables parameter sniffing): I would remove this flag. As far as I know, this is not set in a standard installation of AX 2009. It was probably set with good intentions and because some blog articles on the internet mention it to improve performance. I suggest to read the following two links with some information on this flag and what should be used instead:
SQL Server Parameter Sniffing with Dynamics AX, just plain evil
SQL Trace Flag for Dynamics Ax: Do we need it?
More information
If you want to go further down the rabbit hole of performance optimization for AX, I suggest the following articles. They are mostly for AX 2012, but parts should apply for AX 2009, too and provide a starting point:
AX Performance Troubleshooting Checklist Part 1A [Introduction and SQL Configuration]
AX Performance Troubleshooting Checklist Part 1B [Application and AOS Configuration]
AX Performance Troubleshooting Checklist Part 2
AX Performance – Checking key SQL Server configuration and database settings
A bit of meta
Your question is in a very gray area of being on topic for Stack Overflow. You may have better luck on serverfault or Database Administrators.
Related
I'm designing a database that will need to be optimized for maximum speed.
All the database data is generated once from something I call an input database (which holds the data I'm editing, mainly some polylines, markers, etc for google maps).
So the database is not subject to editing, but it needs to hold as many data as it can for quickly displaying results to the user (routes across town, custom polylines, etc).
The question is: choosing smaller data types for example like smallint over int will improve performance or it will affect it? Space is not quite a problem, after some quick calculations, the database will not exceed 200mb, and there will not be tables with more than 100.000 rows (average will be around 5.000).
I'm asking this because I read some articles around the internet and some say that smaller data types improve performance others say that it affects it because additional processing must be done. I'm aware that for smaller databases probably results are not noticeable, but I'm interested in every bit because I'm expecting many requests which will trigger a lot more queries.
The hosting environment is gonna be Windows Server 2008 R2 with SQL Server 2008 R2.
EDIT 1: Just to give you an example because I don't have a proper table structure yet:
I'm going to have a table which will hold public transportation lines (somewhere around 200), identified by a unique number in real life, and which is going to be referenced in all sorts of tables and on which all sorts of operations are going to be made. These referencing tables will hold the largest amount of data.
Because lines have unique numbers, I have thought of 3 examples of designs:
The PK is the line number of datatype: smallint
The PK is the line number of datatype: int
The PK is something different (identity for example) and the line number is stored in a different field.
Just for the sake of argument, because I used this on the 'input database' which is not subject to optimization, the PK is a GUID (16 bytes); if you like, you can make a comparison of how bad is this compared to others, if it really is
So keep in mind that the PK is going to be referenced in at least 15 tables, some of which will have over 50.000 rows (the rest averaging 5.000 as I said above) which are going to be subject to constant querying and manipulation, and I'm interested in every bit of speed that I can get.
I can detail this even more if you need. Thanks
EDIT 2: And another question related to this came to my mind, think it fits into this discussion:
Will I see any performance improvements in this specific scenario if I use native SQL queries from inside my .NET application rather than using LINQ to SQL? I know LINQ is strongly optimized and generates very good queries performance-wise, but still, sure worth asking. Thanks again.
Can you point to some articles that say that smaller data types = more processing? Keeping in mind that even with SSDs most workloads today are I/O-bound (or memory-bound) and not CPU-bound.
Particularly in cases where the PK is going to be referenced in many tables, it will be beneficial to use the smallest data type possible. In this case if that's a SMALLINT then that's what I would use (though you say there are about 200 values, so theoretically you could use TINYINT which is half the size and supports 0-255). Where you need to exercise caution is if you aren't 100% sure that there will always be ~200 values. Once you need 256 you're going to have to change the data type in all of the affected tables, and this is going to be a pain. So sometimes a trade-off is made between accommodating future growth and squeezing the absolute most performance today. If you don't know for certain that you will never exceed 255 or 32,000 values then I would probably just an INT. Unless you also don't know that you won't ever exceed 2 billion values, in which case you would use BIGINT.
The difference between INT/SMALLINT/TINYINT is going to be more noticeable in disk space than in performance. (And if you're on Enterprise, the differences in both disk space and performance can be offset quite a bit using data compression - particularly while your INT values all fit within SMALLINT/TINYINT, though in the latter case it really will be negligible because the values are unique.) On the other hand, the difference between any of these and GUID is going to be much more noticeable in both performance and disk space. Marc gave some great links from Kimberly; I wrote this article in 2003 and while it's a little dated it does contain most of the salient points that are still relevant today.
Another trade-off that sometimes needs to be considered (though not in your specific case, it seems) is whether values need to be unique across multiple systems. This is where you might need to sacrifice some performance in order to meet business requirements. In a lot of cases folks take the easy way and resign themselves to GUID. But there are other solutions too, such as identity ranges, a central custom sequence generator, and the new SEQUENCE object in SQL Server 2012. I wrote about SEQUENCE back in 2010 when the first public beta of SQL Server 2012 was released.
I think you will need to provide some more details about the tables structure and sample queries that will be running against them. Based on the information that you have provided I believe that impact of choosing smaller data types will be just a couple of percents and I would suggest to give higher attention to indexes that you will have. SQL Server does a good job on suggesting what indexes to create by providing you with execution plans for your queries and tuning advisor tool
One suggestion that I have is to incorporate a decimal datatype instead of using a combination of fields. For example, instead of having a table with Date (YYYYMMDD), Store (SSSS), and Item (IIII), I would recommend...YYYYMMDD.SSSSIIII. Especially when querying multiple tables with this same key combination, it dramatically improves processing time.
I'm still learning the ropes of OLAP, cubes, and SSAS, but I'm hitting a performance barrier and I'm not sure I understand what is happening.
So I have a simple cube, which defines two simple dimensions (type and area), a third Time dimension hierarchy (goes Year->Quarter->Month->Day->Hour->10-Minute), and one measure (sum on a field called Count). The database tracks events: when they occur, what type are, where they occurred. The fact table is a precalculated summary of events for each 10 minute interval.
So I set up my cube and I use the browser to view all my attributes at once: total counts per area per type over time, with drill down from Year down to the 10 Minute Interval. Reports are similar in performance to the browse.
For the most part, it's snappy enough. But as I get deeper into the drill-tree, it takes longer to view each level. Finally at the minute level it seems to take 20 minutes or so before it displays the mere 6 records. But then I realized that I could view the other minute-level drilldowns with no waiting, so it seems like the cube is calculating the entire table at that point, which is why it takes so long.
I don't understand. I would expect that going to Quarters or Years would take longest, since it has to aggregate all the data up. Going to the lowest metric, filtered down heavily to around 180 cells (6 intervals, 10 types, 3 areas), seems like it should be fastest. Why is the cube processing the entire dataset instead of just the visible sub-set? Why is the highest level of aggregation so fast and the lowest level so slow?
Most importantly, is there anything I can do by configuration or design to improve it?
Some additional details that I just thought of which may matter: This is SSAS 2005, running on SQL Server 2005, using Visual Studio 2005 for BI design. The Cube is set (as by default) to full MOLAP, but is not partitioned. The fact table has 1,838,304 rows, so this isn't a crazy enterprise database, but it's no simple test db either. There's no partitioning and all the SQL stuff runs on one server, which I access remotely from my work station.
When you are looking at the minute level - are you talking about all events from 12:00 to 12:10 regardless of day?
I would think if you need that to go faster (because obviously it would be scanning everything), you will need to make the two parts of your "time" dimension orthogonal - make a date dimension and a time dimension.
If you are getting 1/1/1900 12:00 to 1/1/1900 12:10, I'm not sure what it could be then...
Did you verify the aggregations of your cube to ensure they were correct? Any easy way to tell is that if you get the same amount of records no matter what drill-tree you go down.
Assuming this is not the case, what Cade suggests about making a Date dimension AND a Time dimension would be the most obvious approach but it is one bigger no-no's in SSAS. See this article for more information: http://www.sqlservercentral.com/articles/T-SQL/70167/
Hope this helps.
I would also check to ensure that you are running the latest sp for sql server 2005
The RTM version had some SSAS perf issues.
also check to ensure that you have correctly define attribute relationships on you time dimension and other dims as well.
Not having these relationships defined will the SSAS storage engine to scan more data then necessary
more info: http://ms-olap.blogspot.com/2008/10/attribute-relationship-example.html
as stated above, splitting out the date and time will significantly decrease the cardinality of your date dimension which should increase performance and allow a better analytic experience.
As I understand it, most query optimizers are "cost-based". Others are "rule-based", or I believe they call it "Syntax Based". So, what's the best way to optimize the syntax of SQL statements to help an optimizer produce better results?
Some cost-based optimizers can be influenced by "hints" like FIRST_ROWS(). Others are tailored for OLAP. Is it possible to know more detailed logic about how Informix IDS and SE's optimizers decide what's the best route for processing a query, other than SET EXPLAIN? Is there any documentation which illustrates the ranking of SELECT statements as to what's the fastest way to access rows, assuming it's indexed?
I would imagine that "SELECT col FROM table WHERE ROWID = n" is the fastest (rank 1).
If I'm not mistaking, Informix SE's ROWID is a SERIAL(INT) which allows for a max. of 2GB nrows, or maybe it uses INT9 for TB's nrows? SE's optimizer is cost based when it has enough data but it does not use distributions like the IDS optimizer.
IDS'ROWID isn't an INT, it is the logical address of the row's page left
shifted 8 bits plus the slot number on the page that contains the row's data.
IDS' optimizer is a cost based optimizer that uses data
about the index depth and width, number of rows, number of pages, and the
data distributions created by update statistics MEDIUM and HIGH to decide
which query path is the least expensive, but there's no ranking of statements?
I think Oracle uses HEX values for ROWID. Too bad ROWID can't be oftenly used, since a rows ROWID can change. So maybe ROWID can be used by the optimizer as a counter to report a query progress?, an idea I mentioned in my "Begin viewing query results before query completes" question? I feel it wouldn't be that difficult to report a query's progress while being processed, perhaps at the expense of some slight overhead, but it would be nice to know ahead of time: A "Google-like" estimate of how many rows meet a query's criteria, display it's progress every 100, 200, 500 or 1,000 rows, give users the ability to cancel it at anytime and start displaying the qualifying rows as they are being put into the current list, while it continues searching?.. This is just one example, perhaps we could think other neat/useful features, the ingridients are more or less there.
Perhaps we could fine-tune each query with more granularity than currently available? OLTP queries tend to be mostly static and pre-defined. The "what-if's" are more OLAP, so let's try to add more control and intelligence to it? So, therefore, being able to more precisely control, not just "hint/influence" the optimizer is what's needed. We can then have more dynamic SELECT statements for specific situations! Maybe even tell IDS to read blocks of index nodes at-a-time instead of one-by-one, etc. etc.
I'm not really sure what your are after but here is some info on SQL Server query optimizer which I've recently read:
13 Things You Should Know About Statistics and the Query Optimizer
SQL Server Query Execution Plan Analysis
and one for Informix that I just found using google:
Part 1: Tuning Informix SQL
For Oracle, your best resource would be Cost Based oracle Fundamentals. It's about 500 pages (and billed as Volume 1 but there haven't been any followups yet).
For a (very) simple full-table scan, progress can sometimes be monitored through v$session_longops. Oracle knows how many blocks it has to scan, how many blocks it has scanned, how many it has to go, and reports on progress.
Indexes are a different matter. If I search for records for a client 'Frank', and use the index, the database will make a guess at how many 'Frank' entries are in the table, but that guess can be massively off. It may be that you have 1000 'Frankenstein' and just 1 'Frank' or vice versa.
It gets even more complicated as you add in other filter and access predicates (eg where multiple indexes can be chosen), and makes another leap as you include table joins. And thats without getting into the complex stuff about remote databases, domain indexes like Oracle Text and Locator.
In short, it is very complicated. It is stuff that can be useful to know if you are responsible for tuning a large application. Even for basic development you need to have some grounding in how the database can physically retrieve that data you are interested.
But I'd say you are going the wrong way here. The point of an RDBMS is to abstract the details so that, for the most part, they just happen. Oracle employs smart people to write query transformation stuff into the optimizer so us developers can move away from 'syntax fiddling' to get the best plans (not totally, but it is getting better).
Im trying to work out the best way scale my site, and i have a question on how mssql will scale.
The way the table currently is:
cache_id - int - identifier
cache_name - nvchar 256 - Used for lookup along with event_id
cache_event_id - int - Basicly a way of grouping
cache_creation_date - datetime
cache_data - varbinary(MAX) - Data size will be from 2k to 5k
The data stored is a byte array, thats basically a cached instance (compressed) of a page on my site.
The different ways i see storing i see are:
1) 1 large table, it would contain tens millions of records and easily become several gigabytes in size.
2) Multiple tables to contain the data above, meaning each table would 200k to a million records.
The data will be used from this table to show web pages, so anything over 200ms to get a record is bad in my eyes ( I know some ppl think 1-2 seconds page load is ok, but i think thats slow and want to do my best to keep it lower).
So it boils down to, what is it that slows down the SQL server?
Is it the size of the table ( disk space )
Is the the number of rows
At what point does it stop becoming cost effective to use multiple database servers?
If its close to impossible to predict these things, il accept that as a reply to. Im not a DBA, and im basically trying to design my DB so i dont have to redesign it later when its it contains huge amount of data.
So it boils down to, what is it that slows down the SQL server?
Is it the size of the table ( disk space )
Is the the number of rows
At what point does it stop becoming cost effective to use multiple
database servers?
This is all a 'rule of thumb' view;
Load (and therefore to a considerable extent performance) of a DB is largely a factor of 2 issues data volumes and transaction load, with IMHO the second generally being more relevant.
With regards the data volume one can hold many gigabytes of data and get acceptable access times by way of Normalising, Indexing, Partitioning, Fast IO systems, appropriate buffer cache sizes, etc. Many of these, e.g. Normalisation are the issues that one considers at DB design time, others during system tuning, e.g. additional/less indexes, buffer cache size.
The transactional load is largely a factor of code design and total number of users. Code design includes factors like getting transaction size right (small and fast is the general goal, but like most things it is possible to take it to far and have transactions that are too small to retain integrity or so small as to in itself add load).
When scaling I advise first scale up (bigger, faster server) then out (multiple servers). The admin issues of a multiple server instance are significant and I suggest only worth considering for a site with OS, Network and DBA skills and processes to match.
Normalize and index.
How, we can't tell you, because you haven't told use what your table is trying to model or how you're trying to use it.
1 million rows is not at all uncommon. Again, we can't tell you much in the absence of context only you can, but don't, provide.
The only possible answer is to set it up, and be prepared for a long iterative process of learning things only you will know because only you will live in your domain. Any technical advice you see here will be naive and insufficiently informed until you have some practical experience to share.
Test every single one of your guesses, compare the results, and see what works. And keep looking for more testable ideas. (And don't be afraid to back out changes that end up not helping. It's a basic requirement to have any hope of sustained simplicity.)
And embrace the fact that your database design will evolve. It's not as fearsome as your comment suggests you think it is. It's much easier to change a database than the software that goes around it.
I've really been struggling to make SQL Server into something that, quite frankly, it will never be. I need a database engine for my analytical work. The DB needs to be fast and does NOT need all the logging and other overhead found in typical databases (SQL Server, Oracle, DB2, etc.)
Yesterday I listened to Michael Stonebraker speak at the Money:Tech conference and I kept thinking, "I'm not really crazy. There IS a better way!" He talks about using column stores instead of row oriented databases. I went to the Wikipedia page for column stores and I see a few open source projects (which I like) and a few commercial/open source projects (which I don't fully understand).
My question is this: In an applied analytical environment, how do the different column based DB's differ? How should I be thinking about them? Anyone have practical experience with multiple column based systems? Can I leverage my SQL experience with these DBs or am I going to have to learn a new language?
I am ultimately going to be pulling data into R for analysis.
EDIT: I was requested for some clarification in what exactly I am trying to do. So, here's an example of what I would like to do:
Create a table that has 4 million rows and 20 columns (5 dims, 15 facts). Create 5 aggregation tables that calculate max, min, and average for each of the facts. Join those 5 aggregations back to the starting table. Now calculate the percent deviation from mean, percent deviation of min, and percent deviation from max for each row and add it to the original table. This table data does not get new rows each day, it gets TOTALLY replaced and the process is repeated. Heaven forbid if the process must be stopped. And the logs... ohhhhh the logs! :)
The short answer is that for analytic data, a column store will tend to be faster, with less tuning required.
A row store, the traditional database architecture, is good at inserting small numbers of rows, updating rows in place, and querying small numbers of rows. In a row store, these operations can be done with one or two disk block I/Os.
Analytic databases typically load thousands of records at a time; sometimes, as in your case, they reload everything. They tend to be denormalized, so have a lot of columns. And at query time, they often read a high proportion of the rows in the table, but only a few of these columns. So, it makes sense from an I/O standpoint to store values of the same column together.
Turns out that this gives the database a huge opportunity to do value compression. For instance, if a string column has an average length of 20 bytes but has only 25 distinct values, the database can compress to about 5 bits per value. Column store databases can often operate without decompressing the data.
Often in computer science there is an I/O versus CPU time tradeoff, but in column stores the I/O improvements often improve locality of reference, reduce cache paging activity, and allow greater compression factors, so that CPU gains also.
Column store databases also tend to have other analytic-oriented features like bitmap indexes (yet another case where better organization allows better compression, reduces I/O, and allows algorithms that are more CPU-efficient), partitions, and materialized views.
The other factor is whether to use a massively parallel (MMP) database. There are MMP row-store and column-store databases. MMP databases can scale up to hundreds or thousands of nodes, and allow you to store humungous amounts of data, but sometimes have compromises like a weaker notion of transactions or a not-quite-SQL query language.
I'd recommend that you give LucidDB a try. (Disclaimer: I'm a committer to LucidDB.) It is open-source column store database, optimized for analytic applications, and also has other features such as bitmap indexes. It currently only runs on one node, but utilizes several cores effectively and can handle reasonable volumes of data with not much effort.
4 million rows times 20 columns times 8 bytes for a double is 640 mb. Following the rule of thumb that R creates three temporary copies for every object, we get to around 2 gb. That is not a lot by today's standard.
So this should be doable in memory on a suitable 64-bit machine with a 'decent' amount of ram (say 8 gb or more). Installing Ubuntu or Debian (possibly in the server version) can be done in a few minutes.
I have some experience with Infobright Community edition --- column-or. db, based on mysql.
Pro:
you can use mysql interfaces/odbc mysql drivers, from R too
fast enough queries on big chunks of data selection (because of KnowledgeGrid & data packs)
very fast native data loader and connectors for ETL (talend, kettle)
optimized exactly that operations what I (and I think most of us) use (selection by factor levels, joining etc)
special "lookup" option for optimized storing R factor variables ;) (ok, char/varchar variables with relatively small levels number/rows number)
FOSS
paid support option
?
Cons:
no insert/update operations in Community edition (yet?), data loading only via native data loader/ETL connectors
no utf-8 official support (collation/sort etc), planned for q3 2009
no functions in aggregate queries f.e. select month (date) from ...) yet, planned for July(?) 2009, but because of column storage, I prefer simply create date columns for every aggregation levels (week number, month, ...) I need
cannot installed on existing mysql server as storage engine (because of own optimizer, if I understood correctly), but you may install Infobright & mysql on different ports if you need
?
Resume:
Good FOSS solution for daily analytical tasks, and, I think, your tasks as well.
Here is my 2 cents: SQL server does not scale well. We attempted to use SQL server to store financial data in real time (i.e. prices ticks coming in for 100 symbols). It worked perfectly for the first 2 weeks - then it went slower and slower as the database size increased, and finally ground to a halt, too slow to insert each price as it was received. We tried to work around it by moving data from the active database to offline storage every night, but ultimately the project was abandoned as it just didn't work.
Bottom line: if you're planning on storing a lot of data ( >1GB) you need something that scales properly, and that probably means a column database.
It looks like an implementation change (2-D array in column-major order, instead of row-major order), rather than an interface change.
Think "strategy" pattern, rather than being an entire paradigm shift. Of course, I've never used these products, so they may in fact force a paradigm shift down your throat. I don't know why, though.
We might be better able to help you reach an informed decision if you described [1] your specific goal and [2] the issues you're running into with SQL Server.