How would you optimize this SQL Server 2008 R2 Table - sql-server

I have a private messaging system for my browser game. When I check most CPU time using queries I see that this table is the most CPU using one. I am not good with indexes, query time optimization. So I would like to get your optimization tips about this table.
Alright now I am going to show you table structure first:
structure image
Alright this following query reads how many unread messages does user have and this query is the most CPU using one since it reads at every page load:
SELECT COUNT([Id]) [Number]
FROM [MyTable]
WHERE [ReceiverUserId] = #1
AND [ReceiverReaded] = #2
AND [ReceiverDeleted] = #3
So what kind of indexes etc might improve my performance?

Why allow NULLs on those columns at all - either it's read or not. Just default to 0. Then index on Read/Deleted/ReceivedUser (in that order they will be "partitioned" if you need a lot of ALL READ access, alternatively, if most reads are just for a single user, put an index on ReceivedUser)
What you want to do is see your index be covering. In your case, you could put an index on ReceiverUserId and INCLUDE columns ReceiverReaded and ReceiverDeleted and it would be covering (for that query). In the execution plan, you should just see an index seek, since you have a single user.
You could capture the workload and then run it through the index tuning wizard in SQL Server and it would probably make pretty good suggestions. You need to interpret what it's telling you, of course.

You always want indexes on the fields you are searching for, so you would probably improve the query performance by adding indexes on [ReceiverUserId], [ReceiverReaded] and [ReceiverDeleted].
Of course the more columns you index, the slower your UPDATES and INSERTS will be.

A fairly simple rule of thumb in db optimization is to index any column that appears as part of a predicate in a WHERE clause or a JOIN. From your example these would include:
ReceiverUserId
ReceiverReaded
ReceiverDeleted
There are also a number of optimizer tools available that will "observe" your db and tell you what columns to index for best performance.

Different approach that may be viable for your application: don't query the messages table at all when the user isn't explicitly requesting any content, e.g. when he's not in the "messages" section of your game.
try extending your user table with integer valued columns indicating how many messages are there and how many are read already. Every time you modify the message table, you also modify the corresponding value in the user table.
This way you won't need to look through the whole table on every refresh. Note though, that the downturn of this method is some extra synchronization work on the programmer's part. If you've encapsulated the modification of the messages table (add message, read message, delete message) properly, this shouldn't be a problem.

Index your search fields (per #PaulStock's answer).
Change your tinyint fields to bit fields (default value = 0)
Does your body really need to be nvarchar(4000)? That's HUGE! Consider much shorter messages (such as nvarchar(300) or smaller -- for reference, Twitter is just 140.)

Related

SQL Server - what kind of index should I create?

I need to make queries such as
SELECT
Url, COUNT(*) AS requests, AVG(TS) AS avg_timeSpent
FROM
myTable
WHERE
Url LIKE '%/myController/%'
GROUP BY
Url
run as fast as possible.
The columns selected and grouped are almost always the same, being the difference, an extra column on the select and group by (the column tenantId)
What kind of index should I create to help me run this scenario?
Edit 1:
If I change my base query to '/myController/%' (note there's no % at the begging) would it be better?
This is a query that cannot be sped up with an index. The DBMS cannot know beforehand how many records will match the condition. It may be 100% or 0.001%. There is no clue for the DBMS to guess this. And access via an index only makes sense when a small percentage of rows gets selected.
Moreover, how can such an index be structured and useful? Think of a telephone book and you want to find all names that contain 'a' or 'rs' or 'ems' or whatever. How would you order the names in the book to find all these and all other thinkable letter combinations quickly? It simply cannot be done.
So the DBMS will read the whole table record for record, no matter whether you provide an index or not.
There may be one exception: With an index on URL and TS, you'd have both columns in the index. So the DBMS might decide to read the whole index rather than the whole table then. This may make sense for instance when the table has hundreds of columns or when the table is very fragmented or whatever. I don't know. A table is usually much easier to read sequentially than an index. You can still just try, of course. It doesn't really hurt to create an index. Either the DBMS uses it or not for a query.
Columnstore indexes can be quite fast at such tasks (aggregates on globals scans). But even they will have trouble handling a LIKE '%/mycontroler/%' predicate. I recommend you parse the URL once into an additional computed field that projects the extracted controller of your URL. But the truth is that looking at global time spent on a response URL reveals very little information. It will contain data since the beginning of time, long since obsolete by newer deployments, and not be able to capture recent trends. A filter based on time, say per hour or per day, now that is a very useful analysis. And such a filter can be excellently served by a columnstore, because of natural time order and segment elimination.
Based on your posted query you should have a index on Url column. In general columns which are involved in WHERE , HAVING, ORDER BY and JOIN ON condition should be indexed.
You should get the generated query plan for the said query and see where it's taking more time. Again based n the datatype of the Url column you may consider having a FULLTEXT index on that column

The order of a SQL Select statement without Order By clause

As I know, from the relational database theory, a select statement without an order by clause should be considered to have no particular order. But actually in SQL Server and Oracle (I've tested on those 2 platforms), if I query from a table without an order by clause multiple times, I always get the results in the same order. Does this behavior can be relied on? Anyone can help to explain a little?
No, that behavior cannot be relied on. The order is determined by the way the query planner has decided to build up the result set. simple queries like select * from foo_table are likely to be returned in the order they are stored on disk, which may be in primary key order or the order they were created, or some other random order. more complex queries, such as select * from foo where bar < 10 may instead be returned in order of a different column, based on an index read, or by the table order, for a table scan. even more elaborate queries, with multipe where conditions, group by clauses, unions, will be in whatever order the planner decides is most efficient to generate.
The order could even change between two identical queries just because of data that has changed between those queries. a "where" clause may be satisfied with an index scan in one query, but later inserts could make that condition less selective, and the planner could decide to perform a subsequent query using a table scan.
To put a finer point on it. RDBMS systems have the mandate to give you exactly what you asked for, as efficiently as possible. That efficiency can take many forms, including minimizing IO (both to disk as well as over the network to send data to you), minimizing CPU and keeping the size of its working set small (using methods that require minimal temporary storage).
without an ORDER BY clause, you will have not asked exactly for a particular order, and so the RDBMS will give you those rows in some order that (maybe) corresponds with some coincidental aspect of the query, based on whichever algorithm the RDBMS expects to produce the data the fastest.
If you care about efficiency, but not order, skip the ORDER BY clause. If you care about the order but not efficiency, use the ORDER BY clause.
Since you actually care about BOTH use ORDER BY and then carefully tune your query and database so that it is efficient.
No, you can't rely on getting the results back in the same order every time. I discovered that when working on a web page with a paged grid. When I went to the next page, and then back to the previous page, the previous page contained different records! I was totally mystified.
For predictable results, then, you should include an ORDER BY. Even then, if there are identical values in the specified columns there, you can get different results. You may have to ORDER BY fields that you didn't really think you needed, just to get a predictable result.
Tom Kyte has a pet peeve about this topic. For whatever reason, people are fascinated by this, and keep trying to come up with cases where you can rely upon a specific order without specifying ORDER BY. As others have stated, you can't. Here's another amusing thread on the topic on the AskTom website.
The Right Answer
This is a new answer added to correct the old one. I've got answer from Tom Kyte and I post it here:
If you want rows sorted YOU HAVE TO USE AN ORDER. No if, and, or buts about it. period. http://tkyte.blogspot.ru/2005/08/order-in-court.html You need order by on that IOT. Rows are sorted in leaf blocks, but leaf blocks are not stored sorted. fast full scan=unsorted rows.
https://twitter.com/oracleasktom/status/625318150590980097
https://twitter.com/oracleasktom/status/625316875338149888
The Wrong Answer
(Attention! The original answer on the question was placed below here only for the sake of the history. It's wrong answer. The right answer is placed above)
As Tom Kyte wrote in the article mentioned before:
You should think of a heap organized table as a big unordered
collection of rows. These rows will come out in a seemingly random
order, and depending on other options being used (parallel query,
different optimizer modes and so on), they may come out in a different
order with the same query. Do not ever count on the order of rows from
a query unless you have an ORDER BY statement on your query!
But note he only talks about heap-organized tables. But there is also index-orgainzed tables. In that case you can rely on order of the select without ORDER BY because order implicitly defined by primary key. It is true for Oracle.
For SQL Server clustered indexes (index-organized tables) created by default. There is also possibility for PostgreSQL store information aligning by index. More information can be found here
UPDATE:
I see, that there is voting down on my answer. So I would try to explain my point a little bit.
In the section Overview of Index-Organized Tables there is a phrase:
In an index-organized table, rows are stored in an index defined on the primary key for the table... Index-organized tables are useful when related pieces of data must be stored together or data must be physically stored in a specific order.
http://docs.oracle.com/cd/E25054_01/server.1111/e25789/indexiot.htm#CBBJEBIH
Because of index, all data is stored in specific order, I believe same is true for Pg.
http://www.postgresql.org/docs/9.2/static/sql-cluster.html
If you don't agree with me please give me a link on the documenation. I'll be happy to know that there is something to learn for me.

How to deal with billions of records in an sql server?

I have an sql server 2008 database along with 30000000000 records in one of its major tables. Now we are looking for the performance for our queries. We have done with all indexes. I found that we can split our database tables into multiple partitions, so that the data will be spread over multiple files, and it will increase the performance of the queries.
But unfortunatly this functionality is only available in the sql server enterprise edition, which is unaffordable for us.
Is there any way to opimize for the query performance? For example, the query
select * from mymajortable where date between '2000/10/10' and '2010/10/10'
takes around 15 min to retrieve around 10000 records.
A SELECT * will obviously be less efficiently served than a query that uses a covering index.
First step: examine the query plan and look for and table scans and the steps taking the most effort(%)
If you don’t already have an index on your ‘date’ column, you certainly need one (assuming sufficient selectivity). Try to reduce the columns in the select list, and if ‘sufficiently’ few, add these to the index as included columns (this can eliminate bookmark lookups into the clustered index and boost performance).
You could break your data up into separate tables (say by a date range) and combine via a view.
It is also very dependent on your hardware (# cores, RAM, I/O subsystem speed, network bandwidth)
Suggest you post your table and index definitions.
First always avoid Select * as that will cause the select to fetch all columns and if there is an index with just the columns you need you are fetching a lot of unnecessary data. Using only the exact columns you need to retrieve lets the server make better use of indexes.
Secondly, have a look on included columns for your indexes, that way often requested data can be included in the index to avoid having to fetch rows.
Third, you might try to use an int column for the date and convert the date into an int. Ints are usually more effective in range searches than dates, especially if you have time information to and if you can skip the time information the index will be smaller.
One more thing to check for is the Execution plan the server uses, you can see this in management studio if you enable show execution plan in the menu. It can indicate where the problem lies, you can see which indexes it tries to use and sometimes it will suggest new indexes to add.
It can also indicate other problems, Table Scan or Index Scan is bad as it indicates that it has to scan through the whole table or index while index seek is good.
It is a good source to understand how the server works.
If you add an index on date, you will probably speed up your query due to an index seek + key lookup instead of a clustered index scan, but if your filter on date will return too many records the index will not help you at all because the key lookup is executed for each result of the index seek. SQL server will then switch to a clustered index scan.
To get the best performance you need to create a covering index, that is, include all you columns you need in the "included columns" part of your index, but that will not help you if you use the select *
another issue with the select * approach is that you can't use the cache or the execution plans in an efficient way. If you really need all columns, make sure you specify all the columns instead of the *.
You should also fully quallify the object name to make sure your plan is reusable
you might consider creating an archive database, and move anything after, say, 10-20 years into the archive database. this should drastically speed up your primary production database but retains all of your historical data for reporting needs.
What type of queries are we talking about?
Is this a production table? If yes, look into normalizing a bit more and see if you cannot go a bit further as far as normalizing the DB.
If this is for reports, including a lot of Ad Hoc report queries, this screams data warehouse.
I would create a DW with seperate pre-processed reports which include all the calculation and aggregation you could expect.
I am a bit worried about a business model which involves dealing with BIG data but does not generate enough revenue or even attract enough venture investment to upgrade to enterprise.

What is SQL Server doing between the time my first record is returned and when my last record is returned?

Say I have a query that returns 10,000 records. When the first record has returned what can I assume about the state of my query?
Has it finished and is just returning records from the server to my instance of SSMS?
Is the query itself still being executed on the server?
What is it that causes the 10,000 records to be slowly returned for one query and nearly instantly for another?
There is potentially some mix of progressive processing on the server side, network transfer of the data, and rendering by the client.
If one query returns 10,000 rows quickly, and another one slowly -- and they are of similar row size, data types, etc., and are both destined for results to grid or results to text -- there is little we can do to analyze the differences unless you show us execution plans and/or client statistics for each one. These are options you can set in SSMS when running a query.
As an aside, switching between results to grid and results to text you might notice slightly different runtimes. This is because in one case Management Studio has to work harder to align the columns etc.
You can not make a generic assumption, a query's plan is composed of a number of different types of operations, or iterators. Some of these are Navigational based, and work like a pipeline, whilst others are set based operations, such as a sort.
If any query contains a set based operation, it requires all the records before it could output the results (i.e an order by clause within your statement.) But if you have no set based iterators you could expect the rows to be streamed to you as they become available.
The answer to each of your individual questions is "it depends."
For example, consider if you include an order by clause, and there isn't an index for the column(s) you're ordering by. In this case, the server has to find all the records that satisfy your query, then sort them, before it can return the first record. This causes a long pause before you get your first record, but you (should normally) get them quite quickly once you start getting any.
Without the order by clause, the server will normally send each record as its found, so the first record will often show up sooner, but you may see a long pause between one record and the next.
As as far simply "why is one query faster than another", a lot depends on what indexes are available, and whether they can be used for a particular query. For example, something like some_column like '%something' will almost always be quite slow. The leading '%' means this won't be able to use an index, even if some_column has one. A search for something% instead of %something% might easily be 100 or 1000 times faster. If you really need the former, you really want to use full-text searching instead (create a full-text index, and use contains() instead of like.
Of course, a lot can also depend simply on whether the database has an index for a particular column (or group of columns). With a suitable index, the query will usually be quite a lot faster.

How to design this database?

I have to design a database to store log data but I don't have experience before. My table contains about 19 columns (about 500 bytes each row) and daily grows up to 30.000 new rows. My app must be able to query effectively again this table.
I'm using SQL Server 2005.
How can I design this database?
EDIT: data I want to store contains a lot of type: datetime, string, short and int. NULL cells are about 25% in total :)
However else you'll do lookups, a logging table will almost certainly have a timestamp column. You'll want to cluster on that timestamp first to keep inserts efficient. That may mean also always constraining your queries to specific date ranges, so that the selectivity on your clustered index is good.
You'll also want indexes for the fields you'll query on most often, but don't jump the gun here. You can add the indexes later. Profile first so you know which indexes you'll really need. On a table with a lot of inserts, unwanted indexes can hurt your performance.
Well, given the description you've provided all you can really do is ensure that your data is normalized and that your 19 columns don't lead you to a "sparse" table (meaning that a great number of those columns are null).
If you'd like to add some more data (your existing schema and some sample data, perhaps) then I can offer more specific advice.
Throw an index on every column you'll be querying against.
Huge amounts of test data, and execution plans (with query analyzer) are your friend here.
In addition to the comment on sparse tables, you should index the table on the columns you wish to query.
Alternatively, you could test it using the profiler and see what the profiler suggests in terms of indexing based on actual usage.
Some optimisations you could make:
Cluster your data based on the most likely look-up criteria (e.g. clustered primary key on each row's creation date-time will make look-ups of this nature very fast).
Assuming that rows are written one at a time (not in batch) and that each row is inserted but never updated, you could code all select statements to use the "with (NOLOCK)" option. This will offer a massive performance improvement if you have many readers as you're completely bypassing the lock system. The risk of reading invalid data is greatly reduced given the structure of the table.
If you're able to post your table definition I may be able to offer more advice.

Resources