COUNT_BIG in Indexed View

COUNT_BIG in Indexed View - sql-server

CREATE TABLE test2 (
id INTEGER,
name VARCHAR(10),
family VARCHAR(10),
amount INTEGER)
CREATE VIEW dbo.test2_v WITH SCHEMABINDING
AS
SELECT id, SUM(amount) as amount
-- , COUNT_BIG(*) as tmp
FROM dbo.test2
GROUP BY id
CREATE UNIQUE CLUSTERED INDEX vIdx ON test2_v(id)
I have error with this code:
Cannot create index on view
'test.dbo.test2_v' because its select
list does not include a proper use of
COUNT_BIG. Consider adding
COUNT_BIG(*) to select list.
I can create view like this:
CREATE VIEW dbo.test2_v WITH SCHEMABINDING
AS
SELECT id, SUM(amount) as amount, COUNT_BIG(*) as tmp
FROM dbo.test2
GROUP BY id
But I'm just wondering what is purpose of this column in this case?

You need COUNT_BIG in this case because of the fact you are using GROUP BY.
This is one of many limitations of Indexed Views and because of these restrictions, Indexed Views can't be used in many places or the usage of it is NOT as effective as it could have been. Unfortunately, it is how it works currently. Sucks, it narrows the scope of usage.
http://technet.microsoft.com/en-us/library/cc917715.aspx

Looks like it's simply a hardcoded performance-related restriction that the SQL Server team had to put in place when they first designed aggregate indexed views in SQL Server 2000.
Until relatively recently you could see this in the SQL 2000 technet documentation at http://msdn.microsoft.com/en-us/library/aa902643(SQL.80).aspx, but the SQL Server 2000 documentation has been definitely retired. You can still download a 92MB PDF file, and find the relevant notes on pages 1146 and 2190: https://www.microsoft.com/en-us/download/details.aspx?id=51958
An explanation for this restriction can be found on the SQLAuthority site - actually an excerpt from Itzik Ben-Gan's "Inside SQL" book: http://blog.sqlauthority.com/2010/09/21/sql-server-count-not-allowed-but-count_big-allowed-limitation-of-the-view-5/
It's worth noting that Oracle has the same restriction/requirement, for the same reasons (for an equivalent fast refreshable materialized view); see http://rwijk.blogspot.com.es/2009/06/fast-refreshable-materialized-view.html for a discussion on this topic.
Summary of the explanation:
Why does sql server logically need a materialized global count column in the indexed aggregate view?
So that it can quickly check / know whether a particular row in the aggregate view needs to change or go, when a given row of an underlying table is updated or deleted.
Why does this count column need to be COUNT_BIG(*)?
So that there is no possible risk of overflow; by forcing the use of the bigint datatype, there is no risk of an indexed view "breaking" when a particular row reaches an overly high count.
It's relatively easy to visualize why a count is critical to efficient aggregate view maintenance - imagine the following situation:
The table structures are as specified in the question
There are 4 rows in the underlying table:
ID | name | family | amount
--- | ---- | ------ | ------
1 | a | | 10
2 | b | | 11
2 | c | | 12
3 | d | | 13
The aggregate view is materialized to something like this:
ID | amount | tmp
--- | ------ | ---
1 | 10 | 1
2 | 23 | 2
3 | 13 | 1
Simple Case:
The SQL Engine detects a change in the underlying data - the third row in the source data (id 2, name c) is deleted.
The engine needs to:
find and update the relevant row of the aggregate materialized view
reduce the "amount" sum by the amount of the deleted underlying row
reduce the "count" by 1 (if this column exists)
Target/difficult case:
The SQL Engine detects another change in the underlying data - the second row in the source data (id 2, name b) is deleted.
The engine needs to:
find and delete the relevant row of the aggregate materialized view, as there are no more source rows with the same grouping key
Consider that the engine always has the "before" row of the underlying table(s) at view-update time - it knows exactly what changed in both cases.
The notable "step" in the materialized-view-maintenance algorithm is determining whether the target materialized aggregate row needs to be deleted or not
if you have a "count", you don't need to look anywhere beyond the target row - if you're dropping the count to 0, then delete the row. If you're updating to any other value, then leave the row.
if you don't have a count, then the only way for you to figure it out, would be to query the underlying table to check for any other rows with the same aggregation key; such a process would clearly introduce much more onerous restrictions:
it would be implicitly slower, and
in join-aggregation cases would be un-optimizable!
For these reasons, the existence of a count(*) column is a fundamental requirement of the aggregate materialized view implementation. Without a count(*) column, the real-time maintenance of an aggregate materialized view in the face of underlying data changes would carry an unacceptably high performance penalty!
You could still ask "why doesn't SQL Server create/maintain such a count column for me automatically when I create an aggregate materialized view?" - I don't have a particularly good answer for this. In the end, I imagine there would be more questions and confusion about "Why does my aggregate materialized view have a BIGCOUNT column if I didn't add it?" if they did that, so it's simpler to make it a basic requirement of creation of the object, but that's a purely subjective opinion.

I know this thread is a bit old but for those who still have this question, http://technet.microsoft.com/en-us/library/ms191432%28v=sql.105%29.aspx says this about indexed views
The SELECT statement in the view cannot contain the following Transact-SQL syntax elements:
The AVG, MAX, MIN, STDEV, STDEVP, VAR, or VARP aggregate functions. If AVG(expression) is specified in queries referencing the indexed view, the optimizer can frequently calculate the needed result if the view select list contains SUM(expression) and COUNT_BIG(expression). For example, an indexed view SELECT list cannot contain the expression AVG(column1). If the view SELECT list contains the expressions SUM(column1) and COUNT_BIG(column1), SQL Server can calculate the average for a query that references the view and specifies AVG(column1).

Related

SQL: Ignore irrelevant JOIN when aggregating over the core table

Suppose I have Tables:
dbo.Purchases
Id | Value | UserId
1 | 10.00 | 3
2 | 1.00 | 1
3 | 15.50 | 2
4 | 13.40 | 1
dbo.Users
Id (UQ) | Name
1 | Bob
2 | Sarah
3 | Alex
And a VIEW:
dbo.PurchasesWithUsers
SELECT *
FROM dbo.Purchases
LEFT JOIN dbo.Users ON Users.Id = UserId
And I'm going to run SELECT SUM(Value) FROM dbo.PurchasesWithUsers.
Now ... as a human, I can see that that JOIN doesn't affect that query:
It's obviously not directly used in the SUM.
It's a LEFT JOIN so it can't exclude Purchase rows.
It's joining to a column with a UQ constraint so it can't duplicate Purchase rows.
But when I run the query and look at the execution plan, the Engine (MS SQL Server) is still performing the JOIN, which degrades the performance :(.
Is there any way that I can give the engine additional clues that it can work out that it could completely skip the JOIN, whilst still using the VIEW as the thing I'm querying?
Context:
Obviously the tables are huge which is why the performance impact is material
The Tables and the View are obviously a little more complex than that, but not actually all that much - the logical simplification is still valid, and the UQ constraints are explicit (as either UQ CONSTRAINTs or UQ indexes).
The VIEW is being used so that the users can filter on a variety of different options. The Data API process those options and applies the relevant WHERE clauses to a single VIEW. Alas that means various of the JOINs aren't relevant to the VIEW depending on what filters have been chosen :(
I'm aware that I could materialise and directly index the VIEW, but I'd prefer to avoid that if possible, given that I can see that a simpler query plan could logically exist already.

Unless indexed, a view does not exist as a stored set of data values in a database. The rows and columns of data come from tables referenced in the query defining the view and are produced dynamically when the view is referenced. Because you don't want to "directly index the VIEW", there is no other data to work with and the server is getting the data by using the query behind the view so there is no work around to avoid that LEFT JOIN with this approch.

Database architecture - Have 2 seperate columns or 1

Okey, context:
I have a system that requires to do a monthly, weekly and dayly reports.
Architecture A:
3 tables:
1) Monthly reports
2) Weekly reports
3) Daily reports
Architecture B:
1 table:
1) Reports: With extra column report_type, with values: "monthly", "weekly", "daily".
Which one would be more performant and why?

The common method I use to do this is use two tables, following similarly to your B approach. One table would be as you describe with report data and an extra column, but instead of hard coding the values, this column would hold an id to a reference table. The reference table would then hold the names of these values. This set up allows you to easily reference the intervals with other tables should you need that later on, but also makes name updates much more efficient. Changing the name of say "Monthly" to "Month" would require one update here, vs n updates if you stored the string in your report table.
Sample structure:
report_data | interval_id
xxxx | 1
interval_id | name
1 | Monthly
As a side note, you would rarely want to take your first approach, approach A, due to how it limits changing the interval type of entered data. If all of a sudden you want to change half of your Daily entries to Weekly entries, you need to do n/2 deletes and n/2 inserts, which is fairly costly especially if you start introducing indexes. In general tables should describe types of data (ie Reports) and columns should describe that type (ie How often a report happens)

Complex Database Design: columns or data

For an existing database we are considering to improve part of the database design.
25 tables have very similar structures, about 90% identical columns & data types. And fairly frequent changes to the tables, for example we may need to add 2 new columns to 7 of these 25 tables. A few months later the 2 new columns may be required in 5 further tables, etc. We also get questions like how many rows in these tables have IsActive (see example below) = TRUE. This currently means creating 25 SQL statements and the statements are much more complex than this simple example. It just feels wrong to query 25 tables and then combine the results.
One option we discussed would be to store all data in a master table. However in total this would mean having quite a wide table and quite a lot of NULL values.
A further idea we discussed is to keep the 25 tables and create a master view, which combines these tables. The view would however need a lot of manuall maintenance and the update could get forgotten & the view would still work.
In database design one of the main concepts is: "For maximum flexibility, data is stored in columns, not in column names.", which leads us to the main question. Does anyone have experience in storing columns in a table? The columns actually contain filter criteria for business logic.
Here is an example:
Table 1: Business Rule 1
CustomerID (int) | IsPremiumCust (bool) | HasCreditCard (bool) | IsActive (bool) | OrderThreshold (int)
Table 2: Business Rule 2
CustomerID (int) | IsPremiumCust (bool) | HasCreditCard (bool) | IsActive (bool) | Discount (int)
further 23 tables like these. All with more columns than in this examples.
Suggestion: Criteria table
Criteria ID | Criteria | Data Type
1 | IsPremiumCust | bool
2 | HasCreditCard | bool
3 | IsActive | bool
4 | OrderThreshold | int
5 | Discount | int
Suggestion: Business Rule table
Business Rule ID | Name
1 | Business Rule 1
2 | Business Rule 2
Suggestion: Intersection table
CustomerID | Business Rule ID | Criteria ID | Criteria Value
------------------------------------------------------------
1 | 1 | 1 | TRUE
2 | 2 | 1 | FALSE
I know this doesn't really work, as the Criteria Value field could have different data types. However I hope someone might have had a similar situation and can think of a full solution for this question.
This would allow us to add criteria without having to keep changing many table structures.

It sounds like you should have one table that has the core fields from your 25 common tables, with an additional field for the type of record that corresponds to the current existing table names. Then, you want one or a few supplemental tables that use the primary key for you new core table and also store just the additional fields needed by each type of record. If you find yourself with a new set of columns that only apply to a handful of existing tables, that's fine. You only need records for those records from the core table in your new supplemental table. And when those columns expand to include more of your original tables, adding records to the supplemental table is easy. You can still build a master view from this, if you need it.

I think reply to you question is good property inheritance tree for end entity. If tree will be optimized for problem domain you will be have efficient database scheme without null values. Problem with quantity of sql statements you can close by suitable ORM.

How to optimize SQL Query when you have more than 100 where Clause

Currently I have a view called vStoreProduct which has following columns
DOEntry | StoreID | UpcCode | Value
Now my users filter StoreID and UpcCode usually I have around more than 500 Stores and more than 700 UpcCode.
In my frontend or user interface, user can select anything ie
All Store & All Product
Some Store or some product
Now the outcome SQL Query is something like this
select count(*) from vStoreProduct where StoreID in ( ..................) and
UpcCode in (.....................)
Currently even a count is taking more than 3 mins for a view of 500,000 records.
Is this the best approach or would you recommend something else.
Thanks

What is the code in vStoreProduct? Is it just a simple select for one (or more) base tables, or does it contain more logic that is the root cause for the slowness? If it's complex, can you use an indexed view instead?
If the view is simple, are the fields you use indexed? Have you looked at statistics io output or checked heavy operations from execution plan (table / index scans, sorts and key lookups for large number of rows, spools).
If the view contains 500 000 rows, how many of these are you fetching when it takes 3 minutes? How many rows are in the base tables?

DB design: domain object is a fixed column, fixed row table, how to design a DB table for it?

designing my current app I've run into a problem: the user provided the following requirement:
For each enterprise (let's say there's an enterprise table, with enterprise_id as key), there is a fixed dataset, structured more or less like this:
| Income | Declared | Expenses | Normalized | Whatever | etc
-------------+--------+----------+----------+------------+----------+-...
Short term |
Medium term |
Long term |
Unspecified |
Unknown |
The key point is that (conceptually) the columns are fixed and the rows are fixed. And by fixed, I mean it's a written law in my country! So it's not gonna change in the short term.
My doubt is: each instance of the "user" table (with its fixed M columns and N rows) looks like a DB row to me (there's a 1:1 correspondence with enterprise_id, all the data is going to get saved/retrieved as a single block each time, etc). On the other hand, this is a lot of columns (MxN may be a hundred in my real app), and, frankly, it's ugly to look at, so I'm uneasy about it.
So, should I create a table for this user data, with (MxN) columns (plus one for the enterprise_id foreign key), or should I go along the lines of creating two tables, one with the possible "user" rows ("short term", etc), and another with only the columns ("income", etc), being each row in this case (enterprise_id, possible_rows_id, income, declared_income, etc).
Thanks in advance!

Assuming the values are all some kind of currency, my first thought was that you probably needed a table like this:
term category term_category_value
-- -- --
Short term Income <some currency amount>
Short term Declared <some currency amount>
...
Short term Whatever <some currency amount>
Middle term Income <some currency amount>
Middle term Declared <some currency amount>
...
Depending on your dbms, making this kind of data look like a spreadsheet might be fairly easy or mildly hard. But you only have to build one view to do that. SQL Server has PIVOT, for example. Some other platforms call it CROSSTAB.

If the columns are 1::1 with Enterprise, they should be in Enterprise. But at somepoint you will run into row-length maximum.
Therefore use a second table with MxN columns plus EnterpriseId as both PK and FK.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

COUNT_BIG in Indexed View - sql-server

Related

SQL: Ignore irrelevant JOIN when aggregating over the core table

Database architecture - Have 2 seperate columns or 1

Complex Database Design: columns or data

How to optimize SQL Query when you have more than 100 where Clause

DB design: domain object is a fixed column, fixed row table, how to design a DB table for it?

Categories

Resources