The Question
Assuming I make an indexed view on a table containing a varbinary(max) column, will the binary content be physically copied into the indexed view's B-Tree, or will the original fields just be "referenced" somehow, without physically duplicating their content?
In other words, if I make an indexed view on a table containing BLOBs, will that duplicate the storage needed for BLOBs?
More Details
When using a full-text index on binary data, such as varbinary(max), we need an additional "filter type" column to specify how to extract text from that binary data so it can be indexed, something like this:
CREATE FULLTEXT INDEX ON <table or indexed view> (
<data column> TYPE COLUMN <type column>
)
...
In my particular case, these fields are in different tables, and I'm trying to use indexed view to join them together, so they can be used in a full-text index.
Sure, I could copy the type field into the BLOB table and maintain it manually (keeping it synchronized with the original), but I'm wondering if I can make the DBMS do it for me automatically, which would be preferable unless there is a steep price to pay in terms of storage.
Also, merging these two tables into one would have negative consequences of its own, not to go into too much details here...
will that duplicate the storage needed for BLOBs?
Yes. The indexed view will have its own copy.
You can see this from
CREATE TABLE dbo.T1
(
ID INT IDENTITY PRIMARY KEY,
Blob VARBINARY(MAX)
);
DECLARE #vb VARBINARY(MAX) = CAST(REPLICATE(CAST('ABC' AS VARCHAR(MAX)), 1000000) AS VARBINARY(MAX));
INSERT INTO dbo.T1
VALUES (#vb),
(#vb),
(#vb);
GO
CREATE VIEW dbo.V1
WITH SCHEMABINDING
AS
SELECT ID,
Blob
FROM dbo.T1
GO
CREATE UNIQUE CLUSTERED INDEX IX
ON dbo.V1(ID)
SELECT o.NAME AS object_name,
p.index_id,
au.type_desc AS allocation_type,
au.data_pages,
partition_number,
au.total_pages,
au.used_pages
FROM sys.allocation_units AS au
JOIN sys.partitions AS p
ON au.container_id = p.partition_id
JOIN sys.objects AS o
ON p.object_id = o.object_id
WHERE o.object_id IN ( OBJECT_ID('dbo.V1'), OBJECT_ID('dbo.T1') )
Which returns
+-------------+----------+-----------------+------------+------------------+-------------+------------+
| object_name | index_id | allocation_type | data_pages | partition_number | total_pages | used_pages |
+-------------+----------+-----------------+------------+------------------+-------------+------------+
| T1 | 1 | IN_ROW_DATA | 1 | 1 | 2 | 2 |
| T1 | 1 | LOB_DATA | 0 | 1 | 1129 | 1124 |
| V1 | 1 | IN_ROW_DATA | 1 | 1 | 2 | 2 |
| V1 | 1 | LOB_DATA | 0 | 1 | 1129 | 1124 |
+-------------+----------+-----------------+------------+------------------+-------------+------------+
Related
I thought it was a simple task but it's a couple of hours I'm still struggling :-(
I want to have the list of column names of a table, together with its datatype and the value contained in the columns, but have no idea how to bind the table itself to get the current value:
DECLARE #TTab TABLE
(
fieldName nvarchar(128),
dataType nvarchar(64),
currentValue nvarchar(128)
)
INSERT INTO #TTab (fieldName,dataType)
SELECT
i.COLUMN_NAME,
i.DATA_TYPE
FROM
INFORMATION_SCHEMA.COLUMNS i
WHERE
i.TABLE_NAME = 'Users'
Expected result:
+------------+----------+---------------+
| fieldName | dataType | currentValue |
+------------+----------+---------------+
| userName | nvarchar | John |
| active | bit | true |
| age | int | 43 |
| balance | money | 25.20 |
+------------+----------+---------------+
In general the answer is: No, this is impossible. But there is a hack using text-based containers like XML or JSON (v2016+):
--Let's create a test table with some rows
CREATE TABLE dbo.TestGetMetaData(ID INT IDENTITY,PreName VARCHAR(100),LastName NVARCHAR(MAX),DOB DATE);
INSERT INTO dbo.TestGetMetaData(PreName,LastName,DOB) VALUES
('Tim','Smith','20000101')
,('Tom','Blake','20000202')
,('Kim','Black','20000303')
GO
--Here's the query
SELECT C.colName
,C.colValue
,D.*
FROM
(
SELECT t.* FROM dbo.TestGetMetaData t
WHERE t.Id=2
FOR XML PATH(''),TYPE
) A(rowSet)
CROSS APPLY A.rowSet.nodes('*') B(col)
CROSS APPLY(VALUES(B.col.value('local-name(.)','nvarchar(500)')
,B.col.value('text()[1]', 'nvarchar(max)'))) C(colName,colValue)
LEFT JOIN INFORMATION_SCHEMA.COLUMNS D ON D.TABLE_SCHEMA='dbo'
AND D.TABLE_NAME='TestGetMetaData'
AND D.COLUMN_NAME=C.colName;
GO
--Clean-Up (carefull with real data)
DROP TABLE dbo.TestGetMetaData;
GO
Part of the result
+----------+------------+-----------+--------------------------+-------------+
| colName | colValue | DATA_TYPE | CHARACTER_MAXIMUM_LENGTH | IS_NULLABLE |
+----------+------------+-----------+--------------------------+-------------+
| ID | 2 | int | NULL | NO |
+----------+------------+-----------+--------------------------+-------------+
| PreName | Tom | varchar | 100 | YES |
+----------+------------+-----------+--------------------------+-------------+
| LastName | Blake | nvarchar | -1 | YES |
+----------+------------+-----------+--------------------------+-------------+
| DOB | 2000-02-02 | date | NULL | YES |
+----------+------------+-----------+--------------------------+-------------+
The idea in short:
Using FOR XML PATH(''),TYPE will create a XML representing your SELECT's result set.
The big advantage with this: The XML's element will carry the column's name.
We can use a CROSS APPLY to geht the column's name and value
Now we can JOIN the metadata from INFORMATION_SCHEMA.COLUMNS.
One hint: All values will be of type nvarchar(max) actually.
The value being a string type might lead to unexpected results due to implicit conversions or might lead into troubles with BLOBs.
UPDATE
The following query wouldn't even need to specify the table's name in the JOIN:
SELECT C.colName
,C.colValue
,D.DATA_TYPE,D.CHARACTER_MAXIMUM_LENGTH,IS_NULLABLE
FROM
(
SELECT * FROM dbo.TestGetMetaData
WHERE Id=2
FOR XML AUTO,TYPE
) A(rowSet)
CROSS APPLY A.rowSet.nodes('/*/#*') B(attr)
CROSS APPLY(VALUES(A.rowSet.value('local-name(/*[1])','nvarchar(500)')
,B.attr.value('local-name(.)','nvarchar(500)')
,B.attr.value('.', 'nvarchar(max)'))) C(tblName,colName,colValue)
LEFT JOIN INFORMATION_SCHEMA.COLUMNS D ON CONCAT(D.TABLE_SCHEMA,'.',D.TABLE_NAME)=C.tblName
AND D.COLUMN_NAME=C.colName;
Why?
Using FOR XML AUTO will use attribute centered XML. The elements name will be the tables name, while the values rest within attributes.
UPDATE 2
Fully generic function:
CREATE FUNCTION dbo.GetRowWithMetaData(#input XML)
RETURNS TABLE
AS
RETURN
SELECT C.colName
,C.colValue
,D.*
FROM #input.nodes('/*/#*') B(attr)
CROSS APPLY(VALUES(#input.value('local-name(/*[1])','nvarchar(500)')
,B.attr.value('local-name(.)','nvarchar(500)')
,B.attr.value('.', 'nvarchar(max)'))) C(tblName,colName,colValue)
LEFT JOIN INFORMATION_SCHEMA.COLUMNS D ON CONCAT(D.TABLE_SCHEMA,'.',D.TABLE_NAME)=C.tblName
AND D.COLUMN_NAME=C.colName;
--You call it like this (see the extra paranthesis!)
SELECT * FROM dbo.GetRowWithMetaData((SELECT * FROM dbo.TestGetMetaData WHERE ID=2 FOR XML AUTO));
As you see, the function does not even has to know anything in advance...
In SQL Server the max length of a row is 8060 bytes. In my case the dba.CostCenter rows can be max 1383 bytes. So far so good. When updating our Cloud software on a daily basis the table definition gets messed up. After 37 days with 8 similar computed columns as the one below.
ALTER TABLE dba.CostCenter DROP COLUMN CcrTotBudgetAY;
ALTER TABLE dba.CostCenter ADD CcrTotBudgetAY AS (CONVERT(DECIMAL(21,5), CcrChildBudgetAY + CcrBudgetAY)) PERSISTED NOT NULL;
SQL Server raises the well known 8060 bytes error.
When I add
ALTER TABLE dba.CostCenter REBUILD PARTITION = ALL;
between the DROP and ADD COLUMN the error message doesn't popup and I can easily update the computed columns more than 100 times.
Is there an alternative for the REBUILD option because it is a heavy operation or is this a bug in SQL Server?
causes SQL Server to think the 8060 is reached while it is not
If you are seeing this error message then you are wrong about that. Every time you drop and re-create the computed column the old column is marked as dropped in the metadata but the previous column values still exist in the data page.
When you recreate the column it is treated as a new column at the end of all existing columns. It can reclaim space from dropped columns at the end of the section but not if they are followed by another undropped column. So depending on the approach you take by the time you have done a few iterations of this the data page will be full of rubbish.
Approach 1 (offsets increase)
DROP TABLE IF EXISTS dbo.CostCenter
go
CREATE TABLE dbo.CostCenter
(
CcrChildBudgetAY DECIMAL(10,2),
CcrBudgetAY DECIMAL(10,2),
CcrTotBudgetAY AS (CONVERT(DECIMAL(21,5), CcrChildBudgetAY + CcrBudgetAY)) PERSISTED NOT NULL,
CcrTotBudgetAY2 AS (CONVERT(DECIMAL(21,5), CcrChildBudgetAY + CcrBudgetAY)) PERSISTED NOT NULL
);
INSERT INTO dbo.CostCenter VALUES (1,1);
ALTER TABLE dbo.CostCenter DROP COLUMN CcrTotBudgetAY;
ALTER TABLE dbo.CostCenter ADD CcrTotBudgetAY AS (CONVERT(DECIMAL(21,5), CcrChildBudgetAY + CcrBudgetAY)) PERSISTED NOT NULL;
ALTER TABLE dbo.CostCenter DROP COLUMN CcrTotBudgetAY2;
ALTER TABLE dbo.CostCenter ADD CcrTotBudgetAY2 AS (CONVERT(DECIMAL(21,5), CcrChildBudgetAY + CcrBudgetAY)) PERSISTED NOT NULL;
SELECT c.name, is_dropped, leaf_offset
FROM sys.system_internals_partition_columns pc
JOIN sys.partitions p on p.partition_id = pc.partition_id
LEFT JOIN sys.columns c on c.object_id = p.object_id and c.column_id = pc.partition_column_id
WHERE p.object_id = OBJECT_ID('dbo.CostCenter')
+------------------+------------+-------------+
| name | is_dropped | leaf_offset |
+------------------+------------+-------------+
| CcrChildBudgetAY | 0 | 4 |
| CcrBudgetAY | 0 | 13 |
| NULL | 1 | 22 |
| NULL | 1 | 35 |
| CcrTotBudgetAY | 0 | 48 |
| CcrTotBudgetAY2 | 0 | 61 |
+------------------+------------+-------------+
Approach 2 (offsets reused)
DROP TABLE IF EXISTS dbo.CostCenter
go
CREATE TABLE dbo.CostCenter
(
CcrChildBudgetAY DECIMAL(10,2),
CcrBudgetAY DECIMAL(10,2),
CcrTotBudgetAY AS (CONVERT(DECIMAL(21,5), CcrChildBudgetAY + CcrBudgetAY)) PERSISTED NOT NULL,
CcrTotBudgetAY2 AS (CONVERT(DECIMAL(21,5), CcrChildBudgetAY + CcrBudgetAY)) PERSISTED NOT NULL
);
INSERT INTO dbo.CostCenter VALUES (1,1);
ALTER TABLE dbo.CostCenter DROP COLUMN CcrTotBudgetAY, CcrTotBudgetAY2;
ALTER TABLE dbo.CostCenter ADD CcrTotBudgetAY AS (CONVERT(DECIMAL(21,5), CcrChildBudgetAY + CcrBudgetAY)) PERSISTED NOT NULL,
CcrTotBudgetAY2 AS (CONVERT(DECIMAL(21,5), CcrChildBudgetAY + CcrBudgetAY)) PERSISTED NOT NULL
SELECT c.name, is_dropped, leaf_offset
FROM sys.system_internals_partition_columns pc
JOIN sys.partitions p on p.partition_id = pc.partition_id
LEFT JOIN sys.columns c on c.object_id = p.object_id and c.column_id = pc.partition_column_id
WHERE p.object_id = OBJECT_ID('dbo.CostCenter')
+------------------+------------+-------------+
| name | is_dropped | leaf_offset |
+------------------+------------+-------------+
| CcrChildBudgetAY | 0 | 4 |
| CcrBudgetAY | 0 | 13 |
| NULL | 1 | 22 |
| NULL | 1 | 35 |
| CcrTotBudgetAY | 0 | 22 |
| CcrTotBudgetAY2 | 0 | 35 |
+------------------+------------+-------------+
The best alternative will be to come up with a solution that doesn't involve dropping and re-creating these columns daily as even in the best case this causes logged activity for all rows in the table that need to be rewritten as you add back the persisted computed columns.
Because it is too complicated to solve this problem without real data, I will try to add some:
| tables 1 | table 2 | ... | table n
---------------------------------------------------------------------------------------
columns_name: | name | B | C | D | name | B | C | D | ... | name | B | C | D
---------------------------------------------------------------------------------------
column_content:| John | ... | Ben | ... | ... | John| ...
The objective is to extract the rows in the N tables where name = 'John'.
Where we already have a table called [table_names] with the n tables names stored in the column [column_table_name].
Now we want to do something like that:
SELECT [name]
FROM (SELECT [table_name]
FROM INFORMATION_SCHEMA.TABLES)
WHERE [name] = 'Jonh'
Tables names are dynamic and thus unknown until we run the information_schema.tables query.
This final query is giving me an error. Any clue about how to use multiple stored tables names in a subquery?
You need to alias your subquery in order to reference it. Plus name should be table_name
SELECT [table_name]
FROM (SELECT [table_name]
FROM INFORMATION_SCHEMA.TABLES) AS X
WHERE [table_name] = 'Jonh'
I have two tables. One is the parent data table, the other is a mapping table for fulfilling a many-to-many relationship between this parent data table and the main table. My problem is that the parent and mapping tables have duplicate values that need to be merged. I can seemingly remove the duplicates from the parent table, but the mapping table needs to have the duplicate data merged in the same fashion. There is a FK and related cascading delete/update on the Mapping Table. How do I ensure the merges from the following statement also get reflected in the Mapping Table?
Before
Parent Table_A:
| ID | ProductName | MFG_ID |
|------+-------------+------------+
| 1 | ACME_123 | 123 |
| 2 | ACME_123 | 456 |
Mapping Table
| ID | MainRecordID | ParentTable.MFG_ID|
|------+--------------+-----------------------+
| 1 | 1 | 123 |
| 2 | 2 | 456 |
Desired After
Parent Table_A:
| ID | ProductName | MFG_ID|
|------+-------------+------------+
| 1 | ACME_123 | 123 |
Mapping Table
| ID | MainRecordID | ParentTable.MFG_ID|
|------+--------------+-----------------------+
| 1 | 1 | 123 |
| 2 | 2 | 123 |
Proposed Code to Merge Table_A Duplicates
MERGE Table_A
USING
(
SELECT
MIN(ID) ID,
ProductName,
MIN(MFG_ID) MFG_ID,
FROM Table_A
GROUP BY ProductName
) NewData ON Table_A.ID = NewData.ID
WHEN MATCHED THEN
UPDATE SET
Table_A.ProductName = NewData.ProductName
WHEN NOT MATCHED BY SOURCE THEN DELETE;
Split it into two separate statements wrapped in an explicit transaction instead of a merge. Something like this:
declare #src table
(
Id int,
ProductName varchar(128),
MFG_ID int
)
set xact_abort on
insert into #src
select
Id = min(ID),
ProductName = ProductName,
MFG_ID = MIN(MFG_ID) ,
from Table_A
group by ProductName
begin tran
delete o
from Table_A o
where not exists
(
select 1
from #src i
where o.id = i.id
)
update t
set ProductName = s.ProductName
from Table_A t
inner join #Src s
on t.Id = s.Id
commit tran
Going of the diagram here: I'm confused on column 1 and 3.
I am working on an datawarehouse table and there are two columns that are used as a key that gets you the primary key.
The first column is the source system. there are three possible values Lets say IBM, SQL, ORACLE. Then the second part of the composite key is the transaction ID it could ne numerical or varchar. There is no 3rd column. Other than the secret key which would be a key generated by Identity(1,1) as the record gets loaded. So in the graph below I imagine if I pass in a query
Select a.Patient,
b.Source System,
b.TransactionID
from Patient A
right join Transactions B
on A.sourceSystem = B.sourceSystem and
a.transactionID = B.transactionID
where SourceSystem = "SQL"
The graph leads me to think that column 1 in the index should be set to the SourceSystem. Since it would immediately split the drill down into the next level of index by a 3rd. But when showing this graph to a coworker, they interpreted it as column 1 would be the transactionID, and column 2 as the source system.
Cols
1 2 3
-------------
| | 1 | |
| A |---| |
| | 2 | |
|---|---| |
| | | |
| | 1 | 9 |
| B | | |
| |---| |
| | 2 | |
| |---| |
| | 3 | |
|---|---| |
First, you should qualify all column names in a query. Second, left join usually makes more sense than a right join (the semantics are keep all columns in the first table). Finally, if you have proper foreign key relationships, then you probably don't need an outer join at all.
Let's consider this query:
Select p.Patient, t.Source System, t.TransactionID
from Patient p join
Transactions t
on t.sourceSystem = p.sourceSystem and
t.transactionID = p.transactionID
where t.SourceSystem = 'SQL';
The correct index for this query is Transactions(SourceSystem, TransactionId).
Notes:
Outer joins affect the choice of indexes. Basically if one of the tables has to be scanned anyway, then an index might be less useful.
t.SourceSystem = 'SQL' and p.SourceSystem = 'SQL' would probably optimize differently.
Does the patient really have a transaction id? That seems strange.