How can I include NULLs in a PivotTable Count over SSAS? - sql-server

I have a view that joins orders to web tracking data which is being used as a fact table. I have lots of nulls because it takes a while for orders to obtain web tracking information. As you can see i have a total row count of 86432. However my measure count is showing 52, 753 (simple row count when you build a measure group). (Is using exactly the same view).
I believe my counts are going to be wrong due to the nulls in my data. How can I get SSAS to correctly count my null values? (I am limited to what I can do to the source database as I don't have access to change the core structure of the source system).
I understand what you are saying about counting a field vs all fields however as you can see by creating a new measure in SSAS you have the option of count of rows of a source table. This is the behaviour I would expect and I would expect the same count as SELECT * on the table as shown in my images...

I believe DimAd does not have a null or zero AdKey row. And I believe during processing you have to change the error configuration to have it discard or ignore any fact table rows where the foreign key is null.
My top recommendation is to change your fact table foreign keys to be not null. You will need to create a -1 key in each dimension and then use it in the fact table instead of null as described here.
If that's not feasible then add null or zero AdKey rows to any dimension where the fact table foreign key can be null. SSAS should convert the Bulls to zero so either should work. Then during processing those rows won't be dropped because they join fine. And you won't have to change the error configuration during processing.
If that's not feasible or acceptable then you can turn on the Unknown member on all dimensions which could be nullable. Then in the Dimension Usage tab set each relationship to fallback to the Unknown member. This process is described here.

In order to get a true row count you need not to count the column, but instead use *.
COUNT(*) will count all rows, regardless of NULL
COUNT(Column) counts only NON-NULL values
Test Example
declare #table table (i int)
insert into #table (i) values
(1),(NULL),(NULL),(NULL)
select count(*) from #table --returns 4
select count(i) from #table --returns 1

Related

How to apply ROW ID in snowflake? Oracle code conversion to Snowflake

Is there any way of converting the last a.ROWID > b.ROWID values in below code in to snowflake? the below is the oracle code. Need to take the ROW ID to snowflake. But snowflake does not maintain ROW ID. Is there any way to achieve the below and convert the row id issue?
DELETE FROM user_tag.user_dim_default a
WHERE EXISTS (SELECT 1
FROM rev_tag.emp_site_weekly b
WHERE a.number = b.ID
AND a.accountno = b.account_no
AND a.ROWID > b.ROWID)
So this Oracle code seem very broken, because ROWID is a table specific pseudo column, thus comparing value between table seem very broken. Unless the is some aligned magic happening, like when user_tag.user_dim_default is inserted into rev_tag.emp_site_weekly is also written. But even then I can imagine data flows where this will not get what you want.
So as with most things Snowflake, "there is no free lunch", so the data life cycle that is relying on ROW_ID needs to be implemented.
Which implies if you are wanting to use two sequences, then you should do explicitly on each table. And if you are wanting them to be related to each other, it sounds like a multi table insert or Merge should be used so you can access the first tables SEQ and relate it in the second.
ROWID is an internal hidden column used by the database for specific DB operations. Depending on the vendor, you may have additional columns such as transaction ID or a logical delete flag. Be very carful to understand the behavior of these columns and how they work. They may not be in order, they may not be sequential, they may change in value as a DB Maint job runs while your code is running, or someone else runs an update on a table. Some of these internal columns may have the same value for more than one row for example.
When joining tables, the RowID on one table has no relation to the RowID on another table. When writing Dedup logic or delete before insert type logic, you should use the primary key, and then additionally an audit column that has the date of insert or date of last update in combo with that. Check the data model or ERD digram for the PK/FK relationships between the tables and what audit columns are available.

Using LIKE in WHERE clause for GUIDs results in full table scan

I have a table that looks something like this:
CREATE TABLE Records
(
ID UNIQUEIDENTIFIER PRIMARY KEY NONCLUSTERED,
owner UNIQUEIDENTIFIER,
value FLOAT,
timestamp DATETIME
)
There is a multi-column clustered index on some other columns not relevant to this question.
The table currently has about 500,000,000 rows, and I need to operate on the table but it's too large to deal with currently (I am hampered by slow hardware). So I decided to work on it in chunks.
But if I say
SELECT ID
FROM records
WHERE ID LIKE '0000%'
The execution plan shows that the ENTIRE TABLE is scanned. I thought that with an index, only those rows that matched the original condition would be scanned until SQL reached the '0001' records. With the % in front, I could clearly see why it would scan the whole table. But with the % at the end, it shouldn't have to scan the whole table.
I am guessing this works different with GUIDs rather than CHAR or VARCHAR columns.
So my question is this: how can I search for a subsection of GUIDs without having to scan the whole table?
From your comments, I see the actual need is to break the rows of random GUID values into chunks (ordered) based on range. In this case, you can specify a range instead of LIKE along with a filter on the desired start/end values in the last group:
SELECT ID
FROM dbo.records
WHERE
ID BETWEEN '00000000-0000-0000-0000-000000000000'
AND '00000000-0000-0000-0000-000FFFFFFFFF';
This article explains how uniqueidentifiers (GUIDs) are stored and ordered in SQL Server, comparing and sorting the last group first rather than left-to-right as you might expect. By filtering on the last group, you'll get a sargable expression and touch only those rows in the specified range (assuming an index on ID is used).

Fast duplicates filtering in T-SQL procedure

I need to store some records in a table. It's crucial the records must be unique. All duplicates must be silently ignored.
I wrote a procedure which accepts a user-defined table type, I send a record collection to it, and I try to have NEW, UNIQUE records stored.
How do I determine unique? I calculate SHA1 from a couple of columns. In my table, I have a Hash column. It has UNIQUE index constraint.
Here comes the tricky part. Instead of using IF EXIST(SEELCT ..) I use TRY / CATCH blocks, I let the INSERT silently fail on duplicate hash.
So I use cursor to fetch my rows, then I calculate the hash for each row, then I try to insert this row. If it fails, the next row is processed.
It works. It's quite fast. However I'm very disappointed with my identity column.
If I try to enter 3 identical records and 1 new one I get following id-s: 1, 4. I would expect 1 and 2, not 1 and 4. So identity is incremented on each failed insert. I need to avoid it.
I tried to wrap the INSERT into TRANSACTION and ROLLBACK in the CATCH block. It does nothing. It works, just the id-s are wrong.
Is there a way to use UNIQUE constraint to filter the duplicates efficiently or the only way is using IF EXISTS method?
Is using UNIQUE constraint really faster than IF EXISTS?
UPDATE:
The typical scenario would look like 1000 duplicates and 2 new rows. There will be some concurrent calls to this procedure. I just don't want it to slow the server considerably when I'll have like a couple of millions of rows in my table.
you can use SET IDENTITY INSERT and control that identity field till the end of insert:
SET IDENTITY INSERT ON
CURSOR --increasing also counter for identity column and setting it during loop
SET IDENTITY INSERT OFF
By the way, are you sure you cannot avoid cursors?
You could use EXCEPT to get only the values that are not already existing and inserting them in only one statement, that would be definitely faster; just to give an idea:
INSERT INTO DestTable
SELECT * FROM (SELECT * FROM SourceTable
EXCEPT
SELECT * FROM DestTable)

How can the date a row was added be in a different order to the identity field on the table?

I have a 'change history' table in my SQL Server DB called tblReportDataQueue that records changes to rows in other source tables.
There are triggers on the source tables in the DB which fire after INSERT, UPDATE or DELETE. The triggers all call a stored procedure that just inserts data into the change history table that has an identity column:
INSERT INTO tblReportDataQueue
(
[SourceObjectTypeID],
[ActionID],
[ObjectXML],
[DateAdded],
[RowsInXML]
)
VALUES
(
#SourceObjectTypeID,
#ActionID,
#ObjectXML,
GetDate(),
#RowsInXML
)
When a row in a source table is updated multiple times in quick succession the triggers fire in the correct order and put the changed data in the change history table in the order that it was changed. The problem is that I had assumed that the DateAdded field would always be in the same order as the identity field but somehow it is not.
So my table is in the order that things actually happened when sorted by the identity field but not when sorted by the 'DateAdded' field.
How can this happen?
screenshot of example problem
In example image 'DateAdded' of last row shown is earlier than first row shown.
You are using a surrogate key. One very important characteristic of a surrogate key is that it cannot be used to determine anything about the tuple it represents, not even the order of creation. All systems which have auto generated values like this, including Oracles sequences, make no guarantee as to order, only that the next value generated will be unique from previous generated values. That is all that is required, really.
We all do it, of course. We look at a row with ID of 2 and assume it was inserted after the row with ID of 1 and before the row with ID of 3. That is a bad habit we should all work to break because the assumption could well be wrong.
You have the DateAdded field to provide the information you want. Order by that field and you will get the rows in order of insertion (if that field is not updateable, that is). The auto generated values will tend to follow that ordering, but absolutely do not rely on that!
try use Sequence...
"Using the identity attribute for a column, you can easily generate auto-
incrementing numbers (which as often used as a primary key). With Sequence, it
will be a different object which you can attach to a table column while
inserting. Unlike identity, the next number for the column value will be
retrieved from memory rather than from the disk – this makes Sequence
significantly faster than Identity.
Unlike identity column values, which are generated when rows are inserted, an
application can obtain the next sequence number before inserting the row by
calling the NEXT VALUE FOR function. The sequence number is allocated when NEXT
VALUE FOR is called even if the number is never inserted into a table. The NEXT
VALUE FOR function can be used as the default value for a column in a table
definition. Use sp_sequence_get_range to get a range of multiple sequence
numbers at once."

How to unique identify rows in a table without primary key

I'm importing more than 600.000.000 rows from an old database/table that has no primary key set, this table is in a sql server 2005 database. I created a tool to import this data into a new database with a very different structure. The problem is that I want to resume the process from where it stopped for any reason, like an error or network error. As this table doesn't have a primary key, I can't check if the row was already imported or not. Does anyone know how to identify each row so I can check if it was already imported or not? This table has duplicated row, I already tried to compute the hash of all the columns, but it's not working due to duplicated rows...
thanks!
I would bring the rows into a staging table if this is coming from another database -- one that has an identity set on it. Then you can identify the rows where all the other data is the same except for the id and remove the duplicates before trying to put it into your production table.
So: you are loading umpteen bazillion rows of data, the rows cannot be uniquely identified, the load can (and, apparently, will) be interrupted at any point at any time, and you want to be able to resume such an interrupted load from where you left off, despite the fact that for all practical purposes you cannot identify where you left off. Ok.
Loading into a table containing an additional identity column would work, assuming that however and whenever the data load is started, it always starts at the same item and loads items in the same order. Wildly inefficient, since you have to read through everythign every time you launch.
Another clunky option would be to first break the data you are loading into manageably-sized chunks (perhaps 10,000,000 rows). Load them chunk by chunk, keeping track of which chunk you have loaded. Use a Staging table, so that you know and can control when a chunk has been "fully processed". If/when interrupted, you've only toss the chunk you were working on when interrupted, and resume work with that chunk.
With duplicate rows, even row_number() is going to get you nowhere, as that can change between queries (due to the way MSSQL stores data). You need to either bring it into a landing table with an identity column or add a new column with an identity onto the existing table (alter table oldTbl add column NewId int identity(1,1)).
You could use row_number(), and then back out the last n rows if they have more than the count in the new database for them, but it would be more straight-forward to just use a landing table.
Option 1: duplicates can be dropped
Try to find a somewhat unique field combination. (duplicates are allowed) and join over a hash of the rest of the fields which you store in the destination table.
Assume a table:
create table t_x(id int, name varchar(50), description varchar(100))
create table t_y(id int, name varchar(50), description varchar(100), hash varbinary(8000))
select * from t_x x
where not exists(select *
from t_y y
where x.id = y.id
and hashbytes('sha1', x.name + '~' + x.description) = y.hash)
The reason to try to join as many fields as possible is to reduce the chance of hash collisions which are real on a dataset with 600.000.000 records.
Option 2: duplicates are important
If you really need the duplicate rows you should add a unique id column to your big table. To achieve this in a performing way you should do the following steps:
Alter the table and add a uniqueidentifier or int field
update the table with the newsequentialid() function or a row_number()
create an index on this field
add the id field to your destination table.
once all the data is moved over, the field can be dropped.

Resources