Polybase metadata query - sql-server

All,
Polybase keeps running this query, or variation of, every 1 minute +/-. This shows on the source system it's getting it's data from. While usually very fast, if any clustered index is rebuilt during maintenance is blocks this query.
This doesn't appear to be part of the external table call as I've stopped the job running that and we are still seeing this. Interestingly, it appears to only be called on tables with partitioning enabled (or at least I haven't captured the query on tables that aren't partitioned).
Question:
Anyone have thoughts on what polybase is doing here?
Why it needs to run this query every 1 minute or so?
Is there a way to prevent/slowdown the calls during stuff like index maintenance?
Issues with killing the process on the source system?
Appreciate the help and input.
SELECT PARTITION_FUNCTIONS.name
,QUOTENAME(COLUMNS.name)
,cast(PARTITIONS.partition_number AS NVARCHAR(10))
FROM "<DBName>".sys.indexes AS INDEXES
,"<DBName>".sys.partitions AS PARTITIONS
,"<DBName>".sys.index_columns AS INDEX_COLUMNS
,"<DBName>".sys.columns AS COLUMNS
,"<DBName>".sys.partition_functions AS PARTITION_FUNCTIONS
,"<DBName>".sys.partition_schemes AS PARTITION_SCHEMES
WHERE INDEXES.object_id = object_id(#0)
AND INDEXES.type IN (#1,#2,#3)
AND INDEX_COLUMNS.partition_ordinal = #4
AND INDEXES.object_id = PARTITIONS.object_id
AND INDEXES.index_id = PARTITIONS.index_id
AND INDEXES.object_id = INDEX_COLUMNS.object_id
AND INDEXES.index_id = INDEX_COLUMNS.index_id
AND INDEXES.data_space_id = PARTITION_SCHEMES.data_space_id
AND PARTITION_SCHEMES.function_id = PARTITION_FUNCTIONS.function_id
AND INDEXES.object_id = COLUMNS.object_id
AND INDEX_COLUMNS.column_id = COLUMNS.column_id```

It is normal to query the external table metadata to identify the fields, data types and lengths, but there are only two times when Polybase does this, at the external table creation, and when querying the data. If you're seeing it every minute with variations, check:
If you're dropping and creating the external table in a cycle.
If you're querying the external table in a cycle.
If it is part of a scale-out group, and if other nodes are querying the external table.
If you can't identify where it is being triggered, you should drop your external table while not in use to avoid issues with the index maintenance, better than creating a process to kill these sessions indefinitely.
You can find additional information about how Polybase works in my book "Hands-on data virtualization with Polybase".

Related

Why would a dirty read of a table cause WRITELOG waits in SQL Server 2019?

I'm running into an interesting issue in production, which I cannot replicate in our QA/Staging environments.
I have a query that is doing dirty reads on a fairly large table (around 6 million rows, but we only keep the last 90 days of data in it, older records are warehoused in a different database). This table has lots of writes to it, as it logs page views, but only occasionally is data read from the table.
Recently I noticed that when one specific query is running, SQL Server 2019 starts generating a ton of WRITELOG waits and appears to hold up any other requests that are trying to write to the database.
Now the query itself has nolock hints on all the tables, because it's okay if dirty data is returned. We use the nolock hints because the writes to the table are extremely frequently and queries to this table can be slow because there are a lot of page scans required.
The query itself looks something like this:
select
clt.ViewDate, clt.UserId, clt.RemoteAddress, clt.LibraryId, clt.Parameters
, u.Fullname
, cl.Id as VideoId, cl.Title
-- we need a compound key for each row, so we can count the unique rows
, case
when clt.ViewDate is null then null
else row_number() over (order by clt.ViewDate, clt.UserId, clt.LibraryId, clt.Parameters)
end as compoundKey
from
ContentLibrary as cl (nolock)
left join
(
ContentLibraryTracking as clt (nolock)
inner join
[User] as u (nolock)
on
clt.UserId = u.UserId
)
on
clt.ViewDate between #startDate and #endDate
and
clt.Parameters like #filter
where
1 = 1
and
cl.ContentType = #contentType
order by
clt.ViewDate
The problem table appears to be the ContentLibraryTracking. This is the table that has millions of rows and has lots of inserts and we warehouse rows nightly, so there can be a lot of page fragmentation. We do defrag the indices and stats weekly on the table.
When this query is running, sp_BlitzWho will report the query has entered into a CXCONSUMER. I will then see SQL Server 2019 starting to queue processes with a WRITELOG wait. This processed remain in this state until the query has finished running.
Since our application has some kind of write transaction with every page view, this means this query is holding up execution for entire application, which is obviously bad.
While I know have page scans is bad for a query plan, the query requires searching patterns in a varchar column, which is why the page scans happen. Since the reads are very infrequently, the table is optimized for writes since those are extremely frequent. And while the query could perform better, considering the work it's doing even when it's slow it runs within 15 seconds or so.
One thing I do see from the sp_BlitzWho results is the query is using parallelism and it also states the Transaction Isolation Level is Read Committed (which I would unexpected Read Uncommitted since all the tables have a nolock hint).
What would cause a query with dirty reads to be forcing the database to queue up WRITELOG events?
I could see this happening if the query was altering data and causing it's own transaction log entries, but that should not be happening with the query. That's the whole reason we are using the nolock hint on the tables.
Also, our database, log files and tempdb are all on their own logical storage devices, so reads from the database should not be causing a IO problems writing to the transaction log files.
A couple of notes on the environment:
We are running Microsoft SQL Server 2019 (RTM-CU8-GDR) (KB4583459) - 15.0.4083.2 (X64))
The database is running in a VM
We backup transaction logs every 5 minutes (could this be the issue?)
Memory and CPU usage appear fine with the query runs
SQL Monitor 11 only really shows spikes in the log flushes and waits (which would match the behavior). Page splits, buffer cache & page are all normal. I do see the "disk read bytes/sec" go up on the logic drive that has the database on it, but the writes on all drives (including the transaction logs) look okay.
Any thoughts would be greatly appreciated as I'm really scratching my head over this issue.
Right after I posted my question I started looking at the sp_BlitzWho results in more detail. I noticed the parallelism was using all the CPUs. So I changed the MAXDOP to half the CPU/cores and this appears to have resolved the issue. I'm going to keep monitoring the situation, but looks like an instance where the MAXDOP was not set correctly.
It make sense that if a query is eating up all the available cores, that other threads would be waiting. I was just thrown off by the WRITELOG waits.

How to reduce blocking during concurrent DELETE & INSERT to a single table in SQL Server

We have a stored procedure which loads order details about an order. We always want the latest information about an order, so order details for the order are regenerated every time, when the stored procedure is called. We are using SQL Server 2016.
Pseudo code:
DELETE by clustered index based on order identifier
INSERT into the table, based on a huge query containing information about order
When multiple end-users are executing the stored procedure concurrently, there is a blocking created on orderdetails table. Once the first caller is done, second caller is queued, followed by third caller. So, the time for the generation of the orderdetails increases as time goes by. This is happening especially in the cases of big orders containing details rows in > 100k or 1 or 2 million, as there is table level lock is happening.
The approach we took
We partitioned the table based on the last digit of the order identifier for concurrent orderdetails loading. This improves the performance in the case of first time orderdetails loading, as there are no deletes. But, second time onwards, INSERT in first session is causing blocking for other sessions DELETE. The other sessions are blocked till first session is done with INSERT.
We are considering creation of separate orderdetails table for every order to avoid this concurrency issues.
Question
Can you please suggest some approach, which will support concurrent DELETE & INSERT scenario ?
We solved the contention issue by going for temporary table for orderdetails. We found that huge queries are taking longer SELECT time and this longer time was contributing to longer table level locks on the orderdetails table.
So, we first loaded data into temporary table #orderdetail and then went for DELETE and INSERT in the orderdetail table.
As the orderdetail table is already partitioned, DELETE were faster and INSERT were happening in parallel. INSERT was also very fast here, as it is simple table scan from #orderdetail table.
You can give a look to the Hekaton Engine. It is available even in SQL Server Standard Edition if you are using SP1.
If this is too complicated for implementation due to hardware or software limitations, you can try to play with the Isolation Levels of the database. Sometimes, queries that are reading huge amount of data are blocked or even deadlock victims of queries which are modifying parts of these data. You can ask yourself do you need to guarantee that the data read by the user is valid or you can afford for example some dirty reads?

How can I optimise Zumero sync queries

I am currently experiencing very long sync times on a zumero synced database (well over a minute), and following some profiling, the culprit appears to be a particular query that is taking 20+ seconds (suitably anonymised):
WITH relevant_rvs AS
(
SELECT rv.z_rv AS rv FROM zumero."mydb_089eb7ec0e2e4772ba0dde90170ee368_mysynceddb$z$rv$271340031" rv
WHERE (rv.txid<=913960)
AND NOT EXISTS (SELECT 1 FROM zumero."mydb_089eb7ec0e2e4772ba0dde90170ee368_mysynceddb$z$dd$271340031" dd WHERE dd.rv=rv.z_rv AND (dd.txid<=913960))
)
INSERT INTO #final_included_271340031_e021cfbe1c97213dd5adbacd667c08439fb8c6 (z_rv)
SELECT z$this.z_rv
FROM zumero."mydb_089eb7ec0e2e4772ba0dde90170ee368_mysynceddb$z$271340031" z$this
WHERE (z$this.z_rv IN (SELECT rv FROM relevant_rvs))
AND MyID = (MyID = XXX AND MyOtherField=XXX)
UNION SELECT z$this.z_rv
FROM zumero."mydb_089eb7ec0e2e4772ba0dde90170ee368_mysynceddb$z$old$271340031" z$this
WHERE (z$this.z_rv IN (SELECT rv FROM relevant_rvs))
AND (MyID = XXX AND MyOtherField=XXX)
I have taken the latter SELECT part of the query and ran it in isolation, which reproduces the same poor performance. Interestingly the execution plan recommends an index be applied, but I'm reluctant to go changing the schema of zumero generated tables, is adding indexes to these tables something that can be attempted safely and is it likely to help?
The source tables have 100,000ish records in them and the filter results in each client syncing 100-1000ish records, so not trivial data volumes but levels I would not expect to be causing major issues in terms of query performance.
Does anyone have any experience optimising Zumero sync performance server side? Do any indexes on source tables propagate to these tables? they don't appear to in this case.
Creating a custom index on the z$old table should be safe. I do hope it helps boost your query performance! (And it would be great to see a comment letting us know if it does or not.)
I believe the only issue such an index may cause would be that it could block certain schema changes on the host table. For example, if you tried to DROP the [MyOtherField] column from the host table, the Zumero triggers would attempt to drop the same column from the z$old table as well, and the transaction would fail with an error (which might be a bit surprising, since the index is not on the table being directly acted on).
Another thing to consider: It might also help to give this new index a name that will be recognized/helpful if it ever appears in an error message. Then (as always) feel free to contact support#zumero.com with any further questions or issues if they come up.

Tables oracle locked

I´m not a dba but I´m having a problems with a database which it´s been locking tables and creating a big chaos as a result. I get some information from the oracle dba, if someone could help me to found the key of the problem or point me what I need to do, I put more information here:
I have a big report from oracle but I don´t understand 95% of the data.
This looks like regular row-level locking. A row can only be modified by one user at a time. The data dictionary has information about who is blocked and who is doing the blocking:
--Who's blocking who?
select
blocked_sql.sql_id blocked_sql_id
,blocked_sql.sql_text blocked_sql_text
,blocked_session.username blocked_username
,blocking_sql.sql_id blocking_sql_id
,blocking_sql.sql_text blocking_sql_text
,blocking_session.username blocking_username
from gv$sql blocked_sql
join gv$session blocked_session
on blocked_sql.sql_id = blocked_session.sql_id
and blocked_sql.users_executing > 0
join gv$session blocking_session
on blocked_session.final_blocking_session = blocking_session.sid
and blocked_session.final_blocking_instance = blocking_session.inst_id
left join gv$sql blocking_sql
on blocking_session.sql_id = blocking_sql.sql_id;
If you understand the system it is usually easier to focus on "who" is doing the blocking instead of "what" is blocked. The above query only returns a few common columns, but there are dozens of other columns in those tables that might help identify the process.

Is there meta data I can read from SQL Server to know the last changed row/table?

We have a database with hundreds of tables.
Is there some kind of meta data source in SQL Server that I can programatically query to get the name of the last changed table and row?
Or do we need to implement this ourselves with fields in each table called LastChangedDateTime, etc.?
In terms of finding out when a table last had a modification, there is a sneaky way that can work to access this information, but it will not tell you which row was altered, just when.
SQL Server maintains index usage statistics, and records the last seek / scan / lookup and update on an index. It also splits this by user / system.
Filtering that to just the user tables, any insert / update / deletion will cause an update to occur on the index, and the DMV will update with this new information.
select o.name,
max(u.last_user_seek) as LastSeek,
max(u.last_user_scan) as LastScan,
max(u.last_user_lookup) as LastLookup,
max(u.last_user_update) as LastUpdate
from sys.dm_db_index_usage_stats u
inner join sys.objects o on o.object_id = u.object_id
where o.type = 'U' and o.type_desc = 'USER_TABLE'
group by o.name
It is not ideal however, a heap has no index for a start - and I have never considered using it for production code as a tracking mechanism, only as a forensic tool to check obvious alterations.
If you want proper row level alteration tracking you will either have to build that in, or look at the SQL 2008 specific Change Data Capture feature.
The [sys].[tables] view will tell you when the table was created and last modified (in terms of schema, not insert, updates or deletes). To my knowledge there is no built-in information about last modified for each record in the database (it would take up a lot of space anyway, so it's probably nice not to have it). So you should add a last modified field yourself, and maybe have it updated automatically by a trigger.
Depending on the recovery model you might be able to get this from the transaction log using fn_dblog http://www.sqlskills.com/BLOGS/PAUL/post/Search-Engine-QA-6-Using-fn_dblog-to-tell-if-a-transaction-is-contained-in-a-backup.aspx

Resources