Queryset delete with order_by in django - django-models

For preventing deadlock in most scenarios, it is better to have delete and update or other queries have affect on the same portion of data, work with a defined behavior. One simple practice is to use order by in mysql operations.
I am able to delete portion of data using order by by following query in mysql:
DELETE from Table where ... order by id
But assuming same behavior from django queryset has not the same output:
Table.objects.filter(...).order_by('id').delete()
It ignores order_by and simply deletes records.
Except using raw sql query, how can I use djanog orm to do the same job.
For clarifying I think two simple queries could cause a deadlock if the order of acquiring locks was arbitrary:
# transaction 1
select * from Table where id > 100 for update;
#transaction 2
delete from Table where id > 101 and id < 110;

Related

Snowflake CHANGES | Why does it need to perform a self join? Why is it slower than join using other unique column?

I was facing issues with merge statement over large tables.
The source table for merge is basically clone of the target table after applying some DML.
e.g. In the below example PUBLIC.customer is target and STAGING.customer is the source.
CREATE OR REPLACE TABLE STAGING.customer CLONE PUBLIC.customer;
MERGE INTO STAGING.customer TARGET USING (SELECT * FROM NEW_CUSTOMER) AS SOURCE ON TARGET.ID = SOURCE.ID
WHEN MATCHED AND SOURCE.DELETEFLAG=TRUE THEN DELETE
WHEN MATCHED AND TARGET.ROWMODIFIED < SOURCE.ROWMODIFIED THEN UPDATE SET TARGET.AGE = SOURCE.AGE, ...
WHEN NOT MATCHED THEN INSERT (AGE) VALUES (AGE, DELETEFLAG, ID,...);
Currently, we are simply merging the STAGING.customer back to PUBLIC.customer at the end.
This final merge statement is very costly for some of the large tables.
While looking for a solution to reduce the cost, I discovered Snowflake "CHANGES" mechanism. As per the documentation,
Currently, at least one of the following must be true before change tracking metadata is recorded for a table:
Change tracking is enabled on the table (using ALTER TABLE … CHANGE_TRACKING = TRUE).
A stream is created for the table (using CREATE STREAM).
Both options add hidden columns to the table which store change tracking metadata. The columns consume a small amount of storage.
I assumed that the metadata added to the table is equivalent to the result-set of the select statement using "changes" clause, which doesn't seem to be the case.
INSERT INTO PUBLIC.CUSTOMER(AGE,...) (SELECT AGE,... FROM STAGING.CUSTOMER CHANGES (information => default) at(timestamp => 1675772176::timestamp) where "METADATA$ACTION" = 'INSERT' );
The select statement using "changes" clause is way slower than the merge statement that I am using currently.
I checked the execution plan and found that Snowflake performs a self-join(sort of) on the table at two different timestamp.
Should it really be the behaviour or am I missing something here? I was hoping to get better performance assuming to scan the table one time and then simply inserting the new records which should be faster than the merge statement.
Also, even if it does a self join, why does the merge query perform better than this, the merge query is also doing join on similar volumes.
I was also hoping to use same mechanism for delete/updates on source table.

How to safely use current identity as value in insert query

I have a table where one of the columns is a path to an image and I need to create a directory for the record being inserted.
Example:
Id | PicPath |<br>
1 | /Pics/1/0.jpg|<br>
2 | /Pics/2/0.jpg|
This way I can be sure that the folder name is always valid and it is unique (no clash between two records).
Question is: how can I safely refer to the current id of the record being insert? Keep in mind that this is a highly concurrent environment, and I would like to avoid multiple trips to the DB if possible.
I have tried the following:
insert into Dummy values(CONCAT('a', (select IDENT_CURRENT('Dummy'))))
and
insert into Dummy values(CONCAT('a', (select SCOPE_IDENTITY() + 1)))
The first query is not safe, for when running 1000 concurrent inserts I got 58 'duplicate key' exceptions.
The second query didn't work because SCOPE_IDENTITY() returned the same value for all queries as I suspected.
What are my alternatives here?
Try a temporary table to track your inserted ids using OUTPUT clause
INSERT #temp_ids(someval) OUTPUT inserted.identity_column
This will get all the inserted ids from your queries. 'inserted' is context safe.

postgres: join against partitioned table

I want to join against a huge partitioned table. The planner probably assumes that the partitioned table is very cheap to scan.
I have the following query:
select *
from (
select users where age < 18 limit 10
) as users
join
clicks on users.id = clicks.userid
where
clicks.ts between '2015-01-01' and now();
The table clicks is the master table with roughly 40 child tables containing together about 40 million records.
This query performs very slow. When I look at the planner postgres first performs a complete scan of the clicks table and then scans the user table.
However when I limit the users subquery to 1 the planner first scans the users and then the clicks.
It seems as if the planner assumes that the clicks table is very lightweight. If I look at the stats in pg_class the master table clicks has 0 tuples. Which is true on the one hand because it is a master table, but on the other hand, for the planner it should contain the sum of all its child tables.
How can I force the planner to use the cheapest option first?
edit: in simplifying the query I indeed missed out an additional constraint on the date.
The partitioning constraints are on: clicks.ts and clicks.userid
I have indexes on users.age, user.id, clicks.userid and clicks.ts
Maybe I have to trust the planner. I am just a little insecure because I once had a case where postgres showed some weird behavior with limits (PostgreSQL query very slow with limit 1).

UPDATE slow when setting column to NULL

I have a SQL Server 2008 table with 80,000 rows and am executing the following query:
UPDATE dbo.TableName WITH (ROWLOCK)
SET HelloWorldID = NULL
WHERE HelloWorldID = #helloWorldID
HelloWorldID is an int and the #helloWorldID parameter is also int.
The query is taking too long and I'd like to optimize it. I created a nonclustered index on HelloWorldID but it didn't matter. I may have to redesign this...maybe put the HelloWorldID on another table that links it to the TableName table?
Since the command you're waiting on is DELETE I have to guess that there is a trigger on dbo.TableName and that it is performing additional work that you do not expect. Or perhaps some CASCADE option that is affecting other tables that have triggers on them.
It all depends on how much rows will be updated by this query.
If you're updating a lot of rows, say 30% of the table, then the index will actually slow down the query (as index will be updated along with the table, and it won't help with filtering the rows for update). Also ROWLOCK will slow it down, because the engine will issue a separate lock for each row (as opposed to pagelocks that would occur normally).
Try removing the index and running this update using WITH(TABLOCK) just to see what happens.
I get this problem sometimes. Your query is dependent upon simultaneously getting a write-lock on every row in the table meeting the conditions of the WHERE-Clause . Depending on your needs for full 'ACID', you could do something like this:
SELECT getdate() -- force ##rowcount=1
while ##rowcount > 0
UPDATE TOP (1000) dbo.TableName
SET HelloWorldID = NULL
WHERE HelloWorldID = #helloWorldID
This will do the update is smaller chunks, and help overcome locking issues. But remember, this-method gives up on doing this-query as a single-transaction. You will need to tune the 1000 to a value that is right for your server.

Updating redundant/denormalized data automatically in SQL Server

Use a high level of redundant, denormalized data in my DB designs to improve performance. I'll often store data that would normally need to be joined or calculated. For example, if I have a User table and a Task table, I would store the Username and UserDisplayName redundantly in every Task record. Another example of this is storing aggregates, such as storing the TaskCount in the User table.
User
UserID
Username
UserDisplayName
TaskCount
Task
TaskID
TaskName
UserID
UserName
UserDisplayName
This is great for performance since the app has many more reads than insert, update or delete operations, and since some values like Username change rarely. However, the big draw back is that the integrity has to be enforced via application code or triggers. This can be very cumbersome with updates.
My question is can this be done automatically in SQL Server 2005/2010... maybe via a persisted/permanent View. Would anyone recommend another possibly solution or technology. I've heard document-based DBs such as CouchDB and MongoDB can handle denormalized data more effectively.
You might want to first try an Indexed View before moving to a NoSQL solution:
http://msdn.microsoft.com/en-us/library/ms187864.aspx
and:
http://msdn.microsoft.com/en-us/library/ms191432.aspx
Using an Indexed View would allow you to keep your base data in properly normalized tables and maintain data-integrity while giving you the denormalized "view" of that data. I would not recommend this for highly transactional tables, but you said it was heavier on reads than writes so you might want to see if this works for you.
Based on your two example tables, one option is:
1) Add a column to the User table defined as:
TaskCount INT NOT NULL DEFAULT (0)
2) Add a Trigger on the Task table defined as:
CREATE TRIGGER UpdateUserTaskCount
ON dbo.Task
AFTER INSERT, DELETE
AS
;WITH added AS
(
SELECT ins.UserID, COUNT(*) AS [NumTasks]
FROM INSERTED ins
GROUP BY ins.UserID
)
UPDATE usr
SET usr.TaskCount = (usr.TaskCount + added.NumTasks)
FROM dbo.[User] usr
INNER JOIN added
ON added.UserID = usr.UserID
;WITH removed AS
(
SELECT del.UserID, COUNT(*) AS [NumTasks]
FROM DELETED del
GROUP BY del.UserID
)
UPDATE usr
SET usr.TaskCount = (usr.TaskCount - removed.NumTasks)
FROM dbo.[User] usr
INNER JOIN removed
ON removed.UserID = usr.UserID
GO
3) Then do a View that has:
SELECT u.UserID,
u.Username,
u.UserDisplayName,
u.TaskCount,
t.TaskID,
t.TaskName
FROM User u
INNER JOIN Task t
ON t.UserID = u.UserID
And then follow the recommendations from the links above (WITH SCHEMABINDING, Unique Clustered Index, etc.) to make it "persisted". While it is inefficient to do an aggregation in a subquery in the SELECT as shown above, this specific case is intended to be denormalized in a situation that has higher reads than writes. So doing the Indexed View will keep the entire structure, including the aggregation, physically stored so each read will not recalculate it.
Now, if a LEFT JOIN is needed if some Users do not have any Tasks, then the Indexed View will not work due to the 5000 restrictions on creating them. In that case, you can create a real table (UserTask) that is your denormalized structure and have it populated via either a Trigger on just the User Table (assuming you do the Trigger I show above which updates the User Table based on changes in the Task table) or you can skip the TaskCount field in the User Table and just have Triggers on both tables to populate the UserTask table. In the end, this is basically what an Indexed View does just without you having to write the synchronization Trigger(s).

Resources