CREATE TABLE table (
id1 uuid,
bucket timestamp,
createdat timestamp,
id2 uuid,
data blob,
primary key((id1, bucket), createdat, id2)
)
with
clustering ORDER BY (createdat ASC, id2 ASC) AND ...
I have this Cassandra table. In a multithreaded environment two threads can insert rows to this table on the same time.
I have a requirement that the table should not have two rows which has same (id1, bucket) and data field having value null.
It means that only last row of the table (for each id1 and bucket) could have data as null.
In order to achieve that before each insert I query last row of the table and insert new one only if data field is not null.
This does not work because of race condition. Two threads both check, get last row which does not have data as null and insert rows having null so violating the requirement.
I think it's possible to fix this on database level. I need some transaction that first checks the last row and only then insert new one. And nothing should be inserted between these actions.
Can you hint how this can be achieved on Cassandra?
There isn't a way to achieve this in Cassandra. As you already stated, you will always run into race conditions.
Apart from lightweight transactions, Cassandra doesn't have locking mechanism because it goes against the goal of high velocity CRUD operations. Cheers!
You may be able to get this behavior to work with a Lightweight Transaction.
In your case, it may look something like:
INSERT INTO table (id1,bucket,createdat,id2,data)
VALUES (uuid(),'2021-09-27 13:20',totimestamp(now()),uuid(),'some data')
IF NOT EXISTS;
Basically, this will only perform the write if the key combination does not already exist. This exact solution might not work, but it should point you in the right direction.
Related
I have a table in Google Cloud Spanner.
CREATE TABLE test_id (
Id STRING(MAX) NOT NULL,
KeyColumn STRING(MAX) NOT NULL,
parent_id INT64 NOT NULL,
Updated TIMESTAMP NOT NULL OPTIONS (allow_commit_timestamp=true),
) PRIMARY KEY (Id)
And, I am trying to perform transaction.insert_or_update through a python script.
For each row in a pandas dataframe, I am doing:
transaction.insert_or_update(
'test_id', columns=['Id','KeyColumn', 'parent_id', 'Updated'],
values=[(uuid.uuid4().hex, row["KeyColumn"], row["parent_id"], spanner.COMMIT_TIMESTAMP)],
)
What I want is that if the row["KeyColumn"] is already present in KeyColumn of the table, update its parent_id column, otherwise insert a new row in the Spanner table corresponding to that KeyColumn.
But since, my primary key is Id which is generated randomly by uuid.uuid4().hex, it every time inserts a new row.
If I understand you correctly, the following is the situation:
ID is the primary key of your table.
There is a unique index defined for the table on the column KeyColumn.
You want to insert_or_update a row using KeyColumn as the column that should be used to determine whether the row already exists.
That is unfortunately not possible. insert_or_update will always use the primary key of the table to determine whether the row exists. I can think of three possible solutions to this problem, but they all have their drawbacks:
You could change the table definition and make KeyColumn the primary key and set a unique index on the Id column. The problem with this is of course that any other code that depends on Id being the primary key also needs to change. It is also a rather cumbersome change, because Cloud Spanner does not allow you to change the primary key of a table, so you would have to create a copy of the test_id table and then drop the old table.
You could fetch the row from Cloud Spanner before updating it by reading it using the KeyColumn value that you have. The big problem with this is obviously performance. You will need to do a read for each row that you want to update.
You could use a DML statement (UPDATE test_id SET parent_id=#parent WHERE KeyColumn=#key) to execute the update and check whether it actually updated a row by checking the returned update count. If it did not update anything, you could then execute the insert. This will obviously also be slower than an insert_or_update mutation.
Here there is a way to query the Cloud Spanner with a specific index.
You should use something like this in the end of your query : FROM test_id#{FORCE_INDEX=KeyColumnIndex} .
Even though this is the way to execute queries on secondary indexes and the answer for the question in the title, I do not know how much it can be applied in your use case.
When I add news rows using the Insert into select code, the new rows get added randomly in between the already existing rows, instead of getting added to the end of the table.
I'm using, Insert into Table1 (Name1) select Name from Table2.
SQL tables are modeled after unordered sets, and hence you should not assume that there is any order to your data in the table. The only order which exists is what you specify when you query using ORDER BY, e.g.
SELECT Name1
FROM Table1
ORDER BY Name1
An index can also be thought of a way of ordering your records, but these two are mostly distinct entities from your actual table.
I agree with Tim's answer. But if you still want the data inserted in the way you want, then you can try to add the primary key yourself which is incremental (like 1,2,3 ... or 10,20,30 ...).
Although I don't recommend it, but I think following can help you if you don't want to handle the primary key yourself.
How do I add a auto_increment primary key in SQL Server database?
We have a table that will store versions of records.
The columns are:
Id (Guid)
VersionNumber (int)
Title (nvarchar)
Description (nvarchar)
etc...
Saving an item will insert a new row into the table with the same Id and an incremented VersionNumber.
I am not sure how is best to generate the sequential VersionNumber values. My initial thought is to:
SELECT #NewVersionNumber = MAX(VersionNumber) + 1
FROM VersionTable
WHERE Id = #ObjectId
And then use the the #NewVersionNumber in my insert statement.
If I use this method do I need set my transaction as serializable to avoid concurrency issues? I don't want to end up with duplicate VersionNumbers for the same Id.
Is there a better way to do this that doesn't make me use serializable transactions?
In order to avoid concurrency issues (or in your specific case duplicate inserts) you could create a Compound Key as the Primary Key for your table, consisting of the ID and VersionNumber columns. This would then enforce a unique constraint on the key column.
Subsequently your insert routine/logic can be devised to handle or rather CATCH an insert error due to a duplicate key and then simply re-issue the insert process.
It may also be worth mentioning that unless you specifically need to use a GUID i.e. because of working with SQL Server Replication or multiple data sources, that you should consider using an alternative data type such as BIGINT.
I had thought that the following single insert statement would avoid concurrency issues, but after Heinzi's excellent answer to my question here it turns out that this is not safe at all:
Insert Into VersionTable
(Id, VersionNumber, Title, Description, ...)
Select #ObjectId, max(VersionNumber) + 1, #Title, #Description
From VersionTable
Where Id = #ObjectId
I'm leaving it just for reference. Of course this would work with either table hints or a transaction isolation level of Serializable, but overall the best solution is to use a constraint.
I have a stored procedure that is working with a large amount of data. I have that data being inserted in to a temp table. The overall flow of events is something like
CREATE #TempTable (
Col1 NUMERIC(18,0) NOT NULL, --This will not be an identity column.
,Col2 INT NOT NULL,
,Col3 BIGINT,
,Col4 VARCHAR(25) NOT NULL,
--Etc...
--
--Create primary key here?
)
INSERT INTO #TempTable
SELECT ...
FROM MyTable
WHERE ...
INSERT INTO #TempTable
SELECT ...
FROM MyTable2
WHERE ...
--
-- ...or create primary key here?
My question is when is the best time to create a primary key on my #TempTable table? I theorized that I should create the primary key constraint/index after I insert all the data because the index needs to be reorganized as the primary key info is being created. But I realized that my underlining assumption might be wrong...
In case it is relevant, the data types I used are real. In the #TempTable table, Col1 and Col4 will be making up my primary key.
Update: In my case, I'm duplicating the primary key of the source tables. I know that the fields that will make up my primary key will always be unique. I have no concern about a failed alter table if I add the primary key at the end.
Though, this aside, my question still stands as which is faster assuming both would succeed?
This depends a lot.
If you make the primary key index clustered after the load, the entire table will be re-written as the clustered index isn't really an index, it is the logical order of the data. Your execution plan on the inserts is going to depend on the indexes in place when the plan is determined, and if the clustered index is in place, it will sort prior to the insert. You will typically see this in the execution plan.
If you make the primary key a simple constraint, it will be a regular (non-clustered) index and the table will simply be populated in whatever order the optimizer determines and the index updated.
I think the overall quickest performance (of this process to load temp table) is usually to write the data as a heap and then apply the (non-clustered) index.
However, as others have noted, the creation of the index could fail. Also, the temp table does not exist in isolation. Presumably there is a best index for reading the data from it for the next step. This index will need to either be in place or created. This is where you have to make a tradeoff of speed here for reliability (apply the PK and any other constraints first) and speed later (have at least the clustered index in place if you are going to have one).
If the recovery model of your database is set to simple or bulk-logged, SELECT ... INTO ... UNION ALL may be the fastest solution. SELECT .. INTO is a bulk operation and bulk operations are minimally logged.
eg:
-- first, create the table
SELECT ...
INTO #TempTable
FROM MyTable
WHERE ...
UNION ALL
SELECT ...
FROM MyTable2
WHERE ...
-- now, add a non-clustered primary key:
-- this will *not* recreate the table in the background
-- it will only create a separate index
-- the table will remain stored as a heap
ALTER TABLE #TempTable ADD PRIMARY KEY NONCLUSTERED (NonNullableKeyField)
-- alternatively:
-- this *will* recreate the table in the background
-- and reorder the rows according to the primary key
-- CLUSTERED key word is optional, primary keys are clustered by default
ALTER TABLE #TempTable ADD PRIMARY KEY CLUSTERED (NonNullableKeyField)
Otherwise, Cade Roux had good advice re: before or after.
You may as well create the primary key before the inserts - if the primary key is on an identity column then the inserts will be done sequentially anyway and there will be no difference.
Even more important than performance considerations, if you are not ABSOLUTELY, 100% sure that you will have unique values being inserted into the table, create the primary key first. Otherwise the primary key will fail to be created.
This prevents you from inserting duplicate/bad data.
If you add the primary key when creating the table, the first insert will be free (no checks required.) The second insert just has to see if it's different from the first. The third insert has to check two rows, and so on. The checks will be index lookups, because there's a unique constraint in place.
If you add the primary key after all the inserts, every row has to be matched against every other row. So my guess is that adding a primary key early on is cheaper.
But maybe Sql Server has a really smart way of checking uniqueness. So if you want to be sure, measure it!
I was wondering if I could improve a very very "expensive" stored procedure entailing a bunch of checks at each insert across tables and came across this answer. In the Sproc, several temp tables are opened and reference each other. I added the Primary Key to the CREATE TABLE statement (even though my selects use WHERE NOT EXISTS statements to insert data and ensure uniqueness) and my execution time was cut down SEVERELY. I highly recommend using the primary keys. Always at least try it out even when you think you don't need it.
I don't think it makes any significant difference in your case:
either you pay the penalty a little bit at a time, with each single insert
or you'll pay a larger penalty after all the inserts are done, but only once
When you create it up front before the inserts start, you could potentially catch PK violations as the data is being inserted, if the PK value isn't system-created.
But other than that - no big difference, really.
Marc
I wasn't planning to answer this, since I'm not 100% confident on my knowledge of this. But since it doesn't look like you are getting much response ...
My understanding is a PK is a unique index and when you insert each record, your index is updated and optimized. So ... if you add the data first, then create the index, the index is only optimized once.
So, if you are confident your data is clean (without duplicate PK data) then I'd say insert, then add the PK.
But if your data may have duplicate PK data, I'd say create the PK first, so it will bomb out ASAP.
When you add PK on table creation - the insert check is O(Tn) (where Tn is "n-th triangular number", which is 1 + 2 + 3 ... + n) because when you insert x-th row, it's checked against previously inserted "x - 1" rows
When you add PK after inserting all the values - the checker is O(n^2) because when you insert x-th row, it's checked against all n existing rows.
First one is obviously faster since O(Tn) is less than O(n^2)
P.S. Example: if you insert 5 rows it is 1 + 2 + 3 + 4 + 5 = 15 operations vs 5^2 = 25 operations
I'm inserting a large amount of rows into an empty table with a primary key constraint on one column.
If there is a duplicate key error, is there any way to find out the value of the key (or row) that caused the error?
Validating the data prior to the insert is sadly not something I can do right now.
Using SQL 2008.
Thanks!
Doing the count(*) / group by thing is something I'm trying to avoid, this is an insert of hundreds of millions of rows from hundreds of different DB's (some of which are on remote servers)...I don't have the time or space to do the insert twice.
The data is supposed to be unique from the providers, but unfortunately their validation doesn't seem to work correctly 100% of the time and I'm trying to at least see where it's failing so I can help them troubleshoot.
Thank you!
There's not a way of doing it that won't slow your process down, but here's one way that will make it easier. You can add an instead-of trigger on that table for inserts and updates. The trigger will check each record before inserting it and make sure it won't cause a primary key violation. You can even create a second table to catch violations, and have a different primary key (like an identity field) on that one, and the trigger will insert the rows into your error-catching table.
Here's an example of how the trigger can work:
CREATE TRIGGER mytrigger ON sometable
INSTEAD OF INSERT
AS BEGIN
INSERT INTO sometable SELECT * FROM inserted WHERE ISNUMERIC(somefield) = 1 FROM inserted;
INSERT INTO sometableRejects SELECT * FROM inserted WHERE ISNUMERIC(somefield) = 0 FROM inserted;
END
In that example, I'm checking a field to make sure it's numeric before I insert the data into the table. You'll need to modify that code to check for primary key violations instead - for example, you might join the INSERTED table to your own existing table and only insert rows where you don't find a match.
The solution would depend on how often this happens. If it's <10% of the time then I would do the following:
Insert the data
If error then do Bravax's revised solution (remove constraint, insert, find dup, report and kill dup, enable constraint).
This means it's only costing you on the few times an error occurs.
If this is happening more often then I'd look at sending the boys over to see the providers :-)
Revised:
Since you don't want to insert twice, could you:
Drop the primary key constraint.
Insert all data into the table
Find any duplicates, and remove them
Then re-add the primary key constraint
Previous reply:
Insert the data into a duplicate of the table without the primary key constraint.
Then run a query on it to determine rows which have duplicate values for the rpimary key column.
select count(*), <Primary Key>
from table
group by <Primary Key>
having count(*) > 1
Use SSIS to import the data and have it check for this as part of the data flow. That is the best way to handle. SSIS can send the bad records to a table (that you can later send to the vendor to help them clean up their act) and process the good ones.
I can't believe that SSIS does not easily address this "reality", because, let's face it, oftentimes you need and want to be able to:
See if a record exists with a certain unique or primary key
If it does not, insert it
If it does, either ignore it or update it.
I don't understand how they would let a product out the door without this capability built-in in an easy-to-use manner. Like, say, set an attribute of a component to automatically check this.