We have a scenario, where each insert happen per id_2 given id_1, for below schema, in Cassandra:
CREATE TABLE IF NOT EXISTS my_table (
id_1 UUID,
id_2 UUID,
textDetails TEXT,
PRIMARY KEY (id_1, id_2)
);
A single POST request body has the details for multiple values of id_2. This triggers multiple inserts per single POST request on single table.
Each INSERT query is performed as shown below:
insertQueryString = "INSERT INTO my_table (id_1, id_2, textDetails) " + "VALUES (?, ?, ?) IF NOT EXISTS"
cassandra.Session.Query(insertQueryString,
id1,
id2,
myTextDetails).Exec();
1
Does Cassandra ensure data consistency on multiple inserts on a single table, per POST request? Each POST request is processed on a Go-routine(thread). Subsequent GET requests should ensure retrieving consistent data(inserted through POST)
Using BATCH statements is having "Batch too large" issues in staging & production. https://github.com/RBMHTechnology/eventuate/issues/166
2
We have two data centres(for Cassandra), with 3 replica nodes per data center.
What are the consistency levels need to set for write query operation(POST request) and ready query operation(GET request), to ensure full consistency
There are multiple problems here:
Batching should be used very carefully in Cassandra - only if you're inserting data into the same partition. If you insert data into multiple partitions, then it's better to use separate queries executed in parallel (but you can collect multiple entries per partition key and batch them).
you're using IF NOT EXISTS and it's done against the same partition - as result it leads to the conflicts between multiple nodes (see documentation on lightweight transactions) plus it requires reading data from disk, so it heavily increase the load onto the nodes. But do you really need to insert data only if the row doesn't exist? What is the problem if row exists already? It's easier just to overwrite data in Cassandra when doing INSERT because it won't require reading data from the disk.
Regarding consistency level - the QUORUM (or SERIAL for LWTs) will give you the strong consistency but at expense of the increased latency (because you need to wait for answer from another DC), and lack of fault tolerance - if you lose another DC, then all your queries will fail. In most cases the LOCAL_QUORUM is enough (LOCAL_SERIAL in case of LWTs), and it will provide fault tolerance. I recommend to read this whitepaper on best practices of build fault-tolerance applications on top of Cassandra.
Related
My source tables called Event sitting in a different database and it has millions of rows. Each event can have an action of DELETE, UPDATE or NEW.
We have a Java process that goes through these events in the order they were created and do all sort of rules and then insert the results into multiple tables for look up, analyse etc..
I am using JdbcTemplate and using batchUpdate to delete and upsert to Postgres DB in a sequential order right now, but I'd like to be able to parallel too. Each batch is 1,000 entities to be insert/upserted or deleted.
However, currently even doing in a sequential manner, Postgres locks queries somehow which I don't know much about and why.
Here are some of the codes
entityService.deleteBatch(deletedEntities);
indexingService.deleteBatch(deletedEntities);
...
entityService.updateBatch(allActiveEntities);
indexingService.updateBatch(....);
Each of these services are doing insert/delete into different tables. They are in one transaction though.
The following query
SELECT
activity.pid,
activity.usename,
activity.query,
blocking.pid AS blocking_id,
blocking.query AS blocking_query
FROM pg_stat_activity AS activity
JOIN pg_stat_activity AS blocking ON blocking.pid = ANY(pg_blocking_pids(activity.pid));
returns
Query being blocked: "insert INTO ENTITY (reference, seq, data) VALUES($1, $2, $3) ON CONFLICT ON CONSTRAINT ENTITY_c DO UPDATE SET data = $4",
Blockking query: delete from ENTITY_INDEX where reference = $1
There are no foreign constraints between these tables. And we do have indexes so that we can run queries for our processing as part of the process.
Why would one completely different table can block the other tables? And how can we go about resolving this?
Your query is misleading.
What it shows as “blocking query” is really the last statement that ran in the blocking transaction.
It was probably a previous statement in the same transaction that caused entity (or rather a row in it) to be locked.
I have a database table which have more than 1 million records uniquely identified by a GUID column. I want to find out which of these record or rows was selected or retrieved in the last 5 years. The select query can happen from multiple places. Sometimes the row will be returned as a single row. Sometimes it will be part of a set of rows. there is select query that does the fetching from a jdbc connection from a java code. Also a SQL procedure also fetches data from the table.
My intention is to clean up a database table.I want to delete all rows which was never used( retrieved via select query) in last 5 years.
Does oracle DB have any inbuild meta data which can give me this information.
My alternative solution was to add a column LAST_ACCESSED and update this column whenever I select a row from this table. But this operation is a costly operation for me based on time taken for the whole process. Atleast 1000 - 10000 records will be selected from the table for a single operation. Is there any efficient way to do this rather than updating table after reading it. Mine is a multi threaded application. so update such large data set may result in deadlocks or large waiting period for the next read query.
Any elegant solution to this problem?
Oracle Database 12c introduced a new feature called Automatic Data Optimization that brings you Heat Maps to track table access (modifications as well as read operations). Careful, the feature is currently to be licensed under the Advanced Compression Option or In-Memory Option.
Heat Maps track whenever a database block has been modified or whenever a segment, i.e. a table or table partition, has been accessed. It does not track select operations per individual row, neither per individual block level because the overhead would be too heavy (data is generally often and concurrently read, having to keep a counter for each row would quickly become a very costly operation). However, if you have you data partitioned by date, e.g. create a new partition for every day, you can over time easily determine which days are still read and which ones can be archived or purged. Also Partitioning is an option that needs to be licensed.
Once you have reached that conclusion you can then either use In-Database Archiving to mark rows as archived or just go ahead and purge the rows. If you happen to have the data partitioned you can do easy DROP PARTITION operations to purge one or many partitions rather than having to do conventional DELETE statements.
I couldn't use any inbuild solutions. i tried below solutions
1)DB audit feature for select statements.
2)adding a trigger to update a date column whenever a select query is executed on the table.
Both were discarded. Audit uses up a lot of space and have performance hit. Similary trigger also had performance hit.
Finally i resolved the issue by maintaining a separate table were entries older than 5 years that are still used or selected in a query are inserted. While deleting I cross check this table and avoid deleting entries present in this table.
We have a stored procedure which loads order details about an order. We always want the latest information about an order, so order details for the order are regenerated every time, when the stored procedure is called. We are using SQL Server 2016.
Pseudo code:
DELETE by clustered index based on order identifier
INSERT into the table, based on a huge query containing information about order
When multiple end-users are executing the stored procedure concurrently, there is a blocking created on orderdetails table. Once the first caller is done, second caller is queued, followed by third caller. So, the time for the generation of the orderdetails increases as time goes by. This is happening especially in the cases of big orders containing details rows in > 100k or 1 or 2 million, as there is table level lock is happening.
The approach we took
We partitioned the table based on the last digit of the order identifier for concurrent orderdetails loading. This improves the performance in the case of first time orderdetails loading, as there are no deletes. But, second time onwards, INSERT in first session is causing blocking for other sessions DELETE. The other sessions are blocked till first session is done with INSERT.
We are considering creation of separate orderdetails table for every order to avoid this concurrency issues.
Question
Can you please suggest some approach, which will support concurrent DELETE & INSERT scenario ?
We solved the contention issue by going for temporary table for orderdetails. We found that huge queries are taking longer SELECT time and this longer time was contributing to longer table level locks on the orderdetails table.
So, we first loaded data into temporary table #orderdetail and then went for DELETE and INSERT in the orderdetail table.
As the orderdetail table is already partitioned, DELETE were faster and INSERT were happening in parallel. INSERT was also very fast here, as it is simple table scan from #orderdetail table.
You can give a look to the Hekaton Engine. It is available even in SQL Server Standard Edition if you are using SP1.
If this is too complicated for implementation due to hardware or software limitations, you can try to play with the Isolation Levels of the database. Sometimes, queries that are reading huge amount of data are blocked or even deadlock victims of queries which are modifying parts of these data. You can ask yourself do you need to guarantee that the data read by the user is valid or you can afford for example some dirty reads?
I'm using the transaction in .net. I use to insert data into four tables with different methods.
since it is order inflow method, busy in peak hours. The table used for order submission, the same table used for bulk order processing and batch job processing at the same time. so sometime we are getting timeout in any of the above 3 areas.
Mean to say single table used in major 3 transaction areas same time.
order submission.(insert operation)
Job(updating a column)
Bulk order processing(updating a column)
Is that due to locking in order submission affect other transaction also.
Nolock is not used in transactions.
I want to place DB2 Triggers for Insert, Update and Delete on DB2 Tables heavily used in parallel online Transactions. The tables are shared by several members on a Sysplex, DB2 Version 10.
In each of the DB2 Triggers I want to insert a row into a central table and have one background process calling a Stored Procedure to read this table every second to process the newly inserted rows, ordered by sequence of the insert (sequence number or timestamp).
I'm very concerned about DB2 Index locking contention and want to make sure that I do not introduce Deadlocks/Timeouts to the applications with these Triggers.
Obviously I would take advantage of DB2 Features to reduce locking like rowlevel locking, but still see no real good approach how to avoid index contention.
I see three different options to select the newly inserted rows.
Put a sequence number in the table and the store the last processed sequence number in the background process. I would do the following select Statement:
SELECT COLUMN_1, .... Column_n
FROM CENTRAL_TABLE
WHERE SEQ_NO > 'last-seq-number'
ORDER BY SEQ_NO;
Locking Level must be CS to avoid selecting uncommited rows, which will be later rolled back.
I think I need one Index on the table with SEQ_NO ASC
Pro: Background process only reads rows and makes no updates/deletes (only shared locks)
Neg: Index contention because of ascending key used.
I can clean-up processed records later (e.g. by rolling partions).
Put a Status field in the table (processed and unprocessed) and change the Select as follows:
SELECT COLUMN_1, .... Column_n
FROM CENTRAL_TABLE
WHERE STATUS = 'unprocessed'
ORDER BY TIMESTAMP;
Later I would update the STATUS on the selected rows to "processed"
I think I need an Index on STATUS
Pro: No ascending sequence number in the index and no direct deletes
Cons: Concurrent updates by online transactions and the background process
Clean-up would happen in off-hours
DELETE the processed records instead of the status field update.
SELECT COLUMN_1, .... Column_n
FROM CENTRAL_TABLE
ORDER BY TIMESTAMP;
Since the table contains very few records, no index is required which could create a hot spot.
Also I think I could SELECT with Isolation Level UR, because I would detect potential uncommitted data on the later delete of this row.
For a Primary Key index I could use GENERATE_UNIQUE,which is random an not ascending.
Pro: No Index hot spot and the Inserts can be spread across the tablespace by random UNIQUE_ID
Con: Tablespace scan and sort on every call of the Stored Procedure and deleting records in parallel to the online inserts.
Looking forward what the community thinks about this problem. This must be a pretty common problem e.g. SAP should have a similar issue on their Batch Input tables.
I tend to favour Option 3, because it avoids index contention.
May be there is still another solution in your minds out there.
I think you are going to have numerous performance problems with your various solutions.
(I know premature optimazation is a sin, but experience tells us that some things are just not going to work in a busy system).
You should be able to use DB2s autoincrement feature to get your sequence number, with little or know performance implications.
For the rest perhaps you should look at a Queue based solution.
Have your trigger drop the operation (INSERT/UPDATE/DELETE) and the keys of the row into a MQ queue,
Then have a long running backgound task (in CICS?) do your post processing as its processing one update at a time you should not trip over yourself. Having a single loaded and active task with the ability to batch up units of work should give you a throughput in the order of 3 to 5 hundred updates a second.