Processing records in database with application running in different nodes - distributed

I need to process 100K records from DB, process them and update the status of the record in DB. If application is running on multiple nodes, how to make sure that same record is not picked by multiple nodes for processing?
This process is triggered by a quartz scheduler that runs every hour and we do not have the flexibility to configure the scheduler on each node to run at different times.
What is the best way to achieve this?

There are various approaches and the following two come to my mind at the moment.
(1) Use DB row locks
Make your nodes place an exclusive DB row lock (select for update) on the record the node is about to process. If the node can place the lock, then the process is free to process the record. Otherwise, the node should try another record because some other node has locked the record and is currently processing the record. You should randomize the selection of unprocessed records to somewhat minimize the chances of multiple nodes competing to process identical records.
(2) Assign your nodes some IDs and make your nodes use these IDs to select disjunct records from the processed data set. For example, if you have 10 nodes, assign them IDs 0 to 9. Then split your unprocessed records into disjunct sets based on the record ID by applying some function that produces numbers 0 to 9. For example, you can use the MOD function. Your nodes will then select only those unprocessed records whose ID is equal to the node ID. This is very simple to implement in SQL as long as you can assign some unique and consecutive IDs to your nodes.
If I were you, I would probably pick the second solution.

Related

Data Partitioning and Replication on Cassandra cluster

I have a 3 node Cassandra cluster with RF=3. Now when I do nodetool status I get the owns for each node in the cluster as 100%.
But when I have 5 nodes in the cluster wit RF=3. The owns is 60%(approx as shown in image below).
Now as per my understanding the partitioner will calculate the hash corresponding to first replica node and the data will also be replicated as per the RF on the other nodes.
Now we have a 5 node cluster and RF is 3.
Shouldn't 3 nodes be owning all the data evenly(100%) as partitioner will point to one node as per the partitoning strategy and then same data be replicated to remaining nodes which equals RF-1? It's like the data is getting evenly distributed among all the nodes(5) even though RF is 3.
Edit1:
As per my understanding the reason for 60%(approx) owns for each node is because the RF is 3. It means there will be 3 replicas for each row. It means there will be 300% data. Now there are 5 nodes in the cluster and partitioner will be using the default random hashing algorithm which will distribute the data evenly across all the nodes in the cluster.
But now the issue is that we checked all the nodes of our cluster and all the nodes contain all the data even though the RF is 3.
Edit2:
#Aaron I did as specified in the comment. I created a new cluster with 3 nodes.
I created a Keyspace "test" and set the class to simplestrategy and RF to 2.
Then I created a table "emp" having partition key (id,name).
Now I inserted a single row into the first node.
As per your explanation, It should only be in 2 nodes as RF=2.
But when I logged into all the 3 nodes, i could see the row replicated in all the nodes.
I think since the keyspace is getting replicated in all the nodes therefore, the data is also getting replicated.
Percent ownership is not affected (at all) by actual data being present. You could add a new node to a single node cluster (RF=1) and it would instantly say 50% on each.
Percent ownership is purely about the percentage of token ranges which a node is responsible for. When a node is added, the token ranges are recalculated, but data doesn't actually move until a streaming event happens. Likewise, data isn't actually removed from its original node until cleanup.
For example, if you have a 3 node cluster with a RF of 3, each node will be at 100%. Add one node (with RF=3), and percent ownership drops to about 75%. Add a 5th node (again, keep RF=3) and ownership for each node correctly drops to about 3/5, or 60%. Again, with a RF of 3 it's all about each node being responsible for a set of primary, secondary, and tertiary token ranges.
the default random hashing algorithm which will distribute the data evenly across all the nodes in the cluster.
Actually, the distributed hash with Murmur3 partitioner will evenly distribute the token ranges, not the data. That's an important distinction. If you wrote all of your data to a single partition, I guarantee that you would not get even distribution of data.
The data replicated to another nodes when you add them isn't cleared up automatically - you need to call nodetool cleanup on the "old" nodes after you add the new node into cluster. This will remove the ranges that were moved to other nodes.

Updating database keys where one table's keys refer to another's

I have two tables in DynamoDB. One has data about homes, one has data about businesses. The homes table has a list of the closest businesses to it, with walking times to each of them. That is, the homes table has a list of IDs which refer to items in the businesses table. Since businesses are constantly opening and closing, both these tables need to be updated frequently.
The problem I'm facing is that, when either one of the tables is updated, the other table will have incorrect data until it is updated itself. To make this clearer: let's say one business closes and another one opens. I could update the businesses table first to remove the old business and add the new one, but the homes table would then still refer to the now-removed business. Similarly, if I updated the homes table first to refer to the new business, the businesses table would not yet have this new business' data yet. Whichever table I update first, there will always be a period of time where the two tables are not in synch.
What's the best way to deal with this problem? One way I've considered is to do all the updates to a secondary database and then swap it with my primary database, but I'm wondering if there's a better way.
Thanks!
Dynamo only offers atomic operations on the item level, not transaction level, but you can have something similar to an atomic transaction by enforcing some rules in your application.
Let's say you need to run a transaction with two operations:
Delete Business(id=123) from the table.
Update Home(id=456) to remove association with Business(id=123) from the home.businesses array.
Here's what you can do to mimic a transaction:
Generate a timestamp for locking the items
Let's say our current timestamp is 1234567890. Using a timestamp will allow you to clean up failed transactions (I'll explain later).
Lock the two items
Update both Business-123 and Home-456 and set an attribute lock=1234567890.
Do not change any other attributes yet on this update operation!
Use a ConditionalExpression (check the Developer Guide and API) to verify that attribute_not_exists(lock) before updating. This way you're sure there's no other process using the same items.
Handle update lock responses
Check if both updates succeeded to Home and Business. If yes to both, it means you can proceed with the actual changes you need to make: delete the Business-123 and update the Home-456 removing the Business association.
For extra care, also use a ConditionExpression in both updates again, but now ensuring that lock == 1234567890. This way you're extra sure no other process overwrote your lock.
If both updates succeed again, you can consider the two items updated and consistent to be read by other processes. To do this, run a third update removing the lock attribute from both items.
When one of the operations fail, you may try again X times for example. If it fails all X times, make sure the process cleans up the other lock that succeeded previously.
Enforce the transaction lock throught your code
Always use a ConditionExpression in any part of your code that may update/delete Home and Business items. This is crucial for the solution to work.
When reading Home and Business items, you'll need to do this (this may not be necessary in all reads, you'll decide if you need to ensure consistency from start to finish while working with an item read from DB):
Retrieve the item you want to read
Generate a lock timestamp
Update the item with lock=timestamp using a ConditionExpression
If the update succeeds, continue using the item normally; if not, wait one or two seconds and try again;
When you're done, update the item removing the lock
Regularly clean up failed transactions
Every minute or so, run a background process to look for potentially failed transactions. If your processes take at max 60 seconds to finish and there's an item with lock value older than, say 5 minutes (remember lock value is the time the transaction started), it's safe to say that this transaction failed at some point and whatever process running it didn't properly clean up the locks.
This background job would ensure that no items keep locked for eternity.
Beware this implementation do not assure a real atomic and consistent transaction in the sense traditional ACID DBs do. If this is mission critical for you (e.g. you're dealing with financial transactions), do not attempt to implement this. Since you said you're ok if atomicity is broken on rare failure occasions, you may live with it happily. ;)
Hope this helps!

Optimizing network bandwidth over distributed database aggregation jobs

I have a distributed/federated database structured as follows:
The databases are spread across three geographic locations ("nodes")
Multiple databases are clustered at each node
The relational databases are a mix of PostgreSQL, MySQL, Oracle, and MS SQL Server; the non-relational databases are either MongoDB or Cassandra
Loose coupling within each node and across the node federation is achieved via RabbitMQ, with each node running a RabbitMQ broker
I am implementing a readonly inter-node aggregation job system for jobs that span the node federation (i.e. for jobs that are not local to a node). These jobs only perform "get" queries - they do not modify the databases. (If the results of the jobs are intended to go into one or more of the databases then this is accomplished by a separate job that is not part of the inter-node job system I am trying to optimize.) My objective is to minimize the network bandwidth required by these jobs (first to minimize the inter-node / WAN bandwidth, then to minimize the intra-node / LAN bandwidth); I assume a uniform cost for each WAN link, and another uniform cost for each LAN link. The jobs are not particularly time-sensitive. I perform some CPU load-balancing within a node but not between nodes.
The amount of data transported across the WAN/LAN for the aggregation jobs is small relative to the amount of database writes that are local to a cluster or to a specific database, so it would not be practical to fully distribute the databases across the federation.
The basic algorithm I use for minimizing network bandwidth is:
Given a job that runs on a set of data that is spread across the federation, the manager node sends a message to each the other nodes that contains the relevant database queries.
Each node runs its set of queries, compresses them with gzip, caches them, and sends their compressed sizes to the manager node.
The manager moves to the node containing the plurality of the data (specifically, to the machine within the cluster that has the most data and that has idle cores); it requests the rest of the data from the other two nodes and from the other machines within the cluster, then it runs the job.
When possible the jobs use a divide-and-conquer approach to minimize the amount of data co-location that is needed. For example, if the job needs to compute the sums of all Sales figures across the federation, then each node locally calculates its Sales sums which are then aggregated at the manager node (rather than copying all of the unprocessed Sales data to the manager node). However, sometimes (such as when performing a join between two tables that are located at different nodes) data co-location is needed.
The first thing I did to optimize this was aggregate the jobs, and to run the aggregated jobs at ten minute epochs (the machines are all running NTP, so I can be reasonably certain that "every ten minutes" means the same thing at each node). The goal is for two jobs to be able to share the same data, which reduces the overall cost of transporting the data.
Given two jobs that query the same table, I generate each job's resultset, and then I take the intersection of the two resultsets.
If both jobs are scheduled to run on the same node, then the network transfer cost is calculated as the sum of the two resultsets minus the intersection of the two resultsets.
The two resultsets are stored to PostgreSQL temporary tables (in the case of relational data) or else to temporary Cassandra columnfamilies / MongoDB collections (in the case of nosql data) at the node selected to run the jobs; the original queries are then performed against the combined resultsets and the data delivered is to the individual jobs. (This step is only performed on combined resultsets; individual resultset data is simply delivered to its job without first being stored on temporary tables/column families/collections.)
This results in an improvement to network bandwidth, but I'm wondering if there's a framework/library/algorithm that would improve on this. One option I considered is to cache the resultsets at a node and to account for these cached resultsets when determining network bandwidth (i.e. trying to reuse resultsets across jobs in addition to the current set of pre-scheduled co-located jobs, so that e.g. a job run in one 10-minute epoch can use a cached resultset from a previous 10-minute resultset), but unless the jobs use the exact same resultsets (i.e. unless they use identical where clauses) then I don't know of a general-purpose algorithm that would fill in the gaps in the resultset (for example, if the resultset used the clause "where N > 3" and a different job needs the resultset with the clause "where N > 0" then what algorithm could I use to determine that I need to take the union of the original resultset and with the resultset with the clause "where N > 0 AND N <= 3") - I could try to write my own algorithm to do this, but the result would be a buggy useless mess. I would also need to determine when the cached data is stale - the simplest way to do this is to compare the cached data's timestamp with the last-modified timestamp on the source table and replace all of the data if the timestamp has changed, but ideally I'd want to be able to update only the values that have changed with per-row or per-chunk timestamps.
I've started to implement my solution to the question.
In order to simplify the intra-node cache and also to simplify CPU load balancing, I'm using a Cassandra database at each database cluster ("Cassandra node") to run the aggregation jobs (previously I was aggregating the local database resultsets by hand) - I'm using the single Cassandra database for the relational, Cassandra, and MongoDB data (the downside is that some relational queries run slower on Cassandra, but this is made up for by the fact that the single unified aggregation database is easier to maintain than the separate relational and non-relational aggregation databases). I am also no longer aggregating jobs in ten minute epochs since the cache makes this algorithm unnecessary.
Each machine in a node refers to a Cassandra columnfamily called Cassandra_Cache_[MachineID] that is used to store the key_ids and column_ids that it has sent to the Cassandra node. The Cassandra_Cache columnfamily consists of a Table column, a Primary_Key column, a Column_ID column, a Last_Modified_Timestamp column, a Last_Used_Timestamp column, and a composite key consisting of the Table|Primary_Key|Column_ID. The Last_Modified_Timestamp column denotes the datum's last_modified timestamp from the source database, and the Last_Used_Timestamp column denotes the timestamp at which the datum was last used/read by an aggregation job. When the Cassandra node requests data from a machine, the machine calculates the resultset and then takes the set difference of the resultset and the table|key|columns that are in its Cassandra_Cache and that have the same Last_Modified_Timestamp as the rows in its Cassandra_Cache (if the timestamps don't match then the cached data is stale and is updated along with the new Last_Modified_Timestamp). The local machine then sends the set difference to the Cassandra node and updates its Cassandra_Cache with the set difference and updates the Last_Used_Timestamp on each cached datum that was used to compose the resultset. (A simpler alternative to maintaining a separate timestamp for each table|key|column is to maintain a timestamp for each table|key, but this is less precise and the table|key|column timestamp is not overly complex.) Keeping the Last_Used_Timestamps in sync between Cassandra_Caches only requires that the local machines and remote nodes send the Last_Used_Timestamp associated with each job, since all data within a job uses the same Last_Used_Timestamp.
The Cassandra node updates its resultset with the new data that it receives from within the node and also with the data that it receives from the other nodes. The Cassandra node also maintains a columnfamily that stores the same data that is in each machine's Cassandra_Cache (except for the Last_Modified_Timestamp, which is only needed on the local machine to determine when data is stale), along with a source id indicating if the data came from within the within the node or from another node - the id distinguishes between the different nodes, but does not distinguish between the different machines within the local node. (Another option is to use a unified Cassandra_Cache rather than using one Cassandra_Cache per machine plus another Cassandra_Cache for the node, but I decided that the added complexity was not worth the space savings.)
Each Cassandra node also maintains a Federated_Cassandra_Cache, which consists of the {Database, Table, Primary_Key, Column_ID, Last_Used_Timestamp} tuples that have been sent from the local node to one of the other two nodes.
When a job comes through the pipeline, each Cassandra node updates its intra-node cache with the local resultsets, and also completes the sub-jobs that can be performed locally (e.g. in a job to sum data between multiple nodes, each node sums its intra-node data in order to minimize the amount of data that needs to be co-located in the inter-node federation) - a sub-job can be performed locally if it only uses intra-node data. The manager node then determines on which node to perform the rest of the job: each Cassandra node can locally compute the cost of sending its resultset to another node by taking the set difference of its resultset and the subset of the resultset that has been cached according to its Federated_Cassandra_Cache, and the manager node minimizes the cost equation ["cost to transport resultset from NodeX" + "cost to transport resultset from NodeY"]. For example, it costs Node1 {3, 5} to transport its resultset to {Node2, Node3}, it costs Node2 {2, 2} to transport its resultset to {Node1, Node3}, and it costs Node3 {4, 3} to transport its resultset to {Node1, Node2}, therefore the job is run on Node1 with a cost of "6".
I'm using an LRU eviction policy for each Cassandra node; I was originally using an oldest-first eviction policy because it is simpler to implement and requires fewer writes to the Last_Used_Timestamp column (once per datum update instead of once per datum read), but the implementation of an LRU policy turned out not to be overly complex and the Last_Used_Timestamp writes did not create a bottleneck. When a Cassandra node reaches 20% free space it evicts data until it reaches 30% free space, hence each eviction is approximately the size of 10% of the total space available. The node maintains two timestamps: the timestamp of the last-evicted intra-node data, and the timestamp of the last-evicted inter-node / federated data; due to the increased latency of inter-node communication relative to that of intra-node communication, the goal of the eviction policy to have 75% of the cached data be inter-node data and 25% of the cached data be intra-node data, which can be quickly approximated by having 25% of each eviction be inter-node data and 75% of each eviction be intra-node data. Eviction works as follows:
while(evicted_local_data_size < 7.5% of total space available) {
evict local data with Last_Modified_Timestamp <
(last_evicted_local_timestamp += 1 hour)
update evicted_local_data_size with evicted data
}
while(evicted_federated_data_size < 2.5% of total space available) {
evict federated data with Last_Modified_Timestamp <
(last_evicted_federated_timestamp += 1 hour)
update evicted_federated_data_size with evicted data
}
Evicted data is not permanently deleted until eviction acknowledgments have been received from the machines within the node and from the other nodes.
The Cassandra node then sends a notification to the machines within its node indicating what the new last_evicted_local_timestamp is. The local machines update their Cassandra_Caches to reflect the new timestamp, and send a notification to the Cassandra node when this is complete; when the Cassandra node has received notifications from all local machines then it permanently deletes the evicted local data. The Cassandra node also sends a notification to the remote nodes with the new last_evicted_federated_timestamp; the other nodes update their Federated_Cassandra_Caches to reflect the new timestamp, and the Cassandra node permanently deletes the evicted federated data when it receives notifications from each node (the Cassandra node keeps track of which node a piece of data came from, so after receiving an eviction acknowledgment from NodeX the node can permanently delete the evicted NodeX data before receiving an eviction acknowledgment from NodeY). Until all machines/nodes have sent their notifications, the Cassandra node uses the cached evicted data in its queries if it receives a resultset from a machine/node that has not evicted its old data. For example, the Cassandra node has a local Table|Primary_Key|Column_ID datum that it has evicted, and meanwhile a local machine (which has not processed the eviction request) has not included the Table|Primary_Key|Column_ID datum in its resultset because it thinks that the Cassandra node already has the datum in its cache; the Cassandra node receives the resultset from the local machine, and because the local machine has not acknowledged the eviction request the Cassandra node includes the cached evicted datum in its own resultset.

DB optimization to use it as a queue

We have a table called worktable which has some columns(key (primary key), ptime, aname, status, content).
We have something called producer which puts in rows in this table and we have consumer which does an order-by on the key column and fetches the first row which has status 'pending'. The consumer does some processing on this row:
updates status to "processing"
does some processing using content
deletes the row
We are facing contention issues when we try to run multiple consumers (probably due to the order-by which does a full table scan).
Using advanced queues would be our next step but before we go there we want to check what the max throughput is that we can achieve with multiple consumers and producer on the table.
What are the optimizations we can do to get the best numbers possible?
Can we do an in-memory processing where a consumer fetches 1000 rows at a time processes and deletes? will that improve? What are other possibilities? partitioning of table? parallelization? Index organized tables?...
The possible optimizations depend a lot on the database used, but a pretty general approach would be to create an index that covers all fields needed to select the correct rows (it sounds like that would be the key and the status in this case). If the index is created correctly (some database need the correct order of the key elements, others don't), then the query should be much faster.

Processing a database queue across multiple threads - design advice

I have a SQL Server table full of orders that my program needs to "follow up" on (call a webservice to see if something has been done with them). My application is multi-threaded, and could have instances running on multiple servers. Currently, every so often (on a Threading timer), the process selects 100 rows, at random (ORDER BY NEWID()), from the list of "unconfirmed" orders and checks them, marking off any that come back successfully.
The problem is that there's a lot of overlap between the threads, and between the different processes, and their's no guarantee that a new order will get checked any time soon. Also, some orders will never be "confirmed" and are dead, which means that they get in the way of orders that need to be confirmed, slowing the process down if I keep selecting them over and over.
What I'd prefer is that all outstanding orders get checked, systematically. I can think of two easy ways do this:
The application fetches one order to check at a time, passing in the last order it checked as a parameter, and SQL Server hands back the next order that's unconfirmed. More database calls, but this ensures that every order is checked in a reasonable timeframe. However, different servers may re-check the same order in succession, needlessly.
The SQL Server keeps track of the last order it asked a process to check up on, maybe in a table, and gives a unique order to every request, incrementing its counter. This involves storing the last order somewhere in SQL, which I wanted to avoid, but it also ensures that threads won't needlessly check the same orders at the same time
Are there any other ideas I'm missing? Does this even make sense? Let me know if I need some clarification.
RESULT:
What I ended up doing was adding a LastCheckedForConfirmation column to my table with finished orders in it, and I added a stored procedure that updates a single, Unconfirmed row with GETDATE() and kicks out the order number so my process can check on it. It spins up as many of these as it can (given the number of threads the process is willing to run), and uses the stored procedure to get a new OrderNumber for each thread.
To handle the "Don't try rows too many times or when they're too old" problem, I did this: The SP will only return a row if "Time since last try" > "Time between creation and last try", so each time it will take twice as long before it tries again - first it waits 5 seconds, then 10, then 20, 40, 80, 120, and then after it's tried 15 times (6 hours), it gives up on that order and the SP will never return it again.
Thanks for the help, everybody - I knew the way I was doing it was less than ideal, and I appreciate your pointers in the right direction.
I recommend read and internalize Using tables as Queues.
If you use the data as a queue, you must organize it properly for queuing operations. The article I linked goes into details about how to do this, what you have is a variant of a Pending Queue.
One thing you must absolutely get rid of is the randomness. If there is one thing that is hard to reproduce in a query, is randomness. ORDER BY NEWID() will scan every row, generate a guid, then SORT, and then give you back top 100. You cannot, under any circumstances, have every worker thread scan the entire table every time, you'll kill the server as the number of unprocessed entries grows.
Instead use pending processing date. Have the queue be organized (clustered) by processing date column (when the item is due for retry) and dequeue using the techniques I show in my linked article. If you want to retry, the dequeue should postpone the item instead of deleting it, ie. WITH (...) UPDATE SET due_date = dateadd(day, 1, getutcdate()) ...
The obvious way would be to add a column LastCheckDt to the order. In each thread, retrieve the order that has gone for the longest time without checking. In the procedure that retrieves the order, update the LastCheckDt field.
I wouldn't retrieve 100 orders at once, there is a risk of the 50th order changing in the database before your thread reaches it. Get one order, and when done, get the next one.
In addition, I'd initially develop the process without multi-threading. Checking an open order is usually fast enough to be done sequentially.
One strategy you might want to Consider is a table like this;
JobID bigint PK not null, WorkerID int/nvarchar(max) null
Where worker is the id/name of the server that is processing it, or null if nobody has picked up the job. When a server picks up a job, it puts its own id/name into that column which indicates to others not to pick up the job.
One problem is that it is possible that the server working a job crashes, making the job never complete. You could add a date column that would represent the timeout, which is set when the worker picks up the job to now + some time span that you decide is appropriate.
EDIT: Forgot to mention, you will either need to delete to job when it is complete, or have a status field to indicate completion. An additional field could indicate parameters for the job to make your job table generic: ie. don't just make a solution for your orders, make a job manager that can process anything you will need in the future.

Resources