Processing a database queue across multiple threads - design advice

Processing a database queue across multiple threads - design advice - sql-server

I have a SQL Server table full of orders that my program needs to "follow up" on (call a webservice to see if something has been done with them). My application is multi-threaded, and could have instances running on multiple servers. Currently, every so often (on a Threading timer), the process selects 100 rows, at random (ORDER BY NEWID()), from the list of "unconfirmed" orders and checks them, marking off any that come back successfully.
The problem is that there's a lot of overlap between the threads, and between the different processes, and their's no guarantee that a new order will get checked any time soon. Also, some orders will never be "confirmed" and are dead, which means that they get in the way of orders that need to be confirmed, slowing the process down if I keep selecting them over and over.
What I'd prefer is that all outstanding orders get checked, systematically. I can think of two easy ways do this:
The application fetches one order to check at a time, passing in the last order it checked as a parameter, and SQL Server hands back the next order that's unconfirmed. More database calls, but this ensures that every order is checked in a reasonable timeframe. However, different servers may re-check the same order in succession, needlessly.
The SQL Server keeps track of the last order it asked a process to check up on, maybe in a table, and gives a unique order to every request, incrementing its counter. This involves storing the last order somewhere in SQL, which I wanted to avoid, but it also ensures that threads won't needlessly check the same orders at the same time
Are there any other ideas I'm missing? Does this even make sense? Let me know if I need some clarification.
RESULT:
What I ended up doing was adding a LastCheckedForConfirmation column to my table with finished orders in it, and I added a stored procedure that updates a single, Unconfirmed row with GETDATE() and kicks out the order number so my process can check on it. It spins up as many of these as it can (given the number of threads the process is willing to run), and uses the stored procedure to get a new OrderNumber for each thread.
To handle the "Don't try rows too many times or when they're too old" problem, I did this: The SP will only return a row if "Time since last try" > "Time between creation and last try", so each time it will take twice as long before it tries again - first it waits 5 seconds, then 10, then 20, 40, 80, 120, and then after it's tried 15 times (6 hours), it gives up on that order and the SP will never return it again.
Thanks for the help, everybody - I knew the way I was doing it was less than ideal, and I appreciate your pointers in the right direction.

I recommend read and internalize Using tables as Queues.
If you use the data as a queue, you must organize it properly for queuing operations. The article I linked goes into details about how to do this, what you have is a variant of a Pending Queue.
One thing you must absolutely get rid of is the randomness. If there is one thing that is hard to reproduce in a query, is randomness. ORDER BY NEWID() will scan every row, generate a guid, then SORT, and then give you back top 100. You cannot, under any circumstances, have every worker thread scan the entire table every time, you'll kill the server as the number of unprocessed entries grows.
Instead use pending processing date. Have the queue be organized (clustered) by processing date column (when the item is due for retry) and dequeue using the techniques I show in my linked article. If you want to retry, the dequeue should postpone the item instead of deleting it, ie. WITH (...) UPDATE SET due_date = dateadd(day, 1, getutcdate()) ...

The obvious way would be to add a column LastCheckDt to the order. In each thread, retrieve the order that has gone for the longest time without checking. In the procedure that retrieves the order, update the LastCheckDt field.
I wouldn't retrieve 100 orders at once, there is a risk of the 50th order changing in the database before your thread reaches it. Get one order, and when done, get the next one.
In addition, I'd initially develop the process without multi-threading. Checking an open order is usually fast enough to be done sequentially.

One strategy you might want to Consider is a table like this;
JobID bigint PK not null, WorkerID int/nvarchar(max) null
Where worker is the id/name of the server that is processing it, or null if nobody has picked up the job. When a server picks up a job, it puts its own id/name into that column which indicates to others not to pick up the job.
One problem is that it is possible that the server working a job crashes, making the job never complete. You could add a date column that would represent the timeout, which is set when the worker picks up the job to now + some time span that you decide is appropriate.
EDIT: Forgot to mention, you will either need to delete to job when it is complete, or have a status field to indicate completion. An additional field could indicate parameters for the job to make your job table generic: ie. don't just make a solution for your orders, make a job manager that can process anything you will need in the future.

Related

How to get notified when no record is inserted in a table for a while in SQL Server

Goal: I have a table which handles the status of a device. Whenever I don't receive the status from it for one hour or more, I want to get notified once.
The device inserts a "heartbeat" record in the table with a timestamp (NOTE: I have to stick with this implementation).
In order to get notified for any changes, I'm using a service with a queue (which is read by another program).
What I've tried: I made a job which runs every 1 minute. It:
Looks at the last heartbeat in the table
If the timestamp is one hour ago or before:
Looks in another table which stores if I have already written the notification on the queue or not
If not, writes in the queue
Else:
Looks in another table which stores if I have already written the notification on the queue or not
If yes, resets the value to no
Issues with that: I'm concerned about my technique for executing some code when no record is inserted in a while:
I feel like there is a built in (and better) way to solve this kind of problems, but I can't find it;
It doesn't notify me as soon as possible (in the worst case, after 1 minute). This could be solved by reducing the schedule wait time, but I don't know if it may hurt performances or not. I would want the lighter solution possible;
I don't like to have to use an helper table, so I would want to remove it;
I would prefer to not use jobs if possible (I'm using a VS DB project and I would like to remove the post deployment script).

DynamoDB concurrent write, result in throttling

Below is my structure for table
table
UUID- key - Let call this **EntryKey**
HistoryLog - this also version number
Map<UUID (Let call this **EntryChildKey**, BYTE> value
version - For **optimistic locking**
Let's assume map has around 10k entry uuid to some value.
So, my problem is once in while I am getting request to update 10k EntryChildKey(map) value and all this request bombard db at the same time and because, every time I am hitting same EntryKey row, I am running in to lot of concurrency error, version got update every time and I have to retry and all EntryChildKey updates are thrashing each other, resulting in DynamoDB throttling my request.
I can get out of this problem if I separate this in to 2 tables as below, but we have to maintain HistoryLog version changing at EntryKey level and also there are some other problem so I can’t take this route
Table1 Table2
UUID EntryChildKey UUID EntryKey
BYTE value List<UUID> EntryChildKey
So, another approach I am thinking is Write ahead log kind of stuff, where I’ll update the version and also record the intent to update the table, but won’t update the record, instead keep it as list in table and then update the EntryChildKey values sequentially. But, I don’t whether there is something like this or similar thing I can do with DynamoDb or not ?
Also any another approach that could help to solve this problem I’ll appreciate

If you really do need to have a version attribute be updated on a single key each time any one of the 10k EntryChild items is updated then your only option is to decouple the table from the update source.
DynamoDB has a hard limit of up to 1000 writes/second to any item at all times. There is simply no way to increase that, for a single item. It doesn't matter what size table you have, how many partitions, or how much total write capacity you allocate to your table, a single item will never be able to be updated more than 1000 times per second.
So, if your requirement to update an attribute (the HistoryLog in your example) on the "master" entry item is really firm, then to use DynamoDB your best bet is to introduce a queue and batching to pre-process the updates before writing to Dynamo.
You could create an SQS queue and use a lambda function to read from the queue and write to Dynamo.
In a naive approach, you could simple read from the queue and then write to the table as much as you can, based on the DynamoDB throttling. For 10k updates to the same "master" key this will take at least 10 seconds, though in reality it will likely take longer.
A better option though, would be to run the lambda on a schedule, say once a second, and have it read all the messages available in the queue and combine all updates to the same "master" key into a single update. That way, you only write to the same item at most once every second.
The big challenge with a normal SQS queue is that it does not offer exactly once semantics: meaning there will be items in the queue that will be received multiple times. If you can design a system where you can safely discard duplicate updates then this approach will work wonderful. If not, then things get more complicated.

Paging of frequently changing data

I'm developing a web application which display a list of let's say "threads". The list can be sorted by the amount of likes a thread has. There can be thousands of threads in one list.
The application needs to work in a scenario where the likes of a thread can change more than 10x in a second. The application furthermore is distributed over multiple servers.
I can't figure out an efficient way to enable paging for this sort of list. And I can't transmit the whole sorted list by likes to a user at once.
As soon as an user would go to page 2 of this list, it likely changed and may contain threads already listed from page one
Solutions which don't work:
Storing the seen threads on the client side (could be too many on mobile)
Storing the seen threads on the Server side (too many users and threads)
Snapshot the list in temp database table (it's too frequent changing data and it need to be actual)
(If it matters I'm using MongoDB+c#)
How would you solve this kind of problem?

Interesting question. Unless I'm misunderstanding you, and by all means let me know if I am, it sounds like the best solution would be to implement a system that, instead of page numbers, uses timestamps. It would be similar to what many of the main APIs already do. I know Tumblr even does this on the dashboard, where this is, of course, not an unreasonable case: there can be tons of posts added in a small amount of time at peak hours, depending on how many people the user follows.
So basically, your "next page" button could just link to /threads/threadindex/1407051000, which could translate to "all the threads that were created before 2014-08-02 17:30. That makes your query super easy to implement. Then, when you pull down all the next elements, you just look for anything that occurred before the last element on the page.
The downfall of this, of course, is that it's hard to know how many new elements have been added since the user started browsing, but you could always log the start time and know anything since then would be new. And it's also difficult for users to type in their own pages, but that's not a problem in most applications. You also need to store the timestamps for every record in your thread, but that's probably already being done, and if it's not then it's certainly not hard to implement. You'll be paying the cost of something like eight bytes extra per record, but that's better than having to store anything about "seen" posts.
It's also nice because, and again this might not apply to you, but a user could bookmark a page in the list, and it would last unchanged forever since it's not relative to anything else.

This is typically handled using an OLAP cube. The idea here is that you add a natural time dimension. They may be too heavy for this application, but here's a summary in case someone else needs it.
OLAP cubes start with the fundamental concept of time. You have to know what time you care about to be able to make sense of the data.
You start off with a "Time" table:
Time {
timestamp long (PK)
created datetime
last_queried datetime
}
This basically tracks snapshots of your data. I've included a last_queried field. This should be updated with the current time any time a user asks for data based on this specific timestamp.
Now we can start talking about "Threads":
Threads {
id long (PK)
identifier long
last_modified datetime
title string
body string
score int
}
The id field is an auto-incrementing key; this is never exposed. identifier is the "unique" id for your thread. I say "unique" because there's no unique-ness constraint, and as far as the database is concerned it is not unique. Everything else in there is pretty standard... except... when you do writes you do not update this entry. In OLAP cubes you almost never modify data. Updates and inserts are explained at the end.
Now, how do we query this? You can't just directly query Threads. You need to include a star table:
ThreadStar {
timestamp long (FK -> Time.timestamp)
thread_id long (FK -> Threads.id)
thread_identifier long (matches Threads[thread_id].identifier)
(timestamp, thread_identifier should be unique)
}
This table gives you a mapping from what time it is to what the state of all of the threads are. Given a specific timestamp you can get the state of a Thread by doing:
SELECT Thread.*
FROM Thread
JOIN ThreadStar ON Thread.id = ThreadStar.thread_id
WHERE ThreadStar.timestamp = {timestamp}
AND Thread.identifier = {thread_identifier}
That's not too bad. How do we get a stream of threads? First we need to know what time it is. Basically you want to get the largest timestamp from Time and update Time.last_queried to the current time. You can throw a cache up in front of that that only updates every few seconds, or whatever you want. Once you have that you can get all threads:
SELECT Thread.*
FROM Thread
JOIN ThreadStar ON Thread.id = ThreadStar.thread_id
WHERE ThreadStar.timestamp = {timestamp}
ORDER BY Thread.score DESC
Nice. We've got a list of threads and the ordering is stable as the actual scores change. You can page through this at your leisure... kind of. Eventually data will be cleaned up and you'll lose your snapshot.
So this is great and all, but now you need to create or update a Thread. Creation and modification are almost identical. Both are handled with an INSERT, the only difference is whether you use an existing identifier or create a new one.
So now you've inserted a new Thread. You need to update ThreadStar. This is the crazy expensive part. Basically you make a copy of all of the ThreadStar entries with the most recent timestamp, except you update the thread_id for the Thread you just modified. That's a crazy amount of duplication. Fortunately it's pretty much only foreign keys, but still.
You also don't do DELETEs either; mark a row as deleted or just exclude it when you update ThreadStar.
Now you're humming along, but you've got crazy amounts of data growing. You'll probably want to clean it out, unless you've got a lot of storage budge, but even then things will start slowing down (aside: this will actually perform shockingly well, even with crazy amounts of data).
Cleanup is pretty straightforward. It's just a matter of some cascading deletes and scrubbing for orphaned data. Delete entries from Time whenever you want (e.g. it's not the latest entry and last_queried is null or older than whatever cutoff). Cascade those deletes to ThreadStar. Then find any Threads with an id that isn't in ThreadStar and scrub those.
This general mechanism also works if you have more nested data, but your queries get harder.
Final note: you'll find that your inserts get really slow because of the sheer amounts of data. Most places build this with appropriate constraints in development and testing environments, but then disable constraints in production!
Yeah. Make sure your tests are solid.
But at least you aren't sensitive to re-ordered data mid-paging.

For constantly changing data such as likes I would use a two stage appraoch. For the frequently changing data I would use an in memory DB to keep up with the change rates and flush this peridically to the "real" db.
Once you have that the query for constantly chaning data is easy.
Query the db.
Query the in memory db.
Merge the frequently changed data from the in memory db with the "slow" db data .
Remember which results you already have displayed so pressing the next button will
not display an already dispalyed value twice because on different pages because its rank has changed.
If many people look at the same data it might help to cache the results of 3 in itself to reduce the load on the real db even further.
Your current architecture has no caching layers (the bigger the site the more things are cached). You will not get away with a simple DB and efficient queries against the db if things become too massive.

I would cache all 'thread' results on the server when the user first time hits the database. Then return the first page of data to the user and for each subsequent next page calls I'd return cached results.
To minimize memory usage you can cache only records ids and fetch whole data when user requests it.
Cache can be evicted each time user exits current page. If it isn't a ton of data I would stick to this solution because user won't get annoyed of data constantly changing.

Architecting a Work Item Processing System with Modified FIFO Semantics in Windows

I’m building a system that generates “work items” that are queued up for back-end processing. I recently completed a system that had the same requirements and came up with an architecture that I don’t feel is optimal and was hoping for some advice for this new system.
Work items are queued up centrally and need to be processed in an essentially FIFO order. If this were the only requirement, then I would probably favor an MSMQ or SQL Server Service Broker solution. However, in reality, I need to select work items in a modified FIFO order. A work item has several attributes, and they need to be assigned in FIFO order where certain combinations of attribute values exist.
As an example, a work item may have the following attributes: Office, Priority, Group Number and Sequence Number (within group). When multiple items are queued for the same Group Number, they are guaranteed to be queued in Sequence Number order and will have the same priority.
There are several back-end processes (currently implemented as Windows Services) that pull work times in modified FIFO order given certain configuration parameters for the given service. The service running Washington, DC is configured to process only work items for DC, while the service in NY may be configured to process both NY and DC items (mainly to increase overall throughput). In addition to this type of selectivity, higher priority items should be processed first, and items that contain the same “Group Number” must be processed in Sequence Number order. So if the NY service is working on a DC item in group 100 with sequence 1, I don’t want the DC service to pull off DC item in group 100 sequence 2 because sequence 1 is not yet complete. Items in other groups should remain eligible for processing.
In the last system, I implemented the queues with SQL tables. I created stored procedures to submit items and, more importantly, to “assign” items to the Windows Services that were responsible for processing them. The assignment stored procedures contain the selection logic I described above. Each Windows Service would call the assignment stored procedure, passing it the parameters that were unique to that instance of the service (e.g. the eligible offices). This assignment stored procedure stamps the work item as assigned (in process) and when the work is complete, a final stored procedure is called to remove the item from the “queue” (table).
This solution does have some advantages in that I can quickly examine the state of these “queues” by a simple SQL select statement. I’m also able to manipulate the queues easily (e.g. I can bump priorities with a simple SQL update statement). However, on the downside, I occasionally have to deal with deadlocks on these queue tables and have the burden of writing these stored procedures (which gets tedious after a while).
Somehow I think that either MSMQ (with or without WCS) or Service Broker should be able to provide a more elegant solution. Rolling my own queuing/work-item-processing system just feels wrong. But as far as I know, these technologies don’t offer the flexibility that I need in the assignment process. I am hoping that I am wrong. Any advice would be welcome.

It seems to me that your concept of an atomic unit of work is a Group. So I would suggest that you only queue up a message that identified a Group Id, and then your worker will have to go to a table that maps Group Id to 1 or more Work Items.
You can handle your other problems by using more than one queue - NY-High, NY-Low, DC-High, DC-Low, etc.
In all honesty, though, I think you are better served to fix your deadlock issues in your current architecture. You should be reading the TOP 1 message from your queue table with Update Lock and Read Past hints, ordered by your priority logic and whatever filter criteria you want (Office/Location). Then you process your 1 message, change it's status or move it to another table. You should be able to call that stored procedure in parallel without a deadlock issue.

Queues are for FIFO order, not random access order. Even though you are saying that you want FIFO order, you want FIFO order with respect to a random set of variables, which is essentially random order. If you want to use queues, you need to be able to determine order before the message goes in the queue, not after it goes in.

Predict next auto-inserted row id (SQLite)

I'm trying to find if there is a reliable way (using SQLite) to find the ID of the next row to be inserted, before it gets inserted. I need to use the id for another insert statement, but don't have the option of instantly inserting and getting the next row.
Is predicting the next id as simple as getting the last id and adding one? Is that a guarantee?
Edit: A little more reasoning...
I can't insert immediately because the insert may end up being canceled by the user. User will make some changes, SQL statements will be stored, and from there the user can either save (inserting all the rows at once), or cancel (not changing anything). In the case of a program crash, the desired functionality is that nothing gets changed.

Try SELECT * FROM SQLITE_SEQUENCE WHERE name='TABLE';. This will contain a field called seq which is the largest number for the selected table. Add 1 to this value to get the next ID.
Also see the SQLite Autoincrement article, which is where the above info came from.
Cheers!

Either scrapping or committing a series of database operations all at once is exactly what transactions are for. Query BEGIN; before the user starts fiddling and COMMIT; once he/she's done. You're guaranteed that either all the changes are applied (if you commit) or everything is scrapped (if you query ROLLBACK;, if the program crashes, power goes out, etc). Once you read from the db, you're also guaranteed that the data is good until the end of the transaction, so you can grab MAX(id) or whatever you want without worrying about race conditions.
http://www.sqlite.org/lang_transaction.html

You can probably get away with adding 1 to the value returned by sqlite3_last_insert_rowid under certain conditions, for example, using the same database connection and there are no other concurrent writers. Of course, you may refer to the sqlite source code to back up these assumptions.
However, you might also seriously consider using a different approach that doesn't require predicting the next ID. Even if you get it right for the version of sqlite you're using, things could change in the future and it will certainly make moving to a different database more difficult.

Insert the row with an INVALID flag of some kind, Get the ID, edit it, as needed, delete if necessary or mark as valid. That and don't worry about gaps in the sequence
BTW, you will need to figure out how to do the invalid part yourself. Marking something as NULL might work depending on the specifics.
Edit: If you can, use Eevee's suggestion of using proper transactions. It's a lot less work.

I realize your application using SQLite is small and SQLite has its own semantics. Other solutions posted here may well have the effect that you want in this specific setting, but in my view every single one of them I have read so far is fundamentally incorrect and should be avoided.
In a normal environment holding a transaction for user input should be avoided at all costs. The way to handle this, if you need to store intermediate data, is to write the information to a scratch table for this purpose and then attempt to write all of the information in an atomic transaction. Holding transactions invites deadlocks and concurrency nightmares in a multi-user environment.
In most environments you cannot assume data retrieved via SELECT within a transaction is repeatable. For example
SELECT Balance FROM Bank ...
UPDATE Bank SET Balance = valuefromselect + 1.00 WHERE ...
Subsequent to UPDATE the value of balance may well be changed. Sometimes you can get around this by updating the row(s) your interested in Bank first within a transaction as this is guaranteed to lock the row preventing further updates from changing its value until your transaction has completed.
However, sometimes a better way to ensure consistency in this case is to check your assumptions about the contents of the data in the WHERE clause of the update and check row count in the application. In the example above when you "UPDATE Bank" the WHERE clause should provide the expected current value of balance:
WHERE Balance = valuefromselect
If the expected balance no longer matches neither does the WHERE condition -- UPDATE does nothing and rowcount returns 0. This tells you there was a concurrency issue and you need to rerun the operation again when something else isn't trying to change your data at the same time.

select max(id) from particular_table is unreliable for the reason below..
http://www.sqlite.org/autoinc.html
"The normal ROWID selection algorithm described above will generate monotonically increasing unique ROWIDs as long as you never use the maximum ROWID value and you never delete the entry in the table with the largest ROWID. If you ever delete rows or if you ever create a row with the maximum possible ROWID, then ROWIDs from previously deleted rows might be reused when creating new rows and newly created ROWIDs might not be in strictly ascending order."

I think this can't be done because there is no way to be sure that nothing will get inserted between you asking and you inserting. (you might be able to lock the table to inserts but Yuck)
BTW I've only used MySQL but I don't think that will make any difference)

Most likely you should be able to +1 the most recent id. I would look at all (going back a while) of the existing id's in the ordered table .. Are they consistent and is each row's ID is one more than the last? If so, you'll probably be fine. I'd leave a comments in the code explaining the assumption however. Doing a Lock will help guarantee that you're not getting additional rows while you do this as well.

Select the last_insert_rowid() value.

Most of everything that needs to be said in this topic already has... However, be very careful of race conditions when doing this. If two people both open your application/webpage/whatever, and one of them adds a row, the other user will try to insert a row with the same ID and you will have lots of issues.

select max(id) from particular_table;
The next id will be +1 from the maximum id.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight