To make one complete ring I need three segments. Segments C1A, C1B and C1C should be batched to C1.
After batching I want to separate rings based on their names(String) in the select output block to respective delay processes.
My question is how to create a relation between unbatched and batched rings in order to complete this whole process.
thanks in advance.
My first question is why do you want to batch them if they get processed separately.. See my example below I think if you just release all three the segments at the same time you do not need to batch them
You will need to add a wait block ( can also be done with a queue) and then wait until you have enough of the correct batch (assuming the batch is denoted by the first 2 letters of a segment name) See code below.
This will check if there are 3 segments with the same first two letters in their name and release them from the wait block.
Then you can do what you want with them... if you want to batch set the back size to 3 (based on your example) so that every 3 units that gets released, and they will be released all at once will be batched into a new agent called Rigng.
We store the first two letters of the segment name, which will be the same for all segments being batched and use them to set the new ring name.
You can then use the Ring name to decide which machine they need to go to.
Since we then did not select the "Permanent batch" option in the batch block we can simply unlatch them again...
But I think based on your screenshot you might just want to let them go through the system no need for batching... I am not sure...
Related
In my Architecture, DML commands are queued in to Kafka. Storm topology comprises of single Spout and 3 Solr Bolts. The DML commands get distributed among these 3 Bolts.
My Problem is how to handle if the order of commands get shuffled by Solr Bolts. For ex., the sequence of commands are
Insert record A with value 50.
Insert record B with value x.
Update record A to value 20.
Insert record C with value y.
Update record A to value 100.
and so on
In the above case, what if the command 5 get executed by a Bolt before command 3 getting executed by other Bolt? This can happen if Bolt 3 first picks and executes the 5th command before Bolt 2 executes the command 3.
If I understand you correct, you have single spout (with dop=1) and a single bolt (with dop=3) that get the data from spout via shuffle grouping. If the dependent command are shuffled to different bolt-executors, there is no way to get the executed in correct order.
However, if you have a series of commands that depends on each other, you can use fieldsGrouping to ensure that all command go to the same executor. For this case, the order is guaranteed to be preserved. To accomplish this, you just add an attribute to the spout output tuples (a counter), and field-group on this attribute. Furthermore, for consecutive dependent commands, you do not modify the counter. If a series of dependent command is finished, you increase the counter by one (this ensure, that different command series are processed by different bolt-executors, ie, load balancing). The tricky part is to know, when a series of dependent commands is finished I guess. Hope this helps.
We are trying to build a system with multiple instances of a service on different machines that share the load of processing.
Each of these will check a table, if there are rows to be processed on that table, it will pick the first, mark it processing, then process it, then mark it done. Rinse repeat.
What is the best way to prevent a racing condition where 2 instances A and B do the following
A (1) read the table, finds row 1 to process,
B (1) reads the table, finds row 1 to process,
A (2) marks it row processing
B (2) Marks it row processing
In a single app we could use locks or mutexs.
I can just put A1 and A2 in a single transaction, is it that simple, or is there a better, faster way to do this?
Should I just turn it on it's head so that the steps are:
A (1) Mark the next row as mine to process
A (2) Return it to me for processing.
I figure this has to have been solved many times before, so I'm looking for the "standard" solutions, and if there are more than one, benefits and disadvantages.
Transactions are a nice simple answer, with two possible drawbacks:
1) You might want to check with the fine print of your database. Sometimes the default consistency settings don't guarantee absolute consistency in every possible circumstance.
2) Sometimes the pattern of accesses associated with using a database to queue and distribute work is hard on a database that isn't expecting it.
One possibility is to look at reliable message queuing systems, which are seem to pretty good match to what you are looking for - worker machines could just read work from a shared queue. Possible jumping-off points are http://en.wikipedia.org/wiki/Message_queue and http://docs.oracle.com/cd/B10500_01/appdev.920/a96587/qintro.htm
Ok, so the story is like this:
-- I am having lots of files (pretty big, around 25GB) that are in a particular format and needs to be imported in a datastore
-- these files are continuously updated with data, sometimes new, sometimes the same data
-- I am trying to figure out an algorithm on how could I detect if something has changed for a particular line in a file, in order to minimize the time spent updating the database
-- the way it currently works now is that I'm dropping all the data in the database each time and then reimport it, but this won't work anymore since I'll need a timestamp for when an item has changed.
-- the files contains strings and numbers (titles, orders, prices etc.)
The only solutions I could think of are:
-- compute a hash for each row from the database, that it's compared against the hash of the row from the file and if they're different the update the database
-- keep 2 copies of the files, the previous ones and the current ones and make diffs on it (which probably are faster than updating the db) and based on those update the db.
Since the amount of data is very big to huge, I am kind of out of options for now. On the long run, I'll get rid of the files and data will be pushed straight into the database, but the problem still remains.
Any advice, will be appreciated.
Problem definition as understood.
Let’s say your file contains
ID,Name,Age
1,Jim,20
2,Tim,30
3,Kim,40
As you stated Row can be added / updated , hence the file becomes
ID,Name,Age
1,Jim,20 -- to be discarded
2,Tim,35 -- to be updated
3,Kim,40 -- to be discarded
4,Zim,30 -- to be inserted
Now the requirement is to update the database by inserting / updating only above 2 records in two sql queries or 1 batch query containing two sql statements.
I am making following assumptions here
You cannot modify the existing process to create files.
You are using some batch processing [Reading from file - Processing in Memory- Writing in DB]
to upload the data in the database.
Store the hash values of Record [Name,Age] against ID in an in-memory Map where ID is the key and Value is hash [If you require scalability use hazelcast ].
Your Batch Framework to load the data [Again assuming treats one line of file as one record], needs to check the computed hash value against the ID in in-memory Map.First time creation can also be done using your batch framework for reading files.
If (ID present)
--- compare hash
---found same then discard it
—found different create an update sql
In case ID not present in in-memory hash,create an insert sql and insert the hashvalue
You might go for parallel processing , chunk processing and in-memory data partitioning using spring-batch and hazelcast.
http://www.hazelcast.com/
http://static.springframework.org/spring-batch/
Hope this helps.
Instead of computing the hash for each row from the database on demand, why don't you store the hash value instead?
Then you could just compute the hash value of the file in question and compare it against the database stored ones.
Update:
Another option that came to my mind is to store the Last Modified date/time information on the database and then compare it against that of the file in question. This should work, provided the information cannot be changed either intentionally or by accident.
Well regardless what you use your worst case is going to be O(n), which on n ~ 25GB of data is not so pretty.
Unless you can modify the process that writes to the files.
Since you are not updating all of the 25GBs all of the time, that is your biggest potential for saving cycles.
1. Don't write randomly
Why don't you make the process that writes the data append only? This way you'll have more data, but you'll have full history and you can track which data you already processed (what you already put in the datastore).
2. Keep a list of changes if you must write randomly
Alternatively if you really must do the random writes you could keep a list of updated rows. This list can be then processed as in #1, and the you can track which changes you processed. If you want to save some space you can keep a list of blocks in which the data changed (where block is a unit that you define).
Furthermore you can keep checksums/hashes of changed block/lines. However this might not be very interesting - it is not so cheap to compute and direct comparison might be cheaper (if you have free CPU cycles during writing it might save you some reading time later, YMMV).
Note(s)
Both #1 and #2 are interesting only if you can make adjustment to the process that writes the data to the disk
If you can not modify the process that writes in the 25GB data then I don't see how checksums/hashes can help - you have to read all the data anyway to compute the hashes (since you don't know what changed) so you can directly compare while you read and come up with a list of rows to update/add (or update/add directly)
Using diff algorithms might be suboptimal, diff algorithm will not only look for the lines that changed, but also check for the minimal edit distance between two text files given certain formatting options. (in diff, this can be controlled with -H or --minimal to work slower or faster, ie search for exact minimal solution or use heuristic algorithm for which if iirc this algorithm becomes O(n log n); which is not bad, but still slower then O(n) which is available to you if you do direct comparison line by line)
practically it's kind of problem that has to be solved by backup software, so why not use some of their standard solutions?
the best one would be to hook the WriteFile calls so that you'll receive callbacks on each update. This would work pretty well with binary records.
Something that I cannot understand: the files are actually text files that are not just appended, but updated? this is highly ineffective ( together with idea of keeping 2 copies of files, because it will make the file caching work even worse).
I have a SQL Server table full of orders that my program needs to "follow up" on (call a webservice to see if something has been done with them). My application is multi-threaded, and could have instances running on multiple servers. Currently, every so often (on a Threading timer), the process selects 100 rows, at random (ORDER BY NEWID()), from the list of "unconfirmed" orders and checks them, marking off any that come back successfully.
The problem is that there's a lot of overlap between the threads, and between the different processes, and their's no guarantee that a new order will get checked any time soon. Also, some orders will never be "confirmed" and are dead, which means that they get in the way of orders that need to be confirmed, slowing the process down if I keep selecting them over and over.
What I'd prefer is that all outstanding orders get checked, systematically. I can think of two easy ways do this:
The application fetches one order to check at a time, passing in the last order it checked as a parameter, and SQL Server hands back the next order that's unconfirmed. More database calls, but this ensures that every order is checked in a reasonable timeframe. However, different servers may re-check the same order in succession, needlessly.
The SQL Server keeps track of the last order it asked a process to check up on, maybe in a table, and gives a unique order to every request, incrementing its counter. This involves storing the last order somewhere in SQL, which I wanted to avoid, but it also ensures that threads won't needlessly check the same orders at the same time
Are there any other ideas I'm missing? Does this even make sense? Let me know if I need some clarification.
RESULT:
What I ended up doing was adding a LastCheckedForConfirmation column to my table with finished orders in it, and I added a stored procedure that updates a single, Unconfirmed row with GETDATE() and kicks out the order number so my process can check on it. It spins up as many of these as it can (given the number of threads the process is willing to run), and uses the stored procedure to get a new OrderNumber for each thread.
To handle the "Don't try rows too many times or when they're too old" problem, I did this: The SP will only return a row if "Time since last try" > "Time between creation and last try", so each time it will take twice as long before it tries again - first it waits 5 seconds, then 10, then 20, 40, 80, 120, and then after it's tried 15 times (6 hours), it gives up on that order and the SP will never return it again.
Thanks for the help, everybody - I knew the way I was doing it was less than ideal, and I appreciate your pointers in the right direction.
I recommend read and internalize Using tables as Queues.
If you use the data as a queue, you must organize it properly for queuing operations. The article I linked goes into details about how to do this, what you have is a variant of a Pending Queue.
One thing you must absolutely get rid of is the randomness. If there is one thing that is hard to reproduce in a query, is randomness. ORDER BY NEWID() will scan every row, generate a guid, then SORT, and then give you back top 100. You cannot, under any circumstances, have every worker thread scan the entire table every time, you'll kill the server as the number of unprocessed entries grows.
Instead use pending processing date. Have the queue be organized (clustered) by processing date column (when the item is due for retry) and dequeue using the techniques I show in my linked article. If you want to retry, the dequeue should postpone the item instead of deleting it, ie. WITH (...) UPDATE SET due_date = dateadd(day, 1, getutcdate()) ...
The obvious way would be to add a column LastCheckDt to the order. In each thread, retrieve the order that has gone for the longest time without checking. In the procedure that retrieves the order, update the LastCheckDt field.
I wouldn't retrieve 100 orders at once, there is a risk of the 50th order changing in the database before your thread reaches it. Get one order, and when done, get the next one.
In addition, I'd initially develop the process without multi-threading. Checking an open order is usually fast enough to be done sequentially.
One strategy you might want to Consider is a table like this;
JobID bigint PK not null, WorkerID int/nvarchar(max) null
Where worker is the id/name of the server that is processing it, or null if nobody has picked up the job. When a server picks up a job, it puts its own id/name into that column which indicates to others not to pick up the job.
One problem is that it is possible that the server working a job crashes, making the job never complete. You could add a date column that would represent the timeout, which is set when the worker picks up the job to now + some time span that you decide is appropriate.
EDIT: Forgot to mention, you will either need to delete to job when it is complete, or have a status field to indicate completion. An additional field could indicate parameters for the job to make your job table generic: ie. don't just make a solution for your orders, make a job manager that can process anything you will need in the future.
I’m building a system that generates “work items” that are queued up for back-end processing. I recently completed a system that had the same requirements and came up with an architecture that I don’t feel is optimal and was hoping for some advice for this new system.
Work items are queued up centrally and need to be processed in an essentially FIFO order. If this were the only requirement, then I would probably favor an MSMQ or SQL Server Service Broker solution. However, in reality, I need to select work items in a modified FIFO order. A work item has several attributes, and they need to be assigned in FIFO order where certain combinations of attribute values exist.
As an example, a work item may have the following attributes: Office, Priority, Group Number and Sequence Number (within group). When multiple items are queued for the same Group Number, they are guaranteed to be queued in Sequence Number order and will have the same priority.
There are several back-end processes (currently implemented as Windows Services) that pull work times in modified FIFO order given certain configuration parameters for the given service. The service running Washington, DC is configured to process only work items for DC, while the service in NY may be configured to process both NY and DC items (mainly to increase overall throughput). In addition to this type of selectivity, higher priority items should be processed first, and items that contain the same “Group Number” must be processed in Sequence Number order. So if the NY service is working on a DC item in group 100 with sequence 1, I don’t want the DC service to pull off DC item in group 100 sequence 2 because sequence 1 is not yet complete. Items in other groups should remain eligible for processing.
In the last system, I implemented the queues with SQL tables. I created stored procedures to submit items and, more importantly, to “assign” items to the Windows Services that were responsible for processing them. The assignment stored procedures contain the selection logic I described above. Each Windows Service would call the assignment stored procedure, passing it the parameters that were unique to that instance of the service (e.g. the eligible offices). This assignment stored procedure stamps the work item as assigned (in process) and when the work is complete, a final stored procedure is called to remove the item from the “queue” (table).
This solution does have some advantages in that I can quickly examine the state of these “queues” by a simple SQL select statement. I’m also able to manipulate the queues easily (e.g. I can bump priorities with a simple SQL update statement). However, on the downside, I occasionally have to deal with deadlocks on these queue tables and have the burden of writing these stored procedures (which gets tedious after a while).
Somehow I think that either MSMQ (with or without WCS) or Service Broker should be able to provide a more elegant solution. Rolling my own queuing/work-item-processing system just feels wrong. But as far as I know, these technologies don’t offer the flexibility that I need in the assignment process. I am hoping that I am wrong. Any advice would be welcome.
It seems to me that your concept of an atomic unit of work is a Group. So I would suggest that you only queue up a message that identified a Group Id, and then your worker will have to go to a table that maps Group Id to 1 or more Work Items.
You can handle your other problems by using more than one queue - NY-High, NY-Low, DC-High, DC-Low, etc.
In all honesty, though, I think you are better served to fix your deadlock issues in your current architecture. You should be reading the TOP 1 message from your queue table with Update Lock and Read Past hints, ordered by your priority logic and whatever filter criteria you want (Office/Location). Then you process your 1 message, change it's status or move it to another table. You should be able to call that stored procedure in parallel without a deadlock issue.
Queues are for FIFO order, not random access order. Even though you are saying that you want FIFO order, you want FIFO order with respect to a random set of variables, which is essentially random order. If you want to use queues, you need to be able to determine order before the message goes in the queue, not after it goes in.