How to handle DML operations in Apache Storm Topology - solr

In my Architecture, DML commands are queued in to Kafka. Storm topology comprises of single Spout and 3 Solr Bolts. The DML commands get distributed among these 3 Bolts.
My Problem is how to handle if the order of commands get shuffled by Solr Bolts. For ex., the sequence of commands are
Insert record A with value 50.
Insert record B with value x.
Update record A to value 20.
Insert record C with value y.
Update record A to value 100.
and so on
In the above case, what if the command 5 get executed by a Bolt before command 3 getting executed by other Bolt? This can happen if Bolt 3 first picks and executes the 5th command before Bolt 2 executes the command 3.

If I understand you correct, you have single spout (with dop=1) and a single bolt (with dop=3) that get the data from spout via shuffle grouping. If the dependent command are shuffled to different bolt-executors, there is no way to get the executed in correct order.
However, if you have a series of commands that depends on each other, you can use fieldsGrouping to ensure that all command go to the same executor. For this case, the order is guaranteed to be preserved. To accomplish this, you just add an attribute to the spout output tuples (a counter), and field-group on this attribute. Furthermore, for consecutive dependent commands, you do not modify the counter. If a series of dependent command is finished, you increase the counter by one (this ensure, that different command series are processed by different bolt-executors, ie, load balancing). The tricky part is to know, when a series of dependent commands is finished I guess. Hope this helps.

Related

Batching and unbatching of agents based on database reference AnyLogic

To make one complete ring I need three segments. Segments C1A, C1B and C1C should be batched to C1.
After batching I want to separate rings based on their names(String) in the select output block to respective delay processes.
My question is how to create a relation between unbatched and batched rings in order to complete this whole process.
thanks in advance.
My first question is why do you want to batch them if they get processed separately.. See my example below I think if you just release all three the segments at the same time you do not need to batch them
You will need to add a wait block ( can also be done with a queue) and then wait until you have enough of the correct batch (assuming the batch is denoted by the first 2 letters of a segment name) See code below.
This will check if there are 3 segments with the same first two letters in their name and release them from the wait block.
Then you can do what you want with them... if you want to batch set the back size to 3 (based on your example) so that every 3 units that gets released, and they will be released all at once will be batched into a new agent called Rigng.
We store the first two letters of the segment name, which will be the same for all segments being batched and use them to set the new ring name.
You can then use the Ring name to decide which machine they need to go to.
Since we then did not select the "Permanent batch" option in the batch block we can simply unlatch them again...
But I think based on your screenshot you might just want to let them go through the system no need for batching... I am not sure...

How to get all results from results set in pentaho kettle step Input table?

I have simple transformation consisting of 2 steps. 1 step (Input table) makes query to DB and 2 step (Java class) processes results. 2 step takes much time (it is normal in my case) but after 1 hours I get error of closed results set
Server has closed the connection. If result set contain huge amount of data, Server expects client to read off the result set relatively fast. In this case, please consider increasing net_wait_timeout session variable. / processing your result set faster (check Streaming result sets documentation for more information)
2017/10/02 13:12:06 - Getting of data cells .0 -
I think there should be some intermediate step (or some other option) to get relatively fast all result from 1 step. Could you help me with that?
I guess your step 2 is locking the same table as the one in step 1.
That's one of the drawback of the otherwise efficient architecture of the PDI. All the steps startup at the same time, and the quickest to produce results give the hand to the next steps. With this strategy of "do the quickest first", you sometimes beat the sql optimizer itself when there is lots of joins on sums or averages (pro rata).
The main pitfall in this respect is to read a table, make some transformation and rewrite the result on the same table with the truncate table checked. In that case, the truncate is done a few milliseconds before the select of the input table which starts an infinite dead lock. After a long time you decide to kill the ETL, but at that time the data has been lost.
Solutions:
The best practice is to rewrite step2 using PDI steps rather than to use a ready made java class. That is the way I strongly recommend on the long run, but you may have some reason not follow it.
If your table is small, you can put a blocking step between the input and output.
If you table is big, you can use a sort row step instead of the blocking step. You do not really want to sort, but the PDI needs to look at the last row to be sure the sort is complete, before to give results to the next step. The sort will cut the data in temporary chuncks on the hard disk, and you can have a certain control on where and how the tmp data is stored.
You can copy your table in a tmp table (or file), process and delete it after. Use a job to do that, because in a job, unlike in a transformation, the process is sequential.

Sliding processing time window computes inconsistent results

In Flink, I am reading a file using readTextFile and applying SlidingProcessingTimeWindows.of(Time.milliseconds(60), Time.milliseconds(60)) of 60 msec with slide of 60 msec on it. On windowed stream I am calculating the mean of the second filed of the tuple. My text file contains 1100 lines and each line is tuple (String, Integer). I have set the parallelism to 1 and keyed the messages on first field of the tuple.
When I run the code, each time I get different answers. I mean that it seems like, sometime it reads entire file and sometime it reads one first some lines of the file. Does it have some relation with window size of sliding amount? How this relation can be found out so that I can decide the size and sliding amount of window?
The answer in the comment of AlpineGizmo is correct. I'll add a few more details here.
Flink aligns time windows to the begin of epoch (1970-01-01-00:00:00). This means that a window operator with a 1 hour window starts a new window with every new hour (i.e., at 00:00, 01:00, 02:00, ...) and not with the first arriving record.
Processing time windows are evaluated based on the current time of the system.
As said in the comment above, this means that the amount of data which can be processed depends on the processing resources (hardware, CPU/IO load, ...) of the machine that an operator runs on. Therefore, processing time window cannot produce reliable and consistent results.
I your case, both described effects might cause results which are inconsistent across jobs. Depending on when you start the job, the data will be assigned to different windows (if the first record arrives just before the first 60 msecs window is closed, only this element will be in the window). Depending on the IO load of the machine it might take more or less time to access and read the file.
If you want to have consistent results, you need to use event-time. In this case, the records are processed based on the time which is encoded in the data, i.e., the results depend on the data only and not on external effects such as the starting time of the job or the load of the processing machine.

Multiple instances of a service processing rows from a single table, what's the best way to prevent collision

We are trying to build a system with multiple instances of a service on different machines that share the load of processing.
Each of these will check a table, if there are rows to be processed on that table, it will pick the first, mark it processing, then process it, then mark it done. Rinse repeat.
What is the best way to prevent a racing condition where 2 instances A and B do the following
A (1) read the table, finds row 1 to process,
B (1) reads the table, finds row 1 to process,
A (2) marks it row processing
B (2) Marks it row processing
In a single app we could use locks or mutexs.
I can just put A1 and A2 in a single transaction, is it that simple, or is there a better, faster way to do this?
Should I just turn it on it's head so that the steps are:
A (1) Mark the next row as mine to process
A (2) Return it to me for processing.
I figure this has to have been solved many times before, so I'm looking for the "standard" solutions, and if there are more than one, benefits and disadvantages.
Transactions are a nice simple answer, with two possible drawbacks:
1) You might want to check with the fine print of your database. Sometimes the default consistency settings don't guarantee absolute consistency in every possible circumstance.
2) Sometimes the pattern of accesses associated with using a database to queue and distribute work is hard on a database that isn't expecting it.
One possibility is to look at reliable message queuing systems, which are seem to pretty good match to what you are looking for - worker machines could just read work from a shared queue. Possible jumping-off points are http://en.wikipedia.org/wiki/Message_queue and http://docs.oracle.com/cd/B10500_01/appdev.920/a96587/qintro.htm

Architecting a Work Item Processing System with Modified FIFO Semantics in Windows

I’m building a system that generates “work items” that are queued up for back-end processing. I recently completed a system that had the same requirements and came up with an architecture that I don’t feel is optimal and was hoping for some advice for this new system.
Work items are queued up centrally and need to be processed in an essentially FIFO order. If this were the only requirement, then I would probably favor an MSMQ or SQL Server Service Broker solution. However, in reality, I need to select work items in a modified FIFO order. A work item has several attributes, and they need to be assigned in FIFO order where certain combinations of attribute values exist.
As an example, a work item may have the following attributes: Office, Priority, Group Number and Sequence Number (within group). When multiple items are queued for the same Group Number, they are guaranteed to be queued in Sequence Number order and will have the same priority.
There are several back-end processes (currently implemented as Windows Services) that pull work times in modified FIFO order given certain configuration parameters for the given service. The service running Washington, DC is configured to process only work items for DC, while the service in NY may be configured to process both NY and DC items (mainly to increase overall throughput). In addition to this type of selectivity, higher priority items should be processed first, and items that contain the same “Group Number” must be processed in Sequence Number order. So if the NY service is working on a DC item in group 100 with sequence 1, I don’t want the DC service to pull off DC item in group 100 sequence 2 because sequence 1 is not yet complete. Items in other groups should remain eligible for processing.
In the last system, I implemented the queues with SQL tables. I created stored procedures to submit items and, more importantly, to “assign” items to the Windows Services that were responsible for processing them. The assignment stored procedures contain the selection logic I described above. Each Windows Service would call the assignment stored procedure, passing it the parameters that were unique to that instance of the service (e.g. the eligible offices). This assignment stored procedure stamps the work item as assigned (in process) and when the work is complete, a final stored procedure is called to remove the item from the “queue” (table).
This solution does have some advantages in that I can quickly examine the state of these “queues” by a simple SQL select statement. I’m also able to manipulate the queues easily (e.g. I can bump priorities with a simple SQL update statement). However, on the downside, I occasionally have to deal with deadlocks on these queue tables and have the burden of writing these stored procedures (which gets tedious after a while).
Somehow I think that either MSMQ (with or without WCS) or Service Broker should be able to provide a more elegant solution. Rolling my own queuing/work-item-processing system just feels wrong. But as far as I know, these technologies don’t offer the flexibility that I need in the assignment process. I am hoping that I am wrong. Any advice would be welcome.
It seems to me that your concept of an atomic unit of work is a Group. So I would suggest that you only queue up a message that identified a Group Id, and then your worker will have to go to a table that maps Group Id to 1 or more Work Items.
You can handle your other problems by using more than one queue - NY-High, NY-Low, DC-High, DC-Low, etc.
In all honesty, though, I think you are better served to fix your deadlock issues in your current architecture. You should be reading the TOP 1 message from your queue table with Update Lock and Read Past hints, ordered by your priority logic and whatever filter criteria you want (Office/Location). Then you process your 1 message, change it's status or move it to another table. You should be able to call that stored procedure in parallel without a deadlock issue.
Queues are for FIFO order, not random access order. Even though you are saying that you want FIFO order, you want FIFO order with respect to a random set of variables, which is essentially random order. If you want to use queues, you need to be able to determine order before the message goes in the queue, not after it goes in.

Resources