I often use Camel's idempotent pattern to prevent duplicate processing of discrete messages. What's the best practice to do this when the data stream in question is a large volume of messages each with a timestamp?
Consider this route configuration (pseudocode):
timer -> idempotent( search_splunk_as_batch -> split -> sql(insert))
We want to periodically query from splunk and write to sql. We don't want to miss any messages and we don't want any duplicate messages.
Instead of persisting an idempotent marker for each message, I'd like to note the cutoff time for each batch and begin the next query at the cutoff time.
Your method will probably work as long as you can rely on some assumptions:
Your indexers never load data that appears in the past (according to the _time field)
Your camel route is never running in more than one process at a time that is sending to the same database table.
If you can make sure these are met, then you can just store the maximum timestamp that you receive from the search and use that with the "earliest" parameter of the splunk search command. Storing and retrieving the max timestamp could be done with something like a file, a separate database table, or using a column in your target table.
Related
Currently in Snowflake we have configured an auto-ingest Snowpipe connected to an external S3 stage as documented here. This works well and we're copying records from the pipe into a "landing" table. The end goal is to MERGE these records into a final table to deal with any duplicates, which also works well. My question is around how best to safely perform this MERGE without missing any records? At the moment, we are performing a single data extraction job per-day so there is normally a point where the Snowpipe queue is empty which we use as an indicator that it is safe to proceed, however we are looking to move to more frequent extractions where it will become harder and harder to guarantee there will be no new records ingested at any given point.
Things we've considered:
Temporarily pause the pipe, MERGE the records, TRUNCATE the landing table, then unpause the pipe. I believe this should technically work but it is not clear to me that this is an advised way to work with Snowpipes. I'm not sure how resilient they are to being paused/unpaused, how long it tends to take to pause/unpause, etc. I am aware that paused pipes can become "stale" after 14 days (link) however we're talking about pausing it for a few minutes, not multiple days.
Utilize transactions in some way. I have a general understanding of SQL transactions, but I'm having a hard time determining exactly if/how they could be used in this situation to guarantee no data loss. The general thought is if the MERGE and DELETE could be contained in a transaction it may provide a safe way to process the incoming data throughout the day but I'm not sure if that's true.
Add in a third "processing" table and a task to swap the landing table with the processing table. The task to swap the tables could run on a schedule (e.g. every hour), and I believe the key is to have the conditional statement check both that there are records in the landing table AND that the processing table is empty. As this point the MERGE and TRUNCATE would work off the processing table and the landing table would continue to receive the incoming records.
Any additional insights into these options or completely different suggestions are very welcome.
Look into table streams which record insertions/updates/deletions to your snowpipe table. You can then merge off the stream to your target table which then resets the offset. Use a task to run your merge statement. Also, given it is snowpipe, when creating your stream it is probably best to use an append only stream
However, I had a question here where in some circumstances, we were missing some rows. Our task was set to 1min intervals, which may be partly the reason. However I never did get to the end of it, even with Snowflake support.
What we did notice though was that using a stored procedure, with a transaction and also running a select on the stream before the merge, seems to have solved the issue i.e. no more missing rows
.
Hi,
using Apache Flink 1.8. I have a stream of records coming in from Kafka as JSON and filtering them and that all works fine.
Now, I would like to enrich the data from Kafka with a look up value from a database table.
Is that just a case of creating 2 streams, loading the table in the 2nd stream and then joining the data?
The database table does get updated but not frequently and I would like to avoid looking up the DB on every record that comes through the stream.
Flink has state, which you could take advantage of here. I've done something similar, where I took a daily query from my lookup table (in my case it was a bulk webservice call) and through the results into a kafka topic. This kafka topic was being consumed by the same service flink job as that needed the data for lookups. Both topics were keyed by the same value, but I used the lookup topic to store data into a keyed state, and when processing the other topic, I'd pull the data back out of state.
I had some additional logic to check if there was NO state yet for a given key. If that was the case, I'd make an async request to the webservice. You may not need to do that however.
The caveat here is that I had memory for state management, and my lookup table was only about 30-million records, about 100 gigs spread across 45 slots on 15 nodes.
[In answer to question in comments]
Sorry, but my answer was too long, so had to edit my post:
I had a python job that loaded the data via a bulk REST call (yours could just do a data lookup). It then transformed the data into the correct format and dumped it into Kafka. Then my flink flow had two sources, one was the 'real data' topic, the other was the 'lookup data' topic. Data coming from the lookup data topic was stored in state (I used a ValueState because each key mapped to a single possible value, but there are other state types. I also had a 24 hour expiration time for each entry, but that was my usecase.
The trick is that the same operation that stores the value in state from the lookup topic, has to be the operation that pulls the value back out of state from the 'real' topic. This is because flink state (even keyed states) are tied to the operator that created them.
I want to read history from state. if state is null, then read hbase and update the state and using onTimer to set state ttl. The problem is how to batch read hbase, because read single record from hbase is not efficient.
In general, if you want to cache/mirror state from an external database in Flink, the most performant approach is to stream the database mutations into Flink -- in other words, turn Flink into a replication endpoint for the database's change data capture (CDC) stream, if the database supports that.
I have no experience with hbase, but https://github.com/mravi/hbase-connect-kafka is an example of something that might work (by putting kafka in-between hbase and flink).
If you would rather query hbase from Flink, and want to avoid making point queries for one user at a time, then you could build something like this:
-> queryManyUsers -> keyBy(uId) ->
streamToEnrich CoProcessFunction
-> keyBy(uID) ------------------->
Here you would split your stream, sending one copy through something like a window or process function or async i/o to query hbase in batches, and send the results into a CoProcessFunction that holds the cache and does the enrichment.
When records arrive in this CoProcessFunction directly, along the bottom path, if the necessary data is in the cache, then it is used. Otherwise the record is buffered, pending the arrival of data for the cache from the upper path.
I am trying to run a lot of update statements from code, and we have a requirement to summarize what changed for every operation for an audit log.
The update basically persists an entire graph consisting of dozens of tables to SQL Server. Right now, before we begin, we collect the data from all the tables, assemble the graph(s) as a "before" picture, apply the updates, then re-collect the data from all the tables, re-assemble the graph(s) for the "after", serialize the before and after graph(s) to JSON, then create a message to an ESB queue for an off-process consumer to crunch through the graphs, identify the deltas, and update the audit log. All the sql operations occur in a single transaction.
Needless to say, this is an expensive and time-consuming process.
I've been playing with the OUTPUT directive in T-SQL, I like the idea of getting the results of the operation in the same command as the update, but it seems to have some limitations. For example, ideally, it'd be great if I could get the INSERTED and DELETED result sets back at the same time, but there doesn't seem to be a concept of UNION between the two tablesets, so that gets unwieldy very quickly. Also, because the updates don't actually modify every column, I can't take the changes I made and compare them to the DELETED, since we'd show deltas for columns we didn't change.
...but maybe I'm missing some syntax with the OUTPUT command, or I'm not using it correctly, so I figured I'd ask the SO community.
What is the most efficient way to collect the deltas of an update operation in SQL Server? The goal is to minimize the calls to SQL Server, and collect the minimum necessary amount of information for writing an accurate audit log, without writing a bunch of custom code for every single operation.
I have a large table, 1B+ records that I need to pull down and run an algorithm on every record. How can I use ADO.NET to exec a "select * from table" asynchronously and start reading the rows one by one while ado.net is receiving the data?
I also need to dispose of the records after I read them to save on memory. So I am looking of a way to pull a table down record by record and basically shove the record into a queue for processing.
My datasources are oracle and mssql. I have to do this for several datasources.
You should use SSIS for this.
You need a bit of background detail on how the ADO.Net data providers work to understand what you can do and what you can't do. Lets take the SqlClient provider for example. It is true that it is possible to execute queries asynchronously with BeginExecuteReader but this asynchronous execution is only until the query start returning results. At the wire level the SQL text is sent to the server, the server start churning the query execution and eventually will start pushing result rows back to the client. As soon as the first packet comes back to the client, the asynchronous execution is done and the completion callback is executed. After that the client uses the SqlDataReader.Read() method to advance the result set. There are no asynchronous methods in the SqlDataReader. This pattern work wonders for complex queries that return few results after some serious processing is done. While the server is busy producing the result, the client is idle with no threads blocked. However things are completely different for simple queries that produce large result sets (as seem to be the case for you): the server will immedeatly produce resutls and will continue to push them back to the client. The asynchronous callback will be almost instantenous and the bulk of the time will be spent by the client iterating over the SqlDataReader.
You say you're thinking of placing the records into an in memory queue first. What is the purpose of the queue? If your algorithm processing is slower than the throughput of the DataReader result set iteration then this queue will start to build up. It will consume live memory and eventualy will exhaust the memory on the client. To prevent this you would have to build in a flow control mechanism, ie. if the queue size is bigger than N don't put any more records into it. But to achieve this you would have to suspend the data reader iteration and if you do this you push flow control to the server which will suspend the query until the communication pipe is available again (until you start reading from the reader). Ultimately the flow control has to be proagated all the way to the server, which is always the case in any producer-consumer relation, the producer has to stop otherwise intermediate queues fill up. Your in-memory queue serves no purpose at all, other than complicating things. You can simply process items from the reader one by one and if your rate of processing is too slow, the data reader will cause flow control to be applied on the query running on the server. This happens automatically simply because you don't call the DataReader.Read method.
To summarise up, for a large set processing you cannot do asynchronous processing and there is no need for a queue.
Now the difficult part.
Is your processing doing any sort of update back in the database? If yes, then you have much bigger problems:
You cannot use the same connection to write back the result, because it is busy with the data reader. SqlClient for SQL Server supports MARS but that only solves the problem with SQL 2005/2008.
If you're going to enroll the read and update in a transaction if your updates occur on a different connection (see above), then this means using distributed transactions (even when the two conencitons involved point back to the same server). Distributed transactions are slow.
You will need to split the processing into several batches because is very bad to process 1B+ records in a single transaction. This means also that you are going to have to be able to resume processing of an aborted batch, which means you must be able to identify records that were already processed (unless processing is idempotent).
A combination of a DataReader and an iterator block (a.k.a. generator) should be a good fit for this problem. The default DataReaders provided by Microsoft pull data one record at a time from a datasource.
Here's an example in C#:
static IEnumerable<User> RetrieveUsers(DbDataReader reader)
{
while (reader.NextResult())
{
User user = new User
{
Name = reader.GetString(0),
Surname = reader.GetString(1)
};
yield return user;
}
}
A good approach to this would be to pull back the data in blocks, iterate through adding to your queue then calling again. This is going to be better than hitting the DB for each row. If you are pulling them back via a numeric PK then this will be easy, if you need to order by something you can use ROW_NUMBER() to do this.
Just use the DbDataReader (just like Richard Nienaber said). It is a forward-only way of scrolling through the retrieved data. You don't have to dispose of your data because a DbDataReader is forward only.
When you use the DbDataReader it seems that the records are retrieved one by one from the database.
It is however slightly more complicated:
Oracle (and probably MySQL) will fetch a few 100 rows at a time to decrease the number of round trips to the database. You can configure the fetch size of DataReader. Most of the time it will not matter whether you fetch 100 rows or 1000 rows per round trip. However, a very low value like 1 or 2 rows slows things down because with a low value retrieving the data will require too many round trips.
You probably don't have to set the fetch size manually, the default will be just fine.
edit1: See here for an Oracle example: http://www.oracle.com/technology/oramag/oracle/06-jul/o46odp.html