Transform a specific group of records from MongoDB - apache-flink

I've got a periodically triggered batch job which writes data into a MongoDB. The job needs about 10 minutes and after that I would like to receive this data and do some transformations with Apache Flink (Mapping, Filtering, Cleaning...). There are some dependencies between the records which means I have to process them together. For example I like to transform all records from the latest batch job where the customer id is 45666. The result would be one aggregated record.
Are there any best practices or ways to do that without implementing everything by myself (get distict customer ids from latest job, for each customer select records and transform, flag the transformed customers, etc....)?
I'm not able to stream it because I have to transform multiple records together and not one by one.
Currently I'm using Spring Batch, MongoDB, Kafka and thinking about Apache Flink.

Conceivably you could connect the MongoDB change stream to Flink and use that as the basis for the task you describe. The fact that 10-35 GB of data is involved doesn't rule out using Flink streaming, as you can configure Flink to spill to disk if its state can't fit on the heap.
I would want to understand the situation better before concluding that this is a sensible approach, however.

Related

Kafka Consumer to ingest data without querying DB for every message

We have a Kafka consumer service to ingest data into our DB. Whenever we receive the message from the topic we will compose an insert statement to insert this message into DB. We use DB connection pool to handle the insertion and so far so good.
Currently, we need to add a filter to select only the related message from Kafka and do the insert. There are two options in my mind to do this.
Option 1: Create a config table in the DB to define our filtering condition.
Pros
No need to make code changes or redeploy services
Just insert new filters to config table, service will pick them the next run
Cons
eed to query the DB every time we receive new messages.
Say we received 100k new messages daily and need to filter out 50k. So totally we only need to run 50k INSERT commands, while need to run 100K SELECT queries to check the filter condition for every single Kafka message.
Option 2: Use a hardcoded config file to define those filters.
Pros
Only need to read the filters once when the consumer start running
Has no burden on the DB layer
Cons
This is not a scalable way since we are planning to add a lot of filters, everytime we need to make code changes on the config file and redeploy the consumer.
My question is, is there a better option to achieve the goal? Find the filters without using hardcoded config file or without increasing the concurrency of DB queries.
Your filters could be in another Kafka topic.
Start your app and read the topic until the end, and only then start doing database inserts. Store each consumed record in some local structure such as ConcurrentHashmap, SQLite, RocksDB (provided by Kafka Streams), or DuckDB is popular recently...
When you add a new filter, your consumer would need to temporarily pause your database operations
If you use Kafka Streams, then you could lookup data from the incoming topic against your filters "table" statestore using Processor API and drop the records from the stream
This way, you separate your database reads and writes once you start inserting 50k+ records, and your app wouldn't be blocked trying to read any "external config"
You could also use Zookeeper, as that's one of its use cases

idiomatic way to do many dynamic filtered views of a Flink table?

I would like to create a per-user view of data tables stored in Flink, which is constantly updated as changes happen to the source data, so that I can have a constantly updating UI based on a toChangelogStream() of the user's view of the data. To do that, I was thinking that I could create an ad-hoc SQL query like SELECT * FROM foo WHERE userid=X and convert it to a changelog stream, which would have a bunch of inserts at the beginning of the stream to give me the initial state, followed by live updates after that point. I would leave that query running as long as the user is using the UI, and then delete the table when the user's session ends. I think this is effectively how the Flink SQL client must work, so it seem like this is possible.
However, I anticipate that there may be some large overheads associated with each ad hoc query if I do it this way. When I write a SQL query, based on the answer in Apache Flink Table 1.4: External SQL execution on Table possible?, it sounds like internally this is going to compile a new JAR file and create new pipeline stages, I assume using more JVM metaspace for each user. I can have tens of thousands of users using the UI at once, so I'm not sure that's really feasible.
What's the idiomatic way to do this? The other ways I'm looking at are:
I could maybe use queryable state since I could group the current rows behind the userid as the key, but as far as I can tell it does not provide a way to get a changelog stream, so I would have to constantly re-query the state on a periodic basis, which is not ideal for my use case (the per-user state can be large sometimes but doesn't change quickly).
Another alternative is to output the table to both a changelog stream sink and an external RDBMS sink, but if I do that, what's the best pattern for how to join those together in the client?

can (should) I use Flink like an in-memory database?

I've used batch Beam but am new to the streaming interface. I'm wondering about the appropriateness of using Apache Flink / Beam kind of like an in-memory database -- I'd like to constantly recompute and materialize one specific view of my data based on edge triggered updates.
More details: I have a few tables in a (normal) database, ranging from thousands to millions of rows, and each one has a many-to-many (M2M) relationship with other ones. Picture to explain:
Hosts <-M2M #1-> Table 1 <-M2M #2-> Table 2 <-M2M #3-> Table 3
Table 1 is a set of objects that the hosts need to know about, and each host needs to know about all downstream rows referenced directly or indirectly by the objects in Table 1 that it's related to. When changes happen anywhere other than the first many-to-many relationship M2M #1, it's not obvious which hosts need to be updated without traversing "left" to find the hosts and then traversing "right" to get all the necessary configuration. The objects and relationships at most levels change frequently, and I need sub-second latency to go from "a record or relationship changed" to recalculating any flattened config files with changes in them so that I can push updates to the hosts very quickly.
Is this an appropriate use case for streaming Flink / Beam? I have worked with Beam in a different system but only in batch mode, and I think that it would be a great tool to use here if I could edge-trigger it. The part I'm getting stuck on is, in batch mode, the PCollections are all "complete" in the sense that I can always join all records in one table with all records in another table. But with streaming, once I process a record once, it gets removed from its PCollection and can't be joined against future updates that arrive later on and relate to it, right? IIUC, it's only available within a window, but I effectively want an infinitely long window where only outdated versions of items in a PCollection (e.g. versions of them which have been overwritten by a new version that came in over the stream) would be freed up.
(Also, to bootstrap this system, I would need to scan the whole database to prefill all the state before I could start reading from a stream of edge-triggered updates. Is that a common pattern?)
I don't know enough about Beam to answer that part of the question, but you can certainly use Flink in the way you've described. The simplest way to accomplish this with Flink is with a streaming join, using the SQL/Table API. The runtime will materialize both tables into managed Flink state, and produce new and/or updated results as new and updated records are processed from the input tables. This is a commonly used pattern.
As for initially bootstrapping the state, before continuing to ingest the updates, I suggest using a CDC-based approach. You might start by looking at https://github.com/ververica/flink-cdc-connectors.

What's the simplest way to process lots of updates to Solr in batches?

I have a Rails app that uses Sunspot, and it is generating a high volume of individual updates which are generating unnecessary load on Solr. What is the best way to send these updates to Solr in batches?
Assuming, the changes from the Rails apps also update a persistence store, you can check for Data Import Handler (DIH) handler which can be scheduled periodically to update Solr indexes.
So instead of each update and commits triggered on Solr, the frequency can be decided to update Solr in batches.
However, expect a latency in the search results.
Also, Are you updating the Individual records and commit ? If using Solr 4.0 you can check for Soft and Hard Commits as well.
Sunspot makes indexing a batch of documents pretty straightforward:
Sunspot.index(array_of_docs)
That will send off just the kind of batch update to Solr that you're looking for here.
The trick for your Rails app is finding the right scope for those batches of documents. Are they being created as the result of a bunch of user requests, and scattered all around your different application processes? Or do you have some batch process of your own that you control?
The sunspot_index_queue project on GitHub looks like a reasonable approach to this.
Alternatively, you can always turn off Sunspot's "auto-index" option, which fires off updates whenever your documents are updated. In your model, you can pass in auto_index: false to the searchable method.
searchable auto_index: false do
# sunspot setup
end
Then you have a bit more freedom to control indexing in batches. You might write a standalone Rake task which iterates through all objects created and updated in the last N minutes and index them in batches of 1,000 docs or so. An infinite loop of that should stand up to a pretty solid stream of updates.
At a really large scale, you really want all your updates going through some kind of queue. Inserting your document data into a queue like Kafka or AWS Kinesis for later processing in batches by another standalone indexing process would be ideal for this at scale.
I used a slightly different approach here:
I was already using auto_index: false and processing solr updates in the background using sidekiq. So instead of building an additional queue, I used the sidekiq-grouping gem to combine Solr update jobs into batches. Then I use Sunspot.index in the job to index the grouped objects in a single request.

Which approach and database to use in performance-critical solution

I have the following scenario:
Around 70 million of equipments send a signal every 3~5 minutes to
the server sending its id, status (online or offiline), IP, location
(latitude and longitude), parent node and some other information.
The other information might not be in an standard format (so no schema for me) but I still need to query it.
The equipments might disappear for some time (or forever) not sending
signals in the process. So I need a way to "forget" the equipments if
they have not sent a signal in the last X days. Also new equipments
might come online at any time.
I need to query all this data. Like knowing how many equipments are offline on a specific region or over
an IP range. There won't be many queries running at the same time.
Some of the queries need to run fast (less than 3 min per query) and
at the same time as the database is updating. So I need indexes on
the main attributes (id, status, IP, location and parent node). The
query results do not need to be 100% accurate, eventual consistency
is fine as long as it doesn't take too long (more than 20 min on
avarage) for them to appear in the queries results.
I don't need
persistence at all, if the power goes out it's okay to lose
everything.
Given all this I thought of using a noSQL approach maybe MongoDB or CouchDB since I have experience with MapReduce and Javascript but I don't know which one is better for my problem (I'm gravitating towards CouchDB) or if they are fit at all to handle this massive workload. I don't even know if I actually need a "traditional" database since I don't need persistence to disk (maybe a main-memory approach would be better?), but I do need a way to build custom queries easily.
The main problem I detect are the following:
Need to insert/update lots of tuples really fast and I don't know
beforehand if the signal I receive is already in the database or not.
Almost all of the signals will be in the same state as they were the
last time, so maybe query by id and check to see if the tuple changed if not do nothing, if it did update?
Forgeting offline equipments. A batch job that runs during the night
removing expired tuples would solve this problem.
There won't be many queries running at the same time, but they need
to run fast. So I guess I need to have a cluster that perform a
single query on multiple nodes of the cluster (does CouchDB MapReduce
splits the workload to multiple nodes of the cluster?). I'm not
enterily sure I need a cluster though, could a single more expensive
machine handle all the load?
I have never used a noSQL system before, but I have theoretical
knowledge of the subject.
Does this make sense?
Apache Flume for collecting the signals.
It is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store. Easy to configure and scale. Store the data in HDFS as files using Flume.
Hive for batch queries.
Map the data files in HDFS as external tables in Hive warehouse. Write SQL like queries using HiveQL whenever you need offline-batch processing.
HBase for random real-time reads/writes.
Since HDFS, being a FS, lacks the random read/write capability, you would require a DB to serve that purpose. Looking at your use case HBase seems good to me. I would not say MongoDB or CouchDB as you are not dealing with documents here and both these are document-oriented databases.
Impala for fast, interactive queries.
Impala allows you to run fast, interactive SQL queries directly on your data stored in HDFS or HBase. Unlike Hive it does not use MapReduce. It instead leverages the power of MPP so it's good for real time stuff. And it's easy to use since it uses the same metadata, SQL syntax (Hive SQL), ODBC driver etc as Hive.
HTH
Depending on the type of analysis, CouchDB, HBase of Flume may be all be good choices. For strictly numeric "write-once" metrics data graphite is a very popular open source solution.

Resources