Elasticsearch index as a source to apache flink batch job - apache-flink

I am fairly new to Apache Flink. I have a specific requirement were I have to use elasticsearch index as a source. I tried to figure out if flink has a source as elasticsearch but doesn't seem to be. I could see that we can have elasticsearch as a sink but direct support as a source is not there. Can anyone guide me on how we can solve this problem. I am using elasticsearch 5.5.0 and flink 1.2.

I, found one flink elasticsearcg source connector implementation on git - https://github.com/mnubo/flink-elasticsearch-source-connector. But it seems that it has not been active for almost a year now and has limited support in terms on aggregation and es version.
Thought of sharing this just in case if it meets someone's requirement.

Related

Apache flink vs Apache Beam (With flink runner)

I am considering using Flink or Apache Beam (with the flink runner) for different stream processing applications. I am trying to compare the two options and make the better choice. Here are the criteria I am looking into and for which I am struggling to find info for the flink runner (I found basically all the info for flink standalone already) :
Ease of use
Scalability
Latency
Throughput
Versatility
Metrics generation
Can deploy with Kubernetes (easily)
Here are the other criteria which I think I already know the answers too:
Ability to do stateful operations: Yes for both
Exactly-once guarantees: Yes for both
Integrates well with Kafka: Yes for both (might be a little harder with beam)
Language supported:
Flink: Java, Scala, Python, SQL
Beam: Java, Python, GO
If you have any insight on these criteria for the flink runner please let me know! I will update the post if I find answers!
Update: Good article I found on the advantage of using Beam (don't look at the airflow part):
https://www.astronomer.io/blog/airflow-vs-apache-beam/
Similar to OneCricketeer's comment, it's quite subjective to compare these 2.
If you are absolutely sure that you are going to use FlinkRunner, you could just cut the middle man and directly use Flink. And it saves you trouble in case Beam is not compatible with a specific FlinkRunner version you want to use in the future (or if there is a bug). And if you are sure all the I/Os you are going to use are well supported by Flink and you know where/how to set up your FlinkRunner (in different modes), it makes sense to just use Flink.
If you consider moving to other languages/runners in the future, Beam offers language and runner portabilities for you to write a pipeline once and run everywhere.
Beam supports more than Java, Python and Go:
JavaScript: https://github.com/robertwb/beam-javascript
Scala: https://github.com/spotify/scio
Euphoria API
SQL
Runners:
DataflowRunner
FlinkRunner
NemoRunner
SparkRunner
SamzaRunner
Twister2Runner
Details can be found on https://beam.apache.org/roadmap/.

Flink on Kubernetes

We are building a stream processing job using Flink v1.12.2 and planning to run it on a Kubernetes cluster. While referring to the official Flink documentation, we came across, primarily, two ways of submitting Flink jobs to a Kubernetes cluster, one is in Standalone mode and the other is in Native mode. We noticed that with the latter option, there are no yaml config files and looks simple. Just wondering what is the recommended mode/approach and their pros and cons. Thank you.
glad to hear you're trying out Flink on K8s!
The Native mode is the current recommendation for starting out on Kubernetes as it is the simplest option, like you noted. In Flink 1.13 (to be released in the coming weeks), there is added support for specifying Pod templates. One of the drawbacks to this approach is its limited ability to integrate with CI/CD.
Some other popular approaches for a more "Kubernetes" style of running jobs (i.e. just YAML manifests) include Lyft's Operator, the Ververica Platform (disclaimer: I work here, on this), and Google Cloud Platform's Operator. These are all more work to set up but offer a better CI/CD story, which can help make using Flink in production less effort in the long run.
If you'd like to talk about any of these more in-depth, the User Mailing List is full of helpful people that can weigh some of the pros/cons that apply to your use case.

Can anyone provide the Kafka Sink Connector example Java?

To be honest, I am at a very primary level of using Apache Flink, I am looking for the Apache Flink sink connector which will send my messages to Kafka topic.
Looking forward to quick help.
The Apache Flink training has an exercise on the topic of writing to and reading from Kafka. Included are reference solutions which you can use as a guide. The link I've given you is a deep link to the relevant exercise -- you'll probably want to browse around and explore more of the material there as well.

How to synchronize indexes and repository of Apache Lucene and Solr in Clustered JBoss

I have a situation, I want to run my demo Web-Application built with EJB-Hibernate into JBoss Cluter for High Availability and in my application we use Apache Solr (and one part uses Lucene as well) for text-based search.
I got the clustering information from Jboss official website, but I am not able to get any information about how to sync up solr or lucene indexes and their data repositories..?
I am sure that lot many people must have done clustering with Lucene or solr in them, please anyone point me to the correct source about it. About how to synchronize solr or lucene directories on multiple server instances of JBoss.
I have embedded solr deployment, so as Jayendra had suggested below, Solr Replication with HTTP is not possible for me. Is there any other way to do solr-replication with repeater configuration (i.e. my all nodes will act as both master as well as slave)?
If you want to copy/sync data repositories for Solr, you can check for Solr Replication which will allow you to sync data repositories across different solrs instances on different machines
The clustering technology of JBoss and WildFly is based on the Infinispan OSS project.
Infinispan provides an highly efficient distributed storage model, and the project includes an Apache Lucene index storage layer:
http://infinispan.org/docs/dev/user_guide/user_guide.html#integrations:lucene-directory
It should be easy to replace the Solr Directory with this implementation.

Migrate data from Solr 3

I'm thinking about migrating from Solr 3 to Solrcloud or Elasticsearch and was wondering if is it possible to import data indexed with Solr 3.x to Solrcloud (solr 4) and/or Elasticsearch?
They're all lucene based, but since they have different behaviors I'm not really sure that it will work.
Has anyone ever done this? How it going? Related issues?
Regarding importing data from solr to elasticsearch you can take a look at the elasticsearch mock solr plugin. It adds a new solr-alike endpoint to elasticsearch, so that you can use the indexer that you've written for solr (if you have one) to index documents in elasticsearch.
Also, I've been working on an elasticsearch solr river which would allow to import data from solr to elasticsearch through the solrj library. The only limitation is that it can import only the fields that you configured as stored in solr. I should be able to make it public pretty soon, just a matter of days. I'll update my answer as soon as it's available.
Regarding the upgrade of Solr from 3.x to 4.0, not a big deal. The index format has changed, but Solr will take care of upgrading the index. That happens automatically once you start Solr with your old index. But after that the index cannot be read anymore by a previous Solr/lucene version. If you have a master/slave setup you should upgrade the slaves first, otherwise the index on the master would be replicated to the slaves which cannot read it yet.
UPDATE
Regarding the river that I mentioned: I made it public, you can download it from my github profile: https://github.com/javanna/elasticsearch-river-solr.

Resources