Unified connectors for DataStream/Table APIs - apache-flink

I am writing a simple connector (source/sink) for Flink 1.14.4 which mostly wraps the official Kafka Connector and automatically set ups custom serializers/deserializers. I'm a bit confused about the current state of the new source/sink interfaces introduced in FLIP-27 and FLIP-143. Is it currently possible to write unified connectors, really (that is, connectors that work across different APIs, such as DataStream/Table)? By looking at the code of the current Kafka Connector, I see it comes with both legacy and new flavours, but AFAIK the connector for the Table API still relies on the legacy API only. Also, by reading the official documentation:
https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/table/sourcessinks/
It seems that the new interfaces cannot still be used for the Table API. To make it worse, I find it very confusing that only sources are mentioned in the DataStream section, which already describes the new approach:
https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/sources/
but nothing is said regarding sinks. Overall, I think this leaves the user not knowing very well how to approach the creation of custom connectors as of today. In particular, I would expect having an equivalent section for the DataStream API, i.e., one covering the creation of user-defined sources & sinks, as that given above for the Table API.

The Unified Source and Sink API (FLIP-27 and FLIP-143) were created to create one interface for connectors so they could be used for both bounded (batch) and unbounded (stream) data.
Both interfaces allow for building a source/sink that you can use in either DataStream or Table/SQL API. That's currently already the case for FileSystem, Kafka and Pulsar (as of Flink 1.15 which will be released shortly).
You're absolutely right that the current documentation doesn't make this clear. At the moment, the Flink community is working on externalizing the connectors (moving each of them from the Flink repository to their own individual repository) and overhauling the documentation and guides on how to write a connector.

Related

What data sources Snowflake support?

As mentioned in the title, I'd like to know what data sources Snowflake supports. I'm not completely sure how to even approach this question. I know you can create an external stage in the cloud storage of supported cloud providers, but what if I want to load data from the Oracle database, for example. What's the best solution in that case, is it to use the ODBC driver, or?
And please feel free to give me any suggestions, or advice on where to continue my research. Also, let me know if any part of my question is unclear so that I can rephrase it :)
Snowflake natively supports AVRo, Parquet, CSV, JSON and ORC. These are landed in a stage for ingestion --- your ELT/ETL tool of choice or even a home-built application must land the data in a stage, either internal or external.
That file is then ingested into Snowflake utilizing a COPY command either automated by said tool or using something like Snowpipe.
We have documentation on Firehose / Kafka pipelines landing data for Snowpipe to ingest either through AUTO_INGEST notifications (limited to external stage) or calling our REST API.
All supported by our documentation, simply google the terms I have mentioned and there will be tons of documentation
Multiple existing ETL Tools allow to define Snowflake as destination, supporting a wide variety of sources.
Native Programmatic Interfaces
Snowflake Ecosystem - Data Integration

Using Python Processors in Java Flink Application

I have a use case where I want to implement an AWS Kinesis Data Application with Flink in Java. It will listen to multiple Kinesis streams via the Data Streams API. However, the analysis of those streams will be done in Python (since our data scientists prefer Python).
From this answer, there appears to be support for calling Python UDFs from Java. However, I want to be able to convert an incoming stream to a table, via
StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env);
Table sessionsTable = tableEnv.fromDataStream(inputStream);
...and then have a Python processor that is invoked to process that stream.
I really have 3 questions here:
Is this a supported use case?
If so, is there documentation that describes how to do so?
If so, will this add significant overhead to the application?
The starting point in the Flink documentation for learning about using Python with Tables and Datastreams is at https://ci.apache.org/projects/flink/flink-docs-stable/docs/dev/python/overview/.
The Python APIs only provide a subset of what's available from Java; you'll have to look and see if what you need is included.
Not sure about performance, but you can, for example, convert back and forth between Flink Tables and Pandas dataframes.

Apache Camel vs Apache Nifi

I am using Apache camel for quite long time and found it to be a fantastic solution for all kind of system integration related business need. But couple of years back I came accross the Apache Nifi solution. After some googleing I found that though Nifi can work as ETL tool but it is actually meant for stream processing.
In my opinion, "Which is better" is very bad question to ask as that depend on different things. But it will be nice if somebody can describe more about the basic comparison between the two and also the obvious question, when to use what.
It will help to take decision as per my current requirement, which will be the good option in my context or should I use both of them together.
The biggest and most obvious distinction is that NiFi is a no-code approach - 99% of NiFi users will never see a line of code. It is a web based GUI with a drag and drop interface to build pipelines.
NiFi can perform ETL, and can be used in batch use cases, but it is geared towards data streams. It is not just about moving data from A to B, it can do complex (and performant) transformations, enrichments and normalisations. It comes out of the box with support for many specific sources and endpoints (e.g. Kafka, Elastic, HDFS, S3, Postgres, Mongo, etc.) as well as generic sources and endpoints (e.g. TCP, HTTP, IMAP, etc.).
NiFi is not just about messages - it can work natively with a wide array of different formats, but can also be used for binary data and large files (e.g. moving multi-GB video files).
NiFi is deployed as a standalone application - it's not a framework or api or library or something that you integrate in to something else. It is a fully self-contained, realised application that is fully featured out of the box with no additional development. Though it can be extended with custom development if required.
NiFi is natively clustered - it expects (but isn't required) to be deployed on multiple hosts that work together as a cluster for performance, availability and redundancy.
So, the two tools are used quite differently - hopefully that helps highlight some of the key differences
It's true that there is some functional overlap between NiFi and Camel, but they were designed very differently:
Apache NiFi is a data processing and integration platform that is mostly used centrally. It has a low-code approach and prefers configuration.
Apache Camel is an integration framework which is mostly used in distributed solutions. Solutions are coded in Java. Example solutions are adapters, flows, API's, connectors, cloud functions and so on.
They can be used very well together. Especially when using a message broker like Apache ActiveMQ or Apache Kafka.
An example: A java application is enhanced with Camel so that it can send messages to Kafka. In NiFi the first step is consuming those messages from Kafka. Then in the NiFi flow the message is changed in various steps. In the middle the message is put on another Kafka topic. A Camel function (CamelK) in the cloud does various operations on the message, when it's finished it put the message on a Kafka topic. The message goes through a NiFi flow which at the end calls an API created with Camel.
In a blog I wrote in detail on the various ways to combine Camel and Nifi:
https://raymondmeester.medium.com/using-camel-and-nifi-in-one-solution-c7668fafe451

What are the differences between Hazelcast Jet and Apache Flink

More specifically, what usecases does Hazelcast Jet solve that Flink does not solve (equally well) and vice versa?
NOTE: I belong to Hazelcast Jet's core engineering team.
I'd say the main advantage of Hazelcast Jet isn't in offering a brand-new computing model, but in bringing the same level of convenience that Hazelcast is known for to the realm of DAG-based distributed computing.
If you currently have a Java application running in a cluster, adding Jet will be a snap: add the Maven dependency and write one line of code to start a Jet instance on the local member. The instances will self-discover to form their own cluster, and you can now submit your job to it.
If you want a dedicated distributed computing cluster, you can download the distribution ZIP to the cluster machines. Jet has native support for the most popular cloud environments, allowing the nodes you start to self-discover. You can then connect to the cluster using a Jet client.
Needless to say, Jet makes it very convenient to use a Hazelcast IMap or IList as a data source. Jet cluster can host Hazelcast structures directly; then you benefit from data locality and get the data with no network traffic. On the other hand, the choice of data source is completely unconstrained and there is public API dedicated to implementing fast, arbitrarily partitioned, custom data sources.
Jet solves the concerns of infinite streams processing like aggregating over time-based windows, dealing with reordered events and resilience to changes in the cluster topology (e.g., failure of individual Jet nodes) while maintaining the Exactly-Once processing guarantee.
Jet's main programming paradigm is the Pipeline API which is quite similar to java.util.stream API but adapted to the specifics of distributed computing (lambda serialization and other concerns).
Pipeline API builds upon a lower-level DAG-based model that is also exposed as public API.
in my opinion, flink seems to offer some very useful streaming features, wich are not yet offered by hatecast jet.
Different flexible window operators, wich also can handle out-of-order and late items.
fault tolerance on the cluster and delivery guarantees
Beside this it also seems to be more stable and well known at the moment.
For example, you can use it as a runtime for Apache Beam and then migrate easily between Google Data Flow on the cloud and your own deployment.
So I would currently use flink.
Best

Pros and cons using Lucidworks Fusion instead of regular Solr

i wanna know what are the pros and cons using Fusion instead of regular Solr ? can you guys give some example (like some problem that can be solved easily using Fusion)?
First of all, I should disclose that I am the Product Manager for Lucidworks Fusion.
You seem to already be aware that Fusion works with Solr (or one or more Solr clusters or instances), using Solr for data storage and querying. The purpose of Fusion is to make it easier to use Solr, integrate Solr, and to build complex solutions that make use of Solr. Some of the things that Fusion provides that many people find helpful for this include:
Connectors and a connector framework. Bare Solr gives you a good API and the ability to push certain types of files at the command line. Fusion comes with several pre-built data source connectors that fetch data from various types of systems, process them as appropriate (including parsing, transformation, and field mapping), and sends the results to Solr. These connectors include common document stores (cloud and on-premise), relational databases, NoSQL data stores, HDFS, enterprise applications, and a very powerful and configurable web crawler.
Security integration. Solr does not have any authentication or authorizations (though as of version 5.2 this week, it does have a pluggable API and an basic implementation of Kerberos for authentication). Fusion wraps the Solr APIs with a secured version. Fusion has clean integrations into LDAP, Active Directory, and Kerberos for authentication. It also has a fine-grained authorizations model for mananging and configuring Fusion and Solr. And, the Fusion authorizations model can automatically link group memberships from LDAP/AD with access control lists from the Fusion Connectors data sources so that you get document-level access control mirrored from your source systems when you run search queries.
Pipelines processing model. Fusion provides a pipeline model with modular stages (in both API and GUI form) to make it easier to define and edit transformations of data and documents. It is analogous to unix shell pipes. For example, while indexing you can include stages to define mappings of fields, compute new fields, aggregate documents, pull in data from other sources, etc. before writing to Solr. When querying, you could do the same, along with transforming the query, running and returning the results of other analytics, and applying security filtering.
Admin GUI. Fusion has a web UI for viewing and configuring the above (as well as the base Solr config). We think this is convenient for people who want to use Solr, but don't use it regularly enough to remember how to use the APIs, config files, and command line tools.
Sophisticated search-based features: Using the pipelines model described above, Fusion includes (and make easy to use) some richer search-based components, including: Natural language processing and entity extraction modules; Real-time signals-driven relevancy adjustment. We intend to provide more of these in the future.
Analytics processing: Fusion includes and integrates Apache Spark for running deep analytics against data stored in Solr (or on its way in to Solr). While Solr implicitly includes certain data analytics capabilities, that is not its main purpose. We use Apache Spark to drive Fusion's signals extraction and relevancy tuning, and expect to expose APIs so users can easily run other processing there.
Other: many useful miscellaneous features like: dashboarding UI; basic search UI with manual relevancy tuning; easier monitoring; job management and scheduling; real-time alerting with email integration, and more.
A lot of the above can of course be built or written against Solr, without Fusion, but we think that providing these kinds of enterprise integrations will be valuable to many people.
Pros:
Connectors : Lucidworks provides you a wide range of connectors, with those you can connect to datasources and pull the data from there.
Reusability : In Lucidworks you can create pipelines for data ingestion and data retrieval. You can create pipelines with common logic so that these can be used in other pipelines.
Security : You can apply restrictions over data i.e Security Trimming data. Lucidworks provides in built query-pipeline stages for Security Trimming or you can write custom pipeline for your use case.
Troubleshooting : Lucidworks comes with discrete services i.e api, connectors, solr. You can troubleshoot any issue according the services, each service has its logs. Also you can configure JVM properties for each service
Support : Lucidworks support is available 24/7 for help. You can create support case according the severity and they schedule call for you.
Cons:
Not much, but it keeps you away from your normal development, you don't get much chance to open your IDE and start coding.

Resources