More specifically, what usecases does Hazelcast Jet solve that Flink does not solve (equally well) and vice versa?
NOTE: I belong to Hazelcast Jet's core engineering team.
I'd say the main advantage of Hazelcast Jet isn't in offering a brand-new computing model, but in bringing the same level of convenience that Hazelcast is known for to the realm of DAG-based distributed computing.
If you currently have a Java application running in a cluster, adding Jet will be a snap: add the Maven dependency and write one line of code to start a Jet instance on the local member. The instances will self-discover to form their own cluster, and you can now submit your job to it.
If you want a dedicated distributed computing cluster, you can download the distribution ZIP to the cluster machines. Jet has native support for the most popular cloud environments, allowing the nodes you start to self-discover. You can then connect to the cluster using a Jet client.
Needless to say, Jet makes it very convenient to use a Hazelcast IMap or IList as a data source. Jet cluster can host Hazelcast structures directly; then you benefit from data locality and get the data with no network traffic. On the other hand, the choice of data source is completely unconstrained and there is public API dedicated to implementing fast, arbitrarily partitioned, custom data sources.
Jet solves the concerns of infinite streams processing like aggregating over time-based windows, dealing with reordered events and resilience to changes in the cluster topology (e.g., failure of individual Jet nodes) while maintaining the Exactly-Once processing guarantee.
Jet's main programming paradigm is the Pipeline API which is quite similar to java.util.stream API but adapted to the specifics of distributed computing (lambda serialization and other concerns).
Pipeline API builds upon a lower-level DAG-based model that is also exposed as public API.
in my opinion, flink seems to offer some very useful streaming features, wich are not yet offered by hatecast jet.
Different flexible window operators, wich also can handle out-of-order and late items.
fault tolerance on the cluster and delivery guarantees
Beside this it also seems to be more stable and well known at the moment.
For example, you can use it as a runtime for Apache Beam and then migrate easily between Google Data Flow on the cloud and your own deployment.
So I would currently use flink.
Best
Related
I am using Apache camel for quite long time and found it to be a fantastic solution for all kind of system integration related business need. But couple of years back I came accross the Apache Nifi solution. After some googleing I found that though Nifi can work as ETL tool but it is actually meant for stream processing.
In my opinion, "Which is better" is very bad question to ask as that depend on different things. But it will be nice if somebody can describe more about the basic comparison between the two and also the obvious question, when to use what.
It will help to take decision as per my current requirement, which will be the good option in my context or should I use both of them together.
The biggest and most obvious distinction is that NiFi is a no-code approach - 99% of NiFi users will never see a line of code. It is a web based GUI with a drag and drop interface to build pipelines.
NiFi can perform ETL, and can be used in batch use cases, but it is geared towards data streams. It is not just about moving data from A to B, it can do complex (and performant) transformations, enrichments and normalisations. It comes out of the box with support for many specific sources and endpoints (e.g. Kafka, Elastic, HDFS, S3, Postgres, Mongo, etc.) as well as generic sources and endpoints (e.g. TCP, HTTP, IMAP, etc.).
NiFi is not just about messages - it can work natively with a wide array of different formats, but can also be used for binary data and large files (e.g. moving multi-GB video files).
NiFi is deployed as a standalone application - it's not a framework or api or library or something that you integrate in to something else. It is a fully self-contained, realised application that is fully featured out of the box with no additional development. Though it can be extended with custom development if required.
NiFi is natively clustered - it expects (but isn't required) to be deployed on multiple hosts that work together as a cluster for performance, availability and redundancy.
So, the two tools are used quite differently - hopefully that helps highlight some of the key differences
It's true that there is some functional overlap between NiFi and Camel, but they were designed very differently:
Apache NiFi is a data processing and integration platform that is mostly used centrally. It has a low-code approach and prefers configuration.
Apache Camel is an integration framework which is mostly used in distributed solutions. Solutions are coded in Java. Example solutions are adapters, flows, API's, connectors, cloud functions and so on.
They can be used very well together. Especially when using a message broker like Apache ActiveMQ or Apache Kafka.
An example: A java application is enhanced with Camel so that it can send messages to Kafka. In NiFi the first step is consuming those messages from Kafka. Then in the NiFi flow the message is changed in various steps. In the middle the message is put on another Kafka topic. A Camel function (CamelK) in the cloud does various operations on the message, when it's finished it put the message on a Kafka topic. The message goes through a NiFi flow which at the end calls an API created with Camel.
In a blog I wrote in detail on the various ways to combine Camel and Nifi:
https://raymondmeester.medium.com/using-camel-and-nifi-in-one-solution-c7668fafe451
What are the advantages and disadvantages of using python or java when developing apache flink stateful function.
Is there any performance difference? which one is more efficient for the same operation?
Can we develop the application completely on python?
What are the features that one supports and the other does not.
StateFun support embedded functions and remote functions.
Embedded functions are bundled and deployed within the JVM processes that run Flink. Therefore they must be implemented in a JVM language (like Java) and they would be the most performant. The downside is that any change to the function code requires a restart of the Flink cluster.
Remote functions are functions that are executing in a separate process, and are invoked by the Flink cluster for every incoming message addressed to them. Therefore they are expected to be less performant than the embedded functions, but they provide a great flexibility in:
Choosing an implementation language
Fast scaling up and down
Fast restart in case of a failure.
Rolling upgrades
Can we develop the application completely on python?
Is it is possible to develop an application completely in Python, see the python greeter example.
What are the features that one supports and the other does not.
The current features are currently supported only in the Java SDK:
Richer routing logic from an ingress to a function. Any routing logic that you can describe via code.
Few more state types like a table and a buffer.
Exposing existing Flink sources and Sinks as ingresses and egresses.
I am building a mobile app with the following business requirements:
Db to be stored locally on the device for use when disconnected from
the cloud.
A NoSQL type store is required to provide for future changes without requiring complex db rebuild and data migration.
Utilises a SQL query language for simple programming.
Run on all target platforms - Windows, Android, iOS
No central database server - data to be synchronised by matching two local copies of the db file.
I have examined a lot of dbs for mobile and none provide all these features except Couchbase Lite 2.1 Enterprise Edition. The downside of that is that the EE license might be price prohibitive in my use case.
[EDIT: yes the EE license is USD$35K for <= 1000 devices to that option is out for me sadly.]
Are there any other such products out there that someone could point me to?
The client-side synchronization of local databases done by Couchbase Lite is a way to replicate data from one mobile device to another. Though is a limited feature because it works on P2P. Take as an example BitTorrent, the fastest and most effective P2P protocol. It still has flaws, risk of data corruption and partial data loss. A P2P synchronization would only be safe when running between two distinct applications on the same mobile device.
In case both databases are in the same mobile device and managed by the same application, it would be much simpler. You could do the synchronization yourself by reading data from one and saving in the other, and dealing with conflicts if needed.
I'm curious, why is it a requirement not to have a central database server? You can fine tune what data is shared and between which users is it shared. Here is how it works:
On server-side user registry, each user is assigned a list of channel names. At the same time, each JSON document added or updated is also linked to a list of channel names. For every pair of user x document with at least one channel name in common, the server allows push/pull replications to occur.
Good luck !
We have 2 server clusters: the first is made up of typical web applications backed by SQL databases. The second are highly optimized multiplayer game servers which keep all data in memory. Both clusters communicate with clients via HTTP (Ajax with JSON). There are a few cases in which we need to share data between the two server types, for example, reporting back and storing the results of a game (should ultimately end up in the database).
We're considering several approaches for inter-server communication:
Just share the MySQL databases between clusters (introduce SQL to the game servers)
Sharing data in a distributed key-value store like Memcache, Redis, etc.
Use an RPC technology like Google ProtoBufs or Apache Thrift
Using RESTful web services (the game server would POST back to the web servers, for example)
At the moment, we're leaning towards web services or just sharing the database. Sharing the database seems easy, but we're concerned this adds extra memory and a new dependency into the game servers. Web services provide good separation of concerns and fit with the existing Ajax we use, but add complexity, overhead and many more ways for communication to fail.
Are there any other good reasons not to use one or the other approach? Which would be easier to scale?
Sharing the DB brings the obvious drawback of not having one unit in control of the data going into the DB. This can be a big hassle, which is I would recommend building an application layer.
If this application layer is what your web applications form, then I see nothing wrong with implementing client-server communication between the game servers and the web apps. Let the game servers push data to the application layer and have them subscribe to updates. This is a good fit to a message queueing system, but you could get away with building your own REST-based system for instance, if this fits better with your current architecture.
If the web apps do not form the application layer, I would suggest introducing such a layer by writing a small app, which hides the specifics of the storage. Each side gets a handle to the app interface, and writes it data to it.
In order to share the data between the two systems, the application layer could then use a distributed DB, like mnesia, or implement a multi-level cache system with replication. The simplest version of this would be time-triggered replication with for instance MySQL as you mention. Other options are message queues, replicated memory (Terracotta) and/or replicated caches (memcached), although these do not provide persistent storage.
I'd also suggest looking at Redis as a data store and nodered for distributed pub-sub.
Although Redis is an in-memory K/V store, the latest version has VM support where keys are kept in memory, but values may be swapped out as memory pressure hits a configurable threshold. It also has simple master-slave replication and publish-subscribe built in.
NodeRed is built on node.js which is a scalable and ridiculously fast server-side js engine.
I'm building a web application that from day one will be on the limits of what a single server can handle. So I'm considering to adopt a distributed architecture with several identical nodes. The goal is to provide scalability (add servers to accommodate more users) and fault tolerance. The nodes need to share some state between them, therefore some communication between them is required. I believe I have the following alternatives to implement this communication in Java:
Implement it using sockets and a custom protocol.
Use RMI
Use web services (each node can send and receive/parse HTTP request).
Use JMS
Use another high-level framework like Terracotta or hazelcast
I would like to know how this technologies compare to each other:
When the number of nodes increases
When the amount of communication between the nodes increases (1000s of messages per second and/or messages up to 100KB etc)
On a practical level (eg ease of implementation, available documentation, license issues etc)
I'm also interested to know what technologies are people using in real production projects (as opposed to experimental or academic ones).
Don't forget Jini.
It gives you automatic service discovery, service leasing, and downloadable proxies so that the actual client/server communication protocol is up to you and not enforced by the framework (e.g. you can choose HTTP/RMI/whatever).
The framework is built around acknowledgement of the 8 Fallacies of Distributed Computing and recovery-oriented computing. i.e. you will have network problems, and the architecture is built to help you recover and maintain a service.
If you also use Javaspaces it's trivial to implement workflows and consumer-producer architectures. Producers will write into the Javaspaces, and one or more consumers will take that work from the space (under a transaction) and work with it. So you scale it simply by providing more consumers.