Difference partition_key vs key in Apache Camel - apache-camel

I am using camel-kafka version 2.19.2 and I went through the documentation on Camels website - camel-kafka before posting this question here.
I saw that there exists this following example for producing messages.
Producing messages:
from("direct:start").process(new Processor() {
#Override
public void process(Exchange exchange) throws Exception {
exchange.getIn().setBody("Test Message from Camel Kafka Component Final",String.class);
exchange.getIn().setHeader(KafkaConstants.PARTITION_KEY, 0);
exchange.getIn().setHeader(KafkaConstants.KEY, "1");
}
}).to("kafka:localhost:9092?topic=test");
As you can see, there exists KafkaConstants.PARTITION_KEY and KafkaConstants.KEY
Also, for the information, I have a kafka topic with 4 partitions.
I played around with both of them and I understood that KafkaConstants.KEY acts as a message key and this key would be used for determining which partition the message goes to.
KafkaConstants.PARTITION_KEYis the one I am confused about as tried setting a partition number but it still sent the messages to all the 4 partitions.
Can anyone specify the difference between these 2 KafkaConstants? And specifically what is KafkaConstants.PARTITION_KEY is used for?
EDIT: Corrected the camel version which is being used.

From the github doc of 2.21.0-SNAPSHOT docs:
KEY:
The record key (or null if no key is specified). If this option has been configured then it take precedence over header link KafkaConstants.KEY
PARTITION_KEY:
The partition to which the record will be sent (or null if no partition was specified). If this option has been configured then it take precedence over header link KafkaConstants.PARTITION_KEY
Source: https://github.com/apache/camel/blob/master/components/camel-kafka/src/main/docs/kafka-component.adoc
For completeness:
In Kafka:
data is actually a key-value pair
its storage happens at a partition level
The key is used for intelligent and efficient data distribution within a cluster. Depending on the key, Kafka sends the data to a specific partition and ensures that its replicated as well (configuration).
As you can see in the master branch of Camel-Kafka from Apache Camel project:
https://github.com/apache/camel/blob/master/components/camel-kafka/src/main/java/org/apache/camel/component/kafka/KafkaProducer.java#L202-L215
You'll need to specify a key in order to be able to use the partition key you are specifying in your camel route.

Related

An Alternative Approach for Broadcast stream

I have two different streams in my flink job;
First one is representing set of rules which will be applied to the actual stream. I've just broadcasted these set of rules. Changes are come from kafka, and there can be a few changes each hour (like 100-200 per hour).
Second one is actual stream called as customer stream which contains some numeric values for each customer. This is basically keyed stream based on customerId.
So, basically I'm preparing my actual customer stream data, then applying some rules on keyed stream, and getting the calculated results.
And, I also know which rules should be calculated by checking a field of customer stream data. For example; a field of customer data contains value X, that means job have to apply only rule1, rule2, rule5 instead of calculating all the rules (let's say there are 90 rules) for the given customer. Of course, in this case, I have to get and filter all rules by field value of incoming data.
Everything is ok in this scenario, and perfectly fits broadcast pattern usage. But the problem here is that huge broadcast size. Sometimes it can be very huge, like 20 GB or more. It supposes it's very huge for broadcast state.
Is there any alternative approach to solve this limitation? Like, using rocks db backend (I know it's not supported, but I can implement custom state backend for broadcast state if there is no limitation about this).
Is there any changes if I connect both streams without broadcasting rules stream?
From your description it sounds like you might be able to avoid broadcasting the rules (by turning this around and broadcasting the primary stream to the rules). Maybe this could work:
make sure each incoming customer event has a unique ID
key-partition the rules so that each rule has a distinct key
broadcast the primary stream events to the rules (and don't store the customer events)
union the outputs from applying all the rules
keyBy the unique ID from step (1) to bring together the results from applying each of the rules to a given customer event, and assemble a unified result
https://gist.github.com/alpinegizmo/5d5f24397a6db7d8fabc1b12a15eeca6 shows how to do fan-out/fan-in with Flink -- see that for an example of steps 1, 4, and 5 above.
If there's no way to partition the rules dataset, then I don't think you get a win by trying to connect streams.
I would check out Apache Ignite as a way of sharing the rules across all of the subtasks processing the customer stream. See this article for a description of how this could be one.

Flink job production readiness - validate UUIDs assigned to all operators

The flink production readiness (https://ci.apache.org/projects/flink/flink-docs-stable/ops/production_ready.html) suggests assigning UUIDs to all operators. I'm looking for a way to validate that all operators in a given job graph have been assigned UUIDs -- ideally to be used as a pre-deployment check in our CI flow.
We already have a process in place that uses the PackagedProgram class to get a JSON-formatted 'preview plan'. Unfortunately, that does not include any information about the assigned UUIDs (or lack thereof).
Digging into the code behind generating the JSON preview plan (PlanJSONDumpGenerator), I can trace how it visits each of the nodes as a DumpableNode<?>, but from there, I can't find anything that leads me to the definition of the operator with it's UUID.
When defining the job (using the DataStream API), the UUID is assigned on a StreamTransformation<T>. Is there anyway to connect the data in the PackagedProgram back to the original StreamTransformation<T>s to get the UUID?
Or is there a better approach to doing this type of validation?

Apache Flink Set Operator Uid vs UidHash

I'm using Apache Flink 1.2.0. According to Production Readiness Checklist (https://ci.apache.org/projects/flink/flink-docs-release-1.2/ops/production_ready.html) it is recommended to set Uids for operators to ensure compatibility for savepoints.
I couldn't find the setUid() method for a flatMap but I found uid() and setUidHash() which according to doc. says
uid
"Sets an ID for this operator.
The specified ID is used to assign the same operator ID across job submissions (for example when starting a job from a savepoint)."
uidHash
"Sets an user provided hash for this operator. This will be used AS IS the create the JobVertexID.
The user provided hash is an alternative to the generated hashes, that is considered when identifying an operator through the default hash mechanics fails (e.g. because of changes between Flink versions)."
Which one actually should be set on a flatMap for example uid() or setUidHash()? Or both?
uid() method is recommended to be used in this case.
setUidHash() should be used only as workaround to fixup jobs created with default uids instead of user defined ones. It's stated in javadoc:
this should be used as a workaround or for trouble shooting. The provided hash needs to be unique per transformation and job. Otherwise, job submission will fail. Furthermore, you cannot assign user-specified hash to intermediate nodes in an operator chain and trying so will let your job fail.

Google Datastore app architecture questions

I'm working on a Google AppEngine app connecting to the Google Cloud Datastore via its JSON API (I'm using PHP).
I'm reading all the documentation provided by Google and I still have questions:
In the documentation about Transactions, there is the following mention: "Transactions must operate on entities that belong to a limited number (5) of entity groups" (BTW few lines later we can found: "All Datastore operations in a transaction can operate on a maximum of twenty-five entity groups"). I'm not sure about what is an entity group. Let's say that I've an object Country which is identified only by its kind (COUNTRY) and a datastore's auto affected key id. So there is no ancestor path, hierarchical relationships, etc... Is all the countries entities counting for only 1 entity group? Or each country is counting for one?
For the Country entity kind I need to have an incremental unique id (like the SQL AUTOINCREMENT). It has to be absolutely unique and without gap. Also, this kind of object won't be created more than few / minute so there is no need to handle contention & sharding. I'm thinking about having a unique counter that will reflect the auto increment and using it inside a transaction. Is the following code pattern OK?:
Starting transaction, getting the counter, commit the creation of the Country along with the update of the counter. Rollback the transaction if the commit fails. Does this pattern prevents the affectation of 2 same ids? Could you confirm me that if 2 processes get the counter at the same time (so the same value), the first one who commits will make the other to fail (so it will be able to restart and get the new counter value)?
The documentation also mention that: "If your application receives an exception when attempting to commit a transaction, it does not necessarily mean that the transaction has failed. It is possible to receive exceptions or error messages even when a transaction has been committed and will eventually be applied successfully" !? How are we supposed to handle that case? If this behavior occurs on the creation of my country (question #2), I will have an issue with my auto increment id, no!?
Since the datastore needs that all the writes actions of a transaction to be done in only one call. And since the transaction ensure that all or none of the transaction's actions will be performed, why do we have to make a rollback?
Is the limit of 1 write / sec only on an entity (so something defined by its kind and its key path) and not a whole entity group (I will be reassured only when I'll be sure about what exactly is an entity group ;-) question #1)
I'm stoping here to not make a huge post. I'll probably get back with others (or refined) questions after getting answers on this ones ;-)
Thanks for your help.
[UPDATE] Country is just used as a sample class object.
No, ('Country', 123123) and ('Country', 679621) are not in the same entity group. But ('Country', 123123, 'City', '1') and ('Country', 123123, 'City', '2') are in the same entity group. Entities with the same ancestor are in the same group.
Sounds like really bad idea to use auto-increment for things like countries. Just generate an ID based on the name of the country.
From the same paragraph:
Whenever possible, structure your Datastore transactions so that the end result will be unaffected if the same transaction is applied more than once.
In internal datastore APIs like db or ndb you don't have to worry about rolling back, its happening automatically.
It's about 1 write per sec per whole entity group, that's why you need to keep groups as smaller as possible.

Can this strange behaviour be explained by Eventual Consistency on the App Engine Datastore?

I have implemented two server-side HTTP endpoints which 1) stores some data and 2) processes it. Method 1) calls method 2) through App Engine Tasks since they are time consuming tasks that I do not want the client to wait for. The process is illustrated in the sequence diagram below.
Now from time to time I experience that the processing task (named processSomething in the sequence diagram below) can't find the data when attempting to process - illustrated with the yellow throw WtfException() below. Can this be explained with the Eventual Consistency model described here?
The document says Strong consistency for reads but eventual consistency for writes. I'm not sure what exactly that means related to this case. Any clarification is appreciated.
edit: I realize I'm asking a boolean question here, but I guess I'm looking for an answer backed up with some documentation on what Eventual Consistency is in general and specifically on Google Datastore
edit 2: By request here are details on the actual read/write operations:
The write operation:
entityManager.persist(hand);
entityManager.close()
I'm using JPA for data persistance. Object 'hand' is recieved from client and not previously stored in the db so a new key Id will be assigned.
The read operation:
SELECT p FROM Hand p WHERE p.GameId = :gid AND p.RoundNo = :rno
Neither GameId nor RoundNo is the primary key. GameId is a "foreign key" although the Datastore is oblivious of that by design.
It would help if you showed actual code, showing how you save the entity and how you retrieve it, but assuming that id is an actual datastore ID, part of a Key, and that your load is a get using the id and not a query on some other property then eventual-consistency is not your issue.
(The documentation on this is further down the page you linked.)

Resources