I have a Java App Engine project and I am using DeferredTasks for push queues.
/** A hypothetical expensive operation we want to defer on a background task. */
public static class ExpensiveOperation implements DeferredTask {
#Override
public void run() {
System.out.println("Doing an expensive operation...");
// expensive operation to be backgrounded goes here
}
}
I want to be able to create multiple shards of a DeferredTask to be able to have more through-put. Basically, I want to run one DeferredTask that then runs many more DeferredTasks (up to 1,000 of them). Essentially a fan-out task. How can I do that?
One issue is that when creating tasks you need to specify the name of them in the queue.yaml file. But if I want to have 1,000 tasks, do I really need to specify 1,000 of them in that file? It would get very tedious to write out "task-1", "task-2", etc.
Is there a better way to do this?
This is usually done by specifying a shard parameter for each task and reusing the same queue. As noted in your example, the entire java object is serialized with DeferredTask. So you can simply pass in any values you want in a constructor. E.g.
public static class ShardedOperation implements DeferredTask {
private final int shard;
public ShardedOperation(int shard) {
this.shard = shard;
}
}
...
#Override
public void run() {
System.out.println("Fanning out an expensive operation...");
Queue queue = QueueFactory.getDefaultQueue();
for (int i = 0; i < 1000; ++i) {
queue.add(TaskOptions.Builder.withPayload(new ShardedOperation(i)));
}
}
This matches the section you linked to https://cloud.google.com/appengine/docs/standard/java/taskqueue/push/creating-tasks#using_the_instead_of_a_worker_service where the default queue is used.
Related
I have a simple Apache Flink job that looks very much like this:
public final class Application {
public static void main(final String... args) throws Exception {
final var env = StreamExecutionEnvironment.getExecutionEnvironment();
final var executionConfig = env.getConfig();
final var params = ParameterTool.fromArgs(args);
executionConfig.setGlobalJobParameters(params);
executionConfig.setParallelism(params.getInt("application.parallelism"));
final var source = KafkaSource.<CustomKafkaMessage>builder()
.setBootstrapServers(params.get("application.kafka.bootstrap-servers"))
.setGroupId(config.get("application.kafka.consumer.group-id"))
// .setStartingOffsets(OffsetsInitializer.committedOffsets(OffsetResetStrategy.EARLIEST))
.setStartingOffsets(OffsetsInitializer.earliest())
.setTopics(config.getString("application.kafka.listener.topics"))
.setValueOnlyDeserializer(new MessageDeserializationSchema())
.build();
env.fromSource(source, WatermarkStrategy.noWatermarks(), "custom.kafka-source")
.uid("custom.kafka-source")
.rebalance()
.flatMap(new CustomFlatMapFunction())
.uid("custom.flatmap-function")
.filter(new CustomFilterFunction())
.uid("custom.filter-function")
.addSink(new CustomDiscardSink()) // Will be a Kafka sink in the future
.uid("custom.discard-sink");
env.execute(config.get("application.job-name"));
}
}
Problem is that I would like to provide an integration test for the entire application — sort of like an end-to-end (set of) test(s) for the entire job. I'm using Testcontainers, but I'm not really sure how to move forward with this. For instance, this is how the test looks like (for now):
#Testcontainers
final class ApplicationTest {
private static final DockerImageName DOCKER_IMAGE = DockerImageName.parse("confluentinc/cp-kafka:7.0.1");
#Container
private static final KafkaContainer KAFKA_CONTAINER = new KafkaContainer(DOCKER_IMAGE);
#ClassRule // How come this work in JUnit Jupiter? :/
public static MiniClusterResource cluster;
#BeforeAll
static void init() {
KAFKA_CONTAINER.start();
// ...probably need to wait and create the topic(s) as well
final var config = new MiniClusterResourceConfiguration.Builder().setNumberSlotsPerTaskManager(2)
.setNumberTaskManagers(1)
.build();
cluster = new MiniClusterResource(config);
}
#Test
void main() throws Exception {
// new Application(); // ...what's next?
}
}
I'm not sure how to implement what's required to trigger the job as-is from that point on. Basically, I would like to execute what was defined before, without (almost) any modifications — I've seen plenty of examples that practically build the entire job again, so that's not an option.
Can somebody provide any pointers here?
MessageDeserializationSchema is unbounded, so isEndOfStream returns false. Not sure if that's an impediment.
In order to make the pipeline more testable, I suggest you create a method on your Application class that takes a source and a sink as parameters, and creates and executes the pipeline, using those connectors.
In your tests you can call that method with special sources and sinks that you use for testing. In particular, you will want to use a KafkaSource that uses .setBounded(...) in the tests so that it cleanly handles just the range of data intended for the test(s).
The solutions and tests for the Apache Flink training exercises are organized along these lines; for example, see RideCleansingSolution.java and RideCleansingIntegrationTest.java. These examples don't use kafka or test containers, but hopefully they'll still be helpful.
I would suggest you instrument your application as an opaque-box test by interacting with it through its public API. This can be done either as an out-process test (e.g. by running your application in a container as well, using Testcontainers) are as an in-process test (by creating your Application and calling its main() method).
Now in your comments you explained, that you want to check for the side-effects of interacting with your application (Kafka messages being published). To check this, connect to the KafkaContainer with your own KafkaConsumer from within the test and use a library such as Awaitiliy to wait until the messages have been received.
We have a legacy application that is writing results as records to some local files. We want to process these records in real-time thus we are planning to use Flink as an engine. I know that I can read text files using StreamingExecutionEnvironment#readFile. It seems that we need something similar to PROCESS_CONTINUOUSLY there but this flag causes a whole file to be reprocessed on each change, what is not what we want here.
Of course, I can write my custom source that saves number of records per file in its state. But I suppose there might be some problem with such approach with checkpointing or something - my reasoning is that if that would be easy to implement reliably, it would have been already implemented in Flink.
Any tips / suggestions how to approach this?
You can do this rather easily with a custom source, so long as you are content to be reading from a single file (per source instance). You will need to use operator state and implement checkpointing. The state handling and checkpointing will look something like this:
public class CheckpointedFileSource implements SourceFunction<Event>, ListCheckpointed<Long> {
private long eventCnt = 0;
public void run(SourceContext<Event> sourceContext) throws Exception {
final Object lock = sourceContext.getCheckpointLock();
// skip over previously emitted events
...
while (not cancelled) {
read event from file;
synchronized (lock) {
eventCnt++;
sourceContext.collectWithTimestamp(event, timestamp);
}
}
}
#Override
public List<Long> snapshotState(long checkpointId, long checkpointTimestamp) throws Exception {
return Collections.singletonList(eventCnt);
}
#Override
public void restoreState(List<Long> state) throws Exception {
for (Long s : state)
this.eventCnt = s;
}
}
For a complete example see the checkpointed taxi ride data source used in the Flink training exercises. You’ll have to adapt it a bit, since it’s designed to read a static file, rather than one that is being appended to.
I am using Apache Flink to perform analytics on streaming data.
I am using a dependency whose object takes more than 10 secs to create as it is reads several files present in hdfs before initialisation.
If I initialise the object in open method I get a timeout Exception and if in the constructor of a sink/flatmap, I get serialisation exception.
Currently I am using static block to initialise the object in some other class, using Preconditions.checkNotNull(MGenerator.mGenerator) in main file and then it's working if used in a flatmap of sink.
Is there a way to create a non serializable dependency's object which might take more than 10 secs to be initialised in Flink's flatmap or sink?
public class DependencyWrap {
static MGenerator mGenerator;
static {
final String configStr = "{}";
final Config config = new Gson().fromJson(config, Config.class);
mGenerator = new MGenerator(config);
}
}
public class MyStreaming {
public static void main(String[] args) throws Exception {
Preconditions.checkNotNull(MGenerator.mGenerator);
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(parallelism);
...
input.flatMap(new RichFlatMapFunction<Map<String,Object>,List<String>>() {
#Override
public void open(Configuration parameters) {
}
#Override
public void flatMap(Map<String,Object> value, Collector<List<String>> out) throws Exception {
out.collect(MFVGenerator.mfvGenerator.generateMyResult(value.f0, value.f1));
}
});
}
}
Also, Please correct me if I am wrong about the question.
Doing it in the Open method is 100% the right way to do it. Is Flink giving you a timeout exception, or the object?
As a last ditch method, you could wrap your object in a class that contains both the object and it's JSON string or Config (is Config serializable?) with the object marked transient and then override the ReadObject/WriteObject methods to call the constructor. If the mGenerator object itself is stateless (and you'll have other problems if it's not), the serialization code should get called only once when jobs are distributed to taskmanagers.
Using open is usually the right place to load external lookup sources. The timeout is a bit odd, maybe there is a configuration around it.
However, if it's huge using a static loader (either static class as you did or singleton) has the benefit that you only need to load it once for all parallel instances of the task on the same task manager. Hence, you save memory and CPU time. This is especially true for you, as you use the same data structure in two separate tasks. Further, the static loader can be lazily initialized when it's used for the first time to avoid the timeout in open.
The clear downside of this approach is that the testability of your code suffers. There are some ways around that, which I could expand if there is interest.
I don't see a benefit of using the proxy serializer pattern. It's unnecessarily complex (custom serialization in Java) and offers little benefit.
Using custom partitioning in Apache Flink, we specify a key for each record to be assigned to a particular taskmanager.
Consider we broadcast a dataset to all of the nodes, taskmanagers. Is there any While in a map or faltmap to get the taskmanagef Id or not?
A custom partitioner does not assign records to a TaskManager but to a specific parallel task instance of the subsequent operator (a TM can execute multiple parallel task instances of the same operator).
You can access the ID of a parallel task instance, be extending a RichFunction, e.g., extend a RichMapFunction instead of implementing a MapFunction. Rich functions are available for all transformation. A RichFunction gives access to the RuntimeContext which tells you the ID of the parallel task instance:
public static class MyMapper extends RichMapFunction<Long, Long> {
#Override
public void open(Configuration config) {
int pId = getRuntimeContext().getIndexOfThisSubtask();
}
#Override
public Long map(Long value) throws Exception {
// ...
}
}
I have a class.Class contain a method ParseJSONResponse().I want that method should get executed on daily basis at midnight.How I can achieve this in salesforce.
I know there is schedule apex mechanism is available in salesforce to perform such a thing but I need no. of steps or code to achieve this.I am new to salesforce.Any help would be appreciated.
public with sharing class ConsumeCloudArmsWebserviceCallout{
public void ParseJSONResponse(){
// handling customerList and inserting records for it
DateTime lastModifiedDate =Common.getSynchDateByDataObject(CloudArmsWebserviceCallout.DataObject.CustomerContact);
List<Account> lstAccounts = ConsumeCustomers.CreateCustomers(lastModifiedDate);
ConsumeContacts.CreateContacts(lastModifiedDate);
Common.updateSynchByDataObject(CloudArmsWebserviceCallout.DataObject.CustomerContact);
}
}
You can implement the Schedulable interface directly to your ConsumeCloudArmsWebserviceCallout class:
https://developer.salesforce.com/docs/atlas.en-us.apexcode.meta/apexcode/apex_scheduler.htm
In order to perform callout -which apparently you will according to your class name- you can use the Queueable interface:
https://developer.salesforce.com/docs/atlas.en-us.apexcode.meta/apexcode/apex_queueing_jobs.htm
public with sharing class ConsumeCloudArmsWebserviceCallout implements Schedulable {
public void ParseJSONResponse(){
// handling customerList and inserting records for it
DateTime lastModifiedDate =Common.getSynchDateByDataObject(CloudArmsWebserviceCallout.DataObject.CustomerContact);
List<Account> lstAccounts = ConsumeCustomers.CreateCustomers(lastModifiedDate);
ConsumeContacts.CreateContacts(lastModifiedDate);
Common.updateSynchByDataObject(CloudArmsWebserviceCallout.DataObject.CustomerContact);
}
//Method implemented in order to use the Schedulable interface
public void execute(SchedulableContext ctx) {
ConsumeCloudQueueable cloudQueueable = new ConsumeCloudQueueable();
ID jobID = System.enqueueJob(cloudQueueable);
}
//Inner class that implements Queueable and can perform callouts.
private class ConsumeCloudQueueable implements Queueable, Database.AllowsCallouts {
public void execute(QueueableContext context) {
ConsumeCloudArmsWebserviceCallout cloudArms = new ConsumeCloudArmsWebserviceCallout();
cloudArms.ParseJSONResponse();
}
}
}
Then go onto the class page on Salesforce setup.
There you will find a schedule class button.
You will be able to schedule all classes implementing the Schedulable interface.
What will happen is that it will schedule your class daily.
Then your schedule will only en-queue a ConsumeCloudQueueable class that will do the job whenever Salesforce runs it (pretty much straight away).
Once the job runs, it will execute whatever is on you ParseJSONResponse() method.
Let me know if you have any question.
Cheers,
Seb
and welcome to salesforce development.
If I assume something wrong let me know. So you are looking to
Every Day at Midnight
Fire a job that makes a callout to another system
Then Parses the results and creates stuff
You are looking for Apex Scheduler code: http://www.salesforce.com/us/developer/docs/apexcode/Content/apex_scheduler.htm
global class ConsumeCloudArmsWebserviceCallout_Job implements Schedulable {
global void execute(SchedulableContext sc) {
new ConsumeCloudArmsWebserviceCallout().ParseJSONResponse();
}
}
Then you can schedule the job in yourname -> Developer Console. Debug -> Open Execute Anonymous Window.
system.schedule('Consumer Cloud Arms Service', '0 0 0 * * ?', new ConsumeCloudArmsWebserviceCallout_Job());
Keep in mind:
you can only make 5 callouts in one apex transation. Meaning if you need to do more, you need to use a batch job or #future calls.
Be careful with large data sets, if your job is expensive (creating and modifying lots of data) you need to be sure that you don't run in to CPU limits, So a batch job requesting smaller portions of data from the service you are calling out may be needed.
I don't think you will be running in to those issues, but they will catch you off guard sometimes.
EDIT: fixed the call to a new object, less psudo-code