Savepoint - Operators could not matched in Apache Flink - apache-flink

I'm trying to stop my job with savepoint, then start it again using the same savepoint. For my case, I update my job, and create new version for it with new jar. Here is my code example;
class Reader(bla bla) {
def read() = {
val ds = readFromKafka()
transform(ds)
}
def transform(ds: DataStream[]) = {
ds.map()
}
}
object MyJob {
def run () = {
val data = new Reader().read()
data.keyBy(id).process(new MyStateFunc).uid("my-uid") // then write to kafka
}
}
In this case, i did stop job with savepoint, then start it using the same savepoint with the same jar. Then, I add a filter to my Reader like this;
class Reader(bla bla) {
def read() = {
val ds = readFromKafka()
transform(ds)
}
def transform(ds: DataStream[]) = {
ds.map().filter() // FILTER ADDED HERE
}
}
I stop my job with savepoint, it works. Then i've tried to deploy job with new version(new filter method) using the same savepoint, it can not match the operators and job does not deploys. Why?

Unless you explicitly provide UIDs for all of your stateful operators before taking a savepoint, then after changing the topology of your job, Flink will no longer be able to figure out which state in the savepoint belongs to which operator.
I see that you have a UID on your keyed process function ("my-uid"). But you also need to have UIDs on the Kafka source and the sink, and anything else that's stateful. These UIDs need to be attached to the stateful operators themselves and need to be unique within the job (but not across all jobs). (Furthermore, each state descriptor needs to assign a name to each piece of state, using a name that is unique within the operator.)
Typically one does something like this
env
.addSource(...)
.name("KafkaSource")
.uid("KafkaSource")
results.addSink(...)
.name("KafkaSink")
.uid("KafkaSink")
where the name() method is used to supply the text that appears in the web UI.

Related

Is there a way limit the members of a camel-cluster (using KubernetesClusterService) using a pod selector (selection)

I am programmatically creating my CamelContext and successfully creating the ClusterServiceProvider using the KubernetesClusterService implementation. When running in my kubernetes cluster, it is electing a leader and responding appropriately to "master:" routes. All good.
However, I would like to limit the Pod/Deployments that are detected in the cluster member negotiation/inspection. It currently has knowledge of (finds) every Pod in the cluster namespace which includes completely unrelated deployments/instances.
The overall question is how to select what Pods/Deployments should be included in the particular camel cluster?
I see in the KubernetesLockConfiguration there is an attribute called clusterLabels however it is unclear to me how or what that is used for. When I do set the clusterLabels to something syntactically common in kubernetes (e.g. app -> my-app), then the cluster finds no members.
I mention that I am doing this programmatically as there is no spring-boot or other commonly documented configuration of camel involved. Running in Play-Framework Scala.
First, I create a CamelContext
val context = new DefaultCamelContext()
val rb = new RouteBuilder() {
def configure() = {
val policy = ClusteredRoutePolicy.forNamespace("default")
from(s"master:ip:timer://master-timer?fixedRate=true&period=60000")
.routePolicy(policy)
.bean(classOf[MasterTimer], "execute()")
.log("Current Leader ${routeId}")
}
}
Second, I create a ClusterServiceProvider
import org.apache.camel.component.kubernetes.cluster.KubernetesClusterService
val crc = new ClusteredRouteController()
val service = new KubernetesClusterService
service.setCamelContext(cc)
service.setKubernetesNamespace("default")
//
// if I set clusterLabels here, no camel cluster is realize/created.
// assumption is the my syntax for CamelKubernetes is wrong however
// it is unclear from documentation how to make this work.
//
// if I do not set clusterLabels, every pod in my kubernetes cluster is
// part of a cluster-member (CamelKubernetesLeaderNotifier logs that
// the list of cluster members has changed). So I get completely
// un-related deployments in the context of something that I want
// to specifically related, namely all pods with an "app" label of
// "my-app"
//
service.setClusterLabels(Map("app" -> "my-app").asJava)
crc.setNamespace("default")
crc.setClusterService(service)
camelContext.addService(service)
camelContext.setRouteController(crc)
camelContext.start()
camelContext.getRouteController().startAllRoutes()

Laravel perform scheduled job on database

This is somewhat a design question for my current laravel side project.
So currently I have a table that stores a status value in one column and the date when that status should be altered. No I want to alter that status value automatically when the date is stored is the current date. Since that table will gain more rows about time I have to perform that altering process in a repeating manner. As well as that I want to perform some constraint checks on the data as well.
Im sure laravel is capable of doing that, but how?
Laravel has commands and a scheduler, combining these two gives exactly what you want.
Create your command in Console\Commands folder, with your desired logic. Your question is sparse, so most of it is pseudo logic and you can adjust it for your case.
namespace App\Console\Commands;
class StatusUpdater extends Command
{
protected $signature = 'update:status';
protected $description = 'Update status on your model';
public function handle()
{
$models = YourModel::whereDate('date', now())->get();
$models->each(function (YourModel $model) {
if ($model->status === 'wrong') {
$model->status = 'new';
$model->save();
}
});
}
}
For this command to run daily, you can use the scheduler to schedule the given command. Go to Commands\Kernel.php where you will find a schedule() method.
use App\Commands\StatusUpdater;
use Illuminate\Console\Scheduling\Schedule;
class Kernel extends ConsoleKernel
{
protected function schedule(Schedule $schedule)
{
$schedule->command(StatusUpdater::class)->daily();
}
}
}
For scheduling to work, you have to add the following command to cronjob on your server. Which is described in the Laravel documentation.
* * * * * cd /path-to-your-project && php artisan schedule:run >> /dev/null 2>&1

Flink Table-API and DataStream ProcessFunction

I want to join a big table, impossible to be contained in TM memory and a stream (kakfa). I successfully joined both on my tests, mixing table-api with datastream api. I did the following:
val stream: DataStream[MyEvent] = env.addSource(...)
stream
.timeWindowAll(...)
.trigger(...)
.process(new ProcessAllWindowFunction[MyEvent, MyEvent, TimeWindow] {
var tableEnv: StreamTableEnvironment = _
override def open(parameters: Configuration): Unit = {
//init table env
}
override def process(context: Context, elements: Iterable[MyEvent], out: Collector[MyEvent]): Unit = {
val table = tableEnv.sqlQuery(...)
elements.map(e => {
//do process
out.collect(...)
})
}
})
It is working, but I have never seen anywhere this type of implementation. Is it ok ? what would be the drawback ?
One should not use StreamExecutionEnvironment or TableEnvironment within a Flink function. An environment is used to construct a pipeline that is submitted to the cluster.
Your example submits a job to the cluster within a cluster's job.
This might work for certain use cases but is generally discouraged. Imagine your outer stream contains thousands of events and your function would create a job for every event, it could potentially DDoS your cluster.

Is there a way to asynchronously modify state in Flink KeyedProcessFunction?

I have two sources, kafka and hbase. In Kafka, there is a data stream in only 24 hours. In Hbase, there is an aggregated data from the beginning. My purpose is that the two data merge on stream processing, when stream input(Kafka) of some session is occurred. I tried a couple of methods but it is not satisfied because of performance.
After some searching, I have an idea with state in keyed process function. The idea is down below. (caching using state of keyed process function)
make input to keyed process function using session information
check keyed process's state
if state is not initialized -> then query from hbase and initialize into state -> go to 5
else (state is initialized) -> go to 5
do business logic using state
During coding the idea, I have faced performance issue that querying to hbase is slow with sync way. So, I tried async version but it's complicated.
I have faced two issues. One of them is thread-safe issue between processElement and hbase Async worker thread, the other is Context of the process function is expired after end of processElement function (not end of hbase Async worker).
val sourceStream = env.addsource(kafkaConsumer.setStartFromGroupOffsets())
sourceStream.keyBy(new KeySelector[InputMessage, KeyInfo]() {
override def getKey(v: InputMessage): KeyInfo = v.toKeyInfo()
})
.process(new KeyedProcessFunction[KeyInfo, InputMessage, OUTPUTTYPE]() {
var state: MapState[String, (String, Long)] = _
override def open(parameters: Configuration): Unit = {
val conn = ConnectionFactory.createAsyncConnection(hbaseConfInstance).join
table = conn.getTable(TableName.valueOf("tablename"))
state = getRuntimeContext.getMapState(stateDescripter)
}
def request(action: Consumer[CacheResult] ): Unit = {
if ( !state.isEmpty ) {
action.accept(new CacheResult(state))
}
else { // state is empty, so load from hbase
table.get(new Get(key)).thenAccept((hbaseResult: Result) => {
// this is called by worker thread
hbaseResult.toState(state) // convert from hbase result into state
action.accept(new CacheResult(state))
}
}
}
override def processElement(value: InputMessage
, ctx: KeyedProcessFunction[KeyInfo, InputMessage, OUTPUTTYPE]#Context
, out: Collector[OUTPUTTYPE]): Unit = {
val businessAction = new Consumer[CacheResult]() {
override def accept(t: CacheResult): Unit = {
// .. do business logic here.
out.collect( /* final result */ )
}
}
request(businessAction)
}
}).addSink()
Is there any suggestion to make KeyedProcessFunction available with async call in third party?
Or any other idea to approach using mixed-up Kafka and Hbase in Flink?
I think your general assumptions are wrong. I faced similar issue but regarding quite different problem and didn't resolve it yet. Keeping state in the program is contradictory with async function and Flink prevents using state in async code by its design (which is a good thing). If you want to make your function async, then you must get rid of the state. To achieve your goal, you probably need to redesign your solution. I don't know all details regarding your problem, but you can think of splitting your process into more pipelines. E.g. you can create pipeline consuming data from hbase and passing it into kafka topic. Then another pipeline can consume data sent by pipeline gathering data from hbase. In such approach you don't have to care about the state becasue each pipeline is doing its own thing, just consuming data and passing it further.

Google App Engine atomic section?

Say you retrieve a set of records from the datastore (something like: select * from MyClass where reserved='false').
how do i ensure that another user doesn't set the reserved is still false?
I've looked in the Transaction documentation and got shocked from google's solution which is to catch the exception and retry in a loop.
Any solution that I'm missing - it's hard to believe that there's no way to have an atomic operation in this environment.
(btw - i could use 'syncronize' inside the servlet but i think it's not valid as there's no way to ensure that there's only one instance of the servlet object, isn't it? same applies to static variable solution)
Any idea on how to solve?
(here's the google solution:
http://code.google.com/appengine/docs/java/datastore/transactions.html#Entity_Groups
look at:
Key k = KeyFactory.createKey("Employee", "k12345");
Employee e = pm.getObjectById(Employee.class, k);
e.counter += 1;
pm.makePersistent(e);
'This requires a transaction because the value may be updated by another user after this code fetches the object, but before it saves the modified object. Without a transaction, the user's request will use the value of counter prior to the other user's update, and the save will overwrite the new value. With a transaction, the application is told about the other user's update. If the entity is updated during the transaction, then the transaction fails with an exception. The application can repeat the transaction to use the new data'
Horrible solution, isn't it?
You are correct that you cannot use synchronize or a static variable.
You are incorrect that it is impossible to have an atomic action in the App Engine environment. (See what atomic means here) When you do a transaction, it is atomic - either everything happens, or nothing happens. It sounds like what you want is some kind of global locking mechanism. In the RDBMS world, that might be something like "select for update" or setting your transaction isolation level to serialized transactions. Neither one of those types of options are very scalable. Or as you would say, they are both horrible solutions :)
If you really want global locking in app engine, you can do it, but it will be ugly and seriously impair scalability. All you need to do is create some kind of CurrentUser entity, where you store the username of the current user who has a global lock. Before you let a user do anything, you would need to first check that no user is already listed as the CurrentUser, and then write that user's key into the CurrentUser entity. The check and the write would have to be in a transaction. This way, only one user will ever be "Current" and therefore have the global lock.
Do you mean like this:
public void func(Data data2) {
String query = "select from " + objectA.class.getName()
+ " where reserved == false";
List<objectA> Table = (List<objectA>) pm.newQuery(
query).execute();
for (objectA row : Table)
{
Data data1 = row.getData1();
row.setWeight(JUtils.CalcWeight(data1, data2));
}
Collections.sort(Table, new objectA.SortByWeight());
int retries = 0;
int NUM_RETRIES = 10;
for (int i = 0; i < Table.size() ; i++)
{
retries++;
pm.currentTransaction().begin(); // <---- BEGIN
ObjectA obj = pm.getObjectById(Table.get(i).class, Table.get(i).getKey());
if (obj .getReserved() == false) // <--- CHECK if still reserved
obj.setReserved(true);
else
break;
try
{
pm.currentTransaction().commit();
break;
}
catch (JDOCanRetryException ex)
{
if (j == (NUM_RETRIES - 1))
{
throw ex;
}
i--; //so we retry again on the same object
}
}
}

Resources