I'm using PyFlink for streaming processing, and have added some metrics to monitor the performance.
Here's my code for registering the udf with metrics. I've installed apache-flink 1.13.0.
class Test(ScalarFunction):
def __init__(self):
self.gauge_value = 0
def open(self, function_context):
metric_group = function_context.get_metric_group().add_group("test")
metric_group.gauge("gauge", lambda: self.gauge_value)
def eval(self, i):
self.gauge_value = i
return i
test = udf(
Test(),
input_types=[DataTypes.BIGINT()],
result_type=DataTypes.BIGINT(),
)
There's no exception raised from the log, and once I change the type of self.gauge_value to str, it'll go wrong, so I believe the line is executed and gauge is registered.
However, I could never find the gauge in the metrics tab of Web UI.
I tried to add counters and meters in this test udf, and they all showed up correctly in the metrics tab.
Is gauge supported in PyFlink? If not, is there any alternative of that?
Related
I am new to PyFlink. I have done the official training exercise in Java: https://github.com/apache/flink-training
However, the project I am working on must use Python as a programming language. I want to know if it is possible to write a data generator using the "SourceFunction". In older PyFlink versions this was possible, using Jython: https://nightlies.apache.org/flink/flink-docs-release-1.7/dev/stream/python.html#streaming-program-example
In newer examples the dataframe contains a finite set of data, which is never extended. I have not found any example of a data generator in PyFlink, e.g. https://github.com/apache/flink-training/blob/master/common/src/main/java/org/apache/flink/training/exercises/common/sources/TaxiRideGenerator.java
I am not sure which functionality the interfaces Source and SinkFunction provide. Can it be used somehow in python or can it only be used in combination with other pipelines or jar files? It looks like the methods "run()" and "cancel()" are not implemented and thus it cannot be used like some other classes, by overloading.
If it can not be used in Python, are there any other ways to use it? Someone may provide an easy example.
If it is not possible to use it, are there any other ways to write a data generator in OOP style? Take this example: https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/python/datastream_tutorial/ There the split() method is used to separate the stream. Basically, I want to do this by an extra class and just extending the stream, which was done in the Java TaxiRide example via "ctx.collect()". I am trying to avoid using Java, another framework for the pipeline, and Jython. It would be nice to get a short example code, but I appreciate any tips and advice.
I tried to use SourceFunction directly, but as already mentioned, I think this is a completely wrong way, resulting in an error: AttributeError: 'DataGenerator' object has no attribute '_get_object_id'
class DataGenerator(SourceFunction):
def __init__(self):
super().__init__(self)
self._num_iters = 1000
self._running = True
def run(self, ctx):
counter = 0
while self._running and counter < self._num_iters:
ctx.collect('Hello World')
counter += 1
def cancel(self):
self._running = False
Solution:
After looking in some older code using the classes Source and SinkFunction, I came to a solution. Here a kafka connector written in Java is used. The python code can be taken as an example of how to use pyflink's Source and SinkFuntion.
I have only written an example for the SourceFunction:
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream import SourceFunction
from pyflink.java_gateway import get_gateway
class TaxiRideGenerator(SourceFunction):
def __init__(self):
java_src_class = get_gateway().jvm.org.apache.flink.training.exercises.common.sources.TaxiRideGenerator
java_src_obj = java_src_class()
super(TaxiRideGenerator, self).__init__(java_src_obj)
def show(ds, env):
# this is just a little helper to show the output of the pipeline
ds.print()
env.execute()
def streaming():
# arm the flink ExecutionEnvironment
env = StreamExecutionEnvironment.get_execution_environment()
env.set_parallelism(1)
taxi_src = TaxiRideGenerator()
ds = env.add_source(taxi_src)
show(ds, env)
if __name__ == "__main__":
streaming()
The second line in the class init was hard to find. I had expected to get an object in the first line.
You have to create a jar file after building this project.
I have entered the path until I see the folder "org":
$ cd flink-training/flink-training/common/build/classes/java/main
flink-training/common/build/classes/java/main$ ls
flink-training/common/build/classes/java/main$ org
flink-training/common/build/classes/java/main$ jar cvf flink-training.jar org/apache/flink/training/exercises/common/**/*.class
Copy the jar file to the pyflink/lib folder, normally under your python environment, e.g. flinkenv/lib/python3.8/site-packages/pyflink/lib. Then start the script.
Any chance I could get a tip for proper way to build an agent that could do read multiple points from multiple devices on a BACnet system? I am viewing the actuator agent code trying learn how to make the proper rpc call.
So going through the agent development procedure with the agent creation wizard.
In the init I have this just hard coded at the moment:
def __init__(self, **kwargs):
super(Setteroccvav, self).__init__(**kwargs)
_log.debug("vip_identity: " + self.core.identity)
self.default_config = {}
self.agent_id = "dr_event_setpoint_adj_agent"
self.topic = "slipstream_internal/slipstream_hq/"
self.jci_zonetemp_string = "/ZN-T"
So the BACnet system in the building has JCI VAV boxes all with the same zone temperature sensor point self.jci_zonetemp_string and self.topic is how I pulled them into volttron/config store through BACnet discovery processes.
In my actuate point function (copied from CSV driver example) am I at all close for how to make the rpc call named reads using the get_multiple_points? Hoping to scrape the zone temperature sensor readings on BACnet device ID's 6,7,8,9,10 which are all the same VAV box controller with the same points/BAS program running.
def actuate_point(self):
"""
Request that the Actuator set a point on the CSV device
"""
# Create a start and end timestep to serve as the times we reserve to communicate with the CSV Device
_now = get_aware_utc_now()
str_now = format_timestamp(_now)
_end = _now + td(seconds=10)
str_end = format_timestamp(_end)
# Wrap the timestamps and device topic (used by the Actuator to identify the device) into an actuator request
schedule_request = [[self.ahu_topic, str_now, str_end]]
# Use a remote procedure call to ask the actuator to schedule us some time on the device
result = self.vip.rpc.call(
'platform.actuator', 'request_new_schedule', self.agent_id, 'my_test', 'HIGH', schedule_request).get(
timeout=4)
_log.info(f'*** [INFO] *** - SCHEDULED TIME ON ACTUATOR From "actuate_point" method sucess')
reads = publish_agent.vip.rpc.call(
'platform.actuator',
'get_multiple_points',
self.agent_id,
[(('self.topic'+'6', self.jci_zonetemp_string)),
(('self.topic'+'7', self.jci_zonetemp_string)),
(('self.topic'+'8', self.jci_zonetemp_string)),
(('self.topic'+'9', self.jci_zonetemp_string)),
(('self.topic'+'10', self.jci_zonetemp_string))]).get(timeout=10)
Any tips before I break something on the live system greatly appreciated :)
The basic form of an RPC call to the actuator is as follows:
# use the agent's VIP connection to make an RPC call to the actuator agent
result = self.vip.rpc.call('platform.actuator', <RPC exported function>, <args>).get(timeout=<seconds>)
Because we're working with devices, we need to know which devices we're interested in, and what their topics are. We also need to know which points on the devices that we're interested in.
device_map = {
'device1': '201201',
'device2': '201202',
'device3': '201203',
'device4': '201204',
}
building_topic = 'campus/building'
all_device_points = ['point1', 'point2', 'point3']
Getting points with the actuator requires a list of point topics, or device/point topic pairs.
# we only need one of the following:
point topics = []
for device in device_map.values():
for point in all_device_points:
point_topics.append('/'.join([building_topic, device, point]))
device_point_pairs = []
for device in device_map.values():
for point in all_device_points:
device_point_pairs.append(('/'.join([building_topic, device]),point,))
Now we send our RPC request to the actuator:
# can use instead device_point_pairs
point_results = self.vip.rpc.call('platform.actuator', 'get_multiple_points', point_topics).get(timeout=3)
maybe it's just my interpretation of your question, but it seems a little open-ended - so I shall respond in a similar vein - general (& I'll try to keep it short).
First, you need the list of info for targeting each device in-turn; i.e. it might consist of just a IP(v4) address (for the physical device) & the (logical) device's BOIN (BACnet Object Instance Number) - or if the request is been routed/forwarded on by/via a BACnet router/BACnet gateway then maybe also the DNET # & the DADR too.
Then you probably want - for each device/one at a time, to retrieve the first/0-element value of the device's Object-List property - in order to get the number of objects it contains, to allow you to know how many objects are available (including the logical device/device-type object) - that you need to retrieve from it/iterate over; NOTE: in the real world, as much as it's common for the device-type object to be the first one in the list, there's no guarantee it will always be the case.
As much as the BACnet standard started allowing for the retrieval of the Property-List (property) from each & every object, most equipment don't as-yet support it, so you might need your own idea of what properties (/at least the ones of interest to you) that each different object-type supports; probably at the very-very least know which ones/object-types support the Present-Value property & which ones don't.
One ideal would be to have the following mocked facets - as fakes for testing purposes instead of testing against a live/important device (- or at least consider testing against a noddy BACnet enabled Raspberry PI or the harware-based like):
a mock for your BACnet service
a mock for the BACnet communication stack
a mock for your device as a whole (- if you can't write your own one, then maybe even start with the YABE 'Room Control Simulator' as a starting point)
Hope this helps (in some way).
Spark DStream has mapPartition API, while Flink DataStream API doesn't. Is there anyone who could help explain the reason. What I want to do is to implement a API similar to Spark reduceByKey on Flink.
Flink's stream processing model is quite different from Spark Streaming which is centered around mini batches. In Spark Streaming each mini batch is executed like a regular batch program on a finite set of data, whereas Flink DataStream programs continuously process records.
In Flink's DataSet API, a MapPartitionFunction has two parameters. An iterator for the input and a collector for the result of the function. A MapPartitionFunction in a Flink DataStream program would never return from the first function call, because the iterator would iterate over an endless stream of records. However, Flink's internal stream processing model requires that user functions return in order to checkpoint function state. Therefore, the DataStream API does not offer a mapPartition transformation.
In order to implement functionality similar to Spark Streaming's reduceByKey, you need to define a keyed window over the stream. Windows discretize streams which is somewhat similar to mini batches but windows offer way more flexibility. Since a window is of finite size, you can call reduce the window.
This could look like:
yourStream.keyBy("myKey") // organize stream by key "myKey"
.timeWindow(Time.seconds(5)) // build 5 sec tumbling windows
.reduce(new YourReduceFunction); // apply a reduce function on each window
The DataStream documentation shows how to define various window types and explains all available functions.
Note: The DataStream API has been reworked recently. The example assumes the latest version (0.10-SNAPSHOT) which will be release as 0.10.0 in the next days.
Assuming your input stream is single partition data (say String)
val new_number_of_partitions = 4
//below line partitions your data, you can broadcast data to all partitions
val step1stream = yourStream.rescale.setParallelism(new_number_of_partitions)
//flexibility for mapping
val step2stream = step1stream.map(new RichMapFunction[String, (String, Int)]{
// var local_val_to_different_part : Type = null
var myTaskId : Int = null
//below function is executed once for each mapper function (one mapper per partition)
override def open(config: Configuration): Unit = {
myTaskId = getRuntimeContext.getIndexOfThisSubtask
//do whatever initialization you want to do. read from data sources..
}
def map(value: String): (String, Int) = {
(value, myTasKId)
}
})
val step3stream = step2stream.keyBy(0).countWindow(new_number_of_partitions).sum(1).print
//Instead of sum(1), you can use .reduce((x,y)=>(x._1,x._2+y._2))
//.countWindow will first wait for a certain number of records for perticular key
// and then apply the function
Flink streaming is pure streaming (not the batched one). Take a look at Iterate API.
I am trying to model an event-based gateway that waits for several messages, and optionally for a timer. Before using this in a real model I tried it in a unit test, and it seems in the camunda engine the condition is completely ignored. Now I'm wondering if this is supposed to be supported by bpmn, if not if there is an easy alternative way to model this.
The code for the unit test, based on the camunda-engine-unit-test project is as following:
Map<String, Object> variables = singletonMap("isTimerActive", (Object) false);
ProcessInstance pi = runtimeService.startProcessInstanceByKey("testProcess", variables);
assertFalse("Process instance should not be ended", pi.isEnded());
String id = pi.getProcessInstanceId();
Job timer = managementService.createJobQuery().processInstanceId(id).timers().active().singleResult();
assertNull(timer);
This is not allowed.
The outgoing Sequence Flows of the Event Gateway MUST NOT have a conditionExpression
BPMN 2.0 Specification Section 10.5.6, page 297
edit: source: http://www.omg.org/spec/BPMN/2.0/PDF
We´re trying to test the write limit exceptions mentioned to be about 1 write / second to prep our code for it (https://developers.google.com/appengine/docs/python/datastore/exceptions -> Timeout)
So I´m creating a item and updating it with the loop count 10k times via tasks and 10k times via a loop... It doesn´t seem to trigger a exception although the writes per second should be high enough (I remember something like more than one write per second gets critical).
Always the same: things don´t break when your´re you want them to ;).
class Message(ndb.Model):
text = ndb.StringProperty()
count = ndb.IntegerProperty()
#defined in seperate file
class DeferredClass(object):
def put(self, id, x):
msg = Message.get_by_id(id)
msg.count = x
try:
msg.put()
except:
logging.error("error putting the Message")
logging.error(sys.exc_info()[0])
msg = Message(text="TestGreeting", count=0)
key = msg.put()
id = key.id()
test = DeferredClass()
for x in range(10000):
deferred.defer(test.put, id, x)
for x in range(10000):
msg.count = x
try:
msg.put()
except:
logging.error("error putting the Message")
logging.error(sys.exc_info()[0])
self.response.out.write("done")
PS: We´re aware that the docs are for db and the code is ndb... the basic limitations should still exist... Also: Docs on ndb Exceptions would be great! Anyone?
Using a non-default TaskQueue with a increased rate limit of 350/tasks/sec led to 20 instances being fired up and plenty of Timeout Exceptions... Thanks Mr. Steinrücken!
The Exception is google.appengine.api.datastore_errors.Timeout, which is the same as documented for the db package - so no ndb extras there.
PS: Our idea is to catch the exception in our cache handling class as a sign of datastore overload and automatically set up shading for that item... monitoring the quests a minute and diable shading again when not needed...