I'm using Kinesis Data Analytics Studio which provides a Zeppelin environment.
Very simple code:
%flink.pyflink
from pyflink.common.serialization import JsonRowDeserializationSchema
from pyflink.common.typeinfo import Types
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.connectors import FlinkKafkaConsumer
# create env = determine app runs locally or remotely
env = s_env or StreamExecutionEnvironment.get_execution_environment()
env.add_jars("file:///home/ec2-user/flink-sql-connector-kafka_2.12-1.13.5.jar")
# create a kafka consumer
deserialization_schema = JsonRowDeserializationSchema.builder() \
.type_info(type_info=Types.ROW_NAMED(
['id', 'name'],
[Types.INT(), Types.STRING()])
).build()
kafka_consumer = FlinkKafkaConsumer(
topics='nihao',
deserialization_schema=deserialization_schema,
properties={
'bootstrap.servers': 'kakfa-brokers:9092',
'group.id': 'group1'
})
kafka_consumer.set_start_from_earliest()
ds = env.add_source(kafka_consumer)
ds.print()
env.execute('job1')
I can get this working locally can sees change logs being produced to console. However I cannot get the same results in Zeppelin.
Also checked STDOUT in Flink web console task managers, nothing is there too.
Am I missing something? Searched for days and could not find anything on it.
I'm not 100% sure but I think you may need a sink to begin pulling data through the datastream, you could potentially use the included Print Sink Function
Related
I'm a beginner on pyflink framework and I would like to know if my use case is possible with it ...
I need to make a tumbling windows and apply a python udf (scikit learn clustering model) on it.
The use case is : every 30 seconds I want to apply my udf on the previous 30 seconds of data.
For the moment I succeeded in consume data from a kafka in streaming but then I'm not able to create a 30seconds window on a non-keyed stream with the python API.
Do you know some example for my use case ? Do you know if the pyflink API allow this ?
Here my first shot :
from pyflink.common import Row
from pyflink.common.serialization import JsonRowDeserializationSchema, JsonRowSerializationSchema
from pyflink.common.typeinfo import Types
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.connectors import FlinkKafkaConsumer, FlinkKafkaProducer
from pyflink.common.watermark_strategy import TimestampAssigner, WatermarkStrategy
from pyflink.common import Duration
import time
from utils.selector import Selector
from utils.timestampAssigner import KafkaRowTimestampAssigner
# 1. create a StreamExecutionEnvironment
env = StreamExecutionEnvironment.get_execution_environment()
# the sql connector for kafka is used here as it's a fat jar and could avoid dependency issues
env.add_jars("file:///flink-sql-connector-kafka_2.11-1.14.0.jar")
deserialization_schema = JsonRowDeserializationSchema.builder() \
.type_info(type_info=Types.ROW_NAMED(["labelId","freq","timestamp"],[Types.STRING(),Types.DOUBLE(),Types.STRING()])).build()
kafka_consumer = FlinkKafkaConsumer(
topics='events',
deserialization_schema=deserialization_schema,
properties={'bootstrap.servers': 'localhost:9092'})
# watermark_strategy = WatermarkStrategy.for_bounded_out_of_orderness(Duration.of_seconds(5))\
# .with_timestamp_assigner(KafkaRowTimestampAssigner())
ds = env.add_source(kafka_consumer)
ds.print()
ds = ds.windowAll()
# ds.print()
env.execute()
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.flink.api.java.ClosureCleaner (file:/home/dorian/dataScience/pyflink/pyflink_env/lib/python3.6/site-packages/pyflink/lib/flink-dist_2.11-1.14.0.jar) to field java.util.Properties.serialVersionUID
WARNING: Please consider reporting this to the maintainers of org.apache.flink.api.java.ClosureCleaner
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Traceback (most recent call last):
File "/home/dorian/dataScience/pyflink/project/__main__.py", line 35, in <module>
ds = ds.windowAll()
AttributeError: 'DataStream' object has no attribute 'windowAll'
Thx
I am new to Flink Streaming framework and I am trying to understand the components and the flow. I am trying to run the basic wordcount example using the DataStream. I am trying to run the code on my IDE. The code runs with no issues when I feed data using the collection as
DataStream<String> text = env.fromElements(
"To be, or not to be,--that is the question:--",
"Whether 'tis nobler in the mind to suffer",
"The slings and arrows of outrageous fortune",z
"Or to take arms against a sea of troubles,"
);
But, it fails everytime when I am trying to read data either from the socket or from a file as below:
DataStream<String> text = env.socketTextStream("localhost", 9999);
or
DataStream<String> text = env.readTextFile("sample_file.txt");
For both the socket and textfile, I am getting the following error:
Exception in thread "main" java.lang.reflect.InaccessibleObjectException: Unable to make field private final byte[] java.lang.String.value accessible: module java.base does not "opens java.lang" to unnamed module
Flink currently (as of Flink 1.14) only supports Java 8 and Java 11. Install a suitable JDK and try again.
I am trying to expose a Prometheus Gauge in a Flink app:
#transient def metricGroup: MetricGroup = getRuntimeContext.getMetricGroup
.addGroup("site", site)
.addGroup("sink", counterBaseName)
#transient var failedCounter: Counter = _
def expose(metricName: String, gaugeValue: Int, context: SinkFunction.Context[_]): Unit = {
try {
metricGroup.addGroup("hostname", metricName).gauge[Int, ScalaGauge[Int]]("test", ScalaGauge[Int](() => gaugeValue))
}
} catch {
case _: Throwable => failedCounter.inc()
}
}
The app runs locally just fine and expose the metrics without any problem.
While trying to move to production I encounter the following exception in Flink task manager:
WARN org.apache.flink.runtime.metrics.MetricRegistryImpl - Error while registering metric. java.lang.NullPointerException
Not sure, what am I missing here.
Why does the local app expose metrics while on the cluster it fails to register gauge?
I use Prometheus in order to expose other metrics from Flink, for example, failedCounter (in the code) which is a counter.
This is the first time I exposed gauge in Flink so I bet something in my implementation is broken.
Please help.
While playing around with Flink, I have been trying to upsert data into Elasticsearch. I'm having this error on my STDOUT:
Caused by: org.apache.flink.table.api.NoMatchingTableFactoryException: Could not find a suitable table factory for 'org.apache.flink.table.factories.TableSinkFactory' in
the classpath.
Reason: Required context properties mismatch.
The following properties are requested:
connector.hosts=http://elasticsearch-elasticsearch-coordinating-only.default.svc.cluster.local:9200
connector.index=transfers-sum
connector.key-null-literal=n/a
connector.property-version=1
connector.type=elasticsearch
connector.version=6
format.json-schema={ \"curr_careUnit\": {\"type\": \"text\"}, \"sum\": {\"type\": \"float\"} }
format.property-version=1
format.type=json
schema.0.data-type=VARCHAR(2147483647)
schema.0.name=curr_careUnit
schema.1.data-type=FLOAT
schema.1.name=sum
update-mode=upsert
The following factories have been considered:
org.apache.flink.streaming.connectors.kafka.Kafka09TableSourceSinkFactory
org.apache.flink.table.sinks.CsvBatchTableSinkFactory
org.apache.flink.table.sinks.CsvAppendTableSinkFactory
at org.apache.flink.table.factories.TableFactoryService.filterByContext(TableFactoryService.java:322)
...
Here is what I have in my scala Flink code:
def main(args: Array[String]) {
// Create streaming execution environment
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
// Set properties per KafkaConsumer API
val properties = new Properties()
properties.setProperty("bootstrap.servers", "kafka.kafka:9092")
properties.setProperty("group.id", "test")
// Add Kafka source to environment
val myKConsumer = new FlinkKafkaConsumer010[String]("raw.data4", new SimpleStringSchema(), properties)
// Read from beginning of topic
myKConsumer.setStartFromEarliest()
val streamSource = env
.addSource(myKConsumer)
// Transform CSV (with a header row per Kafka event into a Transfers object
val streamTransfers = streamSource.map(new TransfersMapper())
// create a TableEnvironment
val tEnv = StreamTableEnvironment.create(env)
// register a Table
val tblTransfers: Table = tEnv.fromDataStream(streamTransfers)
tEnv.createTemporaryView("transfers", tblTransfers)
tEnv.connect(
new Elasticsearch()
.version("6")
.host("elasticsearch-elasticsearch-coordinating-only.default.svc.cluster.local", 9200, "http")
.index("transfers-sum")
.keyNullLiteral("n/a")
.withFormat(new Json().jsonSchema("{ \"curr_careUnit\": {\"type\": \"text\"}, \"sum\": {\"type\": \"float\"} }"))
.withSchema(new Schema()
.field("curr_careUnit", DataTypes.STRING())
.field("sum", DataTypes.FLOAT())
)
.inUpsertMode()
.createTemporaryTable("transfersSum")
val result = tEnv.sqlQuery(
"""
|SELECT curr_careUnit, sum(los)
|FROM transfers
|GROUP BY curr_careUnit
|""".stripMargin)
result.insertInto("transfersSum")
env.execute("Flink Streaming Demo Dump to Elasticsearch")
}
}
I am creating a fat jar and uploading it to my remote flink instance. Here is my build.gradle dependencies:
compile 'org.scala-lang:scala-library:2.11.12'
compile 'org.apache.flink:flink-scala_2.11:1.10.0'
compile 'org.apache.flink:flink-streaming-scala_2.11:1.10.0'
compile 'org.apache.flink:flink-connector-kafka-0.10_2.11:1.10.0'
compile 'org.apache.flink:flink-table-api-scala-bridge_2.11:1.10.0'
compile 'org.apache.flink:flink-connector-elasticsearch6_2.11:1.10.0'
compile 'org.apache.flink:flink-json:1.10.0'
compile 'com.fasterxml.jackson.core:jackson-core:2.10.1'
compile 'com.fasterxml.jackson.module:jackson-module-scala_2.11:2.10.1'
compile 'org.json4s:json4s-jackson_2.11:3.7.0-M1'
Here is how the farJar command is built for gradle:
jar {
from {
(configurations.compile).collect {
it.isDirectory() ? it : zipTree(it)
}
}
manifest {
attributes("Main-Class": "main" )
}
}
task fatJar(type: Jar) {
zip64 true
manifest {
attributes 'Main-Class': "flinkNamePull.Demo"
}
baseName = "${rootProject.name}"
from { configurations.compile.collect { it.isDirectory() ? it : zipTree(it) } }
with jar
}
Could anybody please help me to see what I am missing? I'm fairly new to Flink and data streaming in general. Hehe
Thank you in advance!
Is the list in The following factories have been considered: complete? Does it contain Elasticsearch6UpsertTableSinkFactory? If not as far as I can tell there is a problem with the service discovery dependencies.
How do you submit your job? Can you check if you have a file META-INF/services/org.apache.flink.table.factories.TableFactory in the uber jar with an entry for Elasticsearch6UpsertTableSinkFactory?
When using maven you have to add a transformer to properly merge service files:
<!-- The service transformer is needed to merge META-INF/services files -->
<transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
I don't know how do you do it in gradle.
EDIT:
Thanks to Arvid Heise
In gradle when using shadowJar plugin you can merge service files via:
// Merging Service Files
shadowJar {
mergeServiceFiles()
}
You should use the shadow plugin to create the fat jar instead of doing it manually.
In particular, you want to merge service descriptors.
When I query Google AppEngine's datastore using PHP(through Quercus) and the low-level data-access API for an entity, I get an error that the entity doesn't exist, even though I've put it in the datastore previously.
The specific error is "com.caucho.quercus.QuercusException: com.google.appengine.api.datastore.DatastoreService.get: No entity was found matching the key: Test(value1)"
Here's the relevant code -
<?php
import com.google.appengine.api.datastore.DatastoreService;
import com.google.appengine.api.datastore.DatastoreServiceFactory;
import com.google.appengine.api.datastore.Entity;
import com.google.appengine.api.datastore.EntityNotFoundException;
import com.google.appengine.api.datastore.Key;
import com.google.appengine.api.datastore.KeyFactory;
import com.google.appengine.api.datastore.PreparedQuery;
import com.google.appengine.api.datastore.Query;
$testkey = KeyFactory::createKey("Test", "value1");
$ent = new Entity($testkey);
$ent->setProperty("field1", "value2");
$ent->setProperty("field2", "value3");
$dataService = DatastoreServiceFactory::getDatastoreService();
$dataService->put($ent);
echo "Data entered";
try
{
$ent = $dataService->get($testkey);
echo "Data queried - the results are \n";
echo "Field1 has value ".$ent->getProperty("field1")."\n";
echo "Field2 has value ".$ent->getProperty("field2")."\n";
}
catch(EntityNotFoundException $e)
{
echo("<br/>Entity test not found.");
echo("<br/>Stack Trace is:\n");
echo($e);
}
And here's the detailed stack-trace - link.
This same code runs fine in Java (of course after changing the syntax). I wonder what's wrong.
Thanks.
I have found the solution to my problem. It was caused by missing dependencies and I solved it by using the prepackaged PHP Wordpress application available here.
One thing is to be noted. The package overlooked a minor issue in that all files other than the src/ directory need to be in a war/ directory which stays alongside the src/ directory (this as per appengine conventions as mentioned on its documentation). So I organized the files thus myself, put the above PHP file in the war/ directory, and it's working fine on the appengine.