when i use pyflink hive sql read data insert into es ,throw the follow exeception :
the environment :
flink 1.11.2
flink-sql-connector-hive-3.1.2_2.11-1.11.2.jar
hive 3.1.2
2020-12-17 21:10:24,398 WARN org.apache.flink.runtime.taskmanager.Task [] - Source: HiveTableSource(driver_id, driver_base_lc_p1, driver_90d_lc_p1, driver_30d_lc_p1, driver_14d_lc_p1, driver_180d_lc_p1, vehicle_base_lc_p1, driver_active_zone, is_incremental, dt) TablePath: algorithm.jiawei_oas_driver_features_for_incremental_hive2kafka, PartitionPruned: false, PartitionNums: null, ProjectedFields: [0, 8, 9] -> Calc(select=[driver_id, is_incremental, dt, () AS bdi_feature_create_time]) -> Sink: Sink(table=[default_catalog.default_database.0_demo4_903157246_tmp], fields=[driver_id, is_incremental, dt, bdi_feature_create_time]) (1/1) (98f4259c3d00fac9fc3482a4cdc8df3c) switched from RUNNING to FAILED.
at org.apache.orc.impl.ConvertTreeReaderFactory$AnyIntegerTreeReader.nextVector(ConvertTreeReaderFactory.java:445) ~[flink-sql-connector-hive-3.1.2_2.11-1.11.2.jar:1.11.2]
at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1300) ~[flink-sql-connector-hive-3.1.2_2.11-1.11.2.jar:1.11.2]
at org.apache.flink.connectors.hive.read.HiveVectorizedOrcSplitReader.reachedEnd(HiveVectorizedOrcSplitReader.java:99) ~[flink-sql-connector-hive-3.1.2_2.11-1.11.2.jar:1.11.2]
at org.apache.flink.streaming.api.functions.source.InputFormatSourceFunction.run(InputFormatSourceFunction.java:90) ~[flink-dist_2.11-1.11.2.jar:1.11.2]
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:63) ~[flink-dist_2.11-1.11.2.jar:1.11.2]
at org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:213) ~[flink-dist_2.11-1.11.2.jar:1.11.2]
2020-12-17 21:10:24,402 INFO org.apache.flink.runtime.taskmanager.Task [] - Freeing task resources for Source: HiveTableSource(driver_id, driver_base_lc_p1, driver_90d_lc_p1, driver_30d_lc_p1, driver_14d_lc_p1, driver_180d_lc_p1, vehicle_base_lc_p1, driver_active_zone, is_incremental, dt) TablePath: algorithm.jiawei_oas_driver_features_for_incremental_hive2kafka, PartitionPruned: false, PartitionNums: null, ProjectedFields: [0, 8, 9] -> Calc(select=[driver_id, is_incremental, dt, () AS bdi_feature_create_time]) -> Sink: Sink(table=[default_catalog.default_database.0_demo4_903157246_tmp], fields=[driver_id, is_incremental, dt, bdi_feature_create_time]) (1/1) (98f4259c3d00fac9fc3482a4cdc8df3c).
java.lang.ArrayIndexOutOfBoundsException: 1024
at org.apache.flink.orc.shim.OrcShimV210.nextBatch(OrcShimV210.java:35) ~[flink-sql-connector-hive-3.1.2_2.11-1.11.2.jar:1.11.2]
at org.apache.flink.orc.shim.OrcShimV210.nextBatch(OrcShimV210.java:29) ~[flink-sql-connector-hive-3.1.2_2.11-1.11.2.jar:1.11.2]
at org.apache.flink.orc.OrcSplitReader.ensureBatch(OrcSplitReader.java:134) ~[flink-sql-connector-hive-3.1.2_2.11-1.11.2.jar:1.11.2]
at org.apache.orc.impl.TreeReaderFactory$LongTreeReader.nextVector(TreeReaderFactory.java:612) ~[flink-sql-connector-hive-3.1.2_2.11-1.11.2.jar:1.11.2]
at org.apache.orc.impl.TreeReaderFactory$TreeReader.nextVector(TreeReaderFactory.java:269) ~[flink-sql-connector-hive-3.1.2_2.11-1.11.2.jar:1.11.2]
at org.apache.orc.impl.ConvertTreeReaderFactory$StringGroupFromAnyIntegerTreeReader.nextVector(ConvertTreeReaderFactory.java:1477) ~[flink-sql-connector-hive-3.1.2_2.11-1.11.2.jar:1.11.2]
at org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextBatch(TreeReaderFactory.java:2012) ~[flink-sql-connector-hive-3.1.2_2.11-1.11.2.jar:1.11.2]
at org.apache.flink.orc.OrcSplitReader.reachedEnd(OrcSplitReader.java:101) ~[flink-sql-connector-hive-3.1.2_2.11-1.11.2.jar:1.11.2]
at org.apache.flink.connectors.hive.read.HiveTableInputFormat.reachedEnd(HiveTableInputFormat.java:261) ~[flink-sql-connector-hive-3.1.2_2.11-1.11.2.jar:1.11.2]
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:100) ~[flink-dist_2.11-1.11.2.jar:1.11.2]
2020-12-17 21:10:24,406 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor
how to solve this question?
The reason is that ORC format exists bug when array batch size >= 1024, but Flink use default size is 2048 which will meet the ORC format bug.
https://issues.apache.org/jira/browse/ORC-598
https://issues.apache.org/jira/browse/ORC-672
We've created an issue to workaround the orc format bug in Flink.
https://issues.apache.org/jira/browse/FLINK-20667
Related
Configuring a KafkaSink from new Kafka connector API (since version 1.15) with DeliveryGuarantee.EXACTLY_ONCE and transactionalId prefix produce an excessive amount of logs each time a new checkpoint is triggered.
Logs are these
2022-11-02 10:04:10,124 INFO org.apache.flink.connector.kafka.sink.FlinkKafkaInternalProducer [] - Flushing new partitions
2022-11-02 10:04:10,125 INFO org.apache.kafka.clients.producer.ProducerConfig [] - ProducerConfig values:
acks = -1
batch.size = 16384
bootstrap.servers = [localhost:9092]
buffer.memory = 33554432
client.dns.lookup = use_all_dns_ips
client.id = producer-flink-1-24
compression.type = none
connections.max.idle.ms = 540000
delivery.timeout.ms = 120000
enable.idempotence = true
interceptor.classes = []
internal.auto.downgrade.txn.commit = false
key.serializer = class org.apache.kafka.common.serialization.ByteArraySerializer
linger.ms = 0
max.block.ms = 60000
max.in.flight.requests.per.connection = 5
max.request.size = 1048576
metadata.max.age.ms = 300000
metadata.max.idle.ms = 300000
metric.reporters = []
metrics.num.samples = 2
metrics.recording.level = INFO
metrics.sample.window.ms = 30000
partitioner.class = class org.apache.kafka.clients.producer.internals.DefaultPartitioner
receive.buffer.bytes = 32768
reconnect.backoff.max.ms = 1000
reconnect.backoff.ms = 50
request.timeout.ms = 30000
retries = 2147483647
retry.backoff.ms = 100
sasl.client.callback.handler.class = null
sasl.jaas.config = null
sasl.kerberos.kinit.cmd = /usr/bin/kinit
sasl.kerberos.min.time.before.relogin = 60000
sasl.kerberos.service.name = null
sasl.kerberos.ticket.renew.jitter = 0.05
sasl.kerberos.ticket.renew.window.factor = 0.8
sasl.login.callback.handler.class = null
sasl.login.class = null
sasl.login.refresh.buffer.seconds = 300
sasl.login.refresh.min.period.seconds = 60
sasl.login.refresh.window.factor = 0.8
sasl.login.refresh.window.jitter = 0.05
sasl.mechanism = GSSAPI
security.protocol = PLAINTEXT
security.providers = null
send.buffer.bytes = 131072
socket.connection.setup.timeout.max.ms = 30000
socket.connection.setup.timeout.ms = 10000
ssl.cipher.suites = null
ssl.enabled.protocols = [TLSv1.2, TLSv1.3]
ssl.endpoint.identification.algorithm = https
ssl.engine.factory.class = null
ssl.key.password = null
ssl.keymanager.algorithm = SunX509
ssl.keystore.certificate.chain = null
ssl.keystore.key = null
ssl.keystore.location = null
ssl.keystore.password = null
ssl.keystore.type = JKS
ssl.protocol = TLSv1.3
ssl.provider = null
ssl.secure.random.implementation = null
ssl.trustmanager.algorithm = PKIX
ssl.truststore.certificates = null
ssl.truststore.location = null
ssl.truststore.password = null
ssl.truststore.type = JKS
transaction.timeout.ms = 60000
transactional.id = flink-1-24
value.serializer = class org.apache.kafka.common.serialization.ByteArraySerializer
2022-11-02 10:04:10,131 INFO org.apache.kafka.clients.producer.KafkaProducer [] - [Producer clientId=producer-flink-1-24, transactionalId=flink-1-24] Overriding the default enable.idempotence to true since transactional.id is specified.
2022-11-02 10:04:10,161 INFO org.apache.kafka.clients.producer.KafkaProducer [] - [Producer clientId=producer-flink-0-24, transactionalId=flink-0-24] Overriding the default enable.idempotence to true since transactional.id is specified.
2022-11-02 10:04:10,161 INFO org.apache.kafka.clients.producer.KafkaProducer [] - [Producer clientId=producer-flink-0-24, transactionalId=flink-0-24] Instantiated a transactional producer.
2022-11-02 10:04:10,162 INFO org.apache.kafka.clients.producer.KafkaProducer [] - [Producer clientId=producer-flink-0-24, transactionalId=flink-0-24] Overriding the default acks to all since idempotence is enabled.
2022-11-02 10:04:10,159 INFO org.apache.kafka.clients.producer.KafkaProducer [] - [Producer clientId=producer-flink-1-24, transactionalId=flink-1-24] Instantiated a transactional producer.
2022-11-02 10:04:10,170 INFO org.apache.kafka.clients.producer.KafkaProducer [] - [Producer clientId=producer-flink-1-24, transactionalId=flink-1-24] Overriding the default acks to all since idempotence is enabled.
2022-11-02 10:04:10,181 INFO org.apache.kafka.common.utils.AppInfoParser [] - Kafka version: 2.8.1
2022-11-02 10:04:10,184 INFO org.apache.kafka.common.utils.AppInfoParser [] - Kafka commitId: 839b886f9b732b15
2022-11-02 10:04:10,184 INFO org.apache.kafka.common.utils.AppInfoParser [] - Kafka startTimeMs: 1667379850181
2022-11-02 10:04:10,185 INFO org.apache.kafka.clients.producer.internals.TransactionManager [] - [Producer clientId=producer-flink-0-24, transactionalId=flink-0-24] Invoking InitProducerId for the first time in order to acquire a producer ID
2022-11-02 10:04:10,192 INFO org.apache.kafka.common.utils.AppInfoParser [] - Kafka version: 2.8.1
2022-11-02 10:04:10,192 INFO org.apache.kafka.common.utils.AppInfoParser [] - Kafka commitId: 839b886f9b732b15
2022-11-02 10:04:10,192 INFO org.apache.kafka.common.utils.AppInfoParser [] - Kafka startTimeMs: 1667379850192
2022-11-02 10:04:10,209 INFO org.apache.kafka.clients.producer.internals.TransactionManager [] - [Producer clientId=producer-flink-1-24, transactionalId=flink-1-24] Invoking InitProducerId for the first time in order to acquire a producer ID
2022-11-02 10:04:10,211 INFO org.apache.kafka.clients.Metadata [] - [Producer clientId=producer-flink-0-24, transactionalId=flink-0-24] Cluster ID: MCY5mzM1QWyc1YCvsO8jag
2022-11-02 10:04:10,216 INFO org.apache.kafka.clients.producer.internals.TransactionManager [] - [Producer clientId=producer-flink-0-24, transactionalId=flink-0-24] Discovered transaction coordinator ubuntu:9092 (id: 0 rack: null)
2022-11-02 10:04:10,233 INFO org.apache.kafka.clients.Metadata [] - [Producer clientId=producer-flink-1-24, transactionalId=flink-1-24] Cluster ID: MCY5mzM1QWyc1YCvsO8jag
2022-11-02 10:04:10,241 INFO org.apache.kafka.clients.producer.internals.TransactionManager [] - [Producer clientId=producer-flink-1-24, transactionalId=flink-1-24] Discovered transaction coordinator ubuntu:9092 (id: 0 rack: null)
2022-11-02 10:04:10,345 INFO org.apache.kafka.clients.producer.internals.TransactionManager [] - [Producer clientId=producer-flink-0-24, transactionalId=flink-0-24] ProducerId set to 51 with epoch 0
2022-11-02 10:04:10,346 INFO org.apache.flink.connector.kafka.sink.KafkaWriter [] - Created new transactional producer flink-0-24
2022-11-02 10:04:10,353 INFO org.apache.kafka.clients.producer.internals.TransactionManager [] - [Producer clientId=producer-flink-1-24, transactionalId=flink-1-24] ProducerId set to 52 with epoch 0
2022-11-02 10:04:10,354 INFO org.apache.flink.connector.kafka.sink.KafkaWriter [] - Created new transactional producer flink-1-24
ProducerConfig values log is repeated for each new producer created (based on sink parallelism level).
Configuring checkpoint interval to 10 or 15 seconds, I lose valuable job logs.
There is a way to disable these logs without setting WARN level?
I use pyflink to run flink streaming, if I run flink with StandAlone mode, it works, but run flink with yarn-per-job mode, it failed, report "pyflink.util.exceptions.TableException: Failed to execute sql"
yarn per job command is: flink run -t yarn-per-job -Djobmanager.memory.process.size=1024mb -Dtaskmanager.memory.process.size=2048mb -ynm flink-cluster -Dtaskmanager.numberOfTaskSlots=2 -pyfs cluster.py ...
standalone command is: flink run -pyfs cluster.py ...
The python environment archive attached in cluster.py.
env = StreamExecutionEnvironment.get_execution_environment()
env_settings = EnvironmentSettings.new_instance().in_streaming_mode().use_blink_planner().build()
t_env = StreamTableEnvironment.create(env, environment_settings=env_settings)
curr_path = os.path.abspath(os.path.dirname(os.path.dirname(__file__)))
jars = f"""
file://{curr_path}/jars/flink-sql-connector-kafka_2.11-1.13.1.jar;
file://{curr_path}/jars/force-shading-1.13.1.jar"""
t_env.get_config().get_configuration().set_string("pipeline.jars", jars)
t_env.add_python_archive("%s/requirements/flink.zip" % curr_path)
t_env.get_config().set_python_executable("flink.zip/flink/bin/python")
env.set_stream_time_characteristic(TimeCharacteristic.EventTime)
env.set_parallelism(2)
env.get_config().set_auto_watermark_interval(10000)
t_env.get_config().get_configuration().set_boolean("python.fn-execution.memory.managed", True)
parse_log = udaf(LogParser(parsing_params),
input_types=[DataTypes.STRING(), DataTypes.STRING(), DataTypes.STRING(), DataTypes.STRING(),
DataTypes.STRING(), DataTypes.TIMESTAMP(3)],
result_type=DataTypes.STRING(), func_type="pandas")
process_ad = udf(ADProcessor(ad_params), result_type=DataTypes.STRING())
t_env.create_temporary_function('log_parsing_process', parse_log)
t_env.create_temporary_function('ad_process', process_ad)
tumble_window = Tumble.over("5.minutes").on("time_ltz").alias("w")
t_env.execute_sql(f"""
CREATE TABLE source_table(
ip VARCHAR, -- ip address
raws VARCHAR, -- message
host VARCHAR, -- host
log_type VARCHAR, -- type
system_name VARCHAR, -- system
ts BIGINT,
time_ltz AS TO_TIMESTAMP_LTZ(ts, 3),
WATERMARK FOR time_ltz AS time_ltz - INTERVAL '5' SECOND
) WITH (
'connector' = 'kafka',
'topic' = '{source_topic}',
'properties.bootstrap.servers' = '{source_servers}',
'properties.group.id' = '{group_id}',
'scan.startup.mode' = '{auto_offset_reset}',
'format' = 'json'
)
""")
sink_sql = f"""
CREATE TABLE sink (
alert VARCHAR, -- alert
start_time timestamp(3), -- window start timestamp
end_time timestamp(3) -- window end timestamp
) with (
'connector' = 'kafka',
'topic' = '{sink_topic}',
'properties.bootstrap.servers' = '{sink_servers}',
'json.fail-on-missing-field' = 'false',
'json.ignore-parse-errors' = 'true',
'format' = 'json'
)"""
t_env.execute_sql(sink_sql)
t_env.get_config().set_null_check(False)
source_table = t_env.from_path('source_table')
sink_table = source_table.window(tumble_window) \
.group_by("w, log_type") \
.select("log_parsing_process(ip, raws, host, log_type, system_name, time_ltz) AS pattern, "
"w.start AS start_time, "
"w.end AS end_time") \
.select("ad_process(pattern, start_time, end_time) AS alert, start_time, end_time")
sink_table.execute_insert("sink")
Error is:
File "/tmp/pyflink/xxxx/xxxx/workerbee/log_exception_detection_run_on_diff_mode.py ,line 148, in run_flink sink_table_execute_insert("test_sink")
File "/opt/flink/flink-1.13.1_scala_2.12/opt/python/pyflink.zip/pyflink/table/table.py, line 1056 in execute_insert
File "/opt/flink/flink-1.13.1_scala_2.12/opt/python/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1286, in __call__
File "/opt/flink/flink-1.13.1_scala_2.12/opt/python/pyflink.zip/pyflink/util/exceptions.py", line 163, in deco
pyflink.util.exceptions.TableException: Failed to execute sql
at org.apache.flink.table.api.internal.TableEnvironmentImpl.executeInternal(TableEnvironmentImpl.java:777)
at org.apache.flink.table.api.internal.TableEnvironmentImpl.executeInternal(TableEnvironmentImpl.java:742)
at org.apache.flink.table.api.internal.TableImpl.executeInsert(TableImpl.java:572)
at sun.reflect.NativeMetondAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMetondAccessorImpl.invoke(NativeMethodAccessorImpl.hava:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.hava:498)
at org.apache.flink.api.python.shaded.py4j.reflection.MethodInvoker(MethodInvoker.java:244)
at org.apache.flink.api.python.shaded.py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at org.apache.flink.api.python.shaded.py4j.Gateway.invoke(Gateway.java:282)
at org.apache.flink.api.python.shaded.py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at org.apache.flink.api.python.shaded.py4j.commands.CallCommand.execute(CallCommand.java:79)
at org.apache.flink.api.python.shaded.py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
org.apache.flink.client.program.ProgramAbortException: java.lang.RuntimeException: Python process exits with code: 1
nodemanager log:
INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: launchContainer: [bash, /opt/hadoop_data/tmp/nm-local-dir/usercache/root/appcache/applicatino_I1644370510310_0002/container_I1644370510310_0002_03_000001/default_container_executor.sh]
WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from container container_I1644370510310_0002_03_000001 is : 1
WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception from container-launch with container ID: container_I1644370510310_0002_03_000001 and exit exit code: 1
ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java: 1008)
at org.apache.hadoop.util.Shell.run(Shell.java: 901)
at org.apache.hadoop.util.Shell$ShellCommandExceutor.execute(Shell.java:1213
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:309)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.launchContainer(ContainerLaunch.java:585)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.Call(ContainerLaunch.java:373)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.Call(ContainerLaunch.java:103)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPollExecutor.runWorker(ThreadPollExecutor.java:1149)
at java.util.concurrent.ThreadPollExecutor$Worker.run(ThreadPollExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exception from container-launch.
INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: container id: container_I1644370510310_0002_03_000001
INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exit code: 1
WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Container launch failed : Container exited with a non-zero exit code 1
INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_I1644370510310_0002_03_000001 transitioned from RUNNING to EXITED_WITH_FAILURE
Looks like a classloader related issue. classloader.check-leaked-classloader configuration can refer to https://nightlies.apache.org/flink/flink-docs-master/zh/docs/deployment/config/
In addition, you can try to use add_jar api instead of setting pipeline.jars config directly
def add_jars(self, *jars_path: str):
"""
Adds a list of jar files that will be uploaded to the cluster and referenced by the job.
:param jars_path: Path of jars.
"""
add_jars_to_context_class_loader(jars_path)
jvm = get_gateway().jvm
jars_key = jvm.org.apache.flink.configuration.PipelineOptions.JARS.key()
env_config = jvm.org.apache.flink.python.util.PythonConfigUtil \
.getEnvironmentConfig(self._j_stream_execution_environment)
old_jar_paths = env_config.getString(jars_key, None)
joined_jars_path = ';'.join(jars_path)
if old_jar_paths and old_jar_paths.strip():
joined_jars_path = ';'.join([old_jar_paths, joined_jars_path])
env_config.setString(jars_key, joined_jars_path)
after debug and check, finally I found the issue is I missed some flink hadoop jar packages:
commons-cli-1.4.jar
flink-shaded-hadoop-3-uber-3.1.1.7.2.1.0-327-9.0.jar
hadoop-yarn-api-3.3.1.jar
I'm trying to consume some kafka topics through flink streams.
Below is the code for the stream.
object VehicleFuelEventStream {
def sink[IN](hosts: String, table: String, ds: DataStream[IN]): CassandraSink[IN] = {
CassandraSink.addSink(ds)
.setQuery(s"INSERT INTO global.$table values (id, vehicle_id, consumption, from, to, version) values (?, ?, ?, ?, ?, ?);")
.setClusterBuilder(new ClusterBuilder() {
override def buildCluster(builder: Cluster.Builder): Cluster = {
builder.addContactPoints(hosts.split(","): _*).build()
}
}).build()
}
def dataStream[T: TypeInformation](env: StreamExecutionEnvironment,
source: SourceFunction[String],
flinkConfigs: FlinkStreamProcessorConfigs,
windowFunction: WindowFunction[VehicleFuelEvent, (String, Double, SortedMap[Date, Double], Long, Long), String, TimeWindow]): DataStream[(String, Double, SortedMap[Date, Double], Long, Long)] = {
val fuelEventStream = env.addSource(source)
.map(e => EventSerialiserRepo.fromJsonStr[VehicleFuelEvent](e))
.filter(_.isSuccess)
.map(_.get)
.assignTimestampsAndWatermarks(VehicleFuelEventTimestampExtractor.withWatermarkDelay(flinkConfigs.windowWatermarkDelay))
.keyBy(_.vehicleId) // key by VehicleId
.timeWindow(Time.minutes(flinkConfigs.windowTimeWidth.toMinutes))
.apply(windowFunction)
fuelEventStream
}
}
Stream is triggered when play framework create it's dependencies on startup via google Guice as below.
#Singleton
class VehicleEventKafkaConsumer #Inject()(conf: Configuration,
lifecycle: ApplicationLifecycle,
repoFactory: StorageFactory,
cache: CacheApi,
cassandra: Cassandra,
fleetConfigs: FleetConfigManager) {
private val kafkaConfigs = KafkaConsumerConfigs(conf)
private val flinkConfigs = FlinkStreamProcessorConfigs(conf)
private val topicsWithClassTags = getClassTagsForTopics
private val cassandraConfigs = CassandraConfigs(conf)
private val repoCache = mutable.HashMap.empty[String, CachedSubjectStatePersistor]
private val props = new Properties()
// add props
private val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
env.enableCheckpointing(flinkConfigs.checkpointingInterval.toMillis)
if (flinkConfigs.enabled) {
topicsWithClassTags.toList
.map {
case (topic, tag) if tag.runtimeClass.isAssignableFrom(classOf[VehicleFuelEvent]) =>
Logger.info(s"starting for - $topic and $tag")
val source = new FlinkKafkaConsumer09[String](topic, new SimpleStringSchema(), props)
val fuelEventStream = VehicleFuelEventStream.dataStream[String](env, source, flinkConfigs, new VehicleFuelEventWindowFunction)
VehicleFuelEventStream.sink(cassandraConfigs.hosts, flinkConfigs.cassandraTable, fuelEventStream)
case (topic, _) =>
Logger.info(s"no stream processor found for topic $topic")
}
Logger.info("starting flink stream processors")
env.execute("flink vehicle event processors")
} else
Logger.info("Flink stream processor is disabled!")
}
I get the below errors on application startup.
03/13/2018 05:47:23 TriggerWindow(TumblingEventTimeWindows(1800000), ListStateDescriptor{serializer=org.apache.flink.api.common.typeutils.base.ListSerializer#e899e41f}, EventTimeTrigger(), WindowedStream.apply(WindowedStream.scala:582)) -> Sink: Cassandra Sink(4/4) switched to RUNNING
2018-03-13 05:47:23,262 - [info] o.a.f.r.t.Task - TriggerWindow(TumblingEventTimeWindows(1800000), ListStateDescriptor{serializer=org.apache.flink.api.common.typeutils.base.ListSerializer#e899e41f}, EventTimeTrigger(), WindowedStream.apply(WindowedStream.scala:582)) -> Sink: Cassandra Sink (3/4) (d5124be9bcef94bd0e305c4b4546b055) switched from RUNNING to FAILED.
org.apache.flink.streaming.runtime.tasks.StreamTaskException: Cannot load user class: streams.fuelevent.VehicleFuelEventWindowFunction
ClassLoader info: URL ClassLoader:
Class not resolvable through given classloader.
at org.apache.flink.streaming.api.graph.StreamConfig.getStreamOperator(StreamConfig.java:232)
at org.apache.flink.streaming.runtime.tasks.OperatorChain.<init>(OperatorChain.java:95)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:231)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
at java.lang.Thread.run(Thread.java:745)
2018-03-13 05:47:23,262 - [info] o.a.f.r.t.Task - TriggerWindow(TumblingEventTimeWindows(1800000), ListStateDescriptor{serializer=org.apache.flink.api.common.typeutils.base.ListSerializer#e899e41f}, EventTimeTrigger(), WindowedStream.apply(WindowedStream.scala:582)) -> Sink: Cassandra Sink (2/4) (6b3de15a4f6
.....
org.apache.flink.streaming.runtime.tasks.StreamTaskException: Could not instantiate outputs in order.
at org.apache.flink.streaming.api.graph.StreamConfig.getOutEdgesInOrder(StreamConfig.java:394)
at org.apache.flink.streaming.runtime.tasks.OperatorChain.<init>(OperatorChain.java:103)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:231)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: streams.fuelevent.VehicleFuelEventStream$$anonfun$4
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at org.apache.flink.runtime.execution.librarycache.FlinkUserCodeClassLoaders$ChildFirstClassLoader.loadClass(FlinkUserCodeClassLoaders.java:128)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
2018-03-13 05:47:23,336 - [info] o.a.f.r.t.Task - Source: Custom Source -> Map -> Filter -> Map -> Timestamps/Watermarks (4/4) (97b2313c985592fdec0ac4f7fba8062f) switched from RUNNING to FAILED.
dependencies
// kafka
libraryDependencies += "org.apache.kafka" %% "kafka" % "1.0.0"
//flink
libraryDependencies += "org.apache.flink" %% "flink-streaming-scala" % "1.4.2"
libraryDependencies += "org.apache.flink" %% "flink-connector-kafka-0.9" % "1.4.0"
libraryDependencies += "org.apache.flink" %% "flink-connector-cassandra" % "1.4.2"
Any help is appreciated to solve this issue.
Look like Flink can't found this class streams.fuelevent.VehicleFuelEventStream
It's that class inside Flink classpath or inside you Jar file?
Flink doc could bring some light on this https://ci.apache.org/projects/flink/flink-docs-release-1.4/monitoring/debugging_classloading.html
I am trying to filter all temp events that are > 10 in Flink using below pattern,
Pattern<MonitoringEvent, ?> warningPattern = Pattern.<MonitoringEvent>begin("first")
.subtype(TemperatureEvent.class)
.where(new FilterFunction<TemperatureEvent>() {
#Override
public boolean filter(TemperatureEvent temperatureEvent) throws Exception {
return temperatureEvent.getTemperature() > 50;
}
});
Input is a text file , which is parsed to stream by an input function, Contents of input file are:-
1,98
2,33
3,44
4,55
5,66
6,88
7,99
8,76
Here first value is Rack_id and second is Temperature
I have issued print() on both input-stream and WarnigsStream as shown below
inputEventStream.print();
warnings.print();
Now, comes the issue, The output of Flink CEP is shown below
08/10/2017 23:43:15 Job execution switched to status RUNNING.
08/10/2017 23:43:15 Source: Custom Source -> Sink: Unnamed(1/1) switched to SCHEDULED
08/10/2017 23:43:15 Source: Custom Source -> Sink: Unnamed(1/1) switched to DEPLOYING
08/10/2017 23:43:15 AbstractCEPPatternOperator -> Map -> Sink: Unnamed(1/1) switched to SCHEDULED
08/10/2017 23:43:15 AbstractCEPPatternOperator -> Map -> Sink: Unnamed(1/1) switched to DEPLOYING
08/10/2017 23:43:15 AbstractCEPPatternOperator -> Map -> Sink: Unnamed(1/1) switched to RUNNING
08/10/2017 23:43:15 Source: Custom Source -> Sink: Unnamed(1/1) switched to RUNNING
Rack id = 1 and temprature = 98.0)
Rack id = 2 and temprature = 33.0)
Rack id = 3 and temprature = 44.0)
Rack id = 4 and temprature = 55.0)
Rack id = 5 and temprature = 66.0)
Rack id = 6 and temprature = 88.0)
Rack id = 7 and temprature = 99.0)
Rack id = 8 and temprature = 76.0)
08/10/2017 23:43:16 Source: Custom Source -> Sink: Unnamed(1/1) switched to FINISHED
Rack id = 1 and temprature = 98.0)
Rack id = 8 and temprature = 76.0)
Rack id = 7 and temprature = 99.0)
Rack id = 6 and temprature = 88.0)
Rack id = 5 and temprature = 66.0)
Rack id = 4 and temprature = 55.0)
08/10/2017 23:43:16 AbstractCEPPatternOperator -> Map -> Sink: Unnamed(1/1) switched to FINISHED
08/10/2017 23:43:16 Job execution switched to status FINISHED.
Process finished with exit code 0
As we can see, the first Complex event(Rack id = 1 and temperature = 98.0)) is printed in same order, but after this, all the other complex events having temp > 50 are getting printed in opposite order with respect to the input stream.
My questions are :-
1. Any idea why events are getting printed in reverse order?
2. Is there a custom way to print values{w/o using warnings.print()} of
warning stream, like can I print only temperature, rather than rack-id ?
Thanks in Advance
This issue was solved by assigning Timestamps and Watermarks to inputStream shown as below
// Input stream of monitoring events
DataStream<MonitoringEvent> inputEventStream = env
.addSource(new InputStreamAGenerator()).assignTimestampsAndWatermarks(new IngestionTimeExtractor<>());
Output Generated is shown below
08/11/2017 00:45:09 Job execution switched to status RUNNING.
08/11/2017 00:45:09 Source: Custom Source -> Timestamps/Watermarks(1/1) switched to SCHEDULED
08/11/2017 00:45:09 Source: Custom Source -> Timestamps/Watermarks(1/1) switched to DEPLOYING
08/11/2017 00:45:09 AbstractCEPPatternOperator -> Map -> Sink: Unnamed(1/1) switched to SCHEDULED
08/11/2017 00:45:09 AbstractCEPPatternOperator -> Map -> Sink: Unnamed(1/1) switched to DEPLOYING
08/11/2017 00:45:09 Source: Custom Source -> Timestamps/Watermarks(1/1) switched to RUNNING
08/11/2017 00:45:09 AbstractCEPPatternOperator -> Map -> Sink: Unnamed(1/1) switched to RUNNING
Rack id = 1 and temprature = 98.0)
Rack id = 4 and temprature = 55.0)
Rack id = 5 and temprature = 66.0)
Rack id = 6 and temprature = 88.0)
Rack id = 7 and temprature = 99.0)
Rack id = 8 and temprature = 76.0)
08/11/2017 00:45:10 Source: Custom Source -> Timestamps/Watermarks(1/1) switched to FINISHED
08/11/2017 00:45:10 AbstractCEPPatternOperator -> Map -> Sink: Unnamed(1/1) switched to FINISHED
08/11/2017 00:45:10 Job execution switched to status FINISHED.
I followed the tutorial from https://medium.com/#chvanikoff/phoenix-react-love-story-reph-1-c68512cfe18 and developed an application but with different versions.
elixir - 1.3.4
phoenix - 1.2.1
poison - 2.0
distillery - 0.10
std_json_io - 0.1
The application ran successfully when running locally.
Bur when created a mix release(MIX_ENV=prod mix release --env=prod --verbose) and ran rel/utopia/bin/utopia console(the otp application name was :utopia), I ran into error
Interactive Elixir (1.3.4) - press Ctrl+C to exit (type h() ENTER for help)
14:18:21.857 request_id=idqhtoim2nb3lfeguq22627a92jqoal6 [info] GET /
panic: write |1: broken pipe
goroutine 3 [running]:
runtime.panic(0x4a49e0, 0xc21001f480)
/usr/local/Cellar/go/1.2.2/libexec/src/pkg/runtime/panic.c:266 +0xb6
log.(*Logger).Panicf(0xc210020190, 0x4de060, 0x3, 0x7f0924c84e38, 0x1, ...)
/usr/local/Cellar/go/1.2.2/libexec/src/pkg/log/log.go:200 +0xbd
main.fatal_if(0x4c2680, 0xc210039ab0)
/Users/alco/extra/goworkspace/src/goon/util.go:38 +0x17e
main.inLoop2(0x7f0924e0c388, 0xc2100396f0, 0xc2100213c0, 0x7f0924e0c310, 0xc210000000, ...)
/Users/alco/extra/goworkspace/src/goon/io.go:100 +0x5ce
created by main.wrapStdin2
/Users/alco/extra/goworkspace/src/goon/io.go:25 +0x15a
goroutine 1 [chan receive]:
main.proto_2_0(0x7ffce6670101, 0x4e3e20, 0x3, 0x4de5a0, 0x1, ...)
/Users/alco/extra/goworkspace/src/goon/proto_2_0.go:58 +0x3a3
main.main()
/Users/alco/extra/goworkspace/src/goon/main.go:51 +0x3b6
14:18:21.858 request_id=idqhtoim2nb3lfeguq22627a92jqoal6 [info] Sent 500 in 1ms
14:18:21.859 [error] #PID<0.1493.0> running Utopia.Endpoint terminated
Server: 127.0.0.1:8080 (http)
Request: GET /
** (exit) an exception was raised:
** (Protocol.UndefinedError) protocol String.Chars not implemented for {#PID<0.1467.0>, :result, %Porcelain.Result{err: nil, out: {:send, #PID<0.1466.0>}, status: 2}}
(elixir) lib/string/chars.ex:3: String.Chars.impl_for!/1
(elixir) lib/string/chars.ex:17: String.Chars.to_string/1
(utopia) lib/utopia/react_io.ex:2: Utopia.ReactIO.json_call!/2
(utopia) web/controllers/page_controller.ex:12: Utopia.PageController.index/2
(utopia) web/controllers/page_controller.ex:1: Utopia.PageController.action/2
(utopia) web/controllers/page_controller.ex:1: Utopia.PageController.phoenix_controller_pipeline/2
(utopia) lib/utopia/endpoint.ex:1: Utopia.Endpoint.instrument/4
(utopia) lib/phoenix/router.ex:261: Utopia.Router.dispatch/2
goon got panicked and hence the porcelain. Someone please provide a solution.
Related issues: https://github.com/alco/porcelain/issues/13
EDIT: My page_controller.ex
defmodule Utopia.PageController do
use Utopia.Web, :controller
def index(conn, _params) do
visitors = Utopia.Tracking.Visitors.state
initial_state = %{"visitors" => visitors}
props = %{
"location" => conn.request_path,
"initial_state" => initial_state
}
result = Utopia.ReactIO.json_call!(%{
component: "./priv/static/server/js/utopia.js",
props: props
})
render(conn, "index.html", html: result["html"], props: initial_state)
end
end