I want to write ORC file using Flink's Streaming File Sink but it doesn’t write files correctly - apache-flink

I am reading data from Kafka and trying to write it to the HDFS file system in ORC format. I have used the below link reference from their official website. But I can see that Flink write exact same content for all data and make so many files and all files are ok 103KB
https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/connectors/streamfile_sink.html#orc-format
Please find my code below.
object BeaconBatchIngest extends StreamingBase {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
def getTopicConfig(configs: List[Config]): Map[String, String] = (for (config: Config <- configs) yield (config.getString("sourceTopic"), config.getString("destinationTopic"))).toMap
def setKafkaConfig():Unit ={
val kafkaParams = new Properties()
kafkaParams.setProperty("bootstrap.servers","")
kafkaParams.setProperty("zookeeper.connect","")
kafkaParams.setProperty("group.id", DEFAULT_KAFKA_GROUP_ID)
kafkaParams.setProperty("auto.offset.reset", "latest")
val kafka_consumer:FlinkKafkaConsumer[String] = new FlinkKafkaConsumer[String]("sourceTopics", new SimpleStringSchema(),kafkaParams)
kafka_consumer.setStartFromLatest()
val stream: DataStream[DataParse] = env.addSource(kafka_consumer).map(new temp)
val schema: String = "struct<_col0:string,_col1:bigint,_col2:string,_col3:string,_col4:string>"
val writerProperties = new Properties()
writerProperties.setProperty("orc.compress", "ZLIB")
val writerFactory = new OrcBulkWriterFactory(new PersonVectorizer(schema),writerProperties,new org.apache.hadoop.conf.Configuration);
val sink: StreamingFileSink[DataParse] = StreamingFileSink
.forBulkFormat(new Path("hdfs://warehousestore/hive/warehouse/metrics_test.db/upp_raw_prod/hour=1/"), writerFactory)
.build()
stream.addSink(sink)
}
def main(args: Array[String]): Unit = {
setKafkaConfig()
env.enableCheckpointing(5000)
env.execute("Kafka_Flink_HIVE")
}
}
class temp extends MapFunction[String,DataParse]{
override def map(record: String): DataParse = {
new DataParse(record)
}
}
class DataParse(data : String){
val parsedJason = parse(data)
val timestamp = compact(render(parsedJason \ "timestamp")).replaceAll("\"", "").toLong
val event = compact(render(parsedJason \ "event")).replaceAll("\"", "")
val source_id = compact(render(parsedJason \ "source_id")).replaceAll("\"", "")
val app = compact(render(parsedJason \ "app")).replaceAll("\"", "")
val json = data
}
class PersonVectorizer(schema: String) extends Vectorizer[DataParse](schema) {
override def vectorize(element: DataParse, batch: VectorizedRowBatch): Unit = {
val eventColVector = batch.cols(0).asInstanceOf[BytesColumnVector]
val timeColVector = batch.cols(1).asInstanceOf[LongColumnVector]
val sourceIdColVector = batch.cols(2).asInstanceOf[BytesColumnVector]
val appColVector = batch.cols(3).asInstanceOf[BytesColumnVector]
val jsonColVector = batch.cols(4).asInstanceOf[BytesColumnVector]
timeColVector.vector(batch.size + 1) = element.timestamp
eventColVector.setVal(batch.size + 1, element.event.getBytes(StandardCharsets.UTF_8))
sourceIdColVector.setVal(batch.size + 1, element.source_id.getBytes(StandardCharsets.UTF_8))
appColVector.setVal(batch.size + 1, element.app.getBytes(StandardCharsets.UTF_8))
jsonColVector.setVal(batch.size + 1, element.json.getBytes(StandardCharsets.UTF_8))
}
}

With bulk formats (such as ORC), the StreamingFileSink rolls over to new files with every checkpoint. If you reduce the checkpointing interval (currently 5 seconds), it won't write so many files.

Related

Is there any way to zip two or more streams with order of event time in Flink?

Suppose we have a stream of data with this format:
example of input data stream:
case class InputElement(key:String,objectType:String,value:Boolean)
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
val inputStream:DataSet[InputElement] = env.fromElements(
InputElement("k1","t1",true)
,InputElement("k2","t1",true)
,InputElement("k2","t2",true)
,InputElement("k1","t2",false)
,InputElement("k1","t2",true)
,InputElement("k1","t1",false)
,InputElement("k2","t2",false)
)
it is semantically equal to have these streams:
val inputStream_k1_t1 = env.fromElements(
InputElement("k1","t1",true),
InputElement("k1","t1",false)
)
val inputStream_k1_t2 = env.fromElements(
InputElement("k1","t2",false),
,InputElement("k1","t2",true)
)
val inputStream_k2_t1 = env.fromElements(
InputElement("k2","t1",true)
)
val inputStream_k2_t2 = env.fromElements(
InputElement("k2","t2",true),
InputElement("k2","t2",false)
)
I want to have an output type like this:
case class OutputElement(key:String,values:Map[String,Boolean])
expected output data stream for the example input data:
val expectedOutputStream = env.fromElements(
OutputElement("k1",Map( "t1"->true ,"t2"->false)),
OutputElement("k2",Map("t1"->true,"t2"->true)),
OutputElement("k1",Map("t1"->false,"t2"->true)),
OutputElement("k2",Map("t2"->false))
)
==========================================
edit 1:
after some considerations about the problem the subject of the question changed:
I want to have another input stream that shows which keys are subscribed to which object types:
case class SubscribeRule(strategy:String,patterns:Set[String])
val subscribeStream: DataStream[SubscribeRule] = env.fromElements(
SubscribeRule("s1",Set("p1","p2")),
SubscribeRule("s2",Set("p1","p2"))
)
now I want to have this output:
(the result stream does not emit any thing till all the subscribed objectType are received:
val expectedOutputStream = env.fromElements(
OutputElement("k1",Map( "t1"->true ,"t2"->false)),
OutputElement("k2",Map("t1"->true,"t2"->true)),
OutputElement("k1",Map("t1"->false,"t2"->true)),
// OutputElement("k2",Map("t2"->false)) # this element will emit when a k2-t1 input message recieved
)
App.scala:
import org.apache.flink.api.common.state.MapStateDescriptor
import org.apache.flink.api.scala.createTypeInformation
import org.apache.flink.streaming.api.datastream.BroadcastStream
import org.apache.flink.streaming.api.scala.{DataStream, KeyedStream, StreamExecutionEnvironment}
object App {
case class updateStateResult(updatedState:Map[String,List[Boolean]],output:Map[String,Boolean])
case class InputElement(key:String,objectType:String,passed:Boolean)
case class SubscribeRule(strategy:String,patterns:Set[String])
case class OutputElement(key:String,result:Map[String,Boolean])
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
// checkpoint every 10 seconds
val subscribeStream: DataStream[SubscribeRule] = env.fromElements(
SubscribeRule("s1",Set("p1","p2")),
SubscribeRule("s2",Set("p1","p2"))
)
val broadcastStateDescriptor =
new MapStateDescriptor[String, Set[String]]("subscribes", classOf[String], classOf[Set[String]])
val subscribeStreamBroadcast: BroadcastStream[SubscribeRule] =
subscribeStream.broadcast(broadcastStateDescriptor)
val inputStream = env.fromElements(
InputElement("s1","p1",true),
InputElement("s1","p2",true),
InputElement("s2","p1",false),
InputElement("s2","p2",true),
InputElement("s2","p2",false),
InputElement("s1","p1",false),
InputElement("s2","p1",true),
InputElement("s1","p2",true),
)
val expected = List(
OutputElement("s1",Map("p2"->true,"p1"->true)),
OutputElement("s2",Map("p2"->true,"p1"->false)),
OutputElement("s2",Map("p2"->false,"p1"->true)),
OutputElement("s1",Map("p2"->true,"p1"->false))
)
val keyedInputStream: KeyedStream[InputElement, String] = inputStream.keyBy(_.key)
val result = keyedInputStream
.connect(subscribeStreamBroadcast)
.process(new ZippingFunc())
result.print
env.execute("test stream")
}
}
ZippingFunc.scala
import App.{InputElement, OutputElement, SubscribeRule, updateStateResult}
import org.apache.flink.api.common.state.{ MapState, MapStateDescriptor, ReadOnlyBroadcastState}
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.functions.co.KeyedBroadcastProcessFunction
import org.apache.flink.util.Collector
import java.util.{Map => JavaMap}
import scala.collection.JavaConverters.{iterableAsScalaIterableConverter, mapAsJavaMapConverter}
class ZippingFunc extends KeyedBroadcastProcessFunction[String, InputElement,SubscribeRule , OutputElement] {
private var localState: MapState[String,List[Boolean]] = _
private lazy val broadcastStateDesc =
new MapStateDescriptor[String, Set[String]]("subscribes", classOf[String], classOf[Set[String]])
override def open(parameters: Configuration) {
val localStateDesc: MapStateDescriptor[String,List[Boolean]] =
new MapStateDescriptor[String, List[Boolean]]("sourceMap1", classOf[String], classOf[List[Boolean]])
localState = getRuntimeContext.getMapState(localStateDesc)
}
def updateVar(objectType:String,value:Boolean): Option[Map[String, Boolean]] ={
val values = localState.get(objectType)
localState.put(objectType, value::values)
pickOutputs(localState.entries().asScala).map((ur: updateStateResult) => {
localState.putAll(ur.updatedState.asJava)
ur.output
})
}
def pickOutputs(entries: Iterable[JavaMap.Entry[String, List[Boolean]]]): Option[updateStateResult] = {
val mapped: Iterable[Option[(String, Boolean, List[Boolean])]] = entries.map(
(x: JavaMap.Entry[String, List[Boolean]]) => {
val key: String = x.getKey
val value: List[Boolean] = x.getValue
val head: Option[Boolean] = value.headOption
head.map(
h => {
(key, h, value.tail)
}
)
}
)
sequenceOption(mapped).map((x: List[(String, Boolean, List[Boolean])]) => {
updateStateResult(
x.map(y => (y._1, y._3)).toMap,
x.map(y => (y._1, y._2)).toMap
)
}
)
}
def sequenceOption[A](l:Iterable[Option[A]]): Option[List[A]] =
{
l.foldLeft[Option[List[A]]](Some(List.empty[A]))(
(acc: Option[List[A]], e: Option[A]) =>{
for {
xs <- acc
x <- e
} yield x :: xs
}
)
}
override def processElement(value: InputElement, ctx: KeyedBroadcastProcessFunction[String, InputElement, SubscribeRule, OutputElement]#ReadOnlyContext, out: Collector[OutputElement]): Unit = {
val bs: ReadOnlyBroadcastState[String, Set[String]] = ctx.getBroadcastState(broadcastStateDesc)
if(bs.contains(value.key)) {
val allPatterns: Set[String] = bs.get(value.key)
allPatterns.map((patternName: String) =>
if (!localState.contains(patternName))
localState.put(patternName, List.empty)
)
updateVar(value.objectType, value.passed)
.map((r: Map[String, Boolean]) =>
out.collect(OutputElement(value.key, r))
)
}
}
// )
override def processBroadcastElement(value: SubscribeRule, ctx: KeyedBroadcastProcessFunction[String, InputElement, SubscribeRule, OutputElement]#Context, out: Collector[OutputElement]): Unit = {
val bs = ctx.getBroadcastState(broadcastStateDesc)
bs.put(value.strategy,value.patterns)
}
}

State cannot be read when using State Processor Api

A Flink Streaming was developed with a filter that does the deduplication based on the id of the event using a key-value state based on RocksDB state backend.
Application Code
env.setStateBackend(new RocksDBStateBackend(checkpoint, true).asInstanceOf[StateBackend])
val stream = env
.addSource(kafkaConsumer)
.keyBy(_.id)
.filter(new Deduplication[Stream]("stream-dedup", Time.days(30))).uid("stream-filter")
Deduplication Code
class Deduplication[T](stateDescriptor: String, time: Time) extends RichFilterFunction[T] {
val ttlConfig: StateTtlConfig = StateTtlConfig
.newBuilder(time)
.setUpdateType(StateTtlConfig.UpdateType.OnReadAndWrite)
.setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired)
.cleanupFullSnapshot
.build
val deduplicationStateDescriptor = new ValueStateDescriptor[Boolean](stateDescriptor, classOf[Boolean])
deduplicationStateDescriptor.enableTimeToLive(ttlConfig)
lazy val deduplicationState: ValueState[Boolean] = getRuntimeContext.getState(deduplicationStateDescriptor)
override def filter(value: T): Boolean = {
if (deduplicationState.value) {
false
} else {
deduplicationState.update(true)
true
}
}
}
All of this works just fine. My goal with this question is to understand how I can read all the state using state processor api. So I started to write some code based on the documentation available.
Savepoint Reading Code
val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
val savepoint = Savepoint
.load(env, savepointPath,new RocksDBStateBackend("file:/tmp/rocksdb", true))
savepoint
.readKeyedState("stream-filter", new DeduplicationStateReader("stream-dedup")).print()
Reader Function Code
class DeduplicationStateReader(stateDescriptor: String) extends KeyedStateReaderFunction[String, String] {
var state: ValueState[Boolean] = _
override def open(parameters: Configuration): Unit = {
val deduplicationStateDescriptor = new ValueStateDescriptor[Boolean](stateDescriptor, classOf[Boolean])
state = getRuntimeContext.getState(deduplicationStateDescriptor)
}
override def readKey(key: String, ctx: KeyedStateReaderFunction.Context, out: Collector[String]): Unit = {
out.collect("IT IS WORKING")
}
}
Whenever I try to read the state, a serialization error appears to me.
Is there anything wrong? Did I misunderstand all of this?

No checkpoint files are created in my simple application

I have the following simple flink application running within IDE, and I do a checkpoint every 5 seconds, and would like to write the checkpoint data into directory file:///d:/applog/out/mycheckpoint/, but after running for a while, i stop the application,but I didn't find anything under the directory file:///d:/applog/out/mycheckpoint/
The code is:
import java.util.Date
import io.github.streamingwithflink.util.DateUtil
import org.apache.flink.api.common.state.{ListState, ListStateDescriptor}
import org.apache.flink.api.scala._
import org.apache.flink.runtime.state.filesystem.FsStateBackend
import org.apache.flink.runtime.state.{FunctionInitializationContext, FunctionSnapshotContext}
import org.apache.flink.streaming.api.checkpoint.CheckpointedFunction
import org.apache.flink.streaming.api.environment.CheckpointConfig.ExternalizedCheckpointCleanup
import org.apache.flink.streaming.api.functions.source.SourceFunction
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
object SourceFunctionExample {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(4)
env.getCheckpointConfig.setCheckpointInterval(5 * 1000)
env.getCheckpointConfig.enableExternalizedCheckpoints(ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION)
env.setStateBackend(new FsStateBackend("file:///d:/applog/out/mycheckpoint/"))
val numbers: DataStream[Long] = env.addSource(new ReplayableCountSource)
numbers.print()
env.execute()
}
}
class ReplayableCountSource extends SourceFunction[Long] with CheckpointedFunction {
var isRunning: Boolean = true
var cnt: Long = _
var offsetState: ListState[Long] = _
override def run(ctx: SourceFunction.SourceContext[Long]): Unit = {
while (isRunning && cnt < Long.MaxValue) {
ctx.getCheckpointLock.synchronized {
// increment cnt
cnt += 1
ctx.collect(cnt)
}
Thread.sleep(200)
}
}
override def cancel(): Unit = isRunning = false
override def snapshotState(snapshotCtx: FunctionSnapshotContext): Unit = {
println("snapshotState is called at " + DateUtil.format(new Date) + s", cnt is ${cnt}")
// remove previous cnt
offsetState.clear()
// add current cnt
offsetState.add(cnt)
}
override def initializeState(initCtx: FunctionInitializationContext): Unit = {
// obtain operator list state to store the current cnt
val desc = new ListStateDescriptor[Long]("offset", classOf[Long])
offsetState = initCtx.getOperatorStateStore.getListState(desc)
// initialize cnt variable from the checkpoint
val it = offsetState.get()
cnt = if (null == it || !it.iterator().hasNext) {
-1L
} else {
it.iterator().next()
}
println("initializeState is called at " + DateUtil.format(new Date) + s", cnt is ${cnt}")
}
}
I tested the application on Windows and Linux and in both cases the checkpoint files were created as expected.
Note that the program keeps running if a checkpoint fails, for example due to some permission errors or invalid path.
Flink logs a WARN message with the exception that caused the checkpoint to fail.

Generate md5 checksum scala js

I am trying to calculate hex md5 checksum at in scala js incrementally. The checksum will be verified at server side once file is transferred.
I tried using spark-md5 scala js web jar dependency:
libraryDependencies ++= Seq("org.webjars.npm" % "spark-md5" % "2.0.2")
jsDependencies += "org.webjars.npm" % "spark-md5" % "2.0.2" / "spark-md5.js"
scala js Code:-
val reader = new FileReader
reader.readAsArrayBuffer(data) // data is javascript blob object
val spark = scala.scalajs.js.Dynamic.global.SparkMD5.ArrayBuffer
reader.onload = (e: Event) => {
spark.prototype.append(e.target)
print("Checksum - > " + spark.end)
}
Error:-
Uncaught TypeError: Cannot read property 'buffer' of undefined
at Object.SparkMD5.ArrayBuffer.append (sampleapp-jsdeps.js:596)
at FileReader. (SampleApp.scala:458)
I tried google but most of the help is available are for javascript, couldn't find anything on how to use this library in scala js.
Sorry If I missed something very obvious, I am new to both javascript & scala js.
From spark-md5 readme, I read:
var spark = new SparkMD5.ArrayBuffer();
spark.append(e.target.result);
var hexHash = spark.end();
The way you translate that in Scala.js is as follows (assuming you want to do it the dynamically typed way):
import scala.scalajs.js
import scala.scalajs.js.typedarray._
import org.scalajs.dom.{FileReader, Event}
val SparkMD5 = js.Dynamic.global.SparkMD5
val spark = js.Dynamic.newInstance(SparkMD5.ArrayBuffer)()
val fileContent = e.target.asInstanceOf[FileReader].result.asInstanceOf[ArrayBuffer]
spark.append(fileContent)
val hexHashDyn = spark.end()
val hexHash = hexHashDyn.asInstanceOf[String]
Integrating that with your code snippet yields:
val reader = new FileReader
reader.readAsArrayBuffer(data) // data is javascript blob object
val SparkMD5 = js.Dynamic.global.SparkMD5
val spark = js.Dynamic.newInstance(SparkMD5)()
reader.onload = (e: Event) => {
val fileContent = e.target.asInstanceOf[FileReader].result.asInstanceOf[ArrayBuffer]
spark.append(fileContent)
print("Checksum - > " + spark.end().asInstanceOf[String])
}
If that's the only use of SparkMD5 in your codebase, you can stop there. If you plan to use it several times, you should probably define a facade type for the APIs you want to use:
import scala.scalajs.js.annotation._
#js.native
object SparkMD5 extends js.Object {
#js.native
class ArrayBuffer() extends js.Object {
def append(chunk: js.typedarray.ArrayBuffer): Unit = js.native
def end(raw: Boolean = false): String = js.native
}
}
which you can then use much more naturally as:
val reader = new FileReader
reader.readAsArrayBuffer(data) // data is javascript blob object
val spark = new SparkMD5.ArrayBuffer()
reader.onload = (e: Event) => {
val fileContent = e.target.asInstanceOf[FileReader].result.asInstanceOf[ArrayBuffer]
spark.append(fileContent)
print("Checksum - > " + spark.end())
}
Disclaimer: not tested. It might need small adaptations here and there.

JPA question (Scala using)

OpenJPA and Scala question.
There are 2 tables in database (h2) - tasks and taskLists.
TaskList contains collection ArrayList.
Each TaskList can contain tasks from another TaskLists
When taskX modified in TaskList1 it have to be modified in TaskList2, if it contains taskX.
How to implement this?
I can add new task to task list and save it to DB, but when I try to add taskX from TaskList1 tasks collection to TaskList2 tasks collection and persist it, taskX is not persisted in db.
Example:
1. Adding task1 to taskList1 and task2 to taskList2 will result:
TaskList1: [task1]
TaskList2: [task2]
2. Adding task2 to taskList1 will result:
TaskList1: [task1]
3. Adding task3 to taskList1 and task4 to taskList2 will result:
TaskList1: [task1, task3]
TaskList2: [task2, task4]
What is wrong?
Task class:
import _root_.javax.persistence._
import org.apache.openjpa.persistence._
#Entity
#Table(name="tasks")
#ManyToMany
class CTask (name: String) {
def this() = this("new task")
#Id
#GeneratedValue(strategy=GenerationType.IDENTITY)
#Column(unique = true, nullable = false)
var id: Int = _
var title: String = name
def getId = id
def getTitle = title
def setTitle(titleToBeSet: String) = title = titleToBeSet
override def toString: String = title
}
TaskList class:
import _root_.javax.persistence._
import _root_.java.util.ArrayList
import org.apache.openjpa.persistence._
#Entity
#Table(name="tasklists")
class CTaskList(titleText: String) {
#Id
#GeneratedValue(strategy=GenerationType.IDENTITY)
#Column(unique = true, nullable = false)
var id: Int = _
#Column(nullable = false)
var title: String = titleText
#ManyToMany
var tasks: ArrayList[CTask] = _;
var tags: String = ""
var rank: Int = 0
def getTasks = tasks
def this() = this("new task list")
def createTasks: ArrayList[CTask] = {
tasks = new ArrayList[CTask]
tasks
}
def getID = id
def getTags = tags
def setTags(tagsParam: String) = tags = tagsParam
def getRank = rank
def setRank(rankParam: Int) = rank = rankParam
def getTitle: String = title
def getTaskById(index: Int): CTask = {
null
}
def equals(anotherTaskList: CTaskList): Boolean = {
if (anotherTaskList == null) return false
if (anotherTaskList.getTasks == null)
return id == anotherTaskList.getID && title == anotherTaskList.getTitle && tags == anotherTaskList.getTags && rank == anotherTaskList.getRank && tasks == null;
else
return id == anotherTaskList.getID && title == anotherTaskList.getTitle && tags == anotherTaskList.getTags && rank == anotherTaskList.getRank && tasks != null && tasks.size == anotherTaskList.getTasks.size
}
def addTask(taskToBeAdded: CTask): Int = {
if (tasks == null)
tasks = new ArrayList[CTask]
tasks.add(taskToBeAdded)
tasks.indexOf(taskToBeAdded)}
def printTasks = println("TaskList's " + this + " tasks: " + tasks)
}
As far as I know, Many-To-Many association (as specified in your mapping annotations) implies association table between two entities (i.e. #JoinTable annotation should be specified).

Resources