If use override def afterAll in base class.
I want override def afterAll at all testcase finished after,but i enabling parallel after, as each testcase will call override def afterAll.Causes the file to be deleted early every time.
But i must use parallel.
code:
abstract class TestBaseWithFunSuite extends FunSuite with BeforeAndAfterAll{
def cleanFolder(sparkPath:String): Unit =
{
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
val srcPath = new Path(sparkPath)
if (fs.exists(srcPath) && fs.getFileStatus(srcPath).isDirectory)
fs.delete(srcPath, true)
}
protected def csvOptions: Map[String, String] = Map("header"->"true", "delimiter"->",")
protected def sparkPath: String = s"${folderPath}/target/com.microsoft.mpdp/pipeline/spark-warehouse"
protected def streamName: String = TEST_DATASOURCE
protected def csvPath: String = sparkPath
override def afterAll{
cleanFolder(sparkPath)
}
}
Becauses all testcase will call this method.
So some test can't load this file dir.
java.io.FileNotFoundException: File file:/D:/Marketplace.DataPlatform/src/pipeline/target/com.microsoft.mpdp/pipeline/spark-warehouse/test/watermark/part-00000-a65b3b2e-b6cf-4415-a65a-79a7549a4d71-c000.csv does not exist
Related
I am using Flink 1.12 and I have a keyed stream, in my code it looks that both A and B share the same watermark? and therefore B is determined as late because A's coming has upgraded the watermark to be 2020-08-30 10:50:11?
The output is A(2020-08-30 10:50:08, 2020-08-30 10:50:16):2020-08-30 10:50:15,there is no output for B
I would ask whether it is possible to make different keys have independent watermark? A's watermark and B'watermark change independently
The application code is:
import java.text.SimpleDateFormat
import java.util.Date
import java.util.concurrent.TimeUnit
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.AssignerWithPunctuatedWatermarks
import org.apache.flink.streaming.api.scala.function.WindowFunction
import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
import org.apache.flink.streaming.api.watermark.Watermark
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector
object DemoDiscardLateEvent4_KeyStream {
def to_milli(str: String) =
new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").parse(str).getTime
def to_char(milli: Long) = {
val date = if (milli <= 0) new Date(0) else new Date(milli)
new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").format(date)
}
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val data = Seq(
("A", "2020-08-30 10:50:15"),
("B", "2020-08-30 10:50:07")
)
env.fromCollection(data).setParallelism(1).assignTimestampsAndWatermarks(new AssignerWithPunctuatedWatermarks[(String, String)]() {
var maxSeen = Long.MinValue
override def checkAndGetNextWatermark(lastElement: (String, String), extractedTimestamp: Long): Watermark = {
val eventTime = to_milli(lastElement._2)
if (eventTime > maxSeen) {
maxSeen = eventTime
}
//Allow 4 seconds late
new Watermark(maxSeen - 4000)
}
override def extractTimestamp(element: (String, String), previousElementTimestamp: Long): Long = to_milli(element._2)
}).keyBy(_._1).window(TumblingEventTimeWindows.of(Time.of(8, TimeUnit.SECONDS))).apply(new WindowFunction[(String, String), String, String, TimeWindow] {
override def apply(key: String, window: TimeWindow, input: Iterable[(String, String)], out: Collector[String]): Unit = {
val start = to_char(window.getStart)
val end = to_char(window.getEnd)
val sb = new StringBuilder
//the start and end of the window
sb.append(s"$key($start, $end):")
//The content of the window
input.foreach {
e => sb.append(e._2 + ",")
}
out.collect(sb.toString().substring(0, sb.length - 1))
}
}).print()
env.execute()
}
}
While it would sometimes be helpful if Flink offered per-key watermarking, it does not.
Each parallel instance of your WatermarkStrategy (or in this case, of your AssignerWithPunctuatedWatermarks) is generating watermarks independently, based on the timestamps of the events it observes (regardless of their keys).
One way to work around the lack of this feature is to not use watermarks at all. For example, if you would be using per-key watermarks to trigger keyed event-time windows, you can instead implement your own windows using a KeyedProcessFunction, and instead of using watermarks to trigger event time timers, keep track of the largest timestamp seen so far for each key, and whenever updating that value, determine if you now want to close one or more windows for that key.
See one of the Flink training lessons for an example of how to implement keyed tumbling windows with a KeyedProcessFunction. This example depends on watermarks but should help you get started.
I've been working with Groovy and Grails for a few weeks now.
I've just had a problem that any File creation command such as below:
void validate(FileToValidate) {
try {
DefaultImmutableModuleIdentifierFactory moduleIdentifierFactory = new DefaultImmutableModuleIdentifierFactory()
def moduleDescriptorConverter = new IvyModuleDescriptorConverter(moduleIdentifierFactory)
def metadataFactory = new IvyMutableModuleMetadataFactory(moduleIdentifierFactory,null)
def repository = new DefaultExternalResourceRepository("repo", null, null, null, null, null,null)
def files = new java.io.File(FileToValidate)
URI uri = files.toURI()
def name = new ExternalResourceName(uri)
def parser = new IvyXmlModuleDescriptorParser(moduleDescriptorConverter, moduleIdentifierFactory,repository.resource(name ,true), metadataFactory)
DescriptorParseContext ivySettings = null //new DisconnectedDescriptorParseContext();
parser.parseMetaData(ivySettings, FileToValidate, true);
} catch (MetaDataParseException e) {
throw new GradleException("Invalid ivy descriptor file $FileToValidate", e);
}
}
I am getting below error:
Caused by: groovy.lang.GroovyRuntimeException: Could not find matching constructor for: java.io.File(File)
in my case I had only path so had to create file on my own,
it helped:
new java.io.File(FileToValidate.toString())
as toString
Returns the string representation of this path.
You are passing a File to void validate(FileToValidate) {
Then trying to create a new File out of it here
def files = new java.io.File(FileToValidate)
You don't need to do this... Just use FileToValidate
As a side note, sticking to lower case initial letters for variables and arguments is advised, to avoid confusion, so
void validate(fileToValidate) {
I am new to Flink and doing something very similar to the below link.
Cannot see message while sinking kafka stream and cannot see print message in flink 1.2
I am also trying to add JSONDeserializationSchema() as a deserializer for my Kafka input JSON message which is without a key.
But I found JSONDeserializationSchema() is not present.
Please let me know if I am doing anything wrong.
JSONDeserializationSchema was removed in Flink 1.8, after having been deprecated earlier.
The recommended approach is to write a deserializer that implements DeserializationSchema<T>. Here's an example, which I've copied from the Flink Operations Playground:
import org.apache.flink.api.common.serialization.DeserializationSchema;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.ObjectMapper;
import java.io.IOException;
/**
* A Kafka {#link DeserializationSchema} to deserialize {#link ClickEvent}s from JSON.
*
*/
public class ClickEventDeserializationSchema implements DeserializationSchema<ClickEvent> {
private static final long serialVersionUID = 1L;
private static final ObjectMapper objectMapper = new ObjectMapper();
#Override
public ClickEvent deserialize(byte[] message) throws IOException {
return objectMapper.readValue(message, ClickEvent.class);
}
#Override
public boolean isEndOfStream(ClickEvent nextElement) {
return false;
}
#Override
public TypeInformation<ClickEvent> getProducedType() {
return TypeInformation.of(ClickEvent.class);
}
}
For a Kafka producer you'll want to implement KafkaSerializationSchema<T>, and you'll find examples of that in that same project.
To solve the problem of reading non-key JSON messages from Kafka I used case class and JSON parser.
The following code makes a case class and parses the JSON field using play API.
import play.api.libs.json.JsValue
object CustomerModel {
def readElement(jsonElement: JsValue): Customer = {
val id = (jsonElement \ "id").get.toString().toInt
val name = (jsonElement \ "name").get.toString()
Customer(id,name)
}
case class Customer(id: Int, name: String)
}
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val properties = new Properties()
properties.setProperty("bootstrap.servers", "xxx.xxx.0.114:9092")
properties.setProperty("group.id", "test-grp")
val consumer = new FlinkKafkaConsumer[String]("customer", new SimpleStringSchema(), properties)
val stream1 = env.addSource(consumer).rebalance
val stream2:DataStream[Customer]= stream1.map( str =>{Try(CustomerModel.readElement(Json.parse(str))).getOrElse(Customer(0,Try(CustomerModel.readElement(Json.parse(str))).toString))
})
stream2.print("stream2")
env.execute("This is Kafka+Flink")
}
The Try method lets you overcome the exception thrown while parsing the data
and returns the exception in one of the fields (if we want) or else it can just return the case class object with any given or default fields.
The sample output of the Code is:
stream2:1> Customer(1,"Thanh")
stream2:1> Customer(5,"Huy")
stream2:3> Customer(0,Failure(com.fasterxml.jackson.databind.JsonMappingException: No content to map due to end-of-input
at [Source: ; line: 1, column: 0]))
I am not sure if it is the best approach but it is working for me as of now.
We have a Flink job that does intervalJoin two streams, both streams consume events from Kafka. Here is the example code
val articleEventStream: DataStream[ArticleEvent] = env.addSource(articleEventSource)
.assignTimestampsAndWatermarks(new ArticleEventAssigner)
val feedbackEventStream: DataStream[FeedbackEvent] = env.addSource(feedbackEventSource)
.assignTimestampsAndWatermarks(new FeedbackEventAssigner)
articleEventStream
.keyBy(article => article.id)
.intervalJoin(feedbackEventStream.keyBy(feedback => feedback.article.id))
.between(Time.seconds(-5), Time.seconds(10))
.process(new ProcessJoinFunction[ArticleEvent, FeedbackEvent, String] {
override def processElement(left: ArticleEvent, right: FeedbackEvent, ctx: ProcessJoinFunction[ArticleEvent, FeedbackEvent, String]#Context, out: Collector[String]): Unit = {
out.collect(left.name + " got feedback: " + right.feedback);
}
});
});
class ArticleEventAssigner extends AssignerWithPunctuatedWatermarks[ArticleEvent] {
val bound: Long = 5 * 1000
override def checkAndGetNextWatermark(lastElement: ArticleEvent, extractedTimestamp: Long): Watermark = {
new Watermark(extractedTimestamp - bound)
}
override def extractTimestamp(element: ArticleEvent, previousElementTimestamp: Long): Long = {
element.occurredAt
}
}
class FeedbackEventAssigner extends AssignerWithPunctuatedWatermarks[FeedbackEvent] {
val bound: Long = 5 * 1000
override def checkAndGetNextWatermark(lastElement: FeedbackEvent, extractedTimestamp: Long): Watermark = {
new Watermark(extractedTimestamp - bound)
}
override def extractTimestamp(element: FeedbackEvent, previousElementTimestamp: Long): Long = {
element.occurredAt
}
}
However, we do not see any joined output. We checked that each stream does continuously emit elements with timestamp and proper watermark. Does anyone have any hint what could be possible reasons?
After checking different parts (timestamp/watermark, triggers), I just noticed that I made a mistake, i.e., the window size I used
between(Time.seconds(-5), Time.seconds(10))
is just too small, which could not find elements from both streams to join. This might sound obvious, but since I am new to Flink, I did not know where to check.
So, my lesson is that if the join does not output, it could be necessary to check the window size.
And thanks all for the comments!
I am implementing a SourceFunction, which reads Data from a Database.
The job should be able to be resumed if stopped or crushed (i.e savepoints and checkpoints) with the data being processed exactly once.
What I have so far:
#SerialVersionUID(1L)
class JDBCSource(private val waitTimeMs: Long) extends
RichParallelSourceFunction[Event] with StoppableFunction with LazyLogging{
#transient var client: PostGreClient = _
#volatile var isRunning: Boolean = true
val DEFAULT_WAIT_TIME_MS = 1000
def this(clientConfig: Serializable) =
this(clientConfig, DEFAULT_WAIT_TIME_MS)
override def stop(): Unit = {
this.isRunning = false
}
override def open(parameters: Configuration): Unit = {
super.open(parameters)
client = new JDBCClient
}
override def run(ctx: SourceFunction.SourceContext[Event]): Unit = {
while (isRunning){
val statement = client.getConnection.createStatement()
val resultSet = statement.executeQuery("SELECT name, timestamp FROM MYTABLE")
while (resultSet.next()) {
val event: String = resultSet.getString("name")
val timestamp: Long = resultSet.getLong("timestamp")
ctx.collectWithTimestamp(new Event(name, timestamp), timestamp)
}
}
}
override def cancel(): Unit = {
isRunning = false
}
}
How can I make sure to only get the rows of the database which aren't processed yet?
I assumed the ctx variable would have some information about the current watermark so that I could change my query to something like:
select name, timestamp from myTable where timestamp > ctx.getCurrentWaterMark
But it doesn't have any relevant methods for me. Any Ideas how to solve this problem would be appreciated
You have to implement CheckpointedFunction so that you can manage checkpointing by yourself. The documentation of the interface is pretty comprehensive but if you need an example I advise you to take a look at an example.
In essence, your function must implement CheckpointedFunction#snapshotState to store the state you need using Flink's managed state and then, when performing a restore, it will read that same state in CheckpointedFunction#initializeState.