flink assign uid to window function - apache-flink

is there a way to assign uid to a window function (such as apply(ApplyCustomFunction)) as we do for map/flatmap (or other) functions in Flink. The Flink version is 1.13.1.
I would like to specify the case with an example
DataStream<RECORD> outputDataStream = dataStream
.coGroup(otherDataStream)
.where(DATA::getKey)
.equalTo(OTHERDATA::getKey)
.window(TumblingProcessingTimeWindows.of(Time.seconds(2)))
.apply(new CoGroupFunction());
Thanks

CoGroupedStreams.WithWindow#apply(CoGroupFunction<T1,T2,T>) doesn't have the return type that's needed for setting a UID or per-operator parallelism (among other things). This was done in order to keep binary backwards compatibility, and can't be fixed before Flink 2.0.
You can work around this by using the (deprecated) with method instead of apply, as in
DataStream<RECORD> outputDataStream = dataStream
.coGroup(otherDataStream)
.where(DATA::getKey)
.equalTo(OTHERDATA::getKey)
.window(TumblingProcessingTimeWindows.of(Time.seconds(2)))
.with(new CoGroupFunction())
.uid("window");
The with method will be removed once it is no longer needed.

Use with() instead of apply(). It will be fixed in 2.0 version, how it sayed in documentation

Related

Find max value in a Flink DataStream

I have a DataStream of Tuple2<String, Integer>. I want to find the max of field f1, preferably without doing a keyBy(). Is that possible in Flink?
One "hack" I came up with:
DataStream<Tuple2<String, Integer>> input; // Initialized somewhere
DataStream<Tuple2<String, Integer>> maxEntry =
input.map(entry -> new Tuple3(entry.f0, entry.f1, "foo"))
.keyBy(2)
.maxBy(1)
.map(entry -> new Tuple2(entry.f1, entry.f1));
Doing the intermediate map() and keyBy() seems to me wasteful/inefficient. Is there a better way?
Thank you,
Ahmed.
You could do this, which is still hacky, but less so
input.keyBy(e -> "foo").maxBy(1)
but keep in mind that
Keying by a constant reduces the effective parallelism to 1 (which is fine in this case, as you need to process every event in the same place to find a global maximum).
KeyedStream#maxBy will be removed from Flink in the future. See https://stackoverflow.com/a/66651834/2000823 for more about that.

Use content of a tuple as variable session

I extracted from a previous response an Object of tuple with the following regex :
.check(regex(""""idSc":(.{1,8}),"pasTemps":."codePasTemps":(.),"""").ofType[(String,String)].findAll.saveAs ("OBJECTS1"))
So I get my object :
OBJECTS1 -> List((1657751,2), (1658105,2), (4557378,2), (1657750,1), (916,1), (917,2), (1658068,1), (1658069,2), (4557379,2), (1658082,1), (4557367,1), (4557368,1), (1660865,2), (1660866,2), (1658122,1), (921,1), (922,2), (923,2), (1660875,1), (1660876,2), (1660877,2), (1658300,1), (1658301,1), (1658302,1), (1658309,1), (1658310,1), (2996562,1), (4638455,1))
After that I did a Foreach and need to extract every couple to add them in next requests So we tried :
.foreach("${OBJECTS1}", "couple") {
exec(http("request_foreach47"
.get("/ctr/web/api/seriegraph/bydates/${couple(0)}/${couple(1)}/1552863600000/1554191743799")
.headers(headers_27))
}
But I get the message : named 'couple' does not support index access
I also though that to use 2 regex on the couple to extract both part could work but I haven't found any way to use a regex on a session variable. (Even if its not needed for this case but possible im really interessed to learn how as it could be usefull)
If would be really thankfull if you could provided me help. (Im using Gatling 2 but can,'t use a more recent version as its for work and others scripts have been develloped with Gatling2)
each "couple" is a scala tuple which can't be indexed into like a collection. Fortunately the gatling EL has a function that handles tuples.
so instead of
.get("/ctr/web/api/seriegraph/bydates/${couple(0)}/${couple(1)}/1552863600000/1554191743799")
you can use
.get("/ctr/web/api/seriegraph/bydates/${couple._1}/${couple._2}/1552863600000/1554191743799")

Parsing data from Kafka in Apache Flink

I am just getting started on Apache Flink (Scala API), my issue is following:
I am trying to stream data from Kafka into Apache Flink based on one example from the Flink site:
val stream =
env.addSource(new FlinkKafkaConsumer09("testing", new SimpleStringSchema() , properties))
Everything works correctly, the stream.print() statement displays the following on the screen:
2018-05-16 10:22:44 AM|1|11|-71.16|40.27
I would like to use a case class in order to load the data, I've tried using
flatMap(p=>p.split("|"))
but it's only splitting the data one character at a time.
Basically the expected results is to be able to populate 5 fields of the case class as follows
field(0)=2018-05-16 10:22:44 AM
field(1)=1
field(2)=11
field(3)=-71.16
field(4)=40.27
but it's now doing:
field(0) = 2
field(1) = 0
field(3) = 1
field(4) = 8
etc...
Any advice would be greatly appreciated.
Thank you in advance
Frank
The problem is the usage of String.split. If you call it with a String, then the method expects it to be a regular expression. Thus, p.split("\\|") would be the correct regular expression for your input data. Alternatively, you can also call the split variant where you specify the separating character p.split('|'). Both solutions should give you the desired result.

Indexing PDF documents with addtional search fields using SolrNet?

I found this article useful when indexing documents, however, how can I attach additional fields so I can pass in, say, the ID of the document in our database for use in displaying the search results? I thought by using the Fields (Of the ExtractParameters class) property I could index additional data with the document, but that doesn't seem to work or that is not its function.
Example code:
var solr = ObjectLocator.Instance.Resolve<ISolrOperations<IndexDocument>>();
var guid = Guid.NewGuid().ToString();
using (var fileStream = System.IO.File.OpenRead(Server.MapPath("~/files/") + "greenroof.pdf"))
{
var response =
solr.Extract(
new ExtractParameters(fileStream, "greenRoof1234")
{
ExtractFormat = ExtractFormat.Text,
ExtractOnly = false,
Fields = new[] { new ExtractField("field1", "value1"), new ExtractField("field2", "value2") }
});
}
#aitchnyu is correct, passing the values via the literal.field=value method is the correct way to do this.
However, according to this post on ExtractingRequestHandler support in the SolrNet Google Group, there was a bug with the ExtractParameters.Fields not working properly. This was fixed in the 0.4.0.X versions of SolrNet. Please make sure you are using one of the latest versions of SolrNet. You can obtain that by one of the following means:
Project Site Downloads
NuGet PreRelease Package
Also that discussion has some good examples of using the ExtractingRequestHandler in SolrNet as well as a workaround for adding the additional field values if you cannot upgrade to a newer version of SolrNet.
This is sufficient: http://wiki.apache.org/solr/ExtractingRequestHandler#Literals .
In general use a literal.field=value while uploading.
It turned out not to be an issue with SOLRNet, but my knowledge of SOLR, in general. I needed to specify the fields in my schema. After i added the fields to my schema they were visible in my SOLR query.

How to access NHibernate.IQueryOver<T,T> within ActiveRecord?

I use DetachedCriteria primarily, it has a static method For to construct an instance. But it seems IQueryOver is more favourable, I want to use it.
The normal way to get an instance of IQueryOver is Isession.Query, and I want get it with ActiveRecord gracefully, anyone knows the way? Thanks.
First, QueryOver.Of<T> return an instance of QueryOver<T, T>, then you build conditions with QueryOver API.
After that, query.DetachedCriteria return the equivalent DetachedCriteria, which can be used with ActiveRecord gracefully.
var query = QueryOver.Of<PaidProduct>()
.Where(paid =>
paid.Account.OrderNumber == orderNumber
&& paid.ProductDelivery.Product == product)
.OrderBy(paid=>paid.ProductDelivery.DeliveredDate).Desc;
return ActiveRecordMediator<PaidProduct>.FindAll(query.DetachedCriteria);
As far as I know, there is no direct support for QueryOver. I encourage you to create an item in the issue tracker, then fork the repository and implement it. I'd start by looking at the implementation of ActiveRecordLinqBase, it should be similar. But instead of a separate class, you could just implement this in ActiveRecordBase. Then wrap it in ActiveRecordMediator to that it can also be used in classes that don't inherit ActiveRecordBase.

Resources