Don't understand interval join in Flink - apache-flink

From Flink's official doc:
https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/stream/operators/joining.html#interval-join
The example code is:
import org.apache.flink.streaming.api.functions.co.ProcessJoinFunction;
import org.apache.flink.streaming.api.windowing.time.Time;
...
val orangeStream: DataStream[Integer] = ...
val greenStream: DataStream[Integer] = ...
orangeStream
.keyBy(elem => /* select key */)
.intervalJoin(greenStream.keyBy(elem => /* select key */))
.between(Time.milliseconds(-2), Time.milliseconds(1))
.process(new ProcessJoinFunction[Integer, Integer, String] {
override def processElement(left: Integer, right: Integer, ctx: ProcessJoinFunction[Integer, Integer, String]#Context, out: Collector[String]): Unit = {
out.collect(left + "," + right);
}
});
});
From the above code, I would like know how to specify the starting time(eg, from the beginning of today) from which to perform this interval join(the data before the starting time will not take into account).
eg, I have run the program for 3 days, I don't want to perform this join for all the data for 3 days,
I just want to perform the join for the data generated today.

I don't think it works like You think it does.
The actual interval is calculated based on the actual timestamps of orangeStream in this case, so You are not really providing the interval of the data You want to take in account, but rather this is something like window which specifies which elements will be joined with the given element of the orange stream.
So, for the window described above if You have orange element with timestamp 5, then it will be joined with elements that have timestamps from 3 to 6.
I really don't think You could use it to perform joins only with some part of the data, the only think I can think of is to simply filter the data using the timestamps and filter out all elements that have been generated earlier.

Related

Django query bases on greater date

I want to know how efficient this filter can be done with django queries. Essentially I have the followig two clases
class Act(models.Model):
Date = models.DateTimeField()
Doc = models.ForeignKey(Doc)
...
class Doc(models.Model):
...
so one Doc can have severals Acts, and for each Doc I want to get the act with the greater Date. I'm only interested in Acts objects.
For example, if a have
act1 = (Date=2021-01-01, Doc=doc1)
act2 = (Date=2021-01-02, Doc=doc1)
act3 = (Date=2021-01-03, Doc=doc2)
act4 = (Date=2021-01-04, Doc=doc2)
act5 = (Date=2021-01-05, Doc=doc2)
I want to get [act2, act5] (the Act with Doc=doc1 with the greater Date and the Act with Doc=doc2 with the greater Date).
My only solution is to make a for over Docs.
Thank you so much
You can do this with one or two queries: the first query will retrieve the latest Act per Doc, and then the second one will then retrieve the acts:
from django.db.models import OuterRef, Subquery
last_acts = Doc.objects.annotate(
latest_act=Subquery(
Act.objects.filter(
Doc_id=OuterRef('pk')
).values('pk').order_by('-Date')[:1]
)
).values('latest_act')
and then we can retrieve the corresponding Acts:
Act.objects.filter(pk__in=last_acts)
depending on the database, it might be more efficient to first retrieve the primary keys, and then make an extra query:
Act.objects.filter(pk__in=list(last_acts))

Neo4j Create new label based on other labels with certain properties

I need to create Trajectories based on Points.
A Trajectory can contain any number of Points that meet certain criteria.
Criteria are: cameraSid, trajectoryId, classType and classQual should be equal.
The difference in time (at) of each point must be less or equal to 1 hour.
In order to create a Trajectory we need at least one point.
In order to associate a new point to an existing Trajectory, the latest associated Point of the trajectory must be no older and 1 hour compared to the new point.
If the new Point has the exact same properties but is older than 1 hour, then a new Trajectory need to be created.
I've been reading a lot but, I cannot make this work as it should.
This is what I have tried so far:
MATCH (p:Point)
WHERE NOT (:Trajectory)-[:CONTAINS]->(p)
WITH p.cameraSid AS cameraSid, p.trajectoryId AS trajectoryId, p.classType AS classType, p.classQual AS classQual, COLLECT(p) AS points
UNWIND points AS point
MERGE (trajectory:Trajectory{trajectoryId:point.trajectoryId, cameraSid: point.cameraSid, classType: point.classType, classQual: point.classQual, date: date(datetime(point.at))})
MERGE (trajectory)-[:CONTAINS{at:point.at}]->(point)
I have no idea how to create this sort of condition (1hr or less) in the MERGE clause.
Here are the neo4j queries to create some data
// Create points
LOAD CSV FROM 'https://uca54485eb4c5d2a6869053af475.dl.dropboxusercontent.com/cd/0/get/AmR2pn0hC0c-CQW_mSS-TDqHQyi7MNVjPvqffQHhSIyMP37D7UMtfODdHDkNWi6-HqzQdp4ob2Q3326g6imEd26F3sdNJyJuAeNa8wJA2o_E6A/file?dl=1#' AS line
CREATE (:Point{trajectoryId: line[0],at: line[1],cameraSid: line[2],activity: line[3],x: line[4],atEpochMilli: line[5],y: line[6],control: line[7],classQual: line[8],classType: line[9],uniqueIdentifier: line[10]})
// Create Trajectory based on Points
MATCH (p:Point)
WHERE NOT (:Trajectory)-[:CONTAINS]->(p)
WITH p.cameraSid AS cameraSid, p.trajectoryId AS trajectoryId, p.classType AS classType, p.classQual AS classQual, COLLECT(p) AS points
UNWIND points AS point
MERGE (trajectory:Trajectory{trajectoryId:point.trajectoryId, cameraSid: point.cameraSid, classType: point.classType, classQual: point.classQual, date: date(datetime(point.at))})
MERGE (trajectory)-[:CONTAINS{at:point.at}]->(point)
If the link to the CSV file does not work, here is an alternative, in this case, you will have to download the file and then import it locally from your neo4j instance.
I think this is one of those just because you can do it in one Cypher statement doesn't mean you should situations, and you will almost certainly find this easier to do in application code.
Regardless, it can be done using APOC and by introducing an instanceId unique property on your Trajectory nodes.
Possible solution
This almost certainly won't scale, and you'll want indexes (discussed later based on educated guesswork).
First we need to change your import script to make sure that the at property is a datetime and not just a string (otherwise we end up peppering the queries with datetime() calls:
LOAD CSV FROM 'file:///export.csv' AS line
CREATE (:Point{trajectoryId: line[0], at: datetime(line[1]), cameraSid: line[2], activity: line[3],x: line[4], atEpochMilli: line[5], y: line[6], control: line[7], classQual: line[8], classType: line[9], uniqueIdentifier: line[10]})
The following then appears to take your sample data set and add Trajectories per your requirements (and can be run whenever new Points are added).
CALL apoc.periodic.iterate(
'
MATCH (p: Point)
WHERE NOT (:Trajectory)-[:CONTAINS]->(p)
RETURN p
ORDER BY p.at
',
'
OPTIONAL MATCH (t: Trajectory { trajectoryId: p.trajectoryId, cameraSid: p.cameraSid, classQual: p.classQual, classType: p.classType })-[:CONTAINS]-(trajPoint:Point)
WITH p, t, max(trajPoint.at) as maxAt, min(trajPoint.at) as minAt
WITH p, max(case when t is not null AND (
(p.at <= datetime(maxAt) + duration({ hours: 1 }))
AND
(p.at >= datetime(minAt) - duration({ hours: 1 }))
)
THEN t.instanceId ELSE NULL END) as instanceId
MERGE (tActual: Trajectory { trajectoryId: p.trajectoryId, cameraSid: p.cameraSid, classQual: p.classQual, classType: p.classType, instanceId: COALESCE(instanceId, randomUUID()) })
ON CREATE SET tActual.date = date(datetime(p.at))
MERGE (tActual)-[:CONTAINS]->(p)
RETURN instanceId
',
{ parallel: false, batchSize: 1 })
Explanation
The problem as posed is tricky because the decision on whether or not to create a new Trajectory or add the point to an existing one depends entirely on how we handled all prior Points. That means two things:
We need to process the Points in order to make sure we create reliable Trajectories - we start with the earliest, and work up
We need each creation or amend of a Trajectory to be immediately visible for the processing of the next Point - that is to say that we need to handle each Point in isolation, as though it were a mini-transaction
We'll use apoc.periodic.iterate with a batchSize of 1 to give us the behaviour we need.
The first parameter builds the set of nodes to be processed - all those Points which aren't currently part of a Trajectory, sorted by their timestamp.
The second parameter to apoc.periodic.iterate is where the magic's happening so let's break that down - given a point p that isn't part of a Trajectory so far:
OPTIONAL MATCH (t: Trajectory { trajectoryId: p.trajectoryId, cameraSid: p.cameraSid, classQual: p.classQual, classType: p.classType })-[:CONTAINS]-(trajPoint:Point)
WITH p, t, max(trajPoint.at) as maxAt, min(trajPoint.at) as minAt
WITH p, max(case when t is not null AND (
(p.at <= datetime(maxAt) + duration({ hours: 1 }))
AND
(p.at >= datetime(minAt) - duration({ hours: 1 }))
)
THEN t.instanceId ELSE NULL END) as instanceId
Finds any Trajectory that matches the key fields and that contain a Point that's within an hour of the incoming point p and pick out its instanceId property if we find a suitable match (or the biggest one we found if there are multiple matches - we just want to ensure there's zero or one rows by this point)
We'll see what instanceId is all about in a minute, but consider it a unique identifier for a given Trajectory
MERGE (tActual: Trajectory { trajectoryId: p.trajectoryId, cameraSid: p.cameraSid, classQual: p.classQual, classType: p.classType, instanceId: COALESCE(instanceId, randomUUID()) })
ON CREATE SET tActual.date = date(datetime(p.at))
Ensure that there's a Trajectory that matches the incoming point's key fields - if the previous code found a matching Trajectory then the MERGE has no work to do. Otherwise create the new Trajectory - we'll add a property instanceId to a new random UUID if we didn't match a Trajectory earlier to compel the MERGE to create a new node (since no other will exist with that UUID, even if one matches all the other key fields)
MERGE (tActual)-[:CONTAINS]->(p)
RETURN instanceId
tActual is now the Trajectory that the incoming point p should belong to - create the :CONTAINS relationship
The third parameter is vital:
{ parallel: false, batchSize: 1 })
Important: For this to work, each iteration of the 'inner' Cypher statement has to happen exactly in order, so we force a batchSize of 1 and disable parallelism to prevent APOC from scheduling the batches in any way other than one-at-a-time.
Indexing
I think the performance of the above is going to degrade quickly as the size of the import grows and the number of Trajectories increases. At a minimum I think you'll want a composite index on
:Trajectory(trajectoryId, cameraSid, classQual, classType) - so that the initial match to find candidate trajectories for a given Point is quick
:Trajectory(trajectoryId, cameraSid, classQual, classType, instanceId) - so that the MERGE at the end finds the existing Trajectory to add to quickly, if one exists
However - that's guesswork from eyeballing the query, and unfortunately you can't see into the query properly to tell what the execution plan is because we're using apoc.periodic.iterate - an EXPLAIN or PROFILE will just tell you that there's one procedure call costing 1 db hit, which is true but not helpful.

SSRS The textrun uses a First aggregate in an outer aggregate (Different datasets)

Ok, I'm working on a multi-data report that merges data from many servers.
dataset1 = One of six datasets with the data I need.
ds_BusinessDays = A calendar table dataset with Specific dates and numbers that change every day/week/month.
I'm trying to use a SWITCH where MonthName(Date) from dataset1 = MonthName(Date2) from ds_BusinessDays. Then Sum the total count.
I have successfully used similar cross dataset calculations like
SUM(SWITCH when Data = "Product" then 1) / SUM(businessdaysinmonth, "ds_BusinessDays")
This was to get the Average. works like a charm.
=SUM(
SWITCH(Fields!Requested_Month.Value = MonthName(Month(First(Fields!PreviousBusinessDate.Value, "ds_BusinessDays")))
,1)
)
All Fields in ds_BusinessDays dataset are 1 entry results. Example, "PreviousBusinessDay" = "6/21/2019". So I want my code to do something like this.
When MonthName(Date) form dataset1 = MonthName(PreviousBusinessDate) from ds_BusinessDays then 1. Sum all of that up to get my a total for that month.
The problem is that FIRST and SUM are the only fields available to me when using fields from another dataset. They can't be used in an Aggregate within an Aggregate.
Is there a substitute that I can use for First in First(Fields!PreviousBusinessDate.Value, "ds_BusinessDays")?
Why are you using a SWITCH when you only have a single conditional? I think an IIF would be much easier to read. Also, if ds_BusinessDays.PreviousBusinessDate is a single entry, why do you even need FIRST? There should only be one result. If you're trying to do what I think you're trying to do, this expression should do it.
= SUM(IIF(Fields!Requested_Month.Value = MonthName(Month(Fields!PreviousBusinessDate.Value, "ds_BusinessDays")), 1, 0))
To add additional detail, a SWITCH is best used for more than 2 conditional statements. If you're only trying to compare two fields, you can just use a simple IIF. An example of when SWITCH is necessary:
=SUM(SWITCH(Fields!Date.Value = Fields!Today.Value, 1,
Fields!Date.Value = Fields!Yesterday.Value, 2,
Fields!Date.Value = Fields!Tomorrow.Value, 3,
true, 0))
This expression would check Date against three different fields and return a different count for each, ending with an always true condition that catches everything else. Your expression doesn't need a SWITCH.

Akka Streams: Fan-out operator with custom logic

I am looking for an Akka Streams operator that would allow me to split a stream based on custom logic. The set of messages that I am expecting is known in advance so there is no need for dynamic scaling of downstream consumers.
In the earlier versions of the library - when it was still labeled experimental - there was a FlexiRoute operator. I saw that at some point it accumulated a lot of cruft and was subsequently removed in favor of GraphStage.
Nowadays there are operators like Balance and Partition that come close to what I need. Balance requires me to duplicate logic per consumer. Partition works only for two outputs and I need to have N. I could make it happen with a Partition per message type but that seems hacky.
Is building a custom solution the only way?
Partition is what you need. It works for N rather than only 2 outputs. Read https://doc.akka.io/api/akka/current/akka/stream/scaladsl/Partition.html for API, read https://blog.colinbreck.com/partitioning-akka-streams-to-maximize-throughput/#partition for example.
A snapshot of the API doc:
new Partition(outputPorts: Int, partitioner: (T) ⇒ Int, eagerCancel: Boolean)
A snapshot of the example:
val flow = Flow.fromGraph(GraphDSL.create() { implicit b =>
import GraphDSL.Implicits._
val workerCount = 4
val partition = b.add(Partition[Int](workerCount, _ % workerCount))
val merge = b.add(Merge[Int](workerCount))
for (_ <- 1 to workerCount) {
partition ~> Flow[Int].map(spin).async ~> merge
}
FlowShape(partition.in, merge.out)
})
Source(1 to 1000)
.via(flow)
.runWith(Sink.ignore)

What's the best way to count results in GQL?

I figure one way to do a count is like this:
foo = db.GqlQuery("SELECT * FROM bar WHERE baz = 'baz')
my_count = foo.count()
What I don't like is my count will be limited to 1000 max and my query will probably be slow. Anyone out there with a workaround? I have one in mind, but it doesn't feel clean. If only GQL had a real COUNT Function...
You have to flip your thinking when working with a scalable datastore like GAE to do your calculations up front. In this case that means you need to keep counters for each baz and increment them whenever you add a new bar, instead of counting at the time of display.
class CategoryCounter(db.Model):
category = db.StringProperty()
count = db.IntegerProperty(default=0)
then when creating a Bar object, increment the counter
def createNewBar(category_name):
bar = Bar(...,baz=category_name)
counter = CategoryCounter.filter('category =',category_name).get()
if not counter:
counter = CategoryCounter(category=category_name)
else:
counter.count += 1
bar.put()
counter.put()
db.run_in_transaction(createNewBar,'asdf')
now you have an easy way to get the count for any specific category
CategoryCounter.filter('category =',category_name).get().count
+1 to Jehiah's response.
Official and blessed method on getting object counters on GAE is to build sharded counter. Despite heavily sounding name, this is pretty straightforward.
Count functions in all databases are slow (eg, O(n)) - the GAE datastore just makes that more obvious. As Jehiah suggests, you need to store the computed count in an entity and refer to that if you want scalability.
This isn't unique to App Engine - other databases just hide it better, up until the point where you're trying to count tens of thousands of records with each request, and your page render time starts to increase exponentially...
According to the GqlQuery.count() documentation, you can set the limit to be some number greater than 1000:
from models import Troll
troll_count = Troll.all(keys_only=True).count(limit=31337)
Sharded counters are the right way to keep track of numbers like this, as folks have said, but if you figure this out late in the game (like me) then you'll need to initialize the counters from an actual count of objects. But this is a great way to burn through your free quota of Datastore Small Operations (50,000 I think). Every time you run the code, it will use up as many ops as there are model objects.
I haven't tried it, and this is an utter resource hog, but perhaps iterating with .fetch() and specifying the offset would work?
LIMIT=1000
def count(query):
result = offset = 0
gql_query = db.GqlQuery(query)
while True:
count = gql_query.fetch(LIMIT, offset)
if count < LIMIT:
return result
result += count
offset += LIMIT
orip's solution works with a little tweaking:
LIMIT=1000
def count(query):
result = offset = 0
gql_query = db.GqlQuery(query)
while True:
count = len(gql_query.fetch(LIMIT, offset))
result += count
offset += LIMIT
if count < LIMIT:
return result
We now have Datastore Statistics that can be used to query entity counts and other data. These values do not always reflect the most recent changes as they are updated once every 24-48 hours. Check out the documentation (see link below) for more details:
Datastore Statistics
As pointed out by #Dimu, the stats computed by Google on a periodic basis are a decent go-to resource when precise counts are not needed and the % of records are NOT changing drastically during any given day.
To query the statistics for a given Kind, you can use the following GQL structure:
select * from __Stat_Kind__ where kind_name = 'Person'
There are a number of properties returned by this which are helpful:
count -- the number of Entities of this Kind
bytes -- total size of all Entities stored of this Kind
timestamp -- an as of date/time for when the stats were last computed
Example Code
To answer a follow-up question posted as a comment to my answer, I am now providing some sample C# code that I am using, which admittedly may not be as robust as it should be, but seems to work OK for me:
/// <summary>Returns an *estimated* number of entities of a given kind</summary>
public static long GetEstimatedEntityCount(this DatastoreDb database, string kind)
{
var query = new GqlQuery
{
QueryString = $"select * from __Stat_Kind__ where kind_name = '{kind}'",
AllowLiterals = true
};
var result = database.RunQuery(query);
return (long) (result?.Entities?[0]?["count"] ?? 0L);
}
The best workaround might seem a little counter-intuitive, but it works great in all my appengine apps. Rather than relying on the integer KEY and count() methods, you add an integer field of your own to the datatype. It might seem wasteful until you actually have more than 1000 records, and you suddenly discover that fetch() and limit() DO NOT WORK PAST THE 1000 RECORD BOUNDARY.
def MyObj(db.Model):
num = db.IntegerProperty()
When you create a new object, you must manually retrieve the highest key:
max = MyObj.all().order('-num').get()
if max : max = max.num+1
else : max = 0
newObj = MyObj(num = max)
newObj.put()
This may seem like a waste of a query, but get() returns a single record off the top of the index. It is very fast.
Then, when you want to fetch past the 1000th object limit, you simply do:
MyObj.all().filter('num > ' , 2345).fetch(67)
I had already done this when I read Aral Balkan's scathing review: http://aralbalkan.com/1504 . It's frustrating, but when you get used to it and you realize how much faster this is than count() on a relational db, you won't mind...

Resources