How to add custom delay to instaloader - timer

can someone show me how to add a time delay to instaloader.py a code snippet of a working time delay inside the instaloadercontext.py i just want to add a custom delay once a return limit of 60 is reached to do_sleep for an hour before the next query.
in the below code ,class RateController is providing request tracking and rate controlling to stay within rate limits.
It can be overridden to change Instaloader's behavior regarding rate limits, for example to raise a custom exception when the rate limit is hit
import instaloader
class MyRateController(instaloader.RateController):
def sleep(self, secs):
raise MyCustomException()
L = instaloader.Instaloader(rate_controller=lambda ctx: MyRateController(ctx))

Since nobody replied, I was wondering if you were able to solve your question?
By the way, the MyRateController class can do more than just "sleep".
class MyRateController(instaloader.RateController):
def sleep(self, secs: float):
return super().sleep(secs)
def handle_429(self, query_type: str) -> None:
return super().handle_429(query_type)
def query_waittime(self, query_type: str, current_time: float, untracked_queries: bool = False) -> float:
return super().query_waittime(query_type, current_time, untracked_queries)
def wait_before_query(self, query_type: str) -> None:
return super().wait_before_query(query_type)

Related

Flink Watermark forBoundedOutOfOrderness includes data beyond boundaries

I'm assigning timestamps and watermarks like this
def myProcess(dataStream: DataStream[Foo]) {
dataStream
.assignTimestampsAndWatermarks(
WatermarkStrategy
.forBoundedOutOfOrderness[Foo](Duration.ofSeconds(5))
.withTimestampAssigner(new SerializableTimestampAssigner[Foo]() {
// Long is milliseconds since the Epoch
override def extractTimestamp(element: Foo, recordTimestamp: Long): Long = element.eventTimestamp
})
)
.keyBy({ k => k.id
})
.window(TumblingEventTimeWindows.of(Time.hours(1)))
.reduce(new MyReducerFn(), new MyWindowFunction())
}
I have a unit test
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val dataTime = 1641488400000L
val lateTime = dataTime + Duration.ofHours(5).toMillis
val dataSource = env
.addSource(new SourceFunction[Foo]() {
def run(ctx: SourceFunction.SourceContext[Foo]) {
ctx.collect(Foo(id=1, value=1, eventTimestamp=dataTime))
ctx.collect(Foo(id=1, value=1, eventTimestamp=lateTime))
// should be dropped due to past max lateness
ctx.collect(Foo(id=1, value=2, eventTimestamp=dataTime))
}
override def cancel(): Unit = {}
})
myProcess(dataSource).collect(new MyTestSink)
env.execute("watermark & lateness test")
I expect that the second element would advance the watermark to (latetime - Duration.ofSeconds(5)) (i.e. the bounded out of orderness) and therefore the third element would not be assigned to the 1 hour tumbling window since the watermark had advanced considerably past it. However I see both the first and third element reach my reduce function.
I must misunderstand watermarks here? Or what forBoundedOutOfOrderness does? Can I please get clarity?
Thanks!
I must have typo'd something I have it working.

A problem from Flink Training tutorial: LongRidesSolution.scala

What this function(ProcessElement) will do is pretty clear:
Based on the keyed stream(keyed by rideId), it will iterate all the elements whose rideId belongs to that key,it will update the state based on the condition
override def processElement(ride: TaxiRide,
context: KeyedProcessFunction[Long, TaxiRide, TaxiRide]#Context,
out: Collector[TaxiRide]): Unit = {
val timerService = context.timerService
if (ride.isStart) {
// the matching END might have arrived first; don't overwrite it
if (rideState.value() == null) {
rideState.update(ride)
}
}
else {
rideState.update(ride)
}
timerService.registerEventTimeTimer(ride.getEventTime + 120 * 60 * 1000)
}
The Timer will trigger once the watermark reaches to the timestamp
override def onTimer(timestamp: Long,
ctx: KeyedProcessFunction[Long, TaxiRide, TaxiRide]#OnTimerContext,
out: Collector[TaxiRide]): Unit = {
val savedRide = rideState.value
if (savedRide != null && savedRide.isStart) {
out.collect(savedRide)
}
rideState.clear()
}
The Problem is: If the End record comes first,and then based on the logic, it will not update the ride state(related key),then it will trigger after 2 hours, then it will not collect and will not emit the record, but what if this record meets our requirement? ==> the start time of the record happened more than 2 hours ago? I think there should be more logic to deal with that
If the END record is processed before the START record, then it could be that the START record arrives very late, and when it does arrive it supplies evidence that this ride lasted for more than two hours.
However, the goal of this exercise is not to find all rides that last for more than two hours, but rather to flag, in real-time, rides that should have ended by now (because they started more than two hours ago), but haven't. Since these rides you ask about have ended, it's debatable whether they merit alerts.
You've raised an interesting point that should probably be added to the exercise discussion page.

Why does Flink emit duplicate records on a DataStream join + Global window?

I'm learning/experimenting with Flink, and I'm observing some unexpected behavior with the DataStream join, and would like to understand what is happening...
Let's say I have two streams with 10 records each, which I want to join on a id field. Let's assume that for each record in one stream had a matching one in the other, and the IDs are unique in each stream. Let's also say I have to use a global window (requirement).
Join using DataStream API (my simplified code in Scala):
val stream1 = ... // from a Kafka topic on my local machine (I tried with and without .keyBy)
val stream2 = ...
stream1
.join(stream2)
.where(_.id).equalTo(_.id)
.window(GlobalWindows.create()) // assume this is a requirement
.trigger(CountTrigger.of(1))
.apply {
(row1, row2) => // ...
}
.print()
Result:
Everything is printed as expected, each record from the first stream joined with a record from the second one.
However:
If I re-send one of the records (say, with an updated field) from one of the stream to that stream, two duplicate join events get emitted 😞
If I repeat that operation (with or without updated field), I will get 3 emitted events, then 4, 5, etc... 😞
Could someone in the Flink community explain why this is happening? I would have expected only 1 event emitted each time. Is it possible to achieve this with a global window?
In comparison, the Flink Table API behaves as expected in that same scenario, but for my project I'm more interested in the DataStream API.
Example with Table API, which worked as expected:
tableEnv
.sqlQuery(
"""
|SELECT *
| FROM stream1
| JOIN stream2
| ON stream1.id = stream2.id
""".stripMargin)
.toRetractStream[Row]
.filter(_._1) // just keep the inserts
.map(...)
.print() // works as expected, after re-sending updated records
Thank you,
Nicolas
The issue is that records are never removed from your global window. So you trigger the join operation on the global window, whenever a new record has arrived, but the old records are still present.
Thus, to get it running in your case, you'd need to implement a custom evictor. I expanded your example in a minimal working example and added the evictor, which I will explain after the snippet.
val data1 = List(
(1L, "myId-1"),
(2L, "myId-2"),
(5L, "myId-1"),
(9L, "myId-1"))
val data2 = List(
(3L, "myId-1", "myValue-A"))
val stream1 = env.fromCollection(data1)
val stream2 = env.fromCollection(data2)
stream1.join(stream2)
.where(_._2).equalTo(_._2)
.window(GlobalWindows.create()) // assume this is a requirement
.trigger(CountTrigger.of(1))
.evictor(new Evictor[CoGroupedStreams.TaggedUnion[(Long, String), (Long, String, String)], GlobalWindow](){
override def evictBefore(elements: lang.Iterable[TimestampedValue[CoGroupedStreams.TaggedUnion[(Long, String), (Long, String, String)]]], size: Int, window: GlobalWindow, evictorContext: Evictor.EvictorContext): Unit = {}
override def evictAfter(elements: lang.Iterable[TimestampedValue[CoGroupedStreams.TaggedUnion[(Long, String), (Long, String, String)]]], size: Int, window: GlobalWindow, evictorContext: Evictor.EvictorContext): Unit = {
import scala.collection.JavaConverters._
val lastInputTwoIndex = elements.asScala.zipWithIndex.filter(e => e._1.getValue.isTwo).lastOption.map(_._2).getOrElse(-1)
if (lastInputTwoIndex == -1) {
println("Waiting for the lookup value before evicting")
return
}
val iterator = elements.iterator()
for (index <- 0 until size) {
val cur = iterator.next()
if (index != lastInputTwoIndex) {
println(s"evicting ${cur.getValue.getOne}/${cur.getValue.getTwo}")
iterator.remove()
}
}
}
})
.apply((r, l) => (r, l))
.print()
The evictor will be applied after the window function (join in this case) has been applied. It's not entirely clear how your use case exactly should work in case you have multiple entries in the second input, but for now, the evictor only works with single entries.
Whenever a new element comes into the window, the window function is immediately triggered (count = 1). Then the join is evaluated with all elements having the same key. Afterwards, to avoid duplicate outputs, we remove all entries from the first input in the current window. Since, the second input may arrive after the first inputs, no eviction is performed, when the second input is empty. Note that my scala is quite rusty; you will be able to write it in a much nicer way. The output of a run is:
Waiting for the lookup value before evicting
Waiting for the lookup value before evicting
Waiting for the lookup value before evicting
Waiting for the lookup value before evicting
4> ((1,myId-1),(3,myId-1,myValue-A))
4> ((5,myId-1),(3,myId-1,myValue-A))
4> ((9,myId-1),(3,myId-1,myValue-A))
evicting (1,myId-1)/null
evicting (5,myId-1)/null
evicting (9,myId-1)/null
A final remark: if the table API offers already a concise way of doing what you want, I'd stick to it and then convert it to a DataStream when needed.

Flink - behaviour of timesOrMore

I want to find pattern of events that follow
Inner pattern is:
Have the same value for key "sensorArea".
Have different value for key "customerId".
Are within 5 seconds from each other.
And this pattern needs to
Emit "alert" only if previous happens 3 or more times.
I wrote something but I know for sure it is not complete.
Two Questions
I need to access the previous event fields when I'm in the "next" pattern, how can I do that without using the ctx command because it is heavy..
My code brings weird result - this is my input
and my output is
3> {first=[Customer[timestamp=50,customerId=111,toAdd=2,sensorData=33]], second=[Customer[timestamp=100,customerId=222,toAdd=2,sensorData=33], Customer[timestamp=600,customerId=333,toAdd=2,sensorData=33]]}
even though my desired output should be all first six events (users 111/222 and sensor are 33 and then 44 and then 55
Pattern<Customer, ?> sameUserDifferentSensor = Pattern.<Customer>begin("first", skipStrategy)
.followedBy("second").where(new IterativeCondition<Customer>() {
#Override
public boolean filter(Customer currCustomerEvent, Context<Customer> ctx) throws Exception {
List<Customer> firstPatternEvents = Lists.newArrayList(ctx.getEventsForPattern("first"));
int i = firstPatternEvents.size();
int currSensorData = currCustomerEvent.getSensorData();
int prevSensorData = firstPatternEvents.get(i-1).getSensorData();
int currCustomerId = currCustomerEvent.getCustomerId();
int prevCustomerId = firstPatternEvents.get(i-1).getCustomerId();
return currSensorData==prevSensorData && currCustomerId!=prevCustomerId;
}
})
.within(Time.seconds(5))
.timesOrMore(3);
PatternStream<Customer> sameUserDifferentSensorPatternStream = CEP.pattern(customerStream, sameUserDifferentSensor);
DataStream<String> alerts1 = sameUserDifferentSensorPatternStream.select((PatternSelectFunction<Customer, String>) Object::toString);
You will have an easier time if you first key the stream by the sensorArea. They you will be pattern matching on streams where all of the events are for a single sensorArea, which will make the pattern easier to express, and the matching more efficient.
You can't avoid using an iterative condition and the ctx, but it should be less expensive after keying the stream.
Also, your code example doesn't match the text description. The text says "within 5 seconds" and "3 or more times", while the code has within(Time.seconds(2)) and timesOrMore(2).

NAO robot: places where output functions of boxes are defined

I am wondering where are output functions of NAO behavior boxes usually defined.
I simply failed to find any related documentation in API. There are some you can find indeed, but not for output functions.
Take Speech Reco box for example, I can find definition of function "WordRecognized" on online API, but not the "wordRecognized" (case sensitive) and the "onNothing". Intuition is that they define them as helpers in the script of the box (which you can get by double-clicking on the box), but I just failed to find any relevant implementation of those either in the script.
Anyone had this before and know the solution? I really appreciate any feedback since I want to inspect how they are defined.
Code for Speech Reco is as below, and this situation happens for some other boxes too:
class MyClass(GeneratedClass):
def __init__(self):
GeneratedClass.__init__(self, False)
try:
self.asr = ALProxy("ALSpeechRecognition")
except Exception as e:
self.asr = None
self.logger.error(e)
self.memory = ALProxy("ALMemory")
def onLoad(self):
from threading import Lock
self.bIsRunning = False
self.mutex = Lock()
self.hasPushed = False
self.hasSubscribed = False
self.BIND_PYTHON(self.getName(), "onWordRecognized")
def onUnload(self):
from threading import Lock
self.mutex.acquire()
try:
if (self.bIsRunning):
if (self.hasSubscribed):
self.memory.unsubscribeToEvent("WordRecognized", self.getName())
if (self.hasPushed and self.asr):
self.asr.popContexts()
except RuntimeError, e:
self.mutex.release()
raise e
self.bIsRunning = False;
self.mutex.release()
def onInput_onStart(self):
from threading import Lock
self.mutex.acquire()
if(self.bIsRunning):
self.mutex.release()
return
self.bIsRunning = True
try:
if self.asr:
self.asr.setVisualExpression(self.getParameter("Visual expression"))
self.asr.pushContexts()
self.hasPushed = True
if self.asr:
self.asr.setVocabulary( self.getParameter("Word list").split(';'), self.getParameter("Enable word spotting") )
self.memory.subscribeToEvent("WordRecognized", self.getName(), "onWordRecognized")
self.hasSubscribed = True
except RuntimeError, e:
self.mutex.release()
self.onUnload()
raise e
self.mutex.release()
def onInput_onStop(self):
if( self.bIsRunning ):
self.onUnload()
self.onStopped()
def onWordRecognized(self, key, value, message):
if(len(value) > 1 and value[1] >= self.getParameter("Confidence threshold (%)")/100.):
self.wordRecognized(value[0]) #~ activate output of the box
else:
self.onNothing()
Those methods are defined when you create or edit a box input or output. See this piece of documentation.
If you give the input the name "onMyTruc", then the method onInput_onMyTruc(self) will be called when the input is triggered.
If you give the name "output_value" to some output, it will create a callable method name: self.output_value()
In your example, wordRecognized and onNothing are the name of the output of the SpeechReco box.

Resources