VOLTTRON Actuator agent failure message but it appears to be working just fine - volttron

I'm using the actuator agent get_multiple_points with VOLTTRON 8.1.3 to make about 30 BACnet read requests of sensors with:
zone_setpoints_data = self.vip.rpc.call('platform.actuator', 'get_multiple_points', actuator_get_this_data).get(timeout=300)
And I notice this debug message:
2022-06-09 19:55:21,927 (loadshedagent-0.1 2930461) __main__ DEBUG: [Simple DR Agent INFO] - ACTUATOR SCHEDULE EVENT SUCESS {'result': 'FAILURE', 'data': {}, 'info': 'REQUEST_CONFLICTS_WITH_SELF'}
But I have the data, like it appears to be working just fine in addition to the 1 minute interval scrape all BACnet devices inside the building. Anything to worry about or should I make some sort of adjustment?
EDIT
Code snip for scheduling the actuator below. Am I scheduling the actuator agent wrong with the _now,str_start,_end,str_end on 30 devices for get_multiple_points? Should I be adjusting this td(seconds=10) uniquely to space out the call for each device?
# create start and end timestamps for actuator agent scheduling
_now = get_aware_utc_now()
str_start = format_timestamp(_now)
_end = _now + td(seconds=10)
str_end = format_timestamp(_end)
actuator_schedule_request = []
for group in self.nested_group_map.values():
for device_address in group.values():
device = '/'.join([self.building_topic, str(device_address)])
actuator_schedule_request.append([device, str_start, str_end])
# use actuator agent to get all zone temperature setpoint data
result = self.vip.rpc.call('platform.actuator', 'request_new_schedule', self.core.identity, 'my_schedule', 'HIGH', actuator_schedule_request).get(timeout=90)

It seems to me that getting points won't cause that debug message. What the message you are seeing comes from https://volttron.readthedocs.io/en/releases-8.x/driver-framework/actuator/actuator-agent.html?highlight=REQUEST_CONFLICTS_WITH_SELF#task-schedule-failures
So it appears that somewhere your schedules are overlapping. But the call to get_multiple_points should not have anything to do with this error.

Related

Race Condition in Sagemaker Batch Transform Job

We are facing this bug in our production environment. I have been looking for a solution for a while and I cannot seem to solve it. Any help would be appreciated.
We are using Sagemaker Batch Transform to perform inference on our machine learning models. Each job is supposed to create one instance using a docker image from our ECR container. This job then consumes a payload and starts processing it using a pytorch script. When the job is done, the script calls an API to store the results.
The issue is that when we check the cloud watch logs for a SINGLE job, we see that it is repeated. After the job is repeated multiple times, the individual instances of the same job may or may not finish and the whole operation returns with an error.
Basically, we see the following issue in our cloud watch logs and cannot seem to figure out what is causing this:
2022-04-24 19:41:47,865 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Starting to Process Task: 12345678-abcd-1234-efgh-123456ab12c3
...
[The job is running and printing logs]
...
[There is no error but the job doesn't seem to run anymore, the same job seems to roll back again]
...
2022-04-24 19:52:09,522 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Starting to Process Task: 12345678-abcd-1234-efgh-123456ab12c3
...
[The job is running and printing logs]
...
[There is no error but the job doesn't seem to run anymore, the same job seems to roll back again]
...
2022-04-24 20:12:11,834 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Starting to Process Task: 12345678-abcd-1234-efgh-123456ab12c3
...
[The job is running and printing logs]
...
[There are no errors but the cloud watch logs stop here. Sagemaker returns an error to the client.]
The following sample code is what we are using to run the jobs:
def inference_batch(self):
batch_input = f"s3://{self.cnf.SAGEMAKER_BUCKET}/batch-input/batch.csv"
batch_output = f"s3://{self.cnf.SAGEMAKER_BUCKET}/batch-output/"
job_name = f"{self.cnf.SAGEMAKER_MODEL}-{str(datetime.datetime.now().strftime('%Y-%m-%d-%H-%m-%S'))}"
transform_input = {
'DataSource': {
'S3DataSource': {
'S3DataType': 'S3Prefix',
'S3Uri': batch_input
}
},
'ContentType': 'text/csv',
'SplitType': 'Line',
}
transform_output = {
'S3OutputPath': batch_output
}
transform_resources = {
'InstanceType': self.cnf.SAGEMAKER_BATCH_INSTANCE,
'InstanceCount': 1
}
# self.sm_boto_client is an instance of boto3.Session(region_name="some-region).client("sagemaker")
self.sm_boto_client.create_transform_job(
TransformJobName=job_name,
ModelName=self.cnf.SAGEMAKER_MODEL,
TransformInput=transform_input,
TransformOutput=transform_output,
TransformResources=transform_resources
)
status = self.sm_boto_client.describe_transform_job(TransformJobName=job_name)
print(f'Executing transform job {job_name}...')
while status['TransformJobStatus'] == 'InProgress':
time.sleep(5)
status = self.sm_boto_client.describe_transform_job(TransformJobName=job_name)
if status['TransformJobStatus'] == 'Completed':
print(f'Batch transform job {job_name} successfully completed.')
else:
raise Exception(f'Batch transform job {job_name} failed.')

schedule actuator agent MALFORMED_REQUEST

Can someone give me a tip what I am doing wrong with scheduling the actuator agent in my agent code to scrape some BACnet data?
_now = get_aware_utc_now()
str_start = format_timestamp(_now)
_end = _now + td(seconds=10)
str_end = format_timestamp(_end)
peroidic_schedule_request = ["slipstream_internal/slipstream_hq/1100", # AHU
"slipstream_internal/slipstream_hq/5231", # eGuage Main
"slipstream_internal/slipstream_hq/5232", # eGuage RTU
"slipstream_internal/slipstream_hq/5240"] # eGuage PV
# send the request to the actuator
result = self.vip.rpc.call('platform.actuator', 'request_new_schedule', self.core.identity, 'my_schedule', 'HIGH', peroidic_schedule_request).get(timeout=90)
_log.debug(f'[Conninuous Roller Agent INFO] - ACTUATOR SCHEDULE EVENT SUCESS {result}')
meter_data_to_get = [self.ahu_clg_pid_topic,self.kw_pv_topic,self.kw_rtu_topic,self.kw_main_topic]
rpc_result = self.vip.rpc.call('platform.actuator', 'get_multiple_points', meter_data_to_get).get(timeout=90)
_log.debug(f'[Conninuous Roller Agent INFO] - kW data is {rpc_result}!')
The RPC request is working, I can get the data but there is an actuator agent error.
{'result': 'FAILURE', 'data': {}, 'info': 'MALFORMED_REQUEST: ValueError: too many values to unpack (expected 3)'}
Full Traceback:
2021-09-13 20:47:23,334 (actuatoragent-1.0 1477601) __main__ ERROR: bad request: {'time': '2021-09-13T20:47:23.334762+00:00', 'requesterID': 'platform.continuousroller', 'taskID': 'my_schedule', 'type': 'NEW_SCHEDULE'}, [], too many values to unpack (expected 3)
2021-09-13 20:47:23,338 (continuousrolleragent-0.1 1559787) __main__ DEBUG: [Conninuous Roller Agent INFO] - ACTUATOR SCHEDULE EVENT SUCESS {'result': 'FAILURE', 'data': {}, 'info': 'MALFORMED_REQUEST: ValueError: too many values to unpack (expected 3)'}
2021-09-13 20:47:23,346 (forwarderagent-5.1 1548751) __main__ DEBUG: publish_to_historian number of items: 1
2021-09-13 20:47:23,346 (forwarderagent-5.1 1548751) __main__ DEBUG: Lasttime: 0 currenttime: 1631566043.0
2021-09-13 20:47:23,350 (forwarderagent-5.1 1548751) __main__ DEBUG: handled: 1 number of items
2021-09-13 20:47:23,710 (continuousrolleragent-0.1 1559787) __main__ DEBUG: [Conninuous Roller Agent INFO] - kW data is [{'slipstream_internal/slipstream_hq/1100/Cooling Capacity Status': 76.46278381347656, 'slipstream_internal/slipstream_hq/5231/REGCHG total_power': 59960.0, 'slipstream_internal/slipstream_hq/5232/REGCHG total_rtu_power': 44507.0, 'slipstream_internal/slipstream_hq/5240/REGCHG Generation': 1477.0}, {}]!
I suspect that the the Scheduler, which is an attribute of Actuator, encounters a problem when creating a Task from the 'requests' object, which originates from the 'requests' parameter for 'request_new_schedule'. In your case, 'requests' is the list of strings in the object named 'periodic_schedule_request'. Creating a Task is part of the Actuator's workflow by dint of the Acutuator's Scheduler calling 'request_slots'; see
https://github.com/VOLTTRON/volttron/blob/develop/services/core/ActuatorAgent/actuator/agent.py#L1378 and
https://github.com/VOLTTRON/volttron/blob/a7bbdfcd4c82544bd743532891389f49b771b196/services/core/ActuatorAgent/actuator/scheduler.py#L392
When a Task is created, the 'requests' object is processed in the Task constructor method; see https://github.com/VOLTTRON/volttron/blob/a7bbdfcd4c82544bd743532891389f49b771b196/services/core/ActuatorAgent/actuator/scheduler.py#L148
I suspect that each 'request' in 'requests' is not an object composed of exactly three objects, potentially being the cause of the Actuator error. But I could be wrong. To know exactly what 'requests' looks like, go back through your logs and look for a debug statement from the Actuator that begins with: "Got new schedule request: ..." (see https://github.com/VOLTTRON/volttron/blob/develop/services/core/ActuatorAgent/actuator/agent.py#L1375)
Without seeing more of the logs, as a first step, I'd recommend verifying that the 'requests' object was properly processed; I'd also recommend putting various debug statements, especially at https://github.com/VOLTTRON/volttron/blob/a7bbdfcd4c82544bd743532891389f49b771b196/services/core/ActuatorAgent/actuator/scheduler.py#L148 to catch where this ValueError is coming from.

Google Cloud Run pubsub pull listener app fails to start

I'm testing pubsub "pull" subscriber on Cloud Run using just listener part of this sample java code (SubscribeAsyncExample...reworked slightly to fit in my SpringBoot app):
https://cloud.google.com/pubsub/docs/quickstart-client-libraries#java_1
It fails to startup during deploy...but while it's trying to start, it does pull items from the pubsub queue. Originally, I had an HTTP "push" receiver (a #RestController) on a different pubsub topic and that worked fine. Any suggestions? I'm new to Cloud Run. Thanks.
Deploying...
Creating Revision... Cloud Run error: Container failed to start. Failed to start and then listen on the port defined
by the PORT environment variable. Logs for this revision might contain more information....failed
Deployment failed
In logs:
2020-08-11 18:43:22.688 INFO 1 --- [ main] o.s.web.context.ContextLoader : Root WebApplicationContext: initialization completed in 4606 ms
2020-08-11T18:43:25.287759Z Listening for messages on projects/ce-cxmo-dev/subscriptions/AndySubscriptionPull:
2020-08-11T18:43:25.351650801Z Container Sandbox: Unsupported syscall setsockopt(0x18,0x29,0x31,0x3eca02dfd974,0x4,0x28). It is very likely that you can safely ignore this message and that this is not the cause of any error you might be troubleshooting. Please, refer to https://gvisor.dev/c/linux/amd64/setsockopt for more information.
2020-08-11T18:43:25.351770555Z Container Sandbox: Unsupported syscall setsockopt(0x18,0x29,0x12,0x3eca02dfd97c,0x4,0x28). It is very likely that you can safely ignore this message and that this is not the cause of any error you might be troubleshooting. Please, refer to https://gvisor.dev/c/linux/amd64/setsockopt for more information.
2020-08-11 18:43:25.680 WARN 1 --- [ault-executor-0] i.g.n.s.i.n.u.internal.MacAddressUtil : Failed to find a usable hardware address from the network interfaces; using random bytes: ae:2c:fb:e7:92:9c:2b:24
2020-08-11T18:45:36.282714Z Id: 1421389098497572
2020-08-11T18:45:36.282763Z Data: We be pub-sub'n in pull mode2!!
Nothing else after this and the app stops running.
#Component
public class AndyTopicPullRecv {
public AndyTopicPullRecv()
{
subscribeAsyncExample("ce-cxmo-dev", "AndySubscriptionPull");
}
public static void subscribeAsyncExample(String projectId, String subscriptionId) {
ProjectSubscriptionName subscriptionName =
ProjectSubscriptionName.of(projectId, subscriptionId);
// Instantiate an asynchronous message receiver.
MessageReceiver receiver =
(PubsubMessage message, AckReplyConsumer consumer) -> {
// Handle incoming message, then ack the received message.
System.out.println("Id: " + message.getMessageId());
System.out.println("Data: " + message.getData().toStringUtf8());
consumer.ack();
};
Subscriber subscriber = null;
try {
subscriber = Subscriber.newBuilder(subscriptionName, receiver).build();
// Start the subscriber.
subscriber.startAsync().awaitRunning();
System.out.printf("Listening for messages on %s:\n", subscriptionName.toString());
// Allow the subscriber to run for 30s unless an unrecoverable error occurs.
// subscriber.awaitTerminated(30, TimeUnit.SECONDS);
subscriber.awaitTerminated();
System.out.printf("Async subscribe terminated on %s:\n", subscriptionName.toString());
// } catch (TimeoutException timeoutException) {
} catch (Exception e) {
// Shut down the subscriber after 30s. Stop receiving messages.
subscriber.stopAsync();
System.out.printf("Async subscriber exception: " + e);
}
}
}
Kolban question is very important!! With the shared code, I would like to say "No". The Cloud Run contract is clear:
Your service must answer to HTTP request. Out of request, you pay nothing and no CPU is dedicated to your instance (the instance is like a daemon when no request is processing)
Your service must be stateless (not your case here, I won't take time on this)
If you want to pull your PubSub subscription, create an endpoint in your code with a Rest controller. While you are processing this request, run your pull mechanism and process messages.
This endpoint can be called by Cloud Scheduler regularly to keep the process up.
Be careful, you have a max request processing timeout at 15 minutes (today, subject to change in a near future). So, you can't run your process more than 15 minutes. Make it resilient to fail and set your scheduler to call your service every 15 minutes

Logging packet sent time by physical layer

From the sender node i am sending packet by command in simulation script:
router << new DatagramReq(to: 1, data:[1,1,1,1], protocol:Protocol.DATA);
I want to know, how to log the Datagram sent time by Physical layer of the sender in log-0.txt file
Setting the log level of the agent to FINE should log all messages sent by that agent. The logger name for each agent follows the class name of the agent, by convention. If you do a ps, you'll see the class names of the agents:
> ps
remote: org.arl.unet.remote.RemoteControl - IDLE
rdp: org.arl.unet.net.RouteDiscoveryProtocol - IDLE
ranging: org.arl.unet.phy.Ranging - IDLE
uwlink: org.arl.unet.link.ReliableLink - IDLE
node: org.arl.unet.nodeinfo.NodeInfo - IDLE
phy: org.arl.unet.sim.HalfDuplexModem - IDLE
arp: org.arl.unet.addr.AddressResolution - IDLE
transport: org.arl.unet.transport.SWTransport - IDLE
router: org.arl.unet.net.Router - IDLE
mac: org.arl.unet.mac.CSMA - IDLE
You can see that the phy agent class is org.arl.unet.sim.HalfDuplexModem, if you're running a simulator. You can set the log level of the simulation classes to INFO:
> logLevel 'org.arl.unet.sim', FINE
Now, the logs will show the messages (and their corresponding times). Example:
1567801752688|FINE|org.arl.unet.sim.HalfDuplexModem/A#47:handleTxFrameReq|TxFrameReq:REQUEST[type:DATA to:1 (4 bytes)] Data:
01010101
1567801752741|FINE|org.arl.unet.sim.HalfDuplexModem/A#47:sendTxFrameNtf|TxFrameNtf:INFORM[type:DATA txTime:753410296]
1567801753392|FINE|org.arl.unet.sim.HalfDuplexModem/B#63:sendRxFrameStartNtf|RxFrameStartNtf:INFORM[type:DATA rxTime:3515132027]
1567801754095|FINE|org.arl.unet.sim.HalfDuplexModem/B#63:sendRxFrameNtf|RxFrameNtf:INFORM[type:DATA from:232 to:1 rxTime:3515132027 (4 bytes)] Data:
01010101

AWS: Why does my RDS instance keep starting after I turned it off?

I have an RDS database instance on AWS and have turned it off for now. However, every few days it starts up on its own. I don't have any other services running right now.
There is this event in my RDS log:
"DB instance is being started due to it exceeding the maximum allowed time being stopped."
Why is there a limit to how long my RDS instance can be stopped? I just want to put my project on hold for a few weeks, but AWS won't let me turn off my DB? It costs $12.50/mo to have it sit idle, so I don't want to pay for this, and I certainly don't want AWS starting an instance for me that does not get used.
Please help!
That's a limitation of this new feature.
You can stop an instance for up to 7 days at a time. After 7 days, it will be automatically started. For more details on stopping and starting a database instance, please refer to Stopping and Starting a DB Instance in the Amazon RDS User Guide.
You can setup a cron job to stop the instance again after 7 days. You can also change to a smaller instance size to save money.
Another option is the upcoming Aurora Serverless which stops and starts for you automatically. It might be more expensive than a dedicated instance when running 24/7.
Finally, there is always Heroku which gives you a free database instance that starts and stops itself with some limitations.
You can also try saving the following CloudFormation template as KeepDbStopped.yml and then deploy with this command:
aws cloudformation deploy --template-file KeepDbStopped.yml --stack-name stop-db --capabilities CAPABILITY_IAM --parameter-overrides DB=arn:aws:rds:us-east-1:XXX:db:XXX
Make sure to change arn:aws:rds:us-east-1:XXX:db:XXX to your RDS ARN.
Description: Automatically stop RDS instance every time it turns on due to exceeding the maximum allowed time being stopped
Parameters:
DB:
Description: ARN of database that needs to be stopped
Type: String
AllowedPattern: arn:aws:rds:[a-z0-9\-]+:[0-9]+:db:[^:]*
Resources:
DatabaseStopperFunction:
Type: AWS::Lambda::Function
Properties:
Role: !GetAtt DatabaseStopperRole.Arn
Runtime: python3.6
Handler: index.handler
Timeout: 20
Code:
ZipFile:
Fn::Sub: |
import boto3
import time
def handler(event, context):
print("got", event)
db = event["detail"]["SourceArn"]
id = event["detail"]["SourceIdentifier"]
message = event["detail"]["Message"]
region = event["region"]
rds = boto3.client("rds", region_name=region)
if message == "DB instance is being started due to it exceeding the maximum allowed time being stopped.":
print("database turned on automatically, setting last seen tag...")
last_seen = int(time.time())
rds.add_tags_to_resource(ResourceName=db, Tags=[{"Key": "DbStopperLastSeen", "Value": str(last_seen)}])
elif message == "DB instance started":
print("database started (and sort of available?)")
last_seen = 0
for t in rds.list_tags_for_resource(ResourceName=db)["TagList"]:
if t["Key"] == "DbStopperLastSeen":
last_seen = int(t["Value"])
if time.time() < last_seen + (60 * 20):
print("database was automatically started in the last 20 minutes, turning off...")
time.sleep(10) # even waiting for the "started" event is not enough, so add some wait
rds.stop_db_instance(DBInstanceIdentifier=id)
print("success! removing auto-start tag...")
rds.add_tags_to_resource(ResourceName=db, Tags=[{"Key": "DbStopperLastSeen", "Value": "0"}])
else:
print("ignoring manual database start")
else:
print("error: unknown database event!")
DatabaseStopperRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Action:
- sts:AssumeRole
Effect: Allow
Principal:
Service:
- lambda.amazonaws.com
ManagedPolicyArns:
- arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
Policies:
- PolicyName: Notify
PolicyDocument:
Version: '2012-10-17'
Statement:
- Action:
- rds:StopDBInstance
Effect: Allow
Resource: !Ref DB
- Action:
- rds:AddTagsToResource
- rds:ListTagsForResource
- rds:RemoveTagsFromResource
Effect: Allow
Resource: !Ref DB
Condition:
ForAllValues:StringEquals:
aws:TagKeys:
- DbStopperLastSeen
DatabaseStopperPermission:
Type: AWS::Lambda::Permission
Properties:
Action: lambda:InvokeFunction
FunctionName: !GetAtt DatabaseStopperFunction.Arn
Principal: events.amazonaws.com
SourceArn: !GetAtt DatabaseStopperRule.Arn
DatabaseStopperRule:
Type: AWS::Events::Rule
Properties:
EventPattern:
source:
- aws.rds
detail-type:
- "RDS DB Instance Event"
resources:
- !Ref DB
detail:
Message:
- "DB instance is being started due to it exceeding the maximum allowed time being stopped."
- "DB instance started"
Targets:
- Arn: !GetAtt DatabaseStopperFunction.Arn
Id: DatabaseStopperLambda
It has worked for at least one person. If you have issues please report here.

Resources