Map subtask_id to TaskManager in Flink

Map subtask_id to TaskManager in Flink - apache-flink

I have an operator with parallelism=256 running on 128 task managers. Each time when I get a checkpoint failure, it happens at the same subtask of this operator, for example it's always subtask 129 that gets stuck and blocks the checkpointing. I want to understand what happened to this subtask by examining logs of the task manager that subtask 129 is running on. Is there a way in Flink to map subtask id to the corresponding Task Manager?

The taskmanager.log files contain the names of the deployed tasks including their sub task index. You could simply search for the TASK_NAME (129/256) in all taskmanager.log files.

I was able to find not a trivial, but working solution to get the required map at runtime programmatically.
The main idea is that the Rest Endpoint /jobs/:jobid/vertices/:vertexid provides the necessary information for a specific vertex in format
{
"id": "804e...",
"name": "Map -> Sink",
...
"subtasks": [
{
"subtask": 0,
"host": "ip-10-xx-yy-zz:36ddd"
},
...
]
}
The main difficulty was to get the web interface url programmatically. I was able to get it this way (probably, there is a more elegant solution):
val env = FieldUtils
.readField(getRuntimeContext.asInstanceOf[StreamingRuntimeContext], "taskEnvironment", true)
.asInstanceOf[RuntimeEnvironment]
try {
println("trying to get cluster client...")
val client = new RestClusterClient[String](env.getTaskManagerInfo.getConfiguration, "rest")
return client.getWebInterfaceURL
} catch {
case e: Exception =>
println("Failed to get cluster client : ")
e.printStackTrace()
}
Given the web interface url, I simply made an http call to it and constructed the map.

Related

Step Functions parameter (array or not to array)

I have a couple of state machines that use respectively DescribeNetworkInterfaces (and EC2 API) and ListResourceRecordSets (a Route53 API).
I compose both Parameters using the Input as follows:
For EC2:
"Parameters": {
"NetworkInterfaceIds.$": "$.detail.attachments[0].details[?(#.name==networkInterfaceId)].value"
"Resource": "arn:aws:states:::aws-sdk:ec2:describeNetworkInterfaces",
For Route 53:
"Parameters": {
"HostedZoneId.$": "$.NetworkInterfaces[0].TagSet[?(#.Key==HOSTZONEID)].Value"
},
"Resource": "arn:aws:states:::aws-sdk:route53:listResourceRecordSets"
They resolve fine but the problem is that, for some reasons, the [?(#.name==networkInterfaceId)] and the [?(#.Key==HOSTZONEID)] turn the Parmeters values into array. Respectively:
{
"NetworkInterfaceIds": [
"eni-00f25c294401006b2"
]
}
And:
{
"HostedZoneId": [
"Z0555263BOXV8ELWLRS5"
]
}
In my case, the EC2 call succeeds because the Network API does expect an array of ENIs. However the Route53 call fails because the Records API does expect just one hosted zone id. This is the error message I get from SF:
No hosted zone found with ID: ["Z0555263BOXV8ELWLRS5"] (Service: Route53, Status Code: 404, Request ID: fa5bc89b-ca3d-4fe9-b2f3-baf01d509d76)
Based on my tests, it seems that it's the select that is causing the Parameter to be turned into an array because if I point directly to a field (which I cannot do) such as this:
"Parameters": {
"NetworkInterfaceIds.$": "States.Array($.detail.attachments[0].details[1].value)"
},
The input parameter is no longer constructed as an array and the Network API fails with:
An error occurred while executing the state 'DescribeNetworkInterfaces (1)' (entered at the event id #2). The Parameters '{"NetworkInterfaceIds":"eni-00f25c294401006b2"}' could not be used to start the Task: [Cannot construct instance of java.util.ArrayList (although at least one Creator exists): no String-argument constructor/factory method to deserialize from String value ('eni-00f25c294401006b2')]
I can fix the above problem using the States.Array intrinsic but it's a moot point because I need to dynamically select the the values in the inputs.
So I am at the point where the EC2 call works because, by chance, I need an array... but I can't figure out how to turn the Route53 parameter (hosted zone id) into a non array.

It was easier than I thought. I figure there was an intrinsic that would allow me to extract a value from an array: States.ArrayGetItem.
The following did the trick:
"Parameters": {
"HostedZoneId.$": "States.ArrayGetItem($.NetworkInterfaces[0].TagSet[?(#.Key==HOSTZONEID)].Value, 0)"
},

How can I run a job in Jenkins n-times?

Is it possible in Jenkins to create a job, that will run n-times?
I would like to write a script in configuration (windows batch command / groovy) which allows me to do it. In this script, I would like to have an array with parameters and then run this job with each parameter in the cycle. It should look like that:
paramArray [] = ["a","b","c"];
for(int i = 0; i < paramArray.length; i++)
{
//Here I want to run this job with each parameter
job.run(paramArray[i]);
}
Please, help me with that issue.

I found the answer!
We need to create 2 pipelines in Jenkins: downstream and upstream jobs.
1. The downstream job is parameterized and take 1 string parameter in 'General' section
Then, it just prints the choosing parameter in 'Pipeline' section:
Here is the result of this downstream job:
2. The upstream job has an array with all possible parameters for a downstream job.
And in the loop, it runs a downstream job with each parameter from an array.
In the result, an upstream job will run a downstream job 3 times with each parameter.
:)

I think you can't run Jenkins job according to your above code. But you can configure the cronjob in Jenkins using “Build periodically” for run Jenkins job periodically.
go to Jenkins job > Configure > tick Build periodically in build Triggers
and put cronjob syntax like below image and Save.
This job runs every 15 minutes. and also you can set a specific time in the schedule.

Please see the example from https://jenkins.io/doc/book/pipeline/jenkinsfile/ in "Handling parameters" section: With a Jenkinsfile like this (example here copied from that doc), you can launch "Build with parameters" and give params. Since you want multiple params, you can delimit them with , or ; or something else based on your data. You just need to parse the input params to get the values using the delimiter you chose.
pipeline {
agent any
parameters {
string(name: 'Greeting', defaultValue: 'Hello', description: 'How should I greet the world?')
}
stages {
stage('Example') {
steps {
echo "${params.Greeting} World!"
}
}
}
}

How do I load data from Cloud Storage to Cloud Datastore from an AppEngine PHP application?

I have been searching various sources but it is not clear to this neewbie. How do I load data (CSV file) from Cloud Storage to Cloud Datastore from an AppEngine PHP application? I do have an existing method which downloads the file and then loads each row as a transaction. It takes a few hours for a few million rows so this does not seem the best method and have been searching for a more efficient method. I appreciate any guidance.
Editing this as I have switched to trying to use a remote URL from which to load the JSON data into Datastore from GAE. Code is not working though I do not know why (yet):
<?php
require 'vendor/autoload.php';
use Google\Auth\ApplicationDefaultCredentials;
use Google\Cloud\Datastore\DatastoreClient;
/**
* Create a new product with a given SKU.
*
* #param DatastoreClient $datastore
* #param $sku
* #param $product
* #return Google\Cloud\Datastore\Entity
*/
function add_product(DatastoreClient $datastore, $sku, $product)
{
$productKey = $datastore->key('SKU', $sku);
$product = $datastore->entity(
$productKey,
[
'created' => new DateTime(),
'name' => strtolower($product)
]);
$datastore->upsert($product);
return $product;
}
/*
Load Cloud DataStore Kind from remote URL
#param $projectId
#param $url
*/
function load_datastore($projectId, $url) {
// Create Datastore client
$datastore = new DatastoreClient(['projectId' => $projectId]);
// Enable `allow_url_fopen` to allow reading file from URL
ini_set("allow_url_fopen", 1);
// Read the products listing and load to Cloud Datastore.
// Use batches of 20 for a transaction
$json = json_decode(file_get_contents($url), true);
$count = 1;
foreach($json as $sku_key => $product_val) {
if ($count == 1) {
$transaction = $datastore->transaction();
}
add_product($datastore, $sku_key, $product_val);
if ($count == 20) {
$transaction->commit();
$count = 0;
} catch (Exception $err) {
echo 'Caught exception: ', $err->getMessage(), "\n";
$transaction->rollback();
}
$count++;
}
}
try
{
$projectId = 'development';
$url = 'https://raw.githubusercontent.com/BestBuyAPIs/open-data-set/master/products.json';
load_datastore($projectId, $url);
} catch (Exception $err) {
echo 'Caught exception: ', $err->getMessage(), "\n";
$transaction->rollback();
}
?>

Google provides pre-written dataflow templates. You can use the GCS to Datastore Dataflow Template to read in the CSV, convert the CSV into Datastore Entity JSON, and write the results to datastore.
Let's assume you have a CSV of the following:
username, first, last, age, location.zip, location.city, location.state
samsmith, Sam, Smith, 33, 94040, Mountain View, California
johndoe, John, Doe, 50, 30075, Roswell, Georgia
dannyboy, Danny, Mac, 94040, Mountain View, California
You could have the following UDF to transform this CSV into a Datastore Entity of Kind People. This UDF assumes the following Schema:
username = Key & String Property
first = String Property
Last = String Property
Age = Integer Property
Location = Record
Location.Zip = Integer Property
Location.City = String Property
Location.State = String Property
This UDF outputs a JSON encoded Entity. This is the same JSON payload as used by the Cloud Datastore REST API. Values can be of the following types.
function myTransform(csvString) {
var row = csvString.split(",");
if (row.length != 4) { return; }
return JSON.stringify({
"key": {
"partition_id": {
// default namespace is an empty string
"namespace_id": ""
},
"path": {
"kind": "People",
"name": row[0]
}
},
"properties": {
"username": { "stringValue": row[0] },
"first": { "stringValue": row[1] },
"last": { "stringValue": row[2] },
"age": { "integerValue": row[3] },
"location": {
"entityValue": {
"properties": {
"zip": { "integerValue": row[4] },
"city": { "stringValue": row[5] },
"state": { "stringValue": row[6] }
}
}
}
}
});
}
To run the dataflow template. First save that UDF to a GCS bucket using gsutil.
gsutil cp my_csv_udf.js gs://mybucket/my_csv_udf.js
Now head into the Google Cloud Platform Console. Head to the dataflow page. Click on Create Job From Template and select "GCS Text to Datastore". You can also refer to this doc.
You job parameters would look like as follows:
textReadPattern = gs://path/to/data/*.csv
javascriptTextTransformGcsPath = gs://mybucket/my_csv_udf.js
javascriptTextTransformFunctionName = myTransform
datastoreWriteProjectId = my-project-id
errorWritePath = gs://path/to/data/errors
Note: The UDF transform only supports JavaScript ECMAScript 5.1. So only basic javascript, no fancy arrow functions / promises...etc.

This question is similar to Import CSV into google cloud datastore and Google Cloud Datastore: Bulk Importing w Node.js .
The quick answer is that you can use Apache Beam or Cloud Dataflow to import CSV data into Cloud Datastore.

Sorry for not being more specific, but I'm a python standard env GAE user, rather unfamiliar with the PHP environment(s).
In general your current approach is serialized and synchronous - you're processing the rows one at a time (or, at best, in batches of 20 if all the upsert calls inside a transactions actually go to the datastore in a single batch), blocking for every the datastore interaction and advancing to the next row only after that interaction completes.
I'm unsure if the PHP environment supports async datastore operations and/or true batch operations (the python ndb library can batch up to 500 writes into one datastore call) - those could help speed things up.
Another thing to consider if your rows are entirely independent - do you actually need transactions for writing them? If PHP supports plain writing you could do that instead (transactions take longer to complete).
Even without the above-mentioned support, you can still speed things up considerably by decoupling the row reading from the waiting for completion of datastore ops:
in the current request handler you keep just the row reading and creating batches of 20 rows somehow passed for processing on other threads (task queue, pub/sub, separate threads - whatever you can get in PHP)
on a separate request handler (or task queue or pub/sub handler, depending on how you choose to pass your batch data) you receive those batches and make the actual datastore calls. This way you can have multiple batches processed in parallel, the amount of time they're blocked waiting for the datastore replies becoming irrelevant from the overall processing time perspective.
With such approach your performance would be limited only by the speed at which you can read the rows and enqueue those batches. If you want to be even faster - you could also split the single CSV file into multiple smaller ones, thus also having multiple row readers that could work in parallel, feeding those batch processing workers.
Side note: maybe you want to retry the failed/rolled-back transactions or save those entities for a later retry, currently it appears you're losing them.

Block or extend Breeze metadata auto fetching

Current application - Angular application with Breeze. Application has ~7 entity managers and different data domains (metadata). When application runs we trying to fetch entity managers, like:
app.run(['$rootScope', 'datacontext1', ... ], function($rootScope, datacontext1, ...) {
datacontext1.loadMetadata();
...
datacontext7.loadMetadata();
}
Every datacontext has its own entity manager and loadMetadata is:
function loadMetadata() {
manager.fetchMetadata().then(function(mdata) {
if (mdata === 'already fetched') {
return;
}
...
applyCustomMetadata(); // Do some custom job with metadata/entity types
});
}
Metadata comes from server asynchronously. Few module has really big metadata, like 200Kb and takes some time for loading and apply to entity manager. Its possible that first Breeze data request executed in same entity manager will be started before this loadMetadata operation finished and as I understand Breeze automatically fetch metadata again. Usually its not a problem, Metadata end point cached on server, but sometimes it produces very strange behavior of Breeze - EntityManager.fetchMetadata resolve promise as "already fetched" and in this case applyCustomMetadata() operation can not be executed.
As I understand problem is inside Breeze and approach its used to resolve metadata promise (seems to be http adapter is singleton and second request override metadata with "already fetched" string and applyCustomMetadata() operation never executes).
Need to figure out some way to resolve issue without significant changes in application.
Logically need to delay full application from using entity managers while loadMetadata done. Looking for any way on Breeze level to disable auto fetch metadata if its already in progress (but not interrupt request, just wait and try again after some time). Any other ideas are fine as well.

Why are you allowing queries to execute before the metadata is loaded? Therein lies your problem.
I have an application bootstrapper that I expose through a global variable; none of my application activities depending on the Entity Manager are started until preliminary processes complete:
var bootstrapper = {
pageReady: ko.observable(false)
};
initBootstrapper();
return bootstrapper;
function initBootstrapper() {
window.MyApp.entityManagerProvider.initialize() // load metadata, lookups, etc
.then(function () {
window.MyApp.router.initialize(); // setup page routes, home ViewModel, etc
bootstrapper.pageReady(true); // show homepage
});
};
Additionally, depending on the frequency of database changes occurring in your organization, you may wish to deliver the metadata to the client synchronously on page_load. See this documentation for further details:
http://breeze.github.io/doc-js/metadata-load-from-script.html

How to view JSON logs of a managed VM in the Log Viewer?

I'm trying to get JSON formatted logs on a Compute Engine VM instance to appear in the Log Viewer of the Google Developer Console. According to this documentation it should be possible to do so:
Applications using App Engine Managed VMs should write custom log
files to the VM's log directory at /var/log/app_engine/custom_logs.
These files are automatically collected and made available in the Logs
Viewer.
Custom log files must have the suffix .log or .log.json. If the suffix
is .log.json, the logs must be in JSON format with one JSON object per
line. If the suffix is .log, log entries are treated as plain text.
This doesn't seem to be working for me: logs ending with .log are visible in the Log Viewer, but displayed as plain text. Logs ending with .log.json aren't visible at all.
It also contradicts another recent article that states that file names must end in .log and its contents are treated as plain text.
As far as I can tell Google uses fluentd to index the log files into the Log Viewer. In the GitHub repository I cannot find any evidence that .log.json files are being indexed.
Does anyone know how to get this working? Or is the documentation out-of-date and has this feature been removed for some reason?

Here is one way to generate JSON logs for the Managed VMs logviewer:
The desired JSON format
The goal is to create a single line JSON object for each log line containing:
{
"message": "Error occurred!.",
"severity": "ERROR",
"timestamp": {
"seconds": 1437712034000,
"nanos": 905
}
}
(information sourced from Google: https://code.google.com/p/googleappengine/issues/detail?id=11678#c5)
Using python-json-logger
See: https://github.com/madzak/python-json-logger
def get_timestamp_dict(when=None):
"""Converts a datetime.datetime to integer milliseconds since the epoch.
Requires special handling to preserve microseconds.
Args:
when:
A datetime.datetime instance. If None, the timestamp for 'now'
will be used.
Returns:
Integer time since the epoch in milliseconds. If the supplied 'when' is
None, the return value will be None.
"""
if when is None:
when = datetime.datetime.utcnow()
ms_since_epoch = float(time.mktime(when.utctimetuple()) * 1000.0)
return {
'seconds': int(ms_since_epoch),
'nanos': int(when.microsecond / 1000.0),
}
def setup_json_logger(suffix=''):
try:
from pythonjsonlogger import jsonlogger
class GoogleJsonFormatter(jsonlogger.JsonFormatter):
FORMAT_STRING = "{message}"
def add_fields(self, log_record, record, message_dict):
super(GoogleJsonFormatter, self).add_fields(log_record,
record,
message_dict)
log_record['severity'] = record.levelname
log_record['timestamp'] = get_timestamp_dict()
log_record['message'] = self.FORMAT_STRING.format(
message=record.message,
filename=record.filename,
)
formatter = GoogleJsonFormatter()
log_path = '/var/log/app_engine/custom_logs/worker'+suffix+'.log.json'
make_sure_path_exists(log_path)
file_handler = logging.FileHandler(log_path)
file_handler.setFormatter(formatter)
logging.getLogger().addHandler(file_handler)
except OSError:
logging.warn("Custom log path not found for production logging")
except ImportError:
logging.warn("JSON Formatting not available")
To use, simply call setup_json_logger - you may also want to change the name of worker for your log.

I am currently working on a NodeJS app running on a managed VM and I am also trying to get my logs to be printed on the Google Developper Console. I created my log files in the ‘/var/log/app_engine’ directory as described in the documentation. Unfortunately this doesn’t seem to be working for me, even for the ‘.log’ files.
Could you describe where your logs are created ? Also, is your managed VM configured as "Managed by Google" or "Managed by User" ? Thanks!

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Map subtask_id to TaskManager in Flink - apache-flink

The taskmanager.log files contain the names of the deployed tasks including their sub task index. You could simply search for the TASK_NAME (129/256) in all taskmanager.log files.

Related

Step Functions parameter (array or not to array)

How can I run a job in Jenkins n-times?

How do I load data from Cloud Storage to Cloud Datastore from an AppEngine PHP application?

Block or extend Breeze metadata auto fetching

How to view JSON logs of a managed VM in the Log Viewer?

Categories

Resources