Will there be an impact on the graph database if multiple SUBMIT JOB STATS are carried out simultaneously?

Will there be an impact on the graph database if multiple SUBMIT JOB STATS are carried out simultaneously? - graph-databases

NebulaGraph Version: 3.1.0
Deployment method: distributed / single machine
Does submitting multiple SUBMIT JOB STATS tasks at the same time affect the gallery?
We need to calculate the number of points and edges of each graph dataset.
But if I have multiple graphs and run dozens of SUBMIT JOB STATS tasks at the same time, what is NebulaGraph’s mechanism for executing job tasks?
Will this affect the NebulaGraph clusters, for example, all resources are placed on the job task, causing other operations to freeze, etc.?
Will NebulaGraph execute all job tasks sequentially by jobid?

In the same space, jobs are executed serially. If multiple graphs execute SUBMIT JOB STATS, they are executed separately and in parallel.
Whether executing multiple STATS jobs affects the performance of the cluster depends on the architecture of the cluster and the capability of each machine. If the data volume stored in the cluster is very large and the machine capability is relatively not so good, it is not recommended to run multiple STATS jobs together.

Related

Fault Tolerance in Flink

How can we configure a Flink application to start/restart only the pods/(sub)tasks that crashed instead of restarting the whole job i.e. restart all the tasks/sub-tasks in the job/pipeline including that tasks that are healthy. It does not make sense and feels unnecessary to try to restart the healthy tasks along with the crashed ones. The stream processing application processes messages from Kafka and writes the output back to Kafka and runs on Flink 1.13.5 and a Kubernetes resource manager - using Lyft's Kubernetes operator to schedule and run the Flink job. We tried setting the property, **jobmanager.execution.failover-strategy** to **region** and did not help.

Flink only supports partial restarts to the extent that this is possible without sacrificing completely correct, exactly-once results.
After recovery, failed tasks are restarted from the latest checkpoint. Their inputs are rewound, and they will reproduce previously emitted results. If healthy downstream consumers of those failed tasks aren't also reset and restarted from that same checkpoint, then they will end up producing duplicate/inflated results.
With streaming jobs, only with embarrassingly parallel pipelines will have you disjoint pipelined regions. Any use of keyBy or rebalancing (e.g., to change the parallelism) will produce a job with a single failure region.

Restart Pipelined Region Failover Strategy.
This strategy groups tasks into disjoint regions. When a task failure is detected, this strategy computes the smallest set of regions that must be restarted to recover from the failure. For some jobs this can result in fewer tasks that will be restarted compared to the Restart All Failover Strategy.
Refer to https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/task_failure_recovery/#restart-pipelined-region-failover-strategy
But another failover strategy is in progress in https://cwiki.apache.org/confluence/display/FLINK/FLIP-135+Approximate+Task-Local+Recovery
Approximate task-local recovery is useful in scenarios where a certain amount of data loss is tolerable, but a full pipeline restart is not affordable

Use of flink/kubernetes to replace etl jobs (on ssis) : one flink cluster per jobtype or create and destroy flink cluster per job execution

I am trying to see feasibility of replacing the hundreds of feed file ETL jobs created using SSIS packages with apache flink jobs (and kuberentes as underlying infra). One recommendation i saw in some article is "to use one flink cluster for one type of job".
Since i have handful jobs per day of each job type, then this means the best way for me is to create flinkcluster on the fly when executing the job and destroy it to free up resources, is that correct way to do it? I am setting up flinkcluster without job manager.
Any suggestions on best practices for using flink for batch ETL activities.
May be most important question: is flink correct solution for the problem statement or should i go more into Talend and other classic ETL tools?

Flink is well suited for running ETL workloads. The two deployment modes give you the following properties:
Session cluster
A session cluster allows to run several jobs on the same set of resources (TaskExecutors). You start the session cluster before submitting any resources.
Benefits:
No additional cluster deployment time needed when submitting jobs => Faster job submissions
Better resource utilization if individual jobs don't need many resources
One place to control all your jobs
Downsides:
No strict isolation between jobs
Failures caused by job A can cause job B to restart
Job A runs in the same JVM as job B and hence can influence it if statics are used
Per-job cluster
A per-job cluster starts a dedicated Flink cluster for every job.
Benefits
Strict job isolation
More predictable resource consumption since only a single job runs on the TaskExecutors
Downsides
Cluster deployment time is part of the job submission time, resulting in longer submission times
Not a single cluster which controls all your jobs
Recommendation
So if you have many short lived ETL jobs which require a fast response, then I would suggest to use a session cluster because you can avoid the cluster start up time for every job. If the ETL jobs have a long runtime, then this additional time will carry no weight and I would choose the per-job mode which gives you more predictable runtime behaviour because of strict job isolation.

How can i share state between my flink jobs?

I run multiple job from my .jar file. i want share state between my jobs. but all inputs consumes(from kafka) in every job and generate duplicate output.
i see my flink panel. all of jobs 'record sents' is 3. i think must split number to my jobs.
I create job with this command
bin/flink run app.jar
How can i fix it?

Because of its focus on scalability and high performance, Flink state is local. Flink doesn't really provide a mechanism for sharing state between jobs.
However, Flink does support splitting up a large job among a fleet of workers. A Flink cluster is able to run a single job in parallel, using the resources of one or many multi-core CPUs. Some Flink jobs are running on thousands of cores, just to give an idea of its scalability.
When used with Kafka, each Kafka partition can be read by a different subtask in Flink, and processed by its own parallel instance of the pipeline.
You might begin by running a single parallel instance of your job via
bin/flink run --parallelism <parallelism> app.jar
For this to succeed, your cluster will have to have at least as many free slots as the parallelism you request. The parallelism should be less than or equal to the number of partitions in the Kafka topic(s) being consumed. The Flink Kafka consumers will coordinate amongst themselves -- with each of them reading from one or more partitions.

Is there a fast database that does approx results if allowed - accurate results when asked for

I create job arrays of thousands of simulations that get executed on a network connected cluster of servers which all have local disk as well being connected to NFS disc drives.
Is there a database that can be distributed amongst the servers that operates in the following way:
When I submit my job array each individual job running on an individual server to send results to the distributed DB.
Whilst the job array is still running the user can request partial summaries from the DB - the DB having the option to not wait for all the latest results from all its distributed nodes, but to "improvise" in some way
The user can request a full summary after the job array is finished which causes the DB to ensure that it returns an accurate summary of all data from all its nodes and further, that all nodes are not still receiving data from jobs (quiescent for a stated time).
In other words, I want a fast DB and an accurate DB when I tell it, receiving lots of data from thousands of jobs in an LSF job array. I need to monitor the progress of the LSF job array's results but am willing to forego some accuracy when monitoring to increase speed, but need an accurate result when all is done.
The data being stored for each job is small job ID, small PASS/FAIL, large how job fails. The how is likely to be only spot-checked for a very few jobs until the end of all jobs in the job-array when triage scripts will need access to all DB data for the job array at speed.

SSAS Processing Threads

We have an SSIS package that executes XMLA against SSAS to delete a database and then rebuild a database We then create a Windows Task Scheduler job to kick off the SSIS package for each of our clients at a given time interval. We have anywhere from 50-200 jobs that we would like to run on a nightly basis.
We are having a problem that when too many jobs run together at the same time (anywhere from 3+ jobs) SSAS seems to freeze and these jobs never complete. When I monitor SSAS (via Perfmon) I only see a maximum of two busy processing threads at any one time. It would seem that with multiple jobs running at the same time, say 10 jobs I should a minimum of 10 processing threads running, but again we don't. Our server has 8 cores and 32 GB memory and CPU utilization is also fairly low.
Does anyone have any suggestions for using more threads in SSAS or explanation for the behavior I am seeing?