Submiting Flink Job on YARN Issue - apache-flink

I want to submit my flink job on YARN with this command:
./bin/flink run -m yarn-cluster -p 4 -yjm 1024m -ytm 4096m ./task.jar
but I faced this error:
is running beyond virtual memory limits. Current usage: 390.3 MB of 1 GB physical memory used; 2.3 GB of 2.1 GB virtual memory used. Killing container.

This is caused because of the a variable named yarn.nodemanager.vmem-pmem-ratio which is default set to 2.1, in this command this ratio is 4096/1024 = 4
You have 3 ways:
1 - If you have access to YARN configuration you can set yarn.nodemanager.vmem-check-enabled is yarn-site.xml to false.
2 - Another if you have access to configuration is to change the ratio value from 2.1 to 5 e.g.
3 - If you don't have access you can change the YARN configurations, you can change the ytm and yjm values in order to satisfy the ratio condition, for example: -yjm 4096 -ytm 4096.

Related

first try on a k3s with external etcd db

I tried to run my first cluster as I'm currently trying to learn so I can work in Cloud Engineering hopefully.
What I did :
I have 3 Cloud Servers ( Ubuntu 20.04), all in one Network,
I've successfully set up my ETCD Cluster ( cluster-health shows me all 3 Network IPs of the Servers, 1 leader 2 not leader)
now I've installed k3s on my first Server
curl -sfL https://get.k3s.io | sh -s - server \ --datastore-endpoint="https://10.0.0.2:2380,https://10.0.0.4:2380,https://10.0.0.3:2380"
I've done the same on the 2 other Servers the only difference is I added the token value to it and checked it beforehand in:
cat /var/lib/rancher/k3s/server/token
now everything seems to have worked but when I tried to kubectl get nodes , it just shows me one node...
does anyone have any tips or answers for me?
k3s Service FIle :
[Unit]
Description=Lightweight Kubernetes
Documentation=https://k3s.io
Wants=network-online.target
After=network-online.target
[Install]
WantedBy=multi-user.target
[Service]
Type=notify
EnvironmentFile=-/etc/default/%N
EnvironmentFile=-/etc/sysconfig/%N
EnvironmentFile=-/etc/systemd/system/k3s.service.env
KillMode=process
Delegate=yes
# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNOFILE=1048576
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
TimeoutStartSec=0
Restart=always
RestartSec=5s
ExecStartPre=/bin/sh -xc '! /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service'
ExecStartPre=-/sbin/modprobe br_netfilter
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/local/bin/k3s \
server \
'--node-external-ip=78.46.241.153'
'--node-name=node-1'
' --flannel-iface=ens10'
' --advertise-address=10.0.0.2'
' --node-ip=10.0.0.2'
' --datastore-endpoint=https://10.0.0.2:2380,https://10.0.0.4:2380,https://10.0.0.3:2380' \

Simultaneously running multiple jobs on same node using slurm

Most of our jobs are either (relatively) low on CPU and high on memory (data processing) or low on memory and high on CPU (simulations). The server we have is generally big enough (256GB Mem; 16 cores) to accommodate multiple jobs running at the same time and we would like use slurm to schedule these jobs. However, testing on a small (4 CPU) amazon server, I am unable to get this working. I would have to use SelectType=select/cons_res and SelectTypeParameters=CR_CPU_Memory as far as I know. However, when starting multiple jobs using a single CPU these are started sequentially and not in parallel.
My slurm.conf
ControlMachine=ip-172-31-37-52
MpiDefault=none
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
SlurmUser=slurm
StateSaveLocation=/var/lib/slurm-llnl/slurmctld
SwitchType=switch/none
TaskPlugin=task/none
# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
ClusterName=cluster
JobAcctGatherType=jobacct_gather/none
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
# COMPUTE NODES
NodeName=ip-172-31-37-52 CPUs=4 RealMemory=7860 CoresPerSocket=2 ThreadsPerCore=2 State=UNKNOWN
PartitionName=test Nodes=ip-172-31-37-52 Default=YES MaxTime=INFINITE State=UP
job.sh
#!/bin/bash
sleep 30
env
Output when running jobs:
ubuntu#ip-172-31-37-52:~$ sbatch -n1 -N1 job.sh
Submitted batch job 2
ubuntu#ip-172-31-37-52:~$ sbatch -n1 -N1 job.sh
Submitted batch job 3
ubuntu#ip-172-31-37-52:~$ sbatch -n1 -N1 job.sh
Submitted batch job 4
ubuntu#ip-172-31-37-52:~$ sbatch -n1 -N1 job.sh
Submitted batch job 5
ubuntu#ip-172-31-37-52:~$ sbatch -n1 -N1 job.sh
Submitted batch job 6
ubuntu#ip-172-31-37-52:~$ sbatch -n1 -N1 job.sh
Submitted batch job 7
ubuntu#ip-172-31-37-52:~$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
3 test job.sh ubuntu PD 0:00 1 (Resources)
4 test job.sh ubuntu PD 0:00 1 (Priority)
5 test job.sh ubuntu PD 0:00 1 (Priority)
6 test job.sh ubuntu PD 0:00 1 (Priority)
7 test job.sh ubuntu PD 0:00 1 (Priority)
2 test job.sh ubuntu R 0:03 1 ip-172-31-37-52
The jobs are run sequentially, while in principle it should be possible to run 4 jobs in parallel.
You do not specify memory in your submission files. Also, you do not specify a default value for memory (DefMemPerNode, or DefMemPerCPU ). In that case, Slurm allocates the full memory to the jobs, it is therefore not able to allocate multiple jobs on one node.
Try specifying the memory :
sbatch -n1 -N1 --mem-per-cpu=1G job.sh
You can check the resources consumed on a node with scontrol show node (Look for the AllocTRES value).

Rebalancing rate when new node is added

When a new node is added, we see that it is starting to receive new tablets (in the http://:7000/tablet-servers page) and the system is rebalancing. But the default rate seems low. Are there any knobs to determine this rate?
The rebalance in YugaByte DB is rate limited.
One of the parameters that governs this behavior is the yb-tserver gflag remote_bootstrap_rate_limit_bytes_per_sec which defaults to 256MB/sec and is the maximum transmission rate (inbound + outbound) related to rebalance that any one server (yb-tserver) may do.
To inspect the current setting on a yb-tserver you can try this:
$ curl -s 10.150.0.20:9000/varz | grep remote_bootstrap_rate
--remote_bootstrap_rate_limit_bytes_per_sec=268435456
This particular param can also be changed on the fly without needing a yb-tserver restart. For example to set the rate to 512MB/sec.
bin/yb-ts-cli --server_address=$TSERVER_IP:9100 set_flag --force remote_boostrap_rate_limit_bytes_per_sec 536870912
A second aspect of this is the cluster wide global settings on how many tablet rebalances can happen simultaneously in the system. These are governed by a few yb-master gflags.
$ bin/yb-ts-cli --server_address=$MASTER_IP:7100 set_flag -force load_balancer_max_concurrent_adds 3
$ bin/yb-ts-cli --server_address=$MASTER_IP:7100 set_flag -force load_balancer_max_over_replicated_tablets 3
$ bin/yb-ts-cli --server_address=$MASTER_IP:7100 set_flag -force load_balancer_max_concurrent_tablet_remote_bootstraps 3

Increasing Solr5 time out from 30 seconds while starting solr

Many time while starting solr I see the below message and then the solr is not reachable.
debraj#boutique3:~/solr5$ sudo bin/solr start -p 8789
Waiting to see Solr listening on port 8789 [-] Still not seeing Solr listening on 8789 after 30 seconds!
I am having two cores in my local set-up. I am guessing this is happening because one of the core is a little big. So solr is timing out while loading the core. If I take one of the core out of solr then everything works fine.
Can some one let me know how can I increase this timeout value from default 30 seconds?
I am using Solr 5.2.1 on Debian 7.
Usually this could be related to startup problems, but if you are running solr on a slow machine, 30 seconds may not be enough for it to start.
In that case you may try this (I'm using Solr 5.5.0)
Windows (tested, working): in bin/solr.cmd file, look for the parameter
-maxWaitSecs 30
few lines below "REM now wait to see Solr come online ..." and replace 30 with a number that meets your needs (e.g. 300 seconds = 5 minutes)
Others (not tested): in bin/solr file, search the following code
if [ $loops -lt 6 ]; then
sleep 5
loops=$[$loops+1]
else
echo -e "Still not seeing Solr listening on $SOLR_PORT after 30 seconds!"
tail -30 "$SOLR_LOGS_DIR/solr.log"
exit # subshell!
fi
Increase waiting loop cycles from 6 to whatever meets your needs (e.g. 60 cycles * 5 sleep seconds = 300 seconds = 5 minutes). You should change the number of seconds in the message below too, just to be congruent.

How to set SplitMasterWorker value as false in giraph

I try to execute the giraph custom code from eclipse IDE, and when i try to run the code its showing Exception in thread “main” java.lang.IllegalArgumentException: checkLocalJobRunnerConfiguration: When using LocalJobRunner, must have only one worker since only 1 task at a time!
So i want to set the giraph.SplitMasterWorker=false.How to set it and where to set it?
pass -ca giraph.SplitMasterWorker=false to your application as an argument.
If you are running giraph on a single node cluster, then paste "-ca giraph.SplitMasterWorker=false" would help. However, if you try to run giraph on multi-nodes cluster such as AWS EC2 base on hadoop version 2.x.x, then I definitely recommend to modify the mapred-site.xml file adding parameter such mapred.job.tracker value in it.
giraph.SplitMasterWorker=false is the variable you have to set while calling the giraph runner. This can be passed in as a custom variable under -ca. Also I think you are using -w parameter, if you running on your local machine it should not be more than 1 since there are no slave nodes to work as a worker
E.g. hadoop jar /usr/local/giraph1.0/giraph-examples/target/giraph-examples-1.1.0-for-hadoop-2.7.0-jar-with-dependencies.jar org.apache.giraph.GiraphRunner org.apache.giraph.examples.ConnectedComponentsComputation -vif org.apache.giraph.io.formats.IntIntNullTextInputFormat -vip -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op -w 5 -ca giraph.SplitMasterWorker=false

Resources