How can I increase performance of AgensGraph? - agens-graph

My project entered testing phase.
But I suffered lack of performance, when large dataset over than GigaBytes.
Is there turnable parameters on AgensGraph?

First, You can find installed directory on message of 'initdb'.
$ initdb
The files belonging to this database system will be owned by user "agens".
This user must also own the server process.
The database cluster will be initialized with locale "ko_KR.UTF-8".
The default database encoding has accordingly been set to "UTF8".
initdb: could not find suitable text search configuration for locale "ko_KR.UTF-8"
The default text search configuration will be set to "simple".
Data page checksums are disabled.
creating directory /Users/agens/Downloads/pgsql/data ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting dynamic shared memory implementation ... posix
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok
syncing data to disk ... ok
WARNING: enabling "trust" authentication for local connections
You can change this by editing pg_hba.conf or using the option -A, or
--auth-local and --auth-host, the next time you run initdb.
Success. You can now start the database server using:
ag_ctl -D /Users/agens/Downloads/pgsql/data -l logfile start
Second, change current directory to agens data directory.
$ cd /Users/agens/Downloads/pgsql/data
Finally, you can find buffer size parameter on config file.
$ grep shared_buffer postgresql.conf
shared_buffers = 128MB # min 128kB
#wal_buffers = -1 # min 32kB, -1 sets based on shared_buffers
"shared_buffers" parameter is key factor for searching performance.
Suggested buffer size is minimum of whole data and half of system memory.

Related

Need custom backup filenames for file copy using Ansible

I have set / array of hosts that fall in below three categories i.e
source_hosts (multiple servers)
ansible_host (single server)
destination_hosts. (multiple servers)
Based on our architecture the plan is to do the following Steps.
Verify if the files exists of source_hosts and has copy permissions for the source user. Also, verify if the "path to folder" n the destination exists and has permissions for the files to get copied. Checking if we are not "Running out of space" on the destination should also be considered.
If the above verification is successful the files should get copied from source_host to ansible_server
Note: I plan to use ansible's fetch module for this http://docs.ansible.com/ansible/fetch_module.html
From the ansible server the files should get copied over to the destination server's respective locations.
Note: I plan to use ansible's copy module for this
http://docs.ansible.com/ansible/copy_module.html
If the file already exists on the destination server a backup must be created with a identifier say "tkt432" along with the timestamp.
Note: Again, I am planning to use copy module for backups but i don't know how to append the identifier to the backed-up files. The module does not have any such feature of appending custom identifier to file names as of my limited knowledge.
I have the following concerns.
what would be the ideal ansible module to address Step 1 ?
How do I address the issue highlighted in Step 4 ?
Any other suggestions are welcomed.
Q: "What would be the ideal ansible module to address Step 1 ?"
A: Modules file and stat. Checking "Running out of space" see Using ansible to manage disk space.
Q: "How do I address the issue highlighted in Step 4 ? If the file already exists on the destination server a backup must be created with an identifier say "tkt432" along with the timestamp."
A: Quoting from the parameters of copy module
backup - Create a backup file including the timestamp ...
Neither the extension nor the place of the backup files is optional. See add optional backup_dir for the backup option #16305.
Q: "Any other suggestions are welcomed."
A: Take a look at module synchronize.
Q: "1. Is there any module to check file/folder permissions (rights) for copy-paste operation with that user id?"
A: There are no copy-paste operations in Ansible.
Q: "Requesting more inputs on how we can append identifiers like "tkt432" to backup filenames while using "copy" modules backup option or any other good solution."
A: There is no more input. Ansible does not do that.
Q: "I feel I won't be able to use the copy module and will have to fallback to writing shell scripts for the above-mentioned issues."
A: Yes. Modules shell and command could help with this.

what is the difference between hadoop -appendToFile versus hadoop -put when used for updating stream data into hdfs continously

As per hadoop source code following descriptions are pulled out from the classes -
appendToFile
"Appends the contents of all the given local files to the
given dst file. The dst file will be created if it does not exist."
put
"Copy files from the local file system into fs. Copying fails if the file already exists, unless the -f flag is given.
Flags:
-p : Preserves access and modification times, ownership and the mode.
-f : Overwrites the destination if it already exists.
-l : Allow DataNode to lazily persist the file to disk. Forces
replication factor of 1. This flag will result in reduced
durability. Use with care.
-d : Skip creation of temporary file(<dst>._COPYING_)."
I am trying to update a file into hdfs regularly as it is being updated dynamically from a streaming source in my local File System.
Which one should I use out of appendToFile and put, and Why?
appendToFile modifies the existing file in HDFS, so only the new data needs to be streamed/written to the filesystem.
put rewrites the entire file, so the entire new version of the file needs to be streamed/written to the filesystem.
You should favor appendToFile if you are just appending to the file (i.e. adding logs to the end of a file). This function will be faster if that's your use case. If the file is changing more than just simple appends to the end, you should use put (slower but you won't lose data or corrupt your file).

Purpose of fs.hdfs.hadoopconf in flink-conf.yaml

Newbie to Flink.
I am able to run the example wordcount.jar on a file present in remote hdfs cluster without declaring fs.hdfs.hadoopconf variable in flink conf.
So wondering what exactly is the purpose of above mentioned variable.
Does declaring it changes the way one runs the example jar ?
Command :
flink-cluster.vm ~]$ /opt/flink/bin/flink run /opt/flink/examples/batch/WordCount.jar --input hdfs://hadoop-master:9000/tmp/test-events
Output:
.......
07/13/2016 00:50:13 Job execution switched to status FINISHED.
(foo,1)
.....
(bar,1)
(one,1)
Setup :
Remote HDFS cluster on hdfs://hadoop-master.vm:9000
Flink cluster on running on flink-cluster.vm
Thanks
Update :
As pointed out by Serhiy, declared fs.hdfs.hadoopconf in conf but on running the job with updated argument hdfs:///tmp/test-events.1468374669125 got the following error
flink-conf.yaml
# You can also directly specify the paths to hdfs-default.xml and hdfs-site.xml
# via keys 'fs.hdfs.hdfsdefault' and 'fs.hdfs.hdfssite'.
#
fs.hdfs.hadoopconf: hdfs://hadoop-master:9000/
fs.hdfs.hdfsdefault : hdfs://hadoop-master:9000/
Command :
flink-cluster.vm ~]$ /opt/flink/bin/flink run /opt/flink/examples/batch/WordCount.jar --input hdfs:///tmp/test-events
Output :
Caused by: org.apache.flink.runtime.JobException: Creating the input splits caused an error: The given HDFS file URI (hdfs:///tmp/test-events.1468374669125) did not describe the HDFS NameNode. The attempt to use a default HDFS configuration, as specified in the 'fs.hdfs.hdfsdefault' or 'fs.hdfs.hdfssite' config parameter failed due to the following problem: Either no default file system was registered, or the provided configuration contains no valid authority component (fs.default.name or fs.defaultFS) describing the (hdfs namenode) host and port.
at org.apache.flink.runtime.executiongraph.ExecutionJobVertex.<init>(ExecutionJobVertex.java:172)
at org.apache.flink.runtime.executiongraph.ExecutionGraph.attachJobGraph(ExecutionGraph.java:679)
at org.apache.flink.runtime.jobmanager.JobManager.org$apache$flink$runtime$jobmanager$JobManager$$submitJob(JobManager.scala:1026)
... 19 more
From documentation:
fs.hdfs.hadoopconf: The absolute path to the Hadoop File System’s
(HDFS) configuration directory (OPTIONAL VALUE). Specifying this value
allows programs to reference HDFS files using short URIs
(hdfs:///path/to/files, without including the address and port of the
NameNode in the file URI). Without this option, HDFS files can be
accessed, but require fully qualified URIs like
hdfs://address:port/path/to/files. This option also causes file
writers to pick up the HDFS’s default values for block sizes and
replication factors. Flink will look for the “core-site.xml” and
“hdfs-site.xml” files in the specified directory.

how to limit the total size of log files managed by syslog?

How can I limit the total size of log files that are managed by syslog? The oldest archived log files should probably be removed when this size limit (quota) is exceeded.
Some of the log files are customized files specified by LOG_LOCALn, but I guess this doesn't matter regarding the quota issue.
Thanks!
The Linux utility logrotate renames and reuses system error log files on a periodic basis so that they don't occupy excessive disk space. Linux system stores all the relevant information regarding this into the file /etc/logrotate.conf
There are number of attributes which help us to manage the log size. Please read the manual("man logrotate") before doing anything. On my machine this file looks as follows:
# see "man logrotate" for details
# rotate log files weekly
weekly
# keep 4 weeks worth of backlogs
rotate 4
# create new (empty) log files after rotating old ones
create
# uncomment this if you want your log files compressed
#compress
# packages drop log rotation information into this directory
include /etc/logrotate.d
# no packages own wtmp, or btmp -- we'll rotate them here
/var/log/wtmp {
missingok
monthly
create 0664 root utmp
rotate 1
}
/var/log/btmp {
missingok
monthly
create 0660 root utmp
rotate 1
}
# system-specific logs may be configured here
As we can see that log files would be rotated on weekly basis.This may be changed to daily basis.The compress is not enabled on my machine. This may be enabled if you want to make log file size smaller.
There is excellent article which you may want to refer for the complete understanding this topic.

Batch file to monitor a processes RAM, CPU%, Network data, threads

I need to generate a report (periodic, say every 1 minute) that whilst run, generates the following in a txt file (or other):
For a given process...
Timestamp : RAM : CPU% : Network data sent/received for last second : Total network data sent/received : threads
I believe in Process Explorer the network data sent/received for last second is called the Delta.
Could you recommend how I might capture this using either a plain batch file, or relying on another tool if required? Such as power shell or PsList? Or at least, point me in the direction of the applicable tool that'll report all these things for a given process? And ideally, be able to report these from a process running on a remote machine if possible! Many thanks, knowledge gurus!
logman create counter cpu_mem_trh -c "\Processor(_Total)\% Processor Time" "\Memory\Pool Paged Bytes" "\Process(*)\Thread Count" -f csv -cf C:\PerfLogs\perflog.csv
logman update cpu_mem_trh -si 60 -v mmddhhmm
logman start cpu_mem_trh
to stop the performance counter use:
logman start cpu_mem_trh
Here are all available performance counters.
And here's the logman help.
For remote machine try with \\machine name prefix on each counter path or with -s option.Time intervals are set with -si option on the update verb. Path to the report is set with -cf option.

Resources