Purpose of fs.hdfs.hadoopconf in flink-conf.yaml - apache-flink

Newbie to Flink.
I am able to run the example wordcount.jar on a file present in remote hdfs cluster without declaring fs.hdfs.hadoopconf variable in flink conf.
So wondering what exactly is the purpose of above mentioned variable.
Does declaring it changes the way one runs the example jar ?
Command :
flink-cluster.vm ~]$ /opt/flink/bin/flink run /opt/flink/examples/batch/WordCount.jar --input hdfs://hadoop-master:9000/tmp/test-events
Output:
.......
07/13/2016 00:50:13 Job execution switched to status FINISHED.
(foo,1)
.....
(bar,1)
(one,1)
Setup :
Remote HDFS cluster on hdfs://hadoop-master.vm:9000
Flink cluster on running on flink-cluster.vm
Thanks
Update :
As pointed out by Serhiy, declared fs.hdfs.hadoopconf in conf but on running the job with updated argument hdfs:///tmp/test-events.1468374669125 got the following error
flink-conf.yaml
# You can also directly specify the paths to hdfs-default.xml and hdfs-site.xml
# via keys 'fs.hdfs.hdfsdefault' and 'fs.hdfs.hdfssite'.
#
fs.hdfs.hadoopconf: hdfs://hadoop-master:9000/
fs.hdfs.hdfsdefault : hdfs://hadoop-master:9000/
Command :
flink-cluster.vm ~]$ /opt/flink/bin/flink run /opt/flink/examples/batch/WordCount.jar --input hdfs:///tmp/test-events
Output :
Caused by: org.apache.flink.runtime.JobException: Creating the input splits caused an error: The given HDFS file URI (hdfs:///tmp/test-events.1468374669125) did not describe the HDFS NameNode. The attempt to use a default HDFS configuration, as specified in the 'fs.hdfs.hdfsdefault' or 'fs.hdfs.hdfssite' config parameter failed due to the following problem: Either no default file system was registered, or the provided configuration contains no valid authority component (fs.default.name or fs.defaultFS) describing the (hdfs namenode) host and port.
at org.apache.flink.runtime.executiongraph.ExecutionJobVertex.<init>(ExecutionJobVertex.java:172)
at org.apache.flink.runtime.executiongraph.ExecutionGraph.attachJobGraph(ExecutionGraph.java:679)
at org.apache.flink.runtime.jobmanager.JobManager.org$apache$flink$runtime$jobmanager$JobManager$$submitJob(JobManager.scala:1026)
... 19 more

From documentation:
fs.hdfs.hadoopconf: The absolute path to the Hadoop File System’s
(HDFS) configuration directory (OPTIONAL VALUE). Specifying this value
allows programs to reference HDFS files using short URIs
(hdfs:///path/to/files, without including the address and port of the
NameNode in the file URI). Without this option, HDFS files can be
accessed, but require fully qualified URIs like
hdfs://address:port/path/to/files. This option also causes file
writers to pick up the HDFS’s default values for block sizes and
replication factors. Flink will look for the “core-site.xml” and
“hdfs-site.xml” files in the specified directory.

Related

Need custom backup filenames for file copy using Ansible

I have set / array of hosts that fall in below three categories i.e
source_hosts (multiple servers)
ansible_host (single server)
destination_hosts. (multiple servers)
Based on our architecture the plan is to do the following Steps.
Verify if the files exists of source_hosts and has copy permissions for the source user. Also, verify if the "path to folder" n the destination exists and has permissions for the files to get copied. Checking if we are not "Running out of space" on the destination should also be considered.
If the above verification is successful the files should get copied from source_host to ansible_server
Note: I plan to use ansible's fetch module for this http://docs.ansible.com/ansible/fetch_module.html
From the ansible server the files should get copied over to the destination server's respective locations.
Note: I plan to use ansible's copy module for this
http://docs.ansible.com/ansible/copy_module.html
If the file already exists on the destination server a backup must be created with a identifier say "tkt432" along with the timestamp.
Note: Again, I am planning to use copy module for backups but i don't know how to append the identifier to the backed-up files. The module does not have any such feature of appending custom identifier to file names as of my limited knowledge.
I have the following concerns.
what would be the ideal ansible module to address Step 1 ?
How do I address the issue highlighted in Step 4 ?
Any other suggestions are welcomed.
Q: "What would be the ideal ansible module to address Step 1 ?"
A: Modules file and stat. Checking "Running out of space" see Using ansible to manage disk space.
Q: "How do I address the issue highlighted in Step 4 ? If the file already exists on the destination server a backup must be created with an identifier say "tkt432" along with the timestamp."
A: Quoting from the parameters of copy module
backup - Create a backup file including the timestamp ...
Neither the extension nor the place of the backup files is optional. See add optional backup_dir for the backup option #16305.
Q: "Any other suggestions are welcomed."
A: Take a look at module synchronize.
Q: "1. Is there any module to check file/folder permissions (rights) for copy-paste operation with that user id?"
A: There are no copy-paste operations in Ansible.
Q: "Requesting more inputs on how we can append identifiers like "tkt432" to backup filenames while using "copy" modules backup option or any other good solution."
A: There is no more input. Ansible does not do that.
Q: "I feel I won't be able to use the copy module and will have to fallback to writing shell scripts for the above-mentioned issues."
A: Yes. Modules shell and command could help with this.

what is the difference between hadoop -appendToFile versus hadoop -put when used for updating stream data into hdfs continously

As per hadoop source code following descriptions are pulled out from the classes -
appendToFile
"Appends the contents of all the given local files to the
given dst file. The dst file will be created if it does not exist."
put
"Copy files from the local file system into fs. Copying fails if the file already exists, unless the -f flag is given.
Flags:
-p : Preserves access and modification times, ownership and the mode.
-f : Overwrites the destination if it already exists.
-l : Allow DataNode to lazily persist the file to disk. Forces
replication factor of 1. This flag will result in reduced
durability. Use with care.
-d : Skip creation of temporary file(<dst>._COPYING_)."
I am trying to update a file into hdfs regularly as it is being updated dynamically from a streaming source in my local File System.
Which one should I use out of appendToFile and put, and Why?
appendToFile modifies the existing file in HDFS, so only the new data needs to be streamed/written to the filesystem.
put rewrites the entire file, so the entire new version of the file needs to be streamed/written to the filesystem.
You should favor appendToFile if you are just appending to the file (i.e. adding logs to the end of a file). This function will be faster if that's your use case. If the file is changing more than just simple appends to the end, you should use put (slower but you won't lose data or corrupt your file).

Flink produces out file in log folder but does not print anything

I am using Flink local mode with parallelism = 1.
In my Flink code, I have tried to print the incoming source using:
DataStream<String> ds = env.addSource(source);
ds.print();
In my local Flink_dir/log folder, i could see that an xxx.out file has been created, but nothing was printed into the file. Is there any config that I might have overlooked? I am sure that my source data contains text as I have managed to add the data to the sink successfully. Thanks!
ds.print will write to stdout and not to a file. ${flink_dir}/log contains only the logs of your task and/or job manager.

Why are "sc.addFile" and "spark-submit --files" not distributing a local file to all workers?

I have a CSV file "test.csv" that I'm trying to have copied to all nodes on the cluster.
I have a 4 node apache-spark 1.5.2 standalone cluster. There are 4 workers where one node also acts has master/driver as well as the worker.
If I run:
$SPARK_HOME/bin/pyspark --files=./test.csv OR from within the REPL interface execute sc.addFile('file://' + '/local/path/to/test.csv')
I see spark log the following:
16/05/05 15:26:08 INFO Utils: Copying /local/path/to/test.csv to /tmp/spark-5dd7fc83-a3ef-4965-95ba-1b62955fb35b/userFiles-a4cb1723-e118-4f0b-9f26-04be39e5e28d/test.csv
16/05/05 15:26:08 INFO SparkContext: Added file file:/local/path/to/test.csv at http://192.168.1.4:39578/files/test.csv with timestamp 1462461968158
In a separate window on the master/driver node, I can easily locate the file using ls, i.e. (ls -al /tmp/spark-5dd7fc83-a3ef-4965-95ba-1b62955fb35b/userFiles-a4cb1723-e118-4f0b-9f26-04be39e5e28d/test.csv).
However if I log into the the workers, there is no file at /tmp/spark-5dd7fc83-a3ef-4965-95ba-1b62955fb35b/userFiles-a4cb1723-e118-4f0b-9f26-04be39e5e28d/test.csv and not even a folder at /tmp/spark-5dd7fc83-a3ef-4965-95ba-1b62955fb35b
But the apache spark web interface shows a job running and cores allocated on all nodes, also no other warnings or errors appear in the console.
As Daniel commented, each worker manages files differently. If you want to access the added file, then you can use SparkFiles.get(file). If you want to see which directory your files are going to, then you can print the output of SparkFiles.getDirectory (now SparkFiles.getRootDirectory)

hadoop write file and put in Distributed cache

I have a requirement to create a dynamic file based on the content in hadoop job.properties and then put it in Distributed Cache.
When I create the file I see that it has been created with the path of "/tmp".
I create a symbolic name and refer to this file in the cache. Now, when I try to read the file in the Dis. cache I am not able to access it. I get th error caused by: java.io.FileNotFoundException: Requested file /tmp/myfile6425152127496245866.txt does not exist.
Can you please let me know If should I need to specify the path also while creating the file and also use that path while accessing/reading the file.
I only need the file to be available only till the job is running.
I don't really get your meaning of
I only need the file to be available only till the job is running
But, when I practice to use distributed cache , I use path like this :
final String NAME_NODE = "hdfs://sandbox.hortonworks.com:8020";
job.addCacheFile(new URI(NAME_NODE + "/user/hue/users/users.dat"));
hope this will help you .

Resources