I am using Flink local mode with parallelism = 1.
In my Flink code, I have tried to print the incoming source using:
DataStream<String> ds = env.addSource(source);
ds.print();
In my local Flink_dir/log folder, i could see that an xxx.out file has been created, but nothing was printed into the file. Is there any config that I might have overlooked? I am sure that my source data contains text as I have managed to add the data to the sink successfully. Thanks!
ds.print will write to stdout and not to a file. ${flink_dir}/log contains only the logs of your task and/or job manager.
Related
I want to setup checkdb using DatabaseIntegrityCheck.sql from Ola-hallengren. I have passed LogToTable = 'Y'. But will it log to disk as well in text files? I did not find any parameter for that.
P.S. I know that jobs from MaintenanceSolution.sql do log to files in disk.
Script reference : DatabaseIntegrityCheck.sql
The procedure do not, byt itself log to disk. There isn't really any clean way to write to disk from inside T-SQL. Hence using an output file in the job step (like what the create job section of MaintenanceSolution does).
As per hadoop source code following descriptions are pulled out from the classes -
appendToFile
"Appends the contents of all the given local files to the
given dst file. The dst file will be created if it does not exist."
put
"Copy files from the local file system into fs. Copying fails if the file already exists, unless the -f flag is given.
Flags:
-p : Preserves access and modification times, ownership and the mode.
-f : Overwrites the destination if it already exists.
-l : Allow DataNode to lazily persist the file to disk. Forces
replication factor of 1. This flag will result in reduced
durability. Use with care.
-d : Skip creation of temporary file(<dst>._COPYING_)."
I am trying to update a file into hdfs regularly as it is being updated dynamically from a streaming source in my local File System.
Which one should I use out of appendToFile and put, and Why?
appendToFile modifies the existing file in HDFS, so only the new data needs to be streamed/written to the filesystem.
put rewrites the entire file, so the entire new version of the file needs to be streamed/written to the filesystem.
You should favor appendToFile if you are just appending to the file (i.e. adding logs to the end of a file). This function will be faster if that's your use case. If the file is changing more than just simple appends to the end, you should use put (slower but you won't lose data or corrupt your file).
I have done a couple projects using the LoopEach construct. This the first where I'm generating an output file inside a Data Flow in the loop.
I have a script that extracts the path and basename of the input file. I have looked at that is extracted (using FileInfo) and it is correct.
A few steps later, I am outputing to a flat file. I have both the input and output connection managers set to DelayValidation. The output keeps failing. Although I have the output set to a variable that includes the input file name (so it will continue to change, the output fails and shows the filename without the info from the variable.
The variable is set to a scope of the package, so it should be covered.
What am I missing???
Newbie to Flink.
I am able to run the example wordcount.jar on a file present in remote hdfs cluster without declaring fs.hdfs.hadoopconf variable in flink conf.
So wondering what exactly is the purpose of above mentioned variable.
Does declaring it changes the way one runs the example jar ?
Command :
flink-cluster.vm ~]$ /opt/flink/bin/flink run /opt/flink/examples/batch/WordCount.jar --input hdfs://hadoop-master:9000/tmp/test-events
Output:
.......
07/13/2016 00:50:13 Job execution switched to status FINISHED.
(foo,1)
.....
(bar,1)
(one,1)
Setup :
Remote HDFS cluster on hdfs://hadoop-master.vm:9000
Flink cluster on running on flink-cluster.vm
Thanks
Update :
As pointed out by Serhiy, declared fs.hdfs.hadoopconf in conf but on running the job with updated argument hdfs:///tmp/test-events.1468374669125 got the following error
flink-conf.yaml
# You can also directly specify the paths to hdfs-default.xml and hdfs-site.xml
# via keys 'fs.hdfs.hdfsdefault' and 'fs.hdfs.hdfssite'.
#
fs.hdfs.hadoopconf: hdfs://hadoop-master:9000/
fs.hdfs.hdfsdefault : hdfs://hadoop-master:9000/
Command :
flink-cluster.vm ~]$ /opt/flink/bin/flink run /opt/flink/examples/batch/WordCount.jar --input hdfs:///tmp/test-events
Output :
Caused by: org.apache.flink.runtime.JobException: Creating the input splits caused an error: The given HDFS file URI (hdfs:///tmp/test-events.1468374669125) did not describe the HDFS NameNode. The attempt to use a default HDFS configuration, as specified in the 'fs.hdfs.hdfsdefault' or 'fs.hdfs.hdfssite' config parameter failed due to the following problem: Either no default file system was registered, or the provided configuration contains no valid authority component (fs.default.name or fs.defaultFS) describing the (hdfs namenode) host and port.
at org.apache.flink.runtime.executiongraph.ExecutionJobVertex.<init>(ExecutionJobVertex.java:172)
at org.apache.flink.runtime.executiongraph.ExecutionGraph.attachJobGraph(ExecutionGraph.java:679)
at org.apache.flink.runtime.jobmanager.JobManager.org$apache$flink$runtime$jobmanager$JobManager$$submitJob(JobManager.scala:1026)
... 19 more
From documentation:
fs.hdfs.hadoopconf: The absolute path to the Hadoop File System’s
(HDFS) configuration directory (OPTIONAL VALUE). Specifying this value
allows programs to reference HDFS files using short URIs
(hdfs:///path/to/files, without including the address and port of the
NameNode in the file URI). Without this option, HDFS files can be
accessed, but require fully qualified URIs like
hdfs://address:port/path/to/files. This option also causes file
writers to pick up the HDFS’s default values for block sizes and
replication factors. Flink will look for the “core-site.xml” and
“hdfs-site.xml” files in the specified directory.
I have a requirement to create a dynamic file based on the content in hadoop job.properties and then put it in Distributed Cache.
When I create the file I see that it has been created with the path of "/tmp".
I create a symbolic name and refer to this file in the cache. Now, when I try to read the file in the Dis. cache I am not able to access it. I get th error caused by: java.io.FileNotFoundException: Requested file /tmp/myfile6425152127496245866.txt does not exist.
Can you please let me know If should I need to specify the path also while creating the file and also use that path while accessing/reading the file.
I only need the file to be available only till the job is running.
I don't really get your meaning of
I only need the file to be available only till the job is running
But, when I practice to use distributed cache , I use path like this :
final String NAME_NODE = "hdfs://sandbox.hortonworks.com:8020";
job.addCacheFile(new URI(NAME_NODE + "/user/hue/users/users.dat"));
hope this will help you .