Spring Batch FlatFileItemWriter does not write data to a file - file

I am new to Spring Batch application. I am trying to use FlatFileItemWriter to write the data into a file. Challenge is application is creating the file on a given path, but, now writing the actual content into it.
Following are details related to code:
List<String> dataFileList : This list contains the data that I want to write to a file
FlatFileItemWriter<String> writer = new FlatFileItemWriter<>();
writer.setResource(new FileSystemResource("C:\\Desktop\\test"));
writer.open(new ExecutionContext());
writer.setLineAggregator(new PassThroughLineAggregator<>());
writer.setAppendAllowed(true);
writer.write(dataFileList);
writer.close();
This is just generating the file at proper place but contents are not getting written into the file.
Am I missing something? Help is highly appreciated.
Thanks!

This is not a proper way to use Spring Batch Writer and writer data. You need to declare bean of Writer first.
Define Job Bean
Define Step Bean
Use your Writer bean in Step
Have a look at following examples:
https://github.com/pkainulainen/spring-batch-examples/blob/master/spring-boot/src/main/java/net/petrikainulainen/springbatch/csv/in/CsvFileToDatabaseJobConfig.java
https://spring.io/guides/gs/batch-processing/

You probably need to force a sync to disk. From the docs at https://docs.spring.io/spring-batch/trunk/apidocs/org/springframework/batch/item/file/FlatFileItemWriter.html,
setForceSync
public void setForceSync(boolean forceSync)
Flag to indicate that changes should be force-synced to disk on flush. Defaults to false, which means that even with a local disk changes could be lost if the OS crashes in between a write and a cache flush. Setting to true may result in slower performance for usage patterns involving many frequent writes.
Parameters:
forceSync - the flag value to set

Related

How to use a PipedInputStream in a Producer Endpoint in Apache Camel

I am using Camel to consume a huge (GB) file, process/modify some data in the file and finally forward the modified file via an AWS-S3, FTP or SFTP producer endpoint to its target. In the actual usage scenario using an intermediate(temporary) file holding the processed data is not allowed.
In case of the AWS producer, the configure method of the corresponding RouteBuilder specifies the route as follows:
from("file:/...")
.streamCaching()
.process(new CustomFileProcessor())
.to("aws-s3://...");
In it's process(Exchange exchange) method the CustomFileProcessor reads the input data from exchange.getIn().getBody(InputStream.class) and writes the processed and modified data into a PipedOutputStream.
Now the PipedInputStream connected with this PipedOutputStream should be used as the source for the producer sending the data to AWS-S3.
I tried exchange.getOut().setBody(thePipedInputStream) in the process method but this doesn't work and seems to create a deadlock.
So what is the correct way - if it is possible at all - of piping the processed output data of the CustomFileProcessor to the producer endpoint so that the entire data is send over?
Many thanks in advance.
After further digging into this the solution was quite simple. I only needed to place the pipe reader into a separate thread and the problem was solved.

Bro: Disable ALL log generation

I created a bro script, with the objective of extract all files for all posible protocols from a pcap file. But I dont want to write all logs. Bro create a log file for each protocol. Example: 'http.log', 'smtp.log', etc. Even a 'weird.log' is generated. My pcap files are large (20gb), so, each log file contains over 30mb of information. This log generation reduce the performance of the file extraction.
I can disable the 'conn.log' with the line Log::disable_stream(Conn::LOG) but, what about all protocol logging??
This is my script
#load base/files/extract
event bro_init()
{
Log::disable_stream(Conn::LOG);
}
event file_sniff(f: fa_file, meta: fa_metadata)
{
local ext = "";
if ( meta?$mime_type )
ext = split_string(meta$mime_type, /\//)[1];
local fname = fmt("%s-%s.%s", f$source, f$id, ext);
Files::add_analyzer(f, Files::ANALYZER_EXTRACT, [$extract_filename=fname]);
}
You can use the none writer like this:
bro -r packets.pcap Log::default_writer=Log::WRITER_NONE
I'm not totally convinced that writing these logs is harming your performance in any real way though. Typically, writing the files to disk is what causes the biggest overhead.
Here's a way to turn off whatever logging's been turned on (prior to bro_init), without having to know which stream IDs are relevant:
event bro_init()
{
# We don't want any output other than from this script.
for (id in Log::active_streams)
Log::disable_stream(id);
}
This construct makes me twitch a little about modifying a table while iterating over it, but it seems to work and I can't actually find any way to peek at one key from a table without doing an iteration. I suppose one could write
event bro_init()
{
while (|Log::active_streams|) {
for (id in Log::active_streams) {
Log::disable_stream(id);
break;
}
}
}
but that's hideous and I'm not going to use it unless I discover that I have to.
I achieved this with this line of code in main.bro:
Log::remove_filter(Conn::LOG, "default");

Hadoop Map Whole File in Java

I am trying to use Hadoop in java with multiple input files. At the moment I have two files, a big one to process and a smaller one that serves as a sort of index.
My problem is that I need to maintain the whole index file unsplitted while the big file is distributed to each mapper. Is there any way provided by the Hadoop API to make such thing?
In case if have not expressed myself correctly, here is a link to a picture that represents what I am trying to achieve: picture
Update:
Following the instructions provided by Santiago, I am now able to insert a file (or the URI, at least) from Amazon's S3 into the distributed cache like this:
job.addCacheFile(new Path("s3://myBucket/input/index.txt").toUri());
However, when the mapper tries to read it a 'file not found' exception occurs, which seems odd to me. I have checked the S3 location and everything seems to be fine. I have used other S3 locations to introduce the input and output file.
Error (note the single slash after the s3:)
FileNotFoundException: s3:/myBucket/input/index.txt (No such file or directory)
The following is the code I use to read the file from the distributed cache:
URI[] cacheFile = output.getCacheFiles();
BufferedReader br = new BufferedReader(new FileReader(cacheFile[0].toString()));
while ((line = br.readLine()) != null) {
//Do stuff
}
I am using Amazon's EMR, S3 and the version 2.4.0 of Hadoop.
As mentioned above, add your index file to the Distributed Cache and then access the same in your mapper. Behind the scenes. Hadoop framework will ensure that the index file will be sent to all the task trackers before any task is executed and will be available for your processing. In this case, data is transferred only once and will be available for all the tasks related your job.
However, instead of add the index file to the Distributed Cache in your mapper code, make your driver code to implement ToolRunner interface and override the run method. This provides the flexibility of passing the index file to Distributed Cache through the command prompt while submitting the job
If you are using ToolRunner, you can add files to the Distributed Cache directly from the command line when you run the job. No need to copy the file to HDFS first. Use the -files option to add files
hadoop jar yourjarname.jar YourDriverClassName -files cachefile1, cachefile2, cachefile3, ...
You can access the files in your Mapper or Reducer code as below:
File f1 = new File("cachefile1");
File f2 = new File("cachefile2");
File f3 = new File("cachefile3");
You could push the index file to the distributed cache, and it will be copied to the nodes before the mapper is executed.
See this SO thread.
Here's what helped me to solve the problem.
Since I am using Amazon's EMR with S3, I have needed to change the syntax a bit, as stated on the following site.
It was necessary to add the name the system was going to use to read the file from the cache, as follows:
job.addCacheFile(new URI("s3://myBucket/input/index.txt" + "#index.txt"));
This way, the program understands that the file introduced into the cache is named just index.txt. I also have needed to change the syntax to read the file from the cache. Instead of reading the entire path stored on the distributed cache, only the filename has to be used, as follows:
URI[] cacheFile = output.getCacheFiles();
BufferedReader br = new BufferedReader(new FileReader(#the filename#));
while ((line = br.readLine()) != null) {
//Do stuff
}

Neo4j store is not cleanly shut down; Recovering from inconsistent db state from interrupted batch insertion

I was importing ttl ontologies to dbpedia following the blog post http://michaelbloggs.blogspot.de/2013/05/importing-ttl-turtle-ontologies-in-neo4j.html. The post uses BatchInserters to speed up the task. It mentions
Batch insertion is not transactional. If something goes wrong and you don't shutDown() your database properly, the database becomes inconsistent.
I had to interrupt one of the batch insertion tasks as it was taking time much longer than expected which left my database in an inconsistence state. I get the following message:
db_name store is not cleanly shut down
How can I recover my database from this state? Also, for future purposes is there a way for committing after importing every file so that reverting back to the last state would be trivial. I thought of git, but I am not sure if it would help for a binary file like index.db.
There are some cases where you cannot recover from unclean shutdowns when using the batch inserter api, please note that its package name org.neo4j.unsafe.batchinsert contains the word unsafe for a reason. The intention for batch inserter is to operate as fast as possible.
If you want to guarantee a clean shutdown you should use a try finally:
BatchInserter batch = BatchInserters.inserter(<dir>);
try {
} finally {
batch.shutdown();
}
Another alternative for special cases is registering a JVM shutdown hook. See the following snippet as an example:
BatchInserter batch = BatchInserters.inserter(<dir>);
// do some operations potentially throwing exceptions
Runtime.getRuntime().addShutdownHook(new Thread() {
public void run() {
batch.shutdown();
}
});

How can Apache Camel be used to monitor file changes?

I would like to monitor all of the files in a given directory for changes, ie an updated timestamp. This use case seems natural for Camel using the file component, but I can't seem to find a way to configure this behavior.
A uri like:
file:/some/directory
will consume the files in the provided directory but will delete them.
A uri like:
file:/some/directory?noop=true
consumes each file once when it is added or when the route is started.
It's surprising that there isn't an option along the lines of
consumeOnChange=true
Is there a straightforward way to monitor file changes and not delete the file after consuming?
You can do this by setting up the idempotentKey to tell Camel how a file is considered changed. For example if the file size changes, or its timestamp changes etc.
See more details at the Camel file documentation at: https://camel.apache.org/components/latest/file-component.html
See the section Avoiding reading the same file more than once (idempotent consumer). And read about idempotent and idempotentKey.
So something alike
from("file:/somedir?noop=true&idempotentKey=${file:name}-${file:size}")
Or
from("file:/somedir?noop=true&idempotentKey=${file:name}-${file:modified}")
You can read here about the various ${file:xxx} tokens you can use: http://camel.apache.org/file-language.html
Setting noop to true will result in Camel setting idempotent=true as well, despite the fact that idempotent is false by default.
Simplest solution to monitor files would be:
.from("file:path?noop=true&idempotent=false&delay=60s")
This will monitor changes to all files in the given directory every one minute.
This can be found in the Camel documentation at: http://camel.apache.org/file2.html.
I don't think Camel supports that specific feature but with the existent options you can come up with a similar solution of monitoring a directory.
What you need to do is set a small delay value to check the directory and maintain a repository of the already read files. Depending on how you configure the repository (by size, by filename, by a mix of them...) this solution would be able to provide you information about news files and modified files. As a caveat it would be consuming the files in the directory very often.
Maybe you could use other solutions different from Camel like Apache Commons VFS2 (I wrote a explanation about how to use it for this scenario: WatchService locks some files?
I faced the same problem i.e. wanted to copy updated files also (along with new files). Below is my configuration,
public static void main(String[] a) throws Exception {
CamelContext cc = new DefaultCamelContext();
cc.addRoutes(createRouteBuilder());
cc.start();
Thread.sleep(10 * 60 * 1000);
cc.stop();
}
protected static RouteBuilder createRouteBuilder() {
return new RouteBuilder() {
public void configure() {
from("file://D:/Production"
+ "?idempotent=true"
+ "&idempotentKey=${file:name}-${file:size}"
+ "&include=.*.log"
+ "&noop=true"
+ "&readLock=changed")
.to("file://D:/LogRepository");
}
};
}
My testing steps:
Run the program and it copies few .log files from D:/Production to D:/LogRepository and then continues to poll D:/Production directory
I opened a already copied log say A.log from D:/Production (since noop=true nothing is moved) and edited it with some editor tool. This doubled the file size and save it.
At this point I think Camel is supposed to copy that particular file again since its size is modified and in my route definition I used "idempotent=true&idempotentKey=${file:name}-${file:size}&readLock=changed". But camel ignores the file.
When I use TRACE for logging it says "Skipping as file is already in progress...", but I did not find any lock file in D:/Production directory when I editted and saved the file.
I also checked that camel still ignores the file if I replace A.log (with same name but bigger size) in D:/Production directory from outside.
But I found, everything is working as expected if I remove noop=true option.
Am I missing something?
If you want monitor file changes in camel, use file-watch component.
Example -> RECURSIVE WATCH ALL EVENTS (FILE CREATION, FILE DELETION, FILE MODIFICATION):
from("file-watch://some-directory")
.log("File event: ${header.CamelFileEventType} occurred on file ${header.CamelFileName} at ${header.CamelFileLastModified}");
You can see the complete documentation here:
Camel file-watch component

Resources