Writing to file with Groovy(Grails) fails for some lines (broken lines) - file

I am performing some mass writing in a .csv file using Groovy. More specifically, I have a Quartz job that is running and creates some Map messages that get sent to a RabbitMQ queue. The queue is being consumed by 10 consumers and results in producing some lists of Strings. For each element in the List I just write it in a pipe separated .csv file. The actual service that has the method that writes to the .csv file, is a standard (singleton) transactional grails service. When I log the lines to be written, everything's fine, but in the file, some lines are "broken". The way I am writing is:
def writeRowsToFile(List<String> rows, File file) {
rows.each {row->
file.append("${row}\n")
}
}
Initially I was using:
file.withWriterAppend {out->
out.write(row.toString())
out.newLine()
}
and got the same thing as well...
If it was something wrong it would fail for all the lines. Could it be some kind of race condition, concurrency or I don't know what else issue?
Any help will be appreciated.
Thanks

You should be doing it the second way, ie:
def writeRowsToFile(List<String> rows, File file) {
file.withWriterAppend {out->
rows.eachWithIndex { row, idx ->
// It's probably \n chars in your strings
if( row ==~ /.*[\n\r]+.*/ ) {
println "Detected a CRLF char in rows[$idx]"
}
out.writeLine row
}
}
}
However, you say it might be "some kind of race condition"
Are multiple threads writing to the same file?
If not, it is more likely that your row data has \n characters in it

Related

Apache Camel: How to use "done" files to identify records written into a file is over and it can be moved

As the title suggests, I want to move a file into a different folder after I am done writing DB records to to it.
I have already looked into several questions related to this: Apache camel file with doneFileName
But my problem is a little different since I am using split, stream and parallelProcessing for getting the DB records and writing to a file. I am not able to know when and how to create the done file along with the parallelProcessing. Here is the code snippet:
My route to fetch records and write it to a file:
from(<ROUTE_FETCH_RECORDS_AND_WRITE>)
.setHeader(Exchange.FILE_PATH, constant("<path to temp folder>"))
.setHeader(Exchange.FILE_NAME, constant("<filename>.txt"))
.setBody(constant("<sql to fetch records>&outputType=StreamList))
.to("jdbc:<endpoint>)
.split(body(), <aggregation>).streaming().parallelProcessing()
.<some processors>
.aggregate(header(Exchange.FILE_NAME), (o, n) -> {
<file aggregation>
return o;
}).completionInterval(<some time interval>)
.toD("file://<to the temp file>")
.end()
.end()
.to("file:"+<path to temp folder>+"?doneFileName=${file:header."+Exchange.FILE_NAME+"}.done"); //this line is just for trying out done filename
In my aggregation strategy for the splitter I have code that basically counts records processed and prepares the response that would be sent back to the caller.
And in my other aggregate outside I have code for aggregating the db rows and post that writing into the file.
And here is the file listener for moving the file:
from("file://<path to temp folder>?delete=true&include=<filename>.*.TXT&doneFileName=done")
.to(file://<final filename with path>?fileExist=Append);
Doing something like this is giving me this error:
Caused by: [org.apache.camel.component.file.GenericFileOperationFailedException - Cannot store file: <folder-path>/filename.TXT] org.apache.camel.component.file.GenericFileOperationFailedException: Cannot store file: <folder-path>/filename.TXT
at org.apache.camel.component.file.FileOperations.storeFile(FileOperations.java:292)[209:org.apache.camel.camel-core:2.16.2]
at org.apache.camel.component.file.GenericFileProducer.writeFile(GenericFileProducer.java:277)[209:org.apache.camel.camel-core:2.16.2]
at org.apache.camel.component.file.GenericFileProducer.processExchange(GenericFileProducer.java:165)[209:org.apache.camel.camel-core:2.16.2]
at org.apache.camel.component.file.GenericFileProducer.process(GenericFileProducer.java:79)[209:org.apache.camel.camel-core:2.16.2]
at org.apache.camel.util.AsyncProcessorConverterHelper$ProcessorToAsyncProcessorBridge.process(AsyncProcessorConverterHelper.java:61)[209:org.apache.camel.camel-core:2.16.2]
at org.apache.camel.processor.SendProcessor.process(SendProcessor.java:141)[209:org.apache.camel.camel-core:2.16.2]
at org.apache.camel.management.InstrumentationProcessor.process(InstrumentationProcessor.java:77)[209:org.apache.camel.camel-core:2.16.2]
at org.apache.camel.processor.RedeliveryErrorHandler.process(RedeliveryErrorHandler.java:460)[209:org.apache.camel.camel-core:2.16.2]
at org.apache.camel.processor.CamelInternalProcessor.process(CamelInternalProcessor.java:190)[209:org.apache.camel.camel-core:2.16.2]
at org.apache.camel.processor.Pipeline.process(Pipeline.java:121)[209:org.apache.camel.camel-core:2.16.2]
at org.apache.camel.processor.Pipeline.process(Pipeline.java:83)[209:org.apache.camel.camel-core:2.16.2]
at org.apache.camel.processor.CamelInternalProcessor.process(CamelInternalProcessor.java:190)[209:org.apache.camel.camel-core:2.16.2]
at org.apache.camel.component.seda.SedaConsumer.sendToConsumers(SedaConsumer.java:298)[209:org.apache.camel.camel-core:2.16.2]
at org.apache.camel.component.seda.SedaConsumer.doRun(SedaConsumer.java:207)[209:org.apache.camel.camel-core:2.16.2]
at org.apache.camel.component.seda.SedaConsumer.run(SedaConsumer.java:154)[209:org.apache.camel.camel-core:2.16.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)[:1.8.0_144]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)[:1.8.0_144]
at java.lang.Thread.run(Thread.java:748)[:1.8.0_144]
Caused by: org.apache.camel.InvalidPayloadException: No body available of type: java.io.InputStream but has value: Total number of records discovered: 5
What am I doing wrong? Any inputs will help.
PS: Newly introduced to Apache Camel
I would guess that the error comes from .toD("file://<to the temp file>") trying to write a file, but finds the wrong type of body (String Total number of records discovered: 5 instead of InputStream.
I don't understand why you have one file-destinations inside the splitter and one outside of it.
As #claus-ibsen suggested try to remove this extra .aggregate(...) in your route. To split and re-aggregate it is sufficient to reference the aggregation strategy in the splitter. Claus also pointed to an example in the Camel docs
from(<ROUTE_FETCH_RECORDS_AND_WRITE>)
.setHeader(Exchange.FILE_PATH, constant("<path to temp folder>"))
.setHeader(Exchange.FILE_NAME, constant("<filename>.txt"))
.setBody(constant("<sql to fetch records>&outputType=StreamList))
.to("jdbc:<endpoint>)
.split(body(), <aggregationStrategy>)
.streaming().parallelProcessing()
// the processors below get individual parts
.<some processors>
.end()
// The end statement above ends split-and-aggregate. From here
// you get the re-aggregated result of the splitter.
// So you can simply write it to a file and also write the done-file
.to(...);
However, if you need to control the aggregation sizes, you have to combine splitter and aggregator. That would look somehow like this
from(<ROUTE_FETCH_RECORDS_AND_WRITE>)
.setHeader(Exchange.FILE_PATH, constant("<path to temp folder>"))
.setHeader(Exchange.FILE_NAME, constant("<filename>.txt"))
.setBody(constant("<sql to fetch records>&outputType=StreamList))
.to("jdbc:<endpoint>)
// No aggregationStrategy here so it is a standard splitter
.split(body())
.streaming().parallelProcessing()
// the processors below get individual parts
.<some processors>
.end()
// The end statement above ends split. From here
// you still got individual records from the splitter.
.to(seda:aggregate);
// new route to do the controlled aggregation
from("seda:aggregate")
// constant(true) is the correlation predicate => collect all messages in 1 aggregation
.aggregate(constant(true), new YourAggregationStrategy())
.completionSize(500)
// not sure if this 'end' is needed
.end()
// write files with 500 aggregated records here
.to("...");

Good way to avoid java.lang.IndexOutOfBoundsException, when joining a list

So I am trying to read a rather large XML file into a String. Currently joining a list of .readLines() like this:
def is = zipFile.getInputStream(entry)
def content = is.getText('UTF-8')
def xmlBodyList = content.readLines()
return xmlBodyList[1..xmlBodyList.size].join("")
However I am getting this output in console:
java.lang.IndexOutOfBoundsException: toIndex = 21859
I don't need any explanation on IndexOutOfBoundsExceptions, but I am having a hard time figuring out how to program around this issue.
How can I implement this differently, so it allows for a large enough file size?
About Good way to avoid java.lang.IndexOutOfBoundsException
error is here:
return xmlBodyList[1..xmlBodyList.size].join("")
A good way to check variables before accessing and you can use relative range accessor:
assert xmlBodyList.size>1 //check value
return xmlBodyList[1..-1].join("") //use relative indexes -1 = the last one
About large files processing
If you need to iterate through all the lines and execute some operation here is an example:
def stream = zipFile.getInputStream(entry)
stream.eachLine("UTF-8"){line, index->
if(index>1){ //skip first line
//do something here with each line from file
println "$line $index"
}
}
there are a lot of additional groovy methods over java.io.InputStream that could help you to process large file without loading it into memory:
http://docs.groovy-lang.org/latest/html/groovy-jdk/java/io/InputStream.html

boost log every hour

I'm using boost log and I want to make basic log principal file: new error log at the beginning of each hour (if error exists), and to name it like "file_%Y%m%d%H.log".
I have 2 problems with this boost library:
1. How to rotate file at the beginning of each hour?
This isn't possible with rotation_at_time_interval parameter because it creates new file regarding first written record in file, and the hour in file name doesn't match that rule. Is it possible to have multiple rotation_at_time_point for one file in sink or is there some other solution?
2. When file exceed some size I want it to start new file and in that case it should append some index to file name. With adding rotation_size parametar and %N to file name it will increment N all the time while application is running. I want that N to be reset at the beginning of each hour, just as my file name changes. Does anybody have any idea how to do that with this boost log library?
This is basic principal in creating log files in industry. I really don't understand how this can't be done with library which is dedicated for creating log files.
Library itself doesn't provide a way to rotate file at the begging of every hour, but i had same problem so i used a function wrapper, which return true on begging of every hour.
I find this way better for me, because i can controll efficency of code.
from boost.org:
bool is_it_time_to_rotate();
void init_logging(){
boost::shared_ptr< sinks::text_file_backend > backend =
boost::make_shared< sinks::text_file_backend >(
keywords::file_name = "file_%5N.log",
keywords::time_based_rotation = &is_it_time_to_rotate
);
}
For a second question i really dont undrestand it well.

Saving a Binary tree to a file in C

I am currently doing a Customers program when a user can add/edit/search/list customers. I decided to use a binary tree as the backbone of this program. My idea was to then save each item in the tree to "customers.dat" right before the program closes, then on start up load everything from the file to the tree. So far so good, however after finally managing to save the binary search tree into a file, I have one bug.
Lets say I add 3 customers the first time. I then close the program, and when I reopen it I will find the same 3 customers in the tree. however, the next time I open the file, it gives me an error from one of my predefined errors, which occurs when a node is not able to identity whether to go left or right, maybe because its empty or uncomparable. here are some code snippets. I aslo tried using other fileopen techniques other than a+b, and I had no such error, but by the way I designed my program, I need the append method or else only one record will save.
Customers are stored in Cstmr in the header:
typedef struct customer
{
char Name[MAXNAME];
char Surname[MAXNAME];
char ID[MAXID];
char Address[MAXADDRESS];
} Cstmr;
else:
void CustomerTreeToFile(Tree*pt)
{
if (TreeIsEmpty(pt))
puts("Nothing to save!");
else
Traverse(pt,saveItem); //Traverses each node, and appliess the function
//saveItem to each node
}
void saveItem(Cstmr C)
{
save = C;
customers = fopen("customers.dat","ab+");
fwrite(&C,sizeof (Cstmr), 1, customers);
fclose(customers);
}
The problem is that you are always appending all the data to the same file ... so everything from the previous run is duplicated.
One solution would be to delete (unlink) the file before saving the data with your current approach.
However, as Freezerburn pointed out earlier, it's more economic to not open and close the file for each item. Just open the file once in overwrite mode (i.e. not append), then write all the data, then close the file. Should be much faster, too.
Another problem is that you save your data in binary format. Try to define an easy to read textual format. That would have made the problem obvious ...

multithreads process data from the same file

can anyone in this forum give an example in C how two threads process data from one textfile.
As an example, I have one textfile that contains a paragraph. I have two threads that will process the data in the said file. One thread will count the number of lines in the paragraph. The second thread will count the numeric characters.
thanks
If you asked in C++ I could give you a code example, but I havent done ANSI C in a very long time so I will give you the design and pseudo code.
Please keep in mind this is really bad pseudo code that is meant to give an example. I'm not questioning WHY you would want to do this. For all I know it could be an excercise with threads or because you "feel like it".
Example 1
int integerCount = 0;
int lineCount = 0;
numericThread()
{
// By flagging the file as readonly you should
// be able to open it as many times as you wish
handle h = openfile ("textfile.txt". readonly);
while (!eof(h)) {
String word = readWord (h);
int outInteger
if (stringToInteger(word, outInteger)) {
++integerCount;
}
}
}
lineThread()
{
// By flagging the file as readonly you should
// be able to open it as many times as you wish
handle h = openfile ("textfile.txt". readonly);
while (!eof(h)) {
String word = readWord (h);
if (word.equals("\n") {
++lineCount ;
}
}
}
If for some reason you aren't able to open the file twice in readonly you will need to maintain a queue for each thread, having the main thread put words into each threads queue. The threads will then pull from the queue.
Example 2
int integerCount = 0;
int lineCount = 0;
queue numericQueue;
queue lineQueue;
numericThread()
{
while (!numericQueue.closed()) {
String word = numericQueue.pop();
int outInteger
if (stringToInteger(word, outInteger)) {
++integerCount;
}
}
}
lineThread()
{
while (!lineQueue.closed()) {
String word = lineQueue.pop();
if (word.equals("\n") {
++lineCount ;
}
}
}
mainThread()
{
handle h = openfile ("textfile.txt". readonly);
while (!eof(h)) {
String word = readWord(h);
numericQueue.push(word);
lineQueue.push(word);
}
numericQueue.close();
lineQueue.close();
}
There are lots of ways to do this. You can make different design decisions depending on how fast or simple or elegant or overengineered you want this to be. One way, as posted by Andrew Finnell is to have each thread open the file and read it completely independently. In theory this isn't great because you are doing expensive IO twice but in practice it's probably fine because the OS has likely cached the contents of whichever read executes first. Double IO is still more expensive than average because it involves a lot of needless system calls, but again in practice it will be irrelevant unless you have a very large file.
Another model of how to do this would be for each thread to have an input queue, or a shared global queue. The main thread reads the file and places each line in turn on the queue(s), and perhaps main doubles as one of your worker threads. This is more complicated because access to the queue(s) must be synchronized, or some lockless queue implementation must be used. In the case of a shared global queue, there is less duplication of data but now the lifecycle of that data is more complicated.
Just to point out how many ways such a simple thing can be done, you could go the overengineering route and make each thread generic. Instead of placing data on the queue(s) you place both data (or pointers to data) and function pointers and let each thread execute the callback. This kind of model might might sense if you plan on adding lots more kind of things to compute but want to limit the number of threads you will use.
I don't think you will see much performance difference in using 2 threads over one. Either way, you don't want both threads to read the file. Read the file first, then pass a COPY of the stream to the methods you want and process both. The threads will not have access to the same stream of data at the same time so you'll need to use 2 copies of the textfile.
P.S. It's possible that depending on the size of the file, you will actually loose performance using 2 threads.

Resources