Writing to output stream in chunks in Groovy - arrays

I'm getting Jenkins console logs and writing them into an output stream like this:
ByteArrayOutputStream stream = new ByteArrayOutputStream()
currentBuild.rawBuild.getLogText().writeLogTo(0, stream)
However, the downside of this approach is that writeLogTo() method is limited to 10000 lines:
https://github.com/jenkinsci/stapler/blob/master/core/src/main/java/org/kohsuke/stapler/framework/io/LargeText.java#L572
In this case, if Jenkins console log is more than a 10000 lines then the data from line 10000 and up is lost and not written into a buffer.
I'm trying to re-write the above approach in the most easiest way to account for cases when the log has more than 10000 lines.
I feel like my attempt is very complicated and error-prone. Is there an easier way to introduce a new logic?
Please note that the code below is not tested, this is just a draft of how I'm planning to implement it:
ByteArrayOutputStream stream = new ByteArrayOutputStream()
def log = currentBuild.rawBuild.getLogText()
def offset = 0
def maxNumOfLines = 10000
# get total number of lines in the log
# def totalLines = (still trying to figure out how to get it)
if (totalLines > maxNumOfLines) {
def numOfExecutions = round(totalLines / maxNumOfLines)
}
for (int i=0; i<numOfExecutions; i++) {
log.writeLogTo(offset, stream)
offset += maxNumOfLines
}

writeLogTo(long start, OutputStream out)
According to comments this method returns the offset to start the next write operation.
Seems code could be like this
def logFile = currentBuild.rawBuild.getLogText()
def start=0
while(logFile.length()>start)
start=logFile.writeLogTo(start, stream)
stream could be a FileOutputStream to avoid reading whole log into memory.
There is another method readAll()
So, the code could be simple as this to read whole log as text:
def logText=currentBuild.rawBuild.getLogText().readAll().getText()
Or if you want to transfer it to a local file:
new File('path/to/file.log').withWriter('UTF-8'){ w->
w << currentBuild.rawBuild.getLogText().readAll()
}

Related

Get files names and content , and then merge into another file with mapreduce

I have several files with datas in it.
For example: file01.csv with x lignes in it, file02.csv with y lines in it.
I would like to treat and merge them with mapreduce in order to get a file with the x lines beginning with file01 then line content, and y files beginning with file02 then line content.
I have two issues here:
I know how to get lines from a file with mapreduce by setting FileInputFormat.setInputPath(job, new Path(inputFile));
But I don't understand how I can get lines of each file of a folder.
Once I have those lines in my mapper, how can I access to the filename corresponding, so that I can create the data I want ?
Thank you for your consideration.
Ambre
You do not need map-reduce in your situation. That's because you want to preserve the order of lines in result file. In this case single thread processing will be faster.
Just run java client with code like this:
FileSystem fs = FileSystem.get();
OutputStream os = fs.create(outputPath); // stream for result file
PrintWriter pw = new PrintWriter(new OutputStreamWriter(os));
for (String inputFile : inputs) { // reading input files
InputStream is = fs.open(new Path(inputFile));
BufferedReader br = new BufferedReader(new InputStreamReader(is));
String line;
while ((line = br.readLine()) != null) {
pw.println(line);
}
br.close();
}
pw.close();

Groovy script to modify text file line by line

I have a input file from which a Groovy file reads input. Once a particular input is processed, Groovy script should be able to comment the input line it used and then move on.
File content:
1
2
3
When it processes line 1 and line 2, the input file will look as below:
'1
'2
3
By this way, if I re-run the Groovy, I would like to start from the line it stopped last time. If a input was used and it failed, that particular line shall not be commented (') so that a retry can be attempted.
Appreciate if you can help to draft a Groovy script.
Thanks
AFAIK in Groovy you can only append text at the end of the file.
Hence to add ' on each line when it is processed you need to rewrite the entire file.
You can use the follow approach but I only recommend you to use for a small files since you're loading all the lines in memory. In summary an approach for your question could be:
// open the file
def file = new File('/path/to/sample.txt')
// get all lines
def lines = file.readLines()
try{
// for each line
lines.eachWithIndex { line,index ->
// if line not starts with your comment "'"
if(!line.startsWith("'")){
// call your process and make your logic...
// but if it fails you've to throw an exception since
// you can not use 'break' within a closure
if(!yourProcess(line)) throw new Exception()
// line is processed so add the "'"
// to the current line
lines.set(index,"'${line}")
}
}
}catch(Exception e){
// you've to catch the exception in order
// to save the progress in the file
}
// join the lines and rewrite the file
file.text = lines.join(System.properties.'line.separator')
// define your process...
def yourProcess(line){
// I make a simple condition only to test...
return line.size() != 3
}
An optimal approach to avoid load all lines in memory for a large files is to use a reader to read the file contents, and a temporary file with a writer to write the result, and optimized version could be:
// open the file
def file = new File('/path/to/sample.txt')
// create the "processed" file
def resultFile = new File('/path/to/sampleProcessed.txt')
try{
// use a writer to write a result
resultFile.withWriter { writer ->
// read the file using a reader
file.withReader{ reader ->
while (line = reader.readLine()) {
// if line not starts with your comment "'"
if(!line.startsWith("'")){
// call your process and make your logic...
// but if it fails you've to throw an exception since
// you can not use 'break' within a closure
if(!yourProcess(line)) throw new Exception()
// line is processed so add the "'"
// to the current line, and writeit in the result file
writer << "'${line}" << System.properties.'line.separator'
}
}
}
}
}catch(Exception e){
// you've to catch the exception in order
// to save the progress in the file
}
// define your process...
def yourProcess(line){
// I make a simple condition only to test...
return line.size() != 3
}

Google cloud storage using stream instead of bytebuffer - java

I'm using the following code:
GcsService gcsService = GcsServiceFactory.createGcsService();
GcsFilename filename = new GcsFilename(BUCKETNAME, fileName);
GcsFileOptions options = new GcsFileOptions.Builder()
.mimeType(contentType)
.acl("public-read")
.addUserMetadata("myfield1", "my field value")
.build();
#SuppressWarnings("resource")
GcsOutputChannel outputChannel =
gcsService.createOrReplace(filename, options);
outputChannel.write(ByteBuffer.wrap(byteArray));
outputChannel.close();
The problem is that when I try to store video files, I have to store the file in the byteArray which could cause memory issues.
But I cannot find any interface to do the same with stream.
questions:
Should I worry about mem issues in the appengine srv, or are they capable of keeping a 1 min video in mem?
is it possible to use stream instead of byte array? how?
I'm reading the bytes as byte[] byteArray = IOUtils.toByteArray(stream); should I use the byte array as a real buffer and just read chunks and upload them to the GCS? how do I do that?
The amount of memory available depends on the appengine instance type you've configured. Streaming this data seems like a good idea if you can.
Not sure about the GcsService api, but looks like you can do this using the gcloud Storage api:
https://github.com/GoogleCloudPlatform/gcloud-java/blob/master/gcloud-java-storage/src/main/java/com/google/cloud/storage/Storage.java
This code might work (untested)...
final BlobInfo info = BlobInfo.builder(bucket.getBucketName(), "name").contentType("image/png").build();
final ReadableByteChannel src = Channels.newChannel(stream);
final WriteChannel dst = gcsStorage.writer(info);
fastChannelCopy(src, dst);
private void fastChannelCopy(final ReadableByteChannel src, final WritableByteChannel dest) throws IOException {
final ByteBuffer buffer = ByteBuffer.allocateDirect(16 * 1024);
while (src.read(buffer) != -1) {
buffer.flip(); // prepare the buffer to be drained
dest.write(buffer); // write to the channel, may block
// If partial transfer, shift remainder down
// If buffer is empty, same as doing clear()
buffer.compact();
}
// EOF will leave buffer in fill state
buffer.flip();
// make sure the buffer is fully drained.
while (buffer.hasRemaining()) {
dest.write(buffer);
}
}

Any Efficient way to parse large text files and store parsing information?

My purpose is to parse text files and store information in respective tables.
I have to parse around 100 folders having more that 8000 files and whole size approximately 20GB.
When I tried to store whole file contents in a string, memory out exception was thrown.
That is
using (StreamReader objStream = new StreamReader(filename))
{
string fileDetails = objStream.ReadToEnd();
}
Hence I tried one logic like
using (StreamReader objStream = new StreamReader(filename))
{
// Getting total number of lines in a file
int fileLineCount = File.ReadLines(filename).Count();
if (fileLineCount < 90000)
{
fileDetails = objStream.ReadToEnd();
fileDetails = fileDetails.Replace(Environment.NewLine, "\n");
string[] fileInfo = fileDetails.ToString().Split('\n');
//call respective method for parsing and insertion
}
else
{
while ((firstLine = objStream.ReadLine()) != null)
{
lineCount++;
fileDetails = (fileDetails != string.Empty) ? string.Concat(fileDetails, "\n", firstLine)
: string.Concat(firstLine);
if (lineCount == 90000)
{
fileDetails = fileDetails.Replace(Environment.NewLine, "\n");
string[] fileInfo = fileDetails.ToString().Split('\n');
lineCount = 0;
//call respective method for parsing and insertion
}
}
//when content is 90057, to parse 57
if (lineCount < 90000 )
{
string[] fileInfo = fileDetails.ToString().Split('\n');
lineCount = 0;
//call respective method for parsing and insertion
}
}
}
Here 90,000 is the bulk size which is safe to process without giving out of memory exception for my case.
Still the process is taking more than 2 days for completion. I observed this is because of reading line by line.
Is there any better approach to handle this ?
Thanks in Advance :)
You can use a profiler to detect what sucks your performance. In this case it's obvious: disk access and string concatenation.
Do not read a file more than once. Let's take a look at your code. First of all, the line int fileLineCount = File.ReadLines(filename).Count(); means you read the whole file and discard what you've read. That's bad. Throw away your if (fileLineCount < 90000) and keep only else.
It almost doesn't matter if you read line-by-line in consecutive order or the whole file because reading is buffered in any case.
Avoid string concatenation, especially for long strings.
fileDetails = fileDetails.Replace(Environment.NewLine, "\n");
string[] fileInfo = fileDetails.ToString().Split('\n');
It's really bad. You read the file line-by-line, why do you do this replacement/split? File.ReadLines() gives you a collection of all lines. Just pass it to your parsing routine.
If you'll do this properly I expect significant speedup. It can be optimized further by reading files in a separate thread while processing them in the main. But this is another story.

read cloud storage content with "gzip" encoding for "application/octet-stream" type content

We're using "Google Cloud Storage Client Library" for app engine, with simply "GcsFileOptions.Builder.contentEncoding("gzip")" at file creation time, we got the following problem when reading the file:
com.google.appengine.tools.cloudstorage.NonRetriableException: java.lang.RuntimeException: com.google.appengine.tools.cloudstorage.SimpleGcsInputChannelImpl$1#1c07d21: Unexpected cause of ExecutionException
at com.google.appengine.tools.cloudstorage.RetryHelper.doRetry(RetryHelper.java:87)
at com.google.appengine.tools.cloudstorage.RetryHelper.runWithRetries(RetryHelper.java:129)
at com.google.appengine.tools.cloudstorage.RetryHelper.runWithRetries(RetryHelper.java:123)
at com.google.appengine.tools.cloudstorage.SimpleGcsInputChannelImpl.read(SimpleGcsInputChannelImpl.java:81)
...
Caused by: java.lang.RuntimeException: com.google.appengine.tools.cloudstorage.SimpleGcsInputChannelImpl$1#1c07d21: Unexpected cause of ExecutionException
at com.google.appengine.tools.cloudstorage.SimpleGcsInputChannelImpl$1.call(SimpleGcsInputChannelImpl.java:101)
at com.google.appengine.tools.cloudstorage.SimpleGcsInputChannelImpl$1.call(SimpleGcsInputChannelImpl.java:81)
at com.google.appengine.tools.cloudstorage.RetryHelper.doRetry(RetryHelper.java:75)
... 56 more
Caused by: java.lang.IllegalStateException: com.google.appengine.tools.cloudstorage.oauth.OauthRawGcsService$2#1d8c25d: got 46483 > wanted 19823
at com.google.common.base.Preconditions.checkState(Preconditions.java:177)
at com.google.appengine.tools.cloudstorage.oauth.OauthRawGcsService$2.wrap(OauthRawGcsService.java:418)
at com.google.appengine.tools.cloudstorage.oauth.OauthRawGcsService$2.wrap(OauthRawGcsService.java:398)
at com.google.appengine.api.utils.FutureWrapper.wrapAndCache(FutureWrapper.java:53)
at com.google.appengine.api.utils.FutureWrapper.get(FutureWrapper.java:90)
at com.google.appengine.tools.cloudstorage.SimpleGcsInputChannelImpl$1.call(SimpleGcsInputChannelImpl.java:86)
... 58 more
What else should be added to read files with "gzip" compression to be able to read the content in app engine? ( curl cloud storage URL from client side works fine for both compressed and uncompressed file )
This is the code that works for uncompressed object:
byte[] blobContent = new byte[0];
try
{
GcsFileMetadata metaData = gcsService.getMetadata(fileName);
int fileSize = (int) metaData.getLength();
final int chunkSize = BlobstoreService.MAX_BLOB_FETCH_SIZE;
LOG.info("content encoding: " + metaData.getOptions().getContentEncoding()); // "gzip" here
LOG.info("input size " + fileSize); // the size is obviously the compressed size!
for (long offset = 0; offset < fileSize;)
{
if (offset != 0)
{
LOG.info("Handling extra size for " + filePath + " at " + offset);
}
final int size = Math.min(chunkSize, fileSize);
ByteBuffer result = ByteBuffer.allocate(size);
GcsInputChannel readChannel = gcsService.openReadChannel(fileName, offset);
try
{
readChannel.read(result); <<<< here the exception was thrown
}
finally
{
......
It is now compressed by:
GcsFilename filename = new GcsFilename(bucketName, filePath);
GcsFileOptions.Builder builder = new GcsFileOptions.Builder().mimeType(image_type);
builder = builder.contentEncoding("gzip");
GcsOutputChannel writeChannel = gcsService.createOrReplace(filename, builder.build());
ByteArrayOutputStream byteStream = new ByteArrayOutputStream(blob_content.length);
try
{
GZIPOutputStream zipStream = new GZIPOutputStream(byteStream);
try
{
zipStream.write(blob_content);
}
finally
{
zipStream.close();
}
}
finally
{
byteStream.close();
}
byte[] compressedData = byteStream.toByteArray();
writeChannel.write(ByteBuffer.wrap(compressedData));
the blob_content is compressed from 46483 bytes to 19823 bytes.
I think it is the google code's bug
https://code.google.com/p/appengine-gcs-client/source/browse/trunk/java/src/main/java/com/google/appengine/tools/cloudstorage/oauth/OauthRawGcsService.java, L418:
Preconditions.checkState(content.length <= want, "%s: got %s > wanted %s", this, content.length, want);
the HTTPResponse has decoded the blob, so the Precondition is wrong here.
If I good understand you have to set mineType:
GcsFileOptions options = new GcsFileOptions.Builder().mimeType("text/html")
Google Cloud Storage does not compress or decompress objects:
https://developers.google.com/storage/docs/reference-headers?csw=1#contentencoding
I hope that's what you want to do .
Looking at your code it seems like there is a mismatch between what is stored and what is read. The documentation specifies that compression is not done for you (https://developers.google.com/storage/docs/reference-headers?csw=1#contentencoding). You will need to do the actual compression manually.
Also if you look at the implementation of the class that throws the exception (https://code.google.com/p/appengine-gcs-client/source/browse/trunk/java/src/main/java/com/google/appengine/tools/cloudstorage/oauth/OauthRawGcsService.java?r=81&spec=svn134) you will notice that you get the original contents back but you're actually expecting compressed content. Check the method readObjectAsync in the above mentioned class.
It looks like the content persisted might not be gzipped or the content-length is not set properly. What you should do is verify length of the compressed stream just before writing it into the channel. You should also verify that the content length is set correctly when doing the http request. It would be useful to see the actual http request headers and make sure that content length header matches the actual content length in the http response.
Also it looks like contentEncoding could be set incorrectly. Try using:.contentEncoding("Content-Encoding: gzip") as used in this TCK test. Although still the best thing to do is inspect the HTTP request and response. You can use wireshark to do that easily.
Also you need to make sure that GCSOutputChannel is closed as that's when the file is finalized.
Hope this puts you on the right track. To gzip your contents you can use java GZIPInputStream.
I'm seeing the same issue, easily reproducable by uploading a file with "gsutil cp -Z", then trying to open it with the following
ByteArrayOutputStream output = new ByteArrayOutputStream();
try (GcsInputChannel readChannel = svc.openReadChannel(filename, 0)) {
try (InputStream input = Channels.newInputStream(readChannel))
{
IOUtils.copy(input, output);
}
}
This causes an exception like this:
java.lang.IllegalStateException:
....oauth.OauthRawGcsService$2#1883798: got 64303 > wanted 4096
at ....Preconditions.checkState(Preconditions.java:199)
at ....oauth.OauthRawGcsService$2.wrap(OauthRawGcsService.java:519)
at ....oauth.OauthRawGcsService$2.wrap(OauthRawGcsService.java:499)
The only work around I've found is to read the entire file into memory using readChannel.read:
int fileSize = 64303;
ByteBuffer result = ByteBuffer.allocate(fileSize);
try (GcsInputChannel readChannel = gcs.openReadChannel(new GcsFilename("mybucket", "mygzippedfile.xml"), 0)) {
readChannel.read(result);
}
Unfortunately, this only works if the size of the bytebuffer is greater or equal to the uncompressed size of the file, which is not possible to get via the api.
I've also posted my comment to an issue registered with google: https://code.google.com/p/googleappengine/issues/detail?id=10445
This is my function for reading compressed gzip files
public byte[] getUpdate(String fileName) throws IOException
{
GcsFilename fileNameObj = new GcsFilename(defaultBucketName, fileName);
try (GcsInputChannel readChannel = gcsService.openReadChannel(fileNameObj, 0))
{
maxSizeBuffer.clear();
readChannel.read(maxSizeBuffer);
}
byte[] result = maxSizeBuffer.array();
return result;
}
The core is that you cannot use the size of the saved file cause Google Storage will give it to you with the original size, so it checks the sizes you expected and the real size and these are differents:
Preconditions.checkState(content.length <= want, "%s: got %s > wanted
%s", this, content.length, want);
So i solved it allocating the biggest amount possible for these files using BlobstoreService.MAX_BLOB_FETCH_SIZE. Actually maxSizeBuffer is only allocated once outsize of the function
ByteBuffer maxSizeBuffer = ByteBuffer.allocate(BlobstoreService.MAX_BLOB_FETCH_SIZE);
And with maxSizeBuffer.clear(); all data is flushed again.

Resources