find file content size docx,pptx etc - file

I want to find out the size of the content inside a docx,pptx etc. Is there any package which can be used for this? I googled and found that POI is used widely to read/write to MS file types. But not able to find the correct api to find the size of the file content. I want to know the actual content size not the compressed file size which can be seen from properties.
Finally i found the way, but it is throwing OOM exception if the file is too large.
OPCPackage opcPackage = OPCPackage.open(file.getAbsolutePath());
XWPFDocument doc = new XWPFDocument(opcPackage);
XWPFWordExtractor we = new XWPFWordExtractor(doc);
String paragraphs = we.getText();
System.out.println("Total Paragraphs: "+paragraphs.length() / 1024);
Please help me if there are any other better way to do this.

Ok this has been asked long time ago and there is also no response to this question. I have not used OPCPackage and hence my answer is not based on that.
DOCX (and for that matter PPTX as well as XSLX) files are all zip files having a particular structure.
We could hence use the java.util.zip package and enumerate the entries of the zip file and get the size of the zip entry xl for xlsx file and word for docx files. Probably a more generic method would be to ignore the following top-level zip entries i.e. zip entries starting with :
docProps
_rels
[Content_Types].xml
The size of the remaining zip entry (do not ignore any folder within this zip entry) would tell you the correct size of the content.
This method is also very efficient - you only read the entries of the zip file and not the zip file itself hence obtaining the size information would run with negligible time and memory resources. For a quick start I was able to get the size of a 4MB docx file in fraction of a second.
A "good-enough" but not adequately working piece of code using this approach is pasted below. Please feel free to use this as a starting point and fix bugs if found. It would be great if you can post back the modifications or corrections so that others can benefit
private static final void printUnzippedContentLength() throws IOException
{
ZipFile zf = new ZipFile(new File("/home/chaitra/verybigfile.docx"));
Enumeration<? extends ZipEntry> entries = zf.entries();
long sumBytes = 0L;
while(entries.hasMoreElements())
{
ZipEntry ze = entries.nextElement();
if(ze.getName().startsWith("docProps") || ze.getName().startsWith("_rels") || ze.getName().startsWith("[Content_Types].xml"))
{
continue;
}
sumBytes += ze.getSize();
}
System.out.println("Uncompressed content has size " + (sumBytes/1024) + " KB" );
}

Related

How to convert a .dm3 file (with annotation and scale bar) to .jpg/jpeg image?

I wonder how to convert a dm3 file into .jpg/jpeg images? there is test annotation and scale bar on the image. I setup a script but it always show that "the format cannot contain the data to be saved". This can be done via file/batch convert function. So how to realize the same function in script? Thanks
image test:=IntegerImage("test",2,1,100,100)
test.ShowImage()
image frontimage:=GetFrontImage()
string filename=getname(frontimage)
imagedisplay disp = frontImage.ImageGetImageDisplay(0)
disp.applydatabar()
ImageDocument frontDoc = GetFrontImageDocument()
string directoryname, pathname
number length
if(!SaveAsDialog("","Do Not Change Me",directoryname)) exit(0)
length=len(directoryname)-16
directoryname=mid(directoryname,0,length)
pathname=directoryname+filename
frontDoc.ImageDocumentSaveToFile( "JPG Format", pathname )
To convert to jpg you have to use "JPEG/JFIF Format" as the handler (=format).
It has to be exactly this string in the ImageDocument.ImageDocumentSaveToFile() function. Other formats are mentioned in the help (F1 > Scripting > Objects > Document Object Model > ImageDocument Object > ImageDocumentSaveToFile() function). Those are (for example):
'Gatan Format'
'Gatan 3 Format'
'GIF Format'
'BMP Format'
'JPEG/JFIF Format'
'Enhanced Metafile Format'
In your code you are using the SaveAsDialog() to get a directory. This is not necessary. You can use GetDirectoryDialog() to get a directory. This saves you the name operation for the directoryname and avoids problems when users do change your filename.
Also for concatinating paths I prefer using PathConcatenate(). On the first hand this makes your code a lot more readable since its name tells what you are doing. On the other hand this also takes care of the directory ending with \ or not and other path related things.
The following code is what I think you need:
Image test := IntegerImage("test", 2, 1, 100, 100);
test.ShowImage();
Image frontimage := GetFrontImage();
ImageDisplay disp = frontImage.ImageGetImageDisplay(0);
disp.applydatabar();
ImageDocument frontDoc = GetFrontImageDocument();
string directoryname;
if(!GetDirectoryDialog("Select directory", "C:\\\\", directoryname)){
// ↑
// You can of course use something else as the start point for selection here
exit(0);
}
string filename = GetName(frontimage);
string pathname = directoryname.PathConcatenate(filename);
frontDoc.ImageDocumentSaveToFile("JPEG/JFIF Format", pathname);
This answer is correct and should be accepted. Your problem is the wrong file-type string. You want to use "JPEG/JFIF Format"
A bit more general information on image file saving in DigitalMicrograph.
One doesn't save images but always imageDocuments that can contain one, more, or even zero image objects in them. Script-commands that save an image like SaveAsGatan() really just call things like: ImageGetOrCreateImageDocument().ImageDocumentSaveToFile()
The difference doesn't really matter for simple one-image-in-document type images, but it can make a difference when there are multiple images in a document, or when a single image is displayed multiple times simultaneously (which can be done.) So it is always good to know what "really" goes on.
ImageDocuments contain some properties relating to saving:
A save format (“Gatan Format”, “TIFF Format”, …)
Default value: What it was opened with, or last used save-format in case of creation
Script commands: ImageDocumentGetCurrentFileSaveFormat() ImageDocumentSetCurrentFileSaveFormat()
A current file path:
Default value: What it was opened from, or empty
Script commands: ImageDocumentGetCurrentFile() ImageDocumentSetCurrentFile()
A dirty-state:
Default value: clean when opened, dirty when created
Script commands: ImageDocumentIsDirty() ImageDocumentClean()
A linked-to-file state:
Default value: true when opened, false when created
Script commands: ImageDocumentIsLinkedToFile()
There are two ways of saving an imageDocument:
Saving the current document itself to disc:
void ImageDocumentSave( ImageDocument imgDoc, Number save_style ) This utilizes the current properties of the imageDocument to save it to current path in current format, marking it clean in the process. The save_style parameter determines how the program deals with missing info:
0 = never ask for path
1 = ask if not linked (or empty path)
2 = always ask
Saving a copy of the current document to disc:
void ImageDocumentSaveToFile( ImageDocument imgDoc, String handler, String fileName ) This makes a copy and save the file under provided path in the provided format. The imageDocument in memory does not change its properties. Most noticeable: It does not become clean, and it is not linked to the provided file on disc. The filename parameter specifies the saving location including the filename. If a file extension is provided, it has to match the file-format, but it can be left out. The handler parameter specified the file-format and can be anything GMS currently supports, such as:
Gatan Format
Gatan 3 Format
GIF Format
BMP Format
JPEG/JFIF Format
Enhanced Metafile Format
In short:
To save the currently opened imageDocument with a different format, you would want to do:
imageDocument doc = GetFrontImageDocument()
doc.ImageDocumentSetCurrentFileSaveFormat("TIFF Format")
doc.ImageDocumentSave(0)
While to just save a copy of the current state you would use:
imageDocument doc = GetFrontImageDocument()
string path = doc.ImageDocumentGetCurrentFile() // full path including extension!
path = PathExtractDirectory(path,0) + PathExtractBaseName(path,0) // path without file extension
doc.ImageDocumentSaveToFile("TIFF Format", path )

boost log every hour

I'm using boost log and I want to make basic log principal file: new error log at the beginning of each hour (if error exists), and to name it like "file_%Y%m%d%H.log".
I have 2 problems with this boost library:
1. How to rotate file at the beginning of each hour?
This isn't possible with rotation_at_time_interval parameter because it creates new file regarding first written record in file, and the hour in file name doesn't match that rule. Is it possible to have multiple rotation_at_time_point for one file in sink or is there some other solution?
2. When file exceed some size I want it to start new file and in that case it should append some index to file name. With adding rotation_size parametar and %N to file name it will increment N all the time while application is running. I want that N to be reset at the beginning of each hour, just as my file name changes. Does anybody have any idea how to do that with this boost log library?
This is basic principal in creating log files in industry. I really don't understand how this can't be done with library which is dedicated for creating log files.
Library itself doesn't provide a way to rotate file at the begging of every hour, but i had same problem so i used a function wrapper, which return true on begging of every hour.
I find this way better for me, because i can controll efficency of code.
from boost.org:
bool is_it_time_to_rotate();
void init_logging(){
boost::shared_ptr< sinks::text_file_backend > backend =
boost::make_shared< sinks::text_file_backend >(
keywords::file_name = "file_%5N.log",
keywords::time_based_rotation = &is_it_time_to_rotate
);
}
For a second question i really dont undrestand it well.

How to handle large files while reading it with python xlrd in GAE without giving DeadlineExceededError

I want to read a file having size 4 MB using python xlrd in GAE.
i am getting the file from Blobstore. Code used is given below.
book = xlrd.open_workbook(file_contents=temp_file)
sh = book.sheet_by_index(0)
for col_no in range(sh.ncols):
its gives me DeadlineExceededError.
book = xlrd.open_workbook(file_contents=file_data)
File "/base/data/home/apps/s~appid/app-version.369475363369053908/xlrd/__init__.py", line 416, in open_workbook
ragged_rows=ragged_rows,
File "/base/data/home/apps/s~appid/app-version.369475363369053908/xlrd/xlsx.py", line 756, in open_workbook_2007_xml
x12sheet.process_stream(zflo, heading)
File "/base/data/home/apps/s~appid/app-version.369475363369053908/xlrd/xlsx.py", line 520, in own_process_stream
for event, elem in ET.iterparse(stream):
DeadlineExceededError
But i am able to read files with smaller size.
Actually i need to get only first few rows(30 to 50) of the file. Is there any other method, other than adding it as a task and getting the details using channel API to get the details with out causing deadline error ?
What i can do to handle this....?
I read a file about 1000 rows excel and it works okay the library.
I leave a link that might be useful https://github.com/cjhendrix/HXLator-SpaceAppsVersion/blob/master/gae/main.py
the code I see that this crossing of columns and rows must be at lists for each row
example:
wb = xlrd.open_workbook(file_contents=inputfile.read())
sh = wb.sheet_by_index(0)
for rownum in range(sh.nrows):
val_row = sh.row_values(rownum)
#here print element of list
self.response.write(val_row[1]) #depending for number for columns
regards!!!

How do I get a temporary File object (of correct content-type, without writing to disk) directly from a ZipEntry (RubyZip, Paperclip, Rails 3)?

I'm currently trying to attach image files to a model directly from a zip file (i.e. without first saving them on a disk). It seems like there should be a clearer way of converting a ZipEntry to a Tempfile or File that can be stored in memory to be passed to another method or object that knows what to do with it.
Here's my code:
def extract (file = nil)
Zip::ZipFile.open(file) { |zip_file|
zip_file.each { |image|
photo = self.photos.build
# photo.image = image # this doesn't work
# photo.image = File.open image # also doesn't work
# photo.image = File.new image.filename
photo.save
}
}
end
But the problem is that photo.image is an attachment (via paperclip) to the model, and assigning something as an attachment requires that something to be a File object. However, I cannot for the life of me figure out how to convert a ZipEntry to a File. The only way I've seen of opening or creating a File is to use a string to its path - meaning I have to extract the file to a location. Really, that just seems silly. Why can't I just extract the ZipEntry file to the output stream and convert it to a File there?
So the ultimate question: Can I extract a ZipEntry from a Zip file and turn it directly into a File object (or attach it directly as a Paperclip object)? Or am I stuck actually storing it on the hard drive before I can attach it, even though that version will be deleted in the end?
UPDATE
Thanks to blueberry fields, I think I'm a little closer to my solution. Here's the line of code that I added, and it gives me the Tempfile/File that I need:
photo.image = zip_file.get_output_stream image
However, my Photo object won't accept the file that's getting passed, since it's not an image/jpeg. In fact, checking the content_type of the file shows application/x-empty. I think this may be because getting the output stream seems to append a timestamp to the end of the file, so that it ends up looking like imagename.jpg20110203-20203-hukq0n. Edit: Also, the tempfile that it creates doesn't contain any data and is of size 0. So it's looking like this might not be the answer.
So, next question: does anyone know how to get this to give me an image/jpeg file?
UPDATE:
I've been playing around with this some more. It seems output stream is not the way to go, but rather an input stream (which is which has always kind of confused me). Using get_input_stream on the ZipEntry, I get the binary data in the file. I think now I just need to figure out how to get this into a Paperclip attachment (as a File object). I've tried pushing the ZipInputStream directly to the attachment, but of course, that doesn't work. I really find it hard to believe that no one has tried to cast an extracted ZipEntry as a File. Is there some reason that this would be considered bad programming practice? It seems to me like skipping the disk write for a temp file would be perfectly acceptable and supported in something like Zip archive management.
Anyway, the question still stands:
Is there a way of converting an Input Stream to a File object (or Tempfile)? Preferably without having to write to a disk.
Try this
Zip::ZipFile.open(params[:avatar].path) do |zipfile|
zipfile.each do |entry|
filename = entry.name
basename = File.basename(filename)
tempfile = Tempfile.new(basename)
tempfile.binmode
tempfile.write entry.get_input_stream.read
user = User.new
user.avatar = {
:tempfile => tempfile,
:filename => filename
}
user.save
end
end
Check out the get_input_stream and get_output_stream messages on ZipFile.

Byte Array copy in Jsp

I am trying to append 2 images (as byte[] ) in GoogleAppEngine Java and then ask HttpResponseServlet to display it.
However, it does not seem like the second image is being appended.
Is there anything wrong with the snippet below?
...
resp.setContentType("image/jpeg");
byte[] allimages = new byte[1000000]; //1000kB in size
int destPos = 0;
for(Blob savedChart : savedCharts) {
byte[] imageData = savedChart.getBytes(); //imageData is 150k in size
System.arraycopy(imageData, 0, allimages, destPos, imageData.length);
destPos += imageData.length;
}
resp.getOutputStream().write(allimages);
return;
Regards
I would expect the browser/client to issue 2 separate requests for these images, and the servlet would supply each in turn.
You can't just concatenate images together (like most other data structures). What about headers etc.? At the moment you're providing 2 jpegs butted aainst one another and a browser won't handle that at all.
If you really require 2 images together, you're going to need some image processing library to do this for you (or perhaps, as noted, AWT). Check out the ImageIO library.
Seem that you have completely wrong concept about image file format and how they works in HTML.
In short, the arrays are copied very well without problem. But it is not the way how image works.
You will need to do AWT to combine images in Java

Resources