Is it possible to manipulate pdf files in Visual Basic without an external library/SDK? - file

I am looking at how to implement PDF merging with raw VB code so that the code may be invoked by a bot for business process automation.
The software used to create the bot provides a function to invoke VB code, but I don't believe it can access any externally imported libraries because it expects plain source, so I essentially need to produce code that one could run in a VB shell environment without anything fancy (or convenient, it seems).
All the research I've done so far point me in the direction of external packages I would need to install, such as iText; this is what I'm looking to avoid.

(previous iText employee here)
PDF is not an easy (binary) format.
Essentially, blobs of information (text that has to be rendered, fonts, images, vector graphics, etc) are compressed and gathered into objects.
Each object gets a number. Objects are allowed to reference eachother (a piece of text might say 'I want to be rendered with font 4433')
All object numbers and their byte offset in the file are gathered in the crossreference (often called XREF) table.
A PDF includes a 'Pages' dictionary object that tells the viewer which objects belong on which page.
In order to merge PDF files, you would need to:
- read all XREF tables of all files
- adjust all of those to the correct byte offset
- update various dictionary objects within the PDF file that tell it where all the objects per page are kept
This is by no means a trivial task, but it can be done using only VB.
If you are serious about implementing a robust, scalable version of this of tool, perhaps it's better to look at the iText sourcecode and try to port it to VB?

Related

Why is there no program-data independence in traditional file processing?

"In traditional file processing, the structure of data files is embedded in the application programs, so any changes to the structure of a file may require changing all programs that access that file. By contrast, DBMS access programs do not require such changes in most cases. The structure of data files is stored in the DBMS catalog separately from the access programs. We call this property program-data independence."
The following text is taken from the book Fundamentals of the Database system. I didn't get the part about the traditional file processing can somebody please explain(an example would be appreciated)?
I'll give you a simple example.
Microsoft Excel used to save its files in a proprietary binary format. In practical terms, this meant that you could only work on those files using Excel.
But now, Excel supports an open document format in XML that is text-based, and allows other programs like the OpenOffice SDK to interact with them. So you no longer need to rely on Excel to work with open document format Excel files.

Print PDF programmatically - C# WinForms

I need to print a SSRS report in PDF format from a WinForms application written in C#. The report is a PDF document (containing text, images & tables), in a byte array - and I don't want to save it to disk for security/performance reasons. The requirements for printing are that it needs to be done:
- in the fastest way possible
- with no user interaction
- without the need to install anything on the client machine (we can't rely on any Adobe products being installed)
- third-party libraries can be used, as long as they can be installed together with the application
I came to 2 potential solutions:
1. using MigraDoc - but I can't find a way to load and print an existing file, only a newly created PDF file, or one already saved to disk
2. sending the PDF directly to the printer, using "PDF Direct Print"/PCL/etc. This seems to be the fastest option, but I haven't implemented it yet, and it seems to not be supported by all printers.
Does anybody have any suggestions on how to implement the options above, or any other options which meet the requirements?
MigraDoc cannot print PDF files, so one of your potential solutions is void.

What would be the easiest way to generate new localized resource files?

My windows forms application consists of one Visual Studio solution and several projects. The application is localized in English and French using resource files (each project has global resource files, e.g. fooResources.resx and fooResources.fr.resx, and each form/user control has its own resource files (e.g. fooForm.resx and fooForm.fr.resx) - so lets say for arguments sake I have about 30 sets of resource files.
I now have to extract all the strings to be sent for translation into German, then when I receive the translated strings create German resource files (e.g. fooResources.de.resx and fooForm.de.resx) which contain the new captions.
Obviously I could do all of this manually, but I am a developer and thus by nature lazy! No, just kidding - but I would appreciate some suggestions on the most painless way to do this as I am sure more languages will be coming in the future.
Thanks.
I'm the author of a translation product that makes the job very easy for both the developer and the translator. See http://www.hexadigm.com.
I've dealt with this a few different ways. I have worked with translators that are more than willing to work with .ResX files and will create the files as needed (though you may have to rename them to the proper locale code.
Otherwise, the contents of the resource files are purely XML. I wrote a little program that drops the xml data into a datagrid, and then imports the CSV or Excel file in using the key as a UID.

dsofile c# API / NTFS custom file properties

I'm searching for a good way to add meta data to a file. dsofile.dll works fine for NTFS. The meta data is lost, when one drops a copy on a FAT32 share (it uses NTFS hidden streams I guess). Microsoft Word documents contain meta data that are not lost, how do they do it? Similiar to FAT, sending the file via E-Mail strips of all meta data created with dsofile (and also meta data created by hand with Windows Explorer). Separate meta data files are not an option. It must be compatible with standard Windows techniques. If I send someone a file with Outlook and he sends it back, the meta-data should not be lost.
(the required meta data is actually only an ID)
The issue is that all file systems provide a single-stream view of the file as a greatest-common-denominator. Through this interface which exposes the files "contents", you can read or store properties and have them be transported with the "contents" by naive system (or user-) utilities. For example, CopyFile in Windows will carefully lose alternate data streams and has no notion of "shadow files".
The question is whether or not the format of the "contents" allows for arbitrary addition of properties.
Some formats allow arbitrary content (e.g., MSFT's docfile aka .doc/.xls/etc). Some allow limited content (.mp3, .jpg, .exe).
Some are completely SOL (.txt, .bmp).
Any solution would be format-dependent. MS OFfice files are (all) compound files and there's a place for properties there. In some formats (PE files, for example) it's safe to just append data to the end of the file, if you know how to read them later. In ZIP file you can probably find a place in the directory or just add a helper file with your data to the archive. Other formats can't stand this, and you'd need to find your own way at solving the problem.
Actually, file name can also be a good placeholder for your ID.
If you need to store the files somewhere but don't need the file to remain readable by outside applications, you can pack them to ZIP archive or use something like our SolFS
library.
What about the standard properties rather than custom DSOFile properties? Ie Comments, Author etc? do they get wiped?
Not sure if its ideal but a way we've gotten around it is that we have a tool that will take the DSOfile properties and save a text file, which is then emailed along with the file, and at the other end the user runs a tool to re-import the dsofile properties from the text.

Convert pcl to image

I'm communicating with a logic analyzer (HP 1660A) over RS232. I issue a command which tells the analyzer to print screen its display and send it over to the controller (my pc) through serial communication. I'm saving the result (which is usually abut 25kB) to my computer and I would like to view it as a TIFF or other format. The problem is that the response from the analyzer comes in PCL format, therefore suitable to be sent to a printer and printed directly, but not to be opened as an image. I have tried a few PCL to image converters to do the job, I found one which does it properly, however I've used the trial version and I am reluctant to purchase it. I've given you the background of my labour. I would appreciate any kind of help, a reference to the commands in pcl 1 and what should I do in order to extract the data and format it properly from the PCL file. I have no experience with PCL and image processing whatsoever, so please, give me a hand here. Thank you.
P.S. I've obtained the PCL file from the analyzer, both in C# and matlab... I have one slight problem in C# with the serial port control, some images have some uninterpreted characters in the image, when using the above converters. I say all these because I need an algorithm or some indications, no matter the programming language, so please feel free to post.
PCL is complex to read. There are only a handful of tools out there that do a good job of this. We have lots of PCL expertise and still often look to other to supply conversion to PDF and other formats. If the PCL is quite simple, that is, just text, a few fonts, and a graphic or two, a couple of RegEx commands could deal with the extraction of the text and then you could mock up a new document using whatever tools you wish.
Looking at these files in stackoverflow might be tough. If you can get them on an ftp and post a link I can take a quick look and post my findings/thoughts here. The other option is to look to an outside tool. There are a few we've had success with. Our needs are broad so I've settled on one that works the best with many different PCL streams (some PCL coding is better than others). As you are dealing with a known quantity of PCL you may have a few options. Here are a few we've used and had some success with (in order of usefulness to us)
PCLWorks by PageTech (they have a GUI viewer and complete SDK)
VeryPDF PCL Converter (command line tool)
SwiftView
There are others, and even an opensource variant of Ghostscript that handles PCL (we've never had much luck as the PCL we use often contains very custom fonts, symbol sets, and tons of macros which seem to choke it.
GhostPCL
EDIT: Most recently we've been working with LincPDF (http://www.lincolnco.com/). This is also an excellent product with has one big benefit, deployment is simple. Some of the other tools have complex software installations. This solution is very easy for us to deploy as a feature in an application. It's also faster then any tools we've tested to date (at least with the PCL that we generate from our apps which is quite complex as they include specialized fonts and macros).
According to the spec sheet for the HP 1660 (pdf) series can send the TIFF,PCX and postscript.
Wouldn't it be easier to use TIFF?
The project was put on hold for a while, but I would like to offer a complete and usable solution.
#Adrian
You can save the image to a floppy disk, I've done that, saved it as TIFF and everything worked fine. Unfortunately, it sends only PCL through RS232. The idea to save the print screen over serial communication was to avoid using too much the floppy disk, which the device uses in order to boot.
#Douglas
Thank you for your elaborate answer. I'll take a look at the indicated tools, however, my desire is to offer a complete front-end solution, which yields directly the graphic. I've put some files from my tests here in order to see the complexity of the PCL constructions. Do you have any knowledge of a possible API that I could integrate into my application, which can parse the file and interpret the PCL?
Regards,
Cosmin
We capture the serial input via a serial spooler that watches COM1:. It's called SSpool.exe. It redirects the PCL as input to PCLXForm. PCLXForm converts it into any raster format (TIFF, JPG, PDF, BMP, etc.) However, we can also extract the text during the conversion and we can extract individual raster objects from the PCL for re-arrangement in the downstream application. Our pricing model is positioned for licensee's that need to convert up to 50,000 pages of invoices into indexed PDF's per month. However, this type of application normally requires a custom license in order to get our pricing down to the level required. In order to do so, we often have to restrict our product to convert unlimited files, but only up to the 20th page within any one PCL print file. That provides enough page volume and gives us the ability to reduce the pricing per unit. To demo, you would need the PCLTool SDK.

Resources