How to design a document file format?

How to design a document file format? - file

I would like to design a file format which includes content like inks (saved as points), texts, pictures, records and so on.
I can't find any useful guidelines about how to design a file format. Features to be considered are:
Partial Synchronization (can sync partial elements but not always sync the hole document)
Quick response (like email，when finish downloading the content,shows the content.And then continue to download the attachment)
Compatibility (when the file format upgrade, the lower version of parser can parse a part of document which it recognizes)
Maybe I can use XML to construct the document. Are there any guidelines or design patterns for beginner to learn the method of designing a file format.

For maximum "compatibility", I'd reccommend a format consisting of labelled chunks, like TIFF. If the reading code ignores unknown chunks, then your 2.0 installations can still read 3.0 files (at least as much of it as they are able).
You can organize the chunks with a master "directory" chunk which describes the hierarchy and placement of elements (some kind of flattened tree).
A good source for more is The Encyclopedia of Graphics File Formats. They're super-cheap on Amazon because the old edition has some pages of the TOC mixed-up.

Related

Machine learning: file parsing and prediction class file

Good morning all,
I am currently on a project in the field of Machine Learning, the goal is to make a supervised classification on a set of data. My data is a large number of pdf files, each file has a specific class, the goal is to use these files as a training dataset in order to do class prediction on new files.
My problem is that I don't know how to build my training dataset since the classification algorithm must train on the content of each file and in my training data frame I have the class of each file and the name of the file in question. How do I include the content of each pdf file in my training Data Frame?
Thank you in advance for your help

PDF files are usually characterized by text, images, charts or whatever, and so they cannot be easily transformed into vectors of numbers that can be given to a machine learning algorithm. First you need to extract information of interest from your files.
In this regard, you might want to try first some libraries which can be used to extract information, and see what happens. For Python, a good start can be PyPDF2. You can find a tutorial here.
If this is does not work as expected, my advice would be to try to use some OCR tools, which directly read the pdf as an image to extract information. In pytesseract is one of the most used, but it is not the only one.

Why is there no program-data independence in traditional file processing?

"In traditional file processing, the structure of data files is embedded in the application programs, so any changes to the structure of a file may require changing all programs that access that file. By contrast, DBMS access programs do not require such changes in most cases. The structure of data files is stored in the DBMS catalog separately from the access programs. We call this property program-data independence."
The following text is taken from the book Fundamentals of the Database system. I didn't get the part about the traditional file processing can somebody please explain(an example would be appreciated)?

I'll give you a simple example.
Microsoft Excel used to save its files in a proprietary binary format. In practical terms, this meant that you could only work on those files using Excel.
But now, Excel supports an open document format in XML that is text-based, and allows other programs like the OpenOffice SDK to interact with them. So you no longer need to rely on Excel to work with open document format Excel files.

Is it possible to manipulate pdf files in Visual Basic without an external library/SDK?

I am looking at how to implement PDF merging with raw VB code so that the code may be invoked by a bot for business process automation.
The software used to create the bot provides a function to invoke VB code, but I don't believe it can access any externally imported libraries because it expects plain source, so I essentially need to produce code that one could run in a VB shell environment without anything fancy (or convenient, it seems).
All the research I've done so far point me in the direction of external packages I would need to install, such as iText; this is what I'm looking to avoid.

(previous iText employee here)
PDF is not an easy (binary) format.
Essentially, blobs of information (text that has to be rendered, fonts, images, vector graphics, etc) are compressed and gathered into objects.
Each object gets a number. Objects are allowed to reference eachother (a piece of text might say 'I want to be rendered with font 4433')
All object numbers and their byte offset in the file are gathered in the crossreference (often called XREF) table.
A PDF includes a 'Pages' dictionary object that tells the viewer which objects belong on which page.
In order to merge PDF files, you would need to:
- read all XREF tables of all files
- adjust all of those to the correct byte offset
- update various dictionary objects within the PDF file that tell it where all the objects per page are kept
This is by no means a trivial task, but it can be done using only VB.
If you are serious about implementing a robust, scalable version of this of tool, perhaps it's better to look at the iText sourcecode and try to port it to VB?

Generate a series of documents based on SQL table

I am trying to formulate a proposal for an application that allows a user to print a batch of documents based on data stored in a SQL table. The SQL table indicates which documents are due and also contains all demographic information. This is outside of what I normally do and am trying to see if these is a platform/application that already exists to do such a task
For example
List of all documents: Document #1 - Document #10
Person 1 is due for document #: 1,5,7,8
Person 2 is due for document #: 2.6
Person 3 is due for document #: 7,8,10
etc
Ideally, what I would like is for the user to be able to push a button and get a printed stack of documents that have been customized for each user including basic demographic info like name, DOB, etc
Like i said at the top, I already have all of the needed information in a database, I am just trying to figure out the best approach to move that information onto a document
I have done some research and found some people have used mail merge in Word or using Access as a front end but I don't know if this is the best way. I've also found this document. Any advice would be greatly appreciated

If I understand your problem correctly, your problem is two-fold: Firstly, you need to find a way to generated documents based on data (mail-merge) and secondly, you might need to print them two.
For document generation you have two basic approaches: template-based and programmatically from scratch. I suppose that you will opt for a template based approach which basically means that you design (in MS Word) a template document (Word, RTF, ...) that acts as a template and contains placeholders and other tags that designate »dynamic« parts of the document. Then, at document generation time, you need a .NET library/processor that you will pass this template document and the data, where the processor will populate the template with the data and return the resulting document.
One way to achieve this functionality would be employing MS Words' native mail-merge, but you should know that this would involve using Office COM and Word Application Automation which should be avoided almost always.
Another option is to build such a system on top of Open XML SDK. This is velid option, but it will be a pretty demanding task and will most probably cost you much more than buying a commercial .NET library that does mail-merge out-of-the-box – been there, done that. But of course, the good side here is that you will be able to tailer the solution to your needs. If you go down this road I recoment that you use Content Controls for tagging documents/templates. The solution with CCs will be much easier to implement than the solution with bookmarks.
I'm not very familliar with the open source solutions and I'm not sury how many there are that can do mail-merge. One I know is FlexDoc (on CodePlex) but its problem is that uses a construct (XmlControl) for tagging that is depricated in Word 2010+.
Then there are commercial solutions. Again I don't know them in detail but I know that the majority of them are a general purpose document processing libraries. Our company has been using this document generation toolkit for some time now and I can say it covers all our »template-based document generation« needs. It doesn't require MS Word at doc generation time, and has really helpful add-in for MS word and you only need several lines of code to integrate it in your project. Templating is very powerful and you can set-up a template in a very short time. While templates are Word documents, you can generate PDF or XPS docs as well. XPS is useful because you can use .NET/WPF prining framework that works with XPS docs to print documents. This is a very high-end solution, but of course, the downside here is that it is not a free solution.

Convert pcl to image

I'm communicating with a logic analyzer (HP 1660A) over RS232. I issue a command which tells the analyzer to print screen its display and send it over to the controller (my pc) through serial communication. I'm saving the result (which is usually abut 25kB) to my computer and I would like to view it as a TIFF or other format. The problem is that the response from the analyzer comes in PCL format, therefore suitable to be sent to a printer and printed directly, but not to be opened as an image. I have tried a few PCL to image converters to do the job, I found one which does it properly, however I've used the trial version and I am reluctant to purchase it. I've given you the background of my labour. I would appreciate any kind of help, a reference to the commands in pcl 1 and what should I do in order to extract the data and format it properly from the PCL file. I have no experience with PCL and image processing whatsoever, so please, give me a hand here. Thank you.
P.S. I've obtained the PCL file from the analyzer, both in C# and matlab... I have one slight problem in C# with the serial port control, some images have some uninterpreted characters in the image, when using the above converters. I say all these because I need an algorithm or some indications, no matter the programming language, so please feel free to post.

PCL is complex to read. There are only a handful of tools out there that do a good job of this. We have lots of PCL expertise and still often look to other to supply conversion to PDF and other formats. If the PCL is quite simple, that is, just text, a few fonts, and a graphic or two, a couple of RegEx commands could deal with the extraction of the text and then you could mock up a new document using whatever tools you wish.
Looking at these files in stackoverflow might be tough. If you can get them on an ftp and post a link I can take a quick look and post my findings/thoughts here. The other option is to look to an outside tool. There are a few we've had success with. Our needs are broad so I've settled on one that works the best with many different PCL streams (some PCL coding is better than others). As you are dealing with a known quantity of PCL you may have a few options. Here are a few we've used and had some success with (in order of usefulness to us)
PCLWorks by PageTech (they have a GUI viewer and complete SDK)
VeryPDF PCL Converter (command line tool)
SwiftView
There are others, and even an opensource variant of Ghostscript that handles PCL (we've never had much luck as the PCL we use often contains very custom fonts, symbol sets, and tons of macros which seem to choke it.
GhostPCL
EDIT: Most recently we've been working with LincPDF (http://www.lincolnco.com/). This is also an excellent product with has one big benefit, deployment is simple. Some of the other tools have complex software installations. This solution is very easy for us to deploy as a feature in an application. It's also faster then any tools we've tested to date (at least with the PCL that we generate from our apps which is quite complex as they include specialized fonts and macros).

According to the spec sheet for the HP 1660 (pdf) series can send the TIFF,PCX and postscript.
Wouldn't it be easier to use TIFF?

The project was put on hold for a while, but I would like to offer a complete and usable solution.
#Adrian
You can save the image to a floppy disk, I've done that, saved it as TIFF and everything worked fine. Unfortunately, it sends only PCL through RS232. The idea to save the print screen over serial communication was to avoid using too much the floppy disk, which the device uses in order to boot.
#Douglas
Thank you for your elaborate answer. I'll take a look at the indicated tools, however, my desire is to offer a complete front-end solution, which yields directly the graphic. I've put some files from my tests here in order to see the complexity of the PCL constructions. Do you have any knowledge of a possible API that I could integrate into my application, which can parse the file and interpret the PCL?
Regards,
Cosmin

We capture the serial input via a serial spooler that watches COM1:. It's called SSpool.exe. It redirects the PCL as input to PCLXForm. PCLXForm converts it into any raster format (TIFF, JPG, PDF, BMP, etc.) However, we can also extract the text during the conversion and we can extract individual raster objects from the PCL for re-arrangement in the downstream application. Our pricing model is positioned for licensee's that need to convert up to 50,000 pages of invoices into indexed PDF's per month. However, this type of application normally requires a custom license in order to get our pricing down to the level required. In order to do so, we often have to restrict our product to convert unlimited files, but only up to the 20th page within any one PCL print file. That provides enough page volume and gives us the ability to reduce the pricing per unit. To demo, you would need the PCLTool SDK.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight