How to Analyze multi-file forms with Azure Form Recognizer - azure-form-recognizer

I have a form split in 3 different jpg files, a file for each page, is it possible to instruct Form Recognizer to consider them as a single form?
Should I first merge in a single file? Which is the best free java lib to merge 3 different jpg in a single PDF file?
Thank You

You will need to merge the 3 jpg files into a single file.
You could try to merge it to a TIFF file, which is supported by Az Form Recognizer, and it supports multiple images/pages.
And this seems to be possible in Java without any additional library.
Check this post as it may give you an idea.
Moreover, if all documents have the same number of pages and layout, you can use a custom model trained without labels to analyze your form.
In case a model without labels doesn't give you good results, try training it with labels.

Related

Is it possible to manipulate pdf files in Visual Basic without an external library/SDK?

I am looking at how to implement PDF merging with raw VB code so that the code may be invoked by a bot for business process automation.
The software used to create the bot provides a function to invoke VB code, but I don't believe it can access any externally imported libraries because it expects plain source, so I essentially need to produce code that one could run in a VB shell environment without anything fancy (or convenient, it seems).
All the research I've done so far point me in the direction of external packages I would need to install, such as iText; this is what I'm looking to avoid.
(previous iText employee here)
PDF is not an easy (binary) format.
Essentially, blobs of information (text that has to be rendered, fonts, images, vector graphics, etc) are compressed and gathered into objects.
Each object gets a number. Objects are allowed to reference eachother (a piece of text might say 'I want to be rendered with font 4433')
All object numbers and their byte offset in the file are gathered in the crossreference (often called XREF) table.
A PDF includes a 'Pages' dictionary object that tells the viewer which objects belong on which page.
In order to merge PDF files, you would need to:
- read all XREF tables of all files
- adjust all of those to the correct byte offset
- update various dictionary objects within the PDF file that tell it where all the objects per page are kept
This is by no means a trivial task, but it can be done using only VB.
If you are serious about implementing a robust, scalable version of this of tool, perhaps it's better to look at the iText sourcecode and try to port it to VB?

Data Extraction from PDF

I get 15+ PDF's a day that I have to enter into a database. They are generated from a table where the "Blanks" are filled in from specific table fields. Any tools or python code examples I could use to try and develop a means of extracting the data from the PDF to either write to or create a table to import to the database table? The Database is currently Access mdb.
Thanks
There are a number of approaches that will work.
One simple approach is to simply print the PDF file out to a text file and then have Access import that text. All recent versions of windows allow you to install a “text” printer that outputs the printing of a document to a text file. You can have access “process” a folder of pdfs, print them to text and then import those text files. You might need some VBA to remove “pages” and some extra lines before you import the data into Access.
Another approach is to use Word (Automate from Access) to open a PDF. When word opens a pdf, it converts it to a word document. This approach will even format rows as a word table. You can then pluck out that table data and send that data to word. You can likely pull that text out without writing the data out to a text file – or just use Words “save-as” to a text file (you can automate this process from Access).
Another approach is to use the free Ghost Script library that can extract text from a PDF (this I would consider if did not have word at your disposal).
So which solution is best will much depend on the current software you going to have installed on the computer running Access. Opening the pdf files with word would be my first choice and test.
At my old job we used Cogniview which converted PDF to Excel spreadsheets quite quickly. If you want to use Python, a quick search yielded me this which seems straight forward enough, PDF to XLS with Python

Print PDF programmatically - C# WinForms

I need to print a SSRS report in PDF format from a WinForms application written in C#. The report is a PDF document (containing text, images & tables), in a byte array - and I don't want to save it to disk for security/performance reasons. The requirements for printing are that it needs to be done:
- in the fastest way possible
- with no user interaction
- without the need to install anything on the client machine (we can't rely on any Adobe products being installed)
- third-party libraries can be used, as long as they can be installed together with the application
I came to 2 potential solutions:
1. using MigraDoc - but I can't find a way to load and print an existing file, only a newly created PDF file, or one already saved to disk
2. sending the PDF directly to the printer, using "PDF Direct Print"/PCL/etc. This seems to be the fastest option, but I haven't implemented it yet, and it seems to not be supported by all printers.
Does anybody have any suggestions on how to implement the options above, or any other options which meet the requirements?
MigraDoc cannot print PDF files, so one of your potential solutions is void.

File Maker Scripting - Sending Different Attachment

Is there a way to send a mail with different PDF file to different contacts using file maker?
I am aware of sending batch emails with one attachment but I would like to send a personalize PDF for each contact which seems not so simple.
Also
Can I add PDF files to the table itself or would I have to use the path to the file?
Example:
Table 1
**Name** [James Brown] [James Blue]
**Email** [brown.j#gmail.com] [blue.j#gmail.com]
**PDFfileAttchamnet** [folder/PDF/JamesBrown.pdf] [folder/PDF/JamesBlue.pdf]
So an Email for James Brown would look like:
Dear James Brown, please see the attached file.
Attachment [JamesBrown.pdf] {actual file}
and
Dear James Blue, please see the attached file.
Attachment [JamesBlue.pdf] {actual file}
I think you can solve it by creating container field in you database and import the pdfs in it.
then you can use export Field Contents[] to export it and send it by email
Hope it useful
I would like to send a personalize PDF for each contact which seems
not so simple.
Find the records of contacts you want to include and loop among them, sending mail to each one individually (i.e. without selecting the 'Collect addresses across found set' option).
Can I add PDF files to the table itself or would I have to use the
path to the file?
You can do either, it's up to you. If the path to the file can be calculated (as in your example), you can calculate it right there in the Send Mail script step.
Note that you can also generate the PDF files during the process itself.
Do I understand correctly that you would actually like to personalize the PDF document(s)?
This is possible, maybe not very simple, but quite simple. The trick is to prepare the PDF as a form, and then fill the form fields to personalize.
PDF has a native forms data format (called FDF), which is described in ISO 32000 (as well as the older PDF specification documents provided by Adobe, as you can find in the Acrobat SDK, downloadable from the Adobe website).
FDF is a simple structured text file, which can easily be assembled using FileMaker (I have done that routinely for several catalog projects). The easiest way to get going is to open the form in Acrobat, fill in the fields, and then export the data as FDF. This gives you the pattern to "fill in the blanks".
So, you create the FDF files using Filemaker. With them you can fill the blank form and feed the saved document to the eMail system.
Which tool to use to fill the blank form depends on the volume you have to process. Acrobat is not very powerful (and you may end up in a bit of a legal gray zone, because Acrobat is not set up for being used as a service). There are applications which are made specifically for filling out forms on a server (such as FDFMerge by Appligent), or there are also several libraries which have the tools to fill out forms (iText or pdflib come to my mind). These applications also allow you to flatten the PDF, which means that there are no longer form fields, but their contents becomes part of the base.
The resulting file can now be either made to an eMail attachment, or you make it available on a server and send an eMail with the link to the file (which method you will use may depend on security and privacy regulations).

Manually create a multi-sheet file for excel from C

I am working on a C application. I was planning on using a CSV file to read the values into a spread sheet, but then as the data got more and more complex (around 100 cols), I saw the need to start to do multiple sheets. I am working on a single board computer, and the file is used for storing diagnostic information. I would like to be able to write the file from the SBC in an ASCII format, and then import it to excel (or the open source alternative), and have multiple sheets. Is this even possible, or should I start working on macros to run on the data?
Maybe consider using XML to write the data, then it can easily be transformed into whatever format you want, and you can have the data be emitted in a way that make semantic sense rather than according to the architecture of a spreadsheet. This might allow more flexibility for different external programs to interact with the data. You can use XSLT to transform it into multiple CSV files if that's desired and there is probably reasonable ways to import it into spreadsheets directly.

Resources