Data Extraction from PDF - database

I get 15+ PDF's a day that I have to enter into a database. They are generated from a table where the "Blanks" are filled in from specific table fields. Any tools or python code examples I could use to try and develop a means of extracting the data from the PDF to either write to or create a table to import to the database table? The Database is currently Access mdb.
Thanks

There are a number of approaches that will work.
One simple approach is to simply print the PDF file out to a text file and then have Access import that text. All recent versions of windows allow you to install a “text” printer that outputs the printing of a document to a text file. You can have access “process” a folder of pdfs, print them to text and then import those text files. You might need some VBA to remove “pages” and some extra lines before you import the data into Access.
Another approach is to use Word (Automate from Access) to open a PDF. When word opens a pdf, it converts it to a word document. This approach will even format rows as a word table. You can then pluck out that table data and send that data to word. You can likely pull that text out without writing the data out to a text file – or just use Words “save-as” to a text file (you can automate this process from Access).
Another approach is to use the free Ghost Script library that can extract text from a PDF (this I would consider if did not have word at your disposal).
So which solution is best will much depend on the current software you going to have installed on the computer running Access. Opening the pdf files with word would be my first choice and test.

At my old job we used Cogniview which converted PDF to Excel spreadsheets quite quickly. If you want to use Python, a quick search yielded me this which seems straight forward enough, PDF to XLS with Python

Related

File Maker Scripting - Sending Different Attachment

Is there a way to send a mail with different PDF file to different contacts using file maker?
I am aware of sending batch emails with one attachment but I would like to send a personalize PDF for each contact which seems not so simple.
Also
Can I add PDF files to the table itself or would I have to use the path to the file?
Example:
Table 1
**Name** [James Brown] [James Blue]
**Email** [brown.j#gmail.com] [blue.j#gmail.com]
**PDFfileAttchamnet** [folder/PDF/JamesBrown.pdf] [folder/PDF/JamesBlue.pdf]
So an Email for James Brown would look like:
Dear James Brown, please see the attached file.
Attachment [JamesBrown.pdf] {actual file}
and
Dear James Blue, please see the attached file.
Attachment [JamesBlue.pdf] {actual file}
I think you can solve it by creating container field in you database and import the pdfs in it.
then you can use export Field Contents[] to export it and send it by email
Hope it useful
I would like to send a personalize PDF for each contact which seems
not so simple.
Find the records of contacts you want to include and loop among them, sending mail to each one individually (i.e. without selecting the 'Collect addresses across found set' option).
Can I add PDF files to the table itself or would I have to use the
path to the file?
You can do either, it's up to you. If the path to the file can be calculated (as in your example), you can calculate it right there in the Send Mail script step.
Note that you can also generate the PDF files during the process itself.
Do I understand correctly that you would actually like to personalize the PDF document(s)?
This is possible, maybe not very simple, but quite simple. The trick is to prepare the PDF as a form, and then fill the form fields to personalize.
PDF has a native forms data format (called FDF), which is described in ISO 32000 (as well as the older PDF specification documents provided by Adobe, as you can find in the Acrobat SDK, downloadable from the Adobe website).
FDF is a simple structured text file, which can easily be assembled using FileMaker (I have done that routinely for several catalog projects). The easiest way to get going is to open the form in Acrobat, fill in the fields, and then export the data as FDF. This gives you the pattern to "fill in the blanks".
So, you create the FDF files using Filemaker. With them you can fill the blank form and feed the saved document to the eMail system.
Which tool to use to fill the blank form depends on the volume you have to process. Acrobat is not very powerful (and you may end up in a bit of a legal gray zone, because Acrobat is not set up for being used as a service). There are applications which are made specifically for filling out forms on a server (such as FDFMerge by Appligent), or there are also several libraries which have the tools to fill out forms (iText or pdflib come to my mind). These applications also allow you to flatten the PDF, which means that there are no longer form fields, but their contents becomes part of the base.
The resulting file can now be either made to an eMail attachment, or you make it available on a server and send an eMail with the link to the file (which method you will use may depend on security and privacy regulations).

Server backend: how to generate file paths for uploaded files?

I am trying to create a site where users can upload images, videos and other types of files.
I did some research and people seem to suggest that saving the files as BLOB in database is a Bad idea; instead, save the file paths in database.
My questions are, if I save the file paths in a database:
1. How do I generate the file names?
I thought about computing the MD5 value of the file name, but what if two files have the same name? Adding the username and time-stamp etc. to file name? Does it even make sense?
2. What is the best directory structure?
If a user uploads images at 12/17/2013, 12/18/2018, can I just put it in user_ABC/images/, then create time-stamped sub-directories 20131217, 20131218 etc. ? What is the best structure for all these stuff?
3. How do all these come together?
It seems like maintaining this system is such a pain, because the file system manipulation scripts are tightly coupled with the database operations(may also need the worry about database transactions? Say in one transaction I updated the database but failed to modify the file system so I need to roll back my database?).
And I think this system doesn't scale (what if my machine runs out of hard disk so I need to upload the files to a second machine? What if my contents are on a cluster?)
I think my real question is:
4. Is there any existing framework/design pattern/db that handles this problem?
What is the standard way of handling this kind of problems?
Thanks in advance for your answers.
I've actually asked this same question when I was designing a social website for food chefs. I decided to store the url of the image in a MySQL database along with recipe. If you plan on storing multiple images for one recipe, in my example, maybe having a comma separated value would work. When the recipe loaded on the page, I would fetch the image associated with that recipe onto the screen.
Since it was a hackathon and wasn't meant for production purposes, I didn't encode the file name into something unique. However, if I were developing for productional purposes, I would append the time-stamp to the media file name when storing it into the server and database/backend.
I believe what I've proposed is the best data structure of handling this scenario. Storing the image onto the server is not only faster, but it should also take less space. I have found that when converting a standard jpg file of reasonable resolution to base64 encoding, the encoded text file representation took 30% more space. There is also the time of encoding the file and decoding the file for storage and resolving when using some BLOB type of data format instead of straight up storing the file on the server.
Using some sort of backend server scripting like PHP, you'll be able to do some pretty neat stuff with the information you have available. Fetch the result from the database, and load it in from the page using HTML.
As far as I know, there isn't a standard way of fetching media from a database yet. Perhaps there will be one day.
There is not standard way to do that, it is different to the different application. The idea is you need generate a different Path+FileName for every upload, here is a way:
HashId = sha1(microsecond + random(1,1000000));
Path = /[user_id]/[HashId{0,2}]/[HashId{-2}];
FileName = HashId

Creating Excel File in Apex code

'm wondering is it possible to create Excel or CSV file in apex code (as attachment) is it possible ? currently i only see it works with VF page, but i'm looking to do it in apex code not using vf page, I don't see any options.
Any help is appreciated.
Thanks
Unfortunately there is no native CSV Library or API in Apex to handle the creation of CSV files. Certainly not XLS documents although you could use the native XML DOM libraries to create an Excel friendly XML document.
To write a CSV, this should be fairly simple; it's basically a big huge string with a carriage return and/or line feed at the close of each full record write. You'll need to carefully manage property value escapement by using String.replaceAll('\','\''); etc. Which in turn will chew up your script statements pretty badly.
Next, create a new Document(), convert your string to a Blob using Blob.valueOf(String) and instantiate that blob as the body of the Document.
If you are planning on writing a very large document, you may want to consider writing to some other format type or offloading this processing to a remote system (EC2?) and letting it respond with the Document in time.
I have some Apex CSV utilities already written if you need them.

Whats a "[CS Format=A]" header is for?

I'm trying to identify a type of file that contents starts with "[CS Format=A]".
I've extracted files from blobs from a database I was handed. I do not have access to the software that created this database. There is a column that I assume signifies compression (it's called COMPRESS). Also in said database were the names of the files and their extensions. I've extracted all the files out of the database and everything works except anything that's marked as compress is not readable as it's own file type (I.E. if it was a PDF before it was stored in this DB now that I've pulled them all back out it is not parsable as a pdf like the other non-"COMPRESS" pdfs). When I crack them open and look at them the first 13 bytes always are "[CS Format=A]" (which I swear I've seen somewhere before, but can't for the life of me remember what) followed by binary data. Magic can't tell me what I'm looking at and google is not being very helpful with my very strict search term. These were stored in an MSSQL database before I was given the files, most likely 2005 by the time it was pulled.
Probably not helpful, but just to make sure... Oracle will decompress automatically on select.
If it's still compressed afterwards then you're looking at some 3rd party component which can be almost anything, but I'd start with testing Mac/Win first before you run through all the 3rd party compression tools.

Simplest way to implement a database for a Song list type program (Using Visual C++)

Working on a school project, the program is supposed to read from a text file that has a record about a song in every line, fields separated by ";".
Anyways I have no knowledge of databases, and I just want the quickest way to create a database from that text file, and also i will need to change some of the fields of the records once in a while from the program... Also the program needs to search through the database based on certain fields.
Anyways so far all our projects didn't keep a database, so when we closed the program, every info was gone, now i actually need to keep some info for the next time the program runs. What's the fastest way to accomplish this?
Also I wanna be able to keep some info about the software, like the path of the original text file for weekly updates. Where can i save info like that?
EDIT: it doesn't have to an actual database, as long as i can search and edit it efficiently.
If you can use SQL database, I'd suggest simple file-based database SQLite
With SQLite, you can query, insert and update records by executing regular SQL statements.
Here you will find introduction to C++ interface It's easy to embed SQLite support in an application because SQLite comes as a library, meaning a bunch of header files and 1-2 binary archive with library.
Your comma-delimited textfile is aleady a database. You can add records, delete records, and modify records using the standard textfile routines provided by the standard C++ libraries.
Alternatively, you can import your textfile into SQL Server using BULK INSERT.
Finally, you can access your CSV (comma-delimited text) file using SQL queries. You need to find the correct connection string. See http://www.connectionstrings.com/textfile.

Resources