Good morning all,
I am currently on a project in the field of Machine Learning, the goal is to make a supervised classification on a set of data. My data is a large number of pdf files, each file has a specific class, the goal is to use these files as a training dataset in order to do class prediction on new files.
My problem is that I don't know how to build my training dataset since the classification algorithm must train on the content of each file and in my training data frame I have the class of each file and the name of the file in question. How do I include the content of each pdf file in my training Data Frame?
Thank you in advance for your help
PDF files are usually characterized by text, images, charts or whatever, and so they cannot be easily transformed into vectors of numbers that can be given to a machine learning algorithm. First you need to extract information of interest from your files.
In this regard, you might want to try first some libraries which can be used to extract information, and see what happens. For Python, a good start can be PyPDF2. You can find a tutorial here.
If this is does not work as expected, my advice would be to try to use some OCR tools, which directly read the pdf as an image to extract information. In pytesseract is one of the most used, but it is not the only one.
Related
I want to reproduce some models of flight trajectory prediction papers, which need ads-b data, but I can't find relevant data set on the Internet, so I want to make a csv file by myself. Would you like some expert to teach me how to do it, or maybe provide me with a data set
Similar problems are described above
I am looking at how to implement PDF merging with raw VB code so that the code may be invoked by a bot for business process automation.
The software used to create the bot provides a function to invoke VB code, but I don't believe it can access any externally imported libraries because it expects plain source, so I essentially need to produce code that one could run in a VB shell environment without anything fancy (or convenient, it seems).
All the research I've done so far point me in the direction of external packages I would need to install, such as iText; this is what I'm looking to avoid.
(previous iText employee here)
PDF is not an easy (binary) format.
Essentially, blobs of information (text that has to be rendered, fonts, images, vector graphics, etc) are compressed and gathered into objects.
Each object gets a number. Objects are allowed to reference eachother (a piece of text might say 'I want to be rendered with font 4433')
All object numbers and their byte offset in the file are gathered in the crossreference (often called XREF) table.
A PDF includes a 'Pages' dictionary object that tells the viewer which objects belong on which page.
In order to merge PDF files, you would need to:
- read all XREF tables of all files
- adjust all of those to the correct byte offset
- update various dictionary objects within the PDF file that tell it where all the objects per page are kept
This is by no means a trivial task, but it can be done using only VB.
If you are serious about implementing a robust, scalable version of this of tool, perhaps it's better to look at the iText sourcecode and try to port it to VB?
I am trying to create a site where users can upload images, videos and other types of files.
I did some research and people seem to suggest that saving the files as BLOB in database is a Bad idea; instead, save the file paths in database.
My questions are, if I save the file paths in a database:
1. How do I generate the file names?
I thought about computing the MD5 value of the file name, but what if two files have the same name? Adding the username and time-stamp etc. to file name? Does it even make sense?
2. What is the best directory structure?
If a user uploads images at 12/17/2013, 12/18/2018, can I just put it in user_ABC/images/, then create time-stamped sub-directories 20131217, 20131218 etc. ? What is the best structure for all these stuff?
3. How do all these come together?
It seems like maintaining this system is such a pain, because the file system manipulation scripts are tightly coupled with the database operations(may also need the worry about database transactions? Say in one transaction I updated the database but failed to modify the file system so I need to roll back my database?).
And I think this system doesn't scale (what if my machine runs out of hard disk so I need to upload the files to a second machine? What if my contents are on a cluster?)
I think my real question is:
4. Is there any existing framework/design pattern/db that handles this problem?
What is the standard way of handling this kind of problems?
Thanks in advance for your answers.
I've actually asked this same question when I was designing a social website for food chefs. I decided to store the url of the image in a MySQL database along with recipe. If you plan on storing multiple images for one recipe, in my example, maybe having a comma separated value would work. When the recipe loaded on the page, I would fetch the image associated with that recipe onto the screen.
Since it was a hackathon and wasn't meant for production purposes, I didn't encode the file name into something unique. However, if I were developing for productional purposes, I would append the time-stamp to the media file name when storing it into the server and database/backend.
I believe what I've proposed is the best data structure of handling this scenario. Storing the image onto the server is not only faster, but it should also take less space. I have found that when converting a standard jpg file of reasonable resolution to base64 encoding, the encoded text file representation took 30% more space. There is also the time of encoding the file and decoding the file for storage and resolving when using some BLOB type of data format instead of straight up storing the file on the server.
Using some sort of backend server scripting like PHP, you'll be able to do some pretty neat stuff with the information you have available. Fetch the result from the database, and load it in from the page using HTML.
As far as I know, there isn't a standard way of fetching media from a database yet. Perhaps there will be one day.
There is not standard way to do that, it is different to the different application. The idea is you need generate a different Path+FileName for every upload, here is a way:
HashId = sha1(microsecond + random(1,1000000));
Path = /[user_id]/[HashId{0,2}]/[HashId{-2}];
FileName = HashId
I'm new here, so hello everyone!
I wrote a few things in Processing language and now I need to switch to Processing.js. I need to write an app that first scans the sketch folder to prepare a list of provided files. And what was straightforward in Processing is not in PJS.
I'm currently searching the web but I only found solutions for classic Processing. I know that JavaScript has restrictions and in general can't access the user-side files, but is there any way to list the sketch-itself files?
The only way that comes to my mind is to list them on server side via PHP and generate the .pde file dynamically depending on the sketch folder. But the catch is to not use any other language.
Thanks in advance for help!
Processing.js running on a website can only get information that URLs can provide it, and since there are no "dir listings" on the web, it can't grab dir listing content for a URL for you work with. However, depending on what you really want to do, there might be a way to make it work without resorting to PHP.
Assuming you have your Pjs page running on www.example.org/index.html, and you want to list content for www.example.org/sketch/, one option is to simply have a file www.example.org/sketch/list.txt containing all the filenames that the sketch can access, and simply grab that with a
String[] fileNames = loadStrings("./sketch/list.txt")
instruction.
If you can give an example of what you mean with "I need to write an app that first scans the sketch folder to prepare a list of provided files", a more specific solution is probably possible (i.e., what are the files, what does the user need them for, etc)
I would like to design a file format which includes content like inks (saved as points), texts, pictures, records and so on.
I can't find any useful guidelines about how to design a file format. Features to be considered are:
Partial Synchronization (can sync partial elements but not always sync the hole document)
Quick response (like email,when finish downloading the content,shows the content.And then continue to download the attachment)
Compatibility (when the file format upgrade, the lower version of parser can parse a part of document which it recognizes)
Maybe I can use XML to construct the document. Are there any guidelines or design patterns for beginner to learn the method of designing a file format.
For maximum "compatibility", I'd reccommend a format consisting of labelled chunks, like TIFF. If the reading code ignores unknown chunks, then your 2.0 installations can still read 3.0 files (at least as much of it as they are able).
You can organize the chunks with a master "directory" chunk which describes the hierarchy and placement of elements (some kind of flattened tree).
A good source for more is The Encyclopedia of Graphics File Formats. They're super-cheap on Amazon because the old edition has some pages of the TOC mixed-up.