Using AWS Textract to classify given document pages structure into headlines and paragraphs - reactjs

I have been searching all over the internet for a way to extract a meaningful page structure from an uploaded document (headlines/titles and paragraphs). The document could be of any format but I'm currently testing with PDF.
Example of what I'm trying to do:
Upload PDF file client-side
Save it to S3
Request AWS textract to detect or analyze text in that S3 object
Classify the output into: Headlines and Paragraphs
My application is working fine until step 3 and AWS textract outputs the result as blocks, block types can be either page, line or words and each block has a Geometry object which includes bounding box details and Polygon object as well (More info here: AnalayzeCommandOutput(JS_SDK) and AnalayzeCommandOutput(General)
However, I still need to process the output and classify it into headlines (e.g. 1 block of type line could be a headline and the following 3 blocks of type line are a single paragraph) so the output of step 4 would be:
{
"Headlines": ["Headline1", "Headline2", "Headline3"],
"Paragraphs": [{"Paragraph": "Paragraph1", "Headline": "Headline1"}, {"Paragraph": "Paragraph2", "Headline": "Headline1}
The unsuccessful methods I tried:
Calculate the size of bounding box of a line relative to the page size and comparing it the average bounding box sizes if it's greater then it's a headline if it's smaller than or equal it's a paragraph (not practical)
Use other PDF parsers but most of them just output unformatted text
Use the "Query" option of analyze document input but it would require to define each line in the PDF as key value pairs to output something meaningful. As per here So the PDF content would be something like:
Headline1: Headline
Paragraph1: Paragraph
Paragraph2: Paragraph
Headline2: Headline
Paragraph1: Paragraph
I'm not asking for a coding solution. Maybe I'm overcomplicating things and there is a simpler way to do it. Maybe someone has tried something similar and can point me into the right direction or approach.

Related

Meshroom: how to access the final camera parameters?

I am trying to write a script which loads the camera parameters from Meshroom and imports them into a CAD program. My first understanding was that these parameters (position, rotation matrix, focal length etc.) are contained in the JSON-file cameras.sfm in the StructureFromMotion-subdirectory.
After importing these parameters into Rhino3D and comparing the resulting views onto the 3D-mesh with the undistorted photographs in the PrepareDenseScene-directory, I find surprising large discrepancies. The mesh which was the result of the run was good, so I think that the deviation is because of the parameters in cameras.sfm being not the final ones. This assumption is also supported by the fact that the file only contains the focal length as read from the input images' EXIF information and no refined values. So my question is:
How can I access the final camera parameters from the output of Meshroom?
Knowing this would help me a lot for re-building a photogrammetry/CAD pipeline I had previously implemented for VisualSFM + CMPMVS.
Many thanks!
EDIT: As this is my first post, I am not able to create a new tag for Meshroom. Perhaps this could be added by someone else? Thanks!

Iterating through sequence of Images in tensorflow

I have a database with images numbered from 1 till 7500.
I need to feed these images into my model in tensorflow in the following manner:
grab the 1st 100 images, that is, from 1 till 100, then grab another 100 images such that the next batch is from 1 till 101. As well, the following batch is from 2 till 102 and so on...
The purpose for using the following behavior is that I am using a recurrent neural network where the images to be fed are faces detected from a video. Therefore, I need to feed sequences of images such that these images are directly following one another.
Any help is much appreciated!!
I don't have a perfect solution for your question, but this one might help you.
I'm assuming that you are using tfrecords to build inputs because if not, feeding numpy to model doesn't meet this problem.
supporing your image files are list like this ["image_0", ..., "imgae_N"], you can build i-th tf.example with ["image_i", ..., "image_i+100"] as a feature.
After dequeuing, you get a tensor contains the names of there images, and then unstack them, read image content from there image names with tf.read_file and decode them to images with tf.image.decode_image, and concat them back into one tensor and send it to your model as input.

PDF: how do I find how much space will the text occupy when rendered?

I am writing a little PDF library in C. When generating PDF source code that is responsible for rendering text, I need to know how much space the rendered text occupies in order to render the next paragraph correctly.
How do I find out?
Thank you!
The mechanisms and math of PDF text rendering are exhaustively explained in the PDF specification ISO 32000-1. Most important are chapters 8 Graphics and 9 Text.
Essentially you need to know the current graphic state (which should be easy because you after all are the one who creates the PDF) and the metrics of the font you use and then calculate.
Most of these details are governed by the operators and calculations described in chapter 9 but one should not forget the current transformation matrix described in chapter 8.

pybrain image input to dataset for Neural Network

I'm trying to write a neural network that (after being properly trained) identifies certain road signs and returns a different output for each type of sign.
Before I started to train my network, I noticed on the pybrain website that their datasets are always an array of values, each entry containing an input and a target. The images I have for my NN have been converted to grayscale pixel data (a simple array of numbers). To train each set of data, do I need to somehow add a target value for each pixel? And if so, how would I go about doing that?
QUICK ANSWER
No, you don't need target for every single pixel, you treat pixels from single image as your input data and you add target to that data.
LONG ANSWER
What you trying to do is to solve classification problem. You have image represented by array of numbers and you need to classify it as some class from limited set of classes.
So lets say that you have 2 classes: prohibitions signs (I'm not native speaker, I don't know how you call signs that forbid something), and information signs. Lets say that prohibition signs is our class 1 and information signs is class 2.
Your data set should look like this:
([representation of sign in numbers], class) - single sample
After that, since it's classification problem, I recommend using _convertToOneOfMany() method of DataSet class, to convert your targets into multiple outputs.
I've answered similar question here, go check it out.

Simple Multi-Blob Detection of a Binary Image?

If there is a given 2d array of an image, where threshold has been done and now is in binary information.
Is there any particular way to process this image to that I get multiple blob's coordinates on the image?
I can't use openCV because this process needs to run simultaneously on 10+ simulated robots on a custom simulator in C.
I need the blobs xy coordinates, but first I need to find those multiple blobs first.
Simplest criteria of pixel group size should be enough. But I don't have any clue how to start the coding.
PS: Single blob should be no problem. Problem is multiple blobs.
Just a head start ?
Have a look at QuickBlob which is a small, standalone C library that sounds perfectly suited for your needs.
QuickBlob comes with a small command-line tool (csv-blobs) that outputs the position and size of each blob found within the input image:
./csv-blobs white image.png
X,Y,size,color
28.37,10.90,41,white
51.64,10.36,42,white
...
Here's an example (output image is produced thanks to the show-blobs.py tiny Python utility that comes with QuickBlob):
You can go through the binary image labeling the connected parts with an algorithm like the following:
Create a 2D array of ints, labelArray, that will hold the labels of the connected regions and initiate it to all zeros.
Iterate over each binary pixel, p, row by row
A. If p is true and the corresponding value for this position in the labelArray is 0 (unlabeled), assign it to a new label and do a breadth-first search that will add all surrounding binary pixels that are also true to that same label.
The only issue now is if you have multiple blobs that are touching each other. Because you know the size of the blobs, you should be able to figure out how many blobs are in a given connected region. This is the tricky part. You can try doing a k-means clustering at this point. You can also try other methods like using binary dilation.
I know that I am very late to the party, but I am just adding this for the benefipeople who are researching this problem.
Here is a nice description that might fit your needs.
http://www.mcs.csueastbay.edu/~grewe/CS6825/Mat/BinaryImageProcessing/BlobDetection.htm

Resources