What is the '-' in multipart/form-data? - multipartform-data

I want to ask a question about the multipart/form data. I find the http header of multipart post and the Content-Type: multipart/form-data; boundary=-----...---boundaryNumber. I want to ask, how many of '-' between the boundaryNumber and '='?

Not a single - is mandatory. You can have any number of them. It is actually a mystery to me why user-agents tend to add so many. It is probably traditional because in the old days, when people still regularly looked at the actual protocol traffic, it provided some nice visual separation. Nowadays it is pointless.
Note however, that when you use the boundary in the stream, it must be prefixed by two hyphens (--). That’s part of the protocol. Of course, the fact that most user-agents use lots of hyphens in their boundary makes this very hard to see by example.
Furthermore, the last boundary (which marks the end of the message) is prefixed and suffixed by two hyphens (--).
So in summary, you could call your boundary OMGWTFPLZDIEKTHX, and then your traffic could look like this:
Content-Type: multipart/form-data; boundary=OMGWTFPLZDIEKTHX
--OMGWTFPLZDIEKTHX
Content-Type: text/plain
First part (plain text).
--OMGWTFPLZDIEKTHX
Content-Type: text/html
<html>Second part (HTML).</html>
--OMGWTFPLZDIEKTHX--

The number of dashes depends on how many you want there. It can be zero, if you like -- it's just that more dashes makes the boundary more obvious.
The boundary consists of a line containing two dashes plus everything after "boundary=". So if your header said boundary=ABC, the boundary looks like
--ABC

In your boundary definition, no hyphens are required. When using that boundary to separate two distinct body parts, you must start with two hyphens, followed by your previously-defined boundary string.
This is explained in RFC 1341 (MIME), and you can find additional information there in the Multipart section (as linked).

It is completely arbitrary.
The point of the boundary is to define the beginning and ending of your data. It does not matter what it is, as long as it is not part of the content.

Multipart/form-data media type can be used by a wide variety of applications and transported by a wide variety of protocols as a way of returning a set of values as the result of a user filling out a form.
Multipart/form-data follows the model of multipart MIME data streams. A multipart/form-data body contains a series of parts separated by a boundary.
Example of multipart/form-data response:
There are four important fields which we are important in response:
-<<boundary_value>>
Content-Disposition: form-data; name="<<field_name>>"
Content-Type: type of the data
<<field_value>>
The "Boundary" Parameter is one of the clue in the in multipart response:
As with other multipart types, the parts are delimited with a
boundary delimiter, constructed using CRLF, "--", and the value of
the "boundary" parameter. The boundary is supplied as a "boundary"
parameter to the multipart/form-data type. The boundary delimiter MUST NOT appear inside any of the encapsulated parts, and it is often necessary to enclose the
"boundary" parameter values in quotes in the Content-Type header
field.
Resource - https://datatracker.ietf.org/doc/html/rfc7578

Related

Using AWS Textract to classify given document pages structure into headlines and paragraphs

I have been searching all over the internet for a way to extract a meaningful page structure from an uploaded document (headlines/titles and paragraphs). The document could be of any format but I'm currently testing with PDF.
Example of what I'm trying to do:
Upload PDF file client-side
Save it to S3
Request AWS textract to detect or analyze text in that S3 object
Classify the output into: Headlines and Paragraphs
My application is working fine until step 3 and AWS textract outputs the result as blocks, block types can be either page, line or words and each block has a Geometry object which includes bounding box details and Polygon object as well (More info here: AnalayzeCommandOutput(JS_SDK) and AnalayzeCommandOutput(General)
However, I still need to process the output and classify it into headlines (e.g. 1 block of type line could be a headline and the following 3 blocks of type line are a single paragraph) so the output of step 4 would be:
{
"Headlines": ["Headline1", "Headline2", "Headline3"],
"Paragraphs": [{"Paragraph": "Paragraph1", "Headline": "Headline1"}, {"Paragraph": "Paragraph2", "Headline": "Headline1}
The unsuccessful methods I tried:
Calculate the size of bounding box of a line relative to the page size and comparing it the average bounding box sizes if it's greater then it's a headline if it's smaller than or equal it's a paragraph (not practical)
Use other PDF parsers but most of them just output unformatted text
Use the "Query" option of analyze document input but it would require to define each line in the PDF as key value pairs to output something meaningful. As per here So the PDF content would be something like:
Headline1: Headline
Paragraph1: Paragraph
Paragraph2: Paragraph
Headline2: Headline
Paragraph1: Paragraph
I'm not asking for a coding solution. Maybe I'm overcomplicating things and there is a simpler way to do it. Maybe someone has tried something similar and can point me into the right direction or approach.

What encoding function does useSearchParams() use?

I am writing a wrapper around react-router's useSearchParams() hook that automatically handles the (de)serialization of individual search parameters. In my unit test I am attempting to verify that calling the update function returned from the useSearchParams() function will also update the location's search property properly (handling any encoding automatically).
My unit test is nearly working, but the final check is failing. I am expecting the updated value to be $%{ }#=! - and when I use encodeURIComponent on it it becomes: "%24%25%7B %20 %7D%23%3D !"
However, the value that I get back for the parameter from useLocation()'s search property is "%24%25%7B + %7D%23%3D %21"
(white space added so I can bold things for emphasis)
While very similar, and clearly the same value when decoded, react-router is encoding certain characters ever so slightly differently from the standard encodeURIComponent function. Here we see that the whitespace character is getting encoded as %20 by encodeURIComponent and as + by react-router. The exclamation mark is being left simply as ! by encodeURIComponent while react-router has encoded it as %21
Is there a standard function somewhere that will allow me to use and test against react-router's encoding? Or is there any documentation describing the differences between react-router's encoding vs the standard encodeURIComponent function so that I might write my own?
Thanks!

angularjs resource slash parameters

I am using $resource to make a rest api call.
My call to that resource is like that :
Client.get({parametres : param}
My problem is that param contains "\" character, that make the call fail with
400 Bad Request
response.
How can I escape the "\" character?
Thanks.
encodeURIComponent should do the trick.
The encodeURIComponent() method encodes a Uniform Resource Identifier (URI) component by replacing each instance of certain characters by one, two, three, or four escape sequences representing the UTF-8 encoding of the character (will only be four escape sequences for characters composed of two "surrogate" characters).
As per: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/encodeURIComponent
Client.get({ parameters: encodeURIComponent(param) }

Can HTTP/1.1 Body contain a string like "\r\n"?

I'm trying to implement HTTP/1.1 protocol using Socket in C language. I just want to know if the Body inside a request can contain strings like: "\r\n" i.e a CR LF.
Also, please let me know if there is a maximum limit to the number of characters inside the body.
There are no limits to the size or content of the body in an HTTP request or response.
Yes, the body can contain CRLFs. No, there is no limit to the length of the body. The body is arbitrary data, as far as HTTP is concerned. RFC 2616 Section 4.4 outlines how to determine the length of the body and how the body is transmitted. The Content-Type header determines how the body data is interpreted once received.

Unicode/special characters in help_text for Django form?

I am trying to add a special character (specifically the ndash) to a Model field's help_text. I'm using it in the Form output so I tried what seemed intuitive for the HTML:
help_text='2 – 30 characters'
Then I tried:
help_text='2 \2013 30 characters'
Still no luck. Thoughts?
django escapes all html by default. try wrapping your string in mark_safe
You almost had it on your second try. First you need to declare the string as Unicode by prefacing it with a u. Second, you wrote the codepoint wrong. It needs a preface as well; like \u.
help_text=u'2\u201330 characters'
Now it will work and has the added benefit of not polluting the string with HTML character entities. Remember that field value could be used elsewhere, not just in the Form display output. This tip is universal for using Unicode characters in Python.
Further reading:
Unicode literals in Python, which mentions other codepoint prefaces (\x and \U)
PEP263 has simple instructions for using actual raw Unicode characters in a source file.

Resources