C - libcurl - How to get dltotal without using the CURLOPT_XFERINFOFUNCTION - c

I am using the libcurl multi interface and I need to know how much data is being sent for each request. I would rather not use the CURLOPT_XFERINFOFUNCTION because it gets called a lot and I only need to know the dltotal while I am in the CURLOPT_WRITEFUNCTION callback. I want to clean up the existing easy handle and malloc'd data while I am still in the write callback once all the data has been received. Is there a function I can call that will return the total amount of data that is being sent for a particular easy handle?
I tried using curl_easy_getinfo() with CURLINFO_SIZE_DOWNLOAD and it always returned 0. I also tried CURLINFO_CONTENT_LENGTH_DOWNLOAD which also always returned 0. I was calling this from within the CURLOPT_WRITEFUNCTION callback.

The reason you get zeroes back from those calls is probably because the size simply isn't known before-hand.
But let me also alert you that "I want to clean up the existing easy handle and malloc'd data while I am still in the write callback" sounds like a disaster waiting to happen. You really should not cleanup the handle from within the callback.

Related

Multithreading inside Flink's Map/Process function

I have an use case where I need to apply multiple functions to every incoming message, each producing 0 or more results.
Having a loop won't scale for me, and ideally I would like to be able to emit results as soon as they are ready instead of waiting for the all the functions to be applied.
I thought about using AsyncIO for this, maintaining a ThreadPool but if I am not mistaken I can only emit one record using this API, which is not a deal-breaker but I'd like to know if there are other options, like using a ThreadPool but in a Map/Process function so then I can send the results as they are ready.
Would this be an anti-pattern, or cause any problems in regards to checkpointing, at-least-once guarantees?
Depending on the number of different functions involved, one solution would be to fan each incoming message out to n operators, each applying one of the functions.
I fear you'll get into trouble if you try this with a multi-threaded map/process function.
How about this instead:
You could have something like a RichCoFlatMap (or KeyedCoProcessFunction, or BroadcastProcessFunction) that is aware of all of the currently active functions, and for each incoming event, emits n copies of it, each being enriched with info about a specific function to be performed. Following that can be an async i/o operator that has a ThreadPool, and it takes care of executing the functions and emitting results if and when they become available.

Transferring data to/from a callback from/to a worker thread

My current application is a toy web service written in C designed to replicate the behaviour of http://sprunge.us/ (takes data in via http POST, stores it to disk, returns the client a url to the data - also serves data that has been previously stored upon request).
The application is structured such that a thread pool is instantiated with worker threads (just a function pointer that takes a void* parameter) and a socket is opened to listen to incoming connections. The main loop of the program comprises a sock = accept(...) call and then a pool_add_task(worker_function_parse_http, sock) to enable requests to be handled quickly.
The parse_http worker parses the incoming request and either adds another task to the work queue for storing the data (POST) or serving previously stored data (GET).
My problem with this approach stems from the use of the http-parser library which uses a callback design to return parsed data (all http parsers that I looked at used this style). The problem I encounter is as such:
My parse_http worker:
Buffers data from the accepted socket (the function's only parameter, at this stage)
Sets up a http-parser object as per its API, complete with setting callback functions for it to call when it finishes parsing the URL or BODY or whatever. (These functions are of a fixed type signature defined by the http-parser lib, with a pointer to a buffer containing the parsed data relevant to the call, so I can't pass in my own variables and solve the problem that way. These functions also return a status code to the http parser, so I can't use the return values either. The suggested way to get data out of the parser for later use is to copy it out to a global variable during the callback - fun with multiple threads.)
Execute the parser on the buffered socket data. At this stage, the parser is expected to call its set up callbacks when it parses different sections of the buffer. The callback is supplied with parsed data relevant to each callback (e.g. BODY segment supplied to body_parsed callback function).
Well, this is where the problem shows. The parser has executed, but I don't have any access to the parsed data. Here is where I would add a new task to the queue with a worker function to store the received body data or another to handle the GET request for previously stored data. These functions would need to be supplied with both the parsed information (POST data or GET url) as well as the accepted socket so that the now delegated work can respond to the request and close the connection.
Of course, the obvious solution to the problem is simply to not use this thread-pool model with asynchronous practices, but I would like to know, for now and for later, how best to tackle this problem.
How can I get the parsed data from these callbacks back to the worker thread function. I've considered simply making my on_url_parsed and on_body_parsed do the rest of the application's job (storing and retrieving data), but of course I no longer have the client's socket to respond back to in these contexts.
If needed, I can post up the source code to the project when I get the chance.
Edit: It turns out that it is possible to access a user defined void * from within the callbacks of this particular http-parser library as the callbacks are passed a reference to the caller (the parser object) which has a user-definable data field.
A well-designed callback interface would provide for you to give the parser a void * which it would pass on to each of the callback functions when it calls them. The callback functions you provide know what type of object it points to (since you provide both the data pointer and the function pointers), so they can cast and properly dereference it. Among other advantages, this way you can provide for the callbacks to access a local variable of the function that initiates the parse, instead of having to rely on global variables.
If the parser library you are using does not have such a feature (and you don't want to switch to a better-designed one), then you can probably use thread-local storage instead of global variables. How exactly you would do that depends on your thread library and compiler, or you could roll your own by using thread identifiers as keys to thread-specific slots in some global data structure (a hash table for instance).

When to Use a Callback function?

When do you use a callback function? I know how they work, I have seen them in use and I have used them myself many times.
An example from the C world would be libcurl which relies on callbacks for its data retrieval.
An opposing example would be OpenSSL: Where I have used it, I use out parameters:
ret = somefunc(&target_value);
if(ret != 0)
//error case
I am wondering when to use which? Is a callback only useful for async stuff? I am currently in the processes of designing my application's API and I am wondering whether to use a callback or just an out parameter. Under the hood it will use libcurl and OpenSSL as the main libraries it builds on and the parameter "returned" is an OpenSSL data type.
I don't see any benefit of a callback over just returning. Is this only useful, if I want to process the data in any way instead of just giving it back? But then I could process the returned data. Where is the difference?
In the simplest case, the two approaches are equivalent. But if the callback can be called multiple times to process data as it arrives, then the callback approach provides greater flexibility, and this flexibility is not limited to async use cases.
libcurl is a good example: it provides an API that allows specifying a callback for all newly arrived data. The alternative, as you present it, would be to just return the data. But return it — how? If the data is collected into a memory buffer, the buffer might end up very large, and the caller might have only wanted to save it to a file, like a downloader. If the data is saved to a file whose name is returned to the caller, it might incur unnecessary IO if the caller in fact only wanted to store it in memory, like a web browser showing an image. Either approach is suboptimal if the caller wanted to process data as it streams, say to calculate a checksum, and didn't need to store it at all.
The callback approach allows the caller to decide how the individual chunks of data will be processed or assembled into a larger whole.
Callbacks are useful for asynchronous notification. When you register a callback with some API, you are expecting that callback to be run when some event occurs. Along the same vein, you can use them as an intermediate step in a data processing pipeline (similar to an 'insert' if you're familiar with the audio/recording industry).
So, to summarise, these are the two main paradigms that I have encountered and/or implemented callback schemes for:
I will tell you when data arrives or some event occurs - you use it as you see fit.
I will give you the chance to modify some data before I deal with it.
If the value can be returned immediately then yes, there is no need for a callback. As you surmised, callbacks are useful in situations wherein a value cannot be returned immediately for whatever reason (perhaps it is just a long running operation which is better performed asynchronously).
My take on this: I see it as which module has to know about which one? Let's call them Data-User and IO.
Assume you have some IO, where data comes in. The IO-Module might not even know who is interested in the data. The Data-User however knows exactly which data it needs. So the IO should provide a function like subscribe_to_incoming_data(func) and the Data-User module will subscribe to the specific data the IO-Module has. The alternative would be to change code in the IO-Module to call the Data-User. But with existing libs you definitely don't want to touch existing code that someone else has provided to you.

does extjs keep a list of open calls?

I need something to happen after all calls to the server are handled.
But i don't know in which order these will happen.
So i was wondering if extjs/sencha touch keeps a list of all the open calls somewhere.
Seems like this does exist.
Ext.Ajax.requests
keeps a list of all open calls.
If I understand your question correctly you probably don't need such feature, you could schedule something after n asynchronous functions have returned incrementing a counter when executing each of the calls. On each callback decrement counter, if all have returned, counter will reach zero and you can execute your "global" callback.

Should an API assume the proper calling sequence, or should it verify the assumptions?

I have 2 api calls to implement, let's call them "ShouldDoSomething" and "DoSomething". The first is a test to see if a particular action needs to be taken and the other actually takes that action. Naturally, taking the action is only necessary or valid if the test returns true. There are cases where the test is needed without actually taking the action, so both are necessary.
Should the action call internally run the test and become a no-op if it's not needed, or should the implementation assume that it will only ever be called in the case where the test has already returned true? It seems simpler and safer to validate the assumptions, but then the 'action' call has 3 possible return states instead of two (success, failure, and not needed). Another approach would be to make the test an assertion and abort if the action was called unnecessarily. (this is in C, so exceptions aren't really a good idea)
But it comes down to a choice - simpler api calls or fewer api calls?
Pick a policy according to what you think is cleanest for the intended use cases. Document it clearly and carefully. The most important thing with API decisions like this is documentation, not the specifics of what convention is chosen.
If you decide to use the first option, do you actually need to add a "not needed" return state, or can it simply indicate "success" if the operation is not needed? Don't add complexity unless it's necessary.
I think it would be best to go with:
Another approach would be to make the
test an assertion and abort if the
action was called unnecessarily.
So for example if "DoSomething" is called and "ShouldDoSomething" is telling you that you shouldn't take that particular action then "DoSomething should abort without doing anything. Do you really need to expose "ShouldDoSomething" to the client, if its only testing whether "DoSomething" should be called?
I think it also comes down to what will be changed. Could it be fatal if called inappropriately? If so I think you should put assertions in. If it will be harmless then it doesn't really matter.
I would say safer is the way to go.
Hope this helps.
Stephen Canon is correct about consistency and documentation.
Personally I tend to err on trying to protect people from themselves unless it is a serious performance penalty. In most situations I would lean to checking ShouldDoSomething internally from DoSomething and return an error if DoSomething wasn't required. That indicates a logic error to the user and is a lot nicer than letting them blow up or leaving something in an invalid state that might be hard to debug down the line.
The simplest API (from a user perspective) would be to have the function DoSomethingIfNeeded(), and let you take care of the conditions. Why bother the user?
The least friendly user interface is the one that says: "Now call one of these functions! No, not that one!".

Resources