embedding a headless browser in C

embedding a headless browser in C - c

Is any of env-js, phantom JS, slimmerJS or any of the headless browsers embeddable in a C application? This is what I need to do: I have a C application that connects to a couple of servers through HTTP & HTTPS. Up until now, I got every page, figured out what it did, (mainly Javascript), extracted the parts of code that I needed (I also implemented a very simple string parser/extracter) and implemented the flow by sending the HTML code through a (secure)Socket and reading back the response. That part still works smoothly.
That was until I bumped into a set of very complex (read: obfuscated and looong) javascript pages linked one after the other, with several scripts included, and server side programming, and then I realized that I wouldn't be able to get a "maintainable" program with the procedure I've taken.
So I have spent the last several days looking for an embeddable Javascript interpreter. Bumped into spider monkey which is embeddable in C, but since I don't have any control over the scripts received, its lack of the DOM implementation makes it unsuitable. I also considered implementing the DOM interface, but honestly is a way too long distraction from my main proyect.
Then I considered the headless browsers mentioned above. Have read all the information I found about them, looked for some sort of library to interface with them, and in abscence of such librarys, have considedered hacking the code, only to realize that even trying to hack phantomJS to embedded it in my C system would be even crazier than implementing the DOM interface in spider monkey.
The system currently works in Windows and I'm using MinGW to develop it, but its final target is to be implemented in a Raspberry PI, so the more I can have plain straight C source code, the easier it will be to move the system to its final destination. With this I mean: I can use windows libraries in the meantime if they are pre-built, but can't miss the need that they must be compilable and generated in a plain compiler. I don't have yet the Raspberry PI, but I'm not expecting any fancy development tool set (I might be wrong on this)
Lastly, for the curious inside, the system is a stock screener, generates graphics with indicators which are put in a web server and generates alerts (to send notifications of price conditions) through Yahoo Messenger (this choice was mainly due to portability and availability of source code).
I will really appreciate your help in find a way to implement/embed in C *any* Javascript interpreter that has a DOM interface implemented.
Regards.
Alfredo Meraz

Related

high performance application webserver in C/C++

Is there any high performance (ideally evented and open source) web server in C or C++?
I'd like to be able to use it in that it calls a method/function in my application with a filled out HTTP Request class/struct, and then I can return a filled out HTTP Response class/struct to it.
If it isn't open source, I'd need built in support for long-polling connections, keep-alive, etc—otherwise, I think that I can add these things myself.
If you don't know of any such servers available, would you recommend writing my own web server to fit the task? It cannot be file-based, and must be written in high-performance C/C++.
Edit: I'm thinking something like the Ruby Mongrel for C, if that helps.

I had the very same requirements for my job, so I evaluated a number of solutions: mongoose, libmicrohttpd, libevent. And I also was thinking about writing nginx modules. Here is the summary of my findings:
nginx
nginx project page
I love this server and use it a lot. Its performance and resource usage is much better than that of Apache, which I also still use but plan migrating to nginx.
Very good tunable performance. Rich functionality. Portability.
Module API is not documented and seems to be very verbose. See this nginx hello world module as example.
Nginx does not use threads but uses multiple processes. This makes writing modules harder, need to learn nginx API for shared memory, etc.
mongoose
mongoose project page
All server's code is in single mongoose.c file (about 130K), no dependencies. This is good.
One thread per connection, so if you need concurrency you've got to configure lots of threads, ie. high RAM usage. Not too good.
Performance is good, although not exceptional.
API is simple but you have to compose all response HTTP headers yourself, ie. learn HTTP protocol in detail.
libmicrohttpd
libmicrohttpd project page
Official GNU project.
Verbose API, seems awkward to me, although much more simple than writing nginx modules.
Good performance in keep-alive mode (link to my benchmarks below), not so good without keep-alive.
libevent
libevent project page
Libevent library has built-in web server called evhttp.
It is event based, uses libevent for that.
Easy API. Constructs HTTP headers automatically.
Officially single-threaded. This is major disadvantage. I've found a hack, which makes several instances of evhttp run simultaneously accepting connections from the same socket. Not sure if it is all safe and robust.
Performance of single-threaded evhttp is surprisingly poor. Multi-threaded hack works better, but still not good.
G-WAN
G-WAN project is not open source, but I'd like to say a few words about it.
Very good performance, low memory usage, 150 KB executable.
Very convenient 'servlet' deployment: just copy .c file into csp directory, and running server automatically compiles it. Code modifications also compiled on the fly.
Simple API. Although constrained in some ways. Rich functionality (json, key-value store, etc.).
Unstable. I had segfaults on static files. Hangs on some sample scripts. (Experienced on clean install. Never mixed files of different versions).
Only 32-bit binary (not anymore).
So as you can see, none of existing alternatives have fully satisfied me. So I have developed my own server, which is ...
NXWEB
NXWEB project page
Feature highlights:
Very good performance; see benchmarks on project page
Can serve tens of thousands concurrent requests
Small memory footprint
Multi-threaded model designed to scale
Exceptionally light code base
Simple API
Decent HTTP protocol handling
Keep-alive connections
SSL support (via GNUTLS)
HTTP proxy (with keep-alive connection pooling)
Non-blocking sendfile support (with configurable small file memory cache; gzip pre-encoded file serving)
Modular design for developers
Can be run as daemon; relaunches itself on error
Open source
Limitations:
Depends on libev library (not anymore)
Only tested on Linux

I would suggest to write a FastCGI executable that can be used with many high performance web servers (even closed source ones).

mongoose: one file. simple and easy to use. not an asycn io but perfect for embedded and particular purposes.
gwan. excellent. no crashes. ultra well planned configuration. very smart and easy for c/c++ development in other words, very clean sensible api compared to nginx. provides a thread per core. or whatever you specify. a great choice. largest disadvantage (maybe im lacking in this area): cannot step thru code.
libevent: single thread is not a disadvantage on a single core machine. afterall its point is an async i/o. does have multithreads for other cores.
nginx: no personal experience. gaining serious ground on a-patchy server. (terribly confusing api)
boost asio: a c++ library for asynchio (asio). awesome. needs a friendly higher-level api for simpletons like myself. and others who come from php, java, javascript, node.js and other web languages.
python bottle: awesome 1 file lib (framework/system) that makes it easy to build python web apps. has/is a built in httpd server, like libevent and node.js
node.js: javascript asyncio server. an excellent selection. unfortunately, have to program in javascript that does become tedious. while there is something to be said for getting the job done; there is also something to be said for enjoying yourself during the process. hopefully no ones comes up with node.php

I'm going to suggest the same thing as Axel Gneiting - but have provided an answer with my reasons for taking this approach:
1) HTTP is not trivial as a protocol - writing your own server or amending an off-the-shelf solution is a very complex task - a lot more complex than using the available APIs for implementing a separate processing engine
2) Using (an unmodified) mainstream webserver should provide you with more functionality than you require (so you've got growing room).
3) Using (an unmodified) mainstream webserver will usually mean that it has been far more extensively tested and documented than a homebrew system.
4) .. and its more likely to be secure and stable.
5) Using fastCGI you can use all sorts of languages to develop your back-end processing in - including C++ and C. There are standard toolkits available to facilitate this.
6) alternatively many webservers provide support for running interpreter engines in-process (e.g. mod_php, mod_perl). I'd advise against running your own native code as a module though.
It cannot be file-based.
Eh? What does that mean?

I'm an avid nginx user; nginx is written in C; nginx seems like it could work for you. If you want the very best speed out of nginx, I would make a nginx module. Here are 3rd party modules which you can examine to get an idea of what it requires.
As for the long polling requirement, you might want to have a look at the http push modules.

Porting Autodesk Animator Pro to be cross platform

a previous relevant question from me is here Reverse Engineering old paint programs
I have set up my base of operations here: http://animatorpro.org
wiki coming soon.
Okay, so now I have a 300,000 line legacy MSDOS codebase. It's sort of a "be careful what you wish for" situation. I am not an experienced C programmer. I'm not entirely inexperienced either, but for all intents and purposes I'm a noob to the language and in particular the intricacies of its libraries. I am especially ignorant of the vagaries of the differences between C programs written specifically for MSDOS and programs that are cross platform. However I have been studying this code base for over a year now, and this is what I know about Animator Pro:
Compilers and tools used:
Watcom C compiler
tcmake (make program from Turbo C)
386asm, a specialised assembler for the Phar Lap dos extender
and of course, the Phar Lap dos extender itself.
a selection of obscure dos utilities
Much of the compilation seems to be driven by batch files. Though I have obtained copies of all these tools, I have not yet succeeded at compiling it. (though I have compiled its older brother, autodesk animator original.
It's got a plugin system that replicates DLL before DLL's were available, based on REX. The plugin system handles:
Video Drivers (with a plethora of included VESA drivers)
Input drivers (including wacom tablets, and keyboards)
Drawing Tools
Inks (Like photoshop's filters, or blending modes)
Scripting Addons (essentially compiled scripts)
File formats
It's got its own script interpreter named POCO, based on the C language- The scripting language has enough power to do virtually all the things the plugin system can do- Just slower.
Given this information, this is my development plan. Please criticise this. The source code is available in the link above, so you can easily, if you are so inclined, assess the situation yourself.
Compile with its original tools.
Switch to using DJGPP, and make the necessary changes to get it to compile with that, plus the original assembler.
Include the Allegro.cc "Game" library, and switch over as much functionality to that library as possible- Perhaps by simply writing new video and input drivers that use the Allegro API. I'm thinking allegro rather than SDL because: there is a DOS version of Allegro, and fascinatingly, one of its core functions is the ability to play Animator Pro's native format FLIC.
Hopefully after 3, I will have eliminated most or all of the Assembler in the project. I say hopefully, because it's in an obscure dialect that doesn't assemble in any modern free assembler without significant modification. I have tried them all. Whatever is left gets converted to assemble in NASM, or to C code if I can define the assembler's actual function.
Switch the dos extender from Phar Lap to HX Dos http://www.japheth.de/HX.html, Which promises to replicate as much of the WIN32 api as possible. Then make all the necessary code changes for that to work.
Switch to the win32 version of Allegro.cc, assuming that the win32 version can run on top of HXDos. Make any further necessary changes
Modify the plugin system to use some kind of standard cross platform plugin library. What this would be, I have no idea. Maybe you can offer some suggestions? I talked to the developer who originally wrote the plugin system, and he said some of the things it does aren't possible on modern OS's because of segmentation restrictions. I'm not sure what this means, but I'm guessing it means all the plugins will need to be rewritten almost from scratch.
Magically, I got all the above done, and we can try and make it run in windows, osx, and linux, whilst dealing with other cross platform niggles like long file names, and things I haven't thought of.
Anyone got a problem with any of this? Is allegro a good choice? if not, why? what would you do about this plugin system? What would you do different? Is this whole thing foolish, and should I just rewrite it from scratch, using the original as inpiration? (it would apparently take the original developer "About a month" to do that)
One thing I haven't covered above is the text/font system. Not sure what to do about that, but Animator Pro has its own custom font format, but also is able to use Postscript Type 1 fonts, and some other formats.

My biggest concern with your plan, in a nutshell: Your approach seems to be to attempt to keep the whole enormous thing working at all times, tweaking the environment ever-further away from DOS. During each tweak to the environment, that means you will have approximately a billion subtle assumptions that might have broken at once, none of which you necessarily understand yet. Untangling them all at once will be incredibly painful.
If I were doing the port, my approach would be to disable as much code as possible to get SOMETHING running in a modern environment, and bring the parts back online, one piece at a time. Write a simple test harness program that loads a display driver and draws some stuff, and compile it for DOS to make sure you understand the interface. Then write some C code that implements the same interface, but with Allegro (or SDL or SFML), and make that program work under Windows or Linux. When the output differs, you have a simple test case to work from.
Your entire job on this port is swapping out implementations of various interfaces and functions with completely new ones. This is a job that unit testing excels at. Don't write any new code without a test of some kind that runs on the old code under DOS! Make your potential problems as small and simple as you possibly can. Port assembly code instead of rewriting it only if you're reasonably confident that it will actually make your job easier (ie, algorithmic stuff that compiles fine with few tweaks under NASM). Don't bite off a bigger piece than you can comfortably fit in your brain at once.
I, for one, look forward to seeing your progress! I think what you're attempting to do is great. Thanks for doing it.

Hmmm - I might approach it by writing an OpenGL video "driver" for it. and todays machines are fast enough with tons of ram that you could do all the pixel specific algorithms on main CPU into a back buffer and it would work. As the "generic" VGA driver just mapped the video buffer to a pointer this would be a place to start. There was a zoom mode in the UI so you can look at the pixels on a high res display.

It is often very difficult to take an existing non-trivial code base that wasn't written with portability in mind - you mention a few - and then try to make it portable. There will be a lot of problems on the way. It is probably a better idea to start from scratch and rewrite the code using the existing code as reference only. If you start from scratch you can leverage existing portable UI solution in your new project like Qt.

Virtual Instance of a C compiler on client browser

Is there a way I can create a virtual instance of gcc compiler on the client browser when the client opens my website??
By doing so, I can directly pass the user .c file as argument to my compiler instance and then execute it without having to make a POST call to server and execute the file there???

Originally I userstood your question to be targeting the native platform on which the browser is running:
Consider that Browsers may be running
on many different platforms,
operatinng systems and processor
architectures. Compiling C in the way
you describe might be technically
doable, but practically infeasible.
I was basing "practically infeasible" on the difficulty of supporting the plethora of widely used browser platforms.
Now I understand that you are thinking more on the lines of targeting a virtual environment. I'll amend practically infeasible to "a large amount of work".
If I understand your intent it is to run a C compiler which emits, shall we say, x86 compiled code and executes it. So to do that we need an emulation of the x86 environment in, say, JavaScript. What's more I think your intent is that the conmpiler itself execute in this environment, so that you can re-use gcc. So you'll need to emulate a file-system too. It's "obvious" that this could be done, but it really is a lot of work. Is it really worth it?
Competition code is small (I guess) even with lots of programmers the number of simultaneous compiles can't be so huge with a decent queued request system, a touch of Ajax, and a bit of back-end scaling how costly is it to support the expected population? What's the ratio of developers to back end systems?
Anyway, if I were to address this problem I'd go for taking the code for an opensource browser and melding in the gcc code. Produce a compiler/browser hybrid. Give that to the developers and tell them "Use this and get zippy compilation speeds, or use your own browser and join the queue."

You're not going to use GCC as it is written for this. AT BEST, you could accomplish something simalar if you had a compiler written in Java that targeted the JVM and could be ran as an applet. I don't know what it would take to get something like this working but, I suspect it would take a bit work to get it up and going. As far as I know nothing currently exist that does this.

Perhaps using a jsLinux in background? There the making process can run in the virtual machine. Communication could be done by extending the clipboard transfer, perhaps into multiple pipes...
I would be interested in javascript based gcc solutions, too.

Best way to implement plugin framework - are DLLs the only way (C/C++ project)?

Introduction:
I am currently developing a document classifier software in C/C++ and I will be using Naive-Bayesian model for classification. But I wanted the users to use any algorithm that they want(or I want in the future), hence I went to separate the algorithm part in the architecture as a plugin that will be attached to the main app # app start-up. Hence any user can write his own algorithm as a plugin and use it with my app.
Problem Statement:
The way I am intending to develop this is to have each of the algorithms that user wants to use to be made into a DLL file and put into a specific directory. And at the start, my app will search for all the DLLs in that directory and load them.
My Questions:
(1) What if a malicious code is made as a DLL (and that will have same functions mandated by plugin framework) and put into my plugins directory? In that case, my app will think that its a plugin and picks it and calls its functions, so the malicious code can easily bring down my entire app down (In the worst case could make my app as a malicious code launcher!!!).
(2) Is using DLLs the only way available to implement plugin design pattern? (Not only for the fear of malicious plugin, but its a generic question out of curiosity :) )
(3) I think a lot of softwares are written with plugin model for extendability, if so, how do they defend against such attacks?
(4) In general what do you think about my decision to use plugin model for extendability (do you think I should look at any other alternatives?)
Thank you
-MicroKernel :)

Do not worry about malicious plugins. If somebody managed to sneak a malicious DLL into that folder, they probably also have the power to execute stuff directly.
As an alternative to DLLs, you could hook up a scripting language like Python or Lua, and allow scripted plugins. But maybe in this case you need the speed of compiled code?
For embedding Python, see here. The process is not very difficult. You can link statically to the interpreter, so users won't need to install Python on their system. However, any non-builtin modules will need to be shipped with your application.
However, if the language does not matter much to you, embedding Lua is probably easier because it was specifically designed for that task. See this section of its manual.
See 1. They don't.
Using a plugin model sounds like a fine solution, provided that a lack of extensibility really is a problem at this point. It might be easier to hard-code your current model, and add the plugin interface later, if it turns out that there is actually a demand for it. It is easy to add, but hard to remove once people started using it.

Malicious code is not the only problem with DLLs. Even a well-meaning DLL might contain a bug that could crash your whole application or gradually leak memory.
Loading a module in a high-level language somewhat reduces the risk. If you want to learn about embedding Python for example, the documentation is here.
Another approach would be to launch the plugin in a separate process. It does require a bit more effort on your part to implement, but it's much safer. The seperate process approach is used by Google's Chrome web browser, and they have a document describing the architecture.
The basic idea is to provide a library for plugin writers that includes all the logic for communicating with the main app. That way, the plugin author has an API that they use, just as if they were writing a DLL. Wikipedia has a good list of ways for inter-process communication (IPC).

1) If there is a malicious dll in your plugin folder, you are probably already compromised.
2) No, you can load assembly code dynamically from a file, but this would just be reinventing the wheel, just use a DLL.
3) Firefox extensions don't, not even with its javascript plugins. Everything else I know uses native code from dynamic libraries, and is therefore impossible to guarantee safety. Then again Chrome has NaCL which does extensive analysis on the binary code and rejects it if it can't be 100% sure it doesn't violate bounds and what not, although I'm sure they will have more and more vulnerabilities as time passes.
4) Plugins are fine, just restrict them to trusted people. Alternatively, you could use a safe language like LUA, Python, Java, etc, and load a file into that language but restrict it only to a subset of API that wont harm your program or environment.

(1) Can you use OS security facilities to prevent unauthorized access to the folder where the DLL's are searched or loaded from? That should be your first approach.
Otherwise: run a threat analysis - what's the risk, what are known attack vectors, etc.
(2) Not necessarily. It is the most straigtforward if you want compiled plugins - which is mostly a question of performance, access to OS funcitons, etc. As mentioned already, consider scripting languages.
(3) Usually by writing "to prevent malicous code execution, restrict access to the plugin folder".
(4) There's quite some additional cost - even when using a plugin framework you are not yet familiar with. it increases cost of:
the core application (plugin functionality)
the plugins (much higher isolation)
installation
debugging + diagnostics (bugs that occur only with a certain combinaiton of plugins)
administration (users must know of, and manage plugins)
That pays only if
installing/updating the main software is much more complex than updating the plugins
individual components need to be updated individually (e.g. a user may combine different versions of plugins)
other people develop plugins for your main application
(There are other benefits of moving code into DLL's, but they don't pertain to plugins as such)

What if a malicious code is made as a DLL
Generally, if you do not trust dll, you can't load it one way or another.
This would be correct for almost any other language even if it is interpreted.
Java and some languages do very hard job to limit what user can do and this works only because they run in virtual machine.
So no. Dll loaded plug-ins can come from trusted source only.
Is using DLLs the only way available to implement plugin design pattern?
You may also embed some interpreter in your code, for example GIMP allows writing plugins
in python.
But be aware of fact that this would be much slower because if nature of any interpreted language.

We have a product very similar in that it uses modules to extend functionality.
We do two things:
We use BPL files which are DLLs under the covers. This is a specific technology from Borland/Codegear/Embarcadero within C++ Builder. We take advantage of some RTTI type features to publish a simple API similar to the main (argv[]) so any number of paramters can be pushed onto the stack and popped off by the DLL.
We also embed PERL into our application for things that are more business logic in nature.
Our software is an accounting/ERP suite.

Have a look at existing plugin architectures and see if there is anything that you can reuse. http://git.dronelabs.com/ethos/about/ is one link I came across while googling glib + plugin. glib itself might may it easier to develop a plugin architecture. Gstreamer uses glib and has a very nice plugin architecture that may give you some ideas.

Coding a website in C?

I was just reading the http://www.meebo.com/ About Us page, and read this line :
"plus, we're one of the few still around using C!"
Considering that meebo is an online chat client, how do they work with C? How can they use C for the backend? How does it interact with the frontend? For example, let's say a user creates a new account, and new directory is to be made, how does the information go from the front end to the back end?
I'm sorry if it's an invalid question.
Thank you
Edit 1: The Intro Tutorial to CGI was great. Any good books I can pick up from my library regarding this?
Thanks a lot for the quick response guys!

I don't know how meebo does it, but given that it's chat software they probably have a custom server written in C to handle the actual message traffic.
However, Apache and most other HTTP servers have always been able to call C programs just as they can call PHP, CGI and other languages for certain requests. Some websites are even written in Lisp.
The backend has to be compiled each time, unlike an interpreted language, but that happens at rollout and is part of the build/production scripts.
The permissions given and user account that the C program runs under must be carefully chosen, and of course a C website suffers from the same issues any other C program can fall for, such as buffer overrun, segfault, stackoverflow, etc. As long as you run it with reduced permissions you are better protected, and it's no worse than any other language/platform/architecture.
For servers, however, it's still used widely - the gold standard, I suppose. You can find plenty of servers written in Java, C++, and every other language, but C just seems to stick around.
-Adam

I've rolled non-blocking HTTP 1.1 servers in as little as 50 lines of code (sparse) or a few hundred (better), up to about 5k (safe). The servers would load dynamic shared objects as modules to handle specific kinds of requests.
The parent code would handle connection tracking, keep alives, GET/POST/HEAD requests and feed them off to handlers that were loaded on start up. I did this when I was working with VERY little elbow room on embedded devices that had to have some kind of web based control panel .. specifically a device that controlled power outlets.
The entry point to each DSO was defined by the URL and method used (i.e. /foo behaved differently depending on the type of request it was serving).
My little server did quite well, could handle about 150 clients without forks or threads and even had a nice little template system so the UI folks could modify pages without needing hand-holding.
I would most decidedly not use this kind of setup on any kind of production site, even your basic hello world home page with a guest book.
Now, if all I have to do is listen on port 80/443, accept requests with a small POST payload, sanitize them and forward them along to other clients ... its a little different.But that's a task specific server that pretends to be a web server, its not using C to generate dynamic pages.

Meebo uses a custom Lighttpd module called mod_meebo. It doesn't fully answer your question, but I thought you might be interested.

A lot of server-side programs can be done in C, not to mention CGI programming. They could also be Using C with MySQL, which is very possible. But without access to their source code, we have no way of knowing just how much C they are using.
Claiming that they are "one of the few around still using C" was probably just a joke. With stats like this at least I would hope so.
-John

You can see a good example of a web site in C with source code: fossil.
It uses SQLite for the back end.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight