I'm trying to create an XMPP library (and later a server) from scratch in Go (although the language itself is irrelevant) as a means to learn what I can about the XMPP protocol and server software development in general.
As many of you know, XMPP is messaging protocol based on XML that depends on an enormous amount of short but frequent XML streams. I'm thinking that for such applications an event based XML parser should be better because I won't need DOM and all that (correct me if I'm wrong). Please keep in mind that this library is intended for servers so there might be many instances run at once;
Which one of the two has better performance and memory usage for that use case, libxml2 or expat?
There is a whole project devoted to answering the question of XML performance called XML Benchmark.
The short answer, in my opinion, is to use libxml2, but I have other considerations beyond pure performance, such as platform availability. That said, it is generally faster than expat according to the latest numbers, though it's fairly close in the great scheme of things.
And yes, you want to use the SAX parser, not the DOM parser.
Related
My situation
I really want to use Neo4j for a webapp I'm writing, but I'm writing the app in perl.
Besides using the REST API, what are my options for prepared statements? Ideally, I don't want to have to do any forking, and I certainly don't want to have to call an external program.
Why
I'm using prepared statements for security reasons, and a database backend for real-time efficiency + speed + ease of use. As a result, most of these solutions are, at face value, unacceptable for my needs. While I recognize that the 4j part of neo4j means "support outside of Java is probably a pipe dream", I still maintain some hope.
EDIT:
(I'll put what I find here.)
So far, I've found:
REST::Neo4p (which has documentation on prepared statement use here, if you'd rather not comb through that first link.). It's also worth noting that there's a DBD connector written to run on top of that, DBD::Neo4p.
At the moment, I've provisionally decided to use the DBD connector built atop REST::Neo4p, because it looks easy-to-use and pretty safe. Although it probably isn't as efficient as I'd want it to be, since the returned JSON under the hood will have a bunch of long link strings. And there'll be HTTP headers with each request/response.
So, for right now, I've decided to use this solution. But I'm leaving the answer unaccepted because I would welcome more lightweight alternatives. Unless the JDBC driver uses the REST API under the hood (which I doubt). In which case, I suppose it wouldn't matter, then.
I am an integration consultant and tend to use C and Lua in my spare time, unfortunately it is not my day job ;-(
Anyway, I tend to believe that a mixture of C and Lua is perfect for many "product" developments. I currently have an "adapter engine" built in pure C, but would like to actually move the adapter code to Lua....
For example, coding an EMAIL adapter in Lua is far easier than in C...yet I like the "engine speed of C"....
But now there is the big question of security risk in that the user can potentially add whatever he or she wants to the LUa scripts in production.....obviously there we could CHMOD the files...but is that really secure?
Ideally I want the C / Lua combination here....but now do I literally imbed the Lua code in the C application with a CHAR*....or do I issue a lua_dofile??
Thanks for the help
Lynton
First, one of the drawbacks to using C/Lua in production is it tends to be harder to find resources who can develop for these languages. C++ and JavaScript programmers are typically easier to find.
In terms of security, the key here is to use leading practices. Security is about risk reduction, there is no expectation one can achieve perfect security so you need to mitigate risk.
Here are my suggestions:
As with all middleware you need to use a hardened server. This is the first step, if the server is compromised using any platform you are in trouble. Middleware should NOT be in the DMZ.
You want to store the Lua code external to the compiled code (otherwise you lose the advantage of using Lua.) Make that storage as secure as you can. CHMOD is good, a secure DB server is better. The more secure the script store the more secure the system.
You can encrypt the Lua source - this is a trade off since it makes it a little harder to gain the advantage of easy updates and modification. You will probably need to implement decrypted script caching for performance.
Your security is as strong as your weakest link. If you provide a way to modify the Lua source via external access this will be an attack vector. Avoid this design if you can.
You should consider putting in change management checks. For example a separate place in the system where a checksum for each Lua file is stored. Then if an un-authorized change is made to a script you can abort functioning till the security breach is mitigated.
Other than the drawback I mentioned above, I don't think there is anything fundamentally flawed in your plan. If it can aid in making a good middleware system I would say go for it. Just mitigate the risk of your adapter scripts getting compromised as much as possible.
To expand on Donal's comment - given the popularity of node I would say that JavaScript is the the leading practice in scripted middleware right now. If you can handle learning a new scripting language I would say it would be a good idea given the support, popularity, and tools available.
Your primary requirement in terms of security is to ensure that the server cannot evaluate anything send by clients by any mechanism (not just direct evaluation, but also through supplying filenames). A lesser requirement is that they should also not have any mechanism to produce a message that allows other clients to evaluate unexpected things (i.e., avoid XSS trouble). If you can satisfy these requirements, you've got a safe server and the language(s) that it written in won't matter; using multiple languages is in fact a good idea as it lets you leverage the best of each.
It's also a good idea to use a carefully configured firewall, plenty of privilege restriction, some kind of DMZ proxy system to at least verify basic syntactic legality of messages, etc. These things are all just good practice. (Aim to configure things so the server can only just manage to do the service you want to provide.)
With sending email, there are a few other things to beware of. In particular, you do not want to be a conduit for spam, so you need to take care to ensure that arbitrary email headers cannot be constructed from client input and that the data formats you send out are non-executable (or that the data is constructed in a way that is non-evil). Rate limiting is also a good idea; unless your site is insanely popular, you shouldn't need to send more than a few messages a second across all clients. If you're ever sending only to a small set of addresses (e.g., to a fixed contact address) then you can relax these restrictions a bit (but still be careful of header injection). In all cases, route all email by a specialist email handling server instead of doing routing yourself as this avoids a whole lot of configuration difficulties.
I wasn't sure if one was better to use than another, ie. Java, PHP, or Perl.
The best one is the one you are most comfortable working with.
It doesn't really matter, as long as you are using the right tools to do the job.
You need to consider where you are deploying your application (web versus desktop), the time you want to spend learning a new technology/language, and availability of libraries for parsing RSS and/or XML and/or HTML. The three languages that you named are all good candidates, though.
RSS files are just formatted XML that you obtain over the internet. All you need in a language is that it can make a HTTP request and has ways to parse the XML.
The framework code can be in anything, but consider using XSL transforms (or XPath queries) to get the XML into a more palatable format. Espec. if you're looking for small subsets of the data, or individual values.
It's hardly "scraping" if the source data was meant to be machine-parsed in the first place. :)
If you are stronger with one particular technology and you have a dead line (or other factors) then go with that technology as they all have capabilities.
If this is not the case then it falls to the requirements of the project you are undertaking and also if you want to/are able to learn a new technology.
PHP is the most naturally web based technology and you can use a library like this Simple HTML DOM Parser (it supports XML as well) to get quick results as well as delve deeper into the complexities of web scraping which PHP will support as well.
Java has a nice project called Web Harvest which I have used in the past with good results (all though you to learn a non-standard xml syntax but it's similar to xslt) and once your system is set up your web scraping can be easily modified.
Perl is the strongest when it comes to regex (Java and especially PHP can become a bit messy when working with regex I find) and regex is a nice skill to have so depending on what you want to do with your information this is a reasnoble option as well.
If you are writing a server application that needs to run often and aggregate content across a large number of sites then performance should be a significant criteria for you. This would mean a language capable of processing a large volume of data quickly.
If you just need a program to run occasionally and pick out bits of data from many pages then you can consider a specialized language. The product TestPlan offers a very simply language that would let you grab RSS content quickly and expose it in a simple fashion.
I've used it in some significant scraping projects. While not blazingly fast the scripts are extremely easy to maintain.
I want to develop a web application using ANSI C. Since, I want have to have the application to be fast enough than others and also it should support the all kind of operations as the normal scripting php, python or any scripting language provides. Even if you have idea for fastest access with database rather than C, please recommend anything better
If anyone have idea please share the tutorials to start.
I'm not aware of any C web application frameworks, and so if you did wish to write your application in C you will need to handle all communication between your application / framework and the web server through a web server interface - the easiest starting point for understanding this is probably to read up on CGI, however once you understand how CGI works you will want to move onto understanding FastCGI instead, as although FastCGI is more complex, CGI is notoriously slow.
However I strongly recommend that you don't bother unless you are attempting this for academic purposes!:
The path you are suggesting involves low level stuff - its interesting, but a lot of work to achieve things that can be done incredibly easily in any half-decent web application framework.
With web applications is that the thing that matters is throughput (number of requests you can handle in a given time period), not speed (the time it takes to process a single request) - it might seem that a web site written in C would be much faster, however in reality the execution speed of C counts for incredibly little vs (for example) Caching and other optimisations.
Other frameworks already exist that are proven and lightning fast!
The end result is that anything that you come up with will almost certainly be more work and slower than using "slow" scripting languages.
Any kind of 'scripting' won't give you the 'raw speed' it seems you might be looking for.
I would generally strongly discourage this whole train of thought, though. There are plenty of web frameworks out there where you produce code that runs very efficiently. Even 'scripted' web frameworks often cache the scripts and reduce much of the initial slowdown involved in parsing and executing.
And frameworks that use compiled bytecode/IL can be quite fast once loaded/JIT'ed.
If you plan to write your own HTTP engine in C, though; I doubt you would be able to get something remotely close to as fast as anything else out there until you were very familiar with what's already out there; how they all work, all the variations in the protocols involved, etc, etc...
I've heard a lot of good things about FastCGI. Maybe you should try that?
You should checkout g-wan by trustleap. It allows you to write servlets in
ansi-c, taking care of all the nitty gritty regarding the http protocol.
http://www.trustleap.com/
A long time ago I figured out that bcp is just a little C program that calls the special bit of the sybase client api to do mass data moving into the database. It lies cheats and steals and skips check constraints all in the name of speed.
Great, I'm all for it.
In sybase 12 I noticed that the api was exposed in the C client library, but not the java one.
I've been looking but I haven't found anything that says they've yet implemented it in the sybase 15 java client library.
Does anybody know if this is available or not in sybase 15?
I disagree with your comments on Java using a BCP api. Whilst I agree about the limitations of Java and ODBC/JDBC that doesn't mean there aren't advantages of using a Java BCP api. We have a system with a lot of Java and its not practical or very effective to shell out from Java and run the BCP command line utility.
Running the command line utility doesn't give very good error reporting and deadlock retries.
It also requires the writing of data to a file which is going to increase the number of operations and slow down the whole process. Sometime we can't even write a file as its on a grid which doesn't have a file system and tmp is too small.
As for the speed, well JBCP is slower than native api, however it is acceptable and certainly faster than calling repeated insert commands.
Mwillett (author of JBCP)
I am thinking not, it may be more an issue with fitting that operation into the JDBC spec.
I do see a JBCP project out on SourceForge, but don't have any experience with it.
If you don't mind your Java program not being portable anymore, you can link to any C library via JNI. That is preferable to having to rewrite your application, or to calling a separate task to BCP the data
I assume you'd rather not rewrite your whole application in C++ ;-)
The answer is NO.
But why on Earth would you want to move masses of data from Java to the server ? (1) Java isn't designed for that, so it will be very slow (2) native bcp or C+bcp or perl+bcp or any shell command+bcp would scream circles around it, and displace it anyway. It's like wanting to run bcp via ODBC or JDBC.
We need to move away from Maslow's Hammer and Use the Right Tool for the Job.
Further detail, responding to comments:
An ordinary PROGRAM that connects to the ASE server (Client-Server style) uses the provided Open Client Library; this is native, and moves the TDS packets efficiently. The connection is a universally available one inch garden hose. PROGRAMS written in C, C++, COBOL, Perl, and PowerBuilder use this transport.
ODBC (and thus JDBC because it is built on top of ODBC) is a simple method of connecting to the server using a one millimetre hose. While this is quite adequate for tasks such as using Excel to draw charts directly from ASE tables, where the data transfer speed is not relevant; it is quite inadequate for moving data of any substantial volume, for normal app access to a data server (except where the "programmer" is ignorant of the fact that [1] is available).
.
Java does not have [1] and is limited to this [2] transport.
bcp is an utility PROGRAM (exists on its own) supplied by the vendor that connects to the server much more tightly. It is not a "special bit of the client API". There is no "lying and cheating" involved, all constraints are directed by the DBA performing the task. The connection is a two and a half inch fire hose, not generally available to the public. It is designed to move large data volumes at speed. If used on the same host as the server, since the hose is not reticulated through the network, it moves data even faster.
Much later, the vendor made the bcp capability available as a Library (API in your terms), which can therefore be invoked from any reasonably architected compiler. C, C++, COBOL, and Perl are such, and produce PROGRAMS, and therefore provide access to this library directly from your code. The connection is the same two and a half inch fire hose, but due to the additional layers, it runs at a maximum speed of a two inch fire hose.
(Note to readers who expect this to be a complete list: There are two other options which I have not detailed here, because they are server-server and not relevant to this thread).
Since Java PROGRAMS are currently limited to the one millimetre connection to the ASE server, there is no use in providing the bcp API to Java (you will merely have a two and a half inch fire hose reticulated through the network, with a FLOW of one millimetre), it is an absurd enterprise. There is a reason why, despite the millions many organisations have poured into Java, during its rather long um progression, none of them have spent money in providing a firehouse that moves drips and drops.
You cannot obtain the speed of a greyhound from a dachshund, there is no use giving it racetrack training. You can stop whispering promises in its ear.
Second, Java cannot handle large (source) data sets efficiently, it was not designed for that. Therefore even if the JDBC strangulation was lifted (eg. a native Open Client Library was implemented), it still cannot move data as fast as C, C++, COBOL, Perl, PB; it will move data at a trickle (quarter inch ?) in the one inch hose. Therefore, even then, it would be absurd to provide the bcp capability to the Java library; that would be worthwhile if and when Java (which was designed with other priorities in mind) is annointed with large data transfer capability.
It may help to move out of your Java, Java, and nothing but Java mindset, and Use the Right Tool (PROGRAM) for the Job. If you are a Java "programmer", then at least you need to familiarise yourself with the capability and limitations of your programming language and the libraries available. The original question demonstrates complete ignorance of that, hence I had to supply it in my revised post.
Programmers who are not limited to Java, think about where the large data source is located; minimise data transfers across networks; think about what PROGRAMS are already written and that can be used (as opposed to writing their own in isolation from the rest of the planet); and use them.
Finally, for understanding, even if you did obtain the bcp capability in some Library, and implemented it, when you place the "program" in the real world, any reasonable DBA will dismiss it due to its trickle speed data transport, and use bcp with its fire hose instead.