Stress testing with pycassa - benchmarking

I've been trying to write a stress tester for a rather large cassandra database. At first I was doing it from scratch, and then I found stress.py which allows you to stress test your cluster. However, like all benchmarks, the test data is unrepresentative of the loads this database will be seeing. Thus I decided to modify it to be more realistic to my usage pattern.
I'm using pycassa for most of this project. However stress.py uses the lower-level thrift interface directly, which I find rather cumbersome. Are there any projects out there which stress test cassandra using pycassa? Thanks!

I'm not aware of any existing general-purpose stress tests that make use of pycassa; I'd also love to hear about them if there are any.
In the past, I've modified stress.py to make use of pycassa. I believe I set it up to use one small ConnectionPool per process and I was pretty happy with the result; modifying the Operation class and get_client was the main chunk of work here.
It's hard to give more specific details about this without knowing what you want to do, so feel free to ask more detailed questions if you need to.

Related

Consensus algorithm for Node.js

I'm trying to implement a collaborative canvas in which many people can draw free-handly or with specific shape tools.
Server has been developed in Node.js and client with Angular1-js (and I am pretty new to them both).
I must use a consensus algorithm for it to show always the same stuff to all the users.
I'm seriously in troubles with it since I cannot find a proper tutorial its use. I have been looking and studying Paxos implementation but it seems like Raft is very used in practical.
Any suggestions? I would really appreciate it.
Writing a distributed system is not an easy task[1], so I'd recommend using some existing strongly consistent one instead of implementing one from scratch. The usual suspects are zookeeper, consul, etcd, atomix/copycat. Some of them offer nodejs clients:
https://github.com/alexguan/node-zookeeper-client
https://www.npmjs.com/package/consul
https://github.com/stianeikeland/node-etcd
I've personally never used any of them with nodejs though, so I won't comment on maturity of clients.
If you insist on implementing consensus on your own, then raft should be easier to understand — the paper is surprisingly accessible https://raft.github.io/raft.pdf. They also have some nodejs implementations, but again, I haven't used them, so it is hard to recommend any particular one. Gaggle readme contains an example and skiff has an integration test which documents its usage.
Taking a step back, I'm not sure if the distributed consensus is what you need here. Seems like you have multiple clients and a single server. You can probably use a centralized data store. The problem domain is not really that distributed as well - shapes can be overlaid one on top of the other when they are received by server according to FIFO (imagine multiple people writing on the same whiteboard, the last one wins). The challenge is with concurrent modifications of existing shapes, by maybe you can fallback to last/first change wins or something like that.
Another interesting avenue to explore here would be Conflict-free Replicated Data Types — CRDT. Folks at github used them to implement collaborative "pair" programming in atom. See the atom teletype blog post, also their implementation maybe useful, as collaborative editing seems to be exactly the problem you try to solve.
Hope this helps.
[1] Take a look at jepsen series https://jepsen.io/analyses where Kyle Kingsbury tests various failure conditions of distribute data stores.
Try reading Understanding Paxos. It's geared towards software developers rather than an academic audience. For this particular application you may also be interested in the Multi-Paxos Example Application referenced by the article. It's intended both to help illustrate the concepts behind the consensus algorithm and it sounds like it's almost exactly what you need for this application. Raft and most Multi-Paxos designs tend to get bogged down with an overabundance of accumulated history that generates a new set of problems to deal with beyond simple consistency. An initial prototype could easily handle sending the full-state of the drawing on each update and ignore the history issue entirely, which is what the example application does. Later optimizations could be made to reduce network overhead.

Neo4j non-Java frontends

My situation
I really want to use Neo4j for a webapp I'm writing, but I'm writing the app in perl.
Besides using the REST API, what are my options for prepared statements? Ideally, I don't want to have to do any forking, and I certainly don't want to have to call an external program.
Why
I'm using prepared statements for security reasons, and a database backend for real-time efficiency + speed + ease of use. As a result, most of these solutions are, at face value, unacceptable for my needs. While I recognize that the 4j part of neo4j means "support outside of Java is probably a pipe dream", I still maintain some hope.
EDIT:
(I'll put what I find here.)
So far, I've found:
REST::Neo4p (which has documentation on prepared statement use here, if you'd rather not comb through that first link.). It's also worth noting that there's a DBD connector written to run on top of that, DBD::Neo4p.
At the moment, I've provisionally decided to use the DBD connector built atop REST::Neo4p, because it looks easy-to-use and pretty safe. Although it probably isn't as efficient as I'd want it to be, since the returned JSON under the hood will have a bunch of long link strings. And there'll be HTTP headers with each request/response.
So, for right now, I've decided to use this solution. But I'm leaving the answer unaccepted because I would welcome more lightweight alternatives. Unless the JDBC driver uses the REST API under the hood (which I doubt). In which case, I suppose it wouldn't matter, then.

What is a good balance in an MVC model to have efficient data access?

I am working on a few PHP projects that use MVC frameworks, and while they all have different ways of retrieving objects from the database, it always seems that nothing beats writing your SQL queries by hand as far as speed and cutting down on the number of queries.
For example, one of my web projects (written by a junior developer) executes over 100 queries just to load the home page. The reason is that in one place, a method will load an object, but later on deeper in the code, it will load some other object(s) that are related to the first object.
This leads to the other part of the question which is what are people doing in situations where you have a table that in one part of the code only needs the values for a few columns, and another part needs something else? Right now (in the same project), there is one get() method for each object, and it does a "SELECT *" (or lists all the columns in the table explicitly) so that anytime you need the object for any reason, you get the whole thing.
So, in other words, you hear all the talk about how SELECT * is bad, but if you try to use a ORM class that comes with the framework, it wants to do just that usually. Are you stuck to choosing ORM with SELECT * vs writing the specific SQL queries by hand? It just seems to me that we're stuck between convenience and efficiency, and if I hand write the queries, if I add a column, I'm most likely going to have to add it to several places in the code.
Sorry for the long question, but I'm explaining the background to get some mindsets from other developers rather than maybe a specific solution. I know that we can always use something like Memcached, but I would rather optimize what we can before getting into that.
Thanks for any ideas.
First, assuming you are proficient at SQL and schema design, there are very few instances where any abstraction layer that removes you from the SQL statements will exceed the efficiency of writing the SQL by hand. More often than not, you will end up with suboptimal data access.
There's no excuse for 100 queries just to generate one web page.
Second, if you are using the Object Oriented features of PHP, you will have good abstractions for collections of objects, and the kinds of extended properties that map to SQL joins. But the important thing to keep in mind is to write the best abstracted objects you can, without regard to SQL strategies.
When I write PHP code this way, I always find that I'm able to map the data requirements for each web page to very few, very efficient SQL queries if my schema is proper and my classes are proper. And not only that, but my experience is that this is the simplest and fastest way to implement. Putting framework stuff in the middle between PHP classes and a good solid thin DAL (note: NOT embedded SQL or dbms calls) is the best example I can think of to illustrate the concept of "leaky abstractions".
I got a little lost with your question, but if you are looking for a way to do database access, you can do it couple of ways. Your MVC can use Zend framework that comes with database access abstractions, you can use that.
Also keep in mind that you should design your system well to ensure there is no contention in the database as your queries are all scattered across the php pages and may lock tables resulting in the overall web application deteriorating in performance and becoming slower over time.
That is why sometimes it is prefereable to use stored procedures as it is in one place and can be tuned when we need to, though other may argue that it is easier to debug if query statements are on the front-end.
No ORM framework will even get close to hand written SQL in terms of speed, although 100 queries seem unrealistic (and maybe you are exaggerating a bit) even if you have the creator of the ORM framework writing the code, it will always be far from the speed of good old SQL.
My advice is, look at the whole picture not only speed:
Does the framework improves code readability?
Is your team comfortable with writing SQL and mixing it with code?
Do you really understand how to optimize the framework queries? (I think a get() for each object is not the optimal way of retrieving them)
Do the queries (after optimization) of the framework present a bottleneck?
I've never developed anything with PHP, but I think that you could mix both approaches (ORM and plain SQL), maybe after a thorough profiling of the app you can determine the real bottlenecks and only then replace that ORM code for hand written SQL (Usually in ruby you use ActiveRecord, then you profile the application with something as new relic and finally if you have a complicated AR query you replace that for some SQL)
Regads
Trust your experience.
To not repeat yourself so much in the code you could write some simple model-functions with your own SQL. This is what I am doing all the time and I am happy with it.
Many of the "convenience" stuff was written for people who need magic because they cannot do it by hand or just don't have the experience.
And after all it's a question of style.
Don't hesitate to add your own layer or exchange or extend a given layer with your own stuff. Keep it clean and make a good design and some documentation so you feel home when you come back later.

Is there a framework for running unit tests on Apache C modules?

I am about to make some changes to an existing Apache C module to fix some possible security flaws and general bad practices. However the functionality of the code must remain unchanged (except in cases where its fixing a bug). Standard regression testing stuff seems to be in order. I would like to know if anyone knows of a good way to run some regression unit tests againt the code. I'm thinking something along the lines of using C-Unit but with all the tie ins to the Apache APR and status structures I was wondering if there is a good way to test this. Are there any pre-built frameworks that can be used with C-unit for example?
Thanks
Peter
I've been thinking of answering this for a while, but figured someone else might come up with a better answer, because mine is rather unsatisfactory: no, I'm not aware of any such unit testing framework.
I think your best bet is to try and refactor your C module such that it's dependencies on the httpd code base are contained in a very thin glue layer. I wouldn't worry too much about dependencies on APR, that can easily be linked into your unit test code. It's things like using the request record that you should try to abstract out a little bit.
I'll go as far and suggest that such a refactoring is a good idea if the code is suspected to contain security flaws and bad practices. It's just usually a pretty big job.
What you might also consider is running integration tests rather than unit tests (ideally both), i.e. come up with a set of requests and expected responses from the server, and run a program to compare the actual to expected responses.
So, not the response you've been looking for, and you probably thought of something along this line yourself. But at least I can tell you from experience that if the module can't be replaced with something new for business reasons, then refactoring it for testability will likely pay off in the longer term.
Spent some time looking around the interwebs for you as its a question i was curious myself. came across a wiki article stating that
http://cutest.sourceforge.net/
was used for the apache portable c runtime testing. might be worth checking that out.

What's the best database access pattern for testability?

I've read about and dabbled with some including active record, repository, data transfer objects. Which is best?
'Best' questions are not really valid. The world is filled with combination and variations. You should start with the question that you have to answer: What problem are you trying to solve. After you answer that you look at the tools that work best with the issue.
While I agree "best" questions are not the greatest form (since they are so arbitrary), they're not totally irrelevant either.
In the world constructed here at S.O. where developers vote on what's "best", why not have best questions? "Best questions" prompt discussion and differing opinions.
Eventually, when someone "googles" 'data access pattern' they should come to this page and see a plethora of answers then, right?
Repository is probably the best pattern for testability since it allows you to replace a repository with a mock when you need to test. ActiveRecord ties your models to the database (convenient sometimes, but generally more dificcult to test).
It really depends on your task. At least you should know and understand all database access patterns to choose one most suitable for current problem.
This is a good question which should provoke some thought.
Database access is often not subject to rigorous testing - particularly not automated testing, and I would certainly like to increase the amount of testing on my database.
I'm using the MbUnit test framework running from inside Visual Studio to do some testing.
Our application uses stored procedures wherever possible, and the tests I have written set up the database for testing, call a stored procedure, and check the results.
For a collection of related stored procedures, we have a C# file with tests for those stored procs. (However, our coverage is probably about 1% so far!).
Active record is an attractive option because of Ruby's built-in emphasis on automated testing. If I were starting over, that would be a point for using active record.

Resources