Use Google Go's Goroutines To Create A Bayes Network - google-app-engine

I have a large dataset of philosophic arguments, each of which connect to other arguments as proof or disproof of a given statement. A root statement can have many proofs and disproofs, each of which may also have proofs and disproofs. Statements can also be used in multiple graphs, and graphs can be analyzed under a "given context" or assumption.
I need to construct a bayesian network of related arguments, so that each node propagates influence fairly and accurately to it's connected arguments; I need to be able to calculate the probability of chains of connected nodes concurrently, with each node requiring datastore lookups that must block to get results; the process is mostly I/O bound, and my datastore connection can run asynchronously in java, go and python {google appengine}. Once each lookup completes, it propagates the effects to all other connected nodes until the probability delta drops below a threshold of irrelevance {currently 0.1%}. Each node of the process must calculate chains of connections, then sum up all the results across all queries to adjust validity results, with results chained outward to any connected arguments.
In order to avoid recurring infinitely, I was thinking of using an A*-like process in goroutines to propagate updates to the argument maps, with a heuristic based on compounding influence which ignores nodes once probability of influence dips below, say 0.1% . I'd tried to set up the calculations with SQL triggers, but it got complex and messy way too fast. Then I moved to google appengine to take advantage of asynchronous nosql, and it was better, but still too slow. I need to be run the updates fast enough to get a snappy UI, so when a user creates or votes for or against a proof or disproof, they can see the results reflected in UI immediately.
I think Go is the language of choice to support the concurrency I need, but I'm open to suggestions. The client is a monolithic javascript app that just uses XHR and websockets to push and pull argument maps {and their updates} in real time. I have a java prototype that can compute large chains in 10~15s, but monitoring of performance shows that most of my runtime is wasted in synchronization and overhead from ConcurrentHashMap.
If there are other highly-concurrent languages worth trying out, please let me know. I know java, python, go, ruby and scala, but will learn any language if it suits my needs.
Similarly, if there are open source implementations of huge Bayesian networks, please leave a suggestion.

I think it's a bit difficult to tell what you are asking about. Maybe you can elaborate on your question.
Goroutines are quite cheap, and are a perfect match for modern web applications which use XHR or Websockets heavily (and other I/O bound applications which have to wait for database responses and stuff like that). Additionally, the go runtime is also able to execute those goroutines in parallel, so that Go is also a good fit for CPU bound tasks, which should take advantage of multiple cores and the speed of a natively compiled language.
But you should also keep in mind, that goroutines and channels aren't for free. They still require some amount of memory and each synchronization point (e.g. a channel send or receive) comes with its cost. That's normally not a problem, since the synchronization is, in comparison to a database query for example, extremely cheap, but it might not be suited for building efficient Bayesian networks, especially if the actual work of each goroutine / node is negligible in comparison to the synchronization overhead.
Your primary goal for every concurrent program should be to avoid shared mutability as far as possible. So a Bayesian network modeled with goroutines and channels might be a good educational example and a great way to measure the performance of Go's channel implementation, but it's probably not the best fit for your problem.

Related

Is SageMaker multi-node Spot-enabled GPU training an anti-pattern?

Is it an anti-pattern to do multi-node Spot-enabled distributed GPU training on SageMaker?
I'm afraid that several issues will slow things down or even make them infeasible:
the interruption detection lag
the increased probability of interruption (N instances)
the need to re-download data at every interruption
the need start/stop whole clusters instead of just replacing interrupted nodes
the fact that Sagemaker doesn' support variable size cluster
Additionally EC2-Spot documentation deters users from using Spot in multi-node workflows where nodes are tightly coupled (which is the case in data-parallel and model-parallel training) "Spot Instances are not suitable for workloads that are inflexible, stateful, fault-intolerant, or tightly coupled between instance nodes."
Anybody here have experience doing Spot-enabled distributed GPU training on SageMaker happily?
Short answer is that Spot training works well when the instance type you need, in the region you need, has enough free capacity, at a particular time. Otherwise you won't be able to start the job, or get too frequent interruptions.
Why not just try it for yourself? Once you have a working on-demand training job, you can enable spot training by adding 3 relevant parameters to the job's Estimator definition, and implement checkpoint save/load (good to have anyway). Then if it works well, great! If not, switch back.

Write once read many in memory key value store

I have a particular use case for multiple in memory key value maps that need very fast lookup time. They are set just set once a day so can be considered immutable for all practical purposes. Redis is not an option since it gets CPU throttled in case of multiple threads accessing it. Multi instance redis takes up too much memory because of data replication. The important thing to consider here is that the read rate is very high in bursts. Around 10 million requests in bursts from around 40-50 workers simultaneously.
I was thinking of creating a simple client server architecture with multiple readers connecting to a server to read from shared memory maps. However I wonder if such an architecture already exists and has been tested profusely for this use case in which case I should not be reinventing the wheel.
So to sum up what is my best alternative? TIA.
Might not be suitable for you but you could try RBLDNSD and store your values in DNS. It's high performance and results will be cached, and it's easy to read the values from pretty much any programming environment. To write values to it you'll need to write directly to its zone files, but the format is simple and easy to write.
You don't mention the size of your maps, but given that performance is so critical, it sounds like you may want to consider keeping copies of your 'multiple in memory key value maps' with each worker.
You could then implement a simple mechanism to notify each worker that it's time to refresh their maps (e.g. Redis PUBLISH, or any other pubsub type framework).
At the risk of running afoul of the stackoverlow self-promotion police :-) eXtremeDB might be a consideration. It's not schema-less, but your schema can simply define a key-value pair. It supports MVCC (optimistic, non-blocking) concurrency so even the relatively infrequent writes won't get in the way of readers, and you'll be able to utilize all the CPU cores.

in gae, is number of write operations taken to write a single entity and core values same or not?

What is difference between storing 10 entities separately and storing 10 entities as core values(ie, Lists or Sets) in google appengine datastore?
Are both taking same number of write operations or storing lists reduce the write operations count?
You can have a look with the new Cloud Monitoring service, see https://cloud.google.com/monitoring/docs; among other things it can show you graphs of RPC delays encountered by your app.
Different app-side "bridges" to storage may have different behavior; for example, ndb (for Python apps) strives "behind the curtains" to transparently use caching on your behalf, and "bunch up" RPC calls to the datastore (it also lets you do the latter explicitly with _multi and _async methods and functions). In other languages, or if using the old db in Python, you may observe different performance characteristics on the app side because they may employ such optimizations to different extents, all the way to "none at all".
At the lowest abstraction level, however, "bunched" and "async" behavior will always have a performance advantage versus writing each entity singly and synchronously.
If the back-end of the datastore gets fewer RPC calls each requesting substantial amounts of writes, it can organize its own operations better than if it gets many RPC calls each about writing a single entity
The "async" behavior doesn't affect that (except that it may let ndb do more "bundling" behind the curtains) at the lowest abstraction level (the datastore back-end must do the same amount of work whether the app side is blocked waiting for the results, or just has a future watching asynchronously) but it can still improve your app's performance as things can "overlap" and your app is able to do something else before waiting for the future to deliver.
So each "bridge" should document exactly what it's doing on your behalf, but even if it does things are complex enough (esp. in a multi-instance, multi-threaded app) that it's worth experimenting and using Cloud Monitoring to check the actual performance effects of different approaches.

What is the Speed Difference Between Database and Web Service Calls?

All things being equal, and in the most simple form, which is faster?
1.) A call to a web service method
2.) A call to a database
For example, assume that you have a simple web service that just returns an integer that is calculated in X time. You also have a database that, when queried in th right way, also takes X time to calculate the answer. (So the compute time is the same in both cases) In both cases, assume the amount of data both directions is the same, say, a single 32-bit integer, for simplicity.
Thus far, the calculation times of both the web service and the database are exactly the same.
The environment is 1 application server, where the app resides, and 1 other server that is holding both the web service and the database. There is nothing else going on in the environment other than the application calling either the web service or database repeatedly. This all within one single LAN, so any network latency is equal.
From an application, which will be faster, the call to the database, or the call to the web service?
What I am trying to isolate, I guess, is which is more heavy-weight. Does the set up, open, close, tear down of a database connection end up slower than that for a web service, or is it the same? Additionally, if there are other things, such as parsing the result from a web service, how do they affect the speed?
O(1) doesn't refer to any length of time. A single operation could take .001 ms on a webservice and 100 seconds in a database and they both could be using O(1) functions:
http://en.wikipedia.org/wiki/Big_O_notation
It's hard to know quite what you're asking. If you're asking whether accessing a local database is generally faster than accessing a similar service over the internet, then I expect that, generally, the answer is that the local database will be faster. The call over the internet to the web service has a lot of overhead and communication over internet is relatively slow. Evan on a slow computer a databases can perform many thousands of simple queries per second. Contrast that with access over the internet, where you'd be lucky to get 50 round trip requests per second, not even accounting for time it takes to perform the requested operation on the server.
If you're asking whether a server on the web can serve data faster by avoiding a database and calculating results directly, then the answer is it depends. The call to the database in this case adds unnecessary overhead if the data in it can be easily calculated in a stand-alone function. The answer to this question doesn't really have anything to do with a "web service". Is it faster to calculate an answer in a function or to access the answer using a query on a database? As I said, the answer would depend on the complexity of the particular function you had to use, and weighing its computation time against the overhead of accessing the answer (or part of the answer) directly from a database.
In short, the answer to your question depends on what exactly you're asking. It would also probably help to know why you're asking the question. I have a suspicion that the real answer is that this probably isn't something you need to worry about, not really a practical concern unless you have a particular situation requiring optimization.
If you're concerned about comparison of speed when webservice and database are both on a lan, I'm pretty sure the overhead of the db is a less than the webservice. The application typically maintains a stateful connection(s) to the db, while requests to a webservice are via http, which is stateless, relatively higher overhead, and slower. Could be wrong, though. Best answer would be to whip up a simple webservice, query, and (1) measure time it takes to retrieve results using both methods, and compare, and/or (2) create an app that opens a lot of threads and do some load testing.
A caveat: If your app doesn't maintain an open connection or have access to a pool of connections with the db, then the db alternative may well be slower. Initial creation of a db connection can be relatively slow. But that shouldn't figure into things, since you should write your app so that an open connection is always maintained.
Based on practical experience, I would say that the database call is significantly faster.
It all depends on the network topology and languages you're using. If you're talking C#...my money would be on the database call being faster almost every time.
Your calls to the database server are going to be made over the native protocol. Everything is going to be optimized.
If you're calling a web service, you're going to need some mechanism to send the request to the web server, wait for the web server to respond, and then something to parse the result of the web service call back into your code.
One could say that generally, latency of the network in a web service (which will typically be over the internet) is going to be slower than the call to a database (which is typically on a LAN or something, which is faster than one's connection to the internet).
Of course, this makes a LOT of assumptions about setups/software/etc, etc which effectively reduces it to an apples and oranges comparison, which there is never a good answer for.
O(1) doesn't specify the speed, it specifies the 'growth' in time required as the underlying data gets larger. The constants are dropped from the equation. What this means is that O(N^2) can be less than O(N) for some really small N.
A web service is a way to connect to some functionality. Besides the network latency, the real time is bound by what the service is actually doing. There could be a database underneath for example. If it is something that just returns an Integer, the computational time is mostly trivial, the request is bounded by the network.
A database needs to parse the query, build a query tree, optimized it, then apply some search algorithms against a series of caches and files. If you just plopped an Integer into a trivial table, or a tableless SQL call, then fetching the data is probably trivial, its the whole transactional packaging that will eat CPU.
Can you get a packet back and forth to a server before you can parse trivial SQL and punch back a tabled result? Mostly, these days I say it was a toss up. Some networks are faster than others, while some databases and servers are pretty good. Nothing is certain.
In general, is a web service faster than a database? Yes, if and only if the service is trivial (if it's hiding a database, then it's obviously just additional time). Databases are big bulky engines, and while they've gotten much faster over the years, their base level of transactional integrity specifies an awful lot of minimum CPU usage. They're slower because they are doing so much more work. Contrast that with some explicit minimal computation hidden behind network access. A fibber or gigabit network can rapidly move data. It's just so much less work to get accomplished.
Of course the reason we don't replace databases with custom written web services is time. It takes too long to write it, and then keep it up to date. Way more effort than just slamming it into a database and accepting it's performance.
Paul.
IMHO I would say the database call would be faster hands down. I say this because there is much less overhead. With the verbosity of the HTTP protocol and SOAP markup incurred you have a lot more bloat in your data. This bloat data has extra cost for packaging and un-packaging. with a stored procedure call you could use an output parameter to return a single int instead of a result set to make it even lighter.
Algorithmic complexity is just one variable that impacts the overall performance of a system. Other factors might include network latency or network bandwidth, especially when the size of the returned data is different.
If you run the same O(1) algorithm on a local machine, you will get the results faster than if you run the algorithm on a machine on another continent and need the same results sent over the network.
Other factors might include raw CPU speed if the calls are done on physically different machines.
That's why premature optimisation is the root of all evil.
EDIT:
I'd say it depends even more now on the details of the system, i.e., what database software you are using, or whether or not your web service is reading data from a static web page or dynamically generating the data.
But I am beginning to lose sight of why you are asking the question. You seem to say that both methods take the same amount of time. So if they take the same amount of time, how can you ask which is faster? Clearly they are equally fast. You need to tell us more about how and when they stop taking the same amount of time.
If we are assuming that you are communicating to a different server for both the web and database calls, wouldn't they be pretty much the same, since both requests are transferred through TCP/IP? The only thing then that could be compared is how big the actual results are that are sent back in terms of bits across the wire.

What advice can you give me for writing a meaningful benchmark?

I have developed a framework that is used by several teams in our organisation. Those "modules", developed on top of this framework, can behave quite differently but they are all pretty resources consuming even though some are more than others. They all receive data in input, analyse and/or transform it, and send it further.
We planned to buy new hardware and my boss asked me to define and implement a benchmark based on the modules in order to compare the different offers we have got.
My idea is to simply start sequentially each module with a well chosen bunch of data as input.
Do you have any advice? Any remarks on this simple procedure?
Your question is pretty broad, so unfortunately my answer will not be very specific either.
First, benchmarking is hard. Do not underestimate the effort necessary to produce meaningful, repeatable, high-confidence results.
Second, what is your performance goal? Is it throughput (transaction or operations per second)? Is it latency (time it takes to execute a transaction)? Do you care about average performance? Do I care about worst case performance? Do you care about the absolute worst case or I care that 90%, 95% or some other percentile get adequate performance?
Depending on which goal you have, then you should design your benchmark to measure against that goal. So, if you are interested in throughput, you probably want to send messages / transactions / input into your system at a prescribed rate and see if the system is keeping up.
If you are interested in latency, you would send messages / transactions / input and measure how long it takes to process each one.
If you are interested in worst case performance you will add load to the system until up to whatever you consider "realistic" (or whatever the system design says it should support.)
Second, you do not say if these modules are going to be CPU bound, I/O bound, if they can take advantage of multiple CPUs/cores, etc. As you are trying to evaluate different hardware solutions you may find that your application benefits more from a great I/O subsystem vs. a huge number of CPUs.
Third, the best benchmark (and the hardest) is to put realistic load into the system. Meaning, you record data from a production environment, and put the new hardware solution through this data. Getting this done is harder than it sounds, often, this means adding all kinds of measure points in the system to see how it behaves (if you do not have them already,) modifying the existing system to add record/playback capabilities, modifying the playback to run at different rates, and getting a realistic (i.e., similar to production) environment for testing.
The most meaningful benchmark is to measure how your code performs under everyday usage. That will obviously provide you with the most realistic numbers.
Choose several real-life data sets and put them through the same processes your org uses every day. For extra credit, talk with the people that use your framework and ask them to provide some "best-case", "normal", and "worst-case" data. Anonymize the data if there are privacy concerns, but try not to change anything that could affect performance.
Remember that you are benchmarking and comparing two sets of hardware, not your framework. Treat all of the software as a black box and simply measure the hardware performance.
Lastly, consider saving the data sets and using them to similarly evaluate any later changes you make to the software.
If you're system is supposed to be able to handle multiple clients all calling at the same time, then your benchmark should reflect this. Note that some calls will not play well together. For example, having 25 threads post the same bit of information at the same time could lead to locks on the server end, thus skewing your results.
From a nuts-and-bolts point of view, I've used Perl and its Benchmark module to gather the information I care about.
If you're comparing differing hardware, then measuring the cost per transaction will give you a good comparison of the trade offs of hardware for performance. One configuration may give you the best performance, but costs too much. A less expensive configuration may give you adequate performance.
It's important to emulate the "worst case" or "peak hour" of load. It's also important to test with "typical" volumes. It's a balancing act to get good server utilization, that doesn't cost too much, that gives the required performance.
Testing across hardware configurations quickly becomes expensive. Another viable option is to first measure on the configuration you have, then simulate that behavior across virtual systems using a model.
If you can, try to record some operations users (or processes) are doing with your framework, ideally using a clone of the real system. That gives you the most realistic data. Things to consider:
Which functions are most often used?
How much data is transferred?
Do not assume anything. If you think "that is going to be fast/slow", don't bet on it. In 9 out of 10 cases, you're wrong.
Create a top ten for 1+2 and work from that.
That said: If you replace old hardware with new hardware, you can expect roughly 10% faster execution for each year that has passed since you bought the first set (if the systems are otherwise pretty equal).
If you have a specialized system, the numbers may be completely different but usually, new hardware doesn't change much. For example, adding an useful index to a database can reduce the runtime of a query from two hours to two seconds. Hardware will never give you that.
As I see it, there are two kinds of benchmarks when it comes to benchmarking software. First, microbenchmarks, when you try to evaluate a piece of code in isolation or how a system deals with narrowly defined workload. Compare two sorting algorithms written in Java. Compare two web browsers how fast can each perform some DOM manipulation operation. Second, there are system benchmarks (I just made the name up), when you try to evaluate a software system under a realistic workload. Compare my Python based backend running on Google Compute Engine and on Amazon AWS.
When dealing with Java and such like, keep in mind that the VM needs to warm up before it can give you realistic performance. If you measure time with the time command, the JVM startup time will be included. You almost always want to either ignore start-up time or keep track of it separately.
Microbenchmarking
During the first run, CPU caches are getting filled with the necessary data. The same goes for disk caches. During few subsequent runs the VM continues to warm up, meaning JIT compiles what it deems helpful to compile. You want to ignore these runs and start measuring afterwards.
Make a lot of measurements and compute some statistics. Mean, median, standard deviation, plot a chart. Look at it and see how much it changes. Things that can influence the result include GC pauses in the VM, frequency scaling on the CPU, some other process may start some background task (like virus scan), OS may decide move the process on a different CPU core, if you have NUMA architecture, the results would be even more marked.
In case of microbenchmarks, all of this is a problem. Kill what processes you can before you begin. Use a benchmarking library that can do some of it for you. Like https://github.com/google/caliper and such like.
System benchmarking
In case of benchmarking a system under a realistic workload, these details do not really interest you and your problem is "only" to know what a realistic workload is, how to generate it and what data to collect. It is always best if you can instrument a production system and collect data there. You can usually do that, because you are measuring end-user characteristics (how long did a web page render) and these are I/O bound so the code gathering data does not slow down the system. (The page needs to be shipped to the user over the network, it does not matter if we also log a few numbers in the process).
Be mindful of the difference between profiling and benchmarking. Benchmarking can give you absolute time spent doing something, profiling gives you relative time spent doing something compared to everything else that needed doing. This is because profilers run heavily instrumented programs (common technique is to stop-the-world every few hundred ms and save a stack trace) and the instrumentation slows everything down significantly.

Resources