Indexing about 300.000 triples in sesame using Camel

Indexing about 300.000 triples in sesame using Camel - apache-camel

I have a Camel context configured to do some manipulation of input data in order to build RDF triples.
There's a final route with a processor that, using Sesame Client API, talks to a separate Sesame instance (running on Tomcat with 3GB of RAM) and sends add commands (each command contains about 5 - 10 statements).
The processor is running as a singleton and the corresponding "from" endpoint has 10 concurrentConsumers (I tried with 1, then 5, then 10 - moreless same behaviour).
I'm using HttpRepository from my processor for sending add commands and, while running, I observe a (rapid and) progressive degrade of performance in indexing. The overall process starts indexing triples very quickly but after a little bit the committed statements grow very slowly.
On Sesame side I used both MemoryStore and NativeStore but (performance) behaviour seems moreless the same.
The questions:
which kind of store kind is reccommended in case I would like to speed up the indexing phase?
Is the Repository.getConnection doing some kind of connection pooling? In other words, can I open and close a connection each time the "add" processor does its work?
Having said that I need first to create a store will all those triples, is it preferred create a "local" Sail store instead of having that managed by a remote Sesame server (therefore I won't use a HTTPRepository)?

I am assuming that you're adding using transactions of 4 or 5 statements for good reason, but if you have a way to do larger transactions, that will significantly boost speed. Ideal (and quickest) would be to just send all 300,000 triples to the store in a single transaction.
Your questions, in order:
If you're only storing 300,000 statements the choice of store is not that important, as both native and memory can easily handle this kind of scale at good speed. I would expect memory store be slightly more performant, especially if you have configured it to use a non-zero sync delay for persistence, but native has a lower memory footprint and is of course more robust.
HTTPRepository.getConnection does not pool the actual RepositoryConnection itself, but internally pools resources (so the actual HttpConnections that Sesame uses internally are pooled). so getConnection is relatively cheap and opening and closing multiple connections is fine - though you might consider reusing the same connection for multiple adds, so that you can batch multiple adds in a single transaction.
Whether to store locally or on a remote server really depends on you. Obviously a local store will be quicker because you eliminate network latency as well as the cost of (de)serializing, but the downside is that a local store is not easily made available outside your own application.

Related

Write once read many in memory key value store

I have a particular use case for multiple in memory key value maps that need very fast lookup time. They are set just set once a day so can be considered immutable for all practical purposes. Redis is not an option since it gets CPU throttled in case of multiple threads accessing it. Multi instance redis takes up too much memory because of data replication. The important thing to consider here is that the read rate is very high in bursts. Around 10 million requests in bursts from around 40-50 workers simultaneously.
I was thinking of creating a simple client server architecture with multiple readers connecting to a server to read from shared memory maps. However I wonder if such an architecture already exists and has been tested profusely for this use case in which case I should not be reinventing the wheel.
So to sum up what is my best alternative? TIA.

Might not be suitable for you but you could try RBLDNSD and store your values in DNS. It's high performance and results will be cached, and it's easy to read the values from pretty much any programming environment. To write values to it you'll need to write directly to its zone files, but the format is simple and easy to write.

You don't mention the size of your maps, but given that performance is so critical, it sounds like you may want to consider keeping copies of your 'multiple in memory key value maps' with each worker.
You could then implement a simple mechanism to notify each worker that it's time to refresh their maps (e.g. Redis PUBLISH, or any other pubsub type framework).

At the risk of running afoul of the stackoverlow self-promotion police :-) eXtremeDB might be a consideration. It's not schema-less, but your schema can simply define a key-value pair. It supports MVCC (optimistic, non-blocking) concurrency so even the relatively infrequent writes won't get in the way of readers, and you'll be able to utilize all the CPU cores.

simple Solr deployment with two servers for redundancy

I'm deploying the Apache Solr web app in two redundant Tomcat 6 servers,
to provide redundancy and improved availability. At this point, scalability is not a issue.
I have a load balancer that can dynamically route traffic to one server or the other or both.
I know that Solr supports master/slave configuration, but that requires manual recovery if the slave receives updates during the master outage (which it will in my use case).
I'm considering a simpler approach using the ability to reload a core:
- only one of the two servers is receiving traffic at any time (the "active" instance), but both are running,
- both instances share the same index data and
- before re-routing traffic due to an outage, the now active instance is told to reload the index core(s)
Limited testing of failovers with both index reads and writes has been successful. What implications/issues am I missing?
Your thoughts and opinions welcomed.

The simple approach to redundancy your considering seems reasonable but you will not be able to use it for disaster recovery unless you can share the data/index to/from a different physical location using your NAS/SAN.
Here are some suggestions:-
Make backups for disaster recovery and test those backups work as an index could conceivably have been corrupted as there are no checksums happening internally in SOLR/Lucene. An index could get wiped or some records could get deleted and merged away without you knowing it and backups can be useful for recovering those records/docs at a later time if you need to perform an investigation.
Before you re-route traffic to the second instance I would run some queries to load caches and also to test and confirm the current index works before it goes online.
Isolate the updates to one location and process and thread to ensure transactional integrity in the event of a cutover as it could be difficult to manage consistency as SOLR does not use a vector clock to synchronize updates like some databases. I personally would keep a copy of all updates in order separately from SOLR in some other store just in case a small time window needs to be repeated.
In general, my experience with SOLR has been excellent as long as you are not using cutting edge features and plugins. I have one instance that currently has 40 million docs and an uptime of well over a year with no issues. That doesn't mean you wont have issues but gives you an idea of how stable it could be.

I hardly know anything about Solr, so I don't know the answers to some of the questions that need to be considered with this sort of setup, but I can provide some things for consideration. You will have to consider what sorts of failures you want to protect against and why and make your decision based on that. There is, after all, no perfect system.
Both instances are using the same files. If the files become corrupt or unavailable for some reason (hardware fault, software bug), the second instance is going to fail the same as the first.
On a similar note, are the files stored and accessed in such a way that they are always valid when the inactive instance reads them? Will the inactive instance try to read the files when the active instance is writing them? What would happen if it does? If the active instance is interrupted while writing the index files (power failure, network outage, disk full), what will happen when the inactive instance tries to load them? The same questions apply in reverse if the 'inactive' instance is going to be writing to the files (which isn't particularly unlikely if it wasn't designed with this use in mind; it might for example update some sort of idle statistic).
Also, reloading the indices sounds like it could be a rather time-consuming operation, and service will not be available while it is happening.
If the active instance needs to complete an orderly shutdown before the inactive instance loads the indices (perhaps due to file validity problems mentioned above), this could also be time-consuming and cause unavailability. If the active instance can't complete an orderly shutdown, you're gonna have a bad time.

Read-only, rather small database - ways to optimise?

I have a following design. There's a pool of identical worker processes (max 64 of them, on average 15) that uses a shared database for reading only. The database is about 25 MB. Currently, it's implemented as a MySQL database, and all the workers connect to it. This works for now, but I'd like to:
eliminate cross-process data transfer - i. e. execute SQL in-process
keep the data completely in memory at all time (I mean, 25 MB!)
not load said 25 MB separately into each process (i. e. keep it in shared memory somehow)
Since it's all reading, concurrent access issues are nonexistent, and locking is not necessary. Data refreshes happen from time to time, but these are unfrequent and I'm willing to shut down the whole shebang for those.
Access is performed via pretty vanilla SQL SELECTs. No subqueries, no joins. LIKE conditions are the fanciest feature ever used. Indices, however, are very much needed.
Question - can anyone think of a database library that would provide the goals outlined above?

You can use SQLite with its in-memory database.

I would look at treating like a cache. MEMCACHED is easy and very fast as all in memory. Fan of MongoDB or similar will also be faster although disk based.

Tips for optimising Database and POST request performance

I am developing an application which involves multiple user interactivity in real time. It basically involves lots of AJAX POST/GET requests from each user to the server - which in turn translates to database reads and writes. The real time result returned from the server is used to update the client side front end.
I know optimisation is quite a tricky, specialised area, but what advice would you give me to get maximum speed of operation here - speed is of paramount importance, but currently some of these POST requests take 20-30 seconds to return.
One way I have thought about optimising it is to club POST requests and send them out to the server as a group 8-10, instead of firing individual requests. I am not currently using caching in the database side, and don't really have too much knowledge on what it is, and whether it will be beneficial in this case.
Also, do the AJAX POST and GET requests incur the same overhead in terms of speed?

Rather than continuously hitting the database, cache frequently used data items (with an expiry time based upon how infrequently the data changes).
Can you reduce your communication with the server by caching some data client side?
The purpose of GET is as its name
implies - to GET information. It is
intended to be used when you are
reading information to display on the
page. Browsers will cache the result
from a GET request and if the same GET
request is made again then they will
display the cached result rather than
rerunning the entire request. This is
not a flaw in the browser processing
but is deliberately designed to work
that way so as to make GET calls more
efficient when the calls are used for
their intended purpose. A GET call is
retrieving data to display in the page
and data is not expected to be changed
on the server by such a call and so
re-requesting the same data should be
expected to obtain the same result.
The POST method is intended to be used
where you are updating information on
the server. Such a call is expected to
make changes to the data stored on the
server and the results returned from
two identical POST calls may very well
be completely different from one
another since the initial values
before the second POST call will be
differentfrom the initial values
before the first call because the
first call will have updated at least
some of those values. A POST call will
therefore always obtain the response
from the server rather than keeping a
cached copy of the prior response.
Ref.

The optimization tricks you'd use are generally the same tricks you'd use for a normal website, just with a faster turn around time. Some things you can look into doing are:
Prefetch GET requests that have high odds of being loaded by the user
Use a caching layer in between as Mitch Wheat suggests. Depending on your technology platform, you can look into memcache, it's quite common and there are libraries for just about everything
Look at denormalizing data that is going to be queried at a very high frequency. Assuming that reads are more common than writes, you should get a decent performance boost if you move the workload to the write portion of the data access (as opposed to adding database load via joins)
Use delayed inserts to give priority to writes and let the database server optimize the batching
Make sure you have intelligent indexes on the table and figure out what benefit they're providing. If you're rebuilding the indexes very frequently due to a high write:read ratio, you may want to scale back the queries
Look at retrieving data in more general queries and filtering the data when it makes to the business layer of the application. MySQL (for instance) uses a very specific query cache that matches against a specific query. It might make sense to pull all results for a given set, even if you're only going to be displaying x%.
For writes, look at running asynchronous queries to the database if it's possible within your system. Data synchronization doesn't have to be instantaneous, it just needs to appear that way (most of the time)
Cache common pages on disk/memory in a fully formatted state so that the server doesn't have to do much processing of them
All in all, there are lots of things you can do (and they generally come down to general development practices on a more bite sized scale).

The common tuning tricks would be:
- use more indexing
- use less indexing
- use more or less caching on filesystem, database, application, or content
- provide more bandwidth or more cpu power or more memory on any of your components
- minimize the overhead in any kind of communication
Of course an alternative would be to:
0 develop a set of tests, preferable automatic that can determine, if your application works correct.
1 measure the 'speed' of your application.
2 determine how fast it has to become
3 identify the source of the performane problems:
typical problems are: network throughput, file i/o, latency, locking issues, insufficient memory, cpu
4 fix the problem
5 make sure it is actually faster
6 make sure it is still working correct (hence the tests above)
7 return to 1

Have you tried profiling your app?
Not sure what framework you're using (if any), but frankly from your questions I doubt you have the technical skill yet to just eyeball this and figure out where things are slowing down.
Bluntly put, you should not be messing around with complicated ways to try to solve your problem, because you don't really understand what the problem is. You're more likely to make it worse than better by doing so.
What I would recommend you do is time every step. Most likely you'll find that either
you've got one or two really long running bits or
you're running a shitton of queries because of an n+1 error or the like
When you find what's going wrong, fix it. If you don't know how, post again. ;-)

What is the Speed Difference Between Database and Web Service Calls?

All things being equal, and in the most simple form, which is faster?
1.) A call to a web service method
2.) A call to a database
For example, assume that you have a simple web service that just returns an integer that is calculated in X time. You also have a database that, when queried in th right way, also takes X time to calculate the answer. (So the compute time is the same in both cases) In both cases, assume the amount of data both directions is the same, say, a single 32-bit integer, for simplicity.
Thus far, the calculation times of both the web service and the database are exactly the same.
The environment is 1 application server, where the app resides, and 1 other server that is holding both the web service and the database. There is nothing else going on in the environment other than the application calling either the web service or database repeatedly. This all within one single LAN, so any network latency is equal.
From an application, which will be faster, the call to the database, or the call to the web service?
What I am trying to isolate, I guess, is which is more heavy-weight. Does the set up, open, close, tear down of a database connection end up slower than that for a web service, or is it the same? Additionally, if there are other things, such as parsing the result from a web service, how do they affect the speed?

O(1) doesn't refer to any length of time. A single operation could take .001 ms on a webservice and 100 seconds in a database and they both could be using O(1) functions:
http://en.wikipedia.org/wiki/Big_O_notation
It's hard to know quite what you're asking. If you're asking whether accessing a local database is generally faster than accessing a similar service over the internet, then I expect that, generally, the answer is that the local database will be faster. The call over the internet to the web service has a lot of overhead and communication over internet is relatively slow. Evan on a slow computer a databases can perform many thousands of simple queries per second. Contrast that with access over the internet, where you'd be lucky to get 50 round trip requests per second, not even accounting for time it takes to perform the requested operation on the server.
If you're asking whether a server on the web can serve data faster by avoiding a database and calculating results directly, then the answer is it depends. The call to the database in this case adds unnecessary overhead if the data in it can be easily calculated in a stand-alone function. The answer to this question doesn't really have anything to do with a "web service". Is it faster to calculate an answer in a function or to access the answer using a query on a database? As I said, the answer would depend on the complexity of the particular function you had to use, and weighing its computation time against the overhead of accessing the answer (or part of the answer) directly from a database.
In short, the answer to your question depends on what exactly you're asking. It would also probably help to know why you're asking the question. I have a suspicion that the real answer is that this probably isn't something you need to worry about, not really a practical concern unless you have a particular situation requiring optimization.
If you're concerned about comparison of speed when webservice and database are both on a lan, I'm pretty sure the overhead of the db is a less than the webservice. The application typically maintains a stateful connection(s) to the db, while requests to a webservice are via http, which is stateless, relatively higher overhead, and slower. Could be wrong, though. Best answer would be to whip up a simple webservice, query, and (1) measure time it takes to retrieve results using both methods, and compare, and/or (2) create an app that opens a lot of threads and do some load testing.
A caveat: If your app doesn't maintain an open connection or have access to a pool of connections with the db, then the db alternative may well be slower. Initial creation of a db connection can be relatively slow. But that shouldn't figure into things, since you should write your app so that an open connection is always maintained.

Based on practical experience, I would say that the database call is significantly faster.

It all depends on the network topology and languages you're using. If you're talking C#...my money would be on the database call being faster almost every time.
Your calls to the database server are going to be made over the native protocol. Everything is going to be optimized.
If you're calling a web service, you're going to need some mechanism to send the request to the web server, wait for the web server to respond, and then something to parse the result of the web service call back into your code.

One could say that generally, latency of the network in a web service (which will typically be over the internet) is going to be slower than the call to a database (which is typically on a LAN or something, which is faster than one's connection to the internet).
Of course, this makes a LOT of assumptions about setups/software/etc, etc which effectively reduces it to an apples and oranges comparison, which there is never a good answer for.

O(1) doesn't specify the speed, it specifies the 'growth' in time required as the underlying data gets larger. The constants are dropped from the equation. What this means is that O(N^2) can be less than O(N) for some really small N.
A web service is a way to connect to some functionality. Besides the network latency, the real time is bound by what the service is actually doing. There could be a database underneath for example. If it is something that just returns an Integer, the computational time is mostly trivial, the request is bounded by the network.
A database needs to parse the query, build a query tree, optimized it, then apply some search algorithms against a series of caches and files. If you just plopped an Integer into a trivial table, or a tableless SQL call, then fetching the data is probably trivial, its the whole transactional packaging that will eat CPU.
Can you get a packet back and forth to a server before you can parse trivial SQL and punch back a tabled result? Mostly, these days I say it was a toss up. Some networks are faster than others, while some databases and servers are pretty good. Nothing is certain.
In general, is a web service faster than a database? Yes, if and only if the service is trivial (if it's hiding a database, then it's obviously just additional time). Databases are big bulky engines, and while they've gotten much faster over the years, their base level of transactional integrity specifies an awful lot of minimum CPU usage. They're slower because they are doing so much more work. Contrast that with some explicit minimal computation hidden behind network access. A fibber or gigabit network can rapidly move data. It's just so much less work to get accomplished.
Of course the reason we don't replace databases with custom written web services is time. It takes too long to write it, and then keep it up to date. Way more effort than just slamming it into a database and accepting it's performance.
Paul.

IMHO I would say the database call would be faster hands down. I say this because there is much less overhead. With the verbosity of the HTTP protocol and SOAP markup incurred you have a lot more bloat in your data. This bloat data has extra cost for packaging and un-packaging. with a stored procedure call you could use an output parameter to return a single int instead of a result set to make it even lighter.

Algorithmic complexity is just one variable that impacts the overall performance of a system. Other factors might include network latency or network bandwidth, especially when the size of the returned data is different.
If you run the same O(1) algorithm on a local machine, you will get the results faster than if you run the algorithm on a machine on another continent and need the same results sent over the network.
Other factors might include raw CPU speed if the calls are done on physically different machines.
That's why premature optimisation is the root of all evil.
EDIT:
I'd say it depends even more now on the details of the system, i.e., what database software you are using, or whether or not your web service is reading data from a static web page or dynamically generating the data.
But I am beginning to lose sight of why you are asking the question. You seem to say that both methods take the same amount of time. So if they take the same amount of time, how can you ask which is faster? Clearly they are equally fast. You need to tell us more about how and when they stop taking the same amount of time.

If we are assuming that you are communicating to a different server for both the web and database calls, wouldn't they be pretty much the same, since both requests are transferred through TCP/IP? The only thing then that could be compared is how big the actual results are that are sent back in terms of bits across the wire.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight