NTP server setup config - ntp

The most important thing is that my 2 machines are synced with each other. Of course all else being equal, I would like the shared time to be synced with some public server(s).
Is it ever prudent to avoid syncing the system with the outside world, because this results in worse sync between the machines?
Assuming I do wish to sync the system with the outside world, is it best to simply have the two machines serve each other, or should I concern myself with the peer configuration setting? I don't find the docs clear at all when it comes to the prefer keyword either. Doesn't it already prefer servers higher up the list?

1 - That depends on your specific situation. In general if you can you should always sync to a valid set of time sources. These don't strictly speaking have to be 'external' you could deploy your own stratum 0 solution by way of a GPS/MSF/DCS/CDMA receiver and hard-wire that into a host you physically control. But generally speaking I can't think of any reason why two hosts would want to sync to each other but not anything 'valid' when there are usually options to allow that.
2 - Since it is only two machines then depending on the physical locations of the boxes it would be better to give them both there own valid config with at least 3 and ideally 5 external low stratum servers to connect to.
In a simple set-up like yours where your not providing a timing domain to an entire network or collection of hosts you can safely ignore the peer command.
For some info check the docs If you had a setup as shown below then you could/would peer server A & B. This is just 1 example and in reality would include additional peerstatements and cross connections.
GPS (s0) DCFa (s0)
| |
v v
Server A (s1) <---peer---> Server B (s1)
| |
v v
Server C (s2) Server D (s2)
| |
v V
London Clients Dutch Clients
As for prefer You should check the doc's its a tricky one;
Marks the server as preferred. All other things being equal, this host
will be chosen for synchronization among a set of correctly operating
hosts. See the Mitigation Rules and the prefer Keyword page for
further information.
From the Mitigation Rules;
In the prefer scheme the clustering algorithm is modified so that the
prefer peer is never discarded; on the contrary, its potential removal
becomes a termination condition. If the original algorithm were about
to toss out the prefer peer, the algorithm terminates immediately. The
prefer peer can still be discarded by the sanity checks and
intersection algorithms, of course, but it will always survive the
clustering algorithm.

Related

Peer to peer replication of local databases

I have a program in C that monitors traffic and records the URLs visited by the user. Currently, I am maintaining this in a hash table. My key is the src-IP address and the result is a data-structure with a linked list of URLs. I am currently maintaining 50k to 100k records in a hash table. When the user logs out, the record can get deleted.
The program independently runs on a Active-Standby pair. I want to replicate this database to another machine in case my primary machine crashes (the 2 systems act as Client and Server) and continue recording stuff associated with the user.
The hard way is to write code for sending this information to the peer and on the peer system to receive and store. The issue is, it will add lots of code (and bugs!). To do data-replication and data-store, here are a few prereqs:
I want data-record replication between these machines. I am NOT looking at adding another machine/cluster unless required.
Prefer library so that query is quick. If not another process on the same machine to which I can IPC.
Add, update and delete operations should be supported.
In memory database a must.
Support multiple such databases with different keys.
Something that has publish/subscribe.
Resync capability if the backup dies and comes back again.
Interface should be in C
Possible options I looked at were zookeeper, redis, memcached, sql-lite, berkeley-db.
Zookeeper - Needs odd number of systems for tie-break. Not suitable for 1 to 1.
Redis - Looks to fit my requirements with hiredis for C interface. Separate process though.
Memcached - I don't have any caching requirements.
Sql-lite - Embedded database with C interface
Berkeley-DB - Embedded database for better scale.
So, Redis, Sql-lite and Berkeley-DB look like my options to go forward. Appreciate any help/thoughts on the DBs I should research more for my requirements. Or if there are any other DBs I should research? I apologize if my question is very generic. If the question does not belong here, please point me to the right forum.

Indexing about 300.000 triples in sesame using Camel

I have a Camel context configured to do some manipulation of input data in order to build RDF triples.
There's a final route with a processor that, using Sesame Client API, talks to a separate Sesame instance (running on Tomcat with 3GB of RAM) and sends add commands (each command contains about 5 - 10 statements).
The processor is running as a singleton and the corresponding "from" endpoint has 10 concurrentConsumers (I tried with 1, then 5, then 10 - moreless same behaviour).
I'm using HttpRepository from my processor for sending add commands and, while running, I observe a (rapid and) progressive degrade of performance in indexing. The overall process starts indexing triples very quickly but after a little bit the committed statements grow very slowly.
On Sesame side I used both MemoryStore and NativeStore but (performance) behaviour seems moreless the same.
The questions:
which kind of store kind is reccommended in case I would like to speed up the indexing phase?
Is the Repository.getConnection doing some kind of connection pooling? In other words, can I open and close a connection each time the "add" processor does its work?
Having said that I need first to create a store will all those triples, is it preferred create a "local" Sail store instead of having that managed by a remote Sesame server (therefore I won't use a HTTPRepository)?
I am assuming that you're adding using transactions of 4 or 5 statements for good reason, but if you have a way to do larger transactions, that will significantly boost speed. Ideal (and quickest) would be to just send all 300,000 triples to the store in a single transaction.
Your questions, in order:
If you're only storing 300,000 statements the choice of store is not that important, as both native and memory can easily handle this kind of scale at good speed. I would expect memory store be slightly more performant, especially if you have configured it to use a non-zero sync delay for persistence, but native has a lower memory footprint and is of course more robust.
HTTPRepository.getConnection does not pool the actual RepositoryConnection itself, but internally pools resources (so the actual HttpConnections that Sesame uses internally are pooled). so getConnection is relatively cheap and opening and closing multiple connections is fine - though you might consider reusing the same connection for multiple adds, so that you can batch multiple adds in a single transaction.
Whether to store locally or on a remote server really depends on you. Obviously a local store will be quicker because you eliminate network latency as well as the cost of (de)serializing, but the downside is that a local store is not easily made available outside your own application.

couchdb replication on a lot of servers

I am currently looking at CouchDB and I understand that I have to specify all the replications by hand. If I want to use it on 100 nodes how would I do the replication?
Doing 99 "replicate to" and 99 "replicate from" on each node
It feels like it would be overkill since a node replication includes all the other nodes replications to it
Doing 1 replicate to the next one to form a circle (like A -> B -> C -> A)
Would work until one crash, then all wait until it comes back
The latency would be big for replicating from the first to the last
Isn't there a way to say: "here are 3 IPs on the full network. Connect to them and share with everyone as you see fit like an independent P2P" ?
Thanks for your insight
BigCouch won't provide the cross data-center stuff out of the box. Cloudant DBaaS (based on BigCouch) does have this setup already across several data-centers.
BigCouch is a sharded "Dynamo-style" fork of Apache CouchDB--it is to be merged into the "mainline" Apache CouchDB in the future, fwiw. The shards live across nodes (servers) in the same data-center. "Classic" CouchDB-style Replication is used (afaik) to keep the BigCouches in the various data-centers insync.
CouchDB-style replication (n-master) is change-based, so replication only includes the latest changes.
You would need to setup to/from pairs of replication for each node/database combination. However, if all of your servers are intended to be identical, replication won't actually happen that often--it will only happen if needed.
If A gets a change, replication ships it to B and C (etc). However, if B--having just got that change--replicates it to C before A gets the chance too--due to network latency, etc--when A does finally try, it will realize the data is already there, and not bother sending the change again.
If this is a standard part of your setup (i.e., every time you make a db you want it replicated everywhere else), then I'd highly recommend automating the setup.
Also, checkout the _replicator database. It's much easier to manage what's going on:
https://gist.github.com/fdmanana/832610
Hope something in there is useful. :)

What are good algorithms to keep consistency across multiple files in a network?

What are good algorithms to keep consistency in multiple files?
This is a school project. I have to implement in C, some replication across a network.
I have 2 servers,
Server A1
Server A2
Both servers have their own file called "data.txt"
If I write something to one of them, I need the other to be updated.
I also have another scenario, with 3 Servers.
Server B1
Server B2
Server B3
I need these do do pretty much the same.
While this would be fairly simple to implement. If one, or two of the servers were to be down, When comming back up, they would have to update themselves.
I'm sure there are algorithms that solve this efficiently. I know what I want, I just don't know exactly what I'm looking for!
Can someone point me to the right direction please?
Thank you!
The fundamental issue here is known as the 'CAP theorem', which defines three properties that a distributed system can have:
Consistency: Reading data from the system always returns the most up-to-date data.
Availability: Every response either succeeds or fails (doesn't just keep waiting until things recover)
Partition tolerance: The system can operate when its servers are unable to communicate with each other (a server being down is one special case of this)
The CAP theorem states that you can only have two of these. If your system is consistent and partition tolerant, then it loses the availability condition - you might have to wait for a partition to heal before you get a response. If you have consistency and availability, you'll have downtime when there's a partition, or enough servers are down. If you have availability and partition tolerance, you might read stale data, or have to deal with conflicting writes.
Note that this applies separately between reads and writes - you can have an Available and Partition-Tolerant system for reads, but Consistent and Available system for writes. This is basically a master-slave system; in a partition, writes might fail (if they're on the wrong side of a partition), but reads will work (although they might return stale data).
So if you want to be Available and Partition Tolerant for reads, one easy option is to just designate one host as the only one that can do writes, and sync from it (eg, using rsync from a cron script or something - in your C project, you'd just copy the file over using some simple network code periodically, and do an extra copy just after modifying it).
If you need partition tolerance for writes, though, it's more complex. You can have two servers that can't talk to each other both doing writes, and later have to figure out what data wins. This basically means you'll need to compare the two versions when syncing and decide what wins. This can just be as simple as 'let the highest timestamp win', or you can use vector clocks as in Dynamo to implement a more complex policy - which is appropriate here depends on your application.
Check out rsync and how Dropbox works.
With every write on to server A, fork a process to write the same content to server B.
So that all the writes on to server A are replicated on to server B. If you have multiple servers, make the forked process to write across all the backup servers.

simple Solr deployment with two servers for redundancy

I'm deploying the Apache Solr web app in two redundant Tomcat 6 servers,
to provide redundancy and improved availability. At this point, scalability is not a issue.
I have a load balancer that can dynamically route traffic to one server or the other or both.
I know that Solr supports master/slave configuration, but that requires manual recovery if the slave receives updates during the master outage (which it will in my use case).
I'm considering a simpler approach using the ability to reload a core:
- only one of the two servers is receiving traffic at any time (the "active" instance), but both are running,
- both instances share the same index data and
- before re-routing traffic due to an outage, the now active instance is told to reload the index core(s)
Limited testing of failovers with both index reads and writes has been successful. What implications/issues am I missing?
Your thoughts and opinions welcomed.
The simple approach to redundancy your considering seems reasonable but you will not be able to use it for disaster recovery unless you can share the data/index to/from a different physical location using your NAS/SAN.
Here are some suggestions:-
Make backups for disaster recovery and test those backups work as an index could conceivably have been corrupted as there are no checksums happening internally in SOLR/Lucene. An index could get wiped or some records could get deleted and merged away without you knowing it and backups can be useful for recovering those records/docs at a later time if you need to perform an investigation.
Before you re-route traffic to the second instance I would run some queries to load caches and also to test and confirm the current index works before it goes online.
Isolate the updates to one location and process and thread to ensure transactional integrity in the event of a cutover as it could be difficult to manage consistency as SOLR does not use a vector clock to synchronize updates like some databases. I personally would keep a copy of all updates in order separately from SOLR in some other store just in case a small time window needs to be repeated.
In general, my experience with SOLR has been excellent as long as you are not using cutting edge features and plugins. I have one instance that currently has 40 million docs and an uptime of well over a year with no issues. That doesn't mean you wont have issues but gives you an idea of how stable it could be.
I hardly know anything about Solr, so I don't know the answers to some of the questions that need to be considered with this sort of setup, but I can provide some things for consideration. You will have to consider what sorts of failures you want to protect against and why and make your decision based on that. There is, after all, no perfect system.
Both instances are using the same files. If the files become corrupt or unavailable for some reason (hardware fault, software bug), the second instance is going to fail the same as the first.
On a similar note, are the files stored and accessed in such a way that they are always valid when the inactive instance reads them? Will the inactive instance try to read the files when the active instance is writing them? What would happen if it does? If the active instance is interrupted while writing the index files (power failure, network outage, disk full), what will happen when the inactive instance tries to load them? The same questions apply in reverse if the 'inactive' instance is going to be writing to the files (which isn't particularly unlikely if it wasn't designed with this use in mind; it might for example update some sort of idle statistic).
Also, reloading the indices sounds like it could be a rather time-consuming operation, and service will not be available while it is happening.
If the active instance needs to complete an orderly shutdown before the inactive instance loads the indices (perhaps due to file validity problems mentioned above), this could also be time-consuming and cause unavailability. If the active instance can't complete an orderly shutdown, you're gonna have a bad time.

Resources