I'm loosely considering using Google App Engine for some Java server hosting, however I've come across what seems to be a bit of a problem whilst reading some of the docs. Most servers that I've ever written, and certainly the one I have in mind, require some form of memory-based storage that persists between sessions, however GAE seems to provide no mechanism for this.
Data can be stored as static objects but an app may use multiple servers and the data cannot be shared between the servers.
There is memcache, which is shared, but since this is a cache it is not reliable.
This leaves only the datastore, which would work perfectly, but is far too slow.
What I actually need is a high performance (ie. memory-based) store that is accessible to, and consistent for, all client requests. In this case it is to provide a specialized locking and synchronization mechanism that sits in front of the datastore.
It seems to me that there is a big gap in the functionality here. Or maybe I am missing something?
Any ideas or suggestions?
Static data (data you upload along with your app) is visible, read-only, to all instances.
To share data between instances, use the datastore. Where low-latency is important, cache in memcache. Those are the basic options. Reading out of the datastore is pretty fast, it's only writes you'll need to concern yourself with, and those can be mitigated by making sure that any entity properties that you don't need to query against are unindexed.
Another option, if it fits your budget, is to run your own cache in an always-on backend server.
Related
Can etcd be used as reliable database replacement? Since it is distributed and stores key/value pairs in a persistent way, it would be a great alternative nosql database. In addition, it has a great API. Can someone explain why this is not a thing?
etcd
etcd is a highly available key-value store which Kubernetes uses for persistent storage of all of its objects like deployment, pod, service information.
etcd has high access control, that it can be accessed only using API in master node. Nodes in the cluster other than master do not have access to etcd store.
nosql database
There are currently more than than 255 nosql databases, which can be broadly classified into Key-Value based, Column based, Document based and Graph based. Considering etcd as an key-value store, lets see the available nosql key-value data stores.
Redis, memcached and memcacheDB are popular key-value stores. These are general-purpose distributed memory caching system often used to speed up dynamic database-driven websites by caching data and objects in memory.
Why etcd not an alternative
etcd cannot be stored in memory(ram) they can only be persisted in disk storage, whereas redis can be cached in ram and can also be persisted in disk.
etcd does not have various data types. It is made to store only kubernetes objects. But redis and other key-value stores have data-type flexibility.
etcd guarantees only high availabilty, but does not give you the fast querying and indexing. All the nosql key-value stores are built with the goal of fast querying and searching.
Eventhough it is obvious that etcd cannot be used as an alternative nosql database, I think the above explanation will prove it cannot be an suitable alternative.
From the ETCD.IO site:
etcd is a strongly consistent, distributed key-value store that
provides a reliable way to store data that needs to be accessed by a
distributed system or cluster of machines. It gracefully handles
leader elections during network partitions and can tolerate machine
failure, even in the leader node.
It has a simple interface using http and json. It is NOT just for Kubernetes. Kubernetes is just an example of a critical application that uses it.
You are right it should be a thing. A nice reliable data store with an easy to use API and a nice way of telling you when things change using raft protocol. This is great for feature toggles and other items where everything needs to know and is much better than things like putting a trigger in an sql database and getting it to send an event to an external application or really horrible polling.
So if you are writing something like the kubernetes use case >> it is perfect a well proven store for a distributed application.
If you are writing something very different to the kubernetes use case, then you are comparing with all the other no-sql databases. But is very different to something like mongodb so it may be better for you if mongodb or similar does not work for you.
Other example users
M3, a large-scale metrics platform for Prometheus created by Uber, uses etcd for rule storage and other functions
Consistency
There is a nice comparison of NOSQL database consistency by Jepson at https://jepsen.io/analyses
ETCD sum up their result at https://etcd.io/blog/jepsen-343-results/
The only answer I've come to see are those between our ears. Guess we need to show first that it can be done, and what the benefits are.
My colleagues seem to shy off it because "it's for storing secrets, and common truth". The etcd v3 revise made etcd capable of much more, but the news hasn't simply rippled down, yet.
Let's make some show cases, success stories. Personally, I like etcd because of the reasons you mentioned, and because of its focus on dependable performance.
First, no. Etcd is not the next nosql replacement. But there are some sort of scenarios, where it can come in handy.
Let's imagine you have (configuration) data, that is mostly static but may change on runtime. Maybe your frontend needs to know the backend endpoints based on the customers country to comply with legal and you know the world wide rollout is done in phases.
So you could just use a k8s configMap to store the array of data (country -> endpoint) and let your backend watch this configMap for changes.
On change, the application just reads in the list and provides a repository to allow access to the data from your service layer.
All operations need to be implemented in the repository (search, get, update, ...) but your data will be in memory (probably a linked hash map). So it will be very quick to retrieve (like a local cache).
If data get changed by the application just serialize the list and patch the configMap. Any other application watching the configMap will update their internal state.
However there is no locking. So quick changes may result in race conditions.
etcd allows for 1Mb to be stored. That's enough for almost static data.
Another application might be feature toggles. They do not changed that much but when they do, every application needs to know quickly and polling sucks.
See if this checklist of limitations of etcd compared to a more full-featured database will work for you:
Your database size is going to be within 2 GB (extensible to max 8 GB)
No sharding and hence data scalability that NoSQL db clusters (Mongo, Redis,...) provide
Meant for simple value stores with payloads limited to 1.5 MB. Can be increased but impacts other queries. Most dbs can store large BLOBs. Redis can store a value of 512 MB.
No query language for more complex searches beyond key prefix. Other databases provide more complex data types like document, graph storage with querying and indexing. Even key-value db Redis supports more complex types through modules along with querying and search capabilities
No ACID transactions
Having a hammer, everything may look like a potential nail. You need to make sure it is indeed one.
Note: (I have investigated CouchDB for sometime and need some actual experiences).
I have an Oracle database for a fleet tracking service and some status here are:
100 GB db
Huge insertion/sec (our received messages)
Reliable replication (via Oracle streams on 4 servers)
Heavy complex queries.
Now the question: Can CouchDB be used in this case?
Note: Why I thought of CouchDB?
I have read about it's ability to scale horizontally very well. That's very important in our case.
Since it's schema free we can handle changes more properly since we have a lot of changes in different tables and stored procedures.
Thanks
Edit I:
I need transactions too. But I can tolerate other solutions too. And If there is a little delay in replication, that would be no problem IF it is guaranteed.
You are enjoying the following features with your database:
Using it in production
The data is naturally relational (related to itself)
Huge insertion rate (no MVCC concerns)
Complex queries
Transactions
These are all reasons not to switch to CouchDB.
Of course, the story is not so simple. I think you have discovered what many people never learn: complex problems require complex solutions. We cannot simply replace our database and take the rest of the month off. Sure, CouchDB (and BigCouch) supports excellent horizontal scaling (and cross-datacenter replication too!) but the cost will be rewriting a production application. That is not right.
So, where can CouchDB benefit you?
I suggest that you begin augmenting your application with CouchDB applications. Deploy CouchDB, import your data into it, and build non mission-critical applications. See where it fits best.
For your project, these are the key CouchDB strengths:
It is a small, simple tool—easy for you to set up on a workstation or server
It is a web server. It integrates very well with your infrastructure and security policies.
For example, if you have a flexible policy, just set it up on your LAN
If you have a strict network and firewall policy, you can set it up behind a VPN, or with your SSL certificates
With that step done, it is very easy to access now. Just make http or http requests. Whether you are importing data from Oracle with a custom tool, or using your web browser, it's all the same.
Yes! CouchDB is an app server too! It has a built-in administrative app, to explore data, change the config, etc. (like a built-in phpmyadmin). But for you, the value will be building admin applications and reports as simple, traditional HTML/Javascript/CSS applications. You can get as fancy or as simple as you like.
As your project grows and becomes valuable, you are in a great position to grow, using replication
Either expand the core with larger CouchDB clusters
Or, replicate your data and applications into different data centers, or onto individual workstations, or mobile phones, etc. (The strategy will be more obvious when the time comes.)
CouchDB gives you a simple web server and web site. It gives you a built-in web services API to your data. It makes it easy to build web apps. Therefore, CouchDB seems ideal for extending your core application, not replacing it.
I don't agree with this answer..
I think CouchDB suits especially well fleet tracking use case, due to their distributed nature. Moreover, the unreliable nature of gprs connections used for transmitting position data, makes the offline-first paradygm of couchapps the perfect partner for your application.
For uploading data from truck, Insertion-rate can take a huge advantage from couchdb replication and bulk inserts, especially if performed on ssd-based couchdb hosting.
For downloading data to truck, couchdb provides filtered replication, allowing each truck to download only the data it really needs, instead of the whole database.
Regarding complex queries, NoSQL database are more flexible and can perform much faster than relation databases.. It's only a matter of structuring and querying your data reasonably.
I'm trying to use AppEngine as sort of a RESTful web service. The service is supposed to do simple finds and puts from the Datastore so Objectify seems good for covering that part. It also does a few lookups to other services if data is not available in the Datastore'. I'm usingRedstone XMLRPC` for that part.
Now, I have a couple of questions about how to design the serving part in view of AppEngine' quotas (I guess one should think about efficiency in most case but AppEngine's billing make more people think about efficiency).
First lets consider I use simple Servlets. In this case, I see two options. Either I create a number of servlets each providing a different service with Json passed to each of them or I use a single (or a fewer number of) service and determine the action to perform based on a parameter passed with the Json. Will either design have any significance on the number of hours, etc. clocked by AppEngine?
What is the cost difference if I use a RESTful framework such as Restlet or RestEasy as opposed to the barebones approach ?
This question is something of a follow up to : Creating Java Web Service using Google AppEngine
It's not so important, because most costs are going to datastore, so frontend micro-optimisation doesn't matter.
You can save there may be few cents, by choosing 'simple servet', but... is it your goal? It's much more important to make good data structures, prepare all required data in background, make good caching strategy, etc.
I agree with #Igor.
However, there is an additional thing to consider: http sessions.
GAE supports http sessions. Since GAE is a distributed system, sessions are stored in Datastore (and cached in Memcache for efficient reading). Session is updated in every request (to support expiration), so on every request Datastore is accessed.
Sessions are not required for REST and should be turned off.
I'm referring tothe datastore admin tool explained here: http://code.google.com/appengine/docs/adminconsole/datastoreadmin.html
One way of backing up locally is by using bulkloader.py, but I'm liking this solution better as your data stays in Google's cloud and can be easily transferred from one app to the other using a button in the admin console on an entity by entity basis. thinking of having two apps, one that i can manually back up to every week, and another that actually serves users. The backup app might incur some storage costs but overall costs would be minimal as no front/backend instances would be used except as needed for the backups...
Backing up one GAE app's data to another app is certainly not recommended by me. To me, there are a handful of reasons to backup:
To safeguard against a catastrophic outage on Google's part.
To safeguard against a programming error on your part that results in significant data loss.
To safeguard against your Google account being revoked.
Backing up to another GAE app does relatively little on each of these.
If you use the High Replication datastore, you're already distributing your data all across Google's cloud, so doing that again just seems redundant -- but not in the high availability sense of the word. The only way Google is going to lose your data is through some catastrophic disaster, in which case both of your apps may be in peril.
If you backup your data locally, you can store historic snapshots. Whereas if you're simply backing up to another app, you aren't storing historic data, so you have little protection against programmer error unless you catch it in between the time the error happens and when you're going to do your next backup.
In the event that Google, for whatever reason, kills your account, you lose both apps and all your data.
Ultimately, by backing up to another GAE, you still have all your eggs in one basket. You've just partitioned your basket. If your data is important enough to be backed up outside of your app, it's important enough to backup locally or to another provider entirely. That's my opinion anyway.
We backup our data periodically to another application to use as our development environment, but as others pointed out that's not really protecting your data against a major appengine catastrophe (as unlikely as this is...).
The best solution I've found for archiving data for disaster recovery is to pull it down using the python scripts google provides either onto EC2 or local disk.
I am trying to figure out the best way to deploy a single Google App Engine application across multiple regions.
The same code is to be used, but the stored data is specific to each region. Motivating examples are hyperlocal review sites, like yelp.com or urbanspoon, where restaurants and other businesses to review are specific to a region (e.g. boston.app.com, seattle.app.com).
A couple options include:
Create multiple GAE applications,
and duplicate the code across them.
Create a single GAE application, and store all data for all regions
in the same Datastore, with a region
identifier field for each model
delimiting the relevant region.
Some of the trade-offs:
Option 2 seems like it will be increasingly inefficient (space: replicating a region identifier for each record of every model; time: filtering/indexing on the identifier for every query).
Option 1 requires an app ID for every region, while GAE only allows 10 apps per account. Moreover, deploying the code across every region, as well as Datastore migration, seems like it could be a pain to manage.
In the ideal world, I would have a single application instance. From that instance, I could route between subdomains (like here), as well as have a separate Datastore for each subdomain. But I believe GAE only allows a single datastore per application.
Does anyone have ideas on the best way to solve this problem? Or options that I am not considering?
Thanks for your time!
I would recommend your approach #2. Storage space is cheap (and region codes are short), and datastore performance does not degrade with size, unlike most databases. Using a single app also makes for easier management and upgrades, and avoids any issues with the TOS (which prohibit sharding your app to avoid billing charges).
If you use source code revision control, then it is not too bad to push identical code into multiple apps. You could set a policy whereby only full-fledged tags are allowed to be pushed up to GAE. Another option is to make your application version the same as the revision number.
With App Engine, I (and I believe most others) always migrate data from within my model code. You can't easily do bulk migrations in GAE and the usual solution is to migrate data as you come across it in code. In this way, you can keep your models pretty much identical across applications.
Having said that, I would probably still go with a unified application. It's more future-proof. What if users want to join their L.A. identity and their New York identity? Or what if an advertiser offers you a sweet deal for you to run some marketing reports on your own data?
Finally, a few bytes of data doesn't matter so much on App Engine. As your site grows, you will very quickly discover that you will always be bumping into ceilings. GAE limits are extremely small compared to a traditional web server and so you will have to work within those limits anyway. For example, you can only fetch 1,000 records at a time. So your architecture will already support a piecemeal paging solution. So don't worry too much about an extra field or two in your record.