What exactly does "Highly Scalable" property of Google datastore means? - database

Simple doubt ,does it means it can handle billion+ entities(rows in mysql sense)in a kind(table in mysql sense) without any sharding as well as without compromising any performance ?

Yes, you can handle billions of entities with no sharding.
The performance of the datastore queries is not dependent on the number of entities that you have. It depends on the number of entities that you want to retrieve. In other words, you will get 100 entities in the same time whether you only have 100 entities or 1 billion entities.

Yes, it can handle billions of entities in a kind without compromising performance. However, "without sharding" is questionable. By default, all your entities are available for Google to "shard" however they see fit to meet the demands of your app. When I say "shard" here, I mean spread your entities across machines or datacenters as they see fit. Sharding is not something you ever need to manage yourself.
You can, however, restrict sharding (in this sense) by putting multiple entities in the same entity group (i.e. by giving multiple entities the same parent). This is something you should avoid when possible, so that you do not restrict how Google can optimize your data with sharding. However, if you need to access many entities within a single transaction, you may need to make entity groups. More information on why and when you'd want to is available here.
By the way, Google may also make multiple copies of your data in multiple locations around the world to increase read throughput, if that's what their algorithms determine is most optimal.

Related

How to implement sharding?

First world problems: We've got a production system that is growing rapidly, and we are aiming to grow our user base even more. At peak times our DB is flatlining at 100% CPU, which I take as an indication that it's pretty much stretched to the limit. Being an AWS instance, we could always throw some more hardware at it, but long term, it seems we will need to implement sharding.
I've Googled all over and found lots of explanations of what sharding is, why it is a good idea under certain circumstances, what design considerations, etc... but not a word on the practicality of how to do it.
What are the practical steps to shard a database? How do you redirect queries to the appropriate shard? And how do you run reports that require data from all shards?
The first thing you'll want to decide is whether or not you want to take on the complexity of routing queries in your application. If you decide to roll your own implementation, there are a number of complexities that you'll need to deal with over time.
You'll need a scheme to distribute data and queries evenly across the cluster. You'll need to ensure that this scheme is forward-compatible with a larger cluster, as if your data is already big enough to require a sharded architecture, it's likely that you'll need to add more servers.
The problem with sharding schemes is that they force you to make tradeoffs that you wouldn't have to make with a single-server database. For example, if you are sharding by user_id, any query which spans multiple users will need to be sent to all servers (or a subset of servers) and the results must be accumulated in your client application. This is especially complex if you are using aggregate queries that rely on the ordering of the data, such as MAX(), or any histogram computation.
All of this complexity isn't meant to scare you, but it's something you'll need to pay attention to. There are tools out there that can help you (disclosure: my company makes a tool called dbShards) but you can definitely put together your own solution, especially if your application is mature and the query patterns are quite predictable.

App Engine: Can pull queue tasks handle arbitrary tags?

https://cloud.google.com/appengine/docs/python/taskqueue/overview-pull
Is there a limit on taskqueue pull task tag names, or can there be millions of arbitrary task tags?
There's no documented hard limit, and I wouldn't expect it to be; a couple reasons come to mind:
Internally tasks are stored in bigtable as everything else, and thus one could imagine tags are indexed as we're used to for our own data, and there's no limit there.
The database is designed to find indexed data very efficiently, and they purposely denied us of methods to group tags so we couldn't use them to fan-in data, meaning no need to merge-join and thus guaranteed performance that scales indefinitively :)
In this thread people are talking about how reliable is the queue when testing the limits, and this quote is interesting:
We were using many different tags (basically regrouping events per user
with several million users).
So at least this one guy just went with it and used millions of tags, with no issues directly related to the practice.

Google App Engine ndb dynamic indexes alternative

Background
I'm creating an application that allows users to define their own datasets with custom properties (and property types).
A user could, while interacting with the application, define a dataset that has the following columns:
Name:String
Location: Geo
Weight:float
Notes:Text(not indexed)
How many:int
etc...
While there will be restrictions on the total number of properties (say 10-20 or something), there are no restrictions on the property types.
Google's ndb datastore allows this to happen and will auto-generate simple indexes for searches involving combinations of equality operators and no sorts, or only sorts.
Ideal
Multiple sorts
Equality and sorts
Combinations of inequalities
I'm trying to determine if I should use NDB at all, or switch to something else (SQL seems extremely expensive, comparatively, which is one of the reasons I'm hesitant).
For the multiple sorts, I could write server-side code that queries for the first, then sorts in memory by the second, third, etc. I could also query for the data and do the sorting on the client side.
For the combinations of inequalities, I could do the same (more or less).
These solutions are obviously not performant, and won't scale if there are a large number of items that match the first query.
BaaS providers like Kinvey (which runs on GAE unless I'm quite mistaken) use schemaless databases and allow you to both create them on the fly and make compound, complicated queries over the data.
Sanity Check:
Trying to force NDB into what I want seems like a bad idea, unless there's something I'm overlooking (possible) that would make this more doable. My solutions would work, but wouldn't scale well (though I'm not sure how far. Would they work for 10k objects? 100k? 1M?).
Options I've investigated:
Kinvey, which charges by user and by data stored (since they just changed their pricing model), and ends up costing quite a bit.
Stackmob is also nice, but cloud code is crazy expensive ($200/month), and hosting and such all just costs more. Tasks cost more. The price looks very high.
Question:
I've done a fair bit of investigating, but there are just so many options. Assuming that my sanity check is correct (if it's not, and doing in-memory operations is sort-of-scalable, then fantastic!), are there other options out there that are inexpensive (BaaS providers get quite expensive once the applications scale), fast, and easily scalable that would solve my problem? Being able to run custom code in the cloud easily (and cheaply) and have API calls and bandwidth cost next to nothing is one of the reasons I've been investigating GAE (full hosting provider, any code I want in the cloud, etc).

Amazon Cloudsearch (or Solr, ElasticSearch) best practice for result contents?

I have read that it is best practice to only return an ID when querying for results, and then populate metadata from the database. Is this true? I am worried about performance.
In my opinion, it is almost always best to store and return the fewest fields possible — preferably just the ID, unless you explicitly need a feature such as highlighting.
Storing a lot of data in your index can have a negative impact on your search performance as your index grows. There is no data that loads faster than no data. Plus, looking up objects by their IDs should be a very cheap operation in your primary data store of choice.
Most importantly, if your application is using an ORM to interact with its data store, then the sheer utility of reusing all your domain modeling consistently throughout your application would be hard to overstate.
Returning values straight from your search engine can be useful. But, short of using the search engine as a primary data store, I would need a very compelling reason to fragment my domain logic by foregoing an ORM.
IMO, If you can retrieve the search results and the data within a single call would be a huge boost to performance in comparison with getting just the ids and making a DB call to retrieve the metadata for the same.
Also, Solr/ES provides in built Caching solutions so the response would be faster for subsequent queries. For DB you may have to use a Solution or probably some other options.
this all depends on your specific scenario.
In some cases, what you say might be true. For instance, Etsy does exactly that (or at least used to do that), they rationale is that they had a very capable mysql cluster and they know very well how to manage it, and is very fast, so Solr returning only the id was enough for them.
But, you might be in a totally different scenario, and maybe calling the db will take longer than storing everything needed in Solr and hitting just Solr.
In my experience Solr performs bad on retrieving results when you either have highlighting on, or the fields you retrieve are very large and the network serialization/deserialization transfer overhead increases. If that is the case, you might be better off asynchronously retrieving these fields from the DB.

what are the best ways to mitigate database i/o bottoleneck for large web sites?

For large web sites (traffic wise) that has alot of incoming reads and updates that end up being database I/Os, what're the best ways to mitigate the performance impact? one solution that I can think of is - for write, to cache and then do delayed write (using separate job); for read, use memcached concept. any other better solutions?
Here are the most common solutions to database performance:
Caching (Memcache, etc)
Add memory to your database
More database servers (master/slave or sharding)
Use a different database type (NoSQL, Redis, etc)
Indexes to speed up read perf. (careful, too many will affect write performance)
SSDs (fast SSDs will help a lot)
RAID
Optimize/tune SQL queries
Don't forget to optimize your queries. Most of the times it is not the disk I/O, but poorly written queries which turn out to be the bottleneck.
You can also cache query results and also entire web pages if the content isn't going to change too often.
It very much depends on the usage pattern and data type. There are really different things to do depending on whether transaction are going to be supported, whether you are interested in full consistency or "eventual consistency", how big the data is (will it all fit in huge memory?), how complex the data and queries are, the list might go on and on.... Lots of variables and only after listing all the constraints/requirements you will be able to make a proper decision. Two general advices though:
Use SSDs
Use distributed architecture with distributed "NoSQL" (key/value) approach (only if you do not have to use complex relations and transactions)
10 years ago, the standard answer - besides optimizing your particular database - was scale-out using MySQL in two ways.
Reads can be scaled out in two ways. The first is through caching, which introduces possible inconsistancies and creates a separate cache layer. Reads can also be scaled in MySQL by creating "read replicas", where any database can be queried. Any write must be applied to all servers, so replication doesn't help write throughput.
Writes are scaled through sharding. For example, imagine all users with the last name 'a' are assigned to a certain server. Now imagine a more complicated shard algorithm, where a particular row's primary ID is hashed using a hash function, and distributed to one of a pool of servers.
Facebook is one of the most advanced proponents of a sharded MySQL architecture. You can have individual tables "joined" but you have to write custom code, because you might have to hop from server to server - imagine you want to get your friend's timeline posts, you can't simply join it, you have to write some application code.
Once you shard your database, you can't do joins and range lookups become difficult. This subset is sometimes called CRUD operations, and thus MySQL is overkill. Many Chinese social networks realized this, and use sharded Redis (which is much quicker than MySQL), and have written their own shard layer and application logic layers.
Imagine the next problem in sharding - you want to add a new server, and start assigning some users to that new server.
Another approach is to use a distributed database, which generally comes under the names NoSQL or NewSQL, and have a variety of approaches. Some, like MongoDB, have a sharding system to manage this mapping, but require manual steps to add servers. Cassandra has a more flexible clustering scheme, called a chorded architecture. Systems like CouchBase and Aerospike use a random distribution mechanism that remove the need for a shard layer. Some of these databases can exceed 100,000 to 200,000 requests per second per server, with the lateral scale to add new servers - enough for very large operations. With this style of clustering, you can often get a higher level of redundancy and reliability.
Other distributed approaches represent data in a more efficient way, like a graph database. If you have a problem that is better represented as a graph, then a clustered graph database may be more appropriate.

Resources