Background
I'm creating an application that allows users to define their own datasets with custom properties (and property types).
A user could, while interacting with the application, define a dataset that has the following columns:
Name:String
Location: Geo
Weight:float
Notes:Text(not indexed)
How many:int
etc...
While there will be restrictions on the total number of properties (say 10-20 or something), there are no restrictions on the property types.
Google's ndb datastore allows this to happen and will auto-generate simple indexes for searches involving combinations of equality operators and no sorts, or only sorts.
Ideal
Multiple sorts
Equality and sorts
Combinations of inequalities
I'm trying to determine if I should use NDB at all, or switch to something else (SQL seems extremely expensive, comparatively, which is one of the reasons I'm hesitant).
For the multiple sorts, I could write server-side code that queries for the first, then sorts in memory by the second, third, etc. I could also query for the data and do the sorting on the client side.
For the combinations of inequalities, I could do the same (more or less).
These solutions are obviously not performant, and won't scale if there are a large number of items that match the first query.
BaaS providers like Kinvey (which runs on GAE unless I'm quite mistaken) use schemaless databases and allow you to both create them on the fly and make compound, complicated queries over the data.
Sanity Check:
Trying to force NDB into what I want seems like a bad idea, unless there's something I'm overlooking (possible) that would make this more doable. My solutions would work, but wouldn't scale well (though I'm not sure how far. Would they work for 10k objects? 100k? 1M?).
Options I've investigated:
Kinvey, which charges by user and by data stored (since they just changed their pricing model), and ends up costing quite a bit.
Stackmob is also nice, but cloud code is crazy expensive ($200/month), and hosting and such all just costs more. Tasks cost more. The price looks very high.
Question:
I've done a fair bit of investigating, but there are just so many options. Assuming that my sanity check is correct (if it's not, and doing in-memory operations is sort-of-scalable, then fantastic!), are there other options out there that are inexpensive (BaaS providers get quite expensive once the applications scale), fast, and easily scalable that would solve my problem? Being able to run custom code in the cloud easily (and cheaply) and have API calls and bandwidth cost next to nothing is one of the reasons I've been investigating GAE (full hosting provider, any code I want in the cloud, etc).
Related
I know this is a topic that's been addressed ad nauseam but I also know there are people who enjoy opining about databases so I figured I'd just go ahead and ask the question again.
I'm building out a web application that on a very basic level displays a list of objects that meet a user-defined search criteria. The primary function of the application will be to provide an interface by which a user can perform realtime faceted searches on a large number of object properties, including ranges of data, location data, and probably related data.
Of course there will be ancillary information too: user accounts, lookup tables etc.
My background is entirely in relational database development, primarily SQL Server with a little bit of MySQL. However, I'm intrigued by the possible applicability of an object-relational approach or even a full-on document database. Without the experience of working in those paradigms I'm not sure what I might be getting myself into.
Here are some further considerations that may affect the decision:
The schema will likely evolve considerably over time as more properties and search options are added, creating the typical versioning/deployment challenges. This is the primary reason why I would consider a document database.
The application itself will likely be written in Node/Express with an Angular or React front-end using Typescript and so the code will be interacting with data in json format. In other words, regardless of what comes back from the db server, we want json on the code level. (Another case for a doc database.)
There is the potential for a large amount of search parameters and a large amount of data, so indexing will be key and performance will be a huge potential gotcha. This would seem to me to be a strong case against a document db.
A potential use case would involve a user adjusting a slider control (let's say it controls high and low price parameters or a distance range). The selected parameters would then be packaged as a json object and sent to a search controller, which would then pass these parameters to the db server on change and expect a list of objects in return. In other words, the user would generally not be pushing a button to refine search criteria. The search update would happen each time they change a parameter.
I don't know the extent to which this a thing or not, but it would also be great if there were some way to leverage technology that could cache search results and then search within those results if the search were narrowed, thus performing a second search only on the smaller subset of the first search rather than the entire universe of available objects.
I guess while I'm at it I should ask about ORMs. Also something I'm generally not some experienced with (I've used Entity Framework a bit) but wondering if I should expand my horizons.
Thanks and I look forward to your opinions!
I think that your requirement of "There is the potential for a large amount of search parameters and a large amount of data, so indexing will be key and performance will be a huge potential gotcha" makes a strong case for using a relational database to store the data.
Leveraging an ORM that can support data in JSON format would seem to be ideal in your use case. Schema evolution for a system in production would sure be a challenge (not unsurmountable though) but it would be nice to use an ORM product that can at least easily support schema evolution during the development stage when things are more likely to change and evolve rapidly.
Given the kind of queries you would typically be issuing (e.g., adjusting a slider control), an ORM that supports prepared statements that can have range criteria would be more efficient.
Also, given your need to "perform realtime faceted searches on a large number of object properties, including ranges of data, location data, and probably related data", an ORM product that can easily support one-to-one, one-to-many, and many-to-many relationships and path-expressions in search criteria should simplify your development process.
Simple doubt ,does it means it can handle billion+ entities(rows in mysql sense)in a kind(table in mysql sense) without any sharding as well as without compromising any performance ?
Yes, you can handle billions of entities with no sharding.
The performance of the datastore queries is not dependent on the number of entities that you have. It depends on the number of entities that you want to retrieve. In other words, you will get 100 entities in the same time whether you only have 100 entities or 1 billion entities.
Yes, it can handle billions of entities in a kind without compromising performance. However, "without sharding" is questionable. By default, all your entities are available for Google to "shard" however they see fit to meet the demands of your app. When I say "shard" here, I mean spread your entities across machines or datacenters as they see fit. Sharding is not something you ever need to manage yourself.
You can, however, restrict sharding (in this sense) by putting multiple entities in the same entity group (i.e. by giving multiple entities the same parent). This is something you should avoid when possible, so that you do not restrict how Google can optimize your data with sharding. However, if you need to access many entities within a single transaction, you may need to make entity groups. More information on why and when you'd want to is available here.
By the way, Google may also make multiple copies of your data in multiple locations around the world to increase read throughput, if that's what their algorithms determine is most optimal.
We have ~1Tb of user profiles and need to perform two types operations on them:
random reads and writes (~20k profile updates per second)
queries on predefined dimensions (e.g. for reporting)
For example, if we encounter user in a transaction, we want to update his profile with a URL he came from. At the end of the day we want to see all users who visited particular URL. We don't need joins, aggregations, etc., only filtering by one or several fields.
We don't really care about latency, but need high throughput.
Most databases we looked at belong to one of two categories - key-value DBs with fast random access or batch DBs optimized for querying and analytics.
Key-value storages
Aerospike can store terabyte-scale data and is very well-optimized for fast key-based lookup. However, queries on secondary index are deadly slow, which makes it unsuitable for our purposes.
MongoDB is pretty flexible, but requires too much hardware to handle our load. In addition, we encountered particular issues with massive exports from it.
HBase looks attractive since we already have Hadoop cluster. Yet, it's not really clear how to create secondary index for it and what its performance will be.
Cassandra - may be an option, but we don't have experience with it (if you do, please share it)
Couchbase - may be an option, but we don't have experience with it (if you do, please share it)
Analytic storages
Relational DBMS (e.g. Oracle, PostreSQL) provide both - random access and efficient queries, but we have doubts that they can handle terabyte data.
HDFS / Hive / SparkSQL - excellent for batch processing, but doesn't support indexing. The closest thing is partitioning, but it's not applicable given many-to-many relations (e.g. many users visited many URLs). Also, to our knowledge none of HDFS-backed tools except for HBase support updates, so you can only append new data and read latest version, which is not very convenient.
Vertica has very efficient queries, but updates boil down to rewriting the whole file, so are terribly slow.
(Because of limited experience some of information above may be subjective or wrong, please feel free to comment about it)
Do any of the mentioned databases have useful options that we missed?
Is there any other database(s) optimized for your use case? If not, how would you address this task?
First world problems: We've got a production system that is growing rapidly, and we are aiming to grow our user base even more. At peak times our DB is flatlining at 100% CPU, which I take as an indication that it's pretty much stretched to the limit. Being an AWS instance, we could always throw some more hardware at it, but long term, it seems we will need to implement sharding.
I've Googled all over and found lots of explanations of what sharding is, why it is a good idea under certain circumstances, what design considerations, etc... but not a word on the practicality of how to do it.
What are the practical steps to shard a database? How do you redirect queries to the appropriate shard? And how do you run reports that require data from all shards?
The first thing you'll want to decide is whether or not you want to take on the complexity of routing queries in your application. If you decide to roll your own implementation, there are a number of complexities that you'll need to deal with over time.
You'll need a scheme to distribute data and queries evenly across the cluster. You'll need to ensure that this scheme is forward-compatible with a larger cluster, as if your data is already big enough to require a sharded architecture, it's likely that you'll need to add more servers.
The problem with sharding schemes is that they force you to make tradeoffs that you wouldn't have to make with a single-server database. For example, if you are sharding by user_id, any query which spans multiple users will need to be sent to all servers (or a subset of servers) and the results must be accumulated in your client application. This is especially complex if you are using aggregate queries that rely on the ordering of the data, such as MAX(), or any histogram computation.
All of this complexity isn't meant to scare you, but it's something you'll need to pay attention to. There are tools out there that can help you (disclosure: my company makes a tool called dbShards) but you can definitely put together your own solution, especially if your application is mature and the query patterns are quite predictable.
A couple quick questions related to GAE search and datastore:
(1) Why is it that I can inequality filter on more than one property using the search service, but that I can only inequality filter on at most one property when querying the datastore? It seems odd that this limitation would exist in one service but not the other.
(2) I intend to use google app engine search to query very many objects (thousands or hundreds of thousands, maybe more). I plan to be doing many inequalities, for example: "time created" before x, "price" greater than y, "rating" less than z, "latitude" between a and b, "longitude" between c and d etc. This seems like a lot of filters and potentially expensive. Is App Engine Search an appropriate solution for this?
Thanks so much.
1) The SearchService basically gives you an API to perform the sorts of things you can't using the datastore. If you could do them on the datastore, you wouldn't really need the SearchService. While not a very satisfactory answer, many of the common operations you might do with a traditional RDBMS were not really even possible before the Search API was available.
2) is a bit harder. Currently the search api doesn't handle failure conditions very well, usually you'll get a SearchServiceException without a meaningful message. The team seem to have been improving this over the last year or so, although fixes in this space seem to have been coming very slowly.
From the tickets I've raised, failures are usually a result of queries running too long. This is usually represented as queries that are too complex. You can actually tune queries quite a lot with combinations of the query string as well as the parameters you apply to your search request. The downside is that its all totally black box, I haven't seen any guides or tools on optimising queries. When they fail, they just fail.
The AppEngine search api is designed to solve the problems you describe, whether in your case it does may be hard to determine. You could set up some sample queries and deploy to a test environment to see if it even basically works for your typical set of data. I would expect that it will work fine for the example you gave. I have successfully been running similar searches in large scale production environments.