I am considering ways of organizing data for my application.
One data model I am considering would entail having entities where each entity could contain up to roughly 100 repeated StructuredProperties. The StructuredProperties would be mostly read and updated only very infrequently. My question is - if I update any of those StructuredProperties, will the entire entity get deleted from Memcache and will the entire entity be reread from the ndb? Or is it just the single StructuredProperty that will get reread? Is this any different with LocalStructuredProperty?
More generally, how are StructuredProperties organized internally? In situations where I could use multiple Float or Int properties - and I am using a StructuredProperty instead just to make my model more readable - is this a bad idea? If I am reading an entity with 100 StructuredProperties will I have to make 100 rpc calls or are the properties retrieved in bulk as part of the original entity?
StructuredPropertys belong to the entity that contains them - so your assumption that
updating a single StructuredProperty will invalidate the memcache is correct.
LocalStructuredProperty is the same behavior - the advantage however is that each
property on a LocalStructuredProperty is obfuscated into a binary storage - the datastore
has no idea about the structure of a LocalStructuredProperty. (There is probably a deserialization
computational cost attributed to these properties - but that depends a lot on the amount
of data they contain, I imagine.)
To contrast, StructuredProperty actually makes its child properties available for
Query indexing in most cases - allowing you to perform complicated lookups.
Keep in mind - you should be calling put() for the containing entity, not for each
StructuredProperty or LocalStructuredProperty - so you should be seeing a single RPC
call for updating that parent entity - regardless of the number of repeated properties exist.
I would advise using StructuredProperty that contain ndb.IntegerProperty(repeated=True), rather
than making 'parallel lists' of integers and floats - that adds more complexity to your python
model, and is exactly the behavior that ndb.StructuredProperty strives to replace.
Related
Every Cloud Datastore query computes its results using one or more indexes, which contain entity keys in a sequence specified by the index's properties and, optionally, the entity's ancestors. The indexes are updated incrementally to reflect any changes the application makes to its entities, so that the correct results of all queries are available with no further computation needed.
Generally, I would like to know if
datastore.get(List<Key> listOfKeys);
is faster or slower than a query with the index file prepared (with the same results).
Query q = new Query("Kind")(.setFilter(someFilter));
My current problem:
My data consists of Layers and Points. Points belong to only one unique layer and have unique ids within a layer. I could load the points in several ways:
1) Have points with a "layer name" property and query with a filter.
- Here I am not sure whether the datastore would have the results prepared because as the layer name changes dynamically.
2) Use only keys. The layer would have to store point ids.
KeyFactory.createKey("Layer", "layer name");
KeyFactory.createKey("Point", "layer name"+"x"+"point id");
3) Use queries without filters: I don't actually need the general kind "Point" and could be more specific: kind would be ("layer name"+"point id")
- What are the costs to creating more kinds? Could this be the fastest way?
Can you actually find out how the datastore works in detail?
faster or slower than a query with the index file prepared (with the same results).
Fundamentally a query and a get by key are not guaranteed to have the same results.
Queries are eventually consistent, while getting data by key is strongly consistent.
Your first challenge, before optimizing for speed, is probably ensuring that you're showing the correct data.
The docs are good for explaining eventual vs strong consistency, it sounds like you have the option of using an ancestor query which can be strongly consistent. I would also strongly recommend avoiding using the 'name' - which is dynamic - as the entity name, this will cause you an excessive amount of grief.
Edit:
In the interests of being specifically helpful, one option for a working solution based on your description would be:
Give a unique id (a uuid probably) to each layer, store the name as a property
Include the layer key as the parent key for each point entity
Use an ancestor query when fetching points for a layer (which is strongly consistent)
An alternative option is to store points as embedded entities and only have one entity for the whole layer - depends on what you're trying to achieve.
I have the id of an entity from which I only need a single field. Is there a way to get that projection or must I fetch the whole entity? Here is the code that I thought should do it.
bookKey =OfyService.ofy().load().type(Page.class).id(pageId).project("bookKey").now();
The datastore is a key-value store which loads objects whole, not field-by-field. This is quite different from how you work with a relational database.
There is an exception to this which allows you to load data directly out of an index (projection queries), however it is a performance optimization with very limited and specific use. In general, if you don't understand the fairly exotic detail of how projections work, you should not be using them - it's a premature optimization.
TL;DR
I have architecture issue which boils down to filtering entities by predefined set of common filters. Input is: set of products. Each product has details. I need to design filtering engine so that I can (easily and fast) resolve a task:
"Filter out collection of products with specified details"
Requirements
User may specify whatever filtering is possible with support of precedence and nested filters. So, bare example is (weight=X AND (color='red' OR color='green')) OR price<1000 The requests should go via HTTP / REST, but that's insignificant (it only adds an issue with translating filters from URI to some internal model). Any comparison operators should be supported (like equality, inequality, less than etc.)
Specifics
Model
There is no fixed model definition - in fact I am free to chose one. To make it simpler I am using simple key=>value for details. So it goes at the very minimum to:
class Value extends Entity implements Arrayable
{
protected $key;
protected $value;
//getters/setters for key/value here
}
for simple value for product detail and something like
class Product extends Entity implements Arrayable
{
protected $id;
/**
* #var Value[]
*/
protected $details;
//getters/setters, more properties that are omitted
}
for the product. Now, regarding data model, there is a first question: How to design filtering model?. I have a simple idea of implementing it as a let's say, recursive iterator which will be a tree regular structure according to incoming user request. The difficulties which I certainly need to solve here are:
Quickly build the model structure out from user request
Possibility for easy modification of the structure
Easy translate of chosen filters data model to chosen storage (see below)
Last point in the list above is probably the most important part as storage routines will be most time-consuming and therefore filters data model should fit in such structure. That means storage has always higher priority and if data model can not fit into some storage design that allows to resolve the issue - then data model should be changed.
Storage
As a storage I want to use NoSQL+RDBMS which is Postgree 9.4 for example. So that will allow to use JSON for storing details. I do not want to use EAV in any case, that is why pure relational DBMS isn't an option (see here why). There is one important thing - products may contain stocks which leads to the situation that I have basically two ways:
If I design products as a single entity with their stocks (pretty logical), then I can not go "storage" + "indexer" approach because this produces outdated state as indexer (such as SOLR) needs to update and reindex data
Design with separate entities. That means - to separate whatever can be cached from whatever that can not. First part then can go to indexer (and details probably can go to there, so we are filtering by them) and non-cacheable part will go somewhere else.
And the question for storage part would be, of course: which one to chose?
Good thing about first approach is that the internal API is simple, internal structures are simple and scalable because they then can easily be abstracted from storage layer. Bad thing is that then I need this "magic solution" which will allow to use "just storage" instead of "storage+indexer". "Magic" here means to somehow design indexes or some additional data-structures (I was thinking about hashing, but it isn't helpful against range queries) in storage that will resolve filtering requests.
On the other hand second solution will allow to use search engine to resolve filtering task inside itself but producing some gap when data will be outdated there. And of course now the data layer needs to be implemented the way it will somehow know about which part of model goes to which storage (so stocks to one storage, details to another etc)
Summary
What can be a proper data model to design filtering?
Which approach should be used to resolve the issue on the storage level: storage+indexer with separate products model or only storage with monolithic products model? Or may be something else?
If go the approach with storage only - is it possible to design storage so it will be possible to filter out products easily by any set of details?
If go with the indexer, what will fit better for this issue? (There is a good comparison between solr and sphinx here, but it's '15 now while it was made in '09 so for sure it is outdated)
Any links, related blogposts or articles are very welcome.
As a P.S.: I did a search across SO but faced barely-relevant suggestions/topics so far (for example this). I am not expecting a silver bullet here as it is always boils down to some trade-off, but however question looks very standard so there should be good insights already. Please, guide me - I tried to "ask google" with some luck but that was not enough yet.
P.P.S. feel free to edit tags or redirect question to proper SE resource if SO is not a good idea for such kind of questions. And I am not asking language-specific solution, so if you are not using PHP - it does not matter, design has nothing to do with the language
My preferred solution would be to split the entities - your second approach. The stable data would be held in Cassandra (or Solr or Elastic etc), while the volatile stock data would be held in (ideally) an in-memory database like Redis or Memcache that supports compare-and-swap / transactions (or Dynamo or Voldemort etc if the stock data won't fit in memory). You won't need to worry too much about the consistency of the stable data since presumably it changes rarely if ever, so you can choose a scalable but not entirely consistent database like Cassandra; meanwhile you can choose a less scalable but more consistent database for the volatile stock data.
I'm searching for the best practice to store a large amount of Comment Entities which have a one to many relationship to another entity.
I read a lot about the limitations about the datastore and don't know how to solve this.
I can't store them as structured properties due to the 1MB Entity Limitation.
Also Guido van Rossum answered the question about repeated properties with "if you have more than 100-1000 values" do not use repeated properties.
So repeated properties are no solution for my comments, too.
Final Question: What is the best practice to solve this problem? Are ancestors an opportunity?
Edit: In this question about ancestor or reference properties Nick Johnson mentioned that "Every entity with the same parent will be in the same entity group, and writes to entity groups are serialized, so using ancestors here will slow things down if you're writing multiple entities concurrently. Since all the entities in a group are 'owned' by the user that forms the root of the group in your instance, though, this shouldn't be a problem - and in fact, what you're doing is actually a recommended design pattern."
What exactly does " writing multiple entities concurrently mean" ? When different user comment at the same time to that entity?
Depends on the amount you read / write per bill.
You can store references for more than 1000 (until an amount depending by the key size and how you reference them) as json compressed unindexed properties. But take care then with referencing and dereferecing that amount. Plus your overhead and data amount that you will transfer on each request will be big. You don't want though to be doing ops on 1000000 compressed entity keys on the server for just a simple request. If you take this way trying to optimize this approach do it on the client as smart as you can.
Go for ancestors and/or optimize your logic not to be consistent (eg it doesn't matter if a comment is not shown immediately) and use iterators or pointer or seeks (whatever it's called)
The objective is to reduce the CPU cost and response time for a piece of code that runs very often and must db.get() several hundred keys each time.
Does this even work?
Can I expect the API time of a db.get() with several hundred keys
to reduce roughly linearly as I reduce the size of the entity?
Currently the entity has the following data attached: 9 String, 9
Boolean, 8 Integer, 1 GeoPt, 2 DateTime, 1 Text (avg size ~100 bytes
FWIW), 1 Reference, 1 StringList (avg size 500 bytes). The goal is to
move the vast majority of this data to related classes so that the
core fetch of the main model will be quick.
If it does work, how is it implemented?
After a refactor, will I still incur the same
high cost fetching existing entities? The documentation says that all
properties of a model are fetched simultaneously. Will the old
unneeded properties still transfer over RPC on my dime and while users
wait? In other words: if I want to reduce the load time of my entities, is
it necessary to migrate the old entities to ones with the new
definition? If so, is it sufficient to re-put() the entity, or must I
save under a wholly new key?
Example
Consider:
class Thing(db.Model):
text = db.TextProperty()
strings = db.StringListProperty()
num = db.IntegerProperty()
thing = Thing(key_name='thing1', text='x' * 10240,
strings = ['y'*500 for i in range(10)], num=23)
thing.put()
Let's say I re-define Thing to be streamlined and push up a new version:
class Thing(db.Model):
num = db.IntegerProperty()
And I fetch it again:
thing_again = Thing.get_by_key_name('thing1')
Have I reduced the fetch time for this entity?
To answer your questions in order:
Yes, splitting up your model will reduce the fetch time, though probably not linearly. For a relatively small model like yours, the differences may not be huge. Large list properties are the leading cause of increased fetch time.
Old properties will still be transferred when you fetch an entity after the change to the model, because the datastore has no knowledge of models.
Also, however, deleted properties will still be stored even once you call .put(). Currently, there's two ways to eliminate the old properties: Replace all the existing entities with new ones, or use the lower-level api.datastore interface, which is dict-like and makes it easy to delete keys.
To remove properties from an entity, you can change your Model to an Expando, and then use delattr. It's documented in the App Engine docs here:
http://code.google.com/intl/fr/appengine/articles/update_schema.html
Under the heading "Removing Deleted Properties from the Datastore"
if I want to reduce the size of my
entities, is it necessary to migrate
the old entities to ones with the new
definition?
Yes. The GAE data store is just a big key-value store, that doesn't know anything about your model definitions. So the old values will be the old values until you put new values in!