When would using a ComputedProperty (ndb) in google app engine give you a distinct advantage over just computing when needed, on the backend (such as in a handler), without the datastore being involved?
Everything I'm reading seems to indicate that it's mostly useless, and would just slow queries down (at least the put operation if nothing else).
Thoughts?
I did see this:
"Note: Use ComputedProperty if the application queries for the computed value. If you just want to use the derived version in Python code, define a regular method or use Python's #property built-in."
but that doesn't really explain any advantage (why query if you can derive?)
The documentation is quite clear on that regard, and i'll cite it again for reference, the Computed Properties section:
Note: Use ComputedProperty if the application queries for the computed value. If you just want to use the derived version in Python code, define a regular method or use Python's #property built-in.
When to use it? When you need to query some derived data, it needs to be written to the datastore so it gets indexed.
First example that came to mind: You're already storing the birthday of a user, but also need to filter by actual age, adding a property to derive that value might be the easiest and most efficient solution:
age = ndb.ComputedProperty(lambda self: calc_age(self.birthday))
Of course you could just have a function that returns the age, but that's only useful after you get the entity, can't use it for queries.
Related
I'm trying to find a good way to combine Flink keyed WindowedStream locally for Flink application. The idea is to similar to a combiner in MapReduce: to combine partial results in each partition (or mapper) before the data (which is still a keyed WindowedStream) is sent to a global aggregator (or reducer). The closest function I found is: aggregate but I was't be able to find a good example for the usage on WindowedStream.
It looks like aggregate doesn't allow a WindowedStream output. Is there any other way to solve this?
There have been some initiatives to provide pre-aggregation in Flink. You have to implement your own operator. In the case of stream environment you have to extend the class AbstractStreamOperator.
KurtYoung implemented a BundleOperator. You can also use the Table API on top of the stream API. The Table API is already providing a local aggregation. I also have one example of the pre-aggregate operator that I implemented myself. Usually, the drawback of all those solutions is that you have to set the number of items to pre-aggregate or the timeout to pre-aggregate. If you don't have it you can run out of memory, or you never shuffle items (if the threshold number of items is not achieved). In other words, they are rule-based. What I would like to have is something that is cost-based, more dynamic. I would like to have something that adjusts those parameters in run-time.
I hope these links can help you. And, if you have ideas for the cost-based solution, please come to talk with me =).
Is there a neat way to save a Date into a Datomic attribute of type db.type/instant? For instance, there is a d/tempid and d/squuid functions to produce a tempid and a squuid.
Datomic doesn't provide an API endpoint for generating dates as opposed to the cases for tempid (something Datomic makes a specific use of) and squuid (the generated value is changed from a standard uuid and will leak time information, which precludes some secure use, but allows for better indexing performance).
In Clojure code you can use the #inst reader literal or (java.util.Date.). You can obviously use the java.util.Date constructor in Java code as well (or use a library that generates the same type).
TL;DR
I have architecture issue which boils down to filtering entities by predefined set of common filters. Input is: set of products. Each product has details. I need to design filtering engine so that I can (easily and fast) resolve a task:
"Filter out collection of products with specified details"
Requirements
User may specify whatever filtering is possible with support of precedence and nested filters. So, bare example is (weight=X AND (color='red' OR color='green')) OR price<1000 The requests should go via HTTP / REST, but that's insignificant (it only adds an issue with translating filters from URI to some internal model). Any comparison operators should be supported (like equality, inequality, less than etc.)
Specifics
Model
There is no fixed model definition - in fact I am free to chose one. To make it simpler I am using simple key=>value for details. So it goes at the very minimum to:
class Value extends Entity implements Arrayable
{
protected $key;
protected $value;
//getters/setters for key/value here
}
for simple value for product detail and something like
class Product extends Entity implements Arrayable
{
protected $id;
/**
* #var Value[]
*/
protected $details;
//getters/setters, more properties that are omitted
}
for the product. Now, regarding data model, there is a first question: How to design filtering model?. I have a simple idea of implementing it as a let's say, recursive iterator which will be a tree regular structure according to incoming user request. The difficulties which I certainly need to solve here are:
Quickly build the model structure out from user request
Possibility for easy modification of the structure
Easy translate of chosen filters data model to chosen storage (see below)
Last point in the list above is probably the most important part as storage routines will be most time-consuming and therefore filters data model should fit in such structure. That means storage has always higher priority and if data model can not fit into some storage design that allows to resolve the issue - then data model should be changed.
Storage
As a storage I want to use NoSQL+RDBMS which is Postgree 9.4 for example. So that will allow to use JSON for storing details. I do not want to use EAV in any case, that is why pure relational DBMS isn't an option (see here why). There is one important thing - products may contain stocks which leads to the situation that I have basically two ways:
If I design products as a single entity with their stocks (pretty logical), then I can not go "storage" + "indexer" approach because this produces outdated state as indexer (such as SOLR) needs to update and reindex data
Design with separate entities. That means - to separate whatever can be cached from whatever that can not. First part then can go to indexer (and details probably can go to there, so we are filtering by them) and non-cacheable part will go somewhere else.
And the question for storage part would be, of course: which one to chose?
Good thing about first approach is that the internal API is simple, internal structures are simple and scalable because they then can easily be abstracted from storage layer. Bad thing is that then I need this "magic solution" which will allow to use "just storage" instead of "storage+indexer". "Magic" here means to somehow design indexes or some additional data-structures (I was thinking about hashing, but it isn't helpful against range queries) in storage that will resolve filtering requests.
On the other hand second solution will allow to use search engine to resolve filtering task inside itself but producing some gap when data will be outdated there. And of course now the data layer needs to be implemented the way it will somehow know about which part of model goes to which storage (so stocks to one storage, details to another etc)
Summary
What can be a proper data model to design filtering?
Which approach should be used to resolve the issue on the storage level: storage+indexer with separate products model or only storage with monolithic products model? Or may be something else?
If go the approach with storage only - is it possible to design storage so it will be possible to filter out products easily by any set of details?
If go with the indexer, what will fit better for this issue? (There is a good comparison between solr and sphinx here, but it's '15 now while it was made in '09 so for sure it is outdated)
Any links, related blogposts or articles are very welcome.
As a P.S.: I did a search across SO but faced barely-relevant suggestions/topics so far (for example this). I am not expecting a silver bullet here as it is always boils down to some trade-off, but however question looks very standard so there should be good insights already. Please, guide me - I tried to "ask google" with some luck but that was not enough yet.
P.P.S. feel free to edit tags or redirect question to proper SE resource if SO is not a good idea for such kind of questions. And I am not asking language-specific solution, so if you are not using PHP - it does not matter, design has nothing to do with the language
My preferred solution would be to split the entities - your second approach. The stable data would be held in Cassandra (or Solr or Elastic etc), while the volatile stock data would be held in (ideally) an in-memory database like Redis or Memcache that supports compare-and-swap / transactions (or Dynamo or Voldemort etc if the stock data won't fit in memory). You won't need to worry too much about the consistency of the stable data since presumably it changes rarely if ever, so you can choose a scalable but not entirely consistent database like Cassandra; meanwhile you can choose a less scalable but more consistent database for the volatile stock data.
I tried using Sitecore.Search namespace and it seems to do basic stuff. I am now evaluating AdvancedDatabaseCrawler module by Alex Shyba. What are some of the advantages of using this module instead of writing my own crawler and search functions?
Thanks
Advantages:
You don't have to write anything.
It handles a lot of the code you need to write to even query Sitecore, e.g. basic search, basic search with field-level sorting, field-level searches, relation searches (GUID matches for lookup fields), multi-field searches, numeric range and date range searches, etc.
It handles combined searches, with logical operators
You can access the code.
This video shows samples of the code and front-end running various search types.
Disadvantages:
None that I can think of, because if you find an issue or a way to extend it, you have full access to the code and can amend it per your needs. I've done this before by creating the GetHashCode() and Equals() methods for the SkinnyItem class.
First of all, the "old" way of acecssing the Lucene index was very simple, but unfortunately it's deprecated from Sitecore 6.5.
The "new" way of accessing the Lucene index is very complex as the possibilities are endless. Alex Shyba's implementation is the missing part that makes it sensible to use the "new" way.
Take a look at this blog post: http://briancaos.wordpress.com/2011/10/12/using-the-sitecore-open-source-advanceddatabasecrawler-lucene-indexer/
It's a 3 part description on how to configure the AdvancedDatabaseCrawler, how to make a simple search and how to make a multi field search. Without Alex's AdvancedDatabaseCrawler, these tasks would take almost 100 lines of code. With the AdvancedDatabaseCrawler, it takes only 7 lines of code.
So if you are in need of an index solution, this is the solution to use.
What is the proper way to perform mass updates on entities in a Google App Engine Datastore? Can it be done without having to retrieve the entities?
For example, what would be the GAE equivilant to something like this in SQL:
UPDATE dbo.authors
SET city = replace(city, 'Salt', 'Olympic')
WHERE city LIKE 'Salt%';
There isn't a direct translation. The datastore really has no concept of updates; all you can do is overwrite old entities with a new entity at the same address (key). To change an entity, you must fetch it from the datastore, modify it locally, and then save it back.
There's also no equivalent to the LIKE operator. While wildcard suffix matching is possible with some tricks, if you wanted to match '%Salt%' you'd have to read every single entity into memory and do the string comparison locally.
So it's not going to be quite as clean or efficient as SQL. This is a tradeoff with most distributed object stores, and the datastore is no exception.
That said, the mapper library is available to facilitate such batch updates. Follow the example and use something like this for your process function:
def process(entity):
if entity.city.startswith('Salt'):
entity.city = entity.city.replace('Salt', 'Olympic')
yield op.db.Put(entity)
There are other alternatives besides the mapper. The most important optimization tip is to batch your updates; don't save back each updated entity individually. If you use the mapper and yield puts, this is handled automatically.
No, it can't be done without retrieving the entities.
There's no such thing as a '1000 max record limit', but there is of course a timeout on any single request - and if you have large amounts of entities to modify, a simple iteration will probably fall foul of that. You could manage this by splitting it up into multiple operations and keeping track with a query cursor, or potentially by using the MapReduce framework.
you could use the query class, http://code.google.com/appengine/docs/python/datastore/queryclass.html
query = authors.all().filter('city >', 'Salt').fetch()
for record in query:
record.city = record.city.replace('Salt','Olympic')