Tagging schema for AppEngine - google-app-engine

Hey,
I'm using AppEngine for an application that I'm writing. So I need to assign tags each object. I wanted to know what is the best way of doing this.
Should I create a space seperated string of tags and then query something like %search_tag% (I'm not sure if you can do that in JDOQL)?
What other options do I have ?
Should I create another class which will map every object to a tag?
Which would be the best from the point of view of scalability, performance and ease of use?
Thanks

First, '%search_tag%' type 'LIKE' queries do not work on App Engine's datastore. The best you can do is a prefix search.
It is difficult to answer very general questions like this. The best solution will depend on several factors, how many tags do you expect per entity? Is there a limit to the number of tags? How will you use the tags? For searching? For display only? The answers to all these questions impact how you should design your models.
One general solution for tagging is to use a multi-valued property, such as a list of tags.
http://code.google.com/appengine/docs/java/datastore/dataclasses.html#Collections
Be aware, if you will have many tags on your entities it will add overhead at write time, since the indexes writes need time too. Also, you should try to avoid using multi-valued properties multiple times (or multiple multi-value properties) in queries with inequalities or orders. That can lead to 'exploding indexes,' since one index row gets written for every combination of the indexed fields.

Related

Elastic Search: Parent child vs Nested Document

P.S: We are using Elastic 6.x
AS Elastic Search is upgraded few breaking changes are also popped out. We have some relational data which requires to be managed either nested or parent/child mode.
For Final decision I was wondering with following questions:
How many nested documents/array size I can save in one field
We have to manipulate the fields often so whats the recommendation if we use nested field type
What are the limitations of Parent/child if we use 4 types of relations
I believe, answers of the above questions can help me decide the field type, let me know if there is any other thing I should consider
Thanks in advance
How many nested documents/array size i can save in one field
By default, you can have a maximum of 50 nested fields defined per index. In each of those nested fields arrays, you may store any number of elements.
We have to manipulate the fields often so whats the recommendation if we use nested field type
That's where nested fields come short, as whenever a nested document changes, you either have to reindex the whole parent document or figure out via scripting which nested document to update, but it can quickly get quite convoluted.
What are the limitations of Parent/child if we use 4 types of relations
In ES 6.x onwards, you're limited to a single join field per index.
As it looks like, it doesn't seem like either nested fields nor parent/child would work well in your case... Maybe there's another possible design if you are willing to denormalize a little bit more your data, but hard to say without getting more detailed information about your preceise use case.
Choosing Parent/Child vs Nested Document

Ancestor relation in datastore

I have three entities: user, post and comment. A user may have multiple posts and a post may have multiple comments.
I know I can add ancestor relations like this:
user(Grand Parent) post(parent) comment(child)
I'm little bit confused about ancestors. I read from documention and searches that ancestors are used for transactions, every ancestors are in same entity group and entity groups are stored in same datastore node which makes it less scaleable. Is this right?
Is creating user as parent of posts and post as parent of comments a good thing?
Rather than this we can add one extra property in the post entity like user_id as shown in example and filter by it.
Which is better/more scalable: filter posts by ancestors or add an extra property user_id in the post Entity and filter by it?
I know both approaches can get the same results but I want to know which one is better in performance and scalability?
Sorry, I'm new in datastore.
Update 11/4/2017
A large number of users is using this App. It's is quite possible there are more
than one posts per sec. But A single user can not create posts more than one per sec. But multiple user may be. As described in documentations maximum entity group write rate of 1/s. Is it still possible to use Ancestor ?
Same for comments. Multiple user can add comment in a same entity group. It's is
quite possible more than one comment in one sec.
Ancestor Queries are faster ?
I read in many places that ancestors queries are much faster than others.
As I know the reason why they are fast is that because it create entity group and store related data in same node. So, it require less time to get data from single node as compare to multiple nodes.
For Example: If post is store in Asia node and comment is store in Europe node and I want to get posts and comments then datastore API need to fetch two nodes to complete request. Which make it slow. Rather than if I create ancestor relation and make entity group which create a better performance.
But what if I don't need to get post and comment data at same time. If I need post in separate web page and comment in separate page.In this scenario datastore api need to fetch only one node at a time.It is not matter data save in single node or save in multiple node. What about query performance can ancestor make it fast in this case ?
Yes, you are correct: all ancestry-related entities are in the same entity group, which raises 2 scalability issues: data contention and maximum entity group write rate of 1/s. See somehow related Is there an Entity Group Max Size?
There are advantages of using ancestries and some may be willing to sacrifice scalability for them (see What would be the purpose of putting all datastore entities in a single group?), but IMHO not for your kind of app: I think you'll agree that it's not really critical to see every new user/post/comment in random searches immediately after it is created (i.e. strong consistency) - the fact that it eventually appears is IMHO good enough.
Simply having no ancestry at all and adding additional model properties (entity keys or even just entity key IDs for entities which never have ancestors) to allow cross-referencing entities is the more scalable approach and IMHO fits well with your app.
I think the question to ask is: Are you expecting:
User to create Posts more than once per seconds (I doubt :)
People to comment on a Post more than once per second (could happen)
It not, then having ancestors queries will be faster than normal queries. So it depends of your usecase. I'd go for query speed unless you know you will have thousands of comments on posts.

Solr documents with multiple parents

I'm currently trying to figure out if Solr is the right tool for me. I have the following setup:
There is the primary document type "blog". Then there are two additional document types "user" and "category". Both of these are parents of the "blog" document type.
Now when searching the "blog" documents, I not only want to search in those fields (e.g. title and content), but also in the parent fields (user>name and category>name.
Of course, I could just flatten that down to a single document for Solr, which would ease the search a lot. The downside to this is though, that when e.g. a user updates their name, I have to run through all blog posts of them and update the documents for that in Solr, instead of just updating a single document.
This becomes even worse when the user has another parent, on which I need to search as well.
Do you have any recommendations about how to handle this use case? Maybe my Google foo is just not good enough, but what I found (block joins, etc.) don't seem to do the trick.
The absolutely most performant and easiest solution would be to flatten everything to a single document. It turns out that these relations aren't updated as often as people think, and that searches are performed more often than the documents update. And even if one of the values that are identical across a large set of documents change, reindexing from the most recent documents (for a blog) and then going backwards will appear rather performant for most users. The assumes that you have to actually search the values and don't just need the values - which you could look up from secondary storage when displaying an item (and just store the never changing id in the document).
Another option is to divide this into a multi-search problem. One collection for blog posts, one collection for users and one collection for categories. You then search through each of the collections for the relevant data and merge it in your search model. You can also use [Streaming Expressions] to hand off most of this processing to a Solr cluster for you.
The reason why I always recommend flattening if possible is that most features in Solr (and Lucene) are written for a flat document structure, and allows you to fully leverage the features available. Since Lucene by design is a flat document store, most other features require special care to support blockjoins and parent/child relationships, and you end up experimenting a lot to get the correct queries and feature set you want (if possible). If the documents are flat, it just works.

How to best use Solr for relational records?

Given the following relational data scheme -
Posts can have 0 to many tags
Tags have 0 to many aliases
In Sunspot / Solr, would be the best practice to search for posts by tag (or a tag alias)?
One approach is to index posts with an array of the full text of all associated tags and their aliases. This will work but seems wasteful in terms of storage resources and possibly not optimally performant.
To add to the above answer, you can use the synonyms feature to store aliases.
The synonyms does not have to be static. The latest versions of solr have made it very simple to programmatically manage synonyms.

How to handle frequently changing multivalue string fields in SOLR?

I have a SOLR (or rather Heliosearch 0.07) core on a single EC2 instance. It contains about 20M documents and takes about 50GB on disc. The core is quite fixed/frozen and performs quite well, if everything is warmed up.
The problem is a multimulti value string field: That field contains assigned categories, which change quite frequently for large parts of the 20M documents. After a commit, the warm up takes way too long to be usable in production.
The field is used only for facetting and filtering. My idea was, to store the categories outside SOLR and to inject them somehow using custom code. I checked quite some approaches in various JIRA issues and blogs, but I could not find some working solution. Item 2 of this issue suggests that there is a solution, but I don't get what he's talking about.
I would appreciate any solution which allows me to update my category field without having to re-warmup my caches again afterwards.
I'm not sure that JIRA will help you: it seems an advanced topic and most impprtant it is still unresolved so not yet available.
Partial document updates are not useful here because a) it requires everything is stored in your schema b) behind the scenes it does reindex again the whole index
From what you say it seems tou have a one monolithic index: have you considered to split the index vertically using sharding or SolrCloud? In that way each "portion" would be smaller and the autowarm shouldn't be a big problem.

Resources