Google App Engine - NDB - INDEX Quesions - google-app-engine

I had the following question about GAE NDB - Index.
I assume you can specify index via index.yaml or within the model definition using property option, indexed = true. Am I correct? If yes is one preferred over the other?
Is there a way to add/drop index during the life cycle of the data objects?
Can I specify an index on a structured property field?
If so, then can you please let me know** as the syntax for this?
Thanks in advance

By default, the properties that can be indexed (i.e. those that aren't variants of Blob) are indexed, which means you can filter or sort by them on their own. Adding single-property indexes to index.yaml would be unusual. Setting indexed=False for a property will mean fewer write-operations when saving entities, but will mean filtering or sorting by the property is no longer possible. I'd suggest reading the documentation on indexes.
If you want to filter or sort (in combination) by more than one property, then you need to include them in index.yaml. However, as you run code in the development server, if it requires an index that hasn't yet been specified, then index.yaml will be modified to contain a suitable index for the query being run. Adding indexes manually isn't necessarily something you'll ever have to do.
You can't index an entire StructuredProperty, the properties of Structured Properties are individually indexed, and don't need to think about them any differently than for regular properties. If you want to manually specify a multi-property index that includes a sub-property, then you should be able to do so by using 'property.subproperty' (e.g. 'address.city').

s1) Yes, you can set certain properties as being indexed. Some property types do not allow indexing at all. It's preferable to set the indexes programmatically within each model definition.
2) Although you can drop the index programmatically (i.e. remove indexed=True), I would not recommend it. It will leave your data store in inconsistent state.
3) It's not possible to set index on a structured property, however, you can set a Key relationship between your model and the models in the structured property.
See:
https://developers.google.com/appengine/docs/python/ndb/entities
https://developers.google.com/appengine/docs/python/ndb/properties
"You can specify the usual property options for structured properties
(except indexed)."

As I discovered the hard way (see GAE python NDB projection query working in development but not in production), there's a big difference between having an index (and therefore needing an entry in index.yaml) and marking properties as indexed or not indexed. These things are there for different purposes:
Having an index allows you to search, sort, or do projection queries.
Marking properties as indexed or not tells an index which entities to include in the index and which to ignore.
Yea, absolutely, you can add or drop index at any time:
Update your index.yaml
Then run one of the gcloud commands such as '$ gcloud datastore indexes create index.yaml' or '$ gcloud datastore indexes cleanup index.yaml'
No, you can't create an index on a structured property. See more info here https://cloud.google.com/appengine/docs/standard/python/ndb/entity-property-reference#structured

Related

Alfresco 5.0 transactional queries

I have some questions about how index in Alfresco One works with transactional queries.
We use Alfresco 5.0.2 and in documentation I can read this: "When you are upgrading the database, you can add optional indexes in order to support the metadata query feature."
Suppose that in my model.xml I add a custom property like this:
<type name="doc:myDoc">
<title>Document</title>
<parent>cm:content</parent>
<properties>
<property name="doc:level">
<title>Level</title>
<type>d:text</type>
<mandatory>true</mandatory>
<index enabled="true">
<atomic>true</atomic>
<stored>false</stored>
<tokenised>both</tokenised>
</index>
</property>
...
</properties>
</type>
And I have on my alfresco-global.properties these sets
solr.query.cmis.queryConsistency=TRANSACTIONAL_IF_POSSIBLE
solr.query.fts.queryConsistency=TRANSACTIONAL_IF_POSSIBLE
system.metadata-query-indexes.ignored=false
My first question is... How Alfresco knows which properties I want to index on DB? Read my model.xml and index only the indexed properties that I specify there? Index all the custom properties? Or I need to create a script to add these new indexes?
I read the script metadata-query-indexes.sql but I don't understand how rewrite it in order to add a new index for my property. If it's necessary this script, could you give me an example with the doc:myDoc property that I wrote before, please?
Another question is about query syntax that isn't supported by DB and goes directly to SOLR.
I read that PATH, SITE, ANCESTOR, OR, any d:content, d:boolean or d:any (among others) properties in your query or it will not be executable against the DB. But I don't understand what d:content is exactly.
For example, a query (based on my custom property written before) like TYPE:whatever AND #doc\:level:"value" is considered d:content? This query is supported by BD or goes to SOLR?
I read also this:
"Any property checks must be expressed in a form that means "identical value check" as querying the DB does not provide the same tokenization / similarity capabilities as the SOLR index. E.g. instead of my:property:"value" you'd have to use =my:property:"value" and "value" must be written in the proper case the value is stored in the DB."
This means that if I use the =, for example doing =#doc\:level:"value", this query isn't accepted on DB and goes to SOLR? I can't search for an exact value on DB?
I've been researching TMQs recently. I'm assuming that you need transactionality, which is why TMQ queries are interesting. Queries via SOLR are eventually consistent, but TMQs will immediately return the change. There are certain applications where eventual consistency is a huge problem, so I'm assuming this is why you are looking into them.
Alfresco says that they use TMQs by default, and in my limited testing (200k documents), I found no appreciable performance difference between a solr and TMQ query. I can't imagine they are horrible for performance if Alfresco set it up to be the default style, but I need to do further testing with millions of documents to be sure. It will of course depend on your database load. If your database is a bottleneck and you don't need the transactionality, you could consider using # syntax in metadata searches to avoid them, or you could disable them via properties configuration.
1) How Alfresco knows which properties I want to index on DB? Read my model.xml and index only the indexed properties that I specify there? Index all the custom properties? Or I need to create a script to add these new indexes?
When you execute a query using a syntax that is compatible with a TMQ, Alfresco will do so. The default behavior is "TRANSACTIONAL_IF_POSSIBLE":
http://docs.alfresco.com/4.2/concepts/intrans-metadata-configure.html
You do not have to have the field marked as indexable in the model for this to work. This is unclear from the documentation but I've tried disabling indexing for the field in the model and these queries still work. You don't even have to have solr running!
2) Another question is about query syntax that isn't supported by DB and goes directly to SOLR.
Your example of TYPE and an attribute does not go to solr. It's things like PATH that must go to SOLR.
3) "Any property checks must be expressed in a form that means "identical value check" as querying the DB does not provide the same tokenization / similarity capabilities as the SOLR index. E.g. instead of my:property:"value" you'd have to use =my:property:"value" and "value" must be written in the proper case the value is stored in the DB."
What they are saying is that you must use the = operator, not the default or # operator. The # operator depends on tokenization, but TMQs go straight to the database. However, you can use * in an attribute if you omit the "", like so:
=cm:\title:Startswith*
Works for me on 5.0.2 vía TMQ. You can absolutely search for an exact value as well however.
I hope this cleared it up for you. I highly recommend putting the solr.query.fts.queryConsistency=TRANSACTIONAL to force TMQs always in a test evironment and testing different queries if you still have questions about what syntax is supported.
Regards
A nice explanation can be found here.
https://community.alfresco.com/people/andy1/blog/2017/06/19/explaining-eventual-consistency
When changes are made to the repository they are picked up by SOLR via
a polling mechanism. The required updates are made to the Index Engine
to keep the two in sync. This takes some time. The Index Engine may
well be in a state that reflects some previous version of the
repository. It will eventually catch up and be consistent with the
repository - assuming it is not forever changing.

Properties vs Categories on an Aspect in Alfresco

I'm using Alfresco 4.1.6 and Solr 1.4.
I'm reading about the possibility of using classifications for the nodes, specified with a type d:category in an aspect on the content model.
A good time of searchs in our project are the most important, is the reason I try to design the best option possible for this. Our repository have over 2 millions of documents, spread over directories, where each user (we have 3000 users aprox) have an own root path.
For the queries (FTS_ALFRESCO), we actually use TYPE (we have 5 distinct types of nodes defined on our model) and custom properties (all of them that we use in the queries are indexed).
My question is... Imagine I change my model and use one of our properties like a category. I delete a property and create an aspect with d:category with this property. The search will be more efficient and quickly if I search by TYPE, property and category? Alfresco gives me a best performance if I search this value like a category instead of when I search this value like a normal indexed property? Or really is the same? Whats the benefits of use this like a category?
Category and Properties both of them has different usage.
Main difference is
Property:You could have different value of same property for each content
Category:You will have same category which can be associated to muliple contents
So, based on your requirement you have to choose which one you want to use. As far as performence is concerned I guess category based search will be faster(I haven't really tried it though).

Indexes and indexes entries limits in Google App Engine Datastore

I'm having some problem in understanding how indexes work in GAE Datastore, in particular somthing really unclear to me are the limits related to indexes.
For what I understood one can create some custom indexes in the datastore-indexes.xml file and additionally the Datastore will generate some automatic indexes to match the user queries.
A first question is: the "Number of indexes" quota limit defined in the quotas page (https://cloud.google.com/appengine/docs/quotas#Datastore) is referred only to the custom indexes defined in datastore-indexes.xml, or it applies also to indexes automatically generated?
Another concept eluding me is the "index entry for a single query".
Assume I don't have multi-dimensional properties (i.e. not lists) and I have some entities of kind "KindA". Then I define two groups of entity properties:
- Group1: properties with arbitray name and boolean value
- Group2: properties with arbitray name and double value
In my world any KindA entity can have at most N properties of Group1 and N properties of Group2. For any property P an index table is created and each entity having that P set will add a row in the P index table (right?). Thus initially any KindA entity will have 1 entry for each of the max. 2N properties (thus in total max 2N index entries per entity) right?
If this is correct than it follows that I can create an entity with a limited number of properties, however this is strange since I 've always read that an entity can have unlimited properties...(without taking in account the size limit).
However let assume now that my application allows users to query for KindA entities using an arbitrarly long sequence of AND filters on properties of Group1 (boolean one). Thus one can query something like:
find all entities in KindA where prop1=true AND prop2=true ... AND propM = true
This is a situation in which query only contains equalities and thus no custom indexes are required (https://cloud.google.com/appengine/docs/python/datastore/indexes#Index_configuration).
But what if I want to order using properties of GroupB ? In this case I need an index for any different query right (different in terms of combination of filtering properties names)?
In my developmnet server I tried without specifying any custom index and GAE generates them for me (however any time I restart previous generated indexes get removed). In this case how many index entries does a signle KindA entity have in a single query index? I say 1 because of what GAE docs says:
The property may also be included in additional, custom indexes declared in your index configuration file (index.yaml). Provided that an entity has no list properties, it will have at most one entry in each such custom index (for non-ancestor indexes) or one for each of the entity's ancestors (for ancestor indexes)
Thus in theory if N is limited I'm safe with respect to the "Maximum number of index entries for an entity" (https://cloud.google.com/appengine/docs/java/datastore/#Java_Quotas_and_limits) is it right?
But what about receiving over 200 different queries? does it leads GAE to automatically generate over 200 custom indexes (one for distinct query)? If yes, do these indexes automatically generate influence the Indexes number limit (which is 200) ?
If yes, then it follows that I can't let user do this (IMHO very basic) queries. Am I misunderstanding something?
first of all I was trying to understand your question which I find difficult to follow.
The 200 index limit only counts towards the indexes you (or are define for you automatically by the devappserver) define by using queries. This means that the indexes that will be created alone for your indexed properties are not counted towards this limit.
Your are correct in the 2N automatic indexes created for every indexed property.
You can have any number of properties indexed in any entity as long as you don't get over the 1MB limit per entity. But.. this really depends on the content of the properties stored.
For the indexes created for you on your indexed properties... you don't really have an actual limit rather than an increasing cost as your writes and storage per entity put will increase for each added property.
When using sort orders, you are limited to one sort order when using automatic indexes. More sort orders will require a composite index (your custom index). Thus if you are already using an equality filter you need anyway a custom index.
So, yes, on your example the devapp server will create a composite index for each query you will be executing. However you can reduce this indexes manually by deleting the ones not needed. The query planner can use query time to find your results by merging different indexes as explained here:
https://cloud.google.com/appengine/articles/indexselection
Yes, every index definition on your index.yaml will count towards the 200 limit.
I find out that you really don't use composite indexes too much when you know how gae apps can be programmed. You need to balance what users need to do and what not. And also balance between doing query side job, or just query all and filter by code (it really depends on how many max entities you can have in that particular kind).
However, if your trying to do some complex queries available to your users then maybe the datastore is not the choice.

How can I discover if a property of a stored Entity is indexed or unindexed?

I have several entities in datastore, but I don't know if some of their properties are indexed or unindexed.
How can I discover (with admin console or programatically) if a property of a stored Entity is indexed or unindexed?
By default each entity is indexed (unless its TextProperty or BlobProperty), you need (and should) set the property indexed property to False if you don't want it to be indexed (to improve performance and entity writing costs).
There is no indication in the admin console on if a property is indexed or not, You can try to execute "select * from EntityType order by Property" in the GQL of the datastore views and see if it fails.
If you've been flipping between indexed=True and indexed=False on some properties over time, and have a set of entities written under both regimes, then you'll have some properties that are indexed and some that aren't. Is this the situation you're in?
If you don't have reliable history on your code, trying to determine if you're in this situation is a bit tricky, depending on how many entities you have. You can determine if you're in an inconsistent state by noting if a keys-only query on an Entity returns a different number of keys than a query that filters on the suspect property. A filter won't find unindexed properties. If you've got a lot of entities, you'll have to shard the counting somehow (to avoid timing out on a long query that returns lots of entities).
If you determine that you do have inconsistent indexing and want to repair your entities to be consistent, the usual approach is to write a mapreduce that touches all of your unstable entities and issues puts on the necessary properties.
Take a look at "Datastore Indexes" interface, link for which is located on the left navigation menu in app engine dashboard.
There you'll see list of indexes and the specific properties on which an index has been applied.
For composite indexes (i.e. the one defined in datastore-indexes.xml or index.yaml), you could use the low-level API to get the list of indexes that are present in your app's datastore.
In GAE/J, you would need to invoke DatastoreServiceFactory.getDatastoreService().getIndexes(), while in Python, the same function is provided by db.get_indexes().

Are indexes really required in the datastore?

I'm a bit confused by some of the GAE documentation. While I intend to add indexes to optimize performance of my application, I wanted to get some clarification on if they are only suggested for this purpose or if they are truly required.
Queries can't find property values
that aren't indexed. This includes
properties that are marked as not
indexed, as well as properties with
values of the long text value type
(Text) or the long binary value type
(Blob).
A query with a filter or sort order on
a property will never match an entity
whose value for the property is a Text
or Blob, or which was written with
that property marked as not indexed.
Properties with such values behave as
if the property is not set with regard
to query filters and sort orders.
from http://code.google.com/appengine/docs/java/datastore/queries.html#Introduction_to_Indexes
The first paragraph leads me to believe that you simply cannot sort or filter on unindexed properties. However, the second paragraph makes me think that this limitation is only confined to Text or Blob properties or properties specifically annotated as unindexed.
I'm curious about the distinction because I have some numeric and string fields that I am currently sorting/filtering against in a production environment which are unindexed. These queries are being run in a background task that mostly doesn't care about performance (would rather optimize for size/cost in this sitation). Am I somehow just lucky that these are returning the right data?
In the GAE datastore, single property indexes are automatically created for all properties that are not unindexable (explicitly marked, or of those types).
The language in that doc, I suppose, is a tad confusing.
You only need to explicitly define indexes when you want to index by more than one property (say, for sorting by two different properties.)
In GAE, unfortunately if the property is marked as unindexed
num = db.IntegerProperty(required=True, indexed=False)
Then it is impossible to include it in the custom index... This is counterproductive (Most built-in indices are never used by my code, but take lots of space). But it is how GAE currently works.
Datastore Indexes - Unindexed properties:
Note: If a property appears in an index composed of multiple properties, then setting it to unindexed will prevent it from being indexed in the composed index.
Never add a property to a model without EXPLICITLY entering either indexed=True or indexed=False. Indices take substantial resources: space, write ops costs, and latency increases when doing put()s. We never, never add a property without explicitly stating its indexed value even if the index=False. Saves costly oversights, and forces one to always think a bit about whether or not to index. (You will at some point find yourself cursing the fact that you forgot to override the default=True.) GAE Engineers would do a great service by not allowing this to default to True imho. I would simply not provide a default if I was them. HTH. -stevep
you must use index if you want to use two or more filter function in one single query.
e.g:
Foobar.filter('foo =', foo).filter('bar =', bar)
if you just query with one filter, no need to use index, which is auto-generated.
for Blob and Text, you can't generate index for them, even you specify it in index.yaml, meanwhile you can't use filter in them.
e.g.
class Foobar(db.Model):
content = db.TextProperty()
Foobar.filter('content =', content)
codes above will raise an Error because TextProperty can't be assigned a index and can't be matched.

Resources