Create Subpath Indexes - basex

I am storing a vast amount of mathematical formulas as Content MathML in BaseX databases. Since a lot of subpaths are similar and repeated it would be nice to have (self-defined) indexes which index not only one element node but subpaths in order to speedup lookup.
E.g. index not only elements which are accessible via
//apply
but whole (sub)paths like
//apply/plus
or
//apply/plus/ci
Is there any way to create such an index? I could not find a way in the documentation.

Related

Indexes and indexes entries limits in Google App Engine Datastore

I'm having some problem in understanding how indexes work in GAE Datastore, in particular somthing really unclear to me are the limits related to indexes.
For what I understood one can create some custom indexes in the datastore-indexes.xml file and additionally the Datastore will generate some automatic indexes to match the user queries.
A first question is: the "Number of indexes" quota limit defined in the quotas page (https://cloud.google.com/appengine/docs/quotas#Datastore) is referred only to the custom indexes defined in datastore-indexes.xml, or it applies also to indexes automatically generated?
Another concept eluding me is the "index entry for a single query".
Assume I don't have multi-dimensional properties (i.e. not lists) and I have some entities of kind "KindA". Then I define two groups of entity properties:
- Group1: properties with arbitray name and boolean value
- Group2: properties with arbitray name and double value
In my world any KindA entity can have at most N properties of Group1 and N properties of Group2. For any property P an index table is created and each entity having that P set will add a row in the P index table (right?). Thus initially any KindA entity will have 1 entry for each of the max. 2N properties (thus in total max 2N index entries per entity) right?
If this is correct than it follows that I can create an entity with a limited number of properties, however this is strange since I 've always read that an entity can have unlimited properties...(without taking in account the size limit).
However let assume now that my application allows users to query for KindA entities using an arbitrarly long sequence of AND filters on properties of Group1 (boolean one). Thus one can query something like:
find all entities in KindA where prop1=true AND prop2=true ... AND propM = true
This is a situation in which query only contains equalities and thus no custom indexes are required (https://cloud.google.com/appengine/docs/python/datastore/indexes#Index_configuration).
But what if I want to order using properties of GroupB ? In this case I need an index for any different query right (different in terms of combination of filtering properties names)?
In my developmnet server I tried without specifying any custom index and GAE generates them for me (however any time I restart previous generated indexes get removed). In this case how many index entries does a signle KindA entity have in a single query index? I say 1 because of what GAE docs says:
The property may also be included in additional, custom indexes declared in your index configuration file (index.yaml). Provided that an entity has no list properties, it will have at most one entry in each such custom index (for non-ancestor indexes) or one for each of the entity's ancestors (for ancestor indexes)
Thus in theory if N is limited I'm safe with respect to the "Maximum number of index entries for an entity" (https://cloud.google.com/appengine/docs/java/datastore/#Java_Quotas_and_limits) is it right?
But what about receiving over 200 different queries? does it leads GAE to automatically generate over 200 custom indexes (one for distinct query)? If yes, do these indexes automatically generate influence the Indexes number limit (which is 200) ?
If yes, then it follows that I can't let user do this (IMHO very basic) queries. Am I misunderstanding something?
first of all I was trying to understand your question which I find difficult to follow.
The 200 index limit only counts towards the indexes you (or are define for you automatically by the devappserver) define by using queries. This means that the indexes that will be created alone for your indexed properties are not counted towards this limit.
Your are correct in the 2N automatic indexes created for every indexed property.
You can have any number of properties indexed in any entity as long as you don't get over the 1MB limit per entity. But.. this really depends on the content of the properties stored.
For the indexes created for you on your indexed properties... you don't really have an actual limit rather than an increasing cost as your writes and storage per entity put will increase for each added property.
When using sort orders, you are limited to one sort order when using automatic indexes. More sort orders will require a composite index (your custom index). Thus if you are already using an equality filter you need anyway a custom index.
So, yes, on your example the devapp server will create a composite index for each query you will be executing. However you can reduce this indexes manually by deleting the ones not needed. The query planner can use query time to find your results by merging different indexes as explained here:
https://cloud.google.com/appengine/articles/indexselection
Yes, every index definition on your index.yaml will count towards the 200 limit.
I find out that you really don't use composite indexes too much when you know how gae apps can be programmed. You need to balance what users need to do and what not. And also balance between doing query side job, or just query all and filter by code (it really depends on how many max entities you can have in that particular kind).
However, if your trying to do some complex queries available to your users then maybe the datastore is not the choice.

C Database Design, Sortable by Multiple Fields

If memory is not an issue for my particular application (entry, lookup, and sort speed being the priorities), what kind of data structure/concept would be the best option for a multi-field rankings table?
For example, let's say I want to create a Hall of Fame for a game, sortable by top score (independent of username), username (with all scores by the same user placed together before ranking users by their highest scores), or level reached (independent of score or name). In this example, if I order a linked list, vector, or any other sequential data structure by the top score of each player, it makes searching for the other fields -- like level and non-highest scores -- more iterative (i.e. iterate across all looking for the stage, or looking for a specific score-range), unless I conceive some other way to store the information sorted when I enter new data.
The question is whether there is a more efficient (albeit complicated and memory-consumptive) method or database structure in C/C++ that might be primed for this kind of multi-field sort. Linked lists seem fine for simple score rankings, and I could even organize a hashtable by hashing on a single field (player name, or level reached) to sort by a single field, but then the other fields take O(N) to find, worse to sort. With just three fields, I wonder if there is a way (like sets or secondary lists) to prevent iterating in certain pre-desired sorts that we know beforehand.
Do it the same way databases do it: using index structures. You have your main data as a number of records (structs), perhaps ordered according to one of your sorting criteria. Then you have index structures, each one ordered according to one of your other sorting criteria, but these index structures don't contain copies of all the data, just pointers to the main data records. (Think "index" like the index in a book, with page numbers "pointing" into the main data body.)
Using ordered linked list for your index structures will give you a fast and simple way to go through the records in order, but it will be slow if you need to search for a given value, and similarly slow when inserting new data.
Hash tables will have fast search and insertion, but (with normal hash tables) won't help you with ordering at all.
So I suggest some sort of tree structure. Balanced binary trees (look for AVL trees) work well in main memory.
But don't forget the option to use an actual database! Database managers such as MySQL and SQLite can be linked with your program, without a separate server, and let you do all your sorting and indexing very easily, using SQL embedded in your program. It will probably execute a bit slower than if you hand-craft your own main-memory data structures, or if you use main-memory data structures from a library, but it might be easier to code, and you won't need to write separate code to save the data on disk.
So, you already know how to store your data and keep it sorted with respect to a single field. Assuming the values of the fields for a single entry are independent, the only way you'll be able to get what you want is to keep three different lists (using the data structure of your choice), each of which are sorted to a different field. You'll use three times the memory's worth of pointers of a single list.
As for what data structure each of the lists should be, using a binary max heap will be effective. Insertion is lg(N), and displaying individual entries in order is O(1) (so O(N) to see all of them). If in some of these list copies the entries need to be sub-sorted by another field, just consider that in the comparison function call.

How is it possible to build database index on top of key/value store?

I was reading about LevelDB and found out that:
Upcoming versions of the Chrome browser include an implementation of the IndexedDB HTML5 API that is built on top of LevelDB
IndexedDB is also a simple key/value store that has the ability to index data.
My question is: how is it possible to build an index on top of a key/value store? I know that an index is at it's lowest level is n-ary tree and I understand the way that data is indexed in a database. But how can a key/value store like LevelDB be used for creating a database index?
The vital feature is not that it supports custom comparators but that it supports ordered iteration through keys and thus searches on partial keys. You can emulate fields in keys just using conventions for separating string values. The many scripting layers that sit on top of leveldb use that approach.
The dictionary view of a Key-Value store is that you can only tell if a key is present or not by exact match. It is not really feasible to use just such a KV store as a basis for a database index.
As soon as you can iterate through keys starting from a partial match, you have enough to provide the searching and sorting operations for an index.
Just a couple of things, LevelDB supports sorting of data using a custom comparer, from the page you linked to:
According to the project site the key features are:
Keys and values are arbitrary byte arrays.
Data is stored sorted by key.
Callers can provide a custom comparison function to override the sort order.
....
So LevelDB can contain data this can be sorted/indexed based on 1 sort order.
If you needed several indexable fields, you could just add your own B-Tree that works on-top of LevelDB. I would imagine that this is the type of approach that the Chrome browser takes, but I'm just guessing.
You can always look through the Chrome source.

In the python version of Google App Engine, how do I find the quartile values of a model with an index on a specific property?

In Google App Engine I have a model with 10K entities with an index on the property foo. What is the most efficient way to find the 1st quartile, 2nd quartile (the median), and the 3rd quartile entities? I can fetch the sorted list of keys and find the three quartile keys programmatically, but downloading all the keys won't scale. What is the more elegant approach?
sortedValues = MyModel.all(keys_only=True).order('foo').fetch(limit=10000)
Have you tried .fetch(2500,limit=1), .fetch(5000,limit=1), and .fetch(7500,limit=1)? The first argument corresponds to the offset.
The documentation reads the following, however, so this approach won't afford you O(1) performance.
Note: The query has performance characteristics that correspond linearly with the offset amount plus the limit amount.
From here.
Since quartiles are defined in terms of entity ordering, there's unfortunately no way to determine them other than iterating over them. As cheeken points out, you can speed things up a little by not fetching the intermediate results by using an offset argument.

Should I denormalize properties to reduce the number of indexes required by App Engine?

One of my queries can take a lot of different filters and sort orders depending on user input. This generates a huge index.yaml file of 50+ indexes.
I'm thinking of denormalizing many of my boolean and multi-choice (string) properties into a single string list property. This way, I will reduce the number of query combinations because most queries will simply add a filter to the string list property, and my index count should decrease dramatically.
It will surely increase my storage size, but this isn't really an issue as I won't have that much data.
Does this sound like a good idea or are there any other drawbacks with this approach?
As always, this depends on how you want to query your entities. For most of the sorts of queries you could execute against a list of properties like this, App Engine will already include an automatically built index, which you don't have to specify in app.yaml. Likewise, most queries that you'd want to execute that require a composite index, you couldn't do with a list property, or would require an 'exploding' index on that list property.
If you tell us more about the sort of queries you typically run on this object, we can give you more specific advice.
Denormalizing your data to cut back on the number of indices sounds like it a good tradeoff. Reducing the number of indices you need will have fewer indices to update (though your one index will have more updates); it is unclear how this will affect performance on GAE. Size will of course be larger if you leave the original fields in place (since you're copying data into the string list property), but this might not be too significant unless your entity was quite large already.
This is complicated a little bit since the index on the list will contain one entry for each element in the list on each entity (rather than just one entry per entity). This will certainly impact space, and query performance. Also, be wary of creating an index which contains multiple list properties or you could run into a problem with exploding indices (multiple list properties => one index entry for each combination of values from each list).
Try experimenting and see how it works in practice for you (use AppStats!).
"It will surely increase my storage size, but this isn't really an issue as I won't have that much data."
If this is true then you have no reason to denormalize.

Resources