I want to test a new Entity I created in the Google Datastore. I'm trying to do a GQL Query with an inequality to retrieve some entities in the datastore web interface:
SELECT * FROM UserMatchingIndex WHERE age < 25 AND wants_male = false AND wants_dating = false AND wants_friendship = false AND wants_age = 20
But I always have an error : "GQL query error: Your Datastore does not have the composite index (developer-supplied) required for this query." whereas I have defined the required composite indexes !
UserMatchingIndex: age ▲ wants_male ▲ wants_dating ▲ wants_friendship ▲ wants_age ▲ Serving
UserMatchingIndex: age ▲ wants_female ▲ wants_dating ▲ wants_friendship ▲ wants_age ▲ Serving
These are defined as followed in the index.yaml:
- kind: UserMatchingIndex
ancestor: no
properties:
- name: age
- name: wants_male
- name: wants_dating
- name: wants_friendship
- name: wants_age
- kind: UserMatchingIndex
ancestor: no
properties:
- name: age
- name: wants_female
- name: wants_dating
- name: wants_friendship
- name: wants_age
I really don't see what could be possibly wrong... I've done that many times for other entities. If you have some clues, you're welcome.
This issue was submitted to Google Cloud support and it seems to be an issue on their side, or at least, a restriction not documented yet. Changing the order of the properties as followed:
- kind: UserMatchingIndex
ancestor: no
properties:
- name: wants_male
- name: wants_dating
- name: wants_friendship
- name: wants_age
- name: age
makes the query work.
EDIT: Answer from Google Support
"The rows of an index table are sorted first by ancestor and then by property values, in the order specified in the index definition. The perfect index for a query, which allows the query to be executed most efficiently, is defined on the following properties, in order:
Properties used in equality filters
Property used in an inequality filter (of which there can be no more than one)
Properties used in sort orders
This ensures that all results for every possible execution of the query appear in consecutive rows of the table."
This restriction should be documented soon.
Related
For a Facebook type social networking app, a high performing database structure is required, for storing data in Firebase(Cloud Firestore) (NoSQL).
Data to be stored :
- Userinfo (name, email etc)
- Friends
- Posts
- Comments on posts.
I am confused among the following two DB Structures regarding query performance (if the database becomes huge).
(Ref: C_xxx is Collection, D_xxx is document)
Structure 1
C_AllData
- D_UserID-1
name: xxxx,
email: yyy,
friends: [UserID-3, UserID-4]
- C_Posts
- D_PostId-1
Text: hhh
Date: zzz
- C_Comments
- D_CommentId-1
UserID: 3
Text: kkk
- D_CommentId-2
UserID: 4
Text: kkk
- D_PostId-2
Text: hhh
Date: zzz
- C_Comments
- D_CommentId-3
UserID: 3
Text: kkk
- D_CommentId-4
UserID: 4
Text: kkk
- D_UserID-2
name: xxxx,
email: yyy
friends: [UserID-5, UserID-7]
- C_Posts
- D_PostId-3
Text: hhh
Date: zzz
- C_Comments
- D_CommentId-5
UserID: 5
Text: kkk
- D_CommentId-6
UserID: 7
Text: kkk
Structure 2
C_AllUsers
- D_UserID-1
name: xxxx,
email: yyy
friends: [UserID-3, UserID-4]
- D_UserID-2
name: xxxx,
email: yyy
friends: [UserID-5, UserID-7]
C_AllPosts
- D_PostId-1
UserID: 1
Text: hhh
Date: zzz
- C_Comments
- D_CommentId-1
UserID: 3
Text: kkk
- D_CommentId-2
UserID: 4
Text: kkk
- D_PostId-3
UserID: 2
Text: hhh
Date: zzz
- C_Comments
- D_CommentId-5
UserID: 5
Text: kkk
- D_CommentId-6
UserID: 7
Text: kkk
My Question is what are the pros and cons of the two approaches ?
Some points that i could think of are below, please correct me if i am wrong.
Structure 1 :
Is getting all the posts of a given user, Faster in Structure 1 ? Since we are pinpointing to the exact collection ( AllData/{UserID}/Posts/ )
Since entire DB is under one collection, is scalability not good ?
Structure 2 :
Divided DB -> Better Scalability ?
Divided DB -> Better Performance ?
Lesser Nesting -> Better Performance ?
AllPosts under one collection -> Slow querying ?
Or if you can suggest a better model, that would be great too.
In Firebase a rule of thumb is to keep separate entity types in separate branches. This is especially important because:
(Note: Here firebase is firebase realtime database)
Firebase always loads complete nodes, and
once you grant a user read access to a node, they have access to all data under that node.
For example in your first data structure, to load a list of friends, you will have to load all the posts of all friends, and all comments on all those posts too. That's a lot more data than is strictly needed, if all you wanted to do was show a list of the friends names.
In your second data structure, you are one step closer. As now you can first load the friends names, and then load their posts.
But even in that structure you have the same problem. If you want to display the list of post titles for a friend (or for all friends), you are going to have to load the entire posts and all comments. That is again way more data than is needed to show a list of post titles. So you'll definitely want to store the comments in a separate top-level list too, using the same key of the post to identify and group them.
C_AllPosts
- D_PostId-1
UserID: 1
Text: hhh
Date: zzz
- D_PostId-3
UserID: 2
Text: hhh
Date: zzz
C_AllComments
- D_PostId-1
- D_CommentId-1
UserID: 3
Text: kkk
- D_CommentId-2
UserID: 4
Text: kkk
- D_PostId-3
- D_CommentId-5
UserID: 5
Text: kkk
- D_CommentId-6
UserID: 7
Text: kkk
Now if you want to display a post and its comments, you will have to read two nodes. If you do this for multiple posts, you'll end up with a lot of reads, to essentially perform the NoSQL equivalent of a SQL JOIN. This is quite normal, it's essentially a client-side join, and it not nearly as slow as you may think, because Firebase pipelines the requests.
For some more introduction on this type of data modeling, I recommend:
this article on NoSQL data modeling
the Firebase blog post Denormalization is normal
this video series Firebase for SQL developers
And these answers to previous questions:
Many to Many relationship in Firebase
How would you model a collection of users and friends in Firebase?
Firebase data structure and url
https://stackoverflow.com/questions/16421179/whats-the-best-way-of-structuring-data-on-firebase/16423051?s=2|19.0394#16423051
https://stackoverflow.com/questions/30693785/how-to-write-denormalized-data-in-firebase/30699277?s=3|18.5624#30699277
https://stackoverflow.com/questions/43830610/how-to-denormalize-normalize-data-structure-for-firebase-realtime-database/43832677?s=5|16.2022#43832677
I was experimenting with datastore indexes, and I noticed that I can order the properties in an index multiple ways:
IsItemActive ▲ + Rating ▲
- or -
Rating ▲ + IsItemActive ▲
What is the difference between the two indexes above? One allows me to query SELECT * FROM Items WHERE Rating > 3 AND IsItemActive = FALSE but the other does not.
Datastore relies very heavily on the ordering of index properties in order to enforce its rule that every query must scale with the size of the result set.
In order to answer a query, all the results for that query must appear sequentially in the index.
So, consider the two indexes:
Index(IsItemActive, Rating)
Item(Rating=3, IsItemActive=False) <----
Item(Rating=4, IsItemActive=False) <----
Item(Rating=3, IsItemActive=True)
Item(Rating=4, IsItemActive=True)
Item(Rating=5, IsItemActive=True)
Index(Rating, IsItemActive)
Item(Rating=3, IsItemActive=False) <----
Item(Rating=3, IsItemActive=True)
Item(Rating=4, IsItemActive=False) <----
Item(Rating=4, IsItemActive=True)
Item(Rating=5, IsItemActive=True)
In order for your query SELECT * FROM Items WHERE Rating > 3 AND IsItemActive = FALSE to have all results next to each other, it must use the Index(IsItemActive, Rating) index. The other index does not have all the results you need next to each other.
Here is an article about how index selection works. Also, I would highly recommend the Google I/O talk (2008) on how Datastore works under the covers.
I'm working with facets in Solr and I have the concept of facet groups that each contain a number of facets.
Say I have a structure like this
Product Type
- Chairs (50)
- Tables (20)
- Mirrors (5)
Color
- Yellow (5)
- Black (50)
- Red (10)
- Orange (10)
I have an OR relationship between facets within a facet group and an AND relationship between the groups.
So if I choose Chairs as a facet I get 50 products. Using the standard faceting in Solr (and assuming that each product can have exactly one product type and one color) it will now give:
Product Type
- Chairs (50)
- Tables (0)
- Mirrors (0)
Color
- Yellow (5)
- Black (30)
- Red (5)
- Orange (10)
However, what I really want is that the facet counts within Product Type stay the same as that would reflect what would happen if one of them was chosen.
Can this be done with Solr in one query?
This can implemented using tagged filters and then excluding them when creating the facet.
From the referenced page:
To implement a multi-select facet for doctype, a GUI may want to still display the other doctype values and their associated counts, as if the doctype:pdf constraint had not yet been applied. Example:
=== Document Type ===
[ ] Word (42)
[x] PDF (96)
[ ] Excel(11)
[ ] HTML (63)
To return counts for doctype values that are currently not selected, tag filters that directly constrain doctype, and exclude those filters when faceting on doctype.
q=mainquery&fq=status:public&fq={!tag=dt}doctype:pdf&facet=on&facet.field={!ex=dt}doctype
Filter exclusion is supported for all types of facets. Both the tag and ex local params may specify multiple values by separating them with commas.
At what rate do indexes "explode" in GAE's big table?
The excerpt from their documentation below explains that for collection values, indexes can "explode" exponentially.
Does this mean that for an object with two collection values, there is an index entry for each subset of values in the first collection paired with each subset in the second collection? Or is there only an index entry for each possible pair of values?
Example:
Entity:
widget:{
mamas_list: ['cookies', 'puppies']
papas_list: ['rain', 'sun']
}
Index entry for each subset of values in the first collection paired with each subset in the second collection:
cookies rain
cookies puppies rain
cookies puppies rain sun
cookies sun
cookies rain sun
puppies rain
puppies sun
puppies rain sun
Only an index entry for each possible pair of values:
cookies rain
cookies sun
puppies rain
puppies sun
Exploding indexes excerpt:
Source: https://developers.google.com/appengine/docs/python/datastore/indexes#Index_Limits
an entity that can have multiple values for the same property requires
a separate index entry for each value; again, if the number of
possible values is large, such an entity can exceed the entry limit.
The situation becomes worse in the case of entities with multiple
properties, each of which can take on multiple values. To accommodate
such an entity, the index must include an entry for every possible
combination of property values. Custom indexes that refer to multiple
properties, each with multiple values, can "explode" combinatorially,
requiring large numbers of entries for an entity with only a
relatively small number of possible property values.
(Taken from: )
Chris,
You'll only have an 'exploding index problem' in cases you explicitly add an index.yaml entry for multiple repeated properties, and when objects saved to the table have too many multiple properties.
In the example, does your index.yaml add this index?
- kind: widget
properties:
- name: mamas_list
- name: papas_list
If you save the sample object to the datastore:
widget(mamas_list=['a', 'b'], papas_list['c', 'd']).put()
There will be 4 different indexes to be saved:
['a', 'c'] ['a', 'd'] ['b', 'c'] ['b', 'd']
The whole purpose of adding this index would be to allow querying by these 2 properties:
widget.query().filter(mamas_list=='a').filter(papas_list=='d').fetch()
You could always avoid an exploding index (not found in this sample case), using the zig-zag algorithm indexes:
http://www.google.com/events/io/2010/sessions/next-gen-queries-appengine.html
I have a products table on my database, and a table with features of this products. The features table have 3 columns: id, type and value. Id is a foreign key from products.
A example of data on my tables:
Table Products:
ID | Description |
01 Computer A
02 Car
03 Computer B
Table Features:
ID | Type | Value |
01 Processor Phenom X3
01 Memory 2GB
01 HDD 500GB
02 Color Blue
02 Mark Ford
03 Processor Phenom X3
03 Memory 3GB
I want the best way to index it, so, a example, when someone searches for “computer”, the faceting shows:
Phenom X3(2)
Memory 2GB(1)
Memory 3GB(1)
HDD 500GB(1)
And so on, related with the query string. If I make a query with the string “processor”, it will list Phenom X3(1) only if this products (with “processor” on description) have a feature like Processor: Phenom X3. There’s a lot of product types, so we can’t create static colums to all features and pass it to Solr…
I hope my question is clear, thanks in advance!
Use data import handler to index the data # http://wiki.apache.org/solr/DataImportHandler
You can define the products table as main entity and features as sub entity. So that the product with features is indexed as a single document.
For indexing -
Define description field as indexed true
As you want facet on type and value, you can define a new field type_value with type string and concat the type and value field in dataconfig.xml
type_value will be a multivalued field.
For searching -
Make the product description field searchable e.g. q=description:computers
You can configure this in the solrconfig.xml with proper weightage
Define the features field as a facet field and facet.field=type_value
I hope this gives a fair idea.