At what rate do indexes "explode" in GAE's big table?
The excerpt from their documentation below explains that for collection values, indexes can "explode" exponentially.
Does this mean that for an object with two collection values, there is an index entry for each subset of values in the first collection paired with each subset in the second collection? Or is there only an index entry for each possible pair of values?
Example:
Entity:
widget:{
mamas_list: ['cookies', 'puppies']
papas_list: ['rain', 'sun']
}
Index entry for each subset of values in the first collection paired with each subset in the second collection:
cookies rain
cookies puppies rain
cookies puppies rain sun
cookies sun
cookies rain sun
puppies rain
puppies sun
puppies rain sun
Only an index entry for each possible pair of values:
cookies rain
cookies sun
puppies rain
puppies sun
Exploding indexes excerpt:
Source: https://developers.google.com/appengine/docs/python/datastore/indexes#Index_Limits
an entity that can have multiple values for the same property requires
a separate index entry for each value; again, if the number of
possible values is large, such an entity can exceed the entry limit.
The situation becomes worse in the case of entities with multiple
properties, each of which can take on multiple values. To accommodate
such an entity, the index must include an entry for every possible
combination of property values. Custom indexes that refer to multiple
properties, each with multiple values, can "explode" combinatorially,
requiring large numbers of entries for an entity with only a
relatively small number of possible property values.
(Taken from: )
Chris,
You'll only have an 'exploding index problem' in cases you explicitly add an index.yaml entry for multiple repeated properties, and when objects saved to the table have too many multiple properties.
In the example, does your index.yaml add this index?
- kind: widget
properties:
- name: mamas_list
- name: papas_list
If you save the sample object to the datastore:
widget(mamas_list=['a', 'b'], papas_list['c', 'd']).put()
There will be 4 different indexes to be saved:
['a', 'c'] ['a', 'd'] ['b', 'c'] ['b', 'd']
The whole purpose of adding this index would be to allow querying by these 2 properties:
widget.query().filter(mamas_list=='a').filter(papas_list=='d').fetch()
You could always avoid an exploding index (not found in this sample case), using the zig-zag algorithm indexes:
http://www.google.com/events/io/2010/sessions/next-gen-queries-appengine.html
Related
I have an index for songs and each song have a collection of customers.
If I filter songs that contains a specific customer works great, but I get in the result all the customers for the song.
I would need to get the song with only the customer I'm filtering in the collection.
My index is something like:
{
SongId: 1,
Title: "My song",
Artist: "Artist",
Customers: [
{
CustomerId: 1,
...more customer data
}
{
CustomerId: 2,
...more customer data
}
]
}
I need to get the song filtering by title and only get the customer's 1 data or no customer if the customer 1 is not in the collection for that song.
Is that possible?
This is not possible today. You'd have to filter out the complex collection elements on the client-side.
Also, you didn't mention this in your question, but it seems the cardinality of the relationship from songs to customers could be quite high. If that's the case, you should consider a different data model, because there is a hard limit on the number of elements you can have in complex collections per document. Even without that limit, there are practical limits to complex collections, for example around document size and the inability to incrementally update them during indexing.
If your model will have at most a few hundred customers per song, you're probably fine, but if it's in the thousands or higher, you should reconsider your design.
I have an array of key value pairs. A date and a quantity, but many have the same date, and I need to combine the quantities for each specific date. I've tried a million different for loops and pushing to an array in every way I can think of using vanilla JS, then moved to lodash and still no avail. This is what I currently have.
`data = [
[February 3rd 2020: 1]
[February 4th 2020: 1]
[February 3rd 2020: 3]
[February 5th 2020: 2]
]`
I can get an array of unique dates, and an array of unique values, but I'm stuck when it comes to getting the values combined in any form so that the quantities for each date can be manipulated into simple arithmetic... sum, average...
Appreciate it.
I have two Apache Solar collections, the first collection containing information for the past year and the second collection containing information up to one year old (as shown below)
My problem is doing sorted search between two collections.
For example, I want to search data between 300-400 days ago in a sort order, the problem is that I do not know how to do it by most accurate and fastest way.
Create a Collection alias that spans both collections - i.e. give both collection names in the collections list when creating the alias:
/admin/collections?action=CREATEALIAS&name=name&collections=uptopastyear,pastyeartonow
You can then query this collection with a regular range filter:
&fq=datetime:[<timestamp 400 days ago> TO <timestamp 300 years ago>]
I've got a business case where I need to check if the search query is about displays businesses
eg: q="night clubs new york"
I've got a list of Countries, state city and region in my database 3million + records and I've got a list of business categories.
All I want to do is check if in the query has a business category in it (night clubs) and does it have a City, state or country's name (new york). So i'm checking the number of results retuned for the below query. If I get 2 numResults then this is a business query and then I query my Solr index to search for businesses.
query: places_ss:(night clubs new york) OR categories_ss:(night clubs new york)
Speed Question: How should I save the list of cities, states and countries in SOLR to get maximum search speed ?
Have one document id:places and add distinct cities, states and countries in on array places_ss
have multiple documents with different id's with 100,000 place names in each document in an array.
?
have a document or multiple documents with place_s string(not array) each place separated by space and each space in place separated by underscore eg: new york becomes new_york.
And during query time I will get multiple combinations of night clubs new york
eg: night night_clubs night_clubs_new night_clubs_new_york clubs_new clubs_new_york new_york york and query for place.
Would it be a good idea to have a separate core just for above place documents to increase speed ?
Is this a good solution ?
Document organisation :
better to have a document approche with :
- location
- activity
- other things needed!
location
You should save your location like this
Country:state:city:suburb.... so that you can seach in usa:new york:new york*
of ::new york
No need for _
avoid that, there is no needs !
activity
activity should be stored in another field for precision on the search and speed.
I am looking for a way to retrieve the "surrounding" rows in a NHibernate query given a primary key and a sort order?
E.g. I have a table with log entries and I want to display the entry with primary key 4242 and the previous 5 entries as well as the following 5 entries ordered by date (there is no direct relation between date and primary key). Such a query should return 11 rows in total (as long as we are not close to either end).
The log entry table can be huge and retrieving all to figure it out is not possible.
Is there such a concept as row number that can be used from within NHibernate? The underlying database is either going to be SQlite or Microsoft SQL Server.
Edited Added sample
Imagine data such as the following:
Id Time
4237 10:00
4238 10:00
1236 10:01
1237 10:01
1238 10:02
4239 10:03
4240 10:04
4241 10:04
4242 10:04 <-- requested "center" row
4243 10:04
4244 10:05
4245 10:06
4246 10:07
4247 10:08
When requesting the entry with primary key 4242 we should get the rows 1237, 1238 and 4239 to 4247. The order is by Time, Id.
Is it possible to retrieve the entries in a single query (which obviously can include subqueries)? Time is a non-unique column so several entries have the same value and in this example is it not possible to change the resolution in a way that makes it unique!
"there is no direct relation between date and primary key" means, that the primary keys are not in a sequential order?
Then I would do it like this:
Item middleItem = Session.Get(id);
IList<Item> previousFiveItems = Session.CreateCriteria((typeof(Item))
.Add(Expression.Le("Time", middleItem.Time))
.AddOrder(Order.Desc("Time"))
.SetMaxResults(5);
IList<Item> nextFiveItems = Session.CreateCriteria((typeof(Item))
.Add(Expression.Gt("Time", middleItem.Time))
.AddOrder(Order.Asc("Time"))
.SetMaxResults(5);
There is the risk of having several items with the same time.
Edit
This should work now.
Item middleItem = Session.Get(id);
IList<Item> previousFiveItems = Session.CreateCriteria((typeof(Item))
.Add(Expression.Le("Time", middleItem.Time)) // less or equal
.Add(Expression.Not(Expression.IdEq(middleItem.id))) // but not the middle
.AddOrder(Order.Desc("Time"))
.SetMaxResults(5);
IList<Item> nextFiveItems = Session.CreateCriteria((typeof(Item))
.Add(Expression.Gt("Time", middleItem.Time)) // greater
.AddOrder(Order.Asc("Time"))
.SetMaxResults(5);
This should be relatively easy with NHibernate's Criteria API:
List<LogEntry> logEntries = session.CreateCriteria(typeof(LogEntry))
.Add(Expression.InG<int>(Projections.Property("Id"), listOfIds))
.AddOrder(Order.Desc("EntryDate"))
.List<LogEntry>();
Here your listOfIds is just a strongly typed list of integers representing the ids of the entries you want to retrieve (integers 4242-5 through 4242+5 ).
Of course you could also add Expressions that let you retrieve Ids greater than 4242-5 and smaller than 4242+5.
Stefan's solution definitely works but better way exists using a single select and nested Subqueries:
ICriteria crit = NHibernateSession.CreateCriteria(typeof(Item));
DetachedCriteria dcMiddleTime =
DetachedCriteria.For(typeof(Item)).SetProjection(Property.ForName("Time"))
.Add(Restrictions.Eq("Id", id));
DetachedCriteria dcAfterTime =
DetachedCriteria.For(typeof(Item)).SetMaxResults(5).SetProjection(Property.ForName("Id"))
.Add(Subqueries.PropertyGt("Time", dcMiddleTime));
DetachedCriteria dcBeforeTime =
DetachedCriteria.For(typeof(Item)).SetMaxResults(5).SetProjection(Property.ForName("Id"))
.Add(Subqueries.PropertyLt("Time", dcMiddleTime));
crit.AddOrder(Order.Asc("Time"));
crit.Add(Restrictions.Eq("Id", id) || Subqueries.PropertyIn("Id", dcAfterTime) ||
Subqueries.PropertyIn("Id", dcBeforeTime));
return crit.List<Item>();
This is NHibernate 2.0 syntax but the same holds true for earlier versions where instead of Restrictions you use Expression.
I have tested this on a test application and it works as advertised