Query with filter on a Key field and another (string) field very slow on App Engine Datastore - google-app-engine

I have an entity Kind A with around 1.3 million entities. It has several fields, two of which I use as filters in a Datastore query - K (which is a Key) and S (which is a string).
I perform the following query, and it takes it almost 8 seconds to complete:
SELECT * FROM A WHERE K=KEY('...') AND S='...' LIMIT 1
(I do this by constructing a query with filters in java, but it works just as slow in GQL.)
Is this reasonable? Can this be improved in any way?
Replacing the key query with other fields (or removing completely), speeds things up. Does it make sense to save the key field as a string, and query on that instead of on the original Key type field?

Can you please send your index.yaml file, or at least paste all the indexes for Kind A? My initial feeling is that you don't have a proper index for K and S, so it's doing a full table scan or using another index to project the results. Check out https://developers.google.com/appengine/docs/python/datastore/indexes for an explanation.
If you're still banging your head, I would suggest turning appstats on and determining the slowness from the visuals on your local dev server (https://developers.google.com/appengine/docs/python/tools/appstats).
Finally, I wanted to point out that single property lookups are fast in App Engine, so either K= or S= lookups would be fast. Together, they require special attention (indexing, memcache, etc.).

Related

Indexes and indexes entries limits in Google App Engine Datastore

I'm having some problem in understanding how indexes work in GAE Datastore, in particular somthing really unclear to me are the limits related to indexes.
For what I understood one can create some custom indexes in the datastore-indexes.xml file and additionally the Datastore will generate some automatic indexes to match the user queries.
A first question is: the "Number of indexes" quota limit defined in the quotas page (https://cloud.google.com/appengine/docs/quotas#Datastore) is referred only to the custom indexes defined in datastore-indexes.xml, or it applies also to indexes automatically generated?
Another concept eluding me is the "index entry for a single query".
Assume I don't have multi-dimensional properties (i.e. not lists) and I have some entities of kind "KindA". Then I define two groups of entity properties:
- Group1: properties with arbitray name and boolean value
- Group2: properties with arbitray name and double value
In my world any KindA entity can have at most N properties of Group1 and N properties of Group2. For any property P an index table is created and each entity having that P set will add a row in the P index table (right?). Thus initially any KindA entity will have 1 entry for each of the max. 2N properties (thus in total max 2N index entries per entity) right?
If this is correct than it follows that I can create an entity with a limited number of properties, however this is strange since I 've always read that an entity can have unlimited properties...(without taking in account the size limit).
However let assume now that my application allows users to query for KindA entities using an arbitrarly long sequence of AND filters on properties of Group1 (boolean one). Thus one can query something like:
find all entities in KindA where prop1=true AND prop2=true ... AND propM = true
This is a situation in which query only contains equalities and thus no custom indexes are required (https://cloud.google.com/appengine/docs/python/datastore/indexes#Index_configuration).
But what if I want to order using properties of GroupB ? In this case I need an index for any different query right (different in terms of combination of filtering properties names)?
In my developmnet server I tried without specifying any custom index and GAE generates them for me (however any time I restart previous generated indexes get removed). In this case how many index entries does a signle KindA entity have in a single query index? I say 1 because of what GAE docs says:
The property may also be included in additional, custom indexes declared in your index configuration file (index.yaml). Provided that an entity has no list properties, it will have at most one entry in each such custom index (for non-ancestor indexes) or one for each of the entity's ancestors (for ancestor indexes)
Thus in theory if N is limited I'm safe with respect to the "Maximum number of index entries for an entity" (https://cloud.google.com/appengine/docs/java/datastore/#Java_Quotas_and_limits) is it right?
But what about receiving over 200 different queries? does it leads GAE to automatically generate over 200 custom indexes (one for distinct query)? If yes, do these indexes automatically generate influence the Indexes number limit (which is 200) ?
If yes, then it follows that I can't let user do this (IMHO very basic) queries. Am I misunderstanding something?
first of all I was trying to understand your question which I find difficult to follow.
The 200 index limit only counts towards the indexes you (or are define for you automatically by the devappserver) define by using queries. This means that the indexes that will be created alone for your indexed properties are not counted towards this limit.
Your are correct in the 2N automatic indexes created for every indexed property.
You can have any number of properties indexed in any entity as long as you don't get over the 1MB limit per entity. But.. this really depends on the content of the properties stored.
For the indexes created for you on your indexed properties... you don't really have an actual limit rather than an increasing cost as your writes and storage per entity put will increase for each added property.
When using sort orders, you are limited to one sort order when using automatic indexes. More sort orders will require a composite index (your custom index). Thus if you are already using an equality filter you need anyway a custom index.
So, yes, on your example the devapp server will create a composite index for each query you will be executing. However you can reduce this indexes manually by deleting the ones not needed. The query planner can use query time to find your results by merging different indexes as explained here:
https://cloud.google.com/appengine/articles/indexselection
Yes, every index definition on your index.yaml will count towards the 200 limit.
I find out that you really don't use composite indexes too much when you know how gae apps can be programmed. You need to balance what users need to do and what not. And also balance between doing query side job, or just query all and filter by code (it really depends on how many max entities you can have in that particular kind).
However, if your trying to do some complex queries available to your users then maybe the datastore is not the choice.

Query for greater/less than in combination with other field asks for already existing index

I got an index like this for my entity Word:
changed ▲ + type ▲
I've added one entity to the datastore and can successfully query it in the console like this:
KIND Word
changed is a number greater than 0
However if I add a second field to the filter like this:
KIND Word
changed is a number greater than 0
type is a string that is noun
It will fail with the error "You need an index to execute this query."
It seems to me that the exact index it's asking for is already present. Can anyone shed some light on this?
In order to query on two fields you need a combined index, not an individual index on both of them.
Please review this for more information about Datastore indexes:
https://developers.google.com/appengine/articles/indexselection
The best way to get ALL the indexes you need is to run the queries on your devserver, making sure you hit EVERY query you can do on production.
When you're in development, the dev server will create all the indexes needed for your application in you xml/yaml (java/python). If you didn't hit that particular query on your devserver, the index is not in your xml/yaml and you'll need to add it manually.
It's a paradigm shift when you're working with the datastore, but you have to understand that, for scalability reason, the datastore CANNOT do ANY computation. It only returns you scans of your table. When you save one Entity in your table, behind the scene the datastore will actually save it multiple times, one for each index, already sorted for that index. So even if you have your index for parameter 1 AND a separate index for parameter 2, it will not necessarily create the composite index for parameter 1 & 2. So if you don't create that composite index in your xml/yaml, you "force" the datastore to do a join between your tables, which it unfortunately cannot.
I took it as a best practice to hit EVERY query I can do with my code in the dev server and then look at my indexes created. You just have to be careful not creating tons of indexes since they work on your quota and can end up costing you(if your app has billing enabled of course).

Lucene - few or a lot of indexes

Is it better to use
a lot of indexes (eg. for every user as your application allows that)
in Lucene
or just one, having every document in int
... if you think about:
performance
disk space
health
I am using elasticsearch, therefore I am using Lucene.
In Elastic Search, I think based off your information I would use 1 index. My understanding is users are only searching there own documents, and the documents seems to be relatively similar.
Performance - When searching you can use a Filtered Query to filter to only the documents matching the user. The user id filter is cache-able, and fast.
Scalable - In Elasticsearch, you control sharding and replication at index level. Elasticsearch can handle large numbers of indexes, I just think configuring appropriate shards and replications could be valuable for the entire index.
In a single index, you can still easy wipe away data (see delete by query) , and there should be little concern of seeing others data unless you write your queries wrong. A filtered query with that filters results to only those associated with a user id is very simple. Similar in complexity to searching a different index per user.
Your exact needs might fit a different approach better. Based what I have so far, I would do choose one index though.

Best way to create a Lucene Index with fields that can be updated frequently, and filtering the results by this field

I use Lucene to index my documents and search. Actually I have 800k documents indexed in Lucene. Those documents have some fields:
Id: is a Numeric field to index the documents
Name: is a textual field to be stored and analyzed
Description: like name
Availability: is a numeric field to filter results. This field can be updated frequently, every day.
My question is: What's the better way to create a filter for availability?
1 - add this information to index and make a lucene filter.
With this approach I have to update document (remove and add, because lucene 3.0.2 not have update support) every time the "availability" changes. What the cost of reindex?
2 - don't add this information to index, and filter the results with a DB select.
This approach will do a lot of selects, because I need select every id from database to check availability.
3 - Create a separated index with id and availability.
I don't know if it is a good solution, but I can create a index with static information and other with information can be frequently updated. I think it is better then update all document, just because some fields were updated.
I would stay away from 2, if you can deal only with the search in lucene, instead of search in lucene+db, do it. I deal in my project with this case (Lucene search + DB search), but I do it cause there is no way out of it.
The cost of an update is internally:
delete the doc
insert new doc (with new field).
I would just try approach number 1 (as is the simplest), if the performance is good enough, then just stick with it, if not then you might look ways to optimize it or try 3.
Answer provided from lucene-groupmail:
How often is "frequently"? How many updates do you expect to do in
a day? And how quickly must those updates be reflected in the search
results?
800K documents isn't all that many. I'd go with the simple approach first
and monitor the results, #then# go to a more complex solution if you
see a problem arising. Just update (delete/add) the documents when
the value changes.
Well, the cost to reindex is just about what the cost to index it orignally
is. The old version of the document is marked deleted and the new one
is added. It's essentially the same cost as to index a new document.
This leaves some gaps in your index, that is the deleted docs are still in
there, but the next optimize will compact them.
From which you may infer that optimizing is the expensive part. I'd do that,
say
once daily (or even weekly).
HTH
Erick

Creating an efficient search capability using SQL Server (and/or coldfusion)

I am trying to visualize how to create a search for an application that we are building. I would like a suggestion on how to approach 'searching' through large sets of data.
For instance, this particular search would be on a 750k record minimum table, of product sku's, sizing, material type, create date, etc;
Is anyone aware of a 'plugin' solution for Coldfusion to do this? I envision a google like single entry search where a customer can type in the part number, or the sizing, etc, and get hits on any or all relevant results.
Currently if I run a 'LIKE' comparison query, it seems to take ages (ok a few seconds, but still), and it is too long. At times making a user sit there and wait up to 10 seconds for queries & page loads.
Or are there any SQL formulas to help accomplish this? I want to use a proven method to search the data, not just a simple SQL like or = comparison operation.
So this is a multi-approach question, should I attack this at the SQL level (as it ultimately looks to be) or is there a plug in/module for ColdFusion that I can grab that will give me speedy, advanced search capability.
You could try indexing your db records with a Verity (or Solr, if CF9) search.
I'm not sure it would be faster, and whether even trying it would be worthwhile would depend a lot on how often you update the records you need to search. If you update them rarely, you could do an Verity Index update whenever you update them. If you update the records constantly, that's going to be a drag on the webserver, and certainly mitigate any possible gains in search speed.
I've never indexed a database via Verity, but I've indexed large collections of PDFs, Word Docs, etc, and I recall the search being pretty fast. I don't know if it will help your current situation, but it might be worth further research.
If your slowdown is specifically the search of textual fields (as I surmise from your mentioning of LIKE), the best solution is building an index table (not to be confiused with DB table indexes that are also part of the answer).
Build an index table mapping the unique ID of your records from main table to a set of words (1 word per row) of the textual field. If it matters, add the field of origin as a 3rd column in the index table, and if you want "relevance" features you may want to consider word count.
Populate the index table with either a trigger (using splitting) or from your app - the latter might be better, simply call a stored proc with both the actual data to insert/update and the list of words already split up.
This will immediately drastically speed up textual search as it will no longer do "LIKE", AND will be able to use indexes on index table (no pun intended) without interfering with indexing on SKU and the like on the main table.
Also, ensure that all the relevant fields are indexed fully - not necessarily in the same compund index (SKU, sizing etc...), and any field that is searched as a range field (sizing or date) is a good candidate for a clustered index (as long as the records are inserted in approximate order of that field's increase or you don't care about insert/update speed as much).
For anything mode detailed, you will need to post your table structure, existing indexes, the queries that are slow and the query plans you have now for those slow queries.
Another item is to enure that as little of the fields are textual as possible, especially ones that are "decodable" - your comment mentioned "is it boxed" in the text fields set. If so, I assume the values are "yes"/"no" or some other very limited data set. If so, simply store a numeric code for valid values and do en/de-coding in your app, and search by the numeric code. Not a tremendous speed improvement but still an improvement.
I've done this using SQL's full text indexes. This will require very application changes and no changes to the database schema except for the addition of the full text index.
First, add the Full Text index to the table. Include in the full text index all of the columns the search should perform against. I'd also recommend having the index auto update; this shouldn't be a problem unless your SQL Server is already being highly taxed.
Second, to do the actual search, you need to convert your query to use a full text search. The first step is to convert the search string into a full text search string. I do this by splitting the search string into words (using the Split method) and then building a search string formatted as:
"Word1*" AND "Word2*" AND "Word3*"
The double-quotes are critical; they tell the full text index where the words begin and end.
Next, to actually execute the full text search, use the ContainsTable command in your query:
SELECT *
from containstable(Bugs, *, '"Word1*" AND "Word2*" AND "Word3*"')
This will return two columns:
Key - The column identified as the primary key of the full text search
Rank - A relative rank of the match (1 - 1000 with a higher ranking meaning a better match).
I've used approaches similar to this many times and I've had good luck with it.
If you want a truly plug-in solution then you should just go with Google itself. It sounds like your doing some kind of e-commerce or commercial site (given the use of the term 'SKU'), So you probably have a catalog of some kind with product pages. If you have consistent markup then you can configure a google appliance or service to do exactly what you want. It will send a bot in to index your pages and find your fields. No SQl, little coding, it will not be dependent on your database, or even coldfusion. It will also be quite fast and familiar to customers.
I was able to do this with a coldfusion site in about 6 hours, done! The only thing to watch out for is that google's index is limited to what the bot can see, so if you have a situation where you want to limit access based on a users role or permissions or group, then it may not be the solution for you (although you can configure a permission service for Google to check with)
Because SQL Server is where your data is that is where your search performance is going to be a possible issue. Make sure you have indexes on the columns you are searching on and if using a like you can't use and index if you do this SELECT * FROM TABLEX WHERE last_name LIKE '%FR%'
But it can use an index if you do it like this SELECT * FROM TABLEX WHERE last_name LIKE 'FR%'. The key here is to allow as many of the first characters to not be wild cards.
Here is a link to a site with some general tips. https://web.archive.org/web/1/http://blogs.techrepublic%2ecom%2ecom/datacenter/?p=173

Resources