App Engine datastore query, to many Indexes - google-app-engine

I have a form with 7 input fields. each of these fields should query for a bigger and/or smaller value and every field could be empty or not. As inequality filters work only on one property, I wanted to make an array of int's property, that contains up to 20 integers and query it with a 'Property =' filter to check if the value is present in the given array. However, this gives me the message 'too many indexed properties'.
Im bit lost as i can't use inequality filters on more than one property and list-properties create to many indexes.
Can somebody point me in the right direction?

You probably want to use CloudSql or Full Text Search for this kind of complex querying. In particular, you're going to face a lot of challenges if you need to include ordering for these queries, as you'll need explicit multiproperty indexes set up then. You will then face a combinatorial explosion problem.

Related

CouchBase view get for multiple ranges

I'm evaluating CouchBase for an application, and trying to figure out something about range queries on views. I know I can do a view get for a single key, multiple keys, or a range. Can I do a get for multiple ranges? i.e. I want to retrieve items with view key 0-10, 50-100, 5238-81902. I might simultaneously need 100 different ranges, so having to make 100 requests to the database seems like a lot of overhead.
As far as I know in couchbase there is no way to implement getting values from multiple ranges with one view. May be there are (or will be implemented in future) some features in Couchbase N1QL, but I didn't work with it.
Answering your question 100 requests will not be a big overhead. Couchbase is quiet fast and it's designed to handle a lot of operations per second. Also, if your view is correctly designed, it will not be "recalculated" on each query.
Also there is another way:
1. Determine minimum and maximum value of your range (it will be 0..81902 according to your example)
2. Query view that will return only document ids and a value that range was based on, without including all docs in result.
3. On client side filter array of results from previous step according to your ranges (0-10, 50-100, 5238-81902)
and then use getMulti with document ids that left in array.
I don't know your data structure, so you can try both ways, test them and choose the best one that will fit your demands.

Advanced database queries - appengine datastore

I have a fairly simple application (like CRM) which has a lot of contacts and associated tags.
A user can search giving lot of criteria (search-items) such as
updated_time in last 10 days
tags in xxx
tags not in xxx
first_name starts with xxx
first_name not in 'Smith'
I understand indexing and how filters (not in) cannot work on more than one property.
For me, since most of the times, reporting is done in a cron - I can iterate through all records and process them. However, I would like to know the best optimized route of doing it.
I am hoping that instead of querying 'ALL', I can get close to a query which can run with the appengine design limits and then manually match rest of the items in the query.
One way of doing it is to start with the first search-item and then get count, add another the next search-item, get count. The point it bails out, I then process those records with rest of the search-items manually.
The question is
Is there a way before hand to know if a query is valid programatically w/o doing a count
How do you determine the best of search-items in a set which do not collide (like not-in does not work on many filters) etc.
The only way I see it is to get all equal filters as one query, take the first in-equality filter or in, execute it and just iterate over the search entities.
Is there a library which can help me ;)
I understand indexing and how filters (not in) cannot work on more than one property.
This is not strictly true. You may create a "composite index" which allows you to perform filters on multiple fields. These consume additional data.
You may also generate your own equivalent of composite index by generating your own "composite field" that you can use to query against.
Is there a way before hand to know if a query is valid programatically w/o doing a count
I'm not sure I understand what kind of validity you're referring to.
How do you determine the best of search-items in a set which do not collide (like not-in does not work on many filters) etc.
A "not in" filter is not trivial. One way is to create two arrays (repeated fields). One with all the tagged entries and one with not all the tags. This would allow you to easily find all the entities with and without the tag. The only issue is that once you create a new tag, you have to sweep across the entities adding a "not in" entry for all the entities.

Google App Engine storing as list vs JSON

I have a model called User, and a user has a property relatedUsers, which, in its general format, is an array of integers. Now, there will be times when I want to check if a certain number exists in a User's relatedUsers array. I see two ways of doing this:
Use a standard Python list with indexed values (or maybe not) and just run an IN query and see if that number is in there.
Having the key to that User, get back the value for property relatedUsers, which is an array in JSON string format. Decode the string, and check if the number is in there.
Which one is more efficient? Would number 1 cost more reads than option 2? And would number 1 writes cost more than number 2, since indexing each value costs a write. What if I don't index -- which solution would be better then?
Here's your costs vs capability, option wise:
Putting the values in an indexed list will be far more expensive. You will incur the cost of one write for each value in the list, which can explode depending on how many friends your users have. It's possible for this cost explosion to be worse if you have certain kinds of composite indexes. The good side is that you get to run queries on this information: you can get query for a list of users who are friends with a particular user, for example.
No extra index or write costs here. The problem is that you lose querying functionality.
If you know that you're only going to be doing checks only on the current user's list of friends, by all means go with option 2. Otherwise you might have to look at your design a little more carefully.

Should I denormalize properties to reduce the number of indexes required by App Engine?

One of my queries can take a lot of different filters and sort orders depending on user input. This generates a huge index.yaml file of 50+ indexes.
I'm thinking of denormalizing many of my boolean and multi-choice (string) properties into a single string list property. This way, I will reduce the number of query combinations because most queries will simply add a filter to the string list property, and my index count should decrease dramatically.
It will surely increase my storage size, but this isn't really an issue as I won't have that much data.
Does this sound like a good idea or are there any other drawbacks with this approach?
As always, this depends on how you want to query your entities. For most of the sorts of queries you could execute against a list of properties like this, App Engine will already include an automatically built index, which you don't have to specify in app.yaml. Likewise, most queries that you'd want to execute that require a composite index, you couldn't do with a list property, or would require an 'exploding' index on that list property.
If you tell us more about the sort of queries you typically run on this object, we can give you more specific advice.
Denormalizing your data to cut back on the number of indices sounds like it a good tradeoff. Reducing the number of indices you need will have fewer indices to update (though your one index will have more updates); it is unclear how this will affect performance on GAE. Size will of course be larger if you leave the original fields in place (since you're copying data into the string list property), but this might not be too significant unless your entity was quite large already.
This is complicated a little bit since the index on the list will contain one entry for each element in the list on each entity (rather than just one entry per entity). This will certainly impact space, and query performance. Also, be wary of creating an index which contains multiple list properties or you could run into a problem with exploding indices (multiple list properties => one index entry for each combination of values from each list).
Try experimenting and see how it works in practice for you (use AppStats!).
"It will surely increase my storage size, but this isn't really an issue as I won't have that much data."
If this is true then you have no reason to denormalize.

Creating an efficient search capability using SQL Server (and/or coldfusion)

I am trying to visualize how to create a search for an application that we are building. I would like a suggestion on how to approach 'searching' through large sets of data.
For instance, this particular search would be on a 750k record minimum table, of product sku's, sizing, material type, create date, etc;
Is anyone aware of a 'plugin' solution for Coldfusion to do this? I envision a google like single entry search where a customer can type in the part number, or the sizing, etc, and get hits on any or all relevant results.
Currently if I run a 'LIKE' comparison query, it seems to take ages (ok a few seconds, but still), and it is too long. At times making a user sit there and wait up to 10 seconds for queries & page loads.
Or are there any SQL formulas to help accomplish this? I want to use a proven method to search the data, not just a simple SQL like or = comparison operation.
So this is a multi-approach question, should I attack this at the SQL level (as it ultimately looks to be) or is there a plug in/module for ColdFusion that I can grab that will give me speedy, advanced search capability.
You could try indexing your db records with a Verity (or Solr, if CF9) search.
I'm not sure it would be faster, and whether even trying it would be worthwhile would depend a lot on how often you update the records you need to search. If you update them rarely, you could do an Verity Index update whenever you update them. If you update the records constantly, that's going to be a drag on the webserver, and certainly mitigate any possible gains in search speed.
I've never indexed a database via Verity, but I've indexed large collections of PDFs, Word Docs, etc, and I recall the search being pretty fast. I don't know if it will help your current situation, but it might be worth further research.
If your slowdown is specifically the search of textual fields (as I surmise from your mentioning of LIKE), the best solution is building an index table (not to be confiused with DB table indexes that are also part of the answer).
Build an index table mapping the unique ID of your records from main table to a set of words (1 word per row) of the textual field. If it matters, add the field of origin as a 3rd column in the index table, and if you want "relevance" features you may want to consider word count.
Populate the index table with either a trigger (using splitting) or from your app - the latter might be better, simply call a stored proc with both the actual data to insert/update and the list of words already split up.
This will immediately drastically speed up textual search as it will no longer do "LIKE", AND will be able to use indexes on index table (no pun intended) without interfering with indexing on SKU and the like on the main table.
Also, ensure that all the relevant fields are indexed fully - not necessarily in the same compund index (SKU, sizing etc...), and any field that is searched as a range field (sizing or date) is a good candidate for a clustered index (as long as the records are inserted in approximate order of that field's increase or you don't care about insert/update speed as much).
For anything mode detailed, you will need to post your table structure, existing indexes, the queries that are slow and the query plans you have now for those slow queries.
Another item is to enure that as little of the fields are textual as possible, especially ones that are "decodable" - your comment mentioned "is it boxed" in the text fields set. If so, I assume the values are "yes"/"no" or some other very limited data set. If so, simply store a numeric code for valid values and do en/de-coding in your app, and search by the numeric code. Not a tremendous speed improvement but still an improvement.
I've done this using SQL's full text indexes. This will require very application changes and no changes to the database schema except for the addition of the full text index.
First, add the Full Text index to the table. Include in the full text index all of the columns the search should perform against. I'd also recommend having the index auto update; this shouldn't be a problem unless your SQL Server is already being highly taxed.
Second, to do the actual search, you need to convert your query to use a full text search. The first step is to convert the search string into a full text search string. I do this by splitting the search string into words (using the Split method) and then building a search string formatted as:
"Word1*" AND "Word2*" AND "Word3*"
The double-quotes are critical; they tell the full text index where the words begin and end.
Next, to actually execute the full text search, use the ContainsTable command in your query:
SELECT *
from containstable(Bugs, *, '"Word1*" AND "Word2*" AND "Word3*"')
This will return two columns:
Key - The column identified as the primary key of the full text search
Rank - A relative rank of the match (1 - 1000 with a higher ranking meaning a better match).
I've used approaches similar to this many times and I've had good luck with it.
If you want a truly plug-in solution then you should just go with Google itself. It sounds like your doing some kind of e-commerce or commercial site (given the use of the term 'SKU'), So you probably have a catalog of some kind with product pages. If you have consistent markup then you can configure a google appliance or service to do exactly what you want. It will send a bot in to index your pages and find your fields. No SQl, little coding, it will not be dependent on your database, or even coldfusion. It will also be quite fast and familiar to customers.
I was able to do this with a coldfusion site in about 6 hours, done! The only thing to watch out for is that google's index is limited to what the bot can see, so if you have a situation where you want to limit access based on a users role or permissions or group, then it may not be the solution for you (although you can configure a permission service for Google to check with)
Because SQL Server is where your data is that is where your search performance is going to be a possible issue. Make sure you have indexes on the columns you are searching on and if using a like you can't use and index if you do this SELECT * FROM TABLEX WHERE last_name LIKE '%FR%'
But it can use an index if you do it like this SELECT * FROM TABLEX WHERE last_name LIKE 'FR%'. The key here is to allow as many of the first characters to not be wild cards.
Here is a link to a site with some general tips. https://web.archive.org/web/1/http://blogs.techrepublic%2ecom%2ecom/datacenter/?p=173

Resources