Tricky 1-to-1 unowned relations query - google-app-engine

I am currently developing a location based service on GAE/Java. I am quite new to this and I need your help with the JDO query part.
I have two persistent classes, Client and ClientGeolocation. The first one is for storing the client attributes (Key clientId, String name, String settings, etc.) and the second is for storing its geolocation updates (Key clientGeolocationId, Key clientId, Long timestamp, Double latitude, Double longitude). Since one client has thousands of geolocation records (one for each location update) over time, I decided to use 1-to-1 unowned relationship between ClientGeolocation and Client classes.
The service lets the user to see if another user is within range (e.g. they are in 5 minutes walking distance). Making this happen with JDO queries for each request would be far too resource consuming / slow so I put the last geolocation of the users in memcache and do the checking from there. So far so good.
The problem is when the App cold starts and memcache is empty, I want to fill up the memcache with data from the storage (using JDO query) and I simply do not know how to query "the last geolocation record for each user who has at least one record which is not older than 180 minutes".
The best possible solution I can come up with at the moment is to do this in two parts. First, to query the clientId keys of users who has records within the last 180 minutes (this will query distinct clientIds I hope) then execute a query for all clientId in which I query the last (top 1 order by timestamp desc) geolocation record. This means if the first query returns 10.000 users then I will do 10.000 queries for the last geolocation records. I have a feeling that there is a better solution for this in GAE :) .
Can you please help me how to write this query in a proper way?
Thank you very much for your help!

Should this be helpful ?
http://www.datanucleus.org/products/accessplatform_3_0/jdo/jdoql_subquery.html

Related

Cloud Firestore better structure for the case

I am developing an APP where the users (firebase auth) will register their expenses and will be notified (OneSignal) on each Sunday about the expenses that will expire in the week.
My firestore structure is:
-users (collection)
---xxXXxxX (user document)
-----email
-----OneSignal ID
-----expenses (collection)
-------yyYYYyY (expense document)
---------dueDate
---------value
---------userId
-------aaAAaaA (expense document)
---------dueDate
---------value
---------userId
---bBBbbBB (another user document)
-----email
-----OneSignal ID
-----expenses (collection)
-------wwWWwwW (expense document)
---------dueDate
(...)
Based on this structure, every Sunday Google Cloud will run a schedule function that will query all the expenses that expire in the week (Collection group query - returning a list of expenses that can have more than one expense per user.)
With this list, still in function, I will separate manually the userId from expenses, creating a second list with one register per user. With the second list the function will get the OneSignal ID of each user (another queries on firebase, one per user in the list) and register a notification in OneSignal service for every user.
P.S: OneSignal ID can change, because this situation i can't save the OneSignalID on expense.
I guess that this structure will work, but appears that this is not the best solution because many queries running on the "background" and this can be costly in the future.
Does anyone have a better suggestion for this case? Maybe another structure on firestore...
I hope that I explained the "problem" well. English is not my first language.
Thank you!
From what I've read in the documentation, you're doing perfect, and here's why (anyone, correct me if I'm wrong)
Firebase charges you when the users download more than Xgb of data. As far as I know you won't be charged for doing queries and filtering. So you're good in this aspect.
The Firebase firestore querying time depends solely on the amount of results you get. Doesn't matter the structure. So if you're fine with this structure, stick with it. I see no problem at all.
EDIT : I just re-read the docs and found this:
When you use Cloud Firestore, you are charged for the following:
The number of reads, writes, and deletes that you perform.
The amount of storage that your database uses, including overhead for metadata and indexes.
The amount of network bandwidth that you use.
So you will be charged when querying apparently. In this case a better way to structure the db would be to flattening the expenses tree maybe. You could have something like this
Users - - - -
|------
|------
Expenses - - - - -
user ID - - -
Etc---------
Etc----------
This way you could filter for a user's query.
https://cloud.google.com/firestore/docs/query-data/queries#:~:text=Cloud%20Firestore%20provides%20powerful%20query,Data%20and%20Get%20Realtime%20Updates.

google app engine query opimization

I am trying to do my reads and writes for GAE as efficiently as possible and I was wondering which is the best of the following two options.
I have a website where users are able to post different things and right now whenever I want to show all posts by that user I do a query for all posts with that user's user ID and then I display them. Would it be better to store all of the post IDs in the user entity and do a get_by_id(post_ID_list) to return all of the posts? Or would that extra space being used up not be worth it?
Is there anywhere I can find more information like this to optimize my web app?
Thanks!
The main reason you would want to store the list of IDs would be so that you can get each entity separately for better consistency - entity gets by id are consistent with the latest version in the datastore, while queries are eventually consistent.
Check datastore costs and optimize for cost:
https://developers.google.com/appengine/docs/billing
Getting entities by key wouldn't be any cheaper than querying all the posts. The query makes use of an index.
If you use projection queries, you can reduce your costs quite a bit.
There is several cases.
First, if you keep track for all ids of user's posts. You must use entity group for consistency. Thats means speed of write to datastore would be ~1 entity per second. And cost is 1 read for object with ids and 1 read per entity.
Second, if you just use query. This is not need consistency. Cost is 1 read + 1 read per entity retrieved.
Third, if you quering only keys and after fetching. Cost is 1 read + 1 small per key retrieved. Watch this: Keys-Only Queries. This equals to projection quering for cost.
And if you have many result, and use pagination then you need use Query Cursors. That prevent useless usage of datastore.
The most economical solution is third case. Watch this: Batch Operations.
In case you have a list of id's because they are stored with your entity, a call to ndb.get_multi (in case you are using NDB, but it would be similar with any other framework using the memcache to cache single entities) would save you further datastore calls if all (or most) of the entities correpsonding to the keys are already in the datastore.
So in the best possible case (everything is in the memcache), the datastore wouldn't be touched at all, while using a query would.
See this issue for a discussion and caveats: http://code.google.com/p/appengine-ndb-experiment/issues/detail?id=118.

Google App Engine: efficient large deletes (about 90000/day)

I have an application that has only one Model with two StringProperties.
The initial number of entities is around 100 million (I will upload those with the bulk loader).
Every 24 hours I must remove about 70000 entities and add 100000 entities. My question is now: what is the best way of deleting those entities?
Is there anyway to avoid fetching the entity before deleting it? I was unable to find a way of doing something like:
DELETE from xxx WHERE foo1 IN ('bar1', 'bar2', 'bar3', ...)
I realize that app engine offers an IN clause (albeit with a maximum length of 30 (because of the maximum number of individual requests per GQL query 1)), but to me that still seems strange because I will have to get the x entities and then delete them again (making two RPC calls per entity).
Note: the entity should be ignored if not found.
EDIT: Added info about problem
These entities are simply domains. The first string being the SLD and the second the TLD (no subdomains). The application can be used to preform a request like this http://[...]/available/stackoverflow.com . The application will return a True/False json object.
Why do I have so many entities? Because the datastore contains all registered domains (.com for now). I cannot perform a whois request in every case because of TOSs and latency. So I initially populate the datastore with an entire zone file and then daily add/remove the domains that have been registered/dropped... The problem is, that these are pretty big quantities and I have to figure out a way to keep costs down and add/remove 2*~100000 domains per day.
Note: there is hardly any computation going on as an availability request simply checks whether the domain exists in the datastore!
1: ' A maximum of 30 datastore queries are allowed for any single GQL query.' (http://code.google.com/appengine/docs/python/datastore/gqlreference.html)
If are not doing so already you should be using key_names for this.
You'll want a model something like:
class UnavailableDomain(db.Model):
pass
Then you will populate your datastore like:
UnavailableDomain.get_or_insert(key_name='stackoverflow.com')
UnavailableDomain.get_or_insert(key_name='google.com')
Then you will query for available domains with something like:
is_available = UnavailableDomain.get_by_key_name('stackoverflow.com') is None
Then when you need to remove a bunch of domains because they have become available, you can build a big list of keys without having to query the database first like:
free_domains = ['stackoverflow.com', 'monkey.com']
db.delete(db.Key.from_path('UnavailableDomain', name) for name in free_domains)
I would still recommend batching up the deletes into something like 200 per RPC, if your free_domains list is really big
have you considered the appengine-mapreduce library. It comes with the pipeline library and you could utilise both to:
Create a pipeline for the overall task that you will run via cron every 24hrs
The 'overall' pipeline would start a mapper that filters your entities and yields the delete operations
after the delete mapper completes, the 'overall' pipeline could call an 'import' pipeline to start running your entity creation part.
pipeline api can then send you an email to report on it's status

Paged results when selecting data from 2 databases

Hi
I have one web service connected to one db that has a table called clients which has some data.
I have another web service connected to another db that has a table called clientdetails which has some other data.
I have to return a paged list of clients and every client object contains the information from both tables.
But I have a problem.
The search criteria has to be applied on both tables.
So basically in the clients table I can have the properties:
cprop1, cprop2
in the clientdetails table I can have cdprop1,cdprop2
and my search criteria can be cporp1=something, cdprop2 = somethingelse
I call the first web service and send it the criteria cporp1=something
And it returns some info and then I call the method in the second web service but if I have to return say 10 items on a page and the criteria of the second web service are applied on the 10 items selected by the first web service(cdprop2 = somethingelse) then I may be left with 8 items or none at all.
So what do I do in this case?
How can I make sure I always get the right number of items(that is as much as the user says he wants on a page)?
Until you have both responses you don't know how many records you are going to have to display.
You don't what kind of database access you are using, you imply that you ask for "N records matching criterion X", where you have N set to 10. In some DB access mechanisms you can ask for all matching records and then advance a "cursor" through the set, hence you don't need to set any upper bound - we assume that the DB takes care of managing resources efficiently for such a query.
If you can't do that, then you need to be able to revisit the first database asking for the next 10 records, repeat until finally you have a page full or no more records can be found. This requires that you have some way to specify a query for "next 10".
You need the ability to get to all records matching the criteria in some efficient way, either by some cursor mechanism offered by your DB or by your own "paged" queries, without that capability I don't see a way to guarantee to give an accurate result.
I found that in instances like this it's better not to use identity primary keys but primary keys with generated values in the second database(generated in the first database).
As for searching you should search for the first 1000 items that fit your criteria from the first database, intersect them with the first 1000 that match the given criteria from the second database and return the needed amount of items from this intersection.
Your queries should never return an unlimited amount of items any way so 1000 should do. The number could be bigger or smaller of course.

What Database / Technology should be used to calculate unique visitors in time scope

I've got a problem with performance of my reporting database (tables have millions of records, 50+), when I want to calculate distinct on column that indicates a visitor uniqueness, let's say some hashkey.
For example:
I have these columns:
hashkey, name, surname, visit_datetime, site, gender, etc...
I need to get distinct in time span of 1 year, less than in 5 sec:
SELECT COUNT(DISTINCT hashkey) FROM table WHERE visit_datetime BETWEEN 'YYYY-MM-DD' AND 'YYYY-MM-DD'
This query will be fast for short time ranges, but if it be bigger than one month, than it can takes more than 30s.
Is there a better technology to calculate something like this than relational databases?
I'm wondering what google analytics use to do theirs unique visitors calculating on the fly.
For reporting and analytics, the type of thing you're describing, these sorts of statistics tend to be pulled out, aggregated, and stored in a data warehouse or something. They are stored in a fashion meant for performance reasons in lieu of nice relational storage techniques optimized for OLTP (online transaction processing). This pre-aggregated technique is called OLAP (online analytical processing).
You could have another table store the count of unique visitors for each day, updated daily by a cron function or something.
Google Analytics uses a first-party cookie, which you can see if you log Request Headers using LiveHTTPHeaders, etc.
All GA analytics parameters are packed into the Request URL, e.g.,
utma%3D97315849.1774621898.1207701397.1207701397.1207701397.1%3B">http://www.google-analytics.com/_utm.gif?utmwv=4&utmn=769876874&utmhn=example.com&utmcs=ISO-8859-1&utmsr=1280x1024&utmsc=32-bit&utmul=en-us&utmje=1&utmfl=9.0%20%20r115&utmcn=1&utmdt=GATC012%20setting%20variables&utmhid=2059107202&utmr=0&utmp=/auto/GATC012.html?utm_source=www.gatc012.org&utm_campaign=campaign+gatc012&utm_term=keywords+gatc012&utm_content=content+gatc012&utm_medium=medium+gatc012&utmac=UA-30138-1&utmcc=__utma%3D97315849.1774621898.1207701397.1207701397.1207701397.1%3B...
Within that URL is a piece that keyed to __utmcc, these are the GA cookies. Within _utmcc, is a string keyed to _utma, which is string comprised of six fields each delimited by a '.'. The second field is the Visitor ID, a random number generated and set by the GA server after looking for GA cookies and not finding them:
__utma%3D97315849.1774621898.1207701397.1207701397.1207701397.1
In this example, 1774621898 is the Visitor ID, intended by Google Analytics as a unique identifier of each visitor
So you can see the flaws of technique to identify unique visitors--entering the Site using a different browser, or a different device, or after deleting the cookies, will cause you to appear to GA as a unique visitor (i.e., it looks for its cookies and doesn't find any, so it sets them).
There is an excellent article by EFF on this topic--i.e., how uniqueness can be established, and with what degree of certainty, and how it can be defeated.
Finally, once technique i have used to determine whether someone has visited our Site before (assuming the hard case, which is that they have deleted their cookies, etc.) is to examine the client request for our favicon. The directories that store favicons are quite often overlooked--whether during a manual sweep or programmatically using a script.

Resources