Query Google DataStore - google-app-engine

I have following Objectify entity to store data in Google DataStore.
public class Record implements Serializable {
private static final long serialVersionUID = 201203171843L;
#Id
private Long id;
private String name; // John Smith
private Integer age; // 43
private String gender; // Male/Female
private String eventName; // Name of the marathon/event
private String eventCityName; // City of the event
private String eventStateName; // State of the event
private Date eventDate; // event date
//Getters & Setters
}
Now, my question is how can I query my database to get count of Records for a given eventName or event City+State? Or get a list of all City+Name.

On App Engine counting is very expensive: you basically need to query with certain condition (eventName = something), then count all results. The cost is a key-only query (1 read + 1 small operation) and increases with number of entities counted. For example counting 1 million entities would cost $0.8.
What is normally done is to keep count of things as a property inside a dedicated entity: increase the property value when count goes up (entity added) and decrease when it goes down (entity deleted).
If you plan to do this on a larger scale then understand there is a write/update limitation of about 5 writes/s per entity (entity group actually). See sharded counters for how to work around this.

Related

Objectify queries are very slow (Google Datastore)

After some refactoring, we are having issues with the objectify queries that we are using in the application. The strange thing is that even if we revert to the original code the problem stays.
When the application starts, a number of 250 books are fetched from the Datastore using Objectify. The caching is enabled and seems to be working.
The problem is that it takes around 50 - 60 seconds to get the result, and for this reason sometimes the http request is killed. We never had this issues before and we can't find an answer to it.
If I ran a query like "select * from BookEntity order by creationDate desc limit 250" in the Google Datastore console and it took 5 - 7 seconds not more.
Before the refactoring, the book entity looked something like this:
#Index
#Entity
#Cache
public class BookEntity {
#Index
public String title_name;
#Index
public String author_name;
public String isbn;
public int number_of_pages;
public Ref<PdfEntity> book_pdf;
}
Now it's like this:
#Index
#Entity
#Cache
public class BookEntity {
#Index
#AlsoLoad("title_name")
private String titleName;
#Index
#AlsoLoad("author_name")
private String authorName;
private String isbn;
#AlsoLoad("number_of_pages")
private int numberOfPages;
#AlsoLoad("book_pdf")
private Ref<PdfEntity> bookPdf;
// getters and setters for the fields because now they are private
}
Here is just an example, but in reality it has around 20 fields.
In order to migrate the schema to the field names, I ran a task in GAE which loaded and then saved again all the BookEntity entities.
This example can be extended to all the entities that are used in the application, but the book is the worst performing one. Even though nothing is changed in the query, and we are talking about a basic query which fetches the newest 250 books by creationDate, it takes a lifetime to get the actual result. Any idea how I can investigate this issues further?
Problem found. We were persisting some information in the non-args constructor of the BookEntity, so for every book fetched from the datastore 3 save operations were made for some other entities which are referred from the book.

datastore - using queries and transactions

I'm working with objectify.
In my app, i have Employee entities and Task entities. Every task is created for a given employee. Datastore is generating id for Task entities.
But i have a restriction, i can't save a task for an employee if that task overlaps with another already existing task.
I don't know how to implement that. I don't even know how to start. Should i use a transaction ?
#Entity
class Task {
#Id
private Long id;
#Index
private long startDate; // since epoch
#Index
private long endDate;
#Index
private Ref<User> createdFor;
public Task(String id, long startDate, long endDate, User assignedTo) {
this.id = null;
this.startDate = startDate;
this.endDate = endDate;
this.createdFor = Ref.create(assignedTo);
}
}
#Entity
class Employee {
#Id
private String id;
private String name;
public Employee(String id, String name) {
this.id = id;
this.name = name;
}
}
You can't do it with the entities you've set up, because between the time you queried for tasks, and inserted the new task, you can't guarantee that someone wouldn't have already inserted a new task that conflicts with the one you're inserting. Even if you use a transaction, any concurrently added, conflicting tasks won't be part of your transaction's entity group, so there's the potential for violating your constraint.
Can you modify your architecture so instead of each task having a ref to the employee its created for, every Employee contains a collection of tasks created for that Employee? That way when your querying the Employee's tasks for conflicts, the Employee would be timestamped in your transaction's Entity Group, and if someone else put a new task into it before you finished putting your new task, a concurrent modification exception would be thrown and you would then retry. But yes, have both your query and your put in the same Transaction.
Read here about Transactions, Entity Groups and Optimistic Concurrency: https://code.google.com/p/objectify-appengine/wiki/Concepts#Transactions
As far as ensuring your tasks don't overlap, you just need to check whether either of your new task's start of end date is within the date range of any previous Tasks for the same employee. You also need to check that your not setting a new task that starts before and ends after a previous task's date range. I suggest using a composite.and filter for for each of the tests, and then combining those three composite filters in a composite.or filter, which will be the one you finally apply. There may be a more succint way, but this is how I figure it:
Note these filters would not apply in the new architecture I'm suggesting. Maybe I'll delete them.
////////Limit to tasks assigned to the same employee
Filter sameEmployeeTask =
new FilterPredicate("createdFor", FilterOperator.EQUAL, thisCreatedFor);
/////////Check if new startDate is in range of the prior task
Filter newTaskStartBeforePriorTaskEnd =
new FilterPredicate("endDate", FilterOperator.GREATER_THAN, thisStartDate);
Filter newTaskStartAfterPriorTaskStart =
new FilterPredicate("startDate", FilterOperator.LESS_THAN, thisStartDate);
Filter newTaskStartInPriorTaskRange =
CompositeFilterOperator.and(sameEmployeeTask, newTaskStartBeforePriorTaskEnd, newTaskStartAfterPriorTaskStart);
/////////Check if new endDate is in range of the prior task
Filter newTaskEndBeforePriorTaskEnd =
new FilterPredicate("endDate", FilterOperator.GREATER_THAN, thisEndDate);
Filter newTaskEndAfterPriorTaskStart =
new FilterPredicate("startDate", FilterOperator.LESS_THAN, thisEndDate);
Filter newTaskEndInPriorTaskRange =
CompositeFilterOperator.and(sameEmployeeTask, newTaskEndBeforePriorTaskEnd, newTaskEndAfterPriorTaskStart);
/////////Check if this Task overlaps the prior one on both sides
Filter newTaskStartBeforePriorTaskStart =
new FilterPredicate("startDate", FilterOperator.GREATER_THAN_OR_EQUAL, thisStartDate);
Filter newTaskEndAfterPriorTaskEnd =
new FilterPredicate("endDate", FilterOperator.LESS_THAN_OR_EQUAL, thisEndDate);
Filter PriorTaskRangeWithinNewTaskStartEnd = CompositeFilterOperator.and(sameEmployeeTask ,newTaskStartBeforePriorTaskStart, newTaskEndAfterPriorTaskEnd);
/////////Combine them to test if any of the three returns one or more tasks
Filter newTaskOverlapPriorTask = CompositeFilterOperator.or(newTaskStartInPriorTaskRange,newTaskEndInPriorTaskRange,PriorTaskRangeWithinNewTaskStartEnd);
/////////Proceed
Query q = new Query("Task").setFilter(newTaskOverlapPriorTask);
PreparedQuery pq = datastore.prepare(q);
If you don't return any results, then you don't have any overlaps, so go ahead and save the new task.
Ok, I hope I can be a bit more helpful. I've attempted to edit your question and change your entities to the right architecture. I've added an embedded collection of tasks and an attemptAdd method to your Employee. I've added a detectOverlap method to both your Task and your Employee. With these in place, your could use something like the transaction below. You will need to deal with the cases where you task doesn't get added because there's a conflicting task, and also the case where the add fails due to a ConcurrentModificationException. But you could make another question out of those, and you should have the start you need in the meantime.
Adapted from: https://code.google.com/p/objectify-appengine/wiki/Transactions
Task myTask = new Task(startDate,endDate,description);
public boolean assignTaskToEmployee(EmployeeId, myTask) {
ofy().transact(new VoidWork() {
public void vrun() {
Employee assignee = ofy().load().key(EmployeeId).now());
boolean taskAdded = assignee.attemptAdd(myTask);
ofy().save().entity(assignee);
return taskAdded;
}
}
}

Google app engine JPA ancestor queries

i was wondering if is there any cost/performance difference in using ancestor queries.
Query q = em.createQuery("SELECT FROM File f WHERE f.parentID = :parentID AND f.someOtherNumber > :xx");
q.setParameter("parentID", KeyFactory.createKey("User", 2343334443334L));
q.setParameter("xx",233);
//File class with ancestors
#Entity
class File{
#Id
#....
public Key ID;
#Extension(vendorName = "datanucleus", key = "gae.parent-pk", value ="true")
public Key parentID;
};
OR
Query q = em.createQuery("SELECT FROM File f WHERE f.parentID = :parentID AND f.someOtherNumber > :xx");
q.setParameter("parentID", 2343334443334L);
q.setParameter("xx",233);
//File class without ancestors
#Entity
class File{
#Id
#....
public Key ID;
public long parentID;
};
I was testing some stuff and if i use ancestor query my index doesn't include parentID(it says with ancestors) the non ancestor version it does.
Is there a difference in index/datastore read/write cost?
The writing costs might be slightly lower (one fewer indexed property), but the storage costs might be slightly higher (a key for each child entity includes all of its ancestors).
In either case, the differences are insignificant unless you have a billion records. You will face more serious performance/cost differences depending on your data access patterns (i.e. how you access the data most of the time).

Most efficient way to do this select in JPA 2?

I have an Entity that looks like this:
#Entity
public class Relationship
{
#Id
#GeneratedValue(strategy = GenerationType.IDENTITY)
private Key key;
#Basic
private UUID from;
#Basic
private UUID to;
}
Now I can have arbitrary levels of indirection here like so:
final Relationship r0 = new Relationship(a,b);
final Relationship r1 = new Relationship(b,c);
final Relationship r2 = new Relationship(c,d);
final Relationship rN = new Relationship(d,e);
Now what I want to find out as efficiently as possible is given a give me back e where rN is N level deep.
If I was writing regular SQL I would do something like the follow pseudo code :
SELECT r.to
FROM relationship r
WHERE r.from = 'a' AND
r.to NOT IN ( SELECT r.from FROM relationship r)
The only thing I can find online is references to passing in a List as a parameter to a Criteria.Builder.In but I don't have the list, I need to use a sub-select as the list?
Also this is using the Datastore in Google App Engine, and it is restricted on some things that it supports via JPA 2.
Am I going to have to resort to the low level Datastore API?
In the datastore, there's no way to issue a single query to get 'e' from 'a'. In fact the only way to get e, is to individually query each Relationship linearly, so you'll need to do four queries.
You can pass in a list as a parameter, but that's only for an IN query. NOT IN queries are not available, and neither are JOINs.
(Aside: you could use a combination of the from and to properties to create a key, in which case you could just fetch the entity instead of query).
Usually, the GAE datastore version of doing things is to denormalize, ie write extra data that will enable your queries. (This is a pain, because it also means that when you update an entity, you need to be careful to update the denormalized data as well, and it can be hard to synchronize this - It's designed for web type traffic where reads occur much more frequently than writes.)
This is a potential solution:
#Entity
public class Relationship
{
#Id
#GeneratedValue(strategy = GenerationType.IDENTITY)
private Key key;
#Basic
private UUID from;
#Basic
private UUID to;
#ElementCollection
private Collection<UUID> reachable;
}
In this case you would simply query
WHERE from = 'a' and reachable = 'e'
Solution
Surprisingly enough this recursive method doesn't error out with a StackOverflow even with 1000 levels of indirection, at least not on my local development server.
public UUID resolve(#Nonnull final UUID uuid)
{
final EntityManager em = EMF.TRANSACTIONS_OPTIONAL.createEntityManager();
try
{
final String qs = String.format("SELECT FROM %s a WHERE a.from = :from ", Alias.class.getName());
final TypedQuery<Alias> q = em.createQuery(qs, Alias.class);
q.setParameter("from", uuid);
final Alias r;
try
{
r = q.getSingleResult();
final Key tok = KeyFactory.createKey(Alias.class.getSimpleName(), r.getTo().toString());
if (em.find(Alias.class, tok) == null)
{
return r.getTo();
}
else
{
return this.resolve(r.getTo());
}
}
catch (final NoResultException e)
{
/* this is expected when there are no more aliases */
return uuid;
}
}
finally
{
em.close();
}
}
The stress test code I had is timing out on the actual GAE Service, but I am not worried about it, I won't be creating more than one level of indirection at a time in practice. And there won't be more than a handful of indirections either, and it will all get hoisted up into Memcache in the final version anyway.

How to select out of a one to many property

I have an appengine app which has been running for about a year now, i have mainly been using JDO queries until now, but i am trying to collect stats and the queries are taking too long. I have the following entity (Device)
public class Device implements Serializable{
...
#Persistent
private Set<Key> feeds;// Key for the Feed entity
...
}
So I want to get a count of how many Devices have a certain Feed. I was doing it in JDOQL before as such (uses javax.jdo.Query):
Query query = pm.newQuery("select from Device where feeds.contains(:feedKey)");
Map<String, Object> paramsf = new HashMap<String, Object>();
paramsf.put("feedKey",feed.getId());
List<Device> results = (List<Device>) query.executeWithMap(paramsf);
Though this code times out now. I was trying to use the Datastore API so I could set chunk size,etc to see if i could speed the query up or use a cursor, but I am unsure how to search in a Set field. I was trying this (uses com.google.appengine.api.datastore.Query)
Query query = new Query("Device");
query.addFilter("feeds", FilterOperator.IN, feed.getId());
query.setKeysOnly();
final FetchOptions fetchOptions = FetchOptions.Builder.withPrefetchSize(100).chunkSize(100);
QueryResultList<Entity> results = dss.prepare(query).asQueryResultList(fetchOptions);
Essentially i am unsure how to search in the one-to-many filed (feeds) for a single key. Is it possible to index it somehow?
hope it makes sense....
Lists (and other things that are implemented as lists, like sets) are indexed individually. As a result, you can simply use an equality filter in your query, the same as if you were filtering on a single value rather than a list. A record will be returned if any of the items in the list match.

Resources