How to ignore errors in datastore.Query.GetAll()? - google-app-engine

I just started developing a GAE app with the Go runtime, so far it's been a pleasure. However, I have encountered the following setback:
I am taking advantage of the flexibility that the datastore provides by having several different structs with different properties being saved with the same entity name ("Item"). The Go language datastore reference states that "the actual types passed do not have to match between Get and Put calls or even across different App Engine requests", since entities are actually just a series of properties, and can therefore be stored in an appropriate container type that can support them.
I need to query all of the entities stored under the entity name "Item" and encode them as JSON all at once. Using that entity property flexibility to my advantage, it is possible to store queried entities into an arbitrary datastore.PropertyList, however, the Get and GetAll functions return ErrFieldMismatch as an error when a property of the queried entities cannot be properly represented (that is to say, incompatible types, or simply a missing value). All of these structs I'm saving are user generated and most values are optional, therefore saving empty values into the datastore. There are no problems while saving these structs with empty values (datastore flexibility again), but there are when retrieving them.
It is also stated in the datastore Go documentation, that it is up to the caller of the Get methods to decide if the errors returned due to empty values are ignorable, recoverable, or fatal. I would like to know how to properly do this, since just ignoring the errors won't suffice, as the destination structs (datastore.PropertyList) of my queries are not filled at all when a query results in this error.
Thank you in advance, and sorry for the lengthy question.
Update: Here is some code
query := datastore.NewQuery("Item") // here I use some Filter calls, as well as a Limit call and an Order call
items := make([]datastore.PropertyList, 0)
_, err := query.GetAll(context, &items) // context has been obviously defined before
if err != nil {
// something to handle the error, which in my case, it's printing it and setting the server status as 500
}
Update 2: Here is some output
If I use make([]datastore.PropertyList, 0), I get this:
datastore: invalid entity type
And if I use make(datastore.PropertyList, 0), I get this:
datastore: cannot load field "Foo" into a "datastore.Property": no such struct field
And in both cases (the first one I assume can be discarded) in items I get this:
[]

According to the following post the go datastore module doesn't support PropertyList yet.
Use a pointer to a slice of datastore.Map instead.

Related

NDB Queries Exceeding GAE Soft Private Memory Limit

I currently have a an application running in the Google App Engine Standard Environment, which, among other things, contains a large database of weather data and a frontend endpoint that generates graph of this data. The database lives in Google Cloud Datastore, and the Python Flask application accesses it via the NDB library.
My issue is as follows: when I try to generate graphs for WeatherData spanning more than about a week (the data is stored for every 5 minutes), my application exceeds GAE's soft private memory limit and crashes. However, stored in each of my WeatherData entities are the relevant fields that I want to graph, in addition to a very large json string containing forecast data that I do not need for this graphing application. So, the part of the WeatherData entities that is causing my application to exceed the soft private memory limit is not even needed in this application.
My question is thus as follows: is there any way to query only certain properties in the entity, such as can be done for specific columns in a SQL-style query? Again, I don't need the entire forecast json string for graphing, only a few other fields stored in the entity. The other approach I tried to run was to only fetch a couple of entities out at a time and split the query into multiple API calls, but it ended up taking so long that the page would time out and I couldn't get it to work properly.
Below is my code for how it is currently implemented and breaking. Any input is much appreciated:
wDataCsv = 'Time,' + ','.join(wData.keys())
qry = WeatherData.time_ordered_query(ndb.Key('Location', loc),start=start_date,end=end_date)
for acct in qry.fetch():
d = [acct.time.strftime(date_string)]
for attr in wData.keys():
d.append(str(acct.dict_access(attr)))
wData[attr].append([acct.time.strftime(date_string),acct.dict_access(attr)])
wDataCsv += '\\n' + ','.join(d)
# Children Entity - log of a weather at parent location
class WeatherData(ndb.Model):
# model for data to save
...
# Function for querying data below a given ancestor between two optional
# times
#classmethod
def time_ordered_query(cls, ancestor_key, start=None, end=None):
return cls.query(cls.time>=start, cls.time<=end,ancestor=ancestor_key).order(-cls.time)
EDIT: I tried the iterative page fetching strategy described in the link from the answer below. My code was updated to the following:
wDataCsv = 'Time,' + ','.join(wData.keys())
qry = WeatherData.time_ordered_query(ndb.Key('Location', loc),start=start_date,end=end_date)
cursor = None
while True:
gc.collect()
fetched, next_cursor, more = qry.fetch_page(FETCHNUM, start_cursor=cursor)
if fetched:
for acct in fetched:
d = [acct.time.strftime(date_string)]
for attr in wData.keys():
d.append(str(acct.dict_access(attr)))
wData[attr].append([acct.time.strftime(date_string),acct.dict_access(attr)])
wDataCsv += '\\n' + ','.join(d)
if more and next_cursor:
cursor = next_cursor
else:
break
where FETCHNUM=500. In this case, I am still exceeding the soft private memory limit for queries of the same length as before, and the query takes much, much longer to run. I suspect the problem may be with Python's garbage collector not deleting the already used information that is re-referenced, but even when I include gc.collect() I see no improvement there.
EDIT:
Following the advice below, I fixed the problem using Projection Queries. Rather than have a separate projection for each custom query, I simply ran the same projection each time: namely querying all properties of the entity excluding the JSON string. While this is not ideal as it still pulls gratuitous information from the database each time, generating individual queries of each specific query is not scalable due to the exponential growth of necessary indices. For this application, as each additional property is negligible additional memory (aside form that json string), it works!
You can use projection queries to fetch only the properties of interest from each entity. Watch out for the limitations, though. And this still can't scale indefinitely.
You can split your queries across multiple requests (more scalable), but use bigger chunks, not just a couple (you can fetch 500 at a time) and cursors. Check out examples in How to delete all the entries from google datastore?
You can bump your instance class to one with more memory (if not done already).
You can prepare intermediate results (also in the datastore) from the big entities ahead of time and use these intermediate pre-computed values in the final stage.
Finally you could try to create and store just portions of the graphs and just stitch them together in the end (only if it comes down to that, I'm not sure how exactly it would be done, I imagine it wouldn't be trivial).

Batch collecting in Solr PostFilter

What I want to do is exactly like this:
Solr: How to perform a batch request to an external system from a PostFilter?
and the approach I took is similar:
-don't call super.collect(docId) in the collect method of the PostFilter but store all docIds in an internal map
-call the external system in the finish() then call super.collect(docId) for all the docs that pass the external filtering
The problem I have: docId exceeds maxDoc "(docID must be >= 0 and < maxDoc=100000 (got docID=123456)"
I suspect I am storing local docIds and when Reader is changed, docBase is also changed so the global docId, which I believe is constructed in super.collect(docId) using the parameter docId and docBase, becomes incorrect. I've tried storing super.delegate.getLeafCollector(context) along with docId and call super.delegate.getLeafCollector(context).collect() instead of super.collect() but this doesn't work either (got a null pointer exception)
Look at the code for the CollapsingQParserPlugin in the Solr codebase, particularly CollapsingScoreCollector.finish.
The docId's you receive in the collect call are not globally unique. The Collapsing collector makes them unique by adding the docBase from the context to the local docId to create a globalDoc during the collect() phase.
Then in the finish() phase, you must find the context containing the doc in question and set the reader/leafDelegate depending on what version of Solr your running. Specifying the right docId with the wrong context will throw Exceptions. For the Collapsing collector, you iterate through the contexts until you find the first docBase smaller than the globalDoc.
Finally, if you added docBase in collect(), don't forget to subtract docBase in finish() when you call collect() on the appropriate DelegationCollector object, as the author may or may not have done the first time.

Neo4j output format

After working with neo4j and now coming to the point of considering to make my own entity manager (object manager) to work with the fetched data in the application, i wonder about neo4j's output format.
When i run a query it's always returned as tabular data. Why is this??
Sure tables keep a big place in data and processing, but it seems so strange that a graph database can only output in this format.
Now when i want to create an object graph in my application i would have to hydrate all the objects and this is not really good for performance and doesn't leverage true graph performace.
Consider MATCH (A)-->(B) RETURN A, B when there is one A and three B's, it would return:
A B
1 1
1 2
1 3
That's the same A passed down 3 times over the database connection, while i only need it once and i know this before the data is fetched.
Something like this seems great http://nigelsmall.com/geoff
a load2neo is nice, a load-from-neo would also be nice! either in the geoff format or any other formats out there https://gephi.org/users/supported-graph-formats/
Each language could then implement it's own functions to create the objects directly.
To clarify:
Relations between nodes are lost in tabular data
Redundant (non-optimal) format for graphs
Edges (relations) and vertices (nodes) are usually not in the same table. (makes queries more complex?)
Another consideration (which might deserve it's own post), what's a good way to model relations in an object graph? As objects? or as data/method inside the node objects?
#Kikohs
Q: What do you mean by "Each language could then implement it's own functions to create the objects directly."?
A: With an (partial) graph provided by the database (as result of a query) a language as PHP could provide a factory method (in C preferably) to construct the object graph (this is usually an expensive operation). But only if the object graph is well defined in a standard format (because this function should be simple and universal).
Q: Do you want to export the full graph or just the result of a query?
A: The result of a query. However a query like MATCH (n) OPTIONAL MATCH (n)-[r]-() RETURN n, r should return the full graph.
Q: you want to dump to the disk the subgraph created from the result of a query ?
A: No, existing interfaces like REST are prefered to get the query result.
Q: do you want to create the subgraph which comes from a query in memory and then request it in another language ?
A: no i want the result of the query in another format then tabular (examples mentioned)
Q: You make a query which only returns the name of a node, in this case, would you like to get the full node associated or just the name ? Same for the edges.
A: Nodes don't have names. They have properties, labels and relations. I would like enough information to retrieve A) The node ID, it's labels, it's properties and B) the relation to other nodes which are in the same result.
Note that the first part of the question is not a concrete "how-to" question, rather "why is this not possible?" (or if it is, i like to be proven wrong on this one). The second is a real "how-to" question, namely "how to model relations". The two questions have in common that they both try to find the answer to "how to get graph data efficiently in PHP."
#Michael Hunger
You have a point when you say that not all result data can be expressed as an object graph. It reasonable to say that an alternative output format to a table would only be complementary to the table format and not replacing it.
I understand from your answer that the natural (rawish) output format from the database is the result format with duplicates in it ("streams the data out as it comes"). I that case i understand that it's now left to an alternative program (of the dev stack) to do the mapping. So my conclusion on neo4j implementing something like this:
Pro's - not having to do this in every implementation language (of the application)
Con's - 1) no application specific mapping is possible, 2) no performance gain if implementation language is fast
"Even if you use geoff, graphml or the gephi format you have to keep all the data in memory to deduplicate the results."
I don't understand this point entirely, are you saying that these formats are no able to hold deduplicated results (in certain cases)?? So infact that there is no possible textual format with which a graph can be described without duplication??
"There is also the questions on what you want to include in your output?"
I was under the assumption that the cypher language was powerful enough to specify this in the query. And so the output format would have whatever the database can provide as result.
"You could just return the paths that you get, which are unique paths through the graph in themselves".
Useful suggestion, i'll play around with this idea :)
"The dump command of the neo4j-shell uses the approach of pulling the cypher results into an in-memory structure, enriching it".
Does the enriching process fetch additional data from the database or is the data already contained in the initial result?
There is more to it.
First of all as you said tabular results from queries are really commonplace and needed to integrate with other systems and databases.
Secondly oftentimes you don't actually return raw graph data from your queries, but aggregated, projected, sliced, extracted information out of your graph. So the relationships to the original graph data are already lost in most of the results of queries I see being used.
The only time that people need / use the raw graph data is when to export subgraph-data from the database as a query result.
The problem of doing that as a de-duplicated graph is that the db has to fetch all the result data data in memory first to deduplicate, extract the needed relationships etc.
Normally it just streams the data out as it comes and uses little memory with that.
Even if you use geoff, graphml or the gephi format you have to keep all the data in memory to deduplicate the results (which are returned as paths with potential duplicate nodes and relationships).
There is also the questions on what you want to include in your output? Just the nodes and rels returned? Or additionally all the other rels between the nodes that you return? Or all the rels of the returned nodes (but then you also have to include the end-nodes of those relationships).
You could just return the paths that you get, which are unique paths through the graph in themselves:
MATCH p = (n)-[r]-(m)
WHERE ...
RETURN p
Another way to address this problem in Neo4j is to use sensible aggregations.
E.g. what you can do is to use collect to aggregate data per node (i.e. kind of subgraphs)
MATCH (n)-[r]-(m)
WHERE ...
RETURN n, collect([r,type(r),m])
or use the new literal map syntax (Neo4j 2.0)
MATCH (n)-[r]-(m)
WHERE ...
RETURN {node: n, neighbours: collect({ rel: r, type: type(r), node: m})}
The dump command of the neo4j-shell uses the approach of pulling the cypher results into an in-memory structure, enriching it and then outputting it as cypher create statement(s).
A similar approach can be used for other output formats too if you need it. But so far there hasn't been the need.
If you really need this functionality it makes sense to write a server-extension that uses cypher for query specification, but doesn't allow return statements. Instead you would always use RETURN *, aggregate the data into an in-memory structure (SubGraph in the org.neo4j.cypher packages). And then render it as a suitable format (e.g. JSON or one of those listed above).
These could be a starting points for that:
https://github.com/jexp/cypher-rs
https://github.com/jexp/cypher_websocket_endpoint
https://github.com/neo4j-contrib/rabbithole/blob/master/src/main/java/org/neo4j/community/console/SubGraph.java#L123
There are also other efforts, like GraphJSON from GraphAlchemist: https://github.com/GraphAlchemist/GraphJSON
And the d3 json format is also pretty useful. We use it in the neo4j console (console.neo4j.org) to return the graph visualization data that is then consumed by d3 directly.
I've been working with neo4j for a while now and I can tell you that if you are concerned about memory and performances you should drop cypher at all, and use indexes and the other graph-traversal methods instead (e.g. retrieve all the relationships of a certain type from or to a start node, and then iterate over the found nodes).
As the documentation says, Cypher is not intended for in-app usage, but more as a administration tool. Furthermore, in production-scale environments, it is VERY easy to crash the server by running the wrong query.
In second place, there is no mention in the docs of an API method to retrieve the output as a graph-like structure. You will have to process the output of the query and build it.
That said, in the example you give you say that there is only one A and that you know it before the data is fetched, so you don't need to do:
MATCH (A)-->(B) RETURN A, B
but just
MATCH (A)-->(B) RETURN B
(you don't need to receive A three times because you already know these are the nodes connected with A)
or better (if you need info about the relationships) something like
MATCH (A)-[r]->(B) RETURN r

Projection query with new fields/properites ignores entries that haven't set those properties yet

I have an Article type structured like this:
type Article struct {
Title string
Content string `datastore:",noindex"`
}
In an administrative portion of my site, I list all of my Articles. The only property I need in order to display this list is Title; grabbing the content of the article seems wasteful. So I use a projection query:
q := datastore.NewQuery("Article").Project("Title")
Everything works as expected so far. Now I decide I'd like to add two fields to Article so that some articles can be unlisted in the public article list and/or unviewable when access is attempted. Understanding the datastore to be schema-less, I think this might be very simple. I add the two new fields to Article:
type Article struct {
Title string
Content string `datastore:",noindex"`
Unlisted bool
Unviewable bool
}
I also add them to the projection query, since I want to indicate in the administrative article list when an article is publicly unlisted and/or unviewable:
q := datastore.NewQuery("Article").Project("Title", "Unlisted", "Unviewable")
Unfortunately, this only returns entries that have explicitly included Unlisted and Unviewable when Put into the datastore.
My workaround for now is to simply stop using a projection query:
q := datastore.NewQuery("Article")
All entries are returned, and the entries that never set Unlisted or Unviewable have them set to their zero value as expected. The downside is that the article content is being passed around needlessly.
In this case, that compromise isn't terrible, but I expect similar situations to arise in the future, and it could be a big deal not being able to use projection queries. Projections queries and adding new properties to datastore entries seem like they're not fitting together well. I want to make sure I'm not misunderstanding something or missing the correct way to do things.
It's not clear to me from the documentation that projection queries should behave this way (ignoring entries that don't have the projected properties rather than including them with zero values). Is this the intended behavior?
Are the only options in scenarios like this (adding new fields to structs / properties to entries) to either forgo projection queries or run some kind of "schema migration", Getting all entries and then Puting them back, so they now have zero-valued properties and can be projected?
Projection queries source the data for fields from the indexes not the entity, when you have added new properties pre-existing records do not appear in those indexes you are performing the project query on. They will need to be re-indexed.
You are asking for those specific properties and they don't exist hence the current behaviour.
You should probably think of a projection query as a request for entities with a value in a requested index in addition to any filter you place on a query.

What ways does Go have to easily convert data into bytes or strings

I've been developing a couple apps using Google App Engine Go SDK that use Memcache as a buffer for loading data from the Datastore. As the Memcache is only able to store data as []byte, I often find myself creating functions to encode the various structures as strings, and also functions to reverse the process. Needless to say, it is quite tedious when I need to do this sort of thing 5 times over.
Is there an easy way to convert any arbitrary structure that can be stored in Datastore into []byte in order to store it in Memcache and then load it back without having to create custom code for various structures in GAE Golang?
http://golang.org/pkg/encoding/gob or http://golang.org/pkg/encoding/json can turn arbitrary datatypes into []byte slices given certain rules apply to the datastructures being encoded. You probably want one of them gob will encode to smaller sizes but json is more easily shareable with other languages if that is a requirement.
I found myself needing the same thing. So I created a package called:
AEGo/ds
Documentation | Source
go get github.com/scotch/aego/ds
It uses the same API as the "appengine/datastore" so It will work as a drop in replacement.
import "github.com/scotch/aego/v1/ds"
u = &User{Name: "Bob"}
key := datastore.NewKey(c, "User", "bob", 0, nil)
key, err := ds.Put(c, key, u)
u = new(User)
err = ds.Get(c, key, u)
By default it will cache all Puts and Gets to memcache, but you can modify this behavior by calling the ds.Register method:
ds.Register("User", true, false, false)
The Register method takes a string representing the Kind and 3 bool - userDatastore, useMemcache, useMemory. Passing a true value will cause AEgo/ds to persist the record to that store. The Memory store is useful for records that you do not expect to change, but could contain stale data if you have more then one instance running.
Supported methods are:
Put
PutMulti
Get
GetMulti
Delete
DeleteMulti
AllocateIDs
Note: Currently cashing only occures with Get. GetMulti pulls from the datastore.
AEGo/ds is a work in progress, but the code is well tested. Any feedback would be appreciated.
And to answer you question here's how I serialized the entities to gob for the memcache persistence.

Resources