GAE Go size of datastore - database

Is there some function one can call to get the amount of entries in the GAE Go Datastore of an app, without querying it for the whole database and counting the output?

c := appengine.NewContext(r)
var result struct {
Bytes int64 `datastore:"bytes"`
Count int64 `datastore:"count"`
Timestamp datastore.Time `datastore:"timestamp"`
}
datastore.NewQuery("__Stat_Total__").Run(c).Next(&result)
c.Infof("count: %d", result.Count)

You can view the size of all entities in the admin console under Data > Datastore Statistics.
These stats can be queried programmatically from Python or Java; I couldn't find a documented equivalent for Go.

Related

How to effectively delete Google App engine Search API index

I have found some similary questions here ,but no solid answer.
How to delete or reset a search index in Appengine
how to delete search index in GAE Search API
How to delete a search index on the App Engine using Go?
How to delete a search Index itself
I see some googler suggest that
You can effectively delete an index by first using index.delete() to remove all of the documents from an index, and then using index.delete_schema() to remove the type mappings from the index 1.
Unfortunately, golang sdk does not have "index.delete_schema()" api.
I can only delete document one by one by getting itemId list from index. And We got a surprisely billing status in dashboard:
Resource Usage Billable Price Cost
Search API Simple Searches 214,748.49 10K Ops 214,748.39 $0.625 / 10K Ops $134,217.74
Can someone tell me how to effectively delete Google App engine Search API index wihout cost so much ?
Unfortunately there is no simple operation that allows you to delete an entire large search index without incurring substantial cost, short of deleting the entire app (which, actually, could be an effective approach in certain circumstances).
The short answer is NO.
There is no pefectly efficient way with GCP to drop full search index in one go.
The only efficient way they themselves suggest in thier "Best Practices" is to delete in bactches of 200 documents per index.delete() method call (in Java and Python app engine sdk).
To add to the disappointment, GO SDK even does not support this too and allows only one doc deletion per call. What a miserable support from GCP!
So if your indexes have grown to some good GBs, you are forced to consume your dollars and days or better say cleanup the mess, left by GCP, at your own cost. Mind it, it costs you a lot with giant index>10GB.
Now, how to do it in GO runtime.
Donot do it with GO runtime
Better, write a micro-service in Java or Python under the same projectId and use these runtimes with thier SDKs/Client Libraries to delete index the only efficient way(200 per call), GCP offers. So very very limited and essentially cost bearing solution with app engine. Have to live with it, dear :)
PS. I have created an bug/issue a year back regarding the same. No actions takent yet :)
As you mentioned, deleting an index is only available for Java 8 at the moment.
Since you're using GO, currently there is no possibility to delete an index, but you can remove documents that are part of it to reduce the cost
To delete documents from an index you can follow this example here
func deleteHandler(w http.ResponseWriter, r *http.Request) {
ctx := appengine.NewContext(r)
index, err := search.Open("users")
if err != nil {
http.Error(w, err.Error(), http.StatusInternalServerError)
return
}
id := "PA6-5000"
err = index.Delete(ctx, id)
if err != nil {
http.Error(w, err.Error(), http.StatusInternalServerError)
return
}
fmt.Fprint(w, "Deleted document: ", id)
}

NDB Queries Exceeding GAE Soft Private Memory Limit

I currently have a an application running in the Google App Engine Standard Environment, which, among other things, contains a large database of weather data and a frontend endpoint that generates graph of this data. The database lives in Google Cloud Datastore, and the Python Flask application accesses it via the NDB library.
My issue is as follows: when I try to generate graphs for WeatherData spanning more than about a week (the data is stored for every 5 minutes), my application exceeds GAE's soft private memory limit and crashes. However, stored in each of my WeatherData entities are the relevant fields that I want to graph, in addition to a very large json string containing forecast data that I do not need for this graphing application. So, the part of the WeatherData entities that is causing my application to exceed the soft private memory limit is not even needed in this application.
My question is thus as follows: is there any way to query only certain properties in the entity, such as can be done for specific columns in a SQL-style query? Again, I don't need the entire forecast json string for graphing, only a few other fields stored in the entity. The other approach I tried to run was to only fetch a couple of entities out at a time and split the query into multiple API calls, but it ended up taking so long that the page would time out and I couldn't get it to work properly.
Below is my code for how it is currently implemented and breaking. Any input is much appreciated:
wDataCsv = 'Time,' + ','.join(wData.keys())
qry = WeatherData.time_ordered_query(ndb.Key('Location', loc),start=start_date,end=end_date)
for acct in qry.fetch():
d = [acct.time.strftime(date_string)]
for attr in wData.keys():
d.append(str(acct.dict_access(attr)))
wData[attr].append([acct.time.strftime(date_string),acct.dict_access(attr)])
wDataCsv += '\\n' + ','.join(d)
# Children Entity - log of a weather at parent location
class WeatherData(ndb.Model):
# model for data to save
...
# Function for querying data below a given ancestor between two optional
# times
#classmethod
def time_ordered_query(cls, ancestor_key, start=None, end=None):
return cls.query(cls.time>=start, cls.time<=end,ancestor=ancestor_key).order(-cls.time)
EDIT: I tried the iterative page fetching strategy described in the link from the answer below. My code was updated to the following:
wDataCsv = 'Time,' + ','.join(wData.keys())
qry = WeatherData.time_ordered_query(ndb.Key('Location', loc),start=start_date,end=end_date)
cursor = None
while True:
gc.collect()
fetched, next_cursor, more = qry.fetch_page(FETCHNUM, start_cursor=cursor)
if fetched:
for acct in fetched:
d = [acct.time.strftime(date_string)]
for attr in wData.keys():
d.append(str(acct.dict_access(attr)))
wData[attr].append([acct.time.strftime(date_string),acct.dict_access(attr)])
wDataCsv += '\\n' + ','.join(d)
if more and next_cursor:
cursor = next_cursor
else:
break
where FETCHNUM=500. In this case, I am still exceeding the soft private memory limit for queries of the same length as before, and the query takes much, much longer to run. I suspect the problem may be with Python's garbage collector not deleting the already used information that is re-referenced, but even when I include gc.collect() I see no improvement there.
EDIT:
Following the advice below, I fixed the problem using Projection Queries. Rather than have a separate projection for each custom query, I simply ran the same projection each time: namely querying all properties of the entity excluding the JSON string. While this is not ideal as it still pulls gratuitous information from the database each time, generating individual queries of each specific query is not scalable due to the exponential growth of necessary indices. For this application, as each additional property is negligible additional memory (aside form that json string), it works!
You can use projection queries to fetch only the properties of interest from each entity. Watch out for the limitations, though. And this still can't scale indefinitely.
You can split your queries across multiple requests (more scalable), but use bigger chunks, not just a couple (you can fetch 500 at a time) and cursors. Check out examples in How to delete all the entries from google datastore?
You can bump your instance class to one with more memory (if not done already).
You can prepare intermediate results (also in the datastore) from the big entities ahead of time and use these intermediate pre-computed values in the final stage.
Finally you could try to create and store just portions of the graphs and just stitch them together in the end (only if it comes down to that, I'm not sure how exactly it would be done, I imagine it wouldn't be trivial).

FetchOptions withLimit() does not reduce query execution time (Google App Engine)

Problem
Running a datastore query with or without FetchOptions.Builder.withLimit(100) takes the same execution time! Why is that? Isn't the limit method intended to reduce the time to retrieve results!?
Test setup
I am locally testing the execution time of some datastore queries with Google's App Engine. I am using the Google Cloud SDK Standard Environment with the App Engine SDK 1.9.59.
For the test, I created an example entity with 5 indexed properties and 5 unindexed properties. I filled the datastore with 50.000 entries of a test entity. I run the following method to retrieve 100 of this entities by utilizing the withLimit() method.
public List<Long> getTestIds() {
List<Long> ids = new ArrayList<>();
FetchOptions fetchOptions = FetchOptions.Builder.withLimit(100);
Query q = new Query("test_kind").setKeysOnly();
for (Entity entity : datastore.prepare(q).asIterable(fetchOptions)) {
ids.add(entity.getKey().getId());
}
return ids;
}
I measure the time before and after calling this method:
long start = System.currentTimeMillis();
int size = getTestIds().size();
long end = System.currentTimeMillis();
log.info("time: " + (end - start) + " results: " + size);
I log the execution time and the number of returned results.
Results
When I do not use the withLimit() FetchOptions for the query, I get the expected 50.000 results in about 1740 ms. Nothing surprising here.
If I run the code as displayed above and use withLimit(100) I get the expected 100 results. However, the query runs about the same 1740 ms!
I tested with different numbers of datastore entries and different limits. Every time the queries with or without withLimit(100) took the same time.
Question
Why is the query still fetching all entities? I am sure the query is not supposed to get all entities even though the limit is set to 100 right? What am I missing? Is there some datastore configuration for that? After testing and searching the web for 4 days I still can't find the problem.
FWIW, you shouldn't expect meaningful results from datastore performance tests performed locally, using either the development server or the datastore emulator - they're just emulators, they don't have the same performance (or even the 100% equivalent functionality) as the real datastore.
See for example Datastore fetch VS fetch(keys_only=True) then get_multi (including comments)

Extracting result from Google BigQuery to cloud storage golang

I am using the following GoLang package: https://godoc.org/cloud.google.com/go/bigquery
My app runs in Google App Engine
If I have understood the documentation correctly it should be possible to extract the result of a job/query to Google Cloud Storage using a job. I don't think the documentation is very clear and was wondering if anyone has an example code or other help.
TL:DR
How do I get access to the temporary table when using Go Lang instead of command line.
How do I extract the result of my Bigquery to GCS
** EDIT **
Solution i used
I created a temporary table and set it as the Dst (Destination) for the Query result and created an export job with it.
dataset_result.Table(table_name).Create(ctx, bigquery.TableExpiration(time.Now().Add(1*time.Hour)))
Update 2018:
https://github.com/GoogleCloudPlatform/google-cloud-go/issues/547
To get the table name:
q := client.Query(...)
job, err := q.Run(ctx)
// handle err
config, err := job.Config()
// handle err
tempTable := config.(*QueryConfig).Dst
How do I extract the result of my BigQuery to GCS
You cannot directly write the results of a query to GCS. You first need to run the query, save the results to a permanent table, and then kick off an export job to GCS.
https://cloud.google.com/bigquery/docs/exporting-data
How do I get access to the temporary table when using Go Lang instead of command line.
You call use the jobs API, or look in the query history if using the web UI. See here and here.
https://cloud.google.com/bigquery/querying-data#temporary_and_permanent_tables

What ways does Go have to easily convert data into bytes or strings

I've been developing a couple apps using Google App Engine Go SDK that use Memcache as a buffer for loading data from the Datastore. As the Memcache is only able to store data as []byte, I often find myself creating functions to encode the various structures as strings, and also functions to reverse the process. Needless to say, it is quite tedious when I need to do this sort of thing 5 times over.
Is there an easy way to convert any arbitrary structure that can be stored in Datastore into []byte in order to store it in Memcache and then load it back without having to create custom code for various structures in GAE Golang?
http://golang.org/pkg/encoding/gob or http://golang.org/pkg/encoding/json can turn arbitrary datatypes into []byte slices given certain rules apply to the datastructures being encoded. You probably want one of them gob will encode to smaller sizes but json is more easily shareable with other languages if that is a requirement.
I found myself needing the same thing. So I created a package called:
AEGo/ds
Documentation | Source
go get github.com/scotch/aego/ds
It uses the same API as the "appengine/datastore" so It will work as a drop in replacement.
import "github.com/scotch/aego/v1/ds"
u = &User{Name: "Bob"}
key := datastore.NewKey(c, "User", "bob", 0, nil)
key, err := ds.Put(c, key, u)
u = new(User)
err = ds.Get(c, key, u)
By default it will cache all Puts and Gets to memcache, but you can modify this behavior by calling the ds.Register method:
ds.Register("User", true, false, false)
The Register method takes a string representing the Kind and 3 bool - userDatastore, useMemcache, useMemory. Passing a true value will cause AEgo/ds to persist the record to that store. The Memory store is useful for records that you do not expect to change, but could contain stale data if you have more then one instance running.
Supported methods are:
Put
PutMulti
Get
GetMulti
Delete
DeleteMulti
AllocateIDs
Note: Currently cashing only occures with Get. GetMulti pulls from the datastore.
AEGo/ds is a work in progress, but the code is well tested. Any feedback would be appreciated.
And to answer you question here's how I serialized the entities to gob for the memcache persistence.

Resources