I am new to Salesforce Marketing Cloud and journey builder.
https://developer.salesforce.com/docs/marketing/marketing-cloud/guide/creating-activities.html
We are building journey builder's custom activity in which it will use a data extension as the source and when the journey builder is invoked, it will fetch a row and send this data to our company's internal endpoint. The team got that part working. We are using the postmonger.js.
I have a couple of questions:
Is there a way to retrieve the data from data extension in bulk so that we can call our company's internal bulk endpoint? Calling the endpoint for each record in the data extension for our use case would not be efficient enough and won't work.
When the journey is invoked and an entry in the data extension is retrieved and that data is sent to our internal endpoint, is there a machanism to mark this entry as already sent such that next time the journey is run, it won't process the entry that's already sent?
Here is a snippet of our customActivity.js in which this is populating one record. (I changed some variable names.). Is there a way to populate multiple records such that when "execute" is called, it is passing a list of payloads as input to our internal endpoint.
function save() {
try {
var TemplateNameValue = $('#TemplateName').val();
var TemplateIDValue = $('#TemplateID').val();
let auth = "{{Contact.Attribute.Authorization.Value}}"
payload['arguments'].execute.inArguments = [{
"vendorTemplateId": TemplateIDValue,
"field1": "{{Contact.Attribute.DD.field1}}",
"eventType": TemplateNameValue,
"field2": "{{Contact.Attribute.DD.field2}}",
"field3": "{{Contact.Attribute.DD.field3}}",
"field4": "{{Contact.Attribute.DD.field4}}",
"field5": "{{Contact.Attribute.DD.field5}}",
"field6": "{{Contact.Attribute.DD.field6}}",
"field7": "{{Contact.Attribute.DD.field7}}",
"messageMetadata" : {}
}];
payload['arguments'].execute.headers = `{"Authorization":"${auth}"}`;
payload['configurationArguments'].stop.headers = `{"Authorization":"default"}`;
payload['configurationArguments'].validate.headers = `{"Authorization":"default"}`;
payload['configurationArguments'].publish.headers = `{"Authorization":"default"}`;
payload['configurationArguments'].save.headers = `{"Authorization":"default"}`;
payload['metaData'].isConfigured = true;
console.log(payload);
connection.trigger('updateActivity', payload);
} catch(err) {
document.getElementById("error").style.display = "block";
document.getElementById("error").innerHtml = err;
}
console.log("Template Name: " + JSON.stringify(TemplateNameValue));
console.log("Template ID: " + JSON.stringify(TemplateIDValue));
}
});
Any advise or idea is highly appreciated!
Thank you.
Grace
Firstly, i implore you to not proceed with the design pattern of fetching data for each subscriber, from Marketing Cloud, that gets sent through the custom activity, for arguments sake i'll list two big issues.
You have no way of limiting the configuration of data extensions columns or column names in SFMC (Salesforce Marketing Cloud). If any malicious user or by human error would delete a column or change a column name your service would stop receiving that value.
Secondly, Marketing Cloud has 2 sets of API limitations, yearly and minute by minute. Depending on your licensing, you could run into the yearly limit.
The problem you have with limitation on minutes (2500 for REST and 2000 for SOAP) is that each usage of the custom activity in journey builder would multiple the amount of invocations per minute. Hitting this limit would cause issues for incremental data flows into SFMC.
I'd also suggest not retrieving any data from Marketing Cloud when a customer gets sent through a custom activity. Users should pick which corresponding rows/data that should be sent to the custom activity in their segmentation.
The eventDefinitionKey can be picked up from postmonger after requestedTriggerEventDefinition in the eventDefinitionModel function. eventDefinitionKey can then be used to programmatically populate SFMC's POST call with data from the Journey Data model, thus allowing marketers to select what data to be sent with the subscriber.
Following is some code to show how it would work in your customActivity.js
connection.on(
'requestedTriggerEventDefinition',
function (eventDefinitionModel) {
var eventKey = eventDefinitionModel['eventDefinitionKey'];
save(eventKey);
}
);
function save(eventKey) {
// subscriberKey fetched directly from Contact model
// columnName is populated from the Journey Data model
var params = {
subscriberKey: '{{Contact.key}}',
columnName: '{{Event.' + eventKey + '.columnName}}',
};
payload['arguments'].execute.inArguments = [params];
}
Using the Watson Assistant V2 API it is necessary to create a session handle first (create_session(assistantid)) which returns the session ID to use in the individual call to message(assistantid,sessionid,request). The session maintains the conversation state and therefore is the equivalent to the context id parameter of the V1 API.
Unfortunately it seems that there's a 5 minute session timeout by default. The response includes the following header attribute:
{...,"x-watson-session-timeout": [
"x-watson-session-timeout",
"session_timeout=300"
],...}
Any attempt to change this parameter by using the set_default_headers() method of the assistant object or by adding the optional header parameter to the create_session() call seems to have no effect. As I have not found any documentation of how to update this parameter correctly I just tried several alternatives:
1) self.assistant.set_default_headers({'x-watson-session-timeout':"['x-watson-session-timeout','session_timeout=3600']"})
2) self.assistant.set_default_headers({'x-watson-session-timeout':"'x-watson-session-timeout','session_timeout=3600'"})
3)self.assistant.set_default_headers({'x-watson-session-timeout':"session_timeout=3600"})
4)self.assistant.set_default_headers({'x-watson-session-timeout':"3600"})
5)self.assistant.set_default_headers({'session_timeout':"3600"})
Nothing is effective. The value of the parameter in the header of the response is still 300.
Do I use incorrect dict pairs to update the parameter? Is there another way to maintain the conversation state longer than 5 minutes using the V2 API? Is it not possible to change it at all?
The value of the session timeout is not under the control of the caller, and is in fact related to the Assistant plan you are using. For the free and standard the timeout is indeed 5 minutes. For the other plans the timeout is larger.
See Retaining information across dialog turns
The current session lasts for as long a user interacts with the assistant, and then up to 60 minutes of inactivity for Plus or Premium plans (5 minutes for Lite or Standard plans).
You can call watson assistant for an other session and resend your message. Keep your context...
Or just increase timeout limit in assistant setting on IBM Cloud with the right plan.
function createSession(end) {
assistant.createSession({
assistantId: watsonID }).then(res => {
sessionId=res.result.session_id;
if(end){
console.log("\x1b[32m%s\x1b[0m","new session "+sessionId);
}else{
console.log("session id :"+ sessionId);
console.log("http://"+host+":"+port);
}
});
}
createSession();
function callWatsonClient(payload,res) {
assistant.message(payload,function(err, data) {
if(data == null){
createSession(true);
//this not keep the context
var data ={result:{context:"",output:{generic:[{text:"session expirée, renvoyez le message"}]}}};
res.send(data);
}else{
//normal job
console.log("\x1b[33m%s\x1b[0m" ,JSON.stringify(data.result.output));
}
I have a server-side-rendered reactjs app using firebase firestore.
I have an area of my site that server-side-renders content that needs to be retrieved from firestore.
Currently, I am using firestore rules to allow anyone to read data from these particular docs
What worries me is that some bad person could setup a script to just continuously hit my database with reads and rack up my bills (since we are charged on a per-read basis, it seems that it's never wise to allow anyone to perform reads.)
Current Rule
// Allow anonymous users to read feeds
match /landingPageFeeds/{pageId}/feeds/newsFeed {
allow read: if true;
}
Best Way Forward?
How do I allow my server-side script to read from firestore, but not allow anyone else to do so?
Keep in mind, this is an initial action that runs server-side before hydrating the client-side with the pre-loaded state. This function / action is also shared with client-side for page-to-page navigation.
I considered anonymous login - which worked, however, this generated a new anonymous user with every page load - and Firebase does throttle new email/password and anonymous user accounts. It did not seem practical.
Solution
Per Doug's comment, I thought about the admin SDK more. I ended up creating a separate API in firebase functions for anonymous requests requiring secure firestore reads that can be cached.
Goals
Continue to deny public reads of my firestore database
Allow anonymous users to trigger firestore reads for server-side-rendered reactjs pages that require data from Firestore database (like first-time visitors, search engines).
Prevent "read spam" where a third party could hit my database with millions of reads to drive up my cloud costs by using server-side CDN cache for the responses. (by invoking unnessary reads in a loop, I once racked up a huge bill on accident - I want to make sure strangers can't do this maliciously)
Admin SDK & Firebase Function Caching
The admin SDK allows me to securely read from firestore. My firestore security rules can deny access to non-authenticated users.
Firebase functions that are handling GET requests support server caching the response. This means that subsequent hits from identical queries will not re-run all of my functions (firebase reads, other function invocations) - it will just instantly respond with the same data again.
Process
Anonymous client visits a server-side rendered reactjs page
Initial load rendering on server triggers a firebase function (https trigger)
Firebase function uses Admin SDK to read from secured firestore database
Function caches the response for 3 hours res.set('Cache-Control', 'public, max-age=600, s-maxage=10800');
Subsequent requests from any client anywhere for the next 3 hours are served from the cache - avoiding unnecessary reads or additional computation / resource usage
Note - caching does not work on local - must deploy to firebase to test caching effect.
Example Function
const functions = require("firebase-functions");
const cors = require('cors')({origin: true});
const { sendResponse } = require("./includes/sendResponse");
const { getFirestoreDataWithAdminSDK } = require("./includes/getFirestoreDataWithAdminSDK");
const cachedApi = functions.https.onRequest((req, res) => {
cors(req, res, async () => {
// Set a cache for the response to limit the impact of identical request on expensive resources
res.set('Cache-Control', 'public, max-age=600, s-maxage=10800');
// If POST - response with bad request code - POST requests are not cached
if(req.method === "POST") {
return sendResponse(res, 400);
} else {
// Get GET request action from query
let action = (req.query.action) ? req.query.action : null;
console.log("Action: ", action);
try {
// Handle Actions Appropriately
switch(true) {
// Get Feed Data
case(action === "feed"): {
console.log("Getting feed...");
// Get feed id
let feedId = (req.query.feedId) ? req.query.feedId : null;
// Get feed data
let feedData = await getFirestoreDataWithAdminSDK(feedId);
return sendResponse(res, 200, feedData);
}
// No valid action specified
default: {
return sendResponse(res, 400);
}
}
} catch(err) {
console.log("Cached API Error: ", err);
return sendResponse(res, 500);
}
}
});
});
module.exports = {
cachedApi
}
I'm trying to schedule a PySpark Job. I followed the GCP documentation and ended up deploying a little python script to App Engine which does the following :
authenticate using a service account
submit a job to a cluster
The problem is, I need the cluster to be up and running otherwise the job won't be sent (duh !) but I don't want the cluster to always be up and running, especially since my job needs to run once a month.
I wanted to add the creation of a cluster in my python script but the call is asynchronous (it makes an HTTP request) and thus my job is submitted after the cluster creation call but before the cluster is really up and running.
How could I do ?
I'd like something cleaner than just waiting for a few minutes in my script !
Thanks
EDIT : Here's what my code looks like so far :
To launch the job
class EnqueueTaskHandler(webapp2.RequestHandler):
def get(self):
task = taskqueue.add(
url='/run',
target='worker')
self.response.write(
'Task {} enqueued, ETA {}.'.format(task.name, task.eta))
app = webapp2.WSGIApplication([('/launch', EnqueueTaskHandler)], debug=True)
The job
class CronEventHandler(webapp2.RequestHandler):
def create_cluster(self, dataproc, project, zone, region, cluster_name):
zone_uri = 'https://www.googleapis.com/compute/v1/projects/{}/zones/{}'.format(project, zone)
cluster_data = {...}
dataproc.projects().regions().clusters().create(
projectId=project,
region=region,
body=cluster_data).execute()
def wait_for_cluster(self, dataproc, project, region, clustername):
print('Waiting for cluster to run...')
while True:
result = dataproc.projects().regions().clusters().get(
projectId=project,
region=region,
clusterName=clustername).execute()
# Handle exceptions
if result['status']['state'] != 'RUNNING':
time.sleep(60)
else:
return result
def wait_for_job(self, dataproc, project, region, job_id):
print('Waiting for job to finish...')
while True:
result = dataproc.projects().regions().jobs().get(
projectId=project,
region=region,
jobId=job_id).execute()
# Handle exceptions
print(result['status']['state'])
if result['status']['state'] == 'ERROR' or result['status']['state'] == 'DONE':
return result
else:
time.sleep(60)
def submit_job(self, dataproc, project, region, clusterName):
job = {...}
result = dataproc.projects().regions().jobs().submit(projectId=project,region=region,body=job).execute()
return result['reference']['jobId']
def post(self):
dataproc = googleapiclient.discovery.build('dataproc', 'v1')
project = '...'
region = "..."
zone = "..."
clusterName = '...'
self.create_cluster(dataproc, project, zone, region, clusterName)
self.wait_for_cluster(dataproc, project, region, clusterName)
job_id = self.submit_job(dataproc,project,region,clusterName)
self.wait_for_job(dataproc,project,region,job_id)
dataproc.projects().regions().clusters().delete(projectId=project, region=region, clusterName=clusterName).execute()
self.response.write("JOB SENT")
app = webapp2.WSGIApplication([('/run', CronEventHandler)], debug=True)
Everything works until the deletion of the cluster. At this point I get a "DeadlineExceededError: The overall deadline for responding to the HTTP request was exceeded." Any idea ?
In addition to general polling either through list or get requests on the Cluster or the Operation returned with the CreateCluster request, for single-use clusters like this you can also consider using the Dataproc Workflows API and possibly its InstantiateInline interface if you don't want to use full-fledged workflow templates; in this API you use a single request to specify cluster settings along with jobs to submit, and the jobs will automatically run as soon as the cluster is ready to take it, after which the cluster will be deleted automatically.
You can use the Google Cloud Dataproc API to create, delete and list clusters.
The list operation can be (repeatedly) performed after create and delete operations to confirm that they completed successfully, since it provides the ClusterStatus of the clusters in the results with the relevant State information:
UNKNOWN The cluster state is unknown.
CREATING The cluster is being created and set up. It is not ready for use.
RUNNING The cluster is currently running and healthy. It is ready for use.
ERROR The cluster encountered an error. It is not ready for use.
DELETING The cluster is being deleted. It cannot be used.
UPDATING The cluster is being updated. It continues to accept and process jobs.
To prevent plain waiting between the (repeated) list invocations (in general not a good thing to do on GAE) you can enqueue delayed tasks in a push task queue (with the relevant context information) allowing you to perform such list operations at a later time. For example, in python, see taskqueue.add():
countdown -- Time in seconds into the future that this task should run or be leased. Defaults to zero. Do not specify this argument if
you specified an eta.
eta -- A datetime.datetime that specifies the absolute earliest time at which the task should run. You cannot specify this argument if
the countdown argument is specified. This argument can be time
zone-aware or time zone-naive, or set to a time in the past. If the
argument is set to None, the default value is now. For pull tasks, no
worker can lease the task before the time indicated by the eta
argument.
If at the task execution time the result indicates the operation of interest is still in progress simply enqueue another such delayed task - effectively polling but without an actual wait/sleep.
Does anyone know how to delete all datastore in Google App Engine?
If you're talking about the live datastore, open the dashboard for your app (login on appengine) then datastore --> dataviewer, select all the rows for the table you want to delete and hit the delete button (you'll have to do this for all your tables).
You can do the same programmatically through the remote_api (but I never used it).
If you're talking about the development datastore, you'll just have to delete the following file: "./WEB-INF/appengine-generated/local_db.bin". The file will be generated for you again next time you run the development server and you'll have a clear db.
Make sure to clean your project afterwards.
This is one of the little gotchas that come in handy when you start playing with the Google Application Engine. You'll find yourself persisting objects into the datastore then changing the JDO object model for your persistable entities ending up with obsolete data that'll make your app crash all over the place.
The best approach is the remote API method as suggested by Nick, he's an App Engine engineer from Google, so trust him.
It's not that difficult to do, and the latest 1.2.5 SDK provides the remote_shell_api.py out of the shelf. So go to download the new SDK. Then follow the steps:
connect remote server in your commandline: remote_shell_api.py yourapp /remote_api
The shell will ask for your login info, and if authorized, will make a Python shell for you. You need setup url handler for /remote_api in your app.yaml
fetch the entities you'd like to delete, the code looks something like:
from models import Entry
query = Entry.all(keys_only=True)
entries =query.fetch(1000)
db.delete(entries)
\# This could bulk delete 1000 entities a time
Update 2013-10-28:
remote_shell_api.py has been replaced by remote_api_shell.py, and you should connect with remote_api_shell.py -s your_app_id.appspot.com, according to the documentation.
There is a new experimental feature Datastore Admin, after enabling it in app settings, you can bulk delete as well as backup your datastore through the web ui.
The fastest and efficient way to handle bulk delete on Datastore is by using the new mapper API announced on the latest Google I/O.
If your language of choice is Python, you just have to register your mapper in a mapreduce.yaml file and define a function like this:
from mapreduce import operation as op
def process(entity):
yield op.db.Delete(entity)
On Java you should have a look to this article that suggests a function like this:
#Override
public void map(Key key, Entity value, Context context) {
log.info("Adding key to deletion pool: " + key);
DatastoreMutationPool mutationPool = this.getAppEngineContext(context)
.getMutationPool();
mutationPool.delete(value.getKey());
}
EDIT:
Since SDK 1.3.8, there's a Datastore admin feature for this purpose
You can clear the development server datastore when you run the server:
/path/to/dev_appserver.py --clear_datastore=yes myapp
You can also abbreviate --clear_datastore with -c.
If you have a significant amount of data, you need to use a script to delete it. You can use remote_api to clear the datastore from the client side in a straightforward manner, though.
Here you go: Go to Datastore Admin, and then select the Entity type you want to delete and click Delete. Mapreduce will take care of deleting!
There are several ways you can use to remove entries from App Engine's Datastore:
First, think whether you really need to remove entries. This is expensive and it might be cheaper to not remove them.
You can delete all entries by hand using the Datastore Admin.
You can use the Remote API and remove entries interactively.
You can remove the entries programmatically using a couple lines of code.
You can remove them in bulk using Task Queues and Cursors.
Or you can use Mapreduce to get something more robust and fancier.
Each one of these methods is explained in the following blog post:
http://www.shiftedup.com/2015/03/28/how-to-bulk-delete-entries-in-app-engine-datastore
Hope it helps!
The zero-setup way to do this is to send an execute-arbitrary-code HTTP request to the admin service that your running app already, automatically, has:
import urllib
import urllib2
urllib2.urlopen('http://localhost:8080/_ah/admin/interactive/execute',
data = urllib.urlencode({'code' : 'from google.appengine.ext import db\n' +
'db.delete(db.Query())'}))
Source
I got this from http://code.google.com/appengine/articles/remote_api.html.
Create the Interactive Console
First, you need to define an interactive appenginge console. So, create a file called appengine_console.py and enter this:
#!/usr/bin/python
import code
import getpass
import sys
# These are for my OSX installation. Change it to match your google_appengine paths. sys.path.append("/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine")
sys.path.append("/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/lib/yaml/lib")
from google.appengine.ext.remote_api import remote_api_stub
from google.appengine.ext import db
def auth_func():
return raw_input('Username:'), getpass.getpass('Password:')
if len(sys.argv) < 2:
print "Usage: %s app_id [host]" % (sys.argv[0],)
app_id = sys.argv[1]
if len(sys.argv) > 2:
host = sys.argv[2]
else:
host = '%s.appspot.com' % app_id
remote_api_stub.ConfigureRemoteDatastore(app_id, '/remote_api', auth_func, host)
code.interact('App Engine interactive console for %s' % (app_id,), None, locals())
Create the Mapper base class
Once that's in place, create this Mapper class. I just created a new file called utils.py and threw this:
class Mapper(object):
# Subclasses should replace this with a model class (eg, model.Person).
KIND = None
# Subclasses can replace this with a list of (property, value) tuples to filter by.
FILTERS = []
def map(self, entity):
"""Updates a single entity.
Implementers should return a tuple containing two iterables (to_update, to_delete).
"""
return ([], [])
def get_query(self):
"""Returns a query over the specified kind, with any appropriate filters applied."""
q = self.KIND.all()
for prop, value in self.FILTERS:
q.filter("%s =" % prop, value)
q.order("__key__")
return q
def run(self, batch_size=100):
"""Executes the map procedure over all matching entities."""
q = self.get_query()
entities = q.fetch(batch_size)
while entities:
to_put = []
to_delete = []
for entity in entities:
map_updates, map_deletes = self.map(entity)
to_put.extend(map_updates)
to_delete.extend(map_deletes)
if to_put:
db.put(to_put)
if to_delete:
db.delete(to_delete)
q = self.get_query()
q.filter("__key__ >", entities[-1].key())
entities = q.fetch(batch_size)
Mapper is supposed to be just an abstract class that allows you to iterate over every entity of a given kind, be it to extract their data, or to modify them and store the updated entities back to the datastore.
Run with it!
Now, start your appengine interactive console:
$python appengine_console.py <app_id_here>
That should start the interactive console. In it create a subclass of Model:
from utils import Mapper
# import your model class here
class MyModelDeleter(Mapper):
KIND = <model_name_here>
def map(self, entity):
return ([], [entity])
And, finally, run it (from you interactive console):
mapper = MyModelDeleter()
mapper.run()
That's it!
You can do it using the web interface. Login into your account, navigate with links on the left hand side. In Data Store management you have options to modify and delete data. Use respective options.
I've created an add-in panel that can be used with your deployed App Engine apps. It lists the kinds that are present in the datastore in a dropdown, and you can click a button to schedule "tasks" that delete all entities of a specific kind or simply everything. You can download it here:
http://code.google.com/p/jobfeed/wiki/Nuke
For Python, 1.3.8 includes an experimental admin built-in for this. They say: "enable the following builtin in your app.yaml file:"
builtins:
- datastore_admin: on
"Datastore delete is currently available only with the Python runtime. Java applications, however, can still take advantage of this feature by creating a non-default Python application version that enables Datastore Admin in the app.yaml. Native support for Java will be included in an upcoming release."
Open "Datastore Admin" for your application and enable Admin. Then all of your entities will be listed with check boxes. You can simply select the unwanted entites and delete them.
This is what you're looking for...
db.delete(Entry.all(keys_only=True))
Running a keys-only query is much faster than a full fetch, and your quota will take a smaller hit because keys-only queries are considered small ops.
Here's a link to an answer from Nick Johnson describing it further.
Below is an end-to-end REST API solution to truncating a table...
I setup a REST API to handle database transactions where routes are directly mapped through to the proper model/action. This can be called by entering the right url (example.com/inventory/truncate) and logging in.
Here's the route:
Route('/inventory/truncate', DataHandler, defaults={'_model':'Inventory', '_action':'truncate'})
Here's the handler:
class DataHandler(webapp2.RequestHandler):
#basic_auth
def delete(self, **defaults):
model = defaults.get('_model')
action = defaults.get('_action')
module = __import__('api.models', fromlist=[model])
model_instance = getattr(module, model)()
result = getattr(model_instance, action)()
It starts by loading the model dynamically (ie Inventory found under api.models), then calls the correct method (Inventory.truncate()) as specified in the action parameter.
The #basic_auth is a decorator/wrapper that provides authentication for sensitive operations (ie POST/DELETE). There's also an oAuth decorator available if you're concerned about security.
Finally, the action is called:
def truncate(self):
db.delete(Inventory.all(keys_only=True))
It looks like magic but it's actually very straightforward. The best part is, delete() can be re-used to handle deleting one-or-many results by adding another action to the model.
You can Delete All Datastore by deleting all Kinds One by One.
with google appengine dash board. Please follow these Steps.
Login to https://console.cloud.google.com/datastore/settings
Click Open Datastore Admin. (Enable it if not enabled.)
Select all Entities and press delete.(This Step run a map reduce job for deleting all selected Kinds.)
for more information see This image http://storage.googleapis.com/bnifsc/Screenshot%20from%202015-01-31%2023%3A58%3A41.png
If you have a lot of data, using the web interface could be time consuming. The App Engine Launcher utility lets you delete everything in one go with the 'Clear datastore on launch' checkbox. This utility is now available for both Windows and Mac (Python framework).
For the development server, instead of running the server through the google app engine launcher, you can run it from the terminal like:
dev_appserver.py --port=[portnumber] --clear_datastore=yes [nameofapplication]
ex: my application "reader" runs on port 15080. After modify the code and restart the server, I just run "dev_appserver.py --port=15080 --clear_datastore=yes reader".
It's good for me.
Adding answer about recent developments.
Google recently added datastore admin feature. You can backup, delete or copy your entities to another app using this console.
https://developers.google.com/appengine/docs/adminconsole/datastoreadmin#Deleting_Entities_in_Bulk
I often don't want to delete all the data store so I pull a clean copy of /war/WEB-INF/local_db.bin out source control. It may just be me but it seems even with the Dev Mode stopped I have to physically remove the file before pulling it. This is on Windows using the subversion plugin for Eclipse.
PHP variation:
import com.google.appengine.api.datastore.Query;
import com.google.appengine.api.datastore.DatastoreServiceFactory;
define('DATASTORE_SERVICE', DatastoreServiceFactory::getDatastoreService());
function get_all($kind) {
$query = new Query($kind);
$prepared = DATASTORE_SERVICE->prepare($query);
return $prepared->asIterable();
}
function delete_all($kind, $amount = 0) {
if ($entities = get_all($kind)) {
$r = $t = 0;
$delete = array();
foreach ($entities as $entity) {
if ($r < 500) {
$delete[] = $entity->getKey();
} else {
DATASTORE_SERVICE->delete($delete);
$delete = array();
$r = -1;
}
$r++; $t++;
if ($amount && $amount < $t) break;
}
if ($delete) {
DATASTORE_SERVICE->delete($delete);
}
}
}
Yes it will take time and 30 sec. is a limit. I'm thinking to put an ajax app sample to automate beyond 30 sec.
for amodel in db.Model.__subclasses__():
dela=[]
print amodel
try:
m = amodel()
mq = m.all()
print mq.count()
for mw in mq:
dela.append(mw)
db.delete(dela)
#~ print len(dela)
except:
pass
If you're using ndb, the method that worked for me for clearing the datastore:
ndb.delete_multi(ndb.Query(default_options=ndb.QueryOptions(keys_only=True)))
For any datastore that's on app engine, rather than local, you can use the new Datastore API. Here's a primer for how to get started.
I wrote a script that deletes all non-built in entities. The API is changing pretty rapidly, so for reference, I cloned it at commit 990ab5c7f2063e8147bcc56ee222836fd3d6e15b
from gcloud import datastore
from gcloud.datastore import SCOPE
from gcloud.datastore.connection import Connection
from gcloud.datastore import query
from oauth2client import client
def get_connection():
client_email = 'XXXXXXXX#developer.gserviceaccount.com'
private_key_string = open('/path/to/yourfile.p12', 'rb').read()
svc_account_credentials = client.SignedJwtAssertionCredentials(
service_account_name=client_email,
private_key=private_key_string,
scope=SCOPE)
return Connection(credentials=svc_account_credentials)
def connect_to_dataset(dataset_id):
connection = get_connection()
datastore.set_default_connection(connection)
datastore.set_default_dataset_id(dataset_id)
if __name__ == "__main__":
connect_to_dataset(DATASET_NAME)
gae_entity_query = query.Query()
gae_entity_query.keys_only()
for entity in gae_entity_query.fetch():
if entity.kind[0] != '_':
print entity.kind
entity.key.delete()
continuing the idea of svpino it is wisdom to reuse records marked as delete. (his idea was not to remove, but mark as "deleted" unused records). little bit of cache/memcache to handle working copy and write only difference of states (before and after desired task) to datastore will make it better. for big tasks it is possible to write itermediate difference chunks to datastore to avoid data loss if memcache disappeared. to make it loss-proof it is possible to check integrity/existence of memcached results and restart task (or required part) to repeat missing computations. when data difference is written to datastore, required computations are discarded in queue.
other idea similar to map reduced is to shard entity kind to several different entity kinds, so it will be collected together and visible as single entity kind to final user. entries are only marked as "deleted". when "deleted" entries amount per shard overcomes some limit, "alive" entries are distributed between other shards, and this shard is closed forever and then deleted manually from dev console (guess at less cost) upd: seems no drop table at console, only delete record-by-record at regular price.
it is possible to delete by query by chunks large set of records without gae failing (at least works locally) with possibility to continue in next attempt when time is over:
qdelete.getFetchPlan().setFetchSize(100);
while (true)
{
long result = qdelete.deletePersistentAll(candidates);
LOG.log(Level.INFO, String.format("deleted: %d", result));
if (result <= 0)
break;
}
also sometimes it useful to make additional field in primary table instead of putting candidates (related records) into separate table. and yes, field may be unindexed/serialized array with little computation cost.
For all people that need a quick solution for the dev server (as time of writing in Feb. 2016):
Stop the dev server.
Delete the target directory.
Rebuild the project.
This will wipe all data from the datastore.
I was so frustrated about existing solutions for deleting all data in the live datastore that I created a small GAE app that can delete quite some amount of data within its 30 seconds.
How to install etc: https://github.com/xamde/xydra
For java
DatastoreService db = DatastoreServiceFactory.getDatastoreService();
List<Key> keys = new ArrayList<Key>();
for(Entity e : db.prepare(new Query().setKeysOnly()).asIterable())
keys.add(e.getKey());
db.delete(keys);
Works well in Development Server
You have 2 simple ways,
#1: To save cost, delete the entire project
#2: using ts-datastore-orm:
https://www.npmjs.com/package/ts-datastore-orm
await Entity.truncate();
The truncate can delete around 1K rows per seconds
Here's how I did this naively from a vanilla Google Cloud Shell (no GAE) with python3:
from google.cloud import datastore
client = datastore.Client()
query.keys_only()
for counter, entity in enumerate(query.fetch()):
if entity.kind.startswith('_'): # skip reserved kinds
continue
print(f"{counter}: {entity.key}")
client.delete(entity.key)
This takes a very long time even with a relatively small amount of keys but it works.
More info about the Python client library: https://googleapis.dev/python/datastore/latest/client.html
As of 2022, there are two ways to delete a kind from a (largeish) datastore to the best of my knowledge. Google recommends using a Dataflow template. The template will basically pull each entity one by one subject to a GQL query, and then delete it. Interestingly, if you are deleting a large number of rows (> 10m), you will run into datastore troubles; as it will fail to provide enough capacity, and your operations to the datastore will start timing out. However, only the kind you are mass deleting from will be effected.
If you have less than 10m rows, you can just use this go script:
import (
"cloud.google.com/go/datastore"
"context"
"fmt"
"google.golang.org/api/option"
"log"
"strings"
"sync"
"time"
)
const (
batchSize = 10000 // number of keys to get in a single batch
deleteBatchSize = 500 // number of keys to delete in a single batch
projectID = "name-of-your-GCP-project"
serviceAccount = "path-to-sa-file"
table = "kind-to-delete"
)
func min(a, b int) int {
if a < b {
return a
}
return b
}
func deleteBatch(table string) int {
ctx := context.Background()
client, err := datastore.NewClient(ctx, projectID, option.WithCredentialsFile(serviceAccount))
if err != nil {
log.Fatalf("Failed to open client: %v", err)
}
defer client.Close()
query := datastore.NewQuery(table).KeysOnly().Limit(batchSize)
keys, err := client.GetAll(ctx, query, nil)
if err != nil {
fmt.Printf("%s Failed to get %d keys : %v\n", table, batchSize, err)
return -1
}
var wg sync.WaitGroup
for i := 0; i < len(keys); i += deleteBatchSize {
wg.Add(1)
go func(i int) {
batch := keys[i : i+min(len(keys)-i, deleteBatchSize)]
if err := client.DeleteMulti(ctx, batch); err != nil {
// not a big problem, we'll get them next time ;)
fmt.Printf("%s Failed to delete multi: %v", table, err)
}
wg.Done()
}(i)
}
wg.Wait()
return len(keys)
}
func main() {
var globalStartTime = time.Now()
fmt.Printf("Deleting \033[1m%s\033[0m\n", table)
for {
startTime := time.Now()
count := deleteBatch(table)
if count >= 0 {
rate := float64(count) / time.Since(startTime).Seconds()
fmt.Printf("Deleted %d keys from %s in %.2fs, rate %.2f keys/s\n", count, table, time.Since(startTime).Seconds(), rate)
if count == 0 {
fmt.Printf("%s is now clear.\n", table)
break
}
} else {
fmt.Printf("Retrying after short cooldown\n")
time.Sleep(10 * time.Second)
}
}
fmt.Printf("Total time taken %s.\n", time.Since(globalStartTime))
}