Azure Search SDK for Blob Storage - Deleting Files - azure-cognitive-search

I have created an application that lists all the documents in an Azure storage container, and lets the user mark specific files to delete.
This is an Azure Search application, so the process is to add a "deleted" metadata property to the selected files, run the indexer to remove that information from the index, and then physically delete the files.
Here's the code for that process:
serviceClient.Indexers.Run(documentIndexer);
var status = serviceClient.Indexers.GetStatus(documentIndexer).LastResult.Status;
// Loop until the indexer is done
while (status == IndexerExecutionStatus.InProgress)
{
status = serviceClient.Indexers.GetStatus(documentIndexer).LastResult.Status;
}
// If successful, delete the flagged files
if (status == IndexerExecutionStatus.Success)
{
DeleteFlagged();
}
Everything works fine, but only if I put a breakpoint on the DeleteFlagged() call, effectively forcing a delay between running the indexer and deleting the files.
Without the pause, the indexer comes back as successful, and I delete the files, but the file contents haven't been removed from the index - they still show up in search results (the files have been physically deleted).
Is there something else I need to check before deleting?

When you Run an indexer, it doesn't instantly transition into InProgress state - in fact, depending on how many indexers are running in your service, there may be a significant delay before the indexer is scheduled to run. So, when you call GetStatus before the loop, the indexer may not be InProgress yet, and you end up deleting blobs too early.
A more reliable approach would be to wait for the indexer to complete this particular run (e.g., by looking at the LastResult's StartTime/EndTime).

Related

Exceeded soft memory limit of 512 MB with 532 MB after servicing 3 requests total. Consider setting a larger instance class in app.yaml

We are on Google App engine standard environment, F2 instance (generation 1 - python 2.7). We have a reporting module that follows this flow.
Worker Task is initiated in a queue.
task = taskqueue.add(
url='/backendreport',
target='worker',
queue_name = 'generate-reports',
params={
"task_data" : task_data
})
In the worker class, we query Google datastore and write the data to a Google Sheet. We paginate through the records to find additional report elements. When we find additional page, we call the same task again to spawn another write, so it can fetch the next set of report elements and write them to Google sheet.
in the backendreport.py we have the following code.
class BackendReport():
# Query google datastore to find the records(paginated)
result = self.service.spreadsheets().values().update(
spreadsheetId=spreadsheet_Id,
range=range_name,
valueInputOption=value_input_option,
body=resource_body).execute()
# If pagination finds additional records
task = taskqueue.add(
url='/backendreport',
target='worker',
queue_name = 'generate-reports',
params={
"task_data" : task_data
})
We run the same BackendReport (with pagination) as a front end job (not as a task). The pagination works without any error - meaning we fetch each page of records and display to the front end. But when we execute the tasks iteratively it fails with the soft memory limit issue. We were under the impression that every time a task is called (for each pagination) it should act independently and there shouldn't be any memory constraints. What are we doing wrong here?
Why doesn't GCP spin a different instance when the soft memory limit is reached - automatically (our instance class is F2).
The error message says soft memory limit of 512 MB reached after servicing 3 requests total - does this mean that the backendreport module spun up 3 requests - does it mean there were 3 tasks calls (/backendreport)?
Why doesn't GCP spin a different instance when the soft memory limit is reached
One of the primary mechanisms for when app engine decides to spin up a new instance is max_concurrent_requests. You can checkout all of the automatic_scaling params you can configure here:
https://cloud.google.com/appengine/docs/standard/python/config/appref#scaling_elements
does this mean that the backendreport module spun up 3 requests - does it mean there were 3 tasks calls (/backendreport)?
I think so. To be sure, you can open up Logs viewer, find the log where this was printed and filter your logs by that instance-id to see all the requests it handled that lead to that point.
you're creating multiple tasks in Cloud Tasks, but there's no limitation for the dispatching queue there, and as the queue tries to dispatch multiple tasks at the same time, it reaches the memory limit. So the limitations you want to set in place is really max_concurrent_requests, however not for the instances in app.yaml, it should be set for the queue dispatching in queue.yaml, so only one task at a time is dispatched:
- name: generate-reports
rate: 1/s
max_concurrent_requests: 1

Creating a cluster before sending a job to dataproc programmatically

I'm trying to schedule a PySpark Job. I followed the GCP documentation and ended up deploying a little python script to App Engine which does the following :
authenticate using a service account
submit a job to a cluster
The problem is, I need the cluster to be up and running otherwise the job won't be sent (duh !) but I don't want the cluster to always be up and running, especially since my job needs to run once a month.
I wanted to add the creation of a cluster in my python script but the call is asynchronous (it makes an HTTP request) and thus my job is submitted after the cluster creation call but before the cluster is really up and running.
How could I do ?
I'd like something cleaner than just waiting for a few minutes in my script !
Thanks
EDIT : Here's what my code looks like so far :
To launch the job
class EnqueueTaskHandler(webapp2.RequestHandler):
def get(self):
task = taskqueue.add(
url='/run',
target='worker')
self.response.write(
'Task {} enqueued, ETA {}.'.format(task.name, task.eta))
app = webapp2.WSGIApplication([('/launch', EnqueueTaskHandler)], debug=True)
The job
class CronEventHandler(webapp2.RequestHandler):
def create_cluster(self, dataproc, project, zone, region, cluster_name):
zone_uri = 'https://www.googleapis.com/compute/v1/projects/{}/zones/{}'.format(project, zone)
cluster_data = {...}
dataproc.projects().regions().clusters().create(
projectId=project,
region=region,
body=cluster_data).execute()
def wait_for_cluster(self, dataproc, project, region, clustername):
print('Waiting for cluster to run...')
while True:
result = dataproc.projects().regions().clusters().get(
projectId=project,
region=region,
clusterName=clustername).execute()
# Handle exceptions
if result['status']['state'] != 'RUNNING':
time.sleep(60)
else:
return result
def wait_for_job(self, dataproc, project, region, job_id):
print('Waiting for job to finish...')
while True:
result = dataproc.projects().regions().jobs().get(
projectId=project,
region=region,
jobId=job_id).execute()
# Handle exceptions
print(result['status']['state'])
if result['status']['state'] == 'ERROR' or result['status']['state'] == 'DONE':
return result
else:
time.sleep(60)
def submit_job(self, dataproc, project, region, clusterName):
job = {...}
result = dataproc.projects().regions().jobs().submit(projectId=project,region=region,body=job).execute()
return result['reference']['jobId']
def post(self):
dataproc = googleapiclient.discovery.build('dataproc', 'v1')
project = '...'
region = "..."
zone = "..."
clusterName = '...'
self.create_cluster(dataproc, project, zone, region, clusterName)
self.wait_for_cluster(dataproc, project, region, clusterName)
job_id = self.submit_job(dataproc,project,region,clusterName)
self.wait_for_job(dataproc,project,region,job_id)
dataproc.projects().regions().clusters().delete(projectId=project, region=region, clusterName=clusterName).execute()
self.response.write("JOB SENT")
app = webapp2.WSGIApplication([('/run', CronEventHandler)], debug=True)
Everything works until the deletion of the cluster. At this point I get a "DeadlineExceededError: The overall deadline for responding to the HTTP request was exceeded." Any idea ?
In addition to general polling either through list or get requests on the Cluster or the Operation returned with the CreateCluster request, for single-use clusters like this you can also consider using the Dataproc Workflows API and possibly its InstantiateInline interface if you don't want to use full-fledged workflow templates; in this API you use a single request to specify cluster settings along with jobs to submit, and the jobs will automatically run as soon as the cluster is ready to take it, after which the cluster will be deleted automatically.
You can use the Google Cloud Dataproc API to create, delete and list clusters.
The list operation can be (repeatedly) performed after create and delete operations to confirm that they completed successfully, since it provides the ClusterStatus of the clusters in the results with the relevant State information:
UNKNOWN The cluster state is unknown.
CREATING The cluster is being created and set up. It is not ready for use.
RUNNING The cluster is currently running and healthy. It is ready for use.
ERROR The cluster encountered an error. It is not ready for use.
DELETING The cluster is being deleted. It cannot be used.
UPDATING The cluster is being updated. It continues to accept and process jobs.
To prevent plain waiting between the (repeated) list invocations (in general not a good thing to do on GAE) you can enqueue delayed tasks in a push task queue (with the relevant context information) allowing you to perform such list operations at a later time. For example, in python, see taskqueue.add():
countdown -- Time in seconds into the future that this task should run or be leased. Defaults to zero. Do not specify this argument if
you specified an eta.
eta -- A datetime.datetime that specifies the absolute earliest time at which the task should run. You cannot specify this argument if
the countdown argument is specified. This argument can be time
zone-aware or time zone-naive, or set to a time in the past. If the
argument is set to None, the default value is now. For pull tasks, no
worker can lease the task before the time indicated by the eta
argument.
If at the task execution time the result indicates the operation of interest is still in progress simply enqueue another such delayed task - effectively polling but without an actual wait/sleep.

How to ensure all users are being sent only one daily message using GAE and deferred task queues

I am using the deferred task queues library with GAE. Every day I need to send a piece of text to all users connected to a certain page in my app. My app has multiple pages connected, so for each page, I want to go over all users, and send them a daily message. I am using a cursor to iterate over the table of Users in batches of 800. If there are more than 800 users, I want to remember where the cursor left off, and start another task with the other users.
I just want to make sure that with my algorithm I am going to send all users only one message. I want to make sure I won't miss any users, and that no user will receive the same message twice.
Does this look like the proper algorithm to handle my situation?
def send_news(page_cursor=None, page_batch_size=1,
user_cursor=None, user_batch_size=800):
p_query = PageProfile.query(PageProfile.subscribed==True)
all_pages, next_page_cursor, page_more = p_query.fetch_page(page_batch_size,
start_cursor=page_cursor)
for page in all_pages:
if page.page_news_url and page.subscribed:
query = User.query(User.subscribed==True, User.page_id == page.page_id)
all_users, next_user_cursor, user_more = query.fetch_page(user_batch_size, start_cursor=user_cursor)
for user in all_users:
user.sendNews()
# If there are more users on this page, remember the cursor
# and get the next 800 users on this same page
if user_more:
deferred.defer(send_news, page_cursor=page_cursor, user_cursor=next_user_cursor)
# If there are more pages left, use another deferred queue to
# send the daily news to users in that page
if page_more:
deferred.defer(send_news, page_cursor=next_page_cursor)
return "OK"
You could wrap your user.sendNews() into another deferred task with specific name which will ensure that it's created only once.
interval = int(time.time()) / (60 * 60 * 24)
args = ('positional_arguments_for_object')
kwargs = {'param': 'value'}
task_name = '_'.join([
'user_name',
'page_name'
str(interval_num)
])
# with interval presented in the name we are sure that the task name for the same page and same user will stay same for 24 hours
try:
deferred.defer(obj, _name=task_name, _queue='my-queue', _url='/_ah/queue/deferred', *args, **kwargs)
except (taskqueue.TaskAlreadyExistsError):
pass
# task with such name already exists, likely wasn't executed yet
except (taskqueue.TombstonedTaskError)
pass
# task with such name was created not long time ago and this name isn't available to use
# this should reset once a week or so
Note that as far as I remember App Engine does not guarantee that the task will be executed only once, in some edge cases it could be executed twice or more times and ideally they should be idempotent. If such edge cases are important for you – you could transactionally read/write some flag in the datastore for each task, and before executing the task you check if that entity is there to cancel the execution.

PyroCMS / Codeigniter : too many session entries in db

I'm using for a small website the pyrocms / codeigniter combo.
after adding some content, i checked the db and saw that:
is this a normal behaviour? multiple session_ids for one user with the same ip?
i can't imagine that this is correct.
my session config looks like:
$config['sess_cookie_name'] = 'pyrocms' . (ENVIRONMENT !== 'production' ? '_' .
ENVIRONMENT : '');
$config['sess_expiration'] = 14400;
$config['sess_expire_on_close'] = true;
$config['sess_encrypt_cookie'] = true;
$config['sess_use_database'] = true;
// don't change anything but the 'ci_sessions' part of this. The MSM depends on the 'default_' prefix
$config['sess_table_name'] = 'default_ci_sessions';
$config['sess_match_ip'] = true;
$config['sess_match_useragent'] = true;
$config['sess_time_to_update'] = 300;
i did not change on line of code affecting the session class or something like that.
the red hat rows belong to a 15min cron-job. this is fine i think.
everytime a refresh the page two or three new session_entries are added...
Yes, this is normal. The CI session class automatically generates a new ID periodically. (Every 5 minutes, by default.) This is part of the security inherent in using CI sessions instead of native PHP sessions. Garbage collection will take care of this, you do not need to do anything.
You can read more about the session id behavior in the CI manual. This is an excerpt copied from that page.
The user's unique Session ID (this is a statistically random string
with very strong entropy, hashed with MD5 for portability, and
regenerated (by default) every five minutes)
This behavior is by design. There is nothing to fix. The session class has built in garbage collection that deletes old entries as needed. I have many projects using code igniter for several years. This is what it does.
If it really bothers you, you can alter the timeout in the main CI config file. Change the line
$config['sess_time_to_update'] = 300 (the 5 minute refresh period)
to a number greater than
$config['sess_expiration'] (default 7200)
This will cause the session to timeout before it is regenerated. This is inherently less secure in theory, but unless you are transacting sensitive data, it is probably irrelevant in practice.
But again, this is by design as part of the many layers of CI sessions. These and other features are what make it better than PHP native sessions. You can turn on profiling and see that the overhead for these queries is negligible, especially in light of all the other optimizations the framework provides.

Trying to get a list of groups that have permission to view a file in Google Drive

Okay, so I'm writing a Google Apps Script for our intranet, and I want to be able to display a list of files from a folder on Google Drive. However, I only want to display files that the user has access to.
There is a method, getViewers, that will return a list of strings:
https://developers.google.com/apps-script/class_file#getViewers
The problem with that is, although it returns email addresses for individuals who are on the permissions list, it returns group names. This is less than ideal, since there's no way to get the group object with GroupsManager -- it only takes the group ID.
There are a few things I could do in spite of this. One thing I tried was this:
var files = DocsList.getFolderById('0B_Zfq-SOMETHINGIJUSTMADEUP').getFiles();
for (f = 0; f < files.length; f++){
var viewers = files[f].getViewers();
var flag = false;
// userGroups is the list of group objects, from this session's user
for (i=0; i < usersGroups.length; i++){
var groupName = userGroups[i].getName();
if (viewers.indexOf(groupName) > -1){
flag = true;
}
}
if (flag){
// print the link to file within the HTML template
}
}
But that takes horribly long to load the page, for obvious reasons. It loads in like 5 minutes. What I really need is to be able to get a list of group email addresses from the getViewers method. It seems really strange that it returns emails for individual users, but group names for groups. Does anyone know any solution or workaround for this?
Your best bet will probably be to use a cache with the groupIds mapped to group name and then use that instead of the GroupManager service otherwise it will be an age every time the script runs. Depending on how 'live' the docs list is for the Intranet site, you could speed things up also using a cache for the directory map.
If the permissions of files and directory listing are VERY changeable then the cache could be pre-populated by a helper script running on a time-based-trigger to suit your needs.
This is a good suggestion to add on the issue tracker as a Feature Request.

Resources