I use blobstore.create_upload_url to uplaod a file to Cloud Storage with a fairly long directory name. The returned blob key exceeds 500 bytes and caused the error showing below.
The work around is to use a shorter directory name.
What is the general guidance for the length of directory and filename to generate less-than-500-bytes key?
Is it a bug that create_upload_url generates a key that exceed maximum length?
context: python, GAE standard environment
Thanks in advance.
The error message,
ValueError: Key name strings must be non-empty strings up to 500 bytes; received AMIfv94PcVablhJ0hpvUKZgaEX3w9Ysm9RLvXRogYwLU373p-DTD6kFkVCJgwfFgq1UIM-xlG2M4GwPWC5lH4XIetUGlD_JoipLmaps9XvTK_1ZnWjUrww86Y6izXLhU-boKHl4G9YxJFi1rNU-9JjJnJ_smXmGp2Aa9OHeNd8imQrxAHjT3bOEQAvoI8MQM3KBlqnh4kVgre7Lf0AQPtb0wiPI42WbyqETQ6QD--BS-ofel0XGt_picz1SN5ECpqXfPctfuE0s40Wq72rzRsSb-UPukdbVDCrdCJOb7ZRnHSGuYtHBzJJR_ilUY9uuMsCPbo4NPOScSfovo2pfcwxjfEs-oFdHLOXu8CRzwLnnzsoNKvGy3VE6mLuDbr-R7cQefybaMQSiKL4VXzEXEVLKP3Yg_1SHeqIRD5xq1pbt1yZcplpJ5jkV-5dVdgBdI9e6NbghwOXhTQQbp7JodYgcdf5bBgemrpIn2ZhMgMrYAEYEe64DeoUBuQNDpmCUVM1z5wxFzrUMSNhayJzMfebMgFJATnppusA
at _ReferenceFromPairs (/base/alloc/tmpfs/dynamic_runtimes/python27/54c5883f70296ec8_unzipped/python27_lib/versions/1/google/appengine/ext/ndb/key.py:766)
at _ConstructReference (/base/alloc/tmpfs/dynamic_runtimes/python27/54c5883f70296ec8_unzipped/python27_lib/versions/1/google/appengine/ext/ndb/key.py:673)
at positional_wrapper (/base/alloc/tmpfs/dynamic_runtimes/python27/54c5883f70296ec8_unzipped/python27_lib/versions/1/google/appengine/ext/ndb/utils.py:160)
at reference (/base/alloc/tmpfs/dynamic_runtimes/python27/54c5883f70296ec8_unzipped/python27_lib/versions/1/google/appengine/ext/ndb/key.py:546)
at key_to_pb (/base/alloc/tmpfs/dynamic_runtimes/python27/54c5883f70296ec8_unzipped/python27_lib/versions/1/google/appengine/ext/ndb/model.py:682)
at async_get (/base/alloc/tmpfs/dynamic_runtimes/python27/54c5883f70296ec8_unzipped/python27_lib/versions/1/google/appengine/datastore/datastore_rpc.py:1627)
at _get_tasklet (/base/alloc/tmpfs/dynamic_runtimes/python27/54c5883f70296ec8_unzipped/python27_lib/versions/1/google/appengine/ext/ndb/context.py:344)
at _help_tasklet_along (/base/alloc/tmpfs/dynamic_runtimes/python27/54c5883f70296ec8_unzipped/python27_lib/versions/1/google/appengine/ext/ndb/tasklets.py:430)
at get (/base/alloc/tmpfs/dynamic_runtimes/python27/54c5883f70296ec8_unzipped/python27_lib/versions/1/google/appengine/ext/ndb/context.py:760)
at _help_tasklet_along (/base/alloc/tmpfs/dynamic_runtimes/python27/54c5883f70296ec8_unzipped/python27_lib/versions/1/google/appengine/ext/ndb/tasklets.py:427)
at blob_info_async (/base/data/home/apps/b~suiqui-dev-170002/checklist:checklist-dev.406185416109946132/suiqui/file/models.py:20)
UPDATE:
The exception was raised at blobstore.BlobInfo.get_async()
#ndb.tasklet
def blob_info_async(self):
blobinfo = yield blobstore.BlobInfo.get_async(self.blob_key)
raise ndb.Return(blobinfo)
Related
I have a working flow to upload files from sharepoint to salesforce. It works fine.
Recently identified the issue that there is maximum batch limit is there at salesforce end i.e a single batch contain at most 10,000,000 characters. When a large file which contains more than 10,000,000 characters. It got failed to upload into salesforce and got below error.
Ia there any other way to upload more than 10,000,000 characters into salesforce in a single batch. I can not send half of the file in one batch and another half in another batch because sync up issue comes.
Code:
<enricher target="#[flowVars['jobInfo_upload']]" doc:name="Enricher jobId insert">
<sfdc:create-job config-ref="SFA_NOL_SHAREPOINT" type="ContentVersion" concurrencyMode="Parallel" contentType="XML" operation="insert" doc:name="Create Job"/>
</enricher>
<expression-component doc:name="Expression to save jobid"><![CDATA[sessionVars.jobInfo_upload = flowVars.jobInfo_upload.id]]></expression-component>
<dw:transform-message metadata:id="1dec8ccb-75ec-4be9-933e-06eb92354eba" doc:name="Transform Message">
<dw:set-payload><![CDATA[%dw 1.0
%output application/java
---
[{
Title: flowVars.filename,
PathOnClient: flowVars.path,
TagCsv: "Sharepoint Version: " ++ flowVars.MajorVersion ++ "." ++ flowVars.MinorVersion,
VersionData: payload,
FirstPublishLocationId: flowVars.FirstPublishLocationId
}]]]></dw:set-payload>
</dw:transform-message>
<sfdc:create-batch config-ref="SFA_NOL_SHAREPOINT" doc:name="Insert">
<sfdc:job-info ref="#[flowVars.jobInfo_upload]"/>
<sfdc:objects ref="#[payload]"/>
</sfdc:create-batch>
Error:
09:51:01.348 10/13/2017 Worker-0 [apl-sfa-sharepoint-interface].batch-upload-simpleBatchFlow.stage1.17 ERROR
********************************************************************************
Message : Failed to invoke createBatch. Message payload is of type: ArrayList
Type : org.mule.api.MessagingException
Code : MULE_ERROR-29999
JavaDoc : http://www.mulesoft.org/docs/site/current3/apidocs/org/mule/api/MessagingException.html
Payload : [{Title=Customer Presentation Deck_Template (October 2017).pptx, PathOnClient=/Users/dangnguyen/sharepoint/ Deck_Template.pptx, TagCsv=Sharepoint Version: 1.0, VersionData=UEsDBBQABgAIAAAAIQD9wUeU7QUAADV9AAATAAgCW0NvbnRlbnRfVHlwZXNdLnhtbCCiBAIooAACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADMnU1v20YQhu8F+h8EXQuLH5IoKrCdQ...
********************************************************************************
Exception stack is:
1. ClientInputError : Failed to read request. Exceeded max size limit of 10000000 (com.sforce.async.AsyncApiException)
com.sforce.async.BulkConnection:180 (null)
2. ClientInputError : Failed to read request. Exceeded max size limit of 10000000 (org.mule.modules.salesforce.exception.SalesforceException)
There is a limit for bulk API itself.
https://www.salesforce.com/us/developer/docs/api_asynch/Content/asynch_api_concepts_limits.htm
Batches for data loads can consist of a single CSV or XML file that can be no larger than 10 MB.
A batch can contain a maximum of 10,000 records.
A batch can contain a maximum of 10,000,000 characters for all the data in a batch.
A field can contain a maximum of 32,000 characters.
A record can contain a maximum of 5,000 fields.
A record can contain a maximum of 400,000 characters for all its fields.
A batch must contain some content or an error occurs.
So you should divide the input file into files less than these limits.
In my GAE App I just do a datastore_backup for all Kinds.
When I try to load it to BigQuery almost Kinds was successfully loaded however with one Kind I have this error:
invalid - Invalid field name "rows.pedido_automatico". Fields must
contain only letters, numbers, and underscores, start with a letter or
underscore, and be at most 128 characters long.
My GAE Kind:
class StockRow(ndb.Model):
pedido_automatico = ndb.StringProperty(default="N",choices=set(["S","N"]))
class Stock(ndb.Model):
rows = ndb.StructuredProperty(SemAlmacenPedidoRutaRow, repeated=True)
Is this a known bug ?
This is a known bug that we've got a fix for internally and should be in our next release.
I'm using the blobstore to backup and recovery entities in csv format. The process is working well for all of my smaller models. However, once I start to work on models with more than 2K entities, I am exceeded the soft memory limit. I'm only fetching 50 entities at a time and then writing the results out to the blobstore, so I'm not clear why my memory usage would be building up. I can reliably make the method fail just by increasing the "limit" value passed in below which results in the method running just a little longer to export a few more entities.
Any recommendations on how to optimize this process to reduce memory consumption?
Also, the files produced will only <500KB in size. Why would the process use 140 MB of memory?
Simplified example:
file_name = files.blobstore.create(mime_type='application/octet-stream')
with files.open(file_name, 'a') as f:
writer = csv.DictWriter(f, fieldnames=properties)
for entity in models.Player.all():
row = backup.get_dict_for_entity(entity)
writer.writerow(row)
Produces the error:
Exceeded soft private memory limit with 150.957 MB after servicing 7 requests total
Simplified example 2:
The problem seems to be with using files and the with statement in python 2.5. Factoring out the csv stuff, I can reproduce almost the same error by simply trying to write a 4000 line text file to the blobstore.
from __future__ import with_statement
from google.appengine.api import files
from google.appengine.ext.blobstore import blobstore
file_name = files.blobstore.create(mime_type='application/octet-stream')
myBuffer = StringIO.StringIO()
#Put 4000 lines of text in myBuffer
with files.open(file_name, 'a') as f:
for line in myBuffer.getvalue().splitlies():
f.write(line)
files.finalize(file_name)
blob_key = files.blobstore.get_blob_key(file_name)
Produces the error:
Exceeded soft private memory limit with 154.977 MB after servicing 24 requests total
Original:
def backup_model_to_blobstore(model, limit=None, batch_size=None):
file_name = files.blobstore.create(mime_type='application/octet-stream')
# Open the file and write to it
with files.open(file_name, 'a') as f:
#Get the fieldnames for the csv file.
query = model.all().fetch(1)
entity = query[0]
properties = entity.__class__.properties()
#Add ID as a property
properties['ID'] = entity.key().id()
#For debugging rather than try and catch
if True:
writer = csv.DictWriter(f, fieldnames=properties)
#Write out a header row
headers = dict( (n,n) for n in properties )
writer.writerow(headers)
numBatches = int(limit/batch_size)
if numBatches == 0:
numBatches = 1
for x in range(numBatches):
logging.info("************** querying with offset %s and limit %s", x*batch_size, batch_size)
query = model.all().fetch(limit=batch_size, offset=x*batch_size)
for entity in query:
#This just returns a small dictionary with the key-value pairs
row = get_dict_for_entity(entity)
#write out a row for each entity.
writer.writerow(row)
# Finalize the file. Do this before attempting to read it.
files.finalize(file_name)
blob_key = files.blobstore.get_blob_key(file_name)
return blob_key
The error looks like this in the logs
......
2012-02-02 21:59:19.063
************** querying with offset 2050 and limit 50
I 2012-02-02 21:59:20.076
************** querying with offset 2100 and limit 50
I 2012-02-02 21:59:20.781
************** querying with offset 2150 and limit 50
I 2012-02-02 21:59:21.508
Exception for: Chris (202.161.57.167)
err:
Traceback (most recent call last):
.....
blob_key = backup_model_to_blobstore(model, limit=limit, batch_size=batch_size)
File "/base/data/home/apps/singpath/163.356548765202135434/singpath/backup.py", line 125, in backup_model_to_blobstore
writer.writerow(row)
File "/base/python_runtime/python_lib/versions/1/google/appengine/api/files/file.py", line 281, in __exit__
self.close()
File "/base/python_runtime/python_lib/versions/1/google/appengine/api/files/file.py", line 275, in close
self._make_rpc_call_with_retry('Close', request, response)
File "/base/python_runtime/python_lib/versions/1/google/appengine/api/files/file.py", line 388, in _make_rpc_call_with_retry
_make_call(method, request, response)
File "/base/python_runtime/python_lib/versions/1/google/appengine/api/files/file.py", line 236, in _make_call
_raise_app_error(e)
File "/base/python_runtime/python_lib/versions/1/google/appengine/api/files/file.py", line 179, in _raise_app_error
raise FileNotOpenedError()
FileNotOpenedError
C 2012-02-02 21:59:23.009
Exceeded soft private memory limit with 149.426 MB after servicing 14 requests total
You'd be better off not doing the batching yourself, but just iterating over the query. The iterator will pick a batch size (probably 20) that should be adequate:
q = model.all()
for entity in q:
row = get_dict_for_entity(entity)
writer.writerow(row)
This avoids re-running the query with ever-increasing offset, which is slow and causes quadratic behavior in the datastore.
An oft-overlooked fact about memory usage is that the in-memory representation of an entity can use 30-50 times the RAM compared to the serialized form of the entity; e.g. an entity that is 3KB on disk might use 100KB in RAM. (The exact blow-up factor depends on many factors; it's worse if you have lots of properties with long names and small values, even worse for repeated properties with long names.)
In What is the proper way to write to the Google App Engine blobstore as a file in Python 2.5 a similar problem was reported. In an answer there it is suggested that you should try inserting gc.collect() calls occasionally. Given what I know of the files API's implementation I think that is spot on. Give it a try!
I can't speak for the memory use in Python, but considering your error message, the error most likely stems from the fact that a blobstore backed file in GAE can't be open for more than around 30 seconds so you have to close and reopen it periodically if your processing takes longer.
It can possibly be a Time Exceed error, due to the limitation of the request to 30 secs. In my implementation in order to bypass it instead of having a webapp handler for the operation I am firing an event in the default queue. The cool thing about the queue is that it takes one line of code to invoke it, it has a 10 min time limit and if a task fails it retries before the time limit. I am not really sure if it will solve your problem but it worths giving a try.
from google.appengine.api import taskqueue
...
taskqueue.add("the url that invokes your method")
you can find more info about the queues here.
Or consider using a backend for serious computations and file operations.
I'm trying to use an app engine User object's user_id (returned by the User.user_id() method) as a key_name in my own User class. The problem is that it keeps telling me that it's an invalid key_name. I've tried sha2'ing it, and using the digest() as well as the hexdigest() method to reduce the number of possible characters, but still no good result. Is this because the value is too long, or because key names can't have certain characters? And also, how can I modify a user_id in such a way that it stays unique, but is also usable as a key_name for an entity? Extra bonus if it uses a hash so that thje user_id can't be guessed.
Here is the code where the error occured:
def get_current_user():
return User.get(db.Key(hashlib.sha1(users.get_current_user().user_id()).hexdigest()))
I'm now doing some more testing, concidering suggestions from the comments and answer.
I'm not sure why it isn't working for you, the following has no issues when I run it in the dev console.
from google.appengine.ext import db
from google.appengine.api import users
user = users.get_current_user()
name = user.user_id()
print db.Key.from_path ('User', name)
However if you are hashing it (which it sounds like you may be), be aware that you may get a collision. I would avoid against using a hash and would consider some other means of anonymization if you are giving the key to clients. Such as another model whose key you can give away, that has the user's key stored in it. Another method would be to encrypt the id (using the same key for all users) rather than hash it.
If you are doing something that generates binary data (encryption / hash digest) app engine (the sdk at-least) has issues, so you need to encode it first, and use that as the key_name.
name = user.user_id()
hashed_name = hashlib.sha1(name).digest()
encoded_name = base64.b64encode (name)
db.Key.from_path ('User', encoded_name)
Appengine docs mention a 1Mb limit on both entity size and batch get requests (db.get()):
http://code.google.com/appengine/docs/python/datastore/overview.html
Is there also a limit on the total size of all entities returned by a query for a single fetch() call?
Example query:
db.Model.all().fetch(1000)
Update: As of 1.4.0 batch get limits have been removed!
Size and quantity limits on datastore batch get/put/delete operations have
been removed. Individual entities are still limited to 1 MB, but your app
may
batch as many entities together for get/put/delete calls as the overall
datastore deadline will allow for.
Theres no longer a limit on the number of entities that can be returned by a query, but the same entity size limit applies when you are actually retrieving / iterating over the entities. This will only be on a single entity at a time though; it is not a limit on the total size of all entities returned by the query.
Bottom line: as long as you don't have a single entity that is > 1Mb you should be OK with queries.
I tried it out on production and you can indeed exceed 1 Mb total for a query. I stopped testing at around 20 Mb total response size.
from app import models
# generate 1Mb string
a = 'a'
while len(a) < 1000000:
a += 'a'
# text is a db.TextProperty()
c = models.Comment(text=a)
c.put()
for c in models.Comment.all().fetch(100):
print c
Output:
<app.models.Comment object at 0xa98f8a68a482e9f8>
<app.models.Comment object at 0xa98f8a68a482e9b8>
<app.models.Comment object at 0xa98f8a68a482ea78>
<app.models.Comment object at 0xa98f8a68a482ea38>
....
Yes there is a size limit; the quotas and limits section explicitly states there is a 1 megabyte limit to db API calls.
You will not be able to db.get(list_of_keys) if the total size of the entities in the batch is over 1 megabyte. Likewise, you will not be able to put a batch if the total size of the entities in the batch is over 1 megabyte.
The 1,000 entity limit has been removed, but (at present) you will need to ensure the total size of your batches is less than 1 megabyte yourself.