Trying to upload compressed data (unicode) via bulkuploader

Trying to upload compressed data (unicode) via bulkuploader - google-app-engine

I ran into an issue where the data being uploaded to db.text was over 1 mb, so I compressed the information using zlib. Bulkloader by default didn't support the unicode data data being uploaded, so I switched out the source code to use unicodecsv rather than python's built in csv module. The problem that I'm running into is that Google App Engine's bulkload is unable to support the unicode characters (even though the db.Text entity is unicode).
[ERROR ] [Thread-12] DataSourceThread:
Traceback (most recent call last):
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/bulkloader.py", line 1611, in run
self.PerformWork()
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/bulkloader.py", line 1730, in PerformWork
for item in content_gen.Batches():
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/bulkloader.py", line 542, in Batches
self._ReadRows(key_start, key_end)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/bulkloader.py", line 452, in _ReadRows
row = self.reader.next()
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/bulkload/csv_connector.py", line 219, in generate_import_record
for input_dict in self.dict_generator:
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/unicodecsv/__init__.py", line 188, in next
row = csv.DictReader.next(self)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/csv.py", line 108, in next
row = self.reader.next()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/unicodecsv/__init__.py", line 106, in next
row = self.reader.next()
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/bulkload/csv_connector.py", line 55, in utf8_recoder
for line in codecs.getreader(encoding)(stream):
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 612, in next
line = self.readline()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 527, in readline
data = self.read(readsize, firstline=True)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 474, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x9c in position 29: invalid start byte
I know that for my local testing I could modify the python files to use unicodecsv's module instead but that doesn't help solve the problem for using GAE's Datastore on production. Is there an existing solution to this problem that anyone is aware of?

Solved this the other week, you just need to base64 encode the results so you won't have any issues with bulkloader size increases by 30-50% but since zlib already compressed my data to 10% of the original this wasn't too bad.

Related

google app engine NDB records counts from NDB model

How many records we can get from google app engine from single query so that we can display count to user and is we can increase timeout limit 3 seconds to 5 seconds

In my experience, ndb cannot pull more than 1000 records at a time. Here is an example of what happens if I try to use .count() on a table that contains ~500,000 records.
s~project-id> models.Transaction.query().count()
WARNING:root:suspended generator _count_async(query.py:1330) raised AssertionError()
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/usr/local/Caskroom/google-cloud-sdk/latest/google-cloud-sdk/platform/google_appengine/google/appengine/ext/ndb/utils.py", line 160, in positional_wrapper
return wrapped(*args, **kwds)
File "/usr/local/Caskroom/google-cloud-sdk/latest/google-cloud-sdk/platform/google_appengine/google/appengine/ext/ndb/query.py", line 1287, in count
return self.count_async(limit, **q_options).get_result()
File "/usr/local/Caskroom/google-cloud-sdk/latest/google-cloud-sdk/platform/google_appengine/google/appengine/ext/ndb/tasklets.py", line 383, in get_result
self.check_success()
File "/usr/local/Caskroom/google-cloud-sdk/latest/google-cloud-sdk/platform/google_appengine/google/appengine/ext/ndb/tasklets.py", line 427, in _help_tasklet_along
value = gen.throw(exc.__class__, exc, tb)
File "/usr/local/Caskroom/google-cloud-sdk/latest/google-cloud-sdk/platform/google_appengine/google/appengine/ext/ndb/query.py", line 1330, in _count_async
batch = yield rpc
File "/usr/local/Caskroom/google-cloud-sdk/latest/google-cloud-sdk/platform/google_appengine/google/appengine/ext/ndb/tasklets.py", line 513, in _on_rpc_completion
result = rpc.get_result()
File "/usr/local/Caskroom/google-cloud-sdk/latest/google-cloud-sdk/platform/google_appengine/google/appengine/api/apiproxy_stub_map.py", line 614, in get_result
return self.__get_result_hook(self)
File "/usr/local/Caskroom/google-cloud-sdk/latest/google-cloud-sdk/platform/google_appengine/google/appengine/datastore/datastore_query.py", line 2910, in __query_result_hook
self._batch_shared.conn.check_rpc_success(rpc)
File "/usr/local/Caskroom/google-cloud-sdk/latest/google-cloud-sdk/platform/google_appengine/google/appengine/datastore/datastore_rpc.py", line 1377, in check_rpc_success
rpc.check_success()
File "/usr/local/Caskroom/google-cloud-sdk/latest/google-cloud-sdk/platform/google_appengine/google/appengine/api/apiproxy_stub_map.py", line 580, in check_success
self.__rpc.CheckSuccess()
File "/usr/local/Caskroom/google-cloud-sdk/latest/google-cloud-sdk/platform/google_appengine/google/appengine/api/apiproxy_rpc.py", line 157, in _WaitImpl
self.request, self.response)
File "/usr/local/Caskroom/google-cloud-sdk/latest/google-cloud-sdk/platform/google_appengine/google/appengine/ext/remote_api/remote_api_stub.py", line 308, in MakeSyncCall
handler(request, response)
File "/usr/local/Caskroom/google-cloud-sdk/latest/google-cloud-sdk/platform/google_appengine/google/appengine/ext/remote_api/remote_api_stub.py", line 362, in _Dynamic_Next
assert next_request.offset() == 0
AssertionError
To by pass this, you can do something like:
objs = []
q = None
more = True
while more:
_objs, q, more = models.Transaction.query().fetch_page(300, start_cursor=q)
objs.extend(_objs)
But even that will eventually hit memory/timeout limits.
Currently I use Google Dataflow to pre-compute these values and store the results in Datastore as the models DaySummaries & StatsPerUser
EDIT:
snakecharmerb is correct. I was able to use .count() in the production environment, but the more entities it has to count, the longer it seems to take. Here's a screenshot of my logs viewer where it took ~15 seconds to count ~330,000 records
When I tried adding a filter to that query which returned a count of ~4500, it took about a second to run instead.
EDIT #2:
Ok I had another app engine project with a kind with ~8,000,000 records. I tried to do .count() on that in my http request handler and the request timed-out after running for 60 seconds.

Google AppEngine backups reporting ApiTemporaryUnavailableError

When trying to backup the datastore from the DataStore admin page, backups fail with and error for both blobstore and cloud store targets:
Callstack for cloud store:
ApplicationError: 1
Traceback (most recent call last):
File "/base/data/home/runtimes/python/python_lib/versions/1/google/appengine/ext/datastore_admin/backup_handler.py", line 642, in _ProcessPostRequest
10)
File "/base/data/home/runtimes/python/python_lib/versions/1/google/appengine/ext/datastore_admin/backup_handler.py", line 492, in _perform_backup
gs_bucket_name = validate_and_canonicalize_gs_bucket(gs_bucket_name)
File "/base/data/home/runtimes/python/python_lib/versions/1/google/appengine/ext/datastore_admin/backup_handler.py", line 1803, in validate_and_canonicalize_gs_bucket
verify_bucket_writable(bucket_name)
File "/base/data/home/runtimes/python/python_lib/versions/1/google/appengine/ext/datastore_admin/backup_handler.py", line 1763, in verify_bucket_writable
test_file = files.open(files.gs.create(file_name), 'a', exclusive_lock=True)
File "/base/data/home/runtimes/python/python_lib/versions/1/google/appengine/api/files/gs.py", line 331, in create
return files._create(_GS_FILESYSTEM, filename=filename, params=params)
File "/base/data/home/runtimes/python/python_lib/versions/1/google/appengine/api/files/file.py", line 650, in _create
_make_call('Create', request, response)
File "/base/data/home/runtimes/python/python_lib/versions/1/google/appengine/api/files/file.py", line 255, in _make_call
_raise_app_error(e)
File "/base/data/home/runtimes/python/python_lib/versions/1/google/appengine/api/files/file.py", line 183, in _raise_app_error
raise ApiTemporaryUnavailableError(e)
ApiTemporaryUnavailableError: ApplicationError: 1
Callstack for blob store:
ApplicationError: 1
Traceback (most recent call last):
File "/base/data/home/runtimes/python/python_lib/versions/1/google/appengine/ext/webapp/_webapp25.py", line 716, in call
handler.post(*groups)
File "/base/data/home/runtimes/python/python_lib/versions/1/google/appengine/ext/mapreduce/base_handler.py", line 147, in post
self.handle()
File "/base/data/home/runtimes/python/python_lib/versions/1/google/appengine/ext/mapreduce/handlers.py", line 1391, in handle
state)
File "/base/data/home/runtimes/python/python_lib/versions/1/google/appengine/ext/mapreduce/handlers.py", line 1539, in _schedule_shards
mr_state.writer_state)
File "/base/data/home/runtimes/python/python_lib/versions/1/google/appengine/ext/mapreduce/output_writers.py", line 726, in create
acl=acl)
File "/base/data/home/runtimes/python/python_lib/versions/1/google/appengine/ext/mapreduce/output_writers.py", line 640, in _create_file
return files.blobstore.create(mime_type, filename)
File "/base/data/home/runtimes/python/python_lib/versions/1/google/appengine/api/files/blobstore.py", line 75, in create
return files._create(_BLOBSTORE_FILESYSTEM, params=params)
File "/base/data/home/runtimes/python/python_lib/versions/1/google/appengine/api/files/file.py", line 650, in _create
_make_call('Create', request, response)
File "/base/data/home/runtimes/python/python_lib/versions/1/google/appengine/api/files/file.py", line 255, in _make_call
_raise_app_error(e)
File "/base/data/home/runtimes/python/python_lib/versions/1/google/appengine/api/files/file.py", line 183, in _raise_app_error
raise ApiTemporaryUnavailableError(e)
ApiTemporaryUnavailableError: ApplicationError: 1
Seems to be a problem with the underlying files API which is out of our control. Anyone come across this and have a solution or workaround?

I think this may have to do with the fact that the files API is deprecated.
I had the same error trying to push a mapreduce pipeline to the blobstore (as George-Bogdan is having). I decided to finally do the switch I have been avoiding (switching to the GCS client library). Once I finished the switch, my tests have been conclusive, this works properly.
It seems like it's a transient issue on the Google side. But since the Files API is deprecated, I feel safer using the library they actually suggest using now
(answer copied from my own answer on another question)

Unicode fonts in pdf at GAE with web2py/pyfpdf

I'm writing an app, which results with pdf file with some text with unicode characters. On GAE devserver it works good, but after deploy it can't import font file (crash after add_font() (pyfpdf)).
The code is:
# -*- coding: utf-8 -*-
def fun1():
from gluon.contrib.pyfpdf import FPDF, HTMLMixin
class MyFPDF(FPDF, HTMLMixin):
pass
pdf =MyFPDF()
pdf.add_font('DejaVu', '', 'DejaVuSansCondensed.ttf', uni=True)
pdf.add_page()
pdf.set_font('DejaVu','',16)
pdf.write(10,'test-ąśł')
response.headers['Content-Type']='application/pdf'
return pdf.output(dest='S')
The font files (with a file DejaVuSansCondensed.pkl generated after first run on web2py server...) is in /gluon/contrib/fpdf/font. I didn't add anything to routers.py (I'm using Pattern-based system) also app.yaml is not changed. And I get this:
In FILE: /base/data/home/apps/s~myapp/web2py-04.369240954601780983/applications/app3/controllers/default.py
Traceback (most recent call last):
File "/base/data/home/apps/s~myapp/web2py-04.369240954601780983/gluon/restricted.py", line 212, in restricted
exec ccode in environment
File "/base/data/home/apps/s~myapp/web2py-04.369240954601780983/applications/app3/controllers/default.py", line 674, in <module>
File "/base/data/home/apps/s~myapp/web2py-04.369240954601780983/gluon/globals.py", line 194, in <lambda>
self._caller = lambda f: f()
File "/base/data/home/apps/s~myapp/web2py-04.369240954601780983/applications/app3/controllers/default.py", line 493, in fun1
pdf.add_font('DejaVu', '', 'DejaVuSansCondensed.ttf', uni=True)
File "/base/data/home/apps/s~myapp/web2py-04.369240954601780983/gluon/contrib/fpdf/fpdf.py", line 432, in add_font
font_dict = pickle.load(fh)
File "/base/data/home/runtimes/python27p/python27_dist/lib/python2.7/pickle.py", line 1378, in load
return Unpickler(file).load()
File "/base/data/home/runtimes/python27/python27_dist/lib/python2.7/pickle.py", line 858, in load
dispatch[key](self)
File "/base/data/home/runtimes/python27/python27_dist/lib/python2.7/pickle.py", line 966, in load_string
raise ValueError, "insecure string pickle"
ValueError: insecure string pickle
As I said on local (both web2py/rocket and gae) it works well. After deploy only something like this works:
pdf =MyFPDF()
pdf.add_page()
pdf.set_font('Arial','',16)
pdf.write(10,'testąśł')
But without "unusual" characters...
The best solution would be to add my font files (like DejaVu), but basically I need unicode characters in any font... maybe some "half-solution" to use "generic GAE unicode" fonts... if it exist something like this...

Thanks for suggestion Tim!
I found some solution... it isn't the best one, but it works...
The problem is with using pickle on GAE. The best solution (probably) would be to overload/rewrite the add_font() function where for GAE, in such a way, that it would write to a datastore instead of a filesystem. Additionaly ValueError: insecure string pickle error can still occur, I tried b64 encoding according to this. But still I get errors. So my solution is to overload add_font() function with commented out/deleted parts:
if os.path.exists(unifilename):
fh = open(unifilename)
try:
font_dict = pickle.load(fh)
finally:
fh.close()
else:
and
try:
fh = open(unifilename, "w")
pickle.dump(font_dict, fh)
fh.close()
except IOError, e:
if not e.errno == errno.EACCES:
raise # Not a permission error.
Because of this the function every time calculates little bit more instead of just reading data from the pickle... but it works on GAE.

Backing up large datastore kind (1TB+) to google cloud storage

Has anyone been successful in backing up large datastore kinds to cloud storage? This is an experimental feature so support is pretty sketchy on the google end.
The kind in question we want to backup to cloud storage (ultimately with the goal of ingesting from cloud storage into big query) is currently sitting at 1.2 TB in size.
- description: BackUp
url: /_ah/datastore_admin/backup.create?name=OurApp&filesystem=gs&gs_bucket_name=OurBucket&queue=backup&kind=LargeKind
schedule: every day 00:00
timezone: America/Regina
target: ah-builtin-python-bundle
We keep running into the following error message:
Traceback (most recent call last):
File "/base/data/home/apps/s~steprep-prod-hrd/prod-339.366560204640641232/lib/mapreduce/handlers.py", line 182, in handle
input_reader, shard_state, tstate, quota_consumer, ctx)
File "/base/data/home/apps/s~steprep-prod-hrd/prod-339.366560204640641232/lib/mapreduce/handlers.py", line 263, in process_inputs
entity, input_reader, ctx, transient_shard_state):
File "/base/data/home/apps/s~steprep-prod-hrd/prod-339.366560204640641232/lib/mapreduce/handlers.py", line 318, in process_data
output_writer.write(output, ctx)
File "/base/data/home/apps/s~steprep-prod-hrd/prod-339.366560204640641232/lib/mapreduce/output_writers.py", line 711, in write
ctx.get_pool("file_pool").append(self._filename, str(data))
File "/base/data/home/apps/s~steprep-prod-hrd/prod-339.366560204640641232/lib/mapreduce/output_writers.py", line 266, in append
self.flush()
File "/base/data/home/apps/s~steprep-prod-hrd/prod-339.366560204640641232/lib/mapreduce/output_writers.py", line 288, in flush
f.write(data)
File "/python27_runtime/python27_lib/versions/1/google/appengine/api/files/file.py", line 297, in __exit__
self.close()
File "/python27_runtime/python27_lib/versions/1/google/appengine/api/files/file.py", line 291, in close
self._make_rpc_call_with_retry('Close', request, response)
File "/python27_runtime/python27_lib/versions/1/google/appengine/api/files/file.py", line 427, in _make_rpc_call_with_retry
_make_call(method, request, response)
File "/python27_runtime/python27_lib/versions/1/google/appengine/api/files/file.py", line 250, in _make_call
rpc.check_success()
File "/python27_runtime/python27_lib/versions/1/google/appengine/api/apiproxy_stub_map.py", line 570, in check_success
self.__rpc.CheckSuccess()
File "/python27_runtime/python27_lib/versions/1/google/appengine/api/apiproxy_rpc.py", line 133, in CheckSuccess
raise self.exception
DeadlineExceededError: The API call file.Close() took too long to respond and was cancelled.

There seems to be an undocumented time limit of 30 seconds for write operations from gae to cloud storage.
This applies also to write-ops made on a backend, so the maximum file-size you could create from the gae
in the cloud-storage depends on your throughput. Our solution is to split the file; each time the writer-task
approaches 20 seconds, it closes the current file and opens a new one and then we join these files locally. For us this results in files of about 500KB (compressed), so this might not be an acceptable solution for you...

GAE Full Text Search development console UnicodeEncodeError

I have an index with manny words with accent (e.g: São Paulo, José, etc).
The search api works fine, but when try to do some test queries on development console, I can't access index data.
This error only occurs on development environment. On production GAE everything works fine.
Bellow the traceback:
Traceback (most recent call last):
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/_webapp25.py", line 701, in __call__
handler.get(*groups)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/admin/__init__.py", line 1704, in get
'values': self._ProcessSearchResponse(resp),
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/admin/__init__.py", line 1664, in _ProcessSearchResponse
value = TruncateValue(doc.fields[field_name].value)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/admin/__init__.py", line 158, in TruncateValue
value = str(value)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xc1' in position 5: ordinal not in range(128)

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Trying to upload compressed data (unicode) via bulkuploader - google-app-engine

Solved this the other week, you just need to base64 encode the results so you won't have any issues with bulkloader size increases by 30-50% but since zlib already compressed my data to 10% of the original this wasn't too bad.

Related

google app engine NDB records counts from NDB model

Google AppEngine backups reporting ApiTemporaryUnavailableError

Unicode fonts in pdf at GAE with web2py/pyfpdf

Backing up large datastore kind (1TB+) to google cloud storage

GAE Full Text Search development console UnicodeEncodeError

Categories

Resources