BadRequestError while uploading data using bulk loader - google-app-engine

Hello I have created sample Greeting application in Google app engine.
Now I am trying to upload data using bulk loader.
But its giving BadRequestError.This is the code for that:
D:\Study\M.Tech\Summer\Research\My Work\Query Transformation\Experiment\Tools\Bu
lkloader\bulkloader test>appcfg.py create_bulkloader_config --url=http://bulkex.
appspot.com/remote_api --application=bulkex --filename=config.yml
Creating bulkloader configuration.
[INFO ] Logging to bulkloader-log-20111008.175810
[INFO ] Throttling transfers:
[INFO ] Bandwidth: 250000 bytes/second
[INFO ] HTTP connections: 8/second
[INFO ] Entities inserted/fetched/modified: 20/second
[INFO ] Batch Size: 10
[INFO ] Opening database: bulkloader-progress-20111008.175810.sql3
[INFO ] Opening database: bulkloader-results-20111008.175810.sql3
[INFO ] Connecting to bulkex.appspot.com/remote_api
Please enter login credentials for bulkex.appspot.com
Email: shyam.rk22#gmail.com
Password for shyam.rk22#gmail.com:
[INFO ] Downloading kinds: ['__Stat_PropertyType_PropertyName_Kind__']
[ERROR ] [WorkerThread-3] WorkerThread:
Traceback (most recent call last):
File "C:\Program Files\Google\google_appengine\google\appengine\tools\adaptive
_thread_pool.py", line 176, in WorkOnItems
status, instruction = item.PerformWork(self.__thread_pool)
File "C:\Program Files\Google\google_appengine\google\appengine\tools \bulkloader.py",line 764, in PerformWork transfer_time = self._TransferItem(thread_pool)
File "C:\Program Files\Google\google_appengine\google\appengine\tools\bulkload
er.py", line 1170, in _TransferItem
self, retry_parallel=self.first)
File "C:\Program Files\Google\google_appengine\google\appengine\tools\bulkload
er.py", line 1471, in GetEntities
results = self._QueryForPbs(query)
File "C:\Program Files\Google\google_appengine\google\appengine\tools\bulkload
er.py", line 1442, in _QueryForPbs
raise datastore._ToDatastoreError(e)
BadRequestError: app s~bulkex cannot access app bulkex's data
[INFO ] [WorkerThread-0] Backing off due to errors: 1.0 seconds
[INFO ] An error occurred. Shutting down...
[ERROR ] Error in WorkerThread-3: app s~bulkex cannot access app bulkex's data
[INFO ] Have 0 entities, 0 previously transferred
[INFO ] 0 entities (6466 bytes) transferred in 25.6 seconds

Note the warning under --application in http://code.google.com/appengine/docs/python/tools/uploadingdata.html and use --url instead.

I was having the same issue. I removed the --application=APPID parameter from the statement and the code executed and built out the config.yml file with all my Kinds from the Datastore!

Related

Run SageMaker Batch transform failed on loading model

I am trying to run batch transform job with HuggineFace class and fine-tuned model and custom inference file.
The job failed on loading the model but I could load it locally.
I need to make custom inference file because i need to keep the input file as is, so i had to change the input key from the input json file.
Here are the exception :
PredictionException(str(e), 400)
2022-05-08 16:49:45,499 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - mms.service.PredictionException: Can't load config for '/.sagemaker/mms/models/model'. Make sure that:
2022-05-08 16:49:45,499 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -
2022-05-08 16:49:45,499 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - - '/.sagemaker/mms/models/model' is a correct model identifier listed on 'https://huggingface.co/models'
2022-05-08 16:49:45,499 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -
2022-05-08 16:49:45,500 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - - or '/.sagemaker/mms/models/model' is the correct path to a directory containing a config.json file
2022-05-08 16:49:45,500 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -
2022-05-08 16:49:45,500 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - : 400
I am running on script mode:
from sagemaker.huggingface.model import HuggingFaceModel
hub = {
# 'HF_MODEL_ID':'cardiffnlp/twitter-roberta-base-sentiment',
'HF_TASK':'text-classification',
'INPUT_TEXTS': 'Description'
}
huggingface_model = HuggingFaceModel(model_data='../model/model.tar.gz',
role=role,
source_dir="../model/pytorch_model/code",
transformers_version="4.6",
pytorch_version="1.7",
py_version="py36",
entry_point="inference.py",
env=hub
)
batch_job = huggingface_model.transformer(
instance_count=1,
instance_type='ml.p3.2xlarge',
output_path=output_s3_path, # we are using the same s3 path to save the output with the input
strategy='SingleRecord',
accept='application/json',
assemble_with='Line'
)
batch_job.transform(
data=s3_file_uri,
content_type='application/json',
split_type='Line',
#input_filter='$[1:]',
join_source='Input'
)
Custom inference.py
import json
import os
from transformers import pipeline
import torch
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
def model_fn(model_dir):
model = pipeline(task=os.environ.get('HF_TASK', 'text-classification'), model=model_dir, tokenizer=model_dir)
return model
def transform_fn(model, input_data, content_type, accept):
input_data = json.loads(input_data)
input_text = os.environ.get('INPUT_TEXTS', 'inputs')
inputs = input_data.pop(input_text, None)
parameters = input_data.pop("parameters", None)
# pass inputs with all kwargs in data
if parameters is not None:
prediction = model(inputs, **parameters)
else:
prediction = model(inputs)
return json.dumps(
prediction,
ensure_ascii=False,
allow_nan=False,
indent=None,
separators=(",", ":"),
)
I think the issue is with the "model_data" parameter. It should point to an S3 object(model.tar.gz).
Then the transform job will download the model file from S3 and load it.
The solution is change the "task" in the pipline to "sentiment-analysis"
hub = {
# 'HF_MODEL_ID':'cardiffnlp/twitter-roberta-base-sentiment',
'HF_TASK':'sentiment-analysis',
'INPUT_TEXTS': 'Description'
}

Can't start clickhouse service, too many files in ../data/default/<TableName>

I have a strange problem with my standalone clickhouse-server installation. Server was running for some time with nearly default config, except data and tmp directories was replaced to separate disk:
cat /etc/clickhouse-server/config.d/my_config.xml
<?xml version="1.0"?>
<yandex>
<path>/data/clickhouse/</path>
<tmp_path>/data/clickhouse/tmp/</tmp_path>
</yandex>
Today the server stopped responding with connection refused error. It was rebooted and after that the service couldn't completely start:
2018.05.28 13:15:44.248373 [ 2 ] <Information> DatabaseOrdinary (default): 42.86%
2018.05.28 13:15:44.259860 [ 2 ] <Debug> default.event_4648 (Data): Loading data parts
2018.05.28 13:16:02.531851 [ 2 ] <Debug> default.event_4648 (Data): Loaded data parts (2168 items)
2018.05.28 13:16:02.532130 [ 2 ] <Information> DatabaseOrdinary (default): 57.14%
2018.05.28 13:16:02.534622 [ 2 ] <Debug> default.event_5156 (Data): Loading data parts
2018.05.28 13:34:01.731053 [ 3 ] <Information> Application: Received termination signal (Terminated)
Really, I stopped process on 57%, because it starts too long(maybe it could start in an hour or two, I didn't try).
The log level by default is "trace", but I didn't show any reasons of such behavior.
I think the problem is in file count in /data/clickhouse/data/default/event_5156.
Now it is 626023 directories in it and ls -la command do not work in this catalog properly, I have to use find to count files:
# time find . -maxdepth 1 | wc -l
626023
real 5m0.302s
user 0m3.114s
sys 0m24.848s
I have two questions:
1)Why Clickhouse-Server generated so much files and directories, with default config?
2)How can I start the service without data loss in adequate time?
Issue was in data update method. I used script with jdbc connector and have been sending one string per request. After changing scheme to batch update, the issue was solved.

Can't authenticate when using download_data on Google App Engine

trying to download all my app data using appcfg.py download_data as follows:
(venv)awp$ ../google_appengine/appcfg.py download_data --application=s~app-name --url=http://app-name.appspot.com/_ah/remote_api --filename=dev-datastore/data.csv
I have included in app.yaml:
builtins:
- remote_api: on
also tried using service account credentials, but got the same exception, I have triple checked my email and password are correct and the app-name. I have granted myself all possible permissions in IAM pane on cloud dashboard, but still no love...
(venv)awp$ GOOGLE_APPLICATION_CREDENTIALS=../app-name-51362728f4a9.json ../google_appengine/appcfg.py download_data --authenticate_service_account --application=s~app-name --url=http://app-name.appspot.com/_ah/remote_api --filename=dev-datastore/data.csv --noisy
04:46 PM Downloading data records.
[INFO ] Logging to bulkloader-log-20161206.164650
[INFO ] Throttling transfers:
[INFO ] Bandwidth: 250000 bytes/second
[INFO ] HTTP connections: 8/second
[INFO ] Entities inserted/fetched/modified: 20/second
[INFO ] Batch Size: 10
[INFO ] Opening database: bulkloader-progress-20161206.164650.sql3
[INFO ] Opening database: bulkloader-results-20161206.164650.sql3
[DEBUG ] [WorkerThread-0] WorkerThread: started
[DEBUG ] [WorkerThread-1] WorkerThread: started
[DEBUG ] [WorkerThread-2] WorkerThread: started
[DEBUG ] [WorkerThread-3] WorkerThread: started
[DEBUG ] [WorkerThread-4] WorkerThread: started
[DEBUG ] [WorkerThread-5] WorkerThread: started
[DEBUG ] [WorkerThread-6] WorkerThread: started
[DEBUG ] [WorkerThread-7] WorkerThread: started
[DEBUG ] [WorkerThread-8] WorkerThread: started
[DEBUG ] [WorkerThread-9] WorkerThread: started
[DEBUG ] Configuring remote_api. url_path = /_ah/remote_api, servername = app-name.appspot.com
[DEBUG ] Bulkloader using app_id: s~app-name
[INFO ] Connecting to app-name.appspot.com/_ah/remote_api
Please enter login credentials for app-name.appspot.com
Email: anthony#app-name.com
Password for anthony#app-name.com:
[ERROR ] Exception during authentication
Traceback (most recent call last):
File "/Users/Ant/Documents/google_appengine/google/appengine/tools/bulkloader.py", line 3466, in Run
self.request_manager.Authenticate()
File "/Users/Ant/Documents/google_appengine/google/appengine/tools/bulkloader.py", line 1329, in Authenticate
remote_api_stub.MaybeInvokeAuthentication()
File "/Users/Ant/Documents/google_appengine/google/appengine/ext/remote_api/remote_api_stub.py", line 889, in MaybeInvokeAuthentication
datastore_stub._server.Send(datastore_stub._path, payload=None)
File "/Users/Ant/Documents/google_appengine/google/appengine/tools/appengine_rpc.py", line 441, in Send
self._Authenticate()
File "/Users/Ant/Documents/google_appengine/google/appengine/tools/appengine_rpc.py", line 582, in _Authenticate
super(HttpRpcServer, self)._Authenticate()
File "/Users/Ant/Documents/google_appengine/google/appengine/tools/appengine_rpc.py", line 313, in _Authenticate
auth_token = self._GetAuthToken(credentials[0], credentials[1])
File "/Users/Ant/Documents/google_appengine/google/appengine/tools/appengine_rpc.py", line 252, in _GetAuthToken
response = self.opener.open(req)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 437, in open
response = meth(req, response)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 550, in http_response
'http', request, response, code, msg, hdrs)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 475, in error
return self._call_chain(*args)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 409, in _call_chain
result = func(*args)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 558, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 404: Not Found
[ERROR ] Authentication Failed: Incorrect credentials or unsupported authentication type (e.g. OpenId).
any help would be greatly appreciated! thanks
Try exporting the credentials to your credentials file.
export GOOGLE_APPLICATION_CREDENTIALS=./app-name-51362728f4a9.json

Error enabling AD authentication in GPFS filesystem

Ive created a filesystem on a 2 linux node(both running RHEL 7) GPFS cluster. I am trying to enable AD authentication but am receiving an error and unable to find an answer to fix. Here is the process I am following on the manager node:
./spectrumscale file auth ad
Yes to edit template
I fill in the template with the following info:
[file_ad]
servers = bdtestdc01 <--- my test AD server
netbios_name = gpfscluser <--- the name I gave the cluster during setup Is this field looking for another name?
idmap_role = master
bind_username = administrator
bind_password = the domain password of the administrator account
unixmap_domains = bdtest.subdomain.company.com
I save the template and set the password. I then run:
./spectrumscale deploy
It errors at Installing Authentication. The log file says:
Error executing action run on resource 'execute[Configure file authentication]
2015-12-21 10:45:31,440 [ TRACE ] bdgpfs01.subdomain.company.com Chef Client failed. 1 resources updated in 3.641691552 seconds
2015-12-21 10:45:31,456 [ TRACE ] bdgpfs01.subdomain.company.com [2015-12-21T10:45:31-08:00] ERROR: execute[Configure file authentication] (auth::auth_file_configure line 22) had an error: Mixlib::ShellOut::ShellCommandFailed: Expected process to exit with [0], but received '1'
2015-12-21 10:45:31,456 [ TRACE ] bdgpfs01.subdomain.company.com ---- Begin output of /usr/lpp/mmfs/bin/mmuserauth service create --data-access-method file --type ad --servers 'bdtestbluedc01' --netbios-name 'gpfscluster' --idmap-role 'master' --user-name 'administrator' --password XXXXXX --unixmap-domains 'bdtest.subdomain.company.com' --idmap-range '10000000-299999999' --idmap-range-size '1000000' --enable-nfs-kerberos ----
2015-12-21 10:45:31,456 [ TRACE ] bdgpfs01.subdomain.company.com STDOUT:
2015-12-21 10:45:31,457 [ TRACE ] bdgpfs01.subdomain.company.com STDERR: mmuserauth service create: Syntax error. The correct syntax is:
2015-12-21 10:45:31,457 [ TRACE ] bdgpfs01.subdomain.company.com --unixmap-domains domain(lower value-higher value)
2015-12-21 10:45:31,457 [ TRACE ] bdgpfs01.subdomain.company.com mmuserauth service create: Command failed. Examine previous error messages to determine cause.
2015-12-21 10:45:31,438 [ TRACE ] bdgpfs01.subdomain.company.com mmuserauth service create: Command failed. Examine previous error messages to determine cause.
The domain field was looking for nonwhitespace characters and an ID Map Range.
Instead of using the configuration template, I ran the following command which enabled the authentication successfully...
mmuserauth service create --type ad --data-access-method file --netbios-name bdtestnode --user-name administrator --idmap-role master --servers myADserver --password Passwr0rd --idmap-range-size 1000000 --idmap-range 10000000-299999999
I then ran the following command to test:
id "testdomain\administrator"
It returned the proper groups and IDs

How to load data from the online GAE datastore into the local development server?

I have previously used the approach described in GAE docs to download backups of my entities on the live datastore.
Currently, I have a csv file per entity kind, that I got by writing bulkloader.yaml and using this command:
appcfg.py download_data --config_file=bulkloader.yaml --filename=users.csv --kind=Permission --url=http://your_app_id.appspot.com/_ah/remote_api
I also have a sql3 dump file that I got using the command:
appcfg.py download_data --kind=<kind> --url=http://your_app_id.appspot.com/_ah/remote_api --filename=<data-filename>
Now if I try this command:
appcfg.py upload_data --url=http://your_app_id.appspot.com/_ah/remote_api --kind=<kind> --filename=<data-filename>
Replacing the URL by localhost:8080, it asks me for a username/password. Now even if provide a mock username (test#example.com) in http://localhost:8080/_ah/remote_api and check the "admin" checkbox, it always gives me an authentication error.
The other alternative mentioned in the docs is using this:
appcfg.py upload_data --config_file=album_loader.py --filename=album_data.csv --kind=Album --url=http://localhost:8080/_ah/remote_api <app-directory>
I wrote a loader, and tried it out, it also asks for a username and password, but it accepts anything here. The output is as follows:
/usr/local/google_appengine/google/appengine/api/search/search.py:232: UserWarning: DocumentOperationResult._code is deprecated. Use OperationResult._code instead.
'Use OperationResult.%s instead.' % (name, name))
/usr/local/google_appengine/google/appengine/api/search/search.py:232: UserWarning: DocumentOperationResult._CODES is deprecated. Use OperationResult._CODES instead.
'Use OperationResult.%s instead.' % (name, name))
Application: knowledgetestgame
Uploading data records.
[INFO ] Logging to bulkloader-log-20121113.210613
[INFO ] Throttling transfers:
[INFO ] Bandwidth: 250000 bytes/second
[INFO ] HTTP connections: 8/second
[INFO ] Entities inserted/fetched/modified: 20/second
[INFO ] Batch Size: 10
[INFO ] Opening database: bulkloader-progress-20121113.210613.sql3
Please enter login credentials for localhost
Email: test#example.com
Password for test#example.com:
[INFO ] Connecting to localhost:8080/_ah/remote_api
[INFO ] Starting import; maximum 10 entities per post
[ERROR ] [WorkerThread-4] WorkerThread:
Traceback (most recent call last):
File "/usr/local/google_appengine/google/appengine/tools/adaptive_thread_pool.py", line 176, in WorkOnItems
status, instruction = item.PerformWork(self.__thread_pool)
File "/usr/local/google_appengine/google/appengine/tools/bulkloader.py", line 764, in PerformWork
transfer_time = self._TransferItem(thread_pool)
File "/usr/local/google_appengine/google/appengine/tools/bulkloader.py", line 933, in _TransferItem
self.content = self.request_manager.EncodeContent(self.rows)
File "/usr/local/google_appengine/google/appengine/tools/bulkloader.py", line 1394, in EncodeContent
entity = loader.create_entity(values, key_name=key, parent=parent)
File "/usr/local/google_appengine/google/appengine/tools/bulkloader.py", line 2728, in create_entity
(len(self.__properties), len(values)))
AssertionError: Expected 17 columns, found 18.
[INFO ] [WorkerThread-5] Backing off due to errors: 1.0 seconds
[INFO ] Unexpected thread death: WorkerThread-4
[INFO ] An error occurred. Shutting down...
[ERROR ] Error in WorkerThread-4: Expected 17 columns, found 18.
[INFO ] 980 entities total, 0 previously transferred
[INFO ] 0 entities (278 bytes) transferred in 5.9 seconds
[INFO ] Some entities not successfully transferred
I have ~4000 entities in total, it says here that 980 are transferred, but actually I check the local datastore and I find none of them..
Below is the loader I use (I used NDB for the Guess entity)
import datetime
from google.appengine.ext import db
from google.appengine.tools import bulkloader
from google.appengine.ext.ndb import key
class Guess(db.Model):
pass
class GuessLoader(bulkloader.Loader):
def __init__(self):
bulkloader.Loader.__init__(self, 'Guess',
[('selectedAssociation', lambda x: x.decode('utf-8')),
('suggestionsList', lambda x: x.decode('utf-8')),
('associationIndexInList', int),
('timeEntered',
lambda x: datetime.datetime.strptime(x, '%m/%d/%Y').date()),
('rank', int),
('topicName', lambda x: x.decode('utf-8')),
('topic', int),
('player', int),
('game', int),
('guessString', lambda x: x.decode('utf-8')),
('guessTime',
lambda x: datetime.datetime.strptime(x, '%m/%d/%Y').date()),
('accountType', lambda x: x.decode('utf-8')),
('nthGuess', int),
('score', float),
('cutByRoundEnd', bool),
('suggestionsListDelay', int),
('occurrences', float)
])
loaders = [GuessLoader]
Edit: I just noticed this part in the error message [ERROR ] Error in WorkerThread-0: Expected 17 columns, found 18. while actually I just went through the whole csv file, and made sure that every line has 18 columns. I checked the loader, and found that I was missing the key column, I gave it a type int but this doesn't work.
If you have problems with the authentication, put the following in your appengine_config.py:
if os.environ.get('SERVER_SOFTWARE','').startswith('Development'):
remoteapi_CUSTOM_ENVIRONMENT_AUTHENTICATION = (
'REMOTE_ADDR', ['127.0.0.1'])
then run
appcfg.py download_data --url=http://APPNAME.appspot.com/_ah/remote_api --filename=dump --kind=EntityName
appcfg.py upload_data --url=http://localhost:8080/_ah/remote_api --filename=dump --application=dev~APPNAME
Try just pressing Enter (no username/password). This seemed to do the trick for me. My command (wrapped in a bash script to prevent import errors that I occasionally received) is:
#!/bin/bash
# Modify path
export PYTHONPATH=$PYTHONPATH:.
# Load data
python /path/to/app/config/appcfg.py upload_data \
--config_file=<my_loader.py> \
--filename=<output.csv> \
--kind=<kind> \
--application=dev~<application_id> \
--url=http://localhost:8088/_ah/remote_api ./
When prompted for the Email, I hit enter and all is uploaded to the dev server. I am not using NDB in this case, although I do not believe that should make a difference.

Resources