I am attempting to run a local version of the schema.org app so I can write a proposal for an addition to the ontology. I followed the tutorial at http://dataliberate.com/2016/02/10/evolving-schema-org-in-practice-pt1-the-bits-and-pieces/, which had me set up Google App Engine and download a forked version of schema.org using Git.
Unfortunately, I cannot get the schema.org app to run on my machine. Sample GAE apps work fine, but whenever I start the schema.org app I get the following error:
Traceback (most recent call last):
File "C:\Users\Kevin\Desktop\Ontology\schemaorg\lib\rdflib\plugins\parsers\pyRdfa\__init__.py", line 580, in graph_from_source
if not rdfOutput : raise f
rdflib.plugins.parsers.pyRdfa.FailedSource
ERROR2016-09-29 14:54:39,825 wsgi.py:263]
Traceback (most recent call last):
File "C:\Program Files (x86)\Google\google_appengine\google\appengine\runtime\wsgi.py", line 240, in Handle
handler = _config_handle.add_wsgi_middleware(self._LoadHandler())
File "C:\Program Files (x86)\Google\google_appengine\google\appengine\runtime\wsgi.py", line 299, in _LoadHandler
handler, path, err = LoadObject(self._handler)
File "C:\Program Files (x86)\Google\google_appengine\google\appengine\runtime\wsgi.py", line 85, in LoadObject
obj = __import__(path[0])
File "C:\Users\Kevin\Desktop\Ontology\schemaorg\sdoapp.py", line 2585, in <module>
read_schemas(loadExtensions=ENABLE_HOSTED_EXTENSIONS)
File "C:\Users\Kevin\Desktop\Ontology\schemaorg\api.py", line 1055, in read_schemas
apirdflib.load_graph('core',file_paths)
File "C:\Users\Kevin\Desktop\Ontology\schemaorg\apirdflib.py", line 118, in load_graph
g.parse(file=open(full_path(f),"r"),format=format)
File "C:\Users\Kevin\Desktop\Ontology\schemaorg\lib\rdflib\graph.py", line 1037, in parse
parser.parse(source, self, **args)
File "C:\Users\Kevin\Desktop\Ontology\schemaorg\lib\rdflib\plugins\parsers\structureddata.py", line 145, in parse
check_lite=check_lite
File "C:\Users\Kevin\Desktop\Ontology\schemaorg\lib\rdflib\plugins\parsers\structureddata.py", line 176, in _process
processor.graph_from_source(orig_source, graph=graph, pgraph=processor_graph, rdfOutput=False)
File "C:\Users\Kevin\Desktop\Ontology\schemaorg\lib\rdflib\plugins\parsers\pyRdfa\__init__.py", line 662, in graph_from_source
if not rdfOutput : raise b
FailedSource
INFO 2016-09-29 10:54:39,951 module.py:788] default: "GET /_ah/warmup HTTP/1.1" 500-
The problem is occurring when it tries to parse the RDF, but I suspect the lack of RDF output is being caused by the 500 error. I have done an extensive search and found plenty of examples of the 500 error with GAE, but none of the suggested fixes has worked (e.g., increasing the TIMEOUT setting, rolling back to SDK 1.36).
I am running the app on localhost:9080. I get a 500 error whenever I try to access it from the browser. I can, however, access the admin at localhost:8001. For some reason, it shows two instances running.
Any help would be greatly appreciated. Let me know if you need more information.
This problem has now been fixed with a Windows specific patch to the Schema.org code line as referenced in Git Issues (#1384) and (#1412)
A pull of the latest code from the repository should clear the problem.
I'm trying to download needed PDFs related to a researcher.
But the downloaded PDFs can't be opened, saying that the files may be damaged or in wrong format. While another URL used in test resulted in normal PDF files. Do you have any suggestion?
import requests
from bs4 import BeautifulSoup
def download_file(url, index):
local_filename = index+"-"+url.split('/')[-1]
# NOTE the stream=True parameter
r = requests.get(url, stream=True)
with open(local_filename, 'wb') as f:
for chunk in r.iter_content(chunk_size=1024):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
f.flush()
return local_filename
# For Test: http://ww0.java4.datastructures.net/handouts/
# Can't open: http://flyingv.ucsd.edu/smoura/publications.html
root_link="http://ecal.berkeley.edu/publications.html#journals"
r=requests.get(root_link)
if r.status_code==200:
soup=BeautifulSoup(r.text)
# print soup.prettify()
index=1
for link in soup.find_all('a'):
new_link=root_link+link.get('href')
if new_link.endswith(".pdf"):
file_path=download_file(new_link,str(index))
print "downloading:"+new_link+" -> "+file_path
index+=1
print "all download finished"
else:
print "errors occur."
Your code has a comment saying:
# Can't open: http://flyingv.ucsd.edu/smoura/publications.html
Looks like what you can't open is a HTML file. So no wonder that a PDF reader will complain about it...
For any real PDF link that I had a problem with I would proceed as follows:
Download file with a different method (wget, curl, a browser, ...).
Can you download it even? Or is there some password hoop to be jumped through?
Is the download fast + complete?
Does it then open in a PDF viewer?
If so, compare to the file your script downloaded.
What are the differences?
Could they be caused by your script?
Are there no differences within the first few hundred lines, but differences later? End of the file being a bunch of nul-bytes? Then your download didn't complete...
If not so, still compare the differences. If there are none, your script is not at fault. The PDF may really be corrupted...
What does it look like, when opened in a text editor?
I've been having some trouble with the memcache viewer after updating the Python Dev Appserver to Google App Engine 1.7.6 (with Python 2.7).
It appears that my memcache isn't updated or isn't readable. I have tried to view memcache with the app engine memcache viewer but when I input the memcache key I get an error.
when I flush the cache everything proceeds as normal until memcache needs to be read again...
The hit ratio and memcache size increases as normal, so there is something in the cache. Also when I revert back to app engine 1.7.5 everything works just fine. Perhaps someone else has had this issue?
When I input the memcache key I get the following:
Traceback (most recent call last):
File "C:\Program Files (x86)\Google\google_appengine\lib\webapp2-2.5.1\webapp2.py", line 1536, in __call__
rv = self.handle_exception(request, response, e)
File "C:\Program Files (x86)\Google\google_appengine\lib\webapp2-2.5.1\webapp2.py", line 1530, in __call__
rv = self.router.dispatch(request, response)
File "C:\Program Files (x86)\Google\google_appengine\lib\webapp2-2.5.1\webapp2.py", line 1278, in default_dispatcher
return route.handler_adapter(request, response)
File "C:\Program Files (x86)\Google\google_appengine\lib\webapp2-2.5.1\webapp2.py", line 1102, in __call__
return handler.dispatch()
File "C:\Program Files (x86)\Google\google_appengine\google\appengine\tools\devappserver2\admin\admin_request_handler.py", line 80, in dispatch
super(AdminRequestHandler, self).dispatch()
File "C:\Program Files (x86)\Google\google_appengine\lib\webapp2-2.5.1\webapp2.py", line 572, in dispatch
return self.handle_exception(e, self.app.debug)
File "C:\Program Files (x86)\Google\google_appengine\lib\webapp2-2.5.1\webapp2.py", line 570, in dispatch
return method(*args, **kwargs)
File "C:\Program Files (x86)\Google\google_appengine\google\appengine\tools\devappserver2\admin\memcache_viewer.py", line 145, in get
values['value'], values['type'] = self._get_memcache_value_and_type(key)
File "C:\Program Files (x86)\Google\google_appengine\google\appengine\tools\devappserver2\admin\memcache_viewer.py", line 74, in _get_memcache_value_and_type
except (pickle.UnpicklingError, AttributeError, EOFError, ImportError,
NameError: global name 'pickle' is not defined
I tried including an "import pickle" in my main.py but this was in vain.
I've included some samples of my code but hopefully this isn't necessary I hope its something more to do with the app engine update than my code...
some of my main.py file:
#import pickle
from google.appengine.api import memcache
from google.appengine.ext import db
and a sample function for how I handle memcache:
def mc_get(key):
a = memcache.get(key)
if a:
val = a
else:
val = None
return val
def mc_set(key, val):
memcache.set(key, val)
and if I want to query the users in my db I use:
def get_users(update=False):
mc_key = 'USERS'
entries = mc_get(mc_key)
if update or entries is None:
a = User.all()
logging.error('DB---Q - Users')
entries = list(a)
memcache.set(mc_key, entries)
return entries
UPDATE:
I added "import pickle" to the memcache_viewer.py file in Google\google_appengine\google\appengine\tools\devappserver2\admin\memcache_viewer.py
(is this a bug??)
and now when I type in a memcache key I get the following error under the memcache key input field:
Error fetching USERS: Failed to retrieve value from cache: No module named main
Any help would be greatly appreciated, thanks in advance.
I changed from the old datastore API to NDB(a bit of a chore changing the code). The automatic caching seems to have solved the problem, this may suggest that the problem was with my code, but still doesn't explain why everything worked fine when using app engine 1.7.5 and not with 1.7.6.
I'll remove this answer if anyone has an alternative, just thought I'd post my progress in case anyone else is having the same problem.
Problem:
19/06/10 Update: More evidence problem is server-side. Receiving this error on Windows 7 command line (see below for full traceback):
URLError: <urlopen error [Errno 10054] An existing connection was forcibly closed by the remote host>
abort: error: An existing connection was forcibly closed by the remote host
When attempting to push a changeset that contains 6 large files (.exe, .dmg, etc) to my remote server my client (MacHG) is reporting the error:
"Error During Push. Mercurial reported
error number 255: abort: HTTP Error
404: Not Found"
What does the error even mean?! The only thing unique (that I can tell) about this commit is the size, type, and filenames of the files. How can I determine which exact file within the changeset is failing? How can I delete the corrupt changeset from the repository? In a different post, someone reported using "mq" extensions to effectively delete an erroneous changeset from the history within a repository, but mq looks overly complicated for what I'm trying to solve.
Background:
I can push and pull the following: source files, directories, .class files and a .jar file to and from the server, using both MacHG and toirtoise HG.
I successfully committed to my local repository the addition for the first time the 6 large .exe, .dmg etc installer files (about 130Mb total).
In the following commit to my local repository, I removed ("untracked" / forget) the 6 files causing the problem, however the previous (failing) changeset is still queued to be pushed to the server (i.e. my local host is trying to push the "add" and then the "remove" to the remote server - and keep aligned with the "keep everything in history" philosophy of the source control system).
I can commit .txt .java files etc using TortoiseHG from Windows PCs. I haven't actually testing committing or pushing the same large files using TortoiseHG.
Please help!
Setup:
Client applications = MacHG v0.9.7 (SCM 1.5.4), and TortoiseHG v1.0.4 (SCM 1.5.4)
Server = HTTPS, IIS7.5, Mercurial 1.5.4, Python 2.6.5, setup using these instructions:
http://www.jeremyskinner.co.uk/mercurial-on-iis7/
In IIS7.5 the CGI handler is configured to handle ALL verbs (not just GET, POST and HEAD).
My hgweb.cgi file on the server is as follows:
#!/usr/bin/env python
#
# An example hgweb CGI script, edit as necessary
# Path to repo or hgweb config to serve (see 'hg help hgweb')
#config = "/path/to/repo/or/config"
# Uncomment and adjust if Mercurial is not installed system-wide:
#import sys; sys.path.insert(0, "/path/to/python/lib")
# Uncomment to send python tracebacks to the browser if an error occurs:
#import cgitb; cgitb.enable()
from mercurial import demandimport; demandimport.enable()
from mercurial.hgweb import hgweb, wsgicgi
application = hgweb('C:\inetpub\wwwroot\hg\hgweb.config')
wsgicgi.launch(application)
My hgweb.config file on the server is as follows:
[collections]
C:\Mercurial Repositories = C:\Mercurial Repositories
[web]
baseurl = /hg
allow_push = usernamea
allow_push = usernameb
Output from the command line from my macbook (both Mercurial and MacHG installed) using -v and --trackback flags:
macbook15:hgrepos coderunner$ hg -v --traceback push
pushing to https://coderunner:***#hg.mydomain.com.au/hg/hgrepos
searching for changes
3 changesets found
Traceback (most recent call last):
File "/Library/Python/2.6/site-packages/mercurial/dispatch.py", line 50, in _runcatch
return _dispatch(ui, args)
File "/Library/Python/2.6/site-packages/mercurial/dispatch.py", line 471, in _dispatch
return runcommand(lui, repo, cmd, fullargs, ui, options, d)
File "/Library/Python/2.6/site-packages/mercurial/dispatch.py", line 341, in runcommand
ret = _runcommand(ui, options, cmd, d)
File "/Library/Python/2.6/site-packages/mercurial/dispatch.py", line 522, in _runcommand
return checkargs()
File "/Library/Python/2.6/site-packages/mercurial/dispatch.py", line 476, in checkargs
return cmdfunc()
File "/Library/Python/2.6/site-packages/mercurial/dispatch.py", line 470, in <lambda>
d = lambda: util.checksignature(func)(ui, *args, **cmdoptions)
File "/Library/Python/2.6/site-packages/mercurial/util.py", line 401, in check
return func(*args, **kwargs)
File "/Library/Python/2.6/site-packages/mercurial/commands.py", line 2462, in push
r = repo.push(other, opts.get('force'), revs=revs)
File "/Library/Python/2.6/site-packages/mercurial/localrepo.py", line 1491, in push
return self.push_unbundle(remote, force, revs)
File "/Library/Python/2.6/site-packages/mercurial/localrepo.py", line 1636, in push_unbundle
return remote.unbundle(cg, remote_heads, 'push')
File "/Library/Python/2.6/site-packages/mercurial/httprepo.py", line 235, in unbundle
heads=' '.join(map(hex, heads)))
File "/Library/Python/2.6/site-packages/mercurial/httprepo.py", line 134, in do_read
fp = self.do_cmd(cmd, **args)
File "/Library/Python/2.6/site-packages/mercurial/httprepo.py", line 85, in do_cmd
resp = self.urlopener.open(req)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 389, in open
response = meth(req, response)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 502, in http_response
'http', request, response, code, msg, hdrs)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 427, in error
return self._call_chain(*args)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 361, in _call_chain
result = func(*args)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 510, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 404: Not Found
abort: HTTP Error 404: Not Found
macbook15:hgrepos coderunner$
Output from Windows 7 host (has only TortoiseHG installed) attempting to push the same files to the server (different changset, but contains the same 6 file additions as the changeset being pushed from the macbook)
c:\repositories\hgrepos>hg -v --traceback push
pushing to https://coderunner:***#hg.mydomain.com.au/hg/hgrepos
searching for changes
1 changesets found
Traceback (most recent call last):
File "mercurial\dispatch.pyo", line 50, in _runcatch
File "mercurial\dispatch.pyo", line 471, in _dispatch
File "mercurial\dispatch.pyo", line 341, in runcommand
File "mercurial\dispatch.pyo", line 522, in _runcommand
File "mercurial\dispatch.pyo", line 476, in checkargs
File "mercurial\dispatch.pyo", line 470, in <lambda>
File "mercurial\util.pyo", line 401, in check
File "mercurial\commands.pyo", line 2462, in push
File "mercurial\localrepo.pyo", line 1491, in push
File "mercurial\localrepo.pyo", line 1636, in push_unbundle
File "mercurial\httprepo.pyo", line 235, in unbundle
File "mercurial\httprepo.pyo", line 134, in do_read
File "mercurial\httprepo.pyo", line 85, in do_cmd
File "urllib2.pyo", line 389, in open
File "urllib2.pyo", line 407, in _open
File "urllib2.pyo", line 367, in _call_chain
File "mercurial\url.pyo", line 523, in https_open
File "mercurial\keepalive.pyo", line 259, in do_open
URLError: <urlopen error [Errno 10054] An existing connection was forcibly closed by the remote host>
abort: error: An existing connection was forcibly closed by the remote host
c:\repositories\hgrepos>
It is a keep-alive issue? Is IIS7.5 at fault? Python 2.6.5 at fault?
Went through the same pain points...
With the default settings on the IIS server, you will not be able to push large repositories to the server, as IIS has a default maximum request length of only 4 MB, and a timeout for CGI scripts of 15 min, making it impossible to upload large files.
To enable the uploading of large files (and this is not easy to find on the web…), do the following:
1. In IIS Manager, click on the web site node, and click the Limits… link.
2. Then specify a connection time-out sufficiently large (I chose 1 hour here, or 3600 seconds)
3. Next, click the node containing hg (as per the installation procedure), then double-click CGI
4. Specify a sufficiently-long time out for CGI scripts (e.g., 10 hours)
Now, edit C:\inetpub\wwwroot\hg\web.config, so that it has a new <security> section under <system.webserver>, and a <httpRuntime> specification under <system.web>:
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<system.webServer>
[…]
<security>
<requestFiltering>
<requestLimits maxAllowedContentLength ="2147482624" />
</requestFiltering>
</security>
</system.webServer>
<system.web>
<httpRuntime
executionTimeout="540000" maxRequestLength="2097151"/>
</system.web>
</configuration>
This specifies an http timeout of a bit more than 6 days, and a maximum upload limit of about 2 GB.
Had the same issue using IIS 7 as server. Tried the solution above which resolved the error 255 issue, but still got Errorno 10054 with larger files. I then increased the Connection Time-out in IIS which worked.
To change: Web Site -> Manage Web Site -> Advanced Settings -> Connection Limits -> Connection Time-out. The default is 2 minutes. Changed mine to 20 minutes and it worked.
Not sure why this works but seems that Mercurial makes a connection to the server, takes a while to process larger files, then only sends a request. By that time IIS has disconnected the client.
Ok, your solution did it!
I already had a requestLimits tag like this:
<requestLimits maxUrl="16384" maxQueryString="65536" />
so I added maxAllowedContentLength ="524288000" to it like this:
<requestLimits maxUrl="16384" maxQueryString="65536" maxAllowedContentLength ="524288000" />
And that did it!
I'm just posting this for anyone else coming into this thread from a search.
There's currently an issue using the largefiles extension in the mercurial python module when hosted via IIS. See this post if you're encountering issues pushing large changesets (or large files) to IIS via TortoiseHg.
The problem ultimlately turns out to be a bug in SSL processing introduced Python 2.7.3 (probably explaining why there are so many unresolved posts of people looking for problems with Mercurial). Rolling back to Python 2.7.2 let me get a little further ahead (blocked at 30Mb pushes instead of 15Mb), but to properly solve the problem I had to install the IISCrypto utility to completely disable transfers over SSLv2.