Why are 404 codes returned from Twitter when using aiohttp and Python 3.8? - aiohttp

I've written a Python script to validate URLs in a website. The script has been used for many years and uses aiohttp to check multiple links in parallel.
Recently (like the last few days), checks against our websites have reported 404 errors when checking links to Twitter like https://twitter.com/linaroorg. The same links work with curl and Postman.
I've extracted as small an amount of the relevant code from the larger link checker script in order to try and figure out what is happening. What I've found is that if I change from using aiohttp to requests-async, the code then works. Since my script and the installed copy of aiohttp hasn't changed, I suspect that Twitter might have changed something at their end that requests-async copes with (somehow) but aiohttp doesn't.
import asyncio
import aiohttp
async def async_url_validation(session, url):
""" Validate the URL. """
async with session.get(
url) as response:
return response.status
async def async_check_web(session, links):
""" Check all external links. """
results = await asyncio.gather(
*[async_url_validation(session, url) for url in links]
)
# That gets us a collection of the responses, matching up to each of
# the tasks, so loop through the links again and the index counter
# will point to the corresponding result.
i = 0
for link in links:
print(link, results[i])
i += 1
async def check_unique_links():
UNIQUE_LINKS = [
"https://twitter.com/linaroorg",
"https://www.linaro.org"
]
async with aiohttp.ClientSession() as session:
await async_check_web(session, UNIQUE_LINKS)
loop = asyncio.get_event_loop()
cul_result = loop.run_until_complete(check_unique_links())
loop.close()
What I do find intriguing is that if I print the full response within async_url_validation then the response from Twitter includes this:
'Content-Length': '1723'
suggesting that Twitter is replying successfully and that it might be something within aiohttp that is triggering the 404 response code.
Versions:
aiohttp: 3.8.1
multidict: 6.0.2
yarl: 1.7.2
Python: 3.8.10
Ubuntu: 20.04
Interestingly, if I install Python 3.10.4 onto the same system, the script then works ... but I don't know why and I don't know what has happened to cause just Twitter links to break the code. The reason I picked Python 3.10.4 is because I tried my test script on Ubuntu 22.04 and it worked ... and that seemed to be the principal difference.

Related

Flink Rest API : /jars/upload returning 404

Following is my code snippet used for uploading Jar in Flink. I am getting 404 response for this post request. Following is the output for request. I also tried updating the url with /v1/jars/upload but same response. All the API related to jars is giving me same response. I am running this code inside AWS lambda which is present in same vpc where EMR exists which is runing my Flink Job. APIs like /config, /jobs working in this lambda, only APIs like upload jar, submit jobs not working and getting 404 for them
<Response [404]> {"errors":["Not found: /jars/upload"]}
Also tried the same thing by directly logging into job manager node and running curl command, but got the same response. I am using Flink 1.14.2 version on EMR cluster
curl -X POST -H "Expect:" -F
"jarfile=#/home/hadoop/test-1.0-global-14-dyn.jar"
http://ip-10-0-1-xxx:8081/jars/upload
{"errors":["Not found:> /jars/upload"]}
import json
import requests
import boto3
import os
def lambda_handler(event, context):
config = dict(
service_name="s3",
region_name="us-east-1"
)
s3_ = boto3.resource(**config)
bucket = "dv-stream-processor-na-gamma"
prefix = ""
file = "Test-1.0-global-14-dyn.jar"
path = "/tmp/"+file;
try:
s3_.Bucket(bucket).download_file(f"{file}", "/tmp/"+file)
except botocore.exceptions.ClientError as e:
print(e.response['Error']['Code'])
if e.response['Error']['Code'] == "404":
print("The object does not exist.")
print(os.path.isfile('/tmp/' + file))
response = requests.post(
"http://ip-10-0-1-xx.ec2.internal:8081/jars/upload",
files={
"jarfile": (
os.path.basename(path),
open(path, "rb"),
"application/x-java-archive"
)
}
)
print(response)
print(response.text)
Reason for upload jar was not working for me was I was using Flink "Per Job" cluster mode where it was not allowed to submit job via REST API. I updated the cluster mode to "Session" mode and it started working
References for Flink cluster mode information :
https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/overview/
Code you can refer to start cluster in session mode : https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/resource-providers/yarn/#starting-a-flink-session-on-yarn

Consume SQS tasks from App Engine

I'm attempting to integrate with a third party that is posting messages on an Amazon SQS queue. I need my GAE backend to receive these messages.
Essentially, I want the following script to launch and always be running
import boto3
sqs_client = boto3.client('sqs',
aws_access_key_id=KEY,
aws_secret_access_key=SECRET,
region_name=REGION)
while True:
sqs_client.receive_message(QueueUrl=QUEUE_URL, WaitTimeSeconds=60)
for message in msgs_response.get('Messages', []):
deferred.defer(process_and_delete_message, message)
My main appengine web app is on Automatic Scaling (with the 60-second &10-minute task timeouts), but I'm thinking of setting up a micro-service set to either Manual Scaling or Basic Scaling because:
Requests can run indefinitely. A manually-scaled instance can choose to handle /_ah/start and execute a program or script for many hours without returning an HTTP response code. Task queue tasks can run up to 24 hours.
https://cloud.google.com/appengine/docs/standard/python/an-overview-of-app-engine
Apparently both Manual & Basic Scaling also allow "Background Threads", but I am having a hard-time finding documentation for it and I'm thinking this may be a relic from the days before they deprecated Backends in favor of Modules (although I did find this https://cloud.google.com/appengine/docs/standard/python/refdocs/modules/google/appengine/api/background_thread/background_thread#BackgroundThread).
Is Manual or Basic Scaling suited for this? If so, what should I use to listen on sqs_client.receive_message()? One thing I'm concerned about is this task/background thread dieing and not relaunching itself.
This maybe a possible solution:
Try to use a Google Compute Engine micro instance to run that script continuously and send a REST call to your app engine app. Easy Python Example For Compute Engine
OR:
I have used modules that run instance type B2/B1 for long running jobs; and I have never had any trouble; but those jobs do start and stop. I use the basic scaling: with max_instances set to 1. The jobs I have run take around 6 hours to complete.
I ended up creating a manual scaling app engine standard micro-service for this. This micro-service has handeler for /_ah/start never returns and runs indefinitely (many days at a time) and when it does get stopped, then app engine restarts it immediately.
Requests can run indefinitely. A manually-scaled instance can choose
to handle /_ah/start and execute a program or script for many hours
without returning an HTTP response code. Task queue tasks can run up
to 24 hours.
https://cloud.google.com/appengine/docs/standard/python/an-overview-of-app-engine
My /_ah/start handler listens to the SQS queue, and creates Push Queue tasks that my default service is set up to listen for.
I was looking into the Compute Engine route as well as the App Engine Flex route (which is essentially Compute Engine managed by app engine), but there were other complexities like not getting access to ndb and the taskqueue sdk and I didn't have time to dive into that.
Below are all of the files for this micro-service, not included is my lib folder that contains the source code for boto3 & some other libraries I needed.
I hope this helpful for someone.
gaesqs.yaml:
application: my-project-id
module: gaesqs
version: dev
runtime: python27
api_version: 1
threadsafe: true
manual_scaling:
instances: 1
env_variables:
theme: 'default'
GAE_USE_SOCKETS_HTTPLIB : 'true'
builtins:
- appstats: on #/_ah/stats/
- remote_api: on #/_ah/remote_api/
- deferred: on
handlers:
- url: /.*
script: gaesqs_main.app
libraries:
- name: jinja2
version: "2.6"
- name: webapp2
version: "2.5.2"
- name: markupsafe
version: "0.15"
- name: ssl
version: "2.7.11"
- name: pycrypto
version: "2.6"
- name: lxml
version: latest
gaesqs_main.py:
#!/usr/bin/env python
import json
import logging
import appengine_config
try:
# This is needed to make local development work with SSL.
# See http://stackoverflow.com/a/24066819/500584
# and https://code.google.com/p/googleappengine/issues/detail?id=9246 for more information.
from google.appengine.tools.devappserver2.python import sandbox
sandbox._WHITE_LIST_C_MODULES += ['_ssl', '_socket']
import sys
# this is socket.py copied from a standard python install
from lib import stdlib_socket
socket = sys.modules['socket'] = stdlib_socket
except ImportError:
pass
import boto3
import os
import webapp2
from webapp2_extras.routes import RedirectRoute
from google.appengine.api import taskqueue
app = webapp2.WSGIApplication(debug=os.environ['SERVER_SOFTWARE'].startswith('Dev'))#, config=webapp2_config)
KEY = "<MY-KEY>"
SECRET = "<MY-SECRET>"
REGION = "<MY-REGION>"
QUEUE_URL = "<MY-QUEUE_URL>"
def process_message(message_body):
queue = taskqueue.Queue('default')
task = taskqueue.Task(
url='/task/sqs-process/',
countdown=0,
target='default',
params={'message': message_body})
queue.add(task)
class Start(webapp2.RequestHandler):
def get(self):
logging.info("Start")
for loggers_to_suppress in ['boto3', 'botocore', 'nose', 's3transfer']:
logger = logging.getLogger(loggers_to_suppress)
if logger:
logger.setLevel(logging.WARNING)
logging.info("boto3 loggers suppressed")
sqs_client = boto3.client('sqs',
aws_access_key_id=KEY,
aws_secret_access_key=SECRET,
region_name=REGION)
while True:
msgs_response = sqs_client.receive_message(QueueUrl=QUEUE_URL, WaitTimeSeconds=20)
logging.info("msgs_response: %s" % msgs_response)
for message in msgs_response.get('Messages', []):
logging.info("message: %s" % message)
process_message(message['Body'])
sqs_client.delete_message(QueueUrl=QUEUE_URL, ReceiptHandle=message['ReceiptHandle'])
_routes = [
RedirectRoute('/_ah/start', Start, name='start'),
]
for r in _routes:
app.router.add(r)
appengine_config.py:
import os
from google.appengine.ext import vendor
from google.appengine.ext.appstats import recording
appstats_CALC_RPC_COSTS = True
# Add any libraries installed in the "lib" folder.
# Use pip with the -t lib flag to install libraries in this directory:
# $ pip install -t lib gcloud
# https://cloud.google.com/appengine/docs/python/tools/libraries27
try:
vendor.add('lib')
except:
print "Unable to add 'lib'"
def webapp_add_wsgi_middleware(app):
app = recording.appstats_wsgi_middleware(app)
return app
if os.environ.get('SERVER_SOFTWARE', '').startswith('Development'):
print "gaesqs development"
import imp
import os.path
import inspect
from google.appengine.tools.devappserver2.python import sandbox
sandbox._WHITE_LIST_C_MODULES += ['_ssl', '_socket']
# Use the system socket.
real_os_src_path = os.path.realpath(inspect.getsourcefile(os))
psocket = os.path.join(os.path.dirname(real_os_src_path), 'socket.py')
imp.load_source('socket', psocket)
os.environ['HTTP_HOST'] = "my-project-id.appspot.com"
else:
print "gaesqs prod"
# Doing this on dev_appserver/localhost seems to cause outbound https requests to fail
from lib import requests
from lib.requests_toolbelt.adapters import appengine as requests_toolbelt_appengine
# Use the App Engine Requests adapter. This makes sure that Requests uses
# URLFetch.
requests_toolbelt_appengine.monkeypatch()

Testing Google Cloud PubSub push endpoints locally

Trying to figure out the best way to test PubSub push endpoints locally. We tried with ngrok.io, but you must own the domain in order to whitelist (the tool for doing so is also broken… resulting in an infinite redirect loop). We also tried emulating PubSub locally. I am able to publish and pull, but I cannot get the push subscriptions working. We are using a local Flask webserver like so:
#app.route('/_ah/push-handlers/events', methods=['POST'])
def handle_message():
print request.json
return jsonify({'ok': 1}), 200
The following produces no result:
client = pubsub.Client()
topic = client('events')
topic.create()
subscription = topic.subscription('test_push', push_endpoint='http://localhost:5000/_ah/push-handlers/events')
subscription.create()
topic.publish('{"test": 123}')
It does yell at us when we attempt to create a subscription to an HTTP endpoint (whereas live PubSub will if you do not use HTTPS). Perhaps this is by design? Pull works just fine… Any ideas on how to best develop PubSub push endpoints locally?
Following the latest PubSub library documentation at the time of writing, the following example creates a subscription with a push configuration.
Requirements
I have tested with the following requirements :
Google Cloud SDK 285.0.1 (for PubSub local emulator)
Python 3.8.1
Python packages (requirements.txt) :
flask==1.1.1
google-cloud-pubsub==1.3.1
Run PubSub emulator locally
export PUBSUB_PROJECT_ID=fake-project
gcloud beta emulators pubsub start --project=$PUBSUB_PROJECT_ID
By default, PubSub emulator starts on port 8085.
Project argument can be anything and does not matter.
Flask server
Considering the following server.py :
from flask import Flask, jsonify, request
app = Flask(__name__)
#app.route('/_ah/push-handlers/events', methods=['POST'])
def handle_message():
print(request.json)
return jsonify({'ok': 1}), 200
if __name__ == "__main__":
app.run(port=5000)
Run the server (starts on port 5000) :
python server.py
PubSub example
Considering the following pubsub.py :
import sys
from google.cloud import pubsub_v1
if __name__ == "__main__":
project_id = sys.argv[1]
# 1. create topic (events)
publisher_client = pubsub_v1.PublisherClient()
topic_path = publisher_client.topic_path(project_id, "events")
publisher_client.create_topic(topic_path)
# 2. create subscription (test_push with push_config)
subscriber_client = pubsub_v1.SubscriberClient()
subscription_path = subscriber_client.subscription_path(
project_id, "test_push"
)
subscriber_client.create_subscription(
subscription_path,
topic_path,
push_config={
'push_endpoint': 'http://localhost:5000/_ah/push-handlers/events'
}
)
# 3. publish a test message
publisher_client.publish(
topic_path,
data='{"test": 123}'.encode("utf-8")
)
Finally, run this script :
PUBSUB_EMULATOR_HOST=localhost:8085 \
PUBSUB_PROJECT_ID=fake-project \
python pubsub.py $PUBSUB_PROJECT_ID
Results
Then, you can see the results in Flask server's log :
{'subscription': 'projects/fake-project/subscriptions/test_push', 'message': {'data': 'eyJ0ZXN0IjogMTIzfQ==', 'messageId': '1', 'attributes': {}}}
127.0.0.1 - - [22/Mar/2020 12:11:00] "POST /_ah/push-handlers/events HTTP/1.1" 200 -
Note that you can retrieve the message sent, encoded here in base64 (message.data) :
$ echo "eyJ0ZXN0IjogMTIzfQ==" | base64 -d
{"test": 123}
Of course, you can also do the decoding in Python.
This could be a known bug (fix forthcoming) in the emulator where push endpoints created along with the subscription don't work. The bug only affects the initial push config; modifying the push config for an existing subscription should work. Can you try that?
I failed to get PubSub emulator to work on my local env (fails with various java exceptions). I didn't even get to try various features like push with auth, etc. So I end up using ngrok to expose my local dev server and used the public https URL from ngrok in PubSub subscription.
I had no issue with whitelisting and redirects like described in the Q.
So might be helpful for anyone else.

"ImportError: No module named _ssl" with dev_appserver.py from Google App Engine

Background
"In the Python runtime, we've added support for the Python SSL
Library, so you can now open secure connections to remote services
such as Apple's Push Notification service."
This quote is taken from a recent post on the Google App Engine blog.
Implementation
If you want to use native python ssl, you must enable it using the libraries configuration in your application's app.yaml file where you specify the library name "ssl" . . .
These instructions are provided for developers through the Google App Engine documentation.
The following lines have been added to the app.yaml file:
libraries:
- name: ssl
version: latest
This much is in line with the advice provided through the Google App Engine documentation.
Problem
I have tried running my project in three different configurations. Two are working, and one is not.
Working ...
After I upload my application to Google App Engine, and run my project through the live server, everything works fine.
Working ...
When I run my project with manage.py runserver and include the Google App Engine SKD in my PYTHONPATH, everything works fine.
Not Working ...
However, when I run my project with dev_appserver.py, I get the following error:
ImportError at /
No module named _ssl
Request Method: GET
Request URL: http://localhost:8080/
Django Version: 1.4.3
Exception Type: ImportError
Exception Value:
No module named _ssl
Exception Location: /usr/local/lib/google_appengine_1.7.7/google/appengine/tools/devappserver2/python/sandbox.py in load_module, line 856
Python Executable: /home/rbose85/Code/venvs/appserver/bin/python
Python Version: 2.7.3
Python Path:
['/home/rbose85/Code/product/site',
'/usr/local/lib/google_appengine_1.7.7',
'/usr/local/lib/google_appengine_1.7.7/lib/protorpc',
'/usr/local/lib/google_appengine_1.7.7',
'/usr/local/lib/google_appengine_1.7.7',
'/usr/local/lib/google_appengine_1.7.7/lib/protorpc',
'/usr/local/lib/google_appengine_1.7.7',
'/usr/local/lib/google_appengine_1.7.7/lib/protorpc',
'/home/rbose85/Code/venvs/appserver/lib/python2.7',
'/home/rbose85/Code/venvs/appserver/lib/python2.7/lib-dynload',
'/usr/lib/python2.7',
'/usr/local/lib/google_appengine',
u'/usr/local/lib/google_appengine_1.7.7/lib/django-1.4',
u'/usr/local/lib/google_appengine_1.7.7/lib/ssl-2.7',
u'/usr/local/lib/google_appengine_1.7.7/lib/webapp2-2.3',
u'/usr/local/lib/google_appengine_1.7.7/lib/webob-1.1.1',
u'/usr/local/lib/google_appengine_1.7.7/lib/yaml-3.10']
Server time: Wed, 24 Apr 2013 11:23:49 +0000
For the current GAE version (1.8.0 at least until 1.8.3), if you want to be able to debug SSL connections in your development environment, you will need to tweak a little bit the gae sandbox:
add "_ssl" and "_socket" keys to the dictionary _WHITE_LIST_C_MODULES in /path-to-gae-sdk/google/appengine/tools/devappserver2/python/sandbox.py
Replace the socket.py file provided by google in /path-to-gae-sdk/google/appengine/dis27 from the socket.py file from your Python framework.
IMPORTANT: Tweaking the sandbox environment might end up with functionality working on your local machine but not in production (for example, GAE only supports outbound sockets in production). I will recommend you to restore your sandbox when you are done developing that specific part of your app.
The solution by jmg works, but instead of changing the sdk files, you could monkey patch the relevant modules.
Just put something like this on the beginning of your project setup.
# Just taking flask as an example
app = Flask('myapp')
if environment == 'DEV':
import sys
from google.appengine.tools.devappserver2.python import sandbox
sandbox._WHITE_LIST_C_MODULES += ['_ssl', '_socket']
from lib import copy_of_stdlib_socket.py as patched_socket
sys.modules['socket'] = patched_socket
socket = patched_socket
I had to use a slightly different approach to get this working in CircleCI (unsure what peculiarity about their venv config caused this):
appengine_config.py
import os
if os.environ.get('SERVER_SOFTWARE', '').startswith('Development'):
import imp
import os.path
import inspect
from google.appengine.tools.devappserver2.python import sandbox
sandbox._WHITE_LIST_C_MODULES += ['_ssl', '_socket']
# Use the system socket.
real_os_src_path = os.path.realpath(inspect.getsourcefile(os))
psocket = os.path.join(os.path.dirname(real_os_src_path), 'socket.py')
imp.load_source('socket', psocket)
I had this problem because I wasn't vendoring ssl in my app.yaml file. I know the OP did that, but for those landing here for the OP's error, it's worth making sure lines like the following are in your app.yaml file:
libraries:
- name: ssl
version: latest
Stumbled upon this thread trying to work with Apples Push notification service and appengine... I was able to get this working without any monkey patching, by adding the SSL library in my app.yaml, as recommended in the official docs, hope that helps someone else :)
I added the code to appengine_config.py as listed by Spain Train, but had to also add the following code as well to get this to work:
phttplib = os.path.join(os.path.dirname(real_os_src_path), 'httplib.py')
imp.load_source('httplib', phttplib)
You can test if ssl is available at your local system by opening a python shell and typing import ssl. If no error appears then the problem is something else, otherwise you don't have the relevant libraries installed on your system. If you are using a Linux operating system try sudo apt-get install openssl openssl-devel or the relevant instructions for your operating system to install them locally. If you are using windows, these are the instructions.

Unit testing OAuth JS with Mocha

I'm working on a JS based project that runs off GAE and part of the code gets the user's avatar using OAuth from Facebook, Twitter or Google. I'm trying to write tests in Mocha in order to test this but I'm running into some problems.
The code works when I test it in the front end, and the way I envisaged it to work would be to use ZombieJS to run the app on GAE's dev_appserver.py, fire the OAuth functions, fill in the appropriate auth stuff and then complete the test by returning the image URL.
However the first hurdle I've got is that it appears that NodeJS's server is not allowing GAE's server to run on the same IP address. For example:
exec 'dev_appserver.py .', ->
console.log arguments
This returns the error 'Address already in use'. How can I get around this apart from running it on a different machine? Is it possible to tell NodeJS to not reserve the whole IP and just a port? I'm running GAE on 8080 and it works fine when it isn't invoked by NodeJS.
The second problem is ZombieJS. I'm trying to figure out a way I can listen to when new windows are opened and, essentially, tail the console of the browser. I've started two discussions on the Google group but no one has responded yet (https://groups.google.com/forum/?hl=en#!topic/zombie-js/cJklyMbwxRE and https://groups.google.com/forum/?hl=en#!topic/zombie-js/tOhk_lZv5eA)
While the latter isn't as important as I can find ways around it (I hope), the former is the main issue, so I'd greatly appreciate any direction on how to resolve this address conflict.
Here's my NodeJS script:
exec = ( require 'child_process' ).exec
fs = require 'fs'
should = require 'should'
yaml = require 'yaml'
Zombie = require 'zombie'
common = require '../../static/assets/js/common'
url = 'ahmeds.local'
browser = new Zombie()
config = null
consoleCb = 'function consoleSuccess(){console.log("success",arguments)}function consoleFailure(){console.log("failure",arguments)}'
browser.debug = true
browser.silent = false
fs.readFile '../../config.yaml', (error, data) ->
config = yaml.eval data.toString 'ascii'
exec 'cd ../../ && dev_appserver.py -a ' + url + ' .', ->
console.log arguments
# browser.visit config.local.url, ->
browser.visit 'http://' + url + ':8080', ->
browser.evaluate consoleCb
browser.evaluate 'profileImage("facebook",consoleSuccess,consoleFailure)'
console.log browser.window.console.output
I have only limited familiarity with NodeJS, but I just tested running a NodeJS server and App Engine local dev server on the same machine — it works just fine. Without seeing your NodeJS code, I'm guessing you're also trying to run NodeJS on port 8080, and so the App Engine server complains when it's started (8080 is the default, and you noted it's the port you are using).
Try passing --port=8081 (or some other port) to your invocation of dev_appserver.py and it should resolve the conflict.
Nothing in the code you've shown (other than the invocation of dev_appserver) should even be listening on any port (unless zombie implements a "server" for remote debugging or something like that). It looks like the port conflict is coming from somewhere else.
Note that zombie's own Mocha test framework does set up an express server, so if you're using it or code lifted from it, that might be doing it.
What does netstat have to say about who's binding to what port?

Resources