How can I scrape this page with selenium and chromedriver in python language? - screen-scraping

I'm trying to scrape data from the website “http://www.nmpa.gov.cn/” by using selenium and chromedriver. When I was running the code, chromedriver succeeded in inputing the url but couldn't load the page, displaying a blank page. I tried to switch the target website into google.com and succeeded in scraping. I concluded that the target website server detected selenium and refused sending back data. So how can I scrape the data from website with selenium and chromedriver in python language. I'm quite a Python beginner, thank you for your kind help in advance. Here is my simple code:
from selenium import webdriver
my_driver_path = r"C:\python chrome driver\chromedriver.exe"
driver = webdriver.Chrome(executable_path=my_driver_path)
driver.get('http://www.nmpa.gov.cn/')
here is the photo of the issue:
enter image description here

The issue here is more on HTML than on Python.
If you check the source code of the page (you can do this by adding print(driver.page_source), you will see that it contains a meta tag with the http-equiv attribute set to "refresh":
<HTML><HEAD><title>NMPA</title></HEAD>
<body>
<meta http-equiv="Content-Type" content="text/html; charset=gb2312"><meta http-equiv="refresh" content="0;URL=/WS04/CL2042/">
</HTML>
What this tag does is direct the browser to go to the given URL (/WS04/CL2042/, in a badly formatted attribute that the browser is luckily able to understand). So, instead of scraping http://www.nmpa.gov.cn/, you have to scrape http://www.nmpa.gov.cn/WS04/CL2042/.
If you change your code to access this other link, you will see that then you can get the whole page. You can either hard-code the new link or safely concatenate the first link with the "refresh" destination with a method like urllib.parse.join(): https://docs.python.org/3.7/library/urllib.parse.html#urllib.parse.urljoin.

Related

Amazon S3 Image Issue - Share with Facebook - Meta Tag "og:image" - NextJS

I am writing codes to share the page of my website into the facebook. I have used meta tags for open graphs. Previously, the images of the pages were stored in the server itself and I used to supply the link to that image in the meta tag as:
<meta
key="og:image"
property="og:image"
content={coverImage} // link to the image
>
While sharing on facebook, everything worked fine. The image was displayed properly along with title and descriptions.
Recently, I have changed the image store to amazon s3 instead of storing it in server. Now, the coverImage has the link of S3. I supply this link in open graph image tag but the facebook is not previewing the image. I clear the cache, scraped the url again, but the facebook shows the warning:
"Provided og:image, https://***.s3.ap-south-1.amazonaws.com/......*.jpg could not be downloaded. This can happen due to several different reasons such as your server using unsupported content-encoding. The crawler accepts deflate and gzip content encodings."
This image is being previewed everywhere else and I have used ContentType :'image/jpeg', ContentEncoding: 'base64',. I am not really sure why this is being blocked in facebook.
I would be very grateful if anyone could help me resolve this thing.
Thanks in advance.
So, I added the S3 base URL to the App Domain section on "Facebook Developer Console" and it started working.

How to modify HTMLElement in index.html before page gets returned to requestor

Based on my custom URL parameters I process, I am trying to modify dynamically a meta tag I have id'ed in index.html like so:
<meta name="og:image" content="http://example.com/someurl.jpg" id="ogImage"/>
The code below in my home.ts seems to be working
document.getElementById('ogImage').setAttribute("content", Media.ImageURL) ;
I can verify it is via the browser dev console/elements.
However, when I view from facebook via their ojbect graph debugger at
https://developers.facebook.com/tools/debug/og/object/
It appears to see the default
http://example.com/someurl.jpg
as if the index.html is shipped before my home.ts gets chance to make the update.
Perhaps, my understanding is flawed and there is better way to do this.
Thank you.
Note1: initially, I was thinking I had to make some angular binding between index.html and one of my services but I could not locate any sample code, the closest I came to was this post
How can I update meta tags in AngularJS?
But I don't know how to apply it for my ionic2/3 code, so I opted for the document.get approach.
Note2: the ultimate goal here is to share a link into a social media (web or app) like facebook, a messenger like viber/skype, etc... and have it resolve to meaningful images, title, description to drive the visit back to the site via browser, or app if the user clicking on the link is on a mobile device with my app version of the site installed on his device.
Note3: if you decide to point me to ionic deeplinking please provide code to match above, because I could not understand how to apply to my case.
If you are trying to implement dynamic open graph meta tags values in your pages, you will need a server-side scripting language like php. Such a script will run on the server, update the pages as needed, then the pages will be served to the requesting site or application.
client-side scripting (ie. JavaScript) is usually ignored when a site or app is merely visiting your site/link for the purpose of extracting (aka scrapping, parsing html) information such as the one provided by the open graph meta tags (og:title, og:description og:image...).

Title and Meta Description built using AngularJS doesn't work in Social Media

Whenever I shared website link in facebook, twitter or anywhere, I get the following:
{{title}}
{{metadescription}}
When I inspect element using Chrome, I can see the Title and Meta Description correctly.
Could someone shed some light and how to fix it?
Do I have to install PhontomJS/SlimerJS etc? I heard PhantomJS takes a lot of server memory/process.

Get description of meta tag from a URL in Angular

I am setting up a simple app to save links of interesting articles. I'd like to be able to retrieve the description of the meta tag from the website of the URL or at least the first headline or text that appears in the page that is accessed when accessing that URL, to display this next to the URL:
This is the app so far:
http://ux.machinas.com/mux-feed/
Is there a way to do this in angular?
I am facing a similar problem. I am getting a cross-browser script error when trying to get page info with $http.get(), ie $http.get('http://www.stackoverflow.com').
I think the best solution is to get the page info on the server, then send to the browser, ie.
$http.get('./api/getPageInfo.php?url=www.myUrl.com/whatever').then....
this thread explains how to get page info in PHP - Get Facebook meta tags with PHP

Google plus one button not showing

I'm trying to add a Google +1 button to my website.
I have followed the instructions here:
http://www.google.com/intl/en/webmasters/+1/button/index.html
This the code for my webpage:
<html>
<head>
<title>
Why won't it appear?
</title>
<!-- Place this tag in your head or just before your close body tag -->
<script type="text/javascript" src="https://apis.google.com/js/plusone.js"></script>
</head>
<body>
<h1>
Example title
</h1>
<!-- Place this tag where you want the +1 button to render -->
<g:plusone size="tall" href="http://www.example.com/"></g:plusone>
</body>
</html>
As you can see, I've followed their instructions exactly, and yet it does not appear. I've tried it on Chrome, Firefox and IE8 (all on Windows XP). I'm just opening the webpage from my local system.
Interestingly I can see it working here http://www.satinbow.co.uk/xxtest.html
Can anyone solve the mystery?
Update / clues
When the page is stored on my system locally, it doesn't work (hard refreshing didn't fix it either.)
But I've put the page here: dl.dropbox.com/u/6920023/test2.html and it seems to work there.
It would be really cool know what's going on :)
I think it's because when it's local (not webserver) browser block the JS script (that's hosted externally) to prevent security breach. That's why it doesn't work
Link: http://ejohn.org/blog/tightened-local-file-security/
Another thing to check for is whether you have any ad blockers active. These can disable the +1 button and move the iframe containing the button out of the screen.
Working on my open source project, http://code.google.com/p/gwt-socialmedia,
i have discovered another reason that can cause the +1 button not to render: You forgot to define the "URL to +1": It must be a valid URL to an accessible website (so http://localhost won't work for i.e.).
Indeed, the PlusOne API seems to connect to the site URL, in order to get some metadata about it (like the description, title, etc)
If you don't define the URL, Google will send you an error HTTP 400 (Bad Request), with the internal message : "The requested URL was not found on this server. "
and the button will not appear...
Hope it helps!
Due to browser security(as mentioned in one of the answer) it would not display the button. Still to display google plus buttons when your file is local use local web server(WAMP/XAMPP) or you may use PHP local server https://www.sitepoint.com/taking-advantage-of-phps-built-in-server/ to host your file on your computer and you will see the button displayed in your file.

Resources