is google news example of html scraping - screen-scraping

I need to make web app similar to google news.
Do i need to learn html scraping for that or some more techniques

Most of the stuff which Google News shows is all RSS/ATOM . It's way too easy to get the website content through RSS feeds as compared to scraping.
Other than that if you can use Java, then you can scrape html by yourself using the excellent library Goose . It is similar to what Flipboard/Instapaper uses

The easiest solution would be to get the RSS or ATOM feed of the website you are trying to get data from.
Those are well-known formats, and extracting informations from such XML feeds would be much easier than getting it from an HTML page : with RSS/ATOM, you'll just have to parse the XML feed, and extract the tags that contain informations that interest you.
Not sure which language you're working with, but chances are you can find some library that would help you with that.
If the website doesn't export an RSS/ATOM feed... Well, you'll probably have to fallback to HTML scrapping ; good luck with that, as HTML is not quite as well structured as RSS/ATOM : you'll have to find out, for each website, where in the page are the relevant informations.

Related

2sxc - Getting URL path from DNN link parameter / Tab ID

I am working on integrating a 2sxc content WebAPI feed into a ReactJS application.
I have managed to get a JSON feed of data into the application, and am in the process of mapping out the data.
I'm wondering what the best practice would be to "resolve" a URL which is coming through as a DNN Page/ Tab ID.
Below I will showcase the various points this is referenced...
First the Setup of the entity / data types...
Then this is an example entry with the data filled out... The page link / URL is set up to point to another internal page on the DNN website:
Finally you can see this data item come through as a JSON feed via the 2sxc API:
What is the best way to convert this piece of data into a URL which can be used in a SPA type application?
There isn't any "server-side" code going on, just reading a JSON feed on the client side...
My initial idea would be to parse this piece of data in JS, to extract the number then use something like this:
http://www.dotnetnuke.com/tabid/85/default.aspx
http://www.dotnetnuke.com/default.aspx?tabid=85
I was hoping someone with more experience would be able to suggest a better / cleaner approach.
Thanks in advance
If you were server-side in Razor you'd be doing something like this:
#using DotNetNuke.Common
View List
XXXX = Dnn.Tab.TabID or define a string with the tab id you want
I seem to have a vague memory that I saw somewhere that Daniel (2sxc) has a way to use Globals.NavigateUrl() or similar on the client side, but I have no idea where or if I did see that.
The Default.aspx?tabid=xx format will certainly work, as it's the oldest DNN convention and is still used in fallbacks. The urls aren't nice, but it's ok.
The reason you're seeing this is because the query doesn't perform the automatic lookup with the AsDynamic(...) does for you. There is an endpoint to look them up, but they are not official, so they could change and therefor I don't want to suggest that you use them.
So if you really want a nicer url, you should either see if DNN has a REST API for this, or you could create a small own 2sxc-api endpoint (in the api folder) just to look that up, then using the NavigateURL. Would be cool if you shared your work.

Getting a dataset of Photos and hashtags of Instagram

I am trying to come up with a dataset of public photos and some random hashtags regarding them from Instagram. Is there any API for that?
Also is there a dataset for a list of hashtags or object vocabulary?
Best
Yes! you need to look for Instagram API this provides data based on your search queries and much more. At first you may have sign up and follow few of the tutorials to understand what extensible features they provide. Following is a good read Tutorial
Apart from that if you are not interested in API and just want a dataset have a look at this Social Media Repository
Cheers!

angularjs sitemap SEO

I don't see any updated answer on similar topics (hopefully something has changed with last crawl releases), that's why I come up with a specific question.
I have an AngularJS website, which lists products that can be added or removed (the links are clearly updated). URLs have the following format:
http://example.com/#/product/564b9fd3010000bf091e0bf7/published
http://example.com/#/product/6937219vfeg9920gd903bg03/published
The product's ID (6937219vfeg9920gd903bg03) is retrieved by our back-end.
My problem is that Google doesn't list them, probably because I don't have a sitemap.xml file in my server..
In a day a page can be added (therefore a new url to add) or removed..
How can I manage this?
Do I have to manually (or by batch) edit the file each time?
Is there a smart way to tell Google: "Hey my friend, look at this page"?!
Generally you can create a JavaScript - AngularJS sitemap, and according to this guidance from google :
https://webmasters.googleblog.com/2015/10/deprecating-our-ajax-crawling-scheme.html
They Will crawl it.
you can also use Fetch as Google To validate that the pages rendered correctly
.
There is another study about google execution of JavaScript,
http://searchengineland.com/tested-googlebot-crawls-javascript-heres-learned-220157

Angular js and snapshot pages

Have read up on angular and generating snapshots for the crawlers
Just want to know if the snapshot needs to have CSS or just plain and img tags for Facebook and Google etc. to work.
Any good examples of how to implement the framework with expression engine or Wordpress.
As I understand it you would actually create two versions of every page. Is that correctly understood ?
Cheers looking forward on trying it out .
As I understand it it's enough to modify the htaccess file to show the snapshots instead of the angular once when the request is coming from Facebook or Google etc.
Thanks

How to dynamically replace contents of HTML tags with Python on Google App Engine with lxml?

I am using Google app engine and python to build an application. I am incredibly new to python as well as GAE. I have a index.html file with the basic template for my site. However I would like to replace the contents of a few tags depending on the URL. For example update the title tag for each individual pages. From what I can tell the recommended way to do this is using the lxml library.
And so... Tonight is my first time I have ever worked with lxml and I am having a really hard time wrapping my head around it. I have been fooling around with several permutations of the basic syntax and have not had much success understanding how it works. I have looked for different tutorials and the documentation is few and far between.
When I try the following code I get a 'lxml.etree._ElementTree' object has no attribute 'find_class' error, however the documentation here: http://lxml.de/lxmlhtml.html#parsing-html it sure looks like it should have that class
Am I on the right path? Is this the most efficient/best way to replace the content of html tags?
import os
import webapp2
import lxml.html
doc = lxml.html.parse('index.html')
doc.find_class("title") == 'About Page'
self.response.write(lxml.html.tostring(doc))
This is definitely not the way to that on Google App Engine. You should use some kind of template framework like Jinja2 or Django to achieve your goal.
But before all that you will have to make sure that you completed the Getting Started Tutorial, where you can see these things in action.

Resources