Is there an open-source version of Facebook's Linter? - screen-scraping

When you post a link to Facebook, it grabs the article title, description and relevant images. Most major sites have the required OG tags, making it easy to grab this info, but FB is also able to handle websites that don't have them (you can try it here).
Clearly they've got a system in place for grabbing this info in the absence of OG tags. Does anyone know if there's an open-source version?
I'm thinking it would need (in order of preference for each section):
Title:
Check for og:title tag.
Check for regular meta "title" tag.
Check for h1 tag.
Description:
Check for og:description tag.
Check for regular meta "description tag"
Check for div or p tags with sufficient content to indicate a body paragraph
Images:
Check for og:image tags
Check for images over a certain size (say 100x100) and give priority to those that come first.
Thanks a lot!

https://github.com/Anonyfox/node-htmlcarve
The htmlcarve module for Node.js does most of what you're after, here's the output generated from this page:
htmlcarve = require('htmlcarve');
htmlcarve.fromUrl('https://scotch.io/tutorials/using-mongoosejs-in-node-js-and-mongodb-applications', function(error, data) {
console.log(JSON.stringify(data, null, 2));
});
This produces:
{
"source": {
"html_meta": {
"title": "Easily Develop Node.js and MongoDB Apps with Mongoose ⥠Scotch",
"summary": "",
"image": "/wp-content/themes/thirty/img/scotch-logo.png",
"language": "en-US",
"feed": "https://scotch.io/feed",
"favicon": "https://scotch.io/wp-content/themes/thirty/img/icons/favicon-57.png",
"author": "Chris Sevilleja"
},
"open_graph": {
"title": "Easily Develop Node.js and MongoDB Apps with Mongoose",
"summary": "",
"image": "https://scotch.io/wp-content/uploads/2014/11/mongoosejs-node-mongodb-applications.png"
},
"twitter_card": {
"title": "Easily Develop Node.js and MongoDB Apps with Mongoose",
"summary": "",
"author": "sevilayha"
}
},
"result": {
"title": "Easily Develop Node.js and MongoDB Apps with Mongoose",
"summary": "",
"image": "https://scotch.io/wp-content/uploads/2014/11/mongoosejs-node-mongodb-applications.png",
"author": "sevilayha",
"language": "en-US",
"feed": "https://scotch.io/feed",
"favicon": "https://scotch.io/wp-content/themes/thirty/img/icons/favicon-57.png"
},
"links": {
"deep": "https://scotch.io/tutorials/using-mongoosejs-in-node-js-and-mongodb-applications",
"shallow": "https://scotch.io/tutorials/using-mongoosejs-in-node-js-and-mongodb-applications",
"base": "https://scotch.io"
}
}
If you've got Node.js installed, then install it using
npm i -g htmlcarve
and you can run it from the command line directly.

Related

Locale ignored in APLA Alexa Developer Console

I'm new to developing skills with Alexa. I've followed the Build Multi-turn Skills Tutorial with Alexa Conversations tutorial up to module 3.
Because I want to develop a skill only for German users I've altered the language settings in the Alexa developer console of my skill to only support German language.
I change the APLA code in the tutorial with the APLA with the "edit audio response" to this:
{
"type": "APLA",
"version": "0.8",
"mainTemplate": {
"parameters": [
"payload"
],
"item": {
"type": "Selector",
"strategy": "randomItem",
"items": [
{
"type": "Speech",
"contentType": "text",
"when": "${environment.alexaLocale == 'de-DE'}",
"content": "Willkommen bei meiner App"
},
{
"type": "Speech",
"contentType": "text",
"when": "${environment.alexaLocale == 'de-DE'}",
"content": "Willkommen."
},
{
"type": "Speech",
"contentType": "text",
"when": "${environment.alexaLocale == 'en-US'}",
"content": "Welcome."
}
]
}
}
}
At the bottom of the console I see that my locale is set to German but when I preview the APL above the audio player always says "Welcome." with the English voice, the other two options are never triggered. What am I missing here?
The audio response tool doesn't take in account the language of the website.
There are no ways to test the condition environment.alexaLocale in this tool.
To test it, update the code of your skill and test it either on the test tabyour skill in the developer console or directly on a real device. Just tested with your code, it works perfectly. Just not on the audio tool.

I tried inserting a document in mongodb but I then,

I tried inserting a document in mongodb but I then, I received an error saying "Insert not permitted while document contains errors" and yet, I still can't find where the error is in my code,
please help.
I do not know if there's a new version of mongodb or probably a new way of writing code to insert documents in mongodb but still, I've tried all I can to locate the error in this code but I still couldn't.
I also tried checking if I used the curly bracket incorrectly or the square bracket incorrectly but I think its probably fine to me but I'm unsure. Hopefully someone blessed can help me check it out.
[
{
"images": [
{
"public_id": "nextjs_media/pb8fnxyickqqe9krov82",
"url": "https://res.cloudinary.com/devatchannel/image/upload/v1605263280/nextjs_media/pb8fnxyickqqe9krov82.jpg"
},
{
"public_id": "nextjs_media/irfwxjz56x4xa6pdwoks",
"url": "https://res.cloudinary.com/devatchannel/image/upload/v1605263281/nextjs_media/irfwxjz56x4xa6pdwoks.jpg"
}
],
"checked": false,
"inStock": 500,
"sold": 0,
"title": "animal",
"price": 5,
"description": "How to and tutorial videos of cool CSS effect, Web Design ideas,JavaScript libraries, Node.",
"content": "Welcome to our channel Dev AT. Here you can learn web designing, UI/UX designing, html css tutorials, css animations and css effects, javascript and jquery tutorials and related so on.",
"category": "5faa35a88fdff228384d51d8"
},
{
"images": [
{
"public_id": "nextjs_media/jdi9qo0oiinwik8uxzxn",
"url": "https://res.cloudinary.com/devatchannel/image/upload/v1605278590/nextjs_media/jdi9qo0oiinwik8uxzxn.jpg"
},
{
"public_id": "nextjs_media/k2pjwtpzolcieioacnu2",
"url": "https://res.cloudinary.com/devatchannel/image/upload/v1605278591/nextjs_media/k2pjwtpzolcieioacnu2.jpg"
},
{
"public_id": "nextjs_media/qbh6auephsy5leaapsu1",
"url": "https://res.cloudinary.com/devatchannel/image/upload/v1605278592/nextjs_media/qbh6auephsy5leaapsu1.jpg"
},
{
"public_id": "nextjs_media/gnsgrxorl5utlnxygjn6",
"url": "https://res.cloudinary.com/devatchannel/image/upload/v1605278594/nextjs_media/gnsgrxorl5utlnxygjn6.jpg"
},
{
"public_id": "nextjs_media/w8qj2rlrhh1es8wxhcui",
"url": "https://res.cloudinary.com/devatchannel/image/upload/v1605278596/nextjs_media/w8qj2rlrhh1es8wxhcui.jpg"
}
],
"checked": false,
"inStock": 300,
"sold": 10,
"title": "wedding invitation",
"price": 5,
"description": "How to and tutorial videos of cool CSS effect, Web Design ideas,JavaScript libraries, Node.",
"content": "Welcome to our channel Dev AT. Here you can learn web designing, UI/UX designing, html css tutorials, css animations and css effects, javascript and jquery tutorials and related so on.",
"category": "5faa35b58fdff228384d51da"
},
{
"images": [
{
"public_id": "nextjs_media/u8qltexka25minj2rj46",
"url": "https://res.cloudinary.com/devatchannel/image/upload/v1605318879/nextjs_media/u8qltexka25minj2rj46.jpg"
},
{
"public_id": "nextjs_media/wb5osprab71emsxp3ibm",
"url": "https://res.cloudinary.com/devatchannel/image/upload/v1605318910/nextjs_media/wb5osprab71emsxp3ibm.jpg"
},
{
"public_id": "nextjs_media/nelvbtwdbk1vjvhufort",
"url": "https://res.cloudinary.com/devatchannel/image/upload/v1605318911/nextjs_media/nelvbtwdbk1vjvhufort.jpg"
},
{
"public_id": "nextjs_media/bnyeto9vaz40yfts92we",
"url": "https://res.cloudinary.com/devatchannel/image/upload/v1605318913/nextjs_media/bnyeto9vaz40yfts92we.jpg"
}
],
"checked": false,
"inStock": 153,
"sold": 5,
"title": "laptop",
"price": 25,
"description": "How to and tutorial videos of cool CSS effect, Web Design ideas,JavaScript libraries, Node.",
"content": "Welcome to our channel Dev AT. Here you can learn web designing, UI/UX designing, html css tutorials, css animations and css effects, javascript and jquery tutorials and related so on.",
"category": "5faa35a88fdff228384d51d8"
}
]```

Where to put knowledge base deployments details in QnA bot sdk4?

I'm following instructions for migrating my knowledge base from https://learn.microsoft.com/en-us/azure/cognitive-services/qnamaker/tutorials/migrate-knowledge-base.
Point 9 says I have to use the endpoint (image in the instructions below this point) to my bot. I have created a Web App Bot on Azure Portal.
For sdk3, I am able to set this endpoint information to my Web App Bot and get the KB to function. However, for sdk4 I can't do the same.
How do I migrate my knowledge base to sdk4 Web App Bot (QnA Maker)?
There is a good sample of QnA Maker bot with SDK v4 available here in the official samples:
C#: https://github.com/Microsoft/BotBuilder-Samples/tree/master/samples/csharp_dotnetcore/11.qnamaker
Js: https://github.com/Microsoft/BotBuilder-Samples/blob/master/samples/javascript_nodejs/11.qnamaker
With these samples you can see that the endpoint (hostname) information is located on the .bot file, named here qnamaker.bot and looking like the following:
{
"name": "qnamaker",
"description": "",
"services": [
{
"type": "endpoint",
"name": "development",
"endpoint": "http://localhost:3978/api/messages",
"appId": "",
"appPassword": "",
"id": "25"
},
{
"type": "qna",
"name": "qnamakerService",
"kbId": "",
"subscriptionKey": "",
"endpointKey": "",
"hostname": "",
"id": "227"
}
],
"padlock": "",
"version": "2.0"
}
These values are used in the code.

What JSON-LD structured data to use for a multi-pararaph, multi-image blogpost?

I have created the following JSON-LD for a blogpost in my blog:
{
"#context": "http://schema.org",
"#type": "BlogPosting",
"mainEntityOfPage": {
"#type": "WebPage",
"#id": "https://www.example.com"
},
"headline": "My Headline",
"articleBody": "blablabla",
"articleSection": "bla",
"description": "Article description",
"inLanguage": "en",
"image": "https://www.example.com/myimage.jpg",
"dateCreated": "2019-01-01T08:00:00+08:00",
"datePublished": "2019-01-01T08:00:00+08:00",
"dateModified": "2019-01-01T08:00:00+08:00",
"author": {
"#type": "Organization",
"name": "My Organization",
"logo": {
"#type": "ImageObject",
"url": "https://www.example.com/logo.jpg"
}
},
"publisher": {
"#type": "Organization",
"name": "Artina Luxury Villa",
"name": "My Organization",
"logo": {
"#type": "ImageObject",
"url": "https://www.example.com/mylogo.jpg"
}
}
}
Now, I have some blog posts that contain multiple paragraphs and each paragraph is accompanied by an image. Any ideas how can I depict such a structure with JSON-LD?
Background
I have created a simple blog which uses a JSON file for 2 purposes: (a) feed the blog with posts instead using a DB (by using XMLHttpRequest and JSON.parse) and (b) to add JSON-LD structured data to the code for SEO purposes.
When I read the JSON file I have to know which image belongs to which paragraph of the text in order to display it correctly.
Note: As you seem to need this only for internal purposes, and as there is typically no need to publically provide data about this kind of structure, I think it would be best not to provide public Schema.org data about it. So you could, for example, use it to build the page, and then remove it again (or whatever works for your case). Then it would also be possible to use a custom vocabulary (under your own domain) for this, if it better fits your needs.
You could use the hasPart property to add a WebPageElement for each paragraph+image block.
Each WebPageElement can have text and image (and, again, hasPart, if you need to nest them).
Note that JSON-LD arrays are unordered by default. You can use #list to make it ordered.
"hasPart": { "#list":
[
{
"#type": "WebPageElement",
"text": "plain text",
"image": "image-1.png"
},
{
"#type": "WebPageElement",
"text": "plain text",
"image": "image-2.png"
}
]
}
For the blog posting’s header/footer, you could use the more specific WPHeader/WPFooter instead of WebPageElement.

Best practice for large site

I am working on a large site and want to implement JSON-LD. The site has a large social media following and a lot of artist profiles and articles.
This is what I currently have, (the following code is from Google's guidelines)
Front page
<script type="application/ld+json">
{
"#context": "http://schema.org",
"#type": "Organization",
"name": "Organization name",
"url": "http://www.your-site.com",
"sameAs": [
"http://www.facebook.com/your-profile",
"http://instagram.com/yourProfile",
"http://www.linkedin.com/in/yourprofile",
"http://plus.google.com/your_profile"
]
}
</script>
Content pages
<script type='application/ld+json'>
{
"#context": "http://www.schema.org",
"#type": "WebSite",
"name": "About us",
"url": "http://www.your-site.com/about-us"
}
</script>
Profile pages of each artist:
<script type="application/ld+json">
{
"#context": "http://schema.org",
"#type": "NewsArticle",
"mainEntityOfPage": {
"#type": "WebPage",
"#id": "https://google.com/article"
},
"headline": "Article headline",
"image": [
"https://example.com/photos/1x1/photo.jpg",
"https://example.com/photos/4x3/photo.jpg",
"https://example.com/photos/16x9/photo.jpg"
],
"datePublished": "2015-02-05T08:00:00+08:00",
"dateModified": "2015-02-05T09:20:00+08:00",
"author": {
"#type": "Person",
"name": "John Doe"
},
"publisher": {
"#type": "Organization",
"name": "Google",
"logo": {
"#type": "ImageObject",
"url": "https://google.com/logo.jpg"
}
},
"description": "A most wonderful article"
}
</script>
Do I add one script tag per page or do I add all JSON-LD under one script tag? On the front page I have the "Organization" tag and show the social media links, do I add this on all pages?
You may have multiple script JSON-LD data blocks on a page, but using one script element makes it easier to connect the structured data entities: you can nest entities instead of having to reference their URIs.
What to connect? Your NewsArticle can
provide the WebPage¹ entity as value for the mainEntityOfPage property, and
provide the Organization entity as value for the publisher property.
This is only one possibility. Another one: You could provide the WebPage entity as top-level item and provide the NewsArticle entity as value for the mainEntity property.
If you have to duplicate data (for example, because the Organization is author and publisher, or because it’s the publisher of both, the WebPage and the NewsArticle), you can mix nesting and referencing. Give each entity an #id and wherever you provide this entity as value, also provide its #id.
¹ You are using WebSite, but you probably mean WebPage. Also note that the #context should be http://schema.org, not http://www.schema.org.

Resources