How to automate download of generated PDFs - screen-scraping

Scenario:
We are required to enter data daily into a government database in a European country. We suddenly have a need to retrieve some of that data. But the only format they will allow is by PDFs generated from the data—hundreds of them. We would like to avoid sitting in front of a webbrowser clicking link after link.
The links generated look like
<a href='javascript:viajeros("174814255")'>
<img src="img/pdf.png">
</a>
I have almost no experience with Javascript, so I don't know whether I can install a routine as a bookmark to loop through the DOM, find all the links, and call the function. Nor, if that's possible, how to write it.
The ID numbers can't be predicted, so I can't write another page or curl/wget script to do it. (And if I could, it would still fail as mentioned below.)
The 'viajeros' function is simple:
function viajeros(id){
var idm = document.forms[0].idioma.value;
window.open("parteViajeros.do?lang="+idm+"&id_fichero=" + id);
}
but feeding that URI to curl or wget fails. Apparently they check either a cookie or REFERER and generate an error.
Besides, with each link putting the PDF in a browser tab instead of in the downloads directory, we would still have to do two clicks (tab and save) hundreds of times.
What should I do instead?
For what it's worth, this is on MacOS 10.13.4. I normally use Safari, but I also have available Opera and Firefox. I could install Chrome, but that's the last resort. No, that's second to last: we also have a (shudder) Windows 10 laptop. THAT'S last resort.
(Note: I looked at the four suggested duplicates that seemed promising, but each either had no answer or instructed the asker to modify the code that generates the PDF.)

document.querySelectorAll("img[src=\"img/pdf.png\"]")
.forEach((el, i) => {
let id = el.parentElement.href.split("\"")[1];
let url =
"parteViajeros.do?lang=" + document.forms[0].idioma.value +
"&id_fichero=" + id;
setTimeout(() => {
downloadURI(url, id);
}, 1500 * i)
});
This gets all of the images of the PDF icon, then looks at their parent for the link target. This href has its ID extracted, and passed to a string construction making the path to the file to be downloaded, similar to ‘viajeros’ but without the window.open. This URL is then passed to downloadURI which performs the download.
This uses downloadURI function from another Stack Overflow answer. You can download a URL by setting the download attribute on the link, then clicking it, which is implemented as so. This is only tested in Chrome.
function downloadURI(uri, name) {
var link = document.createElement("a");
link.download = name;
link.href = uri;
document.body.appendChild(link);
link.click();
document.body.removeChild(link);
delete link;
}
Open the page with the links and open the console. Paste the downloadURI function first, then the code above to download all the links.

I had a similar situation, where I have to download all the (invoice) pdf that were generated in a day or past week.
So after some research I was able to do the scraping using PhantomJS and later I discovered casperjs which made my job easy.
phantomJs and casperjs are headless browsers.
Since you have less experience with JS and
If you are a C# guy then CefSharp may help you.
Some Useful links:
To get started with phantom, casper and cefSharp
PhantomJs
CasperJs
CefSharp
Try reading the documentation for downloading files.

Related

Href opening link in http://localhost:3000/LINK rather than opening the link seprately [duplicate]

I just have created primitive html page. Here it is: example
And here is its markup:
www.google.com
<br/>
http://www.google.com
As you can see it contains two links. The first one's href doesn't have 'http'-prefix and when I click this link browser redirects me to non-existing page https://fiddle.jshell.net/_display/www.google.com. The second one's href has this prefix and browser produces correct url http://www.google.com/. Is it possible to use hrefs such as www.something.com, without http(s) prefixes?
It's possible, and indeed you're doing it right now. It just doesn't do what you think it does.
Consider what the browser does when you link to this:
href="index.html"
What then would it do when you link to this?:
href="index.com"
Or this?:
href="www.html"
Or?:
href="www.index.com.html"
The browser doesn't know what you meant, it only knows what you told it. Without the prefix, it's going to follow the standard for the current HTTP address. The prefix is what tells it that it needs to start at a new root address entirely.
Note that you don't need the http: part, you can do this:
href="//www.google.com"
The browser will use whatever the current protocol is (http, https, etc.) but the // tells it that this is a new root address.
You can omit the protocol by using // in front of the path. Here is an example:
Google
By using //, you can tell the browser that this is actually a new (full) link, and not a relative one (relative to your current link).
I've created a little function in React project that could help you:
const getClickableLink = link => {
return link.startsWith("http://") || link.startsWith("https://") ?
link
: `http://${link}`;
};
And you can implement it like this:
const link = "google.com";
<a href={getClickableLink(link)}>{link}</a>
Omitting the the protocol by just using // in front of the path is a very bad idea in term of SEO.
Ok, most of the modern browsers will work fine. On the other hand, most of the robots will get in trouble scanning your site. Masjestic will not count the flow from those links. Audit tools, like SEMrush, will not be able to perform their jobs

Get Page url on block types

I am working on generating a report in episerver in the form of a scheduled job.This report is basically used to get all the contents(page types,blocks,media) within Episerver.
There are some good nuget packages for content usages but for some reason& to have more control & plans for tweeking & extending it further,i am creating a custom one rather than using the available 3rd party packages.
The scheduled job is kind of similar to
https://www.codeart.dk/blog/2018/12/content-report-generator/ .
The article helped me a lot to get my report working with some modifications as per my requirement.
The one thing that I am struggling here is to get the URL of where the block is. Now this is a point of debate here.
I am aware that the blocks being shared in nature can be used anywhere in a site but the point is they are still used in pages or as a matter of fact in other blocks which is turn is used on a page.What i am trying to say here is,they directly or indirectly are part of a page.So is there a way to get the page url of a block irrespective of how many pages they are in.
Every forum I have looked at ,there's always page url of Pagedata or mediadata & nothing on blockdata.Even 3rd party nuget packages that i have looked for does not have page url for block types.
I do understand that there is nothing out of the box here.Is there a way to achieve this ie get the page url of a specific block type which can be a list of page urls if the block is used in multiple pages.
Recursive function to reach the page:
private string GetPublicUrl(IContentRepository contentRepository, IContentSoftLinkRepository contentSoftLinkRepository, ContentReference contentReference)
{
var publicUrl = string.Empty;
var content = contentRepository.Get<IContent>(contentReference);
var referencingContentLinks = contentSoftLinkRepository.Load(content.ContentLink, true)
.Where(link => link.SoftLinkType == ReferenceType.PageLinkReference && !ContentReference.IsNullOrEmpty(link.OwnerContentLink))
.Select(link => link.OwnerContentLink);
foreach (var referencingContentLink in referencingContentLinks)
{
publicUrl = UrlResolver.Current.GetUrl(referencingContentLink.GetPublicUrl()) ?? GetPublicUrl(contentRepository, contentSoftLinkRepository, referencingContentLink);
}
return publicUrl;
}
I have written this recursive function to reach the page ,but this works only when there is a single level. For instance A 2Col block on a page.
If I have a block say Download Block which is on a 2Col block which in turn is on a page ,then in this case the url is empty.
Any input is appreciated.
With IContentSoftLinkRepository you can find where the blocks are used. You can check whether the SoftLink points to a page with SoftLink.SoftLinkType == PageLinkReference and then use IUrlResolver to get the page's URL.

Typo3 Transform DB bodytext to frontend html

I try to import an old typo3 v4 into v10 and I'm using external_importer extension for the job. On the flow I would like to download the internal files like PDF and relink in bodytext.
The idea would be to transform the saved content to real html and evaluate the hyperlinks if are containing relative PDF links and in case trigger the download and rebuild the link to the file.
How would I proceed in this case?
I tried the following
$parseObj = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance(ContentObjectRenderer::class);
$html = $parseObj->stdWrap_HTMLparser($htmlStr, []);
DebugUtility::debug($html);
but the hyperlink steel remains as <link http://someurl.com>
I had a similar problem. If the solution is the same with mine, you are halfway there. You are missing the reference. Meaning, how should TYPO3 process the text. Here is what worked for me.
TYPO3 render full t3:// links from bodytext in utility files
First, use the parseFunc and not the stdWrap_HTMLparser. Then use this reference: lib.parseFunc. At the end you should have something like that:
$parseFuncTSPath = 'lib.parseFunc';
$html = $parseObj->parseFunc($htmlStr, [], '< ' . $parseFuncTSPath);
DebugUtility::debug($html);
And since you are using TYPO3 10, i would recommend to use DI (Dependency Injection). You can basically copy paste the code from the linked SO answer i pasted.
Best regards

Can I download an image if it's in base64 format?

I'm developing a React App, and I have a backend in NodeJS.
In my Mongo Schema I have an array that stores multiple strings, these strings are some images.
I saved them as base64. Now I want to display them in my app, works perfectly fine with src from img tag, but I want to create a button that allows the user to download those pictures, is there any solution to this? Can I convert back that string and make it downloadable? Thank you very much for you time, I'm waiting for your ideas!
Note: The examples in the snippets will not work live because Stack Overflow sandboxes snippets without allow-downloads, but they should work on your page.
Depending on your exact use case, you have different options. The easiest one would be using an <a> tag with the download attribute instead of a button, like this:
<a download="myImage.gif" href="">Download GIF</a>
If you need to keep using a button and you want to trigger the download programmatically, you can create an <a> tag (without displaying it) and trigger a click:
const a = document.createElement('a')
a.download = 'myImage.gif'
a.href = ''
a.click()
(If you need to support older browsers, you may have to temporarily insert the tag into the DOM and trigger the click in a setTimeout(..., 0).)
You can also use object URLs like it's shown here but it's probably easier to go the data URI route since you already have such a URI.

HTML Bridge not working with cross-domain Silverlight XAP

I've got a complex Silverlight app that uses the HTML bridge functionality quite extensively (in both directions). The app runs fine when the hosting page is from the same domain as the XAP source. Unfortunately, I can't get the HTML bridge functionality to work when the hosting page is on a different domain.
Now, I know the various tricks normally required to get this to work, i.e., everything that's documented here: http://msdn.microsoft.com/en-us/library/cc645023(VS.95).aspx. I've even put together my own simplified cross-domain repro that I was hoping would highlight the problem, but unfortunately, my "repro" works, i.e., both JS->SL and SL->JS functionality work just fine in it, even if the XAP is hosted on a different domain.
Here's what I've tried so far to narrow down the problem:
On my production solution (where I'm having the problem):
Confirmed that "EnableHtmlAccess" is set to true in the <object> tag.
Confirmed that "ExternalCallersFromCrossDomain" is set to "ScriptableOnly" in the AppManifest.xml file.
On my repro solution (where I can't get it to have the problem):
Added multiple libraries with multiple registered scriptable objects.
Added events to the registered objects.
On both:
Tried it with a static <object> tag and with a dynamically created <object> tag (via Silverlight.js).
Tried it with and without specifying handlers for onSourceDownloadProgressChanged, onSourceDownloadComplete, onError, and onLoad.
Tried it with and without a splashscreen.
I'm kinda running out of ideas. Anyone have any suggestions for other troubleshooting steps?
Well, so far I haven't been able to track down the precise difference between the working and the non-working versions. But I came up with a workaround that's sufficient for my needs. As it turns out, only the JS->SL functionality was broken; any calls from SL->JS still worked. So what I did was to register the scriptable SL objects from within Silverlight. In my controlling JavaScript class, I created a function with a unique name, and registered it with the window object:
var mLoadingController;
var mAppId = 'alantaClient_' + Alanta.makeId();
var mSetLoadingControllerId = mAppId + '_SetLoadingController';
window[mSetLoadingControllerId] = function (value) {
mLoadingController = value;
onLoad();
};
And then I pass in the name of the function as a part of the Silverlight app's InitParams:
var initParams = 'setLoadingControllerId=' + mSetLoadingControllerId;
Silverlight.createObject(mSource, mAppHost, mAppId, params, events, initParams);
And then I call that registration function from within Silverlight, like so:
// Do everything necessary to make the LoadingController scriptable.
HtmlPage.RegisterScriptableObject("LoadingController", LoadingController.Instance);
string setLoadingControllerId;
if (e.InitParams.TryGetValue(LoaderConstants.SetLoadingControllerIdReference, out setLoadingControllerId))
{
HtmlPage.Window.Invoke(setLoadingControllerId, LoadingController.Instance);
}
And then I can call it from JS, like so:
mLoadingController.GoToRoom();
Kinda hacky, but it works. Close enough for now.

Resources