Scrapy: continue process result from parse function

Scrapy: continue process result from parse function - request

I am trying to parse page A, download files listed in the page to local disk, replace URL in page A with URL to the files I saved, and finally save page A to local disk.
I tried file pipeline but it just does not work. The URL in page A looks like http:...php?id=1234 so build-in file_path() returns an error. Overriding file_path() just stops pipeline working without any debug output.
So I found this post:
Answer I referred
After I applied I found the parsing function won't change the data I passed in meta. My code is like:
def ParseClientCaseNote(self,response):
# The function is to download all attachments and replace URL inside pointing to local files
TestMeta='this is to test meta argu'
for a in AttachmentList:
yield scrapy.Request(a,callback=self.DownClientCaseNoteAttach,meta={'test':TestMeta})
self.logger.info('ParseClientCaseNote: after call DownClientCaseNoteAttach, testmeta is: ' + TestMeta)
return
def DownClientCaseNoteAttach(self,response):
TestArg=response.meta['test']
self.logger.info('DownClientCaseNoteAttach: test meta')
self.logger.info(TestArg)
TestArg='this is revised from DownClientCaseNoteAttach'
with open(AbsPath,'wb') as f:
f.write(response.body)
return
I got below result in log:
2018-09-29 09:26:13 [debug] INFO: ParseClientCaseNote: after call DownClientCaseNoteAttach, testmeta is: this is to test meta argu
2018-09-29 09:26:17 [debug] INFO: DownClientCaseNoteAttach: test meta
2018-09-29 09:26:17 [debug] INFO: this is to test meta argu
It seems parsing function is deferred. How can I get the result correctly?
Thanks

I used a workaround to address this. In page A I get file name on web and pass the name to own download function change the url pointing to local file with name on web.
In download function I verify the file name from response.headers['Content-Disposition'].decode(response.headers.encoding) to ensure it is the same as I find on page A before save it.

Related

Show full URL in the squid access.log log

Im running an Squid-Proxy and want to see the full requested URL in the /var/log/squid/access.log log.
The only thing that is shown is:
1607565964.095 64 XXX.XXX.126.82 TCP_TUNNEL/200 13744 CONNECT www.mediamarkt.de:443 XXX HIER_DIRECT/XXX.XXX.208.234 -
I requested an full link from mediamarkt, not only the main homepage.
So is there an way to see the full requested link?
For requests against my server it works:
1607565951.279 0 XXX.XXX.122.16 NONE/400 3947 GET /boaform/admin/formLogin?username=XXXXX&psd=XXXXX - HIER_NONE/- text/html

For showing the full requested URL, i need to use ssl_bump.

Apache Camel route with no "to" endpoint

I am using Apache Camel to assist with capturing message data emitted by a third party software package. In this particular instance, I only need to capture what is produced by the software, there is no receiver on the other end (really no "end" to go to).
So, I tried to set up a route with just the "from" endpoint and no "to" endpoint. Apparently this is incorrect usage as I received the following exception:
[2018-08-15 11:08:03.205] ERROR: string.Launcher:191 - Exception
org.apache.camel.FailedToCreateRouteException: Failed to create route route1 at: >>> From[mina:udp://localhost:9877?sync=false] <<< in route: Route(route1)[[From[mina:udp://localhost:9877?sync=false]] -... because of Route route1 has no output processors. You need to add outputs to the route such as to("log:foo").
at org.apache.camel.model.RouteDefinition.addRoutes(RouteDefinition.java:1063)
at org.apache.camel.model.RouteDefinition.addRoutes(RouteDefinition.java:196)
at org.apache.camel.impl.DefaultCamelContext.startRoute(DefaultCamelContext.java:974)
at org.apache.camel.impl.DefaultCamelContext.startRouteDefinitions(DefaultCamelContext.java:3301)
at org.apache.camel.impl.DefaultCamelContext.doStartCamel(DefaultCamelContext.java:3024)
at org.apache.camel.impl.DefaultCamelContext.access$000(DefaultCamelContext.java:175)
at org.apache.camel.impl.DefaultCamelContext$2.call(DefaultCamelContext.java:2854)
at org.apache.camel.impl.DefaultCamelContext$2.call(DefaultCamelContext.java:2850)
at org.apache.camel.impl.DefaultCamelContext.doWithDefinedClassLoader(DefaultCamelContext.java:2873)
at org.apache.camel.impl.DefaultCamelContext.doStart(DefaultCamelContext.java:2850)
at org.apache.camel.support.ServiceSupport.start(ServiceSupport.java:61)
at org.apache.camel.impl.DefaultCamelContext.start(DefaultCamelContext.java:2819)
at {removed}.Launcher.startCamel(Launcher.java:189)
at {removed}.Launcher.main(Launcher.java:125)
Caused by: java.lang.IllegalArgumentException: Route route1 has no output processors. You need to add outputs to the route such as to("log:foo").
at org.apache.camel.model.RouteDefinition.addRoutes(RouteDefinition.java:1061)
... 13 more
How do I set up a camel route that allows me to intercept (capture) the message traffic coming from the source, and not send it "to" anything? There is no need for a receiver. What would be an appropriate "to" endpoint that just drops everything it receives?
The exception suggestion of to("log:foo"). What does this do?

You can see if the Stub component can help
http://camel.apache.org/stub.html
Example:
from("...")
.to("stub:nowhere");

The exception suggestion of to("log:foo"). What does this do?
It sends your route messages to an endpoint with a component of type log:
(http://camel.apache.org/log.html) - component which basically dumps message contents (body and/or headers and/or properties) to your log file using appropriate log category.
If you just want to drop everything received, it's a good choice:
to("log:com.company.camel.sample?level=TRACE&showAll=true&multiline=true")

Apparently if you're under Linux or Unix, you can also redirect to /dev/null like in this example:
to( "file:/dev?fileName=null")
I am not sure it can be used on Windows but I don't think so.
Note that the syntax: to( "file:/dev/null") does not work as it point to a directory called null but with the fileName option it will work.

REACT PDF: Cannot use the packages to show static PDF file

I am trying to show a static pdf in React app. I have tried a lot of packages:
react-pdf
react-pdf-js
react-pdf-js-infinite
simple-react-pdf
pdfjs-dist
react-pdf-pages
They often say that we can use the URL, or pdf file for the props for the PDF component easily, but I cannot use either.
I had two main errors.
As I want to use myPDF for the props for the component, I write this:
import myPDF from 'path/to/pdf_file';
then, render_some_component pdf:{myPDF}
Here is the error:
ModuleParseError in
Module parse failed: Unexpected token (1:0)
You may need an appropriate loader to handle this file type.
(Source code omitted for this binary file)
(When I comment that line, this kind of error disappears)
I used the file-loader in webpack config, I have tried many different ways but failed.
I use the pdf file directly for the props like this:
render_some_component pdf:{'path/to/pdf_file'}
In the Console:
Warning: Setting up fake worker.
11:23:55.962 pdf.worker.js:349 Warning: Ignoring invalid character "33" in hex string
11:23:55.963 pdf.worker.js:349 Warning: Ignoring invalid character "79" in hex string
...
There are a lot of 'Ignoring invalid character' like that and it always ends with:
localhost/:1 Uncaught (in promise) InvalidPDFException {name: "InvalidPDFException", message: "Invalid PDF structure"}
In the Network, Headers, I see:
Request URL:http://localhost:3000/myPdfFile.pdf
Request Method:GET
Status Code:200 OK
Remote Address:127.0.0.1:3000
but In the Network, Response, I see just the HTML layout.
I think the pdf file is loaded correctly but the package cannot recognize its PDF structure.
Except that two main errors, I had another error related to the Worker used in the packages but I don't know how to fix it:
Uncaught DOMException: Failed to construct 'Worker'
(This is something relates to Chrome as people say Chrome does not allow Worker in the local server)
Any help is highly appreciated as I am stuck in this in 4 days already.

Can you pleas clarify what you main task is?
If I understood it right you want to display a PDF file that already exists in a part of your application? You don't want to create a new PDF with JavaScript.
If you want to just show a PDF have you tried to use iframe?
Something like this:
<iframe
title="file"
style={{ width: '100%', height: '100%' }}
src={downloadURL}
/>
You ca use here also relative paths to the file from the location where your Component is or use full URLs to the file.

Non-interactive auto-refresh stale OAuth Token with Googlesheets package

I'm trying to automatically run an r script to download a private Google Sheet every hour. It always works fine when I'm interactively using R. It also works fine during the first hour after I automate the script with launchd.
It stops working an hour after I start automating it with launchd. I think the problem is that after one hour the access token changes, and the non-interactive version isn’t waiting for the auto refreshing of the OAuth token. Here is the error that I get from the error report:
Auto-refreshing stale OAuth token.
Error in gzfile(file, mode) : cannot open the connection
Calls: gs_auth ... -> -> cache_token -> saveRDS -> gzfile
In addition: Warning message:
In gzfile(file, mode) :
cannot open compressed file '.httr-oauth', probable reason 'Permission denied'
Execution halted
I'm using Jenny Bryan's googlesheets package. Here is the code that I initially use to register the sheet, and then save the oAuth token:
gToken <- gs_auth() # Run this the first time to get the oAuth information
saveRDS(gToken, "/Users/…/gToken.rds") # Save the oAuth information for non-interactive use
I then use the following script in the file that I automate with launchd:
gs_auth(token = "/Users/…/gToken.rds")
How can I avoid this error when running the script automatically with launchd?

I don't know about launchd but I had the same problem when I wanted to run a R script automatically from the Windows task planer. Changing the 'cache' attribute value to FALSE did the trick for me [1]: https://i.stack.imgur.com/pprlC.png
You can find the solution here: https://github.com/jennybc/googlesheets/issues/262
To authenticate once in the browser in order to get a token file, I did this:
token_file <- gs_auth(new_user = TRUE, cache = FALSE)
saveRDS(token_file, "googlesheets_token.rds")
Automatic login afterwards via:
gs_auth(token = paste0(path_scripts, "googlesheets_token.rds"),
verbose = TRUE, cache = FALSE)

URL returned non 200 response code

I am Indexing Documents by using SolrPhpClient library. while making POST request by Solr using extract function reply with
URL http://localhost/moodledemo/pluginfile.php/99/course/overviewfiles/pre bio- data.docx returned non 200 response code".
It happens only if the file name include space. if the filename don't have any space it goes well.
I don't conclude why it returns non 200 response with files that have space in their names. while accessing the same path works in browser.

Using rawurlencode() Solved the Issue..
may be the jetty server was responding with non 200 response code with the filename:-
How are you.doc
Using rawurlencode()
How%20are%20you.doc
Thanks :)

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Scrapy: continue process result from parse function - request

Related

Show full URL in the squid access.log log

Apache Camel route with no "to" endpoint

REACT PDF: Cannot use the packages to show static PDF file

Non-interactive auto-refresh stale OAuth Token with Googlesheets package

URL returned non 200 response code

Categories

Resources