Nutch 1.4 and Solr 3.6 - Nutch not crawling 301/302 redirects - solr

I am having an issue where the initial page is crawled by the redirect is not being crawled or indexed.
I have the http.redirect.max property set to 5, I have attempted values 0, 1, and 3.
<property>
<name>http.redirect.max</name>
<value>5</value>
<description>The maximum number of redirects the fetcher will follow when
trying to fetch a page. If set to negative or 0, fetcher won't immediately
follow redirected URLs, instead it will record them for later fetching.
</description>
</property>
I have also attempted to clear out a majority of what is in the regex-urlfilter.txt and crawl-urlfilter.txt. Other than the website being crawled this is the only other params in these files.
# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|PDF|pdf|js|JS|swf|SWF|ashx|css|CSS|wmv|WMV)$
Also it seems like Nutch is crawling and pushing only pages that have querystring parameters.
When looking at the output.
http://example.com/build Version: 7
Status: 4 (db_redir_temp)
Fetch time: Fri Sep 12 00:32:33 EDT 2014
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 2700 seconds (0 days)
Score: 0.04620983
Signature: null
Metadata: _pst_: temp_moved(13), lastModified=0: http://example.com/build/
There is a default IIS redirect occuring throwing a 302 to add the trailing slash. I have made sure this slash is already added on all pages. So unsure why this is being redirected.
Just a bit more information, here are some parameters I have tried.
depth=5 (tried 1-10)
threads=30 (tried 1 - 30)
adddays=7 (tried 0, 7)
topN=500 (tried 500, 1000)

Try running Wireshark on the webserver to see exactly what is being served, and on the machine Nutch is on to see what's being requested. If they're on the same server, great. Try that and add HTTP to your filter box after the capture.

Related

Remove L parameter in request URL

I'm using Solr extension with TYPO3 9.5.3 and I couldn't index the Pages, I get this error https://imgur.com/1e6LfIy
Failed to execute Page Indexer Request. Request ID: 5d78d130b8b4d
When I look at the Solr log, I see that Typo3 add &L=0 to the request URL, the pages with &L=0 return '404 page not found' error :
request url => 'http://example.com/index.php?id=5&L=0' (43 chars)
I added the following code to my TS setup, But that did not work and the request url always ends with &L=0
plugin.tx_solr.index.queue.pages.fields.url.typolink.additionalParams >
I'm not sure that's the only reason solr doesn't index the pages (news can be indexed without any problem), but first, how can I solve the problem and remove &L=0 from request URL in Solr ?
Can you check your TypoScript if you have a configuration like
config.defaultGetVars.L = 0
or if other old language settings exist
I
m not dure, but have you an older languge-Configuration where you Deine the language-Parameter deines?

How to prevent crawling external links with apache nutch?

I want to crawl only specific domains on nutch. For this I set the db.ignore.external.links to true as it was said in this FAQ link
The problem is nutch start to crawl only links in the seed list. For example if I put "nutch.apache.org" to seed.txt, It only find the same url (nutch.apache.org).
I get the result by running crawl script with 200 depth. And it's finished with one cycle and generate the out put below.
How can I solve this problem ?
I'm using apache nutch 1.11
Generator: starting at 2016-04-05 22:36:16
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: false
Generator: normalizing: true
Generator: topN: 50000
Generator: 0 records selected for fetching, exiting ...
Generate returned 1 (no new segments created)
Escaping loop: no more URLs to fetch now
Best Regards
You want to fetch only pages from a specific domain.
You already tried db.ignore.external.links but this restrict anything but the seek.txt urls.
You should try conf/regex-urlfilter.txt like in the example of the nutch1 tutorial:
+^http://([a-z0-9]*\.)*your.specific.domain.org/
Are you using "Crawl" script? If yes make sure you giving level which is greater than 1. If you run something like this "bin/crawl seedfoldername crawlDb http://solrIP:solrPort/solr 1". It will crawl only urls which are listed in the seed.txt
And to crawl specific domain you can use regex-urlfiltee.txt file.
Add following property in nutch-site.xml
<property>
<name>db.ignore.external.links</name>
<value>true</value>
<description>If true, outlinks leading from a page to external hosts will be ignored. This is an effective way to limit the crawl to include only initially injected hosts, without creating complex URLFilters. </description>
</property>

WebPageTest complaining about not caching static resources even though I have caching enabled

I am testing my website on webpagetest.org. It gives me a
and then goes on to give this list:
Leverage browser caching of static assets: 63/100
WARNING - (2.0 hours) - http://stats.g.doubleclick.net/dc.js
WARNING - (5.5 days) - http://www.bookmine.net/css/images/ui-bg_highlight-soft_100_eeeeee_1x100.png
WARNING - (5.5 days) - http://www.bookmine.net/favicon.ico
WARNING - (5.5 days) - http://www.bookmine.net/js/index.min.js
WARNING - (5.5 days) - http://www.bookmine.net/js/jquery-ui-1.8.13.custom.min.js
WARNING - (5.5 days) - http://www.bookmine.net/css/index.css
WARNING - (5.5 days) - http://www.bookmine.net/js/jquery.form.min.js
WARNING - (5.5 days) - http://www.bookmine.net/css/jquery-ui-1.8.13.custom.css
funny thing is that it does recognize I have caching enabled (set to 5.5 days as reported above), then what is it complaining about? I have also verified I have a default_expiration: "5d 12h" set in my app.yaml and from this link:
default_expiration
Optional. The length of time a static file served by a static file
handler ought to be cached by web proxies and browsers, if the handler
does not specify its own expiration. The value is a string of numbers
and units, separated by spaces, where units can be d for days, h
for hours, m for minutes, and s for seconds. For example, "4d 5h"
sets cache expiration to 4 days and 5 hours after the file is first
requested. If omitted, the production server sets the expiration to 10
minutes.
For example:
application: myapp version: alpha-001 runtime: python27 api_version: 1
threadsafe: true
default_expiration: "4d 5h"
handlers:
Important: The expiration time will be sent in the Cache-Control and Expires HTTP response headers, and therefore, the files are likely
to be cached by the user's browser, as well as intermediate caching
proxy servers such as Internet Service Providers. Once a file is
transmitted with a given expiration time, there is generally no way to
clear it out of intermediate caches, even if the user clears their own
browser cache. Re-deploying a new version of the app will not reset
any caches. Therefore, if you ever plan to modify a static file, it
should have a short (less than one hour) expiration time. In most
cases, the default 10-minute expiration time is appropriate.
I even verified response my website is returning in fiddler:
HTTP/200 responses are cacheable by default, unless Expires, Pragma,
or Cache-Control headers are present and forbid caching. HTTP/1.0
Expires Header is present: Sat, 26 Sep 2015 08:14:56 GMT
HTTP/1.1 Cache-Control Header is present: public, max-age=475200
public: This response MAY be cached by any cache. max-age: This
resource will expire in 132 hours. [475200 sec]
HTTP/1.1 ETAG Header is present: "74YGeg"
So why am I getting a D?
Adding some useful links:
- http://www.learningtechnicalstuff.com/2011/01/static-resources-and-cache-busting-on.html
- http://www.codeproject.com/Articles/203288/Automatic-JS-CSS-versioning-to-update-browser-cach
- https://developers.google.com/web/fundamentals/performance/optimizing-content-efficiency/http-caching#invalidating-and-updating-cached-responses
- https://developers.google.com/speed/docs/insights/LeverageBrowserCaching
- https://stackoverflow.com/a/7671705/147530
- http://www.particletree.com/notebook/automatically-version-your-css-and-javascript-files/
WebPagetest gives a warning if the cache expiration is set for less than 30 days. You can view that detail by clicking on the "D" grade in your test results and viewing the glossary for "Cache Static". You can also find that info here.
If you need to modify a cached static javascript file, you can add version number to the file path or in a querystring.

Solr Query Max Condition

I am using solr 4.3.0 for my web site search. I want to do something using solr but when I query, I get an error. In my situation I have 40000 products, and I want to excludes 1500 products with query. This is the my query
-brand-slug:reebok OR -brand-slug:nike AND
-skuCode:(01-117363 01-117364 01-117552 01-119131 01-119166 01-1J622 01-1J793 01-1M4434 01-1M9691 01-1Q279 01-1T405 01-1T865 01-2109830 01-2111116 01-2111186 01-21J625 01-21J794 01-21V019 01-2M9691 01-2M9696 01-33J793 01-519075 01-M4431 01-M7652 01-M9160 01-M9165 01-M9166 01-M9613 01-M9622 01-M9697 01200CY0001N00 01211SU0141M00 01212KU0009N00 01212KU0010N00 01212KU0025N00 01212KU0027N00 01212KU0038N00 01212KW0019N00 01212KW0020N00
....thousands of skuCodes)
If I put 670 skuCodes in their that will works good, but I use 1500 skuCodes is an error like
Solr HTTP error: OK (400)
How could I solve this problem? Thanks
What a night :) I solved my problem. Actually there was 2 problems in my system. First problem is in my tomcat server. I increase their request size with change maxHttpHeaderSize="65536". ( You could change your web server buffer size I changed my nginx conf). The other problem is about solr config. I got an error like 'too many boolean clauses'. If you get this error, you could change maxBooleanClauses in solrconfig.xml. After restart my tomcat server everything was ok.

Timeout on bigquery v2 from GAE

I am doing a query to BigQuery from my app in google app engine, and receive a weird result from BQ sometimes (discovery#restDescription). It took me some time to understand that the problem occurs only when the amount of data i am querying is high, and thus making somehow my query time out within 10 sec.
I found a good description of my problem here:
Bad response to a BigQuery query
After reading again GAE docs, i found out that HTTP requests should be handled within a few seconds. So i guess, and this is only a guess, that bigquery might also be limiting itself in the same way, and therefor, has to respond to my queries "within seconds".
If this is the case, first of all, i will be a bit surprised, because my bigquery requests are for sure going to take more than few seconds... But anyway, i did a test by forcing a timeout of 1 second to my query, and then get the queryResult by polling the API call getQueryResults.
The outcome is very interesting. BigQuery is returning something within 3 secs, more or less (not 1 as i asked) and then i get my results later on, within 26 secs by polling. This looks like circumventing the 10 secs timeout issue.
But i hardly see myself doing this trick in production.
Did someone encontered the same problem with BigQuery? What am i supposed to do when the query lasts more than "few seconds"?
Here is the code i use to query:
query_config = {
'timeoutMs': 1000,
"defaultDataset": {
"datasetId": self.dataset,
"projectId": self.project_id
},
}
query_config.update(params)
result_json = (self.service.jobs()
.query(projectId=project,
body=query_config)
.execute())
And to retrieve the results, i poll with this:
self.service.jobs().getQueryResults(projectId=project,jobId=jobId).execute()
And those are the logs of what happens on BigQuery:
2012-12-03 12:31:19.835 /api/xxxxx/ 200 4278ms 0kb Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1
xx.xx.xx.xx - - [03/Dec/2012:02:31:19 -0800] "GET /api/xxxxx/ HTTP/1.1" 200 243 ....... ms=4278 cpu_ms=346 cpm_usd=0.000426 instance=00c61b117c1169753678c6d5dac736b223809b
I 2012-12-03 12:31:16.060
URL being requested: https://www.googleapis.com/discovery/v1/apis/bigquery/v2/rest?userIp=xx.xx.xx.xx
I 2012-12-03 12:31:16.061
Attempting refresh to obtain initial access_token
I 2012-12-03 12:31:16.252
URL being requested: https://www.googleapis.com/bigquery/v2/projects/xxxxxxxxxxxx/queries?alt=json
I 2012-12-03 12:31:19.426
URL being requested: https://www.googleapis.com/bigquery/v2/projects/xxxxxxxx/jobs/job_a1e74a6769f74cb997d998623b1b6b2e?alt=json
I 2012-12-03 12:31:19.500
This is what my query API call returns me. And in the meta data, status is 'RUNNING':
{u'kind': u'bigquery#queryResponse', u'jobComplete': False, u'jobReference': {u'projectId': u'xxxxxxxxxxx', u'jobId': u'job_a1e74a6769f74cb997d998623b1b6b2e'}}
with the jobId I am able to retrieve the results 26 secs later, when they are ready.
There must be another way! What am i doing wrong?

Resources