Nutch to allow when the host name is having portnumber - solr

I am using nutch to push and index data to solr. In nutch, i have added abc.com:85 to domain-urlfilter.txt and +^http://abc\.com\:85 to regex-urlfilter.txt.
The problem is that nutch is not indexing data and it is throwing this message Total number of urls rejected by filters:1
Here in the url, i need the portnumber ,this configuration is done.
Could you please let me know how to make nutch work with the port number :85 added.

The problem is the syntax: +^http://abc\.com\:85 is not correct. Please check the syntax here: Nutch regex-urlfilter syntax
Hope this helps,
Le Quoc Do

Related

solr 8.11 Field Types docs contradiction. Any guidance?

I'm setting up my first Solr server via docker using solr:8.11.1-slim. I am gonna use the schema API to set up the schema for my core whose name is 'products'.
While reading the docs there seems to be false info on the docs for field types:
https://solr.apache.org/guide/8_11/field-types-included-with-solr.html
vs.
https://solr.apache.org/guide/8_11/schema-api.html
I followed the first guide to get info on what field types I can specify and am trying to send requests based on the second doc such as this:
{ 'add-field': { "name":"latlong", "type":"LatLongPointSpatialField", "multiValued":False, "stored":True, 'indexed': True } },
but Solr gives me back errors such as:
org.apache.solr.api.ApiBag$ExceptionWithErrObject: error processing commands, errors: [{add-field={name=latlong, type=LatLongPointSpatialField, multiValued=false, stored=true, indexed=true}, errorMessages=[Field 'latlong': Field type 'LatLongPointSpatialField' not found
So what gives? Am I misreading the docs or are they wrong or is something wrong with the solr 8.11.1 image in docker? Why does it not accept the field types I'm providing?
Thanks for your help ahead of time.

Remove L parameter in request URL

I'm using Solr extension with TYPO3 9.5.3 and I couldn't index the Pages, I get this error https://imgur.com/1e6LfIy
Failed to execute Page Indexer Request. Request ID: 5d78d130b8b4d
When I look at the Solr log, I see that Typo3 add &L=0 to the request URL, the pages with &L=0 return '404 page not found' error :
request url => 'http://example.com/index.php?id=5&L=0' (43 chars)
I added the following code to my TS setup, But that did not work and the request url always ends with &L=0
plugin.tx_solr.index.queue.pages.fields.url.typolink.additionalParams >
I'm not sure that's the only reason solr doesn't index the pages (news can be indexed without any problem), but first, how can I solve the problem and remove &L=0 from request URL in Solr ?
Can you check your TypoScript if you have a configuration like
config.defaultGetVars.L = 0
or if other old language settings exist
I
m not dure, but have you an older languge-Configuration where you Deine the language-Parameter deines?

Solr Query Max Condition

I am using solr 4.3.0 for my web site search. I want to do something using solr but when I query, I get an error. In my situation I have 40000 products, and I want to excludes 1500 products with query. This is the my query
-brand-slug:reebok OR -brand-slug:nike AND
-skuCode:(01-117363 01-117364 01-117552 01-119131 01-119166 01-1J622 01-1J793 01-1M4434 01-1M9691 01-1Q279 01-1T405 01-1T865 01-2109830 01-2111116 01-2111186 01-21J625 01-21J794 01-21V019 01-2M9691 01-2M9696 01-33J793 01-519075 01-M4431 01-M7652 01-M9160 01-M9165 01-M9166 01-M9613 01-M9622 01-M9697 01200CY0001N00 01211SU0141M00 01212KU0009N00 01212KU0010N00 01212KU0025N00 01212KU0027N00 01212KU0038N00 01212KW0019N00 01212KW0020N00
....thousands of skuCodes)
If I put 670 skuCodes in their that will works good, but I use 1500 skuCodes is an error like
Solr HTTP error: OK (400)
How could I solve this problem? Thanks
What a night :) I solved my problem. Actually there was 2 problems in my system. First problem is in my tomcat server. I increase their request size with change maxHttpHeaderSize="65536". ( You could change your web server buffer size I changed my nginx conf). The other problem is about solr config. I got an error like 'too many boolean clauses'. If you get this error, you could change maxBooleanClauses in solrconfig.xml. After restart my tomcat server everything was ok.

Nutch didn't crawl all URLs from the seed.txt

I am new to Nutch and Solr. Currently I would like to crawl a website and its content is
generated by ASP. Since the content is not static, I created a seed.txt which
contained all the URLs I would like to crawl. For example:
http://us.abc.com/product/10001
http://us.abc.com/product/10002
http://jp.abc.com/product/10001
http://jp.abc.com/product/10002
...
The regex-urlfilter.txt has this filter:
# accept anything else
#+.
+^http://([a-z0-9]*\.)*abc.com/
I used this command to start the crawling:
/bin/nutch crawl urls -solr http://abc.com:8983/solr/ -dir crawl -depth 10 -topN 10
The seed.txt content 40,000+ URLs. However, I found that many of the URLs content are not
able to be found by Solr.
Question:
Is this approach for a large seed.txt workable ?
How can I check a URL was being crawlered ?
Is seed.txt has a size limitation ?
Thank you !
Check out the property db.max.outlinks.per.page in the nutch configuration files.
The default value for this property is 100 and hence only 100 urls will be picked up from the seeds.txt and rest would be skipped.
Change this value to a higher number to have all the urls scanned and indexed.
topN indicates how many of the generated links should be fetched. You could have 100 links which have been generated , but if you set topN as 12, then only 12 of those links will get fetched, parsed and indexed.

Map static field between nutch and solr

I use nutch 1.4 and I would like to map static field to Solr.
I know there is the index-static plugin. I configured it in nutch-site.xml like this :
<property>
<name>index-static</name>
<value>field:value</value>
</property>
However, the value is not sent to Solr.
Does anyone have a solution ?
It looks like the entry in nutch-default.xml is wrong.
According to the plugin source "index.static" instead of "index-static" is the right name for the property.
String fieldsString = conf.get("index.static", null);
After using that in my nutch-site.xml I was able to send multiple fields to my solr server.
Also make sure that the plugin is added to list of included plugins in the "plugin.includes" property.

Resources