Nutch - Crawl a page for links, but don't index - solr

I would like to point a site to an index.html page to get it started, but don't want to include index.html in my search results- instead just having the child pages appear in the results. Is there a away to remove specific pages?

Just as an update, the only solution I found to my problem, was to remove those URLs from the collection after is was crawled as a secondary step.

Related

Drupal 9 - Two different URL's work for same page

I have a Drupal 9.x version site, as an example whose page URL is like : http://www.example.com/general
but we need that URL to be work as http://www.example.com/something/general also.
It means, my every page of my site to be work (both URL's) with or without PRE-FIX word before of every page name.
Reason: I am looking this feature is, my website page has to load into other website pages. Which means with an iframe I am planning to display the drupal 9.x content into other website. With this PREFIX word if we are calling the URL the header, footer, sidebar of the present page site will be hidden.
If with out the PREFIX word if i am looking the website, the header, footer, sidebar will be displayed normally.
Both URLs should work:
I have about 550+ content pages, and many of our content pages are coming from custom modules, views.
I tried with URL alias, If i have go with this then I have to add these all 550+ content pages URL in our URL alias, that will not work for me.
Is there any other way around that i can achieve the same with minimum effort.
Please suggest me the solution, It will be very helpful for me.

Crawling websites with Nutch 2.3.1 skips product links but crawls other links

So, I am trying to crawl men shoes from jabong.com.
My seed url is:
http://www.jabong.com/men/shoes/
I am making sure nutch does no skip ? and = using this is regex-urlfilter.txt:
-[*!#]
This is my protocol.includes in nutch-site.xml:
protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)|indexer-solr
It crawls links like the following and I can search them in solr:
http://www.jabong.com/men/shoes/andrew-hill/
http://www.jabong.com/men/shoes/?sh_size=40
http://www.jabong.com/all-products/?promotion=app-10-promo&cmpgp=takeover5
But it is not crawling products that I want to crawl actually. Product links are:
http://www.jabong.com/Alberto-Torresi-Black-Sandals-2024892.html?pos=2
http://www.jabong.com/Clarks-Un-Walk-Brown-Formal-Shoes-874785.html?pos=11
This is weird because these links are there in the same page as the seed URL, but they are not getting crawled. I did a wget to get the page and saw the links are there so no javascript involved.
What mistake am I doing?
Make sure your page navigation doesn't depend on cookie. Try dumping the crawlDB and segments and check is the expected urls has been navigated or not. If navigated what contents has been fetched from this url.

How do I fix a 1/2 broken DNN path?

Somehow my DNN has a broken path a particular url serves pages, but doesn't work for delete or advanced options (delivers a 404 error)
The taburls table has no entries for this tab.
If I change the url through the tabs table then I'm able to delete the page without issue.
I tried accessing the page options via tabid, but the tabid gets converted to the friendly name and then 404s.
I tried turning off friendly urls in my web.config but I may have done it wrong since the entire site would not load (yellowscreen of death)
I'm wondering where DNN is storing this path that is breaking the advanced options of whatever page is at the path.
How do I fix this url so it displays page options and lets me delete pages?
First thing I would try is going to the Admin/Page Management screen, can you make the changes you need to via that interface? If so, after making the changes, are you able to access the page and all the features/options correctly?
If that doesn't work, check the TabPath column in the TABS table to see if there are any bad paths in there for specific pages that you are having problems with.

Why is my angularjs site not completely crawlable?

I have created my first AngularJS website. I have set up pushstate (html5 mode), added fragment metatag, created sitemap in google and tested "google fetch" functionality. After few days, my website is still not completely indexed by google. Google indexed only 1 url instead of 4 (my sitemap contains 4 url's). My website is Tom IT. This main page is index, but this subpage that is also in the sitemap (you can find my sitemap in sitemap.xml in the root of my domain tom-it.be), does not appear in search results. I also added robots.txt.
Google crawlers can parse the pages that generated by SPA and appear at SERPs, but not immediately, may need several days. In my experience, use AngularJS may need 3 days, and use EmberJS need 7 days.
If your website wants to be crawled completely, the important information should put in HTML, or use other techniques, for example, prepare another page for crawlers, server pre-rendering or PhantomJS.

Apache Nutch Crawl Dynamic Products

Currently we are using Apache Solr as Search Engine and Apache Nutch as Crawler. Now we have created a site site which contains products which gets generated dynamically.
As current setup will search the content within content field, so whenever we are searching for dynamic Product, then its not coming in search results.
Can you please guide me how to crawl and index Dynamic Product on a Page to Apache Solr? Can we do this using Sitemap.xml, If yes then please suggest how?
Thanks!
One possible solution is this:
Step 1) Put the description of each dynamic product in its own page. e.g http://domain/product?id=xxx (or with more friendly url such as http://domain/product-x).
Step 2) You need a page or several pages that list urls of these products. The sitemap.xml you mentioned is one choice but a simple html page is also suffice. So, for instance, you can dynamically generate a page named products_list which contains entries like this: Product x.
Step 3) You should either add url of products_list page to your nutch seed file or include a link to it in one of already crawling pages.

Resources