Posting wget html pages to Solr

Posting wget html pages to Solr - solr

How can I post html web pages to a Solr index when downloading them with wget? How could I modify the following example so that it gets indexed simultaneously? wget -P /var/myserver/archive http://www.somesite/products.html
I can't spot an obvious example in the Solr documentation and would be grateful for any pointers.

You can check Apache Nutch, which is an Open source web crawler.
You can provide Nutch with a base page and it will help you index the page as well as the links in it.
Nutch integrates with Solr so the pages would be indexed by Solr and be searchable.
However, if its just couple of pages with not Spider capabilities you can just download the html pages and feed it to solr through Client code.
Solr have HTML filters which will hep to extract content from this pages and index them as text.

Related

Show extracted content from tika in Frontend

I work with TYPO3 10.4.18, solr_file_indexer 2.3.1 and tika 6.0.0.
For Tika I have the solr server as host.
The indexing of the pages, extensions and documents works flawlessly.The search index contains the content of the documents.
Now I want to display the search results for the documents like the page result list. But I can't find a variable which contains the extracted content from tika for the frontend and can be used in the document.html file of solr.
Is there any additional configuration needed here?

With help from #swilking I found the simple answer for my question:
In the file document.html of the extension solr write something like
<f:if condition="{document.type} == 'sys_file_metadata'">
<div>{s:document.highlightResult(resultSet:resultSet, document:document,
fieldName:'content')}</div>
</f:if>
to get the file content with the highlight feature.

Crawling websites with Nutch 2.3.1 skips product links but crawls other links

So, I am trying to crawl men shoes from jabong.com.
My seed url is:
http://www.jabong.com/men/shoes/
I am making sure nutch does no skip ? and = using this is regex-urlfilter.txt:
-[*!#]
This is my protocol.includes in nutch-site.xml:
protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)|indexer-solr
It crawls links like the following and I can search them in solr:
http://www.jabong.com/men/shoes/andrew-hill/
http://www.jabong.com/men/shoes/?sh_size=40
http://www.jabong.com/all-products/?promotion=app-10-promo&cmpgp=takeover5
But it is not crawling products that I want to crawl actually. Product links are:
http://www.jabong.com/Alberto-Torresi-Black-Sandals-2024892.html?pos=2
http://www.jabong.com/Clarks-Un-Walk-Brown-Formal-Shoes-874785.html?pos=11
This is weird because these links are there in the same page as the seed URL, but they are not getting crawled. I did a wget to get the page and saw the links are there so no javascript involved.
What mistake am I doing?

Make sure your page navigation doesn't depend on cookie. Try dumping the crawlDB and segments and check is the expected urls has been navigated or not. If navigated what contents has been fetched from this url.

Why is my angularjs site not completely crawlable?

I have created my first AngularJS website. I have set up pushstate (html5 mode), added fragment metatag, created sitemap in google and tested "google fetch" functionality. After few days, my website is still not completely indexed by google. Google indexed only 1 url instead of 4 (my sitemap contains 4 url's). My website is Tom IT. This main page is index, but this subpage that is also in the sitemap (you can find my sitemap in sitemap.xml in the root of my domain tom-it.be), does not appear in search results. I also added robots.txt.

Google crawlers can parse the pages that generated by SPA and appear at SERPs, but not immediately, may need several days. In my experience, use AngularJS may need 3 days, and use EmberJS need 7 days.
If your website wants to be crawled completely, the important information should put in HTML, or use other techniques, for example, prepare another page for crawlers, server pre-rendering or PhantomJS.

Apache Nutch Crawl Dynamic Products

Currently we are using Apache Solr as Search Engine and Apache Nutch as Crawler. Now we have created a site site which contains products which gets generated dynamically.
As current setup will search the content within content field, so whenever we are searching for dynamic Product, then its not coming in search results.
Can you please guide me how to crawl and index Dynamic Product on a Page to Apache Solr? Can we do this using Sitemap.xml, If yes then please suggest how?
Thanks!

One possible solution is this:
Step 1) Put the description of each dynamic product in its own page. e.g http://domain/product?id=xxx (or with more friendly url such as http://domain/product-x).
Step 2) You need a page or several pages that list urls of these products. The sitemap.xml you mentioned is one choice but a simple html page is also suffice. So, for instance, you can dynamically generate a page named products_list which contains entries like this: Product x.
Step 3) You should either add url of products_list page to your nutch seed file or include a link to it in one of already crawling pages.

arachnode.net webpage table is huge

I have used arachnode.net crawler to crawl a website. The resulting crawl data has resulted in a database at the size of +100 gb!!!
I have looked around at the arachnode.net database and found the table "webpages" to be the culprit. When I crawl a website I do not download, images, media or anything a like, I only download the html code. However in this case I can see that the html webpages contains huge about of hidden viewdata and javascript.
So I need to do the crawling once again and this time strip out the hidden viewdata and javascript code before saving to the webpages table.
Anyone have some idea on how to achieve it.
Thanks.

Yes, you can write a plugin which modifies the CrawlRequest.Data and CrawlRequest.DecodedHtml before the data is inserted into the database.
Create a PostRequest CrawlAction as shown here: http://arachnode.net/Content/CreatingPlugins.aspx

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Posting wget html pages to Solr - solr

Related

Show extracted content from tika in Frontend

Crawling websites with Nutch 2.3.1 skips product links but crawls other links

Why is my angularjs site not completely crawlable?

Apache Nutch Crawl Dynamic Products

arachnode.net webpage table is huge

Categories

Resources