Apache Nutch Crawl Dynamic Products

Apache Nutch Crawl Dynamic Products - solr

Currently we are using Apache Solr as Search Engine and Apache Nutch as Crawler. Now we have created a site site which contains products which gets generated dynamically.
As current setup will search the content within content field, so whenever we are searching for dynamic Product, then its not coming in search results.
Can you please guide me how to crawl and index Dynamic Product on a Page to Apache Solr? Can we do this using Sitemap.xml, If yes then please suggest how?
Thanks!

One possible solution is this:
Step 1) Put the description of each dynamic product in its own page. e.g http://domain/product?id=xxx (or with more friendly url such as http://domain/product-x).
Step 2) You need a page or several pages that list urls of these products. The sitemap.xml you mentioned is one choice but a simple html page is also suffice. So, for instance, you can dynamically generate a page named products_list which contains entries like this: Product x.
Step 3) You should either add url of products_list page to your nutch seed file or include a link to it in one of already crawling pages.

Related

Crawling websites with Nutch 2.3.1 skips product links but crawls other links

So, I am trying to crawl men shoes from jabong.com.
My seed url is:
http://www.jabong.com/men/shoes/
I am making sure nutch does no skip ? and = using this is regex-urlfilter.txt:
-[*!#]
This is my protocol.includes in nutch-site.xml:
protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)|indexer-solr
It crawls links like the following and I can search them in solr:
http://www.jabong.com/men/shoes/andrew-hill/
http://www.jabong.com/men/shoes/?sh_size=40
http://www.jabong.com/all-products/?promotion=app-10-promo&cmpgp=takeover5
But it is not crawling products that I want to crawl actually. Product links are:
http://www.jabong.com/Alberto-Torresi-Black-Sandals-2024892.html?pos=2
http://www.jabong.com/Clarks-Un-Walk-Brown-Formal-Shoes-874785.html?pos=11
This is weird because these links are there in the same page as the seed URL, but they are not getting crawled. I did a wget to get the page and saw the links are there so no javascript involved.
What mistake am I doing?

Make sure your page navigation doesn't depend on cookie. Try dumping the crawlDB and segments and check is the expected urls has been navigated or not. If navigated what contents has been fetched from this url.

SEO: Does Google recognize dynamically inserted text via Angular?

I've created an angular SPA that dynamically inserts text all over the place but more specifically it has a service that dynamically changes the title tag for each page. Is Google able to index this properly?
See ps101.com for it in action.

You will find some information here:
http://mono.software/2016/02/18/SEO-for-javascript-applications/
http://searchengineland.com/can-now-trust-google-crawl-ajax-sites-235267#.Vq9umCaPP-Y.twitter
#eywu from twitter did also some tests on www.jscrawlability.com.
As you can see, the site is not properly indexed on google.
I would recommend you SEO4Ajax to index reliably your website. You can test it, it's free for small sites.

Database search field database

I'm looking for a way to create a search box in wordpress, where visitors can search a number from the database. Is this possible? I have several package numbers in my database. I want to give my visitors the ability to search for their package number and request the information that comes with the number.

What you want to do can be done.
I suggest a different approach than using wp-exec. (I just looked at wp-exec website, and that plugin was created for WordPress 1.5, which means it hasn't been updated in about 5 years).
The content you want to display exists entirely outside of WordPress. I suggest you use a custom page template - see
http://codex.wordpress.org/Pages#Creating_Your_Own_Page_Templates
In this case you would not use WordPress posts or pages or custom post types. On the custom page template you would write (or have written if you don't have the knowhow to do it yourself) PHP code to extract the info from the database and display it on a page.
For pages like that you would be using WordPress only as a container within which to display the results - they custom page would appear in the site Nav, The page of results would use the site's theme to display so it looks like the rest of the site.
But the code to display from the database would not use the WordPress loop. It would be PHP / MySQL data retrieval and display code.
I really doubt you will find a plugin that lets you display results from an external database, formatted the way you want them to appear. The reason is every external database is different, has different tables and table structures. And no two sites will want the external data visually displayed in the same way. So there is little generalization to encapsulate in a plugin as everyone wants it different.
I've created pages on some sites along the lines of what you want to do thus I know it can be done. But it requires writing custom code.

Searching information on pages created by views in Drupal 7

I'm using Drupal Commerce and creating pages with those products through views.
I would like to be able to search for any of the products and any of their descriptions and have the search results link to those views pages.
I'm currently exploring Search by Page and it's kind of working but it only searches on the page title and doesn't search substrings. I'm downloading and trying everything! Maybe i just need the right combo.
Has anyone dealt with this?
Thanks!

You can use the option "Search: search terms" from filter settings. For this option you have to enable core Search module.
Its searches from Title, Description also we can set "contains" criteria to search any matched data.
Cheers!!!

How to configure solr search only for a branch of pages in Typo3?

I configured solr search for typo3. I want to search only certain branch of pages. Now i got search result from all pages.
I want to get search result only from one branch of the page.
How to achieve this ?

Within the typoscript setup you can set
config.index_enable = 0
to disable indexing for all the sub pages (see solr at forge)

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight