Nutch crawler not indexing HTML content

Nutch crawler not indexing HTML content - solr

I am trying to develop a search functionality where I enter a city name and it gives me the weather conditions for that city.
I have set up Nutch-1.3 and Solr-3.4.0 on my system. The website I am crawling is here and passing the index to Solr for searching.Now, I want to retrieve the information displayed on this link, on querying for delhi.
How can I achieve this? Does it require any plugin to be written?
<doc><float name="score">1.0</float><float name="boost">0.1879294</float><str name="content"/><str name="digest">d41d8cd98f00b204e9800998ecf8427e</str><str name="id">http://www.imd.gov.in/section/nhac/distforecast/delhi.htm</str><str name="segment">20111118153543</str><str name="title"/><date name="tstamp">2011-11-18T10:06:45.604Z</date><str name="url">http://www.imd.gov.in/section/nhac/distforecast/delhi.htm</str></doc>

Nutch basically crawls through links on the pages.
However, there are no links on the India page for it to reach the Delhi page mentioned by you.
So it won't be able to navigate it down to that page.
You can create your own dummy html page, acting as the start url for indexing, and have all the links you want Nutch to index.
Whats the default search field in you schema ?
Usually its the text field, and querying for delhi would look into that field for matches.
As *:* returns the delhi result, and delhi does not. Its not matching the indexed tokens on the field it is searching on.
Whats the field type defined for url in the schema ?
You can copy the field to an other field with text analysis, which would produce the delhi token and querying for url_copy:delhi should return you the results.

Related

Elements getting added in Solr index but not able to search elements as desired

I'm working with solr to store web crawling search results to be used in a search engine. The structure of my documents in solr is the following:
{
word: The word received after tokenizing the body obtained from the html.
url: The url where this word was found.
frequency: The no. of times the word was found in the url.
}
When I go the Solr dashboard on my system, which is http://localhost:8983/solr/#/CrawlerSearchResults/query I'm able to find a word say "Amazon" with the query "word: Amazon" but on directly searching for Amazon I get no results. Could you please help me out with this issue ?
Image links below.
First case
Second case (No results)
Thanks,
Nilesh.

In your second example, the value is searched against the default search field (since you haven't provided a field name). This is by default a field named _text_.
To support just typing a query into the q parameter without field names, you can either set the default field name to search in with df=wordin your URL, or use the edismax query parser (defType=edismax) and the qf parameter (query fields). qf allows multiple fields and giving them a weight, but in your case it'd just be qf=word.
Second - what you're doing seems to replicate what Lucene is doing internally, so I'm not sure why you'd do it this way (each word is what's called a "token", and each count is what's called a term frequency). You can write a custom similarity to add custom scoring based on these parameters.

Result missing in solr for specific search term

I am seeing a problem in a solr search where a result is not returned for a specific search term. The page has definitely been indexed by solr and is returned for other searches, including for a single word in that phrase.
eg:
search 'abc 123' - page IS NOT in the results
search 'abc' - page IS in the results
The phrase 'abc 123' is in the field that is indexed for the page title / h1 tag. The main content of the page and the page URL.
These results have been observed by querying the solr box directly.
Im a relative n00b to solr, any help much appreciated

Searching wiki URLs using Solr

I am trying to index and search a wiki on our intranet using Solr. I have it more-or-less working using edismax but I'm having trouble getting main topic pages to show up first in the search results. For example, suppose I have some URLs in the database:
http://whizbang.com/wiki/Foo/Bar
http://whizbang.com/wiki/Foo/Bar/One
http://whizbang.com/wiki/Foo/Bar/Two
http://whizbang.com/wiki/Foo/Bar/Two/Two_point_one
I would like to be able to search for "foo bar" and have the first link returned as the top result because it is the main page for that particular topic in the wiki. I've tried boosting the title and URL field in the search but the fieldNorm value for the document keeps affecting the scores such that sub-pages score higher. In one particular case, the main topic page shows up on the 2nd results page.
Is there a way to make the First URL score significantly higher than the sub categories so that it shows up in the top-5 search results?

One possible approach to try:
Create a copyField with your url
Extract path only (so, no host, no wiki)
Split on / and maybe space
Lowercase
Boost on phrase or bigram or something similar.
If you have a lot of levels, maybe you want a multivalued field, with different depth (starting from the end) getting separate entries. That way a perfect match will get better value. Here, you should start experimenting with your real searches.

how to Index URL in SOLR so I can boost results after website

I have thousands of documents indexed in my SOLR which represents data crawled from different websites. One of the fields of a document is SourceURL which contains the url of a webpage that I crawled and indexed into this Document.
I want to boost results from a specific website using boost query.
For example I have 4 documents each containing in SourceURL the following data
https://meta.stackoverflow.com/page1
http://www.stackoverflow.com/page2
https://stackoverflow.com/page3
https://stackexchange.com/page1
I want to boost all results that are from stackoverflow.com, and not subdomains (in this case result 2 and 3 ).
Do you know how can I index the url field and then use boost query to identify all the documents from a specific website like in the case above ?

One way would be to parse the url prior to index time and specify if it is a primary domain ( primarydomain boolean field in your schema.xml file for example).
Then you can boost the primarydomain field in your query results. See using the DisMaxQParserPlugin from the Solr Wiki for an example on how to boost fields at query time.

Get text snippet from search index generated by solr and nutch

I have just configured nutch and solr to successfully crawl and index text on a web site, by following the geting started tutorials. Now I am trying to make a search page by modifying the example velocity templates.
Now to my question. How can I tell solr to provide a relevant text snippet of the content of the hits? I only get the following fields associated with each hit:
score, boost, digest, id, segment, title, date, tstamp and url.
The content is really indexed, because I can search for words that I know only is in the fulltext, but I still don't get the fulltext back associated with the hit.

don't forget: indexed is not the same as stored.
You can search words in an document, if all field are indexed, but no field is stored.
To get the content of a specific field, it must be also stored=true in schema.xml
If your fulltext-field is
stored, so probably the default "field-list-settings" does not include the fulltext-field.
You can add this by using the fl parameter:
http://<solr-url>:port/select/?......&fl=mytext,*
...this example, if your fulltext is stored in the field called mytext
Finally, if you like to have only a snippet of the text with the searched words (not the whole text) look at the highlight-component from solr/lucene

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Nutch crawler not indexing HTML content - solr

Related

Elements getting added in Solr index but not able to search elements as desired

Result missing in solr for specific search term

Searching wiki URLs using Solr

how to Index URL in SOLR so I can boost results after website

Get text snippet from search index generated by solr and nutch

Categories

Resources