Post processing of pages crawled using nutch - solr

I have a set of pages crawled using nutch. And I understand that this crawled pages are saved as segments. I want to extract certain key values from this pages and feed it to solr as xml.
A sample situation is that I have crawled a shopping website with many product listings. I want to extract key infos like Name, Price, Specs of the product and ignore rest of the data. So that I may provide to solr some xml like
qwerty123qwerty
This is so that using solr I should be able to do sorting of different product listings based on the price.
Now how this extraction part can be done? Does map reduce come anywhere in picture?

Turning raw web pages into information is not a trivial task. One tool used for this job is Boilerpipe. However, it won't give you a solution on a plate.
If you are working on a fixed target, you might just write your own procedural code to find the data you need. If you need to find this sort of thing in arbitrary HTML, you are facing a very hard problem with no off-the-shelf solutions.

Related

How to develop search box auto completion from database?

I have seen so many e-commerce websites that provides search box to search products. In that search features most of the search fields are auto-complete. If we enter a letter on field, then it will show the data which is including that letter as suggestions from database. As I know basics on developing that functionality.
But what if database contains huge amount of data?
For example, e-commerce websites like flipkart and amazon had a lot of products in their database. so, if user enter a letter in search field, it have to search for data including that letter among all the data in database and match data including that letter and display data as suggestions. The websites are processing it within nano seconds of time. I wonder how they achieved that functionality? I can't understand what are the technologies they are using.
As a learner I wanna know the functional flow and if possible demo for that feature.
I think your question can be divided into two parts. 1) how to design the database for the search technology. 2) how to implement an effect search... They belong to the field of search engine technology.
About the Q1, you can create a table to save the keywords for search, and in the table, you'd better to design a column or similar method to describe the "search-weight". As well known, a view is a practical solution to accelerate the access of the data.
About the Q2, the search engine technique is No longer mysterious, some open source projects can simulate the feature of search engine, such as Apache Lucene, visit please Apache Lucene.
more discuss:
And specially, in your front system for example, the ASP/JSP or even simple HTML page, you should use some scripts e.g. Ajax, to popup, drawdown, of caurse, simple DOM Javascript+DIV can reach it too, but with jQuery or other libarary can make it easily. Here is an example.
Here is the backend system example
To reduce the burden on the host and reduce the requirement of network's bandwidth, the front javascript should active the autocomplete feature with more than three characters.
Please pay attention in your actual application, that your server has calculation's limitation, and the client page has usually many elements, all will reduce user friendliness. Please do not make the request and response too complex.
An alternative simulation can be: make a FIFO logic, save some usual search keyword in the "cache" or temp-table|view, and the amount of data will be reduced.
There are too many solutions, I can only think of these tricks at this moment.
regards

Entity extraction on large documents

I have a need to extract entities from word and pdf documents. Documents can be in the range of 10 to 20 pages. Are there scalable library/APIs available that we can plug into our processing pipeline? Any comparative study of different solutions will be helpful.
Take a look at the Watson Natural Language Understanding (you'll need to get an IBM ID and then login to see this content - don't worry , cost is $0). With Watson Natural Language Understanding you will want to look at the API Explorer to find the correct API syntax to use to get the results that you are looking for.
I also noticed that mention Word/PDF documents. You will need to convert those using the Watson Discovery service, and then you can pass the converted documents to Watson Natural Language Understanding, which takes in JSON, text or HTML inputs.

Web data extraction and data mining; Scraping vs Injection and how to get data.. like yesterday

I feel like i should almost give a friggin synopsis to this/these lengthy question(s)..
I apologize if all of these questions have been answered specifically in a previous question/answer post, but I have been unable to locate any that specifically addresses all of the following queries.
This question involves data extraction from the web (ie web scraping, data mining etc). I have spent almost a year doing research into these fields and how it can be applied to a certain industry. I have also familiarized myself with php and mysql/myphpmyadmin.
In a nutshell I am looking for a way to extract information from a site (probably several gigs worth) as fast and efficiently as possible. I have tried web scraping programs like scrapy and webharvey. I have also experimented with programs like HTTrack. All have their strengths and weaknesses. I have found that webharvey works pretty good yet it has its limitations when scraping images that are stored in gallery widgets. Also I find that many of the sites I am extracting from use other methods to make mining data a pain. It would take months to extract the data using webharvey. Which I can't complain given that I'd be extracting millions of rows worth of data exported in csv format into excel. But again, images and certain ajax widgets throw the program off when trying to extract image files.
So my questions are as follows:
Are there any quicker ways to extract said data?
Is there any way to get around the webharvey image limitations (ie only being able to extract one image within a gallery widget / not being able to follow sub-page links on sites that embed their crap funny and try to get cute with coding)?
Are their any ways to bypass site search form parameters that limit the number of search results (ie obtaining all business listings within an entire state instead of being limited to a county per search form restrictions)**
Also, this is public information so therefore it cannot be copyrighted; anybody can take it :) (case in point: Feist Publications v. Rural Telephone Service). Extracting information is extracting information. Its legal to extract as long as we are talking facts/public information.
So with that said, wouldn't the most efficient method (grey area here) of extracting this "public" information (assuming vulnerabilities existed), be through the use of sql injection?... If one was so inclined? :)
As a side question just how effective is Tor at obscuring ones IP address? Lol
Any help, feedback, suggestions or criticism would be greatly appreciated. I am by no means an expert in any of the above mentioned fields. I am just a motivated individual with a growing interest in programming and automation who has a lot of crazy ideas. Thank you.
You may be better off writing your own Linux command-line scraping program using either a headless browser library like PhantomJS (JavaScript), or a test framework like Selenium WebDriver (Java).
Once you have your scrape program completed, you can then scale it up by installing it on a cloud server (e.g. Amazon EC2, Linode, Google Compute Engine or Microsoft Azure) and duplicating the server image to as many are required.

Trying to understand the idea of Search Documents on the Google App Engine

I'm trying to understand the concept of Documents on Google App Engine's Search API. The concept I'm having trouble with is the idea behind storing documents. So for example, say in my database I have this:
class Business(ndb.Model):
name = ndb...
description = ndb...
For each business, I am storing a document so I can do full-text searches on the name and description.
My questions are:
Is this right? Does these mean we are essentially storing each entity TWICE, in two different places, just to make it searchable?
If the answer to above is yes, is there a better way to do it?
And again if the answer to number 1 is yes, where do the documents get stored? To the high-rep DS?
I just want to make sure I am thinking about this concept correctly. Storing entities in docs means I have to maintain each entity in two separate places... doesn't seem very optimal just to keep it searchable.
You have it worked out already.
Full Text Search Overview
The Search API allows your application to perform Google-like searches
over structured data. You can search across several different types of
data (plain text, HTML, atom, numbers, dates, and geographic
locations). Searches return a sorted list of matching text. You can
customize the sorting and presentation of results.
As you don't get to search "inside" the contents of the models in the datastore the search API provides the ability to do that for text and html.
So to link a searchable text document (e.g a product description) to a model in the datastore (e.g. that product's price) you have to "manually" make that link between the documents and the data-store objects they relate to. You can use the search api and the datastore totally independently of each other also so you have to build that in. AFAIK there is no automatic linkage between them.

Best solution to store crawled sites in database

I want to store in db crawled sites (html code). Sites will be millions. I will be searching in that sites special strings.
Now i am using PostrgreSQL, but i have doubts if relational database is proper. Maybe some NoSQL soultions?
What soultion do you recommend?
I have used Apache Nutch for the same purpose (crawlig, storing and searching millions of sites) with success. It is based on Lucene and it scales (thanks to Hadoop).
Does the work out of the box.
http://nutch.apache.org/
http://lucene.apache.org/
After you fetch your web page you need to truncate extra invaluable information from your web pages (ads, unrelated text, ...). using this strategy you will decrease the page size you should store in database and your search results more relevant information.
I suggest you to create a program and extract valuable information and store those in database (if you don't need original page) after that you can create a lucene library above to search for your information
If you want more accurate information you can analyze your page and store some rules (content direction, category, links to external resources resources, valuable information to all text rate, ....) to create a rank for your page which is techniques of text mining.

Resources