Best solution to store crawled sites in database - database

I want to store in db crawled sites (html code). Sites will be millions. I will be searching in that sites special strings.
Now i am using PostrgreSQL, but i have doubts if relational database is proper. Maybe some NoSQL soultions?
What soultion do you recommend?

I have used Apache Nutch for the same purpose (crawlig, storing and searching millions of sites) with success. It is based on Lucene and it scales (thanks to Hadoop).
Does the work out of the box.
http://nutch.apache.org/
http://lucene.apache.org/

After you fetch your web page you need to truncate extra invaluable information from your web pages (ads, unrelated text, ...). using this strategy you will decrease the page size you should store in database and your search results more relevant information.
I suggest you to create a program and extract valuable information and store those in database (if you don't need original page) after that you can create a lucene library above to search for your information
If you want more accurate information you can analyze your page and store some rules (content direction, category, links to external resources resources, valuable information to all text rate, ....) to create a rank for your page which is techniques of text mining.

Related

How to develop search box auto completion from database?

I have seen so many e-commerce websites that provides search box to search products. In that search features most of the search fields are auto-complete. If we enter a letter on field, then it will show the data which is including that letter as suggestions from database. As I know basics on developing that functionality.
But what if database contains huge amount of data?
For example, e-commerce websites like flipkart and amazon had a lot of products in their database. so, if user enter a letter in search field, it have to search for data including that letter among all the data in database and match data including that letter and display data as suggestions. The websites are processing it within nano seconds of time. I wonder how they achieved that functionality? I can't understand what are the technologies they are using.
As a learner I wanna know the functional flow and if possible demo for that feature.
I think your question can be divided into two parts. 1) how to design the database for the search technology. 2) how to implement an effect search... They belong to the field of search engine technology.
About the Q1, you can create a table to save the keywords for search, and in the table, you'd better to design a column or similar method to describe the "search-weight". As well known, a view is a practical solution to accelerate the access of the data.
About the Q2, the search engine technique is No longer mysterious, some open source projects can simulate the feature of search engine, such as Apache Lucene, visit please Apache Lucene.
more discuss:
And specially, in your front system for example, the ASP/JSP or even simple HTML page, you should use some scripts e.g. Ajax, to popup, drawdown, of caurse, simple DOM Javascript+DIV can reach it too, but with jQuery or other libarary can make it easily. Here is an example.
Here is the backend system example
To reduce the burden on the host and reduce the requirement of network's bandwidth, the front javascript should active the autocomplete feature with more than three characters.
Please pay attention in your actual application, that your server has calculation's limitation, and the client page has usually many elements, all will reduce user friendliness. Please do not make the request and response too complex.
An alternative simulation can be: make a FIFO logic, save some usual search keyword in the "cache" or temp-table|view, and the amount of data will be reduced.
There are too many solutions, I can only think of these tricks at this moment.
regards

Neo4J: Binary File storage and Text Search "stack"

I have a project I would like to work on which I feel is a beautiful case for Neo4j. But there are aspects about implementing this that I do not understand enough to succinctly list my questions. So instead, I'll let the scenario speak for itself:
Scenario: In simplicity, I want to build an application that will allow Users who will receive files of various types such as docs, excel, word, images, audio clips and even videos - although not so much videos, and allow them to upload and categorize these.
With each file they will enter in any and all associations. Examples:
If Joe authors a PDF, Joe is associated with the PDF.
If a DOC says that Sally is Mary's mother, Sally is associated with Mary.
If Bill sent an email to Jane, Bill is associated with Jane (and the email).
If company X sends an invoice (Excel grid) to company Y, X is associated with Y.
and so on...
So the basic goal at this point would be to:
Have users load in files as they receive them.
Enter the associations that each file contains.
Review associations holistically, in order to predict or take some action.
Generate a report of the interested associations including the files that the associations are based on.
The value for this project is in the associations, which in reality would grow much more complex then the above examples and should produce interesting conclusions. However. if the User is asked "How did you come to that conclusion", they need to be able to produce a summary of the associations as well as any files that these associations are based on - ie the PDF or EXCEL or whatever.
Initial thoughts...
I also should also add that this applicatoin would be hosted internally, and probably used by approx 50 Users so I probably don't need super-duper, fastest, scalable, high availability possible solution. The data being loaded could get rather large though, maybe up to a terabyte in a year? (Not the associations but the actual files)
Wouldn't it be great if Neo4J just did all of this! Obviously it should handle the graph aspects of this very nicely, but I figure that the file storage and text search is going to need another player added to the mix.
Some combinations of solutions I know of would be:
Store EVERYTHING including files as binary in Neo4J.
Would be wrestling Neo4J for something its not built for.
How would I search text?
Store only associations and meta data in Neo4J and uploaded file on File system.
How would I do text searches on files that are stored on file server?
Store only associations and meta data in Neo4J and uploaded file in Postgres.
Not so confident of having all my files inside DB. Feel more comfortable having all my files accessible in folders.
Everyone says its great to put your files in DB. Everyone says its not great to put your files in DB.
Get to the bloody questions..
Can anyone suggest a good "stack" that would suit the above?
Please give a basic outline on how you would implement your suggestion, ie:
Have the application store the data into Neo4J, then use triggers to update Postgres.
Or have the files loaded into Postgres and triggers update Neo4J.
Or Have the application load data to Nea4J and then application loads data into Postgres.
etc
How you would tie these together is probably what I am really trying to grasp.
Thank you very much for any input on this.
Cheers.
p.s. What a ramble! If you feel the need to edit my question or title to simplify, go for it! :)
Here's my recommendations:
Never store binary files in the database. Store in filesystem or a service like AWS S3 instead and reference the file in your data model.
I would store the file first in S3 and a reference to it in your primary database (Neo4j?)
If you want to be able to search for any word in a document I would recommend using a full text search engine like Elastic Search. Elastic Search can scan multiple document formats like PDF using Tika.
You can probably also use Elastic/Tika to search for relationships in the document and surface them in order to update your graph.
Suggested Stack:
Neo4j
ElasticSearch
AWS S3 or some other redundant filesystem to avoid data loss
Bonus: See this SO question/answer for best practices on indexing files in multiple formats using ES.

Storing 100k map markers in App Engine

I'm designing yet another "Find Objects near my location" web site and mobile app.
My requirements are:
Store up to 100k objects;
Query for objects that are close to the point (my location, city, etc). And other search criteria (like object type);
Display results on the Google Maps with smooth performance.
Let user filter objects by object time.
I'm thinking about using Google App Engine for this project.
Could You recommend what would be the best data storage option for this?
And couple of words about dynamic data loading strategy.
I kinda feel overwhelmed with options at the moment and looking for hints where should I continue my research.
Thanks a lot!
I'm going to to assume that you are using the datastore. I'm not familiar with Google Cloud SQL (which I believe aims to offer MySQL-like features in the cloud), so I can't speak if it can do geospatial queries.
I've been looking into the whole "get locations in proximity of a location" problem for a while now. I have some good and bad news for you, unfortunately.
The best way to do the proximity search in the Google Environment is via the Search Service (https://developers.google.com/appengine/docs/python/search/ or find the JAVA link ). Reason being is that it supports a "Geopoint Field", and allows you to query in such a way.
Ok, cool, so there is support, right? However, "A query is complex if its query string includes the name of a geopoint field or at least one OR or NOT boolean operator". The free quota for Complex Search Queries are 100/day. Per 10,000 queries, it costs 60 cents. Depending on your application, this may be an issue.
I'm not too familar with the Google Maps API you might be able to pull off something like this :(https://developers.google.com/maps/articles/phpsqlsearch_v3)
My current project/problem involves moving locations, and not "static" ones (stores, landmarks,etc). I've decided to go with Amazon's Dynamodb and they have a library which supports geospatial indexing : http://aws.amazon.com/about-aws/whats-new/2013/09/05/announcing-amazon-dynamodb-geospatial-indexing/

Searching over documents stored in Hadoop - which tool to use?

I'm lost in: Hadoop, Hbase, Lucene, Carrot2, Cloudera, Tika, ZooKeeper, Solr, Katta, Cascading, POI...
When you read about the one you can be often sure that each of the others tools is going to be mentioned.
I don't expect you to explain every tool to me - sure not. If you could help me to narrow this set for my particular scenario it would be great. So far I'm not sure which of the above will fit and it looks like (as always) there are more then one way of doing what's to be done.
The scenario is: 500GB - ~20 TB of documents stored in Hadoop. Text documents in multiple formats: email, doc, pdf, odt. Metadata about those documents stored in SQL db (sender, recipients, date, department etc.) Main source of documents will be ExchangeServer (emails and attachments), but not only. Now to the search: User needs to be able to do complex full-text searches over those documents. Basicaly he'll be presented with some search-config panel (java desktop application, not webapp) - he'll set date range, document types, senders/recipients, keywords etc. - fire the search and get the resulting list of the documents (and for each document info why its included in search results i.e. which keywords are found in document).
Which tools I should take into consideration and which not? The point is to develop such solution with only minimal required "glue"-code. I'm proficient in SQLdbs but quite uncomfortable with Apache-and-related technologies.
Basic workflow looks like this: ExchangeServer/other source -> conversion from doc/pdf/... -> deduplication -> Hadopp + SQL (metadata) -> build/update an index <- search through the docs (and do it fast) -> present search results
Thank you!
Going with solr is a good option. I have used it for similar scenario you described above. You can use solr for real huge data as its a distributed index server.
But to get the meta data about all of these documents formats you should be using some other tool. Basically your workflow will be this.
1) Use hadoop cluster to store data.
2) Extract data in hadoop cluster using map/redcue
3) Do document identification( identify document type)
4) Extract meta data from these document.
5) Index metadata in solr server, store other ingestion information in database
6) Solr server is distributed index server, so for each ingestion you could create a new shard or index.
7) When search is required search on all the indexs.
8) Solr supports all the complex searches , so you don't have to make your own search engine.
9) It also does paging for you as well.
We've done exactly this for some of our clients by using Solr as a "secondary indexer" to HBase. Updates to HBase are sent to Solr, and you can query against it. Typically folks start with HBase, and then graft search on. Sounds like you know from the get go that search is what you want, so you can probably embed the secondary indexing in from your pipeline that feeds HBase.
You may find though that just using Solr does everything you need.
Another project to look at is Lily, http://www.lilyproject.org/lily/index.html, which has already done the work of integrating Solr with a distributed database.
Also, I do not see why you would not want to use a browser for this application. You are describing exactly what faceted search is. While you certainly could set up a desktop app that communicates with the server (parses JSON) and displays the results in a thick client GUI, all of this work is already done for you in the browser. And, Solr comes with a free faceted search system out of the box: just follow along the tutorial.
Going with Solr (http://lucene.apache.org/solr) is a good solution, but be ready to have to deal with some non-obvious things. First is planning your indexes properly. Multiple terabytes of data will almost definitely need multiple shards on Solr for any level of reasonable performance and you'll be in charge of managing those yourself. It does provide distributed search (doing the queries off multiple shards), but that is only half the battle.
ElasticSearch (http://www.elasticsearch.org/) is another popular alternative, but i don't have much experience with it regarding scale. It uses the same Lucene engine so i'd expect the search feature-set to be similar.
Another type of solution is something like SenseiDB - open sourced from LinkedIn - which gives the full-text search functionality (also Lucene-based) as well as proven scale for large amounts of data:
http://senseidb.com
They've definitely done a lot of work on search over there and my casual use of it is pretty promising.
Assuming all your data is already in Hadoop, you could write some custom MR jobs that pull the data in a consistent schema-friendly format into SenseiDB. SenseiDB already provides a Hadoop MR indexer which you can look at.
The only caveat is it is a little more complex to setup, but will save you with the scaling issues many times over - especially around indexing performance and faceting functionality. It also provides clustering support if HA is important to you - which is still in Alpha for Solr (Solr 4.x is alpha atm).
Hope that helps and good luck!
Update:
I asked a friend who is more versed in ElasticSearch than me and it does have the advantage of clustering and rebalancing based on the # of machines and shards you have. This is a definite win over Solr - especially if you're dealing with TBs of data. The only downside is the current state of documentation on ElasticSearch leaves a lot to be desired.
As a side note, you can't say the documents are stored in Hadoop, they are stored in a distributed file system (most probably HDFS since you mentioned Hadoop).
Regarding searching/indexing: Lucene is the tool to use for your scenario. You can use it for both indexing and searching. It's a java library. There is also an associated project (called Solr) which allows you to access the indexing/searching system through WebServices. So you should also take a look at Solr as it allows the handling of different types of documents (Lucene puts the responsability of interpreting the document (PDF, Word, etc) on your shoulders but you, probably, can already do that)

Post processing of pages crawled using nutch

I have a set of pages crawled using nutch. And I understand that this crawled pages are saved as segments. I want to extract certain key values from this pages and feed it to solr as xml.
A sample situation is that I have crawled a shopping website with many product listings. I want to extract key infos like Name, Price, Specs of the product and ignore rest of the data. So that I may provide to solr some xml like
qwerty123qwerty
This is so that using solr I should be able to do sorting of different product listings based on the price.
Now how this extraction part can be done? Does map reduce come anywhere in picture?
Turning raw web pages into information is not a trivial task. One tool used for this job is Boilerpipe. However, it won't give you a solution on a plate.
If you are working on a fixed target, you might just write your own procedural code to find the data you need. If you need to find this sort of thing in arbitrary HTML, you are facing a very hard problem with no off-the-shelf solutions.

Resources