Crawler/parser for Xapian - solr

I would like to implement a search engine which should crawl a set of web sites, extract specific information from the pages and create full-text index of that specific information.
It seems to me that Xapian could be a good choice for the search engine library.
What are the options for a crawler/parser to integrate with Xapian?
Would Solr be a better choice than Xapian to integrate with open source crawlers/parsers?

Here's a little comparison between Xapian and Solr.
But if you want to build a crawler, take a look at Nutch. It's extensible with plugins, so you could write a plugin that analyzes the information that you're looking for.

Flax may provide some of what you're looking for.

Related

How to do FTS within Google Cloud Platform

Does Google Cloud Platform have a product to do full-text search via an API with non-web data (such as json or xml documents)? This may seem like a pretty silly question, but the only options I have come across are:
Search inside of Google App Engine (only available for python2, not python3) -- https://cloud.google.com/appengine/training/fts_intro/.
Related to web search only: https://developers.google.com/custom-search/docs/tutorial/introduction
Using a managed Elasticsearch: https://console.cloud.google.com/marketplace/details/google/elasticsearch.
Cloud firestore explicitly states it doesn't offer that and suggests using Aloglia (and gives details on integrating): https://cloud.google.com/firestore/docs/solutions/search
Is there something I'm missing? I'm basically looking to index and search about a million documents in a sort of free-form type of search. Is this offered as a product from Google outside of App Engine? If so, how can I access it?
You have pretty much covered it there. There is currently no specific Google service for full-text search. As you mentioned, App Engine Search API is available for Python 2.7, which will stop being maintained after January 2020, and not Python 3.
There is one more option you could consider, which is using Lucene foe GAE. I found this blog where several possibilities are studied, perhaps could be an interesting reading for you.
To sum up, I would recommend ElasticSearch or Aloglia, but for the latter you need a Firebase project.

How to integrate search engine of a wiki with a search engine of a QA system?

In a knowledge management project I need to use a wiki engine ( I'm thinking about DokuWiki ) and a Question and Answer system ( I'm thinking about Question2Answer), and I need to create a Search Funcionality that search in both systems (Wiki and QA) and return what has been found. ( Like a Google of the two systems in same time)
Anyone knows a direction to help me how to do it properly?
I don't know about the wiki side, but you can write your own search plugin for Question2Answer that let you integrate your own search results into those of Q2A. See the docs for more info.
If your wiki software has an API or functions you can call to find search results, then in Q2A you can use the process_search method in your plugin, call your wiki search function and add the results to the Q2A results.
Having said all that, you might find it easier to integrate Google Custom Search Engine into your site!

Choose Lucene or Solr

We need to integrate a search engine in our plataform Catalog management software in Share point. The information is stored in multiple databases and a storage of files ( doc , ppt , pdf .....). Our dev platform is Asp.Net and we have done some pre-liminary work on Lucene, found it to be good. However, we just came to know of Solr.
We need to continue using lucene, but we need to defend her the solr.
Please any help is accepted.
And sorry for my english.
Lucene is a full-text search library used to provide search functionalities to an application. It can't be used as an application by itself. Solr is a complete search engine built around Lucene providing its search functionalities and others. Solr is a web application that can be used by itself without any development around it.
If you need a search engine to be called by your application I recommend you to use Solr.

how to make a search engine with nutch and cassandra?

I am tring to implement a website search engine with java as an applet,I have used nutch as web crawler and cassandra as my database,I have to use a nosql database(because my teacher wants me to do),now my question is what should I do next to complete my search engine?
I have googled a lot,but all of the sites are mostly about nutch and solr,and they build search engines with integration of these two,cause solr itself is somehow a database,I don't know what should I do,do I have to use solr too to complete my search engine?is it wise to use two databases(solr and cassandra)?or I should do some thing else?
please remember I have to use cassandra.
and please first explain me if I have understood things in a wrong way and then give me a minus mark,:D
I will be really really thankfull for your help,I have got somehow confused.
by the way does solr counted as a nosql database?excuse me,I am new to them all.
Check out Solr's Data Import Handler and see if you feel it would work. It allows you to query your database and store the results with Solr to which then Solr can manipulate the reuslts. Nutch also has very good integration with Solr should you choose to use it.

Search Engine that can use SKOS?

I am currently working on project where we want to take SKOS and plug it into a search engine to make the search results better. An example of this would be something like Semaphore Smartlogic (closed, not free, too big to partner with).
Searchblox is a very good, free, configurable, lucene/solr search engine, but it does not have SKOS abilities and is not open source.
Constellio is similar to Searchblox (not quite as good), and claims to be working on accepting SKOS, but I can't get it to function properly.
Before I go and build this: Does anyone know of an existing free search engine that has has the ability to accept SKOS? Or, does any know of an open source Lucene/Solr search engine like Searchblox that I could add this functionality to quickly?
You know Solr is a search engine on it's own? Check http://wiki.apache.org/solr/ for more info.
A Google search led me to http://code.google.com/p/lucene-skos/wiki/HowTo
Not the most active project, but I guess a good start.
Should't have to be too hard to combine the 2 into the solution you need.
I am not sure if SIREn supports SKOS, but it is a semantic lucene plugin that may be worth checking out.

Resources