I'm currently working on a web archiving project. Basically, what we try to do is archive a collection of websites (using heritrix crawler) and provide access to the archived contents through a web interface.
We also offer full-text search throughout the archives. Currently, the index is generated using nutchwax (a customised version of apache Nutch, tailored to index .warc files, as generated by heritrix). Nutchwax dumps out a Lucene index and for using it in Solr, all that has to be done is to generate a correct schema.
This is all done and its running like it should, however the archive is not static and there are new .warc files generated periodically.
What I can do now, is to generate a new index, merge it with the existing one and import it back into Solr. However, to do that Solr has to be restarted.
It would be great if the index could be updated "on the fly" as this is usually the case (when updating the index via http requests)
Does anyone have an idea, how this can be done? My first shot at this was generating .xml files out of the Lucene index file and posting them to Solr. Is this worth a try or are there more elegant solutions?
You could probably leverage the use of multiple cores to accomplish what you need. See the Solr Wiki - CoreAdmin for more details. I think you could leverage the MergeIndexes capability or the ability to Swap cores for a better experience in your scenario.
Related
I am working on implementing a research web application or portal that integrates different research portal or website using an open source platform called search kit. The web application will act as a central point of access to research publications on different research portals. To do this, I also need to implement a third party system that does the following:
Searches for documents based on user query on the other different research portals and presents or displays the results to the users on my web application.
Index the documents
Should be used by system administrators to configure the web application. Whereby system administrators can add,remove or modify the URL of the website Solr is pulling documents from
Displays the results to the user in one standard format.
My question is, can apache solr be used to implement the third party system? if not, what open source platform or way would you recommend I used to implement the third party system?
In general, Solr seems like a good fit here, but you might need some custom code (apart from configuration) here and there. To go through the points:
Querying is one of the main features of Solr, so this is definitely possible.
Indexing is handled by Solr.
There was a component for Solr called "Data Import Handler" that supported indexing from URLs (see the docs). However, this was removed from the main Solr distribution, and was moved to a separate package. This package doesn't seem to be actively maintained though, so you will probably run into some problems if you decide to use it. The alternative is to develop your document-pulling code yourself.
Solr can display the results in multiple formats, but it still might not support the exact format you would like it to be. In this case, you need to build your transformation based on the result from Solr.
ElasticSearch has percolator for prospective search. Does SOLR have a similar feature where you define your query upfront? If not, is there an effective way of implementing this myself on top of the existing SOLR features?
besides what BunkerMentality said, it is not hard to build your own percolator, what you need:
Are the queries you want to run easy to model on Lucene only syntax? if so you are good, if not, you need to convert them to Lucene only. Built them, and keep them in memory as Lucene queries
When a doc arrives:
build a MemoryIndex containing only that single doc
run all your queries on the index
I have done this for a system ingesting millions docs a day and it worked fine.
It's listed as an open new feature, SOLR-4587, on Solr JIRA but it doesn't seem like any work has started on it yet.
There is a link in the comments there to a separate project called Luwak that seems to implement some features similar to percolator.
If it is still relevant, you can use this
It's SOLR Update Processor that based on Luwak
I am relatively new to Apache SOlr and have recently been working with DIH, specifically the XPathEntityProcessor. I need a way to periodically index new XML files, however, it appears the delta-import command is only supported by the sqlEntityProcessor [1].
I am working with an increasingly large dataset of XML files and was hoping solr could determine new files and index them...
A potential solution that came to mind is to possibly do a full-import from a staging area consisting of documents that have not been previously index, before moving the documents to their respective permanent locations.
Is there a workaround to mimicking delte-import using XPathEntityProcessor?
What sort of approaches do people using XPathEntityProcessor use to index newer documents?
[1] http://wiki.apache.org/solr/DataImportHandler#Using_delta-import_command-1
I've resorted to using the UpdateRequestHandler; it's perfect for what I want to do.
[1] http://wiki.apache.org/solr/XsltUpdateRequestHandler
We are using Liferay (6.1.20 EE) with Solr search engine.
Now Solr indexes everything. Can we somehow set up Solr (or Liferay) to prevent one Site from being indexed?
It means all articles documents present on that Site would not be indexed and would not be present in Solr.
1) Should this be done with Solr configurations/schema filters before Index starts?
OR
2) Should it be customized in Liferay Indexer classes (with help of Hooks or EXT) to skip content being indexed.
Thanks for your thoughts and suggestions.
Regards,
Kris
You could create a custom version of the solr-web WAR file that you need to install to make the Liferay/SOLR integration work. In the WAR file you'll find SolrIndexWriterImpl. This is the place that everything passes through that will be indexed in SOLR. You could create your own custom implementation of this class that uses the information in the SearchContext parameter, that's passed into each method, to decide if something should be indexed or not.
The latest code for solr-web can be found here: http://svn.liferay.com/repos/public/plugins/trunk/webs/solr-web/
Based on this code I was also able to create a solr-web.war that works on the more recent SOLR versions instead of the ancient 1.4.1 version Liferay uses by default.
Recently I got involved in a task, and part of it require to use Apache Solr ( for Document Search) ,and Apache Tika ( to Extract the meta-text or plain text from documents)
I have n't integrated Solr and tika yet ,But I have worked with both of them individually I might have set of questions related to Apache Solr and Apache Tika , It might be at beginners level or average.
Following types of practical I did with Solr e.g. created a dummy database, wrote a program, configured - schema.xml things, ran Solr sever, and program which fetches documents from database and store in Solr Document Index , Made a Simple client to fetch data from Solr via JSON Interface, Made a Program which keeps MySQL Database to sync with Apache’s Solr document Index.
Following types of practical I did with tika e.g. compiled and Installed Tika, understood its document parsing capablities.
..
My Sample Task statement:
Part of my project require to store around 100,000 of documents (Data of these 100,000 (Doc,PDF,Txt) docs are fetched by Apache tika and pushed to MySql’s Database and later that pushed to apache Solr’s Document Database)for Full Text Search and search them those via a client interface (Browser)
In simple programmatical level this task will get done,
I would like to understand the challenges related to managing the index or something else in Solr e.g.
** In advanced level does it require optimizing the Solr’s Open Source Code?
** While Solr works in proper way, does it provide any specific challenges?
** What Key things need to consider initially so that, Solr should work in a proper way.
** Do you think any extra tool to developed to monitor Solr’s working ?
Hope you got the idea related to questions I have ?
** Also I would like to know If you have any experience of using apache Tika with apache Solr, and any challenges or key things to consider ?
Would you like to recommend and specific sources Or If you have any document or anything which you feel to be helpful.