I need to do some transformations on docs before indexing them in solr. but the texts come from various resources and it's diffcult to do the transformations before indexing because i will have to adapt several programs to parse the files. I'm thinking of indexing them in solr, extract the text fields, do transformations and reindex again.
I tried :
curl 'http://localhost:8983/solr/collection1/select?q=*&rows=20000&wt=xml&indent=true'
but the output is a results xml file while i'm looking for some way to extract the docs with fields like in the posting format. is this possible? how should i do?
Thanks
I would recommend using one of the Solr Clients listed on the Integrating Solr page. This will allow you to use your programming language of choice to extract and transform the Solr documents and then reload them into the index.
Related
I want to use Solr for indexing some library, that represent books in fb2 format.
In fact fb2 is just xml with similar xsd format.
But, post.jar ignores *.fb2 files, and I dont understand how to map values in fb2 file to index fields, like:
<book-title>some book</book-title>
...to "book-title" field in index.
Should I create a plug-in, or something else?
You should look at the Solr Data Import Handler (DIH).
https://cwiki.apache.org/confluence/display/solr/Uploading+Structured+Data+Store+Data+with+the+Data+Import+Handler
In the Solr examples folder you have an RSS import example. If you look in the rss-data-config.xml file you will see how they use the XPathEntityProcessor to map from XML to the Solr fields, e.g.:
Here is some more information: http://www.andornot.com/blog/post/Sample-Solr-DataImportHandler-for-XML-Files.aspx
I have also written Tika parsers in the past to work with specific file formats.
https://lucidworks.com/blog/2010/06/18/extending-apache-tika-capabilities/
For more flexibility you can just read your files using your favorite programming language and send the data to Solr using an API. We had to do this for a recent application as the DIH wasn't flexible enough for what we wanted to achieve.
Hey so I started researching about Solr and have a couple of questions on how Solr works. I know the schema defines what is stored and indexed in the Solr application. But I'm confuse as to how Solr knows that the "content" is the content of the site or that the url is the url?
My main goal is I'm trying to extract phone numbers from websites and I want Solr to nicely spit out 1234567890.
You need to define it in Solr schema.xml by declaring all the fields and its field type. You can then query Solr for any field to search.
Refer this: http://wiki.apache.org/solr/SchemaXml
Solr will not automatically index content from a website. You need to tell it how to index your content. Solr only knows the content you tell it to know. Extracting phone numbers sounds pretty simple so writing an update script or finding one online should not be an issue. Good luck!
I am new to Solr and have a couple of questions to ask help from more experienced people:
I am able to get example running, however what is exactly the start.jar?
I know by running "java -jar start.jar", i can start solr. But do i run this command after i index my own data, not the given sample data? if not, what should i do to run my own solr instance with my own indexed data?
I do need to index my own sample data, not related to the given example solr thing at all. How exactly should i do it? Should i copy the example directory then modify the fields in sechema.xml? should i then run the post.sh accordingly to index the data like what i did to set up the example solr?
Thanks a lot for your help!
Steps:
Decide what will be the document structure u store in SOLR. (Somewhat like creating the schema of a relational DB for one table).
remove the example core and create your own core with that schema
once the schema works with no errors (you check the server logs that hosts the SOLR app) You can start feed the data you have into SOLR. You POST it via HTTP in a specific structure which is documented in the SOLR Wiki. Various frameworks have some classes to handle that.
Marked as Wiki as this is too broad an answer for someone who did not bother to RTFM...
Dear custom indexing is not a difficult task as I have worked on it just a few days ago. First you need to write your documnet is xml,csv or json( format supported in solr) containing fields according to your schema.xml, then run following command in example/exampledocs
For a document mydoc.xml
./post.sh mydoc.xml
if in output, status value is 0 then indexing is successful and you can search your document in solr
Reference:http://www.solrtutorial.com/solr-in-5-minutes.html
Though the question is old, but I am writing for new visitors with same issue. The question can't be answered in few words. You must understand what Solr is, whats Solr Admin UI, why we need Solr instead a relational database. Then you can understand how to import sample data. I have recently published two articles i.e. Solr Introduction and Importing Sample Data, these might be helpful for you.
http://www.devtrainings.com/2017/03/apache-solr-introduction-and-server.html
http://www.devtrainings.com/2017/03/apache-solr-index-data-and-run-search.html
I am comparing Lucene/Solr, Whoosh, Sphinx and Xapian for searching documents in DOC, DOCX, HTML and PDF. Only Solr is documented to have a document parser (Tika) which directly indexes documents. So it seems a clear winner.
But to level the playing field, I like to consider the alternatives. Do the others have direct document indexing (which I may have missed)? If not are they can it be implemented easily? Or is Solr the overwhelming choice?
On Sphinx you're able to convert file using a PHP script through the xmlpipe_command option. Since PHP has a Tika-wrapper, writing the script and the setup itself aren't hard.
I'm looking into a search solution that will identify strings (company names) and use these strings for search and facets in Solr.
I'm new to Nutch and Solr so I wonder if this is best done in Nutch or in Solr. One solution would be to generate a Parser in Nutch that identifies the strings in question and then index the name of the company, later mapped to a Solr value. I'm not sure on how, but I guess this could also be done inside Solr directly from the text?
Does it make sense to do this string identification in Nutch or in Solr and is there some functionality in Solr or Nutch that could help me here?
Thanks.
You could embed a NER library (see opennlp, lingpipe, gate) in to a custom parser, generate new fields and create an indexingfilter accordingly. This is not particularly difficult and the advantage compared to doing this on the SOLR side is that you'd gain from the scalability of mapreduce (NLP tasks are often CPU-hungry).
See Behemoth for an example of how to embed GATE in mapreduce
Nutch works with Solr by indexing the crawled data to Solr via the Solr HTTP API. You trigger the indexation by calling the solrindex command. See this page for details on how to setup this.
To be able to extract the company names, I would add the necessary code in Solr. I would use a UpdateRequestProcessor. It allows to add an extra step in the indexing process to add extra fields in the document being indexed. Your UpdateRequestProcessor would be used to examine to document sent to Solr by Nutch, extract the company names from the text and add them as new fields in the document. Solr would them index the document + the fields that you add.