How WCS's SolR can index Magnolia's content - solr

I'm trying to find a way to index Magnolia's content with a Solr from an other CMS.
(It's not realy a CMS but it's like. I'm talking about websphere commece 7 FEP5).
I've a Solr engine, plugged in a DB2 database, and already configured to index the content of that database.
Now, I have to plug a Magnolia CMS, (with is own database). And one of the requirement is to be able to display search results (provided from my Solr), and theses search result must include Magnolia's content.
Does anybody have any ideas to do that?
Using a second Solr system (the one into Magnolia) is not really acceptable.
Any ideas will be appreciated.
Regards,
Dekx.

This sounds like you should be able to do what you want using the WCS unstructured content crawler.
Start here:
http://pic.dhe.ibm.com/infocenter/wchelp/v7r0m0/topic/com.ibm.commerce.developer.doc/concepts/csdmanagesearchsitecontent.htm

Related

how to make a search engine with nutch and cassandra?

I am tring to implement a website search engine with java as an applet,I have used nutch as web crawler and cassandra as my database,I have to use a nosql database(because my teacher wants me to do),now my question is what should I do next to complete my search engine?
I have googled a lot,but all of the sites are mostly about nutch and solr,and they build search engines with integration of these two,cause solr itself is somehow a database,I don't know what should I do,do I have to use solr too to complete my search engine?is it wise to use two databases(solr and cassandra)?or I should do some thing else?
please remember I have to use cassandra.
and please first explain me if I have understood things in a wrong way and then give me a minus mark,:D
I will be really really thankfull for your help,I have got somehow confused.
by the way does solr counted as a nosql database?excuse me,I am new to them all.
Check out Solr's Data Import Handler and see if you feel it would work. It allows you to query your database and store the results with Solr to which then Solr can manipulate the reuslts. Nutch also has very good integration with Solr should you choose to use it.

Developing an web directory search engine for enterprises information, what's better? use a database or files?

I want to develop an web app for storing enterprises' information, so this info can be searched by keywords as by category, but principally by keywords, because the interface it's going to be as simple a Google. The doubt I have is, is it better to store this info in a database or in text files?
If you want full text search, probably neither. You should look into a search index such as Elasticsearch (http://www.elasticsearch.org/overview/). A search index stores data in a way that is optimized for searching.

Solr's schema and how it works

Hey so I started researching about Solr and have a couple of questions on how Solr works. I know the schema defines what is stored and indexed in the Solr application. But I'm confuse as to how Solr knows that the "content" is the content of the site or that the url is the url?
My main goal is I'm trying to extract phone numbers from websites and I want Solr to nicely spit out 1234567890.
You need to define it in Solr schema.xml by declaring all the fields and its field type. You can then query Solr for any field to search.
Refer this: http://wiki.apache.org/solr/SchemaXml
Solr will not automatically index content from a website. You need to tell it how to index your content. Solr only knows the content you tell it to know. Extracting phone numbers sounds pretty simple so writing an update script or finding one online should not be an issue. Good luck!

Index my own data in Solr

I am new to Solr and have a couple of questions to ask help from more experienced people:
I am able to get example running, however what is exactly the start.jar?
I know by running "java -jar start.jar", i can start solr. But do i run this command after i index my own data, not the given sample data? if not, what should i do to run my own solr instance with my own indexed data?
I do need to index my own sample data, not related to the given example solr thing at all. How exactly should i do it? Should i copy the example directory then modify the fields in sechema.xml? should i then run the post.sh accordingly to index the data like what i did to set up the example solr?
Thanks a lot for your help!
Steps:
Decide what will be the document structure u store in SOLR. (Somewhat like creating the schema of a relational DB for one table).
remove the example core and create your own core with that schema
once the schema works with no errors (you check the server logs that hosts the SOLR app) You can start feed the data you have into SOLR. You POST it via HTTP in a specific structure which is documented in the SOLR Wiki. Various frameworks have some classes to handle that.
Marked as Wiki as this is too broad an answer for someone who did not bother to RTFM...
Dear custom indexing is not a difficult task as I have worked on it just a few days ago. First you need to write your documnet is xml,csv or json( format supported in solr) containing fields according to your schema.xml, then run following command in example/exampledocs
For a document mydoc.xml
./post.sh mydoc.xml
if in output, status value is 0 then indexing is successful and you can search your document in solr
Reference:http://www.solrtutorial.com/solr-in-5-minutes.html
Though the question is old, but I am writing for new visitors with same issue. The question can't be answered in few words. You must understand what Solr is, whats Solr Admin UI, why we need Solr instead a relational database. Then you can understand how to import sample data. I have recently published two articles i.e. Solr Introduction and Importing Sample Data, these might be helpful for you.
http://www.devtrainings.com/2017/03/apache-solr-introduction-and-server.html
http://www.devtrainings.com/2017/03/apache-solr-index-data-and-run-search.html

updating Solr from Lucene Index

I'm currently working on a web archiving project. Basically, what we try to do is archive a collection of websites (using heritrix crawler) and provide access to the archived contents through a web interface.
We also offer full-text search throughout the archives. Currently, the index is generated using nutchwax (a customised version of apache Nutch, tailored to index .warc files, as generated by heritrix). Nutchwax dumps out a Lucene index and for using it in Solr, all that has to be done is to generate a correct schema.
This is all done and its running like it should, however the archive is not static and there are new .warc files generated periodically.
What I can do now, is to generate a new index, merge it with the existing one and import it back into Solr. However, to do that Solr has to be restarted.
It would be great if the index could be updated "on the fly" as this is usually the case (when updating the index via http requests)
Does anyone have an idea, how this can be done? My first shot at this was generating .xml files out of the Lucene index file and posting them to Solr. Is this worth a try or are there more elegant solutions?
You could probably leverage the use of multiple cores to accomplish what you need. See the Solr Wiki - CoreAdmin for more details. I think you could leverage the MergeIndexes capability or the ability to Swap cores for a better experience in your scenario.

Resources