apache solr Index pdf files with xml documents

apache solr Index pdf files with xml documents - solr

how to index pdf files in apache solr (version 8) with xml Documents
example:
<add>
<doc>
<field name="id">filePath</field>
<field name="title">the title</field>
<field name="description">description of the pdf file</field>
<field name="Creator">jhone doe</field>
<field name="Language">English</field>
<field name="Publisher">Publisher_name</field>
<field name="tags">some_tag</field>
<field name="is_published">true</field>
<field name="year">2002</field>
<field name="file">path_to_the_file/file_name.pdf</field>
</doc>
</add>
UPDATE
how to set literal.id to filePath

OK, this is what i did
i am using solr DHI
in solrconfig.xml
<requestHandler name="/dataimport_fromXML" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">data.import.xml</str>
<str name="update.chain">dedupe</str>
</lst>
</requestHandler>
and the data.import.xml file
<dataConfig>
<dataSource type="BinFileDataSource" name="data"/>
<dataSource type="FileDataSource" name="main"/>
<document>
<!-- url : the url for the xml file that holde the metadata -->
<entity name="rec" processor="XPathEntityProcessor" url="${solr.install.dir:}solr/solr_core_name/filestore/docs_metaData/metaData.xml" forEach="/docs/doc" dataSource="main" transformer="RegexTransformer,DateFormatTransformer">
<field column="resourcename" xpath="//resourcename" name="resourceName" />
<field column="title" xpath="//title" name="title" />
<field column="subject" xpath="//subject" name="subject"/>
<field column="description" xpath="//description" name="description"/>
<field column="comments" xpath="//comments" name="comments"/>
<field column="author" xpath="//author" name="author"/>
<field column="keywords" xpath="//keywords" name="keywords"/>
<!-- baseDir: path to the folder that containt the files (pdf | doc | docx | ...) -->
<entity name="files" dataSource="null" rootEntity="false" processor="FileListEntityProcessor" baseDir="${solr.install.dir:}solr/solr_core_name/filestore/docs_folder" fileName="${rec.resourcename}" onError="skip" recursive="false">
<field column="fileAbsolutePath" name="filePath" />
<field column="resourceName" name="resourceName" />
<field column="fileSize" name="size" />
<field column="fileLastModified" name="lastModified" />
<!-- for etch file extracte metadata if not in the xml metadata file -->
<entity name="file" processor="TikaEntityProcessor" dataSource="data" format="text" url="${files.fileAbsolutePath}" onError="skip" recursive="false">
<field column="title" name="title" meta="true"/>
<field column="subject" name="subject" meta="true"/>
<field column="description" name="description" meta="true"/>
<field column="comments" name="comments" meta="true"/>
<field column="Author" name="author" meta="true"/>
<field column="Keywords" name="keywords" meta="true"/>
</entity>
</entity>
</entity>
</document>
</dataConfig>
after that all what you have to do is create xml file (metaData.xml)
<docs>
<doc>
<resourcename>fileName.pdf</resourcename>
<title></title>
<subject></subject>
<description></description>
<comments></comments>
<author></author>
<keywords></keywords>
</doc>
</docs>
and put all your file in one folder
"${solr.install.dir:}solr/solr_core_name/filestore/docs_folder"
the ${solr.install.dir:} is solr home folder
for the update in the question
how to set literal.id to filePath
in data.import.xml map the fileAbsolutePath to the id
<field column="fileAbsolutePath" name="id" />
one last thing
in this example the id is auto generated i am using
<updateRequestProcessorChain name="dedupe">
witch create a unique id based on the hash of the content for avoid duplication

Related

Apache solr index files (pdf,docx,..) over ftp

how to index files over ftp ,
the FTP repo contain all my documents in different format, i am able to do this task for system folder but it doesn't work with ftp
i have this configuration via (DIH)
<dataConfig>
<dataSource type="BinFileDataSource" />
<dataSource type="BinURLDataSource" name="binSource" baseUrl="ftp://localhost:21/" onError="skip" user="solr_ftp" password="solr_ftp_pass" />
<document>
<!-- baseDir: path to the folder that containt the files (pdf | doc | docx | ...) -->
<entity name="files" dataSource="binSource" baseDir="ftp://localhost" rootEntity="false" processor="FileListEntityProcessor" fileName=".*\.(doc)|(pdf)|(docx)|(txt)|(rtf)|(html)|(htm)" onError="skip" recursive="true">
<field column="fileAbsolutePath" name="filePath" />
<field column="resourceName" name="resourceName" />
<field column="fileSize" name="size" />
<field column="fileLastModified" name="lastModified" />
<!-- tika -->
<entity name="documentImport" processor="TikaEntityProcessor" url="${files.fileAbsolutePath}" format="text">
<field column="title" name="title" meta="true"/>
<field column="subject" name="subject" meta="true"/>
<field column="description" name="description" meta="true"/>
<field column="comments" name="comments" meta="true"/>
<field column="Author" name="author" meta="true"/>
<field column="Keywords" name="keywords" meta="true"/>
<field column="category" name="category" meta="true"/>
<field column="xmpTPg:NPages" name="Page-Count" meta="true"/>
<field column="text" name="content"/>
</entity>
</entity>
</document>
</dataConfig>
Error:
failed:java.lang.RuntimeException: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: 'baseDir' value: ftp://localhost is not a directory Processing Document # 1

Indexing wikipedia with solr

I've installed solr 4.6.0 and follow the tutorial available at Solr's home page. Everything was fine, untill I need to do a real job that I'm about to do. I have to get a fast access to wikipedia content and I was advised to use Solr. Well, I was trying to follow the example in the link http://wiki.apache.org/solr/DataImportHandler#Example:_Indexing_wikipedia, but I couldn't get the example. I am newbie, and I don't know what means data_config.xml!
<dataConfig>
<dataSource type="FileDataSource" encoding="UTF-8" />
<document>
<entity name="page"
processor="XPathEntityProcessor"
stream="true"
forEach="/mediawiki/page/"
url="/data/enwiki-20130102-pages-articles.xml"
transformer="RegexTransformer,DateFormatTransformer"
>
<field column="id" xpath="/mediawiki/page/id" />
<field column="title" xpath="/mediawiki/page/title" />
<field column="revision" xpath="/mediawiki/page/revision/id" />
<field column="user" xpath="/mediawiki/page/revision/contributor/username" />
<field column="userId" xpath="/mediawiki/page/revision/contributor/id" />
<field column="text" xpath="/mediawiki/page/revision/text" />
<field column="timestamp" xpath="/mediawiki/page/revision/timestamp" dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss'Z'" />
<field column="$skipDoc" regex="^#REDIRECT .*" replaceWith="true" sourceColName="text"/>
</entity>
</document>
</dataConfig>
I couldn't find in the Solr home directory. Also, I tried to find some questions related to mine, How to index wikipedia files in .xml format into solr and Indexing wikipedia dump with solr, but they didn't solve my doubt.
I think I need something more basic, guiding me step by step, because the tutorial is confusing when deals with indexing wikipedia.
Any advice to give some directions to folow would be nice.

For the data_config.xml
Each Solr instance is configured using three main files:solr.xml,solrconfig.xml,schema.xml,and the data_config.xml file define the data source when you use DIH component,this URL would be usefull for you :DIH.
About Solr home directory
You should start from here:https://cwiki.apache.org/confluence/display/solr/Running+Solr

Well, I've read many things on the Web and tried to collected as many information as possible. This is how I could find the solution:
here is my solrconfig.xml:
...
<!-- ****** Data import handler -->
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">data-config.xml</str>
</lst>
</requestHandler>
...
<lib dir="../../../dist/" regex="solr-dataimporthandler-.*\.jar" />
Here is my data-config.xml: (important: it must be in the same folder of solrconfig.xml)
<dataConfig>
<dataSource type="FileDataSource" encoding="UTF-8" />
<document>
<entity name="page"
processor="XPathEntityProcessor"
stream="true"
forEach="/mediawiki/page/"
url="/Applications/solr-4.6.0/example/exampledocs/simplewikiSubSet.xml"
transformer="RegexTransformer,DateFormatTransformer"
>
<field column="id" xpath="/mediawiki/page/id" />
<field column="title" xpath="/mediawiki/page/title" />
<field column="revision" xpath="/mediawiki/page/revision/id" />
<field column="user" xpath="/mediawiki/page/revision/contributor/username" />
<field column="userId" xpath="/mediawiki/page/revision/contributor/id" />
<field column="text" xpath="/mediawiki/page/revision/text" />
<field column="timestamp" xpath="/mediawiki/page/revision/timestamp" dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss'Z'" />
<field column="$skipDoc" regex="^#REDIRECT .*" replaceWith="true" sourceColName="text"/>
</entity>
</document>
</dataConfig>
Attention: The last line is very important!
My schema.xml:
...
<field name="id" type="string" indexed="true" stored="true" required="true"/>
<field name="title" type="string" indexed="true" stored="false"/>
<field name="revision" type="int" indexed="true" stored="true"/>
<field name="user" type="string" indexed="true" stored="true"/>
<field name="userId" type="int" indexed="true" stored="true"/>
<field name="text" type="text_en" indexed="true" stored="false"/>
<field name="timestamp" type="date" indexed="true" stored="true"/>
<field name="titleText" type="text_en" indexed="true" stored="true"/>
...
<uniqueKey>id</uniqueKey>
...
<copyField source="title" dest="titleText"/>
...
And it's done. That's all folks!

Extract file name (without extension) while indexing using Data Import Handler in Solr

Im successfully able to index pdf,doc,ppt,etc files using the Data Import Handler in solr 4.3.0 .
My data-config.xml looks like this -
<dataConfig>
<dataSource name="bin" type="BinFileDataSource" />
<document>
<entity name="f" dataSource="null" rootEntity="false"
processor="FileListEntityProcessor"
baseDir="C:\Users\aroraarc\Desktop\Impdo"
fileName=".*\.(DOC)|(PDF)|(pdf)|(doc)|(docx)|(ppt)|(pptx)|(xls)|(xlsx)|(txt)" onError="skip"
recursive="true">
<field column="fileAbsolutePath" name="path" />
<field column="fileSize" name="size" />
<field column="fileLastModified" name="lastmodified" />
<field column="file" name="fileName"/>
<entity name="tika-test" dataSource="bin" processor="TikaEntityProcessor"
url="${f.fileAbsolutePath}" format="text" onError="skip">
<field column="Author" name="author" meta="true"/>
<field column="title" name="title" meta="true"/>
<field column="text" name="content"/>
</entity>
</entity>
</document>
</dataConfig>
However in the fileName field i want to insert the pure file name without the extension. Eg - Instead of 'HelloWorld.txt' I want only 'HelloWorld' to be inserted in the fileName field. How do I achieve this?
Thanks in advance!

Check ScriptTransformer to replace or change the value before it is indexed.
Example -
Data Config - Add custom field -
<script><![CDATA[
function changeFileName(row){
var fileName= row.get('fileName');
// Replace or remove the extension .. e.g. from last index of .
file_name_new = file_name.replace ......
row.put(fileName, row.get('file_name_new'));
return row;
}
]]></script>
Entity mapping -
<entity name="f" transformer="script:changeFileName" ....>
......
</entity>

Debugging solr rss DataImportHandler

I have an existing collection, to which I want to add an RSS importer. I've copied what I could gleam from the example-DIH/solr/rss code.
The details are below, but the bottom line is that everything seems to run, but it always says "Fetched: 0" (and I get no documents). There are no exceptions in the tomcat log.
Questions:
Is there a way to turn up debugging on rss importers?
Can I see solr's actual request and response?
What would cause the request to succeed, but no rows to be fetched?
Is there a tutorial for adding an RSS DIH to an existing collection?
Thanks!
My solrconfig.xml file contains the requestHandler:
<requestHandler name="/dataimport"
class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">rss-data-config.xml</str>
</lst>
</requestHandler>
And rss-data-config.xml:
<dataConfig>
<dataSource type="URLDataSource" />
<document>
<entity name="slashdot"
pk="link"
url="http://rss.slashdot.org/Slashdot/slashdot"
processor="XPathEntityProcessor"
forEach="/rss/channel | /rss/item"
transformer="DateFormatTransformer">
<field column="source_name" xpath="/rss/channel/title" commonField="true" />
<field column="title" xpath="/rss/item/title" />
<field column="link" xpath="/rss/item/link" />
<field column="body" xpath="/rss/item/description" />
<field column="date" xpath="/rss/item/date" dateTimeFormat="yyyy-MM-dd'T'HH:mm:ss" />
</entity>
</document>
</dataConfig>
and from schema.xml:
<fields>
<field name="title" type="text_general" required="true" indexed="true" stored="true"/>
<field name="link" type="string" required="true" indexed="true" stored="true"/>
<field name="source_name" type="text_general" required="true" indexed="true" stored="true"/>
<field name="body" type="text_general" required="false" indexed="false" stored="true"/>
<field name="date" type="date" required="true" indexed="true" stored="true" />
<field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/>
<field name="_version_" type="long" indexed="true" stored="true"/>
<fields>
When I run the dataimport from the admin web page, it all seems to go well. It shows "Requests: 1" and there are no exceptions in the tomcat log:
Mar 12, 2013 9:02:58 PM org.apache.solr.handler.dataimport.DataImporter maybeReloadConfiguration
INFO: Loading DIH Configuration: rss-data-config.xml
Mar 12, 2013 9:02:58 PM org.apache.solr.handler.dataimport.DataImporter loadDataConfig
INFO: Data Configuration loaded successfully
Mar 12, 2013 9:02:58 PM org.apache.solr.handler.dataimport.DataImporter doFullImport
INFO: Starting Full Import
Mar 12, 2013 9:02:58 PM org.apache.solr.handler.dataimport.SimplePropertiesWriter readIndexerProperties
INFO: Read dataimport.properties
Mar 12, 2013 9:02:59 PM org.apache.solr.handler.dataimport.DocBuilder execute
INFO: Time taken = 0:0:0.693
Mar 12, 2013 9:02:59 PM org.apache.solr.update.processor.LogUpdateProcessor finish
INFO: [articles] webapp=/solr path=/dataimport params={optimize=false&clean=false&indent=true&commit=false&verbose=true&entity=slashdot&command=full-import&debug=true&wt=json} {} 0 706

Your problem here is due to your rss-data-config.xml and the defined xpaths.
If you open the url http://rss.slashdot.org/Slashdot/slashdot in Internet Explorer and hit F12 for developer tools it will show you the structure of the HTML.
You can see that the node <item> is a child of <channel> and not <rss>. So your config should look as follows:
<?xml version="1.0" encoding="UTF-8" ?>
<dataConfig>
<dataSource type="URLDataSource" />
<document>
<entity name="slashdot"
pk="link"
url="http://rss.slashdot.org/Slashdot/slashdot"
processor="XPathEntityProcessor"
forEach="/rss/channel | /rss/channel/item"
transformer="DateFormatTransformer">
<field column="source_name" xpath="/rss/channel/title" commonField="true" />
<field column="title" xpath="/rss/channel/item/title" />
<field column="link" xpath="/rss/channel/item/link" />
<field column="body" xpath="/rss/channel/item/description" />
<field column="date" xpath="/rss/channel/item/date" dateTimeFormat="yyyy-MM-dd'T'HH:mm:ss" />
</entity>
</document>
</dataConfig>

Which Solr version are you using ?
For 3.X you have the debug feature with DIH which will help you debug step by step.
Its missing in 4.X probably check SOLR-4151

The following data-config.xml file does the work for Slashdot (Solr 4.2.0)
<dataConfig>
<dataSource type="HttpDataSource" />
<document>
<entity name="slashdot"
pk="link"
url="http://rss.slashdot.org/Slashdot/slashdot"
processor="XPathEntityProcessor"
forEach="/rss/channel/item"
transformer="DateFormatTransformer">
<field column="title" xpath="/rss/channel/item/title" />
<field column="link" xpath="/rss/channel/item/link" />
<field column="description" xpath="/rss/channel/item/description" />
<field column="creator" xpath="/rss/channel/item/creator" />
<field column="item-subject" xpath="/rss/channel/item/subject" />
<field column="slash-department" xpath="/rss/channel/item/department" />
<field column="slash-section" xpath="/rss/channel/item/section" />
<field column="slash-comments" xpath="/rss/channel/item/comments" />
<field column="date" xpath="/rss/channel/item/date" dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss'Z'" />
</entity>
</document>
Notice the extra 'Z' on the dateTimeFormat, which is necessary according to "schema.xml"
Quoting schema.xml
The format for this date field is of the form 1995-12-31T23:59:59Z, and
is a more restricted form of the canonical representation of dateTime
http://www.w3.org/TR/xmlschema-2/#dateTime
The trailing "Z" designates UTC time and is mandatory.
Optional fractional seconds are allowed: 1995-12-31T23:59:59.999Z
All other components are mandatory.

Update your rss-data-config.xml as below
'<dataConfig>
<dataSource type="URLDataSource" />
<document>
<entity name="slashdot"
pk="link"
url="http://rss.slashdot.org/Slashdot/slashdot"
processor="XPathEntityProcessor"
forEach="/RDF/channel | /RDF/item"
transformer="DateFormatTransformer">
<field column="source" xpath="/RDF/channel/title" commonField="true" />
<field column="source-link" xpath="/RDF/channel/link" commonField="true" />
<field column="subject" xpath="/RDF/channel/subject" commonField="true" />
<field column="title" xpath="/RDF/item/title" />
<field column="link" xpath="/RDF/item/link" />
<field column="description" xpath="/RDF/item/description" />
<field column="creator" xpath="/RDF/item/creator" />
<field column="item-subject" xpath="/RDF/item/subject" />
<field column="slash-department" xpath="/RDF/item/department" />
<field column="slash-section" xpath="/RDF/item/section" />
<field column="slash-comments" xpath="/RDF/item/comments" />
<field column="date" xpath="/RDF/item/date" dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss" />
</entity>
</document>
</dataConfig>'
It worked for me

solr dataimport not working for URLdatasource

This is my data-config.xml
<dataConfig>
<dataSource name="a" type="URLDataSource" encoding="UTF-8" connectionTimeout="5000" readTimeout="10000"/>
<document name="products">
<entity name="images" dataSource="a"
url="file:///abc/1299.xml"
processor="XPathEntityProcessor"
forEach="/imagesList/image"
>
<field column="id" xpath="/imageList/image/productId" />
<field column="image_array" xpath="/imageList/image/imageUrlString" />
</entity>
</document>
</dataConfig>
This is the schema.xml
<field name="id" type="string" indexed="true" stored="true" required="true" />
<field name="image_array" type="text" indexed="true" stored="true" multivalued="true"/>
But when I try to deltaimport, none of the documents get added.
Any help will be highly appreciated.

Well first off, your XPath says imageList and your XML says imagesList ...

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

apache solr Index pdf files with xml documents - solr

Related

Apache solr index files (pdf,docx,..) over ftp

Indexing wikipedia with solr

Extract file name (without extension) while indexing using Data Import Handler in Solr

Debugging solr rss DataImportHandler

solr dataimport not working for URLdatasource

Categories

Resources