I searched for documents and cannot find where it store all data.
I want to access all crawled data in order to do my own processing.
In the file StartStopListener it sets up the index directories: look for the value of the environment values OPENSEARCHSERVER_DATA, OPENSEARCHSERVER_MULTIDATA, or OPENSHIFT_DATA_DIR.
Now, whether you'll be able to parse the files easily/correctly is another debate: I haven't ever tried to directly open a search server's indexes by hand, and I don't know that the index format is well documented.
By default, the crawled data are not stored. Only the extracted text is stored. It is possible to store the crawled data, here is the process:
Create a new field: Set the "stored" parameter to yes or to compressed.
Go to the Schema / Parser List
Edit the HTML parser
In the "Field Mapping" tab, link the parser field "htmlSource" to the new field.
Restart the indexation process. Now, all crawled data will be copied to this field. Don't forget to add it as returned field in your query.
Related
I want to index the entire content of a CSV file to one field.
I am having trouble with this as Solr wants to create an enormous number of fields to index the CSV file (in schemaless setup) and in the standard manual schema setup (preferred), it produces many field not found errors.
I thought that the answer would lie in the CSV update parameters (https://solr.apache.org/guide/8_11/uploading-data-with-index-handlers.html#csv-update-parameters), but I haven't had success.
Any help is appreciated!
I have data in two formates CSV and TEXT.
1) CSV file contains metadata. i.e. ModifyScore, Size, fileName etc.
2) actual text are in Text folders having files like a.txt, b.txt etc.
Please is it possible to index such data in Solr in a single core through DIH or another possible way?
According to your use case I would proceed with a custom indexing app.
Apparently you want to build your Solr document fetching some field from the CSV and some other field( the content) from the TXT .
Using Java for example, it is going to be quite simple :
You can use SolrJ, fetch the data from the CSV and TXT, build each Solr Document and then index it.
I would use the DIH if I can move the data in a DB ( even 2 tables are fine, as DIH supports joins).
Out of the box, you may be interested in using the script [1] transformer.
Using it in combination with your different data sources could work.
You need to play a little bit with it as it's not a direct solution to your problem.
[1] https://cwiki.apache.org/confluence/display/solr/Uploading+Structured+Data+Store+Data+with+the+Data+Import+Handler#UploadingStructuredDataStoreDatawiththeDataImportHandler-TheScriptTransformer
Just to mention a couple more possibilities:
Use DIH to index txt files into collectionA, and use /update handler to ingest csv directly into collectionB, then use Streaming Expressions to merge both into a third collection that is the one you want to keep. The main advantage is everything is in Solr, no external code.
Use DIH to index files (or /update to index csv) and write a Update Request Processor that will intercept docs before they are indexed, that looks up the info from the other source, and adds it to the doc.
Yes it is possible for information and code, how to index data from multiple heterogeneous data sources see why the tikaEntityProcesor does not index the Text field in the following data-config file?
I have solr instance having lacks of data uploaded. I want to create a new copyfield which is concatenation of existing two fields.
Do I need to repopulate my data?
Yes. From the solr documentation
Fields are copied before analysis is done
By analysis in copyfield context they mean index analizer, which executed when a document is indexed.
I'm trying to index some data from database. There are some linked documents for each page represented in database table.
I noticed that indexing generally works, but field 'text' from Tika is completly ignored and not fetched at all, without any reasonable exception in logs.
My data congig: http://pastebin.com/XdwenPTE, my schema: http://pastebin.com/zXEuFTHE, my solr config: http://pastebin.com/qLiuT0tq
Can you look at my configs and tell me if I ommited anything? When I make query on indexed data, there is no even present field 'text' - why?
[edit]
I changed file path passed to tika to:
url="${page_resource_list.FILE_PATH}"
But still file content is not indexed at all. Any ideas? I have some exceptions saying about files not found (it's good, because some files are missing) but there are no exception about any problems with existing files. And tika didn't indexed anything.
It seems to be the same problem as described here: Solr's TikaEntityProcessor not working - but is this really not fixed yet?
The entity reference for FILE_PATH is ${page_resource_list.FILE_PATH}, not ${page_content.FILE_PATH} (which only have CONTENT defined as a column).
You also have a LogTransformer that can help you by giving you better debug information about the actual content of your fields while indexing.
I have the schema with 10 fields. One of the fields is text(content of a file) , rest all the fields are custom metadata. Document doesn't chnages but the metadata changes frequently .
Is there any way to skip the Document(text) while re-indexing. Can I only index only custom metadata? If I skip the Document(text) in re-indexing , does it update the index file by removing the text field from the Index document?
To my knowledge there's no way to selectively update specific fields. An update operation performs a complete replace of all document data. Since Solr is open source, it's possible that you could produce your own component for this if really desired.