I want to index the entire content of a CSV file to one field.
I am having trouble with this as Solr wants to create an enormous number of fields to index the CSV file (in schemaless setup) and in the standard manual schema setup (preferred), it produces many field not found errors.
I thought that the answer would lie in the CSV update parameters (https://solr.apache.org/guide/8_11/uploading-data-with-index-handlers.html#csv-update-parameters), but I haven't had success.
Any help is appreciated!
Related
In CKAN, it's possible that a dataset may have multiple resources with various file extensions (e.g. CSV, JSON, etc.).
However, by querying Solr from CKAN with request that contains specific format (e.g. <CKAN_SITE_URL>/dataset/?res_format=CSV), I get result contains datasets with format except the CSV (say I'm querying with CSV as requested format) ones, but what I'm expecting is, a result contains only datasets with CSV resource(s).
Does anyone know which field in the Solr schema should I adjust in order to solve the aforementioned problem?
Thanks!
I have data in two formates CSV and TEXT.
1) CSV file contains metadata. i.e. ModifyScore, Size, fileName etc.
2) actual text are in Text folders having files like a.txt, b.txt etc.
Please is it possible to index such data in Solr in a single core through DIH or another possible way?
According to your use case I would proceed with a custom indexing app.
Apparently you want to build your Solr document fetching some field from the CSV and some other field( the content) from the TXT .
Using Java for example, it is going to be quite simple :
You can use SolrJ, fetch the data from the CSV and TXT, build each Solr Document and then index it.
I would use the DIH if I can move the data in a DB ( even 2 tables are fine, as DIH supports joins).
Out of the box, you may be interested in using the script [1] transformer.
Using it in combination with your different data sources could work.
You need to play a little bit with it as it's not a direct solution to your problem.
[1] https://cwiki.apache.org/confluence/display/solr/Uploading+Structured+Data+Store+Data+with+the+Data+Import+Handler#UploadingStructuredDataStoreDatawiththeDataImportHandler-TheScriptTransformer
Just to mention a couple more possibilities:
Use DIH to index txt files into collectionA, and use /update handler to ingest csv directly into collectionB, then use Streaming Expressions to merge both into a third collection that is the one you want to keep. The main advantage is everything is in Solr, no external code.
Use DIH to index files (or /update to index csv) and write a Update Request Processor that will intercept docs before they are indexed, that looks up the info from the other source, and adds it to the doc.
Yes it is possible for information and code, how to index data from multiple heterogeneous data sources see why the tikaEntityProcesor does not index the Text field in the following data-config file?
I searched for documents and cannot find where it store all data.
I want to access all crawled data in order to do my own processing.
In the file StartStopListener it sets up the index directories: look for the value of the environment values OPENSEARCHSERVER_DATA, OPENSEARCHSERVER_MULTIDATA, or OPENSHIFT_DATA_DIR.
Now, whether you'll be able to parse the files easily/correctly is another debate: I haven't ever tried to directly open a search server's indexes by hand, and I don't know that the index format is well documented.
By default, the crawled data are not stored. Only the extracted text is stored. It is possible to store the crawled data, here is the process:
Create a new field: Set the "stored" parameter to yes or to compressed.
Go to the Schema / Parser List
Edit the HTML parser
In the "Field Mapping" tab, link the parser field "htmlSource" to the new field.
Restart the indexation process. Now, all crawled data will be copied to this field. Don't forget to add it as returned field in your query.
I have the schema with 10 fields. One of the fields is text(content of a file) , rest all the fields are custom metadata. Document doesn't chnages but the metadata changes frequently .
Is there any way to skip the Document(text) while re-indexing. Can I only index only custom metadata? If I skip the Document(text) in re-indexing , does it update the index file by removing the text field from the Index document?
To my knowledge there's no way to selectively update specific fields. An update operation performs a complete replace of all document data. Since Solr is open source, it's possible that you could produce your own component for this if really desired.
I have stored a number of binary files in a SQL Server table. I created a full-text-index on that table which also indexes the binary field containing the documents. I installed the appropriate iFilters such that SQL Server can also read .doc, .docx and .pdf files.
Using the function DATALENGTH I can retrieve the length/size of the complete document, but this also includes layout and other useless information. I want to know the length of the text of the documents.
Using the iFilters SQL Server is able to retrieve only the text of such "complicated" documents but can it also be used to determine the length of just the text?
As far as I know (which isn't much), there is no way to query document properties via FTS. I would get the word count before inserting the document into the database, then insert the count along with it, into another column in the table. For Word documents, you can use the Document.Words.Count property; I don't know what the equivalent mechanism is for PDF documents.