Solr - can't parse files using tika nested entity

Solr - can't parse files using tika nested entity - solr

I'm trying to index some data from database. There are some linked documents for each page represented in database table.
I noticed that indexing generally works, but field 'text' from Tika is completly ignored and not fetched at all, without any reasonable exception in logs.
My data congig: http://pastebin.com/XdwenPTE, my schema: http://pastebin.com/zXEuFTHE, my solr config: http://pastebin.com/qLiuT0tq
Can you look at my configs and tell me if I ommited anything? When I make query on indexed data, there is no even present field 'text' - why?
[edit]
I changed file path passed to tika to:
url="${page_resource_list.FILE_PATH}"
But still file content is not indexed at all. Any ideas? I have some exceptions saying about files not found (it's good, because some files are missing) but there are no exception about any problems with existing files. And tika didn't indexed anything.
It seems to be the same problem as described here: Solr's TikaEntityProcessor not working - but is this really not fixed yet?

The entity reference for FILE_PATH is ${page_resource_list.FILE_PATH}, not ${page_content.FILE_PATH} (which only have CONTENT defined as a column).
You also have a LogTransformer that can help you by giving you better debug information about the actual content of your fields while indexing.

Related

Is there anything instead of "add-field" for schema-default.json file in solr when we want to check if field exists?

I am new to solr and I am struggling to find in documentation something like "add-field-if-not-exists". According to documentation:
"The add-field command adds a new field definition to your schema. If a field with the same name exists an error is thrown."
so anytime I use my updated schema with some new field I've got a bunch of errors logged in system for fields that are already there. Is there some kind of alternative to "add-field" command? Or solr should work with different versions of schema-default.json file - only latest changes should appear on each updating schema?

Solr: Index content of CSV to one field

I want to index the entire content of a CSV file to one field.
I am having trouble with this as Solr wants to create an enormous number of fields to index the CSV file (in schemaless setup) and in the standard manual schema setup (preferred), it produces many field not found errors.
I thought that the answer would lie in the CSV update parameters (https://solr.apache.org/guide/8_11/uploading-data-with-index-handlers.html#csv-update-parameters), but I haven't had success.
Any help is appreciated!

Can we index in Solr single core data from two different formats i.e. from csv and text?

I have data in two formates CSV and TEXT.
1) CSV file contains metadata. i.e. ModifyScore, Size, fileName etc.
2) actual text are in Text folders having files like a.txt, b.txt etc.
Please is it possible to index such data in Solr in a single core through DIH or another possible way?

According to your use case I would proceed with a custom indexing app.
Apparently you want to build your Solr document fetching some field from the CSV and some other field( the content) from the TXT .
Using Java for example, it is going to be quite simple :
You can use SolrJ, fetch the data from the CSV and TXT, build each Solr Document and then index it.
I would use the DIH if I can move the data in a DB ( even 2 tables are fine, as DIH supports joins).
Out of the box, you may be interested in using the script [1] transformer.
Using it in combination with your different data sources could work.
You need to play a little bit with it as it's not a direct solution to your problem.
[1] https://cwiki.apache.org/confluence/display/solr/Uploading+Structured+Data+Store+Data+with+the+Data+Import+Handler#UploadingStructuredDataStoreDatawiththeDataImportHandler-TheScriptTransformer

Just to mention a couple more possibilities:
Use DIH to index txt files into collectionA, and use /update handler to ingest csv directly into collectionB, then use Streaming Expressions to merge both into a third collection that is the one you want to keep. The main advantage is everything is in Solr, no external code.
Use DIH to index files (or /update to index csv) and write a Update Request Processor that will intercept docs before they are indexed, that looks up the info from the other source, and adds it to the doc.

Yes it is possible for information and code, how to index data from multiple heterogeneous data sources see why the tikaEntityProcesor does not index the Text field in the following data-config file?

conversion of DateField to TrieDateField in Solr

I'm using Apache Solr for powering the search functionality in my Drupal site using a contributed module for drupal named ApacheSolr Search Integration. I'm pretty novice with Solr and have a basic understanding of it, hence wish to convey my apologies in advance if this query sounds outrageous.
I have a date field added through one of drupal's hooks named ds_myDate which I initially used for sorting the search results. I decided to use a date boosting, so that the search results are displayed based on relevancy and boosted by their date rather than merely being displayed by the descending order of date. Once I had updated my hook to implement the same by adding a boost field as recip(ms(NOW/HOUR,ds_myDate),3.16e-11,1,1) I got a HTTP 400 error stating
Can't use ms() function on non-numeric legacy date field ds_myDate
Googling for the same suggested that I use a TrieDateField instead of the Legacy DateField to prevent this error. Adding a TrieDate field named tds_myDate following the suggested naming convention and implementing the boost as recip(ms(NOW/HOUR,tds_myDate),3.16e-11,1,1) did effectively achieve the boosting. However this requires me to reindex all the content (close to 500k records) to populate the new TrieDate field so that I may be able to use it effectively.
I'd request to know if there's an effective workaround than re-indexing all my content such as converting my ds_myDate to a TrieDate field like running an alter query on a mysql table field to change its type. Since I'm unfamiliar with how Solr works would request to know if such an option is feasible and what the right thing to do would be for this case.

You may be able to achieve it by doing a Partial update, but for that you need to be on on Solr 4+ and storing all indexed fields.
Here is how I would go with this:
Make sure version of Solr is 4+
Make sure all indexed fields are stored (requirement for partial updates)
If above two conditions meet, write a script(PHP), which does following:
1) Iterate through full Solr index, and for each doc:
----a) read value stored in ds_myDate field
----b) Convert it to TrieDateField format
----c) Push onto Solr, via partial update to only tds_myDate field (see sample query)
Sample query:
curl 'localhost:8983/solr/update?commit=true' -H 'Content-type:application/json' -d '[{"id":"$id","tds_myDate":{"set":$converted_Val}}]'
For more details on partial updates: http://solr.pl/en/2012/07/09/solr-4-0-partial-documents-update/

Unfortunately, once a document has been indexed a certain way and you change the schema, you cannot have the new schema changes be applied to existing documents until those documents are re-indexed.
Please see this previous question - Does Schema Change need Reindex for additional details.

how to access raw data in opensearchserver?

I searched for documents and cannot find where it store all data.
I want to access all crawled data in order to do my own processing.

In the file StartStopListener it sets up the index directories: look for the value of the environment values OPENSEARCHSERVER_DATA, OPENSEARCHSERVER_MULTIDATA, or OPENSHIFT_DATA_DIR.
Now, whether you'll be able to parse the files easily/correctly is another debate: I haven't ever tried to directly open a search server's indexes by hand, and I don't know that the index format is well documented.

By default, the crawled data are not stored. Only the extracted text is stored. It is possible to store the crawled data, here is the process:
Create a new field: Set the "stored" parameter to yes or to compressed.
Go to the Schema / Parser List
Edit the HTML parser
In the "Field Mapping" tab, link the parser field "htmlSource" to the new field.
Restart the indexation process. Now, all crawled data will be copied to this field. Don't forget to add it as returned field in your query.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight