Solr 4 - Delete fields from FieldsInfo file - solr

I am running solr 4 version. I have created million fields in solr using a script . I saw GC has gone very high after adding these fields as every time searcher is open, these fields were loaded.
Now, I want to go back to the stage where my solr cluster was before adding those fields. Even though, I delete documents which has those fields, the cluster is not coming back to what it was as the fields are not getting deleted from fieldsInfo file.
Is there a way we can explicitly tell solr to delete the fields from the fieldsInfo file???

There is a schema API documented that can delete a field. However I don't know if this is already available for Solr 4. You should try if it works.

Related

Searching PDF files stored in database using SOLR

I have a lot of PDF files stored in a database (MSSQL) I need to search. They are stored as BLOB. I need a walk through on how to search them using SOLR.
I have a DB, lets call it "fred". Inside Fred is a table, we'll call it pdffiles. pdffiles has a column named pdfdata, of type BLOB.
The pdfs are stored in this table, with the binary data stored in the column. What steps do I take to get SOLR to extract this data and index it?
I'm guessing it involves the TikaEntityProcessor but having the pdfs stored in the database rather than just being regular files adds a level of complexity. I have previously worked with SOLR and have it running in production.
Sample dataconfig and schema files would be very useful.
What steps do I take to get SOLR to extract this data and index it?
create a new file called tika-data-config.xml which will have database configurations and the query to get the data.
You need to update the solrconfig.xml in a text editor and add the following within the config tags:
You need to mention the libs related to data-import handler.
Provide the respective database jar file.
Do the changes in the schema.xml file by mentioning your field. Add the proper fieldType for your field depending on your search requirement.
Once the setup is ready then you can request solr for indexing
using http://localhost:8983/solr/collection1/dataimport?command=full-import
Please refer the link at solr for more detailed...Configure DIH

Solr is showing data in search for some time only, after some time results get disappear

Hi I am facing a problem with Solr earlier everything was fine but after restart of server Solr is showing only those documents that was indexed before restart. Whenever I create new document then it get indexed and commit also work and solr shows it in result but after some hour result get disappear from the search result. Before server restart it was not happening.I dont know why it is behaving like this now.
I observed index data when I index newly created data then its(index folder) size get increased and result shows document in result and after sometime size back to same size that was earlier and result also get disappear.
I am using tomcat 6.0 plugin tx_Solr 2.8.0 and Solr 4.2.0
Can anyone let me know why its now deleting newly created indexes only? Devlog Shows a delete query gets fire but I dont know why it is happening like this.
Any help would be highly appreciable.

Solr - not returning me the complete rows

myself new to Solr.
I have the below 2 issues :-
I am using Tomcat 6 and Oracle 10g as database. Solr 4 . When I deploy solr.war in tomcat I get exception in tomcat console that, dataimporthanler class not found. I have specified solr home and a lib directory in my solr home that contains all jar.
Still why does solr war expects to put the jars in the solr.war lib folder?
I have indexed the entity with full import.
I have a simple database table in Oracle. typical emp_id, emp_name, emp_dept.
I have defined data-config.xml and currently have only one document and entity. I have updated schema.xml accordingly.
when I do a /select query. I get only emp_id in the xml/json output.
How do I say what fields I want in the response?
I have 2222222 rows in the database, I get only 10 rows and if I edit config xml then only I get specified rows. My database table can grow, how do I get complete rows?
The problem is, I cannot say no of rows required which does not make sense since rows keep on increasing as transactions happen.
thanks,
1.Check that if lib directory in Solr Home contains apache-solr-dataimporthandler-x.y.z.jar and apache-solr-dataimporthandler-extras-x.y.z.jar files. Also check the solrconfig.xml file whether lib directory is configured correctly or not.
2.Add stored=true to the field definitions in the schema.xml file. If you don't set stored=true then your fields would not be shown in the output. If you want to return some fields then you can use fl=fieldName query options.
3.When you set rows parameter it will return that much results but you can also find a numFound field in the result that shows total number of rows found with given query.

SOLR 3.6.0, After a full re-index of a bunch of entities, some of my items are not making it into the SOLR index, but no logs are being generated

Use a StreamingUpdateSolrServer, I used the following algorithm to re-index my huge dataset into SOLR.
Initialize StreamingUpdateSolrServer server = new StreamingUpdateSolrServer(solrServerUrl, numDocsToAddInBatch, numOfThreads);
For each Item…
-->Create document
-->Server.add(document)
When all finished,
server.commit();
server.optimize();
The problem:
Some of my items are not making it into the SOLR index, but no logs are being generated to tell me what happened.
I was able to find most of the documents, but some were missing. No errors in any logs – and I have substantial try/catch blocks with logs around all SOLRJ exceptions on the clients site.
Verify logging is not being hidden for the SOLR WAR
You will definitely want to verify that the SOLR server log settings are not hiding the fact that documents are failing to be added to the index.
Because SOLR uses the SLF4J API, your SOLR server could be over-riding the log settings allowing you to see an error message when the document failed to be indexed.
If you have a custom {solr-war}/WEB-INF/classes/logging.properties, you will need to make sure that the settings are not such that it is hiding the error messages.
By default, errors in adding an item should be shown automatically. So if you did not change your SOLR log settings at any point... you should be seeing any errors during indexing in your server log file.
Troubleshoot why Documents are failing to be indexed
In order to investigate this, it is helpful to follow verification step any time after the indexing is complete:
Initialize new log log_fromsolr
Initialize new log log_notfound
For each Item…
-->Search SOLR for the item. If SOLR has the object, log each item’s fields into log_fromsolr on a single line into log_fromsolr. This should include the unqiueKey for your document if you have one.
-->If document cannot be found in SOLR for this item, write a line to log_notfound with all the fields from the object from the database, also supplying the uniqueKey as the first line.
Once the verification step has completed, the log log_notfound created a list of all Documents that failed to be added into the Index.
You can use the log created by log_fromsolr to compare the document fields for an item that made it into the index and one that did not.
Verify it is not an intermittent issue
Sometimes it might be the case that it is not the same Items failing to be added to the index each time you try to index.
If you find objects in the log_notfound log, you will want to back up the current notfound log and run the indexing process again from scratch. Use a diff tool to see the differences between the first notfound log and the second notfound log.
An intermittent problem is evident when you see large numbers of differences in these files (Note: some differences are to be expected if new objects are being created in the database in between the first and second re-indexing).
If your problem is intermittent, it most certainly points at the application code with respect to your SOLR transactions not being committed correctly.
The same documents consistently come up missing each time it indexes
At this point we have to compare documents that are being found from the SOLR index, versus documents that are not getting into the Lucene index. Usually a field-by-field comparison of the object will start turning of some suspicious values that may be causing issues when adding the document to the index.
Try eliminating all the suspicious fields and then re-indexing the entire thing again. See if the documents are still failing to be indexed. If this worked, you will want to start re-introducing the fields that you removed and see if you can pinpoint the one that is the issue.

SOLR 3.1 Indexing Issue

We are facing some issues with SOLR search.
We are using SOLR 3.1 with Jetty. We have set schema according to our requirement. We have set data-config.xml to import records into the Collection (Core) from our database (Sql Server 2005).
There are 320, 000 records in the database which we need to import.
After finished import, when i try to search all the records by SOLR admin
http://localhost:8983/solr/Collection_201/admin/
It shows me total number found 290, 000. So, 30, 000 records are missing.
Now following questions are in my mind
How could i know which record is not properly indexed? OR which record is missing? To know that, i tried a trick, i thought i should have put a field in the database to know that which record is imported into the SOLR collection and which is not. But the big question is how would i update this database field while import from data-config.xml. Because tag allows you only search queries OR in other words something to return. So, i got another idea to still update that database field. I created a stored procedure in my database, which contains update query that would update the field in the database and after that i have select query which is simply return 1 record to fulfill requirement. But when i tried to run DIH with that it returns "Index Failed. Rollback all the changes" error message and nothing imported. When i commented update query into the stored procedure, then it works. So it was not allowing me to run update query even it from stored procedure. So i tried really hard to find a way to update the database from DIH. But i was really failed to find anything Sad smile i refused this idea to update database.
I cleared the index and started import data again. This time i tried it manually run the solr admin import page for 5, 000 records per turn. At the end, for some how records are still missing.
Is this possible it is not committed properly. I red in the documentation that import page (http://localhost:8983/solr/Collection_201/dataimport?command=full-import&clean=false) automatically committed the imported data. But i personally noticed some time it does or sometime it does not. So it is really driving me crazy Sad smile
Now i am fully frustrated and start thinking the way i am using to use SOLR is right or not. If i am right, then is it reliable???? If am wrong, please guide me what is my mistake??
Please Please Please guide me how easily we can sync. collection with our database and make sure it is 100% synced.
What field are you using for your IDs in Solr and the database? The id field needs to be unique, so if you have 30,000 records that have the same ID as some 30,000 other records then the data will overwrite those records.
Also, when you run data import handler, you can query it for status (?command=status) and that should tell you the total number of records imported on the last run.
The first thing I would do is check for non-unique IDs in your database WRT the solr id field.
Also be aware, that when one record in the batch is wrong, the whole batch gets rolledback. So if it happened 3 times, and you are indexing 10K docs each, that would explain it.
At the time, I solved it: https://github.com/romanchyla/montysolr/blob/master/contrib/invenio/src/java/org/apache/solr/handler/dataimport/NoRollbackDataImporter.java
but there should be a better/more elegant solution than that. I don't know how to get missing records in your case. But if you have indexed the ids, then you can compare the indexed ids with the external source and get the gaps

Resources