Solr - not returning me the complete rows - solr

myself new to Solr.
I have the below 2 issues :-
I am using Tomcat 6 and Oracle 10g as database. Solr 4 . When I deploy solr.war in tomcat I get exception in tomcat console that, dataimporthanler class not found. I have specified solr home and a lib directory in my solr home that contains all jar.
Still why does solr war expects to put the jars in the solr.war lib folder?
I have indexed the entity with full import.
I have a simple database table in Oracle. typical emp_id, emp_name, emp_dept.
I have defined data-config.xml and currently have only one document and entity. I have updated schema.xml accordingly.
when I do a /select query. I get only emp_id in the xml/json output.
How do I say what fields I want in the response?
I have 2222222 rows in the database, I get only 10 rows and if I edit config xml then only I get specified rows. My database table can grow, how do I get complete rows?
The problem is, I cannot say no of rows required which does not make sense since rows keep on increasing as transactions happen.
thanks,

1.Check that if lib directory in Solr Home contains apache-solr-dataimporthandler-x.y.z.jar and apache-solr-dataimporthandler-extras-x.y.z.jar files. Also check the solrconfig.xml file whether lib directory is configured correctly or not.
2.Add stored=true to the field definitions in the schema.xml file. If you don't set stored=true then your fields would not be shown in the output. If you want to return some fields then you can use fl=fieldName query options.
3.When you set rows parameter it will return that much results but you can also find a numFound field in the result that shows total number of rows found with given query.

Related

Searching PDF files stored in database using SOLR

I have a lot of PDF files stored in a database (MSSQL) I need to search. They are stored as BLOB. I need a walk through on how to search them using SOLR.
I have a DB, lets call it "fred". Inside Fred is a table, we'll call it pdffiles. pdffiles has a column named pdfdata, of type BLOB.
The pdfs are stored in this table, with the binary data stored in the column. What steps do I take to get SOLR to extract this data and index it?
I'm guessing it involves the TikaEntityProcessor but having the pdfs stored in the database rather than just being regular files adds a level of complexity. I have previously worked with SOLR and have it running in production.
Sample dataconfig and schema files would be very useful.
What steps do I take to get SOLR to extract this data and index it?
create a new file called tika-data-config.xml which will have database configurations and the query to get the data.
You need to update the solrconfig.xml in a text editor and add the following within the config tags:
You need to mention the libs related to data-import handler.
Provide the respective database jar file.
Do the changes in the schema.xml file by mentioning your field. Add the proper fieldType for your field depending on your search requirement.
Once the setup is ready then you can request solr for indexing
using http://localhost:8983/solr/collection1/dataimport?command=full-import
Please refer the link at solr for more detailed...Configure DIH

Solr 4 - Delete fields from FieldsInfo file

I am running solr 4 version. I have created million fields in solr using a script . I saw GC has gone very high after adding these fields as every time searcher is open, these fields were loaded.
Now, I want to go back to the stage where my solr cluster was before adding those fields. Even though, I delete documents which has those fields, the cluster is not coming back to what it was as the fields are not getting deleted from fieldsInfo file.
Is there a way we can explicitly tell solr to delete the fields from the fieldsInfo file???
There is a schema API documented that can delete a field. However I don't know if this is already available for Solr 4. You should try if it works.

How to identify which files are missing in Solr indexing

I am using Solr 4.1 on my computer. I am seeing Num Docs:10961 and Max Doc:10961 in the core dashboard. After observing the source folder in the computer, there are 10965 i.e. 4 files more than the indexed files. There must be the equal number of the file in both Solr and folder on my computer. Now I have a task to identify the missing files and REPOST them for indexing. The indexed documents id (always unique for each document) is full path of the physical location such as
"id":"E:\ABCFolder\1\test file.pdf"
I need someone's help for the following question:
What is the way to find out the missing files in indexing and source files which were posted?
Note: One boring method that I know is search for each file name in the source folder to the Solr query window and see if the file is present.

lucene indexing security files

I'll try to briefly describe my problem and task.
My task is to create search engine for different types of file (only text file types) pdf, word, odf, xml but not html.
I have got little experience with lucene about year ago i wrote simple full text search using lucene and hibernate search. That was simple project. But now i have got very difficult task with searching.
We are using java 1.7 and glassfish 3 and i have to concentrate only server side approach not client ui. Ther is my three major problem :
1) All files is stored on webdav server, but information about file name , id file typ etc are stored into database (postgresql) so when i creating index i need to use both information. As a result of query i need only return file id from database. Summary content of file is stored in server but information about file is stored in database so we must retrieve both.
2) Secondary problem it that each file has a level of secrecy. But major problem is that this level is calculated dynamically. When calculating level of security for file we considering several properties. The static properties is files location, the folder in which the file is, but also dynamic information user profiles user roles and departments . So when user "Maggie" is logged she can search only files "test.pdf" , "test2.doc" etc but if user "Stev" is logged he have got different profiles such a Maggie so he can only search some phase in file "broken.pdf", "mybook.odt". test2.doc etc ..... . I think that when for example user search phase "lucene +solr" we search in all indexed documents and after that filtered result. But i think that solution is is not very efficient. What if results count 100 files , so what next we filtered step by step each files ? But i do not see any other solution. Maybe you can help me and lucene or solr have got mechanism to help.
3) Last problem is that some files are encrypted. So that files must be indexed only once before encryption ! But i think that if we indexed secure files so we get security issue. Because all word from that file is tokenized.
I have not got any idea haw to secure lucene documents and index datastore ? its possible ...
Also i have got question that i need to use Solr for my serarch engine or using only lucene and write own search engine ? So as you can see i have not got problem with indexing , serching but with security files and files secured levels.
Thanks for any hints and time you spend for me.
For Indexing both the File and Metadata of the file from DB check ExtractRequestHandler
You can pass the metadata attributes and the file to be indexed as a single request and it would be stored as a single document in lucene index.
For Security, One of the options is to store the Users/Roles who have access to the Files/Documents within the Solr index.
So you can always filter the results with the user/role to retrieve only the those results.
Make you Solr url secured so that Users don't have a direct access to the documents.
Also check for SOLR-1872
For encryption, Solr and underlying Parser Tika does provide handling for the Encrypted files by providing additional parameters.
Apache Solr uses the Apache Tika which uses the Bouncy Castle generic encryption libraries for extracting text content and metadata from encrypted PDF files. See http://www.bouncycastle.org/ for more details on Bouncy Castle.

SOLR 3.1 Indexing Issue

We are facing some issues with SOLR search.
We are using SOLR 3.1 with Jetty. We have set schema according to our requirement. We have set data-config.xml to import records into the Collection (Core) from our database (Sql Server 2005).
There are 320, 000 records in the database which we need to import.
After finished import, when i try to search all the records by SOLR admin
http://localhost:8983/solr/Collection_201/admin/
It shows me total number found 290, 000. So, 30, 000 records are missing.
Now following questions are in my mind
How could i know which record is not properly indexed? OR which record is missing? To know that, i tried a trick, i thought i should have put a field in the database to know that which record is imported into the SOLR collection and which is not. But the big question is how would i update this database field while import from data-config.xml. Because tag allows you only search queries OR in other words something to return. So, i got another idea to still update that database field. I created a stored procedure in my database, which contains update query that would update the field in the database and after that i have select query which is simply return 1 record to fulfill requirement. But when i tried to run DIH with that it returns "Index Failed. Rollback all the changes" error message and nothing imported. When i commented update query into the stored procedure, then it works. So it was not allowing me to run update query even it from stored procedure. So i tried really hard to find a way to update the database from DIH. But i was really failed to find anything Sad smile i refused this idea to update database.
I cleared the index and started import data again. This time i tried it manually run the solr admin import page for 5, 000 records per turn. At the end, for some how records are still missing.
Is this possible it is not committed properly. I red in the documentation that import page (http://localhost:8983/solr/Collection_201/dataimport?command=full-import&clean=false) automatically committed the imported data. But i personally noticed some time it does or sometime it does not. So it is really driving me crazy Sad smile
Now i am fully frustrated and start thinking the way i am using to use SOLR is right or not. If i am right, then is it reliable???? If am wrong, please guide me what is my mistake??
Please Please Please guide me how easily we can sync. collection with our database and make sure it is 100% synced.
What field are you using for your IDs in Solr and the database? The id field needs to be unique, so if you have 30,000 records that have the same ID as some 30,000 other records then the data will overwrite those records.
Also, when you run data import handler, you can query it for status (?command=status) and that should tell you the total number of records imported on the last run.
The first thing I would do is check for non-unique IDs in your database WRT the solr id field.
Also be aware, that when one record in the batch is wrong, the whole batch gets rolledback. So if it happened 3 times, and you are indexing 10K docs each, that would explain it.
At the time, I solved it: https://github.com/romanchyla/montysolr/blob/master/contrib/invenio/src/java/org/apache/solr/handler/dataimport/NoRollbackDataImporter.java
but there should be a better/more elegant solution than that. I don't know how to get missing records in your case. But if you have indexed the ids, then you can compare the indexed ids with the external source and get the gaps

Resources