How to identify which files are missing in Solr indexing - solr

I am using Solr 4.1 on my computer. I am seeing Num Docs:10961 and Max Doc:10961 in the core dashboard. After observing the source folder in the computer, there are 10965 i.e. 4 files more than the indexed files. There must be the equal number of the file in both Solr and folder on my computer. Now I have a task to identify the missing files and REPOST them for indexing. The indexed documents id (always unique for each document) is full path of the physical location such as
"id":"E:\ABCFolder\1\test file.pdf"
I need someone's help for the following question:
What is the way to find out the missing files in indexing and source files which were posted?
Note: One boring method that I know is search for each file name in the source folder to the Solr query window and see if the file is present.

Related

Searching PDF files stored in database using SOLR

I have a lot of PDF files stored in a database (MSSQL) I need to search. They are stored as BLOB. I need a walk through on how to search them using SOLR.
I have a DB, lets call it "fred". Inside Fred is a table, we'll call it pdffiles. pdffiles has a column named pdfdata, of type BLOB.
The pdfs are stored in this table, with the binary data stored in the column. What steps do I take to get SOLR to extract this data and index it?
I'm guessing it involves the TikaEntityProcessor but having the pdfs stored in the database rather than just being regular files adds a level of complexity. I have previously worked with SOLR and have it running in production.
Sample dataconfig and schema files would be very useful.
What steps do I take to get SOLR to extract this data and index it?
create a new file called tika-data-config.xml which will have database configurations and the query to get the data.
You need to update the solrconfig.xml in a text editor and add the following within the config tags:
You need to mention the libs related to data-import handler.
Provide the respective database jar file.
Do the changes in the schema.xml file by mentioning your field. Add the proper fieldType for your field depending on your search requirement.
Once the setup is ready then you can request solr for indexing
using http://localhost:8983/solr/collection1/dataimport?command=full-import
Please refer the link at solr for more detailed...Configure DIH

how to implement file search functionality?

I am working on file search engine functionality.I need your suggestions in designing my application.
I am using elastisearch as framework to implement my functionality.
My primary feature is to enable file search based on file name , file type, size and date of creation. I also need to enable searching based on content of file.
Please suggest what can be best possible file to do the indexing and extract file data.
Also since file can be deleted/updated so i would need to generate the index again in some time interval so how can i monitor any change in directory.
I am using SAMBA as my file storage system.
To have the search option in file content you need to index the file into elasticsearch index.
Look in to the Mapper Attachment plugin and this will help you to index the files and make it searchable.
Step01: install the plugin in to your elasticsearch cluster
Step02: convert the files as byte[] and sent it to elasticsearch index
Step03: Now you can search using the file content using normal queries.
Note: This will work only for text based files like pdf, word (doc,docx) & text files. if the pdf files contains text in images it will not be searchable.

lucene indexing security files

I'll try to briefly describe my problem and task.
My task is to create search engine for different types of file (only text file types) pdf, word, odf, xml but not html.
I have got little experience with lucene about year ago i wrote simple full text search using lucene and hibernate search. That was simple project. But now i have got very difficult task with searching.
We are using java 1.7 and glassfish 3 and i have to concentrate only server side approach not client ui. Ther is my three major problem :
1) All files is stored on webdav server, but information about file name , id file typ etc are stored into database (postgresql) so when i creating index i need to use both information. As a result of query i need only return file id from database. Summary content of file is stored in server but information about file is stored in database so we must retrieve both.
2) Secondary problem it that each file has a level of secrecy. But major problem is that this level is calculated dynamically. When calculating level of security for file we considering several properties. The static properties is files location, the folder in which the file is, but also dynamic information user profiles user roles and departments . So when user "Maggie" is logged she can search only files "test.pdf" , "test2.doc" etc but if user "Stev" is logged he have got different profiles such a Maggie so he can only search some phase in file "broken.pdf", "mybook.odt". test2.doc etc ..... . I think that when for example user search phase "lucene +solr" we search in all indexed documents and after that filtered result. But i think that solution is is not very efficient. What if results count 100 files , so what next we filtered step by step each files ? But i do not see any other solution. Maybe you can help me and lucene or solr have got mechanism to help.
3) Last problem is that some files are encrypted. So that files must be indexed only once before encryption ! But i think that if we indexed secure files so we get security issue. Because all word from that file is tokenized.
I have not got any idea haw to secure lucene documents and index datastore ? its possible ...
Also i have got question that i need to use Solr for my serarch engine or using only lucene and write own search engine ? So as you can see i have not got problem with indexing , serching but with security files and files secured levels.
Thanks for any hints and time you spend for me.
For Indexing both the File and Metadata of the file from DB check ExtractRequestHandler
You can pass the metadata attributes and the file to be indexed as a single request and it would be stored as a single document in lucene index.
For Security, One of the options is to store the Users/Roles who have access to the Files/Documents within the Solr index.
So you can always filter the results with the user/role to retrieve only the those results.
Make you Solr url secured so that Users don't have a direct access to the documents.
Also check for SOLR-1872
For encryption, Solr and underlying Parser Tika does provide handling for the Encrypted files by providing additional parameters.
Apache Solr uses the Apache Tika which uses the Bouncy Castle generic encryption libraries for extracting text content and metadata from encrypted PDF files. See http://www.bouncycastle.org/ for more details on Bouncy Castle.

How do I set up a new Solr core using data from an existing core?

I saw that there was a similar question asked 3 years ago, but I figure it's OK to duplicate as 1) the existing q is 3 years old and 2) I have different problems and a different version of Solr.
Here's the story. I was given a copy of the "Index" directory of an existing Solr core by a collaborator. I am trying to set up my own core locally and using that index. The existing core was from a Solr 4.1.0 installation. (I have tried, and failed, to set up both Solr 4.3.1 and Solr 4.1.0.) I'm running Solr with Jetty.
What's the problem, you ask? Well, I replace the config files (schema.xml and solrconfig.xml) in the default example core with the ones my collaborator gave me. And then I run Jetty. This creates a new Index folder. I delete the contents of that Index folder and copy in the contents of the Index folder I was given.
The result is that Solr gives me an error indicating that "segments" files cannot be found. So I noticed that there are two files (segments.gen and segments_1) that are created with the initial Index folder. I experiment with leaving those in the Index folder but replacing everything else. Now Solr seems to be working (the browswer interface is working) but it reports "Num docs: 0" and a *:* query gives me 0 results.
Anyone have any ideas? I'm happy to provide more info. Thanks in advance.
You have to use segments.gen and segments_1 from the original index. Ask you collaborator to give you those files also. But since you mentioned that collaborator gave you a copy of index folder, so you must already be having those files.
Note that it might not be necessary that segment_1 is present in your original index copy. It can be segment_N. Whatever segment_ file is there in original copy, copy that to new index and restart jetty.
segments.gen records the current generation (the _N in segments_N) in the index, as a fallback in case directory listing of the files fails to locate the segments_N file (eg on filesystems, like NFS, where the directory listing may come from a stale cache)

Solr - not returning me the complete rows

myself new to Solr.
I have the below 2 issues :-
I am using Tomcat 6 and Oracle 10g as database. Solr 4 . When I deploy solr.war in tomcat I get exception in tomcat console that, dataimporthanler class not found. I have specified solr home and a lib directory in my solr home that contains all jar.
Still why does solr war expects to put the jars in the solr.war lib folder?
I have indexed the entity with full import.
I have a simple database table in Oracle. typical emp_id, emp_name, emp_dept.
I have defined data-config.xml and currently have only one document and entity. I have updated schema.xml accordingly.
when I do a /select query. I get only emp_id in the xml/json output.
How do I say what fields I want in the response?
I have 2222222 rows in the database, I get only 10 rows and if I edit config xml then only I get specified rows. My database table can grow, how do I get complete rows?
The problem is, I cannot say no of rows required which does not make sense since rows keep on increasing as transactions happen.
thanks,
1.Check that if lib directory in Solr Home contains apache-solr-dataimporthandler-x.y.z.jar and apache-solr-dataimporthandler-extras-x.y.z.jar files. Also check the solrconfig.xml file whether lib directory is configured correctly or not.
2.Add stored=true to the field definitions in the schema.xml file. If you don't set stored=true then your fields would not be shown in the output. If you want to return some fields then you can use fl=fieldName query options.
3.When you set rows parameter it will return that much results but you can also find a numFound field in the result that shows total number of rows found with given query.

Resources