I am working on file search engine functionality.I need your suggestions in designing my application.
I am using elastisearch as framework to implement my functionality.
My primary feature is to enable file search based on file name , file type, size and date of creation. I also need to enable searching based on content of file.
Please suggest what can be best possible file to do the indexing and extract file data.
Also since file can be deleted/updated so i would need to generate the index again in some time interval so how can i monitor any change in directory.
I am using SAMBA as my file storage system.
To have the search option in file content you need to index the file into elasticsearch index.
Look in to the Mapper Attachment plugin and this will help you to index the files and make it searchable.
Step01: install the plugin in to your elasticsearch cluster
Step02: convert the files as byte[] and sent it to elasticsearch index
Step03: Now you can search using the file content using normal queries.
Note: This will work only for text based files like pdf, word (doc,docx) & text files. if the pdf files contains text in images it will not be searchable.
Related
I'm trying to use Azure Cognitive Search to search csv content stored in Azure blob. The problem is that I have multiple files in the blob. For example, I have file a, b, c, d and etc. Is there a way to search only in file a? I'm thinking about adding one more column in my csv file to store the file name. But I'd like to know if there is an easy way to do that.
The blob indexer can automatically pull certain metadata about your file and include them in your Azure Search document (including the file name), as long as those fields are included in your index definition
https://learn.microsoft.com/en-us/azure/search/search-howto-indexing-azure-blob-storage#indexing-blob-metadata
If you make those fields filterable, you'll be able to emit a search query with a filter to only include the desired file(s).
Scenario:
Blob storage: contains pdf, word, image files (about 70 files)
I used default fields and predefined skills to create an Azure search instance through the Azure Portal.
But the results for querying any text in these files is not very good. I made content and key phrases as searchable and retrievable. I tried to use Lucene analyzers but was not a great help.
The main concern is if I type even a letter for example "u" in the search explorer, it returns the file. as per my understanding, there is no such word in my files. so what is it doing?
How to refine the search? and also how to manipulate the result?
I am not an expert in document processing. So using the unstructured documents in the blob instead of JSON formatted documents.
another thing, how to define some field in the index, let's say chapter-name or title name which can relate to the PDF chapters/title name?
Please suggest me some ideas or some example links. I am using .net core to develop this.
use custom skill set to extract the fields which you required and make sure those fields are defined in index.
I get 15+ PDF's a day that I have to enter into a database. They are generated from a table where the "Blanks" are filled in from specific table fields. Any tools or python code examples I could use to try and develop a means of extracting the data from the PDF to either write to or create a table to import to the database table? The Database is currently Access mdb.
Thanks
There are a number of approaches that will work.
One simple approach is to simply print the PDF file out to a text file and then have Access import that text. All recent versions of windows allow you to install a “text” printer that outputs the printing of a document to a text file. You can have access “process” a folder of pdfs, print them to text and then import those text files. You might need some VBA to remove “pages” and some extra lines before you import the data into Access.
Another approach is to use Word (Automate from Access) to open a PDF. When word opens a pdf, it converts it to a word document. This approach will even format rows as a word table. You can then pluck out that table data and send that data to word. You can likely pull that text out without writing the data out to a text file – or just use Words “save-as” to a text file (you can automate this process from Access).
Another approach is to use the free Ghost Script library that can extract text from a PDF (this I would consider if did not have word at your disposal).
So which solution is best will much depend on the current software you going to have installed on the computer running Access. Opening the pdf files with word would be my first choice and test.
At my old job we used Cogniview which converted PDF to Excel spreadsheets quite quickly. If you want to use Python, a quick search yielded me this which seems straight forward enough, PDF to XLS with Python
I'll try to briefly describe my problem and task.
My task is to create search engine for different types of file (only text file types) pdf, word, odf, xml but not html.
I have got little experience with lucene about year ago i wrote simple full text search using lucene and hibernate search. That was simple project. But now i have got very difficult task with searching.
We are using java 1.7 and glassfish 3 and i have to concentrate only server side approach not client ui. Ther is my three major problem :
1) All files is stored on webdav server, but information about file name , id file typ etc are stored into database (postgresql) so when i creating index i need to use both information. As a result of query i need only return file id from database. Summary content of file is stored in server but information about file is stored in database so we must retrieve both.
2) Secondary problem it that each file has a level of secrecy. But major problem is that this level is calculated dynamically. When calculating level of security for file we considering several properties. The static properties is files location, the folder in which the file is, but also dynamic information user profiles user roles and departments . So when user "Maggie" is logged she can search only files "test.pdf" , "test2.doc" etc but if user "Stev" is logged he have got different profiles such a Maggie so he can only search some phase in file "broken.pdf", "mybook.odt". test2.doc etc ..... . I think that when for example user search phase "lucene +solr" we search in all indexed documents and after that filtered result. But i think that solution is is not very efficient. What if results count 100 files , so what next we filtered step by step each files ? But i do not see any other solution. Maybe you can help me and lucene or solr have got mechanism to help.
3) Last problem is that some files are encrypted. So that files must be indexed only once before encryption ! But i think that if we indexed secure files so we get security issue. Because all word from that file is tokenized.
I have not got any idea haw to secure lucene documents and index datastore ? its possible ...
Also i have got question that i need to use Solr for my serarch engine or using only lucene and write own search engine ? So as you can see i have not got problem with indexing , serching but with security files and files secured levels.
Thanks for any hints and time you spend for me.
For Indexing both the File and Metadata of the file from DB check ExtractRequestHandler
You can pass the metadata attributes and the file to be indexed as a single request and it would be stored as a single document in lucene index.
For Security, One of the options is to store the Users/Roles who have access to the Files/Documents within the Solr index.
So you can always filter the results with the user/role to retrieve only the those results.
Make you Solr url secured so that Users don't have a direct access to the documents.
Also check for SOLR-1872
For encryption, Solr and underlying Parser Tika does provide handling for the Encrypted files by providing additional parameters.
Apache Solr uses the Apache Tika which uses the Bouncy Castle generic encryption libraries for extracting text content and metadata from encrypted PDF files. See http://www.bouncycastle.org/ for more details on Bouncy Castle.
I am using cakephp framework. I would like to retrieve the contents of the uploaded file and store it in the database. I want to be able to search the content of the uploaded file.
I was made aware that file_get_contents would work for plain .txt files. But all my documents are .docx and .pdfs. Is there any solution I could use.
I appreciate any help.
Thanks
A quick google search got me this:
http://davidwalsh.name/read-pdf-doc-file-php
It appears you can search those files but it requires some additional packages.