Indexing document files - sql-server

I want to store and index document files like .doc, .pdf, .txt and than search this files over a basic web application and let users to download them.
I have found 2 ways. First one is store these files in MSSQL (filestream) and search them with the power of full-text search but this way is scared me because the backup file will be getting bigger and bigger. Second one is indexing these files with Windows Search Service and search them with remote query but this way lacks of full-text search power.
What is the best way to do this, is there any better alternative?
Thanks in advance.

You can process the build in functionality for fulltext search or build your own index. I would recommend building your own one as an Inverted Index for document search capabilities. A quite similar question was asked here: Building Big Search Egnine. Hopefully it helps you to further design what you want.

Related

Matching article text against pre-existing list of categories

I'm new to Azure Cognitive Services, and while I'm pretty sure it can help me solve my problem, I don't quite understand which part of it to use for it...
Here's what I want to do:
We have blog posts, say ~1k, and those blog posts all have categories and tags (multiple each). What I want to do, is to "guess" the right categories/tags for each article based on the content, and then present that to the editor as a suggestions at the time of input ("looks like this article is about: health, well-being, ..."). The ~1k articles we already have in the system are currently correctly tagged/categorized, so I'd like to use these a data source for this "guessing".
I've used Azure Search before, and it seems like some combination of EntityRecognition and KeyPhraseExtraction might be a way in the right direction? Azure Cognitive Services also seems to have an API that supports TextAnalytics that would do something similar. I'm a bit confused about why these are two different things (or are they not?)
This also seems like an entirely common problem (matching text against pre-defined categories based on other text that is categorized), so I'm wondering if I'm just missing an obvious solution here?
Thanks in advance.
I think the Azure Cognitive Text Analytics API is your best bet as you are looking for real-time analysis prior to tagging/categorizing for storage.
Text Analytics could return a list of named entities that you could map to your available tags/categories and present to the user.
Azure Cognitive Search requires an indexer and skillset to process target text with an end result of storing the processed results to an index specifically for searching.

lucene indexing security files

I'll try to briefly describe my problem and task.
My task is to create search engine for different types of file (only text file types) pdf, word, odf, xml but not html.
I have got little experience with lucene about year ago i wrote simple full text search using lucene and hibernate search. That was simple project. But now i have got very difficult task with searching.
We are using java 1.7 and glassfish 3 and i have to concentrate only server side approach not client ui. Ther is my three major problem :
1) All files is stored on webdav server, but information about file name , id file typ etc are stored into database (postgresql) so when i creating index i need to use both information. As a result of query i need only return file id from database. Summary content of file is stored in server but information about file is stored in database so we must retrieve both.
2) Secondary problem it that each file has a level of secrecy. But major problem is that this level is calculated dynamically. When calculating level of security for file we considering several properties. The static properties is files location, the folder in which the file is, but also dynamic information user profiles user roles and departments . So when user "Maggie" is logged she can search only files "test.pdf" , "test2.doc" etc but if user "Stev" is logged he have got different profiles such a Maggie so he can only search some phase in file "broken.pdf", "mybook.odt". test2.doc etc ..... . I think that when for example user search phase "lucene +solr" we search in all indexed documents and after that filtered result. But i think that solution is is not very efficient. What if results count 100 files , so what next we filtered step by step each files ? But i do not see any other solution. Maybe you can help me and lucene or solr have got mechanism to help.
3) Last problem is that some files are encrypted. So that files must be indexed only once before encryption ! But i think that if we indexed secure files so we get security issue. Because all word from that file is tokenized.
I have not got any idea haw to secure lucene documents and index datastore ? its possible ...
Also i have got question that i need to use Solr for my serarch engine or using only lucene and write own search engine ? So as you can see i have not got problem with indexing , serching but with security files and files secured levels.
Thanks for any hints and time you spend for me.
For Indexing both the File and Metadata of the file from DB check ExtractRequestHandler
You can pass the metadata attributes and the file to be indexed as a single request and it would be stored as a single document in lucene index.
For Security, One of the options is to store the Users/Roles who have access to the Files/Documents within the Solr index.
So you can always filter the results with the user/role to retrieve only the those results.
Make you Solr url secured so that Users don't have a direct access to the documents.
Also check for SOLR-1872
For encryption, Solr and underlying Parser Tika does provide handling for the Encrypted files by providing additional parameters.
Apache Solr uses the Apache Tika which uses the Bouncy Castle generic encryption libraries for extracting text content and metadata from encrypted PDF files. See http://www.bouncycastle.org/ for more details on Bouncy Castle.

appengine objectify text search

Maybe I am missing something but is there any way to use the new text search features as described in the 2011 presentation http://www.youtube.com/watch?v=7B7FyU9wW8Y (approx. 30min mark) with Objectify, Entities, and Java? I realize it is an experimental release but the text search features that are present don't seem to cover the full extent they discussed in the presentation. I don't want to have to write my own code to manage the creation, updates to documents. But I don't currently see another way??
Currently, you can't use Full Text Search to search through entities in Datastore; you'll need to create search documents in a search index to use the Full Text Search API, as described in these docs.

Firefox awesomebar database

I want to implement awesomebar like search functionality in my java application.
I like the way it searches the text patterns in between the words. Does it uses lucene library or performs like query on the database.
Which database do i choose for such type of pattern searching ?
A lot of Firefox is backed up by SQLite databases, including bookmarks and search history which are stored in places.sqlite. Although I don't have the source to check (which you could just download and have a look at how they do it), I suspect that since it's SQLite then there is probably something along the lines of a LIKE in there to do the sub-string matching.

Webapps: Storing and searching through user submitted blocks of text

Background:
I'm building a poetry site with user submitted content. The relevant user actions for my questions are that users can:
a. Go to fancysitename.com/view to see all poems so far
b. Go to fancysitename.com/submit to submit your own poem.
c. Go to fancysitename.com/apoemid to view a particular poem you've bookmarked before.
d. Go to fancysitename.com/search to enter a word to search for in all the poems.
All the poems are stored as text fields in a database and referenced by a poem id. So the "apoemid" in step c will be the primary key of the tuple and I'll just pull up the text after getting the key from the url.
Question:
The poems exist nowhere except in a database. My webapp is literally 4 html files. Will this approach affect my search engine rankings?
Is there a more efficient way to do 'd' rather than do a Select * on the db and manually parsing the text on the server? Each poem will be at the most 10 lines long, so I would imagine using a full text search engine like Lucerne will probably be overkill.
Caveat
I'm running this on the google app engine for now, so my database customization options are pretty limited. So while I'd certainly be interested in hearing about the ideal way to do this, this is a pet side project so my budget is limited :(
Thanks!
Edit: Apparently I don't google so well at 7am. I've since found a solution for question 2 here so please disregard question 2.
AppEngine currently doesnt support full text indexing, they do have a better than nothing SearchableModel.
Some details of SearchableModel can be found here:
http://groups.google.com/group/google-appengine/browse_thread/thread/f64eacbd31629668/8dac5499bd58a6b7?lnk=gst&q=searchablemodel
Regarding search engine ranking, yes having all your poems in the datastore can affect your ranking. This is generally overcome through the use of a sitemap. Here is an article about how StackOverflow uses a sitemap to help its search ranking.
http://www.codinghorror.com/blog/archives/001174.html
In most database engines, you can accomplish this kind of searching. For example MysQL does have full text searching. I am not sure how app engine works but you can always have a stored procedure does this search.
Where you store your data will not affect your site's ranking, only how you serve it up (on what URLs, etc). There's absolutely no way for an arbitrary search spider to tell where you store your data, and no reason for it to care, either.
Regardless of the length of your text, you will need full-text searching if you want to search inside a string. As Sam points out, SearchableModel ought to work just fine for that.

Resources