lucene.net + search only first page in the file - wpf

i have requirement where i need to search only the first page of the file. Currently i am using lucene.net with WPF for creating the indexes and searching the entire content of the file. i am able to return the results succefully. Now i need to search only the first page of the file i.e. Each document will have a standard proforma which will have a specific location where keywords are assigned. So can someone please guide me on this !!!

You may use different fields when indexing different parts of the document and use the field names when you search.
See this document that explains fields in lucene

Related

How to associate subsection meta-data in SOLR searchable text

I'd like to make the text of a book searchable in SOLR, and I'd like to include the page number(s) where the matching text can be found in the original book.
I'm wondering what mechanisms SOLR might have to associate a page number with the searchable words of text? (To be clear, I'm talking about the page number of the original source text, not anything to do with SOLR result pagination.)
So in essense I basically need to have structured text, whereby each searchable word (ideally each letter actually, because my real use-case is more of a giant substring that may start anywhere within the word) has some associated meta-data. I could put this information in an external datastore if necessary, but wondered if SOLR has a way to do it natively.
If not, is there another tool better suited to this purpose than SOLR?

Difference between full text and free text search in solr (other search db)

New to search databases and working with one. What is the difference between full text and free text search/index?
They are kind of same. More precisely they are just synonyms.
They are techniques used by search engines to find results in a database.
Solr uses Lucene project for it's search engine. It is used when you have a large documents to be searched and, you can't use LIKE queries with normal RDMS considering the performance.
Mianly it's follows two stages indexing and searching. The indexing stage will scan the text of all the documents and build a list of search terms. In the search stage, when performing a specific query, only the index is referenced, rather than the text of the original documents.
Suppose you typed John and Ryan, query will return will all the items in document which either contains "John" or "Ryan". Order and case sensitiveness doesn't matter.
In nutshell, unless you are using/terming them in specific use case, they are just calling different name for same person.
Call him Cristiano or CR7, they are same :)

Why does Solr store the original/pre-analysis content of a field rather than just its index?

This question kind of makes it clear that I am new to Solr and all of its wonderful features. I apologise for my noobness.
But why does Solr store the original content in addition to the index? It just seems wasteful. I do realise that it stores the original content only if the field has the property stored="true".
Where does it store the original content? Does it reference the actual document somehow?
Also, Is there any way to directly view the index files saved by Solr for each collection?
Links will be appreciated.
If Solr didn't store the text, it wouldn't be able to actually return the text it found - making it impossible to do stuff like highlighting, or build an application that uses the results from Solr directly. You'd have to look up the actual content somewhere else for each and every result, which might not be what you want (and that content might not be available, for example if you're building a search engine - it wouldn't really be effective to retrieve each page in a search result to get the relevant information anyways).
You can read up on the index file format in the API documentation for the Lucene60 codec, the stored fields are stored using the stored fields format. These fields live in the .fdt files in your index directory.
The index files are usually available in the data/index/ directory under the collection / core on disk:
data/index$ ls
_zq.fdt _zr.fdx _zs.si
...

How can we retrieve tokens of a particular property from search engine?

Community version. When contents are added in Alfresco search engine tokenizes properties (name, description) and stores it in indexes. I would like to know if there a way by which we could retrieve a list of those keywords associated with particular content?
Ex.. Fetch me tokens from "Name" of "abc.txt" content
I see there are API's exposed by SolR to get overall status of indexes and to fix transactions, but nothing which meets my needs.
I had a similar experience, needed to find out what the tokenizer was doing about indexes because a particular file name was not found during search.
I finally used Luke Lucene index toolbox which is:
Luke is a handy development and diagnostic tool, which accesses
already existing Lucene indexes and allows you to display and modify
their content in several ways:
browse by document number, or by term
view documents / copy to clipboard
retrieve a ranked list of most frequent terms execute a search, and browse the results
analyze search results
selectively delete documents from the index
reconstruct the original document fields, edit them and re-insert to the index
optimize indexes
open indexes consisting of multiple parts, and/or located on Hadoop filesystem
and much more...
Simply open the index files and you will have a peek on how properties and data were tokenized.
As reported in this post it could be easily used also for SolR indexes.

Using Solr to store user specified information in documents

I have an application that contains a set of text documents that users can search for. Every user must be able to search based on the text of the documents. What is more, users must be able to define custom tags and associate them to a document. Those tags are used in two ways:
1)Users must be able to search for documents based on specific tag ids.
2)There must be facets available for the tags.
My solution was adding a Mutivalued field in each document to pose as an array that contains the tagids that this document has been tagged with. So far so good. I was able to perform queries based on text and tagids ( for example text:hi AND tagIds:56 ).
My question is, would that solution work in production mode in an environment that users add but also remove tags from the documents ? Remember , I have to have the data available in real time, so whenever a user removes/adds a tag I have to reindex that document and commit immediately. If that's not a good solution, what would be an alternative ?
Stackoverflow uses Solr - this is in case if you doubt Solr abilities in production mode.
And although I couldn't find much information on how they have implemented tags, I don't think your approach sounds wrong. Yes, tagged documents will have to be reindexed (that means a slight delay) but other than that I don't see anything wrong with it.

Resources