Oracle Text - Index a BLOB Field (which contains PDF data) - database

Do any of you have any experience with using Oracle Text to search for content inside PDF files?
I have a table, with a field called FILEDATA(blob).
I would like to do the following query:
SELECT id FROM ttc.contract_attachment WHERE CONTAINS(filedata, 'EXAMPLE') > 0;
However, i'm not too sure about the type of index to add to it.
I found the following code:
begin
ctx_ddl.create_preference('doc_lexer', 'BASIC_LEXER');
ctx_ddl.set_attribute('doc_lexer', 'printjoins', '_-');
end;
/
create index idxContentMgmtBinary on CMDEMO.CONTENT_INVENTORY(TEXT) indextype is ctxsys.context
parameters ('lexer doc_lexer sync (on commit)');
Ref: http://www.devx.com/dbzone/Article/21563/1954
I have no idea what BASIC_LEXER is. I'm at a bit of a loss. I shall endeavour to continue searching for an answer. Any help would be great.
Thanks.

I've used Oracle Text to index not only PDF's but other data like XML structures. Oracle has the concept of lexers which take content and parses, tokenizes and indexes the tokens. The basic lexer handles English words, there are other lexers for Chinese, Japanese, Korean, etc. The printjoin attribute allows you to index characters that are normally excluded such as hyphes, quotes, etc.
The index you have defined above will work. Keep in mind that Oracle Text indexing is an asynchronous process, meaning the commit occurs and then sometime in the future the document is indexed. However you will need to synchronize the index as part of a scheduled job or the like. With the option "sync (on commit)" on your index, it will index the document as part of the transaction. This is noteworthy only if you are indexing sizable PDF documents.
I would recommend utilizing progressive relaxation for any search you may want to run, as it can being with a restrictive search and expand out to a more generic search, thereby providing the user with results that are decreasing in relevancy. For instance:
<query>
<textquery lang="ENGLISH" grammar="CONTEXT"> cat dog
<progression>
<seq><rewrite>transform((TOKENS, "{", "}", " "))</rewrite></seq>
<seq><rewrite>transform((TOKENS, "{", "}", "AND"))</rewrite></seq>
<seq><rewrite>transform((TOKENS, "{", "}", "ACCUM"))</rewrite></seq>
</progression>
</textquery>
<score datatype="INTEGER" algorithm="COUNT"/>
</query>
The above query tokenizes the search keywords "cat dog" attempts to find them as a phrase, then any documents contains cat AND dog (not necessarily beside each other), then any document containing cat OR dog, documents containing both words are scored higher than if a document just has a single one. Futhermore the structure automatically dedups the results as it returns them.
All of that being said, you could simply define your index as:
create index idxContentMgmtBinary on CMDEMO.CONTENT_INVENTORY(TEXT)
indextype is ctxsys.context
parameters ('sync (on commit)');
and it would probably work very well for your needs. You would only need to change the behavior of the lexer if you have a need for doing so. I hope this helps.

Related

Implement morphological search using Solr

I am trying to implement morphological search using Solr.
Here's a quick intro to morpholgical search:
It means that the search algorithm considers all grammar forms of words when creating the search index and searching for the requested phrases.
For example, when indexing the word child, the system adds both child and children to the index. Similar rule applies to verbs: for bring, the system adds bringing, brought etc. Consequently, if a user searches for a phrase "children bring", the system will display all results containing child, children, bring, bringing, brought etc.
Here are my two options:
1) Lemmatize each token and use that at index time as well as do the same with the query string at search time.
I DON't WANT to use this approach since this would make my index inconsistent when I start supporting morphpological search, since the previous documents will lack the lemma tokens. I don't want to reindex either.
2) Only at query time, find all variants of the lemma (eg: lemma of 'brought' is 'bring')and generate these as additional tokens through my Token Filter. This would serve a morphological search without having to index/reindex anything.
Question:
Are there any good Java libraries which would give me variants/inflections of a lemma (or the root word. eg: lemma of 'brought' is 'bring') ?
Something near to your requirement is using solr synonym dictionary and synonym filter.There you can add base word like child and add variants like kid,children,baby.
Collection reload would be required after editing dictionary each time. And search would be performed on every variant of child if "kid" is searched.

Solr OR query on a text field

How to perform a simple query on a text field with an OR condition? Something like name:ABC OR name:XYZ so the resulting set would contain only those docs where name is exactly "XYZ" or "ABC"
Dug tons of manuals, cannot figure this out.
I use Solr 5.5.0
Update: Upgraded to Solr 6.6.0, still cannot figure it out. Below are illustrations to demonstrate my issue:
This works:
This works too:
This still works:
But this does not! Omg why!?
There are many ways to perform OR query. Below I have listed some of them. You can select any of it.
[Simple Query]
q=name:(XYZ OR ABC)
[Lucene Query Parser]
q={!lucene q.op=OR df=name v="XYZ ABC"}
Your syntax is right, but what you're asking for isn't what text fields are made for. A text field is tokenized (split into multiple tokens), and each token is searched by itself. So if the text inserted is "ABC DEF GHI", it will be split into three separate tokens, namely "ABC", "DEF" and "GHI". So when you're searching field:ABC, you're really asking for any document that has the token "ABC" somewhere.
Since you want to perform an exact match, you want to query against a field that is defined as a string field, as this will keep the value verbatim (including casing, so the matching will be case sensitive). You can tell Solr to index the same content into multiple fields by adding a copyFile instruction, telling it to take the content submitted for field foo and also copying it into field bar, allowing you to perform both an exact match if needed and a more general search if necessary.
If you need to perform exact, but case insensitive, searches, you can use a KeywordTokenizer - the KeywordTokenizer does nothing, keeping the whole string as a single token, before allowing you to add filters to the analysis chain. By adding a LowercaseFilter you tell Solr to lowercase the string as well before storing it (or querying for it).
You can use the "Analysis" page under the Solr admin page to experiment and see how content for your field is being processed for each step.
After that querying as string_field:ABC OR string_field:XYZ should do what you want (or string_field:(ABC OR XYZ) or a few other ways to express the same.
A wacky workaround I've just come up with:

SQL Server Full Text Index Contains search exact match containing "it"

I'm fairly new to Full Text Index in SQL server. It has been working really well for me however, recently someone did an exact match search for "IT Manager" and the "IT" part of the search seems to be ignored.
e.g.
SELECT * FROM CONTAINSTABLE(vCandidateSearch, SearchText, '"it manager"')
and
SELECT * FROM CONTAINSTABLE(vCandidateSearch, SearchText, '"manager"')
return the same results. What am I doing wrong?
The problem is that the fulltext engine sees "it" as a "noise" - or stop - word, and ignores it.
Assuming you're using SQL 2008+, then see the documentation here on stoplists and stopwords: https://msdn.microsoft.com/en-us/library/ms142551(v=sql.100).aspx
These are lists containing various "filler" words (e.g. "a" "the" "it" etc) in various languages, that are generally not useful in fulltext searches and are ignored.
My experience is that these default lists are great for searching larger bodies of text, but often not so useful for things like product (or indeed job) titles that need to be more specific.
You can create your own stoplists containing (or not) whatever stopwords are appropriate for your particular need.
For a job title search it may well be appropriate to use no stopwords at all for that particular column. You can choose which stoplist (containing stopwords) is associated with a particular fulltext index when the index is created. You can create an empty list if need be, and use it in an index on one column only (although you would have to adjust your queries to take this into account).
In the unlikely event you're on SQL 2005 or below, it uses a much more primitive system of "noise words" that are just held in a text file: https://msdn.microsoft.com/en-us/library/ms142551(v=sql.90).aspx
"" doesn't mean an exact match. It just looks for that phrase in the text.
If I have a value
The big red house
Example matches
"big red house"
"big"
"house"
"red house"
Example of a non match
"the big yellow"
If you need that only "The big red house" matches then you might be better off creating a non-clustered index on that column and using a regular = predicate

Multi-word synonym search in Solr

I'm trying to use a synonym filter to search for a phrase.
peter=> spider man, spiderman, Mary Jane, .....
I use the default configuration. When I put these synonyms into synonym.txt and restart Solr it seems to work only partially: It starts to search for "spider", "man", "spiderman", "Mary" and "Jane" but what I want to search for are the meaningful combinations - like "spider man", "Mary Jane" and "spiderman".
Yes sadly this is a well known problem due to how the Solr query parser breaks up on whitespace before analyzing. So instead of seeing "spider" before "man" in the token stream, you instead simply see each word on its own. Just "spider" with nothing before/after and just "man" with nothing before/after.
This is because most Solr query forms see a space as basically an "OR". Search for "spider OR man" instead of looking at the full text, analyzing it to generate synonyms, then generating a query from that.
For more background, there's this blog post
There's a large number of solutions to this problem, including the following:
hon-lucene-synonyms. This plugin runs an analyzer before generating an edismax query over multiple fields. It's a bit of a blackbox, and I've found it can generate some complex query forms that generate weird performance and relevance bugs.
Lucidwork's autophrase query parser By selectively autophrasing, this plugin lets you specify key phrases (spider man) that should not be broken into OR queries and can have synonym expansion applied
OpenSource Connection's Match query parser. Searches a single field using a query-specified analyzer run before the field is searched. Also searches multi-word synonyms as phrases. My favorite, but disclaimer: I'm the author :)
Rene Kriegler's Querqy -- Querqy is a Solr plugin for query preprocessing rules. These rules can identify your key phrases and rewrite the query to non-multiterm form.
Roll your own: Learn to write your own query parser plugin and handle the problem however you want.
My usually strategy for this kind of problem is to use the synonym filter not to expand a search to include all of the possible synonyms, but to normalize to a single form. I do this both in my index and query field analysis.
For example, with this line in my fieldType/analyzer block in schema.xml:
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
(Note the expand="false")
...and this line in my synonyms.txt:
spiderman, spider man, Mary Jane => peter
This way I make sure that any of these four values will be indexed and searched as "peter". For example, if the source document mentions "The Amazing Spider Man" it will be indexed as "The Amazing peter". When a user searches for "Mary Jane" it will search for "peter" instead, so it will match.
The important thing here is that because "Mary" is not one of the comma-separated synonyms, it won't be changed if it appears without "Jane" following. So searching for "Mary is amazing" will actually search for "Mary is amazing", and it will not match the document.
One of the important details, is that I choose a normalized form (e.g. "peter") that is only one word. I could organize it this way:
peter, spiderman, spider man => Mary Jane
but because Mary Jane is two words, it may (depending on other features of my search), match the two words separately as well as together. By choosing a single word form to normalize into, I make sure that my tokenizer won't try to break it up.
It's a known limitation within Solr / Lucene. Essentially you would have to provide an alternative form of tokenization so that specific space delimited words (i.e. phrases) are treated as single words.
One way of achieving this is to do this client side - i.e. in your application that is calling Solr, when indexing, keep a list of synonym phrases and find / replace those phrase values with an alternative (for example removing the spaces or replacing it with a delimiter that isn't treated as a token boundary).
E.g. if you have "Hello There" as a phrase you want to use in a synonym, then replace it with "HelloThere" when indexing.
Now in your synonyms.txt file you can have (for example):
Hi HelloThere Wotcha => Hello
Similarly when you search, replace any incidences of "Hello There" in the query string with HelloThere and then they will be matched as a synonym of Hello.
Alternatively, you could use the AutoPhraseTokenFilter that LucidWorks created, available on github. This works by maintaining a token stream so that it can work out if a combination of two or more sequential tokens matches one of the synonym phrases, and if it doesn't, it throws away the first token as not matching the phrase. I'm not sure how much overhead this adds, but it seems a good approach - would be nice to have by default in Solr as part of the SynonymFilter.

Lucene search for a filename, using WordDelimiterFilterFactory

If I search for toto.pdf, a token "pdf" is created for the search tI'm indexing some data, including filenames.
What I want is, according to indexed filename:
MySupercool123girlfriend.jpg
And to be able tosearch it with:
supercool
supercool123
123
girlfriend
jpg
So at index it pretty easy to be able to use WordDelimiterFilterFactory so that some tokens are created, like:
my
supercool
mysupercool
mysupercool123
supercool123
123
girlfriend
jpg
girlfriend.jgp
etc...
The matter is that at search time, I don't really know what I should do.
If I use WordDelimiterFilterFactory at search time, MySupercool123girlfriend.jpg would match even with toto.jpg because in both cases a token jpg is created.
toto.jpg should not be in the result list at all, so it's not a solution for me to have both results with the appropriate one having a better scoring
Have you any recommendation to index and search for filenames?
For this specific example of yours i.e. if the search is for MySupercool123girlfriend.jpg and you want this to only return documents that have the entire string in it, you can keep a copyField, say named filename_str, whose fieldType is string. String matches will ensure you that you get an exact match. This could be a first-level "exact match" search you do.
However, I am guessing that you would want a search for 123girlfriend.jpg to return the document containing MySupercool123girlfriend.jpg. You can do a 2nd level search for this. Beginning Solr 4.0 you can do a regex search like
q=filename_str:/.*123girlfriend.jpg/
(This regex query should also work for filename field itself, if you are using preserveOriginal=1 in WordDelimiterFilterFactory at index time.)
Else you can do a leading wild-card search, which works in earlier Solr versions too.
If you also want MySupercool.jpg to match MySupercool123girlfriend.jpg, then I guess you would have to manually do the work of DelimiterFilterFactory and construct a regex query like
q=filename_str:/.*My.*Supercool.*.jpg/
Another issue is that jpg is going to match lot of documents, so you may want to split the filename and the extension and keep them as separate fields.
Can you come up with some meaningful for your use case DisMax mm parameter?
See http://wiki.apache.org/solr/DisMaxQParserPlugin#mm_.28Minimum_.27Should.27_Match.29
E.g.
mm=100% and "MySupercool123girlfriend.jpg" would match only filenames that have all ["my", "supercool", "123", "girlfriend", "jpg"] terms in them
You can find some less strict but still giving relevant results expression. See http://lucene.apache.org/solr/4_1_0/solr-core/org/apache/solr/util/doc-files/min-should-match.html

Resources