I'd like to perform a partial text / phrase search against a Datastore record field using Ruby.
I've figured out how to do it with a conditional constraint using >= <"\ufffd" condition, but that only works from the beginning of the field.
This works; querying for "Ener" returns "Energizer AA Batteries" but querying for "AA" does not return the same.
In the docs for the Python Google Client's Search API, it documents the ability to manually create indexes which allow for both atomic and partial word searches.
https://cloud.google.com/appengine/docs/standard/python/search/ says:
Tokenizing string fields When an HTML or text field is indexed, its
contents are tokenized. The string is split into tokens wherever
whitespace or special characters (punctuation marks, hash sign, etc.)
appear. The index will include an entry for each token. This enables
you to search for keywords and phrases comprising only part of a
field's value. For instance, a search for "dark" will match a document
with a text field containing the string "it was a dark and stormy
night", and a search for "time" will match a document with a text
field containing the string "this is a real-time system".
In the docs for Ruby and PHP, I cannot find such an API reference to enable me to do the same thing. Is it possible to perform this type of query in Ruby / PHP with Cloud Datastore?
If so, can you point me to the docs, and if not, is there a workaround, for example, create indexes with the Python Search API, and then configure the PHP/Ruby client to execute it's queries against those indexes?
Related
I have a field containing short texts (a few tokens). I index it as Text rather than String because I need to search within the text.
However, I need to search with the String-style (matching the entire field).
For example, if a field is Google Search Engine. I currently find the row by searching "search engine". While preserving this behavior, I need another option to catch the row only if the search term is "google search engine".
I believe it is possible by regex, but it should be slow.
I wonder if there is a standard way to do so or if I need to add another field of the same content but with the String type.
Use multiple fields - the definition of the second field will differ based on whether you want the search to be case sensitive or not. If you're OK with having a case sensitive field (i.e. "Google" and "google" are different terms), then string is the correct choice.
If you want the field to be case insensitive, use a TextField with a KeywordTokenizer (which keeps the input as a single, large token) with a LowercaseFilter attached (which lowercases the content).
You can then search both fields by using qf - query fields - with the edismax/dismax query parses and score them differently. If you only need explicit searching (you choose whether you want to match the whole string or just words in it yourself), using the field name in the regular way would work.
Use a copyField instruction to index the same content into both fields without changing your indexing pipeline. You'll need to reindex your core / collection for the new field to get any values.
And no, you can't do this with a regex, since the regex is applied against the tokens. You already have the tokens split up into smaller parts, so /foo bar/ doesn't have a foo bar token to match against, just foo and bar - neither match the regex.
I used icu_tokenizer using custom analyzer to create a search index for Japanese words. Index was created successfully. Using icu_tokenizer as for asian languages it works better than the default azure search tokenizer.
Now when I use query for string Ex:- 赤城 I see multiple search results (total 131) from the index. But when I use the wild card search with the same word, Ex: 赤城* (adding * at the end of the word) or /赤城.*/ (using regex search query) i see 0 search results. The weird part is that * seems to work with single japanese character 赤* gives me same number of search results as 赤 gives. But as soon as I increase the number of japanese characters from 1, wild card queries with * stops working and returns 0 search result. All of these queries I am testing it on search explorer on Azure portal using querytype=full (lucene syntax query)
In my application search terms are normally used as prefix search so normally we append * at the end of the search string to fetch search results but looks like these lucene wildcard queries with japanse characters just do not work. Any idea, how can I make these prefix queries (using wildcard * at end of search strings) work when search strings are given in japanese characters?
Any quick help will be much appreciated!!
I tested with my installation now and I can confirm that wildcards only work with Japanese content when you use a Japanese analyzer.
In my example I set up one index using a property Body that does not have a specific analyzer defined. Then I set up another index where Body uses the ja.microsoft language analyzer. The content in both indexes are identical. I then tried to search for 自動車 (automobile) with a trailing wildcard.
自動車* returns multiple hits from my index using the japanese analyzer. No hits are returned from the index without a specific analyzer defined.
sorry for the late reply.
Have you tried using one of the Japanese language analyzers? For example, ja.microsoft
Also, if you want to use prefix search, you can try experimenting with the suggester feature which is designed to be efficient for this scenario.
While working with search definition which looks like
search music{
document music{
field title type string {
indexing: summary | attribute | index
}
}
}
if I use my custom logic of tokenizing string by developing document processor (I save processed tokens in context of Processing), how to store tokens in the base index? and how they are mapped back to the original content of the field, while recall for a particular query? Do we solve it by ProcessingEndPoint? If yes, how?
First, you should almost certainly drop "attribute" for this field - "attribute" means the text will be stored in a forward store in memory in addition to creating an index for searching. That may be useful for structured data for sorting, grouping and ranking, but not for a free-text field.
Unnecessary details:
You can perform your own document processing by adding document processor components: http://docs.vespa.ai/documentation/docproc-development.html. Token information for indexing are stored as annotations over the text which are consumed by the indexer: http://docs.vespa.ai/documentation/annotations.html
The code doing this in Vespa (called by a document processor) is https://github.com/vespa-engine/vespa/blob/master/indexinglanguage/src/main/java/com/yahoo/vespa/indexinglanguage/linguistics/LinguisticsAnnotator.java, and the annotations it adds, which are consumed during indexing are https://github.com/vespa-engine/vespa/blob/master/document/src/main/java/com/yahoo/document/annotation/AnnotationTypes.java. You'd also need to do the same tokenization at the query side, in a Searcher: http://docs.vespa.ai/documentation/searcher-development.html
However, there is a much simpler way to do this: You can plug in your own tokenizer as described here: http://docs.vespa.ai/documentation/linguistics.html: Create your own component subclassing SimpleLinguistics and override getTokenizer to return your implementation. This will be executed by Vespa as needed both on the document processing and query side.
The reason for doing this is usually to provide linguistics for other languages than english. If you do this, please consider providing your linguistics code back to Vespa.
Do any of you have any experience with using Oracle Text to search for content inside PDF files?
I have a table, with a field called FILEDATA(blob).
I would like to do the following query:
SELECT id FROM ttc.contract_attachment WHERE CONTAINS(filedata, 'EXAMPLE') > 0;
However, i'm not too sure about the type of index to add to it.
I found the following code:
begin
ctx_ddl.create_preference('doc_lexer', 'BASIC_LEXER');
ctx_ddl.set_attribute('doc_lexer', 'printjoins', '_-');
end;
/
create index idxContentMgmtBinary on CMDEMO.CONTENT_INVENTORY(TEXT) indextype is ctxsys.context
parameters ('lexer doc_lexer sync (on commit)');
Ref: http://www.devx.com/dbzone/Article/21563/1954
I have no idea what BASIC_LEXER is. I'm at a bit of a loss. I shall endeavour to continue searching for an answer. Any help would be great.
Thanks.
I've used Oracle Text to index not only PDF's but other data like XML structures. Oracle has the concept of lexers which take content and parses, tokenizes and indexes the tokens. The basic lexer handles English words, there are other lexers for Chinese, Japanese, Korean, etc. The printjoin attribute allows you to index characters that are normally excluded such as hyphes, quotes, etc.
The index you have defined above will work. Keep in mind that Oracle Text indexing is an asynchronous process, meaning the commit occurs and then sometime in the future the document is indexed. However you will need to synchronize the index as part of a scheduled job or the like. With the option "sync (on commit)" on your index, it will index the document as part of the transaction. This is noteworthy only if you are indexing sizable PDF documents.
I would recommend utilizing progressive relaxation for any search you may want to run, as it can being with a restrictive search and expand out to a more generic search, thereby providing the user with results that are decreasing in relevancy. For instance:
<query>
<textquery lang="ENGLISH" grammar="CONTEXT"> cat dog
<progression>
<seq><rewrite>transform((TOKENS, "{", "}", " "))</rewrite></seq>
<seq><rewrite>transform((TOKENS, "{", "}", "AND"))</rewrite></seq>
<seq><rewrite>transform((TOKENS, "{", "}", "ACCUM"))</rewrite></seq>
</progression>
</textquery>
<score datatype="INTEGER" algorithm="COUNT"/>
</query>
The above query tokenizes the search keywords "cat dog" attempts to find them as a phrase, then any documents contains cat AND dog (not necessarily beside each other), then any document containing cat OR dog, documents containing both words are scored higher than if a document just has a single one. Futhermore the structure automatically dedups the results as it returns them.
All of that being said, you could simply define your index as:
create index idxContentMgmtBinary on CMDEMO.CONTENT_INVENTORY(TEXT)
indextype is ctxsys.context
parameters ('sync (on commit)');
and it would probably work very well for your needs. You would only need to change the behavior of the lexer if you have a need for doing so. I hope this helps.
I'm just starting with Python on Google App Engine building a contact database. What is the best way to implement wildcard search?
For example can I do query('name=', %ewman%)?
Unfortunately, Google app engine can't do partial text matches
From the docs:
Tip: Query filters do not have an explicit way to match just part of a string value, but you can fake a prefix match using inequality filters:
db.GqlQuery("SELECT * FROM MyModel WHERE prop >= :1 AND prop < :2", "abc", u"abc" + u"\ufffd")
This matches every MyModel entity with a string property prop that begins with the characters abc. The unicode string u"\ufffd" represents the largest possible Unicode character. When the property values are sorted in an index, the values that fall in this range are all of the values that begin with the given prefix.
App Engine can't do 'like' queries, because it can't do them efficiently. Nor can your SQL database, though: A 'foo LIKE "%bar%"' query can only be executed by doing a sequential scan over the entire table.
What you need is an inverted index. Basic fulltext search is available in App Engine with SearchableModel. Bill Katz has written an enhanced version here, and there's a commercial solution for App Engine (with a free version) available here.