How to implement auto suggestion (auto complete) functionality in GAE - google-app-engine

I want to implement auto suggest functionality in Google App Engine (GAE/GWT).
The client side of the implementation works fine with GWT SuggestBox and RPC.
My main issue is the server side of the implementation. I tried the Google search API but it seems that there is a limitation of 250MB of total indexed data and the search can be performed on complete words and not parts of each word!
How should I approach this? I read that lucene or solr is not supported in GAE.
I would appreciate your thoughts on this.

You can achieve a basic text search using these techniques described here: http://googlecode.blogspot.com.br/2010/05/google-app-engine-basic-text-search.html
In short:
Build a query using content >= yourQuery && content < yourQuery + "\ufffd", where the content property of your entity can be a String or a List of Strings.

I've taken this approach and it works fine for me:
Split up text into separate words. Get rid of duplicates, special characters and short words (in, of, and, etc..).
Add this list of words to entity as a list property.
Search via text range query: listProperty >= wordPart && listProperty < wordPart + "\ufffd"

Related

Parsing paragraphs into separate documents in Solr using script

I would like to crawl through a list of sites using Nutch, then break up each document into paragraphs and sending them to Solr for indexing.
I have been using the following script to automate the process of crawling/fetching/parsing/indexing:
bin/crawl -i -D solr.server.url=http://localhost:8983/solr/#/nutch -s ./urls/ Crawl 2
My idea is to attach a script in the middle of this workflow (probably the parsing stage of Nutch?) that would break up the paragraphs, like paragraphs.split(). How could I accomplish this?
Additionally, I need to add a field to each paragraph that shows its numerical position in the document, and to what chapter it belongs to. The chapter is an h2 tag in the document.
Currently, there is not a very easy answer to your question. To accomplish this you need custom code, specifically, Nutch has two different plugins to deal with parsing HTML code parse-html and parse-tika. These plugins are focused on extracting text content and not so much structured data out of the HTML document.
You would need to have a custom parser (HtmlParserPugin) plugin that will treat paragraph nodes within your HTML document in a custom way (extracting the content and positional information).
The other component that you would need is for modeling the data in Solr, since you need to keep the position of the paragraph within the same document you also need to send this data in a way that it is searchable in Solr, perhaps using nested documents (this really depends on how you plan to use the data).
For instance, you may take a look at this plugin which implements custom logic for extracting data using arbitrary X Path expressions from the HTML.

Azure suggester returning all content

I'm trying to implement an Azure suggester feature into our pilot Azure search app and running into issues. The content I'm indexing are PDF files, so my suggester definition is based on the content field itself which can be thousands of lines of text. Following examples online, when I implement the suggester, I'm returned the entire content of the body of text from the PDF file. What I'd really like to do is return just a phrase found in the text.
For instance, suppose I'm indexing a Harry Potter book and I type into my search field "Dum", I'd like to see suggested results back like "Dumbledore", "Dementor", etc VS the whole book. Is this possible?
Tks
If we want to search for words sharing the same prefix, Autocomplete is the right API for this job. https://learn.microsoft.com/en-us/rest/api/searchservice/autocomplete
In contrast, Suggester API helps users find the documents containing words with that prefix. It returns text snippets containing those worlds.
If you still believe suggester api does not behave as expected and autocomplete is not suitable, let me know your source document, query and expected results.

Solr - Bringing back snippets from indexed data

I have a Solr/Lucene set up where I have indexed a set of documents (MS Word files) and can happily search the content of these documents. However I would like to return a snippet from within the content of the document which shows where the matching line (+/- 5 words from the match term) is. I have tried to follow a range of Google hits but my indexing does not seem to have a direct access to the "content".
Can anyone give me some basic and simple pointers to where I might have made any errors on this - I have based all my work so far on the guidance and examples of the Solr Reference Guide - so I am not sure if the issue is in the search parameters or the original index.
I am doing this to create a clear set of user requirements for building an end solution rather than creating the end solution myself, so I am no expert on the tools and do not need to become one, just need to evidence what is possible with this tool set.
As MatsLindh noted above the issue was that the config was not drawing across the actual content of the Tika parse into a specific field, and so there was no full content of the text to display and highlight
To resolve this I followed the link (https://lucene.apache.org/solr/guide/7_1/uploading-data-with-solr-cell-using-apache-tika.html#configuring-the-solr-extractingrequesthandler) to the guidance documents and reviewed the part on fmap and used the example given for Last Modified Date as a guide on what to apply.
I then went to my solrconfig.xml file in the relevant core folder and added in the following line in the code beneath an already present fmap entry:
<str name="fmap.content">testcontent</str>
I had previously set up the testcontent field under the solr web interface in my core. I then re-ran my indexing line via a command prompt and that seemed to do the trick in terms of pulling out the basic content and rapping it with a basic emphasis.
All thanks for the input on this - still a lot more I want to test to help develop a clear requirement set but this really helps prove some of the basics are not complected.

GAE Full Text Search: can only match exact word? how to search like contains(...)?

Just tried GAE(1.7.7 Java) Full Text Search and found if the search string is work, surprisingly it will not match working, worked, or hardworking, homework, I'd like to know if i miss something in the API, i read the tutorial but did not found any document about this except plural match.
Thanks.
P.S. I tried unit test for search service, not in working environment.
Tucked away in the docs (but unfortunately not in the table of operators), there is a '~' operator
To search for plural variants of an exact query, use the ~ operator:
~"car" # searches for "car" and "cars"
Not sure how far that will get you. Unfortunately thats about it.
See https://developers.google.com/appengine/docs/java/search/overview#Queries_on_Fields
There is so little documentation on this,but just having tried it, it just works on plurals.
One approach would be to do your own stemming on the words in the document, (though you wouldn't return that as the text ;-) Then you could perform stemming on your search term and be able to match worked, working etc..
This is a late answer, but to follow up the previous answer, what you want to do is not possible with the basic API functions. The search API works on full-text searching principles. To get around this you can tokenise your searchable data pre-index and store this in a field with the relevant document.
See: Partial matching GAE search API

Stemming with GAE Full Text Search

In the new GAE API for Full Text Search, I can't find any option to activate stemming. I have tried to search for singular/plural words in my application, and indeed searching for "document" does not return the same result set as searching for "documents". Same goes for accentuated characters, searching for "vehicule" or "véhicule" does not return the same result set.
Is there an option somewhere, either in the API or in the query language syntax, that I can use to activate stemming ? Or do I have to build my own stemming by pre-processing the query and translate for example "document" into "(document OR documents)" ?
In this other SO question they discuss the same. You should use the now documented ~ operator
you should assume charset type.

Resources