appengine objectify text search - google-app-engine

Maybe I am missing something but is there any way to use the new text search features as described in the 2011 presentation http://www.youtube.com/watch?v=7B7FyU9wW8Y (approx. 30min mark) with Objectify, Entities, and Java? I realize it is an experimental release but the text search features that are present don't seem to cover the full extent they discussed in the presentation. I don't want to have to write my own code to manage the creation, updates to documents. But I don't currently see another way??

Currently, you can't use Full Text Search to search through entities in Datastore; you'll need to create search documents in a search index to use the Full Text Search API, as described in these docs.

Related

Matching article text against pre-existing list of categories

I'm new to Azure Cognitive Services, and while I'm pretty sure it can help me solve my problem, I don't quite understand which part of it to use for it...
Here's what I want to do:
We have blog posts, say ~1k, and those blog posts all have categories and tags (multiple each). What I want to do, is to "guess" the right categories/tags for each article based on the content, and then present that to the editor as a suggestions at the time of input ("looks like this article is about: health, well-being, ..."). The ~1k articles we already have in the system are currently correctly tagged/categorized, so I'd like to use these a data source for this "guessing".
I've used Azure Search before, and it seems like some combination of EntityRecognition and KeyPhraseExtraction might be a way in the right direction? Azure Cognitive Services also seems to have an API that supports TextAnalytics that would do something similar. I'm a bit confused about why these are two different things (or are they not?)
This also seems like an entirely common problem (matching text against pre-defined categories based on other text that is categorized), so I'm wondering if I'm just missing an obvious solution here?
Thanks in advance.
I think the Azure Cognitive Text Analytics API is your best bet as you are looking for real-time analysis prior to tagging/categorizing for storage.
Text Analytics could return a list of named entities that you could map to your available tags/categories and present to the user.
Azure Cognitive Search requires an indexer and skillset to process target text with an end result of storing the processed results to an index specifically for searching.

how to add custom fields in azure search

Scenario:
Blob storage: contains pdf, word, image files (about 70 files)
I used default fields and predefined skills to create an Azure search instance through the Azure Portal.
But the results for querying any text in these files is not very good. I made content and key phrases as searchable and retrievable. I tried to use Lucene analyzers but was not a great help.
The main concern is if I type even a letter for example "u" in the search explorer, it returns the file. as per my understanding, there is no such word in my files. so what is it doing?
How to refine the search? and also how to manipulate the result?
I am not an expert in document processing. So using the unstructured documents in the blob instead of JSON formatted documents.
another thing, how to define some field in the index, let's say chapter-name or title name which can relate to the PDF chapters/title name?
Please suggest me some ideas or some example links. I am using .net core to develop this.
use custom skill set to extract the fields which you required and make sure those fields are defined in index.

Generate a series of documents based on SQL table

I am trying to formulate a proposal for an application that allows a user to print a batch of documents based on data stored in a SQL table. The SQL table indicates which documents are due and also contains all demographic information. This is outside of what I normally do and am trying to see if these is a platform/application that already exists to do such a task
For example
List of all documents: Document #1 - Document #10
Person 1 is due for document #: 1,5,7,8
Person 2 is due for document #: 2.6
Person 3 is due for document #: 7,8,10
etc
Ideally, what I would like is for the user to be able to push a button and get a printed stack of documents that have been customized for each user including basic demographic info like name, DOB, etc
Like i said at the top, I already have all of the needed information in a database, I am just trying to figure out the best approach to move that information onto a document
I have done some research and found some people have used mail merge in Word or using Access as a front end but I don't know if this is the best way. I've also found this document. Any advice would be greatly appreciated
If I understand your problem correctly, your problem is two-fold: Firstly, you need to find a way to generated documents based on data (mail-merge) and secondly, you might need to print them two.
For document generation you have two basic approaches: template-based and programmatically from scratch. I suppose that you will opt for a template based approach which basically means that you design (in MS Word) a template document (Word, RTF, ...) that acts as a template and contains placeholders and other tags that designate »dynamic« parts of the document. Then, at document generation time, you need a .NET library/processor that you will pass this template document and the data, where the processor will populate the template with the data and return the resulting document.
One way to achieve this functionality would be employing MS Words' native mail-merge, but you should know that this would involve using Office COM and Word Application Automation which should be avoided almost always.
Another option is to build such a system on top of Open XML SDK. This is velid option, but it will be a pretty demanding task and will most probably cost you much more than buying a commercial .NET library that does mail-merge out-of-the-box – been there, done that. But of course, the good side here is that you will be able to tailer the solution to your needs. If you go down this road I recoment that you use Content Controls for tagging documents/templates. The solution with CCs will be much easier to implement than the solution with bookmarks.
I'm not very familliar with the open source solutions and I'm not sury how many there are that can do mail-merge. One I know is FlexDoc (on CodePlex) but its problem is that uses a construct (XmlControl) for tagging that is depricated in Word 2010+.
Then there are commercial solutions. Again I don't know them in detail but I know that the majority of them are a general purpose document processing libraries. Our company has been using this document generation toolkit for some time now and I can say it covers all our »template-based document generation« needs. It doesn't require MS Word at doc generation time, and has really helpful add-in for MS word and you only need several lines of code to integrate it in your project. Templating is very powerful and you can set-up a template in a very short time. While templates are Word documents, you can generate PDF or XPS docs as well. XPS is useful because you can use .NET/WPF prining framework that works with XPS docs to print documents. This is a very high-end solution, but of course, the downside here is that it is not a free solution.

Index content of PDFs with Solr and Tika

The problem briefly: I would like Sitecore to index the contents of PDFs using Solr's built in functionality (supplied by Tika). I'm not sure how to configure Sitecore's indexing to use this feature in Solr(Tika). (I think I need to write a custom indexer.)
I'm working with Sitecore 7 (7.1 Update 1) and want to index content from PDFs (or other rich media types). I'd like to index this data for search purposes.
I have Solr (4.6.1) installed and working with Sitecore 7. When I index my site it saves all of the documents to the correct Solr core, and I can successfully retrieve these documents for display.
Using curl, I can send a PDF to my Solr instance and get it indexed.
curl "http://localhost:8983/solr/update/extract?literal._id=doc1&uprefix=attr_&fmap.content=attr_content&commit=true" -F "myfile=#sample.pdf"
This works, and I can read this content in my Sitecore web project and display it in views, so I know I can get access to this data. However, I would like the data to be attached to the items that I have uploaded in Sitecore.
I'd like something like this to happen when I upload a PDF to the Sitecore Media Library and publish the item, or at least when I re-index the site.
I'm currently walking through the following tutorial to learn some things about writing custom indexing (here is a link to part 1):
http://www.sitecore.net/Community/Technical-Blogs/Getting-to-Know-Sitecore/Posts/2013/04/Sitecore-7-Search-Provider-Part-1-Manually-Triggered-Indexing.aspx
Thanks for you patience.
For Sitecore, when handling media data, Lucene and Solr needed to index the content in a consistent way (so that you could switch between them if needed and still get data indexed in the same way). As Tika integration is very much a Solr thing it was decided that both should use the general windows concept of IFilters for indexing (http://en.wikipedia.org/wiki/IFilter)
This means that as long as you have the correct IFilter for that mime type installed on the machine that is doing the indexing, then the '_content' computedfield will be populated with the output.
This doesn't mean you can't use the Solr Tika integration but it is not supported by default and would be a customization.
It would be very simple to:
Disable the '_content' computed field
Set up a publish pipeline processor that looks at each item being published
Check if it is a media item
Check to see if it is a PDF
Issues a command to push the content to your Solr server for indexing by Tika.
You may want to see what results you get by using an IFilter, if the results are close enough to what you want then you can go with that, if Tika is producing better results for you then you should be able to switch to that, although you would probably have your media content indexed in a separate Solr core so you would lose any Sitecore specific metadata around the document.
Some blog posts that might be helpful:
http://www.samjgriffin.com/blog/2013/11/06/sitecore-7-pdf-and-document-content-search/
http://www.sitecore.net/Community/Technical-Blogs/John-West-Sitecore-Blog/Posts/2013/04/Sitecore-7-Indexing-Media-with-IFilters.aspx
I second the recommendation to use Sitecore's built-in MediaItemContentExtractor + IFilter approach, unless you've already ruled that out for some reason (IFilter difficulties, perhaps). If an IFilter is not an option, or if you're interested in the other approach regardless, I would integrate Tika a little differently than Stephen suggested, though.
The tutorial you referenced deals with writing your own search provider - i.e. replacing the built-in provider entirely. You should be able to leverage Sitecore's Solr provider and accomplish what you need with something lighter: a computed index field. The built-in media extractor mentioned above uses this approach, which enables you to put just about anything into your index during the normal indexing process. Here is a blog post from John West that walks through creating a basic computed index field: Sitecore 7: Computed Index Fields.
In short, write a class that implements IComputedIndexField and represents content extracted from PDFs or other rich documents. In your implementation of the ComputeFieldValue method:
Call GetMediaStream() on the document.
Pass the stream to Solr in an extract-only command and capture the result.
Return that result to store it in the computed index field.
If you need the media content to be in a particular existing index field, then configure the computed field as a copy field (see 3.3.3) into the existing field. Otherwise, configure your search to reference the computed field.
The main drawback here is the expense of passing the extracted content back and forth, rather than committing it directly to the index in a single step. Depending on your index size and contents, that may not be an issue for you.
One other potential option would be a post-rebuild task to add media content to the existing indexed documents. I am not certain this would work. It depends on knowing the IDs of the media items' documents and committing the rich document content in partial document updates, which this person was unsuccessful in attempting. If you try this, be sure to execute it in the indexing:end event prior to HTML cache clearing.
Whatever approach you take, if you want to work with Tika on a higher level than cURL, have a look at SolrNet's implementation of ExtractCommand and related classes.
If you could upgrade your site into sitecore 7.2 in which the media items content will be indexed automatically and there is no need to install the related IFilter, You should read the following:
Media Content Indexing in sitecore 7.2

Webapps: Storing and searching through user submitted blocks of text

Background:
I'm building a poetry site with user submitted content. The relevant user actions for my questions are that users can:
a. Go to fancysitename.com/view to see all poems so far
b. Go to fancysitename.com/submit to submit your own poem.
c. Go to fancysitename.com/apoemid to view a particular poem you've bookmarked before.
d. Go to fancysitename.com/search to enter a word to search for in all the poems.
All the poems are stored as text fields in a database and referenced by a poem id. So the "apoemid" in step c will be the primary key of the tuple and I'll just pull up the text after getting the key from the url.
Question:
The poems exist nowhere except in a database. My webapp is literally 4 html files. Will this approach affect my search engine rankings?
Is there a more efficient way to do 'd' rather than do a Select * on the db and manually parsing the text on the server? Each poem will be at the most 10 lines long, so I would imagine using a full text search engine like Lucerne will probably be overkill.
Caveat
I'm running this on the google app engine for now, so my database customization options are pretty limited. So while I'd certainly be interested in hearing about the ideal way to do this, this is a pet side project so my budget is limited :(
Thanks!
Edit: Apparently I don't google so well at 7am. I've since found a solution for question 2 here so please disregard question 2.
AppEngine currently doesnt support full text indexing, they do have a better than nothing SearchableModel.
Some details of SearchableModel can be found here:
http://groups.google.com/group/google-appengine/browse_thread/thread/f64eacbd31629668/8dac5499bd58a6b7?lnk=gst&q=searchablemodel
Regarding search engine ranking, yes having all your poems in the datastore can affect your ranking. This is generally overcome through the use of a sitemap. Here is an article about how StackOverflow uses a sitemap to help its search ranking.
http://www.codinghorror.com/blog/archives/001174.html
In most database engines, you can accomplish this kind of searching. For example MysQL does have full text searching. I am not sure how app engine works but you can always have a stored procedure does this search.
Where you store your data will not affect your site's ranking, only how you serve it up (on what URLs, etc). There's absolutely no way for an arbitrary search spider to tell where you store your data, and no reason for it to care, either.
Regardless of the length of your text, you will need full-text searching if you want to search inside a string. As Sam points out, SearchableModel ought to work just fine for that.

Resources