Matching article text against pre-existing list of categories

Matching article text against pre-existing list of categories - azure-cognitive-search

I'm new to Azure Cognitive Services, and while I'm pretty sure it can help me solve my problem, I don't quite understand which part of it to use for it...
Here's what I want to do:
We have blog posts, say ~1k, and those blog posts all have categories and tags (multiple each). What I want to do, is to "guess" the right categories/tags for each article based on the content, and then present that to the editor as a suggestions at the time of input ("looks like this article is about: health, well-being, ..."). The ~1k articles we already have in the system are currently correctly tagged/categorized, so I'd like to use these a data source for this "guessing".
I've used Azure Search before, and it seems like some combination of EntityRecognition and KeyPhraseExtraction might be a way in the right direction? Azure Cognitive Services also seems to have an API that supports TextAnalytics that would do something similar. I'm a bit confused about why these are two different things (or are they not?)
This also seems like an entirely common problem (matching text against pre-defined categories based on other text that is categorized), so I'm wondering if I'm just missing an obvious solution here?
Thanks in advance.

I think the Azure Cognitive Text Analytics API is your best bet as you are looking for real-time analysis prior to tagging/categorizing for storage.
Text Analytics could return a list of named entities that you could map to your available tags/categories and present to the user.
Azure Cognitive Search requires an indexer and skillset to process target text with an end result of storing the processed results to an index specifically for searching.

Related

Airports, points of interest and cities Auto complete

I'm currently working on a travel booking application.
I have two questions related to the same topic.
I need to know where sites like Priceline, Expedia or cheapoair get their autocomplete search data from? Such as Airports, points of interest and city's- states. Do these sites go off the google places API for their search autocomplete?
I was thinking about getting this data using google places autocomplete. Would this be a wise way to go about it? Or would I be better of finding a JSON file with all this data and store it on my own server and query the JSON file directly.

did you try out this?
- https://community.algolia.com/places/
- https://demos.algolia.com/geo-search-demo/ [Search for airports]
- and check the guide that goes with the demo https://www.algolia.com/doc/guides/geo-search/geo-search-overview
Having your own database would give you more flexibility. Also, users would query you data so there'd probably be less search with no results (with google search, users could type queries that are not related to any of your content).
Does that make sense?

How is ElasticSearch supposed to work in CakePHP 3?

I've been trying my very best not to ask any nosy question here in stackoverflow, but it has been almost one week since I got stuck in this problem and I couldn't find any solution.
I already have my working website built with CakePHP 3.2. What the website basically does is scrape Twitter for tweets containing a given search term, check if it's already in my database, and store it if it doesn't yet exist. Twitter's JSON response has this "tweet_id" property, and I've been using that value to check for whether I should ignore or append a specific tweet to my DB. While this might be okay while my database is small, I suspect it's going to slow things down considerably when my tables grow bigger. Thus my need for ElasticSearch.
My ElasticSearch server is running on my Arch Linux install, and I've configured my app to point to the said server. Also, I have my "Type" object named the same way as my "Tweets" table (I followed the documentation until the overview part http://book.cakephp.org/3.0/en/elasticsearch.html). This craps out an "Unknown method "alias" error, and following Google searches led me to creating an alternate pagination class since that was what some found to be the cause of the error (https://github.com/lorenzo/audit-stash/issues/4), which still doesn't fix things.
I'm not sure if I got this right. I installed the ElasticSearch plugin with the assumption that all I have to do is name the Types the same name as my tables, since to me the documentation "implies" that this should be done on top of the Blog Tutorial they did to "improve query performance".
TLDR, how is this supposed to work? Is my above assumption right? Do I name the Types differently and index everything myself? I'm not sure if there's just too much automagic, or I'm just poor at these sort of things. And yes, I'm new to frameworks (but not PHP, among other languages)
Thanks in advance!

Generate a series of documents based on SQL table

I am trying to formulate a proposal for an application that allows a user to print a batch of documents based on data stored in a SQL table. The SQL table indicates which documents are due and also contains all demographic information. This is outside of what I normally do and am trying to see if these is a platform/application that already exists to do such a task
For example
List of all documents: Document #1 - Document #10
Person 1 is due for document #: 1,5,7,8
Person 2 is due for document #: 2.6
Person 3 is due for document #: 7,8,10
etc
Ideally, what I would like is for the user to be able to push a button and get a printed stack of documents that have been customized for each user including basic demographic info like name, DOB, etc
Like i said at the top, I already have all of the needed information in a database, I am just trying to figure out the best approach to move that information onto a document
I have done some research and found some people have used mail merge in Word or using Access as a front end but I don't know if this is the best way. I've also found this document. Any advice would be greatly appreciated

If I understand your problem correctly, your problem is two-fold: Firstly, you need to find a way to generated documents based on data (mail-merge) and secondly, you might need to print them two.
For document generation you have two basic approaches: template-based and programmatically from scratch. I suppose that you will opt for a template based approach which basically means that you design (in MS Word) a template document (Word, RTF, ...) that acts as a template and contains placeholders and other tags that designate »dynamic« parts of the document. Then, at document generation time, you need a .NET library/processor that you will pass this template document and the data, where the processor will populate the template with the data and return the resulting document.
One way to achieve this functionality would be employing MS Words' native mail-merge, but you should know that this would involve using Office COM and Word Application Automation which should be avoided almost always.
Another option is to build such a system on top of Open XML SDK. This is velid option, but it will be a pretty demanding task and will most probably cost you much more than buying a commercial .NET library that does mail-merge out-of-the-box – been there, done that. But of course, the good side here is that you will be able to tailer the solution to your needs. If you go down this road I recoment that you use Content Controls for tagging documents/templates. The solution with CCs will be much easier to implement than the solution with bookmarks.
I'm not very familliar with the open source solutions and I'm not sury how many there are that can do mail-merge. One I know is FlexDoc (on CodePlex) but its problem is that uses a construct (XmlControl) for tagging that is depricated in Word 2010+.
Then there are commercial solutions. Again I don't know them in detail but I know that the majority of them are a general purpose document processing libraries. Our company has been using this document generation toolkit for some time now and I can say it covers all our »template-based document generation« needs. It doesn't require MS Word at doc generation time, and has really helpful add-in for MS word and you only need several lines of code to integrate it in your project. Templating is very powerful and you can set-up a template in a very short time. While templates are Word documents, you can generate PDF or XPS docs as well. XPS is useful because you can use .NET/WPF prining framework that works with XPS docs to print documents. This is a very high-end solution, but of course, the downside here is that it is not a free solution.

Existent geolocations list

I need a simple list of verified existent cities, states (streets and sights if possible, but not necessary). I tried to find some dictionary on the internet, like a plain text file or a web page, but it's hard to make good search query for that.
Or maybe there's a way to get this list from some maps API?

It is indeed hard to make a good search query for that. I just clicked a related questions link and this might help:
http://www.geonames.org/
(copied from an answer on Cities/Province/Cantons list with coords?)
They offer files for donwload and webservices.

Webapps: Storing and searching through user submitted blocks of text

Background:
I'm building a poetry site with user submitted content. The relevant user actions for my questions are that users can:
a. Go to fancysitename.com/view to see all poems so far
b. Go to fancysitename.com/submit to submit your own poem.
c. Go to fancysitename.com/apoemid to view a particular poem you've bookmarked before.
d. Go to fancysitename.com/search to enter a word to search for in all the poems.
All the poems are stored as text fields in a database and referenced by a poem id. So the "apoemid" in step c will be the primary key of the tuple and I'll just pull up the text after getting the key from the url.
Question:
The poems exist nowhere except in a database. My webapp is literally 4 html files. Will this approach affect my search engine rankings?
Is there a more efficient way to do 'd' rather than do a Select * on the db and manually parsing the text on the server? Each poem will be at the most 10 lines long, so I would imagine using a full text search engine like Lucerne will probably be overkill.
Caveat
I'm running this on the google app engine for now, so my database customization options are pretty limited. So while I'd certainly be interested in hearing about the ideal way to do this, this is a pet side project so my budget is limited :(
Thanks!
Edit: Apparently I don't google so well at 7am. I've since found a solution for question 2 here so please disregard question 2.

AppEngine currently doesnt support full text indexing, they do have a better than nothing SearchableModel.
Some details of SearchableModel can be found here:
http://groups.google.com/group/google-appengine/browse_thread/thread/f64eacbd31629668/8dac5499bd58a6b7?lnk=gst&q=searchablemodel
Regarding search engine ranking, yes having all your poems in the datastore can affect your ranking. This is generally overcome through the use of a sitemap. Here is an article about how StackOverflow uses a sitemap to help its search ranking.
http://www.codinghorror.com/blog/archives/001174.html

In most database engines, you can accomplish this kind of searching. For example MysQL does have full text searching. I am not sure how app engine works but you can always have a stored procedure does this search.

Where you store your data will not affect your site's ranking, only how you serve it up (on what URLs, etc). There's absolutely no way for an arbitrary search spider to tell where you store your data, and no reason for it to care, either.
Regardless of the length of your text, you will need full-text searching if you want to search inside a string. As Sam points out, SearchableModel ought to work just fine for that.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight