Best data cleansing practices for IBM Personality Insights - ibm-watson

I am testing out Personality Insights and I am curious whether I need to do any data cleansing prior to sending a string of twitter profile's timeline across to IBM.
For example, should I remove urls included in the tweets and other twitter features like hashtags or profile names included in the single tweet.
I am currently not removing any data. However, I am currently concatenating tweets with a full stop and a space using text+=". "+tweetfulltext.

You don't need to but as they don't count towards the personality then if you already have a cleanup module it will help with the word count. You will want to filter to remove retweets.

Related

Matching article text against pre-existing list of categories

I'm new to Azure Cognitive Services, and while I'm pretty sure it can help me solve my problem, I don't quite understand which part of it to use for it...
Here's what I want to do:
We have blog posts, say ~1k, and those blog posts all have categories and tags (multiple each). What I want to do, is to "guess" the right categories/tags for each article based on the content, and then present that to the editor as a suggestions at the time of input ("looks like this article is about: health, well-being, ..."). The ~1k articles we already have in the system are currently correctly tagged/categorized, so I'd like to use these a data source for this "guessing".
I've used Azure Search before, and it seems like some combination of EntityRecognition and KeyPhraseExtraction might be a way in the right direction? Azure Cognitive Services also seems to have an API that supports TextAnalytics that would do something similar. I'm a bit confused about why these are two different things (or are they not?)
This also seems like an entirely common problem (matching text against pre-defined categories based on other text that is categorized), so I'm wondering if I'm just missing an obvious solution here?
Thanks in advance.
I think the Azure Cognitive Text Analytics API is your best bet as you are looking for real-time analysis prior to tagging/categorizing for storage.
Text Analytics could return a list of named entities that you could map to your available tags/categories and present to the user.
Azure Cognitive Search requires an indexer and skillset to process target text with an end result of storing the processed results to an index specifically for searching.

Airports, points of interest and cities Auto complete

I'm currently working on a travel booking application.
I have two questions related to the same topic.
I need to know where sites like Priceline, Expedia or cheapoair get their autocomplete search data from? Such as Airports, points of interest and city's- states. Do these sites go off the google places API for their search autocomplete?
I was thinking about getting this data using google places autocomplete. Would this be a wise way to go about it? Or would I be better of finding a JSON file with all this data and store it on my own server and query the JSON file directly.
did you try out this?
- https://community.algolia.com/places/
- https://demos.algolia.com/geo-search-demo/ [Search for airports]
- and check the guide that goes with the demo https://www.algolia.com/doc/guides/geo-search/geo-search-overview
Having your own database would give you more flexibility. Also, users would query you data so there'd probably be less search with no results (with google search, users could type queries that are not related to any of your content).
Does that make sense?

Generate a series of documents based on SQL table

I am trying to formulate a proposal for an application that allows a user to print a batch of documents based on data stored in a SQL table. The SQL table indicates which documents are due and also contains all demographic information. This is outside of what I normally do and am trying to see if these is a platform/application that already exists to do such a task
For example
List of all documents: Document #1 - Document #10
Person 1 is due for document #: 1,5,7,8
Person 2 is due for document #: 2.6
Person 3 is due for document #: 7,8,10
etc
Ideally, what I would like is for the user to be able to push a button and get a printed stack of documents that have been customized for each user including basic demographic info like name, DOB, etc
Like i said at the top, I already have all of the needed information in a database, I am just trying to figure out the best approach to move that information onto a document
I have done some research and found some people have used mail merge in Word or using Access as a front end but I don't know if this is the best way. I've also found this document. Any advice would be greatly appreciated
If I understand your problem correctly, your problem is two-fold: Firstly, you need to find a way to generated documents based on data (mail-merge) and secondly, you might need to print them two.
For document generation you have two basic approaches: template-based and programmatically from scratch. I suppose that you will opt for a template based approach which basically means that you design (in MS Word) a template document (Word, RTF, ...) that acts as a template and contains placeholders and other tags that designate »dynamic« parts of the document. Then, at document generation time, you need a .NET library/processor that you will pass this template document and the data, where the processor will populate the template with the data and return the resulting document.
One way to achieve this functionality would be employing MS Words' native mail-merge, but you should know that this would involve using Office COM and Word Application Automation which should be avoided almost always.
Another option is to build such a system on top of Open XML SDK. This is velid option, but it will be a pretty demanding task and will most probably cost you much more than buying a commercial .NET library that does mail-merge out-of-the-box – been there, done that. But of course, the good side here is that you will be able to tailer the solution to your needs. If you go down this road I recoment that you use Content Controls for tagging documents/templates. The solution with CCs will be much easier to implement than the solution with bookmarks.
I'm not very familliar with the open source solutions and I'm not sury how many there are that can do mail-merge. One I know is FlexDoc (on CodePlex) but its problem is that uses a construct (XmlControl) for tagging that is depricated in Word 2010+.
Then there are commercial solutions. Again I don't know them in detail but I know that the majority of them are a general purpose document processing libraries. Our company has been using this document generation toolkit for some time now and I can say it covers all our »template-based document generation« needs. It doesn't require MS Word at doc generation time, and has really helpful add-in for MS word and you only need several lines of code to integrate it in your project. Templating is very powerful and you can set-up a template in a very short time. While templates are Word documents, you can generate PDF or XPS docs as well. XPS is useful because you can use .NET/WPF prining framework that works with XPS docs to print documents. This is a very high-end solution, but of course, the downside here is that it is not a free solution.

How to programmatically create indexing from the list of urls in java

I want to create search engine. So I had used nutch and solr for the developing it.
But it is not able to crawl each and every url of the website and search results are not as
good as Google.So I started using jcrawler to get list of url.
Now I have list of urls.But I have to index them.
So is there any way where I can index list of urls stored line by line in a file.
and show results vis lucene or solr or any other Java API
How you programmatically do something really depends on which language you plan on writing your code in - fetching content from a URL and making sense of that content before indexing will be largely dependent on the libraries available for your programming language of choice.
You can still use nutch with the Solr backend - give it the list of urls as input and set --depth to 1 (so that it doesn't spider anything further).
There are also other "ready" options, such as Crawl Anywhere (which has a Solr backend) and Scrapy.
"Not as good as Google" is not a good description of what you want to accomplish and how to approach that (keep in mind that Search is a core product for Google and they have a very, very large set of custom technologies for handling search). If you have specific issues with your own data and how to display that (usually you can do more useful results as you have domain knowledge of the task you're trying to solve), ask concrete, specific questions.
You can use Data Import Handler to load the list of URLs from file and then read and index them.
You'd need to use nested entity with outside entity having rootEntity flag set to false.
You'd need to practice a little bit with DIH. So, I recommend that you first learn how to import just the URLs into individual Solr documents and then enhance it with actually parsing of URL content.

Web application for managing elections campaign

I’m trying to help a friend in his election campaign.
We mainly need a tool to manage a list of possible voters. We need to be able to:
1. Easily update details about the voters, and
2. Query for voters according to various parameters, and show and print the resulting lists
To enable campaigners to work from multiple workstations, we would like the system to be distributed, probably web based.
We would also like that to be in Hebrew, if possible.
Is there any existing tool that easily enables it?
If not, can you recommend on an easy way to implement such a tool?
(I have a solid programming knowledge, but not much time to devote to that)
You can achieve this easily with iFreeTools Creator. Just create the entities and attributes for Voters and add campaigners as users providing their Google email-id.
Regarding your requirements..
* This app is web-based. It runs on Google App Engine.
* The interface is English only, but data can be in unicode. Entity name and attribute names are also "data", so they can be in unicode too.
Other related features which might be useful in this context..
* You can import voter list using CSV files.
* Campaigners can search for voters near their workstation by filtering out records based on nearness to a geo-location.
// Disclosure : I wrote code for this web-app. Hope you like it. Feedback welcome.
Some possible answers might be found in the same question I asked in the web apps forum

Resources