Slack - how to optimise linked site for multilingual unfurling? - multilingual

Does Slack provide any particular support for multilingual links pasted into a Slack discussion? This is in terms of when it is discovering the OpenGraph data from the linked page.
For example, Facebook supports this via the og:locale and og:locale:alternate tags and will then send a 'X-Facebook-Locale' header, per the description here: http://qnimate.com/open-graph-protocol-in-facebook/
Slack does support OpenGraph when introspecting pages, but I am not sure whether it goes as far as trying to deal with the same page that might be in multiple languages.
Does Slack provide any equivalent support? I did look into the documentation, but could not find any references to this.

Related

How to conduct a web crawl for specific topic via Apache Nutch?

I'm new to this field and as a student we have to create a web portal for a specific topic. As a first step we have to crawl the web (or part of it) so we can gather links for this topic before we index and rank them with the final purpose to feed them as database for our portal.
The thing is that I cannot come up to the right methodology. Let's say the theme of our portal is "health insurance".
What are the steps i have to follow as methodology and the tools I need?
Is there a way to guide nutch for specific content?
Should I fill my seeds.txt with a wide range of links parse a lot of links and then filter the content?
You can describe steps on high-level and i'll do the research how to implement.
Introduction
What you are trying to build is a so-called focused crawler or topical crawler, which only collects data, which is in your specific domain of interest.
There are a lot of different (scientific) approaches on how to develop such system. It often involves statistical methods or machine learning to estimate the similarity of a certain Web page to your topic. Next, the selection of seed points is crucial for this approach. I would recommend to use a search-engine to collect high quality seeds for your domain of interest. As an alternative you could use pre-classified URLs from Web directories such as curlie.org.
A good literature review on this topic with some in-depth explanation of different approaches is a journal paper by Kumar et al..
Process in Short
In short, the process of implementing such a system would be:
Build a relevance model, which can decide, if a given Web page belongs to your domain of interest / topic (e.g. a text classifier).
Evaluate your domain-specific relevance model. If you are not satisfied, go back to (1)
Feed your high quality seed points into the system and start the crawl
Architecture
A more or less general (focused) crawler architecture (on a single server/pc) looks like this:
Disclaimer: Image is my own work. Please respect this by referencing this post.
Apache Nutch
Sadly, Apache Nutch cannot do this by default. You have to implement the additional logic as a plugin. An inspiration on how to do this might be anthelion, which was a focused crawler plugin for Nutch. However, it is not actively maintained anymore.
By default Nutch only cares about which links to crawl next (either in the current or next crawl cycle). The concept of "next URL" is controlled within Nutch by a scoring plugin.
Since NUTCH-2039 was merged Nutch now supports a "relevance based scoring". This means that you can define a gold standard (your ideal page) and let the crawler score each potential URL to crawl based on how similar the new link is to your ideal case. This provides (to some extent) a topic based crawler.
You can take a look a https://cwiki.apache.org/confluence/display/nutch/SimilarityScoringFilter to see how to enable/configure this plugin.
Nutch is coming with a built in NaiveBayesParseFilter. You have to add the following property in nutch-site.xml and also create a training file as described below.
From my experience It performs great even with a handful of documents for train. of course the more the merrier.
<property>
<name>plugin.includes</name>
<value>parsefilter-naivebayes</value>
</property>
<property>
<name>parsefilter.naivebayes.trainfile</name>
<value></value>
<description>Set the name of the file to be used for Naive Bayes training. The format will be:
Each line contains two tab seperated parts
There are two columns/parts:
1. "1" or "0", "1" for relevant and "0" for irrelevant document.
3. Text (text that will be used for training)
Each row will be considered a new "document" for the classifier.
CAUTION: Set the parser.timeout to -1 or a bigger value than 30, when using this classifier.
</description>
</property>

How to publish AIML embedded with javascript?

I've written an AIML file for a chat bot and I'd like to build an interactive web application which allows me to chat with the bot in the web browser.
Is it possible to achieve this with HTML & Javascript?
There is no short answer on how to write a web application which allows a user to interact with your AIML. Writing such an application from scratch will be much more work then compiling the AIML was.
The easiest option would be to use a pre-built service like PandoraBots which allows you to upload AIML files and interact with them in the web browser. It's free to use the explorer part of website. They also have paid developer options which generates an API to bridge your AIML script and any applications you might want to build. It can be easily connected to work with common chat apps like Google talk ect.
If you decide to build everything from scratch you might want to check out the AIML Interpreter library for nodejs.
UPDATE: Here is a node.js based interpreter that you might find useful https://github.com/mrchimp/surly2
I was looking at AIML too and had similar questions. I just found RiveScript RiveScript and it looks like it fits your need to run javascript based on a match. It is not AIML, but very close. There is also at least one tool to convert from AIML to RiveScript, so I would say this fits your needs within those constraints.

How to crawl entire wikimapia?

i need a sitemap which can help to people and google to know pages as well.
I've tried WebSphinx application.
I realize if I put wikipedia.org as the starting URL, it will not crawl further.
Hence, how to actually crawl the entire Wikipedia? Can anyone gimme some guidelines? Do I need to specifically go and find those URLs and put multiple starting URLs?
Anyone has suggestions of good website with the tutorial on usng WebSphinx's API?
Crawling wikipedia is a bad idea. It is hundreds of TBs of data uncompressed. I would suggest offline crawling by using various dumps provided by wikipedia. Find them here https://dumps.wikimedia.org/
You can create a sitemap for wikipedia using page meta information, external links, interwikilinks and redirects databases to name a few.

Publish one product to multiple sites

Is there a way to have one product definition and have it publish to multiple sites? I am looking for this ability specifically in DNN or Umbraco, either with free or paid extensions. I did install both the platforms and played with the free extensions and looked for any extension offering such functionality but did not find one. Any links or pointers are highly appreciated!
I had looked up for this info in many places before reaching over to the expert pool here, hoping to get some hints;
In umbraco there is the built in /base extension (http://our.umbraco.org/wiki/reference/umbraco-base) which enables you to access product data that is maintained in Umbraco from other websites. Base is REST-ish so the implementation is well documented - you can access the data as XML or JSON (Returning Json instead of XML with Umbraco Base).
Also as the implementation is REST-ish the other websites that consume the content maintained in the core site could be written in anything that can consume a REST feed eg html & javascript.
It's not 100% clear to me what setup you're after, but if you're looking to set up a traditional Authoring/Delivery configuration - one of the few paid offerings Umbraco has is called Courier. It's a very reasonably priced (~$135USD,/99EUR) deployment manager that handles syncing content between two sites, i.e., Authoring and a Delivery server.
It's a very smart tool that manages content, configuration, and dependencies. It's neat and also supports a great open-source project!
If you're looking to setup something more like a centralized product database that is used by many sites - amelvin is on good pointer with BASE. They have a nice api where you may also set up your own webservice (beyond their own webservice functaionality!).
If you need this centralized product data to notify the other sites to update their caches - i encourage you to look into the 'distributedCall' functionality.
There's a bit of documentation on distributed calls in this load-balancing tutorial that may help understand the concept a bit better.
...Hope this helps get pointed in the right direction.

Access to webcam in browser?

For an internal project we would like to play with building a video conferencing system. We are able to decide the browser that the user has to use and can install plugins.
The only requirement is that the browser and plugins must be free and work over Linux and Mac. (Don't care about Windows)
What is the best way to do access the webcam and mic stream from a user for sending to a server?
Ideally I would like to do this plugin free but I can see no implementation of the devices tag in HTML5 in any browser yet, unless someone knows different.
If its flash/silverlight, any quick examples of capture and sending to a server?
Also any examples of streaming video from a server to a client would be useful, so we can stick it all together. This I know we can do in HTML5 so this would be a preference.
The client connection part is all I would need as we are building the server, this is the internal challenge.
Basically I'm looking for good examples and best practices for sending and receiving this information.
Edit: As I have discovered from some groups the device tag is no where near completion. So answers will have to be flash/silverlight (does that work on linux??).
See a demo of device tag done on a webkit custom build: https://labs.ericsson.com/blog/beyond-html5-conversational-voice-and-video-implemented-webkit-gtk
Check out the Red5 project. I think that it is what you're looking for. The examples are quite good.
http://red5.org

Resources