I have a recipe website and am trying to put the JSON-LD in the head of the page. This whole website is in Indonesian and I was wondering it it's allowed to use an other language than English in JSON-LD? So instead of writing:
"recipeCategory": "Breakfast",
"recipeCuisine": "Indonesian",
can I write:
"recipeCategory": "Sarapan",
"recipeCuisine": "Indonesia",
Or do I need to define the language first?
Thanks
Yes, there are no restrictions on language, but you may want to indicate the default language for strings using #language in your context. Note that any vocabulary you use might seem to be in English, but they are actually just shortcuts for URIs, and can always be aliases to Indonesian, if you like.
Depending on your consumers, you may want to be multi-lingual, and might consider using language maps.
JSON-LD is based on the RDF data model, and there is no bias towards any language expected for the data. Due to JSON, and also in RDF, you are restricted to UTF-8, though.
New in JSON-LD 1.1 is the ability to indicate text direction as well, although I don’t this this is necessarily useful for Indonesian.
Related
I'm trying to build a voice XML interface to a machine translation system. Most of the menu design is simple enough, but when the user actually says the phrase to be translated, I need to be able to intake whatever text comes from the ASR without trying to match it to a finite grammar. Is there a standard way to do this in voice XML?
If by standard way, you mean VoiceXML with SRGS/SISR, you could build a grammar that had ever word of the target language and the SI to reassemble the content into a slot. Not a practical solution, but a possible one within the specification constraints.
If you are just looking at VoiceXML, only building the capability into a browser would be a constraint, as VoiceXML doesn't provide any relevant restrictions for how $lastresult is populated.
Your implementation constraints and what your are trying to achieve might be helpful to create a practical solution.
The 'standard' VoiceXML not allows to get free text (because you allay use a grammar with strict rules), you plan to be out of the initial scope of the specification.
If you can control your VoiceXML interpreter implementation you can use the same method as us. With our Voximal VoiceXML interpreter we solve this by using a builtin grammar :
<field name="text" type="text" > : it use the builtin:grammar/text
You can extend by adding parameter like "text?lang=en-US" or 'text?model=MyWatsonModel".
The text restult is in the variable, and you can add extra values in the shaddow variables.
All this is platform dependent, and of of the scope of the VoiceXML standard. But I think it is the best way to integrate SpeechToText in the VoiceXML.
I'm working on a multi-lingual online store for customers in Ireland who speak English, Latvian and Russian.
PayPal is available in English and Russian (and others) but not Latvian.
I would like to have my form sent to PayPal setting it to show in English by default, or Russian if the customer is reading the store in Latvian or Russian.
The problem is that the API code for this, LOCALECODE, requires both the country and the language. So for example, ie_EN would be Ireland-localised English (which PayPal doesn't support), ie_LV would be Ireland-localised Latvian (again not supported) and ie_RU would be Ireland-localised Russian (again, etc).
Is there a generic way of saying "just use the language, please", without needing to hard-code a list of available languages?
You used to be able to send LOCALECODE=EN, but that has now been superseded by the more traditional LOCALECODE=en_US.
If it's just two languages you're worried about, I'm not sure what's the problem in a simple if-statement to set the correct language?
I do not know whether you had figure out this issue.
You can follow these:
For Example : https://www.paypal.com/cgi-bin/webscr?locale.x=zh_HK
You can add up "locale.x=language value allowed by paypal".
I follow this and there is okay.
I run a website that allows users to write blog-post, I would really like to summarize the written content and use it to fill the <meta name="description".../>-tag for example.
What methods can I employ to automatically summarize/describe the contents of user generated content?
Are there any (preferably free) methods out there that have solved this problem?
(I've seen other websites just copy the first 100 or so words but this strikes me as a sub-optimal solution.)
Think of the task of summarization as a challenge to 'select the most important sentences' from the document.
The method described in The Automatic Creation of Literature Abstracts by H.P. Luhn (1958) describes a naive method that actually performs quite well. Try giving it a shot.
If your website is in Python coding this algorithm using the NLTK (Natural Language Toolkit) is a fun task.
Make it predictable.
From a users perspective simply using the first paragraph is not bad at all.
Using any automation is bound to fall flat in some cases. So I suggest to display
the first paragraph (maybe truncating at some point) as a summary and offer the ability to override that by an optional field.
I might try using mechanical Turk or any number of other crowdsourcing options.
Another item to check out, a SourceForge project, AutoSummary Semantic Analysis Engine
Not a trivial task... You should look for articles or books on "extractive summarization"
A few starters could be:
Books:
Natural Language Processing with Python
Foundations of Statistical Natural Language Processing
Articles:
Language independent extractive summarization
Extractive summarization: how to identify the gist of a text
Extractive Summarization using Inter- and Intra- Event Relevance
Yahoo has a free API for this:
http://developer.yahoo.com/search/content/V1/termExtraction.html
Apple's patent 6424362 - Auto-summary of document content contains sample code which might be useful...
This borders on artificial intelligence so there's not going to be an "easy" solution out there, but there are products that target this problem.
Check out Copernic Summarizer, for one.
Noun phrases typically tend to be important elements of a sentence. Picking sentence(s) with a high density of noun phrases could yield a good summary. You could get noun phrases using a POS tagger.
For a good summary, it is desirable that it is a meaningful sentence. Reading a broken sentence is slightly jarring.
Alternatively, when the author posts the article, the author can highlight what are the keywords that can be used in the description which can then be automatically put in the meta description tag.
Say I come up with some super-duper way of representing some data that I think would be useful for other people to know about and use. Assume I have a 'spec' in some form, even if it might not be a completely formal one: ie, I know how this file format will work already.
How would I then go about releasing this spec to get comments and feedback based on it? How would I get it 'standardised' in some form?
Specifying file formats is difficult. If the data you want to store is trivial, it tends to be trivial. In general however, this is hardly the case. You can use the RFC structure and keywords, but I always found specifying a fileformat in prose a slow, difficult and boring task, also because reading it is likewise difficult.
My suggestion, if you want to follow this way, is to focus on blocks of information. Most of the difficuly is for entities that are optional, and present only if another condition happens, so try to exploit this when partitioning your data.
The best spec, IMHO, is real code with an uberperfect testsuite.
As for standardization, if enough people use it, it becomes a de-facto standard. you don't need an official stamp for it, although when the format is used enough, you could benefit from an official mime type.
To talk about it, well, it depends. I found useful to talk in terms of "object oriented" entities, and also in terms of relationships. Database-like diagrams are very useful on this respect.
Finally, try to find a decent already standard alternative first, or at least try not to deal with the raw bits. There are a lot of perfect container formats out there that free you of many annoying tasks. The choice of the container depends on the actual type of file format (e.g. if you need encryption, interleaving, transactions, etc).
There are a couple of ways I'd go about it, I think.
First, determine if there's a standards body (like W3C, or IEEE) that might be related to your file format. If there is, pitch it to them. I have no idea how receptive they'd be though.
Second, a standard is useless if nobody is using it. Get some momentum behind it. Write a blog post, twitter and make a website about it. Link on programming.reddit.com, and slashdot. Describe it to your friends and colleagues. Post it here on SO, and ask for feedback.
HTH.
So we are sure that we will be taking our product internationally and will eventually need to internationalize it. How much internationalizing would you recommend we do as we go along?
I guess in other words, is there any internationalization that is easy now but can be much worse if we let the code base mature and that won't slow us down very much if we choose to start doing it now?
Tech used: C#, WPF, WinForms
Prepare it now, before you write all the strings in the codebase itself.
Everything after now will be too late. It's now or never!
It's true that it is a bit of extra effort to prepare well now, but not doing it will end up being a lot more expensive.
If you won't follow all the guidelines in the links below, at least heed points 1,2 and 7 of the summary which are very cheap to do now and which cause the most pain afterwards in my experience.
Check these guidelines and see for yourself why it's better to start now and get everything prepared.
Developing world ready applications
Best practices for developing world ready applications
Little extract:
Move all localizable resources to separate resource-only DLLs. Localizable resources include user interface elements such as strings, error messages, dialog boxes, menus, and embedded object resources. (Moving the resources to a DLL afterwards will be a pain)
Do not hardcode strings or user interface resources. (If you don't prepare, you know you will hardcode strings)
Do not put nonlocalizable resources into the resource-only DLLs. This causes confusion for translators.
Do not use composite strings that are built at run time from concatenated phrases. Composite strings are difficult to localize because they often assume an English grammatical order that does not apply to all languages. (After the interface design, changing phrases gets harder)
Avoid ambiguous constructs such as "Empty Folder" where the strings can be translated differently depending on the grammatical roles of the strings' components. For example, "empty" can be either a verb or an adjective, and this can lead to different translations in languages such as Italian or French. (Same issue)
Avoid using images and icons that contain text in your application. They are expensive to localize. (Use text rendered over the image)
Allow plenty of room for the length of strings to expand in the user interface. In some languages, phrases can require 50-75 percent more space. (Same issue, if you don't plan for it now, redesign is more expensive)
Use the System.Resources.ResourceManager class to retrieve resources based on culture.
Use Microsoft Visual Studio .NET to create Windows Forms dialog boxes, so they can be localized using the Windows Forms Resource Editor (Winres.exe). Do not code Windows Forms dialog boxes by hand.
IMHO, to claim something is going to happens "in a few years" literally translates to "we hope one day" which really means "never". Although I would still skim over various tutorials to make sure you don't make any horrendous mistakes. Doing correct internationalization support now will mean less work in the future, and once you get use to it, it won't have any real affect on today's productivity. But if you can measure the goal in years, maybe it's not worth doing at all right now.
I have worked on two projects that did internationalization: a C# ASP.NET (existed before I joined the project) app and a PHP app (homebrewed my own method using a free Internationalization control and my own management app).
You should store all the text (labels, button text, etc etc) as data inside a database. Reference these with keys (I prefer to use the first 4 words, made uppercase, spaces converted to underscores and non alpha-numerics stripped out) and when you have a duplicate, append a number to the end. The benefit of this key method is the programmer has a pretty strong understanding of the content of the text just by looking at the key.
Write a utility to extract the data and build .NET resource files that you add into your project for compile. Create a separate resource file for each language. In your code, use the key to point to the proper entry.
I would skim over the MS documents on the subject:
http://www.microsoft.com/globaldev/getwr/dotneti18n.mspx
Some basic things to avoid:
never ever ever use translation software, hire a pro or an intern taking that language at a local college
never try to create text by appending two existing entries, because grammar differs greately in each language, this will never work. So if you have a string that says "Click" and want one that says "Click Now", do not try to create a setup that merges two entries, or during translation, copy the word for click and translate the word now. Treat every string as a totally new translation from scratch
I will add to store and manipulate string data as Unicode (NVARCHAR in MS SQL).
Some questions to think about…
How match can you afford to delay the shipment of the English version of your application to save a bit of cost internationalize later?
Will you still be trading if you don’t get the cash flow from shipping the English version quickly?
How will you get the UI right, if you don’t get feedback quickly from some customers about it?
How often will you rewrite the UI before you have to internationalize it?
Do you English customers wish to be able to customize strings in the UI, e.g. not everyone calls a “shipping note” the same think.
As a large part of the pain of internationalize is making sure you don’t break the English version, is automated system testing of the UI a better investment?
The only thing I think I will always do is: “Do not use composite strings that are built at run time from concatenated phrases” and if you do so, don’t spread the code that builds up the a single string over lots of methods.
Having your UI automatically resize (and layout) to cope with length of labels etc will save you lots of time over the years if you can do it cheaply. There a lots of 3rd party control sets for Windows Forms that lets you label text boxes etc without having to put the labels on as separate controls.
I just starting to internationalize a WinForms application, we hope to mostly be able to use the “name” of each control as the lookup key, without having to move lots into resource files etc. It is not always as hard as you think at first….
You could use NGettext.Wpf (it can be installed from NuGet, and yes I am the author, but I made it out of the frustrations listed in the other answers).
It is hosted this github repository, and here is the getting started section at the time of writing:
NGettext.Wpf is intended to work with dependency injection. You need to call the following at the entry point of your application:
NGettext.Wpf.CompositionRoot.Compose("ExampleDomainName");
The "ExampleDomainName" string is the domain name. This means that when the current culture is set to "da-DK" translations will be loaded from "Locale\da-DK\LC_MESSAGES\ExampleDomainName.mo" relative to where your WPF app is running (You must include the .mo files in your application and make sure they are copied to the output directory).
Now you can do something like this in XAML:
<Button CommandParameter="en-US"
Command="{StaticResource ChangeCultureCommand}"
Content="{wpf:Gettext English}" />
Which demonstrates two features of this library. The most important is the Gettext markup extension which will make sure the Content is set to the translation of "English" with respect to the current culture, and update it when the current culture is changed. The other feature it demonstrates is the ChangeCultureCommand which changes the current culture to the given culture, in this case "en-US".
I also highly recommend reading Preparing Strings from the gettext utilities manual.
Internationalization will let your product be usable in other countries, it's easy and should be done from the start (this way English speaking people all over the world can use your software), those 3 rules will get you most of the way there:
Support international characters - use only Unicode data types in files and databases.
Support international date, time and number formats - use CultureInfo.InvariantCulture when storing data to file or computer readable storage, use CultureInfo.CurrentCulture when displaying data or parsing user input, never do your own parsing, never use any other culture objects.
textual data entered by the user should be considered a black box, don't try to break it up into words or letters, especially when displaying it to the user - different languages have diffract rules and the OS knows how to display left-to-right text, you don't.
Localization is translating the software into different languages, this is difficult and expensive, a good start is to never hard code strings and never build sentences out of smaller strings.
If you use test data, use non-English (e.g.: Russian, Polish, Norwegian etc) strings.
Encoding peeks it's little ugly head at every corner. If not in your own libraries, then in external ones.
I personally favor Russian because although I don't speak a word Russian (despite my name's origin) it has foreign chars in it and it takes way more space then English and therefor tests your spacing too.
Don't know if that is something language specific, or just because our Russian translator likes verbose strings.