How to export text from all pages of a MediaWiki? - export

I have a MediaWiki running which represents a dictionary of German terms and their translation to a local dialect. Each page holds one term, its translation and a number of additional infos.
Now, for a printable version of the dictionary, I need a full export of all terms and their translation. Since this is an extract of a page's content, I guess I need a complete export of all pages in their newest version in a parsable format, e.g. xml or csv.
Has anyone done that or can point me to a tool?
I should mention, that I don't have full access to the server, e.g. no command line, but I am able to add MediaWiki extensions or access the MySQL database.

You can export the page content directly from the database. It will be the raw wiki markup, as when using Special:Export. But it will be easier to script the export, and you don't need to make sure all your pages are in some special category.
Here is an example:
SELECT page_title, page_touched, old_text
FROM revision,page,text
WHERE revision.rev_id=page.page_latest
AND text.old_id=revision.rev_text_id;
If your wiki uses Postgresql, the table "text" is named "pagecontent", and you may need to specify the schema. In that case, the same query would be:
SET search_path TO mediawiki,public;
SELECT page_title, page_touched, old_text
FROM revision,page,pagecontent
WHERE revision.rev_id=page.page_latest
AND pagecontent.old_id=revision.rev_text_id;

This worked very well for me. Notice I redirected the output to the file backup.xml. From a Windows Command Processor (CMD.exe) prompt:
cd \PATH_TO_YOUR_WIKI_INSTALLATION\maintenance
\PATH_OF_PHP.EXE\php dumpBackup.php --full > backup.xml

I'm not completely satisfied with the solution, but I ended up specifying a common category for all pages and then I can add this category and all of the containing page names in the Special:Export box. It seems to work, allthough I'm not sure if it will still work when I reach a few thousand pages.

Export
cd maintenance
php5 ./dumpBackup.php --current > /path/wiki_dump.xml
Import
cd maintenance
php5 ./importDump.php < /path/wiki_dump.xml

It looks less than simple. http://meta.wikimedia.org/wiki/Help:Export might help, but probably not.
If the pages are all structured in the same way, you might be able to write a web scraper with something like Scrapy

You can use the special page, Special:Export to export to XML; here is Wikipedia's version.
You might also consider Extension:Collection if you want it eventually human readable (e.g. PDF) form.

You can set https://www.mediawiki.org/wiki/Manual:$wgExportAllowAll to true, then export all pages from Special:Export.

Related

How can I keep changes in the index when I use DIH fullimport?

I'm using Solr 6.5 to index files from multiples ftp files into multiples cores (having one core for each type of document, like audio file, image, software, video and documents).
The situation is that I'm doing this to populate an app that in its front end has a social networking approach in which every user can add new tags or modify other metadata without restriction.
So when I execute again data import handler to add new files to my application, it erase the index that previosly was modified for the user and set up with the data-config default configuration.
My question: is there a way to tell DIH, if the id exists, continues without importing and just adds the files which don't have an id in the index?
If this is not possible, can I do something similar in a different way?
Thanks for everything!
Sounds like you are doing a full import with default settings. One of them is clean, which defaults to true and deletes the whole index before the import.
Try setting it to false and also look at preImportDeleteQuery and postImportDeleteQuery for even more precision.

Dynamic locale name in multi-language site

I am developing an application in cakephp 2.3.4, Which is multi-language.
Admin can add any number of new languages.
My question is, When admin decides to add a new language, how resulting locale name should be defined.
Can a locale name be any arbitrary name, given by admin or it should be a dropdown containg all languages code according to language.
Unfortunately, your question is a bit 'vague', i.e., will administrators be able to add GNU-locale files (*.po), or are you talking about adding translations inside the database.
In any case, CakePHP uses locales according to the ISO 639-3 standard see here and here for more information. A complete list of those locales can be found inside the I10n class.
Since you probably also want to switch the locale of PHP itself when switching locales, so that, for example, date, money and time-formats will follow the right format for the locale, it's best to stick with those locales and not 'invent' your own locales.
See setlocale(). Be aware though, that PHP may use slightly different locale-codes than CakePHP uses. And it will depend on what locales are installed on your server.
To get a list of locales installed on your server, use locale -a on the command line. See this page for more information: https://wiki.archlinux.org/index.php/Locale
Which techniques to use for localization
A quick summary of techniques to use;
Short messages (interface/UI)
In general, locale files are used for short pieces of text. Locale-files are therefore mostly used for fixed strings,
for example, strings that are used in the interface (like 'are you sure you want to delete this file?' => 'weet u zeker dat u dit bestand wilt verwijderen?).
Longer (fixed) text
For longer pieces of text in your application, that are not part of the 'content' (not the blog-post, but for example a fixed page with a disclaimer),
it's best to use separate views for translated content, for example;
app/Views/MyController/disclaimer_eng.ctp
app/Views/MyController/disclaimer_deu.ctp
app/Views/MyController/disclaimer_fre.ctp
Content
For the content of your website (the part of your website that is managed by the 'user' of the website),
put translations inside the database. This data may be updated frequently and all translations should be updated as well.
How to implement this, is really up to you and depends on your situation. CakePHP offers a Translate behavior that you can use (http://book.cakephp.org/2.0/en/core-libraries/behaviors/translate.html), but in most of my situations that behavior didn't really fit our needs (IMO it is not very efficient, because it stores translations per-field, per-model).

Difficulty with filename and filemime when using Migrate module

I am using the Drupal 7 Migrate module to create a series of nodes from JPG and EPS files. I can get them to import just fine. But I notice that when I am done importing them if I look at the nodes it creates, none of the attached filefield and thumbnail files contain filename information.
Upon inspecting the file_managed table I see that both the filename and filemime fields are empty for ONLY the files that I attached via the migrate module. This also creates an issue with downloading the files.
Now I think the problem has to do with the fact that I am using "file_link" instead of "file_copy" as the file operation I specify. The problem is I am importing around 2TB (thats Terabytes) of image files. We had to put in a special request with Rackspace just to get access to that much disk space on our server. So I can't go around copying from one directory to the next because of space issues. So "file_link" seems like the obvious choice.
Now you probably want to see how I am doing this exactly, so here is the code snippet:
$jpg_arguments = MigrateFileFieldHandler::arguments(NULL,
'file_link', FILE_EXISTS_RENAME, 'en', array('source_field' => 'jpg_name'),
array('source_field' => 'jpg_filename'), array('source_field' => 'jpg_filename'));
$this->addFieldMapping('field_image', 'jpg_uri')
->arguments($jpg_arguments);
As you can see I am specifying no base path (just like the beer.inc example file does). I have set file_link, the language, and the source fields for the description, title, and alt.
It is able to generate thumbnails from the JPGs. But still missing those columns of data in the db table. I traced through the functions the best I could but I don't see what is causing this. I tried running the uri in the table through the functions that generate the filename and the filemime and they output just fine. It is like something is removing just those segments of data.
Does anyone have any idea what this could be? I am using the Drupal 7 Migrate module version 2.2. It is running on Drupal 7.8.
Thanks,
Patrick
Ok, so I have found the answer to yet another question of mine. This is actually an issue with the migrate module itself. The issue is documented here. I will be repealing this bounty (as soon as I figure out how).

Diff tool which generates documentation?

I have tried Beyond Compare, and it seems to be a good tool.
But I haven't found a way to export an overview of the differences.
The format of it should be one that most people can read. Doc, Rtf, Pdf, Html...
What I need is to display the differences of two folder. And it would be enough to display which files has been changed. But it would also be nice if it would be possible to, in the documentation, go deeper and actually see which rows in a file has been changed.
Are there any tools that can do this?
Beyond Compare has some functions to do that.
For example, in the folder diff view, select the files you want to report and then select Actions->File Compare Report. HTML is one of the output formats supported there.
Araxis Merge covered all of my needs.
Simple to use
Generated a nice overview of files in folder structure
Could click on changed files to see the changes in the content
The colors could be better, but that can be solved by inserting a custom CSS-file. :)
This script to colorize diff output to HTML might be useful. There are many other tools, one more is difftool.
On a relatively different note, I had used a code coverage tool that also generated HTML code views from gcov coverage information. Its called lcov.

How do you build a multi-language web site?

A friend of mine is now building a web application with J2EE and Struts, and it's going to be prepared to display pages in several languages.
I was told that the best way to support a multi-language site is to use a properties file where you store all the strings of your pages, something like:
welcome.english = "Welcome!"
welcome.spanish = "¡Bienvenido!"
...
This solution is ok, but what happens if your site displays news or something like that (a blog)? I mean, content that is not static, that is updated often... The people that keep the site have to write every new entry in each supported language, and store each version of the entry in the database. The application loads only the entries in the user's chosen language.
How do you design the database to support this kind of implementation?
Thanks.
Warning: I'm not a java hacker, so YMMV but...
The problem with using a list of "properties" is that you need a lot of discipline. Every time you add a string that should be output to the user you will need to open your properties file, look to see if that string (or something roughly equivalent to it) is already in the file, and then go and add the new property if it isn't. On top of this, you'd have to hope the properties file was fairly human readable / editable if you wanted to give it to an external translation team to deal with.
The database based approach is useful for all your database based content. Ideally you want to make it easy to tie pieces of content together with their translations. It only really falls down for all the places you may want to output something that isn't out of a database (error messages etc.).
One fairly old technology which we find still works really well, is to use gettext. Gettext or some variant seems to be available for most languages and platforms. The basic premise is that you wrap your output in a special function call like so:
echo _("Please do not press this button again");
Then running the gettext tools over your source code will extract all the instances wrapped like that into a "po" file. This will contain entries such as:
#: myfolder/my.source:239
msgid "Please do not press this button again"
msgstr ""
And you can add your translation to the appropriate place:
#: myfolder/my.source:239
msgid "Please do not press this button again"
msgstr "s’il vous plaît ne pas appuyer sur le bouton ci-dessous à nouveau"
Subsequent runs of the gettext tools simply update your po files. You don't even need to extract the po file from your source. If you know you may want to translate your site down the line, then you can just use the format shown above (the underscored function) with all your output. If you don't provide a po file it will just return whatever you put in the quotes. gettext is designed to work with locales so the users locale is used to retrieve the appropriate po file. This makes it really easy to add new translations.
Gettext Pros
Doesn't get in your way while coding
Very easy to add translations
PO files can be compiled down for speed
There are libraries available for most languages / platforms
There are good cross platform tools for dealing with translations. It is actually possible to get your translation team set up with a tool such as poEdit to make it very easy for them to manage translation projects
Gettext Cons
Solves your site "furniture" needs, but you would usually still want a database based approach for your database driven content
For more info on gettext see this wikipedia page
They way I have designed the database before is to have an News-table containing basic info like NewsID (int), NewsPubDate (datetime), NewsAuthor (varchar/int) and then have a linked table NewsText that has these columns: NewsID(int), NewsText(text), NewsLanguageID(int). And at last you have a Language-table that has LanguageID(int) and LanguageName(varchar).
Then, when you want to show your users the news-page you do:
SELECT NewsText FROM News INNER JOIN NewsText ON News.NewsID = NewsText.NewsID
WHERE NewsText.NewsLanguageID = <<Session["UserLanguageID"]>>
That Session-bit is a local variable where you store the users language when they log in or enters the site for the first time.
Java web applications support internationalization using the java standard tag library.
You've really got 2 problems. Static content and dynamic content.
for static content you can use jstl. It uses java ResourceBundles to accomplish this. I managed to get a Databased backed bundle working with the help of this site.
The second problem is dynamic content.
To solve this problem you'll need to store the data so that you can retrieve different translations based on the user's Locale. (Locale includes Country and Language).
It's not trivial, but it is something you can do with a little planning up front.
#Auron
thats what we apply it to. Our apps are all PHP, but gettext has a long heritage.
Looks like there is a good Java implementation
Tag libraries are fine if you're using JSP, but you can also achieve I18N using a template-based technology such as FreeMarker.

Resources