Download wiki in one or more files - file

I would like to load data from wikipedia for some task in Hadoop. I found some links: http://www.kiwix.org/wiki/Main_Page#Wikipedia_files, https://archive.org/details/enwiki-20160113. But I am not sure in which format it will be and how to work with that. So, question is does anybody know if it is possible to download wikipedia in one or more txt files?

Well, you can download the most recent complete (another dump is in progress at 20161101) dumps of wikipedia content here: https://dumps.wikimedia.org/enwiki/20161020/
Note I don't think this includes media files themselves, and that this example is only the English site - the other sites are available there too.

Related

Exportint Trac wiki to Confluence using UWC tool

I want to create an Export of Trac Wiki and import it to COnfluence using UWC tool.
All Trac Wiki Pages and Attachments are stored in Postgres Database.
Can anyone please let me know how can i create Trac Export which can be readable by UWC. I am confused about TracEnviorment where can i find this?
Thank You,
Akash
Last thing first: If you don't know about a Trac Environment, you'll hardly be fit for the task you're aiming at right now. Start reading a bit about Trac, that is largely self-documenting, could help. No matter from where, be it the 'TracEnvironment' page in your own Trac wiki or the authoritative resource.
OTOH I'm not familiar at all with UWC, but a bit of research revealed, that Trac wiki is supported by a user-contributed module in UWC. So I guess that its documentation page is the natural place to start from. From what I've read so far the export/import is file-based and in line with recommendations under Exporter on that same page.
First, read https://migrations.atlassian.net/wiki/display/UWC/UWC+Trac+Notes.
Second, for attachments, you may need to use this as a reference http://l33t.peopleperhour.com/blog/2014/07/10/converting-a-trac-wiki-to-confluence/, anyway you will need to adapt it to your actual DBMS behind Trac.
Note that the current UWC - Trac integration is not as polished as someone may expect, so prepare yourself to download UWC source, debug it and patch it.

MongoDb and Symfony2.4 file is not stored in Gridfs

I have implemented the file upload functionality with reference to this link
http://www.slideshare.net/mongodb/mongo-db-bangalore-2012-15070802
But the file is not stored into the Gridfs.
I had done some research for the same and also with reference to this blog post
http://php-and-symfony.matthiasnoback.nl/2012/10/uploading-files-to-mongodb-gridfs-2/
But again, unfortunately, I stuck with this issue since last from 15 days
please help.
Please take a look at KnpLabs/Gaufrette and the related KnpLabs/KnpGaufretteBundle
The Gaufrette bundle provides a level of abstraction around file systems and, it helped me get file-oriented operations up and running quickly. I found it very useful, and in fact the Symfony CMS package leverages this bundle. It may help you out as well.

Google App Engine PDF converter

I'm looking for a good, open source, PDF generator/library that will convert html (with styling etc.) into a PDF file.
Requirement:
Must be Java or Python and run on Google App Engine.
Must be Free, open-source.
Must be easy to use/consume.
Yes I have tried searching for this myself - I've tried many "solutions" that I've found on Google etc. None yet satisfy me. Many seem incomplete, buggy or don't work well on GAE. So I figured I would appeal to the StackOverflow community for opinions or suggestions.
For HTML/image to PDF I use the Python library http://www.xhtml2pdf.com/ which uses Pisa, Reportlab, pyPdf, and html5lib running on GAE. I have been using it to generate very nice article PDFs with embedded images and once I figured out how to get the page size correct I have found this to be a very good library.
You will need the xhtml2pdf library and it's dependencies:
https://github.com/chrisglass/xhtml2pdf
I threw together some example Python code and put it in this pastebin:
http://pastebin.com/FFEZjNs3
The pdf_data you get at the end is the binary PDF file data. The html_data you give to pisa is really any string containing an HTML document.
There are some recommended things to include in your HTML to get a well formatted PDF output. Here is an example HTML document similar to the base template I use. Note the author meta field and the #page CSS:
http://pastebin.com/q1wRm9nJ
Here are the docs about the compatible CSS and HTML:
https://github.com/chrisglass/xhtml2pdf/blob/master/doc/usage.rst#supported-css-properties
You can include images using either the URL of the external image or you can use a dataUri and xhtml2pdf has a function for creating these "pisa.makeDataURI()".
Hopefully that helps.

CMSMS File Download Problems

I'm using the CMS Made Simple platform; which I'm not very familiar with!
The site has a secure frontend, which contains a document library for members. Files are stored outside the document root and links are generated by the CMS so you should only be able to get the documents if you're logged in.
At first glance the setup works fine; however certain PDFs uploaded in this fashion are corrupt upon download, and line endings in text files aren't preserved.
Sorry if this is a bit vague, I'm hoping someone has come across a similar problem but any ideas would be greatly appreciated.
Regards,
Rich
You should check which module/tag is handling the download / upload.
Are you sure they´re intact when uploaded?
Check for headers, and content size calculation, try different browsers and different methods to force the download.

What's the easiest way to convert an SO data dump from HTML back to Markdown?

I've just got my hands on a Stackoverflow data dump, and I'm disappointed to see that the Body field of the posts is in HTML rather than Markdown. I suspect there's Markdown in the original database because that's what I see if I try to edit an answer.
I want to recover Markdown from a large set of answers. I will be processing hundreds of entries in batch mode, using either command-line tools or some kind of Lua or C library, so an interactive tool like the wmd Markdown editor is not suitable. Can people say
what tools are available to help me recover Markdown from a Stackoverflow data dump?
(Related question, not a duplicate: Convert HTML back to Markdown within wmd.)
Markdownify converts HTML to Markdown.
See Also: MetaSO / Can Markdown be recovered from the SO data dump?
take a look at pandoc:http://johnmacfarlane.net/pandoc/
there is an html2markdown tool included with pandoc that works pretty well, and the program is run from the command line, making batch conversion quite nice.
here is the man page: http://johnmacfarlane.net/pandoc/html2markdown.1.html

Resources