Get all articles in category from Wikipedia - export

I'm trying to get a .csv of all living people listed on wikipedia, preferably with biographic columns for birth date, nationality, etc. I've tried the Special Export page to export the records as an .xml but it seems there are too many records for a proper export (there are about 800,000 living people in Wikipedia). I also tried Petscan but this only generates a .csv with the titles of the Wikipedia pages with no actual content.
What would be the best way to go about exporting this data? I'm ok with exporting as an .xml file and converting this to a .csv manually.

Related

Does storing file paths in a database defeat the purpose of data independece?

I'm building an application that contains a database of movies. Movies have various fields such as a title, directors, actors, etc. I also want to include a movie cover with each row in the movies database. I currently have a column in my database called "cover_path" which contains the absolute file path to an image (the movie cover). I will similarly have one called "movie_path" which contains the path to the actual movie.
Are there better ways of storing files in a database and by storing the file path to an image in the database am I defeating the purpose of data independence?
Using file paths have more advantages then storing binary data in the database. This is a good discussion on this subject.
Though I'm not sure in which way would this defeat the purpose of data Independence on any level.

Database | Google Product Feeds Field Length

Trying to construct a database to hold a bunch of product data. The plan is to programmically generate TXT (Text tab delimited) filed and upload to Google, Amazon, and Bing (all use Google Product Feeds file formatting) from the data in the database.
Problem I'm having right now is I seem to have to guess at the maximum field length for most of the available fields. I found in several places telling me what field are required/optional and what their use is for, but no where can I find how long they can be. This is critical to know in a fully automated product feed system since we don't want to pull data from our warehouse system and INSERT that into our Product Feed database if it is invalid for the Google product feeds (too long).
Please advise.
google shopping will allow up to 72 chr for the title. description fields are usually unlimited. for anything else you usually wouldn't have anything too long anyways.
There are programs out there that do this already

Database or file type for containing modular page contents?

I'm building a personal web and I have a problem planning my DB structure.
I have a portfolio page and each artwork has its own description page.
So I need to save contents in a file or a database.
The Description page has some guideline,
but the length, the number, and the order of elements are free.
for example a page may have just one paragraph or more,
and in each paragraph, many footage codes and text blocks can be mixed.
the Question is:
(I made a simple diagram to describe my needs).
What can I choose for my data type to save that contents structure maintaining the order?
I'm used to XML and I know XML can be one of the choice, but if the contents is big, it will be hard to read and slow.
I've heard that JSON alternates XML these days, but as I searched, JSON cannot maintain the order of elements, can it?
Waiting for clever recommendations:)
Thanks.

Storing website content: database or file?

I'm building a website, and I'm planning to publish various kinds of posts, like tutorials, articles, etc. I'm going to manage it with php, but when it comes to storing the content of each post, the very text that will be displayed, what's a better option: using a separate text file or adding it as an for each entry in the database?
I don't see a reason why not to use the database directly to store the content, but it's the first time I use a DB and that feels kind of wrong.
What's your experience in this matter?
Ok Friends I am visiting this question once again for the benefit of those who will read this answer. After a lot of trial and error I have reached a conclusion that keeping text in database is a lot convenient and easy to manipulate. Thus all my data is now with in database. Previously I had some details in database and the text part in file but now i have moved all to database.
The only problem is that when editing your posts the field like title or tags or subject etc are changed on a simple html form. but for the main content I have created a text area. however i just have to cut and copy it from the text area to my favorite text editor. and after the editing copy and paste it back.
some benefits that forced me to put every thing in database are
EASY SEARCH: you can run quires like mysql LIKE on your text (specially main content).
EASY ESCAPING: you can run commands easily on your data to escape special characters and make it suitable for display etc.
GETTING INPUT FROM USER: if you want the user to give you input it makes sense to save his input in database , escape it and manipulate it as and when required.
Functions like moving tables , back up, merging two records, arranging posts with similar content in sequential order... etc etc all is more easy in database than the file system.
in file system there is always the problem of missing files, different file names, wrong file shown for different title etc etc
I do not escape user input before adding it to database just before display. this way no permanent changes are stored to the text.(i don't know if that's ok or not)
Infact I am also doing something like you. However I have reached the conclusion as explained below (almost the same as mentioned in the answer above me). I hope you must have made the decision by now but still I will explain it so that it is useful for future.
My Solution: I have a table called content_table it contain details about each and every article, post or anything else that I write. The main (text portion) of the articles/post is placed in a directory in a .php or .txt file. When a user clicks on an article to read, a view of the article is created dynamically by using the information in database and then pulling the text part (I call it main content) from the .txt file. The database contain information like _content_id_, creation date, author, catagory (most of this become meta tags).
The two major benefits are:
performance since less load on datbase
editing the text content is easy.
I am giving comments based on my experience ,
Except attachments you can store things in DB, why because managing content,back up, restore ,querying , searching especially full text search will be easy.
Store attached files in some folder and keep path in DB tables.
Even more if you r willing to implement search inside attachments you can go for some search engine like lucene which is efficient to search static contents.
keeping attachment in DB or in file system is upto the level of important to the files.

How do I index different sources in Solr?

How do I index text files, web sites and database in the same Solr schema? All 3 sources are a requirement and I'm trying to figure out how to do it. I did some examples and they're working fine as they're separate from each other, now I need them all to be 1 schema since the user will be searching in all of those 3 data sources.
How should I proceed?
You should sketch up a few notes for each of your content sources:
What meta-data is available
How is the information accessed
How do I want to present the information
Once that is done, determine which meta-data you want to make searchable. Some of it might be very specific to just one of the content sources (such as author on web pages, or any given field in a DB row), while others will be present in all sources (such as unique ID, title, text content). Use copy-fields to consolidate fields as needed.
Meta-data will vary greatly from project to project, but yes -- things like update date, filename, and any structured data you can parse out of the text files will surely help you improve relevance. Beyond that, it varies a lot from case to case. Maybe the file paths hint at a (possibly informal) taxonomy you can use as metadata. Maybe filenames contain metadata themselves (such as year, keyword, product names, etc).
Be prepared to use different fields for different sources when displaying results. A source field goes a long way in terms of creating result tiles -- and it might turn out to be your most used facet.
An alternative (and probably preferred) approach to using copy-fields extensively, is using the DisMax/EDisMax request handlers, to facilitate searching in several fields.
Consider using a mix of copy-fields and (e)dismax. For instance, copy all fields into a catch-all text-field, that need not be stored, and include it in searches, but with a low boost-value, and include highly weighted fields (such as title, or headings, or keywords, or filename) in the search. There's a lot of parameters to tweak in dismax, but it's definately worth the effort.

Resources