How does Wikipedia avoid duplicate entries? - database

How can websites as big as Wikipedia sort duplicated entries out?
I need to know the exact procedure from the moment that user creates the duplicate entry and so on. If you don't know it but you know a method please send it.
----update----
Suppose there is wikipedia.com/horse and somebody afterward creates wikipedia.com/the_horse this is a duplicate entry! It should be deleted or may be redirected to the original page.

It's a manual process
Basically, sites such as wikipedia and also stackoverflow rely on their users/editors not to make duplicates or to merge/remove them when they have been created by accident. There are various features that make this process easier and more reliable:
Establish good naming conventions ("the horse" is not a well-accepted name, one would naturally choose "horse") so that editors will naturally give the same name to the same subject.
Make it easy for editors to find similar articles.
Make it easy to flag articles as duplicates or delete them.
Make sensible restrictions so that vandals can't mis-use these features to remove genuine content from your site.
Having said this, you still find a lot of duplicate information on wikipedia --- but the editors are cleaning this up as quickly as it is being added.
It's all about community (update)
Community sites (like wikipedia or stackoverflow) over time develop their procedures over time. Take a look at Wikipedia:about Stackoverflow:FAQ or meta.stackoverflow. You can spend weeks reading about all the little (but important) details of how a community together builds a site together and how they deal with the problems that arise. Much of this is about rules for your contributors --- but as you develop your rules, many of their details will be put into the code of your site.
As a general rule, I would strongly suggest to start a site with a simple system and a small community of contributors that agree on a common goal and are interested in reading the content of your site, like to contribute, are willing to compromise and to correct problems manually. At this stage it is much more important to have an "identity" of your community and mutual help than to have many visitors or contributors. You will have to spend much time and care to deal with problems as they arise and delegate responsibility to your members. Once the site has a basis and a commonly agreed direction, you can slowly grow your community. If you do it right, you will gain enough supporters to share the additional work amongst the new members. If you don't care enough, spammers or trolls will take over your site.
Note that Wikipedia grew slowly over many years to its current size. The secret is not "get big" but "keep growing healthily".
Having said that, stackoverflow seems to have grown at a faster rate than wikipedia. You may want to consider the different trade off decisions that were made here: stackoverflow is much more restricted in allowing one user to change the contribution of another user. Bad information is often simply pushed down to the bottom of a page (low ranking). Hence, it will not produce articles like wikipedia. But it's easier to keep problems out.

I can add one to Yaakov's list:
* Wikipedia makes sure that after merging the information, "The Horse" points to "Horse", so that the same wrong title can not be used a second time.

EBAGHAKI, responding to your last question in the comments above:
If you're trying to design your own system with these features, the key one is:
Make the namespace itself editable by the community that is identifying duplicates.
In MediaWiki's case, this is done with the special "#REDIRECT" command -- an article created with only "#REDIRECT [[new article title]]" on its first line is treated as a URL redirect.
The rest of the editorial system used in MediaWiki is depressingly simple -- every page is essentially treated as a block of text, with no structure, and with a single-stream revision history that any reader can add a new revision to. Nothing automatic about any of this.
When you try to create a main page, you are shown a long message encouraging you to search for the page title in various ways to see whether an existing page is already there -- many sites have similar processes. Digg is a typical example of one with an aggressive, automated search to try to convince you not to post duplicates -- you have to click through a screen listing potential duplicates and affirm that yours is different, before you are allowed to post.

I assume they have a procedure that removes extraneous words such as 'the' to create a canonical title, and if it matches an existing page not allow the entry.

Related

Reusing tables for purposes other than the intended

We need to implement some new functionality for some clients. The functionality is essentially an EULA accept interface for the users. Users will open our app, will be presented with the corresponding EULA (varies from client to client). It needs to be able to store different versions of the EULA for the same client, and it also needs to store which users have accepted which version of the EULA. If a new version is saved, it will be presented to the users the next time they log in.
I've written a document suggesting to add two tables, EULAs and UserAcceptedEULA. That will allow us to store different EULAs and keep track of the accepted ones, current and previous ones.
My problem comes with how some people at the company want to do the implementation. They suggest to use a table ConstantGroups (which contains ConstantGroupID, Timestamp, ClientID and Name) that we use for grouping constants with their values that are stored in another table, e.g.: ConstantGroup would be Quality, and the values would be High, Medium, Low.
To me this is a horrible, incredibly wrong way to do it. They're suggesting it because we already have an endpoint where you pass the ClientID and you get back a string, so it "does what we need".
I wrote the document explaining the whole solution, covering DB changes, APIs needed and UI modifications, but they still don't want to accept it because they thing their way will save us time.
How do I make them understand how horribly wrong they are?
This depends somewhat on your assumptions about "good" design.
Many software folk have adopted the SOLID principles as being "good" (I am one of them). While the original thinking is about object oriented design, I think many apply to databases too.
The first element of that is "Single responsibility". A table should do one thing, and one thing only. Your colleagues are trying to get a single entity to manage different concepts; the Constants table suddenly is responsible for "constants" and "EULA acceptance".
That leads to "Open to extension, closed to change" - if you need to change the implementation of "constants" or "EULAs", you have to untangle the other. So any time you (might) save now will cost you later.
The second principle I like (especially in database design) is the Principle of Least Astonishment. Imagine a new developer joining the team, and having to figure out how EULAs work. They would naturally look for some kind of "EULA" and "Acceptance" tables, and would be astonished to learn that actually, this is managed in a thing called "constants". Any time you save now will be repaid by onboarding new people (or indeed, reminding yourself in 2 years time when you have to fix a bug).

Database Structure (zero-to-one-to-many relationships)

I am currently working on creating a database for a community partnership program for educational purposes. The structure of the DB should be simple be as stated above, the data tends to overlap in various of ways. There are four main categories; Internships, Jobs, Summer/Yearly Programs, and Other. Followed by an Address book/Contacts list.
This is the part where the data is difficult to structure. The employer and has relate to the "employment posting" and doing so relates to the school's academic departments, 6. But some employers require more than one. This data will then be followed by, how many openings?, posting date, follow up contact date, Student hired? if so, student evaluation, and Notes.
I'm not asking how to create the DB, but how would I organize and structure such a complex data collection? I have managed DB's, (putting in information) and I know how to build from scratch as needed. But I have been tasked with structuring somethings like this.
Here is an image of information needed to collect. (More or less)
Click me!
If you are stuck at the "How do I get started?" stage, I suggest that you start at a very high level (the conceptual data model), then refine only a bit to the logical data model, then to physical data model. Here is a short explanation of the 3 different kinds of data model. (Don't worry that it appears to be about data warehouses - these bits aren't specific to data warehouses.)
For a bit more detail, there is another article on data modeling - again, don't worry that it appears in the context of Agile - this is generally useful stuff even if you're not using Agile.
Another two things that might help are these questions (in this order):
What questions do I need the database to answer?
What information does it need to provide a home for? (Why? If it's not covered by part of the answer to the first question, challenge why it's needed.)

FAQ section - Algorithm for showing

I'm writing this website with an FAQ-section. I've been fiddling abit with how the different questions are scored, in order to display the question, that most people wonder about, on top.
At the moment, I have created a counter that is incremented each time someone presses the question, and the list of questions is sorted in descending order.
My question is: Is there a better way to score these FAQ-questions? For example a combined counter and rating or something like that?
The rating should play no role in the FAQ algorithm. The rating is feedback for you: if a question/answer pair is repeatedly rated poorly, you should improve its quality.
The FAQ questions should be the ones that your visitors ask most often, so a simple hit counter is a good place to start. You should also consider ways to include questions that you often receive via phone or email or other feedback methods.
A lot of sites rely on manual maintenance of the FAQ, but I prefer an automated or partially automated approach for several reasons:
Staff is not always involved in the question/answer process, especially if you have a robust set of questions/answers online.
Manually-maintained lists become tedious and therefore receive less attention than they should.
Automated metrics often reveal customer behavior that isn't noticed or expected by staff.
What if you let the support staff arrange the FAQs. Because they get the extra phone calls and know the most popular questions.
So i would like the FAQs to be arranged manually by the support staff.

Game Design and Architecture Advice for Text Adventures

I am trying to create an old-school Text Adventure Game. I'm a bit stuck on creating the World Map and rooms.
Should the room descriptions be part of the source code or should it be separated out? I was thinking of placing all such descriptions and room properties in a MySQL database and then have code to organize the logic of each room; putting each room description in with the actual source code seems a bit untidy.
Is this the preferred method of organising Descriptions in an adventure game? I was also thinking that this might be preferable since I could then query the database to find common properties about the data.
Any comments would be appreciated.
No, don't include level/room description within code, it is not dynamic this way.
Many many development frameworks now tend to go with separating code from data. So, for usual cases, we put game rooms data within files and read those to build the level and maybe enable the user to construct a new level on his own and eventually create a new file to carry the room data.
I work in a company where they build games, and they have the rooms separated from the code, they have it in mysql. Actually also the items that go in each room are in a table, and there is also a table that says which item is at which room at that moment.
Besides if you want to expand your game or do statistics about it is much better doing it with a database.
I will address two issues here. First, you are right to keep the data that defines the game away from the engine that will use it. This makes it so that you dont have to recompile everything in order to fix a typo or the like in the case of a text based game.
Secondly though, I would just question the use of MySQL. If you are making a dos typed game that is to be installed on people's systems you dont want a pre-req to be 'Install MySQL', hehe. There is a little program out there that is written in C that is free for all to use called SQLite that would suit your needs much better. If on the other hand the web is the medium for the release of this text based game, then have at it :)
You could just use a system like ADRIFT, then all you need to worry about are the descriptions and logic.
Should the room descriptions be part of the source code or should it
be separated out?
Separated out.
Try Prolog language.
It has similar database to SQL (actually logical predicates)
With some skill You may be able to check whether after some change is Your adventure finishable.
You may easily create this description by some logical predicates if You don't mind it being very "computer like".
You can see examples of Prolog text adventures in simple Google search.
I suggest using engines that already have a vibrant community around them. That way, your source code is only that; the source code of the game. I'd go with either TADS 3 or Inform 7
I would construct such a game as an interpreter which reads in room data, and based on the room data, allows for a set of valid commands (move, take, drop, change...). For movement you would have a pre-built graph with nodes being rooms and edges being allowed moves.
I would separate the descriptions from the code, having an object Room, that owns an object Description that calls a "database" through some Facade, so that you may use a file or a database or anything you wish. It would also eventually allow you to add some scripting to the room itself, like having objects in your description that have behaviors.

Is it highly necessary to record the registration date of new website users?

What are the advantages and disadvantages?
That depends on what your site is, and how you use that information. On StackOverflow, you are awarded a "yearling" badge once a full year elapses from the time you registered. Clearly here that information is necessary.
If I were you, I'd save it. It's a small piece of information that may become useful eventually. It's better to have it and not need it than to need it and not have it. It would be rather difficult to extrapolate an accurate registration date retrospectively if you don't store it to begin with.
Advantage:
You don't get in a migration horror when needing it at some point. For a lot of data you cannot find out this data afterwards. You could fake around with MODIFICATION_DATE but often this is not accurate and sits in the future (e.g. when profile can be edited by user).
Disadvantage:
In case you never need this information, you wasted space (though another small data payload column shouldn't make a problem). Further more you have an 'all-time' deprecated field, which can be confusing to new developers ("what is this column for, cannot see where it is used...?")
As mentioned the registration-date is most likely a valuable information I would add it from start on. When thinking of persistent data and its model you sometimes have to think "more" for the future.

Resources