representing play in relational db - database

I run a project that deals with plays, mostly Shakespeare. Right now, for storage, we just parse them out into a JSON array formatted like this:
[0: {"title": "Hamlet", "author": ["Shakespeare", "William"], "noActs": 5}, | info
1: [null, ["Scene 1 Setting", "Stage direction", ["character", ["char line 1", "2"]] | act 1
...]
and save them to a file.
We'd like to now move to a relational database for a variety of reasons (foremost currently search) but have no idea how to represent things.
I'm looking for an outline of the best way to do things?

the topic is 'normalization'
begin by identifying your main classifications, such as PLAY, AUTHOR, ACT, SCENE
then add attributes - specifically, a primary key like play_id, act_id, etc.
then add still more attributes like NAME or other identifying information.
then finally, add some relationships between these object bby creating more tables like PLAY_AUTHOR which includes PLAY_ID and AUTHOR_ID.

Not much of a theater goer myself, but this may give you some ideas should you choose a relational DB.

Related

Neo4j or MongoDB for relative attributes/relations

I wish to build a database of objects with various types of relations between them. It will be easier for me to explain by giving an example.
I wish to have a set of objects, each is described by a unique name and a set of attributes (say, height, weight, colour, etc.), but instead of values, these attributes may contain values which are relative to other objects. For example, I might have two objects, A and B, where A has height 1 and weight "weight of B + 2", and B has height "height of A + 3" and weight 4.
Some objects may have completely other attributes; for example, object C may represent a box, and objects A and B will be related to C by the relations "I appear x times in C".
Queries may include "what is the height of A/B" or what is the total weight of objects appearing in C with multiplicities.
I am a bit familiar with MongoDB, and fond of its simplicity. I heard of Neo4j, but never tried working with it. From its description, it sounds more suitable for my need (but I can't tell it is capable of the task). But is MongoDB, with its simplicity, suitable as well? Or perhaps a different database engine?
I am not sure it matters, but I plan to use python as the engine which processes the queries and their outputs.
Either can do this. I tend to prefer neo4j, but either way could work.
In neo4j, you'd create a graph consisting of a node (A) and its "base" (B). You could then connect them like this:
(A:Node { weight: "base+2" })-[:base]->(B:Node { weight: 2 })
Note that modeling in this way would make it possible to change the base relationship to point to another node without changing anything about A. The downside is that you'd need a mini calculator to expand expressions like "base+2", which is easy but in any case extra work.
Interpreting your question another way, you're in a situation where you'd probably want a trigger. Here's an article on neo4j triggers, how graphs handle this. If parsing that expression "base+2" at read time isn't what you want, and you want to actually set the value on A to be b.weight + 2, then you want a trigger. This would let you define some other function to be run when the graph gets updated in a certain way. In this case, when someone inserts a new :base relationship in the graph, you might check the base value (endpoint of the relationship) and add 2 to its weight, and set that new property value on the source of the relationship.
Yes, you can use either DBMS.
To help you decide, this is an example of how to support your uses cases in neo4j.
To create your sample data:
CREATE
(a:Foo {id: 'A'}), (b:Foo {id: 'B'}), (c:Box {id: 123}),
(h1:Height {value: 1}), (w4:Weight {value: 4}),
(a)-[:HAS_HEIGHT {factor: 1, offset: 0}]->(h1),
(a)-[:HAS_WEIGHT {factor: 1, offset: 2}]->(w4),
(b)-[:HAS_WEIGHT {factor: 1, offset: 0}]->(w4),
(b)-[:HAS_HEIGHT {factor: 1, offset: 3}]->(h1),
(c)-[:CONTAINS {count: 5}]->(a),
(c)-[:CONTAINS {count: 2}]->(b);
"A" and "B" are represented by Foo nodes, and "C" by a Box node. Since a given height or weight can be referenced by multiple nodes, this example data model uses shared Weight and Height nodes. The HAS_HEIGHT and HAS_WEIGHT relationships have factor and offset properties to allow adjustment of the height or weight for a particular Foo node.
To query "What is the height of A":
MATCH (:Foo {id: 'A'})-[ra:HAS_HEIGHT]->(ha:Height)
RETURN ra.factor * ha.value + ra.offset AS height;
To query "What is the ratio of the heights of A and B":
MATCH
(:Foo {id: 'A'})-[ra:HAS_HEIGHT]->(ha:Height),
(:Foo {id: 'B'})-[rb:HAS_HEIGHT]->(hb:Height)
RETURN
TOFLOAT(ra.factor * ha.value + ra.offset) /
(rb.factor * hb.value + rb.offset) AS ratio;
Note: TOFLOAT() is used above to make sure integer division, which would truncate, is never used.
To query "What is the total weight of objects appearing in C":
MATCH (:Box {id: 123})-[c:CONTAINS]->()-[rx:HAS_WEIGHT]->(wx:Weight)
RETURN SUM(c.count * (rx.factor * wx.value + rx.offset));
I have not used Mongo and decided not to after studying it. So filter my opinion with that in mind; users may find my comments easy to overcome. Mongo is not a true graph database. The user must create and manage the relationships. In Neo4j, relationships are "native" and robust.
There is a head to head comparison at this site:
[https://db-engines.com/en/system/MongoDB%3bNeo4j]
see also: https://stackoverflow.com/questions/10688745/database-for-efficient-large-scale-graph-traversal
There is a distinction between NoSQL (e.g., Mongo) and a true graph database. Many seem to assume that if it is not SQL then it's a graph database. This is not true. Most NoSQL data bases do not store relationships. The free book describes this in chapter 2.
Personally, I'm sold on Neo4j. It makes relationships, transerving graphs and collecting lists along the path easy and powerful.

Is there a name for "tables generations" in relational modeling?

I mean, when I'm designing DBs, I find tables with no parents, let say "free tables", and others that are their childs, and childs of the childs...
Let me try to be more formal:
- Let say -> means "references to..."
- Let say [tableName] is a table
Consider this:
[cell]->[organism]->[community]->[region]->[planet]
According with my first approach, [planet] is a "free table", but [cell] has a certain amount of "references to" in secuence, 4 acctually. [planet] has 0, [community] has 2.
I guess it is kind of a grade... Here born my question, Is there a formal name for that number?
Thanks and sorry for my poor English.
There is no name for that number, as many tables reference multiple tables and so have multiple numbers. From a system I'm currently working on:
StructureElement -> Generator ->PlugIn -> PlugInType
PlugInType is only a sort of description identifying what sort of work the PlugIn's of that type do.
StructureElement -> Calendar -> CalendarType
What calendar it's using.
Also, you get hierarchical relationships. A StructureElement can be part of another StructureElement, so you have
StructureElement -> StructureElement
That could conceptually go on for infinity, but my software limits it to 10 levels. What is the number of StructureElement?
Cheers -

How do I think in "IndexedDB"?

Let's say I have 3 entries in my store
{
category: "Science",
author: "Charles Darwin",
content: "Lorem ipsum dolor..."
}
{
category: "Science",
author: "Albert Einstein",
content: "sit amet..."
}
{
category: "Philosophy",
author: "Albert Einstein",
content: "consectetur adipisicing elit..."
}
Is it possible to get all the entries "where category = 'Science'", without creating an index on category?
Is it possible to get all the entries "where category = 'Science' AND author = 'Albert Einstein'"?
Is it possible to get all the entries that contain the word "Lorem" is the content field?
Or should I use a different database for these kinds of queries
Is it possible to get all the entries "where category = 'Science'", without creating an index on category?
Yes: just iterate through the entire object store and manually look for objects with that category. Even if you write some convenience function to do this easily, it'll still be pretty slow compared to using an index. So in practice, you would want to use an index there.
Is it possible to get all the entries "where category = 'Science' AND author = 'Albert Einstein'"?
Yes, you can use a "compound index" as is described in this question here. However, the big caveat is that it's not supported in IE10. If that is a problem for your application, then you can only index on individual fields. For any other constraints besides one indexed field, you'll have to iterate through all the results and manually compare. There are various libraries built on top of IndexedDB that aim to make this easier, but I haven't used any of them so I can't help you there. Either way, it's going to be pretty slow if you can't use compound indexes.
Is it possible to get all the entries that contain the word "Lorem" is the content field?
You might be noticing a pattern here... Yes: just iterate through the entire object store and manually look for objects with "Lorem" in the content field. There is no special support built in for full text searching. Theoretically one could imagine a full text search API built on top of IndexedDB (just keep track of all the words), but I'm not sure if anyone has built that yet (and if they have built it, I'm not sure how performance would be).
Or should I use a different database for these kinds of queries
If you need to do a lot of queries like this (or, God forbid, even more complex queries) and you have the option of using a different database, then use a different database. But you might not have that option, depending on what you're trying to accomplish.

Best way to store user-submitted item names (and their synonyms)

Consider an e-commerce application with multiple stores. Each store owner can edit the item catalog of his store.
My current database schema is as follows:
item_names: id | name | description | picture | common(BOOL)
items: id | item_name_id | picture | price | description | picture
item_synonyms: id | item_name_id | name | error(BOOL)
Notes: error indicates a wrong spelling (eg. "Ericson"). description and picture of the item_names table are "globals" that can optionally be overridden by "local" description and picture fields of the items table (in case the store owner wants to supply a different picture for an item). common helps separate unique item names ("Jimmy Joe's Cheese Pizza" from "Cheese Pizza")
I think the bright side of this schema is:
Optimized searching & Handling Synonyms: I can query the item_names & item_synonyms tables using name LIKE %QUERY% and obtain the list of item_name_ids that need to be joined with the items table. (Examples of synonyms: "Sony Ericsson", "Sony Ericson", "X10", "X 10")
Autocompletion: Again, a simple query to the item_names table. I can avoid the usage of DISTINCT and it minimizes number of variations ("Sony Ericsson Xperia™ X10", "Sony Ericsson - Xperia X10", "Xperia X10, Sony Ericsson")
The down side would be:
Overhead: When inserting an item, I query item_names to see if this name already exists. If not, I create a new entry. When deleting an item, I count the number of entries with the same name. If this is the only item with that name, I delete the entry from the item_names table (just to keep things clean; accounts for possible erroneous submissions). And updating is the combination of both.
Weird Item Names: Store owners sometimes use sentences like "Harry Potter 1, 2 Books + CDs + Magic Hat". There's something off about having so much overhead to accommodate cases like this. This would perhaps be the prime reason I'm tempted to go for a schema like this:
items: id | name | picture | price | description | picture
(... with item_names and item_synonyms as utility tables that I could query)
Is there a better schema you would suggested?
Should item names be normalized for autocomplete? Is this probably what Facebook does for "School", "City" entries?
Is the first schema or the second better/optimal for search?
Thanks in advance!
References: (1) Is normalizing a person's name going too far?, (2) Avoiding DISTINCT
EDIT: In the event of 2 items being entered with similar names, an Admin who sees this simply clicks "Make Synonym" which will convert one of the names into the synonym of the other. I don't require a way to automatically detect if an entered name is the synonym of the other. I'm hoping the autocomplete will take care of 95% of such cases. As the table set increases in size, the need to "Make Synonym" will decrease. Hope that clears the confusion.
UPDATE: To those who would like to know what I went ahead with... I've gone with the second schema but removed the item_names and item_synonyms tables in hopes that Solr will provide me with the ability to perform all the remaining tasks I need:
items: id | name | picture | price | description | picture
Thanks everyone for the help!
The requirements you state in your comment ("Optimized searching", "Handling Synonyms" and "Autocomplete") are not things that are generally associated with an RDBMS. It sounds like what you're trying to solve is a searching problem, not a data storage and normalization problem. You might want to start looking at some search architectures like Solr
Excerpted from the solr feature list:
Faceted Searching based on unique field values, explicit queries, or date ranges
Spelling suggestions for user queries
More Like This suggestions for given document
Auto-suggest functionality
Performance Optimizations
If there were more attributes exposed for mapping, I would suggest using a fast search index system. No need to set aliases up as the records are added, the attributes simply get indexed and each search issued returns matches with a relevance score. Take the top X% as valid matches and display those.
Creating and storing aliases seems like a brute-force, labor intensive approach that probably won't be able to adjust to the needs of your users.
Just an idea.
One thing that comes to my mind is sorting the characters in the name and synonym throwing away all white space. This is similar to the solution of finding all anagrams for a word. The end result is ability to quickly find similar entries. As you pointed out, all synonyms should converge into one single term, or name. The search is performed against synonyms using again sorted input string.

Database of common name aliases / nicknames of people

I'm involved with a SQL / .NET project that will be searching through a list of names. I'm looking for a way to return some results on similar first names of people. If searching for "Tom" the results would include Thom, Thomas, etc. It is not important whether this be a file or a web service. Example Design:
Table "Names" has Name and NameID
Table "Nicknames" has Nickname, NicknameID and NameID
Example output:
You searched for "John Smith"
You show results Jon Smith, Jonathan Smith, Johnny Smith, ...
Are there any databases out there (public or paid) suited to this type of task to populate a relationship between nicknames and names?
I'm adding another source for anyone who comes across this question via Google. This project provides a very good lookup for this purpose.
https://github.com/carltonnorthern/nickname-and-diminutive-names-lookup
It's somewhat simpler and less complete than pdNickName but on the other hand it's free and easy to use.
A google search on "Database of Nicknames" turned up pdNickName (for pay).
In addition, I think you only need a single table for this job, not two, with NameID, Name, and MasterNameID. All the nicknames go into the Name column. One name is considered the "canonical" one. All the nickname records use the MasterNameID column to point back to that record, with the canonical name pointing to itself.
Your two table schema contains no additional information and, depending on how you fill in the nickname table, you might need extra code to handle the canonical cases.
I just found this site.
It looks like you could script it pretty easily.
http://www.behindthename.com/php/extra.php?terms=steve&extra=r&gender=m
I just wish I could auto narrow this to english..
Another commercial name matching database is: http://www.basistech.com/name-indexer/
It looks quite professional (though potentially expensive).
They claim to support the following languages:
Arabic, Chinese (Simplified), Chinese (Traditional), Persian (Farsi / Dari), English, Japanese, Korean, Pashto, Russian, Urdu
Here is a github repo with csv of related names, and you can contribute back:
The first few lines show the format:
aaron,ron
abel,abe
abednego,bedney
abijah,ab,bige
abigail,ab,abbie,abby,gail
abner,ab,abbie,abby
abraham,abe,abram,bram
absalom,ab,abbie,app
There is a database out there called pdNicknames (found at http://www.peacockdata2.com/products/pdnickname/). It contains everything you need, at a cost of $500.
Similar format as Stan James's csv, but folded two ways for lookups:
Name to nickname: https://github.com/MrCsabaToth/SOEMPI/blob/master/openempi/conf/name_to_nick.csv
Nickname to name: https://github.com/MrCsabaToth/SOEMPI/blob/master/openempi/conf/nick_to_name.csv
To select similar sounding name use: (see MSDN)
SELECT SOUNDEX ('Tom')
This is a good choice: https://github.com/onyxrev/common_nickname_csv
id, name, nickname
1, Aaron, Erin
2, Aaron, Ron
3, Aaron, Ronnie
4, Abel, Ab
5, Abel, Abe
6, Abel, Eb
7, Abel, Ebbie
8, Abiel, Ab
9, Abigail, Abby
10, Abigail, Gail

Resources