Database design: Splitting a blog entry into multiple pages - database

What is the best database strategy for paginating a blog entry or page content where some entries may be a single page and some may span multiple pages? Note: The content would be article-like rather than a list of items.
The method that I'm currently considering is storing all of the content in a single text field and using a page separator like {pagebreak}. Upon retrieval, the content would be split into an array by the page separator and then the page would display the appropriate index. Is this the best way to go about it, or is there a better approach?

I think your current idea would be the best option. Makes it a lot easier to move the page breaks if you ever want to, or to put them in when you originally compose the article. Also allows you to have a print page option, where the entire article is in 1 field.

the easy way (now, but you'll pay later )is to store the entire article within one text field, but you give up some display control because you'll might need to put some html in that text. If you put html in the text, you'll have a lot of data to fix if you ever change your web page's look/feel. This may not be an issue
As a rule I try not to ever put html into the database. You might be better off using XML to define your article, and store that in one text field, so your application can properly render the contents in a dynamic way. You could store page breaks in the XML, or let the app read in the entire article and split it up dynamically based on your current look/feel.
You can use my "poor man's CMS" schema (below) if you don't want to use XML. It will give you more control over the formatting than the "all text in one field" method.
these are just a wild guess based on your question
tables:
Articles
--------
ArticleID int --primary key
ArticleStatus char(1) --"A"ctive, "P"ending review, "D"eleted, etc..
ArticleAuthor varchar(100) --or int FK to a "people" table
DateWritten datetime
DateToDisplay datetime
etc...
ArticleContent
--------------
ArticleID int --primary key
Location int --primary key, will be the order to display the article content, 1,2,3,4
ContentType char(1) --"T"ext, "I"mage, "L"ink, "P"age break
ArticleContentText
------------------
ArticleID int --primary key
Location int --primary key
FormatStyle char(1) --"X"extra large, "N"ormal, "C"ode fragment
ArticleText text
ArticleContentImage
-------------------
ArticleID int --primary key
Location int --primary key
AtricleImagePath varchar(200)
AtricleImageName varchar(200)
You can still put the entire article in one field, but you can split it up if it contains different types of "things".
If you have an article about PHP code with examples, the "one field method" would force you to put html in the text to format the code examples. with this model, you store what is what, and let the application display it properly. You can add and expand different types, put page breaks in, remove them. You can store your content in multiple "ArticleContentText" rows each representing a page break, or include "ArticleContent" rows that specify page breakes. You could let the app read the entire article and then only display what it wants.

I think the correct approach is what you've mentioned: the entry should be stored in the database as a single entry, and then you can use markup / the UI layer to determine where pagebreaks or other formatting should occur.
The database design shouldn't be influenced by the UI concepts - because you might decide to change how they are displayed down the road, and you need your database to be consistent.

You're much better off leaving formatting like this on the client side. Let the database hold your data and your application present it to the user in the correct format.

It seems to me like a good solution. This way you will have your article as one piece and have the possibility to paginate it when necesary.

Related

Best way to store comments with mentions (#FirstName) in database

Was wondering what is the best way to store comments in a database (sql) that allows mentioning of other users by a non-unique natural name?
E.g. This is a comment for #John.
The client application would also need to detect and link to corresponding user profile if his/her name was clicked.
My initial thought was to replace the user's first name with the id and some metadata and store that in the DB: This is a comment for <John_51/> where 51 is the id of that user. Clients can then parse that and display the appropriate user name and profile links.
Is this a good approach?
Some background:
What I would like to achieve is similar to facebook posts where it allows you to 'tag' a user by just mentioning their name (not the unique username) in a post. It doesn't have to be as complex as facebook as what I need it for isn't for a post, but just comments (which can only be text, as opposed to posts which could be text mixed with videos/images/etc).
The solution would affect the database side (how the comments are stored) and also the client side (how the comments are parsed and displayed to the user). The clients are mobile apps for iOS and Android but also looking to expand to a web application as well.
I don't think the language matters as much but for completeness sake, I'm using Python's Flask with SQLAlchemy frameworks on the backend.
Current DB schema for comments
COMMENT TABLE:
id (<PK>)
post_id (id of the post that the comment is for: <FK on a post object>)
author_id (id of the creator of the comment: <FK on a user object>)
text (comment text: <String>)
timestamp (comment date: <Date>)
Edit:
I ended up going with metadata in the comment. E.g.
Hey <mention userid="785" tagname="JohnnyBravo"/>!
I included the user's name (tagname) as well so that client application can extract the name directly from the comment text instead of adding another step to look up who user 785 is.
The big problem here is if the username is not a stable reference, you need to abstract it to an id reference, while still keeping the the text reconstructable, but the references queryable.
Embedded collections and dynamic typing are a great option if you're using a NoSQL database. It would be fairly straightforward.
{
_id: ...,
text: [
"Wow ",
51,
", your selfie looks really great, even better than ",
72,
"'s does."
],
...
}
That way you could query references, while still easily reconstructing the content. BUT since you're using SQLAlchemy, that's a no go. Your methodology seems fine, but because your doing magic in the string you'll need to escape your delimiters, (as well as escape the escape character) if they exist in the text. Personally, I would use # as the delimiter since it's already a special character. You'd also need to identify the end of the id, in case the user sticks a bunch of numbers after the #mention, so
Wow #51#, your selfie looks really great, even better than #72#'s does. email me! john\#foo.com. Division time!!! with backslashes! 12\\4 = 3
IF querying posts for references is also important to you. You'll also need to maintain a separate POST__USER junction table that stores a row for the post and for each user id, so that when you load an object into memory, you can construct a collection. You could decide to add the junction table later, but it would be a fairly expensive migration.
If #name is not unique,you have to somehow associate the non-unique name, via the session, with the unique owner of the natural name, and do this ideally before storing it in the database. Storing a non-unique name in the database, if it cannot be resolved to its unique owner, is not of much value.
Since you mention "sql" I assume you're using a relational database. If that is the case, once you have resolved #name to its unique owner, I would create a one-to-many relationship between posting or comment and userids; that would allow a comment or post to reference more than one user.
TABLE: COMMENT_MENTIONEDUSERS
commentid
userid
I would recommend storing the comment as markdown since it's now quite widespread. In your case, "This is a comment for [#John](/user/johnID)".
Markdown is pretty standard and you shouldn't have an issue finding a package for editing / viewing.

CakePHP need to know model ID before saving

My situation:
I have a form based on several models linked together.
In background I have a routine for saving images , based on ajax call to a particular controller which uploads the image and displays it on a textarea in realtime.
My problem applies when I have to save the record first time, because the ajax routine needs to know the ID of the record to associate the image , but this will be created only after save() will be called.
Is there a way to obtain a ID before the model saves? or some other workaround
Thanks in advance to everyone
Rudy
I think #RichardAtHome provides a good input on this. I'll just add a third solution you may want to consider:
Use universally unique identifiers (UUID) for that model's primary key. CakePHP offers a way to generate them via the String utility. So what you would do is generate the id when the page loads with String::uuid() and then send it as part of the ajax requests.
A couple of solutions:
Change your workflow, so that the images can only be linked to the main model once the main model has been saved (and has an id).
Don't save your images to the database until the main model has been created. You could store them in a session, or just keep appending fields to the end of the form as the user adds more fields.
Option1:
You could try using UUIDs instead of INTs. That way, you can just use CakePHP to generate an ID: String::uuid();.
After you create it, you can populate an id field that would then make the item save with that id.
I'm not sure I'd rely on it's uniqueness if I were doing banking software or something critical, but - it's something like 1 in a billion chance to be duplicate, and for normal websites/software, it seems like that would suffice.
Option2:
In a lot of our projects, we've found it helpful to have a very simple "create a Thing" page with just a title field, then, once saved, it takes them immediately to the more in-depth "edit" page where they can upload files, save extra data...etc etc. Almost like a pg1, pg2. That way, files, as well as any other dynamic ajax-driven data will have an ID to work with.
Probably not, because the ID will be set by the Database, and not CakePhp, and the id can only be known after the database has auto-incremented the ID field...
Not sure how your flow is, but I guess you can save a dummy/empty row to the database without all the associations when the controller gets the page (when $this->data is empty). You know have an ID to an empty row in the database that is like "12, NULL, NULL, NULL, NULL, NULL, ...". You can then store the ID of that dummy record (or do $this->set("id", $id) to pass it to the view to use in your ajax-calls).
After the POST of your page you can then use the ID to store all other information in dummy row, or delete the dummy row from the database when something fails/user navigates away...

Suggestions for implementing a document management system in which some documents have runtime replaced fields

One of my apps is a Document Management System in which the documents are stored as blob fields into a db. This is not a language specific question, anyway I put Delphi in tags since this is the community to which i tipically ask questions (and many people that uses Delphi faces these problems).
One feature I need to add is to programmatically add some data to the document. I make a simple example just to get the idea. One field is the date in which the document has been created. For this the user will type "a tag" for example <DOCUMENT_DATE> and the date will be automatically substituted when the docuemnt is extracted from db.
So I have 2 main concerns. ONe is what to use as "tag". The simplest thing is to use a text tag, so simply typing into the docuemnt and then do Search & Replace text (using for example MS Word ActiveX). I already do this for other purposes. AN alternative could be using bookmarks or another technique.
The other question is strictly related with the previous.
How do I store it? My first idea is to store the document in DB with the "tags", so when it is "checked out" the user sees the tags, while when the user opens it (in readonly mode) he sees the subsituted text. (so in first case he sees and in the second "12 october 2011").
In this way I store the file once, but every time it is opened there is an overhead in processing it and doing the Search Replace thing, that can be also relatively slow. So this is why I asked for other techniques. Like serach replace for bookmark. The fastest the better.
The alternative is to store the document twice: once with the "tags" and the other with the "substituted veresion". This will be good for performance: no Searh & Replace but simply when the document is openeed in "checkout" mode I will open the one with the tags, while whne I open it in readonly mode I will open the subsituted one.
This of course takes more storage, for every document version (revision1, revision2, ...) I need to store 2 files.
I feel double storage is the best, because it won't affect perfomance at all, I mean it will be as fast as now, just the checkin process will be slower since I need to save 2 files and not one. Moreover by not enabling this auto substitution feature on all documents by default I won't have double db size.
But anyway I would like to hear some comments, since it is a quite crucial decision.
It really does not make sense to store identical data twice.
in fact it is a really bad idea, mainly from a consistancy point of view.
The way you do this is to store stuff in different tables and create links between the tables.
This is a process called normalization.
Here's an example loosely inspired by your post using MySQL:
TABLE document
--------------
id UNSIGNED INTEGER AUTO_INCREMENT PRIMARY KEY,
data BLOB
TABLE tag
------------
id UNSIGNED INTEGER AUTO_INCREMENT PRIMARY KEY,
tag VARCHAR(20)
TABLE tag_link
-------------------
tag_id UNSIGNED INTEGER,
reference_nr UNSIGNED INTEGER,
PRIMARY KEY (tag_id, reference_nr)
FOREIGN KEY (tag_id) REFERENCES tag(id) ON DELETE CASCADE ON UPDATE CASCADE,
FOREIGN KEY (reference_nr) REFERENCES post(reference_nr) ON DELETE CASCADE ON UPDATE CASCADE,
TABLE post
----------------
reference_nr UNSIGNED INTEGER NOT NULL,
revision UNSIGNED INTEGER NOT NULL DEFAULT 1,
document_id UNSIGNED INTEGER,
title VARCHAR(255),
creation_date TIMESTAMP,
other_fields .....
PRIMARY KEY (reference_nr, revision),
FOREIGN KEY (document_id) REFERENCES document(id) ON DELETE SET NULL ON CASCADE UPDATE
Now you can add tags to a post, all revisions of a post share the same tags.
Revisions of a post can link to the same document, or to different documents no need to duplicate data.
If you want to get all the lastest revisions of documents with certain tags, you use the following query:
SELECT p.title, d.data, GROUP_CONCAT(t.tag) AS tags
FROM post p
LEFT JOIN d.data ON (p.document_id = d.id)
INNER JOIN taglink tl ON (tl.reference_nr = p.reference_nr)
INNER JOIN tags t ON (tl.tag_id = t.id)
WHERE t.tag IN ('test','test2')
GROUP BY p.reference_nr /*only works in MySQL because other db's do not support ANSI SQL 2003*/
HAVING p.revision = MAX(p.revision)
ORDER BY p.creation_date DESC
I see two other possibilities worth considering.
1. Use RTF
If your document templates are Word documents, I'd rather store them as RTF.
RTF is just plain ASCII, and even if it is a proprietary format, it is well documented, and can be easily parsed. Word is able to save its content and read it as RTF. If you have pictures within, it can grow, but you can zip it before storing as BLOB in your database (and you may embed EMF pictures).
Then you can process those RTF content very fast in your code, changing all <DOCUMENT_DATE> using the latest version of the date field value.
I use this technique in several applications, and it gives very good results. See for instance how our SynProject tool generates Word documents from plain text, replacing tags, setting bookmarks or indexes on the fly. With RTF, you can do much more than just replacing a tag, but create a whole document easily.
For end-user input, you can use a basic TRichEdit or a more advanced (but not free) TRichView instead of Word.
You may consider using HTML instead of RTF, but it is much less printing-friendly.
2. Use a report engine
Another possibility could be to use a code-based report engine, then create PDF files.
Our Open Source units can be used from a simple reporting class to create easily the file content, preview it on screen and/or print/export as PDF. It is much easier than RTF to work with, but the layout has to be set in your code, or with text-based / wiki-like templates to be stored in your DB.

Custom Fields for a Form representing an object

I have an architectural question concerning custom fields in a view for an object. Let's say you have a User Object with some basic information like firstname, lastname, ... that can be used by all customers.
Now, often we get a question from a customer to add couple of custom fields typical for their domain. Our solution now is an xml data column where key value pairs are stored. This has been ok so far, but now we'll have to find a more architectural solution.
For instance, now, a customer wants a dropdown where it can select the value for its custom field. We could still store the selected value in the xml data column, but where do we store all those dropdown values...
I know that in sharepoint you can also add custom fields like dropdowns and I was wondering how to deal with this best. I want to avoid creating custom tables for customers, or having a table with 90 columns (10 basic and then 10 for each customer), ...
You get the idea, it should be generic and be able to deal with all sorts of problems in the future.
What I was thinking about is a Table UserConfiguration where each record has a Foreign Key to the Customer (Channel in our database), then a column FieldName, a column FieldType and a column Values. The column values should be an xml type column, because for a dropdown, we'll need to add multiple values. Also, each value can have extra data attached to it (not just a name). The other problem then is how to store the selected value. I don't like the idea of having foreign keys to xml in my database (read somewhere that Azure can't handle this all to well). Do you just store the name of the value (what if the value were to disappear out of the xml?)?
Any documentation, links on this kind of problems would also be great. I'm trying to find a design pattern that deals with this kind of problem in the database.
I want to answer your question in two parts:
1) Implementing custom fields in a database server
2) Restricting custom fields to an enumeration of values
Although common solutions to 1) are discussed in the question referenced by #Simon, maybe you are looking for a bit of discussion on what the problem is and why it hasn't been solved for us already.
databases are great for structured, typed data
custom fields are inherently less structured
therefore, custom fields are more difficult to work with in a database
some or many of the advantages of using a database are lost
some queries may be more difficult or impossible
type safety may be lost (in the database)
data integrity may no longer be enforced (by the database)
it's a lot more work for the implementers and maintainers
As discussed in the other question, there's no perfect solution.
But these benefits/features still need to be implemented somewhere, and so often the application becomes responsible for data integrity and type safety.
For situations like these, people have created Object-Relation Mapping tools, although, as Jeff Atwood says, even using an ORM could create more problems than it solved. However, you mentioned that it 'should be generic and be able to deal with all sorts of problems in the future' -- this makes me think an ORM might be your best bet.
So, to sum up my answer, this is a known problem with known solutions, none of which are completely satisfactory (because it's so hard). Pick your poison.
To answer the second part of (what I think is) your question:
As mentioned in the linked question, you could implement Entity-Attribute-Value in your database for custom fields, and then add an extra table to hold the legal values for each entity. Then, the attribute/value of the EAV table is a foreign key into the attribute-value table.
For example,
CREATE TABLE `attribute_value` ( -- enumerations go in this table
`attribute` varchar(30),
`value` varchar(30),
PRIMARY KEY (`attribute`, `value`)
);
CREATE TABLE `eav` ( -- now the values of attributes are restricted
`entityid` int,
`attribute` varchar(30),
`value` varchar(30),
PRIMARY KEY (`entityid`, `attribute`),
FOREIGN KEY (`attribute`, `value`) REFERENCES `attribute_value`(`attribute`, `value`)
);
Of course, this solution isn't perfect or complete -- it's only supposed to illustrate the idea. For instance, it uses varchars, and lacks a type column. Also, who gets to decide what the possible values for each attribute are? Can these be changed at any time by the user?
I'm doing something similar for a customer. I've create a JSON FieldType which holds the entire JSON stream of a complex object and a String containing the FQTN (FullQualifiedTypeName) of my C# model class.
By using custom New-, Edit- and Display-Forms we'd ensured that our custom objects are rendered the correct way for best user experience.
To promote fields from the complex C# model to the SharePoint list, we've build something like Microsoft did in InfoPath. Users are able to select Properties or MetaData from the Complex C# type, which will be automatically promoted to the hosting SharePoint list.
The big advantage of JSON is, that its smaller than XML and easier to work with in the web world. (JavaScript...)
When you let the users create the data models, I would recommend looking at an document database or 'NoSQL' since you want exactly that, to store schemaless data structures.
Also, sharePoint stores metadata the way you mentioned (10 columns for text, 5 for dates etc)
That said, in my current project (locked in SharePoint, so Framework 3.5 + SQL Server and all the constraints that follow) we use a somewhat similar structure as below:
Form
Id
Attribute (or Field)
Name
Type (enum) Text, List, Dates, Formulas etc
Hidden (bool)
Mandatory
DefaultValue
Options (for lists)
Readonly
Mask (for SSN etc)
Length (for text fields)
Order
Metadata
FormId
AttributeId
Text (the value for everything but dates)
Date (the value for dates)
Our formulas employ functions such as Increment: INC([attribute1][attribute2], 6) and this would produce something like 000999 for the 999th instance of the combined values for attribute 1 and attribute 2 for a form, this is stored as:
AttributeIncrementFormula
AtributeId
Counter
Token
Other 'formulas' (aka anything non-trivial) such as barcodes are stored as single metadata values. In the actual implementation, we would have something like this:
var form = formRepository.GetById(1);
form.Metadata["firstname"].Value
Value above is a readonly property that decides whether we should get the value from Text or Date and if some additional transform is required. Note that the database here is merely a storage, we hold all the domain complexity in the application.
We also let our customer decide which attribute is the form title for example, so if firstname is the form title, they'll set an in-memory param that spans the entire application to be something like Params.InMemory.TitleAttributeId = <user-defined-id>.
I hope this gives you some insight on a production impl of a similar scenario.
This is really more of a comment than an answer, but I need more space than SO will allow for comments, so here 'tis:
I think your UserConfiguration table approach is good, and would suggest only abstracting the "type" and "value" pieces of your design a bit more:
Since your application will need to validate user input, each notion of "type" will have an associated piece of evaluation logic. Obviously the more of this you can abstract into data the easier it will be to keep your code small. Enumerated lists are a good start, but if your "validator" logic can be extended to handle pattern matching for text strings and Boolean logical expressions (e.g. to describe/enforce constraints on input values), then you can express pretty much any "type" of input that your application may need to handle in terms of (relatively) simple "atoms" that you can map naturally to DB tables.
When storing a user-specified value, you can either store the "raw" data (e.g. in JSON) and a foreign key to the associated "type", or you can add an lookup/cache system that assigns an integer to each new value that is encountered by the system ("novelty" can be checked by checking a hash of the "raw" data, for example). The latter approach obviously scales better if you're expecting lots of data duplication (which of course you would in the case of a multiple-choice menu).

Indexing URL's in SQL Server 2005

What is the best way to deal with storing and indexing URL's in SQL Server 2005?
I have a WebPage table that stores metadata and content about Web Pages. I also have many other tables related to the WebPage table. They all use URL as a key.
The problem is URL's can be very large, and using them as a key makes the indexes larger and slower. How much I don't know, but I have read many times using large fields for indexing is to be avoided. Assuming a URL is nvarchar(400), they are enormous fields to use as a primary key.
What are the alternatives?
How much pain would there likely to be with using URL as a key instead of a smaller field.
I have looked into the WebPage table having a identity column, and then using this as the primary key for a WebPage. This keeps all the associated indexes smaller and more efficient but it makes importing data a bit of a pain. Each import for the associated tables has to first lookup what the id of a url is before inserting data in the tables.
I have also played around with using a hash on the URL, to create a smaller index, but am still not sure if it is the best way of doing things. It wouldn't be a unique index, and would be subject to a small number of collisions. So I am unsure what foreign key would be used in this case...
There will be millions of records about webpages stored in the database, and there will be a lot of batch updating. Also there will be a quite a lot of activity reading and aggregating the data.
Any thoughts?
I'd use a normal identity column as the primary key. You say:
This keeps all the associated indexes smaller and more efficient
but it makes importing data a bit of a pain. Each import for the
associated tables has to first lookup what the id of a url is
before inserting data in the tables.
Yes, but the pain is probably worth it, and the techniques you learn in the process will be invaluable on future projects.
On SQL Server 2005, you can create a user-defined function GetUrlId that looks something like
CREATE FUNCTION GetUrlId (#Url nvarchar(400))
RETURNS int
AS BEGIN
DECLARE #UrlId int
SELECT #UrlId = Id FROM Url WHERE Url = #Url
RETURN #UrlId
END
This will return the ID for urls already in your URL table, and NULL for any URL not already recorded. You can then call this function inline your import statements - something like
INSERT INTO
UrlHistory(UrlId, Visited, RemoteIp)
VALUES
(dbo.GetUrlId('http://www.stackoverflow.com/'), #Visited, #RemoteIp)
This is probably slower than a proper join statement, but for one-time or occasional import routines it might make things easier.
Break up the URL into columns based on the bits your concerned with and use the RFC as a guide. Reverse the host and domain info so an index can group like domains (Google does this).
stackoverflow.com -> com.stackoverflow
blog.stackoverflow.com -> com.stackoverflow.blog
Google has a paper that outlines what they do but I can't find right now.
http://en.wikipedia.org/wiki/Uniform_Resource_Locator
I would stick with the hash solution. This generates a unique key with a fairly low chance of collision.
An alternative would be to create GUID and use that as the key.
I totally agree with Dylan. Use an IDENTITY column or a GUID column as surrogate key in your WebPage table. Thats a clean solution. The lookup of the id while importing isn't that painful i think.
Using a big varchar column as key column is wasting much space and affects insert and query performance.
Not so much a solution. More another perspective.
Storing the total unique URI of a page perhaps defeats part of the point of URI construction. Each forward slash is supposed to refer to a unique semantic space within the domain (whether that space is actual or logical). Unless the URIs you intend to store are something along the line of www.somedomain.com/p.aspx?id=123456789 then really it might be better to break a single URI metatable into a table representing the subdomains you have represented in your site.
For example if you're going to hold a number of "News" section URIs in the same table as the "Reviews" URIs then you're missing a trick to have a "Sections" table whose content contains meta information about the section and whose own ID acts as a parent to all those URIs within it.

Resources