How to design a database schema for a search engine? - database

I'm writing a small search engine in C with curl, libxml2, and mysql. The basic plan is to grab pages with curl, parse them with libxml2, then iterate over the DOM and find all the links. Then traverse each of those, and repeat, all while updating a SQL database that maintains the relationship between URLs.
My question is: how can I best represent the relationship between URLs?.

Why not have a table of base urls (ie www.google.com/) and a table of connections, with these example columns:
starting page id (from url table)
ending page id (from url table)
the trailing directory of the urls as strings in two more columns
This will allow you to join on certain urls and pick out information you want.
Your solution seems like it would be better suited to a non relational datastore, such as a column store.
Most search engine indices aren't stored in relational databases, but stored in memory as to minimize retrieval time.

Add two fields to table - 'id' and 'parent_id'.
id - unique identifier for URL
parent_id - link between URL's

If you want to have a single entry for each URL then you should create another table that maps the relationships.
You then lookup the URL table to see if it exists. If not create it.
The relationship table would have
SourceUrlId,
UrlId
Where the SourceUrlId is the page and the UrlId is the url it points to. That way you can have multiple relationships for the same URL and you won't need to have a new entry in the Url table for every link to that url. Will also mean only 1 copy of any other info you are storing.

Why are you interested in representing pages graph?
If you want to compute the ranking, then it's better to have a more succinct and efficient representation (e.g., matricial form if you want to compute something similar to PageRank).

Related

How can I design a table that stores subdomain metadata to avoid large partitions?

i try to design a table in cassandra, but im getting a lot of large partition messages.
Any ideas how i could improve this "design" to prevent overloading and still can use a query like this:
select * from analytics where domain='test' and tld='com'
CREATE TABLE analytics (
domain text,
tld text,
subdomain text,
a text,
PRIMARY KEY ((domain, tld), subdomain)
)
Also im loading this table with
update analytics set a='a' where domain='test' and tld='com' and subdomain='b';
Some partitions are over 1million rows
I must be naïve but I'm very surprised to hear that some domains can have a million subdomains. In any case, I suspect that a significant majority of domains would have less than 100 subdomains so for the most part, your current table schema is going to be fine and you just need to deal with the really "large" domains.
This is a common problem for social apps and in Graph Theory it is known as the supernode problem -- a vertex with an incredibly high number of edges. In simpler terms, it's Barack Obama (the vertex or node) with over 133M followers (edges) on Twitter, or Cristiano Ronaldo with over 506M followers on Instagram.
For apps that run into the supernode problem, they typically work around it by handling the supernodes separately from the rest. In your case, you need to implement some logic in your app to detect the "super domains" and store them in a separate table.
A possible table design uses the first 2 characters of the subdomain as a bucket. For example with domain sub.domainsr.us, we use the prefix su for bucketing to make the partitions smaller:
CREATE TABLE subdomains_by_domain_tld_prefix (
domain text,
tld text,
prefix text,
subdomain text,
a text,
PRIMARY KEY ((domain, tld, prefix), subdomain)
)
This is just an example so the prefix doesn't have to be limited to just the first 2 characters. You can adjust it depending on the dataset.
Also if it makes it simpler for your app, you can choose to use this table for all domains. Cheers!
👉 Please support the Apache Cassandra community by hovering over the cassandra tag above and click on Watch tag. 🙏 Thanks!
How unique is a? You can include whatever makes the most sense and would give you smaller partitions, then you could create a secondary index on whichever column you leave out of the original PK and need to query. Remember that whatever you include in the PK, you'll need to use when you query records, so only include or add a column that would make sense to include in queries and would give you smaller partitions.

Amazon DynamoDB Single Table Design For Blog Application

New to this community. I need some help in designing the Amazon Dynamo DB table for my personal projects.
Overview, this is a simple photo gallery application with following attributes.
UserID
PostID
List item
S3URL
Caption
Likes
Reports
UploadTime
I wish to perform the following queries:
For a given user, fetch 'N' most recent posts
For a given user, fetch 'N' most liked posts
Give 'N' most recent posts (Newsfeed)
Give 'N' most liked posts (Newsfeed)
My solution:
Keeping UserID as the partition key, PostID as the sort key, likes and UploadTime as the local secondary index, I can solve the first two query.
I'm confused on how to perform query operation for 3 and 4 (Newsfeed). I know without partition ket I cannot query and scan is not an effective solution. Any workaround for operatoin 3 and 4 ?
Any idea on how should I design my DB ?
It looks like you're off to a great start with your current design, well done!
For access pattern #3, you want to fetch the most recent posts. One way to approach this is to create a global secondary index (GSI) to aggregate posts by their creation time. For example, you could create a variable named GSI1PK on your main table and assign it a value of POSTS and use the upload_time field as the sort key. That would look something like this:
Viewing the secondary index (I've named it GSI1), your data would look like this:
This would allow you to query for Posts and sort by upload_time. This is a great start. However, your POSTS partition will grow quite large over time. Instead of choosing POSTS as the partition key for your secondary index, consider using a truncated timestamp to group posts by date. For example, here's how you could store posts by the month they were created:
Storing posts using a truncated timestamp will help you distribute your data across partitions, which will help your DB scale. If a month is too long, you could use truncated timestamps for a week/day/hour/etc. Whatever makes sense.
To fetch the N most recent posts, you'd simply query your secondary index for POSTS in the current month (e.g. POSTS#2021-01-00). If you don't get enough results, run the same query against the prior month (e.g. POSTS#2020-12-00). Keep doing this until your application has enough posts to show the client.
For the fourth access pattern, you'd like to fetch the most liked posts. One way to implement this access pattern is to define another GSI with "LIKES" as the partition key and the number of likes as the sort key.
If you intend on introducing a data range on the number of likes (e.g. most popular posts this week/month/year/etc) you could utilize the truncated timestamp approach I outlined for the previous access pattern.
When you find yourself "fetch most recent" access patterns, you may want to check out KSUIDs. KSUIDs, or K-sortable Universal Identifier, are unique identifiers that are sortable by their creation date/time/. Think of them as UUID's and timestamps combined into one attribute. This could be useful in supporting your first access pattern where you are fetching most recent posts for a user. If you were to use a KSUID for the Post ID, your table would look like this:
I've replaced the POST ID's in this example with KSUIDs. Because the KSUIDs are unique and sortable by the time they were created, you are able to support your first access pattern without any additional indexing.
There are KSUID libraries for most popular programming languages, so implementing this feature is pretty simple.
You could add two Global Secondary Indexes.
For 3):
Create a static attribute type with the value post, which serves as the Partition Key for the GSI and use the attribute UploadTime as the Sort Key. You can then query for type="post" and get the most recent items based on the sort key.
The solution for 4) is very similar:
Create another Global secondary index with the aforementioned item type as the partition key and Likes as the sort key. You can then query in a similar way as above. Note, that GSIs are eventually consistent, so it may take time until your like counters are updated.
Explanation and additional infos
Using this approach you group all posts in a single item collection, which allows for efficient queries. To save on storage space and RCUs, you can also choose to only project a subset of attributes into the index.
If you have more than 10GB of post-data, this design isn't ideal, but for a smaller application it will work fine.
If you're going for a Single Table Design, I'd recommend to use generic names for the Index attributes: PK, SK, GSI1PK, GSI1SK, GSI2PK, GSI2SK. You can then duplicate the attribute values into these items. This will make it less confusing if you store different entities in the table. Adding a type column that holds the entity type is also common.

Fetching by key vs fetching by filter in Google App Engine

I want to be as efficient as possible and plan properly. Since read and write costs are important when using Google App Engine, I want to be sure to minimize those. I'm not understanding the "key" concept in the datastore. What I want to know is would it be more efficient to fetch an entity by its key, considering I know what it is, than by fetching by some kind of filter?
Say I have a model called User and a user has an array(list) of commentIds. Now I want to get all this user's comments. I have two options:
The user's array of commentId's is an array of keys, where each key is a key to a Comment entity. Since I have all the keys, I can just fetch all the comments by their keys.
The user's array of commentId's are custom made identifiers by me, in this case let's just say that they're auto-incrementing regular integers, and each comment in the datastore has a unique commentIntegerId. So now if I wanted to get all the comments, I'd do a filtered fetch based on all comments with ID that is in my array of ids.
Which implementation would be more efficient, and why?
Fetching by key is the fastest way to get an entity from the datastore since it the most direct operation and doesn't need to go thru index lookup.
Each time you create an entry (unless you specified key_name) the app engine will generate a unique (per parent entity) numeric id, you should use that as ids for your comments.
You should design a NoSql database (= GAE Datastore) based on usage patterns:
If you need to get all user's comments at once and never need to get one or some of them based on some criteria (e.g. query them), than the most efficient way, in terms of speed and cost would be to serialize all comments as a binary blob inside an entity (or save it to Blobstore).
But I guess this is not the case, as comments are usually tied to both users and to posts, right? In this case above advice would not be viable.
To answer you title question: get by key is always faster then query by a property, because query first goes through index to satisfy the property condition, where it gets the key, then it does the get with this key.

Database design for a product aggregator

I'm trying to design a database for a product aggregator. Each product has information about where it comes from, what it costs, what type of thing it is, price, color, etc. Users need to able to search and filter results based on any of those product categories. I also expect to have a large number of users. My initial thought was having one big table with every product in it with a column for each piece of information and an index on anything I need to be able to search by but I think this might be inefficient with a lot of users pounding on this one table. My other thought was to organize the database to promote a tree-like navigation of tables but because you can search by anything I'm not sure how I would organize the tables.
Any thoughts on some good practices?
One table of products - databases are designed to have lots of users pounding on tables.
(from the comments)
You need to model your data. This comes from looking at the all the data you have, determining what is related to what (a table is called a relation because all the attributes in a row are related to a candidate key). You haven't really given enough information about the scope of what data (unstructured?) you have on these products and how it varies. Are you going to have difficulties because Shoes have brand, model, size and color, but Desks only have brand, model and finish? All this is going to inform your data model. Typically you have one products table, and other things link to it.
Some of those attributes will be foreign keys to lookup tables, others (price) would be simple scalars. Appropriate indexing and you'll be fine. For advanced analytics, consider a dimensionally modeled star-schema, but perhaps not for your live transaction system - depends what your data flow/workflow/transactions are. Or consider some benefits of its principles in your transactional database. Ralph Kimball is source of good information on dimensional modeling.
I dont see any need for the tree structure here. You can do with single table.
if you insist on tree structure with hierarchy here is an example to get you started.
For text based search, and ease of startup & design, I strongly recommend Apache SOLR. The SOLR API is easy to use (especially JSON). Databases do text search poorly, and I would instead recommend that you just make sure that they respond to primary/unique key queries properly, and those are the fields you should index.
One table for the products, and another table for the product category hierarchy (you don't specifically say you have this but "tree-like navigation of tables" makes me think you might).
I can see you might be concerned about over-indexing causing problems if you plan to index almost every column. In that case, it might be best to index on the top 5 or 10 columns you think users are likely to search for, unless it's possible for a user to search on ANY column. In that case you might want to look at building a data warehouse. Maybe you'll want to look into data cubes to see if those will help...?
For hierarchical data, you need a PRODUCT_CATEGORY table looking something like this:
ID
PARENT_ID
NAME
Some sample data:
ID PARENT_ID NAME
1 ROOT
2 1 SOCKS
3 1 HELICOPTER PARTS
4 2 ARGYLE
Some SQL engines (such as Oracle) allow you to write recursive queries to traverse the hierarchy in a single query. In this example, the root of the tree has a PARENT_ID of NULL, but if you don't want this column to be nullable, I've also seen -1 used for the same purposes.

Indexing URL's in SQL Server 2005

What is the best way to deal with storing and indexing URL's in SQL Server 2005?
I have a WebPage table that stores metadata and content about Web Pages. I also have many other tables related to the WebPage table. They all use URL as a key.
The problem is URL's can be very large, and using them as a key makes the indexes larger and slower. How much I don't know, but I have read many times using large fields for indexing is to be avoided. Assuming a URL is nvarchar(400), they are enormous fields to use as a primary key.
What are the alternatives?
How much pain would there likely to be with using URL as a key instead of a smaller field.
I have looked into the WebPage table having a identity column, and then using this as the primary key for a WebPage. This keeps all the associated indexes smaller and more efficient but it makes importing data a bit of a pain. Each import for the associated tables has to first lookup what the id of a url is before inserting data in the tables.
I have also played around with using a hash on the URL, to create a smaller index, but am still not sure if it is the best way of doing things. It wouldn't be a unique index, and would be subject to a small number of collisions. So I am unsure what foreign key would be used in this case...
There will be millions of records about webpages stored in the database, and there will be a lot of batch updating. Also there will be a quite a lot of activity reading and aggregating the data.
Any thoughts?
I'd use a normal identity column as the primary key. You say:
This keeps all the associated indexes smaller and more efficient
but it makes importing data a bit of a pain. Each import for the
associated tables has to first lookup what the id of a url is
before inserting data in the tables.
Yes, but the pain is probably worth it, and the techniques you learn in the process will be invaluable on future projects.
On SQL Server 2005, you can create a user-defined function GetUrlId that looks something like
CREATE FUNCTION GetUrlId (#Url nvarchar(400))
RETURNS int
AS BEGIN
DECLARE #UrlId int
SELECT #UrlId = Id FROM Url WHERE Url = #Url
RETURN #UrlId
END
This will return the ID for urls already in your URL table, and NULL for any URL not already recorded. You can then call this function inline your import statements - something like
INSERT INTO
UrlHistory(UrlId, Visited, RemoteIp)
VALUES
(dbo.GetUrlId('http://www.stackoverflow.com/'), #Visited, #RemoteIp)
This is probably slower than a proper join statement, but for one-time or occasional import routines it might make things easier.
Break up the URL into columns based on the bits your concerned with and use the RFC as a guide. Reverse the host and domain info so an index can group like domains (Google does this).
stackoverflow.com -> com.stackoverflow
blog.stackoverflow.com -> com.stackoverflow.blog
Google has a paper that outlines what they do but I can't find right now.
http://en.wikipedia.org/wiki/Uniform_Resource_Locator
I would stick with the hash solution. This generates a unique key with a fairly low chance of collision.
An alternative would be to create GUID and use that as the key.
I totally agree with Dylan. Use an IDENTITY column or a GUID column as surrogate key in your WebPage table. Thats a clean solution. The lookup of the id while importing isn't that painful i think.
Using a big varchar column as key column is wasting much space and affects insert and query performance.
Not so much a solution. More another perspective.
Storing the total unique URI of a page perhaps defeats part of the point of URI construction. Each forward slash is supposed to refer to a unique semantic space within the domain (whether that space is actual or logical). Unless the URIs you intend to store are something along the line of www.somedomain.com/p.aspx?id=123456789 then really it might be better to break a single URI metatable into a table representing the subdomains you have represented in your site.
For example if you're going to hold a number of "News" section URIs in the same table as the "Reviews" URIs then you're missing a trick to have a "Sections" table whose content contains meta information about the section and whose own ID acts as a parent to all those URIs within it.

Resources