Database Table structure for search engine for my website - database

I am trying to make a search engine for my website. How should I design the table which keeps the list of indexed words.
Earlier I thought something like this:
Table: tbl_indexedwords has 2 columns iw_wordid and iw_word.
Table: tbl_wordoccurrence has 4 columns wo_occurrenceid, wo_wordid, wo_pageid, wo_numberofoccurrences.
Now, this design will not work well if the user enters more than two words in the search box. Suppose foo bar. Even if foo and bar both are present in the table tbl_indexedwords and corresponding details are in the tbl_wordoccurrence, my search engine script would rank the results where it sees maximum wo_numberofoccurrences for either foo or bar. It will not see whether foo and bar are present next to each other as there is no column for order of occurrence of the words. I hope I am clear with what I am saying here.
Another idea could be to make the table tbl_wordoccurrence of 3 columns. Forget about wo_numberofoccurences and store each word in the page with unique wo_occurrenceid. Now, this would solve my problem as I know the order of occurrence of the words. if wo_occurrenceid of some word is wo_occurrenceid+1 or wo_occurrenceid-1 of some other word then, these two occur side by side.
The problem with this design is that it would take up lots of space. I have lots of content for my website. I think this approach would make it slow(not sure, though). Is there any other design that would help me? Or will I have to go with the second one? I am sure the first one is not gonna work, so discarding it.

If the contents of your website is on the database (I assume) creating a separate table would not even necessary if you are using FULLTEXT index. If you are using mySQL then it has such capability see the examples here and here. And if you are using MSSQL it has also its own FULLTEXT indexing capability like the example here and here
And if you insist if having a separate table for searching then you could most likely have only one table needed like:
Table : tbl_wordsoccurrence
Fields : words_id, words
(and if you like you can include also number_of_occurences and page_id fields)
In the table above you could either store one word like programming or phrases like php programming.
On the other hand if your website is static meaning the content is not saved on a database and therefore changes had to be made manually rather than by regular user input then that's another story.

Related

Creating Index in Cloudant

Scenario.
I have a document in the database which has thousands of item in
'productList' as below.
here
All the object in array 'productList' has the same shape and same fields with different values.
Now I want to search in the following way.
when a user writes 'c' against 'Ingrediants' field, the list will show all 'Ingrediants' start with alphabet 'c'.
when a user write 'A' against 'brandName' field, the list will show
all 'brandName' start with alphabet 'A'.
please give an example using this to search for it, either it is by
creating an index(json,text).
creating a Search index (design document) or
using views etc
Note: I don't want to create an index at run-time(I mean index could be defined by Cloudant dashboard) I just want to query it, by this library in the application.
I have read the documentation's, I got the concepts.
Now, I want to implement it with the best approach.
I will use this approach to handle all such scenarios in future.
Sorry if the question is stupid :)
thanks.
CouchDB isn't designed to do exactly what you're asking. You'd need one index for Ingredient, and another for Brand Name - and it isn't particularly performant to do both at once. The best approach I think would be to check out the Mango query feature http://docs.couchdb.org/en/2.0.0/api/database/find.html, try the queries you're interested in and then add indexes as required (it has the explain plan to help make this more efficient).

Implementing a database -- How to get started

I've been trying to learn programming for a while. I've studied Java and Python, and I'm comfortable with their syntax. Recently, I wanted to use what I've learnt with coding a tangible software from ground up.
I want to implement a database engine, sort of a NoSQL database. I've put together a small document, sort of a specification to follow throughout my adventure of coding it. But all I know is a bunch of keywords. I don't know where to start.
Can someone help me find out how to gather the knowledge I need for this kind of work and in what order to learn things? I have searched for documents, but I feel like I'll end up finding unrelated/erroneous content or start from a wrong point, because implementing a complete database engine is (seeming to be) a truly complicated task.
I wan't to express that I'd prefer theses and whitepapers and (e)books to codes of other projects, because I've asked a question of kind in which people usually get answered in the form of "read project - x' source code". I'm not at the level of comfortably reading and understanding source code.
First, you may have a look that the answers for How to write a simple database engine. While it focus on a SQL engine, there is still a lot of good material in the answers.
Otherwise, a good project tutorial is Implementation of a B-Tree Database Class. The example code is in C++, but the description of what is done and why is probably what you'll want to look at anyway.
Also, there is Designing and Implementing Structured Storage (Database Engine) over at MSDN. Plenty of information there to help you in your learning project.
Because the accepted answer only offers (good) links to other resources, I'd thought I share my experience writing webdb, a small experimental database for browsers. I also invite you to read the source code. It's pretty small. You should be able to read through it and get a basic understanding of what it's doing in a couple of hours. Warning: I am a n00b at this and since writing it I learned a lot more about it and see I have been doing some things wrong. It can help you get started though.
The basics: BTree
I started out with adapting an AVL tree to suit my needs. An AVL tree is a kind of self-balancing binary search tree. You store the key K and related data (if any) in a node, then all items with key < K in a node in the left subtree and all items with key > K in a right subtree. You can use an array to store the data items if you want to support non unique keys.
This tree will give you the basics: Create, Update, Delete and a way to quickly get an item by key, or all items with key < x, or with key between x and y etc. It can serve as the index for our table.
A schema
As a next step I wrote code that lets the client code define a schema. Methods like createTable() etc. Schemas are typically associated with SQL, but even no-SQL sort-of has a schema; they usually require you to mark the ID field and any other fields you want to search on. You can make your schema as fancy as you want, but you typically want to model at least which column(s) serve as primary key and which fields will be searched on frequently and need an index.
Creating a data structure to store a table
I decided to use the tree I created in the first step to store my items. These were simple JS objects. Having defined which field contains the PK, I could simply insert the item into the tree using that field's value as the key. This gives me quick lookup by ID (range).
Next I added another tree for every column that needs an index. In these trees I did not store the full record, but only the key. So to fetch a customer by last name, I would first use the index on last name to get the ID, then the primary key index to get the actual record. The reason I did not just store the (reference to the) actual object is because it makes set operations a little bit simpler (see next step)
Querying
Now that we have a table with indexes for PK and search fields, we can implement querying. I did not take this very far as it becomes complicated quickly, but you can get some nice functionality with just some basics. WebDB does not implement joins; all queries operate only on a single table. But once you understand this you see a pretty clear (though long and winding) path to doing joins and other complicated stuff as well.
In WebDB, to get all customers with firstName = 'John' and city = 'New York' (assuming those are two search fields), you would write something like:
var webDb = ...
var johnsFromNY = webDb.customers.get({
firstName: 'John',
city: 'New York'
})
To solve it, we first do two lookups: we get the set X of all IDs of customers named 'John' and we get the set Y of all IDs of customers from New York. We then perform an intersection on these two sets to get all IDs of customers that are both named 'John' AND from New York. We then run through our set of resulting IDs, getting the actual record for each one and adding it to the result array.
Using the set operators like union and intersection we can perform AND and OR searches. I only implemented AND.
Doing joins would (I think) involve creating temporary tables in memory, then populating them as the query runs with the joined results, then applying the query criteria to the temp table. I never got there. I attempted some syncing logic next but that was too ambitious and it went downhill from there :)

How can I standardize user-entered data?

I have a table of data that I'm trying to "standardize". The data entered into the table wasn't static or standardized (like with drop-down lists of answers), leaving me with multiple variations of answers where I want a static, universal answer.
For instance, let's say that there's a column in the database called "Type of pet". Because user input wasn't standardized, people could enter in variations of a specific type of pet, rather than generalized form of the pet. So instead of just entering "Dog", there are different versions of dogs like "Collie", "Mutt", "Labrador", etc.
How do I go about transcribing these answers into their generalized form -- replacing Collie/Mutt/Labrador/etc answers in the table with just "Dog" (or "Cat", or "Bird", etc.)?
I realize there needs to be some form of a manually-entered "translation" function. My gut reaction is that a long-spanning list of stacked if-statements would be inefficient, as well as being tedious to control and expand.
Is there some kind of process or system for doing something like this? Like some type of lookup table system/matrix?
I'm assuming a foreach loop to iterate through the array of records would be most appropriate. And then within each iteration of the foreach loop, you'd have it do a test/comparison of the pet variable against some type of list (that I would have created manually) -- but what would you use for this lookup table/list? Or this step of the process? Would you have it as some type of a SQL database/table, an array, a CSV file, etc.?
Then, once this comparison is completed and the "translated" equivalent of the type of pet is determined, the foreach loop would update that specific row of the record, either overwriting the old non-standardized value, or perhaps just tacking on the new standardized equivalent into a new column (for later verifying).
My gut reaction is that a long-spanning list of stacked if-statements
would be inefficient, as well as being tedious to control and expand.
100% correct, and because of this you really only have one option: Manually go through the database and clean it up. Once that is done you will need to restrict user input using stop down lists rather than raw text input.
Depending on your users you might want to look at how Stackoverflow does tags - essentially allowing anyone to do the cleanup for you.
But if you have like 150000 records or something doing an SQL find-replace query might help clean up the data to start.
This sounds like a data normalization project to me though I don't have alot of experience with it in practice, but in theory you start with how the data is entered. For instance, free text fields allow users to enter anything they want. You'd want to change that after scrubbing the data. And it pays to know how the data got in in the first place. Was it freetext, a bullet, a drop-down menu? etc.
You'd also want to create a data dictionary of all the standardized terms that can replace the multitude of variations with.
Then you could create an update query that would go through the old data and update it with the new using an update query and wildcards.
https://support.office.com/en-us/article/Use-the-Find-and-Replace-dialog-box-to-change-data-2eee8d02-5a40-4328-ba56-ec0406865680
This could be a more automated way of scrubbing the data rather than find and replace too.
-Al

Sql Server Full Text: Human names which sound alike

I have a database with lots of customers in it. A user of the system wants to be able to look up a customer's account by name, amongst other things.
What I have done is create a new table called CustomerFullText, which just has a CustomerId and an nvarchar(max) field "CustomerFullText". In "CustomerFullText" I keep concatenated together all the text I have for the customer, e.g. First Name, Last Name, Address, etc, and I have a full-text index on that field, so that the user can just type into a single search box and gets matching results.
I found this gave better results that trying to search data stored in lots of different columns, although I suppose I'd be interested in hearing if this in itself is a terrible idea.
Many people have names which sound the same but which have different spellings: Katherine and Catherine and Catharine and perhaps someone who's record in the database is Katherine but who introduces themselves as Kate. Also, McDonald vs MacDonald, Liz vs Elisabeth, and so on.
Therefore, what I'm doing is, whilst storing the original name correctly, making a series of replacements before I build the full text. So ALL of Katherine and Catheine and so on are replaced with "KATE" in the full text field. I do the same transform on my search parameter before I query the database, so someone who types "Catherine" into the search box will actually run a query for "KATE" against the full text index in the database, which will match Catherine AND Katherine and so on.
My question is: does this duplicate any part of existing SQL Server Full Text functionality? I've had a look, but I don't think that this is the same as a custom stemmer or word breaker or similar.
Rather than trying to phonetically normalize your data yourself, I would use the Double Metaphone algorithm, essentially a much better implementation of the basic SOUNDEX idea.
You can find an example implementation here: http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=13574, and more are listed in the Wikipedia link above.
It will generate two normalized code versions of your word. You can then persist those in two additional columns and compare them against your search text, which you would convert to Double Metaphone on the fly.

Creating an efficient search capability using SQL Server (and/or coldfusion)

I am trying to visualize how to create a search for an application that we are building. I would like a suggestion on how to approach 'searching' through large sets of data.
For instance, this particular search would be on a 750k record minimum table, of product sku's, sizing, material type, create date, etc;
Is anyone aware of a 'plugin' solution for Coldfusion to do this? I envision a google like single entry search where a customer can type in the part number, or the sizing, etc, and get hits on any or all relevant results.
Currently if I run a 'LIKE' comparison query, it seems to take ages (ok a few seconds, but still), and it is too long. At times making a user sit there and wait up to 10 seconds for queries & page loads.
Or are there any SQL formulas to help accomplish this? I want to use a proven method to search the data, not just a simple SQL like or = comparison operation.
So this is a multi-approach question, should I attack this at the SQL level (as it ultimately looks to be) or is there a plug in/module for ColdFusion that I can grab that will give me speedy, advanced search capability.
You could try indexing your db records with a Verity (or Solr, if CF9) search.
I'm not sure it would be faster, and whether even trying it would be worthwhile would depend a lot on how often you update the records you need to search. If you update them rarely, you could do an Verity Index update whenever you update them. If you update the records constantly, that's going to be a drag on the webserver, and certainly mitigate any possible gains in search speed.
I've never indexed a database via Verity, but I've indexed large collections of PDFs, Word Docs, etc, and I recall the search being pretty fast. I don't know if it will help your current situation, but it might be worth further research.
If your slowdown is specifically the search of textual fields (as I surmise from your mentioning of LIKE), the best solution is building an index table (not to be confiused with DB table indexes that are also part of the answer).
Build an index table mapping the unique ID of your records from main table to a set of words (1 word per row) of the textual field. If it matters, add the field of origin as a 3rd column in the index table, and if you want "relevance" features you may want to consider word count.
Populate the index table with either a trigger (using splitting) or from your app - the latter might be better, simply call a stored proc with both the actual data to insert/update and the list of words already split up.
This will immediately drastically speed up textual search as it will no longer do "LIKE", AND will be able to use indexes on index table (no pun intended) without interfering with indexing on SKU and the like on the main table.
Also, ensure that all the relevant fields are indexed fully - not necessarily in the same compund index (SKU, sizing etc...), and any field that is searched as a range field (sizing or date) is a good candidate for a clustered index (as long as the records are inserted in approximate order of that field's increase or you don't care about insert/update speed as much).
For anything mode detailed, you will need to post your table structure, existing indexes, the queries that are slow and the query plans you have now for those slow queries.
Another item is to enure that as little of the fields are textual as possible, especially ones that are "decodable" - your comment mentioned "is it boxed" in the text fields set. If so, I assume the values are "yes"/"no" or some other very limited data set. If so, simply store a numeric code for valid values and do en/de-coding in your app, and search by the numeric code. Not a tremendous speed improvement but still an improvement.
I've done this using SQL's full text indexes. This will require very application changes and no changes to the database schema except for the addition of the full text index.
First, add the Full Text index to the table. Include in the full text index all of the columns the search should perform against. I'd also recommend having the index auto update; this shouldn't be a problem unless your SQL Server is already being highly taxed.
Second, to do the actual search, you need to convert your query to use a full text search. The first step is to convert the search string into a full text search string. I do this by splitting the search string into words (using the Split method) and then building a search string formatted as:
"Word1*" AND "Word2*" AND "Word3*"
The double-quotes are critical; they tell the full text index where the words begin and end.
Next, to actually execute the full text search, use the ContainsTable command in your query:
SELECT *
from containstable(Bugs, *, '"Word1*" AND "Word2*" AND "Word3*"')
This will return two columns:
Key - The column identified as the primary key of the full text search
Rank - A relative rank of the match (1 - 1000 with a higher ranking meaning a better match).
I've used approaches similar to this many times and I've had good luck with it.
If you want a truly plug-in solution then you should just go with Google itself. It sounds like your doing some kind of e-commerce or commercial site (given the use of the term 'SKU'), So you probably have a catalog of some kind with product pages. If you have consistent markup then you can configure a google appliance or service to do exactly what you want. It will send a bot in to index your pages and find your fields. No SQl, little coding, it will not be dependent on your database, or even coldfusion. It will also be quite fast and familiar to customers.
I was able to do this with a coldfusion site in about 6 hours, done! The only thing to watch out for is that google's index is limited to what the bot can see, so if you have a situation where you want to limit access based on a users role or permissions or group, then it may not be the solution for you (although you can configure a permission service for Google to check with)
Because SQL Server is where your data is that is where your search performance is going to be a possible issue. Make sure you have indexes on the columns you are searching on and if using a like you can't use and index if you do this SELECT * FROM TABLEX WHERE last_name LIKE '%FR%'
But it can use an index if you do it like this SELECT * FROM TABLEX WHERE last_name LIKE 'FR%'. The key here is to allow as many of the first characters to not be wild cards.
Here is a link to a site with some general tips. https://web.archive.org/web/1/http://blogs.techrepublic%2ecom%2ecom/datacenter/?p=173

Resources