Filtering Functionality Similar to Ebay SQL Count Issue - sql-server

I am stuck on a database problem for a client, wandering if someone could help me out. I am currently trying to implement filtering functionality so that a user can filter results after they have searched for something. We are using SQL Server 2008. I am working on an electronics e-commerce site and the database is quite large (500,000 plus records). The scenario is this - user goes to our website and types in 'laptop' and clicks search. This brings up the first page of several thousand results. What I want to do is then
filter these results further and present the user with options such as:
Filter By Manufacturer
Dell (10,000)
Acer (2,000)
Lenovo (6,000)
Filter By Colour
Black (7000)
Silver (2000)
The main columns of the database are like this - the primary key is an integer ID
ID Title Manufacturer Colour
The key part of the question is how to get the counts in various categories in an efficient manner. The only way I currently know how to do it is with separate queries. However, should we wish to filter by further categories then this will become very slow - especially as the database grows. My current SQL is this:
select count(*) as ManufacturerCount, Manufacturer from [ProductDB.Product] GROUP BY Manufacturer;
select count(*) as ColourCount, Colour from [ProductDB.Product] GROUP BY Colour;
My question is if I can get the results as a single table using some-kind of join or union and if this would be faster than my current method of issuing multiple queries with the Count(*) function. Thanks for your help, if you require any further information please ask. PS I am wandering how on sites like ebay and amazon manage to do this so fast. In order to understand my problem better if you go onto ebay and type in laptop you will
see a number of filters on the left - this is basically what I am trying to achieve. I don't know how it can be done efficiently when there are many filters. E.g to get functionality equivalent to Ebay I would need about 10 queries and I'm sure that will be slow. I was thinking of creating an intermediate table with all the counts however the intermediate table would have to be continuously updated in order to reflect changes to the database and that would be a problem if there are multiple updates per minute. Thanks.

The "intermediate table" is exactly the way to go. I can guarantee you that no e-commerce site with substantial traffic and large number of products would do what you are suggesting on the fly at every inquiry.
If you are worried about keeping track of changes to products, just do all changes to the product catalog thru stored procs (my preferred method) or else use triggers.
One complication is how you will group things in the intermediate table. If you are only grouping on pre-defined categories and sub-categories that are built into the product hierarchy, then it's fairly easy. It sounds like you are allowing free-text search... if so, how will you manage multiple keywords that result in an unexpected intersection of different categories? One way is to save the keywords searched along with the counts and a time stamp. Then, the next time someone searches on the same keywords, check the intermediate table and if the time stamp is older than some predetermined threshold (say, 5 minutes), return your results to a temp table, query the category counts from the temp table, overwrite the previous counts with the new time stamp, and return the whole enchilada to the web app. Otherwise, skip the temp table and just return the pre-aggregated counts and data records. In this case, you might get some quirky front-end count behavior, like it might say "10 results" in a particular category but then when the user drills down, they actually find 9 or 11. It's happened to me on different sites as a customer and it's really not a big deal.
BTW, I used to work for a well-known e-commerce company and we did things like this.

Related

SQL save different versions of same product in table

I have a products table in postgres that stores all the data I need. What I need to work out the best way to do:
Each product has a different status on the way to completion - machined, painted, assembled, etc. For each status there is a letter that changes in the product id.
What would the most efficient way of saving the data? For each status of the product should there be 'another product' in the table? Or would doing join tables somewhere work?
Example:
111a1 for machined
111b1 for painted
Yet these are the same end product, just at different stages ...
It depends on what you want to be efficient: storage, ingestion, queries, maintainability...
Joins would work - you can join on some substring of the product id, so you need not have separate products for every stage of production.
But maintainability of your code is really important. Businesses change - perhaps to include sub-assemblies.
You might want to re-think this scheme of altering the product id to show the status. Product ID and work flow state are orthogonal concepts. So you probably want to have them in separate fields. You'll probably write far less code that way. The alternative will be becoming really well acquainted with substr() (depending on your SQL dialect), and all sorts of duplications elsewhere.

Finding unique products (never seen before by a user) in a datastore sorted by a dynamically changing value (i.e. product rating)

been trying to solve this problem for a week and couldn't come up with any solutions in all my research so I thought I'd ask you all.
I have a "Product" table and a "productSent" table, here's a quick scheme to help explain:
class Product(ndb.Model):
name = ndb.StringProperty();
rating = ndb.IntegerProperty
class productSent(ndb.Model): <--- the key name here is md5(Product Key+UUID)
pId = ndb.KeyProperty(kind=Product)
uuId = ndb.KeyProperty(kind=userData)
action = ndb.StringProperty()
date = ndb.DateTimeProperty(auto_now_add=True)
My goal is to show users the highest rated product that they've never seen before--fast. So to keep track of the products users have seen, I use the productSent table. I created this table instead of using Cursors because every time the rating order changes, there's a possibility that the cursor skips the new higher ranking product. An example: assume the user has seen products 1-24 in the db. Next, 5 users liked product #25, making it the #10 product in the database--I'm worried that the product will never be shown again to the user (and possibly mess things up on a higher scale).
The problem with the way I'm doing it right now is that, once the user has blown past the first 1,000 products, it really starts slowing down the query performance. Because I'm literally pulling 1,000+ results, checking if they've been sent by querying against the productSent table (doing a keyName lookup to speed things up) and going through the loop until 15 new ones have been detected.
One solution I thought of was to add a repeated property (listProperty) to the Product table of all the users who have seen a product. Or if I don't want to have inequality filters I could put a repeated property of all the users who haven't seen a product. That way when I query I can dynamically take those out. But I'm afraid of what happens when I have 1,000+ users:
a) I'll go through the roof on the limit of repeated properties in one entity.
b) The index size will increase size costs
Has anyone dealt with this problem before (I'm sure someone has!) Any tips on the best way to structure it?
update
Okay, so had another idea. In order to minimize the changes that take place when a rating (number of likes) changes, I could have a secondary column that only has 3 possible values: positive, neutral, negative. And sort by that? Ofcourse for items that have a rating of 0 and get a 'like' (making them a positive) would still have a chance of being out of order or skipped by the cursor--but it'd be less likely. What do y'all think?
Sounds like the inverse, productNotSent would work well here. Every time you add a new product, you would add a new productNotSent entity for each user. When the user wants to see the highest rated product they have not seen, you will only have to query over the productNotSent entities that match that user. If you put the rating directly on the productNotSent you could speed the query up even more, since you will only have to query against one Model.
Another idea would be to limit the number of productNotSent entities per user. So each user only has ~100 of these entities at a time. This would mean your query would be constant for each user, regardless of the number of products or users you have. The creation of new productNotSent entities would become more complex, though. You'd have to have a cron job or something that "tops up" a user's collection of productNotSent entities when they use some up. You also may want to double-check that products rated higher than those already within the user's set of productNotSent entities get pushed in there. These are a little more difficult and well require some design trade-offs.
Hope this helps!
I do not know your expected volumes and exact issues (only did a quick perusal of your question), but you may consider using Json TextProperty storage as part of your plan. Create dictionaries/lists and store them in records by json.dump()ing them to a TextProperty. When the client calls, simply send the TextProperties to the client, and figure everything out on the client side once you JSON.parse() them. We have done some very large array/object processing in JS this way, and it is very fast (particularly indexed arrays). When the user clicks on something, send a transaction back to update their record. Set up some pull or push queue processes to handle your overall product listing updates, major customer rec updates, etc.
One downside is higher bandwidth going out of you app, but I think this cost will be minimal given potential processing savings on GAE. If you structure this right, you may be able to use get_by_id() to replace all or most of your planned indices and queries. We have found json.loads() and json.dumps() to be very fast inside the app, but we only use simple dictionary/list structures.This approach will be, though, a big, big quantum measure lower than your planned use of queries. The other potential issue is that very large objects may run into soft memory limits. Be sure that your Json objects are fairly simple+lightweight to avoid this (e.g. do no include product description, sub-objects, etc. in the Json item, just the basics such as product number). HTH, -stevep

How to best combine data from key-value stores and databases

Let's assume we have a friend list table for a social network.
Most use cases will require the friend list table to be JOINed to another table where you hold the personal details, such as: Name, Age, City, Profile picture URL, Last login time, etc...
Once the friend list table is in the 100M rows range. Querying a JOIN like this can take a few seconds. If you introduce a few other WHERE conditions it can even be slower.
A key-value store systems can bring in the friend list very quickly.
Let's assume we would like to show the 10 most recently logged in friends of a user.
What is the best way to calculate this output? A few methods I've been thinking about are below. Do any of them make sense?
Shall we keep all data in the key-value store environment? Update the
key-value store with every new login?
Or shall we pull the friend list id's first. Then use a database command like "IN()" and query the database?
Merge the data at the client level? A javascript solution?
In your Users table you have a field to save a timestamp for last login. In your table were the friend-relationships are stored you have 1 row per relationship and that makes the table really long.
So joining these tables seems bad and we should optimize this process somehow? The answer is: No, not necessarily. The people who construct a DBMS have the same problems as you and they implement the tools to solve them. Every DBMS has some sort of query optimization which is smarter than you and me.
So there's no shame in joining long tables. If you want to try to optimize you may:
Get the IDs of the friends of the user.
Get the information you want of the first 10 friends sorted by last_login desc where the id fits (and other where conditions).
You don't need to join the tables, but you will use two queries, so maybe if your DBMS is smart a join is faster (Maybe run a test).
If you want to, you can use ajax to load this data after the page was loaded, this improve the experience for the user, but the traffic on the DB will be the same.
I hope this helped.
Edit: Oh yeah, if you already knew the friends IDs (you need them for other stuff) you wouldn't even need a join. You can pass the IDs over to the javascript which loads the last login list later via AJAX.

Django: efficient database search

I need an efficient way to search through my models to find a specific User, here's a list,
User - list of users, their names, etc.
Events - table of events for all users, on when they're not available
Skills - many-to-many relationship with the User, a User could have a lot of skills
Contracts - many-to-one with User, a User could work on multiple contracts, each with a rating (if completed)
... etc.
So I got a lot of tables linked to the User table. I need to search for a set of users fitting certain criteria; for example, he's available from next Thurs through Fri, has x/y/z skills, and has received an average 4 rating on all his completed contracts.
Is there some way to do this search efficiently while minimizing the # of times I hit the database? Sorry if this is a very newb question.
Thanks!
Not sure if this method will solve you issue for all 4 cases, but at least it should help you out in the first one - querying users data efficiently.
I usually find using values or values_list query function faster because it slims down the SELECT part of the actual SQL, and therefore you will get results faster. Django docs regarding this.
Also worth mentioning that starting with new dev version within values and values_list you can query any type of relationship, including many_to_one.
And finally you might find in_bulk also useful. If I do a complex query, you might try to query the ids first of some models using values or values_list and then use in_bulk to get the model instances faster. Django docs about that.

Creating an efficient search capability using SQL Server (and/or coldfusion)

I am trying to visualize how to create a search for an application that we are building. I would like a suggestion on how to approach 'searching' through large sets of data.
For instance, this particular search would be on a 750k record minimum table, of product sku's, sizing, material type, create date, etc;
Is anyone aware of a 'plugin' solution for Coldfusion to do this? I envision a google like single entry search where a customer can type in the part number, or the sizing, etc, and get hits on any or all relevant results.
Currently if I run a 'LIKE' comparison query, it seems to take ages (ok a few seconds, but still), and it is too long. At times making a user sit there and wait up to 10 seconds for queries & page loads.
Or are there any SQL formulas to help accomplish this? I want to use a proven method to search the data, not just a simple SQL like or = comparison operation.
So this is a multi-approach question, should I attack this at the SQL level (as it ultimately looks to be) or is there a plug in/module for ColdFusion that I can grab that will give me speedy, advanced search capability.
You could try indexing your db records with a Verity (or Solr, if CF9) search.
I'm not sure it would be faster, and whether even trying it would be worthwhile would depend a lot on how often you update the records you need to search. If you update them rarely, you could do an Verity Index update whenever you update them. If you update the records constantly, that's going to be a drag on the webserver, and certainly mitigate any possible gains in search speed.
I've never indexed a database via Verity, but I've indexed large collections of PDFs, Word Docs, etc, and I recall the search being pretty fast. I don't know if it will help your current situation, but it might be worth further research.
If your slowdown is specifically the search of textual fields (as I surmise from your mentioning of LIKE), the best solution is building an index table (not to be confiused with DB table indexes that are also part of the answer).
Build an index table mapping the unique ID of your records from main table to a set of words (1 word per row) of the textual field. If it matters, add the field of origin as a 3rd column in the index table, and if you want "relevance" features you may want to consider word count.
Populate the index table with either a trigger (using splitting) or from your app - the latter might be better, simply call a stored proc with both the actual data to insert/update and the list of words already split up.
This will immediately drastically speed up textual search as it will no longer do "LIKE", AND will be able to use indexes on index table (no pun intended) without interfering with indexing on SKU and the like on the main table.
Also, ensure that all the relevant fields are indexed fully - not necessarily in the same compund index (SKU, sizing etc...), and any field that is searched as a range field (sizing or date) is a good candidate for a clustered index (as long as the records are inserted in approximate order of that field's increase or you don't care about insert/update speed as much).
For anything mode detailed, you will need to post your table structure, existing indexes, the queries that are slow and the query plans you have now for those slow queries.
Another item is to enure that as little of the fields are textual as possible, especially ones that are "decodable" - your comment mentioned "is it boxed" in the text fields set. If so, I assume the values are "yes"/"no" or some other very limited data set. If so, simply store a numeric code for valid values and do en/de-coding in your app, and search by the numeric code. Not a tremendous speed improvement but still an improvement.
I've done this using SQL's full text indexes. This will require very application changes and no changes to the database schema except for the addition of the full text index.
First, add the Full Text index to the table. Include in the full text index all of the columns the search should perform against. I'd also recommend having the index auto update; this shouldn't be a problem unless your SQL Server is already being highly taxed.
Second, to do the actual search, you need to convert your query to use a full text search. The first step is to convert the search string into a full text search string. I do this by splitting the search string into words (using the Split method) and then building a search string formatted as:
"Word1*" AND "Word2*" AND "Word3*"
The double-quotes are critical; they tell the full text index where the words begin and end.
Next, to actually execute the full text search, use the ContainsTable command in your query:
SELECT *
from containstable(Bugs, *, '"Word1*" AND "Word2*" AND "Word3*"')
This will return two columns:
Key - The column identified as the primary key of the full text search
Rank - A relative rank of the match (1 - 1000 with a higher ranking meaning a better match).
I've used approaches similar to this many times and I've had good luck with it.
If you want a truly plug-in solution then you should just go with Google itself. It sounds like your doing some kind of e-commerce or commercial site (given the use of the term 'SKU'), So you probably have a catalog of some kind with product pages. If you have consistent markup then you can configure a google appliance or service to do exactly what you want. It will send a bot in to index your pages and find your fields. No SQl, little coding, it will not be dependent on your database, or even coldfusion. It will also be quite fast and familiar to customers.
I was able to do this with a coldfusion site in about 6 hours, done! The only thing to watch out for is that google's index is limited to what the bot can see, so if you have a situation where you want to limit access based on a users role or permissions or group, then it may not be the solution for you (although you can configure a permission service for Google to check with)
Because SQL Server is where your data is that is where your search performance is going to be a possible issue. Make sure you have indexes on the columns you are searching on and if using a like you can't use and index if you do this SELECT * FROM TABLEX WHERE last_name LIKE '%FR%'
But it can use an index if you do it like this SELECT * FROM TABLEX WHERE last_name LIKE 'FR%'. The key here is to allow as many of the first characters to not be wild cards.
Here is a link to a site with some general tips. https://web.archive.org/web/1/http://blogs.techrepublic%2ecom%2ecom/datacenter/?p=173

Resources