Improving performance for string matching

Improving performance for string matching - database

I am working for a startup that is building a iphone app. And i would like to ask a few questions to improve an algorithm we use for string matching.
We have a database that has a huge list of phone numbers along with the name of the user who owns the phone number. Lets say that the database looks like this
name phonenum
hari 1234
abc 3873
....
This database has large number of rows (around 1 million). When the user opens the app, the app gets the list of phone numbers from the person's phone contacts and matches it with the database. We return all the phone numbers that are present in the database. Right now, what we do is very very inefficient. We send the phone numbers from phone contacts in sets of 20. And we match it with the database. This will lead to a complexity of num of phone contacts * O(n).
I thought of some improvements like having the database rows sorted by phone numbers so that we can do binary search. In addition to that, we can have a hash table containing some 10,000 phone numbers in the cache memory and we can search against this cache memory initially. Only if there is a miss, we will access the database and search the database with complexity of O(log n) using binary search.
Also, there is the issue of sending phone numbers for matching. do i send them as such or send them as a hashed value ? will that matter in terms of improving performance?
Is there any other way of doing this thing?
I explained the whole scenario so that you can have a better understanding of my need
thanks

If you already have an SQL Server database, let it take care of that. Create an index on the phone number column (if you don't have it already). Send all the numbers in the contact list in one go (no need to split them by 20) and match them against the database. The SQL server probably uses much better indexing than anything you could come up with, so it's going to be pretty fast.
Alternatively, you can try to insert the numbers into a temporary table and query against that, but I have no idea whether that would be faster.

If you can represent phone numbers as numeric values instead of strings, then you can put an index on your database field that will make lookup operations very fast. Even if you have to represent them as strings, an index on the database field will be make looking up values fast enough to be a non-issue in the grand scheme of things.
Your biggest performance problem is going to be with all the round trips between the application and your database. That is a performance bottleneck with any web-enabled program. If you are unlikely to have a high rate of success (maybe 2% of the user's contacts are in your database), you'll probably be better off sending the whole list of phone numbers at once, since you'll just be getting data back for a few of them.
If the purpose is to update the user's contact data with the data found in your database, you could create a hash out of the appropriate fields and send that along with the phone number. Have the database keep a hash of those fields on its side and do a comparison. If the hash matches, then you don't have to send any data back because the local and remote versions are the same.
A successful caching strategy would require a good understanding of how the data will be used, so I can't provide much guidance based on the information given. For example, if 90% of the phones using your app will have all of the phone numbers matched in a small group of the numbers in the database, then by all means, put that small group into a Hashtable. But if users are likely to have any phone numbers that aren't in that small group, you're going to have to do a database round-trip. The key will be to construct a query that allows the database to return all of the data you need in one trip.

I'd split the phone number up into three parts
example 777.777.7777
Each section can be stored into and int and used as a hash tag.
This would mean that your data store becomes a series of hash tables.
Or you could force the whole number into an int and then use that as your hash key. But for fast results you'd need more buckets.
Cheers

Related

Firebase - Efficiently query users by geolocation + other filters

I am working on a Flutter app (similar to dating apps with swippable cards ) that show one user at a time.
I have a Firestore collection called users from which I'd like to query the nearby users first (based on the current user's location). Every user has a GeoPoint(Long + Lat) and a Geohash (8 characters) saved.
What I'd like:
Query the closest users first (20 or 30 users per query call) since I don't want the user to wait a long time.
The area in which the users are being queried grows bigger and bigger whenever the webservice is being called over and over (since the close-by users have been shown already)
There are a few filters that are applied to the query like: only get users of a certain age group, or certain gender, etc.
I thought about Duncan Campbell's solution of having multiple variants of the GeoHash per user. For instance, every user would have a GeoHash of 2 characters, 3 characters, and 4 characters. And I'd search with a Where In clause containing the neighboring geohashes or the parents of the neighboring geohashes which should get me the users quickly.
However, this won't work for my purpose. Let's say we got all the closeby users by using the neighboring geohash4 (4 characters). Once done, if the user kept on swiping, we will have to search with a larger area geohash3( 3 characters) which will contain almost all the users that were searched previously. Yes I can filter them afterward, but that is not efficient (and costly). Also if we keep on going, we'd reach the geohash2, but then, in order to get the rest of the users, I'd have to search with a where not inclause containing all the neighboring geohash2 areas that we already searched. I do understand that this is a huge area to cover, but with the data that I currently have, users will cover it really quickly.
I also think that GeoFire doesn't do the job for me since it queries all the users in a given area at once which doesn't satisfy my requirements (as discussed in the linked article).
For me, it sounds like this solution is a bit complicated, and so I am seeking suggestions for making this more efficient and effective. Any ideas on how to tackle this issue?

Google App Engine sort costs for videogame's highscore tables

I'm considering creating my own GAE app to track players' highscores in my videogames. I've already created a simple app that allows me to send and recover Top 10 highscores (that is, just 10 scores are stored per game), but now I'm considering costs if things grow.
Say a game has thousands or millions of players (hehe, not mine). I've seen how applications like OpenFeint are able to sort your score and tell your exact rank in a highscore table with thousands of entries. You may be #19623, for example.
In order to keep things simple, I would create Top 100 score tables. But what if I truly wanted to store all scores and keep things sorted? Does it make sense to simply sort scores as they are queried from the database? I don't think so...
How are such applications implemented?

On GAE it's easy to return sorted queries as long as you index your fields. If your goal is just to find the top 100 scores, you can do an ordered query by score for 100 entities - you will get them in order.
https://developers.google.com/appengine/docs/python/datastore/queryclass#Query_order
The harder part is assigning the number to the query. For the top 100, you'd basically go through the returned list of 100 entities, and print a number beside each of them.
If you need to find a user at a particular rank, you can use a cursor to make narrow your search to say whoever is at rank #19623.
What you won't be able to do efficiently with this is figure out the rank of a single entity. In order to figure out rankings using the built in index, you'd have to query for all entities, and find where that indivdual entity is in the list.
The laziest way to do the ranking would be something like search for the top 100, if the user is in there, show their ranking, if not, then tell them they are > 100. Another possibility is to occasionaly do large queries to get score ranges, store those, and then give the user a less accurent (you are in the top 500, top 1000 etc), without having the exact place.

Standard database indexing - both on App Engine and elsewhere - doesn't provide an efficient way to find the rank of a row/entity. One option is to go through the database at regular intervals and update the current rank. If you want ranks to be updated immediately, however, a tree-based solution is better. One is provided for App Engine in the app-engine-ranklist project.

We had the same problem with TyprX typing races (GWT + App Engine). The way we did it without going through millions of rows it to store high score like this:
class User {
Integer day, month, year;
Integer highscoreOfTheDay;
Integer highscoreOfMonth;
Integer highscoreOfTheYear;
}
Doing so you can get a sorted list of daily, monthly, yearly high scores with on query. The key is to update the users records with their own best score for each period as they finish their games.
Then we added save the result to memcache and voila.
Daniel

I'd think about using exception processing. How many of the thousands of results each day/hour will be a top 100 score? Keep a min/max top-100 range entity (memcached of course). Each score that comes is goes one direction if it is within the range, else another direction (task queue?) if not. Why not shunt the 99% of non-relevant work to another process, and only have to deal with 100+1 recs in whatever your final setup might be for changing the rankings.

Google App Engine storing as list vs JSON

I have a model called User, and a user has a property relatedUsers, which, in its general format, is an array of integers. Now, there will be times when I want to check if a certain number exists in a User's relatedUsers array. I see two ways of doing this:
Use a standard Python list with indexed values (or maybe not) and just run an IN query and see if that number is in there.
Having the key to that User, get back the value for property relatedUsers, which is an array in JSON string format. Decode the string, and check if the number is in there.
Which one is more efficient? Would number 1 cost more reads than option 2? And would number 1 writes cost more than number 2, since indexing each value costs a write. What if I don't index -- which solution would be better then?

Here's your costs vs capability, option wise:
Putting the values in an indexed list will be far more expensive. You will incur the cost of one write for each value in the list, which can explode depending on how many friends your users have. It's possible for this cost explosion to be worse if you have certain kinds of composite indexes. The good side is that you get to run queries on this information: you can get query for a list of users who are friends with a particular user, for example.
No extra index or write costs here. The problem is that you lose querying functionality.
If you know that you're only going to be doing checks only on the current user's list of friends, by all means go with option 2. Otherwise you might have to look at your design a little more carefully.

Flagging possible identical users in an account management system

I am working on a possible architecture for an abuse detection mechanism on an account management system. What I want is to detect possible duplicate users based on certain correlating fields within a table. To make the problem simplistic, lets say I have a USER table with the following fields:
Name
Nationality
Current Address
Login
Interests
It is quite possible that one user has created multiple records within this table. There might be a certain pattern in which this user has created his/her accounts. What would it take to mine this table to flag records that may be possible duplicates. Another concern is scale. If we have lets say a million users, taking one user and matching it against the remaining users is unrealistic computationally. What if these records are distributed across various machines in various geographic locations?
What are some of the techniques, that I can use, to solve this problem? I have tried to pose this question in a technologically agnostic manner with the hopes that people can provide me with multiple perspectives.
Thanks

The answer really depends upon how you model your users and what constitutes a duplicate.
There could be a user that uses names from all harry potter characters. Good luck finding that pattern :)
If you are looking for records that are approximately similar try this simple approach:
Hash each word in the doc and pick the min shingle. Do this for k different hash functions. Concatenate these min hashes. What you have is a near duplicate.
To be clear, lets say a record has words w1....wn. Lets say your hash functions are h1...hk.
let m_i = min_j (h_i(w_j)
and the signature is S = m1.m2.m3....mk
The cool thing with this signature is that if two documents contain 90% same words then there is a good 90% chance that good chance that the signatures would be the same for the two documents. Hence, instead of looking for near duplicates, you look for exact duplicates in the signatures. If you want to increase the number of matches then you decrease the value of k, if you are getting too many false positives then you increase the number of k.
Of course there is the approach of implicit features of users such as thier IP addresses and cookie etc.

Creating an efficient search capability using SQL Server (and/or coldfusion)

I am trying to visualize how to create a search for an application that we are building. I would like a suggestion on how to approach 'searching' through large sets of data.
For instance, this particular search would be on a 750k record minimum table, of product sku's, sizing, material type, create date, etc;
Is anyone aware of a 'plugin' solution for Coldfusion to do this? I envision a google like single entry search where a customer can type in the part number, or the sizing, etc, and get hits on any or all relevant results.
Currently if I run a 'LIKE' comparison query, it seems to take ages (ok a few seconds, but still), and it is too long. At times making a user sit there and wait up to 10 seconds for queries & page loads.
Or are there any SQL formulas to help accomplish this? I want to use a proven method to search the data, not just a simple SQL like or = comparison operation.
So this is a multi-approach question, should I attack this at the SQL level (as it ultimately looks to be) or is there a plug in/module for ColdFusion that I can grab that will give me speedy, advanced search capability.

You could try indexing your db records with a Verity (or Solr, if CF9) search.
I'm not sure it would be faster, and whether even trying it would be worthwhile would depend a lot on how often you update the records you need to search. If you update them rarely, you could do an Verity Index update whenever you update them. If you update the records constantly, that's going to be a drag on the webserver, and certainly mitigate any possible gains in search speed.
I've never indexed a database via Verity, but I've indexed large collections of PDFs, Word Docs, etc, and I recall the search being pretty fast. I don't know if it will help your current situation, but it might be worth further research.

If your slowdown is specifically the search of textual fields (as I surmise from your mentioning of LIKE), the best solution is building an index table (not to be confiused with DB table indexes that are also part of the answer).
Build an index table mapping the unique ID of your records from main table to a set of words (1 word per row) of the textual field. If it matters, add the field of origin as a 3rd column in the index table, and if you want "relevance" features you may want to consider word count.
Populate the index table with either a trigger (using splitting) or from your app - the latter might be better, simply call a stored proc with both the actual data to insert/update and the list of words already split up.
This will immediately drastically speed up textual search as it will no longer do "LIKE", AND will be able to use indexes on index table (no pun intended) without interfering with indexing on SKU and the like on the main table.
Also, ensure that all the relevant fields are indexed fully - not necessarily in the same compund index (SKU, sizing etc...), and any field that is searched as a range field (sizing or date) is a good candidate for a clustered index (as long as the records are inserted in approximate order of that field's increase or you don't care about insert/update speed as much).
For anything mode detailed, you will need to post your table structure, existing indexes, the queries that are slow and the query plans you have now for those slow queries.
Another item is to enure that as little of the fields are textual as possible, especially ones that are "decodable" - your comment mentioned "is it boxed" in the text fields set. If so, I assume the values are "yes"/"no" or some other very limited data set. If so, simply store a numeric code for valid values and do en/de-coding in your app, and search by the numeric code. Not a tremendous speed improvement but still an improvement.

I've done this using SQL's full text indexes. This will require very application changes and no changes to the database schema except for the addition of the full text index.
First, add the Full Text index to the table. Include in the full text index all of the columns the search should perform against. I'd also recommend having the index auto update; this shouldn't be a problem unless your SQL Server is already being highly taxed.
Second, to do the actual search, you need to convert your query to use a full text search. The first step is to convert the search string into a full text search string. I do this by splitting the search string into words (using the Split method) and then building a search string formatted as:
"Word1*" AND "Word2*" AND "Word3*"
The double-quotes are critical; they tell the full text index where the words begin and end.
Next, to actually execute the full text search, use the ContainsTable command in your query:
SELECT *
from containstable(Bugs, *, '"Word1*" AND "Word2*" AND "Word3*"')
This will return two columns:
Key - The column identified as the primary key of the full text search
Rank - A relative rank of the match (1 - 1000 with a higher ranking meaning a better match).
I've used approaches similar to this many times and I've had good luck with it.

If you want a truly plug-in solution then you should just go with Google itself. It sounds like your doing some kind of e-commerce or commercial site (given the use of the term 'SKU'), So you probably have a catalog of some kind with product pages. If you have consistent markup then you can configure a google appliance or service to do exactly what you want. It will send a bot in to index your pages and find your fields. No SQl, little coding, it will not be dependent on your database, or even coldfusion. It will also be quite fast and familiar to customers.
I was able to do this with a coldfusion site in about 6 hours, done! The only thing to watch out for is that google's index is limited to what the bot can see, so if you have a situation where you want to limit access based on a users role or permissions or group, then it may not be the solution for you (although you can configure a permission service for Google to check with)

Because SQL Server is where your data is that is where your search performance is going to be a possible issue. Make sure you have indexes on the columns you are searching on and if using a like you can't use and index if you do this SELECT * FROM TABLEX WHERE last_name LIKE '%FR%'
But it can use an index if you do it like this SELECT * FROM TABLEX WHERE last_name LIKE 'FR%'. The key here is to allow as many of the first characters to not be wild cards.
Here is a link to a site with some general tips. https://web.archive.org/web/1/http://blogs.techrepublic%2ecom%2ecom/datacenter/?p=173