I'm really trying to understand an example on how to construct a good suffix table for a given pattern. The problem is, I'm unable to wrap my head around it. I've looked at numerous examples, but do not know where the numbers come from.
So here goes:
The following example is a demonstration of how to construct a Good Suffix Table given the pattern ANPANMAN:
Index | Mismatch | Shift | goodCharShift
-----------------------------------------------
0 | N| 1 | goodCharShift[0]==1
1 | AN| 8 | goodCharShift[1]==8
2 | MAN| 3 | goodCharShift[2]==3
3 | NMAN| 6 | goodCharShift[3]==6
4 | ANMAN| 6 | goodCharShift[4]==6
5 | PANMAN| 6 | goodCharShift[5]==6
0 | NPANMAN| 6 | goodCharShift[6]==6
0 | ANPANMAN| 6 | goodCharShift[7]==6
Any help on this matter is highly appreciated. I simply don't know how to get to these numbers. Thanks!
Row 1, no characters matched, the character read was not an N. The good-suffix length is zero. Since there are plenty of letters in the pattern that are also not N, we have minimal information here - shifting by 1 is the least interesting result.
Row 2 we matched the N, and it was preceded by something other than A. Now look at the pattern starting from the end, where do we have N preceded by something other than A? There are two other N's, but both are preceded by A. That means no part of the good suffix can be useful to us -- shift by the full pattern length 8.
Row 3: We matched the AN, and it was preceded by not M. In the middle of the pattern there is a AN preceded by P, so it becomes the shift candidate. Shifting that AN to the right to line up with our match is a shift of 3.
Rows 4 & up: the matched suffixes do not match anything else in the pattern, but the trailing suffix AN matches the start of the pattern, so the shifts here are all 6.
It might help you Good Suffix-Table.
why you didnt try with last occurrence method its much easy as compared to good suffix table.I used last occurrence method for my searching
Although, this is an old question and there is an answer being accepted, but I just want to add the pdf from JHU, which explains quite well about the good suffix rules.
http://www.cs.jhu.edu/~langmea/resources/lecture_notes/boyer_moore.pdf
This pdf makes my life so much easier. So hope that it will help people who are struggling with understanding this algorithm as well.
Related
Firstly, apologies for not using the correct terminology here. I don't actually know the correct terms, and thus have failed miserably to find a solution. Please accept my example as the question, and I'll update the question accordingly if somebody can enlighten me (or delete and read the actual solution where it exists).
So far my research has only yielded comparisons of single bits within the whole, rather than showing all.
Given a set of integers:
ELEMENT
-----------
Bricks 1
Plaster 2
Cement 4
Concrete 8
I have a result set that provides how these materials are used:
MIXTURE ELEMENTS
----------------------
MixtureFoo 3
MixtureBar 7
MixtureBaz 11
I need to show the final set of mixtures, but with each constituent element listed that is used in the respective mixture:
MIXTURE ELEMENTS ELEMENT
------------------------------
MixtureFoo 3 1
MixtureFoo 3 2
MixtureBar 7 1
MixtureBar 7 2
MixtureBar 7 4
MixtureBaz 11 1
MixtureBaz 11 2
MixtureBaz 11 8
You could use bitwise operations:
SELECT *
FROM t
JOIN ELEMENT e
ON t.ELEMENTS & e.w = e.w
ORDER BY MIXTURE, w;
db<>fiddle demo
I intend to store a list of distances from all points to each other.
So if I had only 3 points (A-C) it would be like
| FROM | TO | DISTANCE
| A | B | 10 miles
| A | C | 15 miles
| B | C | 12 miles
Obviously you can infer that B to A = 10 miles since you know A to B = 10 miles. In terms of my queries I may be searching for A to B or B to A - I can't guarantee the order of the start and end point of the journey.
I have 1600 points which makes (1600^2 - 1600)/2 = 1.3m possible journeys. What is the best way to store that data for querying by either A to B or B to A?
Should I duplicate the rows for the reverse journeys leading to 2.5m
rows and query on that?
Or should I make a composite clustered index
on the two columns and search for both the A to B OR the B to A
knowing at least one exists?
Or something else clever
This is a common enough problem surely and so I want to know from a DB expert if there is a common pattern or practice for solving it. I want facts, references, or specific expertise to answer this question not just vague opinions as I have plenty of them myself :)
This is on SQL Azure in case that makes a difference
If i were to solve this, rather than saying From and To column, I would call it Point1 and Point2 and always make sure that point2 is greater than point1, in your case C>B , B>A and C>A
Hope this helps.
I'm trying to grasp the concept of extendible hashing, but I'm getting confused about the distribution of values to the buckets.
For example:
Say I want to insert 6 values from scratch: 17, 32, 14, 50, 35, 21
What would be wrong with this as a solution:
Global depth = 2
Bucket size = 2
00[] --> [][]
01[] --> [][]
10[] --> [][]
11[] --> [][]
Does this mean only one value for each hash value will be pointed to the bucket, so then you increment the global depth? Or would this work?
I understand the beginning of the process, I am just confused at this point.
There is nothing wrong in the solution that you've provided just that the global depth need not be increased. The solution is perfectly compatible with the given global depth.
Assuming that we choose the directory and the corresponding bucket using the 2 left most bits. Then, the solution would look like the following
Also the numbers in the binary format would look like the following
17 - 010001
32 - 100000
14 - 001110
50 - 110010
35 - 100011
21 - 010101
directory ------------- buckets
00-----------------------> 14 |
01-----------------------> 17 | 21
10-----------------------> 32 | 35
11-----------------------> 50 |
Hope this helps.
You shouldn't increment global depth.
The whole idea of the hash is to select such function that it would put items in a buckets more or less equally.
That depends on a hash function.
You can use something as complex as md5 as a hash and than you will get 1 element in 1 bucket but you're not really guaranteed that there will be only 1.
So a general implementation should use binary search on buckets and some other search inside bucket. You can't and you shouldn't change hash function on the fly.
As an over-simplified example I have a list of events that have a maximum attendance:
event | places
===================
event_A | 1
event_B | 2
event_C | 1
And a list of attendees with the distance to the events:
attendee | event_A dist | event_B dist | event_C dist
==========================================================
attendee_1 | 12 | 15 | 12
attendee_2 | 11 | 15 | 11
attendee_3 | 10 | 11 | 12
Can anyone suggest a simple method to produce a set of options providing the best case allocations based on shortest total distance and on shortest mean distance?
I currently have the data held in Oracle Spatial database, but I'm open to suggestions.
I currently understand your problem as follows:
Each atendee should be assigned to exactly one event
Each event has a limit as to how many atendees are assigned to it
Underfull or even empty events are no problem
Each assignment between an event and an atendee corresponds to a given distance
You want to minimize the overall distance for all assignments
You might want to print the result not using sums but using means
Based on this interpretation, I suggest the following algorithm:
Create a complete bipartite graph, with nodes in partition A for atendees and nodes in partition B for places in events. So every atendee corresponds to one node, and every event corresponds to as many nodes as it has places. All atendees are connected to all event nodes with the distance as the edge cost.
At this point your problem corresponds to a general assignment problem, with “agents” corresponding to your event places, and “tasks” corresponding to your atendees. Every atendee must be covered, but not every event place must be used.
Add dummy attendees to allow a perfect matching. Simply sum up all the places, subtract from that the number of actual atendees. For the difference, create as many atendee nodes, with distance zero to all the event nodes.
By making both partitions equal in size, you are now in the domain of the more common linear assignment problem.
Use the hungarian algorithm to compute a minimal cost assignment. Perhaps you can think of some simplifications which make use of the fact that you have many equivalent nodes, i.e. places for the same event and all those dummy atendees.
All of this should probably be done in application code, not in the data base. So I'd rather tag this algorithm. You'll need to pull the full cost matrix from the database to provide costs for your edges.
given my current position (lat,long) I want to quickly find the nearest neighbor in a points of interest problem. Thus I intend to use an R-Tree database, which allows for quick lookup. However, first the database must be populated - of course. Therefore, I need to determine the rectangular regions that covers the area, where each region contains one point of interest.
My question is how do I preprocess the data, i.e. how do I subdivide the area into these rectangular sub-regions? (I want rectangular regions because they are easily added to the RTree - in contrast to more general Voronoi regions).
/John
Edit: The below approach works, but ignores the critical feature of R-trees -- that The splitting behavior of R-tree nodes is well defined, and maintains a balanced tree (through B-tree-like properties). So in fact, all you have to do is:
Pick the maximum number of rectangles per page
Create seed rectangles (use points furthest away from each other, centroids, whatever).
For each point, choose a rectangle to put it into
If it already falls into a single rectangle, put it in there
If it does not, extend the rectangle that needs to be extended least (different ways to measure "least" exits -- area works)
If multiple rectangles apply -- choose one based on how full it is, or some other heuristic
If overflow occurs -- use the quadratic split to move things around...
And so on, using R-tree algorithms straight out of a text book.
I think the method below is ok for finding your initial seed rectangles; but you don't want to populate your whole R-tree that way. Doing the splits and rebalancing all the time can be a bit expensive, so you will probably want to do a decent chunk of the work with the KD approach below; just not all of the work.
The KD approach: enclose everything in a rectangle.
If the number of points in the rectangle is > threshold, sweep in direction D until you cover half the points.
Divide into rectangles left and right (or above and below) the splitting point).
Call the same procedure recursively on the new rectangles, with the next direction (if you were going left to right, you will now go top to bottom, and vice versa).
The advantage this has over the divide-into-squares approach offered by another poster is that it accommodates skewed point distributions better.
Oracle Spatial Cartridge documentation describes tessellation algorithm that can do what you want. In short:
enclose all your points in square
if square contains 1 point - index square
if square does not contain points - ignore it
if square contains more then 1 point
split square into 4 equal squares and repeat analysis for each new square
Result should be something like this:
alt text http://download-uk.oracle.com/docs/cd/A64702_01/doc/cartridg.805/a53264/sdo_ina5.gif
I think you a missing something in the problem statement. Assume you have N points (x, y) such that every point has a unique x- and y-coordinate. You can divide your area into N rectangles then by just dividing it into N vertical columns. But that does not help you to solve the nearest POI problem easily, does it? So I think you are thinking about something about the rectangle structure which you haven't articulated yet.
Illustration:
| | | | |5| | |
|1| | | | |6| |
| | |3| | | | |
| | | | | | | |
| |2| | | | | |
| | | | | | |7|
| | | |4| | | |
The numbers are POIs and the vertical lines show a subdivision into 7 rectangular areas. But clearly there isn't much "interesting" information in the subdivision. Is there some additional criterion on the subdivision which you haven't mentioned?