Can I define which word breakers to use when building a mssql fulltext index? - sql-server

I have created a fulltext catalog that stores the data from some of the columns in a table, but the contents seem to have been split apart by characters that I don't really want to be considered word delimiters. ("/", "-", "_" etc..)
I know that I can set the language for word breaker, and http://msdn.microsoft.com/en-us/library/ms345188.aspx gives som idea on how to install new languages - but I need more direct control than that, because all of those languages still break on the characters I want to not break on.
Is there a way to define my own language to use for finding word breakers?

Full text indexes only consider the characters _ and ` while indexing. All the other characters are ignored and the words get split where these characters occur. This is mainly because full text indexes are designed to index large documents and there only proper words are considered to make it a more refined search.
We faced a similar problem. To solve this we actually had a translation table, where characters like #,-, / were replaced with special sequences like '`at`','`dash`','`slash`' etc. While searching in the full text, u've to again replace ur characters in the search string with these special sequences and search. This should take care of the special characters.

The ability to configure FTS indexing is fairly limited out of the box. I don't think that you can use languages to do this.
If you are up for a challenge, and have access to some C++ knowledge, you can always write a custom IFilter implementation. It's not trivial, but not too difficult. See here for IFilter resources.

Related

Is there a way to remove the last token from the WhitespaceTokenizerFactory in Solr?

In the index analyzer I'm tokenizing with the WhitespaceTokenizerFactory. Generally the strings are split into two tokens, and it turns out that the remaining steps of my analyzer are better fitted to just the first token rather than both.
Is there a way to remove this second token from also being used in the remaining analyzer?
Thanks for any insight.
I'm not familiar with any filter that allows you to remove arbitrary tokens (although it shouldn't be too hard to write), but you can possibly work around it by using a PatternReplaceCharFilter.
If you have a common separator (i.e. a space / whitespace), you can remove anything after the separator, leaving just the first token present. This won't work if you need more advanced tokenization, but as long as you an express it as a regular expression, you should be OK.

splitting JSON string using regex

I want to split a JSON document and which has a pattern like [[[1,2],[3,4][5,6]]] using regex. The pairs represent x ad y. What I want to do it to take this string and produce a list with {"1,2", "3,4","5,6"}. Eventually I want to split the pairs. I was thinking I can make a list of {"1,2", “3,4","5,6"} and use the for loop to split the pairs. Is this approach correct to get the x and y separately?
JSON is not a regular language, but a Context free language, and as such, cannot be matched by a regular expresion. You need a full JSON parser like the ones referenced in the comments to your question.
... but, if you are going to have a fixed structure, like only three levels of square brakets only, and with the structure you posted in your question, then there's a regexp that can parse it (It would be a subset of the JSON grammar, not general enough to parse other JSON contents):
You'll have numbers: ([+-]?[0-9]+)
Then you'll have brackets and separators: \[\[\[, ,, \],\[ and \]\]\]
and finally, put all this together:
\[\[\[([+-]?[0-9]+),([+-]?[0-9]+)\],\[([+-]?[0-9]+),([+-]?[0-9]+)\],\[([+-]?[0-9]+),([+-]?[0-9]+)\]\]\]
and if you want to permit spaces between symbols, then you need:
\s*\[\s*\[\s*\[\s*([+-]?\d+)\s*,\s*([+-]?\d+)\s*\]\s*,\s*\[\s*([+-]?\d+)\s*,\s*([+-]?\d+)\s*\]\s*,\s*\[\s*([+-]?\d+)\s*,\s*([+-]?\d+)\s*\]\s*\]\s*\]\s*
This regexp will have six matching groups that will match the corresponding integers in the matching string as the folloging demo
Clarification
Regular languages, and regular grammars, and regular expressions form a class of languages with many practical properties, for example:
You can parse them efficiently in one pass with what is called a finite automaton
You can define the automaton to accept language sentences simply with a regular expression.
You can simply operate with regexps (or with automata) to make more complex acceptors (for the union of language sets, intersection, symmetric difference, concatenation, etc) to make acceptors for them.
You can simply say if one regular expression (the language it defines) is a subset, superset or none of the language of the original.
By contrast, it limits the power of languages that can be defined with it:
you cannot define languages that allow nesting of subexpressions (like the bracketing you allow in JSON expressions or the tag nesting allowed in XML documents)
you cannot define languages which collect context and use it in another place of the sentence (for example, sentences that identify a number and have to match that same number in another place of the sentence)
But, the meaning of my answer is that, if you bind the upper limit of nesting (let's say, for example, to three levels of parenthesis, like the example you posted) you can make your language regular and then parse it with the regular expression. It is not easy to do that, because this often leads to complex expressions (as you have seen in my answer) but not impossible, and you'll gain the possibility of being able to identify parts of the sentence as submatches of the regular subexpressions embedded in the global one.
If you want to allow nesting, you need to switch to context free languages, which are defined with context free grammars and are accepted with a more complex stack based automaton. Then, you loose the complete set of operations you had:
You'll never be able again to say if some language overlaps another (is included)
You'll never be abla again to construct a language from the union, intersection or difference of other context free languages.
But you will be able to match unbounded nested sentences. Normally, programming languages are defined with a context free grammar and a little more work for context checking (for example, to check if some identifier being used is actually defined in the declaration section or to match the starting and ending tag identifiers at matching levels in an XML document)
For context free languages, see this.
For regular languages, see this.
Second clarification
As in your question you didn't expressed you wanted to match real, decimal numbers, I have modified the demo to make it to allow fixed point numbers (not general floating point with exponential notation, you'll need to work it yourself, as an exercise). Just make some tests and modify the regexp to adapt it to your needs.
(well, if you want to see the solution, look at it)
Yeah i tried using the regex in my code but it is not working so I am trying a different approach now. I have an idea of how to approach it but it is not really working. First of let me be more clear on the question. What I am trying to so parse a JSON document. Like the image below. the file has a strings have [[[1,2],[3,4][5,6]]] pattern. What I am trying to get out of this is to have each pair as a list. So the list has an x-y pairs.
the string structure
My approach: first replace the “[[“ and “]]” at the begging and at the end, so I have a string with the same pattern through out. which gives [enter image description here][2]me a string “[1,2],[3,4][5,6]” This is my code but it is not working. How do I fix it? The other thing I though it could be an issue is, the strings are not the same length so. So how do I replace just the beginning and the ending?
my code
Then I can use a regex split method to get a list that has a form {“1,2” , “3,4”, “5,6”}. I am not really sure how to do this though.
Then I take the x, and the y, and add them and add those to the list. So I get of a list pair x-y pair. I will appreciate if you show me how to do this.
This is the approach I am working on but if there is a better way of doing it I will be glad to see it. [enter image description here][4]

Algo - key-values expression to parse (C)

I'm working on a way to write in a string key-values data like this :
{k1=v1__k2=v2__k3=v3}
Not a big deal to parse, but the problem became bigger when I wanted to add the possibility to write a set of key-values as a value of a key, like this :
{k1=v1__k2={k21=v21__k22=v22}__k3=v3}
Also, it would be better to enable the possibility to add more depth in my structure, example:
{k1=v1__k2=v2__k3={k31={k311=v311__k312=v312__k313={k3131=v3131}}__k32=v32}
I tried (in C) to parse it, but it becomes harder to parse it in a simple way (split __ and {} characters) and I also tried regex to split each key and value but I lose the hierarchy (depth) data...
Constraints of the problem:
Data structure can accept more than one depth
special characters (or set of characters) can be changed (different from __ or {})
Anyone knows a good algorithm?
I'm not sure, but JSon format has the same constraints, am I wrong?
Many thank #ll
Rather than attempt to write you own data representation, I suggest you use either XML or JSON. Both are more than up to the job. If you are using XML, libxml is your friend. There are many many JSON libraries, including (for example), libyajl. Both XML and JSON are tried and tested, cope with escaping and so fort. XML also allows querying with xpath and (if you want it) DTD functionality. Believe me, this is better than reinventing the wheel.

SQL Server Full Text Search Most Common Word Pairs

I am looking for a way to query for the most common adjacent words and/or most common included words in a document given a set of documents containing a word.
For example, I would like a query that would accept 'windows' and return a list of words that are most commonly found in a document containing 'windows', like 'microsoft' or 'doors'.
I would like to find adjacent words, but I also see a potential need in my application for eventually knowing the most common words also present in the document. An example of that might be 'linux' or 'efficiency'. Those words might not be adjacent to 'windows' but they are likely to be in the same document.
I found this question which helps me part way, but that only gets me the most common words given all the documents, or a specific document, not a set of documents.

How to efficiently search large dataset for substrings?

I have a large set of short strings. What are some algorithms and indexing strategies for filtering the list on items that contain a substring? For example, suppose I have a list:
val words = List(
"pick",
"prepick",
"picks",
"picking",
"kingly"
...
)
How could I find strings that contain the substring "king"? I could brute force the problem like so:
words.filter(_.indexOf("king") != -1) // yields List("picking", "kingly")
This is only practical for small sets; Today I need to support 10 million strings, with a future goal in the billions. Obviously I need to build an index. What kind of index?
I have looked at using an ngram index stored in MySQL, but I am not sure if this is the best approach. I'm not sure how to optimally query the index when the search string is longer than the ngram size.
I have also considered using Lucene, but this is optimized around token matching, not substring matching, and does not seem to support the requirement of simple substring matching. Lucene does have a few classes related to ngrams (org.apache.lucene.analysis.ngram.NGramTokenFilter is one example), but these seem to be intended for spell check and autocomplete use cases, not substring matching, and the documentation is thin.
What other algorithms and indexing strategies should I consider? Are there any open source libraries that support this? Can the SQL or Lucene strategies (above) be made to work?
Another way to illustrate the requirement is with SQL:
SELECT word FROM words WHERE word LIKE CONCAT('%', ?, '%');
Where ? is a user provided search string, and the result is a list of words that contain the search string.
How big is the longest word?
if that's about 7-8 char you may find all substrings for each and every string and insert that substrings in trie (the one is used in Aho-Corasik - http://en.wikipedia.org/wiki/Aho-Corasick)
It will take some time to build the tree but then searching for all occurances will be O(length(searched word)).
Postgres has a module which does a trigram index
That seems an interesting idea too- building a trigram index.
About a comment in your question regarding how to break down text searches greater than n-gram length:
Here's one approach which will work:
Say we have a search string as "abcde" , and we have built a trigram index. (You have strings which are of smaller lengths-this could hit a sweet spot for you)
Let the search results of abc= S1, bcd=S2,cde=S3 (where S1,S2,S3 are sets of indexes )
Then the longest common substring of S1,S2,S3 will give the indexes that we want.
We can transform each set of indexes,as a single string separated by a delimiter (say space) before doing LCS.
After we find the LCS,we would have to search the indexes for the complete pattern,since we have broken down the search term. ie we would have to prune results which have "abc-XYZ- bcd-HJI-def"
The LCS of a set of strings can be efficiently found Suffix Arrays. or Suffix trees

Resources