Custom WordBreaker for SQL Server Full-text - sql-server

Does anyone have information on how to create a custom word breaker for SQL Server 2005. I'd prefer to write it in c#. I need to be able to search on terms such as 'c#', 'f#' etc but the '#' character is a word breaker in the English (UK) word breaker component and can't be changed in any other way.
I have found the following article which provides an incomplete (the IWordSink interface is missing) sample and references an article that is no longer available. This article also doesn't provide any of the thread-checking code I'd expect to see.
http://bytes.com/topic/sql-server/answers/864969-custom-wordbreaker-sql-server-full-text
Alternately, someone could point me to how to decompile the existing English word breaker 'langwrbk.dll' so I could make the little change I need the existing code?
Thanks
Kirk

The technology for word breakers and stemmers is common across all the Microsoft Search products, including SQL Server Fulltext. The Search SDK is well and fully documented see Extending the Index and Windows Search Developer's Guide.
Reverse engineering the langwrbk.dll would be against the user license you agreed on, as is very clearly stipulated that you cannot reverse engineer, decompile or disassemble the Software. Not to mention that is code signed and you wouldn't be able to 'make a little change'...

Related

Is it possible to adapt and existing NLP tool in english to Swedish? and what´s the best approach?

Whats the best approach of using existing NLP tools in english with another language ex.spanish ?
That's an awfully broad question, and you'd need to provide some more pointers. However, if you're interested in general research on the topic, you can try Hana, Feldman, Brew (2004) "Tagging Russian using Czech morphology" and Resnik's 2004 "Using bilingual text for monolingual annotation" and start from there.
In general, you'd want to have a bicorpus (say, English/Swedish). Then establish mappings using alignment (that's a common topic in machine translation with many established results.)
You can then tag the English side, and use the mapping to "translate" these mappings into the Swedish side. Then you can train the same tool that created the mappings on the English side using the newly annotated Swedish corpus.
It goes without saying that you'll lose quite a bit of quality and that this technique only works for supervised methods. You should probably try to find properly annotated Swedish corpora and tools. There are a few out there.

Generate a series of documents based on SQL table

I am trying to formulate a proposal for an application that allows a user to print a batch of documents based on data stored in a SQL table. The SQL table indicates which documents are due and also contains all demographic information. This is outside of what I normally do and am trying to see if these is a platform/application that already exists to do such a task
For example
List of all documents: Document #1 - Document #10
Person 1 is due for document #: 1,5,7,8
Person 2 is due for document #: 2.6
Person 3 is due for document #: 7,8,10
etc
Ideally, what I would like is for the user to be able to push a button and get a printed stack of documents that have been customized for each user including basic demographic info like name, DOB, etc
Like i said at the top, I already have all of the needed information in a database, I am just trying to figure out the best approach to move that information onto a document
I have done some research and found some people have used mail merge in Word or using Access as a front end but I don't know if this is the best way. I've also found this document. Any advice would be greatly appreciated
If I understand your problem correctly, your problem is two-fold: Firstly, you need to find a way to generated documents based on data (mail-merge) and secondly, you might need to print them two.
For document generation you have two basic approaches: template-based and programmatically from scratch. I suppose that you will opt for a template based approach which basically means that you design (in MS Word) a template document (Word, RTF, ...) that acts as a template and contains placeholders and other tags that designate »dynamic« parts of the document. Then, at document generation time, you need a .NET library/processor that you will pass this template document and the data, where the processor will populate the template with the data and return the resulting document.
One way to achieve this functionality would be employing MS Words' native mail-merge, but you should know that this would involve using Office COM and Word Application Automation which should be avoided almost always.
Another option is to build such a system on top of Open XML SDK. This is velid option, but it will be a pretty demanding task and will most probably cost you much more than buying a commercial .NET library that does mail-merge out-of-the-box – been there, done that. But of course, the good side here is that you will be able to tailer the solution to your needs. If you go down this road I recoment that you use Content Controls for tagging documents/templates. The solution with CCs will be much easier to implement than the solution with bookmarks.
I'm not very familliar with the open source solutions and I'm not sury how many there are that can do mail-merge. One I know is FlexDoc (on CodePlex) but its problem is that uses a construct (XmlControl) for tagging that is depricated in Word 2010+.
Then there are commercial solutions. Again I don't know them in detail but I know that the majority of them are a general purpose document processing libraries. Our company has been using this document generation toolkit for some time now and I can say it covers all our »template-based document generation« needs. It doesn't require MS Word at doc generation time, and has really helpful add-in for MS word and you only need several lines of code to integrate it in your project. Templating is very powerful and you can set-up a template in a very short time. While templates are Word documents, you can generate PDF or XPS docs as well. XPS is useful because you can use .NET/WPF prining framework that works with XPS docs to print documents. This is a very high-end solution, but of course, the downside here is that it is not a free solution.

Pulling out Popular terms from a Solr core

I have an Apache Solr core where i need to pull out the popular terms out of it, i am already aware of luke, facets, and Apache Solr stopwords but i am not getting what i want, for example, when i try to use luke to get the popular terms and after applying the stopwords on the result set i get a bunch of words like:
http, img, que ...etc
While what i really want is:
Obama, Metallica, Samsung ...etc
Is there any better way to implement this in Solr?, am i missing something that should be used to do this?
Thank You
Finding relevant words from a text is not easy. The first thing I would have a deeper look at is Natural Language Processing (NLP) with Solr. The article in Solr's wiki is a starting point for this. Reading the page you will stumble over the Full Example which extracts nouns and verbs, probably that already helps you.
During the process of getting this running you will need to install additional software (Apache's OpenNLP project) so after reading in Solr's wiki that project's home page maybe the next step.
To get a feeling what is possible with that you should have a look on the demonstration of the searchbox guy. There you can paste a sample text and get relevant words and terms extracted from it.
There are several tutorials out there you may have a look at for further reading.
If you went down the path and the results are not as expected or not as good as required, you may go down that road even further and start thinking about text mining with Apache Mahout. There are again several tutorials out there to cross it with Solr.
In any case you should then search Stackoverflow or the web for tutorials and How-Tos you will certainly need.
Update about arabic
If you are going to use OpenNLP for not supported languages, which Arabic unfortunately is out of the box as of version 1.5, you will need to train OpenNLP for the language. The reference about it is found on the developer docs of OpenNLP. Probably there is already something out there from the arabic community, but my arabic google-fu is not that good.
Should you decide to do the work and train it for the arabic language, why not share your traning with the project?
Update about integration in Solr/Lucene
There is work going on to integrate it as a module. In my humble opinion this is as far as it will and should get. If you compare this problem field to stemming stemming appears to be rather easy. But even stemming got complex when supporting different languages. Analysing a language to the level that you can extract nouns, verbs and so forth is so complex that a whole project evolved around it.
Having a module/contrib at hand, which you could simply copy to solr_home/lib would already be very handy. So there would be no need to run a different installer an so forth.
Well , this is a bit open ended.
First you will need to facet and find "popular terms" out of your index, then add all the non useful items such as http , img , time , what, when etc to your stop word list and re-index to get cream of the data you care about. I do not think there is an easier way of knowing popular names unless you can bounce your data against a custom dictionary of nouns during indexing (that is an option by the way)- You can choose to index only names by having a custom token filter (look how stopword filter works) and have your own nouns.txt file to go with your own nouns filter, in the case you allow only words in your dictionary to into index, and this approach is only possible if you have finite known list of nouns.

DB Comparison tool that I can schedule

I'm after a DB Comparison tool for SQL Server that allows me to do the following:
Schedule a comparison to happen on a recurring schedule
Email me the results (in a nice readable format and not the generated script)
Allow me to exclude/include certain object names (for example exclude table names containing %test%. That's not a real example but there is a good reason why that would come in useful.)
As well as the obvious:
Have the usual options for ignoring things like comments, identity seeds etc
Options for selecting different types of objects
If it was free or at least didn't cost a forture that would be an extra bonus of course.
I have tried out RedGate's SQL Compare and also the built-in DB Comparison in Visual Studio but neither seem able to do the first 3 points above. I also looked at other tools recommended in various threads on here but again they don't mention in their features the 3 points above.
One option I found is RedGate's SQL Comparison SDK with which I think I could write something to do what I want.
I just wanted to investigate tools that might do all of the above out of the box.
Thank you!
SQL Compare Pro comes with a command line, which will be easier to set up than the SDK. If you call this via the Windows Scheduler or in an Agent Job you can achieve what you're looking for.
An example of how to invoke the command line from Powershell it can be found here:
http://www.simple-talk.com/sql/database-administration/auditing-ddl-changes-in-sql-server-databases/
This article also covers how to send an email in Powershell. SQL Compare can also be passed a filter using the /filter switch to exclude objects based on various rules.
http://www.red-gate.com/supportcenter/Content/SQL_Compare/help/10.0/sc_cl_Switches_in_the_cl
Do please email support#red-gate.com should you have trouble getting this working.
I don't think any tool would do all of this out of the box. Have you had a chance to look at
sp_CompareDB. I had a similar requirement and ended up writing my own routine based on the same.
http://www.sql-server-performance.com/2001/database-comparison-sp/

Optical character recoginition

I've to write a program which is able to recognize patterns, specially characters. I've implemented back-propagation in c# and now I want to use it for the pattern recognition. I've also created a form application and used brush/graphics so that user can write something with the help of mouse (just like 'pencil tool' in MS Paint). So I need some helping material about "How to implement character recognition method in my application?".
Helping stuff over the internet mostly related to back-propagation and software demos.
If your project is something else but you want to have OCR in your project, you should search for third party tools that do this. But if your project is this and you want to do that yourself, read this answer:
There are two ways of recognizing characters. Online and offline.
Online way uses the pen (or mouse) input data. and offline way uses just the pixels.
Your first step will be choose from one of them. offline way hasn't the pen data, this is a useful feature. but in offline, you can recognize characters from image files (created by paint and saved or even scanned)
Second, you should preprocess data (this step is for only offline way). you should remove noises from it, scale it, and do the Thinning to it.
Next, you should extract useful features from the preprocessed data (online or offline). for this, you can read some articles about optical character recognition and feature extractions of it. there is a good powerpoint presentation about preprocessing and feature extraction here. Also pdf keyword and filetype:pdf at the end of your search term in google would help you!
Then you should use neural networks or something like that to recognize the character. inputs should be extracted features.
But remember, this project is not easy and may take some time! (This was my project for Persian language)

Resources