Extracting semantic data from webpages - screen-scraping

I'm interested in extracting semantic data (simple template stuff) from webpages and other sources that aren't currently semanticly aware. I've written crawlers and manual parser before in a bunch of different languages, but there always seems to be a lot of boilerplate and page specific code, and was wondering if you guys knew of any platforms or frameworks that simplified the process (open source only please).
I'll be writing one if I can't find one, so links to similar systems or framework suggestions would also be appreciated.

The field is known as "automatic wrapper extraction" and is an active area of research, but I haven't seen a good open source toolkit. A company called lixto makes a commercial tool that may be of interest to you. I'd love to see an open source project that tackles this problem.

Related

What are the advantages of using "_()" in fprintf? [duplicate]

Why should I use GNU gettext for internationalization? I am working on developing wordpress themes and that's what is bundled with wordpress and obviously what is recomended.
However, I come from a game development background where localization is done differently and I can't seem to see the advantage of gettext over methods I used to use.
I am struggling over the debate, should I use gettext that is recomended for this technology, or use methods I am more familiar with from my game dev days that also provide more extensibility and flexibility than gettext.
Some of my internal debate undoubtably comes from my inexperience with gettext and my struggles with the POeditor UI. If you all could help me understand why gettext is so commonly used, and why I should use it, would be much appreciated.
First of all, you didn't provide any examples or advantages of your technologies.
So why GetText?
Easy to translate. Translators don't have to bother themselves with any development concepts: they just take a .po file, open it up and translate in an easy-to-use editor where they have only source language, destination language and, possibly, plurals pattern.
Easy to maintain both for developers and for translators. A developer makes a .pot catalog, gives it to all of his/her translators, they update their .po files (again, with a single command — «update from pot file») and have new strings above the translated ones, so they can quickly update the translations.
Easy to develop internationalizable applications. In the best case you have only to wrap your strings with a _ function.
It's a cross-platform solution having ports for many OSes and programming languages.
So why not GetText? Particularly if it's a default solution in WordPress? I believe your theme/plugin won't be accepted by the community if your i18n technique is different from the one used in this particular community.

How are you integrating help into your WPF application. Any recommendations?

The question says it all really. If you are writing a WPF application, how are you integrating the application help? What is the state of play in mid-2013?
It seems that there is no clear answer to this from an afternoon with a search engine, but several options:
Write your own fancy tooltip based help (but where are you getting your data from?)
Use .CHM files and the Windows Forms help system (seems archaic to me).
Use Microsoft Help Viewer 1.X or Microsoft Help 2.0.
There is some confusion as to which is more recent / approved of by MS. It appear Help Viewer 1.X might be the recommended option over Microsoft Help 2.0. It doesn't help that the names are so similar...
What is the status of 2.0? Should we use it? Was it ever fully deployed?
Use a third-party product to author your help files and link to them somehow - DocToHelp/NetHelp, NetAdvantage on-line help, etc...
Furthermore, what XAML based mark-up / attributes are you using to provide the necessary context? What is the recommended method?
It seems surprising there is no clear path for supporting application based help in WPF.
My current preference is to use a third party help authorizing system to generate HTML based help.
We then use a WebBrowser to display this help as needed. The authoring system we use makes it fairly easy to extract out a single page from the main help (each "topic" is a single HTML file, and can be included with full contents or not as desired).
Granted, this definitely felt like a bit of a nasty hack at first - but once we wrote the basic plumbing (some attached properties for xaml to specify attributes for context location and add behavior to trigger help, etc), it's fairly clean.
One very nice advantage to this approach, however, is a single help system build works perfectly in all contexts - we can include the documentation online, expose it locally for use in a browser, and use it with context from within our application directly.

Ways to get past the Inner-platform effect while still building highly customized web apps?

Feel free to answer the question in the title as generally as I posed it, I offer some more details and specifics below.
Currently I develop and maintain a somewhat legacy business app (ASP/SQL) that is highly customizable allowing for moderate to full customization on: custom fields, forms, views, reports, actions, events, workflows, etc. This customization is necessary in the domain we develop for and has allowed us to build a niche.
I have been reading up on the inner-platform effect and ways of implementing high level user defined customization and have concluded that we do suffer from many of the inner-platform effect problems because essentially we have created a high level abstraction on top of the SQL. The organization of custom fields is implemented in a similar way to the approach found here
http://blog.springsource.com/arjen/archives/2008/01/24/storing-custom-fields-in-the-database/
We use something similar to the meta database method described in that article. All customization is built around this approach and in many ways we suffer from database on top of a database.
The end result is something that looks fantastic on paper yet the more features are added and custom coding is done for clients the more of a mess everything becomes. It seems that the more I read the more I realize this is somewhat of an anti-pattern. It also seems that the more I try to read the more I find so little has been written on the topic. Anyways, I am trying to learn modern approaches to this problem and trying to find more discussion/articles on the topic. Are Database systems such as CouchDB relevant to this type of application?
My question is clearly pretty general. It seems like there is a lot against this kind of application in favor of just "knowing and defining your domain better". Are there any good/better ways to implement this kind of application? I'm not looking for black and white answers, and any further readings on the subject would be fantastic. Thanks for any help.
My answer is be conscious and clear about what is for a plugin to do and what is a user setting. In that case, your platform and your settings are different. Your application provides basic services and is unabashedly a platform. It may also provide an application built on that platform.
So in that case you focus on programmer interfaces instead of implementation possibilities.
The standard advice in CS is to create another level of abstraction, not sure if that's not the problem here.
The only advice I could give is to push as much functionality onto the database, given it's the platform. SQL Server supports custom functions, fields and stored (SQL) procedures.
Either that or try to pull repeated functionality into separate functions in ASP.

How to write a Large WinForms application?

I'm going to write a rather big/complex WinForm application such as Paint.NET, SharpDevelop, etc. I think one of the most important things to build such an application is to structure the project properly to increase maintainability and control the complexity.
So what kind of patterns or practices show I use? Any blog posts, papers, open source projects are welcomed. I'm trying to learn something from SharpDevelop but it's rather huge for me to step into.
PS: I'm an experienced programmer formerly targeting to web developement(asp.net, rails, etc.). So I know some design principles and how to use them when implement business logics. Maybe I really need now is a sample to get started with a WinForm application so that I can realize how to handle the menus, controls and others. I've learnt something about the MVP pattern but still unconfident to start a large/complex application.
For big projects the methodology and the tools you are using are equally as important as the architectural design. You need to set up a source control system (like SVN) from day one. Also, it is very good to have a standard build procedure and perform builds in a daily basis. The build procedure should include running all tests, which you should also put some effort in implementing from the start.
Regarding the structure, I believe the single most important thing is to divide your project into building blocks with mimimal dependecies on each other. This way you will be able to think about one small part of the system at the same time and not have to face the full complexity of it. It will also help delegate some work to a fellow programmer, if you have this chance.
In order to get started, I recommend that you implement first something minimal as quick as possible. Then work to make it better and add functionality. This will keep you motivated as you will have something concrete to work with. It will also help you identify major design flaws and important issues early enough to correct them.
This is a good beginners guide from Microsoft itself:
http://msdn.microsoft.com/en-us/beginner/default.aspx
check the Windows track there.
After mastering basics - and since you are an experienced developer - you can check this book "patterns & practices Application Architecture Guide 2.0" from Microsoft also.
I would imagine that many of the techniques that make for successful web projects will translate to Winforms projects. Start small and grow the application incrementally. Try to keep the entire application building/working while you add features one at a time.

PowerPoint and WPF

I really need a way of loading a .ppt document in my wpf application. Can anyone give me a hint, code sample?
Checkout the following discussion thread. Also Dr.WPF got an interesting article that might help you as well: Hosting Office in WPF Application
However consider license costs will be quite high for your scenario...
According to this artice the DSO Framer is no longer supported. Have to look for something else.
You may need to elaborate a bit more on your particular need to get a practical answer.
I don't think hosting PowerPoint (ppt) is a good option because it requires ppt to be installed on the target machine.... and if the target machine has ppt then you can use its API to save the document as HTML and open it in a WebBrowser control.
If the target machine doesn't have powerpoint you may look into some online file conversion service and try hooking up there to convert to HTML and still use WebBrowser control.
I definitely don't recommend wasting your time with DSOFramer - it's very unstable at best and it will just feel like you're one step away from making it work for a while but it doesn't work.
Another option is of course to write your own parser for ppt files, the OfficeOpenXML version of the files is definitely "parseable". I've done that for Word docx and it's relatively easy to get the course data out of the document - say shapes, text... - but the devil there is in the details. There are a zillion little features to implement.

Resources