I need a bit of guidance/advice. I have decided to build a web application but I’m having difficulty putting all the components together.
I’ve made basic websites in the past but have forgotten a lot of it. I studied JavaScript and Java in the past but I’m a little rusty so if you decide to reply please treat me like a person that’s new to all of this.
Basically I am having difficulty understanding the backend and front end of the whole web application and what exactly I will need. I’ve done some research found out that I will need MySQL, Tomcat server or Apache? (I don’t understand this part), Spring Tool Suite, knowledge of Java and AngularJS for the front-end. I have basic SQL knowledge. I’m having difficulty binding all this together.
The application I am trying to create is a prototype so not a full application. I’d like to be able to enter data into text boxes which is then visually represented on graphs/charts. I understand that AngularJS is capable of achieving this. Is AngularJS the same as JavaScript? Where does Java come in all of this? I thought a web programming language is needed so why is Java used? Is Java used for the backend? What will MySQL be used for and the Tomcat or Apache server also? I prefer to stick to JavaScript/Angular and Java if possible because I haven’t learned any other languages like PHP or C.
I am working on a Mac if anyone is wondering. I have the following already installed on my machine:
Latest JDK
Spring Tool Suite
MySQL
Could anyone please clarify this for me as best as possible? I don’t think it will take a lot for me to understand it all as I do have knowledge in computing but I’m a bit rusty since it’s been a while since I last used it all.
Any help is greatly appreciated!
There's a lot packed into this question. I'll try to answer it all, but bear with me. Because I hope to cover so much ground, keep in mind that a lot of what I say will be imprecise, and you should definitely read up some more on these topics. You're on the verge of entering a whole new world of programming possibilities if you enjoy it, so take your time and try to soak it all in at your own pace.
First, some basics on web technologies. Possibly the most fundamental thing to understand here is the separation between the client and the server. These terms are broadly used in software discussions to refer to the thing processing, storing, and providing data (the server) and the thing allowing the user to request, view, interact with, and make changes to that data (the client) . What exactly the client and server are will vary widely based on what sort of context you're talking about, but in the world of web development:
The client is generally assumed to be a web browser. There are lots of different web browsers (e.g. Chrome, Firefox, Safari, Internet Explorer/Edge, Opera, ...) and they each have their own quirks, but by and large if you don't try anything too fancy and you stick to recent version of browsers your results will be basically the same no matter which browser you use to view your web page.
When talking about "the server", you can be referring to a few things depending on the sentence you use it in: the server-side code you write (more on this in a minute), the piece of software that handles your server-side code and serves your web pages (this would be Apache or Tomcat), or the physical collection of aluminum, silicon, and plastic on which the above software is running (that is, a computer, anything ranging from your laptop to a dedicated machine). For the purposes of this discussion we'll stick to the first two.
As I said earlier, the most important part of the client/server distinction is in the separation of concerns between presenting and enabling interaction with data on the client and processing and storing data on the server. As much as possible, try to keep those separate; it will make your life much easier in the long run. The client code shouldn't have any idea how the data is stored or processed; that's the server's job, and the client shouldn't need to know. Similarly, the code you write on the server shouldn't care in the least how the data is being formatted and presented to the user. Instead, it should try to provide data in a format that is easy to deal with in lots of different ways.
But I'm getting ahead of myself. First we need to talk about the technology questions you asked. So let's drill down a bit.
The Client
Usually, different programming languages are used on the server and the client1. There are three primary languages used in the browser, and each has a distinct purpose that compliments the purpose of the other two. Forgive me (and feel free to skim or skip) if I tread ground you've already covered; I'm writing for the beginner here, so I want to make sure all the bases are covered.
HTML
HTML (the HyperText Markup Language) is the absolute foundation on which web pages are built. HTML's job is to describe the structure of the page and the data being presented in it. HTML separates things into sections, including headers, footers, articles, paragraphs, asides, and other such containers. HTML also tells the browser about images, videos, and flash games, which text or picture should be a link to somewhere else in the web, which text should be emphasized or where you're making a strong point. It can contain forms containing various sorts of input devices, and it can tell the browser where to send that form's content when you submit the form. Put another way, it describes the composition of your document, the relationships between sections of your content, and, to some degree, the purpose of certain sections of the document2.
HTML looks something like this:
<html>
<head>
<title>A Sample Web Page</title>
</head>
<body>
<header>
<h1>Sample Code FTW!</h1>
</header>
<main>
<section id="introduction">
<h1>Introduction</h1>
<h2>Where it all begins...</h2>
<p>
This is the first paragraph of the intro.
</p>
<p>This is the second one.</p>
</section>
<section id="picture-show">
<h1>Pictures</h1>
<h2>They're fun!</h2>
<p>Here is a cat:</p>
<img src="https://i.imgur.com/MQHYB.jpg" />
<p>Here are some more (the internet is full of these things):</p>
<img src="https://i.imgur.com/sHQQJG5.gif" />
</section>
</main>
<footer>
<small>© KenB 2015; All rights reserved. (just kidding, do what you want with it)</small>
</footer>
</body>
</html>
Here's a link to see the above in action: http://codepen.io/kenbellows/pen/GZvmVy
As seen if you clicked that link, you can technically make a whole web page with just HTML. The thing is, it's pretty boring. First, it's very plain; this is because HTML contains no formatting information. It has structure, but no layout, color, typeface, border, or any other presentation-related information; that's the job of CSS3. Second, it's entirely static; this is because HTML isn't made to describe dynamic content or behaviors. HTML is intentionally static. If you want anything to move, change, or react to user interaction, if you want to display any dynamic content based on calculations made on the fly, that's where you need JavaScript.
Learn more about HTML here: https://developer.mozilla.org/en-US/Learn/Getting_started_with_the_web/HTML_basics
CSS
CSS (Cascading Style Sheets) solve the formatting problem. If you're writing a static page that just displays the same content every time you load the page and doesn't require any user interaction, like a blog post, you probably only need HTML and CSS.
Here's just a little CSS for that HTML I posted up there:
body {
font-family: "Helvetica Neue", Helvetica, Arial, sans-serif;
color: #333;
background-color: #ccc;
}
header, main, footer {
margin: 1em auto;
width: 60%;
padding: 1em;
}
header {
text-align: center;
}
header > h1 {
font-size: 4em;
}
main {
background-color: #fff;
}
main h1 {
font-size: 2.5em;
margin-bottom: 0;
}
main h2 {
font-size: 1.5em;
margin-top: 0;
}
And a link: http://codepen.io/kenbellows/pen/wGqewE
See what just a little styling can do for a document?
Learn more about CSS: https://developer.mozilla.org/en-US/docs/Web/Guide/CSS/Getting_started
JavaScript
JavaScript is what powers all the dynamic parts of a web page4. This is the most important section for the chart stuff you were talking about. JavaScript, unlike HTML and CSS, is a real, proper programming language, Turing-complete and everything, with all the usual branching, looping, function, and object oriented constructs you would expect from a modern language. Because of this, it's a lot more complex than either HTML or CSS, and it can take as long to master as any other full-fledged programming language.
Learn more about JavaScript here: https://developer.mozilla.org/en-US/Learn/Getting_started_with_the_web/JavaScript_basics
Frameworks
Like any other modern language, JavaScript has plenty of libraries and frameworks written for it to help you handle a lot of the messier, boilerplatier parts of writing a web application. Angular.js, the one you mentioned, is a particularly popular framework at the moment, but you should know that it's only one of many, many frameworks. Don't get me wrong, though, it's a good one. But here's something to keep in mind: if you're just starting out with JavaScript as a language, it might not be the best idea to jump head first into a framework. For one, you're just compounding the amount of knowledge you'll need to get up and running; for another, and this is very important, you should probably learn the language in its own right before you start with a framework, to avoid becoming dependent on that framework. I've known a few too many developers who started with Angular right away, then a year later didn't have a solid grasp of why the code was behaving the way it was because they didn't take the time to get a firm understanding of the language fundamentals. Angular is wonderful, I use it all the time, but again, do yourself a favor and at least go through the MDN tutorial on JavaScript I linked above and try writing a couple toy applications before jumping into Angular.
A thought on your app idea
I actually think that for the purposes of your application, you don't need the server at all, other than for hosting it if you decide to publish it on the web or on your company's intranet or something. I'll talk about the server briefly below to answer your questions, but really, if all you need is to take in some user data and show it in real-time as a chart, you can do that with just some JavaScript. Angular would definitely be helpful since you sound like you want real-time updating charts. There are a few Angular-based chart/graph libraries out there that you should look into, e.g. https://jtblin.github.io/angular-chart.js/
The Server
Generally, the server handles data processing and storage. Instead of going into theory like I did for the client, let me answer your questions directly:
Tomcat and Apache - Tomcat and the Apache HTTP Server are two options for web servers. (These guys fit the second definition of "server" I gave above.) They are definitely popular choices, though Apache tends to be more popular for production servers, and Tomcat for development setups (in my experience; people argue about this all the time).
Spring - Spring, including all its various modules and tooling (SpringMVC, Spring Security, Spring Tool Suite, etc), is a Java framework to help you write your server code (server definition #1). In my personal opinion (worth what you're paying for it), Spring is great for large, complex applications with lots of moving parts, but is unnecessarily complicated for a simple app like the one you described.
MySQL - MySQL, like all flavors of SQL, is a database program. It stores data in tables and provides an interface for you to query that data in all sorts of convenient ways. Databases are great, and MySQL is a popular choice, but it's important to figure out whether you need a database for your project. Are you planning to store the data entered by the user for later use? If not, skip the database.
Other languages - Web server code can be written in any language you can run on the command line. If you like Java and you're good at it, stick with Java. If you want to be adventurous, maybe look into Node.js, a JavaScript server solution; you're learning JavaScript anyway, right? It's all about personal preference and what will get the job done for you. No need to learn PHP (and definitely no need to learn C, good god; please don't write your server code in C) just to write the backend for a simple app like the one you've got. Sounds like you're already learning a lot for this project; no need to add more to your plate.
1. (Might be best to read this footnote again after you're done with everything above.) One notable exception that's becoming more popular is the use of JavaScript on the server-side using Node.js. If you try JavaScript on the client-side and fall in love (like many of us do), maybe give Node a shot.
2. Read up on semantic markup once you've got the basics of how HTML works.
3. To be slightly more accurate, HTML shouldn't contain any of this information, although it technically can contain some of it. This is a holdover from the pre-CSS days of the late-90s/early-2000s, when a lot of sizing, font, and color information was stored in the markup. Please, do yourself a favor: leave formatting to CSS.
4. With the exception of some more advanced CSS rules you can use to get pretty basic hover-effects and simple animation. JavaScript is still necessary for anything involving calculations, iteration, logic, basically anything non-trivial.
What you are building is generally called a "full-stack" web application.
Apache alongside Tomcat is used for transferring data between the front and back ends of the application. Specifically, Tomcat is where your Java code will live in the form of Servlets and JSPs (Java Server Pages). Java (like PHP) is only found on the back-end, and is used for "gluing" together the webpages. The back-end is responsible for connecting & talking with the database (MySQL) for storing data as well as moving data from one place to another.
Your front-end is all the HTML, CSS, and JavaScript of the application. It sends HTTP requests to the back-end in order to send/receive data to/from the back-end. Usually, you only need to reference the back-end when a user submits a form or needs to load more information (using AJAX) or even just a whole new page.
AngularJS is a special creature. It IS JavaScript, so it lives on the front-end and does front-end type things, but (other than communicating with the database) it pretty much replaces the back-end in modern web applications. AngularJS is a tool for creating a "single-page application". Which means that when you would normally be requesting new pages from the back-end, instead you would be requesting modifications of the current page.
Related
Thanks in advance for any help offered and patience for my current web-coding experience.
Background:
I'm currently attempting to develop an web based application for my family's business. There is a current version of this system I have developed in C#, however I want to get the system web-based and in the process learn cakephp and the MVC pattern.
Current problem:
I'm currently stuck in a controller that's supposed to take care of a PurchaseTicket. This ticket will have an associated customer, line items, totals etc. I've been trying to develop a basic 'add()' function to the controller however I'm having trouble with the following:
I'm creating a view with everything on it: a button for searching customer, a button to add line items, and a save button. Since I'm used to developing desktop applications, I'm thinking that I might be trying to transfer the same logic to web-based. Is this something that would be recommended or do'able?
I'm running into basic problems like 'searching customer'. From the New Ticket page I'm redirecting to the customer controller, searching and then putting result in session variable or posting it back, but as I continue my process with the rest of the required information, I'm ending up with a bit of "spaghetti" code. Should I do a multi part form? If I do I break the visual design of the application.
Right now I ended up instantiating my PurchaseTicket model and putting it in a session variable. I did this to save intermediate data however I'm not sure if instantiating a Model is conforming to cakephp standards or MVC pattern.
I apologize for the length, this is my first post as a member.
Thanks!
Welcome to Stack Overflow!
So it sounds like there's a few questions, all with pretty open-ended answers. I don't know if this will end up an answer as such, but it's more information than I could put in a comment, so here I go:
First and foremost, if you haven't already, I'd recommend doing the CakePHP Blog Tutorial to get familiar with Cake, before diving straight into a conversion of your existing desktop app.
Second, get familiar with CakePHP's bake console. It will save you a LOT of time if you use it to get started on the web version of your app.
I can't stress how important it is to get a decent grasp of MVC and CakePHP on a small project before trying to tackle something substantial.
Third, the UI for web apps is definitely different to desktop apps. In the case of CakePHP, nothing is 'running' permanently on the server. The entire CakePHP framework gets instantiated, and dies, with every single page request to the server. That can be a tricky concept when transitioning from desktop apps, where everything is stored in memory, and instances of objects can exist for as long as you want them to. With desktop apps, it's easier to have a user go and do another task (like searching for a customer), and then send the result back to the calling object, the instance of which will still exist. As you've found out, if you try and mimic this functionality in a web app by storing too much information in sessions, you'll quickly end up with spaghetti code.
You can use AJAX (google it if you don't already know about it) to update parts of a page only, and get a more streamlined UI, which it sounds like something you'll be needing to do. To get a general idea of the possibilities, you might want to take a look at Bamboo Invoice. It's not built with CakePHP, but it's built with CodeIgniter, which is another open source PHP MVC framework. It sounds like Bamboo Invoice has quite a few similar functionalities to what you're describing (an Invoice has line items, totals, a customer, etc), so it might help you to get an idea of how you should structure your interface - and if you want to dig into the source code, how you can achieve some of the things you want to do.
Bamboo Invoice uses Ajax to give the app a feel of 'one view with everything on it', which it sounds like you want.
Fourth, regarding the specific case of your Customer Search situation, storing stuff in a session variable probably isn't the way to go. You may well want to use an autocomplete field, which sends an Ajax request to server after each time a character is entered in the field, and displays the list list of suggestions / matching customers that the server sends back. See an example here: http://jqueryui.com/autocomplete/. Implementing an autocomplete isn't totally straight forward, but there should be plenty of examples and tutorials all over the web.
Lastly, I obviously don't know what your business does, but have you looked into existing software that might work for you, before building your own? There's a lot of great, flexible web-based solutions, at very reasonable prices, for a LOT of the common tasks that businesses have. There might be something that gives you great results for much less time and money than it costs to build your own solution.
Either way, good luck, and enjoy CakePHP!
Preface: I have a broad, college knowledge, of a handful of languages (C++, VB,C#,Java, many web languages), so go with which ever you like.
I want to make an android app that compares numbers, but in order to do that I need a database. I'm a one man team, and the numbers get updated biweekly so I want to grab those numbers off of a wiki that gets updated as well.
So my question is: how can I access information from a website using one of the languages above?
What I understand the problem to be: Some entity generates a data set (i.e. numbers) every other week and you have a need to download that data set for treatment (e.g. sorting).
Ideally, the web site maintaining the wiki would provide a Service, like a RESTful interface, to easily gather the data. If that were the case, I'd go with any language that provides easy manipulation of HTTP request & response, and makes your data manipulation easy. As a previous poster said, Java would work well.
If you are stuck with the wiki page, you have a couple of options. You can parse the HTML your browser receives (Perl comes to mind as a decent language for that). Or you can use tools built for that purpose such as the aforementioned Jsoup.
Your question also mentions some implementation details such as needing a database. Evidently, there isn't enough contextual information for me to know whether that's optimal, so I won't address this aspect of the problem.
http://jsoup.org/ is a great Java tool for accessing content on html pages
Consider https://scraperwiki.com/ - it's a site where users can contribute scrapers. It's free as long as you let your scraper be public. The results of your scraper are exposed as csv and JSON.
If you don't know what a "scraper" is, google "screen scraping" - it's a long and frustrating tradition for coders, who have dealt with the same problem you have since the beginning of networked computing.
You could check out :http://web-harvest.sourceforge.net/
For Python, BeautifulSoup is one of the most tolerant HTML parsers out there. The documentation also lists similar libraries in Ruby and Java, so you'll probably find something relevant there.
I am researching best practices for developing 'classic' style mobile sites, i.e., mobile sites that are delivered and experienced as mobile HTML pages vs. small JavaScript applications (jQuery Mobile, Sencha, etc.).
There are two prevailing approaches:
Deliver the same page structure (HTML) to all mobile devices, then use CSS media queries or JavaScript to improve the experience for more capable devices.
Deliver an entirely different page structure (and possibly content) to devices with enhanced capabilities.
I'm specifically interested in best practices for the second approach. Two good examples are:
MIT's mobile site: different for Blackberries and feature(less) phones than for iOS & Android devices, but available at the same URLs -- http://m.mit.edu/
CNN's mobile site: ditto -- http://m.cnn.com/
I'd like to hear from people here at SO have actually worked on something like this, and can explain what the best practices are for delivering this type of device-dependent structure/content/experience.
I don't need a primer on mobile user-agent detection, or WURFL, or any of the concepts covered in other (great) SO threads like this one. I've used jQuery Mobile and Sencha Touch and I'm familiar with most approaches for delivering the final mobile experience, so no pointers required there either thanks.
What I really would like to understand is: how these specific types of experiences are delivered in terms of server-side detection and delivery based on user-agent groups -- where there's one stripped down page structure (different HTML) delivered to one group of devices, and another richer type of HTML document delivered to newer devices, but both at the same sub-domain / URLs.
Hope that all makes sense. Many thanks in advance.
At NPR, we use a server side 'application' to serve up the correct html/css/etc depending on if the user is on a high-end device or a lower-tier phone.
So, when a mobile device pings an npr.org page, our servers use a user-agent detection method to point them to the corresponding m.npr.org. Once directed to the m.npr.org URL, the web app - which is written in groovy, but I think could potentially be a number of things - sends back either the touch version of the site or the more simple, stripped down content. The choice of the web app is made based at least somewhat on the WURFL data.
I don't have enough rep points to post a comparison with screenshots, so I'll have to point you to the sites themselves.
You can see this in your desktop browser by typing in m.npr.org to see the stripped down site. And you can override the default device detection by adding the parameter ?devicegate.client=iPhone_3_0 to see the touch version you would see if you just went to npr.org on your smartphone. If you view the source, you can see how different html & css is being served at the same subdomain.
Hope it helps seeing something like this in the wild. Does that make sense?
A common way to detect which format a mobile device needs is the accept header:
application/xhtml+xml > xhtml
text/vnd.wap.wml > Old wml wap pages
.
.
.
On newer devices which can handle all the desktop html formats, you can use the user agent.
Then you have to ask yourself what you want to do:
Switch to another Stylesheet (only works with newer devices).
Switch to another view logic, like building wml page templates.
Switch to a complete other page.
I think the second approach is the best one. Many web frameworks make it easy to switch to another view logic without rewriting the rest (the mvc pattern in its glory).
I have two examples for you.
Read up on how facebook achieves this using XHP to give abstract different output for different markups: One Mobile Site to Serve Thousands of Phones
There will be a lot of good stuff in their actual implementation which I wish was available.
I use a framework called HawHaw, which let's you write your app once (in PHP Objects or XML files) and it outputs the correct markup to the device based on a few checks (accept header, agent string etc).
I know it has great out-of-the-box features but is it easy to customize?
Like when I query stuff from the database or change css layouts.
Is it faster to create my own modules for it or just go on and write everything from scratch using frameworks like Cake
I'm currently working on an Elgg-based site and I absolutely hate it. The project was near completion when I stepped in, but the people who created were no longer available, so I took it over as a freelancer.
As a personal impression, you are much better off writing the app from scratch in a framework. I don't know if the people before me butchered it, but the code looks awful, the entity-based relationship model is wierd to say the least and debugging is horrendous. Also, from my point of view, it doesn't scale very well. If you were to have a consistent user base, I'd be really really worried.
It keeps two global objects ($vars and $CONFIG) that have more than 5000(!) members loaded in memory on each page. This is a crap indicator.
I've worked extensively with cake. With Elgg, for about a month in a project that is on QA stage right now.
My advise is: if you need something quick with a lot of features and you only need to customize a little, go with Elgg.
If you're going to customize a lot and you can afford the development of all the forums, friends, invites, etc. features, go with Cake or any other MVC framework.
I have been working on a Elgg site for the past month or so, its code is horrible, however it's not the worst I've seen :D. it's not built for programmers like Drupal is :D. But it's not too bad. Once I got a handle on the metadata functions and read most of the code I was able to navigate it well and create custom modules and such.
What would help immensely would be some real documentation and explanation of the Elgg system. I don't think that's going to happen though :).
Out of the box there are a few problems, there are some bugs that haven't been fixed for a while and I've had to go in and fix them myself. Overall, you can make it pretty and it has some cool functions, but i wouldn't dive in until i had read the main core code to get a handle on what's happening on the backend.
Oh and massive use of storing values in globals. and a crap ton of DB calls (same with Drupal though).
i wonder if the use of storing everything, and i mean everything for your site in the globals will really hinder the server if you have a massive user load.
If you want to build a product based on a social networking platform/framework then Elgg is definately a good way to go. The code is not that bad if you actually look before leaping and doing what elgg expects. You go against its processes and structures and it will leave you beaten by the side of the road.
Developing modules/plugins or editing CSS is easy and Elgg does give you great flexability to basically build your own product ontop of it. Dolphin, as comparrison, does not allow you to do anything outside of what it expects you to do.
If you however just need a framework (not primarily for social networking etc) with some user based functionality then i suggest Cake, or if your project is HUGE then maybe Symfony or Zend. They all have plugins you can download and use/hack which would be easirer to adjust for personalised needs.
To show what you can do with elgg here is a site Mobilitate we built with Elgg 1.7. This is a very complicated website and was built ontop of Elgg.
We are starting a new project with Elgg 1.8. The new version is a major improvement they have made a lot of elements easier, incorporated better JS and CSS implementation/structure and have better commented their own code.
Elgg's database schema is horrific. They've essentially implemented a NoSQL database in SQL. It completely defeats the purpose of using a relational table structure.
If you can ignore this, and aren't doing much customization, you might be OK with Elgg. If not, STAY AWAY.
I've been working with Elgg for over a year. It is easier to customize than it would be to build something from scratch using a framework like CakePHP. I tried CakePHP and found it even more complicated than Elgg.
It is difficult to query the database due to the entity-based relationship model. You should use the build-in methods for accessing data. However, I have written many queries to double check on what is actually stored in the database.
You cannot change layouts using CSS alone. You have to deal with the various Elgg views. But CakePHP uses the same Model/View/Controller MVC concept so that would be just as difficult.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I hear people writing these programs all the time and I know what they do, but how do they actually do it? I'm looking for general concepts.
Technically, screenscraping is any program that grabs the display data of another program and ingests it for it's own use.
Quite often, screenscaping refers to a web client that parses the HTML pages of targeted website to extract formatted data. This is done when a website does not offer an RSS feed or a REST API for accessing the data in a programmatic way.
One example of a library used for this purpose is Hpricot for Ruby, which is one of the better-architected HTML parsers used for screen scraping.
Lots of accurate answers here.
What nobody's said is don't do it!
Screen scraping is what you do when nobody's provided you with a reasonable machine-readable interface. It's hard to write, and brittle.
As an example, consider an RSS aggregator, then consider code that gets the same information by working through a normal human-oriented blog interface. Which one breaks when the blogger decides to change their layout?
Of course, sometimes you have no choice :(
In general a screen scraper is a program that captures output from a server program by mimicing the actions of a person sitting in front of the workstation using a browser or terminal access program. at certain key points the program would interpret the output and then take an action or extract certain amounts of information from the output.
Originally this was done with character/terminal outputs from mainframes for extracting data or updating systems that were archaic or not directly accessible to the end user. in modern terms it usually means parsing the output from an HTTP request to extract data or to take some other action. with the advent of web services this sort of thing should have died away, but not all apps provide a nice api to interact with.
A screen scraper downloads the html page, and pulls out the data interested either by searching for known tokens or parsing it as XML or some such.
In the early days of PC's, screen scrapers would emulate a terminal (e.g. IBM 3270) and pretend to be a user in order to interactively extract, update information on the mainframe. In more recent times, the concept is applied to any application that provides an interface via web pages.
With emergence of SOA, screenscraping is a convenient way in which to services enable applications that aren't. In those cases, the web page scraping is the more common approach taken.
Here's a tiny bit of screen scraping implemented in Javascript, using jQuery (not a common choice, mind you, since scraping is usually a client-server activity):
//Show My SO Reputation Score
var repval = $('span.reputation-score:first'); alert('StackOverflow User "' + repval.prev().attr('href').split('/').pop() + '" has (' + repval.html() + ') Reputation Points.');
If you run Firebug, copy the above code and paste it into the Console and see it in action right here on this Question page.
If SO changes the DOM structure / element class names / URI path conventions, all bets are off and it may not work any longer - that's the usual risk in screen scraping endeavors where there is no contract/understanding between parties (the scraper and the scrapee [yes I just invented a word]).
Technically, screenscraping is any program that grabs the display data of another program and ingests it for it's own use.In the early days of PC's, screen scrapers would emulate a terminal (e.g. IBM 3270) and pretend to be a user in order to interactively extract, update information on the mainframe. In more recent times, the concept is applied to any application that provides an interface via web pages.
With emergence of SOA, screenscraping is a convenient way in which to services enable applications that aren't. In those cases, the web page scraping is the more common approach taken.
Quite often, screenscaping refers to a web client that parses the HTML pages of targeted website to extract formatted data. This is done when a website does not offer an RSS feed or a REST API for accessing the data in a programmatic way.
Typically You have an HTML page that contains some data you want. What you do is you write a program that will fetch that web page and attempt to extract that data. This can be done with XML parsers, but for simple applications I prefer to use regular expressions to match a specific spot in the HTML and extract the necessary data. Sometimes it can be tricky to create a good regular expression, though, because the surrounding HTML appears multiple times in the document. You always want to match a unique item as close as you can to the data you need.
Screen scraping is what you do when nobody's provided you with a reasonable machine-readable interface. It's hard to write, and brittle.
As an example, consider an RSS aggregator, then consider code that gets the same information by working through a normal human-oriented blog interface. Which one breaks when the blogger decides to change their layout.
One example of a library used for this purpose is Hpricot for Ruby, which is one of the better-architected HTML parsers used for screen scraping.
You have an HTML page that contains some data you want. What you do is you write a program that will fetch that web page and attempt to extract that data. This can be done with XML parsers, but for simple applications I prefer to use regular expressions to match a specific spot in the HTML and extract the necessary data. Sometimes it can be tricky to create a good regular expression, though, because the surrounding HTML appears multiple times in the document. You always want to match a unique item as close as you can to the data you need.
Screen scraping is what you do when nobody's provided you with a reasonable machine-readable interface. It's hard to write, and brittle.
Not quite true. I don't think I'm exaggerating when I say that most developers do not have enough experience to write decents APIs. I've worked with screen scraping companies and often the APIs are so problematic (ranging from cryptic errors to bad results) and often don't give the full functionality that the website provides that it can be better to screen scrape (web scrape if you will). The extranet/website portals are used my more customers/brokers than API clients and thus are better supported. In big companies changes to extranet portals etc.. are infrequent, usually because it was originally outsourced and now its just maintained. I refer more to screen scraping where the output is tailored, e.g. a flight on particular route and time, an insurance quote, a shipping quote etc..
In terms of doing it, it can be as simple as web client to pull the page contents into a string and using a series of regular expressions to extract the information you want.
string pageContents = new WebClient("www.stackoverflow.com").DownloadString();
int numberOfPosts = // regex match
Obviously in a large scale environment you'd be writing more robust code than the above.
A screen scraper downloads the html
page, and pulls out the data
interested either by searching for
known tokens or parsing it as XML or
some such.
That is cleaner approach than regex... in theory.., however in practice its not quite as easy, given that most documents will need normalized to XHTML before you can XPath through it, in the end we found the fine tuned regular expressions were more practical.