Related
I am interested in creating a Web Browser using C and the socket library (or any other library) under a Linux System.
The basic use of my Web Browser, would be to render the HTML of a webpage, into something readable to the user.
I just want someone to point me to the right direction. I also have a pretty good understanding of sockets and their system calls in C...
A pointer in the right direction, eh?
Well, a web browser consists of a whole mess of systems working together; even the most basic web browser must, at an absolute minimum understand HTTP/1.1 and HTML.
It must be able to fetch pages from remote servers, parse the HTML into a DOM, render that to a viewport, capture mouse clicks, let them activate hyperlinks, and navigate to new pages.
But if it can only do that, it's a poor excuse for a web browser; even the simplest of web browsers should also be able to parse and apply CSS; display JPEG, PNG and BMP images, parse XML, execute JavaScript, deal with cookies, offline storage, plugins (such as flash), and about a million other things.
The point I'm trying to make, of course, is that a web browser is in a lot of ways a poor project for learning to do software projects, because the overhead related to even basic functionality is crippling.
Preface: I have a broad, college knowledge, of a handful of languages (C++, VB,C#,Java, many web languages), so go with which ever you like.
I want to make an android app that compares numbers, but in order to do that I need a database. I'm a one man team, and the numbers get updated biweekly so I want to grab those numbers off of a wiki that gets updated as well.
So my question is: how can I access information from a website using one of the languages above?
What I understand the problem to be: Some entity generates a data set (i.e. numbers) every other week and you have a need to download that data set for treatment (e.g. sorting).
Ideally, the web site maintaining the wiki would provide a Service, like a RESTful interface, to easily gather the data. If that were the case, I'd go with any language that provides easy manipulation of HTTP request & response, and makes your data manipulation easy. As a previous poster said, Java would work well.
If you are stuck with the wiki page, you have a couple of options. You can parse the HTML your browser receives (Perl comes to mind as a decent language for that). Or you can use tools built for that purpose such as the aforementioned Jsoup.
Your question also mentions some implementation details such as needing a database. Evidently, there isn't enough contextual information for me to know whether that's optimal, so I won't address this aspect of the problem.
http://jsoup.org/ is a great Java tool for accessing content on html pages
Consider https://scraperwiki.com/ - it's a site where users can contribute scrapers. It's free as long as you let your scraper be public. The results of your scraper are exposed as csv and JSON.
If you don't know what a "scraper" is, google "screen scraping" - it's a long and frustrating tradition for coders, who have dealt with the same problem you have since the beginning of networked computing.
You could check out :http://web-harvest.sourceforge.net/
For Python, BeautifulSoup is one of the most tolerant HTML parsers out there. The documentation also lists similar libraries in Ruby and Java, so you'll probably find something relevant there.
I am researching best practices for developing 'classic' style mobile sites, i.e., mobile sites that are delivered and experienced as mobile HTML pages vs. small JavaScript applications (jQuery Mobile, Sencha, etc.).
There are two prevailing approaches:
Deliver the same page structure (HTML) to all mobile devices, then use CSS media queries or JavaScript to improve the experience for more capable devices.
Deliver an entirely different page structure (and possibly content) to devices with enhanced capabilities.
I'm specifically interested in best practices for the second approach. Two good examples are:
MIT's mobile site: different for Blackberries and feature(less) phones than for iOS & Android devices, but available at the same URLs -- http://m.mit.edu/
CNN's mobile site: ditto -- http://m.cnn.com/
I'd like to hear from people here at SO have actually worked on something like this, and can explain what the best practices are for delivering this type of device-dependent structure/content/experience.
I don't need a primer on mobile user-agent detection, or WURFL, or any of the concepts covered in other (great) SO threads like this one. I've used jQuery Mobile and Sencha Touch and I'm familiar with most approaches for delivering the final mobile experience, so no pointers required there either thanks.
What I really would like to understand is: how these specific types of experiences are delivered in terms of server-side detection and delivery based on user-agent groups -- where there's one stripped down page structure (different HTML) delivered to one group of devices, and another richer type of HTML document delivered to newer devices, but both at the same sub-domain / URLs.
Hope that all makes sense. Many thanks in advance.
At NPR, we use a server side 'application' to serve up the correct html/css/etc depending on if the user is on a high-end device or a lower-tier phone.
So, when a mobile device pings an npr.org page, our servers use a user-agent detection method to point them to the corresponding m.npr.org. Once directed to the m.npr.org URL, the web app - which is written in groovy, but I think could potentially be a number of things - sends back either the touch version of the site or the more simple, stripped down content. The choice of the web app is made based at least somewhat on the WURFL data.
I don't have enough rep points to post a comparison with screenshots, so I'll have to point you to the sites themselves.
You can see this in your desktop browser by typing in m.npr.org to see the stripped down site. And you can override the default device detection by adding the parameter ?devicegate.client=iPhone_3_0 to see the touch version you would see if you just went to npr.org on your smartphone. If you view the source, you can see how different html & css is being served at the same subdomain.
Hope it helps seeing something like this in the wild. Does that make sense?
A common way to detect which format a mobile device needs is the accept header:
application/xhtml+xml > xhtml
text/vnd.wap.wml > Old wml wap pages
.
.
.
On newer devices which can handle all the desktop html formats, you can use the user agent.
Then you have to ask yourself what you want to do:
Switch to another Stylesheet (only works with newer devices).
Switch to another view logic, like building wml page templates.
Switch to a complete other page.
I think the second approach is the best one. Many web frameworks make it easy to switch to another view logic without rewriting the rest (the mvc pattern in its glory).
I have two examples for you.
Read up on how facebook achieves this using XHP to give abstract different output for different markups: One Mobile Site to Serve Thousands of Phones
There will be a lot of good stuff in their actual implementation which I wish was available.
I use a framework called HawHaw, which let's you write your app once (in PHP Objects or XML files) and it outputs the correct markup to the device based on a few checks (accept header, agent string etc).
I am lost on multi-language implementation. How to handle it? Session, Cookies, File, ...other ways?
Overview
Website is a user content website, like a social network. We will have system content controlled by us and user content translated by users. Languages supported will be system controlled. To start there will be the top 20 supported languages. There are two user types (Non-user and logged-in user). Both user types have pages as not all pages are behind a log-in. Non users can still view many public pages or profile pages that are public.
Requirement
I want to access a public page in French (as an example) directly without having to hit the site in english then change the language to french. (optional)
For user content -> If I want to translate an English content to Italian, I am looking to translate only that 1 content (example status update) not the entire page. So page is in English but I can input Italian for that one content without converting the entire page into Italian.
Search for content based on language from one place. If I am reading reviews, I want to load only German reviews from the menu but not change other page content.
I want to view all wall posts that are in German, can I do it straight from my profile by changing the language or do I have to logout of that language session and login with a new session for the new language, if session based?
I am seeking to be able to change language on any page, for any content without having the user to login or logout.
I need to perform analytics for internal purposes based on language type. (like number of wall posts by people by network X who posted content in Chinese. So I will need to track per language per content.)
Other
I am still not sure if the content will be database or file driven but first I am looking into how I can best handle multi-language for scalability yet keep it user friendly.
Suggestions?
This was probably not answered because of this section of the faq: "Your questions should be reasonably scoped. If you can imagine an entire book that answers your question, you’re asking too much."
https://stackoverflow.com/faq
Translation engine+search engine+user engine+analytics engine? Try learning and implementing one at a time. I'm going to answer it just in case someone sees this and is still interested, but I'm not an expert in this area neither, so I'll list what I did and think.
1st, create the language engine. A simple "Language" drop-down menu somewhere should be enough so far (for the visitors) with the database, cookies, session and code correctly done. Create it as you like it, what you listed is rather complex but perfectly achievable.
2nd, add the user engine, including database, log in/log out forms, code and everything needed and put both of them together. Each user needs to have a column in the "user" table with their preferred language. Modify slightly the language engine to support users. This must be easy to implement now.
3rd, (and still new for me), create the search engine.
4th, implement the analytics engine. I'd recommend using an external one, since it's much easier and complete.
But, as stated, this is just my opinion.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I hear people writing these programs all the time and I know what they do, but how do they actually do it? I'm looking for general concepts.
Technically, screenscraping is any program that grabs the display data of another program and ingests it for it's own use.
Quite often, screenscaping refers to a web client that parses the HTML pages of targeted website to extract formatted data. This is done when a website does not offer an RSS feed or a REST API for accessing the data in a programmatic way.
One example of a library used for this purpose is Hpricot for Ruby, which is one of the better-architected HTML parsers used for screen scraping.
Lots of accurate answers here.
What nobody's said is don't do it!
Screen scraping is what you do when nobody's provided you with a reasonable machine-readable interface. It's hard to write, and brittle.
As an example, consider an RSS aggregator, then consider code that gets the same information by working through a normal human-oriented blog interface. Which one breaks when the blogger decides to change their layout?
Of course, sometimes you have no choice :(
In general a screen scraper is a program that captures output from a server program by mimicing the actions of a person sitting in front of the workstation using a browser or terminal access program. at certain key points the program would interpret the output and then take an action or extract certain amounts of information from the output.
Originally this was done with character/terminal outputs from mainframes for extracting data or updating systems that were archaic or not directly accessible to the end user. in modern terms it usually means parsing the output from an HTTP request to extract data or to take some other action. with the advent of web services this sort of thing should have died away, but not all apps provide a nice api to interact with.
A screen scraper downloads the html page, and pulls out the data interested either by searching for known tokens or parsing it as XML or some such.
In the early days of PC's, screen scrapers would emulate a terminal (e.g. IBM 3270) and pretend to be a user in order to interactively extract, update information on the mainframe. In more recent times, the concept is applied to any application that provides an interface via web pages.
With emergence of SOA, screenscraping is a convenient way in which to services enable applications that aren't. In those cases, the web page scraping is the more common approach taken.
Here's a tiny bit of screen scraping implemented in Javascript, using jQuery (not a common choice, mind you, since scraping is usually a client-server activity):
//Show My SO Reputation Score
var repval = $('span.reputation-score:first'); alert('StackOverflow User "' + repval.prev().attr('href').split('/').pop() + '" has (' + repval.html() + ') Reputation Points.');
If you run Firebug, copy the above code and paste it into the Console and see it in action right here on this Question page.
If SO changes the DOM structure / element class names / URI path conventions, all bets are off and it may not work any longer - that's the usual risk in screen scraping endeavors where there is no contract/understanding between parties (the scraper and the scrapee [yes I just invented a word]).
Technically, screenscraping is any program that grabs the display data of another program and ingests it for it's own use.In the early days of PC's, screen scrapers would emulate a terminal (e.g. IBM 3270) and pretend to be a user in order to interactively extract, update information on the mainframe. In more recent times, the concept is applied to any application that provides an interface via web pages.
With emergence of SOA, screenscraping is a convenient way in which to services enable applications that aren't. In those cases, the web page scraping is the more common approach taken.
Quite often, screenscaping refers to a web client that parses the HTML pages of targeted website to extract formatted data. This is done when a website does not offer an RSS feed or a REST API for accessing the data in a programmatic way.
Typically You have an HTML page that contains some data you want. What you do is you write a program that will fetch that web page and attempt to extract that data. This can be done with XML parsers, but for simple applications I prefer to use regular expressions to match a specific spot in the HTML and extract the necessary data. Sometimes it can be tricky to create a good regular expression, though, because the surrounding HTML appears multiple times in the document. You always want to match a unique item as close as you can to the data you need.
Screen scraping is what you do when nobody's provided you with a reasonable machine-readable interface. It's hard to write, and brittle.
As an example, consider an RSS aggregator, then consider code that gets the same information by working through a normal human-oriented blog interface. Which one breaks when the blogger decides to change their layout.
One example of a library used for this purpose is Hpricot for Ruby, which is one of the better-architected HTML parsers used for screen scraping.
You have an HTML page that contains some data you want. What you do is you write a program that will fetch that web page and attempt to extract that data. This can be done with XML parsers, but for simple applications I prefer to use regular expressions to match a specific spot in the HTML and extract the necessary data. Sometimes it can be tricky to create a good regular expression, though, because the surrounding HTML appears multiple times in the document. You always want to match a unique item as close as you can to the data you need.
Screen scraping is what you do when nobody's provided you with a reasonable machine-readable interface. It's hard to write, and brittle.
Not quite true. I don't think I'm exaggerating when I say that most developers do not have enough experience to write decents APIs. I've worked with screen scraping companies and often the APIs are so problematic (ranging from cryptic errors to bad results) and often don't give the full functionality that the website provides that it can be better to screen scrape (web scrape if you will). The extranet/website portals are used my more customers/brokers than API clients and thus are better supported. In big companies changes to extranet portals etc.. are infrequent, usually because it was originally outsourced and now its just maintained. I refer more to screen scraping where the output is tailored, e.g. a flight on particular route and time, an insurance quote, a shipping quote etc..
In terms of doing it, it can be as simple as web client to pull the page contents into a string and using a series of regular expressions to extract the information you want.
string pageContents = new WebClient("www.stackoverflow.com").DownloadString();
int numberOfPosts = // regex match
Obviously in a large scale environment you'd be writing more robust code than the above.
A screen scraper downloads the html
page, and pulls out the data
interested either by searching for
known tokens or parsing it as XML or
some such.
That is cleaner approach than regex... in theory.., however in practice its not quite as easy, given that most documents will need normalized to XHTML before you can XPath through it, in the end we found the fine tuned regular expressions were more practical.