How to collect data from a website - database

Preface: I have a broad, college knowledge, of a handful of languages (C++, VB,C#,Java, many web languages), so go with which ever you like.
I want to make an android app that compares numbers, but in order to do that I need a database. I'm a one man team, and the numbers get updated biweekly so I want to grab those numbers off of a wiki that gets updated as well.
So my question is: how can I access information from a website using one of the languages above?

What I understand the problem to be: Some entity generates a data set (i.e. numbers) every other week and you have a need to download that data set for treatment (e.g. sorting).
Ideally, the web site maintaining the wiki would provide a Service, like a RESTful interface, to easily gather the data. If that were the case, I'd go with any language that provides easy manipulation of HTTP request & response, and makes your data manipulation easy. As a previous poster said, Java would work well.
If you are stuck with the wiki page, you have a couple of options. You can parse the HTML your browser receives (Perl comes to mind as a decent language for that). Or you can use tools built for that purpose such as the aforementioned Jsoup.
Your question also mentions some implementation details such as needing a database. Evidently, there isn't enough contextual information for me to know whether that's optimal, so I won't address this aspect of the problem.

http://jsoup.org/ is a great Java tool for accessing content on html pages

Consider https://scraperwiki.com/ - it's a site where users can contribute scrapers. It's free as long as you let your scraper be public. The results of your scraper are exposed as csv and JSON.
If you don't know what a "scraper" is, google "screen scraping" - it's a long and frustrating tradition for coders, who have dealt with the same problem you have since the beginning of networked computing.

You could check out :http://web-harvest.sourceforge.net/

For Python, BeautifulSoup is one of the most tolerant HTML parsers out there. The documentation also lists similar libraries in Ruby and Java, so you'll probably find something relevant there.

Related

Saving code in database, what are pitfall I should be careful about

I am designing a system which takes user submitted code and saves it in database. Code can be in any language, ruby, python, elixir, javascript, etc. There's no restriction on language. Code saved in database is never meant to be run. It will be displayed in blog article or converted into file for download. Similar example might be GitHub gist or Cacher, both takes user submitted code and displays on website.
How do I make sure User submitted code is sanitised and secure to be displayed on webpage with code highlighter?
What processing do I need to do on code such that I can safely display it? I don't want to impose strict restrictions on users.
Any gotcha I need to be aware?
Any idea how those website implement this feature?
I am using Elixir and Phoenix framework. Is there any pitfalls I should be careful about? I am thinking of using Phoenix.HTML module to escape codes. I just wanna be sure that my approach doesn't have known loop holes.
I think you are looking for this https://www.owasp.org/index.php/XSS_(Cross_Site_Scripting)_Prevention_Cheat_Sheet

SCORM authoring tool

Im starting a new project. The aim of the project is to create a e-authoring tool for building courses in SCORM Complaint. Im new to this domain and I have little idea on this. I have taken a view on authoring tool in Articulate, which my customer requires to do the same. I understood the content creation, but I am trying to understand How can I export this as SCORM compliant course? In between I learned about xAPI as well And understood it is a kind of enhanced SCORM.
Could any one guide me to understand this,
1) How can create content from my custom authoring tool and export as SCORM complaint
2) Is it better to use xAPI or SCORM.
3) How is the SCORM pacakge communicate with my custom made LMS?
4) Heard about LRS,
My custom authoring tool will be made in React and store would MondDB
Any help would be greatly appreciated. Thankyou!
That is a lot to take on, particularly all at once.
1) The SCORM spec is made up of multiple parts. There is a packaging portion and a runtime portion. The basics are that your package needs to be a zip file, and that zip needs to include specific files that indicate to the LMS what type of standard it is along with other metadata about the package. For SCORM this will be called an imsmanifest.xml file. For xAPI you are most likely going to use a cmi5.xml (see cmi5) or a tincan.xml file (what Articulate Storyline exports when it says "xAPI"). The other parts of the package will depend on what standard and version of that standard (for SCORM 1.2, 2004 2nd, 3rd, or 4th edition) you are targeting, realizing that different LMSs support different standards and different degrees of those standards.
Once you have a package constructed that will import, the content itself (usually an HTML file) will need to locate the JavaScript API provided by the SCORM player (from the LMS) and make specific calls depending on what the content is needing to store or read, this is the runtime portion. The calls will again depend on the standard and version. For xAPI based packages (either tincan.xml packages or cmi5 packages) the content will communicate directly to the LRS based on the information provided on the URL at launch time (there is no built in JavaScript API).
2) This entirely depends on what your customer base looks like and the types of data that you intend to capture. SCORM is a more mature landscape and has wider adoption and is more heavily specified, if the information you need to capture fits into its limited information model then it is still an excellent choice. If you need significant data portability and/or the information you need to capture goes beyond compliance data (pass/fail, complete, and score) and/or interaction data (questions + answers) then you should consider xAPI, specifically via cmi5.
3) The LMS must provide a JavaScript API (specified by the SCORM runtime) which the content will use as its interface. The storage/retrieval of data is implementation specific for the LMS beyond what is included in the specification for the JavaScript API.
4) You didn't really include a question here.
I would suggest familiarizing yourself with the two sets of standards via http://scorm.com and http://xapi.com. And although it is a plug for my company's product, you may want to consider the Rustici Driver as it is a product (library) specifically designed to make it easy for an authoring tool to export content as SCORM 1.2, 2004, AICC, cmi5 or Tin Can (the latter two being xAPI). Once you have your tool up and running with minimal standards support you should consider testing it on Rustici's SCORM Cloud (it is free for this purpose), see http://cloud.scorm.com.
The format is huge, there is no quick reference guides. And different authoring tools have different scorm-support depths. You should probably start with this document
Sounds like you're talking about designing editable content; and the content "framework" itself.
This is a massive effort! This is massive support! That said, people do it.
Having built a CMS system for many supporting subject matters I had to divide and conquer this task.
Few ways I'd think to digest this beast- data, data, data
Requirements on Activities (Interaction types)
Design (static/dynamic) on these interactions
The view/facade displaying can change. Tech moves at the speed of light. Need to come up with a super solid data model.
I'd think about how these can be generic, and how they can be extended to meet the customers goals/needs. All depends how much customization (if any) can happen.
I start mapping all this to SCORM CMI Object level calls. Scoring, Progress, Interactions, Objectives etc...
Get your self a wicked SCORM Content API library or write one yourself. You'll be re-using a lot of these calls, no sense baking them into all your interactions
Get up on SCORM Packaging .. much of this has to be defined at author time. Lots of reading, and a lot of features you need to pick thru if your customers even use. Don't dev in places that have .1% market need. The low hanging fruit get you to market.
Surround yourself with passionate great people. You'll need them.
As far as the standards go, it's all about portability. SCORM works directly with a LMS if thats where your customer goes. Others use a LRS which is coded to work with one they set at author time. You can even do both.
Aside from React and MongoDB, you'll need something that can do the lift and shift of all this content.

CakePHP Beginner: Advice needed, Everything on a single view or multi part forms

Thanks in advance for any help offered and patience for my current web-coding experience.
Background:
I'm currently attempting to develop an web based application for my family's business. There is a current version of this system I have developed in C#, however I want to get the system web-based and in the process learn cakephp and the MVC pattern.
Current problem:
I'm currently stuck in a controller that's supposed to take care of a PurchaseTicket. This ticket will have an associated customer, line items, totals etc. I've been trying to develop a basic 'add()' function to the controller however I'm having trouble with the following:
I'm creating a view with everything on it: a button for searching customer, a button to add line items, and a save button. Since I'm used to developing desktop applications, I'm thinking that I might be trying to transfer the same logic to web-based. Is this something that would be recommended or do'able?
I'm running into basic problems like 'searching customer'. From the New Ticket page I'm redirecting to the customer controller, searching and then putting result in session variable or posting it back, but as I continue my process with the rest of the required information, I'm ending up with a bit of "spaghetti" code. Should I do a multi part form? If I do I break the visual design of the application.
Right now I ended up instantiating my PurchaseTicket model and putting it in a session variable. I did this to save intermediate data however I'm not sure if instantiating a Model is conforming to cakephp standards or MVC pattern.
I apologize for the length, this is my first post as a member.
Thanks!
Welcome to Stack Overflow!
So it sounds like there's a few questions, all with pretty open-ended answers. I don't know if this will end up an answer as such, but it's more information than I could put in a comment, so here I go:
First and foremost, if you haven't already, I'd recommend doing the CakePHP Blog Tutorial to get familiar with Cake, before diving straight into a conversion of your existing desktop app.
Second, get familiar with CakePHP's bake console. It will save you a LOT of time if you use it to get started on the web version of your app.
I can't stress how important it is to get a decent grasp of MVC and CakePHP on a small project before trying to tackle something substantial.
Third, the UI for web apps is definitely different to desktop apps. In the case of CakePHP, nothing is 'running' permanently on the server. The entire CakePHP framework gets instantiated, and dies, with every single page request to the server. That can be a tricky concept when transitioning from desktop apps, where everything is stored in memory, and instances of objects can exist for as long as you want them to. With desktop apps, it's easier to have a user go and do another task (like searching for a customer), and then send the result back to the calling object, the instance of which will still exist. As you've found out, if you try and mimic this functionality in a web app by storing too much information in sessions, you'll quickly end up with spaghetti code.
You can use AJAX (google it if you don't already know about it) to update parts of a page only, and get a more streamlined UI, which it sounds like something you'll be needing to do. To get a general idea of the possibilities, you might want to take a look at Bamboo Invoice. It's not built with CakePHP, but it's built with CodeIgniter, which is another open source PHP MVC framework. It sounds like Bamboo Invoice has quite a few similar functionalities to what you're describing (an Invoice has line items, totals, a customer, etc), so it might help you to get an idea of how you should structure your interface - and if you want to dig into the source code, how you can achieve some of the things you want to do.
Bamboo Invoice uses Ajax to give the app a feel of 'one view with everything on it', which it sounds like you want.
Fourth, regarding the specific case of your Customer Search situation, storing stuff in a session variable probably isn't the way to go. You may well want to use an autocomplete field, which sends an Ajax request to server after each time a character is entered in the field, and displays the list list of suggestions / matching customers that the server sends back. See an example here: http://jqueryui.com/autocomplete/. Implementing an autocomplete isn't totally straight forward, but there should be plenty of examples and tutorials all over the web.
Lastly, I obviously don't know what your business does, but have you looked into existing software that might work for you, before building your own? There's a lot of great, flexible web-based solutions, at very reasonable prices, for a LOT of the common tasks that businesses have. There might be something that gives you great results for much less time and money than it costs to build your own solution.
Either way, good luck, and enjoy CakePHP!

Apache module FORM handling in C

I'm implementing an Apache 2.0.x module in C, to interface with an existing product we have. I need to handle FORM data, most likely using POST but I want to handle the GET case as well.
Nick Kew's Apache Modules book has a section on handling form data. It provides code examples for POST and GET, which return an apr_hash_t of the key+value pairs in the form. parse_form_from_POST marshalls the bucket brigade and flattens it into a buffer, while parse_form_from_GET can simply reference the URL. Both routines rely on a parse_form_from_string routine to walk through each delimited field and extract the information into the hash table.
That would be fine, but it seems like there should be an easier way to do this than adding a couple hundred lines of code to my module. Is there an existing module or routines within apache, apr, or apr-util to extract the field names and associated data from a GET or POST FORM into a structure which C code can more easily access? I cannot find anything relevant, but this seems like a common need for which there should be a solution.
I switched to G-WAN which offers a transparent ANSI C scripts interface for GET and POST forms (and many other goodies like charts, GIF I/O, etc.).
A couple of AJAX examples are available at the GWAN developer page
Hope it helps!
While, on it's surface, this may seem common, cgi-style content handlers in C on apache are pretty rare. Most people just use CGI, FastCGI, or the myriad of frameworks such as mod_perl.
Most of the C apache modules that I've written are targeted at modifying the particular behavior of the web server in specific, targeted ways that are applicable to every request.
If it's at all possible to write your handler outside of an apache module, I would encourage you to pursue that strategy.
I have not yet tried any solution, since I found this SO question as a result of my own frustration with the example in the "Apache Modules" book as well. But here's what I've found, so far. I will update this answer when I have researched more.
Luckily it looks like this is now a solved problem in Apache 2.4 using the ap_parse_form_data funciton.
No idea how well this works compared to your example, but here is a much more concise read_post function.
It is also possible that mod_form could be of value.

How do screen scrapers work? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I hear people writing these programs all the time and I know what they do, but how do they actually do it? I'm looking for general concepts.
Technically, screenscraping is any program that grabs the display data of another program and ingests it for it's own use.
Quite often, screenscaping refers to a web client that parses the HTML pages of targeted website to extract formatted data. This is done when a website does not offer an RSS feed or a REST API for accessing the data in a programmatic way.
One example of a library used for this purpose is Hpricot for Ruby, which is one of the better-architected HTML parsers used for screen scraping.
Lots of accurate answers here.
What nobody's said is don't do it!
Screen scraping is what you do when nobody's provided you with a reasonable machine-readable interface. It's hard to write, and brittle.
As an example, consider an RSS aggregator, then consider code that gets the same information by working through a normal human-oriented blog interface. Which one breaks when the blogger decides to change their layout?
Of course, sometimes you have no choice :(
In general a screen scraper is a program that captures output from a server program by mimicing the actions of a person sitting in front of the workstation using a browser or terminal access program. at certain key points the program would interpret the output and then take an action or extract certain amounts of information from the output.
Originally this was done with character/terminal outputs from mainframes for extracting data or updating systems that were archaic or not directly accessible to the end user. in modern terms it usually means parsing the output from an HTTP request to extract data or to take some other action. with the advent of web services this sort of thing should have died away, but not all apps provide a nice api to interact with.
A screen scraper downloads the html page, and pulls out the data interested either by searching for known tokens or parsing it as XML or some such.
In the early days of PC's, screen scrapers would emulate a terminal (e.g. IBM 3270) and pretend to be a user in order to interactively extract, update information on the mainframe. In more recent times, the concept is applied to any application that provides an interface via web pages.
With emergence of SOA, screenscraping is a convenient way in which to services enable applications that aren't. In those cases, the web page scraping is the more common approach taken.
Here's a tiny bit of screen scraping implemented in Javascript, using jQuery (not a common choice, mind you, since scraping is usually a client-server activity):
//Show My SO Reputation Score
var repval = $('span.reputation-score:first'); alert('StackOverflow User "' + repval.prev().attr('href').split('/').pop() + '" has (' + repval.html() + ') Reputation Points.');
If you run Firebug, copy the above code and paste it into the Console and see it in action right here on this Question page.
If SO changes the DOM structure / element class names / URI path conventions, all bets are off and it may not work any longer - that's the usual risk in screen scraping endeavors where there is no contract/understanding between parties (the scraper and the scrapee [yes I just invented a word]).
Technically, screenscraping is any program that grabs the display data of another program and ingests it for it's own use.In the early days of PC's, screen scrapers would emulate a terminal (e.g. IBM 3270) and pretend to be a user in order to interactively extract, update information on the mainframe. In more recent times, the concept is applied to any application that provides an interface via web pages.
With emergence of SOA, screenscraping is a convenient way in which to services enable applications that aren't. In those cases, the web page scraping is the more common approach taken.
Quite often, screenscaping refers to a web client that parses the HTML pages of targeted website to extract formatted data. This is done when a website does not offer an RSS feed or a REST API for accessing the data in a programmatic way.
Typically You have an HTML page that contains some data you want. What you do is you write a program that will fetch that web page and attempt to extract that data. This can be done with XML parsers, but for simple applications I prefer to use regular expressions to match a specific spot in the HTML and extract the necessary data. Sometimes it can be tricky to create a good regular expression, though, because the surrounding HTML appears multiple times in the document. You always want to match a unique item as close as you can to the data you need.
Screen scraping is what you do when nobody's provided you with a reasonable machine-readable interface. It's hard to write, and brittle.
As an example, consider an RSS aggregator, then consider code that gets the same information by working through a normal human-oriented blog interface. Which one breaks when the blogger decides to change their layout.
One example of a library used for this purpose is Hpricot for Ruby, which is one of the better-architected HTML parsers used for screen scraping.
You have an HTML page that contains some data you want. What you do is you write a program that will fetch that web page and attempt to extract that data. This can be done with XML parsers, but for simple applications I prefer to use regular expressions to match a specific spot in the HTML and extract the necessary data. Sometimes it can be tricky to create a good regular expression, though, because the surrounding HTML appears multiple times in the document. You always want to match a unique item as close as you can to the data you need.
Screen scraping is what you do when nobody's provided you with a reasonable machine-readable interface. It's hard to write, and brittle.
Not quite true. I don't think I'm exaggerating when I say that most developers do not have enough experience to write decents APIs. I've worked with screen scraping companies and often the APIs are so problematic (ranging from cryptic errors to bad results) and often don't give the full functionality that the website provides that it can be better to screen scrape (web scrape if you will). The extranet/website portals are used my more customers/brokers than API clients and thus are better supported. In big companies changes to extranet portals etc.. are infrequent, usually because it was originally outsourced and now its just maintained. I refer more to screen scraping where the output is tailored, e.g. a flight on particular route and time, an insurance quote, a shipping quote etc..
In terms of doing it, it can be as simple as web client to pull the page contents into a string and using a series of regular expressions to extract the information you want.
string pageContents = new WebClient("www.stackoverflow.com").DownloadString();
int numberOfPosts = // regex match
Obviously in a large scale environment you'd be writing more robust code than the above.
A screen scraper downloads the html
page, and pulls out the data
interested either by searching for
known tokens or parsing it as XML or
some such.
That is cleaner approach than regex... in theory.., however in practice its not quite as easy, given that most documents will need normalized to XHTML before you can XPath through it, in the end we found the fine tuned regular expressions were more practical.

Resources