Need ideas on retrieving data from a website - screen-scraping

I'm stumped and need some ideas on how to do this or even whether it can be done at all.
I have a client who would like to build a website tailored to English-speaking travelers in a specific country (Thailand, in this case). The different modes of transportation (bus & train) have good web sites for providing their respective information. And both are very static in terms of the data they present (the schedules rarely change). Here's one of the sites I would need to get info from: train schedules The client wants to provide users the ability to search for a beginning and end location and determine, using the external website's information, how they can best get there, being provided a route with schedule times for the different modes of chosen transport.
Now, in my limited experience, I would think the way to do that would be to retrieve the original schedule info from the external site's server (via API or some other means) and retain the info in a database, which can be queried as needed. Our first thought was to contact the respective authorities to determine how/if this can be done, but this has proven to be problematic due to the language barrier, mainly.
My client suggested what is basically "screen scraping", but that sounds like it would be complicated at best, downloading the web page(s) and filtering through the HTML for relevant/necessary data to put into the database. My worry is that the info on these mainly static sites is so static, that the data isn't even kept in a database to build the page and the web page itself is updated (hard-coded) when something changes.
I could really use some help and suggestions here. Thanks!

Screen scraping is always problematic IMO as you are at the mercy of the person who wrote the page. If the content is static, then I think it would be easier to copy the data manually to your database. If you wanted to keep up to date with changes, you could then snapshot the page when you transcribe the info and run a job to periodically check whether the page has changed from the snapshot. When it does, it sends an email for you to update it.
The above method could also be used in conjunction with some sort of screen scaper which could fall back to a manual process if the page changes too drastically.
Ultimately, it is a case of how much effort (cost) is your client willing to bear for accuracy

I have done this for the following site: http://www.buscatchers.com/ so it's definitely more than doable! A key feature of a web scraping solution for travel sites is that it must send you emails if anything went wrong during the scraping process. On the site, I use a two day window so that I have two days to fix the code if the design changes. Only once or twice have I had to change my code, and it's very easy to do.
As for some examples. There is some simplified source code here: http://www.buscatchers.com/about/guide. The full source code for the project is here: https://github.com/nicodjimenez/bus_catchers. This should give you some ideas on how to get started.

I can tell that the data is dynamic, it's to well structured. It's not hard for someone who is familiar with xpath to scrape this site.

Related

will gatling actually perform the operation or will it check only the urls' response time?

I have a gatling test for an application that will answer a survey and upon answering this survey, the application will identify possible answers that may pose a risk and create what we call riskareas. These riskareas are normally created in the background as soon as the survey answering is finished. My question is I have a gatling test with ten users who will go and answer the survey and logout, I used recorder to record the test; now after these ten users are finished I do not see any riskareas being created in the application. Am I missing something--should the survey be really answered by gatling (like it does in selenium) user or is it just the urls that the gatling test will touch ?
I am new to gatling please help.
Gatling should be indistinguishable from a user in a web browser (or Selenium) as far as the server is concerned, so the end result should be exactly the same as if you'd gone through the process yourself. However, writing a Gatling script is a little more work than writing a Selenium script.
For performance reasons, Gatling operates at a lower level than Selenium. Gatling works with the actual data that is sent and received from the server (i.e, the actual GETs and POSTs sent to the server), rather than with user-level interactions (such as clicking links and filling forms).
The recorder will generally produce a relaitvely "dumb" script. It records the exact data that was sent to the server, and makes no attempt to account for things that may change from run to run. For example, the web application you are testing might have hidden form fields that contain session information, or the link addresses might contain a unique identifier or a session id.
This means that your script may not be doing what you think it's doing.
To debug the script, the first thing to do is to add checks on each of the requests, to validate that you are getting the response you expect (for example, check that when you submit page 1 of the survey, you are taken to page 2 - check for something that you'd only expect to find on page 2, like a specific question).
Once you know which requests are failing, look at what data was sent with the request, and try to figure out where it came from. You will probably find that there are session ids, view state, or similar, that must be extracted from the previous page.
It will help to enable request and response logging, as per the documentation.
To simplify testing of web apps, we wrote some helper functions to allow tests to be written in a more Selenium-like way. Once you understand what your application is doing, you may find that it simplifies scripting for you too. However, understanding why your current script doesn't work the way you expect should be your first step.

When exactly a site uses it's database for retrieval

my knowledge on how a database exactly works is close to zero and I'm trying to understand when exactly a site uses it's database to retrieve information. So for example the site retrieves all the information the moment i load the site(so when i choose for example "funny pictures" it doesn't have to retrieve anything from the database) or it retrieves information only when i make a specific choice? I hope you kind of understand my question, I'm sorry for my bad English.
It depends how it has been implemented and with whihc technology.
Most of time, It load a specific set of information only relevant of the current context.
It depends on SW technology used, settings and site code. Normally it is set to show something only when needed, but it is possible to change this behaviour.
There is one more possibility you haven't noticed, but often used - for example, you have a set of pictures. You see all their small icons on the page, but the whole picture - only if you click on an icon. Then while you are looking on the icons, or on one of the pictures, others pictures are downloaded at the background, to be quickly shown when needed. Sometimes even some primitive prediction system works, to guess what you'll look at the next time.
And all these behaviours work not only for data from databases, but for all data on the pages.
Again, the main thought - it is for you to choose.

what is the best way to debug vCloud client REST applications?

I'm building a vClould client application via the REST APIs, however, the documentation is inconsistent an in some cases just wrong and misleading.
All I really need is a solid debug tool or even a log file. Any recommendations?
You already mentioned you have access to the message stream, which is one of the first steps. Typically if I'm using the Apache HttpClient/HttpComponents I'll go increase the log level so it logs the full HTTP requests.
My next step is usually to cheat and to log into vCD as a system administrator and see what's going on. When vCD was designed there was a very deliberate decision to not reveal infrastructure level problems to tenants of the cloud (normal org users or org admins), as that would break the cloud abstraction. Sadly, that means as an org-level user you're often going to get "contact your cloud admin" error responses. We are aware that this isn't ideal and try to find ways to make it better when we can (IIRC the new 5.5 release that was announced last month does have some improvements in that area).
The last step is usually to cheat even more and to look at the server side logs (vcloud-container-debug.log, specifically). That usually gives me a better clue as to what went wrong. Of course, you may be unlucky and not have access to the vCD cell machine.
My workaround in the latter two cases is to try the operations via the vCD UI and see (1) if they work as expected and (2) if they do, to check the system state via the API and see if I'm sending the wrong request payloads, etc. because the doc or schema reference may not have been clear enough.
In regards to the documentation, please use the feedback links () found on individual doc pages to let us know! Our technical writer reviews all the feedback and tries to address them.
My final suggestion is that you might want to post API questions to the vCloud API community forum VMware has. There are a number of experts (both users and VMware employees) that monitor it and respond to questions.

CakePHP Beginner: Advice needed, Everything on a single view or multi part forms

Thanks in advance for any help offered and patience for my current web-coding experience.
Background:
I'm currently attempting to develop an web based application for my family's business. There is a current version of this system I have developed in C#, however I want to get the system web-based and in the process learn cakephp and the MVC pattern.
Current problem:
I'm currently stuck in a controller that's supposed to take care of a PurchaseTicket. This ticket will have an associated customer, line items, totals etc. I've been trying to develop a basic 'add()' function to the controller however I'm having trouble with the following:
I'm creating a view with everything on it: a button for searching customer, a button to add line items, and a save button. Since I'm used to developing desktop applications, I'm thinking that I might be trying to transfer the same logic to web-based. Is this something that would be recommended or do'able?
I'm running into basic problems like 'searching customer'. From the New Ticket page I'm redirecting to the customer controller, searching and then putting result in session variable or posting it back, but as I continue my process with the rest of the required information, I'm ending up with a bit of "spaghetti" code. Should I do a multi part form? If I do I break the visual design of the application.
Right now I ended up instantiating my PurchaseTicket model and putting it in a session variable. I did this to save intermediate data however I'm not sure if instantiating a Model is conforming to cakephp standards or MVC pattern.
I apologize for the length, this is my first post as a member.
Thanks!
Welcome to Stack Overflow!
So it sounds like there's a few questions, all with pretty open-ended answers. I don't know if this will end up an answer as such, but it's more information than I could put in a comment, so here I go:
First and foremost, if you haven't already, I'd recommend doing the CakePHP Blog Tutorial to get familiar with Cake, before diving straight into a conversion of your existing desktop app.
Second, get familiar with CakePHP's bake console. It will save you a LOT of time if you use it to get started on the web version of your app.
I can't stress how important it is to get a decent grasp of MVC and CakePHP on a small project before trying to tackle something substantial.
Third, the UI for web apps is definitely different to desktop apps. In the case of CakePHP, nothing is 'running' permanently on the server. The entire CakePHP framework gets instantiated, and dies, with every single page request to the server. That can be a tricky concept when transitioning from desktop apps, where everything is stored in memory, and instances of objects can exist for as long as you want them to. With desktop apps, it's easier to have a user go and do another task (like searching for a customer), and then send the result back to the calling object, the instance of which will still exist. As you've found out, if you try and mimic this functionality in a web app by storing too much information in sessions, you'll quickly end up with spaghetti code.
You can use AJAX (google it if you don't already know about it) to update parts of a page only, and get a more streamlined UI, which it sounds like something you'll be needing to do. To get a general idea of the possibilities, you might want to take a look at Bamboo Invoice. It's not built with CakePHP, but it's built with CodeIgniter, which is another open source PHP MVC framework. It sounds like Bamboo Invoice has quite a few similar functionalities to what you're describing (an Invoice has line items, totals, a customer, etc), so it might help you to get an idea of how you should structure your interface - and if you want to dig into the source code, how you can achieve some of the things you want to do.
Bamboo Invoice uses Ajax to give the app a feel of 'one view with everything on it', which it sounds like you want.
Fourth, regarding the specific case of your Customer Search situation, storing stuff in a session variable probably isn't the way to go. You may well want to use an autocomplete field, which sends an Ajax request to server after each time a character is entered in the field, and displays the list list of suggestions / matching customers that the server sends back. See an example here: http://jqueryui.com/autocomplete/. Implementing an autocomplete isn't totally straight forward, but there should be plenty of examples and tutorials all over the web.
Lastly, I obviously don't know what your business does, but have you looked into existing software that might work for you, before building your own? There's a lot of great, flexible web-based solutions, at very reasonable prices, for a LOT of the common tasks that businesses have. There might be something that gives you great results for much less time and money than it costs to build your own solution.
Either way, good luck, and enjoy CakePHP!

DNN module sugestions needed, form/reg, pyment, emailing

i need a module that is kind of a cross between a registration module and a form module.
it need to allow for custom form fields to be saved to the DB and work as part of a flow such that once data is entered by the users they click next and see the data to confirm it is correct. at this point they should have the option to edit the data if they notice an error or continue to a payment page.
the payment page needs to have a module that can integrat with payment gateways liek paypal and accept credit cards. once credit card data is entered and the transaction is complete a custom email with a unique userNumber needs to be sent to the user.
i figure im lookign at three separate modules for this typeof work flow. but i hope since this is a standard type of register, pay, email confirm operation there may be a single module i can confugure to meet my needs.
thoughts? suggestions?
Have you looked at DNM RAD by DotNet Mushroom?
http://www.dotnetmushroom.com/DNMRADGeneral/GeneralInformation/WhatisDNMRAD/tabid/2347/Default.aspx
I have not had a use for this yet, but it is a module that I have on my short list in case the need comes up. They do state that they can work with pament gateways.
Good luck.
You might have to be somewhat flexible with your work flow if you want to used 100% canned modules.
FormMaster is a pretty good form solution. You can write to existing database tables, structure SQL tables or just the default is an XML file. It doesn't go through a preview before saving though.
FormMaster Website
Searching snowcovered.com you can certainly find something that can process a payment. That one shouldn't be too difficult.
I'm thinking you may need to sling some code to get the exact experience you are looking for.

Resources