How to migrate resources from proprietary CMS? - screen-scraping

I need to migrate our website from a proprietary CMS that uses active server pages. Is there a tool or technique that will help download the resources from the existing site? I guess I'm looking for a tool that will crawl and scrape the entire site.
An additional challenge is that the site uses SSL and is protected with forms-based authentication. I have the necessary credentials and I can grab the cookie that validates the session but I'm not sure where to go from here and I don't want to reinvent the wheel if existing tools can help me.
EDIT - I'm using Windows OS

wget may be a good tool for you to use
wget --load-cookies cookies.txt --mirror --page-requisites http://example.com/
add --convert-links if you wish to make it more suitably for a local archive, rather than something you can re-upload somewhere.
A windows version of wget is available from the gnuwin32 project on sourceforge.net
http://gnuwin32.sourceforge.net/packages/wget.htm

wget --http-user:username --http-pass:password -r http://yoursite.com
This will fetch the entire site (recursively). If you're on windows, you'll want to install cygwin or something similar to use it, though I believe there are windows versions/clones of wget that you can download.

If you know Perl, you might like WWW::Mechanize. Depends on the level of automation you are trying to achieve – wget would probably do just fine for some cases.

You have a lot of options. One thing to consider is how complex the authentication is. Besides wget, you can look at curl (a very robust option with bindings for many different languages), Python's urllib, Apache HttpClient, WWW-Mechanize, etc.

Related

Deploying AngularJs + Sinatra to AWS

I have an AngularJS site consuming an API written in Sinatra.
I'm simply trying to deploy these 2 components together on an AWS EC2 instance.
How would one go about doing that? What tools do you recommend? What structure do you think is most suitable?
Cheers
This is based upon my experience of utilizing the HashciCorp line of tools.
Manual: Launch an Ubuntu image, gem install sinatra and deploy your code. Take a snapshot for safe keeping. This one off approach is good for a development box to iron out the configuration process. Write down the commands you run and any options you may need.
Automated: Use the Packer EC2 Builder and Shell Provisioner to automate your commands from the previous manual approach. This will give you a configured AMI that can be launched.
You can apply different methods of getting to an AMI using different toolsets. However, in the end, you want a single immutable image that can be deployed. repeatedly.

How to transfer live WordPress site to Wamp?

I've got a wordpress site that I have been using for a year now and it is hosted with HostGator. I have got a few tests i would like to run on the site, but I would like to test it offline using wamp first before making it LIVE.
The problem is previously I was always making changes to the LIVE site, usually at hours when I get little to no traffic. However, that has changed now and I do get traffic most hours through out a 24hr day.
So my problem is:
How do i download my existing website to laptop (wamp) and make those changes with new theme? (total newbie, sorry!)
I use Windows 7, so not sure what I need to be doing to get the site working like a live site offline.
Once I have implemented the new changes, what is the best way to upload the updated site back to the HostGator server without having any downtime or errors for site visitors?
Is there anything else I need to install or do inorder for this to work? I hope you can give me as much information as possible or any links to any guides or articles that explain how to do this.
Thanks so much for any help you can offer!!!
If you're using Hostgator, the process is simple:
Install XAMPP or WAMPP on your computer;
Go to your cPanel, backup and download your website;
Extract the backup to your computer, specially the homedir and the sql;
Go to your local environment, access http://localhost/phpmyadmin
Create a new database, doesn't matter the name but for the example let's call it "database";
Inside that database, import the one taken from the backup;
create a new folder inside your htdocs with the name of your website, "example.com";
Extract the content of the homedir there;
edit wp-config with the following data:
Host: 'localhost'
Username: 'root'
Password: blank
access http://localhost/example.com
You can check a good tutorial about the subject here.
About putting the site live, I recommend you to use a GIT repository, however it's understandable that might be a little complicated and perhaps too much work for what you're trying to achieve.
Try to move your files directly from your local to live environment using Filezilla or WinSCP, the drag and drop should replace the files live and the downtime should be minimal.
Instead of WAMP, you can always use VirtualBox to install CentOS or Ubuntu/Debian.
You can go one further and install either CentminMod to automate creating a LAMP, or a full panel like ISPConfig or Virtualmin.
That take care of create the environment.
Create a new account on the LAMP, using the same domain name.
You can FTP with Windows to get the files, but networking Windows and Linux is a pain. The better option is to use the command line (CLI) in the Linux VM to ftp the files from Hostgator to the VM. This guide will help with that process: http://www.tldp.org/HOWTO/FTP-3.html
Then your only concern is the MySQL database. And for this, you have several options.
For me, the easiest is to buy (or try!) SQLyog on Windows, and then copy the database from the Hostgator source to the localhost destination. Some mild networking is needed for Windows to see the Linux VM, but nothing as complex as file sharing (the FTP issue). SQLyog is far quicker than backing up the database, then restoring it -- especially since you can run into memory issues doing it this way. It fully depends on the size of the database.
The cheap/free backup>restore method is to use phpMyAdmin.
WordPress also has plugins, of varying cost, but you still have the possible backup>restore memory issue there as well.
When done, just copy it the other way, again using SQLyog and CLI ftp. You'll still have some downtime, but it will hopefully be minimal.
As a newbie, this probably seems like rocket science, but at least it gives you a good place to start. Welcome to the world of locally dev'ing sites!

Packaging database into application seamlessly for users

I want to create a desktop application that uses a relational database (such as postgres - let's say my best case scenario is to use postgres in this application).
I want users to be unaware of the database. Currently, I had to install postgres into my local computer and have my application communicate with that.
I am using Go.
How can I avoid this?
You're looking for an embedded database.
This isn't an ideal job for PostgreSQL, but you can use it that way with a bit of care.
Please don't bundle the installer and run it unattended. Users who later go to install PostgreSQL will be very confused when they see it's already on their computer but they don't know why, who installed it, or what the password is.
Instead initdb a new datadir inside your app's %APPDATA% or (for multiuser shared) in %PROGRAMDATA%. Set a custom port (don't use the default 5432). Create a new service with pg_ctl register, running as NETWORKSERVICE, or just start/stop on demand with pg_ctl. That way you won't get in the way of any existing PostgreSQL installs or new ones and have a private PostgreSQL just for your app.
Please offer users the option of instead supplying a connection string for an existing PostgreSQL though. It's a pain if apps insist on using their own embedded copy when you don't want them to.
Often it's better to look at using SQLite, H2, Derby, Firebird, or one of the other embedded DBs, though.
Short version, you really can't, your best bet is to use SQLite or similar.
Long version, well, if you really really want to, you can create multiple unattended installers for your database that targets each platform you want, embed it into your application and install it on the first run.
Now that is just ugly and most users (myself included) would outright never use it.
You can always mention that your software depends on X and Y and provide information about how to manually install the dependencies.

How to "explore" group of servers?

I need to check a group of servers (Unix, Linux) to know what kind of services, software (also version) are running there (check it once for a while and store it in database).
The idea is to have always fresh info about whole environment - its constantly changing. Perhaps you can suggest some solution that is already there?
Currently i am thinking about using Nagios or Cacti + plugins but I am not sure if this solution will be optimal.
Nagios is a very powerful monitoring solution (the best for me) : Open source, Compatible with both linux & windows, reporting & notifications via emails/SMS, Nice interface, Many many plugins...etc I've already worked with & I was very satisfied.
Check Nico Largo's Forum for Install. If you are not familiar to linux command search for FAN : Fully Automated Nagios which is a .iso where nagios is already in.
If you have any trouble during install or configuration post your questions there : https://serverfault.com/
Given that you want to poll for information on the system that can change dynamically, I would look at Check_MK.
It originally started as a plugin for Nagios that would poll a server for running services and generate the necessary configs for monitoring anything it discovered. Since then, it has evolved into a complete monitoring solution that provides its own complete ui (still based on nagios core), so you are safe in running this if you are familiar with nagios already.
See the website: http://mathias-kettner.com/checkmk_monitoring_system.html
You may need to select that you wish to view the "English" perspective of the site on first visit.

How to download all datastore entities on Google App Engine?

I've read the GAE docs, and I can't seem to figure out how to download all my entity data.
What I'd love to do is download the whole thing as a big TSV file (or something I can easily munge into one), so I can import my various entities into a spreadsheet and fiddle with them.
But I'm stuck at the starting gate. I don't understand the first few bits of the docs: "This document applies to apps that use the master/slave datastore. If your app uses the High Replication datastore..." -- I'm not even sure which I have, or how I would tell.
Assuming I have the simpler master/slave, the docs continue: "...you can use the Python appcfg.py tool by installing the Java version of the remote_api handler..." but, again, I'm not quite sure what they mean or where I find this appcfg.py tool.
Sorry for such a n00b question, but is there some sort of walk-through? I just want to download my datastore!
Thanks!
Master/Slave is the default (for now), so that's almost certainly what you're using. You can confirm at https://appengine.google.com/ -> app-id -> Administration -> Application Settings -> Datastore Replication Options.
To download your data, first install Remote API for Java, then use appcfg.py to download data:
appcfg.py download_data --application=<app-id> --url=http://<appname>.appspot.com/[remote_api_path] --filename=<data-filename>
There is nothing you need to do other than just follwiong the google's own documentaion. There is no workaround or walkthrough. I am not sure about java. In python , only thing you need to do is enable remote_api in your app.yaml.
appcfg.py can be found inside the root directory of appengine SDK.

Resources