How best to screen scrape a password protected site on behalf of a 3rd party? - screen-scraping

I want to write a program that analyzes your fantasy baseball team and notifies you of recommended actions, possibly multiple times per day. The problem is, you aren't playing fantasy baseball on my site, you're playing on yahoo, or cbs, or espn, etc.
On the majority of these sites, fantasy teams and leagues are not public, so you must be logged in and a member of the league to see the teams in the league.
All that I need is the plain html for the team page on each of those sites to be sent to my server, where I can then parse and analyze the file and send user notifications.
The problem is that I need username/password combinations to easily get this data to my server when I need it, and I think there will be a lot of people who wouldn't want to entrust their yahoo/espn/cbs password to me.
I have come up with several possible ways to solve this problem:
The most obvious way is to ask for their credentials for the site on which their team is hosted. Then I could just programmatically log in and request the data I need. I'm guessing a number of people would be comfortable giving me their credentials, and a number of them not so much.
Write a desktop client, which the user then downloads. The client would require their credentials, but it could then basically do exactly the same thing that the server based version would do, log in, request the page, and send the page back to my server. The difference being that their password would never need to leave their desktop. Their computer would need to be on, and this program running for this method to work.
Write browser add-ons that navigate to the page I need, use the cookie that is saved from a previous login to login to the site, and send the page back to my server. This doesn't require my software to ever ask for their password, but if the cookie expires I am hosed, and I don't know much about browser add-ons besides.
I'm sure there are other options, but these are what I've come up with so far.
I have two questions:
1. What are the other possibilities for this type of task?
2. Am I over-estimating people's reluctance to give me their yahoo (for example) password? Is option (1) above the obvious choice?
It was suggested in the comments that I try yahoo pipes, and that looked like a promising suggestion so I explored it a bit. Having looked now at this, I don't think that is an option. So, it looks like I'll be going with option 1.

This is a problem I grappled with a couple of years ago when I wanted to do the same thing. Our site is http://benchcoach.com and the options we were considering were the following:
Original we considered getting the user's credentials and login. We would then log in and scrape their league and team info. The problem there is that after reading several of the various terms of service, this would definitely be violating the terms of service. On top of this, Yahoo! was definitely one of the sites we were considering and their users have email (where we could get access to sensitive data), and Yahoo! wallet. In addition, it would be pretty trivial for Yahoo/ESPN/CBS to block our programmatic logins by IP Address.
The solution we settled on (not 100% happy but it does seem to work) was asking our users to install a bookmarklet (like delicious, digg or reddit) which would post the current html page to our servers, where we could parse the data and load our database. If they were still logged into their Yahoo/ESPN/CBS account, we would direct them directly to the pages, otherwise, those sites would prompt for authentication. Clicking the bookmarklet once more, would post the page to our servers.
The pros of this approach was that we never collected anyone's credentials so any concern of security would have been alleviated. Secondly, it would make it impossible for Yahoo/ESPN/CBS to block access to our service since we would never be connecting directly to their servers but rather the user's browser would be posting the contents of their browser to our server.
The problems with this is that it takes 2 clicks to post a page to our site. For head to head leagues, we needed 3-4 pages so it would take our user 6-8 clicks to sync their league to our servers. We're still looking at options for this.
One important note is that I ran into the product manager of the Yahoo Fantasy Football site at a conference a year ago. We talked about how we were getting the Yahoo data, and he confirmed that getting credentials would violate their TOS and they may stop us. While I don't think they would have, it would have made it hard to invest time and energy to develop this only to have them block our site and pissing of users by closing their accounts.

A potentially more complicated answer could possibly be done with (for example) yahoo pipes.
Hypothetically, you create a pipe which prompts the user for their credentials and provides them with a url which contains their scraped data. They enter this URL in their site, and never have to provide their credentials directly. Even better, for the security-conscious, it would be possible to examine what the pipe was actually doing before entering any information.
The downside would be increased complexity (as well as you'd have to write and maintain the pipe). Having said that, you could provide a link directly to the published pipe from your site, to make things as easy as possible.

Option 1 is the obvious choice. People who trust your site will provide the details. There is no other way you can login to other site while screen scraping.

Related

User login in Django + React

I have looked through quite a few tutorials (e.g. this, this, and this) on user authentication in a full-stack Django + React web app. All of them simply send username and password received from the user to the backend using a POST request. It seems to me that, if the user leaves the computer unattended for a minute, anyone can grab his password from the request headers in network tools in the browser. Is this a valid concern that must be taken care of? If so, how should these examples be modified? A tutorial / example of the correct approach would be appreciated.
It seems to me that, if the user leaves the computer unattended for a minute, anyone can grab his password from the request headers in network tools in the browser
If the user leaves the computer unattended then what you are describing will probably be the least of his/her worries.
Authentication is a complex topic, if you really do not want to use existing libraries that handle this for you then you will need to spend quite some time to get things right (knowing that even then, risk 0 does not exist), the most basic thing being to never store plain text credentials on your DB and using https to transmit them over an encrypted connection. You can then start thinking about JWTs, avoiding local storage, CSRF and securing cookies, refresh tokens, etc.
You cannot do much however about cases like the one you describe of people giving away access to their computers or sharing their passwords with others except reminding them they should never do such a thing.
On a side note, if the user didn't have the network monitoring tool open when making the request to your website, opening it afterwards will not show the previously submitted plain text credentials (there are workarounds to this however)

Salesforce: How to automate report extraction as JSON/CSV

I am new to Salesforce, but am an experienced developer. I am provided a link to a Salesforce report, which mostly has the right filters (query). I would like to use an REST API to pull that information as CSV or JSON so that I can do further processing on it.
Here are my questions:
Do I need special permissions to make API calls? What are they?
Do I need to create an "app" with client-key & secret? Does my admin need to grant me permission for this too?
There are a lot of REST APIs from Salesforce, which one do I need to get the info from the report? Analytics?
How do I authenticate in code?
You'd have to work with the System Administrator on the security pieces. Anybody who knows how the company works, can all users see everything, is there Single Sign-On in place, how likely is the report to change...
You will need an user account to pull the data. You need to decide if it'll be some "system account" (you know username and password and have them stored in your app) or can it run for any user in this org. It might not matter much but reports are "fun". If there will be data visibility issues 6 months from now, you'll be asked to make sure the report shows only French data to French users etc... you can make it in report filters or have multiple reports - or you can just use current users access and then it's the sysadmin that has to set the sharing rules right. (would you ever think about packaging what you did and reusing in another SF instance? Making a mobile app out of it? Things like that, they may sound stupid now but will help you decide on best path)
The user (whether it'll be system account or human) needs Profile permissions like "API Enabled" + whatever else you'd need normally ("Run Reports" etc). If you're leaning towards doing it with system user - you might want to look at Password Policies and maybe set password to Never Expires. Now this is bit dangerous so there would be other things you might want to read up about: "API only user" (can't login to website), maybe even locking down the account so it can login only from certain IP ranges or at certain times when the job's supposed to be scheduled...
Connected App and OAUth2 stuff - it's a good idea to create one, yes. Technically you don't have to, you could use SOAP API to call login, get session id... But it's bit weak, OAuth2 would give you more control over security. If you have sandboxes - there's little-known trick. You can make connected app in production (or even totally unrelated Developer Edition) and use client id & secret from it to login to sandboxes. If you create app in sandbox and you refresh it - keys stop working.
(back to security piece - in connected app you can let any user allow/deny access or sysadmin would allow only say these 3 users to connect, "pre-authorize". Could be handy)
Login - there are few REST API ways to login. Depends on your decision. if you have 1 dedicated user you'll probably go with "web server flow". I've added example https://stackoverflow.com/a/56034159/313628 if you don't have a ready SF connection library in your programming language.
If you'll let users login with their own credentials there will be typical OAuth "dance" of going to the target page (Google login, LinkedIn, Twitter...) and back to your app on success. This even works if client has Single Sign-On enabled. Or you could let people type in their username and pass into your app but that's not a great solution.
Pull the actual report already
Once you have session id. Official way would be to use Reporting API, for example https://developer.salesforce.com/docs/atlas.en-us.api_analytics.meta/api_analytics/sforce_analytics_rest_api_get_reportdata.htm
A quick & dirty and officially not supported thing is to mimic what happens when user clicks the report export in UI. Craft a GET request with right cookie and you're golden. See https://stackoverflow.com/a/57745683/313628. No idea if this will work if you went with dedicated account and "API access only" permission.

Signup for email accounts from GAE app

I am faced with a rather strange request and there isn't much material online tackling that.
I am building a web app on GAE ... front end, back end, datastore, blob store, user accounts, the whole nine yards ...
Part of the requirements is to have a user communication system, (users sending messages to each other, just like Facebook) as user emails are not to be shared among other users, and the web app shall only send emails to the user sign up email strictly for security and administration purposes, and wont flood their inbox with notifications like some websites do.
I have narrowed narrowed it down to 4 options
Option 1:
Reinvent the wheel - Build this whole system form scratch on the Datastore and Blob store. However, not only is it expensive, but also I am not gonna go through all of that (just saying honestly)
Option 2:
Build a bouncing system ... User A sends message to app ... app bounces email to User B. Not very Elegant, impossible to create threads and conversations, eats up app Mail Quota used for Marketing and what not.
Option 3:
Host My own Email server onsite. Patch an API servlet and run the whole show through API. Very valid, except that the client doesn't want anything on site, and I wont be around to maintain it for him.
Option 4:(Best option if someone helps out)
Implement option 3 on a 3rd party email provider. Which brings us to the question, is there any respectable email provider that allows account sign up through API ?? I need to create a shadow email account on a 3rd party server(that the user will never know it exists) every time someone makes an account on my app. Then store all emails and their generated passwords in the Datastore, and when user logs in my web app, web app logs in 3rd party server, retrieves messages and serves it. When he wants to send a message, web app gets the message, sends an email using API as well. If someone knows how to do that on Gmail, I would be eternally grateful (but I highly doubt google allows that)
Note
I can implement the whole setup on xmpp/Jabber servers as well but these free servers keep changing all the time and they change their configurations ... bottom line they are not very reliable.
Thanks a lot guys !! I really appreciate any feed back and if you have any other suggestions please don't hesitate !! This is by no means a solid plan yet.

Access VisualForce Page without salesforce account

I'd like to create visualforce page that inserts a record into salesforce account object. However, I expect some of the page users won't have salesforce accounts. Can they still access it? If not, what are the alternatives that can be used to visualforce page in this case? (Please don't consider Web to Lead Forms).
Thanks,
Yes, it's possible. Go read about Salesforce Sites. For a start:
http://wiki.developerforce.com/page/Websites
http://wiki.developerforce.com/page/An_Introduction_to_Force.com_Sites
(of course it's also possible to write that page in say Java/.NET/PHP and use integration via SOAP or REST to talk to Salesforce... but these 2 main links will keep the whole solution within SF so no need to need to learn new language, have extra maintenance effort etc)
Sites are VF pages that expose a bit of your company's data without need to log in. You can use them to input data too, just remember that in theory anybody could learn the link and spam you (not too different from web2lead, inbound email handlers etc). You specify security in a way similar to Profiles, the records will have "Created By = {site name} Guest User".
I don't think there's anything out of the box to restrict visibility, they're open to whole world. So if you would want something similar to login IP ranges (so only sales reps from your office's network can enter data) - you might have to write some logic in the controller.

Email confirmation best practices for mobile apps

So I'm writing a mobile app and have reached a point where I need to allow users to register a username. I'm doing this by asking for an email address, username and password.
Typically, it's been normal to set this sort of thing up on the web by having the user confirm his email address by clicking on a link sent to his inbox.
Needless to say, on a mobile app this is a bit clunky as the user will be redirected out of your app and into his browser.
So I had a look at how other mobile apps are doing it (WP7) and was surprised to see that DropBox and Evernote both allow you to sign up without confirming your email address. The end result of this is that I was able to sign up with completely bogus email addresses and/or valid email addresses that don't belong to me.
I assume this is done on purpose.
Your thoughts?
I came across the same issue when writing a social networking style app. I chose to have the user create a username and then provide and email and password. I do not verify the email address and I've never attempted to send any email to them (yet).
What I would suggest would be alternate ways to validate a users email address. My app allows users to do Facebook Connect. All they have to do is log into Facebook, and the app talks to Facebook to confirm that they are using a valid email address. No need to verify it with a URL in an email.
I believe Twitter has a similar service and there may even be a few others that provide an API.
I've also discovered that a lot of people just want to tinker around in the app and not create an account at all. It's definitely a balancing act
I'd say it depends on your app and how important it is to ensure users have valid email addresses. In an app I'm creating now, we want to discourage users from signing up with multiple bogus accounts (because our system could be gamed that way) so we're not allowing users to log in until their email address if verified. On other sites however, it might not be such a big deal so why bother users with that extra step?
As for a mobile device, I don't see why you can't still send a verification email that sends them to your website to verify their email address. There are plenty of mobile apps that also have a website users can log into to manage their account.
Another option is have multiple "states" for users. Before they validate their email, they are in a "pending" state. Once they click it, they're in an "active" state. If you store the createDate for the user, you can periodically remove pending users older than 1 week (or however long).
The bonus is that you can easily add more states, such as suspended or deleted.
Personally, I wasn't too happy for users to create accounts with any old email address.
I think a few decent options are:
send a confirmation email with a link that uses a Custom Url Schema to redirect back to the app (although this is only good if they use the link on the same device)
send a short PIN in the email for them to enter back in the app.
send a confirmation email with a web link, have your server confirm the valid email/token, and have your app check the account status either periodically or with some sort of realtime tech like SignalR or Firebase.
I prefer the last one, although hardest to implement. A user might well have their phone in their hand and their laptop next to them, register in the app and try to click the link in the email that just showed up on their laptop. I like the idea of the app then just "knowing" that they've validated.
Do you have a web server? Write a web service that does the validation for you on the server side, and sends back the result.
Either you can use some platform, such as Facebook connect as #Brian replied above, or you may give users a reasonable timeframe to verify, for example, a few days or even a week. After that, the account gets removed.
You can even have your app issue notifications to remind the user to verify his account (such as every day, or on the last date of the verification.
Don't ask for email confirmation on mobile and allow the user to use the service. When the user is using a PC, then ask the user to confirm his email.
I won't defend my recommendation because most of the solutions here are valid. There isn't one correct way. You asked for ideas and here's one.
A good strategy is to allow people to use as much of your app as possible given the amount of data they've provided.
For example, in the case of a newsreader you might let someone browse your app without registering, then require an account for offline syncing, and a verified email for alerts. Always give people a good reason to take the next step, and build engagement first, then people will forgive you pestering them later.

Resources