How can I scrape an HTML table to CSV?

How can I scrape an HTML table to CSV? - screen-scraping

The Problem
I use a tool at work that lets me do queries and get back HTML tables of info. I do not have any kind of back-end access to it.
A lot of this info would be much more useful if I could put it into a spreadsheet for sorting, averaging, etc. How can I screen-scrape this data to a CSV file?
My First Idea
Since I know jQuery, I thought I might use it to strip out the table formatting onscreen, insert commas and line breaks, and just copy the whole mess into notepad and save as a CSV. Any better ideas?
The Solution
Yes, folks, it really was as easy as copying and pasting. Don't I feel silly.
Specifically, when I pasted into the spreadsheet, I had to select "Paste Special" and choose the format "text." Otherwise it tried to paste everything into a single cell, even if I highlighted the whole spreadsheet.

Select the HTML table in your tools's UI and copy it into the clipboard (if that's possible
Paste it into Excel.
Save as CSV file
However, this is a manual solution not an automated one.

using python:
for example imagine you want to scrape forex quotes in csv form from some site like:fxquotes
then...
from BeautifulSoup import BeautifulSoup
import urllib,string,csv,sys,os
from string import replace
date_s = '&date1=01/01/08'
date_f = '&date=11/10/08'
fx_url = 'http://www.oanda.com/convert/fxhistory?date_fmt=us'
fx_url_end = '&lang=en&margin_fixed=0&format=CSV&redirected=1'
cur1,cur2 = 'USD','AUD'
fx_url = fx_url + date_f + date_s + '&exch=' + cur1 +'&exch2=' + cur1
fx_url = fx_url +'&expr=' + cur2 + '&expr2=' + cur2 + fx_url_end
data = urllib.urlopen(fx_url).read()
soup = BeautifulSoup(data)
data = str(soup.findAll('pre', limit=1))
data = replace(data,'[<pre>','')
data = replace(data,'</pre>]','')
file_location = '/Users/location_edit_this'
file_name = file_location + 'usd_aus.csv'
file = open(file_name,"w")
file.write(data)
file.close()
edit: to get values from a table:
example from: palewire
from mechanize import Browser
from BeautifulSoup import BeautifulSoup
mech = Browser()
url = "http://www.palewire.com/scrape/albums/2007.html"
page = mech.open(url)
html = page.read()
soup = BeautifulSoup(html)
table = soup.find("table", border=1)
for row in table.findAll('tr')[1:]:
col = row.findAll('td')
rank = col[0].string
artist = col[1].string
album = col[2].string
cover_link = col[3].img['src']
record = (rank, artist, album, cover_link)
print "|".join(record)

This is my python version using the (currently) latest version of BeautifulSoup which can be obtained using, e.g.,
$ sudo easy_install beautifulsoup4
The script reads HTML from the standard input, and outputs the text found in all tables in proper CSV format.
#!/usr/bin/python
from bs4 import BeautifulSoup
import sys
import re
import csv
def cell_text(cell):
return " ".join(cell.stripped_strings)
soup = BeautifulSoup(sys.stdin.read())
output = csv.writer(sys.stdout)
for table in soup.find_all('table'):
for row in table.find_all('tr'):
col = map(cell_text, row.find_all(re.compile('t[dh]')))
output.writerow(col)
output.writerow([])

Even easier (because it saves it for you for next time) ...
In Excel
Data/Import External Data/New Web Query
will take you to a url prompt. Enter your url, and it will delimit available tables on the page to import. Voila.

Two ways come to mind (especially for those of us that don't have Excel):
Google Spreadsheets has an excellent importHTML function:
=importHTML("http://example.com/page/with/table", "table", index
Index starts at 1
I recommend a copy and paste values shortly after import
File -> Download as -> CSV
Python's superb Pandas library has handy read_html and to_csv functions
Here's a basic Python3 script that prompts for the URL, which table at that URL, and a filename for the CSV.

Quick and dirty:
Copy out of browser into Excel, save as CSV.
Better solution (for long term use):
Write a bit of code in the language of your choice that will pull the html contents down, and scrape out the bits that you want. You could probably throw in all of the data operations (sorting, averaging, etc) on top of the data retrieval. That way, you just have to run your code and you get the actual report that you want.
It all depends on how often you will be performing this particular task.

Excel can open a http page.
Eg:
Click File, Open
Under filename, paste the URL ie: How can I scrape an HTML table to CSV?
Click ok
Excel does its best to convert the html to a table.
Its not the most elegant solution, but does work!

Basic Python implementation using BeautifulSoup, also considering both rowspan and colspan:
from BeautifulSoup import BeautifulSoup
def table2csv(html_txt):
csvs = []
soup = BeautifulSoup(html_txt)
tables = soup.findAll('table')
for table in tables:
csv = ''
rows = table.findAll('tr')
row_spans = []
do_ident = False
for tr in rows:
cols = tr.findAll(['th','td'])
for cell in cols:
colspan = int(cell.get('colspan',1))
rowspan = int(cell.get('rowspan',1))
if do_ident:
do_ident = False
csv += ','*(len(row_spans))
if rowspan > 1: row_spans.append(rowspan)
csv += '"{text}"'.format(text=cell.text) + ','*(colspan)
if row_spans:
for i in xrange(len(row_spans)-1,-1,-1):
row_spans[i] -= 1
if row_spans[i] < 1: row_spans.pop()
do_ident = True if row_spans else False
csv += '\n'
csvs.append(csv)
#print csv
return '\n\n'.join(csvs)

Here is a tested example that combines grequest and soup to download large quantities of pages from a structured website:
#!/usr/bin/python
from bs4 import BeautifulSoup
import sys
import re
import csv
import grequests
import time
def cell_text(cell):
return " ".join(cell.stripped_strings)
def parse_table(body_html):
soup = BeautifulSoup(body_html)
for table in soup.find_all('table'):
for row in table.find_all('tr'):
col = map(cell_text, row.find_all(re.compile('t[dh]')))
print(col)
def process_a_page(response, *args, **kwargs):
parse_table(response.content)
def download_a_chunk(k):
chunk_size = 10 #number of html pages
x = "http://www.blahblah....com/inclusiones.php?p="
x2 = "&name=..."
URLS = [x+str(i)+x2 for i in range(k*chunk_size, k*(chunk_size+1)) ]
reqs = [grequests.get(url, hooks={'response': process_a_page}) for url in URLS]
resp = grequests.map(reqs, size=10)
# download slowly so the server does not block you
for k in range(0,500):
print("downloading chunk ",str(k))
download_a_chunk(k)
time.sleep(11)

Have you tried opening it with excel?
If you save a spreadsheet in excel as html you'll see the format excel uses.
From a web app I wrote I spit out this html format so the user can export to excel.

If you're screen scraping and the table you're trying to convert has a given ID, you could always do a regex parse of the html along with some scripting to generate a CSV.

Related

How to export react JS material-ui table data with pagination into csv, excel and pdf?

I am a new front end developer. I am trying to create buttons on front end which can export material-ui table data with pagination into csv, excel and pdf. Is there any library which I can use to do this?

I have good experience using ExcelJS library which not only supports CSV output, but also Excel and other formats. The API is incredibly easy to use, and it also saves you from figuring out how to escape the characters. ExcelJS should work well if you look into file generation in browser or server-side.

Assuming that you know the source of your table data (an array of objects) there are some libraries that you can use to convert from that array into a pdf, xls. For CSV you could make something like
const rows = [
["name1", "city1", "some other info"],
["name2", "city2", "more info"]
];
let csvContent = "data:text/csv;charset=utf-8,";
rows.forEach(function(rowArray) {
let row = rowArray.join(",");
csvContent += row + "\r\n";
});
Then you only have to "open" your generated file into a new window like
const encodedUri = encodeURI(csvContent);
window.open(encodedUri);

How to assign multiple variables in a loop for graphs in Python 3

I am relatively new to coding and I have a few issues I don't quite understand how to solve, yet. I'm trying to build code that will make graphs that will produce from a ticker list, with the data downloading from yahoo finance. Taking out of account manually assigning stock1 (and so forth) a ticker for a moment...
I want to figure out how to loop the data going into running the graph, so TSLA and MSFT in my code. So far I have the code below, which I already changed dfs and stocks. I just don't understand how to make the loop. If anyone has some good resources for loops, as well, let me know.
Later, I would like to save the graphs as a png with file names corresponding to the stock being pulled from yahoo, so extra points if someone knows how to change this code (savefig = dict(fname="tsla.png", bbox_inches= "tight") which goes after style = 'default'. Thanks for the help!
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import datetime as dt
import mplfinance as mpf
import yfinance as yf
#yahoo info
start = "2020-01-01"
end = dt.datetime.now()
stock1 = 'TSLA'
stock2 = 'MSFT'
df1 = yf.download(stock1, start, end)
df2 = yf.download(stock2, start, end)
stocks = [[stock1],[stock2]]
dfs = [[df1],[df2]]
changingvars = [[stocks],[dfs]]
#graph1
short_sma = 20
long_sma = 50
SMAs = [short_sma, long_sma]
for i in SMAs:
dfs["SMA_"+ str(i)] = dfs.iloc[:,4].rolling(window=i).mean()
graph1 = mpf.plot(dfs, type = 'candlestick',figratio=(16,6),
mav=(short_sma,long_sma),
volume=True, title= str(stocks),
style='default')
plt.show()

Not sure why you are calculating your own SMA's, and grouping your stocks and dataframes, if your goal is only to create multiple plots (one for each stock). Also, if you are using mplfinance, then there is no need to import and/or use matplotlib.pyplot (nor to call plt.show(); mplfinance does that for you).
That said, here is a suggestion for your code. I've added tickers for Apple and Alphabet (Google), just to demonstrate how this can be extended.
stocklist = ['TSLA','MSFT','AAPL','GOOGL']
start = "2020-01-01"
end = dt.datetime.now()
short_sma = 20
long_sma = 50
for stock in stocklist:
df = yf.download(stock, start, end)
filename = stock.lower() + '.png'
mpf.plot(df,type='candlestick',figratio=(16,6),
mav=(short_sma,long_sma),
volume=True, title=stock,style='default',
savefig=dict(fname=filename,bbox_inches="tight")
)
The above code will not display the plots for each stock, but will save each one in its own .png file locally (where you run the script) for you to view afterwards.
Note also that it does not save the actual data; only plots the data and then moves on to the next stock, reassigning the dataframe variable (which automatically deletes the previous stock's data). If you want to save the data for each stock in a separate csv file, that is easy to do as well with Pandas' .to_csv() method.
Also, I am assuming you are calling yf.download() correctly. I am not familiar with that API so I just left that part of the code as you had it.
HTH. Let me know. --Daniel

How to save the results of an np.array for future use when using Google Colab

I am working on a project of Information Retrieval. For that I am using Google Colab. I am in the phase where I have computed some features ("input_features") and I have the labels ("labels") by doing a for loop, which took me about 4 hours to finish.
So at the end I have appended the results to an array:
input_features = np.array(input_features)
labels = np.array(labels)
So my question would be:
Is it possible to save those results in order to use them future purposes when using google colab?
I have found 2 options that maybe could be applied but I don't know where these files are created.
1) To save them as csv files. And my code would be:
from numpy import savetxt
# save to csv file
savetxt('input_features.csv', input_features, delimiter=',')
savetxt('labels.csv', labels, delimiter=',')
And in order to load them:
from numpy import loadtxt
# load array
input_features = loadtxt('input_features.csv', delimiter=',')
labels = loadtxt('labels.csv', delimiter=',')
# print the array
print(input_features)
print(labels)
But still I don't get something back when I print.
2) Save the results of an array by using pickle where I followed these instructions from here:
https://colab.research.google.com/drive/1EAFQxQ68FfsThpVcNU7m8vqt4UZL0Le1#scrollTo=gZ7OTLo3pw8M
from google.colab import files
import pickle
def features_pickeled(input_features, results):
input_features = input_features + '.txt'
pickle.dump(results, open(input_features, 'wb'))
files.download(input_features)
def labels_pickeled(labels, results):
labels = labels + '.txt'
pickle.dump(results, open(labels, 'wb'))
files.download(labels)
And to load them back:
def load_from_local():
loaded_features = {}
uploaded = files.upload()
for input_features in uploaded.keys():
unpickeled_features = uploaded[input_features]
loaded[input_features] = pickle.load(BytesIO(data))
return loaded_features
def load_from_local():
loaded_labels = {}
uploaded = files.upload()
for labels in uploaded.keys():
unpickeled_labels = uploaded[labels]
loaded[labels] = pickle.load(BytesIO(data))
return loaded_labes
#How do I print the pickled files to see if I have them ready for use???
When using python I would do something like this for pickle:
#Create pickle file
with open("name.pickle", "wb") as pickle_file:
pickle.dump(name, pickle_file)
#Load the pickle file
with open("name.pickle", "rb") as name_pickled:
name_b = pickle.load(name_pickled)
But the thing is that I don't see any files to be created in my google drive.
Is my code correct or do I miss some part of the code?
Long description in order to hopefully have explained in detail what I want to do and what I have done for this issue.
Thank you in advance for your help.

Google Colaboratory notebook instances are never guaranteed to have access to the same resources when you disconnect and reconnect because they are run on virtual machines. Therefore, you can't "save" your data in Colab. Here are a few solutions:
Colab saves your code. If the for loop operation you referenced doesn't take an extreme amount of time to run, just leave the code and run it every time you connect your notebook.
Check out np.save. This function allows you to save an array to a binary file. Then, you could re-upload your binary file when you reconnect your notebook. Better yet, you could store the binary file on Google Drive, mount your drive to your notebook, and reference it like that.

# Mount driver to authenticate yourself to gdrive
from google.colab import drive
drive.mount('/content/gdrive')
#---
# Import necessary libraries
import numpy as np
from numpy import savetxt
import pandas as pd
#---
# Create array
arr = np.array([1, 2, 3, 4, 5])
# save to csv file
savetxt('arr.csv', arr, delimiter=',') # You will see the results if you press in the File icon (left panel)
And then you can load it again by:
# You can copy the path when you find your file in the file icon
arr = pd.read_csv('/content/arr.csv', sep=',', header=None) # You can also save your result as a txt file
arr

CSV iterate/loop

I am trying to iterate through all the cells of a CSV row ( name, screen_name and image url). Different errors show up, I tried with pandas but still I am unable to finish the job. My CSV looks like this:
screen_name,name,image_url_https
Jan,Jan,https://twimg.com/sticky/default_profile_images/default_profile_normal.png
greg,Gregory Kara,https://twimg.com/profile_images/60709109/Ferran_Adria_normal.jpg
hillheadshow,Hillhead 2020,https://twimg.com/profile_images/1192061150915178496/cF6jOCRV_normal.jpg
hectaresbe,Hectaresbe,https://twimg.com/profile_images/1190957150996226048/lJnRnFwi_normal.jpg
Sdzz,Sanne,https://twimg.com/profile_images/1159005129879801856/8p6KC1ei_normal.jpg
and the part of the code that I need to change is:
import json
import time,os
import pandas as pd
screen_name = 'mylist'
file = pd.read_csv("news2.csv", header=0)
col = file.head(0)
columns = list(col)
fp=codecs.open(screen_name+'.csv','w',encoding="utf-8")
i=0
while True:
try:
i+=1
print (i)
name=['name']
uname=['screen_name']
urlimage=['image_url_https']
The values are ok with #Snake_py code, next i am doing a request:
myrequest='https://requesturl.com/'+uname
#print(myrequest)
resp=requests.get(myrequest)
I get the following error:
raise InvalidSchema("No connection adapters were found for '%s'" % url)requests.exceptions.InvalidSchema: No connection adapters were found for '0 https://requesturl.com/Jan
Name: name, dtype: object'
timeout error caught.

the easiest way to iterate through a csv with Python would be:
name = []
uname = []
urlimage =[]
open with ('url', 'r') as file:
for row in file:
row = row.strip().split(";")
name.append(row[0])
uname.append(row[1])
urlimage.append(row[2])
print(name)
print(uname)
print(urlimage)
First I created three empty arrays. then I open the file and iterate over each row in the file. Row will be returned as an array. So you can run the normal index command [] to get the needed part of the array to append it to the empty list.
For the method above you might run into an encoding problem, so I would recommend method 2, although you actually do not iterate over the rows then.
Alternatively you could just do:
import pandas as pd
file = pd.read_csv("news2.csv", header=0)
name = file['name']
uname = file['uname']
urlimage = file['urlimage']
For the second method, you need to make sure that your header has the correct spelling

JSON array keyerror in Python

I'm fairly new to Python programming and am attempting to extract data from a JSON array. Below code results in an error for
js[jstring][jkeys]['5. volume'])
Any help would be much appreciated.
import urllib.request, urllib.parse, urllib.error
import json
def DailyData(symb):
url = https://www.alphavantage.co/queryfunction=TIME_SERIES_DAILY&symbol=MSFT&apikey=demo
stockdata = urllib.request.urlopen(url)
data = stockdata.read().decode()
try:
js = json.loads(data)
except:
js = None
jstring = 'Time Series (Daily)'
for entry in js:
i = js[jstring].keys()
for jkeys in i:
return (jkeys,
js[jstring][jkeys]['1. open'],
js[jstring][jkeys]['2. high'],
js[jstring][jkeys]['3. low'],
js[jstring][jkeys]['4. close'],
js[jstring][jkeys]['5. volume'])
print('volume',DailyData(symbol)[5])

Looks like the reason for the error is because the returned data from the URL is a bit more hierarchical than you may realize. To see that, print out js (I recommend using a jupyter notebook):
import urllib.request, urllib.parse, urllib.error
import ssl
import json
import sqlite3
url = "https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol=MSFT&apikey=demo"
stockdata = urllib.request.urlopen(url)
data = stockdata.read().decode()
js = json.loads(data)
js
You can see that js (now a python dict) has a "Meta Data" key before the actual time series begins. You need to start operating on the dict at that key.
Having said that, to get the data into a table like structure (for plotting, time series analysis, etc), you can use pandas package to read the dict key directly into a dataframe. The pandas DataFrame constructor accepts a dict as input. In this case, the data was transposed, so the T at the end rotates it (try with and without the T and you will see it.
import pandas as pd
df=pd.DataFrame(js['Time Series (Daily)']).T
df
Added edit... You could get the data into a dataframe with a single line of code:
import requests
import pandas as pd
url = "https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol=MSFT&apikey=demo"
data = pd.DataFrame(requests.get(url).json()['Time Series (Daily)']).T
DataFrame: The contructor from Pandas to make data into a table like structure
requests.get(): method from the requests library to fetch data..
.json(): directly converts from JSON to a dict
['Time Series (Daily)']: pulls out the key from the dict that is the time series
.T: transposes the rows and columns.
Good luck!
Following code worked for me
import urllib.request, urllib.parse, urllib.error
import json
def DailyData(symb):
# Your code was missing the ? after query
url = "https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol={}&apikey=demo".format(symb)
stockdata = urllib.request.urlopen(url)
data = stockdata.read().decode()
js = json.loads(data)
jstring = 'Time Series (Daily)'
for entry in js:
i = js[jstring].keys()
for jkeys in i:
return (jkeys,
js[jstring][jkeys]['1. open'],
js[jstring][jkeys]['2. high'],
js[jstring][jkeys]['3. low'],
js[jstring][jkeys]['4. close'],
js[jstring][jkeys]['5. volume'])
# query multiple times, just to print one item?
print('open',DailyData('MSFT')[1])
print('high',DailyData('MSFT')[2])
print('low',DailyData('MSFT')[3])
print('close',DailyData('MSFT')[4])
print('volume',DailyData('MSFT')[5])
Output:
open 99.8850
high 101.4300
low 99.6700
close 101.1600
volume 19234627
Without seeing the error, it's hard to know what exact problem you were having.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

How can I scrape an HTML table to CSV? - screen-scraping

Select the HTML table in your tools's UI and copy it into the clipboard (if that's possible Paste it into Excel. Save as CSV file However, this is a manual solution not an automated one.

Even easier (because it saves it for you for next time) ... In Excel Data/Import External Data/New Web Query will take you to a url prompt. Enter your url, and it will delimit available tables on the page to import. Voila.

Excel can open a http page. Eg: Click File, Open Under filename, paste the URL ie: How can I scrape an HTML table to CSV? Click ok Excel does its best to convert the html to a table. Its not the most elegant solution, but does work!

Have you tried opening it with excel? If you save a spreadsheet in excel as html you'll see the format excel uses. From a web app I wrote I spit out this html format so the user can export to excel.

If you're screen scraping and the table you're trying to convert has a given ID, you could always do a regex parse of the html along with some scripting to generate a CSV.

Related

How to export react JS material-ui table data with pagination into csv, excel and pdf?

How to assign multiple variables in a loop for graphs in Python 3

How to save the results of an np.array for future use when using Google Colab

CSV iterate/loop

JSON array keyerror in Python

Categories

Resources