Is it possible to get the data scraped from websites using Scrapy, and saving that data in an Microsoft SQL Server Database?
If Yes, are there any examples of this being done? Is it mainly a Python issue? i.e. if I find some code of Python saving to an SQL Server database, then Scrapy can do same?
Yes, but you'd have to write the code to do it yourself since scrapy does not provide an item pipeline that writes to a database.
Have a read of the Item Pipeline page from the scrapy documentation which describes the process in more detail (here's a JSONWriterPipeline as an example). Basically, find some code that writes to a SQL Server database (using something like PyODBC) and you should be able to adapt that to create a custom item pipeline that outputs items directly to a SQL Server database.
Super late and completely self promotion here, but I think this could help someone. I just wrote a little scrapy extension to save scraped items to a database. scrapy-sqlitem
It is super easy to use.
pip install scrapy_sqlitem
Define Scrapy Items using SqlAlchemy Tables
from scrapy_sqlitem import SqlItem
class MyItem(SqlItem):
sqlmodel = Table('mytable', metadata
Column('id', Integer, primary_key=True),
Column('name', String, nullable=False))
Add the following pipeline
from sqlalchemy import create_engine
class CommitSqlPipeline(object):
def __init__(self):
self.engine = create_engine("sqlite:///")
def process_item(self, item, spider):
item.commit_item(engine=self.engine)
Don't forget to add the pipeline to settings file and create the database tables if they do not exist.
http://doc.scrapy.org/en/1.0/topics/item-pipeline.html#activating-an-item-pipeline-component
http://docs.sqlalchemy.org/en/rel_1_1/core/tutorial.html#define-and-create-tables
Related
I've reached the writing to a SQL Server database part of my data journey, I hope someone is able to help.
I've been able to successfully connect to a remote Microsoft SQL Server database using PYODBC this allows me to pass in SQL queries into dataframes and to create reports.
I now would want to automate the "select import" manual method I've had a read of many blogs but I'm none the wiser to understanding the how behind it all.
import pandas as pd
import pyodbc
SERVER = r'Remote SQL Server'
database = 'mydB'
username = 'datanovice'
password = 'datanovice'
cnxn = pyodbc.connect('Driver={SQL
Server};SERVER='+server+';DATABASE='+database+';UID='+username+';PWD='+
password)
cursor = cnxn.cursor()
I'm able to read queries easily using this and pass them into dataframes.
what's the best way to write into my MS SQL dB? noting that it's not local I'm happy to pass this into SQL Alchemy but I wasn't sure of the correct syntax.
Things to consider:
This is a mission critical database and some of the DataFrames must be written as delete queries
If this is an unsafe method and if I need to go back and study more to understand proper database methodology I'm very happy to do so
I'm not looking for someone to write or provide the code for me, but rather point me in the right direction
I envisage this to be something like.. but I'm not sure how I specify the correct table:
df.to_sql('my_df', con, chunksize=1000)
As you've seen from the pandas documentation you need to pass a SQLAlchemy engine object as the second argument to the to_sql method. Then you can use something like
df.to_sql("table_name", engine, if_exists="replace")
The SQLAlchemy documentation shows how to create the engine object. If you use an ODBC DSN then the statement will look something like this:
from sqlalchemy import create_engine
# ...
engine = create_engine("mssql+pyodbc://scott:tiger#some_dsn")
I have a question I was hoping to get some feedback on being that it is my first time doing something like this.
I am building an intranet site and need to display a SQL view from SQL Server. I've done some research online and I've seen some people say use PHP to connect to the DB, then HTML to build a table... but those scenarios have been for MySQL or Access databases, not SQL Server.
My question is: what is the best way to to go about connecting and displaying a SQL Server view to an HTML 5 page? (I'm mainly just asking for some suggestions on where to start, any good documentation to look at, etc. not looking for someone to code it for me or anything like that)
First of all, you do not want to connect to the database directly from client side code, direct database queries should always be done on the server side.
I'd recommend using PHP because it's fairly simple, lightweight and widely supported.
You can query SQL Server with PHP, although you may need to download and install the PHP PDO drivers from Microsoft in order to do this. (Google 'sql server php pdo driver' and you'll be able to find them for your particular platform)
Alternatively you may be able to use an ODBC driver instead, although I haven't tried that personally.
In PHP you can use PDO to open a connection to your database, select from the view into an associative array using something like the following:
$pdo_object = new PDO($dsn, $user, $password);
$statement = $pdo_object->prepare("SELECT column1, column2 FROM view1");
$statement->execute();
$data = $statement->fetchAll(PDO::FETCH_ASSOC);
From here you can iterate over this data to render a table or pass it into javascript and have that populate a data grid.
I created the below method to connect to SQL Server using SQL Alchemy and Pyodbc.
def getDBEngine(server, database):
Params = urllib.quote_plus("DRIVER={SQL Server};SERVER="+server+";DATABASE="+database+";TRUSTED_CONNECTION=Yes")
Engine = sa.create_engine("mssql+pyodbc:///?odbc_connect=%s" % Params)
return Engine
I'm able to then use that engine to read and write data via methods like to_sql from pandas per the below.
def toSQL(Server, Database, Tablename):
writeEngine= getDatabaseEngine(Server, Database)
data.to_sql('write_Tablename',writeEngine,if_exists='append')
My question is whether there is a simple way to check the connection/ status of the engine before actually using it to read or write data. What's the easiest way?
One pattern I've seen used at multiple engagements is essentially a "is alive" check that is effectively select 1 as is_alive;. There's no data access, so it just checks where your connection is receptive to receiving commands from your application.
Is there any chance of connecting opencart with mssql? Have anyone tried? If so what is the procedure of doing that?
That should not be a big problem, You only need to do:
create /system/database/mssql.php class - the class should have the same methods, properties and functionality as e.g. the mysql.php one
rewrite all of the model classes method's queries to meet the MS SQL / T-SQL SQL syntax
in both config files (/config.php and /admin/config.php) set the proper DB_DRIVER - mssql
I am supposing You have the OpenCart database created already due to the /install/opencart.sql file.
I guess nothing more should be done.
Anyway, what is the reason for switching to MS SQL?
EDIT: In /system/database/ there is this mmsql.php file which actually contains the MSSQL class thus this do not have to be implemented, just renamed to mssql.php file.
This is probably going to be an underspecified question, as I'm not looking for a specific fix:
I want to run a machine learning algorithm on some data in a SQL Server database. I'd like to use R to do the calculations -- which would involve using R to connect to the database, process the data, and write a table of results back to the database.
Is this possible? My guess is yes. Shouldn't be a problem using a client...
however, would it be possible to set this up on a linux box as a cron job?
Yes to all!
Your choices for scripting are either Rscript or littler as discussed in this previous post.
Having struggled with connecting to MSSQL databases from Linux, my recommendation is to use RJDBC for database connections to MSSQL. I used RODBC to connect from Windows but I was never able to get it working properly in Linux. To get RJDBC working you will need to have Java installed properly on your Linux box and may need to change some environment variables (seems I always have SOMETHING mis-configured with rJava). You will also need to download and install the JDBC drivers for Linux which you can get directly from Microsoft.
Once you get RJDBC installed and the drivers installed, the code for pulling data from the database will look something like the following template:
require(RJDBC)
drv <- JDBC("com.microsoft.sqlserver.jdbc.SQLServerDriver",
"/etc/sqljdbc_2.0/sqljdbc4.jar")
conn <- dbConnect(drv, "jdbc:sqlserver://mySqlServer", "userId", "Password")
sqlText <- paste("
SELECT *
FROM SomeTable
;")
myData <- dbGetQuery(conn, sqlText)
You can write a table with something like
dbWriteTable(conn, "myData", SomeTable, overwrite=TRUE)
When I do updates to my DB I generally use dbWriteTable() to create a temporary table on my database server then I issue a dbSendUpdate() that appends the temp table to my main table then a second dbSendUpdate() that drops the temporary table. You might find that pattern useful.
The only "gotcha" I ran into was that I could never get a Windows domain/username to work in the connection sequence. I had to set up an individual SQL Server account (like sa).
You may just write a script containing R code and put this in the first line:
#!/usr/bin/env Rscript
change the file permissions to allow execution and put in crontab as it would be a bash script.