Beautifulsoup scraping book catalogue - screen-scraping

for i in range(1,1000000):
page = urllib2.urlopen("http://www.palgrave.com/products/title.aspx?pid="+str(i))
print "http://www.palgrave.com/products/title.aspx?pid="+str(i)
soup = BeautifulSoup(page) #retreive
books = soup.findAll("div",{"id":"Title"}) #process
I need to crawl through the whole catalogue for a publisher.
I need to retrieve:
Book Image
Title
Edition
Publisher
PubDate
PriceCurrency
ISBN13
Description (within an ajax tab)

use XPath to extract the content from these locations

Related

Retrieving DOI for a combination of author, title and date/year

I want to retrieve the DOI for a combination of a title, the first author, and the year. I wanted to use the query method, but there is no date or year filed in it. It also does not allow me to use a title. Here is the example article I want to get its DOI:
title: "The generic name Mediocris (Cetacea: Delphinoidea: Kentriodontidae),
belongs to a foraminiferan"
author : Mark D Uhen
Year: 2006
I tried this but it failed (although I could not find any filed name for the year):
last_name='Uhen'
title = 'The generic nameMediocris (Cetacea: Delphinoidea: Kentriodontidae), belongs to a foraminiferan'
q = works.query(author=last_name, title = title)
Here is the error I got:
UrlSyntaxError: Field query title specified but there is no such field query for this route. Valid field queries for this route are: affiliation, author, bibliographic, chair, container_title, contributor, editor, event_acronym, event_location, event_name, event_sponsor, event_theme, funder_name, publisher_location, publisher_name, translator
I appreciate any help!
Thanks
In the browser, the link would be:
https://api.crossref.org/works?query.author=Uhen&query.bibliographic=The%20generic%20name&filter=from-pub-date:2006-01-01,until-pub-date:2006-12-31
(I don't know what software you are using, but it might be one step towards a solution to change the field name title to bibliographic)

How to properly use LIKE and '%%' in python to search a database list for a partial response stored in a placeholder variable?

search = request.form.get("search")
book = db.execute("SELECT * FROM books WHERE title = :search OR author = :search OR isbn = :search OR title LIKE :search",{"search": search}).fetchall()
This is a portion of my function. I want the user to be able to search for a book and store their result in "search", and I want to be able to pull any results from the database list that at least has the "search" even if the title isn't complete. Example, if the user is looking for a book titled "the fisherman" but the user only types in "the fisher" I want the query to be able to pull "the fisherman".
book = db.execute(text("SELECT * FROM books WHERE title LIKE :search"),
{"search": f"%{search}%"}).fetchall()
The error was in my dictionary. To have a variable that the user inputs to be looked up using '% %' must be modified within the dictionary.

Best way to design app engine datastore and text search modelling

We have a Java application running on google app engine. Having a kind called Contact. following is the sample schema
Contact
{
long id
String firstName
String lastName
...
}
The above is the existig model,for supporting few requirements we are storing this object both in datastore and text search
Now we want to integrate contacts with their page views data.
Each contact can have thousands of page views records or even millions for some contacts
Following is the sample page visit object [Note : We don't have this object as of now, this is just give information about page visit]
PageVisit
{
long id
String url
String refUrl
int country
String city
....
}
We have a requirement , which needs a query on contact core properties and his page visited data
for ex :
select * from Contact where firstName = 'abc' and url = 'cccccc.com';
select * from Contact where firstName = 'abc' or url = 'cccccc.com';
To write this kind of queries we need both contact core properties and their page visited need to available in Contact object itself but contact
can have huge number page views. So this will cross entity maximum size limit
So how to design contact model in this kind of situation both in datastore and text search.
Thanks
Cloud Datastore doesn't support joins, so you will need to handle this in some manner from the client code.
2 possible ways to handle this are:
Denormalize the Contact you need to search into PageVisit:
PageVisit
{
long id
String firstName // Denormalized from Contact
String url
String refUrl
int country
String city
....
}
This requires you to create a composite index:
- kind: PageVisit
ancestor: no
properties:
- name: firstName
- name: url
Or run multiple queries
select id from Contact where firstName = 'abc'
select * from PageVisit where contactId={id} and url = 'cccccc.com';
select * from PageVisit where contactId={id} or url = 'cccccc.com';
This requires you to create a composite index:
- kind: PageVisit
ancestor: no
properties:
- name: contactId
- name: url
Final aside: Depending on how large your site is, it might be worth looking into Cloud Bigtable for the PageView data. It's a better solution for high write OLAP-style workloads.

How to match text in cmsplugin_text table to URL in Django CMS

We have created a site for a client using Django CMS and are approaching the launch date. There are a number of links to files on their old site. Doing a search of the cmsplugin_text table, I find 12 entries that contain the URL. There is no simple mapping to the new file download URL from the old download URL, so I need to find the pages these 12 entries appear on and tell our client so they can edit the page.
But the database is not easy to follow. So how do I go from the value of the cmsplugin_ptr_id column of the cmsplugin_text column to the URL of the page? I'm fairly sure that the cmsplugin_ptr_id is meant to line up with the id of the cms_cmsplugin table. That table also has parent_id, tree_id and placeholder_id, but I've kind of got lost at this point.
I'm happy to use either the database commands directly, or to use manage.py shell to do this.
Should have tried a bit harder before answering.
The steps that worked were to look in cms_page_placeholder for lines with the placeholder_id and look up the corresponding page_id. I could then look up the page in the admin at http://mysite.com/en/admin/cms/page/page_id and that page has a "View on site link".
The SQL statement I used was:
SELECT cpp.page_id
FROM cmsplugin_text AS cpt
LEFT JOIN cms_cmsplugin AS ccp ON cpt.cmsplugin_ptr_id = ccp.id
LEFT JOIN cms_page_placeholders AS cpp ON ccp.placeholder_id = cpp.placeholder_id
WHERE cpt.body like '%userfiles%';
Where userfiles was part of the path to the files on the old site.

Understanding ListProperty backend behavior in GAE

I'm trying to understand how you're supposed to access items in a GAE db.ListProperty(db.Key).
Example:
A Magazine db.Model entity has a db.ListProperty(db.Key) that contains 10 Article entities. I want to get the Magazine object and display the Article names and dates. Do I make 10 queries for the actual article objects? Do I do a batch query? What if there's 50 articles? (Don't batch queries rely on the IN operator, which is limited to 30 or fewer elements?)
So you are describing something like this:
class Magazine(db.Model):
ArticleList = db.ListProperty(db.Key)
class Article(db.Model):
ArticleName = db.StringProperty()
ArticleDate = db.DateProperty()
In this case the simplest way to grab the listed articles is to use the Model.get() method, which looks for a key list.
m = Magazine.get() #grab the first record
articles = Article.get(m.ArticleList) #get Articles using key list
for a in articles:
name = a.ArticleName
date = a.ArticleDate
#do something with this data
Depending on how you plan on working with the data you may be better off adding a Magazine reference property to your Article entities instead.
You need to read Modeling Entity Relationships, especially the part about one to many.

Resources