How to scrape web pages that are in different format/layouts? - screen-scraping

I need to scrape Form 10-K reports (i.e. annual reports of US companies) from SEC website for a project.
The trouble is, companies do not use the exact same format for filing this data. So for ex., real estate data for 2 different companies could be displayed as below
1st company
Property name State City Ownership Year Occupancy Total Area
------------- ----- ------ --------- ---- --------- ----------
ABC Mall TX Dallas Fee 2007 97% 1,347,377
XYZ Plaza CA Ontario Fee 2008 85% 2,252,117
2nd company
Property % Ownership %Occupany Rent Square Feet
--------------- ----------- --------- ----- -----------
New York City
ABC Plaza 100.0% 89.0% 38.07 2,249,000
123 Stores 100.0% 50.0% 18.00 1,547,000
Washington DC Office
12th street .......
2001, J Drive .......
etc.
Likewise, the data layout could be entirely different for other companies.
I would like to know if there are better ways to scrape this type of heterogenous data other than writing complex regex searches.
I have the liberty to use Java, Perl, Python or Groovy for this work.

I'd be inclined to keep a library of meta files that describe the layout for each page you want to scrape data from and use it when trying to get the data.
In that way you don't need complex reg-ex commands and if a site changes its design you simply change a single one of your files.
How you decide to create the meta file is up to you but things like pertinent class names or tags might be a good start.
then describe how to extract the data from that tag.
Unsure if there is a tool out there that does all that.
The other, nicer, way might be to contact the owners of these sites and see if they provide a feed in the form of a WebService or something that you can use to get the data. Saves a lot of heartache I should think.

Related

Creating Excel database to track inventory

I will try to explain what I would like to achieve, and since I have not looking for ready-to-go solution, I hope you will give me pointers what to look for.
So, I have one sheet in Excel (Libre, Apache whatever) where I want to keep track of the inventory in offices. I don't have many of them, thus I have opted for something simpler that Access or any other database
So for example in Office 122 I have Dell computer XZY
Now, on next sheet 2 I would like to keep properties of that particular computer (as table for example)
Dell computer XZY | CPU Xenon | Nvidia 980 | RAM 16GB
Dell computer AAA | CPU I7 | AMD 290X | RAM 32GB
and so on
Now, on the first sheet I have columns
Office | Computer | Specs
I would like to be able set column Computer from drop down selection name of the computer from sheet 2 e.g. Dell computer XZY, and to print out its specs in Info column automatically from the sheet 2 which holds computer names and specs of the computers so it looks something like
Office 122 | Dell Computer AAA (this should be drop down selection) | I7, AMD 290X, 32GB
I hope I was clear enough :). As I said, I don't expect you to make me the sheet (wouldn't mind, but not expecting), just to tell me what I am looking for and where to search since I didn't have any experience with this kind of "databases" in excel.
Thanks in advance
Part 1 make dropdown
On sheet2 make columns
Restrict your cell to values from a list of items
select cell
select data validation
select allow list
select range
Thats it
http://blogs.technet.com/b/hub/archive/2011/06/09/restrict-data-entry-in-excel-with-lists.aspx
Part two -lookup value
With excel formula VLOOPUP you can loopup a value, witch mist be unique and you can select the row which value it should return. Something like this
=VLOOKUP(B4,Sheet2!B:C,2,FALSE)

Relation on a multi-valued field in an Access database

Does someone know how I can put more than one value in one field to make a relation between two different records?
Google translation (from German):
Using multivalued fields
In most systems, DBMS (database management systems), including earlier
versions of Microsoft Access, you can only store a single value in a
field. In Microsoft Office Access 2007, you can also create fields
that contain multiple values​​, such as a list of categories to which
you have assigned a condition. Multivalued fields are used in specific
situations, such as when you use Office Access 2007 to work saved in a
Windows SharePoint Services 3.0 list that contains a field of one of
the field types with multiple values ​​that are available in Windows
data SharePoint Services.
This topic describes how to create and use multivalued fields in
Office Access 2007 and Windows SharePoint Services, how to create
multivalued fields and used, and how to use multivalued fields in a
query.
You can use multi-value Lookup fields in JOIN conditions (via their .Value property), but be aware that if there is such a field on both sides of the join then it will produce a match when any item matches on the joined fields, not when all items match. This may or may not be desirable, depending on the situation.
Case 1: Students with allergies
A school administrator needs to keep track of students with allergies and provide them with a list of meals that they should avoid when eating at the school cafeteria.
[Students]
ID Student Allergies
-- ------- ---------
1 Alice Eggs, Soy
2 Bradley Peanuts
3 Carol
4 Dennis Soy
[Meals]
ID Meal Allergens
-- ------------- ---------
1 Thai stir-fry Peanuts
2 Tofu omlette Eggs, Soy
3 Waffles Eggs
The query
SELECT Students.Student, Students.Allergies, Meals.Meal, Meals.Allergens
FROM Students INNER JOIN Meals ON Students.Allergies.Value = Meals.Allergens.Value;
returns
Student Allergies Meal Allergens
------- --------- ------------- ---------
Alice Eggs, Soy Tofu omlette Eggs, Soy
Alice Eggs, Soy Waffles Eggs
Bradley Peanuts Thai stir-fry Peanuts
Dennis Soy Tofu omlette Eggs, Soy
This is appropriate, since Alice should avoid meals that contain any ingredients to which she is allergic.
Case 2: Hotel requirements
[Travellers]
ID Traveller Requirements
-- --------- -------------------------
1 Gord free WiFi, in-room coffee
[Hotels]
ID Hotel Amenities
-- ------------ ----------------------------
1 Budget Motel free WiFi, in-room coffee
2 Fancy Hotel in-room coffee, room service
The query
SELECT Travellers.Traveller, Travellers.Requirements, Hotels.Hotel, Hotels.Amenities
FROM Hotels INNER JOIN Travellers ON Hotels.Amenities.Value = Travellers.Requirements.Value;
returns
Traveller Requirements Hotel Amenities
--------- ------------------------- ------------ ----------------------------
Gord free WiFi, in-room coffee Budget Motel free WiFi, in-room coffee
Gord free WiFi, in-room coffee Fancy Hotel in-room coffee, room service
The query returns both properties because they both offer in-room coffee. However, Fancy Hotel does not offer free WiFi, so I would prefer not to stay there. In this case the default join behaviour is not desirable (to me).

How to store sets of objects that have occurred together during events?

I'm looking for an efficient way of storing sets of objects that have occurred together during events, in such a way that I can generate aggregate stats on them on a day-by-day basis.
To make up an example, let's imagine a system that keeps track of meetings in an office. For every meeting we record how many minutes long it was and in which room it took place.
I want to get stats broken down both by person as well as by room. I do not need to keep track of the individual meetings (so no meeting_id or anything like that), all I want to know is daily aggregate information. In my real application there are hundreds of thousands of events per day so storing each one individually is not feasible.
I'd like to be able to answer questions like:
In 2012, how many minutes did Bob, Sam, and Julie spend in each conference room (not necessarily together)?
Probably fine to do this with 3 queries:
>>> query(dates=2012, people=[Bob])
{Board-Room: 35, Auditorium: 279}
>>> query(dates=2012, people=[Sam])
{Board-Room: 790, Auditorium: 277, Broom-Closet: 71}
>>> query(dates=2012, people=[Julie])
{Board-Room: 190, Broom-Closet: 55}
In 2012, how many minutes did Sam and Julie spend MEETING TOGETHER in each conference room? What about Bob, Sam, and Julie all together?
>>> query(dates=2012, people=[Sam, Julie])
{Board-Room: 128, Broom-Closet: 55}
>>> query(dates=2012, people=[Bob, Sam, Julie])
{Board-Room: 22}
In 2012, how many minutes did each person spend in the Board-Room?
>>> query(dates=2012, rooms=[Board-Room])
{Bob: 35, Sam: 790, Julie: 190}
In 2012, how many minutes was the Board-Room in use?
This is actually pretty difficult since the naive strategy of summing up the number of minutes each person spent will result in serious over-counting. But we can probably solve this by storing the number separately as the meta-person Anyone:
>>> query(dates=2012, rooms=[Board-Room], people=[Anyone])
865
What are some good data structures or databases that I can use to enable this kind of querying? Since the rest of my application uses MySQL, I'm tempted to define a string column that holds the (sorted) ids of each person in the meeting, but the size of this table will grow pretty quickly:
2012-01-01 | "Bob" | "Board-Room" | 2
2012-01-01 | "Julie" | "Board-Room" | 4
2012-01-01 | "Sam" | "Board-Room" | 6
2012-01-01 | "Bob,Julie" | "Board-Room" | 2
2012-01-01 | "Bob,Sam" | "Board-Room" | 2
2012-01-01 | "Julie,Sam" | "Board-Room" | 3
2012-01-01 | "Bob,Julie,Sam" | "Board-Room" | 2
2012-01-01 | "Anyone" | "Board-Room" | 7
What else can I do?
Your question is a little unclear because you say you don't want to store each individual meeting, but then how are you getting the current meeting stats (dates)? In addition any table given the right indexes can be very fast even with alot of records.
You should be able to use a table like log_meeting. I imagine it could contain something like:
employee_id, room_id, date (as timestamp), time_in_meeting
Where foreign keys to employee id to employee table, and room id key to room table
If you index employee id, room id, and date you should have a pretty quick lookup as mysql multiple-column indexes go left to right such that you gain index on (employee id, employee id + room id, and employee id + room id + timestamp) when do searches. This is explained more in the multi-index part of:
http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html
By refusing to store meetings (and related objects) individually, you are loosing the original source of information.
You will not be able to compensate for this loss of data, unless you memorize on a regular basis the extensive list of all potential daily (or monthly or weekly or ...) aggregates that you might need to question later on!
Believe me, it's going to be a nightmare ...
If the number of people are constant and not very large you can then assign a column to each person for present or not and store the room, date and time in 3 more columns this can remove the string splitting problems.
Also by the nature of your question I feel first of all you need to assign Ids to everything rooms,people, etc. No need for long repetitive string in DB. Also try reducing any string operation and work using individual data in each column for better intersection performance. Also you can store a permutation all the people in a table and assign a id for them then use one of those ids in the actual date and time table. But all techniques will require that something be constant either people or rooms.
I do not understand whether you know all "questions" in design time or it's possible to add new ones during development/production time - this approach would require to keep all data all the time.
Well if you would know all your questions it seems like classic "banking system" which recalculates data on daily basis.
How I think about it.
Seems like you have limited number of rooms, people, days etc.
Gather logging data on daily basis, one table per day. Just one event, one database row, all information (field) what you need.
Start to analyse data using some crone script at "midnight".
Update stats for people, rooms, etc. Just increment number of hours spent by Bob in xyz room etc. All what your requirements need.
As analyzed data are limited and relatively small as you analyzed (compress) them, your system can contain also various queries as indexes would be relatively small etc.
You could be able to use scalable map/reduce algorithm.
You can't avoid storing the atomic facts as follows: (the meeting room, the people, the duration, the day), which is probably only a weak consolidation when the same people meet multiple times in the same room on the same day. Maybe that happens a lot in your office :).
Making groups comparable is an interesting problem, but as long as you always compose the member strings the same, you can probably do it with string comparisons. This is not "normal" however. To normalise you'll need a relation table (many to many) and compose a temporary table out of your query set so it joins quickly, or use an "IN" clause and a count aggregate to ensure everyone is there (you'll see what I mean when you try it).
I think you can derive the minutes the board room was in use as meetings shouldn't overlap, so a sum will work.
For storage efficiency, use integer keys for everything with lookup tables. Dereference the integers during the query parsing, or just use good old joins if you are feeling traditional.
That's how I would do it anyway :).
You'll probably have to store individual meetings to get the data you need anyway.
However you'll have to make sure you aggregate and anonymise it properly before creating your reports. Make sure to separate concerns and access levels to stay within the proper legal limits on data.

Hyperlink Columns in the Form and List DotNetNuke Module

The Form and List module for DotNetNuke makes it fairly easy to include a hyperlink as one of the columns.
Unfortunately, it's not so easy to have a different caption for each.
For example, it's simple to get this:
Name Company
----- ---------
Sue http://www.apple.com
Fred http://www.google.com
Joe http://www.facebook.com
What I really want, however, is this (with hyperlinks in the Company column):
Name Company
----- ---------
Sue Apple
Fred Google
Joe Facebook
How can I make this happen?
You can use the generate xsl and then modify the xsl to give the layout you want
see explanation and sample xsl F&L XSL

Get state/province from geonames data?

I downloaded these databases for US and CA from GeoNames. The date looks like this:
5881639 100 Mile House 100 Mile House 51.64982 -121.28594 P PPL CA 02 0 917 America/Vancouver 2006-01-18
5881640 101 Mile Lake 101 Mile Lake 51.66652 -121.30264 H LK CA 02 0 917 America/Vancouver 2006-01-18
5881641 101 Ponds 101 Ponds 47.811 -53.97733 H PNDS CA 05 0 18 America/St_Johns 2006-01-18
I want to use this data for a city-picker, but I want to display to province or state beside it. Doesn't look like this data contains that information. Is there some way to retrieve that? Or is there a better DB that includes that?
You use the data in the columns for the admin codes these are actually ids that link to the admin codes table (there are separate data sets available for the admin codes) it is very straightforward.
Check the Geonames forums for more info.
http://forum.geonames.org/
Use the datasets here: geocoder.ca which include city name and state / province name in the same file.
If you want to stick with your data, you can use Google's Geocoding API, as in the first answer here:
Google Maps: how to get country, state/province/region, city given a lat/long value?
to get information based on latitude and longitude. This will be a lot of work, though, especially for a city-picker.

Resources