web scraping tool or library that automatically finds text content without rules set - screen-scraping

Is there a web scraping tool or library that auto-detects repeating HTML blocks and scrapes the text content inside the blocks, thus removing the need for human to manually input the rules - CSS selectors or xpath to find the content?
This is based on the assumptiom that modern content website is generated dynamically by server-side languages such as PHP or Python. The content is almost always rendered by a for loop in the template, hence the repeating HTML blocks can always be found. An example:
<div id="content">
<div class="blog entry">
<div class="title">
<h1>1st post</h2>
</div>
<div class="content">
<p>...</p>
</div>
</div>
<div class="blog entry">
<div class="title">
<h1>2nd post</h2>
</div>
<div class="content">
<p>...</p>
</div>
</div>
<div class="blog entry">
<div class="title">
<h1>3rd post</h2>
</div>
<div class="content">
<p>...</p>
</div>
</div>
</div>
Libraries like bautiful soap and scrapy rely on human to input the rules before the scraping can be carried out. They are not what I want.

Haven't used it, but heard about scrapely:
Unlike most scraping libraries, Scrapely doesn't work with DOM trees
or xpaths so it doesn't depend on libraries such as lxml or libxml2.
Instead, it uses an internal pure-python parser, which can accept
poorly formed HTML. The HTML is converted into an array of token ids,
which is used for matching the items to be extracted.
Scrapely extraction is based upon the Instance Based Learning
algorithm and the matched items are combined into complex objects
(it supports nested and repeated objects), using a tree of parsers,
inspired by A Hierarchical Approach to Wrapper Induction

You might want to look at my scraping library. It doesn't work automatically nor does it detect repeated parts. But it comes close, since it doesn't need rules at all and instead uses templates, which you can get directly from the html you have.
E.g. with your example above, the template to read all the posts in 2 arrays is:
<div id="content">
<div class="blog entry">
<div class="title">
<h1>{title:=.}</h1>
</div>
<div class="content">
<p>{content:=.}</p>
</div>
</div>*
</div>

You may try HTQL:
import htql;
a=htql.Browser();
p,b=a.goUrl('http://channel9.msdn.com/Blogs/Vector/Announcing-BUILD-2012');
htql.query(p, '&html_main_text');
p,b=a.goUrl('http://stackoverflow.com/questions/tagged/screen-scraping');
htql.query(p, '&html_main_text');

Related

How to create complex table which contains lots of data and supports expanding rows & fixed column using react-virtualized?

I'm creating excel-like table using react.
And there have some essential features.
Support update feature for the cells including child row's cells.
If head row's data can represents all of it's children's data and itself, it's cell will be merged into single cell and show in single cell.
Support row expanding feature :: When user click row, it expand and show it's children's data which has same columns with parent. (Like each row represents some aggregation of data, and with clicking row, user can see sub-data.
Should guarantee proper performance for more than 30000 rows with 100 columns including images.
Should support filter & sort & search & hide column(s).
Should provide data at single page (It means it need to be infinite-scroll, not pagenation)
Should have sticky table header
Should user can fix some of columns to left side for providing easier way to see data with horizontal scrolling.
I succeed on 1,2,3,4,5,6,7 (Most of them) by react-virtualized. But it become a problem when regarding '8'-Fixing column.
I checked internet and found that react-virtualized is the only way to archive it.
Basic idea of this is quite simple. I use List for rendering each row by user's scrolling(Virtualization), AutoSizer and CellMeasurer(Cache) for implementing dynamic height rows(Expanding rows). And least of codes for implementing TableHeader and cell updating, sort, filter features.
But it became a problem when I need to implement column fix. react-virtualized seems it dose not allow customization for it's feature (Seems really static for their functionalities.)
At first, I considered create two table and position it horizontally so that left table act like fixed column, right table is horizontally scrollable by sharing CellMeasurerCache for dynamic height row. But I found this article https://stackoverflow.com/questions/45682063/react-virtualized-share-cellmeasurercache-for-multiple-grids which says that CellMeasurerCache is not sharable by more than one List.
Also I considered that using other components react-virtualized support. But there are Pros and Cons for each components and I need to give up one of above requirements.
Can you provide me a idea for implementing all above requirements?
I want to share my code but it is quite big (About more than 1000lines for single table -- abstract table which is not hard coded for single purpose) If you want to see, I'll send you in anyway.
But I can provide basic HTML markups -- which is 8 is not implemented. and proven to be usable disregarding some bugs and errors.
This is not real HTML, I omit some of markups.
(All data is not calculated at table, all of them are fetched from other servers.)
<div class="table">
<!--Head part-->
<div class="headers">
<div class="header" style="display: sticky; top: 0">
<div class="upper">Name</div>
<div class="bottom">
<button class="sort" style="background-image: url(sort_asc.png)"></button>
<button class="filter" style="background-image: url(btn_filter.png)"></button>
</div>
</div>
<div class="header" style="display: sticky; top: 0">
<div class="upper">Price</div>
<div class="bottom">
<button class="sort" style="background-image: url(sort_asc.png)"></button>
<button class="filter" style="background-image: url(btn_filter.png)"></button>
</div>
</div>
<div class="header" style="display: sticky; top: 0">
<div class="upper">Gross</div>
<div class="bottom">
<button class="sort" style="background-image: url(sort_asc.png)"></button>
<button class="filter" style="background-image: url(btn_filter.png)"></button>
</div>
</div>
</div>
<!--Content part-->
<div class="row"/> <!-- This is expanded row -->
<div class="column" style="display:flex;flex-direction:column; height:120px;"/>
<div class="headCell" style="height:120px">ChicagoCompany</div> <!-- This is merged cell -->
<div>
<div class="column" style="display:flex;flex-direction:column; height:120px;"/>
<div class="headCell" style="height:40px">215</div> <!-- Display average value of child -->
<div class="cell" style="height:40px">230</div>
<div class="cell" style="height:40px">200</div>
<div>
<div class="column" style="display:flex;flex-direction:column; height:120px;"/>
<div class="headCell"style="height:120px">440</div> <!-- This is merged cell -->
<div>
</div>
<div class="row"/> <!-- This is not expanded row -->
<div class="column" style="display:flex;flex-direction:columnheight:40px"/>
<div class="headCell"style="height:40px">Seattle Corp</div> <!-- This is merged cell -->
<div>
<div class="column" style="display:flex;flex-direction:columnheight:40px"/>
<div class="headCell"style="height:40px">130</div> <!-- Display average value of child -->
<div>
<div class="column" style="display:flex;flex-direction:columnheight:40px"/>
<div class="headCell"style="height:40px">440</div> <!-- This is merged cell -->
<div>
</div>
</div>
What I considered above about append two table horizontally is like above
<div class="tableSet"><div class="fixedTable">...</div><div class="flexibleTable">...</div></div>
(Put two "table" together horizontally)
I tried to set sticky to all class <column>, but it dose not work as I expected. (It is not the same-level properties as you know.)
I think I can bypass restriction on CellMeaurerCache so that it can be shared in both table. But I have no idea for that. Can you give me a advice?
Is there any way to bypass restriction and implement all requirements? Or Is there any library or component or way to achieve it's goal even if it need to be re-build from the scratch?
Yes. It looks like that I want to Excel on webpage.

how to use one ng-controller within another ng-app and controller in angular.js

I am new to Angular.js and want to use one ng-controller within another ng-app and ng-controller like this so that I will be able to use the code before I used on the other pages as well.
Please help me out and correct me if I am wrong anywhere.
<div id="divFriendList" class="container" ng-app="friendModule" ng-controller="friendController">
<div id="module2" ng-app="cardsModule" ng-controller="CardsController">
<div ng-repeat="card in cards></div>
</div>
</div>
I am rather new as well, but I believe a way to do this would be to declare a dependency in your module. For example, angular.module('friendModule', ['cardsModule']);. Such an example can be found here. For more about modules and a tutorial on angular you can visit W3Schools.
As for your code, you're missing the closing quotation in <div ng-repeat="card in cards>. Further, you are asking ng-repeat to iterate each card in cards, but you have not defined what cards is. Therefore, you'll want to change
<div id="module2" ng-app="cardsModule" ng-controller="CardsController">
to
<div id="module2" ng-app="cardsModule" ng-controller="CardsController as cards">.
Additionally, you will not get any information displayed if you don't ask for an output such as {{card}}. I would also consider declaring your ng-apps before hand and not when they're needed - but I'm unsure as to the efficiency of this.
What I would update as:
<div id="divFriendList" class="container" ng-app="friendModule" ng-controller="friendController">
<div id="module2" ng-app="cardsModule" ng-controller="CardsController as cards">
<div ng-repeat="card in cards">{{ card }}</div>
</div>
</div>
Hope that helps.

Add another custom interpolator in Angularjs

I still want {{1+2}} to be evaluated as normal. But in addition to the normal interpolator, I want to create a custom one that I can program to do whatever I want.
e.g. <p>[[welcome_message]]</p> should be a shortcut for <p>{{'welcome_message' | translate}}</p>, or <p translate="welcome_message"></p>. That way, i18n apps would be much more convenient to write.
Is anything like that possible? I'm currently going through the angular.js source code, but the interpolation system looks pretty complicated(?). Any suggestions?
I created a directive that regex-find-replaces it's innerHTML. Basically, I can rewrite any text into any other text. Here's how I did it:
How do I make angular.js reevaluate / recompile inner html?
Now all I have to do is to place my directive-attribute, "autotranslate", in one of the parent elements of where I want my interpolator to work, and it rewrites it however I want it! :D
<div class="panel panel-default" autotranslate>
<div class="panel-heading">[[WELCOME]]</div>
<div class="panel-body">
[[HELLO_WORLD]
</div>
</div>
becomes
<div class="panel panel-default" autotranslate>
<div class="panel-heading"><span translate="WELCOME"></span></div>
<div class="panel-body">
<span translate="HELLO_WORLD"></span>
</div>
</div>
which does exactly what I wanted.
I don't think that's possible, but if you really want to save some characters you could create a function on your rootScope called t, then call it within your views:
<p>{{ t(welcome_message) }}</p>

Iterating over a list and calling directives depending upon the item types , in angular js

I have a HTML file it iterates over a list of objects as shown and every object has a template( stored in the db) that it uses I get "List" from a web service :-
<ul>
<li ng-repeat="object in List" ng-include="object.TemplateName" > </li>
</ul>
Let object.TemplateName be "template1"
A sample template would have a specific directive with the attributes needed and few html tags as shown "template1":-
template1:-
<directive1 s-web-service-path="object.WebServicePath" >
<h1>any html content</h1>
</directive1>
my directive calls a web service to get the content to be displayed and has its own template... instead of putting directives in a template and including them cant I directly call my directive depending upon the different types of objects that i obtain in List
something like
for Object.Type="1" i call directive1 instead of template1
for Object.Type="2" i call directive2 instead of template2
ngIf or ngSwitch might be helpful here, with a few extra wrapping elements within the ngRepeat, in order to dynamically choose what to include based on Object.Type. Using ngSwitch:
<ul>
<li ng-repeat="object in List">
<div ng-switch="object.Type">
<div ng-switch-when="'1'">
<div ng-include="object.TemplateName"></div>
</div>
<div ng-switch-when="'2'">
<directive1 s-web-service-path="object.WebServicePath" >
<h1>any html content</h1>
</directive1>
</div>
</div>
</li>
</ul>
The above is not tested, so there could potentially be an error. You might also be able to cut down on some of DOM nesting level by including the ng-switch-when attributes on the directive1 / ng-include divs, but the way above makes the behaviour clear, and avoids any unexpected issues that might arise from having multiple directives work on the same element.

angularjs: how to prevent from processing additional directives after ng-hide

I have some code that is running twice even thou angularJs is rendering only one branch of it.
<div ng-Show="SomeCondition">
...
</div>
<div ng-Hide="SomeCondition">
...
</div>
AngularJs is correctly only rendering one of the divs, however it's processing both. This is leading to some performance degredation as each section is quite big. Is there a way to remove processing from one of the branch of execution?
What you're looking for is ng-switch, which will 'Conditionally change the DOM structure.' This means that the other cases will not be run, unlike using ng-hide/ng-show, which just adjust CSS.
http://docs.angularjs.org/api/ng.directive:ngSwitch
<div ng-switch on="SomeCondition">
<div ng-switch-when="true">Example</div>
<span ng-switch-when="false">Example Two</span>
<span ng-switch-when="somethingElse">Example Three</span>
<span ng-switch-default>default</span>
</div>

Resources