Extract HTML table content based on "thead"

Extract HTML table content based on "thead" - css-selectors

Here is a basic HTML table :
<table>
<thead>
<td class="foo">bar</td>
</thead>
<tbody>
<td>rows</td>
…
</tbody>
</table>
Suppose there are several such tables in the source file. Is there an option of hxextract, or a CSS3 selector I could use with hxselect, or some other tool, which would allow to extract one particular table, either based on the content of thead or on its class if it exists ? Or am I stuck with not so simple awk (or maybe perl, as found before submitting) scripting ?
Update :
For content-based extraction, perl's HTML::TableExtract does the trick :
#!/usr/bin/env perl
use open ':std', ':encoding(UTF-8)';
use HTML::TableExtract;
# Extract tables based on header content, slice_columns helpful if colspan issues
$te = HTML::TableExtract->new( headers => ['Multi'], slice_columns => 0);
$te->parse_file('mywebpage.html');
# Loop on all matching tables
foreach $ts ($te->tables())
{
# Print table identification
print "Table (", join(',', $ts->coords), "):\n";
# Print table content
foreach $row ($ts->rows)
{
print join(':', #$row), "\n";
}
}
However in some cases a simple lynx -dump mywebpage.html coupled wih awk or whatever can be just as efficient.

This would require a parent selector or a relational selector, which does not as yet exist (and by the time it does exist, hxselect may not implement it because it does not even fully implement the current standard as of this writing). hxextract appears to only retrieve an element by its type and/or class name, so the best it'd do is td.foo, which would return the td only, not its thead or table.
If you are processing this HTML from the command line, you will need a script.

Related

Mark doxygen comments and put them in a separate file / build documentation from source/comment blocks

My question is not easy to describe in a short caption, so i try to explain it more detailed:
I want to generate documentation from comments in the source code. My favorite way would be a separately generated markdown file which contains the collected comment blocks.
Here is an example sourcecode of what I want to do:
/*!
* #brief Command management
*/
void DoCommands()
{
// \HTML Variant
/*!
This part is a html sequence \n
Here \b<a table> will appear \n
<table>
<tr><th> Command </th> <th> Function </th></tr>
<tr><td> ? </td><td> Dummy </td></tr>
*/
switch (Cmd)
{
/*!
<tr><td>v </td><td>Get version </td></tr>
*/
case 'v': SendVersion(); break;
/*!
<tr><td>q </td><td> Quit program </td></tr>
</table>
*/
case 'q': Quit(); break;
}
// XRefItem Variant
/*!
This part is a xrefitem sequence \n
Here \b<a table> will appear \n
\cmditem Command, Function
\cmditem ?, Dummy
*/
switch (Cmd)
{
/*!
\cmditem v, Get Version
*/
case 'v': SendVersion(); break;
/*!
\cmditem q, Quit program
*/
case 'q': Quit(); break;
}
// Markdown Variant
/*!
This part is a markdown sequence \n
Here **a table** will appear \n
Command | Function \n
----|-------
? | Dummy
*/
switch (Cmd)
{
/*!
v | Get Version
*/
case 'v': SendVersion(); break;
/*!
q | Quit program
*/
case 'q': Quit(); break;
}
}
The function handles different commands and I want to put the description of the command into the case to have source code and documentation at one place.
As you can see, I tried several possibilities to collect this information with Doxygen. The first solution with html tags works not too bad. The resulting html looks like this:
But its not perfect. As the table is built over several blocks, an extra newline is added after "Dummy" and "Get version" and therefore they are not in line with the "?" and the "v".
My next try was with the \xrefitem. I defined an ALIAS in Doxygen
cmditem=\xrefitem cmditems "Commands" "Command overview" and used it for documenting the single items. Doxygen generated additionally to the function an extra "Related pages" entry which looks like this:
The problem is, I don't know how to get this formatted pretty as a table...
And last I tried to write the table in markdown syntax but this works only for the first entry of the table. The second entry which is in a separate comment block was not added to the table.
Does anyone know a way how to get a pretty formatted documentation with separated comment blocks for the different commands?
I need the documentation of the commands to give it to a customer. He does'nt need all the other stuff about the code and the function. So "the cherry on the cake" whould be an instruction/command which tells doxygen to take the comment block "as it is" and put/append it to a separate file.
In this example I'd like to have the markdown comments in one file which could look like this (off course after a little modification of the lines above):
This part is a markdown sequence
Here **a table** will appear
| Command | Function |
|----|-------|
|? | Dummy|
|v | Get Version|
|q | Quit program|

you are right, every comment block is a separate paragraph and therefore Doxygen inserts <p> ... </p> in the table
which causes the extra lines.
The generated html code looks like this:
</p><table class="doxtable">
<tr>
<th>Command </th><th>Function </th></tr>
<tr valign="top">
<td>? </td><td><p class="starttd">Dummy </p>
<p class="endtd"></p>
</td></tr>
<tr valign="top">
<td>v </td><td><p class="starttd">Get version </p>
<p class="endtd"></p>
</td></tr>
<tr valign="top">
<td>q </td><td><p class="starttd">Quit program </p>
<p class="endtd"></p>
</td></tr>
</table>
Your workaround helps to get a uniform layout of the table entries:
Thank you!

To understand the behavior a bit one has to know that different comment blocks are not directly appended to each other but that that each block is seen as a different paragraph and thus that the markdown table entries have empty lines and thus break the table.
Regarding the xrefitem possibility, these are formatted wit <dl>, <dd> and <dt> in the normal paragraph and just as paragraphs on the related page. Hence not really possible to create a table out of it.
Regarding the HTML solution.
It is indeed a bit strange that there is an extra newline for ? and v row, this is also due to the extra new lines (though a bit strange to me as the new lines are after closed rows (this has be investigate in the doxygen code why this happens).
To get a better output here I would suggest
get the </table> out of the switch so also the q has the extra line
use the attribute valign="top" in the <tr> tag
So in general we get for the HTML part:
/*!
* #brief Command management
*/
void DoCommands()
{
// \HTML Variant
/*!
This part is a html sequence \n
Here \b<a table> will appear \n
<table>
<tr valign="top"><th> Command </th> <th> Function </th></tr>
<tr valign="top"><td> ? </td><td> Dummy </td></tr>
*/
switch (Cmd)
{
/*!
<tr valign="top"><td>v </td><td>Get version </td></tr>
*/
case 'v': SendVersion(); break;
/*!
<tr valign="top"><td>q </td><td> Quit program </td></tr>
*/
case 'q': Quit(); break;
}
/*!
</table>
*/
}
I think this table does look a little bit better as e.g. the ? is not in the middle of the cell anymore, the extra empty line is not easily removed. For people who know css it would be possible to create an HTML_EXTRA_STYLESHEET with the setting if e.g:
p.endtd {
margin-bottom: -14px;
}
but this would influence all tables there the paragraph tag is used inside a cell (doxygen adds the paragraph tag).

How do I get gene features in FASTA nucleotide format from NCBI using Perl?

I am able to download a FASTA file manually that looks like:
>lcl|CR543861.1_gene_1...
ATGCTTTGGACA...
>lcl|CR543861.1_gene_2...
GTGCGACTAAAA...
by clicking "Send to" and selecting "Gene Features", FASTA Nucleotide is the only option (which is fine because that's all I want) on this page.
With a script like this:
#!/usr/bin/env perl
use strict;
use warnings;
use Bio::DB::EUtilities;
my $factory = Bio::DB::EUtilities->new(-eutil => 'efetch',
-db => 'nucleotide',
-id => 'CR543861',
-rettype => 'fasta');
my $file = 'CR543861.fasta';
$factory->get_Response(-file => $file);
I get a file that looks like:
>gi|49529273|emb|CR543861.1| Acinetobacter sp. ADP1 complete genome
GATATTTTATCCACA...
with the whole genomic sequence lumped together. How do I get information like in the first (manually downloaded) file?
I looked at a couple of other posts:
how to download complete genome sequence in biopython entrez.esearch (this answer seemed relevant)
How can I download the entire GenBank file with just an accession number?
As well as this section from EUtilities Cookbook.
I tried fetching and saving a GenBank file (since it seems to have separate sequences for each gene in the .gb file I get), but when I go work with it using Bio::SeqIO, I will get only 1 large sequence.

With that accession number and return type, you are getting the complete genome sequence. If you want to get the individual gene sequences, specify that you want the complete genbank file, then parse out the genes. Here is an example:
#!/usr/bin/env perl
use 5.010;
use strict;
use warnings;
use Bio::SeqIO;
use Bio::DB::EUtilities;
my $factory = Bio::DB::EUtilities->new(-eutil => 'efetch',
-email => 'foo#bar.com',
-db => 'nucleotide',
-id => 'CR543861',
-rettype => 'gb');
my $file = 'CR543861.gb';
$factory->get_Response(-file => $file);
my #gene_features = grep { $_->primary_tag eq 'gene' }
Bio::SeqIO->new(-file => $file)->next_seq->get_SeqFeatures;
for my $feat_object (#gene_features) {
for my $tag ($feat_object->get_all_tags) {
# open a filehandle here for writing each to a separate file
say ">",$feat_object->get_tag_values($tag);
say $feat_object->spliced_seq->seq;
# close it!
}
}
This will write each gene to the same file (if you redirect it, now it just writes to STDOUT) but I indicated where you could make a small change to write them to separate files. Parsing genbank can be a bit tricky at times, so it is always helpful to read the docs and in particular, the excellent Feature Annotation HOWTO.

HowTo break out of a Tapestry loop?

So, i am trying to break out of tapestry loop here.
This is my -more or less- simplified scenario:
<ul>
<t:loop source="firstSource" value="firstValue">
<li>
<t:loop source="firstValue" value="secondValue">
<p>${secondValue}</p>
</t:loop>
<t:loop source="secondSource" value="thirdValue">
<p>${thirdValue}</p>
</t:loop>
</li>
</t:loop>
</ul>
What I do not want to have is:
Tapestry loops through all entries in firstValue - then loops through all entries in secondSource. I do not want to iterate through secondSource inside the loop of fristValue as this would iterate through all entries in secondSource - and I just want to do 1 iteration at a time.
What I want to have is:
Tapestry enters the loop for firstValue and does some printing or whatever, then breaks after the first iteration and jumps into secondSource to do the first iteration . After it has finished it jumps back to firstValue and repeats these steps.
This is what in Java the "break;" would do.
I did not find a clue in the Tapestry documentation on how to do this, nor in their forums.
But it has to be possible in some way. I can not imagine I am the only one trying to do this.

Just put an if statement around the logic, probably using an index variable:
<t:loop source="firstSource" value="firstValue">
<li>
<t:loop source="firstValue" value="secondValue" index="firstValueIndex">
<t:if test="firstCondition">
<p>${secondValue}</p>
</t:if>
</t:loop>
<t:loop source="secondSource" value="thirdValue">
<t:if test="secondCondition">
<p>${thirdValue}</p>
</t:if>
</t:loop>
</li>
</t:loop>
In the Java page:
#Property
private int firstValueIndex;
public boolean getFirstCondition() {
// logic to determine whether to break out
return firstValueIndex == 0;
}
public boolean getSecondCondition() {
// logic
}

My guess is that you have three sources of data and are trying to output three columns, is this right?
Sometimes you have to transform your data a little bit: For example, you might need to do some work to convert one value from each of the three inputs into a single value:
public class Row {
Object col1, col2, col2;
}
In your Java code, you would build up a List of Row objects.
In your template, you iterate over the Row objects, rendering the col1, col2 and col3 properties.
(In Tapestry 5.3 and above, a public field can be treated as a property.)
I've used similar techniques to output a calendar, which can be very tricky to manage using conditionals and the like inside the template.
Remember the role of the Controller in MVC: its job to mediate between the model and the view; sometimes that includes some simple transformations of the model data to fit in with the view.

Perl -- DBI selectall_arrayref when querying getting Not Hash Reference

I am very new to perl (but from a c# background) and I am trying to move some scripts to a windows box.
Due to some modules not working easily with windows I have changed the way it connects to the DB.
I have an sqlserver DB and I had a loop reading each row in a table, and then within this loop another query was sent to select different info.
I was the error where two statements can't be executed at once within the same connection.
As my connection object is global I couldn't see an easy way round this, so decided to store the first set of data in an array using:
my $query = shift;
my $aryref = $dbh->selectall_arrayref($query) || die "Could not select to array\n";
return($aryref);
(this is in a module file that is called)
I then do a foreach loop (where #$s_study is the $aryref returned above)
foreach my $r_study ( #$s_study ) {
~~~
my $surveyId=$r_study->{surveyid}; <-------error this line
~~~~
};
When I run this I get an error "Not a hash reference". I don't understand?!
Can anyone help!
Bex

You need to provide the { Slice => {} } parameter to selectall_arrayref if you want each row to be stored as a hash:
my $aryref = $dbh->selectall_arrayref($query, { Slice => {} });
By default, it returns a reference to an array containing a reference to an array for each row of data fetched.

$r_study->{surveyid} is a hashref
$r_study->[0] is an arrayref
this is your error.
You should use the second one

If you have a problem with a method, then a good first step is to read the documentation for that method. Here's a link to the documentation for selectall_arrayref. It says:
This utility method combines
"prepare", "execute" and
"fetchall_arrayref" into a single
call. It returns a reference to an
array containing a reference to an
array (or hash, see below) for each
row of data fetched.
So the default behaviour is to return a reference to an array which contains an array reference for each row. That explains your error. You're getting an array reference and you're trying to treat it as a hash reference. I'm not sure that the error could be much clearer.
There is, however, that interesting bit where it says "or hash, see below". Reading on, we find:
You may often want to fetch an array
of rows where each row is stored as a
hash. That can be done simple using:
my $emps = $dbh->selectall_arrayref(
"SELECT ename FROM emp ORDER BY ename",
{ Slice => {} }
);
foreach my $emp ( #$emps ) {
print "Employee: $emp->{ename}\n";
}
So you have two options. Either switch your code to use an array ref rather than a hash ref. Or add the "{ Slice => {} }" option to the call, which will return a hash ref.
The documentation is clear. It's well worth reading it.

When you encounter something like "Not a hash reference" or "Not an array reference" or similar you can always take Data::Dumper to just dump out your variable and you will quickly see what data you are dealing with: arrays of arrayrefs, hashes of something etc.
And concerning reading the data, this { Slice => {} } is most valuable addition.

Hierarchical Database Driven Menu in MVC

I use the code below as an HTMLHelper which gets data from the database and loops over it to display a menu. This is fairly straightforward as you can see however, what if you have a database table using the adjacent model of hierarchies eg/ID, ParentID, OrderID. Easy to see whats going on but recursion is needed to get this data out properly. Is writing a C# recursive function acceptable? If so can someone help me with that? The expected output is something similar to this..
<ul>
<li>Item1
<ul>
<li>SubItem1</li>
</ul>
</li>
</ul>
SQL 2008 has a Hierarchy datatype now so I am not sure if this will help things?
I would also like some way of enabling users to decide what goes in a menu for example, a list of items that can go in the menu and then choosing these items and their positions in the hierarchy. Once a saved button is pressed it will store this heirarchy in the database.
Am I asking too much, I'm sure this must be quite a common scenario?
Here is my HTMLHelper code if anyone wants to use it...
public static string Menu(this HtmlHelper helper, int MenuCat)
{
string menuHTML = "<ul id=\"menu\">";
var route = helper.ViewContext.RequestContext.RouteData;
string currentPageName = route.GetRequiredString("id");
DB db = DB.CreateDB();
//var result = from p in db.WebPages where p.CategoryID == 9 select p;
var result = from p in db.WebPages select p;
foreach (var item in result)
{
if (item.Name == currentPageName)
{
menuHTML += "\n\t<li>" + helper.ActionLink(item.Name, "Details", "Dinner", new { id = item.ID }, new { #class = "selected" }) + "</li>";
}
else
{
menuHTML += "\n\t<li>" + helper.ActionLink(item.Name, "Details", "Dinner", new { id = item.ID }, null) + "</li>";
}
}
menuHTML += "\n</ul>\n";
return menuHTML;
}

I would do two things here: don't bother rendering this yourself: use jQuery. If you Google "jquery menu" you'll find hundreds of links.
Next, put the ordering logic on your app, you don't need the DB to do this as it soaks up cycles and (from what I've read) isn't terribly efficient. This is simple looping logic with a self-referencing join that Linq is perfect for.
Hand this off to jQuery, adn you're good to go without hard-coding HTML in code :)

If you are using Sql server 2005 take a look to Common Table Expression (CTE) (google with CTE hierarchical data). It allows you to create a view displaying the complete hierarchy.
But, how much depth level are you displaying in the menu? Usually you only need to show directy childs and go down in the hierarchy as the user clicks the links. (No recursion needed)

I always use recursive table-valued functions for fetching hierarchical data in SQL server.
See an example here:
blogs.conchango.com/christianwade/archive/2004/11/09/234.aspx
Unfortunately, there is a recursion limit (32 levels maximum) for SQL Server User Defined Functions (UDF) and Stored Procedures.
Note: If you use a table-valued function just drop it in your dbml file and you will be able to access it like any other table.
Another approach is to use the a new recursive queries syntax (in the form of the WITH clause and Common Table Expressions-CTE) introduced in SQL Server 2005.
Take a look here:
www.eggheadcafe.com/articles/sql_server_recursion_with_clause.asp
An approach of mixing CTE with Linq-To-SQL is presented here:
stackoverflow.com/questions/584841/common-table-expression-cte-in-linq-to-sql

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Extract HTML table content based on "thead" - css-selectors

Related

Mark doxygen comments and put them in a separate file / build documentation from source/comment blocks

How do I get gene features in FASTA nucleotide format from NCBI using Perl?

HowTo break out of a Tapestry loop?

Perl -- DBI selectall_arrayref when querying getting Not Hash Reference

Hierarchical Database Driven Menu in MVC

Categories

Resources