See Also: Home Links Personal Site Blogroll  FriendFeed CV

Tags:

Topic Image

Project Ideas: Html Query

some notes from November 2001 about an idea to write a SQL like engine to be able to query content from HTML using DOM like paths

Aim:

Idea being to provde a human readable and writeable SQL like interface to the Perl modules that provide HTML parsing capability (something like XQuery for XML) without requiring any knowledge of Perl or the use or operation of the parser modules.

A SQL file would be written containing the querys required to process a source HTML file in some way, then the SQL/HTML engine would be called with a reference to the SQL file and the HTML file/s to be processed.

All output would head off to STDOUT land, and could be captured and written out to an XML file with and malformed (probably all) container HTML written as CDATA.

The XML output could then be run through an XSL transformation to re-produce the desired target format.

The SQL like layer would be interpreted by a script which itself would use a 3rd party non-validating HTML parser for fetching markup attributes and content from the source.

Issues:

As HTML is often not well structured/formed the results of a query will not always be as expected. The parser should be able to handle malformed markup, close containers that arent properly closed and act in a way similar to a tolerant and non-valdating web browser might do.

Language Design:

Could be modelled on a subset of the XQuery language from W3C

Should resemble SQL as closely as possible so as to be familiar to SQL writers, and easy to grasp for those with no experience. Each line terminated by ';' character so multiple lines able to be executed simultaneously, input able to be streamed in from SQL file, and output to STDOUT so can be piped elsewhere.

Standard SQL Language Elements/Verbs to be included...

SELECT AS FROM WHERE COUNT COLUMN FORMAT LIKE DESCRIBE

Operators...

= > < <= >= * ( ) % ? AND NOT OR ISNULL NOTNULL BETWEEN

Extended verbs:

CLASS VALUE ELEMENT ATTRIB TEXT

If no FROM clause given, assume is querying the whole document rather than a specific container (such as BODY or TABLE)

If the character <> follows a SELECT, assume that the text '' is a element name to be selected

Operation:

Could have the ability to parse a DTD looking for container elements so can build a library of expected containers for the given document type.

Example Queries


Fetch all meta-tags from the 1st HEAD container section

SELECT * FROM head

e.g.

<title>My Page</title>
<meta name='author' content='dean stringer'>
<meta name='keywords' content='test,meta,language,sql,query'>
<link rel='stylesheet' href='/css/w3t.css' type='text/css'>


Show (dump) page layout (containers only, not content)

DESCRIBE <body>

e.g.

<table><tr><td><h1></h1>
<p></p></td></tr></table>


Select paragraphs from the body with a specific class name

SELECT <p&gt FROM body WHERE CLASS = 'indent'

...or...

SELECT <p> WHERE CLASS = 'indent'   

e.g.

<p class='indent'>Here is an indented column</p>


Select paragraphs from the body with a specific ID

SELECT <p> FROM body WHERE ID = 'quote%'

e.g.

<p id='quote123'>Here is an indented column</p>


Fetch whole meta-tags from the 1st HEAD container section where any attribute names like '%author%'

SELECT * FROM head WHERE ATTRIB LIKE '%author%'   

e.g.

<meta name='author' content='dean stringer'>
<meta name='authoremail' content='deans@waikato.ac.nz'>


Fetch CONTENT attribute from meta-tags from the 1st HEAD container section where any attribute name = 'author.email'

SELECT VALUE FROM head WHERE ELEMENT = 'meta' and ATTRIB = 
'author.email'   

e.g.

"deans@waikato.ac.nz"


Fetch whole document TITLE tag

SELECT * FROM head WHERE ELEMENT = 'title'

e.g.

<title>My Page</title>


Fetch document TITLE text only

SELECT TEXT FROM head WHERE ELEMENT = 'title'

e.g.

"My Page"


Fetch all headings

SELECT * FROM body WHERE ELEMENT like 'h?'

e.g.

<h1>heading</h1>
<h2>sub heading</h2>


Fetch LINK tags from the HEAD container section where any attribute name = 'stylesheet'

SELECT link FROM head
SELECT * FROM head WHERE ELEMENT = 'link'
SELECT * FROM head WHERE ELEMENT = 'link' and 
HASATTRIBS ('rel','type') and 
HASVALUES 'stylesheet','text/css')

e.g.

<link rel='stylesheet' href='/css/uow2000emsie.css' type='text/css'>
<link rel='stylesheet' href='/css/w3t.css' type='text/css'>


Select title and address of links with specific url matching

SELECT href,title FROM <a> WHERE class='nav' AND 
href LIKE "%waikato%"

e.g.

'http://www.waikato.ac.nz/' , 'UOW Home page'


Fetch LINK tags href's only from the HEAD container section

SELECT VALUE FROM head WHERE ELEMENT = 'link' and ATTRIB='href'

e.g.

"/css/uow2000emsie.css"
"/css/w3t.css"


Select all meta tag names from the HEAD container section

SELECT name FROM head WHERE ELEMENT = 'meta';

e.g.

"author:authorname:keywords"


Select all meta tag names from the HEAD container section and use in following statement

SELECT name AS tagname FROM head WHERE ELEMENT = 'meta';


Fetch all tables (returned as array)

SELECT * FROM body WHERE ELEMENT = 'table'

e.g.

<table><tr><td>......</table>
<table border='1'><tr><td>......</table>


Fetch table cell 2 from the 3rd row of the 1st table

SELECT <td>[2] FROM <table>[1].<tr>[3]

e.g.

<table><tr><td>......</table>
<table border='1'><tr><td>......</table>


Fetch table rows from table 2 where column (cell) count = 3

SELECT <tr> FROM <table>[2] where 
<td>.count = 3

e.g.

<tr><td></td><td></td>
<td></td></tr>


Fetch table rows from table 2 where column (cell) count = 3

SELECT <table> FROM <body> where 
<tr>.count > 1

e.g.

<table><tr><td></td></tr>
<tr><td></td></tr></table>

XML representation:

<parse>
<seek>
   <element name='head'></seek>
   <select id='metaDataAuthor'><element name='meta'
      attrib='name' value='%author%' /></select>
   <select id='title'><element name='title' /></select>
<seek>
   <element name='body'>
</seek>
</parse>

References:

AnyDataNew Page::HTMLTable
http://www.vpservices.com/jeff/programs/AnyData/HTMLtable.html

HTML::TreeBuilder
http://search.cpan.org/doc/SBURKE/HTML-Tree-3.11/lib/HTML/TreeBuilder.pm

HTML::Element
http://search.cpan.org/doc/GAAS/libwww-perl-5.22/lib/HTML/Element.pm
http://search.cpan.org/doc/SBURKE/HTML-Tree-3.11/lib/HTML/Element.pm

HTML::TableExtract
http://www.mojotoad.com/sisk/projects/
http://search.cpan.org/search?dist=HTML-TableExtract
http://www.mojotoad.com/sisk/projects/HTML-TableExtract/tables.html
http://www.geocrawler.com/archives/3/1785/

XQuery - W3C
http://www.w3.org/TR/xquery/

XML Query Engine
http://www.fatdog.com/

SGML-QL
http://www.univ-tln.fr/~gect/simm/SgmlQL/

PQL - Plain Query Language
http://sal.kachinatech.com/H/1/PQL.html

VisualPQLNew Page (Visual Procedural Query Language
http://www.sir.com.au/help/visualpql_intro.htm


See Also: Project Ideas | Web Development | Notes Index