News, Events, Trends, Activities, Conferences and Notes to do with Web Culture, Development, New Media, Content Management, Mobile and PDA Access and Web Infrastructure
|
See Also: Home Links Personal Site Blogroll FriendFeed CV | |
Wiki Menu:Tags:
|
Project Ideas: Html Querysome notes from November 2001 about an idea to write a SQL like engine to be able to query content from HTML using DOM like paths
Aim:Idea being to provde a human readable and writeable SQL like interface to the Perl modules that provide HTML parsing capability (something like XQuery for XML) without requiring any knowledge of Perl or the use or operation of the parser modules. A SQL file would be written containing the querys required to process a source HTML file in some way, then the SQL/HTML engine would be called with a reference to the SQL file and the HTML file/s to be processed. All output would head off to STDOUT land, and could be captured and written out to an XML file with and malformed (probably all) container HTML written as CDATA. The XML output could then be run through an XSL transformation to re-produce the desired target format. The SQL like layer would be interpreted by a script which itself would use a 3rd party non-validating HTML parser for fetching markup attributes and content from the source.
Issues:As HTML is often not well structured/formed the results of a query will not always be as expected. The parser should be able to handle malformed markup, close containers that arent properly closed and act in a way similar to a tolerant and non-valdating web browser might do.
Language Design:Could be modelled on a subset of the XQuery language from W3C Should resemble SQL as closely as possible so as to be familiar to SQL writers, and easy to grasp for those with no experience. Each line terminated by ';' character so multiple lines able to be executed simultaneously, input able to be streamed in from SQL file, and output to STDOUT so can be piped elsewhere. Standard SQL Language Elements/Verbs to be included...
SELECT AS FROM WHERE COUNT COLUMN FORMAT LIKE DESCRIBE Operators...
= > < <= >= * ( ) % ? AND NOT OR ISNULL NOTNULL BETWEEN
Extended verbs:
CLASS VALUE ELEMENT ATTRIB TEXT If no FROM clause given, assume is querying the whole document rather than a specific container (such as BODY or TABLE) If the character <> follows a SELECT, assume that the text '' is a element name to be selected
Operation:Could have the ability to parse a DTD looking for container elements so can build a library of expected containers for the given document type.
Example Queries
Fetch all meta-tags from the 1st HEAD container section
SELECT * FROM head e.g.
<title>My Page</title> <meta name='author' content='dean stringer'> <meta name='keywords' content='test,meta,language,sql,query'> <link rel='stylesheet' href='/css/w3t.css' type='text/css'>
Show (dump) page layout (containers only, not content)
DESCRIBE <body> e.g. <table><tr><td><h1></h1> <p></p></td></tr></table>
Select paragraphs from the body with a specific class name
SELECT <p> FROM body WHERE CLASS = 'indent' ...or...
SELECT <p> WHERE CLASS = 'indent' e.g. <p class='indent'>Here is an indented column</p>
Select paragraphs from the body with a specific ID
SELECT <p> FROM body WHERE ID = 'quote%' e.g. <p id='quote123'>Here is an indented column</p>
Fetch whole meta-tags from the 1st HEAD container section where any attribute names like '%author%'
SELECT * FROM head WHERE ATTRIB LIKE '%author%' e.g.
<meta name='author' content='dean stringer'> <meta name='authoremail' content='deans@waikato.ac.nz'>
Fetch CONTENT attribute from meta-tags from the 1st HEAD container section where any attribute name = 'author.email'
SELECT VALUE FROM head WHERE ELEMENT = 'meta' and ATTRIB = 'author.email' e.g. "deans@waikato.ac.nz"
Fetch whole document TITLE tag
SELECT * FROM head WHERE ELEMENT = 'title' e.g. <title>My Page</title>
Fetch document TITLE text only
SELECT TEXT FROM head WHERE ELEMENT = 'title' e.g. "My Page"
Fetch all headings
SELECT * FROM body WHERE ELEMENT like 'h?' e.g.
<h1>heading</h1> <h2>sub heading</h2>
Fetch LINK tags from the HEAD container section where any attribute name = 'stylesheet'
SELECT link FROM head
SELECT * FROM head WHERE ELEMENT = 'link'
SELECT * FROM head WHERE ELEMENT = 'link' and
HASATTRIBS ('rel','type') and
HASVALUES 'stylesheet','text/css')
e.g.
<link rel='stylesheet' href='/css/uow2000emsie.css' type='text/css'> <link rel='stylesheet' href='/css/w3t.css' type='text/css'>
Select title and address of links with specific url matching
SELECT href,title FROM <a> WHERE class='nav' AND href LIKE "%waikato%" e.g. 'http://www.waikato.ac.nz/' , 'UOW Home page'
Fetch LINK tags href's only from the HEAD container section
SELECT VALUE FROM head WHERE ELEMENT = 'link' and ATTRIB='href' e.g.
"/css/uow2000emsie.css" "/css/w3t.css"
Select all meta tag names from the HEAD container section
SELECT name FROM head WHERE ELEMENT = 'meta'; e.g. "author:authorname:keywords"
Select all meta tag names from the HEAD container section and use in following statement
SELECT name AS tagname FROM head WHERE ELEMENT = 'meta';
Fetch all tables (returned as array)
SELECT * FROM body WHERE ELEMENT = 'table' e.g.
<table><tr><td>......</table> <table border='1'><tr><td>......</table>
Fetch table cell 2 from the 3rd row of the 1st table
SELECT <td>[2] FROM <table>[1].<tr>[3] e.g.
<table><tr><td>......</table> <table border='1'><tr><td>......</table>
Fetch table rows from table 2 where column (cell) count = 3
SELECT <tr> FROM <table>[2] where <td>.count = 3 e.g. <tr><td></td><td></td> <td></td></tr>
Fetch table rows from table 2 where column (cell) count = 3
SELECT <table> FROM <body> where <tr>.count > 1 e.g. <table><tr><td></td></tr> <tr><td></td></tr></table>
XML representation:
<parse>
<seek>
<element name='head'></seek>
<select id='metaDataAuthor'><element name='meta'
attrib='name' value='%author%' /></select>
<select id='title'><element name='title' /></select>
<seek>
<element name='body'>
</seek>
</parse>
References:
AnyData
HTML::TreeBuilder
HTML::Element
HTML::TableExtract
XQuery - W3C
XML Query Engine
SGML-QL
PQL - Plain Query Language
VisualPQL
See Also: Project Ideas | Web Development | Notes Index |