|
ActiveTcl User Guide
|
|
|
[ Main table Of Contents | Tcllib Table Of Contents | Tcllib Index ]
htmlparse(n) 0.3.1 "HTML Parser"
htmlparse - Procedures to parse HTML strings
package require Tcl 8.2
package require struct 1
package require cmdline 1.1
package require htmlparse ?0.3.1?
The htmlparse package provides commands that
allow libraries and applications to parse HTML in a string into a
representation of their choice.
The following commands are available:
- ::htmlparse::parse ?-cmd cmd? ?-vroot tag? ?-split n? ?-incvar var? ?-queue q? html
- This command is the basic parser for HTML. It takes an HTML
string, parses it and invokes a command prefix for every tag
encountered. It is not necessary for the HTML to be valid for this
parser to function. It is the responsibility of the command invoked
for every tag to check this. Another responsibility of the invoked
command is the handling of tag attributes and character entities
(escaped characters). The parser provides the un-interpreted tag
attributes to the invoked command to aid in the former, and the
package at large provides a helper command, ::htmlparse::mapEscapes, to aid in the handling of the
latter. The parser does ignore leading DOCTYPE
declarations and all valid HTML comments it encounters.
All information beyond the HTML string itself is specified via
options, these are explained below.
To help understand the options, some more background information
about the parser.
It is capable of detecting incomplete tags in the HTML string
given to it. Under normal circumstances this will cause the parser
to throw an error, but if the option -incvar is
used to specify a global (or namespace) variable, the parser will
store the incomplete part of the input into this variable instead.
This will aid greatly in the handling of incrementally arriving
HTML, as the parser will handle whatever it can and defer the
handling of the incomplete part until more data has arrived.
Another feature of the parser are its two possible modes of
operation. The normal mode is activated if the option -queue is not present on the command line invoking the
parser. If it is present, the parser will go into the incremental
mode instead.
The main difference is that a parser in normal mode will
immediately invoke the command prefix for each tag it encounters.
In incremental mode however the parser will generate a number of
scripts which invoke the command prefix for groups of tags in the
HTML string and then store these scripts in the specified queue. It
is then the responsibility of the caller of the parser to ensure
the execution of the scripts in the queue.
Note: The queue object given to the parser has to provide
the same interface as the queue defined in tcllib -> struct.
This means, for example, that all queues created via that tcllib
module can be immediately used here. Still, the queue doesn't have
to come from tcllib -> struct as long as the same interface is
provided.
In both modes the parser will return an empty string to the
caller.
The -split option may be given to a parser in
incremental mode to specify the size of the groups it creates. In
other words, -split 5 means that each of the generated scripts will
invoke the command prefix for 5 consecutive tags in the HTML
string. A parser in normal mode will ignore this option and its
value.
The option -vroot specifies a virtual root tag.
A parser in normal mode will invoke the command prefix for it
immediately before and after it processes the tags in the HTML,
thus simulating that the HTML string is enclosed in a <vroot>
</vroot> combination. In incremental mode however the parser
is unable to provide the closing virtual root as it never knows
when the input is complete. In this case the first script generated
by each invocation of the parser will contain an invocation of the
command prefix for the virtual root as its first command. The
following options are available:
- -cmd cmd
- The command prefix to invoke for every tag in the HTML string.
Defaults to ::htmlparse::debugCallback.
- -vroot tag
- The virtual root tag to add around the HTML in normal mode. In
incremental mode it is the first tag in each chunk processed by the
parser, but there will be no closing tags. Defaults to hmstart.
- -split n
- The size of the groups produced by an incremental mode parser.
Ignored when in normal mode. Defaults to 10. Values <= 0 are not
allowed.
- -incvar var
- The name of the variable where to store any incomplete HTML
into. This makes most sense for the incremental mode. The parser
will throw an error if it sees incomplete HTML and has no place to
store it to. This makes sense for the normal mode. Only incomplete
tags are detected, not missing tags. Optional, defaults to 'no
variable'.
- Interface to the command prefix
- In normal mode the parser will invoke the command prefix with
four arguments appended. See ::htmlparse::debugCallback for a description.
In incremental mode, however, the generated scripts will invoke
the command prefix with five arguments appended. The last four of
these are the same which were mentioned above. The first is a
placeholder string (\\win\\) for a clientdata
value to be supplied later during the actual execution of the
generated scripts. This could be a tk window path, for example.
This allows the user of this package to preprocess HTML strings
without committing them to a specific window, object, whatever
during parsing. This connection can be made later. This also means
that it is possible to cache preprocessed HTML. Of course, nothing
prevents the user of the parser from replacing the placeholder with
an empty string.
- ::htmlparse::debugCallback ?clientdata? tag slash param
textBehindTheTag
- This command is the standard callback used by the parser in ::htmlparse::parse if none was specified by the
user. It simply dumps its arguments to stdout. This callback can be
used for both normal and incremental mode of the calling parser. In
other words, it accepts four or five arguments. The last four
arguments are described below. The optional fifth argument contains
the clientdata value passed to the callback by a parser in
incremental mode. All callbacks have to follow the signature of
this command in the last four arguments, and callbacks used in
incremental parsing have to follow this signature in the last five
arguments.
The first argument, clientdata, is optional and
present only if this command is invoked by a parser in incremental
mode. It contains whatever the user of this package wishes.
The second argument, tag, contains the name of
the tag which is currently processed by the parser.
The third argument, slash, is either empty or
contains a slash character. It allows the callback to distinguish
between opening (slash is empty) and closing tags (slash contains a
slash character).
The fourth argument, param, contains the
un-interpreted list of parameters to the tag.
The fifth and last argument, textBehindTheTag,
contains the text found by the parser behind the tag named in tag.
- ::htmlparse::mapEscapes html
- This command takes a HTML string, substitutes all escape
sequences with their actual characters and then returns the
resulting string. HTML strings which do not contain escape
sequences are returned unchanged.
- ::htmlparse::2tree html tree
- This command is a wrapper around ::htmlparse::parse which takes an HTML string (in html) and converts it into a tree containing the
logical structure of the parsed document. The name of the tree is
given to the command as its second argument (tree). The command does not generate
the tree by itself but expects that the caller provided it with an
existing and empty tree. It also expects that the specified tree
object follows the same interface as the tree object in tcllib
-> struct. It doesn't have to be from tcllib -> struct, but
it must provide the same interface.
The internal callback does some basic checking of HTML validity
and tries to recover from the most basic errors. The command
returns the contents of its second argument. Side effects are the
creation and manipulation of a tree object.
- ::htmlparse::removeVisualFluff
tree
- This command walks a tree as generated by ::htmlparse::2tree and removes all the nodes which
represent visual tags and not structural ones. The purpose of the
command is to make the tree easier to navigate without getting
bogged down in visual information not relevant to the search. Its
only argument is the name of the tree to cut down.
- ::htmlparse::removeFormDefs tree
- Like ::htmlparse::removeVisualFluff this
command is here to cut down on the size of the tree as generated by
::htmlparse::2tree. It removes all nodes
representing forms and form elements. Its only argument is the name
of the tree to cut down.
html , parsing