atox-user Mailing List for Atox

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi!

Just on the off chance that anyone actually cares <wink>, I thought
I'd give a brief update on the current state of affairs. After quite a
bit of pondering (and prototyping in C, of all things ;) I now think
I've gotten the major pieces of the Atox 1 puzzle together. Atox 1 is
the new architecture (the one I've mentioned earlier, but haven't
released yet) that will be used in Atox version 1.0 (whenever that
comes out ;)

The major functionality is now in place, and evne though there are
lots of rough edges (and *tons* of "XXX"-comments in the code) and the
user interface is at present simply an undocumented (and as yet
sometimes a bit inconsistent) API, it does seem to work quite well.

What follows is a (possibly somewhat incoherent) summary of features
and the basic architecture.

In previous versions (at least some of them), I've worked with one
single "paradigm" for marking up the text: Context-sensitive,
recursive searching for regexp-based tag patterns. This is versatile
(I've even stretched it to marking up Python code, for example) but
unnecessarily consistent. (A foolish consistency, and all that...) It
also seemed unnecessarily slow... Once you got a lot of features that
you wanted to look for, the old Atox would grind to a halt, more or
less.

When thinking of how to improve this, one of my early ideas was to
chunk up the document in paragraphs/blocks, as I had done in earlier
systems. This can be done quickly, and if you can classify the blocks,
you may have something of an idea of what to look for (e.g., don't
look for anything inside a code block). Also, you no longer have to
*search* for the features that define, say, a heading -- you just
check whether a given block looks like a heading. That should be
cheaper.

One of the features in the old recursive searching-thing was that it
used a (poorly implemented) LL(1) engine (I've since implemented
several replacements, but currently don't use any of them), which let
it automatically insert "wrapping" tags. So, if it found a list item,
and *knew* that that could only occur inside a list, the list would
automatically be inserted. How could I achieve something like this if
I used a chunking (or, rather, chinking -- looking for the chinks
between the chunks) approach?

Another issue to deal with was how to modularize the system. I wanted
to be able to combine several rather independent modules that could
affect the document -- including external programs and custom
plug-ins. (I'll get back to the issue of "fixing" in a little while :)
This issue was (after considering seveal similar options, such as
blackboard architectures) eventually resolved by using a simple *event
pipeline* architecture. I decided to go with PYX events, because they
are simple and somewhat standard. (Also, I was working in C at the
time, and wanted to stay *really* simple here.)

Each PYX event is simply a string, where the first character
represents the event type, and the rest represents the contents. By
using Unicode objects (after again switching to Python, of course),
encoding doesn't become an issue (internally).

The PYX event types are as follows:

  '(foo'       Start tag, <foo>
  ')foo'       End tag, </foo>
  'Afoo bar'   Attribute, foo="bar"
  '-foo'       Text, foo
  '?foo bar'   Processing instruction, <?foo bar>

  (See http://www.xml.com/pub/a/2000/03/15/feature for more info)

In addition, I've added '\0' as an EOF-event (to allow flushing
within the same framework). A PYX event stream can easily be
serialized as a text stream, with one event per line (with newlines
encoded as '\\n', for example) for simple external text-processing
programs.

The Atox pipeline then consists simply of objects that support a
feed(event) method, and that will send events to the object in their
own sink attribute:

  somefilter = SomeFilter()
  somefilter.sink = outputfilterofsomekind
  somefilter.feed('-Hello, world!')

somefilter is here free to do what it wants, and can relay events to
outputfilterofsomekind as it pleases. (The sink attribute is the
mechanism by which filters are chained.) The Atox framework core is
then a set of such filters, that together make it possible (and
hopefully easy) to do most text-to-XML conversion without too much
programming. However, it's easy to add new filters, of course. For
example, let's say you want to replace all occurrences of '--' with
the Uniode emdash code point/character; you could do this by
implementing the following filter (unless you wanted to use some
built-in search/replace filter):

class EmdashThingy:

    def feed(self, event):
        if event[0] == '-':
            event = '-' + event[1:].replace('--', unichr(2014))
        self.sink.feed(event)

(Note that I'm just *assuming* that the sink has been set here. A more
reasonable approach might, perhaps, be to check for it, and do nothing
if there is no sink.)

Now, about that "fixing" -- wrapping list items... I just happened to
come across the FMQ algorithm for fixing sentences based on LL(1)
algorithms (Fisher, Milton and Quiring, 1977, 'An efficient
insertion-only error-corrector for LL(1) parsers), after trying to
figure out something similar myself (in a rather ad-hoc fashion). The
algorithm takes an arbitrary input and, only by inserting a minimum
number of tokens, fixes the input so that it conforms to a given LL(1)
grammar. Magic! (Only works on a subset of LL(1) grammars, but
still...) So, I implemented this (in C and in Python).

Later on, I realised that using LL(1) grammars wasn't really playing
nice with the user, who had to specify the grammar and make sure it
was LL(1) (and insertion-fixable, no less). Instead, I thought about
what I was really trying to describe, and came to the conclusion that
I only cared about which elements were allowed inside which others,
not in which order the elements were allowed. This lead me to another
schema concept (of my own "invention"): The containment graph.

The nodes of the containment graph are elements, and the (directed)
edges represent legal containment relationships. So, for example, 'p'
and 'em' may be nodes, and there may be an edge from 'p' to 'em'
(because you can have 'em' inside 'p') but not vice versa (in this
case). On the other hand, there may be links both ways between 'span'
and 'em', for example. The user doesn't have to think of it as a
graph; it just makes it easier to deal with it internally (as you will
see in a minute). The user can just think of it as a grammar-like
thing without any order constraings. 

For example, the user can simply specify the adjacency lists:

  schema = {
    'doc':   ['p'],
    'p':     ['em', 'span'],
    'em':    ['span', 'em'],
    'span':  ['em', 'span']
  }

The nice thing about this is that I can now use basic graph algorithms
instead of somewhat confusing grammar-based algorithms such as the
FMQ. For example, the Fixer filter that's included in the current Atox
code uses a slightly modified version of the Floyd-Warshall algorithm
(allowing some backward-edges at the beginning of the path,
representing end-events) to find the minimum set of events (i.e.,
tags) that must be inserted to fix the input at a certain point.

For example, consider the following event sequence:

  '(doc', '(em'

At this point, we know it doesn't conform to the schema. However
(ignoring the "backpedaling" part of the algorithm, which doesn't
apply here), the shortest path from the 'doc' node to the 'em' node
goes through the 'p' node, and thus the Fixer inserts a '(p' event
before the '(em' event, and Bob's your uncle.

(Another nice application of using graphs as schemas is calculating
all legal descendants by calculating the transitive closure of the
graph. This is also used in the Atox code.)

So, what have we got so far...? We have a pipeline of events and a
Fixer filter that can make the event stream conform to a schema.
That's hardly enough...

The following is the collection of standard filters that I currently
see as the core of Atox (and that I have currently implemented):

  Reader: Reads partial or possibly ill-formed XML (which can simply
  be plain text with no XML, a full XML file, or anything in-between)
  and constructs a PYX stream. Accepts (almost) anything as input.

  Document: A filter to put at the end of the pipeline. Accepts any
  PYX stream and creates a well-formed XML document as output.

  Chinker: Splits text into chunks, by using a chink pattern (defaults
  to one or more empty lines with optional whitespace). Uses a highly
  complicated (IMO) algorithm to be able to deal with partial mark-up
  in its input stream. Should work the way you expect it to, though.

  Fixer: Inserts start- and end-element events to ensure that an event
  stream conforms to a given schema (containment graph).

  Tagger: Searches for patterns in text events (possibly restricted
  only to certain elements) and inserts start- or end-element events.
  (Which patterns to look for changes depending on the current parent
  element.) Used primarily for inline mark-up.

  Trafo: Gathers up the contents of certain elements (as specified)
  into tiny trees (possibly only containing text, if there was no
  sub-markup) and uses a user-defined callback to modify this tree,
  before the tree is again "exploded" back into the PYX stream.
  Primarily intended to deal with renaming and rewriting blocks based
  on their text contents.

In addition there are utility classes, such as Filter (and Dispatcher)
and Normalizer for dealing with low-level handling of the PYX stream,
as well as SearchCore, which encapsulates basic search functionality
with a simple pattern language (a subset of the Python re module
functionality), to make it easier to implement similar functionality
in other languages (if, for example, one wants to create a highly
optimized version in C). This SearchCore is ued by both Chinker and
Tagger at the moment.

One important feature that is implicit in the description above is
that you can mix your custom markup with (partial, potentially
ill-formed, if you like) XML. This is partly a consequence of the fact
that every filter should accept a more or less arbitrary PYX stream,
and partly a design choice: By allowing explicit markup into the
"implicit/invisible" plain-text markup that Atox was designed to
handle, you can get the best of both worlds. For example, you could go
almost to the XML extreme, using, say, XHTML but just dropping the <p>
tags (letting Atox take care of those). Or, you could go in the other
direction, and use a plain-text Wiki-like markup, but use a <code> tag
for your code listings, to avoid having the chinker foul them up by
splitting them at empty lines. You could do this in other ways too,
but simply using XML may be the easiest in many cases.

And, because such filters as Chinker are markup-savvy, you can go from
something like

   This is a paragraph.

   <code>This

   is a listing</code> <em>Some</em> more text.

to...

   <p>This is a paragraph.</p>

   <code>This

   is a listing</code> <p><em>Some</em> more text.</p>

Because the Chinker knows (or -- has to be told ;) the legal
containment relationships, it knows that it should do something like
this...

   <p>This is a paragraph.</p>

   <p><code>This

   is a listing</code> <em>Some</em> more text.</p>

There are, of course, many other ways in which this could go wrong. I
think the current approach will do the Right Thing (tm) in most (if
not all) cases.

There is quite a bit left to be done, certainly (for example, a more
user-friendly interface, more default functionality rather than only
the basic, generic filters, etc.), but there is at least a core that
works rather well, now, and that takes care of some pretty hefty
tasks.

As for practical matters... I currently use two external libraries:
PyXML (for the sgmlop module; I use this rather than effbot's current,
because the PyXML sgmlop module can deal with Unicode input) and
the elementtree package (or cElementTree, if available).

There are still too many rough edges to release it yet, I think, but
I'll gladly send the code to anyone who wants a sneak peek. As before,
there's no telling what the rate of progress will be, but at least I'm
quite hopeful that there will be a release ;)

-- 
Magnus Lie Hetland                    Fall seven times, stand up eight
http://hetland.org                                  [Japanese proverb]

2004	Jan	Feb (9)	Mar (2)	Apr (12)	May (3)	Jun (4)	Jul	Aug	Sep	Oct	Nov	Dec
2005	Jan	Feb	Mar (1)	Apr	May (4)	Jun	Jul	Aug (1)	Sep	Oct	Nov	Dec

atox-user Mailing List for Atox

atox-user — Mailing-list for users of Atox.