atox-user Mailing List for Atox
Status: Pre-Alpha
Brought to you by:
mlh
This list is closed, nobody may subscribe to it.
2004 |
Jan
|
Feb
(9) |
Mar
(2) |
Apr
(12) |
May
(3) |
Jun
(4) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2005 |
Jan
|
Feb
|
Mar
(1) |
Apr
|
May
(4) |
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
From: Magnus L. H. <ma...@he...> - 2005-08-08 13:50:03
|
Hi! Just on the off chance that anyone actually cares <wink>, I thought I'd give a brief update on the current state of affairs. After quite a bit of pondering (and prototyping in C, of all things ;) I now think I've gotten the major pieces of the Atox 1 puzzle together. Atox 1 is the new architecture (the one I've mentioned earlier, but haven't released yet) that will be used in Atox version 1.0 (whenever that comes out ;) The major functionality is now in place, and evne though there are lots of rough edges (and *tons* of "XXX"-comments in the code) and the user interface is at present simply an undocumented (and as yet sometimes a bit inconsistent) API, it does seem to work quite well. What follows is a (possibly somewhat incoherent) summary of features and the basic architecture. In previous versions (at least some of them), I've worked with one single "paradigm" for marking up the text: Context-sensitive, recursive searching for regexp-based tag patterns. This is versatile (I've even stretched it to marking up Python code, for example) but unnecessarily consistent. (A foolish consistency, and all that...) It also seemed unnecessarily slow... Once you got a lot of features that you wanted to look for, the old Atox would grind to a halt, more or less. When thinking of how to improve this, one of my early ideas was to chunk up the document in paragraphs/blocks, as I had done in earlier systems. This can be done quickly, and if you can classify the blocks, you may have something of an idea of what to look for (e.g., don't look for anything inside a code block). Also, you no longer have to *search* for the features that define, say, a heading -- you just check whether a given block looks like a heading. That should be cheaper. One of the features in the old recursive searching-thing was that it used a (poorly implemented) LL(1) engine (I've since implemented several replacements, but currently don't use any of them), which let it automatically insert "wrapping" tags. So, if it found a list item, and *knew* that that could only occur inside a list, the list would automatically be inserted. How could I achieve something like this if I used a chunking (or, rather, chinking -- looking for the chinks between the chunks) approach? Another issue to deal with was how to modularize the system. I wanted to be able to combine several rather independent modules that could affect the document -- including external programs and custom plug-ins. (I'll get back to the issue of "fixing" in a little while :) This issue was (after considering seveal similar options, such as blackboard architectures) eventually resolved by using a simple *event pipeline* architecture. I decided to go with PYX events, because they are simple and somewhat standard. (Also, I was working in C at the time, and wanted to stay *really* simple here.) Each PYX event is simply a string, where the first character represents the event type, and the rest represents the contents. By using Unicode objects (after again switching to Python, of course), encoding doesn't become an issue (internally). The PYX event types are as follows: '(foo' Start tag, <foo> ')foo' End tag, </foo> 'Afoo bar' Attribute, foo="bar" '-foo' Text, foo '?foo bar' Processing instruction, <?foo bar> (See http://www.xml.com/pub/a/2000/03/15/feature for more info) In addition, I've added '\0' as an EOF-event (to allow flushing within the same framework). A PYX event stream can easily be serialized as a text stream, with one event per line (with newlines encoded as '\\n', for example) for simple external text-processing programs. The Atox pipeline then consists simply of objects that support a feed(event) method, and that will send events to the object in their own sink attribute: somefilter = SomeFilter() somefilter.sink = outputfilterofsomekind somefilter.feed('-Hello, world!') somefilter is here free to do what it wants, and can relay events to outputfilterofsomekind as it pleases. (The sink attribute is the mechanism by which filters are chained.) The Atox framework core is then a set of such filters, that together make it possible (and hopefully easy) to do most text-to-XML conversion without too much programming. However, it's easy to add new filters, of course. For example, let's say you want to replace all occurrences of '--' with the Uniode emdash code point/character; you could do this by implementing the following filter (unless you wanted to use some built-in search/replace filter): class EmdashThingy: def feed(self, event): if event[0] == '-': event = '-' + event[1:].replace('--', unichr(2014)) self.sink.feed(event) (Note that I'm just *assuming* that the sink has been set here. A more reasonable approach might, perhaps, be to check for it, and do nothing if there is no sink.) Now, about that "fixing" -- wrapping list items... I just happened to come across the FMQ algorithm for fixing sentences based on LL(1) algorithms (Fisher, Milton and Quiring, 1977, 'An efficient insertion-only error-corrector for LL(1) parsers), after trying to figure out something similar myself (in a rather ad-hoc fashion). The algorithm takes an arbitrary input and, only by inserting a minimum number of tokens, fixes the input so that it conforms to a given LL(1) grammar. Magic! (Only works on a subset of LL(1) grammars, but still...) So, I implemented this (in C and in Python). Later on, I realised that using LL(1) grammars wasn't really playing nice with the user, who had to specify the grammar and make sure it was LL(1) (and insertion-fixable, no less). Instead, I thought about what I was really trying to describe, and came to the conclusion that I only cared about which elements were allowed inside which others, not in which order the elements were allowed. This lead me to another schema concept (of my own "invention"): The containment graph. The nodes of the containment graph are elements, and the (directed) edges represent legal containment relationships. So, for example, 'p' and 'em' may be nodes, and there may be an edge from 'p' to 'em' (because you can have 'em' inside 'p') but not vice versa (in this case). On the other hand, there may be links both ways between 'span' and 'em', for example. The user doesn't have to think of it as a graph; it just makes it easier to deal with it internally (as you will see in a minute). The user can just think of it as a grammar-like thing without any order constraings. For example, the user can simply specify the adjacency lists: schema = { 'doc': ['p'], 'p': ['em', 'span'], 'em': ['span', 'em'], 'span': ['em', 'span'] } The nice thing about this is that I can now use basic graph algorithms instead of somewhat confusing grammar-based algorithms such as the FMQ. For example, the Fixer filter that's included in the current Atox code uses a slightly modified version of the Floyd-Warshall algorithm (allowing some backward-edges at the beginning of the path, representing end-events) to find the minimum set of events (i.e., tags) that must be inserted to fix the input at a certain point. For example, consider the following event sequence: '(doc', '(em' At this point, we know it doesn't conform to the schema. However (ignoring the "backpedaling" part of the algorithm, which doesn't apply here), the shortest path from the 'doc' node to the 'em' node goes through the 'p' node, and thus the Fixer inserts a '(p' event before the '(em' event, and Bob's your uncle. (Another nice application of using graphs as schemas is calculating all legal descendants by calculating the transitive closure of the graph. This is also used in the Atox code.) So, what have we got so far...? We have a pipeline of events and a Fixer filter that can make the event stream conform to a schema. That's hardly enough... The following is the collection of standard filters that I currently see as the core of Atox (and that I have currently implemented): Reader: Reads partial or possibly ill-formed XML (which can simply be plain text with no XML, a full XML file, or anything in-between) and constructs a PYX stream. Accepts (almost) anything as input. Document: A filter to put at the end of the pipeline. Accepts any PYX stream and creates a well-formed XML document as output. Chinker: Splits text into chunks, by using a chink pattern (defaults to one or more empty lines with optional whitespace). Uses a highly complicated (IMO) algorithm to be able to deal with partial mark-up in its input stream. Should work the way you expect it to, though. Fixer: Inserts start- and end-element events to ensure that an event stream conforms to a given schema (containment graph). Tagger: Searches for patterns in text events (possibly restricted only to certain elements) and inserts start- or end-element events. (Which patterns to look for changes depending on the current parent element.) Used primarily for inline mark-up. Trafo: Gathers up the contents of certain elements (as specified) into tiny trees (possibly only containing text, if there was no sub-markup) and uses a user-defined callback to modify this tree, before the tree is again "exploded" back into the PYX stream. Primarily intended to deal with renaming and rewriting blocks based on their text contents. In addition there are utility classes, such as Filter (and Dispatcher) and Normalizer for dealing with low-level handling of the PYX stream, as well as SearchCore, which encapsulates basic search functionality with a simple pattern language (a subset of the Python re module functionality), to make it easier to implement similar functionality in other languages (if, for example, one wants to create a highly optimized version in C). This SearchCore is ued by both Chinker and Tagger at the moment. One important feature that is implicit in the description above is that you can mix your custom markup with (partial, potentially ill-formed, if you like) XML. This is partly a consequence of the fact that every filter should accept a more or less arbitrary PYX stream, and partly a design choice: By allowing explicit markup into the "implicit/invisible" plain-text markup that Atox was designed to handle, you can get the best of both worlds. For example, you could go almost to the XML extreme, using, say, XHTML but just dropping the <p> tags (letting Atox take care of those). Or, you could go in the other direction, and use a plain-text Wiki-like markup, but use a <code> tag for your code listings, to avoid having the chinker foul them up by splitting them at empty lines. You could do this in other ways too, but simply using XML may be the easiest in many cases. And, because such filters as Chinker are markup-savvy, you can go from something like This is a paragraph. <code>This is a listing</code> <em>Some</em> more text. to... <p>This is a paragraph.</p> <code>This is a listing</code> <p><em>Some</em> more text.</p> Because the Chinker knows (or -- has to be told ;) the legal containment relationships, it knows that it should do something like this... <p>This is a paragraph.</p> <p><code>This is a listing</code> <em>Some</em> more text.</p> There are, of course, many other ways in which this could go wrong. I think the current approach will do the Right Thing (tm) in most (if not all) cases. There is quite a bit left to be done, certainly (for example, a more user-friendly interface, more default functionality rather than only the basic, generic filters, etc.), but there is at least a core that works rather well, now, and that takes care of some pretty hefty tasks. As for practical matters... I currently use two external libraries: PyXML (for the sgmlop module; I use this rather than effbot's current, because the PyXML sgmlop module can deal with Unicode input) and the elementtree package (or cElementTree, if available). There are still too many rough edges to release it yet, I think, but I'll gladly send the code to anyone who wants a sneak peek. As before, there's no telling what the rate of progress will be, but at least I'm quite hopeful that there will be a release ;) -- Magnus Lie Hetland Fall seven times, stand up eight http://hetland.org [Japanese proverb] |
From: Magnus L. H. <ma...@he...> - 2005-05-31 17:57:00
|
Hi! I just put a notice on the Web site on the current status of things, in case anyone is interested. In short: I have been working steadfastly on a new version for quite some time (another complete rewrite, completely new architecture that allows pluggability and interoperability with partial XML documents, among other things), and it seems to be coming together. There are lots of pieces to the puzzle, though, and I don't expect to finish a complete release for quite some time. (I haven't even checked it into SourceForge CVS yet.) If you want to have a look at the code, just email me. -- Magnus Lie Hetland Fall seven times, stand up eight http://hetland.org [Japanese proverb] |
From: Calixto J. P. F. <cal...@te...> - 2005-05-27 18:13:09
|
Hi again: I happily got the problem solved. You were right, Magnus. atox was looking for a new-line character that wasn't there. I added Java code to put a '\n' character at the end of the file before passing it to atox, and everything works fine now. Thanks a lot! Greetings Calixto. |
From: Magnus L. H. <ma...@he...> - 2005-05-27 17:17:45
|
Calixto J. Porras Fortes <cal...@te...>: > > Hi all, > > I'm trying to process with a atox a very simple plain-text file. It is > the output of a XSLT process, So you actually have an XML source to begin with? Wouldn't it, perhaps, be useful to just write another XSLT stylesheet to create the XML you want, instead of going via plain text? > and its content is the following: > > GW_NAME = gwRDSIH323 XXX > NUM_CHANNELS = 2 XXX > > If a invoke atox processing on it, I get the following error message: > > Error in input text (line 2, col 1): pattern u'\\n' not found > > The odd thing is that if I edit de file, for example using vi, add a > new line (at the beggining or between the two lines), and run atox > again, everything goes smoothly. The XSLT processing is made in Java, > and I've already tried adding the new line to the resulting file by > code, but it does not work. Well, Atox *is* looking for a newline that isn't there, it seems; you terminate your fields with <ax:del>\n</ax:del>, and if a field ends with EOF, then you'll get the error above. Without having tested it, I'd guess that using <ax:del>\n|\Z</ax:del> could help (\Z is the "end of string" pattern). If this doesn't work, I can perhaps take a whack at it myself. If you could send me the input file you're having trouble with as an attachment (to make sure I get the exact contents) that would help. -- Magnus Lie Hetland Fall seven times, stand up eight http://hetland.org [Japanese proverb] |
From: Calixto J. P. F. <cal...@te...> - 2005-05-27 16:56:46
|
Hi all, I'm trying to process with a atox a very simple plain-text file. It is the output of a XSLT process, and its content is the following: GW_NAME = gwRDSIH323 XXX NUM_CHANNELS = 2 XXX If a invoke atox processing on it, I get the following error message: Error in input text (line 2, col 1): pattern u'\\n' not found The odd thing is that if I edit de file, for example using vi, add a new line (at the beggining or between the two lines), and run atox again, everything goes smoothly. The XSLT processing is made in Java, and I've already tried adding the new line to the resulting file by code, but it does not work. The markup file I'm using is the following: ---------------------------------- <?xml version="1.0"?> <ax:format xmlns:ax="http://hetland.org/atox" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <fieldValue> <ax:alt maxOccur="inf"> <ax:match name="comentarios"/> <field> <name>GW_NAME</name> <value> <ax:del>\s*=\s*</ax:del> <ax:del>\n</ax:del> </value> </field> <field> <name>NUM_CHANNELS</name> <value> <ax:del>\s*=\s*</ax:del> <ax:del>\n</ax:del> </value> </field> <field> <name>MSN_0</name> <value> <ax:del>\s*=\s*</ax:del> <ax:del>\n</ax:del> </value> </field> <field> <name>MSN_1</name> <value> <ax:del>\s*=\s*</ax:del> <ax:del>\n</ax:del> </value> </field> <field> <name>ADN_MSN_0</name> <value> <ax:del>\s*=\s*</ax:del> <ax:del>\n</ax:del> </value> </field> <field> <name>ADN_MSN_1</name> <value> <ax:del>\s*=\s*</ax:del> <ax:del>\n</ax:del> </value> </field> <field> <name>DEFAULT_NUMBER</name> <value> <ax:del>\s*=\s*</ax:del> <ax:del>\n</ax:del> </value> </field> <field> <name>NUMBER2ALIAS</name> <value> <ax:del>\s*=\s*</ax:del> <ax:del>\n</ax:del> </value> <value maxOccur="inf"> <ax:del>NUMBER2ALIAS\s*=\s*</ax:del> <ax:del>\n</ax:del> </value> </field> <field> <name>DEFAULT_TERMINAL</name> <value> <ax:del>\s*=\s*</ax:del> <ax:del>\n</ax:del> </value> </field> <field> <name>LISTEN_PORT</name> <value> <ax:del>\s*=\s*</ax:del> <ax:del>\n</ax:del> </value> </field> <field> <name>G711_ALAW_CODEC<ax:del>\s*=\s*</ax:del></name> <value> <ax:alt> <ax:del>n\s*\n</ax:del> y </ax:alt> </value> </field> <field> <name>G711_ULAW_CODEC<ax:del>\s*=\s*</ax:del></name> <value> <ax:alt> <ax:del>n\s*\n</ax:del> y </ax:alt> </value> </field> <field> <name>GSM_CODEC<ax:del>\s*=\s*</ax:del></name> <value> <ax:alt> <ax:del>n\s*\n</ax:del> y </ax:alt> </value> </field> <field> <name>LPC_CODEC<ax:del>\s*=\s*</ax:del></name> <value> <ax:alt> <ax:del>n\s*\n</ax:del> y </ax:alt> </value> </field> <field> <name>USE_MONITOR<ax:del>\s*=\s*</ax:del></name> <value> <ax:alt> <ax:del>n\s*\n</ax:del> y </ax:alt> </value> </field> <field> <name>MONITOR_FILE</name> <value> <ax:del>\s*=\s*</ax:del> <ax:del>\n</ax:del> </value> </field> <field> <name>TRACE_LEVEL</name> <value> <ax:del>\s*=\s*</ax:del> <ax:del>\n</ax:del> </value> </field> <field> <name>TRACE_FILE</name> <value> <ax:del>\s*=\s*</ax:del> <ax:del>\n</ax:del> </value> </field> <field> <name>TIMEOUT_ANSWER</name> <value> <ax:del>\s*=\s*</ax:del> <ax:del>\n</ax:del> </value> </field> <field> <name>JITTER</name> <value> <ax:del>\s*=\s*</ax:del> <ax:del>\n</ax:del> </value> </field> <field> <name>USE_GATEKEEPER<ax:del>\s*=\s*</ax:del></name> <value> <ax:alt> <ax:del>n\s*\n</ax:del> y </ax:alt> </value> </field> <field> <name>GATEKEEPER_ADDR</name> <value> <ax:del>\s*=\s*</ax:del> <ax:del>\n</ax:del> </value> <value maxOccur="inf"> <ax:del>GATEKEEPER_ADDR\s*=\s*</ax:del> <ax:del>\n</ax:del> </value> </field> <field> <name>PREFIX</name> <value> <ax:del>\s*=\s*</ax:del> <ax:del>\n</ax:del> </value> <value maxOccur="inf"> <ax:del>PREFIX\s*=\s*</ax:del> <ax:del>\n</ax:del> </value> </field> <field> <name>CALLBACK</name> <value> <ax:del>\s*=\s*</ax:del> <ax:del>\n</ax:del> </value> </field> </ax:alt> </fieldValue> <ax:def name="comentarios"> <ax:del maxOccur="inf" greedy="true">#.*</ax:del> </ax:def> </ax:format> --------------------------------------- Thanks for the help Best Regards, Calixto |
From: <df...@ms...> - 2005-03-01 03:19:13
|
China Antique Carpet Collection, a famous carpet company, deals in oriental tribal & temple antique carpets collected directly from origin. These long-history carpets are hand-made and natural dyeing which have genuine gentilitial style. Some of them are unique on the world. ¡ñ Offer antique carpets from Ningxia, Tibet, Inner Mongolia, Xinjiang and Beijing.We have a great quantity antique rugs. Wholesale service is acceptable. ¡ñ Offer handmade antique finished carpets, semi-antique carpets and decorative carpets, Customization and wholesale are acceptable. ¡ñ Offer expert appraisal service. Professional Carpet clean and repair ¡ñ Offer modern hand knotted, hand weave and hand tufted carpets in different qualities and designs, both stock and order is acceptable. ¡ñ Offer China arts and crafts. All major credit cards are accepted Operation Hours: 9:30am-5:30pm China Antique Carpet Collection 2-401 Building 6, 2Qu, Ruihaijiayuan Xihongmen, Daxing District, Beijing 100076 China Tel: +86-10-60249766 Mobile: +86-13601329012 1365115189 Fax: +86-10-66061588 www.china-antique-carpet.com sa...@ch... chi...@ya... |
From: Magnus L. H. <ma...@he...> - 2004-06-16 21:26:53
|
I've managed to get the basic flex/bison mechanisms working, but I realise that making a full Atox version using this stuff is going to be quite an undertaking. It may be that I'll do it some day, but if I launch into it now, I suspect I'll just end up running out of steam. So, instead, I've uploaded the basic code for this experimental mechanism on the Atox Web site [1], so people can look at it if they're interested. (It is, in fact, functional enough that you can use it to convert text into XML -- and once you've created the compiled parser, it's *fast*!) As for the standard release, I think 0.5 is, in fact, relatively complete in terms of the main functionality I wanted for Atox. It is, perhaps, not fast enough if you have complex formats, but without a solution such as the flex/bison implementation, I guess that's a price one has to pay for the generality Atox embodies. So, I won't be slapping on any new format language (such as Relax NG or the like) or other backward incompatible changes for now. In fact, there probably won't be any new major releases for quite some time (unless I come up with something Really Great(tm) ;) If there are bug reports (preferrably with patches) there will (most likely) be bug fix releases, though. (The version in CVS has some minor changes; the default input encoding is changed from 'iso8869-1' to 'iso-8859-1' to appease xsltproc, and the config variable tab_size has been added.) [1] http://atox.sourceforge.net/atox-proto-20040616.tgz -- Magnus Lie Hetland "Wake up!" - Rage Against The Machine http://hetland.org "Shut up!" - Linkin Park |
From: Magnus L. H. <ma...@he...> - 2004-06-07 23:52:35
|
Thomas Kalka <th...@co...>: > > Hi Magnus, > > I did not understand which question you would like to have > commented. > > Is it: should I change the front-end before I know how the back-end > will look like ? Just general comments on whether the use of something like Relax NG would be a good idea. Basically, the question is (I think): "If a future version of Atox is necessarily backward-incompatible, would it be a good idea to drop the current format completely and change to a more standard one, such as Relax NG?" The most conservative alternative would, most likely, be to use the current format but with a different regexp language. It would still be backward incompatible -- if such a change is made at all. (Maybe I'll just implement it and use it myself, without releasing it ;) > Is there a way to transform RELAX NG XML to the current atox XML > format ? That would only be a matter of writing an XSLT style sheet, I guess. That should work the other way (from the current Atox format to Relax NG) as well. So that sort of compatibility ought to be possible -- but the regexp language would still be incompatible if Flex (or something similar) is to be used (for performance reasons). Including such XSLT transformations in the distribution wouldn't be a problem, I guess. (At least I should think this to be a simple matter -- I'm not 100% sure.) The two issues are quite separate, though (the format language and the back-end/parser implementation). Both decisions/implementation will eventually be taken in concert, though, I guess. Just for the record (again): Atox is not in any way "frozen" yet -- until it is (if ever) backward incompatibilities may well occur. I've tried to warn about that (among other places, on the Atox home page). The main reason for this is that I haven't quite found a form I'm satisfied with (both regarding the format and the implementation). I guess to make this even clearer I could develop the current version to a 1.0 release and push the experimental stuff to a 2.0 release (or somesuch), and keep a bug fix branch open for 1.0... -- Magnus Lie Hetland "Wake up!" - Rage Against The Machine http://hetland.org "Shut up!" - Linkin Park |
From: Thomas K. <th...@co...> - 2004-06-07 23:33:27
|
Hi Magnus, I did not understand which question you would like to have commented. Is it: should I change the front-end before I know how the back-end will look like ? Is there a way to transform RELAX NG XML to the current atox XML format ? --Thomas |
From: Magnus L. H. <ma...@he...> - 2004-06-07 23:16:32
|
One of the technical things I'm pondering at the moment is the automatic placement of "fill" tokens in converting an Atox format to a (sort of) context-free grammar (as I've mentioned, I've been able to implement the fill behavior in itself, using the "follow-set" formalism of LL(1) parsers). The pause in development caused by this pondering has also let me reconsider the Atox format. I've thought about several options; now, as in the past, I've basically thought about two different approaches: 1. The current grammar-like sort of format 2. An XSLT-like template format The second type appeals to me because it would probably be quite easy to use, but I think it might be hard to implement if the underlying mechanism is some sort of standard parser (which is what I'm planning now). It could be that XSLT-like templates could be used as a grammar extension mechanism -- who knows. The current kind of format is basically a relative of the kind of grammars you see in formal language theory -- and in XML schema languages... The latter is an interesting point. I've previously been considering making the Atox format an extension of the DTD format or the W3C XML Schema language, but I've found them to be quite complex (and I'm sure there were several other reasons why I decided against this). However: I just discovered *Relax NG* (http://www.relaxng.org). For those who don't know it, it's another XML schema language developed by *OASIS* (http://www.oasis-open.org). It seems to have pretty solid backing and well-known people (such as James Clark, who has even written an emacs mode that can perform Relax NG validation of a document natively) behind the format itself. The reason I'm excited by it is covered by its two first key features (as listed on relaxng.org): It's simple and it's easy to learn. Also, it "can partner with a separate datatyping language". So: I'm thinking of returning to my idea of basing the Atox format on a schema language, namely Relax NG. It is even possible that Relax NG itself (or a subset thereof) can be used, with a custom separate datatyping language for specifying text patterns (document structure is fully covered by Relax NG). A nifty thing about Relax NG is that it also has an alternative compact non-XML syntax, which is another thing I've contemplated (and implemented, in some of the prototypes) for Atox. As a quick illustration, here is an example I've snipped from the two tutorials on the Relax NG site: <addressBook> <card> <name>John Smith</name> <email>js...@ex...</email> </card> <card> <name>Fred Bloggs</name> <email>fb...@ex...</email> </card> </addressBook> This would normally be the input for a Relax NG processor -- and the output of Atox. Here is a Relax NG pattern for this, using the XML syntax: <element name="addressBook" xmlns="http://relaxng.org/ns/structure/1.0"> <zeroOrMore> <element name="card"> <element name="name"> <text/> </element> <element name="email"> <text/> </element> </element> </zeroOrMore> </element> In a simple approach, all that would be needed is the addition of some form of pattern (regular expression) to each of the <text/> tags. Now here is the same pattern using the compact syntax: element addressBook { element card { element name { text }, element email { text } }* } I've really wanted to do this sort of thing for Atox, but I thought that going with a standard like XML for the format language was a Good Idea(tm). But since Relax NG is such a solid standard anyway, and it has an alternative XML syntax, I think this is an excellent candidate. To get beyond the simplest block-level structures, we need some form of patterns (most likely regular expressions). How can these be integrated with Relax NG? To quote Sect. 5 from the tutorial on the compact format: RELAX NG allows patterns to reference externally-defined datatypes. RELAX NG implementations may differ in what datatypes they support. You can only use datatypes that are supported by the implementation you plan to use. The most commonly used datatypes are those defined by [W3C XML Schema Datatypes]. So: Atox could be a Relax NG implementation that supports application-specific datatypes. Again, quoting: A pattern consisting of a name qualified with a prefix matches a string that represents a value of a named datatype. The prefix identifies the library of datatypes being used and the rest of the name specifies the name of the datatype in that library. The prefix xsd identifies the datatype library defined by [W3C XML Schema Datatypes]. Further, datatypes may have parameters. An example using the standard W3C XML Schema datatype 'string' with minLength and maxLength: element email { xsd:string { minLength = "6" maxLength = "127" } } This sort of construct makes it easy to conjure up a specific regexp-matching type for Atox (in fact, there are patterns in the W3C XML Schema datatypes -- I don't think they're entirely applicable here, but I'll have to look into it): element email { ax:string { pat = "[a-z]@([a-z]\.)+[a-z]+" } } Or something like that. To define the datatype prefix 'ax' one would add something like datatypes ax = "http://hetland.org/atox" to the beginning of the file. I'm not 100% sure about the details of the datatype format, but since it's only a small part of the language, it should be possible to give it quite a lot of thought and still come to a conclusion. (As for parsing the compact format, there is a full grammar for it at the Relax NG web site, so that's not a problem.) This will, of course, mean full incompatibility with previous versions, but that might have happened anyway, when transitioned to flex/bison (or something similar). By using something thoroughly standardized, there is some hope that future backward incompatibilities may be avoided. So -- if anyone has actually read this far <wink>: Input would be appreciated. -- Magnus Lie Hetland "Wake up!" - Rage Against The Machine http://hetland.org "Shut up!" - Linkin Park |
From: Magnus L. H. <ma...@he...> - 2004-05-31 21:45:05
|
As I've mentioned, I've been experimenting a bit behind the scenes lately, with a prototype (that I've so far called Dtox, but which will probably become next major version of Atox) that uses Flex and Bison to do the parsing. I've been uncertain about whether I could make this new architecture work (the main motivation being increased speed) but now I have the core functionality in place, and it's *fast*. *Much* faster than the current Atox. There is still quite a bit of wrapping and polishing to do before I'll even release it as a prototype (or perhaps an alpha), but at least things look promising. If this new version turns out to be successful, it may be the first step on stabilizing Atox, and freezing the format language and feature set. (Quite a bit left before we get there, though...) - M -- Magnus Lie Hetland "Wake up!" - Rage Against The Machine http://hetland.org "Shut up!" - Linkin Park |
From: <ben...@id...> - 2004-05-25 08:40:47
|
Dear Open Source developer I am doing a research project on "Fun and Software Development" in which I kindly invite you to participate. You will find the online survey under http://fasd.ethz.ch/qsf/. The questionnaire consists of 53 questions and you will need about 15 minutes to complete it. With the FASD project (Fun and Software Development) we want to define the motivational significance of fun when software developers decide to engage in Open Source projects. What is special about our research project is that a similar survey is planned with software developers in commercial firms. This procedure allows the immediate comparison between the involved individuals and the conditions of production of these two development models. Thus we hope to obtain substantial new insights to the phenomenon of Open Source Development. With many thanks for your participation, Benno Luthiger PS: The results of the survey will be published under http://www.isu.unizh.ch/fuehrung/blprojects/FASD/. We have set up the mailing list fa...@we... for this study. Please see http://fasd.ethz.ch/qsf/mailinglist_en.html for registration to this mailing list. _______________________________________________________________________ Benno Luthiger Swiss Federal Institute of Technology Zurich 8092 Zurich Mail: benno.luthiger(at)id.ethz.ch _______________________________________________________________________ |
From: Magnus L. H. <ma...@he...> - 2004-05-20 18:01:05
|
Hi! I'm looking into some alternative implementations (I've been discussing the use of Bison heavily on the help-bison list, for example), and I've been playing around with some prototype versions that automatically compile/run Flex and Bison parsers. I'm not sure whether I'll use that -- it's fast, but not entirely clean, and I'm not even sure if I can make it 100% correct. I'm considering introducing a leveled approach, where the block structure is dealt with separately from the inline tags/markup. This might make it harder to parse text that isn't block-oriented (i.e. where it isn't easy to separate blocks like paragraphs, headers, or list items from each other) but I think it may make the system more intuitive, more robust, and (hopefully) faster. I can't guarantee that I'll figure out a way to do this that is better than the current, but I guess I'll probably spend quite a bit more time mulling over the issue. Also, this may mean that future versions aren't backward-compatible -- Atox is still experimental software, after all (with a very small user base, I think ;) Just thought I'd mention it, though. - M -- Magnus Lie Hetland "Wake up!" - Rage Against The Machine http://hetland.org "Shut up!" - Linkin Park |
From: Magnus L. H. <ma...@he...> - 2004-04-27 20:02:37
|
I'm investigating the possibility of having Atox create a custom parser behind the scenes, using flex/bison. It might mean some slight restrictions in the format language (although the core functionality shouldn't suffer, I think) -- but I suspect enormous gains in performance. (I've just been playing around with flex for a while, and... Its speed is quite stunning now that I've been working on Atox for a while :) -- Magnus Lie Hetland "Wake up!" - Rage Against The Machine http://hetland.org "Shut up!" - Linkin Park |
From: Magnus L. H. <ma...@he...> - 2004-04-26 10:42:56
|
Gregory Serpolet <GRE...@BU...>: > > hi, > > thanks for your answers and precisions. I have already made some > tests with XSLT stylesheets. It works pretty well and it seems > that it could cover all my needs. Excellent! I've been trying to find ways of speeding up Atox, and I don't think it's going to be easy to gain a lot of speed without modifying the functionality somewhat (I may do that in release 2.0 or something -- who knows :). So if you need speed, I'd recommend trying to keep the number of elements and regular expressions to a minimum, and do as much of the necessary work as possible in XSLT (which is very fast). If you find that you have requirements that are not covered, please tell me, and I'll see if something can be done about it. - M -- Magnus Lie Hetland "Wake up!" - Rage Against The Machine http://hetland.org "Shut up!" - Linkin Park |
From: Magnus L. H. <ma...@he...> - 2004-04-23 22:54:50
|
I just released 0.5. :) -- Magnus Lie Hetland "Wake up!" - Rage Against The Machine http://hetland.org "Shut up!" - Linkin Park |
From: Magnus L. H. <ma...@he...> - 2004-04-23 15:28:23
|
Gregory Serpolet <GRE...@BU...>: > > Hello to all atox users, Hi! Not that many of us, yet ;) > I'm trying to parse an Ascii file into an XML one. My XML output > format has already been defined. In that case, Atox seems like a possible solution. (I should warn you that the current implementation is a bit slow if your format is reasonably complex and your files are sort of big... I *hope* to improve this in the future, but it might mean recoding large chunks in C or the like, so don't count on it :) > I whish to insert some contents of the ascii file into some specific > XML tag. Right. > The problem is that some of the tags are inserted at the beginning of > the text and the ascii file's structure doesn't follow the same > decomposition. I see. It seems like XSLT would be useful here. If you check out the upcoming 0.5 release (a working version with documentation can be found in CVS) you'll see that XSLT fragments can now be used inside Atox format files. [snip] > I'm just discovering you tool "atox" and i see that 's a top-down a > parser. Yes, the Atox parsing itself is mainly suited for making the structure that exists in the file explicit. However, once you've done that, XSLT can do almost anything. Atox has been designed with this in mind -- anything you can easily do in XSLT is not a priority in Atox. But, as I said, in 0.5 you can now put XSLT templates inside your format file, to keep everything in one file and make sure your Atox output is correct (i.e. you won't have to go through a semi-correct format before using XSLT to "fix it"). You can, of course, also use XSLT afterward to create various outputs from your XML format (e.g. an XHTML representation or the like). > Is there some subtle solutions which could perform this kind of > parsing. OK, I'm not sure I understand your file format 100% (e.g., are there more than one bug report in the ASCII file?) but I'll just give an example of how you can reorder stuff (using a simplified version of your format). I can try to work with your exact format if you give me some details :) So, let's assume the ASCII file only contains the following (without the indent): Bug Report: blablabla Machine ID: 20234165 Date: 31616561 Here is a possible format file (without the ax:format stuff): <Collect> <Bug> <ax:del>Bug Report:\s+</ax:del> (?=\n) </Bug> <MachineID> <ax:del>Machine ID:\s+</ax:del> (?=\n) </MachineID> <Date> <ax:del>Date:\s+</ax:del> (?=\n|\Z) </Date> </Collect> <xsl:template match="Collect"> <Collect> <xsl:apply-templates select="MachineID"/> <xsl:apply-templates select="Date"/> <xsl:apply-templates select="Bug"/> </Collect> </xsl:template> Note that my XSLT here probably isn't ideal -- it doesn't process the whitespace in the Collect element, so all three child-elements will be put on a single line -- but it demonstrates how this can be done, at least. (To use xsl templates in Atox, you'll have to add xmlns:xsl="http://www.w3.org/1999/XSL/Transform" to the ax:format tag, alongside the xmlns:ax declaration.) Feel free to ask if you need clarifications, or if you need any features that aren't present. -- Magnus Lie Hetland "Oppression and harassment is a small price to pay http://hetland.org to live in the land of the free." -- C. M. Burns |
From: Gregory S. <GRE...@BU...> - 2004-04-23 12:57:01
|
Hello to all atox users, I'm trying to parse an Ascii file into an XML one. My XML output format has already been defined. I whish to insert some contents of the ascii file into some specific XML tag. The problem is that some of the tags are inserted at the beginning of the text and the ascii file's structure doesn't follow the same decomposition. example: Ascii file: ... Bug Report: blablabla... ... ... machine ID: 20234165 Date : 31616561 ... ... XML File: <Collect> <MachineID>20234165</MachineID> <Date>31616561</Date> ... ... <Bug>blablabla</Bug> ... ... I'm just discovering you tool "atox" and i see that 's a top-down a parser. Is there some subtle solutions which could perform this kind of parsing. Best regards. PS: I'm french, excuse me for my basic english. |
From: Magnus L. H. <ma...@he...> - 2004-04-18 20:21:06
|
I've now added support for non-greedy repetition. It required some semi-mind-boggling refactorings, but it seems to work well (and it is quite useful for some formats, such as new DocBook format file I'm working on for the documentation :) - M -- Magnus Lie Hetland "Oppression and harassment is a small price to pay http://hetland.org to live in the land of the free." -- C. M. Burns |
From: Magnus L. H. <ma...@he...> - 2004-04-17 16:26:40
|
I've now added functionality to allow XSLT templates (mainly for small transformations that are independent of final output format) to be embedded in Atox format files, and for them to be executed "behind the scenes". This adds the abibilty to create attributes and namespaces in the output, among (many!) other things. This is useful if there are specific demands on the interchange format (i.e. the immediate XML output of Atox), for example, if you want to use something like DocBook. It can then be better to just add a simple template or two to the format file than to deal with another file in addition (which would mean that the format file wouldn't really be valid on its own). -- Magnus Lie Hetland "Oppression and harassment is a small price to pay http://hetland.org to live in the land of the free." -- C. M. Burns |
From: Magnus L. H. <ma...@he...> - 2004-04-16 20:33:42
|
Paul DuBois <pa...@ki...>: > > >Paul DuBois <pa...@ki...>: > >> > >> For atox 0.4, python setup.py install installs the main atox program in > >> /System/Library/Frameworks/Python.framework/Versions/2.3/bin. > > > >Yeah, I guess this is a consequence of how distutils (the installation > >software used) works. By putting it in the same bin-directory as > >Python, you're pretty much safe on all platforms, including (e.g.) > >Windows. > > I just had a further look at the setup here. python is indeed installed > in that /System...../bin directory, but there is also a symlink to it > in /usr/bin. I presume this is because no one would really add the > /System.../bin directory to their PATH. :-) Interesting. This ought to be reported as a distutils bug, I suppose... I mean -- if distutils places the scripts in a directory where they aren't on the user's PATH; not very useful :-) -- Magnus Lie Hetland "Oppression and harassment is a small price to pay http://hetland.org to live in the land of the free." -- C. M. Burns |
From: Magnus L. H. <ma...@he...> - 2004-04-16 18:04:51
|
Paul DuBois <pa...@ki...>: > > For atox 0.4, python setup.py install installs the main atox program in > /System/Library/Frameworks/Python.framework/Versions/2.3/bin. Yeah, I guess this is a consequence of how distutils (the installation software used) works. By putting it in the same bin-directory as Python, you're pretty much safe on all platforms, including (e.g.) Windows. > Any way to make it install in a more standard location like > /usr/local/bin? Sure, just use python setup.py install --install-scripts=/usr/local/bin I'll add this (plus the mention of the --help option) to the README file. Thanks for mentioning this. -- Magnus Lie Hetland "Oppression and harassment is a small price to pay http://hetland.org to live in the land of the free." -- C. M. Burns |
From: Magnus L. H. <ma...@he...> - 2004-04-14 20:46:12
|
These are the changes I've made: - Made the error handling slightly more user-friendly. - Added some basic improvements to the command-line interface (the '-e', '-f' and '-o' switches, as well the ability to use multiple input files or standard input). Note that the new calling convention is incompatible with the previous version, in that the format file is no longer supplied as an argument. - Normalized newline-handling. - Added the utility tags 'ax:block', 'ax:sob' (start-of-block) and 'ax:eob' (end-of-block). - Fixed an important bug in the indentation code, which affected 'ax:indented'. - Made empty sequences legal. - Added support for config files. -- Magnus Lie Hetland "Oppression and harassment is a small price to pay http://hetland.org to live in the land of the free." -- C. M. Burns |
From: Magnus L. H. <ma...@he...> - 2004-04-10 22:50:25
|
I just fixed an important (and incredibly stupid) bug in Indentation.py, which is related to dedents with a given level (as used in ax:indented). So -- ax:indented doesn't work as it should in 0.3. There will be a 0.3.1 soon. -- Magnus Lie Hetland "Oppression and harassment is a small price to pay http://hetland.org to live in the land of the free." -- C. M. Burns |
From: Magnus L. H. <ma...@he...> - 2004-04-09 19:33:44
|
Atox 0.3 is out, now with support for paired indent/dedent and for backtracking, as well as an improved glue mechanism (the glue attribute, which replaces the glued attribute -- sorry about the slight backward incompatibility; just use glue="" instead of glued="true"). Several new examples in the demo directory. -- Magnus Lie Hetland "Oppression and harassment is a small price to pay http://hetland.org to live in the land of the free." -- C. M. Burns |