Safe Haskell | Safe |
---|
Data.IterIO
Description
This is the main module to import for the IterIO package. It re-exports several other modules and mostly consists of documentation--first a high-level overview of the iteratee model, then a more detailed tutorial, finally a discussion of the differences from other iteratee packages and acknowledgments.
See the Data.IterIO.Iter, Data.IterIO.Inum, and
Data.IterIO.ListLike modules for more detailed documentation of data
structures and functions. In addition, Data.IterIO.Trans (also
re-exported by this module) supplies functions that help you invoke
monad transformers from the mtl library from within the Iter
monad.
Several other potentially useful modules in the package are not exported by default:
- Data.IterIO.Parse includes parsec-like parsing combinators for iteratee input.
- Data.IterIO.Zlib provides zlib and gzip format compression and decompression.
- Data.IterIO.SSL provides support for SSL.
- Data.IterIO.Http provides support for parsing and formatting HTTP, including handling form and file uploads (which can be processed in constant space). This may be useful in conjunction with Data.IterIO.HttpRoute, which provides simple request routing support for web servers.
- Data.IterIO.Atto provides support for running attoparsec parsers on iteratee input (see http://hackage.haskell.org/package/attoparsec/).
- Data.IterIO.Extra provides debugging functions, as well as a loopback iteratee that can be used to test a protocol implementation against itself.
- module Data.IterIO.Iter
- module Data.IterIO.Trans
- module Data.IterIO.Inum
- module Data.IterIO.ListLike
Documentation
module Data.IterIO.Iter
module Data.IterIO.Trans
module Data.IterIO.Inum
module Data.IterIO.ListLike
Overview
At a high level, an iteratee is a data sink that is fed chunks of
data. It may return a useful result, or its utility may lie in
monadic side-effects, such as storing received data to a file.
Iteratees are represented by the type
. Here Iter
t m at
is
the type of data that the iteratee receives as input. (t
must be
an instance of ChunkData
, such as String
or lazy ByteString
.)
m
is the Monad
in which the iteratee runs--for instance IO
(or an instance of MonadIO
) for the iteratee to perform IO. a
is the type that the iteratee will return when it has consumed
enough input to produce a result.
An enumerator is a data source that feeds data chunks to an
iteratee. Enumerators are also iteratees. We use the type
to represent these iteratee-enumerators. As an
iteratee, an Inum
tIn tOut m aInum
sinks data of some input type, generally
designated tIn
. As an enumerator, the Inum
feeds data of a
potentially different type, tOut
, to another iteratee. Thus, the
Inum
can be viewed as transcoding data from type tIn
to type
tOut
for consumption by another iteratee.
Inum
s are generally constructed using the functions
and mkInum
in module Data.IterIO.Inum. The first function
uses a simple mkInumM
to translate between input type
Iter
tIn m tOuttIn
and output type tOut
. The second function,
,
allows construction of more complex mkInumM
Inum
s.
An important special kind of Inum
is an outer enumerator,
which is just an Inum
with the void input type ()
. Outer
enumerators are sources of data. Rather than transcode input
data, they produce data from monadic actions (or from pure data
in the case of inumPure
). The type Onum
represents outer
enumerators and is a synonym for Inum
with an input type of
()
.
To execute iteratee-based IO, you must apply an Onum
to an
Iter
with the |$
("pipe apply") binary operator.
An important property of enumerators and iteratees is that they can
be fused. The |.
("fuse leftward") operator fuses two
Inum
s together (provided the output type of the first is the
input type of the second), yielding a new Inum
that transcodes
from the input type of the first to the output type of the second.
Similarly, the .|
("fuse rightward") operator fuses an Inum
to an Iter
, yielding a new Iter
with a potentially different
input type.
Enumerators of the same type can also be concatenated, using
the cat
function. enum1 `
produces an enumerator
whose effect is to feed first cat
` enum2enum1
's data then enum2
's data
to an Iter
.
Tutorial
The iterIO library performs IO by hooking up sources of data, called enumerators, to data sinks, called iteratees, in a manner reminiscent of Unix command pipelines. Compared to lazy IO, the enumerator/iteratee paradigm provides better error handing, referential transparency (which should, after all, be one of the big advantages of Haskell), and equally convenient composition of protocol layers and parsers without worrying about IO chunk boundaries.
Enumerators, implemented by the type Onum
(short for
outer enumerator, for reasons that will become clear below), are so
called because they enumerate all data elements (e.g., bytes or
packets) in some source such as a file or socket. Hence, an
enumerator should be viewed as a source outputting chunks of data
whose type is a
. (Actually, the input type must be of
class Monoid
ChunkData
, which is a
that additionally has a
method Monoid
to test whether a piece of data is equal to
null
mempty
.)
Iteratees, implemented by the type Iter
, should be viewed as sinks
consuming data. When executing IO, the library iterates over all
data elements output by the source, using an iteratee to produce a
result. The source may output data in chunks whose boundaries do not
coincide with logical message units; iteratees handle this
transparently, simplifying programming.
Here is a simple example:
-- Return the first line of a file headFile :: FilePath -> IO String headFile path =enumFile
path|$
lineI
enumFile
enumerates the contents of a file. lineI
returns a line
of input (discarding the newline). |$
is the pipe apply operator
that applies an Onum
to an Iter
, returning the result of the
Iter
--in this case the first line of the file named path
.
An Iter
's main purpose may not be to produce a result. Some Iter
s
are primarily useful for their side effects. For example, stdoutI
writes data to standard output; handleI
similarly writes output to
an arbitrary file handle. Thus, the following function copies the
contents of a file to standard output:
-- Copy file to standard output catFile :: FilePath -> IO () catFile path =enumFile'
path|$
stdoutI
enumFile'
is like enumFile
above, but type restricted to data in
the lazy
format, which is more efficient than plain
ByteString
String
s. (enumFile
supports multiple types, but in this example
there is not enough information for Haskell to choose one of them, so
we must use enumFile'
or use ::
to specify a type explicitly.)
Once again, |$
is used to execute the IO actions, but, this time,
the return value is just ()
; the interesting action lies in the side
effects of writing data to standard output while iterating over the
input with stdoutI
.
The real power of the iteratee abstraction lies in the fact that
Iter
s are monadic computations. One Iter
may invoke another to
make use of the first one's results. Here is an example of a function
that returns the first two lines of a file:
-- | Return first two lines of file head2File :: FilePath -> IO (String, String) head2File path =enumFile
path|$
lines2I
-- | Iter that returns next two lines as a pair lines2I :: (Monad m) =>Iter
String m (String, String) lines2I = do line1 <-lineI
line2 <-lineI
return (line1, line2)
This example illustrates several points. First, consider the type of
the lines2I
function:
. The
Iter
String m (String, String)Iter
type constructor takes three type arguments. The first,
String
in this case, specifies the type of input expected by the
iteratee. The last type, (String, String)
in this case, specifies
the result type of the iteratee. Finally, the middle type, m
, is a
monad, because
(for a given input type Iter
tt
) is a monad
transformer (i.e., it is an instance of the MonadTrans
class). In
this case, when head2File
invokes lines2I
, m
will be IO
,
because head2File
is returning a result in the IO
monad. However,
lines2I
would work equally well with any other monad.
Next, notice the functioning of
as a monad. The
type of Iter
String mlineI
in the above example is
. The
Iter
String m Stringlines2I
function executes lineI
twice using monadic do
syntax to
bind the results to line1
and line2
. The monadic bind operator
hides the details of IO chunk boundaries. If, for instance, lineI
needs more input because a newline character has not yet been read,
lineI
returns to the containing enumerator asking for more data. If
the first lineI
receives more than a line of input, it simply passes
the residual input to the next invocation of lineI
. Both of these
actions are hidden by the syntax, making most code much easier to read
and write.
That explains the iteratee type Iter
. The enumerator type, Onum
,
has the same three type arguments. Thus, the type of enumFile
, as
instantiated in the above examples, is
. Most enumFile
:: Onum
String IO
aOnum
types are polymorphic in the last argument, so as to
be able to return whatever type the Iter
is returning. (In fact,
enumFile
is polymorphic in the first two arguments, too, so as to
work with multiple String
-like types and any monad in the
class.)
MonadIO
Here is an example of an Iter
with side effects:
liftIOexampleI :: (MonadIO m) =>Iter
String m () liftIOexampleI = do line <-lineI
liftIO
$ putStrLn $ "First line is: " ++ line next <-takeI
40liftIO
$ putStrLn $ "And the next 40 bytes are: " ++ next
Unlike lines2I
, liftIOexampleI
does not return any interesting
result, but it uses the
monad transformer method to output
the first line of the file, followed by the next 40 bytes. The
liftIO
takeI
iteratee returns a String
(or ByteString
) with exactly the
requested number of characters or bytes, unless an EOF (end-of-file)
is encountered.
Of course, the real power of command pipelines is that you can hook multiple commands together. For instance, say you want to know how many words in the system dictionary files contain a double k and start with a lower-case letter. You could run a command like this:
cat /usr/share/dict/words /usr/share/dict/extra.words \ | grep kk | grep '^[a-z]' | wc -l
Let's see how to do something equivalent with iteratees, starting with
the wc -l
command, which counts lines. Here is an equivalent iteratee:
lineCountI :: (Monad m) =>Iter
String m Int lineCountI = count 0 where count n = do line <-safeLineI
case line of Just _ -> count (n+1) Nothing -> return n
The safeLineI
function is like lineI
, but returns a
(or Maybe
String
) which is Maybe
ByteString
Nothing
upon an EOF
condition. (lineI
throws an exception on EOF.)
What about the grep
command? grep
sits in the middle of a
pipeline, so it acts both as a data sink and as a data source.
This is why we call such a pipeline stage an
iteratee-enumerator, or Inum
. Before defining our grep
equivalent, since multiple pipeline stages are going to be considering
the file one line at a time, let's first build an Inum
to separate
input into lines:
import Data.ByteString as S import Data.ByteString.Char8 as S8
-- | Break input into lines of type S.ByteString, as this type -- works most conveniently with regular expressions. (Otherwise, -- we would prefer lazy ByteStrings.) inumToLines :: (Monad m) =>Inum
S.ByteString [S.ByteString] m a inumToLines =mkInum
$ do line <-lineI
return [line]
Inum
takes four type arguments, compared to only three for Onum
.
That's because an Inum
is acting as both an iteratee and an
enumerator; it needn't be processing the same type of data in both
roles. In the above example, when acting as an iteratee,
inumToLines
consumes data of type S.ByteString
(the first type
argument), accepting one long stream of unstructured bytes. However,
as an enumerator, inumToLines
produces output of type
[S.ByteString]
(the second type argument), a list of strings, one
per line of the file. In general the type
is an
iteratee-enumerator taking input type Inum
tIn tOut m atIn
, producing output type
tOut
, and feeding the output to an iteratee of type
.
Iter
tOut m
a
In fact, an Onum
is just a special kind of Inum
with the void
input type ()
. The type
is just a synonym for
Onum
t m a
. Most operations on Inum
() t m aInum
s can be used with
Onum
s as well, since an Onum
is an Inum
. The converse is not
true, however. For example, the |$
operator requires an Onum
, as
it wouldn't know what data to feed to an arbitrary Inum
. (If you
need it, however, there is a function run
, hidden by this module but
exported by Data.IterIO.Iter, that executes an iteratee computation
of arbitrary input type by feeding EOF as input.)
Iteratee-enumerators are generally constructed using either mkInum
or mkInumM
, and by convention most Inum
s have names starting
"inum
...", except that Onum
names start "enum
...". mkInum
takes an argument of type
that consumes input of
type Iter
tIn m tOuttIn
to produce output of type tOut
. (For inumToLines
,
tIn
is S.ByteString
and tOut
is [S.ByteString]
). This is fine
for simple stateless translation functions, but sometimes one would
like to keep state and use more complex logic in an Inum
. For that,
the mkInumM
function creates an Inum
out of a computation in a
dedicated InumM
monad. See the Data.IterIO.Inum documentation for
more information on mkInumM
. In inumToLines
, we do not need to
keep state. We are happy just to let lineI
throw an exception on
EOF, which mkInum
will catch and handle gracefully.
Throwing an EOF exception--either implicitly by executing another
Iter
, or explicitly with throwEOFI
--is one of the standard ways to
exit an Inum
created by mkInum
. The other way is to return empty
input.
We similarly define an Inum
to filter out lines not matching a
regular expression (using the Text.Regex.Posix.ByteString library),
and a simple Inum
to count list elements (since lineCountI ::
has input data type Iter
String m IntString
, while after
inumToLines
we need an Iter
with input data type
[S.ByteString]
).
inumGrep :: (Monad m) => String ->Inum
[S.ByteString] [S.ByteString] m a inumGrep re =mkInum
$ do line <-headI
if line =~ packedRe then return [line] else return [] where packedRe = S8.pack re
lengthI :: (Monad m) =>Iter
[t] m Int lengthI = count 0 where count n = do line <-safeHeadI
case line of Just _ -> count (n+1) Nothing -> return n
Now we are almost ready to assemble all the pieces. But recall that
the |$
operator applies one Onum
to one Iter
, yet now we have
two Onum
s (because we want to look through two files), and three
Inum
s that we want to compose into a pipeline. The library
supports two types of composition for pipeline stages:
concatenation and fusing.
Two Inum
s (or Onum
s) of the same type can be concatenated with
the cat
function, producing a new data source that enumerates all of
the data in the first Inum
followed by all of the data in the
second.
There are two fusing operators. The left-associative |.
operator
fuses two Inum
s, provided the output type of the first is the input
type of the second. (Mnemonic: it produces a pipeline stage that is
open on the right hand side, as it still needs to be applied to an
iteratee with |$
.) The right-associative .|
operator fuses an
Inum
to an Iter
, producing a new Iter
.
The fusing operators bind more tightly than the infix concatenation
functions, which in turn bind more tightly than |$
. (Concatenation
operators can also be used through prefix function application, which
binds most tightly.) Hence, putting it all together, we produce the
following Haskell equivalent to the above Unix pipeline:
grepCount :: IO Int grepCount =enumFile
"/usr/share/dict/words"|.
inumToLines `cat
`enumFile
"/usr/share/dict/extra.words"|.
inumToLines|$
inumGrep "kk".|
inumGrep "^[a-z]".|
lengthI
One often has a choice as to whether to fuse an Inum
to the
Onum
, or to the Iter
. For example, grepCount
could
alternatively have been implemented as:
grepCount' :: IO Int grepCount' =cat
(enumFile
"/usr/share/dict/words"|.
inumToLines) (enumFile
"/usr/share/dict/extra.words"|.
inumToLines)|.
inumGrep "kk"|.
inumGrep "^[a-z]"|$
lengthI
In this case, the two are essentially equivalent. However, for error
handling purposes, one should fuse together pipeline stages in which
errors have similar consequences. Often an Inum
or Onum
failure
is less serious than an Iter
failure. For example, in the above
example, if enumFile
fails because one of the files does not exist,
we might want to continue processing lines from the next file.
Conversely, if lengthI
fails or one of the inumGrep
stages fails
(most likely because the regular expression is illegal), there is not
much point in continuing the program. This is why the first example
fused inumGrep
to lengthI
, though this won't matter until we
actually handle errors (see below).
Another alternative would have been to swap the order of concatenation and fusing:
grepCount'' :: IO Int grepCount'' =cat
(enumFile
"/usr/share/dict/words") (enumFile
"/usr/share/dict/extra.words")|.
inumToLines|$
inumGrep "kk".|
inumGrep "^[a-z]".|
lengthI
This last version changes the semantics of the counting slightly.
With grepCount''
, if the first file has an incomplete last line,
this line will be merged with the first line of the second file, which
is probably not what you want. (For instance, if the incomplete last
line of the first file starts with a capital letter, then the first
line of the second file will not be counted even if it starts with a
lower-case letter and contains two "k"s.)
One limitation of all the grepCount
variants shown so far is that if
the first file does not exist, the whole operation aborts. This
might or might not be reasonable when counting lines, but in other
contexts we may want to resume after failure. Suppose we want to
implement a function like the Unix grep
command that searches for a
string in a bunch of files and prints all matching lines. If opening
or reading a file produces an error, the function should print the
error message and continue on with the next file.
Error handling is provided by the catchI
and inumCatch
functions,
which are roughly equivalent to the standard library
function. There is also a catch
throwI
function analogous to
in the standard library. Because throwIO
only works in the IO
monad, catch
catchI
and inumCatch
work by propagating synchronous
exceptions through the Iter
monad.
transforms IO errors
into such synchronous exceptions. Unfortunately, there is no way to
handle asynchronous exceptions such as those that arise in lazily
evaluated pure code (e.g., divide by zero) or those thrown by another
thread using liftIO
. Fortunately, for our throwTo
grep
example, we
only need to catch IO errors.
Here is the grep
code. We will analyze it below.
grep :: String -> [FilePath] -> IO () grep re files | null files =enumStdin
|.
inumToLines|$
inumGrep re.|
linesOutI | otherwise = foldr1cat
(map enumLines files)|$
inumGrep re.|
linesOutI where enumLines file =inumCatch
(enumFile
file|.
inumToLines) handler handler ::IOError
->IterR
() IO (IterR
[S.ByteString] IO a) ->Iter
() IO (IterR
[S.ByteString] IO a) handler e result = do liftIO (hPutStrLn stderr $ show e)resumeI
result linesOutI = do mline <-safeHeadI
case mline of Just line -> do liftIO $ S.putStrLn line linesOutI Nothing -> return ()
There are two cases. If the list of files to search is null, grep
simply reads from standard input, in which case there is only one
input stream and we do not care about resuming. In the second case,
we use
to concatenate a list of foldr1
cat
Onum
s. Each Onum
is generated by the function enumLines
, which fuses enumFile
to
our previously defined inumToLines
, but also wraps the exception
handler function handler
around the enumerator using inumCatch
.
Note that unlike catch
, inumCatch
expects an exception handler to
have two arguments. The first argument, e
in this example, is the
exception itself. As with catch
, the type of e
determines which
exceptions are caught, which is why we must either specify an explicit
type signature for handler
or somewhere specify e
's type
explicitly, for instance with:
... liftIO (hPutStrLn stderr $ show (e :: IOError)) ...
Note that IOError
doesn't expose a type constructor, but for
exception types that do, it often suffices to define the function with
the exception constructor, as:
handler e@(SomeException _) result = do ...
The second argument to handler
, result
, is the failed state of the
iteratee, which contains more information than just the exception. In
the case of an Inum
failure, it contains the state of the Iter
that the Inum
was feeding when it failed. The type of result
is
IterR
--which is the type returned by Iter
s when they are fed
chunks of data. IterR
takes the same three type arguments as
Iter
. The function resumeI
extracts and returns an
from this failed result. Thus, the next
enumerator in a concatenated series can continue feeding it input.
If, instead of resuming, you want to re-throw the error, it suffices
to re-execute the failed result with Iter
[S.ByteString] IO a
. For instance,
suppose we want to continue executing reRunIter
grep
when a named file does
not exist, but if some other error happens, we want to re-throw the
exception to abort the whole program. This could be achieved as
follows:
handler e result = do if isDoesNotExistError e then do liftIO (hPutStrLn stderr $ show e) resumeI result else reRunIter result
Because printing an exception is so common, there is a function
verboseResumeI
that prints exceptions before resuming (also
prefixing the program name). Thus, we can simplify the above function
to:
handler e result = if isDoesNotExistError e then verboseResumeI result else reRunIter result
These last two handler
functions also do away with the need for an
explicit type signature, because the function
has argument type isDoesNotExistError
IOError
, constraining the type of e
to the type
of exceptions we want to catch.
Differences from other iteratee packages
The Iteratee approach was originally advocated by Oleg Kiselyov (see talk slides at http://okmij.org/ftp/Streams.html#iteratee). The main implementation by Kiselyov and John Lato is simply called iteratee (http://hackage.haskell.org/package/iteratee). Another realization of the iteratee concepts is the enumerator package (http://hackage.haskell.org/package/enumerator). IterIO is a re-implementation of these concepts from scratch. This section discusses the differences between previous packages and iterIO, both as a means for motivating iterIO's design and as a set of suggestions for improving other iteratee implementations.
- Base abstraction
The iterIO package represents an iteratee as a pure function from a
chunk of pending input data to an iteratee result of type IterR
:
newtypeIter
t m a =Iter
{ runIter ::Chunk
t ->IterR
t m a }
An IterR
can yield a result and residual input, or it can ask for
more input, or it can request to have an action executed in the
underlying monad, or it can signal failure. The fact that all
iteratees are functions of input ensures that iteratees generally see
all pending input. Thus, iteratees can do things like measure the
length of buffered input to subtract it from the current file offset
and determine the effective position in a file.
IterR
's division of iteratee results into different outcomes such as
needing input or needing monadic actions allows the library to
distinguish between pure iteratees and those with potential side
effects. The ability to know that a specific iteratee is a pure
function in many cases allows one to parse LL(*) grammars without
large amounts of input buffering for backtracking (see below).
In contrast, the iteratee package uses continuation passing style (CPS), in which an iteratee is a function taking two continuation functions--one to call when done, and a second to call when either requesting more input or failing:
-- From the iteratee package: newtype Iteratee s m a = Iteratee{ runIter :: forall r. -- First the "onDone" function: (a -> Stream s -> m r) -> -- Next the "onCont" function: ((Stream s -> Iteratee s m a) -> Maybe SomeException -> m r) -> m r}
CPS has the advantage of exposing the bind operator of the underlying
monad, making lift
cheap and simple. Moreover, splitting into two
continuations saves the first and most common one (i.e., "onDone")
from the overhead of checking whether an error condition or request
for more input has occurred. See
http://haskell.org/haskellwiki/Performance/Monads#Use__Continuation_Passing_Style
for a good discussion of the advantages of CPS.
Because of CPS, iteratee should be capable of delivering the best
performance of the three iteratee packages. A disadvantage of
iterIO's approach is that every invocation of lift
must be
propagated all the way up the call chain, where a small amount of
overhead is added for each enclosing catchI
or similar call. While
iterIO can handle most successful IterR
outcomes and caught
exceptions locally without popping back up the call stack, there is
also potentially overhead from actually checking that the outcome was
successful at each bind site. (GHC's inliner may be able to avoid the
check in some cases.)
However, iteratee lacks several features of iterIO; offering these features would likely reduce the benefits of CPS and complicate code. For instance, there is no way to execute a pure iteratee without monadic actions (the benefit touted above and described below for LL(*) parsing). Moreover, iteratee's exception mechanism discards the current location in the input stream, making it unsuitable for failed parse alternatives. IterIO provides a general control mechanism to make arbitrary requests from enumerators (such as seek, tell, getpeername, get SSL information, etc.); iteratee instead overloads the exception mechanism for control purposes, which prevents control operations from returning values. Thus, while iteratee can implement seek, it cannot, for instance, implement tell.
The enumerator package's approach is closer to iterIO's, but makes
every iteratee into a monadic action in the underlying monad m
:
-- From the enumerator package: newtype Iteratee a m b = Iteratee { runIteratee :: m (Step a m b) }
Here Step
is similar to iterIO's IterR
type, but the m
wrapper
disallows iterIO's LL(*) parsing tricks. It also causes gratuitous
invocation of m
's bind function, which can be expensive when using
stacks of monad transformers. Furthermore, enumerator discards the
input state on all errors, making it impossible to resume from
failures that leave the input in a known state (such as a parsing
lookahead failure).
- Uniformity of abstraction
IterIO's abstractions were refined over many iterations to become
minimal yet highly expressive and familiar to Unix shell users. Thus,
we have Iter
s, which are data sinks that consume input and produce a
result. Then we have Inum
s, which are also Iter
s. These two data
types and can combined through pipes (i.e., fusing) and concatenation,
both of which have direct analogues in the Unix |
(pipe) operator
and cat
command.
Basing everything around these few concepts makes the library easier
to learn and use. For instance, because all Inum
s are Iter
s,
there is only one set of Iter
building blocks to learn. Inum
implementations invoke the same Iter
s that are used to build other
Iter
s. Moreover, Inum
s and Iter
s use the same error handling
mechanism. Finally, because Onum
s are also Inum
s, one set of
fusing and concatenation operators works for both.
By contrast, both the iteratee and enumerator packages use enumerator types that are not iteratees. Hence, constructing enumerators is harder and requires a different error handing mechanism. The packages must introduce a third, hybrid "Enumeratee" type for inner pipeline stages, and fusing Enumerators to Enumeratees is a different function from fusing Enumeratees together.
Funneling everything through a small number of abstractions also
ensures that the right thing happens in corner cases. In particular,
all enumerator application happens through the pipe operator. Though
there are two pipe operators, a left associative one and a right
associative one, they internally use the same function: a
. Similarly, the pipe application operators (|.
b =
(a .|
) . b|$
and
.|$
) are defined in terms of .|
.
.|
guarantees that its right-hand argument will receive an EOF when
the left hand argument terminates (whether normally or through an
exception). This is crucial for managing resources such file
descriptors, and works no matter how convoluted the control structure
of your program.
Consider the following realistic scenario of a web server constructed
as an Inum
that translates from HTTP requests to HTTP responses.
(Such an Inum
is provided by the function inumHttpServer
in
Data.IterIO.HTTP.) The server's accept loop would resemble the
following:
loop = do (sock, _) <- Net.accept $ listen_socket _ <- forkIO $ do (iter, enum) <-iterStream
(sock) enum|$
inumHttpServer
(ioHttpServer
handler).|
iter loop
This code depends on the fact that iterStream
closes sock
after
both the iter
has received an EOF and the enum
has returned. One
level down, inumHttpServer
uses mkInumM
to construct an Inum
,
and has code looking something like this:
req <-httpReqI
-- parse HTTP request resp <-liftI
$ inumHttpBody .| handler req -- invoke handlerirun
$ enumHttpResp resp Nothing -- send response to client
The handler
gets run on the body of the message, and might decide to
process an HTTP POST request by saving an uploaded file to disk, for
instance with code like this:
let saveFile _ field | ffName field == S8.pack "file" = do h <- liftIO $ openBinaryFile "upload" WriteModehandleI
h `finallyI
` liftIO (hClose h) | otherwise = return () in foldForm req saveFile ()
foldForm
internally is invoking an Inum
that parses HTTP
multipart/form-data to pipe each field of the form to the saveFile
function.
Now suppose inumHttpBody
fails (most likely because it receives an
EOF before reading the number of bytes specified in the Content-Length
header). Because inumHttpBody
is fused to handler
, the failure
will cause handler
to receive an EOF, which will cause foldForm
to
fail, which will cause handleI
to receive an EOF and return, which
will ensure hClose
runs and the file handle h
is not leaked.
Once the EOFs have been processed, the exception will propagate
upwards making inumHttpServer
fail, which in turn will send an EOF
to iter
. Then the exception will cause enum
to fail, after which
sock
will be closed. In summary, despite the complex structure of
the web server, because all the components are fused together with
pipe operators, corner cases like this just work with no need to worry
about leaked file descriptors.
- Uniform error-handling and simplified monad transformers
The iterIO library provides a traditional throw and catch exception
mechanism using its own functions throwI
and catchI
, but keeping
the standard library exception hierarchy from Control.Exception.
All of the support routines are carefully crafted to ensure that this
single exception mechanism is the only one you ever need, so that you
don't end up having to integrate different components with different
error strategies, a situation summarized amusingly in the following
blog post:
http://www.randomhacks.net/articles/2007/03/10/haskell-8-ways-to-report-errors.
A key to uniform error handling is ensuring that errors can be
propagated cleanly across different monads and transformers. Thus,
for instance, the iterIO liftIO
function translates all uncaught IO
errors into Iter
errors.
More importantly, iterIO is designed to support the standard mtl monad
transformers while keeping Iter
as the outermost monadic type. For
instance, if deep in the middle of some
computation
you need a state transformer monad, you can invoke one with
Iter
t IO
runStateTI
, which is the iterIO equivalent of runStateT
. As seen
by comparing their effective types, runStateTI
keeps the Iter
monad on the outside, and thus can cleanly propagate failures out of
the StateT
subcomputation:
runStateT :: StateT s m a -> s -> m (a, s) runStateTI :: Iter t (StateT s m) a -> s -> Iter t m (a, s)
Similarly, there is a function
that can be used to execute a computation in
which a level of monad transformer is stripped off the inner monadic
type.
liftI
:: (MonadTrans t) => Iter
s m a -> Iter s (t m) a
An equally important feature is the ability to distinguish Iter
failures from Inum
failures, given that the former are often more
serious than the latter. As shown by the grep
example in the
tutorial above, when one in a series of concatenated Inum
s fails,
you often want to keep going without losing the state of the Iter
.
The enumerator package does not appear to support this distinction.
The iteratee package might, but it is not clear how to implement the
iteratee equivalent of the grep
example above.
By contrast, iterIO's Inum
mechanism was designed to be intuitive.
If you wrap a pipeline of Inum
s in an inumCatch
statement, then
you will catch exactly the errors thrown by those Inum
s, not those
thrown by pipeline stages outside the scope of the inumCatch
call.
It is because of this unified error handling mechanism that examples such as the HTTP server above can be guaranteed not to leak resources.
- Parser combinators for LL(*) grammars
IterIO's Data.IterIO.Parse module supports parsing of iteratee input
using combinators similar to those found in parsec. However, parsec
supports only LL(1) grammars, and can lead to confusing failures--for
instance the parser string "foo" <|> string "for"
would fail
on input "for"
. IterIO, by contrast, supports full LL(*) parsing,
meaning a parser can look arbitrarily far ahead before failing.
LL(*) parsers are generally disfavored because of their potential to consume arbitrarily large amounts of memory to remember input for backtracking. However, iterIO offers two mechanisms that mitigate the problem.
First, because Iter
s are constructed in such a way as to
differentiate requests for more input from execution of monadic
actions, it is possible to run multiple parsers in parallel. Consider
a hypothetical parser such as the following, designed to recognize the
input format and parse either XML or JSON data:
parser ::Iter
ByteString
m Value parser = (string
"<!DOCTYPE" >> parseXml) <|> (char
'{' >> parseJson)
<|>
is an infix synonym for the iterIO function multiParse
,
which attempts to run two parsers concurrently on input as it arrives.
Because string
and char
are both pure parser combinators with no
monadic side effects, it is possible to run them both concurrently
without fear that the second rule--if it fails--will nonetheless have
produced side effects. In fact, at least one of the string
or the
char
action will fail almost immediately, likely on the first chunk
of data. After one of the two has signaled a parse error, there is no
longer any need to store input for backtracking. Note this works even
if the subsequent functions parseXml
and parseJson
have monadic
side effects, because multiParse
doesn't need to invoke those
monadic actions to determine that one of the two parsers has failed.
A second way to avoid large amounts of storage for backtracking is to
use iterIO's \/
operator, which is an infix synonym for ifNoParse
.
The formulation iter
splits a parser into three
components. \/
no $ yesiter
is executed with backtracking enabled. If it
succeeds, then the saved data is discarded, iter
's result is fed to
the function yes
, and any further failures will not cause input to
be rewound. If, on the other hand, iter
fails, then input is
rewound and no
is executed. The \/
operator is very convenient
for long folds whose individual elements do not consume a lot of
input. For example, to parse and sum a list of numbers (given a
parser number
that skips spaces then parses one number), you might
do something like this:
parseAndSumIntegerList :: Iter String IO Int parseAndSumIntegerList = loop 0 where loop n = number \/ return n $ \n' -> loop (n + n')
Regardless of the length of the list of numbers being parsed,
sumNumbers
only ever needs to backtrack over the input consumed by a
single iteration of number
, which is likely a small amount of extra
memory to keep around.
If you do want an LL(1) parser combinator library, iterIO supports
seamless integration with the attoparsec package. The function atto
in Data.IterIO.Atto turns an attoparsec Parser
into an Iter
monad, treating an attoparsec failure as an Iter
exception that can
be handled in the usual way with ifParse
or multiParse
, or just
caught with catchI
. (Attoparsec has the additional advantage of
solving the annoying string "foo" <|> string "for"
issue by
special-casing string
to have more lookahead.)
Preliminary testing suggests that attoparsec can be about three times
faster than Data.IterIO.Parse on parse-intensive workloads. The
limitation is that attoparsec parsers must be pure. A good compromise
may be to use IterIO for coarse-grained parsing, and attoparsec for
more complex data structures. For example, you might want to use
iterIO's parsing of HTTP multipart/form-data (so as to be able to pipe
files to disk in constant space), but for fields with JSON data, use
atto
to pipe the contents to the excellent attoparsec-based aeson
package.
Acknowledgments
Daniel Giffin contributed numerous suggestions and improvements to
both the code and documentation. Deian Stefan and David Terei helped
with testing and improving the package, as well as understanding
various relevant aspects of Haskell and GHC. Mike Hamburg made the
key suggestion of defining Onum
s as type-restricted Inum
s. The
author is grateful to John Lato for helping him understand much of the
important design rationale behind the original iteratee package. This
work was funded by the DARPA Clean-Slate Design of Resilient,
Adaptive, Secure Hosts (CRASH) program, BAA-10-70.