100% found this document useful (3 votes)
2K views55 pages

Complete Download Grammatical Inference: Algorithms, Routines and Applications 1st Edition Wojciech Wieczorek (Auth.) PDF All Chapters

Routines

Uploaded by

kedirkolf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (3 votes)
2K views55 pages

Complete Download Grammatical Inference: Algorithms, Routines and Applications 1st Edition Wojciech Wieczorek (Auth.) PDF All Chapters

Routines

Uploaded by

kedirkolf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Download the Full Version of textbook for Fast Typing at textbookfull.

com

Grammatical Inference: Algorithms, Routines and


Applications 1st Edition Wojciech Wieczorek
(Auth.)

https://textbookfull.com/product/grammatical-inference-
algorithms-routines-and-applications-1st-edition-wojciech-
wieczorek-auth/

OR CLICK BUTTON

DOWNLOAD NOW

Download More textbook Instantly Today - Get Yours Now at textbookfull.com


Recommended digital products (PDF, EPUB, MOBI) that
you can download immediately if you are interested.

Biota Grow 2C gather 2C cook Loucas

https://textbookfull.com/product/biota-grow-2c-gather-2c-cook-loucas/

textboxfull.com

Computer Age Statistical Inference Algorithms Evidence and


Data Science 1st Edition Bradley Efron

https://textbookfull.com/product/computer-age-statistical-inference-
algorithms-evidence-and-data-science-1st-edition-bradley-efron/

textboxfull.com

Computer Age Statistical Inference Algorithms Evidence and


Data Science 1st Edition Bradley Efron

https://textbookfull.com/product/computer-age-statistical-inference-
algorithms-evidence-and-data-science-1st-edition-bradley-efron-2/

textboxfull.com

Inference for Heavy Tailed Data Applications in Insurance


and Finance 1st Edition Liang Peng

https://textbookfull.com/product/inference-for-heavy-tailed-data-
applications-in-insurance-and-finance-1st-edition-liang-peng/

textboxfull.com
Universal Chess Training 1st Edition Wojciech Moranda

https://textbookfull.com/product/universal-chess-training-1st-edition-
wojciech-moranda/

textboxfull.com

Grouping Genetic Algorithms: Advances and Applications 1st


Edition Michael Mutingi

https://textbookfull.com/product/grouping-genetic-algorithms-advances-
and-applications-1st-edition-michael-mutingi/

textboxfull.com

Swarm Intelligence Algorithms: Modifications and


Applications 1st Edition Adam Slowik

https://textbookfull.com/product/swarm-intelligence-algorithms-
modifications-and-applications-1st-edition-adam-slowik/

textboxfull.com

Data Mining Algorithms in C++: Data Patterns and


Algorithms for Modern Applications 1st Edition Timothy
Masters
https://textbookfull.com/product/data-mining-algorithms-in-c-data-
patterns-and-algorithms-for-modern-applications-1st-edition-timothy-
masters/
textboxfull.com

Wireless Algorithms Systems and Applications Sriram


Chellappan

https://textbookfull.com/product/wireless-algorithms-systems-and-
applications-sriram-chellappan/

textboxfull.com
Studies in Computational Intelligence 673

Wojciech Wieczorek

Grammatical
Inference
Algorithms, Routines and Applications
Studies in Computational Intelligence

Volume 673

Series editor
Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland
e-mail: [email protected]
About this Series

The series “Studies in Computational Intelligence” (SCI) publishes new develop-


ments and advances in the various areas of computational intelligence—quickly and
with a high quality. The intent is to cover the theory, applications, and design
methods of computational intelligence, as embedded in the fields of engineering,
computer science, physics and life sciences, as well as the methodologies behind
them. The series contains monographs, lecture notes and edited volumes in
computational intelligence spanning the areas of neural networks, connectionist
systems, genetic algorithms, evolutionary computation, artificial intelligence,
cellular automata, self-organizing systems, soft computing, fuzzy systems, and
hybrid intelligent systems. Of particular value to both the contributors and the
readership are the short publication timeframe and the worldwide distribution,
which enable both wide and rapid dissemination of research output.

More information about this series at http://www.springer.com/series/7092


Wojciech Wieczorek

Grammatical Inference
Algorithms, Routines and Applications

123
Wojciech Wieczorek
Institute of Computer Science
University of Silesia Faculty of Computer
Science and Materials Science
Sosnowiec
Poland

ISSN 1860-949X ISSN 1860-9503 (electronic)


Studies in Computational Intelligence
ISBN 978-3-319-46800-6 ISBN 978-3-319-46801-3 (eBook)
DOI 10.1007/978-3-319-46801-3
Library of Congress Control Number: 2016952872

© Springer International Publishing AG 2017


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, express or implied, with respect to the material contained herein or
for any errors or omissions that may have been made.

Printed on acid-free paper

This Springer imprint is published by Springer Nature


The registered company is Springer International Publishing AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface

Grammatical inference, the main topic of this book, is a scientific area that lies at
the intersection of multiple fields. Researchers from computational linguistics,
pattern recognition, machine learning, computational biology, formal learning
theory, and many others have their own contribution. Therefore, it is not surprising
that the topic has also a few other names such as grammar learning, automata
inference, grammar identification, or grammar induction. To simplify the location
of present contribution, we can divide all books relevant to grammatical inference
into three groups: theoretical, practical, and applicable. In greater part this book is
practical, though one can also find the elements of learning theory, combinatorics
on words, the theory of automata and formal languages, plus some reference to
real-life problems.
The purpose of this book is to present old and modern methods of grammatical
inference from the perspective of practitioners. To this end, the Python program-
ming language has been chosen as the way of presenting all the methods. Included
listings can be directly used by the paste-and-copy manner to other programs, thus
students, academic researchers, and programmers should find this book as the
valuable source of ready recipes and as an inspiration for their further development.
A few issues should be mentioned regarding this book: an inspiration to write it,
a key for the selection of described methods, arguments for selecting Python as an
implementation language, typographical notions, and where the reader can send any
critical remarks about the content of the book (subject–matter, listings etc.).
There is a treasured book entitled “Numerical recipes in C”, in which along with
the description of selected numerical methods, listings in C language are provided.
The reader can copy and paste the fragments of the electronic version of the book in
order to produce executable programs. Such an approach is very useful. We can
find an idea that lies behind a method and immediately put it into practice. It is a
guiding principle that accompanied writing the present book.
For the selection of methods, we try to keep balance between importance and
complexity. It means that we introduced concepts and algorithms which are
essential to the GI practice and theory, but omitted that are too complicated or too

v
vi Preface

long to present them as a ready-to-use code. Thanks to that, the longest program
included in the book is no more than a few pages long.
As far as the implementation language is concerned, the following requirements
had to be taken into account: simplicity, availability, the property of being firmly
established, and allowing the use of wide range of libraries. Python and FSharp
programming languages were good candidates. We decided to choose IronPython
(an implementation of Python) mainly due to its integration with the optimization
modeling language. We use a monospaced (fixed-pitch) font for the listings of
programs, while the main text is written using a proportional font. In listings,
Python keywords are in bold.
The following persons have helped the author in preparing the final version of
this book by giving valuable advice. I would like to thank (in alphabetical order):
Prof. Z.J. Czech (Silesian University of Technology), Dr. P. Juszczuk, Ph.D. stu-
dent A. Nowakowski, Dr. R. Skinderowicz, and Ph.D. student L. Strak (University
of Silesia).

Sosnowiec, Poland Wojciech Wieczorek


2016
Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 The Problem and Its Various Formulations . . . . . . . . . . . . . . . . . . 1
1.1.1 Mathematical Versus Computer Science Perspectives . . . . . 1
1.1.2 Different Kinds of Output . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.3 Representing Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.4 Complexity Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Assessing Algorithms’ Performance . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.1 Measuring Classifier Performance . . . . . . . . . . . . . . . . . . . . 7
1.2.2 McNemar’s Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.3 5  2 Cross-Validated Paired t Test . . . . . . . . . . . . . . . . . . 9
1.3 Exemplary Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.1 Peg Solitaire . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.2 Classification of Proteins . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4 Bibliographical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2 State Merging Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Evidence Driven State Merging . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Gold’s Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4 Grammatical Inference with MDL Principle . . . . . . . . . . . . . . . . . . 27
2.4.1 The Motivation and Appropriate Measures . . . . . . . . . . . . . 28
2.4.2 The Proposed Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5 Bibliographical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3 Partition-Based Algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 The k-tails Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3 Grammatical Inference by Genetic Search . . . . . . . . . . . . . . . . . . . 37
3.3.1 What Are Genetic Algorithms? . . . . . . . . . . . . . . . . . . . . . . 37

vii
viii Contents

3.3.2 Basic Notions of the Genetic Algorithm for GI. . . . . . . . . . 37


3.3.3 Our Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4 CFG Inference Using Tabular Representations . . . . . . . . . . . . . . . . 40
3.4.1 Basic Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4.2 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4.3 Our Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.5 Bibliographical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4 Substring-Based Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.1 Error-Correcting Grammatical Inference . . . . . . . . . . . . . . . . . . . . . 47
4.1.1 The GI Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.1.2 Our Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2 Alignment-Based Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2.1 Alignment Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2.2 Selection Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2.3 Our Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3 Bibliographical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5 Identification Using Mathematical Modeling . . . . . . . . . . . . . . . . . . . . 57
5.1 From DFA Identification to Graph Coloring . . . . . . . . . . . . . . . . . . 57
5.1.1 Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.1.2 Our Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.2 From NFA Identification to a Satisfiability Problem . . . . . . . . . . . . 61
5.2.1 Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2.2 Our Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.3 From CFG Identification to a CSP . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.3.1 Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.3.2 Our Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.4 Bibliographical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6 A Decomposition-Based Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.1 Prime and Decomposable Languages . . . . . . . . . . . . . . . . . . . . . . . 69
6.1.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.1.2 Cliques and Decompositions . . . . . . . . . . . . . . . . . . . . . . . . 70
6.2 CFG Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.2.1 The GI Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.2.2 Our Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.3 Bibliographical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7 An Algorithm Based on a Directed Acyclic Word Graph . . . . . . . . . 77
7.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
7.2 Constructing a DAWG From a Sample . . . . . . . . . . . . . . . . . . . . . 78
Contents ix

7.3 Our Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79


7.4 Bibliographical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
8 Applications of GI Methods in Selected Fields . . . . . . . . . . . . . . . . . . 83
8.1 Discovery of Generating Functions . . . . . . . . . . . . . . . . . . . . . . . . . 83
8.1.1 Generating Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
8.1.2 The Schützenberger Methodology . . . . . . . . . . . . . . . . . . . . 84
8.1.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
8.2 Minimizing Boolean Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
8.2.1 Background and Terminology . . . . . . . . . . . . . . . . . . . . . . . 94
8.2.2 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
8.2.3 Our Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
8.2.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
8.3 Use of Induced Star-Free Regular Expressions . . . . . . . . . . . . . . . . 100
8.3.1 Definitions and an Algorithm . . . . . . . . . . . . . . . . . . . . . . . 100
8.3.2 An Application in Classification of Amyloidogenic
Hexapeptides . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
8.3.3 An Application in the Construction of Opening Books . . . . 105
8.4 Bibliographical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Appendix A: A Quick Introduction to Python . . . . . . . . . . . . . . . . . . . . . 111
Appendix B: Python’s Tools for Automata, Networks, Genetic
Algorithms, and SAT Solving. . . . . . . . . . . . . . . . . . . . . . . . 129
Appendix C: OML and its Usage in IronPython . . . . . . . . . . . . . . . . . . . 139
Acronyms

CFG Context-free grammar


CGT Combinatorial game theory
CNF Chomsky normal form
CNF Conjunctive normal form
CSP Constraint satisfaction problem
DFA Deterministic finite automaton
DNF Disjunctive normal form
EDSM Evidence driven state merging
GA Genetic algorithm
GI Grammatical inference
GNF Greibach normal form
ILP Integer linear programming
LP Linear programming
MDL Minimum description length
MILP Mixed integer linear programming
NFA Non-deterministic finite automaton
NLP Non-linear programming
NP Non-deterministic polynomial time
OGF Ordinary generating function
OML Optimization modeling language
PTA Prefix tree acceptor
RPNI Regular positive and negative inference
SAT Boolean satisfiability problem
TSP Traveling salesman problem
XML Extensible markup language

xi
Chapter 1
Introduction

1.1 The Problem and Its Various Formulations

Let us start with the presentation of how many variants of a grammatical inference
problem we may be faced with. Informally, we are given a sequence of words and
the task is to find a rule that lies behind it. Different models and goals are given
by response to the following questions. Is the sequence finite or infinite? Does the
sequence contain only examples (positive words) or also counter-examples (negative
words)? Is the sequence of the form: all positive and negative words up to a certain
length n? What is meant by the rule: are we satisfied with regular acceptor, context-
free grammar, context-sensitive grammar, or other tool? Among all the rules that
match the input, should the obtained one be of a minimum size?

1.1.1 Mathematical Versus Computer Science Perspectives

The main division of GI models comes from the size of a sequence. When it is infinite,
we deal with mathematical identification in the limit. The setting of this model is that
of on-line, incremental learning. After each new example, the learner (the algorithm)
must return some hypothesis (an automaton or a CFG). Identification is achieved
when the learner returns a correct answer and does not change its decision afterwards.
With respect to this model the following results have been achieved: (a) if we are
given examples and counter-examples of the language to be identified (learning from
informant), and each individual word is sure of appearing, then at some point the
inductive machine will return the correct hypothesis; (b) if we are given only the
examples of the target (learning from text), then identification is impossible for any
super-finite class of languages, i.e., a class containing all finite languages and at least
one infinite language. In this book, however, we only consider the situation when
the input is finite, which can be called a computer science perspective. We are going
to describe algorithms the part of which base on examples only, and the others base

© Springer International Publishing AG 2017 1


W. Wieczorek, Grammatical Inference, Studies in Computational Intelligence 673,
DOI 10.1007/978-3-319-46801-3_1
2 1 Introduction

Fig. 1.1 A DFA accepting a b c


aa∗ bb∗ cc∗
a b c
S A B C

on both examples and counter-examples. Sometimes we will demand the smallest


possible form of an output, but every so often we will be satisfied with an output that
is just consistent with the input. Occasionally, an algorithm gives the collection of
results that gradually represent the degree of generalization of the input.

1.1.2 Different Kinds of Output

The next point that should be made is that how important it is to pinpoint the kind of
a target. Consider the set of examples: {abc, aabbcc, aaabbbccc}. If a solution is
being sought in the class of regular languages, then one possible guess is presented in
Fig. 1.1. This automaton matches every word starting with one or more as, followed
by one or more bs, and followed by one or more cs. If a solution is being sought
in the class of context-free languages, then one of possible answers is the following
grammar:

S → ABC A → a|aAB
B→b C →c|CC

It is clearly seen that the language accepted by this CFG is {am bm cn : m, n ≥ 1}.
Finally, if a solution is being sought in the class of context-sensitive languages, then
even more precise conjecture can be made:

S →aBc a A → aa b A → Ab
B → Ab B c B c → bc

Now, the language accepted by this grammar is {am bm cm : m ≥ 1}. It is worth


emphasizing that from the above-mentioned three various acceptors, the third one
can not be described as a CFG or a DFA, and the second one can not be described as
a DFA (or an NFA). In the book we will only consider the class of regular and context-
free languages. The class of context-sensitive languages is very rarely an object of
investigations in the GI literature. Such identification is thought to be a very hard
task. Moreover, the decision problem that asks whether a certain word belongs to the
language of a given context-sensitive grammar, is PSPACE-complete.
It should be noted that in the absence of counter-examples, we are at risk of
over-generalization. An illustration that suggests itself is the regular expression (a +
b + c)∗ , which represents all words over the alphabet {a, b, c}, as a guess for
the examples from the previous paragraph. In the present book, two approaches
1.1 The Problem and Its Various Formulations 3

to this problem have been proposed. First, the collection of different hypotheses
is suggested, and it is up to a user to select the most promising one. Second, the
minimum description length (MDL) principle is applied, in which not only the size
of an output matters but also the amount of information that is needed to encode
every example.

1.1.3 Representing Languages

Many definitions specific to particular methods are put in relevant sections. Herein,
we give definitions that will help to understand the precise formulation of our GI
problem and its complexity. Naturally, we skip conventional definitions and notation
from set theory, mathematical logic, and discrete structures, in areas that are on
undergraduate courses.

Definition 1.1 Σ will be a finite nonempty set, the alphabet. A word (or sometimes
string) is a finite sequence of symbols chosen from an alphabet. For a word w, we
denote by |w| the length of w. The empty word λ is the word with zero occurrences of
symbols. Sometimes, to be coherent with external notation, we write epsilon instead
of the empty word. Let x and y be words. Then x y denotes the catenation of x and
y, that is, the word formed by making a copy of x and following it by a copy of y.
We denote, as usual by Σ ∗ , the set of all words over Σ and by Σ + the set Σ ∗ − {λ}.
A word w is called a prefix (resp. a suffix) of a word u if there is a word x such that
u = wx (resp. u = xw). The prefix or suffix is proper if x = λ. Let X, Y ⊂ Σ ∗ .
The catenation (or product) of X and Y is the set X Y = {x y | x ∈ X, y ∈ Y }. In
particular, we define


n
X 0 = {λ}, X n+1 = X n X (n ≥ 0), X ≤n = Xi. (1.1)
i=0

Definition 1.2 Given two words u = u 1 u 2 . . . u n and v = v1 v2 . . . vm , u < v accord-


ing to lexicographic order if u = v1 . . . vn (n < m) or if u i < vi for the minimal i
where u i = vi . The quasi-lexicographic order on the Σ ∗ over an ordered alphabet
Σ orders words firstly by length, so that the empty word comes first, and then within
words of fixed length n, by lexicographic order on Σ n .

To simplify the representations for languages (i.e., sets of words), we define the
notion of regular expressions, finite-state automata, and context-free grammars over
alphabet Σ as follows.
Definition 1.3 The set of regular expressions (regexes) over Σ will be the set of
words R such that
1. ∅ ∈ R which represents the empty set.
2. Σ ⊆ R, each element a of the alphabet represents language {a}.
4 1 Introduction

3. If r A and r B are regexes representing languages A and B, respectively, then


(r A + r B ) ∈ R, (r A r B ) ∈ R, (r A∗ ) ∈ R representing A ∪ B, AB, A∗ , respectively,
where the symbols (,), +, ∗ are not in Σ.
We will freely omit unnecessary parentheses from regexes assuming that catena-
tion has higher priority than + and ∗ has higher priority than catenation. If r ∈ R
represents language A, we will write L(r ) = A.
The above definition of regular expressions has a minor drawback. When we
write, for instance, abc, it is unknown whether the word denotes itself or one of the
following regexes: ((ab)c), (a(bc)). In the present book, however, it will be always
clear from the context.

Definition 1.4 A non-deterministic finite automaton (NFA) is defined by a quintuple


(Q, Σ, δ, s, F), where Q is the finite set of states, Σ is the input alphabet, δ : Q ×
Σ → 2 Q is the transition function, s ∈ Q is the initial state, and F ⊆ Q is the set
of final states. When there is no transition at all from a given state on a given input
symbol, the proper value of δ is ∅, the empty set. We extend δ to words over Σ by
the following inductive definition:
1. δ(q, λ) = {q},

2. δ(q, ax) = r ∈δ(q,a) δ(r, x), where x ∈ Σ ∗ , and a ∈ Σ.
Having extended the transition function δ from the domain Q × Σ to the domain
Q × Σ ∗ , we can formally define that a word x belongs to the language accepted by
automaton A, x ∈ L(A), if and only if δ(s, x) ∩ F = ∅.

Definition 1.5 If in a NFA A for every pair (q, a) ∈ Q × Σ, the transition function
holds |δ(q, a)| ≤ 1, then A is called a deterministic finite automaton (DFA).

The transition diagram of an NFA (as well as DFA) is an alternative way to


represent an NFA. For A = (Q, Σ, δ, s, F), the transition diagram of A is a labeled
digraph (V, E) satisfying
a
V = Q, E = {q −→ p : p ∈ δ(q, a)},
a
where q −→ p denotes an edge (q, p) with label a. Usually, final states are depicted
as double circled terms. A word x over Σ is accepted by A if there is a labeled path
from s to a state in F such that this path spells out the word x.

Definition 1.6 A context-free grammar (CFG) is defined by a quadruple G =


(V, Σ, P, S), where V is an alphabet of variables (or sometimes non-terminal sym-
bols), Σ is an alphabet of terminal symbols such that V ∩ Σ = ∅, P is a finite set of
production rules of the form A → α for A ∈ V and α ∈ (V ∪ Σ)∗ , and S is a special
non-terminal symbol called the start symbol. For the sake of simplicity, we will write
A → α1 | α2 | . . . | αk instead of A → α1 , A → α1 , . . . , A → αk . We call a word
x ∈ (V ∪ Σ)∗ a sentential form. Let u, v be two words in (V ∪ Σ)∗ and A ∈ V .
Then, we write u Av ⇒ uxv, if A → x is a rule in P. That is, we can substitute word
1.1 The Problem and Its Various Formulations 5

x for symbol A in a sentential form if A → x is a rule in P. We call this rewriting


a derivation. For any two sentential forms x and y, we write x ⇒∗ y, if there exists
a sequence x = x0 , x1 , . . . , xn = y of sentential forms such that xi ⇒ xi+1 for all
i = 0, 1, . . . , n − 1. The language L(G) generated by G is the set of all words over
Σ that are generated by G; that is, L(G) = {x ∈ Σ ∗ | S ⇒∗ x}.
Definition 1.7 A sample S over Σ will be an ordered pair S = (S+ , S− ) where S+ ,
S− are finite subsets of Σ ∗ and S+ ∩ S− = ∅. S+ will be called the positive part of
S (examples), and S− the negative part of S (counter-examples). Let A be one of
the following acceptors: a regular expression, a DFA, an NFA, or a CFG. We call A
consistent (or compatible) with a sample S = (S+ , S− ) if and only if S+ ⊆ L(A)
and S− ∩ L(A) = ∅.

1.1.4 Complexity Issues

The consecutive theorems are very important in view of computational intractability


of GI in its greater part. These results will be given without proofs because of the
scope and nature of the present book. Readers who are interested in details of this
may get familiar with works listed at the end of this chapter (Sect. 1.4).
Theorem 1.1 Let Σ be an alphabet, S be a sample over Σ, and k be a positive
integer. Determining whether there is a k-state DFA consistent with S is NP-complete.
It should be noted, however, that the problem can be solved in polynomial time if
S = Σ ≤n for some n. The easy task is also finding such a DFA A that L(A) = S+
and A has the minimum number of states.
Theorem 1.2 The decision version of the problem of converting a DFA to a minimal
NFA is PSPACE-complete.
Let us answer the question of whether NFA induction is a harder problem than
DFA induction. The search space for the automata induction problem can be assessed
by the number of automata with a fixed number of states. It has been shown that the
number of pairwise non-isomorphic minimal k-state DFAs over a c-letter alphabet is
of order k2k−1 k (c−1)k , while the number of NFAs on k states over a c-letter alphabet
2
such that every state is reachable from the start state is of order 2ck . Thus, switching
from determinism to non-determinism increases the search space enormously. On the
other hand, for c, k ≥ 2, there are at least 2k−2 distinct languages L ⊆ Σ ∗ such that:
(a) L can be accepted by an NFA with k states; and (b) the minimal DFA accepting L
has 2k states. It is difficult to resist the conclusion that—despite its hardness—NFA
induction is extremely important and deserves exhaustive research.
Theorem 1.3 Let Σ be an alphabet, S be a sample over Σ, and k be a positive
integer. Determining whether there is a regular expression r over Σ, that has k
or fewer occurrences of symbols from Σ, and such that r is consistent with S, is
NP-complete.
6 1 Introduction

Interestingly, the problem remains NP-complete even if r is required to be star-free


(contains no “∗” operations).
When moving up from the realm of regular languages to the realm of context-free
languages we are faced with additional difficulties. The construction of a minimal
cover-grammar seems to be intractable, specially in view of the following facts: (a)
there is no polynomial-time algorithm for obtaining the smallest context-free gram-
mar that generates exactly one given word (unless P = NP); (b) context-free gram-
mar equivalence and even equivalence between a context-free grammar and a regular
expression are undecidable; (c) testing equivalence of context-free grammars gener-
ating finite sets needs exponential time; (d) the grammar can be exponentially smaller
than any word in the language.

1.1.5 Summary

Let Σ be an alphabet, and k, n be positive integers. We will denote by I (input)


a sample S = (S+ , S− ) over Σ for which either S+ , S− = ∅ or S+ = ∅ and S− = ∅.
We write O (output) for one of the following acceptors: a regex, a DFA, an NFA, or
a CFG. The notation |O| = k means that, respectively, a regex has k occurrences of
symbols from Σ, a DFA (or NFA) has k states, and a CFG has k variables.
From now on, every GI task considered in the book will be put into one of the
three formulations:
1. For a given I find a consistent O.
2. For given I and k > 0 find a consistent O such that |O| ≤ k.
3. For a given I find a consistent O such that |O| is minimal.
But not all combinations will be taken into account. For example, in the book there
is no algorithm for finding a minimum regex or for finding a CFG for examples only.
For technical or algorithmic reasons, sometimes we will demand S not including λ.
If we are given a huge I with positive and negative words along with an algorithm
A that works for positive words only,1 then the following incremental procedure
can be applied. First, sort S+ in a quasi-lexicographic order to (s1 , s2 , . . . , sn ). Next,
set j = 2 (or a higher value), infer A (I = {s1 , . . . , s j−1 }), and check whether
L(A (I )) ∩ S− = ∅. If no counter-example is accepted, find the smallest k from the
indexes { j, j + 1, . . . , n} for which sk is not accepted by A (I ), and set j = k + 1.
Take the set I = I ∪ {sk } as an input for the next inference. Pursue this incremental
procedure until j = n + 1. Note that the proposed technique is a heuristic way of
reducing the running time of the inference process, with no guarantee of getting the
correct solution.

technique can also be used for algorithms that work on S = (S+ , S− ), provided that their
1 This

computational complexity primarily depends on the size of S+ .


1.2 Assessing Algorithms’ Performance 7

1.2 Assessing Algorithms’ Performance

Suppose that we are given a sample S from a certain domain. How would we evaluate
a GI method or compare more such methods. First, a proper measure should be
chosen. Naturally, it depends on a domain’s characteristic. Sometimes only precision
would be sufficient, but in other cases relying only on a single specific measure—
without calculating any general measure of the quality of binary classifications (like
Matthews correlation coefficient or others)—would be misleading.
After selecting the measure of error or quality, we have to choose between three
basic scenarios. (1) The target language is known, and in case of regular languages
we simply check the equivalence between minimal DFAs; for context-free languages
we are forced to generate the first n words in the quasi-lexicographic order from
the hypothesis and from the target, then check the equality of two sets. Random
sampling is also an option for verifying whether two grammars describe the same
language. When the target is unknown we may: (2) random split S into two subsets,
the training and the test set (T&T) or (3) apply K -fold cross-validation (CV), and then
use a selected statistical test. Statisticians encourage us to choose T&T if |S| > 1 000
and CV otherwise. In this section, McNemar’s test for a scenario (2) and 5 × 2 CV
t test for a scenario (3), are proposed.

1.2.1 Measuring Classifier Performance

Selecting one measure that will describe some phenomenon in a population is cru-
cial. In the context of an average salary, take for example the arithmetic mean and
the median from the sample: $55,000, $59,000, $68,000, $88,000, $89,000, and
$3,120,000. Our mean is $579833.33. But it seems that the median,2 $78,000 is
better suited for estimating our data. Such examples can be multiplied. When we
are going to undertake an examination of binary classification efficiency for selected
real biological or medical data, making the mistake to classify a positive object as
a negative one, may cause more damage than incorrect putting an object to the group
of positives; take examples—the ill, counter-examples—the healthy, for good illus-
tration. Then, recall (also called sensitivity, or a true positive rate) would be more
suited measure than accuracy, as recall quantifies the avoiding of false negatives.
By binary classification we mean mapping a word to one out of two classes by
means of inferred context-free grammars and finite-state automata. The acceptance
of a word by a grammar (resp. an automaton) means that the word is thought to
belong to the positive class. If a word, in turn, is not accepted by a grammar (resp. an
automaton), then it is thought to belong to the negative class. For binary classification
a variety of measures has been proposed. Most of them base on rates that are shown

2 The median is the number separating the higher half of a data sample, a population, or a probability

distribution, from the lower half. If there is an even number of observations, then there is no single
middle value; the median is then usually defined to be the mean of the two middle values.
8 1 Introduction

Table 1.1 Confusion matrix for two classes


True class
Predicted class Positive Negative Total
Positive tp: true positive fp: false positive p
Negative fn: false negative tn: true negative n
Total p n N

in Table 1.1. There are four possible cases. For a positive object (an example), if
the prediction is also positive, this is a true positive; if the prediction is negative for
a positive object, this is a false negative. For a negative object, if the prediction is
also negative, we have a true negative, and we have a false positive if we predict
a negative object as positive. Different measures appropriate in particular settings
are given below:

• accuracy, ACC = (tp + tn)/N ,


• error, ERR = 1 − ACC,
• balanced accuracy, BAR = (tp/ p + tn/n)/2,
• balanced error, BER = 1 − BAR,
• precision, P = tp/p ,
• true positive rate (eqv. with recall, sensitivity), TPR = tp/p,
• specificity, SPC = tn/n,
• F-measure, F1 = 2 × P × TPR/(P + TPR), √
• Matthews correlation coefficient, MCC = (tp × tn − fp × fn)/ p × p  × n × n  .

If the class distribution is not uniform among the classes in a sample (e.g. the set of
examples forms a minority class), so as to avoid inflated measurement, the balanced
accuracy should be applied. It has been shown that BAR is equivalent to the AUC (the
area under the ROC curve) score in case of the binary (with 0/1 classes) classification
task. So BAR is considered as primary choice if a sample is highly imbalanced.

1.2.2 McNemar’s Test

Given a training set and a test set, we use two algorithms to infer two acceptors
(classifiers) on the training set and test them on the test set and compute their errors.
The following natural numbers have to be determined:
• e1 : number of words misclassified by 1 but not 2,
• e2 : number of words misclassified by 2 but not 1.
Under the null hypothesis that the algorithms have the same error rate, we expect
e1 = e2 . We have the chi-square statistic with one degree of freedom
1.2 Assessing Algorithms’ Performance 9

(|e1 − e2 | − 1)2
∼ χ12 (1.2)
e1 + e2

and McNemar’s test rejects the hypothesis that the two algorithms have the same
error rate at significance level α if this value is greater than χα,1
2
. For α = 0.05,
χ0.05,1 = 3.84.
2

1.2.3 5 × 2 Cross-Validated Paired t Test

This test uses training and test sets of equal size. We divide the dataset S randomly
into two parts, x1(1) and x1(2) , which gives our first pair of training and test sets. Then
we swap the role of the two halves and get the second pair: x1(2) for training and x1(1)
( j)
for testing. This is the first fold; xi denotes the j-th half of the i-th fold. To get the
second fold, we shuffle S randomly and divide this new fold into two, x2(1) and x2(2) .
We then swap these two halves to get another pair. We do this for three more folds.
( j)
Let pi be the difference between the error rates (for the test set) of the two
classifiers (obtained from the training set) on fold j = 1, 2 of replication i =
1, . . . , 5. The average on replication i is pi = ( pi(1) + pi(2) )/2, and the estimated
variance is si2 = ( pi(1) − pi )2 + ( pi(2) − pi )2 . The null hypothesis states that the two
algorithms have the same error rate. We have t statistic with five degrees of freedom

p (1)
 1 ∼ t5 . (1.3)
5
i=1 is 2
/5

The 5×2 CV paired t test rejects the hypothesis that the two algorithms have the same
error rate at significance level α if this value is outside the interval (−tα/2, 5 , tα/2, 5 ).
If significance level equals 0.05, then t0.025, 5 = 2.57.

1.3 Exemplary Applications

As a concept, grammatical inference is a broad field that covers a wide range of


possible applications. We can find scientific research and practical implementations
from such fields as:
• Natural language processing: building syntactic parsers, language modeling, mor-
phology, phonology, etc. In particular, one of its founding goals is modeling lan-
guage acquisition.
• Bioinformatics and biological sequence analysis: modeling automatically RNA
and protein families of sequences.
10 1 Introduction

• Structural pattern recognition is a field in which grammatical inference has been


active since 1974.
• Program engineering and software testing: verification, testing, specification
learning.
• Music: classification, help in creating, data recovering.
• Others: malware detection, document classification, wrapper induction (XML),
navigation path analysis, and robotic planning.
Some bibliographical references associated with these topics have been reported
in Sect. 1.4. Further in this section, a selected CGT (combinatorial game theory)
problem as well as a problem from the domain of bioinformatics are solved as
an illustration of GI methods usage.

1.3.1 Peg Solitaire

The following problem has been formulated in one of CGT books:


Find all words that can be reduced to one peg in one-dimensional Peg Solitaire.
(A move is for a peg to jump over an adjacent peg into an empty adjacent space,
and remove the jumped-over peg: for instance, 1101 → 0011 → 0100, where 1
represents a peg and 0 an empty space.) Examples of words that can be reduced
to one peg are 1, 11, 1101, 110101, 1(10)k 1. Georg Gunther, Bert Hartnell and
Richard Nowakowski found that for an n × 1 board with one empty space, n
must be even and the space must be next but one to the end. If the board is
cyclic, the condition is simply n even.
Let us solve this problem using a GI method. Firstly, it should be noticed that leading
and trailing 0 s are unimportant, so we will only consider words over {0, 1} that
start and end with 1s. Secondly, for every word of length n ≥ 3 going outside, i.e.,
11dd · · · 1 → 100dd · · · 1, will always put us to a position that cannot be reduced
to one peg. Thus, among all moves that can be made from a word w (|w| ≥ 3),
we do not need to take into account those moves which increase the size of a word
(one-dimensional board).
Under the hypothesis that the solution being sought, say H , is a regular language,
the idea for solving the problem can be stated in three steps: (1) for some n generate
H ∩{0, 1}≤n ; (2) using the k-tails method (from Sect. 3.2) find a sequence of automata;
(3) see if any one of them have passed a verification test.
Step 1
To the previously given remarks we can add another one: H is subset of L(1(1 +
01 + 001)∗ ). In order to determine all words from H up to a certain length, dynamic
programming has been applied, using the fact that for any word with m 1s, a jump
leads to a word with m − 1 1s and the length of the word does not increase (for words
consisting of three or more digits). The algorithm based on dynamic programming
will work if as input we give I = L(1(1 + 01 + 001)∗ ) ∩ {0, 1}≤n sorted by the
1.3 Exemplary Applications 11

number of 1 s in ascending order. We can easily find out that |I | = 1104 for n = 12
and |I | = 6872 for n = 15. This leads us to the following algorithm.
from FAdo.fa import *
from FAdo.reex import *

def moves(w):
result = set()
n = len(w)
if w[:3] == ’110’:
result.add(’1’ + w[3:])
for i in xrange(2, n-2):
if w[i-2:i+1] == ’011’:
result.add(w[:i-2] + ’100’ + w[i+1:])
if w[i:i+3] == ’110’:
result.add(w[:i] + ’001’ + w[i+3:])
if w[-3:] == ’011’:
result.add(w[:-3] + ’1’)
return result

def pegnum(w):
c = 0
for i in xrange(len(w)):
if w[i] == ’1’:
c += 1
return c

def generateExamples(n):
"""Generates all peg words of length <= n
Input: n in {12, 15}
Output: the set of examples"""
rexp = str2regexp("1(1 + 01 + 001)*")
raut = rexp.toNFA()
g = EnumNFA(raut)
numWords = {12: 1104, 15: 6872}[n]
g.enum(numWords)
words = sorted(g.Words, \
cmp = lambda x, y: cmp(pegnum(x), pegnum(y)))
S_plus = {’1’, ’11’}
for i in xrange(4, numWords):
if moves(words[i]) & S_plus:
S_plus.add(words[i])
return S_plus

Step 2
The second step is just the invocation of the k-tails algorithm. Because the algorithm
outputs the sequence of hypothesis, its invocation has been put into the for-loop
structure as we can see in a listing from step 3.
Step 3
The target language is unknown, that is why we have to propose any test for probable
correctness of obtaining automata. To this end, we generate two sets of words, namely
positive test set (Test_pos) and negative test set (Test_neg). The former contains
all words form the set H ∩ {0, 1}≤15 , the latter contains the remaining words over
{0, 1} up to the length 15. An automaton is supposed to be correct if it accepts all
words from the positive test set and accepts no word from the negative test set.
12 1 Introduction

0 1
0 1 1
1 1 0 1
1 1 0
0 0 1
1
1
1 0
0 1
1 0
1

Fig. 1.2 A DFA representing H

def allWords(n):
"""Generates all words over {0, 1} up to length n
Input: an integer n
Output: all w in (0 + 1)* such that 1 <= |w| <= n"""
rexp = str2regexp("(0 + 1)(0 + 1)*")
raut = rexp.toNFA()
g = EnumNFA(raut)
g.enum(2**(n+1) - 2)
return set(g.Words)

Train_pos = generateExamples(12)
Test_pos = generateExamples(15)
Test_neg = allWords(15) - Test_pos

for A in synthesize(Train_pos):
if all(A.evalWordP(w) for w in Test_pos) \
and not any(A.evalWordP(w) for w in Test_neg):
Amin = A.toDFA().minimalHopcroft()
print Amin.Initial, Amin.Final, Amin.delta
break

The resulting automaton is depicted in Fig. 1.2.

1.3.2 Classification of Proteins

One of the problems studied in bioinformatics is classification of amyloidogenic


hexapeptides. Amyloids are proteins capable of forming fibrils instead of the func-
tional structure of a protein, and are responsible for a group of diseases called amyloi-
dosis, such as Alzheimer’s, Huntington’s disease, and type II diabetes. Furthermore,
it is believed that short segments of proteins, like hexapeptides consisting of 6-residue
fragments, can be responsible for amyloidogenic properties. Since it is not possible
to experimentally test all such sequences, several computational tools for predicting
amyloid chains have emerged, inter alia, based on physico-chemical properties or
using machine learning approach.
Suppose that we are given two sets, training and test, of hexapeptides:
Train_pos = \
[’STQIIE’, ’STVIIL’, ’SDVIIE’, ’STVIFE’, ’STVIIS’, ’STVFIE’, \
’STVIIN’, ’WIVIFF’, ’YLNWYQ’, ’SFQIYA’, ’SFFFIQ’, ’STFIIE’, \
’GTFFIN’, ’ETVIIE’, ’SEVIIE’, ’YTVIIE’, ’STVIIV’, ’SMVLFS’, \
1.3 Exemplary Applications 13

’STVIYE’, ’VILLIS’, ’SQFYIT’, ’SVVIIE’, ’STVIII’, ’HLVYIM’, \


’IEMIFV’, ’FYLLYY’, ’FESNFN’, ’TTVIIE’, ’STVIIF’, ’STVIIQ’, \
’IFDFIQ’, ’RQVLIF’, ’ITVIIE’, ’KIVKWD’, ’LTVIIE’, ’WVFWIG’, \
’SLVIIE’, ’STVTIE’, ’STVIIE’, ’GTFNII’, ’VSFEIV’, ’GEWTYD’, \
’KLLIYE’, ’SGVIIE’, ’STVNIE’, ’GVNYFL’, ’STLIIE’, ’GTVLFM’, \
’AGVNYF’, ’KVQIIN’, ’GTVIIE’, ’WTVIIE’, ’STNIIE’, ’AQFIIS’, \
’SSVIIE’, ’KDWSFY’, ’STVIIW’, ’SMVIIE’, ’ALEEYT’, ’HYFNIF’, \
’SFLIFL’, ’STVIIA’, ’DCVNIT’, ’NHVTLS’, ’EGVLYV’, ’VEALYL’, \
’LAVLFL’, ’STSIIE’, ’STEIIE’, ’STVIIY’, ’LYQLEN’, ’SAVIIE’, \
’VQIVYK’, ’SIVIIE’, ’HGWLIM’, ’STVYIE’, ’QLENYC’, ’MIENIQ’]
Train_neg = \
[’KTVIVE’, ’FHPSDI’, ’FSKDWS’, ’STVITE’, ’STVDIE’, ’FMFFII’, \
’YLEIII’, ’STVIDE’, ’RMFNII’, ’ETWFFG’, ’NGKSNF’, ’KECLIN’, \
’STVQIE’, ’IQVYSR’, ’AAELRN’, ’EYLKIA’, ’KSNFLN’, ’DECFFF’, \
’STVPIE’, ’YVSGFH’, ’EALYLV’, ’HIFIIM’, ’RVNHVT’, ’AEVLAL’, \
’PSDIEV’, ’STVIPE’, ’DILTYT’, ’RETWFF’, ’STVIVE’, ’KTVIYE’, \
’KLLEIA’, ’QPKIVK’, ’EECLFL’, ’QLQLNI’, ’IQRTPK’, ’YAELIV’, \
’KAFIIQ’, ’GFFYTP’, ’HPAENG’, ’KTVIIT’, ’AARRFF’, ’STVIGE’, \
’LSFSKD’, ’NIVLIM’, ’RLVFID’, ’STVSIE’, ’LSQPKI’, ’RGFFYT’, \
’YQLENY’, ’QFNLQF’, ’ECFFFE’, ’SDLSFS’, ’KVEHSD’, ’STVMIE’, \
’QAQNQW’, ’SSNNFG’, ’TFWEIS’, ’VTLSQP’, ’STVIEE’, ’TLKNYI’, \
’LRQIIE’, ’STGIIE’, ’YTFTIS’, ’SLYQLE’, ’DADLYL’, ’SHLVEA’, \
’SRHPAE’, ’KWDRDM’, ’FFYTPK’, ’STVIQE’, ’GMFNIQ’, ’HKALFW’, \
’LLWNNQ’, ’GSHLVE’, ’VTQEFW’, ’NIQYQF’, ’STMIIE’, ’PTEKDE’, \
’TNELYM’, ’LIAGFN’, ’HAFLII’, ’YYTEFT’, ’EKNLYL’, ’KTVLIE’, \
’FTPTEK’, ’STPIIE’, ’STVVIE’, ’SGFHPS’, ’LFGNID’, ’SPVIIE’, \
’STVISE’, ’EKDEYA’, ’RVAFFE’, ’FYTPKT’, ’PKIQVY’, ’DDSLFF’, \
’ERGFFY’, ’PTVIIE’, ’DIEVDL’, ’STIIIE’]
Test_pos = \
[’FTVIIE’, ’HQLIIM’, ’ISFLIF’, ’GTFFIT’, ’YYQNYQ’, ’HFVWIA’, \
’NTVIIE’, ’SNVIIE’, ’MLVLFV’, ’YVEYIG’, ’STVWIE’, ’STVIIM’, \
’EYSNFS’, ’SQVIIE’, ’SYVIIE’, ’FLVHSS’, ’NQQNQY’, ’QYFNQM’, \
’DTVIIE’, ’VTSTFS’, ’STVIIT’, ’LIFLIV’, ’SFVIIE’, ’NYVWIV’, \
’NFGAIL’, ’STVIID’, ’VTVIIE’, ’MTVIIE’, ’STVLIE’, ’LLYYTE’, \
’QTVIIE’, ’KLFIIQ’, ’ATVIIE’, ’LVEALY’, ’TYVEYI’, ’RVFNIM’, \
’NQFIIS’, ’STVEIE’]
Test_neg = \
[’STDIIE’, ’LKNGER’, ’KAILFL’, ’NYFAIR’, ’VKWDRD’, ’KENIIF’, \
’WVENYP’, ’WYFYIQ’, ’VAQLNN’, ’DLLKNG’, ’HLVEAL’, ’TAWYAE’, \
’STAIIE’, ’STVGIE’, ’ERIEKV’, ’EVDLLK’, ’STVIIP’, ’AINKIQ’, \
’STTIIE’, ’TYQIIR’, ’MYFFIF’, ’TEFTPT’, ’NGERIE’, ’AENGKS’, \
’ICSLYQ’, ’YASEIE’, ’VAWLKM’, ’NLGPVL’, ’RTPKIQ’, ’EHSDLS’, \
’AEMEYL’, ’NYNTYR’, ’TAELIT’, ’HTEIIE’, ’AEKLFD’, ’LAEAIG’, \
’STVIME’, ’GERGFF’, ’VYSRHP’, ’YFQINN’, ’SWVIIE’, ’KGENFT’, \
’STVIWE’, ’STYIIE’, ’QTNLYG’, ’HYQWNQ’, ’IEKVEH’, ’KMFFIQ’, \
’ILENIS’, ’FFWRFM’, ’STVINE’, ’STVAIE’, ’FLKYFT’, ’FGELFE’, \
’STVILE’, ’WSFYLL’, ’LMSLFG’, ’FVNQHL’, ’STVIAE’, ’KTVIIE’, \
’MYWIIF’]

The biologist’s question is which method, based on decision trees or based on gram-
matical inference will achieve better accuracy.
As regards decision trees approach, classification and regression trees (CART),
a non-parametric decision tree learning technique, has been chosen. For this purpose
we took advantage of a scikit-learn’s3 optimized version of the CART algorithm:

3 Scikit-learnprovides a range of supervised and unsupervised learning algorithms via a consistent


interface in Python. It is licensed under a permissive simplified BSD license. Its web page is http://
scikit-learn.org/stable/.
14 1 Introduction

from sklearn import tree


from sklearn.externals.six import StringIO
from functools import partial

Sigma = set(list("NMLKIHWVTSRQYGFEDCAP"))
idx = dict(zip(list(Sigma), range(len(Sigma))))

def findACC(f):
score = 0
for w in Test_pos:
if f(w):
score += 1
for w in Test_neg:
if not f(w):
score += 1
if score == 0:
return 0.0
else:
return float(score)/float(len(Test_pos) + len(Test_neg))

def acceptsBy(clf, w):


return clf.predict([map(lambda c: idx[c], list(w))])[0] == 1

X = []
Y = []
for x in Train_pos:
X.append(map(lambda c: idx[c], list(x)))
Y.append(1)
for y in Train_neg:
X.append(map(lambda c: idx[c], list(y)))
Y.append(0)
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, Y)
print findACC(partial(acceptsBy, clf))

As for GI approach, it was the induction of a minimal NFA choice, since the
sample is not very large.
from functools import partial
from FAdo.common import DFAsymbolUnknown

def acceptsBy(aut, w):


try:
return aut.evalWordP(w)
except DFAsymbolUnknown:
return False

S_plus, S_minus = set(Train_pos), set(Train_neg)


k = 1
while True:
print k,
A = synthesize(S_plus, S_minus, k)
if A:
print findACC(partial(acceptsBy, A))
print A.dotFormat()
break
k += 1

The resulting automaton is depicted in Fig. 1.3. The ACC scores for the obtained deci-
sion tree and NFA are equal to, respectively, 0.616 and 0.677. It is worth emphasizing,
1.3 Exemplary Applications 15

N, M, Y E, C, T
E, N, W, V, S, Q F, L
q1 F, I
F, N, L, Y N, H, W, V, S, Y
G, E, D, A, L, V, T, S, Q I
F, M, I, T, S, Y
N, W, T
F, A, K, W, V, R q3 q2
F, A, N, L, K, I, V, Q, Y
q0 G, D, W
G, D, N, M, L, H

G, E, C, A, T, Q

Fig. 1.3 An NFA consistent with the (Train_pos, Train_neg) sample

though, that McNemar’s test does not reject the hypothesis that the two algorithms
have the same error rate at significance level 0.05 (the computed statistic was equal
to 0.543, so the null hypothesis might be rejected if α ≥ 0.461).

1.4 Bibliographical Background

Different mathematical models of grammatical inference come from various papers.


Language identification in the limit has been defined and studied by Gold (1967). Two
other models also dominate the literature. In the query learning model by Angluin
(1988), the learning algorithm is based on two query instructions relevant to the
unknown grammar G: (a) membership—the input is a string w and the output is
‘yes’ if w is generated by G and ‘no’ otherwise, (b) equivalence—the input is a
grammar G  and the output is ‘yes’ if G  is equivalent to G and ‘no’ otherwise. If the
answer is ‘no’, a string w in the symmetric difference of the language L(G) and the
language L(G  ) is returned. In the probably approximately correct learning model
by Valiant (1984), we assume that random samples are drawn independently from
examples and counterexamples. The goal is to minimize the probability of learning
an incorrect grammar.
The complexity results devoted to GI from combinatorial perspective were dis-
covered by Gold (1978), for DFAs, and Angluin (1976), for regular expressions. The
hardness of NFA minimization was studied by Meyer and Stockmeyer (1972) and
Jiang and Ravikumar (1993). The information on the number of k-state DFAs and
NFAs is taken from Domaratzki et al. (2002). Some negative results about context-
free languages, which also are mentioned in Sect. 1.1.4, come from papers: Charikar
et al. (2005), Hunt et al. (1976), and books: Hopcroft et al. (2001), Higuera (2010).
An algorithm for finding the minimal DFA consistent with a sample S = Σ ≤n was
constructed by Trakhtenbrot and Barzdin (1973).
Beside regexes, DFAs, NFAs, and CFGs, there are other string-rewriting tools that
can be applied to the grammatical inference domain. Good books for such alternatives
16 1 Introduction

are Book and Otto (1993) and Rozenberg and Salomaa (1997). Some positive GI
results in this context can be found in Eyraud et al. (2007).
Parsing with context-free grammars is an easy task and is analyzed in many books,
for example in Grune and Jacobs (2008). However, context-sensitive membership
problem is PSPACE-complete, which was shown by Kuroda (1964). In fact, the
problem remains hard even for deterministic context-sensitive grammars.
A semi-incremental method described at the end of Sect. 1.1 was introduced by
Dupont (1994). Imada and Nakamura (2009) also applied the similar approach of
the learning process with a SAT solver.
Statistical tests given in this chapter were compiled on Alpaydin (2010). Another
valuable book, especially when we need to compare two (or more) classifiers on
multiple domains, was written by Japkowicz and Shah (2011).
A list of practical applications of grammatical inference can be found in many
works; the reader can refer to Bunke and Sanfelieu (1990), de la Higuera (2005),
Higuera (2010), and Heinz et al. (2015) as good starting points on this topic. The
first exemplary application (peg solitaire) is stated as the 48th unsolved problem
in combinatorial games Nowakowsi (1996). This problem was solved by Moore
and Eppstein (2003) and we have verified that our automaton is equivalent to the
regular expression given by them. The second exemplary application is a hypothetical
problem, though, the data are real and come from Maurer-Stroh et al. (2010).

References

Alpaydin E (2010) Introduction to machine learning, 2nd edn. The MIT Press
Angluin D (1976) An application of the theory of computational complexity to the study of inductive
inference. PhD thesis, University of California
Angluin D (1988) Queries and concept learning. Mach Learn 2(4):319–342
Book RV, Otto F (1993) String-rewriting systems. Springer, Text and Monographs in Computer
Science
Bunke H, Sanfelieu A (eds) (1990) Grammatical inference. World Scientific, pp 237–290
Charikar M, Lehman E, Liu D, Panigrahy R, Prabhakaran M, Sahai A, Shelat A (2005) The smallest
grammar problem. IEEE Trans Inf Theory 51(7):2554–2576
de la Higuera C (2005) A bibliographical study of grammatical inference. Pattern Recogn
38(9):1332–1348
de la Higuera C (2010) Grammatical inference: learning automata and grammars. Cambridge Uni-
versity Press, New York, NY, USA
Domaratzki M, Kisman D, Shallit J (2002) On the number of distinct languages accepted by finite
automata with n states. J Autom Lang Comb 7:469–486
Dupont P (1994) Regular grammatical inference from positive and negative samples by genetic
search: the GIG method. In: Proceedings of 2nd international colloquium on grammatical infer-
ence, ICGI ’94, Lecture notes in artificial intelligence, vol 862. Springer, pp 236–245
Eyraud R, de la Higuera C, Janodet J (2007) Lars: a learning algorithm for rewriting systems. Mach
Learn 66(1):7–31
Gold EM (1967) Language identification in the limit. Inf Control 10:447–474
Gold EM (1978) Complexity of automaton identification from given data. Inf Control 37:302–320
Grune D, Jacobs CJ (2008) Parsing techniques: a practical guide, 2nd edn. Springer
References 17

Heinz J, de la Higuera C, van Zaanen M (2015) Grammatical inference for computational linguistics.
Synthesis lectures on human language technologies. Morgan & Claypool Publishers
Hopcroft JE, Motwani R, Ullman JD (2001) Introduction to automata theory, languages, and com-
putation, 2nd edn. Addison-Wesley
Hunt HB III, Rosenkrantz DJ, Szymanski TG (1976) On the equivalence, containment, and covering
problems for the regular and context-free languages. J Comput Syst Sci 12:222–268
Imada K, Nakamura K (2009) Learning context free grammars by using SAT solvers. In: Proceed-
ings of the 2009 international conference on machine learning and applications, IEEE computer
society, pp 267–272
Japkowicz N, Shah M (2011) Evaluating learning algorithms: a classification perspective. Cam-
bridge University Press
Jiang T, Ravikumar B (1993) Minimal NFA problems are hard. SIAM J Comput 22:1117–1141
Kuroda S (1964) Classes of languages and linear-bounded automata. Inf Control 7(2):207–223
Maurer-Stroh S, Debulpaep M, Kuemmerer N, Lopez de la Paz M, Martins IC, Reumers J, Morris
KL, Copland A, Serpell L, Serrano L et al (2010) Exploring the sequence determinants of amyloid
structure using position-specific scoring matrices. Nat Methods 7(3):237–242
Meyer AR, Stockmeyer LJ (1972) The equivalence problem for regular expressions with squaring
requires exponential space. In: Proceedings of the 13th annual symposium on switching and
automata theory, pp 125–129
Moore C, Eppstein D (2003) One-dimensional peg solitaire, and duotaire. In: More games of no
chance. Cambridge University Press, pp 341–350
Nowakowski RJ (ed) (1996) Games of no chance. Cambridge University Press
Rozenberg G, Salomaa A (eds) (1997) Handbook of formal languages, vol 3. Beyond words.
Springer
Trakhtenbrot B, Barzdin Y (1973) Finite automata: behavior and synthesis. North-Holland Publish-
ing Company
Valiant LG (1984) A theory of the learnable. Commun ACM 27:1134–1142
Chapter 2
State Merging Algorithms

2.1 Preliminaries

Before we start analyzing how the state merging algorithms work, some basic func-
tions on automata as well as functions on the sets of words have to be defined. We
assume that below given routines are available throughout the whole book. Please
refer to Appendixes A, B, and C in order to familiarize with the Python program-
ming language, its packages relevant to automata, grammars, and regexes, and some
combinatorial optimization tools. Please notice also that we follow the docstring con-
vention. A docstring is a string literal that occurs as the first statement in a function
(module, class, or method definition). Such string literals act as documentation.
from FAdo.fa import *

def alphabet(S):
"""Finds all letters in S
Input: a set of strings: S
Output: the alphabet of S"""
result = set()
for s in S:
for a in s:
result.add(a)
return result

def prefixes(S):
"""Finds all prefixes in S
Input: a set of strings: S
Output: the set of all prefixes of S"""
result = set()
for s in S:
for i in xrange(len(s) + 1):
result.add(s[:i])
return result

def suffixes(S):
"""Finds all suffixes in S
Input: a set of strings: S
Output: the set of all suffixes of S"""

© Springer International Publishing AG 2017 19


W. Wieczorek, Grammatical Inference, Studies in Computational Intelligence 673,
DOI 10.1007/978-3-319-46801-3_2
Exploring the Variety of Random
Documents with Different Content
als ze den [4]Ouë strammig zagen sjokken zonder dat
er iets degelijks uit z’n handen poerde.

Kees lei paadjes om, tusschen gerooide bollen, en


klein helpertje wiedde onder de tuinboonen en erwten.
Overal graaiden bronzige geweldige handenparen
gekromd in arbeidsstuip.

Dirk liep met schoffel van de aal- en


kruisbessenboompjes waar ie rondgeschoffeld had,
naar kapucijnders, sla en kool. En ook bollen
wachtten, om op stelling gebracht en gepeld te
worden. Dat zou ie den Ouë toch eens ànsmeeren.

Ouë Gerrit was overstuur van ’t kwaadaardig


grommen en bitsen der kerels. Nou had Dirk ’m weer
gepakt. Hij zou nou maar wàt doen.—

Tegen zes uur ’s avonds stond ie op z’n rotduffe


schuur, bollen op stelling te smijten. Door ’t open luik
woei windstroom in. Uit zware zakken, waarin ’t
rooigoed samenbroeide, wurmde hij ’t over in
stortbakken, die Piet telkens op hoogste tree van
schuurtrap, voor ’m neerschoof. Tusschen de duistere
laag-stoffige stelling, vervuild en doorzaaid van
klonterige spinraggen, grabbelden Gerrits handen op ’t
hout geplank, rommelde ie de bollen met zwarte
wortelenaanhangsels, gelijk. Overal in donkerig duffe
hoeken, drukte ie de soorten plat, met z’n hoofd
gekneld onder laag schuurdak. Even verduisterd,
bogen zijn ooren uit jukken en donkere zijbalken van
stelling. Te blazen en te zweeten stond ie.—Snikheet
was ’t geweest, en veel had ie al gedaan, om de
kerels wat vrediger te stemmen. Toch voelde ie, dat
die loodzware zakkensjouw ’m den rug brak,
àfploeterde, dat ie ’r bij hijgde. Dat kon zoo niet meer,
kòn zoo niet meer.

Te zweeten benauwd, stond ie in de donkere vuile


schuur, ingekneld tusschen de enge morsige
doorloopjes. Stof sloeg ’m naar de keel van
aardezand, droog en zurig. Telkens zat ie geknield de
soorten in te deelen, zoog ie het heete zandvocht in
z’n longen òp bij iederen vermoeienis-hijg, als ie heel
in de diepte te rommelen had in hoeken, waar ie, àl
elleboogstootend en lijfwringend z’n stortbakken
uitleegen moest. Nog èven schimmig licht schemerde
rond z’n gezicht en handen. [5]Telkens weer kwam
Piet, met ’n zak de schuurtrap opsjouwen, in borsthijg
blaasbalgend, zóó vermoeid en doorzweet, dat ’t vet
’m dikdruppelend van slapen en bronzen voorkop, nek
en hals ingutste. Z’n hemd, bij den hals
opengescheurd, dat z’n bronzige haarborst bloot-
ruigde, zoog dóórnat op z’n lijf. Hij hijgde als kreeg ie
’n beroerte.

—Hee Ouë.… Wa’ si’ je?.… dofte afgebroken z’n


stem.

—Hier.… joa.… hoho! ik kom, schreeuwde gedempt


uit duisteren stellinghoek ouë Gerrit terug, klankloos
en beverig.

—Nog drie sakke.… en dan.… is ’t daàn!.… zei


hijgend Piet op de trap, met ’n smak de vracht van z’n
schouers afsjortend tegen den grond, dat er
zandwolken waasden voor ’t lichtluik. Stil bleef ie
staan, met z’n bronzen, vet-bedrupten zweetkop net
boven de trap uitloerend, starend naar het lichtluik.—
Koel dronk ie den frisschen windstroom op, die ’m
tegemoet woei.

—Nou kaik.… nou erais.… die dubbelneus!.… jai


smait te hard!.… hoho! die hep heulegoar sain neus
broke!..

—Wa’ sou ’t!… komp bai de huur t’recht.. doen ’t


sellefers sel je ’n vrachie voele!.… Nou Ouë, nog
drie.… dan is ’t daan!.… ’k mo nog sloa staike tû
donker.…

—Bestig.… nou kraig ik nog wá’ imperoàters!.… wá’


leràine!.… wá’ gele prinse.…

—Enne.… wá’ ru.… rub’rum.… maksimoà! stotterde


Piet, met z’n gezicht gekeerd naar de treeën de trap
afstappend, dat langzaam rooie zweetkop verdween in
schemerdiepte van trapgat.—

—Heè.… schreeuwde ouë Gerrit door ’t windluik naar


beneden op den tuin, met z’n zilveren kop in lichtlijst,
dat z’n lokken fladderden en stoeiden om z’n ooren,—
hullie benne nog veul te wit.… soo nie leeferboàr.…
die hewwe gain son had!.…

—Nou, dan láa’k die nog sitte, krijschte in kop-opstaar


naar tochthoek Piet terug, z’n oogen dicht geknepen
van zon, die er licht inlanste,—dan goan ikke nog wa
sloà staike.… Kees [6]hep nog wá’ roapstele.… aa’s ’t
weer soo opsloàt, hai je f’rjenne van alles g’laik.
Tusschen ’n boonenpad was Piet doorgeloopen en
nog schreeuwde Gerrit na:

—Moar daa’s nie billek.… je mot ’t veurste soo goed


sien aa’s ’t achterste.… hoho! hee!

Kees hurkte aan slootkant op rottig plankje, bezig nog


wat heel late raapstelen te wasschen. Voorover
gebukt, laag z’n kop, schonken in blauwe kiel
tusschen het grasgroen, dat fluweelde glanszacht,
slobberde ie bosjes raapstelen, met breeë vegen door
’t klare zacht-stroomende watertje, raspte met stalen
haarkam ’t vuil gewortel schoon in rukken, als kamde
ie ’n verwarde pruik los; verspoelde ie telkens zand en
aardvuil, onder klossende krinkeling van
waterspiralen. Groen en gelig-rein kleurden de
raapstelen òp in z’n handen. De gebonden bosjes,
schoongespoeld en gekamd, lei Kees voorzichtig
naast zich in een bruin-beschilderden bak, waarin ie
z’n groenten opstapelde, druppend, zilverbespat
groen, dat koelfrisch en sappig fonkelde van
waterparels. Midden in groen en zoeten
kleurschommel van riet en gras, bloemige goud-gele
toetsjes van rolklaver en walstroo, bleef z’n lijf tot
donker gebukt, verduisterde z’n kielblauw in schemer-
gouden avondval, rond beekjes-stilte.—

Dirk wiedde nog, staroogend op de jonge postelijn, die


rooiig even den grond uit kwam kruipen.

Op de tuinders-akkers was nog laàt gezwoeg. Bij ’t


zonnezinken, ratelden en bonk-schokten karren met
groente naar de haven, waar joel was van kinder-
sjofelen stemmenkrijsch, razend geblaf van honden,
afgebeulde en geranselde werkbeesten, met
schreeuw van afgebeulde kerels, in droeve
klankenschorheid, dwars door den boot- en laad-
scharrel heenkrijschend.—

Elken dag, elk uur, groende ’t sperzieblad sterker, en


zwaarder kronkelden de erwtenranken wit-teer
bebloesemd, als was strooisel van sneeuwige
kapelletjes boven uitgestort, tusschen brons-grijze
rijzenknuppeling en aardegroen. De tuinboonen, in
teer-wazig bladzilverdons, bogen en pluimden onder
de [7]werkhanden der wieders, die ze zuiverden van
vuil. ’t Onkruid lag bij stapels te verrotten, op de
paadjes, in de greppels.

—Da’ wort suinege Job, bromde Dirk in ’t voorbijgaan


langs Kees, we motte f’rdomd ’n kerel stelle.… Nou
mo’k vàn màin boone.… nog loof skoffele.… van die
tullepehoek.

—Is tug nog nie g’nog besturrefe?

—Joa.… mot moàr.… snof’rjenne.… morrege mot ’t


hooi overend, op de hoop, alles g’laik.… t’jonge!.… je
hep gain tait van oàdemskeppe.… saa’k stikke!.…

’n Paar dagen later begon heeter gejaag,


zenuwachtiger gegrom. Van wiedhoek hièr, trokken de
kerels, nog krom en gekneusd van kniebuk, naar
schoffelbrok dáár. Geen woord! geen asem! Ze
vloekten dat ze ’s winters niet te vreten hàdden,
zomers niet vreten konden; geen tijd, jacht! jacht!
Ouë Gerrit joeg mee, zette ze op, z’n zoons en de
twee losse werkers.—Een Zeekijker, had veel te lijden
als baas weg was. Hij liep met ’n Wierelandsche meid.
Dat maakte ze dronken van gift en jaloerschheid.

—Wa’ mò die snurkert, die suinige Job hier,


schreeuwden ze rond ’m. Zaterdagavond werd op de
haven met ’m gevochten, kreeg ie duwen, ransel,
klonk er stemgedreig, dat,.… als ie niet van
Wierelandsche meid afbleef, niet behoorlijk ’n
Seekaiker tollereerde, ie veur z’n laife d’r van luste
sou.

—Tullereer jai ’n Seekaiker.… mermot! of blaif


vraigesellekerel hei?.… jai kwinkwanker, wá’ doen jai
hier?

Zoo raasde ’t om ’m, op de akkers, maar stil bleef ie in


z’n lijfbuk wieden, of uitplanten, elken avond zeker van
z’n stiekeme vrijagetje door de donkere laantjes, met
lieve Wierelandsche meid.

Onrustiger joeg Gerrit op. Z’n broer had al aardbeien


op de markt, hij nog pas puur ’n schraal mandje. Hij
vloekte en raasde in stilte, keek soms loenschig naar
z’n eigen rhabarber, die blader-breed en steelfrisch
opschoot boven die van oom Hassel.

—Manskappe, morge is ’t noà polder! huhu! ’t hooi


mot [8]moar overend.… aere week is ’t puur gain
tait.… aa’s d’r rege komp sain wai vast!.… nie
maa’ne?
—Kok.. kok.. kok?.… kokokok!.… moeder weer ’n ai!
sarde Piet

—Nou joà! of jai nou gainigt.… dá’ skeel nies!.… ’t mot


’t mot, gromde de Ouë.

Overal was woeste jacht onder de tuinders, gezweet


en gehijg van ploeter en sjouw, zonder opkijken. Alles
stond gelijk in bloei, alles in één jacht moest gedaan.
Dat ging zoo van half Juni tot Augustus, drie maanden
van zwoeg, dat ze ’t bloed onder de nagels brandde,
de hitte-zon ziedde en schroeide op hun lijven. Dat
was de groote haal van aardbei, erwten en boonen,
waarmee de centen voor pacht, grond, en
winterwerkeloosheid er moesten komen.

De Ouë gloeide van angstige koorts, hoe de oogst


afloopen zou.… Aa’s de boel soo stong aa’s nou, soo
loat, kwam die f’rvloekte kermis net in ’t drukst van de
boonetait weer; had je de kerels dronke en radbroakt
van ’t pierewoaie.. en morrege most hai ’r bai, gong
Dirk vente.… hoho! weer bestig daggie veur sain!

[Inhoud]

II.

Snikhitte schroeide door Wierelandsche polderwei,


toen ’s middags Dirk, Kees en helpertje naar hun
hooidijk stapten.—Oneindig in wijden kring van
wasemend goudlicht, lag jonge bouwgrond aan een
kant van den weg, omfloerst in vochtnevel naar
horizon, doorbroken van stukken teelgrond. En ver,
oneindig, aan andere zij, stil weiland, met gitglans van
’n eenzaam koebeest, ’t hoog behalmend gras, in
roestige bevlamming van wilde zuring. Strakblauw,
doortrild van zomerbrand, spande de lucht, wijd als ’t
land, overstolpend den polder. Dwars door landerijen
achter dijkige hoogte, ging glans-felle kabbeling van
kanaalwater, naast ’n stuifpad, dat blank brandde in
zonnige, heete zanderigheid.— [9]

Masten daar met, achter dijkgroen weggezakte


scheepsrompen, schoven voort, als over de verre
grasvlakte, tusschen fonkeling en wiegeling van riet.
Blanke en goudbrons-brandende roest-zeilen met rag
touwwerk, bewogen door de weien van groen-goue
waas, vreemd-kleurig, in sprokige grilligheid. Aan alle
horizonbochten rond, in oneindigen kring, trilde wazige
hitte, nevelig goud-violette floersen, téér naderend
lichter sferen van zonnebrand, die schroeide de
velden in blauwigen damp. En snikkend, gloeizwaar,
bazuinde ’t licht boven de grashalmen, sidderde ’t
boven jonge, tooverteer grijsblauwe haver en rogge-
akkers, vlamde en spieste ’t op de jonge goudene
mosterdvelden, in lichtstuivenden gloed van
bloesems.—Fel, als wit vuur, dat koorts-gloeide en in
sidder van heet-brandende stralen en vlammen
rondschroeide door de lucht; als bol van ènkel
razende witte gloeiing, zengde de zon in het blauw,
met z’n okerende kringen, z’n vlammig violet, waar
niet tegen op te staren was; licht dat werkers-oogen
verbrandde, als viel er heete kalk in hun appels.
Overal rond, in den polder, ijlde en koortste ’t licht,
sloeg ’t de dakjes van boerenstulp in vlam, hel hevig
schaterend rood, dat ze trilden en gloeiden als vuur-
vlakken in luchtblauw, ’t hemelvuur zèlf doorbevend
van hittewaas.—

Polderweg, met z’n heet-zanderige keitjes en klinkers


voorààn, lag geblakerd als reep wit vuur in blink-
strakken zeng. Hitte dampte uit de steentjes òp.—
Langzaam stoof rullige zandstof achter eenzamen
boerenwagen aan, langzaam, in loome zwaarte weer
neerpoeierend over grasgroen en riet. Nergens koelde
schaduw. Hier en daar, in eng boomkringetje om
boerderij, alleen wat blauwige korte silhouetjes op
heetdampenden grasgrond, weer weggezogen half,
midden in vuurkring van goudwit hettelicht.—

Al verder, wààr de kerels stapten, lagen brokken wei,


doorvonkt van boterbloem, hoog goudgelen spat van
lichtjes, tusschen grasgroen en zuring. Hééle hoeken
ver, vervloeiden in bronzige zilvering van siergras, dat
wuifde en huifde zacht. Licht wierookte daar zilverend
uit, verderop plots doorvlamd van zuringzee,
bronsgoud, waar roode glans over heenhuiverde,
[10]als windestroom verwoei, en suizelende
fluistergolving door heel den polder aanvloeien kwam,
er droomgeruisch ging uit groei van halmen.

Op hooge dijken, achter verre forten, die groenend


ophoogden in de vlakke stilte, figuurden groepjes
maaiers, hoog in de blauwe, trillende hittelucht, achter
elkaar, in één slingering, één rhytmisch-breed-
uitwaaierende kring van armen, met even ingebogen
lijf, goudoverstroomd van licht. En niets anders van ze
te zien, heel in de verte, dan de kadans van hun
armenzwaai, en fijne neig der beenen, als lichtend
uitgehouwen in de oneindigheid van schittergroen en
hemelblauw.—

Overal, aan alle zij, onder den polderhemel, wijd als


zeeazuur, zalig en lichtend, figuurden de maaiers in
pracht van gebaar. En dichterbij, op ’t vlakke land,
zilvervlamden als verspringende en rondgeslingerde
bliksems, hun blinkzeisen. Nu en dan klonk scherpe
rasp van het mes door ’t neer-knikkende gras, in één
rhytme-zwaai uitzwierend hun zeis, als drong er hooge
maatgang in hun arbeid, rhytmisch in hun handen,
tempo en zang in hun fonkelend gereedschap.

Als ze elkaar van ver iets toeriepen, verwaaide


lichtelijk hun stemmeklank over de wei, zangerig in ’t
zonnedronkene gouden polderruim.

Telkens van ver, één uit de maaiersgroep hief hoog z’n


zeis, die stil dan stond te fonkelgloeien als zilveren
vlam en plots, wen strijkers vlijmden langs ’t staal,
trilde en zonbliksemde, weerlichten afketste in flitsen;
zangerige galmen door de blauwe luchtzee
verzwierven.

Op zandgrond, snikheet, met kittelsteen midden in, die


te gloeien stipten als kogeltjes wit vuur, knerpte en
knarste zware klompensjok van Dirk, klonk doffe stap
van Kees, die aan stofkant liep. Achter de fortvuurlijn
lag hun dijk, nog niet te zien in de wegkromming.

Verheiligd ruischte de stilte-muziek van de weien om


hun sjoklijven en in schroeiblaker kookte zonne-oven
z’n hitte-adem neer op hun gezichten, beenen, armen
en rug. Ze konden bijna niet meer. Rug en hoofd
dropen van zweet, en dòòr schroeide [11]zon, op den
overal ziedenden polder, dat ’t bloed ze brandde onder
de huid.

Eindelijk, na twee uur zwaren, loomen gang, met


ingebogen knieën klompklotsend in stikkigen
zandstuif, naast elkaar, hoogde hun dijk in ’t zicht.
Vlak daarvóór, wiegelglansden de jonge korenhalmen
in huivering van leefgenot, groenig grijs, in toover van
lichtdauw beplengd; ging er vloeiende golving door de
halmen, koelend in zachten fluister, suizend rond de
verre stilte, groeizang van de aarde, fluisterzang en
suizel, stem van de stilte, heiligend boven het
eindelooze vruchtbare werk der boeren.

Vlak naast den dijk waar de Hassels hun hooi op hoop


gingen zetten, stonden vijf maaiers, met één kleintje,
knie-diep gezakt in de hooge, golvend blonde
graszee. Gelijk’lijk, in gevoelden maatgang neigden
hun ingebogen beenen, vlijmde en raspte hun
sissende sneeslag, door ’t blond-hooge gras. Stap
voor stap drongen ze voort, zich zelf zettend in den
goud-lichtenden kring van hun arbeid, één rhytmus, de
zeis scherend langs den grond, naast elkaar, als
doorvlijmde en schoren zij de aarde. Vóór hun borsten
de sidderende grashalmen, zee van goudbronzen
golving, doorvlamd van zuring, neerknakkend in één
vlijmsnee, getroffen plots, door flits en fonkel-slag van
zonnesikkels.—Mannetjes klein, naast elkaar, met
even slag van zeis-ruimte tusschen de lichamen, de
gezichten doorbronsd, met schaduw van kleine
hoedjes op bloot-bronzen nekken. Mannetjes klein,
vóór de graszee, met heet hemd-rood van
uitproppende mouw onder kiel, hun pilo en zwarte
broeken, half verzonken in ’t blonde gras, tusschen
zilveren bevertjes en weeldetooi van groei; of naakter
ten voeten uitstaand, op ’t kale afgeschoren lichter
gemaaide weigroen,—zóó ging hun stap, hun rustige
gang, ’t gegolf tegemoet, borsthoog. En vol, door
trillende hittelucht en hoogblauw azuur, klonk
zeisenzang. Telkens van meer dan één zen,
fonkelstreepte wit bliksemlicht àf door wei-
oneindigheid, leken de maaiers overstroomd en
verzwolgen in kokend, neerschroeiend zonnevuur, en
sidderde golvenvloei van graszee. [12]Maar aldoor
weer, in al andere hoeken, de zee ebde terug, onder
den zonne-sikkelenden slag der maaiers, vlak onder
hun warme voeten uitgroenend in hoogeren gloed.—

Achter den dijk nu, waar de Hassels werken gingen,


aan linker polderzij ver, rondom, rondom, rijde weer
maaier naast maaier. Eén groepje in pracht van diep-
rooden baai-brand van hemden, gloeyend traag-
bloedend in ’t zonnegoud, stond daar, de linkerarm
verbogen om kruk, rechterhand om korten dol, èven
rechterbeen gratievol en zachtkens geheven, in lichten
kadans en meezwenk van linkerdij, ver de armen
tèlkens in àchterwaartschen zwaai, en rhytmisch in
schommel naar vòren weer, vlammende cirkels van
zeisslag rondblinkend in wijdbeenschen trèk.

Halmen zwikten voor borst en beenen, in golvend


zwad van goud, achter hun brandende hielen aan.
Zoo, in zeilend gebaar van hoogste pracht, figuralen
gang van zeisen, klonken ook uit dièn polderhoek op
de flitsende zens, in cirkelende scherpte, met
vonkende booglijnen van ’t mes, trillende bliksems
weg-slangend van flikkerende zeispunt tot zeispunt,
heen en weer; ging er ritseling en schuifeling als
gloeiende sissen door ’t gras. Ook dààr braken de
maaiers de golving, keerden zij de deining, de
eeuwige, van ver aanvloeiende graszee, ver-zwaden
ze de neer-gemaaide kringen, stapten ze voort, vaster,
al verder, al vèrder en kleiner in het zengend heet
rood van hun hemdbaai, naar den geweldigen horizon,
die doorwaaid nevelde van gouen vochten, dampend
violet, omvlamd dan en weer door den wit-blauwen
bliksem van opgeheven zeis; verechode den
zoetluidenden klank-zwier van hun strekels. Prachtig
in gouen golving bleef achter hen aanrijen ’t gemaaide
zwad, groenden fel-heet en glanzig-moireerende
laaggeschoren hoeken, waar spreeuwen-wacht uit
òpzwermde, tuk op omgewoeld aas.

Dirk had ’t eerst z’n kiel losgerukt aan hals,


pafbenauwd en hijgend van hitte. Gretig greep ie naar
drinkkruik die ’t helpertje op z’n rug droeg. Heet, uit ’t
verzengende blik, zoog ie ’t lauwe vocht op.—Met
genot was Kees ’n endje tegen koeltewindje
ingeloopen op den dijk, dat nu en dan over den
schroeigrond luwde. [13]

—Nou moste wai moar ophoàrke Dirk, en nie beginne


woar jai stoan, riep Kees.

Bestig vond Dirk dat. Die Kees, wâ die kerel tog werke
kon. Allegoar knapte ie hullie op. Verduufeld aa’s tie
nie twee Piets en twee Dirke veur sain rekening nam.
Dá’ most ie tog eerlijk segge!

Onder ’t neergemaaid hooi haalde Dirk twee


steekvorken uit.—Kees was langzaam den langen
hoogen dijk, die doorstoofd gloeide van hoogovenhitte
en hun voeten ver-blakerde, afgestapt; begon aan
slootkant, vlak bij het soms doodstille, dàn plots
kopjeswiegelende riet, te harken. Dirk en helpertje
werkten aan anderen dijkkant in de laagte, naast
teeltbrok velderwten en boonen.

—Wá’ nat nog in soo’n hette hee?.… schreeuwde Dirk


naar Kees, onder het hoog uiteenzwaaien en
opslingeren van ’t hooi, met witblinkenden houten hark
de lucht ingraaiend.

—Daa’s net, sel f’doag wel nie waier komme aa’s


keere.…

Laag aan Kees’ kant lag niets dan weiland, waar op


alleen figuurden donkere en in licht-zwemmende
maaiers, zwart-brons, in kiel-fèl rood of wit-bekleerd.

Alles beheerschte hun zwierkringen, tusschen het


gesuis en gehuiver van halmen, het blondgoud
geglans van stengels en riet, de diepe oneindigheid
van hemelblauw rondom, de reuzenkom, met z’n
eeuwige stilte-druisch en lichtgespeel.—Den
ganschen dag bleven Hassel’s harken woelen in ’t
broeiende hooi, in lichte opslingering en uitristeling der
afgevlijmde halmen, dat ze in hitteschroei al drooger
kraakten onder den stap der mannevoeten. En pal op
hun nekken en koppen, die dropen en parelden van
heet zweet, doorzengden de gloeiharpoenen van
zonnebol, weerhaken met giftpunt van kokend licht in
hun lichamen priemend, dat er dolle gloei bruiste
binnen in hun aderen.—Zonnevuur bleef uit ’t diep
azuur, trillend heet blauw, neerschroeien, piekend en
martelend hun vel, terugkaatsend in zeng van den
grond op hun beenen, handen, oogen, als werkten ze
in vlammenrijk van cyklopen, in hellestraf overal
ingesloten tusschen hoogovens, die uit hun
[14]schroeiende muilen, lava en kokend licht
verkraterden.—Als blaasbalgde daàr, achter
zonnevuurbol, een demonenmuil de zonne-brand òp
tot hoogeren helwitten gloei; als ging de gouden aarde
vervloeien in hemelsmidsen waarin de zon vuur
neerlekte, z’n vlamkronkels, z’n vlammebliksems van
rood-wit, geel-wit, groen-wit licht. En stiller na elken
windluw, dorstig in snikhitte en zomerbrand, stond
roerloos ’t gewas, de brandende, goud-hevige
mosterdakkers, in schallend blakergoud, de jonge
koornvelden, dampig blauw-grijs van zware hette, de
blank pluimige karwij, week geurend door zoetige
hooilucht, in broeisfeer van warmte-nevelen.—

Den volgenden dag om tien uur, stonden de Hassels


weer op den dijk. Verspreid op de hoogte, hooide ’n
groep tuinders, die den boel haastig moesten binnen
hebben. De vroege zomerhitte was afgedreven.
Polderhemel, ontzaggelijk in wijdheid, wolkte zwaar
betrokken. Weer vreesden de mannen voor onweer en
regens, eer één hooi binnen kon zijn. Vandaag moest
’t op hoop, en als ’t kon, de dorsch ingereeën. ’n Week
later, zaten ze gesmoord in aardbeien, erwten, kwam
de zware ventdrukte. Dan kon er niks meer gehooid.
De boeren, kalmer in arbeid, hadden tijd, werkten
rustiger aan hun hooibouw. Kees harkte weer, en Dirk
met helpertje zwaaide uitpluimende hooislierten de
lucht in, als blond gewolk. Zwaar paars-grauwe
zwerklucht dreigde boven hun hoofden, in ontzaglijke
ruimte. Vèr, aan wazigen horizon, nevelend
verzonken, droomden torenspitsjes, vaag violet en
donkerrood gedak van dorpjes en boerderijtjes.
Altaarstilte trilde wijd over het donker schaduwend
weigroen. Nu en dan wiegde zoet-lief gefluit òp van
leeuwerik, de lucht inzwierend hooge jubel, die uit de
zilvergrauwe hemelzee neerzegende in zwellenden
klankenval, vervlietend over vlakte-eeuwigheid.—Even
soms geelde zon, ijlde schaduw in vlucht van
adelaarswieken over de kleur-treurende weien.

Lichtgroen, kortgeknipt kaal, vlekten en hoekten de


gemaaide brokken, tusschen de donkere, malsche
sappigheid van uitgroeiend gras. Diepe zwijmel van
zoeten hooigeur woei [15]uit, over polderzee, alle
windkringen mee, zoelde in prikkeling dàn als week-
warme reuk van versch gebakken brood, dàn als
rozen-aroom en honingdauw.—Geurprikkeling vloeide
over de werkerskoppen, rondwentelend, verwaaiend
en weer terugvloeiend in de oneindige hemelkom en
graszee van polder. Groot en geweldig uit verren
horizon stoetten áán, verbrokkelde wolkpoorten,
waaronder wierookvaten en zilveren wolklampen
wazige glanzen te dampen brandden.—Goud-dof
geelden de mosterdakkers, en grijsteer vlekte karwij-
sier tusschen donk’re klaver en erwtenranken.
In wissel-licht van zwerkzwaar hemelruim, stonden de
hooiers en maaiers, dàn duister, als silhouetten, dàn
oplichtend in stroomen zilverig geglans, uit blauwe
hemelgaten onder poortendrom neerspuiend.—Breed
hun machtig gebaar zeilde door de lucht, in
rhytmischen gang van zeisencirkeling, als dwongen ze
de aarde tot baren. En inniger, de naklank van
zeiszang, galmde door de kloosterheilige weistilte, als
strekels vlijmden langs ’t staal. Telkens donkerder,
verreuzigden nù hun figuren in dreighemel onder
wisselspel van licht, en breeër wiegden hun kringen;
stonden ze hoog, bezwerend in devoot maaigebaar,
wat met onvruchtbaarheid de velden sloeg.—

Dieper en heiliger, elke minuut méér naar den avond,


staarde de stilte boven de vroomheid van hun
zwoegkoppen, stiller en in siddering weer, zweefden
de handen in geheime kringen van arbeid, schuifelden
hun stappen, hieven de zeizen zich, en woelden òp de
harken hooi, in den onmeetlijken unisonen stilte-ruisch
van aarde- en hemelzee.—En telkens op àndere
dagen, in ànder licht, stonden de hooiers onder woest
paars wolkspel. Nachten van paarse angsten,
gedrochtlijke wolken trokken voorbij, bòven hun
hoofden. Plots daarachter uit, éven weer trillend
zomerlicht, blauw-bloeiend en jubelend. Dwars daar
doorheen weer, stille jacht van aanzilverende
wolkdrommen, optrekkend, heel van verre
horizonbrand, opjagend, in verduisterende dreiging en
rond-donkerend de reuzige polderkom. Niets beefde in
de stilte dan de raspige, rits’lende vlijmsnee van de
zeisen, de maaiers plots reuzig bijéén op dijkengroen,
donker [16]en ver, in duisteren heersch over het
onrustige aardleven.—

Dan het ritselend, krakerig gestoei van opwolkend


hooi, dofblond geslier door ’t luchtangstige zwerk;
dieper en plechtiger daarin, de groote werkzwijg der
zwoegers, eenzaam in ’t zilvergrauw van stille
hemelzee.

In aandacht sloeg Kees z’n hark òm, werkte Dirk, en


verderop, al de hooiers slierten geuren in den wind,
met de woelige lijn van blanke harken door de lucht.
Plots brak sterker zon door, dampten de wolken wèg
in zilverige zoomen, kleurde feller òp karwijbloei,
sneeuwig; fonkelde de mosterd, hoog schallend-
goudgeel, dampten de akkers in diep-groen, vloeide,
stoeide en dartelde ’t licht weer over de werklijven.
Zeisen bliksemden weer in de lucht, en glans van
heiligheid vloeide weer over de geweldige werkdaad
van maaiers en hooiers.

Dagen aanéén bleef ’t hooien op de dijken, in


zonnedavering of zilvergrauw, de kerels in zwijg,
tusschen de eeuwige stilte van eindloos land en
hemel. Sterker stoeiden en wiegden de glansen,
gebroken en vernevelend, gedrenkt in hooigeur, in
diep zalige reuken van bloeiend jong zomergenot, dat
zwijmelend en luchtig leefde vóór broeihitte en
zonnevuur hèviger neerschroeien kwam op polder;
laaiende zomerbrand die voeten, oogen en lijven van
werkers moordend neersmakken zoû, in aêmechtigen
hijg. Groot-machtig van gebaar stonden de zwoegers
in de vochtige, volwasemende lichtkom, hijgend en
sliertend, verstomd in hun arbeid, in hun stillen staar
naar de aarde, geweldig in hun koortsige jacht, die
verbronsde, de bazalten ernsttronies, onder gloeiende
uitdampende hette van grond en hemelruim. Daar
midden in stonden ze, in pracht en hoogsten òpbloei
van zaad en gewas, tusschen de éérste uitvloeiing
van zomerkoortsende groeiwellust der aarde, die
uitstorten wilde, uitstorten moèst, haar vrucht-zware
zwangerschap, brandende baring, onder luchtstolp
van onèindige breedheid. Brons strakten de
werkkoppen, als in donker bazalt gehouwen tronies,
bestoft en grauw-rood doorvlekt van zweet, en hóóg in
gang, sliertten de zware gereedschappen. In
eindlooze tragedie van lucht en lichtspel, in koorts-
strakken jubel [17]en zonnebrand, stonden ze en
zwoegden door, blind voor de pracht van ’t leven, ’t
zingende groen, het heete blauwe vuur van de
zengende lucht.

Hoog boven hun hoofden stroomden geurselen rond


van zoetste welriekendheid, geuren als van vruchten
en rozen, bruisend appelensap en druivenaroom,
dronken-makende wierook van wei- en aarde-wijn,
vervloeid uit wond’re gouden hofjes van ’n zonne-
paradijs. Graanloof en hooi, karwij en klaver,
graszwaden en bloemenhoning, stortten zoetste
stroomen geurenwellust uit, diepste kern van geur-
heerlijk aardleven, in onstuimige opborreling en bruis
van sappen. Hoog boven hun hoofden waasde de
honigdauw uit flonker-gouden lichtschalen, dauw
druppend op hun handen, hun oogen, verstroomend
tot waar de blink-helle zeisen vlijmden, ònder het
zoete gras. De aardgeuren rookten in ’t licht als
zichtbare adem van heliotropen. Dáár, in die
ontzaglijke groene wereldkom baarde de eeuwige
stilte, werkdaad der maaiers, hooiers en planters,
heiliging van hun arbeid; stonden rond hen geschaard,
àlle gestalten van bevende, warme, trillende
aardevruchtbaarheid, doorgloeide passiën van
aardbestaan. En midden in hemelpracht en
aardewellust, sloeg de daad van hun goddelijken
arbeid òver, in neermartelende afbeuling en hitte; werd
woesten weedom en weenende eenzaamheid om hun
lichtende gestalten uitgestort. Dáár, in de verste
verlatenheid van het jagende leven, ging hun zwarte
zweetzwoeg, hijgden hun borsten, werd de pracht van
werkgebaar, de zwelling van hun nooitzinkende
krachten, tot martelende aêmechtige zwijmel, naast
den koortsheeten weelde-opbloei van rijpend gewas.
Heet geurleven stoof, spatte, sapte, gistte òver, òm
hen heen.—Glanzing en kleuren brandden triomf van
zonnegloei rond hun voeten, en de alkleurige
bebouwde aarde in baring en in weeën, stortte uit in
geilen zwijmel, haar vruchtheerlijken rijkdom. Hun
handen maaiden en zaaiden, hun oogen keken, hun
lijven zwoegden, zuchtten, zwollen, krampten onder
den werklast. En stilte, eeuwige stilte bleef trillen
onder oneindigen luchtewelf.—En stiller daarin nog,
koortste en verschroeide hun lijfzwoeg, dorde [18]en
droogde hun leven op, stonden ze neergebeukt in
bronzing van koppen en gloei van prachtlijven, in één
kramp van arbeid versteend; voortzwoegend tot den
avond, zwijgzwaar, geradbraakt in pijning van elk lid.

Volgenden dag weer stonden ze op de lichtdaverende


zonne-velden, als ingesloten tusschen hoogovens en
Welcome to our website – the ideal destination for book lovers and
knowledge seekers. With a mission to inspire endlessly, we offer a
vast collection of books, ranging from classic literary works to
specialized publications, self-development books, and children's
literature. Each book is a new journey of discovery, expanding
knowledge and enriching the soul of the reade

Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.

Let us accompany you on the journey of exploring knowledge and


personal growth!

textbookfull.com

You might also like