The Use of XML in A Video Digital Librar
The Use of XML in A Video Digital Librar
Intelligent Search
on XML Data
13
Series Editors
Volume Editors
Henk Blanken
University of Twente, Faculty of EEMCS
Department of Computer Science
P.O. Box 217, 7500 AE Enschede, The Netherlands
E-mail: [email protected]
Torsten Grabs
Hans-Jörg Schek
Institute of Information Systems, ETH Zürich
8092 Zürich, Switzerland
E-mail: {grabs;schek}@inf.ethz.ch
Ralf Schenkel
Gerhard Weikum
Saarland University, Department of Computer Science
P.O. Box 151150, 66041 Saarbrücken, Germany
{schenkel;weikum}@cs.uni-sb.de
A catalog record for this book is available from the Library of Congress.
ISSN 0302-9743
ISBN 3-540-40768-5 Springer-Verlag Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,
reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,
in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are
liable for prosecution under the German Copyright Law.
Springer-Verlag Berlin Heidelberg New York
a member of BertelsmannSpringer Science+Business Media GmbH
http://www.springer.de
© Springer-Verlag Berlin Heidelberg 2003
Printed in Germany
Typesetting: Camera-ready by author, data conversion by PTP Berlin GmbH
Printed on acid-free paper SPIN: 10950050 06/3142 543210
Contents
Part I Applications
Part V Evaluation
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
References
12. Anders Berlund, Scott Boag, Don Chamberlin, Mary F. Fernandez, Michael
Kay, Jonathan Robie and Jrme Simon (Eds.). W3c working draft XML
path language (XPath) 2.0. W3C Recommendation, http://www.w3.org/TR/
xpath20, 2003.
13. T. Anderson, A. Berre, M. Mallison, H. Porter, and B. Schneider. The Hyper-
Model Benchmark. In Proc. of the Int. Conf. on Extending Database Technol-
ogy, volume 416 of Lecture Notes in Computer Science, pages 317–331, 1990.
14. P. M. G. Apers, P. Atzeni, S. Ceri, S. Paraboschi, K. Ramamohanarao, and
R. T. Snodgrass, editors. VLDB 2001, Proceedings of 27th International Con-
ference on Very Large Data Bases, September 11-14, 2001, Roma, Italy. Mor-
gan Kaufmann, 2001.
15. E. Appelt and D. Israel. Introduction to Information Extraction Technology.
A Tutorial Prepared for IJCAI 1999, 1999. http://www.ai.mit.edu/people/
jimmylin/papers/intro-to-ie.pdf.
16. M. P. Atkinson, M. E. Orlowska, P. Valduriez, S. B. Zdonik, and M. L. Brodie,
editors. VLDB’99, Proceedings of 25th International Conference on Very Large
Data Bases, September 7-10, 1999, Edinburgh, Scotland, UK. Morgan Kauf-
mann, 1999.
17. R. A. Baeza-Yates and G. Navarro. Integrating contents and structure in text
retrieval. SIGMOD Record, 25(1):67–79, 1996.
18. R. A. Baeza Yates and G. Navarro. XQL and Proximal Nodes. Journal of
the American Society of Information Systems and Technology, 53(6):504–514,
2002.
19. R. A. Baeza-Yates and B. Riberto-Neto, editors. Modern Information Retrieval.
Addison Wesley, 1999.
20. C. F. Baker, C. J. Fillmore, and J. B. Lowe. The Berkeley FrameNet project.
In Proceedings of the 36th Annual Meeting of the Association for Computa-
tional Linguistics and the 17th International Conference on Computational
Linguistics (COLIN-ACL), August 10-14, 1998, Montreal, Quebec, Canada,
pages 86–90. ACL / Morgan Kaufmann Publishers, 1998.
21. Z. Bar-Yossef, Y. Kanza, Y. Kogan, W. Nutt, and Y. Sagiv. Querying se-
mantically tagged documents on the WWW. In R. Y. Pinter and S. Tsur,
editors, Next Generation Information Technologies and Systems, 4th Interna-
tional Workshop, NGITS’99, Zikhron-Yaakov, Israel, July 5-7, 1999 Proceed-
ings, volume 1649 of Lecture Notes in Computer Science, pages 2–19. Springer,
1999.
22. Y. Batterywala and S. Chakrabarti. Mining themes from bookmarks. In ACM
SIGKDD Workshop on Text Mining, 2000.
23. R. Baumgartner, S. Flesca, and G. Gottlob. Supervised wrapper generation
with lixto. In Apers et al. [14], pages 715–716.
24. C. Beeri and Y. Tzaban. Sal: An algebra for semistructured data and xml. In
Cluet and Milo [76], pages 37–42.
25. N. J. Belkin, C. Cool, J. Koenemann, K. B. Ng, and S. Park. Using relevance
feedback and ranking in interactive searching. In Proceeding of the 4th Text Re-
trieval Conference (TREC-4), pages 181–210, Gaithersburg, Maryland, USA,
Nov. 1995. National Institute of Standards and Technology (NIST).
26. N. J. Belkin, A. D. Narasimhalu, and P. Willet, editors. Proceedings of the 20th
Annual International ACM SIGIR Conference on Research and Development
in Information Retrieval, New York, 1997. ACM.
References 297
46. J.-M. Bremer and M. Gertz. XQuery/IR: Integrating XML document and data
retrieval. In M. F. Fernandez and Y. Papakonstantinou, editors, Proceedings
of the 5th International Workshop on the Web and Databases (WebDB), pages
1–6, June 2002.
47. S. Bressan, G. Dobbie, Z. Lacroix, M. Lee, Y. Li, and U. Nambiar. X007:
Applying 007 Benchmark to XML Query Processing Tools. In Paques et al.
[237], pages 167–174.
48. M. W. Bright, A. R. Hurson, and S. H. Pakzad. Automated resolution of
semantic heterogeneity in multidatabases. ACM Transactions on Database
Systems, 19(2):212–253, 1994.
49. S. Brin and L. Page. The anatomy of a large-scale hypertextual search engine.
Computer Networks and ISDN Systems, 30(1–7):107–117, Apr. 1998.
50. I. Bruder, A. Düsterhöft, M. Becker, J. Bedersdorfer, and G. Neumann. GET-
ESS: Constructing a linguistic search index for an Internet search engine. In
Proceedings of the 5th Conference of Applications of Natural Language to Data
Bases (NLDB), pages 227–238, 2000.
51. N. Bruno, N. Koudas, and D. Srivastava. Holistic twig joins: Optimal XML
pattern matching. In Franklin et al. [120], pages 310–321.
52. A. Budanitsky and G. Hirst. Semantic distance in WordNet: An experimental,
application-oriented evaluation of five measures. In Proceedings of the Work-
shop on WordNet and Other Lexical Resources, Second meeting of the North
American Chapter of the Association for Computational Linguistics, 2001.
53. P. Buneman, S. Davidson, G. Hillebrand, and D. Suciu. A Query Language
and Optimization Techniques for Unstructured Data. In H. V. Jagadish and
I. S. Mumick, editors, Proceedings of the 1996 ACM SIGMOD International
Conference on Management of Data, Montreal, Quebec, Canada, June 4-6,
1996, pages 505–516. ACM Press, 1996.
54. P. Buneman, W. Fan, J. Siméon, and S. Weinstein. Constraints for semistruc-
tured data and XML. SIGMOD Record, 30(1):47–54, Mar. 2001.
55. P. Buneman, M. F. Fernandez, and D. Suciu. UnQL: A query language and
algebra for semistructured data based on structural recursion. VLDB Journal,
9(1):76–110, 2000.
56. P. Buneman, L. Libkin, D. Suciu, V. Tannen, and L. Wong. Comprehension
Syntax. In SIGMOD Record, 1994.
57. C. Burges. A tutorial on Support Vector Machines for pattern recognition.
Data Mining and Knowledge Discovery, 2(2), 1998.
58. S. Buxton and M. Rys. XQuery and XPath Full-Text Requirements. W3C
working draft 14 february 2003, World Wide Web Consortium, Feb. 2003.
http://www.w3.org/TR/xmlquery-full-text-requirements/.
59. M. E. Califf. Relational Learning Techniques for Natural Language Extraction.
PhD thesis, University of Texas at Austin, Aug. 1998.
60. M. Carey, D. DeWitt, and J. Naughton. The OO7 Benchmark. In P. Buneman
and S. Jajodia, editors, Proceedings of the 1993 ACM SIGMOD International
Conference on Management of Data, Washington, D.C., May 26-28, 1993,
pages 12–21. ACM Press, 1993.
61. M. Carey, D. DeWitt, J. Naughton, M. Asgarian, P. Brown, J. Gehrke, and
D. Shah. The BUCKY Object-Relational Benchmark. In J. Peckham, edi-
tor, SIGMOD 1997, Proceedings ACM SIGMOD International Conference on
Management of Data, May 13-15, 1997, Tucson, Arizona, USA, pages 135–146.
ACM Press, 1997.
References 299
121. H.-P. Frei, S. Meienberg, and P. Schäuble:. The perils of interpreting recall and
precision values. In N. Fuhr, editor, Proceedings of the GI/GMD-Workshop
on Information Retrieval, volume 289 of Informatik-Fachberichte, pages 1–10,
Darmstadt, Germany, June 1991. Springer.
122. D. Freitag. Machine Learning for Information Extraction in Informal Domains.
PhD thesis, Carnegie Mellon University, 1998.
123. N. Fuhr. Probabilistic models in information retrieval. The Computer Journal,
35(3):243–255, 1992.
124. N. Fuhr. Towards Data Abstraction in Networked Information Retrieval Sys-
tems. Information Processing and Management, 35(2):101–119, 1999.
125. N. Fuhr, N. Gövert, G. Kazai, and M. Lalmas. INEX: Initiative for the evalu-
ation of XML retrieval. In Proceedings ACM SIGIR 2002 Workshop on XML
and Information Retrieval, Tampere, Finland, Aug. 2002. ACM.
126. N. Fuhr, N. Gövert, G. Kazai, and M. Lalmas, editors. INitiative for the
Evaluation of XML Retrieval (INEX). Proceedings of the First INEX Work-
shop. Dagstuhl, Germany, December 8–11, 2002, ERCIM Workshop Proceed-
ings, Sophia Antipolis, France, Mar. 2003. ERCIM.
127. N. Fuhr, N. Gövert, and T. Rölleke. DOLORES: A system for logic-based
retrieval of multimedia objects. In Proceedings of the 21st Annual Interna-
tional ACM SIGIR Conference on Research and Development in Information
Retrieval, Melbourne, Australia, pages 257–265. ACM Press, Aug. 1998.
128. N. Fuhr and K. Großjohann. XIRQL: A query language for information re-
trieval in XML documents. In Croft et al. [84], pages 172–180.
129. N. Fuhr and T. Rölleke. A Probabilistic Relational Algebra for the Integration
of Information Retrieval and Database Systems. Transactions on Information
Systems, 14(1):32–66, 1997.
130. Generalised architecture for languages, encyclopaedias and nomenclatures in
medicine. http://www.opengalen.org/about.html.
131. F. Gey, M. Hearst, and R. Tong, editors. SIGIR ’99: Proceedings of the 22nd
Annual International ACM SIGIR Conference on Research and Development
in Information Retrieval, August 15-19, 1999, Berkeley, CA, USA. ACM, 1999.
132. R. Goldman, J. McHugh, and J. Widom. From Semistructured Data to XML:
Migrating the Lore Data Model and Query Language. In Cluet and Milo [76],
pages 25–30.
133. R. Goldman and J. Widom. DataGuides: Enabling query formulation and opti-
mization in semistructured databases. In M. Jarke, M. J. Carey, K. R. Dittrich,
F. H. Lochovsky, P. Loucopoulos, and M. A. Jeusfeld, editors, VLDB’97, Pro-
ceedings of 23rd International Conference on Very Large Data Bases, August
25-29, 1997, Athens, Greece, pages 436–445. Morgan Kaufmann, 1997.
134. T. Grabs. Storage and Retrieval of XML Documents with a Cluster of Database
Systems. PhD thesis, Swiss Federal Institute of Technology (ETH) Zurich,
2003. Diss. ETH No. 15076.
135. T. Grabs, K. Böhm, and H.-J. Schek. PowerDB-IR - Scalable Information
Retrieval and Storage with a Cluster of Databases. Knowledge and Information
Systems. (to appear).
136. T. Grabs, K. Böhm, and H.-J. Schek. A parallel document engine built on top
of a cluster of databases – design, implementation, and experiences –. Technical
Report 340, Department of Computer Science, ETH Zurich, Apr. 2000.
137. T. Grabs, K. Böhm, and H.-J. Schek. PowerDB-IR – Information Retrieval on
Top of a Database Cluster. In Paques et al. [237], pages 411–418.
References 303
138. T. Grabs, K. Böhm, and H.-J. Schek. XMLTM: Efficient transaction man-
agement for XML documents. In C. Nicholas, D. Grossman, K. Kalpakis,
S. Qureshi, H. van Dissel, and L. Seligman, editors, Proceedings of the
11th International Conference on Information and Knowledge Management
(CIKM2002), November 4-9, 2002, McLean, VA, USA, pages 142–152. ACM
Press, 2002.
139. T. Grabs and H.-J. Schek. Generating vector spaces on-the-fly for flexible XML
retrieval. In XML and Information Retrieval Workshop - 25th Annual Interna-
tional ACM SIGIR Conference on Research and Development in Information
Retrieval, 2002.
140. T. Grabs and H.-J. Schek. Flexible Information Retrieval from XML with
PowerDB-XML. In Fuhr et al. [126], pages 35–40.
141. J. Gray. Database and Transaction Processing Performance Handbook. avail-
able at http://www.benchmarkresources.com/handbook/contents.asp, 1993.
142. R. Grishman. Information Extraction: Techniques and Challenges. In Infor-
mation Extraction: A Multidisciplinary Approach to an Emerging Information
Technology - International Summer School, volume 1299 of Lecture Notes in
Computer Science, pages 10–27. Springer, 1997.
143. D. A. Grossman, O. Frieder, D. O. Holmes, and D. C. Roberts. Integrating
structured data and text: A relational approach. Journal of the American
Society for Information Science (JASIS), 48(2):122–132, Feb. 1997.
144. P. Grosso, E. Maler, J. Marsh, and N. Walsh. XPointer framework. W3C
recommendation, 2003. http://www.w3.org/TR/xptr-framework/.
145. T. R. Gruber. Towards Principles for the Design of Ontologies Used for Know-
ledge Sharing. In N. Guarino and R. Poli, editors, Formal Ontology in Con-
ceptual Analysis and Knowledge Representation, pages 89–95, Deventer, The
Netherlands, 1993. Kluwer Academic Publishers.
146. T. Grust. Accelerating XPath location steps. In Franklin et al. [120], pages
109–120.
147. T. Grust, M. van Keulen, and J. Teubner. Staircase Join: Teach a Relational
DBMS to Watch its (Axis) Steps. In Proc. of the 29th Int’l Conference on Very
Large Data Bases (VLDB), Berlin, Germany, Sept. 2003.
148. N. Guarino. Formal ontology and information systems. In N. Guarino, edi-
tor, Proceedings of the 1st International Conference on Formal Ontologies in
Information Systems, FOIS’98, Trento, Italy, 6-8 June 1998, pages 3–15. IOS
Press, 1998.
149. N. Gupta, J. Haritsa, and M. Ramanath. Distributed query processing on the
Web. In Proceedings of the 16th International Conference on Data Engineering,
28 February - 3 March, 2000, San Diego, California, USA, page 84. IEEE
Computer Society, 2000.
150. A. Guttman. R-trees: A dynamic index structure for spatial searching. In
B. Yormark, editor, Proceedings of the 1984 ACM SIGMOD International Con-
ference on Management of Data, Boston, MA, pages 47–57. ACM Press, 1984.
151. A. Halevy et al. Crossing the structure chasm. In Proceedings of the First
Semiannual Conference on Innovative Data Systems Research (CIDR), 2003.
152. D. Harman. Relevance feedback revisited. In N. J. Belkin, P. Ingwersen,
and A. M. Pejtersen, editors, Proc. of the Int. ACM SIGIR Conf. on Research
and Development in Information Retrieval, pages 1–10, Copenhagen, Denmark,
June 1992. ACM.
304 References
202. Q. Li, P. Shilane, N. Noy, and M. Musen. Ontology acquisition from on-line
knowledge sources. In AMIA Annual Symposium, Los Angeles, CA, 2000.,
2000.
203. D. Lin. An information-theoretic definition of similarity. In J. W. Shavlik, edi-
tor, Proceedings of the Fifteenth International Conference on Machine Learning
(ICML 1998), Madison, Wisconson, USA, July 24-27, 1998, pages 296–304.
Morgan Kaufmann, San Francisco, CA, 1998.
204. J. A. List and A. P. de Vries. CWI at INEX 2002. In Fuhr et al. [126].
205. C. Lundquist, D. A. Grossman, and O. Frieder. Improving relevance feedback
in the vector space model. In F. Golshani and K. Makki, editors, Proc. of
the Int. Conf. on Knowledge and Data Management, pages 16–23, Las Vegas,
Nevada, USA, Nov. 1997. ACM.
206. I. A. Macleod. A Query Language for Retrieving Information from Hierarchic
Text Structures. The Computer Journal, 34(3):254–264, 1991.
207. A. Maedche and S. Staab. Semi-automatic engineering of ontologies from text.
In Proceedings of the 12th Internal Conference on Software and Knowledge
Engineering, Chicago, USA. KSI, 2000.
208. A. Maedche and S. Staab. Learning ontologies for the semantic web. In
S. Decker, D. Fensel, A. P. Sheth, and S. Staab, editors, Proceedings of
the Second International Workshop on the Semantic Web - SemWeb’2001,
Hongkong, China, May 1, 2001, volume 40 of CEUR workshop proceedings,
2001. http://CEUR-WS.org/Vol-40/.
209. C. D. Manning and H. Schuetze. Foundations of Statistical Natural Language
Processing. The MIT Press, 1999.
210. I. Manolescu, D. Florescu, and D. Kossmann. Answering xml queries on het-
erogeneous data sources. In VLDB 2001, Proceedings of 27th International
Conference on Very Large Data Bases, September 11-14, 2001, Roma, Italy,
pages 241–250. Morgan Kaufmann, Sept. 2001.
211. M. L. McHale. A comparison of WordNet and Roget’s taxonomy for measuring
semantic similarity. In Proceedings of the Workshop on Content Visualization
and Intermedia Representations (CVIR’98), 1998.
212. J. McHugh and J. Widom. Query optimization for XML. In Atkinson et al.
[16], pages 315–326.
213. Medical subject headings. http://www.nlm.nih.gov/mesh/meshhome.html.
214. MedPICS certification and rating of trustful and assessed health information
on the Net. http://www.medcertain.org.
215. D. R. H. Miller, T. Leek, and R. M. Schwartz. A hidden markov model infor-
mation retrieval system. In Gey et al. [131], pages 214–221.
216. G. A. Miller. Wordnet: A lexical database for english. Communications of the
ACM, 38(11):39–41, Nov. 1995.
217. G. A. Miller and W. G. Charles. Contextual correlates of semantic similarity.
Language and Cognitive Processes, 6(1):1–28, 1991.
218. T. Milo and D. Suciu. Index structures for path expressions. In C. Beeri and
P. Buneman, editors, Database Theory - ICDT ’99, 7th International Con-
ference, Jerusalem, Israel, January 10-12, 1999, Proceedings, volume 1540 of
Lecture Notes in Computer Science, pages 277–295. Springer, 1999.
219. T. Mitchell. Machine Learning. McGraw Hill, 1996.
220. P. Mitra, G. Wiederhold, and M. L. Kersten. A graph-oriented model for
articulation of ontology interdependencies. In Zaniolo et al. [329], pages 86–
100.
308 References
239. M.-F. Plassard, editor. Functional Requirements for Bibliographic Records - Fi-
nal Report, volume 19 of UBCIM Publications New Series. K.G. Saur München,
1998. Available at http://www.ifla.org/VII/s13/frbr/frbr.htm.
240. Public medline. http://www.pubmed.org.
241. Y. Qiu and H.-P. Frei. Concept-based query expansion. In Proceedings of
SIGIR-93, 16th ACM International Conference on Research and Development
in Information Retrieval, pages 160–169, Pittsburgh, US, 1993.
242. Y. Qiu and H.-P. Frei. Improving the retrieval effectiveness by a similarity the-
saurus. Technical Report 225, Swiss Federate Institute of Technology, Zürich,
Switzerland, 1995.
243. L. Rabiner. A tutorial on hidden markov models and selected applications
in speech recognition. In A. Waibel and K. Lee, editors, Readings in speech
recognition, pages 267–296. Morgan Kaufmann, 1990.
244. R. Rada, H. Mili, E. Bicknell, and M. Blettner. Development and applica-
tion of a metric on semantic nets. IEEE Transactions on Systems, Man, and
Cybernetics, 19(1):17–30, 1989.
245. V. V. Raghavan, P. Bollmann, and G. S. Jung. A critical investigation of recall
and precision as measures of retrieval system performance. ACM Transactions
on Office Information Systems, 7(3):205–229, 1989.
246. A. Rector. Conceptual knowledge: the core of medical information systems. In
Proceedings of Medical Informatics, pages 1420–1426, 1992.
247. P. Resnik. Using information content to evaluate semantic similarity in a
taxonomy. In Proceedings of the Fourteenth International Joint Conference
on Artificial Intelligence, IJCAI 95, Montréal, Québec, Canada, August 20-25
1995, volume 1, pages 448–453, 1995.
248. P. Resnik. Semantic similarity in a taxonomy: An information-based measure
and its application to problems of ambiguity in natural language. Journal of
Artificial Intelligence Research, 11:95–130, 1999.
249. R. Richardson, A. Smeaton, and J. Murphy. Using WordNet as a knowledge
base for measuring semantic similarity between words. In Proceedings of the
AICS Conference, 1994.
250. S. Robertson and K. Sparck-Jones. Relevance weighting of search terms. Jour-
nal of the American Society for Information Science, 27(3):129–146, 1976.
251. J. Rocchio Jr. Relevance Feedback in Information Retrieval, The SMART Re-
trieval System: Experiments in Automatic Document Processing, chapter 14,
pages 313–323. Prentice Hall, Englewood Cliffs, New Jersey, USA, 1971.
252. H. Rubenstein and J. B. Goodenough. Contextual correlates of synonymy.
Communications of the ACM, 8(10):627–633, 1965.
253. Y. Rui, T. Huang, and S. Mehrotra. Relevance feedback techniques in interac-
tive content-based image retrieval. In Proceedings of IS&T and SPIE Storage
and Retrieval of Image and Video Databases VI, pages 25–36, San Jose, Cali-
fornia, USA, Jan. 1998.
254. K. Runapongsa, J. M. Patel, H. V. Jagadish, and S. Al-Khalifa. The michigan
benchmark: A micro-benchmark for xml query processing system. Informal
Proceedings of EEXTT02, electronic version available at http://www.eecs.
umich.edu/db/mbench/, 2002.
255. S. Russel and P. Norvig. Artificial Intelligence - A Modern Approach. Prentice
Hall, 1995.
256. A. Sahuguet and F. Azavant. Web ecology: Recycling HTML pages as XML
documents using W4F. In Cluet and Milo [76].
310 References
257. A. Salminen and F. W. Tompa. PAT Expressions: an Algebra for Text Search.
Acta Linguistica Hungarica, 41(1-4):277–306, 1993.
258. G. Salton, editor. The SMART Retrieval System - Experiments in Automatic
Document Processing. Prentice Hall, Englewood, Cliffs, New Jersey, 1971.
259. G. Salton and C. Buckley. Term-weighting approaches in automatic text re-
trieval. Information Processing & Management, 24(5):513–523, 1988.
260. G. Salton and C. Buckley. Improving retrieval performance by relevance feed-
back. Journal of the American Society for Information Science, 41(4):288–297,
1990.
261. G. Salton, E. A. Fox, and H. Wu. Extended Boolean Information Retrieval.
Communications of the ACM, 26(12):1022–1036, 1983.
262. G. Salton and M. J. McGill. Introduction to Modern Information Retrieval.
McGraw-Hill, first edition, 1983.
263. G. Salton, A. Wong, and C. S. Yang. A Vector Space Model for Automatic
Indexing. CACM, 18(11):613–620, 1975.
264. T. Saracevic. Evaluation of evaluation in information retrieval. In E. Fox,
P. Ingwersen, and R. Fidel, editors, Proceedings of the 18th Annual Interna-
tional ACM SIGIR Conference on Research and Development in Information
Retrieval, pages 138–146, New York, 1995. ACM. ISBN 0-89791-714-6.
265. S. Sarawagi. Automation in information extraction and data integration. In
S. Christodoulakis, D. L. Lee, and A. O. Mendelzon, editors, VLDB 2002,
Tutorial notes of the 28th International Conference on Very Large Data Bases,
August 20-23, Hong Kong, China, pages 1–28, 2002.
266. SAX (Simple API for XML). http://sax.sourceforge.net/.
267. P. Schäuble. Multimedia Information Retrieval, Content-based Information
Retrieval from Large Text and Audio Databases. Kluwer Academic Publishers,
Zurich, Switzerland, 1997.
268. T. Schlieder and M. Meuss. Result Ranking for Structured Queries against
XML Documents. In DELOS Workshop: Information Seeking, Searching and
Querying in Digital Libraries, Zurich, Switzerland, Dec. 2000. http://page.
inf.fu-berlin.de/~schlied/publications/delos2000.pdf.
269. A. Schmidt, M. Kersten, D. Florescu, M. Carey, I. Manolescu, and F. Waas.
The XML Store Benchmark Project, 2000. http://www.xml-benchmark.org.
270. A. Schmidt, M. Kersten, D. Florescu, M. Carey, I. Manolescu, and F. Waas.
XMark Queries, 2002. available at http://www.xml-benchmark.org/Assets/
queries.txt.
271. A. Schmidt, M. Kersten, M. Windhouwer, and F. Waas. Efficient Relational
Storage and Retrieval of XML Documents. In Suciu and Vossen [289], pages
47–52.
272. A. Schmidt, F. Waas, M. Kersten, M. Carey, I. Manolescu, and R. Busse.
XMark: A Benchmark for XML Data Management. In P. A. bernstein, Y. E.
Ioannidis, R. Ramakrishnan, and D. Papadias, editors, VLDB 2002, Proceed-
ings of 28th International Conference on Very Large Data Bases, August 20-23,
Hong Kong, China, pages 974–985. Morgan Kaufmann, 2002.
273. A. Schmidt, F. Waas, M. Kersten, D. Florescu, I. Manolescu, M. Carey, and
R. Busse. The XML Benchmark Project. Technical Report INS-R0103, CWI,
Amsterdam, The Netherlands, April 2001.
274. H. Schöning. Tamino – a DBMS designed for XML. In Proceedings of the 17th
International Conference on Data Engineering, April 2-6, 2001, Heidelberg,
Germany, pages 149–154. IEEE Computer Society, 2001.
References 311
Editors
Gerhard Weikum
Universität des Saarlandes
Fachrichtung 6.2 Informatik
Postfach 151150
66041 Saarbrücken
Germany
[email protected]
XIV Contributors
Authors
Acknowledgements
”
In December 2002, a workshop on Intelligent Search in XML Data” was held
at Schloss Dagstuhl (Germany). During this workshop participants presented
their state-of-the-art work. This book documents those presentations.
The participants played a great part in writing and subsequently reviewing
chapters of the book. They had, without exception, a high work load. We
thank them for their productive and pleasant cooperation.
Several research projects described in this book have been (partially) sup-
ported by the following agencies: Deutsche Forschungsgemeinschaft (German
Science Foundation), the Netherlands Organisation for Scientific Research,
the DELOS Network of Excellence for Digital Libraries 1, the IEEE Computer
Society2 , Deutscher Akadmischer Austausch Dienst (DAAD) 3 , and the British
Council4 under the Academic Research Collaboration (ARC) Programme.
1
http://delos-noe.org/
2
http://computer.org/
3
http://www.daad.de/
4
http://www.britishcouncil.org/
1
Demand for Intelligent Search Tools in
Medicine and Health Care
1.1 Introduction
The high demand for medical knowledge poses a big challenge for information
technology to offer user-friendly systems which help healthy citizens, patients
and health professionals to find proper data, information and knowledge.
Medicine has a long history in structured or semi-structured documen-
tation. On the one hand medical documentation of diagnoses has been per-
formed using the ICD-10 (International Classification of Diseases, 10th revi-
sion [294]) or other coding systems; on the other hand indexing of scientific
literature has been done using key words from MeSH (Medical Subject Head-
ings [213]). Coding systems like ICD, classifications and medical thesauri have
been available for years. Scientifically validated terminologies like SNOMED
(Standardized Nomenclature in Medicine [291]) and standardised messaging
standards like HL7 (Health Level 7 [155]) and DICOM (Digital Imaging and
Communication in Medicine [99]) have been facilitating communications be-
tween computer systems and different modalities and have achieved a broad
market acceptance within the healthcare industry. Medical queries are among
the most popular topics people are searching for in different databases and
knowledge sources. Due to the early development of medical domain knowl-
edge sources, most of the coding systems are only available in proprietary,
non standardised structures or schemes.
Although there might be no specific field of domain knowledge which has
been more penetrated with thesauri, classifications etc, it has taken a long
time to accept XML technologies as a standard to meet challenges of medical
content management, data communication and medical knowledge represen-
tation.
In March 2003 the BMJ (British Medical Journal) Publishing Group
started the first excerpt from BestTreatments, a website built for patients and
their doctors that looks at the effectiveness of treatments for chronic medi-
cal conditions, based officially on “Clinical Evidence”, which is recognised
internationally as a gold standard for evidence-based information [226]. From
H. Blanken et al. (Eds.): Intelligent Search on XML Data, LNCS 2818, pp. 5–18, 2003.
Springer-Verlag Berlin Heidelberg 2003
6 K.P. Pfeiffer, G. Göbel, and K. Leitner
Table 1.1.
Citizen, Patient Medical Professional
General Medical Knowledge Health Information System Health Professional
Facts & Figures Patient Information System Information Systems
EBM - Sources
Personalised Medical Personal Health Record Electronic Health Care
Information Record, Electronic
Patient Record
Table 1.1 it can easily be deduced that health professionals and patients need
the same evidence-based information, served up in parallel, drawn from the
same sources. Additionally, personalised medical and health information on
patients – allergies, medications, health plans and emergency contacts should
be accessible online from any location via the internet, always being aware of
data safety and security. Sharing this information in case of emergency with
doctors and processing it with powerful health tools (e.g. immunization plan-
ner, health risk appraisal, personal health plan) are fascinating challenges for
scientists and industrial vendors.
Medical knowledge bases contain all the knowledge and experience to be in-
voked by a reasoning program to produce advice that supports a decision [301].
Generally, medical knowledge is retrievable from
• the medical literature (documented knowledge)
• experts in a specific domain (clinical experience).
Some authors distinguish between knowledge about terminology – concep-
tual knowledge – and more general (medical) knowledge – general inferen-
tial knowledge [246]. Neither current medical literature nor experiences from
experts today are usually being processed in a comprehensive form which
supports medical decisions. One mechanism to transfer experiences from one
person to another is the creation of knowledge bases, which can potentially
provide health care workers with access to large bodies of information. Knowl-
edge, the next step of complexity, must then be expressed in the terminology
and semantics of the knowledge base and several methods must be designed
to acquire knowledge from experts. V. Bemmel [301] defines a medical knowl-
edge base as a systematically organised collection of medical knowledge that is
accessible electronically and interpretable by the computer. . . . Usually medical
knowledge bases include a lexicon (vocabulary or allowed terms) and specific
relationships between the terms.
This definition does not specify a detailed formalism of how relationships
can express expert knowledge. Thus long term research projects like UMLS
1 Demand for Intelligent Search Tools in Medicine and Health Care 7
1.2.1 UMLS
1.2.2 GALEN
and a prescription contains this insurance number, then the patient receives
drugs based on guidelines of that specific insurance.”
An example from the clinical environment shall be given that illustrates
the potential use of search agents supporting physicians in routine patient
care. The search agent uses heterogeneous databases, automatically makes in-
ferences from the EHCR and advises the physician regarding specific medical
issues. This scenario demonstrates the use of a search agent during a con-
sultation with a woman experiencing typical symptoms of menopause who is
concerned about hormone replacement therapy (HRT), one option to success-
fully treat these symptoms.
Mary Jones is 49 years old. For the past 4 years she has been noticing
variations in the length of her monthly cycle and pattern of bleeding.
Moreover, she has been experiencing hot flushes, night sweats, vaginal
dryness, as well as joint pains. She suffers from unpredictable mood
swings. Mary has heard of HRT, but usually in the context of its asso-
ciated risk of breast cancer. On the other hand Anne, her neighbour,
has told her that she should start HRT, as it would greatly alleviate her
suffering, prevent her bones from fracturing in her old age and protect
her from cardiovascular disease. Moreover, Anne said that, according
to the latest research, HRT reduces the risk of Alzheimer’s disease.
Anne feels quite confused and decides to see Dr. Eleanor Trevor, her
local GP, about this issue. Dr. Trevor understands Mary’s concerns, as
HRT has been the subject of ongoing debate for many years. She knows
that the fear of breast cancer is one of the prime reasons for rejection
or discontinuation of HRT. And even though HRT had been promoted
for many years in relation to several health issues such as prevention
of osteoporosis and cardiovascular disease, Dr. Trevor is aware that
recent research suggests not to use HRT for prevention of osteoporosis
and that it may actually increase the risk of cardiovascular disease. She
knows that in addition there are several other organs affected by the
hormonal constituents used in HRT such as the endometrium, colon
and central nervous system. Moreover, it depends very strongly on the
individual person receiving HRT whether it is useful or may actually
be harmful. She wonders about Mary’s physical constitution (her age,
body mass index, parity, her age when Tim, her first child was born,
etc.) and risk factors (Mary is overweight and smokes). Dr. Trevor
lets her search agent support her in this issue. She is glad she has
this tool available because it is near-impossible to stay up to date with
the latest results in medical research. Before she had her agent, she
would have had to look for best evidence in databases such as CDSR
(Cochrane Database of Systematic Reviews [77]) or DARE (Database
of Abstracts of Reviews of Effects [88]), search biomedical databases
such as Medline or Embase [106], search the internet or even hand
search the literature. She was glad she had done a course in searching
10 K.P. Pfeiffer, G. Göbel, and K. Leitner
for evidence, as she knew from a colleague who didn’t even try to treat
according to best practice, as he didn’t know how to find the evidence.
After finding the evidence herself she would have had to apply it to
the individual patient. She would have had to go through all the pa-
tient notes, call the hospital and other specialists for any additional
information needed, and the decisions would have been based mainly
on her own expertise and experience, weighing risks and benefits of
a particular treatment. This whole process became much more conve-
nient with her agent. Basically, the search agent performs all tasks of
information retrieval, integration with patient information, and knowl-
edge representation automatically, in a speedy, comprehensive, reliable
and safe manner. Dr. Trevor feels that it provides her with many ben-
efits such as saving her time, supporting her in her decisions, and
ultimately enabling her to offer better patient care. When she lets the
agent run over Mary’s particular case, it automatically searches for the
best evidence currently available in the field of HRT, retrieves Mary’s
online health record (a health record pulling information together from
all medical facilities Mary had been visiting), detects that Mary also
has high blood pressure and a positive family history of breast cancer,
which Dr. Trevor hadn’t been aware of, and independently determines
the overall risks (breast cancer, blood clots, stroke and coronary heart
disease) and benefits (fracture reduction and reduced risk of colorectal
cancer) HRT would have in Mary’s case. The agent presents its find-
ings to Dr. Trevor who is very satisfied with the feedback, comments
and helpful decision support. She tells Mary that firstly she should try
to alter her lifestyle – eat healthy, exercise regularly and quit smok-
ing. She also lets her know that there are several alternative therapies
around that may or may not be helpful in relieving menopausal symp-
toms but that in general, there is more research needed in that area.
She remarks that herbal therapies may have adverse side effects or ex-
hibit harmful interactions with other medications. She tells Mary that
HRT should be considered only a short-term option, as in the long run,
according to the best evidence currently available and in consideration
of Mary’s status the risks do outweigh the benefits.
Software Agents in medicine run without direct human control or con-
stant supervision to accomplish goals provided by medical experts. Agents
typically collect, filter and process information found in distributed heteroge-
neous data sources, sometimes with the assistance of other agents. It will be
a big challenge in the future to train these agents to find the appropriate and
very specific information for a patient with certain symptoms, diagnoses or
treatments.
1 Demand for Intelligent Search Tools in Medicine and Health Care 11
1.4.4 Interactivity
Interactive multimedia can be used to provide information, to train, educate,
entertain, store collections of audiovisual material, as well as distribute multi-
media and allow for user input. The range of tools, including the well-known
PC-mouse or keyboard, will be extended by new interaction tools (e.g. haptic
or vocal tools). Nonlinear media structures will challenge the development of
powerful knowledge browsers controlled via voice interfaces.
1.4.6 Pro-activity
Electronic guides assist users during their work and try to detect and solve
problems beforehand. Additionally, users may receive notifications on updates
or new documents according to specified criteria. Already today there are
several rule based systems which perform an automatic analysis of drug pre-
scription to avoid drug interaction or just look for cheaper drugs or produce
automatic warnings in the case of allergies.
Table 1.2.
1.10 Conclusions
documents available in free text, in order to achieve the precision and granu-
larity required for personal communication. In recent years a lot of work has
been done to extend structured documentation to replace free text and set up
concepts for the representation of complex medical situations and processes.
This also applies to the representation of health related knowledge for pa-
tients and health professionals. The ever-increasing amount of scientific medi-
cal literature available via the internet as well as the countless web pages with
health related information for patients necessitate very special tools to achieve
specific and precise search results. One precondition for a successful search is
to have a certain extent of structure in the data, e.g. using object-attribute-
value triples and standardized terminologies. Standardized, controlled medical
terminologies like the ICD-code, SNOMED or MeSH may constitute a starting
point for the identification of key words in free text, but are not sufficient to
represent the dynamic relationships between different objects, making modi-
fiers and semantic links necessary.
There is also a long-standing tradition of standardized communication of
data in medicine using e.g. HL7. Nevertheless, the next generation of health
care information systems must not only allow exchange of data between het-
erogeneous systems, but must enable the representation of complex medical
contents as well. Intelligent search engines, virtual agents and very specific
data analysis tools will process semi-structured data and will help make the
latest, quality assured information available for health care professionals, fi-
nanciers and patients. Health information generators will search for individual
person-centred information in the web using person-specific information from
the EHCR and will design health promotion programs based on the latest
evidence-based, quality-assured knowledge. Structured routine data in a hos-
pital or clinical information system will be analysed for clinical studies looking
at e.g. long-term side effects, epidemiology, quality standards, cost effective-
ness, outcomes etc.
The expected functionalities from intelligent search engines in medicine
can be summarized by the following points:
• Extraction of the most important facts from an electronic health care
record or personal health care record, which may be distributed across
various types of databases at different locations, using a master patient
index
• Expressing these facts as medical concepts and linking them using ontolo-
gies and specific medical grammar to represent the complex health status
of a patient in a semi-structured way
• Searching in quality-assured information systems for the latest informa-
tion, relevant specifically to the individual person
• Searching for the state-of-the-art treatment for a certain disease analysing
the outcomes of the latest clinical studies and using evidence-based
medicine databases
18 K.P. Pfeiffer, G. Göbel, and K. Leitner
2.1 Introduction
Video can be considered today as a primarily mean of communication, due to
its richness in informative content and to its appeal. Indeed, the combination
of audio and video is an extremely important communication channel: it is
considered that approximately 50% of what is seen and heard simultaneously is
retained. Due to all these considerations, audio/video is particularly important
in many different application sectors, such as TV broadcasting, professional
applications (e.g. medicine, journalism, advertising, education, etc.), movie
production, historical video archives. Furthermore, most of the video material
produced is extremely difficult to access, due to several limitations: video
documents are extremely large in size, so that archiving and transmission
are expensive; video document’s content, even if extremely rich, is difficult to
extract in order to support an effective content-based retrieval.
The necessity to effectively and efficiently manage such types of informa-
tion will become more important with the forthcoming commercialization of
interactive digital television. In coming years, the use of digital technology
will promote a significant change in the television world, where the viewer
will move from a passive role to using the interactive facilities increasingly
offered by the providers. Digital technology allows to mix traditional audio-
visual contents with data, enabling the transmission of multimedia software
applications to be executed in a digital television or in an analog television
equipped with a powerful decoder or Set-Top Box. These applications can be
synchronized with traditional content and provide interactivity to the user,
together with a return channel for communication with the provider. There
will be an excellent opportunity for television to become a privileged vehicle
for the development of a connected society.
In this chapter, we want to show how the role of XML is becoming a key
choice in the management of audio/video information, which is becoming of
paramount importance in applications such as interactive digital television.
The role of XML spans different levels, from the specification of metadata (e.g.,
H. Blanken et al. (Eds.): Intelligent Search on XML Data, LNCS 2818, pp. 19–33, 2003.
Springer-Verlag Berlin Heidelberg 2003
20 C. Gennaro, F. Rabitti, and P. Savino
may search all videos having in the first scene a description of Rome com-
posed of shots giving a panoramic view plus shots that describe in detail
some monuments).
Another important functionality of DLs in general and of video DLs in
particular, consists in the possibility of allowing the interoperability with other
digital libraries. This means that it should be possible to exchange documents
(composed of data + metadata) among different libraries, and it should be
possible to query different DLs in a transparent way (the user should be able to
issue a query to several different DLs, even if they are using different metadata
formats).
Finally, in many cases it is necessary to exchange metadata among different
modules of the DL. For example, the module responsible for the storage of
metadata and the metadata editor, need to exchange metadata frequently.
The approach to management of video in existing DL systems is extremely
differentiated. The simplest approach consists in the management of video as
an unstructured data type. Indexing is exclusively based on metadata man-
ually associated to the entire video. This approach supports simple retrieval
based on these metadata values. More advanced video archiving and retrieval
systems, such as Virage [307] and Informedia [312] base indexing mainly on
automatic processing techniques. Typical metadata automatically extracted
are the transcript speeches present in the documents, key frames, faces and
objects recognized in the video. Retrieval is based on this information. A
third category consists of systems that offer all typical services of a DL plus a
sophisticated video indexing, based on the integration of automatic process-
ing techniques with manual indexing. A single metadata model is used for
features automatically extracted and for information provided manually. The
metadata model may also support a description of video structure.
In this section we present the ECHO Metadata Model. We first give an ab-
stract view of the proposed model and illustrate how it has been implemented
using XML Schema. We then present the Metadata Editor functionality: a
tool used for browsing and editing the audio/video metadata. Since the meta-
data database of the DL is implemented in XML, at the end of this chapter
we present two example queries and how they are processed by the system.
The ECHO Metadata Model [9] has been defined as an extension to the
IFLAFRBR model. This IFLAFRBR model is composed of four levels de-
scribing different aspects of intellectual or artistic endeavour: work, expres-
sion, manifestation, and item. The entities of the model are organized in a
structure that reflects the hierarchical order of the entities from the top level
2 The Use of XML in a Video Digital Library 25
AVDocument
ExpressedBy
Version
IsVersionOf
ManifestedBy
SynchronizedWith
PartOf PartOf
Media HasChannel
Audio Video Transcript
AvailableAs
FollowedBy FollowedBy
Storage HasTranscript
HasTranscript
document in MPEG format. More than one manifestation of the same Version,
e.g. MPEG, AVI, etc., may exist.
Nevertheless, the Media object does not refer to any physical implementa-
tion. For instance, the MPEG version of the Italian version of the games can
be available on different physical supports, each one represented by a different
Storage object (e.g., videoserver, DVD, etc).
It is worth noting that the entities of the model are grouped in two sets
called “Bibliographic metatada” and “Time/Space related metadata”. The
entities of the former set concern pure cataloguing aspects of the document,
without going into the peculiarity of the type of the multimedia object de-
scribed. More in general, from the perspective of the “Bibliographic metatada”
the object catalogued is a black box. The latter set of entities concerns the
nature of the catalogued object. In particular in our model we focused on the
time/space aspect of the multimedia object, i.e., how the audio/video objects
are divided in scene, shots, etc. However, in general, any kind of metadata
which can help the user to identify and retrieve the object can be used. For
instance, if the object is computer video game, there could be entities for
describing the saved games, the characters of the players, etc.
In particular, once the user has found a relevant document, by means of the
video retrieval tool, the URI (Uniform Resource Identifier) of the document
is obtained and passed to the metadata editor. This URI is sent to the data
manager which returns the document metadata. The format chosen for ex-
changing document metadata is XML.
An important feature of the metadata editor is that it not hard wired
with a particular metadata attributes set, indeed the metadata schema is
defined in the W3C XML Schema Definition (XSD) used by the editor as
configuration file for the metadata model. The advantage of this choice is
that it is possible to add/remove fields in the schema of the metadata of the
audiovisual document (see Figure 2.2). This is achieved by giving the editor
the ability of recognizing a subset of the types available for the XSD schemas,
such as: strings, boolean, dates, integers, etc.
Metadata
Editor
metada in XML
Datamanager
Database
phase, since this tag cannot contain misspelled words. In Figure 2.5 following
an example of a XML schema closed list type definition is given.
The closed lists are automatically recognized by the editor and represented
in the editing interface as ComboBox controls.
Since entity instances are memorized as individual XML documents, the
references among them are achieved by using URI pointers.
The interface of the editor is designed in such a way that it is possible to
browse the tree structure of an audio/video document. Figures 2.6 shows the
screenshot of the main window of the editor: it displays a document like a
folder navigation tool. On the top level of the tree, there is an icon represent-
ing an AVDocument object (the work of the “Olympic Games on 1936” in our
2 The Use of XML in a Video Digital Library 29
<xsd:simpleType name="CutType">
<xsd:restriction base="xsd:string">
<xsd:enumeration value="HardCut "/>
<xsd:enumeration value="FadeIn"/>
<xsd:enumeration value="FadeOut "/>
<xsd:enumeration value="FadeInFadeOut "/>
</xsd:restriction>
</xsd:simpleType>
Fig. 2.5. Illustration of a Closed List Type definition.
example). Connected to the work object the editor presents the three main
Versions that belong to the AVDocument. Moreover, selecting an icon repre-
senting a Version (the English Version in the figure), it is possible to see the
Media instances of the Version and, hence, the corresponding Storage objects.
The navigation tool shows only the main expressions belonging to a doc-
ument (i.e., the expression which correspond to the whole audio/video docu-
ment). The editor allows to browse a single Version (one at a time) by using a
second frame on the right side of the window. In this way it is possible to see
the existing Video, Audio and Transcript Expressions (at least one of them must
exist) of the document and, for each Expression, to browse their segmentation
in scenes, shots, etc.
30 C. Gennaro, F. Rabitti, and P. Savino
XML Searching
<AVDocument>
<Title>The fall of the Berlin Wall</Title>
Retrieved URI <Genre>Documentary</Genre>
document <ExpressedBy>http://... </ExpressedBy>
...
</AVDocument>
<Version>
<Title>English Version</Title>
<ManifestedBy>http://... <ManifistedBy>
<ManifestedBy>http://... <ManifistedBy>
...
</Version>
<Media> <Media>
<Format>AVI</Format> <Format>MPEG</Format>
... ...
</Media> </Media>
The second query is much more complex. We first have to scan all the
Transcript entities and search for those which contain the word “jeep”. Sub-
sequently, we must save the URI of these found Transcipt XML documents.
Since we would like to select all the scenes connected to these Transcripts we
have to save also the time code where the word “jeep” is spoken. The result
of this first query phase is a list of Transcript URIs associated with a list of
timecodes. For convenience this result could be temporary stored in an XML
file such as the following.
32 C. Gennaro, F. Rabitti, and P. Savino
<QueryResult1>
<Transcript URI="http://EchoServer.it/tran25.xml">
<timecode>122424</timecode>
</Transcript>
<Transcript URI="http://EchoServer.it/tran128.xml">
<timecode>556332</timecode>
<timecode>223422</timecode>
</Transcript>
</QueryResult1>
In the second phase of the search we have to scan for all the Video XML
documents which have in the field <HasTranscript> one of the URIs contained
in the QueryResult1.xml and which are identified as scenes. This last check
can be done looking at the tag <IndicationVideoUnit>. This field indicates
the granularity of the Video object. If it contains the closed list item “Scene”
(“Shot”), it means that the Video object is actually a scene (shot) of the whole
video. If this check is positive we have to select all the Video XML documents
which contain the timecode of one of the timecodes in the found Transcript.
The result is a list of Video URIs corresponding to XML files describing the
scenes where the word “jeep” is said. These URIs are then shown in the
retrieval interface and can be selected by the user and opened by means of
the Metadata Editor. The Editor is able to open the selected scene and show
the video and its associated metadata. Figure 2.9 illustrates the elaboration
and the possible result of this query, where two Video documents are retrieved.
2.6 Conclusions
In this chapter we have shown how XML is becoming a key choice in the man-
agement of audio/video information, in the context of video digital libraries.
These advantages have been exemplified in the implementation of the ECHO
Video DL system.
Moreover, XML is gaining a key role in new audio-visual applications of
increasing importance such as interactive digital television. Consider, for ex-
ample, the role of XML in the effort of standardization of the Multimedia
Home Platform protocol (MHP) of Digital Video Broadcasting (DVB) [104],
in the standardization of video metadata information in MPEG-7 [222], and
the standardization of mixed audio/video and data/program structures at the
implementation level [154].
2 The Use of XML in a Video Digital Library 33
Query looking for word “jeep” in tag <Word> of
Trasncript xml
files produces 2 xml documents:
http://EchoServer.it/ http://EchoServer.it/
tran25.xml tran128.xml
<Transcript> <Transcript>
<TranscriptWordList> <TranscriptWordList>
... ...
<Word timecode="122011">the</Word> <Word timecode="556332">jeep</Word>
<Word timecode="122424">jeep</Word> ...
<Word timecode="122534">was</Word> <Word timecode="223422">jeep</Word>
... ...
</TranscriptWordList> </TranscriptWordList>
... ...
</Transcript> </Transcript>
<QueryResult1>
<Transcript URI="http://EchoServer.it/tran25.xml">
<timecode>122424</timecode>
</Transcript>
<Transcript URI="http://EchoServer.it/tran128.xml">
<timecode>556332</timecode>
<timecode>223422</timecode>
</Transcript>
</QueryResult1>
<Video> <Video>
<Title>Safari</Title> <Title>The return</Title>
<IndicationUnit>scene</IndicationUnit> <IndicationUnit>shot</IndicationUnit>
<Time start="95260" end="132011"/> <Time start="550310" end="559100"/>
<HasTranscript>http://EchoServer.it/ <HasTranscript>http://EchoServer.it/
tran25.xml</HasTranscript> tran128.xml</HasTranscript>
... ...
</Video> </Video>
Michael Rys
3.1 Introduction
XML [45] has become one of the most important data representation formats.
One of the major reasons for this success is that XML is well-suited not only
for representing marked-up documents – as its heritage based on SGML indi-
cates – but also highly-structured hierarchical data such as object hierarchies
or relational data and semistructured data (see Fig. 3.1). Even data that tra-
ditionally has been represented in binary format such as graphics is now being
represented using XML (e.g., SVG).
We will use the following definitions for the terms structured, semistruc-
tured and marked-up data. Note that these definitions are somewhat different
to the definitions of data-centric XML and document-centric XML given in
other locations in that they focus less on the use and more on the actual
structure. Often, data-centric XML represents some form of structured or
semistructured data, while document-centric XML often represents semistruc-
tured or marked-up data, although sometimes it may be representing struc-
tured data.
Definition 1. Structured data is data that easily fits into a predefined, ho-
mogeneous type structure such as relations or nested objects.
Definition 2. Semistructured data is data that is mainly structured but may
change its structure from instance to instance. For example, it has some com-
ponents that are one-off annotations that only appear on single instances or
that are of heterogeneous type structure (e.g., once the address data is a string,
once it is a complex object).
Definition 3. Marked-up data is data that represents mainly free-flowing text
with interspersed markup to call out document structure and highlighting.
The most important aspect of XML as a data representation format is
that it allows combining the different data formats in one single document,
thus providing a truly universal data representation format.
H. Blanken et al. (Eds.): Intelligent Search on XML Data, LNCS 2818, pp. 39–57, 2003.
Springer-Verlag Berlin Heidelberg 2003
40 M. Rys
<PatientRecord pid="P1">
<FirstName>Janine</FirstName>
<LastName>Smith</LastName>
<Address>
<Street>1 Broadway Way</Street>
<City>Seattle</City>
<Zip>WA 98000</Zip>
</Address>
<Visit date="2002-01-05">
Janine came in with a <symptom>rash</symptom>. We identified a
<diagnosis>antibiotics allergy</diagnosis> and <remedy>changed her
cold prescription</remedy>.
</Visit>
</PatientRecord>
<PatientRecord>
<pid>P2</pid>
<FirstName>Nils</FirstName>
<LastName>Soerensen</LastName>
<Address>
23 NE 40th Street, New York
</Address>
</PatientRecord>
Fig. 3.1. XML fragment representing structured data (the FirstName and Last-
Name properties of the PatientRecord), semistructured data (the pid and Address
properties), and marked-up data (Visit)
Since more and more data is making use of XML’s capability to combine
different formats, being able to query and search XML documents becomes
an important requirement. Existing query languages such as SQL or OQL
are well-suited to deal with structured, hierarchical data, but are not directly
appropriate for semistructured data or even marked-up documents. The SQL
standard for example provides an interface to a sublanguage called SQL/MM
to perform searches over documents, mainly because the documents and their
markup are not part of the fundamental data model of SQL. This is in contrast
to XML, where markup is the primary structural component of the data
model. Thus, a new query language needed to be developed that can be used
to query XML data.
In order to understand the query requirements, one also needs to un-
derstand a fundamental difference between marked-up data on one side and
structured and semistructured data on the other side. In the case of marked-up
data, the basic information content of the data is being preserved, even if the
XML markup has been removed from the document. In the case of structured
and semistructured data, removing one of the most important information of
the data, the semantic association, would often lose too much information to
still be able to make use of the data.
3 Full-Text Search with XQuery: A Status Report 41
For example, while removing the markup of the Visit element of Fig. 3.1
loses some information, the bare text still contains enough useful information
for the reader to understand the data. On the other hand, the markup in the
structured and semistructured parts of the PatientRecord provides the neces-
sary semantic information to interpret the raw data. Removing the markup
would leave us with data that has lost most of its meaning.
Thus a query language that queries data represented in XML needs the
capability to extract information from both the markup and the data itself.
This means that the language needs a normal navigational component to
deal with structured data as well as a text retrieval component to deal with
marked-up data.
In addition, the markup structure can be provided adhoc without a schema
(so called well-formed XML) or it can be described using a schema language
such as the W3C XML Schema definition and validation language [310]. This
means that the query language needs to be able to query well-formed docu-
ments besides being able to utilize possibly available schema information.
XQuery [34] is a declarative query language for querying XML documents.
It is currently being designed by a working group of the Worldwide Web
Consortium [W3C] and is expected to be released as a recommendation late
2003/early 2004. It is based on the functional programming paradigm with a
type system based on the W3C XML Schema definition and validation lan-
guage [310]. The first version of XQuery will provide a SQL-like language that
combines the XPath [12] document navigation capabilities with a sequence it-
erator commonly referred to as a FLWR expression, XML like constructors
and an extensive function library. For example, the following XQuery expres-
sion finds all patient visits that belong to people with first name Nils and have
New York in their address (could be as street or as city) and order them by
visit date:
for $p in /PatientRecord, $v in $p/Visit
where $p/FirstName = "Nils"
and fn:contains(fn:string($p/Address), "New York")
order by $v/@date
return
<Visit date="{$v/@date}"
Name="{$p/FirstName} {$p/LastName}">{
$v/*
}</Visit>
While the first version of XQuery does not support full-text search or in-
formation retrieval capabilities, such functionality is slated to be added as an
extension to XQuery 1.0. The XQuery working group has formed a task force
that has published a requirement document [58] and a set of use cases [10] and
is currently reviewing a couple of language proposals. Such full-text capabil-
ities allow us to search the marked-up part or the structured and semistruc-
tured parts of XML documents using information retrieval technology.
42 M. Rys
In the following, we will be reviewing the requirements and some of the use
cases and will discuss some possible approaches on how such a language could
be designed and integrated with XQuery and how they can provide intelligent
searching over XML data.
3.2.1 Definitions
Before we start looking at the requirements, let’s repeat some definitions used
by the requirement document.
It is interesting to note that the definition of score does not refer to full-
text search itself. The wider understanding of score and relevance will be an
important aspect in making XQuery the language for intelligent retrieval from
XML documents (see below).
The requirement document also defines the terms must, should and may
that describe the priority and importance of the requirements for the first
version of the language.
The following are some definitions that we use:
1
The requirements themselves are Copyright 0 c 2003 World Wide Web Con-
sortium, (Massachusetts Institute of Technology, European Research Consor-
tium for Informatics and Mathematics, Keio University). All Rights Reserved.
http://www.w3.org/Consortium/Legal/2002/copyright-documents-20021231.
3 Full-Text Search with XQuery: A Status Report 43
The following section discusses the requirements. We will number the re-
quirements in order to reference them.
These two requirements provide the relationship between the Boolean and
relevance-based Full-Text Search. Unlike systems that only provide for scoring
expressions and then define a certain relevance value (normally 0) to represent
false and anything greater to denote true, the XQuery/XPath Full-Text lan-
guage provides a Boolean language that then also can be used in the context
of score expressions. For example, the Boolean language searches for specific
tokens. A Boolean search expression then is used in a score expression that
functions as second-order operation to determine the relevance of the data
w.r.t. search expression. The design of the language also allows for other ex-
pressions to be used in the context of score expressions.
Score Algorithm
Extensibility
The following requirements deal with the ability of the Full-Text Language
to evolve. The first requirement will allow implementers (not only vendors)
to add additional functionality. This is especially important in order to use
the standardized part of the language as the foundation for research into
extending information retrieval over XML data beyond the current state.
(R14) XQuery/XPath Full-Text must be extensible by vendors.
(R15) XQuery/XPath Full-Text may be extensible by users.
The ability to extend the relevance algorithms and the linguistic search
functionality is often requested by the advanced user community. The may-
requirement above makes this a low priority for this version, but will keep it
as a possibilty for future versions. Future versions of the language should be
built within the framework given by the first version as requirement (R16)
requires:
(R16) The first version of XQuery/XPath Full-Text must provide a
robust framework for future versions.
As a result of this requirement, we hope that the design of the Full-Text
Language will take possible future functionality into account and thus will be
designed in an extensible way.
XPath
Extensibility Mechanisms
Composability
Functionality
Normal full-text search for the token ”comfortable” inside chapters will find
the chapter above. However, one may want to exclude footnotes and other
annotations when searching for a phrase or token. Thus being able to ignore
subtrees inside marked-up text seems like an important use case.
Search Scope
Attributes
XML provides two ways to represent text data: Either in form of element
content or as attribute values. Marked-up data rarely uses attributes for rep-
resenting data; they are mainly used for representing meta-information. Nev-
ertheless, Full-Text Search should be able to search within attribute values:
(R34) XQuery/XPath Full-Text must support Full-Text search within
attributes.
3 Full-Text Search with XQuery: A Status Report 49
Markup
(R36) If XQuery/XPath Full-Text supports search within names of
elements and attributes, then it must distinguish between element
content and attribute values and names of elements and attributes in
any search.
Searching on XML documents still needs to heed the structure given by the
data model. This means that the Full-Text Language needs to clearly identify
where to find the data.
Element Boundaries
(R37) XQuery/XPath Full-Text must support search across element
boundaries, at least for NEAR.
Since XML markup can be adding structure and split contiguous text into
separate parts, it becomes important for the Full-Text Search capabilities to
search across the element boundaries, especially in the context of marked-up
data.
(R38) XQuery/XPath Full-Text must treat an element as a token
boundary. This may be user-defined.
This requirement is in my opinion not correctly motivated. Given our under-
standing of the different nature of XML markup for structured/semistructured
and marked-up data, the requirement should really be ”XQuery/XPath Full-
Text should provide an option to treat element markup as a token boundary
or to treat it as having no impact on tokenization.” When tokenizing marked-
up data, which most likely represents the major use case for Full-Text Search,
the element markup should not impact tokenization because the markup may
be used within a single token. In the case of structured and semistructured
data, the element markup indeed should become a token boundary, because
each markup represents a semantic unit.
For certain use cases, it may be even necessary to distinguish on an ele-
ment level. For example, the data in Fig. 3.1 should use the markup as token
boundary for the structured and semistructured data and should not use it
as token boundary for the marked-up data.
50 M. Rys
Score
Given these design requirements, there are still many different approaches
that can be taken to satisfy the requirements. We would like to look at the
following approaches in a bit more details:
Sublanguage approach: Provide a minimal XQuery/XPath functional inter-
face to an existing full-text language such as SQL/MM. Especially for
relevance searches, the whole search expression is expressed in that sub-
language.
Function approach: Add XQuery functions that provide the required func-
tionality. A full text search expression is composed from these functions
with the normal XQuery operators such as and and or.
Syntactic approach: Add full-text functionality to the language by adding
many new statement operators to the XQuery and XPath languages.
These three approaches are of course not fully orthogonal but can be com-
bined. For example, one may choose the functional approach for the Boolean
Search functionality and add syntax to provide the second order functionality
of calculating the relevance score.
The following subsections give my critique of the different approaches. A
good language design needs to find a balance between the following dimen-
sions: number of functions, number of arguments, complexity of arguments,
number of additional operators. One of the tasks of the W3C working groups
will be weighting the benefits and cost and placing the final design in these 4
dimensions.
The functional approaches on the other hand have none of the drawbacks
of the sublanguage approaches. They are easily parameterizable and compos-
able. They can easily reuse the XPath and XQuery expressions and can easily
be composed with other relevance search expressions, even though the ac-
tual relevance function may have to be different from the full-text relevance
function.
harder since it means that any extension will change the grammar of XQuery
and XPath. Finally, the parameterization of the operators with the linguis-
tic information becomes a non-trivial undertaking, since the large amount of
parameters possible leads to a complex grammar.
Figure 3.2 gives the placement on the dimensional axes of some of the ap-
proaches discussed in this investigation.
The investigation of the three main approaches above clearly shows that
the sublanguage approaches are less suited for adding the full-text search
capabilities to XQuery/XPath than either a functional or syntactic approach.
A functional approach seems to be more flexible and less brittle in the long-
term than a syntactic approach, although the functional approaches have the
problem of finding the right balance between number of functions, number or
arguments and complexity of arguments.
It is my believe that the best course of action for XQuery is to take a
functional approach towards adding the Boolean full-text search capabilities
with either a functional or syntactic approach along the line of the outlined
score-clause to provide the second-order functionality of relevance search.
covered by the W3C standards work, but will be the most important aspects
of enabling intelligent information retrieval from XML documents.
3.5 Conclusions
We made the case that finding and extracting information from XML docu-
ments is an important aspect of future information discovery and retrieval and
that XQuery for structural queries, extended with full-text capabilities for in-
formation retrieval queries, is an important and appropriate tool to facilitate
this functionality. We then gave an overview of the current publicly available
state of the XQuery working group’s work on full-text search. In particular,
we reviewed the requirements document and took a look at the use cases. Fi-
nally, we reviewed some of the possible language approaches and investigated
some of the issues that any language proposal will have to address.
The work in the W3C XQuery working group will continue along the lines
outlined in the requirements and use cases document. We can expect that the
next step will be to review proposals and to investigate the language issues
to find the right language – at least in the opinion of the working group – to
satisfy the requirements in the context of XQuery and XPath.
Even more interesting will be, how the full-text search framework provided
by the working group will be used to extend XQuery for other approximate
query functionality over the structural parts and types of data (such as binary
image data in base64- or hex-encoding).
It is my believe that the best course of action for XQuery is to take a
functional approach towards adding the Boolean full-text search capabilities
with either a functional or syntactic approach along the line of the outlined
score-clause to provide the second-order functionality of relevance search.
Finally, there are several interesting research areas that need more inves-
tigation. They range from implementation problems such as performance and
scaling aspects of the dynamic processing provided by the generality of sev-
eral of the possible language approaches over the design of the right APIs to
provide external parameterization of the scoring algorithms to how to com-
bine both full-text and XQuery optimization techniques to benefit both the
structural and full-text queries.
4
A Query Language and User Interface for
XML Information Retrieval
4.1 Introduction
As XML is about to become the standard format for structured documents,
there is an increasing need for appropriate information retrieval (IR) methods.
Since classical IR methods were developed for unstructured documents only,
the logical markup of XML documents poses new challenges.
Since XML supports logical markup of texts both at the macro level (struc-
turing markup for chapter, section, paragraph and so on) and the micro level
(e.g., MathML for mathematical formulas, CML for chemical formulas), re-
trieval methods dealing with both kinds of markup should be developed. At
the macro level, fulltext retrieval should allow for selection of appropriate
parts of a document in response to a query, such as by returning a section
or a paragraph instead of the complete document. At the micro level, specific
similarity operators for different types of text or data should be provided (such
as similarity of chemical structures, phonetic similarity for person names).
Although a large number of query languages for XML have been proposed
in recent years, none of them fully addresses the IR issues related to XML;
especially, the core XQuery proposal of the W3C working group [34] offers
no support for IR-oriented querying of XML sources; the discussion about
extensions for text retrieval has started only recently (see the requirements
document by [34] and the use cases by [10]). There are only a few approaches
that provide partial solutions to the IR problem, namely by taking into ac-
count the intrinsic imprecision and vagueness of IR; however, none of them
are based on a consistent model of uncertainty (see section 4.5).
In this chapter, we present the query language XIRQL which combines
the major concepts of XML querying with those from IR. XIRQL is based on
XPath, which we extend by IR concepts. We also provide a consistent model
for dealing with uncertainty.
For building a complete IR system, the query language and the model
are not enough. One also needs to deal with user interface issues. On the
H. Blanken et al. (Eds.): Intelligent Search on XML Data, LNCS 2818, pp. 59–75, 2003.
Springer-Verlag Berlin Heidelberg 2003
60 N. Fuhr, K. Großjohann, and S. Kriewel
input side, the question of query formulation arises: the query language al-
lows for combining structural conditions with content conditions, and the user
interface needs to reflect this. On the output side, we observe two kinds of
relationships between retrieval results. In traditional document retrieval, the
retrievable items (i.e., documents) are considered to be independent from each
other. This means that the system only needs to visualize the ordering im-
posed by the ranking. But in the case of retrieval from XML documents, two
retrieved items may have a structural relationship, if they come from the same
document: One could be the ancestor of another, or a sibling, and so on.
So in addition to the query language XIRQL, we describe graphical user
interfaces for interactive query formulation as well as for result presentation.
This chapter is structured as follows. In the following section, we discuss
the problem of IR on XML documents (section 4.2). Then we present the
major concepts of our new query language XIRQL (section 4.3). Our graphical
user interfaces are described in section 4.4. A survey on related work is given
in section 4.5, followed by the conclusions and the outlook.
4.3.1 Weighting
our example document; the corresponding disjoint text units are marked as
dashed boxes in Figure 4.1.
book
class="H.3.3"
chapter chapter
author title
[4] [5]
Fig. 4.1. Example XML document tree. Dashed boxes indicate index nodes; brack-
eted numbers serve as identifiers.
is represented by the event [5, syntax]. For retrieval, we assume that different
events are independent. That is, different terms are independent of each other.
Moreover, occurrences of the same term in different index nodes are also in-
dependent of each other. Following this idea, retrieval results correspond to
Boolean combinations of probabilistic events which we call event expressions.
For example, a search for sections dealing with the syntax of XQL could be
specified as //section[.//* cw "XQL" and .//* cw "syntax"]. Here, our
example document would yield the conjunction [5, XQL] ∧ [5, syntax]. In con-
trast, a query searching for this content in complete documents would have
to consider the occurrence of the term “XQL” in two different index nodes,
thus leading to the Boolean expression ([3, XQL] ∨ [5, XQL]) ∧ [5, syntax].
For dealing with these Boolean expressions, we adopt the idea of event
keys and event expressions described by [129]. With the method described
there, we can compute the correct probability for any combination of inde-
pendent events (see also [128]). Furthermore, the method can be extended to
allow for query term weighting. Assume that the query for sections about XQL
syntax would be reformulated as //section[0.6 · .//* cw "XQL" + 0.4 ·
.//* cw "syntax"]. For each of the conditions combined by the weighted
sum operator, we introduce an additional event with a probability as speci-
fied in the query (the sum of these probabilities must not exceed 1). Let us
assume that we identify these events as pairs of an ID referring to the weighted
sum expression, and the corresponding term. Furthermore, the operator ‘·’ is
mapped onto the logical conjunction, and ‘+’ onto disjunction. For the last
section of our example document, this would result in the event expression
[q1 , XQL] ∧ [5, XQL] ∨ [q1 , syntax] ∧ [5, syntax]. Assuming that different query
conditions belonging to the same weighted sum expression are disjoint events,
this event expression is mapped onto the scalar product of query and docu-
ment term weights: P ([q1 , XQL])·P ([5, XQL])+P ([q1 , syntax])·P ([5, syntax]).
Above, we have described a method for combining weights and structural con-
ditions. In contrast, relevance-based search omits any structural conditions;
instead, we must be able to retrieve index objects at all levels. The index
weights of the most specific index nodes are given directly. For retrieval of
the higher-level objects, we have to combine the weights of the different text
units contained therein. For example, assume the following document struc-
ture, where we list the weighted terms instead of the original text:
<chapter> 0.3 XQL
<section> 0.5 example </section>
<section> 0.8 XQL 0.7 syntax </section>
</chapter>
A straightforward possibility would be the OR-combination of the different
weights for a single term. However, searching for the term “XQL” in this
4 A Query Language and User Interface for XML Information Retrieval 65
example would retrieve the whole chapter in the top rank, whereas the sec-
ond section would be given a lower weight. It can be easily shown that this
strategy always assigns the highest weight to the most general element. This
result contradicts the structured document retrieval principle mentioned be-
fore. Thus, we adopt the concept of augmentation from [127]. For this purpose,
index term weights are downweighted (multiplied by an augmentation weight)
when they are propagated upwards to the next index object. In our example,
using an augmentation weight of 0.6, the retrieval weight of the chapter with
respect to the query “XQL” would be 0.3 + 0.6 · 0.8 − 0.3 · 0.6 · 0.8 = 0.636,
thus ranking the section ahead of the chapter.
For similar reasons as above, we use event keys and expressions in order to
implement a consistent weighting process (so that equivalent query expressions
result in the same weights for any given document). [127] introduce augmen-
tation weights (i.e., probabilistic events) by means of probabilistic rules. In
our case, we can attach them to the root element of index nodes. Denoting
these events as index node number, the last retrieval example would result in
the event expression [1, XQL] ∨ [3] ∧ [3, XQL].
In the following, paths leading to index nodes are denoted by ‘inode()’
and recursive search with downweighting is indicated via ‘. . . ’. As an exam-
ple, the query /document//inode()[... cw "XQL" and ... cw "syntax"]
searches for index nodes about “XQL” and “syntax”, thus resulting in the
event expression ([1, XQL] ∨ [3] ∧ [3, XQL]) ∧ [2] ∧ [2, syntax].
In principle, augmentation weights may be different for each index node.
A good compromise between these specific weights and a single global weight
may be the definition of type-specific weights, i.e., depending on the name of
the index node root element. The optimum choice betweeen these possibilities
will be subject to empirical investigations.
The XML standard itself only distinguishes between three data types,
namely text, integer and date. The XML Schema recommendation [111] ex-
tends these types towards atomic types and constructors (tuple, set) that are
typical for database systems.
For the document-oriented view, this notion of data types is of limited use.
This is due to the fact that most of the data types relevant for IR applications
can hardly be specified at the syntactic level (consider for instance names of
geographic locations, or English vs. French text). In the context of XIRQL,
data types are characterized by their sets of vague predicates (such as phonetic
similarity of names, English vs. French stemming). Thus, for supporting IR in
XML documents, there should be a core set of appropriate data types, and the
system should be designed in an extensible way so that application-specific
data types can be added easily.
We do not discuss implementation issues here, but it is clear that the
system needs to provide appropriate index structures, for structural conditions
and also for the (possibly vague) search predicates — both for the core and the
application-specific data types, of course. This problem is rather challenging,
as we suspect that separate index structures for the tree structure and for the
search predicates will not be sufficient; rather, they have to be combined in
some way.
Candidates for the core set are texts in different languages, hierarchical
classification schemes, thesauri and person names. In order to perform text
searches, some knowledge about the kind of text is necessary. Truncation
and adjacency operators available in many IR systems are suitable for west-
ern languages only (whereas XML in combination with unicode allows for
coding of most written languages). Therefore, language-specific predicates,
e.g., for dealing with stemming, noun phrases and composite words should
be provided. Since documents may contain elements in multiple languages,
the language problem should be handled at the data type level. Classification
schemes and thesauri are very popular now in many digital library applica-
tions; thus, the relationships from these schemes should be supported, perhaps
by including narrower or related terms in the search. Vague predicates for this
data type should allow for automatic inclusion of terms that are similar ac-
cording to the classification scheme. Person names often pose problems in
document search, as the first and middle names may sometimes be initials
only (therefore, searching for “Jack Smith” should also retrieve “J. Smith”,
with a reduced weight). A major problem is the correct spelling of names,
especially when transliteration is involved (e.g., “Chebychef”); thus, phonetic
similarity or spelling-tolerant search should be provided.
Application-specific data types should support vague versions of the pred-
icates that are common in this area. For example, in technical texts, measure-
ment values often play an important role; thus, dealing with the different units,
the linear ordering involved (<) as well as similarity (vague equality) should
be supported (”show me all measurements taken at room temperature”). For
texts describing chemical elements and compounds, it should be possible to
4 A Query Language and User Interface for XML Information Retrieval 67
Since typical queries in IR are vague, the query language should also sup-
port vagueness in different forms. Besides relevance-based search as described
above, relativism with respect to elements and attributes seems to be an
important feature. The XPath distinction between attributes and elements
may not be relevant for many users. In XIRQL, author searches an element,
1
Please note that we make no additional assumptions about the internal structure
of the text data type (and its subtypes), like representing text as set or list of
words.
68 N. Fuhr, K. Großjohann, and S. Kriewel
@author retrieves an attribute and =author is used for abstracting from this
distinction.
Another possible form of relativism is induced by the introduction of data
types. For example, we may want to search for persons in documents, without
specifying their role (author, editor, referenced author, subject of a biogra-
phy) in these documents. Thus, we provide a mechanism for searching for
certain data types, regardless of their position in the XML document tree.
For example, #persname searches for all elements and attributes of the data
type persname.
Currently, we are working on further generalizations of structural condi-
tions. One direction is based on ontologies over element names. For example,
assuming that region is a subproperty of the more general element named
geographic-area, which in turn has additional subproperties continent and
country, we would expand the original element name region into the disjunc-
tion region | country | continent. The sequence of elements in a path can
also be subject to vague interpretations (e.g., author = "Smith" should also
match vaguely author/name and author/name/lastname).
A screenshot of our interface can be seen in Figure 4.2. There are three areas:
On the left, the structure condition area enables users to formulate single
query conditions. On the right, the condition list area allows users to edit the
query conditions and to specify how to combine them to form the whole query.
At all times, a paraphrase of the current query in XIRQL syntax is kept up
to date in the paraphrase area at the bottom.
For formulating a single query condition, the main mechanism is Query
by Example. It comes in three variants. In the screenshot, the layout-oriented
variant is shown. The user can click on a word in that document and the
system derives from it a structural condition (candidate) and a value condition
(candidate). The structural condition describes the list of element names on
the path from the root node to the leaf node in the XML tree. From it, a
number of generalizations (using the // operator and the * wild card) are
produced and shown to the user (see the popup window in the lower right of
the screenshot). After selecting the structural condition, the query condition
is added to the condition list area, where additional changes can be made: The
comparison value (defaulting to the word the user selected) can be edited, and
a search predicate can be chosen for this condition.
In addition to the layout-oriented variant of Query by Example, we offer
a structure-oriented variant where people see an expandable tree of the XML
document, as well as a structure-oriented variant which shows a document
surrogate only. Finally, as an alternative to Query by Example, we offer a
DTD oriented method for specifying the structure condition which does not
rely on an example document.
The next step is to specify how the query conditions thus collected should
be combined to form the whole query. Here, we focus on the structural depen-
dence between the conditions. This is achieved by specifying a common prefix
for two query conditions. For example, in the third condition, /ARTICLE/BDY
70 N. Fuhr, K. Großjohann, and S. Kriewel
is grayed out. This means the match for the second and third conditions must
be in the same BDY element (and hence within the same ARTICLE element).
The graying-out connects two adjacent conditions; by making it possible to
move conditions up and down in the list, structual dependence between any
two conditions can be expressed. In addition to the structural dependence,
the Boolean connectors between the conditions also need to be specified. We
do this in a simple manner, allowing the user to choose between and and or
between any two conditions, but we plan more elaborate support, possibly
based on Venn diagrams.
To test the usefulness of this approach, we performed a small preliminary
user study. Three retrieval tasks (against the INEX collection, [126]) were
given in natural language. Five users performed the tasks with the graphical
interface described here, two of them also used a command-line tool to directly
enter XIRQL queries. The results indicate that even people with no knowledge
of XIRQL are enabled to pose queries using this interface. For more complex
queries, the interface might speed up users who know XIRQL. The layout-
oriented variant of Query by Example was popular with all users, the DTD-
based method was rarely used.
textual result representation, and three queries for each of our visualizations.
The results indicated that TextBars outperform the textual representation in
terms of precision, and partial treemaps improve precison even further. On
the other hand, the time used for the judgements were about the same for
all three methods; participants reported that they had a closer look at the
retrieval results and their relationships when using the graphical methods.
Thus, it seems that the added information provided by the graphical method
improved the quality of the judgments.
Fig. 4.5. Result presentation with Partial Treemaps. Each element in the treemap
has a tool-tip with a summary about that element. In the bottom left, we show a
‘table of contents’ (tree showing certain elements) and in the bottom right, we show
the document itself, at the spot corresponding to the element in the treemap that
the user has clicked on.
higher expressiveness than XPath and XIRQL. The latter two offer only se-
lection operators, thus results are always complete elements of the original
documents. In contrast, XQuery also provides operators for restructuring re-
sults as well as for computing aggregations (count, sum, avg, max, min).
A typical XQuery expression has the following structure:
FOR PathExpression
WHERE AdditionalSelectionCriteria
RETURN ResultConstruction
Here, PathExpression may contain one or more path expressions following
the XPath standard, where each expression is bound to a variable. Thus, the
FOR clause returns ordered lists of tuples of bound variables. The WHERE
clause prunes these lists of tuples by testing additional criteria. Finally, the
RETURN clause allows for the construction of arbitrary XML documents by
combining constant text with the content of the variables.
Since XIRQL is based on XPath, it can be seen as an extension of a subset
of XQuery (i.e., only a FOR clause, with a single PathExpression) in order to
support IR.
The current version of XQuery supports querying for single words in texts
only. Recently, discussions about text retrieval extensions for XQuery have
started, which aim at providing restricted forms of weighting and ranking
(see the requirements [58] and the use cases [10]). However, most of the use
cases presented there do not take weighting into account and operate on the
syntactical level. (For example, proximity search is required to handle phrases,
rather than allowing for linguistic predicates.) Furthermore, ranking is only
applicable to full-text search predicates whereas we consider weighting and
ranking to be an important feature for other data types, as well, including
numbers.
In information retrieval, previous work on structured documents has fo-
cused on two major issues:
• The structural approach enriches text search by conditions relating to
the document structure, e.g., that words should occur in certain parts of
a document, or that a condition should be fulfilled in a document part
preceding the part satisfying another condition. [228] give a good survey
on work in this direction. However, all these approaches are restricted
to Boolean retrieval, so neither weighting of index terms nor ranking are
considered.
• Content-based approaches aim at the retrieval of the most relevant part
of a document with respect to a given query. In the absence of explicit
structural information, passage retrieval has been investigated by several
researches ([157]). Here the system determines a sequence of sentences
from the original document that fit the query best.
Only a few researchers have dealt with the combination of explicit struc-
tural information and content-based retrieval. [223] use belief networks for
74 N. Fuhr, K. Großjohann, and S. Kriewel
4.6 Conclusions
In this chapter, we have described a query language for information retrieval in
XML documents. Current proposals for XML query languages lack most IR-
related features, which are weighting and ranking, relevance-oriented search,
data types with vague predicates, and structural relativism. We have presented
the new query language XIRQL which integrates all these features, and we
4 A Query Language and User Interface for XML Information Retrieval 75
have described the concepts that are necessary in order to arrive at a consistent
model for XML retrieval.
In order to ease query formulation, we have developed a user interface
supporting formulation of syntactically and semantically correct queries. For
result presentation of XML retrieval, we have described a solution which vi-
sualizes also sizes of result elements and structural relationships between ele-
ments.
Based on the concepts described in this chapter, we have implemented a
retrieval engine named HyREX (Hypermedia Retrieval E ngine for X ML).
HyREX is designed as an extensible IR architecture. The whole system
is open source and can be downloaded from http://www.is.informatik.
uni-duisburg.de/projects/hyrex. For specific applications, new data types
can be added to the system, possibly together with new index structures.
5
Tamino – A Database System Combining Text
Retrieval and XML
Harald Schöning
5.1 Introduction
In 1999, Software AG released the first version of its native XML server
Tamino [276, 274, 275], which includes a native XML database. The term
native has become popular since then, being used with differing meanings.
While some sources, e.g. [10], define a native XML database system only by
its appearance to the user (“Defines a (logical) model for an XML document
. . . and stores and retrieves documents according to that model. . . . For exam-
ple, it can be built on a relational, hierarchical, or object-oriented database
. . .”), Software AG takes the definition further by requiring that a native XML
database system has been built and designed for the handling of XML, and
is not just a database system for an arbitrary data model with an XML layer
on top.
XML by itself leaves many choices for the modeling of data. Two modeling
approaches are contrasted by [42]: Data-centric documents have a regular
structure, order typically does not matter, and mixed content does not occur.
This is the type of information usually stored in a relational or object-oriented
database. Document-centric documents are characterized by a less regular
structure, often considerably large text fragments, the occurrence of mixed
content, and the significance of the order of the elements in the document. Of
course, all choices in between these two extremes are possible.
As a consequence, text retrieval functionality is essential for the efficient
search in XML documents, in particular for the document-centric parts of
documents. However, it is not sufficient to have this functionality only on the
level of the whole document. The structure imposed to documents by XML is
significant for the semantics and has to be considered in text retrieval. In addi-
tion, it is desirable to combine retrieval with conditions on the well-structured
data-centric parts of an XML document. Consider a set of maintenance man-
uals in XML format. The query “Find the manuals which were written earlier
than 2003 and where in an element ’Caution’ the word ’alcohol’ is mentioned”
H. Blanken et al. (Eds.): Intelligent Search on XML Data, LNCS 2818, pp. 77–89, 2003.
Springer-Verlag Berlin Heidelberg 2003
78 H. Schöning
Tamino database
collection … collection
doctype … doctype
document document
1. The token value is always derived from the string value of an element.
2. The token value is derived by replacing all sub-element tags by delimiters
before computing the string value of an element, and computing the token
value from this string value.
3. Flexible token value computation: if the element has mixed content,
method 1 is used, otherwise method 2 is used.
The choice among these three methods is orthogonal to the text retrieval
functions described in the following sections – these operate on token val-
ues, regardless how these have been computed. In the following we describe
how structure-aware text retrieval has been embedded in Tamino’s query lan-
guages.
• The right hand side uses a proprietary syntax to express proximity condi-
tions. Adding further functionality means extending this syntax by further
proprietary constructs.
• The approach cannot be easily extended to introduce advanced function-
ality such as ranking, highlighting etc.
For these reasons, Software AG has chosen a different approach when inte-
grating text retrieval functionality into its XQuery implementation.
Starting with version 4.1, Tamino supports XQuery in addition to its “tradi-
tional” XPath-based query language. Of course, Tamino’s retrieval function-
ality must be available via XQuery as well. While it would have been easy
to adopt the XPath extension described above to XQuery (because XQuery
includes XPath’s path expressions), Software AG has chosen a different ap-
proach.
XQuery has the concept of functions. Built-in functions belong to a dedi-
cated XQuery name space, and user defined functions can be assigned to any
other name space. Software AG provides another package of built-in func-
tions from a Software AG name space which include Tamino’s text retrieval
functionality.
The function containsText accepts a node and a search string and returns a
Boolean value which indicates whether there is a match between the node’s
content and the search string. This function implements the functionality of
∼= except that the explicit adj and near operators are not supported. For
example, the XPath-based query /DatabaseSystem[description ="a* *
model"] can be written as1
declare namespace
tf="http://namespaces.softwareag.com/tamino/TaminoFunction"
for $d in input()/DatabaseSystem
where tf:containsText($d/description, "a* * model")
return $d
or, more compact, as
declare namespace
tf="http://namespaces.softwareag.com/tamino/TaminoFunction"
input()/DatabaseSystem[tf:containsText(
description, "a* * model")]
1
The function input() returns all documents of the context collection.
5 Tamino – A Database System Combining Text Retrieval and XML 83
5.5.3 Highlighting
two issues deserve special attention: the identification of a token and the
matching of tokens.
indexes can be defined on any node in an XML schema, regardless of its con-
tent model. Indexing is done on the token value of the node as defined above.
A text index is defined in the XML schema for a doctype. Tamino uses the
annotation mechanism of XML schema [297] to preserve standard conforming
XML schema documents while adding Tamino-specific information. The fol-
lowing fragment of an XML schema illustrates the definition of a text index
on the element description.
<?xml version = "1.0" encoding = "UTF-8"?>
<xs:schema
xmlns:tsd = "http://namespaces.softwareag.com/tamino/
TaminoSchemaDefinition"
xmlns:xs = "http://www.w3.org/2001/XMLSchema">
<!-- left out other definitions -->
<xs:element name = "description" type = "xs:string">
<xs:annotation>
<xs:appinfo>
<tsd:elementInfo>
<tsd:physical>
<tsd:native>
<tsd:index>
<tsd:text/>
</tsd:index>
</tsd:native>
</tsd:physical>
</tsd:elementInfo>
</xs:appinfo>
</xs:annotation>
</xs:element>
The XQuery Working Group and the XSL Working Group of W3C have cre-
ated a draft of full-text requirements [58] that an XQuery extension for text
retrieval should fulfill and corresponding full-text use cases [10]. The Tamino
text-retrieval functionality matches all the requirements except that Tamino
does not provide a wildcard for a single character and does not support stop
words. The requirement to support stop words is questionable anyway as it
adds no power to the text retrieval functionality.
Oracle 9iR2 also contains some text retrieval functionality on values
of XMLType. The function CONTAINS accepts search strings that can re-
fer to XPath expressions. The expression CONTAINS(DatabaseSystem, ’XML
INPATH (//description)’) returns a numerical value which is zero if the
path //description does not point to a text containing the word “XML”,
88 H. Schöning
and otherwise is a positive numeric value that increases with the number of
matches found. However, the CONTAINS function can only be applied if an
appropriate index has been defined on the referenced column. Furthermore,
it is not possible to combine multiple predicates on the same sub-tree of an
XML document. Oracle has also extended XPath by a function ora:contains
which does not need and does not use a text index. The following Oracle query
finds the name of all database systems whose description contains the word
“XML” or the word “HTML”.
SELECT D.DatabaseSystem.extractValue(’//name’)
AS name
FROM DBMS D
WHERE D.DatabaseSystem.existsNode(
’/DatabaseSystem[ora:contains(
description,"XML" OR "HTML")>0]’);
In DB2 UDB, the Text Extender can be used for text search on XML. If a cor-
responding XML-aware index has been defined properly, the following expres-
sion returns true if the description of a database system contains the words
“XML” or “HTML”: CONTAINS(DatabaseSystem,’model DatabaseSystem
sections (/DatabaseSystem/ description) ("XML"|"HTML")’). This func-
tion does not search on sub-elements. Multiple predicates on a sub-tree cannot
be combined.
Progress Software’s (formerly eXcelon) eXtensible Information Server has
integrated two dedicated text retrieval functions into its XPath and XQuery
implementations: xln:contains-words and xln:icontains-words. The dif-
ference between these two is that xml:icontains-words is case insensitive,
while xln:contains-words is not. These functions can search for a set of
words. An option specifies whether this set is treated as a sequence of con-
tiguous words, or a set of which any or all words have to be found in any
order. Furthermore, search can be restricted to the first level of child nodes.
Another option controls whether markup separates tokens or not. Wildcards
are supported, but only at the end of words. Tokenization can be switched to
Japanese. Text indexes are available, but optional.
In [46], XQuery is extended by a new keyword for text retrieval purposes.
This is an extension that modifies the XQuery syntax, while Software AG’s
approach integrates well with XQuery as defined by W3C.
5.9 Conclusions
As its first query language, Tamino has extended XPath to include a text re-
trieval operator. With this approach, it has been possible to specifically query
all levels of a document, and combine predicates on text with predicates on
structure and other content of a document. For its second query language
5 Tamino – A Database System Combining Text Retrieval and XML 89
6.1 Introduction
XML – short for the eXtended Markup Language defined by the World Wide
Web Consortium in 1998 – is very successful as a format for data interchange.
The reason is the high flexibility of the semistructured data model underlying
XML [3]. Therefore, XML documents are well suited for a broad range of
applications covering both rigidly structured data such as relations as well as
less rigorously structured data such as text documents. So far, research on
database systems has spent much effort on data-centric processing of rigidly
structured XML documents. However, the importance of document-centric
processing increases the more XML extends to application domains requiring
less rigorously structured data representation.
In the context of document-centric XML processing, this chapter focuses
on the problem of flexible ranked and weighted retrieval on XML docu-
ments [139]. Like information retrieval on text, XML retrieval aims to ef-
fectively and efficiently cover the information needs of users searching for
relevant content in a collection of XML documents. However, due to the flexi-
bility inherent to XML, conventional text retrieval techniques are not directly
applicable to ranked and weighted retrieval on XML documents.
First, the notion of a document collection must be refined in the context
of XML. This is because often a single large XML document comprises all the
content. Figure 6.1 illustrates such an XML document representing a collection
of books. A document collection in the context of conventional information
retrieval in contrast usually comprises many documents. For instance, global
collection-wide IR statistics such as document frequencies, i.e., the number
of documents a word occurs in, build on the conventional notion where a
collection comprises many documents.
Second, different parts of a single XML document may have content from
different domains. Figure 6.1 illustrates this with the different branches of
the bookstore – one for medicine books and one for computer science books.
Intuitively, the term ’computer’ is more significant for books in the medicine
H. Blanken et al. (Eds.): Intelligent Search on XML Data, LNCS 2818, pp. 95–106, 2003.
Springer-Verlag Berlin Heidelberg 2003
96 T. Grabs and H.-J. Schek
bookstore
book book
book book
Ian J. ...
para- para- first- last- Algo- para- para-
... ... Date ...
Alexander graph graph name name rithms graph graph
Fig. 6.1. Exemplary XML document with textual content represented as shaded
boxes
model in more detail. The main innovation presented in Sect. 6.3 is a retrieval
model to dynamically derive the query-specific statistics from underlying ba-
sic statistics. We generalize previous work on indexing nodes for XML [128],
augmentation [127], and multi-category retrieval from flat documents [137] to
flexible information retrieval on XML documents. The section also discusses
the semantics of different query types under the vector space retrieval model.
Section 6.4 concludes the chapter and points to future work.
Our approach takes over some of these ideas and generalizes them such
that consistent retrieval with arbitrary query granularities, i.e., arbitrary com-
binations of element types, is feasible. This makes the restriction of retrieval
granularity to indexing nodes obsolete and allows for flexible retrieval from
XML collections. Our approach to guarantee consistent ranking builds on our
own previous work on conventional retrieval with flat (unstructured) docu-
ments from different domains [137]. We extend this previous work here to
cover hierarchically structured documents such as XML documents as well.
0.7 0.7
book book
1.0 1.0
0.92 0.92
Fig. 6.2. Basic indexing nodes of the XML document in Fig. 6.1
Single-Category Retrieval
Single-category retrieval with XML works on a basic indexing node. For ex-
ample, the path /bookstore/medicine/book/title defines a single category in
Fig. 6.2. The granularity of retrieval are the elements in the category.
The following discussion takes over the usual definition of retrieval status
value with the vector space retrieval model (cf. Definition 6.1): As usual, t
denotes a term, and tf (t, e) is its term frequency in an element e. Let Ncat
and ef cat (t) denote the number of elements at the single category cat and the
element frequency of term t with cat, respectively. In analogy to the inverted
document frequency for conventional vector space retrieval, we define inverted
element frequency (ief ) as
Ncat
ief cat (t) = log (6.2)
ef cat (t)
Multi-category Retrieval
the subsequent definitions show, query processing first computes the statistics
for each single-category as defined in Definition 6.3 and then integrates them
to the multi-category ones as follows. Let Mq denote the set of basic indexing
nodes that the multi-category query q covers. Thus, the integrated inverted
element frequency for multi-category retrieval is
4
cat∈Mq Ncat
ief mcat (t, Mq ) = log 4 (6.4)
cat∈Mq ef cat (t)
where ef cat (t) denotes the single-category element frequency of term t with
category cat. The retrieval status value of an element e for a multi-category
query q is then using again tf idf ranking:
3
rsv (e, q) = tf (t, e) ief mcat (t, Mq )2 tf (t, q) (6.5)
t∈terms(q)
Nested Retrieval
Another type of requests are those that operate on complete subtrees of the
XML documents. The path expression /bookstore/medicine/book/ for instance
defines such a subtree for the XML document in Fig. 6.1. However, there are
the three following difficulties with this retrieval type:
• A path expression such as the one given above comprises different cat-
egories in its XML subtree. With the element types from Fig. 6.2 for
instance, these are the title and paragraph elements. Hence, retrieval over
the complete subtree must consider these element types in combination
to provide a consistent ranking.
• Terms that occur close to the root of the subtree are typically considered
more significant for the root element than ones on deeper levels of the
subtree. Intuitively: the larger the distance of a node from its ancestor is,
the less it contributes to the relevance of its ancestor. Fuhr et al. [127,
128] tackle this issue by so-called augmentation weights which downweigh
term weights when they are pushed upward in hierarchically structured
documents such as XML documents.
• Element containment is at the instance level, and not at the type level.
This is because some element may contain a particular sub-element while
others do not. Take the XML document from Fig. 6.1 for instance: some
book elements do not have an example chapter. Consequently, element
containment relations cannot be derived completely from the element type
nesting.
6 Flexible Information Retrieval on XML Documents 105
Let e ∈ cat denote an element from category cat where cat qualifies for
the path expression of the nested-retrieval query. Let SE (e) denote the set
of sub-elements of e including e, i.e., all elements in the sub-tree rooted at
e. For each se ∈ SE (e), l ∈ path(e, se) stands for a label along the path
from e to se, and awl ∈ [0.0; 1.0] is its augmentation weight as defined by the
annotations of the edges in the XML structure (cf. Fig. 6.2). ief nest (t) stands
for the integrated inverted element frequency of term t with nested retrieval,
and
Ncat Ncat
ief nest (t) = log = log 4 (6.6)
ef cat (t) e∈cat χ(t, e)
where Ncat is again the number of elements in category cat and ef cat (t) is
the number of occurrences of term t in cat . To determine ef cat (t), we define
χ(t, e) as follows:
4
1, if se∈SE(e) tf (t, se) > 0
χ(t, e) = (6.7)
0, otherwise
Thus, χ(t, e) is 1 if at least e or one of its sub-elements contains t. The retrieval
status value rsv of an element e ∈ cat under a nested-retrieval query q using
tf idf -ranking is then:
3 3 / 2 1
rsv (e, q) = awl tf (t, se) iefnest (t)2 tf (t, q)
se∈SE (e) t∈terms(q) l∈path(e,se)
(6.8)
Definition 6.8 reverts to the common tf idf ranking for conventional re-
trieval on flat documents when all augmentation weights are equal to 1.0
and when the elements queried are the root nodes of the XML documents.
In the trivial case where a nested query only comprises one single-category,
Definition 6.8 equals Definition 6.3.
6.4 Conclusions
Flexible retrieval is crucial for document centric processing of XML. Flexi-
ble retrieval means that users may dynamically, i.e., at query time, define the
scopes of their queries. So far, consistent retrieval on XML collections has only
been feasible at fixed granularities [128, 296]. The difficulty is to treat statis-
tics such as document frequencies properly in the context of hierarchically
structured data with possibly heterogeneous contents. Our approach allows
for flexible retrieval over arbitrary combinations of element types. In this chap-
ter, we propose single-category retrieval, multi-category retrieval, and nested
retrieval for flexible retrieval from XML documents. To tackle the aforemen-
tioned difficulty, we rely on basic index and statistics data and integrate them
on-the-fly, i.e., during query processing, to query-specific statistics that prop-
erly reflect the scope of the query. Taking the vector space retrieval model for
106 T. Grabs and H.-J. Schek
Djoerd Hiemstra
7.1 Introduction
The XML standards that are currently emerging have a number of character-
istics that can also be found in database management systems, like schemas
(DTDs and XML schema) and query languages (XPath and XQuery). Fol-
lowing this line of reasoning, an XML database might resemble traditional
database systems. However, XML is more than a language to mark up data;
it is also a language to mark up textual documents. In this chapter we specif-
ically address XML databases for the storage of ‘document-centric’ XML (as
opposed to ‘data-centric’ XML [42]).
Document-centric XML is typically semi-structured, that is, it is charac-
terised by less regular structure than data-centric XML. The documents might
not strictly adhere to a DTD or schema, or possibly the DTD or schema might
not have been specified at all. Furthermore, users will in general not be inter-
ested in retrieving data from document-centric XML: They will be interested
in retrieving information from the database. That is, when searching for doc-
uments about “web information retrieval systems”, it is not essential that the
documents of interest actually contain the words “web”, “information”, “re-
trieval” and “systems” (i.e., they might be called “internet search engines”).
An intelligent XML retrieval system combines ‘traditional’ data retrieval
(as defined by the XPath and XQuery standards) with information retrieval.
Essential for information retrieval is ranking documents by their probability,
or degree, of relevance to a query. On a sufficiently large data set, a query
for “web information retrieval systems” will retrieve many thousands of doc-
uments that contain any, or all, of the words in the query. As users are in
general not willing to examine thousands of documents, it is important that
the system ranks the retrieved set of documents in such a way that the most
promising documents are ranked on top, i.e. are the first to be presented to
the user.
Unlike the database and XML communities, which have developed some
well-accepted standards, the information retrieval community does not have
H. Blanken et al. (Eds.): Intelligent Search on XML Data, LNCS 2818, pp. 107–118, 2003.
Springer-Verlag Berlin Heidelberg 2003
108 D. Hiemstra
1. IT magazines
2. +IT magazine* -MSDOS
3. "IT magazines"
4. IT NEAR magazines
5. (IT OR computer) (books OR magazines OR journals)
6. XML[0.9] IR[0.1] title:INEX site:utwente.nl
Figure 7.1 gives some example queries from these systems. The first query
is a simple “query by example”: retrieve a ranked list of documents about IT
magazines. The second query shows the use of a mandatory term operator
‘+’, stating that the retrieved document must contain the word IT,1 a wild
card operator ‘*’ stating that the document might match “magazine”, but
also “magazines” or “magazined” and the ‘-’ operator stating that we do
not prefer IT magazines about MSDOS. The third and fourth query searches
for documents in which “IT” and “magazines” occur respectively adjacent or
near to each other. The fifth query shows the use of the ‘ OR’ operator, stating
that the system might retrieve documents about “IT magazines”, “computer
magazines”, “IT journals”, “IT books”, etc. The sixth and last query shows
the use of structural information, very much like the kind of functionality that
is provided by XPath; so “title:INEX” means that the title of the document
should contain the word “INEX”. The last query also shows additional term
weighting, stating that the user finds “XML” much more important than “ IR”.
An intelligent XML retrieval system should support XPath and all of the
examples above. For a more comprehensive overview of information retrieval
requirements, we refer to Chap. 3.
This chapter shows that statistical language models provide some inter-
esting alternative ways of thinking about intelligent XML search. The rest of
the chapter is organised as follows: Section 7.2 introduces the language mod-
elling approach to information retrieval, and shows how language modelling
concepts like priors, mixtures and translation models, can be used to model
intelligent retrieval from semi-structured data. Section 7.3 reports the exper-
1
Note that most retrieval systems do not distinguish upper case from lower case,
and confuse the acronym “IT” with the very common word “it”.
7 Statistical Language Models for Intelligent XML Retrieval 109
Note that the denominator on the right hand side does not depend on the XML
element X. It might therefore be ignored when a ranking is needed. The prior
P (X) however, should only be ignored if we assume a uniform prior, that is,
if we assume that all elements are equally likely to be relevant in absence of
a query. Some non-content information, e.g. the number of accesses by other
users to an XML element, or e.g. the length of an XML element, might be
used to determine P (X).
Let’s turn our attention to P (q1 , q2 , · · · , qn |X). The use of probability
theory might here be justified by modelling the process of generating a query
Q given an XML element as a random process. If we assume that the current
page in this book is an XML element in the data, we might imagine picking a
word at random from the page by pointing at the page with closed eyes. Such
a process would define a probability P (q|X) for each term q, which would be
defined by the number of times a word occurs on this page, divided by the
total number of words on the page. Similar generative probabilistic models
have been used successfully in speech recognition systems [243], for which
they are called “language models”.
The mechanism above suggests that terms that do not occur in an XML
element are assigned zero probability. However the fact that a term is never
observed does not mean that this term is never entered in a query for which
110 D. Hiemstra
the XML element is relevant. This problem – i.e., events which are not ob-
served in the data might still be reasonable in a new setting – is called the
sparse data problem in the world of language models [209]. In general, zero
probabilities should be avoided. A standard solution to the sparse data prob-
lem is to interpolate the model P (q|X) with a background model P (q) which
assigns a non-zero probability to each query term. If we additionally assume
that query terms are independent given X, then:
n .
2 0
P (q1 , q2 , · · · , qn |X) = (1−λ)P (qi ) + λP (qi |X) (7.2)
i=1
Equation 7.2 defines our basic language model if we assume that each
term is generated independently from previous terms given relevant XML el-
ement. Here, λ is an unknown mixture parameter, which might be set using
e.g. relevance feedback of the user. The probability P (qi ) is the probability of
the word qi in ‘general query English’. Ideally, we would like to train P (qi )
on a large corpus of queries. In practice however, we will use the document
collection to define these probabilities as the number of times the word oc-
curs in the database, divided by the size of the database, measured in the
total number of word occurrences. It can be shown by some simple rewriting
that Equation 7.2 can be implemented as a vector space weighting algorithm,
where λP (qi |X) resides on the ‘tf -position’ and 1 / (1−λ)P (qi ) resides on the
’idf -position’. The following ‘vector-space-like’ formula assigns zero weight to
words not occurring in a XML element, but ranks the elements in exactly the
same order as the probability measure of Equation 7.2 [163]:
n
3 . λP (qi |X) 0
P (q1 , q2 , · · · , qn |X) ∝ log 1 + (7.3)
i=1
(1−λ)P (qi )
Why would we prefer the use of language models over the use of e.g. a
vector space model with some tf.idf weighting algorithm as e.g. described
by [259]? The reason is the following: our generative query language model
gives a nice intuitive explanation of tf.idf weighting algorithms by means of
calculating the probability of picking at random, one at a time, the query
terms from an XML element. We might extend this by any other generating
process to model complex information retrieval queries in a theoretically sound
way that is not provided by a vector space approach.
For instance, we might might calculate the probability of complex pro-
cesses like the following: What is the probability of sampling eiter “Smith”
or “Jones” from the author element, and sampling “software” and “engineer-
ing” from either the body element or from the title element? Probability
theory will provide us with a sound way of coming up with these probabili-
ties, whereas a vector space approach provides us with little clues on how to
combine the scores of words on different XML elements, or how to distinguish
between “Smith” or “Jones”, and “Smit” and “Jones”.
7 Statistical Language Models for Intelligent XML Retrieval 111
Instead of one unknown mixture parameter, we now have to set the value
of two unknown mixture parameters: α and β (where γ = 1 − α − β).
P (qi |X, title) would simply be defined by the number of occurrences of qi
in the descendant title of X divided by the total number of words in the de-
scendant title of X, and P (qi |X, abstract) would be defined similarly for the
descendant abstract.
In other words, the mixture expresses something similar to the logical OR:
if a word q should match either XML element X or a related XML element Y ,
then the probability is calculated by a mixture. Note that we cannot simply
add the probabilities without the mixture parameters, because the two events
are not disjoint, that is, a word might match both X and Y .
The unknown mixture parameters play a role that is similar to the aug-
mentation weights described in Chap. 4 and 6 of this book. Both are essen-
tially unknown parameters that determine the importance of XML elements
relatively to some related XML elements. The main difference between the
augmentation weights and the mixture parameters of the language models,
is that the augmentation weights are propagated upwards from a leaf node
to its parent, whereas the language models might combine XML elements in
an ad-hoc way. Interestingly, as said above, a two-component mixture of an
element and the document root, behaves like a vector space approach with
tf.idf weights.
In this formula, q sums over all possible words, or alternatively over all words
for which P (ci |q) is non-zero. Given the example above, the sum would include
P (c|lecture) P (lecture|X), P (c|course) P (course|X), etc. Superficially, this
looks very similar to the mixture model. Like the mixtures, the translation
models also express something similar to the logical OR: if an element should
match either the word “lecture”, or the word “course”, then we can add the
probabilities weighted by the translation probabilities. Note however, that the
translation probabilities do not necessarily sum up to one, because they are
conditioned on different qs. Adding the probabilities is allowed because the
qs are disjoint, i.e. the occurrence of one word can never be “lecture” and
“course”. This is like adding the probabilities of tossing a 5 or a 6 with a fair
die, it is impossible to throw a 5 and a 6 with only one toss, so we can add
the probabilities: 1/6 + 1/6 = 1/3 .
Translation models might play a role in using ontologies for ‘semantic’
search of XML data as described in Chap. 8 by Schenkel, Theobald and
Weikum. They introduce a new operator to express semantic similarity search
conditions. As in cross-language retrieval, ontology-based search will retrieve
an element that matches words that are related, according to the ontology,
to the word in the query. If we follow the approach by Schenkel et al., the
ontology might define P (ci |q) in Equation 7.4 as the probability of a concept
ci , given a word q.
Maybe the easiest language modelling concept to experiment with is the XML
element prior P (X). The prior P (X) defines the probability that the user
likes the element X if we do not have any further information (i.e., no query).
An example of the usefulness of prior knowledge is the PageRank [49] algo-
rithm that analyses the hyperlink structure of the world wide web to come
up with pages to which many documents link. Such pages might be highly
7 Statistical Language Models for Intelligent XML Retrieval 113
recommended by the system: If we do not have a clue what the user is look-
ing for, an educated guess would be to prefer a page with a high pagerank
over e.g. the personal home page of the author of this chapter. Experiments
show that document priors can provide over 100 % improvement in retrieval
performance for a web entry page search task [192]. The usefulness of some
simple priors for XML search are investigated in Section 7.3.
7.2.6 Discussion
This section presented some interesting new ways of thinking about intelligent
XML retrieval. Whether these approaches perform well in practice, has to be
determined by experiments on benchmark test collections as e.g. provided by
INEX. Preliminary experiments are described in the next section.
However, experience with language models on other tasks look promising.
Recent experiments that use translation models for cross-language retrieval
[162], document priors for web search [192], and mixture models for video
retrieval [315] have shown that language models provide top performance on
these tasks. Other systems that use language models for intelligent XML re-
trieval are described by Ogilvie and Callan [233], and by List and De Vries
[204].
114 D. Hiemstra
The preliminary prototype should in principle support ‘all of XPath and all
of IR’. In order to support XPath, the system should contain a complete
representation of the XML data. The system should be able to reproduce any
part of the data as the result of the query. For XPath we refer to [29].
For our first prototype we implemented the XML relational storage scheme
proposed in Chap. 16 by Grust and Van Keulen. They suggest to assign two
identifiers (id) to each instance node: one id is assigned in pre-order, and
the other in post-order. The pre and post order assignment of XML element
ids provides elegant support for processing XPath queries, forming an alter-
native to explicit parent-child relations which are often used to store highly
structured data in relational tables [116, 303, 271]. 2
Note that pre and post order assignment can be done almost trivially in
XML by keeping track of the order of respectively the opening and closing
tags. Since we are going to build a textual index for content-based retrieval,
we assign an id (or position) to each word in the XML text content as well.
The word positions are used in a term position index to evaluate phrasal
queries and proximity queries. Interestingly, if we number the XML data as a
linearised string of tokens (including the content words), we obey the pre/post
order id assignment, but we also allow the use of theory and practice of region
algebras (see Chap. 12). For a more detailed description of the storage scheme,
we refer to [161].
INEX is the Initiative for the Evaluation of XML Retrieval. The initiative
provides a large testbed, consisting of XML documents, retrieval tasks, and
relevance judgements on the data. INEX identifies two tasks: the content-only
task, and the content-and-structure task.
The content-only task provides 30 queries like the following example:
//*[. =~ "computational biology"] (“XPath & IR” for: any element about
“computational biology”). In this task, the system needs to identify the most
2
Actually, Grust et al.store the id of the parent as well. Similarly, Schmidt et al.
[271] add a field to keep track of the order of XML elements; here we emphasise
different view points.
7 Statistical Language Models for Intelligent XML Retrieval 115
appropriate XML element for retrieval. The task resembles users that want to
search XML data without knowing the schema or DTD.
The content-and-structure task provides 30 queries like the following:
//article[ author =~ "Smith|Jones" and bdy =~ "software engineering" ]
(“XPath & IR” for: retrieve articles written by either Smith or Jones about
software engineering). This task resembles users or applications that do know
the schema or DTD, and want to search some particular XML elements while
formulating restrictions on some other elements.
For each query in both tasks, quality assessments are available. XML ele-
ments are assessed based on relevance and coverage. Relevance is judged on a
four-point scale from 0 (irrelevant) to 3 (highly relevant). Coverage is judged
by the following four categories: N (no coverage), E (exact coverage), L (the
XML element is too large), and S (the XML element is too small).
In order to apply traditional evaluation metrics like precision and recall,
the values for relevance and coverage must be quantised to a single quality
value. INEX suggests the use of two quantisation functions: Strict and lib-
eral quantisation. The strict quantisation function evaluates whether a given
retrieval method is capable of retrieving highly relevant XML elements: it as-
signs 1 to elements that have a relevance value 3, and exact coverage. The
liberal quantisation function assigns 1 to elements that have a relevance value
of 2 and exact coverage, or, a relevance value of 3 and either exact, too small,
or too big coverage. An extensive overview of INEX is given in Chap. 19 of
this volume.
in between the two extremes. The prior is defined by P (X) = 100 + number of
tokens in the XML element. Of course, the priors should be properly scaled,
but the exact scaling does not matter for the purpose of ranking. We hy-
pothesise that the system using the length prior will outperform the baseline
system.
This section presents the evaluation results of three retrieval approaches (no
prior, ‘half’ prior, and length prior) on two query sets (content-only, and
content-and-structure), following two evaluation methods (strict and liberal).
We will report for each combination the precision at respectively 5, 10, 15,
20, 30 and 100 documents retrieved.
Strict Evaluation
Table 7.1 shows the results of the three experiments on the content-only
queries following the strict evaluation. The precision values are averages over
22 queries. The results show an impressive improvement of the length prior
on all cut-off values. Apparently, if the elements that need to be retrieved are
not specified in the query, users prefer larger elements over smaller elements.
Table 7.2 shows the results of the three experiments on the content-and-
structure queries following the strict evaluation. The precision values are av-
erages over 28 queries. The baseline system performs much better on the
content-and-structure queries than on the content-only queries. Surprisingly,
the length prior again leads to substantial improvement on all cut-off values
in the ranked list.
Liberal Evaluation
Table 7.3 shows the results of the three experiments on the content-only
queries using the liberal quantisation function defined above for evaluation.
7 Statistical Language Models for Intelligent XML Retrieval 117
The precision values are averages over 23 queries. Again, the results show a
significant improvement of the length prior on all cut-off values.
Table 7.4 shows the results of the three experiments on the content-and-
structure queries following the liberal evaluation. The precision values are
averages over 28 queries. The length prior again shows better performance
on all cut-off values. Note that the content-only task and the content-and-
structure task show practically equal performance if the liberal evaluation
procedure is followed.
118 Djoerd Hiemstra
7.4 Conclusions
In this chapter we described in some detail the ideas behind the language
modelling approach to information retrieval, and suggested several advanced
language modelling concepts to model intelligent XML retrieval. We presented
a preliminary implementation of a system that supports XPath and complex
information retrieval queries based on language models. From the experiments
we conclude that it is beneficial to assign a higher prior probability of relevance
to bigger fragments of XML data than to smaller XML fragments, that is, to
users, more information seems to be better information.
Whether the advanced modelling contructs presented in Section 7.2 will
in fact result in good retrieval performance will be evaluated in the CIRQUID
project (Complex Information Retrieval Queries in a Database). In this
project, which is run in cooperation with CWI Amsterdam, we will develop
a logical data model that allows us to define complex queries using advanced
language modelling primitives.
8
Ontology-Enabled XML Search
8.1 Introduction
8.1.1 Motivation
XML is rapidly evolving towards the standard for data integration and ex-
change over the Internet and within intranets, covering the complete spec-
trum from largely unstructured, ad hoc documents to highly structured,
schematic data. However, established XML query languages like XML-QL [96]
or XQuery [34] cannot cope with the rapid growth of information in open en-
vironments such as the Web or intranets of large corporations, as they are
bound to boolean retrieval and do not provide any relevance ranking for the
(typically numerous) results. Recent approaches such as XIRQL [128] or our
own system XXL [295, 296] that are driven by techniques from information
retrieval overcome the latter problem by considering the relevance of each
potential hit for the query and returning the results in a ranked order, using
similarity measures like the cosine measure. But they are still tied to keyword
queries, which is no longer appropriate for highly heterogeneous XML data
from different sources, as it is the case in the Web or in large intranets.
In such large-scale settings, both the structure of documents and the ter-
minology used in documents may vary. As an example, consider documents
about courses in computer science, where some authors talk about “lectures”
while others prefer to use “course”, “reading”, or “class”. Boolean queries
searching for lectures on computer science cannot find any courses or other
synonyms. Additionally, courses on database systems will not qualify for the
result set, even though database systems is a branch of computer science. So in
order to find all relevant information to a query, additional knowledge about
related terms is required that allows us to broaden the query, i.e., extend-
ing the query with terms that are closely related to the original query terms.
However, imprudent broadening of the query may be misleading in some cases,
when the extended query yields unwanted, irrelevant results. Consider a user
searching for lectures on stars and galaxies. When we extend the query using
H. Blanken et al. (Eds.): Intelligent Search on XML Data, LNCS 2818, pp. 119–131, 2003.
Springer-Verlag Berlin Heidelberg 2003
120 R. Schenkel, A. Theobald, and G. Weikum
related terms to “star”, we will add terms like “sun” and “milky way” that
help in finding better results, but also terms like “movie star” or “hollywood”
which are clearly misleading here. This can happen because words typically
have more than one sense, and it is of great importance to choose the right
sense for extending the query. Such information can be delivered by an ontol-
ogy, which models terms with their meanings and relationships between terms
and meanings.
corpus [247, 179, 203, 248]. A detailed comparison of similarity measures for
WordNet can be found in [52]; [211] and [175] compare measures based on
WordNet with similar measures for Roget’s Thesaurus [174].
Semi-automatic or automatic ontology construction is proposed in [207,
208, 186, 195, 188, 286, 50] and is mostly based on methods of text mining
and information extraction based on natural language processing using an
existing thesaurus or a text processor such as SMES [207] or GATE [87].
Merging ontologies across shared sub-ontologies is described in [27, 202]. Some
comprehensive systems for developing or using ontologies are OntoBroker [92],
Text-To-Onto [207, 208], GETESS [286, 50], Protégé 2000 [232], LGAccess [8],
KAON [43], Ontolingua [113], and FrameNet [20].
To our knowledge, the role of ontologies in searching semistructured data
has not yet been discussed in any depth. The unique characteristic of our
approach lies in the combination of ontological knowledge and information
retrieval techniques for semantic similarity search on XML data.
A widely accepted definition for an ontology is the one by Gruber and Guar-
ino [145, 148]): An ontology is a specification of representational vocabulary of
words (or terms) including hierarchical relationships and associative relation-
ships between these words. It is used for indexing and investigation as well
as for support knowledge sharing and reuse. However, this definition is not
precise enough to be used for building search engines and information man-
agement applications; we need a formal apparatus to this end. In this section,
we develop a model for an ontology that, while still capturing the ideas of the
informal definition, is precise enough to be implemented in our XXL search
engine for ranked retrieval on XML data.
As a building block and information pool for building our ontology, we
make use of Wordnet [114], an extensive electronic lexical database. Word-
Net captures the different senses of words and semantic relationships between
122 R. Schenkel, A. Theobald, and G. Weikum
them, among them hypernomy, synonymy and holonymy. Given a word, Word-
net returns the senses of this word (represented by a short phrase that explains
the sense), optionally together with related words for each sense. For example,
for the word “star” WordNet returns the following word senses, denoted by
their textual descriptions and ordered by descending usage frequency:
1. (astronomy) a celestial body of hot gases that radiates energy derived
from thermonuclear reactions in the interior
2. someone who is dazzlingly skilled in any field
3. a plane figure with 5 or more points; often used as an emblem
4. an actor who plays a principal role
5. a performer who receives prominent billing
and two further senses that are less commonly used. Another prominent ex-
ample is “Java” with three completely different senses (the programming lan-
guage, the island and the coffee).
In fact, most words are ambigous, i.e., they have more than one word
sense. For example, WordNet currently covers about 75,000 different senses
of nouns, but about 138,000 noun-sense pairs, so each word has about two
different senses on average. The word alone is therefore not enough to uniquely
represent one of its senses, so we are using pairs of the form (word,sense) to
represent semantic concepts. More formally, we are considering words as terms
over a fixed alphabet Σ. A word w together with its word sense s (or sense
for short) forms a concept c = (w, s), i.e., the precise meaning of the word
when used in this sense. The set of all such concepts is called the universe U .
In order to determine which concepts are related, we introduce semantic
relationships between concepts. Among the most common relationships in on-
tologies are hypernomy and hyponomy: we say that a concept c is a hypernym
(hyponym) of another concept c+ if the sense of c is more general (more spe-
cific) than the sense of c+ . We are also considering holonyms and meronyms,
i.e., c is a holonym (meronym) of c+ if c+ means something that is a part of
something meant by c (vice versa for meronyms). Finally, two concepts are
called synonyms when there senses are identical, i.e., their meaning is the
same. Note that these are exactly the relationships supported by WordNet.
There may be further semantic relationships between concepts that could be
easily integrated into our framework, but we restrict ourselves to hypernomy
and synonymy in this chapter.
Based on these definitions we now define the ontology graph which is a
data structure to represent concepts and relationships between them. This
graph has the concepts as nodes and an edge between two concepts whenever
there is a semantic relationship between them. Additionally, we label each
edge with the type of the underlying relationship of the edge. Figure 8.1
shows an example for an excerpt of the ontology graph around the first sense
for the term “star”, limited to hypernym, holonym and synonym edges for
better readability, and already augmented with edge weights that will be
explained in the next subsection. As each hypernym edge is accompanied
8 Ontology-Enabled XML Search 123
natural object
group
[...an object occuring
naturally...] [...any number of entities...]
hyper
[0.3]
hyper hyper
universe hyper
[0.2] [0.23]
[...the whole collection of [0.4]
existing things...]
celestial body heavenly body
syn holo
[...natural objects visible [1.0] [...natural objects visible [0.6]
in the sky...] in the sky...]
galaxy
hyper hyper holo [...a collection of
[0.86] [0.74] [0.9] star systems...]
hyper
holo [0.75]
[0.5]
star milky way
[...a celestial body [...the galaxy containing
of hot gases...] the solar system...]
hyper hyper
[0.8] [0.82]
beta centauri sun
[...the second brightest [...any star around which a
star in Centaurus...] planetary system evolves...]
Fig. 8.1. Excerpt of an ontology around the first sense of the term “star”, augmented
with edge weights
metric on vector spaces, e.g. the cosine measure. However, as such feature
vectors would typically be quite sparse, the distance of most concepts would
be close to or equal to zero, which is too restrictive.
A more promising approach is to apply probabilistic models using the
probability distribution of the concepts, i.e., the words in their selected sense,
in documents. If we manage to capture this distribution, we obtain an ap-
proximation of the similarity by computing the correlation of the concepts.
However, in our setting (with the Web as the source for documents), the con-
cept distribution is unknown, so we have to use some pragmatic approach for
collecting statistics. We could approximate the concept distribution using ap-
proximations of the frequency f (c) of a concept c in a very large text corpus,
e.g., the result of a topic-specific crawl or the entire Web. Here, f (c) means
how often the word of c and all the words from the textual represenation of
its sense occur in pages in the corpus. In order to use the Web, we apply a
Web search engine like Google or Altavista to get approximations for the fre-
quencies. While the numbers that these engines return for the frequencies are
not meant to be exact in any way, we believe that the relationship between
the numbers is reasonable.
Based on the frequency values, we compute the correlation of the concepts
using correlation coefficients. Among the candidates are the Dice coefficient,
the Jaccard coefficient or the Overlap coefficient (see Table 8.1). We choose the
Dice coefficient for our statistics. As an example, consider the semantic simi-
f (x∩y)
Dice coeff. 2 f (x)+f (y)
f (x∩y)
Jaccard coeff. f (x∪y)
f (x∩y)
Overlap coeff. min{f (x),f (y)})
larity between the concepts “universe” and “galaxy” shown in Figure 8.1. To
compute their Dice coefficient, we execute Web queries of the form “universe
collection existing things”, “galaxy collection star systems”,and “universe col-
lection existing things galaxy star systems” to obtain the counts for f (x), f (y),
and f (x ∩ y), respectively. For synonyms, we always set the similarity to 1.
However, while this gives good results for most concept pairs, it can still
be improved. The word itself and the textual representation may not always
contain enough words to discriminate the different senses of the word. As
an example, consider again the concepts “universe” and “galaxy”. When we
evaluate the query “universe collection existing things” to compute f (x), this
8 Ontology-Enabled XML Search 125
yields not only documents about astronomy, but many other, nonrelated doc-
uments, so the resulting frequency value will be too high. In order to make
frequency values more precise, it may be helpful to consider not only the
terms of the concept and its textual description, but also the terms of a lo-
cal context of the concept, i.e., other concepts that are closely related, like
synonyms, hypernyms, hyponyms, or siblings in the graph (i.e., successor of
predecessor of a given node). When we apply such a context in the example,
the resulting query for f (x) could be “universe collection existing things star
body hot gases”, integrating terms from the holonym “star”.
The similarity sim(v, w) of the two nodes is then defined as the maximal
similarity over all paths between v and w:
sim(v, w) = max{simp (v, w)} (8.2)
The rationale for this formula is that the length of a path has direct influ-
ence on the similarity score. The similarity score for a short path will typically
be better than the similarity score for a longer one, unless the path consists
only of synonyms that have similarity 1 by definition. However, the shortest
path does not need to be the path with the highest similarity, as the triangle
inequality does not necessarily hold for the ontological graph structure. So
in order to determine the similarity of two concepts, it is not sufficient to
calculate the similarity score along the shortest path between the concepts.
Instead, we need an appropriate algorithm that takes into account all possible
paths between the concepts, calculates the similarity scores for all paths and
chooses the maximum of the scores for the similarity of the concepts.
126 R. Schenkel, A. Theobald, and G. Weikum
RELAX(u,v,weight)
If w[v]<w[u]*weight(u,v) Then
w[v]:=w[u]*weight(u,y)
DIJKSTRA(s,V,E,weight)
PriorityQueue Q=V; // priorities are in array w[]
Set S:=empty;
For Each vertex v In V Do
w[v]=0;
w[s]=1;
current estimation for the maximal distance of each node to the start node s,
which is initialized with 0 for all nodes except the start node s (that repre-
sents the given concept c). All nodes are inserted into a priority queue Q with
the priority being their value in the array w. Inside the main loop, the node
with the highest similarity is extracted from the priority queue and inserted
into the set S that keeps all nodes for which the computation is finished. All
nodes that are direct neighbors of the current node are then considered for
updating their similarity by the RELAX operation that assigns a new similarity
to a node if its current similarity is smaller than the similarity over the path
from the start node over the current node to the considered neighbor.
8 Ontology-Enabled XML Search 127
8.2.4 Disambiguation
We already discussed that words may have more than a single sense, so it is not
immediately clear in which sense a word is used in a query or in a document.
It is fundamental to disambiguate the word, i.e., determine its current sense,
in order to make use of the ontology graph to find related concepts, e.g. to
broaden or refine a query. In this subsection we show how this process of
disambiguation works.
Starting with word w that may be a keyword of the query or a term
in a document, we look up w in our ontology graph, i.e., we find one or
more candidate concepts c1 , . . . , cm in the graph that have w as their term,
and identify possible word senses s1 , . . . , sm . As an example, consider again
the word “star” for which we found seven different word senses before. Now
the key question is: which of the possible senses of w is the right one? Our
approach to answer this question is based on word statistics for local contexts
of the candidate concepts on one hand and the word itself in either the query
or the document on the other hand. As far as the word context con(w) is
concerned, we choose other words around the word in the document or the
complete set of query terms if w is a keyword of a query, because we think
that the keywords in the query give the best hints towards the actual topic of
the query. Note, however, that disambiguizing query terms is typically harder
than finding the right concept for a term of a document because there are
far less keywords in queries than words in documents. For the context of a
candidate concept ci we consider not only the concept itself but also some
context of ci in the ontology graph. Candidates for such a context of c are
its synonyms, all other immediate neighbors, and also the hyponyms of the
hypernyms (i.e., the siblings of c in the ontology graph). For the concepts in
the context of ci , we form the union of their words and corresponding texts,
eliminate stopwords, and construct thus the local context con(ci ) of candidate
concept ci .
The final step towards disambiguizing the mapping of a keyword onto a
word sense is to compare the query terms with the contexts of candidates
con(c1 ) through con(cm ) in terms of a similarity measure between bags of
words. The standard IR measure for this purpose would be the cosine simi-
larity between the set of keywords and con(sj ), or alternatively the Kullback-
Leibler divergence between the two word frequency distributions (note that
the context construction may add the same word multiple times, and this
information is kept in the word bag). Our implementation uses the cosine
similarity for its simpler computation. Finally, we map the keyword w onto
the candidate concept ci whose context has the highest similarity to the set
of keywords.
However, sometimes the information in the keywords of the query will not
be sufficient for a disambiguation. For example, a user may only specify that
she wants to find documents that have a title with the word “star”. In such
cases, it may be helpful to the user to broaden the query using all possible
128 R. Schenkel, A. Theobald, and G. Weikum
candidate concepts and present the result grouped by the word sense. In our
example, we may present separate lists for all 7 senses of the term “star”,
provided we got results for all of the broadened queries. While this is not as
good as an automatic disambiguation, it helps the user to get only the results
that she intended, especially when we allow her to manually refine her query
to search only in one of the seven senses of the word.
The Flexible XML Search Language XXL has been designed to allow SQL-
style queries on XML data. We have adopted several concepts from XML-QL,
XQuery and similar languages as the core, with certain simplifications and
resulting restrictions, and have added capabilities for ranked retrieval and
ontological similarity. As an example for a query in XXL, consider one of our
examples from the introduction where someone searches for lectures on stars
and galaxies. This query could be expressed using XXL as shown in Figure 8.3.
The SELECT clause of an XXL query specifies the output of the query, e.g., all
bindings of certain element variables. The FROM clause defines the search space,
which can be a set of URLs or the index structure that is maintained by the
XXL engine. The WHERE clause specifies the search condition; it consists of the
logical conjunction of path expressions, where a path expression is a regular
expression over elementary conditions and an elementary condition refers to
the name or content of a single element or attribute. Regular expressions are
formed using standard operators like ’.’ for concatenation, ’|’ for union,
and ’*’ for the Kleene star. The operator ’#’ stands for an arbitrary path
of elements. Each path expression can be followed by the keyword AS and
a variable name that binds the end node of a qualifying path (i.e., the last
element on the path and its attributes) to the variable, that can be used
later on within path expressions, with the meaning that its bound value is
substituted in the expression.
130 R. Schenkel, A. Theobald, and G. Weikum
The evaluation of the search conditions in the WHERE clause consists of the
following two main steps:
• The XXL query is decomposed into subqueries. A global evaluation order
for evaluating the various subqueries and a local evaluation order in which
the components of each subquery are evaluated are chosen.
• For each subquery, subgraphs of the data graph that match the query
graph are computed, exploiting the various indexes to the best possible
extent. The subresults are then combined into the result for the original
query.
Query Decomposition
The WHERE clause of an XXL query is of the form "WHERE P1 AS V1 AND ...
AND Pn AS Vn" where each Pi is a regular path expression over elementary
conditions and the Vi are element variables to which the end node of a match-
ing path is bound. Each regular path expression corresponds to a subquery
and can be described by an equivalent non-deterministic finite state automa-
ton (NFSA).
We restrict XXL queries so that the dependency graph between binding
and usage of variables is acyclic. Furthermore, we estimate the selectivity of
each subquery using simple statistics about the frequency of element names
and search terms that appear as constants in the subquery. Then we choose to
evaluate subqueries and bind the corresponding variables in ascending order
of selectivity (i.e., estimated size of the intermediate result).
The local evaluation order for a subquery specifies the order in which it is
attempted to match the states of the subquery’s NFSA with elements in the
data graph. The XXL prototype supports two alternative strategies: in top-
down order the matching begins with the start state of the NFSA and then
proceeds towards the final state(s); in bottom-up order the matching begins
with the final state(s) and then proceeds towards the start state.
For each subquery, simple path expressions with element names and the wild-
card symbol # are looked up in the EPI. For example, all occurrences of
a pattern #.lecture.description or lecture.#.description can be re-
trieved from the EPI. Content conditions are evaluated by the ECI, a text
index on element and attribute contents. For semantic similarity conditions
such as description ∼ "star" the ECI yields approximate matches and
a similarity score based on IR-style tf*idf measures [19, 209] and semantic
distances between concepts in the ontology. Finally, for semantic similarity
8 Ontology-Enabled XML Search 131
8.4 Conclusions
Ontologies are increasingly seen as a key asset to further automation of infor-
mation processing. Although many approaches for representing and applying
ontologies have already been devised, they have not found their way into
search engines for querying XML data. In this chapter we have shown how
ontologies with quantified semantic relationships can help to increase both
the recall and precision for queries on semistructured data. This is achieved
by broadening the query with closely related terms, thus yielding more re-
sults, but only after disambiguizing query terms, so only relevant results are
included in the result of the query.
9
Using Relevance Feedback in XML Retrieval
Roger Weber
9.1 Introduction
Information retrieval has a long tradition: in the early days, the main focus was
on the retrieval of plain text documents and on search systems for books and
structured documents in (digital) libraries. Often, users were assisted by well-
trained librarians or specialists to retrieve documents fitting their information
need. With the proliferation of the internet, retrieval systems for further media
types like images, video, audio and semi-structured documents have emerged.
But more importantly, an ever increasing number of untrained users deploy
retrieval systems to seek for information. Since most users lack a profound
understanding of how retrieval engines work and of how to properly describe
an information need, the retrieval quality is often not satisfactory due to bad
query formulations. As an illustration of this, Jansen et al. [177] reported that
62% of queries submitted to the Excite web search engine consisted of less than
three query terms. Obviously, this is by far insufficient to accurately describe
an information need. But search systems often do not support users (or only
rudimentary) to adjust their queries to improve retrieval effectiveness.
As a countermeasure for the query refinement problem, relevance feedback
was introduced in the late 1960’s [169, 251]. The basic idea is to model the
search as an iterative and interactive process (cf. Figure 9.1) during which
the system assists users with the task of query refinement. To this end, the
user has to assign relevance values to the retrieved documents. This feed-
back together with the original query is processed according to a feedback
model and yields a new query which, hopefully, returns new and more rele-
vant documents. This iteration can continue until the user is satisfied or the
retrieval process is aborted. The feedback process bears a number of design
options: 1) capturing of feedback (implicit vs. explicit, granularity, feedback
values), 2) reformulation of a query given the feedback (feedback model), and
3) provision of methods for users to accept/reject parts of the refined query.
In this chapter, we focus on relevance feedback techniques for XML-
retrieval. In this context, we describe and deploy a retrieval model that
H. Blanken et al. (Eds.): Intelligent Search on XML Data, LNCS 2818, pp. 133–143, 2003.
Springer-Verlag Berlin Heidelberg 2003
134 R. Weber
Initial query
Relevance
Assesment
Download / Read
bookstore
medicine computer-science
and content. For more details, we refer to Chapter 6 which discusses the model
in more details.
bookstore
1 1
book book
Fig. 9.3. DTD of the global XML document illustrating the principle of augmen-
tation.
the set of elements fulfilling the type constraint (label path to elements) and
the content constraint (predicates on the instance’s data). This set is then
ordered according to a retrieval function that takes query part 3 and 4 into
account. For more details on query processing, we refer to Chapter 6.
In the context of XML documents, we must slightly adapt the notions of tra-
ditional TFIDF ranking. Instead of statistics (term frequencies and inverse
2
Sometimes, log 2 (N + 1) and log2 (dfi + 1) are used to prevent numerical problems.
3
Note that the definitions of rsv and distances are converse: while large rsv-values
are better, small distances denote a better similarity match.
138 R. Weber
In its simplest form, feedback methods only require positive examples, i.e.,
they adjust queries only with the relevant documents. We may further take
non-relevant documents into account. Ide [169] reported that this seems to
raise the ranks of high-ranking relevant documents, and, however, to lower
the ranks of some low-ranking relevant documents. As a consequence of this,
Ide proposed a method that only considers the top-ranking non-relevant docu-
ment (Ide dec-hi) apart of the relevant ones. Instead of relevance assessments,
a system could query for preferences of the form ”document A fits better to
my information need than document B” [121].
Further, it is possible to capture feedback information at finer granular-
ities, e.g., users have to explicitly mark relevant portions in retrieved docu-
ments. This would greatly help to reduce the noise introduced by the non-
relevant areas of documents marked as relevant. Yet, this would also increase
140 R. Weber
the burden on users which is likely to be not accepted. White et al. [316]
have compared explicit capturing of feedback, i.e., users mark documents as
relevant, with implicit capturing of feedback, i.e., users’ behavior like mouse
pointer navigation and the time spent to view a document were interpreted
as relevance indications. Their empirical study revealed that users performed
equally well on the implicit and explicit system. However, more empirical
studies are needed to investigate how implicit evidence is best collected.
Intuitively, the relevant documents attract the query while the non-relevant
ones repel it. The parameters α, β, and γ have to be determined with extensive
experiments. [260, 267] reported that the settings α = 1, β = 0.75 and γ =
0.25 lead to best results. Ide [169] simplified Rocchio’s formula as follows:
mr
3 mn
3 mr
3
(Regular) qr = q + rj − sj (dec − hi) qr = q + rj − s0 (9.7)
j=0 j=0 j=0
However, it appears that fine tuning these relevance values heavily depends on
the document collection and the query requirements. Further, [152, 205] re-
port that best retrieval results were achieved if only 10 to 20 terms were added
to the original query. Hence, keeping the number of query terms small is not
only desirable from an efficiency point of view but also from an effectiveness
perspective.
In the vector space model and also in probabilistic retrieval models, a lot of
research addressed the core problem of how to determine the discrimination
power of terms. The presented idf -formulae in this chapter are only some
proposals among many others. Several evaluations, however, have shown that
there is no ”correct” or ”optimal” formula to compute how characteristic
a term is over a collection. Often, the discrimination power depends on the
query context of the user. With relevance feedback, we have the opportunity to
take this query context into account when selecting the idf -weighting scheme.
Following the ideas of the next subsection, we can select out of the many
proposals the one method that best separates relevant elements from non-
relevant elements.
Another approach is to use the term weighting schemes of probabilistic
models for vector space retrieval [152]. To this end, idf -weights are replaced
by the ci -weights of the BIR model after a feedback step. Terms that only
appear in relevant documents obtain a higher weights than terms that appear
in both groups or only in the non-relevant documents.
Our approach for structural re-weighting works as follows: let E be one of the
element types selected by the path expression of the query, and S be one of its
sub-element types. The aim is to determine how characteristic the contents
of S is to distinguish between relevant and non-relevant elements of type E.
To this end, we determine a ranking for the last query by only consulting
the content of sub-elements of type S. If this ranking contains many relevant
documents at the top, we may say that S is important for the current query,
and, if relevant documents obtain low ranks, the contents of S-elements is not
well-suited. A number of measures are available to asses result quality, e.g.,
R-precision, normalized sums over the ranks of relevant documents [313], or
the usefulness measure if partial relevance ordering exists [121].
A similar idea was implemented in the MARS image retrieval engine to
select and weight image features [253]. We may deploy their approach to
select an ”optimal” model from a set of different retrieval models. As with
structural weights, we determine the effectiveness for each retrieval model
given the relevance assessments of the user. Now, we have two options: either
we select the most effective model and run the refined query for that model.
142 R. Weber
Or, we select a number of retrieval models and combine their scores to obtain
an overall score for elements.
9.4 Conclusions
In this chapter, we have presented a number of relevance feedback techniques
adapted from classical information retrieval scenarios to an XML retrieval
scenario. Our retrieval model provides means to retrieve elements of arbitrary
types from a global XML document, and orders them according to their sim-
ilarity to the textual part of the query. In this model, we have identified five
dimensions for query refinement: 1) query expansion, 2) query term weight-
ing, 3) adjusting of the discrimination power of terms, 4) structural weighting,
and 5) selection of the retrieval model. For each of these dimensions, we have
described a number of feedback methods to refine the query according to the
relevance assessments of users. Although the feedback models were described
in isolation, we may apply a number (or all) of them at each stage of the
4
http://www.teoma.com/
5
http://www.vivisimo.com
9 Using Relevance Feedback in XML Retrieval 143
10.1 Introduction
Despite the great advances in XML data management and querying, the cur-
rently prevalent XPath- or XQuery-centric approaches face severe limitations
when applied to XML documents in large intranets, digital libraries, feder-
ations of scientific data repositories, and ultimately the Web. In such envi-
ronments, data has much more diverse structure and annotations than in a
business-data setting and there is virtually no hope for a common schema or
DTD that all the data complies with. Without a schema, however, database-
style querying would often produce either empty result sets, namely, when
queries are overly specific, or way too many results, namely, when search pred-
icates are overly broad, the latter being the result of the user not knowing
enough about the structure and annotations of the data.
An important IR technique is automatic classification for organizing
documents into topic directories based on statistical learning techniques
[218, 64, 219, 102]. Once data is labeled with topics, combinations of declar-
ative search, browsing, and mining-style analysis is the most promising ap-
proach to find relevant information, for example, when a scientist searches
for existing results on some rare and highly specific issue. The anticipated
benefit is a more explicit, topic-based organization of the information which
in turn can be leveraged for more effective searching. The main problem that
we address towards this goal is to understand which kinds of features of XML
data can be used for high-accuracy classification and how these feature spaces
should be managed by an XML search tool with user-acceptable responsive-
ness.
This work explores the design space outlined above by investigating fea-
tures for XML classification that capture annotations (i.e., tag-term pairs),
structure (i.e., twigs and tag paths), and ontological background information
(i.e., mapping words onto word senses). With respect to the tree structure of
XML documents, we study XML twigs and tag paths as extended features that
can be combined with text term occurrences in XML elements. XML twigs
H. Blanken et al. (Eds.): Intelligent Search on XML Data, LNCS 2818, pp. 145–157, 2003.
Springer-Verlag Berlin Heidelberg 2003
146 M. Theobald, R. Schenkel, and G. Weikum
are triples of the form (ancestor element, left sibling element, right sibling el-
ement) which allow a shallow structure-aware document representation, while
tag paths merely describe linear ancester/descendant relationships with no re-
gard for siblings. Moreover, we show how to leverage ontological background
information, more specifically, the WordNet thesaurus, for the construction of
more expressive feature spaces.
The various options for XML feature spaces are implemented within the
BINGO! [281] focused crawler (also known as thematic crawler [63]) for expert
Web search and automated portal generation. BINGO! has originally been
designed for HTML pages (including various formats of unstructured text like
PDF, Word, etc.), and is now extended not only to extract contents from
XML documents but to exploit their structure for more precise document
representation and classification.
The BINGO! 1 focused crawling toolkit consists of six main components that
are depicted in Figure 10.1: the multi-threaded crawler itself, an HTML docu-
ment analyzer that produces a feature vector for each document, the classifier
with its training data, the feature selection as a ”noise-reduction” filter for
the classifier, the link analysis module as a distiller for topic-specific author-
ities and hubs, and the training module for the classifier that is invoked for
periodic retraining.
Topic-specific bookmarks play a key role in the BINGO! system. The
crawler starts from a user’s bookmark file or some other form of personalized
or community-specific topic directory [22]. These intellectually classified doc-
uments serve two purposes: 1) they provide the initial seeds for the crawl (i.e.,
documents whose outgoing hyperlinks are traversed by the crawler), and 2)
they provide the initial contents for the user’s topic tree and the initial train-
ing data for the classifier. The classifier, which is the crucial filter component
of a focused crawler, detects relevant documents on the basis of these book-
mark samples, while it discards off-the-topic documents and prevents their
links from being pursued. The following subsections give a short overview of
the main components of BINGO!. For more details see [282, 281].
10.2.1 Crawler
The crawler processes the links in the URL queue using multiple threads. For
each retrieved document the crawler initiates some analysis steps that depend
on the document’s MIME type (e.g., HTML, XML, etc.) and then invokes the
classifier on the resulting feature vector. Once a crawled document has been
successfully classified, BINGO! extracts all links from the document and adds
1
Bookmark-Induced Gathering of !nformation
10 Classification and Focused Crawling for Semistructured Data 147
them to the URL queue for further crawling. The ordering of links (priority)
in the crawler queue is based on the classification confidence provided by the
specific classification method that is used. This confidence measure is derived
from either statistical learning approaches (e.g., Naive Bayes [19, 209]) or
the result of regression techniques (e.g., Support Vector Machines [57, 306,
180]). All retrieved documents are stored in our database index including the
features, links and available metadata like URL(s), title, authors, etc.
10.2.3 Classifier
The feature selection algorithm provided by the BINGO! engine yields the
most characteristic features for a given topic; these are the features that are
10 Classification and Focused Crawling for Semistructured Data 149
used by the classifier for testing new documents. A good feature for this
purpose discriminates competing topics from each other, i.e., those topics
that are at the same level of the topic tree. Therefore, feature selection has to
be topic-specific; it is invoked for every topic in the tree individually.
We use the Mutual Information (MI) measure to build topic-specific fea-
ture spaces. This technique, which is a specialized case of the notions of cross-
entropy or Kullback-Leibler divergence [209], is known as one of the most
effective methods [326, 325] that is slightly in favor of rare terms (i.e., the
ones with a high idf value) which is an excellent property for classification.
Mutual information can be interpreted as a measure of how much the
joint distribution of features and topics deviate from a hypothetical distri-
bution in which features and topics are independent of each other (hence
the remark about MI being a special case of the Kullback-Leibler divergence
which measures the differences between multivariate probability distributions
in general).
The root node of the taxonomy tree yields the unification of all topic-
specific feature spaces and provides a simple dictionary data structure for
one-to-one mapping of features (i.e., terms) to dimensions (i.e., integers) in
the vector space which is used to generate the input vectors of the SVM.
10.2.6 Retraining
Building a reasonably precise classifier from a very small set of training data
is a challenging task. Effective learning algorithms for highly heterogeneous
environments like the Web would require a much larger training basis, yet
human users would rarely be willing to invest hours of intellectual work for
putting together a rich document collection that is truly representative of
their interest profiles. To address this problem we distinguish two basic crawl
strategies:
• The learning phase serves to automatically identify the most characteristic
documents of a topic, coined archetypes, and to expand the classifier’s
knowledge base among the bookmarks’ neighbor documents with highest
classification confidence and best authority score.
• The harvesting phase then serves to effectively process the user’s informa-
tion demands with improved crawling precision and recall.
BINGO! repeatedly initiates retraining of the classifier when a certain
number of documents have been crawled and successfully classified with con-
fidence above a certain threshold. At such points, a new set of training docu-
ments is determined for each node of the topic tree. For this purpose, the best
archetypes are determined in two complementary ways. First, the link analy-
sis is initiated with the current documents of a topic as its base set. The best
authorities of a tree node are regarded as potential archetypes of the node.
The second source of topic-specific archetypes builds on the confidence of the
classifier’s yes-or-no decision for a given node of the ontology tree. Among the
automatically classified documents of a topic those documents whose yes deci-
sion had the highest confidence measure are selected as potential archetypes.
The intersection of the top authorities and the documents with highest SVM
confidence form a new set of candidates for the promotion to the training
data.
After successfully extending the training basis with additional archetypes,
BINGO! retrains all topic-specific classifiers and switches to the harvesting
phase now putting emphasis on recall (i.e., collecting as many documents
as possible) where the crawler is resumed with the best hubs from the link
analysis.
of the extra burden that XML (or some even more advanced Semantic-Web
style representation such as RDF [320] or DAML+OIL [166]) would pose on
non-trained users in terms of authoring and maintaining their Web pages.
So, unless XML authoring is significantly improved [151], simply typing text
and adding simple HTML-style markup is likely to remain the favorite format
on the Web. Ironically, most of the dynamically generated information that
can be obtained via Web portals (e.g., Amazon, eBay, CNN, etc.) is actually
stored in backend databases with structured schemas but portal query results
are still delivered in the form of almost unstructured HTML.
Following prior work on HTML wrapper generators and information ex-
traction tools (see, e.g., [256, 82, 23, 265]), we have developed a toolkit for
automatically transforming HTML documents into XML format. Our tool first
constructs a valid XML document for input in plain text, HTML, PDF, etc.,
and then uses rules based on regular-expression matching to generate more
meaningful tags. For example, keywords in table headings may become tags,
with each row or cell of the table becoming an XML element. Our framework
is currently being extended to use also machine-learning techniques such as
Hidden Markov Models for more elaborated annotation. Since we are mostly
interested in focused crawling and thematic portal generation, the tool has
been designed for easy extensibility to quickly add domain specific rules.
Using only text terms (e.g., words, word stems, or even noun composites) and
their frequencies (and other derived weighting schemes such as tf ∗ idf mea-
sures [209]) as features for automatic classification of text documents poses
inherent difficulties and often leads to unsatisfactory results because of the
noise that is introduced by the idiosyncratic vocabulary and style of document
authors. For XML data we postulate that tags (i.e., element names) will be
chosen much more consciously and carefully than the words in the element
contents. We do not expect authors to be as careful as if they designed a
database schema, but there should be high awareness of the need for mean-
ingful and reasonably precise annotations and structuring. Furthermore, we
expect good XML authoring tools to construct tags in a semi-automatic way,
for example, by deriving them from an ontology or a “template” library (e.g.,
for typical homepages) and presenting them as suggestions to the user.
So we view tags as high-quality features of XML documents. When we
combine tags with text terms that appear in the corresponding element con-
tents, we can interpret the resulting (tag, term) pairs almost as if they were
(concept, value) pairs in the spirit of a database schema with attribute names
and attribute values. For example, pairs such as (programming language,
Java) or (lines of code, 15000) are much more informative than the mere co-
occurrence of the corresponding words in a long text (e.g., describing a piece
of software for an open source portal). Of course, we can go beyond simple
152 M. Theobald, R. Schenkel, and G. Weikum
tag-term pairs by considering entire tag paths, for example, a path “univer-
sity/department/chair” in combination with a term “donation” in the corre-
sponding XML element, or by considering structural patterns within some local
context such as twigs of the form “homepage/teaching ∧ homepage/research”
(the latter could be very helpful in identifying homepages of university pro-
fessors).
An even more far-reaching option is to map element names onto an onto-
logical knowledge base and take advantage of the semantics of a term within its
respective document context. This way, tags such as “university” and “school”
or “car” and “automobile” could be mapped to the same semantic concept,
thus augmenting mere words by their word senses. We can generalize this by
mapping words to semantically related broader concepts (hypernyms) or more
narrow concepts (hyponyms) if the synonymy relationship is not sufficient for
constructing strong features. And of course, we could apply such mappings
not just to the element names, but also to text terms that appear in element
contents.
some threshold; only the remaining terms, usually in the order of 10000, are
considered for (tag, term) feature selection.
Term-based features are quantified in the form of tf ∗ idf weights that are
proportional to the frequency of the term (tf ) in a given document and to the
(logarithm of the) inverse document frequency (idf ) in the entire corpus (i.e.,
the training data for one topic in our case). So the highest weighted features
are those that are frequent in one document but infrequent across the corpus.
For XML data we could compute tf and idf statistics either for tags and terms
separately or for tag-term pairs. Analogously to the arguments for feature
selection, we have chosen the option with combined tag-term statistics. The
weight wij (fi ) of feature fi in document j is computed as wij (fi ) = tfij idfi
where idfi is the logarithm of the inverse element frequency of term ti . From
the viewpoint of an individual term this approach is equivalent to interpreting
every XML element as if it were a mini-document. This way, the idf part in
the weight of a term is implicitly computed for each element type separately
without extra effort.
For example, in a digital library with full-text publications the pair (jour-
nal title, transaction) would have low idf value (because of ACM Transactions
on Database Systems, etc.), whereas the more significant pair (content, trans-
action) would have high idf value (given that there have been relatively few
papers on transaction management in the recent past). Achieving the same
desired effect with separate statistics for tags and terms would be much less
straightforward to implement.
Using tag paths as features gives rise to some combinatorial explosion. How-
ever, we believe that this is a fairly modest growth, for the number of different
tags used even in a large data collection should be much smaller than the num-
ber of text terms and we expect real-life XML data to exhibit typical context
patterns along tag paths rather than combining tags in an arbitrarily free
manner. These characteristic patterns should help us to classify data that
comes from a large number of heterogeneous sources. Nevertheless, efficiency
reasons may often dictate that we limit tag-path features to path length 2,
just using (parent tag, tag) pairs or (parent tag, tag, term) triples.
Twigs are a specific way to split the graph structure of XML documents
into a set of small characteristic units with respect to sibling elements. Twigs
are encoded in the form “left child tag $ parent tag $ right child tag”; examples
are “research$homepage$teaching”, with “homepage” being the parent of the
two siblings “research” and “teaching”, or “author$journal paper$author” for
publications with two or more authors. The twig encoding suggests that our
features are sensitive to the order of sibling elements, but we can optionally
map twigs with different orders to the same dimension of the feature space,
thus interpreting XML data as unordered trees if this is desired. For tag
154 M. Theobald, R. Schenkel, and G. Weikum
paths and twig patterns as features we apply feature selection to the complete
structural unit.
10.4.1 Mapping
A tag usually consists of a single word or a composite word with some special
delimiters (e.g., underscore) or a Java-style use of upper and lower case to
distinguish the individual words. Consider the tag word set {w1 , ..., wk }. We
look up each of the wi in an ontology database and identify possible word sense
si1 , . . . , sim . For example, for a tag “goal” we would find the word senses:
1. goal, end – (the state of affairs that a plan is intended to achieve and that
(when achieved) terminates behavior to achieve it; “the ends justify the
means”)
2. goal – (a successful attempt at scoring; “the winning goal came with less
than a minute left to play”)
and two further senses. By looking up the synonyms of these word senses,
we can construct the synsets {goal, end, content, cognitive content, mental
object} and {goal, score} for the first and second meaning, respectively. For a
composite tag such as “soccer goal”, we look up the senses for both “soccer”
and “goal” and form the cross product of possible senses for the complete
tag and represent each of these senses by the union of the corresponding
two synsets. Obviously, this approach would quickly become intractable with
10 Classification and Focused Crawling for Semistructured Data 155
growing number of words in a composite tag, but more than two words would
seem to be extremely unusual.
The final step towards disambiguating the mapping of a tag onto a word
sense is to compare the tag context con(t) with the context of candidates
con(s1 ) through con(sp ) in terms of a similarity measure between bags of
words. The standard IR measure for this purpose would be the cosine sim-
ilarity between con(t) and con(sj ), or alternatively the Kullback-Leibler di-
vergence between the two word frequency distributions (note that the con-
text construction may add the same word multiple times, and this informa-
tion is kept in the word bag). Our implementation uses the cosine similar-
ity between the tf vectors of con(t) and con(sj ) for its simpler computa-
tion. Finally, we map tag t onto that sense sj whose context has the high-
est similarity to con(t) which is similar to the disambiguation strategy that
the XXL [296] engine applies. Denote this word sense as sense(t) and the
set of all senses of tags that appear in the training data as sensestrain =
{sense(ti )|ti appears in training data}.
Putting everything together, a tag t in a test document d is first mapped
to a word sense s := map(t), then we find the closest word sense s+ :=
argmaxs" {sim(s+ , s)|s+ ∈ F } that is included in the feature space F of the
training data, and set the weight of s+ in d to sim(s+ , map(t)).
The feature space constituted by the mapping of tags onto word senses is
appropriate when training documents and the test documents to be classified
have a major overlap in their word senses (i.e., the images of their tag-to-
word-sense mappings). In practice, this is not self-guaranteed even for test
documents that would indeed fall into one of the trained topics. Suppose,
for example, that we have used training documents with tags such as “goal”,
“soccer”, and “shot”, which are all mapped to corresponding word senses,
and then a previously unseen test document contains tags such as “Champi-
ons League”, “football”, and “dribbling”, which correspond to a disjoint set of
word senses. The classifier would have no chance to accept the test document
for topic “sports”; nevertheless we would intellectually rate the test document
as a good candidate for sports.
To rectify this situation, we define a similarity metric between word senses
of the ontological feature space, and then map the tags of a previously unseen
test document to the word senses that actually appeared in the training data
and are closest to the word senses onto which the test document’s tags would
be mapped directly. For a detailed discussion how the semantic similarity
sim(s, s+ ) of two concepts s and s+ in the ontology graph can be estimated see
Chapter 8.
If the training set itself yields rich feature spaces that already contain
most of the classification-relevant terms we will ever encounter in our test
documents, we cannot expect vast improvements by ontology lookups and
156 M. Theobald, R. Schenkel, and G. Weikum
thus leave out this step using only given tag-term pairs and twigs. With in-
creasingly sparse training feature spaces (in comparison to the test set), we
can stepwise add lookups for the strongest (most relevant) features (i.e. the
tags) and/or terms, where infrequent nouns among the element contents would
be the next source of potentially relevant features. Of course, full lookups for
each term using the ontology service with its disambiguation function is best
suited for very few training data but also has the highest costs and decreases
performance.
In the classification phase we aim to extend the test vector by finding ap-
proximate matches between the concepts in feature space and the formerly
unknown concepts of the test document. Like the training phase, the classifica-
tion identifies synonyms and replaces all tags/terms with their disambiguated
synset ids. 1) If there is a direct match between a test synset s+ and a synset in
the training feature space (i.e., s+ ∈ sensestrain ), we are finished and put the
respective concept-term-pair in the feature vector that is generated for the test
document. 2) If there is no direct match between the test synset s+ and any
synset derived from the training data (i.e., s+ ∈ / sensestrain ), we replace the
actual test synset with its most similar match. The weight wij (fi ) of feature fi
in document j is now scaled by sim(s, s+ ), i.e., wij (fi ) = sim(s, s+ ) tfij idfi , to
reflect the concept similarities of approximate matches in the feature weights.
In praxis, we limit the search for common hypernyms to a distance of 2
in the ontology graph. Concepts that are not connected within this threshold
are considered as dissimilar for classification and obtain a similarity value of
0. To improve performance over frequently repeated queries, all mappings of
similar synsets within the same document (i.e., the same context) and their
retrieved similarities are cached.
Unlike related techniques using ontologies for query expansion, we do not
change the length of the feature vectors (i.e., the amount of a document’s
features). Our approach is merely adjusting the weights of unknown test con-
cepts according to their nearest match in the feature space with regard to
correlations defined by the ontology; adding new features to the vector could
produce more matches in the classification step and would distort the result.
This ontology-based evaluation can be regarded as a ”similarity kernel” for
the SVM that goes beyond syntactic matching.
10.5 Conclusions
Preliminary experiments were run on XML representations of the IMDB (In-
ternet Movie Database) collection and the Reuters-21578 data set taking into
consideration the XML structure using simple tag-term pairs (only regarding
a single ancestor tag) and twigs. These tests indicate a clear improvement
10 Classification and Focused Crawling for Semistructured Data 157
of the classifiers’ F-measure [19, 209] (the harmonic mean of precision and
recall) for the structure aware feature sets over a purely text-based feature
space. For the IMDB the structure-aware classifier improved the F-measure
from approximately 0.7 that the text classifier achieved to about 0.8 when very
few (5-20) documents were used for training. For the Reuters data mapping of
text terms onto ontological concepts led to a similar improvement when few
training documents were used. In both cases the text classifiers continuously
converged to (but never ouperformed) the F-measure of the structural classi-
fiers when we were feeding more training documents (up to 500 per topic) into
the classifiers knowledge base which led to a saturation effect where ontology
lookups could not replace any formerly unknown training concepts any more.
These tests show the potential of structural features like tag-term pairs
and twigs towards a more precise document characterization and the compu-
tation of a structure-aware classifier. Ontology lookups are a promising way
for disambiguating terms in a given document context and for eliminating
polysemy of text terms.
Our approach is a first step towards understanding the potential of ex-
ploiting structure, annotation, and ontological features of XML documents
for automatic classification. Our experiments are fairly preliminary, but the
results indicate that the direction is worthwhile to be explored in more depth.
Our current and near-future work includes more comprehensive experimen-
tal studies and making the ontological mapping more robust. In particular,
we are working on incorporating other sources of ontological knowledge into
our system, to go beyond the information provided by WordNet. This ongoing
work is part of the BINGO! project in which we are building a next-generation
focused crawler and comprehensive toolkit for automatically organizing and
searching semistructured Web and intranet data.
11
Information Extraction and
Automatic Markup for XML Documents
11.1 Introduction
As XML is going to become the standard document format, there is still the
legacy problem of large amounts of text (written in the past as well as today)
that are not available in this format. In order to exploit the benefits of XML,
these legacy texts must be converted into XML. In this chapter, we discuss the
issues of automatic XML markup of documents. We give a survey on existing
approaches, and we describe a specific system in some detail.
When talking about XML markup, we can roughly distinguish between
three types of markup:
• Macro-level markup deals with the global visual and logical structure of
a document (e.g. part, chapter, section down to the paragraph level.)
• Micro-level markup is used for marking single words or word groups. For
example, in news, person and company names, locations and dates may
be marked up, possibly along with their roles in the event described (e.g.
a company merger).
• Symbol-level markup uses symbolic names as content of specific elements
in order to describe content that is not plain text (e.g. MathML for math-
ematical formulas and CML for chemical formulas). Since this type of con-
tent is usually represented in various formats in legacy documents, specific
transformation routines should be applied in order to convert these into
XML. We will not consider this type of markup in the remainder of this
chapter.
Micro and macro-level markup require different methods for performing auto-
matic markup: Whereas macro-level markup is mainly based on information
about the layout of a document, micro-level markup typically requires basic
linguistic procedures in combination with application-specific knowledge. We
will describe the details of these two approaches in the following sections.
Adding markup to a document increases its value by making its infor-
mation more accessible. Without markup, from a system’s point of view, a
H. Blanken et al. (Eds.): Intelligent Search on XML Data, LNCS 2818, pp. 159–174, 2003.
Springer-Verlag Berlin Heidelberg 2003
160 M. Abolhassani, N. Fuhr, and N. Gövert
document is just a long sequence of words, and thus the set of operations that
can be performed on such a document is rather limited. Once we have markup,
however, the system is able to exploit the implicit semantics of the markup
tags, thus allowing for operations that are closer to the semantic level. Here
we give a few examples:
• Markup at the macro level supports a user in navigating through the
logical structure of a document.
• Content-oriented retrieval aims at retrieving meaningful units for a given
query that refers only to the content, but not to the structure of the target
elements. Whereas classical passage retrieval [183] can only select text pas-
sages of a fixed size, XML-based retrieval is able to select XML elements
based on the explicit logical structure as represented in the macro-level
markup.
• When micro-level markup is used for specifying the data type of element
content (e.g. date, location, person name, company name), type-specific
search predicates may be used in retrieval, thus supporting high-precision
searches.
• Another dimension of micro-level markup is the role of element content
(e.g. author vs. editor, departure location vs. arrival location, starting
date vs. ending date). Here again, precision of retrieval can be increased
by referring to these elements; also, browsing through the values occurring
in certain roles may ease information access for a user.
• Text mining can extract the contents of specific elements and stores them
in a separate database in order to perform data-mining-like analysis op-
erations.
The remainder of this chapter is structured as follows: In Section 11.2, we
briefly describe methods for macro-level markup. In Section 11.3, we give a
survey over Information Extraction (IE) methods and discuss their application
for micro-level markup. In Section 11.4, we present a case study where a
toolkit for automatic markup, developed by our research group, is applied to
articles from encyclopedias of art. Finally, we conclude this chapter with some
remarks.
text from paper into a form that computers can manipulate, by scanning the
paper, producing image, analysing it and providing an electronic file.
In this way, the textual content of the document as well as its structure
can be extracted. The document structure can be expressed in two ways:
• Presentation oriented: how the document looks.
• Logically: how the document parts are related to each other.
Markup of these (macro) structures have different applications. Presenta-
tion markup can mainly be used for enhancing the layout of the document.
Southall mentions that presentation markup helps us to display a document’s
visual structure which contributes to the document’s meaning [284]. Logical
markup serves for a variety of purposes, as mentioned in the previous section.
Taghva et al. introduce a system that automatically markups technical
documents, based on information provided by an (OCR) device, which, in
addition to its main task, provides detailed information about page layout,
word geometry and font usage [292]. An automatic markup program uses
this information, combined with dictionary lookup and content analysis, to
identify structural components of the text. These include the document title,
author information, abstract, sections, section titles, paragraphs, sentences
and de-hyphenated words.
Moreover, the logical structure of a document can be extracted from its
layout. For this purpose, there are two approaches:
Top-down: starting with the presentation markup and joining segmented
pages into sections, sections into paragraphs, paragraphs into sentences
and sentences into words. This is the preferred approach in the literature.
Bottom-up: starting with words and grouping words into sentences, sentences
to paragraphs and paragraphs into sections.
Having detailed information about presentation attributes of single words
(its page, exact location on the page and font) one can use this data to form
sentences, paragraphs and sections [292].
Furthermore, Hitz et al. advocate the use of synthetic document images
as a basis for extracting the logical structure of a document, in order to deal
with different formats and document models [165].
Syntactic Analysis
The most natural approach to syntactic analysis would be the development of
a full parser. However, experiments have shown that such an approach results
in a very slow system, which is also error-prone. Thus, most approaches in this
area aim at shallow parsing, using a finite-state grammar. The justification
for this strategy lies in the fact that IE is directed toward extracting relatively
simple relationships among singular objects. These finite-state grammars fo-
cus mainly on noun groups and verb groups, since they contain most of the
relevant information. As attributes of these constituents, numbers and def-
initeness are extracted from the determiner of noun groups, and tense and
voice from verb groups. In a second parsing phase, prepositional phrases are
handled; here mainly the prepositions “of” and “for” are considered, whereas
treatment of temporal and locative adjuncts is postponed to the domain anal-
ysis phase.
Domain Analysis
Before the extraction of facts can start, first the problem of coreference must
be solved. Since text writers typically use varying notations for referring to
the same entity, IE systems struggle with the problem of resolving these coref-
erences (e.g. “IBM”, “International Business Machines”, “Big Blue”, “The
Armonk-based company”). Even person names already pose severe problems
(e.g. “William H. Gates”, “Mr. Gates”, “William Gates”, “Bill Gates”, “Mr.
Bill H. Gates”). In addition, anaphoric references (pronouns or discourse def-
inite references) must be resolved. Although there is rich literature on this
specific problem, most approaches assume full parsing and thus are not appli-
cable for IE.
In [15], a general knowledge engineering approach for coreference is de-
scribed: In the first step, for a candidate referring expression (noun phrase),
the following attributes are determined: sortal information (e.g. company vs.
location), number (single vs. plural), gender and syntactic features (e.g. name,
pronoun, definite vs. indefinite). Then, for each candidate referring expression,
the accessible antecedents are determined (e.g. for names the entire preceding
text, for pronouns only a small part of it), which are subsequently filtered with
a semantic / sortal consistency check (based on the attributes determined in
the first step), and the remaining candidates are filtered by dynamic syntactic
preferences (considering the relative location in the text).
Once there are solutions for all the problems described above, the core
task of IE can be addressed. As a prerequisite, an appropriate template form
must be defined: Typically, users would give an informal specification of the
information bits they are interested, for which then an adequate and useful
representation format must be specified.
For filling this template, there are two knowledge engineering approaches:
• The molecular approach aims at filling the complete template in one step.
For this purpose, the knowledge engineer reads some texts in order to
11 Information Extraction and Automatic Markup for XML Documents 165
identify the most common and most reliably indicative patterns in which
relevant information is expressed. For these patterns appropriate rules are
formulated, then one moves on to less common but still reliable patterns.
Thus, this approach aims initially at high precision, and then improves
recall incrementally.
• The atomic approach, in contrast, is based on the assumption that every
noun phrase of the right sort and every verb of the right type (indepen-
dently of the syntactic relations among them) indicates an event / relation-
ship of interest. Thus, one starts with high recall and low precision, with
incremental development of filters for false positives. This approach is only
feasible if entities in the domain have easily determined types, and there is
scarcely more than one template slot where an entity of a given type may
fit - as a negative example, a template for management changes would
contain at least two slots (predecessor / successor) where a person can be
filled in.
In most cases both approaches produce only partial descriptions, which must
be merged subsequently (e.g. in a template for management changes, one por-
tion of the text may mention company, position and the new person filling it,
whereas the predecessor is mentioned in a subsequent paragraph). This merg-
ing step is a specific type of unification. A knowledge engineering approach
for this problem is described in [15]: Starting from typed slots, type-specific
procedures are developed which compare two candidates for inconsistencies,
coreference and subsumption; in addition, application-specific heuristics are
necessary in most cases.
In general, major parts of an IE system are rather application-dependent,
and there is little experience with the development of portable systems. On
the other hand, experience from the MUC conference shows that approaches
based on general-purpose language analysis systems yield lower performance
than application-specific developments.
Above, we have described the general structure of IE tasks and the architec-
ture of IE systems. A more detailed analysis and categorisation of IE problems
is described in [83]. In this chapter, the authors distinguish between source
properties and extraction methods, and develop taxonomies for issues related
to these two subjects.
With respect to the source properties, the following aspects are considered
to be the more important ones:
• Structure can be free, tagged or even follow a specified schema.
• Topology distinguishes between single and multiple documents to be con-
sidered for filling a single template.
• Correctness refers to the amount and type (format, content) of errors that
may occur in the input.
166 M. Abolhassani, N. Fuhr, and N. Gövert
The discussion above has focused on IE methods, and little has been said
about their relationship with the problem of Automatic Markup (AM). Cun-
ningham distinguishes five levels of IE tasks, which can also be used for char-
acterising different levels of automatic markup [86]:
• Named entity recognition extracts entities of one or more given types. For
automatic markup, this method can be used for assigning appropriate tags
to these names as they occur in the text.
• Coreference resolution recognises different notations for the same entity.
In XML, this fact could be marked by adding ID / IDREF attributes to
the tags.
11 Information Extraction and Automatic Markup for XML Documents 167
Da Vinci, Leonardo, born in Anchiano, near Vinci, 15 April 1452, died in Am-
boise, near Tours, 2 May 1519. Italian painter, sculptor, architect, designer, the-
orist, engineer and scientist. He was the founding father of what is called the
High Renaissance style and exercised an enormous influence on contemporary and
later artists. His writings on art helped establish the ideals of representation and
expression that were to dominate European academies for the next 400 years. The
standards he set in figure draughtsmanship, handling of space, depiction of light
and shade, representation of landscape, evocation of character and techniques of
narrative radically transformed the range of art. A number of his inventions in
architecture and in various fields of decoration entered the general currency of
16th-century design.
Fig. 11.1. Example text of source documents
To arrive at a micro-level markup from the OCRed texts, the Vasari project
follows the knowledge engineering approach. The project presents a language
and a number of tools for this purpose. Rules for markup can be expressed in
the Vasari Language (VaLa). The Vasari tool serves the knowledge engineer in
the iterative process of developing the rules. Having defined the set of rules for
a given encyclopedia, the extraction tool VaLaEx (Vasari Language Extractor)
uses them in order to automatically markup the plain texts. Furthermore, a
toolkit has been specified around VaLaEx which includes tools that can be
used to pre-/post-process the input/output of the VaLaEx extractor.
In the following we describe the Vasari project in more detail. In Sec-
tion 11.4.1 we give a survey on VaLa. The Vasari tool for developing VaLa
1
Giorgio Vasari, who lived in the 16th century in Florence, Italy, was an Italian
painter and architect. His most important work is an encyclopedia of artist biogra-
phies (“The biographies of the most famous architects, painters and sculptors”,
published in an extended edition in 1568), which still belongs to the foundations
of art history.
2
Although our work principally deals with German texts, here we give an English
example from The Grove Dictionary of Art Online (http://www.groveart.com/).
11 Information Extraction and Automatic Markup for XML Documents 169
descriptions and the VaLaEx markup tool are described in Section 11.4.2. In
Section 11.4.3 we show how additional tools can be applied to enhance the
result of the markup process. We end the description of the Vasari project in
Section 11.4.4 with a brief discussion on how the results can be improved by
fusing knowledge obtained from different sources.
<xs:simpleType name="tSurName">
<xs:annotation>
<xs:appinfo>
<vala:match type="regexp">[A-Za-z ]+</vala:match>
<vala:post type="regexp">,</vala:post>
</xs:appinfo>
</xs:annotation>
</xs:simpleType>
<xs:simpleType name="tGivenName">
<xs:annotation>
<xs:appinfo>
<vala:match type="regexp">[A-Za-z ]+</vala:match>
<vala:post type="regexp">,</vala:post>
</xs:appinfo>
</xs:annotation>
</xs:simpleType>
<xs:simpleType name="tBirthPlace">
<xs:annotation>
<xs:appinfo>
<vala:pre type="regexp">born[ / n]*in[ / n]*</vala:pre>
<vala:match type="regexp">[^,]+</vala:match>
</xs:appinfo>
</xs:annotation>
</xs:simpleType>
Fig. 11.2. An excerpt from a VaLa description for automatic markup of the artists’
data
Given a VaLa description for documents of a special type (in our case ar-
ticles from an encyclopedia) the VaLaEx tool for automatic markup applies
that description onto a set of source documents. Since a VaLa description
defines a tree structure, the approach taken by VaLaEx is based on the re-
cursive definition of trees. The source document is passed to the root of the
structure tree. By means of its filler rules the root node selects that part of
11 Information Extraction and Automatic Markup for XML Documents 171
the document text which matches its filler rules. The matching part then is
passed to the root’s first child node, which in turn selects its matching text
part and provides the respective markup. The remaining text is passed to
the next child node, and so on. In case the filler rules within a child node
cannot be matched, alternative solutions are tried through backtracking. For
each child which receives text from its parent node, the algorithm is applied
recursively.
Figure 11.3 displays the result of a VaLa description (the one mentioned
above).
at for the resulting XML documents. The filler rules can then be developed by
means of the example documents. At any stage of the development process the
knowledge engineer can check the result by means of the example documents.
A VaLa description obtained in this way now can be improved iteratively.
In each iteration step the knowledge engineer gives feedback to the system
with regard to the result obtained up to then: The markup can be assessed
as being wrong or correct; missing markup can be introduced into the XML
documents. According to this kind of feedback, the VaLa description can then
be improved further. Whenever a new version of the description is finished,
the example documents are marked up using that version. Since feedback is
available from earlier versions already, part of the assessment of the new result
can be done by the system automatically and visualised to the user.
Figure 11.4 shows the main window of the Vasari user interface. The bot-
tom part of the window contains the VaLa description developed up to then
(see also Figure 11.2). The remaining upper part of the window is split into
three parts: The left-hand part contains an overview of the example source
documents. One of the source documents is displayed in the middle part, while
the result of the markup process (using the VaLa description shown in the
bottom part of the window) is displayed on the right hand side. As can be
seen the artist’s name and birth place has been marked up correctly. The
knowledge engineer therefore marked the tags accordingly. Whenever markup
of this part of the document is changed by any later version of the VaLa
description, it is marked as being wrong automatically.
When developing means for markup of rather unstructured plain text docu-
ments, obtained from OCRed texts, we realized that there are some problems
which are not directly related to automatic markup. This includes the cor-
rection of systematic OCR errors, the detection of document boundaries and
the elimination of hyphenation of words in the pre-processing phase for the
source documents. Also, some features of the VaLa language required the use
of external tools, like SPPC to detect linguistic categories or entities. In order
not to burden Vasari and VaLaEx with these tasks we developed a toolkit
framework, of which VaLaEx is the core. Other tools for specific tasks in the
markup process can be added arbitrarily.
All tools in the toolkit comply with a simple standard interface: The input
as well as the output always consist of (sets of) XML documents - on the input
side additional parameters, e.g. a VaLa description for the VaLaEx tool, might
be provided. Hence a high degree of modularisation is achieved, and tools can
be combined in an almost arbitrary way. The interface standard allows for
easy integration of external tools, e.g. SPPC, by means of wrappers.
11 Information Extraction and Automatic Markup for XML Documents 173
Fig. 11.4. Vasari user interface for interactive and iterative development of VaLa
descriptions
Characteristic for the Vasari application is that source documents from dif-
ferent encyclopedias are available, the contents of which intersect partly. This
can be exploited to achieve an even better knowledge representation after the
automatic markup process is completed. Knowledge from different sources can
be fused, thus references implicitly available within the source documents can
be made explicit. For example, given that an artist is described within differ-
ent encyclopedias, the fused descriptions would lead to a more complete view
on that artist. Even contradictions could be detected, triggering e.g. manual
correction.
11.5 Conclusions
Automatic markup of (legacy) documents will remain a problem for a long
time. Using XML as the target format for automatic markup leads to pow-
erful search and navigational structures for effective knowledge exploration.
In this chapter we summarised approaches for automatic markup of macro
and micro structures within rather unstructured documents. In a case study
174 M. Abolhassani, N. Fuhr, and N. Gövert
12.1 Introduction
Since long, computer science has distinguished between information retrieval
and data retrieval, where information retrieval entails the problem of rank-
ing textual documents on their content (with the goal to identify documents
relevant for satisfying a user’s information need) while data retrieval involves
exact match, that is, checking a data collection for presence or absence of
(precisely specified) items. But, now that XML has become a standard doc-
ument model that allows structure and text content to be represented in a
combined way, new generations of information retrieval systems are expected
to handle semi-structured documents instead of plain text, with usage scenar-
ios that require the combination of ‘conventional’ ranking with other query
constraints; based on the structure of text documents, on the information
extracted from various media (or various media representations), or through
additional information induced during the query process.
Consider for example an XML collection representing a newspaper archive,
and the information need ‘recent English newspaper articles about Willem-
Alexander dating Maxima’.1 This can be expressed as the following query
(syntax in the spirit of the XQuery-Fulltext working draft [58]): 2
FOR $article IN document("collection.xml")//article
WHERE $article/text() about ‘Willem-Alexander dating Maxima’
AND $article[@lang = ‘English’]
AND $article[@pdate between ‘31-1-2003’ and ‘1-3-2003’]
RETURN <result>$article</result>
The terms ‘recent’ and ‘English’ refer to metadata about the newspaper
articles, whereas the aboutness-clause refers to the news content. Because only
1
Willem-Alexander is the Crown Prince of The Netherlands, who married Maxima
Zorreguieta on 2-2-2002.
2
Assume an interpretation in which ‘recent’ is equivalent to ‘published during the
last month’, and language and pdate are attributes of the article tag. The between
. . . and . . . construct does not exist in XQuery, but is used for simplicity.
H. Blanken et al. (Eds.): Intelligent Search on XML Data, LNCS 2818, pp. 179–191, 2003.
Springer-Verlag Berlin Heidelberg 2003
180 A.P. de Vries, J.A. List, and H.E. Blok
recent English articles will be retrieved by this request, precision at low recall
levels is likely to be improved. Note that this capability to process queries that
combine content and structure is beneficial in ways beyond extending query-
ing textual content with constraints on rich data types like numeric attributes
(e.g., price), geographical information and temporal values. Egnor and Lord
[105] suggest that new generations of information retrieval systems could ex-
ploit the potentially rich additional information in semi-structured document
collections also for disambiguation of words through their tag context, and
use structural proximity as part of the ranking model. Also, combined query-
ing on content and structure is a necessary precondition for improving the IR
process when taking into account Mizzaro’s different notions of relevance (see
[221]).
12.1.1 Dilemma
optimize our systems for parallel and distributed computing and memory
access patterns however, usage of black-box abstractions to obtain flexibility
becomes ever less desirable: it leads easily to inefficient systems, as we do not
really understand what happens inside the ranking process.
The essence of our problem is being trapped in an impasse solving the
dilemma: gaining flexibility through abstraction causes an efficiency penalty
which is felt most when we exploit this flexibility in new applications of IR or
explore improvements upon existing models.
This problem is illustrated clearly in the processing of relevance feedback.
Retrieval systems typically rank the documents with the initial query in a
first pass and re-rank with an adapted query in a second pass. Jónsson et al.
have shown in [182] that the resulting retrieval system is not optimal with
respect to efficiency, unless we address buffer management while taking both
passes into account. So, the inner workings of the original system must be
changed for optimal performance of the full system. In other words, we must
break open the black-box. This, obviously, conflicts with our previously stated
desire for flexibility.
Another illustration of this dilemma appears when extending retrieval sys-
tems for multimedia data collections, strengthening our arguments against
the pragmatic engineering practice of coupling otherwise stand-alone retrieval
systems. In a multimedia retrieval system that ranks its objects using various
representations of content (such as the system described in [302]), the number
of independent black-box components that may contribute to the final rank-
ing equals the number of feature spaces used in the system. It seems unlikely
that computing these multiple rankings independently (i.e., without taking
intermediate results into account) is the most efficient approach.
We seek a way out of this impasse between flexibility and efficiency by fol-
lowing ‘the database approach’. Database technology provides flexibility by
expressing requests in high-level, declarative query languages at the concep-
tual level, independent from implementation details such as file formats and
access structures (thus emphasizing data independence). Efficiency is obtained
in the mapping process from declarative specification (describing what should
happen) into a query plan at the physical level (describing how it happens).
The query optimizer generates a number of logically equivalent query plans,
and selects a (hopefully) efficient plan using some heuristics.
There is not much consensus on how the integration of IR techniques in
general-purpose database management systems (DBMSs) should take place.
The typical system design couples two standalone black-box systems using
a shallow layer on top: an IR system for the article text and a DBMS for
the structured data. Their connection is established by using the same doc-
ument identifiers in both component systems. State-of-the-art database solu-
182 A.P. de Vries, J.A. List, and H.E. Blok
Extension m
Extension 1
Extension n
Extension 1
Extension n
Physical ... Physical ...
algebra algebra
title bdy
sec sec
‘...’ ‘...’
p p p p
‘dating’
Fig. 12.2. Possible article excerpt of an XML newspaper archive; the leaf nodes
contain index terms.
A text region a can be identified by its starting point sa and ending point ea
within the entire linearized string, where assignment of starting and ending
points is simply done by maintaining a token counter. Figure 12.3 visualizes
the start point and end point numbering for the example XML document and
we can see, for example, that the bdy-region can be identified with the closed
interval [5..24].
12 The Multi-model DBMS Architecture and XML IR 185
@language
article:[0..25]
@date−published
title:[1..4] bdy:[5..24]
sec:[6..14] sec:[15..23]
‘...’:[2..2] ‘...’:[3..3]
‘dating’:[17..17]
At the physical level, our system stores these XML text regions as four-
tuples (region id, start, end, tag), where:
• region id denotes a unique node identifier for each region;
• start and end represent the start and end positions of each region;
• tag is the (XML) tag of each region.
The set of all XML region tuples is named the node index N . Index terms
present in the XML documents are stored in a separate relation called the
word index W. Index terms are considered text regions as well, but physically
the term identifier is re-used as both start and end position to reduce mem-
ory usage. Node attributes are stored in the attribute index A as four-tuples
(attr id, region id, attr name, attr val). Furthermore, we extended the phys-
ical layer with the text region operators, summarized in Table 12.1. Note that
we have put the text region operators in a relational context, delivering sets
or bags of tuples.
Table 12.1. Region and region set operators, in comprehension syntax [56]; sr and
er denote the starting and ending positions of region r , or its region id.
Operator Definition
a ⊃b true ⇐⇒ sb > sa ∧ eb < ea
A#$⊃ B {(oa , ob )| a ← A, b ← B , a ⊃ b}
employed in the prototype system that takes place to evaluate the example
query. To keep things simple, we present the physical layer as a familiar SQL
database; in the prototype implementation however, we use the Monet In-
terface Language (MIL, [38]) gaining better control over the generated query
plans.
The first part of the generated query plan focuses on the processing of struc-
tural constraints, and is handled in the logical XML extension. For the exam-
ple query, it identifies the document components in the collection that are sub-
sequently ranked by the IR extension, which implements the about function.
The XML processing extension produces its query plans based upon the region
indexing scheme outlined in Section 12.3, using the physical database schema
shown in Figure 12.4. It selects the collection of article components specified
by XPath expression //article/text() (a collection of bags of words), fil-
tered by the specified constraints on publication date and language attributes:
articles :=
SELECT n.region_id, start, end
FROM nodeindex n,
attributeindex al, attributeindex ap
WHERE n.tag = ’article’
AND al.region_id = n.region_id
12 The Multi-model DBMS Architecture and XML IR 187
mat_articles :=
SELECT a.region_id, w.position
FROM articles a, wordindex w
WHERE a.start < w.position AND w.position < a.end;
12.4.2 IR Processing
The next step in this discussion focuses on the logical extension for IR pro-
cessing; in our example query, this extension handles the ranking of article
components selected by the XML extension.
The prototype system uses Hiemstra’s statistical language modeling ap-
proach for the retrieval model underlying the about function (Chapter 7 of
this book). The selected XML sub-documents are thus ranked by a linear
combination of term frequency (tf ) and document frequency (df ). The lan-
guage model smoothes probability P (Ti |Dj ) (for which the tf statistic is a
maximum likelihood estimator) with a background model P (Ti ) (for which
the df statistic is a maximum likelihood estimator), computing the document
component’s retrieval status value by aggregating the independent scores of
each query term.
The IR processing extension at the logical level manipulates collections of
bag-of-words representations of the document components to be ranked. Let
us first consider the calculation of the term probabilities. This requires the
188 A.P. de Vries, J.A. List, and H.E. Blok
ntf_ij :=
SELECT mat_articles.region_id, w.term,
(count(*) / mat_art_len.length) AS prob
FROM mat_articles, mat_art_len, wordindex w, query q
WHERE w.term = q.qterm
AND mat_articles.position = w.position
AND mat_articles.region_id = mat_art_len.region_id
GROUP BY mat_articles.region_id, w.term;
For generality, the computation of these probabilities has been assumed
completely dynamic, for the IR extension cannot predict what node sets in
the XML collection will be used for ranking. In practice however, when the
collection is mostly static and the same node sets are used repeatedly for rank-
ing (e.g., users ranking always subsets of //article/text()), the relations
storing term counts and component lengths should obviously be maintained
as materialized views.
Similar arguments hold for the estimation of P (Ti ), the term probability
in the background model. As explained in [89] however, the collection from
which the background statistics are to be estimated should be specified as
a parameter of the about operator (alternatively, the right scope could be
guessed by the system). Let the collection of all article nodes be appropriate in
the example query (and not the subset resulting from the attribute selections),
and the following queries compute the background statistics:
num_art :=
SELECT COUNT(*) FROM nodeindex WHERE tag=’article’;
art_qterm :=
SELECT DISTINCT n.region_id, w.term
FROM nodeindex n, wordindex w, query q
WHERE w.term = q.qterm
AND n.tag = ’article’
AND n.start < w.position AND w.position < n.end;
ndf_i :=
SELECT term, (count(*) / num_art) AS prob
FROM art_qterm GROUP BY term;
The final step computes the ranking function from the intermediate term
probabilities in document and collection:
ranks :=
SELECT ntf_ij.region_id,
sum(log(1.0 + ((ntf_ij.prob / ndf_i.prob) *
12 The Multi-model DBMS Architecture and XML IR 189
12.5 Discussion
The information retrieval models discussed so far have been straightforward,
ignoring semantic information from XML tags, as well as most of the logical
and conceptual structure of the documents. In spite of the simplicity of the
retrieval models discussed, these examples demonstrate the suitability of the
‘database approach’ for information retrieval applications. The next step in
our research is to determine what extra knowledge we need to add to increase
retrieval effectiveness. Development of new retrieval models (that exploit the
190 A.P. de Vries, J.A. List, and H.E. Blok
plicitly presenting expectedly bad results to the user might very well speed
up the entire process as the negative user feedback on those results will rule
out significant parts of the search space for processing in further iterations.
Trading quality for speed is an interesting option for the first steps of the user.
12.6 Conclusions
We have identified two types of challenges for IR systems, that are difficult
to address with the current engineering practice of hard-coding the ranking
process in highly optimized inverted file structures. We propose that the trade-
off between flexibility and efficiency may be resolved by adopting a ‘database
approach’ to IR. The main advantage of adhering to the database approach
is that it provides a system architecture allowing to balance flexibility and
efficiency. Flexibility is obtained by declarative specification of the retrieval
model, and efficiency is addressed through algebraic optimization in the map-
ping process from specification to query plan.
Existing (relational) database system architectures are however inadequate
for proper integration of querying on content and structure. The Multi-Model
DBMS architecture is proposed as an alternative design for extending database
technology for this type of retrieval applications. Discussing the query pro-
cessing strategies for an example query combining content and structure, the
main differences with existing blackbox approaches for extending database
technology are explained.
The chapter has been concluded with a discussion of future directions in IR
system implementation for which our proposed architecture is of even more
importance. In particular, we claim that both fragmentation as well as the
optimization though quality prediction would benefit greatly from an open,
extensible, layered approach, i.e., the advantages of the Multi-Model DBMS
architecture.
13
PowerDB-XML: Scalable XML Processing
with a Database Cluster
13.1 Introduction
The flexible data model underlying the W3C XML document format covers
a broad range of application scenarios. These scenarios can be categorized
into data-centric and document-centric ones. Data-centric processing stands
for highly structured XML documents, queries with precise predicates, and
workloads such as the ones with online transaction processing. Document-
centric processing in turn denotes searching for relevant information in XML
documents in the sense of information retrieval (IR for short). With document-
centric scenarios, XML documents are typically less rigidly structured and
queries have vague predicates. Today, different systems address these needs,
namely database systems and information retrieval systems. XML however
offers the perspective to cover them with a single integrated framework, and
to make the above distinction obsolete at least in terms of the underlying
system infrastructure. The aim of the PowerDB-XML engine being developed
at ETH Zurich is to build an efficient and scalable platform for combined
data-centric and document-centric XML processing. The following overview
summarizes important requirements that a respective XML engine must cover
efficiently:
• lossless storage of XML documents,
• reconstruction of the XML documents decomposed into storage structures,
• navigation and processing of path expressions on XML document struc-
ture,
• processing of precise and vague predicates on XML content, and
• scalability in the number and size of XML documents.
To cover these requirements efficiently and in combination is challenging
since a semistructured data model underlies the XML format. Consequently,
both the data and its structure are defined in the XML documents. This makes
for instance optimization much more difficult than with rigidly structured data
where a clear distinction into schema and data exists.
H. Blanken et al. (Eds.): Intelligent Search on XML Data, LNCS 2818, pp. 193–206, 2003.
Springer-Verlag Berlin Heidelberg 2003
194 T. Grabs and H.-J. Schek
None of the storage schemes for XML documents available so far cov-
ers all the aforementioned requirements in combination. Native XML storage
techniques lack standard database functionality such as transactions, buffer
management, and indexing. This functionality however comes for free with
approaches that map XML documents to databases. This makes relational
database systems attractive as storage managers for XML documents. Nev-
ertheless, relational databases currently fall short in supporting document-
centric XML processing. The reason for this is that the database mapping
techniques that have been proposed so far focus on data-centric processing
only. Therefore, database mapping techniques in isolation are not a viable
solution either in order to cover the requirements. Combinations of native
XML storage with database mappings may appear as an attractive alterna-
tive. However, they are not available yet with commercial database systems
or commercial XML engines.
In this chapter, we present the XML storage scheme of PowerDB-XML.
It combines native XML storage management with database mappings and
efficiently supports document-centric processing. Our approach builds on rela-
tional database systems as storage managers for XML documents. This covers
efficient data-centric processing. Moreover, special interest is paid to efficient
document-centric processing, in particular to flexible relevance-oriented search
on XML, as explained in Chap. 6. Note that the full-text search functional-
ity provided by database systems does not cover the requirements for flexible
relevance-oriented search on XML documents. This is another argument why
we must rule out commercial off-the-shelf approaches.
A further benefit of PowerDB-XML’s storage management is that it nicely
fits with a cluster of database systems as underlying infrastructure. A cluster
of databases is a cluster of workstations or personal computers (PCs) inter-
connected by a standard network. With the PowerDB project at ETH Zurich,
the idea is to use off-the-shelf components as much as possible regarding both
hardware and software. Therefore, each node of the cluster runs a commer-
cially available operating system. In addition, a relational database system
is installed and running on each node. Such a cluster of databases is attrac-
tive as an infrastructure for information systems that require a high degree
of scalability. This is because scaling-out the cluster is easy when higher per-
formance is required: one adds further nodes to the cluster. This provides
additional computational resources and storage capacities to the overall sys-
tem, and ideally the envisioned system re-arranges data and workloads such
that performance is optimal on the enlarged cluster. Figure 13.1 illustrates
XML processing with PowerDB-XML and a cluster of databases.
on the physical storage design. The second one relies on the mapping function
from XML documents to the actual storage structures.
Regarding the first dimension, so-called native storage techniques for XML
documents are distinguished from database storage for XML. Native XML
storage techniques aim at developing new storage structures that reflect the
semistructured data model of XML. Natix [160] and PDOM [167] are promi-
nent examples of native XML storage managers. The expectation is that spe-
cific storage schemes for XML provide superior performance as compared to
the following approach.
Database Storage for XML stores XML documents based on existing
data structures of the underlying database systems. The XML structures are
mapped onto the data model of the database system using mapping functions
discussed subsequently. Here, the expectation is that leveraging existing and
well-understood storage techniques such as today’s database systems yields
better performance for XML processing.
196 T. Grabs and H.-J. Schek
The first- last- Ilias Homer The first- last- Ilias Homer
...
name name ...
name name
Benjamin Franklin Benjamin Franklin
database
mapping
Space
Men 5.20 and 47.11 unusual
... stars content
Document Table
Table xmldocument stores the texts of the XML fragments. The columns of the
table are the primary key columns docid and seqno and a column doctext.
The column docid stores internal system generated document identifiers. Col-
umn seqno in turn stores the sequence number of the fragment. For lack of
off-the-shelf native XML storage techniques with relational database systems,
doctext currently stores the fragments as character-large-objects (CLOBs).
However, as soon as more efficient extensions for XML become available with
13 PowerDB-XML: Scalable XML Processing with a Database Cluster 199
commercial database systems SFRAG can store them using native storage
schemes instead of CLOBs which we expect to further improve performance.
This representation of documents is well-suited for fetching and updating
complete document texts with document-centric XML processing. Both re-
trieval and update of the full document text map to the row-operations of the
database system, which are efficient. In the following, we explain how to make
data-centric processing efficient as well.
Side Tables
database
mapping
Fig. 13.4. IR-descriptions for XML documents with the SFRAG mapping
in Fig. 13.4. The query searches for books with a description relevant to the
query ’space stars XML’ under tf idf ranking and that cost less than 50 ($).
Processing this query with PowerDB-XML logically comprises three steps:
the first step computes the ranking of book description elements using a SQL
statement over the tables IL bookdescription and S bookdescription. The sec-
ond step then eliminates those books that do not fulfill the predicate over
price using a SQL clause over the side table book. Finally, PowerDB-XML
loads the XML fragments/documents identified by the previous steps from
the xmldocument table. This is necessary when the XML documents comprise
content not covered by database mappings and IR descriptions. Figure 13.4
illustrates this with the unusual elements which must be returned as part of
the query result to guarantee lossless storage.
Routing of XML requests depends on the data organization within the clus-
ter. With partitioning, PowerDB-XML first determines which document types
possibly qualify for the request. It then routes the request to the node groups
of these document types and processes it there. With replication in turn, a
retrieval request is processed at the node group with the smallest current
workload. This leads to a good distribution of the workload among the clus-
ter nodes. An update request, however, must run at all copies to keep replica
consistent. This introduces additional overhead as compared to a single-copy
setting. With striping, the request is routed to each node of the striped node
group and processed there. Each node processes the request and contributes
204 T. Grabs and H.-J. Schek
450
400
350
response time (sec)
300
250
200
150
100
50
0
8 16 32 64 128
cluster nodes and database size (in 10 GB)
1 5 10 15 20 30 40 streams
Fig. 13.6. XML retrieval request performance with PowerDB-XML, cluster sizes
from 8 to 128 nodes, and database sizes from 80 GB to 1.3 TB
2500
1500
1000
500
0
8 16 32 64 128
cluster nodes and database size (in 10 GB)
1 5 10 15 20 30 40 streams
Fig. 13.7. XML update request performance with PowerDB-XML, cluster sizes
from 8 to 128 nodes, and database sizes from 80 GB to 1.3 TB
after the other without any think time. Requests distribute equally over XML
retrieval and XML updates. Each request loads about 0.1% of the XML doc-
uments of the collection. Note that our experimental evaluation reflects the
worst case where loading of the XML fragment from the document table is
necessary (third step in Example 2).
Figures 13.6 and 13.7 show the outcome of the experiments. As the curves
for the different workloads in the figures show, increasing workloads lead to
higher response times for both retrieval and updates with any cluster size
and database size. So far, this is what one would expect when more parallel
requests compete for shared resources. However, the figures illustrate an inter-
esting observation: both retrieval and update requests scale less than linearly
for high workloads when scaling out. For increasing collection sizes this means
that there is an increase in response times when doubling both collection size
and cluster size. For instance, average response times with a workload of 20
concurrent streams are about 50% higher with 128 cluster nodes and 1.3 TB
overall database size than with 8 nodes and 80 GB. This is in contrast to the
expectation of ideal scalability, i.e., that response times are constant when
increasing cluster size and collection size at the same rate. Nevertheless, con-
sider that the overall database size has grown by a factor of 16 while response
times have only increased by a factor of 1.5. Hence, the overall result is still
a positive one.
206 T. Grabs and H.-J. Schek
13.7 Conclusions
14.1 Introduction
Web-based distributed XML query processing has gained in importance in
recent years due to the widespread popularity of XML on the Web. Unlike
centralized and tightly coupled distributed systems, Web-based distributed
database systems are highly unpredictable and uncontrollable, with a rather
unstable behavior in data availability, processing capability and data transfer
speed. As a consequence, there exists a number of conspicuous problems that
need to be addressed for Web-based distributed query processing in the novel
context of XML. Some major ones are listed below.
• High autonomy of participanting sites. Data sources scattered on the Web
may be equipped with either powerful query engines, say those which can
support XPath, XQuery or some other XML query languages, or sim-
ple ones which offer limited processing capabilities, just like a plain Web
server returning the whole XML files. In Web-based distributed database
systems, both the data sources and their associated processing capabilities
need to be modeled and used in query execution planning.
• XML streaming data is proliferating and flowing on the Web. As the size
of the streaming data is usually enormous, it is not efficient to first wait
for all data to arrive, store it locally and then query it. Instead, new
techniques must be deployed for querying streaming XML data.
• Unreliable response time on the Web exists due to dozens of Internet
routers that separate nodes participating in distributed processing. High
delay and a total data jam must be taken into account. When congestions
are detected by certain mechanisms like timers, querying should activate
alternative execution plans to keep the system busy with performing some
other relevant tasks.
• Different expectations of query results. The classical way of querying sug-
gests the delivery of a complete and exact query result. In such systems,
users prefer to have the “time to last” result as short as possible. However,
H. Blanken et al. (Eds.): Intelligent Search on XML Data, LNCS 2818, pp. 207–216, 2003.
Springer-Verlag Berlin Heidelberg 2003
208 M. Smiljanić, L. Feng, and W. Jonker
According to the location where queries are planned and executed, we cate-
gorize query processing into the following three groups.
This is the simplest, and currently the most frequently used architecture in
distributed Web systems. In such a system, one node carries all the respon-
sibilities: it collects and warehouses XML data, plans and executes queries.
Other nodes are accessed in the off-line mode and are asked only to pro-
vide the raw data. Google, a keyword-based search engine, falls in this group
(though Google itself does not solely use XML data). Niagara system [227],
a combination of a search engine and a query processing engine, allows users
to raise arbitrary structured queries over the Internet. Niagara can perform
“on-demand” retrievals on the distributed documents if they are not available
in the local document repository, but are referenced by the full text search
engine. Xyleme system [75] provides users with integrated views of XML data
stored in its local warehouse.
When a node receives a user’s query over a virtual view of data sources,
it produces a complete set of instructions that will evaluate the query in
the distributed environment. These instructions will be sent to corresponding
nodes which will optimize the execution of their local subqueries. In this case,
we have a centralized distributed query processor, responsible for generating
a complete query execution plan. However, the query execution that follows
is delegated to respective local query processors.
from two data sources from the network. Each data entity is first placed in
the hash structure for that data source, and then the other hash is probed.
As soon as a match is found, the result tuple is produced. In order to cope
with the memory overflow due to the hash size, other techniques involving
data storage on secondary storage medium are proposed. Some specifics of
XML element joining are discussed in [299], demonstrating that Lattice-Join
of XML documents can be implemented as merge operation.
If data streams are continuous like stock indexes, or simply too long, it
is desirable for a system to be able to show the current status of results
being generated. A query processor has been designed to support partial result
generation for non-monotonic functions such as sort, average, and sum, etc.
[278].
document. Each node of such trees represents a path, leading from the root
to that node. Each such node is assigned an accumulated number of element
instances reached using the path described by the node. This number repre-
sents the selectivity for that path. It can be calculated in a single pass over
the XML document. The second technique is named “Markov tables”. A table
is constructed with a distinct row for each path in the data up to the length
of m. The calculation of the estimation for longer paths based on the data
in the table is shown in [5]. Several summarizing techniques can be exploited
to reduce the size and the complexity of the data structures used in both
approaches.
In the presence of integrated views, query processors can start with query
rewriting in such a way that a query over a global view is translated into a set
of queries over local data sources. Such rewriting and creation of a distributed
execution plan involves the techniques known as data-shipping and query-
shipping. [210] describes an architecture for integrating heterogeneous data
sources under an XML global schema, following the local-as-view approach,
where local sources’ schemas such as relational and tree-structured schemas
are described as views over the global XML schema. Users express their queries
against the XML global schema in XQuery, which is then translated into into
one or several SQL queries over the local data sources. The advantage of
using a relational query model lies in the benefit from the relational query
capabilities that the relational or XML sources may have. The tuples resulting
from the SQL query execution are then structured into the desired XML result.
14 Web-Based Distributed XML Query Processing 213
XML bears a close similarity to semi-structured data models [44, 54, 24]. One
pioneering work on distributed querying over semistructured data was done
by Dan Suciu [288], who proposed the efficiency definition for a distributed
query from the following two aspects.
1) The total number of communication steps between the data sources is
constant, i.e. independent on the data or on the query. A communication step
can be a broadcast, or a gather, and can involve arbitrary large messages.
2) The total amount of data transferred during query evaluation should
depend only on (a) the total number of links between data sources, and (b)
the size of the total result.
Suciu investigates distributed queries in a context where data sources are
distributed over a fixed number of nodes, and the edges linking the nodes
are classified into local (with both ends in the same node) and cross edges
(with ends in two distinct nodes). Efficient evaluation of regular path ex-
pression queries is reduced to efficient computation of transitive closure of a
distributed graph. For more complex queries, where regular path expressions
are intermixed freely with selections, joins, grouping, and data restructuring,
a collection of recursive functions can be defined accordingly. Those iterate on
the graph’s structure. The queries in this formalism form an algebra C, which
is a fragment of UnQL [53, 55]. By following an algebraic rather than an op-
erational approach, a query Q can be rewritten into Q , called a decomposed
query, such that on a distributed database, Q can be evaluated by evaluating
Q independently at each node, computing the accessible part of all results
fragments, then shipping and assembling the separate result fragments at the
user site.
The proposed query evaluation algorithms provide minimal communica-
tion between data sources. Even if several logical ‘jumps’ (joins in queries)
between data sources exist, execution is planned in such a way that those
data sources exchange data between each other just once. This does not come
214 M. Smiljanić, L. Feng, and W. Jonker
without a price. The centralized query planner has to know all the metadata
on the participating data sources to plan the query.
The algorithm and the systems described by Suciu fall in the category of
centralized planning and distributed evaluation architectures.
Since the autonomy and the dynamics of the data sources are quite high
on the Web, a high cost of maintaining a central metadata repository will be
incurred. Some alternative approaches of query evaluation are thus raised in
the sequel.
14.4.2 WEBDIS
14.4.3 Xyleme
User can query the documents in the repository through a predefined in-
tegrated view. The integrated views for specific thematic domains are defined
by domain experts but the mapping between each integrated view and the
documents in the warehouse is established using the support of sophisticated
mapping algorithms. Apart from that, the main strengths of Xyleme lie in the
layered and clustered internal architecture. The architecture provides good
scalability in terms of both the number of users and the number of XML
documents stored in the system.
As illustrated, Xyleme falls in the group of Web-based databases with
centralized planning and centralized evaluation approaches.
Niagara [227] and Tukwila [171] are both data integration systems implement-
ing XML query processing techniques .
Niagara system is built as a two-component system with the first compo-
nent being a search engine and the second one being an XML query processing
engine. Niagara allows users to ask arbitrary structured queries over the Web.
Its search engine uses the full text index to select a set of the XML documents
that match the structured content specified in the query, while its XML query
engine is used to perform more complex actions on the selected documents and
to present the requested results. Niagara can perform on-demand retrievals of
XML documents if they are not available in the local document repository.
Still, all the XML query processing is performed centrally.
In comparison, Tukwila provides a mediated schema over a set of hetero-
geneous distributed databases. The system can intelligently process the query
over such mediate schema, reading data across the network and responding
to data source sizes, network conditions, and other factors.
Both Tukwila and Niagara possess a dynamic feature. Their query process-
ing is adaptable to changes in the unstable Web environment. Adaptability
is thus defined as a special ability of the query processor, using which the
execution plan of the query is changed during the course of its execution in
response to unexpected environmental events. Both systems achieve adapt-
able query processing by implementing flexible operators within their query
engines. In Niagara, operators are built in such a way that they provide non-
blocking functioning. This means that they can process any data available at
their input at any time. Faced with data delays, the operators can switch to
process other arriving data, and resume the original task when data becomes
available.
In Tukwila, a re-optimization is done on the level of query execution frag-
ments - which are units of query execution. After each fragment is materi-
alized, Tukwila compares the estimated and the achieved execution perfor-
mance. If sufficiently divergent, the rest of the execution plan is re-optimized
using the previous performance sub-results [227]. In addition, a collector op-
erator is proposed for managing data sources with identical schemas. The
216 M. Smiljanić, L. Feng, and W. Jonker
Feature System
Suciu WEBDIS Xyleme Niagara Tukwila
Data Source semistructured hyperlinked XML XML XML
rooted labeled XML, HTML data data data
graphs documents
Query Planning static static static dynamic dynamic
centralized distributed centralized centralized centralized
Query Execution static static static dynamic static
distributed distributed centralized centralized centralized
Querying of - - - yes yes
Streaming Data
Integrated View - - yes - yes
Query Granularity graph XML XML XML XML
node document component component component
Query Language UnQL DISQL XQL XML-QL XQuery
Table 14.1 summarizes different query processing schemas that the above
systems utilize, together with their query facilities offered to users.
14.5 Conclusions
15.1 Introduction
The Internet forms today’s largest source of information, with public services
like libraries and museums digitizing their collections and making (parts of) it
available to the public. Likewise, the public digitizes private information, e.g.,
holiday pictures and movies, and shares it on the World Wide Web (WWW).
This kind of document collections have often two aspects in common. They
contain a high density of multimedia objects and its content is often seman-
tically related. The identification of relevant media objects in such a vast
collection poses a major problem that is studied in the area of multimedia
information retrieval.
A generic multimedia information retrieval system is sketched in Fig. 15.1
(based on [93]). The left-hand side depicts the interactive process, where the
user formulates her query, using relevance feedback. On the right-hand side
of the figure a librarian is annotating the document collection. In the earlier
stages of information retrieval, the annotation was done in a complete manual
fashion. Today, this process is often supported by automatic annotation, which
is also referred to as content-based retrieval.
The rise of XML as the information exchange format has also its impact
on multimedia information retrieval systems. The semistructured nature of
XML allows to integrate several more-or-less structured forms of multime-
dia annotations in the system, i.e., the database stores a collection of (in-
tegrated) XML documents. However, using the XML document structure di-
rectly for searching through the document collection is likely to lead to seman-
tic misinterpretations. For example, an XML document contains the structure
company → director → name. When searching for the name of the company,
without any knowledge of the semantics of the implied structure the name of
the director can, mistakenly, be found.
With the focus on document collections where the content is semantically
related, it becomes feasible to use a conceptual schema that describes the
content of the document collection at a semantical level of abstraction. This
H. Blanken et al. (Eds.): Intelligent Search on XML Data, LNCS 2818, pp. 217–230, 2003.
Springer-Verlag Berlin Heidelberg 2003
218 M. Windhouwer and R. van Zwol
librarian
user
Relevance
feedback
Automatic Manual
Database
Search Engine content-based data
concept-based data
approach, defined in the Webspace Method [305] allows the user to formulate
complex conceptual queries that exceed the ‘boundary’ of a document. How-
ever, using a conceptual search alone is not satisfactory. The integration with
a content-based technique, such as a feature grammar is essential.
A feature grammar is based on the formal model of a feature grammar
system and supports the use of annotation extraction algorithms to automat-
ically extract content-based information from multimedia objects [317].
This chapter describes the integration of the Webspace Method and fea-
ture grammars to realize a retrieval system for multimedia XML document
collections. To illustrate the usability of the combined system fragments of
the Australian Open case-study are used. This case study has been described
in more detail in [319] and [318].
The holy grail for automatic annotation is to extract all explicit and implicit
semantic meanings of a multimedia object, i.e., take over a large part of the
manual annotation burden. This ultimate goal may never be reached, but
for limited domains knowledge can be captured well enough to automatically
extract meaningful annotations. To realize this, the semantic gap between raw
sensor data and “real world” concepts has to be bridged. The predominant
approach to bridge this gap is the translation of the raw data into low-level
features, which are subsequently mapped into high-level concepts.
These steps are illustrated in Fig. 15.2, which shows the annotation of an
image of André Agassi. The image is classified by a boolean rule as a photo
on basis of previously extracted feature values, e.g., the number and average
15 Combining Concept- with Content-Based Multimedia Retrieval 219
output/input dependencies
Feature Neural
detector net
saturation of the colors. Then a neural net, determines if the photo contains
a human face, based on detected skin areas.
This example shows two kinds of dependencies: (1) output/input depen-
dencies, e.g., the color features are input for the photo decision rule; and
(2) context dependencies, e.g., the face classifier is only run when the image
is a photo. Notice that context dependencies are inherently different from
output/input dependencies. They are based on design decisions or domain
restrictions and are not enforced by the extraction algorithm.
Feature grammars are based on a declarative language, which supports
both output/input and context dependencies. Before introducing the lan-
guage, the next section describes its solid and formal basis.
Non-
terminal
url(http://...) int(14,137) flt(0.01) flt(0.36) bit(true) bitmap(00...) int(1)
Terminal
%detector Graphic(Number,Prevalent,Saturation);
%detector Skin(Location);
%detector matlab::Color(Location);
%classifier bpnn::Faces(Skin);
This feature grammar describes declaratively the extraction process of Fig. 15.2.
Notice that some detectors, i.e., whitebox detectors, and the detector param-
eters take the form of XPath expressions [29].
The Feature Detector Engine (FDE) uses a feature grammar and its associated
detectors to steer an annotation extraction process. This process can now be
implemented by a specific parsing algorithm. The control flow in a feature
grammar system is top-down and favors leftmost derivations. The top-down
algorithm used is based on an exhaustive backtracking algorithm. Backtracking
indicates depth-first behavior: one alternative is chosen and followed until it
222 M. Windhouwer and R. van Zwol
either fails or succeeds. Upon failure the algorithm backtracks until an untried
alternative is found and tries that one. The adjective exhaustive means that
the algorithm also backtracks when the alternative is successful, thus handling
ambiguity.
The FDE starts with validation of the start symbol, i.e., Image. The
declaration of this start symbol specifies that at least the Location token
should be available. In the example case the librarian provided the URL of the
Australian Open picture of André Agassi. The FDE continues with building
the parse tree until it encounters the first detector symbol: Color. The Color
detector function needs the Location information, which is passed on as a
parameter to the matlab plugin. This plugin connects to the matlab engine
and requests execution of the Color function. The output of this function is
a new sentence of three tokens: N umber, P revalent and Saturation. This
sentence is subsequently validated using the rules for the Color detector. The
FDE continues in this vain until the start symbol is proved valid.
The result of this proces is the following XML document:
<?xml version="1.0"?>
<fg:forest xmlns:fg="http://.../fg" xmlns:Image="http://.../Image">
<fg:elementary idrefs="1@1" start="WWW:WebObject">
<Image:Image id="5479@0">
<Image:Location id="1@1">
<Image:url id="2@1">http://...</Image:url>
</Image:Location>
<Image:Color idref="5480@0"/>
<Image:Class id="9@1">
<Image:Photo idref="5481@0"/>
<Image:Skin idref="5482@0"/>
<Image:Faces idref="5483@0"/>
</Image:Class>
</Image:Image>
</fg:elementary>
<fg:auxiliary>
<Image:Color id="5480@0" idrefs="2@1">
<Image:Number id="3@1">
<fg:int id="4@1">14137</fg:int>
</Image:Number>
<Image:Prevalent id="5@1">
<fg:flt id="6@1">0.01</fg:flt>
</Image:Prevalent>
<Image:Saturation id="7@1">
<fg:flt id="8@1">0.36</fg:flt>
</Image:Saturation>
</Image:Color>
</fg:auxiliary>
<fg:auxiliary>
<Image:Photo id="5481@0" idrefs="3@1 5@1 7@1"/>
</fg:auxiliary>
15 Combining Concept- with Content-Based Multimedia Retrieval 223
Image::Image
Image::Class
url
<fg:auxiliary>
<Image:Skin id="5482@0" idrefs="1@1">
<Image:bitmap id="10@1"><![CDATA[00...]]></Image:bitmap>
</Image:Skin>
</fg:auxiliary>
<fg:auxiliary>
<Image:Faces id="5483@0" idrefs="10@1">
<fg:int id="11@1">1</fg:int>
</Image:Faces>
</fg:auxiliary>
</fg:forest>
This document contains the constructed parse tree (see Fig. 15.3), and hence
the desired annotation information. The database contains a collection of
these XML documents, thus storing the content-based data of the multimedia
collection.
Over time the source data on which the stored annotation information is
based may change. Also new or newer extraction algorithms may come avail-
able. The declarative nature of a feature grammar gives the opportunity to
incrementally maintain the stored information. Based on the output/input
and context dependencies embedded in the feature grammar a dependency
graph can be build (see Fig. 15.4). Using this graph the Feature Detector
Schedular (FDS) can localize changes and trigger incremental parses of the
224 M. Windhouwer and R. van Zwol
FDE. In an incremental parse the FDE only validates a single grammar com-
ponent, thus only revalidating a partial parse tree. Following the links in the
dependency graph the FDS can trigger other incremental parses based on the
updated annotations.
The next section will describe the Webspace Method, which, combined
with feature grammars, provides an advanced architecture for multimedia
information retrieval.
The modeling stage of the Webspace Method consists of four steps that are
carried out to create a document collection. These four steps have been inte-
grated in the Webspace modeling tool, which functions as an authoring tool
for content management (see Fig. 15.5).
The first step is to identify concepts that adequately describe the content
contained in the document collection. Once the webspace schema is (partially)
defined, a view on the schema can be defined that describes the structure of the
document that the author wants to create. This structure is then exported to
an XML Schema Definition. Once the structure of the document is known, the
content can be added. For maintenance reasons the content is ideally stored
in a data management system, but the author can also choose to manually
add the content to the document.
The result is an XML document that is seen as a materialized view on the
webspace schema, since it contains both the data, and part of the conceptual
schema. The XML document by itself is not suitable for presentation to the
226 M. Windhouwer and R. van Zwol
Player
name
country
gender
picture: Image
history:
Hypertext
1. Constructing the query skeleton. The first step of the query formula-
tion process involves the construction of the query skeleton. This skeleton
is created, using a visualization of the webspace schema. This visualiza-
tion consists of a simplified class diagram, and only contains the classes
and associations between the classes, as defined in the webspace schema.
The user simply composes the query skeleton, based on his information
need, by selecting classes and related associations from the visualization.
The (single) graph that is created represents the query skeleton.
In Fig. 15.7.a a fragment taken from the GUI of the webspace search engine
is presented, which shows the query skeleton (depicted in black-filled text
boxes), that is used for the query formulation of the three example queries.
2. Formulating the constraints. In the second step of the query formu-
lation process, the constraints of the query are defined. In Fig. 15.7.b
another fragment of the GUI of the webspace search engine is presented,
showing the interface that is used for this purpose. For each class con-
tained in the query skeleton a tab is activated, which allows a user to
formulate the conceptual constraints of the query. As shown in the figure,
a row is created for each attribute. Each row contains two check boxes,
the name of the attribute, and either a text field or a button.
The first checkbox is used to indicate whether the attribute is used as a
constraint of the query. The second checkbox indicates whether the results
of the query should show the corresponding attribute. If the type of the
attribute is a BasicType, a textfield is available that allows the user to
specify the value of the constraint, if the first checkbox is checked. If the
attribute is of type WebClass, a button is available, which, if pressed,
activates the interface that is used to query that particular multimedia
object.
Fig. 15.7.c shows the interface that is used to formulate queries over Hyper-
text-objects, i.e., define content-based constraints. The figure shows both
a low-level and advanced interface to the underlying feature grammar
15 Combining Concept- with Content-Based Multimedia Retrieval 229
15.5 Conclusions
Fig. 15.8. View on ‘Australian Open’ containing the result of example query 3.
knowledge in the automatic process and thus the quality of the extracted
annotations.
16
Tree Awareness for Relational DBMS Kernels:
Staircase Join
16.1 Introduction
Relational database management systems (RDBMSs) derive much of their
efficiency from the versatility of their core data structure: tables of tuples.
Such tables are simple enough to allow for an efficient representation on all
levels of the memory hierarchy, yet sufficiently generic to host a wide range
of data types. If one can devise mappings from a data type τ to tables and
from operations on τ to relational queries, an RDBMS may be a premier
implementation alternative. Temporal intervals, complex nested objects, and
spatial data are sample instances for such types τ .
The key to efficiency of the relational approach is that the RDBMS is
made aware of the specific properties of τ . Typically, such awareness can be
implemented in the form of index structures (e.g., R-trees [150] efficiently
encode the inclusion and overlap of spatial objects) or query operators (e.g.,
the multi-predicate merge join [331] exploits knowledge about containment of
nested intervals).
This chapter applies this principle to the tree data type with the goal
to turn RDBMSs into efficient XML and XPath processors [29]. The database
system is supplied with a relational [193] XML document encoding, the XPath
accelerator [146]. Encoded documents (1) are represented in relational tables,
(2) can be efficiently indexed using index structures native to the RDBMS,
namely B-trees, and (3) XPath queries may be mapped to SQL queries over
these tables. The resulting purely relational XPath processor is efficient [146]
and complete (supports all 13 XPath axes).
We will show that an enhanced level of tree awareness, however, can lead
to a query speed-up by an order of magnitude. Tree awareness is injected into
the database kernel in terms of the staircase join operator, which is tuned to
exploit the knowledge that the RDBMS operates over tables encoding tree-
shaped data. This is a local change to the database kernel: standard B-trees
suffice to support the evaluation of staircase join and the query optimizer may
treat staircase join much like other native join operators.
H. Blanken et al. (Eds.): Intelligent Search on XML Data, LNCS 2818, pp. 231–245, 2003.
Springer-Verlag Berlin Heidelberg 2003
232 T. Grust and M. van Keulen
The XPath accelerator document encoding [146] preserves this region notion.
The key idea is to design the encoding such that the nodes contained in an
axis region can be retrieved by a relational query simple enough to be effi-
ciently supported by relational index technology (in our case B-trees). Equa-
tion (16.1) guarantees that all document nodes are indeed represented in such
an encoding.
a a a a
•1 •1 •1 •1
b#### 111e b#### 111e b#### 111e b #### 111e
• d•
f ◦&&•..i • d•
f ◦&&•..i • d•
f ◦&&•..i • d•
f ◦&&•..i
• .
& . • • .
& . • • .
& . • • .
& . •
c •& • • c •& • • c •& • • c •& • •
g h j g h j g h j g h j
Fig. 16.1. XPath axes induce document regions: shaded nodes are reachable from
context node f via a step along the (a) preceding, (b) descendant, (c) ancestor,
(d) following axes. Leaf nodes denote either empty XML elements, attributes, text,
comment, or processing instruction nodes; inner nodes represent non-empty elements
The actual encoding maps each node v to its preorder and postorder traver-
sal ranks in the document tree: v 9→ )pre(v), post (v)*. In a preorder traversal,
a node v is visited and assigned its preorder rank pre(v) before its children are
recursively traversed from left to right. Postorder traversal is defined dually:
node v is assigned its postorder rank post (v) after all its children have been
1
In the sequel, we will abbreviate such XPath step expressions as f /following.
16 Tree Awareness for Relational DBMS Kernels 233
traversed. For the XML document tree of Fig. 16.1, a preorder traversal enu-
merates the nodes in document order (a, . . . , j) while a postorder traversal
enumerates (c, b, d, g, h, f, j, i, e, a), so that we get )pre(e), post (e)* = )4, 8*,
for instance.
post
$ ( • document node
•a
− ( ◦ context node
− •e(
− ( •i
− •j
5 −
(
2 2 2 2 2◦ 2 2 2 2
− ( f •h
− ( ◦g
− •d (
1 − •b (
• + + + + + + + " pre
"0,0# + +
c
1 5
Fig. 16.2. Pre/post plane for the XML document of Fig. 16.1. Dashed and dotted
lines indicate the document regions as seen from context nodes f ( 2 2 ) and g ( ),
respectively
Figure 16.2 depicts the two-dimensional pre/post plane that results from en-
coding the XML instance of Fig. 16.1. A given context node f , encoded as
)pre(f ), post (f )* = )5, 5*, induces four rectangular regions in the pre/post
plane, e.g., in the lower-left partition we find the nodes f /preceding =
(b, c, d). This characterization of the XPath axes is much more accessible for an
RDBMS: an axis step can be evaluated in terms of a rectangular region query
on the pre/post plane. Such queries are efficiently supported by concatenated
(pre, post ) B-trees (or R-trees [146]).
The further XPath axes, like, e.g., following-sibling or ancestor-or--
self, determine specific supersets or subsets of the node sets computed
by the four partitioning axes. These are easily characterized if we addi-
tionally maintain parent node information for each node, i.e., use v 9→
)pre(v), post (v), pre(parent (v))* as the encoding for node v. We will focus
on the four partitioning axes in the following.
Note that all nodes are alike in the XPath accelerator encoding: given
an arbitrary context node v, e.g., computed by a prior XPath axis step or
XQuery expression, we retrieve )pre(v), post (v)* and then access the nodes
in the corresponding axis region. Unlike related approaches [79], the XPath
accelerator has no bias towards the document root element. Please refer to
[146, 147] for an in-depth explanation of the XPath accelerator.
Inside the relational database system, the encoded XML document tree, i.e.,
the pre/post plane, is represented as a table doc with schema pre post type .
234 T. Grust and M. van Keulen
Each tuple encodes a single node (with field type discriminating element, at-
tribute, text, comment, processing instruction node types). Since pre is unique
– and thus may serve as node identity as required by the W3C XQuery and
XPath data model [115] – additional node information is assumed to be hosted
in separate tables using pre as a foreign key.2 A SAX-based document loader
[266] can populate the doc table using a single sequential scan over the XML
input [146].
The evaluation of an XPath path expression p = s1 /s2 / · · · /sn leads to
a series of n region queries where the node sequence output by step si is
the context node sequence for the subsequent step si+1 . The context node
sequence for step s1 is held in table context (if p is an absolute path, i.e.,
p = /s1 / · · · , context holds a single tuple: the encoding of the document root
node). XPath requires the resulting node sequence to be duplicate free as
well as being sorted in document order [29]. These inherently set-oriented,
or rather sequence-oriented, XPath semantics are implementable in plain SQL
(Fig. 16.3).
Fig. 16.3. Translating the XPath path expression s1 /s2 / · · · /sn (with context con-
text) into an SQL query over the document encoding table doc
Note that we focus on the XPath core, namely location steps, here. Func-
tion axis(·), however, is easily adapted to implement further XPath con-
cepts, like node tests, e.g., with XPath axis α and node kind κ ∈ {text(),
comment(), . . . }:
Fig. 16.4. SQL equivalent for the XPath expression s1 [s2 ]/s3 (note the exchange
of v1 for v2 in axis(s3 , v1 , v3 ), line 3).
The structure of the generated SQL queries – a flat self-join of the doc table
using a conjunctive join predicate – is simple. An analysis of the actual query
plans chosen by the optimizer of IBM DB2 V7.1 shows that the system can
cope quite well with this type of query. Figure 16.5 depicts the situation for
a two-step query s1 /s2 originating in context sequence context.
unique
pre
$# /
%%% /
$#+ ixscan
pre/post
)
sort ixscan doc v
pre pre/post 2
context c doc v1
The RDBMS maintains a B-tree over concatenated (pre, post ) keys. The
index is used to scan the inner (right) doc table join inputs in pre-sorted
order. The context is, if necessary, sorted by the preorder rank pre as well.
Both joins may thus be implemented by merge joins. The actual region query
evaluation happens in the two inner join inputs: the predicates on pre act as
index range scan delimiters while the conditions on post are fully sargable
[277] and thus evaluated during the B-tree index scan as well. The joins are
actually right semijoins, producing their output in pre-sorted order (which
matches the request for a result sequence sorted in document order in line 4
of the SQL query).
As reasonable as this query plan might appear, the RDBMS treats table
doc (and context) like any other relational table and remains ignorant of tree-
specific relationships between pre(v) and post (v) other than that both ranks
are paired in a tuple in table doc. The system thus gives away significant
optimization opportunities.
To some extent, however, we are able to make up for this lack of tree
awareness at the SQL level. As an example, assume that we are to take a
descendant step from context node v (Fig. 16.6). It is sufficient to scan the
(pre, post ) B-tree in the range from pre(v) to pre(v + ) since v + is the rightmost
236 T. Grust and M. van Keulen
• t
"◦* *v
"
" 2 2*• v *
v ** •2
Fig. 16.6. Nodes with minimum post (v ** ) and maximum pre (v * ) ranks in the
subtree below v
leaf in the subtree below v and thus has maximum preorder rank. Since the
pre-range pre(v)–pre(v + ) contains exactly the nodes in the descendant axis
of v, we have3
pre(v + ) = pre(v) + |v/descendant| . (16.2)
Additionally, for any node v in a tree t we have that
where level (v) denotes the length of the path from t’s root to v which is
obviously bound by h, the overall height of t.4 Equations (16.2) and (16.3)
provide us with a characterization of pre(v + ) expressed exclusively in terms of
the current context node v:
A dual argument applies to leaf v ++ , the node with minimum postorder rank
below context node v (Fig. 16.6). Taken together, we can use these observa-
tions to further delimit the B-tree index range scans to evaluate descendant
axis steps:
Note that the index range scan is now delimited by the actual size of the
context nodes’ subtrees – modulo a small misestimation of maximally h which
is insignificant in multi-million node documents – and independent of the
document size. The benefit of using these shrunk descendant axis regions
is substantial, as Fig. 16.7 illustrates for a small XML instance. In [146], a
speed-up of up to three orders of magnitude has been observed for 100 MB
XML document trees.
Nevertheless, as we will see in the upcoming section, the index scans and joins
in the query plan of Fig. 16.5 still perform a significant amount of wasted work,
3
We use |s| to denote the cardinality of set s.
4
The system can compute h at document loading time. For typical real-world XML
instances, we have h " 10.
16 Tree Awareness for Relational DBMS Kernels 237
post
$
•
• '
'%---
•
•
•
• '
•
•
' pre = post +h
•
v • '
◦
• '
• '
• '
•
•
• ' pre(v)+h
• '
•
'
•
•
•
' " pre
"0,0# ' post (v)+h
Fig. 16.7. Original (dark) and shrunk (light) pre and post scan ranges for a
descendant step to be taken from v
especially for large context sequences. Being uninformed about the fact that
the doc tables encodes tree-shaped data, the index scans repeatedly re-read
regions of the pre/post plane only to generate duplicate nodes. This, in turn,
violates XPath semantics such that a rather costly duplicate elimination phase
(the unique operator in Fig. 16.5) at the top of the plan is required.
Real tree awareness, however, would enable the RDBMS to improve XPath
processing in important ways: (1) since the node distribution in the pre/post
plane is not arbitrary, the ixscans could actually skip large portions of the
B-tree scans, and (2) the context sequence induces a partitioning of the plane
that the system can use to fully avoid duplicates.
The necessary tree knowledge is present in the pre/post plane – and actu-
ally available at the cost of simple integer operations like +, < as we will now
see – but remains inaccessible for the RDBMS unless it can be made explicit
at the SQL level (like the descendant window optimization above).
16.3.1 Pruning
a a
$$•00000 e $$•00000 e
b$$$$ 0 b $$$$ 0
• d◦
''◦,,,i • d◦
''•,,,i
f ' f '
• ◦,
' , ◦ • •,
' , •
c c
'' , • ◦ ◦ •'' , ◦ ◦
g h j g h j
(a) (b)
path’s shade, the more often are its nodes produced in the resulting node se-
quence – which ultimately leads to the need for a costly duplicate elimination
phase. Obviously, we could remove nodes e, f, i – which are located along a
path from some other context node up to the root – from the context node
sequence without any effect on the final result (a, d, e, f, h, i, j) (Fig. 16.8 (b)).
Such opportunities for the simplification of the context node sequence arise
for all axes.
Figure 16.9 depicts the scenario in the pre/post plane as this is the
RDBMS’s view of the problem (these planes show the encoding of a slightly
larger XML document instance). For each axis, the context nodes establish a
different boundary enclosing a different area. Result nodes can be found in
the shaded areas. In general, regions determined by context nodes can include
one another or partially overlap (dark areas). Nodes in these areas generate
duplicates.
c2 , c4 for (a) descendant and c3 , c4 for (c) following axis. The process of
identifying the context nodes at the cover’s boundary is referred to as prun-
ing and is easily implemented involving a simple postorder rank comparison
(Fig. 16.14).
After pruning for the descendant or ancestor axis, all remaining context
nodes relate to each other on the preceding/following axis as illustrated for
descendant in Fig. 16.10. The context establishes a boundary in the pre/post
plane that resembles a staircase.
post
$ • document node
◦ context node
•
◦(2 2 2• 2
c3 ( •
•
•
( ••
•
(◦2 2• 2( 2 2 2
c2 ( • • (
•
(
( 2• 2( 2 2( 2 2 2
•
◦2
c1 • • (
•( ( (
•
• " pre
"0,0#
Observe in Fig. 16.10 that the three darker subregions do not contain any
nodes. This is no coincidence. Any two nodes a, b partition the pre/post
plane into nine regions R through Z (see Fig. 16.11). There are two cases
to be distinguished regarding how both nodes relate to each other: (a) on
ancestor/descendant axis or (b) on preceding/following axis. In (a), re-
gions S, U are necessarily empty because an ancestor of b cannot precede
(region U ) or follow a (region S) if b is a descendant of a. Similarly, region Z
in (b) is empty, because a, b cannot have common descendants if b follows a.
The empty regions in Fig. 16.10 correspond to such Z regions.
A similar empty region analysis can be done for all XPath axes. The con-
sequences for the preceding and following axes are more profound. After
pruning for, e.g., the following axis, the remaining context nodes relate to
each other on the ancestor/descendant axis. In Fig. 16.11 (a), we see that
for any two remaining context nodes a and b, (a, b)/following = S ∪ T ∪ W .
Since region S is empty, (a, b)/following = T ∪ W = (b)/following. Conse-
quently, we can prune a from the context (a, b) without affecting the result. If
this reasoning is followed through, it turns out that all context nodes can be
pruned except the one with the maximum preorder rank in case of preceding
and the minimum postorder rank in case of following. For these two axes,
the context is reduced to a singleton sequence such that the axis step eval-
uation degenerates to a single region query. We will therefore focus on the
ancestor and descendant axes in the following.
240 T. Grust and M. van Keulen
post post
$ $
∅
R S T R S T
• •
a b
∅
U V W U V W
• •
b a
∅
X Y Z " pre X Y Z " pre
post
$ (
•a (
a •e (
•i (
$$•000 000e 2( 2◦ j
b $$$
•$ d ◦ '•,, (2 2 2f 2◦( h
•
f '' ,i
• •, • •g
c
''' ,, 2 2 2◦( d
• ◦ ◦
g h j •b
p0 p1 p2 p3 • " pre
p0 c p1 p2 p3
(a) (b)
Fig. 16.12. The partitions p0 –p1 , p1 –p2 , p2 –p3 of the ancestor staircase separate
the ancestor-or-self paths in the document tree
16.3.2 Partitioning
While pruning leads to a significant reduction of duplicate work, Fig. 16.8 (b)
exemplifies that duplicates still remain due to intersecting ancestor-or-self
paths originating in different context nodes. A much better approach results
if we separate the paths in the document tree and evaluate the axis step for
each context node in its own partition (Fig. 16.12 (a)).
Such a separation of the document tree is easily derived from the staircase
induced by the context node sequence in the pre/post plane (Fig. 16.12 (b)):
each of the partitions p0 –p1 , p1 –p2 , and p2 –p3 define a region of the plane
containing all nodes needed to compute the axis step result for context nodes
d, h, and j, respectively. Note that pruning reduces the number of these par-
titions. (Although a review of the details is outside the scope of this text, it
16 Tree Awareness for Relational DBMS Kernels 241
16.3.3 Skipping
The empty region analysis explained in Sect. 16.3.1 offers another kind
of optimization, which we refer to as skipping. Figure 16.13 illustrates this
for the XPath axis step (c1 , c2 )/descendant. An axis step can be evaluated
by scanning the pre/post plane from left to right and partition by partition
starting from context node c1 . During the scan of c1 ’s partition, v is the first
node encountered outside the descendant boundary and thus not part of the
result.
Note that no node beyond v in the current partition contributes to the
result (the light grey area is empty). This is, again, a consequence of the fact
that we scan the encoding of a tree data structure: node v is following c1
in document order so that both cannot have common descendants, i.e., the
empty region in Fig. 16.13 is a region of type Z in Fig. 16.11 (b).
This observation can be used to terminate the scan early which effectively
means that the portion of the scan between pre(v) and the successive context
node pre(c2 ) is skipped.
The effectiveness of skipping is high. For each node in the context, we
either (1) hit a node to be copied into the result, or (2) encounter a node of
type v which leads to a skip. To produce the result, we thus never touch more
than |result| + |context| nodes in the pre/post plane, a number independent of
the document size.
A similar, although slightly less effective skipping technique can be applied
to the ancestor axis: if, inside the partition of context node c, we encounter a
node v outside the ancestor boundary, we know that v as well as all descen-
dants of v are in the preceding axis of c and thus can be skipped. In such a
post
$ • document node
◦ context node
•
•
( 2
c2◦2
•
•
•
(•
•
• (
v• •
(
•
• (
c1◦(2 2 2 2 2 2 •
•
( • • ∅
•
( •
•
"0,0#
(• "" #( " " pre
scan scan
skip
σ
::n
'
anc ,,
σ'
,
::text() doc
desc
,,,
'''
context doc
and depict the descendant and ancestor variants of the staircase join,
anc
respectively).
&& ...
anc
σ
::n
&
desc ..
&& σ
context ::text() doc
doc
There are, however, alternative query plans. Staircase join, like ordinary
joins, allows for selection pushdown, or rather node test pushdown: for any
location step α and node test κ
Figure 16.16 shows a query plan for the example query where both node tests
have been pushed down.
Observe that in the second query plan, the node test is performed on the
entire document instead of just the result of the location step. An RDBMS
already keeps statistics about table sizes, selectivity, and so on. These can be
used by the query optimizer in the ordinary way to decide whether or not the
node test pushdown makes sense. Physical database design does not require
exceptional treatment either. For example, in a setting where applications
mainly perform qualified name tests (i.e., few ‘::*’ name tests), it is beneficial
to fragment table doc by tag name. A pushed down name test σ::n (doc) can
then be evaluated by taking the appropriate fragment without the need for
any data processing.
The addition of staircase join to an existing RDBMS kernel and its query
optimizer is, by design, a local change to the database system. A standard
B-tree index suffices to realize the “tree knowledge” encapsulated in staircase
join. Skipping, as introduced in Sect. 16.3.3, is efficiently implemented by
following the pre-ordered chain of linked B-tree leaves, for example.
We have found staircase join to also operate efficiently on higher levels of
the memory hierarchy, i.e., in a main-memory database system. For queries
like the above example, the staircase join enhanced system processed 1 GB
XML documents in less than 1/2 second on a standard single processor host
[147].
16.5 Conclusions
The approach toward efficient XPath evaluation described in this chapter is
based on a relational document encoding, the XPath accelerator. A preorder
plus postorder node ranking scheme is used to encode the tree structure of an
XML document. In this scheme, XPath axis steps are evaluated via joins over
simple integer range predicates expressible in SQL. In this way, the XPath ac-
celerator naturally exploits standard RDBMS query processing and indexing
technology.
We have shown that an enhanced level of tree awareness can lead to a
significant speed-up. This can be obtained with only a local change to the
RDBMS kernel: the addition of the staircase join operator. This operator
encapsulates XML document tree knowledge by means of incorporating the
described techniques of pruning, partitioning, and skipping in its underlying
16 Tree Awareness for Relational DBMS Kernels 245
17.1 Introduction
With the rapidly increasing popularity of XML for data representation, there
is a lot of interest in query processing over data that conforms to a labeled-tree
data model. A variety of languages have been proposed for this purpose, all
of which can be viewed as consisting of a pattern language and construction
expressions. Since the data objects are typically trees, tree pattern matching is
the central issue. The idea behind evaluating tree pattern queries, sometimes
called twig queries, is to find all the ways of embedding the pattern in the
data. Because this lies at the core of most languages for processing XML
data, efficient evaluation techniques for these languages require appropriate
indexing structures.
In query processing, signatures are compact (small) representations of im-
portant features extracted from actual documents in such a way that query
execution can be performed on the signatures instead of the documents. In
the past, see e.g. [298] for a survey, such a principle has been suggested as
an alternative to the inverted file indexes. Recently, it has been successfully
applied to indexing multi-dimensional vectors for similarity-based searching
[314], image retrieval [225], and data mining [224].
We define the tree signature as a sequence of tree-node entries, containing
node names and their structural relationships. Though other possibilities exist,
here we show how these signatures can be used for efficient tree navigation
and twig pattern matching.
H. Blanken et al. (Eds.): Intelligent Search on XML Data, LNCS 2818, pp. 247–258, 2003.
Springer-Verlag Berlin Heidelberg 2003
248 P. Zezula, G. Amato, and F. Rabitti
the second case, element names are associated with references to the names’
occurrences in XML documents. In the following, we briefly discuss the most
important representatives.
APEX
Based on the idea of DataGuide, APEX (Adaptive Path indEx for XML data)
was defined in [72] . APEX does not keep all paths starting from the root,
but only utilizes frequently accessed paths determined in the query workload
by means of specialized mining algorithms. In addition to a graph structure,
similar to a DataGuide, APEX also applies a hash tree. The graph structure
represents a structural summary of XML data. The hash tree contains incom-
ing label paths to nodes of the structure graph. Nodes of the hash tree are
hash tables where entries may point to another hash table or to a node of
the structure graph. The key of an entry in a hash table is an element name.
Following a label path in the hash tree, the corresponding node in the struc-
ture graph is reached. The hash tree represents the frequently accessed paths
and all paths of length two, so any path expression can be evaluated by using
joins, without accessing the original data. A represented path need not start
from the root. By using the hash tree, path expressions containing wildcards
can also be processed efficiently, provided the corresponding label paths are
included in the hash tree.
Index Fabric
An Index Fabric [79] is a disk based extension of the Patricia trie. It has the
same scaling property, but it is balanced and optimized for disk based access.
17 Processing XML Queries with Tree Signatures 249
In order to index XML documents by an Index Fabric, paths are encoded using
designators, which are special characters or character strings. Each element
name is associated with a unique designator. Text content of an element is
not modified. For instance the XML fragment
<invoice>
<buyer><name>ABC Corp</name></buyer>
<invoice>
can be encoded with the string ”IBNABC Corp”, where bold letters I, B,
and N represent, respectively, the element names invoice, buyer, and name.
Each designator encoded path is inserted in the Index Fabric, and designators
are treated as normal characters. A separate designator dictionary is used to
maintain the mapping between the designators and the element names. At-
tributes are considered as children of the corresponding element. In addition
to raw paths, i.e. paths from the root to the leaves occurring in XML docu-
ments, refined paths, i.e. specialized paths that optimize frequently occurring
access patterns, are also inserted in the Index Fabric. Refined paths can be
used to efficiently process specific queries with wildcards and alternates.
XXL
The path index from [296] associates each element name appearing in the
indexed XML documents with a list of its occurrences. Each element name
is stored exactly once in the index. An occurrence consists of the URL of
the XML document, the unique identifier of the element name occurrence, a
pointer to the parent element, pointers to the children elements, along with
optional XLink and XPointer links. Attributes are uniformly treated as if
they were special children of the corresponding element. All path expressions
can be efficiently processed by using the explicitly maintained references to
parents and children. However, this path index was implemented as a red-
black tree, built on element names and maintained in the main memory. For
query processing, the index must be first loaded from the disk.
Join Techniques
XISS
a
) (
b f
↓( (
c g h
) ( ) (
d e o p
pre : a b c d e g f h o p
post : d e c g b o p h f a
rank : 1 2 3 4 5 6 7 8 9 10
Fig. 17.1. Preorder and postorder sequences of a tree with element ranks
252 P. Zezula, G. Amato, and F. Rabitti
The tree signature is a list of all the tree nodes obtained with a preorder
traversal of the tree. Apart from the node name, each entry also contains the
node’s position in the postorder rank.
Definition 1. Let T be an ordered labelled tree. The signature of T is a se-
quence, sig(T ) = )t1 , post(t1 ); t2 , post(t2 ); . . . tn , post(tn )*, of n = |T | entries,
where ti is a name of the node with pre(ti ) = i. The post(ti ) value is the
postorder value of the node named ti and the preorder value i.
For example, )a, 10; b, 5; c, 3; d, 1; e, 2; g, 4; f, 9; h, 8; o, 6; p, 7* is the signature of
the tree from Figure 17.1.
Suppose the data tree T and the query tree Q are specified by signatures
post
n
A F
P D
n pre
Fig. 17.2. Properties of the preorder and postorder ranks.
17 Processing XML Queries with Tree Signatures 253
any sub sigQ (T ) ≡ sig(Q), because qi = ti for all i, but the corresponding
entries may have different postorder values. It is important to understand
that in general, the sequence positions of entries in sub-signatures do not
correspond to the preorder values of the entries in T .
Lemma 1. The query tree Q is included in the data tree T if the following two
conditions are satisfied: (1) on the level of node names, sig(Q) is sequence-
included in sig(T ) determining sub sigQ (T ), (2) for all pairs of entries i and
j in sig(Q) and sub sigQ (T ), i, j = 1, 2, . . . |Q| − 1 and i + j ≤ |Q|, whenever
post(qi+j ) > post(qi ) it is also true that post(ti+j ) > post(ti ).
Proof. Because the index i increases according to the preorder sequence, node
i + j must be either the descendant or the following node of i. If post(qi+j ) <
post(qi ), the node i + j in the query is a descendant of the node i, thus also
post(ti+j ) < post(ti ) is required. By analogy, if post(qi+j ) > post(qi ), the
node i + j in the query is a following node of i, thus also post(ti+j ) > post(ti )
must hold.
For example, consider the data tree T in Figure 17.1 and suppose the query
tree Q consists of nodes h, o, and p structured as in Figure 17.3. Such a query
h
) (
o p
sig(Q) = "h, 3; o, 1; p, 2#
situations by simply distinguishing between node names and their unique oc-
currences. Leaf nodes in signatures are all nodes with a postorder smaller than
the postorder of the following node in the signature sequence – the last node
is always a leaf. We can also determine the level of leaf nodes, because the
level of a leaf node ti with index i, level(ti ) = i − post(i).
Extended Signatures
Parent. The parent node is directly given by the pointer f a. The Ancestor
axis is just recursive closure of Parent.
Following. The following nodes of the reference in position i (when they
exist) start in position f f i and include all nodes up to the end of the
signature sequence. All nodes following c (with i = 3) are in the suffix of
the signature starting in position f f 3 = 6.
Preceding. All preceding nodes are on the left of the reference node as a set of
intervals separated by the ancestors. Given a node with index i, f ai points
to the first ancestor (i.e. the parent) of i, and the nodes (when they exist)
between i and f ai precede i in the tree. If we recursively continue from
f ai , we find all the preceding nodes of i. For example, consider node g with
i = 6: following the ancestor pointer, we get f a6 = 2, f a2 = 1, f a1 = 0, so
the ancestor nodes are b and a, because f a1 = 0 indicates the root. The
preceding nodes of g are only in the interval from i − 1 = 5 to f a6 + 1 = 3
(i.e. nodes c, d, and e), because the second interval, from f a2 − 1 = 0 to
f a1 + 1 = 1, is empty.
Following-sibling. In order to get the following siblings, we just follow the
f f pointers while the following objects exist and the f a pointers are the
same as f ai . For example, given node c with i = 3 and f a3 = 2, the
f f3 pointer moves us to the node with index 6, that is node g. Node g is
the sibling following c, because f a6 = f a3 = 2. But this is also the last
following sibling, because f f 6 = 7 and f a7 <= f a3 .
Preceding-sibling. All preceding siblings must be between the context node
with index i and its parent with index f ai < i. The first node after the
i-th parent, which has the index f ai + 1, is the first sibling. Then we use
the Following-sibling strategy up to the sibling with index i. Consider
the node f (i = 7) as the context node. The first sibling of the i-th parent
is b, determined by pointer f a7 + 1 = 2. Then the pointer f f 2 = 7 leads
us back to the context node f , so b is the only preceding sibling node of
f.
The experimental evaluation in [330] confirms that implementations of the
axes on extended signatures are faster than on short signatures, and the larger
the signature, the better. The actual improvements depend on the axes – the
biggest advantage, counted in hundreds, was observed for the ancestor axis,
above all when processing large low trees. In general, the execution costs of
the axes depend on the shape of the tree and the position of the reference node
in it. But in no circumstances, do the implementations on the short signatures
ever outperform the implementations on the extended signatures.
defined by the query, are satisfied. Query execution strategies determine the
ways the query’s predicates are evaluated. In principle, a predicate can be
decided either by accessing a specific part of the document or by means of an
index. So a specific strategy depends on the availability of indexes. We assume
that tree signatures are used to support the verification of required structural
relationships.
A query processor can also exploit tree signatures to evaluate set-oriented
primitives similar to the XPath axes. For instance, given a set of elements
R, the evaluation of P arent(R, article) returns the set of elements named
article, which are parents of elements contained in R. We suppose that
elements are identified by their preorder values, so sets of elements are in fact
sets of element identifiers.
Verifying structural relationships can easily be integrated with evaluating
content predicates. If indexes are available, a good strategy is to use these
indexes to obtain sets of elements which satisfy the predicates, and then verify
the structural relationships using signatures. Consider the following XQuery
[176] query:
for $a in //people
where
$a/name/first="John" and
$a/name/last="Smith"
return
$a/address
Suppose that content indexes are available on the first and last elements.
A possible efficient execution plan for this query is:
First, the content indexes are used to obtain R1 and R2 , i.e. the sets of
elements that satisfy the content predicates. Then, tree signatures are used
to navigate through the structure and verify structural relationships.
Now suppose that a content index is only available on the last element, the
predicate on the first element has to be processed by accessing the content
of XML documents. Though the specific technique for efficiently accessing the
content depends on the storage format of the XML documents (plain text files,
relational transformation, etc.), a viable query execution plan is as follows:
17 Processing XML Queries with Tree Signatures 257
Here, the content index is first used to find R1 , i.e. the set of elements
containing Smith. The tree signature is used to produce R3 , that is the set of
the corresponding first elements. Then, these elements are accessed to verify
that their content is John. Finally, tree signatures are used again to verify the
remaining structural relationships.
the signature strategy has to follow only one additional step for each qualifying
element, that is to access one more entry in the signature, containment joins
have to merge potentially large lists of references.
17.6 Conclusions
Inspired by the success of signature files in several application areas, we pro-
pose tree signatures as an auxiliary data structure for XML databases. The
proposed signatures are based on preorder and postorder ranks and support
tree inclusion evaluation, respecting sibling and ancestor-descendant relation-
ships. Navigation operations, such as those required by the XPath axes, are
computed very efficiently. Query processing can also benefit from the appli-
cation of tree signature indexes. For highly selective queries, i.e. typical user
queries, query processing with the tree signature is about 10 times more effi-
cient, compared to the strategy with containment joins.
The proposed signature file approach also creates good basis for deal-
ing with dynamic XML collections. Even though the preorder and postorder
numbering scheme is affected by document updates – node ranks change when
inserting or deleting a node – the effects are always local within specific sig-
natures. So it is up to the database designer to choose a suitable signature
granularity, which should be quite small for very dynamic collections, while
relatively stable or static collections can use much larger signatures. This
locality property cannot be easily exploited with approaches based on con-
tainment joins or approaches like [146], where updates (as well as insertions
and deletions) usually require an extensive reorganization of the index.
18
A Look Back on the XML Benchmark Project
18.1 Introduction
Database vendors and researchers have been responding to the establishing of
XML [45] as the premier data interchange language for Internet applications
with the integration of XML processing capabilities into Database Manage-
ment Systems. The new features fall into two categories: XML-enabled inter-
faces allow the DBMS to speak and understand XML formats, whereas XML
extensions add novel primitives to the engine core. Both kinds of innovations
have the potential to impact the architecture of software systems, namely by
bringing about a complexity reduction in multi-tier systems. However, it is
often difficult to estimate the effect of these innovations. This is where the
XML Benchmark Project tries to help with XMark. By providing an appli-
cation scenario and a query workload, the benchmark suite can be used to
identify strengths and weaknesses of XML-enabled software systems.
The queries of the benchmark suite target different aspects of querying
of XML documents, both in isolation and in combination. We identify the
following areas of potential performance impacts:
• The topology of XML structures as found in the original document is a
potential candidate for queries; especially systems that implement doc-
ument order on top of an unordered data model may not be properly
prepared for this kind of challenge and have to turn rather simple queries
into complex operations. This is also tested in several benchmark queries.
• The document-oriented nature of XML makes strings the basic data type
applications have to deal with. Typing XML documents is therefore as
important a challenge to make data processing more robust as enforcing
other semantics constraints. Problems can also arise as the typing rules
of query languages may clash with the more complex type systems of
host programming languages. In addition, strings are often not efficient in
database systems since their length can vary greatly, putting additional
stress on the storage engine.
H. Blanken et al. (Eds.): Intelligent Search on XML Data, LNCS 2818, pp. 263–278, 2003.
Springer-Verlag Berlin Heidelberg 2003
264 A. Schmidt et al.
results interpretable we abstract from the systems engineering issues and con-
centrate only on the core ingredients: the query processor and its interaction
with the data store. We do not consider network overhead, communication
costs or transformations to the output. As for the choice of language, we
use XQuery [34] which is the result of incorporating experiences from various
research languages [40] for semi-structured data and XML into a standard.
The target audience of the benchmark could comprise three groups. First,
the framework presented here can help database vendors to verify and refine
their query processors by comparing them to other implementations. Second,
customers can be assisted in choosing between products by using our setting
as a simple case study or pilot project that yet provides essential ingredients
of the targeted system. For researchers, lastly, we provide example data and
a framework for helping to tailor existing technology for use in XML settings
and for refinement or design of algorithms.
from,to
annotation category
interest
author
person
bidder,
seller buyer, seller
watch
open auction closed auction
itemref
itemref
item
categoryref
the same systems can handle efficiently documents that are more document-
centric [41], i.e., consisting mostly of natural language with mark-up only
interspersed with the result of irregular path structures. Converted to rela-
tional tables in a naive way, the data and query profile often do not match
the kind of pattern traditional database engines are optimized for.
auction, closed auction, item, and category on the one side and entities akin
to annotation on the other side. The relationships between the entities in the
first group are expressed through references, as depicted with arrows in Fig-
ure 18.1. The relationships between the entities of the second group, which
take after natural language text and are document-centric element structures,
are embedded into the sub-trees to which they semantically belong. An ER di-
agram can be found in [47]. The entities we just mentioned carry the following
semantics:
• Items are the objects that are on sale in an auction or that already have
been sold. Each item carries a unique identifier and bears properties like
payment (credit card, money order, . . . ), a reference to the seller, a de-
scription etc., all encoded as elements. Each item is assigned a world region
represented by the item’s parent element.
• Open auctions are auctions in progress. Their properties are the privacy
status, the bid history (i.e., increases or decreases over time) with refer-
ences to the bidders and the seller, the current bid, the time interval within
which bids are accepted, the status of the transaction and a reference to
the item being sold, among others.
• Closed auctions are auctions that are finished. Their properties are the
seller (a reference to a person), the buyer (a reference to a person), a
reference to the respective item, the price, the number of items sold, the
date when the transaction was closed, the type of transaction, and the
annotations that were made before, during and after the bidding process.
• Persons are characterized by name, email address, phone number, mail
address, homepage URL, credit card number, profile of their interests, and
the (possibly empty) set of open auctions they are interested in and get
notifications about.
• Categories feature a name and a description; they are used to implement
a classification scheme of items. A category graph links categories into a
network.
We emphasize that these entities constitute the relatively structured,
i.e., data-oriented part of the document. Their sub-element structure is fairly
regular on a per entity basis but there are predictable exceptions such as that
not every person has a homepage; in a relational DBMS, these exceptions
would typically be taken care of by NULL values. Another characteristic of
these entities is that, apart from occasional list types such as bidding histo-
ries, the order of the input is not particularly relevant. On the other hand, the
sub-elements of the document-centric part of the database, namely those of an-
notation and similar elements, do not accentuate the above aspects. Here the
length of strings and the internal structure of sub-elements varies greatly. The
markup now comprises itemized lists, keywords, and even visual formatting
instructions and character data, doing its best to imitate the characteristics
of natural language texts. This warrants that the benchmark database covers
268 A. Schmidt et al.
the full range of XML instance incarnations, from marked-up data structures
to traditional prose.
The arrows in Figure 18.1 are mainly implemented as IDREFs that con-
nects elements with IDs. Care has been taken that the references feature
diverse distributions, derived from uniformly, normally and exponentially dis-
tributed random variables. Also note that all references are typed, i.e., all
instances of an XML element point to the same type of XML element; for
example, references that model interests always refer to categories although
this constraint does not materialize in the DTD that accompanies XMark.
The XML Standard [45] defines constructs that are useful for producing
flexible markup but do not justify the definition of queries to challenge them
directly. Therefore, we only made use of a restricted set of XML features in
the data generator which we consider performance critical in the context of
XML processing in databases. We do not generate documents with Entities
or Notations. Neither do we distinguish between Parsed Character Data and
Character Data assuming that both are string types from the viewpoint of the
storage engine. Furthermore, we don’t include namespaces into the queries. We
also restrict ourselves to the seven bit ASCII character set. A DTD and schema
information are provided to allow for more efficient mappings. However, we
stress that this is additional information that may be exploited.
to struggle with rather complex aggregations to select the bidder element with
index 1.
Q3: Return the first and current increases of all open auctions whose current
increase is at least twice as high as the initial increase.
This is a more complex application of array lookups. In the case of a
relational DBMS, the query can take advantage of set-valued aggregates on
the index attribute to accelerate the execution. Queries Q2 and Q3 are akin
to aggregations in the TPCD [141] benchmark.
Casting Strings are the generic data type in XML documents. Queries that
interpret strings will often need to cast strings to another data type that car-
ries more semantics. This query challenges the DBMS in terms of the casting
primitives it provides. Especially, if there is no additional schema information
or just a DTD at hand, casts are likely to occur frequently. Although other
queries include casts, too, this query is meant to challenge casting in isolation.
Q5: How many sold items cost more than 40?
Regular Path Expressions Regular path expressions are a fundamental build-
ing block of virtually every query language for XML or semi-structured data.
These queries investigate how well the query processor can optimize path
expressions and prune traversals of irrelevant parts of the tree.
Q6: How many items are listed on all continents?
A good evaluation engine or path encoding scheme should help realize
that there is no need to traverse the complete document tree to evaluate such
expressions.
Q7: How many pieces of prose are in our database?
Also note that COUNT aggregations do not require a complete traversal of
the document tree. Just the cardinality of the respective parts is queried.
Chasing References References are an integral part of XML as they allow
richer relationships than just hierarchical element structures. These queries
define horizontal traversals with increasing complexity. A good query opti-
mizer should take advantage of the cardinalities of the operands to be joined.
Q8: List the names of persons and the number of items they bought. (joins
person, closed auction)
Q9: List the names of persons and the names of the items they bought in
Europe. (joins person, closed auction, item)
18 A Look Back on the XML Benchmark Project 271
In past database benchmarks, there have been two main routes to designing
a database. On the one hand, designers may lean towards databases that
exhibit properties close to what is found in real-world applications. This has
the advantage that queries feel natural and that it is hard to question the
usefulness of the scenario. On the other hand, it is often desirable to have
another property which often is hard to combine with naturalness, namely
predictable query behavior. If designers pursue predictability, they often go
for very regular designs so that they can exactly and reliably predict what
queries return. These designs are frequently based on mathematical models
which allow precise predictions – at times at the trade-off however that the
resulting databases ‘feel’ only little natural.
It is hard to position XML between the two extremes. For one, XML is not
a pure machine format and therefore not exclusively consumed and produced
by applications but also absorbed by humans – at least occasionally. Therefore,
not only the semantics but also the documents themselves should still make
sense to humans while it is primarily machines that produce and consume
them. In XMark, we thus tried to reconcile the two competing goals as much
as possible but, in case of conflicts, our policy was to favor predictability of
queries and performance in the generation process.
We should mention that designers of other XML benchmarks had different
policies in mind. For example, the Michigan Benchmark [254] features a very
structured approach to database generation and want to maximize predictabil-
ity on all levels and queries, much in the spirit of the Wisconsin Benchmark
described in [141]. A hybrid approach is taken by XBench [327] who classify
their documents according to a requirements matrix: their axes are Single-
Document vs. Multi-Document Databases and Text-Centric vs. Data-Centric
Databases, respectively. Other XML benchmarks like X007 [47] and XMach-
1 [36] are also based on certain considerations with respect to document de-
sign.
While most people agree that performance is an important goal in query
execution, it is equally important in data generation especially when it comes
to large databases, which bring about significant generation overhead. In
XMark, we pursued performance in that it was a design goal that the data
generator should be able to output several megabytes of XML text per sec-
ond, which we considered a necessary requirement should it be suitable for
deployment in large-scale scenarios. After we finished a first prototype of the
generator, we found out that a major performance bottleneck was random
number generation. At first, we had chosen a high quality random number
generator which turned out to be inadequate. In the sequel, we had to deal
with the trade-off between the quality of random variables in general and their
correlations in particular at one end of the scale and generation time at the
other end. What turned out to be a problem was that when weak correla-
tions were to be generated the quality of the random number generator may
18 A Look Back on the XML Benchmark Project 273
Since XML was still at an early stage in its development, the actual imple-
mentation of the benchmark on a number of systems was a non-trivial task.
The architectures and capabilities of query processors very much varied from
system to system. Some systems could only bulkload small documents at a
time; hence, we sometimes had to use the split feature of the data generator
and feed the benchmark document in small pieces; at other times we were
given the opportunity to specify (parts of) the XML-to-Database mapping
by hand. The benchmark queries (see [273] for a complete list) often had to
be translated to standard (SQL and XQuery) or proprietary query languages
and possibly annotated with execution plan hints. All in all, there were many
opportunities for hand-optimization which sometimes had to be taken advan-
tage of to make the benchmark work on a system. However, we think that the
technology has matured since we did the experiments and expect it to become
more robust so that a detailed report of these experience would probably be
already outdated. We therefore just mention some findings and refer to [272]
and [273] which contain more detailed material.
The benchmark has been a group-design activity of academic and industry
researchers and is known to be used with success to evaluate progress in
both commercial and research settings. The evaluation in this section here is
meant to present the highlights we encountered when running the benchmark
274 A. Schmidt et al.
Concerning the scaling factor, all mass storage systems were able to process
the queries at scaling factor 1.0. Note that it took the XML parser expat [108]
4.9 seconds (user time on the above Linux machine including system time
and disk I/O) to scan the benchmark document (this time only includes the
tokenization of the input stream and normalizations and substitutions as re-
quired by the XML standard and no user-specified semantic actions). The
18 A Look Back on the XML Benchmark Project 275
bulkload times are summarized in Table 18.1: they range from 50 seconds
to 781 seconds. They are completed transactions and include the conversion
effort needed to map the XML document to a database instance. Note that
System C requires a DTD to derive a database schema; the time for this
derivation is not included in the figure, but is negligible anyway. The result-
ing database sizes are also listed in Table 18.1; we remark that some systems
which are not included in this comparison require far larger database sizes.
We now turn our attention to the running times and statistics as displayed
in Table 18.2 and present some insights. Since we do not have the space to
discuss all timings and experiments in detail, we only present a selection. In
most physical XML mappings found in the literature, Query Q1 [269] consists
of a table scan or index lookup and a small number of additional table look-
ups. It is mainly supposed to establish a performance baseline: At scaling
factor 1.0, the scan goes over 10000 tuples and is followed by two table look-
ups if a mapping like [271] is used.
Query System Compilation CPU Compilation total Execution CPU Execution total
A 16% 25% 31% 75%
Q1 B 13% 51% 30% 49%
C 0% 29% 20% 71%
A 9% 13% 41% 87%
Q2 B 12% 20% 65% 80%
C 3% 16% 77% 84%
276 A. Schmidt et al.
Queries Q2 and Q3 are the first ones to provide surprises. It turns out
that the parts of the query plans that compute the indices are quite com-
plex TPC/H-like aggregations: they require the computation of set-valued
attributes to determine the bidder element with the least index with respect
to the open auction ancestor. Therefore the complexity of the query plan is
higher than the rather innocent looking XQuery representation [269] of the
queries might suggest. Consequently, running times are quite high. Although
System A was able to find an execution plan for Q3 which was as good as
that of the other systems, it spent too much of its time on optimization. Ta-
ble 18.3 displays some interesting characteristics of Q1 and Q2 that can be
traced back to the physical mappings the systems use. System A basically
stores all XML data on one big heap, i.e., only a single relation. System B on
the other hand uses a highly fragmenting mapping. Consequently, System A
has to access fewer metadata to compile a query than System B, thus spend-
ing only half as much time on query compilation (including optimization) as
System B. However, this comes at a cost. Because the data mapping deployed
in System A has less explicit semantics, the actual cost of accessing the real
data is higher than in System B (75% vs 49%). System C as mentioned needs
a DTD to derive a storage schema; this additional information helps to get
favorable performance. Still in Table 18.3, we also find the detailed execution
times for Q2. They show that mappings that structure the data according to
their semantics can achieve significantly higher CPU usage (compare 77% of
System C and 65% of System B vs System A’s 41%). We remark that System
C also uses a data mapping in the spirit of [279] that results in comparatively
simple and efficient execution plans and thus outperforms all other systems
for Q2 and Q3.
Query Q5 tries to quantify the cost of casting or type-coercion opera-
tions such as those necessary for the comparisons in Q3. For all mass-storage
systems, the cost of this coercion is rather low with respect to the relative
complexity of Q3’s query execution plan and given the execution times of Q5.
In any case, Q5 does not exhibit great differences in execution times. We note
that all character data in the original document, including references, were
stored as strings and cast at runtime to richer data types whenever necessary
as in Queries 3, 5, 11, 12, 18, 20. We did not apply any domain-specific know-
ledge; neither did the systems use schema information nor pre-calculation or
caching of casting results.
Regular path expressions are the challenge presented by queries Q6 and
Q7. System D keeps a detailed structural summary of the database and can
exploit it to optimize traversal-intensive queries; this actually makes Q6 and
Q7 surprisingly fast. However, on systems without access to structural sum-
maries, which effectively play the role of an index or schema, these queries of-
ten are significantly more expensive to execute. The problem that Q7 actually
looks for non-existing paths is efficiently solved by exploiting the structural
summary in the case of System D. For some systems, the cost of accessing
schema information was very high and dominated query performance.
18 A Look Back on the XML Benchmark Project 277
18.6 Conclusions
In this chapter we outlined the design of XMark, a benchmark to assess the
performance of query processors for XML documents. Based on an internet
auction site as an application scenario, XMark provides a suite of queries
278 A. Schmidt et al.
19.1 Introduction
The widespread use of the extensible Markup Language (XML) on the Web
and in Digital Libraries brought about an explosion in the development of
XML tools, including systems to store and access XML content. As the num-
ber of these systems increases, so is the need to assess their benefit to users.
The benefit to a given user depends largely on which aspects of the user’s in-
teraction with the system are being considered. These aspects, among others,
include response time, required user effort, usability, and the system’s ability
to present the user with the desired information. Users then base their decision
whether they are more satisfied with one system or another on a prioritised
combination of these factors.
The Initiative for the Evaluation of XML Retrieval (INEX) was set up at
the beginning of 2002 with the aim to establish an infrastructure and provide
means, in the form of a large XML test collection and appropriate scoring
methods, for the evaluation of content-oriented retrieval of XML documents.
As a result of a collaborative effort, with contributions from 36 participating
groups, INEX created an XML test collection consisting of publications of the
IEEE Computer Society, 60 topics and graded relevance assessments. Using
the constructed test collection and the developed set of evaluation metrics
and procedures, the retrieval effectiveness of the participating organisations’
XML retrieval approaches were evaluated and their results compared [126].
In this chapter we provide an overview of the INEX evaluation initiative.
Before we talk about INEX, we first take a brief look, in Section 19.2, at the
evaluation practices of information retrieval (IR) as these formed the basis of
our work in INEX. In our discussion of INEX we follow the requirements that
evaluations in IR are founded upon [264]. These include the specification of the
evaluation objective (e.g. what to evaluate) in Section 19.3, and the selection
of suitable evaluation criteria in Section 19.4. This is followed by an overview
of the methodology for constructing the test collection in Section 19.5. We
H. Blanken et al. (Eds.): Intelligent Search on XML Data, LNCS 2818, pp. 279–293, 2003.
Springer-Verlag Berlin Heidelberg 2003
280 G. Kazai et al.
describe the evaluation metrics in Section 19.6. Finally we close with thoughts
for future work in Section 19.8.
satisfies the user’s information need, e.g. measures the exhaustivity of the
topic within a component.
Component coverage, which is a criterion that considers the structural as-
pects and reflects the extent to which a document component is focused
on the information need, e.g. measures the specificity of a component with
regards to the topic.
The basic threshold for relevance was defined as a piece of text that men-
tions the topic of request [153]. A consequence of this definition is that con-
tainer components of relevant document components in a nested XML struc-
ture, albeit too large components, are also regarded as relevant. This clearly
shows that relevance as a single criterion is not sufficient for the evaluation
of content-oriented XML retrieval. Hence, the second dimension, component
coverage, is used to provide a measure with respect to the size of a component
by reflecting the ratio of relevant and irrelevant content within a document
component. In actual fact, both dimensions are related to component size. For
example, the more exhaustively a topic is discussed the more likely that the
component is longer in length, and the more focused a component the more
likely that it is smaller in size.
When considering the use of the above two criteria for the evaluation of
XML retrieval systems, we must also decide about the scales of measurements
to be used. For relevance, binary or multiple degree scales are known. In INEX,
we chose a multiple degree relevance scale as it allows the explicit representa-
tion of how exhaustively a topic is discussed within a component with respect
to its sub-components. For example, a section containing two paragraphs may
be regarded more relevant than either of its paragraphs by themselves. Binary
values of relevance cannot reflect this difference. We adopted the following
four-point relevance scale [185]:
Irrelevant (0): The document component does not contain any information
about the topic of request.
Marginally relevant (1): The document component mentions the topic of re-
quest, but only in passing.
Fairly relevant (2): The document component contains more information than
the topic description, but this information is not exhaustive. In the case
of multi-faceted topics, only some of the sub-themes or viewpoints are
discussed.
Highly relevant (3): The document component discusses the topic of request
exhaustively. In the case of multi-faceted topics, all or most sub-themes
or viewpoints are discussed.
For component coverage we used the following four-category nominal scale:
No coverage (N): The topic or an aspect of the topic is not a theme of the
document component.
Too large (L): The topic or an aspect of the topic is only a minor theme of
the document component.
19 The INEX Evaluation Initiative 285
Too small (S): The topic or an aspect of the topic is the main or only theme
of the document component, but the component is too small to act as a
meaningful unit of information when retrieved by itself.
Exact coverage (E): The topic or an aspect of the topic is the main or only
theme of the document component, and the component acts as a mean-
ingful unit of information when retrieved by itself.
According to the above definition of coverage it becomes possible to re-
ward XML search engines that are able to retrieve the appropriate (“exact”)
sized document components. For example, a retrieval system that is able to
locate the only relevant section in an encyclopaedia is likely to trigger higher
user satisfaction than one that returns a too large component, such as the
whole encyclopaedia. On the other hand, the above definition also allows the
classification of components as too small if they do not bear self-explaining in-
formation for the user and thus cannot serve as informative units [70]. Take as
an example, a small text fragment, such as the sentence “These results clearly
show the advantages of content-oriented XML retrieval systems.”, which, al-
though part of a relevant section in a scientific report, is of no use to a user
when retrieved without its context.
Only the combination of these two criteria allows the evaluation of systems
that are able to retrieve components with high relevance and exact coverage,
e.g. components that are exhaustive to and highly focused on the topic of
request and hence represent the most appropriate components to be returned
to the user.
19.5.1 Documents
The document collection consists of the fulltexts of 12 107 articles from
12 magazines and 6 transactions of the IEEE Computer Society’s publica-
tions, covering the period of 1995–2002, and totalling 494 megabytes in size.
Although the collection is relatively small compared with TREC, it has a
suitably complex XML structure (192 different content models in DTD) and
contains scientific articles of varying length. On average, an article contains
1 532 XML nodes, where the average depth of a node is 6.9.
All documents of the collection are tagged using XML conforming to one
common DTD. The overall structure of a typical article, shown in Figure 19.1,
286 G. Kazai et al.
consists of a front matter (<fm>), a body (<bdy>), and a back matter (<bm>).
The front matter contains the article’s metadata, such as title, author, publi-
cation information, and abstract. Following it is the article’s body, which con-
tains the content. The body is structured into sections (<sec>), sub-sections
(<ss1>), and sub-sub-sections (<ss2>). These logical units start with a title,
followed by a number of paragraphs. In addition, the content has markup for
references (citations, tables, figures), item lists, and layout (such as empha-
sised and bold faced text), etc. The back matter contains a bibliography and
further information about the article’s authors.
<sec>
<article>
<st>...</st>
<fm>
...
...
<ss1>...</ss1>
<ti>IEEE Transactions on ...</ti>
<ss1>...</ss1>
<atl>Construction of ...</atl>
...
<au>
</sec>
<fnm>John</fnm>
...
<snm>Smith</snm>
</bdy>
<aff>University of ...</aff>
<bm>
</au>
<bib>
<au>...</au>
<bb>
...
<au>...</au>
</fm>
<ti>...</ti>
<bdy>
...
<sec>
</bb>
<st>Introduction</st>
...
<p>...</p>
</bib>
... </bm>
</sec> </article>
19.5.2 Topics
The topics of the test collection were created by the participating groups. We
asked each organisation to create sets of content-only (CO), and content-and-
structure (CAS) candidate topics that were representative of what real users
might ask and the type of the service that operational systems may provide.
Participants were provided with guidelines to assist them in this four-stage
task [126].
During the first stage participants created an initial description of their
information need without regard to system capabilities or collection peculiar-
ities. During the collection exploration stage, using their own XML retrieval
19 The INEX Evaluation Initiative 287
that more than half requested facts to be returned to the user. Furthermore,
the majority of the CAS topics contained either only fact, or a mixture of
19 The INEX Evaluation Initiative 289
fact and content containment conditions, e.g. specifying the publication year
and/or the author, or specifying the author and the subject of some document
components.
Table 19.1. Statistics on CAS and CO topics in the INEX test collection
CAS CO
no of topics 30 30
avg no of <cw>/topic title 2.06 1.0
avg no of unique words/cw 2.5 4.3
avg no of unique words/topic title 5.1 4.3
avg no of <ce>/topic title 1.63 –
avg no of XML elements/<ce> 1.53 –
avg no of XML elements/topic title 2.5 –
no of topics with <ce> representing a fact 12 –
no of topics with <ce> representing content 6 –
no of topics with mixed fact and content <ce> 12 –
no of topics with <te> components 25 0
avg no of XML elements/<te> 1.68 –
no of topics with <te> representing a fact 13 –
no of topics with <te> representing content 12 –
no of topics with <te> representing articles 6 –
avg no of words in topic description 18.8 16.1
avg no of words in keywords component 7.06 8.7
19.5.3 Assessments
The final set of topics were distributed back to the participating groups, who
then used these topics to search the document collection. The actual queries
put to the search engines had to be automatically generated from any part
of the topics except the narrative. As a result of the retrieval sessions, the
participating organisations produced ranked lists of XML elements in answer
to each query. The top 100 result elements from all sixty sets of ranked lists
(one per topic) consisted the results of one retrieval run. Each group was
allowed to submit up to three runs. A result element in a retrieval run was
identified using a combination of file names and XPaths. The file name (and
file path) uniquely identified an article within the document collection, and
XPath allowed the location of a given node within the XML tree of the article.
Associated with a result element were its retrieval rank and/or its relevance
status value [126].
290 G. Kazai et al.
A total of 51 runs were submitted from 25 groups. For each topic, the re-
sults from the submissions were merged to form the pool for assessment [309].
The assessment pools contained between one to two thousand document com-
ponents from 300–900 articles, depending on the topic. The result elements
varied from author, title and paragraph elements through sub-section and
section elements to complete articles and even journals. The assessment pools
were then assigned to groups for assessment; either to the original topic au-
thors or when this was not possible, on a voluntary basis, to groups with
expertise in the topic’s subject area.
The assessments were done along the two dimensions of topical relevance
and component coverage. Assessments were recorded using an on-line assess-
ment system, which allowed users to view the pooled result set of a given topic
listed in alphabetical order, to browse the document collection and view ar-
ticles and result elements both in XML (i.e. showing the tags) and document
view (i.e. formatted for ease of reading). Other features included facilities such
as keyword highlighting, and consistency checking of the assessments [126].
Table 19.2 shows a summary of the collected assessments for CAS and CO
topics2 . The table shows a relatively large proportion of sub-components with
exact coverage compared with article elements, which indicates that for most
topics sub-components of articles were considered as the preferred units to be
returned to the user.
2
The figures are based on the assessments of 54 of the 60 topics; for the remaining
six topics no assessments are available.
19 The INEX Evaluation Initiative 291
Due to the nature of XML retrieval, it was necessary to develop new evalua-
tion procedures. These were based on the traditional recall/precision and, in
particular, the metrics described in Section 19.2. However, before we could
apply these measures, we first had to derive a single relevance value based
on the two dimensions of topical relevance and component coverage. For this
purpose we defined a number of quantisation functions, fquant :
Here, the set of relevance assessments is Relevance := {0, 1, 2, 3}, and the set
of coverage assessments is Coverage := {N, S, L, E}.
The rational behind such a quantisation function is that overall relevance
of a document component can only be determined using the combination of
relevance and coverage assessments. Quantisation functions can be selected ac-
cording to the desired user standpoint. For INEX 2002, two different functions
have been selected: fstrict and fgeneralised . The quantisation function fstrict is
used to evaluate whether a given retrieval method is capable of retrieving
highly relevant and highly focused document components:
'
1 if rel = 3 and cov = E,
fstrict (rel, cov) := (19.5)
0 else
19.8 Conclusions
As a collaborative effort of research groups from 36 organisations worldwide,
the INEX evaluation initiative in 2002 created an infrastructure for evaluating
the effectiveness of content-oriented retrieval of XML documents. A document
collection with real life XML documents from the IEEE Computer Society’s
digital library has been set up, 60 topics created and assessments provided
for 54 of these topics. Based on the notion of recall and precision, metrics for
evaluating the effectiveness of XML retrieval have also been developed. These
were applied to evaluate the submitted retrieval runs of the participating
groups.
3
Another 11 organisations (not listed here) participated actively in the relevance
assessment phase
19 The INEX Evaluation Initiative 293
In the second round of INEX, commencing from April 2003, we aim to ex-
tend the test collection and develop alternative evaluation measures and met-
rics addressing the issue of overlapping result elements. We are also working
on a new topic format, which will allow the representation of vague structural
conditions. In the long term future of INEX we aim to extend the range of
tasks under investigation to include, in particular, interactive retrieval, which
will be based on new evaluation criteria reflecting typical user interaction with
structured documents.
19
The INEX Evaluation Initiative
19.1 Introduction
The widespread use of the extensible Markup Language (XML) on the Web
and in Digital Libraries brought about an explosion in the development of
XML tools, including systems to store and access XML content. As the num-
ber of these systems increases, so is the need to assess their benefit to users.
The benefit to a given user depends largely on which aspects of the user’s in-
teraction with the system are being considered. These aspects, among others,
include response time, required user effort, usability, and the system’s ability
to present the user with the desired information. Users then base their decision
whether they are more satisfied with one system or another on a prioritised
combination of these factors.
The Initiative for the Evaluation of XML Retrieval (INEX) was set up at
the beginning of 2002 with the aim to establish an infrastructure and provide
means, in the form of a large XML test collection and appropriate scoring
methods, for the evaluation of content-oriented retrieval of XML documents.
As a result of a collaborative effort, with contributions from 36 participating
groups, INEX created an XML test collection consisting of publications of the
IEEE Computer Society, 60 topics and graded relevance assessments. Using
the constructed test collection and the developed set of evaluation metrics
and procedures, the retrieval effectiveness of the participating organisations’
XML retrieval approaches were evaluated and their results compared [126].
In this chapter we provide an overview of the INEX evaluation initiative.
Before we talk about INEX, we first take a brief look, in Section 19.2, at the
evaluation practices of information retrieval (IR) as these formed the basis of
our work in INEX. In our discussion of INEX we follow the requirements that
evaluations in IR are founded upon [264]. These include the specification of the
evaluation objective (e.g. what to evaluate) in Section 19.3, and the selection
of suitable evaluation criteria in Section 19.4. This is followed by an overview
of the methodology for constructing the test collection in Section 19.5. We
H. Blanken et al. (Eds.): Intelligent Search on XML Data, LNCS 2818, pp. 279–293, 2003.
Springer-Verlag Berlin Heidelberg 2003
280 G. Kazai et al.
describe the evaluation metrics in Section 19.6. Finally we close with thoughts
for future work in Section 19.8.
satisfies the user’s information need, e.g. measures the exhaustivity of the
topic within a component.
Component coverage, which is a criterion that considers the structural as-
pects and reflects the extent to which a document component is focused
on the information need, e.g. measures the specificity of a component with
regards to the topic.
The basic threshold for relevance was defined as a piece of text that men-
tions the topic of request [153]. A consequence of this definition is that con-
tainer components of relevant document components in a nested XML struc-
ture, albeit too large components, are also regarded as relevant. This clearly
shows that relevance as a single criterion is not sufficient for the evaluation
of content-oriented XML retrieval. Hence, the second dimension, component
coverage, is used to provide a measure with respect to the size of a component
by reflecting the ratio of relevant and irrelevant content within a document
component. In actual fact, both dimensions are related to component size. For
example, the more exhaustively a topic is discussed the more likely that the
component is longer in length, and the more focused a component the more
likely that it is smaller in size.
When considering the use of the above two criteria for the evaluation of
XML retrieval systems, we must also decide about the scales of measurements
to be used. For relevance, binary or multiple degree scales are known. In INEX,
we chose a multiple degree relevance scale as it allows the explicit representa-
tion of how exhaustively a topic is discussed within a component with respect
to its sub-components. For example, a section containing two paragraphs may
be regarded more relevant than either of its paragraphs by themselves. Binary
values of relevance cannot reflect this difference. We adopted the following
four-point relevance scale [185]:
Irrelevant (0): The document component does not contain any information
about the topic of request.
Marginally relevant (1): The document component mentions the topic of re-
quest, but only in passing.
Fairly relevant (2): The document component contains more information than
the topic description, but this information is not exhaustive. In the case
of multi-faceted topics, only some of the sub-themes or viewpoints are
discussed.
Highly relevant (3): The document component discusses the topic of request
exhaustively. In the case of multi-faceted topics, all or most sub-themes
or viewpoints are discussed.
For component coverage we used the following four-category nominal scale:
No coverage (N): The topic or an aspect of the topic is not a theme of the
document component.
Too large (L): The topic or an aspect of the topic is only a minor theme of
the document component.
19 The INEX Evaluation Initiative 285
Too small (S): The topic or an aspect of the topic is the main or only theme
of the document component, but the component is too small to act as a
meaningful unit of information when retrieved by itself.
Exact coverage (E): The topic or an aspect of the topic is the main or only
theme of the document component, and the component acts as a mean-
ingful unit of information when retrieved by itself.
According to the above definition of coverage it becomes possible to re-
ward XML search engines that are able to retrieve the appropriate (“exact”)
sized document components. For example, a retrieval system that is able to
locate the only relevant section in an encyclopaedia is likely to trigger higher
user satisfaction than one that returns a too large component, such as the
whole encyclopaedia. On the other hand, the above definition also allows the
classification of components as too small if they do not bear self-explaining in-
formation for the user and thus cannot serve as informative units [70]. Take as
an example, a small text fragment, such as the sentence “These results clearly
show the advantages of content-oriented XML retrieval systems.”, which, al-
though part of a relevant section in a scientific report, is of no use to a user
when retrieved without its context.
Only the combination of these two criteria allows the evaluation of systems
that are able to retrieve components with high relevance and exact coverage,
e.g. components that are exhaustive to and highly focused on the topic of
request and hence represent the most appropriate components to be returned
to the user.
19.5.1 Documents
The document collection consists of the fulltexts of 12 107 articles from
12 magazines and 6 transactions of the IEEE Computer Society’s publica-
tions, covering the period of 1995–2002, and totalling 494 megabytes in size.
Although the collection is relatively small compared with TREC, it has a
suitably complex XML structure (192 different content models in DTD) and
contains scientific articles of varying length. On average, an article contains
1 532 XML nodes, where the average depth of a node is 6.9.
All documents of the collection are tagged using XML conforming to one
common DTD. The overall structure of a typical article, shown in Figure 19.1,
286 G. Kazai et al.
consists of a front matter (<fm>), a body (<bdy>), and a back matter (<bm>).
The front matter contains the article’s metadata, such as title, author, publi-
cation information, and abstract. Following it is the article’s body, which con-
tains the content. The body is structured into sections (<sec>), sub-sections
(<ss1>), and sub-sub-sections (<ss2>). These logical units start with a title,
followed by a number of paragraphs. In addition, the content has markup for
references (citations, tables, figures), item lists, and layout (such as empha-
sised and bold faced text), etc. The back matter contains a bibliography and
further information about the article’s authors.
<sec>
<article>
<st>...</st>
<fm>
...
...
<ss1>...</ss1>
<ti>IEEE Transactions on ...</ti>
<ss1>...</ss1>
<atl>Construction of ...</atl>
...
<au>
</sec>
<fnm>John</fnm>
...
<snm>Smith</snm>
</bdy>
<aff>University of ...</aff>
<bm>
</au>
<bib>
<au>...</au>
<bb>
...
<au>...</au>
</fm>
<ti>...</ti>
<bdy>
...
<sec>
</bb>
<st>Introduction</st>
...
<p>...</p>
</bib>
... </bm>
</sec> </article>
19.5.2 Topics
The topics of the test collection were created by the participating groups. We
asked each organisation to create sets of content-only (CO), and content-and-
structure (CAS) candidate topics that were representative of what real users
might ask and the type of the service that operational systems may provide.
Participants were provided with guidelines to assist them in this four-stage
task [126].
During the first stage participants created an initial description of their
information need without regard to system capabilities or collection peculiar-
ities. During the collection exploration stage, using their own XML retrieval
19 The INEX Evaluation Initiative 287
that more than half requested facts to be returned to the user. Furthermore,
the majority of the CAS topics contained either only fact, or a mixture of
19 The INEX Evaluation Initiative 289
fact and content containment conditions, e.g. specifying the publication year
and/or the author, or specifying the author and the subject of some document
components.
Table 19.1. Statistics on CAS and CO topics in the INEX test collection
CAS CO
no of topics 30 30
avg no of <cw>/topic title 2.06 1.0
avg no of unique words/cw 2.5 4.3
avg no of unique words/topic title 5.1 4.3
avg no of <ce>/topic title 1.63 –
avg no of XML elements/<ce> 1.53 –
avg no of XML elements/topic title 2.5 –
no of topics with <ce> representing a fact 12 –
no of topics with <ce> representing content 6 –
no of topics with mixed fact and content <ce> 12 –
no of topics with <te> components 25 0
avg no of XML elements/<te> 1.68 –
no of topics with <te> representing a fact 13 –
no of topics with <te> representing content 12 –
no of topics with <te> representing articles 6 –
avg no of words in topic description 18.8 16.1
avg no of words in keywords component 7.06 8.7
19.5.3 Assessments
The final set of topics were distributed back to the participating groups, who
then used these topics to search the document collection. The actual queries
put to the search engines had to be automatically generated from any part
of the topics except the narrative. As a result of the retrieval sessions, the
participating organisations produced ranked lists of XML elements in answer
to each query. The top 100 result elements from all sixty sets of ranked lists
(one per topic) consisted the results of one retrieval run. Each group was
allowed to submit up to three runs. A result element in a retrieval run was
identified using a combination of file names and XPaths. The file name (and
file path) uniquely identified an article within the document collection, and
XPath allowed the location of a given node within the XML tree of the article.
Associated with a result element were its retrieval rank and/or its relevance
status value [126].
290 G. Kazai et al.
A total of 51 runs were submitted from 25 groups. For each topic, the re-
sults from the submissions were merged to form the pool for assessment [309].
The assessment pools contained between one to two thousand document com-
ponents from 300–900 articles, depending on the topic. The result elements
varied from author, title and paragraph elements through sub-section and
section elements to complete articles and even journals. The assessment pools
were then assigned to groups for assessment; either to the original topic au-
thors or when this was not possible, on a voluntary basis, to groups with
expertise in the topic’s subject area.
The assessments were done along the two dimensions of topical relevance
and component coverage. Assessments were recorded using an on-line assess-
ment system, which allowed users to view the pooled result set of a given topic
listed in alphabetical order, to browse the document collection and view ar-
ticles and result elements both in XML (i.e. showing the tags) and document
view (i.e. formatted for ease of reading). Other features included facilities such
as keyword highlighting, and consistency checking of the assessments [126].
Table 19.2 shows a summary of the collected assessments for CAS and CO
topics2 . The table shows a relatively large proportion of sub-components with
exact coverage compared with article elements, which indicates that for most
topics sub-components of articles were considered as the preferred units to be
returned to the user.
2
The figures are based on the assessments of 54 of the 60 topics; for the remaining
six topics no assessments are available.
19 The INEX Evaluation Initiative 291
Due to the nature of XML retrieval, it was necessary to develop new evalua-
tion procedures. These were based on the traditional recall/precision and, in
particular, the metrics described in Section 19.2. However, before we could
apply these measures, we first had to derive a single relevance value based
on the two dimensions of topical relevance and component coverage. For this
purpose we defined a number of quantisation functions, fquant :
Here, the set of relevance assessments is Relevance := {0, 1, 2, 3}, and the set
of coverage assessments is Coverage := {N, S, L, E}.
The rational behind such a quantisation function is that overall relevance
of a document component can only be determined using the combination of
relevance and coverage assessments. Quantisation functions can be selected ac-
cording to the desired user standpoint. For INEX 2002, two different functions
have been selected: fstrict and fgeneralised . The quantisation function fstrict is
used to evaluate whether a given retrieval method is capable of retrieving
highly relevant and highly focused document components:
'
1 if rel = 3 and cov = E,
fstrict (rel, cov) := (19.5)
0 else
19.8 Conclusions
As a collaborative effort of research groups from 36 organisations worldwide,
the INEX evaluation initiative in 2002 created an infrastructure for evaluating
the effectiveness of content-oriented retrieval of XML documents. A document
collection with real life XML documents from the IEEE Computer Society’s
digital library has been set up, 60 topics created and assessments provided
for 54 of these topics. Based on the notion of recall and precision, metrics for
evaluating the effectiveness of XML retrieval have also been developed. These
were applied to evaluate the submitted retrieval runs of the participating
groups.
3
Another 11 organisations (not listed here) participated actively in the relevance
assessment phase
19 The INEX Evaluation Initiative 293
In the second round of INEX, commencing from April 2003, we aim to ex-
tend the test collection and develop alternative evaluation measures and met-
rics addressing the issue of overlapping result elements. We are also working
on a new topic format, which will allow the representation of vague structural
conditions. In the long term future of INEX we aim to extend the range of
tasks under investigation to include, in particular, interactive retrieval, which
will be based on new evaluation criteria reflecting typical user interaction with
structured documents.