Big Data Computing Ebook free download pdf pdf

Gratis

0
0
562
7 months ago
Preview
Full text

  

Big Data

Computing

  

This page intentionally left blank This page intentionally left blank Big Data

Computing

Edited by

  

Rajendra Akerkar

Western Norway Research Institute

Sogndal, Norway

  Taylor & Francis Group © 2014 by Taylor & Francis Group, LLC Boca Raton, FL 33487-2742 6000 Broken Sound Parkway NW, Suite 300 CRC Press International Standard Book Number-13: 978-1-4665-7838-8 (eBook - PDF) Version Date: 20131028 No claim to original U.S. Government works CRC Press is an imprint of Taylor & Francis Group, an Informa business

copyright holders if permission to publish in this form has not been obtained. If any copyright material has

have attempted to trace the copyright holders of all material reproduced in this publication and apologize to

responsibility for the validity of all materials or the consequences of their use. The authors and publishers

have been made to publish reliable data and information, but the author and publisher cannot assume

not been acknowledged please write and let us know so we may rectify in any future reprint.

This book contains information obtained from authentic and highly regarded sources. Reasonable efforts

without written permission from the publishers.

ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented,

including photocopying, microfilming, and recording, or in any information storage or retrieval system,

http://www.crcpress.com and the CRC Press Web site at http://www.taylorandfrancis.com Visit the Taylor & Francis Web site at only for identification and explanation without intent to infringe.

  

To

All the visionary minds who have helped create a modern data science profession

  

This page intentionally left blank This page intentionally left blank

  Contents

  

  

  

  

  

  

  

  

  

  

  Contents

  

  

  

  

  

  

  

  

  

  

  

  

  

Index ..................................................................................................................... 539

  In the international marketplace, businesses, suppliers, and customers create * and consume vast amounts of information. Gartner predicts that enterprise data in all forms will grow up to 650% over the next five years. According to IDC, the world’s volume of data doubles every 18 months. Digital infor- mation is doubling every 1.5 years and will exceed 1000 exabytes next year according to the MIT Centre for Digital Research. In 2011, medical centers held almost 1 billion terabytes of data. That is almost 2000 billion file cabinets’ worth of information. This deluge of data, often referred to as Big Data, obvi- ously creates a challenge to the business community and data scientists.

  The term Big Data refers to data sets the size of which is beyond the capa- bilities of current database technology. It is an emerging field where innova- tive technology offers alternatives in resolving the inherent problems that appear when working with massive data, offering new ways to reuse and extract value from information.

  Businesses and government agencies aggregate data from numerous pri- vate and/or public data sources. Private data is information that any orga- nization exclusively stores that is available only to that organization, such as employee data, customer data, and machine data (e.g., user transactions and customer behavior). Public data is information that is available to the public for a fee or at no charge, such as credit ratings, social media content (e.g.,  LinkedIn, Facebook, and Twitter). Big Data has now reached every sector in the world economy. It is transforming competitive opportunities in every industry sector including banking, healthcare, insurance, manu- facturing, retail, wholesale, transportation, communications, construction, education, and utilities. It also plays key roles in trade operations such as marketing, operations, supply chain, and new business models. It is becom- ing rather evident that enterprises that fail to use their data efficiently are at a large competitive disadvantage from those that can analyze and act on their data. The possibilities of Big Data continue to evolve swiftly, driven by inno- vation in the underlying technologies, platforms, and analytical capabilities for handling data, as well as the evolution of behavior among its users as increasingly humans live digital lives.

  It is interesting to know that Big Data is different from the conventional data models (e.g., relational databases and data models, or conventional gov- ernance models). Thus, it is triggering organizations’ concern as they try to separate information nuggets from the data heap. The conventional models of structured, engineered data do not adequately reveal the realities of Big *

http://www.gartner.com/it/content/1258400/1258425/january_6_techtrends_rpaquet.pdf

  Preface

  Data. The key to leveraging Big Data is to realize these differences before expediting its use. The most noteworthy difference is that data are typically governed in a centralized manner, but Big Data is self-governing. Big Data is created either by a rapidly expanding universe of machines or by users of highly varying expertise. As a result, the composition of traditional data will naturally vary considerably from Big Data. The composition of data serves a specific purpose and must be more durable and structured, whereas Big Data will cover many topics, but not all topics will yield useful information for the business, and thus they will be sparse in relevancy and structure.

  The technology required for Big Data computing is developing at a sat- isfactory rate due to market forces and technological evolution. The ever- growing enormous amount of data, along with advanced tools of exploratory data analysis, data mining/machine learning, and data visualization, offers a whole new way of understanding the world.

  Another interesting fact about Big Data is that not everything that is con- sidered “Big Data” is in fact Big Data. One needs to explore deep into the scientific aspects, such as analyzing, processing, and storing huge volumes of data. That is the only way of using tools effectively. Data developers/ scientists need to know about analytical processes, statistics, and machine learning. They also need to know how to use specific data to program algo- rithms. The core is the analytical side, but they also need the scientific back- ground and in-depth technical knowledge of the tools they work with in order to gain control of huge volumes of data. There is no one tool that offers this per se.

  As a result, the main challenge for Big Data computing is to find a novel solution, keeping in mind the fact that data sizes are always growing. This solution should be applicable for a long period of time. This means that the key condition a solution has to satisfy is scalability. Scalability is the ability of a system to accept increased input volume without impacting the profits; that is, the gains from the input increment should be proportional to the incre- ment itself. For a system to be totally scalable, the size of its input should not be a design parameter. Pushing the system designer to consider all possible deployment sizes to cope with different input sizes leads to a scalable archi- tecture without primary bottlenecks. Yet, apart from scalability, there are other requisites for a Big Data–intensive computing system.

  Although Big Data is an emerging field in data science, there are very few books available in the market. This book provides authoritative insights and highlights valuable lessons learnt by authors—with experience.

  Some universities in North America and Europe are doing their part to feed the need for analytics skills in this era of Big Data. In recent years, they have introduced master of science degrees in Big Data analytics, data science, and business analytics. Some contributing authors have been involved in developing a course curriculum in their respective institution and country. The number of courses on “Big Data” will increase worldwide

  Preface

  of productivity growth, innovation, and consumer surplus, according to a * research by MGI and McKinsey’s Business Technology Office.

  The main features of this book can be summarized as

  1. It describes the contemporary state of the art in a new field of Big Data computing.

  2. It presents the latest developments, services, and main players in this explosive field.

  3. Contributors to the book are prominent researchers from academia and practitioners from industry.

Organization

  This book comprises five sections, each of which covers one aspect of Big Data computing. Section I focuses on what Big Data is, why it is important, and how it can be used. Section II focuses on semantic technologies and Big Data. Section III focuses on Big Data processing—tools, technologies, and methods essential to analyze Big Data efficiently. Section IV deals with business and economic perspectives. Finally, Section V focuses on various stimulating Big Data applications. Below is a brief outline with more details on what each chapter is about.

Section I: Introduction

  Chapter 1 provides an approach to address the problem of “understanding” Big Data in an effective and efficient way. The idea is to make adequately grained and expressive knowledge representations and fact collections that evolve naturally, triggered by new tokens of relevant data coming along. The chapter also presents primary considerations on assessing fitness in an evolving knowledge ecosystem.

  Chapter 2 then gives an overview of the main features that can character- ize architectures for solving a Big Data problem, depending on the source of data, on the type of processing required, and on the application context in * which it should be operated.

  

  Preface

  Chapter 3 discusses Big Data from three different standpoints: the busi- ness, the technological, and the social. This chapter lists some relevant initia- tives and selected thoughts on Big Data.

Section II: Semantic Technologies and Big Data

  Chapter 4 presents foundations of Big Semantic Data management. The chapter sketches a route from the current data deluge, the concept of Big Data, and the need of machine-processable semantics on the Web. Further, this chapter justifies different management problems arising in Big Semantic Data by characterizing their main stakeholders by role and nature.

  A number of challenges arising in the context of Linked Data in Enterprise Integration are covered in Chapter 5. A key prerequisite for addressing these challenges is the establishment of efficient and effective link discovery and data integration techniques, which scale to large-scale data scenarios found in the enterprise. This chapter also presents the transformation step of Linked Data Integration by two algorithms.

  Chapter 6 proposes steps toward the solution of the data access prob- lem that end-users usually face when dealing with Big Data. The chapter discusses the state of the art in ontology-based data access (OBDA) and explains why OBDA is the superior approach to the data access challenge posed by Big Data. It also explains why the field of OBDA is currently not yet sufficiently complete to deal satisfactorily with these problems, and it finally presents thoughts on escalating OBDA to a level where it can be well deployed to Big Data.

  Chapter 7 addresses large-scale semantic interoperability problems of data in the domain of public sector administration and proposes practical solutions to these problems by using semantic technologies in the context of Web services and open data. This chapter also presents a case of the Estonian semantic interoperability framework of state information systems and related data interoperability solutions.

Section III: Big Data Processing

  Chapter 8 presents a new way of query processing for Big Data where data exploration becomes a first-class citizen. Data exploration is desirable when new big chunks of data arrive speedily and one needs to react quickly. This chapter focuses on database systems technology, which for several years has

  Preface

  Chapter 9 explores the MapReduce model, a programming model used to develop largely parallel applications that process and generate large amounts of data. This chapter also discusses how MapReduce is implemented in Hadoop and provides an overview of its architecture.

  A particular class of stream-based joins, namely, a join of a single stream with a traditional relational table, is discussed in Chapter 10. Two available stream-based join algorithms are investigated in this chapter.

Section IV: Big Data and Business

  Chapter 11 provides the economic value of Big Data from a macro- and a microeconomic perspective. The chapter illustrates how technology and new skills can nurture opportunities to derive benefits from large, constantly growing, dispersed data sets and how semantic interoperability and new licensing strategies will contribute to the uptake of Big Data as a business enabler and a source of value creation.

  Nowadays businesses are enhancing their business intelligence prac- tices to include predictive analytics and data mining. This combines the best of strategic reporting and basic forecasting with advanced operational intelligence and decision-making functions. Chapter 12 discusses how Big Data technologies, advanced analytics, and business intelligence (BI) are interrelated. This chapter also presents various areas of advanced analytic technologies.

Section V: Big Data Applications

  The final section of the book covers application topics, starting in Chapter 13 with novel concept-level approaches to opinion mining and sentiment analy- sis that allow a more efficient passage from (unstructured) textual informa- tion to (structured) machine-processable data, in potentially any domain.

  Chapter 14 introduces the spChains framework, a modular approach to sup- port mastering of complex event processing (CEP) queries in an abridged, but effective, manner based on stream processing block composition. The approach aims at unleashing the power of CEP systems for teams having reduced insights into CEP systems.

  Real-time electricity metering operated at subsecond data rates in a grid with 20 million nodes originates more than 5 petabytes daily. The requested decision-making timeframe in SCADA systems operating load shedding

  Preface

  optimization task and the data management approach permitting a solution to the issue.

  Chapter 16 presents an innovative outlook to the scaling of geographi- cal space using large street networks involving both cities and countryside. Given a street network of an entire country, the chapter proposes to decom- pose the street network into individual blocks, each of which forms a mini- mum ring or cycle such as city blocks and field blocks. The chapter further elaborates the power of the block perspective in reflecting the patterns of geographical space.

  Chapter 17 presents the influence of recent advances in natural language processing on business knowledge life cycles and processes of knowledge management. The chapter also sketches envisaged developments and mar- ket impacts related to the integration of semantic technology and knowledge management.

Intended Audience

  The aim of this book is to be accessible to researchers, graduate students, and to application-driven practitioners who work in data science and related fields. This edited book requires no previous exposure to large-scale data analysis or NoSQL tools. Acquaintance with traditional databases is an added advantage.

  This book provides the reader with a broad range of Big Data concepts, tools, and techniques. A wide range of research in Big Data is covered, and comparisons between state-of-the-art approaches are provided. This book can thus help researchers from related fields (such as databases, data sci- ence, data mining, machine learning, knowledge engineering, information retrieval, information systems), as well as students who are interested in entering this field of research, to become familiar with recent research devel- opments and identify open research challenges on Big Data. This book can help practitioners to better understand the current state of the art in Big Data techniques, concepts, and applications.

  The technical level of this book also makes it accessible to students taking advanced undergraduate level courses on Big Data or Data Science. Although such courses are currently rare, with the ongoing challenges that the areas of intelligent information/data management pose in many organizations in both the public and private sectors, there is a demand worldwide for gradu- ates with skills and expertise in these areas. It is hoped that this book helps address this demand.

  In addition, the goal is to help policy-makers, developers and engineers, data scientists, as well as individuals, navigate the new Big Data landscape.

  Preface

Acknowledgments

  The organization and the contents of this edited book have benefited from our outstanding contributors. I am very proud and happy that these researchers agreed to join this project and prepare a chapter for this book. I am also very pleased to see this materialize in the way I originally envisioned. I hope this book will be a source of inspiration to the readers. I especially wish to express my sincere gratitude to all the authors for their contribution to this project.

  I thank the anonymous reviewers who provided valuable feedback and helpful suggestions. I also thank Aastha Sharma, David Fausel, Rachel Holt, and the staff at

  CRC Press (Taylor & Francis Group), who supported this book project right from the start.

  Last, but not least, a very big thanks to my colleagues at Western Norway Research Institute (Vestlandsforsking, Norway) for their constant encour- agement and understanding.

  I wish all readers a fruitful time reading this book, and hope that they expe- rience the same excitement as I did—and still do—when dealing with Data.

  Rajendra Akerkar

  

This page intentionally left blank This page intentionally left blank

  

Rajendra Akerkar is professor and senior researcher at Western Norway

  Research Institute (Vestlandsforsking), Norway, where his main domain of research is semantic technologies with the aim of combining theoretical results with high-impact real-world solutions. He also holds visiting aca- demic assignments in India and abroad. In 1997, he founded and chaired the Technomathematics Research Foundation (TMRF) in India.

  His research and teaching experience spans over 23 years in academia including different universities in Asia, Europe, and North America. His research interests include ontologies, semantic technologies, knowledge sys- tems, large-scale data mining, and intelligent systems.

  He received DAAD fellowship in 1990 and is also a recipient of the pres- tigious BOYSCASTS Young Scientist award of the Department of Science and Technology, Government of India, in 1997. From 1998 to 2001, he was a UNESCO-TWAS associate member at the Hanoi Institute of Mathematics, Vietnam. He was also a DAAD visiting professor at Universität des Saarlan- des and University of Bonn, Germany, in 2000 and 2007, respectively.

  Dr. Akerkar serves as editor-in-chief of the International Journal of Computer

  

Science & Applications (IJCSA) and as an associate editor of the International

Journal of Metadata, Semantics, and Ontologies (IJMSO). He is co-organizer

  of several workshops and program chair of the international conferences

  ISACA, ISAI, ICAAI, and WIMS. He has co-authored 13 books, approxi- mately 100 research papers, co-edited 2 e-books, and edited 5 volumes of international conferences. He is also actively involved in several interna- tional ICT initiatives and research & development projects and has been for more than 16 years.

  

This page intentionally left blank This page intentionally left blank

   Rajendra Akerkar

  Dipankar Das

  Michael Cochez

  Faculty of Information Technology

  University of Jyväskylä Jyväskylä, Finland

  Fulvio Corno

  Department of Control and Computer Engineering

  Polytechnic University of Turin Turin, Italy

  Department of Computer Science

  Advanced Computing and Electromagnetic Unit

  National University of Singapore Singapore

  Luigi De Russis

  Department of Control and Computer Engineering

  Polytechnic University of Turin Turin, Italy

  Mariano di Claudio

  Department of Systems and Informatics University of Florence Firenze, Italy

  Gillian Dobbie

  Istituto Superiore Mario Boella Torino, Italy

  Giuseppe Caragnano

  Western Norway Research Institute Sogndal, Norway

  Distributed Systems and Internet Technology

  Mario Arias

  Digital Enterprise Research Institute National University of Ireland Galway, Ireland

  Sören Auer

  Enterprise Information Systems Department

  Institute of Computer Science III Rheinische Friedrich-Wilhelms-

  Universität Bonn Bonn, Germany

  Pierfrancesco Bellini

  Department of Systems and Informatics

  Department of Computer Science National University of Singapore Singapore

  University of Florence Firenze, Italy

  Dario Bonino

  Department of Control and Computer Engineering

  Polytechnic University of Turin Turin, Italy

  Diego Calvanese

  Department of Computer Science Free University of Bozen-Bolzano Bolzano, Italy

  Erik Cambria

  Department of Computer Science The University of Auckland Auckland, New Zealand

  Contributors Vadim Ermolayev

  Department of Computer Science

  Bin Jiang

  Department of Technology and Built Environment University of Gävle Gävle, Sweden

  Monika Jungemann-Dorner

  Senior International Project Manager

  Verband der Verein Creditreform eV

  Neuss, Germany

  Jakub Klimek

  University of Leipzig Leipzig, Germany

  Department of Computer Science National and Kapodistrian

  Herald Kllapi

  Department of Computer Science National and Kapodistrian

  University of Athens Athens, Greece

  Manolis Koubarakis

  Department of Computer Science

  National and Kapodistrian University of Athens

  Athens, Greece

  Peep Küngas

  University of Athens Athens, Greece

  Yannis Ioannidis

  Zaporozhye National University Zaporozhye, Ukraine

  Department of Computer Science

  Javier D. Fernández

  Department of Computer Science University of Valladolid Valladolid, Spain

  Philipp Frischmuth

  Department of Computer Science University of Leipzig Leipzig, Germany

  Martin Giese

  Department of Computer Science University of Oslo Oslo, Norway

  Claudio Gutiérrez

  University of Chile Santiago, Chile

  Dutch National Research Center for Mathematics and Computer Science (CWI)

  Peter Haase

  Fluid Operations AG Walldorf, Germany

  Hele-Mai Haav

  Institute of Cybernetics Tallinn University of

  Technology Tallinn, Estonia

  Ian Horrocks

  Department of Computer Science Oxford University Oxford, United Kingdom

  Stratos Idreos

  Institute of Computer Science University of Tartu

  Contributors Maurizio Lenzerini

  Technical University of Catalonia (UPC)

  Department of Computer Science National University of Singapore Singapore

  Özgür Özçep

  Department of Computer Science TU Hamburg-Harburg Hamburg, Germany

  Tassilo Pellegrin

  Semantic Web Company Vienna, Austria

  Jordà Polo

  Barcelona Supercomputing Center (BSC)

  Barcelona, Spain

  Department of Computer Science University of Leipzig Leipzig, Germany

  Dheeraj Rajagopal

  Department of Computer Science National University of Singapore Singapore

  Nadia Rauch

  Department of Systems and Informatics University of Florence Firenze, Italy

  Riccardo Rosati

  Department of Computer Science Sapienza University of Rome Rome, Italy

  Pietro Ruiu

  Daniel Olsher

  Axel-Cyrille Ngonga Ngomo

  Department of Computer Science

  Department of Computer Science

  Sapienza University of Rome Rome, Italy

  Xintao Liu

  Department of Technology and Built Environment University of Gävle Gävle, Sweden

  Miguel A. Martínez-Prieto

  Department of Computer Science

  University of Valladolid Valladolid, Spain

  Ralf Möller

  TU Hamburg-Harburg Hamburg, Germany

  Technology University of Florence

  Lorenzo Mossucca

  Istituto Superiore Mario Boella Torino, Italy

  Mariano Rodriguez Muro

  Department of Computer Science Free University of Bozen-Bolzano Bolzano, Italy

  M. Asif Naeem

  Department of Computer Science The University of Auckland Auckland, New Zealand

  Paolo Nesi

  Department of Systems and Informatics Distributed Systems and Internet

  Istituto Superiore Mario Boella

  Contributors Rudolf Schlatte

  Vagan Terziyan

  Roberto V. Zicari

  Department of Computer Science The University of Auckland Auckland, New Zealand

  Gerald Weber

  Department of Computer Science University of Oslo Oslo, Norway

  Arild Waaler

  Istituto Superiore Mario Boella Torino, Italy

  Advanced Computing and Electromagnetics Unit

  Olivier Terzo

  University of Jyväskylä Jyväskylä, Finland

  Department of Mathematical Information Technology

  Ludwig-Maximilians University Munich, Germany

  Department of Computer Science

  Marcus Spies

  University of Oslo Oslo, Norway

  Department of Computer Science

  Ahmet Soylu

  Istituto Superiore Mario Boella Torino, Italy

  Advanced Computing and Electromagnetics Unit

  Mikhail Simonov

  Fluid Operations AG Walldorf, Germany

  Michael Schmidt

  University of Oslo Oslo, Norway

  Department of Computer Science Goethe University Frankfurt, Germany

  

  

This page intentionally left blank This page intentionally left blank

   Vadim Ermolayev, Rajendra Akerkar, Vagan Terziyan, and Michael Cochez CONTENTS

  Introduction .............................................................................................................4 Motivation and Unsolved Issues ..........................................................................6

  Illustrative Example ...........................................................................................7 Demand in Industry ...........................................................................................9 Problems in Industry .........................................................................................9 Major Issues ...................................................................................................... 11

  State of Technology, Research, and Development in Big Data Computing .. 12 Big Data Processing—Technology Stack and Dimensions ......................... 13 Big Data in European Research ...................................................................... 14

  Complications and Overheads in Understanding Big Data ......................20 Refining Big Data Semantics Layer for Balancing Efficiency Effectiveness ......................................................................................................23

  Focusing ........................................................................................................25 Filtering .........................................................................................................26 Forgetting ......................................................................................................27 Contextualizing ............................................................................................27 Compressing ................................................................................................29 Connecting ....................................................................................................29

  Autonomic Big Data Computing ...................................................................30 Scaling with a Traditional Database ................................................................... 32

  Large Scale Data Processing Workflows .......................................................33 Knowledge Self-Management and Refinement through Evolution ..............34

  Knowledge Organisms, their Environments, and Features .......................36 Environment, Perception (Nutrition), and Mutagens ............................ 37 Knowledge Genome and Knowledge Body ............................................ 39 Morphogenesis............................................................................................. 41 Mutation .......................................................................................................42 Recombination and Reproduction ............................................................44

  Populations of Knowledge Organisms .........................................................45 Fitness of Knowledge Organisms and Related Ontologies ........................46

  Big Data Computing

  Some Conclusions .................................................................................................48 Acknowledgments ................................................................................................50 References ...............................................................................................................50

Introduction

  Big Data is a phenomenon that leaves a rare information professional negli- gent these days. Remarkably, application demands and developments in the context of related disciplines resulted in technologies that boosted data gen- eration and storage at unprecedented scales in terms of volumes and rates. To mention just a few facts reported by Manyika et al. (2011): a disk drive capable of storing all the world’s music could be purchased for about US $600; 30 bil- lion of content pieces are shared monthly only at Facebook . Exponential growth of data volumes is accelerated by a dramatic increase in social networking applications that allow nonspecialist users create a huge amount of content easily and freely. Equipped with rapidly evolving mobile devices, a user is becoming a nomadic gateway boosting the generation of additional real-time sensor data. The emerging Internet of Things makes every thing a data or content, adding billions of additional artificial and autonomic sources of data to the overall picture. Smart spaces, where people, devices, and their infrastructure are all loosely connected, also generate data of unprecedented volumes and with velocities rarely observed before. An expectation is that valuable information will be extracted out of all these data to help improve the quality of life and make our world a better place.

  Society is, however, left bewildered about how to use all these data effi- ciently and effectively. For example, a topical estimate for the number of a need for data-savvy managers to take full advantage of Big Data in the United States is 1.5 million (Manyika et al. 2011). A major challenge would be finding a balance between the two evident facets of the whole Big Data adventure: (a) the more data we have, the more potentially useful patterns it may include and (b) the more data we have, the less the hope is that any machine-learn- ing algorithm is capable of discovering these patterns in an acceptable time frame. Perhaps because of this intrinsic conflict, many experts consider that this Big Data not only brings one of the biggest challenges, but also a most exciting opportunity in the recent 10 years (cf. Fan et al. 2012b)

  The avalanche of Big Data causes a conceptual divide in minds and opin- ions. Enthusiasts claim that, faced with massive data, a scientific approach “. . . hypothesize, model, test—is becoming obsolete. . . . Petabytes allow us to say: ‘Correlation is enough.’ We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical

  Toward Evolving Knowledge Ecosystems for Big Data Understanding

  that Big Data provides “. . . destabilising amounts of knowledge and informa- tion that lack the regulating force of philosophy” (Berry 2011). Indeed, being abnormally big does not yet mean being healthy and wealthy and should be treated appropriately (Figure 1.1): a diet, exercise, medication, or even surgery (philosophy). Those data sets, for which systematic health treatment is ignored in favor of correlations, will die sooner—as useless. There is a hope, however, that holistic integration of evolving algorithms, machines, and people rein- forced by research effort across many domains will guarantee required fitness of Big Data, assuring proper quality at right time (Joseph 2012).

  Mined correlations, though very useful, may hint about an answer to a “what,” but not “why” kind of questions. For example, if Big Data about Royal guards and their habits had been collected in the 1700s’ France, one could mine today that all musketeers who used to have red Burgundy regu- larly for dinners have not survived till now. Pity, red Burgundy was only one of many and a very minor problem. A scientific approach is needed to infer real reasons—the work currently done predominantly by human analysts.

  Effectiveness and efficiency are the evident keys in Big Data analysis. Cradling the gems of knowledge extracted out of Big Data would only be effective if: (i) not a single important fact is left in the burden—which means completeness and (ii) these facts are faceted adequately for further infer- ence—which means expressiveness and granularity. Efficiency may be inter- preted as the ratio of spent effort to the utility of result. In Big Data analytics, it could be straightforwardly mapped to timeliness. If a result is not timely, its utility (Ermolayev et al. 2004) may go down to zero or even far below in seconds to milliseconds for some important industrial applications such as technological process or air traffic control.

  Notably, increasing effectiveness means increasing the effort or making the analysis computationally more complex, which negatively affects efficiency.

Figure 1.1 Evolution of data collections—dimensions (see also Figure 1.3) have to be treated with care.

  Big Data Computing

  Finding a balanced solution with a sufficient degree of automation is the challenge that is not yet fully addressed by the research community.

  One derivative problem concerns knowledge extracted out of Big Data as the result of some analytical processing. In many cases, it may be expected that the knowledge mechanistically extracted out of Big Data will also be big. Therefore, taking care of Big Knowledge (which has more value than the source data) would be at least of the same importance as resolving chal- lenges associated with Big Data processing. Uplifting the problem to the level of knowledge is inevitable and brings additional complications such as resolving contradictory and changing opinions of everyone on everything. Here, an adequate approach in managing the authority and reputation of “experts” will play an important role (Weinberger 2012).

  This chapter offers a possible approach in addressing the problem of “understanding” Big Data in an effective and efficient way. The idea is mak- ing adequately grained and expressive knowledge representations and fact collections evolve naturally, triggered by new tokens of relevant data coming along. Pursuing this way would also imply conceptual changes in the Big Data Processing stack. A refined semantic layer has to be added to it for provid- ing adequate interfaces to interlink horizontal layers and enable knowledge- related functionality coordinated in top-down and bottom-up directions.

  The remainder of the chapter is structured as follows. The “Motivation and Unsolved Issues” section offers an illustrative example and the anal- ysis of the demand for understanding Big Data. The “State of Technology, Research, and Development in Big Data Computing” section reviews the relevant research on using semantic and related technologies for Big Data processing and outlines our approach to refine the processing stack. The “Scaling with a Traditional Database” section focuses on how the basic data storage and management layer could be refined in terms of scalability, which is necessary for improving efficiency/effectiveness. The “Knowledge Self-Management and Refinement through Evolution” section presents our approach, inspired by the mechanisms of natural evolution studied in evo- lutionary biology. We focus on a means of arranging the evolution of knowl- edge, using knowledge organisms, their species, and populations with the aim of balancing efficiency and effectiveness of processing Big Data and its semantics. We also provide our preliminary considerations on assessing fit- ness in an evolving knowledge ecosystem. Our conclusions are drawn in the “Some Conclusions” section.

  Motivation and Unsolved Issues

  Practitioners, including systems engineers, Information Technology archi-

  Toward Evolving Knowledge Ecosystems for Big Data Understanding

  the phenomenon of Big Data in their dialog over means of improving sense- making. The phenomenon remains a constructive way of introducing others, including nontechnologists, to new approaches such as the Apache Hadoop () framework. Apparently, Big Data is collected to be ana- lyzed. “Fundamentally, big data analytics is a workflow that distills terabytes of low-value data down to, in some cases, a single bit of high-value data. . . . The goal is to see the big picture from the minutia of our digital lives” (cf. Fisher et al. 2012). Evidently, “seeing the big picture” in its entirety is the key and requires making Big Data healthy and understandable in terms of effec- tiveness and efficiency for analytics.

  In this section, the motivation for understanding the Big Data that improves the performance of analytics is presented and analyzed. It begins with pre- senting a simple example which is further used throughout the chapter. It continues with the analysis of industrial demand for Big Data analytics. In this context, the major problems as perceived by industries are analyzed and informally mapped to unsolved technological issues.

  illustrative example

  Imagine a stock market analytics workflow inferring trends in share price changes. One possible way of doing this is to extrapolate on stock price data. However, a more robust approach could be extracting these trends from market news. Hence, the incoming data for analysis would very likely be several streams of news feeds resulting in a vast amount of tokens per day. An illustrative example of such a news token is:

  Posted: Tue, 03 Jul 2012 05:01:10-04:00 LONDON (Reuters) U.S. planemaker Boeing hiked its 20-year market forecast, predicting demand for 34,000 new aircraft worth $4.5 trillion, on growth in emerging regions and as airlines seek efficient new planes to coun- * ter high fuel costs.

  Provided that an adequate technology is available, one may extract the knowledge pictured as thick-bounded and gray-shaded elements in Figure 1.2.

  This portion of extracted knowledge is quite shallow, as it simply inter- * prets the source text in a structured and logical way. Unfortunately, it does

  

(accessed July 5,

2012).

The technologies for this are under intensive development currently, for example, wit.istc.

  • * * -baseOf -basedIn

    -builtBy

  Big Data Computing Country Plane -sellsTo Company * MarketForecast PlaneMaker -built * * -successorOf -SalesVolume * -has -by * Airline * * -buysForm -seeksFor -soughtBy * * -fuelConsumption : <unspecified> = low -delivered : Date

-built : <unspecified> = >2009

EfficientNewPlane -hiked by * * -predecessorOf * -hikes

  • owns Owns -ownedBy B787-JA812A : EfficientNewPlane
  • 1

    Terminological component baseOf Built = >2009 Japan : Country basedIn AllNipponAirways : Airline basedIn New20YMarketForecastbyBoeing : MarketForecast Boeing : PlaneMaker built Owned by has by Fuel consumption = 20% lower than others delivered : Date = 2012/07/03 builtBy Individual assertions baseOf UnitedStates : Country hikedBy hikes successorOf SalesVolume Old 20YMarketForecastbyBoeing : MarketForecast SalesVolume = 4.5 trillion predecessorOf

    Figure 1.2 Semantics associated with a news data token.

      not answer several important questions for revealing the motives for Boeing to hike their market forecast: Q1. What is an efficient new plane? How is efficiency related to high fuel costs to be countered? Q2. Which airlines seek for efficient new planes? What are the emerg- ing regions? How could their growth be assessed? Q3. How are plane makers, airlines, and efficient new planes related to each other?

      In an attempt to answering these questions, a human analyst will exploit his commonsense knowledge and look around the context for additional relevant evidence. He will likely find out that Q1 and Q3 could be answered using commonsense statements acquired from a foundational ontology, for example, CYC (Lenat 1995), as shown by dotted line bounded items in Figure 1.2.

      Answering Q2, however, requires looking for additional information like: * the fleet list of All Nippon Airways who was the first to buy B787 airplanes

    • *

      For example, at airfleets.net/flottecie/All%20Nippon%20Airways-active-b787.htm (accessed

    •   Toward Evolving Knowledge Ecosystems for Big Data Understanding

        from Boeing (the rest of Figure 1.2); and a relevant list of emerging regions and growth factors (not shown in Figure 1.2). The challenge for a human analyst in performing the task is low speed of data analysis. The available time slot for providing his recommendation is too small, given the effort to be spent per one news token for deep knowledge extraction. This is one good reason for growing demand for industrial strength technologies to assist in analytical work on Big Data, increase quality, and reduce related efforts.

        Demand in industry

        Turning available Big Data assets into action and performance is considered a deciding factor by today’s business analytics. For example, the report by Capgemini (2012) concludes, based on a survey of the interviews with more than 600 business executives, that Big Data use is highly demanded in indus- tries. Interviewees firmly believe that their companies’ competitiveness and performance strongly depend on the effective and efficient use of Big Data. In particular, on average,

      • Big Data is already used for decision support 58% of the time, and

        29% of the time for decision automation

      • It is believed that the use of Big Data will improve organizational performance by 41% over the next three years

        The report by Capgemini (2012) also summarizes that the following are the perceived benefits of harnessing Big Data for decision-making:

      • More complete understanding of market conditions and evolving business trends
      • Better business investment decisions
      • More accurate and precise responses to customer needs
      • Consistency of decision-making and greater group participation in shared decisions
      • Focusing resources more efficiently for optimal returns
      • Faster business growth
      • Competitive advantage (new data-driven services)
      • Common basis for evaluation (one true starting point)
      • Better risk management

        Problems in industry

        Though the majority of business executives firmly believe in the utility of Big Data and analytics, doubts still persist about its proper use and the

        Big Data Computing

        speak of the Knowledge Economy or the Information Society. It’s all data now: Data Economy and Data Society. This is a confession that we are no longer in control of the knowledge contained in the data our systems col- lect” (Greller 2012).

        Capgemini (2012) outlines the following problems reported by their interviewees:

      • Unstructured data are hard to process at scale. Forty-two percent of respondents state that unstructured content is too difficult to inter- pret. Forty percent of respondents believe that they have too much unstructured data to support decision-making.
      • Fragmentation is a substantial obstacle. Fifty-six percent of respondents across all sectors consider organizational silos the biggest impedi- ment to effective decision-making using Big Data.
      • Effectiveness needs to be balanced with efficiency in “cooking” Big Data.

        Eighty-five percent of respondents say the major problem is the lack of effective ability to analyze and act on data in real time. The last conclusion by Capgemini is also supported by Bowker (2005, pp. 183–184) who suggests that “raw data is both an oxymoron and a bad idea; to the contrary, data should be cooked with care.” This argument is further detailed by Bollier (2010, p. 13) who stresses that Big Data is a huge “mass of raw information.” It needs to be added that this “huge mass” may change in time with varying velocity, is also noisy, and cannot be considered as self- explanatory. Hence, an answer to the question whether Big Data indeed rep- resent a “ground truth” becomes very important—opening pathways to all sorts of philosophical and pragmatic discussions. One aspect of particular importance is interpretation that defines the ways of cleaning Big Data. Those ways are straightforwardly biased because any interpretation is subjective.

        As observed, old problems of data processing that are well known for decades in industry are made even sharper when data becomes Big. Boyd and Crawford (2012) point out several aspects to pay attention to while “cooking” Big Data, hinting that industrial strength technologies for that are not yet in place:

      • Big Data changes the way knowledge is acquired and even defined. As already mentioned above (cf. Anderson 2008), correlations mined from Big Data may hint about model changes and knowledge repre- sentation updates and refinements. This may require conceptually novel solutions for evolving knowledge representation, reasoning, and management.
      • Having Big Data does not yet imply objectivity, or accuracy, on time. Here, the clinch between efficiency and effectiveness of Big Data inter-

        Toward Evolving Knowledge Ecosystems for Big Data Understanding

        a sample of an appropriate size for being effective may bring bias, harm correctness, and accuracy. Otherwise, analyzing Big Data in source volumes will definitely distort timeliness.

      • Therefore, Big Data is not always the best option. A question that requires research effort in this context is about the appropriate sample, size, and granularity to best answer the question of a data analyst.
      • Consequently, taken off-context Big Data is meaningless in interpreta-

        tion . Indeed, choosing an appropriate sample and granularity may be seen as contextualization—circumscribing (Ermolayev et  al.

        2010) the part of data which is potentially the best-fitted sample for the analytical query. Managing context and contextualization for Big Data at scale is a typical problem and is perceived as one of the research and development challenges.

        One more aspect having indirect relevance to technology, but important in terms of socio-psychological perceptions and impact on industries, is eth-

        

      ics and Big Data divide. Ethics is concerned with legal regulations and con-

        straints of allowing a Big Data collector interpreting personal or company information without informing the subjects about it. Ethical issues become sharper when used for competition and lead to the emergence of and sepa- ration to Big Data rich and poor implied by accessibility to data sources at required scale.

        Major issues

        Applying Big Data analytics faces different issues related with the charac- teristics of data, analysis process, and also social concerns. Privacy is a very sensitive issue and has conceptual, legal, and technological implications. This concern increases its importance in the context of big data. Privacy is defined by the International Telecommunications Union as the “right of individuals to control or influence what information related to them may be disclosed” (Gordon 2005). Personal records of individuals are increas- ingly being collected by several government and corporate organizations. These records usually used for the purpose of data analytics. To facilitate data analytics, such organizations publish “appropriately private” views over the collected data. However, privacy is a double-edged sword—there should be enough privacy to ensure that sensitive information about the individuals is not disclosed and at the same time there should be enough data to perform the data analysis. Thus, privacy is a primary concern that has widespread implications for someone desiring to explore the use of Big Data for development in terms of data acquisition, storage, preservation, presentation, and use.

        Another concern is the access and sharing of information. Usually private

        Big Data Computing

        clients and users, as well as about their own operations. Barriers may include legal considerations, a need to protect their competitiveness, a culture of con- fidentiality, and, largely, the lack of the right incentive and information struc- tures. There are also institutional and technical issues, when data are stored in places and ways that make them difficult to be accessed and transferred.

        One significant issue is to rethink security for information sharing in Big Data use cases. Several online services allow us to share private informa- tion (i.e., etc.), but outside record-level access control we do not comprehend what it means to share data and how the shared data can be linked.

        Managing large and rapidly increasing volumes of data has been a chal- lenging issue. Earlier, this issue was mitigated by processors getting faster, which provide us with the resources needed to cope with increasing vol- umes of data. However, there is a fundamental shift underway considering that data volume is scaling faster than computer resources. Consequently, extracting sense of data at required scale is far beyond human capability. So, we, the humans, increasingly “. . . require the help of automated systems to make sense of the data produced by other (automated) systems” (Greller 2012). These instruments produce new data at comparable scale—kick-start- ing a new iteration in this endless cycle.

        In general, given a large data set, it is often necessary to find elements in it that meet a certain criterion which likely occurs repeatedly. Scanning the entire data set to find suitable elements is obviously impractical. Instead, index structures are created in advance to permit finding the qualifying ele- ments quickly.

        Moreover, dealing with new data sources brings a significant number of analytical issues. The relevance of these issues will vary depending on the type of analysis being conducted and on the type of decisions that the data might ultimately inform. The big core issue is to analyze what the data are really telling us in an entirely transparent manner.

      State of Technology, Research, and Development in Big Data Computing

        After giving an overview of the influence of Big Data on industries and society as a phenomenon and outlining the problems in Big Data computing context as perceived by technology consumers, we now proceed with the analysis of the state of development of those technologies. We begin with presenting the overall Big Data Processing technology stack and point out how different dimensions of Big Data affect the requirements to technolo- gies, having understanding—in particular, semantics-based processing—

        Toward Evolving Knowledge Ecosystems for Big Data Understanding

        research and development projects and focus on what they do in advanc- ing the state-of-the-art in semantic technologies for Big Data processing. Further, we summarize the analysis by pointing out the observed complica- tions and overheads in processing Big Data semantics. Finally, we outline a high-level proposal for the refinement of the Big Data semantics layer in the technology stack.

        Big Data Processing—Technology Stack and Dimensions

        At a high level of detail, Driscoll (2011) describes the Big Data processing technology stack comprising three major layers: foundational, analytics, and applications (upper part of Figure 1.3).

        The foundational layer provides the infrastructure for storage, access, and management of Big Data. Depending on the nature of data, stream process- ing solutions (Abadi et al. 2003; Golab and Tamer Ozsu 2003; Salehi 2010), distributed persistent storage (Chang et al. 2008; Roy et al. 2009; Shvachko et  al. 2010), cloud infrastructures (Rimal et  al. 2009; Tsangaris et  al. 2009; Cusumano 2010), or a reasonable combination of these (Gu and Grossman 2009; He et al. 2010; Sakr et al. 2011) may be used for storing and accessing data in response to the upper-layer requests and requirements.

        

      Focused services

      Big Data analytics ty city ie

        

      Efficiency... ...Effectiveness

      ar elo

        V V Volume Complexity Big Data storage, access, management infrastructure

        Query planning Data

        V

      and exceution management

      olume city elo

        V Storage Data stream processing

        

      Velocity Volume

      Efficiency

      Big Data

      Variety

        Complexity

      Effectiveness

      Figure 1.3

      Processing stack, based on Driscoll (2011), and the four dimensions of Big Data, based on Beyer

        Big Data Computing

        The middle layer of the stack is responsible for analytics. Here data ware- housing technologies (e.g., Nemani and Konda 2009; Ponniah 2010; Thusoo et  al. 2010) are currently exploited for extracting correlations and features (e.g., Ishai et  al. 2009) from data and feeding classification and prediction algorithms (e.g., Mills 2011).

        Focused applications or services are at the top of the stack. Their func- tionality is based on the use of more generic lower-layer technologies and exposed to end users as Big Data products.

         even leverages the collective behavior of users to improve its fraud predic- tions. Another company called Klout provides a genuine data service that uses social media activity to measure online influence.

        LinkedIn’s People you may know feature is also a kind of focused service. This service is presumably based on graph theory, starting exploration of the graph of your relations from your node and filtering those relations accord- ing to what is called “homophily.” The greater the homophily between two nodes, the more likely two nodes will be connected.

        According to its purpose, the foundational layer is concerned about being capable of processing as much as possible data (volume) and as soon as pos- sible. In particular, if streaming data are used, the faster the stream is (veloc-

        

      ity ), the more difficult it is to process the data in a stream window. Currently

        available technologies and tools for the foundational level are not equally well coping with volume and velocity dimensions which are, so to say, anti- correlated due to their nature. Therefore, hybrid infrastructures are in use for balancing processing efficiency aspects (Figure 1.3)—comprising solu- tions focused on taking care of volumes, and, separately, of velocity. Some examples are given in “Big Data in European Research” section.

        For the analytics layer (Figure 1.3), volume and velocity dimensions (Beyer et al. 2011) are also important and constitute the facet of efficiency—big vol- umes of data which may change swiftly have to be processed in a timely fashion. However, two more dimensions of Big Data become important—

        

      complexity and varietywhich form the facet of effectiveness. Complexity

        is clearly about the adequacy of data representations and descriptions for analysis. Variety describes a degree of syntactic and semantic heterogeneity in distributed modules of data that need to be integrated or harmonized for analysis. A major conceptual complication for analytics is that efficiency is anticorrelated to effectiveness.

        Big Data in european research

        Due to its huge demand, Big Data Computing is currently on the hype as a field of research and development, producing a vast domain of work. To keep the size of this review observable for a reader, we focus on the batch of the

        Toward Evolving Knowledge Ecosystems for Big Data Understanding

        Technology (ICT; projects within this vibrant field. Big Data processing, including semantics, is addressed by the strate- of FP7 ICT Call 5 are listed in Table 1.1 and further analyzed below.

        

      SmartVortex [Integrating Project (IP); develops a techno-

        logical infrastructure—a comprehensive suite of interoperable tools, ser- vices, and methods—for intelligent management and analysis of massive data streams. The goal is to achieving better collaboration and decision- making in large-scale collaborative projects concerning industrial innova- tion engineering.

        

      Legend : AEP, action extraction and prediction; DLi, data linking; DM, data

        mining; DS, diversity in semantics; DV, domain vocabulary; FCA, formal concept analysis; IE, information extraction; Int, integration; KD, knowledge discovery; M-LS, multi-lingual search; MT, machine translation; O, ontology; OM, opinion mining; QL, query language; R, reasoning; SBI, business intelli- gence over semantic data; SDW, semantic data warehouse (triple store); SUM, summarization.

        

      LOD2 (IP; claims delivering: industrial strength tools and meth-

        odologies for exposing and managing very large amounts of structured information; a bootstrap network of multidomain and multilingual ontolo- gies from sources such as Wikipedia and OpenStreetMap ; machine learning algorithms for enriching, repairing, interlinking, and fusing data from Web resources; standards and methods for tracking provenance, ensuring privacy and data security, assessing informa- tion quality; adaptive tools for searching, browsing, and authoring Linked Data.

        

      Tridec (IP; develops a service platform accompanied with

        the next-generation work environments supporting human experts in deci- sion processes for managing and mitigating emergency situations triggered by the earth (observation) system in complex and time-critical settings. The platform enables “smart” management of collected sensor data and facts inferred from these data with respect to crisis situations.

        

      First [Small Targeted Research Project (STREP); develops

        an information extraction, integration, and decision-making infrastructure for financial domain with extremely large, dynamic, and heterogeneous sources of information.

        

      iProd (STREP; investigates approaches of reducing prod-

        uct development costs by efficient use of large amounts of data comprising the development of a software framework to support complex information management. Key aspects addressed by the project are handling hetero- geneous information and semantic diversity using semantic technologies including knowledge bases and reasoning.

        

      Teleios (STREP; focuses on elaborating a data

        1 6 B

        X X O, ML Tridec

        X X

        Use cases: a virtual observatory for TerraSAR-X data; real-time fire monitoring

        X X Civil defense, environmental agencies.

        X R, Int Teleios

        X X X

        X X Manufacturing: aerospace, automotive, and home appliances

        IE iProd

        X X

        X X X

        X X Market surveillance, investment management, online retail banking and brokerage

        X R First

        X X

        X X

        X X Crisis/emergency response, government, oil and gas

        X X X

        ig D at a C om p u tin g

        V ariety Complexity i. Fast

        TaBle 1.1 FP7 ICT Call 5 Projects and their Contributions to Big Data Processing and Understanding Acronym

        IIM Cluster a Domain(s)/Industry(ies) Contribution to Coping with Big Data dimensions b Contribution to

        Big Data Processing Stack Layers c Contribution to Big Data Understanding

        Online Content, Interactive and Social Media Reasoning and

        Information Exploitation Knowledge Discovery and Management

        V olume V elocity

        Access to/Management of Data at Scale ii. Fast

        X X Media and publishing, corporate data intranets, eGovernment

        Analytics of Data at Scale iii. Focused Services

        SmartVortex

        X X

        X Industrial innovation engineering

        X X

        X X LOD2

        X X DM, QL, KD

        T Khresmoi

        X X Medical imaging in healthcare,

        X X

        X X

        X IE, DLi, M-LS, ow biomedicine

        MT ar d E

        Robust

        X X Online communities (internet, extranet

        X X

        X AEP and intranet) addressing: customer vo support; knowledge sharing; hosting lv services in g K

        Digital.me

        X X Personal sphere

        X X n

        Fish4Knowledge X

        X Marine sciences, environment

        X X

        X DV (fish), SUM ow

        Render

        X X Information management (wiki), news

        X X

        X DS le d aggregation (search engine), customer ge E relationship management (telecommunications) co

        PlanetData

        X Cross-domain

        X sy

        LATC

        X Government

        X X

        X ste m

        Advance

        X Logistics

        X X

        X X s f

        Cubist

        X Market intelligence, computational

        X X

        X SDW, SBI, FCA or B biology, control centre operations Promise

        X Cross-domain

        X X X X ig D Dicode

        X Clinico-genomic research, healthcare,

        X X X

        X DM, OM at marketing a a U

        

      IIM clustering information has been taken from the Commission’s sour n

      b d c As per the Gartner report on extreme information management (Gartner 2011). er The contributions of the projects to the developments in the Big Data Stack layers have been assessed based on their public deliverables. sta n d in 7 1 g

        Big Data Computing

        these, a scalable and adaptive environment for knowledge discovery from EO images and geospatial data sets, and a query processing and optimi- zation technique for queries over multidimensional arrays and EO image

        

      Khresmoi (IP) develops an advanced multilingual and multi-

      modal search and access system for biomedical information and documents.

        The advancements of the Khresmoi comprise: an automated information extraction from biomedical documents reinforced by using crowd sourcing, active learning, automated estimation of trust level, and target user exper- tise; automated analysis and indexing for 2-, 3-, 4D medical images; link- ing information extracted from unstructured or semistructured biomedical texts and images to structured information in knowledge bases; multilin- gual search including multiple languages in queries and machine-translated pertinent excerpts; visual user interfaces to assist in formulating queries and displaying search results.

        

      Robust (IPinvestigates models and methods for describ-

        ing, understanding, and managing the users, groups, behaviors, and needs of online communities. The project develops a scalable cloud and stream- based data management infrastructure for handling the real-time analysis of large volumes of community data. Understanding and prediction of actions is envisioned using simulation and visualization services. All the developed tools are combined under the umbrella of the risk management framework, resulting in the methodology for the detection, tracking, and management of opportunities and threats to online community prosperity.

        

      Digital.me (STREP; integrates all personal data in a per-

        sonal sphere at a single user-controlled point of access—a user-controlled personal service for intelligent personal information management. The soft- ware is targeted on integrating social web systems and communities and implements decentralized communication to avoid external data storage and undesired data disclosure.

        

      Fish4Knowledge (STREP; )

        develops methods for information abstraction and storage that reduce the 15 12 amount of video data at a rate of 10 × 10 pixels to 10 × 10 units of informa- tion. The project also develops machine- and human-accessible vocabularies for describing fish. The framework also comprises flexible data-processing architecture and a specialized query system tailored to the domain. To achieve these, the project exploits a combination of computer vision, video summarization, database storage, scientific workflow, and human–computer interaction methods.

        

      Render (STREP; is focused on investigating the aspect

        of diversity of Big Data semantics. It investigates methods and techniques, develops software, and collects data sets that will leverage diversity as a source of innovation and creativity. The project also claims providing

        Toward Evolving Knowledge Ecosystems for Big Data Understanding

        designing novel algorithms that reflect diversity in the ways information is selected, ranked, aggregated, presented, and used.

        

      PlanetData [Network of Excellence (NoE); works toward

        establishing a sustainable European community of researchers that supports organizations in exposing their data in new and useful ways and develops technologies that are able to handle data purposefully at scale. The network also facilitates researchers’ exchange, training, and mentoring, and event organization based substantially on an open partnership scheme.

        

      LATC (Support Actioncreates an in-depth test-bed for data-

        intensive applications by publishing data sets produced by the European Commission, the European Parliament, and other European institutions as Linked Data on the Web and by interlinking them with other governmental data.

        

      Advance (STREP; develops a decision support plat-

        form for improving strategies in logistics operations. The platform is based on the refinement of predictive analysis techniques to process massive data sets for long-term planning and cope with huge amounts of new data in real time.

        

      Cubist (STREP; elaborates methodologies and imple-

        ments a platform that brings together several essential features of Semantic Technologies and Business Intelligence (BI): support for the federation of data coming from unstructured and structured sources; a BI-enabled triple store as a data persistency layer; data volume reduction and preprocessing using data semantics; enabling BI operations over semantic data; a semantic data warehouse implementing FCA; applying visual analytics for rendering, navigating, and querying data.

        

      Promise (NoEestablishes a virtual laboratory for conduct-

        ing participative research and experimentation to carry out, advance, and bring automation into the evaluation and benchmarking of multilingual and multimedia information systems. The project offers the infrastructure for access, curation, preservation, re-use, analysis, visualization, and mining of the collected experimental data.

        

      Dicode (STREP; develops a workbench of interoperable

        services, in particular, for: (i) scalable text and opinion mining; (ii) collabora- tion support; and (iii) decision-making support. The workbench is designed to reduce data intensiveness and complexity overload at critical decision points to a manageable level. It is envisioned that the use of the workbench will help stakeholders to be more productive and concentrate on creative activities.

        In summary, the contributions to Big Data understanding of all the proj- ects mentioned above result in the provision of different functionality for a semantic layer—an interface between the Data and Analytics layers of the Big Data processing stack—as pictured in Figure 1.4.

        However, these advancements remain somewhat insufficient in terms of

        Big Data Computing Focused services MT SBI

        Big data analytics Query formulation/ QL, Reasoning Integration: DLi, Int transformation ML-S R, AP, Representation: O, DV OM

        Extraction/ AE, DE, IE, DS Storage: SDW elicitation KD, FCA, SUM Big data semantics

        Query planning Data

      and execution management

      Storage Data stream processing Big data storage, access, management

      Figure 1.4 Contribution of the selection of FP7 ICT projects to technologies for Big Data understanding.

        Abbreviations are explained in the legend to Table 1.1.

        in the introduction of this chapter. Analysis of Table 1.1 reveals that no one of the reviewed projects addresses all four dimensions of Big Data in a bal- anced manner. In particular, only two projects—Trydec and Firstclaim contributions addressing Big Data velocity and variety-complexity. This fact points out that the clinch between efficiency and effectiveness in Big Data processing still remains a challenge.

        Complications and Overheads in understanding Big Data

        As observed, the mankind collects and stores data through generations, without a clear account of the utility of these data. Out of data at hand, each generation extracts a relatively small proportion of knowledge for their everyday needs. The knowledge is produced by a generation for their needs—to an extent they have to satisfy their “nutrition” requirement for supporting decision-making. Hence, knowledge is “food” for data analyt- ics. An optimistic assumption usually made here is that the next generation will succeed in advancing tools for data mining, knowledge discovery, and extraction. So the data which the current generation cannot process effec- tively and efficiently is left as a legacy for the next generation in a hope that the ancestors cope better. The truth, however, is that the developments of data and knowledge-processing tools fail to keep pace with the explosive growth of data in all four dimensions mentioned above. Suspending under- standing Big Data until an advanced next-generation capability is at hand is

        Toward Evolving Knowledge Ecosystems for Big Data Understanding

        Do today’s state-of-the-art technologies allow us to understand Big Data with an attempt to balance effectiveness and efficiency?—probably not. Our brief analysis reveals that Big Data computing is currently developed toward more effective versus efficient use of semantics. It is done by add- ing the semantics layer to the processing stack (cf. Figures 1.3 and 1.4) with an objective of processing all the available data and using all the generated knowledge. Perhaps, the major issue is the attempt to eat all we have on the table. Following the metaphor of “nutrition,” it has to be noted that the “food” needs to be “healthy” in terms of all the discussed dimensions of Big Data.

        Our perceptions of the consequences of being not selective with respect to consuming data for understanding are as follows. The major problem is the introduction of a new interface per se and in an improper way. The advent of semantic technologies aimed at breaking down data silos and simultaneously enabling efficient knowledge management at scale. Assuming that databases describe data using multiple heterogeneous labels, one might expect that annotating these labels using ontology ele- ments as semantic tags enables virtual integration and provides immedi- ate benefits for search, retrieval, reasoning, etc. without a need to modify existing code, or data. Unfortunately, as noticed by Smith (2012), it is now too easy to create “ontologies.” As a consequence, myriads of them are being created in ad hoc ways and with no respect to compatibility, which implies the creation of new semantic silos and, further bringing something like a “Big Ontology” challenge to the agenda. According to Smith (2012), the big reason is the lack of a rational (monetary) incentive for investing in reuse. Therefore, it is often accepted that a new “ontology” is developed for a new project. Harmonization is left for someone else’s work—in the next genera- tion. Therefore, the more semantic technology simplifying ontology creation is successful, the more we fail to achieve our goals for interoperability and integration (Smith 2012).

        It is worth noting here that there is still a way to start doing things cor- rectly which, according to Smith (2012), would be “to create an incremental, evolutionary process, where what is good survives, and what is bad fails; create a scenario in which people will find it profitable to reuse ontologies, terminologies and coding systems which have been tried and tested; silo effects will be avoided and results of investment in Semantic Technology will cumulate effectively.”

        A good example of a collaborative effort going in this correct direction is the approach used by the Gene Ontology initiative () which follows the principles of the OBO Foundry . The Gene Ontology project is a major bioinformatics initiative with the aim of standardizing the representation of gene and gene product attributes across species and databases. The project provides a controlled vocabulary of terms for describing gene product characteristics and gene product annotation data, as well as tools to access and process this data. The mission of OBO

        Big Data Computing

        fully interoperable ontologies in the biomedical domain following common evolving design philosophy and implementation and ensuring a gradual improvement of the quality of ontologies.

        Furthermore, adding a data semantics layer facilitates increasing effec- tiveness in understanding Big Data, but also substantially increases the computational overhead for processing the representations of knowl- edge—decreasing efficiency. A solution is needed that harmonically and rationally balances between the increase in the adequacy and the com- pleteness of Big Data semantics, on the one hand, and the increase in com- putational complexity, on the other hand. A straightforward approach is using scalable infrastructures for processing knowledge representations. A vast body of related work focuses on elaborating this approach (e.g., Broekstra et al. 2002; Wielemaker et al. 2003; Cai and Frank 2004; DeCandia et al. 2007).

        The reasons to qualifying this approach only as a mechanistic solution are

      • Using distributed scalable infrastructures, such as clouds or grids, implies new implementation problems and computational overheads.
      • Typical tasks for processing knowledge representations, such as reasoning, alignment, query formulation and transformation, etc., scale hardly (e.g., Oren et al. 2009; Urbani et al. 2009; Hogan et al. 2011)—more expressiveness implies harder problems in decoupling the fragments for distribution. Nontrivial optimization, approxima- tion, or load-balancing techniques are required.

        Another effective approach to balance complexity and timeliness is main- taining history or learning from the past. A simple but topical example in data processing is the use of previously acquired information for saving approximately 50% of comparison operations in sorting by selection (Knuth 1998, p. 141). In Distributed Artificial Intelligence software, agent architec- tures maintaining their states or history for more efficient and effective deliberation have also been developed (cf. Dickinson and Wooldridge 2003). In Knowledge Representation and Reasoning, maintaining history is often implemented as inference or query result materialization (cf. Kontchakov et al. 2010; McGlothlin and Khan 2010), which also do not scale well up to the volumes characterizing real Big Data.

        Yet another way to find a proper balance is exploiting incomplete or approximate methods. These methods yield results of acceptable quality much faster than approaches aiming at building fully complete or exact, that is, ideal results. Good examples of technologies for incomplete or partial rea- soning and approximate query answering (Fensel et al. 2008) are elaborated in the FP7 LarKC project . Remarkably, some of the approximate querying techniques, for example, Guéret et al. (2008), are based on evolu-

        Toward Evolving Knowledge Ecosystems for Big Data Understanding

      refining Big Data Semantics layer for Balancing efficiency effectiveness

        As one may notice, the developments in the Big Data semantics layer are mainly focused on posing and appropriately transforming the semantics of queries all the way down to the available data, using networked ontologies. *

        At least two shortcomings of this, in fact, unidirectional approach need to be identified:

        1. Scalability overhead implies insufficient efficiency. Indeed, executing queries at the data layer implies processing volumes at the scale of stored data. Additional overhead is caused by the query transforma- tion, distribution, and planning interfaces. Lifting up the stack and fusing the results of these queries also imply similar computational overheads. A possible solution for this problem may be sought fol- lowing a supposition that the volume of knowledge describing data adequately for further analyses is substantially smaller than the vol- ume of this data. Hence, down-lifting queries for execution need to be stopped at the layer of knowledge storage for better efficiency. However, the knowledge should be consistent enough with the data so that it can fulfill completeness and correctness requirements specified in the contract of the query engine.

        2. Having ontologies inconsistent with data implies effectiveness problems.

        Indeed, in the vast majority of cases, the ontologies containing knowledge about data are not updated consistently with the changes in data. At best, these knowledge representations are revised in a sequence of discrete versions. So, they are not consistent with the data at an arbitrary point in time. This shortcoming may be over- come only if ontologies in a knowledge repository evolve continu- ously in response to data change. Ontology evolution will have a substantially lower overhead because the volume of changes is always significantly lower than the volume of data, though depends on data velocity (Figure 1.3).

        To sum up, relaxing the consequences of the two outlined shortcomings and, hence, balancing efficiency and effectiveness may be achievable if a bidirectional processing approach is followed. Top-down query answering has to be complemented by a bottom-up ontology evolution process, which meet at the knowledge representation layer. In addition to a balance between efficiency and effectiveness, such an approach of processing huge data sets

      • * may help us “. . . find and see dynamically changing ontologies without hav-

        

      Technologies for information and knowledge extraction are also developed and need to be

      regarded as bottom-up. However, these technologies are designed to work off-line for updat-

      ing the existing ontologies in a discrete manner. Their execution is not coordinated with the

        Big Data Computing ing to try to prescribe them in advance.

      • * Taxonomies and ontologies are things that you might discover by observation, and watch evolve over time” (cf. Bollier 2010).

        Further, we focus on outlining a complementary bottom-up path in the overall processing stack which facilitates existing top-down query answering frameworks by providing knowledge evolution in line with data change—as pictured in Figure 1.5. In a nutshell, the proposed bottom-up path is charac- terized by:

      • Efficiently performing simple scalable queries on vast volumes of data or in a stream window for extracting facts and decreasing vol- umes (more details could be found in the “Scaling with a Traditional Database” section)
      • Adding extracted facts to a highly expressive persistent knowledge base allowing evolution of knowledge (more details on that could be seen in Knowledge Self-Management and Refinement through Evolution)
      • Assessing fitness of knowledge organisms and knowledge represen- tations in the evolving knowledge ecosystem (our approach to that is also outlined in the “Knowledge Self-Management and Refinement through Evolution” section)

        This will enable reducing the overheads of the top-down path by perform- ing refined inference using highly expressive and complex queries over

        fo r q u er y a n sw Evolution er in g Contextualization

        Extraction Data management Query planning and execution Data stream processing Persistent distributed storage Big Data storage, access, management Knowle dge ex tr ac ti o n B ot to m -up path for knowle dge evolution Knowle dge e volution an d m an ag em en t

      Figure 1.5 Refining Big Data semantics layer for balancing efficiency and effectiveness.

      • Big Data analytics Big Data semantics Change harmonization Semantic query planning and answering Distributed persistent knowledge storage T op-down path

        Toward Evolving Knowledge Ecosystems for Big Data Understanding

        evolving (i.e., consistent with data) and linked (i.e., harmonized), but reason- ably small fragments of knowledge. Query results may also be materialized for further decreasing computational effort.

        After outlining the abstract architecture and the bottom-up approach, we will now explain at a high level how Big Data needs to be treated along the way. A condensed formula for this high-level approach is “3F + 3Co” which is unfolded as

        3F: Focusing-Filtering-Forgetting

        3Co: Contextualizing-Compressing-Connecting Notably, both 3F and 3Co are not novel and used in parts extensively in many domains and in different interpretations. For example, an interesting interpretation of 3F is offered by Dean and Webb (2011) who suggest this for- mula as a “treatment” for senior executives (CEOs) to deal with information overload and multitasking. Executives are offered to cope with the problem by focusing (doing one thing at a time), filtering (delegating so that they do not take on too many tasks or too much information), and forgetting (taking breaks and clearing their minds).

        Focusing

        Following our Boeing example, let us imagine a data analyst extracting knowledge tokens from a business news stream and putting these tokens as missing bits in the mosaic of his mental picture of the world. A tricky part of his work, guided by intuition or experience in practice, is choosing the order in which the facts are picked up from the token. Order of focusing is very important as it influences the formation and saturation of different fragments in the overall canvas. Even if the same input tokens are given, dif- ferent curves of focusing may result in different knowledge representations and analysis outcomes.

        A similar aspect of proper focusing is of importance also for automated processing of Big Data or its semantics. One could speculate whether a pro- cessing engine should select data tokens or assertions in the order of their appearance, in a reversed order, or anyhow else. If data or assertions are pro- cessed in a stream window and in real time, the order of focusing is of lesser relevance. However, if all the data or knowledge tokens are in a persistent storage, having some intelligence for optimal focusing may improve process- ing efficiency substantially. With smart focusing at hand, a useful token can be found or a hidden pattern extracted much faster and without making a complete scan of the source data. A complication for smart focusing is that the nodes on the focusing curve have to be decided upon on-the-fly because generally the locations of important tokens cannot be known in advance.

        Big Data Computing

        is intended directly of this portion of data, but also hint about the next point on the curve.

        A weak point in such a “problem-solving” approach is that some potentially valid alternatives are inevitably lost after each choice made on the decision path. So, only a suboptimal solution is practically achievable. The evolution- ary approach detailed further in section “Knowledge Self-Management and Refinement through Evolution” follows, in fact, a similar approach of smart focusing, but uses a population of autonomous problem-solvers operating concurrently. Hence, it leaves a much smaller part of a solution space without attention, reduces the bias of each choice, and likely provides better results.

        Filtering

        A data analyst who receives dozens of news posts at once has to focus on the most valuable of them and filter out the rest which, according to his informed guess, do not bring anything important additionally to those in his focus. Moreover, it might also be very helpful to filter out noise, that is, irrelevant tokens, irrelevant dimensions of data, or those bits of data that are unread- able or corrupted in any other sense. In fact, an answer to the question about what to trash and what to process needs to be sought based on the under- standing of the objective (e.g., was the reason for Boeing to hike their market forecast valid?) and the choice of the proper context (e.g., should we look into the airline fleets or the economic situation in developing countries?).

        A reasonable selection of features for processing or otherwise a rational choice of the features that may be filtered out may essentially reduce the volume as well as the variety/complexity of data which result in higher effi- ciency balanced with effectiveness.

        Quite similar to focusing, a complication here is that for big heterogeneous data it is not feasible to expect a one-size-fits-all filter in advance. Even more, for deciding about an appropriate filtering technique and the structure of a filter to be applied, a focused prescan of data may be required, which implies a decrease in efficiency. The major concern is again how to filter in a smart way and so as to balance the intentions to reduce processing effort (efficiency) and keep the quality of results within acceptable bounds (effectiveness).

        Our evolutionary approach presented in the section “Knowledge Self- Management and Refinement through Evolution” uses a system of environ- mental contexts for smart filtering. These contexts are not fixed but may be adjusted by several independent evolutionary mechanisms. For example, a context may become more or less “popular” among the knowledge organ- isms that harvest knowledge tokens in them because these organisms may migrate freely between contexts in search for better, more appropri- ate, healthier knowledge to collect. Another useful property we propose for knowledge organisms is their resistance to sporadic mutagenic factors, which may be helpful for filtering out noise.

        Toward Evolving Knowledge Ecosystems for Big Data Understanding Forgetting

        A professional data analyst always keeps a record of data he used in his work and the knowledge he created in his previous analyses. The storage for all these gems of expertise is, however, limited, so it has to be cleaned periodi- cally. Such a cleaning implies trashing potentially valuable things, though never or very rarely used, but causing doubts and further regrets about the lost. Similar thing happens when Big Data storage is overflown—some parts of it have to be trashed and so “forgotten.” A question in this respect is about which part of a potentially useful collection may be sacrificed. Is forgetting the oldest records reasonable?—perhaps not. Shall we forget the features that have been previously filtered out?—negative again. There is always a chance that an unusual task for analysis pops up and requires the features never exploited before. Are the records with minimal potential utility the best can- didates for trashing?—could be a rational way to go, but how would their potential value be assessed?

        Practices in Big Data management confirm that forgetting following straightforward policies like fixed lifetime for keeping records causes regret almost inevitably. For example, the Climate Research Unit (one of the leading institutions that study natural and anthropogenic climate change and collect climate data) admits that they threw away the key data to be used in global warming calculations (Joseph 2012).

        A better policy for forgetting might be to extract as much as possible knowl- edge out of data before deleting these data. It cannot be guaranteed, however, that future knowledge mining and extraction algorithms will not be capable of discovering more knowledge to preserve. Another potentially viable approach could be “forgetting before storing,” that is, there should be a pragmatic reason to store anything. The approach we suggest in the section “Knowledge Self- Management and Refinement through Evolution” follows exactly this way. Though knowledge tokens are extracted from all the incoming data tokens, not all of them are consumed by knowledge organisms, but only those assertions that match to their knowledge genome to a sufficient extent. This similarity is considered a good reason for remembering a fact. The rest remains in the envi- ronment and dies out naturally after the lifetime comes to end as explained in “Knowledge Self-Management and Refinement through Evolution”.

        Contextualizing

        Our reflection of the world is often polysemic, so a pragmatic choice of a context is often needed for proper understanding. For example, “taking a mountain hike” or “hiking a market forecast” are different actions though the same lexical root is used in the words. An indication of a context: recre- ation or business in this example would be necessary for making the state- ment explicit. To put it even broader, not only the sense of statements, but

        Big Data Computing

        also judgments, assessments, attitudes, and sentiments about the same data or knowledge token may well differ in different contexts. When it goes about data, it might be useful to know:

        1. The “context of origin”—the information about the source; who orga- nized and performed the action; what were the objects; what features have been measured; what were the reasons or motives for collecting these data (transparent or hidden); when and where the data were collected; who were the owners; what were the license, price, etc.

        2. The “context of processing”—formats, encryption keys, used prepro- cessing tools, predicted performance of various data mining algo- rithms, etc.; and

        3. The “context of use”—potential domains, potential or known appli- cations, which may use the data or the knowledge extracted from it, potential customers, markets, etc. Having circumscribed these three different facets of context, we may say now that data contextualization is a transformation process which decontex- tualizes the data from the context of origin and recontextualizes it into the context of use (Thomason 1998), if the latter is known. This transformation is performed via smart management of the context of processing.

        Known data mining methods are capable of automatically separating the so-called “predictive” and “contextual” features of data instances (e.g., Terziyan 2007). A predictive feature stands for a feature that directly influ- ences the result of applying to data a knowledge extraction instrument— knowledge discovery, prediction, classification, diagnostics, recognition, etc.

        RESULT = INSTRUMENT(Predictive Features). Contextual features could be regarded as arguments to a meta-function that influences the choice of appropriate (based on predicted quality/perfor- mance) instrument to be applied to a particular fragment of data: INSTRUMENT = CONTEXTUALIZATION(Contextual Features).

        Hence, a correct way to process each data token and benefit of contextu- alization would be: (i) decide, based on contextual features, which would be an appropriate instrument to process the token; and then (ii) process it using the chosen instrument that takes the predictive features as an input. This approach to contextualization is not novel and is known in data mining and knowledge discovery as a “dynamic” integration, classification, selec- tion, etc. Puuronen et al. (1999) and Terziyan (2001) proved that the use of dynamic contextualization in knowledge discovery yields essential quality

        Toward Evolving Knowledge Ecosystems for Big Data Understanding Compressing

        In the context of Big Data, having data in a compact form is very important for saving storage space or reducing communication overheads. Compressing is a process of data transformation toward making data more compact in terms of required storage space, but still preserving either fully (lossless compres- sion) or partly (lossy compression) the essential features of these data—those potentially required for further processing or use.

        Compression, in general, and Big Data compression, in particular, are effectively possible due to a high probability of the presence of repetitive, periodical, or quasi-periodical data fractions or visible trends within data. Similar to contextualization, it is reasonable to select an appropriate data compression technique individually for different data fragments (clusters), also in a dynamic manner and using contextualization. Lossy compression may be applied if it is known how data will be used, at least potentially. So that some data fractions may be sacrificed without losing the facets of semantics and the overall quality of data required for known ways of its use. A relevant example of a lossy compression technique for data having quasi- periodical features and based on a kind of “meta-statistics” was reported by Terziyan et al. (1998).

        Connecting

        It is known that nutrition is healthy and balanced if it provides all the neces- sary components that are further used as building blocks in a human body. These components become parts of a body and are tightly connected to the rest of it. Big Data could evidently be regarded as nutrition for knowledge economy as discussed in “Motivation and Unsolved Issues”. A challenge is to make this nutrition healthy and balanced for building an adequate mental representation of the world, which is Big Data understanding. Following the allusion of human body morphogenesis, understanding could be simplisti- cally interpreted as connecting or linking new portions of data to the data that is already stored and understood. This immediately brings us about the concept of linked data (Bizer et al. 2009), where “linked” is interpreted as a sublimate of “understood.” We have written “a sublimate” because having data linked is not yet sufficient, though necessary for further, more intel- ligent phase of building knowledge out of data. After data have been linked, data and knowledge mining, knowledge discovery, pattern recognition, diagnostics, prediction, etc. could be done more effectively and efficiently. For example, Terziyan and Kaykova (2012) demonstrated that executing busi- ness intelligence services on top of linked data is noticeably more efficient than without using linked data. Consequently, knowledge generated out of linked data could also be linked using the same approach, resulting in the linked knowledge. It is clear from the Linking Open Data Cloud Diagram

        Big Data Computing

        (e.g., RDF or OWL modules) represented as a linked data can be relatively easily linked to different public data sets, which creates a cloud of linked open semantic data.

        Mitchell and Wilson (2012) argue that the key to extract value from Big Data lies in exploiting the concept of linked. They believe that linked data potentially creates ample opportunities from numerous data sources. For example, using links between data as a “broker” brings more possibilities of extracting new data from the old, creating insights that were previously unachievable, and facilitating exciting new scenarios for data processing.

        For developing an appropriate connection technology, the results are rele- vant from numerous research and development efforts, for example, Linking FOAF , Factual . These projects create structured and interlinked semantic content, in fact, mashing up the features from Social and Semantic Web (Ankolekar et al. 2007). One strength of their approach is that collaborative content development effort is propa- gated up the level of the data-processing stack which allows creating seman- tic representations collaboratively and in an evolutionary manner.

        autonomic Big Data Computing

        The treatment offered in the “Refining Big Data Semantics Layer for Balancing Efficiency-Effectiveness” section requires a paradigm shift in Big Data computing. In seeking for a suitable approach to building processing infrastructures, a look into Autonomic Computing might be helpful. Started by International Business Machines (IBM) in 2001, Autonomic Computing refers to the characteristics of complex computing systems allowing them to manage themselves without direct human intervention. A human, in fact, defines only general policies that constrain self-management process. * According to IBM, the four major functional areas of autonomic computing are: (i) self-configuration—automatic configuration of system components; (ii)

        

      self-optimization —automatic monitoring and ensuring the optimal function-

        ing of the system within defined requirements; (iii) self-protectionautomatic identification and protection from security threats; and (iv) self-healing automatic fault discovery and correction. Other important capabilities of autonomic systems are: self-identity in a sense of being capable of knowing itself, its parts, and resources; situatedness and self-adaptationsensing the influences from its environment and acting accordingly to what happens in the observed environment and a particular context; being non-proprietary in a sense of not constraining itself to a closed world but being capable of

      • * functioning in a heterogeneous word of open standards; and anticipatory in
      •   Toward Evolving Knowledge Ecosystems for Big Data Understanding

          a sense of being able to automatically anticipate needed resources and seam- lessly bridging user tasks to their technological implementations hiding complexity.

          However, having an autonomic system for processing Big Data seman- tics might not be sufficient. Indeed, even such a sophisticated entity system may once face circumstances which it would not be capable of reacting to by reconfiguration. So, the design objectives will not be met by such a sys- tem and it should qualify itself as not useful for further exploitation and die. A next-generation software system will then be designed and implemented (by humans) which may inherit some valid features from the ancestor system but shall also have some principally new features. Therefore, it needs to be admitted that it is not always possible for even an autonomic system to adapt itself to a change within its lifetime. Consequently, self-management capabil- ity may not be sufficient for the system to survive autonomously—humans are required for giving birth to ancestors. Hence, we are coming to the neces- sity of a self-improvement feature which is very close to evolution. In that we may seek for inspiration in bio-social systems. Nature offers an automatic tool for adapting biological species across generations named genetic evolu- tion. An evolutionary process could be denoted as the process of proactive change of the features in the populations of (natural or artificial) life forms over successive generations providing diversity at every level of life organiza- tion. Darwin (1859) put the following principles in the core of his theory:

        • Principle of variation (variations of configuration and behavioral features);
        • Principle of heredity (a child inherits some features from its parents);
        • Principle of natural selection (some features make some individu- als more competitive than others in getting needed for survival resources).

          These principles may remain valid for evolving software systems, in par- ticular, for Big Data computing. Processing knowledge originating from Big Data may, however, imply more complexity due to its intrinsic social features.

          Knowledge is a product that needs to be shared within a group so that survivability and quality of life of the group members will be higher than those of any individual alone. Sharing knowledge facilitates collaboration and improves individual and group performance. Knowledge is actively consumed and also left as a major inheritance for future generations, for example, in the form of ontologies. As a collaborative and social substance, knowledge and cognition evolve in a more complex way for which additional facets have to be taken into account such as social or group focus of attention, bias, interpretation, explicitation, expressiveness, inconsistency, etc.

          In summary, it may be admitted that Big Data is collected and super-

          Big Data Computing

          objectives,  etc. Big Data semantics is processed using naturally different ontologies. All these loosely coupled data and knowledge fractions in fact “live their own lives” based on very complex processes, that is, evolve follow- ing the evolution of these cultures, their cognition mechanisms, standards, objectives, ontologies, etc. An infrastructure for managing and understand- ing such data straightforwardly needs to be regarded as an ecosystem of evolving processing entities. Below we propose treating ontologies (a key for understanding Big Data) as genomes and bodies of those knowledge processing entities. For this, basic principles by Darwin are applied to their evolution aiming to get optimal or quasi-optimal (according to evolving defi- nition of the quality) populations of knowledge species. These populations represent the evolving understanding of the respective islands of Big Data in their dynamics. This approach to knowledge evolution will require inter- pretation and implementation of concepts like “birth,” “death,” “morpho- genesis,” “mutation,” “reproduction,” etc., applied to knowledge organisms, their groups, and environments.

        Scaling with a Traditional Database

          In some sense, “Big data” is a term that is increasingly being used to describe very large volumes of unstructured and structured content—usually in amounts measured in terabytes or petabytes—that enterprises want to har- ness and analyze.

          Traditional relational database management technologies, which use index- ing for speedy data retrieval and complex query support, have been hard pressed to keep up with the data insertion speeds required for big data ana- lytics. Once a database gets bigger than about half a terabyte, some database products’ ability to rapidly accept new data start [start is to database prod- ucts] to decrease.

          There are two kinds of scalability, namely vertical and horizontal. Vertical scaling is just adding more capacity to a single machine. Fundamentally, every database product is vertically scalable to the extent that they can make good use of more central processing unit cores, random access memory, and disk space. With a horizontally scalable system, it is possible to add capacity by adding more machines. Beyond doubt, most database products are not horizontally scalable.

          When an application needs more write capacity than they can get out of a single machine, they are required to shard (partition) their data across mul- tiple database servers. This is how companies like Facebook have scaled their MySQL installations to massive proportions. This is the closest to what one can get into horizontal scalability

          Toward Evolving Knowledge Ecosystems for Big Data Understanding

          Sharding is a client-side affair, that is, the database server does not do it for user. In this kind of environment, when someone accesses data, the data access layer uses consistent hashing to determine which machine in the clus- ter a precise data should be written to (or read from). Enhancing capacity to a sharded system is a process of manually rebalancing the data across the cluster. The database system itself takes care of rebalancing the data and guaranteeing that it is adequately replicated across the cluster. This is what it means for a database to be horizontally scalable.

          In many cases, constructing Big Data systems on premise provides better data flow performance, but requires a greater capital investment. Moreover, one has to consider the growth of the data. While many model linear growth curves, interestingly the patterns of data growth within Big Data systems are more exponential. Therefore, model both technology and costs to match up with sensible growth of the database so that the growth of the data flows.

          Structured data transformation is the traditional approach of changing the structure of the data found within the source system to the structure of the target system, for instance, a Big Data system. The advantage of most Big Data systems is that deep structure is not a requirement; without doubt, structure can typically be layered in after the data arrive at the goal. However, it is a best practice to form the data within the goal. It should be a good abstrac- tion of the source operational databases in a structure that allows those who analyze the data within the Big Data system to effectively and efficiently find the data required. The issue to consider with scaling is the amount of latency that transformations cause as data moves from the source(s) to the goal, and the data are changed in both structure and content. However, one should avoid complex transformations as data migrations for operational sources to the analytical goals. Once the data are contained within a Big Data sys- tem, the distributed nature of the architecture allows for the gathering of the proper result set. So, transformations that cause less latency are more suit- able within Big Data domain.

          large Scale Data Processing Workflows

          Overall infrastructure for many Internet companies can be represented as a pipeline with three layers: Ingestion, Storage & Processing, and Serving. The most vital among the three is the Storage & Processing layer. This layer can be represented as a multisub-layer stack with a scalable file system such as Google File System (Ghemawat et  al. 2003) at the bottom, a framework for distributed sorting and hashing, for example, Map-Reduce (Dean and Ghemawat 2008) over the file system layer, a dataflow programming frame- work over the map-reduce layer, and a workflow manager at the top.

          Debugging large-scale data, in the Internet firms, is crucial because data passes through many subsystems, each having different query inter- face, different metadata representation, different underlying models (some

          Big Data Computing

          to maintain consistency and it is essential to factor out the debugging from the subsystems. There should be a self-governing system that takes care of all the metadata management. All data-processing subsystems can dispatch their metadata to such system which absorbs all the metadata, integrates them, and exposes a query interface for all metadata queries. This can pro- vide a uniform view to users, factors out the metadata management code, and decouples metadata lifetime from data/subsystem lifetime.

          Another stimulating problem is to deal with different data and process granularity. Data granularity can vary from a web page, to a table, to a row, to a cell. Process granularity can vary from a workflow, to a map-reduce pro- gram, to a map-reduce task. It is very hard to make an inference when the given relationship is in one granularity and the query is in other granular- ity and therefore it is vital to capture provenance data across the workflow. While there is no one-size-fits-all solution, a good methodology could be to use the best granularity at all levels. However, this may cause a lot of overhead and thus some smart domain-specific techniques need to be imple- mented (Lin and Dyer 2010; Olston 2012).

        Knowledge Self-Management and Refinement through Evolution

          World changes—so do the beliefs and reflections about it. Those beliefs and reflections are the knowledge humans have about their environments. However, the nature of those changes is different. The world just changes in events. Observation or sensing (Ermolayev et al. 2008) of events invokes gen- eration of data—often in huge volumes and with high velocities. Humans evolve—adapt themselves to become better fitted to the habitat.

          Knowledge is definitely a product of some processes carried out by con- scious living beings (for example, humans). Following Darwin’s (1859) approach and terminology to some extent, it may be stated that knowledge, both in terms of scope and quality, makes some individuals more competi- tive than others in getting vital resources or at least for improving their qual- ity of life. The major role of knowledge as a required feature for survival is decision-making support. Humans differ in fortune and fate because they make different choices in similar situations, which is largely due to their pos- session of different knowledge. So, the evolution of conscious beings notice- ably depends on the knowledge they possess. On the other hand, making a choice in turn triggers the production of knowledge by a human. Therefore, it is natural to assume that knowledge evolves triggered by the evolution of conscious beings, their decision-making needs and taken decisions, quality standards, etc. To put both halves in one whole, knowledge evolves in sup-

          Toward Evolving Knowledge Ecosystems for Big Data Understanding

          example, to better interpret or explain the data generated when observing events, corresponding to the diversity and complexity of these data. This observation leads us to a hypothesis about the way knowledge evolves:

          

        The mechanisms of knowledge evolution are very similar to the mecha-

        nisms of biological evolution. Hence, the methods and mechanisms for

        the evolution of knowledge could be spotted from the ones enabling the

        evolution of living beings.

          In particular, investigating the analogies and developing the mechanisms for the evolution of formal knowledge representations—specified as ontolo- gies—is of interest for the Big Data semantics layer (Figure 1.5). The triggers for ontology evolution in the networked and interlinked environments could be external influences coming bottom-up from external and heterogeneous information streams.

          Recently, the role of ontologies as formal and consensual knowledge rep- resentations has become established in different domains where the use of knowledge representations and reasoning is an essential requirement. Examples of these domains range from heterogeneous sensor network data processing through the Web of Things to Linked Open Data management and use. In all these domains, distributed information artifacts change spo- radically and intensively in reflection of the changes in the world. However, the descriptions of the knowledge about these artifacts do not evolve in line with these changes.

          Typically, ontologies are changed semiautomatically or even manually and are available in a sequence of discrete revisions. This fact points out a seri- ous disadvantage of ontologies built using state-of-the-art knowledge engi- neering and management frameworks and methodologies: expanding and amplified distortion between the world and its reflection in knowledge. It is also one of the major obstacles for a wider acceptance of semantic technolo- gies in industries (see also Hepp 2007; Tatarintseva et al. 2011).

          The diversity of domain ontologies is an additional complication for proper and efficient use of dynamically changing knowledge and information arti- facts for processing Big Data semantics. Currently, the selection of the best suiting one for a given set of requirements is carried out by a knowledge engineer using his/her subjective preferences. A more natural evolutionary approach for selecting the best-fitting knowledge representations promises enhancing robustness and transparency, and seems to be more technologi- cally attractive.

          Further, we elaborate a vision of a knowledge evolution ecosystem where agent-based software entities carry their knowledge genomes in the form of ontology schemas and evolve in response to the influences percieved from their environments. These influences are thought of as the tokens of Big Data (like news tokens in the “Illustrative Example” section) coming into the spe-

          Big Data Computing

          reflect the change in the world snap-shotted by Big Data tokens. Inspiration and analogies are taken from evolutionary biology.

          Knowledge Organisms, their environments, and Features

          Evolving software entities are further referred to as individual Knowledge

          Organisms (KO). It is envisioned (Figure 1.6) that a KO:

          1. Is situated in its environment as described in “Environment, Perception (Nutrition), and Mutagens”

          2. Carries its individual knowledge genome represented as a schema or Terminological Box (TBox; Nardi and Brachman 2007) of the respec- tive ontology (see “Knowledge Genome and Knowledge Body”)

          3. Has its individual knowledge body represented as an assertional com- ponent (ABox; Nardi and Brachman 2007) of the respective ontology (see “Knowledge Genome and Knowledge Body”)

          4. Is capable of perceiving the influences from the environment in the form of knowledge tokens (see “Environment, Perception (Nutrition), and Mutagens”) that may cause the changes in the genome (see “Mutation”) and body (see “Morphogenesis”)—the mutagens

          5. Is capable of deliberating about the affected parts of its genome and body (see “Morphogenesis” and “Mutation”)

          Mutagen KO (ABox) (TBox) Perception Communication Deliberation KO Morphogenesis Genome (TBox) ensor input Body S Mutation (ABox) Recombination Reproduction Excretion Action output

          KO Environment Figure 1.6

        A Knowledge Organism: functionality and environment. Small triangles of different trans-

        parency represent knowledge tokens in the environment—consumed and produced by KOs.

          Toward Evolving Knowledge Ecosystems for Big Data Understanding

          6. Is capable of consuming some parts of a mutagen for: (a) morpho-

          genesis changing only the body (see “Morphogenesis”); (b) mutation

          changing both the genome and body (see “Mutation”); or (c) recom-

          bination —a mutual enrichment of several genomes in a group of KO

          which may trigger reproductionrecombination of body replicas giving “birth” to a new KO (see “Recombination and Reproduction”)

          7. Is capable of excreting the unused parts of mutagens or the “dead” parts of the body to the environment The results of mainstream research in distributed artificial intelligence and semantic technologies suggest the following basic building blocks for devel- oping a KO. The features of situatedness (Jennings 2000) and deliberation (Wooldridge and Jennings 1995) are characteristic to intelligent software agents, while the rest of the required functionality could be developed using the achiev- ments in Ontology Alignment (Euzenat and Shvaiko 2007). Recombination involving a group of KOs could be thought of based on the known mechanisms for multiissue negotiations on semantic contexts (e.g., Ermolayev et al. 2005) among software agents—the members of a reproduction group.

          Environment, Perception (Nutrition), and Mutagens

          An environmental context for a KO could be thought of as an arial of its habi- tat. Such a context needs to be able to provide nutrition that is “healthy” for particular KO species, that is, matching their genome noticeably. The food for nutrition is provided by Knowledge Extraction and Contextualization functionality (Figure 1.7) in a form of knowledge tokens. Hence, several

          Plane Maker Business Knowle dg e to k en In fo rm at io n t o k en basedIn baseOf has by hikes hikedBy Boeing: PlaneMaker New20YMarketForecast Old20YMarketForecast : MarketForecast UnitedStates

          Environmental Contexts Knowledge Extraction and Contextualization Airline Business Another News Stream Business News Stream Posted: Tue, 03 Jul 2012 05:01:10 -0400 LONDON (Reuters) U.S. planemaker Boeing hiked its 20-year market forecast, predicting demand for 34,000 new aircraft worth $4.5 trillion, on growth in emerging regions and as airlines seek efficient new planes to counter high fuel costs. Another Domain

          Figure 1.7

          Big Data Computing

          and possibly overlapping environmental contexts need to be regarded in a hierarchy which corresponds to several subject domains of intetrest and a foundational knowledge layer. By saying this, we subsume that there is a single domain or foundational ontology module schema per environmental context. Different environmental contexts corresponding to different subject domais of interest are pictured as ellipses in Figure 1.7.

          Environmental contexts are sowed with knowledge tokens that corre- spond to their subject domains. It might be useful to limit the lifetime of a knowledge token in an environment—those which are not consumed dis- solve finally when their lifetime ends. Fresh and older knowledge tokens are pictured with different transparency in Figure 1.7.

          KOs inhabit one or several overlapping environmental contexts based on the nutritional healthiness of knowledge tokens sowed there, that is, the degree to which these knowledge tokens match to the genome of a particular KO. KOs use their perceptive ability to find and consume knowledge tokens for nutrition. A KO may decide to migrate from one environment to another based on the availability of healthy food there. Knowledge tokens that only partially match KOs’ genome may cause both KO body and genome changes and are thought of as mutagens. Mutagens, in fact, deliver the information about the changes in the world to the envi- ronments of KOs.

          Knowledge tokens are extracted from the information tokens either in a stream window or from the updates of the persistent data storage and further sawn in the appropriate environmental context. The context for placing a newly coming KO is chosen by the contextualization functional- ity (Figure 1.7) based on the match ratio to the ontology schema character- izing the context in the environment. Those knowledge tokens that are not mapped well to any of the ontology schemas are sown in the environment without attributing them to any particular context.

          For this, existing shallow knowledge extraction techniques could be exploited, for example, Fan et al. (2012a). The choice of appropriate tech- niques depends on the nature and modality of data. Such a technique would extract several interrelated assertions from an information token and provide these as a knowledge token coded in a knowledge representa- tion language of an appropriate expressiveness, for example, in a tractable subset of the Web Ontology Language (OWL) 2.0 (W3C 2009). Information * and knowledge tokens for the news item of our Boeing example are pic- tured in Figure 1.7.

        • Figure 1.7 because it is more illustrative. Though not shown in Figure 1.7, it can be straight-

          

        Unified Modeling Language (UML) notation is used for picturing the knowledge token in

          Toward Evolving Knowledge Ecosystems for Big Data Understanding Knowledge Genome and Knowledge Body

          Two important aspects in contextualized knowledge representation for an outlined knowledge evolution ecosystem have to be considered with care (Figure 1.8):

        • A knowledge genome etalon for a population of KOs belonging to one species
        • An individual knowledge genome and body for a particular KO A knowledge genome etalon may be regarded as the schema (TBox) of a distinct ontology module which represents an outstanding context in a subject domain. In our proposal, the etalon genome is carried by a dedicated Etalon KO (EKO; Figure 1.8) to enable alignments with individual genomes and other etalons in a uniform way. The individual assertions (ABox) of this ontology module are spread over the individual KOs belonging to the cor- responding species—forming their individual bodies.

          The individual genomes of those KOs are the recombined genomes of the KOs who gave birth to this particular KO. At the beginning of times, the individual genomes may be replicas of the etalon genome. Anyhow, they evolve independently in mutations or because of morphogenesis of an indi- vidual KO, or because of recombinations in reproductive groups.

          EKO Species C 1 Genome Etalon KO a KO b C 1 C 1 Genome Genome Body Body

          

        Environmental Contexts

        Figure 1.8

        Knowledge genomes and bodies. Different groups of assertions in a KO body are attributed to

        different elements of its genome, as shown by dashed arrows. The more assertions relate to a

          Big Data Computing

          Different elements (concepts, properties, axioms) in a knowledge genome may possess different strengths, that is, be dominant or recessive. For exam- ple (Figure 1.8) concept C in the genome of KO is quite strong because it 1 a is reinforced by a significant number of individual assertions attributed to this concept, that is, dominant. On the contrary, C in the genome of KO is 1 b very weak—that is, recessive—as it is not supported by individual asser- tions in the body of KO . Recessivness or dominance values may be set b and altered using techniques like spreading activation (Quillian 1967, 1969;

          Collins and Loftus 1975) which also appropriately affect the structural con- texts (Ermolayev et al. 2005, 2010) of the elements in focus.

          Recessive elements may be kept in the genome as parts of the genetic mem- ory, but until they do not contradict any dominant elements. For example, if a dominant property of the PlaneMaker concept in a particular period of time is PlaneMakerhikesMarketForecast, then a recessive property PlaneMaker

          

        lessensMarketForecast may die out soon with high probability, as contradic-

        tory to the corresponding dominant property.

          The etalone genome of a species evolves in line with the evolution of the indi- vidual genomes. The difference, however, is that EKO has no direct relation- ship (situatedness) to any environmental context. So, all evolution influences are provided to EKO by the individual KOs belonging to the corresponding species via communication. If an EKO and KOs are implemented as agent- based software entities, the techniques like agent-based ontology alignment are of relevance for evolving etalon genomes. In particular, the alignment settings are similar to a Structural Dynamic Uni-directional Distributed (SDUD) ontology alignment problem (Ermolayev and Davidovsky 2012). The problem could be solved using multiissue negotiations on semantic contexts, for example, following the approach of Ermolayev et al. (2005) and Davidovsky et al. (2012). For assuring the consistency in the updated ontol- ogy modules after alignment, several approaches are applicable: incremental updates for atomic decompositions of ontology modules (Klinov et al. 2012); checking correctness of ontology contexts using ontology design patterns approach (Gangemi and Presutti 2009); evaluating formal correctness using formal (meta-) properties (Guarino and Welty 2001).

          An interesting case would be if an individual genome of a particular KO evolves very differently to the rest of KOs in the species. This may happen if such a KO is situated in an environmental context substantially different from the context where the majority of the KOs of this species are collecting knowledge tokens. For example, the dominancy and recessiveness values in the genome of KO (Figure 1.8) differ noticeably from those of the genomes b of the KOs similar to KO . A good reason for this may be: KO is situated in a b an environmental context different to the context of KO —so the knowledge a tokens KO may consume are different to the food collected by KO . Hence, b a the changes to the individual genome of KO will be noticeably different to b those of KO after some period of time. Such a genetic drift may cause that the a

          Toward Evolving Knowledge Ecosystems for Big Data Understanding

          which recombination gives ontologically viable posterity. A new knowledge genome etalon may, therefore, emerge if the group of the KOs with genomes drifted in a similar direction reaches a critical mass—giving birth to a new species.

          The following are the features required to extend an ontology representa- tion language for to cope with the mentioned evolutionary mechanisms:

        • A temporal extension that allows representing and reasoning about the lifetime and temporal intervals of validity of the elements in knowledge genomes and bodies. One relevant extension and rea- soning technique is OWL-MET (Keberle 2009).
        • An extension that allows assigning meta-properties to ontological elements for verifying formal correctness or adherence to relevant design patterns. Relevant formalisms may be sought following Guarino and Welty (2001) or Gangemi and Presutti (2009).

          Morphogenesis

          Morphogenesis in a KO could be seen as a process of developing the shape of a KO body. In fact, such a development is done by adding new assertions to the body and attributing them to the correct parts of the genome. This process could be implemented using ontology instance migration technique (Davidovsky et al. 2011); however, the objective of morphogenesis differs from that of ontology instance migration. The task of the latter is to ensure correct- ness and completeness, that is, that, ideally, all the assertions are properly aligned with and added to the target ontology ABox. Morphogenesis requires that only the assertions that fit well to the TBox of the target ontology are consumed for shaping it out. Those below the fitness threshold are excreted. If, for example, a mutagen perceived by a KO is the one of our Boeing example * presented in Figures 1.2 or 1.7, then the set of individual assertions will be

          {AllNipponAirways:Airline, B787-JA812A:EfficientNewPlane, Japan:Country, Boeing:PlaneMaker, New20YMarketForecastbyBoeing:Mark etForecast, United States:Country, Old20YMarketForecastbyBoeing:MarketF orecast}.

          (1.1) Let us now assume that the genome (TBox) of the KO contains only the con- cepts represented in Figure 1.2 as grey-shaded classes—{Airline, PlaneMaker,

          MarketForecast} and thick-line relationships—{seeksFor–soughtBy}. Then only the bolded assertions from (1.1) could be consumed for morphogen- esis by this KO and the rest have to be excreted back to the environment. * Interestingly, the ratio of mutagen ABox consumption may be used as a good

          

        The syntax for representing individual assertions is similar to the syntax in UML for compat-

          Big Data Computing

          metric for a KO in deliberations about: its resistance to mutations; desire to migrate to a different environmental context, or to start seeking for reproduc- tion partners.

          Another important case in a morphogenesis process is detecting a con- tradiction between a newly coming mutagenic assertion and the asser- tion that is in the body of the KO. For example, let us assume that the body already comprises the property SalesVolume of the assertion named

          

        New20YMarketForecastbyBoeing with the value of 2.1 million. The value of the

          same property coming with the mutagen equals to 4.5 million. So, the KO has to resolve this contradiction by: either (i) deciding to reshape its body by accepting the new assertion and excreting the old one; or (ii) resisting and declining the change. Another possible behavior would be collecting and keeping at hand the incoming assertions until their dominance is not proved by the quantity. Dominance may be assessed using different metrics. For example, a relevant technique is offered by the Strength Value-Based Argumentation Framework (Isaac et al. 2008).

          Mutation

          Mutation of a KO could be understood as the change of its genome caused by the environmental influences (mutagenic factors) coming with the con- sumed knowledge tokens. Similar to the biological evolution, a KO and its genome are resistent to mutagenic factors and do not change at once because of any incoming influence, but only because of those which could not be ignored because of their strength. Different genome elements may be dif- ferently resistant. Let us illustrate different aspects of mutation and resis- tance using our Boeing example. As depicted in Figure 1.9, the change of the

          

        AirPlaneMaker concept name (to PlaneMaker) in the genome did not happen

          though a new assertion had been added to the body as a result of morpho- * genesis (Boeing: (PlaneMaker) AirPlaneMaker ). The reason AirPlaneMaker con- cept resisted this mutation was that the assertions attributed to the concept of PlaneMaker were in the minority—so, the mutagenic factor has not yet been strong enough. This mutation will have a better chance to occur if simi- lar mutagenic factors continue to come in and the old assertions in the body of the KO die out because their lifetime periods come to end. More generally, the more individual assertions are attributed to a genome element at a given point in time—the more strong this genome element is to mutations.

          In contrast to the AirPlaneMaker case, the mutations brought by hikes

          

        hikedBy and successorOfpredecessorOf object properties did happen

          (Figure 1.9) because the KO did not possess any (strong) argument to resist

        • *

          UML syntax is used as basic. The name of the class from the knowledge token is added in

          brackets before the name of the class to which the assertion is attributed in the KO body. This

          is done for keeping the information about the occurrences of a different name in the incom-

          ing knowledge tokens. This historical data may further be used for evaluating the strength

        •   Toward Evolving Knowledge Ecosystems for Big Data Understanding

            them. Indeed, there were no contradictory properties both in the genome and the body of the KO before it accepted the grey-shaded assertions as a result of morphogenesis.

            Not all the elements of an incoming knowledge token could be consumed by a KO. In our example (Figure 1.9), some of the structural elements (AirLine,

            EfficientNewPlane , seekssoughtBy) were

          • Too different to the genome of this particular KO, so the similarity factor was too low and the KO did not find any match to its TBox. Hence, the KO was not able to generate any replacement hypotheses also called propositional substitutions (Ermolayev et al. 2005).
          • Too isolated from the elements of the genome—having no properties relating them to the genome elements. Hence, the KO was not able to generate any merge hypotheses.

            These unused elements are excreted (Figure 1.9) back to the environment * as a knowledge token. This token may further be consumed by another Country

          • - mutations -basedIn

            AirPlaneMaker MarketForecast

            hikes successorOf -baseOf

            -has -by

            hikedBy predecessorOf -SalesVolume

            Mutating KO - morphogeneses - irrelevant Boeing (PlaneMaker) : AirPlaneMaker

          *

          and excreted Boe Boeing (PlaneMaker) : AirPlaneMaker * AirBus: AirPlaneMaker Boeing : AirPlaneMaker * Boeing : AirPlaneMaker

          * *

          hikes

          hikedby Old20YMarketForecastbyBoeing : MarketForecast 5YMarketForcastbyBoeing2010 : MarketForecast New20YMarketForecastbyBoeing : MarketForecast Genome successorOf elements AirBus: AirPlaneMaker AirBus: AirPlaneMaker AirBus: AirPlaneMaker UnitedStates : Country predecessorOf * Canada : Country * Body Perception PlaneMaker MarkerForecast -has -by -SalesVolume Brasil : Country * Excretion Consumed Airline Excreted knowledge
          • * * knowledge token * * * Airline -seeksFor -soughtBy -delivered : Date -built : <unspecified> = >2009 -fuelConsumption : <unspecified> = low EfficientNewPlane -fuelConsumption : <unspecified> = low -built : <unspecified> = >2009 EfficientNewPlane -soughtBy -seeksFor

            token has by Genome * -delivered : Date basedIn Boeing : PlaneMaker New20YMarketForecastbyBoeing : MarketForecast

          Body

          UnitedStates : Country hikedBy Old20YMarketForecastbyboeing : MarketForecast baseOf hikes successorOf SalesVolume SalesVolume = 4.5 trillion predecessorOf Genome Figure 1.9

            Big Data Computing

            KO with different genome comprising matching elements. Such a KO may migrate from a different environmental context (e.g., Airlines Business).

            Similar to morphogenesis, mutation may be regarded as a subproblem of ontology alignment. The focus is, however, a little bit different. In contrast to morphogenesis which was interpreted as a specific ontology instance migration problem, mutation affects the TBox and is therefore structural ontology alignment (Ermolayev and Davidovsky 2012). There is a solid body of related work in structural ontology alignment. Agent-based approaches relevant to our context are surveyed, for example, in Ermolayev and Davidovsky (2012).

            In addition to the requirements already mentioned above, the following features extending an ontology representation language are essential for coping with the mechanisms of mutation:

          • The information of the attribution of a consumed assertion to a par- ticular structural element in the knowledge token needs to be pre- served for future use in possible mutations. An example is given in Figure 1.8—Boeing: (PlaneMaker) AirPlaneMaker. The name of the concept in the knowledge token (PlaneMaker) is preserved and the assertion is attributed to the AirPlaneMaker concept in the genome.

            Recombination and Reproduction

            As mutation, recombination is a mechanism of adapting KOs to environ- mental changes. Recombination involves a group of KOs belonging to one or several similar species with partially matching genomes. In contrast to mutation, recombination is triggered and performed differently. Mutation is invoked by external influences coming from the environment in the form of mutagens. Recombination is triggered by a conscious intention of a KO to make its genome more resistant and therefore better adapted to the envi- ronment in its current state. Conscious in this context means that a KO first analyzes the strength and adaptation of its genome, detects weak elements, and then reasons about the necessity of acquiring external reinforcements for these weak elements. Weaknesses may be detected by:

          • Looking at the proportion of consumed and excreted parts in the perceived knowledge tokens—reasoning about how healthy is the food in its current environmental context. If not, then new elements extending the genome for increasing consumption and decreasing excretion may be desired to be acquired.
          • Looking at the resistance of the elements in the genome to muta- tions. If weaknesses are detected, then it may be concluded that the assertions required for making these structural elements stronger are either nonexistent in the environmental context or are not con-

            Toward Evolving Knowledge Ecosystems for Big Data Understanding

            new genome elements through recombination may be useful. In the former case (nonexistence), the KO may decide to move to a different environmental context. Recombination of KOs as a mechanism may be implemented using sev- eral available technologies. Firstly, a KO needs to reason about the strengths and weaknesses of the elements in its genome. For this, in addition to the extra knowledge representation language features mentioned above, it needs a simple reasoning functionality (pictured in Figure 1.6 as Deliberation). Secondly, a KO requires a means for getting in contact with the other KOs and checking if they have similar intentions to recombine their genomes. For this, the available mechanisms for communication (e.g., Labrou et al. 1999; Labrou 2006), meaning negotiation (e.g., Davidovsky et al. 2012), and coali- tion formation (e.g., Rahwan 2007) could be relevant.

            Reproduction is based on recombination mechanism and results and goes further by combining the replicas of the bodies of those KOs who take part in the recombination group resulting in the production of a new KO. A KO may intend to reproduce itself because his lifetime period comes to an end or because of the other individual or group stimuli that have to be researched.

            Populations of Knowledge Organisms

            KOs may belong to different speciesthe groups of KOs that have similar genomes based on the same etalon carried by the EKO. KOs that share the same areal of habitat (environmental context) form the population which may comprise the representatives of several species. Environmental contexts may also overlap. So, the KOs of different species have possibilities to interact. With respect to species and populations, the mechanisms of (i) migration, (ii)

            

          genetic drift , (iii) speciation, and (iv) breeding for evolving knowledge represen-

          tations are of interest.

            Migration is the movement of KOs from one environmental context to another context because of different reasons mentioned in the “Knowledge Organisms, their Environments, and Features” section. Genetic drift is the change of genomes to a degree beyond the species tolerance (similarity) threshold caused by cumulative efffect of a series of mutations as explained in the “Knowledge Genome and Knowledge Body” section. Speciation effect occurs if genetic drift results in a distinct group of KOs capable of reproduc- ing themselves with their recombined genomes.

            If knowledge evolves in a way similar to biological evolution, the out- come of this process would best-fit KOs desires of environmental mimicry, but perhaps not the requirements of ontology users. Therefore, for ensuring human stakeholders’ commitment to the ontology, it might be useful to keep the evolution process under control. For this, constraints, or restrictions in another form, may be introduced for relevant environmental contexts and

            Big Data Computing

            goal. This artificial way of control over the natural evolutionary order of things may be regarded as breeding—a controlled process of sequencing desired mutations that causes the emergence of a species with the required genome features.

            Fitness of Knowledge Organisms and related Ontologies

            It has been repeatedly stated in the discussion of the features of KOs in “Knowledge Organisms, their Environments, and Features” that they exhibit proactive behavior. One topical case is that a KO would rather migrate away from the current environmental context instead of continuing consuming knowledge tokens which are not healthy for it in terms of structural simi- larity to its genome. It has also been mentioned that a KO may cooperate with other KOs to fulfill its evolutionary intentions. For instance, KOs may form cooperative groups for recombination or reproduction. They also interact with their EKOs for improving the etalon genome of the species.

            Another valid case, though not mentioned in “Knowledge Organisms, their Environments, and Features”, would be if a certain knowledge token is avail- able in the environment and two or more KOs approach it concurrently with an intention to consume. If those KOs are cooperative, the token will be con- sumed by the one which needs it most—so that the overall “strength” of the species is increased. Otherwise, if the KOs are competitive, as it often hap- pens in nature, the strongest KO will get the token. All these cases require a quantification of the strength, or fitness, of KOs and knowledge tokens. Fitness is, in fact, a complex metric having several important facets.

            Firstly, we summarize what fitness of a KO means. We outline that their fit- ness is inseparable from (in fact, symmetric to) the fitness of the knowledge tokens that KOs consume from and excrete back to their environmental con- texts. Then, we describe several factors which contribute to fitness. Finally, we discuss how several dimensions of fitness could be used to compare dif- ferent KOs.

            To start our deliberations about fitness, we have to map the high-level understanding of this metric to the requirements of Big Data processing as presented in the “Motivation and Unsolved Issues” and “State of Technology, Research, and Development in Big Data Computing” sections in the form of the processing stack (Figures 1.3 through 1.5). The grand objective of a Big Data computing system or infrastructure is providing a capability for data analysis with balanced effectiveness and efficiency. In particular, this capability subsumes facilitating decision-making and classification, provid- ing adequate inputs to software applications, etc. An evolving knowledge ecosystem, comprising environmental contexts populated with Kos, is intro- duced in the semantics processing layer of the overall processing stack. The aim of introducing the ecosystem is to ensure seamless and balanced con- nection between a user who operates the system at the upper layers and the

            Toward Evolving Knowledge Ecosystems for Big Data Understanding

            Ontologies are the “blood and flesh” of the KOs and the whole ecosys- tem as they are both the code registering a desired evolutionary change and the result of this evolution. From the data-processing viewpoint, the ontolo- gies are consensual knowledge representations that facilitate improving data integration, transformation, and interoperability between the process- ing nodes in the infrastructure. A seamless connection through the layers of the processing stack is facilitated by the way ontologies are created and changed. As already mentioned above in the introduction of the “Knowledge Self-Management and Refinement through Evolution” section, ontologies are traditionally designed beforehand and further populated by assertions taken from the source data. In our evolving ecosystem, ontologies evolve in parallel to data processing. Moreover, the changes in ontologies are caused by the mutagens brought by the incoming data. Knowledge extraction sub- system (Figure 1.7) transforms units of data to knowledge tokens. These in turn are sown in a corresponding environmental context by a contextualiza- tion subsystem and further consumed by KOs. KOs may change their body or even mutate due to the changes brought by consumed mutagenic knowl- edge tokens. The changes in the KOs are in fact the changes in the ontologies they carry. So, ontologies change seamlessly and naturally in a way to best suite the substance brought in by data. For assessing this change, the judg- ments about the value and appropriateness of ontologies in time are impor- tant. Those should, however, be formulated accounting for the fact that an ontology is able to self-evolve.

            A degree to which an ontology is reused is one more important character- istic to be taken into account. Reuse means that data in multiple places refers to this ontology and when combined with interoperability it implies that data about similar things is described using the same ontological fragments. When looking at an evolving KO, having a perfect ontology would mean that if new knowledge tokens appear in the environmental contexts of an organ- ism, the organism can integrate all assertions in the tokens, that is, without a need to excrete some parts of the consumed knowledge tokens back to the environment. That is to say, the ontology which was internal to the KO before the token was consumed was already prepared for the integration of the new token. Now, one could turn the viewpoint by saying that the infor- mation described in the token was already described in the ontology which the KO had and thus that the ontology was reused in one more place. This increases the value, that is, the fitness of the ontology maintained by the KO.

            Using similar argumentation, we can conclude that if a KO needs to excrete a consumed knowledge token, the ontology fits worse to describing the frag- ment of data to which the excreted token is attributed. Thus, in conclusion, we could say that the fitness of a KO is directly dependent on the propor- tion between the parts of knowledge tokens which it: (a) is able to consume for morphogenesis and possibly mutation; versus (b) needs to excrete back to the environment. Additionally, the age of the assertions which build up

            Big Data Computing

            of very young assertions in the body is high, the KO might be not resistant to stochastic changes, which is not healthy. Otherwise, if only long-living assertions form the body, it means that the KO is either in a wrong context or too resistant to mutagens. Both are bad as no new information is added, the KO ignores changes, and hence the ontology it carries may become irrel- evant. Therefore, a good mix of young and old assertions in the body of a KO indicates high fitness—KO’s knowledge is overall valid and evolves appropriately.

            Of course stating that fitness depends only on the numbers of used and excreted assertions is an oversimplification. Indeed, incoming knowledge tokens that carry assertions may be very different. For instance, the knowl- edge token in our Boeing example contains several concepts and properties in its TBox: a Plane, a PlaneMaker, a MarketForecast, an Airline, a Country, SalesVolume, seeksFor—soughtBy, etc. Also, some individuals attrib- uted to these TBox elements are given in the ABox: UnitedStates, Boeing, New20YMarketForecastByBoeing, 4.5 trillion, etc. One can imagine a less complex knowledge token which contains less information. In addition to size and complexity, a token has also other properties which are important to consider. One is the source where the token originates from. A token can be produced by knowledge extraction from a given channel or can be excreted by a KO. When the token is extracted from a channel, its value depends on the quality of the channel, relative to the quality of other channels in the system (see also the context of origin in the “Contextualizing” section). The quality of knowledge extraction is important as well, though random errors could be mitigated by statistical means. Further, a token could be attributed to a number of environmental contexts. A context is important, that is, adds more value to a token in the context if there are a lot of knowledge tokens in that context or more precisely there have appeared many tokens in the con- text recently. Consequently, a token becomes less valuable along its lifetime in the environment.

            Till now, we have been looking at different fitness, value, and quality fac- tors in insulation. The problem is, however, that there is no straightforward way to integrate these different factors. For this, an approach to address the problem of assessing the quality of an ontology as a dynamic optimization problem (Cochez and Terziyan 2012) may be relevant.

          Some Conclusions

            For all those who use or process Big Data a good mental picture of the world, dissolved in data tokens, may be worth of petabytes of raw information and save weeks of analytic work. Data emerge reflecting a change in the

            Toward Evolving Knowledge Ecosystems for Big Data Understanding

            us. Knowledge extracted from these data in an appropriate and timely way is an essence of adequate understanding of the change in the world. In this chapter, we provided the evidence that numerous challenges stand on the way of understanding the sense, the trends dissolved in the petabytes of Big Data—extracting its semantics for further use in analytics. Among those challenges, we have chosen the problem of balancing between effectiveness and efficiency in understanding Big Data as our focus. For better explaining our motivation and giving a reader the key that helps follow how our prem- ises are transformed into conclusions, we offered a simple walkthrough example of a news token.

            We began the analysis of Big Data Computing by looking at how the phenomenon influences and changes industrial landscapes. This overview helped us figure out that the demand in industries for effective and efficient use of Big Data, if properly understood, is enormous. However, this demand is not yet fully satisfied by the state-of-the-art technologies and methodolo- gies. We then looked at current trends in research and development in order to narrow the gaps between the actual demand and the state of the art. The analysis of the current state of research activities resulted in pointing out the shortcomings and offering an approach that may help understand Big Data in a way that balances effectiveness and efficiency.

            The major recommendations we elaborated for achieving the balance are: (i) devise approaches that intelligently combine top-down and bottom-up pro- cessing of data semantics by exploiting “3F + 3Co” in dynamics, at run time; (ii) use a natural incremental and evolutionary way of processing Big Data and its semantics instead of following a mechanistic approach to scalability.

            Inspired by the harmony and beauty of biological evolution, we further presented our vision of how these high-level recommendations may be approached. The “Scaling with a Traditional Database” section offered a review of possible ways to solve scalability problem at data processing level. The “Knowledge Self-Management and Refinement through Evolution” sec- tion presented a conceptual level framework for building an evolving ecosys- tem of environmental contexts with knowledge tokens and different species of KOs that populate environmental contexts and collect knowledge tokens for nutrition. The genomes and bodies of these KOs are ontologies describing corresponding environmental contexts. These ontologies evolve in line with the evolution of KOs. Hence they reflect the evolution of our understanding of Big Data by collecting the refinements of our mental picture of the change in the world. Finally, we found out that such an evolutionary approach to building knowledge representations will naturally allow assuring fitness of knowledge representations—as the fitness of the corresponding KOs to the environmental contexts they inhabit.

            We also found out that the major technological components for building such evolving knowledge ecosystems are already in place and could be effec- tively used, if refined and combined as outlined in the “Knowledge Self-

            Big Data Computing

          Acknowledgments

            This work was supported in part by the “Cloud Software Program” man- aged by TiViT Oy and the Finnish Funding Agency for Technology and Innovation (TEKES).

          References

            

          Abadi, D. J., D. Carney, U. Cetintemel, M. Cherniack, C. Convey, S. Lee, M. Stonebraker,

          N. Tatbul, and S. Zdonik. 2003. Aurora: A new model and architecture for data stream management. VLDB Journal 12(2): 120–139.

          Anderson, C. 2008. The end of theory: The data deluge makes the scientific method

          Ankolekar, A., M. Krotzsch, T. Tran, and D. Vrandecic. 2007. The two cultures:

          Mashing up Web 2.0 and the Semantic Web. In Proc Sixteenth Int Conf on World

            Wide Web (WWW’07) , 825–834. New York: ACM.

            

          Berry, D. 2011. The computational turn: Thinking about the digital humanities. Culture

          Beyer, M. A., A. Lapkin, N. Gall, D. Feinberg, and V. T. Sribar. 2011. ‘Big Data’ is only

          the beginning of extreme information management. Gartner Inc. (April). http:// www.gartner.com/id=1622715 (accessed August 30, 2012).

          Bizer, C., T. Heath, and T. Berners-Lee. 2009. Linked data—The story so far. International

            Journal on Semantic Web and Information Systems 5(3): 1–22.

            

          Bollier, D. 2010. The promise and peril of big data. Report, Eighteenth Annual

          Aspen Institute Roundtable on Information Technology, the Aspen Institute. Bowker, G. C. 2005. Memory Practices in the Sciences. Cambridge, MA: MIT Press.

          Boyd, D. and K. Crawford. 2012. Critical questions for big data. Information, Communication

          & Society 15(5): 662–679.

          Broekstra, J., A. Kampman, and F. van Harmelen. 2002. Sesame: A generic architecture

          for storing and querying RDF and RDF schema. In The Semantic Web—ISWC

            2002 , eds. I. Horrocks and J. Hendler, 54–68. Berlin, Heidelberg: Springer-Verlag, LNCS 2342.

            

          Cai, M. and M. Frank. 2004. RDFPeers: A scalable distributed RDF repository based

          on a structured peer-to-peer network. In Proc Thirteenth Int Conf World Wide Web (WWW’04), 650–657. New York: ACM.

          Capgemini. 2012. The deciding factor: Big data & decision making. Report. http://

          www.capgemini.com/services-and-solutions/technology/business-informa- tion-management/the-deciding-factor/ (accessed August 30, 2012).

          Chang, F., J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows,

            Toward Evolving Knowledge Ecosystems for Big Data Understanding system for structured data. ACM Transactions on Computer Systems 26(2): article 4.

          Cochez, M. and V. Terziyan. 2012. Quality of an ontology as a dynamic optimisation

          problem. In Proc Eighth Int Conf ICTERI 2012, eds. V. Ermolayev et al., 249–256.

            CEUR-WS vol. 848. http://ceur-ws.org/Vol-848/ICTERI-2012-CEUR-WS-DEIS- paper-1-p-249-256.pdf.

          Collins, A. M. and E. F. Loftus. 1975. A spreading-activation theory of semantic pro-

          cessing. Psychological Review 82(6): 407–428.

          Cusumano, M. 2010. Cloud computing and SaaS as new computing platforms.

            Communications of the ACM 53(4): 27–29.

            

          Darwin, C. 1859. On the Origin of Species by Means of Natural Selection, or the Preservation

          of Favoured Races in the Struggle for Life . London: John Murrey.

            

          Davidovsky, M., V. Ermolayev, and V. Tolok. 2011. Instance migration between ontol-

          ogies having structural differences. International Journal on Artificial Intelligence Tools 20(6): 1127–1156.

            

          Davidovsky, M., V. Ermolayev, and V. Tolok. 2012. Agent-based implementation for

          the discovery of structural difference in OWL DL ontologies. In Proc. Fourth Int United Information Systems Conf (UNISCON 2012)

            , eds. H. C. Mayr, A. Ginige, and S. Liddle, Berlin, Heidelberg: Springer-Verlag, LNBIP 137.

          Dean, J. and S. Ghemawat. 2008. MapReduce: Simplified data processing on large

          clusters. Communications of the ACM 51(1): 107–113.

          Dean, D. and C. Webb. 2011. Recovering from information overload. McKinsey

            

            

          DeCandia, G., D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin,

          S.  Sivasubramanian, P. Vosshall, and W. Vogels. 2007. Dynamo: Amazon’s highly available key-value store. In 21st ACM Symposium on Operating Systems Principles , eds. T. C. Bressoud and M. Frans Kaashoek, 205–220. New York: ACM.

          Dickinson, I. and M. Wooldridge. 2003. Towards practical reasoning agents for the

          semantic web. In Proc. of the Second International Joint Conference on Autonomous

            Agents and Multiagent Systems , 827–834. New York: ACM.

            

          Driscoll, M. 2011. Building data startups: Fast, big, and focused. O’Reilly Radar

          (9).  http://radar.oreilly.com/2011/08/building-data-startups.html (accessed October 8, 2012).

          Ermolayev, V. and M. Davidovsky. 2012. Agent-based ontology alignment: Basics,

          applications, theoretical foundations, and demonstration. In Proc. Int Conf on

            Web Intelligence, Mining and Semantics (WIMS 2012) , eds. D. Dan Burdescu, R.

            Akerkar, and C. Badica, 11–22. New York: ACM.

          Ermolayev, V., N. Keberle, O. Kononenko, S. Plaksin, and V. Terziyan. 2004. Towards

          a framework for agent-enabled semantic web service composition. International

            Journal of Web Services Research

          1(3): 63–87.

            

          Ermolayev, V., N. Keberle, W.-E. Matzke, and V. Vladimirov. 2005. A strategy for auto-

          mated meaning negotiation in distributed information retrieval. In Proc 4th Int Semantic Web Conference (ISWC’05)

            , eds. Y. Gil et al., 201–215. Berlin, Heidelberg: Springer-Verlag, LNCS 3729.

          Ermolayev, V., N. Keberle, and W.-E. Matzke. 2008. An ontology of environments,

          events, and happenings, computer software and applications, 2008. COMPSAC

            '08. 32nd Annual IEEE International , pp. 539, 546, July 28, 2008–Aug. 1, 2008.

            Big Data Computing

          Ermolayev, V., C. Ruiz, M. Tilly, E. Jentzsch, J.-M. Gomez-Perez, and W.-E. Matzke.

            2010. A context model for knowledge workers. In Proc Second Workshop on Content, Information, and Ontologies (CIAO 2010) , eds. V. Ermolayev, J.-M.

          Euzenat, J. and P. Shvaiko. 2007. Ontology Matching. Berlin, Heidelberg: Springer-Verlag.

            

          Fan, W., A. Bifet, Q. Yang, and P. Yu. 2012a. Foreword. In Proc First Int Workshop on Big

          Data, Streams, and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications

            , eds. W. Fan, A. Bifet, Q. Yang, and P. Yu, New York: ACM.

          Fan, J., A. Kalyanpur, D. C. Gondek, and D. A. Ferrucci. 2012b. Automatic knowledge

          extraction from documents. IBM Journal of Research and Development 56(3.4):

            5:1–5:10.

          Fensel, D., F. van Harmelen, B. Andersson, P. Brennan, H. Cunningham, E. Della

          Valle, F. Fischer et al. 2008. Towards LarKC: A platform for web-scale reason- ing, Semantic Computing, 2008 IEEE International Conference on, pp. 524, 529, 4–7 Aug. 2008. doi: 10.1109/ICSC.2008.41.

            

          Fisher, D., R. DeLine, M. Czerwinski, and S. Drucker. 2012. Interactions with big data

          analytics. Interactions 19(3):50–59.

          Gangemi, A. and V. Presutti. 2009. Ontology design patterns. In Handbook on

            Ontologies , eds. S. Staab and R. Studer, 221–243. Berlin, Heidelberg: Springer- Verlag, International Handbooks on Information Systems.

            

          Ghemawat, S., H. Gobioff, and S.-T. Leung. 2003. The Google file system. In Proc

          Nineteenth ACM Symposium on Operating Systems Principles (SOSP’03), 29–43.

            New York: ACM.

          Golab, L. and M. Tamer Ozsu. 2003. Issues in data stream management. SIGMOD

          Record 32(2): 5–14.

            

          Gordon, A. 2005. Privacy and ubiquitous network societies. In Workshop on ITU

          Ubiquitous Network Societies , 6–15.

          Grell

          August 20, 2012).

          Gu, Y. and R. L. Grossman. 2009. Sector and sphere: The design and implementation

          of a high-performance data cloud. Philosophical Transactions of the Royal Society

            367(1897): 2429–2445.

          Guarino, N. and C. Welty. 2001. Supporting ontological analysis of taxonomic rela-

          tionships. Data and Knowledge Engineering 39(1): 51–74.

            

          Guéret, C., E. Oren, S. Schlobach, and M. Schut. 2008. An evolutionary perspective

          on approximate RDF query answering. In Proc Int Conf on Scalable Uncertainty Management

            , eds. S. Greco and T. Lukasiewicz, 215–228. Berlin, Heidelberg: Springer-Verlag, LNAI 5291.

          He, B., M. Yang, Z. Guo, R. Chen, B. Su, W. Lin, and L. Zhou. 2010. Comet: Batched

          stream processing for data intensive distributed computing, In Proc First ACM symposium on Cloud Computing (SoCC’10) , 63–74. New York: ACM.

            

          Hepp, M. 2007. Possible ontologies: How reality constrains the development of rel-

          evant ontologies. IEEE Internet Computing 11(1): 90–96.

          Hogan, A., J. Z. Pan, A. Polleres, and Y. Ren. 2011. Scalable OWL 2 reasoning for linked

          data. In Lecture Notes for the Reasoning Web Summer School, Galway, Ireland

            (August). http://aidanhogan.com/docs/rw_2011.pdf (accessed October 18,

            Toward Evolving Knowledge Ecosystems for Big Data Understanding

          Isaac, A., C. Trojahn, S. Wang, and P. Quaresma. 2008. Using quantitative aspects

          of alignment generation for argumentation on mappings. In Proc ISWC’08

            Workshop on Ontology Matching , ed. P. Shvaiko, J. Euzenat, F. Giunchiglia, and

            

          Ishai, Y., E. Kushilevitz, R. Ostrovsky, and A. Sahai. 2009. Extracting correlations,

          Foundations of Computer Science, 2009. FOCS '09. 50th Annual IEEE Symposium on

            , pp. 261, 270, 25–27 Oct. 2009. doi: 10.1109/FOCS.2009.56.

          Joseph, A. 2012. A Berkeley view of big data. Closing keynote of Eduserv Symposium

          Keberle, N. 2009. Temporal classes and OWL. In Proc Sixth Int Workshop on OWL:

            Experiences and Directions (OWLED 2009) , eds. R. Hoekstra and P. F. Patel- Schneider, CEUR-WS, vol 529. http://ceur-ws.org/Vol-529/owled2009_sub- mission_27.pdf (online).

            

          Kendall, E., R. Bell, R. Burkhart, M. Dutra, and E. Wallace. 2009. Towards a graphical

          notation for OWL 2. In Proc Sixth Int Workshop on OWL: Experiences and Directions (OWLED 2009) , eds. R. Hoekstra and P. F. Patel-Schneider, CEUR-WS, vol 529. http://ceur-ws.org/Vol-529/owled2009_submission_47.pdf (online).

          Klinov, P., C. del Vescovo, and T. Schneider. 2012. Incrementally updateable and

          persistent decomposition of OWL ontologies. In Proc OWL: Experiences and

            Directions Workshop , ed. P. Klinov and M. Horridge, CEUR-WS, vol 849. http:// ceur-ws.org/Vol-849/paper_7.pdf (online).

            

          Kontchakov, R., C. Lutz, D. Toman, F. Wolter, and M. Zakharyaschev. 2010. The com-

          bined approach to query answering in DL-Lite. In Proc Twelfth Int Conf on the Principles of Knowledge Representation and Reasoning (KR 2010) , eds. F. Lin and U.

            Sattler, 247–257. North America: AAAI.

          Knuth, D. E. 1998. The Art of Computer Programming. Volume 3: Sorting and Searching.

            Second Edition, Reading, MA: Addison-Wesley.

          Labrou, Y. 2006. Standardizing agent communication. In Multi-Agent Systems and

            Applications , eds. M. Luck, V. Marik, O. Stepankova, and R. Trappl, 74–97. Berlin, Heidelberg: Springer-Verlag, LNCS 2086.

            

          Labrou, Y., T. Finin, and Y. Peng. 1999. Agent communication languages: The current

          landscape. IEEE Intelligent Systems 14(2): 45–52.

          Lenat, D. B. 1995. CYC: A large-scale investment in knowledge infrastructure.

            Communications of the ACM 38(11): 33–38.

            

          Lin, J. and C. Dyer. 2010. Data-Intensive Text Processing with MapReduce. Morgan &

          Manyika, J., M. Chui, B. Brown, J. Bughin, R. Dobbs, C. Roxburgh, and A. Hung Byers.

            2011. Big data: The next frontier for innovation, competition, and productivity.

          McGlothlin, J. P. and L. Khan. 2010. Materializing inferred and uncertain knowledge

          in RDF datasets. In Proc Twenty-Fourth AAAI Conference on Artificial Intelligence

            (AAAI-10) , 1951–1952. North America: AAAI.

            

          Mills, P. 2011. Efficient statistical classification of satellite measurements. International

            Big Data Computing

          Mitchell, I. and M. Wilson. 2012. Linked Data. Connecting and Exploiting Big Data.

            

          Nardi, D. and R. J. Brachman. 2007. An introduction to description logics. In

            The Description Logic Handbook , eds. F. Baader, D. Calvanese, D. L. McGuinness, D. Nardi, and P. F. Patel-Schneider. New York: Cambridge University Press.

            

          Nemani, R. R. and R. Konda. 2009. A framework for data quality in data warehousing.

            In Information Systems: Modeling, Development, and Integration, eds. J. Yang, A. Ginige, H. C. Mayr, and R.-D. Kutsche, 292–297. Berlin, Heidelberg: Springer- Verlag, LNBIP 20.

            

          Olston, C. 2012. Programming and debugging large scale data processing workflows.

            In First Int Workshop on Hot Topics in Cloud Data Processing (HotCDP’12), Switzerland.

          Oren, E., S. Kotoulas, G. Anadiotis, R. Siebes, A. ten Teije, and F. van Harmelen. 2009.

            Marvin: Distributed reasoning over large-scale Semantic Web data. Journal of Web Semantics 7(4): 305–316.

            

          Ponniah, P. 2010. Data Warehousing Fundamentals for IT Professionals. Hoboken, NJ:

          John Wiley & Sons.

            

          Puuronen, S., V. Terziyan, and A. Tsymbal. 1999. A dynamic integration algorithm

          for an ensemble of classifiers. In Foundations of Intelligent Systems: Eleventh Int Symposium ISMIS’99 , eds. Z.W. Ras and A. Skowron, 592–600. Berlin, Heidelberg: Springer-Verlag, LNAI 1609.

            

          Quillian, M. R. 1967. Word concepts: A theory and simulation of some basic semantic

          capabilities. Behavioral Science 12(5): 410–430.

          Quillian, M. R. 1969. The teachable language comprehender: A simulation program

          and theory of language. Communications of the ACM 12(8): 459–476.

          Rahwan, T. 2007. Algorithms for coalition formation in multi-agent systems. PhD

          diss., University of Southampton. http://users.ecs.soton.ac.uk/nrj/download- files/lesser-award/rahwan-thesis.pdf (accessed October 8, 2012).

          Rimal, B. P., C. Eunmi, and I. Lumb. 2009. A taxonomy and survey of cloud comput-

          ing systems. In Proc Fifth Int Joint Conf on INC, IMS and IDC, 44–51. Washington,

            DC: IEEE CS Press.

          Roy, G., L. Hyunyoung, J. L. Welch, Z. Yuan, V. Pandey, and D. Thurston. 2009. A

          distributed pool architecture for genetic algorithms, Evolutionary Computation,

            2009. CEC '09. IEEE Congress on, pp. 1177, 1184, 18–21 May 2009. doi: 10.1109/ CEC.2009.4983079

            

          Sakr, S., A. Liu, D.M. Batista, and M. Alomari. 2011. A survey of large scale data

          management approaches in cloud environments. IEEE Communications Society Surveys & Tutorials 13(3): 311–336.

            

          Salehi, A. 2010. Low Latency, High Performance Data Stream Processing: Systems

          Architecture . Algorithms and Implementation. Saarbrücken: VDM Verlag.

            

          Shvachko, K., K. Hairong, S. Radia, R. Chansler. 2010. The Hadoop distributed file

          system, Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on , pp.1,10, 3–7 May 2010. doi: 10.1109/MSST.2010.5496972.

            

          Smith, B. 2012. Big data that might benefit from ontology technology, but why this

          usually fails. In Ontology Summit 2012, Track 3 Challenge: Ontology and Big

            

            Toward Evolving Knowledge Ecosystems for Big Data Understanding

          Tatarintseva, O., V. Ermolayev, and A. Fensel. 2011. Is your Ontology a burden or a

          Gem?—Towards Xtreme Ontology engineering. In Proc Seventh Int Conf ICTERI

            

            

          Terziyan, V. 2001. Dynamic integration of virtual predictors. In Proc Int ICSC Congress

          on Computational Intelligence: Methods and Applications (CIMA’2001) , eds. L. I.

            Kuncheva et al., 463–469. Canada: ICSC Academic Press.

          Terziyan, V. 2007. Predictive and contextual feature separation for Bayesian meta-

          networks. In Proc KES-2007/WIRN-2007, ed. B. Apolloni et al., 634–644. Berlin,

            Heidelberg: Springer-Verlag, LNAI 4694.

          Terziyan, V. and O. Kaykova. 2012. From linked data and business intelligence to

          executable reality. International Journal on Advances in Intelligent Systems 5(1–2):

            194–208.

          Terziyan, V., A. Tsymbal, and S. Puuronen. 1998. The decision support system for tele-

          medicine based on multiple expertise. International Journal of Medical Informatics

            49(2): 217–229.

          Thomason, R. H. 1998. Representing and reasoning with context. In Proc Int Conf on

            Artificial Intelligence and Symbolic Computation (AISC 1998) , eds. J. Calmet and J.

            Plaza, 29–41. Berlin, Heidelberg: Springer-Verlag, LNAI 1476.

          Thusoo, A., Z. Shao, S. Anthony, D. Borthakur, N. Jain, J. S. Sarma, R. Murthy, and

            Proc 2010 ACM SIGMOD Int Conf on Management of Data , 1013–1020. New York: ACM.

            

          Tsangaris, M. M., G. Kakaletris, H. Kllapi, G. Papanikos, F. Pentaris, P. Polydoras, E.

            Sitaridi, V. Stoumpos, and Y. E. Ioannidis. 2009. Dataflow processing and opti- mization on grid and cloud infrastructures. IEEE Data Engineering Bulletin 32(1): 67–74.

          Urbani, J., S. Kotoulas, E. Oren, and F. van Harmelen. 2009. Scalable distributed rea-

          soning using MapReduce. In Proc Eighth Int Semantic Web Conf (ISWC’09), eds. A. Bernstein, D. R. Karger, T. Heath, L. Feigenbaum, D. Maynard, E. Motta, and K. Thirunarayan, 634–649. Berlin, Heidelberg: Springer-Verlag.

          W3C. 2009. OWL 2 web ontology language profiles. W3C Recommendation (October).

          http://www.w3.org/TR/owl2-profiles/.

          Weinberger, D. 2012. Too Big to know. Rethinking Knowledge now that the Facts aren’t the

            

          Facts, Experts are Everywhere, and the Smartest Person in the Room is the Room

          . First Edition. New York, NY: Basic Books.

            

          Wielemaker, J., G. Schreiber, and B. Wielinga. 2003. Prolog-based infrastructure

          for RDF: Scalability and performance. In The Semantic Web—ISWC 2003, eds.

            D. Fensel, K. Sycara, and J. Mylopoulos, 644–658. Berlin, Heidelberg: Springer- Verlag, LNCS 2870.

          Wooldridge, M. and N. R. Jennings. 1995. Intelligent agents: Theory and practice. The

            Knowledge Engineering Review

          10(2): 115–152.

            

          This page intentionally left blank This page intentionally left blank

            

          Pierfrancesco Bellini, Mariano di Claudio, Paolo Nesi, and Nadia Rauch

          CONTENTS

            Introduction ...........................................................................................................58 Main Requirements and Features of Big Data Solutions .................................65

            Infrastructral and Architectural Aspects .......................................................65 Scalability ......................................................................................................65 High Availability ......................................................................................... 67 Computational Process Management .......................................................68 Workflow Automation ................................................................................68 Cloud Computing........................................................................................ 69 Self-Healing .................................................................................................. 70

            Data Management Aspects ............................................................................. 70 Database Size................................................................................................71 Data Model ...................................................................................................71 Resources ......................................................................................................72 Data Organization .......................................................................................73 Data Access for Rendering ......................................................................... 74 Data Security and Privacy .......................................................................... 74

            Data Analytics Aspects ....................................................................................75 Data Mining/Ingestion ............................................................................... 76 Data Access for Computing .......................................................................77

            Overview of Big Data Solutions .......................................................................... 78 Couchbase .....................................................................................................79 eXist ...............................................................................................................82 Google Map-Reduce ....................................................................................82 Hadoop .........................................................................................................83 Hbase .............................................................................................................83 Hive ...............................................................................................................84 MonetDB .......................................................................................................84 MongoDB ......................................................................................................85 Objectivity.....................................................................................................85 OpenQM .......................................................................................................86 RDF-3X ..........................................................................................................86

            Big Data Computing

            Comparison and Analysis of Architectural Features ....................................... 87 Application Domains Comparison ..................................................................... 87 Conclusions ............................................................................................................96 References ...............................................................................................................97

          Introduction

            Although the management of huge and growing volumes of data is a chal- lenge for the past many years, no long-term solutions have been found so far. The term “Big Data” initially referred to huge volumes of data that have the size beyond the capabilities of current database technologies, consequently for “Big Data” problems one referred to the problems that present a combina- tion of large volume of data to be treated in short time. When one establishes that data have to be collected and stored at an impressive rate, it is clear that the biggest challenge is not only about the storage and management, their analysis, and the extraction of meaningful values, but also deductions and actions in reality is the main challenge. Big Data problems were mostly related to the presence of unstructured data, that is, information that either do not have a default schema/template or that do not adapt well to relational tables; it is therefore necessary to turn to analysis techniques for unstruc- tured data, to address these problems.

            Recently, the Big Data problems are characterized by a combination of the so-called 3Vs: volume, velocity, and variety; and then a fourth V too has been added: variability. In essence, every day a large volume of information is pro- duced and these data need a sustainable access, process, and preservation according to the velocity of their arrival, and therefore, the management of large volume of data is not the only problem. Moreover, the variety of data, metadata, access rights and associating computing, formats, semantics, and software tools for visualization, and the variability in structure and data models significantly increase the level of complexity of these problems. The first V, volume, describes the large amount of data generated by individuals, groups, and organizations. The volume of data being stored today is explod- ing. For example, in the year 2000 about 800,000 petabytes of data in the world were generated and stored (Eaton et al., 2012) and experts estimated that in the year 2020, about 35 zettabyte of data will be produced. The second V,

            velocity

            , refers to speed at which Big Data are collected, processed, and elabo- rated, may handle a constant flow of massive data, which are impossible to be processed with traditional solutions. For this reason, it is not only impor- tant to consider “where” the data are stored, but also “how” they are stored. The third V, variety, is concerned with the proliferation of data types from social, mobile sources, machine-to-machine, and traditional data that are

            Tassonomy and Review of Big Data Solutions Navigation

            data have become complex, because they include raw, semistructured, and unstructured data from log files, web pages, search indexes, cross media, emails, documents, forums, and so on. Variety represents all types of data and usually the enterprises must be able to analyze all of them, if they want to gain advantage. Finally, variability, the last V, refers to data unpredictabil- ity and to how these may change in the years, following the implementation of the architecture. Moreover, the concept of variability can be attributed to assigning a variable interpretation to the data and to the confusions created in Big Data analysis, referring, for example, to different meanings in Natural Language that some data may have. These four properties can be considered orthogonal aspects of data storage, processing, and analysis and it is also interesting that increasing variety and variability also increases the attrac- tiveness of data and their potentiality in providing hidden and unexpected information/meanings.

            Especially in science, the need of new “infrastructures for global research

            

          data ” that can achieve interoperability to overcome the limitations related to

            language, methodology, and guidelines (policy) would be needed in short time. To cope with these types of complexities, several different techniques and tools may be needed, they have to be composed and new specific algo- rithms and solutions may also have to be defined and implemented. The wide range of problems and the specifics needs make almost impossible to identify unique architectures and solutions adaptable to all possible applica- tive areas. Moreover, not only the number of application areas so different from each other, but also the different channels through which data are daily collected increases the difficulties of companies and developers to identify which is the right way to achieve relevant results from the accessible data. Therefore, this chapter can be a useful tool for supporting the researchers and technicians in making decisions about setting up some Big Data infra- structure and solutions. To this end, it is very helpful to have an overview about Big Data techniques; it can be used as a sort of guidelines to better understand possible differences and relevant best features among the many needed and proposed by the product as the key aspects of Big Data solutions. These can be regarded as requirements and needs according to which the different solutions can be compared and assessed, in accordance with the case study and/or application domain.

            To this end, and to better understand the impact of Big Data science and solutions, in the following, a number of examples describing major appli- cative domains taking advantage from the Big Data technologies and solu- tions are reported: education and training, cultural heritage, social media and social networking, health care, research on brain, financial and business, marketing and social marketing, security, smart cities and mobility, etc.

            Big Data technologies have the potential to revolutionize education.

            

          Educational data such as students’ performance, mechanics of learning, and

            answers to different pedagogical strategies can provide an improved under-

            Big Data Computing

            These data can also help identify clusters of students with similar learn- ing style or difficulties, thus defining a new form of customized education based on sharing resources and supported by computational models. The proposed new models of teaching in Woolf et al. (2010) are trying to take into account student profile and performance, pedagogical and psychological and learning mechanisms, to define personalized instruction courses and activities that meet the different needs of different individual students and/ or groups. In fact, in the educational sector, the approach to collect, mine, and analyze large data sets has been consolidated, in order to provide new tools and information to the key stakeholders. This data analysis can provide an increasing understanding of students’ knowledge, improve the assess- ments of their progress, and help focus questions in education and psychol- ogy, such as the method of learning or how different students respond to different pedagogical strategies. The collected data can also be used to define models to understand what students actually know and understand how to enrich this knowledge, and assess which of the adopted techniques can be effective in whose cases, and finally produce a case-by-case action plan. In terms of Big Data, a large variety and variability of data is presented to take into account all events in the students’ career; the data volume is also an additional factor. Another sector of interest, in this field, is the e-learn- ing domain, where two main kinds of users are defined: the learners and the learning providers (Hanna, 2004). All personal details of learners and the online learning providers’ information are stored in specific database, so applying data mining with e-learning can enable one to realize teach- ing programs targeted to particular interests and needs through an efficient decision making.

            For the management of large amounts of cultural heritage information data, Europeana has been created with over 20 millions of content indexed that can be retrieved in real time. Earlier, each of them was modeled with a simple metadata model, ESE, while a new and more complete models called EDM (Europeana Data Model) with a set of semantic relationships is going to be adopted in the 2013 [Europeana]. A number of projects and activities are con- nected to Europeana network to aggregate content and tools. Among them ECLAP is a best practice network that collected not only content metadata for Europeana, but also real content files from over 35 different institutions hav- ing different metadata sets and over 500 file formats. A total of more than 1 million of cross media items is going to be collected with an average of some hundreds of metadata each, thus resulting in billions of information elements and multiple relationships among them to be queried, navigated, and accessed in real time by a large community of users [ECLAP] (Bellini et al., 2012a).

            The volume of data generated by social network is huge and with a high variability in the data flow over time and space, due to human factor; for example, Facebook receives 3 billion uploads per month, which corresponds to approximately 3600 TB/year. Search engine companies such as Google

            Tassonomy and Review of Big Data Solutions Navigation

            business is developed, offering useful services to its users and companies in real time (Mislove et al., 2006). From these large amounts of data collected through social networks (e.g., Facebook, Twitter, MySpace), social media and Big Data solutions may estimate the user-collective profiles and behavior, analyze product acceptance, evaluate the market trend, keep trace of user movements, extract unexpected correlations, evaluate the models of influ- ence, and perform different kinds of predictions (Domingos, 2005). Social media data can be exploited by considering geo-referenced information and Natural Language Processing for analyzing and interpreting urban living: massive folk movements, activities of the different communities in the city, movements due to large public events, assessment of the city infrastructures, etc. (Iaconesi and Persico, 2012). In a broader sense, by this information it is possible to extract knowledge and data relationships, by improving the activity of query answering.

            For example, in Healthcare/Medical field large amount of information about patients’ medical histories, symptomatology, diagnoses, and responses to treatments and therapies is collected. Data mining techniques might be implemented to derive knowledge from this data in order to either iden- tify new interesting patterns in infection control data or to examine report- ing practices (Obenshain, 2004). Moreover, predictive models can be used as detection tools exploiting Electronic Patient Record (EPR) accumulated for each person of the area, and taking into account the statistical data. Similar solutions can be adopted as decision support for specific triage and diagnosis or to produce effective plans for chronic disease management, enhancing the quality of healthcare and lowering its cost. This activity may allow detecting the inception of critical conditions for the observed people over the whole population. In Mans et al. (2009), techniques to the fast access and extrac- tion of information from event’s log from medical processes, to produce eas- ily interpretable models, using partitioning, clustering, and preprocessing techniques have been investigated. In medical field, especially hospital, run time data are used to support the analysis of existing processes. Moreover, taking into account genomic aspects and EPR for millions of patients leads to cope with Big Data problems. For genome sequencing activities (HTS, high- throughput sequencing) that produce several hundreds of millions of small sequences, a new data structure for indexing called Gkarrays (Rivals et al., 2012) has been proposed, with the aim of improving classical indexing system such as hash table. The adoption of sparse hash tables is not enough to index huge collections of k-mer (subword of a given length k in a DNA sequence, which represents the minimum unit accessed). Therefore, new data structure has been proposed based on three arrays: the first for storing the start posi- tion of each k-mer, the second as an inverted array allows finding any k-mer from a position in a read, and the last records the interval of position of each distinct k-mer, in sorted order. This structure allowed obtaining in constant time, the number of reads that contain a k-mer. A project of the University of

            Big Data Computing

            machine learning techniques to the evaluation of large amounts of tomo- graphic images generated by computer (Zinterhof, 2012). The idea is to apply proven techniques of machine learning for image segmentation, in the field of computer tomography.

            In several areas of science and research such as astronomy (automated sky survey), sociology (web log analysis of behavioral data), and neuroscience (genetic and neuroimaging data analysis), the aim of Big Data analysis is to extract meaning from data and determine what actions take. To cope with the large amount of experimental data produced by research experi- ments, the University Montpellier started the ZENITH project [Zenith] that adopts a hybrid architecture p2p/cloud (Valduriez and Pacitti, 2005). The idea of Zenith is to exploit p2p to facilitate the collaborative nature of scien- tific data, centralized control, and use the potentialities of computing, stor- age, and network resources in the Cloud model, to manage and analyze this large amount of data. The storage infrastructure used in De Witt et al. (2012) is called CASTOR and allows for the management of metadata related to scientific files of experiments at CERN. For example, the database of RAL (Rutherford Appleton Laboratory) uses a single table for storing 20 GB (which reproduces the hierarchical structure of the file) that runs about 500 transac- tions per second on 6 clusters. With the increasing number of digital scientific data, one of the most important challenges is the digital preservation and for this purpose is in progress the SCAPE (SCAlable Preservation Environment) project [SCAPE Project]. The platform provides an extensible infrastructure to achieve the conservation of workflow information of large volume of data. The AzureBrain project (Antoniu et  al., 2010) aims to explore cloud com- puting techniques for the analysis of data from genetic and neuroimaging domains, both characterized by a large number of variables. The Projectome project, connected with the Human Brain Project, HBP, aims to set up a high- performance infrastructure for processing and visualizing neuroanatomical information obtained by using confocal ultramicroscopy techniques (Silvestri et al., 2012), the solution is connected with the modeling of knowledge of and information related to rat brains. Here, the single image scan of a mouse is more than 1 Tbyte and it is 1000 times smaller than a human brain.

            The task of finding patterns in business data is not new; nowadays it is get- ting a larger relevance because enterprises are collecting and producing a huge amount of data including massive contextual information, thus tak- ing into account a larger number of variables. Using data to understand and improve business operations, profitability, and growth is a great opportunity and a challenge in evolving. The continuous collection of large amounts of data (business transaction, sales transaction, user behavior), widespread use of networking technologies and computers, and design of Big Data warehouse and data mart have created enormously valuable assets. An interesting pos- sibility to extract meaningful information from these data could be the use of machine learning techniques in the context of mining business data (Bose

            Tassonomy and Review of Big Data Solutions Navigation

            data mining to model classes of customers in client databases using fuzzy clustering and fuzzy decision making (Setnes et al., 2001). These data can be analyzed in order to define prediction about the behavior of users, to identify buying pattern of individual/group customers, and to provide new custom services (Bose and Mahapatra, 2001). Moreover, in recent years, the major mar- ket analysts conduct their business investigations with data that are not stored within the classic RDBMS (Relational DataBase Management System), due to the increase of various and new types of information. Analysis of web users’ behavior, customer loyalty programs, the technology of remote sensors, com- ments into blogs, and opinions shared on the network are contributing to cre- ate a new business model called social media marketing and the companies must properly manage these information, with the corresponding potential for new understanding, to maximize the business value of the data (Domingos, 2005). In financial field, instead, investment and business plans may be created thanks to predictive models derived using techniques of reasoning and used to discover meaningful and interesting patterns in business data.

            Big Data technologies have been adopted to find solutions to logistic and

            

          mobility management and optimization of multimodal transport networks in

            the context of Smart Cities. A data-centric approach can also help for enhanc- ing the efficiency and the dependability of a transportation system. In fact, through the analysis and visualization of detailed road network data and the use of a predictive model, it is possible to achieve an intelligent trans- portation environment. Furthermore, through the merging of high-fidelity geographical stored data and real-time sensor networks scattered data, it can be made an efficient urban planning system that mix public and pri- vate transportation, offering people more flexible solutions. This new way of traveling has interesting implications for energy and environment. The analysis of the huge amount of data collected from the metropolitan mul- timodal transportation infrastructure, augmented with data coming from sensors, GPS positions, etc., can be used to facilitate the movements of people via local public transportation solutions and private vehicles (Liu et al., 2009). The idea is to provide intelligent real-time information to improve traveler experience and operational efficiencies (see, for example, the solutions for the cities of Amsterdam, Berlin, Copenhagen, and Ghent). In this way, in fact, it is possible in order to use the Big Data both as historical and real-time data for the applications of machine learning algorithms aimed to traffic state estimation/planning and also to detect unpredicted phenomena in a suf- ficiently accurate way to support near real-time decisions.

            In security field, Intelligence, Surveillance, and Reconnaissance (ISR) define topics that are well suited for data-centric computational analyses. Using analysis tools for video and image retrieval, it is possible to establish alert for activity and event of interest. Moreover, intelligence services can use these data to detect and combine special patterns and trends, in order to recog- nize threats and to assess the capabilities and vulnerabilities with the aim to

            Big Data Computing

            In the field of energy resources optimization and environmental monitoring, the  data related to the consumption of electricity are very important. The analysis of a set of load profiles and geo-referenced information, with appro- priate data mining techniques (Figueireido et al., 2005), and the construction of predictive models from that data, could define intelligent distribution strat- egies in order to lower costs and improve the quality of life in this field, and another possible solution is an approach that provides for the adoption of a conceptual model for a smart grid data management based on the main features of a cloud computing platform, such as collection and real-time man- agement of distributed data, parallel processing for research and interpreta- tion of information, multiple and ubiquitous access (Rusitschka et al., 2010).

            In the above overview about some of the application domains for Big Data technologies, it is evident that to cope with those problems several different kinds of solutions and specific products have been developed. Moreover, the complexity and the variability of the problems have been addressed with a combination of different open source or proprietary solutions, since presently there is not an ultimate solution to the Big Data problem that includes in an integrated manner data gathering, mining, analysis, processing, accessing, publication, and rendering. It would therefore be extremely useful if a “map” of the hot spots to be taken into account, during the design process and the creation of these architectures, which helps the technical staff to orient them- selves in the wide range of products accessible on the Internet and/or offered by the market. To this aim, we have tried to identify the main features that can characterize architectures for solving a Big Data problem, depending on the source of data, on the type of processing required, and on the application context in which should be to operate.

            The paper is organized as follows. In the “Main Requirements and Features of Big Data Solutions” section, the main requirements and features for Big Data solutions are presented by taking into account infrastructural and architectural aspects, data management, and data analytics aspects. Section “Overview of Big Data Solutions” reports a brief overview of existing solu- tions for Big Data and their main application fields. In the “Comparison and Analysis of Architectural Features” section, a comparison and analysis of the architectural features is presented. The analysis has permitted to put in evidence the most relevant features and different among the different solu- tion. Section “Application Domains Comparison” is characterized by the description of the main application domains of the Big Data technologies and includes our assessment of these applicative domains in terms of the identified features reported in the “Overview of Big Data Solutions” section. Therefore, this section can be very useful to identify which are the major challenges of each domain and the most important aspects to be taken into account for each domain. This analysis allowed us to perform some comparison and consider- ation about the most commonly adopted tools in the different domains. Also in the same session, the identified application domains are crossed with the solu-

            Tassonomy and Review of Big Data Solutions Navigation

            shortcut to determine whose products have already been applied to a specific field of application, that is, a hint for the development of future applications. Finally, in the “Conclusions” section, conclusions are drawn.

          Main Requirements and Features of Big Data Solutions

            In this section, according to the above-reported short overview of Big Data problems, we have identified a small number of main aspects that should be addressed by architectures for management of Big Data problems. These aspects can be regarded as a collection of major requirements to cope with most of the issues related to the Big Data problems. We have divided the identified main aspects in three main categories which, respectively, concern with the infrastructure and the architecture of the systems that should cope with Big Data; with the management of the large amount of data and charac- teristics related to the type of physical storage; and with the accesses to data and techniques of data analytics, such as ingestion, log analysis, and every- thing else is pertinent to post-production phase of data processing. In some cases, the features are provided and/or inherited by the operating system or by the cloud/virtual infrastructure. Therefore, the specific Big Data solutions and techniques have to be capable of taking advantage from the underlining operating system and the infrastructure.

            infrastructral and architectural aspects

            The typical Big Data solutions are deployed on cloud exploiting the flexibility of the infrastructure. As a result, some of the features of Big Data solutions may depend on the architecture and infrastructure facilities from which the solution inherits/exploits the capabilities. Moreover, specific tool for data gathering, processing, rendering, etc. may be capable or incapable of exploit- ing a different range of cloud-based architectural aspects. For example, not all databases can be distributed on multiple servers, not all algorithms can be profitable remapped on a parallel architecture, not all data access or render- ing solutions may exploit multilayered caches, etc. To this end, in the follow- ing paragraphs, a set of main features are discussed, among them are the: scalability, multitiered memory, availability, parallel and distributed process management, workflow, self-healing, and data security, and privacy. A sum- mary map is reported in the “Overview of Big Data Solutions” section.

            Scalability

            This feature may impact on the several aspects of the Big Data solution (e.g.,

            Big Data Computing

            has to cope with the capability of maintaining acceptable performances cop- ing from small-to-large problems. In most cases, the scalability is obtained by using distributed and/or parallel architectures, which may be allocated on cloud. Both computing and storage resources can be located over a net- work to create a distribute system where managing also the distribution of workload.

            As regards the computational scalability, processing a very huge data set is important to optimize the workload, for example, with a parallel archi- tecture, as proposed in Zenith project [Zenith], which may perform several operations simultaneously (on an appropriate number of tasks), or provid- ing a dynamic allocation of computation resources (i.e., a process releases a resource as soon as it is no more needed, and thus it can be assigned to another process) technique used in the ConPaas platform (Pierre et al., 2011). Usually, the traditional computational algorithms are not scalable and thus specific restructuring of the algorithms have to be defined and adopted. On the other hand, not all the algorithms can take advantage by parallel and/or distributed architectures for computing; specific algorithms have to be defined, provided that an efficient parallel and/or distributed solution exists. The evolution to distributed and parallel processing is just the first step, since processes have to be allocated and managed in some parallel architecture, which can be developed ad hoc or generally setup. Semantic grid and parallel architectures can be used to the problem (Bellini et  al., 2012b) [BlueGene].

            Each system for Big Data may provide a scalable storage solution. In fact, the main problem could be to understand in which measure a storage solu- tion has to be scalable to satisfy the worst operative cases or the most com- mon cases (and, in general, the most expensive cases). Moreover, for large experiments, the data collection and processing may be not predictable with high precision in the long term, for example, for the storage size and cost. For example, it is not clear how much storage would be needed to collect genomic information and EHR (Electronic Healthcare Records) for a unified European health system in 5 or 10 years. In any case because EHR contains a large amount of data, an interesting approach for their management could be the use of a solution based on HBase that builds a system distributed, fault-tolerant, and scalable database on clouds, built on top of the HDFS, with random real-time read/write access to big data, overcoming the design lim- its of traditional RDBMS (Yang et  al., 2011). Furthermore, to focus on this problem, a predictive model to understand how will increase the need for storage space should be made, while complexities and costs of this model are high. In most cases, it is preferable to have a pragmatic approach, first guess and work with the present problems by using cheap hardware and if neces- sary, increase the storage on demand. This approach obviously cannot be considered completely scalable, scalability is not just about the storage size, and then remains the need to associate the solution presented, with a system

            Tassonomy and Review of Big Data Solutions Navigation

            A good solution to optimize the reaction time and to obtain a scalable solu- tion at limited costs is the adoption of a multitiered storage system, including cache levels, where data pass from one level to another along the hierar- chy of storage media having different response times and costs. In fact, a multitier approach to storage, utilizing arrays of disks for all backup with a primary storage and the adoption of an efficient file systems, allows us to both provide backups and restores to online storage in a timely manner, as well as to scale up the storage when primary storage grows. Obviously, each specific solution does not have to implement all layers of the memory hier- archy because their needs depend on the single specific case, together with the amount of information to be accessed per second, the deepness of the cache memories, binning in classes of different types of data based on their availability and recoverability, or the choice to use a middleware to connect separate layers. The structure of the multitiered storage can be designed on the basis of a compromise from access velocity to general storage cost. The multiple storages create as counterpart a large amount of maintenance costs.

            Scalability may take advantage from the recent cloud solutions that imple- ments techniques for dynamic and bursting on cloud storage and processes from private to public clouds and among the latter. Private cloud computing has recently gained much traction from both commercial and open-source interests (Microsoft, 2012). For example, tools such as OpenStack [OpenStack Project] can simplify the process of managing virtual machine resources. In most cases, for small-to-medium enterprises, there is a trend to migrate mul- titier applications into public cloud infrastructures (e.g., Amazon), which are delegated to cope with scalability via elastic cloud solutions. A deep discus- sion on cloud is out of the scope of this chapter.

            High Availability

            The high availability of a service (e.g., it may be referred to general service, to storage, process, and network) is a key requirement in an architecture that can affect the simultaneous use of a large number of users and/or compu- tational nodes located in different geographical locations (Cao et al., 2009). Availability refers to the ability of the community of users to access a system exploiting its services. A high availability leads to increased difficulties in guarantee data updates, preservations, and consistency in real time, and it is fundamental that a user perceives, during his session, the actual and proper reactivity of the system. To cope with these features, the design should be fault-tolerant, as in redundant solution for data and computational capabili- ties to make them highly available despite the failure of some hardware and software elements of the infrastructure. The availability of a system is usu- ally expressed as a percentage of time (the nines method) that a system is up over a given period of time, usually a year. In cloud systems, for instance, the level of 5 nines (99.999% of time means HA, high availability) is typically

            Big Data Computing

            of approximately 5 min, but it is important to note that time does not always have the same value but it depends on the organization referred to by the critical system. The present solutions obtain the HA score by using a range of techniques of cloud architectures as fault-tolerant capabilities for virtual machines, redundant storage for distributed database and balancing for the front end, and the dynamic move of virtual machines.

            Computational Process Management

            The computational activities on Big Data may take a long time and may be distributed on multiple computational computers/nodes on some parallel architecture, in connection with some networking systems. Therefore, one of the main characteristics of most of the Big Data solutions has to cope with the needs of controlling computational processes by means of: allocating them on a distributed system, putting them in execution on demand or periodi- cally, killing them, recovering processing from failure, returning eventual errors, scheduling them over time, etc. Sometimes, the infrastructure that allows to put in execution parallel computational processes can work as a service, thus it has to be accessible for multiple users and/or other multitier architecture and servers. This means that sophisticated solutions for parallel processing and scheduling are needed, including the definition of Service Level Agreement (SLA) and in classical grid solutions. Example of solu- tions to cope with these aspects are solutions for computational grid, media grid, semantic computing, distributed processing such as AXCP media grid (Bellini et al., 2012c) and general grid (Foster et al., 2002). The solution for parallel data processing has to be capable of dynamically exploiting the computational power of the underlining infrastructure, since most of the Big Data problems may be computationally intensive for limited time slots. Cloud solutions may help one to cope with the concepts of elastic cloud for implementing dynamic computational solutions.

            Workflow Automation

            Big Data processes are typically formalized in the form of process work- flows from data acquisition to results production. In some cases, the work- flow is programed by using simple XML (Extensible Markup Language) formalization or effective programing languages, for example, in Java, JavaScript, etc. Related data may strongly vary in terms of dimensions and data flow (i.e., variability): an architecture that handles well with both lim- ited and large volumes of data, must be able to full support creation, organi- zation, and transfer of these workflows, in single cast or broadcast mode. To implement this type of architectures, sophisticated automation systems are used. These systems work on different layers of the architecture through applications, APIs (Application Program Interface), visual process design

            Tassonomy and Review of Big Data Solutions Navigation

            not be suitable for processing a huge amount of data in real time, formal- izing the stream processing, etc. In some Big Data applications, the high data flow and timing requirements (soft real time) have made inadequate the traditional paradigm “store-then-process,” so that the complex event pro- cessing (CEP) paradigms are proposed (Gulisano et al., 2012): a system that processes a continuous stream of data (event) on the fly, without any stor- age. In fact, the CEP can be regarded as an event-driven architecture (EDA), dealing with the detection and production of reaction to events, that spe- cifically has the task of filtering, match and aggregate low-level events in high-level events. Furthermore, creating a parallel-distributed CEP, where data are partitioned across processing nodes, it is possible to realize an elastic system capable of adapting the processing resources to the actual workload reaching the high performance of parallel solutions and over- coming the limits of scalability.

            An interesting application example is the Large Hadron Collider (LHC), the most powerful particle accelerator in the world, that is estimated to pro- duce 15 million gigabytes of data every year [LHC], then made available to physicists around the world thanks to the infrastructure support “worldwide

            

          LHC computing grid ” (WLCG). The WLCG connects more than 140 computing

            centers in 34 countries with the main objective to support the collection and storage of data and processing tools, simulation, and visualization. The idea behind the operation requires that the LHC experimental data are recorded on tape at CERN before being distributed to 11 large computer centers (cen- ters called “Tier 1”) in Canada, France, Germany, Italy, the Netherlands, Scandinavia, Spain, Taiwan, the UK, and the USA. From these sites, the data are made available to more than 120 “Tier-2” centers, where you can conduct specific analyses. Individual researchers can then access the information using computer clusters or even their own personal computer.

            Cloud Computing

            The cloud capability allows one to obtain seemingly unlimited storage space and computing power that it is the reason for which cloud paradigm is considered a very desirable feature in each Big Data solution (Bryant et al., 2008). It is a new business where companies and users can rent by using the “as a service” paradigm infrastructure, software, product, processes, etc., Amazon [Amazon AWS], Microsoft [Microsoft Azure], Google [Google Drive]. Unfortunately, these public systems are not enough to extensive computations on large volumes of data, due to the low bandwidth; ideally a cloud computing system for Big Data should be geographically dispersed, in order to reduce its vulnerability in the case of natural disasters, but should also have a high level of interoperability and data mobility. In fact, there are systems that are moving in this direction, such as the OpenCirrus project [Opencirrus Project], an international test bed that allows experiments on

            Big Data Computing Self-Healing

            This feature refers to the capability of a system to autonomously solve the fail- ure problems, for example, in the computational process, in the database and storage, and in the architecture. For example, when a server or a node fails, it is important to have the capability of automatically solve the problem to avoid repercussions on the entire architecture. Thus, an automated recovery from failure solution that may be implemented by means of fault-tolerant solutions, balancing, hot spare, etc., and some intelligence is needed. Therefore, it is an important feature for Big Data architectures, which should be capable of autonomously bypassing the problem. Then, once informed about the prob- lems and the performed action to solve it, the administrator may perform an intervention. This is possible, for example, through techniques that automati- cally redirected to other resources, the work that was planned to be carried out by failed machine, which has to be automatically put offline. To this end, there are commercial products which allow setting up distributed and balanced architecture where data are replicated and stored in clusters geographically dispersed, and when a node/storage fails, the cluster can self-heal by recreat- ing the missing data from the damage node, in its free space, thus reconstruct- ing the full capability of recovering from the next problem. On the contrary, the breakdown results and capacity may decrease in the degraded conditions until the storage, processor, resource is replaced (Ghosh et al., 2007).

            Data Management aspects

            In the context of data management, a number of aspects characterize the Big Data solutions, among them: the maximum size of the database, the data models, the capability of setting up distributed and clustered data man- agement solutions, the sustainable rate for the data flow, the capability of partitioning the data storage to make it more robust and increase perfor- mance, the query model adopted, the structure of the database (relational, RDF (resource description framework), reticular, etc.), etc. Considering data structures for Big Data, there is a trend to find a solution using the so-called

            

          NoSQL databases (NoSQL, Simple Query Language), even if there are good

            solutions that still use relational database (Dykstra, 2012). In the market and from open source solutions, there are several different types of NoSQL data- bases and rational reasons to use them in different situations, for different kinds of data. There are many methods and techniques for dealing with Big Data, and in order to be capable of identifying the best choice in each case, a number of aspects have to be taken into account in terms of architecture and hardware solutions, because different choices can also greatly affect the performance of the overall system to be built. Related to the database per- formance and data size, there is the so-called CAP Theorem that plays a relevant role (Brewer, 2001, 2012). The CAP theorem states that any distrib-

            Tassonomy and Review of Big Data Solutions Navigation features: consistency, availability, and partition tolerance (Fox and Brewer, 1999).

            Property of consistency states that a data model after an operation is still in a consistent state providing the same data to all its clients. The property of availability means that the solution is robust with respect to some internal failure, that is, the service is still available. Partition tolerance means that the system is going to continue to provide service even when it is divided in disconnected subsets, for example, a part of the storage cannot be reached. To cope with CAP theorem, Big Data solutions try to find a trade-off between continuing to issue the service despite of problems of partitioning and at the same time attempting to reduce the inconsistencies, thus supporting the so-called eventual consistency.

            Furthermore, in the context of relational database, the ACID (Atomicity, Consistency, Isolation and Durability) properties describe the reliability of database transactions. This paradigm does not apply to NoSQL database where, in contrast to ACID definition, the data state provides the so-called BASE property: Basic Available, Soft state, and Eventual consistent. Therefore, it is typically hard to guaranteed an architecture for Big Data management in a fault-tolerant BASE way, since, as the Brewer’s CAP theorem says, there is no other choice to make a compromise if you want to scale up. In the fol- lowing, some of the above aspects are discussed and better explained.

            Database Size

            In Big Data problems, the database size may easily reach magnitudes like hundreds of Tera Byte (TB), Peta Byte (PB), or Exa Byte (EB). The evolution of Big Data solutions has seen an increment of the amounts of data that can be managed. In order to exploit these huge volumes of data and to improve the productivity of scientific, new technologies, new techniques are needed. The real challenge of database size are related to the indexing and to access at the data. These aspects are treated in the following.

            Data Model

            To cope with huge data sets, a number of different data models are available such as Relational Model, Object DB, XML DB, or Multidimensional Array model that extend database functionality as described in Baumann et  al. (1998). Systems like Db4o (Norrie et  al., 2008) or RDF 3X (Schramm, 2012) propose different solutions for data storage can handle structured informa- tion or less and the relationships among them. The data model represents the main factor that influences the performance of the data management. In fact, the performance of indexing represents in most cases the bottleneck of the elaboration. Alternatives may be solutions that belong to the so-called category of NoSQL databases, such as ArrayDBMS (Cattel, 2010), MongoDB [mongoDB], CouchDB [Couchbase], and HBase [Apache HBase], which pro-

            Big Data Computing

            management systems). Within the broad category of NoSQL database, large NoSQL families can be identified, which differ from each other for storage and indexing strategy:

          • Key-value stores: high scalable solution, which allows one to obtain good speed in the presence of large lists of elements, such as stock quotes; examples are Amazon Dynamo [Amazon Dynamo] and Oracle Berkeley [Oracle Berkeley].
          • Wide column stores (big tables): are databases in which the columns are grouped, where keys and values can be composed (as HBase [Apache HBase], Cassandra [Apache Cassandra]). Very effective to cope with time series and with data coming from multiple sources, sensors, device, and website, needing high speed. Consequently, they provide good performance in reading and writing operations, while are less suitable for data sets where the data have the same importance of the data relationships.
          • Document stores: are aligned with the object-oriented programing, from clustering to data access, and have the same behavior of key- value stores, where the value is the document content. They are useful when data are hardly representable with a relational model due to high complexity; therefore, are used with medical records or to cope with data coming from social networks. Examples are MongoDB [mongoDB] and CouchDB [Couchbase].
          • Graph databases: they are suitable to model relationships among data. The access model is typically transactional and therefore suit- able for applications that need transactions. They are used in fields such as geospatial, bioinformatics, network analysis, and recom- mendation engines. The execution of traditional SQL queries is not simple. Examples are: Neo4J [Neo4j], GraphBase [GraphBase], and AllegroGraph [AllegroGraph].

            Other NoSQL database categories are: object databases, XML databases, multivalue databases, multimodel databases, multidimensional database, etc. [NoSQL DB].

            It is therefore important to choose the right NoSQL storage type during the design phase of the architecture to be implemented, considering the dif- ferent features that characterize the different databases. In other words, it is very important to use the right tool for each specific project, because each storage type has its own weaknesses and strengths.

            Resources

            The main performance bottlenecks for NoSQL data stores correspond to

            Tassonomy and Review of Big Data Solutions Navigation

            computational capabilities of the associated CPUs. Typically, the Big Data stores are based on clustering solutions in which the whole data set is par- titioned in clusters comprising a number of nodes, cluster size. The number of nodes in each cluster affects the completion times of each job, because a greater number of nodes in a cluster corresponds to a lower completion time of the job. In this sense, also the memory size and the computational

            

          capabilities of each node influence the node performance (De Witt et al., 2008).

            Most of the NoSQL databases use persistent socket connections; while disk is always the slowest component for the inherent latency of non-volatile storages. Thus, any high-performance database needs to have some form of memory caching or memory-based storage to optimize the memory perfor-

            

          mance . Another key point is related to the memory size and usage of the solu-

            tion selected. Some solutions, such as HBase [Apache HBase], are considered memory-intensive, and in these cases a sufficient amount of memory on each server/node has to be guaranteed to cover the needs of the cluster that are located in its region of interest. When the amount of memory is insufficient, the overall performance of the system would drastically decrease (Jacobs, 2009). The network capability is an important factor that affects the final per- formance of the entire Big Data management. In fact, network connections among clusters make extensive use during read and write operations, but there are also algorithms like Map-Reduce, that in shuffle step make up a high-level network usage. It is therefore important to have a highly available and resiliency network, which is also able to provide the necessary redun- dancy and that could scale well, that is, it allows the growth of the number of clusters.

            Data Organization

            The data organization impacts on storage, access, and indexing perfor- mance of data (Jagadish et  al., 1997). In most cases, a great part of data accumulated are not relevant for estimating results and thus they could be filtered out and/or stored in compressed size, as well as moved into slower memory along the multitier architecture. To this end, a challenge is to define

            

          rules for arranging and filtering data in order to avoid/reduce the loss of use-

            ful information preserving performances and saving costs (Olston et  al., 2003). The distribution of data in different remote tables may be the cause of inconsistencies when connection is lost and the storage is partitioned for some fault. In general, it is not always possible to ensure locally available data on the node that would process them. It is evident that if this condition is generally achieved, the best performance would be obtained. Otherwise, it would be needed to retrieve the missed data blocks, to transfer them and process them in order to produce the results with a high consumption of resources on the node requested them and on the node that owns them, and thus on the entire network; therefore, the time of completion would be

            Big Data Computing Data Access for Rendering

            The activity of data rendering is related to the access of data for represent- ing them to the users, and in some cases by performing some prerender- ing processing. The presentation of original or produced data results may be a relevant challenge when the data size is so huge that their processing for producing a representation can be highly computational-intensive, and most of the single data would not be relevant for the final presentation to the user. For example, representing at a glance the distribution of 1 billion of economical transactions on a single image would be in any way limited to some thousands of points; the presentation of the distribution of people flows in the large city would be based on the analysis of several hundreds of millions of movements, while their representation would be limited on presenting a map on an image of some Mbytes. A query on a huge data set may produce enormous set of results. Therefore, it is important to know in advance their size and to be capable of analyzing Big Data results with scal- able display tools that should be capable of producing a clear vision in a range of cases, from small-to-huge set of results. For example, the node-link representation of the RDF graph does not provide a clear view of the overall RDF structure: one possible solution to this problem is the use of a 3D adja- cency matrix as an alternate visualization method for RDF (Gallego et al., 2011). Thanks to some graph display tools, it is possible to highlight specific data aspects. Furthermore, it should be possible to guarantee efficient access, perhaps with the definition of standard interfaces especially in business and medical applications on multichannel and multidevice delivering of results without decreasing data availability. An additional interesting feature for data access can be the save of user experience in data access and naviga- tion (parameters and steps for accessing and filtering them). The adoption of semantic queries in RDF databases is essential for many applications that need to produce heterogeneous results and thus in those cases the data ren- dering is very important for presenting them and their relationships. Other solutions for data access are based on the production of specific indexes, such as Solr [Apache Solr] or in NoSQL databases. An example are the production of faceted results, in which the query results are divided into multiple cate- gories on which the user can further restrict the search results, by composing by using “and”/“or” different facets/filters. This important feature is present in solutions such as RDF-HDT Library, eXist project (Meier, 2003).

            Data Security and Privacy

            The problem of data security is very relevant in the case of Big Data solu- tions. The data to be processed may contain sensitive information such as EPR, bank data, general personal information as profiles, and content under IPR (intellectual property rights) and thus under some licensing model.

            Tassonomy and Review of Big Data Solutions Navigation

            and thus have to be managed in some coded protected format, for exam- ple, with some encryption. Solutions based on conditional access, channel protection, and authentication may still have sensible data stored in clear into the storage. They are called Conditional Access Systems (CAS) and are used to manage and control the user access to services and data (normal users, administrator, etc.) without protecting each single data element via encryption. Most Big Data installations are based on web services models, with few facilities for countering web threats, whereas it is essential that data are protected from theft and unauthorized accesses. While, most of the present Big Data solutions present only conditional access methods based on credentials only for accessing the data information and not to protect them with encrypted packages. On the other hand, content protection technolo- gies are sometimes supported by Digital Rights Management (DRM), solu- tions that allow to define and execute licenses that formalize the rights that can be exploited on a given content element, who can exploit that rights and at which condition (e.g., time, location, number of times, etc.). The control of the user access rights is per se a Big Data problem (Bellini et  al., 2013). The DRM solutions use authorization, authentication, and encryption tech- nologies to manage and enable the exploitation of rights at different types of users; logical control of some users with respect to each single pieces of the huge quantities of data. The same technology can be used to provide contribution to safeguard the data privacy allowing keeping the encrypted data until they are effectively used by authorized and authenticated tools and users. Therefore, the access to data outside permitted rights and con- tent would be forbidden. Data security is a key aspect of architecture for the management of such big quantities of data and is excellent to define who can access to what. This is a fundamental feature in some areas such as health/ medicine, banking, media distribution, and e-commerce. In order to enforce data protection, some frameworks are available to implement DRM and/ or CAS solutions exploiting different encryption and technical protection techniques (e.g., MPEG-21 [MPEG-21], AXMEDIS (Bellini et al., 2007), ODRL (Iannella, 2002)). In the specific case of EPR, several millions of patients with hundreds of elements have to be managed; where for each of them some tens of rights should to be controlled, thus resulting in billions of accesses and thus of authentications per day.

            Data analytics aspects

            Data analysis aspects have to do with a large range of different algorithms for data processing. The analysis and review of the different data analytic algorithms for Big Data processing is not in the focus of this chapter that aims at analyzing the architectural differences and the most important fea- tures of Big Data solutions. On the other hand, the data analytic algorithms may range on data: ingestion, crawling, verification, validation, mining, pro-

            Big Data Computing

            the estimation of relevant results such as the detection of unexpected corre- lations, detection of patterns and trends (for example, of events), estimation of collective intelligence, estimation of the inception of new trends, predic- tion of new events and trends, analysis of the crowd sourcing data for senti- ment/affective computing with respects to market products or personalities, identification of people and folk trajectories, estimation of similarities for producing suggestion and recommendations, etc. In most of these cases, the data analytic algorithms have to take into account of user profiles, content descriptors, contextual data, collective profiles, etc.

            The major problems of Big Data are related to how their “meanings” are discovered; usually this research occurs through complex modeling and ana- lytics processes: hypotheses are formulated, statistical, visual, and semantic models are implemented to validate them, and then new hypotheses are for- mulated again to take deductions, find unexpected correlations, and produce optimizations. Also, in several of these cases, the specific data analytic algo- rithms are based on statistical data analysis; semantic modeling, reasoning, and queries; traditional queries; stream and signal processing; optimization algorithms; pattern recognition; natural language processing; data cluster- ing; similarity estimation; etc.

            In the following, key aspects are discussed and better explained.

            Data Mining/Ingestion

            Aspects are two key features in the field of Big Data solutions; in fact, in most cases there is a trade-off between the speed of data ingestion, the abil- ity to answer queries quickly, and the quality of the data in terms of update, coherence, and consistency. This compromise impacts the design of the stor- age system (i.e., OLTP vs OLAP, On-Line Transaction Processing vs On-Line Analytical Processing), that has to be capable of storing and index the new data at the same rate at which they reach the system, also taking into account that a part of the received data could not be relevant for the production of requested results. Moreover, some storage and file systems are optimized to read and others for writing; while workloads generally involve a mix of both these operations. An interesting solution is GATE, a framework and graphical development environment to develop applications and engineering compo- nents for language processing tasks, especially for data mining and infor- mation extraction (Cunningham et al., 2002). Furthermore, the data mining process can be strengthened and completed by the usage of crawling techniques, now consolidated in the extraction of meaningful data from web pages richer information, also including complex structures and tags. The processing of a large amount of data can be very expensive in terms of resources used and computation time. For these reasons, it may be helpful to use a distributed approach of crawlers (with additional functionality) who works as distrib- uted system under, with a central control unit which manages the allocation

            Tassonomy and Review of Big Data Solutions Navigation

            Another important feature is the ability to get advanced faceted results from queries on the large volumes of available data: this type of queries allows the user to access the information in the store, along multiple explicit dimensions, and after the application of multiple filters. This interaction paradigm is used in mining applications and allows to analyze and browse data across multiple dimensions; the faceted queries are especially useful in e-commerce websites (Ben-Yitzhak et al., 2008). In addition to the features already seen, it is important to take into account the ability to process data

            

          in real time : today, in fact, especially in business, we are in a phase of rapid

            transition; there is also the need to faster reactions, to be able to detect pat- terns and trends in a short time, in order to reduce the response time to customer requests. This increases the need to evaluate information as soon as an event occurs, that is, the company must be able to answer questions on the fly according to real-time data.

            Data Access for Computing

            The most important enabling technologies are related to the data model- ing and to the data indexing. Both these aspects should be focused on fast access/retrieve data in a suitable format to guarantee high performance in the execution of the computational algorithms to be used for produc- ing results. The type of indexing may influence the speed of data retrieval operations at only cost of an increased storage space. Couchbase [Couchbase] offers an incremental indexing system that allows an efficient access to data at multiple points. Another interesting method is the use of Hfile (Aiyer et al., 2012) and the already mentioned Bloom filters (Borthakur et al., 2011). It consists of an index-organized data file created periodically and stored on disk. However, in Big Data context, there is the need to manage often irregular data, with a heterogeneous structure and do not follow any pre- defined schema. For these reasons could be interesting the application of an alternative indexing technique suitable for semistructured or unstruc- tured data as proposed in McHugh et al. (1998). On the other hand, where data come from different sources, to establish relationships among datasets, allows data integration and can lead to determine additional knowledge and deductions. Therefore, the modeling and management of data relationships may become more important than the data, especially where relationships play a very strong role (social networks, customer management). This is the case of new data types for social media that are formalized as highly interre- lated content for which the management of multi-dimensional relationships in real time is needed. A possible solution is to store relationships in specific data structures that ensure good ability to access and extraction in order to adequately support predictive analytics tools. In most cases, in order to guarantee the demanded performance in the rendering and production of data results, a set of precomputed partial results and/or indexes can be esti-

            Big Data Computing

          caches stores , as temporary data. Some kinds of data analytic algorithms create

            enormous amounts of temporary data that must be opportunely managed to avoid memory problems and to save time for the successive computa- tions. In other cases, however, in order to make some statistics on the infor- mation that is accessed more frequently, it is possible to use techniques to create well-defined cache system or temporary files to optimize the com- putational process. With the same aim, some incremental and/or hierarchical

            

          algorithms are adopted in combination of the above-mentioned techniques,

            for example, the hierarchical clustering k-means and k-medoid for recom- mendation (Everitt et al., 2001; Xui and Wunsch, 2009; Bellini et al., 2012c). A key element of Big Data access for data analysis is the presence of metadata

            

          as data descriptors , that is, additional information associated with the main

          data, which help to recover and understand their meaning with the context.

            In the financial sector, for example, metadata are used to better understand customers, date, competitors, and to identify impactful market trends; it is therefore easy to understand that having an architecture that allows the stor- age of metadata also represents a benefit for the following operations of data analysis. Structured metadata and organized information help to create a system with more easily identifiable and accessible information, and also facilitate the knowledge identification process, through the analysis of avail- able data and metadata. A variety of attributes can be applied to the data, which may thus acquire greater relevance for users. For example, keyword, temporal, and geospatial information, pricing, contact details, and anything else that improves the quality of the information that has been requested. In most cases, the production of suitable data descriptors could be the way to save time in recovering the real full data, since the matching and the further computational algorithms are based on those descriptors rather than on the original data. For example, the identification duplicated documents could be performed by comparing the document descriptors, the production of user recommendations can be performed on the basis of collective user descrip- tors or on the basis of the descriptors representing the centre of the clusters.

          Overview of Big Data Solutions

            In this section, a selection of representative products for the implementation of different Big Data systems and architectures has been analyzed and orga- nized in a comparative table on the basis of the main features identified in the previous sections. To this end, the following paragraphs provide a brief overview of these considered solutions as reported in Table 2.1, described in the next section.

            The ArrayDBMS extends database services with query support and a

            Tassonomy and Review of Big Data Solutions Navigation

            a high number of operations, each of which is applied to a large number of elements in the array. In these conditions, the execution time with tra- ditional database would be unacceptable. In the literature, and from the real applications, a large number of examples are available that use various types of ArrayDBMS, and among them, we can recall a solution that is based on ArrayDBMS Rasdaman (Baumann et al., 1998): different from the other types, Rasdaman ArrayDBMS provides support for domain-independent arrays of arbitrary size and it uses a general-purpose declarative query lan- guage, that is also associated with an optimized internal execution, transfer, and storage. The conceptual model consists of arrays of any size, measures, and types of cells, which are stored in tables named collections that con- tain an OID (object ID) column and an array column. The RaSQL language offers expressions in terms of multi-dimensional arrays of content objects. Following the standard paradigm Select-from-where, firstly the query process gathers collections inspected, then the “where” clause filters the array cor- responding to the predicate, and finally, the “Select” prepares the matrices derived from initial query. Internally, Rasdaman decomposes each object array in “tiles” that form the memory access units, the querying units and the processing units. These parts are stored as BLOBs (Binary Large Object) in a relational database. The formal model of algebra for Rasdaman arrays offers a high potential for query optimization. In many cases, where phenomena are sampled or simulated, the results are data that can be stored, searched, and submitted as an array. Typically, the data arrays are outlined by meta- data that describe them; for example, geographically referenced images may contain their position and the reference system in which they are expressed.

            Couchbase

            [Couchbase] is designed for real-time applications and does not support SQL queries. Its incremental indexing system is realized to be native to JSON (JavaScript Object Notation) storage format. Thus, JavaScript code can be used to verify the document and select which data are used as index key. Couchbase Server is an elastic and open-source NoSQL database that auto- matically distributes data across commodity servers or virtual machines and can easily accommodate changing data management requirements, thanks to the absence of a schema to manage. Couchbase is also based on Memcached which is responsible for the optimization of network proto- cols and hardware, and allows obtaining good performance at the network level. Memcached [Memcached] is an open-source distributed caching sys- tem based on main memory, which is specially used in high trafficked web- sites and high-performance demand web applications. Moreover, thanks to Memcached, CouchBase can improve its online users experience main- taining low latency and good ability to scale up to a large number of users. CouchBase Server allows managing in a simple way system updates, which

            Big Data Computing TaBle 2.1 Main Features of Reviewed Big Data Solutions ArrayDBMS CouchBase Db4o eXist Google MapReduce Hadoop Infrastructural and Architectural Aspects Distributed Y Y Y A Y Y High availability A Y Y Y Y Y Process management Computation- insensitive

          Auto distribution NA So high for update

          of entire files Configurable Configurable Cloud A Y A Y/A Y Y Parallelism Y Y Transactional Y Y Y Data Management Aspects Data dimension 100 PB PB 254 GB 2 31 doc. 10 PB Traditional/not 10 PB traditional NoSQL NoSQL NoSQL NoSQL NoSQL NoSQL SQL interoperability Good SQL-like language A A A Low

            Data organization Blob Blob 20 MB Blob NA Blob Chunk Data model Multidim. array 1 document/concept (document store)

          object DB + B-tree for

          index XML-DB + index tree Big table (CF, KV, OBJ, DOC) Big table (column family) Memory footprint Reduces Documents + bidimensional index Objects + index Documents + index NA NA Users access type Web interface Multiple point Remote user interface Web interface, REST interface Many types of interface API, common line interface or HDFS-UI web app Data access performance Much higher if metadata are stored in DBMS Speed up access to a document by automatically caching Various techniques for optimal data access

          performance

          NA NA NA Data Analitycs Aspects Type of indexing Multimensional index Y (incremental) B-Tree field indexes B+-tree (XISS) Distributed multilevel tree indexing HDFS Data relationships Y NA Y Y Y A Visual rendering and visualization Y (rView) NA Y NA P A Faceted query NA A P Y A (Lucene) A (Lucene) Statistical analysis tools Y Y/A A (optional library) A (JMXClient) A Y Log analysis NA Y NA NA Y Y

          Semantic query P A (elastic search) P A A P (index for

          semantic search) Indexing speed More than RDBMS Non-optimal performance 5–10 times more than SQL High speed with B+Tree Y/A A Real-time processing NA Indexing + creating view on the fly

          More suitable for

          real-time processing of events NA A A (streambased, Hstreaming) Note: Y, supported; N, no info; P, partially supported; A, available but supported by means a plug-in or external extension; NA, not available.

            Tassonomy and Review of Big Data Solutions Navigation HBase Hive MonetDB MongoDB Objectivity OpenQM RdfHdt Library RDF 3X Y Y Y Y Y NA A Y Y Y P Y Y Y A A Write-

            Y Y A Y Y NA A A intensive Read-dominated (or rapidly change) Read-dominated More Possibility Divided among more processes NA Optimized Y Y Y Y Y NA NA Y PB PB TB PB TB NoSQL NoSQL SQL NoSQL 16 TB 100 mil Triples (TB) 50 mil Triples XML/SQL++ NoSQL NoSQL NoSQL A Y Y JOIN not support Y (SQL++) NA Y Y A Bucket Blob Chunk 16 MB Blob Chunk 32 KB NA Blob Big table (column family) Table— partitions— Bucket (column family) BAT (Ext SQL) 1 Table for each collection (document DB) Classic Table in which define + models (GraphDB) 1 file/table (data + dictionary) (MultivalueDB) 3 structures, RDF graph for Header (RDF store) Optimized use NA Efficient use Document + 1 Table + permutations (RDF store) Metadata Like RDBMS No compression −50% data set Less than data set Jython or scala interface, NA Accelerate queries rest or thrift gateway HiveQL queries Full SQL interfaces Command Line, Web Interface Multiple access from different query application, AMS Console or web application Access on demand Web interface (SPARQL) with bitmap indices Fast data access (MonetDB/ XQuery is among the fastest and mostscalable) Over 50 GB, 10 times faster than MySQL Not provide any optimization for accessing replicas NA NA NA h-files Bitmap indexing Hash index Index RDBMS like Y (function e objectivity/ SQL++ interfaces) B-tree based RDF-graph Y (efficient triple indexes) Y NA Y Y Y Y Y Y NA NA NA A A (PerSay) Y Y P A (filter) NA NA NA Y (Objectivity PQE) NA Y NA Y y A (SciQL) Y (network traffic) Y A A Y P Y NA A Y NA Y Y Y NA NA NA NA NA Y Y Y NA More than RDBMS High speed if DB dimension doesnot exceed memory High speed Increased speed with alternate key 15 times faster than RDF Aerodynamic A (HBaseHUT library) A (Flume + Hive indexes our data and can be queried in real-time) NA Y Y (release 8) NA NA NA

            Big Data Computing

            realizing a reliable and highly available storage architecture, thanks to the multiple copies of the data stored within each cluster.

            The db4o is an object-based database (Norrie et al., 2008) which provides a support to make application objects persistent. It also supports various forms of querying over these objects such as query expression trees and iter- ator query methods, query-by-example mechanisms to retrieve objects. Its advantages are the simplicity, speed, and small memory footprint.

            eXist

            eXist (Meier, 2003) is grounded on an open-source project to develop a native

            XML database system that can be integrated into a variety of possible appli- cations and scenarios, ranging from web-based applications to documenta- tion systems. The eXist database is completely written in Java and maybe deployed in different ways, either running inside a servlet-engine as a stand- alone server process, or directly embedded into an application. eXist pro- vides schema-less storage of XML documents in hierarchical collections. It is possible to query a distinct part of collection hierarchy, using an extended

            XPath syntax, or the documents contained in the database. The eXist’s query engine implements efficient, index-based query processing. According to path join algorithms, a large range of queries are processed using index infor- mation. This database is an available solution for applications that deal with both large and small collections of XML documents and frequent updates of them. eXist also provides a set of extensions that allow to search by keyword, by proximity to the search terms, and by regular expressions.

            Google Map-Reduce

            Google Map-Reduce (Yang et al., 2007) is the programing model for process- ing Big Data used by Google. Users specify the computation in terms of a map and a reduction function. The underlying system parallelizes the computa- tion across large-scale clusters of machines and is also responsible for the fail- ures, to maintain effective communication and the problem of performance. The Map function in the master node takes the inputs, partitioning them into smaller subproblems, and distributes them to operational nodes. Each opera- tional node could perform this again, creating a multilevel tree structure. The operational node processes the smaller problems and returns the response to its parent node. In the Reduce function, the root node takes the answers from the subproblems and combine them to produce the answer at the global prob- lem is trying to solve. The advantage of Map-Reduce consists in the fact that it is intrinsically parallel and thus it allows to distribute processes of mapping operations and reduction. The operations of Map are independent of each other, and can be performed in parallel (with limitations given from the data source and/or the number of CPU/cores near to that data); in the same way,

            Tassonomy and Review of Big Data Solutions Navigation

            queries or other highly distributable algorithms potentially in real time, that is a very important feature in some work environments.

            Hadoop

            [Hadoop Apache Project] is a framework that allows managing distributed processing of Big Data across clusters of computers using simple program- ing models. It is designed to scale up from single servers to thousands of machines, each of them offering local computation and storage. The Hadoop library is designed to detect and handle failures at the application layer, so delivering a highly available service on top of a cluster of computers, each of which may be prone to failures. Hadoop was inspired from Google’s Map- Reduce and Google File System, GFS, and in practice it has been realized to be adopted in a wide range of cases. Hadoop is designed to scan large data set to produce results through a distributed and highly scalable batch pro- cessing systems. It is composed of the Hadoop Distribute File System (HDFS) and of the programing paradigm Map-Reduce (Karloff et al., 2010); thus, it is capable of exploiting the redundancy built into the environment. The pro- graming model is capable of detecting failures and solving them automati- cally by running specific programs on various servers in the cluster. In fact, redundancy provides fault tolerance and capability to self-healing of the Hadoop Cluster. HDFS allows applications to be run across multiple servers, which have usually a set of inexpensive internal disk drives; the possibility of the usage of common hardware is another advantage of Hadoop. A similar and interesting solution is HadoopDB, proposed by a group of researchers at Yale. HadoopDB was conceived with the idea of creating a hybrid system that combines the main features of two technological solutions: parallel data- bases in performance and efficiency, and Map-Reduce-based system for scal- ability, fault tolerance, and flexibility. The basic idea behind HadoopDB is to use Map-Reduce as the communication layer above multiple nodes running single-node DBMS instances. Queries are expressed in SQL and then trans- lated into Map-Reduce. In particular, the solution implemented involves the use of PostgreSQL as database layer, Hadoop as a communication layer, and Hive as the translation layer (Abouzeid et al., 2009).

            Hbase

            Hbase (Aiyer et al., 2012) is a large-scale distributed database build on top of the HDFS, mentioned above. It is a nonrelational database developed by means of an open source project. Many traditional RDBMSs use a single mutating B-tree for each index stored on disk. On the other hand, Hbase uses a Log Structured Merge Tree approach: first collects all updates into a special data structure on memory and then, periodically, flush this memory on disk, creating a new index-organized data file, the called also Hfile. These indices are immu-

            Big Data Computing

            merged. Therefore, by using this approach, the writing to the disk is sequen- tially performed. HBase’s performance is satisfactory in most cases and may be further improved by using Bloom filters (Borthakur et al., 2011). Both HBase and HDFS systems have been developed by considering elasticity as funda- mental principle, and the use of low cost disks has been one of the main goals of HBase. Therefore, to scale the system results is easy and cheap, even if it has to maintain a certain fault tolerance capability in the individual nodes.

            Hive

            [Apache Hive] is an open-source data warehousing solution based on top of Hadoop. Hive has been designed with the aim of analyzing large amounts of data more productively, improving the query capabilities of Hadoop. Hive supports queries expressed in an SQL-like declarative language—HiveQL— to extract data from sources such as HDFS or HBase. The architecture is divided into: Map-Reduce paradigm for computation (with the ability for users to enrich the queries with custom Map-Reduce scripts), metadata information for a data storage, and a processing part that receives a query from user or applications for execution. The core in/out libraries can be expanded to analyze customized data formats. Hive is also characterized by the presence of a system catalog (Metastore) containing schemas and sta- tistics, which is useful in operations such as data exploration, query optimi- zation, and query compilation. In Facebook, the Hive warehouse contains tens of thousands of tables and stores over 700 TB of data and is being used extensively for both reporting and ad-hoc analyses by more than 200 users per month (Thusoo et al., 2010).

            MonetDB

            MonetDB (Zhang et al., 2012) is an open-source DBMS for data mining appli- cations. It has been designed for applications with large databases and que- ries, in the field of Business Intelligence and Decision Support. MonetDB has been built around the concept of bulk processing: simple operations applied to large volumes of data by using efficient hardware, for large-scale data pro- cessing. At present, two versions of MonetDB are available and are working with different types of databases: MonetDB/SQL with relational database, and MonetDB/XML with an XML database. In addition, a third version is under development to introduce RDF and SPARQL (SPARQL Protocol and RDF Query Language) supports. MonetDB provides a full SQL interface and does not allow a high-volume transaction processing with its multilevel ACID properties. The MonetDB allows performance improvement in terms of speed for both relational and XML databases thanks to innovations introduced at DBMS level, a storage model based on vertical fragmentation, run-time query optimization, and on modular software architecture. MonetDB is designed

            Tassonomy and Review of Big Data Solutions Navigation

            techniques for an efficient support of workloads. MonetDB represents rela- tional tables using the vertical fragmentation (column-stores), storing each column in a separate table, called BAT (Binary Association Table). The left column, usually the OID (object-id), is called the head and the right column, which usually contains the actual attribute values, is called the tail.

            MongoDB

            [MongoDB] is a document-oriented database that memorizes document data in BSON, a binary JSON format. Its basic idea consists in the usage of a more flexible model, like the “document,” to replace the classic concept of a “row.” In fact, with the document-oriented approach, it is possible to represent complex hierarchical relationships with a single record, thanks to embedded docu- ments and arrays. MongoDB is open-source and it is schema-free—that is, there is no fixed or predefined document’s keys—and allows defining indices based on specific fields of the documents. In order to retrieve data, ad-hoc que- ries based on these indices can be used. Queries are created as BSON objects to make them more efficient and are similar to SQL queries. MongoDB sup- ports MapReduce queries and atomic operations on individual fields within the document. It allows realizing redundant and fault-tolerant systems that can easily horizontally scaled, thanks to the sharing based on the document keys and the support of asynchronous and master–slave replications. A rele- vant advantage of MongoDB are the opportunities of creating data structures to easily store polymorphic data, and the possibility of making elastic cloud systems given its scale-out design, which increases ease of use and developer flexibility. Moreover, server costs are significantly low because MongoDB deployment can use commodity and inexpensive hardware, and their hori- zontal scale-out architecture can also reduce storage costs.

            Objectivity

            [Objectivity Platform] is a distributed OODBMS (Object-Oriented Database Management System) for applications that require complex data models. It supports a large number of simultaneous queries and transactions and provides high-performance access to large volumes of physically distrib- uted data. Objectivity manages data in a transparent way and uses a dis- tributed database architecture that allows good performance and scalability. The main reasons for using a database of this type include the presence of complex relationships that suggest tree structures or graphs, and the pres- ence of complex data, that is, when there are components of variable length and in particular multi-dimensional arrays. Other reasons are related to the presence of a database that must be geographically distributed, and which is accessed via a processor grid, or the use of more than one language or platform, and the use of workplace objects. Objectivity has an architecture

            Big Data Computing

            high performance in relation to the amount of data stored and the number of users. This architecture distributes tasks for computation and data storage in a transparent way through the different machines and it is also scalable and has a great availability.

            OpenQM

            [OpenQM Database] is a DBMS that allows developing and run applications that includes a wide range of tools and advanced features for complex appli- cations. Its database model belongs to the family of Multivalue and therefore has many aspects in common with databases Pick-descended and is trans- actional. The development of applications Multivalue is often faster than using other types of database and this therefore implies lower development costs and easier maintenance. This instrument has a high degree of compat- ibility with other types of systems with database Multivalue as UniVerse [UniVerse], PI/open, D3, and others.

            The RDF-HDT (Header-Dictionary-Triples) [RDF-HDT Library] is a new representation format that modularizes data and uses structures of large RDF graphs to get a big storage space and is based on three main compo- nents: Header, Dictionary, and a set of triples. Header includes logical and physical data that describes the RDF data set, and it is the entry point to the data set. The Dictionary organizes all the identifiers in an RDF graph and provides a catalog of the amount of information in RDF graph with a high level of compression. The set of Triples, finally, includes the pure structure of the underlying RDF graph and avoids the noise produced by long labels and repetitions. This design gains in modularity and compactness, and addresses other important characteristics: allows access addressed on-demand to the RDF graph and is used to design specific compression techniques RDF (HDT- compress) able to outperform universal compressors. RDF-HDT introduces several advantages like compactness and compression of stored data, using small amount of memory space, communication bandwidth, and time. RDF- HDT uses a low storage space, thanks to the asymmetric structure of large RDF graph and its representation format consists of two primary modules, Dictionary and Triple. Dictionary contains mapping between elements and unique IDs, without repetition, thanks to which achieves a high compres- sion rate and speed in searches. Triple corresponds to the initial RDF graph in a compacted form where elements are replaced with corresponding IDs. Thanks to the two processes, HDT can be also generated from RDF (HDT encoder) and can manage separate logins to run queries, to access full RDF or to carry out management operations (HDT decoder)

            

          RDF-3X (Schramm, 2012) is an RDF store that implements SPARQL [SPARQL]

            Tassonomy and Review of Big Data Solutions Navigation Set Computer) architecture with efficient indexing and query processing.

            The design of RDF-3X solution completely eliminates the process of tun- ing indices thanks to an exhaustive index of all permutations of subject– predicate–object triples and their projections unary and binary, resulting in highly compressed indices and in a query processor that can provide data results with excellent performance. The query optimizer can choose the optimal join orders also for complex queries and a cost model that includes statistical synopses for entire join paths. RDF-3X is able to provide good sup- port for efficient online updates, thanks to an architecture staging.

          Comparison and Analysis of Architectural Features

            A large number of products and solutions have been reviewed for analyz- ing the most interested products for the readers and for the market, while a selection of solutions and products has been proposed in this paper with the aim of representing all of them. The analysis performed has been very complex since a multidisciplinary team has been involved in assessing the several aspects in multiple solutions including correlated issues in the case of tools depending on other solutions (as reported in Table 2.1). This is due to the fact that features of Big Data solutions are strongly intercorrelated and thus it was impossible to identify orthogonal aspects to provide a simple and easy to read taxonomical representation. Table 2.1 can be used to compare different solutions in terms of: infrastructure and the architecture of the systems that should cope with Big Data; data management aspects; and of data analytics aspects, such as ingestion, log analysis, and everything else is pertinent to post-production phase of data processing. Some of the information related to specific features of products and tools have not been clearly identified. In those cases, we preferred to report that the information was not available.

          Application Domains Comparison

            Nowadays, as reported in the introduction, Big Data solutions and technolo- gies are currently used in many application domains with remarkable results and  excellent future prospects to fully deal with the main challenges like data modeling, organization, retrieval, and data analytics. A major investment in Big Data solutions can lay the foundations for the next generations of advances in medicine, science, research, education, and e-learning, business and financial, healthcare, smart city, security, info mobility, social media, and networking.

            In order to assess different fields and solutions, a number of factors

            8

            8 TaBle 2.2

            Relevance of the Main Features of Big Data Solutions with Respect to the Most Interesting Applicative Domains Data Analysis Educational

            Social Network Scientific and Internet Research Cultural Energy/ Financial/ Smart Cities Social Media Service Web Infrastructural and Architectural Aspects (biomedical) Heritage Transportation Business Healthcare Security and mobility Marketing Data Distributed H M H H H H M H H management High availability H M H H H H H M H Internal H M H M H H H M M parallelism (related to Data Management Aspects velocity) Data dimension H M M H M H+ H+ H+ H+

            (data volume) Data replication H L M H H H H H M Data organization Chuncks, Blob Blob Cluster, Blob Cluster, Blob, Chuncks Blob Blob Chuncks (16 MB), Chunks, Bucket Chuncks Blob Data relationships H M L H H M H H H

            B ig D SQL M M L H H L H H M interoperability at

            

          Data variety H M M M M H H H H a C

          Data variability H H M H H H H H H om Data Access Aspects p u

            Data access H H L M H L H L H tin performance g

            T a Data access type AQL (SQL Standard Remote user HDFS-UI web API, common Remote user Interfaces for API and Open ad-hoc ss array version), interfaces, interface, get app, API, line interfaces, on-demand customized SQL access (es. on full SQL remote API, multiple common line interfaces, multiple access access, interfaces, HiveQL), Web om interfaces, user access query interfaces, concurrency from different SQL command Line, interfaces, ad-hoc y a

          specific API interfaces AMS access query interfaces web interfaces, REST

          n application AMS interfaces d R

            Visual rendering H M M H H L H M M and visualization ev ie

            Faceted Query L H L H H L L H H w of B results Graph L M M H M L H H H relationships ig D navigation at Data Analytics Aspects a S

            Type of indexing Array Ontologies, Key index HDFS, index, Multi- Distributed Hash index, Distributed Inverted index, ol u multi- index distributed, RDBMS like, dimensional multi-level HDFS, index, RDBMS MinHash, tio dimensional, RDBMS B-tree field B-tree index index, HDFS, tree indexing, RDF-graph like B+-tree, HDFS, n s N hash index like, indexes index, HDFS, index, RDF graph RDBMS like RDBMS like av

            Indexing speed L L L H M H H M H ig

            Semantic query H M M M H L L M H at io

            Statistical analysis H H H H H L M H H n tools in queries CEP (active query) H L M H H M H L H

            Log analysis L M L H H H H H H Streaming M L M H M H H M H (network processing monitoring) 9 8

            Big Data Computing

            Data solutions with respect to the most interesting applicative domains is reported. When possible, each feature has been expressed for that applica- tive domain in terms of relevance by expressing: high, medium, and low relevance/impact. For some features, the assessment of the grading was not possible, and thus a comment including commonly adopted specific solu- tions and/or technologies has been provided.

            The main features are strongly similar to those adopted in Table 2.1 (the features those not presented relevant differences in the different domains have been removed to make the table more readable). Table 2.2 could be considered a key lecture to compare Big Data solutions on the basis of the relevant differences. The assessment has been performed according to the state-of-the-art analysis of several Big Data applications in the domains and corresponding solutions proposed. The application domains should be con- sidered as macro-areas rather than specific scenarios. Despite the fact that the state of the art of this matter is in continuous evolution, the authors think that the work presented in this chapter can be used as the first step to understand which are the key factors for the identification of the suitable solutions to cope with a specific new domain. This means that the consid- erations have to be taken as examples and generalization of the analyzed cases.

          Table 2.2 can be read line by line. For example, considering the infrastruc- tural aspect of supporting distributed architectures “Distributed manage-

            ment,” at the first glance, one could state that this feature is relevant for all the applicative domains. On the other hand, a lower degree of relevance has been notified for the educational domain. For example, the student profile analysis for the purpose of the personalized courses are typically locally based and are accumulating a lower number of information with respect to global market analysis, security, social media, etc. The latter cases are typi- cally deployed as multisite geographical distributed databases at the world- wide level, while educational applications are usually confined at regional and local levels. For instance, in high educational domains, a moderate num- ber of information is available and their use is often confined at local level for the students of the institute. This makes the advantage of geographically distributed services less important and interesting. While social networking applications typically needs highly distributed architectures since also their users are geographically distributed. Similar considerations can be applied for the demand of database consistency and may be about the high availabil- ity that could be less relevant for educational and social media with respect to the demands of safety critical situations of energy management and of transportation. Moreover, the internal parallelism of the solution can be an interesting feature that can be fully exploited only in specific cases depend- ing on the data analytic algorithms adopted and when the problems can take advantage from a parallel implementation of the algorithm. This feature is strongly related to the reaction time, for example, in most of the social media

            Tassonomy and Review of Big Data Solutions Navigation

            and content is something that is performed offline updating values periodi- cally but not in real time.

            As regards the data management aspects, the amount of data involved is considerably huge in almost all the application domains (in the order of several petabytes and exabyte). On the other hand, this is not always true for the size of individual files (with the exception of satellite images, medi- cal images, or other multimedia files that can also be several gigabytes in size). The two aspects (number of elements to be indexed and accessed, and typical data size) are quite different and, in general, the former (number of elements) creates major problems, for processing and accessing them. For this reason, security, social media, and smart cities have been considered the applicative domains with higher demand of Big Data solutions in terms of volume. Moreover, in many cases, the main problem is not their size, rather the management and the preservation of the relationships among the vari- ous elements; they represent the effective semantic value of data set (the data model of Table 2.1 may help in comparing the solutions). For example, for the user profile (human relationships), traffic localization (service relationships, time relationships), patients’ medical records (events and data relationships), etc. Data relationships, often stored in dedicated structures, and making specific queries and reasoning can be very important for some applications such as social media and networking. Therefore, for this aspect, the most challenging domains are again smart cities, social networks, and health care. In this regard, in these last two application domains, the use of graph rela- tionship navigation constitutes a particularly useful support to improve the representation, research, and understanding of information and meanings explicitly not evident in the data itself.

            In almost all domains, the usage and the interoperability of both SQL and NoSQL database are very relevant, and some differences can be detected in the data organization. In particular, the interoperability with former SQL is a very important feature in application contexts such as healthcare, social media mar- keting, business, and smart cities, to the widespread use of traditional RDBMS, rather than in application domains such as research scientific, social networks, and web data, security, and energy, mainly characterized by unstructured or semistructured data. Some of the applicative domains intrinsically present a large variety and variability of data, while others present more standardized and regular information. A few of these domains present both variety and variability of data such as the scientific research, security, and social media which may involve content-based analysis, video processing, etc.

            Furthermore, the problems of data access is of particular importance in terms of performances and for the provided features related to the rendering, representation, and/or navigation of produced results as visual rendering tools, presentation of results by using faceted facilities, etc. Most of the domains are characterized by the needs of different custom interfaces for the data rendering (built ad hoc on the main features) that provide safe

            Big Data Computing

            important to take account of issues related to the concurrent access and thus data consistency, while in social media and smart cities it is important to provide on-demand and multidevice access to information, graphs, real-time conditions, etc. A flexible visual rendering (distributions, pies, histograms, trends, etc.) may be a strongly desirable features to be provided, for many scientific and research applications, as well as for financial data and health care (e.g., for reconstruction, trend analysis, etc.). Faceted query results can be very interesting for navigating in mainly text-based Big Data as for edu- cational and cultural heritage application domains. Graph navigation among resulted relationships can be an avoidable solution to represent the resulted data in smart cities and social media, and for presenting related implications and facts in financial and business applications. Moreover, in certain specific contexts, the data rendering has to be compliant with standards, for example, in the health care.

            In terms of data analytic aspects, several different features could be of interest in the different domains. The most relevant feature in this area is the type of indexing, which in turn characterizes the indexing performance. The indexing performance are very relevant in the domains in which a huge amount of small data have to be collected and need to be accessed and elabo- rated in the short time, such as in finance, health care, security, and mobil- ity. Otherwise, if the aim of the Big Data solution is mainly on the access and data processing, then fast indexing can be less relevant. For example, the use of HDFS may be suitable in contexts requiring complex and deep data processing, such as the evaluation on the evolution of a particular dis- ease in the medical field, or the definition of a specific business models. This approach, in fact, runs the process function on a reduced data set, thus achieving scalability and availability required for processing Big Data. In education, instead, the usage of ontologies and thus of RDF databases and graphs provides a rich semantic structure better than any other method of knowledge representation, improving the precision of search and access for educational contents, including the possibility of enforcing inference in the semantic data structure.

            The possibility of supporting statistical and logical analyses on data via specific queries and reasoning can be very important for some applications such as social media and networking. If this feature is structurally sup- ported, it is possible to realize direct operations on the data, or define and store specific queries to perform direct and fast statistical analysis: for exam- ple, for estimating recommendations, firing conditions, etc.

            In other contexts, however, it is very important to the continuous process- ing of data streams, for example, to respond quickly to requests for informa- tion and services by the citizens of a “smart city,” real-time monitoring of the performance of financial stocks, or report to medical staff unexpected changes in health status of patients under observation. As can be seen from the table, in these contexts, a particularly significant feature is the use of

            Tassonomy and Review of Big Data Solutions Navigation TaBle 2.3 Relevance of the Main Features of Big Data Solutions with Respect to the Most Interesting Applicative Domains eXist Hive Db4o HBase Hadoop RDF 3X OpenQM MonetDB MongoDB CouchBase Objectivity ArrayDBMS RdfHdt Library Google MapReduce

            Data analysis scientific research

            X X

            X X

            X X

            X X

            X X (biomedical) Education and cultural heritage

            

          X

          X

            X X Energy/transportation

            X X

            X X

            X Financial/business

            X X

            X X

            X X

            X Healthcare

            X X

            X X

            X X

            X X

            X Security

            X X

            X Y Smart mobility, smart cities

            X X

            X X

            X X Social marketing

            X X

            X X

            X X

            X X

            X Social media

            X X

            X X

            X X

            X X

            X X Note: Y = frequently adopted.

            improves the user experience with a considerable increase in the speed of analysis; on the contrary, in the field of education, scientific research and transport speed of analysis is not a feature of primary importance, since in these contexts, the most important thing is the storage of data to keep track of the results of experiments or phenomena or situations occurred in a specific time interval, then the analysis is a passage that can be realized at a later time.

            Lastly, it is possible to observe Table 2.3 in which the considered applica- tion domains are shown in relation to the products examined in the previous session and to the reviewed scenarios. Among all the products reviewed, MonetDB and MongoDB are among the most flexible and adaptable to dif- ferent situations and contexts of applications. It is also interesting to note that RDF-based solutions have been used mainly on social media and network- ing applications.

            Shown below are the most effective features of each product ana- lyzed and the main application domains in which their use is commonly suggested.

            HBase over HDFS provides an elastic and fault-tolerant storage solution and provides a strong consistency. Both HBase and HDFS are grounded on the fundamental design principle of elasticity. Facebook messages (Aiyer et al., 2012) exploit the potential of HBase to combine services such as mes-

            Big Data Computing

            approximately 14 TB of messages and 11 TB of chats, each month. For these reasons, it is successfully used in the field of social media Internet services as well as on social media marketing.

            RDF-3X is considered as one of the fastest RDF representations and it pro- vides an advantage with handling of the small data. The physical design of RDF-3X completely eliminates the need for index tuning, thanks to highly compressed indices for all permutations of triples and their binary and unary projections. Moreover, RDF-3X is optimized for queries and pro- vides a suitable support for efficient online updates by means of a staging architecture.

            MonetDB achieves a significant speed improvement for both relational/ SQL and XML/XQuery databases over other open-source systems; it intro- duces innovations at all layers of a DBMS: a storage model based on vertical fragmentation (column store), a modern CPU-tuned query execution archi- tecture, automatic, and self-tuning indexes, a run-time query optimization, and a modular software architecture. MonetDB is primarily used for the management of the large amount of images, for example, in astronomy, seis- mology, and earth observations. These relevant features collocate MonetDB as on the best tools in the field of scientific research and scientific data analy- sis, thus defining an interesting technology on which to develop scientific applications and create interdisciplinary platform for the exchange of data in the world community of researchers.

            HDT has been proved by experiments to be a good tool for compacting data set. It allows to be compacted more than 15 times with respect to stan- dard RDF representations; thus, improving parsing and processing, while maintaining a consistent publication scheme. Thus, RDF-HDT allows to improve compactness and compression, using much less space, thus saving storing space and communication bandwidth. For these reasons, this solu- tion is especially suited to share data on the web, but also in those contexts that require operations such as data analysis and visualization of results, thanks to the support of 3D visualization of the RDF Adjacency matrix of the RDF Graph. eXist’s query engine implements efficient index structure to collect data for scientific and academic research, educational assessments, and for consump- tion in the energy sector, and its index-based query processing is needed to efficiently perform queries on large document collections. Experiments have, moreover, demonstrated the linear scalability of eXist’s indexing, stor- age, and querying architecture. In general, the search expressions using full text index perform better with eXist, than that with the corresponding que- ries based on XPath.

            In scientific/technical applications, ArrayDBMS is often used in combi- nation with complex queries, and therefore, the optimization results are fundamental. ArrayDBMS may be used with both hardware and software parallelisms, which make possible the realization of efficient systems in many

            Tassonomy and Review of Big Data Solutions Navigation

            Objectivity/DB guarantees a complete support for ACID and can be repli- cated to multiple locations. Objectivity/DB is highly reliable and thanks to the possibility of schema evolution, it provides advantages over other technolo- gies that had a difficult time with change/update a field. Thus, it has been typically used for making data-insensitive systems or real-time applications, which manipulate the large volumes of complex data. Precisely because of these features, the main application fields are the healthcare and financial ser- vices, respectively, for the real-time management of electronic health records and for the analysis of products with higher consumption, with also the mon- itoring of sensitive information to support intelligence services.

            OpenQM enables the system development with reliability and also pro- vides efficiency and stability. The choice of using OpenQM is usually related to the need for speed, security, and reliability and also related to the ability of easily built excellent GUI interfaces into the database.

            Couchbase is a high-performance and scalable data solution supporting high availability, fault tolerance, and data security. Couchbase may pro- vide extremely fast response time. It is particularly suggested for applica- tions developed to support the citizens in the new model of smart urban cities (smart mobility, energy consumption, etc.). Thanks to its low latency, Couchbase is mainly used in the development of gaming online, and, in applications where obtaining a significant performance improvements is very important, or where the extraction of meaningful information from the large amount of data constantly exchanged is mandatory. For example, in social networks such as Twitter, Facebook, Flickr, etc.

            The main advantage of Hadoop is its ability to analyze huge data sets to quickly spot trends. In fact, most customers use Hadoop together with other types of software such as HDFS. The adoption of Google MapReduce provides several benefits: the indexing code is simpler, smaller, and easier to understand, and it guarantees fault tolerance and parallelization. Both Hadoop and Google MapReduce are preferably used in applications requir- ing large distributed computation. The New York Times, for example, uses Hadoop to process row images and turn them into a pdf format in an accept- able time (about 24 h each 4 TB of images). Other big companies exploit the potential of these products: Ebay, Amazon, Twitter, and Google itself that uses MapReduce to regenerate the Google’s Index, to update indices and to run various types of analyses. Furthermore, this technology can be used in medical fields to perform large-scale data analysis with the aim of improv- ing treatments and prevention of disease.

            Hive significantly reduces the effort required for a migration to Hadoop, which makes it perfect for data warehousing and also it has the ability to cre- ate ad-hoc queries, using a jargon similar to SQL. These features make Hive excellent for the analysis of large data sets especially in social media market- ing and web application business.

            MongoDB provides relevant flexibility and simplicity, which may reduce

            Big Data Computing

            requiring insertion and updating in real time, in addition to real-time query processing. It allows one to define the consistency level that is directly related to the achievable performance. If high performance is not a necessity, it is possible to obtain maximum consistency, waiting until the new element has been replicated to all nodes. MongoDB uses internal memory to store the working set, thus allowing faster access of data. Thanks to its characteristics, MongoDB is easily usable in business and in social marketing fields and, it is actually successfully used in Gaming environment, thanks to its high per- formance for small operations of read/write. As many other Big Data solu- tions, it is well suited for applications that handled high volumes of data where traditional DBMS might be too expensive.

            Db4o does not need a mapping function between the representation in memory and what actually is stored on disk, because the application schema corresponds with the data schema. This advantage allows one to obtain better performance and good user experience. Db4o also permits one to database access by using simple programing language (Java, .NET, etc.), and thanks to its type safety, it does not need to hold in check query against code injection. Db4o supports the paradigm CEP (see Table 2.2) and is there- fore very suitable for medical applications, scientific research, analysis of financial and real-time data streams, in which the demand for this feature is very high.

          Conclusions

            We have entered an era of Big Data. There is the potential for making faster advances in many scientific disciplines through better analysis of these large volumes of data and also for improving the profitability of many enterprises. The need for these new-generation data management tools is being driven by the explosion of Big Data and by the rapidly growing volumes and vari- ety of data that are collecting today from alternative sources such as social networks like Twitter and Facebook.

            NoSQL Database Management Systems represents a possible solution to these problems; unfortunately they are not a definitive solutions: these tools have a wide range of features that can be further developed to create new products more adaptable to this huge stream of data constantly growing and to its open challenge such as error handling, privacy, unexpected correlation detection, trend analysis and prediction, timeliness analysis, and visualiza- tion. Considering this latter challenge, it is clear that, in a fast-growing mar- ket for maps, charts, and other ways to visually sort using data, these larger volumes of data and analytical capabilities become the new coveted features; today, in fact in the “Big Data world,” static bar charts and pie charts just

            Tassonomy and Review of Big Data Solutions Navigation

            more dynamic, interactive tools, and methods for line-of-business managers and information workers for viewing, understanding, and operating on the analysis of big data.

            Each product compared in this review presents different features that may be needed in different situations with which we are dealing. In fact, there is still no definitive ultimate solution for the management of Big Data. The best way to determine on which product to base the development of your sys- tem may consist in analyzing the available data sets carefully and determine what are the requirements to which you cannot give up. Then, an analysis of the existing products is needed to determine the pros and cons, also con- sidering other nonfunctional features such as the programing language, the integration aspects, the legacy constraints, etc.

          References

            

          Abouzeid A., Bajda-Pawlikowski C., Abadi D., Silberschatz A., Rasin A., HadoopDB:

          An architectural hybrid of MapReduce and DBMS technologies for analytical workloads. Proceedings of the VLDB Endowment, 2(1), 922–933, 2009.

          Aiyer A., Bautin M., Jerry Chen G., Damania P., Khemani P., Muthukkaruppan K.,

          Ranganathan K., Spiegelberg N., Tang L., Vaidya M., Storage infrastructure behind Facebook messages using HBase at scale. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering , 35(2), 4–13, 2012. AllegroGraph, http://www.franz.com/agraph/allegrograph/ Amazon AWS, http://aws.amazon.com/ Amazon Dynamo, http://aws.amazon.com/dynamodb/

          Antoniu G., Bougè L., Thirion B., Poline J.B., AzureBrain: Large-scale Joint Genetic

          and Neuroimaging Data Analysis on Azure Clouds , Microsoft Research Inria Joint

            Apache Cassandra, http://cassandra.apache.org/ Apache HBase, http://hbase.apache.org/ Apache Hive, http://hive.apache.org/ Apache Solr, http://lucene.apache.org/solr/

          Baumann P., Dehmel A., Furtado P., Ritsch R., The multidimensional database sys-

          tem RasDaMan. SIGMOD’98 Proceedings of the 1998 ACM SIGMOD International

            Conference on Management of Data , Seattle, Washington, pp. 575–577, 1998, ISBN: 0-89791-995-5.

            

          Bellini P., Cenni D., Nesi P., On the effectiveness and optimization of information

          retrieval for cross media content, Proceedings of the KDIR 2012 is Part of IC3K 2012, International Joint Conference on Knowledge Discovery

            , Knowledge Engineering and Knowledge Management, Barcelona, Spain, 2012a.

          Bellini P., Bruno, I., Cenni, D., Fuzier, A., Nesi, P., Paolucci, M., Mobile medicine:

          Semantic computing management for health care applications on desktop and

            Big Data Computing

            

          Bellini P., Bruno I., Cenni D., Nesi P., Micro grids for scalable media computing and

          intelligence in distributed scenarios. IEEE MultiMedia, 19(2), 69–79, 2012c.

          Bellini P., Bruno I., Nesi P., Rogai D., Architectural solution for interoperable content

          and DRM on multichannel distribution, Proc. of the International Conference on

            Distributed Multimedia Systems, DMS 2007 , Organised by Knowledge Systems Institute, San Francisco Bay, USA, 2007.

            

          Bellini P., Nesi P., Pazzaglia F., Exploiting P2P scalability for grant authorization digi-

          tal rights management solutions, Multimedia Tools and Applications, Multimedia Tools and Applications Journal , Springer, April 2013.

            

          Ben-Yitzhak O., Golbandi N., Har’El N., Lempel R., Neumann A., Ofek-Koifman S.,

          Sheinwald D., Shekita E., Sznajder B., Yogev S., Beyond basic faceted search,

          Proc. of the 2008 International Conference on Web Search and Data Mining

          , pp.

            33–44, 2008. BlueGene IBM project, http://www.research.ibm.com/bluegene/index.html

          Borthakur D., Muthukkaruppan K., Ranganathan K., Rash S., SenSarma J.,

            Spielberg N., Molkov D. et  al., Apache Hadoop goes realtime at Facebook, Proceedings of the 2011 International Conference on Management of Data , Athens, Greece, 2011.

            

          Bose I., Mahapatra R.K., Business data mining—a machine learning perspective.

            Information & Management , 39(3), 211–225, 2001.

            

          Brewer E., CAP twelve years later: How the rules have changed. IEEE Computer, 45(2),

          23–29, 2012.

          Brewer E., Lesson from giant-scale services. IEEE Internet Computing, 5(4), 46–55,

          2001.

          Bryant R., Katz R.H., Lazowska E.D., Big-data computing: Creating revolution-

          ary breakthroughs in commerce, science and society, In Computing Research

            Initiatives for the 21st Century, Computing Research Association, Ver.8 , 2008. http:// www.cra.org/ccc/files/docs/init/Big_Data.pdf

          Bryant R.E., Carbonell J.G., Mitchell T., From data to knowledge to action: Enabling

          advanced intelligence and decision-making for America’s security, Computing Community Consortium , Version 6, July 28, 2010.

            

          Cao L., Wang Y., Xiong J., Building highly available cluster file system based on

          replication, International Conference on Parallel and Distributed Computing, Applications and Technologies

            , Higashi Hiroshima, Japan, pp. 94–101, December 2009.

          Cattel R., Scalable SQL and NoSQL data stores. ACM SIGMOND Record, 39(4), 12–27,

          2010. Couchbase, http://www.couchbase.com/

          Cunningham H., Maynard D., Bontcheva K., Tablan V., GATE: A framework and

          graphical development environment for robust NLP tools and applications, Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics

            , Philadelphia, July 2002.

          De Witt D.J., Paulson E., Robinson E., Naugton J., Royalty J., Shankar S., Krioukov A.,

          Clustera: an integrated computation and data management system. Proceedings of the VLDB Endowment , 1(1), 28–41, 2008.

            

          De Witt S., Sinclair R., Sansum A., Wilson M., Managing large data volumes from

            Tassonomy and Review of Big Data Solutions Navigation

          Domingos P., Mining social networks for viral marketing. IEEE Intelligent Systems,

          20(1), 80–82, 2005.

          Dykstra D., Comparison of the frontier distributed database caching system to NoSQL

          databases, Computing in High Energy and Nuclear Physics (CHEP) Conference,

            

          Ghosh D., Sharman R., Rao H.R., Upadhyaya S., Self-healing systems—Survey and

          synthesis, Decision Support Systems, 42(4), 2164–2185, 2007. Google Drive, http://drive.google.com GraphBase, http://graphbase.net/

          Gulisano V., Jimenez-Peris R., Patino-Martinez M., Soriente C., Valduriez P., A big

          data platform for large scale event processing, ERCIM News, 89, 32–33, 2012.

          Hadoop Apache Project, http://hadoop.apache.org/

          Hanna M., Data mining in the e-learning domain. Campus-Wide Information Systems,

            , Athens, Greece, pp. 16–25, 1997.

          Karloff H., Suri S., Vassilvitskii S., A model of computation for MapReduce. Proceedings

          of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms

          , pp.

            

          Jagadish H.V., Narayan P.P.S., Seshadri S., Kanneganti R., Sudarshan S., Incremental

          organization for data recording and warehousing, Proceedings of the 23rd International Conference on Very Large Data Bases

            Interaction with Technology , 52(8), 36–44, 2009.

            

          Iannella R., Open digital rights language (ODRL), Version 1.1 W3C Note, 2002,

          http://www.w3.org/TR/odrl

          Jacobs A., The pathologies of big data. Communications of the ACM—A Blind Person’s

            Performing Arts, Media Access and Entertainment , Florence, Italy, 2012.

            21(1), 29–34, 2004.

          Iaconesi S., Persico O., The co-creation of the city, re-programming cities using

          real-time user generated content, 1st Conference on Information Technologies for

            Search Workshop (SemSearch) , Hyderabad, India, 2011.

            New York, May 2012.

          Eaton C., Deroos D., Deutsch T., Lapis G., Understanding Big Data: Analytics for

            , Rio Rico, Arizona, pp. 174– 178, 1999.

          Gallego M.A., Fernandez J.D., Martinez-Prieto M.A., De La Fuente P., RDF visual-

          ization using a three-dimensional adjacency matrix, 4th International Semantic

            

          Fox A., Brewer E.A., Harvest, yield, and scalable tolerant systems, Proceedings of the

          Seventh Workshop on Hot Topics in Operating Systems

            IEEE Computer , 5(6), 37–46, 2002.

            

          Foster I., Jeffrey M., and Tuecke S. Grid services for distributed system integration,

            Systems , 20(2), 596–602, 2005.

            ECLAP, http://www.eclap.eu Europeana Portal, http://www.europeana.eu/portal/

          Everitt B., Landau S., Leese M., Cluster Analysis, 4th edition, Arnold, London, 2001.

          Figueireido V., Rodrigues F., Vale Z., An electric energy consumer characteriza-

          tion framework based on data mining techniques. IEEE Transactions on Power

            Enterprise Class Hadoop and Streaming Data , McGraw Hill Professional, McGraw Hill, New York, 2012, ISBN: 978-0071790536.

            938–948, 2010. LHC, http://public.web.cern.ch/public/en/LHC/LHC-en.html

          Liu L., Biderman A., Ratti C., Urban mobility landscape: Real time monitoring of urban

          mobility patterns, Proceedings of the 11th International Conference on Computers in

            Big Data Computing

          Mans R.S., Schonenberg M.H., Song M., Van der Aalst W.M.P., Bakker P.J.M.,

          Application of process mining in healthcare—A case study in a Dutch hospital.

            Biomedical Engineering Systems and Technologies, Communications in Computer and Information Science , 25(4), 425–438, 2009.

            

          McHugh J., Widom J., Abiteboul S., Luo Q., Rajaraman A., Indexing semistructured

          data, Technical report, Stanford University, California, 1998.

          Meier W., eXist: An open source native XML database. Web, Web-Services, and Database

            Systems—Lecture Notes in Computer Science , 2593, 169–183, 2003.

            Memcached, http://memcached.org/ Microsoft Azure, http://www.windowsazure.com/it-it/ Microsoft, Microsoft private cloud. Tech. rep., 2012.

          Mislove A., Gummandi K.P., Druschel P., Exploiting social networks for Internet

          search, Record of the Fifth Workshop on Hot Topics in Networks: HotNets V, Irvine,

            CA, pp. 79–84, November 2006. MongoDB, http://www.mongodb.org/ MPEG-21, http://mpeg.chiariglione.org/standards/mpeg-21/mpeg-21.htm Neo4J, http://neo4j.org/

          Norrie M.C., Grossniklaus M., Decurins C., Semantic data management for db4o,

            Proceedings of 1st International Conference on Object Databases (ICOODB 2008) , Frankfurt/Main, Germany, pp. 21–38, 2008.

            NoSQL DB, http://nosql-database.org/

          Obenshain M.K., Application of data mining techniques to healthcare data, Infection

          Control and Hospital Epidemiology , 25(8), 690–695, 2004.

            Objectivity Platform, http://www.objectivity.com

          Olston C., Jiang J., Widom J., Adaptive filters for continuous queries over distributed

          data streams, Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data , pp. 563–574, 2003. OpenCirrus Project, https://opencirrus.org/ OpenQM Database, http://www.openqm.org/docs/ OpenStack Project, http://www.openstack.org Oracle Berkeley, http://www.oracle.com/technetwork/products/berkeleydb/

          Pierre G., El Helw I., Stratan C., Oprescu A., Kielmann T., Schuett T., Stender J.,

          Artac M., Cernivec A., ConPaaS: An integrated runtime environment for elastic cloud applications, ACM/IFIP/USENIX 12th International Middleware Conference, Lisboa, Portugal, December 2011. RDF-HDT Library, http://www.rdfhdt.org

          Rivals E., Philippe N., Salson M., Léonard M., Commes T., Lecroq T., A scalable

          indexing solution to mine huge genomic sequence collections. ERCIM News, 89, 20–21, 2012.

          Rusitschka S., Eger K., Gerdes C., Smart grid data cloud: A model for utilizing cloud

          computing in the smart grid domain, 1st IEEE International Conference of Smart

            Grid Communications , Gaithersburg, MD, 2010.

            

          Setnes M., Kaymak U., Fuzzy modeling of client preference from large data sets: An

          application to target selection in direct marketing. IEEE Transactions on Fuzzy Systems , 9(1), February 2001.

            SCAPE Project, http://scape-project.eu/ Schramm M., Performance of RDF representations, 16th TSConIT, 2012.

          Silvestri Ludovico (LENS), Alessandro Bria (UCBM), Leonardo Sacconi (LENS), Anna

            Tassonomy and Review of Big Data Solutions Navigation (CINECA), Carlo Cavazzoni (CINECA), Giovanni Erbacci (CINECA), Roberta Turra (CINECA), Giuseppe Fiameni (CINECA), Valeria Ruggiero (UNIFE), Paolo Frasconi (DSI-UNIFI), Simone Marinai (DSI-UNIFI), Marco Gori (DiSI- UNISI), Paolo Nesi (DSI-UNIFI), Renato Corradetti (Neuroscience-UNIFI), GiulioIannello (UCBM), Francesco SaverioPavone (ICON, LENS), Projectome: Set up and testing of a high performance computational infrastructure for pro- cessing and visualizing neuro-anatomical information obtained using confocal ultra-microscopy techniques, Neuroinformatics 2012 5th INCF Congress, Munich, Germany, September 2012.

          Snell A., Solving big data problems with private cloud storage, White paper, October

          2011. SPARQL at W3C, http://www.w3.org/TR/rdf-sparql-query/

          Thelwall M., A web crawler design for data mining. Journal of Information Science,

            27(1), 319–325, 2001.

          Thusoo A., Sarma J.S., Jain N., Zheng Shao, Chakka P., Ning Zhang, Antony S., Hao

          Liu, Murthy R., Hive—A petabyte scale data warehouse using Hadoop, IEEE

            26th International Conference on Data Engineering (ICDE) , pp. 996–1005, Long Beach, CA, March 2010.

            UniVerse, http://u2.rocketsoftware.com/products/u2-universe

          Valduriez P., Pacitti E., Data management in large-scale P2P systems. High

          Performance Computing for Computational Science Vecpar 2004—Lecture Notes in Computer Science , 3402, 104–118, 2005.

          Woolf B.P., Baker R., Gianchandani E.P., From Data to Knowledge to Action: Enabling

            Personalized Education . Computing Community Consortium, Version 9, Computing Research Association, Washington DC, 2 September 2010. http://

          www.cra.org/ccc/files/docs/init/Enabling_Personalized_Education.pdf

            Xui R., Wunsch D.C. II, Clustering, John Wiley and Sons, USA, 2009.

          Yang H., Dasdan A., Hsiao R.L., Parker D.S., Map-reduce-merge: simplified relational

          data processing on large clusters, SIGMOD’07 Proceedings of the 2007 ACM

            SIGMOD International Conference on Management of Data , Beijing, China, pp.

            1029–1040, June 2007, ISBN: 978-1-59593-686-8.

          Yang J., Tang D., Zhou Y., A distributed storage model for EHR based on HBase,

            International Conference on Information Management, Innovation Management and Industrial Engineering , Shenzhen, China, 2011.

            Zenith, http://www-sop.inria.fr/teams/zenith/

          Zhang Y., Kersten M., Ivanova M., Pirk H., Manegold S., An implementation of ad-

          hoc array queries on top of MonetDB, TELEIOS FP7-257662 Deliverable D5.1, February 2012. Zinterhof P., Computer-aided diagnostics. ERCIM News, 89, 46, 2012.

            

          This page intentionally left blank This page intentionally left blank

             Roberto V. Zicari CONTENTS

            Introduction ......................................................................................................... 104 The Story as it is Told from the Business Perspective .................................... 104 The Story as it is Told from the Technology Perspective ............................... 107

            Data Challenges .............................................................................................. 107 Volume ........................................................................................................ 107 Variety, Combining Multiple Data Sets .................................................. 108 Velocity ........................................................................................................ 108 Veracity, Data Quality, Data Availability ................................................ 109 Data Discovery ........................................................................................... 109 Quality and Relevance .............................................................................. 109 Data Comprehensiveness ......................................................................... 109 Personally Identifiable Information ........................................................ 109 Data Dogmatism ........................................................................................ 110 Scalability .................................................................................................... 110

            Process Challenges ......................................................................................... 110 Management Challenges ............................................................................... 110 Big Data Platforms Technology: Current State of the Art ........................ 111

            Take the Analysis to the Data! ................................................................. 111 What Is Apache Hadoop? ......................................................................... 111 Who Are the Hadoop Users? ................................................................... 112 An Example of an Advanced User: Amazon ......................................... 113 Big Data in Data Warehouse or in Hadoop? .......................................... 113 Big Data in the Database World (Early 1980s Till Now) ...................... 113 Big Data in the Systems World (Late 1990s Till Now) .......................... 113 Enterprise Search ....................................................................................... 115 Big Data “Dichotomy” .............................................................................. 115 Hadoop and the Cloud ............................................................................. 116 Hadoop Pros ............................................................................................... 116 Hadoop Cons ............................................................................................. 116

            Technological Solutions for Big Data Analytics ......................................... 118 Scalability and Performance at eBay ......................................................122 Unstructured Data ..................................................................................... 123 Cloud Computing and Open Source ...................................................... 123

            Big Data Computing

            Big Data Myth ................................................................................................. 123 Main Research Challenges and Business Challenges ............................... 123

            Big Data for the Common Good ....................................................................... 124 World Economic Forum, the United Nations Global Pulse Initiative .... 124 What Are the Main Difficulties, Barriers Hindering Our Community to Work on Social Capital Projects? ....................................... 125 What Could We Do to Help Supporting Initiatives for Big Data for Good? ......................................................................................................... 126

            Conclusions: The Search for Meaning Behind Our Activities ...................... 127 Acknowledgments .............................................................................................. 128 References ............................................................................................................. 128

          Introduction

            “Big Data is the new gold” (Open Data Initiative)

            Every day, 2.5 quintillion bytes of data are created. These data come from digital pictures, videos, posts to social media sites, intelligent sensors, pur- chase transaction records, cell phone GPS signals, to name a few. This is known as Big Data.

            There is no doubt that Big Data and especially what we do with it has the potential to become a driving force for innovation and value creation. In this chapter, we will look at Big Data from three different perspectives: the business perspective, the technological perspective, and the social good perspective.

          The Story as it is Told from the Business Perspective

            Now let us define the term Big Data. I have selected a definition, given by McKinsey Global Institute (MGI) [1]:

            

          “Big Data” refers to datasets whose size is beyond the ability of typical

          database software tools to capture, store, manage and analyze.

            This definition is quite general and open ended, and well captures the rapid growth of available data, and also shows the need of technology to “catch up” with it. This definition is not defined in terms of data size; in fact, data sets will increase in the future! It also obviously varies by sectors, ranging from a

            Big Data

            (Big) Data is in every industry and business function and is an important factor for production. MGI estimated that 7 exabytes of new data enterprises globally were stored in 2010. Interestingly, more than 50% of IP traffic is non- human, and M2M will become increasingly important. So what is Big Data

            

          supposed to create ? Value. But what “value” exactly? Big Data per se does not

          produce any value.

            David Gorbet of MarkLogic explains [2]: “the increase in data complexity is the biggest challenge that every IT department and CIO must address. Businesses across industries have to not only store the data but also be able to leverage it quickly and effectively to derive business value.

            Value comes only from what we infer from it. That is why we need Big Data Analytics . Werner Vogels, CTO of describes Big Data Analytics as fol- lows [3]: “in the old world of data analysis you knew exactly which questions you wanted to asked, which drove a very predictable collection and storage model. In the new world of data analysis your questions are going to evolve and changeover time and as such you need to be able to collect, store and analyze data without being constrained by resources.

            According to MGI, the “value” that can be derived by analyzing Big Data can be spelled out as follows:

          • Creating transparencies;
          • Discovering needs, exposing variability, and improving performance;
          • Segmenting customers; and
          • Replacing/supporting human decision-making with automated algo- rithms—Innovating new business models, products, and services.

            “The most impactful Big Data Applications will be industry- or even organization-specific, leveraging the data that the organization consumes and generates in the course of doing business. There is no single set for- mula for extracting value from this data; it will depend on the application” explains David Gorbet.

            “There are many applications where simply being able to comb through large volumes of complex data from multiple sources via interactive que- ries can give organizations new insights about their products, customers, services, etc. Being able to combine these interactive data explorations with some analytics and visualization can produce new insights that would oth- erwise be hidden. We call this Big Data Search” says David Gorbet.

            Gorbet’s concept of “Big Data Search” implies the following:

          • There is no single set formula for extracting value from Big Data; it will depend on the application.
          • There are many applications where simply being able to comb

            Big Data Computing

            interactive queries can give organizations new insights about their products, customers, services, etc.

          • Being able to combine these interactive data explorations with some analytics and visualization can produce new insights that would otherwise be hidden.

            Gorbet gives an example of the result of such Big Data Search: “it was anal- ysis of social media that revealed that Gatorade is closely associated with flu and fever, and our ability to drill seamlessly from high-level aggregate data into the actual source social media posts shows that many people actually take Gatorade to treat flu symptoms. Geographic visualization shows that this phenomenon may be regional. Our ability to sift through all this data in real time, using fresh data gathered from multiple sources, both internal and external to the organization helps our customers identify new actionable insights.”

            

          Where Big Data will be used? According to MGI, Big Data can generate finan-

            cial value across sectors. They identified the following key sectors:

          • Health care (this is a very sensitive area, since patient records and, in general, information related to health are very critical)
          • Public sector administration (e.g., in Europe, the Open Data Initiative—a European Commission initiative which aims at open- ing up Public Sector Information)
          • Global personal location data (this is very relevant given the rise of mobile devices)
          • Retail (this is the most obvious, since the existence of large Web retail shops such as eBay and Amazon)
          • Manufacturing I would add to the list two additional areas
          • Social personal/professional data (e.g., Facebook, Twitter, and the like)

            What are examples of Big Data Use Cases? The following is a sample list:

          • Log analytics
          • Fraud detection
          • Social media and sentiment analysis
          • Risk modeling and management

            Big Data

            Currently, the key limitations in exploiting Big Data, according to MGI, are

          • Shortage of talent necessary for organizations to take advantage of

            Big Data

          • Shortage of knowledge in statistics, machine learning, and data mining

            Both limitations reflect the fact that the current underlying technology is quite difficult to use and understand. As every new technology, Big Data Analytics technology will take time before it will reach a level of maturity and easiness to use for the enterprises at large. All the above-mentioned examples of values generated by analyzing Big Data, however, do not take into account the possibility that such derived “values” are negative.

            In fact, the analysis of Big Data if improperly used poses also issues, specifi- cally in the following areas:

          • Access to data
          • Data policies
          • Industry structure
          • Technology and techniques This is outside the scope of this chapter, but it is for sure one of the most important nontechnical challenges that Big Data poses.

          The Story as it is Told from the Technology Perspective

            The above are the business “promises” about Big Data. But what is the reality today? Big data problems have several characteristics that make them techni-

            cally challenging .

            We can group the challenges when dealing with Big Data in three dimen- sions: data, process, and management. Let us look at each of them in some detail:

            Data Challenges Volume

            The volume of data, especially machine-generated data, is exploding, how fast that data is growing every year, with new sources of data that are emerging. For example, in the year 2000, 800,000 petabytes (PB) of data were stored in the world, and it is expected to reach 35

            Big Data Computing

            Social media plays a key role: Twitter generates 7+ terabytes (TB) of data every day. Facebook, 10 TB. Mobile devices play a key role as well, as there were estimated 6 billion mobile phones in 2011. The challenge is how to deal with the size of Big Data.

            Variety, Combining Multiple Data Sets

            More than 80% of today’s information is unstructured and it is typically too big to manage effectively. What does it mean? David Gorbet explains [2]:

            It used to be the case that all the data an organization needed to run its operations effectively was structured data that was generated within the organization. Things like customer transaction data,

            ERP data, etc. Today, companies are looking to leverage a lot more data from a wider variety of sources both inside and outside the organization. Things like documents, contracts, machine data, sen- sor data, social media, health records, emails, etc. The list is endless really. A lot of this data is unstructured, or has a complex structure that’s hard to represent in rows and columns. And organizations want to be able to combine all this data and analyze it together in new ways. For example, we have more than one customer in different industries whose applications combine geospatial vessel location data with weather and news data to make real-time mission-critical decisions. Data come from sensors, smart devices, and social collaboration tech- nologies. Data are not only structured, but raw, semistructured, unstructured data from web pages, web log files (click stream data), search indexes, e-mails, documents, sensor data, etc.

            Semistructured Web data such as A/B testing, sessionization, bot detection, and pathing analysis all require powerful analytics on many petabytes of semistructured Web data. The challenge is how to handle multiplicity of types, sources, and formats.

            Velocity

            Shilpa Lawande of Vertica defines this challenge nicely [4]: “as busi- nesses get more value out of analytics, it creates a success problem— they want the data available faster, or in other words, want real-time analytics.

            And they want more people to have access to it, or in other words, high

            Big Data

            One of the key challenges is how to react to the flood of information in the time required by the application.

            Veracity, Data Quality, Data Availability

            Who told you that the data you analyzed is good or complete? Paul Miller [5] mentions that “a good process will, typically, make bad decisions if based upon bad data. E.g. what are the implications in, for example, a Tsunami that affects several Pacific Rim countries? If data is of high quality in one country, and poorer in another, does the Aid response skew ‘unfairly’ toward the well-surveyed country or toward the educated guesses being made for the poorly surveyed one?”

            There are several challenges: How can we cope with uncertainty, imprecision, missing values, mis- statements or untruths? How good is the data? How broad is the coverage? How fine is the sampling resolution? How timely are the readings? How well understood are the sampling biases? Is there data available, at all?

            Data Discovery

            This is a huge challenge: how to find high-quality data from the vast collec- tions of data that are out there on the Web.

            Quality and Relevance

            The challenge is determining the quality of data sets and relevance to par- ticular issues (i.e., the data set making some underlying assumption that ren- ders it biased or not informative for a particular question).

            Data Comprehensiveness

            Are there areas without coverage? What are the implications?

            Personally Identifiable Information

            Much of this information is about people. Partly, this calls for effective indus- trial practices. “Partly, it calls for effective oversight by Government. Partly— perhaps mostly—it requires a realistic reconsideration of what privacy really means”. (Paul Miller [5])

            Can we extract enough information to help people without extracting so

            Big Data Computing Data Dogmatism

            Analysis of Big Data can offer quite remarkable insights, but we must be wary of becoming too beholden to the numbers. Domain experts—and com- mon sense—must continue to play a role.

            For example, “It would be worrying if the healthcare sector only responded to flu outbreaks when Google Flu Trends told them to.” (Paul Miller [5])

            Scalability

            Shilpa Lawande explains [4]: “techniques like social graph analysis, for instance leveraging the influencers in a social network to create better user experience are hard problems to solve at scale. All of these problems combined create a perfect storm of challenges and opportunities to create faster, cheaper and better solutions for Big Data analytics than traditional approaches can solve.”

            Process Challenges

            “It can take significant exploration to find the right model for analysis, and the ability to iterate very quickly and ‘fail fast’ through many (possible throw away) models—at scale—is critical.” (Shilpa Lawande)

            According to Laura Haas (IBM Research), process challenges with deriv- ing insights include [5]:

          • Capturing data
          • Aligning data from different sources (e.g., resolving when two objects are the same)
          • Transforming the data into a form suitable for analysis
          • Modeling it, whether mathematically, or through some form of simulation
          • Understanding the output, visualizing and sharing the results, think for a second how to display complex analytics on a iPhone or a mobile device

            Management Challenges

            “Many data warehouses contain sensitive data such as personal data. There are legal and ethical concerns with accessing such data.

            So the data must be secured and access controlled as well as logged for audits.” (Michael Blaha) The main management challenges are

          • Data privacy

            Big Data

          • Governance • Ethical The challenges are: Ensuring that data are used correctly (abiding by its intended uses and relevant laws), tracking how the data are used, trans- formed, derived, etc., and managing its lifecycle.

            Big Data Platforms Technology: Current State of the art

            The industry is still in an immature state, experiencing an explosion of dif- ferent technological solutions. Many of the technologies are far from robust or enterprise ready, often requiring significant technical skills to support the software even before analysis is attempted. At the same time, there is a clear shortage of analytical experience to take advantage of the new data. Nevertheless, the potential value is becoming increasingly clear.

            In the past years, the motto was “rethinking the architecture”: scale and performance requirements strain conventional databases.

            

          “The problems are a matter of the underlying architecture. If not built

          for scale from the ground-up a database will ultimately hit the wall—

          this is what makes it so difficult for the established vendors to play in

          this space because you cannot simply retrofit a 20+year-old architecture

          to become a distributed MPP database over night,” says Florian Waas of

          EMC/Greenplum [6].

          “In the Big Data era the old paradigm of shipping data to the applica-

          tion isn’t working any more. Rather, the application logic must ‘come’ to

          the data or else things will break: this is counter to conventional wisdom

          and the established notion of strata within the database stack. With tera-

          bytes, things are actually pretty simple—most conventional databases

          scale to terabytes these days. However, try to scale to petabytes and it’s a

          whole different ball game.” (Florian Waas)

            This confirms Gray’s Laws of Data Engineering, adapted here to Big Data:

            Take the Analysis to the Data!

            In order to analyze Big Data, the current state of the art is a parallel database or NoSQL data store, with a Hadoop connector. Hadoop is used for process- ing the unstructured Big Data. Hadoop is becoming the standard platform for doing large-scale processing of data in the enterprise. Its rate of growth far exceeds any other “Big Data” processing platform.

            What Is Apache Hadoop?

            Hadoop provides a new open source platform to analyze and process Big Data. It was inspired by Google’s MapReduce and Google File System (GFS)

            Big Data Computing

            Higher-level declarative languages for writing queries and data analysis pipelines, such as:

          • Pig (Yahoo!)—relational-like algebra—(used in ca. 60% of Yahoo!

            MapReduce use cases)

          • PigLatin
          • Hive (used by Facebook) also inspired by SQL—(used in ca. 90% of

            Facebook MapReduce use cases)

          • Jaql (IBM)
          • Several other modules that include Load, Transform, Dump and store, Flume Zookeeper Hbase Oozie Lucene Avro, etc.

            Who Are the Hadoop Users?

            A simple classification: • Advanced users of Hadoop.

            They are often PhDs from top universities with high expertise in analytics, databases, and data mining. They are looking to go beyond batch uses of Hadoop to support real-time streaming of content. Product recommendations, ad placements, customer churn, patient outcome predictions, fraud detection, and senti- ment analysis are just a few examples that improve with real- time information. How many of such advanced users currently exist? “There are only a few Facebook-sized IT organizations that can have 60 Stanford PhDs on staff to run their Hadoop infrastruc- ture. The others need it to be easier to develop Hadoop applica- tions, deploy them and run them in a production environment.” (JohnSchroeder [7]) So, not that many apparently.

          • New users of Hadoop They need Hadoop to become easier. Need it to be easier to develop

            Hadoop applications, deploy them, and run them in a produc- tion environment. Organizations are also looking to expand Hadoop use cases to include business critical, secure applications that easily integrate with file-based applications and products. With mainstream adoption comes, the need for tools that do not require specialized skills and programers. New Hadoop devel-

            Big Data

            and out. This includes direct access with standard protocols using existing tools and applications. Is there a real need for it? See also Big Data Myth later.

            An Example of an Advanced User: Amazon

            “We chose Hadoop for several reasons. First, it is the only available frame- work that could scale to process 100s or even 1000s of terabytes of data and scale to installations of up to 4000 nodes. Second, Hadoop is open source and we can innovate on top of the framework and inside it to help our customers develop more performant applications quicker.

            Third, we recognized that Hadoop was gaining substantial popularity in the industry with multiple customers using Hadoop and many vendors innovating on top of Hadoop. Three years later we believe we made the right choice. We also see that existing BI vendors such as Microstrategy are willing to work with us and integrate their solutions on top of Elastic. MapReduce.” (Werner Vogels, VP and CTO Amazon [3])

            Big Data in Data Warehouse or in Hadoop?

            Roughly speaking we have:

          • Data warehouse: structured data, data “trusted”
          • Hadoop: semistructured and unstructured data. Data “not trusted” An interesting historical perspective of the development of Big Data comes from Michael J. Carey [8]. He distinguishes between:

            Big Data in the Database World (Early 1980s Till Now)

          • Parallel Databases. Shared-nothing architecture, declarative set- oriented nature of relational queries, divide and conquer parallelism (e.g., Teradata). Later phase re-implementation of relational data- bases (e.g., HP/Vertica, IBM/Netezza, Teradata/Aster Data, EMC/ Greenplum, Hadapt) and

            Big Data in the Systems World (Late 1990s Till Now)

          • Apache Hadoop (inspired by Google GFS, MapReduce), contributed by large Web companies. For example, Yahoo!, Facebook, Google

            Big Data Computing

            The Parallel database software stack (Michael J. Carey) comprises

          • SQL → SQL Compiler • Relational Dataflow Layer (runs the query plans, orchestrate the local storage managers, deliver partitioned, shared-nothing storage ser- vices for large relational tables)
          • Row/Column Storage Manager (record-oriented: made up of a set of row-oriented or column-oriented storage managers per machine in a cluster)

            

          Note : no open-source parallel database exists! SQL is the only way into the

            system architecture. Systems are monolithic: Cannot safely cut into them to access inner functionalities.

            The Hadoop software stack comprises (Michael J. Carey):

          • HiveQL. PigLatin, Jaql script → HiveQL/Pig/Jaql (High-level languages)
          • Hadoop M/R job → Hadoop MapReduce Dataflow La
          • (for batch analytics, applies Map ops to the data in partitions of an

            HDFS file, sorts, and redistributes the results based on key values in the output data, then performs reduce on the groups of output data items with matching keys from the map phase of the job).

          • Get/Put ops → Hbase Key-value Store (accessed directly by client app or via Hadoop for analytics needs)
          • Hadoop Distributed File System (byte oriented file abstraction— files appears as a very large contiguous and randomly addressable sequence of bytes

            

          Note : all tools are open-source! No SQL. Systems are not monolithic: Can

          safely cut into them to access inner functionalities.

            A key requirement when handling Big Data is scalability. Scalability has three aspects

          • data volume
          • hardware size
          • concurrency What is the trade-off between scaling out and scaling up? What does it mean in practice for an application domain?

            Chris Anderson of Couchdb explains [11]: “scaling up is easier from a soft- ware perspective. It’s essentially the Moore’s Law approach to scaling—buy a bigger box. Well, eventually you run out of bigger boxes to buy, and then

            Big Data

            Scaling out means being able to add independent nodes to a system. This is the real business case for NoSQL. Instead of being hostage to Moore’s Law, you can grow as fast as your data. Another advantage to adding independent nodes is you have more options when it comes to matching your workload. You have more flexibility when you are running on commodity hardware— you can run on SSDs or high-compute instances, in the cloud, or inside your firewall.”

            Enterprise Search

            Enterprise Search implies being able to search multiple types of data gener- ated by an enterprise. There are two alternatives: Apache Solr or implement- ing a proprietary full-text search engine.

            There is an ecosystem of open source tools that build on Apache Solr.

            Big Data “Dichotomy”

            The prevalent architecture that people use to analyze structured and unstructured data is a two-system configuration, where Hadoop is used for processing the unstructured data and a relational database system or an NoSQL data store is used for the structured data as a front end.

            NoSQL data stores were born when Developers of very large-scale user- facing Web sites implemented key-value stores:

          • Google Big Table • Amazon Dynamo • Apache Hbase (open source BigTable clone) • Apache Cassandra, Riak (open source Dynamo clones), etc.

            There are concerns about performance issues that arise along with the transfer of large amounts of data between the two systems. The use of con- nectors could introduce delays and data silos, and increase Total Cost of Ownership (TCO).

            Daniel Abadi of Hadapt says [10]: “this is a highly undesirable architecture, since now you have two systems to maintain, two systems where data may be stored, and if you want to do analysis involving data in both systems, you end up having to send data over the network which can be a major bottleneck.”

            Big Data is not (only) Hadoop. “Some people even think that ‘Hadoop’ and ‘Big Data’ are synonymous

            (though this is an over-characterization). Unfortunately, Hadoop was designed based on a paper by Google in 2004 which was focused on use cases involving unstructured data (e.g., extracting words and phrases from Web pages in order to create Google’s Web index). Since it was not origi-

            Big Data Computing

            short-cuts in query processing, its performance for processing relational data is therefore suboptimal” says Daniel Abadi of Hadapt.

            Duncan Ross of Teradata confirms this: “the biggest technical challenge is actually the separation of the technology from the business use! Too often people are making the assumption that Big Data is synonymous with Hadoop, and any time that technology leads business things become difficult. Part of this is the difficulty of use that comes with this.

            It’s reminiscent of the command line technologies of the 70s—it wasn’t until the GUI became popular that computing could take off.”

            Hadoop and the Cloud Amazon has a significant web-services business around Hadoop.

            But in general, people are concerned with the protection and security of their data. What about traditional enterprises? Here is an attempt to list the pros and cons of Hadoop.

            Hadoop Pros • Open source.

          • Nonmonolithic support for access to file-based external data.
          • Support for automatic and incremental forward-recovery of jobs with failed task.
          • Ability to schedule very large jobs in smaller chunks.
          • Automatic data placement and rebalancing as data grows and machines come and go.
          • Support for replication and machine fail-over without operation intervention.
          • The combination of scale, ability to process unstructured data along with the availability of machine learning algorithms, and recom- mendation engines create the opportunity to build new game chang- ing applications.
          • Does not require a schema first.
          • Provides a great tool for exploratory analysis of the data, as long as you have the software development expertise to write MapReduce programs.

            Hadoop Cons • Hadoop is difficult to use.

          • Can give powerful analysis, but it is fundamentally a batch-oriented paradigm. The missing piece of the Hadoop puzzle is accounting for

            Big Data

          • Hadoop file system (HDS) has a centralized metadata store

            (NameNode), which represents a single point of failure without availability. When the NameNode is recovered, it can take a long time to get the Hadoop cluster running again.

          • Hadoop assumes that the workload it runs will belong running, so it makes heavy use of checkpointing at intermediate stages. This means parts of a job can fail, be restarted, and eventually complete successfully—there are no transactional guarantees.

            Current Hadoop distributions challenges

          • Getting data in and out of Hadoop. Some Hadoop distributions are limited by the append-only nature of the Hadoop Distributed File System (HDFS) that requires programs to batch load and unload data into a cluster.
          • The lack of reliability of current Hadoop software platforms is a major impediment for expansion.
          • Protecting data against application and user errors.
          • Hadoop has no backup and restore capabilities. Users have to con- tend with data loss or resort to very expensive solutions that reside outside the actual Hadoop cluster.

            There is work in progress to fix this from vendors of commercial Hadoop distributions (e.g., MapR, etc.) by reimplementing Hadoop components. It would be desirable to have seamless integration.

            

          “Instead of stand-alone products for ETL,BI/reporting and analytics we

          have to think about seamless integration: in what ways can we open up

          a data processing platform to enable applications to get closer? What lan-

          guage interfaces, but also what resource management facilities can we

          offer? And so on.” (Florian Waas) Daniel Abadi: “A lot of people are using Hadoop as a sort of data refinery.

            Data starts off unstructured, and Hadoop jobs are run to clean, transform, and structure the data. Once the data is structured, it is shipped to SQL databases where it can be subsequently analyzed. This leads to the raw data being left in Hadoop and the refined data in the SQL databases. But it’s basi- cally the same data—one is just a cleaned (and potentially aggregated) ver- sion of the other. Having multiple copies of the data can lead to all kinds of problems. For example, let’s say you want to update the data in one of the two locations—it does not get automatically propagated to the copy in the other silo. Furthermore, let’s say you are doing some analysis in the SQL database and you see something interesting and want to drill down to the

            Big Data Computing

            becomes highly nontrivial. Furthermore, data provenance is a total night- mare. It’s just a really ugly architecture to have these two systems with a connector between them.”

            Michael J. Carey adds that is:

          • Questionable to layer a record-oriented data abstraction on top of a giant globally sequenced byte-stream file abstraction.

            (E.g., HDFS is unaware of record boundaries. “Broken records” instead of fixed-length file splits, i.e., a record with some of its bytes in one split and some in the next)

          • Questionable building a parallel data runtime on top of a unary operator model (map, reduce, combine). E.g., performing joins with MapReduce.
          • Questionable building a key-value store layer with a remote query access at the next layer. Pushing queries down to data is likely to outperform pulling data up to queries.
          • Lack of schema information, today is flexible, but a recipe for future difficulties. E.g., future maintainers of applications will likely have problems in fixing bugs related to changes or assumptions about the structure of data files in HDFS. (This was one of the very early les- sons in the DB world).
          • Not addressed single system performance, focusing solely on scale-out.

            Technological Solutions for Big Data analytics

            There are several technological solutions available in the market for Big Data Analytics. Here are some examples:

            

          An NoSQL Data Store (CouchBase, Riak, Cassandra, MongoDB, etc.) Connected to

          Hadoop

            With this solution, an NoSQL data store is used as a front end to process selected data in real time data, and having Hadoop in the back end process- ing Big Data in batch mode.

            

          “In my opinion the primary interface will be via the real time store,

          and the Hadoop layer will become a commodity. That is why there is

          so much competition for the NoSQL brass ring right now” says J. Chris

          Anderson of Couchbase (an NoSQL datastore).

            In some applications, for example, Couchbase (NoSQL) is used to enhance the batch-based Hadoop analysis with real-time information, giving the

            Big Data

            The process consists of essentially moving the data out of Couchbase into Hadoop when it cools off. CouchDB supplies a connector to Apache Sqoop (a Top-Level Apache project since March of 2012), a tool designed for efficiently transferring bulk data between Hadoop and relational databases.

            An NewSQL Data Store for Analytics (HP/Vertica) Instead of Hadoop

            Another approach is to use a NewSQL data store designed for Big Data Analytics, such as HP/Vertica. Quoting Shilpa Lawande [4] “Vertica was designed from the ground up for analytics.” Vertica is a columnar database engine including sorted columnar storage, a query optimizer, and an execu- tion engine, providing standard ACID transaction semantics on loads and queries.

            With sorted columnar storage, there are two methods that drastically reduce the I/O bandwidth requirements for such Big Data analytics work- loads. The first is that Vertica only reads the columns that queries need. Second, Vertica compresses the data significantly better than anyone else. Vertica’s execution engine is optimized for modern multicore processors and we ensure that data stays compressed as much as possible through the query execution, thereby reducing the CPU cycles to process the query. Additionally, we have a scale-out MPP architecture, which means you can add more nodes to Vertica.

            All of these elements are extremely critical to handle the data volume chal- lenge. With Vertica, customers can load several terabytes of data quickly (per hour in fact) and query their data within minutes of it being loaded—that is real-time analytics on Big Data for you.

            There is a myth that columnar databases are slow to load. This may have been true with older generation column stores, but in Vertica, we have a hybrid in-memory/disk load architecture that rapidly ingests incoming data into a write-optimized row store and then converts that to read-optimized sorted columnar storage in the background. This is entirely transparent to the user because queries can access data in both locations seamlessly. We have a very lightweight transaction implementation with snapshot isolation queries can always run without any locks.

            And we have no auxiliary data structures, like indices or material- ized views, which need to be maintained postload. Last, but not least, we designed the system for “always on,” with built-in high availability features. Operations that translate into downtime in traditional databases are online in Vertica, including adding or upgrading nodes, adding or modifying data- base objects, etc. With Vertica, we have removed many of the barriers to mon- etizing Big Data and hope to continue to do so.

            “Vertica and Hadoop are both systems that can store and analyze large amounts of data on commodity hardware. The main differences are how the

            Big Data Computing

            guarantees are provided. Also, from the standpoint of data access, Vertica’s interface is SQL and data must be designed and loaded into an SQL schema for analysis. With Hadoop, data is loaded AS IS into a distributed file sys- tem and accessed programmatically by writing Map-Reduce programs.” (Shilpa Lawande [4])

            

          A NewSQL Data Store for OLTP (VoltDB) Connected with Hadoop or a Data

          Warehouse

            With this solution, a fast NewSQL data store designed for OLTP (VoltDB) is connected to either a conventional data warehouse or Hadoop.

            “We identified 4 sources of significant OLTP overhead (concurrency con- trol, write-ahead logging, latching and buffer pool management). Unless you make a big dent in ALL FOUR of these sources, you will not run dramatically faster than current disk-based RDBMSs. To the best of my knowledge, VoltDB is the only system that eliminates or drastically reduces all four of these overhead components. For example, TimesTen uses conven- tional record level locking, an Aries-style write ahead log and conventional multi-threading, leading to substantial need for latching. Hence, they elimi- nate only one of the four sources.

            VoltDB is not focused on analytics. We believe they should be run on a companion data warehouse. Most of the warehouse customers I talk to want to keep increasing large amounts of increasingly diverse history to run their analytics over. The major data warehouse players are routinely being asked to manage petabyte-sized data warehouses. VoltDB is intended for the OLTP portion, and some customers wish to run Hadoop as a data warehouse plat- form. To facilitate this architecture, VoltDB offers a Hadoop connector.

            VoltDB supports standard SQL. Complex joins should be run on a com- panion data warehouse. After all, the only way to interleave ‘big reads’ with ‘small writes’ in a legacy RDBMS is to use snapshot isolation or run with a reduced level of consistency. You either get an out-of-date, but con- sistent answer or an up-to-date, but inconsistent answer. Directing big reads to a companion DW, gives you the same result as snapshot isolation. Hence, I do not see any disadvantage to doing big reads on a companion system.

            Concerning larger amounts of data, our experience is that OLTP problems with more than a few Tbyte of data are quite rare. Hence, these can easily fit in main memory, using a VoltDB architecture.

            In addition, we are planning extensions of the VoltDB architecture to han- dle larger-than-main-memory data sets.” (Mike Stonebraker [13])

            A NewSQL for Analytics (Hadapt) Complementing Hadoop

            An alternative solution is to use a NewSQL designed for analytics (Hadapt) which complements Hadoop.

            Big Data

            Daniel Abadi explains “at Hadapt, we’re bringing 3 decades of relational database research to Hadoop. We have added features like indexing, co- partitioned joins, broadcast joins, and SQL access (with interactive query response times) to Hadoop, in order to both accelerate its performance for queries over relational data and also provide an interface that third party data processing and business intelligence tools are familiar with.

            Therefore, we have taken Hadoop, which used to be just a tool for super- smart data scientists, and brought it to the mainstream by providing a high performance SQL interface that business analysts and data analysis tools already know how to use. However, we’ve gone a step further and made it possible to include both relational data and non-relational data in the same query; so what we’ve got now is a platform that people can use to do really new and innovative types of analytics involving both unstructured data like tweets or blog posts and structured data such as traditional transactional data that usually sits in relational databases.

            What is special about the Hadapt architecture is that we are bringing data- base technology to Hadoop, so that Hadapt customers only need to deploy a single cluster—a normal Hadoop cluster—that is optimized for both struc- tured and unstructured data, and is capable of pushing the envelope on the type of analytics that can be run over Big Data.” [10]

            A Combinations of Data Stores: A Parallel Database (Teradata) and Hadoop

            An example of this solution is the architecture for Complex Analytics at eBay (Tom Fastner [12])

            The use of analytics at Ebay is rapidly changing, and analytics is driv- ing many key initiatives like buyer experience, search optimization, buyer protection, or mobile commerce. EBay is investing heavily in new tech- nologies and approaches to leverage new data sources to drive innovation.

            EBay uses three different platforms for analytics: 1. “EDW”: dual systems for transactional (structured data); Teradata 6690 with 9.5 PB spinning disk and 588 TB SSD

          • The largest mixed storage Teradata system worldwide; with spool, some dictionary tables and user data automatically managed by access frequency to stay on SSD.10+ years experi- ence; very high concurrency; good accessibility; hundreds of applications.

            2. “Singularity”: deep Teradata system for semistructured data; 36 PB spinning disk;

          • Lower concurrency that EDW, but can store more data; biggest use case is User Behavior Analysis; largest table is 1.2 PB with ~3 Trillion rows.

            3. Hadoop: for unstructured/complex data; ~40 PB spinning disk;

            Big Data Computing

          • Text analytics, machine learning, has the user behavior data and selected EDW tables; lower concurrency and utilization.

            The main technical challenges for Big Data analytics at eBay are • I/O bandwidth: limited due to configuration of the nodes.

          • Concurrency/workload management: Workload management tools usu- ally manage the limited resource. For many years, EDW systems bottleneck on the CPU; big systems are configured with ample CPU making I/O the bottleneck. Vendors are starting to put mechanisms in place to manage I/O, but it will take sometime to get to the same level of sophistication.
          • Data movement (loads, initial loads, backup/restores): As new platforms are emerging you need to make data available on more systems chal- lenging networks, movement tools, and support to ensure scalable operations that maintain data consistency.

            Scalability and Performance at eBay

          • EDW: models for the unknown (close to third NF) to provide a solid physical data model suitable for many applications, which limits the number of physical copies needed to satisfy specific application requirements.

            A lot of scalability and performance is built into the database, but as any shared resource it does require an excellent operations team to fully leverage the capabilities of the platform

          • Singularity: The platform is identical to EDW, the only exception are limitations in the workload management due to configuration choices.

            But since they are leveraging the latest database release, they are exploring ways to adopt new storage and processing patterns. Some new data sources are stored in a denormalized form significantly simplifying data model- ing and ETL. On top they developed functions to support the analysis of the semistructured data. It also enables more sophisticated algorithms that would be very hard, inefficient, or impossible to implement with pure SQL.

            One example is the pathing of user sessions. However, the size of the data requires them to focus more on best practices (develop on small subsets, use 1% sample; process by day).

          • Hadoop: The emphasis on Hadoop is on optimizing for access. There usability of data structures (besides “raw” data) is very low

            Big Data Unstructured Data

            Unstructured data are handled on Hadoop only. The data are copied from the source systems into HDFS for further processing. They do not store any of that on the Singularity (Teradata) system.

            Use of Data management technologies:

          • ETL: AbInitio, home-grown parallel Ingest system
          • Scheduling: UC4
          • Repositories: Teradata EDW; Teradata Deep system; Hadoop • BI: Microstrategy, SAS, Tableau, Excel • Data Modeling: Power Designer • Ad hoc: Teradata SQL Assistant; Hadoop Pig and Hive • Content Management: Joomla-based

            Cloud Computing and Open Source “We do leverage internal cloud functions for Hadoop; no cloud for Teradata.

            Open source: committers for Hadoop and Joomla; strong commitment to improve those technologies.” (Tom Fastner, Principal Architect at eBay)

            Big Data Myth

            It is interesting to report here what Marc Geall, a research analyst at Deutsche Bank AG/in London, writes about the “Big Data Myth,” and pre- dicts [9]:

            “We believe that in-memory/NewSQL is likely to be the prevalent data- base model rather than NoSQL due to three key reasons:

            1. The limited need of petabyte-scale data today even among the NoSQL deployment base,

            2. Very low proportion of databases in corporate deployment which requires more than tens of TB of data to be handles,

            3. Lack of availability and high cost of highly skilled operators (often post-doctoral) to operate highly scalable NoSQL clusters.” Time will tell us whether this prediction is accurate or not.

            Main research Challenges and Business Challenges

            We conclude this part of the chapter by looking at three elements: data, plat- form, and analysis with two quotes: Werner Vogels: “I think that sharing is another important aspect to the mix. Collaborating during the whole process of collecting data, storing

            Big Data Computing

            it, organizing it and analyzing it is essential. Whether it’s scientists in a research field or doctors at different hospitals collaborating on drug tri- als, they can use the cloud to easily share results and work on common datasets.”

            Daniel Abadi: “Here are a few that I think are interesting:

            1. Scalability of non-SQL analytics. How do you parallelize clustering, classification, statistical, and algebraic functions that are not ‘embar- rassingly parallel’ (that have traditionally been performed on a sin- gle server in main memory) over a large cluster of shared-nothing servers.

            2. Reducing the cognitive complexity of ‘Big Data’ so that it can fit in the working set of the brain of a single analyst who is wrangling with the data.

            3. Incorporating graph data sets and graph algorithms into database management systems.

            4. Enabling platform support for probabilistic data and probabilistic query processing.’’

          Big Data for the Common Good

            “As more data become less costly and technology breaks barrier to acquisi- tion and analysis, the opportunity to deliver actionable information for civic purposed grow. This might be termed the ‘common good’ challenge for Big Data.” (Jake Porway, DataKind)

            Very few people seem to look at how Big Data can be used for solving social problems. Most of the work in fact is not in this direction. Why this? What can be done in the international research/development community to make sure that some of the most brilliant ideas do have an impact also for social issues?

            In the following, I will list some relevant initiatives and selected thoughts for Big Data for the Common Good.

            World economic Forum, the united Nations global Pulse initiative

            The United Nations Global Pulse initiative is one example. Earlier this year at the 2012 Annual Meeting in Davos, the World Economic Forum pub- lished a white paper entitled “Big Data, Big Impact: New Possibilities for International Development.” The WEF paper lays out several of the ideas which fundamentally drive the Global Pulse initiative and presents in con-

            Big Data

            today, and how researchers and policy-makers are beginning to realize the potential for leveraging Big Data to extract insights that can be used for Good, in particular, for the benefit of low-income populations.

            “A flood of data is created every day by the interactions of billions of peo- ple using computers, GPS devices, cell phones, and medical devices. Many of these interactions occur through the use of mobile devices being used by people in the developing world, people whose needs and habits have been poorly understood until now.

            Researchers and policymakers are beginning to realize the potential for channeling these torrents of data into actionable information that can be used to identify needs, provide services, and predict and prevent crises for the benefit of low-income populations. Concerted action is needed by gov- ernments, development organizations, and companies to ensure that this data helps the individuals and communities who create it.”

            Three examples are cited in WEF paper:

          • UN Global Pulse: an innovation initiative of the UN Secretary General, harnessing today’s new world of digital data and real-time analyt- ics to gain better understanding of changes in human well-being (www.unglobalpulse.orgGlobal).
          • Viral Forecasting: a not-for-profit whose mission is to promote under- standing, exploration, and stewardship of the microbial world (www.gvfi.orgUshadi).
          • SwiftRiver Platform: a non-profit tech company that specializes in developing free and open source software for information collec- tion, visualization, and interactive mapping (http://ushahidi.com).

            What are the Main Difficulties, Barriers Hindering Our Community to Work on Social Capital Projects?

            I have listed below some extracts from [5]:

          • Alon Havely (Google Research): “I don’t think there are particular barriers from a technical perspective. Perhaps the main barrier is ideas of how to actually take this technology and make social impact. These ideas typically don’t come from the technical community, so we need more inspiration from activists.”
          • Laura Haas: (IBM Research): “Funding and availability of data are two big issues here. Much funding for social capital projects comes from governments—and as we know, are but a small fraction of the overall budget. Further, the market for new tools and so on that might be created in these spaces is relatively limited, so it is not always attractive to private companies to invest. While there is a lot of publicly available data today, often key pieces are missing, or

            Big Data Computing

            privately held, or cannot be obtained for legal reasons, such as the privacy of individuals, or a country’s national interests. While this is clearly an issue for most medical investigations, it crops up as well even with such apparently innocent topics as disaster management

            (some data about, e.g., coastal structures, may be classified as part of the national defense).”

          • Paul Miller (Consultant): “Perceived lack of easy access to data that’s unencumbered by legal and privacy issues? The large-scale and long term nature of most of the problems? It’s not as ‘cool’ as something else? A perception (whether real or otherwise) that academic fund- ing opportunities push researchers in other directions? Honestly, I’m not sure that there are significant insurmountable difficulties or barriers, if people want to do it enough. As Tim O’Reilly said in 2009 (and many times since), developers should ‘Work on stuff that mat- ters.’ The same is true of researchers.” • Roger Barga (Microsot Research): “The greatest barrier may be social.

            Such projects require community awareness to bring people to take action and often a champion to frame the technical challenges in a way that is approachable by the community. These projects will likely require close collaboration between the technical community and those familiar with the problem.”

            What Could We Do to Help Supporting initiatives for Big Data for good?

            I have listed below some extracts from [5]:

          • Alon Havely (Google Research): “Building a collection of high qual- ity data that is widely available and can serve as the backbone for many specific data projects. For example, datasets that include boundaries of countries/counties and other administrative regions, data sets with up-to-date demographic data. It’s very common that when a particular data story arises, these data sets serve to enrich it.”
          • Laura Haas (IBM Research): “Increasingly, we see consortiums of institutions banding together to work on some of these problems. These Centers may provide data and platforms for data-intensive work, alleviating some of the challenges mentioned above by acquir- ing and managing data, setting up an environment and tools, bring- ing in expertise in a given topic, or in data, or in analytics, providing tools for governance, etc.

            My own group is creating just such a platform, with the goal of facilitating such collaborative ventures. Of course, lobbying our gov-

            Big Data

          • Paul Miller (Consultant): “Match domains with a need to research- ers/companies with a skill/product. Activities such as the recent Big Data Week Hackathons might be one route to follow—encourage the organisers (and companies like Kaggle, which do this every day) to run Hackathons and competitions that are explicitly targeted at a ‘social’ problem of some sort. Continue to encourage the Open Data release of key public data sets. Talk to the agencies that are working in areas of interest, and understand the problems that they face. Find ways to help them do what they already want to do, and build trust and rapport that way.”
          • Roger Barga (Microsot Research): “Provide tools and resources to empower the long tail of research. Today, only a fraction of scien- tists and engineers enjoy regular access to high performance and data-intensive computing resources to process and analyze massive amounts of data and run models and simulations quickly. The real- ity for most of the scientific community is that’s peed to discovery is often hampered as they have to either queue up for access to lim- ited resources or pare down the scope of research to accommodate available processing power. This problem is particularly acute at the smaller research institutes which represent the long tail of the research community. Tier 1 and some tier 2 universities have sufficient funding and infrastructure to secure and support computing resources while the smaller research programs struggle. Our funding agencies and corporations must provide resources to support researchers, in par- ticular those who do not have access to sufficient resources.”

          Conclusions: The Search for Meaning Behind Our Activities

            I would like to conclude this chapter with this quote below which I find inspiring.

            

          “All our activities in our lives can be looked at from different perspec-

          tives and within various contexts: our individual view, the view of our

          families and friends, the view of our company and finally the view of

          society—the view of the world. Which perspective means what to us

          is not always clear, and it can also change over the course of time. This

          might be one of the reasons why our life sometimes seems unbalanced.

          We often talk about work-life balance, but maybe it is rather an imbal-

          ance between the amount of energy we invest into different elements of

          our life and their meaning to us.”

            —Eran Davidson, CEO Hasso Plattner Ventures

            Big Data Computing

          Acknowledgments

            I would like to thank Michael Blaha, Rick Cattell, Michael Carey, Akmal Chaudhri, Tom Fastner, Laura Haas, Alon Halevy, Volker Markl, Dave Thomas, Duncan Ross, Cindy Saracco, Justin Sheehy, Mike OSullivan, Martin Verlage, and Steve Vinoski for their feedback on an earlier draft of this chapter.

            But all errors and missing information are mine.

          References

            

          1. McKinsey Global Institute (MGI), Big Data: The next frontier for innovation,

          competition, and productivity, Report, June, 2012.

            

          2. Managing Big Data. An interview with David Gorbet ODBMS Industry Watch,

          July 2, 2012. http://www.odbms.org/blog/2012/07/managing-big-data-an- interview-with-david-gorbet/

          3.

          4. On Big Data: Interview with Shilpa Lawande, VP of Engineering at Vertica.

            ODBMs Industry Watch , November 16, 2011.

            

          5. “Big Data for Good”, Roger Barca, Laura Haas, Alon Halevy, Paul Miller,

          Roberto V. Zicari. ODBMS Industry Watch, June 5, 2012.

            

          6. On Big Data Analytics: Interview with Florian Waas, EMC/Greenplum. ODBMS

          Industry Watch , February 1, 2012.

            

          7. Next generation Hadoop—interview with John Schroeder. ODBMS Industry

          Watch , September 7, 2012.

            8. Michael J. Carey, EDBT keynote 2012, Berlin.

            9. Marc Geall, “Big Data Myth”, Deutsche Bank Report 2012.

            

          10. On Big Data, Analytics and Hadoop. Interview with Daniel Abadi. ODBMS

          Industry Watch , December 5, 2012.

            

          11. Hadoop and NoSQL: Interview with J. Chris Anderson. ODBMS Industry Watch,

          September 19, 2012.

            

          12. Analytics at eBay. An interview with Tom Fastner. ODBMS Industry Watch,

          October 6, 2011.

            13. Interview with Mike Stonebraker. ODBMS Industry Watch, May 2, 2012.

            Links: .odbms.org ODBMS Industry Watch, www.odbms.org/blog

            

            

          This page intentionally left blank This page intentionally left blank

             Javier D. Fernández, Mario Arias, Miguel A. Martínez-Prieto, and Claudio Gutiérrez CONTENTS

            Big Data ................................................................................................................ 133 What Is Semantic Data? ...................................................................................... 135

            Describing Semantic Data ............................................................................. 135 Querying Semantic Data ............................................................................... 136

            Web of (Linked) Data .......................................................................................... 137 Linked Data ..................................................................................................... 138 Linked Open Data .......................................................................................... 139

            Stakeholders and Processes in Big Semantic Data ......................................... 140 Participants and Witnesses ........................................................................... 141 Workflow of Publication-Exchange-Consumption ................................... 144 State of the Art for Publication-Exchange-Consumption ......................... 146

            An Integrated Solution for Managing Big Semantic Data ............................ 148 Encoding Big Semantic Data: HDT .............................................................. 149 Querying HDT-Encoded Data Sets: HDT-FoQ .......................................... 154

            Experimental Results .......................................................................................... 156 Publication Performance ............................................................................... 157 Exchange Performance .................................................................................. 159 Consumption Performance ........................................................................... 159

            Conclusions and Next Steps .............................................................................. 162 Acknowledgments .............................................................................................. 163 References ............................................................................................................. 164 In 2007, Jim Gray preached about the effects of the Data deluge in the sciences (Hey et al. 2009). While experimental and theoretical paradigms originally led science, some natural phenomena were not easily addressed by analyti- cal models. In this scenario, computational simulation arose as a new para- digm enabling scientists to deal with these complex phenomena. Simulation produced increasing amounts of data, particularly from the use of advanced exploration instruments (large-scale telescopes, particle colliders, etc.) In this

            Big Data Computing

            but used powerful computational configurations to analyze the data gath- ered from simulations or captured by instruments. Sky maps built from the Sloan Digital Sky Survey observations, or the evidences found about the Higgs Boson are just two successful stories of just another paradigm, what Gray called the fourth paradigm: the eScience.

            

          eScience sets the basis for scientific data exploration and identifies the com-

            mon problems that arise when dealing with data at large scale. It deals with the complexities of the whole scientific data workflow, from the data cre- ation and capture, through the organization and sharing of these data with other scientists, to the final processing and analysis of such data. Gray linked these problems to the way in which data are encoded “because the only way that scientists are going to be able to understand that information is if their software can understand the information.” In this way, data representation emerges as one of the key factors in the process of storing, organizing, filter- ing, analyzing, and visualizing data at large scale, but also for sharing and exchanging them in the distributed scientific environment.

            Despite its origins in science, the data deluge effects apply to many other fields. It is easy to find real cases of massive data sources, many of them are part of our everyday lives. Common activities, such as adding new friends on social networks, sharing photographs, buying something electronically, or clicking in any result returned from a search engine, are continuously recorded in increasingly large data sets. Data is the new “raw material of

            business” .

            Although business is one of the major contributors to the data deluge, there are many other players that should not go unnoticed. The Open Government movement, around the world, is also converting public administrations in massive data generators. In recent years, they have released large data sets containing educational, political, economic, criminal, census information, among many others. Besides, we are surrounded by multitude of sensors which continuously report information about temperature, pollution, energy consumption, the state of the traffic, the presence or absence of a fire, etc. Any information anywhere and in anytime is recorded in big and constantly evolving heterogeneous data sets that take part in the data deluge. If we add the scientific contributions, the data sets released by traditional and digital libraries, geographical data or collections from mass media, we can see that the data deluge is definitely an ubiquitous revolution.

            From the original eScience has evolved what has been called data science (Loukides 2012), a discipline that cope with this ubiquity, and basically refers to the science of transforming data in knowledge. The acquisition of this knowledge strongly depends on the existence of an effective data linkage, which enables computers for integrating data from heterogeneous data sets.

            We bump again with the question of how information is encoded for differ- ent kinds of automatic processing.

            Definitively, data and information standards are at the ground of this rev-

            Management of Big Semantic Data

            An algorithmic (and standardized) data encoding is crucial to enable com- puter exchange and understanding; for instance, this data representation must allow computers to resolve what a gene is or what a galaxy is, or what a temperature measurement is (Hey et al. 2009). Nowadays, the use of graph-oriented representations and rich semantic vocabularies are gaining momentum. On the one hand, graphs are flexible models for integrating data not only with different degrees of structure, but also enable these het- erogeneous data to be linked in a uniform way. On the other hand, vocab- ularies describe what data mean. The most practical trend, in this line, suggests the use of the Resource Description Framework: RDF (Manola and Miller 2004), a standard model for data encoding and semantic tech- nologies for publication, exchange, and consumption of this Big Semantic

            Data at universal scale.

            This chapter takes a guided tour to the challenges of Big Semantic Data management and the role that it plays in the emergent Web of Data. Section “Big Data” provides a brief overview of Big Data and its dimensions. Section “What is Semantic Data?” summarizes the semantic web foundations and introduces the main technologies used for describing and querying seman- tic data. These basics set the minimal background for understanding the notion of web of data. It is presented in section “The Web of (Linked) Data” along with the Linked Data project and its open realization within the Linked Open Data movement. Section “Stakeholders and Processes in Big Semantic Data” characterizes the stakeholders and the main data flows per- formed in this web of Data: publication, exchange, and consumption, defines them and delves not only into their potential for data interoperability, but also in the scalability drawbacks arising when Big Semantic Data must be processed and queried. Innovative compression techniques are introduced in section “An Integrated Solution for Managing Big Semantic Data,” show- ing how the three Big Data dimensions (volume, velocity, and variety) can be successfully addressed through an integrated solution, called HDT (Header-Dictionary-Triples). Section “Experimental Results” comprises our experimental results, showing that HDT allows scalability improvements to be achieved for storage, exchange, and query answering of such emerging data. Finally, section “Conclusions and Next Steps” concludes and devises the potential of HDT for its progressive adoption in Big Semantic Data management.

            Big Data

            Much has been said and written these days about Big Data. News in rel- evant magazines (Cukier 2010; Dumbill 2012b; Lohr 2012), technical reports

            Big Data Computing

            emergent research works in newly established conferences, * disclosure books (Dumbill 2012a), and more applied ones (Marz and Warren 2013) are flooding us with numerous definitions, problems, and solutions related to Big Data. It is, obviously, a trending topic in technological scenarios, but it also is producing political, economical, and scientific impact.

            We will adopt in this article a simple Big Data characterization. We refer to Big Data as “the data that exceed the processing capacity of conventional database systems” (Dumbill 2012b). Thus, any of these huge data sets gener- ated in the data deluge may be considered Big Data. It is clear that they are too big, they move too fast, and they do not fit, generally, the relational model strictures (Dumbill 2012b). Under these considerations, Big Data result in the convergence of the following three V’s:

            

          Volume is the most obvious dimension because of a large amount of

            data continuously gathered and stored in massive data sets exposed for different uses and purposes. Scalability is the main challenge related to Big Data volume by considering that effective storage mechanisms are the first requirement in this scenario. It is worth noting that storage decisions influence data retrieval, the ultimate goal for the user, that expects it to be performed as fast as possible, especially in real-time systems.

            

          Velocity describes how data flow, at high rates, in an increasingly dis-

            tributed scenario. Nowadays, velocity increases in a similar way than volume. Streaming data processing is the main challenge related to this dimension because selective storage is mandatory for practi- cal volume management, but also for real-time response.

            

          Variety refers to various degrees of structure (or lack thereof) within

            the source data (Halfon 2012). This is mainly due to Big Data may come from multiple origins (e.g., sciences, politics, economy, social networks, or web server logs, among others) and each one describes its own semantics, hence data follow a specific structural modeling. The main challenge of Big Data variety is to achieve an effective mechanism for linking diverse classes of data differing in the inner structure.

            While volume and velocity address physical concerns, variety refers to a logical question mainly related to the way in which data are modeled for enabling effective integration. It is worth noting that the more data are inte- grated, the more interesting knowledge may be generated, increasing the resulting data set value. Under these considerations, one of the main objec- tives in Big Data processing is to increase data value as much as possible by directly addressing the Big Data variety. As mentioned, the use of semantic *

            Management of Big Semantic Data

            technologies seems to be ahead in this scenario, leading to the publication of big semantic data sets.

            What Is Semantic Data?

          Semantic data have been traditionally related to the concept of Semantic Web.

            The Semantic Web enhances the current WWW by incorporating machine- processable semantics to their information objects (pages, services, data sources, etc.). Its goals are summarized as follows:

            1. To give semantics to information on the WWW. The difference between the approach of information retrieval techniques (that currently dominate WWW information processing) and database ones is that in the latter data are structured via schemas that are essentially

            

          metadata . Metadata gives the meaning (the semantics) to data, allow-

            ing structured query, that is, querying data with logical meaning and precision.

            2. To make semantic data on the WWW machine-processable. Currently, on the WWW the semantics of the data is given by humans (either directly during manual browsing and searching, or indirectly via information retrieval algorithms which use human feedback entered via static links or logs of interactions). Although it is currently suc- cessful, this process has known limitations (Quesada 2008). For Big Data, it is crucial to automatize the process of “understanding” (giving meaning to) data on the WWW. This amounts to develop machine-processable semantics.

            To fulfill these goals, the Semantic Web community and the World Wide * Consortium (W3C) have developed (i) models and languages for represent- ing the semantics and (ii) protocols and languages for querying it. We will briefly describe them in the next items.

            Describing Semantic Data

            Two families of languages sufficiently flexible, distributively extensible, and machine-processable have been developed for describing semantic data.

            1. The Resource Description Framework (RDF) (Manola and Miller 2004).

            It was designed to have a simple data model, with a formal seman- * tics, with an extensible URI-based vocabulary, and which allows

            Big Data Computing

            anyone to distributedly make statements about any resource on the Web. In this regard, an RDF description turns out to be a set of URI triples, with the standard intended meaning. It follows the ideas of semantic networks and graph data specifications, based on univer- sal identifiers. It gives basic tools for linking data, plus a lightweight machinery for coding basic meanings. It has two levels:

            a. Plain RDF is the basic data model for resources and relations between them. It is based on a basic vocabulary: a set of prop- erties, technically binary predicates. Formally, it consists of tri- ples of the form (s,p,o) (subject–predicate–object), where s,p,o are URIs that use distributed vocabularies. Descriptions are state- ments in the subject–predicate–object structure, where predicate and object are resources or strings. Both subject and object can be anonymous entities (blank nodes). Essentially, RDF builds graphs labeled with meaning.

            b. RDFS adds over RDF a built-in vocabulary with a normative semantics, the RDF Schema (Brickley 2004). This vocabulary deals with inheritance of classes and properties, as well as typ- ing, among other features. It can be thought of as a lightweight ontology.

            2. The Web Ontology Language (OWL) (McGuinness and van Harmelen 2004). It is a version of logic languages adapted to cope with the Web requirements, composed of basic logic operators plus a mechanism for defining meaning in a distributed fashion.

            From a metadata point of view, OWL can be considered a rich vocabulary with high expressive power (classes, properties, relations, cardinality, equal- ity, constraints, etc.). It comes in many flavors, but this gain in expressive power is at the cost of scalability (complexity of evaluation and processing). In fact, using the semantics of OWL amounts to introduce logical reasoning among pieces of data, thus exploiting in complexity terms.

            Querying Semantic Data

            If one has scalability in mind, due to complexity arguments, the expressive power of the semantics should stay at a basic level of metadata, that is, plain

            

          RDF . This follows from the W3C design principles of interoperability, exten-

          sibility, evolution, and decentralization.

            As stated, RDF can be seen as a graph labeled with meaning, in which each  →  p triple (s,p,o) is represented as a direct edge-labeled graph s o . The data model RDF has a corresponding query language, called SPARQL. SPARQL (Prud’hommeaux and Seaborne 2008) is the W3C standard for querying RDF. It is essentially a graph-pattern matching query language, composed

            Management of Big Semantic Data

            a. The pattern matching part, which includes the most basic features of graph pattern matching, such as optional parts, union of patterns, nesting, filtering values of possible matchings, and the possibility of choosing the data source to be matched by a pattern.

            b. The solution modifiers which, once the output of the pattern has been computed (in the form of a table of values of variables), allow to modify these values applying standard classical operators such as projection, distinct, order, and limit.

            c. Finally, the output of an SPARQL query comes in tree forms. (1) May be: yes/no queries (ASK queries); (2) selections of values of the vari- ables matching the patterns (SELECT queries), and (3) construction of new RDF data from these values, and descriptions of resources (CONSTRUCT queries).

            An SPARQL query Q comprises head and body. The body is a complex RDF graph pattern expression comprising triple patterns (e.g., RDF triples in which each subject, predicate, or object may be a variable) with conjunctions, disjunctions, optional parts, and constraints over the values of the variables. The head is an expression that indicates how to construct the answer for Q. The evaluation of Q against an RDF graph G is done in two steps: (i) the body of Q is matched against G to obtain a set of bindings for the variables in the body and then (ii) using the information on the head, these bindings are processed applying classical relational operators (projection, distinct, etc.) to produce the answer Q.

          Web of (Linked) Data

            The WWW has enabled the creation of a global space comprising linked doc- uments (Heath and Bizer 2011) that express information in a human-readable way. All agree that the WWW has revolutionized the way we consume infor- mation, but its document-oriented model prevents machines and automatic agents for directly accessing to the raw data underlying to any web page con- tent. The main reason is that documents are the atoms in the WWW model and data lack of an identity within them. This is not a new story: an “univer- sal database,” in which all data can be identified at world scale, is a cherished dream in Computer Science.

            The Web of Data (Bizer et  al. 2009) emerges under all previous consid- erations in order to convert raw data into first class citizens of the WWW. It materializes the Semantic Web foundations and enables raw data, from diverse fields, to be interconnected within a cloud of data-to-data hyper-

            Big Data Computing

            level of granularity over the WWW infrastructure. It is worth noting that this idea does not break with the WWW as we know. It only enhances the WWW with additional standards that enable data and documents to coexist in a common space. The Web of Data grows progressively according to the Linked Data principles.

            linked Data

            The Linked Data project * originated in leveraging the practice of linking data to the semantic level, following the ideas of Berners-Lee (2006). Its authors state that:

            

          Linked Data is about using the WWW to connect related data that wasn’t

          previously linked, or using the WWW to lower the barriers to linking

          data currently linked using other methods. More specifically, Wikipedia

          defines Linked Data as “a term used to describe a recommended best

          practice for exposing, sharing, and connecting pieces of data, infor-

          mation, and knowledge on the Semantic Web using URIs (Uniform

          Resource Identifiers) and RDF.”

            The idea is to leverage the WWW infrastructure to produce, publish, and consume data (not only documents in the form of web pages). These pro- cesses are done by different stakeholders, with different goals, in different forms and formats, in different places. One of the main challenges is the meaningful interlinking of this universe of data (Hausenblas and Karnstedt 2010). It relies on the following four rules:

            1. Use URIs as names for things. This rule enables each possible real- world entity or its relationships to be unequivocally identified at universal scale. This simple decision guarantees that any raw data has its own identity in the global space of the Web of Data.

            2. Use HTTP URIs so that people can look up those names. This decision leverages HTTP to retrieve all data related to a given URI.

            3. When someone looks up a URI, provide useful information, using stan-

            dards . It standardizes processes in the Web of Data and pacts the

            languages spoken by stakeholders. RDF and SPARQL, together with semantic technologies described in the previous section, defines the standards mainly used in the Web of Data.

            4. Include links to other URIs. It materializes the aim of data integration by simply adding new RDF triples which link data from two dif- ferent data sets. This inter-data set linkage enables the automatic browsing. *

            Management of Big Semantic Data

            These four rules provide the basics for publishing and integrating Big Semantic Data into the global space of the Web of Data. They enable raw data to be simply encoded by combining the RDF model and URI-based identi- fication, both for entities and for their relationships adequately labeled over rich semantic vocabularies. Berners-Lee (2002) expresses the Linked Data relevance as follows:

            

          Linked Data allows different things in different data sets of all kinds to be con-

          nected. The added value of putting data on the WWW is given by the way it

            can be queried in combination with other data you might not even be aware of. People will be connecting scientific data, community data, social web data, enterprise data, and government data from other agencies and organi- zations, and other countries, to ask questions not asked before.

            

          Linked data is decentralized. Each agency can source its own data without a big

            cumbersome centralized system. The data can be stitched together at the edges, more as one builds a quilt than the way one builds a nuclear power station.

            

          A virtuous circle. There are many organizations and companies which will

            be motivated by the presence of the data to provide all kinds of human access to this data, for specific communities, to answer specific questions, often in connection with data from different sites.

            The project and further information about linked data can be found in Bizer et al. (2009) and Heath and Bizer (2011).

            linked Open Data

            Although Linked Data do not prevent its application in closed environ- ments (private institutional networks on any class of intranet), the most vis- ible example of adoption and application of its principles runs openly. The Linked Open Data (LOD) movement set semantic data to be released under open licenses which do not impede data reuse for free. Tim Berners-Lee also devised a “five-star” test to measure how these Open Data implements the Linked Data principles: 1. Make your stuff available on the web (whatever format).

            2. Make it available as structured data (e.g., excel instead of image scan of a table).

            3. Use nonproprietary format (e.g., CSV instead of excel).

            4. Use URLs to identify things, so that people can point at your stuff.

            5. Link your data to other people’s data to provide context. * The LOD cloud has grown significantly since its origins in May 2007. The first report pointed that 12 data sets were part of this cloud, 45 were acknowledged

          • * in September 2008, 95 data sets in 2009, 203 in 2010, and 295 different data sets
          •   Big Data Computing

            • * in the last estimation (September 2011). These last statistics point out that more than 31 billion triples are currently published and more than 500 million links establish cross-relations between data sets. Government data are predominant in LOD, but other fields such as geography, life sciences, media, or publications are also strongly represented. It is worth emphasizing the existence of many cross-domain data sets comprising data from some diverse fields. These tend to be hubs because providing data that may be linked from and to the vast majority of specific data sets. DBpedia is considered the nucleus for the LOD cloud (Auer et al. 2007). In short, DBpedia gathers raw data underlying to the Wikipedia web pages and exposes the resulting representation following the Linked Data rules. It is an interesting example of Big Semantic Data, and its management is considered within our experiments.

            Stakeholders and Processes in Big Semantic Data

              Although we identify data scientists as one of the main actors in the man- agement of Big Semantic Data, we also unveil potential “traditional” users when moving from a Web of documents to a Web of data, or, in this context, to a Web of Big Semantic Data. The scalability problems arising for data experts and general users cannot be the same, as these are supposed to man- age the information under different perspectives. A data scientist can make strong efforts to create novel semantic data or to analyze huge volumes of data created by third parties. She can make use of data-intensive computing, distributed machines, and algorithms, to spend several hours performing a closure of a graph is perfectly accepted. In contrast, a common user retriev- ing, for instance, all the movies shot in New York in a given year, expects not an immediate answer, but a reasonable response time. Although one could establish a strong frontier between data (and their problems) of these worlds, we cannot forget that establishing and discovering links between diverse data is beneficial for all parties. For instance, in life sciences it is important to have links between the bibliographic data of publications and the concrete genes studied in each publication, thus another researcher can look up previous findings of the genes they are currently studying.

              The concern here is to address specific management problems while remaining in a general open representation and publication infrastructure in order to leverage the full potential of Big Semantic Data. Under this prem- ise, a first characterization of the involved roles and processes would allow researchers and practitioners to clearly focus their efforts on a particular * area. This section provides an approach toward this characterization. We http://www4.wiwiss.fu-berlin.de/lodcloud/state/

              Management of Big Semantic Data

              first establish a simple set of stakeholders in Big Semantic Data, from where we define a common data workflow in order to better understand the main processes performed in the Web of Data.

              Participants and Witnesses

              One of the main breakthroughs after the creation of the Web was the con- sideration of the common citizen as the main stakeholder, that is, a part involved not only in the consumption, but also in the creation of content. To emphasize this fact, the notion of Web 2.0 was coined, and its implications such as blogging, tagging, or social networking became one of the roots of our current sociability.

              The Web of Data can be considered as a complementary dimension to this successful idea, which addresses the data set problems of the Web. It focused on representing knowledge through machine-readable descriptions (i.e., RDF), using specific languages and rules for knowledge extraction and rea- soning. How this could be achieved by the general audience, and exploited for the general market, will determine its chances to success beyond the sci- entific community.

              To date, neither the creation of self-described semantic content nor the linkage to other sources is a simple task for a common user. There exist sev- eral initiatives to bring semantic data creation to a wider audience, being the most feasible use of RDFa (Adida et al. 2012). Vocabulary and link discovery can also be mitigated through searching and recommendation tools (Volz et al. 2009; Hogan et al. 2011). However, in general terms, one could argue that the creation of semantic data is still almost as narrow as the original content creation in Web 1.0. In the LOD statistics, previously reported, only 0.42% of the total data is user generated. It means that public organizations (governments, universities, digital libraries, etc.), researchers, and innovative enterprises are the main creators, whereas citizens are, at this point, just wit- nesses of a hidden increasingly reality.

              This reality shows that these few creators are able to produce huge vol- umes of RDF data, yet we will argue, in the next section, about the quality of these publication schemes (in agreement with empirical surveys; Hogan et al. 2012). In what follows, we characterize a minimum set of stakehold- ers interacting with this huge graph of knowledge with such an enormous potential. Figure 4.1 illustrates the main identified stakeholders within Big Semantic Data. Three main roles are present: creators, publishers, and consum-

              

            ers , with an internal subdivision by creation method or intended use. In par-

              allel, we distinguish between automatic stakeholders, supervised processes, and

              

            human stakeholders . We define below each stakeholder, assuming that (i) this

              classification may not be complete as it is intended to cover the minimum foundations to understand the managing processes in Big Semantic Data and (ii) categories are not disjoint; an actor could participate with several

              Big Data Computing

            Creator : one that generates a new RDF data set by, at least, one of these

            • From scratch
            • Conversion from other data format
            • Data integration from existing con
            • Linked data compliant
            • Direct consumption
            • Intensive consumer processing
            • Composition of data

              processes:

            • Creation from scratch: the novel data set is not based on a previous model. Even if the data exist beforehand, the data modeling process is unbiased from the previous data format. RDF authoring tools *

              are traditionally used.

            • Conversion from other data format: the creation phase is highly deter- mined by the conversion of the original data source; potential map- pings between source and target data could be used; for example, from relational databases (Arenas et  al. 2012), as well as (semi-)automatic conversion tools.
            • Data integration from existing content: the focus moves to an efficient integration of vocabularies and the validation of shared entities (Knoblock et al. 2012).

              Several tasks are shared among all three processes. Some examples of this commonalities are the identification of the entities to be modeled (but this

            • *

              A list of RDF authoring tools can be found at http://www.w3.org/wiki/AuthoringToolsForRDF

              Automatic stakeholders Supervised processes

              Human stakeholders Creator Publisher

              Consumer

            Figure 4.1 Stakeholder classification in Big Semantic Data management.

              Management of Big Semantic Data

              task is more important in the creation from scratch, as no prior identification has been done) or the vocabulary reuse (crucial in data integration in which different ontologies could be aligned). A complete description of the creation process is out of the scope of this work (the reader can find a guide for Linked Data creation in Heath and Bizer 2011).

              

            Publisher : one that makes RDF data publicly available for different pur-

              poses and users. From now on, let us suppose that the publisher follows the Linked Data principles. We distinguish creators from publishers as, in many cases, the roles can strongly differ. Publishers do not have to create RDF con- tent but they are responsible of the published information, the availability of the offered services (such as querying), and the correct adaptation to Linked Data principles. For instance, a creator could be a set of sensors giving the temperature in a given area in RDF (Atemezing et al. 2013), while the pub- lisher is an entity who publish this information and provide entry points to this information.

              Consumer : one that makes use of published RDF data:

            • Direct consumption: a process whose computation task mainly involves the publisher, without intensive processing at the consumer. Downloads of the total data set (or subparts), online querying, infor- mation retrieval, visualization, or summarization are simple exam- ples in which the computation is focused on the publisher.
            • Intensive consumer processing: processes with a nonnegligible con- sumer computation, such as offline analysis, data mining, or reason- ing over the full data set or a subpart (live views; Tummarello et al. 2010).
            • Composition of data: those processes integrating different data sources or services, such as federated services over the Web of Data (Schwarte et al. 2011; Taheriyan et al. 2012) and RDF snippets in search engines (Haas et al. 2011).

              As stated, we make an orthogonal classification of the stakeholders attend- ing the nature of creators, publishers, and consumers. For instance, a sensor could directly create RDF data, but it could also consume RDF data.

              

            Automatic stakeholders , such as sensors, Web processes (crawlers, search

              engines, recommender systems), RFID labels, smart phones, etc. Automatic RDF streaming, for instance, would become a hot topic, especially within the development of smart cities (De et al. 2012). Note that, although each piece of information could be particularly small, the whole system can be seen also as a big semantic data set.

              

            Supervised processes , that is, processes with human supervision, as semantic

            tagging and folksonomies within social networks (García-Silva et al. 2012).

            Human stakeholders , who perform most of the task for creating, publishing,

              Big Data Computing

              The following running example provides a practical review of this classification. Nowadays, an RFID tag could document a user context through RDF metadata descriptions (Foulonneau 2011). We devise a system in which RFID tags provide data about temperature and position. Thus, we have thou- sands of sensors providing RDF excerpts modeling the temperature in dis- tinct parts of a city. Users can visualize and query online this information, establishing some relationships, for example, with special events (such as a live concert or sport matches). In addition, the RDF can be consumed by a monitoring system, for example, to alert the population in case of extreme temperatures.

              Following the classification, each sensor is an automatic creator, conform- ing altogether a potentially huge volume of RDF data. While a sensor should be designed to take care of RDF description (e.g., to follow a set of vocab- ularies and description rules and to minimize the size of descriptions), it cannot address publishing facilities (query endpoints, services to user, etc.). Alternatively, intermediate hubs would collect the data and the authorita- tive organization will be responsible of its publication, and applications and services over these data. This publication authority would be considered as a supervised process solving scalability issues of huge RDF data streams for collecting the information, filtering it (e.g., eliminating redundancy), and finally complying with Linked Data standards. Although these processes could be automatic, let us suppose that human intervention is needed to define links between data, for instance, linking positions to information about city events. Note also that intermediate hubs could be seen as super- vised consumers of the sensors, yet the information coming from the sensors is not openly published but streamed to the appropriate hub. Finally, the consumers are humans, in the case of the online users (concerned of query resolution, visualization, summarization, etc.) or an automatic (or semiauto- matic) process, in the case of monitoring (doing potential complex inference or reasoning).

              Workflow of Publication-exchange-Consumption

              The previous RFID network example shows the enormous diversity of pro- cesses and different concerns for each type of stakeholder. In what follows, we will consider the creation step out of the scope of this work, because our approach relies on the existence of big RDF data sets (without belittling those ones which can be created hereinafter). We focus on tasks involving large-scale management; for instance, scalability issues of visual authoring a big RDF data set are comparable to RDF visualization by consumers, or the performance of RDF data integration from existing content depends on efficient access to the data and thus existing indexes, a crucial issue also for query response.

              Management processes for publishers and consumers are diverse and complex to generalize. However, it is worth characterizing a common work-

              Management of Big Semantic Data ub lishers 3. Consumption P APP Sensor Indexing C 1. Publication Dereferenceable URls RDF dump SPARQL endpoints/ APIs 2. Ex Quality/Provenance Reasoning/integration chan ge Q/P R/I I o n s APP u m r e s

            Figure 4.2 Publication-Exchange-Consumption workflow in the Web of Data.

              scalability issues in context. Figure 4.2 illustrates the identified workflow of Publication-Exchange-Consumption.

              Publication

              refers to the process of making RDF data publicly available for diverse purposes and users, following the Linked Data principles. Strictly, the only obligatory “service” in these principles is to provide dereference- able URIs, that is, related information of an entity. In practice, publishers complete this basic functionality exposing their data through public APIs, mainly via SPARQL endpoints, a service which interprets the SPARQL query language. They also provide RDF dumps, files to fully or partly download the RDF data set.

              Exchange

              is the process of information exchange between publishers and consumers. Although the information is represented in RDF, note that con- sumers could obtain different “views” and hence formats, and some of them not necessarily in RDF. For instance, the result of an SPARQL query could be provided in a CSV file or the consumer would request a summary with statistics of the data set in an XML file. As we are issuing manage- ment of semantic data sets, we restrict exchange to RDF interchange. Thus, we rephrase exchange as the process of RDF exchange between publishers and consumers after an RDF dump request, an SPARQL query resolution or another request or service provided by the publisher.

              Consumption

              can involve, as stated, a wide range of processes, from direct consumption to intensive processing and composition of data sources. Let us simply define the consumption as the use of potentially large RDF data for diverse purposes.

              A final remark must be done. The workflow definition seems to restrict the management to large RDF data sets. However, we would like to open scalability issues to a wider range of publishers and consumers with more limited resources. For instance, similar scalability problems arise when managing RDF in mobile devices; although the amount of information could be potentially smaller, these devices have more restrictive requirements for transmission costs/latency, and for postprocessing due to their inherent memory and CPU constraints (Le-Phuoc et al. 2010). In the following, when- ever we provide approaches for managing these processes in large RDF data

              Big Data Computing State of the art for Publication-exchange-Consumption

              This section summarizes some of the current trends to address publication, exchange, and consumption at large scale.

              

            Publication schemes : the straightforward publication, following Linked

            Data principles, presents several problems in large data sets (Fernández et al.

              2010); a previous analysis of published RDF data sets reveals several undesir- able features; the provenance and metadata about contents are barely pres- ent, and their information is neither complete nor systematic. Furthermore, the RDF dump files have neither internal structure nor a summary of their content. A massive empirical study of Linked Open Data sets in Hogan et al. (2012) draws similar conclusions; few providers attach human readable metadata to their resources or licensing information. Same features can be applied to SPARQL endpoints, in which a consumer knows almost nothing about the content she is going to query beforehand. In general terms, except for the general Linked Data recommendations (Heath and Bizer 2011), few works address the publication of RDF at large scale.

              The Vocabulary of Interlinked Data sets: VoiD (Alexander et al. 2009) is the nearest approximation to the discovery problem, providing a bridge between publishers and consumers. Publishers make use of a specific vocab- ulary to add metadata to their data sets, for example, to point to the asso- ciated SPARQL endpoint and RDF dump, to describe the total number of triples, and to connect to linked data sets. Thus, consumers can look up this metadata to discover data sets or to reduce the set of interesting data sets in federated queries over the Web of Data (Akar et al. 2012). Semantic Sitemaps (Cyganiak et  al. 2008) extend the traditional Sitemap Protocol for describ- ing RDF data. They include new XML tags so that crawling tools (such as

            • * Sindice ) can discover and consume the data sets.

              As a last remark, note that deferenceable URIs can be done in a straight- forward way, publishing one document per URI or set of URIs. However, the publisher commonly materializes the output by querying the data set at URI resolution time. This moves the problem to the underneath RDF store, which has also to deal with scalability problems (see “Efficient RDF Consumption” below). The empirical study in Hogan et al. (2012) also confirmed that pub- lishers often do not provide locally known inlinks in the dereferenced response, which must be taken into account by consumers.

              

            RDF Serialization Formats : as we previously stated, we focus on exchanging

            large-scale RDF data (or smaller volumes in limited resources stakeholders).

              Under this consideration, the RDF serialization format directly determines the transmission costs and latency for consumption. Unfortunately, data sets are currently serialized in plain and verbose formats such as RDF/XML (Beckett 2004) or Notation3: N3 (Berners-Lee 1998), a more compact and read- * able alternative. Turtle (Beckett and Berners-Lee 2008) inherits N3 compact

              Management of Big Semantic Data

              ability adding interesting extra features, for example, abbreviated RDF data sets. RDF/JSON (Alexander 2008) has the advantage of being coded in a lan- guage easier to parse and more widely accepted in the programing world. Although all these formats present features to “abbreviate” constructions, they are still dominated by a document-centric and human-readable view which adds an unnecessary overhead to the final data set representation.

              In order to reduce exchange costs and delays on the network, universal compressors (e.g., gzip) are commonly used over these plain formats. In addition, specific interchange oriented representations may also be used. For instance, the Efficient XML Interchange Format: EXI (Schneider and Kamiya 2011) may be used for representing any valid RDF/XML data set.

              

            Efficient RDF Consumption : the aforementioned variety of consumer tasks

              hinders to achieve a one-size-fits-all technique. However, some general con- cerns can be outlined. In most scenarios, the performance is influenced by (i) the serialization format, due to the overall data exchange time, and (ii) the RDF indexing/querying structure. In the first case, if a compressed RDF has been exchanged, a previous decompression must be done. In this sense, the serialization format affects the consumption through the transmission cost, but also with the easiness of parsing. The latter factor affects the consump- tion process in different ways:

            • For SPARQL endpoints and dereferenceable URIs materialization, the response time depends on the efficiency of the underlying RDF indexes at the publisher.
            • Once the consumer has the data set, the most likely scenario is indexing it in order to operate with the RDF graph, for example, for intensive operation of inference, integration, etc.

              Although the indexing at consumption could be performed once, the amount of resources required for it may be prohibitive for many potential consumers (especially for mobile devices comprising a limited computa- tional configuration). In both cases, for publishers and consumers, an RDF store indexing the data sets is the main actor for efficient consumption.

              Diverse techniques provide efficient RDF indexing, but there are still work- loads for scalable indexing and querying optimization (Sidirourgos et  al. 2008; Schmidt et al. 2010). On the one hand, some RDF stores are built over relational databases and perform SPARQL queries through SQL, for example,

            • * Virtuoso. The most successful relational-based approach performs a vertical partitioning, grouping triples by predicates, and storing them in indepen- dent 2-column tables (S,O) (Sidirourgos et al. 2008; Abadi et al. 2009). On the other hand, some stores: Hexastore (Weiss et al. 2008) or RDF-3X (Neumann and Weikum 2010) build indices for all possible combinations of elements * in RDF (SPO, SOP, PSO, POS, OPS, OSP), allowing (i) all triple patterns to
            •   be directly resolved in the corresponding index and (ii) the first join step to be resolved through fast merge-join. Although it achieves a global com- petitive performance, the index replication largely increases spatial require- ments. Other solutions take advantage of structural properties of the data model (Tran et  al. 2012), introduce specific graph compression techniques (Atre et al. 2010; Álvarez-García et al. 2011), or use distributed nodes within a MapReduce infrastructure (Urbani et al. 2010).

                Big Data Computing

              An Integrated Solution for Managing Big Semantic Data

                When dealing with Big Semantic Data, each step in the workflow must be designed to address the three Big Data dimensions. While variety is man- aged through semantic technologies, this decision determines the way vol- ume and velocity are addressed. As previously discussed, data serialization has a big impact on the workflow, as traditional RDF serialization formats are designed to be human readable instead of machine processable. They may fit smaller scenarios in which volume or velocity are not an issue, but under the presented premises, it clearly becomes a bottleneck of the whole process. We present, in the following, the main requirements for an RDF serialization format of Big Semantic Data.

              • It must be generated efficiently from another RDF input format. For instance, a data creator having the data set in a semantic database must be able to dump it efficiently into an optimized exchange format.
              • It must be space efficient. The generated dump should be as small as possible, introducing compression for space savings. Bear in mind that big semantic data sets are shared on the Web of Data and they may be transferred through the network infrastructure to hundreds or even thousands of clients. Reducing size will not only minimize the bandwidth costs of the server, but also the waiting time of con- sumers who are retrieving the data set for any class of consumption.
              • It must be ready to post process. A typical case is performing a sequen- tial triple-to-triple scanning for any post-processing task. This can seem trivial, but is clearly time-consuming when Big Semantic Data is postprocessed by the consumer. As shown in our experiments, just parsing a data set of 640 million triples, serialized in NTriples and gzip-compressed, wastes more than 40 min on a modern com- putational configuration.
              • It must be easy to convert to other representations. The most usual sce-

                Management of Big Semantic Data

                Store. Most of the solutions reviewed in the previous section use disk-resident variants of B-Trees, which keep a subset of the pages in the main memory. For instance, if data are already sorted, this pro- cess is more efficient than doing it on unsorted elements. Therefore, having the data pre-sorted can be a step ahead in these cases. Also, many stores keep several indices for the different triples orderings (SPO, OPS, PSO, etc.). If the serialization format enables data travers- ing to be performed in different orders, the multi-index generation process can be completed more efficiently.

              • It should be able to locate pieces of data within the whole data set. It is desirable to avoid a full scan over the data set just to locate a par- ticular piece of data. Note that this scan is a highly time-consuming process in Big Semantic Data. Thus, the serialization format must retain all possible clues, enabling direct access to any piece of data in the data set. As explained in the SPARQL query language, a basic way of specifying which triples to fetch is specifying a triple pattern where each component is either a constant or a variable. A desirable format should be ready to solve most of the combinations of triple patterns (possible combinations of constants or variables in subject, predicates, and objects). For instance, a typical triple pattern is to provide a subject, leaving the predicate and object as variables (and therefore the expected result). In such cases, we pretend to locate all the triples that talk about a specific subject. In other words, this requirement contains a succinct intention; data must be encoded in such a way that “the data are the index.”

                encoding Big Semantic Data: HDT

                Our approach, HDT: Header–Dictionary–Triples (Fernández et al. 2010), consid- ers all of the previous requirements, addressing a machine-processable RDF serialization format which enables Big Semantic Data to be efficiently man- aged within the common workflows of the Web of Data. The format formalizes a compact binary serialization optimized for storage or transmission over a network. It is worth noting that HDT is described and proposed for standard- ization as W3C Member Submission (Fernández et al. 2011). In addition, a suc- cinct data structure has been proposed (Martínez-Prieto et al. 2012a) to browse HDT-encoded data sets. This structure holds the compactness of such repre- sentation and provides direct access to any piece of data as described below.

                HDT organizes Big Semantic Data in three logical components (Header,

                

              Dictionary , and Triples) carefully described to address not only RDF peculiari-

                ties, but also considering how these data are actually used in the Publication- Exchange-Consumption workflow.

                

              Header . The Header holds, in plain RDF format, metadata describing a big

                semantic data set encoded in HDT. It acts as an entry point for a consumer,

                Big Data Computing

                its content, even before retrieving the whole data set. It enhances the VoID Vocabulary (Alexander et al. 2009) to provide a standardized binary data set * description in which some additional HDT-specific properties are appended.

                The Header component comprises four distinct sections:

              • Publication Metadata provides information about the publication act, for instance, when was the data set generated, when was it made public, who is the publisher, where is the associated SPARQL end- point, etc. Many properties of this type are described using the pop- ular Dublin Core Vocabulary.
              • Statistical Metadata provides statistical information about the data set, such as the number of triples, the number of different subjects, predicates, objects, or even histograms. For instance, this class of metadata is very valuable or visualization software or federated query evaluation engines.
              • Format Metadata describes how Dictionary and Triples components are encoded. This allows one to have different implementations or representations of the same data in different ways. For instance, one could prefer to have the triples in SPO order, whereas other applica- tions might need it in OPS. Also the dictionary could apply a very aggressive compression technique to minimize the size as much as possible, whereas other implementation could be focused on query speed and even include a full-text index to accelerate text searches. These metadata enable the consumer for checking how an HDT- encoded data set can be accessed in the data structure.
              • Additional Metadata. Since the Header contains plain RDF, the pub- lisher can enhance it using any vocabulary. It allows specific data set/application metadata to be described. For instance, in life sci- ences a publisher might want to describe, in the Header, that the data set describes a specific class of proteins.

                Since RDF enables data integration at any level, the Header component ensures that HDT-encoded data sets are not isolated and can be intercon- nected. For instance, it is a great tool for query syndication. A syndicated query engine could maintain a catalog composed by the Headers of different HDT-encoded data sets from many publishers and use it to know where to find more data about a specific subject. Then, at query time, the syndicated query engine can either use the remote SPARQL endpoint to query directly the third-party server or even download the whole data set and save it in a local cache. Thanks to the compact size of HDT-encoded data sets, both the * transmission and storage costs are highly reduced. http://www.w3.org/Submission/2011/SUBM-HDT-Extending-VoID-20110330/

                Management of Big Semantic Data

              Dictionary . The Dictionary is a catalog comprising all the different terms

                used in the data set, such as URIs, literals, and blank nodes. A unique identi- fier (ID) is assigned to each term, enabling triples to be represented as tuples of three IDs which, respectively, reference the corresponding terms in the dictionary. This is the first step toward compression, since it avoids long terms to be repeatedly represented. This way, each term occurrence is now replaced by its corresponding ID, whose encoding requires less bits in the vast majority of the cases. Furthermore, the catalog of terms within the dic- tionary may be encoded in many advanced ways focused on boosting que- rying or reducing size. A typical example is to use any kind of differential compression for encoding terms sharing long prefixes, for example, URIs.

                The dictionary is divided into sections depending on whether the term plays subject, predicate, or object roles. Nevertheless, in semantic data, it is quite common that a URI appears both as a subject in one triple and as object on another. To avoid repeating those terms twice in the subjects and in the objects sections, we can extract them into a fourth section called shared Subject-Object .

              Figure 4.3 depicts the 4-section dictionary organization and how IDs are assigned to the corresponding terms. Each section is sorted lexicographically

                and then correlative IDs are assigned to each term from 1 to n. It is worth not- ing that, for subjects and objects, the shared Subject–Object section uses the lower range of IDs; for example, if there are m terms playing interchangeably as subject and object, all x IDs such that x < m belong to this shared section.

                HDT allows one to use different techniques of dictionary representation. Each one can handle its catalog of terms in different ways, but must always implement these basic operations

              • Locate (term): finds the term and returns its ID
              • Extract (id): extracts the term associated to the ID
              • NumElements (): returns the number of elements of the section More advanced techniques might also provide these optional operations
              • Prefix (p): finds all terms starting with the prefix p
              • Suffix (s): finds all terms ending with the suffix s

                1

                1

                1 Predicates Shared

              |sh| |sh| |P|

              Subjects

                Objects |S| |O|

                Figure 4.3

                Big Data Computing

              • Substring (s): finds all the terms containing the substring s
              • Regex (e): finds all strings matching the specified regular expression e For instance, these advanced operations are very convenient when serv- ing query suggestions to the user, or when evaluating SPARQL queries that include REGEX filters.

                We suggest a Front-Coding (Witten et al. 1999) based representation as the most simple way of dictionary encoding. It has been successfully used in many WWW-based applications involving URL management. It is a very sim- ple yet effective technique based on differential compression. This technique applies to lexicographically sorted dictionaries by dividing them into buckets of b terms. By tweaking this bucket size, different space/time trade-offs can be achieved. The first term in the bucket is explicitly stored and the remain- ing b − 1 ones are encoded with respect to their precedent: the common prefix length is first encoded and the remaining suffix is appended. More technical details about these dictionaries are available in Brisaboa et al. (2011).

                The work of Martínez-Prieto et al. (2012b) surveys the problem of encoding compact RDF dictionaries. It reports that Front-Coding achieves a good perfor- mance for a general scenario, but more advanced techniques can achieve bet- ter compression ratios and/or handle directly complex operations. In any case, HDT is flexible enough to support any of these techniques, allowing stake- holders to decide which configuration is better for their specific purposes.

                

              Triples . As stated, the Dictionary component allows spatial savings to be

                achieved, but it also enables RDF triples to be compactly encoded, represent- ing tuples of three IDs referring the corresponding terms in the Dictionary. Thus, our original RDF graph is now transformed into a graph of IDs which encoding can be carried out in a more optimized way.

                We devise a Triples encoding that organizes internally the information in a way that exploits graph redundancy to keep data compact. Moreover, this encoding can be easily mapped into a data structure that allows basic retrieval operations to be performed efficiently.

                Triple patterns are the SPARQL query atoms for basic RDF retrieval. That is, all triples matching a template (s, p, o) (where s, p, and o may be variables) must be directly retrieved from the Triples encoding. For instance, in the * geographic data set Geonames, the triple pattern below searches all the sub- jects whose feature code (the predicate) is “P” (the object), a shortcode for “country.” In other words, it asks about all the URIs representing countries:

                ? <http://www.geonames.org/ontology#featureCode> <http://www.geonames.org/ontology#P>

                Thus, the Triples component must be able to retrieve the subject of all those * triples matching this pair of predicate and object.

              • All predicates related to the subject are sorted in an increasing way.
              • Objects follow an increasing order for each path in the tree. That is, objects are internally ordered for each pair (subject, predicate). As
              • The subject can be implicitly encoded given that the trees are sorted by subject and we know the total number of trees. Thus, BT does not perform a triples encoding, but it represents pairs (predicate, object). This is an obvious spatial saving.
              • Predicates are sorted within each tree. This is very similar to a well- known problem: posting list encoding for Information Retrieval (Witten et al. 1999; Baeza-Yates and Ribeiro-Neto 2011). This allows applying many existing and optimized techniques to our problem.

                1 1 1 1 1 1 1 7 2 4 4 4 3 1 8 5 6 7 1

                6

                5

                8

                4

              1 3 4

                4

                2

                7

                2

                Management of Big Semantic Data

                2

                1

                ID-triples Subjects: Predicates: Predicates: Objects: Objects: Bitmap Triples B P S P B o S o 1

                (2, 6)), and 4 is the last object because of its relation to (2, 7). Each triple in the data set is now represented as a full path root-to-leave in the corresponding tree. This simple reorganization reveals many interesting features.

              Figure 4.4 shows the object 5 is listed first (because it is related to the pair (2, 5)), then 1,3 (by considering that these are related to the pair

                As Figure 4.4 shows predicates are sorted as {5, 6, 7} for the second subject.

                Basically, BT transforms the graph into a forest containing as many trees as different subjects are used in the data set, and these trees are then ordered by subject ID. This way, the first tree represents all triples rooted by the subject identified as 1, the second tree represents all triples rooted by the subject identified as 2, and so on. Each tree comprises three levels: the root repre- sents the subject, the second level lists all predicates related to the subject, and finally the leaves organize all objects for each pair (subject, predicate). Predicate and object levels are also sorted.

                HDT proposes a Triples encoding named BitmapTriples (BT). This tech- nique needs the triples to be previously sorted in a specific order, such as subject–predicate–object (SPO). BT is able to handle all possible triple order- ings, but we only describe the intuitive SPO order for explanation purposes.

                7 7 2 1 8 4 2 5 4 2 6 1 2 6 3 2 7 4 Figure 4.4

                Big Data Computing

                Besides, efficient search within predicate lists is enabled by assum- ing that the elements follow a known ordering.

              • Objects are sorted within each path in the tree, so (i) these can be effectively encoded and (ii) these can also be efficiently searched.

                BT encodes the Triples component level by level. That is, predicate and object levels are encoded in isolation. Two structures are used for predicates: (i) an ID sequence (S ) concatenates predicate lists following the tree order-

                p

                ing; (ii) a bitsequence (B ) uses one bit per element in S : 1 bits mean that

                p p

                this predicate is the first one for a given tree, whereas 0 bits are used for the remaining predicates. Object encoding is performed in a similar way: S concatenates object lists, and B tags each position in such way that 1 bits

                o o

                represent the first object in a path, and 0 bits the remaining ones. The right part of Figure 4.4 illustrates all these sequences for the given example.

                Querying HDT-encoded Data Sets: HDT-FoQ

                An HDT-encode data set can be directly accessed once its components are loaded into the memory hierarchy of any computational system. Nevertheless, this can be tuned carefully by considering the volume of the data sets and the retrieval velocity needed by specific applications. Thus, we require a data structure that keeps the compactness of the encoding to load data at the higher levels of the memory hierarchy. Data in faster memory always means faster retrieval operations. We call this solution HDT-FoQ: HDT Focused on Querying.

                

              Dictionary . The dictionary component must be able to be directly mapped

                from the encoding to the computer because it must embed enough informa- tion to resolve the basic operations previously described. Thus, this compo- nent follows the idea of “the data are the index.” We invite interested readers to review the paper of Brisaboa et al. (2011) for a more detailed description on how dictionaries provide indexing capabilities.

                

              Triples . The previously described BitmapTriples approach is easy to map

                due to the simplicity of its encoding. Sequences S and S are loaded into two

                p o

                integer arrays using, respectively, log(|P|) and log(|O|) bits per element. Bit sequences can also be mapped directly, but in this case they are enhanced with an additional small structure (González et al. 2005) that ensures con- stant time resolution for some basic bit operations.

                This simple idea enables efficient traversal of the Triples component. All these algorithms are described in Martínez-Prieto et  al. (2012a), but we review them in practice over the example in Figure 4.4. Let us suppose that we ask for the existence of the triple (2, 6, 1). It implies that the retrieval operation is performed over the second tree:

                1. We retrieve the corresponding predicate list. It is the 2nd one in S

                p

                and it is found by simply locating where is the second 1 bit in B . In

                p

                this case P = 3, so the predicate list comprises all elements from S [2] 2

                p

                Management of Big Semantic Data

                until the end (because no more this is the last 1 bit in B ). Thus, the

                p predicate list is {5, 6, 7}.

                2. The predicate 6 is searched in the list. We binary search it and find that it is the second element in the list. Thus, it is at position

                P + 2 − 1 = 3 + 2 − 1 = 4 in S so we are traversing the 4th path of the 2 p forest.

                3. We retrieve the corresponding object list. It is the 4th one in S . We

                o

                obtain it as before: firstly locate the fourth 1 bit in B :O = 4 and then

                o 4

                retrieve all objects until the next 1 bit. That is, the list comprises the objects {1, 3}.

                4. Finally, the object list is binary searched and locates the object 3 in its first position. Thus, we are sure that the triple (2, 6, 1) exists in the data set. All triple patterns providing the subjects are efficiently resolved on vari- ants of this process. Thus, the data structure directly mapped from the encoding provides fast subject-based retrieval, but makes difficult access- ing by predicate and object. Both can easily be accomplished with a limited overhead on the space used by the original encoding. All fine-grain details about the following decisions are also explained in Martínez-Prieto et al. (2012a).

                

              Enabling access by predicate . This retrieval operation demands direct access

              to the second level of the tree, so it means efficient access to the sequence S . p

                However, the elements of S are sorted by subject, so locating all predicate

                p

                occurrences demands a full scanning of this sequence and this result in a poor response time.

                Although accesses by predicate are uncommon in general (Arias et  al. 2011), some applications could require them (e.g., extracting all the informa- tion described with a set of given predicates). Thus, we must address it by considering the need of another data structure for mapping S . It must enable

                p

                efficient predicate locating but without degrading basic access because it is used in all operations by subject. We choose a structure called wavelet tree.

                The wavelet tree (Grossi et al. 2003) is a succinct structure which reorganizes a sequence of integers, in a range [1, n], to provide some access operations to the data in logarithmic time. Thus, the original S is now loaded as a wavelet tree,

                p

                not as an array. It means a limited additional cost (in space) which holds HDT scalability for managing Big Semantic Data. In return, we can locate all predi- cate occurrences in logarithmic time with the number of different predicates used for modeling in the data set. In practice, this number is small and it means efficient occurrence location within our access operations. It is worth noting that to access to any position in the wavelet tree has also now a logarithmic cost.

                Therefore, access by predicate is implemented by firstly performing an occurrence-to-occurrence location, and for each one traversing the tree by following comparable steps to than explained in the previous example.

                Big Data Computing

              Enabling access by object. The data structure designed for loading HDT-

                encoded data sets, considering a subject-based order, is not suitable for doing accesses by object. All the occurrence of an object are scattered throughout the sequence S and we are not able to locate them unless we do sequential

                o

                scan. Furthermore, in this case a structure like the Wavelet Tree becomes inefficient; RDF data sets usually have few predicates, but they contain many different objects and logarithmic costs result in very expensive operation.

                We enhance HDT-FoQ with an additional index (called O-Index), that is responsible for solving accesses by object. This index basically gathers the positions in where each object appears in the original S . Please note that

                o

                each leave is associated to a different triple, so given the index of an element in the lower level, we can guess the predicate and subject associated by tra- versing the tree upwards processing the bit sequences in a similar way than that used for subject-based access.

                In relative terms, this O-Index has a significant impact in the final HDT- FoQ requirements because it takes considerable space in comparison to the other data structures used for modeling the Triples component. However, in absolute terms, the total size required by HDT-FoQ is very small in compari- son to that required by the other competitive solutions in the state of the art.

                All these results are analyzed in the next section.

                

              Joining Basic Triple Patterns . All this infrastructure enables basic triple pat-

                terns to be resolved, in compressed space, at higher levels of the hierarchy of memory. As we show below, it guarantees efficient triple pattern resolution. Although this kind of queries are massively used in practice (Arias et  al. 2011), the SPARQL core is defined around the concept of Basic Graph Pattern (BGP) and its semantics to build conjunctions, disjunctions, and optional parts involving more than a single triple pattern. Thus, HDT-FoQ must provide more advanced query resolution to reach a full SPARQL coverage. At this moment, it is able to resolve conjunctive queries by using specific implementations of the well-known merge and index join algorithms (Ramakrishnan and Gehrke 2000).

              Experimental Results

                This section analyzes the impact of HDT for encoding Big Semantic Data within the Publication-Exchange-Consumption workflow described in the Web of Data. We characterize the publisher and consumer stakeholders of our experiments as follows:

              • The publisher is devised as an efficient agent implementing a powerful computational configuration. It runs on an Intel Xeon E5645@2.4 GHz, hexa-core (6cores-12siblings: 2 thread per core),

                Management of Big Semantic Data

              • The consumer is designed on a conventional configuration because it plays the role of any agent consuming RDF within the Web of Data. It runs on an AMD-PhenomTM-II X4 955@3.2 GHz, quad-core (4cores-4siblings: 1thread per core), 8 GB DDR2@800 MHz.

                The network is regarded as an ideal communication channel: free of errors and any other external interferences. We assume a transmission speed of

                2 Mbyte/s.

                All our experiments are carried out over an heterogeneous data configura- tion of many colors and flavors. We choose a variety of real-world seman- tic data sets of different sizes and from different application domains (see Table 4.1). In addition, we join together the three bigger data sets into a large mash-up of more than 1 billion triples to analyze performance issues in an integrated data set.

                The prototype running these experiments is developed in C ++ using the HDT library publicly available at the official RDF/HDT website. *

                Publication Performance

                As explained, RDF data sets are usually released in plain-text form (NTriples, Turtle, or RDF-XML), and their big volume is simply reduced using any tradi- tional compressor. This way, volume directly affects the publication process because the publisher must, at least, process the data set to convert it to a suit- able format for exchange. Attending to the current practices, we set gzip com- pression as the baseline and we also include lzma because of its effectiveness. We compare their results against HDT, in plain and also in conjunction with the same compressors. That is, HDT plain implements the encoding described in section “Encoding Big Semantic Data: HDT”, and HDT + X stands for the result of compressing HDT plain with the compressor X.

                Statistics of the Real-World Data sets Used in the Experimentation Data set Plain Ntriples Size (GB) Available at

                LinkedMDB 6,148,121 0,85 http://queens.db.toronto.edu/~oktie/linkedmdb DBLP 73,226,756 11,16 http://DBLP.l3s.de/DBLP++.php

              Geonames 119,316,724 13,79 http://download.Geonames.org/all-Geonames-rdf.zip

              DBpedia 296,907,301 48,62 http://wiki.dbpedia.org/Downloads37 Freebase 639,078,932 84,76 http://download.freebase.com/datadumps/ a Mashup 1,055,302,957 140,46 Mashup of Geonames + Freebase + DBPedia a

              Dump on 2012-07-26 converted to RDF using http://code.google.com/p/freebase-quad-rdfize/.

              • TaBle 4.1

                Big Data Computing Figure 4.5 shows compression ratios for all the considered techniques.

                In general, HDT plain requires more space than traditional compressors. It is an expected result because both Dictionary and Triples use very basic approaches. Advanced techniques for each component enable signifi- cant improvements in space. For instance, our preliminary results using the technique proposed in Martínez-Prieto et  al. (2012b) for dictionary encoding show a significant improvement in space. Nevertheless, if we apply traditional compression over the HDT-encoded data sets, the spa- tial requirements are largely diminished. As shown in Figure 4.5, the com- parison changes when the HDT-encoded data sets are compressed with gzip and lzma. These results show that HDT + lzma achieves the most compressed representations, largely improving the effectiveness reported by traditional approaches. For instance, HDT + lzma only uses 2.56% of the original mash-up size, whereas compressors require 5.23% (lzma) and

                7.92% (gzip).

                Thus, encoding the original Big Semantic Data with HDT and then apply- ing compression reports the best numbers for publication. It means that publishers using our approach require 2−3 times less storage space and bandwidth than using traditional compression. These savings are achieved at the price of spending some time to obtain the corresponding representa- tions. Note that traditional compression basically requires compressing the data set, whereas our approach firstly transforms the data set into its HDT encoding and then compresses it. These publication times (in minutes) are depicted in Table 4.2.

                Compression ratio

                0.00 1.43

                2.00 1.81 2.20

                4.00

                6.00

                8.00

                10.00

                12.00 LinkedMDB 1.56 2.01 2.87 4.46 6.34 HDT + LZMA DBLP 1.68 NT + Izma 2.16 2.47 NT + gz 4.10 6.68 HDT + gz Geonames 3.34 4.32

              4.97

              6.67 8.33 HDT DBPedia 2.06 2.63 4.40 9.73 11.25 Freebase 2.56 3.32 5.23 6.24 6.65 Mashup 7.92 8.70 Figure 4.5

                Management of Big Semantic Data TaBle 4.2 Publication Times (Minutes)

                HDT+

              Data set gzip lzma gzip lzma

                0.19 LinkedMDB

                14.71

                1.09

                1.52

                2.72 DBLP 103.53

                13.48

                21.99

                3.28 Geonames 244.72

                26.42

                38.96

                18.90 DBPedia 664.54 84.61 174.12

                24.08 Freebase 1154.02 235.83 315.34

                47.23 Mash-up 2081.07 861.87 1033.0 Note:

                Bold values emphasize the best compression times.

                As can be seen, direct publication, based on gzip compression, is up to 20 times faster than HDT + gzip. The difference is slightly higher compared to HDT + lzma, but this choice largely outperforms direct lzma compression. However, this comparison must be carefully analyzed because publication is a batch process and it is performed only once per data set, whereas exchange and postprocessing costs are paid each time that any consumer retrieves the data set. Thus, in practical terms, publishers will prioritize compression ver- sus publication time because: (i) storage and bandwidth savings and (ii) the overall time that consumers wait when they retrieve the data set.

                exchange Performance

                In the ideal network regarded in our experiments, exchange performance is uniquely determined by the data size. Thus, our approach also appears as the most efficient because of its excellent compression ratios. Table 4.3 orga- nizes processing times for all data sets and each task involved in the work- flow. Column exchange lists exchanging times required when lzma (in the baseline) and HDT + lzma are used for encoding.

                For instance, the mash-up exchange takes roughly half an hour for HDT + lzma and slightly more than 1 h for lzma. Thus, our approach reduces by the half exchange time and also saves bandwidth in the same proportion for the mash-up.

                Consumption Performance

                In the current evaluation, consumption performance is analyzed from two com- plementary perspectives. First, we consider a postprocessing stage in which the consumer decompresses the downloaded data set and then indexes it for local consumption. Every consumption task directly relies on efficient query resolu-

                Big Data Computing TaBle 4.3 Overall Client Times (Seconds)

              Data set Config. Exchange Decomp. Index Total

                LinkedMDB Baseline

                9.61 5.11 111.08 125.80

                6.25

                1.05

                1.91

                9.21 HDT DBLP Baseline 164.09 70.86 1387.29 1622.24

                89.35

                14.82 16.79 120.96 HDT Geonames Baseline 174.46

                87.51 2691.66 2953.63 118.29

                19.91 44.98 183.18 HDT

              DBPedia Baseline 1659.95 553.43 7904.73 10118.11

                832.35 197.62 129.46 1159.43 HDT

              Freebase Baseline 1910.86 681.12 58080.09 60672.07

                891.90 227.47 286.25 1405.62 HDT > >

                

              Mashup Baseline 3757.92 1238.36 24 h 24 h

              1839.61 424.32 473.64 2737.57

                HDT Note:

                Bold values highlight the best times for each activity in the workflow. Baseline means that the file is downloaded in NTriples format, compressed using lzma, and indexed using RDF-3X. HDT means that the file is downloaded in HDT, compressed with lzma, and indexed using HDT-FoQ.

                Both postprocessing and querying tasks require an RDF store enabling indexing and efficient SPARQL resolution. We choose three well-known stores for fair comparison with respect to HDT-FoQ: *RDF3X was recently reported as the fastest RDF store (Huang et al.

                2011). Virtuoso is a popular store performing on relational infrastructure. Hexastore is a well-known memory-resident store.

                

              Postprocessing . As stated, this task involves decompression and indexing

                in order to make queryable the compressed data set retrieved from the pub- lisher. Table 4.3 also organizes post-processing times for all data sets. It is worth noting that we compare our HDT + lzma against a baseline compris- ing lzma decompression and RDF3X indexing because it reports the best numbers. Cells containing “ >24 h” mean that the process was not finished after 24 h. Thus, indexing the mash-up in our consumer is a very heavy task requiring a lot of computational resources and also wasting a lot of time.

                HDT-based postprocessing largely outperforms RDF3X for all original data sets in our setup. HDT performs decompression and indexing from ≈ 25 (DBPedia) to 114 (Freebase) times faster than RDF3X. This situation is due to * RDF3X is available at http://www.mpi-inf.mpg.de/~neumann/rdf3x/ Virtuoso is available at http://www.openlinksw.com/

                Management of Big Semantic Data

                two main reasons. On the one hand, HDT-encoded data sets are smaller than its counterparts in NTriples and it improves decompression performance. On the other hand, HDT-FoQ generates its additional indexing structures (see section “Querying HDT-Encoded Data sets: HDT-FoQ”) over the origi- nal HDT encoding, whereas RDF3X first needs parsing the data set and then building their specific indices from scratch. Both features share an important fact: the most expensive processing was already done in the server side and HDT-encoded data sets are clearly better for machine consumption.

                Exchange and post-processing times can be analyzed together and because of it the total time than a consumer must wait until the data is able to be efficiently used in any application. Our integrated approach, around HDT encoding and data structures, completes all the tasks 8−43 times faster than the traditional combination of compression and RDF indexing. It means, for instance, that the configured consumer retrieves and makes queryable Freebase in roughly 23 min using HDT, but it needs almost 17 h to complete the same process over the baseline. In addition, we can see that indexing is clearly the heavier task in the baseline, whereas exchange is the longer task for us. However, in any case, we always complete exchange faster due to our achievements in space.

                

              Querying . Once the consumer has made the downloaded data queryable,

                the infrastructure is ready to build on-top applications issuing SPARQL que- ries. The data volume emerges again as a key factor because it restricts the ways indices and query optimizers are designed and managed.

                On the one hand, RDF3X and Virtuoso rely on disk-based indexes which are selectively loaded into main memory. Although both are efficiently tuned for this purpose, these I/O transfers result in very expensive operations that hinder the final querying performance. On the other hand, Hexastore and HDT-FoQ always hold their indices in memory, avoiding these slow accesses to disk. Whereas HDT-FoQ enables all data sets in the setup to be managed in the consumer configuration, Hexastore is only able to index the smaller one, showing its scalability problems when managing Big Semantic Data.

                We obtain two different sets of SPARQL queries to compare HDT-FoQ against the indexing solutions within the state of the art. On the one hand, 5000 queries are randomly generated for each triple pattern. On the other hand, we also generate 350 queries of each type of two-way join, subdivided into two groups depending on whether they have a small or big amount of intermediate results. All these queries are run over Geonames in order to include both Virtuoso and RDF3X in the experiments. Note that, both classes of queries are resolved without the need of query planning, hence the results are clear evidence of how the different indexing techniques perform.

              Figure 4.6 summarizes these querying experiments. The X-axis lists all different queries: the left subgroup lists the triple patterns, and the right ones

                represent all different join classes. The Y-axis means the number of times that HDT-FoQ is faster than its competitors. For instance, in the pattern (S, V, V)

                Big Data Computing Query evaluation time

                16

                15

                14

                13

                12 r te

                11 et

                10 b s

                9 i

                RDF-3x T

                8 D Virtuoso

                7 H

                6 es

                5 im T

                4

                3

                2

                1 spV sVo sVV Vpo VpV VVo SSbig SSsmall SObig SOsmall OObig OOsmall Type of query (triple patterns and joins)

              Figure 4.6 Comparison on querying performance on Geonames.

                faster than RDF3X and more than 11 times faster than Virtuoso. In general, HDT-FoQ always outperforms Virtuoso, whereas RDF3X is slightly faster for (V, P, V), and some join classes. Nevertheless, we remain competitive in all theses cases and our join algorithms are still open for optimization.

              Conclusions and Next Steps This chapter presents basic foundations of Big Semantic Data management

                First, we trace a route from the current data deluge, the concept of Big Data, and the need of machine-processable semantics on the WWW. The Resource Description Framework (RDF) and the Web of (Linked) Data naturally emerge in this well-grounded scenario. The former, RDF, is the natural codi- fication language for semantic data, combining the flexibility of semantic networks with a graph data structure that makes it an excellent choice for describing metadata at Web Scale. The latter, the Web of (Linked) Data, pro- vides a set of rules to publish and link Big Semantic Data.

                We justify the different and various management problems arising in Big Semantic Data by characterizing their main stakeholders by role (Creators/ Publishers/Consumers) and nature (Automatic/Supervised/Human). Then, we define a common workflow Publication-Exchange-Consumption, exist- ing in most applications in the Web of Data. The scalability problems arising to the current state-of-the-art management solutions within this scenario set

                Management of Big Semantic Data

                HDT is designed as a binary RDF format to fulfill the requirements of portability (from and to other formats), compact ability, parsing efficiency (readiness for postprocessing), and direct access to any piece of data in the data set. We detail the design of HDT and we argue that HDT-encoded data sets can be directly consumed within the presented workflow. We show that lightweight indices can be created once the different components are loaded into the memory hierarchy at the consumer, allowing for more complex oper- ations such as joining basic SPARQL Triple Patterns. Finally, this compact infrastructure, called HDT-FoQ (HDT Focused on Querying) is evaluated toward a traditional combination of universal compression (for exchanging) and RDF indexing (for consumption).

                Our experiments show how HDT excels at almost every stage of the publish-exchange-consumption workflow. The publisher spends a bit more time to encode the Big Semantic data set, but in return, the consumer is able to retrieve it twice as fast, and the indexing time is largely reduced to just a few minutes for huge data sets. Therefore, the time since a machine or human client discovers the data set until she is ready to start querying its content is reduced up to 16 times by using HDT instead of the tradi- tional approaches. Furthermore, the query performance is very competi- tive compared to state-of-the art RDF stores, thanks to the size reduction the machine can keep a vast amount of triples in main memory, avoiding slow I/O transferences.

                There are several areas where HDT can be further exploited. We foresee a huge potential of HDT to support many aspects of the workflow Publish- Exchange-Consume. HDT-based technologies can emerge to provide sup- porting tools for both publishers and consumers. For instance, a very useful tool for a publisher is setting up an SPARQL endpoint on top of an HDT file. As the experiments show, HDT-FoQ is very competitive on queries, but there is still plenty of room for SPARQL optimization, by leveraging efficient reso- lution of triple patterns, joins, and query planning. Another useful tool for publishers is configuring a dereferenceable URI materialization from a given HDT. Here the experiments also show that performance will be very high because HDT-FoQ is really fast on queries with a fixed RDF subject.

              Acknowledgments

                This work was partially funded by MICINN (TIN2009-14009-C02-02); Science Foundation Ireland: Grant No. ~ SFI/08/CE/I1380, Lion-II; Fondecyt 1110287 and Fondecyt 1-110066. The first author is granted by Erasmus Mundus, the Regional Government of Castilla y León (Spain) and the European Social Fund. The third author is granted by the University of Valladolid: pro-

                Big Data Computing

              References

                

              Abadi, D., A. Marcus, S. Madden, and K. Hollenbach. 2009. SW-Store: A vertically

              partitioned DBMS for Semantic Web data management. The VLDB Journal 18, 385–406.

              Adida, B., I. Herman, M. Sporny, and M. Birbeck (Eds.). 2012. RDFa 1.1 Primer. W3C

              Working Group Note. http://www.w3.org/TR/xhtml-rdfa-primer/.

              Akar, Z., T. G. Hala, E. E. Ekinci, and O. Dikenelli. 2012. Querying the Web of

              Interlinked Datasets using VOID Descriptions. In Proc. of the Linked Data on the

                Web Workshop (LDOW), Lyon, France, Paper 6.

                

              Alexander, K. 2008. RDF in JSON: A Specification for serialising RDF in JSON. In

              Proc. of the 4th Workshop on Scripting for the Semantic Web (SFSW),

                Tenerife, Spain.

              Alexander, K., R. Cyganiak, M. Hausenblas, and J. Zhao. 2009. Describing linked

              datasets-on the design and usage of voiD, the “vocabulary of interlinked data- sets”. In Proc. of the Linked Data on the Web Workshop (LDOW), Madrid, Spain, Paper 20.

              Álvarez-García, S., N. Brisaboa, J. Fernández, and M. Martínez-Prieto. 2011.

              2 Compressed k -triples for full-in-memory RDF engines. In Proc. 17th Americas

                Conference on Information Systems (AMCIS) , Detroit, Mich, Paper 350.

                

              Arenas, M., A. Bertails, E. Prud’hommeaux, and J. Sequeda (Eds.). 2012. A Direct

              Arias, M., J. D. Fernández, and M. A. Martínez-Prieto. 2011. An empirical study of

              real-world SPARQL queries. In Proc. of 1st Workshop on Usage Analyss and the Web of Data (USEWOD), Hyderabad, India. http://arxiv.org/abs/1103.5043.

                

              Atemezing, G., O. Corcho, D. Garijo, J. Mora, M. Poveda-Villalón, P. Rozas, D. Vila-

              Suero, and B. Villazón-Terrazas. 2013. Transforming meteorological data into linked data. Semantic Web Journal 4(3), 285–290.

              Atre, M., V. Chaoji, M. Zaki, and J. Hendler. 2010. Matrix “Bit” loaded: A scalable

              lightweight join query processor for RDF data. In Proc. of the 19th World Wide

                Web Conference (WWW) , Raleigh, NC, pp. 41–50.

                

              Auer, S., C. Bizer, G. Kobilarov, J. Lehmann, and Z. Ives. 2007. Dbpedia: A nucleus

              for a web of open data. In Proc. of the 6th International Semantic Web Conference (ISWC) , Busan, Korea, pp. 11–15.

                

              Baeza-Yates, R. and B. A. Ribeiro-Neto. 2011. Modern Information Retrieval—the

              Concepts and Technology Behind Search (2nd edn.). Pearson Education Ltd.

              Beckett, D. (Ed.) 2004. RDF/XML Syntax Specification (Revised). W3C Recommendation.

              http://www.w3.org/TR/rdf-syntax-grammar/.

              Beckett, D. and T. Berners-Lee. 2008. Turtle—Terse RDF Triple Language. W3C Team

              Submission. http://www.w3.org/TeamSubmission/turtle/.

              Berners-Lee, T. 2002. Linked Open Data. What is the idea? http://www.thenational-

              dialogue.org/ideas/linked-open-data (accessed October 8, 2012).

                Management of Big Semantic Data

              Bizer, C., T. Heath, and T. Berners-Lee. 2009. Linked data—the story so far. International

              Journal on Semantic Web and Information Systems 5, 1–22.

              Brickley, D. 2004. RDF Vocabulary Description Language 1.0: RDF Schema. W3C

              Recommendation. http://www.w3.org/TR/rdf-schema/.

                

              Brisaboa, N., R. Cánovas, F. Claude, M. Martínez-Prieto, and G. Navarro. 2011.

                Compressed string dictionaries. In Proc. of 10th International Symposium on Experimental Algorithms (SEA) , Chania, Greece, pp. 136–147.

                

              Cuki

              8, 2012).

                

              Cyganiak, R., H. Stenzhorn, R. Delbru, S. Decker, and G. Tummarello. 2008. Semantic

              sitemaps: Efficient and flexible access to datasets on the semantic web. In Proc. of the 5th European Semantic Web Conference (ESWC) , Tenerife, Spain, pp. 690–704.

                

              De, S., T. Elsaleh, P. M. Barnaghi, and S. Meissner. 2012. An internet of things platform

              for real-world and digital objects. Scalable Computing: Practice and Experience 13(1), 45–57.

              Dijcks, J.-P. 2012. Big Data for the Enterprise. Oracle (white paper) (January). http://

              www.oracle.com/us/products/database/big-data-for-enterprise-519135.pdf

                (accessed October 8, 2012). Dumbill, E. 2012a. Planning for Big Data. O’Reilly Media, Sebastopol, CA.

              Fernández, J. D., M. A. Martínez-Prieto, and C. Gutiérrez. 2010. Compact represen-

              tation of large RDF data sets for publishing and exchange. In Proc. of the 9th

                

              International Semantic Web Conference (ISWC) , Shangai, China, pp. 193–208.

                

              Fernández, J. D., M. A. Martínez-Prieto, C. Gutiérrez, and A. Polleres. 2011. Binary RDF

              Representation for Publication and Exchange (HDT) . W3C Member Submission. http://www.w3.org/Submission/2011/03/.

              Foulonneau, M. 2011. Smart semantic content for the future internet. In Metadata and

                Semantic Research , Volume 240 of Communications in Computer and Information Science , pp. 145–154. Springer, Berlin, Heidelberg.

                

              García-Silva, A., O. Corcho, H. Alani, and A. Gómez-Pérez. 2012. Review of the state

              of the art: Discovering and associating semantics to tags in folksonomies. The Knowledge Engineering Review 27(01), 57–85.

                

              González, R., S. Grabowski, V. Mäkinen, and G. Navarro. 2005. Practical implementa-

              tion of rank and select queries. In Proc. of 4th International Workshop Experimental and Efficient Algorithms (WEA) , Santorini Island, Greece, pp. 27–38.

                

              Grossi, R., A. Gupta, and J. Vitter. 2003. High-order entropy-compressed text indexes.

                In Proc. of 9th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), Baltimore, MD, pp. 841–850.

              Haas, K., P. Mika, P. Tarjan, and R. Blanco. 2011. Enhanced results for web search. In

                Proc. of the 34th International Conference on Research and Development in Information Retrieval (SIGIR) , Beijing, China, pp. 725–734.

                

              Hausenblas, M. and M. Karnstedt. 2010. Understanding linked open data as a web-

              scale database. In Proc. of the 1st International Conference on Advances in Databases (DBKDA) , 56–61.

                Big Data Computing

              Heath, T. and C. Bizer. 2011. Linked Data: Evolving the Web into a Global Data Space.

                Synthesis Lectures on the Semantic Web: Theory and Technology, Morgan & Claypool.

              Hey, T., S. Tansley, and K. M. Tolle. 2009. Jim Gray on eScience: A transformed scien-

              tific method. In The Fourth Paradigm. Microsoft Research.

              Hogan, A., A. Harth, J. Umbrich, S. Kinsella, A. Polleres, and S. Decker. 2011. Searching

              and browsing linked data with SWSE: The semantic web search engine. Journal of Web Semantics 9(4), 365–401.

                

              Hogan, A., J. Umbrich, A. Harth, R. Cyganiak, A. Polleres, and S. Decker. 2012. An

              empirical survey of linked data conformance. Web Semantics: Science, Services and Agents on the World Wide Web 14(0), 14–44.

                

              Huang, J., D. Abadi, and K. Ren. 2011. Scalable SPARQL querying of large RDF

              graphs. Proceedings of the VLDB Endowment 4(11), 1123–1134.

              Knoblock, C. A., P. Szekely, J. L. Ambite, S. Gupta, A. Goel, M. Muslea, K. Lerman,

              and P. Mallick. 2012. Semi-Automatically Mapping Structured Sources into the Semantic Web. In Proc. of the 9th Extended Semantic Web Conference (ESWC), Heraklion, Greece, pp. 375–390.

              Le-Phuoc, D., J. X. Parreira, V. Reynolds, and M. Hauswirth. 2010. RDF On the Go:

              An RDF Storage and Query Processor for Mobile Devices. In Proc. of the 9th

                

              Lohr

                Loukides, M. 2012. What is Data Science? O’Reilly Media.

              Manola, F. and E. Miller (Eds.). 2004. RDF Primer. W3C Recommendation. www.

              w3.org/TR/rdf-primer/.

              Martínez-Prieto, M., M. Arias, and J. Fernández. 2012a. Exchange and consumption

              of huge RDF data. In Proc. of the 9th Extended Semantic Web Conference (ESWC),

                Heraklion, Greece, pp. 437–452.

              Martínez-Prieto, M., J. Fernández, and R. Cánovas. 2012b. Querying RDF dictionar-

              ies in compressed space. ACM SIGAPP Applied Computing Reviews 12(2), 64–77.

                

              Marz, N. and J. Warren. 2013. Big Data: Principles and Best Practices of Scalable Realtime

              Data Systems . Manning Publications.

              McGuinness, D. L. and F. van Harmelen (Eds.). 2004. OWL Web Ontology Language

                Overview . W3C Recommendation. http://www.w3.org/TR/owl-features/.

                

              Neumann, T. and G. Weikum. 2010. The RDF-3X engine for scalable management of

              RDF data. The VLDB Journal 19(1), 91–113.

              Prud’hommeaux, E. and A. Seaborne (Eds.). 2008. SPARQL Query Language for RDF.

              http://www.w3.org/TR/rdf-sparql-query/. W3C Recommendation.

              Quesada, J. 2008. Human similarity theories for the semantic web. In Proceedings of

              the First International Workshop on Nature Inspired Reasoning for the Semantic Web, Karlsruhe, Germany.

                

              Ramakrishnan, R. and J. Gehrke. 2000. Database Management Systems. Osborne/

              McGraw-Hill.

              Schmidt, M., M. Meier, and G. Lausen. 2010. Foundations of SPARQL query opti-

              mization. In Proc. of the 13th International Conference on Database Theory (ICDT),

                Lausanne, Switzerland, pp. 4–33.

                Management of Big Semantic Data

              Schneider, J. and T. Kamiya (Eds.). 2011. Efficient XML Interchange (EXI) Format 1.0.

                W3C Recommendation. http://www.w3.org/TR/exi/.

              Schwarte, A., P. Haase, K. Hose, R. Schenkel, and M. Schmidt. 2011. FedX: Optimization

              techniques for federated query processing on linked data. In Proc. of the 10th

                International Conference on the Semantic Web (ISWC) , Bonn, Germany, pp. 601–616.

                

              Selg, E. 2012. The next Big Step—Big Data. GFT Technologies AG (technical report).

              Sidirourgos, L., R. Goncalves, M. Kersten, N. Nes, and S. Manegold. 2008. Column-

              store Support for RDF Data Management: not All Swans are White. Proc. of the

                VLDB Endowment 1(2), 1553–1563.

                

              Taheriyan, M., C. A. Knoblock, P. Szekely, and J. L. Ambite. 2012. Rapidly integrating

              services into the linked data cloud. In Proc. of the 11th International Semantic Web Conference (ISWC), Boston, MA, pp. 559–574.

                

              Tran, T., G. Ladwig, and S. Rudolph. 2012. Rdf data partitioning and query processing

              using structure indexes. IEEE Transactions on Knowledge and Data Engineering 99.

                Doi: ieeecomputersociety.org/10.1109/TKDE.2012.134

              Tummarello, G., R. Cyganiak, M. Catasta, S. Danielczyk, R. Delbru, and S. Decker.

              2010. Sig.ma: Live views on the web of data. Web Semantics: Science, Services and Agents on the World Wide Web 8(4), 355–364.

              Urbani, J., J. Maassen, and H. Bal. 2010. Massive semantic web data compression with

              MapReduce. In Proc. of the 19th International Symposium on High Performance

                Distributed Computing (HPDC) 2010 , Chicago, IL, pp. 795–802.

                

              Volz, J., C. Bizer, M. Gaedke, and G. Kobilarov. 2009. Discovering and maintaining

              links on the web of data. In Proc. of the 9th International Semantic Web Conference (ISWC) , Shanghai, China, pp. 650–665.

                

              Weiss, C., P. Karras, and A. Bernstein. 2008. Hexastore: Sextuple indexing for semantic

              web data management. Proc. of the VLDB Endowment 1(1), 1008–1019.

              Witten, I. H., A. Moffat, and T. C. Bell. 1999. Managing Gigabytes: Compressing and

              Indexing Documents and Images . San Francisco, CA, Morgan Kaufmann.

                

              This page intentionally left blank This page intentionally left blank

                 Sören Auer, Axel-Cyrille Ngonga Ngomo, Philipp Frischmuth, and Jakub Klimek CONTENTS

                Introduction ......................................................................................................... 169 Challenges in Data Integration for Large Enterprises ................................... 173 Linked Data Paradigm for Integrating Enterprise Data ................................ 178 Runtime Complexity .......................................................................................... 180

                Preliminaries ................................................................................................... 181 3 The HR Algorithm ........................................................................................ 183 Indexing Scheme ........................................................................................ 183 Approach .................................................................................................... 184

                Evaluation ....................................................................................................... 187 Experimental Setup ................................................................................... 187 Results ......................................................................................................... 188

                Discrepancy .......................................................................................................... 191 Preliminaries ................................................................................................... 193 CaRLA .............................................................................................................. 194

                Rule Generation ......................................................................................... 194 Rule Merging and Filtering ...................................................................... 195 Rule Falsification ....................................................................................... 196

                Extension to Active Learning ....................................................................... 197 Evaluation ....................................................................................................... 198

                Experimental Setup ................................................................................... 198 Results and Discussion ............................................................................. 199

                Conclusion ........................................................................................................... 201 References ............................................................................................................. 202

              Introduction

                Data integration in large enterprises is a crucial but at the same time a costly, long-lasting, and challenging problem. While business-critical information is often already gathered in integrated information systems such as ERP,

                Big Data Computing

                as the integration with the abundance of other information sources is still a major challenge. Large companies often operate hundreds or even thou- sands of different information systems and databases. This is especially true for large OEMs. For example, it is estimated that at Volkswagen there are approximately 5000 different information systems deployed. At Daimler— even after a decade of consolidation efforts—the number of independent IT systems still reaches 3000.

                After the arrival and proliferation of IT in large enterprises, various approaches, techniques, and methods have been introduced in order to solve the data integration challenge. In the last decade, the prevalent data integra- tion approaches were primarily based on XML, Web Services, and Service-

                

              Oriented Architectures (SOA) [9]. XML defines a standard syntax for data

                representation, Web Services provide data exchange protocols, and SOA is a holistic approach for distributed systems architecture and communication. However, we become increasingly aware that these technologies are not suf- ficient to ultimately solve the data integration challenge in large enterprises. In particular, the overheads associated with SOA are still too high for rapid and flexible data integration, which are a prerequisite in the dynamic world of today’s large enterprises.

                We argue that classic SOA architectures are well suited for transaction pro- cessing, but more efficient technologies are available that can be deployed for solving the data integration challenge. Recent approaches, for example, con- sider ontology-based data integration, where ontologies are used to describe data, queries, and mappings between them [33]. The problems of ontology- based data integration are the required skills to develop the ontologies and the difficulty to model and capture the dynamics of the enterprise. A related, but slightly different approach is the use of the Linked Data paradigm for integrating enterprise data. Similarly, as the data web emerged complement- ing the document web, data intranets can complement the intranets and SOA landscapes currently found in large enterprises.

                The acquisition of Freebase by Google and Powerset by Microsoft are the first indicators that large enterprises will not only use the Linked Data paradigm for the integration of their thousands of distributed information systems, but they will also aim at establishing Enterprise Knowledge Bases (EKB; similar to what Freebase now is for Google) as hubs and crystallization points for the vast amounts of structured data and knowledge distributed in their data intranets.

                Examples of public LOD data sources being highly relevant for large * enterprises are OpenCorporates (a knowledge base containing information about more than 50,000 corporations worldwide), LinkedGeoData [1] (a spa- tial knowledge base derived from OpenStreetMap containing precise infor- * mation about all kinds of spatial features and entities) or Product Ontology http://opencorporates.com/

                Linked Data in Enterprise Integration

                (which comprises detailed classifications and information about more than 1 million products). For enterprises, tapping this vast, crowd-sourced knowledge that is freely available on the web is an amazing opportunity. However, it is crucial to assess the quality of such freely available knowl- edge, to complement and contrast it with additional nonpublic information being available to the enterprise (e.g., enterprise taxonomies, domain data- bases, etc.) and to actively manage the life cycle of both—the public and private data—being integrated and made available in an Enterprises data intranet.

                In order to make large enterprises ready for the service economy, their IT infrastructure landscapes have to be made dramatically more flexible. Information and data have to be integrated with substantially reduced costs and in extremely short-time intervals. Mergers and acquisitions further accelerate the need for making IT systems more interoperable, adaptive, and flexible. Employing the Linked Data approach for establishing enterprise data intranets and knowledge bases will facilitate the digital innovation capabilities of large enterprises.

                In this chapter, we explore the challenges large enterprises are still fac- ing with regard to data integration. These include, but are not limited to, the development, management, and interlinking of enterprise taxono- mies, domain databases, wikis, and other enterprise information sources (cf. “Challenges in Data Integration for Large Enterprises”). Employing the Linked Data paradigm to address these challenges might result in the emergence of enterprise Big Data intranets, where thousands of databases and information systems are connected and interlinked. Only a small part of the data sources in such an emerging Big Data intranet will actually be the Big Data itself. Many of them are rather small- or medium-sized data and knowledge bases. However, due to the large number of such sources, they will jointly reach a critical mass (volume). Also, we will observe on a data intranet a large semantic heterogeneity involving various schemas, vocabularies, ontologies, and taxonomies (variety). Finally, since Linked Data means directly publishing RDF from the original data representations, changes in source databases and information systems will be immediately visible on the data intranet and thus result in a constant evolution (velocity). Of particular importance in such a Big Data intranet setting is the creation of links between distributed data and knowledge bases within an enter- prise’s Big Data intranet. Consequently, we also discuss the requirements for linking and transforming enterprise data in depth (cf. “Linked Data Paradigm for Integrating Enterprise Data”). Owing to the number of link- ing targets to be considered and their size, the time efficiency of linking is a key issue in Big Data intranets. We thus present and study the com- plexity of the first reduction-ratio-optimal algorithm for link discovery (cf. “Runtime Complexity”). Moreover, we present an approach for reducing the discrepancy (i.e., improving the coherence) of data across knowledge bases

                Big Data Computing CRM-US SCM CRM-DE

              - Linksets

              relevant LOD

              - Copies of

              - Vocabularies

              ERP

              Taxonomy Database B Database A Intranet keyword search Portal US Wiki marketing Portal DE Enterprise KB (internal/external) Enterprise data web Wiki development Enterprise IT systems landscape Current situation (e.g. SOA) Our vision Firewall

                Figure 5.1

              Our vision of an Enterprise Data Web (EDW). The solid lines show how IT systems may be

              currently connected in a typical scenario. The dotted lines visualize how IT systems could be

              interlinked employing an internal data cloud. The EDW also comprises an EKB, which consists

              of vocabulary definitions, copies of relevant Linked Open Data, as well as internal and external

              link sets between data sets. Data from the LOD cloud may be reused inside the enterprise, but

              internal data are secured from external access just like in usual intranets.

                The introductory section depicts our vision of an Enterprise Data Web and the resulting semantically interlinked enterprise IT systems landscape (see Figure 5.1). We expect existing enterprise taxonomies to be the nucleus of linking and integration hubs in large enterprises, since these taxonomies already reflect a large part of the domain terminology and corporate and organizational culture. In order to transform enterprise taxonomies into comprehensive EKBs, additional relevant data sets from the Linked Open Data Web have to be integrated and linked with the internal taxonomies and knowledge structures. Subsequently, the emerging EKB can be used (1) for interlinking and annotating content in enterprise wikis, content management systems, and portals; (2) as a stable set of reusable concepts and identifiers; and (3) as the background knowledge for intranet, extranet, and site-search applications. As a result, we expect the current document-oriented intranets in large enterprises to be complemented with a data intranet, which facili- tates the lightweight, semantic integration of the plethora of information sys-

                Linked Data in Enterprise Integration

              Challenges in Data Integration for Large Enterprises

                We identified six crucial areas (Table 5.1) where data integration challenges arise in large enterprises. Figure 5.2 shows the Linked Data life cycle in con- junction with the aforementioned challenges. Each challenge may be related to a single or to multiple steps in the Linked Data life cycle.

                

              Enterprise Taxonomies. Nowadays, almost every large enterprise uses tax-

                onomies to provide a shared linguistic model aiming at structuring the large quantities of documents, emails, product descriptions, enterprise directives, etc. which are produced on a daily basis. Currently, terminology in large enterprises is managed in a centralized manner mostly by a dedicated and independently acting department (often referred to as Corporate Language

                

              Management (CLM) ). CLM is in charge of standardizing all corporate terms

                both for internal and external uses. As a result, they create multiple diction- aries for different scopes that are not interconnected. An employee who aims at looking up a certain term needs to know which dictionary to use in that very context, as well as where to retrieve the currently approved version of it. The latter may not always be the case, especially for new employees. The former applies to all employees, since it might be unclear, which dictionary should be used, resulting in a complicated look-up procedure or worse the

              TaBle 5.1 Overview of Data Integration Challenges Occurring in Large Enterprises Information Integration Domain Current State Linked Data Benefit

                Enterprise Proprietary, centralized, no Open standards (e.g., SKOS), Taxonomies relationships between distributed, hierarchical, terms, multiple independent multilingual, reusable in other terminologies (dictionaries) scenarios

                

              XML Schema Multitude of XML schemas, Relationships between entities from

              Governance no integrated different schemas, tracking/ documentation documentation of XML schema evolution

                

              Wikis Text-based wikis for teams or Reuse of (structured) information via

              internal-use encyclopedias data wikis (by other applications), interlinking with other data sources, for example, taxonomies

                Web Portal and Keyword search over textual Sophisticated search mechanisms

                Intranet Search content employing implicit knowledge from different data sources

                Database Integration Data warehouses, schema Lightweight data integration through

              mediation, query federation RDF layer

                Enterprise Single Consolidated user No passwords, more sophisticated

              Sign-On credentials, centralized SSO access control mechanisms (arbitrary

              metadata attached to identities)

                Big Data Computing

              Inter-linking/

              fusing

                Manual/ Classification/ revision/ enrichment authoring

              Enterprise single sign-on

                

              Wikis and Enterprise

              portals taxonomies

              Storage/

                Quality analysis querying Web portal

                XML schema and intranet governance search

                Database integration Evolution/ Extraction repair

                

              Search/

              browsing

              exploration

                Figure 5.2

              Linked data life cycle supports four crucial data integration challenges arising in enterprise

              environments. Each of the challenges can relate to more than one lifecycle stage.

                abandonment of a search at all. As a result, the main challenge in the area of enterprise taxonomies is defragmentation of term definitions without cen- tralization of taxonomy management. We propose to represent enterprise taxonomies in RDF employing the standardized and widely used SKOS [17] vocabulary as well as publishing term definitions via the Linked Data principles.

                XML Schema Governance.

                The majority of enterprises to date use XML for message exchange, data integration, publishing, and storage, often in a form of Web Services and XML databases. To be able to process XML documents efficiently, it need to be known what kind of data can be expected to find them. For this purpose, XML schemas should be presented for each XML format used. XML schemas describe the allowed structure of an XML docu-

                Linked Data in Enterprise Integration

                schemas: the oldest and the simplest DTD [3], the popular XML Schema [31], the increasingly used Relax NG [4], and the rule-based Schematron [12]. In a typical enterprise, there are hundreds or even thousands of XML schemas in use, each possibly written in a different XML schema language.

                Moreover, as the enterprise and its surrounding environment evolve, the schemas need to adapt. Therefore, new versions of schemas are created, resulting in a proliferation of XML schemas. XML schema governance now is the process of bringing order into the large number of XML schemas being generated and used within large organizations. The sheer number of IT systems deployed in large enterprises that make use of the XML tech- nology bear a challenge in bootstrapping and maintaining an XML schema

                

              repository . In order to create such a repository, a bridge between XML sche-

                mata and RDF needs to be established. This requires in the first place the identification of XML schema resources and the respective entities that are defined by them. Some useful information can be extracted automatically from XML schema definitions that are available in a machine-readable for- mat, such as XML schemas and DTDs. While this is probably given for systems that employ XML for information exchange, it may not always be the case in proprietary software systems that employ XML only for data storage. In the latter case as well as for maintaining additional metadata (such as responsible department, deployed IT systems, etc.), a substantial amount of manual work is required. In a second step, the identified schema metadata needs to be represented in RDF on a fine-grained level. The chal- lenge here is the development of an ontology, which not only allows for the annotation of XML schemas, but also enables domain experts to establish

                

              semantic relationships between schemas. Another important challenge is to

                develop methods for capturing and describing the evolution of XML sche- mata, since IT systems change over time and those revisions need to be aligned with the remaining schemas.

                

              Wikis. These have become increasingly common through the last years

                reaching from small personal wikis to the largest Internet encyclopedia Wikipedia. The same applies for the use of wikis in enterprises [16] too. In addition to traditional wikis, there is another category of wikis, which are called semantic wikis. These can again be divided into two categories: seman- tic text wikis and semantic data wikis. Wikis of this kind are not yet com- monly used in enterprises, but crucial for enterprise data integration since they make (at least some of) the information contained in a wiki machine- accessible. Text-based semantic wikis are conventional wikis (where text is still the main content type), which allow users to add some semantic annota- tions to the texts (e.g., typed links). The semantically enriched content can then be used within the wiki itself (e.g., for dynamically created wiki pages) or can be queried, when the structured data are stored in a separate data store. An example is Semantic MediaWiki [14] and its enterprise counterpart SMW+ [25]. Since wikis in large enterprises are still a quite new phenom-

                Big Data Computing

                relatively be easy to tackle. A challenge, however, is to train the users of such wikis to actually create semantically enriched information. For example, the value of a fact can either be represented as a plain literal or as a relation to another information resource. In the latter case, the target of the relation can be identified either by a newly generated URI or one that was introduced before (eventually already attached with some metadata). The more the users are urged to reuse information wherever appropriate, the more all the par- ticipants can benefit from the data. It should be part of the  design of the wiki application (especially the user interface), to make it easy for users to build quality knowledge bases (e.g., through autosuggestion of URIs within authoring widgets). Data in RDF are represented in the form of simple state- ments, information that naturally is intended to be stored in conjunction (e.g., geographical coordinates) is not visible as such per se. The same applies for information which users are accustomed to edit in a certain order (e.g., address data). A nonrational editing workflow, where the end-users are confronted with a random list of property values, may result in invalid or incomplete information. The challenge here is to develop a choreography of

                

              authoring widgets in order to provide users with a more logical editing work-

                flow. Another defiance to tackle is to make the deployed wiki systems avail- able to as many stakeholders as possible (i.e., cross department boundaries) to allow for an improved information reuse. Once Linked Data resources and potentially attached information are reused (e.g., by importing such data), it becomes crucial to keep them in synchronization with the original source. Therefore, mechanisms for syndication (i.e., propagation of changes) and synchronization need to be developed, both for intra- and extranet seman- tic wiki resources. Finally, it is also necessary to consider access control in this context. Semantic representations contain implicit information, which can be revealed by inferencing and reasoning. A challenge is to develop and deploy scalable access control mechanisms, which are aligned with existing access control policies in the enterprise and which are safe with regard to the hijacking of ontologies [10].

                

              Web Portal and Intranet Search. The biggest problem with enterprise

                intranets today is the huge difference in user experience when compared to the Internet [18]. When using the Internet, the user is spoiled by mod- ern technologies from, for example, Google or Facebook, which provide very comfortable environments, precise search results, auto-complete text boxes, etc. These technologies are made possible through large amounts of resources invested in providing comfort for the millions of users, custom- ers, beta testers, and by their large development team and also by the huge number of documents available, which increases the chances that a user will find what he is looking for. In contrast, in most enterprises, the intranet expe- rience is often poor because the intranet uses technologies from the previ- ous millennium. In order to implement search systems that are based on a Linked Data approach and that provide a substantial benefit in comparison

                Linked Data in Enterprise Integration

              set of high-quality RDF datasources needs to be tackled first. For example, as a

                prerequisite for linking documents to terms a hierarchical taxonomy should be created (see “Challenges in Data Integration for Large Enterprises”). Mechanisms then need to be established to automatically create high-quality links between documents and an initial set of terms (e.g., by crawling), since it is not feasible to manually link the massive amount of available documents. Furthermore, the process of semi-automatic linking of (a) terms that occur in documents but are not part of the taxonomy yet (as well as their placement in the taxonomy) and (b) terms that do not occur in documents but are related and thus useful in a search needs to be investigated and suitable tools should be developed to support responsible employees. To provide results beyond those that can be obtained from text-based documents directly, other data sets need to be transformed to RDF and queried. Finally, although a search engine that queries RDF data directly works, it results in suboptimal per- formance. The challenge here is to develop methods for improving perfor- mance to match traditional search engines, while keeping the advantages of using SPARQL directly. In an enterprise there exist at least two distinct areas where search technology needs to be applied. On the one hand, there is cor- porate internal search, which enables employees to find relevant information required for their work. On the other hand, all large enterprises need at least simple search capabilities on their public web portal(s), since otherwise the huge amounts of information provided may not be reachable for potential customers. Some dedicated companies (e.g., automotive companies) would actually have a need for more sophisticated query capabilities, since the com- plexity of offered products is very high. Nevertheless, in reality, search, both internal and external, is often solely based on keyword matching. We argue that by employing the Linked Data paradigm in enterprises the classical key- word-based search can be enhanced. Additionally, more sophisticated search mechanisms can be easily realized since more information is available in a uniform and machine-processable format.

                

              Database Integration. Relational Database Management Systems (RDBMS)

                are the predominant mode of data storage in the enterprise context. RDBMS are used practically everywhere in the enterprise, serving, for example, in computer-aided manufacturing, enterprise resource planning, supply chain management, and content management systems. We, therefore, deem the inte- gration of relation data into Linked Data a crucial Enterprise Data Integration * technique. A primary concern when integrating relational data is the scalability

                

              and query performance . With our R2RML-based tool SparqlMap, we show that

                an efficient query translation is possible, thus avoiding the higher deployment costs associated with the data duplication inherent in ETL approaches. The challenge of closing the gap between triple stores and relational databases is also present in SPARQL-to-SQL mappers and drives research. A second chal- lenge for mapping relational data into RDF is a current lack of best practices

                Big Data Computing

                and tool support for mapping creation. The standardization of the RDB to RDF

                

              Mapping Language (R2RML) by the W3C RDB2RDF Working Group establishes

                a common ground for an interoperable ecosystem of tools. However, there is a lack of mature tools for the creation and application of R2RML mappings. The challenge lies in the creation of user-friendly interfaces and in the estab- lishment of best practices for creating, integrating, and maintaining those mappings. Finally, for a read–write integration updates on the mapped data need to be propagated back into the underlying RDBMS. An initial solution is presented in [5]. In the context of enterprise data, an integration with granu-

                

              lar access control mechanisms is of vital importance. Consequently, semantic

                wikis, query federation tools, and interlinking tools can work with the data of relation databases. The usage of SPARQL 1.1 query federation [26] allows rela- tional databases to be integrated into query federation systems with queries spanning over multiple databases. This federation allows, for example, portals, which in combination with an EKB provide an integrated view on enterprise data.

                

              Enterprise Single Sign-On. As a result of the large number of deployed soft-

                ware applications in large enterprises, which are increasingly web-based,

                

              single sign-on (SSO) solutions are of crucial importance. A Linked Data-based

                approach aimed at tackling the SSO problem is WebID [30]. In order to deploy a WebID-based SSO solution in large enterprises, a first challenge is to transfer

                

              user identities to the Enterprise Data Web. Those Linked Data identities need to

                be enriched and interlinked with further background knowledge, while main- taining privacy. Thus, mechanisms need to be developed to assure that only such information is publicly (i.e., public inside the corporation) available, that is required for the authentication protocol. Another challenge that arises is related to user management. With WebID a distributed management of identi- ties is feasible (e.g., on department level), while those identities could still be used throughout the company. Though this reduces the likeliness of a single point of failure, it would require the introduction of mechanisms to ensure that company-wide policies are enforced. Distributed group management and authorization is already a research topic (e.g., dgFOAF [27]) in the area of social networks. However, requirements that are gathered from distributed social network use-cases differ from those captured from enterprise use-cases. Thus, social network solutions need a critical inspection in the enterprise context.

              Linked Data Paradigm for Integrating Enterprise Data

                Addressing the challenges from the previous section leads to the creation of a number of knowledge bases that populate a data intranet. Still, for this intranet to abide by the vision of Linked Data while serving the purpose of companies, we need to increase its coherence and establish links between

                Linked Data in Enterprise Integration

                usually integrate them into a unified view by the means of the extract-trans- form-load (ETL) paradigm [13]. For example, IBM’s DeepQA framework [8] combines knowledge from DBpedia,* Freebase, and several other knowledge bases to determine the answer to questions with a speed superior to that of human champions. A similar view to data integration can be taken within the Linked Data paradigm with the main difference that the load step can be discarded when the knowledge bases are not meant to be fused, which is mostly the case. While the extraction was addressed above, the transfor- mation remains a complex challenge and has currently not yet been much addressed in the enterprise context. The specification of this integration pro- cesses for Linked Data is rendered tedious by several factors, including

                1. A great number of knowledge bases (scalability) as well as

                2. The Schema mismatches and heterogeneous conventions for prop- erty values across knowledge bases (discrepancy) Similar issues are found in the Linked Open Data (LOD) Cloud, which con- sists of more than 30 billion triples distributed across more than 250 knowl- edge bases. In the following, we will use the Linked Open Data Cloud as reference implementation of the Linked Data principles and present semi- automatic means that aim to ensure high-quality Linked Data Integration.

                The scalability of Linked Data Integration has been addressed in manifold previous works on link discovery. Especially, Link Discovery frameworks such as LIMES [21–23] as well as time-efficient algorithms such as PPJoin+ [34] have been designed to address this challenge. Yet, none of these manifold approaches provides theoretical guarantees with respect to their performance. Thus, so far, it was impossible to predict how Link Discovery frameworks would perform with respect to time or space requirements. Consequently, the deployment of techniques such as customized memory management [2] or time-optimization strategies [32] (e.g., automated scaling for cloud computing when provided with very complex linking tasks) was rendered very demanding if not impossible. A novel approach that addresses these drawbacks is the HR 3 algorithm [20]. Similar to the HYPPO algorithm [22] (on whose formalism it is based), HR 3 assumes that the property values that are to be compared are expressed in an affine space with a Minkowski distance. Consequently, it can be most naturally used to process the portion of link specifications that compare numeric values (e.g., temperatures, elevations, populations, etc.). HR 3 goes beyond the state of the art by being able to carry out Link Discovery tasks with any achievable reduc-

                

              tion ratio [6]. This theoretical guarantee is of practical importance, as it does

                not only allow our approach to be more time-efficient than the state of the art

              • http://dbpedia.org http://www.freebase.com

                Big Data Computing

                but also lays the foundation for the implementation of customized memory management and time-optimization strategies for Link Discovery.

                The difficulties behind the integration of Linked Data are not only caused by the mere growth of the data sets in the Linked Data Web, but also by large number of discrepancies across these data sets. In particular, ontology mis- matches [7] affect mostly the extraction step of the ETL process. They occur when different classes or properties are used in the source knowledge bases to express equivalent knowledge (with respect to the extraction process at hand). For example, while Sider * uses the class sider:side_effects to represent diseases that can occur as a side effect of the intake of certain medication, the more generic knowledge base DBpedia uses dbpedia:Disease. Such a mis- match can lead to a knowledge base that integrates DBpedia and Sider contain- ing duplicate classes. The same type of mismatch also occurs at the property level. For example, while Eunis uses the property eunis:binomialName to represent the labels of species, DBpedia uses rdfs:label. Thus, even if the extraction problem was resolved at class level, integrating Eunis and DBpedia would still lead to the undesirable constellation of an integrated knowledge base where instances of species would have two properties that serve as labels. The second category of common mismatches mostly affects the transformation step of ETL and lies in the different conventions used for equivalent property values. For example, the labels of films in DBpedia differ from the labels of films in LinkedMDB in three ways: First, they contain a language tag. Second, the extension “(film)” if another entity with the same label exists. Third, if another film with the same label exists, the production year of the film is added. Consequently, the film Liberty from 1929 has the label “Liberty (1929 film)@en” in DBpedia, while the same film bears the label “Liberty” in LinkedMDB. A similar discrepancy in naming persons holds for film directors (e.g., John Frankenheimer (DBpedia: John Frankenheimer@ en, LinkedMDB: John Frankenheimer (Director)) and John Ford (DBpedia: John Ford@en, LinkedMDB: John Ford (Director))) and actors. Finding a conform representation of the labels of movies that maps the LinkedMDB representation would require knowing the rules replace(“@en”, ε) and replace(“(*film)”,

                ε), where ε stands for the empty string.

              Runtime Complexity

                The development of scalable algorithms for link discovery is of crucial importance to address for the Big Data problems that enterprises are increas- ingly faced with. While the variety of the data is addressed by the extraction

              • http://sideeffects.embl.de/ http://eunis.eea.europa.eu/

                Linked Data in Enterprise Integration

                processes presented in the sections above, the mere volume of the data makes it necessary to have single linking tasks carried out as efficiently as possible. Moreover, the velocity of the data requires that link discovery is carried out on a regular basis. These requirements were the basis for the development 3 of HR [20], the first reduction-ratio-optimal link discovery algorithm. In the following, we present and evaluate this approach.

                Preliminaries

                In this section, we present the preliminaries necessary to understand the subsequent parts of this section. In particular, we define the problem of Link Discovery, the reduction ratio, and the relative reduction ratio for- mally as well as give an overview of space tiling for Link Discovery. The 3 subsequent description of HR relies partly on the notation presented in this section.

                Link Discovery.

                The goal of Link Discovery is to compute the set of pair of instances (s, t) ∈ S × T that are related by a relation R, where S and T are two not necessarily distinct sets of instances. One way of automating this discov- ery is to compare sS and t T based on their properties using a distance measure. Two entities are then considered to be linked via R if their distance is less or equal to a threshold θ [23].

                Definition 1: Link Discovery on Distances Given two sets S and T of instances

                , a distance measure δ over the properties of

                s S

                and t T and a distance threshold θ [0, [, the goal of Link Discovery is to

                compute the set

                M = {(s, t, δ(s, t)): s S t T δ(s, t) ≤ θ} Note that in this paper, we are only interested in lossless solutions, that is, solutions that are able to find all pairs that abide by the definition given above.

                Reduction Ratio

                . A brute-force approach to Link Discovery would execute a Link Discovery task on S and T by carrying out |S||T| comparisons. One of the key ideas behind time-efficient Link Discovery algorithms A is to reduce the number of comparisons that are effectively carried out to a num- ber C(A) < |S||T| [29]. The reduction ratio RR of an algorithm A is given by

                A = C ( ) A

                RR ( ) 1 .

                − (5.1)

                | S T |

                RR

                (A) captures how much of the Cartesian product |S||T| was not explored before the output of A was reached. It is obvious that even an optimal loss- less solution which performs only the necessary comparisons cannot achieve an RR of 1. Let C min be the minimal number of comparisons necessary to complete the Link Discovery task without losing recall, that is, C min = |M

                Big Data Computing

                minimal number of comparisons that was carried out by the algorithm A before it terminated. Formally, 1 ( C min /| S T |) | S T | C min

                − − A ( ) .

                RRR = =

                (5.2) A A 1 ( ( )/| C S T |) | S T | C ( )

                − −

                RRR

                (A) indicates how close A is to the optimal solution with respect to the number of candidates it tests. Given that C(A) ≥ C min , RRR(A) ≥ 1. Note that the larger the value of RRR(A), the poorer the performance of A with respect to the task at hand.

                The main observation that led to this work is that while most algorithms aim to optimize their RR (and consequently their RRR), current approaches to Link Discovery do not provide any guarantee with respect to the RR (and con- sequently the RRR) that they can achieve. In this work, we present an approach to Link Discovery in metric spaces whose RRR is guaranteed to converge to 1. 3 Space Tiling for Link Discovery. Our approach, HR , builds upon the same formalism on which the HYPPO algorithm relies, that is, space tiling. HYPPO addresses the problem of efficiently mapping instance pairs (s, t) ∈ S × T described by using exclusively numeric values in an n-dimensional metric space and has been shown to outperform the state of the art in the previ- * ous work [22]. The observation behind space tiling is that in spaces (Ω, δ ) with orthogonal (i.e., uncorrelated) dimensions, common metrics for

                Link Discovery can be decomposed into the combination of functions ϕ i,i {1. . .n} , which operate on exactly one dimension of Ω: δ = f 1 ,. . .,ϕ ). For n Minkowski distances of order p, ϕ i (x, ω) = |x − ω | for all values of i and p n p p i i ω = .

                δ ( , ) x φ ( , ) x ω ii = 1 A direct consequence of this observation is the inequality ϕ (x,ω) ≤ δ i

                (x, ω). The basic insight into this observation is that the hypersphere H(ω, θ

                ) = {x ∈ Ω:δ (x,  ω) ≤θ} is a subset of the hypercube V defined as V(ω, θ

                ) = {x ∈ Ω: ∀i ∈ {1. . .n}, ϕ (x , ω ) ≤ θ. Consequently, one can reduce the num- i i i ber of comparisons necessary to detect all elements of H(ω, θ ) by discarding all elements that are not in V (ω, θ) as nonmatches. Let Δ = θ/α, where α ∈ℕ is the granularity parameter that controls how fine-grained the space tiling should be (see Figure 5.3 for an example). We first tile Ω into the adjacent hypercubes (short: cubes) C that contain all the points ω such that n

                ωi { ... }, n c

                1 i i ( c i 1 ) with ( ,..., ) c 1 c n . (5.3) ∀ ∈ ∆ ≤ < ∆ ∈ +

                We call the vector (c , . . ., c ) the coordinates of the cube C. Each point 1 n *

                ω ω ∈ Ω lies in the cube C(ω) with coordinates ( / ) = . Given such a space   1... i i n

                

              Note that in all cases, a space transformation exists that can map a space with correlated

                Linked Data in Enterprise Integration (a) (b) (c)

                

              θ θ θ

              Figure 5.3

              Space tiling for different values of α. The colored squares show the set of elements that must be

              compared with the instance located at the black dot. The points within the circle lie within the

              distance θ of the black dot. Note that higher values of α lead to a better approximation of the

              hypersphere but also to more hypercubes.

                tiling, it is obvious that V(ω,θ ) consists of the union of the cubes such that ω α

                i { , 1 … , } :| n c c ( ) | i i .

                ∀ ∈ − ≤ Like most of the current algorithms for Link Discovery, space tiling does not provide optimal performance guarantees. The main goal of this paper is to build upon the tiling idea so as to develop an algorithm that can achieve 3 any possible RR. In the following, we present such an algorithm, HR . 3 The HR algorithm 3 The goal of the HR algorithm is to efficiently map instance pairs (s, t) ∈ S × T that are described by using exclusively numeric values in an n-dimensional metric space where the distances are measured by using any Minkowski 3 distance of order p ≥ 2. To achieve this goal, HR relies on a novel indexing

                scheme

                that allows achieving any RRR greater than or equal to than 1. In the following, we first present our new indexing scheme and show that we can discard more hypercubes than simple space tiling for all granularities p p

                α such that n(α − 1) > α . We then prove that by these means, our approach can achieve any RRR greater than 1, therewith proving the optimality of our

                indexing scheme with respect to RRR.

                Indexing Scheme

                Let ω ∈ Ω = ST be an arbitrary reference point. Furthermore, let δ be the Minkowski distance of order p. We define the index function as follows:

                 ∃ − ≤ ∈ if :| ( ) | ω 1 with { ,..., },

                1

                i c i c i i n

                 n = index ( , ) C ωp

                − −

                ω

                (| c i c ( ω ) | ) i 1 else, 

                ∑

                 = i 1

                Big Data Computing

                where C is a hypercube resulting from a space tiling and ω ∈ Ω. Figure 5.4 shows an example of such indices for p = 2 with α = 2 (Figure 5.4a) and α = 4 (Figure 5.4b).

                Note that the highlighted square with index 0 contains the reference point ω . Also note that our indexing scheme is symmetric with respect to C(ω). Thus, it is sufficient to prove the subsequent lemmas for hypercubes C such that c > c(ω) . In Figure 5.4, it is the upper right portion of the indexed space i i with the gray background. Finally, note that the maximal index that a hyper- p cube can achieve is n1) as max|cc (ω)| = α per construction of H(ω, θ). i i

                The indexing scheme proposed above guarantees the following:

                Lemma 1 ω ω p p δ

                → ∀ ∈ ∀ ∈ > ∆ Index C ( , ) x s C ( ), t C , ( , ) s t x . =

                Proof This lemma is a direct implication of the construction of the index.

                Index(C,ω) = x implies that n pω − =

                ( c i c ( ) i 1 ) x .

                ∑ i = 1 Now given the definition of the coordinates of a cube (Equation (5.3)), the

                following holds:

                

              ω ω

              s C ( ), t C , | s i t i | (| c i c ( ) | ) . i

                ∀ ∈ ∀ ∈ − ≥ − − 1 ∆ Consequently, n n p p p

                ω ω

              s C ( ), t C , | s i t i | (| c i c ( ) | ) i 1 ∆ .

                ∀ ∈ ∀ ∈ − ≥ − −

                

              ∑ ∑

              i

              = =

              1 i

              1 By applying the definition of the Minkowski distance of the index func- p p

              ω δ

              s C ( ), t C , ( , ) s t x

                tion, we finally obtained ∀ ∈ ∀ ∈ > Note that given that ω ∈ C(ω), the following also holds: p p

                

              ω δ ω

                = → ∀ ∈ > ∆ index( , ) C x t C : ( , ) t x . (5.5)

                Approach 3 The main insight behind HR is that in spaces with Minkowski distances,

                Linked Data in Enterprise Integration

                

              1

                1

                2

                1

                5

                10

                4

                9

                

              16

                

              9

                

              4

                

              1

                1

                

              4

                

              9

                17

                10

                20

                13

                5

                2

                8

                5

                1

                1

                4

                1

                9

                13 (a) (b)

                25

                20

                18

                13

                10

                8

                5

                9

                4

                10

                2

                4

                5

                13

                8

                13

                18

                17

                10

                16

                9

                5

                4

                2

                2

                1

                2

                1

                1

                2

                1

                1

                1

                2

                2

                1

                1

                1

                1

                2

                1

                1

                1

                2

                1

                1

                1

                1

                1

                1

                1

                4

                10

                5

                5

                10

                8

                13

                25

                18

                32

                25

                13

                20

                2

                17

                9

                16

                9

                10

                16

                17

                13

                20

                18

                25

                Figure 5.4

              Space tiling and resulting index for a two-dimensional example. Note that the index in both

              subfigures was generated for exactly the same portion of space. The black dot stands for the

              position of ω.

                Big Data Computing

                dismissing correct matches) discard more hypercubes than when using sim- ple space tiling. More specifically,

                Lemma 2 p s S index C s : ( , ) α implies that all tC are nonmatches.

                ∀ ∈ >

                Proof

                This lemma follows directly from Lemma 1 as p p p p p

                α δ α θ

                > → ∀ ∈ > ∆ = (5.6) index( , ) C s t C , ( , ) s t . For the purpose of illustration, let us consider the example of α = 4 and p = 2 in the two-dimensional case displayed in Figure 5.4b. Lemma 2 implies that any point contained in a hypercube C with index 18 cannot contain any ele- 18 ment t such that δ(s, t) ≤ θ. While space tiling would discard all black cubes in 3 Figure 5.4b but include the elements of C as candidates, HR discards them 18 and still computes exactly the same results, yet with a better (i.e., smaller) RRR. p p

                One of the direct consequences of Lemma 2 is that n(α − 1) > α is a neces- 3 sary and sufficient condition for HR to achieve a better RRR than simple space tiling. This is simply due to the fact that the largest index that can be n p p p p

                ∑

              α − = α

                assigned to a hypercube is = ( i 1 1 ) n ( 1 ) . Now, if n(α − 1) > α , then this cube can be discarded. For p = 2 and n = 2, for example, this condition is satisfied for α ≥ 4. Knowing this inequality is of great importance when decid- 3 ing on when to use HR as discussed in the “Evaluation” section. α ω α p

                H = ω C index C

                Let ( , ) { : ( , ) ≤ } . H(α, ω) is the approximation of the 3 hypersphere H(ω) = {ω′:δ (ω,ω′) ≤ θ} generated by HR . We define the volume of H(α, ω) as p

                H α ω H α ω ∆ =

              V ( ( , )) | ( , )| .

                (5.7) To show that given any r > 1, the approximation H(α, ω) can always achieve 3 an RRR(HR ) ≤ r, we need to show the following.

                Lemma 3 3

                lim RRR HR ( , ) α 1 . α →∞ =

                Proof

              3

              The cubes that are not discarded by HR (α) are those for which (|cc (ω)| − 1) p p i i

                Linked Data in Enterprise Integration

                → is exactly C min , which leads to the conclusion lim ( , ) α α →∞

                Finally, Δ → 0 when α → leads to (| | ) | | . x x i i p p i n i i p p i n

                − − ≤ ∧ → ∞ →

                − ≤

                = = ∑ ∑

                ω θ α ω θ1 1

                (5.10) This is exactly the condition for linking specified in Definition 1 applied to Minkowski distances of order p. Consequently, H(ω,∞) is exactly H(ω, θ) for any θ. Thus, the number of comparisons carried out by HR 3 (α) when α

                =

                ∆ ∆

                3 1.

                Our conclusion is illustrated in Figure 5.5, which shows the approxima- tions computed by HR 3 for different values of α with p = 2 and n = 2. The higher the α, the closer the approximation is to a circle. Note that these results allow one to conclude that for any RRR-value r larger than 1, there is a setting of HR 3 that can compute links with a RRR smaller or equal to r.

                evaluation

                In this section, we present the data and hardware we used to evaluate our approach. Thereafter, we present and discuss our results.

                Experimental Setup

                We carried out four experiments to compare HR 3 with LIMES 0.5’s HYPPO and SILK 2.5.1. In the first and second experiments, we aimed to deduplicate DBpedia places by comparing their names (rdfs:label), minimum elevation, elevation, and maximum elevation. We retrieved 2988 entities that possessed all four properties. We use the Euclidean metric on the last three values with the thresholds 49 and 99 m for the first and second experiments, respectively.

                ∆ 1 1 (5.9)

                ∑ ∑ ω α ω θ

                being single points. Each cube C thus contains a single point x with coordi- nates x i = c i Δ

                ≤

                . Especially, c i (ω) = ω. Consequently, (| ( )| )

                | | . c c

                x i i p p i n i i p i n p

                − − ≤ ↔

                − −  

                 

                = = ∑ ∑

                ≤ ↔ − − ≤ = =

                ω α ω α

                1 1 1

                ∆ (5.8)

                Given that θ = Δα, we obtain | | (| | ) .

                x x i i p i n p i i p p i n

                − −  

                 

                The third and fourth experiments aimed to discover links between Geonames

                Big Data Computing Figure 5.5 3 HR

                

              Approximation generated by for different values of α. The white squares are selected,

              whilst the colored ones are discarded. (a) α = 4, (b) α = 8, (c) α = 10, (d) α = 25, (e) α = 50, and

              (f) α = 100.

                and latitude of the instances. This experiment was of considerably larger scale than the first one, as we compared 74,458 entities in Geonames with 50,031 entities from LinkedGeoData. Again, we measured the runtime necessary to compare the numeric values when comparing them by using the Euclidean metric. We set the distance thresholds to 1 and 9° in experiments 3 and 4, respectively. We ran all experiments on the same Windows 7 Enterprise 64-bit computer with a 2.8 GHz i7 processor with 8 GB RAM. The JVM was allocated

                7 GB RAM to ensure that the runtimes were not influenced by swapping. Only one of the kernels of the processors was used. Furthermore, we ran each of the experiments three times and report the best runtimes in the following.

                Results

                We first measured the number of comparisons required by HYPPO and 3 HR to complete the tasks at hand (see Figure 5.6). Note that we could not carry out this section of evaluation for SILK 2.5.1 as it would have required 3 altering the code of the framework. In the experiments 1, 3, and 4, HR can reduce the overhead in comparisons (i.e., the number of unnecessary compar- isons divided by the number of necessary comparisons) from approximately 24% for HYPPO to approximately 6% (granularity = 32). In experiment 2, the overhead is reduced from 4.1 to 2%. This difference in overhead reduction

                Linked Data in Enterprise Integration

                having a radius between 49 and 99 m. Thus, running the algorithms with a threshold of 99 m led to only a small a priori overhead and HYPPO per- 3 forming remarkably well. Still, even on such data distributions, HR was able to discard even more data and to reduce the number of unnecessary computations by more than 50% relative. In the best case (experiment 4, 3 6

                α = 32, see Figure 5.6d), HR required approximately 4.13 × 10 less compari- sons than HYPPO for α = 32. Even for the smallest setting (experiment 1, see 3 6 Figure 5.6a), HR still required 0.64 × 10 less comparisons. 3 We also measured the runtimes of SILK, HYPPO, and HR . The best run- times of the three algorithms for each of the tasks is reported in Figure 5.7.

                Note that SILK’s runtimes were measured without the indexing time, as the data fetching and indexing are merged to one process in SILK. Also note that in the second experiment, SILK did not terminate due to higher memory requirements. We approximated SILK’s runtime by extrapolating approxi- mately 11 min it required for 8.6% of the computation before the RAM was filled. Again, we did not consider the indexing time.

                Because of the considerable difference in runtime (approximately two orders of magnitude) between HYPPO and SILK, we report solely HYPPO 3 3 and HR ‘s runtimes in the detailed runtimes Figures 5.8a,b. Overall, HR outperformed the other two approaches in all experiments, especially for α

                = 4. It is important to note that the improvement in runtime increases with 3 the complexity of the experiment. For example, while HR outperforms HYPPO by 3% in the second experiment, the difference grows to more than 7% in the fourth experiment. In addition, the improvement in runtime aug- ments with the threshold. This can be seen in the third and fourth experi- 3 ments. While HR is less than 2% faster in the third experiment, it is more 3 than 7% faster when θ = 4 in the fourth experiment. As expected, HR is slower than HYPPO for α < 4 as it carries out exactly the same comparisons but still has the overhead of computing the index. Yet, given that we know that 3 p p

                HR is only better when n(α − 1) > α , our implementation only carries out the indexing when this inequality holds. By these means, we can ensure that 3 HR is only used when it is able to discard hypercubes that HYPPO would not discard, therewith reaching superior runtimes both with small and large values α. Note that the difference between the improvement of the number 3 of comparisons necessitated by HR and the improvement in runtime over 3 all experiments is due to the supplementary indexing step required by HR . 3 Finally, we measured the RRR of both HR and HYPPO (see Figures 5.8c and d). In the two-dimensional experiments 3 and 4, HYPPO achieves an RRR 3 close to 1. Yet, it is still outperformed by HR as expected. A larger difference 3 between the RRR of HR and HYPPO can be seen in the three-dimensional experiments, where the RRR of both algorithms diverge significantly. Note that the RRR difference grows not only with the number of dimensions, but also with the size of the problem. The difference in RRR between HYPPO 3 and HR does not always reflect the difference in runtime due to the index- 3 3

                1

                9

                (a) 6 o ar co s n m p is 5 × 10 5 × 10 6 × 10 4 × 10 4 × 10 6 6 6 6 6 (b) o ar n s is p 8 × 10 8 × 10 6 6 o f Numb er 2 × 10 3 × 10 2 × 10 3 × 10 10 6 3 6 6 6 HYPPO co HR Minimum 3 f m Numb er o 7 × 10 7 × 10 6 6 Minimum HYPPO HR 3

                500 × 10 2 4 Granularity 8 16 32 6 × 10 6 2 4 Granularity 8 16 32 (d) (c) n s is o ar p 5 × 10 4 × 10 3 × 10 6 6 6 o p m 25 × 10 ar s 35 × 10 n is 40 × 10 30 × 10 6 6 6 6 m co o f er 2 × 10 6 HR er Minimum HYPPO 3 o u m f b co 20 × 10 15 × 10 6 6 6 HR HYPPO Minimum 3 Numb 10 6 N 10 × 10

                B 5 × 10 6 ig D at 2 4 Granularity 8 16 32 2 4 Granularity 8 16 32 a C om p u

                Figure 5.6 3 HR tin Number of comparisons for and HYPPO. g

                Linked Data in Enterprise Integration

                indexing runtime and comparison runtime (i.e., RRR) to outperform HYPPO in all experiments.

                In this section, we address the lack of coherence that comes about when integrating data from several knowledge data and using them within one application. Here, we present CaRLA, the Canonical Representation Learning

                

              Algorithm [19]. This approach addresses the discrepancy problem by learn-

                ing canonical (also called conform) representation of data-type property values. To achieve this goal, CaRLA implements a simple, time-efficient, and accurate learning approach. We present two versions of CaRLA: a batch learning and an active learning version. The batch learning approach relies on a training data set to derive rules that can be used to generate conform representations of property values. The active version of CaRLA (aCarLa) extends CaRLA by computing unsure rules and retrieving highly informa- tive candidates for annotation that allow one to validate or negate these can- didates. One of the main advantages of CaRLA is that it can be configured to learn transformations at character, n-gram, or even word level. By these means, it can be used to improve integration and link discovery processes based on string similarity/distance measures ranging from character-based (edit distance) and n-gram-based (q-grams) to word-based (Jaccard similar-

                10 4

                10 3

                10 2 R untime (s)

                10 1

                10 Exp. 1 Exp. 2 Exp. 3 Exp. 4 HYPPO SILK HR 3 Figure 5.7 Comparison of the runtimes of

                HR 3 , HYPPO, and SILK 2.5.1.

              Discrepancy

                180 180 2 9

                1

                (a) 100 120 140 160 (b)

              120

              140

              160

              100

              R untime (s) 80 60 (Exp. 2) 40 20 HYPPO (Exp. 2) HR HR HYPPO (Exp. 1) 3 3 (Exp. 1) untime (s) R 80 20 40 60 (Exp. 4) HYPPO (Exp. 4) HYPPO (Exp. 3) HR HR 3 3 (Exp. 3)

                2 4 Granularity 8 16 32

              1,005

              2 4 Granularity 8 16 32 (c) 1,8 1,7 1,5 1,6

                (d)

              1,003

              1,004

              R R R 1,2 1,4 1,3 HYPPO (Exp. 2) HR HR HYPPO (Exp. 1) R 3 R 3 (Exp. 1) (Exp. 2) R

              1,002

              1,001

              HYPPO (Exp. 3) HR HR HYPPO (Exp. 4) 3 3 (Exp. 3) (Exp. 4) 1,1 B 1 1 ig D 2 4 8 16 32 2 4 8 16 32 at Granularity Granularity a C om p

                Figure 5.8 3 u

                HR Comparison of runtimes and RRR of and HYPPO. (a) Runtimes for experiments 1 and 2, (b) runtimes for experiments 3 and 4, (c) RRR for experi- tin g ments 1 and 2, and (d) RRR for experiments 3 and 4.

                Linked Data in Enterprise Integration Preliminaries

                

              xx> with xA. We call two transformation rules r and rinverse to each

                φ R :Σ* → Σ* ∪ {ε} a transformation function when it maps s to a string φR(s) by applying all rules ri R to every token of token(s) in an arbitrary order.

                Definition 5: Transformation Function

              Given a set R of (weighted) transformation rules and a string s, we call the function

                mation rule is the pair (r,w(r)), where r ∈ Γ is a transformation rule.

                Γ be the set of all rules. Given a weight function w:Γ → ℝ , a weighted transfor-

                Definition 4: Weighted Transformation Rule Let

                other when r = <xy > and r′ = <yx>. Throughout this work, we will assume that the characters that make up the tokens of A belong to Σ ∪ {ε}, where ε stands for the empty character. Note that we will consequently denote deletions by rules of the form <x → ε > , where xA.

                <

                In the following, we define terms and notation necessary to formalize the approach implemented by CaRLA. Let s ∈ Σ

              • * be a string from an alphabet Σ.

                

              consequence of r. We call a transformation rule trivial when it is of the form

                In the following, we will denote transform rules by using an arrow nota- tion. For example, the mapping of the token “Alan” to “A.” will be denoted by <“Alan” → “A.” >. For any rule r = <xy > , we call x the premise and y the

                Definition 3: Transformation Rule

              A transformation rule is a function r: AA that maps a token from the alphabet A

              to another token of A .

                Note that string similarity and distance measures rely on a large number of different tokenization approaches. For example, the Levenshtein similar- ity [15] relies on a tokenization at character level, while the Jaccard similarity [11] relies on a tokenization at word level.

                → 2 A maps any string s ∈ Σ

              • * to a subset of the token alphabet A.

                Definition 2: Tokenization Function Given an alphabet A of tokens, a tokenization function token : Σ *

                We define a tokenization function as follows:

                For example, the set R = {<“Alan” → “A.”>} of transformation rules would

                Big Data Computing Carla

                The goal of CaRLA is two-fold: First, it aims to compute rules that allow to derive conform representations of property values. As entities can have several values for the same property, CaRLA also aims to detect a condition under which two property values should be merged during the integration process. In the following, we will assume that two source knowledge bases are to be integrated to one. Note that our approach can be used for any num- ber of source knowledge bases.

                Formally, CaRLA addresses the problem of finding the required transfor- mation rules by computing an equivalence relation ε between pairs of prop- erty values (p , p ), that is, such that ε(p , p ) holds when p and p should be 1 2 1 2 1 2 mapped to the same canonical representation p. CaRLA computes ε by generat- ing two sets of weighted transformation function rules R and R such that 1 2

                → σ ϕ ϕ θ for a given similarity function σ ε ( , p p 1 2 ) ( R ( ), p 1 1 R ( )) p 2 ≥ , where θ is 2 a similarity threshold. The canonical representation p is then set to ϕ R ( ). p 1 1

                ϕ θ

                The similarity condition σ ϕ ( R ( p R ), R ( )) p 1 1 2 ≥ 2 is used to distinguish between the pairs of properties values that should be merged.

                To detect R and R , CaRLA assumes two training data sets P and N, 1 2 of which N can be empty. The set P of positive training examples is com- posed of pairs of property value pairs (p , p ) such that ε(p , p ) holds. The 1 2 1 2 set N of negative training examples consists of pairs (p ,p ) such that ε(p , p ) 1 2 1 2 does not hold. In addition, CaRLA assumes being given a similarity func- tion σ and a corresponding tokenization function token. Given this input, CaRLA implements a simple three-step approach: It begins by computing the two sets R and R of plausible transformation rules based on the posi- 1 2 tive examples at hand (Step 1). Then it merges inverse rules across R and R 1 2 and discards rules with a low weight during the rule merging and filtering step. From the resulting set of rules, CaRLA derives the similarity condition

                εσ ϕ ϕ θ

                ( , p p 1 2 ) ( R ( ), p 1 1 R ( )) p 2 ≥ 2 . It then applies these rules to the negative examples in N and tests whether the similarity condition also holds for the negative examples. If this is the case, then it discards rules until it reaches a local minimum of its error function. The retrieved set of rules and the novel value of θ constitute the output of CaRLA and can be used to generate the canonical representation of the properties in the source knowledge bases.

                In the following, we explain each of the three steps in more detail. Throughout the explanation, we use the toy example shown in Table 5.2. In addition, we will assume a word-level tokenization function and the Jaccard similarity.

                Rule Generation

                The goal of the rule generation set is to compute two sets of rules R and R 1 2 that will underlie the transformation ϕ R and ϕ R , respectively. We begin by 1 2 tokenizing all positive property values p and p such that (p , p ) ∈ P. We call T i j i j 1

                Linked Data in Enterprise Integration

                > >    ) else.

                The computation of R 1 and R 2 can lead to a large number of inverse or improb- able rules. In our example, R 1 contains the rule <“van” → “Van”> while R 2 contains <“Van” → “van” >. Applying these rules to the data would conse- quently not improve the convergence of their representations. To ensure that

                Rule Merging and Filtering

                R 1 = {(<“van” → “Van”>, 2.08), (<“T.” → ε >, 2)} and R 2 = {(<“Van” → “van”>, 2.08), (<“(actor)” → ε >, 2)}.

                For the set P in our example, we obtain the following sets of rules:

                . To compute R 2 , we simply swap T 1 and T 2 , invert P (i.e., compute the set {(p j , p i ): (p i , p j ) ∈ P}) and run through the procedure described above.

                = < → ′> ′∈ argmax ( ) 2

                score x y y T final

                (i.e., r is not trivial) and y

                (5.12) Finally, for each token xT 1 , we add the rule r = < xy> to R 1 iff xy

                2 ε

                the set of all tokens p i such that (p i , p j ) ∈ P, while T 2 stands for the set of all p j . We begin the computation of R 1 by extending the set of tokens of each p jT 2 by adding ε to it. Thereafter, we compute the following rule score function score:

                / if

                x y x y x y y x y σ κ

                ≠ × < →

                < → > = < → > +

                ( ) ( , ) , , (

                score score score final ( )

                σ (x, y) > σ(x, y′). Given that σ is bound between 0 and 1, it is sufficient to add a fraction of σ(x, y) to each rule <xy> to ensure that the better rule is chosen. Our final score function is thus given by

                All tokens, x T 1 , always have a maximal co-occurrence with ε as it occurs in all tokens of T 2 . To ensure that we do not compute only deletions, we decrease the score of rules <x → ε > by a factor κ ∈ [0, 1]. Moreover, in the case of a tie, we assume the rule <xy> to be more natural than <xy′> if

                ∈ ∈ ∧ ∈ (5.11)

              score computes the number of co-occurrences of the tokens x and y across P.

                score x y p p P x token p y token p i j i j ( ) |{( , ) : ( ) ( )}|. < → > =

                TaBle 5.2 Toy Example Data Set Type Property Value 1 Property Value 2 ⊕ “Jean van Damne” “Jean Van Damne (actor)” ⊕ “Thomas T. van Nguyen” “Thomas Van Nguyen (actor)” ⊕

              “Alain Delon” “Alain Delon (actor)”

              ⊕ “Alain Delon Jr.” “Alain Delon Jr. (actor)” “Claude T. Francois” “Claude Francois (actor)” Note:

              The positive examples are of type ⊕ and the negative of type ⊖.

                Big Data Computing

                the transformation rules lead to similar canonical forms, the rule merging step first discards all rules <xy > ∈ R 2 such that <yx > ∈ R 1 (i.e., rules in R 2 that are inverse to rules in R 1 ). Then, low-weight rules are discarded. The idea here is that if there is not enough evidence for a rule, it might just be a random event. The initial similarity threshold θ for the similarity condition is finally set to θ

                σ ϕ ϕ = min ( ( ), ( )). ( , ) 1 2 1 2 1 R R 2 p p P p p

                (5.13) In our example, CaRLA would discard <“van” → “Van”> from R 2 . When assuming a threshold of 10% of P’s size (i.e., 0.4), no rule would be filtered out. The output of this step would consequently be R 1 = {(<“van” → “Van”>,

                2.08), (<“T.” → ε >, 2)} and R 2 = {(<“(actor)” → ε >, 2)}.

                Rule Falsification

                The aim of the rule falsification step is to detect a set of transformations that lead to a minimal number of elements of N having a similarity superior to θ via σ. To achieve this goal, we follow a greedy approach that aims to mini- mize the magnitude of the set

                E p p N p p p R R p p P R R

                = ∈ ≥ =

                ( , ) : ( ( ), ( )) min ( ( ), ( , ) 1 2 1 2 1 1 2 1 2 1 2

                σ ϕ ϕ θ σ ϕ ϕ

                (( )) . p 2

                { }

                (5.14) Our approach simply tries to discard all rules that apply to elements of E by ascending score. If E is empty, then the approach terminates. If E does not get smaller, then the change is rolled back and the next rule is tried. Else, the rule is discarded from the set of final rules. Note that discarding a rule can alter the value of θ and thus E. Once the set E has been computed, CaRLA concludes its computation by generating a final value of the thresh- old θ.

                In our example, two rules apply to the element of N. After discarding the rule <”T.” → ε >, the set E becomes empty, leading to the termination of the rule falsification step. The final set of rules are thus R 1 = {<“van” → “Van”>} and R 2 = {<“(actor)” → ε >}. The value of θ is computed to be 0.75. Table 5.3 shows the canonical property values for our toy example. Note that this threshold allows to discard the elements of N as being equivalent property values.

                It is noteworthy that by learning transformation rules, we also found an initial threshold θ for determining the similarity of property values using σ as similarity function. In combination with the canonical forms com- puted by CaRLA, the configuration (σ, θ) can be used as an initial configu-

                Linked Data in Enterprise Integration TaBle 5.3 Canonical Property Values for Our Example Data Set Property Value 1 Property Value 2 Canonical Value “Jean van Damne” “Jean Van Damne (actor)” “Jean Van Damne”

              “Thomas T. van Nguyen” “Thomas Van Nguyen (actor)” “Thomas T. Van Nguyen”

              “Alain Delon” “Alain Delon (actor)” “Alain Delon” “Alain Delon Jr.” “Alain Delon Jr. (actor)” “Alain Delon Jr.” “Claude T. Francois” “Claude T. Francois” “Claude Francois (actor)” “Claude Francois”

                smallest Jaccard similarity for the pair of property values for our exam- ple lies by 1/3, leading to a precision of 0.71 for a recall of 1 (F-measure: 0.83). Yet, after the computation of the transformation rules, we reach an

                F

              • measure of 1 with a threshold of 1. Consequently, the pair (σ, θ ) can be used for determining an initial classifier for approaches such as the RAVEN algorithm [24].

                extension to active learning

                One of the drawbacks of batch learning approaches is that they often require a large number of examples to generate good models. As our evaluation shows (see the “Evaluation” section), this drawback also holds for the batch version of CaRLA, as it can easily detect very common rules but sometimes fails to detect rules that apply to less pairs of property values. In the follow- ing, we present how this problem can be addressed by extending CaRLA to aCARLA using active learning [28].

                The basic idea here is to begin with small training sets P and N . In each iteration, all the available training data are used by the batch version of CaRLA to update the set of rules. The algorithm then tries to refute or vali- date rules with a score below the score threshold s min (i.e., unsure rules). For this purpose, it picks the most unsure rule r that has not been shown to be erroneous in a previous iteration (i.e., that is not an element of the set of banned rules B). It then fetches a set Ex of property values that map the left side (i.e., the premise) of r. Should there be no unsure rule, then Ex is set to the q property values that are most dissimilar to the already known prop- erty values. Annotations consisting of the corresponding values for the ele- ments of Ex in the other source knowledge bases are requested by the user and written in the set P. Property values with no corresponding values are written in N. Finally, the sets of positive and negative examples are updated and the triple (R 1 , R 2 , θ ) is learned anew until a stopping condition such as a maximal number of questions is reached. As our evaluation shows, this simple extension of the CaRLA algorithm allows it to detect efficiently the

                Big Data Computing evaluation Experimental Setup

                In the experiments reported in this section, we evaluated CaRLA by two means: First, we aimed to measure how well CaRLA could compute trans- formations created by experts. To achieve this goal, we retrieved transforma- tion rules from four link specifications defined manually by experts within the LATC project.* An overview of these specifications is given in Table 5.4. Each link specification aimed to compute owl:sameAs links between enti- ties across two knowledge bases by first transforming their property values and by then computing the similarity of the entities based on the similar- ity of their property values. For example, the computation of links between films in DBpedia and LinkedMDB was carried out by first applying the set of

                

              R = {<(film) → ε >} to the labels of films in DBpedia and R = {<(director) → ε >}

              1 2

                to the labels of their directors. We ran both CaRLA and aCaRLA on the prop- erty values of the interlinked entities and measured how fast CaRLA was able to reconstruct the set of rules that were used during the Link Discovery process.

                In addition, we quantified the quality of the rules learned by CaRLA. In each experiment, we computed the boost in the precision of the mapping of property pairs with and without the rules derived by CaRLA. The initial precision was computed as |P|/|M|, where

                M = {( , ) : ( , ) p p i j σ p p i j min ( p p , ) P σ ( , p p

              1

              2 )}. The precision after apply-

                ≥ 1 2 ∈ ing CaRLA’s results was computed as |P|/|M′|, where M′ = {(p ,p ): i j

                σ ϕ ϕ σ ϕ ϕ

                ( R ( ), p i R ( )) p j min ( p p , ) P ( R ( ), p 1 2 ≥ 1 2 ∈ 1 1 R ( ))}. p 2 2 Note that in both cases,

                ( , ) p p i j P : ( , ) σ p p i j min ( p p , ) P σ ( , p p 1 2 ) . In all the recall was 1 given that ∀ ∈ 1 2 ∈ experiments, we used the Jaccard similarity metric and a word tokenizer with κ = 0.8. All runs were carried on a notebook running Windows 7

                Enterprise with 3 GB RAM and an Intel Dual Core 2.2 GHz processor. Each of the algorithms was ran five times. We report the rules that were discov- ered by the algorithms and the number of experiments within which they were found.

                TaBle 5.4 Overview of the Data Sets

              Experiment Source Target Source Property Target Property Size

              rdfs:label rdfs:label

              Actors DBpedia LinkedMDB 1172

              rdfs:label rdfs:label

              Directors DBpedia LinkedMDB 7353

              rdfs:label rdfs:label

              Movies DBpedia LinkedMDB 9859

              • * rdfs:label rdfs:label

                Producers DBpedia LinkedMDB 1540

              •   Linked Data in Enterprise Integration Results and Discussion

                Table 5.5 shows the union of the rules learned by the batch version of CaRLA in all five runs. Note that the computation of a rule set lasted under 0.5 s

                  even for the largest data set, that is, Movies. The columns P give the prob- n ability of finding a rule for a training set of size n in our experiments. R 2 is not reported because it remained empty in all setups. Our results show that in all cases, CaRLA converges quickly and learns rules that are equiva- lent to those utilized by the LATC experts with a sample set of 5 pairs. Note that for each rule of the form <“@en” → y> with y ≠ ε that we learned, the experts used the rule <y → ε > while the linking platform automatically removed the language tag. We experimented with the same data sets with- out language tags and computed exactly the same rules as those devised by the experts. In some experiments (such as Directors), CaRLA was even able detect rules that where not included in the set of rules generated by human experts. For example, the rule <“(filmmaker)” → “(director)”> is not very frequent and was thus overlooked by the experts. In Table 5.5, we marked such rules with an asterix. The Director and the Movies data sets contained a large number of typographic errors of different sort (incl. misplaced hyphens, character repetitions such as in the token “Neilll”, etc.), which led to poor precision scores in our experiments. We cleaned the first 250 entries of these data sets from these errors and obtained the results in the rows labels Directors_clean and Movies_clean. The results of CaRLA on these data sets are also shown in Table 5.5. We also measured the improvement in precision that resulted from applying CaRLA to the data sets at hand (see Figure 5.9). For that the precision remained constant across the differ- ent data set sizes. In the best case (cleaned Directors data set), we are able to improve the precision of the property mapping by 12.16%. Note that we

                  TaBle 5.5 Overview of Batch Learning Results

                Experiment R P P P P P

                1 5 10 20 50 100

                  1

                  1

                  1

                  1

                  1 < Directors “@en” → “(director)”>

                  < Actors “@en” → “(actor)”>

                  1

                  1

                  1

                  1

                  1 < “(filmmaker)” → “(director)”>*

                  0.2 < Directors_clean “@en” → “(director)”>

                  1

                  1

                  1

                  1

                  1 < Movies “@en” → ε >

                  1

                  1

                  1

                  1

                  1 <

                  1

                  1

                  1

                  1

                  1 “(film)” → ε > <

                  0.6 “film)” → ε >* Movies_clean <

                  1

                  1

                  1

                  1

                  1 “@en” → ε > <

                  0.8

                  1

                  1

                  1 “(film)” → ε > <

                  1 “film)” → ε >* < Producers “@en” → (producer)>

                  1

                  1

                  1

                  1

                  1

                  Big Data Computing (a) 100%

                  80% 60% ion Baseline is ec CaRLA Pr

                  40% 20% 0%

                Actors Directors Directors_clean Movies Movies_clean Producers

                (b)

                  100% 80% 60% Baseline eshold CaRLA hr

                  T 40% 20% 0%

                Actors Directors Directors_clean Movies Movies_clean Producers

                  Figure 5.9

                Comparison of the precision and thresholds with and without CaRLA. (a) Comparison of the

                precision with and without CaRLA. (b) Comparison of the thresholds with and without CaRLA.

                  can improve the precision of the mapping of property values even on the noisy data sets.

                  Interestingly, when used on the Movies data set with a training data set size of 100, our framework learned low-confidence rules such as <

                  “(1999” → ε >, which were yet discarded due to a too low score. These are the cases where aCaRLA displayed its superiority. Thanks to its ability to ask for annotation when faced with unsure rules, aCaRLA is able to vali-

                  Linked Data in Enterprise Integration

                  1

                  1

                  1

                  1 < “film)” → ε >*

                  1 < “(2006” → ε >*

                  1 < “(199” → ε >*

                  1 Movies_clean < “@en” → ε >

                  1

                  1

                  1

                  1 < “(film)” → ε >

                  1

                  1

                  1

                  1

                  1 < “film)” → ε >*

                  1 Producers < “@en” → (producer)>

                  1

                  1

                  1

                  1

                  1

                  1

                  1 < “(film)” → ε >

                  aCaRLA is able to detect several supplementary rules that were overlooked by human experts. Especially, it clearly shows that deleting the year of cre- ation of a movie can improve the conformation process. aCaRLA is also able to generate a significantly larger number of candidate rules for the user’s convenience. For example, it detects a large set of low-confidence rules such as <“(actress)” → “(director)”>, <“(actor)” → “(director)”> and <“(actor/ director)” → “(director)”> on the Directors data set. Note that in one case aCARLA misses the rule <“(filmmaker)” → “(director)” > that is discovered by CaRLA with a low probability. This is due to the active learning pro- cess being less random. The results achieved by aCaRLA on the same data sets are shown in Table 5.6. Note that the runtime of aCaRLA lied between 50 ms per iteration (cleaned data sets) and 30 s per iteration (largest data set, Movies). The most time-expensive operation was the search for the prop- erty values that were least similar to the already known ones.

                  1

                  In this chapter, we introduced a number of challenges arising in the context of Linked Data in Enterprise Integration. A crucial prerequisite for address- ing these challenges is to establish efficient and effective link discovery and data integration techniques, which scale to large-scale data scenarios found in the enterprise. We addressed the transformation and linking steps of the Linked Data Integration by presenting two algorithms, HR 3 and CaRLA. We 3 TaBle 5.6

                  Overview of Active Learning Results Experiment R 1 P 5 P 10 P 20 P 50 P 100

                  Actors < “@en” → “(actor)”>

                  1

                  1

                  1

                  1

                  1 Directors < “@en” → “(director)”>

                  1

                  1

                  1

                  1

                  1 < “(actor)” → “(director)”>*

                  1 Directors_clean < “@en” → “(director)”>

                  1

                  1

                  1

                  1

                  1 Movies < “@en” → ε >

                  1

                  1

                  1

                Conclusion

                  Big Data Computing

                  that its RRR converges toward 1 when α converges toward . HR 3 aims to be the first of a novel type of Link Discovery approaches, that is, approaches that can guarantee theoretical optimality, while also being empirically usable. In the future work, more such approaches will enable superior memory and space management. CaRLA uses batch and active learning approach to dis- cover a large number of transformation rules efficiently and was shown to increase the precision of property mapping by up to 12% when the recall is set to 1. In addition, CaRLA was shown to be able to detect rules that escaped experts while devising specifications for link discovery.

                References

                  

                2. F. C. Botelho and N. Ziviani. External perfect hashing for very large key sets. In

                CIKM , pp. 653–662, 2007.

                  

                3. T. Bray, J. Paoli, C. M. Sperberg-McQueen, E. Maler, and F. Yergeau. Extensible

                Markup Language (XML) 1.0 (Fifth Edition). W3C, 2008.

                  

                4. J. Clark and M. Makoto. RELAX NG Specification. Oasis, December 2001. http://

                www.oasis-open.org/committees/relax-ng/spec-20011203.html.

                  

                5. V. Eisenberg and Y. Kanza. D2RQ/update: Updating relational data via virtual

                RDF. In WWW (Companion Volume), pp. 497–498, 2012.

                  

                6. M. G. Elfeky, A. K. Elmagarmid, and V. S. Verykios. Tailor: A record linkage tool

                box. In ICDE, pp. 17–28, 2002.

                  

                1. S. Auer, J. Lehmann, and S. Hellmann. LinkedGeoData: Adding a spatial

                dimension to the Web of Data. The Semantic Web-ISWC 2009, pp. 731–746, 2009.

                  

                8. D. A. Ferrucci, E. W. Brown, J. Chu-Carroll, J. Fan, D. Gondek, A. Kalyanpur,

                A. Lally et al. Building Watson: An overview of the deepQA project. AI Magazine, 31(3):59–79, 2010.

                  

                9. A. Halevy, A. Rajaraman, and J. Ordille. Data integration: The teenage years. In

                Proceedings of the 32nd International Conference on Very Large Data Bases (VLDB’06) , pp. 9–16. VLDB Endowment, 2006.

                  

                10. A. Hogan, A. Harth, and A. Polleres. Scalable authoritative OWL reasoning for

                the web. International Journal on Semantic Web and Information Systems (IJSWIS), 5(2):49–90, 2009.

                  

                11. P. Jaccard. Étude comparative de la distribution florale dans une portion des

                Alpes et des Jura. Bulletin del la Société Vaudoise des Sciences Naturelles, 37:547– 579, 1901.

                  

                12. R. Jelliffe. The Schematron—An XML Structure Validation Language using Patterns

                in Trees . ISO/IEC 19757, 2001.

                  

                13. R. Kimball and J. Caserta. The Data Warehouse ETL Toolkit: Practical Techniques for

                Extracting, Cleaning, Conforming, and Delivering Data . Wiley, Hoboken, NJ, 2004.

                  

                14. M. Krötzsch, D. Vrandečić, and M. Völkel. Semantic Media Wiki. The Semantic

                  

                7. J. Euzenat and P. Shvaiko. Ontology Matching. Springer-Verlag, Heidelberg,

                2007.

                  Linked Data in Enterprise Integration

                  

                24. A.-C. Ngonga Ngomo, J. Lehmann, S. Auer, and K. Höffner. RAVEN—Active

                learning of link specifications. In Proceedings of OM@ISWC, Bonn, Germany, 2011.

                25.

                  

                33. H. Wache, T. Voegele, U. Visser, H. Stuckenschmidt, G. Schuster, H. Neumann,

                and S. Hübner. Ontology-based integration of information—A survey of exist- ing approaches. IJCAI-01 Workshop: Ontologies and Information Sharing, 2001:108– 117, 2001.

                  

                32. L. M. Vaquero, L. Rodero-Merino, and R. Buyya. Dynamically scaling applica-

                tions in the cloud. SIGCOMM Comput. Commun. Rev., 41:45–52.

                  

                31. H. S. Thompson, D. Beech, M. Maloney, and N. Mendelsohn. XML Schema Part

                1: Structures (Second Edition). W3C, 2004.

                  

                30. M. Sporny, T. Inkster, H. Story, B. Harbulot, and R. Bachmann-Gmür. WebID 1.0:

                Web identification and Discovery. W3C Editors Draft, December 2011. http:// www.w3.org/2005/Incubator/webid/spec/.

                  

                29. D. Song and J. Heflin. Automatically generating data linkages using a domain-

                independent candidate selection approach. In ISWC, Boston, USA, pp. 649–664, 2011.

                  

                28. B. Settles. Active learning literature survey. Technical Report 1648, University of

                Wisconsin-Madison, 2009.

                  

                27. F. Schwagereit, A. Scherp, and S. Staab. Representing distributed groups with

                dgFOAF. The Semantic Web: Research and Applications, pp. 181–195, 2010.

                  

                26. E. Prud’hommeaux. SPARQL 1.1 Federation Extensions, November 2011.

                http://www.w3.org/TR/sparql11-federated-query/.

                  

                23. A.-C. Ngonga Ngomo and S. Auer. LIMES—A time-efficient approach for

                large-scale link discovery on the web of data. In Proceedings of IJCAI, Barcelona, Catalonia, Spain, 2011.

                  

                15. V. I. Levenshtein. Binary codes capable of correcting deletions, insertions, and

                reversals. Technical Report 8, 1966.

                  

                22. A.-C. Ngonga Ngomo. A time-efficient hybrid approach to link discovery. In

                Sixth International Ontology Matching Workshop , Bonn, Germany, 2011.

                  

                21. A.-C. Ngonga Ngomo. On link discovery using a hybrid approach. J. Data

                Semantics , 1(4):203–217, 2012.

                  

                20. A.-C. Ngonga Ngomo. Link discovery with guaranteed reduction ratio in affine

                spaces with Minkowski measures. In International Semantic Web Conference (1), Boston, USA, pp. 378–393, 2012.

                  In International Workshop on Ontology Matching, Boston, USA, 2012.

                  

                19. A.-C. Ngonga Ngomo. Learning conformation rules for linked data integration.

                  

                18. R. Mukherjee and J. Mao. Enterprise search: Tough stuff. Queue, 2(2):36, 2004.

                  

                17. A. Miles and S. Bechhofer. SKOS Simple Knowledge Organization

                  

                16. A. Majchrzak, C. Wagner, and D. Yates. Corporate wiki users: Results of a

                survey. In WikiSym’06: Proceedings of the 2006 International Symposium on Wikis, Odense, Denmark. ACM, August 2006.

                  

                34. C. Xiao, W. Wang, X. Lin, and J. X. Yu. Efficient similarity joins for near duplicate

                detection. In WWW, Beijing, China, pp. 131–140, 2008.

                  

                This page intentionally left blank This page intentionally left blank

                   Martin Giese, Diego Calvanese, Peter Haase, Ian Horrocks,

                Yannis Ioannidis, Herald Kllapi, Manolis Koubarakis, Maurizio Lenzerini,

                Ralf Möller, Mariano Rodriguez Muro, Özgür Özçep, Riccardo Rosati, Rudolf Schlatte, Michael Schmidt, Ahmet Soylu, and Arild Waaler CONTENTS

                  Data Access Problem of Big Data ......................................................................206 Ontology-Based Data Access ............................................................................. 207

                  Example ...........................................................................................................209 Limitations of the State of the Art in OBDA ............................................... 212

                  Query Formulation Support .............................................................................. 215 Ontology and Mapping Management ............................................................. 219 Query Transformation ........................................................................................222 Time and Streams ................................................................................................225 Distributed Query Execution ............................................................................ 231 Conclusion ...........................................................................................................235 References .............................................................................................................236 This chapter proposes steps toward the solution to the data access problem that end-users typically face when dealing with Big Data:

                • They need to pose ad-hoc queries to a collection of data sources, pos- sibly including streaming sources.
                • They are unable to query these sources on their own, but are depen- dent on assistance from IT experts.
                • The turnaround time for information requests is in the range of days, possibly weeks, due to the involvement of the IT personnel.
                • The volume, complexity, variety, and velocity of the underlying data sources put very high demands on the scalability of the solution.

                  We propose to approach this problem using ontology-based data access (OBDA), the idea being to capture end-user conceptualizations in an ontol- ogy and use declarative mappings to connect the ontology to the underlying

                  Big Data Computing

                  data sources. End-user queries posed are in terms of concepts of ontology and are then rewritten as queries against the sources.

                  The chapter is structured as follows. First, in the “The Data Access Problem of Big Data” section, we situate the problem within the more general dis- cussion about Big Data. Then, in section “Ontology-Based Data Access,” we review the state of the art in OBDA, explain why we believe OBDA is a superior approach to the data access challenge posed by Big Data, and also explain why the field of OBDA is currently not yet sufficiently mature to deal satisfactory with these problems. The rest of the chapter contains concepts for raising OBDA to a level where it can be successfully deployed to Big Data.

                  The ideas proposed in this chapter are investigated and implemented in the FP7 Integrated Project Optique—Scalable End-user Access to Big Data, which runs until the end of year 2016. The Optique solutions are evaluated on two comprehensive use cases from the energy sector with a variety of data access challenges related to Big Data. *

                Data Access Problem of Big Data

                  The situation in knowledge- and data-intensive enterprises is typically as fol- lows. Massive amounts of data, accumulated in real time and over decades, are spread over a wide variety of formats and sources. End-users operate on these collections of data using specialized applications, the operation of which requires expert skills and domain knowledge. Relevant data are extracted from the data sources using predefined queries that are built into the applications. Moreover, these queries typically access just some specific sources with identical structure. The situation can be illustrated like this:

                  Simple case Application Predefined queries Uniform sources Engineer

                  In these situations, the turnaround time, by which we mean the time from when the end-user delivers an information need until the data are there, will typically be in the range of minutes, maybe even seconds, and Big Data tech- nologies can be deployed to dramatically reduce the execution time for queries.

                  Situations where users need to explore the data using ad hoc queries are considerably more challenging, since accessing relevant parts of the data typically requires in-depth knowledge of the domain and of the organiza- tion of data repositories. It is very rare that the end-users possess such skills themselves. The situation is rather that the end-user needs to collaborate *

                  Scalable End-User Access to Big Data

                  with an IT-skilled person in order to jointly develop the query that solves the problem at hand, illustrated in the figure below: Translation Disparate sources

                  Complex

                Information need Specialized query

                case Engineer

                  

                IT expert

                  The turnaround time is then mostly dependent on human factors and is in the range of days, if not worse. Note that the typical Big Data technologies are of limited help in this case, as they do not in themselves eliminate the need for the IT expert.

                  The problem of end-user data access is ultimately about being able to put the enterprise data in the hands of the expert end-users. Important aspects of the problem are volume, variety, velocity, and complexity (Beyer et al., 2011), where by volume we mean the complete size of the data, by variety we mean the number of different data types and data sources, by velocity we mean the rate at which data streams in and how fast it needs to be processed, and by complexity we mean factors such as standards, domain rules, and size of database schemas that in normal circumstances are manageable, but quickly complicate data access considerably when they escalate.

                  Factors such as variety, velocity, and complexity can make data access chal- lenging even with fairly small amounts of data. When, in addition to these factors, data volumes are extreme, the problem becomes seemingly intrac- table; one must then not only deal with large data sets, but at the same time also has to cope with dimensions that to some extent are complementary. In Big Data scenarios, one or more of these dimensions go to the extreme, at the same time interacting with other dimensions.

                  Based on the ideas presented in this chapter, the Optique project imple- ments a solution to the data access problem for Big Data in which all the above-mentioned dimensions of the problem are addressed. The goal is to enable expert end-users access the data themselves, without the help of the

                  IT experts, as illustrated in this figure. Flexible, Query Disparate sources

                  Optique ontology- Translated Optique based queries trans- solution lation Application queries Engineer Ontology-Based Data Access

                  We have observed that, in end-user access to Big Data, there exists a bot-

                  Big Data Computing End-user

                  IT expert Appli- cation y

                  Ontology Mappings er s u lt Q u es R

                  

                Query answering

                Figure 6.1 The basic setup for OBDA.

                  executable and optimized queries over data sources. An approach known as “ontology-based data access” (OBDA) has the potential to avoid this bot- tleneck by automating this query translation process. Figure 6.1 shows the essential components in an OBDA setup.

                  The main idea is to use an ontology, or domain model, that is a formaliza- tion of the vocabulary employed by the end-users to talk about the problem domain. This ontology is constructed entirely independent of how the data are actually stored. End-users can formulate queries using the terms defined by the ontology, using some formal query language. In other words, queries are formulated according to the end-users’ view of the problem domain.

                  To execute such queries, a set of mappings is maintained which describe the relationship between the terms in the ontology and their representation(s) in the data sources. This set of mappings is typically produced by the IT expert, who previously translated end-users’ queries manually.

                  It is now possible to give an algorithm that takes an end-user query, the ontology, and a set of mappings as inputs, and computes a query that can be executed over the data sources, which produces the set of results expected for the end-user query. As Figure 6.1 illustrates, the result set can then be fed into some existing domain-specific visualization or browsing application which presents it to the end-user.

                  In the next section, we will see an example of such a query translation pro- cess, which illustrates the point that including additional information about the problem domain in the ontology can be very useful for end-users. In gen- eral, this process of query translation becomes much more complex than just

                  Scalable End-User Access to Big Data

                  also, in some cases, become dramatically larger than the original ontology- based query formulated by the end-user.

                  The theoretical foundations of OBDA have been thoroughly investigated in recent years (Möller et  al., 2006; Calvanese et  al., 2007a,b; Poggi et  al., 2008). There is a very good understanding of the basic mechanisms for query rewriting, and the extent to which expressivity of ontologies can be increased, while maintaining the same theoretical complexity as is exhibited by standard relational database systems.

                  Also, prototypical implementations exist (Acciarri et al., 2005; Calvanese et al., 2011), which have been applied to minor industrial case studies (e.g., Amoroso et al., 2008). They have demonstrated the conceptual viability of the OBDA approach for industrial purposes.

                  There are several features of a successful OBDA implementation that lead us to believe that it is the right basic approach to the challenges of end-user access to Big Data:

                • It is declarative, that is, there is no need for end-users, nor for IT experts, to write special-purpose program code.
                • Data can be left in existing relational databases. In many cases, mov- ing large and complex data sets is impractical, even if the data owners were to allow it. Moreover, for scalability it is essential to exploit exist- ing optimized data structures (tables), and to avoid increasing query complexity by fragmenting data. This is in contrast to, for example, data warehousing approaches that copy data: OBDA is more flexible and offers an infrastructure which is simpler to set up and maintain.
                • It provides a flexible query language that corresponds to the end- user conceptualization of the data.
                • The ontology can be used to hide details and introduce abstractions.

                  This is significant in cases where there is a source schema which is too complex for the end-user.

                • The relationship between the ontology concepts and the relational data is made explicit in the mappings. This provides a means for the DB experts to make their knowledge available to the end-user inde- pendent of specific queries.

                  example

                  We will now present a (highly) simplified example that illustrates some of the benefits of OBDA and explains how the technique works. Imagine that an engi- neer working in the power generation industry wants to retrieve data about generators that have a turbine fault. The engineer is able to formalize this infor- mation need, possibly with the aid of a suitable tool, as a query of the form:

                  Big Data Computing TaBle 6.1 Example Ontology and Data for Turbine Faults Human Readable Logic Ontology Condenser is a CoolingDevice that is part of Condenser ⊑ Cooling Device ⊓ ∃ a Turbine isPartOf.Turbine Condenser Fault is a Fault that affects a Condenser Fault ≡ Fault ⊓

                  ∃ Condenser affects.Condenser Turbine Fault is a Fault that affects part of a TurbineFault ≡ Fault ⊓ ∃

                  Turbine affects.(∃isPartOf. Turbine) Data g 1 is a Generator Generator( g1) g 1 has fault f1 hasFault( g1,f1) f 1 is a CondenserFault CondenserFault(f1)

                  which can be read as “return all g such that g is a generator, g has a fault f, and f is a turbine fault.” Now consider a database that includes the tuples given in the lower part of Table 6.1. If Q1 is evaluated over these data, then g1 is not returned in the answer, because f1 is a condenser fault, but not a turbine fault. However, this is not what the engineer would want or expect, because the engineer knows that the condenser is a part of the turbine and that a condenser fault is thus a kind of turbine fault.

                  The problem is caused by the fact that the query answering system is not able to use the engineer’s expert knowledge of the domain. In an OBDA sys- tem, (some of) this knowledge is captured in an ontology, which can then be exploited in order to answer queries “more intelligently.” The ontology pro- vides a conceptual model that is more intuitive for users: it introduces familiar vocabulary terms and captures declaratively the relationships between terms.

                  In our running example, the ontology might include the declarative state- ments shown in the upper part of Table 6.1. These introduce relevant vocabu- lary, such as condenser, cooling device, affects, etc. and establish relationships between terms. The first axiom, for example, states that “every condenser is a cooling device that is part of a turbine.” If we formalize these statements as axi- oms in a suitable logic, as shown in the right hand side of Table 6.1, we can then use automated reasoning techniques to derive facts that must hold, but are not explicitly given by the data, such as TurbineFault(g1). This in turn means that g1 is recognized as a correct answer to the example query. Using an ontology and automated reasoning techniques, query answering can relate to the whole set of implied information, instead of only that which is explicitly stated.

                  Automated reasoning can, in general, be computationally very expensive. Moreover, most standard reasoning techniques would need to interleave operations on the ontology and the data, which may not be practically fea- sible if the data are stored in a relational database. OBDA addresses both

                  Scalable End-User Access to Big Data

                  these issues by answering queries using a two-stage process, first using the ontology to rewrite the query and then evaluating the rewritten query over the data (without any reference to the ontology). The rewriting step generates additional queries, each of which can produce extra answers that follow from a combination of existing data and statements in the ontology. Ensuring that this is possible for all possible combinations of data, and ontology statements require some restrictions on the kinds of statement that can be included in the ontology. The OWL 2 QL ontology language profile has been designed as a maximal subset of OWL 2 that enjoys this property.

                  Coming back to our example, we can easily derive from the ontology that a condenser fault is a kind of turbine fault, and we can use this to rewrite the query as

                  

                Q 2(g) ← Generator(g)∧hasFault(g,f)∧CondenserFault(f)

                  Note that there are many other possible rewritings, including, for example, Q3(g) ← Generator(g)∧hasFault(g,f)∧Fault(f)∧affects(f,c)∧Condenser(c) all of which need to be considered if we want to guarantee that the answer to the query will be complete for any data set, and this can result in the rewrit- ten query becoming very large (in the worst case, exponential in the size of the input ontology and query).

                  One final issue that needs to be considered is how these queries will be evaluated if the data are stored in a data store such as a relational database. So far, we have assumed that the data are just a set of ground tuples that use the same vocabulary as the ontology. In practice, however, we want to access data in some kind of data store, typically a relational database management system (RDBMS), and typically one whose schema vocabulary does not cor- respond with the ontology vocabulary. In OBDA, we use mappings to declara- tively capture the relationships between ontology vocabulary and database queries. A mapping typically takes the form of a single ontology vocabulary term (e.g., Generator) and a query over the data sources that retrieves the instances of this term (e.g., “SELECT id FROM Generator”). Technically, this kind of mapping is known as global as view (GAV).

                  In our example, the data might be stored in an RDBMS using tables for gen- erators and faults, and using a hasFault table to capture the one-to-many rela- tionship between generators and faults, as shown in Table 6.2. Mappings from the ontology vocabulary to RDBMS queries can then be defined as follows:

                  Generator SELECT id FROM Generator ↦ CondenserFault SELECT id FROM Fault WHERE type = ‘C’

                  ↦ TurbineFault SELECT id FROM Fault WHERE type = ‘T’ ↦ hasFault SELECT g-id,f-id FROM hasFault ↦

                  Big Data Computing

                  TaBle 6.2 Database Tables

                  2 f

                  

                2 T

                g

                  5678 f

                  1 g2

                  1 f

                  

                1 C

                g

                  1234 f

                  

                Generator Fault hasFault

                id Serial id Type g-id f-id

                g1

                  4. The efficiency of both the translation process and the execution of the

                  When combined with Q2, these mappings produce the following query over the RDBMS:

                  3 The scope of existing systems is too narrow: they lack many features that are vital for applications.

                  2. The prerequisites of OBDA, namely ontology and mappings, are in practice expensive to obtain.

                  1. The usability is hampered by the need to use a formal query lan- guage that makes it difficult for end-users to formulate queries, even if the vocabulary is familiar.

                  As mentioned above, OBDA has been successfully applied to first indus- trial case studies. Still, realistic applications, where nontechnical end-users require access to large corporate data stores, lie beyond the reach of current technology in several respects:

                  

                SELECT Generator.id FROM Generator, Fault, hasFault

                WHERE Generator.id = g-id AND f-id = Fault.id AND

                type = ‘C’ UNION … limitations of the State of the art in OBDa

                  

                SELECT Generator.id FROM Generator, Fault, hasFault

                WHERE Generator.id = g-id AND f-id = Fault.id AND

                type = ‘T’ UNION

                  The answer to this query will include g1. However, in order to ensure that all valid answers are returned, we also need to include the results of Q1 (the original query) and all other possible rewritings. In an SQL setting, this leads to a UNION query of the form:

                  

                SELECT Generator.id FROM Generator, Fault, hasFault

                WHERE Generator.id = g-id AND f-id = Fault.id AND

                type = ‘C’

                  2

                ⋮ ⋮ ⋮ ⋮ ⋮ ⋮

                  Scalable End-User Access to Big Data

                  In the remainder of this chapter, we discuss possible approaches to overcome these shortcomings and how the state of the art will have to be advanced in order to realize them. Figure 6.2 shows a proposed architecture supporting this approach. In short terms, the ideas are as follows.

                  End-user acceptance depends on the usability for nontechnical users, for example, by providing a user-friendly Query Formulation front-end (see Figure 6.2) that lets the end-user navigate the vocabulary and presents a menu of possible refinements of a query (see section “Query Formulation Support”). Advanced users must have the possibility to switch back and forth as required between the navigational view and a more technical view where the query can be edited directly. This will make it possible for a non- technical user to author large parts of a query, but receive help from a techni- cal expert when required.

                  The second problem that needs to be addressed is providing and main- taining the prerequisites: ontology and mappings. In practice, these will have to be treated as evolving, dynamic entities which are updated as required for formalizing end-users’ information requirements. An industrial-scale front-end needs to support both the semiautomatic derivation of an ini- tial ontology and mappings in new deployments, and the extension of the ontology during query formulation, for example, by adding new techni- cal terms or relationships that were not previously captured. In the archi- tecture of Figure 6.2, this is accomplished by the Query Formulation and

                  

                Ontology and Mapping Management front-end components. This mechanism

                cation formulation Appli- Query End-user Ontology and mapping management

                  IT expert Results Query Query transformation Query planning Ontology Mappings Streaming data Stream adapter Query execution ··· Query execution ··· ··· Cross-component optimization Site A Site B Site C Figure 6.2

                  Big Data Computing

                  of bootstrapping and query-driven ontology construction can enable the cre- ation of an ontology that fits the end-users’ needs at a moderate cost. The same Ontology and Mapping Management component can then also support the IT expert in maintaining a set of mappings that is consistent with the evolving ontology. The sections on “Query Formulation” (see section “Query Formulation Support”) and “Ontology and Mapping Management” (see sec- tion “Ontology and Mapping Management”) expand on the requirements for such a management component.

                  Providing a robust answer to the scope problem is difficult, because there is a trade-off between expressivity and efficiency: very expressive mechanisms in the ontology and mapping languages, which would guarantee applicabil- ity to virtually any problem that might occur in industrial applications, are known to preclude efficient query rewriting and execution (Brachman and Levesque, 1984; Artale et al., 2009; Calvanese et al., 2012). To ensure efficiency, a restricted set of features must be carefully chosen for ontology and map- ping languages, with the aim of covering as many potential applications as possible.

                  Still, concrete applications will come with their own specific difficulties that cannot be covered by a general-purpose tool. This expressivity prob- lem needs to be resolved by plugging application-specific modules into the query answering engine. These domain-specific plug-ins must take care of query translation and optimization in those cases where a generic declara- tive mechanism is not powerful enough for an application. A wide range of special-purpose vocabulary and reasoning could be covered by such domain-specific modules, such as, to name just a few,

                • Geological vocabulary in a petroleum industry application
                • Protein interactions and pathways in molecular biology
                • Elementary particle interactions for particle physics On the other hand, important features that occur in many applica- tions need to be built into the core of any OBDA system. Notably, tem- poral aspects and the possibility of progressively processing data as it is generated (stream processing) are vital to many industrial applications. Fortunately, existing research on temporal databases, as well as time and streams in semantic technologies can be integrated into a unified OBDA framework (see section “Time and Streams”). Another important domain that occurs in many applications is that of geospatial information, spatial proximity, containment, etc. Again, we expect that existing research about geospatial data storage, querying, and semantics can be integrated into the OBDA framework.

                  Other examples are aggregation (summation, averaging, etc.) and epis- temic negation (questions about missing data) that have received little the- oretical or practical attention, but which are important in many practical

                  Scalable End-User Access to Big Data

                  To address efficiency, we propose to decompose the “Query Answering” component into several layers, as shown in Figure 6.2:

                  1. Query Transformation using the ontology and mappings

                  2. Query Planning to distribute queries to individual servers

                  3. Query Execution using existing scalable data stores, or a massively parallelized (cloud) architecture The implementation of the query transformation layer can take recent theoretical advances in query rewriting into account, which can lead to sig- nificantly improved performance (see section “Query Transformation”). The same holds for query execution, which can take advantage of research on massive parallelization of query execution, with the possibility of scaling orders of magnitude beyond a conventional RDBMS architecture (see section “Distributed Query Execution”).

                  We surmise, however, that to gain a real impact on efficiency, a holistic, cross- component view on query answering is needed: current OBDA implementa- tions leave query planning and execution to off-the-shelf database products, often leading to suboptimal performance on the kinds of queries produced by a rewriting component. The complete query answering stack needs to be optimized as a whole, so that the rewritten queries can capitalize on the strengths of the query execution machinery, and the query execution machin- ery is optimized for the queries produced by the rewriting component.

                  In the following sections, we give a detailed discussion of the state of the art in the mentioned aspects, and the necessary expansions for an industrial- scale OBDA tool.

                Query Formulation Support

                  Traditional database query languages, such as SQL, require some techni- cal skills and knowledge about language syntax and domain schema. More precisely, they require users to recall relevant domain concepts and syntax elements and communicate their information need in a programmatic way. Such an approach makes information systems almost, if not completely, inaccessible to the end-users. Direct manipulation (Schneiderman, 1983) lan- guages, which employ recognition (rather than recall) and direct manipu- lation objects (rather than a command language syntax), have emerged as a response to provide easy to use and intuitive interactive systems. In the database domain, Visual Query Systems (Catarci et al., 1997) follow the direct manipulation approach in which the domain of interest and the information

                  Big Data Computing

                  and visualization paradigms—such as diagrams, forms, etc.—have been employed (Epstein, 1991; Catarci et  al., 1997) to enable end-users to easily formulate and construct their query requests. However, early approaches mostly missed a key notion, that is, usability (Catarci, 2000), whose concern is the quality of the interaction between the user and the software system rather than the functionality or the technology of the software product. Increasing awareness of the usability in database domain is visible through a growing amount of research addressing end-user database access (e.g., Uren et al., 2007; Barzdins et al., 2008; Popov et al., 2011).

                  One of the key points for the success of a system, from the usability perspec- tive, is its ability to clearly communicate the provided affordances for user interaction and the domain information that the user is expected to operate on. This concerns the representation and interaction paradigm employed by the system and the organization of the underlying domain knowledge. Concerning the former, researchers mostly try to identify the correlation between task (e.g., simple, complex, etc.) and user type (e.g., novice, expert, etc.) and the visual representation and interaction paradigm used (Catarci et  al., 1997; Catarci, 2000; Popov et al., 2011). Regarding the latter, ontologies are considered a key paradigm for capturing and communicating domain knowledge with the end- users (Uren et al., 2007; Barzdins et al., 2008; Tran et al., 2011).

                  A key feature of any OBDA system is that the ontology needs to provide a user-oriented conceptual model of the domain against which queries can be posed. This allows the user to formulate “natural” queries using familiar terms and without having to understand the structure of the underlying data sources. However, in order to provide the necessary power and flexibility, the underlying query language will inevitably be rather complex. It would be unrealistic to expect all domain experts to formulate queries directly in such a query language, and even expert users may benefit from tool support that exploits the ontology in order to help them to formulate coherent que- ries. Moreover, the ontology may not include all the vocabulary expected or needed by a given user. Ideally, it should be possible for users with differing levels of expertise to cooperate on the same query, by allowing them to switch between more or less technical representations as required, and to extend the ontology on the fly as needed for the query being formulated.

                  Many existing applications today use navigation of simple taxonomic ontologies in order to search for information; a user of eBay, for example, can navigate from “electronics” to “cameras & photography” to “camcorders” in order to find items of interest. In some cases, additional attributes may also be specified; in the above example, attributes such as “brand,” “model,” and “price” can also be specified. This is sometimes called faceted search (Schneiderman, 1983; Suominen et al., 2007; Lim et al., 2009; Tunkelang, 2009), but the structure of the ontology is very simple, as is the form of the query— effectively just retrieving the instances of a given concept/class. The faceted search is based on series of orthogonal categories that can be applied in com-

                  Scalable End-User Access to Big Data

                  of the information elements. In an ontology-based system, identification of these properties is straightforward. An important benefit of faceted search is that it frees users from the burden of dealing with complex form-based interfaces and from the possibility of reaching empty result sets. This faceted search approach, however, in its most common form breaks down as soon as a “join” between information about several objects is required. Consider, for example, searching for camcorders available from sellers who also have a digital camera with ≥12 MP resolution on offer.

                  Similarly, ontology development tools such as Protégé may allow for the formulation of query concepts using terms from the ontology, but the query is again restricted to a single concept term. Specialized applications have sometimes used GUIs or form-based interfaces for concept formulation, for example, the Pen & Pad data entry tool developed in the GALEN project (Nowlan et al., 1990), but if used for querying this would again provide only for concept/class instance retrieval queries.

                  An essential part of any practically usable system must be an interface that supports technically less advanced users by some kind of “query by navigation” interface, where the user gradually refines the query by select- ing more specific concepts and adding relationships to other concepts, with the ontology being used to suggest relevant refinements and relation- ships (Nowlan et al., 1990; Catarci et al., 2004). Work on ontology-supported faceted search (Suominen et al., 2007; Lim et al., 2009) is also relevant in this context. Owing to the rising familiarity of users with faceted search interfaces, a promising direction seems to be to extend faceted search with, among others,

                • The ability to select several pieces of information for output (query- ing instead of search)
                • A possibility for adding restrictions on several objects connected through roles, in order to allow joins
                • A possibility to specify aggregation, like summation or averaging
                • A possibility to specify the absence of information (e.g., that a ven- dor has no negative reviews)

                  The amalgamation of faceted search and navigational search, so-called query

                  

                by navigation (ter Hofstede et al., 1996), is of significance for the realization of

                  the aforementioned objectives. The navigational approach exploits the graph- based organization of the information to allow users to browse the information space by iteratively narrowing the scope. Stratified hypermedia (Bruza and van der Weide, 1992), a well-known example of the navigational approach, is an architecture in which information is organized via several layers of abstraction. The base layer contains the actual information, while other layers contain the abstraction of this information and enable access to the base layer. In a document

                  Big Data Computing

                  keywords. An indexing process is required to characterize the documents and to construct the abstraction layer. However, the characterization of information instances in an ontology-based system is simple and provided by the reference ontology (ter Hofstede et al., 1996). The query by navigation approach is particu- larly supportive at the exploration phase of the query formulation (Marchionini and White, 2007). Recent applications of query by navigation are available in the Semantic Web domain in the form of textual semantic data browsers (e.g., Popov et al., 2011; Soylu et al., 2012).

                  A particular approach that combines faceted search and a diagrammatic form of query by navigation is presented in Heim and Ziegler (2011). The approach is based on the hierarchical organization of facets, and hence allows joins between several information collections. The main problem with such diagrammatic approaches and with textual data browsers is that they do not support dealing with large complex ontologies and schemata well, mainly lacking balance between overview and focus. For instance, a diagrammatic approach is good at providing an overview of the domain; however, it has its limits in terms of information visualization and users’ cognitive bandwidths. A textual navigation approach is good at splitting the task into several steps; however, it can easily cause users to lose the overview. Therefore, it is not enough to provide navigation along the taxonomy and relations captured in the ontology. In many cases, it turns out that accessing data is difficult even for end-users who are very knowledgeable in their domain, for two reasons: (a) not only because of the complexity of the data model—which can be hid- den using an ontology and mappings, but also (b) because of the complexity of an accurate description of the domain. Often an ontology that accurately describes all relevant details of the domain will be more complicated than even experienced domain experts usually think about it in their daily work. This means that they approach the task of query construction without hav- ing complete knowledge of all the details of the domain model. It is there- fore necessary to develop novel techniques to support users in formulating coherent queries that correctly capture their requirements. In addition to navigation, a query formulation tool should allow searching by name for properties and concepts the expert knows must be available in the ontology. The system should help users understand the ontology by showing how the concepts and properties relevant for a query are interconnected.

                  For instance, assume that the user would like to list all digital cameras with ≥

                  12 MP resolution. This sounds like a reasonable question that should have a unique interpretation. But the ontology might not actually assign a “resolu- tion” to a camera. Rather, it might say that a camera has at least one image sen- * sor, possibly several, each of which has an effective and a total resolution. The camera also may or may not support a variety of video resolutions, independent

                • * of sensor’s resolution. The system should let the users search for “resolution,”

                  

                For instance, front-facing and rear-facing on a mobile phone, two sensors in a 3D camcorder,

                  Scalable End-User Access to Big Data

                  help them find chains of properties from “Camera” to the different notions of “Resolution,” and help them find out whether all sensors need to have ≥12 MP, or at least one of them, etc., and which kind of resolution is meant.

                  For complex queries, any intuitive user interface for nontechnical users will eventually reach its limits. It is therefore important to also provide a tex- tual query interface for technically versed users, which allows direct editing of a query using a formal syntax such as the W3C SPARQL language. Ideally, both interfaces provide views on an underlying partially constructed query, and users can switch between views at will. Even in the textual interface, there should be more support than present-day interfaces provide, in the form of context-sensitive completion (taking account of the ontology), navi- gation support, etc. (as is done, e.g., in the input fields of the Protégé ontology editor; Knublauch et al., 2005).

                  Finally, no ontology can be expected to cover a domain’s vocabulary com- pletely. The vocabulary is to a certain extent specific to individuals, projects, departments, etc. and subject to change. To adapt to changing vocabularies, cater for omissions in the ontology, and to allow a light-weight process for ontology development, the query formulation component should also sup- port “on the fly” extension of the ontology during query formulation. This can be achieved by adapting techniques from ontology learning (Cimiano et al., 2005; Cimiano, 2006) in order to identify relevant concepts and rela- tions, and adapting techniques from ontology alignment (aka matching) in order to relate this new vocabulary to existing ontology terms. In case such on-the-fly extensions are insufficient, users should also have access to the range of advanced tools and methodologies discussed in the “Ontology and Mapping Management” section, although they may require assistance from an IT expert in order to use such tools.

                Ontology and Mapping Management

                  The OBDA architecture proposed in this chapter depends crucially on the existence of suitable ontologies and mappings. In this context, the ontology provides a user-oriented conceptual model of the domain that makes it eas- ier for users to formulate queries and understand answers. At the same time, the ontology acts as a “global schema” onto which the schemas of various data sources can be mapped.

                  Developing suitable ontologies from scratch is likely to be expensive. A  more cost-effective approach is to develop tools and methodologies for semiautomatically “bootstrapping” the system with a suitable initial ontol- ogy and for extending the ontology “on the fly” as needed by a given appli- cation. This means that in this scenario, ontologies are dynamic entities that

                  Big Data Computing

                  accommodate new data sources. In both cases, some way is needed to ensure that vocabulary and axioms are added to the ontology in a coherent way.

                  Regarding the ontology/data-source mappings, many of these will, like the ontology, be generated automatically from either database schemata and other available metadata or formal installation models. However, these ini- tial mappings are unlikely to be sufficient in all cases, and they will certainly need to evolve along with the ontology. Moreover, new data sources may be added, and this again requires extension and adjustment of the mappings. The management of large, evolving sets of mappings must be seen as an engineering problem on the same level as that of ontology management.

                  Apart from an initial translation from structured sources like, for example, a relational database schema, present-day ontology management amounts to

                • * using interactive ontology editors like Protégé, NeOn, or TopBraid Composer.

                  These tools support the construction and maintenance of complex ontologies, but they offer little support for the kind of ontology evolution described above.

                  The issue of representing and reasoning about schema mappings has been widely investigated in recent years. In particular, a large body of work has been devoted to studying operators on schema mappings relevant to model management, notably, composition, merge, and inverse (Madhavan and Halevy, 2003; Fagin et al., 2005c, 2008b, 2009b; Kolaitis, 2005; Bernstein and Ho, 2007; Fagin, 2007; Arenas et  al., 2009, 2010a,b; Arocena et  al., 2010). In Fagin et al. (2005a,b), Arenas et al. (2004), Fuxman et al. (2005), and Libkin and Sirangelo (2008), the emphasis is on providing foundations for data interoperability systems based on schema mappings. Other works deal with answering queries posed to the target schema on the basis of both the data at the sources, and a set of source-to-target mapping assertions (e.g., Abiteboul and Duschka, 1998; Arenas et al., 2004; Calì et al., 2004) and the surveys (e.g., Ullman, 1997; Halevy, 2001; Halevy et al., 2006).

                  Another active area of research is principles and tools for comparing both schema mapping languages, and schema mappings expressed in a cer- tain language. Comparing schema mapping languages aim at character- izing such languages in terms of both expressive power and complexity of mapping-based computational tasks (ten Cate and Kolaitis, 2009; Alexe et al., 2010). In particular, ten Cate and Kolaitis (2009) studied various relational schema mapping languages with the goal of characterizing them in terms of structural properties possessed by the schema mappings specified in these languages. Methods for comparing schema mappings have been proposed in Fagin et al. (2008a, 2009b), Gottlob et al. (2009), and Arenas et al. (2010a), espe- cially in the light of the need of a theory of schema mapping optimization. In Fagin et al. (2009b) and Arenas et al. (2010a), schema mappings are compared with respect to their ability to transfer source data and avoid redundancy in the target databases, as well as their ability to cover target data. In Fagin * et al. (2008a), three notions of equivalence are introduced. The first one is the

                  Scalable End-User Access to Big Data

                  usual notion based on logic: two schema mappings are logically equivalent if they are indistinguishable by the semantics, that is, if they are satisfied by the same set of database pairs. The other two notions, called data exchange and conjunctive query equivalence, respectively, are relaxations of logical equivalence, capturing indistinguishability for different purposes.

                  Most of the research mentioned above aim at methods and techniques for analyzing schema mappings. However, mapping management is a broader area, which includes methods for supporting the development of schema mappings, debugging such mappings, or maintaining schema mappings when some part of the specification (e.g., one of the schemas) changes. Although some tools are already available (e.g., CLIO; Fagin et  al., 2009a), and some recent papers propose interesting approaches (e.g., Glavic et al., 2010), this problem is largely unexplored, especially in the realm of OBDA. Specifically, the following problems are so far unsolved in the area, but are crucial in dealing with complex scenarios:

                  1. Once a set of mappings has been defined, the designer often needs to analyze them, in order to verify interesting properties (e.g., minimality).

                  2. Mappings in OBDA systems relate the elements of the ontology to the data structures of the underlying sources. When the ontology changes, some of the mappings may become obsolete. Similarly, when the sources change, either because new sources are added, or because they undergo modifications of various types, the mappings may become obsolete.

                  3. Different types of mappings (LAV, GAV, etc.) have been studied in the literature. It is well known that the different types have different properties from the point of view of expressive power of the map- ping language and computational complexity of mapping-based tasks. The ideal situation would be to use rich mapping languages during the design phase and then transforming the mappings in such a way that efficient query answering is possible with them. This kind of transformation is called mapping simplification. Given a set of mappings M, the goal of simplification is to come up with a set of mappings that are expressed in a tractable class C, and approxi- mate at best M, that is, such that no set M′ of mappings in C exists which is “closer” to M than M′. Regarding ontologies, the required management and evolution described above could be reached by a combination of different techniques, including ontology alignment (Shvaiko and Euzenat, 2005; Jiménez-Ruiz et al., 2009) and ontology approximation (Brandt et al., 2001; Pan and Thomas, 2007). Both the addition of user-defined vocabulary from a query formulation process and the incorporation of the domain model for a new data source are instances of

                  Big Data Computing

                  vocabulary or new knowledge coming from new data sources, with the existing ontology do not necessarily fall within the constrained fragments required for efficient OBDA (such as OWL 2 QL; Calvanese et al., 2007b). This problem can be dealt with by an approximation approach, that is, transform- ing the ontology into one that is as expressive as possible while still falling within the required profile. In general, finding an optimal approximation may be costly or even undecidable, but effective techniques are known for producing good approximations (Brandt et al., 2001; Pan and Thomas, 2007).

                  Concerning mapping management, in order to be able to freely analyze schema mappings, one possibility is to define a specific language for query- ing schema mappings. The goal of the language is to support queries of the following types: return all mappings that map concepts that are subsets of concept C; or, return all mappings that access table T in the data source S. The basic step is to define a formal meta-model for mapping specification, so that queries over schema mappings will be expressions over this meta-model. A query language can thus be defined over such a meta-model: the general idea is to design the language in such a way that important properties (e.g., scalability of query answering) will be satisfied.

                  Based on this meta-model, reasoning techniques could be designed that support the evolution of schema mappings. The meta-model could also be used in order to address the issue of monitoring changes and reacting to them. Indeed, every change to the elements involved in the schema map- pings may be represented as specific updates on the instance level of the meta-level. The goal of such a reasoning system is to specify the actions to perform, or the actions to suggest to the designer, when these update opera- tions change the instances of the meta-model.

                Query Transformation

                  The OBDA architecture proposed in this chapter relies heavily on query rewriting techniques. The motivation for this is the ability of such techniques to separate ontology reasoning from data reasoning, which can be very costly in the presence of Big Data. However, although these techniques have been studied for several years, applying this technology to Big Data introduces performance requirements that go far beyond what can be obtained with simple approaches. In particular, emphasis must be put both on the perfor- mance of the rewriting process and on the performance of the evaluation of the queries generated by it. At the same time, meeting these performance requirements can be achieved by building on top of the experiences in the area of OBDA optimization which we now briefly mention.

                  In the context of query rewriting with respect to the ontology only (i.e.,

                  Scalable End-User Access to Big Data

                  2010; Kikot et al., 2011) have shown that performing query rewriting by means of succinct query expressions, for example, nonrecursive Datalog programs, can be orders of magnitude faster than the approaches that produce UCQs (unions of conjunctive queries) a priori (Calvanese et al., 2007b; Pérez-Urbina et al., 2008; Calì et al., 2009; Chortaras et al., 2011). Moreover, these succinct query representations are, in general, cheaper to deal with during optimiza- tion, since the structure that needs to be optimized is smaller. Complementary to these results are optimization results for OBDA systems in which the data are in control of the query answering engine, where dramatic improvements can be achieved when load-time precomputation of inferences is allowed. In particular, it has been shown in Rodríguez-Muro and Calvanese (2011) that full materialization of inferences is not always necessary to obtain these ben- efits, that it is possible to capture most of the semantics of DL-Lite ontologies by means of simple and inexpensive indexing structures in the data-storage layer of the query answering system. These precomputations allow one to fur- ther optimize the rewriting process and the queries returned by this process.

                  In the context of query rewriting in the presence of mappings and where the data sources cannot be modified by the query answering system, a point of departure are recent approaches that focus on the analysis of the data sources and the mappings of the OBDA system (Rodríguez-Muro, 2010). Existing approaches (Rodríguez-Muro and Calvanese, 2011) focus on detecting the state of completeness of the sources with respect to the semantics of the ontology. The result of this analysis can be used for at least two types of optimization, namely: (i) optimization of the ontology and mappings used during query rewriting (offline optimization) and (ii) optimization of the rewritten queries (online optimization). For the for- mer, initial work can be found in the semantic preserving transformations explored in Rodríguez-Muro and Calvanese (2011). For the latter, early experiences (Pérez-Urbina et  al., 2008; Rodríguez-Muro, 2010; Calvanese et al., 2011; Rodríguez-Muro and Calvanese, 2012) suggest that traditional theory of Semantic Query Optimization (SQO; Grant et  al., 1997) can be applied in the OBDA context as long as the chosen rewriting techniques generate queries that are cheap to optimize using SQO techniques (e.g., Rosati and Almatelli, 2010; Kikot et  al., 2011). Complementarily, a first- order logics-based approach to SQO in the context of semantic data formats and reasoning has been proposed in Schmidt et al. (2010). Finally, previ- ous experiences also suggest that in the context of query rewriting into SQL, obtaining high performance is not guaranteed by using an optimized DBMS system; instead, it has been shown (Rodríguez-Muro, 2010) that the form of the SQL query (e.g., use of subqueries, views, nested expressions, etc.) plays a critical role and that even in commercial DBMS engines, care must be taken to guarantee that the SQL queries are in a form that the DBMS can plan and execute efficiently.

                  Regarding systems, a lot of experience has been accumulated in the

                  Big Data Computing

                  In particular, most of the aforementioned rewriting techniques, as well as optimization techniques, have been accompanied by prototypes that were used to benchmark and study the applicability of these techniques empiri- cally. The first example of these systems is QuOnto (Acciarri et al., 2005), a system that implements the core algorithms presented in Calvanese et  al. (2007b) and that seeded the idea of query answering through query rewrit- ing in the context of Description Logic (DL) ontologies. QuOnto has also served as a platform for the implementation of the epistemic-query answer- ing techniques proposed in Calvanese et  al. (2007a) and served as a basis for the Mastro system (Calvanese et  al., 2011), which implements OBDA- specific functionality. While these systems allowed for query answering over actual databases, initially they put little attention to the performance issue. Because of this, following prototypes focused strongly on the per- formance of the query rewriting algorithms; examples are Requiem (Pérez- Urbina et al., 2010), which implemented the resolution-based query rewriting techniques from Pérez-Urbina et al. (2008), and Presto (Rosati and Almatelli, 2010), which implemented a succinct query translation based on nonrecur- sive Datalog programs. Finally, the latest generation OBDA systems such as Quest (Rodríguez-Muro and Calvanese, 2012) and Prexto (Rosati, 2012) have focused on the exploitation of efficient rewriting techniques, SQO optimiza- tion, as well as the generation of efficient SQL queries.

                  At the same time, while these initial steps toward performance are promis- ing, there are many challenges that arise in the context of industrial applica- tions and Big Data that are not covered by current techniques. For example, optimizations of query rewriting techniques have only been studied in the context of rather inexpressive ontology and query languages such as OWL

                  2 QL/DL-Lite and UCQs; however, empirical evidence indicates that none of these languages is enough to satisfy industrial needs. Also, current pro- posals for optimization using constraints have considered only the use of few classes of constraints, in particular, only simple inclusion dependencies, and little attention has been given to the use of functional dependencies and other forms of constraints that allow one to represent important features of the sources and that are relevant for query answering optimization.

                  Likewise, optimization of OBDA systems has so far only been considered either in a pure “on-the-fly” query rewriting context, in which sources are out of the scope of the query answering system, or in a context in which the data has been removed from the original source and transformed into an ABox. However, the experience that has been obtained experimenting with the current technology indicates, that in practice, a middle ground could give rise to a higher degree of optimization of the query answering process. It also appears that in the context of Big Data and the complex analytical que- ries that are often used in this context, good performance cannot be achieved otherwise and these hybrid approaches might be the only viable alternative. It has also become clear that declarative OBDA might not be the best choice

                  Scalable End-User Access to Big Data Ontology Mappings Query/results

                  Query transformation Query answering plugins Configuration plugins Query rewriting Ontology/mapping optimization Query planning

                  Stream adapter Query execution Query execution · · · · · · Aux. source Site A Site B Site C

                Figure 6.3 Fine structure of the query transformation component.

                  can be more efficient, and hence an OBDA system should provide the means to define such procedures (e.g., by means of domain-specific plug-ins).

                  To conclude, an optimal system for query answering through query rewriting in the context of Big Data must be approached in an integral way, including modules that handle and optimize each of the aspects of the query answering process, while trying to maximize the benefits that are obtained by separating reasoning with respect to the ontology vs. reasoning with respect to the data. The resulting architecture of such a system may look like the one proposed in this chapter and is depicted in Figure 6.3, where all optimization techniques previously mentioned are combined into a single framework that is expressive enough to capture industrial requirements, can understand the data sources (in the formal sense), and is able to identify the best way to achieve performance, being able to go from pure on-the-fly query answering to (partially) materialized query answering as needed.

                Time and Streams

                  Time plays an important role in many industrial applications. Hence, OBDA- based solutions for such applications have to provide means for efficiently storing and querying timed-stamped data. If we recast these user require-

                  Big Data Computing

                  of a query, we come to the conclusion that, first, the user query language should allow the reference to time (instances, intervals) and allow for ade- quate combinations with concepts of the ontology; that, second, the mapping language should allow the handling of time; and that, third, the back-end database should provide means for efficiently storing and retrieving tem- poral data, in particular, it should provide a temporal query language into which the user query will be transformed. One might also want to add a requirement for the ontology language such that it becomes possible to build temporal-thematic concepts in the user ontology; but regarding well-known unfeasibility results on temporal description logics (Artale et al., 2010), we will refrain from discussing any aspect concerning temporal constructors for ontology languages and rather focus on temporal query languages and temporal DBs.

                  While SQL provides built-in data types for times and dates, which can be used, for instance, for representing birthday data, representing validity of facts using, say, two attributes Start and End imposes severe problems for formulating queries in SQL. For instance, in a Big Data scenario involving possibly mobile sensors of one or more power plants, measurement values might be stored in a table Sensor with schema

                  

                Sensor (ID,Location,Value,Start,End),

                  and it might happen that the location changes while the value remains the same.

                  

                ID Location Value Start End

                . . . . . . . . . . . . . . .

                  S_42 Loc_1

                  16

                  15

                  20 S_42 Loc_1

                  17

                  20

                  25 S_42 Loc_2

                  17

                  25

                  30 S_42 Loc_2

                  18

                  30

                  35

                . . . . . . . . . . . . . . .

                  Now, querying for the (maximum) duration of a measurement with a par- ticular value 17 (and neglecting the location of the sensor) should return a relation {(S_42,17,20,30)}.

                  Although in principle one could specify an SQL query that maximizes the interval length to be specified in result tuples (see, e.g., Zaniolo et al. 1997 for pointers to the original literature in which solutions were developed),

                  Scalable End-User Access to Big Data

                  optimized appropriately by standard SQL query engines. Even worse, if only (irregular) time points are stored for measurements, one has to find the next measurement of a particular sensor and time point by a minimization query, and the problem of maximizing validity intervals in output relations as described above remains. In addition, an attribute Timepoint might also refer to the insertion time (transaction time) of the tuple rather than to the valid time as we have assumed in the discussion above.

                  In order to support users in formulating simpler queries for access- ing temporal information appropriately, extensions to relational database technology and query languages such as SQL have been developed (e.g., TSQL2; see Zaniolo et  al., 1997 for an overview). The time ontology usu- ally is defined by a linear time structure, a discrete representation of the real-time line, and proposals for language standards as well as implementa- tions provide data types for intervals or timestamps. A useful distinction adapted from constraint databases (Kuper et  al., 2000) is the one between abstract and concrete temporal databases (Chomicki and Toman, 2005). The representation- independent definitions of temporal databases relying on the infinite structures of the time ontology are called abstract temporal data- bases; these are the objects relevant for describing the intended semantics for query answering. Finite representations of abstract temporal databases are termed concrete temporal databases (Chomicki and Toman, 2005); these rely on compact representations by (time) intervals.

                  Temporal databases provide for means of distinguishing between valid time and transaction time. Valid time captures the idea of denoting the time period during which a fact is considered to be true (or to hold with respect to the real world). With transaction time, the time point (or time period) dur- ing which a fact is stored in the database is denoted. It might be the case that valid time is to be derived from transaction time (due to a sampling inter- val). In the case of a transaction time point often valid time is to be derived by retrieving the “next” entry, assuming an assertion is valid until the next one appears. It might also be the case that both types of time aspects are stored in the database leading to the so-called bitemporal databases (Jensen et al., 1993). Using a temporal database, a query for checking which values the sensors indicate between time units should be as easy as in the following example:

                  

                SELECT ID, Value FROM Sensor WHERE Start >= 23 and End <= 27;

                  with the intended result being a single tuple {(S_42,17)}.

                  The reason for expecting this result can be explained by the abstract vs. concrete distinction. The table with the mobile sensor from the beginning

                  Big Data Computing

                  represents an abstract temporal database. The abstract temporal database holds relations of the form Sensor(ID,Location,Value,T) meaning that sensor ID, located in Location, has a value Value measured/ valid in time T for all T such that there is an entry in the concrete temporal database with Start and End values in between which T lies.

                  Note that the resulting answer to the last query above is the empty set if the mobile sensor data are understood as being part of a pure (nontemporal) SQL DB.

                  One could extend the simple SQL query from above to a temporal SQL query that also retrieves the locations of the sensors.

                  

                SELECT ID, Location, Value FROM Sensor WHERE Start >= 20 and

                End <= 27;

                  The expected result with respect to the semantics of abstract temporal databases is {(S_42,Loc_1,17),(S_42,Loc_2,17)}. Again note that the resulting answer set would have been different for nontemporal SQL, namely

                  {(S_42,Loc_1,17)}. A third example of querying a temporal database is given by the following query, which retrieves start and end times of sensor readings. For the query

                  

                SELECT ID, Value, Start, End FROM Sensor WHERE Start <= 23 and

                End >= 27;

                  the expected result is {(S_42,17,20,30)}.

                  Index structures for supporting these kinds of queries have been devel- oped, and add-ons to commercial products offering secondary-memory query answering services such as those sketched above are on the market. Despite the fact that standards have been proposed (e.g., ATSQL), no agree- ment has been achieved yet, however. Open source implementations for mapping ATSQL to SQL have been provided as well (e.g., TimeDB; Steiner,

                  Scalable End-User Access to Big Data

                  For many application scenarios, however, only small “windows” of data are required, and thus, storing temporal data in a database (and in exter- nal memory) as shown above might cause a lot of unnecessary overhead in some application scenarios. This insight gave rise to the idea of stream- based query answering. In addition to window-based temporal data access, stream-based query answering adopts the view that multiple queries are registered and assumed to be answered “continuously.” For this kind of con- tinuous query answering, appropriate index structures and join algorithms have been developed in the database community (see, e.g., Cammert et al. 2003, 2005 for an overview). Data might be supplied incrementally by mul- tiple sources. Combining these sources defines a fused stream of data over which a set of registered queries is continuously answered. In stream-based query answering scenarios, an algebraically specified query (over multiple combined streams set up for a specific application) might be implemented by several query plans that are optimized with respect to all registered queries. An expressive software library for setting up stream-based processing sce- narios is described in Cammert et al. (2003).

                  Rather than by accessing the whole stream, continuous queries refer to only a subset of all assertions, which is defined by a sliding time window. Interestingly, the semantics of sliding windows for continuous queries over data streams is not easily defined appropriately, and multiple proposals exist in the literature (e.g., Zhang et al., 2001; Krämer and Seeger, 2009).

                  For event recognition, temporal aggregation operators are useful exten- sions to query languages, and range predicates have to be supported in a special way to compute temporal aggregates (Zhang et  al., 2001). In addi- tion, expectation on trajectories helps one to answer continuous queries in a faster way (Schmiegelt and Seeger, 2010). Moreover, it is apparent that the “best” query plan might depend on the data rates of various sources, and dynamic replanning might be required to achieve best performance over time (Krämer et al., 2006; Heinz et al., 2008).

                  While temporal query answering and stream-based processing have been discussed for a long time in the database community (e.g., Law et al., 2004), recently NOSQL data representation formats and query answering lan- guages have become more and more popular. Besides XML, for example, the Resource Description Format (RDF) has been investigated in temporal or stream-based application contexts. Various extensions to the RDF query language SPARQL have been proposed for stream-based access scenarios in the RDF context (Bolles et al., 2008; Barbieri et al., 2010b; Calbimonte et al., 2010). With the advent of SPARQL 1.1, aggregate functions are investigated in this context as well.

                  Streaming SPARQL (Bolles et  al., 2008) was one of the first approaches based on a specific algebra for streaming data. However, data must be pro- vided “manually” in RDF in this approach. On the other hand, mappings for relating source data to RDF ontologies in an automated way have been inves-

                  Big Data Computing

                  et al., 2010). In contrast to OBDA methods, nowadays these approaches require the materialization of structures at the ontology level (RDF) in order to pro- vide the input data for stream-based query systems. For instance, C-SPARQL queries are compiled to SPARQL queries over RDF data that was produced with specific mappings. C-SPARQL deals with entailments for RDFS or OWL

                  2 RL by relying on incremental materialization (Barbieri et  al., 2010a). See also Ren and Pan (2011) for an approach based on EL++.

                  SPARQLstream (Calbimonte et al., 2010) provides for mappings to ontology notions and translates to stream-based queries to SNEEQL (Brenninkmeijer et al., 2008), which is the query language for SNEE, a query processor for wireless sensor networks. Stream-based continuous query answering is often used in monitoring applications for detecting events, possibly in real time. EP-SPARQL (Anicic et al., 2011a), which is tailored for complex event processing, is translated to ETALIS (Event TrAnsaction Logic Inference System; Anicic et al., 2011b), a Prolog-based real-time event recognition sys- tem based on logical inferences.

                  While translation to SQL, SPARQL, or other languages is attractive with respect to reusing existing components in a black box approach, some infor- mation might be lost, and the best query execution plan might not be found. Therefore, direct implementations of stream-based query languages based on RDF are also investigated in the literature. CQELS (Phuoc et al., 2011) is a much faster “native” implementation (and does not rely on transformation to underlying non-stream-based query languages). In addition, in the latter stream-based querying approach, queries can also refer to static RDF data (e.g., linked open data). In addition, a direct implementation of temporal and static reasoning with ontologies has been investigated for media data inter- pretation in Möller and Neumann (2008) and Peraldi et al. (2011). Event rec- ognition with respect to expressive ontologies has been investigated recently (Wessel et al., 2007, 2009; Luther et al., 2008; Baader et al., 2009).

                  As we have seen, it is important to distinguish between temporal que- ries and window-based continuous queries for streams. Often the latter are executed in main memory before data are stored in a database, and much work has been carried out for RDF. However, temporal queries are still important in the RDF context as well. T-SPARQL (Grandi, 2010) applies techniques from temporal DBs (TSQL2, SQL/Temporal, TimeDB) to RDF querying (possibly also with mappings to plain SQL) to define a query language for temporal RDF. For data represented using the W3C standard RDF, an approach for temporal query answering has been developed in an industrial project (Motik, 2010). It is shown that ontology-based answering of queries with specific temporal operators can indeed be realized using a translation to SQL.

                  In summary, it can be concluded that there is no unique semantics for the kind of queries discussed above, that is, neither for temporal nor for stream- based queries. A combination of stream-based (or window-based), tempo-

                  Scalable End-User Access to Big Data

                  provided at a time by most approaches. Early work on deductive event recog- nition (Neumann, 1985; Neumann and Novak, 1983, 1986; André et al., 1988; Kockskämper et al., 1994) already contains many ideas of recently published efforts, and in principle a semantically well-founded combination of quan- titative temporal reasoning with respect to valid time has been developed. However, scalability was not the design goal of these works.

                  While database-based temporal querying approaches or RDF-based tem- poral and stream-querying approaches as discussed above offer fast per- formance for large data and massive streams with high data rates, query answering with respect to (weakly expressive) ontologies is supported only with brute-force approaches such as materialization. It is very unlikely that this approach results in scalable query answering for large real-world ontol- ogies of the future due to the enormous blowup (be the materialization man- aged incrementally or not). Furthermore, reasoning support is quite limited, that is, the expressivity of the ontology languages that queries can refer to is quite limited. Fortunately, it has been shown that brute-force approaches involving materialization for ontology-based query answering are not required for efficiently accessing large amounts of data if recently developed OBDA techniques are applied.

                  A promising idea for scalable stream-based answering of continuous que- ries is to apply the idea of query transformation with respect to ontologies also for queries with temporal semantics. Using an ontology and mapping rules to the nomenclature used in particular relational database schemas, query formulation is much easier. Scalability can, for example, be achieved by a translation to an SQL engine with temporal extensions and native index structures and processing algorithms (e.g., as offered by Oracle).

                  As opposed to what current systems offer, stream-based processing of data usually does not give rise to the instantiation of events with absolute cer- tainty. Rather, data acquired (observations) can be seen as cues that have to be aggregated or accumulated in order to be able to safely infer that a cer- tain event has occurred. These events might be made explicit in order to be able to refer to them directly in subsequent queries (rather than recomputing them from scratch all the time). The central idea of Gries et al. (2010) is to use aggregation operators for data interpretation. Note that interpretation is more than mere materialization of the deductive closure: with interpre- tation, new and relevant data are generated to better focus temporal and stream-based query answering algorithms.

                  Distributed Query Execution

                  In the past, OBDA approaches simply assumed centralized query execution

                  Big Data Computing

                  (Savo et  al., 2010). However, this assumption does not usually hold in the real world where data are distributed over many autonomous, heteroge- neous sources. In addition, existing relational database systems, such as PostgreSQL, cannot scale when faced with TBs of data and the kinds of com- plex queries to be generated by a typical OBDA query translation component (Savo et al., 2010).

                  Relevant research in this area includes previous work on query process- ing in parallel, distributed, and federated database systems, which has been studied for a long time by the database community (Sheth, 1991; DeWitt and Gray, 1992; Özsu and Valduriez, 1999; Kossmann, 2000). Based on principles established in these pioneering works, recently also a variety of approaches for federated query processing in the context of semantic data processing have been proposed (see Görlitz and Staab, 2011 for a recent survey). Falling into this category, our own work on FedX (Schwarte et al., 2011) presents a federated query processing engine operating on top of autonomous seman- tic databases. The FedX engine enables the virtual integration of heteroge- neous sources and implements efficient query evaluation strategies, driven by novel join processing and grouping techniques to minimize the number of requests sent to the federation members. These techniques are based on innovative source selection strategies, pursuing the goal to identify mini- mal sets of federation members that can contribute answers to the respective subqueries. Coming with all these features, FedX can easily be leveraged to OBDA scenarios whenever the source systems scale with the amounts of data and queries to be processed in the concrete ODBA scenario.

                  For truly large scale, heterogeneous data stores, efficient evaluation of queries produced by the query translation component discussed in section “Query Transformation” requires massively parallel and distributed query execution. To cover such scenarios, cloud computing has attracted much attention in the research community and software industry. Thanks to vir- tualization, cloud computing has evolved over the years from a paradigm of basic IT infrastructures used for a specific purpose (clusters), to grid com- puting, and recently to several paradigms of resource provisioning services: depending on the particular needs, infrastructures (IaaS—Infrastructure as a Service), platforms (PaaS—Platform as a Service), and software (SaaS— Software as a Service) can be provided as services (Gonzalez et al., 2009). One of the important advantages of these newest incarnations of cloud comput- ing is the cost model of resources. Clusters represent a fixed capital invest- ment made up-front and a relatively small operational cost paid over time. In contrast, IaaS, PaaS, and SaaS clouds are characterized by elasticity (Kllapi et al., 2011), and offer their users the ability to lease resources only for as long as needed, based on a per quantum pricing scheme, for example, one hour * on Amazon EC2. Together with the lack of any up-front cost, this represents * a major benefit of clouds over earlier approaches.

                  Scalable End-User Access to Big Data

                  The ability to use computational resources that are available on demand challenges the way that algorithms, systems, and applications are imple- mented. Thus, new computing paradigms that fit closely the elastic compu- tation model of cloud computing were proposed. The most popular of these paradigms today is MapReduce (Dean and Ghemawat, 2008). The intuitive appeal of MapReduce, and the availability of platforms such as Hadoop (Apache, 2011), has recently fueled the development of Big Data platforms that aim to support the query language SQL on top of MapReduce (e.g., Hive; Thusoo et al. 2010 and HadoopDB; Bajda-Pawlikowski et al. 2011).

                  Our own work on massively parallel, elastic query execution for Big Data takes place in the framework of the Athena Distributed Processing (ADP) sys- tem (Tsangaris et al., 2009; Kllapi et al., 2011). Massively parallel query execu- tion is the ability to run queries with the maximum amount of parallelism at each stage of execution. Elasticity means that query execution is flexible; the same query can be executed with more or less resources, given the availabil- ity of resources for this query and the execution time goals. Making sure that these two properties are satisfied is a very hard problem in a federated data sources environment such as those discussed in this chapter. The current version of ADP, and its extensions planned for the near future, provides a framework with the right high-level abstractions and an efficient implemen- tation for offering these two properties.

                  ADP utilizes state-of-the-art database techniques: (i) a declarative query language based on data flows, (ii) the use of sophisticated optimization tech- niques for executing queries efficiently, (iii) operator extensibility to bring domain-specific computations into the database processing, and (iv) execu- tion platform independence to insulate applications from the idiosynchra- cies of execution environments such as local clusters, private clouds, or public clouds.

                Figure 6.4 shows the current architecture of ADP. The queries are expressed in a data flow language allowing complex graphs with operators

                  as nodes and with edges representing producer–consumer relationships.

                  Query Optimizer Registry Execution ART ARM plan Cloud

                  Container Figure 6.4

                  Big Data Computing

                  Queries are optimized and transformed into execution plans that are exe- cuted in ART, the ADP Run Time. The resources needed to execute the queries (machines, network, etc.) are reserved or allocated by ARM, the ADP Resource Mediator. Those resources are wrapped into containers. Containers are used to abstract from the details of a physical machine in a cluster or a virtual machine in a cloud. The information about the operators and the state of the system is stored in the Registry. ADP uses state-of-the- art technology and well-proven solutions inspired by years of research in parallel and distributed databases (e.g., parallelism, partitioning, various optimizations, recovery).

                  Several services that are useful to the OBDA paradigm discussed in this chapter have already been developed on top of ADP: an SQL engine (AdpDB), a MapReduce engine (AdpMR), and a data mining library (AdpDM). Some core research problems have also been studied in depth. For example, in Kllapi et al. (2011), we have studied the problem of sched- uling data flows that involve arbitrary data-processing operators in the context of three different optimization problems: (1) minimize completion time given a fixed budget, (2) minimize monetary cost given a deadline, and (3) find trade-offs between completion time and monetary cost without any a priori constraints. We formulated these problems and presented an approximate optimization framework to address them that uses resource elasticity in the cloud. To investigate the effectiveness of our approach, we incorporate the devised framework into ADP and instantiate it with several greedy, probabilistic, and exhaustive search algorithms. Finally, through several experiments that we conducted with the prototype elastic optimizer on numerous scientific and synthetic data flows, we identified several interesting general characteristics of the space of alternative sched- ules as well as the advantages and disadvantages of the various search algorithms. The overall results are very promising and show the effective- ness of our approach.

                  To maximize the impact of ADP to the OBDA Big Data paradigm discussed in this chapter, ADP will be extended as follows:

                • Tight integration with query transformation modules: We will develop query planning and execution techniques for queries produced by query translators such as the ones of section “Query Transformation” by integrating the ADP system tightly with Quest (Rodríguez-Muro and Calvanese, 2012). The integration will start by interfacing with SQL using the AdpDB service and continue using lower level data- flow languages and providing hints (e.g., the degree of parallelism) to the ADP engine in order to increase its scalability.
                • Federation: Building on the input of the query transformation mod- ules, federation will be supported in ADP by scheduling the opera- tors of the query to the different sites so that appropriate cost metrics

                  Scalable End-User Access to Big Data

                  of operators close to the appropriate data sources and moving data (when possible) to sites with more compute power.

                • Continuous and temporal query support: Continuous queries such as the ones discussed in section “Time and Streams” will be supported natively by data streaming operators. Similarly, temporal queries will be supported by special operators that can handle temporal semantics.

                Conclusion

                  Giving end-users with limited IT expertise flexible access to large corporate data stores is a major bottleneck in data-intensive industries. Typically, stan- dard domain-specific tools only allow users to access data using a predefined set of queries. Any information need that goes beyond these predefined que- ries will require the help of a highly skilled IT expert, who knows the data storage intimately and who knows the application domain sufficiently well to communicate with the end-users.

                  This is costly, not only because such IT experts are a scarce resource, but also because the time of the expert end-users are not free for core tasks. We have argued how OBDA can provide a solution: by capturing the end-users’ vocabulary in a formal model (ontology) and maintaining a set of mappings from this vocabulary to the data sources, we can automate the translation work previously done by the IT experts.

                  OBDA has, in recent years, received a large amount of theoretical atten- tion, and there are also several prototypical implementations. But in order to apply the idea to actual industry data, a number of limitations still need to be overcome. In the “Limitations of the State of the Art in OBDA” section, we have identified the specific problems of usability, prerequisites, scope, and efficiency.

                  We have argued that these problems can be overcome by a novel combi- nation of techniques, encompassing an end-user oriented query interface, query-driven ontology construction, new ideas for scalable query rewriting, temporal and streaming data processing, and query execution on elastic clouds.

                  The ideas proposed in this chapter are now investigated and imple- * mented in the FP7 Integrating Project Optique—Scalable End-user Access to

                • * Big Data .

                  See http://www.optique-project.eu/

                  Big Data Computing

                References

                  

                Abiteboul, S. and O. Duschka. 1998, Complexity of answering queries using mate-

                rialized views. In Proc. of the 17th ACM SIGACT SIGMOD SIGART Symp. on Principles of Database Systems (PODS’98) , Seattle, WA, 254–265.

                  

                Acciarri, A., D. Calvanese, G. De Giacomo, D. Lembo, M. Lenzerini, M. Palmieri, and

                R. Rosati. 2005, QuOnto: Querying ontologies. In Proc. of the 20th Nat. Conf. on Artificial Intelligence (AAAI 2005) , Pittsburgh, PA, 1670–1671.

                  

                Alexe, B., P. Kolaitis, and W.-C. Tan. 2010, Characterizing schema mappings via data

                examples. In Proc. of the 29th ACM SIGACT SIGMOD SIGART Symp. on Principles of Database Systems (PODS 2010) , Indianapolis, IA, 261–271.

                  

                Amoroso, A., G. Esposito, D. Lembo, P. Urbano, and R. Vertucci. 2008, Ontology-

                based data integration with Mastro-i for configuration and data management at SELEX Sistemi Integrati. In Proc. of the 16th Ital. Conf. on Database Systems (SEBD 2008) , Mondello (PA), Italy, 81–92.

                  

                André, E., G. Herzog, and T. Rist 1988, On the simultaneous interpretation of real

                world image sequences and their natural language description: The system Soccer. In Proc. of the European Conference on Artificial Intelligence (ECAI 1988), Munich, Germany, 449–454.

                Anicic, D., P. Fodor, S. Rudolph, and N. Stojanovic. 2011a, EP-SPARQL: A unified lan-

                guage for event processing and stream reasoning. In Proc. of the 20th Int. World

                  Wide Web Conference (WWW 2011), Hyderabad, India.

                  

                Anicic, D., P. Fodor, S. Rudolph, R. Stühmer, N. Stojanovic, and R. Studer. 2011b, ETALIS:

                Rule-based reasoning in event processing. In Reasoning in Event-Based Distributed Systems (Sven Helmer, Alex Poulovassilis, and Fatos Xhafa, eds.), volume 347 of Studies in Computational Intelligence , 99–124, Springer, Berlin/Heidelberg.

                  Apache. 2011, Apache Hadoop, http://hadoop.apache.org/.

                Arenas, M., P. Barcelo, R. Fagin, and L. Libkin. 2004, Locally consistent transforma-

                tions and query answering in data exchange. In Proc. of the 23rd ACM SIGACT

                  SIGMOD SIGART Symp. on Principles of Database Systems (PODS 2004) , Paris, France, 229–240.

                  

                Arenas, M., R. Fagin, and A. Nash. 2010a, Composition with target constraints. In Proc. of

                the 13th Int. Conf. on Database Theory (ICDT 2010) , Lausanne, Switzerland, 129–142.

                  

                Arenas, M., J. Pérez, J. L. Reutter, and C. Riveros. 2010b, Foundations of schema map-

                ping management. In Proc. of the 29th ACM SIGACT SIGMOD SIGART Symp. on Principles of Database Systems (PODS 2010) , Indianapolis, IA, 227–238.

                  

                Arenas, M., J. Pérez, and C. Riveros 2009, The recovery of a schema mapping: Bringing

                exchanged data back. ACM Trans. on Database Systems, 34, 22:1–22:48.

                Arocena, P. C., A. Fuxman, and R. J. Miller. 2010, Composing local-as-view mappings:

                Closure and applications. In Proc. of the 13th Int. Conf. on Database Theory (ICDT

                  2010) , Lausanne, Switzerland, 209–218.

                  

                Artale, A., D. Calvanese, R. Kontchakov, and M. Zakharyaschev. 2009, The DL-Lite

                family and relatives. Journal of Artificial Intelligence Research, 36, 1–69.

                Artale, A., R. Kontchakov, V. Ryzhikov, and M. Zakharyaschev. 2010, Past and future

                of DL-Lite. In AAAI Conference on Artificial Intelligence, Atlanta, GA.

                Baader, F., A. Bauer, P. Baumgartner, A. Cregan, A. Gabaldon, K. Ji, K. Lee, D.

                  Rajaratnam, and R. Schwitter. 2009, A novel architecture for situation awareness

                  Scalable End-User Access to Big Data systems. In Proc. of the 18th Int. Conf. on Automated Reasoning with Analytic Tableaux and Related Methods (Tableaux 2009) , Oslo, Norway, volume 5607 of Lecture Notes in Computer Science , 77–92, Springer-Verlag.

                  

                Bajda-Pawlikowski, K., D. J. Abadi, A. Silberschatz, and E. Paulson. 2011, Efficient

                processing of data warehousing queries in a split execution environment. In Proc. of SIGMOD, Athens, Greece.

                  

                Barbieri, D. F., D. Braga, S. Ceri, E. D. Valle, and M. Grossniklaus. 2010a, Incremental

                reasoning on streams and rich background knowledge. In Proc. of the 7th Extended Semantic Web Conference (ESWC 2010) , Heraklion, Greece, volume 1, 1–15.

                  

                Barbieri, D. F., D. Braga, S. Ceri, E. D. Valle, and M. Grossniklaus. 2010b, Querying

                RDF streams with C-SPARQL. SIGMOD Record, 39, 20–26.

                Barzdins, G., E. Liepins, M. Veilande, and M. Zviedris. 2008, Ontology enabled graph-

                ical database query tool for end-users. In Databases and Information Systems V—

                  Selected Papers from the Eighth International Baltic Conference, DB&IS 2008, June 2–5, 2008, Tallinn, Estonia (Hele-Mai Haav and Ahto Kalja, eds.), volume 187 of

                  Frontiers in Artificial Intelligence and Applications , 105–116, IOS Press, Netherlands.

                  

                Bernstein, P. A. and H. Ho. 2007, Model management and schema mappings: Theory

                and practice. In Proc. of the 33rd Int. Conf. on Very Large Data Bases (VLDB 2007), Vienna, Austria, 1439–1440.

                Beyer, M. A., A. Lapkin, N. Gall, D. Feinberg, and V. T. Sribar. 2011, ‘Big Data’ is only

                the beginning of extreme information management. Gartner report G00211490.

                Bolles, A., M. Grawunder, and J. Jacobi. 2008, Streaming SPARQL—Extending SPARQL

                to process data streams. In Proc. of the 5th European Semantic Web Conference

                  (ESWC 2008) , Tenerife, Canary Islands, Spain, 448–462. http://data.semantic- web.org/conference/eswc/2008/paper/3.

                  

                Brachman, R. J. and H. J. Levesque. 1984, The tractability of subsumption in frame-

                based description languages. In AAAI, 34–37, Austin, TX.

                Brandt, S., R. Küsters, and A-Y. Turhan. 2001, Approximation in description log-

                ics. LTCS-Report 01-06, LuFG Theoretical Computer Science, RWTH Aachen,

                  

                Brenninkmeijer, C., I. Galpin, A. Fernandes, and N. Paton. 2008, A semantics for a

                query language over sensors, streams and relations. In Sharing Data, Information and Knowledge, 25th British National Conference on Databases (BNCOD 25), Cardiff, UK (Alex Gray, Keith Jeffery, and Jianhua Shao, eds.), volume 5071 of Lecture Notes in Computer Science , 87–99, Springer, Berlin/Heidelberg.

                  

                Bruza, P. D. and T. P. van der Weide. 1992, Stratified hypermedia structures for infor-

                mation disclosure. Computer Journal, 35, 208–220.

                Calbimonte, J.-P., Ó. Corcho, and A. J. G. Gray. 2010, Enabling ontology-based access to

                streaming data sources. In Proc. of the 9th Int. Semantic Web Conf. (ISWC 2010), Bonn,

                  Germany, 96–111.

                Calì, A., D. Calvanese, G. De Giacomo, and M. Lenzerini. 2004, Data integration

                under integrity constraints. Information Systems, 29, 147–163.

                  

                Calì, A., G. Gottlob, and T. Lukasiewicz. 2009, A general datalog-based framework for

                tractable query answering over ontologies. In Proc. of the 28th ACM Symposium on Principles of Database Systems (PODS 2009) , Providence, RI, 77–86.

                  

                Calvanese, D., G. De Giacomo, D. Lembo, M. Lenzerini, A. Poggi, M. Rodríguez-

                Muro, R. Rosati, M. Ruzzi, and D. F. Savo. 2011, The MASTRO system for ontol-

                  Big Data Computing

                Calvanese, D., G. De Giacomo, D. Lembo, M. Lenzerini, and R. Rosati. 2007a, EQL-Lite:

                Effective first-order query processing in description logics. In Proc. of the 20th Int.

                  

                Chomicki, J. and D. Toman. 2005, Temporal databases. In Handbook of Temporal

                Reasoning in Artificial Intelligence (Michael Fisher, Dov M. Gabbay, and Lluis

                Vila, eds.), volume 1, 429–467, Elsevier B.V., Amsterdam, The Netherlands.

                  

                  Tunkelang, D. 2009, Faceted Search. Morgan and Claypool.

                Epstein, R. G. 1991, The tabletalk query language. Journal of Visual Languages &

                  

                DeWitt, D. J. and J. Gray. 1992, Parallel database systems: The future of high perfor-

                mance database systems. Communications of the ACM, 35, 85–98.

                  305–339.

                Dean, J. and S. Ghemawat. 2008, MapReduce: simplified data processing on large

                  

                Cimiano, P. 2006, Ontology Learning and Population from Text: Algorithms, Evaluation and

                Applications . Springer, USA.

                Cimiano, P., A. Hotho, and S. Staab. 2005, Learning concept hierarchies from text cor-

                pora using formal concept analysis. Journal of Artificial Intelligence Research, 24,

                  

                Chortaras, A., D. Trivela, and G. B. Stamou. 2011, Optimized query rewriting for OWL

                  Eur. Conf. on Artificial Intelligence (ECAI 2004), Valencia, Spain (Ramon López de Mántaras and Lorenza Saitta, eds.), 308–312, IOS Press, Netherlands.

                  

                Joint Conf. on Artificial Intelligence (IJCAI 2007) , Hyderabad, India, 274–279.

                  

                Catarci, T., M. F. Costabile, S. Levialdi, and C. Batini. 1997, Visual query systems for

                Catarci, T., P. Dongilli, T. Di Mascio, E. Franconi, G. Santucci, and S. Tessaris. 2004,

                An ontology based visual tool for query formulation support. In Proc. of the 16th

                  

                  

                GI, http://dblp.uni-trier.de/db/conf/btw/btw2005.html#CammertHKS05.

                Catarci, T. 2000, What happened when database researchers met usability. Information

                  , 26, 12–18.

                Cammert, M., C. Heinz, J. Krämer, and B. Seeger. 2005, Sortierbasierte Joins über

                Datenströmen. In BTW 2005, volume 65 of LNI, Karlsruhe, Germany, 365–384,

                  Eng. Bull.

                  335–360.

                Cammert, M., C. Heinz, J. Krämer, M. Schneider, and B. Seeger. 2003, A status report

                on XXL—a software infrastructure for efficient query processing. IEEE Data

                  

                Calvanese, D., G. De Giacomo, D. Lembo, M. Lenzerini, and R. Rosati. 2007b, Tractable

                reasoning and efficient query answering in description logics: The DL-Lite fam- ily. Journal of Automated Reasoning, 39, 385–429.

                Calvanese, D., G. De Giacomo, D. Lembo, M. Lenzerini, and R. Rosati. 2012, Data

                complexity of query answering in description logics. Artificial Intelligence, 195,

                  

                Fagin, R. 2007, Inverting schema mappings. ACM Transactions on Database Systems,

                32, 25:1–25:53.

                Fagin, R., L. M. Haas, M. A. Hernández, R. J. Miller, L. Popa, and Y. Velegrakis.

                  Scalable End-User Access to Big Data Modeling: Foundations and Applications—Essays in Honor of John Mylopoulos (A. T. Borgida, V. K. Chaudhri, P. Giorgini, and E. S. Yu), 198–236, Springer, Berlin/Heidelberg.

                Fagin, R., P. G. Kolaitis, R. J. Miller, and L. Popa. 2005a, Data exchange: Semantics and

                query answering. Theoretical Computer Science, 336, 89–124.

                Fagin, R., P. G. Kolaitis, A. Nash, and L. Popa. 2008a, Towards a theory of schema-

                mapping optimization. In Proc. of the 27th ACM SIGACT SIGMOD SIGART

                  Symp. on Principles of Database Systems (PODS 2008) , Vancouver, Canada, 33–42.

                  

                Fagin, R., P. G. Kolaitis, and L. Popa. 2005b, Data exchange: Getting to the core. ACM

                Transactions on Database Systems

                , 30, 174–210.

                  

                Fagin, R., P. G. Kolaitis, L. Popa, and W.-C. Tan. 2005c, Composing schema mappings:

                Second-order dependencies to the rescue. ACM Transactions on Database Systems, 30, 994–1055.

                Fagin, R., P. G. Kolaitis, L. Popa, and W.-C. Tan. 2008b, Quasi-inverses of schema map-

                pings. ACM Transactions on Database Systems, 33, 1–52.

                Fagin, R., P. G. Kolaitis, L. Popa, and W.-C. Tan. 2009b, Reverse data exchange:

                Coping with nulls. In Proc. of the 28th ACM SIGACT SIGMOD SIGART Symp. on

                  Principles of Database Systems (PODS 2009) , Providence, RI, 23–32.

                  

                Fuxman, A., P. G. Kolaitis, R. Miller, and W.-C. Tan. 2005, Peer data exchange. In

                Proc. of the 24rd ACM SIGACT SIGMOD SIGART Symp. on Principles of Database Systems (PODS 2005) , Baltimore, MD, 160–171.

                  

                Glavic, B., G. Alonso, R. J. Miller, and L. M. Haas. 2010, TRAMP: Understanding

                the behavior of schema mappings through provenance. PVLDB, 3, Singapore, 1314–1325.

                Gonzalez, L. M. V., L. Rodero-Merino, J. Caceres, and M. A. Lindner. 2009, A break

                in the clouds: Towards a cloud definition. Computer Communication Review, 39,

                  50–55.

                Görlitz, O. and S. Staab. 2011, Federated data management and query optimization

                for linked open data. In New Directions in Web Data Management (A. Vakali and

                  L. C. Jain), 1, 109–137, Springer-Verlag, Berlin/Heidelberg.

                Gottlob, G., R. Pichler, and V. Savenkov. 2009, Normalization and optimization

                of schema mappings. Proceedings of the VLDB Endowment, Lyon, France, 2,

                  1102–1113.

                Grandi, F. 2010, T-SPARQL: A TSQL2-like temporal query language for RDF. In Proc. of the

                  

                ADBIS 2010 Int. Workshop on Querying Graph Structured Data (GraphQ 2010)

                , Novi Sad, Serbia, 21–30.

                  

                Grant, J., J. Gryz, J. Minker, and L. Raschid. 1997, Semantic query optimization for

                object databases. In Proc. of the 13th IEEE Int. Conf. on Data Engineering (ICDE’97), Birmingham, UK, 444–453.

                Gries, O., R. Möller, A. Nafissi, M. Rosenfeld, K. Sokolski, and M. Wessel. 2010, A prob-

                abilistic abduction engine for media interpretation based on ontologies. In Proc. of the 4th Int. Conf. on Web Reasoning and Rule Systems (RR 2010),

                  Bressanone/ Brixen, Italy (J. Alferes, P. Hitzler, and Th. Lukasiewicz, eds.) Springer, Berlin/ Heidelberg.

                Halevy, A. Y. 2001, Answering queries using views: A survey. Very Large Database

                  Journal , 10, 270–294.

                  

                Halevy, A. Y., A. Rajaraman, and J. Ordille. 2006, Data integration: The teenage years.

                  In Proc. of the 32nd Int. Conf. on Very Large Data Bases (VLDB 2006), VLDB 2006,

                  Big Data Computing

                Heim, P. and J. Ziegler. 2011, Faceted visual exploration of semantic data. In

                  Proceedings of the Second IFIP WG 13.7 Conference on Human-Computer Interaction and Visualization , HCIV 2009, co-located with INTERACT 2009, Uppsala,

                Heinz, C., J. Krämer, T. Riemenschneider, and B. Seeger. 2008, Toward simulation-

                based optimization in data stream management systems. In Proc. of the 24th Int.

                  Conf. on Data Engineering (ICDE 2008) , Cancun, Mexico, 1580–1583.

                  

                Jensen, C. S., M. D. Soo, and R. T. Snodgrass. 1993, Unification of temporal data mod-

                els. In Proceedings of IEEE International Conference on Data Engineering, (ICDE 1993 ), Vienna, Austria, 262–271.

                  

                Jiménez-Ruiz, E., B. C. Grau, I. Horrocks, and R. B. Llavori. 2009, Logic-based ontology

                integration using ContentMap. In Proc. of XIV Jornadas de Ingeniería del Software y Bases de Datos (JISBD 2009),

                  San Sebastián, Spain (Antonio Vallecillo and Goiuria Sagardui, eds.), 316–319, Los autores. download/2009/JCHB09c.pdf.

                Kikot, S., R. Kontchakov, and M. Zakharyaschev. 2011, On (in)tractability of OBDA

                with OWL 2 QL. In Proc. of the 24th Int. Workshop on Description Logic (DL 2011),

                  Barcelona, Spain.

                Kllapi, H., E. Sitaridi, M. M. Tsangaris, and Y. E. Ioannidis. 2011, Schedule optimiza-

                tion for data processing flows on the cloud. In Proc. of SIGMOD (SIGMOD 2011),

                  Athens, Greece, 289–300.

                Knublauch, H., M. Horridge, M. A. Musen, A. L. Rector, R. Stevens, N. Drummond,

                P. W. Lord, N. F. Noy, J. Seidenberg, and H. Wang. 2005, The Protégé OWL expe- rience. In Proc. of the OWL: Experience and Directions Workshop (OWLED 2005), Galway, Ireland, volume 188 of CEUR (http://ceur-ws.org/).

                  

                Kockskämper, S., B. Neumann, and M. Schick. 1994, Extending process monitoring by

                event recognition. In Proc. of the 2nd Int. Conf. on Intelligent System Engineering (ISE’94) , Hamburg-Harburg, Germany, 455–460.

                  

                Kolaitis, P. G. 2005, Schema mappings, data exchange, and metadata management. In

                Proc. of the 24rd ACM SIGACT SIGMOD SIGART Symp. on Principles of Database Systems (PODS 2005) , Baltimore, MD, 61–75.

                  

                Kossmann, D. 2000, The state of the art in distributed query processing. ACM

                Computing Surveys , 32, 422–469.

                Krämer, J. and B. Seeger. 2009, Semantics and implementation of continuous sliding win-

                dow queries over data streams. ACM Transactions on Database Systems, 34, 4:1–4:49.

                Krämer, J., Y. Yang, M. Cammert, B. Seeger, and D. Papadias. 2006, Dynamic

                plan migration for snapshot-equivalent continuous queries in data stream systems. In Proc. of EDBT 2006 Workshops (EDBT 2006), Munich, Germany, volume 4254 of Lecture Notes in Computer Science, 497–516, Springer, Berlin/ Heidelberg.

                Kuper, G. M., L. Libkin, and J. Paredaens, eds. 2000, Constraint Databases. Springer,

                Berlin Heidelberg.

                Law, Y.-N., H. Wang, and C. Zaniolo. 2004, Query languages and data models for

                database sequences and data streams. In Proc. of the 30th Int. Conf. on Very

                  Large Data Bases (VLDB 2004), Toronto, Canada, 492–503, VLDB Endowment. http://dl.acm.org/citation.cfm?id=1316689.1316733.

                Libkin, L. and C. Sirangelo. 2008, Data exchange and schema mappings in open and

                closed worlds. In Proc. of the 27th ACM SIGACT SIGMOD SIGART Symp. on

                  Scalable End-User Access to Big Data

                Lim, S. C. J., Y. Liu, and W. B. Lee. 2009, Faceted search and retrieval based on semanti-

                cally annotated product family ontology. In Proc. of the WSDM 2009 Workshop on

                  Exploiting Semantic Annotations in Information Retrieval (ESAIR 2009) , Barcelona, Spain, 15–24. http://doi.acm.org/10.1145/1506250.1506254.

                  

                Luther, M., Y. Fukazawa, M. Wagner, and S. Kurakake. 2008, Situational reasoning for

                task-oriented mobile service recommendation. Knowledge Engineering Review, 23, 7–19. http://dl.acm.org/citation.cfm?id=1362078.1362080.

                Madhavan, J. and A. Y. Halevy. 2003, Composing mappings among data sources. In Proc.

                of the 29th Int. Conf. on Very Large Data Bases (VLDB 2003) , Berlin, Germany, 572–583.

                  

                Marchionini, G. and R. White. 2007, Find what you need, understand what you find.

                  International Journal of Human-Computer Interaction , 23, 205–237.

                  

                Möller, R. and B. Neumann. 2008, Ontology-based reasoning techniques for multime-

                dia interpretation and retrieval. In Semantic Multimedia and Ontologies: Theory and Applications

                  (Yiannis Kompatsiaris and Paola Hobson, eds.), Springer- Verlag, London.

                Möller, R., V. Haarslev, and M. Wessel. 2006, On the scalability of description logic

                instance retrieval. In 29. Deutsche Jahrestagung für Künstliche Intelligenz (C.

                  Freksa and M. Kohlhase, eds.), Bremen, Germany, Lecture Notes in Artificial Intelligence. Springer, Netherlands.

                Motik, B. 2010, Representing and querying validity time in RDF and OWL: A logic-

                based approach. In Proc. of the 9th Int. Semantic Web Conf. (ISWC 2010), Bonn,

                  Germany, volume 1, 550–565.

                Neumann, B. 1985, Retrieving events from geometrical descriptions of time-varying

                scenes. In Foundations of Knowledge Base Management—Contributions from Logic,

                  Databases, and Artificial Intelligence (J. W. Schmidt and Costantino Thanos, eds.), 443, Springer Verlag, Berling/Heidelberg.

                  

                Neumann, B. and H.-J. Novak. 1986, NAOS: Ein System zur natürlichsprachlichen

                Beschreibung zeitveränderlicher Szenen. Informatik Forschung und Entwicklung, 1, 83–92.

                Neumann, B. and H.-J. Novak. 1983, Event models for recognition and natural lan-

                guage description of events in real-world image sequences. In Proc. of the 8th Int.

                  Joint Conference on Artificial Intelligence (IJCAI’83) , Karlsruhe, Germany, 724–726.

                  

                Nowlan, W. A., A. L. Rector, S. Kay, C. A. Goble, B. Horan, T. J. Howkins, and A.

                  Wilson. 1990, PEN&PAD: A doctors’ workstation with intelligent data entry and summaries. In Proceedings of the 14th Annual Symposium on Computer Applications in Medical Care (SCAMC’90),

                  Washington, DC (R. A. Miller, ed.), 941–942. IEEE Computer Society Press, Los Alamitos, California.

                Özsu, M. T. and P. Valduriez. 1999, Principles of Distributed Database Systems, 2nd edi-

                tion. Prentice-Hall.

                  

                Pan, J. Z. and E. Thomas. 2007, Approximating OWL-DL ontologies. In Proc. of the

                22nd Nat. Conf. on Artificial Intelligence (AAAI-07)

                  , Vancouver, British Columbia, Canada, 1434–1439.

                Peraldi, I. S. E., A. Kaya, and R. Möller 2011, Logical formalization of multimedia

                interpretation. In Knowledge-Driven Multimedia Information Extraction and

                  Ontology Evolution , volume 6050 of Lecture Notes in Computer Science, 110–133, Springer, Berlin/Heidelberg.

                  

                Pérez-Urbina, H., B. Motik, and I. Horrocks. 2008, Rewriting conjunctive queries

                over  description logic knowledge bases. In Revised Selected Papers of the 3rd Int.

                  Big Data Computing (K.-D. Schewe and B. Thalheim, eds.), volume 4925 of Lecture Notes in Computer Science , 199–214, Springer, Berlin/Heidelberg.

                Pérez-Urbina, H., B. Motik, and I. Horrocks. 2010, Tractable query answering

                and rewriting under description logic constraints. Journal of Applied Logic, 8,

                  186–209.

                Phuoc, D. L., M. Dao-Tran, J. X. Parreira, and M. Hauswirth. 2011, A native and adap-

                tive approach for unified processing of linked streams and linked data. In Proc. of the 10th Int. Semantic Web Conf. (ISWC 2011) , Boston, MA, 1, 370–388.

                  

                Poggi, A., D. Lembo, D. Calvanese, G. De Giacomo, M. Lenzerini, and R. Rosati. 2008,

                Linking data to ontologies. Journal on Data Semantics, X, 133–173.

                Popov, I. O., M. C. Schraefel, W. Hall, and N. Shadbolt. 2011, Connecting the

                dots: a multi-pivot approach to data exploration. In Proceedings of the 10th

                  International Conference on The Semantic web—Volume Part I , ISWC’11, Boston,

                  

                Ren, Y. and J. Z. Pan 2011, Optimising ontology stream reasoning with truth main-

                tenance system. In Proc. of the ACM Conference on Information and Knowledge Management (CIKM 2011), Glasgow, Scotland.

                  

                Rodríguez-Muro, M. 2010, Tools and Techniques for Ontology Based Data Access

                in Lightweight Description Logics . Ph.D. thesis, KRDB Research Centre for Knowledge and Data, Free University of Bozen-Bolzano.

                  

                Rodríguez-Muro, M. and D. Calvanese. 2011, Dependencies to optimize ontology

                based data access. In Proc. of the 24th Int. Workshop on Description Logic (DL 2011), Barcelona, Spain, volume 745 of CEUR (http://ceur-ws.org/).

                Rodríguez-Muro, M. and D. Calvanese 2012, Quest, an owl 2 ql reasoner for ontol-

                ogy-based data access. In Proc. of the 9th Int. Workshop on OWL: Experiences and

                  Directions (OWLED 2012) , Heraklion, Crete, volume 849 of CEUR Electronic Workshop Proceedings . http://ceur-ws.org/.

                  

                Rosati, R. 2012, Prexto: Query rewriting under extensional constraints in dl-lite. In

                Proc. of the 9th Extended Semantic Web Conference (ESWC 2012) , Heraklion, Crete, volume 7295 of LNCS, 360–374, Springer, Berlin/Heidelberg.

                  

                Rosati, R. and A. Almatelli. 2010, Improving query answering over DL-Lite ontolo-

                gies. In Proc. of the 12th Int. Conf. on the Principles of Knowledge Representation and Reasoning (KR 2010) , Toronto, Canada, 290–300.

                  

                Savo, D. F., D. Lembo, M. Lenzerini, A. Poggi, M. Rodríguez-Muro, V. Romagnoli, M.

                  Ruzzi, and G. Stella. 2010, Mastro at work: Experiences on ontology-based data access. In Proc. of the 23rd Int. Workshop on Description Logic (DL 2010), Waterloo, Canada, volume 573 of CEUR (http://ceur-ws.org/), 20–31.

                Schmidt, M., M. Meier, and G. Lausen. 2010, Foundations of sparql query optimiza-

                tion. In ICDT 2010, Lausanne, Switzerland, 4–33.

                Schmiegelt, P. and B. Seeger. 2010, Querying the future of spatio-temporal

                objects. In Proc. of the 18th SIGSPATIAL Int. Conf. on Advances in Geographic

                  

                Schneiderman, B. 1983, Direct manipulation: A step beyond programming languages.

                  Computer , 16, 57–69.

                  

                Schwarte, A., P. Haase, K. Hose, R. Schenkel, and M. Schmidt. 2011, Fedx: Optimization

                techniques for federated query processing on linked data. In International

                  Scalable End-User Access to Big Data

                Sheth, A. P. 1991, Federated database systems for managing distributed, heteroge-

                neous, and autonomous databases. In VLDB, Barcelona, Spain, 489.

                Shvaiko, P. and J. Euzenat. 2005, A survey of schema-based matching approaches.

                  Journal on Data Semantics , IV, 146–171.

                  

                Soylu, A., F. Modritscher, and P. De Causmaecker. 2012, Ubiquitous web navigation

                through harvesting embedded semantic data: A mobile scenario. Integrated Computer-Aided Engineering , 19, 93–109.

                  

                Steiner, A. 1997, A Generalisation Approach to Temporal Data Models and their

                Implementations

                  . Ph.D. thesis, Departement Informatik, ETH Zurich, Switzerland.

                Suominen, O., K. Viljanen, and E. Hyvänen. 2007, User-centric faceted search for

                semantic portals. In Proc. of the 4th European Semantic Web Conf. (ESWC 2007),

                  Innsbruck, Austria, 356–370. http://dx.doi.org/10.1007/978-3-540-72667-8_26.

                Tang, Y., L. Liang, R. Huang, and Y. Yu. 2003, Bitemporal extensions to non-temporal

                RDBMS in distributed environment. In Proc. of the 8th Int. Conf. on Computer

                  Supported Cooperative Work in Design , Xiamen, China, 370-374. DOI: 10.1109/ CACWD.2004.1349216.

                ten Cate, B. and P. G. Kolaitis. 2009, Structural characterizations of schema-mapping

                languages. In Proc. of the 12th Int. Conf. on Database Theory (ICDT 2009), Saint-

                  Petersburg, Russia, 63–72.

                ter Hofstede, A. H. M., H. A. Proper, and Th. P. van der Weide. 1996, Query formu-

                lation as an information retrieval problem. The Computer Journal, 39, 255–274. http://comjnl.oxfordjournals.org/content/39/4/255.abstract.

                Thusoo, A., J. Sen Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Anthony, H. Liu,

                and R. Murthy. 2010, Hive—a petabyte scale data warehouse using Hadoop. In

                  Proc. of the 26th IEEE Int. Conf. on Data Engineering (ICDE 2010) , Long Beach, CA, 996–1005.

                  

                Tran, T., D. M. Herzig, and G. Ladwig. 2011, Semsearchpro—using semantics through-

                out the search process. Web Semantics: Science, Services and Agents on the World

                  

                Tsangaris, M. M., G. Kakaletris, H. Kllapi, G. Papanikos, F. Pentaris, P. Polydoras,

                E.  Sitaridi, V. Stoumpos, and Y. E. Ioannidis. 2009, Dataflow processing and optimization on grid and cloud infrastructures. IEEE Data Eng. Bull., 32, 67–74.

                Ullman, J. D. 1997, Information integration using logical views. In Proc. of the 6th Int.

                  Conf. on Database Theory (ICDT’97) , volume 1186 of Lecture Notes in Computer

                  Science , Delphi, Greece, 19–40. Springer, London.

                  

                Uren, V., Y. Lei, V. Lopez, H. Liu, E. Motta, and M. Giordanino. 2007, The usability

                Wessel, M., M. Luther, and R. Möller. 2009, What happened to Bob? Semantic data

                mining of context histories. In Proc. of the 2009 Int. Workshop on Description Logics

                  (DL 2009),

                Oxford, UK. CEUR Workshop Proceedings, Vol. 477.

                  

                Wessel, M., M. Luther, and M. Wagner. 2007, The difference a day makes—

                Recognizing important events in daily context logs. In Proc. of the Int.

                  Workshop on Contexts and Ontologies: Representation and Reasoning C&O:RR 2007, collocated with CONTEXT 2007, Roskilde, Denmark. CEUR Workshop

                  Big Data Computing

                Zaniolo, C., S. Ceri, Chr. Faloutsos, R. T. Snodgrass, V. S. Subrahmanian, and R. Zicari.

                  1997, Advanced Database Systems, chapter Overview of Temporal Databases. Morgan Kaufmann, USA.

                Zhang, D., A. Markowetz, V. J. Tsotras, D. Gunopulos, and B. Seeger. 2001, Efficient

                computation of temporal aggregates with range predicates. In Proceedings of the

                  Twentieth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS 2001) (Peter Buneman, ed.), May 21–23, 2001, Santa Barbara, CA.

                   Hele-Mai Haav and Peep Küngas

                  Data-intensive applications (social media sites, e-commerce, e-government, e-health, e-science, etc.) are common in our society. They utilize and gener-

                  CONTENTS

                  Introduction ......................................................................................................... 245 Data Interoperability Framework for Estonian e-Government .................... 247 Semantic Data Interoperability Architecture for Estonian e-Government .... 250 Ontology Development ...................................................................................... 251

                  Ontology Network ..................................................................................... 251 Domain Expert Centric Ontology Development Methodology .......... 252

                  Overview ......................................................................................... 252 Ontology Development Process ..................................................253 Evaluation .......................................................................................255

                  Semantic Enrichment of Data and Services .....................................................255 Semantic Descriptions in SAWSDL .........................................................256 Annotation Patterns for Semantic Enrichment of Schemas .................258

                  Pattern 1—URIdtp .........................................................................258 Pattern 2—URIc URIdtp ............................................................... 259 Pattern 3—URIop URIc URIdtp ................................................... 259 Pattern 4—URIc URIop URIc URIdtp ......................................... 260

                  Common Faults in Semantic Annotation ........................................................ 260 Applications of Semantics.................................................................................. 261

                  Semantic Data Analysis: Redundancy Detection .................................. 261 Data Interoperability with Semantic Data Services .............................. 262 Linked Open Government Data ..............................................................264

                  Conclusion and Future Work ............................................................................ 266 Acknowledgments .............................................................................................. 267 References ............................................................................................................. 267

                Introduction

                  Big Data Computing

                  recent report on Big Data by McKinsey Global Institute, enterprises world- wide stored more than 7 exabytes and consumers stored more than 6 exa- bytes of new data in 2010 (Manyika et al. 2011). The report also refers to the fact that volume of data created globally is expected to grow exponentially in forthcoming years. As a consequence, the “Big Data” problem has emerged.

                  Currently, the term “Big Data” does not have a well-defined and com- monly accepted meaning. Therefore, many researchers and practitioners use the term rather freely. However, there is a common understanding regarding the three characteristics of Big Data, which are as follows: volume, veloc- ity, and variety (Gartner 2012). Volume usually indicates large volumes (e.g., petabytes or more) of data of different types. Taking into account the com- plexity of data, volumes of data may be smaller for highly complex data sets compared to the simple ones. For example, taking into account the complex-

                • * ity dimension of data, 30 billion of RDF triples can be viewed as Big Data. Volume is also dependent on what sizes of data sets are common in a par- ticular field. It is difficult to draw a certain boundary over what all data sets can be considered as Big Data. Velocity characterizes time sensitiveness of data for applications like fraud or trade event detection, where near real-time analysis of large volumes of data is needed. Variety refers to different types of data that need to be integrated and analyzed together. As a rule, these data are originated from various data sources (e.g., SQL/NoSQL databases, textual and XML documents, web sites, audio/video streams, etc.). In many cases, volumes of data are not that big issue as meaningful interoperability of relevant data extracted from disparate sources.

                  Data interoperability as a term might be confused with notions of data integration and data exchange well known from database management field. In this chapter, we define semantic data interoperability as the ability of sys- tems to automatically and accurately interpret meaning of the exchanged data. Semantic interoperability in its broad sense covers data and process interoperability. Data interoperability is a precondition for process interop- erability. In this chapter, we consider only semantic data interoperability (see also Stuckenschmidt 2012). For achieving semantic data interoperability, systems need not only to exchange their data, but also exchange or agree to explicit models of these data (Harmelen 2008). These shared explicit models are known as ontologies that formally represent knowledge used to interpret the data to be exchanged. Using stronger ontology representation languages increases the semantic interoperability among systems. For example, RDF enables less semantic interoperability than OWL DL as it is less expressive knowledge representation language.

                  In the context of Big Data, achieving meaningful (semantic) interoper- * ability of data from heterogeneous sources is a challenging problem. This RDF; http://www.w3.org/RDF/

                  XML; http://www.w3.org/XML/

                  Semantic Data Interoperability

                  mainly addresses the variety of characteristic of Big Data. Practical solutions to this problem are important for enterprises and public sector.

                  There is not a single technology that could solve the data heterogeneity issue of Big Data, but combining different technologies may provide a solu- tion. One possible way is to consider semantic data interoperability problem of Big Data as a semantic data interoperability problem of large-scale dis- tributed systems providing data or data analytics services. According to this approach, semantic technologies are combined with Web services technolo- gies that are used to handle the complexity of networking systems. Semantic technologies allow adding meaning (semantics) to data and linking data provisioning services. The meaning is provided to data through semantic enrichment of data. This means linking each data element to a shared ontol- ogy component. Ontologies are used in this context to give a common inter- pretation of terminology of different data sources. They are represented in a formal language (e.g., OWL DL), enabling machine readability and automatic reasoning. OWL DL representations of different modular ontologies can be linked together and when a set of linked ontologies grows, it can by itself become complex Big Data.

                  One advantage of this approach is that such semantic enrichment of data provides meta-data that is machine-interpretable and linked to the data independent of any system that uses the data. This is in contrast to current Hadoop/MapReduce Big Data solutions that mostly embed meta-data into the code (Marshall 2012) and thus render data integration more complex.

                  In this chapter, we consider semantic enrichment of data combined with linked data and web services technology in order to build infrastructure for semantic data interoperability of Big Data. We show how large-scale seman- tic data interoperability problems can be solved in the domain of public sector administration by using a combination of SQL database technology, linked data, semantic technology, and Web services. In this book, we discuss several semantic data interoperability issues and share our personal expe- rience derived from design and implementation of the Estonian semantic interoperability framework of state information systems and related data interoperability solutions.

                Data Interoperability Framework for Estonian e-Government

                  We base our discussion in this chapter on the real case study from the field of information systems of public administration that covers different emerg- ing semantic data interoperability issues. The case study is about long-term development of Estonian e-government infrastructure and efforts to cope with data interchange between wide varieties of disparate data sources

                  Big Data Computing

                  agencies. The case study presents Big Data challenges as a network of dis- tributed heterogeneous state information systems data sources (about 600 data sources including more than 20,000 different data fields) by showing a need for semantic data interoperability.

                  Before providing infrastructure for data sharing, the Estonian government faced common problems of disparity of data sources that could be character- ized as follows: wide variety in data types and semantics; varying access and security policies; low degree of interoperability and reuse of informa- tion; unclear ownership of data. In order to overcome these problems, the Estonian government decided to build a distributed secure data exchange framework based on Web services approach that allows implementation of platform neutral data services to be accessed via standard Internet protocols. Since Web services can be used as wrappers around existing state informa- tion systems (mostly legacy systems), it is possible to connect distributed dis- parate information systems to the network while hiding its complexity. In 2001, the first version of the middleware platform X-Road (Kalja 2009) started to provide secure data transfer between governmental IS data sources and between individuals and governmental institutions. *

                  Nowadays, X-road platform is based on SOAP standard for data transfer, its data service descriptions are presented in WSDL language and registered in UDDI register (X-road 2011). X-road services differ from ordinary Web services in that data are transferred through standardized security gate- ways (servers) and service providers need to implement a custom compo- nent called adapter server in order to connect to X-road. The role of adapter servers is to convert an X-road query to the query acceptable by data server of a particular information system (e.g., based on SQL, SOAP, etc. protocols) and transform the results back into services responses. In order to guarantee secure data transfer, all complex security processes (e.g., authenticity, integ- rity, confidentiality of the exchanged data, etc.) are performed by security gateways that are needed for each of the information systems connected to X-road (X-road 5.0. 2011).

                  Currently, about 1500 e-government data services are registered and run on X-road. Data sources of more than about 600 organizations are connected via X-road middleware (X-road 2011, X-road 5.0. 2011). X-road services are oper- able 24/7 and some of them are accessible via dedicated e-government por- tals by any Estonian citizen. According to the statistics provided regularly by the Administration System for the State Information System (RIHA 2009), the estimated number of requests was 226 million in 2010 and 240 million in 2011.

                  In X-road approach, data interoperability between disparate data sources is achieved by data services that reduce effort needed for custom-coding. In

                • * this case, provided secure data services form an abstract layer between data
                • SOAP; http//www.w3.org/TR/soap12-part1/#intro WSDL; http://www.w3.org/TR/wsdl

                  Semantic Data Interoperability

                  sources of state information systems and consumers of that data as shown in Figure 7.1. Services layer hides complexity of accessing data sources and reduces a need for knowledge about the underlying data structures.

                  For example, population register has published among others the most widely used data service that given national ID code of a citizen, returns detailed data about the citizen like first name, last name, date of birth, etc. This service in turn is used by other more complex data services that run queries over different heterogeneous databases by combining different data services. For example, one can have an application that provides an oppor- tunity to ask a national ID code of the owner of a car given a vehicle’s reg- istration number. In this case, data services provided by population register and vehicle registration database are combined into one service. This web service composition is custom-coded by application/service providers and published in the X-road service register for further use. However, as we see in the following sections, there are attempts to automate service composition process by using semantic technologies.

                  Although X-road approach provides means for data interoperability, it does not solve semantic data interoperability problems. Data semantics is hard coded into data services (i.e., X-road services). Currently, state information systems provide data services and corresponding WSDL descriptions that do not contain or refer to the meaning of data used. It is sufficient for a person if a data object has, besides a label, a textual description in a natural language they can understand, but a software agent (e.g., a web service, a search engine, a service matchmaker, etc.) requires some formal description of data object to interpret its label. For both humans and software agents to interpret the meaning of data objects in the same way, descriptions of data objects must be enriched with semantic descriptions, that is, references to the concepts used in a given ontology. The X-road data services use data objects as input–output parameters of services. Enriching descriptions of X-road data services with

                  

                Data consumers

                Secure data services

                Data sources of state IS

                  Figure 7.1

                  Big Data Computing

                  semantic references to the components of the ontology of the relevant domain makes it possible to use software agents for maintaining X-road data services and facilitating semantic data and service interoperability.

                Semantic Data Interoperability Architecture for Estonian e-Government X-road middleware does not support semantic data interoperability

                  Therefore, the strategic document on semantic interoperability of state infor- mation technology (RISO 2007, Vallner 2006) was issued by Estonian Ministry of Economic Affairs in 2005. Following this visionary document, the seman- tic interoperability architecture for Estonian state information systems was proposed by Haav et al. (2009). It is designed as the semantic layer on top of X-road infrastructure providing general principles for semantically describ- ing objects of state information systems. Central components of this architec- ture are domain ontologies related to state information systems databases. OWL DL was chosen as an ontology description language. It is W3C recom- mendation and a commonly used standard. The architecture concentrates on two types of objects; data objects and input/output parameters of operations of data services that are to be semantically described by using domain ontolo- gies. Semantic descriptions are seen as semantic enrichments of descriptions of data by providing a link to the meaning of the data elements in domain ontology. Currently, data services of state information systems are described in WSDL and data structures of state databases are presented among other * formats by their XMI data model descriptions. SAWSDL has been chosen for semantic enrichment of these descriptions, as it provides mechanisms for semantic annotations of WSDL and XMI descriptions.

                  As a repository of domain ontologies and semantic descriptions of data services (semantic data services), the Administration System for the State Information System (called RIHA in Estonian) is used. As explained by Parmakson and Vegmann (2009), RIHA is a secure web-based database that stores systematic and reliable meta-data, including semantic meta-data, about public information systems. It provides a user-friendly web interface for searching and browsing all the collected meta-data.

                  In addition, the semantic interoperability architecture suggests that ontol- ogy creation, as well as semantic enrichment, is supported by corresponding policies, guidelines, tools, educational, and promotional activities.

                  The architecture was set up by a related legislation in Estonia in * 2009 demanding from holders of state information systems creation of

                  XMI; http://www.omg.org/spec/XMI/

                  Semantic Data Interoperability

                  corresponding domain ontologies in OWL and semantic annotations of Web services in SAWSDL (RIHA 2009).

                  After that, the process of collecting semantic descriptions of domain ontolo- gies and data services to RIHA has started. However, in 2012, only 13 domain ontologies were published in RIHA. There are several reasons for that. First of all, there is a lack of knowledge about semantic technologies and ontology engineering in Estonia. Therefore, a number of training courses were pro- vided to approximately 200 domain experts responsible for ontology creation in their respective domains. In 2010–2011, special hands-on training courses were offered to smaller groups of domain experts in order to develop ontol- ogy descriptions of their respective domain. This activity resulted in 22 new ontologies. The main problem was that not all ontologies were completed after the courses and were not published in RIHA repository. Some of them did not meet quality requirements set to ontologies to be stored in RIHA and needed redesign. However, it was decided at the beginning of the ontology develop- ment that four critical ontologies that captured semantics of 80% of data ele- ments used by governmental data services will have high priority. These are ontologies of population, business, and address registers as well as Estonian topographic database. By now, three of these ontologies have been completed and the population register ontology is under development. These four basic ontologies can be and are reused by many other domain ontologies making further ontology development processes easier and faster. However, still a lot of ontology engineering work is to be done in order to support full semantic enrichment of data elements of state information systems data sources.

                  On the other hand, feedback from training courses has shown that gen- eral existing ontology engineering methodologies like METHONTOLOGY (Gómez-Pérez et al. 2004) are too technically demanding for domain experts creating a need for more practical and domain-expert centric approaches. In reality, public administration agencies do not have a large number of ontology engineers or knowledge engineers available for converting domain knowledge to a formal ontology. In view of this situation, a practical meth- odology for developing domain ontologies was created by Haav (2010a, 2011) and guidelines for semantic enrichment of data services with domain ontol- ogies were developed by Küngas (2010). We introduce these in the following sections of this chapter.

                  Ontology Development Ontology Network

                  Domain ontologies of state information systems are not isolated ontologies, but they form a network of interlinked ontologies. Modular ontologies are

                  Big Data Computing

                  do ontology engineering of their respective fields, then creation of modular and simple ontologies was our intention from the very beginning. Each of the domain ontologies is not large and of high complexity. As a rule, state information systems domain ontologies contain only descriptions of primi- tive classes. These are classes that do not have any sets of necessary and suf- ficient conditions; they only may have necessary conditions (Horrige 2011). A typical description of state information systems domain ontology includes about 60 descriptions of classes, 25 object properties, and 100 data-type prop- erties. Its Description Logics (DL) complexity is ALCQ(D) (Baader et al. 2003).

                  As mentioned above, state information systems in Estonia have over 20,000 data entity attributes that should be semantically enriched using cor- responding ontology components (i.e., data-type properties). Consequently, the ontology network should contain in the worst case approximately the same number of data-type properties in addition to the number of concepts and object properties. By now, we are in the initial stage of development of ontology network of state information systems in Estonia.

                  Besides domain ontologies, the semantic interoperability framework has foreseen creation of Estonian Top Ontology. This ontology is a linguistic ontology that has been developed by converting EuroWordNet Estonian (EuroWordNet Estonian 1999) to the corresponding machine readable OWL representation. Estonian Top Ontology is available (not publicly) and it can be used for linking domain ontology concepts to corresponding lin- guistic concepts.

                  Domain expert Centric Ontology Development Methodology Overview

                  One of the main reasons why domain ontologies and semantic data services are not developed and completed in time is the complexity of ontology devel- opment and the process of semantic enrichment. Generally, most of well- known ontology development methodologies (Staab et al. 2001, Gómez-Pérez et al. 2004, Haase et al. 2006) are technically demanding and are not easy to learn by employees of governmental agencies. They are domain experts rather than knowledge engineers. Therefore, it was important to reduce the complexity of ontology development by providing an easy to learn domain expert centric ontology development methodology. The idea was to disable the roles of mediators (i.e., knowledge and ontology engineers) from the chain of ontology engineering provided by most of current ontology devel- opment methodologies (Gómez-Pérez et al. 2004, Haase et al. 2006, Lavbiĉ and Krisper 2010) and delegate activities of these roles to a domain expert as an actual knowledge holder. A new ontology engineering methodology was provided and presented by Haav (2011) aimed at development of light- weight domain ontologies in OWL by domain experts without any profound

                  Semantic Data Interoperability

                  to make ontology development process easier for domain experts by pro- viding processes and guidelines that they could learn and use. Ontologies to be developed according to the methodology are intended to be used for semantic enrichment of data and data services of the state information systems. In meta-level, the methodology takes into account some of the proposals of widely accepted ontology development methodologies like METHONTOLOGY (Gómez-Pérez et al. 2004) and NeOn (Haase et al. 2006).

                  The domain expert centric ontology development methodology defines ontology development process as a sequence of ontology development activi- ties, their inputs, and outputs. In more detail, the methodology is presented by Haav (2011). However, in this chapter, we briefly discuss some of its practi- cal aspects and show its novel features.

                  As an input to the ontology development process, different reuseable knowledge resources including ontological and nonontological resources that are available in governmental agencies are used. Reuse of these resources speeds up ontology development process. For example, nonon- tological resources are conceptual schemas of databases, vocabularies, the- sauri, regulatory documents of a state information systems, databases, data service descriptions, etc. Ontological resources are primarily domain ontolo- gies of state information systems collected to RIHA repository, but may also be ontologies available in other repositories.

                  The methodology defines main activities of ontology development process for specification, conceptualization, and implementation phases of ontology. Management and support activities are defined as in METHONTOLOGY methodology.

                  For a lifecycle model, two levels of ontology development are identified and the corresponding lifecycle model was developed as follows:

                • Domain ontology level. An iterative lifecycle model is used for creation of ontology modules corresponding to domain ontologies of state information systems. During each of the iterations, domain ontolo- gies are improved until all the requirements are met.
                • Ontology network level. A method of evolutionary prototyping is pro- posed for development of entire ontology network. In the beginning, a partial network of state information systems domain ontologies meeting already known requirements is developed as a prototype. This is assessed by different applications and after that the require- ments are refined based on feedback from the application developers.

                  Ontology Development Process

                  Ontology development process of the proposed methodology was made as simple as possible for domain experts. This was achieved by merg-

                  Big Data Computing

                  applying iterative lifecycle model to this merged process called early imple- mentation as shown in Figure 7.2. Early implementation phase is composed of the following steps:

                • Conceptualization. The middle out conceptualization strategy is cho- sen as the most appropriate for domain experts. They can easily identify central concepts of a domain and move from there toward more general or more specific concepts, if necessary. In contrast, top-down approach (from legislation terminology to data fields) cre- ates too many concept hierarchy levels of ontology and bottom-up method (from data fields to domain concepts) makes generaliza- tion hard as differentiation of concepts and attributes is not easy. Conceptualization activity starts with identification of 7–10 central concepts (classes) of a domain and generalization and specification of these concepts one level up or down in concept hierarchy. After that, main relationships (object properties) between these central concepts are defined.
                • Implementation. At the first iteration, conceptualization in the scope of basic (central) domain concepts is implemented as ontology repre- sented in OWL. During each of the next iterations, new concepts and relationships are added and implemented.

                  Early implementation enables one to detect logical errors of ontology description at early stage of its implementation as well as to evaluate how well it meets the requirements.

                  Attributes of individuals of concepts are added in the final implementa- tion stage. After that ontology is evaluated according to the requirements, for example, are all input/output parameters of Web services covered by corresponding components of respective domain ontology? Or are all data objects covered by data type properties? If not, then missing components are added. Ontology editing tool Protégé

                • * was used for ontology implementation process. This tool enables usage of DL reasoners (e.g., Pellet,

                  Fact++, etc.) in *

                  Protégé; http://protege.stanford.edu Pellet; http://www.clarkparsia.com/pellet Specification Final implementation Early implementation Implementation Conceptualization

                Figure 7.2 Main activities of ontology development process.

                  Semantic Data Interoperability

                  order to automatically find inconsistencies in ontology description presented in OWL. This activity was highly recommended to ontology developers as a part of ontology implementation process.

                  During the process of implementation, ontologies are commented in Estonian and English languages in order to provide also human understand- able ontology descriptions.

                  Evaluation

                  The provided methodology has been iteratively developed and evaluated during 2010–2011. The methodology was widely used in numerous train- ing courses on ontology engineering provided for domain experts of state information systems. By now, the methodology is accepted as the ontology development methodology of creation of domain ontologies of state informa- tion systems in Estonia and made publicly available in RIHA (Haav 2010a). In addition, guidelines and requirements for creation of domain ontologies of state information systems have been developed by Haav (2010b) in order to assure quality of ontology descriptions loaded to RIHA repository.

                  However, applying the methodology shows that its implementation activ- ity is sometimes still too complex for domain experts even if using simple ontology editors. Therefore, in the future this ontology development meth- odology will be improved by introducing simple intermediate representa- tions of domain conceptualizations that could be automatically translated to corresponding OWL ontology. In this case, implementation activity of the methodology may be automatic or at least semiautomatic.

                Semantic Enrichment of Data and Services

                  Although ontologies facilitate explicit description of domain knowledge, which can be effectively used in applications, in order to exploit the knowl- edge automatically, a binding must be created between the application, data, and ontology elements. Depending on the application and its architecture, such bindings may include application-specific details, which aim to sim- plify the usage of ontologies in particular applications. In the case of Estonian semantic interoperability framework, the complete set of applications has not been finally determined since involved parties have low expectations with respect to semantic technologies. Anyway, the initial applications are artifact indexing and discovery, automated Web service composition, link- ing open data, and redundancy detection in Web services.

                  In fact, it has been indicated (Ventrone and Heiler 1994) that about 80% of data are redundant in large information systems leading to high maintenance

                  Big Data Computing

                  of software artifacts, that is, descriptions of Web services and database sche- mata in XML form, leverages the exploration of dependencies between them through which they can be optimized. Service composition is relevant in the Estonian context since the state information system is based on individual services and the overall intuition is that new value-added services can be formed by skillfully combining the existing ones. Finally, integration of het- erogeneous data is recognized as an issue since many decisions are currently made based on data collected by individual organizations solely for their own purposes. This, however, has limited the quality of decision-making, since a lot of decisions are based on partial or insufficient data.

                  Finally, it is important to note that the choice for linking meta-data to artifacts themselves is preferred to make the sources as self-explanatory as possible.

                  On the basis of these applications, we have set the following requirements to annotations:

                • Minimal effort should be required to annotate artifacts.
                • The annotation schemata should facilitate specification of the seman- tics as specifically as possible.
                • The annotations should facilitate matching the artifacts and expos- ing dependencies between them explicit.
                • Annotations should allow namespaces and it should be possible to attach to them additional descriptions (i.e., links to the Web resources).

                  We scoped annotation to Web services descriptions and realization-level data models and used the cost-effective schema annotation methodology described by Küngas and Dumas (2009) and inspired by UPON (Nicola et al. 2009).

                  Semantic Descriptions in SaWSDl

                  There are several languages for describing semantic Web services like * OWL-S, WSMO and SAWSDL. In the context of the Estonian e-government case study, the semantics of Web services are presented in SAWSDL since it provides mechanisms for embedding semantic annotations to existing WSDL and XSD descriptions of services. For expressing the semantics of schema elements, SAWSDL defines an extension to XSD and WSDL in terms of three attributes: modelReference, liftingSchemaMapping, and loweringSche-

                  

                maMapping . Model reference attributes are used mainly for linking schema

                  elements with any meta-model (in our case, an ontology), while schema * mapping attributes are to support schema transformations. OWL-S; http://www.w3.org/Submission/OWL-S/

                  Semantic Data Interoperability However, SAWSDL itself is far from being perfect for schema annotations.

                  Namely, SAWSDL is very general and too flexible concerning style and mean- ing of semantic annotations as well as the type of concepts that are used for annotations. These issues are also recognized in the works devoted to SAWSDL-based semantic service matchmaking (Klusch et al. 2009, Schulte et al. 2010). Therefore, for the Estonian e-government case study, the follow- ing constraints have been imposed on the usage of SAWSDL:

                • It is required that the semantic concepts are to be formally defined in

                  OWL DL ontology language—the SAWSDL specification, in general, allows heterogeneity in ontologies and ontology languages. The only constraint of the SAWSDL specification is that the resources (in our case, ontology elements), semantically describing WSDL elements, should be identifiable via URI references. In general, this may create a problem with respect to automatic interpretation of these concepts.

                • References to multiple ontologies to describe the semantics of the same WSDL element are allowed but their usage is strictly regulated— elements from different ontologies are allowed in two cases. First, if alternative semantics are described due to the usage of the semantics in a specific context. And second, when an annotation is refined by fol- lowing the RDF graph of an ontology, which imports other ontologies.
                • Bottom-level annotations are recommended—the SAWSDL specifi- cation does not set any restrictions to the style of annotations: both top- and bottom-level annotations are supported. The top-level annotation means that a complex type or element definition of a message parameter is described by a model reference as a whole. A bottom-level annotation enhances the parts of the definition of a complex type or element. In the Estonian semantic interoperability framework, a set of bottom-level annotation rules for the SAWSDL service descriptions is provided by Küngas (2010) and applied. These rules specify that only leaf nodes of data structures of message parts of service operations are annotated and give bottom-level schemas (patterns) for annotation of different XSD types.

                  An example of bottom-level annotations, where WSDL elements “county” and “street” are enriched with data-type properties countyName and street-

                  

                Name from the land ontology using SAWSDL model reference statements is

                  presented as follows: < wsdl:types > . . . < complexType name = “aadress1” > < sequence > < element name = “county” type = “xsd:string” sawsdl:model

                  Big Data Computing

                  < element name = “street” type = “xsd:string” sawsdl:model / >

                   Reference = “ . . .

                  < /sequence > < /complexType > . . .

                  < /wsdl:types >

                  annotation Patterns for Semantic enrichment of Schemas

                  We do not distinguish services and data definitions from annotation point of view. In fact, our main concern in services is their data semantics. This allows us to use the same annotation approach to both data and services.

                  We have defined four reification patterns for presenting semantics of data attributes. In general, the patterns define a path in the RDF graph, which encodes the semantics of a particular data attribute. In the following, we write URIdtp, URIop, and URIc, respectively, to denote URI-s of OWL data- type properties, object properties, and classes. The four basic patterns are usually sufficient for describing the semantics of well-structured data struc- tures. Other constructs might be needed mostly in cases where data struc- tures are either too generic, originate from legacy systems, or their usage has changed in time. Since such cases mostly originate from ignorance with respect to best practices in data model design, we assume that the data mod- els will be aligned with the best practices before annotation.

                  Pattern 1—URIdtp

                  In case a schema element is used to encode a data attribute for a wide range of settings, that is, its domain is not restricted, a reference to data-type prop- erty in an ontology would be sufficient for annotation. In the following example, a data-type property nationalIdCode, defined in namespace http:// ws.soatrader.com/ontology/BaseOntology.owl, is used to refer to an attri- bute, which can encode a national code of citizens from any country: < element xmlns:sawsdl = “awsdl:model

                  

                  

                  While writing such one-element annotations, one should bear in mind that the subject of a particular attribute should comply with the domain of the data-type property. For instance, if, in the preceding example, the national identification code attribute is used only for presenting the national identifi- cation codes of notaries in a data set and the domain of the national identifi- cation code is not the notary, then this pattern is not applicable. In this case,

                  Semantic Data Interoperability

                  To summarize, this pattern should only be applied in cases where there is no need to distinguish classes in ontologies—either the semantics of a data type is general enough or the attribute, to be annotated, is domain-specific and appears only in very specific data sets and contexts.

                  Pattern 2—URIc URIdtp

                  In order to explicitly state semantics of data entity attributes more precisely, one should combine the elements of developed ontologies. For instance, in ontology http://www.soatrader.com/ontology/BaseOntology.owl, we have class Notary to encode the concept of a notary and a datatype property for referring to the semantics of any national identification code (nationalIdCode). By applying pattern “URIc URIdtp,” we can represent the semantics for a national identification code of a notary given that class Notary is allowed to be used as a domain of the datatype property. The preceding is demon- strated in the following annotation: < element xmlns:sawsdl = “sawsdl:model Reference = “http://www.soatrader.com/ontology/BaseOntology.owl/

                  

                Notary http://www.soatrader.com/ontology/BaseOntology.owl/nationalId

                Code ” name = “notary_idcode” type = “string”/>

                  By applying this pattern, we are able to semantically describe majority of data attributes when the number of ontologies for the state information sys- tem is relatively small and their modularity is low.

                  Pattern 3—URIop URIc URIdtp

                  In the case of ontologies of higher degree of modularity, such as the ones where property clumps (an ontology design anomaly) are resolved under specific classes, there is a need for richer patterns than the previously intro- duced ones. In the following example, we show how to specify semantics of an e-mail address of an arbitrary person after resolution of property information (ContactInfo). This class is a domain for data-type properties encoding the semantics of contact info attributes such as e-mail addresses, street names, postal codes, etc. In the current case, there is a data-type property e-mailAddress for encoding the semantics of an e-mail address of an arbitrary entity. In order to bind the contact information to a spe- cific subject, the ontology contains object property hasContactInfo, which encodes the semantics of having contact information. By applying pattern “URIop URIc URIdtp” to these three ontology elements, we can express the semantics of an e-mail address belonging to contact information of an

                  Big Data Computing

                  < element xmlns:sawsdl = “http://www.w3.org/ns/sawsdl” sawsdl:modelReference = “http://www.soatrader.com/ontology/

                  BaseOntology.owl/hasContactInfo http://www.soatrader.com/ontology/ BaseOntology.owl/ContactInfo http://www.soatrader.com/ontology/ BaseOntology.owl/e-mailAddress

                  ” name = “e-mail” type = “string”/ > Since such semantic description pattern does not state explicitly the sub- ject (of contact info in particular case) it suits for annotating data attributes, which are not scoped to a particular subject type.

                  Pattern 4—URIc URIop URIc URIdtp

                  In the case of ontologies with higher degree of modularity, there is a need to describe relations between subjects and relevant data objects. In such case, it is not enough to apply pattern 3. In the following example, we exemplify the usage of pattern 4 in conjunction with expressing explicitly subject–object relation within an annotation.

                  In ontology, http://www.soatrader.com/ontology/BaseOntology.owl, we have defined classes Notary and ContactInfo. Class ContactInfo is a domain for data-type properties encoding semantics of contact information details of orga- nizations and individuals. One of such properties is e-mailAddress for represent- ing semantics of any e-mail address. In order to bind the contact information to a specific subject, the ontology contains object property hasContactInfo, which encodes the semantics of having contact information. By applying pattern 4 to these ontology elements, we are able to express in the following example the semantics of an e-mail address within notary contact information. < element xmlns:sawsdl = “http://www.w3.org/ns/sawsdl” sawsdl:modelReference = “http://www.soatrader.com/ontology/

                  BaseOntology.owl/Notary http://www.soatrader.com/ontology/ BaseOntology.owl/hasContactInfo http://www.soatrader.com/ontology/BaseOntology.owl/ContactInfo http://www.soatrader.com/ontology/BaseOntology.owl/e-mailAddress

                  ” name = “notary_e-mail” type = “string”/ >

                Common Faults in Semantic Annotation

                  Throughout our case study, we have experienced common faults, which appear in annotations. Here, we list the most prevalent ones as follows:

                  1. Precision of semantic annotation—people tend to think in terms of

                  Semantic Data Interoperability

                  they often provide annotations by considering its applicability within a particular application of their own domain. In practice, it means that mostly pattern 1 is used in annotations. In such cases, in addition to incomplete semantic descriptions, there is also a risk that, even if an ontology is initially constructed for a single domain, in time it gets linked to others and this implies that the semantic scope of annotations will change, if not explicitly constrained.

                  2. Class vs Data-type property in semantic annotations—people with- out proper training in knowledge engineering tend to mix the mean- ings/intentions of datatype properties and classes in ontologies.

                  3. Encoding vs meaning—people tend to get confused by the repre- sentation of data and often extend domain ontologies with encoding details. Therefore, we encourage them to design the data models and interfaces in such a way that the semantic descriptions will be kept simple.

                Applications of Semantics

                  We have built several applications that use semantically enriched data and services. In this section, we introduce applications that exploit semantic annotations for semantic data analysis as well as for interoperability of data and services.

                  Semantic Data analysis: redundancy Detection

                  We have used the developed annotations for analyzing redundancy (Küngas and Dumas 2010) in WSDL descriptions of information system interfaces (ser- vices). The intuition here is that redundancy in interfaces leads to redundancy in data management. However, since developers use different naming con- ventions and structuring styles in interface descriptions, we used semantic annotations first to identify links between the interfaces and then applied the developed metrics for redundancy detection. In Figure 7.3, cluster map layout is used for visualizing an overlap in entity attributes (i.e., leaf nodes of data struc- tures) from different information systems. The dots represent entity attributes with specific semantics, while the clusters visualize the information systems. An entity attribute appears in a cluster if it appears in interface descriptions of the corresponding information system. One can see that there is a significant overlap in entity attributes of different information systems.

                  In the Estonian state information systems case study, we were able to auto- matically detect the redundancy with relatively high precision and recall

                  Big Data Computing Health insurance inform.... (40) Land information system (83)

                  Business rigistry (new) (71)

                Figure 7.3 Cluster map representation of redundant entity attributes.

                  identification of the primary location (an individual information system in the federated information system) of redundant attributes. This finding sug- gests that in future while developing new information systems we have the capability of suggesting in which information systems new types of data should be managed and how to rearrange the data models and services of existing systems with respect to new ones.

                  This case study unveiled that, although individual information systems might not have a lot of data redundancy, there can be considerable redun- dancy in federated state information system as a whole. More specifically, we found that 79% of data items are redundant, which is consistent with findings of Ventrone and Heiler (1994), who point to several cases where data model overlap in large federated information systems was up to 80%.

                  Data interoperability with Semantic Data Services

                  One of the most studied applications of semantics of Web services is match- making of services (Klusch et  al. 2009, Küngas and Dumas 2009, Schulte et al. 2010) and their (automated) composition (Rao et al. 2006, Maigre et al. 2013). Although there are tools available for facilitating automated composi- tion, they are mostly exploited in scientific environments (Stevens et al. 2003,

                  Semantic Data Interoperability for scientific computation can be composed from existing Web services.

                  There are some approaches (Haav et al. 2007, Maigre et al. 2013) to automate semantic web service composition taking into account the requirements of the Estonian e-government case study. However, these methods are not implemented within some practical service composition tools.

                  Major benefits of using automated Web services composition include the following: (1) complex components can be managed by different organiza- tions and still used collectively allowing to share the IT management costs and (2) data-intensive computations can be performed close to the data reduc- ing the network bottlenecks and increasing performance of computations.

                  Such benefits are not well recognized in the context of the Estonian state information systems, where public sector service providers do not timely enrich their data services with semantic annotations.

                  Therefore, in order to demonstrate the capabilities of semantic annotations of services, we developed a prototype solution, which, instead of public sec- tor services, uses the services publicly available in the Web. Furthermore, one of the aims of building the prototype solution was to demonstrate how the semantics of services can be used for automated service composition in end-user applications. Therefore, one of the objectives was to hide the com- plexities related to composition and semantic annotations from the user.

                  As a result, we developed a framework, where we fused natural language processing, ontology reasoning, service annotation, discovery, composition and execution, Web technologies, Web widgets, and on-the-fly semantic data aggregation into a deep web search engine. Flowchart of the frame- work platform is illustrated in Figure 7.4. The idea of the prototype was simple—after a user has entered a search query, a simple Web application

                  Web application synthesis Visual widget selection Composition query construction Automated composition Language processing

                  Application execution User search query Figure 7.4

                  Big Data Computing

                Figure 7.5 An instance of a synthesized Web application.

                  is synthesized, which will answer user’s query, when executed. For doing this, language technology is used to process the query and transform it to semantic service composition task. This task is used to initiate composition and services selection. After the set of relevant services has been identi- fied, suitable visual widgets with matching interfaces are selected. Finally, the Web application will be synthesized from discovered components and named entities in the original search query are used to initialize and execute the application. The latter will visualize search results to the user. A sample application synthesized for a company background check query is presented in Figure 7.5.

                  linked Open government Data

                  Open data initiatives have been expected to positively impact the economy, increase openness of societies, accelerate democracy, and other aspects, which are important from political point of view and are of concern to deci- sion-makers. Here, Estonia is no exception and a thread of the open data ini- tiative is managed at the national level. There are no guidelines yet on which level the openly provided governmental data should comply with respect to the five star scale proposed by Tim Berners-Lee at his Bag of Chips talk at

                  Semantic Data Interoperability

                  data are available in any format and five stars means that data are available in RDF format and are linked with other data sets.

                  To understand the benefits, in 2011 a competition was launched in Estonia where four private sector service providers were granted funding for devel- oping demonstrators for exposing and using open data. One of the dem- onstrators, which is currently under development took the linked data approach. The main motivators for this choice were scalability and reduc- tion in the number of technologies for using the data set with other data sets. While model-based data integration, which can be bound easily to linked data, provides scalability from data integration point of view, simplified reuse is leveraged through reduced number of different technologies to be used for data processing when it comes to its usage.

                  Although ontologies are not a prerequisite for linked data, they pro- vide extended possibilities for querying data sets with languages such as * SPARQL or SERQL Ontologies in such case provide a sort of view of the data and impose the domain-specific constraints and rules for automated inference. The latter simplifies usage of linked data in the sense that you do not need to encode extensive domain knowledge into your applications.

                  The linked data demonstrator uses data provisioning services from the Estonian Register of Buildings to first download the records of all build- ings (database consists of about 900 k records). Then these data are linked to the data from Address Data System to bind addresses to the buildings. The addresses of companies are then linked to the data set as well to facili- tate linking buildings with companies. Finally, the data set in a structural database is transformed into RDF triples (about 200M triples are expected for this dataset) and stored into an RDF data store where it can be either downloaded or queried directly. More specifically, OpenLink Virtuoso Open Source Edition is used to store and expose the linked data from the Register of Buildings through its SPARQL endpoint. Besides RDF, JSON and other data formats are also supported by the platform.

                  For updating the data set, an existing data service is regularly invoked to retrieve and store the updates. This approach has set additional require- ments to services engineering—for such an open linked data creation and update mechanism, we need two kinds of services—services for retrieving changes or identifiers of changed records and services for retrieving com- plete data records or changes. Furthermore, there is a need for a specific ser- vice for retrieving data records, which exposes only public data. Currently, the majority of services return both confidential and public data since the main service consumers have been traditionally public sector authorities and * organizations. SPARQL; http://www.w3.org/TR/rdf-sparql-query/ SERQL; http://www.w3.org/2001/sw/wiki/SeRQL

                  Big Data Computing

                  Although for this solution mappings from structured data to RDF format are provided currently manually, we see great potential in using the seman- tic annotations of data provisioning services in construction of such map- pings automatically.

                Conclusion and Future Work

                  We have discussed some practical semantic interoperability problems of data in the domain of public sector administration on the basis of the Estonian case study. We have provided e-government solutions that benefit from semantic technologies used together with Web services and open data technologies.

                  According to our experience, meta-data creation and semantic enrichment of data and services are hard problems to solve before successful semantic data interoperability solutions can be developed and deployed. For existing legacy systems as in the Estonian case study, it is very hard. For new infor- mation systems, it can be included to the system development life cycle as a part of system analysis and implementation activities.

                  In order to make semantic technologies practically applicable, we have cre- ated the domain expert centric agile method for ontology development and the set of rules for semantic enrichment of data and services using OWL ontologies.

                  We have considered application of semantics in detection of data redun- dancy of information systems, in achieving semantic data interoperability with semantic data services using automatic service composition and in linking open government data. Although the case of the Estonian Register of Buildings, with an estimated number of 200M RDF triples, does not classify yet as a Big Data case, further activities in the context of the open data initia- tive are expected to grow this linked data set to the reasonable size in the following few years. Furthermore, we have learned that enhancing linked data by ontological structures makes data integration easier and provides benefits to (open) data publishing.

                  We are working on improving our ontology development methodology. We intend to provide (semi)automatic generation of machine readable meta- data (e.g., domain ontologies).

                  We continue our work on automatic composition of data provisioning ser- vices based on semantic annotations of service parameters. We also fore- see that the semantic annotations of data provisioning services can be used for automatically constructing mappings from structured data to RDF to be used in the context of linked data.

                  According to our experience, manual creation of domain ontologies and semantic enrichment of data and data services is too resource-consuming

                  Semantic Data Interoperability

                  Data (possible coming from distributed heterogeneous sources). By consid- ering our experience with linked data and services, we suggest that some of the future challenges of large-scale semantic data interoperability are related to (semi)automatic generation of machine readable meta-data, ontol- ogy matching and (semi)automatic semantic enrichment of data and data services. Solutions in these fields will facilitate building an infrastructure for meaningful interoperability of heterogeneous data sources and on-demand flexible integration of data.

                Acknowledgments

                  This research was supported by the target-financed theme no. SF0140007s12 of the Estonian Ministry of Education and Research, and by the European Regional Development Fund (ERDF) through the project no 3.2.1201.13-0026 and EXCS.

                References

                  

                Baader, F., D. Calvanese, D. McGuiness, D. Nardi, and P. Patel-Schneider. 2003. The

                Description Logic Handbook: Theory, Implementation and Applications. Cambridge University Press, Cambridge, UK.

                  

                EuroWordNet Estonian. 1999. OLAC record of EuroWordNet Estonian. http://

                www.language-archives.org/item/oai:catalogue.elra.info:ELRA-M0022

                  (accessed October 1, 2012).

                Gartner. 2012. Gartner Says Solving ‘Big Data’ Challenge Involves More Than Just

                Managing Volumes of

                  (accessed October 1, 2012).

                Gómez-Pérez, A., M. Fernández-López, and O. Corcho. 2004. Ontological Engineering

                with Examples from the Areas of Knowledge Management, e-Commerce and the Semantic Web

                  . Springer, Heidelberg.

                Haase, P., S. Rudolph, Y. Wang et  al. 2006. NeOn Deliverable D1.1.1: Networked

                Ontology Model. http://www.neon-project.org (accessed March 12, 2012).

                  

                Haav

                Haav

                  Estonian).

                Haav, H.-M. 2011. A practical methodology for development of a network of e-gov-

                  Big Data Computing Conference on e-Business, e-Services, and e-Society, I3E 2011, Revised Selected Papers, eds. T. Skersys, R. Butleris, L. Nemuraite, and R. Suomi, pp. 1–13, Springer, Heidelberg.

                Haav, H.-M., T. Tammet, V. Kadarpik, K. Kindel et al. 2007. A semantic-based web

                service composition framework. In Advances in Information Systems Development:

                  New Methods and Practice for the Networked Society [Proc. of 15th Int. Conf. on Information Systems Development, ISD 2006 (Budapest, Aug./Sept. 2006)], eds. G. Magyar et al., (1), pp. 379–391, Springer, New York.

                Haav, H.-M., A. Kalja, P. Küngas, and M. Luts. 2009. Ensuring large-scale seman-

                tic interoperability: The Estonian public sector’s case study. In Databases and

                  Information Systems V , eds. H.-M. Haav and A. Kalja, pp. 117–129, IOS Press, Amsterdam, the Netherlands.

                  

                Harmelen, F. van. 2008. Semantic web technologies as the foundation for the infor-

                mation infrastructure. In Creating Spatial Information Infrastructures, ed. P. van Ooster, Wiley, New York.

                Horrige, M. 2011. A Practical Guide to Building OWL Ontologies Using Protégé 4 and

                CO-ODE Tools. http://owl.cs.manchester.ac.uk/tutorials/protegeowltutorial

                  (accessed January 9, 2013).

                Kalja, A. 2009. New version of the x-road. In Information Society Yearbook 2009. Ministry

                of Economic Affairs and Communications of Estonia, Department of State

                  Information Systems (RISO). (accessed August 3, 2012).

                Klusch, M., P. Kapahnke, and I. Zinnikus. 2009. SAWSDL-MX2: A machine-learn-

                ing approach for integrating semantic web service matchmaking variants. In

                  Proceedings of the 2009 IEEE International Conference on Web Services (ICWS) , Los Angeles, CA, pp. 335–342, IEEE Computer Society.

                  

                Küng Estonian).

                Küngas, P. and M. Dumas. 2009. Cost-effective semantic annotation of XML schemas

                and web service interfaces. In Proceedings of the IEEE International Conference on

                  Services Computing (SCC 2009) , Bangalore, India, pp. 372–379, IEEE Computer Society Press.

                  

                Küngas, P. and M. Dumas. 2010. Redundancy detection in service-oriented systems.

                  In Proceedings of the 19th International Conference on World Wide Web (WWW’10), eds. M. Rappa, P. Jones, J. Freire, and S. Chakrabarti, pp. 581–590, ACM, New York.

                Lavbiĉ, D. and M. Krisper. 2010. Rapid ontology development, In Proceedings of the

                  19th Conference on Information Modelling and Knowledge Bases XXI , Maribor, Slovenia, eds. T. Welzer Družovec, H. Jaakkola, Y. Kiyoki, T. Tokuda, and N.

                  Yoshida, pp. 283–290, IOS Press.

                Maigre, R., P. Grigorenko, H.-M. Haav, and A. Kalja. 2013. A semantic method of auto-

                matic composition of e-government services. In Databases and Information Systems

                  VII: Selected Papers from 10th Int. Baltic Conf. on Databases and Information Systems, Baltic DB&IS 2012, Frontiers of Artificial Intelligence and Applications, eds. A. Caplinskas, G. Dzemyda, A. Lupeikene, and O. Vasilecas, (249):204–217, IOS Press, Amsterdam, the Netherlands.

                  

                Manyika, J., M. Chui, B. Brown, et al. 2011. Big data: The next frontier for innova-

                  Semantic Data Interoperability

                Marshall, P. 2012. What you need to know about big data. Government Computer

                News, February 7. http://gcn.com/articles/2012/02/06/feature-1-future-of- big-data.aspx (accessed October 1, 2012).

                Nicola, A., M. Missikoff, and R. Navigli. 2009. A software engineering approach to

                ontology building. Information Systems 34(2):258–275.

                Parmakson, P. and E. Vegmann. 2009. The administration system of the state informa-

                tion system (RIHA). In Information Society Yearbook 2009. Ministry of Economic

                  Affairs and Communications of Estonia, Department of State Information Systems (RISO). (accessed August 3, 2012).

                Rao, J., P. Küngas, and M. Matskin. 2006. Composition of semantic web services using

                linear logic theorem proving. Information Systems 31(4–5):340–360.

                October 1, 2012) (in Estonian).

                Schulte, S., U. Lampe, J. Eckert, and R. Steinmetz. 2010. LOG4SWS.KOM: Self-

                adapting semantic web service discovery for SAWSDL. In Proceedings of the 2010

                  IEEE 6th World Congress on Services (SERVICES-1) , Miami, FL, pp. 511–518, IEEE Computer Society.

                  

                Staab, S., H.P. Schnurr, R. Studer, and Y. Sure. 2001. Knowledge processes and ontolo-

                gies. IEEE Intelligent Systems 16(1):26–34.

                Stevens, R.D., A.J. Robinson, and C.A. Goble. 2003. MyGrid: Personalised bioinfor-

                matics on the information grid. Bioinformatics 19:i302–i304.

                Stuckenschmidt, H. 2012. Data semantics on the web. Journal of Data Semantics 1:1–9.

                Vallner, U. 2006. Nationwide components of Estonia’s state information system. Baltic

                IT&T Review 3(42):34–38.

                Ventrone, V. and S. Heiler. 1994. Some advice for dealing with semantic heterogene-

                ity in federated database systems. In Proceedings of the Database Colloquium, San

                  Diego, CA, August 1994, Armed Forces Communications and Electronics Assc (AFCEA).

                Villegas, M., N. Bel, S. Bel, and V. Rodríguez. 2010. A case study on interoperability

                for language resources and applications. In Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC’10), ed. N. Calzolari et al., pp. 3512–3519, European Language Resources Association (ELRA), Paris,

                  France.

                Wassermann, B. and W. Emmerich. 2006. Reliable scientific service compositions.

                  In Proceedings of the 4th International Conference on Service-oriented Computing (ICSOC’06)

                  , eds. D. Georgakopoulos, N. Ritter, B. Benatallah et  al., 14–25, Springer-Verlag, Berlin, Heidelberg.

                X-road 5.0. 2011. Turvaserveri kasutusjuhend. Redaktsioon 5.05 (29.04.2011). http://

                ee.x-rd.net/docs/est/turvaserveri_kasutusjuhend.pdf (accessed October 1,

                  2012) (in Estonian).

                X-road. 2011. Nõuded infosüsteemidele ja adapterserveritele. Versioon 8.0 (04.05.2011).

                  

                This page intentionally left blank This page intentionally left blank

                  

                  

                This page intentionally left blank This page intentionally left blank

                   Stratos Idreos CONTENTS

                  Introduction ......................................................................................................... 274 The Big Data Era ............................................................................................. 274

                  In Need for Big Data Query Processing ........................................................... 275 Big Data in Businesses ................................................................................... 275 Big Data in Sciences ....................................................................................... 275

                  Big Data Challenges for Query Processing ..................................................... 276 Existing Technology ....................................................................................... 276 Big Data Challenges ....................................................................................... 276 Because More is Different..............................................................................277

                  Data Exploration .................................................................................................277 Key Goals: Fast, Interactive, and Adaptive ................................................ 278 Metaphor Example ......................................................................................... 278 Data Exploration Techniques ........................................................................ 279

                  Adaptive Indexing .............................................................................................. 279 Indexing ........................................................................................................... 279 Offline Indexing ..............................................................................................280 Big Data Indexing Problems .........................................................................280 Database Cracking .........................................................................................280 Column Stores .................................................................................................280 Selection Cracking Example ......................................................................... 281 Data Structures ............................................................................................... 281 Continuous Adaptation ................................................................................. 281 Performance Examples .................................................................................. 282 Sideways Cracking ......................................................................................... 282 Partial Cracking ..............................................................................................283 Updates ............................................................................................................284 Adaptive Merging ..........................................................................................284 Hybrids ............................................................................................................285 Robustness .......................................................................................................286 Concurrency Control .....................................................................................286 Summary ......................................................................................................... 287

                  Adaptive Loading ............................................................................................... 287 The Loading Bottleneck ................................................................................. 287

                  Big Data Computing

                  External Files ...................................................................................................288 Adaptive Loading ..........................................................................................288 Selective Parsing .............................................................................................288 Indexing ........................................................................................................... 289 Caching ............................................................................................................ 289 Statistics ........................................................................................................... 289 Splitting Files ..................................................................................................290 Data Vaults ......................................................................................................290 Summary .........................................................................................................290

                  Sampling-Based Query Processing ................................................................... 291 Sciborg .............................................................................................................. 291 Blink ................................................................................................................. 291 One-Minute DB Kernels ................................................................................ 292 dbTouch ........................................................................................................... 292 Summary ......................................................................................................... 293

                  References ............................................................................................................. 293

                Introduction The Big Data era

                  We are now entering the era of data deluge, where the amount of data out- grows the capabilities of query processing technology. Many emerging appli- cations, from social networks to scientific experiments, are representative examples of this deluge, where the rate at which data are produced exceeds any past experience. For example, scientific analysis such as astronomy is soon expected to collect multiple terabytes of data on a daily basis, while already web-based businesses such as social networks or web log analysis are confronted with a growing stream of large data inputs. Therefore, there is a clear need for efficient Big Data query processing to enable the evolution of businesses and sciences to the new era of data deluge.

                  In this chapter, we focus on a new direction of query processing for Big Data where data exploration becomes a first-class citizen. Data exploration is necessary when new big chunks of data arrive rapidly and we want to react quickly, that is, with little time to spare for tuning and set-up. In particular, our discussion focuses on database systems technology, which for several decades has been the predominant data processing tool.

                  In this chapter, we introduce the concept of data exploration and discuss a series of early techniques from the database community toward the direc- tion of building database systems which are tailored for Big Data explo- ration, that is, adaptive indexing, adaptive loading, and sampling-based query processing. These directions focus on reconsidering fundamental

                  Big Data Exploration

                  assumptions and on designing next-generation database architectures for the Big Data era.

                In Need for Big Data Query Processing

                  Let us first discuss the need for efficient query processing techniques over Big Data. We briefly discuss the impact of Big Data both in businesses and in sciences.

                  Big Data in Businesses

                  For businesses, fast Big Data analysis translates to better customer satisfac- tion, better services, and in turn it may happen to be the catalyst in creat- ing and maintaining a successful business. Examples of businesses in need for analyzing Big Data include any kind of web- and data-based IT busi- ness, ranging from social networks to e-commerce, news, emerging mobile data businesses, etc. The most typical example in this case is the need to quickly understand user behavior and data trends; this is necessary in order to dynamically adapt services to the user needs.

                  Businesses continuously monitor and collect data with regard to the way users interact with their systems, for example, in an e-commerce web site, in a social network, or in a GPS navigation system, etc. and these data need to be analyzed quickly in order to discover interesting trends. Speed here is of essence as these businesses get multiple terabytes of data on a daily basis and the kinds of trends observed might change from day to day or from hour to hour. For example, social networks and mobile data applications observe rapid changes on user interests, for example, every single minute there are 700,000 status updates on Facebook and 700,000 queries on Google. This results in staggering amounts of data that businesses need to analyze as soon as possible and while it is still relevant.

                  Big Data in Sciences

                  For sciences, fast Big Data analysis can push scientific discovery forward. All sciences nowadays struggle with data management, for example, astronomy, biology, etc. At the same time, the expectation is that in the near future sci- ences will increase their ability to collect data even more. For example, the Large Synoptic Survey Telescope project in the USA expects a daily collec- tion of 20 terabytes, while the Large Hadron Collider in CERN in Europe already creates an even bigger amount of data. With multiple terabytes of data on a daily basis, data exploration becomes essential in order to allow

                  Big Data Computing

                  scientists to quickly focus on data parts where there is a good probability of finding interesting observations.

                Big Data Challenges for Query Processing

                  We continue the discussion by focusing on the challenges that Big Data bring for state-of-the-art data management systems.

                  existing Technology

                  Data management technology has a tremendous and important history of achievements and numerous tools and algorithms to deal with scalable data processing. Notable recent examples include column-store database systems (Boncz et al. 2005; Stonebraker et al. 2005) and MapReduce systems (Dean and Ghemawat 2004) as well as recent hybrids that take advantage of both the structured database technology and the massively scalable MapReduce technology (Abouzeid et al. 2009; Hadapt 2012; Platfora 2012). All small and major organizations rely on data management technology to store and ana- lyze their data. Sciences, on the other hand, rely on a mix of data manage- ment technologies and proprietary tools that accommodate the specialized query processing needs in a scientific environment.

                  Big Data Challenges

                  Regardless of the kind of technology used, the fundamental problem nowa- days is that we cannot consume and make sense of all these data fast enough. This is a direct side effect of some of the assumptions that are inherent in modern data management systems.

                  First, state-of-the-art database systems assume that there is always enough workload knowledge and idle time to tune the system with the proper indices, with the proper statistics, and with any other data structure that is expected to speed up data access. With Big Data arriving quickly, unpre- dictably, and with the need to react fast, we do not have the luxury to spend considerable amounts of time in tuning anymore. Second, database systems are designed with the main assumption that we should always consume all data in an effort to provide a correct and complete answer. As the data grow bigger, this becomes a significantly more expensive task.

                  Overall, before being able to use a database system for posing queries, we first need to go through a complex and time-consuming installation process to (a) load data inside the database system and (b) to tune the system. These steps require not only a significant amount of time (i.e., in the order of sev-

                  Big Data Exploration

                  workload knowledge. In other words, we need to know exactly what kind of queries we are going to pose so that we can tune the system accordingly. However, when we are in need to explore a Big Data pile, then we do not necessarily know exactly what kind of queries we would like to pose before the exploration process actually progresses; the answer to one query leads to the formulation of the next query.

                  Attempts to “throw more silicon” to the problem, that is, with Big Data clusters, can allow for more scalability (until the data grow even bigger), but at the expense of wasted resources when consuming data that is not really necessary for the exploration path. This brings yet another critical side effect of Big Data into the picture, that is, energy consumption. Overall, high-per- formance computing and exploitation of large clusters are complementary to the approaches described in this chapter; to deal with Big Data, we need innovations at all fronts.

                  Because More is Different

                  We cannot use past solutions to solve radically new problems. The main observation is that with more data, the query-processing paradigm also has to change. Processing all data is not possible; in fact, often it is not even necessary. For example, a scientist in the astronomy domain is interested in studying parts of the sky at a time searching for interesting patterns, maybe even looking for specific properties at a time. This means that the numer- ous terabytes of data brought every few hours by modern telescopes are not relevant all the time. Why should a scientist spend several hours loading all data in a database? Why should they spend several hours indexing all the data? Which data parts are of importance becomes apparent only after going over parts of the data and at least after partially understudying the trends. To make things worse, in a few hours, several more terabytes of data will arrive, that is, before we make sense of the previous batch of data.

                  Similarly, in a business analytics setting, changing the processing para- digm can be of critical importance. As it stands, now analysts or tools need to scan all data in search of interesting patterns. Yet in many emerging applica- tions, there is no slack time to waste; answers are needed fast, for example, when trying to figure out user behavior or news trends, when observing traffic behavior or network monitoring for fraud detection.

                  Data Exploration

                  With such overwhelming amounts of data, data exploration is becoming a new and necessary paradigm of query processing, that is, when we are in

                  Big Data Computing

                  are looking for. For example, an astronomer wants to browse parts of the sky to look for interesting effects, while a data analyst of an IT business browses daily data of monitoring streams to figure out user behavior pat- terns. What both cases have in common is a daily stream of Big Data, that is, in the order of multiple terabytes and the need to observe “something interesting and useful.”

                  Next-generation database systems should interpret queries by their intent, rather than as a contract carved in stone for complete and correct answers. The result in a user query should aid the user in understanding the data- base’s content and provides guidance to continue the data exploration jour- ney. Data analysts should be able to stepwise explore deeper and deeper the database, and stop when the result content and quality reaches their satisfac- tion point. At the same time, response times should be close to instant such that they allow users to interact with the system and explore the data in a contextualized way as soon as data become available.

                  With systems that support data exploration, we can immediately discard the main bottleneck that stops us from consuming Big Data today; instead of con- sidering a Big Data set in one go with a slow process, exploration-based sys- tems can incrementally and adaptively guide users toward the path that their queries and the result lead. This helps us avoid major inherent costs, which are directly affected by the amount of data input and thus are showstoppers nowadays. These costs include numerous procedures, steps, and algorithms spread throughout the whole design of modern data management systems.

                  Key goals: Fast, interactive, and adaptive For efficient data exploration to work, there are few essential goals.

                  First, the system should be fast to the degree that it feels interactive, that is, the user poses a question and a few seconds later an answer appears. Any data that we load do not have to be complete. Any data structure that we build does not have to represent all data or all value ranges. The answer itself does not have to represent a correct and complete result but rather a hint of how the data look like and how to proceed further, that is, what the next query should be. This is essential in order to engage data analysts in a seamless way; the system is not the bottleneck anymore.

                  Second, the system and the whole query-processing procedure should be adaptive in the sense that it adapts to the user requests; it proceeds with actions that speed up the search toward eventually getting the full answer the user is looking for. This is crucial in order to be able to finally satisfy the user needs after having sufficiently explored the data.

                  Metaphor example

                  The observations to be made about the data in this case resemble an ini-

                  Big Data Exploration

                  they pose to the system. The system makes sure it remembers all pixels in order to guide the user toward areas of the picture where interesting shapes start to appear. Not all the pictures have to be completed for inter- esting effects to be seen from a high-level point of view, while again not all the pictures are needed for certain areas to be completed and seen in more detail.

                  Data exploration Techniques

                  In the rest of this chapter, we discuss a string of novel data exploration tech- niques that aim to rethink database architectures with Big Data in mind. We discuss (a) adaptive indexing to build indices on-the-fly as opposed to

                  

                a priori , (b) adaptive loading to allow for direct access on raw data without

                a priori loading steps, and (c) database architectures for approximate query

                  processing to work over dynamic samples of data.

                Adaptive Indexing

                  In this section, we present adaptive indexing. We discuss the motivation for adaptive indexing in dynamic Big Data environments as well as the main bottlenecks of traditional indexing approaches. This section gives a broad description of the state of the art in adaptive indexing, including topics such as updates, concurrency control, and robustness.

                  indexing

                  Good performance in state-of-the-art database systems relies largely on proper tuning and physical design, that is, creating the proper accelerator structures, called indices. Indices are exploited at query-processing time to provide fast data access. Choosing the proper indices is a major performance parameter in database systems; a query may be several orders of magnitude faster if the proper index is available and is used properly. The main prob- lem is that the set of potential indices is too large to be covered by default. As such, we need to choose a subset of the possible indices and implement only those.

                  In the past, the choice of the proper index collection was assigned to data- base administrators (DBAs). However, as applications became more and more complex, index selection too became complex for human administra- tion alone. Today, all modern database systems ship with tuning advisor tools. Essentially, these tools provide suggestions regarding which indices should be created. A human DBA is then responsible of making and imple-

                  Big Data Computing Offline indexing

                  The predominant approach is offline indexing. With offline indexing, all tuning choices happen up front, assuming sufficient workload knowledge and idle time. Workload knowledge is necessary in order to determine the appropriate tuning actions, that is, to decide which indices should be cre- ated, while idle time is required in order to actually perform those actions. In other words, we need to know what kind of queries we are going to ask and we need to have enough time to prepare the system for those queries.

                  Big Data indexing Problems

                  However, in dynamic environments with Big Data, workload knowledge and idle time are scarce resources. For example, in scientific databases, new data arrive on a daily or even hourly basis, while query patterns follow an exploratory path as the scientists try to interpret the data and understand the patterns observed; there is no time and knowledge to analyze and prepare a different physical design every hour or even every day; even a single index may take several hours to create.

                  Traditional indexing presents three fundamental weaknesses in such cases: (a) the workload may have changed by the time we finish tuning; (b) there may be no time to finish tuning properly; and (c) there is no indexing support during tuning.

                  Database Cracking

                  Recently, a new approach, called database cracking, was introduced to the physical design problem. Cracking introduces the notion of continuous, incremental, partial, and on-demand adaptive indexing. Thereby, indices are incrementally built and refined during query processing. The net effect is that there is no need for any upfront tuning steps. In turn, there is no need for any workload knowledge and idle time to set up the database sys- tem. Instead, the system autonomously builds indices during query process- ing, adjusting fully to the needs of the users. For example, as scientists start exploring a Big Data set, query after query, the system follows the explora- tion path of the scientist, incrementally building and refining indices only for the data areas that seem interesting for the exploration path. After a few queries, performance adaptively improves to the level of a fully tuned sys- tem. From a technical point of view, cracking relies on continuously physi- cally reorganizing data as users pose more and more queries.

                  Every query is treated as a hint on how data should be stored.

                  Column Stores

                  Before we discuss cracking in more detail, we give a short introduction

                  Big Data Exploration

                  modern column stores and thus it relies on a number of modern column-store characteristics. Column-stores store data one column at a time in fixed-width dense arrays. This representation is the same both for disk and for main memory. The net effect compared to traditional row stores is that during query processing, a column store may access only the referenced data/col- umns. Similarly, column stores rely on bulk and vector-wised processing. Thus, a select operator typically processes a single column in one go or in a few steps, instead of consuming full tuples one at a time. Specifically for database cracking, the column-store design allows for efficient physical reor- ganization of arrays. In effect, cracking performs all physical reorganization actions efficiently in one go over a single column; it does not have to touch other columns.

                  Selection Cracking example

                  We now briefly recap the first adaptive indexing technique, selection crack- ing, as it was introduced in Idreos et al. (2007a). The main innovation is that the physical data store is continuously changing with each incoming query

                  

                q , using q as a hint on how data should be stored. Assume an attribute A

                  stored as a fixed-width dense array in a column store. Say a query requests all values where A < 10. In response, a cracking DBMS clusters all tuples of A with A < 10 at the beginning of the respective column C, while pushing all tuples with A ≥ 10 to the end. In other words, it partitions on-the-fly and in-place column C using the predicate of the query as a pivot. A subsequent query requesting A ≥ v1, where v1 ≥ 10, has to search and crack only the last part of C, where values A ≥ 10 reside. Likewise, a query that requests A < v2, where v2 < 10, searches and cracks only the first part of C. All crack actions happen as part of the query operators, requiring no external administration.

                  The terminology “cracking” reflects the fact that the database is parti- tioned (cracked) into smaller and manageable pieces.

                  Data Structures

                  The cracked data for each attribute of a relational table are stored in a normal column (array). The very first query on a column copies the base column to an auxiliary column where all cracking happens. This step is used such that we can always retrieve the base data in its original form and order. In addi- tion, cracking uses an AVL-tree to maintain partitioning information such as which pieces have been created, which values have been used as pivots, etc.

                  Continuous adaptation

                  The cracking actions continue with every query. In this way, the system reacts to every single query, trying to adjust the physical storage, continu-

                  Big Data Computing

                  queries, the more performance improves. In essence, more queries introduce more partitioning, while pieces become smaller and smaller. Every range query or more precisely every range select operator needs to touch at most two pieces of a column, that is, those pieces that appear at the boundaries of the needed value range. With smaller pieces, future queries need less effort to perform the cracking steps and as such performance gradually improves.

                  To avoid the extreme status where a column is completely sorted, crack- ing poses a threshold where it stops cracking a column for pieces which are smaller than L1 cache. There are two reasons for this choice. First, the AVL-tree, which maintains the partitioning information, grows significantly, and causes random access when searching. Secondly, the benefit brought by cracking pieces that are already rather small is minimal. As such, if during a query, a piece smaller than L1 is indicated for cracking, the system com- pletely sorts this piece with an in-memory quick sort. The fact that this piece is sorted is marked in the AVL-tree. This way, if a future query needs to search within a piece p which happens to be fully sorted, then it simply per- forms a binary search on this piece as opposed to physically reorganizing it.

                  Performance examples

                  In experiments with the Skyserver real query and data logs, a database sys- tem with cracking enabled, finished answering 160,000 queries, while a tra- ditional system was still half way creating the proper indices and without having answered a single query (Halim et al. 2012). Similarly, in experiments with the business standard TPC-H benchmark, perfectly preparing a system with all the proper indices took ~3 h, while a cracking database system could answer all queries in a matter of a few seconds with zero preparation, while still reaching optimal performance, similar to that of the fully indexed sys- tem (Idreos et al. 2009).

                  Being able to provide this instant access to data, that is, without any tuning, while at the same time being able to quickly, adaptively, and incrementally approach optimal performance levels in terms of response times, is exactly the property that creates a promising path for data exploration. The rest of the chapter discusses several database architecture challenges that arise when trying to design database kernels where adaptive indexing becomes a first-class citizen.

                  Sideways Cracking

                  Column-store systems access one column at a time. They rely on the fact that all columns of the same table are aligned. This means that for each column, the value in the first position belongs in the first tuple, the one in the sec- ond position belongs to the second tuple, and so on. This allows for efficient query processing for queries that request multiple columns of the same table,

                  Big Data Exploration

                  When cracking physically reorganizes one column, the rest of the columns of the same table remain intact; they are separate physical arrays. As a result, with cracking, columns of the same table are not aligned anymore. Thus, when a future query needs to touch more than one columns of the same table, then the system is forced to perform random access in order to recon- struct tuples on-the-fly. For example, assume a selection on a column A, fol- lowed by a projection on another column B of the same table. If column A has been cracked in the past, then the tuple IDs, which is the intermediate result out of the select operator on A, are in a random order and lead to an expensive access to fetch the qualifying values from column B.

                  One approach could be that every time we crack one column, we also crack in the same way all columns of the same table. However, this defeats the purpose of exploiting column stores; it would mean that every single query would have to touch all attributes of the referenced table as opposed to only touching the attributes which are truly necessary for the current query.

                  Sideways cracking solves this problem by working on pairs of columns at a time (Idreos et al. 2009) and by adaptively forwarding cracking actions across the columns of the same table. That is for a pair of columns A and B, during the cracking steps on A, the B values follow this reorganization. The values of A and B are stored together in a binary column format, making the physical reorganization efficient. Attribute A is the head of this column pair, while attribute B is the tail. When more than two columns are used in a query, sideways cracking uses bit vectors to filter intermediate results, while working across multiple column-pairs of the same head. For example, in order to do a selection on attribute A and two aggregations, one on attribute B and one attribute C, sideways cracking uses pairs AB and AC. Once both pairs are cracked in the same way using the predicates on A, then they are fully aligned and they can be used in the same plans without tuple recon- struction actions.

                  Essentially, sideways cracking performs tuple reconstructions via incremen- tal cracking and alignment actions as opposed to joins. For each pair, there is a log to maintain the cracking actions that have taken place in this pair as well as in other pairs that use the same head attribute. Two column-pairs of the same head are aligned when they have exactly the same history, that is, they have been cracked for the same bounds and exactly in the same order.

                  Partial Cracking

                  The pairs of columns created by sideways cracking can result in a large set of auxiliary cracking data. With Big Data, this is an important concern. Cracking creates those column pairs dynamically, that is, only what is needed is cre- ated and only when it is needed. Still though, the storage overhead may be significant. Partial cracking solves this problem by introducing partial crack- ing columns (Idreos et al. 2009). With partial cracking, we do not need to

                  Big Data Computing

                  workload set are materialized in cracking columns. If missing values are requested by future queries, then the missing values are fetched from the base columns the first time they are requested.

                  With partial cracking, a single cracking column becomes a logical view of numerous smaller physical columns. In turn, each one of the small columns is cracked and accessed in the same way as described for the original data- base cracking technique, that is, it is continuously physically reorganized as we pose queries.

                  Users may pose a storage budget and cracking makes sure it will stay within the budget by continuously monitoring the access patterns of the var- ious materialized cracking columns. Each small physical column of a single logical column is completely independent and can be thrown away and rec- reated at any time. For each column, cracking knows how many times it has been accessed by queries and it uses an LRU policy to throw away columns when space for a new one is needed.

                  updates

                  Updates pose a challenge since they cause physical changes to the data which in combination with the physical changes caused by cracking may lead to significant complexity. The solution proposed in Idreos et al. (2007b) deals with updates by deferring update actions for when relevant queries arrive. In the same spirit as with the rest of the cracking techniques, crack- ing updates do not do any work until it is unavoidable, that is, until a query, which is affected by a pending update, arrives. In this way, when an update comes, it is simply put aside. For each column, there is an auxiliary delete column where all pending deletes are placed and an auxiliary insertions col- umn where all pending inserts are placed. Actual updates are a combination of a delete and then an insert action.

                  Each query needs to check the pending deletes and inserts for pending actions that may affect it. If there are any, then those qualifying pending insertions and deletions are merged with the cracking columns on-the-fly. The algorithm for merging pending updates into cracking columns takes advantage of the fact that there is no strict order within a cracking col- umn. For example, each piece in a cracking column contains values within a given value range but once we know that a new insertion, for example, should go within this piece, then we can place it in any position of the piece; within each cracking piece, there is no strict requirement for main- taining any order.

                  adaptive Merging

                  Cracking can be seen as an incremental quicksort where the pivots are defined by the query predicates. Adaptive merging was introduced as a

                  Big Data Exploration

                  where the merging actions are defined by the query predicates (Graefe and Kuno 2010). The motivation is mainly toward disk-based environments and toward providing fast convergence to optimal performance.

                  The main design point of adaptive merging is that data are horizontally partitioned into runs. Each run is sorted in memory with a quicksort action. This preparation step is done with the first query and results in an initial column that contains the various runs. From there on, as more queries arrive data are moved from the initial column to a results column where the final index is shaped. Every query merges into the results column only data that are defined by its selection predicates and those that are missing from the results column. If a query is covered fully by the results column, then it does not need touch the initial runs. Data that are merged are immediately sorted in place in the results column; once all data are merged, the results column is actually a fully sorted column. With data pieces being sorted both in the initial column and in the results column, queries can exploit binary search both during merging and accessing only the results column.

                  Hybrids

                  Adaptive merging improves over plain cracking when it comes to conver- gence speed, that is, the number of queries needed to reach performance levels similar to that of a full index is significantly reduced. This behavior is mainly due to the aggressive sorting actions during the initial phase of adaptive merging; it allows future queries to access data faster. However, these sorting actions put a sizeable overhead on the initial phase of a workload, causing the very first query to be significantly slower. Cracking, on the other hand, has a much more smooth behavior, making it more lightweight to individual queries. However, cracking takes much longer to reach the optimal index status (unless there is significant skew in the workload).

                  The study in Idreos et al. (2011a,b) presents these issues and proposes a series of techniques that blend the best properties of adaptive merging with the best properties of database cracking. A series of hybrid algorithms are proposed where one can tune how much initialization overhead and how much convergence speed is needed. For example, the crack–crack hybrid (HCC) uses the same overall architecture as adaptive merging, that is, using an initial column and a results column where data are merged based on query predicates. However, the initial runs are now not sorted; instead, they are cracked based on query predicates. As a result, the first query is not penalized as with adaptive merging. At the same time, the data placed in the results column are not sorted in place. Several combinations are proposed where one can crack, sort, or radix cluster the initial column and the result column. The crack–sort hybrid, which cracks the initial column, while it sorts the pieces in the result column, brings the best overall balance between

                  Big Data Computing robustness

                  Since cracking reacts to queries, its adaptation speed and patterns depend on the kind of queries that arrive. In fact, cracking performance crucially depends on the arrival order of queries. That is, we may run exactly the set of queries twice in slightly different order and the result may be signifi- cantly different in terms of response times even though exactly the same cracking index will be created. To make this point more clear, consider the following example. Assume a column of 100 unique integers in [0, 99]. Assume a first query that asks for all values v where v < 1. As a result, cracking partitions the column into two pieces. In piece P1, we have all values in [0, 1) and in piece P2 we have all values in [1, 99]. The net effect is that the second piece still contains 99 values, meaning that the partition- ing achieved by the first query is not so useful; any query falling within the second piece still has to analyze almost all values of the column. Now assume that the second query requests all values v, where v < 2. Then, the third query requests all values v, where v < 3, and so on. This sequence results in cracking having to continuously analyze large portions of the col- umn as it always leaves back big pieces. The net effect is that convergence speed is too slow and in the worst case cracking degrades to a performance similar to that of a plain scan for several queries, resulting in a perfor- mance which is not robust (Halim et al. 2012).

                  To solve the above problem, Halim et al. (2012) propose stochastic cracking. The main intuition is that stochastic cracking plugs in stochastic cracking actions during the normal cracking actions that happen during processing.

                  For example, when cracking a piece of a column for a pivot X, stochastic cracking adds an additional cracking step where this piece is also cracked for a pivot which is randomly chosen. As a result, the chances of leaving back big uncracked pieces become significantly smaller.

                  Concurrency Control

                  Cracking is based on continuous physical reorganization of the data. Every single query might have side effects. This is in strong contrast with what normally happens in database systems where plain queries do not have side effects on the data. Not having any side effects means that read queries may be scheduled to run in parallel. Database systems heavily rely on this paral- lelism to provide good performance when multiple users access the system simultaneously. On the other hand, with cracking, every query might change the way data are organized and as a result it is not safe to have multiple que- ries working and changing the same data in parallel.

                  However, we would like to have both the adaptive behavior of database cracking, while still allowing multiple users to query Big Data simultane- ously. The main trick to achieve this is to allow concurrent access on the vari-

                  Big Data Exploration

                  reorganizing the same column as long as they do not touch the exact same piece simultaneously (Graefe et al. 2012). In this way, each query may lock a single piece of a cracking column at a time, while other queries may be working on the other pieces. As we create more and more pieces, there are more opportunities to increase the ability for multiple queries to work in par- allel. This bonds well with the adaptive behavior of database cracking; if a data area becomes hot, then more queries will arrive to crack it into multiple pieces and subsequently more queries will be able to run in parallel because more pieces exist.

                  Contrary to concurrency control for typical database updates, with adap- tive indexing during read queries, we change only the data organization; the data contents remain intact. For this reason, all concurrency mechanisms for adaptive indexing may rely on latching as opposed to full-fledged database locks, resulting in a very lightweight design (Graefe et al. 2012).

                  Summary

                  Overall, database cracking opens an exciting path towards database systems that inherently support adaptive indexing. As we do not require any work- load knowledge and any tuning steps, we can significantly reduce the time it takes to query newly arrived data, assisting data exploration.

                Adaptive Loading

                  The previous section described the idea of building database kernels that inherently provide adaptive indexing capabilities. Indexing is one of the major bottlenecks when setting up a database system, but it is not the only one. In this section, we focus on another crucial bottleneck, that is, data load- ing. We discuss the novel direction of adaptive loading to enable database systems to bypass the loading overhead and immediately be able to query data before even being loaded in a database.

                  The loading Bottleneck Data loading is a necessary step when setting up a database system.

                  Essentially, data loading copies all data inside the database system. From this point on, the database fully controls the data; it stores data in its own for- mat and uses its own algorithms to update and access the data. Users cannot control the data anymore directly, but only through the database system. The reason to perform the loading step is to enable good performance during query processing; by having full control on the data, the database system can

                  Big Data Computing

                  and transforming all data are significant; it may take several hours to load a decent data size even with parallel loading.

                  As a result, in order to use the sophisticated features of a database system, users have to wait until their data are loaded (and then tuned). However, with Big Data arriving at high rates, it is not feasible anymore to reserve sev- eral hours for data loading as it creates a big gap between data creation and data exploitation.

                  external Files

                  One feature that almost all open source and commercial database products provide is external tables. External files are typically in the form of raw text- based files in CSV format (comma-separated values). With the external tables functionality, one can simply attach a raw file to a database without loading the respective data. When a query arrives for this file, the database system auto- matically goes back to the raw file to access and fetch the data on-the-fly. This is a useful feature in order to delay data loading actions but unfortunately it is not a functionality that can be used for query processing. The reason is that it is too expensive to query raw files; there are several additional costs involved. In particular, parsing and tokenizing costs dominate the total query processing costs. Parsing and tokenizing are necessary in order to distinguish the attribute values inside raw files and to transform them into binary form. For this reason, the external tables functionality is not being used for query processing.

                  adaptive loading

                  The NoDB project recently proposed the adaptive loading direction (Alagiannis et al. 2012; Idreos et al. 2011a,b); the main idea is that loading actions happen adaptively and incrementally during query processing and driven by the actual query needs. Initially, no loading actions take place; this means that there is no loading cost and that users can immediately query their data. With every query, the system adaptively fetches any needed data from the raw data files. At any given time, only data needed by the queries are loaded. The main challenge of the adaptive loading direction is to minimize the cost to touch the raw data files during query processing, that is, to eliminate the reason that makes the external tables fuctionality unusable for querying.

                  The main idea is that as we process more and more queries, NoDB can col- lect knowledge about the raw files and significantly reduce the data access costs. For example, it learns about how data reside on raw files in order to better look for it, if needed, in the future.

                  Selective Parsing

                  NoDB pushes selections down to the raw files in order to minimize the parsing

                  Big Data Exploration

                  for every single row of a data file. In a typical external files process, the system tokenizes and parses all attributes in each row of the file. Then, it feeds the data to the typical data flow inside the database system to process the query. This incurs a maximum parsing and tokenizing cost. NoDB removes this over- head by performing parsing and tokenizing selectively on a row-by-row basis, while applying the filtering predicates directly on the raw file. The net benefit is that as soon as any of the filtering predicates fails, then NoDB can abandon the current row and continue with the next one, effectively avoiding signifi- cant parsing and tokenizing costs. To achieve all these steps, NoDB overloads the scan operator with the ability to access raw file in addition to loaded data.

                  indexing

                  In addition, during parsing, NoDB creates and maintains an index to mark positions on top of the raw file. This index is called positional map and its functionality is to provide future queries with direct access to a location of the file that is close to what they need. For example, if for a given row we know the position of the fifth attribute and the current query needs to ana- lyze the seventh attribute, then the query only needs to start parsing as of the attribute on the fifth position of the file. Of course, given that we can- not realistically assume fixed length attributes, the positional map needs to maintain information on a row-by-row basis. Still though, the cost is kept low, as only a small portion of a raw file needs to be indexed. For example, experiments in Alagiannis et al. (2012) indicate that once 15% of a raw file is indexed, then performance reaches optimal levels.

                  Caching

                  The data fetched from the raw file are adaptively cached and reused if simi- lar queries arrive in the future. This allows the hot workload set to always be cached and the need to fetch raw data appears only during workload shifts. The policy used for cache replacement is LRU in combination with adaptive loading specific parameters. For example, integer attributes have a priority over string attributes in the cache; fetching string attributes back from the raw file during future queries is significantly less expensive than fetching integer attributes. This is because the parsing costs for string attributes are very low compared to those for integer values.

                  Statistics

                  In addition, NoDB creates statistics on-the-fly during parsing. Without proper statistics, optimizers cannot make good choices about query plans. With adaptive loading, the system is initiated without statistics as no data is loaded up front. To avoid bad plans and to guarantee robustness, NoDB

                  Big Data Computing

                  raw file is requested by a query. This puts a small overhead at query time, but it allows us to avoid bad optimization choices.

                  Splitting Files

                  When accessing raw files, we are limited in exploiting the format of the raw files. Typically, data are stored in CSV files where each row represents an entry in a relational table and each file represents all data in a single relational table. As a result, every single query that needs to fetch data from raw files has to touch all data. Even with selective parsing and indexing, at the low level the system still needs to touch almost all the raw file. If the data were a priori loaded and stored in a column-store format, then a query would need to touch only the data columns it really needs. NoDB proposed the idea of text cracking, where during parsing the raw file is separated into multiple files and each file may contain one or more of the attributes of the original raw file (Idreos et al. 2011a,b). This process works recursively and as a result future queries on the raw file, can significantly reduce the amount of data they need to touch by having to work only on smaller raw files.

                  Data Vaults

                  One area where adaptive loading can have a major impact is sciences. In the case of scientific data management, several specialized formats already exist and are in use for several decades. These formats store data in a binary form and often provide indexing information, for example, in the form of clustering data based on the date of creation. In order to exploit database sys- tems for scientific data management, we would need to transform data from the scientific format into the database format, incurring a significant cost. The data vaults project provides a two-level architecture that allows exploit- ing the metadata in scientific data formats for adaptive loading operations (Ivanova et al. 2012). Given that the scientific data are already in a binary format, there are no considerations regarding parsing and tokenizing costs. During the initialization phase, data vaults load only the metadata infor- mation, resulting in a minimal set-up cost. During query processing time, the system uses the metadata to guide the queries to the proper files and to transform only the needed data on-the-fly. This way, without performing any a priori transformation of the scientific data, we can pose queries through the database system directly and selectively.

                  Summary

                  Loading represents a significant bottleneck; it raises a wall between users and Big Data. Adaptive loading directions provide a promising research path towards systems that can be usable immediately as soon as data arrive

                  Big Data Exploration

                Sampling-Based Query Processing

                  Loading and indexing are the two essential bottlenecks when setting up a database system. However, even after all installation steps are performed, there are more bottlenecks to deal with; this time bottlenecks appear during query processing. In particular, the requirements for correctness and com- pleteness raise a significant overhead; every single query is treated by a data- base system as a request to find all possible and correct answers.

                  This inherent requirement for correctness and completeness has its roots in the early applications of database systems, that is, mainly in critical sec- tors such as in banking and financial applications where errors cannot be tolerated. However, with modern Big Data applications and with the need to explore data, we can afford to sacrifice correctness and completeness in favor of improved response times. A query session that may consist of sev- eral exploratory queries can lead to exactly the same result, regardless of whether the full answer is returned every time; in an exploratory session, users are mainly looking for hints on what the next query should be and a partial answer may already be informative enough.

                  In this section, we discuss a number of recent approaches to create data- base systems that are tailored for querying with partial answers, sacrificing correctness, and completeness for improved response times.

                  Sciborg

                  Sciborg proposed the idea of working over data that is organized in a hier- archy of samples (Sidirourgos et al. 2011). The main idea is that queries can be performed over a sample of the data providing a quick response time. Subsequently, the user may choose to ask for more detail and to query more samples. Essentially, this is a promising research path to enable interactive query processing. The main innovation in Sciborg is that samples of data are not simply random samples; instead, Sciborg creates weighted samples driven by past query-processing actions and based on the properties of the data. In this way, it can better follow the needs of the users by collecting rel- evant data together such as users can infer interesting patterns using only a small number of samples.

                  Blink

                  Another recent project, Blink, proposes a system where data are also orga- nized in multiple samples (Agarwal et al. 2012). The characteristic of Blink is its seamless integration with cloud technology, being able to scale to massive amounts of data and processing nodes.

                  Both the Blink and the Sciborg projects represent a vision to create data-

                  Big Data Computing

                  samples. For example, the user does not have to create a sample explicitly and then query it, followed by the creation of a different sample, while repeating this process multiple times. In a database architecture that supports samples at its core, this whole process is transparent to the user and has the potential to be much more effective. For example, with tailored database kernels (a) the samples are created with minimal storage overhead, (b) they adapt continu- ously, and (c) query results over multiple samples can be merged dynami- cally by the system.

                  One-Minute DB Kernels

                  Another vision in the direction of exploration-based database kernels is the one-minute database kernels idea (Kersten et al. 2011). Similar to Sciborg and Blink, the main notion is that correctness and completeness are sacrificed in favor of performance; however, contrary to past approaches, this happens at a very low level, that is, at the level of database operators. Every decision in the design of database algorithms can be reconsidered to avoid expensive actions by sacrificing correctness. For example, a join operator may choose to drop data from the inner join input as soon as the size of the hash table exceeds the size of the main memory or even the size of CPU cache. A smaller hash table is much faster to create and it is also much faster to probe, avoid- ing cache misses. Similar decisions can be made across the whole design of database kernels.

                  Essentially, the one-minute database kernels approach is equivalent to the sample-based ideas. The difference is that it pushes the problem at a much lower level where possibly we may have better control of parameters that affect performance. One of the main challenges is to be able to provide qual- ity guarantees for the query results.

                  dbTouch

                  One significant bottleneck when querying database systems is the need to be an expert user; one needs to be aware of the database schema and needs to be fluent in SQL. When it comes to Big Data exploration, we would like to render data accessible to more people and to make the whole process of dis- covering interesting patterns as easy as possible. dbTouch extends the vision of sample-based processing with the notion of creating database kernels which are tailored for touch-based exploration (Idreos and Liarou 2013). Data appear in a touch device in a visual form, while users can simply touch the data to query. For example, a relational table may be represented as a table shape and a user may slide a finger over the table to run a number of aggre- gations. dbTouch is not about formulating queries; instead, it proposes a new database kernel design which reacts instantly to touch. Users do not pose queries as in normal systems; in dbTouch, users point to interesting data and

                  Big Data Exploration

                  analyzing a single tuple (or at most a few tuples), while a slide gesture can be seen as multiple single touches and can be used to explore a given data area. As such, only a sample of the data is processed every time, while now the user has full control regarding which data are processed and when; by changing the direction or the speed of a slide gesture, users can control the exploration process, while observing running results as they are visualized by dbTouch.

                  The main challenge with dbTouch is in designing database kernels that can react instantly to every touch and to provide quick response times even though the database does not control anymore the order and the kind of data processed for every query session.

                  Summary

                  Overall, correctness and completeness pose a significant bottleneck dur- ing query time; with Big Data, this problem becomes a major showstopper as it becomes extremely expensive to consume big piles of data. The novel research directions described in this chapter make a first step towards a new era of database kernels where performance becomes more impor- tant than correctness and where exploration is the main query-processing paradigm.

                  In the presence of Big Data, query processing is facing significant new challenges. A particular aspect of those challenges has to do with the fact that there is not enough time and workload knowledge to properly prepare and tune database management systems. In addition, producing correct and complete answers by consuming all data within reasonable time bounds is becoming harder and harder. In this chapter, we discussed the research direction of data exploration where adaptive and incremental processing become first-class citizens in database architectures.

                  Adaptive indexing, adaptive loading, and sampling-based database ker- nels provide a promising path towards creating dedicated exploration sys- tems. It represents a widely open research area as we need to reconsider every single aspect of database design established in the past.

                References Abouzeid, A., K. Bajda-Pawlikowski, D. J. Abadi, A. Rasin, and A. Silberschatz

                  HadoopDB: An architectural hybrid of MapReduce and DBMS technologies for analytical workloads. Proceedings of the Very Large Databases Endowment (PVLDB) 2(1), 2009: 922–933.

                Agarwal, S., A. Panda, B. Mozafari, A. P. Iyer, S. Madden, and I. Stoica. Blink and

                it’s done: Interactive queries on very large data. Proceedings of the Very Large

                  Big Data Computing

                Alagiannis, I., R. Borovica, M. Branco, S. Idreos, and A. Ailamaki. NoDB: Efficient

                query execution on raw data files. ACM SIGMOD International Conference on

                  Management of Data , Scottsdale, AZ, 2012.

                  

                Boncz, P. A., M. Zukowski, and N. Nes. MonetDB/X100: Hyper-pipelining query exe-

                cution. Biennial Conference on Innovative Data Systems Research (CIDR), Asilomar, CA, 2005. pp. 225–237.

                Dean, J. and S. Ghemawat. MapReduce: Simplified data processing on large clus-

                ters. USENIX Symposium on Operating Systems Design and Implementation (OSDI),

                  San Francisco, CA, 2004. pp. 137–150.

                Graefe, G., F. Halim, S. Idreos, S. Manegold, and H. A. Kuno. Concurrency control for

                adaptive indexing. Proceedings of the Very Large Databases Endowment (PVLDB)

                  5(7), 2012: 656–667.

                Graefe, G. and H. A. Kuno. Self-selecting, self-tuning, incrementally optimized

                indexes. International Conference on Extending Database Technology (EDBT),

                  Lausanne, Switzerland, 2010. Hadapt. 2012. http://www.hadapt.com/

                Halim, F., S. Idreos, P. Karras, and R. H. C. Yap. Stochastic database cracking: Towards

                robust adaptive indexing in main-memory column-stores. Proceedings of the Very Large Databases Endowment (PVLDB) 5(6), 2012: 502–513.

                  

                Idreos, S., and E. Liarou. dbTouch: Analytics at your Fingetips. International Conference

                on Innovative Data Systems Research (CIDR), Asilomar, CA, 2013.

                  

                Idreos, S., I. Alagiannis, R. Johnson, and A. Ailamaki. Here are my Data Files. Here are

                my Queries. Where are my results? International Conference on Innovative Data Systems Research (CIDR) , Asilomar, CA, 2011a.

                  

                Idreos, S., M. Kersten, and S. Manegold. Database cracking. International Conference on

                Innovative Data Systems Research (CIDR) , Asilomar, CA, 2007a.

                Idreos, S., M. Kersten, and S. Manegold. Self-organizing tuple reconstruction in

                column-stores. ACM SIGMOD International Conference on Management of Data,

                  Providence, RI, 2009.

                Idreos, S., M. Kersten, and S. Manegold. Updating a cracked database. ACM SIGMOD

                International Conference on Management of Data , Beijing, China, 2007b.

                  

                Idreos, S., S. Manegold, H. Kuno, and G. Graefe. Merging what’s cracked, cracking

                what’s merged: Adaptive indexing in main-memory column-stores. Proceedings of the Very Large Databases Endowment (PVLDB) 4(9), 2011b: 585–597.

                  

                Ivanova, M., M. L. Kersten, and S. Manegold. Data vaults: A symbiosis between

                database technology and scientific file repositories. International Conference on Scientific and Statistical Database Management (SSDBM)

                  , Chania, Crete, Greece, 2012.

                Kersten, M., S. Idreos, S. Manegold, and E. Liarou. The researcher’s guide to the data

                deluge: Querying a scientific database in just a few seconds. Proceedings of the

                  Very Large Databases Endowment (PVLDB) 4(12), 2011: 174–177.

                  Platfora. 2012. http://www.platfora.com/

                Sidirourgos, L., M. L. Kersten, and P. A. Boncz. SciBORQ: Scientific data management

                with bounds on runtime and quality. International Conference on Innovative Data systems Research (CIDR) , Asilomar, CA, 2011.

                  

                Stonebraker, M. et al. C-Store: A column-oriented DBMS. International Conference on

                Very Large Databases (VLDB) , Trondheim, Norway, 2005, pp. 553–564.

                   Jordà Polo CONTENTS

                  Introduction ......................................................................................................... 295 MapReduce Model .............................................................................................. 297

                  Comparison with Other Systems ................................................................. 298 RDBMS ........................................................................................................ 298 Distributed Key-Value and Column-Oriented DBMS ......................... 299 Grid Computing ........................................................................................299 Shared-Memory Parallel Programming ................................................. 299

                  Examples and Uses of MapReduce .............................................................300 Word Count: MapReduce’s “Hello World!” ..........................................300 Use Cases ....................................................................................................302

                  MapReduce Implementations ......................................................................304 Google MapReduce ...................................................................................304 Hadoop .......................................................................................................304 Disco ............................................................................................................305 Skynet ..........................................................................................................305 Dryad ...........................................................................................................305

                  Open-Source Implementation: Hadoop ...........................................................306 Project and Subprojects .................................................................................306 Cluster Overview ...........................................................................................307 Storage with HDFS .........................................................................................308 Dataflow ..........................................................................................................309

                  Summary .............................................................................................................. 312 References ............................................................................................................. 313

                Introduction

                  Current trends in computer science drive users toward more service- oriented architectures such as the so-called cloud platforms. The cloud allows provisioning of computing and storage, converting physical central- ized resources into virtual shared resources. The ideas behind it are not

                  Big Data Computing

                  thanks to the development of technologies, it is becoming much more effi- cient: cost-, maintenance-, and energy-wise.

                  At the same time, more business are becoming aware of the relevance of the data they are able to gather: from social websites to log files, there is a lot of hidden information ready to be processed and mined. Not so long ago, it was relatively difficult to work with large amounts of data, and so most of it was usually discarded. The problem was not with hard drive capacity, which has increased a lot over the years, but that with access speed, which is improving only at a much lower pace. However, new tools, most of which are originally designed and built around web-related technologies, are making things easier. Developers are finally getting used to the idea of dealing with large data sets.

                  Both of these changes are not coincidental and respond to certain needs. On the one hand, nowadays it is much easier for companies to become global, target a larger number of clients, and consequently deal with more data. On the other hand, there is a limit to the initial expenses that they are willing to spend. Another issue that these new trends help one to address is that benefits may only arrive when dealing with sufficiently large data, but the upfront cost and the maintenance of the large clusters required to process such data sets is usually a hindrance compared to the benefits.

                  Despite the availability of new tools and the shift to service-oriented com- puting, there is still room for improvement, especially with regard to the integration of the two sides of cloud computing: the applications that pro- vide services and the systems that run these applications.

                  Developers still need to think about the requirements of the applications in terms of resources (CPU, memory, etc.) and will inevitably end up either under- or over-provisioning. In a cloud environment, it is easier to update the provisioning as needed, but for many applications this process is still manual and requires human intervention.

                  Moving away from the old style of managing resources is one of the major challenges of cloud computing. In a way, it can be thought of as the equiva- lent of the revolution that the introduction of time-sharing supposed in the era of batch processing. Time-sharing allowed everyone to interact with computers as if they were the owners of the system. Likewise, freeing users from thinking about provisioning is the definite step in creating the illusion of the cloud as an unlimited source of computing resources.

                  The main obstacle, though, is that the cloud is not actually an infinite and

                  

                free source of computing: maintaining it is not trivial, resources are limited,

                  and providers need some way to prioritize services. If users are freed of the task of provisioning, then there must be some other mechanism to make both, sharing and accounting, possible.

                  On the other hand, some parts of these systems seem to be ready for this shift, especially the lower level components and the middleware. But the applications that run the services on top of cloud platforms seem to be lag-

                  Big Data Processing with MapReduce

                  it is to be expected that not all applications are fully integrated. But it seems clear that these applications represent the next and most obvious target in order to consolidate cloud platforms. One example of this kind of application is the MapReduce programming framework. The MapReduce model allows developers to write massively parallel applications without much effort and is becoming an essential tool in the software stack of many companies that need to deal with large data sets. MapReduce fits well with the idea of dynamic provisioning, as it may run on a large number of machines and is already widely used in cloud environments.

                MapReduce Model

                  MapReduce [5] is a programming model used to develop massively paral- lel applications that process and generate large amounts of data. It was first introduced by Google in 2004 and has since become an important tool for distributed computing. It is especially suited to operate on large data sets on clusters of computers, as it is designed to tolerate machine failures.

                  Essentially, MapReduce divides the work into small computations in two major steps, map and reduce, which are inspired by similar primitives that can be found in LISP and other functional programming languages. The input is formed by a set of key-value pairs, which are processed using the user-defined map function to generate a second set of intermediate key- value pairs. Intermediate results are then processed by the reduce function, which merges values by key.

                  While MapReduce is not something entirely new nor a revolutionary con- cept, it has helped us to standardize parallel applications. And even though its interface is simple, it has proved to be powerful enough to solve a wide- range of real-world problems: from web indexing to image analysis to clus- tering algorithms.

                  MapReduce provides high scalability and reliability, thanks to the division of the work into smaller units. Jobs are submitted to a master node, which is in charge of managing the execution of applications in the cluster. After submitting a job, the master initializes the desired number of smaller tasks or units of work and puts them to run on worker nodes. First, during the map phase, nodes read and apply the map function to a subset of the input data. The map’s partial output is stored locally on each node and served to worker nodes executing the reduce function.

                  Input and output files are usually stored in a distributed file system, but in order to ensure scalability, the master tries to assign local work, meaning the input data are available locally. On the other hand, if a worker node fails to deliver the unit of work it has been assigned to complete, the master node is

                  Big Data Computing Comparison with Other Systems

                  Analyzing and performing computations on massive data sets is not some- thing new, but it is not easy to compare MapReduce to other systems since it is often used to do things in a way that simply was not possible before using standardized tools. But besides creating a new market, MapReduce is also drawing the attention of developers, who use it for a wide range of purposes. The following comparison describes some of the technologies that share some kind of functionality with MapReduce.

                  RDBMS

                  Relational Database Management Systems are the dominant choice for transactional and analytical applications, and they have traditionally been a well-balanced and good enough solution for most applications. Yet its design has some limitations that make it difficult to keep the compatibility and provide optimized solution when some aspects such as scalability are the top priority.

                  There is only a partial overlap of functionality between RDBMSs and MapReduce: relational databases are suited to do things for which MapReduce will never be the optimal solution, and vice versa. For instance, MapReduce tends to involve processing most of the data set, or at least a large part of it, while RDBMS queries may be more fine-grained. On the other hand, MapReduce works fine with semistructured data since it is inter- preted while it is being processed, unlike RDBMSs, where well-structured and normalized data are the key to ensure integrity and improve perfor- mance. Finally, traditional RDBMSs are more suitable for interactive access, but MapReduce is able to scale linearly and handle larger data sets. If the data are large enough, doubling the size of the cluster will also make run- ning jobs twice as fast, something that is not necessarily true of relational databases.

                  Another factor that is also driving the move toward other kind of stor- age solutions are disks. Improvements in hard drives seem to be relegated to capacity and transfer rate only. But data access in an RDBMS is usually dominated by seek times, which have not changed significantly for some years. Solid-state drives may prove to be a good solution in the medium to long term [10], but they are still far from affordable compared to HDD, and besides, databases still need to be optimized for them.

                  MapReduce has been criticized by some RDBMS proponents due to its low- level abstraction and lack of structure. But taking into account the different features and goals of relational databases and MapReduce, they can be seen as complementary rather than opposite models. So the most valid criticism is probably not related to the technical merits of MapReduce, but with the hype generated around it, which is pushing its use to solve problems for which it

                  Big Data Processing with MapReduce Distributed Key-Value and Column-Oriented DBMS

                  Alternative database models such as Distributed Key-Value and Column- oriented DBMS are becoming more widely used for similar reasons as MapReduce. These two different approaches are largely inspired by Amazon’s Dynamo [6] and Google’s BigTable [3]. Key-value storage systems have proper- ties of databases and distributed hash tables, while column-oriented databases serialize data by column, making it more suitable for analytical processing.

                  Both models depart from the idea of a fixed schema-based structure and try to combine the best of both worlds: distribution and scalability of sys- tems like MapReduce with a higher and more database-oriented level of abstraction. In fact, some of the most popular data stores actually use or implement some sort of MapReduce. Google’s BigTable, for instance, uses Google MapReduce to process data stored in the system, and other col- umn-oriented DBMS such as CouchDB use their own implementations of MapReduce internally.

                  This kind of databases also mark a new trend and make it clear that the differences between traditional databases and MapReduce systems are blur- ring as developers try to get the best of both worlds.

                  Grid Computing

                  Like MapReduce, Grid computing services are also focused on performing computations to solve a single problem by distributing the work across sev- eral computers. But these kinds of platforms often built on a cluster with a shared file system, which are good for CPU-bound jobs, but not good enough for data-intensive jobs. And that is precisely one of the key differences between these kind of systems: Grid computing does not emphasize as much as MapReduce on data, especially on doing the computation near the data.

                  Another distinction between MapReduce and Grid computing is the inter- face it provides to the programmer. In MapReduce, the programmer is able to focus on the problem that needs to be solved since only the map and reduce functions need to be implemented, and the framework takes care of the dis- tribution, communication, fault-tolerance, etc. In contrast, in Grid computing the programmer has to deal with lower-level mechanisms to control the data flow, checkpoint, etc. which makes it more powerful, but also more error- prone and difficult to write.

                  Shared-Memory Parallel Programming

                  Traditionally, many large-scale parallel applications have been programmed in shared-memory environments such as OpenMP. Compared to MapReduce, this kind of programming interfaces are much more generic and provide solutions for a wider variety of problems. One of the typical use cases of these systems is parallel applications that require some kind of synchroniza-

                  Big Data Computing

                  However, this comes at a cost: they may be more flexible, but the interfaces are also significantly more low level and difficult to understand. Another difference between MapReduce and this model is the hardware for which each of these platforms has been designed. MapReduce is supposed to work on commodity hardware, while interfaces such as OpenMP are only efficient in shared-memory multiprocessor platforms.

                  examples and uses of Mapreduce

                  MapReduce is currently being used for many different kinds of applica- tions, from very simple helper tools that are part of a larger environment, to more complete and complex programs that may involve multiple, chained MapReduce executions.

                  This section includes a description of a typical MapReduce application, and what needs to be done to make it work, following the steps from the input to the final result. After the initial description, you will find a list of some of the problems MapReduce is able to solve, briefly explained. And finally, a more detailed study of how it is currently being used in production.

                  Word Count: MapReduce’s “Hello World!”

                  The goal of a word count application is to get the frequency of words in a very large collection of documents. Word count was the problem that exem- plified MapReduce in the original paper [5] and has since become the canoni- cal example to introduce how MapReduce works.

                  To compute the frequency of words, a sequential program would be needed to read all the documents, keeping a list of ⟨word, count⟩ pairs, incrementing the appropriate count value every time a word is found.

                  As you will see below, MapReduce’s approach is slightly different. First of all, the problem is divided into two stages known as map and reduce, named after the functions that are applied while they are in progress. The map() function is applied to every single element of the input, and since there is no need to do so in any particular order, it effectively makes it possible to parallelize all the work. For each element, map() emits key-value pairs to be worked on later during the reduce stage. The generated key-value pairs are grouped and processed by keys, so for every key there will be a list of values. The reduce() function is applied to these lists of values produced during the map stage and provides the final result.

                  Listings 9.1 and 9.2 show how these functions are implemented in an appli- cation such as word count. The map() is simple: it takes a line of the input, splits it into words, and for each word emits a ⟨word, count⟩ key-value pair, where count is the partial count and thus always 1. Note that in this example the input is split into lines, but it could have been split into some other iden-

                  Big Data Processing with MapReduce Listing 9.1: Word count: map() function

                  //i: ignored in this example //line: line contents void map(string i, string line): for word in line: print word, 1

                  The reduce function takes ⟨key, list(values)⟩ pairs and goes through all the values to get the aggregated result for that particular key.

                  Listing 9.2: Word count: reduce() function

                  //word: the key //partial_counts: a list of partial count values void reduce(string word, list partial_counts): total = 0 for c in partial_counts: total + = c print word, total

                  A good exercise to understand how data are processed by MapReduce is to follow step by step how a small input evolves into the final output. For instance, imagine that the input of the word count program is as follows:

                  Hello World Hello MapReduce Since, in this example, the map() function is applied to every line and the input has two lines, it is possible to run two map() functions simultaneously.

                  Each function will produce a different output, but the format will be similar: ⟨ word, 1⟩ pairs for each word. For instance, the map() reading the first line will emit the following partial output:

                  Hello, 1 World, 1 During the reduce stage, the intermediate output is merged grouping out- puts by keys. This results in new pairs formed by key and lists of values:

                  ⟨ Hello, (1, 1)⟩, ⟨World, (1)⟩, and ⟨MapReduce, (1)⟩. These pairs are then pro- cessed by the reduce() function, which aggregates the lists and produces the final output:

                  Hello, 2 World, 1 MapReduce, 1

                  Big Data Computing

                  Word count is an interesting example because it is simple, and the logic behind the map() and reduce() functions is easy to understand. As can be seen in the following examples, MapReduce is able to compute a lot more than a simple word count, but even though it is possible to make these functions more complex, it is recommended to keep them as simple as possible to help distribute the computation. If need be, it is always pos- sible to chain multiple executions, using the output of one application as the input of the next one.

                  On the other hand, MapReduce may seem a bit overkill for a problem like word counting. For one thing, it generates huge amounts of intermediate key- value pairs, so it may not be entirely efficient for small inputs. But it is designed with scalability in mind, so it begins to make sense as soon as the input is large enough. Besides, most MapReduce programs also require some level of tweaking on both the application itself and on the server side (block size, memory, etc.). Some of these refinements are not always obvious, and it is usu- ally after a few iterations that applications are ready to be run on production.

                  It should also be noted that this example is focused on the MapReduce computation, and some steps such as input distribution, splitting, and reduce partitioning are intentionally omitted, but will be described in more detail later.

                  Use Cases

                  MapReduce is especially well suited to solve embarrassingly parallel problems, that is, problems with no dependencies or communication requirements in which it is easy to achieve a speedup proportional to the size of the problem when it is parallelized.

                  Below is the description of some of the main problems (not necessarily embarrassingly parallel) and areas where MapReduce is currently used.

                  Distributed Search and Sort

                  Besides the aforementioned word frequency counting application, search- ing and sorting are some of the most commonly used examples to describe the MapReduce model. All these problems also share the fact that they are helper tools, thought to be integrated into larger environments with other applications, very much like their pipeline-based UNIX-like equivalent tools: wc, grep, sort, etc. Moreover, knowing how these problems are imple- mented in MapReduce can be of great help to understand it, as they use dif- ferent techniques.

                  A distributed version of grep is especially straightforward to implement using MapReduce. Reading line by line, maps only emit the current line if it matches the given pattern. And since the map’s intermediate output can be used as the final output, there is no need to implement the reduce() function.

                  Sorting is different from searching in that the map stage only reads the input

                  Big Data Processing with MapReduce

                  output is supposed to be sorted globally, the important part is how to get the appropriate key and partition the input so that all keys for a particular reducer

                  

                N come before all the keys for the next reducer N + 1. This way the output of

                the reducers can be numbered and concatenated after they are all finished. Inverted Indexing and Search Engines

                  When Google’s original MapReduce implementation was completed, it was used to regenerate the index of their search engine. Keeping indices up to date is one of the top priorities of Internet search engines, but web pages are created and updated every day, so a scalable solution is a must.

                  Inverted indices are one of the typical data structures used for information retrieval. Basically, an inverted index contains a list of references to docu- ments for each word. To implement an inverted index with MapReduce, the map reads the input and for each words emits the document ID. MapReduce then reads it and outputs words along with the list of documents in which they appear.

                  Other than Google, other major search engines such as Yahoo! are also based on MapReduce. The need to improve the scalability of the Free, open- source software search engine Nutch also promoted the foundation of Hadoop, one of the most widely used MapReduce implementations to date.

                  Log Analysis

                  Nowadays service providers generate large amounts of logs from all kinds of services, and the benefits of analyzing them are to be found when processing them en masse. For instance, if a provider is interested in tracking the behav- ior of a client during long periods of time, reconstructing user sessions, it is much more convenient to operate over all the logs.

                  Logs are a perfect fit for MapReduce for other reasons too. First, logs usu- ally follow a certain pattern, but they are not entirely structured, so it is not trivial to use an RDBMS to handle them and may require changes to the structure of the database to compute something new. Secondly, logs repre- sent a use case where scalability not only matters, but is also a key to keep the system sustainable. As services grow, so does the amount of logs and the need of getting something out of them.

                  Companies such as Facebook and Rackspace [8] use MapReduce to exam- ine log files on a daily basis and generate statistics and on-demand analysis.

                  Graph Problems

                  MapReduce is not perfectly fit for all graph problems, as some of them require walking through the vertices, which will not be possible if the map- pers receive only a part of the graph, and it is not practical to receive the whole graph as it would be way too big to handle and require a lot of band- width to transfer. But there are ways to work around these issues [4] such as using multiple maps and reduce iterations, along with custom optimized

                  Big Data Computing

                  A good example of an Internet-scale graph problem solved using MapReduce is PageRank, an algorithm that ranks interlinked elements. PageRank can be implemented as a chained MapReduce application that at each step iterates over all the elements calculating its PageRank value until converging.

                  Mapreduce implementations Google MapReduce

                  MapReduce is both, the name of the programming model and the original framework, designed and implemented by Jeff Dean and Sanjay Ghemawat at Google [5]. Even though it is only used internally at Google and its code is not freely available, it is known to be written in C + +, with interfaces in Python and Java.

                  Google MapReduce is used in some of the largest MapReduce clusters to date. According to an interview with Jeff Dean, “The MapReduce software is increasing use within Google. It ran 29,000 jobs in August 2004 and 2.2 million in September 2007. Over that period, the average time to complete a job has dropped from 634 seconds to 395 seconds, while the output of MapReduce tasks has risen from 193 terabytes to 14,018 terabytes. On any given day, Google runs about 100,000 MapReduce jobs; each occupies about 400 servers and takes about 5 to 10 minutes to finish.”

                  In November 2008, Google reported that their MapReduce implementation was able to sort 1 TB of data on 1000 computers in 68 seconds, breaking the previous record of 209 seconds on 910 computers.

                  Hadoop

                  Hadoop is a popular and widely used open source MapReduce implementa- tion. It has a large community base and is also backed and used by compa- nies such as Yahoo!, IBM, Amazon, Facebook, etc.

                  Hadoop was originally developed by Doug Cutting to support distribution for the Nutch search engine. The first working version was available by the end of 2005, and soon after that, in early 2006, Doug Cutting joined Yahoo! to work on it full-time with a dedicated team of developers. In February 2008, Yahoo! announced that they were using a 10,000-core Hadoop cluster in pro- duction to generate their search index.

                  In April 2008, Hadoop was able to sort a terabyte of data on a 910-node cluster in 209 seconds [12]. That same year in November, Google managed to break that record by a wide margin with a time of 68 seconds on a 1000- node cluster. But in May 2009, Yahoo! reclaimed the record with a time of 62 seconds on a cluster with 1460 nodes running Hadoop [13].

                  Hadoop is now a top-level Apache project and hosts a number of subproj- ects such as HDFS, Pig, HBase, ZooKeeper, etc. For a more detailed descrip-

                  Big Data Processing with MapReduce

                  tion of how the Hadoop project is organized, see section “Open-Source Implementation: Hadoop.”

                  Disco

                  Disco is another open source implementation of the MapReduce program- ming model, developed at Nokia Research Center as a lightweight frame- work for rapid scripting of distributed data processing.

                  The Disco core is written in Erlang, a functional language that is designed for building robust fault-tolerant distributed applications. MapReduce programs are typically written in Python, though, which lowers the entry barrier and makes it possible to write data-processing code in only a few lines of code.

                  Unlike Hadoop, Disco is only a minimal MapReduce implementation, and does not include a customized file system. Instead, Disco supports POSIX- compatible distributed file systems such as GlusterFS.

                  Skynet

                  Skynet is an open source implementation of the MapReduce framework cre- ated at Geni. It is written in Ruby, and MapReduce programs are also written in the same language. It has gained some popularity, especially in the Ruby community, since it can be easily integrated into web development frame- works such as Rails.

                  As expected, Skynet claims to be fault-tolerant, but unlike other imple- mentations, its administration is fully distributed and does not have a sin- gle point of failure such as the master servers that can be found in Google MapReduce and Hadoop. It uses a peer recovery system in which workers watch out for each other. If a node dies or fails for any reason, another worker will notice and pick up that task.

                  Dryad Dryad is an ongoing research project and Microsoft’s response to MapReduce.

                  Dryad intends to be a more general-purpose environment to execute data parallel applications [9]. It is not exactly a new MapReduce implementa- tion, but it subsumes other computation frameworks, including MapReduce. Instead of simply dividing applications into map and reduce, Dryad pro- grams are expressed as directed acyclic graphs in which vertices are compu- tations and edges are communication channels.

                  Dryad has been deployed at Microsoft since 2006, where it runs on various clusters of more than 3000 nodes and is used by more than 100 developers to process 10 PB of data on a daily basis. The current implementation is writ- ten in C++, but there are interfaces that make it possible to use higher-level languages.

                  Big Data Computing

                Open-Source Implementation: Hadoop

                  Since its first releases, Hadoop has been the standard Free software * MapReduce implementation. Even though there are other open source

                  MapReduce implementations, they are not as complete and usually lack some component of the full platform (e.g., a storage solution). It is more diffi- cult to compare to proprietary solutions, as most of them are not freely avail- able, but judging from the results of the Terasort benchmark [12,13], Hadoop is able to compete even with the original Google MapReduce.

                  This section describes how MapReduce is implemented in Hadoop and provides an overview of its architecture.

                  Project and Subprojects

                  Hadoop is currently a top-level project of the Apache Software Foundation, a nonprofit corporation that supports a number of other well-known projects such as the Apache HTTP Server.

                  Hadoop is mostly known for its MapReduce implementation, which is in fact a Hadoop subproject, but there are also other subprojects that provide the required infrastructure or additional components. The core of Hadoop upon which most of the other components are built is formed by the follow- ing subprojects:

                  

                Common The common utilities and interfaces that support the other

                Hadoop subprojects (configuration, serialization, RPC, etc.).

                MapReduce Software framework for distributed processing of large

                data sets on compute clusters of commodity hardware.

                HDFS Distributed file system that runs on large clusters and provides

                high throughput access to application data.

                  The remaining subprojects are simply additional components that are usually used on top of the core subprojects to provide additional features. Some of the most noteworthy are:

                  

                Pig High-level data-flow language and execution framework for paral-

                  lel computation [11]. Programs written in the high-level language are translated into sequences of MapReduce programs.

                  

                HBase Distributed, column-oriented database modeled after Bigtable

                  [3] that supports structured data storage for large tables. It is built on * top of HDFS and supports MapReduce computations.

                  

                Hadoop is licensed under the Apache License 2.0, a free software license that allows develop-

                ers to modify the code and redistribute it. It is not a copyleft license, though, so distribution

                  Big Data Processing with MapReduce

                Hive Data warehouse infrastructure that provides data summarization

                  and ad-hoc querying and analysis of large files. It uses a language similar to SQL, which is automatically converted to MapReduce jobs.

                  

                Chukwa Data collection and monitoring system for managing large dis-

                  tributed systems [2]. It stores system metrics as well as log files into HDFS, and uses MapReduce to generate reports.

                  Cluster Overview

                  A typical Hadoop MapReduce cluster is formed by a single master, also known as the jobtracker, and a number of slave nodes, also known as task-

                  trackers

                  . The jobtracker is in charge of processing the user’s requests, and distributing and scheduling the work on the tasktrackers, which are in turn supposed to execute the work they have been handed and regularly send status reports back to the jobtracker.

                  In the MapReduce context, a job is the unit of work that users submit to the jobtracker (Figure 9.1) and involves the input data as well as the map() and reduce() functions and its configuration. Jobs are divided into two dif- ferent kinds of tasks, map tasks and reduce tasks, depending on the operation they execute. Tasktrackers control the execution environment of tasks and are configured to run up to a certain amount of slots of each kind. It defaults to two slots for map tasks and two slots for reduce tasks, but it can vary sig- nificantly depending on the hardware and the kind of jobs that are run in the cluster.

                  Before assigning the first map tasks to the tasktrackers, the jobtracker divides the input data depending on its format, creating a number of virtual

                  splits

                  . The jobtracker then prepares as many map tasks as splits, and as soon

                  Tasktracker 0

                Jobtracker

                Task slot Assign

                  Queue Initialize Running Submit job Job 0 Task 0

                  User 0 Assign Submit job

                  Task slot

                Job 1 Task 1

                ... ...

                Submit job Tasktracker T

                  User U

                Job J Task T

                Assign Task slot

                  Task slot Figure 9.1

                  Big Data Computing

                  as a tasktracker reports a free map slot, it is assigned one of the map tasks (along with its input split).

                  The master continues to keep track of all the map tasks, and once all of them have been completed it is able to schedule the reduce tasks. Except for this dependency, for the jobtracker there is no real difference between kinds of tasks, so map and reduce tasks are treated similarly as the smallest scheduling unit.

                  Other than scheduling, the jobtracker must also make sure that the system is tolerant to faults. If a node fails or times out, the jobs the tasktracker was executing can be rescheduled by the jobtracker. Additionally, if some tasks make no apparent progress, it is also able to re-launch them as speculative tasks on different tasktrackers.

                  Note that Hadoop’s master is not distributed and represents a single point of failure, but since it is aware of the status of the whole cluster, it also allows

                • * for some optimizations and reducing the complexity of the system.

                  Storage with HDFS

                  Hadooop MapReduce is designed to process large amounts of data, but it does so in a way that does not necessarily integrate perfectly well with previous tools, including file systems. One of the characteristics of MapReduce is that it moves computation to the data and not the other way around. In other words, instead of using an independent, dedicated storage, the same low-cost machines are used for both computation and storage. This means that the storage require- ments are not exactly the same as for regular, general purpose file systems.

                  The Hadoop Distributed File System [1] (HDFS) is designed to fulfill Hadoop’s storage needs, and like the MapReduce implementation, it was inspired by a Google paper that described their file system [7]. HDFS shares many fea- tures with other distributed file systems, but it is specifically conceived to be deployed on commodity hardware and thus even more fault-tolerant.

                  Another feature that makes it different from other file systems is its empha- sis on streaming data and achieving high throughput rather than low latency access. POSIX semantics impose many requirements that are not needed for Hadoop applications, so in order to achieve its goals, HDFS relaxes some of the standard file system interfaces. Similarly, HDFS’s coherency model is intentionally simple in order to perform as fast as possible, but everything comes at a cost: for instance, once a file is created, it is not possible to change it.

                  Like MapReduce, HDFS is also based on a client-server architecture. It con- sists of a single master node, also known as the namenode, and a number of slaves or clients known as datanodes. The namenode keeps all the meta- * data associated with the file system (permissions, file locations, etc.) and

                  

                Initial versions of Google’s File system and MapReduce are also known to have in masters

                their single point of failure to simplify the design, but more recent versions are reported to

                  Big Data Processing with MapReduce Namenode

                  1. Create file User

                  2. Write block Datanode

                Datanode

                  3. Replicate Block Block Block Block

                Figure 9.2 HDFS file creation.

                  coordinates operations such as opening, closing, or renaming. Datanodes are spread throughout the cluster and are responsible of storing the data, allowing read and write requests.

                  As in other general-purpose file systems, files in HDFS are split into one or more blocks, which is the minimum unit used to store files on datanodes and to carry out internal operations. Note that just like HDFS is designed to read and write very large files, its block size is likewise larger than the block size of other file systems, defaulting to 64 MB. Also, to ensure fault tolerance, files have a replication factor, which is used to enforce the number of copies of each block available in the cluster.

                  In order to create a new file, the client first requests it to the namenode, but upon approval it writes directly to the datanodes (Figure 9.2). This process is handled by the client and is transparent for the user. Similarly, replication is coordinated by the namenode, but data are directly transferred between datanodes. If a datanode fails or times out, the namenode goes through all the blocks that were stored in that datanode, issuing replication requests for all the blocks that have fallen behind the desired replication ratio.

                  Dataflow

                  The previous sections introduced how MapReduce and the file system work, but one of the keys to understanding Hadoop is to know how both systems are combined and how data flow from the initial input to the processing and final output.

                  Note that although the MapReduce model assumes that data are available in a distributed fashion, it does not directly deal with pushing and maintaining files across the cluster, which is the file system’s job. A direct advantage of this distinction is that Hadoop’s MapReduce supports a number of file systems with different features. In this description, though, as well as in the remaining chap- ters of this document, the cluster is assumed to be running HDFS (described

                  Big Data Computing

                Slave

                Datanode Tasktracker

                  

                (HDFS) (MapReduce)

                Split Map task

                Split Map task

                Split

                  

                Slave

                Tasktracker Map task Datanode

                  

                Split Map task

                Figure 9.3 Local and remote reads from HDFS to MapReduce.

                  MapReduce is able to start running jobs as soon as the required data are available in the file system. First of all, jobs are initialized by creating a series of map and reduce tasks. The number of map tasks is usually determined by the number of splits into which the input data are divided. Splitting the input is what makes it possible to parallelize the map phase and can have a great impact on the performance, so splits can also be thought of as the first level of granularity of the system, and it also shows how the file system and MapReduce are integrated. For instance, if the input consists of a single 6.25 GB file in an HDFS file system, using a block size (dfs.block.size) of 64 MB and the default input format, the job will be divided into 1000 map tasks, one for each split.

                  Map tasks read its share of the input directly from the distributed file system, meaning they can read either locally if the data are available, or remotely from another host if they are not (Figure 9.3). While reading and processing the input, the partial output is continuously written to a circu- lar memory buffer. As can be observed in Figure 9.4, as soon as the buf-

                  B ig D at a P ro ce ss in g w it h M

                  Map phase ap

                  Local Part P Sort + flush p Merge

                  R ... R ed

                  Reduce phase Buffer Local Map 0 u

                  Sort + flush 0 Merge Copy Sort ce

                1 ... R

                  Local & buffers Local Part 0 Buffer Reduce R Copy

                  R R ... M 1 ... R Copy Sort

                  Sort + flush Buffer Local & buffers Buffer Reduce 0 Local

                  Map M Copy ...

                  M 0 1 ... R

                Figure 9.4 Hadoop dataflow.

                  3

                  1 1

                  Big Data Computing

                  defaults to 80%), its contents are sorted and flushed to a temporary file in the local disk. After reading the input, if there is more than one temporary file, the map task will merge them and write the merged result again to disk. Optionally, if the number of spills is large enough, Hadoop will also perform the combined operation at this point in order to make the output smaller and reduce bandwidth usage.

                  Note that in the end it is always necessary to write the map’s result to disk even if the buffer is not completely filled: map tasks run on its own JVM instance and are supposed to finish as soon as possible and not wait indefi- nitely for the reducers. So after writing to disk, the map’s partial output is ready to be served to other nodes via HTTP.

                  The number of reduce tasks is determined by the user and the job’s needs. * For example, if a job requires global sorting, a single reducer may be needed. Otherwise, any number of reduces may be used: using a larger number of reducers increases the overhead of the framework, but can also help it to improve the load balancing of the cluster.

                  Reduce tasks are comprised of three phases: copy, sort, and reduce. Even though reduce tasks cannot be completed until all map tasks are, it is pos- sible to run the first phase of reduce tasks at the same time as map tasks. During the copy phase (also known as shuffle), reduce tasks request their partitions of data to the nodes where map tasks have already been executed, via HTTP. As soon as data are copied and sorted in each reducer, they are passed to the reduce() function, and its output is directly written to the distributed file system.

                Summary

                  This chapter presented the MapReduce model and how it is currently used to process massive amounts of data. The simplicity of this model, which splits the work into two smaller steps, map and reduce, has become a standard for parallel data processing. MapReduce makes data processing easier for developers, but it is still powerful enough to solve a wide range of problems, from indexing to log analysis. At the same time, its scalability and reliability ensure its affinity with the goals of Big Data.

                  While there are a number of MapReduce implementations, this chapter focused on the architecture of Hadoop, a free and open source software ver- sion that has become one of the most widely used MapReduce frameworks * and is now a basic component of the Big Data toolset.

                  

                For some kind of problems, there may be more efficient options such as chaining multiple

                  Big Data Processing with MapReduce

                References

                  

                1.

                  

                2. J. Boulon, A. Konwinski, R. Qi, A. Rabkin, E. Yang, and M. Y. Chukwa. A large-

                scale monitoring system. In Cloud Computing and Its Applications (CCA 08), pp.

                  1–5, Chicago, IL, 10/2008 2008.

                  

                3. F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T.

                  Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. In OSDI’06: Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation

                  , p. 15, Berkeley, CA, USA, USENIX Association, 2006.

                  

                4. J. Cohen. Graph twiddling in a MapReduce world. Computing in Science and

                Engineering , 11(4):29–41, 2009.

                  

                5. J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clus-

                ters. In OSDI’04: Sixth Symposium on Operating System Design and Implementation, pp. 137–150, San Francisco, CA, December 2004.

                  

                6. G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin,

                S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: Amazon’s highly available key-value store. SIGOPS Oper. Syst. Rev., 41(6):205–220, 2007.

                  

                7. S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google file system. SIGOPS

                Oper. Syst. Rev. , 37(5):29–43, 2003.

                  

                8.

                  

                9. M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-

                parallel programs from sequential building blocks. SIGOPS Oper. Syst. Rev., 41(3):59–72, 2007.

                  

                10. S.-W. Lee, B. Moon, C. Park, J.-M. Kim, and S.-W. Kim. A case for flash memory

                SSD in enterprise database applications. In SIGMOD’08: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data , pp. 1075–1086, New York, NY, USA, ACM, 2008.

                  

                11. C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: A not-

                so-foreign language for data processing. In SIGMOD’08: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data , pp. 1099–1110, New York, NY, USA, ACM, 2008.

                  

                12.

                  

                13. O. O’Malley, and A. Murthy. Winning a 60 second dash with a yellow elephant,

                2009.

                  

                14. S. Quinlan and M. K. McKusick, GFS: Evolution on fast-forward. Queue, 7(7):10–

                20, 2009.

                  

                15. T. White. Hadoop: The Definitive Guide. 2nd edition, O’Reilly and Yahoo! Press,

                New York, 2009.

                  

                This page intentionally left blank This page intentionally left blank

                   M. Asif Naeem, Gillian Dobbie, and Gerald Weber CONTENTS

                  Introduction ......................................................................................................... 316 Data Stream Processing ............................................................................. 316 Stream-Based Joins .................................................................................... 316 Application Scenario ................................................................................. 317

                  Existing Approaches and Problem Definition ................................................ 318 Proposed Solution ............................................................................................... 321

                  Execution Architecture .............................................................................. 321 Algorithm .................................................................................................... 323 Asymptotic Runtime Analysis ................................................................. 324 Cost Model .................................................................................................. 325

                  Memory Cost .................................................................................. 325 Processing Cost .............................................................................. 326

                  Analysis of w with Respect to its Related Components....................... 327 Effect of the Size of the Master Data on w ................................. 328

                  Effect of the Hash Table Size on w .............................................. 328 Effect of the Disk Buffer Size on w .............................................. 329

                  Tuning .......................................................................................................... 329 Tests with Locality of Disk Access ....................................................................330 Experiments .........................................................................................................333

                  Experimental Arrangement ......................................................................334 Hardware Specifications ...............................................................334 Data Specifications .........................................................................334 Measurement Strategy ..................................................................334

                  Experimental Results .................................................................................334 Performance Comparison .............................................................335 Cost Validation ...............................................................................338

                  Summary ..............................................................................................................338 References .............................................................................................................338

                  Big Data Computing

                Introduction

                  A data stream is a continuous sequence of items produced in real time. A stream can be considered to be a relational table of infinite size [1]. It is therefore con- sidered impossible to maintain an order of the items in the stream with respect to an arbitrary attribute. Likewise, it is impossible to store the entire stream in memory. However, results of operations are expected to be produced as soon as possible. As a consequence, standard relational query processing cannot be straightforwardly applied, and online stream processing has become a new field of research in the area of data management. A number of common exam- ples where online stream processing is important are network traffic monitor- ing [2–6], sensor data [7], web log analysis [8,9], online auctions [10], inventory and supply-chain analysis [11–13], as well as real-time data integration [14,15].

                  Data Stream Processing

                  Conventional Database Management Systems (DBMSs) are designed using the concept of persistent and interrelated data sets. These DBMSs are stored in reliable repositories, which are updated and queried frequently. But there are some modern application domains where data are generated in the form of a stream, and Data Stream Management Systems (DSMSs) are required to process the stream data continuously. A variety of stream processing engines have been described in the literature [1,16].

                  The basic difference between a traditional DBMS and a DSMS is the nature of query execution. In DBMSs, data are stored on disk and queries are per- formed over persistent data [17,18]. In DSMSs, in contrast, data items arrive online and stay in the memory for short intervals of time. DSMSs need to work in nonblocking mode while executing a sequence of operations over the data stream [16,19–21]. The eight important requirements for processing real-time stream data are described by Stonebraker et al. [22]. To accommo- date the execution of a sequence of operations, DSMSs often use the concept of a window. A window is basically a snapshot taken at a certain point in time and it contains a finite set of data items. When there are multiple opera- tors, each operator executes and stores its output in a buffer, which is further used as an input for some other operator. Each operator needs to manage the contents of the buffer before it is overwritten.

                  Common operations performed by most DSMSs are filtering, aggregation, enrichment, and information processing. A stream-based join is required to perform these operations.

                  Stream-Based Joins

                  A stream-based join is an operation that combines information coming from

                  Efficient Processing of Stream Data over Persistent Data

                  may be disk-based. Stream-based joins are important components in modern system architectures, where just-in-time delivery of data is expected. There are a number of important examples that can be interpreted as stream joins, even if they might often be implemented with different methods. For exam- ple, in the field of networking, two streams of data packets can be joined using their packet ids to synchronize the flow of packets through routers [6]. Another example is an online auction system which generates two streams, one stream for opening an auction, while the other stream consists of bids on that auction [23,24]. A stream-based join can relate the bids with the cor- responding opened auction in a single operation.

                  In this chapter, we consider a particular class of stream-based joins, namely a join of a single stream with a traditional relational table. This table is given in advance and considered to be so slowly changing, that it can be consid- ered constant for the discussion of the algorithm. The application scenario most widely considered in the literature is near-real-time data warehousing [14,15,25–28], as outlined in the following.

                  application Scenario

                  In traditional data warehousing, the update tuples are buffered and joined when resources become available [29,30]. In contrast to this, in real-time data warehousing, these update tuples are joined when they are generated in the data sources.

                  In this application, the slowly changing table is typically a master data table. Incoming real-time sales data may comprise the stream. The stream-based join can be used, for example, to enrich the stream data with master data. The most natural type of join in this scenario would be an equijoin, performed, for example, on a foreign key in the stream data. In near-real-time data ware- housing, stream-based joins can be used in the ETL (extract-transform-load) layer. Typical applications would be the detection of duplicate tuples, iden- tification of newly inserted tuples, and the enriching of some new attribute values from master data. One common transformation is the key transforma- tion. The key used in the data source may be different from that in the data warehouse and therefore needs to be transformed into the required value for the warehouse key. This transformation can be obtained by implementing a join operation between the update tuples and a lookup table. The lookup table contains the mapping between the source keys and the warehouse keys.

                Figure 10.1 shows a graphical interpretation of such a transformation. In the figure, the attributes with column name id in both data sources DS and DS

                  1 2 contain the source data keys and the attribute with name warehouse key in the lookup table contains the warehouse key value corresponding to these data source keys. Before loading each transaction into the data warehouse, each source key is replaced by the warehouse key with the help of a join operator.

                  One important factor related to the join is that both inputs of the join come

                  Big Data Computing DS1

                  Id Name

                Stream-based

                Fact table

                join operator

                KBD_01 Keyboard

                  Id Name M_02 Mouse 101 Keyboard

                  102 Mouse DS2 Id Name 103 CPU

                  KBD_01 Keyboard Data warehouse CPU_03 CPU Data sources

                  Look-up table

                Id Name Warehouse key

                  KBD_01 Keyboard 101 M_02 Mouse 102 CPU_03 CPU 103

                Master data

                Figure 10.1 An example of stream-based join.

                  sources is in the form of an update stream which is fast, while the access rate of the lookup table is comparatively slow due to disk I/O cost. This creates a bottleneck in the join execution and the research challenge is to minimize this bottleneck by amortizing the disk I/O cost on the fast update stream.

                Existing Approaches and Problem Definition

                  A novel stream-based equijoin algorithm, MESHJOIN (Mesh Join) [14,15], was described by Polyzotis et al. in 2008. MESHJOIN was designed to sup- port streaming updates over persistent data in the field of real-time data warehousing. The MESHJOIN algorithm is in principle a hash join, where the stream serves as the build input and the disk-based relation serves as the probe input. The main contribution is a staggered execution of the hash table build and an optimization of the disk buffer for the disk-based relation. The algorithm reads the disk-based relation sequentially in segments. Once the last segment is read, it again starts from the first segment. The algorithm con- tains a buffer, called the disk buffer, to store each segment in memory one at a time, and has a number of memory partitions, equal in size, to store the stream tuples. These memory partitions behave like a queue and are differ-

                  Efficient Processing of Stream Data over Persistent Data

                  to the number of segments on the disk, while the size of each segment on the disk is equal to the size of the disk buffer. In each iteration, the algorithm reads one disk segment into the disk buffer and loads a chunk of stream tuples into the memory partition. After loading the disk segment into mem- ory, it joins each tuple from that segment with all stream tuples available in different partitions. Before the next iteration, the oldest stream tuples are expired from the join memory and all chunks of the stream are advanced by one step. In the next iteration, the algorithm replaces the current disk segment with the next one, loads a chunk of stream tuples into the mem- ory partition, and repeats the above procedure. An overview of MESHJOIN is presented in Figure 10.2, where we consider only three partitions in the queue, with the same number of pages on disk. For simplicity, we do not consider the hash table at this point and assume that the join is performed directly with the queue.

                  The crux of the algorithm is that the total number of partitions in the stream queue must be equal to the total number of partitions on the disk and that number can be determined by dividing the size of the disk-based rela- tion R by the size of disk buffer b (i.e., k = N /b). This constraint ensures that R a stream tuple that enters into the queue is matched against the entire disk relation before it expires.

                  As shown in the figure, for each iteration, the algorithm reads a partition of stream tuples, w , into the queue and one disk page p into the disk buf- i j fer. At any time t, for example, when the page p is in memory the status of 3 the stream tuples in the queue can be explained. The w tuples have already 1 joined with the disk pages p and p and therefore after joining with the page 1 2

                  

                p they will be expired. The w tuples have joined only with the page p and

                3 2 2 Queue p 1 p p 2 2

                w w w

                3 2 1 Input Hash stream function

                  Stream buffer p 3 Disk buffer Hash table p 1 Disk-based p 2 Relation p 3 Figure 10.2

                  Big Data Computing therefore, after joining with page p they will advance one step in the queue. 3 Finally, the tuples w have not joined with any disk pages and they will also 3 advance by one step in the queue after joining with page p . Once the algo- 3 rithm completes the cycle of R, it again starts loading sequentially from the

                  first page.

                  The MESHJOIN algorithm successfully amortizes the fast arrival rate of the incoming stream by executing the join of disk pages with a large number of stream tuples. However, there are still some further issues that exist in the algorithm. Firstly, due to the sequential access of R, the algorithm reads the unused or less used pages of R into memory with equal frequency, which increases the processing time for every stream tuple in the queue due to extra disk I/O(s). Processing time is the time that every stream tuple spends in the join window from loading to matching without including any delay due to the low arrival rate of the stream. The average processing time in the case of MESHJOIN can be estimated using the given formula.

                  1

                • Average processing time s = ( seek time access time ) for the whole of R

                  ( )

                  2 To determine the access rate of disk pages of R, we performed an experi- ment using a benchmark that is based on real market economics. The detail is available in the “Tests with Locality of Disk Access” section. In this experi- ment, we assumed that R is sorted in an ascending order with respect to the join attribute value and we measure the rate of use for the same size of seg- ments (each segment contains 20 pages) at different locations of R. From the results shown in Figure 10.3, it is observed that the rate of page use decreases towards the end of R. The MESHJOIN algorithm does not consider this factor and reads all disk pages with the same frequency.

                  Secondly, MESHJOIN cannot deal with bursty input streams effectively. In MESHJOIN, a disk invocation occurs when the number of tuples in the stream buffer is equal to or greater than the stream input size w. In the case of intermittent or low arrival rate (λ) of the input stream, the tuples already in the queue need to wait longer due to a disk invocation delay. This waiting

                  

                time negatively affects the performance. The average waiting time can be cal-

                  culated using the given formula

                  w

                  = Average waiting time (s)

                  λ

                  Index nested loop join (INLJ) is another join operator that can be used to join an input stream S with the disk-based relation R, using an index on the join attribute. In INLJ for each iteration, the algorithm reads one tuple from S and accesses R randomly with the help of the index. Although in this

                  Efficient Processing of Stream Data over Persistent Data

                  access of R for each tuple of S makes the disk I/O cost dominant. This fac- tor affects the ability of the algorithm to cope with the fast arrival stream of updates and eventually decreases the performance significantly.

                  In summary, the problems that we consider in this chapter are: (a) the minimization of the processing time and waiting time for the stream tuples by accessing the disk-based relation efficiently and (b) dealing with the true nature of skewed and bursty stream data.

                  In the previous section, we explained our observations related to the MESHJOIN and INLJ algorithms. As a solution to the stated problems, we pro- pose a robust stream-based join algorithm called Hybrid Join (HYBRIDJOIN).

                  In this section, we describe the architecture, pseudo-code, and run-time anal- ysis of our proposed algorithm. We also present the cost model that is used for estimating the cost of our algorithm and for tuning the algorithm.

                  execution architecture

                  The schematic execution architecture for HYBRIDJOIN is shown in Figure

                  10.4. The key components of HYBRIDJOIN are the disk buffer, hash table, 1939 1336 910 630 471 500 1000 1500 2000 2500 Rate of usability Segments of pages at different locations in R Last 20 pages out of 1000 Last 20 pages out of 2000 Last 20 pages out of 4000 Last 20 pages out of 8000 Last 20 pages out of 16,000

                Figure 10.3 Measured rate of page use at different locations of R while the size of total R is 16,000 pages.

                Proposed Solution

                  Big Data Computing

                Queue

                . . . .

                  Hash table Join Stream output S

                  

                Hash

                ………..

                function

                Stream buffer

                  ……….. ………..

                  Disk buffer Join window

                Disk-based

                relation

                  

                R

                Figure 10.4 Architecture of HYBRIDJOIN.

                  inputs. In our algorithm, we assume that R is sorted and has an index on the join attribute. The disk page of size v from relation R is loaded into the disk P buffer in memory. The component queue, based on a double linked list, is used to store the values for join attributes, and each node in the queue also contains the addresses of its one-step neighbor nodes. Contrary to the queue in MESHJOIN, we implement an extra feature of random deletion in our HYBRIDJOIN queue. The hash table is an important component that stores the stream tuples and the addresses of the nodes in the queue correspond- ing to the tuples. The key benefit of this is when the disk page is loaded into memory using the join attribute value from the queue as an index, instead of only matching one tuple as in INLJ, the algorithm matches the disk page with all the matching tuples in the queue. This helps one to amortize the fast arrival stream. In the case where there is a match, the algorithm generates that tuple as an output and deletes it from the hash table along with the cor- responding node from the queue, while the unmatched tuples in the queue are dealt with in a similar way to the MESHJOIN strategy. The role of the stream buffer is to hold the fast stream if necessary. To deal with the inter- mittencies in the stream, for each iteration, the algorithm loads a disk page into memory and checks the status of the stream buffer. In the case where no stream tuples are available in the stream buffer, the algorithm will not stop but continues its working until the hash table becomes empty. However, the queue keeps on shrinking continuously and will become empty when all tuples in the hash table are joined. On the other hand, when tuples arrive

                  Efficient Processing of Stream Data over Persistent Data

                  In MESHJOIN, every disk input is bound to the stream input, while in HYBRIDJOIN we remove this constraint by making each disk invocation independent of the stream input.

                  algorithm

                  Once the memory has been distributed among the join components, HYBRIDJOIN starts its execution, according to the procedure defined in Algorithm 1. Initially since the hash table is empty, h is assigned to stream S input size w where h is the total number of slots in the hash table H (line 1). S The algorithm consists of two loops: one is called the outer loop, while the other is called the inner loop. The outer loop, which is an endless loop, is used to build the stream input in the hash table (line 2), while the inner loop is used to probe the disk tuples in the hash table (line 9). In each outer loop iteration, the algorithm examines the availability of stream input in the stream buffer. If stream input is available, the algorithm reads w tuples of the stream and loads them into the hash table while also placing their join attribute values in the queue. Once the stream input is read, the algorithm resets the value of w to zero (lines 3–6). The algorithm then reads the oldest value of a join attribute from the queue and loads a disk partition into the disk buffer, using that join attribute value as an index (lines 7 and 8). After the disk partition has been loaded into memory, the inner loop starts and for each iteration of the inner loop the algorithm reads one disk tuple from the disk buffer and probes it into the hash table. In the case of a match, the algorithm generates the join output. Since the hash table is a multi-hash-map, there may be more than one match against one disk tuple. After generating the join output, the algorithm deletes all matched tuples from the hash table, along with the corresponding nodes from the queue. Finally, the algorithm increases w with the number of vacated slots in the hash table (lines 9–15).

                  Algorithm 1: HYBRIDJOIN

                Input: A master data R with an index on join attribute and a stream of

                  updates S

                  Output: S R Parameters: w tuples of S and a partition of R Method:

                  1: wh S 2: while (true) do 3: if (stream available) then 4: READ w tuples from the stream buffer and load them into hash table,

                  H

                  , while enqueuing their join attribute values in queue, Q

                  Big Data Computing

                  6: end if 7: READ the oldest join attribute value from Q 8: READ a partition of R into disk buffer using that join attribute value as an index 9: for each tuple r in the chosen partition do

                  10: if rH then 11: OUTPUT r H

                  12: DELETE all matched tuples from H along with the related nodes from Q 13: ww+ number of matching tuples found in H 14: end if 15: end for 16: end while

                  asymptotic runtime analysis

                  We compare the asymptotic runtime of HYBRIDJOIN with that of MESHJOIN and INLJ as throughput, that is, the time needed to process a stream section. The throughput is the inverse of the service rate. Consider the time for a con- crete stream prefix s. We denote the time needed to process stream prefix s as MEJ(s) for MESHJOIN, as INLJ(s) for INLJ, and as HYJ(s) for HYBRIDJOIN. Every stream prefix represents a binary sequence, and by viewing this binary sequence as a natural number, we can apply asymptotic complexity classes to the functions above. Note therefore that the following theorems do not use functions on input lengths, but on concrete inputs. The resulting theorems imply analogous asymptotic behavior on input length, but are stronger than statements on input length. We assume that the setup for HYBRIDJOIN and for MESHJOIN is such that they have the same number h of stream tuples in s the hash table—and in the queue accordingly.

                  Comparison with MESHJOIN: Theorem 1: HYJ(s) = O(MEJ(s)) Proof

                  To prove the theorem, we have to prove that HYBRIDJOIN performs no worse than MESHJOIN. The cost of MESHJOIN is dominated by the number of accesses to R. For asymptotic runtime, random access of disk pages is as fast as sequential access (seek time is only a constant factor). For MESHJOIN with its cyclic access pattern for R, every page of R is accessed exactly once after every h stream tuples. We have to show that for HYBRIDJOIN no page s is accessed more frequently. For that we look at an arbitrary page p of R at the time it is accessed by HYBRIDJOIN. The stream tuple at the front of the queue has some position i in the stream. There are h stream tuples currently s in the hash table, and the first tuple of the stream that is not yet read into the

                  Efficient Processing of Stream Data over Persistent Data

                  are joined against the disk-based master data tuples on p, and all matching tuples are removed from the queue. We now have to determine the earliest time that p could be loaded again by HYBRIDJOIN. For p to be loaded again, a stream tuple must be at the front of the queue and has to match a master data tuple on p. The first stream tuple that can do so is the aforementioned stream tuple with position i + h , because all earlier stream tuples that match s data on p have been deleted from the queue. This proves the theorem.

                  Comparison with INLJ: Theorem 2: HYJ(s) = O(INLJ(s)) Proof

                  INLJ performs a constant number of disk accesses per stream tuple. For the theorem, it suffices to prove that HYBRIDJOIN performs no more than a con- stant number of disk accesses per stream tuple as well. We consider first those stream tuples that remain in the queue until they reach the front of the queue. For each of these tuples, HYBRIDJOIN loads a part of R and hence makes a constant number of disk accesses. For all other stream tuples, no separate disk access is made. This proves the theorem.

                  Cost Model

                  In this section, we derive the general formulas to calculate the cost for our proposed HYBRIDJOIN. We generally calculate the cost in terms of mem- ory and processing time. Equation 10.1 describes the total memory used to implement the algorithm (except the stream buffer). Equation 10.3 calculates the processing cost for w tuples, while the average size for w can be calcu- lated using Equation 10.2. Once the processing cost for w tuples is measured, the service rate μ can be calculated using Equation 10.4. The symbols used to measure the cost are specified in Table 10.1.

                  Memory Cost

                  In HYBRIDJOIN, the maximum portion of the total memory is used for the hash table H, while a comparatively smaller amount is used for the disk buf- fer and the queue. We can easily calculate the size for each of them separately.

                  Memory reserved for the disk buffer (bytes) = v P Memory reserved for the hash table (bytes) = α (Mv ) P Memory reserved for the queue (bytes) = (1 − α)(Mv ) P

                  The total memory used by HYBRIDJOIN can be determined by aggregat- ing all the above.

                  M = + v P α ( Mv P ) ( + − 1 α )( Mv P ).

                  (10.1)

                  Big Data Computing TaBle 10.1 Notations Used in Cost Estimation of HYBRIDJOIN Parameter Name

                  Symbol M

                  Total allocated memory (bytes) λ Stream arrival rate (tuples/s) μ Service rate (processed tuples/s) w

                  Average stream input size (tuples) v

                  Stream tuple size (bytes) S Size of disk buffer (bytes) = size of disk page v P Size of disk tuple (bytes) v R

                  Size of disk buffer (tuples) v P d = v R

                  α Memory weight for hash table Memory weight for queue

                  (1 − α) Size of hash table (tuples) h s R

                  Size of disk-based relation R (tuples) t e

                  Exponent value for benchmark

                Cost to read one disk page into disk buffer (ns) c (v )

                I /O P Cost of removing one tuple from the hash table c E and queue (ns) Cost of reading one stream tuple into the stream c S buffer (ns) c

                  Cost of appending one tuple into the hash table A and queue (ns) c

                  Cost of probing one tuple into the hash table (ns) H Cost to generate the output for one tuple (ns) c O Total cost for one loop iteration of HYBRIDJOIN (s) c loop

                  Currently, we are not including the memory reserved for the stream buffer due to its small size (0.05 MB was sufficient in all our experiments).

                  Processing Cost

                  In this section, we calculate the processing cost for HYBRIDJOIN. To calcu- late the processing cost, it is necessary to calculate the average stream input size w first.

                  Calculate average stream input size w: In HYBRIDJOIN the average stream input size w depends on the following four parameters.

                • Size of the hash table, h (in tuples) S • Size of the disk buffer, d (in tuples)
                • Size of the master data, R (in tuples) t

                  Efficient Processing of Stream Data over Persistent Data

                    

                  analysis of w with respect to its related Components

                  (10.4)

                  w

                c

                loop

                  =

                  µ

                  Since the algorithm processes w tuples of the stream S in c loop seconds, the service rate μ can be calculated by dividing w by the cost for one loop iteration:

                  ⋅ ⋅ ⋅ ⋅ (10.3)

                  10 9 ( ) .

                  = + + + + +   

                  In our experiments, w is directly proportional to h S and d (where dv P /v R ) and inversely proportional to R t . Further details about these relationships can be found in the “Analysis of w with respect to its Related Components” section. The fourth parameter represents the exponent value for the stream data distribution as explained in section “ Tests with Locality of Disk Access,” and using an exponent value equal to 1 the 80/20 Rule [31] can be formulated approximately for market sales. Therefore, the formula for w is:

                  c c v v v P c w c w c w c w c P R H O E S A loop I/O

                  By aggregation, the total cost for one loop iteration is:

                  Cost to generate the output for w matching tuples = w c O Cost to delete w tuples from the hash table and the queue = w c E Cost to read w tuples from the stream S = w c S Cost to append w tuples into the hash table and the queue = w c A

                  Cost to read one disk partition = c I /O (v P ) Cost to probe one disk partition into the hash table=(v P /v R )c H

                  On the basis of w, the processing cost can be calculated for one loop itera- tion. In order to calculate the cost for one loop iteration, the major compo- nents are:

                  (10.2) where k is a constant influenced by system parameters. The value of k has been obtained from measurements. In this setup, it is 1.36.

                  · , =

                  · ,

                  w

                h d

                R

                w k h d R

                S

                t S t

                  This section presents details of the experiments that have been conducted to

                  Big Data Computing Effect of the Size of the Master Data on w

                  An experiment has been conducted to observe the effect of the size of the master data, denoted by R , on w. In this experiment, the value of R has been t t increased exponentially while keeping the values for other parameters, h S and d, fixed. The results of this experiment are shown in Figure 10.5a. It is clear that the increase in R affects w negatively. This can be explained as fol- t lows: increasing R decreases the probability of matching the stream tuples t for the disk buffer. Therefore, the relationship of R with w is inversely pro- t portional, represented mathematically as w ∝ 1/R . t

                  Effect of the Hash Table Size on w

                  This experiment has been conducted to examine the effect of hash table size

                  

                hs on w. In order for us to observe the individual effect of hs on w, the values

                  for other parameters, R and d, have been assumed to be fixed. The value of h t s

                  (a) (b) 3 4

                  10

                  10 2 3

                  10

                  10 (tuples) (tuples) w w 1 2

                  10

                  10

                  0.5

                  1

                  2

                  4

                  8

                  0.32

                  0.64

                  1.28

                  2.56

                  5.12 Size of hash table (million tuples) Size of R (million tuples) (on log−log scale) on log−log scale

                  (c) 3

                  10 2

                  10 (tuples) w 1

                  10 250 500 1000 2000 4000 Size of disk buffer (tuples) (on log−log scale)

                  Figure 10.5

                Analysis of w while varying the size of necessary components. (a) Effect of size of R on w; (b)

                  Efficient Processing of Stream Data over Persistent Data has been increased exponentially and w has been measured for each setting.

                  The results of the experiment are shown in Figure 10.5b. It can be observed that w increases at an equal rate while increasing h . The reason for this s is that with an increase in h more stream tuples can be accommodated in s memory. Therefore, the matching probability for the tuples in the disk buffer with the stream tuples increases and that causes w to increase. Hence, w is directly proportional to h which can be described mathematically as wh . s s

                  Effect of the Disk Buffer Size on w

                  Another experiment has been conducted to analyze the effect of the disk buf- fer size d on w. Again the effect of only d on w can be observed, and the values for other parameters, R and h , have been considered to be fixed. The size of t s the disk buffer has been increased exponentially and w has been measured against each setting. Figure 10.5c presents the results of this experiment. It is clear that increasing d results in w increasing at the same rate. The reason for this behavior is that, when d increases, more disk tuples can be loaded into the disk buffer. This increases the probability of matching for stream tuples with the tuples in the disk buffer and eventually w increases. The relation- ship of w with d is directly proportional, that is, wd.

                  Tuning

                  Tuning of the join components is important to make efficient use of avail- able resources. In HYBRIDJOIN, the disk buffer is the key component to tune to amortize the disk I/O cost on fast input data streams. From Equation (10.4), the service rate depends on w and the cost c , required to process loop these w tuples. In HYBRIDJOIN for a particular setting (M = 50 MB) assum- ing the size of R and the exponent value are fixed (R = 2 million tuples and t

                  

                e = 1), from Equation (10.2) w then depends on the size of hash table and the

                  size of disk buffer. Furthermore, the size of the hash table is also dependent on the size of the disk buffer as shown in Equation (10.1). Therefore, using Equations (10.2)–(10.4), the service rate μ can be specified as a function of v P and the value for v at which the service rate is maximum can be determined P by applying standard calculus rules. In order to explain it experimentally, Figure 10.6 shows the relationship between the I/O cost and service rate. From the figure, it can be observed that in the beginning, for a small disk buffer size, the service rate is also small because there are fewer matching tuples in the queue. In other words, we can say w is also small. However, the service rate increases with an increase in the size of the disk buffer due to more matching tuples in the queue. After reaching a particular value of the disk buffer size, the trend changes and performance decreases with further increments in the size of the disk buffer. The plausible reason behind this decrease is the rapid increment in the disk

                  Big Data Computing

                  A crucial factor for the performance of HYBRIDJOIN is the distribution of master data foreign keys in the stream. If the distribution is uniform, then HYBRIDJOIN may perform worse than MESHJOIN, but by a constant fac- tor, in line with the theoretical analysis. Note, however, that HYBRIDJOIN still has the advantage of being efficient for intermittent streams, while the original MESHJOIN would pause in intermittent streams, and leave tuples unprocessed for an open-ended period.

                  It is also obvious that HYBRIDJOIN has advantages if R contains unused data, for example, if there are old product records that are currently accessed very rarely, that are clustered in R. HYBRIDJOIN would not access these areas of R, while MESHJOIN accesses the whole of R.

                  More interesting, however, is whether HYBRIDJOIN can also benefit from more general locality. Therefore, the question arises whether we can demon- strate a natural distribution where HYBRIDJOIN measurably improves over a uniform distribution, because of locality.

                  The popular types of distributions are Zipfian distributions, which exhibit a power law similar to Zipf’s law. Zipfian distributions are discussed as at least plausible models for sales [31], where some products are sold frequently while most are sold rarely. This kind of distribution can be modeled using Zipf’s law.

                  A generator for synthetic data has been designed that models a Zipfian dis- tribution, and it has been used to demonstrate that HYBRIDJOIN performance increases through locality and that HYBRIDJOIN outperforms MESHJOIN.

                  In order to simplify the model, it has been assumed that the product keys are sorted in the master data table according to their frequency in the

                  30 60 120 240 480 960

                  50 100 150 200 250

                  

                Disk buffer size (KB)

                on log scale

                Service rate (10 2 tuples/s) I/O cost (ms)

                Figure 10.6 Tuning of the disk buffer.

                Tests with Locality of Disk Access

                  Efficient Processing of Stream Data over Persistent Data

                  S TATUS

                  Unit ← currentTime())

                  () 5: for i ← 1 to N do 6: PriorityQueue.enqueue(d; bandwidth ← Math.power(2,i), timeInChosen

                  V ALUE

                  D ISTRIBUTION

                  1: totalCurrentBandwidth ← 0 2: timeInChosenUnit ← 0 3: onfalse 4: dG ET

                  Algorithm 2: S TREAM G ENERATOR

                  , it returns a value based on Zipf’s law (lines 28–31).

                  V ALUE

                  D ISTRIBUTION

                  For each call of the procedure G ET

                  V ALUE (lines 15–17).

                  D ISTRIBUTION

                  procedure enqueues the current dequeued stream object by updating its time interval and band- width (lines 19–27). Once the value of the variable totalCurrentBandwidth has been updated, the main procedure generates the final stream tuple values as an output, using the procedure G ET

                  procedure (lines 11–14). The S WAP

                  in typical warehouse catalogues, but it provides a plausible locality behavior and makes the degree of locality very transparent.

                  S WAP S TATUS

                  procedure, are inserted into a priority queue, which always keeps sorting these objects into ascending order (lines 5–7). Once all the virtual stream objects have been inserted into the priority queue, the top most stream object is taken out (line 8). A loop is executed to generate an infinite stream (lines 9–18). In each iteration of the loop, the algorithm waits for a while (which depends upon the value of vari- able oneStep) and then checks whether the current time is more than the time when that particular object was inserted. If the condition is true, the algorithm dequeues the next object from the priority queue and calls the

                  V ALUE

                  D ISTRIBUTION

                  According to the main procedure, a number of virtual stream objects (in this case 10), each representing the same distribution value obtained from the G ET

                  S WAP S TATUS are the subprocedures that are called from the main procedure.

                  and

                  V ALUE

                  D ISTRIBUTION

                  is the main procedure, while G ET

                  S TREAM G ENERATOR

                  This bursty generation of tuples models a flow of sales transactions which depends upon fluctuations over several time periods, such as mar- ket hours, weekly rhythms, and seasons. The pseudo-code for the genera- tion of the benchmark used here is shown in Algorithm 2. In the figure,

                  Finally, in order to demonstrate the behavior of the algorithm under inter- mittence, a stream generator has been implemented that produces stream tuples with a timing that is self-similar.

                  7: end for 8: current ← PriorityQueue.dequeue() 9: while (true) do 10: wait(oneStep)

                  Big Data Computing

                  12: current ← PriorityQueue.dequeue() 13: S S (current) WAP TATUS 14: end if 15: for j ← 1 to totalCurrentBandwidth do 16: OUTPUT G D ET ISTRIBUTION ALUE V ()

                  17: end for 18: end while

                  S S (current)

                  procedure WAP TATUS

                  19: timeInChosenUnit ← (current:timeInChosenUnit + getNextRandom() × oneStep × currentBandwidth) 20: if on then 21: totalCurrentBandwidthtotalCurrentBandwidth - current:bandwidth 22: onfalse 23: else 24: totalCurrentBandwidthtotalCurrentBandwidth + current.bandwidth 25: ontrue 26: end if 27: PriorityQueue.enqueue(current)

                  end procedure

                  G D

                  V () procedure ET ISTRIBUTION ALUE

                  1

                  1 = = − 28: sumOfFrequencydx at x max dx at x min

                  ∫ ∫ x x

                  29: random ← getNextRandom() 30: distributionValue ← inverseIntegralOf (random × sumOfFrequency +

                  1 =

                  dx at x min ) ∫ x

                  31: RETURN [distributionValue]

                  end procedure

                  The experimental representation of the benchmark is shown in Figures 10.7 and 10.8, while the environment in which the experiments have been conducted is described in the “Experimental Arrangement” section. As described previously in this section, the benchmark is based on two charac- teristics: one is the frequency of sales of each product, while the other is the flow of these sales transactions. Figure 10.7 validates the first characteristic, that is, Zipfian distribution for market sales. In the figure, the x-axis repre- sents the variety of products, while the y-axis represents the sales. It can be observed that only a limited number of products (20%) are sold frequently, while the rest of the products are sold rarely.

                  The HYBRIDJOIN algorithm is adapted to these kinds of benchmarks in which only a small portion of R is accessed again and again, while the rest of R is accessed rarely.

                Figure 10.8 represents the flow of transactions, which is the second charac- teristic of the benchmark. It is clear that the flow of transactions varies with

                  Efficient Processing of Stream Data over Persistent Data (a) (b)

                  5 80,000 4 60,000

                  3 Sale Sale 40,000 2 20,000

                  1

                  1

                  2

                  3

                  4

                  5

                  6 Products rank 500,000 1,000,000 1,500,000 2,000,000

                  Products rank Figure 10.7

                A long tail distribution using Zipf’s law that implements 80/20 Rule. (a) On plain scale; (b) on

                log-log scale.

                  1200 1000 es at d

                  800 p u f o

                  600 am re st t

                  400 u p In

                  200

                  1 1061 2121 3181 4241 5301 6361 7421 8481 9541 10,601 11,661 12,721 13,781 14,841 15,901 16,961 18,021 19,081 20,141 21,201 22,261 23,321 24,381 25,441 26,501 27,561 28,621 29,681 30,741

                  Time (ns)

                Figure 10.8 An input stream having bursty and self-similarity type of characteristics. Experiments

                  We performed an extensive experimental evaluation of HYBRIDJOIN, pro- posed in the “Proposed Solution” section, on the basis of synthetic data sets. In this section, we describe the environment of our experiments and analyze

                  Big Data Computing experimental arrangement

                  In order to implement the prototypes of existing MESHJOIN, INLJ, and our proposed HYBRIDJOIN algorithms, we used the following hardware and data specifications.

                  Hardware Specifications

                  We carried out our experimentation on a PentiumIV 2 × 2.13 GHz machine with 4 GB main memory. We implemented the experiments in Java using the Eclipse IDE. We also used built-in plugins, provided by Apache, and nano-

                  

                Time() , provided by the Java API, to measure the memory and processing

                time, respectively.

                  Data Specifications We analyzed the performance of each of the algorithms using synthetic data.

                  The relation R is stored on disk using a MySQL database, while the bursty type of stream data is generated at run time using our own benchmark algorithm.

                  In transformation, a join is normally performed between the primary key (key in lookup table) and the foreign key (key in stream tuple) and there- fore our HYBRIDJOIN supports join for both one-to-one and one-to-many relationships. In order to implement the join for one-to-many relationships, it needs to store multiple values in the hash table against one key value.

                  However, the hash table provided by the Java API does not support this fea- ture and therefore we used Multi-Hash-Map, provided by Apache, as the hash table in our experiments. The detailed specification of the data set that we used for analysis is shown in Table 10.2.

                  Measurement Strategy

                  The performance or service rate of the join is measured by calculating the number of tuples processed in a unit second. In each experiment, the algorithm runs for 1 h and we start our measurements after 20 min and continue it for 20 min. For more accuracy, we calculate confidence intervals for every result by considering 95% accuracy. Moreover, during the execution of the algorithm, no other application is assumed to run in parallel.

                  experimental results

                  We conducted our experiments in two dimensions. In the first dimension, we compare the performance of all three approaches and in the second dimen- sion we validate the cost by comparing it with the predicted cost.

                  Efficient Processing of Stream Data over Persistent Data TaBle 10.2 Data Specification Parameter Value Disk-Based Data

                Size of disk-based relation R 0.5 million to 08 millions tuples

                Size of each tuple 120 bytes Stream Data Size of each tuple 20 bytes Size of each node in queue 12 bytes Stream arrival rate, λ 125–2000 tuples/s

                  Performance Comparison

                  As the source for MESHJOIN is not openly available, we implemented the MESHJOIN algorithm ourselves. In our experiments, we compare the performance in two different ways. First, we compare HYBRIDJOIN with MESHJOIN with respect to the time, both processing time and waiting time. Second, we compare the performance in terms of service rate with the other two algorithms.

                  Performance Comparisons with Respect to Time

                  In order to test the performance with respect to time, two different types of experiments have been conducted. The experiment, shown in Figure 10.9a, presents the comparisons with respect to the processing time, while Figure 10.9b depicts the comparisons with respect to waiting time. The terms pro-

                  cessing time

                  and waiting time have already been defined in the “Proposed Solution” section. According to Figure 10.9a, the processing time in the case of HYBRIDJOIN is significantly smaller than that of MESHJOIN. The rea- son behind this is that in HYBRIDJOIN a different strategy has been used to access R. The MESHJOIN algorithm accesses all disk partitions with the same frequency without considering the rate of use of each partition on the disk. In HYBRIDJOIN, an index-based approach that never reads unused disk partitions has been implemented to access R. The experiment has not reflected the processing time for INLJ because it was constant even when the size of R changes.

                  In the experiment shown in Figure 10.9b, the time that each algorithm waits has been compared. In the case of INLJ, since the algorithm works at tuple level, the algorithm does not need to wait, but this delay then appears in the form of a stream backlog that occurs due to a faster incoming stream rate than the processing rate. The amount of this delay increases further when the stream arrival rate increases. Turning to the other two approaches, from the figure the ratio of waiting time in MESHJOIN is greater than in HYBRIDJOIN.

                  Big Data Computing (a) (b) 10 1 MESHJOIN HYBRIDJOIN 10 10 3 4 INLJ MESHJOIN HYBRIDJOIN Average processing time for each tuple (min) on log scale 10

                  10 −1 0.5 1 2 4 8 Waiting time (ms) on log scale 10 10 1 2 125 250 500 1000 2000 Size of R (tuples in millions) Stream arrival rate (tuples/s) on log scale on log scale (c) (d) 2.5 MESHJOIN 2 3 ×10 ×10 4 INLJ HYBRIDJOIN HYBRIDJOIN 2.5 2 3 4 INLJ MESHJOIN Service rate (tuples/s) Service rate (tuples/s)

                  1.5 0.5 1 0.5 1 2 4 8 1.5 0.5 1 50 100 150 200 250

                Size of R (million tuples) on log scale Total memory (MB)

                (e) 0.7 0.9 0.6 0.8 ×10 INLJ HYBRIDJOIN MESHJOIN 4 Service rate (tuples/s) 0.1 0.2 0.3 0.4

                  0.25

                0.5

                0.75 1 Figure 10.9

                Experimental results. (a) Processing time; (b) waiting time; (c) performance comparison: R var-

                ies; (d) performance comparison: M varies; (e) performance comparison: e varies.

                  In HYBRIDJOIN, since there is no constraint to match each stream tuple with the whole of R, each disk invocation is not synchronized with the stream input. However, for stream arrival rates of less than 150 tuples/s, the waiting time in HYBRIDJOIN is greater than that in INLJ. A plausible reason for this is the greater I/O cost in the case of HYBRIDJOIN when the size of the input

                  Efficient Processing of Stream Data over Persistent Data Performance Comparisons with Respect to Service Rate

                  0.12

                  Total memory (MB) Processing cost (s)

                  210 220 230 240 250 260 270 280 290 300

                  Measured Calculated 50 100 150 200 250 200

                  0.22 Total memory (MB) Processing cost (s)

                  0.2

                  0.18

                  0.16

                  0.14

                  0.1

                  In this category of experiments, the performance of HYBRIDJOIN has been compared with that of the other two join algorithms in terms of the service rate by varying both the total memory budget and the size of R with a bursty stream. In the experiment shown in Figure 10.9c, the total allocated memory for the join is assumed fixed (50 MB), while the size of R varies exponentially. It can be observed that for all sizes of R, the performance of HYBRIDJOIN is significantly better than the other join approaches.

                  0.08

                  0.06

                  50 100 150 200 250

                  Processing cost (s) Measured Calculated

                  0.02 (a) (b) (c) Total memory (MB)

                  50 100 150 200 250 0.012 0.013 0.014 0.015 0.016 0.017 0.018 0.019

                  Finally, the performance of HYBRIDJOIN has been evaluated by varying the skew in the input stream S. The value of the Zipfian exponent e is varied in order to vary the skew. In these experiments, it was allowed to range from 0 to 1. At 0 the input stream S is uniform and the skew increases as e increases. Figure 10.9e presents the results of the experiment. It is clear from Figure 10.9e that under all values of e except 0, HYBRIDJOIN performs considerably

                  In the second experiment of this category, the performance of HYBRIDJOIN has been analyzed using different memory budgets, while the size of R is fixed (2 million tuples). Figure 10.9d depicts the comparisons of all three approaches. From the figure, it is clear that for all memory budgets the per- formance of HYBRIDJOIN is better than the other two algorithms.

                  

                Measured

                Calculated

                Figure 10.10

                  Big Data Computing

                  better than MESHJOIN and INLJ. Also this improvement increases with an increase in e. The plausible reason for this better performance in the case of HYBRIDJOIN is that the algorithm does not read unused parts of R into memory and this saves unnecessary I/O cost. Moreover, when the value of

                  

                e increases the input stream S becomes more skewed and, consequently, the

                I/O cost decreases due to an increase in the size of the unused part of R.

                  However, in a particular scenario, when e is equal to 0, HYBRIDJOIN per- forms worse than MESHJOIN but worse only by a constant factor.

                  Cost Validation

                  In this experiment, we validate the cost model for all three approaches by comparing the predicted cost with the measured cost. Figure 10.10 presents the comparisons of both costs. In the figure, it is demonstrated that the pre- dicted cost closely resembles the measured cost in every approach which also reassures us that the implementations are accurate.

                Summary

                  In the context of real-time data warehousing, a join operator is required to perform a continuous join between the fast stream and the disk-based rela- tion within limited resources. In this chapter, we investigated two avail- able stream-based join algorithms and presented a robust join algorithm, HYBRIDJOIN. Our main objectives in HYBRIDJOIN are: (a) to minimize the stay of every stream tuple in the join window by improving the efficiency of the access to the disk-based relation and (b) to deal with the nonuniform nature of update streams. We developed a cost model and tuning methodology in order to achieve the maximum performance within the limited resources. We designed our own benchmark to test the approaches according to current market economics. To validate our arguments, we implemented a prototype of HYBRIDJOIN that demonstrates a significant improvement in service rate under limited memory. We also validate the cost model for our algorithm.

                References 1. D. J. Abadi, D. Carney, U. Çetintemel, M. Cherniack, C. Convey, S. Lee, M

                  Stonebraker, N. Tatbul, S. Zdonik, Aurora: A new model and architecture for data stream management, The VLDB Journal, 12(2), 120–139, 2003.

                  Efficient Processing of Stream Data over Persistent Data Proceedings of the Thirtieth International Conference on Very Large Data Bases, Toronto, Canada, 30, 480–491, 2004.

                  

                3. C. Cranor, Y. Gao, T. Johnson, V. Shkapenyuk, O. Spatscheck, Gigascope: High

                performance network monitoring with an SQL interface, Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data , Wisconsin, pp.

                  623–623, 2002.

                  

                4. A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, M. J. Strauss, QuickSAND: Quick

                summary and analysis of network data, Technical Report 2001-43, DIMA CS, 2001.

                  

                5. S. Madden, M. J. Franklin, Fjording the stream: An architecture for queries over

                streaming sensor data, Proceedings of 18th IEEE International Conference on Data Engineering , San Jose, CA, pp. 555–566, 2002.

                  

                6. M. Sullivan, A. Heybey, Tribeca: A system for managing large databases of net-

                work traffic, Proceedings of the Annual Technical Conference on USENIX, Louisiana, pp. 2–2, 1998.

                  

                7. P. Bonnet, J. Gehrke, P. Seshadri, Towards sensor database systems, Proceedings

                of the Second International Conference on Mobile Data Management , Hong Kong, pp.

                  3–14, 2001.

                  

                8. C. Cortes, K. Fisher, D. Pregibon, A. Rogers, F. Smith, Hancock: A language for

                extracting signatures from data streams, Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , Boston, MA, pp.

                  9–17, 2000.

                  

                9. A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, M. Strauss, Surfing wavelets on

                streams: One-pass summaries for approximate aggregate queries, Proceedings of the 27th International Conference on Very Large Data Bases , Rome, Italy, pp. 79–88, 2001.

                  

                10. A. Arasu, S. Babu, J. Widom, An abstract semantics and concrete language for

                continuous queries over streams and relations, Stanford Info Lab, 2002.

                  

                11. M. J. Franklin, S. R. Jeffery, S. Krishnamurthy, F. Reiss, S. Rizvi, E. Wu, O.

                  Cooper, A. Edakkunni, W. Hong, Design considerations for high fan-in systems: The HiFi approach, Proceedings of Second Biennial Conference on Innovative Data Systems Research (CIDR’05) , California, pp. 290–304, 2005.

                  

                12. H. Gonzalez, J. Han, X. Li, D. Klabjan, Warehousing and analyzing massive