On Optimizing Operator Fusion Plans for Large-Scale Machine Learning in SystemML
Matthias Böhm (Graz University of Technology (TU Graz))
31.1.2019, 15:00 Uhr, room T03, Dept. of Computer Sciences
Large-scale machine learning (ML) underpins many applications that profoundly transform our lives, but ML systems to execute these workloads are still in their infancy. In a first part of this talk, we give an overview of Apache SystemML as a representative ML system for declarative, large-scale ML. SystemML provides an R-like syntax and automatically compiles these high-level linear algebra programs into hybrid runtime plans of single- node, in-memory operations, and distributed operations on Spark. In a second part, we then present a selected research result on optimizing operator fusion plans. The opportunities for fused operators - in terms of fused chains of basis operators - are ubiquitous, and include fewer intermediates, scan sharing, and sparsity exploitation across operators. However, existing fusion heuristics struggle to find good plans for complex operator DAGs or hybrid plans. Therefore, we introduce an exact yet practical cost-based optimization framework for fusion plans, including techniques for candidate exploration, candidate selection, and code generation of local and distributed operations over dense, sparse, and compressed data. Finally, we share some lessons learned and ongoing work on properly supporting the entire end-to-end data science lifecycle.
Content Recommendation for Viral Social Influence
Panagiotis Karras (Aarhus University)
18.12.2017, 14:00, room T04, Dept. of Computer Sciences
How do we select content that will become viral in a whole network after we share it with friends or followers? Significant research activity has been dedicated to the problem of strategically selecting a seed set of initial adopters so as to maximize a meme's spread in a network. Yet this line of work assumes that the success of such a campaign depends solely on the choice of a tunable set of initiators, regardless of how users perceive the propagated meme, which is fixed. Yet in many real-world settings, the opposite holds: a meme's propagation depends on users' perceptions of its tunable characteristics, while the set of initiators is fixed. We address the natural problem that arises in such circumstances: Suggest content, expressed as a limited set of attributes, for a creative promotion campaign that starts out from a given seed set of initiators, so as to maximize its expected spread over a social network. To our knowledge, no previous work addresses this problem. We find that the problem is NP-hard and inapproximable. As a tight approximation guarantee is not admissible, we design an efficient heuristic, Explore-Update, as well as a conventional Greedy solution. Our experimental evaluation demonstrates that Explore-Update selects near-optimal attribute sets with real data, achieves 30% higher spread than baselines, and runs an order of magnitude faster than Greedy.
Panagiotis Karras (Panos) is an Associate Professor in Computer Science at Aarhus University. His interests are in the confluence of data management, data mining, and database security. He earned a PhD in Computer Science from the University of Hong Kong and an MEng in Electrical and Computer Engineering from the National Technical University of Athens. He has held positions at Aalborg University, the Skolkovo Institute of Science and Technology, Rutgers Business School, the National University of Singapore, the University of Zurich, and the Technical University of Denmark. Panos' work has been published in over 50 research articles, awarded by the Hong Kong Institute of Science, and funded by the Lee Kuan Yew Endowment Fund and the Skolkovo Foundation. He regularly serves as a program committee member and referee for the major international conferences and journals in the above areas.
Unnesting Arbitrary Queries
Thomas Neumann (Technical University of Munich (TUM))
28.10.2016, 11:00, room T03, Dept. of Computer Sciences
SQL-99 allows for nested subqueries at nearly all places within a query. From a user's point of view, nested queries can greatly simplify the formulation of complex queries. However, nested queries that are correlated with the outer queries frequently lead to dependent joins with nested loops evaluations and thus poor performance. Existing systems therefore use a number of heuristics to unnest these queries, i.e., de-correlate them. These unnesting techniques can greatly speed up query processing, but are usually limited to certain classes of queries. To the best of our knowledge no existing system can de-correlate queries in the general case. We present a generic approach for unnesting arbitrary queries. As a result, the de-correlated queries allow for much simpler and much more efficient query evaluation.
Exploiting Knowledge Facets for Enhanced Information Search
Mouna Kacimi (Free University of Bolzano)
29.04.2016, 15:30, room T06, Dept. of Computer Sciences
Search results about a given query topic are typically unstructured making it hard to understand the relationships between the different sources of information. Thus, there is a need for organizing search results to help users to (1) gain more insights about query topics, and (2) have an easy access to information sources that trigger their interests. This is particularly helpful for ambiguous queries or faceted topics that involve a variety of sub-topics, meanings, versions, arguments, opinions, and many other aspects. In this talk, I present techniques that exploit existing knowledge bases to enhance information search. I first show how to exploit Wikipedia for query expansion and search results diversification. Then, I proceed with the organization of information sources allowing an effective navigation through knowledge facets.
Keyword-Based Querying with Local Intent
Christian Jensen (Aalborg University)
Datenmodellierung in der Anwendungsentwicklung mit NoSQL-Datenbanken
Stefanie Scherzinger (OTH Regensburg)
NoSQL-Datenbanken sind gerade in der Webentwicklung zunehmend beliebt. Oft sind es die großen Datenmengen, die es zu verwalten gilt, mitunter sind diese Systeme aber auch wegen ihrer Schema-Flexibilität für agile Entwicklungsteams interessant. Indem viele NoSQL-Datenbanken keine Unterstützung für die Definition, Einhaltung und Wartung eines globalen Schemas bieten, verlagern sich klassische Aufgaben des Datenbank-managementsystems in die Anwendungssoftware. Dieser Vortrag gibt einen Überblick über konkrete Herausforderungen, die sich in der Praxis beim Entwurf eines Datenmodells für Key-Value- und Dokumenten-Datenbanken ergeben. Dazu zählen eine Modellierung, die atomare Updates ermöglicht, das Vermeiden von Hot-Spot-Datenobjekten, wie sie durch hochfrequente, parallele Schreibzugriffe gegen dasselbe Objekt verursacht werden, sowie Strategien zum Umgang mit kontinuierlicher Schema-Evolution. Der Vortrag zeigt auf, dass gerade die Datenbank-Community mit ihrem Erfahrungsschatz im Schema-Management und ihrem breiten Fundus an formalen Methoden hier einen wertvollen Beitrag leisten kann.
Yasin N. Silva (Arizona State University)
Many application scenarios can significantly benefit from the identification and processing of similarities in the data. Even though some work has been done to extend the semantics of some operators, e.g., join and selection, to be aware of data similarities; there has not been much study on the role and implementation of similarity-aware operations as first-class database operators. Furthermore, very little work has addressed the problem of evaluating and optimizing queries that combine several similarity operations. The focus of this presentation is the study of similarity queries that contain one or multiple first-class similarity database operators, e.g., Similarity Selection, Similarity Join, and Similarity Group-by. We will present implementation techniques of several similarity operators; a comprehensive conceptual evaluation model for similarity queries; and a rich set of transformation rules to extend cost-based query optimization to the case of similarity queries. We will also discuss techniques to implement similarity operators using the MapReduce framework to process massive datasets.
Past DB Retreats
07.02.2019 - 09.02.2019 at Waldheim, Martell, Italy
23.02.2018 - 25.02.2018 at Zur Goldenen Rose, Karthaus, Italy
17.02.2017 - 19.02.2017 at Hotel Traube, Graun, Italy
14.02.2016 - 16.02.2016 at Glieshof, Matsch, Italy
04.02.2015 - 06.02.2015 at Zur Goldenen Rose, Karthaus, Italy
08.03.2014 - 10.03.2014 at Das Gerstl, Burgeis, Italy
03.02.2013 - 05.02.2013 at Hotel Rainer, Sterzing, Italy
04.03.2012 - 06.03.2012 at Hotel Villa Waldkönigin, St. Valentin auf der Haide, Italy
16.02.2011 - 18.02.2011 at Hotel Cevedale, Sulden, Italy