Aleksandr Kolotkov (Soft-Service LLC): Tutorial: Real time big data handling
A problem of handling and analysing big data attracts growing attention in different branches of science and economics. Important elements of big data handling are the reporting and analytics of a dataset of interest. As per a textbook, reporting is “the process of organising data into informational summaries in order to monitor how different areas of a business are performing”. This includes the identification of the intrinsic parameters of the data cloud (its core metrics), presenting them in e.g. a spreadsheet or online dashboard, and aggregation of these parameters according to their properties. Likewise, analytics is “the process of exploring data and reports in order to extract meaningful insights, which can be used to better understand and improve business performance”. For example, in some cases the global behaviour of some data parameter can be better understood by breaking the dataset down towards smaller scales, which is a form of analytics. Thus, both reporting and analytics are valuable forms of business intelligence. Reporting provides the information, analytics gives the insights. Reporting raises questions, analytics attempts to answer them. For big data constituted of, e.g., millions or billions of events, performing of the reporting and analytics is naturally complicated by the amount of data, leading to the following principal issues: 1) involvement of reasonable computational resources; and 2) adequate time-scales needed for the report making (reporting) and hypothesis testing (analytics).
In this tutorial, I will be discussing big data handling, addressing one of the tools allowing for an effective solution of these tasks, ClickHouse: a new open source column-oriented database management system developed by Yandex, Russia (it currently powers Yandex.Metrica, the world’s third-largest web analytics platform). In the tutorial format the following features of ClickHouse will be considered: its performance, scalability, hardware efficiency, fault tolerance, and more. Also, the installation procedure and loading of a sample dataset from open sources and querying will be discussed. A sample dataset of the USA civil flights data since 1987 till 2015 from the open sources (contains 166 millions rows, 63 Gb of uncompressed data) will be used for the demonstration. As an example, we are going to obtain new knowledge after querying a sample dataset to find the most popular destinations in 2015, the most popular cities of departure, cities of departure which offer maximum variety of destinations, flight delay dependence on the day of week, cities of departure with most frequent delays for 1 hour or longer, flights of maximum duration, distribution of arrival time delays split by air companies, air companies which stopped flights operation, most trending destination cities in 2015, destination cities with maximum popularity-season dependency.
Reporting and analytics processes and tools capable to implement them, including ClickHouse, are mainly positioned as an important constituent part of business intelligence. However, one can consider the demonstrated features from an interdisciplinary point of view and readily implement them for data science purposes. For example, ClickHouse has already been successfully implemented at CERN's LHCb experiment to store and process metadata on 10 billion events with over 1000 attributes per event.