Massive Data Sets and Database design

Massive Data Sets

Research in the area of high frequency finance and economics builds models using a huge amount of sequential data. This means that we cannot use traditional relational databases whose primary purpose is to store data records in an order-insensitive way and sort them at the time of query evaluation, since this sorting phase leads to a significant performance loss. Our infrastructure is therefore being built to support quick insertion of new data and an efficient, rich query language for processing the ordered data.

The crucial part of the infrastructure is our scalable database system. The main difference to other available sequential storage systems from the point-of-view of data-retrieval is its functional query language, which can express any computable function. This supports not only data retrieval and filtering, but also rich data processing and even complex applications like testing hypotheses directly in the engine. In this way we obtain a significant performance improvement compared to the traditional client-server model used in most other databases.

Sampling Algorithms

Our data processing infrastructure also provides support for sampling algorithms - algorithms that operate using only a portion of the input data available. For more information see the page about Statistical learning.