Skip to content

Data Analytics

I am interested in the automated engineering of distributed/polyglot data persistence and analytics solutions.

Polyglot Stream Processing

Crossflow is a distributed stream processing framework we are developing in the group, which facilitates model-driven design and implementation of polyglot (multi-language) data processing pipelines. Crossflow possesses distinctive features such as job-level caching and opinionated workers and has been used for tasks such as distributed software repository mining, in the context of the Crossminer H2020 project, and for spreadsheet analytics in the context of a Knowledge Transfer Partnership with IBM.

Personal Data Vaults

In the context of a partnership with Maastricht University, we investigated the feasibility of moving, storing, and accessing personal (or otherwise sensitive) information, from centralised databases which are vulnerable to large-scale data leaks and theft, as well as irresponsible data mining, to data vaults stored in end-user devices that comply to the FAIR (Findable, Accessible, Interoperable, Reusable) data principles. While this would have been impractical a decade ago, contemporary end-user devices and growing mobile network speeds now make this an interesting proposition. In our work, we investigated solutions for several challenges related to such a disruptive change in personal data persistence, such as:

  • enabling 3rd parties to be granted access to FAIR data vaults in a fully auditable manner
  • generating signatures for extracted information in support of provenance in data use
  • replicating data in a secure way to minimise the impact of device loss or theft

Big Data Polystores

The need for levels of availability and scalability beyond those supported by relational databases has led to the emergence of a new generation of purpose-specific databases grouped under the term NoSQL. In general, NoSQL databases are designed with horizontal scalability as a primary concern and deliver increased availability and fault-tolerance at a cost of temporary inconsistency and reduced durability of data. To balance the requirements for data consistency and availability, organisations increasingly migrate towards hybrid data persistence architectures comprising both relational and NoSQL databases. The consensus is that this trend will only become stronger in the future; critical data will continue to be stored in ACID (predominately relational) databases while non-critical data will be progressively migrated to high-availability NoSQL databases.

TYPHON was a European Commission H2020 project (2018-2020), of which I was the Technical Director, which provided a methodology and an integrated technical offering for designing, developing, querying and evolving scalable architectures for persistence, analytics and monitoring of large volumes of hybrid (relational, graph-based, document-based, natural language etc.) data (polystores). Our research in this area focuses on intercepting incoming polystore queries and outgoing query results to facilitate analytics and authorization orthogonally to the functionality of applications that use the polystore.