Optimizing Write Performance in Decentralized Data Ecosystems

Jitse De Smet

Introduction

Recently, the world has seen an increasing interest in sharing self-governed, non-public data across organisational structures. [1, 2, 3, 4]. Central to this development is the concept of decentralization, allowing organizations to collaborate and share data without relying on a central authority. Systems designed to cover these needs are called decentralized data ecosystems, or more generally, decentralized environments. Noteworthy efforts include, Solid [2] focussing on data interoperability between people, and both IDSA [3] and Gaia-X [4] focussing on corporate data interoperability. A critical feature of these systems is enabling authorized third parties to write to a restricted set of organisations’ data, though research on writing is still in its infancy.

In the landscape of these decentralized data systems, the Semantic Web has proven instrumental due to its experience in public data sharing, the ability of the RDF [5] data model to create references across organisational boundaries, and its ability to model data semantics explicitly. The semantic web has a long history of distributing usable open data. In that history, it became clear that serving data comes with various trade-offs [6, 7] that need to be considered for each use case. This has led to the creation of many interfaces like SPARQL endpoint [8], Linked Data Platform APIs [9], TPF [10], and others. A similar trend can be expected for decentralized environments, where diverse organizations with varied data converge. As such, embracing interface heterogeneity will be key for decentralized environments aiming for longevity, as it embraces future technological advances and allows data providers to make use-case-dependent choices.

Heterogeneity does, however, increase the complexity of interacting with the data ecosystem, since data-consumers, like developers and data analysts, need to know how to interact with each interface. For read-only SPARQL [11] federated queries over heterogeneous sources, this complexity has been mitigated by introducing a query engine serving as an abstraction layer over different interaction methods[12, 13]. Contrary to read abstractions, interaction abstraction is not sufficient for writing, since writes are not safe. For instance, when my dermatologist wants to send my test results to my general practitioner (GP) using a decentralized data ecosystem, my dermatologist would need to know exactly how to send this data over to my GP, requiring them to know how my GP expects to receive this data. This contrasts reading data, reading my test results is as easy as querying all accessible data, filtering on what applies to me. As such, the solution of abstracting reads is not entirely transferable to abstracting writes.

In this PhD, I will explore how to abstract writes over decentralized, heterogeneous, permissioned data sources. Specifically, I tackle the fundamental challenges of abstracting the complexities that arise from the unique characteristics of such systems, which 1. enable fine-grained permissions, 2. expose data through heterogeneous interfaces, and 3. have no central authority for managing updates. To abstract such heterogeneous interfaces, I will use the declarative SPARQL update [14] query language, which is the standard for expressing structured update queries over RDF data. Additionally, I develop scheduling algorithms for updates across decentralized, heterogeneous, and self-governed data stores, as existing algorithms are designed for centralized data stores or public data [15].

The next section discusses the state of the art. Section 3 explains the problem I am trying to solve, followed by Section 4, which outlines the methodology. Section 5 covers evaluation, Section 6 presents preliminary results, and Section 7 details the desired impact of this research.

State of the Art

The state-of-the-art covers both Semantic Web and querying research over decentralized data, and the more general database management research domains.

Polyglot Systems

A single dataset can be exposed through multiple interfaces, creating a polyglot database [16]. A polyglot database allows data providers to serve multiple use-cases. Interfaces may differ in structure, functionality, and feature tradeoffs [6, 7]. For example, one interface may support transactions [17], while another does not. When the same data is exposed through multiple (partially overlapping) interfaces, it is essential that an agent can discover what data an interface exposes. Meaning interfaces should describe how they relate to each-other, by describing how they relate to the underlying data. From these descriptions, data consumers can identify which interfaces expose data of interest, and interact with the best interface based on properties like availability, presence of certain query accelerators (indices), and transaction support.

Decentralized Updates over Heterogeneous Interfaces

Updates across distributed DBMSs assume a homogeneous interface that is shared across nodes in the distributed data environment [18]. Within a decentralized data ecosystem, however, interfaces can be heterogeneous. This heterogeneity of interfaces affects fundamental concepts within transaction handling and update planning compared to distributed DBMSs. Since executing read SPARQL queries across heterogeneous interfaces [12, 13] is an active research area, we can combine the formal model for executing read SPARQL queries [11] over heterogeneous interfaces with the transaction model for DBMSs [18]. Concretely, this will involve introducing the concepts of interface capability discovery into update transactions, whereby interfaces describe their capabilities via hypermedia descriptions [19].

Scheduler

To convert declarative queries, which describe the desired result, to executable code, queries are first parsed and converted into a query plan [20]. To achieve performant plans, query planning relies on statistics like cardinality estimates [21]. While statistics are relevant for update scheduling, metrics like transactions size [22] are more critical. In polyglot or decentralized systems, the scheduler must also decide which interfaces to consult for what data. While SPARQL query planning over decentralized environments is well-studied [23, 24, 10], update scheduling is not. To achieve performant updates, there is a need for scheduling algorithms for decentralized environments.

Distributed Update Transactions

ACID safety guarantees, achieved through transactions [17], are foundational in DBMSs. However, the rise of distributed databases prompted reevaluation of the importance of these properties. According to the CAP theorem systems must trade off Consistency (C), Availability (A) and Partition Tolerance (P) on a continues scale [25]. Distributed databases typically require Partition Tolerance and are thus forced to choose between CP and AP, and consequently between ACID and BASE [26] properties, respectively focussing consistency or availability.

Most distributed databases under centralized management choose for the BASE properties, emphasize availability and settle for eventual consistency. While distributed DBMS transactions are well-studied, decentralized transactions are less understood [27]. Insights from distributed transactions do not translate well into decentralized contexts.

In distributed data management, one entity manages the entire network, each node thus has the same mission, and data is replicated across the network to ensure availability [18]. In contrast, decentralized data ecosystems consist of self-governed data stores, without systematic data replication or a unifying mission. Failure tolerance also differs: in a decentralized ecosystem, the unavailability of one organization’s data does not compromise the entire ecosystem, much like how the web continues to function even when individual links are broken [28]. As such, decentralized ecosystems can be understood as a collection of loosely coupled, self governed datasets, each exposed through interfaces that individually position themselves in the CAP space. Consistency guarantees remain poorly defined in these environments, highlighting the need for novel transaction models and update algorithms tailored to decentralized ecosystems.

Problem Statement and Contributions

My research project focuses on abstracting data updates across a large number of small data stores, which are permissioned, heterogeneous, and decentralized. These characteristics exist in emerging decentralized data ecosystems [2, 3, 4], and contrast with distributed approaches that focus on one or a few data stores that are large and make use of centralized management techniques [29]. As such, I define my main research question as:

RQ I: How to balance overall write throughput and server-side performance when updating data across a large network of permissioned, decentralized and heterogeneous RDF data stores?

I will focus my research on updating a small number of data stores (1-50) that are known at the start of query processing. I do this because the discovery of data stores is part of ongoing research [24] in the broader query processing research domain that goes beyond my project’s scope. To achieve this, I will develop update query processing techniques with a focus on permissioned and heterogeneous data stores using hypermedia descriptions of the update capabilities in data stores discovered by client-side query engines. This leads to the following main hypothesis:

Hypothesis 1: In a decentralized network of 50 permissioned, heterogeneous data stores, each containing up to 10,000 RDF triples, multiple concurrent SPARQL update queries can be executed without a central coordinator, producing the same final data states as with a central coordinator in distributed environments, while keeping execution time overhead limited (≈ +100%).

This hypothesis addresses three open research challenges within Web decentralization and the Semantic Web. First, I focus on updates in decentralized environments, while most research focuses on handling updates in a centralized datastore or distributed data stores with a centralized coordinator. This is challenging, as my research focuses on handling updates without such a centralized coordinator. Second, I focus on permissioned data stores, meaning data cannot be freely exchanged across data stores without prior authentication. This is challenging, as authentication is usually applied outside existing centralized and distributed algorithms, while I will incorporate this within update algorithms. Third, the data stores I focus on are heterogeneous, which means that different stores may expose varied update capabilities due to different client and server requirements. This is innovative, as heterogeneity in server capabilities has been primarily investigated for reading data, but not yet for writing data.

Section 1 and Section 2, highlighted interface heterogeneity as a core feature of a decentralized data ecosystem. Our research thus starts with the sub-question:

RQ II: What are the requirements for update interfaces in decentralized environments, and how do they differentiate themselves?

Interface choice is case-dependent, with each interface having strengths and weaknesses. Interfaces will thus not only differ across data stores, a single data store could be exposed through multiple interfaces, serving as a polyglot system. As such, interfaces should describe their relation to the data, their interaction model, their functionality, and their feature tradeoffs. Where feature tradeoffs describes the trade-offs made in the interface design, for example, ‘maximum-lock-time’.

RQ III: How can interfaces describe themselves sufficiently such that automated agents can interact with them?

Once a fundamental understanding of update interfaces is established, I will focus on updating a single data store. As each interface has pros and cons, a single data store may be exposed through multiple interfaces to support more use cases. This constitutes the following research questions:

RQ IV: How can the scheduler discover the interfaces exposed by a single data store?
RQ V: How can the scheduler select the interface most suitable for a given update query and user?
RQ VI: How to schedule the execution of an update query over a single data store using the selected interface?

Finally, I will focus on updating multiple data stores.

RQ VII: How can we schedule an update query over multiple data stores?

Additionally, I will investigate the support of ACID-transactions across multiple data stores. Supporting ACID-transactions will enable applications that require high safety guarantees to migrate to decentralized data ecosystems, transforming the ecosystems into a mature data layer. To reduce the workload of my PhD, I will focus on the atomicity property, because typical RDF does not have many consistency constraints.

RQ VIII: How can we support the atomicity property of ACID-transactions across multiple data stores?

Research Methodology and Approach

The work of this PhD can be split in 3 main parts, each providing answers to a subset of RQ II-VIII. We start with a discussion on the study of interfaces (RQ II-III), after which we discuss query updates over a single data store (RQ IV-VI). Finally, we discuss updating multiple data stores (RQ VII-VIII). When all work is done, I will be able to provide an answer to RQ I in my dissertation.

Study of update interfaces in a personal decentralized data ecosystem

The functional choice of different interfaces (such as SPARQL [8], LDP [9], and TPF [10]) for different read/write granularity (SPARQL queries, arbitrary documents, triple patterns, etc), result in different non-functional characteristics, such as 1. permission semantics, 2. varying safety guarantees, 3. high availability, 4. internal transaction support, and 5. cross-interface transaction support. Additionally, interfaces also deviate based on the way they structure data while interacting with agents. They can be modelled 1. symmetrically, meaning the write unit is the same as the read unit, or 2. asymmetrically, for example being query-based or event-based.

To answer RQ II and III, I will investigate how this variety of interfaces can semantically describe themselves, describing: 1. their structure in relation to the data, 2. their supported features, and 3. the feature tradeoffs made. Previous research has mainly focused on the read structure of interfaces [30, 31, 32] and therefore lacks a description of why data is accessible like it is and how similar data can be made accessible in the future. Existing approaches to self-descriptiveness [19] can serve as a starting point in this process.

Update query processing against a single permissioned data store

To answer RQ IV, V and VI, I start by focussing on a single data store. Although decentralization emphasizes heterogeneity between data stores, understanding single-store updates is essential before scaling complexity.

The optimal interface choice depends on the query type and load [6, 7]. Since the interface choice is dependent on the expected consumption and a single data store might be consumed in various ways, it can be expected that a single data store could be exposed through multiple interfaces, serving diverse use cases. A single data store can thus be considered a polyglot system. As such, schedulers should discover the interfaces exposing the data-store, and select the most suitable interface for a given query and user. This choice can be based on the query type, expected load, and the functional and non-functional features of the interface, such as high availability, low server load, transaction support, or even dedicated functionality for certain data such as dedicated indexes.

After interpreting the interface descriptions of a data store and choosing the interfaces to work with, the scheduler dictates which operations will be executed in what order. The schedule should be optimized based on the (non-)functional features of the interfaces, as well as using traditional optimize-then-execute techniques. For example, many insert operations can be grouped together to reduce the required number of requests. Instead of creating and optimizing one big group, it might also be worthwhile creating multiple groups, as to group operations that can be optimized on certain interfaces. The outcome of this task is at least one such scheduling algorithm that can handle updates across multiple interfaces in a single data store.

Scheduling across heterogeneous interfaces

Update queries across decentralized environments span multiple data stores, each of which could be exposed through multiple interfaces. If the creation of data across data stores has a clear order, a scheduler should be able to recognize it. For example, creating a calendar event and inviting others, involves first creating the agenda item in my data store and later sending invitations to the data stores of my invitees.

In decentralized data ecosystems, a single update query may cover different data stores. For instance, an event may refer only to a calendar item in one context, and to both the calendar item and the invitations in another. Updating the event in the latter context, requires a scheduler to update resources across multiple data stores. The outcome of this task should serve as an answer to RQ VII.

Although decentralized data ecosystem lack a shared mission across nodes, some nodes may support cross-data-store transactions as part of their interface choice. In the remainder of this PhD, I aim to provide an answer to RQ VIII: how can we support the atomicity property of ACID in a decentralized data ecosystem. Given the previous example, the atomicity property would assert that an update over the calendar item and invitations would only succeed if both the calendar item and all invitations accept the update.
Additionally, support for the atomicity property would be foundational to future research.

Evaluation Plan

While various distributed DBMS benchmarks exist [33], they are built for the SQL query language and assume homogeneous interfaces over relational databases. The SolidBench [24] benchmark for SPARQL queries over decentralized environments only focuses on read-queries over immutable data stores. SolidBench groups data into a single data store per person, exposing their social media data, like posts and comments. Since this domain model allows for modifications and creations, I will extend SolidBench [24] to include update queries over data stores that are designed to allow updating, while being heterogeneous in terms of data contained within data stores and update capabilities across data stores. This benchmark will include the following evaluation metrics: 1. update execution time, 2. number of HTTP requests, 3. number and scope of transaction locks, 4. recovery time of a transaction rewind, 5. inconsistent state durations, and 6. robustness against random server failures. Server failures could be caused by the server rejecting access while our client thinks access is granted, or, the server becoming unreachable for a long period of time.

Preliminary Results

In prior work [34], I showed that update abstraction over Solid pods using LDP is feasible and can be implemented with limited overhead. I enabled automated agents to discover where new RDF-resources need to be created, where existing RDF-resources reside, and how to update them in a Solid pod. For this work, I interpreted Solid pods as a dataset being exposed by various HTTP-interfaces, while they are typically interpreted as a collection of HTTP-resources containing RDF-data. The outcome was the Storage Guidance Framework (SGF) and Storage Guidance Vocabulary (SGV). SGV generically describes the relation between RDF-resources in Solid pods and the HTTP interfaces exposing them. I then described the structure of Solid pods using this vocabulary, designed the algorithms for SGF, which describe how to consume and act according to the vocabulary. Lastly, I implemented a query engine using SGF. As a result of SGF, the engine could interpret a Solid pod as a collection of RDF data instead of a collection of documents containing RDF data, thereby eliminating the access path data dependency.

To measure the performance impact, I set up SolidBench-based Solid Pods with different fragmentation strategies. One strategy created a new HTTP-resource for each post based on the post’s ID field. As such, modifying the post’s ID required the post to move HTTP-resource. Fig. 1 shows the execution time for different operations, comparing SGF-based and oracle-based execution. In Fig. 1, operations O3 and O4 result in invalid states using the oracle implementation, as it does not detect required moves or invalid states. The original work concluded that the overhead of this abstraction is limited to four times the execution time, and twice the number of HTTP requests, compared to an oracle base approach that knows what HTTP-resource should be updated and performs the update directly.

Operation	Average Time SGF (ms)	Average Time Oracle (ms)
Insert complete post (O1)	91.851,395	76.672,217
Update post; no move (O2)	129.497,887	65.628,874
Update post; and move (O3)	177.991,908	80.729,940
Update post; not allowed (O4)	79.745,654	63.785,574
Delete post (O5)	130.769,232	69.931,699

Fig. 1: Average execution time (in ms) of various operations on a Solid pod with fragmented posts for SGF-based and oracle-based query execution over 100 runs.

This work provided insights into the challenges of abstracting updates over heterogeneous data sources, and the potential benefits of doing so. It will serve as a foundation for my PhD research, which will extend this work to more complex data ecosystems.

Conclusion

By the end of my PhD, I aim to have studied and developed query algorithms and techniques that allow an abstraction layer to update data across decentralized, heterogeneous, self-governed data stores. These advancements will allow data-consumers, like developers and data analysts, to interact with decentralized data ecosystems, while being shielded from the internal details of the underlying system. The increased ease of use will accelerate the adoption of decentralized data ecosystems.

Acknowledgements. Jitse De Smet is a predoctoral fellow of the Research Foundation – Flanders (FWO) (1SB8525N). Ph.D. advisors: dr. Ruben Taelman, and prof. dr. Ruben Verborgh.