Designing a Swift library with data-race safety
I cut an initial release (0.1.0-alpha) of the library automerge-repo-swift. A supplemental library to Automerge swift, it adds background networking for sync and storage capabilities. The library extends code I initially created in the Automerge demo app (MeetingNotes), and was common enough to warrant its own library. While I was extracting those pieces, I leaned into the same general pattern that was used in the Javascript library automerge-repo. That library provides largely the same functionality for Automerge in javascript. I borrowed the public API structure, as well as compatibility and implementation details for the Automerge sync protocol. One of my goals while assembling this new library was to build it fully compliant with Swift’s data-race safety. Meaning that it compiles without warnings when I use the Swift compiler’s strict-concurrency mode.
There were some notable challenges in coming up to speed with the concepts of isolation and sendability. In addition to learning the concepts, how to apply them is an open question. Not many Swift developers have embraced strict concurrency and talked about the trade-offs or implications for choices. Because of that, I feel that there’s relatively little available knowledge to understand the trade-offs to make when you protect mutable state. This post shares some of the stumbling blocks I hit, choices I made, and lessons I’ve learned. My hope is that it helps other developers facing a similar challenge.
Framing the problem
The way I try to learn and apply new knowledge to solve these kinds of “new fangled” problems is first working out how to think about the problem. I’ve not come up with a good way to ask other people how to do that. I think when I frame the problem with good first-principles in mind, trade-offs in solutions become easier to understand. Sometimes the answers are even self-obvious.
The foremost principle in strict-concurrency is “protect your mutable state”. The compiler warnings give you feedback about potential hazards and data-races. In Swift, protecting the state uses a concept of an “isolation domain”. My layman’s take on isolation is “How can the compiler verify that only one thread is accessing this bit of data at a time”. There are some places where the compiler infers the state of isolation, and some of them still changing as we progress towards Swift 6. When you’re writing code, the compiler knows what is isolated (and non-isolated) – either by itself or based on what you annotated. When the compiler infers an isolation domain, that detail is not (yet?) easily exposed to developers. It really only shows up when there’s a mismatch in your assumptions vs. what the compiler thinks and it issues a strict-concurrency warning.
Sendability is the second key concept. In my layman’s terms again, something that is sendable is safe to cross over thread boundaries. With Swift 5.10, the compiler has enough knowledge of types to be able to make guarantees about what is safe, and what isn’t.
The first thing I did was lean heavily into making anything and everything Sendable. In hindsight, that was a bit of a mistake. Not disastrous, but I made a lot more work for myself. Not everything needs to be sendable. Taking advantage of isolation, it is fine – sometimes notably more efficient and easier to reason about – to have and use non-sendable types within an isolation domain. More on that in a bit.
My key to framing up the problem was to think in terms of making explicit choices about what data should be in an isolation region along with how I want to pass information from one isolation domain to another. Any types I pass (generally) need to be Sendable, and anything that stays within an isolation domain doesn’t. For this library, I have a lot of mutable state: networking connections, updates from users, and a state machine coordinating it all. All of it is needed so a repository can store and synchronize Automerge documents. Automerge documents themselves are Sendable (I had that in place well before starting this work). I made the Automerge documents sendable by wrapping access and updates to anything mutable within a serial dispatch queue. (This was also needed because the core Automerge library – a Rust library accessed through FFI – was not safe for multi-threaded use).
Choosing Isolation
I knew I wanted to make at least one explicit isolation domain, so the first question was “Actor or isolated class?” Honestly, I’m still not sure I understand all the tradeoffs. Without knowing what the effect would be to start off with, I decided to pick “let’s use actors everywhere” and see how it goes. Some of the method calls in the design of the Automerge repository were easily and obviously async, so that seemed like a good first cut. I made the top-level repo an actor, and then I kept making any internal type that had mutable state also be it’s own actor. That included a storage subsystem and a network subsystem, both of which I built to let someone else provide the network or storage provider external to this project. To support external plugins that work with this library, I created protocols for the storage and network provider, as well as one that the network providers use to talk back to the repository.
The downside of that choice was two-fold – first setting things up, then interacting with it from within a SwiftUI app. Because I made every-darn-thing an actor, I hade to await a response, which meant a lot of potential suspension points in my code. That also propagated to imply even setup needed to be done within an async context. Sometimes that’s easy to arrange, but other times it ends up being a complete pain in the butt. More specifically, quite a few of the current Apple-provided frameworks don’t have or provide a clear path to integrate async setup hooks. The server-side Swift world has a lovely “set up and run” mechanism (swift-service-lifecycle) it is adopting, but Apple hasn’t provided a similar concept the frameworks it provides. The one that bites me most frequently is the SwiftUI app and document-based app lifecycle, which are all synchronous.
Initialization Challenges
Making the individual actors – Repo and the two network providers I created – initializable with synchronous calls wasn’t too bad. The stumbling block I hit (that I still don’t have a great solution to) was when I wanted to add and activate the network providers to a repository. To arrange that, I’m currently using a detached Task that I kick off in the SwiftUI App’s initializer:
public let repo = Repo(sharePolicy: .agreeable)
public let websocket = WebSocketProvider()
public let peerToPeer = PeerToPeerProvider(
    PeerToPeerProviderConfiguration(
        passcode: "AutomergeMeetingNotes",
        reconnectOnError: true,
        autoconnect: false
    )
)
@main
struct MeetingNotesApp: App {
    var body: some Scene {
        DocumentGroup {
            MeetingNotesDocument()
        } editor: { file in
            MeetingNotesDocumentView(document: file.document)
        }
        .commands {
            CommandGroup(replacing: CommandGroupPlacement.toolbar) {
            }
        }
    }
    init() {
        Task {
            await repo.addNetworkAdapter(adapter: websocket)
            await repo.addNetworkAdapter(adapter: peerToPeer)
        }
    }
}Swift Async Algorithms
One of the lessons I’ve learned is that if you find yourself stashing a number of actors into an array, and you’re used to interacting with them using functional methods (filter, compactMap, etc), you need to deal with the asynchronous access. The standard library built-in functional methods are all synchronous. Because of that, you can only access non-isolated properties on the actors. For me, that meant working with non-mutable state that I set up during actor initialization.
The second path (and I went there) was to take on a dependency to swift-async-algorithms, and use its async variations of the functional methods. They let you “await” results for anything that needs to cross isolation boundaries. And because it took me an embarrasingly long time to figure it out: If you have an array of actors, the way to get to an AsyncSequence of them is to use the async property on the array after you’ve imported swift-async-algorithms. For example, something like the following snippet:
let arrayOfActors: [YourActorType] = []
let filteredResults = arrayOfActors.async.filter(...)Rethinking the isolation choice
That is my first version of this library. I got it functional, then turned around and tore it apart again. In making everything an actor, I was making LOTS of little isolation regions that the code had to hop between. With all the suspension points, that meant a lot of possible re-ordering of what was running. I had to be extrodinarily careful not to assume a copy of some state I’d nabbed earlier was still the same after the await. (I still have to be, but it was a more prominent issue with lots of actors.) All of this boils down to being aware of actor re-entrancy, and when it might invalidate something.
I knew that I wanted at least one isolation region (the repository). I also want to keep mutable state in separate types to preserve an isolation of duties. One particular class highlighted my problems – a wrapper around NWConnection that tracks additional state with it and handles the Automerge sync protocol. It was getting really darned inconvenient with the large number of await suspension points.
I slowly clued in that it would be a lot easier if that were all synchronous – and there was no reason it couldn’t be. In my ideal world, I’d have the type Repo (my top-level repository) as an non-global actor, and isolate any classes it used to the same isolation zone as that one, non-global, actor. I think that’s a capability that’s coming, or at least I wasn’t sure how to arrange that today with Swift 5.10. Instead I opted to make a single global actor for the library and switch what I previously set up as actors to classes isolated to that global actor.
That let me simplify quite a bit, notably when dealing with the state of connections within a network adapter. What surprised me was that when I switched from Actor to isolated class, there were few warnings from the change. The changes were mostly warnings that calls dropped back to synchronous, and no longer needed await. That was quick to fix up; the change to isolated classes was much faster and easier than I anticipated. After I made the initial changes, I went through the various initializers and associated configuration calls to make more of it explicitly synchronous. The end result was more code that could be set up (initialized) without an async context. And finally, I updated how I handled the networking so that as I needed to track state, I didn’t absolutely have to use the async algorithsm library.
A single global actor?
A bit of a side note: I thought about making Repo a global actor, but I prefer to not demand a singleton style library for it’s usage. That choice made it much easier to host multiple repositories when it came time to run functional tests with a mock In-Memory network, or integration tests with the actual providers. I’m still a slight bit concerned that I might be adding to a long-term potential proliferation of global actors from libraries – but it seems like the best solution at the moment. I’d love it if I could do something that indicated “All these things need a single isolation domain, and you – developer – are responsible for providing one that fits your needs”. I’m not sure that kind of concept is even on the table for future work.
Recipes for solving these problems
If you weren’t already aware of it, Matt Massicotte created a GitHub repository called ConcurrencyRecipes. This is a gemstone of knowledge, hints, and possible solutions. I leaned into it again and again while building (and rebuilding) this library. One of the “convert it to async” challenges I encountered was providing an async interface to my own peer-to-peer network protocol. I built the protocol using the Network framework based (partially on Apple’s sample code), which is all synchronous code and callbacks. A high level, I wanted it to act similarly URLSessionWebSocketTask. This gist being a connection has an async send() and an async receive() for sending and receiving messages on the connection. With an async send and receive, you can readily assemble several different patterns of access.
To get there, I used a combination of CheckedContinuation (both the throwing and non-throwing variations) to work with what NWConnection provided. I wish that was better documented. How to properly use those APIs is opaque, but that is a digression for another time. I’m particular happy with how my code worked out, including adding a method on the PeerConnection class that used structured concurrency to handle a timeout mechanism.
Racing tasks with structured concurrency
One of the harder warnings for me to understand was related to racing concurrent tasks in order to create an async method with a “timeout”. I stashed a pattern for how to do this in my notebook with references to Beyond the basics of structured concurrency from WWDC23.
If the async task returns a value, you can set it up something like this (this is from PeerToPeerConnection.swift):
let msg = try await withThrowingTaskGroup(of: SyncV1Msg.self) { group in
    group.addTask {
        // retrieve the next message
        try await self.receiveSingleMessage()
    }
    group.addTask {
        // Race against the receive call with a continuous timer
        try await Task.sleep(for: explicitTimeout)
        throw SyncV1Msg.Errors.Timeout()
    }
    guard let msg = try await group.next() else {
        throw CancellationError()
    }
    // cancel all ongoing tasks (the websocket receive request, in this case)
    group.cancelAll()
    return msg
}There’s a niftier version available in Swift 5.9 (which I didn’t use) for when you don’t care about the return value:
func run() async throws {
    try await withThrowingDiscardingTaskGroup { group in
        for cook in staff.keys {
            group.addTask { try await cook.handleShift() }
        }
        group.addTask { // keep the restaurant going until closing time
            try await Task.sleep(for: shiftDuration)
            throw TimeToCloseError()
        }
    }
}With Swift 5.10 compiler, my direct use of this displayed a warning:
warning: passing argument of non-sendable type 'inout ThrowingTaskGroup<SyncV1Msg, any Error>' outside of global actor 'AutomergeRepo'-isolated context may introduce data races
guard let msg = try await group.next() else {
                          ^I didn’t really understand the core of this warning, so I asked on the Swift forums. VNS (on the forums) had run into the same issue and helped explain it:
It’s because
withTaskGroupaccepts a non-Sendableclosure, which means the closure has to be isolated to whatever context it was formed in. If yourtest()function isnonisolated, it means the closure isnonisolated, so callinggroup.waitForAll()doesn’t cross an isolation boundary.
The workaround to handle the combination of non-sendable closures and TaskGroup is to make the async method that runs this code nonisolated. In the context I was using it, the class that contains this method is isolated to a global actor, so it’s inheriting that context. By switching the method to be explicitly non-isolated, the compiler doesn’t complain about group being isolated to that global actor.
Sharing information back to SwiftUI
These components have all sorts of interesting internal state, some of which I wanted to export. For example, to provide information from the network providers to make a user interface (in SwiftUI). I want to be able to choose to connect to endpoints, to share what endpoints might be available (from the NWBrowser embedded in the peer to peer network provider), and so forth.
I first tried to lean into AsyncStreams. While they make a great local queue for a single point to point connection, I found they were far less useful to generally make a firehouse of data that SwiftUI knows how to read and react to. While I tried to use all the latest techniques, to handle this part I went to my old friend Combine. Some people are effusing that Combine is dead and dying – but boy it works. And most delightfully, you can have any number of endpoints pick up and subscribe to a shared publisher, which was perfect for my use case. Top that off with SwiftUI having great support to receive streams of data from Combine, and it was an easy choice.
I ended up using Combine publishers to make a a few feeds of data from the PeerToPeerProvider. They share information about what other peers were available, the current state of the listener (that accepts connections) and the browser (that looks for peers), and last a publisher that provides information about active peer to peer connctions. I feel that worked out extremely well. It worked so well that I made an internal publisher (not exposed via the public API) for tests to get events and state updates from within a repository.
Integration Testing
It’s remarkably hard to usefully unit test network providers. Instead of unit testing, I made a separate Swift project for the purposes of running integration tests. It sits in it’s own  directory in the git repository and references automerge-repo-swift as a local dependency. A side effect is that it let me add in all sorts of wacky dependencies that were handy for the integration testing, but that I really didn’t want exposed and transitive for the main package. I wish that Swift Packages had a means to identify test-only dependencies that didn’t propagate to other packages for situations like this. Ah well, my solution was a separate sub-project.
Testing using the Combine publisher worked well. Although it took a little digging to figure out the correct way to set up and use expectations with async XCTests. It feels a bit exhausting to assemble the expectations and fulfillment calls, but its quite possible to get working. If you want to see this in operation, take a look at P2P+explicitConnect.swift. I started to look at potentially using the upcoming swift-testing, but with limited Swift 5.10 support, I decided to hold off for now. If it makes asynchronous testing easier down the road, I may well adopt it quickly after it’s initial release.
The one quirky place that I ran into with that API setup was that expectation.fulfill() gets cranky with you if you call it more than once. My publisher wasn’t quite so constrained with state updates, so I ended up cobbling a boolean latch variable in a sink when I didn’t have a sufficiently constrained closure.
The other quirk in integration testing is that while it works beautifully on a local machine, I had a trouble getting it to work in CI (using GitHub Actions). Part of the issue is that the current swift test defaults to running all possible tests at once, in parallel. Especially for integration testing of peer to peer networking, that meant a lot of network listeners, and browsers, getting shoved together at once on the local network. I wrote a script to list out the tests and run them one at a time. Even breaking it down like that didn’t consistently get through CI. I also tried higher wait times (120 seconds) on the expectations. When I run them locally, most of those tests take about 5 seconds each.
The test that was a real challenge was the cross-platform one. Automerge-repo has a sample sync server (NodeJS, using Automerge through WASM). I created a docker container for it, and my cross-platform integration test pushes and pulls documents to an instance that I can run in Docker. Well… Docker isn’t available for macOS runners, so that’s out for GitHub Actions. I have a script that spins up a local docker instance, and I added a check into the WebSocket network provider test – if it couldn’t find a local instance to work against, it skips the test.
Final Takeaways
Starting with a plan for isolating state made the choices of how and what I used a bit easier, and reaching for global-actor constrained classes made synchronous use of those classes much easier. For me, this mostly played out in better (synchronous) intializers and dealing with collections using functional programming patterns.
I hope there’s some planning/thinking in SwiftUI to update or extend the app structure to accomodate async hooks for things like setup and initialization (FB9221398). That should make it easier for a developer to run an async initializer and verify that it didn’t fail, before continuing into the normal app lifecycle. Likewise, I hope that the Document-based APIs gain an async-context to work with documents to likewise handle asynchronous tasks (FB12243722). Both of these spots are very awkward places for me.
Once you shift to using asynchronous calls, it can have a ripple effect in your code. If you’re looking at converting existing code, start at the “top” and work down. That helped me to make sure there weren’t secondary complications with that choice (such as a a need for an async initializer).
Better yet, step back and take the time to identify where mutable state exists. Group it together as best you can, and review how you’re interacting it, and in what isolation region. In the case of things that need to be available to SwiftUI, you can likely isolate methods appropriately (*cough* MainActor *cough*). Then make the parts you need to pass between isolation domains Sendable. Recognize that in some cases, it may be fine to do the equivalent of “Here was the state at some recent moment, if you might want to react to that”. There are several places where I pass back a summary snapshot of mutable state to SwiftUI to use in UI elements.
And do yourself a favor and keep Matt’s Concurrency Recipes on speed-dial.
Before I finished this post, I listened to episode 43 of the Swift Package Index podcast. It’s a great episode, with Holly Bora, compiler geek and manager of the Swift language team, on as a guest to talk about the Swift 6. A tidbit she shared was that they are creating a Swift 6 migration guide, to be published on the swift.org website. Something to look forward to, in addition to Matt’s collection of recipes!