Distributed locks / semaphores, again

etki
Ayte Labs
Published in
10 min readAug 9, 2020

--

Today i want to talk about managing distributed locks and semaphores. I’ve already had a small article on that, but now i want to get deeper and more abstract rather than think about particular implementation.

Locks and semaphores are concurrency primitives that allow exclusive simultaneous access for 1..N participants (processes): classic locks may be obtained by a single participant only, semaphores allow configurable amount of N participants holding such a semaphore, allowing them to work in parallel, finally, read-write locks allow to ensure resource is either accessible for reading by many participants or for writing by a single participant. All those primitives are usually implemented either on OS level or in the standard library for a specific language, so the developer doesn’t care about things like cleanup, should a process die. But in the distributed scenario all the fault-tolerance-related things are moved onto the engineers shoulders. I will continue with just ordinary locks, since they are the most easy to explain, but in general you can do all three primitives with semaphores.

Primitives themselves are not very useful, they are usually necessary to guard access to some resource(s). A nice example of such a resource and a necessity for lock would be a classic bank account transaction. Let’s imagine that an account X in bank A has to transfer some funds to an account Y in bank B. Those banks are very respectful for their transactions, so they want guarantees that a) under no condition a transaction would happen twice and b) transaction would succeed in a reasonable time. In such a case bank X, which has several instances of each application, would need some notion of exclusive access to allow only one of the instances to perform this transaction, and if for any reason instance would fail, the next one should be available to pick up required work and perform transaction: a distributed lock would help to maintain such an exclusive access. This small example also includes the two generals problem in bank Y confirming transaction, but it is out of the scope of this article and some common ways may be used to address it (TCP-way or timeboxing where connectivity failures more than huge time T are considered to be inevitable source of data loss).

Problems & implementation

Usually all those primitives are implemented using some abstract storage which allows CAS operation. It is also possible to implement that very storage directly in application instances, but it’s one of the best ways to bust your kneecaps, since usually it takes several years to find out all bugs in such an implementation, and they are extremely hard to reproduce and understand (usually you just see that lock was obtained twice, but what led to this?).

Simplest implementation would be just putting an instance identifier into some register, for example:

INSERT INTO distributed_locks (key, owner) VALUES ('account/X', 'server-1');

Obviously, after using a concurrency primitive the instance would release it:

DELETE FROM distributed_locks WHERE key = 'account/X' AND owner = 'server-1';

Or, in CAS terms:

server-1: read account/X
storage: null
server-2: read account/X
storage: null
server-1: account/X, replace null with {owner=server-1}
storage: succeeded
server-2: account/X, replace null with {owner=server-2}
storage: failed
server-1: account/X, replace {owner=server-1} with null
storage: succeeded
server-2: read account/X
storage: null
server-2: account/X, replace null with {owner=server-2}
storage: succeeded

This is where the complexity leaves its dark caves. First problem is the fault tolerance: should a node crash, all its primitives would stay acquired until anew instance with the same id would spin up, acquire same primitives and release them. In the worst case scenario, if one would use relational-alike database like above and similar queries, even spinning up new instance wouldn’t solve the problem — because the INSERT query would always fail. I know only one solution to this — to timebox primitive acquisition, just like a participation in cluster. In such a case, there is some safety period when the owner is considered to be alive, and if primitive stays acquired after that period, it is considered stale and is available for acquisition by other participants; owner is allowed to periodically send updates (heartbeats) to prolong it’s acquisition:

DELETE FROM distributed_locks WHERE key = 'account/X' AND lease < NOW();
INSERT INTO distributed_locks (key, owner, lease) VALUES
('account/X', 'server-1', NOW() + 1min);
-- 10 seconds laterUPDATE distributed_locks WHERE key = 'account/X' AND owner = 'server-1' SET lease = NOW() + 1min;

CAS version:

server-1: read account/X
storage: null
server-2: read account/X
storage: null
server-1: update account/X, replace null with {owner=server-1, lease=1:01PM}
storage: succeeded
server-2: update account/X, replace null with {owner=server-2, lease=1:01PM}
storage: failed
server-2: <sleeps and periodically reads account/X>
server-1: update account/X, replace {owner=server-1, lease=1:01PM} with {owner=server-1, lease=1:02PM}
storage: succeeded
server-1: crashes
server-2: read account/X
storage: {owner=server-1, lease=1:02PM}
server-2, at 1:03PM: update account/X, replace {owner=server-1, lease=1:02PM} with {owner=server-2, lease=1:04PM}
storage: succeeded

In this way instances are guaranteed to not to put the system into a dead lock, but there is a second problem, which makes the shit really fucked: time. We’re using pseudo-SQL here, implying that the system is backed by a distributed store, which doesn’t have a notion of atomic time like a classic SQL implementation would. All those timestamps have to be generated by one of the nodes, and nobody guarantees there would be no time skew between the timeline perception:

node a: i'm acquiring this lock for 1 minute, okay?
node a: starts doing it's work
node b, with time skew of two minutes in advance: hey, that lock expired minute ago, i'm going to reacquire it
node b: does the very same work, while concept of locks was introduced to (sic!) to prevent this very possibility of non-exclusivity of resource

Or, with a very simple example of the poor general fellas:

Generals of city A: there's an enemy army coming, which will attack one of our cities. One of us should stop them, while other one go from behind, so we will surround them. It is crucial that we don't go around both, since otherwise we would just chase them while they would go through one city to another.
General B: OK. You'll hold the fort, while i go around. I'll leave my position at 1PM and you will see my army in the plains at around 1:30PM. If you don't see my army till 2PM, then something prevented me from leaving position and you take the lead.
Enemy spy: <slips into general A's camp, advances all clocks by one hour>
Next day, 1PM: both generals start at the same time, trying to go around the enemy and leaving road unguarded.

This is the engineering suffering at it’s best, a thing that we hate and love in our field of work. I’ll jump ahead and say that there’s no good solution to this, see FLP theorem (if i’m not mistaken) — you simply have to set some time bounds on protocol steps, otherwise you may never get a result. In such a situation one has to simply go with a tradeoff, a some reasonable threshold, after which system is considered to be clusterfucked, and explain the limitations of our universe to those who need these processes being built. Usually such a tradeoff is setting an expiration time for a primitive that is significantly higher than possible time skew (and if it is ever reached, it is more likely that the system is already and completely fucked by another disruption).

But wait, it’s not only the OS clock. Timing problem is wider than you can imagine. Consider this:

Application: Hey database, please store for me that record. It says that i'm eligible for doing that work for a minute.
Database: acknowledged, record written. You're good to go.
Application: kewl, i have the lock now.

At this point this may be a reason for nervous laughter, but the idea that application in fact has a lock is a complete lie. Application only knows that some time ago it sent a message, which took some time to process, and then, after some third time interval, acknowledgement has been received. It may have taken single-digit milliseconds. It could have taken minutes. It could have taken months, when bugged database was restarted and read it’s dusty WALs it has forgot about for some reason.

You may think at this point that it would be safe to do this:

Application: Here's the record.
Database: Here's the acknowledgement.
Application: now i have (clock.current - record.timestamp) time to do my work.

But no, this is a complete lie as well. I’m talking here not even about the non-monotonic clocks and time skew that may have happened in between, which are reasons to worry about as well. The real problem is that you can’t guarantee when the next application instruction will be processed, especially if you’re exploring asynchronous world. Consider a standard JS event loop:

locks.acquire(key)
.then(() -> transaction.run())

The then() call will be scheduled after key acquisition, and there are no guarantees that scheduled will turn into right now. This is the simplest example that probably wouldn’t have such limitation in most js environments and would run one function right after another, but real algorithms usually are much more complex, with several roundtrips and asynchronous calls here and there. And, even if you’re dealing with something really simple, there’s always OS preemptive multitasking, which may halt your application for any amount of time at any point. Any memory acquisition (and most of you work with the languages that if not manage memory themselves then don’t let you have 100% control over allocation), any memory read may hit the swap and slowly burn in the disk queue. Somebody may occupy a CPU core. If you’re on VM, somebody will steal your CPU time. It goes on and on with more and more peculiar things that actually do happen.

The whole thing i’m trying to tell you here that this scenario is simply inevitable. The only option one has here is to surrender.

That’s it, you can’t have any reliable concurrent primitive guarantees in distributed application. All you can do is a) rely on some safety bounds and consider any crossing of them an inevitable force majeure and b) architect your processes as a set of idempotent actions as much as possible, so any process could pick it up from any place, using locks just to minimize contention, not to prevent non-exclusive access, and have fingers crossed at non-idempotent actions, siding with either at-least-once or at-most-once model.

Implementation and precautions

Obviously you’re still thinking that situation either fits you or there is a way to cheat away. So, implementation i came up with looks like this:

interface Step<I, T> {
async T run(I input);
}
record Result<T> {
boolean successful;
T value;
static <T> Result<T> successful(T value) {
return new Result<>(true, value);
}
static <T> Result<T> failed() {
return new Result<>(false, null)
}
}
class Pipeline<T> {
Lock lock;
Task launcher;
Task<Result<T>> tail;
Pipeline(Lock lock) {
this.lock = lock;
this.launcher = new Task();
this.tail = launcher.then(() -> new Result<>(true, null));
}
private Pipeline(Lock lock, Task launcher, Task<Result<T>> tail) {
this.lock = lock;
this.launcher = launcher;
this.tail = tail;
}
Pipeline<V> extend(Step<T, V> step) {
Task<V> next = tail.then(state -> {
if (!state.successful || lock.hasExpired()) {
return Result.failed();
}
return Result.successful(await step.run(state.value)));
});
return new Pipeline<>(lock, launcher, next);
}
async Result<T> run() {
return lock.exclusively(() -> {
launcher.complete();
return tail;
});
}
}
class Lock {
private string id;
private string fingerprint;
async Result<T> exclusively(TaskProducer<T> action) {
current = await storage.get(id);
if (current != null && !current.notAcquireable(fingerprint)) {
return Result.failed();
}
if (!await storage.replace(current, createRecord())) {
return Result.failed();
}
try {
return Result.successful(await action.call());
} finally {
ensureReleased();
}
}
...
}

Somewhere there is a background thread that periodically calls lock.extendLease(). Here’s what you must not forget about:

  • Check lock expiration at each operation.
  • If you have to check lock inside each operation, probably you need to increase granularity level (and yes, this comes with a lot of code for tiny instructions and boilerplate).
  • DO NOT FORGET THAT FINALLY CLAUSE. Well, you may do if you like to manually issue deletion queries in database.
  • Log every outcome, since logs would be the first to show you something’s off in case the lock has been strangely reacquired by other node.

But still code above isn’t reliable and you have to pray for good timings in your datacenter(s).

Then what else to do?

As stated above, many processes that can be split into idempotent steps should be split so. In that case time simply doesn’t matter anymore, because all that processes would compete for is just next counter value, which will be incremented as soon as next step is completed, regardless of the clock; it also plays pretty good with my beloved event sourcing. Hyperbolic example for account above could be refined in this way:

  • Upon request, assign unique ID to transaction and store it in transaction references table as pending.
  • In one operation, move required amount from X’s account balance to X’s account hold by ID of that transaction and register transaction ID in pending transactions for that account.
  • Request bank B to confirm transaction with that unique ID (implying that if it sees it for the first time, it would either return negative response or make changes in its own records, following with a positive response, and subsequent calls return the very same response).
  • Set confirmed status in the transaction references table.
  • In one operation, remove transaction traces from account’s hold and pending transactions, and depending on the outcome either do nothing or replenish the account with the held amount.

In real world this would have taken quite a bit more steps (transaction would be reflected not only in account entity — it would certainly be saved in other places as well), but as you can see, all steps are idempotent and guarantee process progress. Any participant can take a look at this process at any point and help complete it without any problems.

Conclusion

So, at this point i may have upset you about state of things in distributed world (to be honest, i hope so a bit). However, thing described above is just a solution with tradeoffs. As any other mechanism that may shoot you it will eventually shoot you, but according to your particular situation losing one of kneecaps may be okay. Unless you’re dealing with very delicate things, you usually can tolerate some skews in the system, and, as we already know, your own bugs will do much more damage rather than such implementation. Unless it’s a bug in your own distributed lock implementation, ha-ha.

--

--

etki
Ayte Labs

I still bear a russian citizenship with no plans to prolong the contract, if that matters