Thread-safe smart pointers, hashtables without locking

Discussion:

Ray Whitmer

2006-05-03 22:20:44 UTC

I found the boost libraries and templates as I was looking for thread-
safe smart pointers. I was unable to use shared_ptr because I needed
to be able to reconstitute a smart pointer from a raw pointer at any
time without losing the count, so the count needs to be internal to
the object, not external, so I am using intrusive_ptr.

Since I need my intrusive_ptr to be thread-safe, I included:

#include "boost/detail/atomic_count.hpp"

and then I declared my internal count to be of type:

boost::detail::atomic_count

which I then increment and decrement using the defined thread-safe
operators from the add_ref and release methods, which should give me
a relatively-thread-safe smart pointer with an internal count without
having to do any actual locking.

This already seems like I may be going beyond the official interface
by explicitly relying on something in the boost detail directory. Is
it wrong to rely on this?

But what I need now goes a bit further. I need a hashtable for which
lookups are thread-safe without doing any locking. I only want to
lock for adding or removing to/from the table.

I already have the hashtable implementation that does this that I
have used in Java for quite a while. In Java, I assumed the writing
of a 32-bit reference into the hashtable array (or a null) was fairly
atomic (which seemed reasonable on a 32-bit machine with normal
alignment, etc). I made sure in the implementation that a writer who
locks and changes the hashtable never temporarily places the
hashtable into a state where it cannot be correctly read by any non-
locking reader.

I am hoping this will continue to work in C++ with the intrusive_ptr
array of a hashtable. intrusive_ptr seems to contain a single 32-bit
value which is likely to be atomic, at least as long as I avoid 16-
bit architectures. Is there any better way to guarantee this?

With the exception of the reference count, the objects being kept by
the hashtable are read-only, so this should allow me any number of
threads simultaneously operating on the same massive shared data
hashtables (which can get cleaned up when no longer in use, by
looking at ref counts).

I am just posting here to see if anyone knows of a missed trick or
bad assumption I have made here using these or other boost mechaniisms.

Is the thread-safe hashtable that does not require locking for
reading something that many others would benefit from? It is quite a
simple implementation.

Thanks,

Ray Whitmer

j***@firstinvestors.com

2006-05-04 16:19:52 UTC

Permalink

Post by Ray Whitmer
which I then increment and decrement using the defined thread-safe
operators from the add_ref and release methods, which should give me
a relatively-thread-safe smart pointer with an internal count without
having to do any actual locking.

[snip]

Post by Ray Whitmer
I am just posting here to see if anyone knows of a missed trick or
bad assumption I have made here using these or other boost mechaniisms.

Personally, I read this and said, "eeek!" (Well, not literally. My office mates would wonder what was up.) I think any design like this is going to be (at best) very non-portable, and (more likely) subject to breaking at the worst possible moment (i.e. when demonstrating to a customer). Is "relatively" thread-safe really good enough here?

As a constructive suggestion, if you want to reduce lock conflicts in hash tables, have you considered some form of finer-grained locking? You could, for example, have one lock per "slot" in your hash table. Computing the hash and finding the slot is thread-safe (since it doesn't depend on the state of the hash table), so you could then lock only the slot itself while performing list maintenance or whatever you need to do to read/write the data.

This would be truly thread-safe but might perform better if you have lots of simultaneous readers/writers. I haven't tested this, but if you do, I'd be interested to hear the results.

Regards,
- James

PS - I'd be even more worried in Java, btw, since Java opcodes should map to CPU operations less well than a compiled binary. Not to mention that in Java, the portability assumption becomes really critical. Java classes are *supposed* to be portable.

Ray Whitmer

2006-05-05 15:48:30 UTC

Permalink

Post by j***@firstinvestors.com
Personally, I read this and said, "eeek!" (Well, not literally. My
office mates would wonder what was up.) I think any design like
this is going to be (at best) very non-portable, and (more likely)
subject to breaking at the worst possible moment (i.e. when
demonstrating to a customer). Is "relatively" thread-safe really
good enough here?

Let me make it clear. I do not want relatively thread-safe. I want
completely thread safe.

It seems to me that boost already relies on an atomic counter for
shared_ptr. I do not think the solution would be less portable than
this. I certainly want it to run on Windows and Mac, 32 and 64 bit
as well as major Unix and Linux including certain 32-bit-or-more
embedded situations.

There may be hopefully-rare C++ threading situations that may be
difficult to recover from, and I will simply have to gather more
information on these and see if they can be dealt with.

Post by j***@firstinvestors.com
As a constructive suggestion, if you want to reduce lock conflicts
in hash tables, have you considered some form of finer-grained
locking? You could, for example, have one lock per "slot" in your
hash table. Computing the hash and finding the slot is thread-safe
(since it doesn't depend on the state of the hash table), so you
could then lock only the slot itself while performing list
maintenance or whatever you need to do to read/write the data.

For the more-common reader of the hash table, it is not my intent to
reduce lock conflicts, but, rather, to generally eliminate them, and
replace them with something lighter-weight, such as the atomic
counter used by shared_ptr, that in best cases may only take a few
instructions to execute. I may have thousands of threads reading
info from the hashtable at once and there may never be a natural time
when there are no readers so I can update the table. The purpose of
the threads is many operations that rely heavily on the hashtables --
every other instruction they are looking up something.

For what it is worth, finding the slot is not thread-safe, because
the table size and pointer can change. List maintenance involves
multiple slots and I do not think it is worthwhile doing finer-
grained locks because of the complexity, deadlock potential, and
likely worse performance it would introduce.

Post by j***@firstinvestors.com
This would be truly thread-safe but might perform better if you
have lots of simultaneous readers/writers. I haven't tested this,
but if you do, I'd be interested to hear the results.

I don't think it even begins to be thread-safe, because it doesn't
begin to deal with reallocation of the array itself, and if it did,
your fine-grained locks appear to become worthless because you need a
high-level lock protecting that allocation assuming you are going to
use locks for readers, which seems like a bad idea from the start for
me. There is a reason shared_ptr does not use locks (at least on the
best-supported platforms), but uses an atomic count.

Post by j***@firstinvestors.com
PS - I'd be even more worried in Java, btw, since Java opcodes
should map to CPU operations less well than a compiled binary. Not
to mention that in Java, the portability assumption becomes really
critical. Java classes are *supposed* to be portable.

I do not want to get into Java wars here, but Java has to solve the
deallocation problem in a thread-safe manner. When the garbage
collector actually reclaims an allocation, the possibility cannot
exist that someone else still happens to be using it or the language
is badly broken. This is the hard thing to be solved in C++ to make
a thread-safe hashtable that you do not have to deal with in Java.

The deallocation problem is, from everything I have analyzed, the big
problem in C++. When you attempt to deallocate something, for all
you know there could be another thread that just barely loaded its
pointer into shared memory and as soon as it unstalls will want to
increment its reference count, but it has not yet been incremented,
so even a smart pointer using an atomic reference count fails because
the object is sharable in a multithreaded environment, but not the
reference itself.

Here is my first stab at a solution:

Extra data (for each Hashtable):

1. Array of two atomic counts, one active and one inactive.
2. Index that flops between 0 and 1, so there is an active and
inactive count.
3. A mutex that can be used for exclusive writers

Each reader behaves as follows:

1. Fetch index.
2. Increment corresponding atomic count.
3. If fetched index is not current index, decrement atomic count and
loop to step 1. This is rare and safely eliminates rare bad cases

4. Now the reader has free access to the entire hashtable structure,
and is guaranteed by cooperation of the writer that the pointers it
sees are not stale. It must only guarantee that any pointer
referenced in step 4 is either destroyed or it's ref count is
incremented before step 5 -- this includes the pointer to the
allocated resizable array itself.

5. Decrement the same atomic count (of fetched index) incremented in
step 3.

Note that in Java, none of this is necessary for readers because the
language safety assumes any reference is not deallocated.

Each writer behaves as follows:

1. Obtain an exclusive lock on the hashtable. This lock only
excludes other writers, not other readers.
2. Perform changes to the hashtable in a manner that, given atomic
loads or stores of aligned pointers, never leaving the table in a
state that simultaneous readers cannot use it. This is very easy
compared to the allocation issues, and I have been doing this in Java
for years. I would be happy to expand on this. This write operation
may include releasing objects with ref-count 1, if this is supposed
to be a weak hashtable that does not keep unused objects in the index.
3. This is really part of 2: any step in (2) which destroys a
reference a reference keeps a private local copy to make sure it is
not deallocated before all reading threads are finished with it. If
the references are ref-counted, a ref-counted private reference is kept.
4. If there are no private local pointers/references being kept, go
to step 9
5. Wait for inactive (1-index) atomic count to go to 0. This is rare
and safely eliminates rare bad cases.
6. Change index (index = 1 - index).
7. Wait for inactive (1-index) atomic count to go to 0. This means
that all readers who might have a copy of a stale pointer are finished.
8. Release all private local references with a reference count of 1.
Non-ref-counted references always get released, due to single owner.
If this is a weak hashtable, reinsert any references that have a
reference count greater than one since these were found and
incremented while we were waiting for readers to exit.
9. Release lock. Done.

More notes:

Although we don't want to go out of our way to make the writer
inefficient, there is a single writer at any one time with an
exclusive lock. But we do not have to wait for all readers to exit
to perform the update, we only need to wait for current readers to
exit, during which time new readers may enter, which we do not care
about because we have already removed the candidates from public
places in the list. This includes the pointer to the array itself,
which is generally treated like any other allocated pointer, but does
not require a reference count.

There are many details I left out that did not directly pertain to
the allocation problem, but only with atomically updating the rest of
the table so that the reader view is always consistent.

Ray Whitmer

j***@firstinvestors.com

2006-05-05 19:11:10 UTC

Permalink

Post by Ray Whitmer

I don't think it even begins to be thread-safe, because it doesn't
begin to deal with reallocation of the array itself...

That's certainly true. I guess I was thinking of a hashtable whose interface is that it is allocated with a fixed size and not reallocated (although it might be copy-constructed to a hashtable of a different size). If modern hashtables reallocate their storage, that would certainly have to be done in a thread-safe manner.

Post by Ray Whitmer
2. Perform changes to the hashtable in a manner that, given atomic
loads or stores of aligned pointers, never leaving the table in a
state that simultaneous readers cannot use it. This is very easy
compared to the allocation issues, and I have been doing this in Java
for years. I would be happy to expand on this. This write operation
may include releasing objects with ref-count 1, if this is supposed
to be a weak hashtable that does not keep unused objects in the index.

[rest snipped]

The algorithm makes sense, and if all these claims hold then it should work, I think. I'm still concerned about step (2), however, if only because of the assumption that what you write as a load/store in C++ maps to a load/store in ML. Will it? Always? I ask as a non-rhetorical question: does the C++ standard *guarantee* it? Because if it doesn't then it seems like a potential problem, but maybe I'm being too careful.

Have you read any of the literature on lock-free data structures, btw? There was articles in the October and December '04 C/C++ User Journal about them.

I realize that I haven't answered your original question about atomic_count, btw, and I'm sorry: the reason is simply that I don't know!

-
James Jones Administrative Data Mgmt.
(v)732-510-1806 375 Raritan Center Pkwy, Suite A
(f)732-510-1855 Edison, NJ 08837
Visit us on the web at http://www.firstinvestors.com/

Ray Whitmer

2006-05-05 20:09:33 UTC

Permalink

Post by j***@firstinvestors.com
The algorithm makes sense, and if all these claims hold then it
should work, I think. I'm still concerned about step (2), however,
if only because of the assumption that what you write as a load/
store in C++ maps to a load/store in ML. Will it? Always? I ask as
a non-rhetorical question: does the C++ standard *guarantee* it?
Because if it doesn't then it seems like a potential problem, but
maybe I'm being too careful.

If I could answer in the affirmative, I could prove it works and not
have to talk to anyone. I can make a bunch of lesser claims, that of
course fail to add up to the strong claim but make me feel better:

1. For years, people have written and relied on C/C++ characteristics
that apparently were not in the standard until later if at all. If
the popular platforms support it, it takes a strong reason to justify
a violation of it.

2. Java, on any platform, would seem to strongly depend upon a
reference assignment being accomplished all-or-nothing as a machine
assignment. The alternative would be threads that see partial
references being assigned in different threads and all the safety of
references is out the window. This is such a common operation, it
would be difficult to believe that there is thread synchronization
going on around every pointer/reference assignment.

3. I have looked at the output of many C compilers and have never
seen a pointer assignment take multiple instructions. The exception
might be if someone used a poorly-written memcpy-like operations to
transfer the contents of a structure byte by byte. A well-written
one takes the respective alignments into account and transfers
aligned 32-bit quantities one-by-one. We are talking about code in a
controlled Hashtable class, it would have to be the "C" compiler
choosing to transfer a class consisting of a single address field
using something other than a single instruction. It doesn't sound
logical to me, but it wouldn't surprise me completely to see it
happen, but I would report it as a compiler bug and expect it to be
addressed even though it is not in the standard.

4. These issues have to be addressed with the multi-processor cores
becoming more prevalent because they are basic to lock-free
processing, or can C++ accept significant inferiority to Java with
respect to lock-free multiprocessing?

Sorry that my arguments in this case frequently refer to Java.

Post by j***@firstinvestors.com
Have you read any of the literature on lock-free data structures,
btw? There was articles in the October and December '04 C/C++ User
Journal about them.

I have read a number of articles on the subject, several specifically
on the topic of lock-free hashtables, several with implementations
they claim to be proven to be correct. My own first attempt is far
from a proven-correct implementation.

Many seem to arrive at similar conclusions, although each has its own
slightly-different take on the subject and quite different mechanisms.

They are also very often theoretical and build their proofs of
completeness on features not apparently explicitly present in "C"
standards. I certainly have yet to find an implementation I can just
plug in and use.

They often resort to garbage collection to solve the hard problems
(even if they are working in C) or if they solve the deallocation
problem, they only do so for the main array, not for the entries in
the hashtable itself.

Post by j***@firstinvestors.com
I realize that I haven't answered your original question about
atomic_count, btw, and I'm sorry: the reason is simply that I don't
know!

Thanks for the comments anyway, which were thought-provoking.

For what it is worth, if anyone else thinks the solution is worth a
generic contribution, or is of the opposite opinion that it would not
be generic enough, I would be interested in hearing. I suppose this
sort of question should go to the developers list. I think it would
be much more difficult to implement a table of shared_ptr than it is
for intrusive_ptr, because assigning a shared_ptr is much less likely
to be uninterrupted, leaving other threads with a broken visible
state. There is clearly more detail to discuss as well.
Ray Whitmer

Ray Whitmer

2006-05-06 12:16:56 UTC

Permalink

I'd also say the idea of moving a pointer out of the way, and then
using the
flip-gate-count-thing to wait for any 'potentially-using' readers
to finish,
definitely has 'potential'. You would need to show detailed code to
really
be sure. And not to us - to the guys on comp.programming.threads.

Thanks. I will plan on pursuing it, then.

Anyhow, if you haven't already, take a look at memory barriers,
and, as a
canonical example, the broken 'double checked locking pattern'. You
can't
do lock free without understanding this stuff.

I was somewhat aware of memory ordering and especially compiler
reordering, but I need to take another look at it and find the proper
set of macros for use in non-kernel code since volatile does not
(yet) guarantee hardware behavior between processors.

And you are correct, you can't make { reading the
location of a shared_ptr + incrementing its ref-count } atomic, it
least not
on most processors.

Actually, I was not concerned with incrementing the ref count. The
flip-gate-count-thing solves, I believe, the problem of reference
counts not being up to date all the time. I realize that I still
have to prove this, but I do not think it is hard.

What worries me more is whether the smart pointer has one or two
fields in it, because this affects whether readers find an incomplete/
broken smart pointer as they whiz by. With a single pointer field in
it, the load and store are atomic.

I don't think the boost implementation of intrusive_ptr's operator=
() is
atomic, but it could be made to be.

shared_ptr has two pointers in it, (one to the shared count, one for
the pointer) whereas intrusive_ptr has a single pointer in it (since
the count is internal to the referenced class).

This means that the shallow bits of shared_ptr are not atomically
initialized, whereas those of intrusive_ptr are. An obvious way to
solve this would be to make shared_ptr contain a single pointer to a
wrapper class that contains both the shared count and a shared
pointer, but this adds an unacceptable level of indirection to basic
pointer operations.

Your wording of 'much less likely' is scary, and I think similar
wording
made James apprehensive as well. There is no grey - it is black or
white,
thread safe or not.

I will have to adjust my language.

I tend to use uncertain language both because I consider my knowledge
incomplete and because of the lack of a generic standard that
guarantees behavior, i.e. the fact that different Linux kernel
architectures need different memory barriers, etc. is demonstration
of the uncertainty of a general C solution.

If I can rely on something like what the Linux kernel relies on, it
should not be scary. Behavior needs to be guaranteed on supported
platforms, and other platforms need to be detected and fail to
compile lock-free or revert to a locking fallback implementation, etc.

Ray Whitmer

Tomas Puverle

2006-05-08 14:13:31 UTC

Permalink

Ray,

with regards to resizing the hash map: You may want to look at a technique
called "Linear Hashing". It may be a nicer way to implement growth for highly
concurrent hashes.
Regards,

Tom