Wouldn’t you like to define a data structure as a module, with the internals
hidden? Obviously you can, but because modules are not records the data
structure then can’t be used with protocols. You could define a record and in
the do
block put functions to manipulate it but then you’re making it easy for
your users to use the record directly, and trample on the invariants in the
process.
Record tags
Luckily there is a trick. It’s not very obvious but if you read the
documentation for Kernel.defrecordp/3
closely there is a tag field which is
used in an example to set the tag to the name of the enclosing module. What’s
the tag? Well in Erlang, and Elixir, records are stored as tuples with the first
field being a “tag” that distinguishes different records and the other fields
being the data of the record. With defrecord
the tag is always the name of the
record module, with defrecordp
it is the record name.
1 2 3 4 |
|
When defining implementations of protocols the name you pass to for:
is the
tag of the record. This makes perfect sense as the protocol doesn’t need to know
anything about your record except how to recognize it, and that’s what the tag
is for.
An abstract binary tree
Using record tags you can write an abstract binary tree module like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
|
The __MODULE__
is a special variable that expands to the name of the module. I
could have written BinTree
in it’s place, but __MODULE__
stands out and is a
good reminder of what trick is used here.
The module works as you’d expect:
1 2 3 4 |
|
We’re still showing the guts of the record through inspect
though, but that’s
easily changed.
1 2 3 |
|
1 2 |
|
Just remember that we’re only hiding the representation when pretty printed, if
you pass raw: true
to inspect you’ll still see the internal representation.
Due to how Erlang (and thus Elixir) works this is unavoidable, but the pretty
printing serves as a powerful reminder to the user that the internal
representation should not be relied upon.
Dispatching on the record tag
There is another trick I’d like to show you. If you look at the source of the
Set
module from the standard library most functions look like this:
1 2 3 |
|
Huh? Object-oriented programming. Nope, just plain old apply/3
(if the first
argument is not a literal symbol Elixir translates this to runtime apply call).
The magic is in the target
macro:
1 2 3 4 5 6 7 8 9 |
|
Remember that the first field of a record is the tag, so elem(rec, 0)
extracts
the record tag. If you call Set.difference(s1, s2)
(where s1
and s2
are
HashSets) target(s1).difference(s1, s2)
gets called, which expands to
(ignoring the tuple check) elem(s1, 0).difference(s1, s2)
, which effectively
translates to apply(HashSet, :difference, [s1, s2])
.
When I first saw that code I spent some time scratching my head over why they
didn’t use a protocol. I’m still not 100% sure, but I strongly suspect
performance is one of the main reasons. As a test I made a very simple benchmark
to compare the overhead of calling through a protocol vs calling through
apply
.
Running this a number of times gives a pretty consistent result after the first few runs: the protocol calls are about 14-15 times slower than the dispatch calls (which is how I named the “through apply” calls, not sure if it’s the right term).
Now protocol calls are known to be slow (that’s a big part of why Elixir has
reduce based collections, those only require a few protocol calls even for
large amounts of data). There are
ideas on how to fix that and
once those get implemented things may change for released code. For now however
indirect calls through apply
are significantly faster.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
|
If you look at Set
again notice that it’s a behaviour with callbacks of the
same types and arity for each exposed function. All the implementation modules
(just HashSet
in the standard library) implement the Set
behaviour and the
compiler makes sure that all the functions are indeed implemented. If you’re
going to do the same protocol replacing trick as Set
it’s a good idea to
define a behaviour as well.
Caveats
So what are the downsides of the abstract data structure technique? Well you
lose the nice record syntax and decomposition and have to work with the macros
defrecordp
generates for you. And it’s a bit more work getting everything set
up.
But let’s face it, those are minor issues. What you get is the ability to offer a data structure without exposing it’s representation. You get the ability to make a data structure that (ugly hacks aside) can only be manipulated through the API you provide, an API that maintains the internal invariants.
If you define a public record stop and think for a moment why you are defining the record that way, because often hiding it in a module results in cleaner, more maintainable code.