In Erlang, it’s the usual accepted approach to implement as much as possible on the BEAM. This gives every bit of code the wonderful characteristics of decent fault handling we love so much.
Preamble
However, there are some cases when we can’t do that, be it because we rely on directly interfacing with some library or low level code or because we’re totally bonkers. Still the first rule of NIF-Club is: Do not write NIFs. So, if you ever wonder “should I write this as a NIF” then please keep reading and answer your question with “Hell no!”.
Sadly, I’m totally bonkers and rarely heed my own advice, so for the 0.3.3 version of DalmatinerDB I implemented a library in C to handle the caching of metric writes before they get serialized to disk.
Before you get the rotten tomatoes and raw eggs out give me a chance to defend my honor. The task of write caching is incredibly dependent on memory management and mutating data on a very high rate. Neither of those things Erlang gives you the best tools for. I write it in Erlang. Then wrote it in Erlang with some parts in a C. And finally decided to go all the way for C and after having a working version the results are stunningly good.
Well so let’s get the actual topic, the NIF. Writing something in C is easy! Writing something in C that compiles is still doable. Doing so in a way that does what you want it to do is a lot harder. And finally writing something in C that does what you think it does and not randomly segfaults or overwrites memory is close to impossible – at least so I blatantly claim without proof or citation other than an empirical study with a sample size of: 1. Then again this is my article so I’m allowed to do that especially if it serves as a story telling device.
Testing with EQC
When testing my code, I heavily rely on QuickCheck. It saves me from coming up with test cases myself and rather lets me describe what I want to happen in a greater scope rather then. I will not go into the detail about the what or how as there are other better sources for this, but I’ll sum up my approach quickly.
I implemented the logic I want in Erlang in the most straight forward (and perhaps inefficient) manner I could think of. Now I let Erlang QuickCheck (EQC) generate a random sequence of operations on the cache, with random input parameters, and in the end, see if the naïve implementation and the real cache produce the same outcome.
For optimizing a simple concept, I really like this approach as often the simple implementation is rather easy to reason about, and even if it’s wrong chances are that the way the simple and the optimized implementation are wrong are different.
Now, with that implementation EQC sets off to do random crazy things to the code and see if something breaks. However, that is only half the story! Once it finds something that breaks it will then try to simply the events that lead to the disaster and present me with a (hopefully) minimal test case that can trigger the problem.
And this works perfectly for Erlang code! Or even C code, it can even find some memory corruption issues that way. However, – yes, I know that is what everyone was waiting for, sorry it took me so long – the concept turns completely useless when the C code segfaults and brutally murders the BEAM.
Resolving the segfault problem
During the EUC last week I talked to Thomas Arts, one of the brilliant people behind EQC, about the problem. He suggested something that is totally obvious once you’re told about but I’d never had thought about it on my own. He said, in his wonderful accent, “Oh, that is not a problem, just execute the tests on a different node.”. The simplicity of that blew me away, of cause, it’s Erlang, just run it on another node and don’t be bothered weather it explodes. It’s brilliant!
Now there are a few hurdles in the way however, EQC, to my knowledge, has no build in abstraction for remote execution. That said it’s easy enough to build it with Erlang.
I use the rebar_eqc plugin to run my tests, rebar3 has a bit of an issue when it comes to hostnames. So, before you do anything else you need to be sure epmd is running on the machine you want to test on. The simplest way is just to start a erl shell in another window. Once that is done you can start rebar with rebar3 as eqc eqc –sname eqc.
Erlang, or rather it’s common test framework comes with a nice helper for starting another node for tests. So that is easy we can use ct_slave:start(eqc_client) that will give us a new host to test on.
Next up the host is started without any paths so we’ll need to make sure it knows where to find the code to test, the simplest way I found is just to feed it the same path that the main node has.
Then, since EQC does not know about the second node, we extract the body of the test into its own function. Then pass it the generated values and just returns the needed info to decide if it’s a failure or success. The rpc module will automatically escalate the remote node crashing to a test failure. And this is the big part, a segfault not goes from destroying our test system, to just a type of failure we can encounter in our testing process.
maybe_client() -> case ct_slave:start(eqc_client) of {ok, Client} -> rpc:call(Client, code, set_path, [code:get_path()]), {ok, Client}; {error, already_started, Client} -> {ok, Client}; E -> E end. remote_eval(Fn, Args) -> {ok, Client} = maybe_client(), rpc:call(Client, ?MODULE, Fn, Args). map_comp_body(Cache, MaxGap) -> {H, T, Ds} = eval(Cache), TreeKs = all_keys_t(T, MaxGap), CacheKs = all_keys_c(H, []), Ds1 = check_elements(Ds), {CacheKs, TreeKs, T, Ds1}. prop_map_comp() -> ?SETUP( fun setup/0, ?FORALL( {MaxSize, MaxGap, Opts}, {c_size(), nat(), opts()}, ?FORALL( Cache, cache(MaxSize, MaxGap, Opts), begin {CacheKs, TreeKs, T, Ds1} = remote_eval(map_comp_body, [Cache, MaxGap]), ?WHENFAIL(io:format(user, "Cache: ~p~nTree:~p / ~p~nDs: ~p~n", [CacheKs, TreeKs, T, Ds1]), CacheKs == TreeKs andalso Ds1 == []) end))).
And really, that’s it! There a few gotchas like that you can’t pass NIF references over RPC, or that sometimes when canceling the test, you need to manually shut down the client node. Still all in all this worked incredibly well.
The code and tests of the library can be found here: https://github.com/dalmatinerdb/mcache