Adding a new instruction to Michelson language#
From time to time a need arises to enrich Michelson language with an additional instruction. It might be an intimidating task, especially to someone who is not very familiar with the interpreter code. Thus, here is a quick tutorial outlining the most important steps required to add a new Michelson instruction. outlining the most important steps required to add a new Michelson instruction.
Adding a new instruction requires providing the following elements (detailed in the rest of this document):
an internal representation (or IR for short)
Once all of the above are provided, we may try compiling the protocol. We will most likely encounter some non-exhaustive pattern matching errors, but after the following steps are complete, these will usually be trivial to fix.
There are a lot of comments in the code base explaining further details about
the functions and data types described here. They are attached either to their
definitions in .ml
files or in corresponding .mli
files if definitions
are public.
For each function and data type mentioned on this page there is a pointer to the source code file where it can be found.
Michelson primitives#
The first thing we need to provide for creating a new instruction is a Michelson primitive. A primitive is a textual representation of our instruction in the Michelson code. Michelson primitives are organized into 5 namespaces:
keywords (lowercase words such as
parameter
,storage
,code
)types (lowercase words)
constructors (capitalised words)
constants (lowercase word
constant
followed by a string)instructions (all uppercase words)
Of these only the last group is of interest to us here. DUP
, CONS
and
ADD
are some examples of primitives representing instructions. By convention
constructor names for primitives representing instructions in the interpreter
implementation (as opposed to: in Michelson code) should start with capital
letter I
, for instance I_ADD
for ADD
instruction.
Primitives should be added with great care, though, as in binary representation they are currently encoded as single bytes, so there can be no more than 256 of them. Exceeding that number will force us to reserve more bytes for some primitives, which will incur some space and performance overheads for storing and executing contracts. Therefore, whenever possible we should prefer reusing existing primitives to adding new ones. Fortunately, we can take advantage of the fact that Michelson has a strong type system, ensuring that a same primitives can be used differently in several contexts. Thus, by allowing an existing primitive in a new context we can effectively add instructions without adding new primitives.
For instance suppose we have a MAP
instruction and primitive, which expects
a list at the top of the stack, and which recursively applies some given code
block to elements of the list, thus creating a new list, which is then put back
at the top of the stack. Suppose we now want a similar instruction for options,
instead of lists of elements. So far MAP
is only considered valid in a very
particular state of the stack. By adding a new translation rule (see
Script translator), we can allow the same primitive
to be used when there is an option instead of the list at the top of the stack,
thus effectively creating a new instruction. This may be seen as “overloading”
the primitive for different argument types.
When adding a new primitive, it should be mapped to the appropriate namespace in
namespace
function (in
src/proto_alpha/lib_protocol/michelson_v1_primitives.ml). In case of an
instruction, the appropriate namespace is Instr_namespace
. It is also necessary to map the primitive
constructor to the string representing it in Michelson code in
prim_of_string
and string_of_prim
in the same module.
See src/proto_alpha/lib_protocol/michelson_v1_primitives.ml for a complete list of primitives known to Michelson and their respective textual representations.
See also Micheline to learn more about the concrete syntax of Michelson language.
Michelson types#
Before we talk about internal representations, we need to have a brief look at
how the interpreter handles types of Michelson expressions internally. Types
ty
and stack_ty
are both defined in
src/proto_alpha/lib_protocol/script_typed_ir.ml. Type ty
enumerates
all types known to the Michelson interpreter and is parametrised by the
underlying OCaml type in which Michelson values are actually stored in memory.
Thus, when values of the ty type are pattern-matched on, their type parameters
help convince the OCaml compiler that the underlying Michelson expressions or
instructions are compatible (i.e. may be unified), and so that the execution
model for Michelson is type-safe.
stack_ty
is a collection of ty
values occurring in a particular order,
which represents the type of the whole stack. The outermost value corresponds to
the top of the stack, while the innermost one is always Bot_t
, which
corresponds to the bottom of the stack. Note that stack_ty
is parametrised
by two type parameters. The first one is the OCaml counterpart of the type of
the value at the top of the stack; the other combines the OCaml counterparts of
types of the remainder of the stack in the form of a cons list, i.e. a pair,
whose first element is a type and the other – another pair containing the
remainder of the list.
This is a recurring pattern within the interpreter: type parameters corresponding to types of Michelson stacks usually come in pairs. As we will shortly see, the first parameter in these pairs always corresponds to the type of element at the top of the stack; the other – to the remainder of the stack.
Internal representation (IR)#
A primitive is what represents our instruction in the Michelson script. Now we need an IR, which is a piece of data containing all the information necessary to actually execute the instruction. It should belong to the type:
type ('before_top, 'before, 'result_top, 'result) kinstr = (* ... *)
which is defined in lib_protocol/script_typed_ir.ml
. As the module’s name
suggests, this representation is already guaranteed to be well-typed. The type
kinstr
is the type of well-typed instruction sequences. Adding a new
instruction consists in defining a new constructor of the kinstr
type
representing sequences starting with the new instruction, providing a rule which
constructs them from the primitive corresponding to the new instruction and a
rule which interprets them when the script in executed.
Instructions are parametrised by 4 type parameters:
The type of the top element of the stack prior to execution
The type of the remainder of the stack prior to execution
The type of the top element of the stack following execution
The type of the remainder of the stack following execution
For instance ICar
instruction is defined as:
| ICar :
('a * 'b, 's) kinfo * ('a, 's, 'r, 'f) kinstr
-> ('a * 'b, 's, 'r, 'f) kinstr
The reason why the third parameter of the resulting kinstr
is 'r
and not
simply 'a
(which is the type of the first element of the pair at the top of the
initial stack) is because this constructor also contains the next instruction,
which produces a value of some arbitrary type 'r
. However, note
that this next instruction should expect 'a
at the top of its initial stack.
New instructions are added by extending the kinstr
type with additional
variants. The variant should contain all the information necessary to execute
the instruction and its type parameters must ensure that the shape of the stack
prior to execution will provide arguments for the instruction and that the stack
following the execution will contain its result.
Each IR should also contain an instance of type kinfo
, which holds the
information about the type of the stack prior to the instruction’s execution.
This information is mostly used for error reporting and logging. Another thing
each instruction’s IR should contain is a continuation, i.e., another instruction
that is going to be executed next. This way the whole script can be represented
as a single instruction constructed by sequencing many instructions together.
See the next section to learn how such a sequence is constructed in the process
of translation.
Instructions taking all their arguments from the stack will require nothing more
than the above data, as the constructor itself will inform the interpreter, what
to do. In more complex cases, especially when some control transfer is involved,
some additional information may be needed. For example a MAP
instruction
(regardless of the type it operates on) requires some more instructions (a
sub-program) to be executed to map one value into another. A PUSH
instruction needs to know the type and the value it should push on the stack,
and so on.
The translator#
Now that we have chosen a primitive to represent our instruction in the code and
an internal representation (IR), we need to provide a rule that translates the
former into the latter. parse_instr
function in
src/proto_alpha/lib_protocol/script_ir_translator.ml is responsible for
this. Notice that the function parse_instr
, despite what its name suggests,
matches on pre-parsed Micheline AST. Micheline parser is not a part of the
protocol and therefore must be run by the client before the script is submitted
to the node in binary-encoded form.
The translator iterates over the AST instruction after instruction, maintaining
the type of the stack after each operation. This way it can also type check the
script in a single run. Additionally this makes information about the “current”
type of the stack available when parsing each consecutive instruction. Notice
that this function not only matches on Michelson primitives themselves, but also
on the type of the stack implied by translating previous instructions. This is
what lets us distinguish between different contexts in which the same primitive
may have different meaning. For instance the MAP
instruction that we
mentioned before may perform mapping either on lists or on options, depending on
what the current type of the stack might be. Of course, the internal
representation will be different in each case. If the primitive and the current
stack type does not match any possibility, it’s a typing error and the whole
script is considered ill-typed.
Because the IR is well-typed by construction, it is necessary for
parse_instr
translates the script and also type check it at the same time.
Motivations for this are twofold. For one thing type checking gives us strong
static guarantees about the behaviour of the interpreter. In particular it’s
guaranteed to only produce well-typed Michelson values, which means each
instruction receives input that it expects. Secondly, the interpreter does not
have to check for types of data it finds on the stack (they’re guaranteed to be
correct), which makes for faster execution. For this reason it is essential that
each IR contains a value of type kinfo
(or an equivalent thereof), from
which the translator can obtain the type the stack should have after this
instruction is executed. Function kinfo_of_kinstr
in
src/proto_alpha/lib_protocol/script_typed_ir.ml is responsible for this
extraction.
An interesting situation occurs with instructions regulating control flow. These
usually receive one or more pieces of code (sub-programs) to execute depending
on some runtime conditions. MAP
, IF_LEFT
, IF_NONE
are all examples
of such instructions. These sub-programs must also be well-typed. Moreover, if
there’s more than one like in case of IF_LEFT
or IF_NONE
instructions, a
certain relationship between types of these sub-programs must hold. In
particular, they must ensure, that the entire instruction will always render a
resulting stack of the same type.
To ensure this, these sub-programs must be type-checked (and translated)
recursively. The aforementioned relationships between sub-programs’ types should
be guaranteed by the constructor of the kinstr
variant. However, in case of
multiple execution branches (sub-programs) their types must be unified before
proceeding. This is what merge_stacks
function is for. It accepts two stack
types and verifies if they’re equal. If so, the unified stack type is returned,
otherwise it results in a type error.
The precise return type of parse_instr
is judgement
defined in
src/proto_alpha/lib_protocol/script_ir_translator.ml:
type ('a, 's, 'b, 'u) cinstr = {
apply :
'r 'f. ('a, 's) kinfo -> ('b, 'u, 'r, 'f) kinstr -> ('a, 's, 'r, 'f) kinstr;
}
type ('a, 's, 'b, 'u) descr = {
loc : Script.location;
bef : ('a, 's) stack_ty;
aft : ('b, 'u) stack_ty;
instr : ('a, 's, 'b, 'u) cinstr;
}
(* ... *)
type ('a, 's) judgement =
| Typed : ('a, 's, 'b, 'u) descr -> ('a, 's) judgement
| Failed : {
descr : 'b 'u. ('b, 'u) stack_ty -> ('a, 's, 'b, 'u) descr;
}
-> ('a, 's) judgement
This is quite a bit involved, so we’ll explain these types and their role one at
a time. cinstr
is an intermediate representation of an instruction whose
continuation is not yet known. As we can see, it’s a function actually, which
given another instruction (the kinstr
argument), sequences it after the
current one, which is held in the function’s closure. A precise stack
description (a kinfo
) is also required, but note the initial type of the
stack is already predetermined by the type of cinstr
. The resulting type of
the instruction passed into the function, on the other hand, can be anything, as
witnessed by the forall annotation 'r 'f.
. The construction of
cinstr
is such that we can already create it without knowing what the next
instruction will be (as opposed to a kinstr
, which should already contain
instruction’s continuation).
A descr
is a complete description of an instruction. It combines a
cinstr
with precise descriptions of the stack type before and after the
operation. These stack types are typically assembled by inspecting the initial
stack type passed to parse_instr
(which we pattern match on in the said
function). Location is typically copied from the Micheline pattern as is and it
only serves the error-reporting purposes in case type-checking failed at a later
step.
Finally a successful typing judgement contains a full description of an
instruction, but is only parametrised by its input type. The return type is
existential and can only be retrieved by inspecting the aft
field of the
contained description. This is vital, because when calling the type-checking
procedure, we don’t yet know the return type of the program. In fact the whole
point of it is (in a sense) to learn that type. Typically when adding typing
rules for new instructions, we only need to be concerned with the successful
result (Typed
constructor). Failed
is returned when the instruction does
not produce an output stack. This is true for instance of FAILWITH
instruction, which immediately terminates the execution. Hence, it cannot be
given any sound type and therefore must be treated specially.
As already mentioned above, some instructions also receive pieces of code
(sub-programs) as their arguments. As these sub-programs must themselves be
well-typed, typically we will recursively call non_terminal_recursion
function on them. If it succeeds in type-checking the sub-program, we use its
description to convert it into the type of the whole instruction (which will
usually be slightly different than the type of the sub-program). If it fails,
however, the error will be transparently propagated up the call stack.
Failed
judgements are treated differently by different instructions. Some,
like MAP
convert them to typing errors, other unify them with any type the
other program branch might have, effectively treating them as if they had type
forall 'a. 'a
. In fact, this is precisely the type that Michelson
specification assigns to the FAILWITH
instruction.
The interpreter#
The interpreter is the heart of the Michelson implementation. This specific implementation
follows the small-step approach: at each execution step, the interpreter rewrites a machine
configuration made of a value stack and a continuation stack. Therefore, the interpreter takes a
script’s IR, a storage and an input to the script as arguments, generates the
initial stack containing the storage content and the input, and then executes
the script, returning the final content of the stack. It’s defined in
src/proto_alpha/lib_protocol/script_interpreter.ml by the execute
function.
The execute
function does some preliminary preparations and then passes control to
the step
function, which encodes the interpretation loop and where execution
rules for all instructions are given. The step
function accepts:
Context and step constants (see below)
Remaining gas
Instruction
Continuation
The top value on the stack (called accu)
The remainder of the stack
Typically this function computes the new value of the stack and then calls
itself recursively with a new instruction (already available in the kinstr
value). In some cases, however, some additional action may be required either
before or after the instruction is processed. These additional actions are
usually related to the control flow.
For instance, in the case of our MAP
instruction for options, if it finds
None
at the top of the stack, will leave it without a change and simply
proceed to the next instruction. If, however, it finds Some x
, it should
pass control over to the sub-program given as parameter. This sub-program does
not expect an option, though, it expects our x
unwrapped. Similarly it does
not return an option, but an arbitrary result which should be wrapped in a
Some
constructor after control returns to the main program. Without this
additional action, the types of stacks produced by the two branches would differ
and the program would be ill-typed. To remedy this and similar problems, the
interpreter also defines the continuation
type (defined in
src/proto_alpha/lib_protocol/script_typed_ir.ml). Whenever the control is
passed over to a sub-program, the next
function can be called to manage the
flow of control around the sub-program (for instance executing it multiple times
in case of a loop). Also, each Michelson program ends with a special instruction
IHalt
, which calls the same next
function.
The next
function occupies itself with continuations stack (argument no. 4
on the list above). Typically before transferring control to a sub-program, an
appropriate continuation is pushed on the continuation stack to manage its
outcome and resume execution of the main program. Only if there’s no
continuation to perform, the program is actually terminated. In the MAP
instruction example it would be wrapping the value at the top of the stack in
the Some
constructor to match the type of the other branch. We simply put an
appropriate continuation in the continuation stack before executing the
sub-program with the rest of the main program appended as continuation.
Sometimes adding an instruction may involve adding a new continuation as well.
However, continuations are completely internal to the interpreter. They neither
have a representation in the Michelson code nor are they ever involved in
translation. A continuation is a value of type ('a, 'b, 'c, 'd) continuation
defined in src/proto_alpha/lib_protocol/script_typed_ir.ml. Similarly to
an instruction, a new constructor of this type should contain all the
information required to execute the continuation. For instance KCons
continuation contains an instruction and a continuation which should be executed
after it. The special KNil
continuation marks the end of execution. After
the continuation is defined, it can be used freely in the interpreter.
The step constants passed to the function along with the context contain some
important information about the transaction itself, like the sender and the
target, the amount transferred and so on. See step_constants
type definition
in src/proto_alpha/lib_protocol/script_interpreter.ml for more details.
The Gas model#
Each Michelson instruction also has a corresponding gas model, which estimates how much gas should be consumed by the interpreter when executing the instruction. The main goal of this is to prevent scripts from falling into infinite loops and to protect bakers from abuse of their computing power. Of course, it’s impossible to make sure statically that a Michelson program always terminates, but its computing time can be checked dynamically and that’s the main purpose of gas.
The gas model is a function approximating the time and computational resources consumed by executing an instruction depending on the size of its argument. Typically the role of the programmer adding an instruction is to define the function for the instruction, but leaving any constant values in it abstract. Additionally, a benchmark needs to be defined in order to find the appropriate values for these constants. At a later point benchmarks are run on a dedicated machine and constant values are filled in based on the results. It is essential that these benchmarks are always run on the same hardware so that they return comparable results.
More information on the gas model and on benchmarking the interpreter can be
found on the pages dedicated to the snoop
library: Snoop and the Michelson Interpreter.
Documentation#
Last but not least, the new instruction needs to be documented. After all, nobody will ever use it otherwise!
Documenting the instruction may involve:
Editing the Michelson language page (only the page corresponding to the Alpha protocol needs be changed, because injected protocols cannot be extended with new instructions). The source for this page is docs/alpha/michelson.rst.
Updating the Michelson reference website, by modifying the associated repository.