When writing computer programs, certain patterns arise over and over again. For example, programs must often loop through the elements of arrays, increment or decrement the values of variables, and perform multiway conditionals based on numeric or character values. Programming language designers typically acknowledge this by including special-purpose syntactic constructs that handle the most common patterns. C, for instance, provides multiple looping constructs, multiple conditional constructs, and multiple constructs for incrementing or otherwise updating the value of a variable.[*]
[*] The C Programming Language, Second Edition, Brian W. Kernighan and Dennis M. Ritchie, Prentice Hall, 1988.
Some patterns are less common but may occur frequently in a certain class of programs, or perhaps just within a single program. These patterns may not even be anticipated by a language's designers, who in any case would typically choose not to incorporate syntactic constructs to handle such patterns in the language core.
Yet, recognizing that such patterns do arise and that special-purpose syntactic constructs can make programs both simpler and easier to read, language designers sometimes include a mechanism for syntactic abstraction, such as C's preprocessor macros or Common Lisp[] macros. When such facilities are absent or are inadequate for a specific purpose, an external tool, such as the m4 macro expander,[
] may be brought to bear.
[
] Common Lisp: The Language, Second Edition, Guy L. Steele Jr., Digital Press, 1990.
[
] The M4 Macro Processor, Brian W. Kernighan and Dennis M. Ritchie, 1979.
Syntactic abstraction facilities differ in several significant ways. C's preprocessor macros are essentially token-based, allowing the replacement of a macro call with a sequence of tokens; text passed to the macro call is substituted for the macro's formal parameters, if any. Lisp macros are expression-based, allowing the replacement of a single expression with another expression, computed in Lisp itself and based on the subforms of the macro call, if any.
In both cases, identifiers appearing within a macro-call subform are scoped where they appear in the output, rather than where they appear in the input, possibly leading to unintended capture of a variable reference by a variable binding.
For example, consider the simple transformation of Scheme's or form[§] into let and if in the following example:
[§] "Revised report on the algorithmic language Scheme," Richard Kelsey, William Clinger, and Jonathan Rees, editors, Higher-Order and Symbolic Computation, Vol. 11, No. 1, pp. 7–105, 1998. Also appeared in ACM SIGPLAN Notices, Vol. 33, No. 9, September 1998.
(or e1 e2)(let ([t e1]) (if t t e2))
An or form must return the value of its first subform, if it evaluates to a true (any non-false) value. The let expression is used to name this value so that it is not computed twice.
The previous transformation works fine in most cases, but it breaks down if the identifier t appears free in e2 (i.e., outside of any binding for t in e2), as in the following expression:
(let ([t #t]) (or #f t))
This should evaluate to the true value #t. With the simple transformation of or specified previously, however, the expression expands to:
(let ([t #t]) (let ([t #f]) (if t t t)))
which evaluates to the false value #f.
Once seen, this problem is easily addressed by using a generated identifier for the introduced binding:
(or e1 e2)(let ([g e1]) (if g g e2))
where g is a generated (fresh) identifier.
As Kohlbecker, Friedman, Felleisen, and Duba observe in their seminal paper on hygienic macro expansion[||] variable capture problems like this are insidious, because a transformation may work correctly for a large body of code only to fail sometime later in a way that may be difficult to debug.
[||] "Hygienic macro expansion," Eugene Kohlbecker, Daniel P. Friedman, Matthias Felleisen, and Bruce Duba, Proceedings of the 1986 ACM Conference on Lisp and Functional Programming, pp. 151–161, 1986.
While unintended captures caused by introduced identifier bindings can always be solved by using generated identifiers, no such simple solution is available for introduced identifier references, which may be captured by bindings in the context of the macro call. In the following expression, if is lexically bound in the context of an or expression:
(let ([if (lambda (x y z) "oops")]) (or #f #f))
With the second transformation for or, this expression expands into:
(let ([if (lambda (x y z) "oops")]) (let ([g #f]) (if g g #f)))
where g is a fresh identifier. The value of the expression should be #f, but will actually be "oops" because the locally bound procedure if is used in place of the original if conditional syntax.
Limiting the language by reserving the names of keywords such as let and if would solve this problem for keywords, but it would not solve the problem generally. For instance, the same situation can arise with the introduced reference to the user-defined variable add1 in the following transformation of increment:
(increment x)(set! x (add1 x))
Kohlbecker et al. invented the concept of hygienic macro expansion to solve both kinds of capturing problems, borrowing the term "hygiene" from Barendregt.[#] Barendregt's hygiene condition for the λ-calculus holds that the free variables of one expression substituted into another are assumed not to be captured by bindings in the other, unless such capture is explicitly required. Kohlbecker et al. adapted this into the following hygiene condition for macro expansion:
[#] "Introduction to the lambda calculus," H. P. Barendregt, Nieuw Archief voor Wisenkunde, Vol. 4, No. 2, pp. 337–372, 1984.
Generated identifiers that become binding instances in the completely expanded program must only bind variables that are generated at the same transcription step.
In practice, this requirement forces the expander to rename identifiers as necessary to avoid unintended captures. For example, with the original or transformation:
(or e1 e2)(let ([t e1]) (if t t e2))
the expression:
(let ([t #t]) (or #f t))
expands into the equivalent of:
(let ([t0 #t]) (let ([t1 #f]) (if t1 t1 t0)))
which properly evaluates to #t. Similarly, the expression:
(let ([if (lambda (x y z) "oops")]) (or #f #f))
expands into the equivalent of:
(let ([if0 (lambda (x y z) "oops")]) (let ([t #f]) (if t t #f)))
which properly evaluates to #f.
In essence, hygienic macro expansion implements lexical scoping with respect to the source code, whereas unhygienic expansion implements lexical scoping with respect to the code after expansion.
Hygienic expansion can preserve lexical scope only to the extent that the scope is preserved by the transformations it is told to perform. A transformation can still produce code that apparently violates lexical scoping. This can be illustrated with the following (incorrect) transformation of let:
(let ((x e)) body)(letrec ((x e)) body)
The expression e should appear outside the scope of the binding of the variable x, but in the output it appears inside, due to the semantics of letrec.
The hygienic macro expansion algorithm (KFFD) described by Kohlbecker et al. is both clever and elegant. It works by adding a timestamp to each variable introduced by a macro, and then uses the timestamps to distinguish like-named variables as it renames lexically bound variables. KFFD has some shortcomings that prevent its direct use in practice, however. The most serious are a lack of support for local macro bindings and quadratic overhead resulting from the complete rewrite of each expression as timestamping and renaming are performed.
These shortcomings are addressed by the syntax-rules system, developed by Clinger, Dybvig, Hieb, and Rees for the Revised Report on Scheme.[**] The simple pattern-based nature of the syntax-rules system permits it to be implemented easily and efficiently.[] Unfortunately, it also limits the utility of the mechanism, so that many useful syntactic abstractions are either difficult or impossible to write.
[**] "Revised report on the algorithmic language Scheme," William Clinger and Jonathan Rees, editors, LISP Pointers, Vol. 4, No. 3, pp. 1–55, July–September 1991.
[
] "Macros that work," William Clinger and Jonathan Rees, Conference Record of the Eighteenth Annual ACM Symposium on Principles of Programming Languages, pp. 155–162, January 1991.
The syntax-case system[] was developed to address the shortcomings of the original algorithm without the limitations of syntax-rules. The system supports local macro bindings and operates with constant overhead, yet allows macros to use the full expressive power of the Scheme language. It is upwardly compatible with syntax-rules, which can be expressed as a simple macro in terms of syntax-case, and it permits the same pattern language to be used even for "low-level" macros for which syntax-rules cannot be used. It also provides a mechanism for allowing intended captures—i.e., allowing hygiene to be "bent" or "broken" in a selective and straightforward manner. In addition, it handles several practical aspects of expansion that must be addressed in a real implementation, such as internal definitions and tracking of source information through macro expansion.
[
] "Syntactic abstraction in Scheme," R. Kent Dybvig, Robert Hieb, and Carl Bruggeman, Lisp and Symbolic Computation, Vol. 5, No. 4, pp. 295–326, 1993.
This all comes at a price in terms of the complexity of the expansion algorithm and the size of the code required to implement it. A study of a complete implementation in all its glory is therefore beyond the scope of this chapter. Instead, we'll investigate a simplified version of the expander that illustrates the underlying algorithm and the most important aspects of its implementation.
We'll start with a few brief syntax-case examples, adapted from the Chez Scheme Version 7 User's Guide (R. Kent Dybvig, Cadence Research Systems, 2005). Additional examples and a more detailed description of syntax-case are given in that document and in The Scheme Programming Language, Third Edition.
The following definition of or illustrates the form of a syntax-case macro definition:
(define-syntax or (lambda (x) (syntax-case x () [(_ e1 e2) (syntax (let ([t e1]) (if t t e2)))])))
The define-syntax form creates a keyword binding, associating the keyword or in this case with a transformation procedure, or transformer. The transformer is obtained by evaluating, at expansion time, the lambda expression on the righthand side of the define-syntax form. The syntax-case form is used to parse the input, and the syntax form is used to construct the output, via straightforward pattern matching. The pattern (_ e1 e2) specifies the shape of the input, with the underscore (_) marking where the keyword or appears, and the pattern variables e1 and e2 bound to the first and second subforms. The template (let([te1]) (iftte2)) specifies the output, with e1 and e2 inserted from the input.
The form (syntax template) may be abbreviated to #'template, so the previous definition may be rewritten as follows:
(define-syntax or (lambda (x) (syntax-case x () [(_ e1 e2) (syntax (let ([t e1]) (if t t e2))])))
Macros may also be bound within a single expression via letrec-syntax.
(letrec-syntax ([or (lambda (x) (syntax-case x () [(_ e1 e2) #'(let ([t e1]) (if t t e2))]))]) (or a b))
Macros can be recursive (i.e., expand into occurrences of themselves), as illustrated by the following version of or that handles an arbitrary number of subforms. Multiple syntax-case clauses are required to handle the two base cases and the recursion case:
(define-syntax or (lambda (x) (syntax-case x () [(_) #'#f] [(_ e) #'e] [(_ e1 e2 e3 ...) #'(let ([t e1]) (if t t (or e2 e3 ...)))])))
An input or output form followed by an ellipsis in the syntax-case pattern language matches or produces zero or more forms.
Hygiene is ensured for the definitions of or in this example, so that the introduced binding for t and the introduced references to let, if, and even or are scoped properly. If we want to bend or break hygiene, we do so with the procedure datum->syntax, which produces a syntax object from an arbitrary s-expression. The identifiers within the s-expression are treated as if they appeared in the original source where the first argument, the template identifier, appeared.
We can use this fact to create a simple method syntax that implicitly binds the name this to the first (object) argument:
(define-syntax method (lambda (x) (syntax-case x () [(k (x ...) e1 e2 ...) (with-syntax ([this (datum->syntax #'k 'this)]) #'(lambda (this x ...) e1 e2 ...))])))
By using the keyword k, extracted from the input, as the template variable, the variable this is treated as if it were present in the method form, so that:
(method (a) (f this a))
is treated as the equivalent of:
(lambda (this a) (f this a))
with no renaming to prevent the introduced binding from capturing the source-code reference.
The with-syntax form used in the definition of method creates local pattern-variable bindings. It is a simple macro written in terms of syntax-case:
(define-syntax with-syntax (lambda (x) (syntax-case x () [(_ ((p e0) ...) e1 e2 ...) #'(syntax-case (list e0 ...) () [(p ...) (begin e1 e2 ...)])])))
The datum->syntax procedure can be used for arbitrary expressions, as illustrated by the following definition of include:
(define-syntax include (lambda (x) (define read-file (lambda (fn k) (let ([p (open-input-file fn)]) (let f ([x (read p)]) (if (eof-object? x) (begin (close-input-port p) '()) (cons (datum->syntax k x) (f (read p)))))))) (syntax-case x () [(k filename) (let ([fn (syntax->datum #'filename)]) (with-syntax ([(e ...) (read-file fn #'k)]) #'(begin e ...)))])))
The form (include "filename") has the effect of treating the forms within the named file as if they were present in the source code in place of the include form. In addition to using datum->syntax, include also uses its inverse operator, syntax->datum, to convert the filename subform into a string it can pass to open-input-file.