Quenya – Beyond syntax

Edit – I have been looking at other alternatives to writing my own parser/language, although I did learn a lot when trying to write my own. Nemerle seems like the most likely candidate to me, given how extensible and C#-like it is. Among mainstream languages, I am also beginning to favor F#. Thanks to everyone in the comments section who pointed out various options to me.

I guess it’s finally time to let people know what I’ve been working on (and thinking about) for the past six months or so.

The driving force behind this new project is my increasing frustration with existing languages. While most of them have been crafted masterfully, there’s always something in each that isn’t available and prevents me from writing the most elegant or readable code for that compiler. With C#, my list of gripes is long and would eat up a whole post. Let’s just say that while I love C#, there are many features I wish it had. And most of the time, it’s not something that can’t be added, it’s just not in the compiler because

  • The feature probably wasn’t considered or was cut because it wasn’t an important use-case
  • The development team couldn’t add it in due to time constraints (that’s fair)
  • Or they believe it’s not something that the language needs (these are far more frustrating, because you know they’re never going to be in the language)

Again, I state these without mentioning what it is I want, which is indeed the point of this post. :)

Without further ado, I present a proposed implementation for my (.NET) compiler which I have dubbed Quenya for the time-being. I intended this compiler to be syntax agnostic (or at the very least, extremely extensible) and therefore the bulk of the idea revolves around a way for users to re-define source code and for the compiler to be able to parse it.

Before we begin, I would like to point you to this awesome post by Luke Hoban, who is one of my favorite blog authors ever. I have used the source code for his monadic parser to build Quenya’s parsing logic.

To allow Quenya to parse something, it is not enough to provide it with source code. Quenya needs a grammar/parser (or a reference to one) and the source to parse. The parser is then expected to return a set of known AST nodes when called upon to parse the source code (linked to that grammar/parser). The rest of the compiler phases then proceed as usual and the compiler spits out IL. The figure below shows an overview of the parsing process in Quenya.

quenya parsing overview

Quenya parsing overview

But it’s probably more cumbersome to write an entire parser each time you want to add or modify a single feature of the base grammar you’re using right? That’s where the monadic parser written in C# comes in. To illustrate, let’s take a simple example. We’ll write a simple parser that can parse words (alphabets) given a string and then extend it to have more features without rewriting the entire parser.

// A simpleton parser, one that doesn't understand the . (period) and that will fail if it encounters one.
public abstract class SimpleSentenceParser: CharParsers
{
    // Defines the set of all whitespace characters
	private Parser _whitespaceChars;
    public virtual Parser WhitespaceChars
    {
        get
        {
            return _whitespaceChars ??
                   (_whitespaceChars = Char(' ').OR(Char('\t').OR(Char('\n')).OR(Char('\r'))));
        }
    }
 
    // A whitespace is 0 or more repetitions of a valid Whitespace character
	private Parser _whitespace;
    public virtual Parser Whitespace
    {
        get
        {
            return _whitespace ??
                   (_whitespace = Rep(WhitespaceChars));
        }
    }
 
    // A word is multiple (alphabetic) characters that are not whitespaces
	private Parser _word;
    public virtual Parser Word
    {
        get
        {
            return _word ??
                   (_word = from w in Whitespace
                            from c in Char(char.IsLetter)
                            from word in Rep(Char(char.IsLetter))
                            select new WordTerm(word.Aggregate(new string(new[] {c}), (acc, ch) => acc + ch)));
        }
    }
 
    // A sentence is a set of words - simplistic, but will work for single sentences
	private Parser _sentence;
    public virtual Parser Sentence
    {
        get
        {
            return _sentence ??
                   (_sentence = from words in Rep(Word)
                                select new SentenceTerm(words));
        }
    }
 
    // Set of all sentences in the input (start symbol for this grammar)
	private Parser _all;
    public virtual Parser All
    {
        get
        {
            return _all ??
                   (_all = from sentence in Sentence
                           select (Term) sentence);
        }
    }
}
 
// A concrete implementation of the above parser that can accept string input
public class SimpleSentenceParser: SimpleSentenceParser
{
    public override Parser AnyChar // Pick one character at a time
    {
        get
        {
            return input => input.Length > 0 ? 
				new Result(input[0], input.Substring(1)) : 
				null;
        }
    }
}

Note that the code above is quite readable (as a grammar) due to the use of C# query expressions. Now let’s look at a parser that “extends” the capabilities of this parser.

// A slightly more intelligent parser, it can identify sentences properly
// Note that this parser does NOT redefine what a word means. It only redefines a sentence.
public abstract class BetterSentenceParser: SimpleSentenceParser
{
    private Parser _whitespaceChars;
    public override Parser WhitespaceChars // Adds to existing whitespace characters
    {
        get
        {
            return _whitespaceChars ??
                (_whitespaceChars = base.WhitespaceChars.OR(Char(',')).OR(Char(';')));
        }
    }
 
    // New, defines sentence terminators to allow this parser to understand them
	private Parser _sentenceTerminators;
    public virtual Parser SentenceTerminators
    {
        get
        {
            return _sentenceTerminators ??
                   (_sentenceTerminators = Char('.'));
        }
    }
 
    // Redefines what a sentence means and takes sentence terminators
	// into account
	private Parser _sentence;
    public override Parser Sentence
    {
        get
        {
            return _sentence ??
                   (_sentence = from words in Rep(Word)
                                from terminator in SentenceTerminators
                                select new SentenceTerm(words));
        }
    }
 
    // This now returns multiple sentences
	private Parser _all;
    public override Parser All // Actually selects multiple sentences
    {
        get
        {
            return _all ??
                   (_all = from sentences in Rep(Sentence)
                           select (Term)new ParagraphTerm(sentences));
        }
    }
}
 
// A concrete implementation of a parser that can accept string input
public class BetterSentenceParser : BetterSentenceParser
{
    public override Parser AnyChar
    {
        get
        {
            return input => input.Length > 0 ? new Result(input[0], input.Substring(1)) : null;
        }
    }
}

Ah, that looks fairly simple! The observant among you no doubt noticed that this parser extends (Sentence) and modifies existing behavior (WhitespaceChars) without having to do all the work by itself. Using the concrete version of these two parsers, we can now parse stuff!

internal class Program
{
    private const string SimpleText = @"Lorem ipsum dolor sit amet";
 
    private const string ComplexText =
        @"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus enim lorem," +
        @" tincidunt ac sollicitudin fringilla, vulputate a dolor. Sed eu viverra ligula.";
 
    private static void Main(string[] args)
    {
        // Will parse SimpleText into multiple words and one sentence
		{
			var parser = new SimpleSentenceParser();
			var result = parser.All (SimpleText);
		}
 
        // Will parse ComplexText partially, fail when it encounters the '.'
		{
			var parser = new SimpleSentenceParser();
			var result = parser.All (ComplexText);
		}
 
        // Will parse ComplexText into multiple words and multiple sentences
		{
			var parser = new ComplexSentenceParser();
			var result = parser.All (ComplexText);
		}
    }
}

Let’s now extend this concept given a fully functional C# parser definition. The user will now be able to “modify” the syntax of the language by defining his own parser that changes certain productions and/or terminals in the grammar. Add in a mechanism to link a given source file to a given grammar/parser (attribute on the parser = file extension of source) and you have Quenya, a language that encourages people to define and modify it.

Granted, this is probably tougher than just learning one language, but considering that the code you write using your DSL or modified base language will be much, much more suited to your purposes (if you use it wisely) and that most of your code will be written in the straight-up base grammar for your language makes this worth it (to me, at least – I’d be happy to hear your thought/comments). Also, people who don’t care about language extensibility can just use a vanilla grammar for the language of their choice and continue doing what they’ve been doing.

The next post on Quenya will deal with the implications of a language like this and how it can be used (or abused) in a real context.

Download the source code here: QuenyaParsing

11 comments to Quenya – Beyond syntax

Leave a Reply

  

  

  

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="" cssfile="">

Recent comments