Edit – I have been looking at other alternatives to writing my own parser/language, although I did learn a lot when trying to write my own. Nemerle seems like the most likely candidate to me, given how extensible and C#-like it is. Among mainstream languages, I am also beginning to favor F#. Thanks to everyone in the comments section who pointed out various options to me.
I guess it’s finally time to let people know what I’ve been working on (and thinking about) for the past six months or so.
The driving force behind this new project is my increasing frustration with existing languages. While most of them have been crafted masterfully, there’s always something in each that isn’t available and prevents me from writing the most elegant or readable code for that compiler. With C#, my list of gripes is long and would eat up a whole post. Let’s just say that while I love C#, there are many features I wish it had. And most of the time, it’s not something that can’t be added, it’s just not in the compiler because
- The feature probably wasn’t considered or was cut because it wasn’t an important use-case
- The development team couldn’t add it in due to time constraints (that’s fair)
- Or they believe it’s not something that the language needs (these are far more frustrating, because you know they’re never going to be in the language)
Again, I state these without mentioning what it is I want, which is indeed the point of this post.
Without further ado, I present a proposed implementation for my (.NET) compiler which I have dubbed Quenya for the time-being. I intended this compiler to be syntax agnostic (or at the very least, extremely extensible) and therefore the bulk of the idea revolves around a way for users to re-define source code and for the compiler to be able to parse it.
Before we begin, I would like to point you to this awesome post by Luke Hoban, who is one of my favorite blog authors ever. I have used the source code for his monadic parser to build Quenya’s parsing logic.
To allow Quenya to parse something, it is not enough to provide it with source code. Quenya needs a grammar/parser (or a reference to one) and the source to parse. The parser is then expected to return a set of known AST nodes when called upon to parse the source code (linked to that grammar/parser). The rest of the compiler phases then proceed as usual and the compiler spits out IL. The figure below shows an overview of the parsing process in Quenya.
But it’s probably more cumbersome to write an entire parser each time you want to add or modify a single feature of the base grammar you’re using right? That’s where the monadic parser written in C# comes in. To illustrate, let’s take a simple example. We’ll write a simple parser that can parse words (alphabets) given a string and then extend it to have more features without rewriting the entire parser.
// A simpleton parser, one that doesn't understand the . (period) and that will fail if it encounters one. public abstract class SimpleSentenceParser: CharParsers { // Defines the set of all whitespace characters private Parser _whitespaceChars; public virtual Parser WhitespaceChars { get { return _whitespaceChars ?? (_whitespaceChars = Char(' ').OR(Char('\t').OR(Char('\n')).OR(Char('\r')))); } } // A whitespace is 0 or more repetitions of a valid Whitespace character private Parser _whitespace; public virtual Parser Whitespace { get { return _whitespace ?? (_whitespace = Rep(WhitespaceChars)); } } // A word is multiple (alphabetic) characters that are not whitespaces private Parser _word; public virtual Parser Word { get { return _word ?? (_word = from w in Whitespace from c in Char(char.IsLetter) from word in Rep(Char(char.IsLetter)) select new WordTerm(word.Aggregate(new string(new[] {c}), (acc, ch) => acc + ch))); } } // A sentence is a set of words - simplistic, but will work for single sentences private Parser _sentence; public virtual Parser Sentence { get { return _sentence ?? (_sentence = from words in Rep(Word) select new SentenceTerm(words)); } } // Set of all sentences in the input (start symbol for this grammar) private Parser _all; public virtual Parser All { get { return _all ?? (_all = from sentence in Sentence select (Term) sentence); } } } // A concrete implementation of the above parser that can accept string input public class SimpleSentenceParser: SimpleSentenceParser { public override Parser AnyChar // Pick one character at a time { get { return input => input.Length > 0 ? new Result(input[0], input.Substring(1)) : null; } } }
Note that the code above is quite readable (as a grammar) due to the use of C# query expressions. Now let’s look at a parser that “extends” the capabilities of this parser.
// A slightly more intelligent parser, it can identify sentences properly // Note that this parser does NOT redefine what a word means. It only redefines a sentence. public abstract class BetterSentenceParser: SimpleSentenceParser { private Parser _whitespaceChars; public override Parser WhitespaceChars // Adds to existing whitespace characters { get { return _whitespaceChars ?? (_whitespaceChars = base.WhitespaceChars.OR(Char(',')).OR(Char(';'))); } } // New, defines sentence terminators to allow this parser to understand them private Parser _sentenceTerminators; public virtual Parser SentenceTerminators { get { return _sentenceTerminators ?? (_sentenceTerminators = Char('.')); } } // Redefines what a sentence means and takes sentence terminators // into account private Parser _sentence; public override Parser Sentence { get { return _sentence ?? (_sentence = from words in Rep(Word) from terminator in SentenceTerminators select new SentenceTerm(words)); } } // This now returns multiple sentences private Parser _all; public override Parser All // Actually selects multiple sentences { get { return _all ?? (_all = from sentences in Rep(Sentence) select (Term)new ParagraphTerm(sentences)); } } } // A concrete implementation of a parser that can accept string input public class BetterSentenceParser : BetterSentenceParser { public override Parser AnyChar { get { return input => input.Length > 0 ? new Result(input[0], input.Substring(1)) : null; } } }
Ah, that looks fairly simple! The observant among you no doubt noticed that this parser extends (Sentence) and modifies existing behavior (WhitespaceChars) without having to do all the work by itself. Using the concrete version of these two parsers, we can now parse stuff!
internal class Program { private const string SimpleText = @"Lorem ipsum dolor sit amet"; private const string ComplexText = @"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus enim lorem," + @" tincidunt ac sollicitudin fringilla, vulputate a dolor. Sed eu viverra ligula."; private static void Main(string[] args) { // Will parse SimpleText into multiple words and one sentence { var parser = new SimpleSentenceParser(); var result = parser.All (SimpleText); } // Will parse ComplexText partially, fail when it encounters the '.' { var parser = new SimpleSentenceParser(); var result = parser.All (ComplexText); } // Will parse ComplexText into multiple words and multiple sentences { var parser = new ComplexSentenceParser(); var result = parser.All (ComplexText); } } }
Let’s now extend this concept given a fully functional C# parser definition. The user will now be able to “modify” the syntax of the language by defining his own parser that changes certain productions and/or terminals in the grammar. Add in a mechanism to link a given source file to a given grammar/parser (attribute on the parser = file extension of source) and you have Quenya, a language that encourages people to define and modify it.
Granted, this is probably tougher than just learning one language, but considering that the code you write using your DSL or modified base language will be much, much more suited to your purposes (if you use it wisely) and that most of your code will be written in the straight-up base grammar for your language makes this worth it (to me, at least – I’d be happy to hear your thought/comments). Also, people who don’t care about language extensibility can just use a vanilla grammar for the language of their choice and continue doing what they’ve been doing.
The next post on Quenya will deal with the implications of a language like this and how it can be used (or abused) in a real context.
Download the source code here: QuenyaParsing


Does this work with the DLR?
Sorry for the letdown, but it isn’t even a full compiler yet. This is only the prototype for its parsing system.
Sorry, do you plan to have it work with the DLR?
Perhaps not from the get-go, but I would like it to run in the DLR, yes.
Very cool. I’ve always wanted to bring Matlab/Octave-style programming to .NET – this could be a great tool to help me get started. Can’t wait to hear more!
Thanks! I am working on a basic C# 2.0 grammar and should have some bits up soon. I also welcome any offers for help on the grammar, parsing framework or unit tests.
I’ve written a DSL using parser combinators before; definitely a neat exercise although a bit of a pain to debug as the functions start nesting deeply. On the topic of unit testing–if you plan on trying to write a C# 2.0 grammer you could test with the syntax all-in-one file Kirill Osenkov posted in his blog (minus the C# 3.0+ features): http://blogs.msdn.com/b/kirillosenkov/archive/2010/10/14/c-syntax-all-in-one-file-attempt-3.aspx
It is such a shame. I am always saddened when I see talent so divided. http://xkcd.com/927/
Have you seen what they are doing in Nemerle 2.0? I have no doubt that the time you have spent has biased you to your code but why not why not join efforts instead of duplicating and fragmenting? http://rsdn.ru/article/nemerle/PegGrammar-en.xml
Thank you for the link! I’d all but forgotten about Nemerle. I will definitely look more into this – I hate duplicating efforts, too.
You should really look at Microsoft Roslyn. I think this will help you a lot, buy preventing you from reinventing the wheel.
http://msdn.microsoft.com/en-us/roslyn
Thanks, I have since dropped my plans in favor of using Nemerle or Roslyn. I really should edit the article to reflect that.