Use Scala parser combinator to parse CSV files

What you missed is whitespace. I threw in a couple bonus improvements. import scala.util.parsing.combinator._ object CSV extends RegexParsers { override protected val whiteSpace = “””[ \t]”””.r def COMMA = “,” def DQUOTE = “\”” def DQUOTE2 = “\”\”” ^^ { case _ => “\”” } def CR = “\r” def LF = “\n” def CRLF … Read more

Difference between constituency parser and dependency parser

A constituency parse tree breaks a text into sub-phrases. Non-terminals in the tree are types of phrases, the terminals are the words in the sentence, and the edges are unlabeled. For a simple sentence “John sees Bill”, a constituency parse would be: Sentence | +————-+————+ | | Noun Phrase Verb Phrase | | John +——-+——–+ … Read more

Looking for a clear definition of what a “tokenizer”, “parser” and “lexers” are and how they are related to each other and used?

A tokenizer breaks a stream of text into tokens, usually by looking for whitespace (tabs, spaces, new lines). A lexer is basically a tokenizer, but it usually attaches extra context to the tokens — this token is a number, that token is a string literal, this other token is an equality operator. A parser takes … Read more

Handling extra operators in Shunting-yard

Valid expressions can be validated with a regular expression, aside from parenthesis mismatching. (Mismatched parentheses will be caught by the shunting-yard algorithm as indicated in the wikipedia page, so I’m ignoring those.) The regular expression is as follows: PRE* OP POST* (INF PRE* OP POST*)* where: PRE is a prefix operator or ( POST is … Read more

How would you go about parsing Markdown? [closed]

The only markdown implementation I know of, that uses an actual parser, is Jon MacFarleane’s peg-markdown. Its parser is based on a Parsing Expression Grammar parser generator called peg. EDIT: Mauricio Fernandez recently released his Simple Markup Markdown parser, which he wrote as part of his OcsiBlog Weblog Engine. Because the parser is written in … Read more

How does the ANTLR lexer disambiguate its rules (or why does my parser produce “mismatched input” errors)?

In ANTLR, the lexer is isolated from the parser, which means it will split the text into typed tokens according to the lexer grammar rules, and the parser has no influence on this process (it cannot say “give me an INTEGER now” for instance). It produces a token stream by itself. Furthermore, the parser doesn’t … Read more

Parsing command line arguments in R scripts

There are three packages on CRAN: getopt: C-like getopt behavior optparse: a command line parser inspired by Python’s optparse library argparse: a command line optional and positional argument parser (inspired by Python’s argparse library). This package requires that a Python interpreter be installed with the argparse and json (or simplejson) modules. Update: docopt: lets you … Read more