Chumsky: Fixing Parsing Issues With Choice And Filter
If you're diving into the world of parsing with Rust, you've likely come across the Chumsky crate. It's a fantastic tool for building parsers, but sometimes you might encounter unexpected issues. This article delves into a common problem: when combining choice and filter in Chumsky, parsing can break. We'll explore a real-world scenario, dissect the problem, and offer solutions to get your parser back on track.
The Parsing Predicament: Choice and Filter in Chumsky
When constructing a parser, you often need to handle multiple possibilities. The choice combinator in Chumsky allows you to try different parsing paths. The filter combinator lets you conditionally accept certain inputs. However, when these two are used together, you might face unexpected parsing failures. Let's understand with an example.
A Real-World Scenario
Consider a simplified math parser. The goal is to parse either numbers or words. Here’s a snippet of Rust code using Chumsky that demonstrates the issue:
use chumsky::{extra::ParserExtra, label::LabelError, prelude::*, text::{inline_whitespace, TextExpected}};
#[derive(Debug, PartialEq)]
pub enum MathTree {
Number(String),
Word(String),
}
impl MathTree {
fn number_parser<'a, E>() -> impl Parser<'a, &'a str, String, E>
where
E: ParserExtra<'a, &'a str>,
E::Error: LabelError<'a, &'a str, TextExpected<'a, &'a str>>
{
just('-')
.or_not()
.then(
inline_whitespace()
.ignore_then(
any().filter(|ch: &char| *ch == '.' || ch.is_ascii_digit())
)
.repeated()
.collect()
)
.map(|(sign, mut str)| {
if let Some(sign) = sign {
str = format!("{sign}{str}");
}
str
})
}
fn word_parser<'a, E>() -> impl Parser<'a, &'a str, String, E>
where
E: ParserExtra<'a, &'a str>
{
any()
//.filter(|ch: &char| ch.is_alphabetic())
.repeated()
.collect()
}
pub(super) fn parser<'a, E>() -> impl Parser<'a, &'a str, Self, E>
where
E: ParserExtra<'a, &'a str>,
E::Error: LabelError<'a, &'a str, TextExpected<'a, &'a str>>
{
let number_parser = Self::number_parser().map(Self::Number);
let word_parser = Self::word_parser().map(Self::Word);
choice((number_parser, word_parser))
}
}
#[cfg(test)]
mod tests {
use super::*;
use std::{borrow::Borrow, fmt::Debug};
type Err<'a> = chumsky::extra::Err<Rich<'a, char>>;
#[track_caller]
fn test_parse<'a, T, U, P>(parser: P, input: &'a str, value: &U)
where
T: Borrow<U>,
U: Debug + PartialEq + ?Sized,
P: Parser<'a, &'a str, T, Err<'a>>
{
assert_eq!(
parser.then_ignore(end()).parse(input).unwrap().borrow(),
value
);
}
#[test] // working
fn test_parse_simple_float() {
test_parse(MathTree::number_parser(), "1.2", "1.2");
test_parse(MathTree::parser(), "1.2", &MathTree::Number("1.2".into()));
}
#[test] // working
fn test_parse_negative_float() {
test_parse(MathTree::number_parser(), "-1.2", "-1.2");
test_parse(MathTree::parser(), "-1.2", &MathTree::Number("-1.2".into()));
}
#[test] // broken
fn test_parse_char() {
test_parse(MathTree::word_parser(), "x", "x");
test_parse(MathTree::parser(), "x", &MathTree::Word("x".into()));
}
}
In this setup, we have two parsers: number_parser and word_parser. The number_parser tries to parse numerical values, while the word_parser aims to parse words. The main parser combines these using choice. When testing this parser with the input "x", the test fails. Let’s investigate why.
Diagnosing the Issue
The error message indicates that the parser expected certain inputs ('-', inline whitespace, something else, or end of input) but found 'x'. This suggests the number_parser partially consumed the input before failing, preventing the word_parser from being attempted.
The problem lies in how filter interacts with repeated and choice. The number_parser uses filter to accept only digits and dots. When it encounters 'x', the filter causes the parser to fail. However, because repeated might have consumed some characters before encountering 'x', the choice combinator doesn't backtrack to try the word_parser.
Breaking Down the Code
number_parser: This parser looks for an optional minus sign, followed by digits and dots. Thefilterwithinany().filter(...).repeated()is the culprit.word_parser: This parser is designed to accept any character, but it never gets a chance when thenumber_parserfails after consuming part of the input.parser: This combinesnumber_parserandword_parserusingchoice. The issue arises here because of the behavior ofchoicewhen one of the parsers fails after partial consumption.
Solutions to the Parsing Problem
To resolve this, we need to ensure that a failed parse in number_parser doesn't prevent word_parser from running. Here are a couple of strategies:
1. Using rewind()
One effective solution is to use rewind() after the number_parser. This combinator ensures that if number_parser fails, the input stream is reset to its original position, allowing word_parser to attempt parsing.
use chumsky::Parser;
impl MathTree {
pub(super) fn parser<'a, E>() -> impl Parser<'a, &'a str, Self, E>
where
E: ParserExtra<'a, &'a str>,
E::Error: LabelError<'a, &'a str, TextExpected<'a, &'a str>>
{
let number_parser = Self::number_parser().map(Self::Number).rewind();
let word_parser = Self::word_parser().map(Self::Word);
choice((number_parser, word_parser))
}
}
By adding .rewind(), we instruct Chumsky to reset the parsing position if number_parser fails. This gives word_parser a fair chance to parse the input.
2. Refactoring the number_parser
Another approach is to modify number_parser to fail more cleanly. Instead of using filter within repeated, we can use filter on the entire collected string. This way, the parser either fully matches a number or fails without consuming part of the input.
use chumsky::Parser;
impl MathTree {
fn number_parser<'a, E>() -> impl Parser<'a, &'a str, String, E>
where
E: ParserExtra<'a, &'a str>,
E::Error: LabelError<'a, &'a str, TextExpected<'a, &'a str>>
{
just('-')
.or_not()
.then(
inline_whitespace()
.ignore_then(
any()
.repeated()
.collect::<String>()
.filter(|s: &String| s.chars().all(|ch| ch == '.' || ch.is_ascii_digit()))
)
)
.map(|(sign, str)| {
let mut str = str;
if let Some(sign) = sign {
str = format!("{sign}{str}");
}
str
})
}
}
In this refactored version, the filter is applied to the entire collected string, ensuring that the parser only succeeds if the whole string is a valid number. If not, it fails cleanly, allowing choice to try the word_parser.
Implementing the Solutions
Let's apply the rewind() solution to the original code. Here’s the modified parser function:
use chumsky::Parser;
impl MathTree {
pub(super) fn parser<'a, E>() -> impl Parser<'a, &'a str, Self, E>
where
E: ParserExtra<'a, &'a str>,
E::Error: LabelError<'a, &'a str, TextExpected<'a, &'a str>>
{
let number_parser = Self::number_parser().map(Self::Number).rewind();
let word_parser = Self::word_parser().map(Self::Word);
choice((number_parser, word_parser))
}
}
With this change, the test case test_parse_char should now pass. The rewind() ensures that if number_parser fails on "x", the word_parser gets a chance to parse it.
Alternatively, we can refactor the number_parser as shown above. This approach avoids the need for rewind() by ensuring that number_parser either fully parses a number or fails cleanly.
Testing the Fix
After applying either solution, it’s crucial to run the tests to ensure the parser now works correctly. The test_parse_char test, which previously failed, should now pass, and all other tests should continue to pass.
Here’s the expected output after applying the fix:
test parser::math::tests::test_parse_char ... ok
This confirms that the parsing issue has been successfully resolved.
Key Takeaways
- Combining
choiceandfilterin Chumsky can lead to unexpected parsing failures if one parser partially consumes input before failing. - Using
rewind()is a straightforward way to ensure thatchoicecan backtrack and try other parsers. - Refactoring parsers to fail more cleanly (e.g., by filtering the entire result instead of individual characters) can also resolve the issue.
- Always write tests to verify that your parsers behave as expected, especially when combining different combinators.
Conclusion
Parsing libraries like Chumsky offer powerful tools for building complex parsers. However, understanding how different combinators interact is essential for avoiding common pitfalls. By using rewind() or refactoring your parsers, you can effectively handle situations where choice and filter might cause parsing to break.
By understanding these nuances, you can leverage Chumsky to create robust and reliable parsers for various applications. Remember to test your parsers thoroughly to catch any unexpected behavior and ensure they meet your requirements.
For more information on Chumsky and parsing techniques, check out the official Chumsky documentation and other resources on Rust parsing. You can find valuable information and examples at the Chumsky GitHub repository. Happy parsing!