Skip to content

[libc] Add monadic functions to cpp::expected #66523

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

gchatelet
Copy link
Contributor

This patch adds support for monadic operations to cpp::expected.

@llvmbot
Copy link
Member

llvmbot commented Sep 15, 2023

@llvm/pr-subscribers-libc

Changes

This patch adds support for monadic operations to cpp::expected.


Patch is 26.85 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/66523.diff

5 Files Affected:

  • (modified) libc/src/__support/CPP/expected.h (+437-18)
  • (modified) libc/test/src/__support/CPP/CMakeLists.txt (+10)
  • (added) libc/test/src/__support/CPP/expected_test.cpp (+272)
  • (modified) utils/bazel/llvm-project-overlay/libc/BUILD.bazel (+2)
  • (modified) utils/bazel/llvm-project-overlay/libc/test/src/__support/CPP/BUILD.bazel (+10)
diff --git a/libc/src/__support/CPP/expected.h b/libc/src/__support/CPP/expected.h
index 2fdcc5246d47318..f0140f2073731e4 100644
--- a/libc/src/__support/CPP/expected.h
+++ b/libc/src/__support/CPP/expected.h
@@ -9,43 +9,462 @@
 #ifndef LLVM_LIBC_SUPPORT_CPP_EXPECTED_H
 #define LLVM_LIBC_SUPPORT_CPP_EXPECTED_H
 
+#include "src/__support/CPP/type_traits.h"
+#include "src/__support/CPP/type_traits/always_false.h"
+#include "src/__support/CPP/utility.h"
+
+// BEWARE : This implementation is not fully conformant as it doesn't take
+// `cpp::reference_wrapper` into account.
+// It also doesn't currently implement constructors with 'cpp::in_place_t' and
+// 'cpp::unexpect_t'. As a consequence, use of 'cpp::expected<T, E>' where 'T'
+// and 'E' are the same type is likely to fail.
+
 namespace __llvm_libc::cpp {
 
 // This is used to hold an unexpected value so that a different constructor is
 // selected.
-template <class T> class unexpected {
-  T value;
+template <class E> class unexpected {
+  E err;
 
 public:
-  constexpr explicit unexpected(T value) : value(value) {}
-  constexpr T error() { return value; }
+  constexpr unexpected(const unexpected &) = default;
+  constexpr unexpected(unexpected &&) = default;
+
+  constexpr explicit unexpected(const E &e) : err(e) {}
+  constexpr explicit unexpected(E &&e) : err(cpp::move(e)) {}
+
+  constexpr const E &error() const & { return err; }
+  constexpr E &error() & { return err; }
+  constexpr const E &&error() const && { return cpp::move(err); }
+  constexpr E &&error() && { return cpp::move(err); }
 };
 
+template <class E>
+constexpr bool operator==(unexpected<E> &x, unexpected<E> &y) {
+  return x.error() == y.error();
+}
+template <class E>
+constexpr bool operator!=(unexpected<E> &x, unexpected<E> &y) {
+  return !(x == y);
+}
+
+namespace detail {
+
+// Returns exp.value() or cpp::move(exp.value()) if exp is an rvalue reference.
+// It also retuns void iff E is an instance of cpp::expected<void, E>.
+template <typename E> auto value(E &&exp) {
+  using value_type = typename cpp::remove_cvref_t<E>::value_type;
+  if constexpr (cpp::is_void_v<value_type>)
+    return;
+  else if constexpr (cpp::is_rvalue_reference_v<decltype(exp)>)
+    return cpp::move(exp.value());
+  else
+    return exp.value();
+}
+
+// Returns exp.error() or cpp::move(exp.error()) if exp is an rvalue reference.
+template <typename E> auto error(E &&exp) {
+  if constexpr (cpp::is_rvalue_reference_v<decltype(exp)>)
+    return cpp::move(exp.error());
+  else
+    return exp.error();
+}
+
+// Returns the function return type for unary functions.
+template <typename T, class F>
+struct function_return_type
+    : cpp::type_identity<cpp::remove_cvref_t<cpp::invoke_result_t<F, T>>> {};
+
+// Returns the function return type for nullary functions.
+template <class F>
+struct function_return_type<void, F>
+    : cpp::type_identity<cpp::remove_cvref_t<cpp::invoke_result_t<F>>> {};
+
+// Helper type.
+template <typename T, class F>
+using function_return_type_t = typename function_return_type<T, F>::type;
+
+// Calls 'f' either directly when E = cpp::expected<void, ...> or with value()
+// otherwise.
+template <typename E, class F> decltype(auto) invoke(E &&exp, F &&f) {
+  using T = typename cpp::remove_cvref_t<E>::value_type;
+  if constexpr (is_void_v<T>)
+    return cpp::invoke(cpp::forward<F>(f));
+  else
+    return cpp::invoke(cpp::forward<F>(f), value(cpp::forward<E>(exp)));
+}
+
+// We implement all monadic calls in a single 'apply' function below.
+// This enum helps selecting the appropriate logic.
+enum class Impl { TRANSFORM, TRANSFORM_ERROR, AND_THEN, OR_ELSE };
+
+// Handles all flavors of 'transform', 'transform_error', 'and_then' and
+// 'or_else':
+// - 'exp' can be const or non-const.
+// - 'exp' can be an 'lvalue_reference' or an 'rvalue_reference'.
+//    Note: if 'exp' is an 'rvalue_reference' its value / error is moved.
+// - 'f' can take zero or one argument.
+// - 'f' can return void or another type.
+template <Impl impl, typename E, class F> decltype(auto) apply(E &&exp, F &&f) {
+  using value_type = decltype(value(cpp::forward<E>(exp)));
+  using UnqualE = cpp::remove_cvref_t<E>;
+  using T = typename UnqualE::value_type;
+  if constexpr (impl == Impl::TRANSFORM || impl == Impl::AND_THEN) {
+    // processing value
+    using U = function_return_type_t<value_type, F>;
+    using unexpected_u = typename UnqualE::unexpected_type;
+    if constexpr (impl == Impl::TRANSFORM) {
+      static_assert(!cpp::is_reference_v<U>,
+                    "F should return a non-reference type");
+      using expected_u = typename UnqualE::template rebind<U>;
+      if (exp.has_value()) {
+        if constexpr (is_void_v<U>) {
+          invoke(cpp::forward<E>(exp), cpp::forward<F>(f));
+          return expected_u();
+        } else {
+          return expected_u(invoke(cpp::forward<E>(exp), cpp::forward<F>(f)));
+        }
+      } else {
+        return expected_u(unexpected_u(error(cpp::forward<E>(exp))));
+      }
+    }
+    if constexpr (impl == Impl::AND_THEN) {
+      if (exp.has_value())
+        return invoke(cpp::forward<E>(exp), cpp::forward<F>(f));
+      else
+        return U(unexpected_u(error(cpp::forward<E>(exp))));
+    }
+  } else if constexpr (impl == Impl::OR_ELSE || impl == Impl::TRANSFORM_ERROR) {
+    // processing error
+    using error_type = decltype(error(cpp::forward<E>(exp)));
+    using G = function_return_type_t<error_type, F>;
+    if constexpr (impl == Impl::TRANSFORM_ERROR) {
+      static_assert(!cpp::is_reference_v<G>,
+                    "F should return a non-reference type");
+      using expected_g = typename UnqualE::template rebind_error<G>;
+      using unexpected_g = typename expected_g::unexpected_type;
+      if (exp.has_value())
+        if constexpr (is_void_v<T>)
+          return expected_g();
+        else
+          return expected_g(value(cpp::forward<E>(exp)));
+      else
+        return expected_g(unexpected_g(
+            cpp::invoke(cpp::forward<F>(f), error(cpp::forward<E>(exp)))));
+    }
+    if constexpr (impl == Impl::OR_ELSE) {
+      if (exp.has_value()) {
+        if constexpr (cpp::is_void_v<T>)
+          return G();
+        else
+          return G(value(cpp::forward<E>(exp)));
+      } else {
+        return cpp::invoke(cpp::forward<F>(f), error(cpp::forward<E>(exp)));
+      }
+    }
+  } else {
+    static_assert(always_false<E>, "");
+  }
+}
+
+template <typename E, class F> decltype(auto) transform(E &&exp, F &&f) {
+  return apply<Impl::TRANSFORM>(cpp::forward<E>(exp), cpp::forward<F>(f));
+}
+
+template <typename E, class F> decltype(auto) transform_error(E &&exp, F &&f) {
+  return apply<Impl::TRANSFORM_ERROR>(cpp::forward<E>(exp), cpp::forward<F>(f));
+}
+
+template <typename E, class F> decltype(auto) and_then(E &&exp, F &&f) {
+  return apply<Impl::AND_THEN>(cpp::forward<E>(exp), cpp::forward<F>(f));
+}
+
+template <typename E, class F> decltype(auto) or_else(E &&exp, F &&f) {
+  return apply<Impl::OR_ELSE>(cpp::forward<E>(exp), cpp::forward<F>(f));
+}
+
+} // namespace detail
+
 template <class T, class E> class expected {
+  static_assert(cpp::is_trivially_destructible_v<T> &&
+                    cpp::is_trivially_destructible_v<E>,
+                "prevents dealing with deletion of the union");
+
   union {
-    T exp;
-    E unexp;
+    T val;
+    E err;
   };
-  bool is_expected;
+  bool has_val;
+
+public:
+  using value_type = T;
+  using error_type = E;
+  using unexpected_type = cpp::unexpected<E>;
+  template <class U> using rebind = cpp::expected<U, error_type>;
+  template <class U> using rebind_error = cpp::expected<value_type, U>;
+
+  constexpr expected() : expected(T{}) {}
+  constexpr expected(T val) : val(val), has_val(true) {}
+  constexpr expected(const expected &other) { *this = other; }
+  constexpr expected(expected &&other) { *this = cpp::move(other); }
+  constexpr expected(const unexpected<E> &err)
+      : err(err.error()), has_val(false) {}
+  constexpr expected(unexpected<E> &&err)
+      : err(cpp::move(err.error())), has_val(false) {}
+
+  constexpr expected &operator=(const expected &other) {
+    if (other.has_value()) {
+      val = other.value();
+      has_val = true;
+    } else {
+      err = other.error();
+      has_val = false;
+    }
+    return *this;
+  }
+  constexpr expected &operator=(expected &&other) {
+    if (other.has_value()) {
+      val = cpp::move(other.value());
+      has_val = true;
+    } else {
+      err = cpp::move(other.error());
+      has_val = false;
+    }
+    return *this;
+  }
+  constexpr expected &operator=(const unexpected<E> &other) {
+    has_val = false;
+    err = other.error();
+  }
+  constexpr expected &operator=(unexpected<E> &&other) {
+    has_val = false;
+    err = cpp::move(other.error());
+  }
+
+  constexpr bool has_value() const { return has_val; }
+  constexpr operator bool() const { return has_val; }
+
+  constexpr T &value() & { return val; }
+  constexpr const T &value() const & { return val; }
+  constexpr T &&value() && { return cpp::move(val); }
+  constexpr const T &&value() const && { return cpp::move(val); }
+
+  constexpr const E &error() const & { return err; }
+  constexpr E &error() & { return err; }
+  constexpr const E &&error() const && { return cpp::move(err); }
+  constexpr E &&error() && { return cpp::move(err); }
+
+  constexpr T *operator->() { return &val; }
+  constexpr const T &operator*() const & { return val; }
+  constexpr T &operator*() & { return val; }
+  constexpr const T &&operator*() const && { return cpp::move(val); }
+  constexpr T &&operator*() && { return cpp::move(val); }
+
+  // value_or
+  template <class U> constexpr T value_or(U &&default_value) const & {
+    return has_value() ? **this
+                       : static_cast<T>(cpp::forward<U>(default_value));
+  }
+  template <class U> constexpr T value_or(U &&default_value) && {
+    return has_value() ? cpp::move(**this)
+                       : static_cast<T>(cpp::forward<U>(default_value));
+  }
+
+  // transform
+  template <class F> constexpr decltype(auto) transform(F &&f) & {
+    return detail::transform(*this, f);
+  }
+  template <class F> constexpr decltype(auto) transform(F &&f) const & {
+    return detail::transform(*this, f);
+  }
+  template <class F> constexpr decltype(auto) transform(F &&f) && {
+    return detail::transform(*this, f);
+  }
+  template <class F> constexpr decltype(auto) transform(F &&f) const && {
+    return detail::transform(*this, f);
+  }
+
+  // transform_error
+  template <class F> constexpr decltype(auto) transform_error(F &&f) & {
+    return detail::transform_error(*this, f);
+  }
+  template <class F> constexpr decltype(auto) transform_error(F &&f) const & {
+    return detail::transform_error(*this, f);
+  }
+  template <class F> constexpr decltype(auto) transform_error(F &&f) && {
+    return detail::transform_error(*this, f);
+  }
+  template <class F> constexpr decltype(auto) transform_error(F &&f) const && {
+    return detail::transform_error(*this, f);
+  }
+
+  // and_then
+  template <class F> constexpr decltype(auto) and_then(F &&f) & {
+    return detail::and_then(*this, f);
+  }
+  template <class F> constexpr decltype(auto) and_then(F &&f) const & {
+    return detail::and_then(*this, f);
+  }
+  template <class F> constexpr decltype(auto) and_then(F &&f) && {
+    return detail::and_then(*this, f);
+  }
+  template <class F> constexpr decltype(auto) and_then(F &&f) const && {
+    return detail::and_then(*this, f);
+  }
+
+  // or_else
+  template <class F> constexpr decltype(auto) or_else(F &&f) & {
+    return detail::or_else(*this, f);
+  }
+  template <class F> constexpr decltype(auto) or_else(F &&f) const & {
+    return detail::or_else(*this, f);
+  }
+  template <class F> constexpr decltype(auto) or_else(F &&f) && {
+    return detail::or_else(*this, f);
+  }
+  template <class F> constexpr decltype(auto) or_else(F &&f) const && {
+    return detail::or_else(*this, f);
+  }
+};
+
+template <class E> class expected<void, E> {
+  static_assert(cpp::is_trivially_destructible_v<E>);
+
+  E err;
+  bool has_val;
 
 public:
-  constexpr expected(T exp) : exp(exp), is_expected(true) {}
-  constexpr expected(unexpected<E> unexp)
-      : unexp(unexp.error()), is_expected(false) {}
+  using value_type = void;
+  using error_type = E;
+  using unexpected_type = cpp::unexpected<E>;
+  template <class U> using rebind = cpp::expected<U, error_type>;
+  template <class U> using rebind_error = cpp::expected<value_type, U>;
+
+  constexpr expected() : has_val(true) {}
+  constexpr expected(const expected &other) { *this = other; }
+  constexpr expected(expected &&other) { *this = cpp::move(other); }
+  constexpr expected(const unexpected<E> &err)
+      : err(err.error()), has_val(false) {}
+  constexpr expected(unexpected<E> &&err)
+      : err(cpp::move(err.error())), has_val(false) {}
+
+  constexpr expected &operator=(const expected &other) {
+    if (other.has_value()) {
+      has_val = true;
+    } else {
+      err = other.error();
+      has_val = false;
+    }
+    return *this;
+  }
+  constexpr expected &operator=(expected &&other) {
+    if (other.has_value()) {
+      has_val = true;
+    } else {
+      err = cpp::move(other.error());
+      has_val = false;
+    }
+    return *this;
+  }
+  constexpr expected &operator=(const unexpected<E> &other) {
+    has_val = false;
+    err = other.error();
+  }
+  constexpr expected &operator=(unexpected<E> &&other) {
+    has_val = false;
+    err = cpp::move(other.error());
+  }
+
+  constexpr bool has_value() const { return has_val; }
+  constexpr operator bool() const { return has_val; }
 
-  constexpr bool has_value() { return is_expected; }
+  constexpr void value() const & {}
+  constexpr void value() && {}
 
-  constexpr T value() { return exp; }
-  constexpr E error() { return unexp; }
+  constexpr const E &error() const & { return err; }
+  constexpr E &error() & { return err; }
+  constexpr const E &&error() const && { return cpp::move(err); }
+  constexpr E &&error() && { return cpp::move(err); }
 
-  constexpr operator bool() { return is_expected; }
+  constexpr void operator*() const {}
 
-  constexpr T &operator*() { return exp; }
-  constexpr const T &operator*() const { return exp; }
-  constexpr T *operator->() { return &exp; }
-  constexpr const T *operator->() const { return &exp; }
+  // transform
+  template <class F> constexpr decltype(auto) transform(F &&f) & {
+    return detail::transform(*this, f);
+  }
+  template <class F> constexpr decltype(auto) transform(F &&f) const & {
+    return detail::transform(*this, f);
+  }
+  template <class F> constexpr decltype(auto) transform(F &&f) && {
+    return detail::transform(*this, f);
+  }
+  template <class F> constexpr decltype(auto) transform(F &&f) const && {
+    return detail::transform(*this, f);
+  }
+
+  // transform_error
+  template <class F> constexpr decltype(auto) transform_error(F &&f) & {
+    return detail::transform_error(*this, f);
+  }
+  template <class F> constexpr decltype(auto) transform_error(F &&f) const & {
+    return detail::transform_error(*this, f);
+  }
+  template <class F> constexpr decltype(auto) transform_error(F &&f) && {
+    return detail::transform_error(*this, f);
+  }
+  template <class F> constexpr decltype(auto) transform_error(F &&f) const && {
+    return detail::transform_error(*this, f);
+  }
+
+  // and_then
+  template <class F> constexpr decltype(auto) and_then(F &&f) & {
+    return detail::and_then(*this, f);
+  }
+  template <class F> constexpr decltype(auto) and_then(F &&f) const & {
+    return detail::and_then(*this, f);
+  }
+  template <class F> constexpr decltype(auto) and_then(F &&f) && {
+    return detail::and_then(*this, f);
+  }
+  template <class F> constexpr decltype(auto) and_then(F &&f) const && {
+    return detail::and_then(*this, f);
+  }
+
+  // or_else
+  template <class F> constexpr decltype(auto) or_else(F &&f) & {
+    return detail::or_else(*this, f);
+  }
+  template <class F> constexpr decltype(auto) or_else(F &&f) const & {
+    return detail::or_else(*this, f);
+  }
+  template <class F> constexpr decltype(auto) or_else(F &&f) && {
+    return detail::or_else(*this, f);
+  }
+  template <class F> constexpr decltype(auto) or_else(F &&f) const && {
+    return detail::or_else(*this, f);
+  }
 };
 
+template <class T, class E>
+constexpr bool operator==(const expected<T, E> &lhs,
+                          const expected<T, E> &rhs) {
+  if (lhs.has_value() != rhs.has_value())
+    return false;
+  if (lhs.has_value()) {
+    if constexpr (cpp::is_void_v<T>)
+      return true;
+    else
+      return *lhs == *rhs;
+  }
+  return lhs.error() == rhs.error();
+}
+
+template <class T, class E>
+constexpr bool operator!=(const expected<T, E> &lhs,
+                          const expected<T, E> &rhs) {
+  return !(lhs == rhs);
+}
+
 } // namespace __llvm_libc::cpp
 
 #endif // LLVM_LIBC_SUPPORT_CPP_EXPECTED_H
diff --git a/libc/test/src/__support/CPP/CMakeLists.txt b/libc/test/src/__support/CPP/CMakeLists.txt
index 5772a83a65cad13..24e8746b3bbc795 100644
--- a/libc/test/src/__support/CPP/CMakeLists.txt
+++ b/libc/test/src/__support/CPP/CMakeLists.txt
@@ -86,6 +86,16 @@ add_libc_test(
     libc.src.__support.CPP.optional
 )
 
+add_libc_test(
+  expected_test
+  SUITE
+    libc-cpp-utils-tests
+  SRCS
+    expected_test.cpp
+  DEPENDS
+    libc.src.__support.CPP.expected
+)
+
 add_libc_test(
   span_test
   SUITE
diff --git a/libc/test/src/__support/CPP/expected_test.cpp b/libc/test/src/__support/CPP/expected_test.cpp
new file mode 100644
index 000000000000000..c00bfb6a5ebfdc1
--- /dev/null
+++ b/libc/test/src/__support/CPP/expected_test.cpp
@@ -0,0 +1,272 @@
+//===-- Unittests for Expected --------------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#include "src/__support/CPP/expected.h"
+#include "test/UnitTest/Test.h"
+
+using __llvm_libc::cpp::expected;
+using __llvm_libc::cpp::unexpected;
+
+enum class error { _ = 0, B };
+enum class fatal_error { _ = 0, D };
+
+auto make_error() { return unexpected<error>{error::B}; }
+
+TEST(LlvmLibcExpectedTest, DefaultCtor) {
+  {
+    expected<int, error> init;
+    EXPECT_TRUE(init.has_value());
+    EXPECT_TRUE((bool)init);
+    EXPECT_EQ(init.value(), int{});
+    EXPECT_EQ(*init, int{});
+  }
+  {
+    expected<void, error> init;
+    EXPECT_TRUE(init.has_value());
+    EXPECT_TRUE((bool)init);
+  }
+}
+
+TEST(LlvmLibcExpectedTest, ValueCtor) {
+  expected<int, error> init(1);
+  EXPECT_TRUE(init.has_value());
+  EXPECT_TRUE((bool)init);
+  EXPECT_EQ(init.value(), 1);
+  EXPECT_EQ(*init, 1);
+}
+
+TEST(LlvmLibcExpectedTest, ErrorCtor) {
+  {
+    expected<int, error> init = make_error();
+    EXPECT_FALSE(init.has_value());
+    EXPECT_FALSE((bool)init);
+    EXPECT_EQ(init.error(), error::B);
+  }
+  {
+    expected<void, error> init = make_error();
+    EXPECT_FALSE(init.has_value());
+    EXPECT_FALSE((bool)init);
+    EXPECT_EQ(init.error(), error::B);
+  }
+}
+
+// value_or
+TEST(LlvmLibcExpectedTest, ValueOr_ReuseValue) {
+  expected<int, error> init(1);
+  EXPECT_EQ(init.value_or(2), int(1));
+}
+
+TEST(LlvmLibcExpectedTest, ValueOr_UseProvided) {
+  expected<int, error> init = make_error();
+  EXPECT_EQ(init.value_or(2), int(2));
+}
+
+// transform
+
+TEST(LlvmLibcExpectedTest, Transform_HasValue) {
+  { // int to int
+    expected<int, error> init(1);
+    expected<int, error> expected_next(2);
+    expected<int, error> actual_next = init.transform([](int) { return 2; });
+    EXPECT_TRUE(actual_next == expected_next);
+  }
+  { // void to int
+    expected<void, error> init;
+    expected<int, error> expected_next(2);
+    expected<int, error> actual_next = init.transform([]() { return 2; });
+    EXPECT_TRUE(actual_next == expected_next);
+  }
+  { // int to void
+    expected<int, error> init(1);
+    expected<void, error> expected_next;
+    expected<void, error> actual_next = init.transform([](int) {});
+    EXPECT_TRUE(actual_next == expected_next);
+  }
+  { // void to void
+    expected<void, error...
[truncated]

@michaelrj-google
Copy link
Contributor

Is there a use case for adding this? This is a rather large rewrite of expected unless there is a followup planned.

@gchatelet
Copy link
Contributor Author

Is there a use case for adding this? This is a rather large rewrite of expected unless there is a followup planned.

Yes absolutely there is a use case. A more modular approach to number parsing (POC https://godbolt.org/z/zssrGdffG).
This would allow us to compose parsers and support different use cases like char vs wchar, FILE* vs C string, and the difference in error reporting. It also makes them more testable as each piece can be tested individually,

Now, TBH I didn't think implementing the monadic interface for expected would require that amount of work so I totally understand the push back (or at least the question about its usefulness). That said I think the monadic approach is valuable (testability, composability) for code that needs to deal with error propagation.

@michaelrj-google
Copy link
Contributor

I like the idea of having a more modular number parser, but I'm not sure that this monadic pattern is the best option.

In the example you provided, it doesn't seem like there would be any difference between calling and_then on the next function and just calling that function on its own. None of the intermediate functions (e.g. consume_optional_sign) can fail, and they also don't return anything, so they'll always advance to the next one. Additionally I like the idea of limiting which functions interact with which pieces of state, but the final number parsing function always needs to interact with all of them and that seems to defeat the purpose.

What I do like a lot is the idea of templating the reading mechanism, and having a way for it to propagate errors. The current reading mechanism in scanf's int converter is very ugly to handle this, see https://github.com/llvm/llvm-project/blob/main/libc/src/stdio/scanf_core/int_converter.cpp#L96 . The problem is that each read first needs to check if the max width has been hit, leading to several repetitions of that "if (max_width > 1) then read a char, else return 0" pattern. By adding another templated reader layer we could use the same code for scanf and the simple strtol that doesn't need any error checking on read at all.

@gchatelet
Copy link
Contributor Author

gchatelet commented Sep 20, 2023

In the example you provided, it doesn't seem like there would be any difference between calling and_then on the next function and just calling that function on its own. None of the intermediate functions (e.g. consume_optional_sign) can fail, and they also don't return anything, so they'll always advance to the next one.

Not "all of them" can fail but some do : assert_not_empty, consume_digits and materialize_value. You can look for return std::unexpected to find them. These ones can fail and short circuit the parsing.

Additionally I like the idea of limiting which functions interact with which pieces of state, but the final number parsing function always needs to interact with all of them and that seems to defeat the purpose.

In this design the monadic operations are used to do error propagation and make the parsing steps stand out. Not returning a value is a deliberate choice as it helps run the steps in any order and keep them as isolated as possible. A pure functional approach would require that each call returns a different type and that the state gets augmented from type to type making the order of operations carved in stone. But we may want to skip prefix or sign parsing or add a mandatory prefix / suffix or consume heading or trailing whitepaces.

In the end, the pieces of state have to be aggregated somehow, this logic is tied to a particular parsing implementation (overflow strategy, error strategy : errno / ErrorOr / error code / no error). The reusable parts are all the functions above that interact with part of the state, these parts are testable in isolation but the final operation has to depend on all of the state. There's no way around it. This design is still useful (I believe) because only the state required to take the final decision is exposed. Also note that the needed state can be different for different parser instances. For instance, we may want a parser that only deals with positive decimal strings making the state (and code) much smaller.

What I do like a lot is the idea of templating the reading mechanism, and having a way for it to propagate errors. The current reading mechanism in scanf's int converter is very ugly to handle this, see https://github.com/llvm/llvm-project/blob/main/libc/src/stdio/scanf_core/int_converter.cpp#L96 . The problem is that each read first needs to check if the max width has been hit, leading to several repetitions of that "if (max_width > 1) then read a char, else return 0" pattern. By adding another templated reader layer we could use the same code for scanf and the simple strtol that doesn't need any error checking on read at all.

This is the reasoning behind LimitingReader, it hides the complexity of dealing with the length limitation in the parser itself.
Other benefits of templated readers:

  • char type can be a template parameter, this way we support wide char automatically.
  • reader can be instrumented for tests so we can make sure that the stream was not read more than strictly required.
  • compiler can optimize through the abstraction (different code for FILE* or C string).

Downside:

  • it can lead to code bloat for embedded systems.

@michaelrj-google
Copy link
Contributor

I'm still not sure that this is the best design for a string parser.

There are some steps that need to be executed conditionally. In your example, you always run the consume_optional_prefix step, which sets the base based on the prefix. It's an important step, but only relevant when the starting base is 0 or 16. For any other base, it's unnecessary, and as it's written right now it will always treat the base as if it's 0.

Additionally, there are some steps that may succeed before the final step. As an example, in string to float, there should be a step that checks for "inf" and "nan". If one of those special cases is found, it should return immediately, since no further parsing is necessary.

I'm certain that these specific problems could be fixed, but I'm worried that fixing them will create more complexity than it removes.

@gchatelet
Copy link
Contributor Author

There are some steps that need to be executed conditionally. In your example, you always run the consume_optional_prefix step, which sets the base based on the prefix. It's an important step, but only relevant when the starting base is 0 or 16. For any other base, it's unnecessary, and as it's written right now it will always treat the base as if it's 0.

Conditional code is not a problem at all. It's quite easy to have an early return in consume_optional_prefix that will drop some of the work if unneeded.

Additionally, there are some steps that may succeed before the final step. As an example, in string to float, there should be a step that checks for "inf" and "nan". If one of those special cases is found, it should return immediately, since no further parsing is necessary.

Now that's a fair point. The choice between "inf" / "nan" and a valid float cannot be expressed directly as a single sequence of continuations. The choice is usually done by

  1. having several parsers run in lockstep (through Deterministic Finite Automaton), or
  2. use a greedy approach where we test them one by one by looking ahead and backtracking (if parsing is unambiguous).

Small digression

This second method is what's currently implemented. Unfortunately it doesn't play well with the FILE* interface which only allows for a backtracking of exactly one char. To compensate for that, the current parser copies the characters over to a char buffer. The char buffer then offers random access and enables the greedy approach.
A direct consequence of the copy though is that it needs a buffer to write to, and since the string is potentially very large we have to depend on heap allocation. This adds "out of memory" as a possible outcome of parsing a number. I believe this is unfortunate (especially for embedded) so I would like to explore the first approach instead.


Back to the main discussion, I agree that we won't be able to express the choice concept with a sequence of monadic operations but I don't think it's a big deal, we can still compose the parser out of reusable pieces. The current code could already be written like so.

pseudo code:

cpp::expected<T, error> parse_float(Reader& reader, bool is_positive) {
  CharVector buffer;
  // fill buffer with reader
  if (buffer.starts_with("inf")) return is_positive ? std::numeric_limits<T>::infinity : -std::numeric_limits<T>::infinity;
  if (buffer.starts_with("nan")) return std::numeric_limits<T>::quiet_NaN;
  CStringReader r(buffer);
  return assert_not_empty(r)
         .and_then(....);
}

To move forward (and if you agree), I will put this patch on hold and start working on a full implementation (integer and floating point). This way I think we will be in a better position to discuss the approach. In case we don't reach an agreement, I think we could still include some of the parts you found useful (templated code, limiting reader, etc...)

@michaelrj-google
Copy link
Contributor

I think that working on a more full implementation sounds like the best path for moving forward. I will warn you that it's going to be more difficult than you expect. There are a lot of strange edge cases that you will run into, especially in the differences between scanf and strto*. If you're not sure quite how something works, definitely feel free to send me questions or check out the current code. I've tried to annotate the parts that tripped me up when I was writing it.

@gchatelet gchatelet marked this pull request as draft September 26, 2023 13:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants