Skip to content

RUST-871 Serialize directly to BSON bytes in insert operations #406

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Aug 5, 2021

Conversation

patrickfreed
Copy link
Contributor

RUST-871

This PR leverages the bson::to_vec function that was introduced in mongodb/bson-rust#279 to greatly increase the performance of inserts. It also introduces the framework for updating other operations to skip the Document step during serialization, though that is left as future work for once the raw document API is introduced.

@@ -444,7 +444,7 @@ fn parse_ids(matches: ArgMatches) -> Vec<bool> {
ids
}

#[cfg_attr(feature = "tokio-runtime", tokio::main)]
#[cfg_attr(feature = "tokio-runtime", tokio::main(flavor = "current_thread"))]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the benchmarks are single threaded, this ends up giving a nice boost by avoiding the tokio overhead of managing a thread pool.

@@ -164,128 +169,30 @@ where
.ok_or_else(|| D::Error::custom(format!("could not deserialize u64 from {:?}", bson)))
}

pub fn doc_size_bytes(doc: &Document) -> u64 {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These were no longer needed as we now just use the length of the serialized vec.

Some(val) => val,
None => break,
};
fn num_decimal_digits(mut n: usize) -> u64 {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the old implementation of this method was incorrect actually so I rewrote it.

/// If the BSON is not properly formatted, an internal error would be returned.
///
/// TODO: RUST-924 replace this with raw document API usage.
pub(crate) fn raw_get(doc: &[u8], key: &str) -> Result<Option<Bson>> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately this function needs to live in the driver for now as we need to be able to comb through the serialized command for _id values, but it can go away shortly once RUST-924 is complete.

Error::from(ErrorKind::Internal {
message: "failed to read from serialized document".to_string(),
})
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documents queried here should always be valid as they're coming straight out of our serializer, so any error encountered here is mapped to an internal error.

@patrickfreed patrickfreed force-pushed the RUST-871/raw-serialize branch from f9597f7 to e2db5d7 Compare August 2, 2021 20:20

#[derive(Debug)]
pub(crate) struct Insert<T> {
pub(crate) struct Insert<'a, T> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously this T was always actually a reference. This now makes it required to be a reference, which allows us to clone the input.


let doc_size = bson_util::array_entry_size_bytes(i, &doc);
let mut doc = bson::to_vec(d)?;
let id = match bson_util::raw_get(doc.as_slice(), "_id")? {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we can't inspect the T for an _id without requiring it implement a trait, we scan through the raw document for it instead.

Some(b) => b,
None => {
let oid = ObjectId::new();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's not in there, we write the new element into the beginning of the document. Once we have the raw document API we could update this to use it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a TODO comment for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Ok(Command::new("insert".to_string(), self.ns.db.clone(), body))
}

fn serialize_command(&mut self, cmd: Command<Self::Command>) -> Result<Vec<u8>> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we already have raw BSON for the documents, we can't just use serde to serialize it, as it'll attempt to do so as a a Binary. Instead, we serialize the rest of the command and manually append the serialized documents. Again, this will be easier with the raw document API.

@patrickfreed patrickfreed marked this pull request as ready for review August 2, 2021 21:17
type Response = CommandResponse<WriteConcernOnlyBody>;

const NAME: &'static str = "abortTransaction";

fn build(&mut self, _description: &StreamDescription) -> Result<Command> {
fn build(&mut self, _description: &StreamDescription) -> Result<Command<Self::Command>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the addition of Self::Command here is necessary because the default T for Command is already Document -- can we revert this change or change the rest of the operations whose build signature wasn't updated (e.g. ListCollections) for consistency?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Some(b) => b,
None => {
let oid = ObjectId::new();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a TODO comment for this?

pub server_api: Option<ServerApi>,
}

fn default_uri() -> String {
DEFAULT_URI.clone()
}

pub fn deserialize_server_api<'de, D>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this custom implementation needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The server expects the fields to be named apiVersion, apiStrict, and apiDeprecationErrors, but the test format drops all of the api- prefixes. Now that we're using the Serialize implementation for ServerApi in the driver, we needed to introduce a custom one for deserializing from the test format.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is only used within this test file, but for clarity can we rename this serialization function? Maybe deserialize_server_api_for_testing or deserialize_server_api_test_format?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah in that case I think you should be able to change the rename attributes on the fields in ServerApi to #[serde(rename(serialize = "apiVersion"))] so that the renames only apply to serialization, and then add #[serde(rename_all = "camelCase", deny_unknown_fields)] to the top of the struct for deserialization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh cool, I didn't know that was possible. Unfortunately, it looks like we'll still need the separate implementation though, since deny_unknown_fields can't be used with flatten (see https://serde.rs/attr-flatten.html). flatten is needed with server API since they're all top level fields of commands rather than in a subdocument.

I did update the fuinction to be more precisely named (went with the test_format suggestion)

Copy link
Contributor

@NBSquare NBSquare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Only a few small things from me.


let serialized = op.serialize_command(cmd)?;
let raw_cmd = RawCommand {
name: cmd_name.clone(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell this is the only place cmd_name is being used, so this seems like a spurious clone to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought so too, but it's actually also used below when emitting the command succeeded / failed event.

pub server_api: Option<ServerApi>,
}

fn default_uri() -> String {
DEFAULT_URI.clone()
}

pub fn deserialize_server_api<'de, D>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is only used within this test file, but for clarity can we rename this serialization function? Maybe deserialize_server_api_for_testing or deserialize_server_api_test_format?

@patrickfreed patrickfreed merged commit 7f03ec5 into mongodb:master Aug 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants