Skip to content

Conversation

@tophtucker
Copy link
Contributor

Since 2020, file.csv and file.tsv have allowed you to pass {typed: true} to apply d3.autoType, which looks line by line and tries to guess the types. More recently, the data table cell has introduced an internal inferSchema function, which instead looks at a sample of rows and picks the most frequent type.

This introduces a new {typed: "auto"} option, which calls inferSchema and then coerces all rows to that schema, allowing FileAttachment calls to match the behavior of the data table cell. It also uses that new "auto" option in loadChartDataSource.

This duplicates a couple lines of the __table function…

const types = new Map(schema.map(({name, type}) => [name, type]));

source = source.map(d => coerceRow(d, types, schema));

…but, since that has some other logic for overriding types, I didn’t bother refactoring anything for now.

Note that, since the existing code only checks for the truthiness of typed, and “auto” is truthy, this could theoretically change the behavior of existing code, if anyone has somehow already said typed: "auto". But I think that’d be very rare!!

@tophtucker tophtucker requested review from lileeyuh and mbostock March 18, 2023 22:57
Copy link
Member

@mbostock mbostock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty good but a few suggestions!

: (array ? csvParseRows : csvParse));
if (typed === "auto") {
const source = parse(text);
const schema = inferSchema(source);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is missing one of the things we talked about:

Suggested change
const schema = inferSchema(source);
const schema = inferSchema(source, source.columns);

We already know the columns thanks to dsvParse, so we don’t have to scan them again.

Even better, you could change inferSchema itself so that it peeks at source.columns rather than always defaulting to getAllKeys in the default expression for columns. That might be nice because as it is today, everyone who calls inferSchema has to remember to pass in source.columns if it exists.

Copy link
Contributor Author

@tophtucker tophtucker Mar 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops yup I forgot about that! Changed inferSchema to default columns to source.columns || getAllKeys(source)

const source = parse(text);
const schema = inferSchema(source);
const types = new Map(schema.map(({name, type}) => [name, type]));
return source.map(d => coerceRow(d, types, schema));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling source.map doesn’t propagate the source.columns field. We could expose that here, too, but even better would be to expose the source.schema that we’ve already computed so that it doesn’t have to be inferred again. (Per the DatabaseClient spec, source.schema takes precedence over source.columns, so we don’t need to expose both.)

Also, it’d be nice to have a function for this operation, like say

function coerceRows(source, schema) {
    const types = new Map(schema.map(({name, type}) => [name, type]));
    return source.map(d => coerceRow(d, types, schema)); // TODO expose schema
}

or even folding in the schema inference if necessary…

function enforceSchema(source, schema = inferSchema(source)) {
    const types = new Map(schema.map(({name, type}) => [name, type]));
    return source.map(d => coerceRow(d, types, schema)); // TODO expose schema
}

Doesn’t really matter too much though, since it’ll presumably be only called in one place. It just might be nice since you could write nice standalone tests for it. Feels like a well-defined, self-contained operation, I mean.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice 👍 Feels more readable with enforceSchema pulled out : ) and assigned the schema property to the returned array

@tophtucker
Copy link
Contributor Author

tophtucker commented Mar 19, 2023

Thanks for the comments Mike! Mostly addressed I think; maybe a couple things left to do…

  • Add test of default columns to inferSchema tests
  • Add test of enforceSchema
  • Don't explicitly pass source.columns in __table, now that it defaults to it
    • I could go either way on whether to do this; on the one hand, it's most explicit to pass it in, even though it's now the default; on the other hand, it feels silly to not use the default… so I changed it

Since inferSchema is exported, I'll leave the getAllKeys behavior when no columns property is specified, even though it seems it's not used anywhere inside stdlib.

Copy link
Member

@mbostock mbostock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but I realize we’re missing some validation here, and therefore I rescind my prior suggestion on defaults. (The defaults were just for our own convenience but make it harder to implement the correct behavior.)

return response;
}

export function enforceSchema(source, schema = inferSchema(source)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know I suggested it (hastily—you’re always welcome to cast a skeptical eye and critique my suggestions!), but I don’t like that enforceSchema now potentially ignores a source.schema if it exists, instead by default inferring one from scratch; it makes it less clear what enforceSchema(source) does. Therefore I think it would be best to drop the default

Suggested change
export function enforceSchema(source, schema = inferSchema(source)) {
export function enforceSchema(source, schema) {

and then in the place we use this function, we’d say

enforceSchema(source, inferSchema(source, source.columns))

to make clear what exactly is happening: in our one usage of it (immediately following dsvParse) we know there can’t possibly be a source.schema, but there is a source.columns.

src/table.js Outdated
}

export function inferSchema(source, columns = getAllKeys(source)) {
export function inferSchema(source, columns = source.columns || getAllKeys(source)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for changing my mind, but I’ve realized there is a danger here and therefore I recommend removing source.columns from the default. The reason being: if inferSchema uses source.columns by default, it becomes inferSchema’s responsibility to check that source.columns is a valid query result set columns array (via isQueryResultSetColumns).

It turns out that we have an existing bug here! (See https://github.com/observablehq/observablehq/issues/11103.) For example this is considered a valid data source, but crashes the data table cell when it tries to iterate over the columns here:

data = Object.assign([{foo: "a"}, {foo: "b"}], {columns: 4})

When __table calls inferSchema, we should validate source.columns before passing it:

inferSchema(source, isQueryResultSetColumns(columns) ? columns : undefined)

On the other hand, when parsing DSV, it’s fine to pass through source.columns directly without checking since we know that dsvParse will return the correct type.

(Also, a smaller point: I don’t like using the loose || in public interfaces. It’s fine to do it internally when you know e.g. that a value can only be undefined or an array/object, but in public interfaces “all bets are off” and so we have to anticipate what would happen if source.columns is not just undefined, but null, false, zero, etc. I prefer to use strict tests against undefined, and that also matches the behavior of JavaScript’s default arguments.)

Copy link
Member

@mbostock mbostock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍👍

@tophtucker
Copy link
Contributor Author

😅 🙏 😌

Reverted adding the defaults! And it now fixes https://github.com/observablehq/observablehq/issues/11103:

image

@tophtucker tophtucker merged commit 97555d6 into main Mar 20, 2023
@tophtucker tophtucker deleted the toph/file-infer-schema branch March 20, 2023 18:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants