-
Notifications
You must be signed in to change notification settings - Fork 84
Add file.csv({typed: "auto"}) option that uses inferSchema #360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty good but a few suggestions!
src/fileAttachment.js
Outdated
| : (array ? csvParseRows : csvParse)); | ||
| if (typed === "auto") { | ||
| const source = parse(text); | ||
| const schema = inferSchema(source); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is missing one of the things we talked about:
| const schema = inferSchema(source); | |
| const schema = inferSchema(source, source.columns); |
We already know the columns thanks to dsvParse, so we don’t have to scan them again.
Even better, you could change inferSchema itself so that it peeks at source.columns rather than always defaulting to getAllKeys in the default expression for columns. That might be nice because as it is today, everyone who calls inferSchema has to remember to pass in source.columns if it exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops yup I forgot about that! Changed inferSchema to default columns to source.columns || getAllKeys(source)
src/fileAttachment.js
Outdated
| const source = parse(text); | ||
| const schema = inferSchema(source); | ||
| const types = new Map(schema.map(({name, type}) => [name, type])); | ||
| return source.map(d => coerceRow(d, types, schema)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling source.map doesn’t propagate the source.columns field. We could expose that here, too, but even better would be to expose the source.schema that we’ve already computed so that it doesn’t have to be inferred again. (Per the DatabaseClient spec, source.schema takes precedence over source.columns, so we don’t need to expose both.)
Also, it’d be nice to have a function for this operation, like say
function coerceRows(source, schema) {
const types = new Map(schema.map(({name, type}) => [name, type]));
return source.map(d => coerceRow(d, types, schema)); // TODO expose schema
}or even folding in the schema inference if necessary…
function enforceSchema(source, schema = inferSchema(source)) {
const types = new Map(schema.map(({name, type}) => [name, type]));
return source.map(d => coerceRow(d, types, schema)); // TODO expose schema
}Doesn’t really matter too much though, since it’ll presumably be only called in one place. It just might be nice since you could write nice standalone tests for it. Feels like a well-defined, self-contained operation, I mean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice 👍 Feels more readable with enforceSchema pulled out : ) and assigned the schema property to the returned array
Co-authored-by: Mike Bostock <[email protected]>
…ch propagates schema
…ehq/stdlib into toph/file-infer-schema
|
Thanks for the comments Mike! Mostly addressed I think;
Since inferSchema is exported, I'll leave the getAllKeys behavior when no columns property is specified, even though it seems it's not used anywhere inside stdlib. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, but I realize we’re missing some validation here, and therefore I rescind my prior suggestion on defaults. (The defaults were just for our own convenience but make it harder to implement the correct behavior.)
src/fileAttachment.js
Outdated
| return response; | ||
| } | ||
|
|
||
| export function enforceSchema(source, schema = inferSchema(source)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know I suggested it (hastily—you’re always welcome to cast a skeptical eye and critique my suggestions!), but I don’t like that enforceSchema now potentially ignores a source.schema if it exists, instead by default inferring one from scratch; it makes it less clear what enforceSchema(source) does. Therefore I think it would be best to drop the default
| export function enforceSchema(source, schema = inferSchema(source)) { | |
| export function enforceSchema(source, schema) { |
and then in the place we use this function, we’d say
enforceSchema(source, inferSchema(source, source.columns))to make clear what exactly is happening: in our one usage of it (immediately following dsvParse) we know there can’t possibly be a source.schema, but there is a source.columns.
src/table.js
Outdated
| } | ||
|
|
||
| export function inferSchema(source, columns = getAllKeys(source)) { | ||
| export function inferSchema(source, columns = source.columns || getAllKeys(source)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for changing my mind, but I’ve realized there is a danger here and therefore I recommend removing source.columns from the default. The reason being: if inferSchema uses source.columns by default, it becomes inferSchema’s responsibility to check that source.columns is a valid query result set columns array (via isQueryResultSetColumns).
It turns out that we have an existing bug here! (See https://github.com/observablehq/observablehq/issues/11103.) For example this is considered a valid data source, but crashes the data table cell when it tries to iterate over the columns here:
data = Object.assign([{foo: "a"}, {foo: "b"}], {columns: 4})When __table calls inferSchema, we should validate source.columns before passing it:
inferSchema(source, isQueryResultSetColumns(columns) ? columns : undefined)On the other hand, when parsing DSV, it’s fine to pass through source.columns directly without checking since we know that dsvParse will return the correct type.
(Also, a smaller point: I don’t like using the loose || in public interfaces. It’s fine to do it internally when you know e.g. that a value can only be undefined or an array/object, but in public interfaces “all bets are off” and so we have to anticipate what would happen if source.columns is not just undefined, but null, false, zero, etc. I prefer to use strict tests against undefined, and that also matches the behavior of JavaScript’s default arguments.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍👍
|
😅 🙏 😌 Reverted adding the defaults! And it now fixes https://github.com/observablehq/observablehq/issues/11103: |

Since 2020, file.csv and file.tsv have allowed you to pass {typed: true} to apply d3.autoType, which looks line by line and tries to guess the types. More recently, the data table cell has introduced an internal inferSchema function, which instead looks at a sample of rows and picks the most frequent type.
This introduces a new {typed: "auto"} option, which calls inferSchema and then coerces all rows to that schema, allowing FileAttachment calls to match the behavior of the data table cell. It also uses that new "auto" option in loadChartDataSource.
This duplicates a couple lines of the __table function…
stdlib/src/table.js
Line 621 in 89f2700
stdlib/src/table.js
Line 630 in 89f2700
…but, since that has some other logic for overriding types, I didn’t bother refactoring anything for now.
Note that, since the existing code only checks for the truthiness of typed, and “auto” is truthy, this could theoretically change the behavior of existing code, if anyone has somehow already said
typed: "auto". But I think that’d be very rare!!