Add file.csv({typed: "auto"}) option that uses inferSchema #360

tophtucker · 2023-03-18T22:57:36Z

Since 2020, file.csv and file.tsv have allowed you to pass {typed: true} to apply d3.autoType, which looks line by line and tries to guess the types. More recently, the data table cell has introduced an internal inferSchema function, which instead looks at a sample of rows and picks the most frequent type.

This introduces a new {typed: "auto"} option, which calls inferSchema and then coerces all rows to that schema, allowing FileAttachment calls to match the behavior of the data table cell. It also uses that new "auto" option in loadChartDataSource.

This duplicates a couple lines of the __table function…

stdlib/src/table.js

Line 621 in 89f2700

const types = new Map(schema.map(({name, type}) => [name, type]));

stdlib/src/table.js

Line 630 in 89f2700

source = source.map(d => coerceRow(d, types, schema));

…but, since that has some other logic for overriding types, I didn’t bother refactoring anything for now.

Note that, since the existing code only checks for the truthiness of typed, and “auto” is truthy, this could theoretically change the behavior of existing code, if anyone has somehow already said typed: "auto". But I think that’d be very rare!!

mbostock

Pretty good but a few suggestions!

mbostock · 2023-03-19T01:27:16Z

src/fileAttachment.js

+    : (array ? csvParseRows : csvParse));
+  if (typed === "auto") {
+    const source = parse(text);
+    const schema = inferSchema(source);


This is missing one of the things we talked about:

Suggested change

const schema = inferSchema(source);

const schema = inferSchema(source, source.columns);

We already know the columns thanks to dsvParse, so we don’t have to scan them again.

Even better, you could change inferSchema itself so that it peeks at source.columns rather than always defaulting to getAllKeys in the default expression for columns. That might be nice because as it is today, everyone who calls inferSchema has to remember to pass in source.columns if it exists.

Oops yup I forgot about that! Changed inferSchema to default columns to source.columns || getAllKeys(source)

src/fileAttachment.js

mbostock · 2023-03-19T01:31:02Z

src/fileAttachment.js

+    const source = parse(text);
+    const schema = inferSchema(source);
+    const types = new Map(schema.map(({name, type}) => [name, type]));
+    return source.map(d => coerceRow(d, types, schema));


Calling source.map doesn’t propagate the source.columns field. We could expose that here, too, but even better would be to expose the source.schema that we’ve already computed so that it doesn’t have to be inferred again. (Per the DatabaseClient spec, source.schema takes precedence over source.columns, so we don’t need to expose both.)

Also, it’d be nice to have a function for this operation, like say

function coerceRows(source, schema) { const types = new Map(schema.map(({name, type}) => [name, type])); return source.map(d => coerceRow(d, types, schema)); // TODO expose schema }

or even folding in the schema inference if necessary…

function enforceSchema(source, schema = inferSchema(source)) { const types = new Map(schema.map(({name, type}) => [name, type])); return source.map(d => coerceRow(d, types, schema)); // TODO expose schema }

Doesn’t really matter too much though, since it’ll presumably be only called in one place. It just might be nice since you could write nice standalone tests for it. Feels like a well-defined, self-contained operation, I mean.

Nice 👍 Feels more readable with enforceSchema pulled out : ) and assigned the schema property to the returned array

src/fileAttachment.js

Co-authored-by: Mike Bostock <[email protected]>

…ch propagates schema

…ehq/stdlib into toph/file-infer-schema

tophtucker · 2023-03-19T13:35:27Z

Thanks for the comments Mike! Mostly addressed I think; ~~maybe a couple things left to do…~~

Add test of default columns to inferSchema tests
Add test of enforceSchema
Don't explicitly pass source.columns in __table, now that it defaults to it
- I could go either way on whether to do this; on the one hand, it's most explicit to pass it in, even though it's now the default; on the other hand, it feels silly to not use the default… so I changed it

Since inferSchema is exported, I'll leave the getAllKeys behavior when no columns property is specified, even though it seems it's not used anywhere inside stdlib.

mbostock

Sorry, but I realize we’re missing some validation here, and therefore I rescind my prior suggestion on defaults. (The defaults were just for our own convenience but make it harder to implement the correct behavior.)

mbostock · 2023-03-19T22:35:15Z

src/fileAttachment.js

  return response;
 }

+export function enforceSchema(source, schema = inferSchema(source)) {


I know I suggested it (hastily—you’re always welcome to cast a skeptical eye and critique my suggestions!), but I don’t like that enforceSchema now potentially ignores a source.schema if it exists, instead by default inferring one from scratch; it makes it less clear what enforceSchema(source) does. Therefore I think it would be best to drop the default

Suggested change

export function enforceSchema(source, schema = inferSchema(source)) {

export function enforceSchema(source, schema) {

and then in the place we use this function, we’d say

enforceSchema(source, inferSchema(source, source.columns))

to make clear what exactly is happening: in our one usage of it (immediately following dsvParse) we know there can’t possibly be a source.schema, but there is a source.columns.

mbostock · 2023-03-19T22:56:38Z

src/table.js

 }

-export function inferSchema(source, columns = getAllKeys(source)) {
+export function inferSchema(source, columns = source.columns || getAllKeys(source)) {


Sorry for changing my mind, but I’ve realized there is a danger here and therefore I recommend removing source.columns from the default. The reason being: if inferSchema uses source.columns by default, it becomes inferSchema’s responsibility to check that source.columns is a valid query result set columns array (via isQueryResultSetColumns).

It turns out that we have an existing bug here! (See https://github.com/observablehq/observablehq/issues/11103.) For example this is considered a valid data source, but crashes the data table cell when it tries to iterate over the columns here:

data = Object.assign([{foo: "a"}, {foo: "b"}], {columns: 4})

When __table calls inferSchema, we should validate source.columns before passing it:

inferSchema(source, isQueryResultSetColumns(columns) ? columns : undefined)

On the other hand, when parsing DSV, it’s fine to pass through source.columns directly without checking since we know that dsvParse will return the correct type.

(Also, a smaller point: I don’t like using the loose || in public interfaces. It’s fine to do it internally when you know e.g. that a value can only be undefined or an array/object, but in public interfaces “all bets are off” and so we have to anticipate what would happen if source.columns is not just undefined, but null, false, zero, etc. I prefer to use strict tests against undefined, and that also matches the behavior of JavaScript’s default arguments.)

mbostock

👍👍

tophtucker · 2023-03-20T18:32:13Z

😅 🙏 😌

Reverted adding the defaults! And it now fixes https://github.com/observablehq/observablehq/issues/11103:

tophtucker added 3 commits March 18, 2023 18:21

add new typed: auto option that uses inferSchema and coerceRow

6e3ddac

matches d3 api slightly better i guess

9915457

update readme

e44db6d

tophtucker requested review from lileeyuh and mbostock March 18, 2023 22:57

mbostock requested changes Mar 19, 2023

View reviewed changes

tophtucker and others added 3 commits March 19, 2023 09:11

Don't do typed: auto if array option is passed

d997cd9

Co-authored-by: Mike Bostock <[email protected]>

check for columns in inferSchema; pull out enforceSchema function whi…

6dfb459

…ch propagates schema

Merge branch 'toph/file-infer-schema' of https://github.com/observabl…

4665697

…ehq/stdlib into toph/file-infer-schema

tophtucker added 3 commits March 19, 2023 13:25

test that inferSchema looks at source.columns

d115ef4

test enforceSchema

8cab713

dont pass columns now that its the internal default of inferSchema

babf327

mbostock requested changes Mar 19, 2023

View reviewed changes

fewer defaults, more explicit

754523b

mbostock approved these changes Mar 20, 2023

View reviewed changes

tophtucker merged commit 97555d6 into main Mar 20, 2023

tophtucker deleted the toph/file-infer-schema branch March 20, 2023 18:32

	const schema = inferSchema(source);
	const schema = inferSchema(source, source.columns);

	export function enforceSchema(source, schema = inferSchema(source)) {
	export function enforceSchema(source, schema) {

Add file.csv({typed: "auto"}) option that uses inferSchema #360

Add file.csv({typed: "auto"}) option that uses inferSchema #360

Uh oh!

Conversation

tophtucker commented Mar 18, 2023

Uh oh!

mbostock left a comment

Choose a reason for hiding this comment

Uh oh!

mbostock Mar 19, 2023

Choose a reason for hiding this comment

Uh oh!

tophtucker Mar 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mbostock Mar 19, 2023

Choose a reason for hiding this comment

Uh oh!

tophtucker Mar 19, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tophtucker commented Mar 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mbostock left a comment

Choose a reason for hiding this comment

Uh oh!

mbostock Mar 19, 2023

Choose a reason for hiding this comment

Uh oh!

mbostock Mar 19, 2023

Choose a reason for hiding this comment

Uh oh!

mbostock left a comment

Choose a reason for hiding this comment

Uh oh!

tophtucker commented Mar 20, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tophtucker Mar 19, 2023 •

edited

Loading

tophtucker commented Mar 19, 2023 •

edited

Loading