Skip to content

Get CSV row index #227

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
emily-coffin opened this issue Feb 14, 2025 · 9 comments · Fixed by #228
Closed

Get CSV row index #227

emily-coffin opened this issue Feb 14, 2025 · 9 comments · Fixed by #228

Comments

@emily-coffin
Copy link

Is there a way to get the CSV row index to show in the transform declarations for each declaration object? This will help with future debugging if an issue occurs so we know what line within the source CSV file to refer to.

@jf-tech
Copy link
Owner

jf-tech commented Feb 16, 2025

@emily-coffin No, currently there is no such a mechanism to pass some metadata (such as line number) from a csv2 reader to its transformation schema.

Assuming your csv schema version is "csv2", I'm thinking about introduce a record level _metainfo node into the idr.Node which can contain some reader specific meta information, such as line number. To use an example:

original schema: https://github.com/jf-tech/omniparser/blob/master/extensions/omniv21/samples/csv2/1_single_row.schema.json

let's say we add a new 'line_num' in the final output:

    "transform_declarations": {
        "FINAL_OUTPUT": { "xpath": ".[DATE != 'N/A']" ,"object": {
            "line_num": { "xpath": "_metainfo/record_starting_line_num" },
            "uv_index": {...},
            "date": {...},

So basically the csv2 reader will auto/implicitly add a _metainfo node into the CSV idr.Node structure, in which the _metainfo contains a sub node called record_starting_line_num or something like that.

Note, each different format (csv2, fixedlength2, edi, etc) will probably have different subnodes/fields underneath their own corresponding _metainfo node.

This design is flexible enough. The problem I don't like is discoverability - no one would know about this _metainfo node and its format specific sub nodes, since they're not declared in the schema, unless they read docs very carefully. Second problem is minor - potential name collision such that what if someone names their own column in the file_declaration _metadata. Now this problem isn't big, can be really mitigated by schema validation and making sure no customer defined column names collide with built-in/system node names starting with "_".

What do you think? I need sometime to think over it as well.

@emily-coffin
Copy link
Author

Yes, this is exactly what I am looking for. We are using CSV2 and getting _metainfo would be super helpful.

I see what you are saying about potential conflicts. Would it help reduce conflicts if there was a way to enable/disable _metainfo? That might reduce the conflicts if someone has _metainfo as a column name. Just a thought.

@jf-tech
Copy link
Owner

jf-tech commented Feb 21, 2025

@emily-coffin PR is out, can you clone the branch https://github.com/jf-tech/omniparser/tree/debug_setting and give it a test? And let me know if this is what you are looking for? Thanks!

@emily-coffin
Copy link
Author

@jf-tech I just tested it out this morning and it works great! This is exactly what I am looking for. Thank you!

@jf-tech
Copy link
Owner

jf-tech commented Feb 21, 2025

@emily-coffin merged to master and issue closed. please do consider a sponsorship of any size would be highly appreciated!

@emily-coffin
Copy link
Author

@jf-tech I am pretty new to golang but I assume this change will not be available in the go mod until a new tag is released. Is this correct?

@jf-tech
Copy link
Owner

jf-tech commented Feb 21, 2025

Not necessarily, you can always pull from the latest on master:

go get github.com/jf-tech/omniparser@latest

@emily-coffin
Copy link
Author

emily-coffin commented Feb 26, 2025

@jf-tech sorry for taking so long to get back. I tried that and it did not work. I keep getting an error: parser_settings: Additional property debug is not allowed. I noticed the go.mod has the version at v1.0.5. That may be part of the issue, but I can't figure out how to keep it on the latest.

@emily-coffin
Copy link
Author

@jf-tech I was able to get it to work. I had to use go get github.com/jf-tech/omniparser@master. Thank you for your quick responses and help on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants