Skip to content

Conversation

@topper-123
Copy link
Contributor

pd.NA fails if passed to a format string and format parameters are supplied. This is different behaviour than np.nan and makes converting arrays containing pd.NA to strings very brittle and annoying.

Examples:

>>> format(pd.NA)
'<NA>'  # master and PR, ok
>>> format(pd.NA, ".1f")
TypeError  # master
'<NA>'  # this PR
>>> format(pd.NA, ">5")
TypeError  # master
' <NA>'  # this PR, tries to behave like a string, then falls back to '<NA>', like np.na

The new behaviour mirrors the behaviour of np.nan.

@jorisvandenbossche
Copy link
Member

@topper-123 Thanks for looking into this!

Personally, instead of relying on a try/except of NaN to check what is supported, I would rather try to understand how and what works for NaN, and try to implement the same logic here.

For example, I suppose that format(pd.NA, ">10.1f") will fail on this branch? While for NaN this works.

Now, properly implementing __format__ manually might be too complicated though, and the "fallback" of formatting the string might already be useful anyway.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jun 13, 2020

Hmm, np.nan is just a float, so using the builtin float.__format__, I think, which is probably a bit complicated to replicate ...

Another idea: how robust would it be if we format some other value (eg np.nan), and then replace "nan" with "<NA>" in the result? We would need a bit of logic to potentially replace " nan" instead of "nan" if possible, but for the rest it might work in many cases?

@topper-123
Copy link
Contributor Author

Another idea: how robust would it be if we format some other value (eg np.nan), and then replace "nan" with "" in the result?

Wouldn't work out of the box, e.g. "nantes_{}.format(np.nan)", I don't think adding logic to get the correct "nan" is the right approach, it's too complicated IMO.

Another idea: pd.NA is supposed to work with all dtypes, not just floats, so probably should'nt be restricted to format_specs accepted by float. How about a simple:

def __format__(self, format_spec):
    try:
        return self.__repr__().__format__(format_spec)
    except ValueError:
        return self.__repr__()

This would allow string format_spec to work (as they do for floats already) and make self.repr() a fallback that always works.

@jorisvandenbossche
Copy link
Member

Wouldn't work out of the box, e.g. "nantes_{}.format(np.nan)",

I don't fully know how the inner python details of this method work, but I suppose the above would end up calling pd.NA.__format__("") ?As long that the nan -> NA replacement happens inside the __format__ function, I would think the above to work fine.

How about a simple:

I think that is certainly better (avoiding only accepting the rules valid for float), but that still wouldn't work for the example I gave of format(pd.NA, ">10.1f") (I think).

(now, it's certainly already fixing a set of use cases, so could also be a good start)

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jun 13, 2020

Very quick try with

    def __format__(self, format_spec) -> str:
        res = format(np.nan, format_spec)
        return res.replace("nan", "<NA>")

works for the example you gave, and also for the example I gave:

In [1]: "nantes_{}".format(pd.NA)  
Out[1]: 'nantes_<NA>'

In [3]: format(pd.NA, ">10.1f")
Out[3]: '       <NA>'

Of course, the above still needs 1) take the 1 char length difference into account in case there is whitespace (like the second example) and 2) still fallback to formatting with the string repr and finally the plain <NA> string repr (like your example impl at #34740 (comment)).

@topper-123
Copy link
Contributor Author

topper-123 commented Jun 13, 2020

Yeah, __format__ only works inside the brackets, so you're right there.

The length format spec would be one special case that would need to be handled, but are there other? I don't think so for floats, but there could be for other format_specs?

@topper-123
Copy link
Contributor Author

I've made the simpler implementation that I suggested. I'm a bit hesitant that adding the special cases will make this too complex.

@jreback jreback added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Output-Formatting __repr__ of pandas objects, to_string labels Jun 14, 2020
@jreback jreback added this to the 1.1 milestone Jun 14, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I am fine with the simplest solution that at least fixes the basic formatting, for now. I still think it wouldn't be hard to support proper floating point / numeric formatting (with the NaN formatting and replace afterwards)

@jreback jreback merged commit 594dc2a into pandas-dev:master Jun 15, 2020
@jreback
Copy link
Contributor

jreback commented Jun 15, 2020

thanks @topper-123

@topper-123 topper-123 deleted the format_na branch June 15, 2020 22:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Output-Formatting __repr__ of pandas objects, to_string

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants