Skip to content

Allow reporting the type of arbitrary expressions on demand #50

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jwiegley opened this issue Apr 16, 2013 · 25 comments
Closed

Allow reporting the type of arbitrary expressions on demand #50

jwiegley opened this issue Apr 16, 2013 · 25 comments

Comments

@jwiegley
Copy link

The ghc-mod utility is able to use the GHC API to determine the type of any identifier or expression within a well-formed Haskell module. At present, the ide-backend only provides the type for identifiers.

We'd like to be able to request the type of arbitrary expressions, with the understanding that such information would only be gathered as it is requested.

Pinging @snoyberg @edsko @dcoutts

@ghost ghost assigned edsko Apr 16, 2013
@snoyberg
Copy link
Member

Just to be clear, this issue is just a fact-finding ticket right now; we'd like to get an idea of how difficult it would be to provide this support so we can decide if we want to extend our feature-set to include it.

@edsko
Copy link
Contributor

edsko commented Apr 16, 2013

@edsko
Copy link
Contributor

edsko commented Apr 16, 2013

Note that this either means keeping the typechecked AST in memory at all times (very expensive) or else re-typecheck the module on every invocation of this new function (also expensive).

@edsko
Copy link
Contributor

edsko commented Apr 17, 2013

It should be possible to typecheck an arbitrary expression, just based on a textual representation that the IDE provides us with, given that we already have all identifier information. That's the theory -- practice may be different :)

@chrisdone
Copy link
Contributor

In practice ghc-mod's API allows this.

I think it would have to be re-checked every time because there's no trivial way to know what's been changed in a file's types because of one small tweak, or even across different modules… So the only advantage to keeping the checked AST around would just be more or less if they didn't change any files yet.

How expensive is it to run the type-check function? If we're already doing a Flymake-style thing (which devs will expect) there should already be a GHC session open, so there's no (additional) GHC initialization overhead. So what's the actual type-checking overhead? Also, I wonder if we can take advantage of .hi files so we don't have to re-check all dependent files unless we need to.

FWIW we also get much more than just types but also exports, instances, breakpoints and more:

λ> :i ModuleInfo
data ModuleInfo
  = GHC.ModuleInfo {GHC.minf_type_env :: TypeEnv,
                    GHC.minf_exports :: NameSet,
                    GHC.minf_rdr_env :: Maybe RdrName.GlobalRdrEnv,
                    GHC.minf_instances :: [Instance],
                    GHC.minf_iface :: Maybe ModIface,
                    GHC.minf_modBreaks :: ModBreaks}
        -- Defined in `GHC'

Could be additionally useful elsewhere in the IDE. Also notice the definition of Var:

λ> :i Id
 type Id = Var.Var   -- Defined in `Var'
λ> :i Var.Var
 data Var.Var
   = Var.TyVar {Var.varName :: !Name,
                Var.realUnique :: FastTypes.FastInt,
                Var.varType :: Kind}
   | Var.TcTyVar {Var.varName :: !Name,
                  Var.realUnique :: FastTypes.FastInt,
                  Var.varType :: Kind,
                  Var.tc_tv_details :: TcType.TcTyVarDetails}
   | Var.Id {Var.varName :: !Name,
             Var.realUnique :: FastTypes.FastInt,
             Var.varType :: Type,
             Var.idScope :: Var.IdScope,
             Var.id_details :: IdInfo.IdDetails,
             Var.id_info :: IdInfo.IdInfo}
         -- Defined in `Var'

That could also be super useful for doing lookups of reverse-dependencies (“who calls this function?”) and doing refactoring (renaming of all instances of this variable), etc. It seems to me that at some point we will have to use this part of the API (or re-implement it ourselves), if we're going to provide features that Java/C++/C# IDEs do out of the box.

Just my two cents.

@edsko
Copy link
Contributor

edsko commented Apr 17, 2013

The overhead for checking an entire module would be quite large, however ghc does also provide means of checking the type of a single expression -- and that would suffice for this case, as long as we can pass in the right typing environment. (And yes, this assumes things haven't changed. When things change another type check cycle needs to be initiated anyway.)

@chrisdone
Copy link
Contributor

Do we get the typing environment from GHC's "load" function?

@edsko
Copy link
Contributor

edsko commented Apr 17, 2013

(With a patch) we get access to the same environment that you'd get from typecheckModule, that's to say, top-level identifiers only. However, we traverse the AST and collect type information for all identifiers so we can construct our own, more complete, environment.

@chrisdone
Copy link
Contributor

(With a patch) we get access to the same environment that you'd get from typecheckModule, that's to say, top-level identifiers only.

But we get more than top-level identifiers only with typecheckModule. Here is a single file implementation of using typecheckModule to get types for local expressions. Compile like this:

$ ghc --make TypeEnvDemo.hs -package ghc -package ghc-syb-utils -o type-env-demo
[1 of 1] Compiling Main             ( src/TypeEnvDemo.hs, src/TypeEnvDemo.o )
Linking type-env-demo ...

Make a X.hs file like this:

module X where
main = putStrLn (show (read "a" * 55))

And use like this:

$ ./type-env-demo X 2 25
((2,24,2,28),"String -> Integer") # read
((2,24,2,32),"Integer") # the int arith expr
((2,24,2,37),"Integer") # the int arith expr
((2,23,2,38),"Integer") # the int arith expr
((2,18,2,38),"String") # show
((2,17,2,39),"String") # putStrLn argument
((2,8,2,39),"IO ()") # putStrLn
((2,1,2,39),"IO ()") # declaration
  • Here is where the type environment is used for expressions, which is just from the TypecheckedModule type. It gets the expressions from the tree here.
  • Here for top-level bindings.
  • Here is pattern matches.

However, we traverse the AST and collect type information for all identifiers so we can construct our own, more complete, environment.

Don't we have everything we need above? We get types for more than just identifiers, but expressions.

Anyhoo, this is the backend so I guess it's not my area of interest/expertise.

@dcoutts
Copy link
Contributor

dcoutts commented Apr 17, 2013

Hi Chris, the code you've got there is pretty similar to what we're doing already, in that it's walking over the typechecked AST and pulling out the types and spans in various places. So far we have only needed the types at the identifier leaves, but we could also collect the spans and types at expression nodes. From that we could search up/down to find the various enclosing expressions and their types.

So difficulty: we would have to alter our traversal a bit and accumulate an extra structure for the spans & types of expression nodes. We would have to accumulate that into an appropriate search structure (probably an interval tree) so that we can quickly find the minimum enclosing source span and all the enclosing spans (& corresponding types). Obviously keeping this info around would increase our memory requirements, compared to just keeping it for the leaves.

So yes, totally doable, and I'd guess roughly two weeks work.

@snoyberg
Copy link
Member

@dcoutts and @edsko Thanks for looking into the feasibility of implementing this. I'll add this to the list of features we need to prioritize and get back to you when we know we'll be ready to use this feature.

@mgsloan
Copy link
Contributor

mgsloan commented Apr 19, 2013

The really ideal situation would be that we would have types for sub-expressions even in the presence of type errors. I've asked about this before, and got the answer that it wouldn't be easy. I think it might be good to revisit the question, though. @edsko @dcoutts How hard would it be to do this? @BartoszMilewski Do we want to do this anytime soon? IMHO, it would be good to start work on this soon, because this is one area where more partialness in compilation leads to the greatest user benefit. When we have that info, we can really start powerfully helping the user resolve their type errors. Afterall, much of what we're doing when we're figuring out a tricky type error is running typechecking in our head or asking GHCI questions about subexpressions.

Originally posted here: https://github.com/fpco/fpco/issues/637#issuecomment-16634545

@BartoszMilewski
Copy link

I'd really like the leaf type information to be polished and tested with users before we do any more advanced stuff. The ultimate goal is to be able to say "Everything GHCi can do, we can do better" so let's consider it in this larger context. I'd rather fist work out REPL functionality because I bet users will be clamoring for it.

@edsko
Copy link
Contributor

edsko commented Apr 25, 2013

As of fe64d22 the internal IdMap type is now a proper interval map, and we never expose it's actual concrete representation.

@mgsloan
Copy link
Contributor

mgsloan commented Apr 27, 2013

@BartoszMilewski Well, asking GHCI for the type of a sub-expression is something you do all the time. The difference is that usually you'd have to re-construct / pun the local bindings before doing so. So having more than just leaf type info fits right into "Everything GHCi can do, we can do better". I do agree a repl would be very nice. Maybe for beta, especially if our alpha users mention it.

edsko added a commit that referenced this issue Sep 20, 2013
We record the subexpression types for identifiers, application, lambda
abstraction, (overloaded) literals, and let expressions. The approach seems to
work well and is efficient (every node in the AST is considered only once).
Just left to do is to add support for more expressions, and the order of the
"dominators" seems somewhat haphazard at the moment.
@edsko
Copy link
Contributor

edsko commented Sep 30, 2013

@snoyberg, @dcoutts, @jwiegley This feature is now ready for testing on the experimental branch. Most expression types are supported, with two notable exceptions: arrow syntax (just not implemented, can add if desired), and TH. The most important problem with TH is that the typechecked tree contains no traces of any TH splices (but only their expanded forms), and moreover, every subterm of those expansions is assigned the location of the entire original TH splice. So, given something like

module A where

qComp x xs = [| x : xs |]

and

module B where

import A

t2 = $(qComp 'a' "bc")

then we report all of [Char], [Char] (again), Char -> [Char] -> [Char], a -> [a] -> [a] and Char all for the same location (the location of the splice). If this is deemed important enough then fixing this would either require hacking around the tree (somehow recording the location of splices and modifying the subtype collection accordingly), or else hacking ghc so that we retain some information about splices in the tree.

The other problem currently is that the order of the types appears a bit random at the moment (i.e., not in domination order). I'm not sure why that is, I will look into that now.

@edsko
Copy link
Contributor

edsko commented Sep 30, 2013

@snoyberg The order of the dominators comes straight from the fingertree package. Let me know if this is an issue and whether you'd like me to spend some time on it.

@mgsloan
Copy link
Contributor

mgsloan commented Oct 1, 2013

Note: Please wait for @snoyberg to sign off on my suggestions below before doing anything about them:

@edsko Awesome!! Using dominators is fine - we can just take the last one, because lexicographic order will give us the "innermost" one. Since this stuff is stored in an interval tree, it would be straightforward to add the following APIs, right? I'd appreciate having this flexibility for future uses, even if we don't use all of it now.

  • Last dominator - this is just a network traffic optimization. It's too bad that fingertree doesn't have a purpose specific function for this. If at some point it becomes a performance problem, we can write a better function for it.
  • All dominators.
  • The results of intersections.

I think it's /ok/ for now that TH has non-ideal results. However, I have a suggestion for a cheap way to fix this problem:

  • While scanning the typechecked AST, discard type results for spans that have the same location as their parent. This is also might be a nice optimization for cases where there are multiple AST nodes for the same thing (a var containing a qualified identifier, containing an identifier - we don't want 3 type spans).
  • A further optimization would halt recursion when there's a case of "my span is the same as my parent" that is impossible in an AST that doesn't come from TH. Let's not think about this very much, for now. I believe that the number of parent-child relationships where this can be true for Exps is fairly low, though.

This would give us the type of the splice / QQ, which is what we want. It would also avoid storing a ton of information for generated code.

@snoyberg
Copy link
Member

snoyberg commented Oct 2, 2013

Thanks @edsko, I've experimented with this briefly and it seems to be working nicely. I've set up a new isolation-runner command for getting this information, I'll open up a new issue in the fpco repo to get this integrated into the UI.

@edsko
Copy link
Contributor

edsko commented Oct 3, 2013

Problem reported in https://github.com/fpco/fpco/issues/2609, looking at that now.

@edsko
Copy link
Contributor

edsko commented Oct 3, 2013

@snoyberg, @mgsloan Note that the issue of overlapping ranges in the subexpression type extends beyond just TH. For instance, consider

type family TestTypeFamily a
type instance TestTypeFamily [a] = Maybe a

t2 :: TestTypeFamily [a]
t2 = Nothing

at the location of Nothing we will get all of

forall a1. Maybe a1
Maybe a
TestTypeFamily [a]

reported, all at the same location. The first type is the polymorphic type of Nothing (which will actually be reported as Maybe a1 because we don't print explicit foralls); the second is the "monomorphic" instantiation (with a type variable); the third is the type after the type cast (because of the type family).

@mgsloan
Copy link
Contributor

mgsloan commented Oct 4, 2013

@edsko Is the information about where these different types came from available at? It's awesome that all of this information is available, but it's difficult to leverage if there's a lot of it, without ways to discriminate between the different results.

These are coming from an annotated AST, right? So it seems like it would be possible to distinguish the two cases of "coincident types, on the same AST node", and "coincident AST nodes". I'm not very familiar with how this works, though, so I could easily be wrong about that.

@edsko
Copy link
Contributor

edsko commented Oct 14, 2013

They come from traversing the AST, yes -- not sure what kind of additional information you would like us to make available though?

@mgsloan
Copy link
Contributor

mgsloan commented Oct 15, 2013

@edsko It's not additional information, but less - I'd prefer that the types from TH-generated ASTs only reported the type(s) of the outermost AST node. My proposed scheme for doing this is to discard type results when they're from a child node that has the same span as its parent - see this comment: #50 (comment)

Anyway, I don't think this is the focus for now, just something to consider at some point.

@edsko
Copy link
Contributor

edsko commented Dec 19, 2013

Closing this issue as we now support getting types for subexpressions. Please open new issues for specific problems/enhancements (for instance, the generalization of types discussed in https://github.com/fpco/fpco/issues/3043).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants