Skip to content

Database.PostgreSQL.Simple.Internal.exec: resource vanished (failed to fetch file descriptor) #168

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
deem0n opened this issue Nov 5, 2015 · 11 comments

Comments

@deem0n
Copy link

deem0n commented Nov 5, 2015

Hello,

This issue was already discussed: #117
The conclusion was that it is due to some websocket related code which used pg socket and sent unexpected by PostgreSQL server ping packets.

I have quite simple Haskell program which just listen to messages and calls plpgsql procedure on each message. No websocket or any other additional socket related code.

processAuditMessage :: Connection -> Notification -> IO ()
processAuditMessage con n = do
  let json = notificationData n
  let sql = "SELECT process_message(?)"
  res <- catchSQLErr (query con sql (Only json) :: IO [Only Bool]) $ \e -> do
    System.IO.hPutStr stderr "SQL Error: "
    UTF8.hPutStrLn stderr $ sqlErrorMsg e
    UTF8.hPutStrLn stderr json
    return [Only False]
  return ()

runLoggerLoop :: Connection -> String -> IO ()
runLoggerLoop con ch = listen >> loop
  where
    listen = execute con "LISTEN ?" [Identifier (fromString ch)]
    loop = forever $
      getNotification con >>= processAuditMessage con

After about 30 hours of work it died with
Database.PostgreSQL.Simple.Internal.exec: resource vanished (failed to fetch file descriptor)

Any idea what may be a reason for the error and how to prevent it ?

@lpsmith
Copy link
Owner

lpsmith commented Nov 5, 2015

Well, typically it means that the connection vanished, without receiving an error message from the database backend. postgresql-simple will generate a IOError in this case, instead of a SQLError. How are you connecting to your database? Via unix sockets, loopback, or over a network? If it's over a network, do you know much about it? Is it a local network, or are you going across a wider-area network?

Any time you go over anything more than loopback, and especially a wide-area network, you probably need to write your logger to be tolerant to network hiccups leading to a loss of connection (and thus potentially also lost notifications that were sent while your logger was away, if that's important).

postgresql-simple can also produce a Database.PostgreSQL.Simple.Notification.getNotification: resource vanished (failed to fetch file descriptor) IOError, so your exception would seem to be coming from the SELECT process_message(?) query. If it's failing consistently there, that's somewhat curious. I suppose it's entirely possible that there is a bug in postgresql-simple that's swallowing a postgres protocol error message that is received shortly before the server hangs up.

I'd also check your server's logs for any messages, if you can. And, if this is happening consistently, then running a packet capture on the client and/or server can be very enlightening. (And wireshark has a postgresql protocol dissector, so I highly recommend using wireshark for analyzing the capture after the fact.)

@deem0n
Copy link
Author

deem0n commented Nov 9, 2015

Hello,
connection is via loopback interface on localhost (127.0.0.1) without SSL. I'm a haskell newbie, so it is quite hard for me to add verbose logging on IOError exceptions. How it would look like ?

Currently my code does some error checks (as shown in my previous message), but there was no SQL Error: part in the error log, so I assume that text
Database.PostgreSQL.Simple.Internal.exec: resource vanished (failed to fetch file descriptor)
was caused by some exception which I did not catch.

Also, haskell application did exit immediately after this error, which is something I want to avoid.

@deem0n
Copy link
Author

deem0n commented Nov 9, 2015

Here is my stack list-dependencies

aeson 0.8.0.2
array 0.5.1.0
attoparsec 0.12.1.6
base 4.8.1.0
binary 0.7.5.0
blaze-builder 0.4.0.1
blaze-textual 0.2.1.0
byteable 0.1.1
bytestring 0.10.6.0
case-insensitive 1.2.0.5
containers 0.5.6.2
cryptohash 0.11.6
deepseq 1.4.1.1
dlist 0.7.1.2
ghc-prim 0.4.0.0
hashable 1.2.3.3
integer-gmp 1.0.0.0
mtl 2.2.1
network-info 0.2.0.8
old-locale 1.0.0.7
postgresql-libpq 0.9.1.1
postgresql-simple 0.4.10.0
primitive 0.6.1.0
random 1.1
scientific 0.3.4.2
syb 0.5.1
template-haskell 2.10.0.0
text 1.2.1.3
time 1.5.0.1
transformers 0.4.2.0
unordered-containers 0.2.5.1
uuid 1.3.11
uuid-types 1.0.2
vector 0.10.12.3

PostgreSQL server is 9.3.9

@lpsmith
Copy link
Owner

lpsmith commented Nov 10, 2015

Well, what I'm saying is (if necessary) turn on logging on the postgresql server side of things, and check the error logs there too.

This particular exception is an IOError, which you can certainly catch yourself on the client side of things. In particular, IIRC, this is raised when your SELECT process_message(?) query eventually calls postgresql-simple's exec, which in turn calls PQsocket and this function returns an invalid file descriptor. Perhaps we need to check for a PQerrorMessage at this point.

@lpsmith
Copy link
Owner

lpsmith commented Nov 10, 2015

Also, while it's certainly a worthy goal to write a daemon that never crashes, in practice you probably still want to have a supervisor process that will detect when a daemon crashes and restarts it automatically. There are a number of options, but both upstart and systemd can natively handle simple single-process supervision like this.

@deem0n
Copy link
Author

deem0n commented Nov 11, 2015

Thanks, very helpful suggestions.

May be this discussion would be interesting and put some light on the problem.

@lpsmith
Copy link
Owner

lpsmith commented Nov 11, 2015

Well, I didn't read the whole thread carefully, but I don't think we have any evidence that the current issue is related to the psycopg one that you linked, and perhaps a bit of evidence that it's not.

But I do find this to be a slightly curious event over loopback connections. So:

  1. Has this issue recurred?
  2. In the instances where this recurred, has the exception been exactly the same? In particular, has this "failed to fetch file descriptor" exception ever originated from getNotification instead?
  3. Have you checked the postgresql server logs for any related error messages?
  4. Have you checked any system logs for related error messages? (e.g. if the postgres process that handles your connection was killed by the system, by say the OOM killer or due to a ulimit violation, then it could cause this exception in Haskell-land, and it may leave some evidence in the system logs.)
  5. Have you run a packet capture on the connection? A command along the following lines might be a reasonable place to get started, but I would highly recommend referring to the manpage:
tcpdump -i lo -n -s 65535 -C 10 -w pgdebug_%Y%m%dT%H%M%S%z.pcap 'tcp port 5432'

This will capture all loopback packets that have a source or destination port of 5432, which should capture every loopback connection to postgres. If there's a lot of other connections to postgres via loopback, you may want to refine your filter expression somehow.

@lpsmith
Copy link
Owner

lpsmith commented Nov 11, 2015

Also, in many distributions, SSL will by default be used in a situation like this; so if you do a packet capture, you may want to include sslmode=disable in your postgresql connection string, use the PGSSLMODE environment variable to do the same, or disable it in the postgresql server config (e.g. via pg_hba.conf). And of course, you should verify that you can actually read the packet captures before you wait for the problem to recur.

@deem0n
Copy link
Author

deem0n commented Nov 12, 2015

Hello

  1. I did not use SSL, it is disabled by pg_hba.conf for localhost
  2. I changed things slightly on the server to reduce amount of payload in postgres messages. haskell daemon worked for 4 days without any problem since then.
  3. I plan to do stress-testing sending a lot of short messages, I will post results&logs in case of reproducible errors

@deem0n
Copy link
Author

deem0n commented Nov 17, 2015

News out there: it crashed again with the same error.
Also, I restarted Postgres due to config file changes and haskell daemon crashed with these lines:

FATAL:  terminating connection due to administrator command
luxmsbi-postgres-sync-logger-exe: Database.PostgreSQL.Simple.Notification.getNotification: resource vanished (failed to fetch file descriptor)

I set up tcpdump traffic logging and will publish results after next crash.

@lpsmith
Copy link
Owner

lpsmith commented Nov 17, 2015

Regarding the postgresql restart, that's not surprising. Were you expecting something different?

The first message I suspect is a notice being printed by libpq. New enough versions of postgresql-libpq have the ability to turn off these messages, or programmatically capture them. The second message is an IOError exception being intentionally thrown by postgresql-simple, although it's likely possible to turn it into something else in this particular case. There was some discussion of these issues in #155

I have a little utility program that throws that particular exception on a regular basis for slightly different reasons: the program replicates and merges some event tables in near real time, and is connected via a VPN over a the public internet to a site fairly far away in real terms. This connection is somewhat flaky, so whenever the connection disappears, and the operating system finally decides the connection is gone and informs that program, then libpq does it's thing and informs my program by not returning a file descriptor, which causes this error to be raised.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants