Usually databases are treated primarily as fairly dumb data storage systems, but they can be capable of much more. Case in point the PostgreSQL database and its – Ada-based – PL/pgSQL programming language, which allows you to perform significantly more complex operations than would be realistically possible with raw SQL. Case in point the implementation of a Kalman Filter by the folk over at Traconiq, which thus removes the necessity for an external filtering pipeline.
Using a Kalman Filter is highly desirable when you’re doing something like vehicle tracking using both dead-reckoning and GPS coordinates, as it filters out noise that can be the result of e.g. GPS reception issues. As noted in the article, transferring state from one row to the next requires a bit of lateral thinking, but is doable with some creative SQL usage. As PL/pgSQL is very similar to Oracle’s PL/SQL, this same code should work there too without too much porting required.
The code for the different implementations and associated benchmarks can be found on GitHub, though the benchmark results make it abundantly clear that the most efficient approach is to run an offline aggregate processing routine. This coincides with the other batch processing tasks that are typically performed by a database server to e.g. optimize storage, so this isn’t entirely unsurprising.
It would seem to me that one wants to offload SQL servers as much as possible. Not only for performance, but because it’s usually the most expensive compute-time short of GPU farms. Hence one’s wallet would suggest to do this kind of data massaging somewhere else.
Anyone have thoughts on this?
Exactly what I thought,…
… Though I am absolutely no database expert.
Depends
first – “it’s usually the most expensive compute-time short of GPU farms” is not universal truth – depends on your system architecture
second – again depending on multiple factors, it may be better to not pull the whole dataset from the server over network first, than process and than update back, but instead do it in the database directly
for large datasets, updates using simple “update” command row by row (when the values are calculated externally) is no go (too slow) so anyway you need to do something like
COPY FROM STDIN
into temp table and than updating from that – may as well be easier to do the update in place with custom function in the database itselfmy company runs on a cloud provider who charge in terms of read/write. As long as we can fit the compute into their 6h timeout, it’s relatively free!
Since the filter depends not only on the input, but also related (previous) values, there might be something to say for implementing it on the server. It avoids having to fetch the previous data to a compute node, calculating the filter and writing the new data to the server.
But still I agree. Unless there is a Lot of data and latency is an issue, I don’t see the added value.
Still: It’s a pretty funny hack. :) Kudos for at least that.
So preoccupied with whether they could that they never stopped to ask whether they should :)
i take this sort of thing as a kind of demo of the capabilities of postgresql. Honestly, i don’t like it, for the reasons you stated and also because i view it as a kind of intrinsically harmful “embrace and extend.” i want an SQL database to be an SQL database, not to have some novel language tacked onto it which does noting but promote vendor lock-in and (usually) bad architectural choices. But a lot of projects are neat just for showing what’s possible, in case it might trigger a thought in the reader about where this capacity might be genuinely useful.