-
Notifications
You must be signed in to change notification settings - Fork 4.6k
Add postgres tap bugfixes & integration tests #229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -100,14 +101,6 @@ public static void tearDown() { | |||
TARGET_PSQL.stop(); | |||
} | |||
|
|||
private static void runSqlScript(MountableFile file, PostgreSQLContainer db) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved to a common helper
import org.testcontainers.containers.PostgreSQLContainer; | ||
import org.testcontainers.utility.MountableFile; | ||
|
||
public class PostgreSQLContainerHelper { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refactored this from other tests since we need to run this in many tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for now I prefer that each test does that until we figure out a way to expose is as a helper
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it sounds like the issue that it is coupled with the db module, in which case this would be better exposed in a separate test utils package?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved this to a separate module
"required": ["type"], | ||
"properties": { | ||
"type": { | ||
"description": "record, schema, and state message: used to determine record type. used in all singer messages.", | ||
"type": "string", | ||
"enum": ["RECORD", "SCHEMA", "STATE"] | ||
"enum": ["RECORD", "SCHEMA", "STATE", "ACTIVATE_VERSION"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some context around ACTIVATE_VERSION
:
it seems to have only the version
field
- confused person on singer slack
- some comments in the source code of a random tap explaining it: link
- i opened an issue in Singer asking to document it link
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we hold off on implementing it? or does it break smth to not have it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it should break anything since this isn't an officially supported field yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
woof. yeah. i'd rather just ignore this type of message then add it into "our" protocol.
Looks like this broke tests elsewhere in the repo, fixing |
dataline-db/build.gradle
Outdated
@@ -7,6 +7,7 @@ dependencies { | |||
api 'org.jooq:jooq-meta:3.13.4' | |||
api 'org.jooq:jooq:3.13.4' | |||
api 'org.postgresql:postgresql:42.2.16' | |||
api 'org.testcontainers:postgresql:1.14.3' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree with that. We shouldn't have testcontainer be a dep for dataline-db. It should remain a testImplementation
import org.testcontainers.containers.PostgreSQLContainer; | ||
import org.testcontainers.utility.MountableFile; | ||
|
||
public class PostgreSQLContainerHelper { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for now I prefer that each test does that until we figure out a way to expose is as a helper
integrationTestImplementation project(':dataline-singer') | ||
} | ||
|
||
import com.bmuschko.gradle.docker.tasks.image.DockerBuildImage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't forget to add the imagename task (see bigquery gradle)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
"stream": "id_and_name", | ||
"record": { | ||
"id": 6, | ||
"name": "sushi" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yummy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not the edible kind I'm afraid 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🐱
"required": ["type"], | ||
"properties": { | ||
"type": { | ||
"description": "record, schema, and state message: used to determine record type. used in all singer messages.", | ||
"type": "string", | ||
"enum": ["RECORD", "SCHEMA", "STATE"] | ||
"enum": ["RECORD", "SCHEMA", "STATE", "ACTIVATE_VERSION"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment on the file -- it's an undocumented message type in singer that is widely used by DBs apparently
exceptionFormat "full" | ||
} | ||
mustRunAfter test | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're misssing the deps to buildImage here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -0,0 +1,2 @@ | |||
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not make in a allow list instead:
*
!Dockefile
!...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, wasn't familiar with that pattern. Done.
@@ -69,7 +69,7 @@ | |||
}, | |||
"SingerColumn": { | |||
"type": "object", | |||
"additionalProperties": false, | |||
"additionalProperties": true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cgardens I actually removed this since it is not needed; see comment below
@@ -43,8 +44,12 @@ | |||
"type": "string" | |||
} | |||
}, | |||
"version": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this isn't part of the singer spec? would it make sense to do what you did in catalog and just allow additional fields? my intuition is that we only explicitly add the field here is if it something that both source and destination need to know about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a compromise because JsonSchema2Pojo does not correctly generate annotations which indicate required fields (e.g by adding the @NotNull
annotation to the field). As a result, the jackson deserializer does not validate that required fields are present. This means that if additionalProperties=true
, Jackson will deserialize pretty much any valid JSON object as a SingerMessage
without validating required properties.
The only mitigating factor is that when setting additionalProperties
to false
, we effectively "whitelist" which fields are allowed to be present in the string that we deserialize to JsonMessage
. Definitely a shitty situation, but this seems like the less bad tradeoff to make unless I'm missing that there is a JsonSchema validator that is easy to use somewhere.
Related issues:
final ProcessBuilderFactory pbf, | ||
final StreamFactory streamFactory, | ||
final SingerDiscoverSchemaWorker discoverSchemaWorker) { | ||
@VisibleForTesting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not just leave this public? nbd either way. just curious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very small thing, but it was confusing as an external consumer of this package (I was using it in integration tests) if I was supposed to pass in the stream factory. Seems like this is only useful for tests, so I removed it so that the confusion doesn't happen.
"required": ["type"], | ||
"properties": { | ||
"type": { | ||
"description": "record, schema, and state message: used to determine record type. used in all singer messages.", | ||
"type": "string", | ||
"enum": ["RECORD", "SCHEMA", "STATE"] | ||
"enum": ["RECORD", "SCHEMA", "STATE", "ACTIVATE_VERSION"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
woof. yeah. i'd rather just ignore this type of message then add it into "our" protocol.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blocking on the two questions in SingerMessage. after that g2g.
"type": "object", | ||
"existingJavaType": "com.fasterxml.jackson.databind.JsonNode" | ||
}, | ||
"schema": { | ||
"description": "schema message: the schema", | ||
"$ref": "SingerCatalog.json#/definitions/SingerColumnMap" | ||
"type": "object", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i thought the schema is a well defined object in the singer spec that matches the schema of the schema in the catalog jsonnot just a json blob. am i wrong about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the schema
field is a JsonSchema description. So we can either find a $ref
definition for a JsonSchema object, specify it exactly ourselves, or call it a JsonNode
. I think given that we do not introspect into the schema messages during sync, the most fault tolerant thing is to treat it as a black box. Should we want to introspect into it, we should specify it more closely. WDYT?
continue; | ||
} | ||
|
||
// We might not want to check that all fields are the same to pass a test e.g: time_extracted isn't |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we just overwrite certain fields like time_extracted
and then just compare the whole object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
.withStandardSync(syncConfig) | ||
.withSourceConnectionImplementation(sourceImpl); | ||
|
||
Stream<SingerMessage> singerMessageStream = singerTapFactory.create(tapConfig, jobRoot); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought for the integration-specific tests we wanted to just test that the config -> output works properly, and not test the tap factory/dataline runner infrastructure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think once we have comprehensive knowledge of all the things that can come out of a singer tap at compile-time, I agree with that approach. But initially (and maybe for a while), it's going to be hard to know that. Given this difficulty, IMO the better tradeoff to make is to have some duplication of test functionality for the benefit of providing higher confidence that all the machinery works well together, rather than the other way around.
For example: tap-postgres
produced a SCHEMA
message which failed our JsonSchema encoding of singer messages. It also produced a RECORD
message which also failed our JsonSchema encoding for a different reason. And it exposed that there is a different type of message, ACTIVATE_VERSION
that is defacto used by many taps, but isn't documented in the spec because it's still experimental. If we didn't test the TapFactory
in this test and used a processbuilderfactory etc.. we would have only found out those issues when running the full application.
So I think for now the approach should be for now: port any such findings into unit tests where applicable (e.g: add those funky records types into SingerTapFactoryTest
) and continue to use wider machinery than is necessary in integration tests until we develop enough confidence that our unit tests are comprehensive enough that we can use more barebones structs in our integration tests (which potentially will come once integrations are fully encapsulated in docker images e.g: when stdout of a docker image outputs dataline data models instead of singer data models that we then convert in the application).
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Talked offline. Sounds good.
discussed offline and we're all on same page now! |
…ostgres-tap-integration-tests
What
Adds integration tests for Postgres tap and fixes some related issues. The main issues were around how we encoded Singer messages:
RECORD
messages didn't have aversion
fieldschema
field onSCHEMA
messages was the wrong typeACTIVATE_VERSION
messages.Reading order
SingerMessage.json
SingerPostgresSourceTap