Skip to content

Add postgres tap bugfixes & integration tests #229

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 23 commits into from
Sep 11, 2020

Conversation

sherifnada
Copy link
Contributor

@sherifnada sherifnada commented Sep 11, 2020

What

Adds integration tests for Postgres tap and fixes some related issues. The main issues were around how we encoded Singer messages:

  • RECORD messages didn't have a version field
  • schema field on SCHEMA messages was the wrong type
  • We didn't support the undocumented ACTIVATE_VERSION messages.

Reading order

  1. SingerMessage.json
  2. SingerPostgresSourceTap
  3. everything else

@@ -100,14 +101,6 @@ public static void tearDown() {
TARGET_PSQL.stop();
}

private static void runSqlScript(MountableFile file, PostgreSQLContainer db)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to a common helper

import org.testcontainers.containers.PostgreSQLContainer;
import org.testcontainers.utility.MountableFile;

public class PostgreSQLContainerHelper {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactored this from other tests since we need to run this in many tests

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for now I prefer that each test does that until we figure out a way to expose is as a helper

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it sounds like the issue that it is coupled with the db module, in which case this would be better exposed in a separate test utils package?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this to a separate module

@sherifnada sherifnada marked this pull request as ready for review September 11, 2020 17:14
"required": ["type"],
"properties": {
"type": {
"description": "record, schema, and state message: used to determine record type. used in all singer messages.",
"type": "string",
"enum": ["RECORD", "SCHEMA", "STATE"]
"enum": ["RECORD", "SCHEMA", "STATE", "ACTIVATE_VERSION"]
Copy link
Contributor Author

@sherifnada sherifnada Sep 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some context around ACTIVATE_VERSION:

it seems to have only the version field

  • confused person on singer slack
  • some comments in the source code of a random tap explaining it: link
  • i opened an issue in Singer asking to document it link

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we hold off on implementing it? or does it break smth to not have it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it should break anything since this isn't an officially supported field yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

woof. yeah. i'd rather just ignore this type of message then add it into "our" protocol.

@sherifnada
Copy link
Contributor Author

Looks like this broke tests elsewhere in the repo, fixing

@@ -7,6 +7,7 @@ dependencies {
api 'org.jooq:jooq-meta:3.13.4'
api 'org.jooq:jooq:3.13.4'
api 'org.postgresql:postgresql:42.2.16'
api 'org.testcontainers:postgresql:1.14.3'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree with that. We shouldn't have testcontainer be a dep for dataline-db. It should remain a testImplementation

import org.testcontainers.containers.PostgreSQLContainer;
import org.testcontainers.utility.MountableFile;

public class PostgreSQLContainerHelper {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for now I prefer that each test does that until we figure out a way to expose is as a helper

integrationTestImplementation project(':dataline-singer')
}

import com.bmuschko.gradle.docker.tasks.image.DockerBuildImage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't forget to add the imagename task (see bigquery gradle)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

"stream": "id_and_name",
"record": {
"id": 6,
"name": "sushi"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yummy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not the edible kind I'm afraid 😅

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐱

"required": ["type"],
"properties": {
"type": {
"description": "record, schema, and state message: used to determine record type. used in all singer messages.",
"type": "string",
"enum": ["RECORD", "SCHEMA", "STATE"]
"enum": ["RECORD", "SCHEMA", "STATE", "ACTIVATE_VERSION"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment on the file -- it's an undocumented message type in singer that is widely used by DBs apparently

exceptionFormat "full"
}
mustRunAfter test
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're misssing the deps to buildImage here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -0,0 +1,2 @@
build
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not make in a allow list instead:

*
!Dockefile
!...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, wasn't familiar with that pattern. Done.

@@ -69,7 +69,7 @@
},
"SingerColumn": {
"type": "object",
"additionalProperties": false,
"additionalProperties": true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cgardens I actually removed this since it is not needed; see comment below

@@ -43,8 +44,12 @@
"type": "string"
}
},
"version": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't part of the singer spec? would it make sense to do what you did in catalog and just allow additional fields? my intuition is that we only explicitly add the field here is if it something that both source and destination need to know about.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a compromise because JsonSchema2Pojo does not correctly generate annotations which indicate required fields (e.g by adding the @NotNull annotation to the field). As a result, the jackson deserializer does not validate that required fields are present. This means that if additionalProperties=true, Jackson will deserialize pretty much any valid JSON object as a SingerMessage without validating required properties.

The only mitigating factor is that when setting additionalProperties to false, we effectively "whitelist" which fields are allowed to be present in the string that we deserialize to JsonMessage. Definitely a shitty situation, but this seems like the less bad tradeoff to make unless I'm missing that there is a JsonSchema validator that is easy to use somewhere.

Related issues:

final ProcessBuilderFactory pbf,
final StreamFactory streamFactory,
final SingerDiscoverSchemaWorker discoverSchemaWorker) {
@VisibleForTesting
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just leave this public? nbd either way. just curious.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very small thing, but it was confusing as an external consumer of this package (I was using it in integration tests) if I was supposed to pass in the stream factory. Seems like this is only useful for tests, so I removed it so that the confusion doesn't happen.

"required": ["type"],
"properties": {
"type": {
"description": "record, schema, and state message: used to determine record type. used in all singer messages.",
"type": "string",
"enum": ["RECORD", "SCHEMA", "STATE"]
"enum": ["RECORD", "SCHEMA", "STATE", "ACTIVATE_VERSION"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

woof. yeah. i'd rather just ignore this type of message then add it into "our" protocol.

Copy link
Contributor

@cgardens cgardens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking on the two questions in SingerMessage. after that g2g.

"type": "object",
"existingJavaType": "com.fasterxml.jackson.databind.JsonNode"
},
"schema": {
"description": "schema message: the schema",
"$ref": "SingerCatalog.json#/definitions/SingerColumnMap"
"type": "object",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i thought the schema is a well defined object in the singer spec that matches the schema of the schema in the catalog jsonnot just a json blob. am i wrong about this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the schema field is a JsonSchema description. So we can either find a $ref definition for a JsonSchema object, specify it exactly ourselves, or call it a JsonNode. I think given that we do not introspect into the schema messages during sync, the most fault tolerant thing is to treat it as a black box. Should we want to introspect into it, we should specify it more closely. WDYT?

continue;
}

// We might not want to check that all fields are the same to pass a test e.g: time_extracted isn't
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we just overwrite certain fields like time_extracted and then just compare the whole object

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

.withStandardSync(syncConfig)
.withSourceConnectionImplementation(sourceImpl);

Stream<SingerMessage> singerMessageStream = singerTapFactory.create(tapConfig, jobRoot);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought for the integration-specific tests we wanted to just test that the config -> output works properly, and not test the tap factory/dataline runner infrastructure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think once we have comprehensive knowledge of all the things that can come out of a singer tap at compile-time, I agree with that approach. But initially (and maybe for a while), it's going to be hard to know that. Given this difficulty, IMO the better tradeoff to make is to have some duplication of test functionality for the benefit of providing higher confidence that all the machinery works well together, rather than the other way around.

For example: tap-postgres produced a SCHEMA message which failed our JsonSchema encoding of singer messages. It also produced a RECORD message which also failed our JsonSchema encoding for a different reason. And it exposed that there is a different type of message, ACTIVATE_VERSION that is defacto used by many taps, but isn't documented in the spec because it's still experimental. If we didn't test the TapFactory in this test and used a processbuilderfactory etc.. we would have only found out those issues when running the full application.

So I think for now the approach should be for now: port any such findings into unit tests where applicable (e.g: add those funky records types into SingerTapFactoryTest) and continue to use wider machinery than is necessary in integration tests until we develop enough confidence that our unit tests are comprehensive enough that we can use more barebones structs in our integration tests (which potentially will come once integrations are fully encapsulated in docker images e.g: when stdout of a docker image outputs dataline data models instead of singer data models that we then convert in the application).

WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talked offline. Sounds good.

@cgardens
Copy link
Contributor

discussed offline and we're all on same page now!

@sherifnada sherifnada changed the title Add postgres tap integration tests Add postgres tap bugfixes & integration tests Sep 11, 2020
@sherifnada sherifnada merged commit 0db3012 into master Sep 11, 2020
@sherifnada sherifnada deleted the sherif/postgres-tap-integration-tests branch September 11, 2020 22:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants