Skip to content

Commit a723c47

Browse files
authored
Add WordCount template sample (GoogleCloudPlatform#1469)
1 parent 87000b1 commit a723c47

File tree

4 files changed

+475
-0
lines changed

4 files changed

+475
-0
lines changed

dataflow/templates/README.md

+152
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,152 @@
1+
# Cloud Dataflow Templates
2+
3+
Samples showing how to create and run an [Apache Beam] on [Google Cloud Dataflow].
4+
5+
## Before you begin
6+
7+
1. Install the [Cloud SDK].
8+
9+
1. [Create a new project].
10+
11+
1. [Enable billing].
12+
13+
1. [Enable the APIs](https://console.cloud.google.com/flows/enableapi?apiid=dataflow,compute_component,logging,storage_component,storage_api,bigquery,pubsub,datastore.googleapis.com,cloudresourcemanager.googleapis.com): Dataflow, Compute Engine, Stackdriver Logging, Cloud Storage, Cloud Storage JSON, BigQuery, Pub/Sub, Datastore, and Cloud Resource Manager.
14+
15+
1. Setup the Cloud SDK to your GCP project.
16+
17+
```bash
18+
gcloud init
19+
```
20+
21+
1. [Create a service account key] as a JSON file.
22+
For more information, see [Creating and managing service accounts].
23+
24+
* From the **Service account** list, select **New service account**.
25+
* In the **Service account name** field, enter a name.
26+
* From the **Role** list, select **Project > Owner**.
27+
28+
> **Note**: The **Role** field authorizes your service account to access resources.
29+
> You can view and change this field later by using the [GCP Console IAM page].
30+
> If you are developing a production app, specify more granular permissions than **Project > Owner**.
31+
> For more information, see [Granting roles to service accounts].
32+
33+
* Click **Create**. A JSON file that contains your key downloads to your computer.
34+
35+
1. Set your `GOOGLE_APPLICATION_CREDENTIALS` environment variable to point to your service account key file.
36+
37+
```bash
38+
export GOOGLE_APPLICATION_CREDENTIALS=path/to/your/credentials.json
39+
```
40+
41+
1. Create a Cloud Storage bucket.
42+
43+
```bash
44+
gsutil mb gs://your-gcs-bucket
45+
```
46+
47+
## Setup
48+
49+
The following instructions will help you prepare your development environment.
50+
51+
1. Download and install the [Java Development Kit (JDK)].
52+
Verify that the [JAVA_HOME] environment variable is set and points to your JDK installation.
53+
54+
1. Download and install [Apache Maven] by following the [Maven installation guide] for your specific operating system.
55+
56+
1. Clone the `java-docs-samples` repository.
57+
58+
```bash
59+
git clone https://github.com/GoogleCloudPlatform/java-docs-samples.git
60+
```
61+
62+
1. Navigate to the sample code directory.
63+
64+
```bash
65+
cd java-docs-samples/dataflow/templates
66+
```
67+
68+
## Templates
69+
70+
### WordCount
71+
72+
* [WordCount.java](src/main/java/com/example/dataflow/templates/WordCount.java)
73+
* [WordCount_metadata](WordCount_metadata)
74+
75+
First, select the project and template location.
76+
77+
```bash
78+
PROJECT=$(gcloud config get-value project)
79+
BUCKET=your-gcs-bucket
80+
TEMPLATE_LOCATION=gs://$BUCKET/dataflow/templates/WordCount
81+
```
82+
83+
Then, to create the template in the desired Cloud Storage location.
84+
85+
```bash
86+
# Create the template.
87+
mvn compile exec:java \
88+
-Dexec.mainClass=com.example.dataflow.templates.WordCount \
89+
-Dexec.args="\
90+
--isCaseSensitive=false \
91+
--project=$PROJECT \
92+
--templateLocation=$TEMPLATE_LOCATION \
93+
--runner=DataflowRunner"
94+
95+
# Upload the metadata file.
96+
gsutil cp WordCount_metadata "$TEMPLATE_LOCATION"_metadata
97+
```
98+
99+
> For more information, see [Creating templates].
100+
101+
Finally, you can run the template via `gcloud` or through the [GCP Console create Dataflow job page].
102+
103+
```bash
104+
JOB_NAME=wordcount-$(date +'%Y%m%d-%H%M%S')
105+
INPUT=gs://apache-beam-samples/shakespeare/kinglear.txt
106+
107+
gcloud dataflow jobs run $JOB_NAME \
108+
--gcs-location $TEMPLATE_LOCATION \
109+
--parameters inputFile=$INPUT,outputBucket=$BUCKET
110+
```
111+
112+
> For more information, see [Executing templates].
113+
114+
You can check your submitted jobs in the [GCP Console Dataflow page].
115+
116+
## Cleanup
117+
118+
To avoid incurring charges to your GCP account for the resources used:
119+
120+
```bash
121+
# Delete only the files created by this sample.
122+
gsutil -m rm -rf \
123+
"gs://$BUCKET/dataflow/templates/WordCount*" \
124+
"gs://$BUCKET/dataflow/wordcount/"
125+
126+
# [optional] Remove the entire dataflow Cloud Storage directory.
127+
gsutil -m rm -rf gs://$BUCKET/dataflow
128+
129+
# [optional] Remove the Cloud Storage bucket.
130+
gsutil rb gs://$BUCKET
131+
```
132+
133+
[Apache Beam]: https://beam.apache.org/
134+
[Google Cloud Dataflow]: https://cloud.google.com/dataflow/docs/
135+
136+
[Cloud SDK]: https://cloud.google.com/sdk/docs/
137+
[Create a new project]: https://console.cloud.google.com/projectcreate
138+
[Enable billing]: https://cloud.google.com/billing/docs/how-to/modify-project
139+
[Create a service account key]: https://console.cloud.google.com/apis/credentials/serviceaccountkey
140+
[Creating and managing service accounts]: https://cloud.google.com/iam/docs/creating-managing-service-accounts
141+
[GCP Console IAM page]: https://console.cloud.google.com/iam-admin/iam
142+
[Granting roles to service accounts]: https://cloud.google.com/iam/docs/granting-roles-to-service-accounts
143+
144+
[Java Development Kit (JDK)]: https://www.oracle.com/technetwork/java/javase/downloads/index.html
145+
[JAVA_HOME]: https://docs.oracle.com/javase/8/docs/technotes/guides/troubleshoot/envvars001.html
146+
[Apache Maven]: http://maven.apache.org/download.cgi
147+
[Maven installation guide]: http://maven.apache.org/install.html
148+
149+
[Creating templates]: https://cloud.google.com/dataflow/docs/guides/templates/creating-templates
150+
[GCP Console create Dataflow job page]: https://console.cloud.google.com/dataflow/createjob
151+
[Executing templates]: https://cloud.google.com/dataflow/docs/guides/templates/executing-templates
152+
[GCP Console Dataflow page]: https://console.cloud.google.com/dataflow

dataflow/templates/WordCount_metadata

+38
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
// Copyright 2019 Google Inc.
2+
//
3+
// Licensed under the Apache License, Version 2.0 (the "License");
4+
// you may not use this file except in compliance with the License.
5+
// You may obtain a copy of the License at
6+
//
7+
// http://www.apache.org/licenses/LICENSE-2.0
8+
//
9+
// Unless required by applicable law or agreed to in writing, software
10+
// distributed under the License is distributed on an "AS IS" BASIS,
11+
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
// See the License for the specific language governing permissions and
13+
// limitations under the License.
14+
{
15+
"name": "WordCount",
16+
"description": "An example pipeline that counts words in the input file.",
17+
"parameters": [
18+
{
19+
"name": "inputFile",
20+
"label": "Input GCS File Pattern",
21+
"help_text: "Google Cloud Storage file pattern glob of the file(s) to read from.",
22+
"regexes": ["^gs:\/\/[^\n\r]+$"],
23+
"is_optional": true
24+
},
25+
{
26+
"name": "outputBucket",
27+
"label": "Output GCS Bucket",
28+
"help_text: "Google Cloud Storage bucket to store the outputs.",
29+
"regexes": ["^[a-z0-9][-_.a-z0-9]+[a-z0-9]$"]
30+
},
31+
{
32+
"name": "withSubstring",
33+
"label": "With Substring",
34+
"help_text: "Filter only words containing the specified substring.",
35+
"is_optional": true
36+
},
37+
]
38+
}

dataflow/templates/pom.xml

+167
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,167 @@
1+
<?xml version="1.0" encoding="UTF-8"?>
2+
<!--
3+
Copyright 2018 Google LLC
4+
5+
Licensed under the Apache License, Version 2.0 (the "License");
6+
you may not use this file except in compliance with the License.
7+
You may obtain a copy of the License at
8+
9+
http://www.apache.org/licenses/LICENSE-2.0
10+
11+
Unless required by applicable law or agreed to in writing, software
12+
distributed under the License is distributed on an "AS IS" BASIS,
13+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
See the License for the specific language governing permissions and
15+
limitations under the License.
16+
-->
17+
<project xmlns="http://maven.apache.org/POM/4.0.0"
18+
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
19+
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
20+
<modelVersion>4.0.0</modelVersion>
21+
22+
<groupId>com.example</groupId>
23+
<artifactId>dataflow-templates</artifactId>
24+
<version>1.0</version>
25+
26+
<properties>
27+
<maven.compiler.source>1.8</maven.compiler.source>
28+
<maven.compiler.target>1.8</maven.compiler.target>
29+
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
30+
31+
<beam.version>2.13.0</beam.version>
32+
33+
<maven-compiler-plugin.version>3.8.1</maven-compiler-plugin.version>
34+
<maven-exec-plugin.version>1.6.0</maven-exec-plugin.version>
35+
<maven-jar-plugin.version>3.1.2</maven-jar-plugin.version>
36+
<maven-shade-plugin.version>3.2.1</maven-shade-plugin.version>
37+
<slf4j.version>1.7.26</slf4j.version>
38+
</properties>
39+
40+
<repositories>
41+
<repository>
42+
<id>apache.snapshots</id>
43+
<name>Apache Development Snapshot Repository</name>
44+
<url>https://repository.apache.org/content/repositories/snapshots/</url>
45+
<releases>
46+
<enabled>false</enabled>
47+
</releases>
48+
<snapshots>
49+
<enabled>true</enabled>
50+
</snapshots>
51+
</repository>
52+
</repositories>
53+
54+
<build>
55+
<plugins>
56+
<plugin>
57+
<groupId>org.apache.maven.plugins</groupId>
58+
<artifactId>maven-compiler-plugin</artifactId>
59+
<version>${maven-compiler-plugin.version}</version>
60+
</plugin>
61+
62+
<plugin>
63+
<groupId>org.apache.maven.plugins</groupId>
64+
<artifactId>maven-jar-plugin</artifactId>
65+
<version>${maven-jar-plugin.version}</version>
66+
<configuration>
67+
<archive>
68+
<manifest>
69+
<addClasspath>true</addClasspath>
70+
<classpathPrefix>lib/</classpathPrefix>
71+
<mainClass>com.example.dataflow.templates.WordCount</mainClass>
72+
</manifest>
73+
</archive>
74+
</configuration>
75+
</plugin>
76+
77+
<!--
78+
Configures `mvn package` to produce a bundled jar ("fat jar") for runners
79+
that require this for job submission to a cluster.
80+
-->
81+
<plugin>
82+
<groupId>org.apache.maven.plugins</groupId>
83+
<artifactId>maven-shade-plugin</artifactId>
84+
<version>${maven-shade-plugin.version}</version>
85+
<executions>
86+
<execution>
87+
<phase>package</phase>
88+
<goals>
89+
<goal>shade</goal>
90+
</goals>
91+
<configuration>
92+
<finalName>${project.artifactId}-bundled-${project.version}</finalName>
93+
<filters>
94+
<filter>
95+
<artifact>*:*</artifact>
96+
<excludes>
97+
<exclude>META-INF/LICENSE</exclude>
98+
<exclude>META-INF/*.SF</exclude>
99+
<exclude>META-INF/*.DSA</exclude>
100+
<exclude>META-INF/*.RSA</exclude>
101+
</excludes>
102+
</filter>
103+
</filters>
104+
<transformers>
105+
<transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
106+
</transformers>
107+
</configuration>
108+
</execution>
109+
</executions>
110+
</plugin>
111+
</plugins>
112+
113+
<pluginManagement>
114+
<plugins>
115+
<plugin>
116+
<groupId>org.codehaus.mojo</groupId>
117+
<artifactId>exec-maven-plugin</artifactId>
118+
<version>${maven-exec-plugin.version}</version>
119+
<configuration>
120+
<cleanupDaemonThreads>false</cleanupDaemonThreads>
121+
</configuration>
122+
</plugin>
123+
</plugins>
124+
</pluginManagement>
125+
</build>
126+
127+
<dependencies>
128+
<dependency>
129+
<groupId>org.apache.beam</groupId>
130+
<artifactId>beam-sdks-java-core</artifactId>
131+
<version>${beam.version}</version>
132+
</dependency>
133+
134+
<!--
135+
By default, the starter project has a dependency on the Beam DirectRunner
136+
to enable development and testing of pipelines. To run on another of the
137+
Beam runners, add its module to this pom.xml according to the
138+
runner-specific setup instructions on the Beam website:
139+
http://beam.apache.org/documentation/#runners
140+
-->
141+
<dependency>
142+
<groupId>org.apache.beam</groupId>
143+
<artifactId>beam-runners-direct-java</artifactId>
144+
<version>${beam.version}</version>
145+
<scope>runtime</scope>
146+
</dependency>
147+
148+
<dependency>
149+
<groupId>org.apache.beam</groupId>
150+
<artifactId>beam-runners-google-cloud-dataflow-java</artifactId>
151+
<version>${beam.version}</version>
152+
<scope>runtime</scope>
153+
</dependency>
154+
155+
<!-- slf4j API frontend binding with JUL backend -->
156+
<dependency>
157+
<groupId>org.slf4j</groupId>
158+
<artifactId>slf4j-api</artifactId>
159+
<version>${slf4j.version}</version>
160+
</dependency>
161+
<dependency>
162+
<groupId>org.slf4j</groupId>
163+
<artifactId>slf4j-jdk14</artifactId>
164+
<version>${slf4j.version}</version>
165+
</dependency>
166+
</dependencies>
167+
</project>

0 commit comments

Comments
 (0)