Skip to content

Commit 1f35dc6

Browse files
committed
docs: Re-organizing docs slightly and adding more code examples (feast-dev#3019)
* docs: Re-organizing docs slightly and adding more code examples in overview + feature retrieval + feature view concepts Signed-off-by: Danny Chiao <[email protected]> * fix broken links Signed-off-by: Danny Chiao <[email protected]> * fix missing table info Signed-off-by: Danny Chiao <[email protected]> * fix empty spaces Signed-off-by: Danny Chiao <[email protected]> * fix empty spaces Signed-off-by: Danny Chiao <[email protected]> * fix nits Signed-off-by: Danny Chiao <[email protected]> * remove broken faq Signed-off-by: Danny Chiao <[email protected]>
1 parent 8a0dd4c commit 1f35dc6

30 files changed

+508
-243
lines changed

docs/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,7 @@ Explore the following resources to get started with Feast:
6565
* [Quickstart](getting-started/quickstart.md) is the fastest way to get started with Feast
6666
* [Concepts](getting-started/concepts/) describes all important Feast API concepts
6767
* [Architecture](getting-started/architecture-and-components/) describes Feast's overall architecture.
68-
* [Tutorials](tutorials/tutorials-overview.md) shows full examples of using Feast in machine learning applications.
68+
* [Tutorials](tutorials/tutorials-overview/) shows full examples of using Feast in machine learning applications.
6969
* [Running Feast with Snowflake/GCP/AWS](how-to-guides/feast-snowflake-gcp-aws/) provides a more in-depth guide to using Feast.
7070
* [Reference](reference/feast-cli-commands.md) contains detailed API and design documents.
7171
* [Contributing](project/contributing.md) contains resources for anyone who wants to contribute to Feast.

docs/SUMMARY.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010
* [Quickstart](getting-started/quickstart.md)
1111
* [Concepts](getting-started/concepts/README.md)
1212
* [Overview](getting-started/concepts/overview.md)
13-
* [Data source](getting-started/concepts/data-source.md)
13+
* [Data ingestion](getting-started/concepts/data-ingestion.md)
1414
* [Entity](getting-started/concepts/entity.md)
1515
* [Feature view](getting-started/concepts/feature-view.md)
1616
* [Feature retrieval](getting-started/concepts/feature-retrieval.md)
@@ -31,11 +31,11 @@
3131

3232
## Tutorials
3333

34-
* [Overview](tutorials/tutorials-overview.md)
35-
* [Driver ranking](tutorials/driver-ranking-with-feast.md)
36-
* [Fraud detection on GCP](tutorials/fraud-detection.md)
37-
* [Real-time credit scoring on AWS](tutorials/real-time-credit-scoring-on-aws.md)
38-
* [Driver stats on Snowflake](tutorials/driver-stats-on-snowflake.md)
34+
* [Sample use-case tutorials](tutorials/tutorials-overview/README.md)
35+
* [Driver ranking](tutorials/tutorials-overview/driver-ranking-with-feast.md)
36+
* [Fraud detection on GCP](tutorials/tutorials-overview/fraud-detection.md)
37+
* [Real-time credit scoring on AWS](tutorials/tutorials-overview/real-time-credit-scoring-on-aws.md)
38+
* [Driver stats on Snowflake](tutorials/tutorials-overview/driver-stats-on-snowflake.md)
3939
* [Validating historical features with Great Expectations](tutorials/validating-historical-features.md)
4040
* [Using Scalable Registry](tutorials/using-scalable-registry.md)
4141
* [Building streaming features](tutorials/building-streaming-features.md)
@@ -50,12 +50,12 @@
5050
* [Load data into the online store](how-to-guides/feast-snowflake-gcp-aws/load-data-into-the-online-store.md)
5151
* [Read features from the online store](how-to-guides/feast-snowflake-gcp-aws/read-features-from-the-online-store.md)
5252
* [Running Feast in production](how-to-guides/running-feast-in-production.md)
53-
* [Upgrading from Feast 0.9](https://docs.google.com/document/u/1/d/1AOsr\_baczuARjCpmZgVd8mCqTF4AZ49OEyU4Cn-uTT0/edit)
5453
* [Upgrading for Feast 0.20+](how-to-guides/automated-feast-upgrade.md)
55-
* [Adding a customer provider](how-to-guides/creating-a-custom-provider.md)
56-
* [Adding a custom batch materialization engine](how-to-guides/creating-a-custom-materialization-engine.md)
57-
* [Adding a new online store](how-to-guides/adding-support-for-a-new-online-store.md)
58-
* [Adding a new offline store](how-to-guides/adding-a-new-offline-store.md)
54+
* [Customizing Feast](how-to-guides/customizing-feast/README.md)
55+
* [Adding a custom batch materialization engine](how-to-guides/customizing-feast/creating-a-custom-materialization-engine.md)
56+
* [Adding a new offline store](how-to-guides/customizing-feast/adding-a-new-offline-store.md)
57+
* [Adding a new online store](how-to-guides/customizing-feast/adding-support-for-a-new-online-store.md)
58+
* [Adding a custom provider](how-to-guides/customizing-feast/creating-a-custom-provider.md)
5959
* [Adding or reusing tests](how-to-guides/adding-or-reusing-tests.md)
6060

6161
## Reference

docs/getting-started/architecture-and-components/batch-materialization-engine.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,6 @@ A batch materialization engine is a component of Feast that's responsible for mo
44

55
A materialization engine abstracts over specific technologies or frameworks that are used to materialize data. It allows users to use a pure local serialized approach (which is the default LocalMaterializationEngine), or delegates the materialization to seperate components (e.g. AWS Lambda, as implemented by the the LambdaMaterializaionEngine).
66

7-
If the built-in engines are not sufficient, you can create your own custom materialization engine. Please see [this guide](../../how-to-guides/creating-a-custom-materialization-engine.md) for more details.
7+
If the built-in engines are not sufficient, you can create your own custom materialization engine. Please see [this guide](../../how-to-guides/customizing-feast/creating-a-custom-materialization-engine.md) for more details.
88

99
Please see [feature\_store.yaml](../../reference/feature-repository/feature-store-yaml.md#overview) for configuring engines.
10-

docs/getting-started/architecture-and-components/offline-store.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Offline store
22

3-
An offline store is an interface for working with historical time-series feature values that are stored in [data sources](../../getting-started/concepts/data-source.md).
3+
An offline store is an interface for working with historical time-series feature values that are stored in [data sources](../../getting-started/concepts/data-ingestion.md).
44
The `OfflineStore` interface has several different implementations, such as `BigQueryOfflineStore`, each of which is backed by a different storage and compute engine.
55
For more details on which offline stores are supported, please see [Offline Stores](../../reference/offline-stores/).
66

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,9 @@
11
# Provider
22

3-
A provider is an implementation of a feature store using specific feature store components \(e.g. offline store, online store\) targeting a specific environment \(e.g. GCP stack\).
3+
A provider is an implementation of a feature store using specific feature store components (e.g. offline store, online store) targeting a specific environment (e.g. GCP stack).
44

5-
Providers orchestrate various components \(offline store, online store, infrastructure, compute\) inside an environment. For example, the `gcp` provider supports [BigQuery](https://cloud.google.com/bigquery) as an offline store and [Datastore](https://cloud.google.com/datastore) as an online store, ensuring that these components can work together seamlessly. Feast has three built-in providers \(`local`, `gcp`, and `aws`\) with default configurations that make it easy for users to start a feature store in a specific environment. These default configurations can be overridden easily. For instance, you can use the `gcp` provider but use Redis as the online store instead of Datastore.
5+
Providers orchestrate various components (offline store, online store, infrastructure, compute) inside an environment. For example, the `gcp` provider supports [BigQuery](https://cloud.google.com/bigquery) as an offline store and [Datastore](https://cloud.google.com/datastore) as an online store, ensuring that these components can work together seamlessly. Feast has three built-in providers (`local`, `gcp`, and `aws`) with default configurations that make it easy for users to start a feature store in a specific environment. These default configurations can be overridden easily. For instance, you can use the `gcp` provider but use Redis as the online store instead of Datastore.
66

7-
If the built-in providers are not sufficient, you can create your own custom provider. Please see [this guide](../../how-to-guides/creating-a-custom-provider.md) for more details.
7+
If the built-in providers are not sufficient, you can create your own custom provider. Please see [this guide](../../how-to-guides/customizing-feast/creating-a-custom-provider.md) for more details.
88

99
Please see [feature\_store.yaml](../../reference/feature-repository/feature-store-yaml.md#overview) for configuring providers.
10-

docs/getting-started/concepts/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,8 @@
44
[overview.md](overview.md)
55
{% endcontent-ref %}
66

7-
{% content-ref url="data-source.md" %}
8-
[data-source.md](data-source.md)
7+
{% content-ref url="data-ingestion.md" %}
8+
[data-ingestion.md](data-ingestion.md)
99
{% endcontent-ref %}
1010

1111
{% content-ref url="entity.md" %}
Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
# Data ingestion
2+
3+
### Data source
4+
5+
The data source refers to raw underlying data (e.g. a table in BigQuery).
6+
7+
Feast uses a time-series data model to represent data. This data model is used to interpret feature data in data sources in order to build training datasets or when materializing features into an online store.
8+
9+
Below is an example data source with a single entity (`driver`) and two features (`trips_today`, and `rating`).
10+
11+
![Ride-hailing data source](<../../.gitbook/assets/image (16).png>)
12+
13+
Feast supports primarily **time-stamped** tabular data as data sources. There are many kinds of possible data sources:
14+
15+
* **Batch data sources:** ideally, these live in data warehouses (BigQuery, Snowflake, Redshift), but can be in data lakes (S3, GCS, etc). Feast supports ingesting and querying data across both.
16+
* **Stream data sources**: Feast does **not** have native streaming integrations. It does however facilitate making streaming features available in different environments. There are two kinds of sources:
17+
* **Push sources** allow users to push features into Feast, and make it available for training / batch scoring ("offline"), for realtime feature serving ("online") or both.
18+
* **\[Alpha] Stream sources** allow users to register metadata from Kafka or Kinesis sources. The onus is on the user to ingest from these sources, though Feast provides some limited helper methods to ingest directly from Kafka / Kinesis topics.
19+
* **(Experimental) Request data sources:** This is data that is only available at request time (e.g. from a user action that needs an immediate model prediction response). This is primarily relevant as an input into **on-demand feature views**, which allow light-weight feature engineering and combining features across sources.
20+
21+
### Batch data ingestion
22+
23+
Ingesting from batch sources is only necessary to power real-time models. This is done through **materialization**. Under the hood, Feast manages an _offline store_ (to scalably generate training data from batch sources) and an _online store_ (to provide low-latency access to features for real-time models).
24+
25+
A key command to use in Feast is the `materialize_incremental` command, which fetches the latest values for all entities in the batch source and ingests these values into the online store.
26+
27+
Materialization can be called programmatically or through the CLI:
28+
29+
<details>
30+
31+
<summary>Code example: programmatic scheduled materialization</summary>
32+
33+
This snippet creates a feature store object which points to the registry (which knows of all defined features) and the online store (DynamoDB in this case), and
34+
35+
```python
36+
# Define Python callable
37+
def materialize():
38+
repo_config = RepoConfig(
39+
registry=RegistryConfig(path="s3://[YOUR BUCKET]/registry.pb"),
40+
project="feast_demo_aws",
41+
provider="aws",
42+
offline_store="file",
43+
online_store=DynamoDBOnlineStoreConfig(region="us-west-2")
44+
)
45+
store = FeatureStore(config=repo_config)
46+
store.materialize_incremental(datetime.datetime.now())
47+
48+
# (In production) Use Airflow PythonOperator
49+
materialize_python = PythonOperator(
50+
task_id='materialize_python',
51+
python_callable=materialize,
52+
)
53+
```
54+
55+
</details>
56+
57+
<details>
58+
59+
<summary>Code example: CLI based materialization</summary>
60+
61+
#### How to run this in the CLI
62+
63+
```bash
64+
CURRENT_TIME=$(date -u +"%Y-%m-%dT%H:%M:%S")
65+
feast materialize-incremental $CURRENT_TIME
66+
```
67+
68+
#### How to run this on Airflow
69+
70+
```python
71+
# Use BashOperator
72+
materialize_bash = BashOperator(
73+
task_id='materialize',
74+
bash_command=f'feast materialize-incremental {datetime.datetime.now().replace(microsecond=0).isoformat()}',
75+
)
76+
```
77+
78+
</details>
79+
80+
### Stream data ingestion
81+
82+
Ingesting from stream sources happens either via a Push API or via a contrib processor that leverages an existing Spark context.
83+
84+
* To **push data into the offline or online stores**: see [push sources](../../reference/data-sources/push.md) for details.
85+
* (experimental) To **use a contrib Spark processor** to ingest from a topic, see [Tutorial: Building streaming features](../../tutorials/building-streaming-features.md)
86+

docs/getting-started/concepts/data-source.md

Lines changed: 0 additions & 16 deletions
This file was deleted.

docs/getting-started/concepts/entity.md

Lines changed: 3 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -33,14 +33,11 @@ At _training time_, users control what entities they want to look up, for exampl
3333
At _serving time_, users specify _entity key(s)_ to fetch the latest feature values for to power a real-time model prediction (e.g. a fraud detection model that needs to fetch the transaction user's features).
3434

3535
{% hint style="info" %}
36-
**Q: Can I retrieve features for **_**all**_** entities in Feast?**
37-
38-
Kind of. \
39-
40-
41-
In practice, this is most relevant for _batch scoring models_ (e.g. predict user churn for all existing users) that are offline only. For these use cases, Feast supports generating features for a SQL backed list of entities. There is an [open GitHub issue](https://github.com/feast-dev/feast/issues/1611) that welcomes contribution to make this a more intuitive API.&#x20;
36+
**Q: Can I retrieve features for all entities?**
4237

38+
Kind of.
4339

40+
In practice, this is most relevant for _batch scoring models_ (e.g. predict user churn for all existing users) that are offline only. For these use cases, Feast supports generating features for a SQL-backed list of entities. There is an [open GitHub issue](https://github.com/feast-dev/feast/issues/1611) that welcomes contribution to make this a more intuitive API.
4441

4542
For _real-time feature retrieval_, there is no out of the box support for this because it would promote expensive and slow scan operations. Users can still pass in a large list of entities for retrieval, but this does not scale well.
4643
{% endhint %}

0 commit comments

Comments
 (0)