tonglin0325的个人主页

特征平台——feast

feast是google开源的一个特征平台,其提供特征注册管理,以及和特征存储(feature store),离线存储(offline store)和在线存储(online store)交互的SDK,官网文档:

1
2
https://docs.feast.dev/

目前最新的v0.24版本支持的离线存储:File,Snowflake,BigQuery,Redshift,Spark,PostgreSQL,Trino,AzureSynapse等,参考:

1
2
https://docs.feast.dev/reference/offline-stores

在线存储:SQLite,Snowflake,Redis,Datastore,DynamoDB,PostgreSQL,Cassandra等,参考:

1
2
https://docs.feast.dev/reference/online-stores

**provider
 **用于定义feast运行的环境,其提供了feature store在不同平台组件上的实现,目前有4种:local, gcp,aws和azure

provider 支持的offline store 支持的online store
local BigQuery,file Redis,Datastore,Sqlite
gcp BigQuery,file Datastore,Sqlite
aws Redshift,file DynamoDB,Sqlite
azure Mysql,file Redis,Splite

参考:

1
2
https://docs.feast.dev/getting-started/architecture-and-components/provider

**data source **用于定义特征的数据来源,每个batch data source都和一个offline store关联,比如SnowflakeSource只能和Snowflake offline store关联

data source的类型包括:file,Snowflake,bigquery,redshift,push,kafka,kinesis,spark,postgreSQL,Trino,AzureSynapse+AzureSQL

data source offline store
FileSource file
SnowflakeSource Snowflake
BigQuerySource BigQuery
RedshiftSource Redshift
PushSource(可以同时将feature写入online和offline store)  
KafkaSource(仍然处于实验性)  
KinesisSource(仍然处于实验性)  
SparkSource(支持hive和parquet文件) Spark
PostgreSQLSource PostgreSQL
TrinoSource Trino
MsSqlServerSource AzureSynapse+AzureSQL 

 

Batch Materialization Engines 用于将offline store的数据刷到online store,其配置位于feature_store.xml的batch_engine

其默认实现是LocalMaterializationEngine,也基于aws lambda的LambdaMaterializaionEngine

1
2
https://docs.feast.dev/getting-started/architecture-and-components/batch-materialization-engine

也可以Bytewax(配合k8s使用)和Snowflake(当使用SnowflakeSource的时候)作为batch materialization engine

此外,还可以自行实现engine,参考:

1
2
https://docs.feast.dev/how-to-guides/customizing-feast/creating-a-custom-materialization-engine

  

 

1.feast的安装#

1
2
https://docs.feast.dev/getting-started/quickstart

下面的安装以v0.23版本为例,安装v0.23版本的时候建议使用python3.8,v0.22版本的时候建议使用python3.7

1
2
pip install feast===0.23.0

由于选择的离线存储是hive,在线存储是cassandra,所以还需要安装离线存储和在线存储的插件

1
2
3
pip install feast-cassandra==0.1.3
pip install feast-hive==0.17.0

如果安装feast-hive的时候遇到无法安装thriftpy,则需要先安装cython

1
2
3
pip install cython
pip install thriftpy

  

2.创建一个feast项目#

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
feast init my_project


Creating a new Feast repository in /Users/lintong/coding/python/my_project.

(⎈ |docker-desktop:default)➜ /Users/lintong/coding/python $ tree -L 3 my_project
my_project
├── __init__.py
├── data
│   └── driver_stats.parquet
├── example.py
└── feature_store.yaml

1 directory, 4 files

其中feature_store.yaml,可以在其中配置offline store和online store,该文件必须位于project的根目录,参考:

1
2
https://docs.feast.dev/reference/feature-repository

如下

1
2
3
4
5
6
7
project: my_project
registry: data/registry.db
provider: local
online_store:
path: data/online_store.db
entity_key_serialization_version: 2

example.py定义了feast pipeline的流程,即feature的数据source,特征的entity,特征的view注册,特征的服务化,如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# This is an example feature definition file

from datetime import timedelta

from feast import Entity, FeatureService, FeatureView, Field, FileSource
from feast.types import Float32, Int64

# Read data from parquet files. Parquet is convenient for local development mode. For
# production, you can use your favorite DWH, such as BigQuery. See Feast documentation
# for more info.
driver_hourly_stats = FileSource(
name="driver_hourly_stats_source",
path="/Users/lintong/coding/python/my_project/data/driver_stats.parquet",
timestamp_field="event_timestamp",
created_timestamp_column="created",
)

# Define an entity for the driver. You can think of entity as a primary key used to
# fetch features.
driver = Entity(name="driver", join_keys=["driver_id"])

# Our parquet files contain sample data that includes a driver_id column, timestamps and
# three feature column. Here we define a Feature View that will allow us to serve this
# data to our model online.
driver_hourly_stats_view = FeatureView(
name="driver_hourly_stats",
entities=[driver],
ttl=timedelta(days=1),
schema=[
Field(name="conv_rate", dtype=Float32),
Field(name="acc_rate", dtype=Float32),
Field(name="avg_daily_trips", dtype=Int64),
],
online=True,
source=driver_hourly_stats,
tags={},
)

driver_stats_fs = FeatureService(
name="driver_activity", features=[driver_hourly_stats_view]
)

  

3.配置注册store和feature#

feature store的配置文件默认是feature_store.xml,也可以自行添加

feature定义的配置文件默认是exampl.xml,也可以自行添加

写好配置文件后通过运行feast apply命令来注册store和feature,也可以使用**.feastignore**文件来排除store和feature

 

如果feast apply遇到如下报错

1
2
importerror: cannot import name 'soft_unicode' from 'markupsafe'

则解决方法如下

1
2
pip install markupsafe==2.0.1