时间序列数据分析

PostgreSQL 物化视图及其应用场景

了解有关 Timescale 对 PostgreSQL 物化视图的替代方案——连续聚合的更多信息。

时间序列数据库 (TSDB) 是一个计算机系统,专为存储和检索属于“时间序列”一部分的数据记录而设计,它是一组与时间戳相关联的数据点。

时间戳为每个数据点提供了重要的上下文,说明它们是如何与其他数据点相关的。时间序列数据通常是连续的数据流,例如传感器测量值和日内股票价格。 任何时间序列数据库都允许您以一种允许快速插入和检索的格式存储大量带时间戳的数据,以支持复杂的数据分析。

增量物化视图和连续聚合

Timescale 是这个领域中最重要的参与者之一,它释放了 PostgreSQL 的力量,以及一套支持时间序列管理的功能。

其中一个最重要且常用的功能是连续聚合 (CAGG)。

连续聚合可以加快对大量数据的聚合查询速度。它们还可以用于对数据进行降采样并降低数据达到一定年龄后的粒度。这将释放一些空间,同时仍能进行长期分析。

您可以将连续聚合视为一种功能强大的 PostgreSQL 物化视图,拥有许多功能

  • 实时聚合

  • 压缩

  • 数据保留

  • 增量和自动刷新

让我们开始看看它们是如何工作的。

Timescale 中增量物化视图的工作原理

假设我们正在存储来自传感器的数据,这些传感器测量温度和湿度。我们有一个 超表 (conditions) 存储这些数据,以及一个普通的 PostgreSQL 表 (locations) 存储传感器元数据。

条件

tsdb=> \d+ conditions
                                                  Table "public.conditions"
Column     |          Type             | Collation | Nullable | Default | Storage    | Compression | Stats target | Description  
-------------+--------------------------+-----------+----------+---------+----------+---
time       | timestamp with time zone |           | not null |          | plain                   |             |                |
device_id  |           text           |           |          |          | extended                |             |                |  
temperature|         numeric          |           |          |          | main                    |             |                |  
humidity   |         numeric          |           |          |          | main                    |             |                |  
Indexes:
   "conditions_device_id_time_idx" btree (device_id, "time" DESC)
   "conditions_time_idx" btree ("time" DESC)
Triggers:
   ts_insert_blocker BEFORE INSERT ON conditions FOR EACH ROW EXECUTE FUNCTION _timescaledb_internal.insert_blocker()
Child tables: _timescaledb_internal._hyper_1_100_chunk,
             _timescaledb_internal._hyper_1_101_chunk,
             _timescaledb_internal._hyper_1_102_chunk,
             _timescaledb_internal._hyper_1_103_chunk,
             .........................................
             _timescaledb_internal._hyper_1_97_chunk,
             _timescaledb_internal._hyper_1_98_chunk,
             _timescaledb_internal._hyper_1_99_chunk
Access method: heap

以及位置

tsdb=> \d+ locations  
                                        Table "public.locations"
  Column    | Type | Collation | Nullable | Default | Storage  | Compression | Stats target | Description  
-------------+------+-----------+----------+---------+----------+-------------+--------------+-------------
device_id   | text |           |          |         | extended |             |              |  
location    | text |           |          |         | extended |             |              |  
environment | text |           |          |         | extended |             |              |  
Access method: heap

conditions 超表有 1.6 亿条记录,locations 表有 2000 条记录。

让我们首先计算平均温度和湿度,并按 15 分钟窗口进行聚合

EXPLAIN ANALYZE SELECT time_bucket(INTERVAL '15 minutes', time) AS bucket, device_id, avg(temperature), avg(humidity)             
FROM conditions  
WHERE time BETWEEN '2023-02-05' AND '2023-03-05'  
GROUP BY bucket, device_id;

QUERY PLAN----------------------------------------------------------------------
Finalize HashAggregate  (cost=3552434.85..3856581.26 rows=3996996 width=91) (actual time=29280.792..41746.314 rows=5336000 loops=1)
Group Key: (time_bucket('00:15:00'::interval, _hyper_7_195_chunk."time")), _hyper_7_195_chunk.device_id
Planned Partitions: 256  Batches: 1305  Memory Usage: 10937kB  Disk Usage: 1294032kB

->  Gather  (cost=1820403.27..3015026.25 rows=7993992 width=91) (actual time=12912.397..23014.709 rows=6641075 loops=1)
Workers Planned: 2
Workers Launched: 2

->  Partial HashAggregate  (cost=1819403.27..2214627.05 rows=3996996 width=91) (actual time=12905.728..22321.716 rows=2213692 loops=3)
Group Key: time_bucket('00:15:00'::interval, _hyper_7_195_chunk."time"), _hyper_7_195_chunk.device_id
Planned Partitions: 256  Batches: 257  Memory Usage: 10641kB  Disk Usage: 1031816kB
Worker 0:  Batches: 257  Memory Usage: 10641kB  Disk Usage: 1031576kB
Worker 1:  Batches: 257  Memory Usage: 10641kB  Disk Usage: 1031496kB

->  Result  (cost=0.43..961974.82 rows=16654149 width=50) (actual time=0.694..6307.600 rows=13333333 loops=3)

->  Parallel Append  (cost=0.43..753797.95 rows=16654149 width=50) (actual time=0.691..4067.675 rows=13333333 loops=3)

->  Parallel Index Scan using _hyper_7_195_chunk_conditions_time_idx on _hyper_7_195_chunk  (cost=0.43..23164.76 rows=382303 width=50) (actual time=0.043..959.387 rows=918000 loops=1)
Index Cond: (("time" >= '2023-02-05 00:00:00+00'::timestamp with time zone) AND ("time" <= '2023-03-05 00:00:00+00'::timestamp with time zone))

->  Parallel Index Scan using _hyper_7_190_chunk_conditions_time_idx on _hyper_7_190_chunk  (cost=0.43..36388.29 rows=601368 width=49) (actual time=2.014..532.092 rows=1440000 loops=1)
Index Cond: (("time" >= '2023-02-05 00:00:00+00'::timestamp with time zone) AND ("time" <= '2023-03-05 00:00:00+00'::timestamp with time zone))

->  Parallel Seq Scan on _hyper_7_189_chunk  (cost=0.00..163635.00 rows=4200000 width=50) (actual time=0.011..1951.004 rows=10080000 loops=1)
Filter: (("time" >= '2023-02-05 00:00:00+00'::timestamp with time zone) AND ("time" <= '2023-03-05 00:00:00+00'::timestamp with time zone))

->  Parallel Seq Scan on _hyper_7_188_chunk  (cost=0.00..163515.05 rows=4192737 width=50) (actual time=0.015..1933.225 rows=10080000 loops=1)
Filter: (("time" >= '2023-02-05 00:00:00+00'::timestamp with time zone) AND ("time" <= '2023-03-05 00:00:00+00'::timestamp with time zone))

->  Parallel Seq Scan on _hyper_7_187_chunk  (cost=0.00..163508.61 rows=4193574 width=50) (actual time=0.007..633.274 rows=3360000 loops=3)
Filter: (("time" >= '2023-02-05 00:00:00+00'::timestamp with time zone) AND ("time" <= '2023-03-05 00:00:00+00'::timestamp with time zone))

->  Parallel Seq Scan on _hyper_7_186_chunk  (cost=0.00..120315.50 rows=3084167 width=50) (actual time=0.015..1363.652 rows=7402000 loops=1)
Filter: (("time" >= '2023-02-05 00:00:00+00'::timestamp with time zone) AND ("time" <= '2023-03-05 00:00:00+00'::timestamp with time zone))

Planning Time: 12.377 ms

Execution Time: 42291.580 ms

(27 rows)

Time: 42332,603 ms (00:42,333)

大约需要 42 秒。如您所见,Timescale 足够聪明,可以不考虑不包含此特定查询所需数据的块(超表中的数据分区)。

在 15 分钟窗口上创建连续聚合

现在,让我们创建一个连续聚合,在 15 分钟窗口上对数据进行分组,并计算温度和湿度的平均值。

CREATE MATERIALIZED VIEW conditions_summary_15mins
WITH (timescaledb.continuous) AS
SELECT   
  time_bucket(INTERVAL '15 minutes', time) AS bucket,    
  device_id,  
  AVG(temperature) AS avg_temp,
  MAX(temperature) AS max_temp,
  MIN(temperature) AS min_temp,
  AVG(humidity) AS avg_humidity,
  MAX(humidity) AS max_hunidity,
  MIN(humidity) AS min_humidity
FROM conditions
GROUP BY bucket, device_id
WITH NO DATA;

连续聚合是在没有数据(没有行)的情况下创建的,并在平均值旁边保留最小和最大温度和湿度。现在让我们开始为刷新 CAGG 制定策略。

连续聚合(类似于物化视图)需要定期刷新,以保持查询时间的效率。与物化视图不同,连续聚合有一个名为 **实时聚合** 的惊人功能,它返回更新到超表上最后一次事务的结果。

它是如何工作的?

Timescale 足够聪明,可以从连续聚合中检索已经聚合的数据,并从超表中获取尚未聚合的数据。然后它对其进行聚合,合并结果,并将其返回给客户端。这当然需要一些额外的步骤(和时间),因此保持连续聚合的更新将显著提高查询性能。

那么,如何保持连续聚合的更新呢?我们有一个方便的函数可以做到这一点

SELECT add_continuous_aggregate_policy('conditions_summary_hourly', start_offset => INTERVAL '1 day', end_offset => INTERVAL '1 hour', schedule_interval => INTERVAL '1 hour');

这意味着什么?我们将为 conditions_summary_hourly 连续聚合添加刷新策略。我们正在刷新一个从一天前开始,到一小时前结束的间隔。如果启用了实时聚合,则从超表中获取最后 1 小时的数据,并将其合并(如果查询需要)。 Start_offsetend_offset 可以为 NULL,但我们强烈建议不要使用该值,尤其是对于 end_offset 来说,尤其是在写入量大的超表上。

并针对连续聚合运行相同的查询

EXPLAIN ANALYZE SELECT bucket, device_id, avg_temp,avg_humidity 
FROM conditions_summary_15mins 
WHERE bucket BETWEEN '2023-02-05' AND '2023-03-05';

QUERY PLAN -------------------------------------------------------------------------------

Append  (cost=0.43..377584.73 rows=5154662 width=50) (actual time=0.016..6323.419 rows=5338000 loops=1)

->  Subquery Scan on "*SELECT* 1"  (cost=0.43..303008.23 rows=5147365 width=50) (actual time=0.015..3458.459 rows=5144000 loops=1)

->  Result  (cost=0.43..251534.58 rows=5147365 width=178) (actual time=0.015..2779.848 rows=5144000 loops=1)

->  Custom Scan (ChunkAppend) on _materialized_hypertable_8  (cost=0.43..200060.93 rows=5147365 width=50) (actual time=0.014..2067.726 rows=5144000 loops=1)

Chunks excluded during startup: 0

->  Index Scan using _hyper_8_216_chunk__materialized_hypertable_8_bucket_idx on _hyper_8_216_chunk  (cost=0.43..5083.93 rows=127365 width=50) (actual time=0.013..34.494 rows=124000 loops=1)
Index Cond: ((bucket < COALESCE(_timescaledb_internal.to_timestamp(_timescaledb_internal.cagg_watermark(8)), '-infinity'::timestamp with time zone)) AND (bucket >= '2023-02-05 00:00:00+00'::timestamp with time zone) AND (bucket <= '2023-03-05 00:00:00+00'::timestamp with time zone))

->  Seq Scan on _hyper_8_217_chunk  (cost=0.00..38431.00 rows=988000 width=50) (actual time=0.012..283.673 rows=988000 loops=1)
Filter: ((bucket >= '2023-02-05 00:00:00+00'::timestamp with time zone) AND (bucket <= '2023-03-05 00:00:00+00'::timestamp with time zone) AND (bucket < COALESCE(_timescaledb_internal.to_timestamp(_timescaledb_internal.cagg_watermark(8)), '-infinity'::timestamp with time zone)))

->  Seq Scan on _hyper_8_218_chunk  (cost=0.00..52175.00 rows=1344000 width=50) (actual time=0.010..412.740 rows=1344000 loops=1)
Filter: ((bucket >= '2023-02-05 00:00:00+00'::timestamp with time zone) AND (bucket <= '2023-03-05 00:00:00+00'::timestamp with time zone) AND (bucket < COALESCE(_timescaledb_internal.to_timestamp(_timescaledb_internal.cagg_watermark(8)), '-infinity'::timestamp with time zone)))

->  Seq Scan on _hyper_8_219_chunk  (cost=0.00..52184.00 rows=1344000 width=50) (actual time=0.009..437.345 rows=1344000 loops=1)
Filter: ((bucket >= '2023-02-05 00:00:00+00'::timestamp with time zone) AND (bucket <= '2023-03-05 00:00:00+00'::timestamp with time zone) AND (bucket < COALESCE(_timescaledb_internal.to_timestamp(_timescaledb_internal.cagg_watermark(8)), '-infinity'::timestamp with time zone)))

->  Seq Scan on _hyper_8_220_chunk  (cost=0.00..52187.00 rows=1344000 width=50) (actual time=0.007..431.213 rows=1344000 loops=1)
Filter: ((bucket >= '2023-02-05 00:00:00+00'::timestamp with time zone) AND (bucket <= '2023-03-05 00:00:00+00'::timestamp with time zone) AND (bucket < COALESCE(_timescaledb_internal.to_timestamp(_timescaledb_internal.cagg_watermark(8)), '-infinity'::timestamp with time zone)))

->  Subquery Scan on "*SELECT* 2"  (cost=47717.46..48803.19 rows=7297 width=91) (actual time=964.021..2342.694 rows=194000 loops=1)

->  Finalize GroupAggregate  (cost=47717.46..48730.22 rows=7297 width=219) (actual time=964.021..2314.030 rows=194000 loops=1)

Group Key: (time_bucket('00:15:00'::interval, conditions."time")), conditions.device_id

->  Gather Merge  (cost=47717.46..48511.23 rows=6086 width=91) (actual time=963.992..1666.272 rows=413099 loops=1)

Workers Planned: 2
Workers Launched: 2

->  Partial GroupAggregate  (cost=46717.44..46808.73 rows=3043 width=91) (actual time=929.071..1400.633 rows=137700 loops=3)

Group Key: (time_bucket('00:15:00'::interval, conditions."time")), conditions.device_id

->  Sort  (cost=46717.44..46725.05 rows=3043 width=49) (actual time=929.053..1015.758 rows=484667 loops=3)

Sort Key: (time_bucket('00:15:00'::interval, conditions."time")), conditions.device_id

Sort Method: external merge  Disk: 28976kB

Worker 0:  Sort Method: external merge  Disk: 28240kB

Worker 1:  Sort Method: external merge  Disk: 28840kB

->  Result  (cost=0.44..46541.38 rows=3043 width=49) (actual time=0.027..300.722 rows=484667 loops=3)

->  Parallel Custom Scan (ChunkAppend) on conditions  (cost=0.44..46503.35 rows=3043 width=49) (actual time=0.026..212.001 rows=484667 loops=3)

Chunks excluded during startup: 5

->  Parallel Index Scan using _hyper_7_190_chunk_conditions_time_idx on _hyper_7_190_chunk  (cost=0.43..46490.03 rows=3038 width=49) (actual time=0.025..165.745 rows=484667 loops=3)

Index Cond: (("time" >= COALESCE(_timescaledb_internal.to_timestamp(_timescaledb_internal.cagg_watermark(8)), '-infinity'::timestamp with time zone)) AND ("time" >= '2023-02-05 00:00:00+00'::timestamp with time zone) AND ("time" <= '2023-03-05 00:15:00+00'::timestamp with time zone))

Filter: ((time_bucket('00:15:00'::interval, "time") >= '2023-02-05 00:00:00+00'::timestamp with time zone) AND (time_bucket('00:15:00'::interval, "time") <= '2023-03-05 00:00:00+00'::timestamp with time zone))

Planning Time: 2.943 ms
Execution Time: 6585.342 ms
(36 rows)

Time: 6618,197 ms (00:06,618)

现在需要 6.6 秒——还不错!如您从执行计划中看到的,查询被分成两部分: "*SELECT* 1""*SELECT* 2"

第一个针对 CAGG,第二个针对超表中的最新数据。行将在请求格式中被获取、计算和合并。

关闭实时聚合

如果您对获取最新事务不感兴趣,只是需要用于分析的数据,则可以禁用该功能,并仅从 CAGG 中检索行

ALTER MATERIALIZED VIEW conditions_summary_15mins SET (timescaledb.materialized_only = true);

QUERY PLAN  ----------------------------------------------

Append  (cost=0.43..187829.34 rows=5147365 width=50) (actual time=0.015..1760.471 rows=5144000 loops=1)
   ->  Index Scan using _hyper_8_216_chunk__materialized_hypertable_8_bucket_idx on _hyper_8_216_chunk  (cost=0.43..4765.52 rows=127365 width=50) (actual time=0.015..43.875 rows=124000 loops=1)
         Index Cond: ((bucket >= '2023-02-05 00:00:00+00'::timestamp with time zone) AND (bucket <= '2023-03-05 00:00:00+00'::timestamp with time zone))
   ->  Seq Scan on _hyper_8_217_chunk  (cost=0.00..31021.00 rows=988000 width=50) (actual time=0.011..239.407 rows=988000 loops=1)
         Filter: ((bucket >= '2023-02-05 00:00:00+00'::timestamp with time zone) AND (bucket <= '2023-03-05 00:00:00+00'::timestamp with time zone))
   ->  Seq Scan on _hyper_8_218_chunk  (cost=0.00..42095.00 rows=1344000 width=50) (actual time=0.010..343.470 rows=1344000 loops=1)
         Filter: ((bucket >= '2023-02-05 00:00:00+00'::timestamp with time zone) AND (bucket <= '2023-03-05 00:00:00+00'::timestamp with time zone))
   ->  Seq Scan on _hyper_8_219_chunk  (cost=0.00..42104.00 rows=1344000 width=50) (actual time=0.008..348.315 rows=1344000 loops=1)
         Filter: ((bucket >= '2023-02-05 00:00:00+00'::timestamp with time zone) AND (bucket <= '2023-03-05 00:00:00+00'::timestamp with time zone))
   ->  Seq Scan on _hyper_8_220_chunk  (cost=0.00..42107.00 rows=1344000 width=50) (actual time=0.010..330.572 rows=1344000 loops=1)
         Filter: ((bucket >= '2023-02-05 00:00:00+00'::timestamp with time zone) AND (bucket <= '2023-03-05 00:00:00+00'::timestamp with time zone))

Planning Time: 0.981 ms
Execution Time: 1985.474 ms
(13 rows)

Time: 2016,746 ms (00:02,017)

现在降至 2 秒,甚至更好!如您所见,数据仅从 CAGG 中检索。

并重新启用它

让我们重新启用实时聚合

ALTER MATERIALIZED VIEW conditions_summary_15mins SET (timescaledb.materialized_only = false);

通常,人们会对不同的时间窗口进行数据聚合,以进行不同类型的分析或仪表板。例如,在金融领域,您通常会查看 15 分钟视图、每小时和每天的数据。因此,我们可以创建一个具有不同聚合窗口的新 CAGG

CREATE MATERIALIZED VIEW conditions_summary_4hrs 
WITH (timescaledb.continuous) AS
SELECT
 time_bucket(INTERVAL '4 hours', time) AS bucket,           
 device_id,  AVG(temperature) AS avg_temp,
 MAX(temperature) AS max_temp,
 MIN(temperature) AS min_temp,
 AVG(humidity) AS avg_humidity,
 MAX(humidity) AS max_hunidity,
 MIN(humidity) AS min_humidity
FROM conditions
GROUP BY bucket, device_id;

NOTICE:  refreshing continuous aggregate "conditions_summary_4hrs"
HINT:  Use WITH NO DATA if you do not want to refresh the continuous aggregate on creation.
CREATE MATERIALIZED VIEW
Time: 544078,223 ms (09:04,078)

第一次刷新聚合数据需要 9 分钟。

分层连续聚合(物化视图上的物化视图)

从 2.9.0 版本开始,Timescale 允许您创建分层连续聚合(也称为 另一个连续聚合之上的连续聚合)。

让我们看看它是如何工作的

CREATE MATERIALIZED VIEW conditions_summary_4hrs
WITH (timescaledb.continuous) AS
SELECT
 time_bucket(INTERVAL '4 hours', bucket) AS bucket_4hrs,
 device_id,   
 AVG(avg_temp) AS avg_temp,
 MAX(max_temp) AS max_temp,
 MIN(min_temp) AS min_temp,
 AVG(avg_humidity) AS avg_humidity,
 MAX(max_hunidity) AS max_hunidity,
 MIN(min_humidity) AS min_humidity
FROM conditions_summary_15mins
GROUP BY bucket_4hrs, device_id;

新的 CAGG 拥有一个 4 小时的窗口,并且是在 15 分钟的 CAGG 上创建的。

NOTICE:  refreshing continuous aggregate "conditions_summary_4hrs"
HINT:  Use WITH NO DATA if you do not want to refresh the continuous aggregate on creation.
CREATE MATERIALIZED VIEW
Time: 70160,808 ms (01:10,161)

结果是 70 秒刷新视图——太棒了!此外,这将有利于刷新策略,减少刷新时间和资源使用。

现在,让我们看看 执行计划 在分层 CAGG 上的表现。

EXPLAIN ANALYZE SELECT bucket_4hrs, device_id, avg_temp,avg_humidity 
FROM conditions_summary_4hrs 
WHERE bucket_4hrs BETWEEN '2023-02-05' AND '2023-03-07';

QUERY PLAN  --------------------------------------------------------------------------

Append  (cost=0.00..25793.82 rows=350092 width=51) (actual time=0.010..319.095 rows=352000 loops=1)

->  Subquery Scan on "*SELECT* 1"  (cost=0.00..21655.56 rows=350000 width=51) (actual time=0.009..217.240 rows=350000 loops=1)

->  Result  (cost=0.00..18155.56 rows=350000 width=179) (actual time=0.009..175.949 rows=350000 loops=1)

->  Custom Scan (ChunkAppend) on _materialized_hypertable_11  (cost=0.00..14655.56 rows=350000 width=51) (actual time=0.007..131.150 rows=350000 loops=1)

Chunks excluded during startup: 0

->  Seq Scan on _hyper_11_244_chunk  (cost=0.00..3255.00 rows=84000 width=51) (actual time=0.007..24.047 rows=84000 loops=1)
Filter: ((bucket_4hrs >= '2023-02-05 00:00:00+00'::timestamp with time zone) AND (bucket_4hrs <= '2023-03-07 00:00:00+00'::timestamp with time zone) AND (bucket_4hrs < COALESCE(_timescaledb_internal.to_timestamp(_timescaledb_internal.cagg_watermark(11)), '-infinity'::timestamp with time zone)))

->  Seq Scan on _hyper_11_246_chunk  (cost=0.00..3255.00 rows=84000 width=51) (actual time=0.010..23.718 rows=84000 loops=1)
Filter: ((bucket_4hrs >= '2023-02-05 00:00:00+00'::timestamp with time zone) AND (bucket_4hrs <= '2023-03-07 00:00:00+00'::timestamp with time zone) AND (bucket_4hrs < COALESCE(_timescaledb_internal.to_timestamp(_timescaledb_internal.cagg_watermark(11)), '-infinity'::timestamp with time zone)))

->  Seq Scan on _hyper_11_251_chunk  (cost=0.00..3256.00 rows=84000 width=51) (actual time=0.009..23.184 rows=84000 loops=1)
Filter: ((bucket_4hrs >= '2023-02-05 00:00:00+00'::timestamp with time zone) AND (bucket_4hrs <= '2023-03-07 00:00:00+00'::timestamp with time zone) AND (bucket_4hrs < COALESCE(_timescaledb_internal.to_timestamp(_timescaledb_internal.cagg_watermark(11)), '-infinity'::timestamp with time zone)))

->  Index Scan using _hyper_11_252_chunk__materialized_hypertable_11_bucket_4hrs_idx on _hyper_11_252_chunk  (cost=0.30..1403.56 rows=8000 width=51) (actual time=0.024..4.620 rows=8000 loops=1)
Index Cond: ((bucket_4hrs < COALESCE(_timescaledb_internal.to_timestamp(_timescaledb_internal.cagg_watermark(11)), '-infinity'::timestamp with time zone)) AND (bucket_4hrs >= '2023-02-05 00:00:00+00'::timestamp with time zone) AND (bucket_4hrs <= '2023-03-07 00:00:00+00'::timestamp with time zone))

->  Seq Scan on _hyper_11_254_chunk  (cost=0.00..2406.00 rows=62000 width=51) (actual time=0.007..17.351 rows=62000 loops=1)
Filter: ((bucket_4hrs >= '2023-02-05 00:00:00+00'::timestamp with time zone) AND (bucket_4hrs <= '2023-03-07 00:00:00+00'::timestamp with time zone) AND (bucket_4hrs < COALESCE(_timescaledb_internal.to_timestamp(_timescaledb_internal.cagg_watermark(11)), '-infinity'::timestamp with time zone)))

->  Seq Scan on _hyper_11_259_chunk  (cost=0.00..1080.00 rows=28000 width=51) (actual time=0.007..7.673 rows=28000 loops=1)
Filter: ((bucket_4hrs >= '2023-02-05 00:00:00+00'::timestamp with time zone) AND (bucket_4hrs <= '2023-03-07 00:00:00+00'::timestamp with time zone) AND (bucket_4hrs < COALESCE(_timescaledb_internal.to_timestamp(_timescaledb_internal.cagg_watermark(11)), '-infinity'::timestamp with time zone)))

->  Subquery Scan on "*SELECT* 2"  (cost=2384.12..2387.80 rows=92 width=91) (actual time=60.909..71.279 rows=2000 loops=1)

->  GroupAggregate  (cost=2384.12..2386.88 rows=92 width=219) (actual time=60.909..71.024 rows=2000 loops=1)
Group Key: (time_bucket('04:00:00'::interval, "*SELECT* 1_1".bucket)), "*SELECT* 1_1".device_id

->  Sort  (cost=2384.12..2384.35 rows=92 width=52) (actual time=60.897..61.946 rows=18000 loops=1)
Sort Key: (time_bucket('04:00:00'::interval, "*SELECT* 1_1".bucket)), "*SELECT* 1_1".device_id
Sort Method: quicksort  Memory: 2452kB

->  Result  (cost=0.44..2381.12 rows=92 width=52) (actual time=2.047..41.311 rows=18000 loops=1)

->  Append  (cost=0.44..2379.97 rows=92 width=53) (actual time=2.046..38.692 rows=18000 loops=1)

->  Subquery Scan on "*SELECT* 1_1"  (cost=0.44..863.69 rows=86 width=50) (actual time=2.045..13.640 rows=16000 loops=1)

->  Result  (cost=0.44..862.83 rows=86 width=178) (actual time=2.045..11.803 rows=16000 loops=1)

->  Custom Scan (ChunkAppend) on _materialized_hypertable_8  (cost=0.44..861.97 rows=86 width=50) (actual time=2.044..9.681 rows=16000 loops=1)
Chunks excluded during startup: 5

->  Index Scan using _hyper_8_258_chunk__materialized_hypertable_8_bucket_idx on _hyper_8_258_chunk  (cost=0.43..848.61 rows=81 width=50) (actual time=2.043..8.155 rows=16000 loops=1)

Index Cond: ((bucket < COALESCE(_timescaledb_internal.to_timestamp(_timescaledb_internal.cagg_watermark(8)), '-infinity'::timestamp with time zone)) AND (bucket >= COALESCE(_timescaledb_internal.to_timestamp(_timescaledb_internal.cagg_watermark(11)), '-infinity'::timestamp with time zone)) AND (bucket >= '2023-02-05 00:00:00+00'::timestamp with time zone) AND (bucket <= '2023-03-07 04:00:00+00'::timestamp with time zone))

Filter: ((time_bucket('04:00:00'::interval, bucket) >= '2023-02-05 00:00:00+00'::timestamp with time zone) AND (time_bucket('04:00:00'::interval, bucket) <= '2023-03-07 00:00:00+00'::timestamp with time zone))

->  Subquery Scan on "*SELECT* 2_1"  (cost=1515.57..1515.81 rows=6 width=91) (actual time=17.266..23.544 rows=2000 loops=1)

->  GroupAggregate  (cost=1515.57..1515.75 rows=6 width=219) (actual time=17.265..23.291 rows=2000 loops=1)
Group Key: (time_bucket('00:15:00'::interval, conditions."time")), conditions.device_id

->  Sort  (cost=1515.57..1515.59 rows=6 width=50) (actual time=17.251..17.728 rows=8000 loops=1)
Sort Key: (time_bucket('00:15:00'::interval, conditions."time")), conditions.device_id
Sort Method: quicksort  Memory: 934kB

->  Result  (cost=0.44..1515.49 rows=6 width=50) (actual time=2.067..8.169 rows=8000 loops=1)

->  Custom Scan (ChunkAppend) on conditions  (cost=0.44..1515.42 rows=6 width=50) (actual time=2.066..6.806 rows=8000 loops=1)
Chunks excluded during startup: 5

->  Index Scan using _hyper_7_190_chunk_conditions_time_idx on _hyper_7_190_chunk  (cost=0.43..1507.46 rows=1 width=49) (actual time=2.065..6.109 rows=8000 loops=1)
Index Cond: ("time" >= COALESCE(_timescaledb_internal.to_timestamp(_timescaledb_internal.cagg_watermark(8)), '-infinity'::timestamp with time zone))
Filter: ((time_bucket('00:15:00'::interval, "time") >= '2023-02-05 00:00:00+00'::timestamp with time zone) AND (time_bucket('00:15:00'::interval, "time") <= '2023-03-07 04:00:00+00'::timestamp with time zone) AND (time_bucket('04:00:00'::interval, time_bucket('00:15:00'::interval, "time")) >= '2023-02-05 00:00:00+00'::timestamp with time zone) AND (time_bucket('04:00:00'::interval, time_bucket('00:15:00'::interval, "time")) <= '2023-03-07 00:00:00+00'::timestamp with time zone) AND (time_bucket('00:15:00'::interval, "time") >= COALESCE(_timescaledb_internal.to_timestamp(_timescaledb_internal.cagg_watermark(11)), '-infinity'::timestamp with time zone)))

Planning Time: 29.508 ms
Execution Time: 355.948 ms

非常酷,355 毫秒!现在您可以看到查询正在从三个不同的超表中获取数据,并将它们合并在一起。

->  Subquery Scan on "*SELECT* 1"  (cost=0.00..21655.56 rows=350000 width=51) (actual time=0.009..217.240 rows=350000 loops=1)

->  Result  (cost=0.00..18155.56 rows=350000 width=179) (actual time=0.009..175.949 rows=350000 loops=1)

->  Custom Scan (ChunkAppend) on _materialized_hypertable_11  (cost=0.00..14655.56 rows=350000 width=51) (actual time=0.007..131.150 rows=350000 loops=1)
…………………………………………………………………………………………
->  Subquery Scan on "*SELECT* 1_1"  (cost=0.44..863.69 rows=86 width=50) (actual time=2.045..13.640 rows=16000 loops=1)

->  Result  (cost=0.44..862.83 rows=86 width=178) (actual time=2.045..11.803 rows=16000 loops=1)

->  Custom Scan (ChunkAppend) on _materialized_hypertable_8  (cost=0.44..861.97 rows=86 width=50) (actual time=2.044..9.681 rows=16000 loops=1)
…………………………………………………………………………………………
->  Result  (cost=0.44..1515.49 rows=6 width=50) (actual time=2.067..8.169 rows=8000 loops=1)

->  Custom Scan (ChunkAppend) on conditions  (cost=0.44..1515.42 rows=6 width=50) (actual time=2.066..6.806 rows=8000 loops=1)

添加 JOIN 到混合中

那么 JOIN 呢? 从 2.10.0 版本开始,CAGG 支持 JOINs,这意味着您可以通过与另一个常规表连接来创建它们。

为了了解它的工作原理,我们将连接条件和位置,并在 15 分钟的窗口上进行聚合。

CREATE MATERIALIZED VIEW conditions_summary_15mins_join
WITH (timescaledb.continuous) AS
SELECT  
 time_bucket(INTERVAL '15 minutes', c.time) AS bucket,
 c.device_id,
 l.location,
 l.environment,
 AVG(c.temperature) AS avg_temp,
 MAX(c.temperature) AS max_temp,
 MIN(c.temperature) AS min_temp,
 AVG(c.humidity) AS avg_humidity,
 MAX(c.humidity) AS max_hunidity,
 MIN(c.humidity) AS min_humidity
FROM conditions c
 JOIN locations l USING (device_id)
GROUP BY bucket, device_id, l.location, l.environment

新创建的 CAGG 已连接数据,如下所示,您无需在 CAGG 上使用 JOIN 查询。

tsdb=> SELECT * from conditions_summary_15mins_join limit 10;
        bucket         |     device_id      |   location    | environment |      avg_temp       |      max_temp      |     min_temp       
------------------------+--------------------+---------------+-------------+---------------------+--------------------+--------------
2023-02-01 23:45:00+00 | weather-pro-000000 | field-000000  | outside     | 89.2000000000000463 |  89.40000000000003 | 89.00000000000006  
2023-02-01 23:45:00+00 | weather-pro-000001 | office-000000 | inside      | 68.3000000000000050 |  68.50000000000001 |              68.1  
2023-02-01 23:45:00+00 | weather-pro-000002 | field-000001  | outside     | 85.7875000000002375 |  85.90000000000023 | 85.60000000000025  
2023-02-01 23:45:00+00 | weather-pro-000003 | arctic-000000 | outside     | 39.3499999999999605 | 39.499999999999964 | 39.19999999999996  
2023-02-01 23:45:00+00 | weather-pro-000004 | door-00000    | doorway     | 65.5250000000002563 |  65.60000000000025 | 65.40000000000026  
2023-02-01 23:45:00+00 | weather-pro-000005 | office-000001 | inside      | 71.4249999999999450 |  71.69999999999995 | 71.19999999999995  
2023-02-01 23:45:00+00 | weather-pro-000006 | field-000002  | outside     | 86.7750000000001850 |  86.90000000000018 | 86.70000000000019  
2023-02-01 23:45:00+00 | weather-pro-000007 | swamp-000000  | outside     | 87.8750000000001775 |  88.00000000000017 | 87.80000000000018  
2023-02-01 23:45:00+00 | weather-pro-000008 | field-000003  | outside     | 87.1500000000001588 |  87.30000000000015 | 87.00000000000017  
2023-02-01 23:45:00+00 | weather-pro-000009 | door-00001    | doorway     | 62.7750000000003226 | 62.900000000000325 | 62.70000000000032  
(10 rows)

Time: 65,128 ms

额外:压缩呢?

现在我们已经解释了与连续聚合相关的某些惊人功能,您可能会说,“等等,它会以额外磁盘空间为代价,对吧?”

是的,CAGG 显着提高了 CPU、内存和磁盘 I/O 使用率,但需要额外的磁盘空间。为了解决这个问题,Timescale 为您带来了另一项令人兴奋的功能:压缩.

您可以压缩超表和 CAGG 上的数据;在大多数情况下,它的性能相当不错。让我们看一个例子。

SELECT * FROM
hypertable_detailed_size('_timescaledb_internal._materialized_hypertable_8');
table_bytes | index_bytes | toast_bytes | total_bytes | node_name  
-------------+-------------+-------------+-------------+-----------
 2822823936 |  1946337280 |      155648 |  4769316864 |

CAGG 是 4.7 GB——让我们启用自动刷新和压缩。

SELECT add_continuous_aggregate_policy('conditions_summary_15mins',
  start_offset => INTERVAL '6 hours',
  end_offset => INTERVAL '1 hour',
  schedule_interval => INTERVAL '2 hours');

ALTER MATERIALIZED VIEW conditions_summary_15mins set (timescaledb.compress = true);

SELECT add_compression_policy('conditions_summary_15mins', compress_after=>'7 day'::interval);

将创建两个新作业,一个用于刷新,另一个用于压缩。如果有块需要压缩,压缩将自动开始。几分钟后,您将拥有压缩后的 CAGG。让我们看看它的大小。

SELECT * FROM hypertable_detailed_size('_timescaledb_internal._materialized_hypertable_14');
table_bytes | index_bytes | toast_bytes | total_bytes | node_name  
-------------+-------------+-------------+-------------+----------
   61325312 |    13205504 |   850911232 |   925442048 |

哇,900 MB。压缩率为 80%,看起来棒极了!

了解更多

连续聚合是 Timescale 版本的自动更新增量物化视图,它允许创建反映基于时间的聚合查询输出的表。

此表将定期自动更新(当源数据发生变化时)。这些被称为实时,因为当您从连续聚合中读取数据时,最新的数据将来自原始数据集。旧数据将来自连续聚合本身。

CAGG 通过提前运行基于时间的查询来提供帮助,从而加快任何未来的运行。此功能依赖于用户提前了解他们想要构建的常见查询或聚合。

了解连续聚合如何使用 PostgreSQL 显著简化实时分析。.