Monday, March 18, 2024

Comparing Postgres and MySQL on the insert benchmark with a small server

My primary goal with the benchmarks I run has been to identify performance regressions, especially ones that can be fixed to make open source databases better. And so I focus on comparing old and new versions of one DBMS at a time to identify where things get better or worse. But here I compare Postgres with MySQL (InnoDB & MyRocks) to show that neither is the best for the Insert Benchmark -- all are good, but none are perfect.

The per-DBMS results are here for Postgres, InnoDB and MyRocks. Those posts also have links to the configurations and builds that I used. This post shares the same result but makes it easier to compare across DBMS. 

Results here are from a small server (8 cores) with a low concurrency workload (1 client, <= 3 concurrent connections). Results from a larger server are pending and might not be the same as what I share here.

Summary of throughput for the IO-bound workload

  • Initial load in key order (l.i0)
    • Postgres is fastest
  • Write-only with secondary index maintenance (l.i1, l.i2)
    • MyRocks is fastest
  • Range queries (qr100, qr500, qr1000)
    • Postgres is fastest
  • Point queries (qp100, qp500, qp1000)
    • MyRocks is fastest, Postgres failed to sustain the target write rate for qp1000
Summary of efficiency for the IO-bound workload
  • Space efficiency
    • MyRocks is best, Postgres/InnoDB used ~4X/~3x more space
  • Write efficiency
    • MyRocks is best and on the l.i1 benchmark step Postgres and InnoDB write ~9X and ~80X more KB to storage per insert than MyRocks.
  • Read efficiency
    • MyRocks is the best and that might surprise people. Both InnoDB and Postgres do more read IO per query for both point and range queries. Bloom filters and less space amplification might explain this.
Summary of throughput over time
  • All DBMS have noise (variance) in some cases. Results for MyRocks aren't any worse than for Postgres or InnoDB.

Build + Configuration

Versions tested
  • pg162_def.cx9a2a_bee
    • Postgres 16.2 and the cx9a2_bee config
  • my8036_rel.cz10a_bee
    • Upstream MySQL 8.0.36 with InnoDB and the cz10a_bee config
  • fbmy8028_rel_221222.cza1_bee
    • MyRocks 8.0.28 from code as of 2023-12-22 at git hash 2ad105fc, RocksDB 8.7.0 at git hash 29005f0b, cza1_bee config
    • Compression is enabled, which saves space at the cost of more CPU
The config files are here.

The Benchmark

The benchmark is run with 1 client. It is explained here and was run in two setups
  • cached - database has 30M rows and fits in memory
  • IO-bound - database has 800M rows and is larger than memory, 
The test server was named SER4 in the previous report. It has 8 cores, 16G RAM, Ubuntu 22.04 and XFS using 1 m.2 device.

The benchmark steps are:

  • l.i0
    • insert X million rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client. X is 30M for cached and 800M for IO-bound.
  • l.x
    • create 3 secondary indexes per table. There is one connection per client.
  • l.i1
    • use 2 connections/client. One inserts Y rows and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate. Y is 80M for cached and 4M for IO-bound.
  • l.i2
    • like l.i1 but each transaction modifies 5 rows (small transactions) and Y is 20M for cached and 1M for IO-bound.
    • Wait for X seconds after the step finishes to reduce variance during the read-write benchmark steps that follow.
  • qr100
    • use 3 connections/client. One does range queries for Z seconds and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. This step is run for a fixed amount of time. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested. Z is 3600 for cached and 1800 for IO-bound.
  • qp100
    • like qr100 except uses point queries on the PK index
  • qr500
    • like qr100 but the insert and delete rates are increased from 100/s to 500/s
  • qp500
    • like qp100 but the insert and delete rates are increased from 100/s to 500/s
  • qr1000
    • like qr100 but the insert and delete rates are increased from 100/s to 1000/s
  • qp1000
    • like qp100 but the insert and delete rates are increased from 100/s to 1000/s
Results: throughput

The performance reports are here for cached and for IO-bound.

The summary has 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version on the first row of the table. The third shows the background insert rate for benchmark steps with background inserts and all systems sustained the target rates. The second table makes it easy to see how performance changes over time.

Below I use relative QPS to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is my version and $base is the version of the base case. When relative QPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. The Q in relative QPS measures: 
  • insert/s for l.i0, l.i1, l.i2
  • indexed rows/s for l.x
  • range queries/s for qr100, qr500, qr1000
  • point queries/s for qp100, qp500, qp1000
Below I use colors to highlight the relative QPS values with red for <= 0.95, green for >= 1.05 and grey for values between 0.95 and 1.05.

From the summary for cached:
  • the base case is Postgres 16.2, numbers in red mean Postgres is faster
  • comparing InnoDB and MyRocks with the base case
    • l.i0
      • InnoDB - relative QPS is 0.75
      • MyRocks - relative QPS is 0.77
    • l.x - I ignore this for now
    • l.i1, l.i2
      • InnoDB - relative QPS is 0.861.63
      • MyRocks - relative QPS is 1.121.47
    • qr100, qr500, qr1000
      • InnoDB - relative QPS is 0.40, 0.42, 0.41
      • MyRocks - relative QPS is 0.19, 0.16, 0.16
    • qp100, qp500, qp1000
      • InnoDB - relative QPS is 0.82, 0.81, 0.82
      • MyRocks - relative QPS is 0.71, 0.70, 0.69
From the summary for IO-bound:
  • the base case is Postgres 16.2, numbers in red mean Postgres is faster
  • comparing InnoDB and MyRocks with the base case
    • l.i0
      • InnoDB - relative QPS is 0.74
      • MyRocks - relative QPS is 0.77
    • l.x - I ignore this for now
    • l.i1, l.i2
      • InnoDB - relative QPS is 0.8318.35
      • MyRocks - relative QPS is 11.4573.55
    • qr100, qr500, qr1000
      • InnoDB - relative QPS is 0.420.470.55
      • MyRocks - relative QPS is 0.070.060.06
    • qp100, qp500, qp1000
      • InnoDB - relative QPS is 1.56, 1.46, 1.44
      • MyRocks - relative QPS is 2.15, 2.13, 2.21
      • Postgres failed to sustain the target write rate during qp1000. The target was ~1000/s and it sustained 927/s.
Results: efficiency

Here I focus on the results from the IO-bound workload. The efficiency section of the IO-bound perf report has a lot of information.

At test end (after qp1000.L6) the database size in GB is 192.6 for Postgres, 166.4 for InnoDB and 54.8 for MyRocks. Compared to MyRocks, Postgres uses ~3.5X more space and InnoDB uses ~3X more space. Compression is enabled for MyRocks which saves on space at the cost of more CPU.

Explaining l.i0 - load in key order
  • Data is here
  • Postgres uses the least CPU per statement (see cpupq, CPU per query). It is ~1.2X larger with InnoDB and MyRocks. CPU probably explains the perf difference.
  • MyRocks write the least to storage per statement (see wkbpi, KB written per insert)
Explaining l.i1 - write-only, 50 rows/commit
  • Data is here
  • MyRocks does the fewest reads from storage per statement (see rpq, reads per query). The rate is ~278X larger for Postgres and ~859X larger for InnoDB. The ratio is so large because non-unique secondary index maintenance is read free for MyRocks. The InnoDB change buffer provides a similar but less significant benefit (I enabled the change buffer for these tests). Alas, with Postgres the leaf pages for secondary indexes must undergo read-modify-write as the heap-only tuple optimization can't be used for this schema.
  • MyRocks uses the least CPU per statement (see cpupq, CPU per query). It is ~3X larger with Postgres and ~5X larger with InnoDB.
  • MyRocks has the best write efficiency (see wkbpi, KB written to storage per insert). It is ~9X larger for Postgres and ~80X larger for InnoDB.
Explaining l.i2 - write-only, 5 rows/commit
  • Data is here
  • Results are similar to l.i1 above with one exception. The CPU overhead for Postgres was ~3X larger than MyRocks for l.i1 but here it is more than 20X larger because of the problem with the optimizer spending too much time in get_actual_variable_range.
Explaining range queries - qr100, qr500, qr1000
  • Data is here
  • The read IO overhead is similar for Postgres and MyRocks (see rpq, read per query) while it is ~8X larger for InnoDB. A standard hand-waving analysis would predict that MyRocks wasn't going to be as read IO efficient as Postgres, but prefix bloom filters and less space amplification help it here.
  • Postgres has the smallest CPU overhead (see cpupq, CPU per query). It is ~2.6X larger for InnoDB and ~15X larger for MyRocks. I hope to explain why MyRocks uses so much more CPU.
Explaining point queries - qp100, qp500, qp1000
  • Data is here
  • MyRocks has the best read IO efficiency (see rpq, read per query). It is ~2.2X and ~1.3X larger for Postgres and InnoDB. Bloom filters and better space amplification might explain this.
  • All DBMS have a similar CPU overhead (see cpupq, CPU per query).
Results: throughput over time

Explaining l.i0 - load in key order
  • Data is here for Postgres, InnoDB and MyRocks
  • Results are stable for all DBMS but MyRocks has the most noise
Explaining l.i1 - write-only, 50 rows/commit
  • Data is here for Postgres, InnoDB and MyRocks
  • The insert/s rate declines over time for Postgres which is expected but it grows over time for InnoDB from ~1000/s to ~3000/s. I assume that InnoDB initially suffers more from page splits and perf increases as that is less likely over time.
Explaining l.i2 - write-only, 5 rows/commit
  • Data is here for Postgres, InnoDB and MyRocks
  • The insert/s and delete/s rate for Postgres decreases slightly over time. I assume the issue is that the optimizer CPU overhead for delete statements grows over time which is apparent on the chart for max delete response time (start here and scroll down).
Explaining range queries - qr100, qr500, qr1000
Explaining point queries - qp100, qp500, qp1000








Sunday, March 17, 2024

Yet another Insert Benchmark result: MyRocks, MySQL and a small server

While trying to explain a Postgres performance problem I repeated the Insert Benchmark on a small server for MyRocks from MySQL 5.6 and 8.0. This post explains those results. The previous report for a cached workload is here.

tl;dr

  • Disclaimers
    • The low-concurrency results here are worse than the results from a bigger server with more concurrency because the result here depends more on CPU overheads and MySQL keeps on growing code paths, while on the bigger server the cost from new CPU overheads is offset by other improvements.
    • Some of the regressions here are similar to what I measure for InnoDB and the problem is likely code above the storage engine layer.
    • For MyRocks 8.0.28 compared to 5.6.35
      • Results for most benchmark steps aren't surprising and MyRocks 8.0.28 gets between 80% and 95% of the throughput compared to MyRocks 5.6.35
      • Results for the qr1000.L6 benchmark step with the IO-bound workload are odd. MyRocks 8.0.28 gets only 39% of the throughput compared to MyRocks 5.6.35. From the metrics I see that MyRocks 8.0.28 does ~2X more read IO/query (see rpq) and uses ~2X more CPU/query (see cpupq). I have yet to explain this.

Build + Configuration

This report has results for MyRocks 5.6.35 and 8.0.28. The cza1_bee config was used and they are here.

The builds tested are:
  • fbmy5635_rel_202203072101.cza1_bee
    • MyRocks 5.6.35 from code as of 2022-03-07 at git hash e7d976ee with RocksDB 6.28.2, cza1_bee config
  • fbmy5635_rel_20230529_850.cza1_bee
    • MyRocks 5.6.35 from code as of 2023-05-29 at git hash b739eac1 with RocksDB 8.5.0, cza1_bee config
  • fbmy8028_rel_20220829_752.cza1_bee
    • MyRocks 8.0.28 from code as of 2022-08-29 at git hash a35c8dfeab, RocksDB 7.5.2, cza1_bee config
  • fbmy8028_rel_20230619_831.cza1_bee
    • MyRocks 8.0.28 from code as of 2023-06-19 at git hash 6164cf0274, RocksDB 8.3.1, cza1_bee config
  • fbmy8028_rel_221222.cza1_bee
    • MyRocks 8.0.28 from code as of 2023-12-22 at git hash 2ad105fc, RocksDB 8.7.0 at git hash 29005f0b, cza1_bee config
  • fbmy8028_rel_231222_870.cza1_bee_cfx
    • MyRocks 8.0.28 from code as of 2023-12-22 at git hash 2ad105fc, RocksDB 8.7.0 at git hash 29005f0b, cza1_bee config, indexes use a separate column family
The Benchmark

The benchmark is run with 1 client. It is explained here and was run in two setups
  • cached - database has 30M rows and fits in memory
  • IO-bound - database has 800M rows and is larger than memory, 
The test server was named SER4 in the previous report. It has 8 cores, 16G RAM, Ubuntu 22.04 and XFS using 1 m.2 device.

The benchmark steps are:

  • l.i0
    • insert X million rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client. X is 30M for cached and 800M for IO-bound.
  • l.x
    • create 3 secondary indexes per table. There is one connection per client.
  • l.i1
    • use 2 connections/client. One inserts Y rows and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate. Y is 80M for cached and 4M for IO-bound.
  • l.i2
    • like l.i1 but each transaction modifies 5 rows (small transactions) and Y is 20M for cached and 1M for IO-bound.
    • Wait for X seconds after the step finishes to reduce variance during the read-write benchmark steps that follow.
  • qr100
    • use 3 connections/client. One does range queries for Z seconds and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. This step is run for a fixed amount of time. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested. Z is 3600 for cached and 1800 for IO-bound.
  • qp100
    • like qr100 except uses point queries on the PK index
  • qr500
    • like qr100 but the insert and delete rates are increased from 100/s to 500/s
  • qp500
    • like qp100 but the insert and delete rates are increased from 100/s to 500/s
  • qr1000
    • like qr100 but the insert and delete rates are increased from 100/s to 1000/s
  • qp1000
    • like qp100 but the insert and delete rates are increased from 100/s to 1000/s
Results

The performance reports are here for cached and for IO-bound.

The summary has 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version on the first row of the table. The third shows the background insert rate for benchmark steps with background inserts and all systems sustained the target rates. The second table makes it easy to see how performance changes over time.

Below I use relative QPS to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is my version and $base is the version of the base case. When relative QPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. The Q in relative QPS measures: 
  • insert/s for l.i0, l.i1, l.i2
  • indexed rows/s for l.x
  • range queries/s for qr100, qr500, qr1000
  • point queries/s for qp100, qp500, qp1000
Below I use colors to highlight the relative QPS values with red for <= 0.95, green for >= 1.05 and grey for values between 0.95 and 1.05.

From the summary for cached:
  • the base case is fbmy5635_rel_202203072101 (MyRocks 5.6.35 from 2022)
  • comparing fbmy8028_rel_231222_870 (latest MyRocks 8.0.28) with the base case
    • l.i0
      • relative QPS is 0.72 in fbmy8028_rel_231222_870
    • l.x - I ignore this for now
    • l.i1, l.i2
      • relative QPS is 0.850.82 in fbmy8028_rel_231222_870
    • qr100, qr500, qr1000
      • relative QPS is 1.061.241.06 in fbmy8028_rel_231222_870
    • qp100, qp500, qp1000
      • relative QPS is 0.960.950.96 in fbmy8028_rel_231222_870
From the summary for IO-bound:
  • the base case is fbmy5635_rel_202203072101 (MyRocks 5.6.35 from 2022)
  • comparing fbmy8028_rel_231222_870 (latest MyRocks 8.0.28) with the base case
    • l.i0
      • relative QPS is 0.72 in fbmy8028_rel_231222_870
    • l.x - I ignore this for now
    • l.i1, l.i2
      • relative QPS is 0.860.80 in fbmy8028_rel_231222_870
    • qr100, qr500, qr1000
      • relative QPS is 0.800.800.39 in fbmy8028_rel_231222_870
      • the 0.39 value is an outlier. From the metrics I see that MyRocks 8.0.28 does ~2X more read IO/query (see rpq) and uses ~2X more CPU/query (see cpupq). I have yet to explain this.
    • qp100, qp500, qp1000
      • relative QPS is 0.940.940.94 in fbmy8028_rel_231222_870

Yet another Insert Benchmark result: MySQL, InnoDB and a small server

While trying to explain a Postgres performance problem I repeated the Insert Benchmark on a small server for InnoDB from MySQL 5.6, 5.7 and 8.0. This post explains those results. Previous reports are here for cached and IO-bound workloads and the results here are similar.

tl;dr

  • Disclaimer - the low-concurrency results here are worse than the results from a bigger server with more concurrency because the result here depends more on CPU overheads and MySQL keeps on growing code paths, while on the bigger server the cost from new CPU overheads is offset by other improvements.
  • There are significant regressions from 5.6 to 5.7 and again from 5.7 to 8.0
Build + Configuration

This report has results for InnoDB with MySQL 5.6.51, 5.7.44 and 8.0.36. The cz10a_bee config was used and they are here.

The Benchmark

The benchmark is run with 1 client. It is explained here and was run in two setups
  • cached - database has 30M rows and fits in memory
  • IO-bound - database has 800M rows and is larger than memory, 
The test server was named SER4 in the previous report. It has 8 cores, 16G RAM, Ubuntu 22.04 and XFS using 1 m.2 device.

The benchmark steps are:

  • l.i0
    • insert X million rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client. X is 30M for cached and 800M for IO-bound.
  • l.x
    • create 3 secondary indexes per table. There is one connection per client.
  • l.i1
    • use 2 connections/client. One inserts Y rows and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate. Y is 80M for cached and 4M for IO-bound.
  • l.i2
    • like l.i1 but each transaction modifies 5 rows (small transactions) and Y is 20M for cached and 1M for IO-bound.
    • Wait for X seconds after the step finishes to reduce variance during the read-write benchmark steps that follow.
  • qr100
    • use 3 connections/client. One does range queries for Z seconds and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. This step is run for a fixed amount of time. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested. Z is 3600 for cached and 1800 for IO-bound.
  • qp100
    • like qr100 except uses point queries on the PK index
  • qr500
    • like qr100 but the insert and delete rates are increased from 100/s to 500/s
  • qp500
    • like qp100 but the insert and delete rates are increased from 100/s to 500/s
  • qr1000
    • like qr100 but the insert and delete rates are increased from 100/s to 1000/s
  • qp1000
    • like qp100 but the insert and delete rates are increased from 100/s to 1000/s
Results

The performance reports are here for cached and for IO-bound.

The summary has 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version on the first row of the table. The third shows the background insert rate for benchmark steps with background inserts and all systems sustained the target rates. The second table makes it easy to see how performance changes over time.

Below I use relative QPS to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is my version and $base is the version of the base case. When relative QPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. The Q in relative QPS measures: 
  • insert/s for l.i0, l.i1, l.i2
  • indexed rows/s for l.x
  • range queries/s for qr100, qr500, qr1000
  • point queries/s for qp100, qp500, qp1000
Below I use colors to highlight the relative QPS values with red for <= 0.95, green for >= 1.05 and grey for values between 0.95 and 1.05.

From the summary for cached:
  • the base case is MySQL 5.6.51
  • comparing 5.7.4 and 8.0.36 with 5.6.21 shows large regressions
    • l.i0
      • relative QPS is 0.84 in 5.7.44
      • relative QPS is 0.57 in 8.0.36
    • l.x - I ignore this for now
    • l.i1, l.i2
      • relative QPS is 1.11, 0.88 in 5.7.44
      • relative QPS is 0.91, 0.73 in 8.0.36
    • qr100, qr500, qr1000
      • relative QPS is 0.73, 0.72, 0.74 in 5.7.44
      • relative QPS is 0.63, 0.63, 0.63 in 8.0.36
    • qp100, qp500, qp1000
      • relative QPS is 0.83, 0.83, 0.82 in 5.7.44
      • relative QPS is 0.63, 0.61, 0.62 in 8.0.36
From the summary for IO-bound:
  • the base case is MySQL 5.6.51
  • comparing 5.7.4 and 8.0.36 with 5.6.21 shows large regressions
    • l.i0
      • relative QPS is 0.86 in 5.7.44
      • relative QPS is 0.59 in 8.0.36
    • l.x - I ignore this for now
    • l.i1, l.i2
      • relative QPS is 1.301.26 in 5.7.44
      • relative QPS is 1.301.16 in 8.0.36
    • qr100, qr500, qr1000
      • relative QPS is 0.760.860.94 in 5.7.44
      • relative QPS is 0.700.810.89 in 8.0.36
    • qp100, qp500, qp1000
      • relative QPS is 0.980.991.02 in 5.7.44
      • relative QPS is 0.940.961.02 in 8.0.36

Trying to tune Postgres for the Insert Benchmark: small server

Last year I spent much time trying to tune the Postgres configs I use to improve results for the Insert Benchmark. While this was a good education for me I wasn't able to get significant improvements. After writing about another perf problem with Postgres (optimizer spends too much time on DELETE statements in a special circumstance) I revisited the tuning but didn't make things significantly better.

The results here are from Postgres 16.2 and a small server (8 CPU cores) with a low concurrency workload. Previous benchmark reports for Postgres on this setup are here for cached and IO-bound runs.

tl;dr

  • I have yet to fix this problem via tuning
The Problem

The performance problem is explained here and here. The issue is that the optimizer spends too much time on DELETE statements under special circumstances. In this case the optimizer can read from the index to determine the true value for the min or max value of the column referenced in the WHERE clause and when there are too many deleted index entries that have yet to be removed by vacuum then there is too much time spent in the optimizer.

The problem shows up on the l.i2 benchmark step. The benchmark client sustains the same rate for inserts/s and delete/s so if deletes are too slow then the insert rate will also be too slow. The ratio of delete/s (and insert/s) for l.i2 relative to l.i1 is ~0.2 for the cached workload and ~0.05 for the IO-bound workload. 

The l.i1 benchmark step deletes more rows/statement so the optimizer overhead is more significant on the l.i2 step. The ratios are much larger for InnoDB and MyRocks (they have perf problems, just not this perf problem).

The circumstances are:
  • the table has a queue pattern (insert to one end, delete from the other)
  • the DELETE statements have WHERE pk_col < $low-const and pk_col > $high-const where $low-const and $high-const are integer constants and there is a PK on pk_col
This workload creates much MVCC garbage that is co-located in the PK index and that is a much bigger problem for Postgres than for InnoDB or MyRocks. 

I hope for a Postgres storage engine that provides MVCC without vacuum. In theory, more frequent vacuum might help and the perf overhead from frequent vacuum might be OK for the heap table given the usage of visibility bits. But when vacuum then has to do a full index scan (no visibility bits there) then that is a huge cost which limits vacuum frequency.

Build + Configuration

See the previous report for more details. I used Postgres 16.2.

The configuration files for the SER4 server are in subdirectories from here. Using the suffixes that distinguish the config file names, they are::
  • cx9a2_bee - base config
  • cx9a2a_bee - adds autovacuum_vacuum_cost_delay= 1ms
  • cx9a2b_bee - adds autovacuum_vacuum_cost_delay= 0
  • cx9a2c_bee - adds autovacuum_naptime= 1s
  • cx9a2e_bee - adds autovacuum_vacuum_scale_factor= 0.01
  • cx9a2f_bee - adds autovacuum_vacuum_insert_scale_factor= 0.01
  • cx9a2g_bee - adds autovacuum_vacuum_cost_limit= 8000
  • cx9a2acef_bee - combines cx9a2a, cx9a2c, cz9a2e, cx9a2f configs
  • cx9a2bcef_bee - combines cx9a2b, cx9a2c, cz9a2e, cx9a2f configs
The Benchmark

The benchmark is run with 1 client. It is explained here and was run in two setups
  • cached - database has 30M rows and fits in memory
  • IO-bound - database has 800M rows and is larger than memory, 
The test was run on two small servers that I have at home:
  • SER4 - Beelink SER4 with 8 cores, 16G RAM, Ubuntu 22.04 and XFS using 1 m.2 device
  • SER7 - Beelink SER7 with 8 cores, 32G RAM, Ubuntu 22.04 and XFS using 1 m.2 device. The CPU on the SER7 is a lot faster than the SER4.
The benchmark steps are:

  • l.i0
    • insert X million rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client. For SER4, X is 30M for cached and 800M for IO-bound. For SER7, X is 60M for cached and 800M for IO-bound.
  • l.x
    • create 3 secondary indexes per table. There is one connection per client.
  • l.i1
    • use 2 connections/client. One inserts Y rows and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate. Y is 80M for cached and 4M for IO-bound.
  • l.i2
    • like l.i1 but each transaction modifies 5 rows (small transactions) and Y is 20M for cached and 1M for IO-bound.
    • Wait for X seconds after the step finishes to reduce variance during the read-write benchmark steps that follow.
  • qr100
    • use 3 connections/client. One does range queries for Z seconds and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. This step is run for a fixed amount of time. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested. Z is 3600 for cached and 1800 for IO-bound.
  • qp100
    • like qr100 except uses point queries on the PK index
  • qr500
    • like qr100 but the insert and delete rates are increased from 100/s to 500/s
  • qp500
    • like qp100 but the insert and delete rates are increased from 100/s to 500/s
  • qr1000
    • like qr100 but the insert and delete rates are increased from 100/s to 1000/s
  • qp1000
    • like qp100 but the insert and delete rates are increased from 100/s to 1000/s
Results: SER4 server

The performance reports are here for cached and for IO-bound.

The summary has 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version on the first row of the table. The third shows the background insert rate for benchmark steps with background inserts and all systems sustained the target rates. The second table makes it easy to see how performance changes over time.

Below I use relative QPS to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is my version and $base is the version of the base case. When relative QPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. The Q in relative QPS measures: 
  • insert/s for l.i0, l.i1, l.i2
  • indexed rows/s for l.x
  • range queries/s for qr100, qr500, qr1000
  • point queries/s for qp100, qp500, qp1000
Below I use colors to highlight the relative QPS values with red for <= 0.95, green for >= 1.05 and grey for values between 0.95 and 1.05.

From the summaries for cached and for IO-bound:
  • The base case is uses the cx9a2_bee config
  • The different config files have no impact on performance for the l.i0 and l.x benchmark steps. They have a small impact for the qr* and qp* (read+write) benchmark steps. Because the impact is non-existent to small I ignore those to focus on l.i1 and l.i2.
For l.i1 and l.i2 with a cached workload the different config files have some impact
  • The relative QPS, where Q means delete (and insert), ranges from 0.76 to 1.34 meaning a few made things slower and the best improved the delete/s rate by ~1.34X
  • The delete/s ratio for l.i2 vs l.i1 is 0.221 for the base case and the best improvement might be from the cx9a2f_bee config where the ratio increases to 0.265. But I was hoping to improve the ratio to 0.5 or larger so I was disappointed.
For l.i1 and l.i2 with an IO-bound workload the different config files have no benefit
  • Postgres 16.2 does ~2000 delete/s for the l.i1 step vs ~100/s for the l.i2 step
Results: SER7 server

The performance reports are here for cached and for IO-bound. Results from the SER7 match results from the SER4 described above so I won't explain them.


























Monday, February 19, 2024

Perf regressions in Postgres from 9.0 to 16 with sysbench and a small server

This has results for sysbench vs Postgres on a small server. I have results for versions from 9.0 through 16. My last report only went back to Postgres 11. The goal is to document where things get faster or slower over time for a low-concurrency and CPU-bound workload. The focus is on CPU regressions. 

My results here aren't universal, but you have to start somewhere:

  • The microbenchmarks here mostly measure CPU overheads
  • Things won't look the same with an IO-bound workload
  • Things won't look the same with a workload that has more concurrency 
  • Things won't look the same with a workload that has complex queries
Summaries

Sections after this explain how the microbenchmark results are grouped.

Comparing Postgres 16.2 with 9.0.23:
  • point query, part 1
    • Postgres 16.2 is faster than 9.0.23 for all but one microbenchmark
  • point query, part 2
    • Postgres 16.2 is faster than 9.0.23 for all microbenchmarks
  • range query, part 1 & part2
    • About half of the microbenchmarks are ~20% slower in 16.2 vs 9.0.23
    • The big regression occurs between 9.0 and 9.1
    • For part 2 where aggregation is done the problem is worse for shorter range scans
  • writes
    • Postgres 16.2 is faster than 9.0.23 for all microbenchmarks

Comparing Postgres 16.2 with 10.23
  • Postgres 16.2 is faster than 10.23 for all microbenchmarks

Comparing Postgres 16.2 with 9.0.23
  • point query, part 1
    • Postgres 16.2 is at most 4% slower than 14.10
  • point query, part 2
    • Postgres 16.2 is at most 1% slower than 14.10
  • range query, part 1
    • Postgres 16.2 is at most 5% slower than 14.10
  • range query, part 2
    • Postgres 16.2 is as fast or faster than 14.10
  • writes
    • Postgres 16.2 is at most 1% slower than 14.10
Build + Configuration

I used these versions: 9.0.23, 9.1.24, 9.2.24, 9.3.25, 9.4.26, 9.5.25, 9.6.24, 10.23, 11.22, 12.17, 13.13, 14.10, 14.11, 15.5, 15.6, 16.1 and 16.2.

The configuration files are in the subdirectories named pg9, pg10, pg11, pg12, pg13, pg14, pg15 and pg16 from here. They are named conf.diff.cx9a2_bee.

Benchmarks

I used sysbench and my usage is explained here. There are 42 microbenchmarks and each tests ~1 type of SQL statement and is run for 1200 seconds.

Tests were run on a small server I have at home (see here). The server is an SER4 from Beelink with 8 cores, 16G of RAM and 1 m.2 storage device with XFS and Ubuntu 22.04. The test tables are cached by Postgres.

The benchmark is run with:
  • one connection
  • 30M rows and a database cached by Postgres
  • each microbenchmark runs for 1200 seconds
  • prepared statements were enabled
The command line was: bash r.sh 1 30000000 1200 1200 nvme0n1 1 1 1

Results

For the results below I split the microbenchmarks into 5 groups -- 2 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries that don't do aggregation while part 2 has queries that do aggregation. Unfortunately, I included the full scan microbenchmark (scan_range=100) in part 2 but it doesn't do aggregation. The spreadsheet with all data and charts is here and is easier to read.

All of the charts have relative throughput on the y-axis where that is (QPS for $me) / (QPS for $base), $me is a version (for example 5.7.20) and $base is the base version. The base version is specified below and one of 5.6.21, 5.7.10 and 8.0.13 depending on what I am comparing. The y-axis doesn't start at 0 to improve readability.

The legend on under the x-axis truncates the names I use for the microbenchmark and I don't know how to fix that other than sharing the link to the Google Sheet I used. File I used to create the spreadsheets are here.

Results: from 9.0 through 16.2

Summary:
  • point query, part 1
    • Postgres 16.2 is faster than 9.0.23 for all but one microbenchmark
  • point query, part 2
    • Postgres 16.2 is faster than 9.0.23 for all microbenchmarks
  • range query, part 1 & part2
    • About half of the microbenchmarks are ~20% slower in 16.2 vs 9.0.23
    • The big regression occurs between 9.0 and 9.1
    • For part 2 where aggregation is done the problem is worse for shorter range scans
  • writes
    • Postgres 16.2 is faster than 9.0.23 for all microbenchmarks
This table has summary statistics from Postgres 16.2 for each microbenchmark group. The numbers represent the relative QPS (relative to 9.0.23) and a value > 1 means that 16.2 is faster than 9.0.23.

minmaxavgmedianstdev
point-10.921.321.181.200.12
point-21.071.181.121.130.04
range-10.771.681.091.000.37
range-20.781.341.010.850.25
writes1.114.642.211.971.10
Results: from 10.23 through 16.2

Summary
  • Postgres 16.2 is faster than 10.23 for all microbenchmarks
This table has summary statistics from Postgres 16.2 for each microbenchmark group. The numbers represent the relative QPS (relative to 10.23) and a value > 1 means that 16.2 is faster than 10.23.

minmaxavgmedianstdev
point-11.021.111.071.080.03
point-21.041.081.061.060.01
range-11.071.131.101.100.02
range-21.041.091.061.050.02
writes1.021.151.081.060.04

Results: 14.10, 14.11, 15.5, 15.6, 16.1, 16.2

Summary
  • point query, part 1
    • Postgres 16.2 is at most 4% slower than 14.10
  • point query, part 2
    • Postgres 16.2 is at most 1% slower than 14.10
  • range query, part 1
    • Postgres 16.2 is at most 5% slower than 14.10
  • range query, part 2
    • Postgres 16.2 is as fast or faster than 14.10
  • writes
    • Postgres 16.2 is at most 1% slower than 14.10
This table has summary statistics from Postgres 16.2 for each microbenchmark group. The numbers represent the relative QPS (relative to 14.10) and a value > 1 means that 16.2 is faster than 14.10.

minmaxavgmedianstdev
point-10.961.071.001.000.03
point-20.991.001.001.000.01
range-10.951.000.980.990.02
range-21.001.071.021.000.03
writes0.991.041.011.020.01