为什么 XFS 使用 lvm 缓存块大小而不是 raid5 设置 sunit/swidth

为什么 XFS 使用 lvm 缓存块大小而不是 raid5 设置 sunit/swidth

我的虚拟机上有 4 个磁盘可用于测试sdb、、和。sdcsddsde

前 3 个磁盘用于 RAID5 配置,最后一个磁盘用作 lvm 缓存驱动器。

我不明白的是:

当我创建一个 50GB 的缓存磁盘(块大小为 64KiB)时,xfs_info出现以下情况:

[vagrant@node-02 ~]$ xfs_info /data
meta-data=/dev/mapper/data-data isize=512    agcount=32, agsize=16777072 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=0 spinodes=0
data     =                       bsize=4096   blocks=536866304, imaxpct=5
         =                       sunit=16     swidth=32 blks
naming   =version 2              bsize=8192   ascii-ci=0 ftype=1
log      =internal               bsize=4096   blocks=262144, version=2
         =                       sectsz=512   sunit=16 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

我们在这里看到的 sunit=16 和 swidth=32 似乎是正确的并且与 raid5 布局相匹配。

结果lsblk -t

[vagrant@node-02 ~]$ lsblk -t
NAME                         ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED    RQ-SIZE   RA WSAME
sda                                  0    512      0     512     512    1 deadline     128 4096    0B
├─sda1                               0    512      0     512     512    1 deadline     128 4096    0B
└─sda2                               0    512      0     512     512    1 deadline     128 4096    0B
  ├─centos-root                      0    512      0     512     512    1              128 4096    0B
  ├─centos-swap                      0    512      0     512     512    1              128 4096    0B
  └─centos-home                      0    512      0     512     512    1              128 4096    0B
sdb                                  0    512      0     512     512    1 deadline     128 4096   32M
├─data5-data5_corig_rmeta_0          0    512      0     512     512    1              128 4096   32M
│ └─data5-data5_corig                0  65536 131072     512     512    1              128  384    0B
│   └─data5-data5                    0  65536 131072     512     512    1              128 4096    0B
└─data5-data5_corig_rimage_0         0    512      0     512     512    1              128 4096   32M
  └─data5-data5_corig                0  65536 131072     512     512    1              128  384    0B
    └─data5-data5                    0  65536 131072     512     512    1              128 4096    0B
sdc                                  0    512      0     512     512    1 deadline     128 4096   32M
├─data5-data5_corig_rmeta_1          0    512      0     512     512    1              128 4096   32M
│ └─data5-data5_corig                0  65536 131072     512     512    1              128  384    0B
│   └─data5-data5                    0  65536 131072     512     512    1              128 4096    0B
└─data5-data5_corig_rimage_1         0    512      0     512     512    1              128 4096   32M
  └─data5-data5_corig                0  65536 131072     512     512    1              128  384    0B
    └─data5-data5                    0  65536 131072     512     512    1              128 4096    0B
sdd                                  0    512      0     512     512    1 deadline     128 4096   32M
├─data5-data5_corig_rmeta_2          0    512      0     512     512    1              128 4096   32M
│ └─data5-data5_corig                0  65536 131072     512     512    1              128  384    0B
│   └─data5-data5                    0  65536 131072     512     512    1              128 4096    0B
└─data5-data5_corig_rimage_2         0    512      0     512     512    1              128 4096   32M
  └─data5-data5_corig                0  65536 131072     512     512    1              128  384    0B
    └─data5-data5                    0  65536 131072     512     512    1              128 4096    0B
sde                                  0    512      0     512     512    1 deadline     128 4096   32M
sdf                                  0    512      0     512     512    1 deadline     128 4096   32M
├─data5-cache_data5_cdata            0    512      0     512     512    1              128 4096   32M
│ └─data5-data5                      0  65536 131072     512     512    1              128 4096    0B
└─data5-cache_data5_cmeta            0    512      0     512     512    1              128 4096   32M
  └─data5-data5                      0  65536 131072     512     512    1              128 4096    0B
sdg                                  0    512      0     512     512    1 deadline     128 4096   32M
sdh                                  0    512      0     512     512    1 deadline     128 4096   32M

lvdisplay -a -m data给我以下信息:

[vagrant@node-02 ~]$ sudo lvdisplay -m -a data
  --- Logical volume ---
  LV Path                /dev/data/data
  LV Name                data
  VG Name                data
  LV UUID                MBG1p8-beQj-TNDd-Cyx4-QkyN-vdVk-dG6n6I
  LV Write Access        read/write
  LV Creation host, time node-02, 2019-09-03 13:22:08 +0000
  LV Cache pool name     cache_data
  LV Cache origin name   data_corig
  LV Status              available
  # open                 1
  LV Size                <2.00 TiB
  Cache used blocks      0.06%
  Cache metadata blocks  0.64%
  Cache dirty blocks     0.00%
  Cache read hits/misses 293 / 66
  Cache wrt hits/misses  59 / 41173
  Cache demotions        0
  Cache promotions       486
  Current LE             524284
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     8192
  Block device           253:9

  --- Segments ---
  Logical extents 0 to 524283:
    Type                cache
    Chunk size          64.00 KiB
    Metadata format     2
    Mode                writethrough
    Policy              smq


  --- Logical volume ---
  Internal LV Name       cache_data
  VG Name                data
  LV UUID                apACl6-DtfZ-TURM-vxjD-UhxF-tthY-uSYRGq
  LV Write Access        read/write
  LV Creation host, time node-02, 2019-09-03 13:22:16 +0000
  LV Pool metadata       cache_data_cmeta
  LV Pool data           cache_data_cdata
  LV Status              NOT available
  LV Size                50.00 GiB
  Current LE             12800
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto

  --- Segments ---
  Logical extents 0 to 12799:
    Type                cache-pool
    Chunk size          64.00 KiB
    Metadata format     2
    Mode                writethrough
    Policy              smq


  --- Logical volume ---
  Internal LV Name       cache_data_cmeta
  VG Name                data
  LV UUID                hmkW6M-CKGO-CTUP-rR4v-KnWn-DbBZ-pJeEA2
  LV Write Access        read/write
  LV Creation host, time node-02, 2019-09-03 13:22:15 +0000
  LV Status              available
  # open                 1
  LV Size                1.00 GiB
  Current LE             256
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     8192
  Block device           253:11

  --- Segments ---
  Logical extents 0 to 255:
    Type                linear
    Physical volume     /dev/sdf
    Physical extents    0 to 255


  --- Logical volume ---
  Internal LV Name       cache_data_cdata
  VG Name                data
  LV UUID                9mHe8J-SRiY-l1gl-TO1h-2uCC-Hi10-UpeEVP
  LV Write Access        read/write
  LV Creation host, time node-02, 2019-09-03 13:22:16 +0000
  LV Status              available
  # open                 1
  LV Size                50.00 GiB
  Current LE             12800
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     8192
  Block device           253:10

  --- Segments ---
  Logical extents 0 to 12799:
    Type                linear
    Physical volume     /dev/sdf
    Physical extents    256 to 13055

  --- Logical volume ---
  Internal LV Name       data_corig
  VG Name                data
  LV UUID                QP8ppy-nv1v-0sii-tANA-6ZzK-EJkP-sLfrh4
  LV Write Access        read/write
  LV Creation host, time node-02, 2019-09-03 13:22:17 +0000
  LV origin of Cache LV  data
  LV Status              available
  # open                 1
  LV Size                <2.00 TiB
  Current LE             524284
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     768
  Block device           253:12

  --- Segments ---
  Logical extents 0 to 524283:
    Type                raid5
    Monitoring          monitored
    Raid Data LV 0
      Logical volume    data_corig_rimage_0
      Logical extents   0 to 262141
    Raid Data LV 1
      Logical volume    data_corig_rimage_1
      Logical extents   0 to 262141
    Raid Data LV 2
      Logical volume    data_corig_rimage_2
      Logical extents   0 to 262141
    Raid Metadata LV 0  data_corig_rmeta_0
    Raid Metadata LV 1  data_corig_rmeta_1
    Raid Metadata LV 2  data_corig_rmeta_2


[vagrant@node-02 ~]$
[vagrant@node-02 ~]$   --- Segments ---
Df7SLj
  LV Write Access        read/write
  LV Creation host, time node-02, 2019-09-03 13:22:08 +0000
  LV Status              available
  # open                 1
  LV Size                1023.99 GiB
  Current LE             262142
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     8192
  Block device           253:8

  --- Segments ---
  Logical extents 0 to 262141:
    Type                linear
    Physical volume     /dev/sdd
    Physical extents    1 to 262142


  --- Logical volume ---
  Internal LV Name       data_corig_rmeta_2
  VG Name                data
  LV UUID                xi9Ot3-aTnp-bA3z-YL0x-eVaB-87EP-JSM3eN
  LV Write Access        read/write
  LV Creation host, time node-02, 2019-09-03 13:22:08 +0000
  LV Status              available
  # open                 1
  LV Size                4.00 MiB
  Current LE             1
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     8192
  Block device           253:7

  --- Segments ---
  Logical extents 0 to 0:
    Type                linear
    Physical volume     /dev/sdd
    Physical extents    0 to 0


  --- Logical volume ---
  Internal LV Name       data_corig
  VG Name                data
  LV UUID                QP8ppy-nv1v-0sii-tANA-6ZzK-EJkP-sLfrh4
  LV Write Access        read/write
  LV Creation host, time node-02, 2019-09-03 13:22:17 +0000
  LV origin of Cache LV  data
  LV Status              available
  # open                 1
  LV Size                <2.00 TiB
  Current LE             524284
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     768
  Block device           253:12

  --- Segments ---
  Logical extents 0 to 524283:
    Type                raid5
    Monitoring          monitored
    Raid Data LV 0
      Logical volume    data_corig_rimage_0
      Logical extents   0 to 262141
    Raid Data LV 1
      Logical volume    data_corig_rimage_1
      Logical extents   0 to 262141
    Raid Data LV 2
      Logical volume    data_corig_rimage_2
      Logical extents   0 to 262141
    Raid Metadata LV 0  data_corig_rmeta_0
    Raid Metadata LV 1  data_corig_rmeta_1
    Raid Metadata LV 2  data_corig_rmeta_2

我们可以清楚地看到段中的块大小为 64KiB。

但是当我创建 250GB 的缓存磁盘时,lvm 至少需要 288KiB 的块大小才能容纳该缓存磁盘的大小。但是当我执行时,xfs_infosunit/swidth突然与缓存驱动器的值匹配,而不是 RAID5 布局。

输出xfs_info

[vagrant@node-02 ~]$ xfs_info /data
meta-data=/dev/mapper/data-data isize=512    agcount=32, agsize=16777152 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=0 spinodes=0
data     =                       bsize=4096   blocks=536866816, imaxpct=5
         =                       sunit=72     swidth=72 blks
naming   =version 2              bsize=8192   ascii-ci=0 ftype=1
log      =internal               bsize=4096   blocks=262144, version=2
         =                       sectsz=512   sunit=8 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

突然,我们得到了72 的sunitswidth,它与缓存驱动器的 288KiB 块大小相匹配,我们可以通过以下方式看到这一点lvdisplay -m -a

[vagrant@node-02 ~]$ sudo lvdisplay -m -a data
  --- Logical volume ---
  LV Path                /dev/data/data
  LV Name                data
  VG Name                data
  LV UUID                XLHw3w-RkG9-UNh6-WZBM-HtjM-KcV6-6dOdnG
  LV Write Access        read/write
  LV Creation host, time node-2, 2019-09-03 13:36:32 +0000
  LV Cache pool name     cache_data
  LV Cache origin name   data_corig
  LV Status              available
  # open                 1
  LV Size                <2.00 TiB
  Cache used blocks      0.17%
  Cache metadata blocks  0.71%
  Cache dirty blocks     0.00%
  Cache read hits/misses 202 / 59
  Cache wrt hits/misses  8939 / 34110
  Cache demotions        0
  Cache promotions       1526
  Current LE             524284
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     8192
  Block device           253:9

  --- Segments ---
  Logical extents 0 to 524283:
    Type                cache
    Chunk size          288.00 KiB
    Metadata format     2
    Mode                writethrough
    Policy              smq


  --- Logical volume ---
  Internal LV Name       cache_data
  VG Name                data
  LV UUID                Ps7Z1P-y5Ae-ju80-SZjc-yB6S-YBtx-SWL9vO
  LV Write Access        read/write
  LV Creation host, time node-2, 2019-09-03 13:36:40 +0000
  LV Pool metadata       cache_data_cmeta
  LV Pool data           cache_data_cdata
  LV Status              NOT available
  LV Size                250.00 GiB
  Current LE             64000
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto

  --- Segments ---
  Logical extents 0 to 63999:
    Type                cache-pool
    Chunk size          288.00 KiB
    Metadata format     2
    Mode                writethrough
    Policy              smq


  --- Logical volume ---
  Internal LV Name       cache_data_cmeta
  VG Name                data
  LV UUID                k4rVn9-lPJm-2Vvt-77jw-NP1K-PTOs-zFy2ph
  LV Write Access        read/write
  LV Creation host, time node-2, 2019-09-03 13:36:39 +0000
  LV Status              available
  # open                 1
  LV Size                1.00 GiB
  Current LE             256
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     8192
  Block device           253:11

  --- Segments ---
  Logical extents 0 to 255:
    Type                linear
    Physical volume     /dev/sdf
    Physical extents    0 to 255


  --- Logical volume ---
  Internal LV Name       cache_data_cdata
  VG Name                data
  LV UUID                dm571W-f9eX-aFMA-SrPC-PYdd-zs45-ypLksd
  LV Write Access        read/write
  LV Creation host, time node-2, 2019-09-03 13:36:39 +0000
  LV Status              available
  # open                 1
  LV Size                250.00 GiB
  Current LE             64000
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     8192
  Block device           253:10

  --- Logical volume ---
  Internal LV Name       data_corig
  VG Name                data
  LV UUID                hbYiRO-YnV8-gd1B-shQD-N3SR-xpTl-rOjX8V
  LV Write Access        read/write
  LV Creation host, time node-2, 2019-09-03 13:36:41 +0000
  LV origin of Cache LV  data
  LV Status              available
  # open                 1
  LV Size                <2.00 TiB
  Current LE             524284
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     768
  Block device           253:12

  --- Segments ---
  Logical extents 0 to 524283:
    Type                raid5
    Monitoring          monitored
    Raid Data LV 0
      Logical volume    data_corig_rimage_0
      Logical extents   0 to 262141
    Raid Data LV 1
      Logical volume    data_corig_rimage_1
      Logical extents   0 to 262141
    Raid Data LV 2
      Logical volume    data_corig_rimage_2
      Logical extents   0 to 262141
    Raid Metadata LV 0  data_corig_rmeta_0
    Raid Metadata LV 1  data_corig_rmeta_1
    Raid Metadata LV 2  data_corig_rmeta_2

输出lsblk -t

[vagrant@node-02 ~]$ lsblk -t
NAME                         ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED    RQ-SIZE   RA WSAME
sda                                  0    512      0     512     512    1 deadline     128 4096    0B
├─sda1                               0    512      0     512     512    1 deadline     128 4096    0B
└─sda2                               0    512      0     512     512    1 deadline     128 4096    0B
  ├─centos-root                      0    512      0     512     512    1              128 4096    0B
  ├─centos-swap                      0    512      0     512     512    1              128 4096    0B
  └─centos-home                      0    512      0     512     512    1              128 4096    0B
sdb                                  0    512      0     512     512    1 deadline     128 4096   32M
├─data5-data5_corig_rmeta_0          0    512      0     512     512    1              128 4096   32M
│ └─data5-data5_corig                0  65536 131072     512     512    1              128  384    0B
│   └─data5-data5                    0 294912 294912     512     512    1              128 4096    0B
└─data5-data5_corig_rimage_0         0    512      0     512     512    1              128 4096   32M
  └─data5-data5_corig                0  65536 131072     512     512    1              128  384    0B
    └─data5-data5                    0 294912 294912     512     512    1              128 4096    0B
sdc                                  0    512      0     512     512    1 deadline     128 4096   32M
├─data5-data5_corig_rmeta_1          0    512      0     512     512    1              128 4096   32M
│ └─data5-data5_corig                0  65536 131072     512     512    1              128  384    0B
│   └─data5-data5                    0 294912 294912     512     512    1              128 4096    0B
└─data5-data5_corig_rimage_1         0    512      0     512     512    1              128 4096   32M
  └─data5-data5_corig                0  65536 131072     512     512    1              128  384    0B
    └─data5-data5                    0 294912 294912     512     512    1              128 4096    0B
sdd                                  0    512      0     512     512    1 deadline     128 4096   32M
├─data5-data5_corig_rmeta_2          0    512      0     512     512    1              128 4096   32M
│ └─data5-data5_corig                0  65536 131072     512     512    1              128  384    0B
│   └─data5-data5                    0 294912 294912     512     512    1              128 4096    0B
└─data5-data5_corig_rimage_2         0    512      0     512     512    1              128 4096   32M
  └─data5-data5_corig                0  65536 131072     512     512    1              128  384    0B
    └─data5-data5                    0 294912 294912     512     512    1              128 4096    0B
sde                                  0    512      0     512     512    1 deadline     128 4096   32M
sdf                                  0    512      0     512     512    1 deadline     128 4096   32M
├─data5-cache_data5_cdata            0    512      0     512     512    1              128 4096   32M
│ └─data5-data5                      0 294912 294912     512     512    1              128 4096    0B
└─data5-cache_data5_cmeta            0    512      0     512     512    1              128 4096   32M
  └─data5-data5                      0 294912 294912     512     512    1              128 4096    0B
sdg                                  0    512      0     512     512    1 deadline     128 4096   32M
sdh                                  0    512      0     512     512    1 deadline     128 4096   32M

这里出现了几个问题。

XFS 显然可以自动检测这些设置,但为什么 XFS 选择使用缓存驱动器的块大小?正如我们在第一个示例中看到的,它能够自动检测 RAID5 布局。

我知道我可以传递su/sw选项来mkfs.xfs获取正确的sunit/swidth值,但在这种情况下我应该这样做吗?

http://xfs.org/index.php/XFS_FAQ#Q:如何计算正确的sunit.2Cswidth_values_for_optimal_performance

我用 Google 搜索了好几天,查看了 XFS 源代码,但没有找到任何线索来了解 XFS 为什么这样做。

因此出现了以下问题:

  • XFS 为何会有这样的行为?
  • 我应该su/sw在运行时手动定义mkfs.xfs
  • 缓存驱动器的块大小是否会对 RAID5 设置产生影响?是否应该以某种方式进行对齐?

答案1

最佳分配策略是一个复杂的问题,因为它取决于各个块层之间如何相互作用。

在确定最优分配政策时,mkfs.xfs利用libblkid. 您可以访问发布相同的信息lsblk -t。它是非常可能mkfs.xfs使用 288K 分配对齐,因为lvs(嗯,device-mapper实际上)只是将该值传递给堆栈。

我看到了与精简配置非常相似的行为,其中mkfs.xfs文件系统根据块大小对齐。

lsblk -t编辑:所以,这是......的输出。

[vagrant@node-02 ~]$ lsblk -t
NAME                         ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED    RQ-SIZE   RA WSAME
sda                                  0    512      0     512     512    1 deadline     128 4096    0B
├─sda1                               0    512      0     512     512    1 deadline     128 4096    0B
└─sda2                               0    512      0     512     512    1 deadline     128 4096    0B
  ├─centos-root                      0    512      0     512     512    1              128 4096    0B
  ├─centos-swap                      0    512      0     512     512    1              128 4096    0B
  └─centos-home                      0    512      0     512     512    1              128 4096    0B
sdb                                  0    512      0     512     512    1 deadline     128 4096   32M
├─data5-data5_corig_rmeta_0          0    512      0     512     512    1              128 4096   32M
│ └─data5-data5_corig                0  65536 131072     512     512    1              128  384    0B
│   └─data5-data5                    0 294912 294912     512     512    1              128 4096    0B
└─data5-data5_corig_rimage_0         0    512      0     512     512    1              128 4096   32M
  └─data5-data5_corig                0  65536 131072     512     512    1              128  384    0B
    └─data5-data5                    0 294912 294912     512     512    1              128 4096    0B
sdc                                  0    512      0     512     512    1 deadline     128 4096   32M
├─data5-data5_corig_rmeta_1          0    512      0     512     512    1              128 4096   32M
│ └─data5-data5_corig                0  65536 131072     512     512    1              128  384    0B
│   └─data5-data5                    0 294912 294912     512     512    1              128 4096    0B
└─data5-data5_corig_rimage_1         0    512      0     512     512    1              128 4096   32M
  └─data5-data5_corig                0  65536 131072     512     512    1              128  384    0B
    └─data5-data5                    0 294912 294912     512     512    1              128 4096    0B
sdd                                  0    512      0     512     512    1 deadline     128 4096   32M
├─data5-data5_corig_rmeta_2          0    512      0     512     512    1              128 4096   32M
│ └─data5-data5_corig                0  65536 131072     512     512    1              128  384    0B
│   └─data5-data5                    0 294912 294912     512     512    1              128 4096    0B
└─data5-data5_corig_rimage_2         0    512      0     512     512    1              128 4096   32M
  └─data5-data5_corig                0  65536 131072     512     512    1              128  384    0B
    └─data5-data5                    0 294912 294912     512     512    1              128 4096    0B
sde                                  0    512      0     512     512    1 deadline     128 4096   32M
sdf                                  0    512      0     512     512    1 deadline     128 4096   32M
├─data5-cache_data5_cdata            0    512      0     512     512    1              128 4096   32M
│ └─data5-data5                      0 294912 294912     512     512    1              128 4096    0B
└─data5-cache_data5_cmeta            0    512      0     512     512    1              128 4096   32M
  └─data5-data5                      0 294912 294912     512     512    1              128 4096    0B
sdg                                  0    512      0     512     512    1 deadline     128 4096   32M
sdh                                  0    512      0     512     512    1 deadline     128 4096   32M

如您所见,设备data5-data5(您在其上创建 xfs 文件系统)报告294912MIN-IO字节OPT-IO(288K,您的缓存块),而底层设备报告 RAID 阵列块大小(64K)。这意味着device-mapper用当前缓存块大小覆盖了底层 IO 信息。

mkfs.xfs只是使用libblkid报告的内容,而这又取决于所使用的特定缓存设备映射器目标。

相关内容