BPF发起的请求不会经过文件系统或块层,因此不容易在进程间保证公平性或QoS,但由于NVMe设备默认的Linux块层调度器是noop,且NVMe规范支持在需要公平性时,在硬件队列使用命令仲裁[11]。另一个挑战是,NVMe层可能会无限发起I/O请求。eBPF校验器会阻止无界循环[38],但我们也需要阻止NVMe钩子上的无界I/O循环。
为了实现公平性,并防止无界遍历,我们计划在NVMe层为每个进行实现一个计数器,跟踪I/O调用链提交的数目,并在计数器上设置一个边界。计数器的值会周期性地传递给BIO层,统计请求数目。
总结BPF有提升高速存储设备的相关查找速度的潜力。但同时也产生了相应的挑战,特别是由于丢失上下文,对操作内核底层存储栈增加了难度。本论文中,我们主要关注索引遍历。即便如此,对于现在的高速NVMe设备,通过使用BPF将少量请求串接起来也可以获得显著的收益。我们可以预想到,BPF存储库可以为开发者在其他标准内核存储操作中提供便利,如压缩(compaction,compression),重复数据删除和扫描。我们也相信BPF与缓存和调度器策略的交互也会创造令人兴奋的研究机会。
引用[1] 200Gb/s ConnectX-6 Ethernet Single/DualPort Adapter IC | NVIDIA. https: //www.mellanox.com/products/ ethernet-adapter-ic/connectx6-en-ic.
[2] bcc. https://github.com/iovisor/ bcc.
[3] bpftrace. https://github.com/ iovisor/bpftrace.
[4] Cilium. https://github.com/cilium/ cilium.
[5] Cloudflare architecture and how BPF eats the world. https://blog.cloudflare.com/ cloudflare-architecture-and-howbpf-eats-the-world/.
[6] eBPF. https://ebpf.io/.
[7] Effocient io with io_uring. https: //kernel.dk/io_uring.pdf.
[8] Intel® Optane™ SSD DC P5800X Series. https://ark.intel.com/content/ www/us/en/ark/products/201859/ intel-optane-ssd-dc-p5800xseries-1-6tb-2-5in-pcie-x4-3dxpoint.html.
[9] MAC and Audit policy using eBPF. https:// lkml.org/lkml/2020/3/28/479.
[10] Ngd systems newport platform. https: //www.ngdsystems.com/technology/ computational-storage.
[11] NVMe base specification. https: //nvmexpress.org/wp-content/ uploads/NVM-Express-1_4b2020.09.21-Ratified.pdf.
[12] Open-sourcing katran, a scalable network load balancer. https://engineering.fb.com/ 2018/05/22/open-source/opensourcing-katran-a-scalablenetwork-load-balancer/.
[13] Optimizing Software for the Next Gen Intel Optane SSD P5800X. https://www.intel.com/ content/www/us/en/events/ memory-and-storage.html?videoId= 6215534787001.
[14] Percona tokudb. https://
[15] RocksDB iterator. https://github.com/ facebook/rocksdb/wiki/Iterator.
[16] SmartSSD computational storage drive. https: //www.xilinx.com/applications/ data-center/computationalstorage/smartssd.html.
[17] SQL server index architecture and design guide. https://docs.microsoft.com/en-us/ sql/relational-databases/sqlserver-index-design-guide.
[18] A thorough introduction to eBPF. https:// lwn.net/Articles/740157/.
[19] Toshiba memory introduces XL-FLASH storage class memory solution. https: //business.kioxia.com/en-us/news/ 2019/memory-20190805-1.html.
[20] Ultra-Low Latency with Samsung Z-NAND SSD. https:// global.semi.static/UltraLow_Latency_with_Samsung_ZNAND_SSD-0.pdf.
[21] XDP. https://www.iovisor.org/ technology/xdp.
[22] Adam Belay, George Prekas, Ana Klimovic, Samuel Grossman, Christos Kozyrakis, and Edouard Bugnion. IX: A protected dataplane operating system for high throughput and low latency. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 49–65, Broomeld, CO, October 2014. USENIX Association.
[23] Badrish Chandramouli, Guna Prasaad, Donald Kossmann, Justin Levandoski, James Hunter, and Mike Barnett. FASTER: A concurrent key-value store with in-place updates. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD ’18, page 275–290, New York, NY, USA, 2018. Association for Computing Machinery.
[24] J Chao. Graph databases for beginners: Graph search algorithm basics. https: //neo4j.com/blog/graph-searchalgorithm-basics/.
[25] Brian F Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st ACM symposium on Cloud computing, pages 143–154, 2010.
[26] Siying Dong, Mark Callaghan, Leonidas Galanis, Dhruba Borthakur, Tony Savor, and Michael Strum. Optimizing space amplification in rocksdb. In CIDR, volume 3, page 3, 2017.
[27] Pekka Enberg, Ashwin Rao, and Sasu Tarkoma. Partition-aware packet steering using XDP and eBPF for improving application-level parallelism. In Proceedings of the 1st ACM CoNEXT Workshop on Emerging In-Network Computing Paradigms, ENCP ’19, page 27–33, New York, NY, USA, 2019. Association for Computing Machinery.
[28] Goetz Graefe. A Survey of B-Tree Locking Techniques. ACM Transactions on Database Systems, 35(3), July 2010.
[29] Brendan Gregg. BPF Performance Tools. AddisonWesley Professional, 2019.