当将元素从一个出列移动到另一个出列时，C++使用两倍内存

问题描述：

在我的项目中，我使用pybind11将C++代码绑定到Python。最近，我不得不处理非常大的数据集（70GB +），并且遇到了需要将多个std::deque之间的数据从一个std::deque中拆分出来的问题。由于我的数据集非常大，我预计拆分不会有太多的内存开销。因此我采取了一种流行一推策略，一般来说应该确保满足我的要求。当将元素从一个出列移动到另一个出列时，C++使用两倍内存

这一切都在理论上。在实践中，我的过程被杀死了。所以我挣扎了两天，最终拿出了证明问题的最简单的例子。

通常，最小示例在deque（〜11GB）中创建大量数据，将其返回给Python，然后再次调用C++来移动元素。就那么简单。移动部分在执行器中完成。

有趣的是，如果我不使用执行程序，内存使用情况与预期的一样，并且在通过ulimit强制限制虚拟内存时，程序确实尊重这些限制并且不会崩溃。

test.py

from test import _test 
import asyncio 
import concurrent 

async def test_main(loop, executor): 
    numbers = _test.generate() 
    # moved_numbers = _test.move(numbers) # This works! 
    moved_numbers = await loop.run_in_executor(executor, _test.move, numbers) # This doesn't! 

if __name__ == '__main__': 
    loop = asyncio.get_event_loop() 
    executor = concurrent.futures.ThreadPoolExecutor(1) 

    task = loop.create_task(test_main(loop, executor)) 
    loop.run_until_complete(task) 

    executor.shutdown() 
    loop.close()

TEST.CPP

#include <deque> 
#include <iostream> 
#include <pybind11/pybind11.h> 
#include <pybind11/stl.h> 

namespace py = pybind11; 

PYBIND11_MAKE_OPAQUE(std::deque<uint64_t>); 
PYBIND11_DECLARE_HOLDER_TYPE(T, std::shared_ptr<T>); 

template<class T> 
void py_bind_opaque_deque(py::module& m, const char* type_name) { 
    py::class_<std::deque<T>, std::shared_ptr<std::deque<T>>>(m, type_name) 
    .def(py::init<>()) 
    .def(py::init<size_t, T>()); 
} 

PYBIND11_PLUGIN(_test) { 
    namespace py = pybind11; 
    pybind11::module m("_test"); 
    py_bind_opaque_deque<uint64_t>(m, "NumbersDequeue"); 

    // Generate ~11Gb of data. 
    m.def("generate", []() { 
     std::deque<uint64_t> numbers; 
     for (uint64_t i = 0; i < 1500 * 1000000; ++i) { 
      numbers.push_back(i); 
     } 
     return numbers; 
    }); 

    // Move data from one dequeue to another. 
    m.def("move", [](std::deque<uint64_t>& numbers) { 
     std::deque<uint64_t> numbers_moved; 

     while (!numbers.empty()) { 
      numbers_moved.push_back(std::move(numbers.back())); 
      numbers.pop_back(); 
     } 
     std::cout << "Done!\n"; 
     return numbers_moved; 
    }); 

    return m.ptr(); 
}

测试/ __ init__.py

import warnings 
warnings.simplefilter("default")

编译：

g++ -std=c++14 -O2 -march=native -fPIC -Iextern/pybind11 `python3.5-config --includes` `python3.5-config --ldflags` `python3.5-config --libs` -shared -o test/_test.so test.cpp

观察：

当移动部分不会被执行完成的，所以我们只需要调用moved_numbers = _test.move(numbers)，所有的作品如预期，内存使用情况显示，通过HTOP保持周围11Gb ，太棒了！
移动部分在执行程序中完成时，程序会占用双倍的内存并崩溃。
当引入虚拟内存限制（〜15Gb）时，所有工作都很好，这可能是最有趣的部分。

ulimit -Sv 15000000 && python3.5 test.py >>Done!。
当我们增加限制程序崩溃（150Gb>我的RAM）。

ulimit -Sv 150000000 && python3.5 test.py >>双端队列方法[1] 2573 killed python3.5 test.py
使用shrink_to_fit没有帮助（而且也不应该）

使用的软件

Ubuntu 14.04 
gcc version 5.4.1 20160904 (Ubuntu 5.4.1-2ubuntu1~14.04) 
Python 3.5.2 
pybind11 latest release - v1.8.1

注意

请注意，此示例仅用于说明问题。使用asyncio和pybind是发生问题所必需的。

关于可能发生什么的任何想法都是最受欢迎的。

答

问题原来是由于数据在一个线程中创建，然后在另一个线程中释放而导致的。这是因为glibc (for reference see this)中的malloc竞技场。它可以很好地证明这样做：

executor1 = concurrent.futures.ThreadPoolExecutor(1) 
executor2 = concurrent.futures.ThreadPoolExecutor(1) 

numbers = await loop.run_in_executor(executor1, _test.generate) 
moved_numbers = await loop.run_in_executor(executor2, _test.move, numbers)

这将需要两次分配的内存_test.generate和

executor = concurrent.futures.ThreadPoolExecutor(1) 

numbers = await loop.run_in_executor(executor, _test.generate) 
moved_numbers = await loop.run_in_executor(executor, _test.move, numbers)

这wound't。

这个问题可以通过重写代码来解决，因为它不会将元素从一个容器移动到另一个容器（我的案例），也可以通过设置环境变量export MALLOC_ARENA_MAX=1来限制malloc区域的数量为1。有一些性能影响涉及（有多个球场的一个很好的理由）。

当将元素从一个出列移动到另一个出列时，C++使用两倍内存

相关推荐