从大量ASCII文件中读取数据的最快方法

问题描述：

对于我已经提交的大学练习，我需要阅读包含大量图像名称（每行1个）的.txt文件。然后，我需要打开每个图像作为ASCII文件，并阅读他们的数据（以ppm格式图像），并与他们做一系列的事情。事情是，我注意到我的程序在读取文件部分的数据时花费了70％的时间，而不是在我正在做的其他计算中（使用散列表查找每个像素的重复次数，找到不同的像素之间的2图像等..），我觉得很奇怪，至少可以说。从大量ASCII文件中读取数据的最快方法

这是PPM格式的样子：

P3 //This value can be ignored when reading the file, because all image will be correctly formatted 
4 4 
255 //This value can be also ignored, will be always 255. 
0 0 0 0 0 0 0 0 0 15 0 15 
0 0 0 0 15 7 0 0 0 0 0 0 
0 0 0 0 0 0 0 15 7 0 0 0 
15 0 15 0 0 0 0 0 0 0 0 0

这是我是如何从文件中读取数据：

ifstream fdatos; 
fdatos.open(argv[1]); //Open file with the name of all the images 

const int size = 128; 
char file[size]; //Where I'll get the image name 

Image *img; 
while (fdatos >> file) { //While there's still images anmes left, continue 
    ifstream fimagen; 
fimagen.open(file); //Open image file 
img = new Image(fimagen); //Create new image object with it's data file 
    ……… 
    //Rest of the calculations whith that image 
    ……… 
delete img; //Delete image object after done 
    fimagen.close(); //Close image file after done 
} 

fdatos.close();

和图像目标读取里面像这样的数据：

const int tallafirma = 100; 
char firma[tallafirma]; 
fich_in >> std::setw(100) >> firma; // Read the P3 part, can be ignored 

int maxvalue, numpixels; 
fich_in >> height >> width >> maxvalue; // Read the next three values 
numpixels = height*width; 
datos = new Pixel[numpixels]; 

int r,g,b; //Don't need to be ints, max value is 256, so an unsigned char would be ok. 
for (int i=0; i<numpixels; i++) { 
    fich_in >> r >> g >> b; 
    datos[i] = Pixel(r, g ,b); 
} 
//This last part is the slow one, 
//I thing I should be able to read all this data in one single read 
//to buffer or something which would be stored in an array of unsigned chars, 
//and then I'd only need to to do: 
//buffer[0] -> //Pixel 1 - Red data 
//buffer[1] -> //Pixel 1 - Green data 
//buffer[2] -> //Pixel 1 - Blue data

那么，有什么想法？我想我可以在一次调用中将它完全改进为一个数组，我只是不知道这是如何完成的。

另外，知道“索引文件”中有多少图像是可行的吗？知道文件的行数是否可行？（因为每行有一个文件名..）

谢谢！

编辑：这是我如何emasure时间。

#include <sys/time.h> 
#include <sys/resource.h> 
double get_time() 
{ 
    struct timeval t; 
    struct timezone tzp; 
    gettimeofday(&t, &tzp); 
    return t.tv_sec + t.tv_usec*1e-6; 
} 

double start = get_time(); 
//Everything to be measured here. 
double end = get_time(); 

cout << end-start << endl;

嗯，我是这个学科的讲座之一，也是这个学生参加的编程比赛的组织者。随时可以帮助他，但假设学生必须自己解决比赛，或者只是阅读不同的节目源，而不是使用积极的查询来社区。无论如何，正如我检测到这个查询，任何其他参与者可以，并且该副本完全禁止........ – 2011-03-06 10:11:10

答

你正在分配内存并在每个循环中删除它。如果你如此关心表现，我认为这不是好事。

因此，您可以做的一项改进是：将内存一旦分配给您的程序即可重新使用。

void *memory = malloc(sizeof(Image)); //reusable memory. 

//placement new to construct the object in the already allocated memory! 
img = new (memory) Image(fimagen); 

//... 

img->~Image(); //calling the destructor 

//when you're done free the memory 
free(memory); //use free, as we had used malloc when allocating!

同样，你可以在Image类重用内存，尤其是在这个行：

datos = new Pixel[numpixels];

最后，而不是读取RGB到本地变量，然后将它们复制到图像数据，并不那么优雅，所以也可以在这里做一点改进，

//this is yours : temporaries, and copying! 
fich_in >> r >> g >> b; 
datos[i] = Pixel(r, g ,b); 

//this is mine : no temporaries, no copying. directly reading into image data! 
fich_in >> datos[i].r >> datos[i].g >> datos[i].b;

除此之外，我认为没有太多的余地可以提高你的代码性能。

嗯，我不知道我可以实际做到这一点，我很新在这一切。我会试着看看它是否有帮助。此外，我并不担心，只是我注意到这一点，并看到人们将整个文件读取到一个缓冲区，我认为这会改善一些事情，因为您只需要一次访问磁盘??我不知道。 – asendra 2011-02-12 15:48:45

恩，@Nawaz，就是这样！临时文件和RGB变量上的复制。用一套小型设备（30张256×256图像）进行测试，时间从0.23下降到0.098。神圣烟霞xD – asendra 2011-02-12 16:22:47

答

不可能在不读取整个文件的情况下计算文本文件中的行数。

对于其他优化，使用time命令（如果您在Unix/Linux上）检查程序是否使用了大量“用户”时间（实际计算）与其“wallclock”时间相比（总时间在过程的开始和结束之间）。如果没有，您可能正在等待磁盘或网络。

从大量ASCII文件中读取数据的最快方法

相关推荐