XML读到字典变得越来越慢随着时间的流逝

问题描述：

我的C＃应用程序读取以下结构的XML文件。 150MB文件内有大约250,000个单词。XML读到字典变得越来越慢随着时间的流逝

<word> 
    <name>kick</name> 
    <id>485</id> 
    <rels>12:4;4256:3;754:3;1452:2;86:2;125:2;</rels> 
</word>

我想读取XML的文件转换成字典。这些是我阅读课的一些班级成员。

private XmlReader Reader; 

public string CurrentWordName; 
public int CurrentWordId; 
public Dictionary<KeyValuePair<int, int>, int> CurrentRelations;

这是我的阅读课的主要方法。它只是从文件中读取下一个单词并获取name,id，并且关系存储在字典中。

CurrentWordId = -1; 
CurrentWordName = ""; 
CurrentRelations = new Dictionary<KeyValuePair<int, int>, int>(); 

while(Reader.Read()) 
    if(Reader.NodeType == XmlNodeType.Element & Reader.Name == "word") 
    { 
     while (Reader.Read()) 
      if (Reader.NodeType == XmlNodeType.Element & Reader.Name == "name") 
      { 
       XElement Title = XElement.ReadFrom(Reader) as XElement; 
       CurrentWordName = Title.Value; 
       break; 
      } 
     while (Reader.Read()) 
      if (Reader.NodeType == XmlNodeType.Element & Reader.Name == "id") 
      { 
       XElement Identifier = XElement.ReadFrom(Reader) as XElement; 
       CurrentWordId = Convert.ToInt32(Identifier.Value); 
       break; 
      } 
     while(Reader.Read()) 
      if (Reader.NodeType == XmlNodeType.Element & Reader.Name == "rels") 
      { 
       XElement Text = XElement.ReadFrom(Reader) as XElement; 
       string[] RelationStrings = Text.Value.Split(';'); 
       foreach (string RelationString in RelationStrings) 
       { 
        string[] RelationsStringSplit = RelationString.Split(':'); 
        if (RelationsStringSplit.Length == 2) 
         CurrentRelations.Add(new KeyValuePair<int,int>(CurrentWordId,Convert.ToInt32(RelationsStringSplit[0])), Convert.ToInt32(RelationsStringSplit[1])); 
       } 
       break; 
      } 
     break; 
    } 

if (CurrentRelations.Count < 1 || CurrentWordId == -1 || CurrentWordName == "") 
    return false; 
else 
    return true;

我的Windows窗体有一个backgroundWorker来读取所有的单词。

private void bgReader_DoWork(object sender, DoWorkEventArgs e) 
{ 
    ReadXML Reader = new ReadXML(tBOpenFile.Text); 

    Words = new Dictionary<int, string>(); 
    Dictionary<KeyValuePair<int, int>, int> ReadedRelations = new Dictionary<KeyValuePair<int, int>, int>(); 

    // reading 
    while(Reader.ReadNextWord()) 
    { 
     Words.Add(Reader.CurrentWordId, Reader.CurrentWordName); 

     foreach (KeyValuePair<KeyValuePair<int, int>, int> CurrentRelation in Reader.CurrentRelations) 
     { 
      ReadedRelations.Add(new KeyValuePair<int, int>(CurrentRelation.Key.Key, CurrentRelation.Key.Value), CurrentRelation.Value); 
     } 
    }

通过调试，我注意到在应用程序启动非常快，并得到随着时间的推移慢。

7秒的第10000个字
30分钟第一个20万个字
35分钟第一个22万个字

我无法解释这种行为！但我确信XML文件中的单词平均大小相同。也许Add()-方法由字典长度变慢。

如何加快我的申请？

阅读节点，而不是由节点看的LINQ to XML – Lloyd 2012-03-10 10:19:36

也许它可以帮助有一个'词典>'而不是一次用双键索引两次。取决于数据。 – harold 2012-03-10 10:21:05

@劳埃德，这将如何帮助改善表现？ – svick 2012-03-10 10:21:07

答

编辑：好的，现在我已经运行的代码，我认为这就是问题所在：

foreach (KeyValuePair<KeyValuePair<int, int>, int> CurrentRelation in 
     Reader.CurrentRelations) 
{ 
    ReadedRelations.Add(new KeyValuePair<int, int>(CurrentRelation.Key.Key, 
     CurrentRelation.Key.Value), CurrentRelation.Value); 
}

如果没有这个循环，它的工作原理多更快......这使我怀疑您从XML读取的事实实际上是一条红鲱鱼。

我怀疑问题是KeyValuePair<,>不覆盖Equals和GetHashCode。我相信，如果您创建的值类型包含两个int值和覆盖GetHashCode和Equals（和实施IEquatable<RelationKey>），它会更快。

或者，您总是可以使用long来存储两个int的值 - 这有点破解，但它工作得很好。我现在无法测试这个，但当我有更多时间时，我会放弃它。

即使只是改变你的循环到：

foreach (var relation in Reader.CurrentRelations) 
{ 
    ReadedRelations.Add(relation.Key, relation.Value); 
}

会更简单，效率更高一点......

编辑：下面是一个RelationKey结构的样本。只需用RelationKey取代的KeyValuePair<int, int>所有出现，并使用Source和Target属性而不是Key和Value：

public struct RelationKey : IEquatable<RelationKey> 
{ 
    private readonly int source; 
    private readonly int target; 

    public int Source { get { return source; } } 
    public int Target { get { return target; } } 

    public RelationKey(int source, int target) 
    { 
     this.source = source; 
     this.target = target; 
    } 

    public override bool Equals(object obj) 
    { 
     if (!(obj is RelationKey)) 
     { 
      return false; 
     } 
     return Equals((RelationKey)obj); 
    } 

    public override int GetHashCode() 
    { 
     return source * 31 + target; 
    } 

    public bool Equals(RelationKey other) 
    { 
     return source == other.source && target == other.target; 
    } 
}

我认为它已经在'BackgroundWorker'上运行了，这就是为什么该方法被称为'bgReader_DoWork'的原因。 – svick 2012-03-10 10:29:55

@svick：Doh--不知道我以前是怎么错过的。编辑... – 2012-03-10 10:31:24

@JonSkeet：首先我忘了清除'CurrentRelations'，但后来我注意到了。之后，我发布了这个问题。对不起，我忘了那个重要的部分。我现在更新了我的问题代码。 – danijar 2012-03-10 10:34:06

XML读到字典变得越来越慢随着时间的流逝

相关推荐