如何使用两个相关的表优化一个简单的LINQ查询？

问题描述：

'病人'：

{ Id = 1, Surname = Smith998 } 
... 
{ Id = 1000, Surname = Smith1000 }

，第二个是 '接待'：

{ PatientId = 1, ReceptionStart = 3/3/2017 1:14:00 AM } 
{ PatientId = 1, ReceptionStart = 1/7/2016 1:14:00 AM } 
... 
{ PatientId = 1000, ReceptionStart = 1/23/2017 1:14:00 AM }

表是不能从数据库中，但它们是使用以下示例代码生成的：

 var rand = new Random(); 
     var receptions = Enumerable.Range(1, 1000).SelectMany(pid => Enumerable.Range(1, rand.Next(0, 10)).Select(rid => new { PatientId = pid, ReceptionStart = DateTime.Now.AddDays(-rand.Next(1, 500)) })).ToList(); 
     var patients = Enumerable.Range(1, 1000).Select(pid => new { Id = pid, Surname = string.Format("Smith{0}", pid) }).ToList();

问题是选择在2017年1月1日前有接待的患者的最佳方式是什么？

事业我可以写这样的事：

 var cured_receptions = (from r in receptions where r.ReceptionStart < new DateTime(2017, 7, 1) select r.PatientId).Distinct(); 
     var cured_patients = from p in patients where cured_receptions.Contains(p.Id) select p;

，但目前尚不清楚什么对我“cured_receptions.Contains（p.Id）”代码实际上呢？它只是遍历所有搜索Id的患者，或者它使用数据库中的索引之类的东西吗？可以cure_receptions.ToDictionary（）或类似的东西在这里帮助不知何故？

你可以把两个查询之间的一个连接，并做到在一个单一的步骤 –

答

从头开始假设只有记忆一切......

你cured_receptions不计算，直到由Contains调用，所以这将是更有效地利用放.ToList()该变量定义的结尾（大约快100倍）。
LINQ不“搜索” - Contains正在进行搜索。如果你想使用二进制搜索或更好的哈希表，你必须创建它。如果您使用HashSet<int>，那么您将获得另一个47X加速。关闭Distinct（因为HashSet将处理该问题）可节省15％。
在变量中记住常量而不是随意创建它们（new DateTime ...）可能会节省多一点。即使大大增加随机数据，也不会花费足够的时间来告诉HashSet。
使用join比您的初始查询快，但您的查询与HashSet结合最快。

因此最快的代码是：

var cured_receptions = new HashSet<int>((from r in receptions where r.ReceptionStart < endDateTime select r.PatientId)); 
var cured_patients = from p in patients where cured_receptions.Contains(p.Id) select p;

注：我用LINQPad生成定时和样本数据。我改变了你的日期参数，因为你的价值观使得大部分的招待会都是匹配的

这里是我的LINQPad代码：

var rand = new Random(); 
var begin = DateTime.Now; 
var receptions = Enumerable.Range(1, 100000).SelectMany(pid => Enumerable.Range(1, rand.Next(0, 100)).Select(rid => new { PatientId = pid, ReceptionStart = begin.AddDays(-rand.Next(1, 180)) })).ToList(); 
var patients = Enumerable.Range(1, 100000).Select(pid => new { Id = pid, Surname = string.Format("Smith{0}", pid) }).ToList(); 

var startTime = Util.ElapsedTime; 
var endDateTime = new DateTime(2017, 5, 1); 
//var cured_receptions = (from r in receptions where r.ReceptionStart < new DateTime(2017, 5, 1) select r.PatientId).Distinct().ToList(); 
//var cured_receptions = (from r in receptions where r.ReceptionStart < new DateTime(2017, 5, 1) select r.PatientId).Distinct(); 
//var cured_receptions = new HashSet<int>((from r in receptions where r.ReceptionStart < new DateTime(2017, 5, 1) select r.PatientId).Distinct()); 
//var cured_receptions = new HashSet<int>((from r in receptions where r.ReceptionStart < endDateTime select r.PatientId).Distinct()); 
//var cured_receptions = new HashSet<int>((from r in receptions where r.ReceptionStart < new DateTime(2017, 5, 1) select r.PatientId)); 
var cured_receptions = new HashSet<int>((from r in receptions where r.ReceptionStart < endDateTime select r.PatientId)); 
var cured_patients = from p in patients where cured_receptions.Contains(p.Id) select p; 

// var cured_patients = (from r in receptions 
//      where r.ReceptionStart < endDateTime 
//      join p in patients on r.PatientId equals p.Id 
//      select p).Distinct(); 

// var cured_patients = from p in patients 
//      join r in receptions on p.Id equals r.PatientId into rj 
//      where rj.Any(r => r.ReceptionStart < endDateTime) 
//      select p; 

cured_patients.Count().Dump(); 
var endTime = Util.ElapsedTime; 

(endTime - startTime).Dump("Elapsed");

我的项目在内存中，但不在数据库中。 – Dmitriano

没有在您的评论中发现这一点。改为在记忆中解释。 – NetMage

关于HashSet的好主意！但我无法弄清楚＃1 - .ToList（）和original cured_receptions之间的区别是什么？从我的角度来看，它们都是一些带有O（N）搜索的容器，不是吗？ – Dmitriano

答

但我不清楚'cure_receptions.contains（p.Id）'代码实际上做了什么？它只是遍历所有搜索Id的患者，或者它使用数据库中的索引之类的东西吗？

案例1：与数据库交互

如果你用的数据库，然后，直到通过调用它ToList()或通过遍历项目执行第二个查询没有查询将被发送到数据库交互在cured_patients。发送到数据库的查询将沿着线的东西：

SELECT 
[Extent1].[Id] AS [Id], 
[Extent1].[Surname] AS [Surname] 
FROM [dbo].[Patients] AS [Extent1] 
WHERE EXISTS (SELECT 
    1 AS [C1] 
    FROM [dbo].[Receptions] AS [Extent2] 
    WHERE ([Extent2].[ReceptionStart] < 
    convert(datetime2, '2017-07-01 00:00:00.0000000', 121)) 
    AND ([Extent2].[PatientId] = [Extent1].[Id]) 
)

它会用任何指标？

是如果PatientId，Id和ReceptionStart被索引，则数据库服务器会使用它们。

案例2：在内存

与项目互动对于第一个查询它会遍历所有receptions，查找其ReceptionStart是给定日期之前的那些，选择PatientId，然后删除任何重复PatientId（S ）。

然后第二个查询，低于：

var cured_patients = 
    from p in patients 
    where cured_receptions.Contains(p.Id) 
    select p;

将遍历每个项目patients，看看该项目的Id在cured_receptions被发现。对于在cured_receptions中找到Id的所有商品，它将选择这些商品。 Contains只需返回true或false。

是否有可能时，我的项目是在内存中以某种方式优化查询，但不在数据库中？ LINQ是否在地图或散列表上运行？为什么LINQ无法在有序集中进行二分搜索？ – Dmitriano

@ user2394762为什么要优化它？它是否缓慢，这段代码是否是瓶颈？ .NET有[BinarySearch]（https://msdn.microsoft.com/en-us/library/3f90y839（v = vs.110）.aspx） – CodingYoshi

请注意，既然'cured_receptions'是一个'IEnumerable ' - 变量定义不会迭代任何东西，只是创建堆叠的LINQ'IEnumerable'函数。 'Contains'将为每个'p'执行'cured_receptions'，每次重新计算元素直到找到匹配'p.Id'。 – NetMage

如何使用两个相关的表优化一个简单的LINQ查询？

相关推荐