RSATree: Distribution-Aware Data Representation of Large-Scale Tabular Datasets for Flexible Visual

论文传送门
视频

作者

浙江大学

  • Honghui Mei
  • Wei Chen
  • Yating Wei
  • Yuanzhe Hu
  • Shuyue Zhou
  • Bingru Lin

中南大学

  • Ying Zhao
  • Jiazhi Xia

RSATree: Distribution-Aware Data Representation of Large-Scale Tabular Datasets for Flexible Visual

摘要

分析师通常会调查从统计数据汇总中得出的数据分布,这些统计汇总由图表(如直方图和合并的散点图)表示,以可视化和分析大型数据集。聚合查询是通过这样的过程隐式执行的。数据集时常会非常大;因此,应通过计算预定义的数据立方体来加快响应时间。但是,查询仅限于预处理数据多维数据集的预定义合并方案。这种局限性阻碍了分析师灵活调整视觉规格,以有效地调查数据中的隐式模式。特别地,RSATree通过利用三种方案来实现任意查询和灵活的分箱策略,这三种方案分别是:基于R树的空间分区方案以捕获数据分布,局部敏感哈希技术以实现对数据项的保留局部性的随机访问以及一个求和的面积表示方案,以线性计算复杂度支持对聚合值的交互式查询。这项研究提出并实现了一个基于Web的可视查询系统,该系统支持视觉规范,查询和用户可调整粒度的大规模表格数据的浏览。我们通过在现实世界的数据集上进行各种实验并分析时间和空间复杂度来证明我们的方法的效率和实用性。

Introduction

Exploratory data analysis (EDA)

  • Large and high-dimensional dataset
  • Limited number of screen pixels
  • Aggregated charts

RSATree: Distribution-Aware Data Representation of Large-Scale Tabular Datasets for Flexible Visual
Motivation:
Enable efficient visual queries that generate such charts, and support interaction on them.

Statistical charts: aggregate query

  • Bins - separations on dimensions
  • Aggregate function
  • Filter

RSATree: Distribution-Aware Data Representation of Large-Scale Tabular Datasets for Flexible Visual

Goals:

  • Low response time
    • The efficiency to answer the query
  • High flexibility
    • The ability to answer different queries (Arbitrary bin widths, Flexible bining strategies)

Related work:
Response time

  • Data cubes for visual query
  • Designed according to the usage scenario
    • imMens
    • Nanocubes
    • Hashedcubes

Problem:
RSATree: Distribution-Aware Data Representation of Large-Scale Tabular Datasets for Flexible Visual
Original Intention
Design according to the usage scenario (EDA)

  • Sometimes prefer a fast but approximate answer
    E.g. dragging a slider
    • Instant preview approximate answers
    • Compute accurate values when stopped

Approach

  • Adaptive granularity
    • Distribution-aware partition
  • Approximate answering
    • Store statistical summaries
  • Random access
    • Locality-sensitive index

RSATree: Distribution-Aware Data Representation of Large-Scale Tabular Datasets for Flexible Visual

Distribution-aware partition

Granularity \leftrightarrow density
Use modified R*-tree strategy

  • Eliminate empty space
  • Minimize overlaps

Approximate answering

Integral Histograms (IH,积分直方图)

  • Useful data structure for aggregate query
  • Calculate the data distribution in a region
    • In discretized Cartesian space
    • Nearly constant time
  • Extend from Summed Area Table (SAT)

IH

  • Scalar value \rightarrow histogram
  • Calculate distribution

Random access

  • Locality-sensitive hashing (LSH)
  • Originally a KNN search (point-2-point)
  • Modified for random access of involved IHs (rectangle-2-rectangle)

Scale alignment

  • Integration with interface
  • Align computational grids with IH cells
  • Improve accuracy
    RSATree: Distribution-Aware Data Representation of Large-Scale Tabular Datasets for Flexible Visual
    RSATree: Distribution-Aware Data Representation of Large-Scale Tabular Datasets for Flexible Visual

Examples

  • Brushing & linking
  • Visual specification
    RSATree: Distribution-Aware Data Representation of Large-Scale Tabular Datasets for Flexible Visual

Conclusion

RSATree

  • Answer aggregate query of arbitrary ranges
  • Support flexible binning strategies
  • Low response time and storage cost
  • Acceptable accuracy loss

Future work

  • A better partition algorithm (e.g. Learn distribution using ML models)
  • Handle outliers
  • GPU parallelism