中文标题#
Buckaroo:一种直接操作的可视化数据整理工具
英文标题#
Buckaroo: A Direct Manipulation Visual Data Wrangler
中文摘要#
准备数据集 —— 这一关键阶段被称为数据整理 —— 构成了数据科学开发的主要阶段,消耗了总项目时间的 80% 以上。 此阶段包括多种任务:解析数据,重新组织以进行分析,修复不准确之处,合并来源,删除重复项,并确保整体数据完整性。 传统方法通常通过 Python 等编程语言的手动编码或使用电子表格进行,不仅费时而且容易出错。 这些问题从缺失条目和格式不一致到数据类型不准确,所有这些问题如果未得到正确纠正,都可能影响后续任务的质量。 为了解决这些挑战,我们提出了 Buckaroo,这是一个可视化系统,用于突出显示数据中的差异,并通过直接操作视觉对象实现实时修正。 Buckaroo(1)自动查找与其余组相比表现出异常的 “有趣” 数据组,并建议对其进行检查;(2)建议用户可以选择的整理操作来修复异常;(3)允许用户通过显示其整理操作的效果并提供撤销或重做这些操作的能力来直观地操作他们的数据,这支持了数据整理的迭代性质。 视频配套内容可在https://youtu.be/iXdCYbvpQVE 查看
英文摘要#
Preparing datasets -- a critical phase known as data wrangling -- constitutes the dominant phase of data science development, consuming upwards of 80% of the total project time. This phase encompasses a myriad of tasks: parsing data, restructuring it for analysis, repairing inaccuracies, merging sources, eliminating duplicates, and ensuring overall data integrity. Traditional approaches, typically through manual coding in languages such as Python or using spreadsheets, are not only laborious but also error-prone. These issues range from missing entries and formatting inconsistencies to data type inaccuracies, all of which can affect the quality of downstream tasks if not properly corrected. To address these challenges, we present Buckaroo, a visualization system to highlight discrepancies in data and enable on-the-spot corrections through direct manipulations of visual objects. Buckaroo (1) automatically finds "interesting" data groups that exhibit anomalies compared to the rest of the groups and recommends them for inspection; (2) suggests wrangling actions that the user can choose to repair the anomalies; and (3) allows users to visually manipulate their data by displaying the effects of their wrangling actions and offering the ability to undo or redo these actions, which supports the iterative nature of data wrangling. A video companion is available at https://youtu.be/iXdCYbvpQVE
PDF 获取#
抖音扫码查看更多精彩内容