中文标题#
Buckaroo:一種直接操作的可視化數據整理工具
英文标题#
Buckaroo: A Direct Manipulation Visual Data Wrangler
中文摘要#
準備數據集 —— 這一關鍵階段被稱為數據整理 —— 構成了數據科學開發的主要階段,消耗了總項目時間的 80% 以上。 此階段包括多種任務:解析數據,重新組織以進行分析,修復不準確之處,合併來源,刪除重複項,並確保整體數據完整性。 傳統方法通常通過 Python 等編程語言的手動編碼或使用電子表格進行,不僅費時而且容易出錯。 這些問題從缺失條目和格式不一致到數據類型不準確,所有這些問題如果未得到正確糾正,都可能影響後續任務的質量。 為了解決這些挑戰,我們提出了 Buckaroo,這是一個可視化系統,用於突出顯示數據中的差異,並通過直接操作視覺對象實現實時修正。 Buckaroo(1)自動查找與其餘組相比表現出異常的 “有趣” 數據組,並建議對其進行檢查;(2)建議用戶可以選擇的整理操作來修復異常;(3)允許用戶通過顯示其整理操作的效果並提供撤銷或重做這些操作的能力來直觀地操作他們的數據,這支持了數據整理的迭代性質。 視頻配套內容可在https://youtu.be/iXdCYbvpQVE 查看
英文摘要#
Preparing datasets -- a critical phase known as data wrangling -- constitutes the dominant phase of data science development, consuming upwards of 80% of the total project time. This phase encompasses a myriad of tasks: parsing data, restructuring it for analysis, repairing inaccuracies, merging sources, eliminating duplicates, and ensuring overall data integrity. Traditional approaches, typically through manual coding in languages such as Python or using spreadsheets, are not only laborious but also error-prone. These issues range from missing entries and formatting inconsistencies to data type inaccuracies, all of which can affect the quality of downstream tasks if not properly corrected. To address these challenges, we present Buckaroo, a visualization system to highlight discrepancies in data and enable on-the-spot corrections through direct manipulations of visual objects. Buckaroo (1) automatically finds "interesting" data groups that exhibit anomalies compared to the rest of the groups and recommends them for inspection; (2) suggests wrangling actions that the user can choose to repair the anomalies; and (3) allows users to visually manipulate their data by displaying the effects of their wrangling actions and offering the ability to undo or redo these actions, which supports the iterative nature of data wrangling. A video companion is available at https://youtu.be/iXdCYbvpQVE
PDF 獲取#
抖音掃碼查看更多精彩內容