Huang2019 referenceworkentry datacleansing

D Data Cleansing Fang Huang Tetherless World Constellation, Rensselaer Polytechnic Institute, Troy, NY, USA Synonyms Data cleaning; Data pre-processing; Data tidying; Data wrangling Introduction Data cleansing, also known as data cleaning, is the process of identifying and addressing problems in raw data to improve data quality (Fox 2018) Data quality is broadly defined as the precision and accuracy of data, which can significantly influence the information interpreted from the data (Broeck et al 2005) Data quality issues usually involve inaccurate, unprecise, and/or incomplete data Additionally, large amounts of data are being produced every day, and the intrinsic complexity and diversity of the data result in many quality issues To extract useful information, data cleansing is an essential step in a data life cycle Data Life Cycle “A data life cycle represents the whole procedure of data management” (Ma et al 2014), and data cleansing is one of the early stages in the cycle The cycle consists of six main stages (modified from Ma et al 2014): Conceptual model: Data science problems often require a conceptual model to define target questions, research objects, and applicable methods, which helps define the type of data to be collected Any changes of the conceptual models will influence the entire data life cycle This step is essential, yet often ignored Collection: Data can be collected via various sources – survey (part of a group), census (whole group), observation, experimentation, simulation, modeling, scraping (automated online data collection), and data retrieval (data storage and provider) Data checking is needed to reduce simple errors and missing and duplicated values Cleansing: Raw data are examined, edited, and transformed into the desired form This stage will solve some of the existing data quality issues (see below) Data cleansing is an iterative task During stages 4–6, if any data problems are discovered, data cleansing must be performed again Curation and sharing: The cleaned data should be saved, curated, and updated in local and/or cloud storage for future use The data can also be published or distributed between devices for sharing This step dramatically reduces the © Springer Nature Switzerland AG 2019 L A Schintler, C.L McNeely (eds.), Encyclopedia of Big Data, https://doi.org/10.1007/978-3-319-32001-4_300-1 likelihood of duplicated efforts Moreover, in scientific research, open data is required by many journals and organizations for study integrity and reproducibility Analysis and discovery: This is the main step for using data to gain insights By applying appropriate algorithms and models, trends and patterns can be recognized from the data and used for guiding decision-making processes Repurposing: The analysis results will be evaluated, and, based on the discovered information, the whole process could be performed again for the same or different target Data cleansing plays an essential role in the data life cycle Data quality issues can cause extracted information to be distorted or unusable – a problem that can be mitigated or eliminated through data cleansing Some issues can be prevented during data collection, but many have to be dealt with in the data cleansing stage Data quality issues include errors, missing values, duplications, inconsistent units, inaccurate data, and so on Methods for tackling those issues will be discussed in the next sections Data Cleansing Process Data cleansing deals with data quality issues after data collection is complete The data cleansing process can be generalized into “3E” steps: examine, explore, and edit Finding data issues through planning and examining is the most effective approach Some simple issues like inconsistent numbers and missing values can be easily detected However, exploratory analysis is needed for more complicated cases Exploratory analyses, such as scatter plots, boxplots, distribution tests, and others, can help identify patterns within a dataset, thereby making errors more detectable Once detected, the data can be edited to address the errors Examine It is always helpful to define questionable features in advance, which include data type problems, missing or duplicate values, and inconsistency Data Cleansing and conflicts A simple reorganization and indexing of the dataset may help discover some of those data quality issues – Data type problems: In data science, the two major types of data are categorical or numeric Categorical values are normally representation of qualitative features, such as job titles, names, nationalities, and so on Occasionally, categorical values need to be encoded with numbers to run certain algorithms, but these remain distinct from numeric values Numeric values are usually quantitative features, which can be further divided into discrete or continuous types Discrete numeric values are separate and distinct, such as the population of a country or the number of daily transactions in a stock market; and continuous numeric values are usually continuous with decimals, such the index of a stock market or the height of a person For example, the age column of a census contains discrete numeric values, and the name column contains categorical data – Missing or duplicate values: These two issues are easily detected through reorganizing and indexing the dataset but can be hard to repair For duplicate values, simply removing duplicated ones can solve the problem Missing data can be filled by checking the original data records or metadata, when available Metadata are the supporting information of data, such as the methods of measurement, environmental conditions, location, or spatial relationship of samples However, if the required information is not in the metadata, some exploratory analysis algorithms may help fill in the missing values – Inconsistency and conflicts: Inconsistency and conflicts happen frequently when merging two datasets Merging data representing the same samples or entities in different formats could easily cause duplication and conflicts Occasionally, the inconsistency may not be solvable in this stage It is acceptable to flag the problematic data and address them after the “analysis and discovery” stage, once a better overview of the data is achieved Data Cleansing Explore This stage is to use exploratory analysis methods to identify problems that are hard to find by simple examination One of the most widely used exploratory tools is visualization, which will improve our understanding of the dataset in a more direct way There are several common methods to exploratory analysis on one-dimensional, two-dimensional, and multidimensional data The dimension here refers to the number of features of data One- and two-dimensional data can be analyzed more easily, but multidimensional data are usually reduced to lower dimensions for easier analysis and visualization Below is a partial list of the methods that can be used in the Explore step because it is much easier to explore the relationship among data features on a two-axis plane There are many other visualization and nonvisualization methods in addition to the above Data visualization techniques are among the most popular ways to identify data quality problems – they allow recognition of outliers as well as relationships within the data, including unusual patterns and trends for further analysis Two-Dimensional – Scatter plot: A graph of the relationship between two numeric data series, whether linear or nonlinear – Bar graph: A chart to present the characteristics of categorical data One axis represents the categories, and the other axis is the values associated with each category There are also grouped and stacked bar graphs to show more complex information Edit After identification of the problems, researchers need to decide how to tackle those problems, and there are multiple methods to edit the dataset (1) Data types need to be adjusted for consistency For instance, revise wrongly inputted numeric data to the correct values or convert numeric or categorical data to meet the requirements of the selected algorithm (2) One should fill in the missing values and replace or delete duplicated values using the information in metadata (see Examine section) For example, a scientific project called “Census of Deep Life” collected microbial life samples below the seafloor along with environmental condition parameters, but some pressure values were missing In this case, the missing pressure values were calculated using depth information recorded in metadata (3) For inconsistency and conflicts, data conversion is needed For example, when two datasets have different units, they should be converted before merging (4) Some problems cannot be solved with the previous techniques and should be flagged within the dataset In future analyses, those points can be noted and dealt with accordingly For example, random forest, a type of machine learning algorithm, has the ability to impute missing values with existing data and relationships Multidimensional – Principal component analysis (PCA): A statistical algorithm to analyze the correlations among multivariant numeric values (Fox 2018) The multidimensional data will be reduced with two orthogonal components Overview Before and during the data cleansing process, some principles should be kept in mind for best results: (1) planning and pre-defining are critical – it will give targets for the data cleansing process (2) Use proper data structures to keep data One-Dimensional – Boxplot: A distribution of one numeric data series with five numbers (Fox 2018) The box shows the minimum, first quartile, median, third quartile, and maximum values, allowing any outlier data points to be easily identified – Histogram: A distribution representation of one numeric data series The numeric values are divided into bins (x-axis), and the number of points in each bin is counted (y-axis) X and Y axes are interchangeable Its shape will change with the bin size, offering more freedom than a boxplot 4 organized and improve efficiency (3) Prevent data problems in collection stage (4) Use unique IDs to avoid duplication (5) Keep a good record of metadata (6) Always keep copies before and after cleansing (7) Document all changes Tools Many tools exist for data cleansing There are two primary types: data cleansing software and programing packages Software is normally easier to use, but with less flexibility; and the programing packages have a steeper learning curve, but they are free of cost and can be extremely powerful – Software: Examples of famous software includes OpenRefine, Trifacta (Data) Wrangler, Drake, TIBCO Clarity, and many others (Deoras 2018) They often have built-in workflows and can some statistical analysis – Programming packages: Packages written in free programming languages, such as Python and R, are becoming more and more popular in the data science industry Python is powerful, easy to use, and runs in many different systems The Python development community is very active and has created numerous data science libraries, including Numpy, Scipy, Scikit-learn, Pandas, Matplotlib, and so on Pandas and Matplotlib have very powerful and easy functions to analyze and visualize different data formats Numpy, Scipy, and Scikit-learn are used for statistical analysis and machine learning training R is a structural programming language, similar to Python, which also has a variety of statistical packages Some widely used R packages include dplyr, Data Cleansing foreign, ggplot2, and tidyr, all of which are useful in data manipulation and visualization Conclusion Data cleansing is essential to ensure the quality of data input for the analytics and discovery, which will in turn extract the appropriate and accurate information for future plans and decisions This is particularly important when large tech companies, like Facebook and Twitter, 23andMe, Amazon, and Uber, and international collaboration scientific projects, are producing huge amounts of social media, genetic information, ecommerce, travel, and scientific data, respectively Such data from various sources may have very different formats and quality, making data cleansing an essential step in many areas of science and technology Further Reading Deoras, S (2018) 10 best data cleaning tools to get the most out of your data Retrieved Mar 2019, from https://www.analyticsindiamag.com/10-best-datacleaning-tools-get-data/ Fox, P (2018) Data analytics course Retrieved Mar 2019, from https://tw.rpi.edu/web/courses/ DataAnalytics/2018 Kim, W., Choi, B J., Hong, E K., Kim, S K., & Lee, D (2003) A taxonomy of dirty data Data Mining and Knowledge Discovery, 7(1), 81–99 Ma, X., Fox, P., Rozell, E., West, P., & Zednik, S (2014) Ontology dynamics in a data life cycle: Challenges and recommendations from a Geoscience Perspective Journal of Earth Science, 25(2), 407–412 Van den Broeck, J., Cunningham, S A., Eeckels, R., & Herbst, K (2005) Data cleaning: Detecting, diagnosing, and editing data abnormalities PLoS Medicine, (10), e267

Định dạng
Số trang	4
Dung lượng	131,59 KB