362
pages
English
Ebooks
2015
Vous pourrez modifier la taille du texte de cet ouvrage
Obtenez un accès à la bibliothèque pour le consulter en ligne En savoir plus
Découvre YouScribe en t'inscrivant gratuitement
Découvre YouScribe en t'inscrivant gratuitement
362
pages
English
Ebooks
2015
Vous pourrez modifier la taille du texte de cet ouvrage
Obtenez un accès à la bibliothèque pour le consulter en ligne En savoir plus
Publié par
Date de parution
05 mai 2015
Nombre de lectures
5
EAN13
9781629598024
Langue
English
Poids de l'ouvrage
18 Mo
Publié par
Date de parution
05 mai 2015
Nombre de lectures
5
EAN13
9781629598024
Langue
English
Poids de l'ouvrage
18 Mo
The correct bibliographic citation for this manual is as follows: Svolba, Gerhard. 2012. Data Quality for Analytics Using SAS . Cary, NC: SAS Institute Inc.
Data Quality for Analytics Using SAS
Copyright 2012, SAS Institute Inc., Cary, NC, USA ISBN 978-1-60764-620-4 (Hardcopy) ISBN 978-1-62959-802-4 (EPUB) ISBN 978-1-62959-803-1 (MOBI) ISBN 978-1-61290-227-2 (PDF)
All rights reserved. Produced in the United States of America.
For a hard-copy book : No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, or otherwise, without the prior written permission of the publisher, SAS Institute Inc.
For a Web download or e-book: Your use of this publication shall be governed by the terms established by the vendor at the time you acquire this publication.
The scanning, uploading, and distribution of this book via the Internet or any other means without the permission of the publisher is illegal and punishable by law. Please purchase only authorized electronic editions and do not participate in or encourage electronic piracy of copyrighted materials. Your support of others rights is appreciated.
U.S. Government License Rights; Restricted Rights: The Software and its documentation is commercial computer software developed at private expense and is provided with RESTRICTED RIGHTS to the United States Government. Use, duplication or disclosure of the Software by the United States Government is subject to the license terms of this Agreement pursuant to, as applicable, FAR 12.212, DFAR 227.7202-1(a), DFAR 227.7202-3(a) and DFAR 227.7202-4 and, to the extent required under U.S. federal law, the minimum restricted rights as set out in FAR 52.227-19 (DEC 2007). If FAR 52.227-19 is applicable, this provision serves as notice under clause (c) thereof and no other notice is required to be affixed to the Software or documentation. The Government's rights in Software and documentation shall be only those set forth in this Agreement.
SAS Institute Inc., SAS Campus Drive, Cary, North Carolina 27513-2414
April 2012
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration.
Other brand and product names are trademarks of their respective companies.
For my three teenage sons and their permanent effort
in letting me share the many wonderful moments of their life,
and without whose help
this book would have probably been completed a year earlier.
You are the quality of my life.
Acknowledgments
Martina , for supporting me and even the crazy idea to write this book in that period of our family life, which is probably the busiest one.
My parents , for providing me with so many possibilities.
The following persons, who contributed to the book by spending time to discuss data quality topics. It is a pleasure to work and to discuss with you: Albert T sch, Andreas M llner, Bertram Wassermann, Bernadette Fabits, Christine Hallwirth, Claus Reisinger, Franz Helmreich, Franz K nig, Helmut Zehetmayr, Josef Pichler, Manuela Lenk, Mihai Paunescu, Matthias Svolba, Nicole Schwarz, Peter Bauer, Phil Hermes, Stefan Baumann, Thomas Schierer, and Walter Herrmann.
The reviewers , who took time to review my manuscript and provided constructive feedback and suggestions. I highly appreciate your effort: Anne Milley, David Barkaway, Jim Seabolt, Mihai Paunescu, Mike Gilliland, Sascha Schubert, and Udo Sglavo.
The nice and charming SAS Press team for their support throughout the whole process of the creation of this book: Julie Platt, Stacey Hamilton, Shelley Sessoms, Kathy Restivo, Shelly Goodin, Aimee Rodriguez, Mary Beth Steinbach, and Lucie Haskins.
The management of SAS Austria for supporting the idea to write this book: Dietmar Kotras and Robert Stindl.
August Ernest M ller, my great-grandfather, designed one of the very early construction plans of a helicopter in 1916 and was able to file a patent in the Austrian-Hungarian monarchy. However, he found no sponsor to realize his project during World War I. I accidentally found his construction plans and documents at the time I started writing this book. His work impressed and motivated me a lot.
Contents
Introduction
Part I Data Quality Defined
Chapter 1 Introductory Case Studies
1.1 Introduction
1.2 Case Study 1: Performance of Race Boats in Sailing Regattas
Overview
Functional problem description
Practical questions of interest
Technical and data background
Data quality considerations
Case 1 summary
1.3 Case Study 2: Data Management and Analysis in a Clinical Trial
General
Functional problem description
Practical question of interest
Technical and data background
Data quality considerations
Case 2 summary
1.4 Case Study 3: Building a Data Mart for Demand Forecasting
Overview
Functional problem description
Functional business questions
Technical and data background
Data quality considerations
Case 3 summary
1.5 Summary
Data quality features
Data availability
Data completeness
Inferring missing data from existing data
Data correctness
Data cleaning
Data quantity
Chapter 2 Definition and Scope of Data Quality for Analytics
2.1 Introduction
Different expectations
Focus of this chapter
Chapter 1 case studies
2.2 Scoping the Topic Data Quality for Analytics
General
Differentiation of data objects
Operational or analytical data quality
General data warehouse or advanced analytical analyses
Focus on analytics
Data quality with analytics
2.3 Ten Percent Missing Values in Date of Birth Variable: An Example
General
Operational system
Systematic missing values
Data warehousing
Usability for analytics
Conclusion
2.4 Importance of Data Quality for Analytics
2.5 Definition of Data Quality for Analytics
General
Definition
2.6 Criteria for Good Data Quality: Examples
General
Data and measurement gathering
Plausibility check: Relevancy
Correctness
Missing values
Definitions and alignment
Timeliness
Adequacy for analytics
Legal considerations
2.7 Conclusion
General
Upcoming chapters
Chapter 3 Data Availability
3.1 Introduction
3.2 General Considerations
Reasons for availability
Definition of data availability
Availability and usability
Effort to make data available
Dependence on the operational process
Availability and alignment in the time dimension
3.3 Availability of Historic Data
Categorization and examples of historic data
The length of the history
Customer event histories
Operational systems and analytical systems
3.4 Historic Snapshot of the Data
More than historic data
Confusion in definitions
Example of a historic snapshot in predictive modeling
Comparing models from different time periods
Effort to retrieve historic snapshots
Example of historic snapshots in time series forecasting
3.5 Periodic Availability and Actuality
Periodic availability
Actuality
3.6 Granularity of Data
General
Definition of requirements
3.7 Format and Content of Variables
General
Main groups of variable formats for the analysis
Considerations for the usability of data
Typical data cleaning steps
3.8 Available Data Format and Data Structure
General
Non-electronic format of data
Levels of complexity for electronically available data
Availability in a complex logical structure
3.9 Available Data with Different Meanings
Problem definition
Example
Consequences
3.10 Conclusion
Chapter 4 Data Quantity
4.1 Introduction
Quantity versus quality
Overview
4.2 Too Little or Too Much Data
Having not enough data
Having too much data
Having too many observations
Having too many variables
4.3 Dimension of Analytical Data
General
Number of observations
Number of events
Distribution of categorical values (rare classes)
The number of variables
Length of the time history
Level of detail in forecast hierarchies
Panel data sets and repeated measurement data sets
4.4 Sample Size Planning
Application of sample size planning
Sample size calculation for data mining?
4.5 Effect of Missing Values on Data Quantity
General
Problem description
Calculation
Summary
4.6 Conclusion
Chapter 5 Data Completeness
5.1 Introduction
5.2 Difference between Availability and Completeness
Availability
Completeness
Categories of missing data
Effort to get complete data
Incomplete data are not necessarily missing data
Random or systematic missing values
5.3 Random Missing Values
Definition
Handling
Consequences
Imputing random missing values
5.4 Customer Age Is Systematically Missing for Long-Term Customers
Problem definition
Systematic missing values
Example
Consequences
5.5 Completeness across Tables
Problem description
Completeness in parent-child relationships
Completeness in time series data <