335
pages
English
Ebooks
2006
Vous pourrez modifier la taille du texte de cet ouvrage
Obtenez un accès à la bibliothèque pour le consulter en ligne En savoir plus
Découvre YouScribe en t'inscrivant gratuitement
Découvre YouScribe en t'inscrivant gratuitement
335
pages
English
Ebooks
2006
Vous pourrez modifier la taille du texte de cet ouvrage
Obtenez un accès à la bibliothèque pour le consulter en ligne En savoir plus
Publié par
Date de parution
27 novembre 2006
Nombre de lectures
3
EAN13
9781629597904
Langue
English
Poids de l'ouvrage
17 Mo
Publié par
Date de parution
27 novembre 2006
Nombre de lectures
3
EAN13
9781629597904
Langue
English
Poids de l'ouvrage
17 Mo
The correct bibliographic citation for this manual is as follows: Svolba, Gerhard. 2006. Data Preparation for Analytics Using SAS . Cary, NC: SAS Institute Inc.
Data Preparation for Analytics Using SAS
Copyright 2006, SAS Institute Inc., Cary, NC, USA ISBN 978-1-59994-047-2 (Hardcopy) ISBN 978-1-62959-790-4 (EPUB) ISBN 978-1-62959-791-1 (MOBI) ISBN 978-1-59994-336-7 (PDF)
All rights reserved. Produced in the United States of America.
For a hard-copy book : No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, or otherwise, without the prior written permission of the publisher, SAS Institute Inc.
For a web download or e-book: Your use of this publication shall be governed by the terms established by the vendor at the time you acquire this publication.
The scanning, uploading, and distribution of this book via the Internet or any other means without the permission of the publisher is illegal and punishable by law. Please purchase only authorized electronic editions and do not participate in or encourage electronic piracy of copyrighted materials. Your support of others rights is appreciated.
U.S. Government License Rights; Restricted Rights: The Software and its documentation is commercial computer software developed at private expense and is provided with RESTRICTED RIGHTS to the United States Government. Use, duplication or disclosure of the Software by the United States Government is subject to the license terms of this Agreement pursuant to, as applicable, FAR 12.212, DFAR 227.7202-1(a), DFAR 227.7202-3(a) and DFAR 227.7202-4 and, to the extent required under U.S. federal law, the minimum restricted rights as set out in FAR 52.227-19 (DEC 2007). If FAR 52.227-19 is applicable, this provision serves as notice under clause (c) thereof and no other notice is required to be affixed to the Software or documentation. The Government's rights in Software and documentation shall be only those set forth in this Agreement.
SAS Institute Inc., SAS Campus Drive, Cary, North Carolina 27513-2414.
November 2009
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration.
Other brand and product names are trademarks of their respective companies.
To my family
Martina, Matthias, Jakob, and Clemens, for being the most valuable support I can imagine.
Table of Contents
Preface
Part 1 Data Preparation: Business Point of View
Chapter 1 Analytic Business Questions
1.1 Introduction
1.2 The Term Business Question
1.3 Examples of Analytic Business Questions
1.4 The Analysis Process
1.5 Challenging an Analytic Business Question
1.6 Business Point of View Needed
Chapter 2 Characteristics of Analytic Business Questions
2.1 Introduction
2.2 Analysis Complexity: Real Analytic or Reporting?
2.3 Analysis Paradigm: Statistics or Data Mining?
2.4 Data Preparation Paradigm: As Much Data As Possible or Business Knowledge First?
2.5 Analysis Method: Supervised or Unsupervised?
2.6 Scoring Needed: Yes/No?
2.7 Periodicity of Analysis: One-Shot Analysis or Re-run Analysis?
2.8 Need for Historic Data: Yes/No?
2.9 Data Structure: One-Row-per-Subject or Multiple-Rows-per-Subject?
2.10 Complexity of the Analysis Team
2.11 Conclusion
Chapter 3 Characteristics of Data Sources
3.1 Introduction
3.2 Operational or Dispositive Data Systems?
3.3 Data Requirement: Periodic Availability
3.4 Wording: Analysis Table or Analytic Data Mart?
3.5 Quality of Data Sources for Analytics
Chapter 4 Different Points of View on Analytic Data Preparation
4.1 Introduction
4.2 Simon, Daniele and Elias: Three Different Roles in the Analysis Process
4.3 Simon-The Business Analyst
4.4 Daniele-The Quantitative Expert
4.5 Elias-The IT and Data Expert
4.6 Who Is Right?
4.7 The Optimal Triangle
Part 2 Data Structures and Data Modeling
Chapter 5 The Origin of Data
5.1 Introduction
5.2 Data Origin from a Technical Point of View
5.3 Application Layer and Data Layer
5.4 Simple Text Files or Spreadsheets
5.5 Relational Database Systems
5.6 Enterprise Resource Planning Systems
5.7 Hierarchical Databases
5.8 Large Text Files
5.9 Where Should Data Be Accessed From?
Chapter 6 Data Models
6.1 Introduction
6.2 Relational Model and Entity Relationship Diagrams
6.3 Logical versus Physical Data Model
6.4 Star Schema
6.5 Normalization and De-normalization
Chapter 7 Analysis Subjects and Multiple Observations
7.1 Introduction
7.2 Analysis Subject
7.3 Multiple Observations
7.4 Data Mart Structures
7.5 No Analysis Subject Available?
Chapter 8 The One Row-per-Subject Data Mart
8.1 Introduction
8.2 The One-Row-per-Subject Paradigm
8.3 The Technical Point of View
8.4 The Business Point of View: Transposing or Aggregating Original Data
8.5 Hierarchies: Aggregating Up and Copying Down
8.6 Conclusion
Chapter 9 The Multiple-Rows-per-Subject Data Mart
9.1 Introduction
9.2 Using Multiple-Rows-per-Subject Data Marts
9.3 Types of Multiple-Rows-per-Subject Data Marts
9.4 Multiple Observations per Time Period
9.5 Relationship to Other Data Mart Structures
Chapter 10 Data Structures for Longitudinal Analysis
10.1 Introduction
10.2 Data Relationships in Longitudinal Cases
10.3 Transactional Data, Finest Granularity, and Most Appropriate Aggregation Level
10.4 Data Mart Structures for Longitudinal Data Marts
Chapter 11 Considerations for Data Marts
11.1 Introduction
11.2 Types and Roles of Variables in a Data Mart
11.3 Derived Variables
11.4 Variable Criteria
Chapter 12 Considerations for Predictive Modeling
12.1 Introduction
12.2 Target Windows and Observation Windows
12.3 Multiple Target Windows
12.4 Overfitting
Part 3 Data Mart Coding and Content
Chapter 13 Accessing Data
13.1 Introduction
13.2 Accessing Data from Relational Databases Using SAS/ACCESS Modules
13.3 Accessing Data from Microsoft Office
13.4 Accessing Data from Text Files
13.5 Accessing Data from Hierarchical Text Files
13.6 Other Access Methods
Chapter 14 Transposing One- and Multiple-Rows-per-Subject Data Structures
14.1 Introduction
14.2 Transposing from a Multiple-Rows-per-Subject Data Set to a One-Row-per-Subject Data Set
14.3 Transposing from a One-Row-per-Subject Data Set to a Multiple-Rows-per-Subject Data Set
14.4 Transposing a Transactional Table with Categorical Entries
14.5 Creating Key-Value Tables
Chapter 15 Transposing Longitudinal Data
15.1 Introduction
15.2 Standard Scenarios
15.3 Complex Scenarios
Chapter 16 Transformations of Interval-Scaled Variables
16.1 Introduction
16.2 Simple Derived Variables
16.3 Derived Relative Variables
16.4 Time Intervals
16.5 Binning Observations into Groups
16.6 Transformations of Distributions
16.7 Replacing Missing Values
16.8 Conclusion
Chapter 17 Transformations of Categorical Variables
17.1 Introduction
17.2 General Considerations for Categorical Variables
17.3 Derived Variables
17.4 Combining Categories
17.5 Dummy Coding of Categorical Variables
17.6 Multidimensional Categorical Variables
17.7 Lookup Tables and External Data
Chapter 18 Multiple Interval-Scaled Observations per Subject
18.1 Introduction
18.2 Static Aggregation
18.3 Correlation of Values
18.4 Concentration of Values
18.5 Course over Time: Standardization of Values
18.6 Course over Time: Derived Variables
Chapter 19 Multiple Categorical Observations per Subject
19.1 Introduction
19.2 Absolute and Relative Frequencies of Categories
19.3 Concatenating Absolute and Relative Frequencies
19.4 Calculating Total and Distinct Counts of the Categories
19.5 Using ODS to Create Different Percent Variables
19.6 Business Interpretation of Percentage Variables
19.7 Other Methods
Chapter 20 Coding for Predictive Modeling
20.1 Introduction
20.2 Proportions or Means of the Target Variable
20.3 Interval Variables and Predictive Modeling
20.4 Validation Methods
20.5 Conclusion
Chapter 21 Data Preparation for Multiple-Rows-per-Subject and Longitudinal Data Marts
21.1 Introduction
21.2 Data Preparation for Association and Sequence Analysis
21.3 Enhancing Time Series Data
21.4 Aggregating at Various Hierarchical Levels
21.5 Preparing Time Series Data with SAS Functions
21.6 Using SAS/ETS Procedures for Data Preparation
Part 4 Sampling, Scoring, and Automation
Chapter 22 Sampling
22.1 Introduction
22.2 Sampling Methods
22.3 Simple Sampling and Reaching the Exact Sample Count or Proportion
22.4 Oversampling
22.5 Clustered Sampling
22.6 Conclusion
Chapter 23 Scoring and Automation
23.1 Introduction
23.2 Scoring Process
23.3 Explicitly Calculating the Score Values from Parameters and Input Variables
23.4 Using the Respective SAS/STAT Procedure for Scoring
23.5 Scoring with PROC SCORE of SAS/STAT
23.6 Us