DATA MINING FOR BUSINESS ANALYTICS

DATA MINING FOR BUSINESS ANALYTICS

 

 

 

DATA MINING FOR BUSINESS ANALYTICS

Concepts, Techniques, and Applications in R

Galit Shmueli

Peter C. Bruce

Inbal Yahav

Nitin R. Patel

Kenneth C. Lichtendahl, Jr.

 

 

This edition first published 2018

© 2018 John Wiley & Sons, Inc.

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.

The right of Galit Shmueli, Peter C. Bruce, Inbal Yahav, Nitin R. Patel, and Kenneth C. Lichtendahl Jr. to be identified as the authors of this work has been asserted in accordance with law.

Registered Offices John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA

Editorial Office 111 River Street, Hoboken, NJ 07030, USA

For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.

Wiley also publishes its books in a variety of electronic formats and by print-on-demand. Some content that appears in standard print versions of this book may not be available in other formats.

Limit of Liability/Disclaimer of Warranty The publisher and the authors make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties; including without limitation any implied warranties of fitness for a particular purpose. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for every situation. In view of on-going research, equipment modifications, changes in governmental regulations, and the constant flow of information relating to the use of experimental reagents, equipment, and devices, the reader is urged to review and evaluate the information provided in the package insert or instructions for each chemical, piece of equipment, reagent, or device for, among other things, any changes in the instructions or indication of usage and for added warnings and precautions. The fact that an organization or website is referred to in this work as a citation and/or potential source of further information does not mean that the author or the publisher endorses the information the organization or website may provide or recommendations it may make. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this works was written and when it is read. No warranty may be created or extended by any promotional statements for this work. Neither the publisher nor the author shall be liable for any damages arising here from.

Library of Congress Cataloging-in-Publication Data applied for

Hardback: 9781118879368

Cover Design: Wiley Cover Image: © Achim Mittler, Frankfurt am Main/Gettyimages

Set in 11.5/14.5pt BemboStd by Aptara Inc., New Delhi, India Printed in the United States of America.

10 9 8 7 6 5 4 3 2 1

 

 

The beginning of wisdom is this:

Get wisdom, and whatever else you get, get insight.

– Proverbs 4:7

 

 

 

Contents

Foreword by Gareth James xix

Foreword by Ravi Bapna xxi

Preface to the R Edition xxiii

Acknowledgments xxvii

PART I PRELIMINARIES CHAPTER 1 Introduction 3

1.1 What Is Business Analytics? . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 What Is Data Mining? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Data Mining and Related Terms . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.5 Data Science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.6 Why Are There So Many Different Methods? . . . . . . . . . . . . . . . . . . . 8 1.7 Terminology and Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.8 Road Maps to This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Order of Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

CHAPTER 2 Overview of the Data Mining Process 15

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2 Core Ideas in Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Association Rules and Recommendation Systems . . . . . . . . . . . . . . . . . 16 Predictive Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Data Reduction and Dimension Reduction . . . . . . . . . . . . . . . . . . . . 17 Data Exploration and Visualization . . . . . . . . . . . . . . . . . . . . . . . . 17 Supervised and Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . 18

2.3 The Steps in Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4 Preliminary Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

Organization of Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Predicting Home Values in the West Roxbury Neighborhood . . . . . . . . . . . 21

vii

 

 

viii CONTENTS

Loading and Looking at the Data in R . . . . . . . . . . . . . . . . . . . . . . 22 Sampling from a Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Oversampling Rare Events in Classification Tasks . . . . . . . . . . . . . . . . . 25 Preprocessing and Cleaning the Data . . . . . . . . . . . . . . . . . . . . . . . 26

2.5 Predictive Power and Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . 33 Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Creation and Use of Data Partitions . . . . . . . . . . . . . . . . . . . . . . . 35

2.6 Building a Predictive Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Modeling Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.7 Using R for Data Mining on a Local Machine . . . . . . . . . . . . . . . . . . . 43 2.8 Automating Data Mining Solutions . . . . . . . . . . . . . . . . . . . . . . . . 43

Data Mining Software: The State of the Market (by Herb Edelstein) . . . . . . . . 45 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

PART II DATA EXPLORATION AND DIMENSION REDUCTION CHAPTER 3 Data Visualization 55

3.1 Uses of Data Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Base R or ggplot? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.2 Data Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Example 1: Boston Housing Data . . . . . . . . . . . . . . . . . . . . . . . . 57 Example 2: Ridership on Amtrak Trains . . . . . . . . . . . . . . . . . . . . . . 59

3.3 Basic Charts: Bar Charts, Line Graphs, and Scatter Plots . . . . . . . . . . . . . 59 Distribution Plots: Boxplots and Histograms . . . . . . . . . . . . . . . . . . . 61 Heatmaps: Visualizing Correlations and Missing Values . . . . . . . . . . . . . . 64

3.4 Multidimensional Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Adding Variables: Color, Size, Shape, Multiple Panels, and Animation . . . . . . . 67 Manipulations: Rescaling, Aggregation and Hierarchies, Zooming, Filtering . . . . 70 Reference: Trend Lines and Labels . . . . . . . . . . . . . . . . . . . . . . . . 74 Scaling up to Large Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 Multivariate Plot: Parallel Coordinates Plot . . . . . . . . . . . . . . . . . . . . 75 Interactive Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

3.5 Specialized Visualizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Visualizing Networked Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Visualizing Hierarchical Data: Treemaps . . . . . . . . . . . . . . . . . . . . . 82 Visualizing Geographical Data: Map Charts . . . . . . . . . . . . . . . . . . . . 83

3.6 Summary: Major Visualizations and Operations, by Data Mining Goal . . . . . . . 86 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Time Series Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

CHAPTER 4 Dimension Reduction 91

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.2 Curse of Dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

 

 

CONTENTS ix

4.3 Practical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

Example 1: House Prices in Boston . . . . . . . . . . . . . . . . . . . . . . . 93

4.4 Data Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

Summary Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

Aggregation and Pivot Tables . . . . . . . . . . . . . . . . . . . . . . . . . . 96

4.5 Correlation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

4.6 Reducing the Number of Categories in Categorical Variables . . . . . . . . . . . 99

4.7 Converting a Categorical Variable to a Numerical Variable . . . . . . . . . . . . 99

4.8 Principal Components Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 101

Example 2: Breakfast Cereals . . . . . . . . . . . . . . . . . . . . . . . . . . 101

Principal Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

Normalizing the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

Using Principal Components for Classification and Prediction . . . . . . . . . . . 109

4.9 Dimension Reduction Using Regression Models . . . . . . . . . . . . . . . . . . 111

4.10 Dimension Reduction Using Classification and Regression Trees . . . . . . . . . . 111

Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

PART III PERFORMANCE EVALUATION

CHAPTER 5 Evaluating Predictive Performance 117

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

5.2 Evaluating Predictive Performance . . . . . . . . . . . . . . . . . . . . . . . . 118

Naive Benchmark: The Average . . . . . . . . . . . . . . . . . . . . . . . . . 118

Prediction Accuracy Measures . . . . . . . . . . . . . . . . . . . . . . . . . . 119

Comparing Training and Validation Performance . . . . . . . . . . . . . . . . . 121

Lift Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

5.3 Judging Classifier Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 122

Benchmark: The Naive Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

Class Separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

The Confusion (Classification) Matrix . . . . . . . . . . . . . . . . . . . . . . . 124

Using the Validation Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

Accuracy Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

Propensities and Cutoff for Classification . . . . . . . . . . . . . . . . . . . . . 127

Performance in Case of Unequal Importance of Classes . . . . . . . . . . . . . . 131

Asymmetric Misclassification Costs . . . . . . . . . . . . . . . . . . . . . . . . 133

Generalization to More Than Two Classes . . . . . . . . . . . . . . . . . . . . . 135

5.4 Judging Ranking Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 136

Lift Charts for Binary Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

Decile Lift Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

Beyond Two Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

Lift Charts Incorporating Costs and Benefits . . . . . . . . . . . . . . . . . . . 139

Lift as a Function of Cutoff . . . . . . . . . . . . . . . . . . . . . . . . . . .

Needs help with similar assignment?

We are available 24x7 to deliver the best services and assignment ready within 3-4 hours? Order a custom-written, plagiarism-free paper

Get Answer Over WhatsApp Order Paper Now