Tree based methods divide the predictor space, that is, the set of possible values for X1, X2,… Xp ,into J distinct and non-overlapping regions, R1, R2….. RJ. In theory, the regions could have any shape. However, we choose to divide the predictor space into high-dimensional rectangles, or boxes, for simplicity and for ease of interpretation of the resulting predictive model
The goal is to find boxes R1, R2, ….. RJ that minimize the Residual sum of Squares (RSS), given by
Unfortunately, it is computationally infeasible to consider every possible partition of the feature space into J boxes. For this reason, we take a top-down, greedy approach that is known as recursive binary splitting. The approach is top-down because it begins at the top of the tree and then successively splits the predictor space; each split is indicated via two new branches further down on the tree.
It is greedy because at each step of the tree-building process, the best split is made at that particular step, rather than looking ahead and picking a split that will lead to a better tree in some future step.
We first select the predictor Xj and the cutpoint s such that splitting the predictor space into the regions {X|Xj < s } leads to the greatest possible reduction in RSS.
Next, we repeat the process, looking for the best predictor and best cutpoint in order to split the data further so as to minimize the RSS within each of the resulting regions.
However, this time, instead of splitting the entire predictor space, we split one of the two previously identified regions. We now have three regions. Again, we look to split one of these three regions further,so as to minimize the RSS. The process continues until a stopping criterion is reached; for instance, we may continue until no region contains more than five observations.
Example :-
Since, extreme values or outliers, never cause much reduction in RSS, they are never involved in split.
Hence, tree based methods are insensitive to outliers.
What are the data inputs and where do they come from?
What are the outputs and how are they consumed- (online algorithm, a static report, etc)
Is this a revenue leakage (“saves us money”) or a revenue growth (“makes us money”) problem?
Use Cases By Function
Marketing
Predicting Lifetime Value (LTV)
what for: if you can predict the characteristics of high LTV customers, this supports customer segmentation, identifies upsell opportunties and supports other marketing initiatives
usage: can be both an online algorithm and a static report showing the characteristics of high LTV customers
Wallet share estimation
working out the proportion of a customer’s spend in a category accrues to a company allows that company to identify upsell and cross-sell opportunities
usage: can be both an online algorithm and a static report showing the characteristics of low wallet share customers
competitions :
Churn
working out the characteristics of churners allows a company to product adjustments and an online algorithm allows them to reach out to churners
usage: can be both an online algorithm and a statistic report showing the characteristics of likely churners
Customer segmentation
If you can understand qualitatively different customer groups, then we can give them different treatments (perhaps even by different groups in the company). Answers questions like: what makes people buy, stop buying etc
usage: static report
Product mix
What mix of products offers the lowest churn? eg. Giving a combined policy discount for home + auto = low churn
usage: online algorithm and static report
Cross selling/Recommendation algorithms/
Given a customer’s past browsing history, purchase history and other characteristics, what are they likely to want to purchase in the future?
usage: online algorithm
Up selling
Given a customer’s characteristics, what is the likelihood that they’ll upgrade in the future?
Identifying contractors who are regularly involved in poor performing products
Design issue prediction
Predicting that a construction project is likely to have issues as early as possible
Life Sciences
Identifying biomarkers for boxed warnings on marketed products
Drug/chemical discovery & analysis
Crunching study results
Identifying negative responses (monitor social networks for early problems with drugs)
Diagnostic test development
Hardware devices
Software
Diagnostic targeting (CRM)
Predicting drug demand in different geographies for different products
Predicting prescription adherence with different approaches to reminding patients
Putative safety signals
Social media marketing on competitors, patient perceptions, KOL feedback
Image analysis or GCMS analysis in a high throughput manner
Analysis of clinical outcomes to adapt clinical trial design
COGS optimization
Leveraging molecule database with metabolic stability data to elucidate new stable structures
Hospitality/Service
Inventory management/dynamic pricing
Promos/upgrades/offers
Table management & reservations
Workforce management (also applies to lots of verticals)
Electrical grid distribution
Keep AC frequency as constant as possible
Seems like a very “online” algorithm
Manufacturing
Sensor data to look at failures Case Study on Manufacturing
Quality management
Identifying out-of-bounds manufacturing
Visual inspection/computer vision
Optimal run speeds
Demand forecasting/inventory management
Warranty/pricing
Travel
Aircraft scheduling
Seat mgmt, gate mgmt
Air crew scheduling
Dynamic pricing
Customer complain resolution (give points in exchange)
Call center stuff
Maintenance optimization
Tourism forecasting
Agriculture
Yield management (taking sensor data on soil quality – common in newer John Deere et al truck models and determining what seed varieties, seed spacing to use etc
Mall Operators
Predicting tenants capacity to pay based on their sales figures, their industry
Predicting the best tenant for an open vacancy to maximise over all sales at a mall
Education
Automated essay scoring
Utilities
Optimise Distribution Network Cost Effectiveness (balance Capital 7 Operating Expenditure)
Predict Commodity Requirements
Other
Sentiment analysis
Loyalty programs
Sensor data
Alerting
What’s going to fail?
De duplication
Procurement
Use Cases That Need Fleshing Out
Procurement
Negotiation & vendor selection
Are we buying from the best producer
Marketing
Direct Marketing
Response rates
Segmentations for mailings
Reactivation likelihood
RFM
Discount targeting
FinServ
Phone marketing
Generally as a follow-up to a DM or a churn predictor
Email Marketing
Offline
Call to action w/ unique promotion
Why are people responding- How do I adjust my buy (where, when, how)?
“I’m sure we are wasting half our money here, but the problem is we don’t know which ad”
Media Mix Optimization
Kantar Group and Nielson are dominant
Hard part of this is getting to the data (good samples & response vars)
Healthcare
CRM & utilization optimization
Claims coding
Forumlary determination and pricing
How do I get you to use my card for auto-pay? Paypal? etc. Unsolved.
If none of the above datasets interest you, you might want to try looking for data from one of the links below. Be warned, there may be some significant data processing to perform before you will be able to perform your analysis.
With MBA colleges out there in every street of various Tier-1 and Tier-2 cities of our country, thesupply has far exceeded the demand for these professionals. Organizations have been forced to pick and choose colleges to hire graduates to maintain quality of the hiring. In a recent article titled (more…)