Introduction

Marketing data is important because it helps executive officers understand the demographic of ticket buyers around the venue’s area. By studying the insights of a given dataset from the trends shaped by the data, an analyst can bring about the best suggestions for the marketing team for effective advertisement.

This study takes data from a local performing arts venue in Cleveland, Ohio. Although ticket buyers come from many different states in the US and Canada, this study will only focus on the tickets coming from the state of Ohio.

The goal of this case study is to understand what events are closely correlated based on each event’s transactions. In order to achieve this goal, this report makes use of association rules modeling, namely Apriori algorithm and Eclat algorithm.

Import data

df <- read.csv('main.csv', na.strings=c(""," ","NA"))
summary(df)
##     Event.ID              Purchase.Date    Customer.ID  
##  Min.   : 1.00   10/21/2019 13:11:  177   Min.   :   1  
##  1st Qu.:10.00   8/27/2019 14:24 :  102   1st Qu.:1318  
##  Median :18.00   8/27/2019 15:06 :  102   Median :2931  
##  Mean   :22.93   8/27/2019 14:27 :  101   Mean   :2689  
##  3rd Qu.:38.00   8/27/2019 15:04 :  101   3rd Qu.:3751  
##  Max.   :44.00   10/21/2019 13:06:  100   Max.   :5262  
##                  (Other)         :21669                 
##                  City      State       Postal.Code    Sale.Location
##  Cleveland         :9352   OH:22352   Min.   :43004   POS: 6535    
##  Beachwood         :1375              1st Qu.:44106   Web:15817    
##  Cleveland Heights :1247              Median :44118                
##  Shaker Heights    :1179              Mean   :44125                
##  University Heights: 835              3rd Qu.:44122                
##  Chagrin Falls     : 529              Max.   :45891                
##  (Other)           :7835                                           
##   Delivery.Method 
##  E-Ticket :17353  
##  ID Scan  :    6  
##  Will Call: 4993  
##                   
##                   
##                   
## 

About the dataset

The imported dataset, saved in RStudion Environment as dataframe df, is a 22352 x 8 matrix. In this matrix, each row represents a purchased ticket dated from summer 2019 to summer 2020, and each column represents a feature of those tickets, including the following: which event the ticket was from the date of purchase, the name of the customer (encoded as Customer.ID), the customer’s address, sale location, and delivery method.

As Event.ID, Customer.ID, and Postal.Code are categorical data, it is important to change their data types from numerical into categorical for grouping purposes.

df$Event.ID <- factor(df$Event.ID)
df$Customer.ID <- factor(df$Customer.ID)
df$Postal.Code <- factor(df$Postal.Code)

Before conducting any further analysis, I created a heatmap of Ohio divided by ZIP Code.

# install.packages("devtools")
# install.packages('choroplethr')
# install_github('arilamstein/choroplethrZip@v1.5.0')
library(tidyverse)
library(devtools)
library(choroplethr)
library(choroplethrZip)
library(ggplot2)
map_df <- as.data.frame(table(df$Postal.Code))
colnames(map_df) <- c("region", "value")
zip_choropleth(map_df, num_colors = 1,
               state_zoom = "ohio",
               title = "Ticket Sales Heatmap - Ohio") + 
  coord_map() + 
  scale_fill_distiller(name="Tickets", palette="Reds", trans = "reverse", na.value = "Grey")

Method - Association Rules

Apriori

Proposed by Agrawal and Srikant in 1994, Apriori algorithm is designed to operate on databases containing transactions (originally collections of items bought by customers) to count item sets efficiently with the goal of finding related elements. Like other association rules, Apriori finds all sets of items with a higher support value than a set minimum support and a higher confidence than the set minimum confidence. The lift is the ratio of the observed support to the expected support if each item were independent. The figure below, taken from Dr. Saed Sayad’s website, represents it best:

Eclat

Eclat algorithm, or Equivalence Class Clustering and bottom-up Lattice Traversal algorithm, is a more efficient and scalable version derived from the Apriori algorithm. The main difference between the two is that, while Apriori algorithm runs Breath-First Search (BFS) of a graph, Eclat algorithm runs Depth-First Search (DFS). It is this vertical approach that makes Eclat execute faster than Apriori. In addition, Eclat algorithm differs from Apriori algorithm in that it only takes into account the support.

Manipulate the dataset

The dataframe ‘df’ is not appropriate to run Apriori algorithm. In order to fix this problem, I first created a new dataframe ardf to record every show that each customer bought ticket(s) for. This transformation will allow the new dataframe to run Apriori algorithm.

ardf <- data.frame(matrix(NA, 
                            ncol = length(unique(df$Event.ID)) + 1, 
                            nrow = length(unique(df$Customer.ID))))
ardf$X1 <- sort(unique(df$Customer.ID))
colnames(ardf)[1] <- 'Customer.ID'

for (i in 1:5262){
  events <- sort(unique(df[df$Customer.ID == i,'Event.ID']))
  events <- c(events, rep('', 44 - length(events)))
  ardf[i,] <- c(i, events)
}

head(ardf, 10)
##    Customer.ID X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19
## 1            1 13                                                             
## 2            2  4                                                             
## 3            3  2                                                             
## 4            4 41                                                             
## 5            5 10 13                                                          
## 6            6 11                                                             
## 7            7  6 10 25 32 35 44                                              
## 8            8  2                                                             
## 9            9  5  7  9 13 21 39 42                                           
## 10          10 28                                                             
##    X20 X21 X22 X23 X24 X25 X26 X27 X28 X29 X30 X31 X32 X33 X34 X35 X36 X37 X38
## 1                                                                             
## 2                                                                             
## 3                                                                             
## 4                                                                             
## 5                                                                             
## 6                                                                             
## 7                                                                             
## 8                                                                             
## 9                                                                             
## 10                                                                            
##    X39 X40 X41 X42 X43 X44 X45
## 1                             
## 2                             
## 3                             
## 4                             
## 5                             
## 6                             
## 7                             
## 8                             
## 9                             
## 10
# Export the list as csv file
write.table(ardf[,-1], 'list.csv', sep = ',', col.names = FALSE, row.names = FALSE)

Package ‘arules’ and ‘arulesViz’

Package ‘arules’ allows the infrastructure to implement the frequent items and association rules mentioned in this report. Package ‘arulesViz’ provides various plotting techniques that are compatible with package ‘arules’.

# install.packages('arules')
# install.packages('arulesViz')
library(arules)
library(arulesViz)
dataset <- read.transactions('list.csv', sep = ',', rm.duplicates = TRUE)
summary(dataset)
## transactions as itemMatrix in sparse format with
##  5262 rows (elements/itemsets/transactions) and
##  44 columns (items) and a density of 0.0379954 
## 
## most frequent items:
##      13      41      40      43       2 (Other) 
##     679     560     528     434     409    6187 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   14   15   16   20 
## 3926  645  202  102   55  244   35   14   10   11    3    7    1    2    2    1 
##   23   35 
##    1    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   1.000   1.672   2.000  35.000 
## 
## includes extended item information - examples:
##   labels
## 1      1
## 2     10
## 3     11

We can then plot out all 44 events’ frequencies:

itemFrequencyPlot(dataset, topN = 44)

Training and Visualizing Apriori Rule - Training 1

Counting each customer as a ‘transaction’ with respect to the original example given by ‘arules’ vignette, there is a total of 5262 transactions. In this report, I ran each association rule twice, one with higher support and higher confidence and one with lower support and lower confidence. This enabled me to take into account groups of events that are closely related, as well as their connection to other shows.

In this training, to calculate the support value, I fetched any show which had at least 200 tickets reserved. Therefore, the value of minimum support = 200/5262 = 0.03800836. I also set my initial minimum confidence at 80% (or 0.8).

I can now use Apriori algorithm with the data:

apriori_rule <- apriori(data = dataset, parameter = list(support = 0.03800836, confidence = 0.8, minlen = 2, maxlen = 44))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime    support minlen
##         0.8    0.1    1 none FALSE            TRUE       5 0.03800836      2
##  maxlen target  ext
##      44  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 199 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[44 item(s), 5262 transaction(s)] done [0.00s].
## sorting and recoding items ... [20 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.00s].
## writing ... [171 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

Top 10 associations (out of 171 rules) sorted by ‘lift’ (Apriori Rule):

inspect(head(sort(apriori_rule, by = 'lift'), 10))
##      lhs                rhs  support    confidence coverage   lift     count
## [1]  {25,44,6}       => {32} 0.04827062 1.0000000  0.04827062 19.27473 254  
## [2]  {10,25,44}      => {32} 0.04808058 1.0000000  0.04808058 19.27473 253  
## [3]  {25,35,44}      => {32} 0.04827062 1.0000000  0.04827062 19.27473 254  
## [4]  {10,25,44,6}    => {32} 0.04808058 1.0000000  0.04808058 19.27473 253  
## [5]  {25,35,44,6}    => {32} 0.04827062 1.0000000  0.04827062 19.27473 254  
## [6]  {10,25,35,44}   => {32} 0.04808058 1.0000000  0.04808058 19.27473 253  
## [7]  {10,35,44,6}    => {32} 0.04827062 1.0000000  0.04827062 19.27473 254  
## [8]  {10,25,35,44,6} => {32} 0.04808058 1.0000000  0.04808058 19.27473 253  
## [9]  {35,44,6}       => {32} 0.04865070 0.9961089  0.04884074 19.19973 256  
## [10] {25,35}         => {32} 0.04827062 0.9960784  0.04846066 19.19914 254

Matrix visualization:

plot(apriori_rule, method = 'grouped')

plotly_arules(apriori_rule)
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

Training and Visualizing Apriori Rule - Training 2

In this training, I extended the consideration range so Apriori algorithm fetched any show that had at least 50 tickets purchased; this limit differs from training 1 minimum of 200. Therefore, the current value of minimum support = 50/5262 = 0.00950209. In addition, I lowered the minimum confidence to 60% (or 0.6).

I can now use Apriori algorithm with the data:

apriori_rule <- apriori(data = dataset, parameter = list(support = 0.00950209, confidence = 0.6, minlen = 2, maxlen = 44))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime    support minlen
##         0.6    0.1    1 none FALSE            TRUE       5 0.00950209      2
##  maxlen target  ext
##      44  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 49 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[44 item(s), 5262 transaction(s)] done [0.00s].
## sorting and recoding items ... [31 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.00s].
## writing ... [195 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

Top 10 associations (out of 195 rules) sorted by ‘lift’ (Apriori Rule):

inspect(head(sort(apriori_rule, by = 'lift'), 10))
##      lhs             rhs  support    confidence coverage   lift     count
## [1]  {14,28}      => {33} 0.00950209 0.8620690  0.01102242 37.48931  50  
## [2]  {28,8}       => {33} 0.00950209 0.7692308  0.01235272 33.45200  50  
## [3]  {14,33}      => {28} 0.00950209 0.8928571  0.01064234 29.18146  50  
## [4]  {33,8}       => {28} 0.00950209 0.8771930  0.01083238 28.66950  50  
## [5]  {33}         => {28} 0.01463322 0.6363636  0.02299506 20.79842  77  
## [6]  {25,44,6}    => {32} 0.04827062 1.0000000  0.04827062 19.27473 254  
## [7]  {10,25,44}   => {32} 0.04808058 1.0000000  0.04808058 19.27473 253  
## [8]  {25,35,44}   => {32} 0.04827062 1.0000000  0.04827062 19.27473 254  
## [9]  {10,25,44,6} => {32} 0.04808058 1.0000000  0.04808058 19.27473 253  
## [10] {25,35,44,6} => {32} 0.04827062 1.0000000  0.04827062 19.27473 254

Matrix visualization:

plot(apriori_rule, method = 'grouped')

plotly_arules(apriori_rule)
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

Training and Visualizing Eclat Rule - Training 1

In this training, I set the value of minimum support to 0.03800836, which is the same as the minimum support value from Apriori Training 1. I did so to allow for an easy comparison between this training and Apriori Training 1.

eclat_rule <- eclat(data = dataset, parameter = list(support = 0.03800836, minlen = 2, maxlen = 44))
## Eclat
## 
## parameter specification:
##  tidLists    support minlen maxlen            target  ext
##     FALSE 0.03800836      2     44 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 199 
## 
## create itemset ... 
## set transactions ...[44 item(s), 5262 transaction(s)] done [0.00s].
## sorting and recoding items ... [20 item(s)] done [0.00s].
## creating sparse bit matrix ... [20 row(s), 5262 column(s)] done [0.00s].
## writing  ... [57 set(s)] done [0.00s].
## Creating S4 object  ... done [0.00s].

Top 10 associations (out of 57 sets) sorted by support (Eclat Rule):

inspect(head(sort(eclat_rule, by = "support"), 10))
##      items      support    transIdenticalToItemsets count
## [1]  {35,44}    0.05112125 269                      269  
## [2]  {10,6}     0.05036108 265                      265  
## [3]  {10,35}    0.05036108 265                      265  
## [4]  {10,44}    0.05036108 265                      265  
## [5]  {44,6}     0.04998100 263                      263  
## [6]  {35,6}     0.04941087 260                      260  
## [7]  {10,32}    0.04922083 259                      259  
## [8]  {32,6}     0.04903079 258                      258  
## [9]  {25,44}    0.04903079 258                      258  
## [10] {10,35,44} 0.04903079 258                      258

Matrix visualization:

plot(eclat_rule, method = "graph")

Training and Visualizing Eclat Rule - Training 2

Similarly to Eclat Training 1, I set this training’s minimum support value to 0.00950209, which is the same as Apriori Training 2’s value.

eclat_rule <- eclat(data = dataset, parameter = list(support = 0.00950209, minlen = 2, maxlen = 44))
## Eclat
## 
## parameter specification:
##  tidLists    support minlen maxlen            target  ext
##     FALSE 0.00950209      2     44 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 49 
## 
## create itemset ... 
## set transactions ...[44 item(s), 5262 transaction(s)] done [0.00s].
## sorting and recoding items ... [31 item(s)] done [0.00s].
## creating sparse bit matrix ... [31 row(s), 5262 column(s)] done [0.00s].
## writing  ... [73 set(s)] done [0.00s].
## Creating S4 object  ... done [0.00s].

Top 10 associations (out of 73 sets) sorted by support (Eclat Rule):

inspect(head(sort(eclat_rule, by = "support"), 10))
##      items      support    transIdenticalToItemsets count
## [1]  {35,44}    0.05112125 269                      269  
## [2]  {10,6}     0.05036108 265                      265  
## [3]  {10,35}    0.05036108 265                      265  
## [4]  {10,44}    0.05036108 265                      265  
## [5]  {44,6}     0.04998100 263                      263  
## [6]  {35,6}     0.04941087 260                      260  
## [7]  {10,32}    0.04922083 259                      259  
## [8]  {32,6}     0.04903079 258                      258  
## [9]  {25,44}    0.04903079 258                      258  
## [10] {10,35,44} 0.04903079 258                      258

Matrix visualization:

plot(eclat_rule, method = "graph")

Discussion

Each of the trainings provide information that can build upon one another to find:

Apriori Training 1 established close relationships between events #6, #10, #25, #32, #35, and #44; each event from the group of 6 is implied to be a consequent of a set of others within the group.Specifically, events #25 and #32 appear to be the top 2 highest consequents based on lift value.

After I lowered the support and confidence in Apriori Training 2, a new group consisting of events #8, #14, #28, and #33 seems to have an even higher lift than the group I specified in training 1. These high lift values imply that, even though the events in this group don’t have the highest support values and confidences, their tickets are frequently bought together by customers, even more frequently than the events specified from training 1.

Similarly to the group matrix from Apriori Training 1, the connected graph from Eclat Training 1 also represents a close relationship between events #6, #10, #25, #32, #35, and #44, further strengthening the conclusion that the event types are very similar, if not the same.

Nonetheless, the data and graph generated from Eclat Training 2 provided me with the most information. Firstly, the graph confirmed there is a tight relationship between six events from group 1 as these 6 events are clustered together. Next, the group consisting of events #8, #14, #28, and #33 is also related pairwise and in groups of 3. Thirdly, new relationships between events #2, #9, #39 and event #13 was discovered in this training. Lastly, events #41 and #42 are not only independently related but are also related to event #13 as a pair. These findings are summarized and validated below.

From all 4 trainings, there are 4 distinct groups of closely-related events:

Cross-checking these Event.IDs with the nature of each event, I can see the connection within the nature of each group of events:

Apriori algorithm is a better algorithm for bigger datasets. For the original dataset, Eclat algorithm is more straightforward, creates more insights, and takes significantly less time to execute due to the smaller number of calculations. The graph generated using Eclat algorithm Training #2 is especially intuitive because it presents clearly which groups of events are frequently bought with one another.

On top of this result, events such as #1, #17, and #18 are also harmonies composed by classical instrumental artists, yet the relationship between them and events of group 3 is not significant enough to be considered. This leads me to believe that better marketing advertisements can be done to connect classical music listeners who attended events of group 3 to purchase tickets from event #1, #17, and #18 as well.

Acknowledgements

Great thanks to the developers who created packages ‘arules’ and ‘arulesViz’ for providing the code that utilizes Apriori and Eclat algorithm intuition.

Also special thanks to Alia Basar for her help in compiling this case study.