Data mining for associated entities
Data mining–the practice of examining large pre-existing databases in order to generate new information.
– by google
In one of my not so recent project for an e-commerce platform i developed a workflow. A workflow in which raw data is collected and placed to be mined in order to find relationship between them. And store this information into database to be utilized later on.
I used Apriori algorithm to derive the relationship between a single or combinations of entities. Once the data is associated it can be used to generate recommendations for the end user based on his/her activities. For instance, if user has added bread in the shopping basket, system can recommend milk based on historical transaction in which milk and bread appeared quite often and thus treated as associated entities.
Apriori algorithm is used to generate rules for the data item appears in transactions called association rules. Lets look at below historical transactions data grabbed from [http://archive.ics.uci.edu/ml/machine-learning-databases/00352/]
|1||85123A||WHITE T-LIGHT HOLDER||2.50||17850|
|2||71053||WHITE METAL LANTERN||3.90||17850|
|3||84406B||HEARTS COAT HANGER||2.75||17850|
|5||85123A||WHITE T-LIGHT HOLDER||2.50||17998|
|6||71053||WHITE METAL LANTERN||3.90||17998|
|7||84406B||HEARTS COAT HANGER||2.75||18768|
Above is the sub set of transaction data of two customers. If you look at the transaction 1,2 and 5,6 which contains same set of items purchased by two different customers. Based on that it is safe to say that item:85123A and item:71053 are purchased in combination quite often and they can be recommended to be purchased together for future customers.
A sample data set of association rules:
Last row means if item 85123A, 71053 are added into basket then it is good idea to recommend ITEM_XYZ to the customer.
Historical transaction data can be huge and generating association rules between them may require batch processing.
Apriori algorithm implementation allows organization to choose what item to count from the transaction data. For example, in how many transaction an item should be appeared to include in candidate dataset, called
support. And second, in how many transaction the items should appears in combination to generate recommendation, called
In the next post i will try to explain the Apriori algorithm implementation.
workflow reference: [http://ferozkhan.org/workflow-with-airflow]