Data mining for associated entities

Data mining–the practice of examining large pre-existing databases in order to generate new information.
– by google

In one of my not so recent project for an e-commerce platform i developed a workflow. A workflow in which raw data is collected and placed to be mined in order to find relationship between them. And store this information into database to be utilized later on.

I used Apriori algorithm to derive the relationship between a single or combinations of entities. Once the data is associated it can be used to generate recommendations for the end user based on his/her activities. For instance, if user has added bread in the shopping basket, system can recommend milk based on historical transaction in which milk and bread appeared quite often and thus treated as associated entities.

Apriori algorithm is used to generate rules for the data item appears in transactions called association rules. Lets look at below historical transactions data grabbed from []

TID StockCode Description Unit Price CustomerID
1 85123A WHITE T-LIGHT HOLDER 2.50 17850
2 71053 WHITE METAL LANTERN 3.90 17850
3 84406B HEARTS COAT HANGER 2.75 17850
4 21777 METAL HEART 7.95 17850
5 85123A WHITE T-LIGHT HOLDER 2.50 17998
6 71053 WHITE METAL LANTERN 3.90 17998
7 84406B HEARTS COAT HANGER 2.75 18768

Above is the sub set of transaction data of two customers. If you look at the transaction 1,2 and 5,6 which contains same set of items purchased by two different customers. Based on that it is safe to say that item:85123A and item:71053 are purchased in combination quite often and they can be recommended to be purchased together for future customers.

A sample data set of association rules:

id item recommend
1 85123A 71053
2 71053 85123A
3 85123A, 71053 ITEM_XYZ

Last row means if item 85123A, 71053 are added into basket then it is good idea to recommend ITEM_XYZ to the customer.

Historical transaction data can be huge and generating association rules between them may require batch processing.

Apriori algorithm implementation allows organization to choose what item to count from the transaction data. For example, in how many transaction an item should be appeared to include in candidate dataset, called support. And second, in how many transaction the items should appears in combination to generate recommendation, called confidence.

In the next post i will try to explain the Apriori algorithm implementation.

workflow reference: []


Now read this

Workflow with airflow

Airflow is an open source project started at Airbnb. It is a tool to orchestrate the desire flow of your application dynamically which is readily scalable to infinity because of it modular architecture and message queuing mechanism. It... Continue →