Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 4.0


This is a UCI data set from 2014 about clients' spending at a wholesale distrubutor. Based on the region names, this seems to be data collected in Portugal. Every row represents how many monetary units within a category of item the given client bought in a year. 440 clients' spendings on each category were recorded. There are 8 attributes: channel (e.g. the client type: "1" for hotel/restaurant/cafe (which they call "horeca" but which I will be referring to as restaurant/hotel) and "2" for retail such as a supermarket), region (Lisbon, Porto, or other, numbered 1, 2, and 3 respectively), fresh products, milk products, grocery products, frozen products, detergents/paper products, and delicatessen (deli products like cold cuts). The numbers under the product types represent monetary units (m.u.), which is a substitute for measuring in regular currency. Wikipedia defines it as "the change in the utility from an increase in the consumption of that good or service". I will be working with this data by observing relationships between certain features, performing clustering, and doing a PCA analysis.

Problem Statement

As I mentioned before, this data set is from the UCI site, and it was donated in 2014.  According to the data homepage, 77 clients are from Lisbon, 47 are from Porto, and 316 are from other regions. 298 are restaurants/hotels and 142 are retail. The main questions I seek to answer are, in general terms, "Is there a relationship between the kind of client and the type of goods of which the most were purchased?", "Is there a relationship between certain product types, whether it be positive or negative correlation?", and "Can we cluster the data into groups based on similar number of products purchased in certain categories?"


Note: When colored by client type, blue is restaurant/hotel and yellow is retail

This graph is colored by client type. It seems that retail as a whole buys less while some hotel/restaurant clients buy a lot; they have more outliers and more points in general.

In the next picture, we can see that there is a higher rate of fresh foods than of frozen being bought by clients. Given that most clients are hotels, restaurants, or cafes, this makes sense because they are less likely to stock frozen foods than retailers, since they intend to use the food they obtain more quickly while supermarkets and the like will store food on shelves for days or weeks. There is a lot of data packed at the bottom for clients who buy mostly fresh and almost nothing frozen, which is why the line has the slope it does.