前沿: 此文章记录了2024年度秋季学期数据挖掘课程的三次课后作业,答案仅供参考。
第一次作业
1
假定数据仓库中包含4个维:date, product, vendor, location;和两个度量:sales_volume和sales_cost。
1)画出该数据仓库的星形模式图 。
2)由基本方体[date, product, vendor, location]开始,列出每年在Los Angles的每个vendor的sales_volume。
roll up on product from basic(key) to all
roll up on location from basic(key) to city
roll up on date from basic(key) to year
slice for location = ‘Los Angles’
3)对于数据仓库,位图索引是有用的。以该立方体为例,简略讨论使用位图索引结构的优点和问题。
2
Design a data warehouse for a regional weather bureau. The weather bureau has about 1000 probes, which are scattered throughout various land and ocean locations in the region to collect basic weather data, including air pressure, temperature, and precipitation at each hour. All data are sent to the central station, which has collected such data for over 10 years. Your design should facilitate efficient querying and online analytical processing, and derive general weather patterns in multidimensional space. (note: please present the schema, the fact table(s) and the dimension tables with concept hierarchy)
3
下面是一个超市商品A连续20个月的销售数据(单位为百元)
A:21, 16, 19, 24, 27, 23, 22, 21, 20, 17, 16, 20, 23, 22, 18, 24, 26, 25, 20, 26。
B:38, 24, 38, 45, 46, 44, 42, 34, 40, 30, 31, 40, 40, 32, 36, 42, 50, 47, 46, 50。
1)Calculate the mean, median, and standard deviation of the sales data.
21.5; 21.5;3.22
2)Draw the boxplot.
Min = 16, Q1 = 19, median = 21.5, Q3 = 24, Max = 27.
3) Normalize the values based on min-max normalization.
4)假设商品B连续20个月的销售数据(单位为百元)如下:38, 24, 38, 45, 46, 44, 42, 34, 40, 30, 31, 40, 40, 32, 36, 42, 50, 47, 46, 50。
Calculate the correlation coefficient (Pearson’s product moment coefficient). Are these products positively or negatively correlated?
相关计算相关系数结果为0.831,表明A和B产品为正相关。
5)Draw the scatter plot for the sales data of the two products.
4
下面是一个超市商品A连续20个月的销售数据(单位为百元)。21,16,19,24,27,23,22,21,20,17,16,20,23,22,18,24,26, 25,20,26。对以上数据进行噪声平滑,使用深度为5的Equal-depth binning方法。
答案:
首先对20个数据进行排序,排序后的结果如下:16, 16, 17, 18, 19, 20, 20, 20, 21, 21, 22, 22, 23, 23, 24, 24, 25, 26, 26, 27,使用深度为5的Equal-depth binning方法,则分箱结果如下:
Bin1: 16, 16, 17, 18, 19;
Bin2: 20, 20, 20, 21, 21;
Bin3: 22, 22, 23, 23, 24;
Bin4: 24, 25, 26, 26, 27;
1)采用bin median方法进行平滑;
Bin1的median为17,则平滑后结果为Bin1: 17, 17, 17, 17, 17;
Bin2的median为20,则平滑后结果为Bin2: 20, 20, 20, 20, 20;
Bin3的median为23,则平滑后结果为Bin3: 23, 23, 23, 23, 23;
Bin4的median为26,则平滑后结果为Bin4: 26, 26, 26, 26, 26;
2) 采用bin boundaries方法进行平滑。
平滑后结果为:
Bin1: 16, 16, 16, 19, 19;
Bin2: 20, 20, 20, 21, 21;
Bin3: 22, 22, 22, 22, 24 或者 22, 22, 24, 24, 24;
Bin4: 24, 24, 27, 27, 27;
第二次作业
1
Given a data set below for attributes {Height, Hair, Eye} and two classes {C1, C2}.
1)Compute the Information Gain for Height, Hair and Eye.
2)Construct a decision tree with Information Gain.
2
Classify the unknown sample Z based on the training data set in Q1:
Z = (Height = Short, Hair = blond, Eye = brown). What would a naïve Bayesian classifier classify Z?
3
注: 此题的答案应该有点问题。
1)Design a multilayer feed-forward neural network (one hidden layer) for the data set in Q1. Label the nodes in the input and output layers.
2)Using the neural network obtained above, show the weight values after one iteration of the back propagation algorithm, given the training instance “(Tall, Red, Brown)". Indicate your initial weight values and biases and the learning rate used.
4
Consider the data set shown in Table 1(min_sup = 60%, min_conf=70%).
1)Find all frequent itemsets using Apriori by treating each transaction ID as a market basket.
2)Use the results in part (a) to compute the confidence for the association rules {a, b}->{c} and {c}->{a, b}. Is confidence a symmetric measure?
3)List all of the strong association rules (with support s and confidence c) matching the following metarule, where X is a variable representing customers, and itemi denotes variables representing items (e.g. “A”, “B”, etc.)
5
Assume a supermarket would like to promote pasta. Use the data in “transactions” as training data to build a decision tree (C5.0 algorithm) model to predict whether the customer would buy pasta or not.
Build a decision tree using data set “transactions” that predicts pasta as a function of the other fields. Set the “type” of each field to “Flag”, set the “direction” of “pasta” as “out”, set the “type” of COD as “Typeless”, select “Expert” and set the “pruning severity” to 65, and set the “minimum records per child branch” to be 95. Hand-in: A figure showing your tree.
6
Use the model (the full tree generated by Clementine in step 1 above) to make a prediction for each of the 20 customers in the “rollout” data to determine whether the customer would buy pasta.
1)Hand-in: your prediction for each of the 20 customers. (10 points)
2)Hand-in: rules for positive (yes) prediction of pasta purchase identified from the decision tree (up to the fifth level. The root is considered as level 1). (10 points)
第三次作业
1
Suppose that the data mining task is to cluster the following ten points (with(x, y, z) representing location) into three clusters:
A1(4,2,5), A2(10,5,2), A3(5,8,7), B1(1,1,1), B2(2,3,2), B3(3,6,9), C1(11,9,2),C2(1,4,6), C3(9,1,7), C4(5,6,7)
The distance function is Euclidean distance. Suppose initially we assign A2,B2,C2 as the center of each cluster, respectively. Use the K-Means algorithm to show only.
1)The three cluster’s centers after the first round execution
2)The final three clusters
2
Table 2 gives a User-Product rating matrix.
1)List the top 3 most similar users of user 2 based on Cosine Similarity
2)Predict User 2’s rating for Product 2
3
The goal of this assignment is to introduce churn management using decision trees, logistic regression and neural network. You will try different combinations of the parameters to see their impacts on the accuracy of your models for this specific data set. This data set contains summarized data records for each customer for a phone company. Our goal is to build a model so that this company can predict potential churners.
Two data sets are available, churn_training.txt and churn_validation.txt. Each data set has 21 variables. They are:
State:
Account_length: how long this person has been in this plan
Area_code:
Phone_number:
International_plan: this person has international plan=1, otherwise=0
Voice_mail_plan: this person has voice mail plan=1, otherwise=0
Number_vmail_messages: number of voice mails
Total_day_minutes:
Total_day_calls:
Total_day_charge:
Total_eve_minutes:
Total_eve_calls:
Total_eve_charge:
Total_night_minutes:
Total_night_calls:
Total_night_charge:
Total_intl_minutes:
Total_intl_calls:
Total_intl_charge:
Number_customer_service_calls:
Class: churn=1, did not churn=0
Each row in “churn_training” represents the customer record. The training data contains 2000 rows and the validation data contains 1033 records.
1)Perform decision tree classification on training data set. Select all the input variables except state, area_code, and phone_number (since they are only informative for this analysis). Set the “Direction” of class as “out”, “type” as “Flag”. Then, specify the “minimum records per child branch” as 40, “pruning severity” as 70, click “use global pruning”. Hand-in the confusion matrices for validation data.
通过在clementine软件上,使用Decision Tree算法,并按照上述要求所计算的混淆矩阵如下图所示。
2)Perform neural network on training data set using default settings. Again, select all the input variables except state, area_code, and phone_number. Hand-in the confusion matrix for validation data.
通过在clementine软件上,使用neural network算法,并按照上述要求所计算的混淆矩阵如下图所示。
3)Perform logistic regression on training data set using default settings. Again, select all the input variables except state, area_code, and phone_number. Hand-in the confusion matrix for validation data.
通过在clementine软件上,使用logistic regression算法,并按照上述要求所计算的混淆矩阵如下图所示。
4) Hand-in your observations on the model quality for decision tree, neural network and logistic regression using the confusion matrices.
4
Learn the use of market basket analysis for the purpose of making product purchase recommendations to the customers.
The data set contains transactions from a large supermarket. Each transaction is made by someone holding the loyalty card. We limited the total number of categories in this supermarket data to 20 categories for simplicity. The field value for a certain product in the transaction basket is 1 if the customer has bought it and 0 if he/she has not. The file named “Transactions” has data for 46243 transactions.
The data are available from the class web page.
Your submission should consist only of those deliverables marked indicated by “Hand-in”.
Market basket analysis has the objective to discover individual products, or groups of products that tend to occur together in transactions. The knowledge obtained from a market basket analysis can be employed by a business to recognize products frequently sold together in order to determine recommendations and cross-sell and up-sell opportunities. It can also be used to improve the efficiency of a promotional campaign.
Run Apriori on “transaction” data set. Set the “Type” of “COD” as “Typeless”, set the “direction” of all the other 20 categories as “Both”, set their “Type” as “Flag”. Set “Minimum antecedent support” to be 7%, “Minimum confidence” to be 45%, and “Maximum number of antecedents” to be 4 in the modeling node (Apriori node). In general you should explore by trying different values of these parameters to see what type of rules you get.
· Hand-in: The list of association rules generated by the model.
Sort the rules by lift, support, and confidence, respectively to see the rules identified. Hand-in: For each case, choose top 5 rules (note: make sure no redundant rules in the 5 rules) and give 2-3 lines comments. Many of the rules will be logically redundant and therefore will have to be eliminated after you think carefully about them.
通过在clementine软件上,分别将lift、support和confidence作为排序字段,所获取的关联规则如上图所示,所选出的top5 rules如下表所示。
1)lift:结合上图中的图(a),我们先选取出top5的规则,如下所述:
a) tomato source→pasta:买番茄酱的人会买意大利面,相对比较合理;
b) coffee、milk→pasta:买咖啡和牛奶的人会买意大利面,不是很合理,排除此规则;
c) biscuits、pasta→milk:买饼干和意大利面的会买牛奶,相对合理;
d) pasta、water→milk:买意大利面和水的人会买牛奶,相对合理;
e) juices→milk:买果汁的人会买牛奶,相对合理;
由于规则b)不是很合理,因此删除规则b),新增一下一个规则:
f) yoghurt→milk:买酸奶的人会买牛奶,相对合理;
因此,所选出的top5规则如上表中的第二列所示。
2)support:结合上图中的图(b),我们先选取出top5的规则:pasta→milk,water→milk,biscuits→milk,brioches→milk以及yoghurt→milk,这五条规则都相对比较合理,比如买水或者酸奶等饮品的购物者常常会一起买上牛奶,因此选取此五条规则,如上表中的第三列所示。
3)confidence:结合上图中的图©,我们先选取出top5的规则:biscuits、pasta→milk,water、pasta→milk,juices→milk,tomato source→pasta以及yoghurt→milk,也相对来说比较符合常识,比如买番茄酱的购物者很可能意大利面,因此选取此五条规则,如上表中的第四列所示。