Skip to content
Halrobotics NetFree custom papers
  • College research paper
  • Sample Page

Hands-On Analysis

March 29, 2020

Hands-OnAnalysis

Hands-OnAnalysis

Questions

Use the adult dataset for the following exercises.

  1. Apply the Kohonen clustering algorithm to the data set, being careful not to include the income field. Use a topology that is not too large, such as 3 X 3.

  2. Construct a scatter plot (with x/y agitation) of the cluster membership, with an overlay of income. Discuss your findings.

  3. Construct a bar chart of the cluster membership, with an overlay of income. Discuss your findings. Compare to the scatter plot.

  4. Construct a bar chart of the cluster membership, with an overlay of marital status. Discuss your findings.

  5. If your software supports this, construct a web graph of income, marital status, and the other categorical variables. Fine-tune the web graph so that it conveys good information.

  6. Generate numerical summaries for the clusters. For example, generate a cluster mean summary.

  7. Using the information above and any other information you can bring to bear, construct detailed and informative cluster profiles, complete with titles.

  8. Use the cluster membership as a further input to a CART decision tree model for classifying income. How important is clustering membership in classifying income?

  9. Use cluster membership as a further input to a C4.5 decision tree model for classifying income. How important is clustering membership in classifying income? Compare to the CART model.

Solution

IBM SPSS Modeler 18 program was chosen to run the Kohonen algorithmon adult data. The algorithm has three major characteristics:competition, cooperation, and adaptation. Once clustering has beenperformed, descriptive statistics is performed i.e. a scatter plot,bar graphs, and a web graph. A scatter plot of cluster membership isplotted, with an overlay of income. Two bar charts are plotted. A barchart of the cluster membership that overlays income and maritalstatus. Additionally, the categorical variables are used to come upwith the web graph.

Figure1 IBM SPSS Modeler 18 Stream

Web graphs are madeup of web nodes that help visualize the magnitude of relationshipsbetween two or more variables. The connections are shown in a graphwith different types of lines that point out connections ofincreasing magnitude. Strong relationships are usually indicatedusing heavy lines thus demonstrating the two variables have a strongrelationship and should be analyzed further. Medium relationships areindicated using normal weighted lines. Weak relationships, on theother hand, are indicated using dotted lines. No lines between twovariables means that the interaction happens below the requiredthreshold required by the web node. In this case the categoricalvariables are used to come up with the web graph. The categoricalvariables include: education, income, native country, marital status,occupation, relationship, race, sex, and work class.

Figure 1 above demonstrates all the analysis performed on the adultdata set. In addition to the graphs C4.5 and CART Decision TreeModels were constructed as well. Decision trees are statisticallearning method that does not involve the estimation of theparameters. They are used mainly for both regression andclassification to create a model whose sole purpose is to predict thetarget variable through the use rules derived from the data set. Thebenefits of using decision trees include: their simplicity makes themeasy to read and interpret, they need minimal data preparation suchas normalization, the energy expelled in predicting data useslogarithms in the total count of datum used to teach the model tree,and they can deal with both continuous and categorical data.Additionally, decision trees utilize white box model, can bevalidated by use of simple statistical methods, and perform wellregardless of any violations by the model used to generate the data(Satyanarayana, 2013). Thetwo models used in this case are CART (Classification and RegressionTrees) and C4.5. C4.5 functions by converting the trained trees intosets of if-then rules whereas CART is similar to C4.5 butadditionally it allows continuous target variables and does notcalculate a group of rules.

IBM SPSS Modeler 18 is a software that can be used to automate thedata mining process by constructing models such as the CART and C4.5Tree Models. It was chosen because it is powerful and has greatdiversity. The software makes it easy to visualize the data miningprocess as shown in Figure 1. It produces very accurate predictionthrough the use of data assets such as business intelligence. Thesoftware comes packaged with a wide variety of advanced analyticalfunctions including the Kohonen algorithm, automatized datapreparation, and amazing visual capacities. It is easier to deploymodels, predictions, and insight to anyone interested. The softwareis packaged with predictive abilities, transformation capabilities,and reporting tools thus creating an all-in-one type of software. Inaddition to IBM data mining the software integrates warehouse toolsthat help provide further insight into the data (Satyanarayana,2013).

Question 7: Kohonen Clustering Algorithm

Figure2 Cluster Sizes

Table1

ClusterSizes

Description

Value

Size of Smallest Cluster

6 (0%)

Size of Largest Cluster

6514 (26.1%)

Ratio of Sizes:

Largest Cluster to Smallest Cluster

1085.67

Figure 2 shows the clusters the application of Kohonen clusteringalgorithm, while using a small 3 X 3 topology, on the adult data fileusing came up with using IBM SPSS Modeler 18. The (X, Y) of theclusters are: (0, 0) for 1, (0, 1) for 2, (0, 2) for 3, (1, 0) for 4,(1, 1) for 5, (1, 2) for 6, (2, 0) for 7, (2, 1) for 8, and finally(2, 2) for 9. This classification of the clusters is used throughoutthis paper.

Table 1 shows the cluster sizes. The smallest cluster appears to have6 members while the largest cluster has 6514 members. The rest of theclusters fall in between the two.

Question 8: Scatter Plot (With X/Y Agitation) Of the ClusterMembership with an Overlay of Income

Figure3 Scatter Plot of the Cluster Membership with an Overlay of Income

From Figure 3, it is evident that a majority of the clusters fall inthe &gt 50K income level. Cluster 1 falls in the &gt 50K incomelevel. Cluster 2 falls in the &gt 50K income level. Cluster 3 fallsin the &gt 50K income level. Cluster 4 falls in the &gt 50K incomelevel. Cluster 5 falls in the &lt= 50K income level. Cluster 6 fallsin the &lt= 50K income level. Cluster 7 falls in the &gt 50K incomelevel. Cluster 8 falls in the &lt= 50K income level. Cluster 9 fallsin the &gt 50K income level.

Question 9: Bar Chart of the Cluster Membership with an Overlay ofIncome

Figure4 Bar Chart of the Cluster Membership with an Overlay of Income

From Figure 4, thecluster members with an income level of &lt= 50K were the most i.e.slightly more than 20,000 while the cluster members with an incomelevel 0f &gt 50K were the least i.e. close to 10,000.

Question 10: Bar Chart of the Cluster Membership with an Overlayof Marital Status

Figure5 Bar Chart of the Cluster Membership with an Overlay of MaritalStatus

From Figure 5,cluster members who had married union spouse marital status were themost with entries of slightly above 20,000 followed by thenever-married marital status with close to 10,000 entries. The othermarital status had negligent entries especially the widowed andmarried armed forces spouse.

Question 11: Web Graph of Income, Marital Status, and the OtherCategorical Variables

Figure6 Web Graph of Income, Marital Status, and the Other CategoricalVariables

Figure 6 representsthe web nodes generated by IBM SPSS Modeler 18 of the categoricalvariables. The graph demonstrates the existing relationships betweenthe variables: strong, normal, weak, and none. There is a strongconnection between United States and White. There is a normalconnection between United States and &lt= 50K, United States andprivate, United States and male, white and &lt= 50K, white andprivate, white and male, male and private, male and &lt=50K, andprivate and &lt= 50K. There is a weak connection between marriedcivil spouse and husband, married civil union spouse and UnitedStates, married civil union spouse and white, married civil spouseand male, husband and male, husband and white, and United States andhusband. The other variables did not meet the threshold required bythe web graph to be drawn and therefore they have no connections.

Question 12: Numerical Summaries for the Clusters

Table2

ClusterSummary

Algorithm

Kohonen

Inputs

13

Clusters

9

Figure7 Cluster Quality

From Table 1 and Figure 7,we can derive the numerical summaries for the clusters. The KohonenAlgorithm used 13 inputs to come up with 9 clusters. Although from Figure 7,the cluster quality was poor.

Question 13: Cluster Profiles

Table3

AllNine Cluster Profiles

Cluster

1

X = 0

Y = 0

2

X = 0

Y = 2

3

X = 2

Y = 0

4

X = 2

Y = 2

5

X = 2

Y = 1

6

X = 1

Y = 0

7

X = 1

Y = 1

8

X = 1

Y = 2

9

X = 0

Y = 1

Age

27.05

41.8

44.85

46.88

43.31

37.12

36.47

43.15

30.92

Education

34.4% (HS-grad)

40.3% (HS-grad)

27.1% (HS-grad)

32.5% (HS-grad)

36.7% (HS-grad)

38.4% (HS-grad)

84.4% (HS-grad)

57.9% (HS-grad)

Hours per week

36.65

44.04

39.15

43.33

39.16

39.64

40

42.67

41.91

Marital status

70.7% (Divorced)

Occupation

21.9% (Craft repair)

21.1% (Adm clerical)

22.4 (Adm clerical)

39.5% (Craft repair)

83% (Craft repair)

50.5% (Craft repair)

Relationship

47.6% (Own child)

99.9% (Husband)

51.1% (Not in family)

88% (Husband)

50% (Wife)

49.3% (Not in family)

91.9% (Husband)

61.1% (Own child)

Sex

62.1% (Male)

100% (Male)

75.6% (Female)

88.7% (Male)

60.2% (Female)

79.1% (Female)

67.4% (Male)

100% (Male)

100% (Male)

Work class

78.7 (Private)

100% (Private)

79% (Private)

49.9% (Private)

67.9% (Private)

95.3% (Private)

34.1% (Local-gov)

100% (Private)

Race (White)

84.5%

90.5%

81.1%

90.3%

84.2%

65.7%

84.3%

95.6%

87.4%

Education num

9.96

10.12

9.66

10.78

10.23

10.03

8.7

9.29

8.6

Capital gain

315.47

1618.92

608.86

2144.11

1496.92

748.47

350.97

1294.97

412.15

Capital loss

52.79

113.91

59.45

139.64

85.57

65.05

86.62

129.99

18.55

Demographic weight

195625.44

190095.14

186497.64

182469.64

180797.52

193074.63

221616.06

180708.54

213360.01

Total Size

7270 (29.1%)

6506 (26%)

4172 (16.7%)

3906 (15.6%)

1530 (6.1%)

1214 (4.9%)

172 (0.7%)

135 (0.5%)

95 (0.4%)

Table 3 above showsall the 9 clusters and their variable distribution. Cluster 1 ischaracterized by an age of 27.05, 36.65 hours per week, 47.6% ownchild relationships, 62.1% male, and 78.7% private work class.Cluster 2 is characterized by an age of 41.8, 34.4% high schoolgraduate education level, 44.04 hours per week, 21.9% craft repairoccupation, 99.9% husband relationships, 100% male, and 100% privatework class. Cluster 3 is characterized by an age of 44.85, 40.3% highschool graduate education level, 39.15 hours per week, 70.7% divorcedmarital status 21.1% adm clerical occupation, 51.1% not in familyrelationships, 75.6% female, and 79% private work class. Cluster 4 ischaracterized by an age of 46.88, 27.1% high school graduateeducation level, 43.33 hours per week, 88% husband relationships, and88.7% male. Cluster 5 is characterized by an age of 43.31, 32.5% highschool graduate education level, 39.16 hours per week, 50% wiferelationships, 60.2% male, and 49.9% private work class. Cluster 6 ischaracterized by an age of 37.12, 36.7% high school graduateeducation level, 39.64 hours per week, 22.4% adm clerical occupation,49.3% not in family relationships, 79.1% female, and 67.9% privatework class. Cluster 7 is characterized by an age of 36.47, 38.4% highschool graduate education level, 40 hours per week, 39.5% craftrepair occupation, 67.4% male, and 95.3% private work class. Cluster8 is characterized by an age of 43.15, 84.4% high school graduateeducation level, 42.67 hours per week, 83% craft repair occupation,91.9% husband relationships, 100% male, and 34.1% local governmentwork class. Cluster 9 is characterized by an age of 30.92, 57.9% highschool graduate education level, 41.91 hours per week, 50.5% craftrepair occupation, 61.1% own child relationships, 100% male, and 100%private work class. The cells highlighted orange have the mostsignificance to the cluster distribution and are therefore used forcluster profiling as well.

Question 14: CART Decision Tree Model for Classifying Income

Table4

CARTDecision Tree Model for Classifying Income

Node 0

Category

%

n

Income

&lt= 50K

75.805

13247

&gt 50K

24.195

4228

Total

100.000

17475

Using clustermembership to classify income in CART Model was able to predict 17475out of 25000 thus indicating that cluster membership is important inclassifying income, see Table 4 and Table 5.

Question 15: C4.5 Decision Tree Model for Classifying Income

Table5

C4.5Decision Tree Model for Classifying Income

Node 0

Category

%

n

Income

&lt= 50K

76.064

19016

&gt 50K

23.939

5984

Total

100.000

25000

Using cluster membership to classify income in C4.5 Decision TreeModel was able to predict 25000 out of 25000 thus indicating thatcluster membership is important in classifying income. The C4.5 modelis more effective compared to the CART model since it predicted 25000versus 17475 by the CART model, see Table 4 and Table 5.

References

Satyanarayana, A. (2013, October).Software tools for teaching undergraduate data mining course.In&nbspAmericanSociety of Engineering Education Mid-Atlantic Fall Conference,Washington, DC.

No related posts.

Recent Posts

  • ProjectID
  • Indoor Air Pollution Prevention
  • Biodiversity
  • Prejudice and Discrimination
  • Report on the Geology of San Francisco as Viewed in Corona Heights
  • Mock Public Address Speech Write-up

Copyright Halrobotics Net 2021 | Theme by ThemeinProgress | Proudly powered by WordPress