Hands on Analysis Project
Handson Analysis Project
ExecutiveSummary
FromNorth to South, East to West ,Europe to North America researchers andsocial sciences have used cluster analysis as part and parcel ofsequence analysis then cluster analysis and finally discriminantanalysis .To start with , it is important to introduce a factor thatwill standardize thevariables in the data set which in turn makesit simple for researchers to run cluster analysis .It is important tonote that factor analysis it minimizes the effects ofmulticollinearity effects in the data set. On the other hand, thecluster analysis then identifies the grouping and finally thegroupings. Cluster analysis is the best means of coming up withuniform groups of variables. Clusters share characteristics aredissimilar from other clusters. Meanwhile, the cluster analysis inmost cases relies on the discriminant analysis as the cluster doesnot in way test the fitness measures (Harrell,2015). Besides,the discriminant aids in the clustering and thuscheckingon whether there are outliers in the data sets under research, thenthe number of variables should be standardizedbefore analysis.Thus, in most cases discriminant analysis usually acts as apredictive model which allows the research to introduce new sets indata analysis process.
Inthe current case we have two models in our data analysis andtherefore it is important to understand some of various clusters whenexamining the standardized test scores in statistics. First, the K–means that the cluster is method that uses large data sets whencomputing uses the hierarchical model in analysis .However, socialscientists and researchersthereforemust all the variables in the data and allocate it to differentclusters prior before ones start the clustering process. On the otherhand, it is important to define the two set cluster process whenidentifying the running and pre clustering at the start beforeembarking on the new hierarchical methods (Kou,Peng & Wang, 2014). It is important to note therefore that there is no holy grail in dataanalysis and thus there is need for one to identify the underlyingdata set before doing any work in the process. Cluster analysis isthe best means of coming up with uniform groups of variables.Clusters share characteristics are dissimilar from other clusters.
Kmeansand Kohonen clustering were used. First, the data was standardizedbefore modeling was carried out. Model using Kclustering wasdeveloped. At first, four clusters were used. The cases for cluster 2were underestimated with only 15 cases. Each variable had asignificant impact in determining the cluster where the values wereranked (p value< 0.05). ANOVA of the clusters showed that allclustering means deferred significantly across at least 2 of thevariables and the null hypothesis was rejected.Medianincome, longitudes and latitudes had a low relative weight (F) makingthem less significant in the model. There were no significantdifferences between cluster for median income and some for medianage, longitudes and for latitudes hence the model was not the best.Another model was developed with three clusters and without medianincome and it proved to be reliable and significant.
Thetraining data cases were divided into two and for each case, Kmeansclustering modeling was carried out. The median income was eliminatedfrom the model and the model was validated by ANOVA and proved to bereliable and suitable for each case .Consequently, the solutioncluster centroid of the two cases were compared and found to besimilar hence proving the model had a high degree of stability. Onthe other hand, different clustering procedures were done on the samedata and yielded same results hence showing that the kmeansclustering model was stable, reliable and valid. Besides, the K netneural network model was developed using Kohonen clustering and agraphical representation of output drawn
Onthe other hand,clusteranalysis in data classification which occurs by splitting the datasets into various groups called clusters. It categorizes cases intogroups by use of variables. In Kmeans clustering, the numbers ofclusters are first determined and the data set normalized bystandardization. Statistical Package for Social Sciences Version 23was used in the analysis of data in this project. The data wasimported from excel to SPSS. For Kmeans cluster analysis, the datawas for the training template was split into two halves and thenanalyzed. From the main menu of SPSS, the analyze table was clickedfollowed by classify to get Kmeans cluster analysis. Kmeansclustering is more powerful as it is less affected by outside factorsand nonimportant variables.
Findings
Modelusing Kmean clustering
Tostart with, the data was standardized in order to reduce variation byuse of SPSS descriptive analysis. See table below
Table1: Initial cluster centers
Initial Cluster Centers 

Cluster 

1 
2 
3 
4 

Zscore: median income 
5.83896 
.68233 
.63195 
1.90043 
Zscore: total bedrooms 
1.29152 
10.35909 
7.50326 
11.93572 
Zscore: housing median age 
1.85542 
1.16489 
1.32385 
1.95971 
Zscore: households 
1.31521 
11.31932 
7.70491 
12.43226 
Zscore: total rooms 
1.22612 
10.56149 
3.44961 
16.56719 
Zscore: latitude 
1.01064 
1.06698 
.80494 
.81430 
Zscore: longitude 
1.46421 
1.07192 
.61262 
.91216 
Zscore: population 
1.25250 
30.52294 
5.35399 
13.09807 
Thus,the cases were labeled by median house value. Also, there werechanges in cluster centers for initial and final hence the issue ofdata randomness of observation in the various sets was met hence wecan use ANOVA dispersion analysis of the clusters to hypothesis aboutmean variables.
Table2: Final Cluster centers
Final Cluster Centers 

Cluster 

1 
2 
3 
4 

Zscore: median income 
.03142 
.53193 
.08320 
.26679 
Zscore: total bedrooms 
.39172 
9.95042 
.94852 
4.22428 
Zscore: housing median age 
.22554 
1.65238 
.66805 
1.17468 
Zscore: households 
.39068 
10.26927 
.94508 
4.20941 
Zscore: total rooms 
.37223 
11.15571 
.87217 
4.28370 
Zscore: latitude 
.06082 
.48144 
.18961 
.20392 
Zscore: longitude 
.06267 
.41259 
.19019 
.27474 
Zscore: population 
.36428 
11.91177 
.87235 
3.92858 
Itis important to note that they were no missing cases. In the firstKmeans clustering model, four clusters were used. It was discoveredthe cases for cluster 2 were underestimated with only 15 cases. Seetable below
Table3: Cases in clusters
Number of Cases in each Cluster 

Cluster 
1 
14075.000 

2 
15.000 

3 
4101.000 

4 
349.000 

Valid 
18540.000 

Missing 
.000 
Besides,the distance between the four clusters was determined as shown below.Thus, it is important to note that the distance between clusters 1and 2 was highest with 22.551 followed by clusters 2 and thedifference was 19.922.
Table4: Distance between clusters
Distances between Final Cluster Centers 

Cluster 
1 
2 
3 
4 
1 
22.551 
2.756 
9.209 

2 
22.551 
19.922 
13.448 

3 
2.756 
19.922 
6.532 

4 
9.209 
13.448 
6.532 
Analysisof variances (ANOVA) was done to determine the impact of thevariables in determination of cluster where the variables and eachvariable had a significant impact in determining the cluster wherethe values were ranked (p value< 0.05) with the p value of allvariables being 0. Median income, longitudes and latitudes had a lowrelative weight (F) making them less significant in the model. Thehousehold data has got the greatest impact in formation of theclusters with F of 16906.378 (p=0), followed by total bed rooms withF of 16837.680 (p=0) and total rooms with F of 15856.410 (p=0) whilemedian income has got the least impact with F of 23.877 (p=0). Seefigure below.
Table5: Analysis of variances on clusters
ANOVA 

Cluster 
Error 
F 
Sig. 

Mean Square 
Df 
Mean Square 
Df 

Zscore: median income 
23.789 
3 
.996 
18536 
23.877 
.000 

Zscore: total bedrooms 
4520.754 
3 
.268 
18536 
16837.680 
.000 

Zscore: housing median age 
1022.916 
3 
.835 
18536 
1225.628 
.000 

Zscore: households 
4525.691 
3 
.268 
18536 
16906.378 
.000 

Zscore: total rooms 
4446.879 
3 
.280 
18536 
15856.410 
.000 

Zscore: latitude 
72.497 
3 
.988 
18536 
73.346 
.000 

Zscore: longitude 
77.504 
3 
.988 
18536 
78.476 
.000 

Zscore: population 
4167.780 
3 
.326 
18536 
12799.585 
.000 

On the other hand, the F test is usually used in descriptive data analysis purposes as clusters have been chosen to maximize the differences among cases in different clusters. The observed significance levels are not corrected for this and thus cannot be interpreted as tests of the hypothesis that the cluster means are equal. 

Meanwhile,cluster 2 had higher bedrooms, household size, total rooms andpopulation than the rest of other clusters. Besides, cluster 4housing median age has negative values for cluster 2, 3 and 4. Seefigure below.
Figure1: A bar graph of final cluster centers values for each cluster
Figure2: Area graph for final cluster center values
Thevalues for population were high, followed by total rooms with that ofhousing age median being lowest for cluster 2. The values for clusterone was concentrated together. See scatter plot below.
Figure3: A scatter plot showing distribution of values
ModelValidation
Meanwhile,multiple comparisons using multivariate analysis for means model wasused to validate this model. After that, there was no significantdifferences between cluster 1 and 2, 2 and 3, 2 and for medianincome, 2 and 4 for median age, 2 and 1, 2 and 3, 2 and 4 forlongitudes, 1 and 2, 2 and 3, 2 and for latitudes as there p valuewas greater than 0.005, hence the model was not the best. See tablebelow.
Table6: ANOVA for means
Multiple Comparisons 

Bonferroni 

Dependent Variable 
(I) Cluster Number of Case 
(J) Cluster Number of Case 
Mean Difference (IJ) 
Std. Error 
Sig. 
95% Confidence Interval 

Lower Bound 
Upper Bound 

Zscore: median income 
1 
2 
.56335369 
.25785957 
.174 
1.2437266 
.1170192 
3 
.11461946^{*} 
.01771241 
.000 
.1613544 
.0678845 

4 
.29821146^{*} 
.05408833 
.000 
.4409257 
.1554972 

2 
1 
.56335369 
.25785957 
.174 
.1170192 
1.2437266 

3 
.44873423 
.25819318 
.493 
.2325189 
1.1299874 

4 
.26514224 
.26320246 
1.000 
.4293281 
.9596126 

3 
1 
.11461946^{*} 
.01771241 
.000 
.0678845 
.1613544 

2 
.44873423 
.25819318 
.493 
1.1299874 
.2325189 

4 
.18359200^{*} 
.05565703 
.006 
.3304453 
.0367387 

4 
1 
.29821146^{*} 
.05408833 
.000 
.1554972 
.4409257 

2 
.26514224 
.26320246 
1.000 
.9596126 
.4293281 

3 
.18359200^{*} 
.05565703 
.006 
.0367387 
.3304453 

Zscore: total bedrooms 
1 
2 
10.34214230^{*} 
.13385975 
.000 
10.6953367 
9.9889479 
3 
1.34024106^{*} 
.00919484 
.000 
1.3645020 
1.3159801 

4 
4.61599478^{*} 
.02807827 
.000 
4.6900804 
4.5419091 

2 
1 
10.34214230^{*} 
.13385975 
.000 
9.9889479 
10.6953367 

3 
9.00190124^{*} 
.13403293 
.000 
8.6482499 
9.3555526 

4 
5.72614753^{*} 
.13663334 
.000 
5.3656349 
6.0866601 

3 
1 
1.34024106^{*} 
.00919484 
.000 
1.3159801 
1.3645020 

2 
9.00190124^{*} 
.13403293 
.000 
9.3555526 
8.6482499 

4 
3.27575372^{*} 
.02889261 
.000 
3.3519880 
3.1995194 

4 
1 
4.61599478^{*} 
.02807827 
.000 
4.5419091 
4.6900804 

2 
5.72614753^{*} 
.13663334 
.000 
6.0866601 
5.3656349 

3 
3.27575372^{*} 
.02889261 
.000 
3.1995194 
3.3519880 

Zscore: housing median age 
1 
2 
1.87791317^{*} 
.23600780 
.000 
1.2551970 
2.5006293 
3 
.89359155^{*} 
.01621141 
.000 
.8508171 
.9363660 

4 
1.40021802^{*} 
.04950472 
.000 
1.2695978 
1.5308382 

2 
1 
1.87791317^{*} 
.23600780 
.000 
2.5006293 
1.2551970 

3 
.98432163^{*} 
.23631313 
.000 
1.6078434 
.3607998 

4 
.47769516 
.24089791 
.284 
1.1133141 
.1579238 

3 
1 
.89359155^{*} 
.01621141 
.000 
.9363660 
.8508171 

2 
.98432163^{*} 
.23631313 
.000 
.3607998 
1.6078434 

4 
.50662647^{*} 
.05094049 
.000 
.3722179 
.6410350 

4 
1 
1.40021802^{*} 
.04950472 
.000 
1.5308382 
1.2695978 

2 
.47769516 
.24089791 
.284 
.1579238 
1.1133141 

3 
.50662647^{*} 
.05094049 
.000 
.6410350 
.3722179 

Zscore: households 
1 
2 
10.65995545^{*} 
.13366042 
.000 
11.0126239 
10.3072870 
3 
1.33576367^{*} 
.00918115 
.000 
1.3599885 
1.3115388 

4 
4.60009077^{*} 
.02803646 
.000 
4.6740661 
4.5261154 

2 
1 
10.65995545^{*} 
.13366042 
.000 
10.3072870 
11.0126239 

3 
9.32419178^{*} 
.13383335 
.000 
8.9710671 
9.6773165 

4 
6.05986468^{*} 
.13642988 
.000 
5.6998889 
6.4198405 

3 
1 
1.33576367^{*} 
.00918115 
.000 
1.3115388 
1.3599885 

2 
9.32419178^{*} 
.13383335 
.000 
9.6773165 
8.9710671 

4 
3.26432710^{*} 
.02884959 
.000 
3.3404479 
3.1882063 

4 
1 
4.60009077^{*} 
.02803646 
.000 
4.5261154 
4.6740661 

2 
6.05986468^{*} 
.13642988 
.000 
6.4198405 
5.6998889 

3 
3.26432710^{*} 
.02884959 
.000 
3.1882063 
3.3404479 

Zscore: total rooms 
1 
2 
11.52794374^{*} 
.13680782 
.000 
11.8889167 
11.1669708 
3 
1.24440072^{*} 
.00939735 
.000 
1.2691960 
1.2196054 

4 
4.65592632^{*} 
.02869665 
.000 
4.7316436 
4.5802090 

2 
1 
11.52794374^{*} 
.13680782 
.000 
11.1669708 
11.8889167 

3 
10.28354302^{*} 
.13698481 
.000 
9.9221030 
10.6449830 

4 
6.87201742^{*} 
.13964249 
.000 
6.5035650 
7.2404698 

3 
1 
1.24440072^{*} 
.00939735 
.000 
1.2196054 
1.2691960 

2 
10.28354302^{*} 
.13698481 
.000 
10.6449830 
9.9221030 

4 
3.41152560^{*} 
.02952893 
.000 
3.4894389 
3.3336123 

4 
1 
4.65592632^{*} 
.02869665 
.000 
4.5802090 
4.7316436 

2 
6.87201742^{*} 
.13964249 
.000 
7.2404698 
6.5035650 

3 
3.41152560^{*} 
.02952893 
.000 
3.3336123 
3.4894389 

Zscore: longitude 
1 
2 
.47525902 
.25673210 
.385 
1.1526571 
.2021390 
3 
.25285598^{*} 
.01763496 
.000 
.2993866 
.2063254 

4 
.33740750^{*} 
.05385183 
.000 
.4794977 
.1953173 

2 
1 
.47525902 
.25673210 
.385 
.2021390 
1.1526571 

3 
.22240303 
.25706424 
1.000 
.4558714 
.9006775 

4 
.13785152 
.26205162 
1.000 
.5535823 
.8292853 

3 
1 
.25285598^{*} 
.01763496 
.000 
.2063254 
.2993866 

2 
.22240303 
.25706424 
1.000 
.9006775 
.4558714 

4 
.08455151 
.05541367 
.762 
.2307627 
.0616597 

4 
1 
.33740750^{*} 
.05385183 
.000 
.1953173 
.4794977 

2 
.13785152 
.26205162 
1.000 
.8292853 
.5535823 

3 
.08455151 
.05541367 
.762 
.0616597 
.2307627 

Zscore: latitude 
1 
2 
.54225836 
.25683740 
.209 
.1354175 
1.2199342 
3 
.25042989^{*} 
.01764220 
.000 
.2038802 
.2969795 

4 
.26473757^{*} 
.05387392 
.000 
.1225891 
.4068861 

2 
1 
.54225836 
.25683740 
.209 
1.2199342 
.1354175 

3 
.29182847 
.25716968 
1.000 
.9703811 
.3867242 

4 
.27752079 
.26215910 
1.000 
.9692382 
.4141966 

3 
1 
.25042989^{*} 
.01764220 
.000 
.2969795 
.2038802 

2 
.29182847 
.25716968 
1.000 
.3867242 
.9703811 

4 
.01430769 
.05543640 
1.000 
.1319635 
.1605789 

4 
1 
.26473757^{*} 
.05387392 
.000 
.4068861 
.1225891 

2 
.27752079 
.26215910 
1.000 
.4141966 
.9692382 

3 
.01430769 
.05543640 
1.000 
.1605789 
.1319635 

Zscore: population 
1 
2 
12.27605023^{*} 
.14741446 
.000 
12.6650093 
11.8870912 
3 
1.23663040^{*} 
.01012592 
.000 
1.2633480 
1.2099127 

4 
4.29286595^{*} 
.03092149 
.000 
4.3744535 
4.2112784 

2 
1 
12.27605023^{*} 
.14741446 
.000 
11.8870912 
12.6650093 

3 
11.03941983^{*} 
.14760518 
.000 
10.6499576 
11.4288821 

4 
7.98318429^{*} 
.15046891 
.000 
7.5861660 
8.3802026 

3 
1 
1.23663040^{*} 
.01012592 
.000 
1.2099127 
1.2633480 

2 
11.03941983^{*} 
.14760518 
.000 
11.4288821 
10.6499576 

4 
3.05623555^{*} 
.03181829 
.000 
3.1401894 
2.9722817 

4 
1 
4.29286595^{*} 
.03092149 
.000 
4.2112784 
4.3744535 

2 
7.98318429^{*} 
.15046891 
.000 
8.3802026 
7.5861660 

3 
3.05623555^{*} 
.03181829 
.000 
2.9722817 
3.1401894 

*. The mean difference is significant at the 0.05 level. 
Whenlongitudes and latitudes were correlated, there was significantcorrelation between latitudes and longitudes (p< 0.005).see tablebelow.
Correlations 

Control Variables 
latitude 
longitude 

Cluster Number of Case 
Latitude 
Correlation 
1.000 
.924 

Significance (2tailed) 
. 
.000 

Df 
0 
18537 

longitude 
Correlation 
.924 
1.000 

Significance (2tailed) 
.000 
. 

Df 
18537 
0 
Bestmodel omitting medianincome variable
Whenmedian income is omitted, the model becomes more significant. On theother hand, the similar procedure as in the first model was used togenerate this model with the median income omitted. Final and initialcluster centers were different. See table below.
Table7: Final cluster centers
Final Cluster Centers 

Cluster 

1 
2 
3 

Zscore: total bedrooms 
.33572 
1.27822 
5.45836 
Zscore: housing median age 
.17080 
.74608 
1.32810 
Zscore: households 
.33453 
1.27103 
5.47906 
Zscore: total rooms 
.31828 
1.17847 
5.67971 
Zscore: latitude 
.03790 
.16913 
.24076 
Zscore: longitude 
.03998 
.17521 
.30244 
Zscore: population 
.30959 
1.15953 
5.32390 
Cluster3 has higher total bedrooms values, household size, total rooms andpopulation while cluster one has the lowest. See figure below. Thevalues vary from one cluster to another and the housing median agehad negative values for cluster 2 and 3. T is evident from the graphthat cluster 1 had fewer values than 2 and 3.See figure below.
Figure4: Bar graph of cluster values
Figure5: Pie chart of cluster values
Figure6: Line graph for cluster values
Therelative weight (F) is high except for longitudes and latitudes.Total bedrooms had the greatest impact on clusters with F of21449.642 (p=0) followed by household size with F of 21325.417 (p=0).Latitudes and longitudes had small impact on clusters with F of61.885 (p=0) and 69.948 (p=0) respectively. See table below.
Table8: ANOVA for influence of variables on clusters
ANOVA 

Cluster 
Error 
F 
Sig. 

Mean Square 
df 
Mean Square 
Df 

Zscore: total bedrooms 
6472.639 
2 
.302 
18537 
21449.642 
.000 

Zscore: housing median age 
1271.175 
2 
.863 
18537 
1473.044 
.000 

Zscore: households 
6461.283 
2 
.303 
18537 
21325.417 
.000 

Zscore: total rooms 
6257.875 
2 
.325 
18537 
19259.074 
.000 

Zscore: latitude 
61.481 
2 
.993 
18537 
61.885 
.000 

Zscore: longitude 
69.432 
2 
.993 
18537 
69.948 
.000 

Zscore: population 
5744.055 
2 
.380 
18537 
15101.295 
.000 

The F tests should be used only for descriptive purposes because the clusters have been chosen to maximize the differences among cases in different clusters. Thus, the observed significance levels are not corrected cannot be interpreted as tests of the hypothesis that the cluster means are equal. 

Whenmultiple comparisons for means were carried out by ANOVA, and it wasfound that there was no significant differences between cluster 3 and2, 2 and 3 for latitudes and longitudes. See table below.
Table9: ANOVA
Multiple Comparisons 

Bonferroni 

Dependent Variable 
(I) Cluster Number of Case 
(J) Cluster Number of Case 
Mean Difference (IJ) 
Std. Error 
Sig. 
95% Confidence Interval 

Lower Bound 
Upper Bound 

Zscore: total bedrooms 
1 
2 
1.61394020^{*} 
.01080007 
.000 
1.6397977 
1.5880827 
3 
5.79408034^{*} 
.03853159 
.000 
5.8863326 
5.7018281 

2 
1 
1.61394020^{*} 
.01080007 
.000 
1.5880827 
1.6397977 

3 
4.18014013^{*} 
.03951790 
.000 
4.2747538 
4.0855265 

3 
1 
5.79408034^{*} 
.03853159 
.000 
5.7018281 
5.8863326 

2 
4.18014013^{*} 
.03951790 
.000 
4.0855265 
4.2747538 

Zscore: housing median age 
1 
2 
.91688962^{*} 
.01826377 
.000 
.8731626 
.9606167 
3 
1.49890004^{*} 
.06516000 
.000 
1.3428942 
1.6549059 

2 
1 
.91688962^{*} 
.01826377 
.000 
.9606167 
.8731626 

3 
.58201042^{*} 
.06682792 
.000 
.4220112 
.7420096 

3 
1 
1.49890004^{*} 
.06516000 
.000 
1.6549059 
1.3428942 

2 
.58201042^{*} 
.06682792 
.000 
.7420096 
.4220112 

Zscore: household 
1 
2 
1.60555210^{*} 
.01082197 
.000 
1.6314620 
1.5796422 
3 
5.81358174^{*} 
.03860974 
.000 
5.9060211 
5.7211424 

2 
1 
1.60555210^{*} 
.01082197 
.000 
1.5796422 
1.6314620 

3 
4.20802963^{*} 
.03959805 
.000 
4.3028352 
4.1132241 

3 
1 
5.81358174^{*} 
.03860974 
.000 
5.7211424 
5.9060211 

2 
4.20802963^{*} 
.03959805 
.000 
4.1132241 
4.3028352 

Zscore: total rooms 
1 
2 
1.49674893^{*} 
.01120705 
.000 
1.5235808 
1.4699170 
3 
5.99799149^{*} 
.03998361 
.000 
6.0937201 
5.9022628 

2 
1 
1.49674893^{*} 
.01120705 
.000 
1.4699170 
1.5235808 

3 
4.50124257^{*} 
.04100708 
.000 
4.5994216 
4.4030635 

3 
1 
5.99799149^{*} 
.03998361 
.000 
5.9022628 
6.0937201 

2 
4.50124257^{*} 
.04100708 
.000 
4.4030635 
4.5994216 

Zscore: longitude 
1 
2 
.21519906^{*} 
.01958784 
.000 
.2620962 
.1683019 
3 
.34242285^{*} 
.06988389 
.000 
.5097387 
.1751070 

2 
1 
.21519906^{*} 
.01958784 
.000 
.1683019 
.2620962 

3 
.12722380 
.07167273 
.228 
.2988224 
.0443748 

3 
1 
.34242285^{*} 
.06988389 
.000 
.1751070 
.5097387 

2 
.12722380 
.07167273 
.228 
.0443748 
.2988224 

Zscore: latitude 
1 
2 
.20703053^{*} 
.01959630 
.000 
.1601131 
.2539479 
3 
.27866551^{*} 
.06991409 
.000 
.1112774 
.4460536 

2 
1 
.20703053^{*} 
.01959630 
.000 
.2539479 
.1601131 

3 
.07163498 
.07170370 
.953 
.1000378 
.2433077 

3 
1 
.27866551^{*} 
.06991409 
.000 
.4460536 
.1112774 

2 
.07163498 
.07170370 
.953 
.2433077 
.1000378 

Zscore: population 
1 
2 
1.46911293^{*} 
.01212545 
.000 
1.4981437 
1.4400822 
3 
5.63348592^{*} 
.04326020 
.000 
5.7370594 
5.5299125 

2 
1 
1.46911293^{*} 
.01212545 
.000 
1.4400822 
1.4981437 

3 
4.16437298^{*} 
.04436754 
.000 
4.2705976 
4.0581484 

3 
1 
5.63348592^{*} 
.04326020 
.000 
5.5299125 
5.7370594 

2 
4.16437298^{*} 
.04436754 
.000 
4.0581484 
4.2705976 

*. The mean difference is significant at the 0.05 level. 
Thetraining data was split into two parts and clustering modeling done.Thus, it was one way of validating the reliability and specificity ofthe model. On the other hand, the data was divided into two with eachgroup containing 9270 data sets. When the training values weredivided into two halves and Kmean clustering done, the followingwere obtained.
Firstsplit (half)
Thedata was grouped into 3 clusters after standardization to clean toolarge and small variables data. The final cluster centers differedfrom the initial cluster centers.
Table10: Final cluster centers
Final Cluster Centers 

Cluster 

1 
2 
3 

Zscore: housing median age 
.32462 
.11186 
1.10700 
Zscore: total rooms 
.98860 
.37096 
4.61962 
Zscore: total bedrooms 
1.08480 
.39198 
4.44786 
Zscore: longitude 
.21820 
.06288 
.23704 
Zscore: households 
1.07367 
.38973 
4.47512 
Zscore: population 
.97671 
.36058 
4.31996 
Zscore: latitude 
.25258 
.07229 
.25398 
Thedistance between clusters was 2.887 for cluster 1 to 2, 6.918 for 1to 3, and 9.777 for 2 to 3. See table below.
Table11: Distance between clusters
Distances between Final Cluster Centers 

Cluster 
1 
2 
3 
1 
2.887 
6.918 

2 
2.887 
9.777 

3 
6.918 
9.777 
ANOVAtest was carried out to determine the impact of the variables indetermination of cluster variables. Each variable had a significantimpact in determining the cluster where the values were ranked (pvalue< 0.05) with the p value of all variables being 0 although,longitudes, latitudes and housing ,median age had the lower impactswith F of 64.972 (p=0), 86.229 (p=0) and 266.146 (p=0) respectivelySee table below.
Table12: ANOVA of means
ANOVA 

Cluster 
Error 
F 
Sig. 

Mean Square 
Df 
Mean Square 
Df 

Zscore: housing median age 
251.744 
2 
.946 
9267 
266.146 
.000 

Zscore: total rooms 
3285.398 
2 
.291 
9267 
11283.719 
.000 

Zscore: total bedrooms 
3395.088 
2 
.267 
9267 
12692.415 
.000 

Zscore: longitude 
64.088 
2 
.986 
9267 
64.973 
.000 

Zscore: households 
3387.371 
2 
.269 
9267 
12585.213 
.000 

Zscore: population 
3001.578 
2 
.352 
9267 
8517.135 
.000 

Zscore: latitude 
84.672 
2 
.982 
9267 
86.229 
.000 

The F tests should be used only for descriptive purposes because the clusters have been chosen to maximize the differences among cases in different clusters. The observed significance levels are not corrected for this and thus cannot be interpreted as tests of the hypothesis that the cluster means are equal. 
Onthe other hand, cluster 2 had the highest number of cases, 7,208 thenfollowed by cluster 1 with 1887cases and lastly cluster 3 had thelowest number of cases at 175. There were no missing values. Seetable below.
Table13: Number of cases in each cluster
Number of Cases in each Cluster 

Cluster 
1 
1887.000 

2 
7208.000 

3 
175.000 

Valid 
9270.000 

Missing 
.000 
Whenthe characteristics of the clusters were plotted, cluster 3 hadhighest values for total rooms, bedrooms, households, and populationfollowed by cluster 1 while cluster 2 had the least values. Seefigures below.
Figure7: Bar graph of cluster values
Figure8: Area graph of cluster values
ANOVAof means was done to test the significance of the model. There wereno significant differences between cluster 1 and 3 for bothlongitudes and latitudes. However, there was a significancedifference with the rest of the clusters. This model was valid. Seetable below.
Table14: Analysis of variances
Multiple Comparisons 

Bonferroni 

Dependent Variable 
(I) Cluster Number of Case 
(J) Cluster Number of Case 
Mean Difference (IJ) 
Std. Error 
Sig. 
95% Confidence Interval 

Lower Bound 
Upper Bound 

Zscore: housing median age 
1 
2 
.43647723^{*} 
.02514937 
.000 
.4966952 
.3762592 
3 
.78237679^{*} 
.07685258 
.000 
.5983599 
.9663937 

2 
1 
.43647723^{*} 
.02514937 
.000 
.3762592 
.4966952 

3 
1.21885402^{*} 
.07440619 
.000 
1.0406947 
1.3970133 

3 
1 
.78237679^{*} 
.07685258 
.000 
.9663937 
.5983599 

2 
1.21885402^{*} 
.07440619 
.000 
1.3970133 
1.0406947 

Zscore: total rooms 
1 
2 
1.35956129^{*} 
.01395327 
.000 
1.3261514 
1.3929712 
3 
3.63102557^{*} 
.04263902 
.000 
3.7331211 
3.5289301 

2 
1 
1.35956129^{*} 
.01395327 
.000 
1.3929712 
1.3261514 

3 
4.99058686^{*} 
.04128173 
.000 
5.0894324 
4.8917413 

3 
1 
3.63102557^{*} 
.04263902 
.000 
3.5289301 
3.7331211 

2 
4.99058686^{*} 
.04128173 
.000 
4.8917413 
5.0894324 

Zscore: total bedrooms 
1 
2 
1.47677428^{*} 
.01337400 
.000 
1.4447514 
1.5087972 
3 
3.36306019^{*} 
.04086888 
.000 
3.4609172 
3.2652032 

2 
1 
1.47677428^{*} 
.01337400 
.000 
1.5087972 
1.4447514 

3 
4.83983448^{*} 
.03956793 
.000 
4.9345765 
4.7450924 

3 
1 
3.36306019^{*} 
.04086888 
.000 
3.2652032 
3.4609172 

2 
4.83983448^{*} 
.03956793 
.000 
4.7450924 
4.9345765 

Zscore: population 
1 
2 
1.33728979^{*} 
.01535098 
.000 
1.3005332 
1.3740464 
3 
3.34325133^{*} 
.04691022 
.000 
3.4555738 
3.2309288 

2 
1 
1.33728979^{*} 
.01535098 
.000 
1.3740464 
1.3005332 

3 
4.68054112^{*} 
.04541696 
.000 
4.7892882 
4.5717941 

3 
1 
3.34325133^{*} 
.04691022 
.000 
3.2309288 
3.4555738 

2 
4.68054112^{*} 
.04541696 
.000 
4.5717941 
4.7892882 

Zscore: latitude 
1 
2 
.32487312^{*} 
.02562424 
.000 
.3862282 
.2635181 
3 
.00140139 
.07830370 
1.000 
.1860901 
.1888929 

2 
1 
.32487312^{*} 
.02562424 
.000 
.2635181 
.3862282 

3 
.32627451^{*} 
.07581112 
.000 
.1447513 
.5077978 

3 
1 
.00140139 
.07830370 
1.000 
.1888929 
.1860901 

2 
.32627451^{*} 
.07581112 
.000 
.5077978 
.1447513 

Zscore: households 
1 
2 
1.46339713^{*} 
.01341557 
.000 
1.4312747 
1.4955196 
3 
3.40144662^{*} 
.04099590 
.000 
3.4996078 
3.3032854 

2 
1 
1.46339713^{*} 
.01341557 
.000 
1.4955196 
1.4312747 

3 
4.86484375^{*} 
.03969092 
.000 
4.9598803 
4.7698072 

3 
1 
3.40144662^{*} 
.04099590 
.000 
3.3032854 
3.4996078 

2 
4.86484375^{*} 
.03969092 
.000 
4.7698072 
4.9598803 

Zscore: longitude 
1 
2 
.28108214^{*} 
.02568213 
.000 
.2195885 
.3425758 
3 
.01883427 
.07848062 
1.000 
.2067494 
.1690809 

2 
1 
.28108214^{*} 
.02568213 
.000 
.3425758 
.2195885 

3 
.29991640^{*} 
.07598242 
.000 
.4818498 
.1179830 

3 
1 
.01883427 
.07848062 
1.000 
.1690809 
.2067494 

2 
.29991640^{*} 
.07598242 
.000 
.1179830 
.4818498 

*. The mean difference is significant at the 0.05 level. 
Graphsfor mean of longitudes and latitudes were drawn and are inverse ofeach other. There was significant correlation between them (p=0). Seefigure below.
Figure9: Graph of mean of latitudes against cluster number of cases
Figure10: Graph of mean of longitudes against number of cases
Grouptwo
Thesecond half of the cases was also subjected to same modeling as thefirst one. The data sets were fist standardized as shown in tablebelow.
Table15: Standardization of values
Descriptive Statistics 

N 
Minimum 
Maximum 
Mean 
Std. Deviation 

median income 
9270 
.50 
15.00 
3.6899 
1.88551 
household median age 
9270 
29.00 
52.00 
39.2012 
7.05775 
total rooms 
9270 
8.00 
17738.00 
1963.8910 
1083.16930 
Populations 
9270 
8.00 
12427.00 
1148.0916 
678.46333 
total bedrooms 
9270 
1.00 
3114.00 
418.9786 
243.00163 
Households 
9270 
1.00 
2826.00 
397.3413 
225.50221 
Latitudes 
9270 
32.57 
41.79 
35.5337 
2.01081 
Longitudes 
9270 
124.35 
114.58 
119.6333 
1.96730 
Valid N (list wise) 
9270 
Thedata was grouped in 3 clusters as shown in the table below. Theinitial and final cluster means not similar.
Table16: Final cluster centers
Final Cluster Centers 

Cluster 

1 
2 
3 

Zscore: household median age 
.55873 
.27495 
.15304 
Zscore: total rooms 
2.97925 
.77059 
.48799 
Zscore: total bedrooms 
3.47380 
.77901 
.51263 
Zscore: populations 
2.95267 
.75766 
.48078 
Zscore: longitudes 
.07241 
.13084 
.05869 
Zscore: households 
3.47159 
.78529 
.51550 
Zscore: latitudes 
.00474 
.15056 
.07121 
Cluster1 had the highest values for household size, populations, totalbedrooms and total rooms followed by cluster 2 with cluster 3 havingthe least. See figures below.
Figure11: Area of cluster center values
Figure12: Bar graph of cluster values
Ascatter plot shows that cluster 1 has highest values, followed bycluster 2 and finally cluster 3 with lowest number of values .Thescatter plot shows that the distribution of variables in the region.See figure below.
Figure13: Scatter plot of cluster center values
Thedistance between clusters was 7.491 for cluster 1 and 3, 4.931 for 1and 1 and 2.597 for 2 and 3. See table below.
Table17: Distance between cluster centers
Distances between Final Cluster Centers 

Cluster 
1 
2 
3 
1 
4.931 
7.491 

2 
4.931 
2.597 

3 
7.491 
2.597 
Allvariables had a significant impact on the clusters with longitudesand latitudes having the lowest impact as were the case n group one.See table below.
Table18: ANOVA of means
ANOVA 

Cluster 
Error 
F 
Sig. 

Mean Square 
Df 
Mean Square 
Df 

Zscore: household median age 
220.889 
2 
.953 
9267 
231.894 
.000 

Zscore: total rooms 
2723.164 
2 
.413 
9267 
6601.553 
.000 

Zscore: total bedrooms 
3225.989 
2 
.304 
9267 
10612.351 
.000 

Zscore: populations 
2653.053 
2 
.428 
9267 
6204.013 
.000 

Zscore: longitudes 
35.945 
2 
.992 
9267 
36.219 
.000 

Zscore: households 
3247.277 
2 
.299 
9267 
10846.320 
.000 

Zscore: latitudes 
48.277 
2 
.990 
9267 
48.774 
.000 

The F tests should be used only for descriptive purposes because the clusters have been chosen to maximize the differences among cases in different clusters. The observed significance levels are not corrected for this and thus cannot be interpreted as tests of the hypothesis that the cluster means are equal (Varmuza & Filzmoser, 2016). 
Clusterone had the lowest number of cases (256). See table below.
Table19: Number of cases in clusters
Number of Cases in each Cluster 

Cluster 
1 
256.000 
2 
2889.000 

3 
6125.000 

Valid 
9270.000 

Missing 
.000 
ANOVAof means were carried out in order to test the significance of themodel. It is important to note that this was done in after the first data set was divided there were no significant differencesbetween cluster 1 and 3 for both longitudes and latitudes. However,there was a significance difference with the rest of the clusters.Hence the model was significant, reliable and with greaterspecificity. See table below
Table20: ANOVA to show significance
Multiple Comparisons 

Bonferroni 

Dependent Variable 
(I) Cluster Number of Case 
(J) Cluster Number of Case 
Mean Difference (IJ) 
Std. Error 
Sig. 
95% Confidence Interval 

Lower Bound 
Upper Bound 

Zscore: household median age 
1 
2 
.28377763^{*} 
.06364423 
.000 
.4361683 
.1313870 
3 
.71176993^{*} 
.06226067 
.000 
.8608478 
.5626921 

2 
1 
.28377763^{*} 
.06364423 
.000 
.1313870 
.4361683 

3 
.42799230^{*} 
.02202797 
.000 
.4807364 
.3752482 

3 
1 
.71176993^{*} 
.06226067 
.000 
.5626921 
.8608478 

2 
.42799230^{*} 
.02202797 
.000 
.3752482 
.4807364 

Zscore: total rooms 
1 
2 
2.20866020^{*} 
.04188229 
.000 
2.1083766 
2.3089438 
3 
3.46724089^{*} 
.04097182 
.000 
3.3691374 
3.5653444 

2 
1 
2.20866020^{*} 
.04188229 
.000 
2.3089438 
2.1083766 

3 
1.25858069^{*} 
.01449592 
.000 
1.2238714 
1.2932899 

3 
1 
3.46724089^{*} 
.04097182 
.000 
3.5653444 
3.3691374 

2 
1.25858069^{*} 
.01449592 
.000 
1.2932899 
1.2238714 

Zscore: population 
1 
2 
2.19500101^{*} 
.04264354 
.000 
2.0928947 
2.2971073 
3 
3.43344585^{*} 
.04171651 
.000 
3.3335592 
3.5333325 

2 
1 
2.19500101^{*} 
.04264354 
.000 
2.2971073 
2.0928947 

3 
1.23844484^{*} 
.01475940 
.000 
1.2031047 
1.2737850 

3 
1 
3.43344585^{*} 
.04171651 
.000 
3.5333325 
3.3335592 

2 
1.23844484^{*} 
.01475940 
.000 
1.2737850 
1.2031047 

Zscore: total bedrooms 
1 
2 
2.69479215^{*} 
.03595358 
.000 
2.6087044 
2.7808799 
3 
3.98642475^{*} 
.03517199 
.000 
3.9022084 
4.0706411 

2 
1 
2.69479215^{*} 
.03595358 
.000 
2.7808799 
2.6087044 

3 
1.29163259^{*} 
.01244393 
.000 
1.2618367 
1.3214285 

3 
1 
3.98642475^{*} 
.03517199 
.000 
4.0706411 
3.9022084 

2 
1.29163259^{*} 
.01244393 
.000 
1.3214285 
1.2618367 

Zscore: household 
1 
2 
2.68629128^{*} 
.03568084 
.000 
2.6008566 
2.7717260 
3 
3.98708535^{*} 
.03490518 
.000 
3.9035079 
4.0706628 

2 
1 
2.68629128^{*} 
.03568084 
.000 
2.7717260 
2.6008566 

3 
1.30079407^{*} 
.01234953 
.000 
1.2712242 
1.3303640 

3 
1 
3.98708535^{*} 
.03490518 
.000 
4.0706628 
3.9035079 

2 
1.30079407^{*} 
.01234953 
.000 
1.3303640 
1.2712242 

Zscore: longitudes 
1 
2 
.20324315^{*} 
.06496399 
.005 
.3587939 
.0476924 
3 
.01372156 
.06355174 
1.000 
.1658908 
.1384477 

2 
1 
.20324315^{*} 
.06496399 
.005 
.0476924 
.3587939 

3 
.18952160^{*} 
.02248475 
.000 
.1356838 
.2433594 

3 
1 
.01372156 
.06355174 
1.000 
.1384477 
.1658908 

2 
.18952160^{*} 
.02248475 
.000 
.2433594 
.1356838 

Zscore: lattudes 
1 
2 
.14581940 
.06487682 
.074 
.0095226 
.3011614 
3 
.07595036 
.06346647 
.694 
.2279154 
.0760147 

2 
1 
.14581940 
.06487682 
.074 
.3011614 
.0095226 

3 
.22176977^{*} 
.02245458 
.000 
.2755353 
.1680042 

3 
1 
.07595036 
.06346647 
.694 
.0760147 
.2279154 

2 
.22176977^{*} 
.02245458 
.000 
.1680042 
.2755353 

*. The mean difference is significant at the 0.05 level. 
Similarly,the longitude and latitude mean were plotted and their graphs wereperpendicular to each other as was the case in the first data setgroup. See figures below.
Figure14: Graph of longitudes
Figure15: Graph of latitudes
Knet model using Kohonen clustering
Nealnetworks K net model using Kohenen clustering was developed. Themodel characteristics were summarized in the table below.
Model Summary 

Specifications 
Growing Method 
CHAID 
Dependent Variable 
median house value 

Independent Variables 
median income, total rooms, total bedrooms, population, housing median age, households 

Validation 
Split Sample 

Maximum Tree Depth 
3 

Minimum Cases in Parent Node 
100 

Minimum Cases in Child Node 
50 

Results 
Independent Variables Included 
median income, total bedrooms, total rooms, households, housing median age 
Number of Nodes 
81 

Number of Terminal Nodes 
56 

Depth 
3 
Themodel was trained and the gain of nodes given. See table below.
Gain Summary for Nodes 

Sample 
Node 
N 
Percent 
Mean 
Training 
50 
144 
1.6% 
456486.7292 
49 
214 
2.3% 
432810.2570 

48 
118 
1.3% 
397222.3390 

47 
182 
2.0% 
371833.6978 

45 
97 
1.0% 
354176.4433 

46 
266 
2.9% 
348348.2594 

41 
57 
0.6% 
341333.5263 

44 
247 
2.7% 
301736.0891 

80 
122 
1.3% 
301226.3033 

36 
86 
0.9% 
294130.3140 

40 
81 
0.9% 
283325.9383 

76 
136 
1.5% 
272350.7941 

74 
63 
0.7% 
263246.0952 

32 
98 
1.1% 
256446.9796 

79 
212 
2.3% 
255040.1085 

35 
99 
1.1% 
252505.1111 

78 
113 
1.2% 
247254.0000 

28 
92 
1.0% 
245225.0435 

72 
312 
3.4% 
233365.4103 

70 
106 
1.1% 
231300.0283 

31 
80 
0.9% 
224991.2750 

75 
157 
1.7% 
224963.0637 

24 
94 
1.0% 
220278.7447 

67 
86 
0.9% 
219804.6977 

73 
199 
2.2% 
219731.1558 

77 
134 
1.4% 
218754.4776 

65 
192 
2.1% 
207707.8177 

69 
241 
2.6% 
196215.7925 

59 
91 
1.0% 
194401.1099 

37 
232 
2.5% 
193883.1940 

71 
216 
2.3% 
189830.1065 

33 
211 
2.3% 
179987.6872 

63 
298 
3.2% 
179324.5101 

66 
109 
1.2% 
177673.3945 

68 
212 
2.3% 
177281.6038 

61 
89 
1.0% 
164487.6404 

18 
100 
1.1% 
160734.0100 

64 
284 
3.1% 
159382.0458 

29 
189 
2.0% 
159001.5873 

15 
74 
0.8% 
158702.7162 

57 
196 
2.1% 
153523.9796 

58 
116 
1.3% 
150485.3448 

62 
271 
2.9% 
144630.2583 

25 
161 
1.7% 
143958.3913 

51 
69 
0.7% 
135492.7681 

60 
93 
1.0% 
135360.2258 

17 
265 
2.9% 
135296.6151 

21 
80 
0.9% 
133578.7500 

13 
214 
2.3% 
131295.3364 

55 
499 
5.4% 
127691.5872 

56 
97 
1.0% 
123121.6495 

11 
151 
1.6% 
120988.0861 

16 
60 
0.6% 
114671.6667 

54 
352 
3.8% 
114566.1960 

52 
148 
1.6% 
101083.7838 

53 
342 
3.7% 
89952.6316 

Test 
50 
141 
1.5% 
445946.7092 
49 
215 
2.3% 
428220.4651 

48 
99 
1.1% 
384917.4343 

47 
191 
2.1% 
371647.9215 

45 
128 
1.4% 
358446.9922 

46 
248 
2.7% 
334943.6492 

41 
67 
0.7% 
338229.9851 

44 
282 
3.0% 
300809.6312 

80 
88 
0.9% 
296329.6023 

36 
113 
1.2% 
305520.4513 

40 
87 
0.9% 
292325.3333 

76 
131 
1.4% 
267161.1145 

74 
76 
0.8% 
257203.9868 

32 
104 
1.1% 
275903.8846 

79 
241 
2.6% 
265156.4564 

35 
96 
1.0% 
247090.6354 

78 
118 
1.3% 
250344.9407 

28 
86 
0.9% 
260660.5349 

72 
337 
3.6% 
231815.1632 

70 
130 
1.4% 
235446.9538 

31 
107 
1.2% 
216281.3271 

75 
166 
1.8% 
230427.1325 

24 
79 
0.9% 
252520.3038 

67 
87 
0.9% 
208370.1149 

73 
204 
2.2% 
225751.4902 

77 
138 
1.5% 
232133.3551 

65 
168 
1.8% 
203676.8036 

69 
258 
2.8% 
201098.8566 

59 
75 
0.8% 
163721.3333 

37 
231 
2.5% 
198149.7835 

71 
225 
2.4% 
191274.2267 

33 
220 
2.4% 
175555.4545 

63 
311 
3.3% 
181234.0868 

66 
120 
1.3% 
170980.8333 

68 
223 
2.4% 
171317.9417 

61 
77 
0.8% 
146348.0519 

18 
107 
1.2% 
146353.2710 

64 
259 
2.8% 
165456.7606 

29 
185 
2.0% 
169588.6649 

15 
82 
0.9% 
164240.2439 

57 
186 
2.0% 
148725.8118 

58 
131 
1.4% 
152106.8779 

62 
263 
2.8% 
150906.4639 

25 
164 
1.8% 
160893.2988 

51 
62 
0.7% 
140474.2097 

60 
106 
1.1% 
127970.7547 

17 
288 
3.1% 
135512.5035 

21 
67 
0.7% 
148083.5821 

13 
186 
2.0% 
134641.9409 

55 
497 
5.3% 
124478.0704 

56 
105 
1.1% 
119066.6667 

11 
129 
1.4% 
112517.0620 

16 
65 
0.7% 
148372.3231 

54 
323 
3.5% 
114648.6130 

52 
121 
1.3% 
103327.2727 

53 
300 
3.2% 
87604.6667 

Growing Method: CHAID Dependent Variable: median house value 
Themodel was summarized as below. The model had an error of 0.836.
Model Summary 

Training 
Sum of Squares Error 
5380.671 
Relative Error 
.836 

Stopping Rule Used 
1 consecutive step(s) with no decrease in error 

Training Time 
0:08:15.91 

Testing 
Sum of Squares Error 
2204.583 
Relative Error 
.840 

Dependent Variable: housing median age 

a. Error computations are based on the testing sample. 
Asimple graphical model was developed to relate age of household withthe other variables as shown in the figure below.
.
References
Harrell,F. (2015). Regressionmodeling strategies: with applications to linear models, logistic andordinal regression, and survival analysis.Springer.
Kou,G., Peng, Y., & Wang, G. (2014). Evaluation of clusteringalgorithms for financial risk analysis using MCDM methods.InformationSciences,275,112.
Varmuza,K., & Filzmoser, P. (2016). Introductionto multivariate statistical analysis in chemometrics.CRC press.
No related posts.