• 
    

    
    

      99热精品在线国产_美女午夜性视频免费_国产精品国产高清国产av_av欧美777_自拍偷自拍亚洲精品老妇_亚洲熟女精品中文字幕_www日本黄色视频网_国产精品野战在线观看

      ?

      Clustering and Data Analysis

      2018-05-14 13:16:44丁立人
      留學 2018年19期
      關鍵詞:年齡拓撲學基礎學科

      1. Introduction

      Clustering is a process of sorting objects, elements or data into groups according to their similarity or dissimilarity. In this thesis, topological foundation and several approaches are going to be explained.

      2. Definition

      In a set of data, a cluster is a group of elements in which the elements are more similar to each other than elements in other clusters. We can put these elements into a metric space to measure the similarity between them by a "distance". This function's purpose would be measure the similarity between two elements. Given a set X, a metric about X is a function X × X → R such that

      1. d(x; y) ≥ 0 for all x; y ∈X and d(x; y) = 0 i x = y.

      2. d(x; y) = d(y; x) for all x; y ∈ X.

      3. d(x; y) ≤ d(x; z) + d(z; y) for all x; y; z ∈ X.

      A pair (X; d) is called a metric space. To form a cluster, we first define a relation x ~R y as

      x ?R x′ iff d(x; x′) ≤ 2R

      in which R ∈ R and R ≥ 0. This show these two element are similar. Then we can find a equivalence class accord to relation x ~R y defined with following: if there exists a sequence of elements x0, …… xn such that x = x0 ?R x1, ……, xn?1 ?R xn = y, then x ~R y.

      Now set of equivalence classes about x forms a partition of the whole set, all elements in this class are more similar to each other comparing to elements not in the class--the cluster. Different functions aiming different type of data input. For data which can be quantify, they can be put into Rn then distance between two elements can be calculate. If data can't be quantified, then for C elements, a symmetric matrix C\C can be build and some function can be used to determine the similarity.

      3. Clustering and data analysis

      Clustering is one of the most vital task of data analysis, because clusters and process clusters form can indicate important information and underlying pattern which can't be provided by other methods.

      4. Clustering algorithms

      All clustering methods divide elements into groups in which elements are similar to each other using a similarity standard.

      4.1 Hierarchical Clustering

      Trying to form cluster, we would find that different threshold R form clusters with different size. If the threshold is 0, then the clusters would each only contain one element; As R increases, elements become connected and multiple clusters joined together and become one cluster. We can informally defines, that hierarchical clustering is the process finding such a hierarchy of clusters within a set of elements. We can use dendrograms shown hierarchy intuitively in (Figure 1.1), where each horizontal segment represent components being connected.

      Bottom-up hierarchy is called an agglomerative clustering. We start from R = 0, when there are as many connected components as the number of individual points, as well as the number of clusters(Figure 1.2). As R increases, points start to become connected (Figure 1.3). At last, all elements in the data set are included in one cluster (Figure 1.1).

      4.2 K-means Clustering

      K-means Clustering is one of the most popular Flat Clustering algorithm. Unlike hierarchical clustering, flat clustering is focused on find the suitable R value.

      4.3 Which one is better?

      It's hard to say which method is better, since both of them have their advantage.

      5. Clustering in Data Analysis Examples

      The clustering data analysis example I use is the relation between GDP per capita and Fertility rate.

      In our situation, there are some countries that have too few population so the data is missing. These data should be filtered out first. Since all data are in real number, we can map data into a Euclidean space. Many points locate near the x-axis, and some other near the fertility rate 2. This shows that there are many countries that have low GDP per capita have higher fertility rate, countries that have relatively higher GDP per capita have fertility rate around 2(Figure 3.1).

      As the threshold increases, there are three clusters forming: cluster with F between 4 and 5, and with G under 5000$; second one is located at the left-bottom corner of the graph, with fertility rate around 2 and G roughly around 10000$; last one is the cluster with G from 30000$ to 50000$ and fertility rate around 2. In the first cluster, Congo rep, Ethiopia, Iraq, and South Sudan are suffer from poverty or war and have a high fertility rate with a low income level. The second cluster include countries such as China, Russia etc, rapidly growing recently. The third group are mostly consist of MDCs including UK, France, Canada etc. These countries are all highly developed and most of them have fertility rate less than two. Pattern of these three cluster actually is a strengthening evidence for the theory of demographic transition.

      Figure 3.3 almost exactly give the partition of developing countries and developed countries.

      6. Conclusion

      Clustering is a very effective method in data analysis. I believe that the power of clustering is shown in the example about demo-graphics, in which clustering revealed three groups of countries that each on a different stage of demographic.

      丁立人

      年齡:17

      城市:北京

      年級:12

      目標專業(yè):數(shù)學,計算機科學

      在夏校學習的一個月以來,我發(fā)現(xiàn)到應用拓撲學和之前初高中學的數(shù)學是完全不同的,應用拓撲和它的基礎學科之一即線性代數(shù)對我來說是巨大的挑戰(zhàn)。學習過程中給我留下印象最深的是聚簇算法,這是一種可以把有相似特征的數(shù)據(jù)歸于幾個相應的群中,還有空間變化,即通過函數(shù)將一個向量空間轉化為另一個。從有所了解到能夠寫出這篇論文,我的進步絕不僅限于應用拓撲學相關的知識,還培養(yǎng)了獨立研究的能力,并讓我對高等數(shù)學更為嚴謹?shù)倪壿嬘辛艘欢ǖ恼J識。

      在論文中,我主要介紹了聚簇算法和拓撲的聯(lián)系,以及用人口學相關的例子介紹了一種聚簇算法。

      猜你喜歡
      年齡拓撲學基礎學科
      拓撲
      以戰(zhàn)略遠見促進基礎學科人才培養(yǎng)
      母雞的年齡
      菲比熊 BCES4002A
      世界汽車(2019年2期)2019-03-01 09:00:08
      學生作品選登
      從拓撲學到拓撲絕緣體
      科學家(2017年17期)2017-10-09 23:28:53
      A Study of Personalized Bumper Stickers in China and America—from the Perspectives of Functions
      臨床醫(yī)院培養(yǎng)基礎學科研究生的探索與思考
      對中醫(yī)臨床基礎學科屬性的認識
      點集拓撲一個典型反例的研究
      喀什市| 延川县| 阜新| 上蔡县| 乐昌市| 喀喇沁旗| 崇信县| 广德县| 盐城市| 巴中市| 剑阁县| 攀枝花市| 金川县| 正宁县| 鄂尔多斯市| 加查县| 祥云县| 华坪县| 临洮县| 连平县| 皮山县| 靖边县| 黑龙江省| 龙陵县| 涪陵区| 齐河县| 中牟县| 溆浦县| 孝感市| 汝南县| 阳信县| 泰安市| 当涂县| 辽中县| 拜泉县| 平遥县| 商洛市| 新昌县| 时尚| 沅江市| 鄂尔多斯市|