智能數據治理平臺

睿治作為國內功能最全的數據治理產品之一，入選IDC企業數據治理實施部署指南。同時，在IDC發布的《中國數據治理市場份額》報告中，連續四年蟬聯數據治理解決方案市場份額第一。

在線免費試用 DEMO體驗視頻介紹

睿治智能數據治理平臺

IDC蟬聯數據治理解決方案市場第一

數據科學中的統計學

時間：2022-07-05來源：和你遇見瀏覽數：241次

數據離散程度的度量，接近 0，離散度小，越大，離散程度也大；極差就是最大最小值之間的差值；一個包含 0 或 100 的數據集，與一個 [0,50,...,50,100] 的極差相同；一種更復雜的離散度的度量方式為方差（variance）；因為方差很難理解，是原始值的平方，所以一般習慣使用標準差

1.1描述單個數據集

描述數據集簡單的方式就是用列表去描述：num_friends = [100, 99, 41, 25]

對于足夠量小的數據集上述描述已經足夠明確，但是數據集較大時，該方法既不實用，也不直觀，盯著 100 萬的數看顯然不夠直觀，就需要用統計學來提取和表達數據的相關特征；第一種方式就是使用 Counter 和 plt.bar 將數量放入直方圖中；利用 Counter 統計每一個數字出現的次數； # -*- coding: utf-8 -*-"""Spyder EditorAThis is a temporary script file."""from collections import Counterimport matplotlib.pyplot as pltnum_friends = [100, 55, 99, 24, 24, 55]friend_counts = Counter(num_friends)xs = range(101)ys = [friend_counts[x] for x in xs]plt.bar(xs, ys)plt.axis([, 101, , 5])plt.title("Histogram of Friend Counts")plt.xlabel("# of freinds")plt.ylabel("# of people")plt.show()

但是這樣的圖依然難與人溝通，需要計算一些統計量，比如計算樣本大小，最大最小值等等；

1.2中心傾向

通常了解數據中心，一般采用均值；如果有兩個數據點，均值就是它們的中間點；當添加更多數據點時，均值也會隨之移動；有時候也會對中位數（median）感興趣，是中間的點值或者中間兩個點的均值，取決于數據集是奇數還是偶數；中位數的一個泛化概念是中位數（quantile），標識在排序后的數據中某個百分比位置的值（中位數表示在 50% 位置的數據的值）； # -*- coding: utf-8 -*-"""Spyder EditorAThis is a temporary script file."""from typing import Listdef quantile(xs:List[float], p:float) -> float:p_index = int(p * len(xs))return sorted(xs)[p_index]print(quantile([1, 3, 4, 1, 2], 0.25)) #1眾數（mode）：出現次數最多的一個或多個值； # -*- coding: utf-8 -*-"""Spyder EditorAThis is a temporary script file."""from typing import List,Counterdef mode(xs: List[float]) -> List[float]:"""因為眾數可能有多個，所以需要返回一個列表"""counts = Counter(xs)max_counts = max(counts.values())return [x_i for x_i, count in counts.items() if count == max_counts]print(mode([1,2,3,41,1,2])) #[1, 2]
1.3離散度
離散度（dispersion）：數據離散程度的度量，接近 0，離散度小，越大，離散程度也大；極差就是最大最小值之間的差值；一個包含 0 或 100 的數據集，與一個 [0,50,...,50,100] 的極差相同；一種更復雜的離散度的度量方式為方差（variance）；因為方差很難理解，是原始值的平方，所以一般習慣使用標準差；

# -*- coding: utf-8 -*-"""Spyder EditorAThis is a temporary script file."""from typing import Listfrom statistics import meanimport mathdef decline_mean(x: List[float]) -> List[float]:x_mean = mean(x)return [x_i - x_mean for x_i in x]def sum_of_squares(x: List[float]) -> float:list_squares = [x_i * x_i for x_i in x]return sum(list_squares)# 計算方差def variance(xs: List[float]) -> float:n = len(xs)deviations = decline_mean(xs)return sum_of_squares(deviations) / (n - 1)# 計算標準差def standard_variance(xs: List[float]) -> float:return math.sqrt(variance(xs))print(variance([1,2,3,4])) #1.6666666666666667 方差print(standard_variance([1,2,3,4])) #1.2909944487358056 標準差極差和標準差都有異常值問題，更穩健的替代方案是計算 75% 和 25% 的分位數之差：這樣不會受到一小部分異常值的影響； # -*- coding: utf-8 -*-"""Spyder EditorAThis is a temporary script file."""from typing import Listdef quantile(xs: List[float], p: float) -> float:p_index = int(p * len(xs))return sorted(xs)[p_index]def interquartile_range(xs: List[float]) -> float:return quantile(xs, 0.75) - quantile(xs, 0.25)print(interquartile_range([1, 3, 4, 1, 2])) #2
1.4相關
比如想要看用戶在網站上花費的時間與其在該網站上擁有的朋友數量相關；命名一個為 daily_minutes 的列表，該列表中的元素與之前 num_friends 列表的元素對應，以進一步探索關系；協方差：方差的孿生兄弟；方差衡量單個變量對其均值的偏離程度，協方差衡量兩個變量對其均值的共同偏離程度；

from typing import Listfrom statistics import meanVector = List[float]def covariance(xs: List[float], ys: List[float]) -> float:assert len(xs) == len(ys), "must have same number of elements"return dot(decline_mean(xs), decline_mean(ys)) / (len(xs) - 1)def dot(v: Vector, w: Vector) -> Vector:# 判定長度是否相同assert len(v) == len(w), "vector have same length"return sum(v_i * w_i for v_i, w_i in zip(v, w))def decline_mean(x: List[float]) -> List[float]:x_mean = mean(x)return [x_i - x_mean for x_i in x]print(covariance([1,2], [2,3])) #0.5協方差很難解釋的原因：協方差的單位是朋友量/分鐘/天，這很難理解；如果每個用戶朋友數是之前兩倍，分鐘數不變；但從某種意義上，變量的相關度是一樣的；由于以上原因，相關性系數（correlation）是更常用的概念，是協方差除以兩個變量的標準差的值；相關系數沒有單位，取值范圍是 -1（完全負相關）~1（完全正相關），0.25 就是比較弱的正相關；

（部分內容來源網絡，如有侵權請聯系刪除）

立即申請數據分析/數據治理產品免費試用我要試用

上一篇：教練式管理工具與技術...

下一篇：智能制造10步走...