深入理解sklearn.datasets.load_iris():鸢尾花数据集加载与分析

sklearn.datasets.load_iris() 是 Python 中 scikit-learn 库提供的一个函数,用于加载经典的 Iris(鸢尾花)数据集。这个数据集包含了三种不同种类鸢尾花(Setosa、Versicolor、Virginica)的样本数据,每个样本包括4个特征和一个标签(即物种分类)。该数据集常被用来作为监督学习算法(如分类任务)的示例或基准测试。

以下是使用 load_iris() 函数加载数据集的详细说明:

函数签名

sklearn.datasets.load_iris(return_X_y=False, as_frame=False, **kwargs)

参数说明

  • return_X_y (bool, default=False): 是否将数据集拆分为特征矩阵 X 和目标向量 y 分别返回。若为 False,则返回一个 Bunch 对象,其中包含了所有数据集相关信息;若为 True,则返回一个元组 (X, y)。
  • as_frame (bool, default=False): 是否将特征数据 X 转换为 pandas DataFrame 格式。仅当 return_X_y=True 或 as_frame=True 时有效。若为 True,则 X 为 DataFrame,y 为 Series。
  • **kwargs: 其他关键字参数,传递给 fetch_openml() 函数(用于在线下载数据集)。在使用 load_iris() 时通常不需要指定这些参数,除非需要从 OpenML 平台获取更新版本的数据。

返回值

  • 若 return_X_y=False(默认情况),返回一个 Bunch 对象,包含以下属性:
  • data (numpy.ndarray of shape (n_samples, n_features)): 特征数据,n_samples 表示样本数量,n_features 表示特征数量。对于 Iris 数据集,n_samples=150,n_features=4。
  • target (numpy.ndarray of shape (n_samples,)): 目标向量,即样本的类别标签。对于 Iris 数据集,每个标签取值为 [0, 1, 2],分别对应 Setosa、Versicolor、Virginica 三种鸢尾花。
  • feature_names (list of length n_features): 特征名称列表,如 [‘sepal length (cm)’, ‘sepal width (cm)’, ‘petal length (cm)’, ‘petal width (cm)’]。
  • target_names (list of length n_classes): 类别标签名称列表,如 [‘setosa’, ‘versicolor’, ‘virginica’]。
  • DESCR (str): 数据集的描述文本。
  • 若 return_X_y=True,返回一个元组 (X, y),其中:
  • X (numpy.ndarray of shape (n_samples, n_features) 或 pandas DataFrame): 特征数据。
  • y (numpy.ndarray of shape (n_samples,) 或 pandas Series): 目标向量。

示例代码

from sklearn.datasets import load_iris

# 加载 Iris 数据集并返回 Bunch 对象
iris_data = load_iris()

# 获取特征数据和目标向量
X = iris_data.data
y = iris_data.target

# 获取特征和目标名称
feature_names = iris_data.feature_names
target_names = iris_data.target_names

# 打印数据集描述
print(iris_data.DESCR)

# 或者直接返回 (X, y) 元组
X, y = load_iris(return_X_y=True)

# 如果需要使用 pandas DataFrame
import pandas as pd
X_df, y_series = load_iris(return_X_y=True, as_frame=True)

运行结果

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
    - sepal length in cm
    - sepal width in cm
    - petal length in cm
    - petal width in cm
    - class:
            - Iris-Setosa
            - Iris-Versicolour
            - Iris-Virginica

:Summary Statistics:

============== ==== ==== ======= ===== ====================
                Min  Max   Mean    SD   Class Correlation
============== ==== ==== ======= ===== ====================
sepal length:   4.3  7.9   5.84   0.83    0.7826
sepal width:    2.0  4.4   3.05   0.43   -0.4194
petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)
============== ==== ==== ======= ===== ====================

:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.

This is perhaps the best known database to be found in the
pattern recognition literature.  Fisher's paper is a classic in the field and
is referenced frequently to this day.  (See Duda & Hart, for example.)  The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant.  One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

|details-start|
**References**
|details-split|

- Fisher, R.A. "The use of multiple measurements in taxonomic problems"
  Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
  Mathematical Statistics" (John Wiley, NY, 1950).
- Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
  (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
- Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
  Structure and Classification Rule for Recognition in Partially Exposed
  Environments".  IEEE Transactions on Pattern Analysis and Machine
  Intelligence, Vol. PAMI-2, No. 1, 67-71.
- Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions
  on Information Theory, May 1972, 431-433.
- See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II
  conceptual clustering system finds 3 classes in the data.
- Many, many more ...

|details-end|

使用 load_iris() 函数加载的 Iris 数据集可以很方便地用于构建各种机器学习模型,如支持向量机(SVM)、决策树、随机森林、K近邻(KNN)等,进行鸢尾花种类的分类任务。


存档地址:https://www.yuque.com/worthstudy/study/ehc6esdtmugl0wbq?singleDoc# 《sklearn.datasets.load_iris()函数》

© 版权声明
THE END
喜欢就点赞支持一下吧,如果觉得不错或日后有所需要,可以收藏文章和关注作者哦。
点赞0打赏 分享
评论 抢沙发

请登录后发表评论

    暂无评论内容