Pandas and Sklearn

pandas isnull函数检查数据是否有缺失

pandas isnull sum with column headers

 

for col in main_df:
    print(sum(pd.isnull(data[col])))

I get a list of the null count for each column:

0
1
100

What I’m trying to do is create a new dataframe which has the column header alongside the null count, e.g.

col1 | 0
col2 | 1
col3 | 100

 

 

#print every column using:

 

nulls = df.isnull().sum().to_frame()
for index, row in nulls.iterrows():
    print(index, row[0])

 

for col in df:
    print(df[col].unique())



pandas.get_dummies 的用法 (One-Hot Encoding)

 

get_dummies 是利用pandas实现one hot encode的方式。详细参数请查看官方文档

pandas.get_dummies(data, prefix=None, prefix_sep=’_’, dummy_na=False, columns=None, sparse=False, drop_first=False)[source]

参数说明:

  • data : array-like, Series, or DataFrame 输入的数据
  • prefix : string, list of strings, or dict of strings, default None get_dummies转换后,列名的前缀
  • columns : list-like, default None 指定需要实现类别转换的列名
  • dummy_na : bool, default False 增加一列表示空缺值,如果False就忽略空缺值
  • drop_first : bool, default False 获得k中的k-1个类别值,去除第一个

离散特征的编码分为两种情况:

1、离散特征的取值之间没有大小的意义,比如color:[red,blue],那么就使用one-hot编码

2、离散特征的取值有大小的意义,比如size:[X,XL,XXL],那么就使用数值的映射{X:1,XL:2,XXL:3}

例子:

import pandas as pd

df = pd.DataFrame([  
            ['green' , 'A'],   
            ['red'   , 'B'],   
            ['blue'  , 'A']])  

df.columns = ['color',  'class'] 
pd.get_dummies(df)

get_dummies 前:

get_dummies 后:

上述执行完以后再打印df 出来的还是get_dummies 前的图,因为你没有写

df = pd.get_dummies(df)

可以对指定列进行get_dummies

pd.get_dummies(df.color)

将指定列进行get_dummies 后合并到元数据中

df = df.join(pd.get_dummies(df.color))

参考:https://blog.csdn.net/maymay_/article/details/80198468

 
>>> train_filter.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1482 entries, 0 to 1481
Columns: 182 entries, SampleID to BS120
dtypes: float64(177), int64(2), object(3)
memory usage: 2.1+ MB
>>> train_filter.dtypes
SampleID object
Streptococcus Infection float64
Duration_of_gestation object
Gestation_age float64
Gestation_age_G1 float64
Gestation_age_G2 float64
GDM_HDP float64
Age int64
Age_group int64
Blood_type float64
Medication_use float64
Progesterone_use float64
Pregnancy_mode float64
Native_place float64
Combined_disease float64
Infection float64
Scar_uterus float64
Risk_rating float64
Anamnesis float64
Thalassemia float64
Ovary_disease float64
Hepatopathy float64
Allergic_history float64
Thyroid_disease float64
Hysteromyoma float64
Breast_disease float64
Weight_at_delivery object
Weight_before_pregnancy float64
Height float64
BMI_before_pregnancy float64
 ... 
B_A/G float64
B_r_GT_G float64
B_r_GT float64
B_TBA_G float64
B_TBA float64
B_ALT_G float64
B_ALT float64
B_AST_G float64
B_AST float64
B_TBIL_G float64
B_TBIL float64
B_DBIL_G float64
B_DBIL float64
B_IBIL_G float64
B_IBIL float64
B_Crea_G float64
B_Crea float64
B_CysC_G float64
B_CysC float64
B_UA_G float64
B_UA float64
B_Urea_G float64
B_Urea float64
B_GLU_G float64
B_GLU float64
HbA1c_G float64
HbA1c float64
BS float64
BS60 float64
BS120 float64
Length: 182, dtype: object



scikit-learn 是基于 Python 语言的机器学习工具

 

 

http://www.scikitlearn.com.cn/

Leave a Reply

  

  

  

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>