简单分析TCGA基因表达数据-1

简单分析TCGA基因表达数据-1

Yinski 4,238 2022-08-12

TCGA数据下载

TCGA数据下载的方法有很多,这里就不详细说明了,有用R包biolinks的、有上GDC官网下载再自己处理的,有用第三方比如xena下载的。

TCGA数据简单处理

由于个人偏好用Python解决问题,所以下面使用Python来简单处理一下矩阵以方便后续分析
注意: 这里使用的TCGA数据矩阵已经用biolinks进行了合并并且替换了基因名

导入库

import numpy as np
import pandas as pd
from glob import glob
import os

打开数据

tpm_matrix_raw = pd.read_csv('HNSC_exp_tpm.csv')

查看前三行

tpm_matrix_raw.head(3)
id TCGA-CV-6942-01A-21R-2016-07 TCGA-CV-7406-11A-01R-2081-07 TCGA-CV-7406-01A-11R-2081-07 TCGA-UF-A71A-06A-11R-A39I-07 TCGA-UF-A71A-01A-22R-A34R-07 TCGA-QK-A8Z7-01A-11R-A39I-07 TCGA-CV-6960-01A-41R-2016-07 TCGA-CV-6960-11A-01R-2016-07 TCGA-CV-A45W-01A-11R-A24Z-07 ... TCGA-CV-6933-01A-11R-1915-07 TCGA-MT-A51X-01A-11R-A266-07 TCGA-CV-7236-01A-11R-2016-07 TCGA-CN-6995-01A-31R-2016-07 TCGA-P3-A5Q5-01A-11R-A28V-07 TCGA-QK-A6II-01A-11R-A31N-07 TCGA-D6-6823-01A-11R-1915-07 TCGA-CR-7374-01A-11R-2016-07 TCGA-CN-4735-01A-01R-1436-07 TCGA-T2-A6X0-01A-11R-A34R-07
0 TSPAN6 16.8589 14.4163 57.4243 46.0498 40.8026 68.0685 26.8010 50.9679 29.3844 ... 39.1086 19.0102 29.8943 18.7316 16.3261 12.8800 26.5914 51.5997 28.8314 24.5403
1 TNMD 0.1345 1.1367 0.0000 0.0659 0.1533 0.0000 0.0000 0.0000 0.0000 ... 0.0000 0.0000 0.0000 0.0000 0.0729 0.0000 0.0000 0.0000 0.0000 0.0715
2 DPM1 62.4801 61.4588 92.1788 136.5788 107.2250 123.1408 154.7088 79.8137 56.8436 ... 97.4329 151.8336 88.4926 103.8825 87.8406 49.2468 143.9244 110.2718 117.1813 107.3840

3 rows × 549 columns

将id设置为索引

tpm_matrix_raw=tpm_matrix_raw.set_index(keys=['id'])
tpm_matrix_raw.head(3)
TCGA-CV-6942-01A-21R-2016-07 TCGA-CV-7406-11A-01R-2081-07 TCGA-CV-7406-01A-11R-2081-07 TCGA-UF-A71A-06A-11R-A39I-07 TCGA-UF-A71A-01A-22R-A34R-07 TCGA-QK-A8Z7-01A-11R-A39I-07 TCGA-CV-6960-01A-41R-2016-07 TCGA-CV-6960-11A-01R-2016-07 TCGA-CV-A45W-01A-11R-A24Z-07 TCGA-CV-7422-01A-21R-2081-07 ... TCGA-CV-6933-01A-11R-1915-07 TCGA-MT-A51X-01A-11R-A266-07 TCGA-CV-7236-01A-11R-2016-07 TCGA-CN-6995-01A-31R-2016-07 TCGA-P3-A5Q5-01A-11R-A28V-07 TCGA-QK-A6II-01A-11R-A31N-07 TCGA-D6-6823-01A-11R-1915-07 TCGA-CR-7374-01A-11R-2016-07 TCGA-CN-4735-01A-01R-1436-07 TCGA-T2-A6X0-01A-11R-A34R-07
id
TSPAN6 16.8589 14.4163 57.4243 46.0498 40.8026 68.0685 26.8010 50.9679 29.3844 49.3624 ... 39.1086 19.0102 29.8943 18.7316 16.3261 12.8800 26.5914 51.5997 28.8314 24.5403
TNMD 0.1345 1.1367 0.0000 0.0659 0.1533 0.0000 0.0000 0.0000 0.0000 0.0000 ... 0.0000 0.0000 0.0000 0.0000 0.0729 0.0000 0.0000 0.0000 0.0000 0.0715
DPM1 62.4801 61.4588 92.1788 136.5788 107.2250 123.1408 154.7088 79.8137 56.8436 111.9194 ... 97.4329 151.8336 88.4926 103.8825 87.8406 49.2468 143.9244 110.2718 117.1813 107.3840

3 rows × 548 columns

转置矩阵

tpm_matrix_T=tpm_matrix_raw.T
tpm_matrix_T.head(3)
id TSPAN6 TNMD DPM1 SCYL3 C1orf112 FGR CFH FUCA2 GCLC NFYA ... MIR4318 MIR4706 MIR3619 AC008133.1 SPDYE22P AC005838.2 AP001381.1 ATP6V1E1P2 RN7SL356P LINC01443
TCGA-CV-6942-01A-21R-2016-07 16.8589 0.1345 62.4801 3.0755 2.8034 13.6529 47.5832 47.8443 13.0953 12.6368 ... 0.0 0.0000 0.0 0.1188 0.5803 0.4174 0.4955 0.1848 0.2667 0.0000
TCGA-CV-7406-11A-01R-2081-07 14.4163 1.1367 61.4588 2.0197 0.6490 2.3741 19.8581 7.8279 5.0802 7.9977 ... 0.0 0.0000 0.0 0.0000 0.1001 0.1260 0.0000 0.0000 0.0000 0.0000
TCGA-CV-7406-01A-11R-2081-07 57.4243 0.0000 92.1788 10.0763 11.5054 12.0852 52.6491 70.4372 30.1331 33.7680 ... 0.0 0.7406 0.0 0.0000 0.0000 0.0000 0.7961 0.0000 0.0000 0.0427

3 rows × 46383 columns

记录TCGA ID

TCGA_id = tpm_matrix_T.index.values

设置肿瘤样本分类

TCGA编码的第四位为区分样本的编码,据此来分类样本类型

cancer_type = ['01','02','03','04','05','06','07','08','09']

获取样本TCGA编码

TCGA_id
array(['TCGA-CV-6942-01A-21R-2016-07', 'TCGA-CV-7406-11A-01R-2081-07',
       'TCGA-CV-7406-01A-11R-2081-07', 'TCGA-UF-A71A-06A-11R-A39I-07',
       'TCGA-UF-A71A-01A-22R-A34R-07', 'TCGA-QK-A8Z7-01A-11R-A39I-07',
       ...
       'TCGA-CV-6960-01A-41R-2016-07', 'TCGA-CV-6960-11A-01R-2016-07',
       'TCGA-D6-6823-01A-11R-1915-07', 'TCGA-CR-7374-01A-11R-2016-07',
       'TCGA-CN-4735-01A-01R-1436-07', 'TCGA-T2-A6X0-01A-11R-A34R-07'],
      dtype=object)

识别TCGA编码中的样本类型编码

type_id = []
for i in TCGA_id:
    typeid=i[13:15]
    type_id.append(typeid)

根据样本编码识别样本类型

typelist = []
for i in type_id:
        if i in cancer_type:
                typelist.append('cancer')           
        else:
                typelist.append('normal')  
typelist
['cancer',
 'normal',
 'cancer',
 'cancer',
 'cancer',
 'cancer',
 'cancer',
 'normal',
 'cancer',
 'cancer',
 'cancer',
 'cancer',
  ...
 'cancer',
 'cancer',
 'cancer',
 'cancer']

将列表添加入矩阵并命名为sample_type

tpm_matrix_T['sample_type']=typelist
tpm_matrix_T.head(5)
id TSPAN6 TNMD DPM1 SCYL3 C1orf112 FGR CFH FUCA2 GCLC NFYA ... MIR4706 MIR3619 AC008133.1 SPDYE22P AC005838.2 AP001381.1 ATP6V1E1P2 RN7SL356P LINC01443 sample_type
TCGA-CV-6942-01A-21R-2016-07 16.8589 0.1345 62.4801 3.0755 2.8034 13.6529 47.5832 47.8443 13.0953 12.6368 ... 0.0000 0.0000 0.1188 0.5803 0.4174 0.4955 0.1848 0.2667 0.0000 cancer
TCGA-CV-7406-11A-01R-2081-07 14.4163 1.1367 61.4588 2.0197 0.6490 2.3741 19.8581 7.8279 5.0802 7.9977 ... 0.0000 0.0000 0.0000 0.1001 0.1260 0.0000 0.0000 0.0000 0.0000 normal
TCGA-CV-7406-01A-11R-2081-07 57.4243 0.0000 92.1788 10.0763 11.5054 12.0852 52.6491 70.4372 30.1331 33.7680 ... 0.7406 0.0000 0.0000 0.0000 0.0000 0.7961 0.0000 0.0000 0.0427 cancer
TCGA-UF-A71A-06A-11R-A39I-07 46.0498 0.0659 136.5788 5.1121 6.1055 4.5697 4.1779 14.2252 174.9097 18.9757 ... 0.0000 0.0000 0.0000 0.0000 1.8397 0.1820 0.0000 0.7839 0.0000 cancer
TCGA-UF-A71A-01A-22R-A34R-07 40.8026 0.1533 107.2250 3.4297 5.2049 7.4706 13.0709 15.7682 120.1223 14.0312 ... 0.0000 0.9087 0.0000 0.0000 0.7138 0.7062 0.0000 0.9123 0.0000 cancer

5 rows × 46384 columns

这样这个数据矩阵就能方便我们进行下一步分析了


# Python # 生物信息 # 数据处理