TCGA数据下载
TCGA数据下载的方法有很多,这里就不详细说明了,有用R包biolinks的、有上GDC官网下载再自己处理的,有用第三方比如xena下载的。
- 参考链接:
使用R下载并整理TCGA表达矩阵
TCGA数据简单处理
由于个人偏好用Python解决问题,所以下面使用Python来简单处理一下矩阵以方便后续分析
注意: 这里使用的TCGA数据矩阵已经用biolinks进行了合并并且替换了基因名
导入库
import numpy as np
import pandas as pd
from glob import glob
import os
打开数据
tpm_matrix_raw = pd.read_csv('HNSC_exp_tpm.csv')
查看前三行
tpm_matrix_raw.head(3)
id | TCGA-CV-6942-01A-21R-2016-07 | TCGA-CV-7406-11A-01R-2081-07 | TCGA-CV-7406-01A-11R-2081-07 | TCGA-UF-A71A-06A-11R-A39I-07 | TCGA-UF-A71A-01A-22R-A34R-07 | TCGA-QK-A8Z7-01A-11R-A39I-07 | TCGA-CV-6960-01A-41R-2016-07 | TCGA-CV-6960-11A-01R-2016-07 | TCGA-CV-A45W-01A-11R-A24Z-07 | ... | TCGA-CV-6933-01A-11R-1915-07 | TCGA-MT-A51X-01A-11R-A266-07 | TCGA-CV-7236-01A-11R-2016-07 | TCGA-CN-6995-01A-31R-2016-07 | TCGA-P3-A5Q5-01A-11R-A28V-07 | TCGA-QK-A6II-01A-11R-A31N-07 | TCGA-D6-6823-01A-11R-1915-07 | TCGA-CR-7374-01A-11R-2016-07 | TCGA-CN-4735-01A-01R-1436-07 | TCGA-T2-A6X0-01A-11R-A34R-07 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | TSPAN6 | 16.8589 | 14.4163 | 57.4243 | 46.0498 | 40.8026 | 68.0685 | 26.8010 | 50.9679 | 29.3844 | ... | 39.1086 | 19.0102 | 29.8943 | 18.7316 | 16.3261 | 12.8800 | 26.5914 | 51.5997 | 28.8314 | 24.5403 |
1 | TNMD | 0.1345 | 1.1367 | 0.0000 | 0.0659 | 0.1533 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | ... | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0729 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0715 |
2 | DPM1 | 62.4801 | 61.4588 | 92.1788 | 136.5788 | 107.2250 | 123.1408 | 154.7088 | 79.8137 | 56.8436 | ... | 97.4329 | 151.8336 | 88.4926 | 103.8825 | 87.8406 | 49.2468 | 143.9244 | 110.2718 | 117.1813 | 107.3840 |
3 rows × 549 columns
将id设置为索引
tpm_matrix_raw=tpm_matrix_raw.set_index(keys=['id'])
tpm_matrix_raw.head(3)
TCGA-CV-6942-01A-21R-2016-07 | TCGA-CV-7406-11A-01R-2081-07 | TCGA-CV-7406-01A-11R-2081-07 | TCGA-UF-A71A-06A-11R-A39I-07 | TCGA-UF-A71A-01A-22R-A34R-07 | TCGA-QK-A8Z7-01A-11R-A39I-07 | TCGA-CV-6960-01A-41R-2016-07 | TCGA-CV-6960-11A-01R-2016-07 | TCGA-CV-A45W-01A-11R-A24Z-07 | TCGA-CV-7422-01A-21R-2081-07 | ... | TCGA-CV-6933-01A-11R-1915-07 | TCGA-MT-A51X-01A-11R-A266-07 | TCGA-CV-7236-01A-11R-2016-07 | TCGA-CN-6995-01A-31R-2016-07 | TCGA-P3-A5Q5-01A-11R-A28V-07 | TCGA-QK-A6II-01A-11R-A31N-07 | TCGA-D6-6823-01A-11R-1915-07 | TCGA-CR-7374-01A-11R-2016-07 | TCGA-CN-4735-01A-01R-1436-07 | TCGA-T2-A6X0-01A-11R-A34R-07 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||||
TSPAN6 | 16.8589 | 14.4163 | 57.4243 | 46.0498 | 40.8026 | 68.0685 | 26.8010 | 50.9679 | 29.3844 | 49.3624 | ... | 39.1086 | 19.0102 | 29.8943 | 18.7316 | 16.3261 | 12.8800 | 26.5914 | 51.5997 | 28.8314 | 24.5403 |
TNMD | 0.1345 | 1.1367 | 0.0000 | 0.0659 | 0.1533 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | ... | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0729 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0715 |
DPM1 | 62.4801 | 61.4588 | 92.1788 | 136.5788 | 107.2250 | 123.1408 | 154.7088 | 79.8137 | 56.8436 | 111.9194 | ... | 97.4329 | 151.8336 | 88.4926 | 103.8825 | 87.8406 | 49.2468 | 143.9244 | 110.2718 | 117.1813 | 107.3840 |
3 rows × 548 columns
转置矩阵
tpm_matrix_T=tpm_matrix_raw.T
tpm_matrix_T.head(3)
id | TSPAN6 | TNMD | DPM1 | SCYL3 | C1orf112 | FGR | CFH | FUCA2 | GCLC | NFYA | ... | MIR4318 | MIR4706 | MIR3619 | AC008133.1 | SPDYE22P | AC005838.2 | AP001381.1 | ATP6V1E1P2 | RN7SL356P | LINC01443 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
TCGA-CV-6942-01A-21R-2016-07 | 16.8589 | 0.1345 | 62.4801 | 3.0755 | 2.8034 | 13.6529 | 47.5832 | 47.8443 | 13.0953 | 12.6368 | ... | 0.0 | 0.0000 | 0.0 | 0.1188 | 0.5803 | 0.4174 | 0.4955 | 0.1848 | 0.2667 | 0.0000 |
TCGA-CV-7406-11A-01R-2081-07 | 14.4163 | 1.1367 | 61.4588 | 2.0197 | 0.6490 | 2.3741 | 19.8581 | 7.8279 | 5.0802 | 7.9977 | ... | 0.0 | 0.0000 | 0.0 | 0.0000 | 0.1001 | 0.1260 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
TCGA-CV-7406-01A-11R-2081-07 | 57.4243 | 0.0000 | 92.1788 | 10.0763 | 11.5054 | 12.0852 | 52.6491 | 70.4372 | 30.1331 | 33.7680 | ... | 0.0 | 0.7406 | 0.0 | 0.0000 | 0.0000 | 0.0000 | 0.7961 | 0.0000 | 0.0000 | 0.0427 |
3 rows × 46383 columns
记录TCGA ID
TCGA_id = tpm_matrix_T.index.values
设置肿瘤样本分类
TCGA编码的第四位为区分样本的编码,据此来分类样本类型
cancer_type = ['01','02','03','04','05','06','07','08','09']
获取样本TCGA编码
TCGA_id
array(['TCGA-CV-6942-01A-21R-2016-07', 'TCGA-CV-7406-11A-01R-2081-07',
'TCGA-CV-7406-01A-11R-2081-07', 'TCGA-UF-A71A-06A-11R-A39I-07',
'TCGA-UF-A71A-01A-22R-A34R-07', 'TCGA-QK-A8Z7-01A-11R-A39I-07',
...
'TCGA-CV-6960-01A-41R-2016-07', 'TCGA-CV-6960-11A-01R-2016-07',
'TCGA-D6-6823-01A-11R-1915-07', 'TCGA-CR-7374-01A-11R-2016-07',
'TCGA-CN-4735-01A-01R-1436-07', 'TCGA-T2-A6X0-01A-11R-A34R-07'],
dtype=object)
识别TCGA编码中的样本类型编码
type_id = []
for i in TCGA_id:
typeid=i[13:15]
type_id.append(typeid)
根据样本编码识别样本类型
typelist = []
for i in type_id:
if i in cancer_type:
typelist.append('cancer')
else:
typelist.append('normal')
typelist
['cancer',
'normal',
'cancer',
'cancer',
'cancer',
'cancer',
'cancer',
'normal',
'cancer',
'cancer',
'cancer',
'cancer',
...
'cancer',
'cancer',
'cancer',
'cancer']
将列表添加入矩阵并命名为sample_type
tpm_matrix_T['sample_type']=typelist
tpm_matrix_T.head(5)
id | TSPAN6 | TNMD | DPM1 | SCYL3 | C1orf112 | FGR | CFH | FUCA2 | GCLC | NFYA | ... | MIR4706 | MIR3619 | AC008133.1 | SPDYE22P | AC005838.2 | AP001381.1 | ATP6V1E1P2 | RN7SL356P | LINC01443 | sample_type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
TCGA-CV-6942-01A-21R-2016-07 | 16.8589 | 0.1345 | 62.4801 | 3.0755 | 2.8034 | 13.6529 | 47.5832 | 47.8443 | 13.0953 | 12.6368 | ... | 0.0000 | 0.0000 | 0.1188 | 0.5803 | 0.4174 | 0.4955 | 0.1848 | 0.2667 | 0.0000 | cancer |
TCGA-CV-7406-11A-01R-2081-07 | 14.4163 | 1.1367 | 61.4588 | 2.0197 | 0.6490 | 2.3741 | 19.8581 | 7.8279 | 5.0802 | 7.9977 | ... | 0.0000 | 0.0000 | 0.0000 | 0.1001 | 0.1260 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | normal |
TCGA-CV-7406-01A-11R-2081-07 | 57.4243 | 0.0000 | 92.1788 | 10.0763 | 11.5054 | 12.0852 | 52.6491 | 70.4372 | 30.1331 | 33.7680 | ... | 0.7406 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.7961 | 0.0000 | 0.0000 | 0.0427 | cancer |
TCGA-UF-A71A-06A-11R-A39I-07 | 46.0498 | 0.0659 | 136.5788 | 5.1121 | 6.1055 | 4.5697 | 4.1779 | 14.2252 | 174.9097 | 18.9757 | ... | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 1.8397 | 0.1820 | 0.0000 | 0.7839 | 0.0000 | cancer |
TCGA-UF-A71A-01A-22R-A34R-07 | 40.8026 | 0.1533 | 107.2250 | 3.4297 | 5.2049 | 7.4706 | 13.0709 | 15.7682 | 120.1223 | 14.0312 | ... | 0.0000 | 0.9087 | 0.0000 | 0.0000 | 0.7138 | 0.7062 | 0.0000 | 0.9123 | 0.0000 | cancer |
5 rows × 46384 columns
这样这个数据矩阵就能方便我们进行下一步分析了