有一篇文章“Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data”,提到了“Our results demonstrate that the original Affymetrix probe set definitions are inaccurate, and many conclusions derived from past GeneChip analyses 40 may be significantly flawed. It will be beneficial to re-analyze existing GeneChip data with updated probe set definitions”,意思是说原始的Affymetrix探针组定义是不准确、过时的,要重定义探针组(Custom CDF)。
重定义的格式包括:
名称 | 含义 |
ENTREZG |
|
REFSEQ |
|
ENSG |
|
ENSE |
|
ENST |
|
VEGAG |
|
VEGAE |
|
VEGAT |
|
TAIRG |
|
TAIRT |
|
UG |
|
MIRBASEF |
|
MIRBASEG |
|
物种包括:
名称 | 含义 | 简写 |
Anopheles gambiae | 冈比亚疟蚊 | Ag |
Arabidopsis thaliana | 拟南芥 | At |
Bos taurus | 牛 | Bt |
Caenorhabditis elegans | 秀丽隐杆线虫 | Ce |
Canis familiaris | 犬类 | Cf |
Danio rerio | 鲐鱼类 | Dr |
Drosophila melanogaster | 黑腹果蝇 | Dm |
Gallus gallus | 原鸡 | Gg |
Homo sapiens | 人类 | Hs |
Macaca mulatta | 猕猴 | MAmu |
Mus musculus | 小家鼠 | Mm |
Oryza sativa | 稻 | Os |
Rattus norvegicus | 鼠类 | Rn |
Saccharomyces cerevisiae | 酿酒酵母 | Sc |
Schizosaccharomyces pombe | 粟酒裂殖酵母 | Sp |
Sus scrofa | 野猪 | Ss |
在“02、CDF文件”中提到过,每种型号的芯片都有着对应CDF包,那么重定义后CDF的名称命名如下(加号代表连接,不要写进去):
原CDF包名+物种简写小写+格式名小写
如HG-U133_Plus_2阵列原本对应hgu133plus2包,如果选择了Homo sapiens(人类物种),ENSG格式,那么就对应hgu133plus2hsensgcdf包了。
获取hgu133plus2的探针组名称,可以:
library(affy) ## 导入affy包cdfname <- "hgu133plus2"cdfname <-"hgu133plus2hsensgcdf"how = getOption("BioC")$affy$probeslocverbose = FALSEbadOut <- list()for (i in 1:length(how)) { cur <- how[[i]] envir <- switch(cur$what, environment = cdfFromEnvironment(cdfname, cur$where, verbose), libPath = cdfFromLibPath(cdfname, cur$where, verbose = verbose), bioC = cdfFromBioC(cdfname, cur$where, verbose))}genenames <- ls(envir) ## 探针组名称> length(genenames) ## 探针组个数[1] 54675> genenames[1:100] ## 输出前100个探针组名称 [1] "1007_s_at" "1053_at" "117_at" "121_at" [5] "1255_g_at" "1294_at" "1316_at" "1320_at" [9] "1405_i_at" "1431_at" "1438_at" "1487_at" [13] "1494_f_at" "1552256_a_at" "1552257_a_at" "1552258_at" [17] "1552261_at" "1552263_at" "1552264_a_at" "1552266_at" [21] "1552269_at" "1552271_at" "1552272_a_at" "1552274_at" [25] "1552275_s_at" "1552276_a_at" "1552277_a_at" "1552278_a_at" [29] "1552279_a_at" "1552280_at" "1552281_at" "1552283_s_at" [33] "1552286_at" "1552287_s_at" "1552288_at" "1552289_a_at" [37] "1552291_at" "1552293_at" "1552295_a_at" "1552296_at" [41] "1552299_at" "1552301_a_at" "1552302_at" "1552303_a_at" [45] "1552304_at" "1552306_at" "1552307_a_at" "1552309_a_at" [49] "1552310_at" "1552311_a_at" "1552312_a_at" "1552314_a_at" [53] "1552315_at" "1552316_a_at" "1552318_at" "1552319_a_at" [57] "1552320_a_at" "1552321_a_at" "1552322_at" "1552323_s_at" [61] "1552325_at" "1552326_a_at" "1552327_at" "1552329_at" [65] "1552330_at" "1552332_at" "1552334_at" "1552335_at" [69] "1552337_s_at" "1552338_at" "1552340_at" "1552343_s_at" [73] "1552344_s_at" "1552347_at" "1552348_at" "1552349_a_at" [77] "1552354_at" "1552355_s_at" "1552359_at" "1552360_a_at" [81] "1552362_a_at" "1552364_s_at" "1552365_at" "1552367_a_at" [85] "1552368_at" "1552370_at" "1552372_at" "1552373_s_at" [89] "1552375_at" "1552377_s_at" "1552378_s_at" "1552379_at" [93] "1552381_at" "1552383_at" "1552384_a_at" "1552386_at" [97] "1552388_at" "1552389_at" "1552390_a_at" "1552391_at"
获取hgu133aplus2_Hs_ENSG的探针组名称,把上面的cdfname换成"hgu133plus2hsensgcdf"即可,最后得到以下结果:
> length(genenames) ## 重定义后的探针组数目[1] 20009> genenames[1:100] ## 重定义后的探针组名称的前100个 [1] "AFFX-BioB-3_at" "AFFX-BioB-5_at" [3] "AFFX-BioB-M_at" "AFFX-BioC-3_at" [5] "AFFX-BioC-5_at" "AFFX-BioDn-3_at" [7] "AFFX-BioDn-5_at" "AFFX-CreX-3_at" [9] "AFFX-CreX-5_at" "AFFX-DapX-3_at" [11] "AFFX-DapX-5_at" "AFFX-DapX-M_at" [13] "AFFX-HSAC07/X00351_3_at" "AFFX-HSAC07/X00351_5_at" [15] "AFFX-HSAC07/X00351_M_at" "AFFX-hum_alu_at" [17] "AFFX-HUMGAPDH/M33197_3_at" "AFFX-HUMGAPDH/M33197_5_at" [19] "AFFX-HUMGAPDH/M33197_M_at" "AFFX-HUMISGF3A/M97935_3_at" [21] "AFFX-HUMISGF3A/M97935_5_at" "AFFX-HUMISGF3A/M97935_MA_at" [23] "AFFX-HUMISGF3A/M97935_MB_at" "AFFX-HUMRGE/M10098_3_at" [25] "AFFX-HUMRGE/M10098_5_at" "AFFX-HUMRGE/M10098_M_at" [27] "AFFX-LysX-3_at" "AFFX-LysX-5_at" [29] "AFFX-LysX-M_at" "AFFX-M27830_3_at" [31] "AFFX-M27830_5_at" "AFFX-M27830_M_at" [33] "AFFX-PheX-3_at" "AFFX-PheX-5_at" [35] "AFFX-PheX-M_at" "AFFX-r2-Bs-dap-3_at" [37] "AFFX-r2-Bs-dap-5_at" "AFFX-r2-Bs-dap-M_at" [39] "AFFX-r2-Bs-lys-3_at" "AFFX-r2-Bs-lys-5_at" [41] "AFFX-r2-Bs-lys-M_at" "AFFX-r2-Bs-phe-3_at" [43] "AFFX-r2-Bs-phe-5_at" "AFFX-r2-Bs-phe-M_at" [45] "AFFX-r2-Bs-thr-3_s_at" "AFFX-r2-Bs-thr-5_s_at" [47] "AFFX-r2-Bs-thr-M_s_at" "AFFX-r2-Ec-bioB-3_at" [49] "AFFX-r2-Ec-bioB-5_at" "AFFX-r2-Ec-bioB-M_at" [51] "AFFX-r2-Ec-bioC-3_at" "AFFX-r2-Ec-bioC-5_at" [53] "AFFX-r2-Ec-bioD-3_at" "AFFX-r2-Ec-bioD-5_at" [55] "AFFX-r2-P1-cre-3_at" "AFFX-r2-P1-cre-5_at" [57] "AFFX-ThrX-3_at" "AFFX-ThrX-5_at" [59] "AFFX-ThrX-M_at" "AFFX-TrpnX-3_at" [61] "AFFX-TrpnX-5_at" "AFFX-TrpnX-M_at" [63] "ENSG00000000003_at" "ENSG00000000005_at" [65] "ENSG00000000419_at" "ENSG00000000457_at" [67] "ENSG00000000460_at" "ENSG00000000938_at" [69] "ENSG00000000971_at" "ENSG00000001036_at" [71] "ENSG00000001084_at" "ENSG00000001167_at" [73] "ENSG00000001460_at" "ENSG00000001461_at" [75] "ENSG00000001497_at" "ENSG00000001561_at" [77] "ENSG00000001617_at" "ENSG00000001626_at" [79] "ENSG00000001629_at" "ENSG00000001631_at" [81] "ENSG00000002016_at" "ENSG00000002079_at" [83] "ENSG00000002330_at" "ENSG00000002549_at" [85] "ENSG00000002586_at" "ENSG00000002587_at" [87] "ENSG00000002726_at" "ENSG00000002745_at" [89] "ENSG00000002746_at" "ENSG00000002822_at" [91] "ENSG00000002834_at" "ENSG00000002919_at" [93] "ENSG00000002933_at" "ENSG00000003056_at" [95] "ENSG00000003096_at" "ENSG00000003137_at" [97] "ENSG00000003147_at" "ENSG00000003249_at" [99] "ENSG00000003393_at" "ENSG00000003400_at"
从结果可以看出,hgu133plus2有54675个探针组,而hgu133plus2hsensgcdf只有20009个探针组,这是因为有些探针组被合并起来了,可能还有一些被舍弃掉。