Skip to main content

dataset - ID Swapping: Efficient use of a reference table to convert ID values



The rise of various bio-companies for high-throughput sequencing technology and open-source bioinformatic software suites for analysis thereof, has resulted in a equivalent rise of reference conventions (often species specific) to referring to genes, exons, transcripts, etc. Most useful of all of these conventions - to the biologist - is the common gene name (which itself likely has several aliases... an issue for another day).



For example, consider the gene KRAS: enter image description here


Thus it proves useful to convert genes from their ID (e.g. in this case P01116, or ENSG00000133703) to its common name (KRAS). To further drive this point home, consider IDs just within the same system - ENSEMBL. The human ID for KRAS is ENSG00000133703, while for mice the KRAS ID is ENSMUSG00000030265. Note that even stripping away the species specific element, yields different numerals. This type of situation might arise when one tries to compare two gene lists. My previous question, related to merging sets via ID, aims at finding efficient ways to combine different gene lists, once both have the same ID convention.





The goal of this question is to find the most efficient way of scanning a reference file for a gene's give ID, acquiring its associated gene name, and then replacing the original ID entry with the associated gene name. If this is unclear, an example comes shortly below.





For this question we will be using an ENSEMBL reference file. They are not large (ranging from ~30MB for yeast to ~200MB for humans). You can download such a file here:ENSEMBL BioMart.


To get the file do the following:




  1. On the drop down that states "- CHOOSE DATABASE -", select "Ensembl Genes 86"

  2. On the next drop down that states "- CHOOSE DATASET -" select whatever species you want. You could try "Mus musculus genes (GRCm38.p4)"

  3. On the left hand column, click "Attributes"

  4. On the new screen, click the box next to "GENE:"

  5. Add at least "Associated Gene Name" (you can also add other useful information such as "GENCODE basic annotation").

  6. At the top of the screen click results

  7. I prefer .CSV files, so you can change the file extension if you wish.

  8. Click Go to download the file. This link may or may not maintain the information (maybe good link)


Naturally if one works with gene lists often, it may be practical to get several different species.






Using SemanticImport we can see our reference file: enter image description here


For example purposes one could use RandomChoice to get a few IDs. I am just pasting the output here:


demoIDs={
"ENSMUSG00000028334", "ENSMUSG00000054079", "ENSMUSG00000027575",
"ENSMUSG00000041528", "ENSMUSG00000030231", "ENSMUSG00000078606",
"ENSMUSG00000028559", "ENSMUSG00000028431", "ENSMUSG00000020253",
"ENSMUSG00000019818"}


The corresponding gene names are:


{"Nans", "Utp18", "Arfgap1", "Rnf123", "Plekha5", 
"Gm4070", "Osbpl9", "Ikbkap", "Ppm1m", "Cd164"}

Naturally the actual list may be quite large.


Thus if this is our starting file: enter image description here


We want to end up with: enter image description here


Note, these were printed using TableForm, but they are actually Datasets.






I am providing my own answer, as I know of a way to do it. However it is surely not the most efficient or elegant way to do so. Thus I am looking for other answers that given the following:



  • the file name

  • the starting ID column header (Ensembl ID)

  • the final ID column header (gene)

  • a gene list data set

  • and the column header of that set contain the column of IDs (see below)


enter image description here


replaces the IDs in the gene list with the converted gene name (or N/A if not found).





Comments

Popular posts from this blog

front end - keyboard shortcut to invoke Insert new matrix

I frequently need to type in some matrices, and the menu command Insert > Table/Matrix > New... allows matrices with lines drawn between columns and rows, which is very helpful. I would like to make a keyboard shortcut for it, but cannot find the relevant frontend token command (4209405) for it. Since the FullForm[] and InputForm[] of matrices with lines drawn between rows and columns is the same as those without lines, it's hard to do this via 3rd party system-wide text expanders (e.g. autohotkey or atext on mac). How does one assign a keyboard shortcut for the menu item Insert > Table/Matrix > New... , preferably using only mathematica? Thanks! Answer In the MenuSetup.tr (for linux located in the $InstallationDirectory/SystemFiles/FrontEnd/TextResources/X/ directory), I changed the line MenuItem["&New...", "CreateGridBoxDialog"] to read MenuItem["&New...", "CreateGridBoxDialog", MenuKey["m", Modifiers-...

How to thread a list

I have data in format data = {{a1, a2}, {b1, b2}, {c1, c2}, {d1, d2}} Tableform: I want to thread it to : tdata = {{{a1, b1}, {a2, b2}}, {{a1, c1}, {a2, c2}}, {{a1, d1}, {a2, d2}}} Tableform: And I would like to do better then pseudofunction[n_] := Transpose[{data2[[1]], data2[[n]]}]; SetAttributes[pseudofunction, Listable]; Range[2, 4] // pseudofunction Here is my benchmark data, where data3 is normal sample of real data. data3 = Drop[ExcelWorkBook[[Column1 ;; Column4]], None, 1]; data2 = {a #, b #, c #, d #} & /@ Range[1, 10^5]; data = RandomReal[{0, 1}, {10^6, 4}]; Here is my benchmark code kptnw[list_] := Transpose[{Table[First@#, {Length@# - 1}], Rest@#}, {3, 1, 2}] &@list kptnw2[list_] := Transpose[{ConstantArray[First@#, Length@# - 1], Rest@#}, {3, 1, 2}] &@list OleksandrR[list_] := Flatten[Outer[List, List@First[list], Rest[list], 1], {{2}, {1, 4}}] paradox2[list_] := Partition[Riffle[list[[1]], #], 2] & /@ Drop[list, 1] RM[list_] := FoldList[Transpose[{First@li...

plotting - How to draw lines between specified dots on ListPlot?

I would like to create a plot where I have unconnected dots and some connected. So far, I have figured out how to draw the dots. My code is the following: ListPlot[{{1, 1}, {2, 2}, {3, 3}, {4, 4}, {1, 4}, {2, 5}, {3, 6}, {4, 7}, {1, 7}, {2, 8}, {3, 9}, {4, 10}, {1, 10}, {2, 11}, {3, 12}, {4,13}, {2.5, 7}}, Ticks -> {{1, 2, 3, 4}, None}, AxesStyle -> Thin, TicksStyle -> Directive[Black, Bold, 12], Mesh -> Full] I have thought using ListLinePlot command, but I don't know how to specify to the command to draw only selected lines between the dots. Do have any suggestions/hints on how to do that? Thank you. Answer One possibility would be to use Epilog with Line : ListPlot[ {{1, 1}, {2, 2}, {3, 3}, {4, 4}, {1, 4}, {2, 5}, {3, 6}, {4, 7}, {1, 7}, {2, 8}, {3, 9}, {4, 10}, {1, 10}, {2, 11}, {3, 12}, {4, 13}, {2.5, 7}}, Ticks -> {{1, 2, 3, 4}, None}, AxesStyle -> Thin, TicksStyle -> Directive[Black, Bold, 12], Mesh -> Full, Epilog -> { Line[ ...