uORFdb documentation

Navigate on the website

The start page allows you to query the database. The menu on the left lets you access functions of the web site, regardless, if you are on the start page or not. The most important features that can be accessed from the menu are introduced in the following sections.

Query the database

You can search for items in the database by entering your search term in the search bar (as shown in Figure 1 for gene CEBPB) and clicking the search icon on the right. Please note that your search term cannot be empty. This limit was imposed to ensure good search performance for a large quantity of users. Please note that by convention, we use genomic sequences, thus you must replace "U" in your nucleotide query by "T". By default, we ignore any special characters in your search for authors and publication titles: For example when searching for an author named "Muller", you will find "Muller" and "Müller". Searching for an author called "Müller" will find authors called "Müller" and "Muller".

Figure 1: The search interface.

Below the search bar, you can see grey panels; one for each view. We use the term view for a display of related items (publications, genes, transcripts, uORFs,...). Each view has a set of defined searchable fields. By de-(selecting) the checkboxes in front of the fields, you can refine your search results. By default, you are searching in all searchable fields from all views.
The panels will also provide you with the number of hits in the selected searchable fields from each view (see Figure 1). In our example, we searched for "CEBPB" which only gave us hits in the gene and publication views (see Figure 1). This is because, the search term only matched the fields in the gene and publication views. This does not mean that there are no entries in the other views that are related to CEBPB (e.g.: we have transcripts and uORFs for CEBPB).
For performance reasons, only the best 1000 hits per view are shown. By clicking the blue arrow button, you can directly jump to your search results. The arrow button only appears, if hits were found in the respective view.

Navigate in the search results

If you choose to go from the search to the gene view, you will see those genes that matched your query. You can still navigate to any other view in the database by using the buttons at the top of the page. Buttons are greyed out, if there are no items in that particular view. All items that you will see in the other views are related to your hits in the gene view. Thus, navigating from your gene hits to the transcript view is not the same as performing the original search for CEBPB on the transcript view.

Figure 2: Controls in the gene view.

You can select one or multiple rows by ticking the checkboxes in front of the rows and clicking "Apply" in the top panel. This will remove any other rows from your current view. If you navigate to any other view, all items will be related to your selected rows. You also have designated buttons to tick all checkboxes on the current page ("Page+"), to uncheck all boxes ("None") or to uncheck all boxes on the current page ("Page-"). The number of checked boxes is shown next to these buttons. Selections can be undone using "Undo" or by clicking on one of your actions in the "Selection History". This will restore the results of that particular action.
On the top right, there are the pagination controls ("<<": first, "<": one page backward, ">": one page forward, ">>": last page), followed by the current page and total page count.
Some rows contain fields with many lines. We only display the first lines of data for each of these fields. Click on ▼ in the respective field to expand the row. Click on ▲ to shrink the fields of that row.
In the uORF and the variant view, we only display the first nucleotides of the sequences (indicated by "..."). You can click on the sequences to view the full sequences and to download them in FASTA format. The same applies to the amino acid sequences.
Links to external resources in the views are highlighted using

Model

Our uORF model allows for a quick analysis by eye. It shows the full TLS and CDS of transcripts with uORFs. ATG uORFs are highlighted in orange. Start codons that are shared with a CDS (see column description: "Shared start codon") are indicated by "*" behind the codon. You can access the model from the gene, transcript and uORF view (see Figure 3). Click on "Model" to see the results for all items in the current selection. However, there is a limit on how many items can be displayed at once. If you exceed the cutoff, you will get an error message and have to further limit your selection.
In the top menu, you can filter transcripts by NCBI ID and uORFs by start codon (the same principle as in the uORF view). By default, we don't include introns in the display and stop codons are hidden. You can change this behavior by selecting the respective checkboxes. To apply your selections, click "Filter".

Figure 3: The uORF model.

Each transcript has its own panel. Each panel shows the three reading frames of the transcript. Panels are independent. If you perform an action on one, the others will remain unaffected. If you move the cursor across the plot, the header of the respective panel will show you the current genomic position (0-based, half-open; see also column descriptions). To zoom in, click and hold the left mouse button and mark an area in the model: Mark a small area for a high magnification. Mark a large area for a more modest magnification of that area. Click "Reset" in the top right corner of the panel to reset the zoom. You can export your current view by using the "PNG" or "SVG" buttons in the top right corner of the panel.
If you want more information on a particular uORF, you can click on it to open an overlay window. To exit the model, click on any view button at the top of the page.

UCSC Genome Browser

In the gene, transcript and uORF view, you can switch to the UCSC Genome Browser by clicking on the links in the gene symbol or the uORF ID columns. The link for gene symbol will only be present, if we analyzed the gene for uORFs. You will then see all transcripts and all uORFs for the associated gene. Nevertheless, if you entered the Genome Browser from the transcript or uORF view, the transcript or uORF that you clicked on will be highlighted. Using the Genome Browser, you can compare the uORFdb tracks to your own custom tracks or to tracks showing ribosome profiling data. A full list of features is beyond the scope of this tutorial, but the UCSC team has created a helpful manual.
If you want to continue working with the uORF sequences, you will find the track details page helpful. To access this page, click on one of our tracks in the Genome Browser. This will open a new page (see Figure 4) where we put direct links to the UCSC Table Browser. Using these links, you can easily create a custom download of the sequences or upload the sequences directly into the free GALAXY cloud service.

Figure 4: The track details page.

Export citations

You can export all publications in your selection into a single RIS file by clicking the cite button in the publication view (there is an upper limit for the amount of citations that can be exported at once). RIS is a standard citation file format. Each record will be annotated with keywords from the database (KW in RIS). The keywords are a short summary of the publication by our uORF experts. They are extracted from the boolean (+/-) columns of the database (see also column descriptions below). If a "+" or a "-" is reported for a publication, the column name is used as a keyword for the citation file.
Our RIS file is fully compliant to its standard. Recognition of single fields, especially the keyword fields, depends on your reference manager. Nevertheless, we found that our citation files were processed as expected by Citavi, Endnote, and Zotero.

Get the latest publications

We publish the publications in uORFdb as an RSS feed that is updated once a day. RSS is a popular protocol to distribute new items to users (podcasts, new publications in journals, new publications matching a query in PubMed,...). All you need is a feed reader which is a special app or program that can understand RSS feeds. Feed readers are available for virtually any operating system, both on your local PC or on your smartphone/tablet (non-exhaustive list). Click the button "Publication Updates" in the left panel of our website. This will open a new tab. Don't worry about the content on this page. This is the RSS feed in a machine-readable format that is not necessarily good to read for humans. Just copy the URL of this page from the address bar of your browser. Now paste the URL into your feed reader. The way, how to do this, depends on your app. After that, you have successfully subscribed to our feed (see Figure 5, left). Every time we will add a new publication to the database, your app will notify you.
Some publication managers support RSS feeds which we feel is a particularly useful feature. The most mature implementation that we have seen so far is the one in Zotero. In Zotero, you can choose "New Library" --> "News Feed" --> "From URL" to open a dialog where you can paste the RSS URL.
From the feed view in Zotero, you can select a publication and inspect it. In the field for the abstract, there are the tags/keywords from uORFdb (see Export citations), followed by the actual abstract. From the top panel, you can chose to directly import the publication to your library (see Figure 5, right). For technical reasons, we cannot provide you the keywords from uORFdb, if you import an item from the RSS feed into your library. If you need the keywords, you can either add them manually or import the citation directly from the database, as explained in the previous section.

Figure 5: Left: A single publication from the RSS feed as displayed in the Feeder app on Android. Right: A single publication from the RSS feed as displayed in the Zotero reference manager.

Export uORF data

Click on "Excel" or "CSV" in the export panel at the top of the uORF view, to export rows in your current selection in the respective format (there is an upper limit to the amount of uORFs that can be exported at once). This will export all columns, except for the links to dbSNP and ClinVar. The exonic nucleotide sequences will be the full sequences. The same applies to the amino acid sequences.

Bulk download

You can download all data that is stored in the database as TSV files. Click on "Downloads" in the main menu on the left of the web site. In the download tab, you will find further information on the type of content that can be downloaded and the file format (see also README file). One word of warning: Some of the exported files are huge, so please be aware that those should not be opened with a program like LibreOffice Calc or Excel. You should use the command line or specialized programs to analyze these files. If you want to explore the format of the downloads, we provide example files with only a few entries. These are generally safe to open with a graphical table viewer. All downloads can be verified using the MD5 checksum.

Column descriptions

In the following you will find an in-depth description of all columns in the uORFdb web interface. The documentation is split in parts according to the views.

Please note the definitions of the most important technical terms:

0-based:The first position is zero and not one.
0-based, half-open: Count base positions from 0 and exclude the end position ( details). This is also known as 0-based start, 1-based end

If not indicated otherwise, the position names and the positions coordinates themselves are always based on the "+" strand. The smallest genomic coordinate is always the start, the largest genomic coordinate is the end. This intuitively makes sense on the "+" strand. On the "-" strand, however, it might be a little surprising. A uORF on the negative strand would end on its start position and start on its end position. If you want to have the "-" strand coordinates, please follow the steps in this resource.

By convention, we use genomic sequences and not mRNA sequences. This means that our sequences will not contain uracil, but thymine.

Genes

Taxon	The taxon for the gene as provided by NCBI.
Gene symbol	The default gene symbol for the gene as provided by NCBI. Clicking on the symbol opens the UCSC Genome Browser. In the Genome Browser, you can inspect all transcripts of the gene and, if applicable, all uORFs. Genes which we did not analyze for uORFs will not have a link to the Genome Browser.
NCBI ID	For most genes, we provide the gene ID in the NCBI gene database. If this ID is not available, the accession number in the NCBI nucleotide database is displayed. Clicking on the ID (or accession) will take you to the full entry on NCBI.
Symbol aliases	Unofficial/alternative symbols for the gene as provided by NCBI.
Names	Official and alias names for the gene as provided by NCBI.
Chromosome	The chromosome on which the gene resides.
Assembly	The assembly version of our sequence data for this gene. If no sequence data exists, it is the current assembly version on NCBI at the time of the insert.
# Transcripts	The number of RefSeq transcripts for this gene in our database.
# Publications	The number of publications for this gene in our database.

Transcripts

Gene symbol	The official gene symbol for the transcript's mother gene as provided by NCBI. Clicking on the symbol opens the UCSC Genome Browser. In the Genome Browser, you can inspect the currently selected transcript (highlighted) and compare it to the other transcripts of this gene. If applicable, you can also see all uORFs.
NCBI ID	The RefSeq accession and version of the transcript.
Chromosome	The chromosome of the transcript.
Genomic start	The start position of the transcript on the genome (0-based, half-open). This is always the lower coordinate, regardless of strand.
Genomic end	The end position of the transcript on the genome (0-based, half-open). This is always the higher coordinate, regardless of strand.
Strand	The strand of the transcript: "+" or "-".
Length [bp]	The length of the transcript without introns.
TLS length [bp]	The length of the TLS of the transcript without introns.
Kozak context	The Kozak consensus sequence of the CDS. It consists of the six nucleotides upstream of the start codon, the start codon (highlighted), and the following nucleotide.
Kozak strength	The translational efficacy of the Kozak context. All classifications consider the 3rd nucleotide upstream of the start codon (A) and the nucleotide directly downstream of the start codon (B). The strength is classified as "strong" (A: purin and B: guanine), "adequate" (A: purin or B: guanine) or "weak" (A: not a purin and B: not a guanine). The strength is not shown, if the 3rd nucleotide upstream of the start codon is missing.
# ATG uORFs	The number of uORFs for this transcript in our database with an ATG start codon.
# aTIS uORFs	The number of uORFs for this transcript in our database with an alternative (!= ATG) start codon.

uORFs

uORF ID	NCBI ID of transcript + "_" + start codon + "." + number. "number" is the consecutive number of uORFs on the same transcript with the same start codon. Clicking on the name will take you to the UCSC Genome Browser. The chosen uORF will be highlighted.
Chromosome	The chromosome of the uORF.
Genomic start	The start position of the uORF on the genome (0-based, half-open). This is always the lower coordinate, regardless of strand.
Genomic end	The end position of the uORF on the genome (0-based, half-open). This is always the higher coordinate, regardless of strand.
Strand	The strand of the uORF: "+" or "-".
Start codon	The start codon of the uORF. A star indicates that this start codon is shared with a CDS (see "Shared start codon")
Stop codon	The stop codon of the uORF.
uORF length [bp]	The length of the uORF without introns.
CDS distance [bp]	The distance from the uORF end position on the genome to the CDS start position on the genome. The value is negative, if the uORF end position is located after the CDS start position.
5'-cap distance [bp]	The distance from the transcript start to the uORF start codon.
Kozak context	The Kozak consensus sequence of the uORF. It consists of the six nucleotides upstream of the start codon, the start codon (highlighted), and the following nucleotide.
Kozak strength	The translational efficacy of the Kozak context. All classifications consider the 3rd nucleotide upstream of the start codon (A) and the nucleotide directly downstream of the start codon (B). The strength is classified as "strong" (A: purin and B: guanine), "adequate" (A: purin or B: guanine) or "weak" (A: not a purin and B: not a guanine). The strength is not shown, if the 3rd nucleotide upstream of the start codon is missing.
Type	"non-overlapping": uORF is completely upstream of CDS. "overlapping": uORF overlaps with CDS, but is not in the same frame. "N-terminal extension": uORF overlaps with CDS and is in the same frame. Thus, CDS and uORF share the same stop codon. "possible N-terminal extension": A special case of the "N-terminal extension". The CDS does not have a valid stop codon (possibly due to annotation errors). Thus we continued the search for a uORF stop codon in the 3'-UTR.
Reading frame	The reading frame (1-3) in which the uORF resides. By definition, the reading frame that matches the frame of the CDS is set to "1".
Exonic sequence	The sequence of the uORF excluding introns. By default, only the first nucleotides are shown in the table. By clicking on the sequence, you can view the whole sequence and optionally download it. The FASTA header has the following format: ">" + gene symbol + "_" + NCBI ID of the transcript + "_" + genomic start + "_" + genomic end + "_nucseq"
Amino acid sequence	The amino acid sequence of the uORF was generated by translating the exonic sequence using the standard genetic code. The stop codon is shown as a "*". By definition, the sequence starts with methionine (M), even if the nucleotide sequence starts with a non-canonical uSTART codon. Although there are rare examples of a different first amino acid being incorporated, methionine is used in the vast majority of translation events in eukaryotes. By default, only the first amino acids are shown in the table. By clicking on the sequence, you can view the whole sequence and optionally download it. The FASTA header has the following format: ">" + gene symbol + "_" + NCBI ID of the transcript + "_" + genomic start + "_" + genomic end + "_aminoseq"
Shared start codon	Is the uORF start codon shared by a CDS from another transcript of the same gene? In this case, we list the NCBI IDs of the transcripts that harbor the CDSs. To identify shared start codons, we compare the position of the biological start (not necessarily genomic start) and the start codon sequence.
Exon variants in dbSNP	Contains a link to dbSNP for human uORFs. dbSNP is queried for variants in the exonic uORF regions.
Exon variants in ClinVar	Contains a link to ClinVar for human uORFs. ClinVar is queried for variants in the exonic uORF regions.

Variants

The results shown here are in whole or part based upon data generated by the TCGA Research Network.

uORF ID	The name of the uORF which was affected by the variant (see also uORFs).
Genomic position	The start position of the mutation on the genome (left-aligned, normalized, 0-based, half-open).
Reference allele	The reference allele of the variant (left-aligned, normalized, always related to "+" strand).
Alternate alleles	The alternate allele(s) of the variant (left-aligned, normalized, always related to "+" strand).
Graph	The graph allows you to put the allele frequencies of the variant in cancer patients into context with the frequencies of the variant in large reference cohorts. The column depicts the somatic allele frequencies of the reference and alternate allele(s) in the analyzed cancer types (upper panel). We only show cancer types in which the variant has been identified, according to our analysis. The lower panel shows the allele frequencies in up to three reference studies: gnomAD Genomes, ExAC, and TopMed version 2 or 3. The frequencies are based on dbSNP. We analyzed the following cancer types: BRCA = breast cancer, COAD = colon cancer, LAML = blood cancer, LUAD = lung cancer, PRAD = prostate cancer, and SKCM = skin cancer
REF start codon	The reference start codon of the uORF.
Start codon effects	The direct effect(s) of the alternate allele(s) on the start codon of the uORF. If the start codon is lost, the effect is "loss". If the codon was an aTIS codon ("TTG", "GTG", "CTG", "AAG", "AGG", "ACG", "ATA", "ATT", "ATC") and the variant turns it into another aTIS codon, the effect is "aTIS->aTIS". If an ATG is turned into an aTIS codon and vice versa the effects are "uAUG->aTIS" and "aTIS->uAUG", respectively. If the variant leads to the loss of the uSTART and additionally creates a new start codon in the next triplet, the effect is "changed position"
ALT start codons	The alternate start codon(s) of the uORF. Empty, if no change or loss (see Start codon effects).
REF stop codon	The reference stop codon of the uORF.
Stop codon effects	The direct effect(s) of the alternate allele(s) on the stop codon of the uORF. If the stop codon changes into another valid uSTOP ("TAG", "TAA", "TGA"), the effect is "uSTOP->uSTOP". If the variant leads to the loss of the uSTOP, the stop codon effect is either "downstream uSTOP", if there is another in-frame stop codon downstream, or "loss", if there is no downstream in-frame stop codon. A variant within the uORF sequence can give rise to a new in-frame uSTOP. This is indicated by the effect "upstream uSTOP".
ALT stop codons	The alternate stop codon(s) of the uORF. Empty, if no change or loss (see Stop codon effects).
REF Kozak context	The reference Kozak context of the uORF.
Kozak effects	The direct effect(s) of the alternate allele(s) on the Kozak context of the uORF. Either in the format [reference Kozak strength]->[alternative Kozak strength] or "altered sequence" for variants that affect the Kozak sequence, but not its strength. If the start codon is lost, the Kozak effect is "loss".
ALT Kozak contexts	The alternate Kozak context(s) of the uORF. Empty, if no change or loss (see Kozak effects).
Sequence effects	The direct effect(s) of the alternate allele(s) on the uORF sequence. Variants that affect the length of the sequence (e.g. loss of uSTOP) will cause one of the following sequence effects: "longer sequence" or "shorter sequence". If the sequence is changed, but the length is not affected, the effect will be "altered sequence". If the uSTART is lost, the sequence effect is "loss". As start and stop condons are part of the sequence, a variant in these regions will always have a sequence effect.
ALT nucleotide sequences	The alternate nucleotide sequence(s) of the uORF. By default, only the first nucleotides are shown in the table. By clicking on each sequence, you can view the whole sequence and optionally download it. The FASTA header has the following format: ">" + uORF ID + "_" + genomic position + "_variant_" + reference allele + "/" + alternate allele. Empty, if no change or loss (see Sequence effects).
Locations	The location(s) of the variant in the transcript that harbors the uORF.
Alternate CDS distances [bp]	The new CDS distance(s) of the uORF caused by the variant allele(s).
Alternate uORF lengths [bp]	The new length(s) of the uORF caused by the variant allele(s).
dbSNP IDs	ID(s) of the reference variant(s) in dbSNP that is/are located at the same position and has/have the same alleles.
Position-related variants in dbSNP	Contains a link to dbSNP for human variants. dbSNP is queried for further variants at the current variant position (regardless of alleles).
ClinVar IDs	ID(s) of the reference variant(s) in ClinVar that is/are located at the same position and has/have the same alleles.
Position-related variants in ClinVar	Contains a link to ClinVar for human variants. ClinVar is queried for further variants at the current variant position (regardless of alleles).

Publications

Many of the following columns are not filled with text and/or numbers, but with plus and minus. A plus indicates positive evidence for a feature and a minus indicates negative evidence.

PubMed ID	The publications's PubMed ID. Clicking on the ID will take you to the publication record in PubMed.
Authors	The author(s) of the publication.
Title	The title of the publication.
Taxa	The taxon or the taxa which are discussed in the publication.
Gene symbols	The gene symbol(s) for the gene(s) in the publication as provided by NCBI.
Gene name in paper	The name(s) of the gene(s) in the publication, if applicable. This name may be different from the current nomenclature. It may also be a general term which sums up multiple individual genes, for example "oncogenes".

Determinants of uORF presence or absence
Alternative promoters; Alternative splicing; Tissue-specific uORFs	Sequence analyses of the human transcriptome revealed that about 50% of mRNAs contain one or more upstream AUGs (uAUGs) between the 5'-cap-structure and the CDS. The general prevalence of uAUG is, although higher than initially anticipated, still lower than expected by normal distribution, arguing for an evolutionary negative selection. For stochastic reasons, the prevalence of uORFs increases with the length of the 5'-regulatory region, yet in specific cases the presence or absence of one or several uORF(s) is dependent on the transcript variant produced by transcription initiation from alternative promoters or due to alternative splicing. Some of these variants are specific to particular tissues.
Non-AUG uORFs	In a recent study using global translational initiation sequencing, 54% of human transcripts displayed one or more translational initiation site(s) preceding the CDS. Surprisingly, about three-fourths of upstream translation was initiated by near-cognate, non-AUG initiation codons, further relativizing the classical `first-AUG'-role. Nevertheless, uAUG codons appeared to be functionally most effective in repressing CDS translation.

Structural and sequence-dependent uORF properties
Number; Length; Distance from 5'-cap; Distance from uORF-STOP to CDS; CDS-overlap	Many publications investigated the importance of structural and sequence dependent uORF properties in mediating translational regulation. The repression of downstream translation appears to be positively correlated with the number of uORFs per transcript, the length of the uORF, and the distance between the 5'-cap structure and the uORF initiation codon. Furthermore, translational repression correlates negatively with the distance between the uORF-STOP and the CDS initiation site and is even more profound, when the uORF overlaps the CDS initiation codon. Taken together, current data suggest a dynamic regulatory model, where indispensable initiation cofactors detach gradually from ribosomes during the elongation phase of uORF translation but may be reloaded to allow reinitiation at the CDS.
RNA secondary structure	In eukaryotes long GC-rich transcript leader sequences tend to form stable secondary structures that inhibit ribosome progression and CDS translation. Similarly, specific secondary structures within or in the surrounding of uORFs may affect translation efficiency.

Functional consequences of uORF-mediated translational control
CDS repression; CDS induction; Start site selection	Most uORFs analyzed to date repress translation of the subsequent initiation site(s) and inhibit/diminish translation of the main protein. Post-uORF initiation at the CDS initiation codon may occur from leaky scanning of ribosomes across the uORF start codon or from reinitiation, if the uORF stop codon precedes the CDS. Despite of a generally repressive function on downstream translation, several exceptions have been described where translation of specific uORFs or a certain alignment of subsequent uORFs mediates enhanced CDS initiation. Furthermore, uORF-directed start site selection can result in the production of N-terminally distinct protein isoforms that harbor unique biological functions.
Nonsense-mediated decay; mRNA destabilization	Nonsense-mediated decay (NMD) of mRNA is activated when specific cellular surveillance mechanisms detect premature termination of protein translation. Such premature termination events may result from the use of nonsense codons that arise in mature transcripts due to mutations, incorrect splicing or aberrant initiation site selection. uORFs have been suggested to induce NMD by conferring additional termination codons to the 5'-leader sequence of certain transcripts. Similarly, another mode of termination-dependent RNA destabilization that is distinct and independent of the common NMD pathway has been reported in yeast.
Ribosome load; Ribosome pausing/stalling; Ribosome shunting	Artificial or mutational deletion of a uORF may result in increased ribosome load on a given transcript associated with increased translational activity. On the contrary, ribosome stalling at the uORF termination codon or pausing of ribosomes on inhibitory uORF structures may hamper CDS translation. Underlining the multiplicity of uORF-mediated translational control mechanisms, certain uORFs facilitate enhanced CDS translation by supporting a ribosome shunt across a highly structured and inhibitory 5'-transcript leader sequence.

Co-regulatory events affecting uORF functions
Kozak consensus sequence	Whether or not the ternary preinitiation complex recognizes an AUG or non-AUG triplet as a translational initiation codon is strongly influenced by the nucleotide context surrounding it. The optimal surrounding sequence for initiation is the Kozak consensus sequence. If the AUG codon is surrounded by a strong context, virtually all scanning ribosomes recognize the start codon and initiate translation. In an adequate or weak surrounding, a number of ribosomes scan through the initiation site and remain ready to recognize an initiation site located further downstream. Since the quality of the Kozak consensus sequence is not the only determinant of translation initiation efficiency, the mere evaluation of the surrounding nucleotides does not permit the precise prediction of initiation.
Translational status	Regulation through uORFs allows rapid integration of the overall translation status of a cell to adjust the translation rates of important regulatory proteins. The translational status is dependent on extracellular signals, environmental conditions, and nutrient supply and is mainly reflected by the abundance of initiation co-factors required to form a functional preinitiation complex (ternary complex). A number of studies in yeast and human transcripts precisely analyzed uORF-mediated regulation under changing translational conditions.
Termination context	The sequence context surrounding a uORF termination codon may determine the reinitiation efficiency at downstream initiation sites. In particular, stable interactions between the terminating ribosome and the RNA, or stable base pairing of the RNA alone may cause ribosome pausing or mediate premature mRNA decay.
uORF RNA/peptide sequence; Regulatory sequence motif; Co-factor/ribosome interaction	Altering the RNA- or peptide-sequence of a uORF frequently affects downstream translation. This suggests that either the uORF-encoded peptide or a specific RNA sequence mediates interaction with a co-factor or the translation machinery to regulate translation, or that specific secondary structure is functionally important.

Medical impact

Disease-related uORFs; Acquired mutations/SNPs

A defect in uORF-mediated translational control can be associated with the development of human disease. Despite of only few unequivocal cases at this time, it is evident that uORF mutations may be involved in a wide variety of diseases, including malignancies, metabolic or neurologic disorders, and inherited syndromes. Considering that many important regulatory proteins, including cell surface receptors, tyrosine kinases, and transcription factors, act in a dose-dependent fashion and possess uORFs, a substantial number of as yet unexplained pathologies might be traced back to uORF mutations altering expression levels of such key regulatory genes.

Manuscript categories
Mouse models; Ribosome profiling; Bioinformatics/arrays/screens; Proteomics	Pathophysiological importance of uORFs has been demonstrated by genetically altered mouse models. Recent progress in computational and sequencing based technologies, the development of the ribosome profiling method, and mass spectrometry approaches allow genome-wide studies of uORF function.
Methods; Review	Rather than describing individual transcripts, part of the bibliography on uORFs focuses on methods for their study or reviews particular aspects of the field of uORF research.

Authors

Name	The name of the author.
# Publications	The number of publications from this author in our database.

Methods and versions

The following sections will briefly list the versions of all major data types in the database and provide a very brief overview of our methods.

Publications and authors

We regularly scan PubMed for the latest publications in the field. The version of the publication metadata (incl. author names) is the version that was available from PubMed at the time of the insert, but we will regularly scan PubMed for updated metadata.

Taxonomy

We are using the NCBI taxonomy which we downloaded from the ftp server on the 06.05.2022.

Genes

The version of a gene for a publication is the current version of that gene on NCBI at time the publication was inserted.
For genes with transcripts and/or uORFs, the gene metadata is based on the All_Data.gene_info.gz and gene2refseq.gz files that we downloaded on the 04.05.2022.

uORFs and transcripts

uORFs predictions are based on the NM_* transcripts (in NCBI RefSeq Curated) and soft-masked genomes which we downloaded from UCSC on the 07.04.2022. We called uORFs for the following species and genome versions using custom scripts.

Homo sapiens	hg38
Drosophila melanogaster	dm6
Mus musculus	mm39
Danio rerio	danRer11
Rattus norvegicus	rn7
Bos taurus	bosTau9
Xenopus laevis	xenLae2
Xenopus tropicalis	xenTro10
Gallus gallus	galGal6
Sus scrofa	susSrc11
Pongo abelii	ponAbe3
Macaca mulatta	rheMac10
Pan troglodytes	panTro6

We then annotated uORFs with gene metadata using the files: All_Data.gene_info.gz and gene2refseq.gz; downloaded on the 04.05.2022. We filtered any transcripts and uORFs that belonged to pseudogenes or where the transcript accession was suppressed, withdrawn or renamed by NCBI. Also duplicate transcript IDs were removed (e.g. in the pseudoautosomal regions).

Variants

We analyzed BAM files from cancer patients provided by the TCGA Research Network. We analyzed the following cancer types: Breast Invasive Carcinoma (TCGA-BRCA), Colon Adenocarcinoma (TCGA-COAD), Acute Myeloid Leukemia (TCGA-LAML), Lung Adenocarcinoma (TCGA-LUAD), Prostate Adenocarcinoma (TCGA-PRAD), and Skin Cutaneous Melanoma (TCGA-SKCM). WGS BAM files were downloaded from the GDC Legacy Archive between the 28.05.2021 and the 09.01.2022. We realigned the BAM files to GRCh38.p13 (downloaded 15.06.2021). Realignment and subsequent quality control were performed according to a custom workflow heavily based on the GDC DNA-Seq Analysis Pipeline. 677 patients remained and entered the subsequent variant calling. Variants were called using Mutect2 (GATK v4.1.4.1, Java v1.10.11) with a custom pipeline based on the GATK best practices.
PASS Variants were left-aligned, normalized, and annotated using custom scripts and BCFtools v1.11. The annotation included dbSNP and ClinVar identifiers and allele frequencies from ExAC, gnomAD Genomes, and TopMed version 2 or 3. The metadata was based on dbSNP (downloaded 13.08.2021) and ClinVar VCFs (downloaded 13.01.2022), as well as dbSNP JSON files (downloaded 23.02.2022). In the functional prediction, we chose to include only variants where alternate and reference allele had the same length. This avoided frame shifts and issues with splicing sites. Prediction of the variant effect on the uORFs was performed using custom scripts.

Cite

Manske F, Ogoniak L, Jürgens L, Grundmann N, Makałowski W, Wethmar K. The new uORFdb: integrating literature, sequence, and variation data in a central hub for uORF research. Nucleic Acids Res. 2023 Jan 6;51(D1):D328-D336. doi: 10.1093/nar/gkac899.