Root
Phrases

Phrases

Given a list of terms and a short string that appears in many of them, we seek a way to produce useful phrases which contain the sought short string, and can help to pinpoint ones intended term without listing all the matching terms.

So, suppose we have a list of terms like this:

2Fe-2S ferredoxin subdomain
5\'3\'-Exonuclease N- and I-domain
AAA-protein subdomain
Alanine dehydrogenase/PNT, C-terminal subdomain
Alanine dehydrogenase/PNT, N-terminal subdomain
Alpha-2-macroglobulin receptor-associated protein, domain 1
Alpha amylase, catalytic subdomain
Aspartate/ornithine carbamoyltransferase, carbamoyl-P binding domain
Autotransporter beta-domain
Bacterial extracellular solute-binding protein, family 1 domain
Bacterial membrane-flanked domain
Bacterial transcriptional activator domain
Blue (type 1) copper domain
Bromodomain
Bromodomain transcription factor
Carbohydrate binding domain, family 11
Carbohydrate binding domain, family 15
Carbohydrate binding domain, family 17/28
Carbohydrate-binding domain, family V/XII
Catalytic domain of components of various dehydrogenase complexes
Cell division protein 48, CDC48, domain 2
Chitin-binding, domain 3
Condensation domain
CpcD phycobilisome linker-like subdomain
Cysteine-rich small domain
Dullard-like phosphatase domain
Elongation factor G, domain IV
Elongation factor Tu, domain 2
Epidermal  growth-factor receptor (EGFR), L domain
Epoxide hydrolase N-terminal domain-like phosphatase
eRF1 domain 1
eRF1 domain 2
eRF1 domain 3
Fatty acid desaturase subdomain
F-box protein interaction domain
FCP1-like phosphatase, phosphatase domain
Fibrobacter succinogenes major paralogous domain
Fibronectin, type III subdomain
Gal4-like dimerisation domain
Glucan 1,4-alpha-glucosidase with starch-binding domain
Glycoside hydrolase family 2, immunoglobulin-like beta-sandwich domain
Glycoside hydrolase, family 2, TIM barrel domain
GTP1/OBG domain
GTP1/OBG subdomain
GTP-binding signal recognition particle SRP54, G-domain
Hedgehog/intein hint domain, C-terminal
Homeobox domain, ZF-HD class
Homeodomain-like
Homeodomain protein CUT
Homeodomain-related
Hpt, subdomain
Iroquois-class homeodomain protein
Legume lectin, beta domain
Metal-dependent phosphohydrolase, HD subdomain
Mo-co oxidoreductase dimerisation domain
MoeA, C-terminal, domain IV
MoeA, N-terminal, domain I and II
Molybdopterin binding domain
Mycoplasmal MG032/MG096/MG288 1 domain
Mycoplasmal MG032/MG096/MG288 2 domain
N-acetylglutamate kinase with DUF619 domain
Nitrite/sulfite reductase, flavoprotein alpha-component, domains 1 and 3
Paired-like homeodomain protein, OAR
PEA3-type ETS-domain transcription factor, N-terminal
Penicillin-binding protein, dimerisation domain
Peptidase, alpha-lytic prodomain
Peptidoglycan-binding domain 1
Phosphatidylinositol-specific phospholipase C, Y domain
Phosphoglucomutase/phosphomannomutase alpha/beta/alpha domain I
Phosphoglucomutase/phosphomannomutase alpha/beta/alpha domain II
Phosphoglucomutase/phosphomannomutase alpha/beta/alpha domain III
Plant-specific domain of unknown function 3588
Predicted aldehyde dehydrogenase with duplicated domain
Predicted bacteriophytochrome with receiver domain
Predicted kinase with amino acid kinase domain
Predicted prephenate dehydrogenase/arogenate dehydrogenase with a C-terminal regulatory domain
Putative zinc finger domain, LRP1
Quinohemoprotein amine dehydrogenase, alpha chain, domain 3
Relaxase/mobilization nuclease domain
Respiratory-chain NADH dehydrogenase domain, 51 kDa subunit
Response regulator with LytTR DNA-binding domain, AlgR/VirR/ComE type
Restriction modification system DNA specificity domain
RfaE bifunctional protein, domain I
RfaE bifunctional protein, domain II
RNA polymerase Rpb1, domain 1
RNA polymerase Rpb1, domain 3
RNA polymerase Rpb1, domain 4
RNA polymerase Rpb1, domain 5
RNA polymerase Rpb1, domain 6
RNA polymerase Rpb1, domain 7
RNA polymerase Rpb2, domain 2
RNA polymerase Rpb2, domain 3
RNA polymerase Rpb2, domain 4
RNA polymerase Rpb2, domain 5
RNA polymerase Rpb2, domain 6
RNA polymerase Rpb2, domain 7
RuvA domain 2-like
Saposin B subdomain
Sec8 exocyst complex component specific domain
Signal peptide binding (SRP54) M-domain
SLA1 homology domain 1, SHD1
S-layer-related duplication domain
Small GTP-binding protein domain
Sugar-specific permease, EIIA 1 domain
Thioredoxin domain 2
Toprim subdomain
Uncharacterised conserved protein with HAD-like hydrolase domain
Uncharacterized Cys-rich domain
Uncharacterized domain 2
Uncharacterized hydrophobic domain
Uncharacterized plant-specific domain
Uncharacterized plant-specific domain 01589
Uncharacterized plant-specific domain 01627

(This is a list of Interpro terms that contain the substring "domain".)

We define a phrase to be a sequentially occuring list of words (like "binding domain"). For example, the phrases of a term "foo bar fnord" are "foo", "foo bar", "foo bar fnord", "bar", "bar fnord" and "fnord".

Given a string like "domain", which occurs often in this list (indeed in every item), it is desirable to produce a short list of frequently occuring phrases that contain "domain".

The process for doing this is two fold. First, we consider all phrases of each term and count their occurance, discarding those that occur only once.

In this process, we also clean up any phrase encountered. First, we can consider only the lower-case version of each phrase (assuming case-insensitivity for the entire process). Next, we can remove first and last character if it is not a letter or number or "(" or "[". Finally, we can remove leading and trailing "stopwords". A stopword, in this case, is simply a frequently occuring word in the english language like "the", "and" or "with" which does not contribute significantly if it appears at the beginning or end of a phrase. We have to be careful when doing such transformations so as to not produce phrases which do not occur in the original list of terms (assuming case-insensitivity).

If we look at the above list of terms, a phrase histogram of 95 phrases is produced:

Count:  85 Phrase: "domain"
Count:  13 Phrase: "subdomain"
Count:  12 Phrase: "rna"
Count:  12 Phrase: "rna polymerase"
Count:  12 Phrase: "protein"
Count:  12 Phrase: "polymerase"
Count:   6 Phrase: "uncharacterized"
Count:   6 Phrase: "rpb2, domain"
Count:   6 Phrase: "rpb2"
Count:   6 Phrase: "rpb1, domain"
Count:   6 Phrase: "rpb1"
Count:   6 Phrase: "rna polymerase rpb2, domain"
Count:   6 Phrase: "rna polymerase rpb2"
Count:   6 Phrase: "rna polymerase rpb1, domain"
Count:   6 Phrase: "rna polymerase rpb1"
Count:   6 Phrase: "polymerase rpb2, domain"
Count:   6 Phrase: "polymerase rpb2"
Count:   6 Phrase: "polymerase rpb1, domain"
Count:   6 Phrase: "polymerase rpb1"
Count:   6 Phrase: "domain 2"
Count:   6 Phrase: "binding"
Count:   5 Phrase: "domain 3"
Count:   5 Phrase: "domain 1"
Count:   5 Phrase: "dehydrogenase"
Count:   5 Phrase: "binding domain"
Count:   4 Phrase: "predicted"
Count:   4 Phrase: "plant-specific"
Count:   4 Phrase: "plant-specific domain"
Count:   4 Phrase: "n-terminal"
Count:   4 Phrase: "hydrolase"
Count:   4 Phrase: "factor"
Count:   4 Phrase: "domain,"
Count:   4 Phrase: "c-terminal"
Count:   3 Phrase: "uncharacterized plant-specific"
Count:   3 Phrase: "uncharacterized plant-specific domain"
Count:   3 Phrase: "protein, domain"
Count:   3 Phrase: "phosphoglucomutase/phosphomannomutase"
Count:   3 Phrase: "phosphoglucomutase/phosphomannomutase alpha/beta/alpha"
Count:   3 Phrase: "phosphoglucomutase/phosphomannomutase alpha/beta/alpha domain"
Count:   3 Phrase: "phosphatase"
Count:   3 Phrase: "homeodomain"
Count:   3 Phrase: "homeodomain protein"
Count:   3 Phrase: "erf1"
Count:   3 Phrase: "erf1 domain"
Count:   3 Phrase: "domain i"
Count:   3 Phrase: "dimerisation"
Count:   3 Phrase: "dimerisation domain"
Count:   3 Phrase: "carbohydrate"
Count:   3 Phrase: "carbohydrate binding"
Count:   3 Phrase: "carbohydrate binding domain,"
Count:   3 Phrase: "carbohydrate binding domain"
Count:   3 Phrase: "binding domain,"
Count:   3 Phrase: "bacterial"
Count:   3 Phrase: "alpha/beta/alpha"
Count:   3 Phrase: "alpha/beta/alpha domain"
Count:   3 Phrase: "1 domain"
Count:   2 Phrase: "type"
Count:   2 Phrase: "transcription"
Count:   2 Phrase: "transcription factor"
Count:   2 Phrase: "signal"
Count:   2 Phrase: "rfae"
Count:   2 Phrase: "rfae bifunctional"
Count:   2 Phrase: "rfae bifunctional protein, domain"
Count:   2 Phrase: "rfae bifunctional protein"
Count:   2 Phrase: "phosphatase domain"
Count:   2 Phrase: "mycoplasmal"
Count:   2 Phrase: "mycoplasmal mg032/mg096/mg288"
Count:   2 Phrase: "moea"
Count:   2 Phrase: "mg032/mg096/mg288"
Count:   2 Phrase: "kinase"
Count:   2 Phrase: "iii"
Count:   2 Phrase: "gtp1/obg"
Count:   2 Phrase: "gtp-binding"
Count:   2 Phrase: "glycoside"
Count:   2 Phrase: "glycoside hydrolase"
Count:   2 Phrase: "elongation"
Count:   2 Phrase: "elongation factor"
Count:   2 Phrase: "domain iv"
Count:   2 Phrase: "domain ii"
Count:   2 Phrase: "domain 7"
Count:   2 Phrase: "domain 6"
Count:   2 Phrase: "domain 5"
Count:   2 Phrase: "domain 4"
Count:   2 Phrase: "dehydrogenase/pnt"
Count:   2 Phrase: "catalytic"
Count:   2 Phrase: "bromodomain"
Count:   2 Phrase: "bifunctional"
Count:   2 Phrase: "bifunctional protein, domain"
Count:   2 Phrase: "bifunctional protein"
Count:   2 Phrase: "alpha"
Count:   2 Phrase: "alanine"
Count:   2 Phrase: "alanine dehydrogenase/pnt"
Count:   2 Phrase: "acid"
Count:   2 Phrase: "2"
Count:   2 Phrase: "1"

This list is a nice representation of the original list of terms, but it is still not what we want. For example, the string "domain" appears in several of these phrases: "domain", "rpb1, domain", "rpb2, domain", "polymerase rpb1, domain", "polymerase rpb2, domain", "binding domain" and "carbohydrate binding domain". Clearly, if we wish to recommend a useful subset of phrases, this list is confusing because it contains multiple phrases that originate from the same term. To fix this, we go to the second step.

The second step is to eliminate from this list those phrases that contain frequently occuring sub-phrases. That is, consider the list of phrases as a list of terms and apply the phrase-counting algorithm to find frequently occuring phrases.

We can apply this phrase-finding algorithm to the above list of found phraess to produce this histogram of sub-phrases:

Count:  33 Phrase: "domain"
Count:  10 Phrase: "polymerase"
Count:   7 Phrase: "protein"
Count:   6 Phrase: "rpb2"
Count:   6 Phrase: "rpb1"
Count:   6 Phrase: "rna"
Count:   6 Phrase: "binding"
Count:   6 Phrase: "bifunctional"
Count:   5 Phrase: "rna polymerase"
Count:   4 Phrase: "rfae"
Count:   4 Phrase: "polymerase rpb2"
Count:   4 Phrase: "polymerase rpb1"
Count:   4 Phrase: "plant-specific"
Count:   4 Phrase: "carbohydrate"
Count:   4 Phrase: "binding domain"
Count:   4 Phrase: "bifunctional protein"
Count:   4 Phrase: "alpha/beta/alpha"
Count:   3 Phrase: "uncharacterized"
Count:   3 Phrase: "rpb2, domain"
Count:   3 Phrase: "rpb1, domain"
Count:   3 Phrase: "rfae bifunctional"
Count:   3 Phrase: "protein, domain"
Count:   3 Phrase: "phosphoglucomutase/phosphomannomutase"
Count:   3 Phrase: "factor"
Count:   3 Phrase: "carbohydrate binding"
Count:   2 Phrase: "uncharacterized plant-specific"
Count:   2 Phrase: "transcription"
Count:   2 Phrase: "rna polymerase rpb2"
Count:   2 Phrase: "rna polymerase rpb1"
Count:   2 Phrase: "rfae bifunctional protein"
Count:   2 Phrase: "polymerase rpb2, domain"
Count:   2 Phrase: "polymerase rpb1, domain"
Count:   2 Phrase: "plant-specific domain"
Count:   2 Phrase: "phosphoglucomutase/phosphomannomutase alpha/beta/alpha"
Count:   2 Phrase: "phosphatase"
Count:   2 Phrase: "mycoplasmal"
Count:   2 Phrase: "mg032/mg096/mg288"
Count:   2 Phrase: "hydrolase"
Count:   2 Phrase: "homeodomain"
Count:   2 Phrase: "glycoside"
Count:   2 Phrase: "erf1"
Count:   2 Phrase: "elongation"
Count:   2 Phrase: "dimerisation"
Count:   2 Phrase: "dehydrogenase/pnt"
Count:   2 Phrase: "carbohydrate binding domain"
Count:   2 Phrase: "bifunctional protein, domain"
Count:   2 Phrase: "alpha/beta/alpha domain"
Count:   2 Phrase: "alanine"
These 48 sub-phrases occur frequently in the list of phrases, and phrases which are (not which contain) these re-occuring subphrases ought to be removed. The list of filtered phrases is:
13 occurances of subdomain
6 occurances of rna polymerase rpb2, domain
6 occurances of rna polymerase rpb1, domain
6 occurances of domain 2
5 occurances of domain 3
5 occurances of domain 1
5 occurances of dehydrogenase
4 occurances of predicted
4 occurances of n-terminal
4 occurances of domain,
4 occurances of c-terminal
3 occurances of uncharacterized plant-specific domain
3 occurances of phosphoglucomutase/phosphomannomutase alpha/beta/alpha domain
3 occurances of homeodomain protein
3 occurances of erf1 domain
3 occurances of domain i
3 occurances of dimerisation domain
3 occurances of carbohydrate binding domain,
3 occurances of binding domain,
3 occurances of bacterial
3 occurances of 1 domain
2 occurances of type
2 occurances of transcription factor
2 occurances of signal
2 occurances of rfae bifunctional protein, domain
2 occurances of phosphatase domain
2 occurances of mycoplasmal mg032/mg096/mg288
2 occurances of moea
2 occurances of kinase
2 occurances of iii
2 occurances of gtp1/obg
2 occurances of gtp-binding
2 occurances of glycoside hydrolase
2 occurances of elongation factor
2 occurances of domain iv
2 occurances of domain ii
2 occurances of domain 7
2 occurances of domain 6
2 occurances of domain 5
2 occurances of domain 4
2 occurances of catalytic
2 occurances of bromodomain
2 occurances of alpha
2 occurances of alanine dehydrogenase/pnt
2 occurances of acid
2 occurances of 2
2 occurances of 1

Given such a list of phrases, those that contain "domain" are:

13 occurances of subdomain
6 occurances of rna polymerase rpb2, domain
6 occurances of rna polymerase rpb1, domain
6 occurances of domain 2
5 occurances of domain 3
5 occurances of domain 1
4 occurances of domain,
3 occurances of uncharacterized plant-specific domain
3 occurances of phosphoglucomutase/phosphomannomutase alpha/beta/alpha domain
3 occurances of homeodomain protein
3 occurances of erf1 domain
3 occurances of domain i
3 occurances of dimerisation domain
3 occurances of carbohydrate binding domain,
3 occurances of binding domain,
3 occurances of 1 domain
2 occurances of rfae bifunctional protein, domain
2 occurances of phosphatase domain
2 occurances of domain iv
2 occurances of domain ii
2 occurances of domain 7
2 occurances of domain 6
2 occurances of domain 5
2 occurances of domain 4
2 occurances of bromodomain

So "subdomain" is possibly what a user might be searching for. Or maybe it's "rna polymerase rpb2, domain", which occurs in six terms.

I've found a really good use for this. Maybe you will too!


This is http://michal.guerquin.com/phrases.html, updated 2006-02-20 03:11 EST

Contact: michalg at domain where domain is gmail.com (more)