Tgrep & TGrep2Tgrep and the improved new version TGrep2 are Unix-based tools that allow you to search syntactically & POS-annotated corpora on AFS. The syntax needs some getting used to but is is worth it since the searches you can do with this tool are quite powerful. If you prefer a graphical interface you can use TIGERsearch which has the same search options. Below you find information on
Tip-1: Much of the information on this page is summarized on this handout on TGrep 1 by Tatiana Nikitina and Jeanette Pettibone (PDF file; ~130KB) which will provide you with a short intro and a comprised summary of the TGrep syntax. Tip-2: New presentation available:
Setting up TGrep2By Susanne Riehemann (ed. by Florian Jaeger and Liz Coppock, with input from Neal Snider): To use TGrep2 you need to be logged in to a firebird or raptor computer - TGrep2 is compiled only for linux and doesn't work on the elaines! Use Samson or any ssh software to connect to one of the firebirds, i.e. firebird1.stanford.edu through firebird15.stanford.edu If you want to be able to use the command "tgrep2" without typing its full path name you need to add
to your PATH variable. You can do this for the current session only by entering:
> emacs .login
Once you are in emacs, add the following lines (the first line is a comment, to keep things organized):
setenv PATH /afs/ir/data/linguistic-data/bin/linux_2_4:$PATH
The files for the tgrep2able versions of the Brown, Switchboard, WSJ, NEGRA, and Chinese Treebank corpora are in
You can find out what corpora can be used with Tgrep2 by typing:
> tgrep2 -c wsj_mrg.t2c.gz 'VP < NP'
> emacs .login
Once you are out of emacs, try this to test out Tgrep2:
The first two outputs of that search are the phrases "all the brothers and sisters", and "freedom of speech in this country and everything". (NP-SBJ (PDT all) (DT the) (NNS brothers) (CC and) (NNS sisters)) (NP-SBJ (NP (NP (NN freedom)) (PP (IN of) (NP (NN speech))) (PP-LOC (IN in) (NP (DT this) (NN country)))) (CC and) (NP (NN everything)))As you can see, the output is hierarchically structured (think of it as a syntax tree lying on its side: say a very sad syntax tree).
If you want to save the output of your search into a text file, then redirect the output to a file in your home directory with the ">"-operator thus:
See also the TGrep syntax and some examples for more details, but here are some quick tips on Tgrep and Tgrep2 syntax: Patterns. Suppose you want to find all of the subject noun phrases in your corpus. Not all subject NPs have exactly the same label, but all subject NPs have the tag -SBJ marking them. We can say we want all constituents whose label contains -SBJ by putting -SBJ in forward slashes, thus:
Other operators you can use:
Order of operations (important!). Multiple operators are assumed to be related by an AND relation. So the following means, "an S that immediately dominates an NP and immediately dominates a VP":
Formating your output. By default tgrep2 returns the match for the left-most element in the search pattern. So, if you want the NP the search above to be the output, you can regroup the pattern:
There is a manual for TGrep2, which can also be found in
See also the TGrep syntax and some examples, The Treebank Page, Switchboard Tags, and Jeanette Pettibone's page on Treebank tags. For more information on the tags and bracketing conventions used in Switchboard, try: The Penn Treebank: An overview (voted "most useful" by Florian Jaeger), and Bracketing Switchboard: An addendum to the Treebank II Bracketing Guidelines. There is also useful documentation on the Penn Part-Of-Speech Tagging conventions, Disfluency Annotations, and Predicate-Argument Structure annotations. Some of this information applies both to TGrep and to TGrep2. Setting up TGrepBy Susanne Riehemann (ed. by Florian Jaeger): To use TGrep you need to be logged in to an epic computer - this doesn't work on the elaines! Use Samson or any ssh software to connect to one of the epics, i.e. epic1.stanford.edu through epic28.stanford.edu Assuming you use csh or tcsh, you need to do the following to set everything up properly. (You can do these at the prompt, but if you want to avoid having to do this every time you use TGrep, put them at the end of the .login file in your home directory. The first time you do this you'll need to log in again.):
setenv PATH /afs/ir/data/linguistic-data/bin/sun4x_57:$PATH setenv MANPATH /afs/ir/data/linguistic-data/man:$MANPATH Then, you should be able to do:
which finds NPs which immediately dominate VPs, and have it work! You can find usage instructions in 'man tgrepdoc', while 'man tgrep' tells you about the command flags. See also the notes below on the TGrep syntax and some examples, The Treebank Page, Switchboard Tags, and Jeanette Pettibone's page on Treebank tags. This sets things up for Switchboard (merged). On AFS there are now 4 TGrep indices, covering the parsed sections of switchboard, the WSJ (2 versions: one with PoS tags, one without), and Brown (only a small fragment was treebanked). You can change the value of TGREP_CORPUS above appropriately, or specify one on the command line.
setenv TGREP_CORPUS /afs/ir/data/linguistic-data/Treebank/tgrepabl/brown_mrg.crp setenv TGREP_CORPUS /afs/ir/data/linguistic-data/Treebank/tgrepabl/wsj_mrg.crp setenv TGREP_CORPUS /afs/ir/data/linguistic-data/Treebank/tgrepabl/wsj_skel.crp Note that we now have TGrep2 as well! The TGrep syntax and some examplesSearching for NPs that immediately dominate VPs:
If you want to look at this output as it is generated, use:
If you want to save the output to a file, use:
If you want to see the tree for the whole sentence in which the match occurred, use:
If there are multiple matches for the pattern in a sentence, you can find them all with:
These switches can be combined, e.g. if you want to see the whole sentence that was matched, use:
A < B A immediately dominates B Some examples using these operators: To search for NPs that are coordinations of plural nouns:
If you've done any interesting TGrep searches for your research, please send the commands to (Corpus TA), so other people can learn by example. Differences between TGrep2 and TgrepThe manual for TGrep2, locally stored at:
From that it should be clear that this is a considerable improvement over TGrep. Another tool that allows searches that refer to any kind of egde labels is TigerSearch. Tgrep2 has also been improved in terms of the control it gives you over the form of the output, the speed of searches, and in that it now accepts search patterns from files as input rather than only command line pattern inputs. Note that some features of the old TGrep are not anymore supported in TGrep2 (for a list, cf. the manual). |