Skip to content

hugovk/gutengrep

gh-pages
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 
 
 

gutengrep

Build Status

Find whole sentences matching a regex in Project Gutenberg plain text files.

Example commands

gutengrep.py "^[^\w]*And then" "*.txt" --cache --sort --correct -o output/and-then.txt

gutengrep.py "^[^\w]*But why" "*.txt" --cache --sort --correct -o output/but-why.txt

gutengrep.py -i "whale" moby11.txt --sort --correct -o out\mobydick-whale.txt

Example output

Name Sorted Regex Input Word count
But why? But why? ^[^\w]*But why *.txt 7,572
And then! And then! [^\w]*And then *.txt 85,014
The whale The whale whale moby11.txt 50,913
Why Why [^\w]*Why *.txt 184,832
Once upon a time Once upon a time -i once upon a time *.txt 6,195
The End The End -i the end\. *.txt 142,94
Happily ever after Happily ever after -i happily ever after *.txt 271
Moonlit Moonlit -i moonlit *.txt 52,345
Moonlight Moonlight -i moonlight *.txt 3,186

See also nanogenmo.md.

Tips

Download the Project Gutenberg August 2003 CD and put all the files in the same directory.

When working on the whole corpus, use --cache to cut down on file operations. The first time it will build a cache file of all tokenised sentences. This first pass takes about 5 minutes on my MBP to go through the 597 books of the Project Gutenberg CD and extract its 3,583,390 sentences. Subsequent runs using the cache take about 40 seconds.

If searching just a single file, or a subset of files, make sure not to use --cache because it will use the cache file generated on the initial file spec.

About

Find whole sentences matching a regex in Project Gutenberg

Resources

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

 

Packages

No packages published