“find . -regex …” in Python or How to find files whose whole name (path + name) matches a regular expression?

From help(os. Walk): When topdown is true, the caller can modify the dirnames list in-place (e.g. , via del or slice assignment), and walk will only recurse into the subdirectories whose names remain in dirnames; this can be used to prune the search... So once a subdirectory (listed in dirnames) is determined to be inadmissable, it should be deleted from dirnames. This will produce the subtree-pruning you are looking for.

(Just be sure to del items from dirnames from the tail-end first, so you don't change the index of remaining items to be deleted. ) import os import re def prune(regex,top='.'): sep=os.path. Sep matcher = re.

Compile(regex) pieces=regex. Split(sep) partial_matchers = map( re. Compile, (sep.

Join(pieces:i+1) for I in range(len(pieces)))) for root, dirs, files in os. Walk(top,topdown=True): for I in reversed(range(len(dirs))): dirname=os.path. Relpath(os.path.

Join(root,dirsi), top) dirlevel=dirname. Count(sep) # print(dirname,dirlevel,sep. Join(pieces:dirlevel+1)) if not partial_matchersdirlevel.

Match(dirname): print('pruning {0}'. Format( os.path. Relpath(os.path.

Join(root,dirsi), top))) del dirsi for filename in files: filename=os.path. Relpath(os.path. Join(root,filename)) # print('checking {0}'.

Format(filename)) if matcher. Match(filename): print(filename) if __name__=='__main__': prune(r'foo/\w+/bar/\d+-\w+. Dat') Running the script with a directory structure like this: ~/test% tree .

. |-- foo | `-- baz | |-- bad | | |-- bad1. Txt | | `-- badbad | | `-- bad2.

Txt | `-- bar | |-- 1-good. Dat | `-- 2-good. Dat `-- tmp |-- 000.

Png |-- 001. Png `-- output. Gif yields pruning tmp pruning foo/baz/bad foo/baz/bar/2-good.

Dat foo/baz/bar/1-good. Dat If you uncomment the "checking" print statement, it is clear the pruned directories are not walked.

I wrote a function select_walk() to search for and select files in a tree of directories. In the following exemple, files that are searched for are files with extensions . Dat , .

Rtf , . Jpeg in directories whose names match the following regex' pattern: r'J:\\fruv? O+\\\w+\\baer(\d+)?

\\(?(1)TURI\1\d*|MONO\d+) Note the presence of a conditional elementary pattern: (?(1)TURI\1\d*|MONO\d+) with group references (1) and \1 to the number-matching group (\d+) in elementary pattern baer(\d+) . 1 ) Here's a code to create the tree of directories taken as exemple: (take care, it first deletes directories 'foo\','fooo\','froooo\','faooo\' before creating them) import os from shutil import rmtree top = 'J:\\' for x in ('foo\\','fooo\\','froooo\\','faooo\\'): if os.path. Isdir(top + x): rmtree(top + x) li = ('foo\\',('basil\\','poto%\\','tamata\\')), ('foo\\basil\\',('ber89','ber300')), ('foo\\basil\\ber89\\',('TURI850','TURI1023')), ('foo\\poto%\\',('ocean','earth')), ('foo\\tamata\\',('vahine',)), ('fooo\\',('york#\\','plain\\','atlantis\\')), ('fooo\\york#\\',('noto','nata')), ('fooo\\plain\\',('zx13ao','ws89rt','bar999')), ('fooo\\plain\\bar999\\',('TURI99905','TURI2227','MONO2')), ('fooo\\plain\\bar999\\TURI99905\\',('AERIAL','minidisc')), ('fooo\\plain\\bar999\\TURI99905\\AERIAL\\',('bumbum','corean')), ('fooo\\atlantis\\',('atlABC','atlDEFG')), ('fooo\\atlantis\\atlABC\\',('atlantis_sound','atlantis_image')), ('froooo\\',('one_dir\\','another_dir\\')), ('froooo\\one_dir\\',('bar25','ber')), ('froooo\\one_dir\\bar25\\',('TURI2501','TURI2502','TURI4813','MONO8')), ('froooo\\one_dir\\ber\\',('TURI30','TURI','MONO532')), ('froooo\\another_dir\\',('notseen','notseen2')), ('faooo\\',('somolo-\\','samala+\\')) for rep,several in li: #print top + rep if os.path.

Isdir(top + rep) == False: os. Mkdir(top + rep) for name in several: #print top + rep + name os. Mkdir(top + rep + name) for filepath in (top + 'foo\\kalaomi.

Xls', top + 'foo\\basil\\ber89\\TURI850\\quetzal. Jpeg', top + 'foo\\basil\\ber89\\TURI850\\tehoi. Txt', top + 'foo\\poto%\\curcuma in poto%.

Txt', top + 'foo\\poto%\\ocean\\file in ocean. Rtf', top + 'foo\\tamata\\vahine\\tahiti. Jpeg', top + 'fooo\\york#\\yorkshire.

Jpeg', top + 'fooo\\plain\\bar999\\TURI99905\\galileo. Jpeg', top + 'fooo\\plain\\bar999\\TURI99905\\polynesia. Dat', top + 'fooo\\plain\\bar999\\TURI99905\\concrete.

Txt', top + 'fooo\\plain\\bar999\\TURI2227\\Monroe. Jpeg', top + 'fooo\\plain\\bar999\\MONO2\\elastic. Jpeg', top + 'froooo\\one_dir\\photo in one_dir.

Jpeg', top + 'froooo\\one_dir\\tabula. Xls', top + 'froooo\\one_dir\\bar25\\TURI2501\\matallelo. Jpeg', top + 'froooo\\one_dir\\bar25\\TURI2501\\italy.

Dat', top + 'froooo\\one_dir\\bar25\\TURI2501\\beretta. Xls', top + 'froooo\\one_dir\\bar25\\TURI2501\\turi2501_ser. Rtf', top + 'froooo\\one_dir\\bar25\\TURI4813\\boaf_inTURI4813.

Jpeg', top + 'froooo\\one_dir\\bar25\\TURI4813\\troui_in_TURI4813. Txt', top + 'froooo\\one_dir\\bar25\\MONO8\\in_mono8. Dat', top + 'froooo\\one_dir\\bar25\\MONO8\\in_mono8.

Rtf', top + 'froooo\\one_dir\\bar25\\MONO8\\in_mono8. Xls', top + 'froooo\\one_dir\\bar25\\TURI2502\\adamante. Jpeg', top + 'froooo\\one_dir\\bar25\\TURI2502\\egyptic.

Txt', top + 'froooo\\one_dir\\bar25\\TURI2502\\urubu. Rtf', top + 'froooo\\one_dir\\ber\\MONO532\\bacillus. Jpeg', top + 'froooo\\one_dir\\ber\\MONO532\\blueberry.

Dat', top + 'froooo\\one_dir\\ber\\MONO532\\Perfume. Doc', top + 'faooo\\samala+\\kfaz. Dat', top + 'faooo\\somolo-\\ytek.

Rtf', top + 'faooo\\123. Txt', top + 'faooo\\458. Rtf',): with open(filepath,'w') as f: pass This code creates the following tree: J: | |--foo | |--basil | |--ber89 | |--TURI850 | |--file quetzal.

Jpeg | |--file tehoi. Txt | |--TURI1023 | |--ber300 | |--poto% | |--ocean | |--file in ocean. Rtf | |--earth | |--file curcuma in poto%.

Txt | |--tamata | |--vahine | |--file tahiti. Jpeg | |--file kalaomi. Xls | |--fooo | |--york# | |--noto | |--nata | |---file yorkshire.

Jpeg | |--plain | |--zx13ao | |--ws89rt | |--bar999 | |--TURI99905 | |--AERIAL | |--bumbum | |--corean | |--minidisc | |--file galileo. Jpeg | |--file polynesia. Dat | |--file concrete.

Txt | |--TURI2227 | |--file Monroe. Jpeg | |--MONO2 | |--file elastic. Jpeg | |--atlantis | |--atlABC | |--atlantis_sound | |--atlantis_image | |--atlDEFG | |--froooo | |--one_dir | |--bar25 | |--TURI2501 | |--file matalello.

Jpeg | |--file italy. Dat | |--file beretta. Xls | |--file turi2501_ser.

Rtf | |--TURI2502 | |--file adamante. Jpeg | |--file egyptic. Txt | |--file urubu.

Rtf | |--TURI4813 | |--file boaf_inTURI4813. Jpeg | |--file troui_inTURI4813. Txt | |--MONO8 | |--file in_mono8.

Dat | |--file in_mono8. Rtf | |--file in_mono8. Xls | |--ber | |--TURI30 | |--TURI | |--MONO532 | |--file bacillus.

Jpeg | |--file blueberry. Dat | |--file Perfume. Doc | |--file photo in one_dir.

Jpeg | |--file tabula. Xls | |--another_dir | |--notseen | |--notseen2 | |--faooo | |--somolo- | |--file ytek. Rtf | |--samala+ | |file kfaz.

Dat | |--file 123. Txt | |--file 458. Rtf The pattern of the regex that matches the files is: r'J:\\fruv?

O+\\\w+\\baer(\d+)? \\(?(1)TURI\1\d*|MONO\d+)\\\w+\.(dat|rtf|jpeg)' and the directories selectively explored to search for this kind of files will be the following ones: 'J:\\fooo\\plain\\bar999\\TURI99905' 'J:\\froooo\\one_dir\\bar25\\TURI2501' 'J:\\froooo\\one_dir\\bar25\\TURI2502' 'J:\\froooo\\one_dir\\ber\\MONO532' . 2 ) As a preliminary demonstration, here's a code that shows the functionning of the part of the select_walk() function's code that builds the regexes necessary to explore only selected directories during the iterated walk in a tree and to return selected files: import re def compute_regexes(pat_file, displ = True): from os import sep splitted_pat = re.

Split(r'\\\\' if sep=='\\' else '/', pat_file) pat_parent_dir = (r'\\' if sep=='\\' else '/'). Join(splitted_pat0:-1) if displ: print ('IN FUNCTION compute_regexes() :' '\n\npat_file== %s' '\n\nsplitted_pat :\n%s' '\n\npat_parent_dir== %s\n') \ % (pat_file , '\n'. Join(splitted_pat) , pat_parent_dir) dgr = {} for i,el in enumerate(splitted_pat): if re.

Search('\(.*? \)',el): dgrlen(dgr)+1 = I if displ: print 'dgr :' print '\n'. Join('group(%s) is in splitted_pat%s' % (g,i) for g,i in dgr.iteritems()) def repl(mat, dgr = dgr): the = int(mat.

Group(1) if mat. Group(1) else mat. Group(2)) return str(the + dgrthe) for i,el in enumerate(splitted_pat): splitted_pati = re.

Sub(r'(?Compile(pat_file), re. Compile(pat_dirs), re. Compile(pat_parent_dir) ) pat_file = r'J:\\fruv?

O+\\\w+\\baer(\d+)? \\(?(1)TURI\1\d*|MONO\d+)\\\w+\.(dat|rtf|jpeg)' regx_file, regx_dirs, regx_parent_dir = compute_regexes(pat_file) print '\n\nEXAMPLES with regx_file :\n' print 'pat_file==',pat_file for filepath in ('J:\\fooo\\basil\\ber92\TURI9258\\beru. Rtf ', 'J:\\froooooo\\ki_ki\\bar\MONO47\\madrid.

Jpeg '): print filepath,bool(regx_file. Match(filepath)) print '\n\nEXAMPLES with regx_dirs :\n' for path in ('J:\\fooo', 'J:\\fooo\\basil', 'J:\\fooo\\basil\\ber92', 'J:\\fooo\\basil\\ber92\\TURI777', 'J:\\fooo\\basil\\ber92\\TURI9258', 'J:\\froooooo' 'J:\\froooooo\\ki_ki', 'J:\\froooooo\\ki_ki\\bar', 'J:\\froooooo\\ki=ki\\bar', 'J:\\froooooo\\ki_ki\\bar\MONO47'): print path,(" : ~~ this dir's name is OK ~~" if path==''. Join(regx_dirs.

Match(path).group()) else " : ## this dir's name doesn't match ##") The function compute_regexes() first splits the original pat_file regex' pattern into elements aimed at matching names of directories in a path. Then it computes : a regex' pattern pat_dirs to match the different levels of path of the including directories of a wanted file a regex' pattern pat_parent_dir that matches any direct parent directory of a wanted file . The treatment implying dgr and the function repl() is a sophistication that allows the function compute_regexes() to take account of the group's references (id est: special sequences \1 \2 etc) and to change them to obtain pat_dirs with group's references still correct relatively to the added parentheses introduced to create pat_dirs.

Result of this code : IN FUNCTION compute_regexes() : pat_file== J:\\fruv? O+\\\w+\\baer(\d+)? \\(?(1)TURI\1\d*|MONO\d+)\\\w+\.(dat|rtf|jpeg) splitted_pat : J: fruv?

O+ \w+ baer(\d+)? (?(1)TURI\1\d*|MONO\d+) \w+\.(dat|rtf|jpeg) pat_parent_dir== J:\\fruv? O+\\\w+\\baer(\d+)?

\\(?(1)TURI\1\d*|MONO\d+) dgr : group(1) is in splitted_pat3 group(2) is in splitted_pat4 group(3) is in splitted_pat5 pat_dirs== J:(? =\\|\Z)(\\fruv? O+(?=\\|\Z)(\\\w+(?=\\|\Z)(\\baer(\d+)?(?=\\|\Z)(\\(?(4)TURI\4\d*|MONO\d+))?

)?)? )? EXAMPLES with regx_file : pat_file== J:\\fruv?

O+\\\w+\\baer(\d+)? \\(?(1)TURI\1\d*|MONO\d+)\\\w+\.(dat|rtf|jpeg) J:\fooo\basil\ber92\TURI9258\beru. Rtf True J:\froooooo\ki_ki\bar\MONO47\madrid.

Jpeg True EXAMPLES with regx_dirs : J:\fooo : ~~ this dir's name is OK ~~ J:\fooo\basil : ~~ this dir's name is OK ~~ J:\fooo\basil\ber92 : ~~ this dir's name is OK ~~ J:\fooo\basil\ber92\TURI777 : ## this dir's name doesn't match ## J:\fooo\basil\ber92\TURI9258 : ~~ this dir's name is OK ~~ J:\frooooooJ:\froooooo\ki_ki : ## this dir's name doesn't match ## J:\froooooo\ki_ki\bar : ~~ this dir's name is OK ~~ J:\froooooo\ki=ki\bar : ## this dir's name doesn't match ## J:\froooooo\ki_ki\bar\MONO47 : ~~ this dir's name is OK ~~ . . 3 ) Finally, here's the function select_walk() that does the job of searching for files in a tree whose names match a certain regex: it yields the triples (dirpath, dirnames, filenames) returned by the built-in os.walk() function , but only those whose directory filenames contains correct file's names matching pat_file.

Of course, during the iteration, the function select_walk() doesn't explore the directories whose files content will never match the key regex' pattern pat_file because of their (directories') names. Def select_walk(pat_file,start_dir): from os import sep splitted_pat = re. Split(r'\\\\' if sep=='\\' else '/', pat_file) pat_parent_dir = (r'\\' if sep=='\\' else '/').

Join(splitted_pat0:-1) dgr = {} for i,el in enumerate(splitted_pat): if re. Search('\(.*? \)',el): dgrlen(dgr)+1 = I def repl(mat, dgr = dgr): the = int(mat.

Group(1) if mat. Group(1) else mat. Group(2)) return str(the + dgrthe) for i,el in enumerate(splitted_pat): splitted_pati = re.

Sub(r'(?Compile(pat_file) regx_dirs = re. Compile(pat_dirs) regx_parent_dir = re. Compile(pat_parent_dir) start_dir = start_dir.

Rstrip(sep) + sep print '\nstart_dir == '+start_dir for dirpath,dirnames,filenames in os. Walk(start_dir): dirpath = dirpath. Rstrip(sep) print '\n'.

Join(('explored dirpath : %s is_direct_parent: %s' \ % (dirpath,('NO','YES')bool(regx_parent_dir. Match(dirpath))), ' dirnames : %s' % dirnames, ' filenames : %s' % filenames)) if regx_parent_dir. Match(dirpath): filenames: = filename for filename in filenames if regx_file.

Match(dirpath + sep + filename) dirnames: = print '\n'. Join((' dirnames : not to be explored ' , ' yielded filenames : %s\n' % filenames)) yield (dirpath,dirnames,filenames) else: dirnames: = dirname for dirname in dirnames if regx_dirs. Match(dirpath + sep + dirname).group()==dirpath + sep + dirname print '\n'.

Join(('dirnames to explore : %s ' % dirnames, ' filenames : not to be yielded\n')) pat_file = r'J:\\fruv? O+\\\w+\\baer(\d+)? \\(?(1)TURI\1\d*|MONO\d+)\\\w+\.(dat|rtf|jpeg)' print '\n\nSELECTED (dirpath, dirnames, filenames) :\n' + '\n'.

Join(map(repr, select_walk(pat_file,'J:\\'))) result pat_dirs== J:(? =\\|\Z)(\\fruv? O+(?=\\|\Z)(\\\w+(?=\\|\Z)(\\baer(\d+)?(?=\\|\Z)(\\(?(4)TURI\4\d*|MONO\d+))?

)?)? )? Start_dir == J:\ explored dirpath : J: is_direct_parent: NO dirnames : 'Amazon', 'faooo', 'Favorites', 'foo', 'fooo', 'froooo', 'Python', 'RECYCLER', 'System Volume Information' filenames : 'image00.

Pfm', 'rep. Py' dirnames to explore : 'foo', 'fooo', 'froooo' filenames : not to be yielded explored dirpath : J:\foo is_direct_parent: NO dirnames : 'basil', 'poto%', 'tamata' filenames : 'kalaomi. Xls' dirnames to explore : 'basil', 'tamata' filenames : not to be yielded explored dirpath : J:\foo\basil is_direct_parent: NO dirnames : 'ber300', 'ber89' filenames : dirnames to explore : 'ber300', 'ber89' filenames : not to be yielded explored dirpath : J:\foo\basil\ber300 is_direct_parent: NO dirnames : filenames : dirnames to explore : filenames : not to be yielded explored dirpath : J:\foo\basil\ber89 is_direct_parent: NO dirnames : 'TURI1023', 'TURI850' filenames : dirnames to explore : filenames : not to be yielded explored dirpath : J:\foo\tamata is_direct_parent: NO dirnames : 'vahine' filenames : dirnames to explore : filenames : not to be yielded explored dirpath : J:\fooo is_direct_parent: NO dirnames : 'atlantis', 'plain', 'york#' filenames : dirnames to explore : 'atlantis', 'plain' filenames : not to be yielded explored dirpath : J:\fooo\atlantis is_direct_parent: NO dirnames : 'atlABC', 'atlDEFG' filenames : dirnames to explore : filenames : not to be yielded explored dirpath : J:\fooo\plain is_direct_parent: NO dirnames : 'bar999', 'ws89rt', 'zx13ao' filenames : dirnames to explore : 'bar999' filenames : not to be yielded explored dirpath : J:\fooo\plain\bar999 is_direct_parent: NO dirnames : 'MONO2', 'TURI2227', 'TURI99905' filenames : dirnames to explore : 'TURI99905' filenames : not to be yielded explored dirpath : J:\fooo\plain\bar999\TURI99905 is_direct_parent: YES dirnames : 'AERIAL', 'minidisc' filenames : 'concrete.

Txt', 'galileo. Jpeg', 'polynesia. Dat' dirnames : not to be explored yielded filenames : 'galileo.

Jpeg', 'polynesia. Dat' explored dirpath : J:\froooo is_direct_parent: NO dirnames : 'another_dir', 'one_dir' filenames : dirnames to explore : 'another_dir', 'one_dir' filenames : not to be yielded explored dirpath : J:\froooo\another_dir is_direct_parent: NO dirnames : 'notseen', 'notseen2' filenames : dirnames to explore : filenames : not to be yielded explored dirpath : J:\froooo\one_dir is_direct_parent: NO dirnames : 'bar25', 'ber' filenames : 'photo in one_dir. Jpeg', 'tabula.

Xls' dirnames to explore : 'bar25', 'ber' filenames : not to be yielded explored dirpath : J:\froooo\one_dir\bar25 is_direct_parent: NO dirnames : 'MONO8', 'TURI2501', 'TURI2502', 'TURI4813' filenames : dirnames to explore : 'TURI2501', 'TURI2502' filenames : not to be yielded explored dirpath : J:\froooo\one_dir\bar25\TURI2501 is_direct_parent: YES dirnames : filenames : 'beretta. Xls', 'italy. Dat', 'matallelo.

Jpeg', 'turi2501_ser. Rtf' dirnames : not to be explored yielded filenames : 'italy. Dat', 'matallelo.

Jpeg', 'turi2501_ser. Rtf' explored dirpath : J:\froooo\one_dir\bar25\TURI2502 is_direct_parent: YES dirnames : filenames : 'adamante. Jpeg', 'egyptic.

Txt', 'urubu. Rtf' dirnames : not to be explored yielded filenames : 'adamante. Jpeg', 'urubu.

Rtf' explored dirpath : J:\froooo\one_dir\ber is_direct_parent: NO dirnames : 'MONO532', 'TURI', 'TURI30' filenames : dirnames to explore : 'MONO532' filenames : not to be yielded explored dirpath : J:\froooo\one_dir\ber\MONO532 is_direct_parent: YES dirnames : filenames : 'bacillus. Jpeg', 'blueberry. Dat', 'Perfume.

Doc' dirnames : not to be explored yielded filenames : 'bacillus. Jpeg', 'blueberry. Dat' SELECTED (dirpath, dirnames, filenames) : ('J:\\fooo\\plain\\bar999\\TURI99905', , 'galileo.

Jpeg', 'polynesia. Dat') ('J:\\froooo\\one_dir\\bar25\\TURI2501', , 'italy. Dat', 'matallelo.

Jpeg', 'turi2501_ser. Rtf') ('J:\\froooo\\one_dir\\bar25\\TURI2502', , 'adamante. Jpeg', 'urubu.

Rtf') ('J:\\froooo\\one_dir\\ber\\MONO532', , 'bacillus. Jpeg', 'blueberry. Dat').

Not a complete solution, but probably the search could be optimized a bit if you check also for the correct dirs. This can be done using two regex, one for dirs and one for files. A modification of your code (not tested, sorry) would be: def find(dregex, fregex, top='.'): dmatcher = re.

Compile(dregex) fmatcher = re. Compile(fregex) for dirpath, dirnames, filenames in os. Walk(top): d = os.path.

Relpath(dirpath, top) if not dmatcher. Match(d): continue for f in filenames: if fmatcher. Match(f): yield os.path.

Join(d, f) if __name__=="__main__": top = ". " fregex = "\d+-\w+. Dat" dregex = "foo/\w+/bar/" for f in find(dregex, fregex, top): print f.

This still has to walk through all the directories, even if it's already clear from the top directory that the regex will never match. On the other hand, I don't see an easy way around this either, since the re module does not tell you whether at least a partial match could be obtained (unlike in Java). – Tim Pietzcker Jul 23 at 7:40.

I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.

Related Questions