Java multithreaded file system tree traversal?

The problem would be to find a good number of threads and distributing works as equally as possbible.

The problem would be to find a good number of threads and distributing works as equally as possbible. Assuming you don't know the number of files and subdirectories in each folder, that could become quite tricky. Here's an idea for a start: What you might do is create a number of threads that operate on a central folder list and spawn a thread per folder you encounter up to a certain maximum.

Each thread could then put the subfolders of the directory it works on to the central list and when it is done it might pick the next from that list. If a folder is put on the list and the maximum number of threads is not reached, a new thread is spawned immediately. If the thread has run out of work and the folder list is empty, it could either stop (requiring you to spawn a new one if needed) or wait until either there's a folder on the list or the application signals that all folders are processed.

Finally, don't forget to synchronize on the folder list. Hope that helps you to get started. Edit: (don't take the following too seriously :) ) You could also use another thread pool implementation that doesn't use the java.util.

Concurrent package :) Edit 2: Basically what I described above is a simple and task specific thread pool implementation. You might try and look for more information on building a thread pool yourself (in the context of your assignment a thread pool task would be scanning one folder).

Not sure what you mean operate on a central folder list. How is that created? – user384706 Sep 21 '11 at 12:36 @user384706 it's basically just a list/queue of either strings (paths) or File objects that represent the folders.

You'd put each sub folder of the folder you are processing into that list and let the next available thread take it from there (might be the same thread but you don't know). - With "central" I mean that each thread can access it either directly or indirectly, hence the need for synchronization. – Thomas Sep 21 '11 at 12:41 So you need to traverse the whole directory structure initially to create the list in the first place?

– user384706 Sep 21 '11 at 12:53 @user384706 no, you fill that list on-the-fly and add worker threads as needed. Whenever a thread encounters subdirectories it puts them on the list. At the beginning you'll have only one thread but as more and more subfolders are added you'll have to spawn additional worker threads until the maximum (normally defined by the number of cores on the machine) is reached.

– Thomas Sep 21 '11 at 13:28 +1. I think it is good idea but only as a sample for interview. I still hold the opinion that for tasks like this OS does it best.In any case the OP doesn't say if the search pattern is for the filename or inside the contents of the file.

– user384706 Sep 21 '11 at 14:40.

Off the top of my head I can think of a couple of approaches which I'll begin to outline here. If you need a more detailed explanation, I'd be glad to elaborate. Note that I have NOT attempted to code to test either of these (yet), so take them with a grain of salt.

The first approach relies on a central dispatcher (the main thread) to coordinate a pool of worker threads and a 'work' queue (of folders). The central dispatcher initializes the thread pool to its minimum size and assigns the root folder to the first worker thread, then signals it to wake up. Each worker thread starts in an idle state, looping until it receives a 'wake up' signal.It then checks if it's been assigned a folder to process, and begins iterating through that folder's contents.

The worker thread submits any files that match the pattern to the 'found' list (or just prints it to System. Out). Any subfolders are added back into the 'work' queue (and the main thread is signalled).

Afterwards, the worker thread reenters the 'idle' state (and the main thread is signalled again). Upon waking up, it should either be assigned a new folder which it starts processing the same way. Otherwise, the thread can terminate itself.

The main thread waits until it receives a signal from any worker thread. It first checks if the work queue is empty. If not, then it checks if there any worker threads that are idle.

While there are any idle worker threads (and the work queue is non-empty), pop the first folder off the work queue and assign it to that thread, and signal it to wake up. The main thread then goes back to waiting. If there are no idle threads, then check the current thread pool size against the maximum configured size.

If the pool is already at maximum, then the main thread goes back to sleep (basically waiting until a thread becomes idle). Otherwise, while the pool isn't at maximum, create a new thread, add it to the pool, and assign it the first folder in the work queue, then wake it up. If the main thread wakes up with the work queue empty, then first it checks if there are any idle threads and the worker thread pool is larger than the configured minimum.

If so, the main thread can elect to wake up idle threads (with no assigned folder, so they'll terminate) and remove them from the pool. If the thread pool is already at a minimum, then again the main thread just goes back to waiting. Note that this 'downsizing' step may be unnecessary optimization (since worker threads shouldn't be consuming CPU cycles waiting anyway).

If all threads are idle and the work queue is empty then the main thread knows it's done and can signal all remaining worker threads to wake up and terminate themselves. The trick here is all in the signaling between the main thread and the worker threads, and synchronization on the work queue (and possibly the 'found files' bin).(Also, to simplify things a bit, one can opt for a fixed-size thread pool. ) The alternative approach doesn't involve a central dispatcher, but uses a fixed pool of worker threads sleeping for random intervals, periodically waking up to check if either a) the work queue has an item, or b) all work has been done.

The main thread just initializes the thread pool, places the root folder at the head of the work 'queue' then starts up all worker threads.It then waits for a signal telling it that all work is done and it can clean up (interrupt all the remaining worker threads to wake them up and terminate themselves). Each worker thread starts up in a non-idle state and checks the work queue. The first thread should see the root folder, which it pops off the queue and begins processing it similar to the above.

All the other threads upon starting should see an empty queue and go to sleep. When a worker thread wakes up, it checks if the work queue has any folders remaining to be processed, and if so, works on it, setting its 'idle' flag to false. When it's done with the current folder, it checks the work queue again.

If the work queue is empty, it sets its 'idle' flag. When a worker thread finds an empty queue (either upon waking up or after processing), it checks all the other threads if they're 'idle'. If it finds at least one other thread still working then it goes to sleep for a random interval.

If all other threads are 'idle' and the work queue is empty, the thread can terminate itself. Before doing so, it interrupts or signals the main thread so the main thread can do the clean up for the rest of the other threads. This approach can also be adapted to use a flexible sized thread pool that can dynamically grow or shrink as needed, but that just puts more complex logic into the worker threads which I won't go into for now.

I think last option will not fit. This task actually mentions that this programm should be much faster than single-threaded programm with abstract disk controllers. I will try first option.

– Igor Konoplyanko Sep 23 '11 at 17:34.

You might want to consider using an ExecutorService to keep control of the number of threads created / running at any one time.

4 pretty tricky to do that "without using util. Concurrent package" – Sean Patrick Floyd Sep 21 '11 at 11:53 Dar! Guess I wouldn't get the job.

;) – John B Sep 21 '11 at 12:00.

You could use JNI and call the search process of the OS for the pattern. My assumption is that the OS would do it more efficiently than you could ever do it. Use multithreading just for spawing a the JNI and parse results.

JNI would require platform dependent solutions which kind of counters the purpose of using Java. I assume that there's no other meaning behind the task than to test the OP's multithreading skills, without actually requiring top performance. – Thomas Sep 21 '11 at 13:31 Well from my point of view in real life OS would do it best.

You are right it is not portable but in the end it depends on what you are trying to do. I mentioned JNI because in Windows platforms various COM bridge libraries in Java are not easy to use. So my point is delegate to OS which will do it best.

– user384706 Sep 21 '11 at 13:41.

I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.

Related Questions