Wednesday, September 1, 2010

Find and Replace

I came across a post on the clojure users list the other day that discussed some code that read a file, removed a specified character from each line in the file, and the wrote the file back out to disk. It got me thinking about how I might implement some basic file editing functions. I started with a find function. (Please note for brevity and for clarity the code snippets that follow only include the ns function call when I introduce a function from a different name space for the first time.) Here is the initial implementation.

(ns clj-sandbox.io
  (:use [clojure.contrib.duck-streams :only [read-lines]]))   

(defn find [file s]
  (filter #(.contains % s) (read-lines file)))

This returns a sequence of all lines in the file that contain the string s. So far so good. I would like to get the line numbers as well though. I was ready to implement a solution using recur, but then I started thinking that there is probably a more functional way of achieving what I want. I started reading through the chapter on functional programming in Programming Clojure (chapter 5) to spark some ideas. That led me to some of the example code which in turn led me to clojure.contrib.seq/indexed. And now I have this function,

(ns clj-sandbox.io
  (:use [clojure.contrib.duck-streams :only [read-lines]])
  (:use [clojure.contrib.seq :only [indexed]]))

(defn find [file s]
  (map (fn [match] {:line-number (inc (first match)) :text (second match)})
       (filter #(.contains (second %) s) (indexed (read-lines file)))))

which returns a lazy sequence of lines and corresponding line numbers that contain at least one occurrence of s. The sequence is comprised of maps containing two keys, :line-number and :text. At this point I am pretty happy with my function and am ready to consider some additional enhancements. First, I want the capability to get back just the line numbers or just the text of each line. This is easily accomplished with a bit of refactoring.

(defn find
  ([file s opt] (map #(opt %) (find file s)))
  ([file s]
     (map (fn [match] {:line-number (inc (first match)) :text (second match)})
 (filter #(.contains (second %) s) (indexed (read-lines file))))))

Now I have a version that takes a third argument, opt, which should be one of the map keys, :line-number or :text. To get a sequence of just the lines numbers I can write,

user> (find "myfile" "mystring" :line-number)

At this point I am satisfied with find and ready to move onto a replace function. Here is a first cut at replace.

(ns clj-sandbox.io
  (:use [clojure.contrib.duck-streams :only [read-lines]])
  (:use [clojure.contrib.seq :only [indexed]])
  (:use [clojure.contrib.string :only [replace-str]]))

(defn replace [file s r] (map #(replace-str s r (:text %)) (find file s)))

This function replaces every occurrence of s with r, and the results are returned as a lazy sequence. I need to update the function to write the changes back to the file. Initially I consider,

(defn replace [f s r]
  (write-lines (str "." f) (map #(replace-str s r (:text %)) (find f s)))
  (copy (file (str "." f)) (file f)))

This implementation however is problematic. Changes are written to a copy of the file and when that is done, the original file is overwritten is replaced with the copy. Only those lines that match the search string are written back to the file; so, we wind up altogether losing lines that should be left intact. The function needs to write every line back to the file, including those that have not been modified. I decide to take out the call to find since it does not return all lines and replace it with a call to read-lines.

(ns clj-sandbox.io
  (:use [clojure.contrib.duck-streams :only [read-lines]])
  (:use [clojure.contrib.seq :only [indexed]])
  (:use [clojure.contrib.string :only [replace-str]])
  (:use [clojure.java.io :only [file]]))


(defn replace [f s r]
  (write-lines (str "." f) (map #(replace-str s r %) (read-lines f)))  
  (copy (file (str "." f)) (file f)))

This gives me the behavior that I want; however, I notice some duplication with creating the backup file name. That can easily be eliminated with a let binding.

(defn replace [f s r]
  (let [new-file (str "." f)]
    (write-lines new-file (map #(replace-str s r %) (read-lines f)))
    (copy (file new-file) (file f))))

Now I have a couple functions that can be used to perform a global find and replace. I spent some time working on functions that find/replace a specified number of matches. There is some additional effort needed for these however because I need to keep a running total of matches. Suppose I want the first 3 matches of some string. The first two occur on line 6, and the third match occurs on line 9. With the find function that has been discussed here, we can determine that we have matches on lines 6 and 9, but we cannot determine how many matches there are per line. I will try to revisit this in a future post. The source code for this can be found on github.

No comments:

Post a Comment