cgrand / xforms

Extra transducers and reducing fns for Clojure(script)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Welford's for SD is better

rebcabin opened this issue · comments

Bravo, btw. Reducible (online) statistics are great, and if you keep going down the same road, you will end up with Kalman filters and much much more (see references at the bottom).

Your algo for standard deviation squares first and then subtracts. It's exposed to catastrophic cancelation. Welford's fixes that: very similar to yours, but you subtract first (a little cleverly), then square. Here is a sketch of Welford's in Clojure. The wikipedia reference is below.

(defn running-mean
  ([]
   {:mean 0, :count 0})
  ([{:keys [mean count]} new-datum]
   (let [new-count (inc count)]
     {:mean  (+ (/ new-datum new-count) (* mean (/ count new-count)))
      :count new-count})))

(defn running-stats
  ([]
   {:mean 0, :count 0, :ssr 0, :variance 0, :std-dev 0})
  ([{:keys [ssr mean count variance std-dev] :as ostats} new-datum]
   (let [nrmean   (running-mean ostats new-datum),
         nssr     (+ ssr (* (- new-datum (:mean ostats))
                            (- new-datum (:mean nrmean)))),
         ncount   (:count nrmean),
         nvar     (if (> ncount 1), (/ nssr (dec ncount)), 0.0)
         nstd-dev (Math/sqrt nvar)]

     {:ssr      (double nssr),
      :mean     (double (:mean nrmean)),
      :count    (:count nrmean),
      :variance (double nvar),
      :std-dev  nstd-dev})))

https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Online_algorithm

Also see http://vixra.org/abs/1609.0044 and http://vixra.org/abs/1606.0328

thanks, fixed by 6047563 and in release 0.8.1

I didn't dig too much into it but I tried a buffered variant of Welford (to amortize the division cost over several items) but (at least in CLJS) it was slower.

It's plausible that Welford's is slower, but I think it's demonstrably safer on data sets with wide dynamic range. When you square, big numbers (> 1) get bigger and small numbers (< 1) get smaller. You can get in a situation where you're subtracting a small squared mean from a large sum of data squared. None of the formulas will give you any warning that this is happening, but Welford's will stave off disaster longer.