多次元尺度構成法によるグラフの作成

はじめに

階層型,K平均法でのクラスタリング手法では,デンドログラム等で結果表示がされてきました.
が,決してビジュアル的には分かりやすいとは言えませんでした.
可視化してみようということで,多次元尺度構成法(以下,MDS)により図示してみることにします.

ScaleDownクラスの作成

MDSを行うにあたり,主となるクラスのソースを示します.
ロジック自体はPythonで書かれた本とほぼ同様です.

module My
  class ScaleDown
    LOOP_MAX = 10_000
    GRID_ADJUSTMENT = 0.01
    ERRORS_FOR_COMPARISON = 2

    def initialize(counts)
      @counts = counts
      @similarities = {}
      @distances = {}
      @matrix = {}
      @errors = Array.new(ERRORS_FOR_COMPARISON, Float::MAX)
    end

    attr_reader :similarities, :matrix, :errors

    # ランダムに要素を配置してから,相関値に近づくよう座標を動かす
    # 全体の誤差が直近10個分より大きくなったら収束したと見なす
    def start
      calc_similarities
      random_attach
      calc_distance

      loop_counter = 0
      until converged? or (LOOP_MAX < loop_counter)
        adjust_elements
        calc_distance
        puts "Iteration #{loop_counter}" if loop_counter % 100 == 0
        loop_counter += 1
      end
    end

    # 座標をX:正の整数 Y:正の実数に正規化する
    def normalize
      normalized = {}
      offset_x = @matrix.map { |count, (x, y)| x }.min
      offset_y = @matrix.map { |count, (x, y)| y }.min

      @matrix.each do |count, (x, y)|
        normalized[count] = [((x - offset_x) * 100).round, ((y - offset_y) * 100)]
      end
      normalized
    end

    private

    # 全要素間の相関値を計算する
    # ただし,格納する際には1から減算して格納する
    def calc_similarities
      @counts.each_key do |count1|
        @counts.each_key do |count2|
          # 自分自身との相関は意味が無いので飛ばす
          next if count1 == count2
          pair = Pair.new [count1, count2]

          unless @similarities[pair]
            # valuesメソッドを使うと順序が保証されないので,別変数にvalueをそれぞれ格納する
            count1_values = []
            count2_values = []
            @counts[count1].each do |count, value|
            count1_values << value
            count2_values << @counts[count2][count]
            end
            @similarities[pair] = 1 - Pearson.calc(count1_values, count2_values)
          end
        end
      end
    end

    # ランダムに各要素にx,yの値を持たせる
    def random_attach
      @counts.each_key do |count|
        @matrix[count] = [rand, rand]
      end
    end

    # その時点の距離を計算し,差分を出す
    def calc_distance
      @matrix.each do |count1, (x1, y1)|
        @matrix.each do |count2, (x2, y2)|
          next if count1 == count2
          pair = Pair.new [count1, count2]
          @distances[pair] = Math::sqrt( ((x1 - x2) ** 2) + ((y1 - y2) ** 2) )
        end
      end
    end
    
    # 座標を目標に近づけるように少しづつ動かす
    def adjust_elements
      @matrix.each do |count1, (x1, y1)|
        @matrix.each do |count2, (x2, y2)|
          next if count1 == count2
          pair = Pair.new [count1, count2]
          numerator = (@similarities[pair] - @distances[pair])
          denominator = [@similarities[pair], @distances[pair]].max

          unless denominator == 0
            difference = numerator.quo(denominator)
            x1 += (x1 - x2).quo(@distances[pair]) * difference * GRID_ADJUSTMENT
            y1 += (y1 - y2).quo(@distances[pair]) * difference * GRID_ADJUSTMENT
            @matrix[count1] = [x1, y1]
          end
        end
      end
    end

    # 誤差の収束判定
    def converged?
      error = 0
      @matrix.each do |count1, (x1, y1)|
        @matrix.each do |count2, (x2, y2)|
          next if count1 == count2
          pair = Pair.new [count1, count2]

          numerator = (@similarities[pair] - @distances[pair]).abs
          denominator = [@similarities[pair], @distances[pair]].max
          error += numerator.quo(denominator) unless denominator == 0
        end
      end

      if @errors[(- ERRORS_FOR_COMPARISON)..-1].max < error
        puts error
        true
      else
        @errors << error
        false
      end
    end
  end
end

処理の流れ

startメソッド内でメインループを司る処理を行っています.
収束するかループ回数の最大値に至るまで,各要素の座標を動かしながら要素間の距離を求め続けます.

until converged? or (LOOP_MAX < loop_counter)
  adjust_elements
  calc_distance
  loop_counter += 1
end

収束判定

誤差の収束判定ですが,収束判定を過去の誤差複数個分指定できるようにしました.

if @errors[(- ERRORS_FOR_COMPARISON)..-1].max < error
  true
else
  @errors << error
  false
end

本書のPythonのものでは1度でも誤差が大きくなったら終了してしまいますが,
何度か振動しつつ収束していく場合もあるので,参照できる誤差の履歴数を指定できるようにしています.

動作確認

require 'pair'
require 'pearson'

scale_down = My::ScaleDown.new(blog_data_from('blogdata.txt'))
scale_down.start

データを読み出す"blog_data_from"は階層型クラスタ作成で使用したものと全く同じものです.
実行すると↓のようになります.(非表示にしてあります)

{"Micro Persuasion"=>[85, 13.5460025405192],
 "Read/WriteWeb"=>[97, 27.0068825598863],
 "Kotaku"=>[51, 86.2703302712075],
 "Michelle Malkin"=>[16, 74.2628592784252],
 "Deadspin"=>[64, 114.456757077053],
 "Schneier on Security"=>[11, 44.7189538498007],
 "Joystiq"=>[165, 73.8217240527541],
 "Bloglines | News"=>[64, 9.10014002381623],
 "ongoing"=>[85, 97.6338165714912],
 "Wonkette"=>[50, 111.837307270994],
 "Neil Gaiman's Journal"=>[74, 136.204262748484],
 "MAKE Magazine"=>[162, 110.132670143626],
 "Seth's Blog"=>[119, 116.568535734765],
 "Lifehacker"=>[101, 46.4111330562718],
 "Quick Online Tips"=>[127, 31.9027177670795],
 "Techdirt"=>[117, 52.235016526772],
 "NewsBusters.org - Exposing Liberal Media Bias"=>[16, 92.0928451363676],
 "Power Line"=>[20, 137.304035556552],
 "CoolerHeads Prevail"=>[170, 71.4781694170523],
 "Pharyngula"=>[135, 113.122148417766],
 "Bloggers Blog: Blogging the Blogsphere"=>[86, 60.6769487682272],
 "Sifry's Alerts"=>[104, 70.1607701938568],
 "Shoemoney - Skills to pay the bills"=>[48, 50.8009971240633],
 "Creating Passionate Users"=>[148, 80.3676603132655],
 "Publishing 2.0"=>[81, 29.4935145181916],
 "Signal vs. Noise"=>[104, 93.324253391559],
 "SpikedHumor"=>[42, 147.168033368196],
 "Oilman"=>[127, 142.701511001283],
 "Engadget"=>[153, 107.548994507299],
 "Blog Maverick"=>[134, 84.758063843039],
 "Copyblogger"=>[135, 58.0241211787316],
 "TechEBlog"=>[39, 8.93343477930175],
 "lifehack.org"=>[123, 135.449476282512],
 "Search Engine Roundtable"=>[111, 4.09280718648749],
 "Joho the Blog"=>[60, 53.025081409916],
 "Andrew Sullivan | The Daily Dish"=>[18, 108.71341854973],
 "Search Engine Watch Blog"=>[99, 1.64363803152656],
 "gapingvoid: \"cartoons drawn on the back of business cards\""=>
  [153, 136.81521420136],
 "Joi Ito's Web"=>[101, 108.269698835308],
 "Joel on Software"=>[115, 125.994299886354],
 "Instapundit.com"=>[18, 118.452458530633],
 "Scobleizer - Tech Geek Blogger"=>[86, 77.5751704222452],
 "Valleywag"=>[75, 61.8496977284291],
 "kottke.org"=>[88, 119.636735600429],
 "ProBlogger Blog Tips"=>[129, 70.1848910297151],
 "Download Squad"=>[161, 64.1416422555106],
 "Boing Boing"=>[96, 151.484250069419],
 "Google Operating System"=>[110, 13.2083872291924],
 "plasticbag.org"=>[77, 87.501466071832],
 "Derek Powazek"=>[38, 71.2168841354237],
 "Eschaton"=>[46, 141.462890962494],
 "A Consuming Experience (full feed)"=>[106, 29.1350983296166],
 "Think Progress"=>[3, 94.860724476481],
 "Daily Kos"=>[33, 118.99385815941],
 "The Viral Garden"=>[114, 88.7812343021926],
 "Talking Points Memo: by Joshua Micah Marshall"=>[8, 108.520329657382],
 "O'Reilly Radar"=>[71, 20.8208127955073],
 "Google Blogoscoped"=>[118, 9.82129398520128],
 "we make money not art"=>[66, 95.7297458831282],
 "Gothamist"=>[74, 116.75197713589],
 "Gizmodo"=>[94, 132.678605317872],
 "Topix.net Weblog"=>[44, 22.1149123295464],
 "Crooks and Liars"=>[23, 85.5889322323665],
 "Go Fug Yourself"=>[70, 156.959391554022],
 "Official Google Blog"=>[115, 21.6839280752944],
 "John Battelle's Searchblog"=>[94, 9.90238642644672],
 "Gawker"=>[59, 130.147978812488],
 "ScienceBlogs : Combined Feed"=>[128, 103.94055328366],
 "Cool Hunting"=>[154, 121.479142368014],
 "flagrantdisregard"=>[103, 155.629518767795],
 "Treehugger"=>[37, 50.9086840106847],
 "BuzzMachine"=>[60, 72.8761880159092],
 "Mashable!"=>[148, 25.6524725184407],
 "PaulStamatiou.com"=>[134, 39.3193818608509],
 "Autoblog"=>[141, 135.600533378354],
 "WWdN: In Exile"=>[25, 62.6314561994913],
 "Captain's Quarters"=>[25, 131.230589116509],
 "Slashdot"=>[55, 34.2531194075149],
 "43 Folders"=>[158, 42.6870468035957],
 "Little Green Footballs"=>[33, 98.1244525467569],
 "TechCrunch"=>[70, 0.0],
 "TMZ.com"=>[120, 156.635525868862],
 "The Superficial - Because You're Ugly"=>[63, 144.741676766823],
 "Signum sine tinnitu--by Guy Kawasaki"=>[31, 21.8169442333282],
 "456 Berea Street"=>[117, 64.9335996882467],
 "GigaOM"=>[133, 25.9477189384876],
 "Wired News: Top Stories"=>[36, 34.6746793694172],
 "Steve Pavlina's Personal Development Blog"=>[105, 132.808099853433],
 "The Huffington Post | Raw Feed"=>[52, 119.618742857073],
 "Online Marketing Report"=>[160, 91.7040920298846],
 "Hot Air"=>[48, 154.564400439014],
 "The Blotter"=>[0, 58.4745533214144],
 "MetaFilter"=>[82, 161.706389998442],
 "The Unofficial Apple Weblog (TUAW)"=>[169, 76.6749410179907],
 "Jeremy Zawodny's blog"=>[90, 143.525564378394],
 "SimpleBits"=>[26, 39.2904296405856],
 "Dave Shea's mezzoblue"=>[146, 53.6366515084854],
 "Matt Cutts: Gadgets, Google, and SEO"=>[74, 46.7826300524704],
 "PerezHilton.com"=>[81, 153.748332058125]}

これでは何だか分からないので,次回はグラフにします.