多次元尺度構成法によるグラフの作成
はじめに
階層型,K平均法でのクラスタリング手法では,デンドログラム等で結果表示がされてきました.
が,決してビジュアル的には分かりやすいとは言えませんでした.
可視化してみようということで,多次元尺度構成法(以下,MDS)により図示してみることにします.
ScaleDownクラスの作成
MDSを行うにあたり,主となるクラスのソースを示します.
ロジック自体はPythonで書かれた本とほぼ同様です.
module My class ScaleDown LOOP_MAX = 10_000 GRID_ADJUSTMENT = 0.01 ERRORS_FOR_COMPARISON = 2 def initialize(counts) @counts = counts @similarities = {} @distances = {} @matrix = {} @errors = Array.new(ERRORS_FOR_COMPARISON, Float::MAX) end attr_reader :similarities, :matrix, :errors # ランダムに要素を配置してから,相関値に近づくよう座標を動かす # 全体の誤差が直近10個分より大きくなったら収束したと見なす def start calc_similarities random_attach calc_distance loop_counter = 0 until converged? or (LOOP_MAX < loop_counter) adjust_elements calc_distance puts "Iteration #{loop_counter}" if loop_counter % 100 == 0 loop_counter += 1 end end # 座標をX:正の整数 Y:正の実数に正規化する def normalize normalized = {} offset_x = @matrix.map { |count, (x, y)| x }.min offset_y = @matrix.map { |count, (x, y)| y }.min @matrix.each do |count, (x, y)| normalized[count] = [((x - offset_x) * 100).round, ((y - offset_y) * 100)] end normalized end private # 全要素間の相関値を計算する # ただし,格納する際には1から減算して格納する def calc_similarities @counts.each_key do |count1| @counts.each_key do |count2| # 自分自身との相関は意味が無いので飛ばす next if count1 == count2 pair = Pair.new [count1, count2] unless @similarities[pair] # valuesメソッドを使うと順序が保証されないので,別変数にvalueをそれぞれ格納する count1_values = [] count2_values = [] @counts[count1].each do |count, value| count1_values << value count2_values << @counts[count2][count] end @similarities[pair] = 1 - Pearson.calc(count1_values, count2_values) end end end end # ランダムに各要素にx,yの値を持たせる def random_attach @counts.each_key do |count| @matrix[count] = [rand, rand] end end # その時点の距離を計算し,差分を出す def calc_distance @matrix.each do |count1, (x1, y1)| @matrix.each do |count2, (x2, y2)| next if count1 == count2 pair = Pair.new [count1, count2] @distances[pair] = Math::sqrt( ((x1 - x2) ** 2) + ((y1 - y2) ** 2) ) end end end # 座標を目標に近づけるように少しづつ動かす def adjust_elements @matrix.each do |count1, (x1, y1)| @matrix.each do |count2, (x2, y2)| next if count1 == count2 pair = Pair.new [count1, count2] numerator = (@similarities[pair] - @distances[pair]) denominator = [@similarities[pair], @distances[pair]].max unless denominator == 0 difference = numerator.quo(denominator) x1 += (x1 - x2).quo(@distances[pair]) * difference * GRID_ADJUSTMENT y1 += (y1 - y2).quo(@distances[pair]) * difference * GRID_ADJUSTMENT @matrix[count1] = [x1, y1] end end end end # 誤差の収束判定 def converged? error = 0 @matrix.each do |count1, (x1, y1)| @matrix.each do |count2, (x2, y2)| next if count1 == count2 pair = Pair.new [count1, count2] numerator = (@similarities[pair] - @distances[pair]).abs denominator = [@similarities[pair], @distances[pair]].max error += numerator.quo(denominator) unless denominator == 0 end end if @errors[(- ERRORS_FOR_COMPARISON)..-1].max < error puts error true else @errors << error false end end end end
処理の流れ
startメソッド内でメインループを司る処理を行っています.
収束するかループ回数の最大値に至るまで,各要素の座標を動かしながら要素間の距離を求め続けます.
until converged? or (LOOP_MAX < loop_counter) adjust_elements calc_distance loop_counter += 1 end
収束判定
誤差の収束判定ですが,収束判定を過去の誤差複数個分指定できるようにしました.
if @errors[(- ERRORS_FOR_COMPARISON)..-1].max < error true else @errors << error false end
本書のPythonのものでは1度でも誤差が大きくなったら終了してしまいますが,
何度か振動しつつ収束していく場合もあるので,参照できる誤差の履歴数を指定できるようにしています.
動作確認
require 'pair' require 'pearson' scale_down = My::ScaleDown.new(blog_data_from('blogdata.txt')) scale_down.start
データを読み出す"blog_data_from"は階層型クラスタ作成で使用したものと全く同じものです.
実行すると↓のようになります.(非表示にしてあります)
{"Micro Persuasion"=>[85, 13.5460025405192], "Read/WriteWeb"=>[97, 27.0068825598863], "Kotaku"=>[51, 86.2703302712075], "Michelle Malkin"=>[16, 74.2628592784252], "Deadspin"=>[64, 114.456757077053], "Schneier on Security"=>[11, 44.7189538498007], "Joystiq"=>[165, 73.8217240527541], "Bloglines | News"=>[64, 9.10014002381623], "ongoing"=>[85, 97.6338165714912], "Wonkette"=>[50, 111.837307270994], "Neil Gaiman's Journal"=>[74, 136.204262748484], "MAKE Magazine"=>[162, 110.132670143626], "Seth's Blog"=>[119, 116.568535734765], "Lifehacker"=>[101, 46.4111330562718], "Quick Online Tips"=>[127, 31.9027177670795], "Techdirt"=>[117, 52.235016526772], "NewsBusters.org - Exposing Liberal Media Bias"=>[16, 92.0928451363676], "Power Line"=>[20, 137.304035556552], "CoolerHeads Prevail"=>[170, 71.4781694170523], "Pharyngula"=>[135, 113.122148417766], "Bloggers Blog: Blogging the Blogsphere"=>[86, 60.6769487682272], "Sifry's Alerts"=>[104, 70.1607701938568], "Shoemoney - Skills to pay the bills"=>[48, 50.8009971240633], "Creating Passionate Users"=>[148, 80.3676603132655], "Publishing 2.0"=>[81, 29.4935145181916], "Signal vs. Noise"=>[104, 93.324253391559], "SpikedHumor"=>[42, 147.168033368196], "Oilman"=>[127, 142.701511001283], "Engadget"=>[153, 107.548994507299], "Blog Maverick"=>[134, 84.758063843039], "Copyblogger"=>[135, 58.0241211787316], "TechEBlog"=>[39, 8.93343477930175], "lifehack.org"=>[123, 135.449476282512], "Search Engine Roundtable"=>[111, 4.09280718648749], "Joho the Blog"=>[60, 53.025081409916], "Andrew Sullivan | The Daily Dish"=>[18, 108.71341854973], "Search Engine Watch Blog"=>[99, 1.64363803152656], "gapingvoid: \"cartoons drawn on the back of business cards\""=> [153, 136.81521420136], "Joi Ito's Web"=>[101, 108.269698835308], "Joel on Software"=>[115, 125.994299886354], "Instapundit.com"=>[18, 118.452458530633], "Scobleizer - Tech Geek Blogger"=>[86, 77.5751704222452], "Valleywag"=>[75, 61.8496977284291], "kottke.org"=>[88, 119.636735600429], "ProBlogger Blog Tips"=>[129, 70.1848910297151], "Download Squad"=>[161, 64.1416422555106], "Boing Boing"=>[96, 151.484250069419], "Google Operating System"=>[110, 13.2083872291924], "plasticbag.org"=>[77, 87.501466071832], "Derek Powazek"=>[38, 71.2168841354237], "Eschaton"=>[46, 141.462890962494], "A Consuming Experience (full feed)"=>[106, 29.1350983296166], "Think Progress"=>[3, 94.860724476481], "Daily Kos"=>[33, 118.99385815941], "The Viral Garden"=>[114, 88.7812343021926], "Talking Points Memo: by Joshua Micah Marshall"=>[8, 108.520329657382], "O'Reilly Radar"=>[71, 20.8208127955073], "Google Blogoscoped"=>[118, 9.82129398520128], "we make money not art"=>[66, 95.7297458831282], "Gothamist"=>[74, 116.75197713589], "Gizmodo"=>[94, 132.678605317872], "Topix.net Weblog"=>[44, 22.1149123295464], "Crooks and Liars"=>[23, 85.5889322323665], "Go Fug Yourself"=>[70, 156.959391554022], "Official Google Blog"=>[115, 21.6839280752944], "John Battelle's Searchblog"=>[94, 9.90238642644672], "Gawker"=>[59, 130.147978812488], "ScienceBlogs : Combined Feed"=>[128, 103.94055328366], "Cool Hunting"=>[154, 121.479142368014], "flagrantdisregard"=>[103, 155.629518767795], "Treehugger"=>[37, 50.9086840106847], "BuzzMachine"=>[60, 72.8761880159092], "Mashable!"=>[148, 25.6524725184407], "PaulStamatiou.com"=>[134, 39.3193818608509], "Autoblog"=>[141, 135.600533378354], "WWdN: In Exile"=>[25, 62.6314561994913], "Captain's Quarters"=>[25, 131.230589116509], "Slashdot"=>[55, 34.2531194075149], "43 Folders"=>[158, 42.6870468035957], "Little Green Footballs"=>[33, 98.1244525467569], "TechCrunch"=>[70, 0.0], "TMZ.com"=>[120, 156.635525868862], "The Superficial - Because You're Ugly"=>[63, 144.741676766823], "Signum sine tinnitu--by Guy Kawasaki"=>[31, 21.8169442333282], "456 Berea Street"=>[117, 64.9335996882467], "GigaOM"=>[133, 25.9477189384876], "Wired News: Top Stories"=>[36, 34.6746793694172], "Steve Pavlina's Personal Development Blog"=>[105, 132.808099853433], "The Huffington Post | Raw Feed"=>[52, 119.618742857073], "Online Marketing Report"=>[160, 91.7040920298846], "Hot Air"=>[48, 154.564400439014], "The Blotter"=>[0, 58.4745533214144], "MetaFilter"=>[82, 161.706389998442], "The Unofficial Apple Weblog (TUAW)"=>[169, 76.6749410179907], "Jeremy Zawodny's blog"=>[90, 143.525564378394], "SimpleBits"=>[26, 39.2904296405856], "Dave Shea's mezzoblue"=>[146, 53.6366515084854], "Matt Cutts: Gadgets, Google, and SEO"=>[74, 46.7826300524704], "PerezHilton.com"=>[81, 153.748332058125]}
これでは何だか分からないので,次回はグラフにします.