読者です 読者をやめる 読者になる 読者になる

kubo39's blog

ただの雑記です。

Crystalのprofiling的なsomething?

まずは環境。

( ՞ਊ ՞) :~/dev/crystal $ crystal --version
Crystal 0.19.4 [7f82f79] (2016-10-07)
( ՞ਊ ՞) :~/dev/crystal $ cat /etc/issue
Ubuntu 16.10 \n \l

( ՞ਊ ՞) :~/dev/crystal $ cat /proc/version
Linux version 4.8.0-26-generic (buildd@lgw01-58) (gcc version 6.2.0 20161005 (Ubuntu 6.2.0-5ubuntu12) ) #28-Ubuntu SMP Tue Oct 18 14:39:52 UTC 2016

こういう雑なコードを書く。

arr = [] of Int32
(0..1e7).each do |i|
  arr << i
end

-d つきでビルドしてDWARF吐くように。

( ՞ਊ ՞) :~/dev/crystal $ crystal build -d arr_push.cr
( ՞ਊ ՞) :~/dev/crystal $ crystal build --release -d arr_push.cr  # リリースビルド

time (bash組み込み)

まあ一番雑に

( ՞ਊ ՞) :~/dev/crystal $ time ./arr_push

real    0m0.198s
user    0m0.212s
sys     0m0.020s

こちらは releaseビルドの結果。

( ՞ਊ ՞) :~/dev/crystal $ time ./arr_push

real    0m0.105s
user    0m0.096s
sys     0m0.040s

/usr/bin/time

もう少し細かく。見方は man time 見ればなんとなくわかるはず。

( ՞ਊ ՞) :~/dev/crystal $ /usr/bin/time -v ./arr_push
        Command being timed: "./arr_push"
        User time (seconds): 0.20
        System time (seconds): 0.02
        Percent of CPU this job got: 115%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.19
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 108504
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 17448
        Voluntary context switches: 192
        Involuntary context switches: 9
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

perf

Linuxならなんだかんだこれが楽ですよね。

( ՞ਊ ՞) :~/dev/crystal $ perf record ./arr_push
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.053 MB perf.data (936 samples) ]
( ՞ਊ ՞) :~/dev/crystal $ perf report | head -20
[kernel.kallsyms] with build id 500e786445f3c205298601de61824e582ab15bef not found, continuing without symbols
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 17
#
# Samples: 936  of event 'cycles:ppp'
# Event count (approx.): 559710139
#
# Overhead  Command   Shared Object       Symbol
# ........  ........  ..................  ....................................................................
#
    16.59%  arr_push  arr_push            [.] GC_mark_from
    15.30%  arr_push  arr_push            [.] *Array(Int32)@Array(T)#push<Int32>:Array(Int32)
    12.96%  arr_push  arr_push            [.] *Array(Int32)@Array(T)#check_needs_resize:(Pointer(Int32) | Nil)
    12.06%  arr_push  arr_push            [.] __crystal_main
     7.65%  arr_push  arr_push            [.] *Pointer(Int32)@Pointer(T)#[]=<Int32, Int32>:Int32
     7.10%  arr_push  arr_push            [.] *Array(Int32)@Array(T)#<<<Int32>:Array(Int32)
     6.67%  arr_push  arr_push            [.] *Pointer(Int32)@Pointer(T)#+<Int32>:Pointer(Int32)
     4.70%  arr_push  [unknown]           [k] 0x00007ff8b3229f0a
     2.75%  arr_push  arr_push            [.] *Int32@Int#succ:Int32

GC_mark_from が一番時間かかっていることがわかる。

releaseビルドも同様。

( ՞ਊ ՞) :~/dev/crystal $ perf report | head -20
[kernel.kallsyms] with build id 500e786445f3c205298601de61824e582ab15bef not found, continuing without symbols
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 8
#
# Samples: 504  of event 'cycles:ppp'
# Event count (approx.): 279905518
#
# Overhead  Command   Shared Object      Symbol
# ........  ........  .................  ...................................................
#
    32.09%  arr_push  arr_push           [.] GC_mark_from
    23.87%  arr_push  arr_push           [.] *Array(Int32)@Array(T)#push<Int32>:Array(Int32)
     9.43%  arr_push  [unknown]          [k] 0x00007f8aa3ab5f0a
     7.99%  arr_push  arr_push           [.] main
     4.57%  arr_push  libc-2.24.so       [.] __memmove_avx_unaligned_erms
     3.10%  arr_push  [kernel.kallsyms]  [k] 0xffffffff9ea3c907
     1.44%  arr_push  [kernel.kallsyms]  [.] 0xffffffff9ee9fa57
     0.81%  arr_push  libc-2.24.so       [.] __memset_avx2_erms
     0.61%  arr_push  [kernel.kallsyms]  [k] 0xffffffff9e7e0114

check_needs_resize とかなくなってるけど、inline展開されたんだろうか。

( ՞ਊ ՞) :~/dev/crystal $ readelf -s ./arr_push| grep check_needs_resize
   419: 0000000000021400   197 FUNC    GLOBAL DEFAULT   15 *Array(Pointer(Void))@Array(T)#check_needs_resize:(Pointer(Pointer(Void)) | Nil)
   683: 000000000002c2f0   197 FUNC    GLOBAL DEFAULT   15 *Array(String)@Array(T)#check_needs_resize:(Pointer(String) | Nil)
   847: 0000000000032bf0   197 FUNC    GLOBAL DEFAULT   15 *Array(Proc(Int32, Nil))@Array(T)#check_needs_resize:(Pointer(Proc(Int32, Nil)) | Nil)
  1972: 0000000000034580   197 FUNC    GLOBAL DEFAULT   15 *Array(Int32)@Array(T)#check_needs_resize:(Pointer(Int32) | Nil)

いや、関数本体は残るのだっけ。objdump -drS でみてみよう。

0000000000021310 <*Array(Pointer(Void))@Array(T)#push<Pointer(Void)>:Array(Pointer(Void))>:
  #
  # a = ["a", "b"] of (Int32 | String)
  # a.push("c") # => ["a", "b", "c"]
  # a.push(1)   # => ["a", "b", "c", 1]
  # ```
  def push(value : T)
   21310:       41 57                   push   %r15
   21312:       41 56                   push   %r14
   21314:       41 54                   push   %r12
   21316:       53                      push   %rbx
   21317:       50                      push   %rax
   21318:       49 89 f6                mov    %rsi,%r14
   2131b:       48 89 fb                mov    %rdi,%rbx
    zip?(other) { |x, y| pairs << {x, y} }
    pairs
  end

  private def check_needs_resize
    double_capacity if @size == @capacity
   2131e:       8b 43 04                mov    0x4(%rbx),%eax
   21321:       3b 43 08                cmp    0x8(%rbx),%eax
   21324:       75 2d                   jne    21353 <*Array(Pointer(Void))@Array(T)#push<Pointer(Void)>:Array(Pointer(Void))+0x43>
  end

  private def double_capacity
    resize_to_capacity(@capacity == 0 ? 3 : (@capacity * 2))
   21326:       8d 0c 00                lea    (%rax,%rax,1),%ecx
   21329:       85 c0                   test   %eax,%eax
   2132b:       41 bf 03 00 00 00       mov    $0x3,%r15d
   21331:       44 0f 45 f9             cmovne %ecx,%r15d
  end

おお、なんかinline化されてる雰囲気だ。LLVMさまさままだな。

  • 追記

valgrind(callgrind) を失念していた。。

Rustのcargo profilerとかの裏側はこれ。

( ՞ਊ ՞) :~/dev/crystal $ valgrind --tool=callgrind ./arr_push 
==30942== Callgrind, a call-graph generating cache profiler
==30942== Copyright (C) 2002-2015, and GNU GPL'd, by Josef Weidendorfer et al.
==30942== Using Valgrind-3.12.0.SVN and LibVEX; rerun with -h for copyright info
==30942== Command: ./arr_push
==30942== 
==30942== For interactive control, run 'callgrind_control -h'.
==30942== brk segment overflow in thread #1: can't grow to 0x4bea000
==30942== (see section Limitations in user manual)
==30942== 
==30942== Events    : Ir
==30942== Collected : 1026463326
==30942== 
==30942== I   refs:      1,026,463,326

出力された*.outみたいなファイルをkcachegrindに食わせるとGUIないい感じのインターフェースが出てくる。

$ kcachegrind callgrind.out.30942