▲Apple's new Processor Trace instrument is incrediblevictorwynne.com

115 points by xdevweeknds 12 hours ago | 38 comments

Veserv 9 hours ago [-]

This is just standard instruction trace.

Intel has supported such capability via Intel Processor Trace (PT) since at least 2014 [1]. Here is a full trace recorder built by Jane Street feeding into standard program trace visualizers [2].

ARM has supported such capability via the standard CoreSight Program Trace Macrocell (PTM)[3]/Embedded Trace Macrocell (ETM)[4] since at least 2000.

If you pair it with standard data trace, which is less commonly available, then you have the prerequisites for a hardware trace time travel debugger as originally seen in the early 2000s [5]

You can get similar performance/function tracing entirely in software via software-instrumented instruction trace and similar debugging information (though less granular performance information) via record-replay time travel debugger recordings.

[1] https://www.intel.com/content/www/us/en/support/articles/000...

[2] https://blog.janestreet.com/magic-trace/

[3] https://developer.arm.com/documentation/ihi0035/b/Program-Fl...

[4] https://developer.arm.com/documentation/ddi0158/d

[5] https://jakob.engbloms.se/archives/1564

sthomps 7 hours ago [-]

Yes, and Intel Processor Trace (IPT) can be used for more than just performance - we use it for very specific memory protection security (see more: https://info.preludesecurity.com/hubfs/Content/Closing%20the...)

wmf 9 hours ago [-]

Most Linux devs think Valgrind is a good profiler so if Apple can shame them into being only 10 years behind that's pretty good.

joshvm 5 minutes ago [-]

I was going to ask, isn't this basically callgrind? I used that a lot in grad school to optimise our group's code (on an iMac no less). Incredibly useful and there are some nice visualisation / inspection tools.

astrange 5 hours ago [-]

The most useful performance tool on macOS is spindump, which is just a straightforward whole-system CPU sampler. Second most useful is MallocStackLogging.

Other OSes have those too, but they're harder to use and the interfaces aren't as good.

viraptor 3 hours ago [-]

The most useful tool was dtrace... Until Apple killed it with M2.

astrange 1 hours ago [-]

That's overkill for most things since it relies on being able to patch running kernel code, which is also the definition of an RCE exploit.

I think it should work if you run `bputil -c`? Didn't try it though.

delta_p_delta_x 9 hours ago [-]

Linux devs in general seem to be quite content with absolutely atrocious tooling.

sitkack 7 hours ago [-]

Something something grug printf is all you need. People who LARP as Unix neckbeards rejoice in having nothing. Not content, actively seek out worse solutions.

Real Scottish craftspeople enjoy having amazing tooling. They even know how to use a debugger!

jiggawatts 6 hours ago [-]

Why use a propriety steam hammer when you can bash things with open source rocks?

wmf 5 hours ago [-]

Nah, there are plenty of newer open source tools that people either resist using or don't even know about.

01HNNWZ0MV43FF 5 hours ago [-]

this but unironically

snihalani 7 hours ago [-]

I don't get why tho.

jesse__ 8 hours ago [-]

10/10 comment

MBCook 9 hours ago [-]

> The catch, as usual with new Apple features, is the hardware requirements. This only works on M4 chips and iPhone 16 devices, which means you’re out of luck if you’re still developing on older hardware. It’s frustrating but not surprising. Apple has a habit of using new developer tools to push hardware upgrades.

This seems unfair. Isn’t there a pretty good likelihood that the number of performance counters in the CPU (or whatever) simply don’t exist in the production versions of the previous CPUs?

seliopou 8 hours ago [-]

Yes, there's no way that capturing all the information needed to reconstruct a useful trace would be possible without built-in hardware support.

bobmcnamara 9 hours ago [-]

Hardware wise, this seems quite similar to many existing Tracing systems from other CPU cores.

I know Arm and XTensa have offered on board trace buffers for ages so operating systems could record themselves.

What's neat here is that Apple has bundled this nicely into a polished developer tool rather than one more discreet tool.

jauntywundrkind 10 hours ago [-]

Longer term I sort of dream of doing computing from the inside out, using all this tracing data we've started gathering not just for observability but as a log and engine of compute: the record of what computing has been done as an event-source, for an event sourcing computing architecture.

ip26 9 hours ago [-]

The present opportunity, in my view, is to feed this tracing into the development of superior compilers. This is starting to happen with automated profiling by the compiler, but you can imagine the profiling expanding to an enormous degree, with the compiler tracing the program it is building in great detail.

layer8 8 hours ago [-]

The compiler often doesn't run on the same CPU model the program will later run on, so that will only be feasible/useful in limited circumstances.

imoverclocked 8 hours ago [-]

This is true partially because of the current landscape. With enough pressure (if the optimization is good enough) then things might shift to accommodate.

alephnerd 6 hours ago [-]

The security industry is trending this way. Observability has been the name of the game for a couple years now, and a lot of really cool grassroots startups have taken off in the runtime observability space. Think XDR+SIEM+SOAR but unified and way less bloated.

8 hours ago [-]

stochastician 9 hours ago [-]

Is there anything like this for more commodity arm cores (neoverse v2) or do we think the insights from apple silicon cores will generalize well to those other ARM architectures?

do_not_redeem 10 hours ago [-]

> Instead of statistical sampling like most profilers, you get a complete picture of your app’s execution flow.

Potentially interesting, but it's not really clear whether this is anything new or not. valgrind + kcachegrind does this too.

https://developer.apple.com/documentation/xcode/analyzing-cp...

These screenshots look a lot like kcachegrind with a slightly reimagined UI. Is there actually anything new here, or is this another case of Apple finally catching up to the open source world?

nkurz 10 hours ago [-]

As 'GeekyBear' implies in a sibling comment, valgrind works with an emulation of an ideal processor rather than directly on the actual CPU. Sometimes this gives you a good idea of how the program will actually run, and sometimes it doesn't. As processors became more complex, it got farther and farther from the truth. Personally, I started in the Valgrind era and stopped using it as soon as better tools using native instrumentation became available. If Apple's approach works as well as described, it is much better than anything from that era.

do_not_redeem 10 hours ago [-]

I've never found cachegrind inaccurate, but maybe I'm not doing hardcore enough performance work. You can also use perf and get you numbers straight from the hardware if that's what you need. Truth be told I mainly use cachegrind because I prefer kcachegrind's UI to hotspot.

(I even prefer cachegrind's approach since the numbers will be less distorted by other random background activity on the machine, but that could just be idealism on my part, who knows.)

If perf or the vendor-specific tools like vtune/uprof aren't sufficient for you then I'm curious what do you use?

nkurz 9 hours ago [-]

I switched from emulator tools like valgrind to tools with hardware support like perf, pmu-tools, and VTune. I generally found them sufficient, but sometimes buggy and difficult to use.

Cachegrind is occasionally inaccurate due to an inaccurate model, but the greater problem was that cache hit percentages only tell a fraction of the story. To be able to predict performance I often needed to be able to accurately measure things like the number of memory requests in flight.

Searching now for an example, I hit on a comment I made here a few years ago where this new tool probably would have been helpful: https://news.ycombinator.com/item?id=18442131

In general I have much greater faith in the on chip performance registers. That said, other than glancing at news stories like this I haven't been keeping up with recent advances. I guess it's possible that cachegrind and friends have improved since I was using them.

do_not_redeem 9 hours ago [-]

I've always reached for llvm-mca when I need stuff like that, but again it's all predicted/ideal numbers, not live off the hardware. And you need to start off with another tool first to pinpoint where to look.

I've never come across pmu-tools, thanks for the tip. I'll try it out next time I'm in the trenches.

GeekyBear 10 hours ago [-]

> Potentially interesting, but it's not really clear whether this is anything new or not. valgrind + kcachegrind does this too.

Looking at the kcachegrind homepage, it doesn't sound like they are pulling their data directly from the CPU core itself:

> Callgrind uses runtime instrumentation via the Valgrind framework for its cache simulation and call-graph generation.

https://kcachegrind.github.io/html/Home.html

Apple seems to have modified it's core design so that it will stream data to a log file while the code is running.

> Recent Apple silicon devices can capture a processor trace where the CPU stores information about the code it runs, including the branches it takes and the instructions it jumps to. The CPU streams this information to an area on the file system so that you can analyze it with the Processor Trace instrument.

jauntywundrkind 10 hours ago [-]

Intel has a Performance Monitoring Unit on its core that has significant overlap.

Forgetting this tool-space, but at least some of these tools can make use of that hardware:

https://github.com/intel/pcm https://github.com/andikleen/pmu-tools

kaladin-jasnah 9 hours ago [-]

IIRC perf does it as well.

do_not_redeem 10 hours ago [-]

If you need data straight from the hardware you can use e.g. perf+hotspot, although I've heard that perf's tracing (not sampling!) supports fewer CPUs (but still more than just 1)

mananaysiempre 10 hours ago [-]

Finally, a processor manufacturer defects from the obfuscatory equilibrium. Granted, Apple’s processor people are not saints—I’ve yet to see even a full table of throughputs, latencies, and port loads from them, let alone an accurate CPU model—but I welcome anything that might maybe, hopefully, pretty please start a race of giving more accurate data to people doing low-level optimization.

touisteur 10 hours ago [-]

Intel Processor Trace was already pretty great. Built a MC-DC coverage tool with it. Used it for fine profiling, live program monitoring...

bri3d 10 hours ago [-]

What’s your beef with VTune and uProf?

urbandw311er 10 hours ago [-]

I feel like it probably would work on older hardware, this very much smacks of forced obsolescence. Just guessing though.

nozzlegear 10 hours ago [-]

Is forced obsolescence the right term for a somewhat obscure debug tool built for developers of macOS/iOS software? I don't imagine there are many people who would feel forced to upgrade their machines more quickly just to get access to this.

astrange 10 hours ago [-]

It would not. You could port cachegrind I suppose.

(Even if hardware support did exist earlier, you don't want to deal with errata for a new hardware feature. It's kind of amazing anything ever works.)

Loading comments...

Veserv 9 hours ago [-]

This is just standard instruction trace.

Intel has supported such capability via Intel Processor Trace (PT) since at least 2014 [1]. Here is a full trace recorder built by Jane Street feeding into standard program trace visualizers [2].

ARM has supported such capability via the standard CoreSight Program Trace Macrocell (PTM)[3]/Embedded Trace Macrocell (ETM)[4] since at least 2000.

If you pair it with standard data trace, which is less commonly available, then you have the prerequisites for a hardware trace time travel debugger as originally seen in the early 2000s [5]

[1] https://www.intel.com/content/www/us/en/support/articles/000...

[2] https://blog.janestreet.com/magic-trace/

[3] https://developer.arm.com/documentation/ihi0035/b/Program-Fl...

[4] https://developer.arm.com/documentation/ddi0158/d

[5] https://jakob.engbloms.se/archives/1564

sthomps 7 hours ago [-]

wmf 9 hours ago [-]

Most Linux devs think Valgrind is a good profiler so if Apple can shame them into being only 10 years behind that's pretty good.

joshvm 5 minutes ago [-]

astrange 5 hours ago [-]

The most useful performance tool on macOS is spindump, which is just a straightforward whole-system CPU sampler. Second most useful is MallocStackLogging.

Other OSes have those too, but they're harder to use and the interfaces aren't as good.

viraptor 3 hours ago [-]

The most useful tool was dtrace... Until Apple killed it with M2.

astrange 1 hours ago [-]

That's overkill for most things since it relies on being able to patch running kernel code, which is also the definition of an RCE exploit.

I think it should work if you run `bputil -c`? Didn't try it though.

delta_p_delta_x 9 hours ago [-]

Linux devs in general seem to be quite content with absolutely atrocious tooling.

sitkack 7 hours ago [-]

Something something grug printf is all you need. People who LARP as Unix neckbeards rejoice in having nothing. Not content, actively seek out worse solutions.

Real Scottish craftspeople enjoy having amazing tooling. They even know how to use a debugger!

jiggawatts 6 hours ago [-]

Why use a propriety steam hammer when you can bash things with open source rocks?

wmf 5 hours ago [-]

Nah, there are plenty of newer open source tools that people either resist using or don't even know about.

01HNNWZ0MV43FF 5 hours ago [-]

this but unironically

snihalani 7 hours ago [-]

I don't get why tho.

jesse__ 8 hours ago [-]

10/10 comment

MBCook 9 hours ago [-]

This seems unfair. Isn’t there a pretty good likelihood that the number of performance counters in the CPU (or whatever) simply don’t exist in the production versions of the previous CPUs?

seliopou 8 hours ago [-]

Yes, there's no way that capturing all the information needed to reconstruct a useful trace would be possible without built-in hardware support.

bobmcnamara 9 hours ago [-]

Hardware wise, this seems quite similar to many existing Tracing systems from other CPU cores.

I know Arm and XTensa have offered on board trace buffers for ages so operating systems could record themselves.

What's neat here is that Apple has bundled this nicely into a polished developer tool rather than one more discreet tool.

jauntywundrkind 10 hours ago [-]

ip26 9 hours ago [-]

layer8 8 hours ago [-]

The compiler often doesn't run on the same CPU model the program will later run on, so that will only be feasible/useful in limited circumstances.

imoverclocked 8 hours ago [-]

This is true partially because of the current landscape. With enough pressure (if the optimization is good enough) then things might shift to accommodate.

alephnerd 6 hours ago [-]

8 hours ago [-]

stochastician 9 hours ago [-]

Is there anything like this for more commodity arm cores (neoverse v2) or do we think the insights from apple silicon cores will generalize well to those other ARM architectures?

do_not_redeem 10 hours ago [-]

> Instead of statistical sampling like most profilers, you get a complete picture of your app’s execution flow.

Potentially interesting, but it's not really clear whether this is anything new or not. valgrind + kcachegrind does this too.

https://developer.apple.com/documentation/xcode/analyzing-cp...

These screenshots look a lot like kcachegrind with a slightly reimagined UI. Is there actually anything new here, or is this another case of Apple finally catching up to the open source world?

nkurz 10 hours ago [-]

do_not_redeem 10 hours ago [-]

(I even prefer cachegrind's approach since the numbers will be less distorted by other random background activity on the machine, but that could just be idealism on my part, who knows.)

If perf or the vendor-specific tools like vtune/uprof aren't sufficient for you then I'm curious what do you use?

nkurz 9 hours ago [-]

I switched from emulator tools like valgrind to tools with hardware support like perf, pmu-tools, and VTune. I generally found them sufficient, but sometimes buggy and difficult to use.

Searching now for an example, I hit on a comment I made here a few years ago where this new tool probably would have been helpful: https://news.ycombinator.com/item?id=18442131

do_not_redeem 9 hours ago [-]

I've never come across pmu-tools, thanks for the tip. I'll try it out next time I'm in the trenches.

GeekyBear 10 hours ago [-]

> Potentially interesting, but it's not really clear whether this is anything new or not. valgrind + kcachegrind does this too.

Looking at the kcachegrind homepage, it doesn't sound like they are pulling their data directly from the CPU core itself:

> Callgrind uses runtime instrumentation via the Valgrind framework for its cache simulation and call-graph generation.

https://kcachegrind.github.io/html/Home.html

Apple seems to have modified it's core design so that it will stream data to a log file while the code is running.

jauntywundrkind 10 hours ago [-]

Intel has a Performance Monitoring Unit on its core that has significant overlap.

Forgetting this tool-space, but at least some of these tools can make use of that hardware:

https://github.com/intel/pcm https://github.com/andikleen/pmu-tools

kaladin-jasnah 9 hours ago [-]

IIRC perf does it as well.

do_not_redeem 10 hours ago [-]

If you need data straight from the hardware you can use e.g. perf+hotspot, although I've heard that perf's tracing (not sampling!) supports fewer CPUs (but still more than just 1)

mananaysiempre 10 hours ago [-]

touisteur 10 hours ago [-]

Intel Processor Trace was already pretty great. Built a MC-DC coverage tool with it. Used it for fine profiling, live program monitoring...

bri3d 10 hours ago [-]

What’s your beef with VTune and uProf?

urbandw311er 10 hours ago [-]

I feel like it probably would work on older hardware, this very much smacks of forced obsolescence. Just guessing though.

nozzlegear 10 hours ago [-]

astrange 10 hours ago [-]

It would not. You could port cachegrind I suppose.

(Even if hardware support did exist earlier, you don't want to deal with errata for a new hardware feature. It's kind of amazing anything ever works.)