Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

A little additional background: iOS has used 16KB pages since the 64-bit transition, and ARM Macs have inherited that design.


A more relevant bit of background is that 4KB pages lead to quite a lot of overhead due to the sheer number of mappings needing to be configured and cached. Using larger pages reduce overhead, in particular TLB misses as fewer entries are needed to describe the same memory range.

While x86 chips mainly supports 4K, 2M and 1G pages, ARM chips tend to support more practical 16K page sizes - a nice balance between performance and wasting memory due to lower allocation granularity.

Nothing in particular to do with Apple and iOS.


Makes me wonder how much performance Windows is leaving on the table with its primitive support for large pages. It does support them, but it doesn't coalesce pages transparently like Linux does, and explicitly allocating them requires special permissions and is very likely to fail due to fragmentation if the system has been running for a while. In practice it's scarcely used outside of server software which immediately grabs a big chunk of large pages at boot and holds onto them forever.


A lot of low level stuff is a lot slower on Windows, let alone the GUI. There's also entire blogs cataloging an abundance of pathological performance issues.

The one I notice the most is the filesystem. Running Linux in VirtualBox, I got 7x the host speed for many small file operations. (On top of that Explorer itself has its own random lag.)

I think a better question is how much performance are they leaving on the table by bloating the OS so much. Like they could have just not touched Explorer for 20 years and it would be 10x snappier now.

I think the number is closer to 100x actually. Explorer on XP opens (fully rendered) after a single video frame... also while running virtualized inside Win10.

Meanwhile Win10 Explorer opens after a noticeable delay, and then spends the next several hundred milliseconds painting the UI elements one by one...


> The one I notice the most is the filesystem.

This is due to the extensible file system filter model in place; I'm not aware of another OS that implements this feature and is primarily used for antivirus, but can be used by any developer for any purpose.

It applies to all file systems on Windows.

DevDrive[0] is Microsoft's current solution to this.

> Meanwhile Win10 Explorer opens after a noticeable delay

This could be, again, largely due to 3rd party hooks (or 1st party software that doesn't ship with Windows) into Explorer.

[0] https://devblogs.microsoft.com/visualstudio/devdrive/


> I'm not aware of another OS that implements this feature

I'm not sure this is exactly what you mean, but Linux has inotify and all sorts of BPF hooks for filtering various syscalls, for example file operations.


FSFilters are basically a custom kernel module that can and will do anything they want on any filesystem access. (There's also network filters, which is how things like WinPcap get implemented.)

So yes, you could implement something similar in Linux, but there's not, last I looked, a prebuilt toolkit and infrastructure for them, just the generic interfaces you can use to hook anything.

(Compare the difference between writing a BPF module to hook all FS operations, and the limitations of eBPF, to having an InterceptFSCalls struct that you define in your custom kernel module to run your own arbitrary code on every access.)


I'm glad you mentioned that. I noticed when running "Hello world" C program on Windows 10 that Windows performs over 100 reads of the Registry before running the program. Same thing when I right click a file...

A few of those are 3rd party, but most are not.


Remember that Win32 process creation is expensive[0]. And on NT, processes don't run, threads do.

The strategy of applications, like olde-tymey Apache using multiple processes to handle incoming connections is fine on UN*X, but terrible on Windows.

[0] https://fourcore.io/blogs/how-a-windows-process-is-created-p...


> The one I notice the most is the filesystem

I’m not sure it’s the file system per se, I believe the main reason is the security model.

NT kernel has rather sophisticated security. The securable objects have security descriptors with many access control entries and auditing rules, which inherit over file system and other hierarchies according to some simple rules e.g. allow+deny=deny. Trustees are members of multiple security groups, and security groups can include other security groups so it’s not just a list, it’s a graph.

This makes access checks in NT relatively expensive. The kernel needs to perform access check every time a process creates or opens a file, that’s why CreateFile API function is relatively slow.


I've been trying to use auditing rules for a usage that seems completely in scope and obvious to prioritize from a security point of view (tracing access to EFS files and/or the keys allowing the access) and my conclusion was that you basically can't, the doc is garbage, the implementation is probably ad-hoc with lots of holes, and MS probably hasn't prioritised the maintenance of this feature since several decades (too busy adding ads in the start menu I guess)

The NT security descriptors are also so complex they are probably a little useless in practice too, because it's too hard to use correctly. On top of that the associated Win32 API is also too hard to use correctly to the point that I found an important bug in the usage model described in MSDN, meaning that the doc writer did not know how the function actually work (in tons of cases you probably don't hit this case, but if you start digging in all internal and external users, who knows what you could find...)

NT was full of good ideas but the execution is often quite poor.


From an NTFS auditing perspective, there’s no difference between auditing a non-EFS file or EFS file. Knowing that file auditing works just fine having done it many times, what makes you say it doesn’t work?


> The one I notice the most is the filesystem. Running Linux in VirtualBox, I got 7x the host speed for many small file operations. (On top of that Explorer itself has its own random lag.)

That’s a very old problem. In early days of subversion, the metadata for every directory existed in the directory. The rationale was that you could check out just a directory in svn. It was disastrously slow on Windows and the subversion maintainers had no answer for it, except insulting ones like “turn off virus scanning”. Telling a windows user to turn off virus scanning is equivalent to telling someone to play freeze tag in traffic. You might as well just tell them, “go fuck yourself with a rusty chainsaw”

Someone reorganized the data so it all happened at the root directory and the CLI just searched upward until it found the single metadata file. If memory serves that made large checkouts and updates about 2-3 times faster on Linux and 20x faster on windows.


None of this has to do with page size.


Death by 1000 cuts


Quite a bit, but 2M is an annoying size and the transparent handling is suboptimal. Without userspace cooperating, the kernel might end up having to split the pages at random due to an unfortunate unaligned munmap/madvise from an application not realizing it was being served 2M pages.

Having Intel/AMD add 16-128K page support, or making it common for userspace to explicitly ask for 2M pages for their heap arenas is likely better than the page merging logic. Less fragile.

1G pages are practically useless outside specialized server software as it is very difficult to find 1G contiguous memory to back it on a “normal” system that has been running for a while.


Intel's menu of page sizes is an artifact of its page table structure.

On x86 in 64-bit mode, page table entries are 64 bits each; the lowest level in the hierarchy (L1) is a 4K page containing 512 64-bit of PTEs which in total map 2M of memory, which is not coincidentally the large page size.

The L1 page table pages are themselves found via a PTE in a L2 page table; one L2 page table page maps 512*2M = 1G of virtual address space, which is again, not coincidentally, the huge page size.

Large pages are mapped by a L2 PTE (sometimes called a PDE, "page directory entry") with a particular bit set indicating that the PTE points at the large page rather than a PTE page. The hardware page table walker just stops at that point.

And huge pages are similarly mapped by an L3 PTE with a bit set indicating that the L3 PTE is a huge page.

Shoehorning an intermediate size would complicate page table updates or walks or probably both.

Note that an OS can, of its own accord independent of hardware maintain allocations as a coarser granularity and sometimes get some savings out of this. For one historic example, the VAX had a tiny 512-byte page size; IIRC, BSD unix pretended it had a 1K page size and always updated PTEs in pairs.


Hmm? Pretending the page size is larger than it is would not yield the primary performance benefits of reduced TLB misses. Unless I am missing something, that seems more like a hack to save a tiny bit of kernel memory on a constrained system by having two PTE’s backed by the same internal page structure.

Unless we can change the size of the smallest page entry on Intel, I doubt there is room to do anything interesting there. If we could do like ARM and just multiply all the page sizes by 4 you would avoid any “shoehorning”.


The smallest page size tends to get entrenched in the rest of the system (for things like linker page sizes, IOMMU interfaces, etc.,); growing the smallest page size might not be a viable option in existing systems and it might be easier to introduce intermediate-size TLB entries, perhaps formed by consolidating adjacent contiguous PTE's..


Would a reasonable compromise be to change the base allocation granularity to 2MB, and transparently sub-allocate those 2MB blocks into 64KB blocks (the current Windows allocation granularity) when normal pages are requested? That feels like it should keep 2MB page fragmentation to a minimum without breaking existing software, but given they haven't done it there's probably some caveat I'm overlooking.



I've lost count of how many blog posts about poor performance ended with the punchline "so then we turned off page coalescing".


IIRC, 64-bit ARM can do 4K, 16K, 64K and 2M pages. But there are some special rules for the last one.

https://documentation-service.arm.com/static/64d5f38f4a92140...


It's a little weirder. At least one translation granule is required but it is up to the implementation to choose which one(s) they want. Many older Arm cores only support 4KB and 64KB but newer ones support all three.

The size of the translation granule determines the size of the block entries at each level. So 4K granules has super pages of 2MB and 1GB, 16KB granules has 32MB super pages, and 64K has 512MB super pages.


Armv8-A also supports 4K pages: FEAT_TGran4K. So Apple did indeed make a choice to instead use 16K, FEAT_TGran16K. Microsoft uses 4K for AArch64 Windows.


How is this "additional background"? This was a post by Google regarding Android.


That this isn't the only 4K→16K transition in recent history? Some programs that assumed 4K had to be fixed as part of the transition, this can provide insights for the work required for Android.


As an android dev: the work hours I spend talking with iOS colleagues is the same as with AND ones. Usually you want to sort of be up to date with the other platform as well.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: