CRC Press Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2016 by Taylor & Francis Group, LLC CRC Press is an imprint of Taylor & Francis Group, an Informa business No claim to original U.S. Government works Printed on acid-free paper Version Date: 20150504 International Standard Book Number-13: 978-1-4987-1607-9 (Pack - Book and Ebook) This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. Library of Congress Cataloging‑in‑Publication Data WebGL insights / editor, Patrick Cozzi. pages cm Includes bibliographical references. ISBN 978-1-4987-1607-9 (alk. paper) 1. Computer graphics--Computer programs. 2. WebGL (Computer program language) I. Cozzi, Patrick. T386.W43W43 2016 006.6’633--dc23 Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com 2015011022 For Liz, Petey, and the Captain Contents Foreword xi Preface xiii Acknowledgments xv Website xvii Tips xix Section I WebGL Implementations 1 ANGLE: A Desktop Foundation for WebGL 3 Nicolas Capens and Shannon Woods 2 Mozilla’s Implementation of WebGL 17 Benoit Jacob, Jeff Gilbert, and Vladimir Vukicevic 3 Continuous Testing of Chrome’s WebGL Implementation 31 Kenneth Russell, Zhenyao Mo, and Brandon Jones Section II Moving to WebGL 4 Getting Serious with JavaScript 49 Matthew Amato and Kevin Ring vii 5 Emscripten and WebGL 71 Nick Desaulniers 6 Data Visualization with WebGL: From Python to JavaScript 89 Cyrille Rossant and Almar Klein 7 Teaching an Introductory Computer Graphics Course with WebGL 107 Edward Angel and Dave Shreiner Section III Mobile 8 Bug-Free and Fast Mobile WebGL 123 Olli Etuaho Section IV Engine Design 9 WebGL Engine Design in Babylon.js 141 David Catuhe 10 Rendering Optimizations in the Turbulenz Engine 157 David Galeano 11 Performance and Rendering Algorithms in Blend4Web 173 Alexander Kovelenov, Evgeny Rodygin, and Ivan Lyubovnikov 12 Sketchfab Material Pipeline: From File Variations to Shader Generation 193 Cedric Pinson and Paul Cheyrou-Lagrèze 13 glslify: A Module System for GLSL 209 Hugh Kennedy, Mikola Lysenko, Matt DesLauriers, and Chris Dickinson 14 Budgeting Frame Time 223 Philip Rideout Section V Rendering 15 Deferred Shading in Luma 235 Nicholas Brancaccio viii Contents 16 HDR Image-Based Lighting on the Web 253 Jeff Russell 17 Real-Time Volumetric Lighting for WebGL 261 Muhammad Mobeen Movania and Feng Lin 18 Terrain Geometry—LOD Adapting Concentric Rings 279 Florian Bösch Section VI Visualization 19 Data Visualization Techniques with WebGL 297 Nicolas Belmonte 20 hare3d—Rendering Large Models in the Browser 317 Christian Stein, Max Limper, Maik Thöner, and Johannes Behr 21 The BrainBrowser Surface Viewer: WebGL-Based Neurological Data Visualization 333 Tarek Sherif Section VII Interaction 22 Usability of WebGL Applications 351 Jacek Jankowski 23 Designing Cameras for WebGL Applications 365 Diego Cantor-Rivera and Kamyar Abhari About the Contributors Contents 387 ix Foreword Since the first release of the WebGL specification in February 2011, a remarkable and ­passionate community has developed around it. Enthusiasts from varied domains, from the demoscene to medical research scientists and everyone in between, have created beautiful visual effects, artifacts, applications, and libraries. Perhaps most remarkably, these individuals and groups have shared their work for all to see and use on the World Wide Web. The web’s culture of sharing dates back to its origins and the “view source” option in the earliest web browsers. Web page creators were encouraged to see how others achieved visual techniques, to copy them, and to help improve them. Since that time, the web has grown organically and exponentially. As web developers noticed missing functionality, new features were proposed in the HTML specification to bridge these gaps, and browsers evolved rapidly to incorporate them. A tremendous number of open-source libraries have sprung up that make it easier to create visually compelling web pages, user interfaces, and applications that scale well from a smartphone screen all the way up to oversized desktop displays. It’s been exciting to witness WebGL’s adoption into the web’s sharing culture. When I first began studying computer graphics, the web was in its infancy; leading-edge techniques were published in printed journals a few times a year, and source code was almost never released by the authors. Today, computer graphics researchers from around the world publish articles every day on their own websites and blogs, and include not only source code, but also live examples written using WebGL! This book and the accompanying website represent the epitome of the web’s sharing culture: a curated collection of leading-edge techniques and insights, with online source code and live demonstrations. In this book, engine authors present their strategies for achieving high-performance, good scalability, and leading-edge visuals. Educators share their experience in moving real-time computer graphics courses to the web and WebGL. Application developers and toolchain authors share their experiences both building large, new JavaScript code bases and bringing existing C++ code bases to the web. Graphics researchers show how to implement leading-edge rendering techniques in WebGL, xi allowing these techniques to be deployed seamlessly to hundreds of millions of devices and billions of people. Visualization researchers demonstrate how to render huge data sets with high performance and high impact. Interaction researchers provide insights into effective navigation and interaction paradigms for 3D applications. Finally, browser and GPU implementers give a look under the hood of WebGL implementations to help developers tune their code for best performance on a range of devices. This book contains a wealth of information and will be a treasured reference for years to come. I thank Patrick Cozzi for initiating and driving this project to completion, and for the opportunity to have been involved with it. Ken Russell Khronos WebGL Working Group Chair Software Engineer, Google Chrome GPU Team xii Foreword Preface WebGL Insights is a celebration of our community’s accomplishments and a snapshot of the state of the art in WebGL. There is WebGL support in every modern browser across every major platform, including Android and iOS. Given its ubiquity, plugin-free deployment, ease of development, tool support, and improved JavaScript performance, WebGL adoption from major developers and startups alike is on the rise. WebGL is used by Unity, Epic, Autodesk, Google, Esri, Twitter, Sony, The New York Times, and countless others. Likewise, many startups base their core technology on WebGL, such as Floored and Sketchfab, who share their experiences in this book. In addition, WebGL is playing a central role in future web technologies such as WebVR. In academia and education, WebGL is gaining momentum due to its low barrier to entry. The introductory graphics course at SIGGRAPH moved from OpenGL to WebGL, as did the popular introductory book, Interactive Computer Graphics: A Top-Down Approach. The Interactive 3D Graphics Udacity course, which teaches graphics using WebGL/three.js, has had more than 60,000 students sign up. In my own teaching at the University of Pennsylvania, I’ve moved to WebGL and often talk with other educators who are doing the same. The demand for skilled WebGL developers is high. As an educator, I receive more requests for WebGL developers than we can fill. As a practitioner, I find that our team of skilled WebGL developers gives us a unique advantage to develop quickly and implement solutions that are efficient and robust. The advancement of the WebGL community, the demand for highly skilled WebGL developers, and our appetite for continuous learning have led to this book. Our community has the need to go beyond the basics of introductory WebGL books and to learn advanced techniques from developers with practical experience. In the spirit of OpenGL Insights, developers from the WebGL community have ­contributed chapters sharing their unique expertise based on their real-world experiences. This includes hardware vendors sharing performance and robustness advice for mobile; browser developers providing deep insight into WebGL implementations xiii and testing; WebGL-engine developers presenting design and performance techniques for several of the most popular WebGL engines; application developers explaining how WebGL is reaching beyond games into areas such as neurological data visualization researchers presenting massive model rendering and educators providing advice on migrating graphics courses to WebGL. Throughout the chapters we see many common themes, including •• •• •• •• •• •• •• •• •• The move of desktop applications to the web either by writing new clients in JavaScript or by converting C/C++ to JavaScript using Emscripten. See Section II and Chapters 16, 17, and 21. Runtime or preprocessed shader pipelines to generate GLSL from higher level material descriptions and shader libraries. See Chapters 9, 11, 12, and 13. The use of Web Workers for compute or load pipelines to offload the main thread from CPU-intense work. See Chapters 4, 11, 14, 19, and 21. Algorithms that are light on the CPU and offload massively parallel work to the GPU. See Chapters 18 and 19. Understanding the CPU and GPU overhead of different WebGL API functions, and strategies for minimizing their cost without incurring too much CPU overhead in our application. See Chapters 1, 2, 8, 9, 10, 14, and 20. Incrementally streaming 3D scenes, often massive scenes, to quickly provide the user with a coarse representative followed by higher detail. See Chapters 6, 12, 20, 21. Organization of large JavaScript code bases and GLSL shaders into modules. See Chapters 4 and 13. A focus on testing both WebGL implementations and WebGL engines built on top of them. See Chapters 3, 4, and 15. Open source. The web has a culture of openness, so it should come as no surprise that most of the WebGL implementations, tools, engines, and applications described in this book are open source. I hope that WebGL Insights inspires you, teaches you, gives you new insights into your own work, and helps bring you to the next level in your adventures. Patrick Cozzi xiv Preface Acknowledgments First, I would like to thank Christophe Riccio (Unity) who edited OpenGL Insights with me. The community focus of WebGL Insights is the direct result of the culture Christophe created with OpenGL Insights. Christophe set the bar very high for OpenGL Insights, and I believe we have continued the tradition in WebGL Insights. I also thank Christophe for his support of the WebGL Insights book proposal and chapter reviews. At SIGGRAPH 2014, I suggested the idea for WebGL Insights to Ed Angel (University of New Mexico), Eric Haines (Autodesk), Neil Trevett (NVIDIA), Ken Russell (Google), and several others. It was their support that got the project off the ground. I thank them for supporting the initial book proposal. I also thank Ken for writing the foreword. WebGL Insights is the story of 42 contributors sharing their experiences with WebGL and related technologies. They made this book what it is and make the WebGL community the lively community that it is. I thank them for their dedication to and enthusiasm for this book. The quality of WebGL Insights is due to the work of the contributors and the 25 technical reviewers. Each chapter had at least two reviews; most had three to five and a couple of chapters received seven or more. I thank all the reviewers for volunteering their time: Won Chun (RAD Games Tools), Aleksandar Dimitrijevic (University of Niš), Eric Haines (Autodesk), Havi Hoffman (Mozilla), Nop Jiarathanakul (Autodesk), Alaina Jones (Sketchfab), Jukka Jylänki (Mozilla), Cheng-Tso Lin (University of Pennsylvania), Ed Mackey (Analytical Graphics, Inc.), Briely Marum (National ICT Australia), Jonathan McCaffrey (NVIDIA), Chris Mills (Mozilla), Aline Normoyle (University of Pennsylvania), Deron Ohlarik, Christophe Riccio (Unity), Fabrice Robinet (Plumzi, Inc.), Graham Sellers (AMD), Ishaan Singh (Electronic Arts), Traian Stanev (Autodesk), Henri Tuhola (Edumo Oy), Mauricio Vives (Autodesk), Luke Wagner (Mozilla), Corentin Wallez (Google), Alex Wood (Analytical Graphics, Inc.), and Alon Zakai (Mozilla). Eric Haines deserves special thanks. Not only did Eric review several chapters in detail, but he also introduced me to his Autodesk colleagues, Traian Stanev and Mauricio Vives, who provided exceptional reviews. xv In addition to external technical reviews, we had a great culture of contributor peer review. I especially thank Olli Etuaho (NVIDIA), Muhammad Mobeen Movania (DHA Suffa University), Tarek Sherif (BioDigital), and Jeff Russell (Marmoset) for going above and beyond. I thank Rick Adams, Judith Simon, Kari Budyk, and Sherry Thomas from CRC Press for all their work publishing WebGL Insights. Rick was very supportive of the project from the start and helped bring WebGL Insights from idea to proposal to contract in record time. I thank Norm Badler from the University of Pennsylvania, whose encouragement sparked my initial involvment in producing books in 2009. Editing WebGL Insights on top of my developer and teaching positions made evenings, weekends, and even holidays with friends and family few and far between. For their patience, understanding, and occasional copyediting, I thank Peg Cozzi, Margie Cozzi, Anthony Cozzi, Colleen Curran Cozzi, Cecilia Cozzi, Audrey Cozzi, Liz Dailey, Petey, and the Captain. xvi Acknowledgments Website The companion WebGL Insights website contains source and other supplements. It is also the place to find announcements about future volumes. www.webglinsights.com Please e-mail me with your comments or corrections: pjcozzi@siggraph.org xvii Tips WebGL Report (http://webglreport.com/) is a great way to get WebGL implementation details for the current browser, especially when debugging browser- or device-specific problems. For performance, avoid object allocation in the render loop. Reuse objects and arrays where possible, and avoid built-in array methods such as map and filter. Each new object creates more work for the Garbage Collector, and in some cases, GC pauses can freeze an application for multiple frames every few seconds. Save memory and improve performance by ensuring that contexts are created with the alpha, depth, stencil, antialias, and preserveDrawingBuffer options set to false, unless otherwise needed. Note that alpha, depth, and antialias are enabled by default and must be explicitly disabled. For performance, query attribute and uniform locations only at initialization. Int precision default qualifiers aren’t the same between vertex and fragment shaders. This can lead to surprising visual differences when moving computation between each. For portability, keep space requirements of varyings and uniforms within the limits of the GLSL ES spec. Consider using vec4 variables instead of float arrays, as they potentially allow for tighter packing. See A.7 in the GLSL ES spec. Non-power-of-two textures require linear or nearest filtering, and clamp-to-border or clamp-to-edge wrapping. Mipmap filtering and repeat wrapping are not supported. When we are using more than one draw buffer with the WEBGL_draw_buffers extension and we don’t want to write to a given draw buffer, pass gl.NONE to the draw buffers parameter list. We must always provide all color attachments that our framebuffer has. Always enable strict mode via the “use strict” directive. It slightly alters JavaScript semantics so that many silent errors turn into runtime exceptions and can even help the browser better optimize our code. Code linters, such as JSHint (http://jshint.com/), are an invaluable tool for keeping JavaScript code clean and error free. Continued xix Create new textures, rather than changing the dimensions or format of old ones. Chapter 1 Avoid use of gl.TRIANGLE_FAN, as it may be emulated in software. Chapter 1 Flag buffer usage as gl.STATIC_DRAW where appropriate, to allow browsers and drivers to make use of optimizations for static data. Chapter 1 Make sure that one of the array attributes is bound (using gl.bindAttribLocation) to location 0. Otherwise, high overhead should be expected when running on non-ES OpenGL platforms such as Mac OS X and desktop Linux. Chapter 2 Pass data back and forth from Web Workers using transferable objects whenever possible. Chapters 4 and 21 Although typed arrays have performance advantages, using JS arrays in teaching allows students to write clearer JS code with the use of array methods. Chapter 7 Using mediump precision in fragment shaders provides the widest device compatibility, but risks corrupted rendering if the shaders are not properly tested. Chapter 8 Using only highp precision prevents corrupted rendering at the cost of losing some efficiency and device compatibility. Prefer highp, especially in vertex shaders. Chapter 8 To test device compatibility of shaders that use mediump or lowp precision, it is possible to use software emulation of lower precision. Use the—emulate-shaderprecision flag in Chrome. Chapter 8 When using an RGB framebuffer, always implement a fallback to RGBA for when RGB is not supported. Use gl.checkFramebufferStatus. Chapter 8 To save a lot of API calls, use vertex array objects (VAOs) or interleave static vertex data. Chapter 8 For performance, do not update a uniform each frame; instead update it only when it changes. Chapters 8, 10, and 17 If shrinking the browser window results in massive speed gains, consider using a half-resolution framebuffer during mouse interaction. Chapter 14 Load time can be improved by amortizing slow tasks across several frames. Chapter 14 Be vigilant about using requestAnimationFrame—ensure that most, if not all, of your WebGL work lives inside it. Chapter 14 The textureProj GLSL function, vec4 color = textureProj(sampler, uv.xyw);, can be simulated with vec4 color = texture(sampler, uv.xy/uv.w) Chapter 17 Avoid using common text-based 3D data formats, such as Wavefront OBJ or COLLADA, for asset delivery. Instead, use formats optimized for the web, such as glTF or SRC. Chapter 20 Use OES_element_index_uint to draw large indexed models with a single draw call. Chapter 21 Smooth, cinematic camera transitions can be created by a cosine-based interpolation of the camera’s position and orientation. Unlike nonlinear interpolation alternatives, the cosine interpolation is computationally cheap and easy to calculate. Chapter 23 xx Tips S ection I WebGL Implementations Knowing what goes on under the hood makes us more effective WebGL developers. It helps us understand which API calls are fast, which are slow, and why. This is particularly important for engine developers, whose users rely on the engine to make efficient use of WebGL. The stack under WebGL is involved. It can include API validation, driver workarounds, shader validation and translation, compositing, and interprocess communication, and eventually calls to the native graphics API, which itself has a stack of OS and GPU vendor code. In this section, developers from Google and Mozilla provide deep insight into what happens between our WebGL calls and the native graphics API. ANGLE is perhaps best known as an OpenGL ES 2.0 wrapper over Direct3D 9 used to implement WebGL in Chrome, Firefox, and Opera on Windows. However, it is much more. ANGLE now also has a Direct3D 11 backend and is not just used on Windows; its shader validator is used on Linux by Chrome and Firefox, and on OS X by Chrome, Firefox, and Safari. In Chapter 1, “ANGLE: A Desktop Foundation for WebGL,” Nicolas Capens and Shannon Woods go under the hood of ANGLE and provide performance tips, such as why we should avoid TRIANGLE_FAN and wide lines, and debugging tips like how to step through ANGLE source code in Chrome. In Chapter 2, “Mozilla’s Implementation of WebGL,” Benoit Jacob, Jeff Gilbert, and Vladimir Vukicevic dive into the details of Mozilla’s WebGL implementation by explaining what happens between when a WebGL function is called in JavaScript and when Firefox calls the native graphics API. This includes datatype conversion, error checking, state tracking, texture conversion, draw call validation, shader source transformation, and compositing. Knowledge of these details helps us optimize our WebGL code; for example, by knowing how data copying, data conversion, and error checking work in ­texImage2D and texSubImage2D, we can call these functions in such a way as to get their fastest path. WebGL support is consistent across an incredible combination of operating systems, browsers, GPUs, and drivers. As an engine developer, I can say with certainty that our WebGL engine has far fewer workarounds than our OpenGL engine (with that said, the state of modern OpenGL drivers is now also very good). The consistency and stability of WebGL is thanks to the testing performed by browser and hardware vendors. In Chapter 3, “Continuous Testing of Chrome’s WebGL Implementation,” Kenneth Russell, Zhenyao Mo, and Brandon Jones explain the continuous test environment used for Chrome. This includes the hardware and software infrastructure and many “gotchas” encountered in practice when running GPU tests on servers. The lessons learned are applicable for many graphics testing scenarios, not just WebGL implementations. 2 WebGL Implementations 1 ANGLE A Desktop Foundation for WebGL Nicolas Capens and Shannon Woods 1.1 Introduction 1.2 Background 1.3 ANGLE Is Not an Emulator 1.4 ANGLE in WebGL App Development 1.5 Debugging (with) ANGLE 1.6 Additional Resources Bibliography 1.1 Introduction WebGL is a powerful tool from a web development perspective, providing a gateway to GPU-accelerated 3D graphics through a JavaScript API. Developers can author their applications once and expect them to run across a wide variety of hardware, both mobile and desktop, with the full assistance of the GPU. It is no simple task to support such a system seamlessly, removing the need for WebGL applications to handle differences in operating system, browser, GPU, or available driver. While applications don’t need to handle these differences themselves, being familiar with the ways in which WebGL calls are validated, modified, translated, and finally issued to the hardware provides developers with the tools to create efficient WebGL applications across implementations and platforms. In this chapter, we discuss ANGLE, an open-source project used by several browsers as part of this seamless multi-platform support, and cover some tools and best practices developers can exercise to ensure WebGL applications perform as expected, everywhere. We conclude with helpful tips on examining the translated shader output generated by ANGLE and building and debugging ANGLE itself as a stand-alone library or as part of the Chrome browser. 3 1.2 Background Browsers handle the calls made in WebGL by interpreting them and issuing graphics commands to the underlying hardware using a native graphics API. In mobile browsers, these native commands are extremely similar to WebGL, because the vast majority of mobile graphics drivers implement OpenGL ES, on which WebGL is directly based. On the desktop, things become slightly more complicated; OpenGL ES drivers are not available for some desktop operating systems. Under Linux and OS X, the support path is clear, as desktop OpenGL is the native 3D graphics API for those platforms and is widely and robustly supported. Windows, on the other hand, has its own challenges. While desktop OpenGL drivers exist for Windows, most games and other applications which make use of the GPU instead utilize Direct3D, Microsoft’s 3D graphics API. Even if a user’s machine is known to have WebGL-capable hardware, it cannot be guaranteed that the user has OpenGL drivers installed at all. By contrast, Direct3D drivers are installed with the operating system. Requiring OpenGL drivers, then, would be a barrier for a large number of Windows users, keeping them from easily being able to experience WebGL—a significant downside for an emerging web API. For this reason, Google initiated the ANGLE project. ANGLE began as an implementation of OpenGL ES 2.0, the native 3D API on which WebGL 1.0 is directly based, built on top of Direct3D 9. This implementation is used both by Google Chrome and Mozilla Firefox on Windows as the backend for WebGL. Chrome also uses ANGLE for hardwareaccelerated rendering support across the entire browser, from page rendering to hardwareaccelerated video. In addition, ANGLE’s shader translator, used to translate shaders from the OpenGL ES Shading Language (ESSL) into Direct3D’s High Level Shading Language (HLSL), also functions as a shader validator for WebGL and is used in that capacity not just on Windows, but also on Linux by Chrome and Firefox and on OS X by Chrome, Firefox, and Safari [Koch 12]. ANGLE has continued to build upon this initial implementation. In 2013, a Direct3D 11 rendering backend was added to ANGLE, allowing WebGL implementations the ability to make use of a newer native API [ANGLE 13]. Direct3D 11 adds support for many of the texture and vertex formats included in OpenGL ES 2.0 and WebGL 1.0, which in the Direct3D 9 backend must be converted to natively supported formats. Direct3D 11 additionally provides all the necessary features for ANGLE to support OpenGL ES 3.0. ANGLE’s rendering backends are selectable at runtime, meaning that a browser or other client application can choose to use the Direct3D 9 or 11 implementation depending on the particular needs of the application and the hardware on which it’s running. ANGLE’s backend targets a minimum feature level of 10_0, Direct3D 11 so hardware with Direct3D 10 support and above are able to make use of the new backend. However, certain features needed for ES 3.0 parity do not appear until feature level 11_0 or 11_1 and must be supported with software-side workarounds on lower feature level hardware, as discussed in Section 1.4.2. The addition of this runtime-selectable backend opens up new possibilities for ANGLE. By treating the implementation of each renderer as an encapsulated object with its own simple interface, we can enable choosing between not just different versions of the Direct3D API, but also other native graphics APIs. This would allow a client application to target OpenGL ES, while in actuality executing on Direct3D, OpenGL, or any other 4 1. ANGLE rendering API for which future ANGLE renderers are written. For Chrome in particular, this is a great benefit, as Chrome currently must perform validation of OpenGL ES calls within its own graphics process before either translating them to desktop OpenGL itself or forwarding them to ANGLE, which performs its own validation. If ANGLE were to handle translation for all platforms and APIs, Chrome would be able to delegate all OpenGL ES validation to ANGLE. The ANGLE team launched an engineering effort to perform additional refactoring necessary to become entirely platform agnostic in mid 2014 [ANGLE 14a] and will be adding a backend targeting desktop OpenGL. This will enable Chrome to use ANGLE as the OpenGL ES implementation on Windows, Mac, Linux, and even mobile, with ANGLE performing any necessary translation to the platform-specific API behind the scenes. For WebGL developers, this means that ANGLE will be playing a role in enabling our applications in a growing number of situations. Figure 1.1 illustrates where ANGLE fits in the overall architecture of a typical WebGL implementation. WebGL Application JavaScript Engine Browser Core OpenGL ES Renderer Shader Translator ANGLE Direct3D 9 Direct3D 11 1.3 ANGLE Is Not an Emulator OpenGL Figure 1.1 System architecture. It is important to highlight that ANGLE is not an emulator in the classical sense. Emulators typically suffer from a great loss in performance from having to account for substantial architectural differences between the guest and host platform. ANGLE instead merely provides an implementation of the OpenGL ES libraries, and happens to use another underlying graphics API to accomplish this. Very little performance is lost,* because the hardware itself is designed to efficiently support any generic rasterization graphics pipeline. A native OpenGL ES driver can be provided by the hardware vendors, but in cases where no such driver is present, ANGLE fulfills the same role. From a bird’s-eye view, all rasterization graphics APIs consist of two types of operations: state setting commands and drawing commands. Setting state includes operations such as specifying the culling mode and attribute layout, and also includes setting shaders. After setting all the states which affect how rendering calls will be processed, the real processing work is started by issuing drawing commands such as ­glDrawArrays and glDrawElements. ANGLE efficiently sets equivalent states so that the drawing commands are processed equally fast. Therefore it acts more like a translator than an emulator. Direct3D, as the name implies, is in many ways a lower level graphics API than OpenGL. Thus ANGLE performs many of the tasks a native OpenGL ES driver would also have to perform. That said, translating between OpenGL ES and Direct3D is not without major challenges. We won’t delve too deeply into the implementation details, as these have already been covered by articles in OpenGL Insights (Direct3D 9) [Koch 12] and * Sometimes performance is gained due to using more optimal drivers, better cooperation with other APIs that use the GPU (e.g., Direct2D), or optimizations performed by ANGLE itself. 1.3 ANGLE Is Not an Emulator 5 GPU Pro 6 (Direct3D 11) [Woods 15]. We will, however, discuss the major caveats from a WebGL development perspective. If ANGLE just maps an underlying API such as Direct3D to OpenGL ES for use by yet another API, WebGL, why not translate directly between WebGL and Direct3D? This actually wasn’t a viable option in the early days of WebGL, because conformance tests were nonexistent except for OpenGL ES. Conformance tests were and still are critically important to ensuring WebGL’s success, or the success of any web standard for that matter. Chapter 3 details the testing effort for the Chrome browser. These days we have a robust WebGL conformance test suite [Khronos 13b], but it is still valuable to have an intermediate API layer which can be used to determine if issues are caused above or below that layer. Aside from testability, ANGLE’s drop-in libraries allow it to be used for any application which wishes to target OpenGL ES, not just browsers which aim to support WebGL. The vast majority of mobile applications use OpenGL ES, and thus ANGLE allows that code to be reused when porting these applications to the desktop. Building ANGLE for this purpose, or for deep debugging of WebGL issues, is covered in Section 1.5. 1.4 ANGLE in WebGL App Development ANGLE is a fully conformant implementation of OpenGL ES 2.0 [Kokkevis 11] and passes an ever-increasing number of WebGL tests—new ones are added almost daily—with performance highly competitive to that of one based on a native desktop OpenGL implementation. ANGLE’s aim is to be invisible to the developer. Still, with the majority of desktop WebGL implementations being based in whole or in part on ANGLE, insights into some of its implementation details can result in higher performance or faster bug resolution. 1.4.1 Recommended Practices The most important thing WebGL developers can do to ensure their applications are robust and performant is to develop for and test with pre-stable release channels of the browser of their choice and report suspected implementation bugs as they are found. When new features or optimizations are deployed in ANGLE, they will appear in the ­pre-stable channels of browsers which use them first, giving developers and early adopters time to encounter and report bugs and the ANGLE team time to address those bugs before most users encounter them. Chrome Dev and Firefox Developer Edition both track a regularly updated version of ANGLE. Additionally, developers can ensure peak performance by using the types, formats, and commands in WebGL which require the least intervention from ANGLE to deliver to the underlying API. Because the particular work that ANGLE must do to translate calls varies depending on which backing renderer is being used by a given user, best practice is to be aware of, and minimize use of, computationally intensive paths for any of ANGLE’s ­supported platforms. •• 6 Recommendation: Avoid Use of LINE_LOOP and TRIANGLE_FAN LINE_LOOP does not exist in either of the Direct3D APIs, and TRIANGLE_ FAN is not supported by Direct3D 11, so ANGLE must rewrite index buffers to support them. This is less of a problem for LINE_LOOP, as it can be ­easily represented by LINE_STRIP with an additional copy of the first index, 1. ANGLE •• •• •• but TRIANGLE_FAN must be represented internally as TRIANGLES, with a greatly expanded index buffer, when backed by Direct3D 11. The impact of this rewriting can be nontrivial in certain situations; one of our benchmarks demonstrates that a particularly pathological case results in TRIANGLE_FAN requiring an additional 5 ms of render time per frame on one test system as compared to TRIANGLES.* Recommendation: Create New Textures, Rather Than Redefining Old Ones The Direct3D API family requires that the format and dimensions of textures and their mipmap chains be fully specified on creation, and it has no notion of incomplete textures. To support textures being defined one mipmap level at a time, ANGLE maintains copies of those textures in system memory, creating the GPU-accessible textures only when they are finally used by a rendering command. Once that texture is created, altering the format or dimensions of any of its constituent levels involves more overhead than for a newly created texture still being maintained in system memory. This overhead can cost precious milliseconds at draw time: In a very simple benchmark, defining the levels of a newly created texture was 3 to 6 ms faster than using identical GL calls to redefine the dimensions of a texture that had already been used in a previous draw as described previously. To avoid this potential penalty, create new textures, rather than changing the format or dimensions of an already existing one. By contrast, if only the pixel data contained in a texture need to be updated, it is best to reuse the texture—the additional overhead is only incurred when updating texture format or dimensions, because these require redefinition of the mipmap chain. Recommendation: Do Not Perform Clears with Scissors or Masks Enabled One of the subtle differences between the Direct3D APIs and the GL APIs is that in the former, clear calls ignore scissors and masks, while the latter applies both to clears [Koch 12]. This means that if a clear is performed with the scissors test enabled, or with a color or stencil mask in use, ANGLE must draw a quad with the requested clear values, rather than using clear. This introduces some state management overhead, as ANGLE must switch out all the cached state such as shaders, sampler and texture bindings, and vertex data related to the draw call stream. It then must set up all the appropriate state for the clear, perform the clear itself, and then reset all of the affected state back to its prior settings once the clear is complete. If ­multiple draw buffers are currently in use, using WEBGL_draw_ buffers, then the performance implications of this emulated clear are compounded, as the draw must be performed once for each target. Clearing buffers without scissors or masks enabled avoids this overhead. Recommendation: Render Wide Lines as Polygons ANGLE does not support line widths greater than 1.0, commonly called “wide” lines. While this is sometimes a requested feature, wide lines are not a native primitive of any modern hardware, and therefore require emulation in either the driver or the application. There is much disagreement about how exactly they should be rendered; the handling of corners, endpoints, and * See benchmarking samples at https://chromium.googlesource.com/angle/angle/+/master/samples/angle/. 1.4 ANGLE in WebGL App Development 7 Figure 1.2 Some line joint variations. •• •• •• 8 overdraw/transparency varies widely between vendors, drivers, and APIs. See Figure 1.2 for a few c­ ommon joint rendering implementations. Basically, there is no right answer; game developers mostly want speed, while CAD and mapping applications may prefer high-quality wide lines that can be slow to render. Many applications resort to using their own implementation to achieve consistent results, and this makes driver-based implementations unnecessary bloat, which comes with a minor performance cost from extra state tracking. For these reasons, wide lines are not supported by Direct3D and have also been deprecated from the desktop OpenGL 3.0 core profile [Khronos 10]. It is important to realize that any fully conformant WebGL implementation can have a maximum line width of 1.0. Therefore, there is no guarantee that when we get the desired result on one platform, it will look identical on another. ANGLE does not want to be part of the problem by giving the impression that wide lines are widely supported. One approach for rendering wide lines in the application can be found in Cesium [Bagnell 13]. Recommendation: Avoid Three-Channel Uint8Array/Uint16Array Data in Vertex Buffers Direct3D has limited support for three-channel vertex formats. Only 32-bit threechannel formats (both integer and float) are supported natively [MSDN 14a]. Other three-channel formats are expanded by ANGLE to four-­channel internally when using a Direct3D backend. If the vertex buffer usage is dynamic, this conversion will be performed each time the buffer is used in a draw. To avoid the expansion, use four-channel formats with 8- or 16-bit types. Recommendation: Avoid Uint8Array Data in Index Buffers Neither Direct3D 9 nor Direct3D 11 supports 8-bit indices, so ANGLE supports these by converting them to 16 bits on those platforms. To avoid the conversion cost, supply 16-bit index values instead. Recommendation: Avoid 0xFFFF in 16-Bit Index Buffers In Direct3D 11, index values with all bits set to 1 are considered a special sentinel value, indicating a triangle strip-cut [MSDN 14b]. In OpenGL, this feature is known as primitive restart. Primitive restart does not appear in OpenGL ES until the 3.0 specification. In OpenGL ES 3.0, primitive restart can be toggled on and off, while by contrast, the feature is always enabled in Direct3D 11. This caused ANGLE to inadvertently break some WebGL applications that use highly 1. ANGLE •• •• •• detailed geometry.* This bug reached Chrome’s Stable channel before it was noticed by anyone. It caught both the ANGLE team and WebGL developers offguard—another reminder to always test applications with Beta or Dev versions of the browser. We fixed the bug by detecting this strip-cut index and converting the whole index buffer to use 32-bit values when it’s present. This cost could be avoided by using the OES_element_index_uint extension [Khronos 07a], but developers should have a fallback for when it is not available. Alternatively, content authoring tools could avoid using the strip-cut index value in 16-bit index buffers by creating indexed triangle strip buffers with less than 65536 vertices. Triangle lists are unaffected by primitive restart, so this offers another alternative. Recommendation: Make Appropriate Use of Static Buffers and Flag Usage Correctly Due to the limited set of vertex formats supported natively in Direct3D 9, ANGLE must convert much of these data before uploading them to the GPU on this platform. If the provided vertex data are not updated subsequent to their first use in a rendering command, the data need not be converted every time they are used. Static data should therefore be stored in separate, designated buffers when possible. Additionally, while ANGLE will track updates to a buffer and promote it internally to static if no updates are made, developers can avoid needless data conversions by designating STATIC_DRAW as the usage for these buffers. Recommendation: Always Specify the Fragment Shader Float Precision GLSL ES 1.00 and 3.00 do not specify a default precision for floating-point values in the fragment shader. This makes it a compilation error not to explicitly specify their precision. Earlier versions of ANGLE did not adhere to this rule and did not produce an error. Desktop hardware typically supports high precision, so this never posed a problem for ANGLE itself. However, this lenience caused developers to forget to set the precision, because shaders ran properly without it on their systems. This caused their code not to run at all on various other platforms which do demand the precision to be specified. Precision matters, especially on mobile platforms. We therefore decided to strictly enforce the specification. This change broke a handful of applications, but they were quickly resolved after explaining the issue to the authors. Those developing mainly on an ANGLE-powered browser will now be met with a compilation error if the precision is not set. For everyone else, we highly recommend to check if the browser enforces it or not, and if not, to test frequently with one that does. For more on shader precision, see Chapter 8. Recommendation: Do Not Use Rendering Feedback Loops In the OpenGL APIs, attempting to write to and sample from the same texture or renderbuffer in a rendering operation is considered a rendering feedback loop, and the results of such an operation are undefined in desktop OpenGL and OpenGL ES [Khronos 14a]. For some varieties of graphics hardware, Direct3D 9 could ­actually provide results for these operations which appeared to be correct. * ANGLE issue 708, http://code.google.com/p/angleproject/issues/detail?id=708 1.4 ANGLE in WebGL App Development 9 In Direct3D 11, this ability disappeared—attempting to perform such r­ endering operations p ­ roduced images with black pixels where the sampled values were used. Users, not realizing that this was an undefined behavior in OpenGL ES and WebGL, began to report this as an error in ANGLE. We decided to create ­consistent behavior in ANGLE and make clear that the operation is not intended to be supported, by disabling such renders in Direct3D 9 as well. Additionally, the WebGL specification has since been amended to require an error when a rendering feedback loop is in place, rather than leaving the behavior undefined [Khronos 14b]. 1.4.2 Beyond WebGL: Recommendations for OpenGL ES, WebGL 2, and More There are a number of features already supported in ANGLE which are not yet exposed in WebGL. These features also have some caveats and some performance benefits for developers. They’re accessible to users of ANGLE’s OpenGL ES interface, and many will become available with WebGL 2. We present our recommendations for these use cases next. •• •• Recommendation: Don’t Use Extensions without Having a Fallback Path It is understandably very tempting to rely on extensions when a quick test indicates that they are supported across a large swath of hardware. Unfortunately Murphy’s law, and the huge number of extensions and hardware variants, are not in our favor. Even YouTube has fallen victim to this.* A single-character ANGLE bug caused the OES_texture_npot extension [Khronos 07b], which enables support for textures whose dimensions are not powers of two, not to be advertised on certain hardware that did support it. Our conformance tests don’t test unavailable extensions, as an implementation without extensions is still completely conformant, so this regression went entirely unnoticed for some time until YouTube broke. Expecting NPOT textures to be present without having performed an extension check, the hardware accelerated video decode path in Chrome attempted to create a pbuffer surface whose dimensions were not a power of two and encountered a failure. This was quickly remedied (the ANGLE bug in question being a single missing exclamation point in a double negation) once it was known, but some trouble could have been avoided by querying the extension string and providing a fallback path or an alert if the expected extension was not present. Issues with extensions continue to get more complicated over time with increasingly varying hardware features. We therefore recommend using them judiciously and frequently testing the fallback code path. Recommendation: Use Immutable Textures When Available Historically, OpenGL and WebGL textures had to be created one mip level at a time. OpenGL does this via glTexImage*, a method that allows users to create internally inconsistent textures, considered by the GL to be “incomplete.” This same method is what is available to developers in WebGL, as texImage*. By contrast, Direct3D requires that users define the dimensions and format of their entire textures at texture creation time, and it enforces internal ­consistency. * ANGLE issue 799, http://code.google.com/p/angleproject/issues/detail?id=799 10 1. ANGLE •• •• •• Because of this difference, ANGLE must do a considerable amount of bookkeeping and maintain system memory copies of all texture data. The ability to define an entire texture at creation time did later get introduced to OpenGL and its related APIs as immutable textures, which also enforce internal consistency and disallow changes to dimensions and ­format. Immutable ­textures came to OpenGL ES 2.0 with EXT_texture_storage [Khronos 13a], and they are included in the core OpenGL ES 3.0 specification and the WebGL 2 Editor’s Draft specification. When immutable textures are available via extension or core specification, some of ANGLE’s bookkeeping can be avoided by using the texStorage* commands to define textures. Recommendation: Use RED Textures instead of LUMINANCE In WebGL and unextended OpenGL ES 2.0, the only option developers have for expressing single-channel textures is the LUMINANCE format, and LUMINANCE_ALPHA for two-channel textures. The EXT_texture_rg extension [Khronos 11] adds the RED and RG formats, and these formats become core functionality in OpenGL ES 3.0. The formats also appear in the WebGL 2 Editor’s Draft specification. Meanwhile, Direct3D 11 has dropped all support for luminance textures, providing only red and red-green formats [MSDN 14a]. This may seem to be a trivial difference—a channel is a ­channel—but sampling from a luminance texture is performed differently than from textures of other formats. The single channel of a luminance texture is duplicated into the red, green, and blue channels when a sample is performed, while sampling from a RED texture populates only the red channel with data. Similarly, the second channel of a LUMINANCE_ALPHA and an RG texture will populate only the alpha and green channels in a sample, respectively. To support luminance formats against Direct3D 11, rather than alter the swizzle behavior in shaders, ANGLE instead expands the texture data to four channels. This expansion, and the associated additional memory and texture upload performance costs, can be avoided by developers keen for clock cycles by simply using RED textures in place of LUMINANCE and RG in place of LUMINANCE_ALPHA when using ANGLE with APIs that support them. Recommendation: Avoid Integer Cube Map Textures Cube maps with unnormalized integer formats are not supported by Direct3D 11 [MSDN 14c]. The ANGLE team hasn’t encountered any uses for it, which may be the reason it was left out of D3D11, but it is a feature of OpenGL ES 3.0 and gets tested by the conformance tests. ANGLE therefore must emulate it in ANGLE’s ESSL to HLSL translator. The cube texture is replaced by a sixlayer 2D array texture, and the face from which to sample, and at what location, is manually computed. Rather than unnormalized integer formats, we recommend using normalized integer formats for cube maps. If integer values are expected, multiply the sampled value by the maximum integer value, and round to the nearest integer. For example, for signed 16-bit integers: int i = int(round(32767 * f)); Recommendation: Avoid Full-Texture Swizzle Texture swizzling is an OpenGL ES 3.0 feature which allows a texture’s components to be sampled in a different order, using the TEXTURE_SWIZZLE_R, 1.4 ANGLE in WebGL App Development 11 •• •• TEXTURE_SWIZZLE_G, TEXTURE_SWIZZLE_B, and TEXTURE_ SWIZZLE_A texture parameters. This is most often used to read RGBA ­textures as BGRA, or vice versa, and can also be used to replicate components as with luminance textures. This feature is, however, not supported by Direct3D 11. Even though it appears a seemingly simple operation to perform during the shader translation, it is actually not feasible to determine which textures are sampled where, because samplers can be passed from function to function as parameters, and the same texture sampling function can be used to sample various different textures. ANGLE therefore swizzles the texture data itself. This consumes some memory and incurs some overhead at texture upload. These costs can be avoided by not changing the TEXTURE_SWIZZLE_R, TEXTURE_SWIZZLE_G, TEXTURE_SWIZZLE_B, and TEXTURE_SWIZZLE_A texture parameters from their defaults. If necessary, use multiple shader variants to account for different texture component orders. Recommendation: Avoid Uniform Buffer Binding Offsets Uniform buffer objects (UBOs), newly added in OpenGL ES 3.0, are bound objects which store uniform data for the use of GLSL shaders. UBOs offer benefits to developers, including the ability to share uniforms between programs and faster switching between sets of uniforms. OpenGL ES 3.0 also allows UBOs, much like other buffer objects, to be bound at an offset into the buffer, rather than just the buffer head. Direct3D, on the other hand, does not support referencing its analogous structure, constant buffers, until Direct3D 11.1, with the addition of the VSSetConstantBuffers1 method [MSDN 14d]. Offsets are supported with a software workaround on all hardware of lower feature levels. Developers can avoid any performance penalty associated with this workaround by binding UBOs at offset 0 only. Minor Recommendation: Beware of Shadow Lookups in 2D Array Textures Our final recommendation is a minor one, because the range of hardware affected is relatively small. Shadow comparison lookups are a feature introduced in OpenGL ES 3.0. These texture lookups can perform prefilter comparison of depth data contained in a texture against a provided reference value. ES 3.0 also introduces new texture types, including 2D texture arrays. Where these two features intersect, a caveat emerges. Direct3D 11 does support shadow lookups for 2D texture arrays—but not at feature level 10_0 [MSDN 14e]. For this reason, ANGLE must either exclude feature level 10_0 hardware from ES 3.0 support or implement a workaround, with potential performance penalties. If the latter approach is chosen, developers may encounter performance issues on Direct3D 10.0 hardware. If the former approach is chosen instead, then OpenGL ES 3.0 would not be available on this hardware at all. 1.5 Debugging (with) ANGLE 1.5.1 Debugging Translated Shaders ANGLE’s Direct3D backends require an intermediate translation of GLSL ES shaders into HLSL. This can complicate bug analysis because the HLSL compiler reports errors in the translated code, not the original code. This intermediate code can be inspected using the 12 1. ANGLE WEBGL_debug_shaders extension [Khronos 14c]. This extension also allows looking at the translated GLSL code for a desktop OpenGL-based implementation, and even GLSL ES for mobile implementations. The browser may apply workarounds and alterations to your shader code to improve safety—for example, preventing out-of-bounds accesses. A full example of the use of this extension can be found at http://www.ianww. com/2013/01/14/debugging-angle-errors-in-webgl-on-windows/ 1.5.2 Building ANGLE Libraries Since ANGLE binaries come as a set of drop-in dynamically linked libraries (DLLs), they are isolated from the browser. This makes it relatively straightforward to build your own version and replace the ones that came with the browser. While this isn’t intended to be common practice, WebGL developers who want to push the envelope will run into ­implementation-dependent behavior sooner or later, and ANGLE provides a window into that. It may, for example, be useful to a developer who is experiencing unexpected behavior, or to those interested in writing tools to assist in the debugging or performance profiling of WebGL code, to build ANGLE’s debug configuration so that a debugger may be attached and the code may be stepped through and examined in progress. We cover this process briefly next. The process of downloading and building ANGLE is covered in detail in the ANGLE wiki.* This process may change from time to time, so please consult the page for the latest step-by-step guide. ANGLE may additionally be built as a part of Chromium. Information on how to build a specific version of ANGLE within the larger Chromium build is included in our wiki as well.† 1.5.3 Debugging ANGLE with Chrome Once the ANGLE libraries are built, they can be dropped into Chrome and used in place of the libraries that shipped with the browser. On Windows, they must be copied to the folder containing the current version of Chrome’s support DLLs, which is a numbered folder located alongside the Chrome executable. It’s a good idea to move the original versions of the ANGLE DLLs out of the way first, because they’ll be needed if something goes awry with ANGLE or to revert to the original state once debugging is complete. Finding the correct Chrome process to attach the debugger to might seem a daunting prospect, but Chrome provides some helpful command line switches to assist with this task. The most useful of these for the purposes of ANGLE debugging is --gpustartup-­dialog. With this argument, Chrome will spawn a dialog box when the GPU process starts, containing the process ID, and will pause until the box is dismissed. This provides the developer with the opportunity to attach a debugger to the identified process. Also helpful is --use-gl=desktop, which forces Chrome to translate calls to desktop OpenGL itself, rather than translating via ANGLE. This can be useful in identifying whether a problem is specific to ANGLE or occurs somewhere else in the Chrome graphics stack. A full listing of the currently available command line switches in Chrome is maintained at http://peter.sh/experiments/chromium-command-line-switches/ * http://code.google.com/p/angleproject/wiki/DevSetup † http://code.google.com/p/angleproject/wiki/BuildingANGLEChromiumForANGLEDevelopment 1.5 Debugging (with) ANGLE 13 From time to time, changes are made to ANGLE or the Chromium codebase which break compatibility with previous versions of Chrome. If this occurs, the issue can be mitigated by working with the version of ANGLE corresponding to the Chrome release being used. Determining which ANGLE version corresponds to a Chrome release is very simple. First, choose “About Google Chrome” from the Chrome menu. This will open a tab that displays information about Chrome, including the software version. This will look something like “Version 39.0.2171.65 m.” Make note of the third subset, which identifies the Chromium branch from which this version of Chrome was built. ANGLE maintains branches following the Chromium branch strategy, each named chromium/, so once this branch number is known, the corresponding ANGLE branch can be checked out. For example, the ANGLE branch used by Chrome 39.0.2171.65 is checked out using the command git checkout chromium/2171. The version of an already built ANGLE DLL can be checked by right-clicking on it in Windows Explorer, choosing Properties, and selecting the Details tab in the dialog that appears. The SHA listed as the Product Version is the git hash representing the ANGLE repository tree from which the DLL was built. To check out that version of ANGLE, use git checkout . Once attached to the GPU process, the debugger can break at points set within the ANGLE source code, internal variable values can be examined, and code can be stepped through incrementally. Developers may notice that some of the calls that come through ANGLE in Chrome do not correspond to their own WebGL. This is because both WebGL contexts and the Chrome compositor itself make use of ANGLE, so the calls are intermingled. It may additionally be helpful to use a graphical debugging tool to see visual results of rendering commands as they are issued, examine the contents of textures and buffers, or monitor GPU usage. A step-by-step guide to starting Chromium with one such tool, NVIDIA Nsight, can be found on the Chromium developer wiki, at http://www.­chromium. org/developers/design-documents/chromium-graphics/debugging-with-nsight. 1.6 Additional Resources There are a number of avenues for communicating with and getting help from members of the ANGLE community and team. Bugs in ANGLE can be filed at http://code.google. com/p/angleproject/issues, or at http://crbug.com if they affect Chrome. Filing bugs helps the ANGLE team to maintain a high-quality, conformant project. Our forum and mailing list are excellent resources for finding answers to questions from ANGLE team members and other ANGLE users. They can be mailed at angleproject@googlegroups.com, or viewed online at http://groups.google.com/group/angleproject. Developers can discuss ANGLE in real time in our IRC channel, #ANGLEproject on FreeNode. Bibliography [ANGLE 13] ANGLE project. “ANGLE Development Update—June 18, 2013.” https:// code.google.com/p/angleproject/wiki/Update20130618, 2013. [ANGLE 14a] ANGLE project. “M(ultiplatform)-ANGLE Effort.” https://code.google. com/p/angleproject/wiki/MANGLE, 2014. [Bagnell 13] Daniel Bagnell. “Robust Polyline Rendering with WebGL.” http://cesiumjs. org/2013/04/22/Robust-Polyline-Rendering-with-WebGL/, 2013. 14 1. ANGLE [Khronos 07a] The Khronos Group. “OES_element_index_uint.” Contact Aaftab Munshi. https://www.khronos.org/registry/gles/extensions/OES/OES_element_index_uint. txt, 2007. [Khronos 07b] The Khronos Group. “OES_texture_npot.” Contact Bruce Merry. https:// www.khronos.org/registry/gles/extensions/OES/OES_texture_npot.txt, 2007. [Khronos 10] The Khronos Group. “The OpenGL Graphics System: A Specification (Version 3.3 (Core Profile)— March 11, 2010).” Edited by Mark Segal and Kurt Akeley. https://www.opengl.org/registry/doc/glspec33.core.20100311.pdf, 2010. [Khronos 11] The Khronos Group. “EXT_texture_rg.” Contact Benj Lipchak. https://www. khronos.org/registry/gles/extensions/EXT/EXT_texture_rg.txt, 2011. [Khronos 13a] The Khronos Group. “EXT_texture_storage.” Contacts Bruce Merry and Ian Romanick. https://www.khronos.org/registry/gles/extensions/EXT/EXT_­ texture_storage.txt, 2013. [Khronos 13b] The Khronos Group. “Testing/Conformance.” https://www.khronos.org/ webgl/wiki/Testing/Conformance, 2013. [Khronos 14a] The Khronos Group. “OpenGL ES Version 3.0.4.” Edited by Benj Lipchak. https://www.khronos.org/registry/gles/specs/3.0/es_spec_3.0.4.pdf, 2014. [Khronos 14b] The Khronos Group. “WebGL Specification.” Edited by Dean Jackson. https://www.khronos.org/registry/webgl/specs/latest/1.0/, 2014. [Khronos 14c] The Khronos Group. “WEBGL_debug_shaders.” Contact Zhenyao Mo https://www.khronos.org/registry/webgl/extensions/WEBGL_debug_shaders/, 2014. [Koch 12] Daniel Koch and Nicolas Capens. “The ANGLE Project: Implementing OpenGL ES 2.0 on Direct3D.” WebGL Insights. Edited by Patrick Cozzi and Christophe Riccio. Boca Raton, FL: CRC Press, 2012. [Kokkevis 11] Vangelis Kokkevis. “The Chromium Blog: OpenGL ES 2.0 Certification for ANGLE.” http://blog.chromium.org/2011/11/opengl-es-20-certification-for-angle. html, 2011. [MSDN 14a] Microsoft. “DXGI_FORMAT enumeration.” http://msdn.microsoft.com/ en-us/library/windows/desktop/bb173059, 2014. [MSDN 14b] Microsoft. “Primitive Topologies.” http://msdn.microsoft.com/en-us/library/ windows/desktop/bb205124, 2014. [MSDN 14c] Microsoft. “Load (DirectX HLSL Texture Object).” http://msdn.microsoft. com/en-us/library/windows/desktop/bb509694, 2014. [MSDN 14d] Microsoft. “ID3D11DeviceContext1::VSSetConstantBuffers1 method.” http://msdn.microsoft.com/en-us/library/windows/desktop/hh446795, 2014. [MSDN 14e] Microsoft. “SampleCmp (DirectX HLSL Texture Object).” http://msdn. microsoft.com/en-us/library/windows/desktop/bb509696, 2014. [Woods 15] Shannon Woods, Nicolas Capens, Jamie Madill, and Geoff Lang. “ANGLE: Bringing OpenGL to the Desktop.” GPU Pro 6: Advanced Rendering Techniques. Edited by Wolfgang Engel. Boca Raton, FL: CRC Press, 2015. Bibliography 15 2 Mozilla’s Implementation of WebGL Benoit Jacob, Jeff Gilbert, and Vladimir Vukicevic 2.1 2.2 2.3 2.4 2.5 2.6 Introduction DOM Bindings WebGL Method Implementations and State Machine Texture Uploads and Conversions Null and Incomplete Textures Shader Compilation 2.7 Validation and Preparation of drawArrays Calls 2.8 Validation of drawElements Calls 2.9 The Swap-Chain and Compositing Process 2.10 Platform Differences 2.11 Extension Interactions 2.12 Closing Thoughts 2.1 Introduction A typical implementation of WebGL in a browser has to implement this API on top of ­system graphics APIs such as OpenGL, OpenGL ES, or Direct3D. Thus, WebGL is effectively an additional layer on top of OpenGL or other similar APIs, potentially adding overhead. Some understanding of what kind of work takes place in a browser’s WebGL implementation may help application developers write better code that suffers from less overhead. Figure 2.1 gives a general overview of Mozilla’s WebGL implementation. Over the course of this chapter, we will touch on each part. 2.2 DOM Bindings The first step in our study of what happens in the browser below WebGL method calls is studying the mechanics of the call itself. 17 WebGL method call DOM binding ANGLE shader compiler Take, for example, the WebGL uniform4f method. In a JavaScript program, a call to uniform4f looks like this: gl.uniform4f(location, x, y, z, w); Here, the location, x, y, z, w parameters, are JavaScript values. They could be anything: numbers, objects, arrays, strings…. The WebGL IDL* says that location should be a WebGLUniformLocation object, and x, y, z, w should be numbers. This means OpenGL calls that the WebGL implementation must verify that the (ANGLE on values passed for these parameters are of these types, or Windows) can be converted to these types; otherwise, the WebGL implementation must generate an exception. WebGL backbuffer In addition to that, the bit representation of these parameters is not the same on the JavaScript side and WebGL swap-chain on the browser internal C++ side, so some conversion work needs to be done. This is very repetitive work that needs to be done Browser WebGL frontbuffer for every parameter of every DOM API method that compositor is internally implemented in C++. This is automated. In Mozilla’s implementation, a Python script parses Figure 2.1 all the web IDL and generates some C++ code for each Overview of Mozilla’s WebGL implementation. method,† which we call its DOM binding. For each method, the generated code takes care of validating and converting the input JavaScript parameters, and then calling the corresponding C++ function in the Mozilla browser code. The execution flow looks like Figure 2.2. From the application developer’s perspective, the main takeaway should be that one should expect the cost of calling a method in a DOM API to grow roughly linearly with the number of method parameters. Methods taking more parameters should be more expensive to call. However, this overhead is low enough that it only becomes noticeable for really cheap WebGL methods. We confirmed with a microbenchmark‡ that uniform4f is slower than uniform1f in all browsers that we tried,§ as one would expect since the DOM binding for ­uniform4f has more parameters to check and convert. The speed difference between these two WebGL method implementation WebGL state machine * IDL stands for Interface Definition Language. The WebGL IDL is the part of the WebGL specification that formally defines the WebGL interfaces and methods, describing the types of methods’ parameters and return values. † In Mozilla’s source tree, which can easily be searched online (try http://dxr.mozilla.org), the WebIDL parser is WebIDL.py, and the corresponding C++ code generator is Codegen.py. We don’t give full paths with directories, as that should not be needed to find these files and is subject to change. The source Web IDL for WebGL is WebGLRenderingContext.webidl. An additional configuration file, Bindings.conf, helps the code generator find the C++ implementation for a given web IDL interface. The resulting generated C++ code is less easy to find online, but the interested reader could easily build Firefox from sources and then look, in the resulting object directory, for WebGLRenderingContextBinding.cpp. ‡ https://github.com/WebGLInsights/WebGLInsights-1/tree/master/02-Mozillas-Implementation-of-WebGL/ § We tried Safari, Chrome, and Firefox on Mac OS X 10.10, in late 2014. 18 2. Mozilla’s Implementation of WebGL ­ ethods varied significantly between different browsers, from m 15%–20% in one browser to 40%–50% in another. However, as soon as one calls more expensive WebGL methods, this effect becomes negligible. For example, comparing uniform4f to uniform4fv instead of uniform1f, we found that uniform4fv is slower than uniform4f in all browsers that we tried, despite taking as few parameters as uniform1f does. This is due to uniform4fv having inherently more work to do than uniform4f, since it can handle arbitrary u ­ niform array uploads. 2.3 WebGL Method Implementations and State Machine JavaScript code calling uniform4f calls DOM binding for uniform4f (auto-generated C++ code) calls WebGL method implementation for uniform4f (manually written C++ code) Figure 2.2 Mechanics of a WebGL method call. Let us now describe what kind of work takes place inside the manual C++ implementations for WebGL methods. Take, for instance, texSubImage2D. It takes several parameters, one of which is the image data format, which is a GLenum, that is, a number. The DOM bindings work done so far has checked that the value passed for this parameter had an acceptable type, and then converted it to a native integer. But the DOM bindings don’t know about anything specific to WebGL, so they have no clue as to what the allowed range is for this integer parameter. Thus, the manually written C++ implementation for each WebGL method has the responsibility to finish checking each parameter and generating the right WebGL errors. One might still think that there is nothing here that a WebGL implementation has to do, as all these checks should be performed already by the underlying OpenGL implementation. This isn’t so, if only because WebGL is not OpenGL and there are subtle differences of semantics and capabilities that mean that WebGL validation is not the same as OpenGL validation. Additionally, even when WebGL and OpenGL semantics agree, a WebGL implementation may not want to trust the OpenGL implementation to be bugfree, especially if security is at stake. Thus, implementations of WebGL methods have to implement most of the call validation themselves, regardless of whether some of this work is going to be duplicated in the underlying OpenGL implementation. Like we saw before with texSubImage2D’s format parameter, call validation depends not only on parameter values but also on existing state. This means that the WebGL implementation needs very frequent access to current state, such as, in the case of t ­ exSubImage2D, the bound texture and its properties such as its format. The WebGL implementation accesses state in two different ways. First, the WebGL implementation can query state directly from OpenGL by getter calls such as ­glGetIntegerv. Second, some of the WebGL state is manually tracked by the WebGL implementation, either because it is WebGL-specific state without a direct OpenGL equivalent, or because the OpenGL API doesn’t expose it in a practical and efficient way. For example, each WebGL texture is tracked as a WebGLTexture object, which contains the OpenGL id for this texture object as well as an array of structs tracking the width, height, format, etc., for each image stored on this texture object (there can be multiple 2.3 WebGL Method Implementations and State Machine 19 images for mipmaps or cube maps). Much, though not all, of that state is redundant with OpenGL state, as the underlying OpenGL texture already stores internally attributes such as width, height, format, etc. However, at least in OpenGL ES 2.0, there isn’t a practical way to query the width of a texture image. At this point we can draw two conclusions. First, a cause of inevitable speed overhead in WebGL compared to OpenGL is the extra validation work that is needed. Second, in addition to speed overhead, there is moderate memory overhead due to having to duplicate large parts of the OpenGL state—though that is mostly negligible in practice. 2.4 Texture Uploads and Conversions So far we’ve talked about general considerations that apply across all of WebGL. Let’s now turn to specific areas that are particularly complex and prone to causing overhead, starting with texture uploads (i.e., the texImage2D and texSubImage2D entry points). Already in OpenGL, because of the way that these entry points are specified, they must perform a full copy of the input texture image before returning. Indeed, the contents of the texture must not be affected by any change to the input image data after tex[Sub] Image2D has returned. WebGL inherits this overhead from OpenGL, and then adds some more. One cause of additional WebGL overhead that affects all texImage2D overloads, but does not affect texSubImage2D, is that, in WebGL, texImage2D must check if the underlying call to glTexImage2D generated an error. Indeed, glTexImage2D typically has to allocate memory, and this could fail with a GL_OUT_OF_MEMORY error. We care because a WebGL implementation has to track state by itself, and when updating state, it has to keep its tracked state in sync with actual OpenGL state. So if an OpenGL call that is ­supposed to update state fails to do so, the WebGL implementation needs to know. Unfortunately, in OpenGL, calling glGetError is costly, for two reasons. First, calling glGetError means that a larger part of glTexImage2D’s work must be immediately executed instead of being run asynchronously later. Second, glGetError also requires the immediate execution of the entire OpenGL command stream that may have been accumulated so far, typically on mobile “deferred” GPUs. The overhead described before is specific to texImage2D and does not affect ­texSubImage2D. However, the other types of overhead described below equally affect ­texImage2D and texSubImage2D. The tex[Sub]Image2D overloads taking an HTML element, such as a element, typically require the entire image to be converted to a different format. Indeed, the pixel format of a decoded image is a browser implementation detail that isn’t exposed in any API, and can vary at any time. So the WebGL specification had to solve the problem of how to prevent such nonstandardized details to leak into the interface. The solution adopted by WebGL was to specify that, no matter what the original format of the decoded image is, it must be converted to the format described by the format parameters taken by tex[Sub]Image2D. This comes at the cost of introducing substantial overhead, as now tex[Sub]Image2D calls taking an HTML element or a canvas ImageData must convert it to the requested ­format. In particular, the pressure on memory bandwidth gets worse, because now there 20 2. Mozilla’s Implementation of WebGL is an additional copy with format conversion being made in the WebGL implementation even before calling OpenGL. The pixel format is only one of several factors that could require a conversion. Other factors are controlled by pixelStorei parameters. They are the stride (UNPACK_ ALIGNMENT), the alpha channel premultiplication status (UNPACK_PREMULTIPLY_ ALPHA_WEBGL), and the vertical flipping (UNPACK_FLIP_Y_WEBGL). For the tex[Sub]Image2D overloads taking an HTML element or canvas ImageData, another case needs to be mentioned: the case of opting out from colorspace conversion. This is achieved by setting another pixelStorei parameter, UNPACK_ COLORSPACE_CONVERSION_WEBGL, to the value NONE. This is implemented by redecoding the image from its original stream, from scratch. This, of course, is costly. There is one particularly nasty corner case with UNPACK_PREMULTIPLY_ALPHA_ WEBGL that may also require re-decoding an image from scratch. This is the case when the image source is an HTML element that was already premultiplied in the browser’s memory (which is very commonly done by browsers), and UNPACK_PREMULTIPLY_ ALPHA_WEBGL has its default value of false. Un-premultiplying a previously premultiplied element doesn’t exactly recover the original image, so the WebGL spec requires implementing this by re-decoding the image from scratch, like we described above for the case where UNPACK_COLORSPACE_CONVERSION_WEBGL is set to NONE. For the tex[Sub]Image2D overloads taking an ArrayBufferView, things are a lot simpler: There is no possibility of format or stride or colorspace conversion, no possibility of un-premultiplication, and in the default state there is no conversion at all. Nondefault pixelStorei parameters can still require conversions: Setting UNPACK_FLIP_Y_ WEBGL or UNPACK_PREMULTIPLY_ALPHA_WEBGL to true will still require a flipping or a premultiplication, respectively. A practical takeaway from this conversation is that the default state of UNPACK_ PREMULTIPLY_ALPHA_WEBGL, set to the value false, is best for tex[Sub] Image2D overloads taking an ArrayBufferView or other image sources that are known not to be premultiplied, such as canvas ImageData, but can be very painful for the overloads taking an HTML element, which typically are premultiplied. For those, setting UNPACK_PREMULTIPLY_ALPHA_WEBGL to true allows for cheaper texture uploads with more accurate results. Table 2.1 summarizes the preceding discussion* of tex[Sub]Image2D overhead. 2.5 Null and Incomplete Textures In some cases, differences between WebGL and OpenGL or between different flavors of OpenGL require a less straightforward approach to implementing WebGL tex[Sub] Image2D, which comes with specific overhead characteristics. One such case, which we have avoided discussing in the preceding section, is when passing null for the input image data. In OpenGL, passing null image data means that the texture image must only be allocated, leaving its contents uninitialized. In WebGL, * For the reader interested in checking the Mozilla source code for this, a good approach would be to use a code search tool such as the online one at http://dxr.mozilla.org and search for identifiers such as UNPACK_PREMULTIPLY_ALPHA_WEBGL. 2.5 Null and Incomplete Textures 21 Table 2.1 Overhead of Texture Image Specification Calls texImage2D with ArrayBufferView texImage2D with element or ImageData texSubImage2D with ArrayBufferView texSubImage2D with element or ImageData Inherent OpenGL overhead (has to copy data immediately) Has to call glGetError immediately Has to convert input image data before passing it to OpenGL Yes Yes Yes Yes Yes, except perhaps when replacing existing same-size image Only if nondefault pixelStorei parameters require it No No Only if nondefault pixelStorei parameters require it Has to re-decode image from scratch No (not applicable) Yes, except perhaps when replacing existing same-size image Yes, except if exact same format and no pixelStorei parameter requires it Only if pixelStorei parameters require it Yes, except if exact same format and no pixelStorei parameter requires it Only if pixelStorei parameters require it No (not applicable) uninitialized texture data would be unacceptable for security and portability reasons. So WebGL avoids exposing uninitialized memory and instead specifies that null texture images must behave as if all bytes were set to 0 (i.e., as transparent black textures). The straightforward implementation would be to allocate a buffer, fill it with zeroes, and pass it to glTextImage2D. That works, but adds overhead, so Mozilla’s implementation avoids this approach insofar as possible. The key observation is that in WebGL rendering, all such transparent black textures are indistinguishable from each other. Thus, one can use a single global transparent black texture, of size 1×1, and use it for all WebGL textures with null image data. One just has to carefully track this state and correctly revert to allocating the texture’s own storage as needed, such as when doing a texSubImage2D call on it. However, in most cases then, the texSubImage2D call will replace the entire contents of the texture image, so at this point there will be no need anymore to allocate a temporary buffer of zeroes! In summary, at least in Mozilla’s current implementation, texImage2D with null is very fast, and subsequent texSubImage2D runs at normal speed provided that it covers the entire image. The slow corner case is the first partial texSubImage2D call on a null texture image. Null texture images are only one of two corner cases that are handled with dummy black textures in Mozilla’s implementation. The other case is that of incomplete textures. In OpenGL and WebGL terminology, an incomplete texture is one that has an illegal combination of texture parameters and texture images. The precise rules vary between different flavors of OpenGL. WebGL follows the core OpenGL ES 2.0 rules. For example, in this setting, a non-power-of-two-size texture with a REPEAT wrap mode is incomplete. Mozilla’s WebGL implementation may run against any flavor of OpenGL, ES, or regular, and must offer WebGL-conformant behavior everywhere. The only way to achieve that is by manually implementing all the texture-completeness logic. Thus, Mozilla’s WebGL implementation tracks each texture’s completeness status, and when it finds that drawing is about to sample from an incomplete texture, it implements the correct behavior for incomplete textures, which is to sample as opaque black (RGBA 0,0,0,255), by manually binding a dummy 1×1 opaque black texture. 22 2. Mozilla’s Implementation of WebGL This, at least, has no particularly surprising overhead behavior. Under typical circumstances, drawing doesn’t actually sample from null or incomplete textures, so there is just a roughly constant amount of tracking overhead* on each operation that can affect or depend on texture completeness, such as draw calls. In the unusual case where drawing is actually about to sample from a null or incomplete texture, some additional work is done around each draw call to bind the dummy black textures. We will get back to that when we discuss draw-call overhead in a later section. 2.6 Shader Compilation WebGL implementations must carry their own shading language compilers for several different reasons. Mozilla’s implementation uses the shader compiler from the ANGLE project† (Chapter 1). There are many different flavors of the OpenGL shading language (GLSL). Non-ES OpenGL and OpenGL ES differ in this respect. Their shading language dialects are referred to as GLSL and GLSL ES, respectively. Even within GLSL ES, the specification leaves room for variation. WebGL’s shading language (WebGL GLSL) is specified by starting from GLSL ES 1.0, and tightening loose parts by mandating several restrictions that were left optional in GLSL ES. In particular, WebGL adopts restrictions on control flow, in particular disallowing while loops and restricting the genericity of for loops. WebGL also adopts the restrictions on the addressing of uniform arrays, so that in fragment shaders all array indices must be constant expressions. In addition to validating and translating shaders, the shader compiler also prevents triggering certain classes of driver bugs and implementation-defined behavior. The most obvious way that it does so is by catching malformed shaders before they are submitted to the OpenGL implementation, but this is not the only way. For instance, a WebGL implementation needs to guard against out-of-bounds access to uniform arrays. The previously mentioned shading language restrictions help with that, but are not enough. Addressing uniform arrays using runtime indices is still allowed in vertex shaders. In the ANGLE shader compiler, which Mozilla uses, this is addressed by injecting clamp instructions so that, at the cost of some runtime overhead, no out-ofbounds access is possible. Another possible way in which out-of-bounds access to uniform arrays could happen is in uniform array setters such as uniform4fv. The OpenGL specifications mandate that when the upload size exceeds the actual uniform array size, excess values must be ignored. But this is the kind of security-sensitive, memory-addressing corner case that browser implementers typically don’t want to rely on drivers to implement correctly. So, while in theory not needed, Mozilla’s implementation queries the uniform array sizes from the ANGLE shader compiler, and uses that to clamp uniform array uploads, adding another layer of protection against out-of-bounds uniform array uploads. Another way in which the ANGLE shader compiler helps to avoid triggering driver bugs is long identifier shortening. Early WebGL implementations were often running * Most of the code tracking null and incomplete textures can be seen in this file in Mozilla source code: WebGLTexture.cpp. † ANGLE is not a Mozilla project. Its website is https://code.google.com/p/angleproject/ 2.6 Shader Compilation 23 WebGL shader source ANGLE shader compiler OpenGL or OpenGL ES translated shader source OpenGL implementation’s shader compiler Compiled shader Figure 2.3 Shader compilation steps. into driver bugs with shaders containing long identifiers; it appeared that the highest power-of-two identifier length that would be safe on all drivers in circulation was 32. Mozilla’s WebGL implementation thus relies on the ANGLE shader compiler to replace long identifiers in shaders by shortened identifiers, not exceeding 32 characters in length, and obtains from ANGLE the correspondence between original and shortened identifiers. Thus, compared to OpenGL, WebGL incurs at least one additional shader translation step,* which involves significant transformations even when the target is OpenGL ES. The cost of compiling shaders is thus s­ignificantly higher in WebGL than in a good OpenGL implementation. The shader translation diagram, at this point, looks like Figure 2.3. This is not accounting for the special case of running on Windows, which requires additional translation described later in the chapter. 2.7 Validation and Preparation of drawArrays Calls The implementation of drawArrays needs to iterate over all vertex attributes and, for those that are arrays and are consumed by the current program, check that they have sufficient length for the parameters passed to drawArrays. The overhead of doing this on every drawArrays call was not negligible, so we implemented a further optimization: We keep track of when vertex attribute array fetching has already been verified and, if it has, we skip the verification step. In addition to validation, drawArrays must also prepare some OpenGL state to account for differences between OpenGL and WebGL. As mentioned earlier, null and incomplete textures are swapped for 1×1 black textures. Furthermore, drawArrays must emulate the attribute 0 array. This is specific to the situation where WebGL has to run on top of non-ES OpenGL, as is the case on Mac OS X and on desktop Linux. In WebGL and OpenGL ES, all vertex attributes are on the same footing. Attribute 0 is not special. Like any other attribute, it can be either an array attribute (as enabled by enableVertexAttribArray), or a nonarray attribute (the default state, giving the same uniform value for all vertices). In non-ES OpenGL, attribute 0 cannot be a nonarray uniform attribute giving the same value for all vertices. Instead, when attribute 0 is not array enabled, it behaves with special semantics that are designed to allow implementing the old OpenGL 1 fixed-function vertex specification API. Consequently, when a WebGL application tries to use attribute 0 as a nonarray, this forces the WebGL implementation to do expensive emulation work when running on nonES OpenGL platforms. Specifically, the WebGL implementation must generate and upload an array buffer that contains N copies of the generic vertex attribute value, where N is long * The reader interested in seeing the Mozilla source code for this could use a code search tool such as http:// dxr.mozilla.org and search for ShCompile, which is the function offered by ANGLE to compile a shader. Mozilla’s source code tree contains its own copy of ANGLE. 24 2. Mozilla’s Implementation of WebGL enough for the given draw call, and attach this buffer to attribute 0, so that it is once again an array attribute. Takeaway performance tip: make sure that one of the array attributes is bound (using bindAttribLocation) to location 0. Otherwise, high overhead should be expected when running on non-ES OpenGL platforms such as Mac OS X and desktop Linux. 2.8 Validation of drawElements Calls In OpenGL, glDrawElements has the same implementation-defined corner cases as glDrawArrays, plus one: glDrawElements chooses vertices indirectly via the bound element array, and if any of the indices read from the bound element array is out of bounds for the current vertex array attributes, the results are implementation defined. Here, WebGL differs from OpenGL in two ways. First, WebGL implementations must prevent the adverse effects such implementation-defined behavior can have on many drivers in circulation, which includes crashing, recovering with severe lag, or dangerously exposing memory contents that shouldn’t be accessible. Second, WebGL 1.0 specifically mandates an INVALID_OPERATION error to be generated in this case. drawElements takes an offset and a count parameter and reads the corresponding contiguous subarray of the element array, starting at the given offset and containing the given count of elements. So drawElements must 1 2 0 5 1 4 2 3 compute the maximum element in this contiguous subarray, and the rest is similar to drawArrays ­validation: One just needs to compare this value to the smallest size of all vertex attribute arrays that are to be conFigure 2.4 sumed by the current shader program. Example element array. Thus, drawElements has to solve a fairly generic algorithmic problem: how to efficiently compute the maximum value in an arbitrary 2 5 4 3 contiguous subarray of an array of integers. Mozilla’s implementation* handles this efficiently enough to make this hardly matter at all, by constructing a binary tree storing precomputed maximums of parts of the 5 4 array. Let’s illustrate this by an example. Suppose that the element array is of length 8, with values as in Figure 2.4. 5 The corresponding tree is then built by grouping array entries two by two and storing the maximum of these two values in a tree leaf (thus the Figure 2.5 tree needs to have four leaves) and continuing, at each level, by storing in Complete binary tree each node the maximum value of its two children, as in Figure 2.5. storing partial maxiWe can always assume that the binary tree is complete, by rounding the mums of elements. element array size to the next power of two. Since the tree is complete, we see that it really is a binary heap, and can be stored compactly in a single array of integers, whose size (unused) 5 5 4 2 5 4 3 is twice the size of the leaf level, as follows: Leave the first entry of this array unused, store the tree root at index 1, then store the next Figure 2.6 level at indices [2..3], then the next level at indices [4..7], and so on. n n+1 Level n is stored at indices [2 ..2 −1]. Thus, our previous example Compact array storage of the c ­ omplete binary tree. is stored in an array of length 8, as in Figure 2.6. * The relevant Mozilla source code file for this section is WebGLElementArrayCache.cpp. 2.8 Validation of drawElements Calls 25 At this point, this data structure requires exactly the same amount of space as the e­ lement array itself. How can we reduce it? Notice that each tree level is twice as small as the next level, so the bulk of the memory usage is for the last few tree levels, near the leaves. Now, notice that these last few tree levels are also the least useful, since each node in these levels corresponds only to a small number of element array entries. For example, half of the total memory overhead is caused by the leaf level, but each leaf only contains the maximum of two element array entries, so these leaves are not very useful. Mozilla’s implementation actually stores in each tree leaf the maximum of eight ­consecutive values in the element array, instead of two in the preceding naive approach. This reduces memory usage by a factor of four, which is usually enough to make it ­negligible—if not, we can simply increase the grouping. It is easy to see that the resulting algorithm for drawElements validation takes O(log n) time for a drawElements call consuming n elements. Thus, about any usage pattern is fine. Overhead can be substantial enough to appear at the bottom of a profile, but it should never be large. There is only one pathological use case that should be avoided as it incurs unnecessary overhead: It is when the same element array buffer is interpreted as different types by different drawElements calls (i.e., UNSIGNED_BYTE versus UNSIGNED_SHORT versus UNSIGNED_INT. Using the same element array buffer with multiple index types only requires the implementation to maintain separate trees for each type; there are three possible types so there can be up to three trees to maintain for a given element array buffer, which multiplies by three the memory usage and speed overhead. Just don’t do it. There is no good reason to: Each drawElements call can only work with one index type anyway. Keep separate index types in separate element array buffers. 2.9 The Swap-Chain and Compositing Process In WebGL, the browser is entirely responsible for frame scheduling and compositing. The application merely provides a callback that signals when a new frame is ready to be drawn. This callback is typically called by requestAnimationFrame at the browser’s whim, typically scheduling frames for smooth animation, synchronizing with the graphics hardware’s refresh signal. Any WebGL canvas has a backbuffer, which is the default framebuffer object receiving WebGL drawing, and a frontbuffer, which is what the compositor shows. When the compositor is ready to update its display of the canvas, if the backbuffer has received any drawing, it is “presented” (i.e., promoted to being the frontbuffer). However, due to asynchronicity of GPU APIs, the contents of the now-frontbuffer might not have actually finished rendering on the GPU. WebGL implementations must be sure that their presented frames are complete before letting the compositor sample from them. Depending on browser details, and at least in Mozilla’s case as of 2014, compositing involves sending frames from one thread or process, running the WebGL application producing frames, to another thread or process, running the browser compositor consuming frames. Thus, we typically have two threads or processes, both using OpenGL, that have to synchronize their OpenGL commands with each other to prevent one side from reading the other side’s rendering until it is complete. 26 2. Mozilla’s Implementation of WebGL The most trivial method for guaranteeing frame completion is to call glFinish when the canvas presents its buffer to the compositor. However, this does not provide great p ­ erformance, as it forces the CPU to wait until the GPU is completely done with the c­ ommand stream. Instead, we merely need to establish a dependency relation on the GPU side alone, where we don’t let the GPU read from the frontbuffer until it is done rendering it. Such GPU-side synchronization is extremely platform dependent, especially when the two sides may be in different processes. Many platforms have some variety of synchronization fence that can be inserted into the GPU command stream. On Mac, where we implement our buffers using IOSurface, we only need to call glFlush, while on other platforms glFlush does not guarantee any synchronization. Once one has solved the problem of correctly compositing complete WebGL frames, the next problem is performance. Synchronization can harm performance by causing one side to wait idly for the other side. With only a backbuffer being rendered to and a frontbuffer being composited, synchronization would take as long as it takes for the other side to finish working. Back buffer The typical solution to this problem, which is what Mozilla’s Swapped when a new frame implementation does, is called triple buffering. It adds a third is produced by application buffer called the “staging buffer.” When a new frame has been produced by the WebGL application, the backbuffer is swapped Staging buffer with the staging buffer. When the compositor consumes a new Swapped when a new frame, it swaps the frontbuffer with the staging buffer. Thus, frame is composited the backbuffer and frontbuffer are never directly swapped with each other, which removes the need for expensive synchronizaFront buffer tion between the two sides. Some thread or process synchronization is still needed, but only to prevent two swaps from Browser compositor happening at the same time. Because these swaps are fast, this is not a problem. Figure 2.7 summarizes triple buffering. Browser window The back/staging/frontbuffer stages in the diagram in Figure 2.7 are steps that are specific to WebGL. By contrast, Figure 2.7 a native OpenGL application would be the equivalent of the Triple-buffered swap-chain. browser compositor here, so it would start right away at the last step in the figure’s diagram. This means that WebGL incurs memory overhead due to having these additional frame buffers. Triple buffering ensures that there is no significant throughput overhead on the CPU side, but depending on implementation details that are prone to changing at any time, there can be some latency overhead, and some approaches to triple buffering incur a latency of one frame’s time (i.e., typically 1/60 of a second). The first conclusion of this discussion is that if you don’t need to update a frame, avoiding re-rendering it will save not only the time it takes to render it, but also a lot of internal compositing work and synchronization. So if you don’t need to update a frame, don’t touch it, don’t even call clear. The second conclusion is that the preserveDrawingBuffer context creation flag is best left in its default false value. Setting it to true essentially means that the browser can’t just swap buffers anymore, and instead must copy buffers, which is expensive in terms of memory bandwidth. 2.9 The Swap-Chain and Compositing Process 27 The third conclusion is that there are additional considerations when evaluating the memory overhead of a WebGL canvas. Depth and stencil components are only needed for the backbuffer, but color and alpha components are needed for all three buffers participating in a triple-buffered swap-chain. For example, a 2 megapixel WebGL canvas with alpha, stored as 32-bit RGBA, pays for an additional two RGBA buffers of 8 MB each, resulting in 16 MB of additional memory usage compared to an equivalent OpenGL application, regardless of depth or stencil components. Finally, a commonly asked question is whether overlaying other HTML content on top of a WebGL context causes overhead. In Mozilla’s implementation, WebGL rendering always goes through the stages described earlier, ending up as one layer in the compositing process, so overlaying more HTML content isn’t a problem, as the main cost is paid anyway. Consider this a fair exchange for the previously described overhead. 2.10 Platform Differences Probably the most platform-specific part of a WebGL implementation is the swap-chain, as described in the previous section. This is by itself the most typical cause of surprising WebGL performance differences between different platforms or browsers. Aside from that, platforms fall into three broad classes as far as Mozilla’s WebGL implementation is concerned: mobile OpenGL ES platforms, desktop OpenGL platforms, and Windows. 2.10.1 Mobile OpenGL ES Platforms Mobile platforms typically implement OpenGL ES, after which WebGL is directly modeled. This makes the implementation of WebGL relatively easy. The difficulty on mobile platforms is not so much for WebGL implementations, but rather for WebGL applications. Mobile GPUs tend to have very different performance characteristics than desktop GPUs, and WebGL applications need to adapt to them while the browser’s WebGL implementation merely stays out of the way. 2.10.2 Desktop OpenGL Platforms We have discussed many differences between desktop OpenGL and OpenGL ES in earlier sections. In addition to that, desktop OpenGL also presents WebGL implementations with a dilemma, as there are very important differences between OpenGL “core profiles” and “compatibility profiles.” Core profiles remove functionality considered deprecated by the OpenGL specifications, including functionality that is part of WebGL. For instance, core profiles do not support such things as the default (zero) vertex array object, which exists in WebGL. This can make WebGL harder to implement on a core profile, and thus Mozilla’s WebGL implementation generally runs on top of an OpenGL implementation’s compatibility profile despite its typically increased overhead. 2.10.3 Windows Windows is a completely different situation. On Windows, the first-class, low-level graphics API is Direct3D. While some GPU vendors provide very good OpenGL implementations on Windows, in practice the proportion of Windows machines with good OpenGL drivers is not high enough for browsers to rely on exclusively. As a result, web browsers target Direct3D on Windows. 28 2. Mozilla’s Implementation of WebGL Along with the aforementioned shader compiler, the ANGLE project offers a full implementation of OpenGL ES on top of Direct3D (Chapter 1). On Windows, Google and Mozilla both implement WebGL primarily on top of ANGLE’s OpenGL ES i­mplementation.* Due to the large amount of work that ANGLE has to perform internally to implement OpenGL ES on top of Direct3D, certain operations have higher overhead on Windows than on other platforms. Most importantly, shader translation becomes more complex, as shown in Figure 2.8. Thus, on Windows, each WebGL shader compilation involves three compilation steps—compared to two on other platforms and one in native applications. That’s overhead. 2.11 Extension Interactions WebGL shader source ANGLE shader compiler OpenGL ES translated shader source ANGLE’s OpenGL-ES-toDirect3D translation layer Direct3D HLSL shader source Microsoft Direct3D compiler With OpenGL, any extension that is supported is readily available, without requiring an “enabling” mechanism. Thus, an application can Direct3D shader binary blithely use extensions without checking for their support, accidentally relying on an extension which may only be present on some subFigure 2.8 set of implementations. Shader compilation steps on Instead of such “always on” extension behavior, WebGL requires that Windows with the ANGLE renderer. applications explicitly enable any extensions they want to use. Before an extension is requested by the application, the WebGL ­implementation does not enable the extension’s features, though it does advertise the availability of the extension via ­getSupportedExtensions. Only when an application explicitly requests an ­extension by calling g ­ etExtension is the extension’s behavior activated. While great for portability, this does cause issues for implementations: They must support any combination of supported extensions being enabled or not. This creates a much larger matrix of possible semantics to implement, especially in validation code. For instance, OES_texture_float allows creating textures with a type of FLOAT. However, these float textures are not allowed to be sampled with any filter other than NEAREST unless OES_texture_float_linear is enabled. The effects of explicit extension activation extend to shader compilation as well. Since WebGL extensions (such as EXT_frag_depth) can enable new GLSL functionality, the extensions active at shader compilation time can change how the shader is compiled. 2.12 Closing Thoughts While some types of overhead affect any implementation of WebGL, the resulting capabilities are relatively uncompromised. WebGL aims to strike a balance. If kept too * Browsing one’s filesystem in the Firefox application directory, one can see a few large files corresponding to the translation layers described here. GLESv2.dll and EGL.dll are the ANGLE OpenGL-ES-to-Direct3D translation layer, and D3DCompiler_*.dll are copies of the Microsoft Direct3D compiler. 2.12 Closing Thoughts 29 portable, it would not be sufficiently capable. If it were to expose the full functionality and ­performance of the platform, it wouldn’t be very portable. WebGL’s promise is to offer a standardized, portable platform for high-performance graphics on the web. Given implementations that handle driver differences sufficiently well, application developers can treat WebGL as a single platform, instead of having to worry about the minutiae of underlying implementation differences. However, there are basic limitations to the realization of this vision. The most important one is the very different performance characteristics of existing GPUs and drivers, especially once mobile devices are factored in. Because of that, the effective portability of any high-performance graphics API is limited by uneven performance characteristics. The WebGL specification and its implementations are designed to alleviate these performance issues as much as possible, and this chapter’s primary goal was to complement that mission by helping application developers understand some of WebGL’s more important performance-related specificities. 30 2. Mozilla’s Implementation of WebGL 3 Continuous Testing of Chrome’s WebGL Implementation Kenneth Russell, Zhenyao Mo, and Brandon Jones 3.1 Introduction 3.2 Starting Points 3.3 Building Out the GPU Try Servers 3.4 Stamping Out Flakiness 3.5 Testing Your WebGL App against Chromium Acknowledgments 3.1 Introduction In Chrome 28 a bad WebGL bug—crbug.com/259994, which intermittently broke Google Maps and many other WebGL applications—shipped to Stable and, embarrassingly, the issue was reported to our team primarily via Twitter. The bug slipped through automated testing for two principal reasons. First, the majority of the automated testing in the opensource Chromium project, and the Chrome browser built on top of it, occurred on virtual machines, which did not exercise the GPU-accelerated rendering path that is taken on most end users’ systems. The few machines with physical hardware and real GPUs had not received enough attention from the overall team and were not reliable enough to detect intermittent failures. Second, there were not enough “try servers”—the banks of machines that developers use to certify their code changes during development and checkin time— running on physical hardware with actual GPUs. The effect was that it was impossible to detect potential breakage to Chromium’s graphics stack ahead of time, even though there were dozens of developers working directly on this code. In response, we led an effort to overhaul Chromium’s GPU bots and try servers. Working closely with Chrome’s infrastructure, labs, and GPU teams, as well as the Chrome team in general, we respecified the hardware, changed OS configurations, and 31 rewrote nearly all the software and tests running on the bots in order to eliminate dozens of sources of flakiness. We built out a bank of GPU try servers: physical hardware that runs graphics tests against every incoming changelist, or CL, to the Chromium project and Blink rendering engine, in comparison to Chromium’s many pre-existing VM-based test bots. The new try servers and continuous testing, or waterfall, bots are among the most ­reliable on the Chromium project and typically run hundreds of consecutive builds with no errors. They test real-world web content, including Google Maps, and reliably detect breakage of not only Chromium’s graphics stack, but also the browser itself. More recently, they have detected highly intermittent bugs in core browser features—­ requestAnimationFrame and video playback—affecting real web pages. They ensure that WebGL is treated as a mission-critical technology and will work as expected for the hundreds of millions of Chrome users. This chapter describes the hardware, operating system, and software changes associated with bringing these machines to high reliability. It is our hope that this experience will help other teams replicate similar configurations for their own graphics testing. Nearly all of the code described in this article is open source and available as part of the Chromium project. 3.1.1 A Few Statistics To help motivate the scale of the project, here are a few statistics gathered from the source code repositories of the Chromium and Blink projects.* Considering the 3 months between September 1, 2014, and November 30, 2014, there was an average of approximately 4,400 commits per month to Chromium—or about 144 commits per day—and approximately 1,600 commits per month to Blink, or about 51 per day. On a monthly basis there were approximately 720 unique committers to the Chromium project and 240 unique committers to the Blink project. According to OpenHub.net† there are ­approximately 13 million lines of code in Chromium. Building Chromium from scratch takes approximately an hour. An incremental relink of the statically linked binary takes approximately 5 minutes. To speed development, Chromium’s build system supports compiling the product into multiple shared libraries, which reduces incremental rebuild times to a few seconds. The shipped product is ­statically linked to reduce startup time. 3.1.2 Background of Chromium’s Test Setup Chromium and Chrome use the open-source Buildbot framework as the basis for the browser’s continuous integration testing. Chrome’s infrastructure team has written a tremendous amount of software that augments this basic setup, including integration with the Rietveld code review tool, a commit queue which automatically tests every incoming code change to the browser, systems for distributing and sharding test execution, and many others. Developers interact with the automated test infrastructure in two primary ways: via the waterfalls and the commit queue. The waterfalls run on a continuous basis. * See the count-chromium-commits script in the WebGL Insights github repository. † https://www.openhub.net/p/chrome 32 3. Continuous Testing of Chrome’s WebGL Implementation Figure 3.1 Chromium’s waterfall view. See Figure 3.1 for a visual example of Chromium’s main waterfall. Whenever new code is checked in to the browser, the bots on the waterfall notice this, build the top-of-tree code, and run tests against it. It is often the case that the waterfalls group batches of commits together; there aren’t enough bots to build and test the top-of-tree code at each commit. A consequence of this fact is that if a test begins to fail at a certain point, it’s often necessary to scan through several or sometimes dozens of commits to find the one at fault, which is a painstaking process. It is for this reason that the best way to catch failures is with the commit queue, at code submission time. The commit queue, in counterpoint, builds and tests each individual code change to the browser. Developers upload work in progress to the Rietveld code review tool. Once the code has been reviewed, the easiest way for the developer to submit it is to simply check the “commit” checkbox. This submits the changelist to the try servers: a large bank of machines, both physical and virtual, representing the majority of the operating systems and configurations on which Chrome runs. The try servers check out the top-of-tree code, apply the developer’s patch to it, compile the product, and run automated tests. Only if all of the automated tests run successfully is the change committed to the code base. Developers may also trigger try jobs manually, without intending to also commit their changes to the code base, to reduce the amount of manual testing they must perform. Figure 3.2 shows the user interface for the code review tool on a representative CL. The try servers run all of their jobs in parallel. It is possible that two or more code changes, touching different files in the source base, actually conflict with each other and would cause test failures when both were applied. In the worst case, two such changes might be tested in parallel, each found to pass all of the tests, and be committed. The second of 3.1 Introduction 33 Figure 3.2 Chromium’s Rietveld review tool, showing the trybot results from sending the patch through the commit queue. these changes would cause tests to start failing on the main waterfall and “close the tree”: prevent further checkins until the tests are fixed, a process requiring manual intervention. In practice, issues like this happen rarely, so the best automated way to avoid breaking tests and closing the tree is to send all incoming source changes through the commit queue. The try servers represent the bulk of the machines in the Chromium project’s ­automated testing infrastructure. For every configuration represented on the waterfall, such as Linux in Release (non-Debug) mode, typically 20 to 30 try servers are required in order to handle the load from developers testing their code changes. In order to save computing resources, it’s critically important that the try servers be both reliable and fast. Tests which fail intermittently not only cause retries, which can double or triple the computing cycles needed to test a particular change, but also cause intermittent test failures on the main waterfall and ultimately destabilize the product. Eliminating flakiness in tests was a primary goal of the project to overhaul Chromium’s GPU bots, as will be seen throughout this chapter. 3.2 Starting Points At the start of the project, Chromium’s GPU bots were a small subset of the browser’s automated testing infrastructure. The configurations included Windows, Mac, and Linux, running a combination of GPUs from AMD, Intel, and NVIDIA. The GPU bots were mainly waterfall machines. There were a few GPU try servers—three each of Windows, 34 3. Continuous Testing of Chrome’s WebGL Implementation Mac, and Linux—that had to be triggered manually because there wasn’t enough capacity to test all of the incoming changes from all of the developers on the project. This essentially meant that no Chromium developers were using the GPU try servers. Tests failed intermittently on the waterfall, seemingly for different reasons on different platforms. The intermittent test failures prevented scaling up the number of try servers for the GPU bots, because tests wouldn’t run reliably on them. The first priority, then, was to diagnose and eliminate the causes of the intermittent test failures. 3.2.1 Hardware and OS Configuration Changes The GPU bots are necessarily physical machines with graphics cards installed, not virtual machines. Initially the bots were run headless: No monitor was plugged in, and remote desktop software (VNC on Linux, Mac OS’s built-in VNC-compatible screen sharing, and Remote Desktop on Windows) was used to log in, administer the machines, and debug test failures that were seen on the machines. Attempts had been made to trick some of the machines into thinking they had a monitor plugged in, in order to activate the GPU—a dummy VGA dongle with resistors shorting some pins together. We observed that logging on to the machines to debug test failures often caused the failures to disappear. We also found that running Chromium’s GPU-accelerated rendering path over remote desktop software typically caused incorrect rendering, including, but not limited to, garbled colors and failure of the GPU code paths to initialize properly. The first order of business was to make it possible to remotely log in to these machines without perturbing their execution. Chrome’s Labs team installed an IP KVM, which plugs into the target machine’s video output and USB ports, digitizes the video, and streams it to a client machine via a web browser. Because the IP KVM requires no software to be installed on the target machine, it not only avoids the interference of running remote desktop software, but also acts as a monitor, causing the GPU to activate properly. The specific model of IP KVM the Labs team used is a Raritan Dominion KX II. We evaluated other brands that claimed to speak the VNC protocol, which would avoid the need for any custom software on the client machine, but did not have success with these products. Chrome’s Labs team found it necessary to provide a fake monitor’s extended display identification data, or EDID, on a couple of the GPUs in order to make them use the correct screen resolution. The switch to the IP KVM was a crucial step, eliminating strange side effects of remote desktop software and a certain class of reliability problems. All of Chromium’s Linux and Windows GPU bots are now hooked up through a Raritan. On Mac OS, the built-in screen-sharing software works well. We don’t hook up the Mac bots to the Raritan because doing so is not necessary for our Mac Minis, and plugging Retina display MacBook Pros into the Raritan changes the display configuration, losing the high-DPI nature of the builtin display and defeating the purpose of testing on this hardware configuration. After switching to the IP KVM we still saw intermittent failures of our pixel tests on Linux. These tests read back the rendered image from the screen, ensuring that the browser’s entire end-to-end rendering path is working correctly. Occasionally these tests would capture a black screen, a symptom of a screen saver or power saver kicking in. The problem persisted despite disabling all power-saving features from the graphical user interface’s control panels. It turns out that the X server itself interfaces with the Display Power 3.2 Starting Points 35 Management System (DPMS), providing a “low-level” power-saving mode that can’t be configured this way. We confirmed that this was the mechanism causing the screen to go blank by watching the test execution, waiting for the screen to blank, logging in via a new ssh terminal, and running the following command: xset s reset The monitor immediately awoke and the pixel tests started to pass again. After ­extensive web searching, we found and made the following changes to/etc/X11/xorg.conf: section "ServerLayout": Option "BlankTime" "0" Option "StandbyTime" "0" Option "SuspendTime" "0" Option "OffTime" "0" section "Monitor": Option "DPMS" "false" After deploying this change to all of our Linux machines, they began to run the pixel tests reliably. To summarize, the key hardware and OS configuration changes made to our GPU bots in order to improve their reliability were •• •• •• Hook up all Linux and Windows machines to an IP KVM Disable all power saving Disable DPMS on X11 systems 3.2.2 Test Harness Changes Since their inception, Chromium’s GPU bots ran the WebGL conformance tests against the browser’s top of tree code on a continuous basis. Initially the tests were run in a specialized test harness called content_browsertests, which loads individual web pages in an environment similar, but not identical, to the full web browser’s rendering pipeline. Unfortunately, intermittent test failures were seen in this environment that defied debugging attempts. On Windows in particular it looked like race conditions in the content_browsertests harness startup and shutdown code were causing intermittent crashes. Despite best attempts, debugging these races proved fruitless. We decided to move away from the content_browsertests harness and start running the majority of the GPU bots’ tests in the browser itself, driven by a test harness called Telemetry.* Testing the browser rather than a specialized test harness carried several benefits. First and foremost, it enabled us to test what we ship, which is a central tenet in Chromium’s automated testing philosophy. Second, it ran the tests in a more realistic environment: Where content_browsertests started up and shut down a portion of the browser for each test, Telemetry simply navigated the browser to the URLs of each of the WebGL conformance tests in turn, which is how they execute when launched manually. Third, the performance of the test suite actually improved when switching to Telemetry, * http://www.chromium.org/developers/telemetry 36 3. Continuous Testing of Chrome’s WebGL Implementation since navigating a given tab from URL to URL is much faster than closing a tab, opening a new one, and navigating to a URL, which was effectively what content_browsertests was doing. Switching to Telemetry worked around the race conditions in the content_browsertests harness and for the first time allowed the WebGL conformance tests to run reliably on Chromium’s GPU bots for hundreds of runs at a time. The GPU bots run not only the WebGL conformance suite, but also a set of other tests which verify things like pixel accuracy of certain pages’ rendering. These GPU tests must perform operations that aren’t allowed in ordinary web browsers for security reasons, such as taking snapshots of the screen or querying the browser’s internal state. The specialized test harnesses, including content_browsertests, have access to these facilities. In order to improve these tests’ reliability they were ported from content_browsertests to Telemetry. It was necessary to expose certain privileged primitives in the Telemetry test harness, as well as port the GPU tests that had previously been written in C++ to the new framework, which is written in Python. This work took a couple of months, but allowed content_browsertests to be removed entirely from the GPU bots. See the related bugs* for details. The switch from content_browsertests to Telemetry on the GPU bots significantly increased their reliability, to the point at which we could consider scaling up the infrastructure and running this set of tests against all incoming code changes to the browser— building the GPU try servers. 3.2.3 Pixel Tests One thing we learned early on in the testing of Chrome’s rendering was that bugs can appear at any step. Even capturing a texture at the end of the rendering pipeline and making sure the content looks OK would not guarantee that it is presented to the screen correctly. Therefore, pixel tests are built upon OS-level screen capturing APIs. There were a few times that a failure turned out to be a message window from the OS popping up and intersecting with Chrome’s window, but that doesn’t happen often and a quick inspection of the failed rendering and shutting down the offending window would fix the problem. Reference images are stored in the cloud and indexed by OS, GPU, and other dimensions that affect rendering—for example, with antialiasing or not. We tried to generate reference images on local bots but that solution turned out to be problematic. First, incorrect reference images can go undetected for a long time and bugs may creep in. Second, a patch that changes rendering has to land and be integrated into a bot first, turning it red, before reference images on that bot can be updated. Therefore, we switched to a cloud-based solution, where inspecting and updating reference images is easier. To compare rendering results to reference images, we use a simple pixel-by-pixel exact matching method. This is necessary because even a one pixel difference may indicate a bug in Chrome. For example, antialiasing in WebGL could be turned off incorrectly. In the future, we may look into smarter image comparison algorithms. * http://crbug.com/278398, http://crbug.com/308675, http://crbug.com/365904 3.2 Starting Points 37 3.3 Building Out the GPU Try Servers Once the GPU waterfall had achieved a good level of reliability, we found that code changes to the web browser caused some tests to fail nearly every day, requiring manual intervention and reverting of the associated changes. For efficiency reasons, most of Chromium’s tests don’t run against the full browser; the Telemetry-based GPU tests are some of the few on the waterfall that do. For this reason, a bug would occasionally slip through the commit queue and wouldn’t be caught by anything else but the Telemetry tests running on the GPU bots, even though the bug was completely unrelated to graphics. One surprising failure* turned out to be breakage of the unbranded Chromium build of the browser; the test harnesses and full Chrome browser worked fine with this particular change. The only way to stop the flood of breaking changes was to add the GPU tests to the default set of tests run against all incoming changelists to Chromium. 3.3.1 Recipes Early in the overhaul of the GPU bots it became clear that many configuration changes would be needed in order to achieve reliability. Buildbot’s architecture is not set up to handle many incremental changes well. Normally, each change to the steps that are run on a particular bot—even something as small as changing the command line arguments passed to a particular test—requires all of the machines on that waterfall to be rebooted, which is not only a major inconvenience, but also painfully slow. In order to reduce the number of waterfall restarts needed to change machines’ configurations, Chrome’s infrastructure team developed a new framework called recipes; see the Pointers to Code and Documentation section at the end of the chapter for links. Recipes worked within the Buildbot framework, but moved the responsibility of deciding which steps to run, and how to run them, from the computer controlling a particular waterfall to the individual bots on that waterfall. Using the new recipe framework, when a change was checked in to the script containing the steps for a particular bot, it would take effect on the next build that that bot ran, without requiring any reboots. The productivity improvements and flexibility afforded by recipes, compared to the way Buildbot is ordinarily used, can’t be overstated. It’s not an exaggeration to say that, without recipes, Chromium’s GPU bots would not have been overhauled successfully. Before making any significant changes to the GPU bots, including the test harness switches described in the previous section, we converted the bots from the old-style Buildbot scripts to recipes.† This made it possible to make and deploy smaller and more incremental changes, and made it easier to test those changes locally. During the course of this project, a significant percentage of the work involved improvements to the recipe framework, adding support for new features and diagnosing and fixing unexpected behaviors. This work proved useful to the larger Chromium project when converting other bots to recipes. 3.3.2 Builder/Tester Split Chromium’s GPU waterfall runs the same set of tests on multiple kinds of GPUs: on the desktop, typically those from AMD, Intel, and NVIDIA. The precise GPUs were originally * https://codereview.chromium.org/421643002/ † See http://crbug.com/286398 and related bugs. 38 3. Continuous Testing of Chrome’s WebGL Implementation selected “organically”—whatever was available in house at the time—though it was a ­specific design criterion to always use mid-range GPUs that were likely to be chosen by end users. Later, as the try servers were built out, it was necessary to standardize on a few GPU configurations, so that multiple machines could reliably produce the exact same pixel results. When a platform and a type of GPU are selected, usually the latest graphics driver available is installed. An exception was OS X Lion, where we had to fall back to an earlier but more stable version (10.7.5). Driver updates are only performed if a new driver with fixed bugs provides significantly more test coverage. Chromium’s GPU try servers do not yet include mobile devices due to resource constraints, but some mobile devices are present on the waterfall. At the beginning of the project, each machine on the waterfall both compiled the browser and ran the tests against it. It’s clearly not efficient to compile the same code multiple times just in order to test on multiple GPUs, so one of the first changes made to the waterfall’s configuration was to split the bots into separate builders and testers.* Once this work was complete, the physical GPU testers could be changed to relatively low-end machines, since they no longer needed to perform the heavy-duty compilation step. It was also no longer necessary for the builders to contain GPUs since they ran no tests, so they could be replaced with virtual machines. 3.3.3 Isolates Initially, the builders zipped up the binaries they’d compiled and copied them to a server, and the testers downloaded the binaries and unzipped them into an identical Chromium workspace before running them. The data files for the tests lived in the Chromium workspace and were not included in the archives copied from builder to tester. The data files were quite large—larger than the binaries in some cases—and it would have been inefficient to copy them between machines each time. Keeping an up-to-date copy of the Chromium workspace on the testers proved to be problematic. After the builder/tester split, the single largest percentage of the testers’ cycle time was actually spent updating the Chromium workspace! Fortunately, Chrome’s infrastructure team had been developing a new mechanism for distributing tests among machines called isolates.† An isolate describes all of the files needed to run a particular test, including the binaries and any data files. Isolates differ significantly in implementation from a simple mechanism like a zip archive, however. Each file is uploaded and compressed separately, and if the contents of a particular file have been uploaded to the server recently, it is neither uploaded again by the builder nor downloaded again by the tester. The GPU bots were converted from uploading and downloading zip archives of their binaries to using isolates‡ in crbug.com/321878 and related bugs. This work involved significant rewrites of the tests to avoid storing any persistent data on the bots’ local disk, since isolates’ execution is designed to be transient. In particular, the results of the bots’ pixel tests had to be placed in cloud storage.§ While this work took some * http://crbug.com/310926 † http://www.chromium.org/developers/testing/isolated-testing ‡ http://crbug.com/321878 and related bugs. § http://crbug.com/330053 and http://crbug.com/330774 3.3 Building Out the GPU Try Servers 39 time to complete, the results were tremendous. First, the bots became much easier to maintain since all of their results were stored in the cloud, and easily accessible over the web. It was no longer necessary to log in to a particular bot to examine its most recent test results. Second, the cycle time was decreased dramatically: The bots became between 1.19× faster for one of the Mac testers to 2.88× faster for one of the Windows testers. The improved hardware utilization finally allowed the GPU try servers to be built. 3.3.4 Ramping Up the Try Servers Chrome’s Labs team did a tremendous amount of work purchasing and setting up the hardware for the GPU try servers. In order to handle the load of the incoming changelists, 20 each of Windows, Linux, Mac, and MacBook Pro Retina machines were purchased. As described earlier, the Windows and Linux try servers were hooked up to an IP KVM. The Macs were left headless. Ten beefy physical servers per platform were purchased as the builders for these testers, based on cycle time measurements on the waterfall. Chromium’s commit queue (CQ) contains an experimental framework that helps with the addition of new machines. A percentage of the load of incoming changelists can be experimentally sent to a particular try server in order to see whether it tests each changelist successfully and can keep up with that load. We helped generalize this framework to support multiple concurrent experiments at different load levels, in order to allow the Linux, Windows, and Mac GPU try servers to be brought online at different times, and we began sending CLs to the new machines. While the try servers reliably ran jobs to completion, we quickly discovered that the bottleneck in the system was not the physical GPU testers, but the builders. The cycle time estimates we’d used to compute the number of builders and testers were off by a large margin. We originally specified that we needed 10 builders for each 20 testers, but it became clear that 30 builders for each 20 testers were necessary. It also turned out that the beefy physical builders were not fast enough to justify the additional hardware cost and that using virtual machines for the builders was more effective. Chrome’s Labs team came through once again in a pinch, reallocating the builders as virtual machines and providing 30 VMs per configuration (Windows, Linux, and Mac). In particular, the Chrome project was running low on capacity, and it was not possible to provision new Mac VMs at the time; the Labs team negotiated with other projects that were not using all of their capacity in order to fulfill our hardware requirements months ahead of schedule, for which we remain grateful. After the reprovisioning of the builders, the load testing of the GPU try servers otherwise went smoothly, and the GPU bots were finally made Chromium tree closers in crbug. com/327170. 3.3.5 Analyzing CL Dependencies Due to the large number of changes submitted to the Chromium project on an ongoing basis, it’s particularly important to utilize the automated testing machines efficiently. If an incoming CL can’t possibly affect the tests on a particular platform, the tryjobs for that CL shouldn’t cause tests to be run on that platform. For example, if a CL touches Windowsonly code, it definitely should not cause tests to be rerun on the Mac trybots. 40 3. Continuous Testing of Chrome’s WebGL Implementation Dependency analysis was added to Chromium’s build system some time ago* and support was added to the GPU trybots more recently, once they were stabilized.† Experience has shown that this dependency analysis reduced load on the try servers by about 20%.‡ The cost savings of this technique are significant, and we recommend incorporating it into any project utilizing a system similar to Chromium’s trybots and commit queue. 3.4 Stamping Out Flakiness The primary goal of building out the GPU try servers was to eliminate the sort of intermittent bugs described at the beginning of this chapter. Once the GPU bots achieved a high level of reliability, we expected that it would be easy to maintain that quality level. While the try servers have unquestionably made it easier to keep Chromium’s GPU code running correctly, a few interesting and surprising intermittent failures came to light that had to be diagnosed and fixed manually. The resolution of these bugs may be useful to other projects. 3.4.1 Glib-Related Timeouts One day in October 2013, tests suddenly started timing out on Chromium’s Linux GPU bots.§ The symptom was that a subprocess of the web browser was deadlocked inside a call to malloc() inside the Glib library, which sits underneath Gtk+. Glib was attempting to make a connection to the desktop bus, or Dbus, daemon, which provides information like which theme is in use by the window manager. The bug was not reproducible either on our local workstations or when logged on to the affected bots. Members of Chrome’s security team helped investigate and indicated that it looked like a typical bug with async signal-unsafe code on POSIX. It’s fairly difficult to find a concise description of the problem online. The bug is related to the use of malloc() between calls to fork() and exec() when spawning subprocesses. What happens is specifically: •• •• •• The parent process calls fork(), and it happens that when it is called, another thread is in the process of calling malloc(). The internal malloc lock is then held forever in the child process, even after the parent process completes its call to malloc(), because the child is a copy-onwrite clone of the parent. Before the child process calls exec() to overwrite its image with the new subprocess, it calls malloc(). The internal malloc lock is held, and this call to malloc() never completes. The child is then deadlocked. The rule of thumb is that a process must never call malloc(), or a function which calls malloc() internally, between calls to fork() and exec(). Unfortunately, Glib does this when it notices that it doesn’t have a connection to the Dbus daemon; it calls opendir() to try to find the dbus–daemon program to spawn it, but it does this after calling fork(). opendir() calls malloc() internally. * http://crbug.com/109173 † http://crbug.com/411991 ‡ https://code.google.com/p/chromium/issues/detail?id=411991#c2 § http://crbug.com/309093 3.4 Stamping Out Flakiness 41 We assumed that a change to Chromium’s initialization of Gtk+ had provoked this problem, and tried to move the Gtk+ initialization earlier, before Chromium had spawned any threads. Unfortunately, this didn’t help. Studying Glib’s code more deeply, it became clear that if it went down this code path at all—if Glib was initialized when the DBUS_ SESSION_BUS_ADDRESS environment variable was not set—Glib first started threads internally, and then spawned a subprocess, in order to make a connection to the Dbus daemon. Glib inevitably raced against itself. This problem had been present all along, and only manifested when the timing of the browser’s code changed subtly. The reason this bug happened on the GPU bots and not on regular workstations, even though the bots have a real monitor and desktop session, is that the bots’ environment for launching processes is slightly different from that of real users. When launching programs from a terminal window that is open on the desktop, the Dbus daemon is implicitly already started; the desktop itself is responsible for starting and connecting to it. However, the bots’ scripts launch at startup, and while they properly set up an X display connection to be able to open windows, they don’t make a connection to the Dbus daemon. The solution to this problem was to modify Chromium’s testing scripts to manually launch the Dbus daemon, which causes the DBUS_SESSION_BUS_ADDRESS environment variable to be set, before starting any tests or launching Chromium. This workaround solved the problem and was deployed across all of the Chromium project’s Linux bots. Recently this problem manifested again on the Google Maps team’s automated test machines, so it is a fairly common issue that should be known to anyone performing automated testing of graphical programs on Linux. Note carefully that it is necessary to manually shut down these explicitly launched dbus–daemon processes, or they will leak! 3.4.2 Google Maps Timeouts In July 2014, the automated test of Google Maps which runs on the GPU bots suddenly started intermittently timing out on the Windows try servers.* The symptom was that the test’s JavaScript began to execute before all of the scripts on the page were loaded, leading to exceptions when undefined symbols were referenced. The regression window was quite large and included changes in the V8 JavaScript engine, Telemetry, and DevTools, upon which Telemetry relies. Fortunately, the problem was reproducible on one of the bots using binaries compiled on our local workstations. A manual bisect was done using “git bisect” and revealed that a change to Blink’s Content Security Policy implementation had introduced the flakiness. This was quite surprising, as it seemed that this particular change did not alter the behavior of the code, but the results were indisputable. The change was reverted and the tests reached their previous levels of reliability. This is another example of a change unrelated to graphics causing broader test failures in the browser, and this bug would have inevitably caused failures on end users’ machines. 3.4.3 HTMLVideoElement and requestAnimationFrame Timeouts In August 2014 intermittent timeouts of several WebGL conformance tests—­including context_lost_restored, premultiplyalpha_test, and webgl_compressed_texture_size_ limit—spontaneously appeared and were reported.† The failures happened rarely, * http://crbug.com/395914 † http://crbug.com/407976 42 3. Continuous Testing of Chrome’s WebGL Implementation and there were no obvious changes to the browser’s code which could have made them start happening. They were not reproducible on our local workstations, and there were no apparent patterns in the failures. With some struggling, a crucial diagnosis was made that there was in fact a pattern among some of the timeouts: Several of the failing tests used the browser’s requestAnimationFrame API, or rAF for short, to drive their execution. With this knowledge, it was possible to categorize the failures. We pushed to leave the failing tests in place and add logging to the browser, and two bugs were identified. First, rAF calls seemed to be intermittently ignored.* Second, HTMLVideoElements created on the fly from JavaScript seemed to occasionally never fire any of their event handlers.† The investigation of the rAF issue dived deep into the browser’s code, across the boundary between the Blink rendering engine and the embedder, into the browser’s compositor and scheduler. Finally a particular piece of code was identified that changed the handling of rAF callbacks depending on whether accelerated compositing was enabled: this code was vestigial, dating back to a time before GPU-accelerated rendering of web pages was introduced into Chromium. There was a long-standing race condition in this code. Eliminating this behavioral difference fixed the bug. There is no good explanation for why this bug appeared when it did, except that the timing in the browser must have changed slightly, exposing the preexisting race condition. Investigation of the HTMLVideoElement bug showed that the video element itself was being garbage collected before it should have been. The basic contract of the video element, as with all HTML elements, is that it may not be garbage collected until doing so would have no observable effects on the program. There was a long-standing race condition in the video element’s computation of whether it had “pending activity” where, if a garbage collection was triggered between the time it had completed loading its network resources and had played its first frame, it would be prematurely collected and never trigger any of its callbacks. The affected tests would time out unless those callbacks were called. The fixes for these two bugs addressed problems that could affect real web pages and that had been present for years, though, surprisingly, had only showed up recently. It was highly coincidental that they began happening nearly at the same time, but it has been our experience that multiple failures tend to happen at the same time, whether due to Murphy’s law or because multiple developers tend to focus on the same general area of the code at the same time, occasionally causing multiple breakages. The experience of diagnosing and fixing these bugs underscored to us the importance of tracking down and fixing flaky test failures urgently, instead of allowing them to pile up. 3.4.4 Shutdown Timeouts on Windows In October 2014, intermittent test timeouts appeared on the Windows GPU bots where it appeared that one of Chromium’s subprocesses was failing to exit cleanly.‡ This problem was not reproducible easily on developers’ machines, though it was more readily reproducible on the bots. The symptom was that a child process of Chromium’s main “browser” * http://crbug.com/393331 † http://crbug.com/412876 ‡ http://crbug.com/424024 3.4 Stamping Out Flakiness 43 process failed to launch successfully, and was stuck in the suspended state before it had run any code. Many hypotheses were put forward for why this behavior started happening. The code which managed Chromium’s subprocesses had undergone some changes, but the first ­failure of this sort happened before the first change to that code had been committed. A couple of bugs were identified and fixed in the subprocess management code, but the stuck processes persisted. For background, Chromium launches sandboxed processes on Windows using the following rough steps: 1. The target process is created in the suspended state. 2. The sandbox policy is applied to the target process. 3. The target process is added to a Windows job object, which will cause it to be automatically terminated when the parent process exits. 4. The target process is resumed and begins to run. It is apparent that there is a brief race where, if the parent process is forcibly terminated between steps 2 and 3, the target process will not be automatically terminated. Logging was added to the browser that indicated that this was happening. The Telemetry harness uses Python’s subprocess management primitives in order to start up and shut down the web browser being tested. It turns out that Python’s Popen. terminate() call uses Windows’ TerminateProcess API to shut down the processes it launches; see https://docs.python.org/2/library/subprocess.html. The target process is forcibly terminated immediately, without running the process’s usual exit code path. The hypothesis was that Telemetry was accidentally terminating the browser exactly in the race condition window above, leading to the leak of the suspended subprocesses. The solution was to enumerate the target process’s top-level windows from Python and send the WM_CLOSE message to the web browser’s windows to allow them to close cooperatively. This particular bug affected only the bots, not end users, because the browser is typically exited by either closing all windows or using the “Quit” menu option. It’s once again surprising that it started occurring without any related code changes; the timing of the product simply changed in such a way that a preexisting race condition started to appear. Nonetheless, it was essential to fix the intermittent failures because they would otherwise mask real failures in the product. 3.4.5 Conclusions On a project as large as Chromium, it’s essential to catch as many bugs as possible before they reach the source tree. It is more critical for WebGL in Chromium, because all graphics commands are sent over and executed in a separate GPU process, which adds extra dimensions of complexity. Seemingly unrelated code changes could break WebGL. This turns into an even worse problem when the breakage is intermittent and happens only on certain hardware and happens rarely. Therefore, having a GPU testing farm that covers major platforms and GPUs is the only way to move forward. It’s been our experience that the GPU try servers have caught the majority of the failures that would otherwise have required manual diagnosis and reverting. We would 44 3. Continuous Testing of Chrome’s WebGL Implementation encourage other projects to develop similar infrastructure at the beginning of the ­project. Delaying only increases the overall amount of work that will be required. The intermittent failures that have slipped past the try servers have been relatively few and far between, though some of them have been difficult to diagnose and fix. We perceive a need for more stress tests that will more reliably expose these race conditions, and hope to develop and deploy such tests once more computing resources are available. Fixing these intermittent problems as they have arisen has resulted in a more reliable browser and WebGL implementation. 3.5 Testing Your WebGL App against Chromium It is possible that your WebGL application may break when a user’s Chrome browser is auto-updated. While maintaining backwards compatibility is a primary goal, occasionally a backward-incompatible change is required. For example, the invariant rules for shader varyings were enforced in a Chromium update; WebGL applications breaking these rules previously ran fine on desktop systems but not on mobile devices. Enforcing these rules made these applications portable. Also, hopefully rarely, a bug might creep into Chromium and affect your application. Therefore, testing against Chromium builds and detecting a problem as early as possible is valuable for WebGL developers. Setting up a testing bot like one on the Chromium’s GPU waterfall requires the following steps: 1. Check out Chromium’s Telemetry: https://chromium.googlesource.com/­chromium/ src.git/+/master/tools/telemetry/. 2. Write a test using Telemetry. You can find examples in https://chromium.­ google source.com/chromium/src.git/+/master/content/test/gpu/. 3. Write a Python script to check for new Chromium builds in https://­common​ datastorage.googleapis.com/chromium-browser-continuous/index.html, download a new build, and run your test against that build. If a bug is detected in Chromium, please file it against crbug.com. 3.5.1 Pointers to Code and Documentation Chromium’s recipes can be found in the workspace https://chromium.googlesource.com/ chromium/tools/build/. The reusable modules invoked in multiple recipes are in the directory scripts/slave/recipe_modules/. The GPU bots’ recipes live in scripts/slave/­recipes/gpu/. Documentation for recipes overall is in scripts/slave/README.recipes.md. The GPU bots are documented in http://www.chromium.org/developers/testing/gpu-testing, and the GPU recipe specifically is documented in http://www.chromium.org/developers/testing/ gpu-recipe. The tests which run on the GPU bots are in the Chromium workspace, https://­ chromium.googlesource.com/chromium/src/, under content/test/gpu/gpu_tests. Some tests use page sets, which live in content/test/gpu/page_sets/, and some reference data that are contained in content/test/data/gpu/. The tests also refer to the WebGL 3.5 Testing Your WebGL App against Chromium 45 conformance tests, a snapshot of which is contained in third_party/webgl in a full Chromium c­ heckout. See http://www.chromium.org/developers/how-tos/get-the-code for information on checking out Chromium’s code. Acknowledgments This project would not have been possible without the help of dozens of teammates. We thank all of the members of Chrome’s Infrastructure and Labs teams for their ­support and guidance on this project. We especially thank Robbie Iannucci for superb help, ­guidance, and leadership during the long effort to fully convert the GPU bots to recipes. We also specifically thank Bryce Albritton, Sergey Berezin, Aaron Gable, Paweł Hajdan, Chase Phillips, Marc-Antoine Ruel, Peter Schmidt, Vadim Shtayura, Mike Stipicevic, Ryan Tseng, and John Weathersby for their help diagnosing and fixing dozens of issues, getting the machines purchased and set up, and keeping them running smoothly. We thank Nat Duca, Tony Gentilcore, and David Tu from the Telemetry team for their help and willingness to accept changes to the harness to allow it to be better used for correctness testing. We thank Pavel Feldman and the rest of the DevTools team for their support in extending Telemetry’s capabilities. We thank Brian Anderson, John Bauman, Philip Jägenstedt (Opera), Jorge Lucangeli Obes, Carlos Pizano, Julien Tinnes, Ricardo Vargas, Hendrik Wagenaar, and Adrienne Walker from the extended Chromium team for their help fixing some of the thorniest bugs. We thank John Abd-El-Malek for his help and encouragement to add the CL dependency analysis, for identifying flakiness on the trybots, and for his patience while bugs were being tracked down. We thank the members of Chrome’s GPU team for their efforts in keeping the bots running smoothly. Finally, we thank all of the members of and contributors to Chromium for their help, support, and patience while this project was being completed. 46 3. Continuous Testing of Chrome’s WebGL Implementation S ection II Moving to WebGL Like many WebGL developers, I’m happy to say that WebGL brought me to the web. I developed with C++ and OpenGL for years, but the lure of being able write 3D apps that run without a plugin across desktop and mobile was too great, and I quickly moved to WebGL when the specification was ratified in 2011. In this section, developers, researchers, and educators share their stories on why and how to move to WebGL. We also see similar themes throughout this book. When our team at AGI wanted to move from C++/OpenGL to JavaScript/WebGL, we weren’t sure how well a large JavaScript codebase would scale. We were coming from a C++ codebase that is now seven million lines of code. Could we manage something even 1% of that size? Thankfully, the answer was a definite yes; today, our engine, Cesium, is more than 150,000 lines of JavaScript, HTML, and CSS. In Chapter 4, “Getting Serious with JavaScript,” my collaborators, Matthew Amato and Kevin Ring, go into the details. They focus on modular design, performance, and testing. For modularity, they survey the asynchronous module definition (AMD) pattern, RequireJS, CommonJS, and ECMAScript 6. Performance topics include object creation, memory allocation, and efficiently passing data to and from Web Workers. Finally, they look at testing with Jasmine and Karma, including unit tests that call WebGL and aim to produce reliable results on a variety of browsers, platforms, and devices. Many developers, myself included, have written new engines for WebGL. However, many companies with large, established graphics and game engines may not want to rewrite their runtime engine. They want to reuse their existing content pipeline and design tools, and simply target the web as another runtime. For this, Mozilla introduced Emscripten, which translates C/C++ to a fast subset of JavaScript called asm.js that Firefox can ­optimize. In Chapter 5, “Emscripten and WebGL,” Nick Desaulniers from Mozilla explains how to use Emscripten, including a strategy for porting OpenGL ES 2.0 to WebGL, a discussion of handling thirdparty code, and a tour of the developer tools in Firefox. When moving a codebase to WebGL, there are two extremes: Write a new codebase in JavaScript or translate the existing one. Between these extremes is a middle ground: hybrid client-server rendering, where the existing codebase generates commands or images on the server. In Chapter 6, “Data Visualization with WebGL: From Python to JavaScript,” Cyrille Rossant and Almar Klein explain the design of VisPy, a Python data visualization library for scatter plots, graphics, 3D surfaces, etc. It has a layered design from low-level, OpenGL-oriented interfaces to high-level, data-oriented ones, with a simple declarative programming language, GL Intermediate Representation (GLIR). GLIR allows for visualization in pure Python apps as well as JavaScript apps. A Python server generates GLIR commands to be rendered with WebGL in the browser, in a closed-loop or open-loop fashion. WebGL adoption goes beyond practitioners, hobbyists, and researchers. Given its low barrier to entry and cross-platform support, WebGL is finding itself a prominent part of computer graphics education. Edward Angel and Dave Shreiner are at the forefront of this movement, moving both their introductory book, Interactive Computer Graphics: A TopDown Approach, and the SIGGRAPH course from OpenGL to WebGL. Ed is the person who motivated me to use WebGL in my teaching at the University of Pennsylvania. In 2011, WebGL was a special topic in my course; now, it is the topic. In Chapter 7, “Teaching an Introductory Computer Graphics Course with WebGL,” Ed and Dave explain the why and the how of moving a graphics course from desktop OpenGL to WebGL, including walking through the HTML and JavaScript for a simple WebGL app. 48 Moving to WebGL 4 Getting Serious with JavaScript Matthew Amato and Kevin Ring 4.1 Introduction 4.2 Modularization 4.3 Performance 4.4 Automated Testing of WebGL Applications Acknowledgments Bibliography 4.1 Introduction As we will see in Chapter 7, “Teaching an Introductory Computer Graphics Course with WebGL,” the nature of JavaScript and WebGL makes it an excellent learning platform for computer graphics. Others have argued that the general accessibility and quality of the toolchain also make it great for graphics research [Cozzi 14]. In this chapter, we discuss what we feel is the most important use for JavaScript and WebGL: writing and maintaining real-world browser-based applications and libraries. Most of our knowledge of JavaScript and WebGL comes from our experiences in helping to create and maintain Cesium,* an open-source WebGL-based engine for 3D globes and 2D maps (Figure 4.1). Before Cesium, we were traditional desktop software developers working in C++, C#, and Java. Like many others, the introduction of WebGL unexpectedly drew us into the world of web development. Since its release in 2012, the Cesium code base has grown to over 150,000 lines of JavaScript, HTML, and GLSL; has enjoyed contributions from dozens of developers; and has been deployed to millions of end users. While maintaining any large code base presents challenges, maintaining a large code base in JavaScript is even harder. * http://cesiumjs.org 49 Figure 4.1 Watching the sun set over the Grand Canyon in Cesium. In this chapter, we discuss our experiences with these challenges, and our strategies for solving or mitigating them. We hope to provide a good starting point for anyone developing a large-scale application for the browser in JavaScript, or in a closely related language like CoffeeScript. First, the lack of a built-in module system means there’s no one right way to organize our code. The common approaches used in smaller applications will become extremely painful as the application grows. We discuss solutions for modularization in Section 4.2. Second, a lot of the features and flexibility that make JavaScript approachable and easy to use also make it easy to write nonperformant code. Different browser engines optimize for different use cases, so what is fast in one browser might not be fast in another. This is especially concerning for us as WebGL developers, because interactive, real-time graphics often have the highest performance requirements of any application on the web. We give some tips and techniques for writing performant JavaScript code in Section 4.3. Finally, dynamically typed languages like JavaScript make automated testing even more important than usual. With JavaScript’s lack of a compilation step and its dynamic, runtime-resolved symbol references, even basic refactorings are unnerving without a robust suite of tests that can be run quickly to ensure that the application still works. A good approach to testing is essential to building a large-scale application and enabling it to evolve over time. We discuss strategies for testing a large JavaScript application, especially one that uses WebGL, in Section 4.4. 50 4. Getting Serious with JavaScript 4.2 Modularization Small JavaScript applications often start their lives as a single JavaScript source file, included in the HTML page with a simple scripts/main.js is itself an AMD module that explicitly specifies its dependencies: Listing 4.2 An entry point AMD module with three dependencies. require([‘a’, ‘b’, ‘c’], function(a, b, c) { a(b(), c()); }); When RequireJS sees the data-main attribute, it attempts to load the specified module. Loading that module requires all of its dependencies, a, b, and c, to be loaded first. Attempting to load those modules will, in turn, cause their dependencies to be loaded first. This process continues recursively. Once a, b, and c and all of their dependencies are loaded, main’s module function is called and the app is up and running. With AMD, we get quick iteration, because no build is necessary; just reload the page! There’s no need to manage an ordered list of RequireJS has a dizzying array of options, allowing us to control how module names are resolved, to specify paths to third-party libraries, to use different minifiers, and much more. RequireJS also has a large assortment of loader plugins.† One that is especially useful * https://github.com/jrburke/r.js † https://github.com/jrburke/requirejs/wiki/Plugins 4.2 Modularization 53 in WebGL applications is the text plugin, which makes it easy to load a GLSL file into a JavaScript string in order to pass it to the WebGL API. All the details can be found on the RequireJS website. 4.2.2 AMD Alternatives The Cesium team has had great success with AMD, and has found RequireJS to be a robust and flexible tool. We don’t hesitate to recommend its use in any serious application. However, there are some valid criticisms of AMD, most of which boil down to a distaste for its fundamental design goal: to create a module format that can be loaded in web browsers without a build step and without any preprocessing. To that end, AMD adopts a syntax for defining dependencies that is considered by many to be ugly and cumbersome. In particular, it requires us to maintain two parallel lists at the top of each module definition and to keep them perfectly in sync: the list of required modules and the list of parameters to the module creation function. If we accidentally let these lists get out of sync, perhaps by deleting a dependency from one list but forgetting to do so from the other, our parameter named Cartesian3 might actually be our Matrix4 module, which would certainly lead to unexpected behavior when we try to use it. If we accept a build step, perhaps because our code needs to be built for other reasons anyway, there are better options than AMD for defining modules that are easy to read and write. After all, today’s web browsers support source maps, so debugging transformed code, or even combined and minified code, can look and feel just like debugging the code we actually wrote by hand. By working incrementally, a build process can often be fast enough that it will be finished before we can switch back to our browser window and hit refresh. If we can use a simpler module pattern and make our development environment more like our production environment in the process, without sacrificing debuggability or iteration time, that’s a big win. With that in mind, let’s briefly survey some of the more promising alternatives to AMD. 4.2.3 CommonJS The most popular direct alternative to AMD is the CommonJS* module format. CommonJS modules are not explicitly wrapped inside a function. The module-private scope is implied within each source file rather than expressed explicitly as a function. They also express their dependencies using a syntax that’s a bit nicer and more difficult to get wrong: Listing 4.3 Importing modules using CommonJS syntax. var Cartesian3 = require('./Cartesian3'); var defaultValue = require('./defaultValue'); CommonJS is the module format used on the server for Node.js† modules. In Node.js, each of those calls to require loads a file off the local disk, so it is reasonable that it not * http://wiki.commonjs.org/wiki/Modules/1.1 † http://nodejs.org/ 54 4. Getting Serious with JavaScript return until the file is loaded and the module created. In the high-latency world of the browser, however, synchronous require calls would be much too slow. So, to use CommonJS modules in the browser, we must convert these modules into a browser-friendly form. One approach is to transform them to AMD modules before loading them in the browser. The r.js tool we used earlier to create minified builds can also be used to transform CommonJS modules. Another tool that is gaining traction, especially among the Node.js crowd, is Browserify.* Browserify takes Node.js-style CommonJS modules and combines them all together into a single browser-friendly JavaScript source file that can be loaded with a simple Our spec-main module requires-in all the spec modules and then executes the Jasmine environment: Listing 4.12 An entry point module for executing Jasmine specs. //spec-main.js define([ './Cartesian3Spec', './Matrix4Spec', './RaySpec' ], function() { var env = jasmine.getEnv(); env.execute(); }); We don’t need to actually have a parameter for each module in our spec-main function because we don’t need to use it; we only need to ensure that the modules are loaded. Of course, it’s unfortunate that we need to list every spec module in this way. The complete list of spec files has to be specified somehow, though, because the web browser can’t inspect the local filesystem to determine the list of specs. In Cesium we use a simple build step to automatically generate the complete list of spec modules so that we don’t have to maintain this list manually. With our SpecRunner.html set up properly, we only need to serve it up through any web server and visit it from any browser to run our tests in that browser. 4.4.2 Karma Running tests in Jasmine is a manual process. We open a web browser, navigate to our SpecRunner.html file, wait for the tests to run, and look for any failures. Karma lets us automate this. 66 4. Getting Serious with JavaScript With Karma, a single command can launch all the browsers on the system, run the tests in each of them, and report the results on the command line. This is critical for working with continuous integration (CI) because it gives us a way to turn test failures generated via a web browser into build process failures. Karma can also watch the tests for changes and automatically rerun them in all browsers, which is handy when practicing test-driven development or when otherwise focused on building out the test suite. Compared to Jasmine, setting up Karma is pretty easy, even for use with AMD.* Install it in your Node.js environment using npm, and then interactively build a config file for your application by running the following: karma init Karma has out-of-the-box support for Jasmine and several other test frameworks, and more can be added via plugins. Once Karma is configured, we can run the tests in all configured browsers by running karma start 4.4.3 Testing JavaScript Code That Uses WebGL Most of what we’ve discussed so far is applicable to testing just about any JavaScript application. What unique challenges does WebGL present? We can test most of our graphics code without ever actually rendering anything. For example, we can validate our triangulation, subdivision, batching, and level-of-detail selection algorithms with standard unit tests that invoke these algorithms and assert that they produce the data structures and numbers that we expect. Inevitably, however, some portion of our rendering code—however small—is intimately tied to the WebGL API. Purists might argue that our unit tests should never make calls into the WebGL API directly, instead calling into a testable abstraction layer of mocks and stubs. In this perfect world, unit testing our WebGL application would be no different from unit testing any other application. We’d write tests that drive our code and then assert that the correct pattern of WebGL functions was invoked, without ever actually invoking those functions. While we do appreciate that there is a place for this sort of testing, we also believe that every serious WebGL application will eventually have to step outside it. For one thing, mocking and stubbing the entire WebGL API—or at least a large enough subset of it to test a sophisticated piece of application logic—would be a significant undertaking. The bigger problem is that WebGL is a complicated API. If we only ever tested against a mocked version of it, we wouldn’t have great confidence that our code would work against the real one. Or, perhaps we should say real ones, because in some sense, each browser and GPU combination may have unique capabilities and bugs. We might choose to call these tests using the real API integration tests rather than unit tests, but the fact remains that they’re an important part of our testing picture. With that in mind, we’ve purposely avoided discussing cloud-based JavaScript testing solutions such as Sauce Labs† in previous sections. This is because none of these solutions, as of this writing, have reliable support for WebGL. It’s unfortunate, because we’d love to be able to run our tests across a wide variety of operating systems and web browsers without * http://karma-runner.github.io/0.12/plus/requirejs.html † https://saucelabs.com/ 4.4 Automated Testing of WebGL Applications 67 maintaining any test infrastructure ourselves. But it’s also understandable, because these solutions necessarily use virtualization, and GPU hardware acceleration in a virtualized environment is still in its infancy. Thus, our current approach is to use Karma to run tests on physical machines that we maintain, all driven by our CI process. When we’re writing a WebGL application, we may have hundreds or thousands of lines of code that all conspire to put a certain pattern of pixels on the screen. How can we write automated tests to confirm that the pattern of pixels is correct? There is no easy answer to this question. On previous projects, we wrote tests to draw a scene, take a screenshot, and compare it to a “known good” screenshot. This was extremely error prone. Differences between GPUs and even driver versions inevitably caused our test images to be different from the master images. Anytime a test failed, we immediately wondered what was wrong with the driver, operating system, or test, rather than asking ourselves what might be wrong with our code. We used “fuzzy” image comparison to make our tests assert that an image “mostly” matched the master image, but even then it was a constantly frustrating balancing act between reporting failure on a new GPU where the code was actually working perfectly well, and reporting success even though something actually went wrong. We would not recommend this approach. Others have reported better success with comparing screenshots by maintaining and manually verifying a set of “known good” images for every combination of platform, GPU, and driver [Pranckevičius 2011]. While it’s easy to see how this would be effective, it also strikes us as an extraordinarily costly approach to testing. Instead, most Cesium tests that do actual rendering render a single pixel, and then assert that the pixel is correct. For example, here’s a simplified test that asserts a polygon is drawn: Listing 4.13 Single pixel sanity checking in Cesium unit tests. it('renders', function() { var gl = createContext(); setupCamera(gl); drawPolygon(gl); var pixels = new Uint8Array(4); gl.readPixels(0, 0, 1, 1, gl.RGBA, gl.UNSIGNED_BYTE, pixels); expect(pixels).not.toEqual([0, 0, 0, 0]); destroyContext(context); }); This test only asserts that the pixel is not black, which is typical of the single-pixel rendering tests in Cesium. Sometimes we may check for something a bit more specific, like nonzero in the red component or “full white.” We generally don’t check for a precise color, though, because differences between browsers and GPUs can make even that test unreliable. While the preceding example test creates a unique WebGL context for the test, we try to avoid this in the Cesium tests. One reason is that context creation and setup take time, and we want our tests to run as quickly as possible. A more serious problem, though, is that web browsers don’t expect applications to create and destroy thousands of contexts. We’ve seen bugs in multiple browsers where context creation would start failing midway through 68 4. Getting Serious with JavaScript our tests. On the other hand, using a single context for all tests risks a test c­ orrupting the context’s state and causing later tests to fail. In Cesium, we’ve found a good balance by creating a context for each test suite. A test suite is a single source file that tests a closely related group of functionality, such as a single class, so it’s usually easy to reason about the context state changes occurring in the suite. A single-pixel rendering test like this is far from exhaustive, of course. Because it really just asserts that the polygon put something on the screen, there are plenty of things that could go wrong and still allow this test to pass. The opposite is not true. We should never see this test fail when the polygon, WebGL stack, and driver are working correctly. 4.4.4 Testing Shaders Cesium has a library of reusable GLSL functions for use in vertex and fragment shaders. Some of these are fairly sophisticated, such as computing the intersections of a ray with an ellipsoid or transforming geodetic latitude to the Web Mercator coordinates commonly used in web mapping. We find it very beneficial to unit test these shader functions in much the same way we would unit test similar functions written in JavaScript. We can’t run Jasmine on the GPU, however, so how do we test them? Our technique is straightforward. We write a fragment shader that invokes the function we wish to test, checks whatever condition we’re testing, and outputs white to gl_FragColor if the condition is true. For example, the test shader for the czm_ transpose function, which transposes a 2×2 matrix, looks like this: Listing 4.14 Testing reusable functions in GLSL. void main() { mat2 m = mat2(1.0, 2.0, 3.0, 4.0); mat2 mt = mat2(1.0, 3.0, 2.0, 4.0); gl_FragColor = vec4(czm_transpose(m)== mt); } When czm_transpose computes the correct transpose, this shader sets gl_ FragColor to white. If the transpose is incorrect, gl_FragColor is transparent black. We then invoke this test shader from a Jasmine spec. Our spec draws a single point with a trivial vertex shader and the fragment shader above. It then reads the pixel that the shader wrote, using gl.readPixels, and asserts that it is white. We’ve found this to be an easy, lightweight, and effective way to test shader functions. Unfortunately, it is not possible to test the entire vertex or fragment shaders in this way, nor is it straightforward to assert more than one condition per test. If either of those features is required, consider using a more full-featured GLSL testing solution such as GLSL Unit.* For Cesium, however, we’ve found this to be unnecessary. By testing the building blocks of our shaders—the individual functions—and keeping the main() functions as simple as possible, we are able to have high confidence in our shaders without a complicated GLSL testing process. * https://code.google.com/p/glsl-unit/ 4.4 Automated Testing of WebGL Applications 69 4.4.5 Testing Is Hard In Cesium, we have a few types of tests: •• •• •• Tests of underlying algorithms that verify the data structures and numbers that the algorithms produce. These tests don’t do any rendering. Rendering smoke tests, as described in Section 4.4.3. These usually render a single pixel and verify that it is not wildly wrong. We also sometimes render a full scene and verify only that no exceptions were thrown during the process. Shader function tests, as described in Section 4.4.4. These test the reusable functions that compose our shaders by invoking them in a test fragment shader and asserting that it produces white. We find these types of tests to be relatively easy to write, robust, and well worth the time investment it takes to write them. Unfortunately, we’ve found no way around having an actual human run the application once in a while, on a variety of systems and in a variety of browsers, to confirm that the rendered output is what we expect. Acknowledgments We’d like to thank everyone who reviewed this chapter and gave us valuable feedback: Jacob Benoit, Patrick Cozzi, Eric Haines, Briely Marum, Tarek Sherif, Ishaan Singh, and Traian Stanev. We’d also like to thank our families for their patience and understanding as we wrote this chapter, and our employers, Analytical Graphics, Inc. (AGI) and National ICT Australia (NICTA), for their flexibility. Finally, we’d like to thank Scott Hunter, who taught us virtually all of what we’ve written here, except for the parts that are wrong. NICTA is funded by the Australian Government through the Department of Communications and the Australian Research Council through the ICT Centre of Excellence Program. Bibliography [Cozzi 14] Patrick Cozzi. “Why Use WebGL for Graphics Research?” http://www.realtimerendering.com/blog/why-use-webgl-for-graphics-research/, 2014. [Hackett 12] Brian Hackett and Shu-yu Guo. “Fast and Precise Hybrid Type Inference for JavaScript.” http://rfrn.org/~shu/drafts/ti.pdf, 2012. [Jones 11] Brandon Jones. “The somewhat depressing state of Object.create performance.” http://blog.tojicode.com/2011/08/somewhat-depressing-state-of.html, 2011. [Pizlo 14] Filip Pizlo. “Introducing the WebKit FTL JIT.” https://www.webkit.org/ blog/3362/introducing-the-webkit-ftl-jit/, 2014. [Pranckevičius 2011] Aras Pranckevičius. “Testing Graphics Code, 4 Years Later.” http:// aras-p.info/blog/2011/06/17/testing-graphics-code-4-years-later/, 2011. [Clifford 12] Daniel Clifford. “Breaking the JavaScript Speed Limit with V8.” Google I/O https://www.youtube.com/watch?v=UJPdhx5zTaw 2012. 70 4. Getting Serious with JavaScript 5 Emscripten and WebGL Nick Desaulniers 5.1 Emscripten 5.2 asm.js 5.3 Hello World 5.4 Working with Third-Party Code 5.5 OpenGL ES Support 5.6 Porting OpenGL ES 2.0 to WebGL with Emscripten and asm.js 5.7 Texture Loading 5.8 Developer Tools 5.9 Emscripten in Production 5.10 More Help Bibliography 5.1 Emscripten I don’t really know much about computer language theory. I just mess around with it in my spare time (weekends mostly). Here is one thing I’ve been thinking about: I want the speed of native code on the web—because I want to run things like game engines there—but I don’t want Java, or NaCl, or some plugin. I want to use standard, platform-agnostic web technologies [Zakai 10]. That quote from Alon Zakai, Emscripten’s creator, is quite telling of the motivations behind Emscripten. Alon simply wanted to run his favorite game engine in the web. Being a Mozilla employee, Alon recognized the prevalence and reach of JavaScript; almost everyone already had a JavaScript runtime installed on their devices. Being able to safely execute remote code is one of the web’s greatest strengths; content that runs in JavaScript is not “silo-ed” to a particular browser or operating system. 71 Emscripten is a compiler for C and C++ code that plugs into the LLVM toolchain. The beauty of LLVM is that it’s split into three logical parts: a lexer and parser (the ­frontend), which generates an intermediary representation (IR); an IR optimizer (middle) that performs all of the code optimizing transforms (loop invariant code motion, inlining, etc); and the code generator (backend). The frontend is called Clang, from which the command line utility is named. Emscripten reuses the frontend for parsing C and C++, and all of the optimization passes, but replaces the native code generation with a module that generates a subset of JavaScript (JS) known as asm.js. Emscripten links against musl as its C runtime and libc++abi for its C++ runtime. Functions like sbrk are implemented in JS, which allows implementations of malloc and free to be compiled over from C [Zakai 13]. Not only can C/C++ be compiled over directly, but other languages whose runtimes are implemented in C/C++ can have their runtimes ported over. Thanks to Emscripten languages, Python, Ruby, and Lua can all run in the browser. In this chapter, I hope to explain what Emscripten and asm.js are, and how we can use them today to get our OpenGL ES 2.0 based renderers up and running in all web browsers with WebGL 1.0 support. 5.2 asm.js Originally, Emscripten emitted idiomatic JavaScript. While the code was mostly correct, it had wild variation in performance and startup time as it created objects that would trigger frequent garbage collection pauses. In a traditional executable there’s a segment for static memory, containing things such as static variables and strings. There’s then a separate segment for the stack, which grows downwards toward the heap (the free store of memory used for dynamic memory allocations). Emscripten emulates this approach and gets predictable speed by allocating a giant TypedArray up front, and mapping on top of a similar memory representation that a process would have. Not only is this faster than working with JavaScript objects, but also using this kind of memory representation helps a JS engine avoid having to allocate new objects during execution of the compiled code. In late 2012, Luke Wagner, a core engineer on Firefox’s JavaScript virtual machine, realized that JavaScript Just In Time compilers (JITs) could do a better job optimizing the type of JavaScript generated by tools such as Emscripten by recognizing implicit static types in the code. Even though JavaScript doesn’t have actual static types, the static types in the source language (C/C++) still show through in the generated JavaScript in the form of operators (such as | and +) that force the resulting dynamic values to have a single dynamic type. Recognizing the patterned use of these operators throughout the code, the JIT can compile the code with full static type knowledge without any of the runtime profiling or chances of mis-speculation that accompany normal JS optimization. While this static-type-embedding subset of the language would be a restriction that throws out some of the best parts of the language, it would allow for JavaScript to become an excellent compiler target. And being a subset of a widely deployed language, every modern browser would be able to run this subset, known as asm.js, whether or not it specifically optimized for it. Being a subset of the JavaScript language, pre- and postincrement and -decrement are not valid asm.js, and neither are the + = and related operators. One must use the notation 72 5. Emscripten and WebGL x = x + 1. Does this mean ++x will not run? Absolutely not—++x certainly is valid JS; the use of pre- and postincrement and -decrement operators just will prevent the current code block from validating as pure asm.js, and so will not take the hot path to the optimizing compiler in Spidermonkey, which is Firefox’s JS virtual machine. In the code for this chapter* is an example of handwritten asm.js that normalizes a vector of four Float32s. We can see from that example that asm.js is cumbersome and unwieldy to write by hand. We end up referring to the asm.js spec often, both the sections on operators [asm.js §8] and value types [asm.js §2.1], to get our type conversions correct. It’s about as much fun to write as assembly, hence the name. Handwriting asm.js code is difficult; instead asm.js serves us better as a compiler target. Small functions generally do not provide major improvements when converted to asm.js. Instead, when large programs or large utility functions are converted, asm.js really shines. 5.3 Hello World Let’s whet the appetite with a quick “hello world.” Download the Emscripten SDK from the site, though I’ve also heard that “real hackers” compile their compilers from source. Grab Node.js as well, since Emscripten will use Node.js in its toolchain. Node.js provides an alternative JavaScript runtime to the one provided by a browser. It allows for command line utilities to be written in JavaScript. Cloning Emscripten’s source code would be useful, too. Throughout the chapter, we refer to files in Emscripten’s source as emscripten/ path/to/file. Emscripten’s documentation site† will be a useful reference. Once we’re all set and have emcc in our $PATH, write a simple hello world example in C or C++, compile with emcc hello_world.c -o hello_world.js, run with node hello_world.js, and we’ll see hello world! printed to stdout. We can invoke the compiler as emcc for both C and C++ code bases. We can also use -o hello_world.html and then open the resulting file in our browser. 5.4 Working with Third-Party Code Just like we need to have the source of a third-party library cross compiled to a new platform, third-party libraries also need to be cross compiled with Emscripten. If we use the `-o file.bc` option, Emscripten will simply emit LLVM bytecode, with its own libc/­libcxx implementations. We can then statically link against them as we would native code. Emscripten does not currently support dynamically linking or calls to dlopen or equivalent. If we don’t have access to the source code of a library, we should work with the vendor to provide a version that has been built with Emscripten. Emscripten already has built in glue code for most of EGL, GLUT, GLEW, GLFW, and SDL. Emscripten is great for compiling functions to JS and allowing us to call into them through foreign function interface (FFI‡), though it’s more common to use Emscripten to have C/C++ code call into functions defined in JS “glue code.” In general, if we use Emscripten, we may end up needing JS glue code. * https://github.com/WebGLInsights/WebGLInsights-1 † https://kripken.github.io/emscripten-site/index.html ‡ https://kripken.github.io/emscripten-site/docs/porting/connecting_cpp_and_javascript/Interacting-withcode.html 5.4 Working with Third-Party Code 73 Before implementing bindings to all of the HTML5 APIs, know that Emscripten has already done so. See emscripten/system/include/emscripten/html5.h and emscripten/ system/include/emscripten/emscripten.h. There are also utilities called “embind” and “WebIDL binder” for creating C++ classes from types of JavaScript objects that can make it easier to do FFI with, though their use is more advanced than what I’ll be talking about here. 5.5 OpenGL ES Support WebGL 1.0 is very similar to OpenGL ES 2.0 (Figure 5.1), with a JavaScript-based API rather than a C-based API. In general, one should familiarize oneself with Section 6 of the WebGL 1.0 spec, which iterates 24 “Differences between WebGL and OpenGL ES 2.0” [WebGL 1.0 §6]. Most notable is that client-side arrays are not supported; one must use vertex buffer objects (VBOs). Emscripten can emulate this functionality with the compiler flag -s FULL_ES2 = 1, though Emscripten may not be able to emulate the most efficient use of VBOs. Support for some parts of the old fixed-function pipeline code can work at a severe performance penalty with the compiler flag -s LEGACY_GL_ EMULATION = 1. 5.6 Porting OpenGL ES 2.0 to WebGL with Emscripten and asm.js The work flow for getting our OpenGL ES 2.0 code base running in WebGL using Emscripten looks like this: 1. Get it building 2. Get it rendering 3. Get it animating The process is unique with every code base, but this is the typical work flow. I’ve converted Dan Ginsburg’s sample code from the OpenGL ES 2.0 Programming Guide [Munshi et al. 08] from an SVN repo to a git repo. The original code is a SVN repo,* but if Figure 5.1 OpenGL family tree. * https://code.google.com/p/opengles-book-samples/ 74 5. Emscripten and WebGL you prefer GIT, I’ve uploaded the code to GitHub.* In the repo, we’ll also find commits in the writing_follow_along branch that we can use to see what changes were made in this section. Please do clone the code and follow along. Your humble author is going to be using OSX 10.8.5, but we’ll be working with the Linux code base because, as we’ll see, Emscripten has great support for various GNU build utilities. Go ahead and and open gles2-book/LinuxX11. If we simply run make as is, we should expect the project to fail, since OSX is missing quite a few of the required headers and libraries. 5.6.1 Get It Building Emscripten ships with a tool called emmake. It’s a Python script that invokes make with a few environment variables like $CC set to emcc (the Emscripten compiler). Let’s try reinvoking the build steps using emmake, emmake make. We can see that the default make target is trying to link against EGL, which we know is definitely not available in OSX. Let’s take a look at the Makefile. We’ll just work with the first sample, “Hello Triangle,” so we can make another make target and just edit the compiler command for Hello Triangle. basic:./Chapter_2/Hello_Triangle/CH02_HelloTriangle That way, for all of our changes, we need to remember to run emmake make basic. Second, let’s replace all hard coded references to gcc with the environment variable $(CC). This will allow emmake to set CC = emcc. ./Chapter_2/Hello_Triangle/CH02_HelloTriangle: ${COMMONSRC} ${COMMONHDR} ${CH02SRC} $(CC) ${COMMONSRC} ${CH02SRC} -o $@ ${INCDIR} ${LIBS} We should see a bunch of warnings about GLESv2, EGL, m, and X11 libraries not being found. That’s OK because Emscripten has linked in implementations of the functions of these libraries. We can remove the line defining LIBS from the Makefile. Finally, to silence all warnings, we can fix up the pointer sign conversion warning existing in the code by changing the definitions of vShaderStr and fShaderStr in Chapter_2/Hello_Triangle/Hello_Triangle.c from being of type GLbyte[] to const char* on lines 83 and 90. OK, so now we’re building, but what do we get? $ file Chapter_2/Hello_Triangle/CH02_HelloTriangle Chapter_2/Hello_Triangle/CH02_HelloTriangle: data It looks like CH02_HelloTriangle is binary data. The data look oddly like LLVM bytecode and, sure enough, if we run LLVM disassembler on the binary file and open the resulting. ll file, we have LLVM IR. $ llvm-dis Chapter_2/Hello_Triangle/CH02_HelloTriangle $ cat Chapter_2/Hello_Triangle/CH02_HelloTriangle.ll ; ModuleID = 'Chapter_2/Hello_Triangle/CH02_HelloTriangle' target datalayout = "e-p:32:32-i64:64-v128:32:128-n32-S128" target triple = "asmjs-unknown-Emscripten" ... * https://github.com/nickdesaulniers/opengles2-book 5.6 Porting OpenGL ES 2.0 to WebGL with Emscripten and asm.js 75 Now we have LLVM bytecode; what can we do with that? Well, Emscripten generates different output based on the file extension of the compiler argument passed to -o. If -o is invoked without a file extension of .js or .html as is the case here, then the output is simply LLVM bytecode. This bytecode is what we need to work with other libraries. Emscripten doesn’t have support for dynamic linking ahead of time or at runtime, so for now we’ll have to stick with static linking. That’s essentially what we’re doing here: compiling the code to an IR that can be linked statically. In this case, we don’t want an equivalent to an archive, we want a full program that we can run. Since Emscripten uses the file extension of the output option to generate different code, we need to tell it whether we want to generate code that can run in node.js (just JavaScript) or in a browser (JavaScript and HTML). In our Makefile, for the ./Chapter_2/ Hello_Triangle/CH02_HelloTriangle target, let’s tack on .html to $@, which is Make syntax for the target. $(CC) ${COMMONSRC} ${CH02SRC} -o $@.html ${INCDIR} ${LIBS} After running emmake make clean && emmake make basic, we can see that we’ve introduced two new warnings. warning: unresolved symbol: XNextEvent warning: unresolved symbol: XLookupString Where did these come from and why didn’t we see them earlier? Before telling Emscripten that we wanted html, it was creating LLVM bytecode. It wasn’t running the linking phase, just the frontend of Clang and the optimization passes. Though we didn’t specify any optimization levels, Emscripten actually uses compiler passes added to LLVM from NaCl to break down IR into more easily digested chunks. It’s not until we tell Emscripten that we want runnable code via -o .js or -o .html that it actually runs the linking phase. In fact, if we look at the LLVM IR in text format (Chapter_2/Hello_Triangle/CH02_HelloTriangle.ll), we can see that a bunch of type definitions that will be resolved at link time are specified as “type opaque.” Right now, the linker is complaining that all of the symbols listed throughout had definitions except two functions, which look to me like they’re related to X11, and that we pulled out from the linking options earlier. But we also removed explicitly linking against GLES2 and EGL, so why aren’t we getting warnings from those? Well, Emscripten will implicitly search its system/include directory for headers for GLES 2 and EGL that are bundled with Emscripten, and definitions in JavaScript of those methods from Emscripten’s src dir; see emscripten/src/library_gl.js and emscripten/src/library_egl.js in Emscripten’s source code or install location. Emscripten ships with a bunch of implicit definitions to functions implemented in JavaScript. Some of these are handwritten, but it’s nice to use compiled versions where it makes sense. We don’t get warnings for GLES or EGL functions because Emscripten has them implemented for us. We get a warning for the X window stuff because we did not define them, and Emscripten does not have them built in. As we will see, we won’t be using the X window bits so tackling those linker warnings right now is a moot point. In general if we get compiler warnings like this, then we should make sure none of our important functions’ definitions are missing. 76 5. Emscripten and WebGL Back to our project, we now have a .html file and a .js file. We could try to run the js file in Node.js, but we’ll run into issues trying to resize the canvas since HTMLCanvasElement is not defined by the Node.js runtime. Let’s see what happens if we try to run the code in the browser. We now have the code building; let’s get it rendering. 5.6.2 Get It Rendering Depending on our browser, we shall see either a slow script warning or a compositor lock up, caused by an infinite loop and resulting in no rendering. Let’s try drawing just one frame, to make sure the issue is in rendering and not in animation. Let’s wrap the final statement in main in #ifdefs for now to render only one frame when compiled with Emscripten: Listing 5.1 Drawing one frame. #ifdef __EMSCRIPTEN__ Draw(&esContext); #else esMainLoop (&esContext); #endif Remember to rerun emmake make basic between page loads. Now we don’t get the long running script error, but still nothing renders, so we know the issue is not just with animation, but also that something is broken in our render function. Checking in our console developer tool, we can see WebGL-implementation-specific errors printed to console.error: Er ror: WebGL: vertexAttribPointer: must have valid GL_ARRAY_BUFFER binding CH02_HelloTriangle.js:1937 Er ror: WebGL: drawArrays: no VBO bound to enabled vertex attrib index 0! CH02_HelloTriangle.js:1974 Let’s try rebuilding with debug symbols. In our Makefile, let’s add a simple -g4 flag to our rule for our example: $(CC) ${COMMONSRC} ${CH02SRC} -o $@.html ${INCDIR} ${LIBS} -g4 Without any optimization flags passed to the compiler, Emscripten will not attempt to minify and compress the emitted code for anything less than -O2. We can use -g1 to preserve whitespace, -g2 to preserve function names, -g3 to preserve variable names, and -g4 to generate source maps. Source maps are an additional file, and a special comment in the emitted JavaScript, for telling the developer tools how to map lines of the emitted code back to lines of the source. If we use -g4, we should now see an additional .html.map file, and a comment at the bottom of our JavaScript file //# sourceMappingURL = CH02_HelloTriangle.html.map After reloading, you should be able to view the C source code, possibly in addition to the emitted JavaScript, depending on your browser and its support for source maps. See emcc—help for more compiler options. 5.6 Porting OpenGL ES 2.0 to WebGL with Emscripten and asm.js 77 Go ahead and set a breakpoint right before the call to Draw(&esContext), reload the page, step into Draw, then step over until we find the offending line that prints the warning in the console tab. We should find that the offending line is glVertexAttribPointer (0, 3, GL_FLOAT, GL_FALSE, 0, vVertices); What is vVertices? If we’re used to working with vertex buffer objects (VBOs), we’re probably expecting the sixth argument to glVertexAttribPointer to be an offset. But we can see a few lines up that vVertices is declared as being of type GLfloat[]. This is referred to as client-side data or rendering from client-side memory. If a zero buffer is bound with glBindBuffer, the sixth argument to glVertexAttribPointer is the address of vertex data in main memory; otherwise, it is an offset into the currently bound VBO in video memory [OpenGL ES 2.0 §2.9.1]. WebGL does not allow for client-side data. We must use VBOs in WebGL, so this code should be converted to use VBOs rather than client-side data. It’s possible to recompile with the flag -s FULL_ES2, see something render, and then call it a day. Let’s not do that. While Emscripten can emulate client-side data, it might be slow. The fix might look like the following: Listing 5.2 Using VBOs instead of client side data. #ifdef __EMSCRIPTEN__ GLuint vertexPosObject; glGenBuffers(1, &vertexPosObject); glBindBuffer(GL_ARRAY_BUFFER, vertexPosObject); glBufferData(GL_ARRAY_BUFFER, 9 * 4, vVertices, GL_STATIC_DRAW); glVertexAttribPointer(0, 3, GL_FLOAT, GL_FALSE, 0, 0); #else glVertexAttribPointer (0, 3, GL_FLOAT, GL_FALSE, 0, vVertices); #endif This is not preferable, since we’re recreating a new buffer every frame, but we’ll see in a bit some useful tools for helping us catch redundant calls against the GL context. Rebuild and we should see our triangle! 5.6.3 Get It Animating Now that we’ve fixed issues related to rendering, let’s tackle animation. Let’s revert the change at the end of main that we did to render only a frame with #ifdefs. We should now be running esMainLoop(&esContext), regardless of the compiler. Load this version up in a browser, and we’ll get the slow script warning again. Go ahead and stop the script. This time, we’ll notice that we at least have something rendering, though we still have to contend with an infinite loop somewhere. First, let’s talk a little bit about event loops. In our native C or C++ code, we can run code in an infinite loop and the OS will take care of making sure the system stays responsive by context switching in and out of the process. JavaScript works 78 5. Emscripten and WebGL a little differently; instead it has an event loop. Because JavaScript is single threaded (WebWorkers can be used for multithreading), APIs for fetching data over the network (like XMLHttpRequest) are typically implemented asynchronously; they take a function as an argument to be invoked later. Such an argument is referred to as a callback function. This is similar to passing function pointers or function objects in C or C++. The callback is placed on the event queue, and won’t be invoked until a few ticks of the event loop have gone by, and won’t be executed until an event such as a network response. This allows for the rest of the current tasks in the event queue to run. Because JavaScript is for the most part single threaded, and runs in the main thread of the browser, long running functions or infinite loops will lock up the page. Functions such as setTimeout and setInterval push a callback onto the event queue, and indeed these were traditionally used in libraries like jQuery a long time ago to perform non-main-thread-blocking animation. There are issues with using those two functions for animation, so with the HTML5 standard we got the r ­ equestAnimationFrame function. requestAnimationFrame is what we want to use for animation, rather than a while (true) loop. If we jump into the file Common/esUtil.c, and then down to the definition of esMainLoop on Line 290, we can indeed find the source of our infinite loop: Listing 5.3 Main animation loop. while(userInterrupt(esContext) = = GL_FALSE) { ... esContext->updateFunc(esContext, deltatime); ... esContext->drawFunc(esContext); ... } In order to transform our code to play nice with JavaScript’s event loop, we perform the following steps: 1. Put the body of the while loop into its own function. 2. Create a struct containing variables if the while loop references variables outside its local scope. 3. Use emscripten_set_main_loop_arg and the address of the populated argument struct. emscripten_set_main_loop_arg has the signature: extern void emscripten_set_main_loop_arg(void (*func)(void*), void* arg, int fps, int simulate_infinite_loop) and will invoke requestAnimationFrame. The front and back buffers will be switched by WebGL at the end of our requestAnimationFrame iteration. There’s also a sister function emscripten_ set_main_loop if we don’t have any arguments to pass. This transformation should 5.6 Porting OpenGL ES 2.0 to WebGL with Emscripten and asm.js 79 feel extremely familiar to anyone who has ever transformed an iterative operation to be threaded. First, let’s move the body of the while loop into its own function: Listing 5.4 Separate render function. static void render (void *data) { ... esContext->updateFunc(esContext, deltatime); ... esContext->drawFunc(esContext); ... } void ESUTIL_API esMainLoop (ESContext *esContext) { #ifdef __EMSCRIPTEN__ //TODO: step 3 //Use emscripten_set_main_loop_arg and the address of the //populated argument struct. #else while(userInterrupt(esContext) = = GL_FALSE) { ... esContext->updateFunc(esContext, deltatime); ... esContext->drawFunc(esContext); ... } #endif } Second, we’ll create a struct for the variables that are initialized before the while loop: Listing 5.5 Create a struct to house animation loop data. struct loop_vars_t { struct timeval t1; struct timezone tz; float totaltime; unsigned int frames; ESContext* esContext; }; static void render (void *data) { struct loop_vars_t* args = (struct loop_vars_t*) data; struct timeval t2; float deltatime; ... args->esContext->updateFunc(args->esContext, deltatime); ... args->esContext->drawFunc(args->esContext); ... } 80 5. Emscripten and WebGL Third and finally, we will use emscripten_set_main_loop_arg and the address of our populated argument struct: Listing 5.6 Use emscripten_set_main_loop_arg. void ESUTIL_API esMainLoop (ESContext *esContext) { #ifdef __EMSCRIPTEN__ struct loop_vars_t args = {0}; args.totaltime = 0.0f; args.frames = 0; args.esContext = esContext; gettimeofday(&args.t1, &args.tz); emscripten_set_main_loop_arg(render, &args, 0, 1); #else ... At this point, building should give us an error that emscripten_set_main_ loop_arg has not yet been defined. Make sure to #include "emscripten.h" at the top of Common/esUtil.c. Emscripten will automatically know how to include this header; there’s no need for additional -I arguments during the build. Now we should see our code properly animating as in Figure 5.2! Granted, we’re not updating any vertices between frames, but we should now see the console output every 2 seconds with an average frames per second. 5.7 Texture Loading Another aspect of working with Emscripten that is different from native development is the asynchronous loading of static assets. If we try compiling the Chapter 13 example from the OpenGL ES 2.0 Programming Guide, we should get an error: Error loading (smoke.tga) image. We have to change the Makefile to specify an .html file extensions to the -o compiler flag and use VBOs instead of client side data as before. If we add the compiler option—preload-file./Chapter_13/ParticleSystem/smoke.tga— we’ll get an additional CH13_ParticleSystem.data file in Chapter_13/ParticleSystem/. This file is a static memory initializer that is produced in optimized builds and builds that use Emscripten’s virtual filesystem. We’ll still get the previous error though. From emcc—help under—reload-file: The path is relative to the current directory at compile time. So we can either edit Chapter_13/ParticleSystem/ParticleSystem.c line 165 to "./Chapter_13/ParticleSystem/smoke.tga" or move smoke.tga to the current working directory. Because we previously fixed our renderer, we should now see the example render and animate correctly. 5.8 Developer Tools We can use developer tools in Firefox to help improve our code. Some of my favorites are the shader editor, the sampling profiler, the timeline viewer, and the canvas debugger. 5.8 Developer Tools 81 Figure 5.2 WebGL renderer side by side with OpenGL ES 2.0 renderer. The performance tab features a sampling profiler and FPS over time graph (Figure 5.3). We will find inverting the call stack or call tree helpful in pinpointing what functions the program spent the most time in. The longer we allow the profiler to run, the more accurate its samples will be. Don’t be alarmed if the profiler seems to be missing some functions; those functions typically run so fast that the sampler misses them. We’ll see extra functions that we may not have written ourselves if we have a few add-ons installed; it’s best to run the profiler from a fresh profile which is free of add-ons.* We can also manually add instrumentation to our code with calls to console. time and console.timeEnd. Browsers that embed the Blink rendering engine, like Chrome, Chromium, and Opera, have chrome:tracing, which is a fantastic structural profiler to use once we’ve added the instrumentation to our code (see Figure 5.4). Structural profilers give us insight into exactly how much time was spent in various function calls relative to one another, but burden us with adding instrumentation. The sampling profiler will miss short running functions, but doesn’t require instrumentation. It’s good to start with a sampling profiler, and then switch to a structural profile if needed. Firefox’s Timeline dev tool will also show console.time blocks’ duration relative to other things going on within the page. WebGL inspector† is another nice tool implemented both as a library and browser add on or extension. * https://support.mozilla.org/en-US/kb/profile-manager-create-and-remove-firefox-profiles † http://www.realtimerendering.com/blog/webgl-debugging-and-profiling-tools/ 82 5. Emscripten and WebGL Figure 5.3 Firefox sampling profiler developer tool. Figure 5.4 Blink’s structural profiler developer tool. From Figure 5.5, we can see that the canvas tab is a wonderful debugger for animation loops. It records, then highlights clear and draw calls in green, and useProgram calls in pink. If we record an iteration of our animation loop, we can see that we’re repeating things that only need to be done once. We don’t need to be resetting the viewport size, setting the only shader program as the active one, creating a new buffer, rebuffering our data, or re-enabling our vertex attribute pointers. Before we move those functions from Draw into Init, let’s remove the browser’s 60 fps limitation on request​ AnimationFrame to see a more dramatic before and after. In Firefox, if we open about:config in another tab, we can change the preferences layout.frame_rate from –1 to 0 and layers.­offmainthreadcomposition.frame-rate from –1 to 1000, then restart the browser. It’s important that we restore these to defaults when done; otherwise, every site using requestAnimationFrame we’ll visit will end up doing much more work than our monitor’s refresh rate can keep up with. 5.8 Developer Tools 83 Figure 5.5 Firefox’s canvas debugger developer tool. If we pull out the -g4 compiler flag in favor of -O3, and shrink the Firefox window down to be the size of the canvas, we can update as high as 570 fps, with some large variations. This is an excellent technique that can be used for A/B testing. It’s important to use optimization levels -O2 or above to emit valid asm.js code from Emscripten. Using -O2 with asm.js means that the available heap size will be fixed. Applications which require variable and/or large amounts of memory will often fail in such cases and cannot be reliably compiled with -O2. In these cases, the runtime will print a warning on how to compile the code so that this can be worked around, for an additional runtime cost. Some other things like new Error().stack in JavaScript can help us print a stack trace. Emscripten provides an EM_ASM macro that allows us to embed JavaScript directly in our C/C++. 5.9 Emscripten in Production The Emscripten team worked with the folks at Epic in 2013 to bring Epic’s Unreal Engine 3 (UE3) to the web, as seen in their Citadel and Unreal Tournament demos, and Unreal Engine 4 in 2014. Developers at Trendy Entertainment shipped Dungeon Defenders Eternity in 2014 using UE3. Unity also made a free feature in Unity 5 to export to the web, as seen with a demo of Mad Finger Games’ Dead Trigger 2. In late 2014, Mozilla partnered with Humble Bundle to release the Humble Mozilla Bundle; releasing nine popular indie games like FTL: Faster Than Light and Voxatron, which together raised over half a million dollars in revenue. Screenshots from the games running in browser are available in Figure 5.6. 84 5. Emscripten and WebGL Figure 5.6 Clockwise from top left: Epic Unreal Engine 3 Citadel Demo, Trendy Entertainment Dungeon Defenders Eternity based on UE3, Mad Finger Games Dead Trigger 2 based on Unity 5, Subset Games FTL: Faster Than Light as part of the Humble Mozilla Bundle. (Continued) 5.9 Emscripten in Production 85 Figure 5.6 (Continued) Clockwise from top left: Epic Unreal Engine 3 Citadel Demo, Trendy Entertainment Dungeon Defenders Eternity based on UE3, Mad Finger Games Dead Trigger 2 based on Unity 5, Subset Games FTL: Faster Than Light as part of the Humble Mozilla Bundle. 86 5. Emscripten and WebGL 5.10 More Help For more help, it’s best to read over the official documentation.* The irc channel ­#emscripten at irc.mozilla.org is also very active and all of the people involved in Emscripten’s development and a bunch of developers with porting experience hang out there. It’s worthwhile to read emscripten/src/preamble.js for documentation on FFI functionality and emscripten/ src/settings.js for compiler options. All of the code samples and figures are available online at the book’s Github code repository.† Bibliography [asm.js] Dave Herman, Luke Wagner, and Alon Zakai. “asm.js.” http://asmjs.org/spec/­ latest/, 2014. [Jylänki 13] Jukka Jylänki. “Emscripten Memory Profiler.” https://dl.dropboxusercontent. com/u/40949268/emcc/memoryprofiler/Geometry_d.html, 2013. [Lamminen et al. 14] Turo Lamminen, Tuomas Närväinen, and Robert Nyman. “Porting to Emscripten.” https://hacks.mozilla.org/2014/11/porting-to-emscripten/, 2014. [Munshi et al. 08] Aaftab Munshi, Dan Ginsburg, and Dave Shreiner. OpenGL ES 2.0 Programming Guide. Addison-Wesley Professional, 2008. [OpenGL ES 2.0] Mark Segal and Kurt Akeley. “The OpenGL ® Graphics System: A Specification.” https://www.opengl.org/documentation/specs/version2.0/glspec20. pdf, 2004. [Typed Array] Dave Herman and Kenneth Russell. “Typed Array Specification.” https:// www.khronos.org/registry/typedarray/specs/latest/, 2013. [Wagner 14] Luke Wagner. “asm.js AOT compilation and startup performance.” https://blog.mozilla.org/luke/2014/01/14/asm-js-aot-compilation-and-startup-­ performance/, 2014. [Wagner and Zakai 14] Luke Wagner and Alon Zakai. “Getting started with asm.js and Emscripten.” GDC 2014.[WebGL 1.0] Dean Jackson. “WebGL Specification.” https:// www.khronos.org/registry/webgl/specs/latest/1.0/, 2014. [Zakai 10] Alon Zakai. “Experiments with ‘Static’ JavaScript: As Fast as Native Code?” http://mozakai.blogspot.com/2010/07/experiments-with-static-javascript-as.html, 2010. [Zakai 13] Alon Zakai. “Emscripten: An LLVM-to-JavaScript Compiler.” https://github. com/kripken/emscripten/blob/master/docs/paper.pdf?raw = true, 2013. * https://kripken.github.io/emscripten-site/ † https://github.com/WebGLInsights/WebGLInsights-1 Bibliography 87 6 Data Visualization with WebGL From Python to JavaScript Cyrille Rossant and Almar Klein 6.1 Introduction 6.2 Overview of VisPy 6.3 GLIR: An Intermediate Representation for OpenGL 6.4 Online Renderer 6.5 Offline Renderer 6.6 Performance Considerations 6.7 Conclusion Acknowledgments Bibliography 6.1 Introduction The deluge of data arising in science, engineering, finance, and many other disciplines calls for modern, innovative analysis methods. Whereas more and more processes can be automated, human supervision is nevertheless often required at most stages of analysis pipelines. The primary way humans apprehend data for explorative analysis is visualization. Effective big data visualization methods have to be interactive, fast, and scalable [Rossant 13]. Modern data sets may be large and high dimensional; thus no static, two-dimensional image can possibly convey all relevant information. A common technique is to create interactive visualizations, where the user explores the various dimensions and subsets of the data. For such data exploration to be most effective, the rendering frame rate needs to be optimal even with large data sets. Finally, big data visualization methods need to 89 Figure 6.1 Screenshots from the VisPy gallery. From left to right: ray tracing example implemented in a fragment shader; a fake galaxy rendered with point sprites; a textured cube with custom postprocessing effects in the fragment shader; hundreds of digital signals updated in real time (the grid layout is generated entirely in the vertex shader for efficiency); a 3D mesh of a cortical surface. support distributed and remote technologies in order to scale to huge data sets stored in cloud architectures. We have been developing an OpenGL-based library in Python named VisPy to visualize large data sets interactively.* Using the graphics card via custom shaders allows us to display smoothly up to hundreds of millions of points. Figure 6.1 illustrates a few types of visualization supported by VisPy. VisPy is open source, BSD-licensed, and written in Python, one of the leading open platforms for data analytics [Oliphant 07]. Although Python is widely acclaimed for its expressivity and accessibility, it lags behind the web platform when it comes to document sharing and dissemination. WebGL is an appealing technology to us because it could provide a way to embed VisPy visualizations within web documents [WebGL]. Since VisPy and WebGL are both based on OpenGL ES 2.0, the main technical barrier is the programming language. Integrating Python and JavaScript indeed remains an open problem [Kelly 13]. In this chapter, we present a set of techniques that we have been developing in order to bring VisPy to the browser thanks to WebGL. Before detailing these techniques, we give an overview of the VisPy project. VisPy’s fundamental idea is to build data-oriented abstraction layers on top of OpenGL ES 2.0. VisPy brings the power of OpenGL to users who have complex visualization needs, but no knowledge of OpenGL. Although VisPy is primarily developed in Python, it is also of interest to WebGL developers who are working on data visualization projects in JavaScript. VisPy focuses on speed and scalability. In particular, its architecture enables out-ofcore visualization applications thanks to specific level-of-detail (LOD) techniques. None of these techniques are currently implemented in VisPy. Rather, VisPy provides an infrastructure that makes these use cases possible. For example, one could dynamically stream various LOD models of a huge data set from a high-performance server (with Python and VisPy installed) to a lower end WebGL-enabled device (desktop computer, smartphone, and so on). VisPy proposes a client-server architecture and a communication protocol that is specifically adapted to such use cases. The purpose of this chapter is to detail this infrastructure. * http://www.vispy.org 90 6. Data Visualization with WebGL 6.2 Overview of VisPy The power and flexibility of modern OpenGL and GLSL make them useful not only for video games and 3D modeling software, but also for 2D/3D data visualization. Examples of visualizations that can be rendered with OpenGL include scatter plots (point sprites), digital signals (polylines), images (textured quads), graphs, 3D surfaces, meshes (rasterization or volume rendering), and many others. The expressivity of GLSL makes it possible to create arbitrarily complex visualizations. Whereas many scientists need to visualize large, complex, and high-dimensional data sets, few have the time and skills to create custom visualizations in OpenGL. Writing an OpenGL-based interactive visualization from the ground up requires knowledge of the GPU architecture, the rendering pipeline, GLSL, and the complex OpenGL API. This high barrier to entry motivated the development of VisPy. This library offers high-level visualization routines that let scientists visualize their data interactively and effectively. Specifically, VisPy defines several abstraction layers, from low-level, OpenGLoriented interfaces to high-level, data-oriented interfaces. Whereas the high-level interfaces allow complex visualizations to be defined with minimal code, the lower level interfaces offer finer control and customization of the visualization process. We present these interfaces in this section. 6.2.1 Dealing with Differences between Desktop GL and GL ES 2.0 VisPy targets “regular” desktop OpenGL as well as OpenGL ES 2.0 (e.g., WebGL). This is realized by limiting ourselves to the subset of OpenGL that is available in both versions. Although both versions are mostly compatible, there are a few pitfalls, some of which we account for in VisPy so that our users can create applications and write GLSL that will work on both desktop GL and WebGL. The most significant differences are* •• •• •• •• •• Some function names are different in both versions. Examples include the create/ delete functions, and getParameter, which replaces getInteger/­getFloat/ getBoolean. These differences are not problematic in VisPy because we use a common representation language (GLIR) to represent GL commands (see later in this chapter). In desktop GL, point sprites are not enabled by default. In VisPy, we automatically enable point sprites if necessary. An ES 2.0 fragment shader must contain precision qualifiers. In VisPy, the medium precision level is currently used. For the future we do plan to let users specify the precision level they need. ES 2.0 (and WebGL) do not support 3D textures, which are used by some techniques, like volume rendering. This can be worked around by using 2D textures and implementing 3D sampling manually in the shaders. See Chapter 17. WebGL has a limit on the size of the index buffer, which can be problematic when visualizing large meshes. This can be worked around by using OES_element_ index_uint or multiple buffers. * For a more complete list see https://github.com/vispy/vispy/wiki/Tech.-EScompat 6.2 Overview of VisPy 91 •• •• •• WebGL does not allow sharing objects between contexts. This makes things simpler, but can be limiting for certain applications. Without a version pragma, the strictness of the GLSL compiler (e.g., allowing implicit type conversions) varies widely between various versions of desktop GL. In VisPy, the version pragma is automatically set to 120 for desktop GL to make the compiler behave similarly to ES 2.0. Further, ES 2.0 implementations are not required to support for-loops for which the number of iterations is not known at compile time. ES 2.0 has fewer data types for attributes and texture formats (e.g., no GL_RGB8). Apart from allowing users to create code that runs on desktop GL as WebGL, we also want to allow users to harness the full potential of desktop GL (e.g., 3D textures and geometry shaders). These desktop-only features would be available only in the Python backends (Qt, wx, etc.) and not in the WebGL backends 6.2.2 gloo: An Object-Oriented Interface to OpenGL The OpenGL API can be verbose. Basic operations like compiling a shader or creating a data buffer require several commands with many parameters. Nevertheless, the core concepts underlying these APIs are relatively simple and can be expressed in a more compact way. VisPy provides an object-oriented interface to OpenGL ES, named gloo, that is implemented in both Python and JavaScript. By focusing on the central concepts and objects in OpenGL—shaders, GLSL variables, data buffers—gloo offers a natural way of creating visualizations. For example, Listing 6.1 is a Python script displaying static random points as shown in Figure 6.2. Figure 6.3 shows the output when running the code in Listing 6.1. Listing 6.1 A simple visualization in VisPy. from vispy import app, gloo import numpy as np vertex = """ attribute vec2 position; void main() { gl_Position = vec4(position, 0.0, 1.0); gl_PointSize = 20.0; } """ fragment = """ void main() { vec2 t = 2.0 * gl_PointCoord.xy - 1.0; float a = 1.0 - pow(t.x, 2.0) - pow(t.y, 2.0); gl_FragColor = vec4(0.1, 0.3, 0.6, a); } """ canvas = app.Canvas() 92 6. Data Visualization with WebGL program = gloo.Program(vertex, fragment) data = 0.3 * np.random.randn(10000, 2) program['position'] = data.astype(np.float32) @canvas.connect def on_resize(event): width, height = event.size gloo.set_viewport(0, 0, width, height) @canvas.connect def on_draw(event): gloo.set_state(clear_color = (0, 0, 0, 1), blend = True, blend_func = ('src_alpha', 'one')) gloo.clear() program.draw('points') canvas.show() app.run() Desktop GL WebGL GLIR python GLIR gloo visuals scene plot JS Figure 6.2 Overview of the graphics abstraction layers provided by VisPy. Levels more to the right trade flexibility for a higher level API. A declarative language called GLIR forms a common layer on top of desktop OpenGL and WebGL. The gloo module provides an object-oriented interface to OpenGL ES. Visuals encapsulate reusable graphical objects implemented with gloo. These visuals can be organized within a scene graph. Finally, a plotting API can be used to create a scene graph and visuals for common use cases. A Program object is created with a vertex and fragment shader. VisPy automatically parses the GLSL code, which allows the user to easily assign data buffers to attributes and values to uniforms. The program can then be rendered with any primitive type among those provided by OpenGL (points, lines, and triangles). In Python, data buffers are commonly created and manipulated with the NumPy library [van der Walt 11]. This widely used library provides a typed ndarray object similar to JavaScript’s TypedArray [Khronos 13], except that the ndarray can be multidimensional. Shaders are extensively used in VisPy. Instead of implementing rendering routines in Python or C, we leverage GLSL as much as possible. Also, this paradigm makes the conversion to WebGL easier 6.2 Overview of VisPy Figure 6.3 Output when running the code in Listing 6.1. 93 as we have less code to translate. This is because GLSL is virtually the same language between desktop OpenGL and OpenGL ES 2.0/WebGL. As we will see later in this chapter, gloo plays a central role in our Python-to-WebGL translation processes. 6.2.3 Visuals The gloo layer forms the foundation of VisPy’s higher level graphics layers. Reusable visualization primitives are encapsulated in Visuals. Each visual represents a graphical object (line, image, mesh, and so on). It encapsulates several gloo objects and provides a Pythonic interface to control their appearance. Whereas the abstraction level of gloo corresponds to the OpenGL architecture, the abstraction level of a visual corresponds to a graphical object appearing in the scene. Therefore, users can create and manipulate visuals intuitively without any knowledge of OpenGL. Examples of visuals include common geometric shapes in 2D and 3D, high-quality antialiased polylines (implemented with a GLSL port of the antigrain geometry library [Rougier 13]), 3D meshes, images, point sprites, and others. Although VisPy provides a relatively rich set of visuals, users can also create their own visuals if necessary. Visuals can share and reuse snippets of GLSL code thanks to an internal shader composition system that organizes shaders into modular, independent components. Reusable GLSL functions can be written with placeholder variables identified by a dollar ($) prefix. These variables can be replaced at runtime by gloo variables (attributes, uniforms, constants), using a simple Python API. VisPy takes care of generating the final GLSL code and resolving potential name conflicts by renaming some variables if necessary. This allows the user to modify the appearance of a visual by attaching custom GLSL coordinate transformations, color modifications, and more. For example, a pan/zoom transformation function is implemented with $translate and $scale template variables. These variables can then be bound to uniforms or attributes depending on the use cases. 6.2.4 The Scene Graph Visuals can be used directly, or they can be organized within a scene graph. In the latter case, the visuals are represented as nodes to form a hierarchical structure. Each node has a transformation that describes the relation between the parent and the current coordinate frame. These transformations have an implementation on both the CPU and the GPU. Further, by using the scene graph, one can make use of several built-in camera/interaction models. 6.2.5 The Plotting Interface The plotting interface is the highest level layer in VisPy. It provides ready-to-use visualization routines for common plots like scatter plots, polylines, images, and others. In essence, it provides an easy way to create visuals and set up a scene graph for common use cases. Tip: VisPy is a data visualization library built on top of OpenGL ES. Written in Python and JavaScript, it provides several abstraction layers. End users can choose between intuitive, but slightly limited, high-level interfaces and more flexible, but more complex, lowlevel interfaces. 94 6. Data Visualization with WebGL 6.3 GLIR: An Intermediate Representation for OpenGL In order to bring VisPy visualizations written in Python to the browser, we need to translate Python code defining a visualization to JavaScript code. The challenging aspect is that we require this translation to be automatic, or at least to require as little human supervision as possible. Our end users are scientists who expect to write Python code that would work indifferently in the desktop or in the browser; generally, they would be unwilling to write JavaScript code for common use cases. Also, avoiding code duplication makes code easier to maintain. Two issues arise when it comes to translating a visualization from Python to JavaScript. First, we have seen that visualizations may be specified in one of several abstraction levels. We need to choose a specific abstraction level in order to find a common representation that can be understood by both languages. Second, such a representation would only define a static visualization. Yet, in VisPy we are focusing on interactive visualizations. Conceptually, interactivity means updating the OpenGL objects in the scene in response to dynamic events such as user actions (mouse, keyboard, touch) or timers. These interactivity functions are typically written in a ­general-purpose programming language. In VisPy, the user can implement these functions in Python using a set of modular components provided by the library. For example, the scene graph defines cameras that can be moved and rotated. Finding a way to translate these custom routines from Python to JavaScript is one of the main challenges of the process. In this section, we focus on the issue of the representation level for translating visualizations from Python to JavaScript. In the next sections, we detail how interactivity can be translated either automatically or semiautomatically. 6.3.1 Finding an Adequate Representation Level There is a trade-off between low-level and high-level representations when looking for an adequate translation level. Low-level representations are more expressive and require less code duplication between Python and JavaScript. However, dynamically extracting a lowlevel representation from a visualization defined in a high-level interface is challenging. High-level representations are easier to translate, but require much more code duplication between Python and JavaScript. Updating and maintaining both implementations at the same time would require a significant effort. We initially investigated the lowest possible representation level, namely the OpenGL API, but several issues arose. First, this interface is complex and verbose, and often leads to boilerplate code. These problems make the translated code less readable and harder to debug. Second, there are slight differences between the desktop OpenGL API and the WebGL API. Third, code written in this representation generally requires nonlinear control flows like loops and conditional branches (error handling, for example). Translating this code automatically from Python to JavaScript proved to be challenging. Therefore, we decided on a new representation that corresponds to our gloo interface. We named this representation GLIR, for GL intermediate representation. GLIR is a simple declarative programming language that describes the dynamic creation and modification of gloo objects. Our architecture features two components: the GLIR generator (frontend) and the GLIR interpreter (backend). The frontend generates GLIR commands reactively in response to user or timer events. The backend interprets and executes these commands within an OpenGL context, which can be desktop OpenGL 6.3 GLIR: An Intermediate Representation for OpenGL 95 (a) Desktop backend (b) Online WebGL backend (c) Offline WebGL backend Py Py GLIR GLIR producer commands (vispy.gloo) Py GLIR parser Interaction OpenGL canvas (e.g. Qt) GLIR producer (vispy.gloo) GLIR JS commands GLIR parser Interaction vispy.gloo Interaction JS HTML/JS + GLIR GLIR parser Figure 6.4 Diagram illustrating the different modes of operation of the WebGL backend: (a) with the desktop backend, the GLIR producer and interpreter operate in the same process on top of a GUI backend (Qt, for example); (b) with the online WebGL backend, the GLIR producer operates in a Python process and streams GLIR commands to the WebGL interpreter; (c) with the offline WebGL backend, a stand-alone web document containing all GLIR ­commands and interaction logic is generated semiautomatically. or WebGL (Figure 6.4). Overall, the scene is initialized and dynamically updated in the OpenGL context while GLIR commands are streamed to the backend. An important feature of GLIR is that the communication is one-way, such that the frontend will never have to wait for the backend, allowing for higher performance. Errors that occur on the backend should be reported to the frontend, but this can occur asynchronously. Implementing this language required a minimal amount of code duplication between Python and JavaScript. Both Python and JavaScript need to parse and execute those commands. Since the GLIR command definitions are simple, we consider this representation relatively stable, and few modifications of GLIR and its implementations are expected in the future. Furthermore, our higher level interfaces are written on top of gloo, and thus are completely decoupled from the GLIR language and implementations. This means that all higher level layers of VisPy transparently work in both desktop and WebGL. 6.3.2 Example of a GLIR Representation The Python script using gloo in Listing 6.1 would produce the GLIR representation in Listing 6.2. Listing 6.2 GLIR representation of the scene described in Listing 6.1. CREATE 1 Program SHADERS 1 "attribute vec2 position;\n[...]" "void main() {\n[...]\n}" CREATE 2 VertexBuffer SIZE 2 80000 DATA 2 0 (10000, 1) ATTRIBUTE 1 position vec2 (2, 8, 0) FUNC glViewport 0 0 800 600 FUNC glClearColor 0.0 0.0 0.0 1.0 FUNC glBlendFuncSeparate src_alpha one src_alpha one 96 6. Data Visualization with WebGL FUNC glEnable blend FUNC glClear 17664 DRAW 1 points (0, 10000) This representation has been automatically generated by VisPy. Before we get to the implementation details, we will explain the main principles of the GLIR representation. The full specification can be found in the VisPy wiki.* Every line represents a command in the form of a tuple. The first element is the command to execute; the remaining elements are arguments. When both the frontend and backend are in the same (Python) process, the commands remain Python tuples. For the WebGL backend, the commands are serialized in JSON. There are currently 14 different commands. Every command comes with one or several parameters. The data types for the arguments can be strings, integers, tuples of strings or integers, or array buffers. Some commands refer to the creation or modification of gloo objects. Each gloo object is represented by an identifier, unique within the current OpenGL context. It is up to the frontend—not the backend—to generate unique identifiers for the gloo objects. The CREATE command allows us to create a gloo object. There are currently six types of gloo objects: Program, VertexBuffer, IndexBuffer, Texture2D, RenderBuffer, and FrameBuffer. The first line of Listing 6.2 creates a Program object. In the second line, a vertex shader and a fragment shader are assigned to that program. Then, a vertex buffer is created, its size (SIZE command) is specified, and the data are uploaded into this vertex buffer (DATA command). The data buffer may be represented in several ways. The exact representation depends on the specific frontend and backend. For example, when translating a Python visualization to JavaScript, we can use a base64 encoding of this buffer. This JavaScript string is then converted into an ArrayBuffer object and passed directly to the bufferData() WebGL command. A more efficient method consists of using the binary WebSocket protocol to send the buffer from Python to the browser. With the ATTRIBUTE command, a program attribute is bound to the VertexBuffer objects that were just created (object #2), and the stride (8) and offset (0) are specified. Finally, a few OpenGL functions are called and we draw the program with the gl.POINTS primitive type. Additional GLIR commands include UNIFORM (to set a program’s uniform value), WRAPPING and INTERPOLATION (for textures), and ATTACH and FRAMEBUFFER (for framebuffers). Tip: GLIR is a simple declarative programming language used to create and manipulate gloo objects dynamically. A frontend module is responsible for generating a stream of GLIR commands at initialization time, and subsequently as a function of dynamic user events (mouse movements, keystrokes, timers, and so on). A backend module receives, interprets, and executes these commands within an OpenGL or WebGL context. This twotier architecture lets us decouple a visualization specification in Python from the rendering in the browser with WebGL. * https://github.com/vispy/vispy/wiki/Spec.-GLIR 6.3 GLIR: An Intermediate Representation for OpenGL 97 6.4 Online Renderer GLIR is the foundation of our Python-to-JavaScript architecture. By writing all higher level functionality on top of gloo in Python, we ensure that all VisPy visualizations can be translated to WebGL. In practice, there are several ways to bring a VisPy visualization to the browser. They differ according to whether a live Python server is available or not. In the online renderer, the browser and a live Python server are connected in a closed loop. The browser captures user events and streams them to the Python server. In return, the server generates the rendering commands and sends them to the browser. The Python server and the browser may or may not run within the same computer. In the offline renderer, the frontend module generates a stand-alone HTML/JavaScript document containing the entire interactive visualization. There is no need for an external Python server. From the user’s perspective, there is a trade-off between the online and offline renderer. The online renderer works in virtually all cases, and it allows one to keep some logic in Python, at the expense of a small communication overhead between the Python server and the browser. This overhead is due to the round-trip exchanges of messages over the network (user events and GLIR commands). The offline renderer allows one to create a fully standalone web application that does not depend on a Python server, but this process is more complex and more limited, and it may require the user to write custom JavaScript code. 6.4.1 The IPython Notebook In order to implement an online renderer, we need the following components: •• •• •• •• A Python process with NumPy and VisPy installed A Python server A WebGL-compatible web browser A communication channel and a protocol between Python and the browser Although directly implementing all of these components would be possible, we chose to rely on an existing architecture that is widely used in scientific computing: the IPython notebook [Perez 07; Shen 14]. This tool provides a web interface to a Python server. Users can type code in this interface and get the results interactively (read-eval-print loop, or REPL). Text, images, plots, and graphical widgets can also be created. Dynamic interaction between the notebook client in the browser and the underlying Python server is implemented by IPython. The architecture is based on the Tornado Python server, the ZeroMQ communication library, and the WebSocket protocol (Figure 6.5). We decided to make use of the IPython-provided functionality to create widgets in the notebook, since this architecture implements all components we need. The custom widget that we created contains a WebGL canvas to which we can render. 6.4.2 Distributed Event Loop With a normal desktop OpenGL backend, GLIR commands are generated and interpreted by the same Python process. With the browser backend, the GLIR commands are generated by the frontend and proxied to the browser backend in real time via the IPython communication channels. 98 6. Data Visualization with WebGL Figure 6.5 Example of VisPy in the IPython notebook. As in all real-time rendering applications, there is an event loop that processes events and generates drawing calls. By design, this event loop is distributed between Python and the browser. In the browser, we use window.requestAnimationFrame() for the WebGL event loop. In addition, two JavaScript queues are implemented. The event queue contains the pending user events that are yet to be sent to the Python server, while the GLIR queue contains pending GLIR commands that are yet to be executed by the WebGL engine (a GLIR interpreter written in JavaScript). Events are produced in JavaScript using standard event callback functions in JavaScript and jQuery. At every WebGL frame, the following operations occur: 1. A JSON event message is produced with the pending user events from the event queue. Consecutive messages of the same type (for example, mouse_move events) are merged into single events for performance reasons. 2. This message is sent to Python via the IPython-provided communication channels. 3. The pending GLIR commands are executed. At the Python side, the JSON event messages are received and the events are injected into VisPy’s event system, causing the user event callback functions to be called. These functions may be implemented directly on top of gloo, or they may involve the machinery implemented in VisPy’s higher level layers (the cameras and the scene graph, for example). 6.4 Online Renderer 99 All GLIR commands that are generated by the Python process are automatically queued. Every time that a draw event is issued by VisPy, a JSON message with the pending GLIR commands is sent to the browser. There, these events populate the JavaScript GLIR queue; these commands will be executed during the next requestAnimationFrame() iteration. Time-based animations are implemented at the Python side using Tornado. A timer is responsible for generating draw events on a regular interval, which will cause programs to be drawn, GLIR commands to be sent, and so on. 6.4.3 Server-Side Offscreen Rendering The process described before implements the online WebGL renderer in the IPython notebook. In addition, we implemented an offscreen renderer for low-end clients that don’t support WebGL. Instead of being proxied to the browser, the GLIR commands are interpreted by the Python server. What is sent to the browser is the locally rendered rasterized image instead of the GLIR commands. This process bears some resemblance to the VNC protocol. This technique might also be used when data sets are too large to be transferred over the network or even to fit in the client’s RAM. However, a custom level of detail techniques might be more effective in these cases. Tip: With the online renderer, the browser sends user events to the Python server in real time via JSON messages. VisPy processes the events and generates a stream of GLIR commands via the gloo interface. These commands are sent back to the browser (again via JSON messages), and they are processed by the WebGL GLIR backend in the requestAnimationFrame() function. This architecture is implemented within the IPython notebook, an interactive web interface to Python that is adapted to scientific computing. Overall, this WebGL backend allows users to visualize large data sets smoothly and interactively, directly in their browser. 6.5 Offline Renderer The online rendering methods described before work for all possible VisPy visualizations. However, they require a live Python server (to run VisPy). When IPython notebooks are viewed statically—for example, when they are exported to static, stand-alone HTML/ JavaScript web documents—the visualizations do not work. It would be interesting for users to keep their interactive visualizations intact when they share notebooks with colleagues, or when they export notebooks to interactive web reports. One approach is to run an IPython kernel in the cloud (for example, using Docker*). However, this raises questions about who will be running the server and who is paying for it. Therefore, we are currently working on an alternative approach that doesn’t require a live Python server at all. Although more limited in scope than the online renderer, since all logic must be implemented on the client side, this approach has the benefit of being easier to deploy. In this section, we describe work in progress in this direction. * https://www.docker.com 100 6. Data Visualization with WebGL 6.5.1 Overview Since there are JavaScript implementations of both gloo and the GLIR backend, a visualization implemented directly on top of gloo can be manually reimplemented in JavaScript. Because this is a tedious process, we propose an assistant that can help users to convert some of their Python code to JavaScript. Exporting the visualization occurs in two steps. First, the list of all created gloo objects is obtained dynamically by instantiating the scene and capturing all generated GLIR commands. Second, the event callback functions (which implement interactivity) are translated either manually or automatically. Automatic conversion cannot be done dynamically because the arguments of these functions are variable (mouse position, keystrokes, and so on). Instead, these functions are translated statically. We describe possible approaches here. 6.5.2 Python-to-JavaScript Translator A first approach would be to use a static Python-to-JavaScript translator such as the opensource Pythonium library.* The translator parses the Python code with the native ast Python library, and generates JavaScript code on-the-fly while visiting the AST (abstract syntax tree). This approach bears some resemblance to Emscripten (Chapter 5), which compiles C/C++ code to a subset of JavaScript through the LLVM compiler architecture. As far as Python is concerned, this approach is followed by several projects, including PyPy.js,† Pyston,‡ and Numba.§ We have not explored these options at this time since we are currently interested in a small subset of Python that may not require the heavy machinery involved in these projects. In our case, the only bits of code we would need to convert are the user callback functions (on_mouse_move(), on_key_press(), etc.). Therefore, the translator is not intended to support the entire Python syntax or to cover all use cases. Rather, it would aim at simplifying the end user’s task of converting a VisPy visualization to a stand-alone HTML/JavaScript document. The generated code could then be improved manually. The features of the Python language we would need for interactivity functions include standard Python statements, conditional branches, loops, and common mathematical and array operations. Scalar operations are useful when uniform values are modified, but array operations are sometimes required with complex interaction patterns that involve modifying vertex buffers and textures. Whereas Python’s math module and JavaScript’s Math module provide similar functionality for scalar operations, there is currently no NumPy support in JavaScript. Therefore, we would need to implement a light version of NumPy in JavaScript as detailed later. Alternative approaches could be based on other Python-to-JavaScript projects such as Brython¶ or Skulpt.** * https://github.com/rcarmo/pythonium † https://github.com/rfk/pypyjs ‡ https://github.com/dropbox/pyston § http://numba.pydata.org/ ¶ http://www.brython.info ** https://github.com/skulpt/skulpt 6.5 Offline Renderer 101 6.5.3 JavaScript Implementation of a NumPy-like Library VisPy is extensively based on NumPy. The ndarray structure can be easily manipulated in Python. First, vectorized mathematical operations can be done concisely and efficiently. For example, the sum of two vectors A and B is just A+B. Second, arrays can be uploaded to OpenGL buffers and textures efficiently without making any unnecessary copy on the GPU. JavaScript provides a few efficient structures for WebGL, namely ArrayBuffers and TypedArrays. Whereas these objects can be easily passed to WebGL functions, no mathematical operations are provided. For this reason, we would need to write a basic JavaScript port of NumPy. Beyond vectorized mathematical operations (+, *, -, /, and so on) and functions (exp, cos, and so on), this port would also provide array creation and manipulation routines that are commonly used in NumPy. With this basic toolbox, a wide variety of interactivity routines written on top of gloo can be readily converted to JavaScript with little effort from the user. 6.5.4 Beyond gloo Exporting all gloo objects by capturing the GLIR commands works indifferently whether the function is written directly on top of gloo or on top of VisPy’s higher level interfaces. This is because using VisPy’s higher layers eventually results in GLIR commands. However, the translation process for the interactivity routines is static and does not involve any code execution in Python. For this reason, it only works when the functions are implemented directly on top of gloo. When these functions use VisPy’s higher level interfaces like the scene graph, an alternative approach needs to be used. One possibility would be to reimplement common cameras in JavaScript. Then, the interactivity functions could be made automatically translatable again, because they would make use of the same cameras in Python and JavaScript. Tip: Whereas a live Python server is required with the online renderer, the offline renderer offers facilities to translate a Python visualization to an interactive, stand-alone HTML/JavaScript document that executes entirely in the client’s browser. A static scene can be easily exported from Python to JavaScript by capturing all generated GLIR commands at initialization. These commands can then be executed by the JavaScript GLIR interpreter. For interactivity, a static NumPy-aware Python-to-JavaScript translator helps the user convert event callbacks implemented on top of gloo. This is made possible by the fact that these functions often involve basic Python constructs with mathematical operations on scalars and arrays. 6.6 Performance Considerations We have evaluated the performance cost incurred by the architecture on a simple visualization example.* This example displays N normally distributed random points as white point sprites with point size 1.0.† * The code is available on https://github.com/vispy/webgl-insights/tree/master/perf † All benchmarks have been performed on a local machine: a GIGABYTE laptop with an Intel(R) Core(TM) i7-4700HQ CPU @ 2.40GHz, 16GB RAM, an Intel GPU with Intel(R) Haswell Mobile Mesa 10.5.0-devel OpenGL 3.0 driver, GLFW backend, VisPy development version 0.4.dev (95a87f6 commit), and Ubuntu 14.04.1 LTS. 102 6. Data Visualization with WebGL First, we have measured the average number of frames per second (FPS) by repeatedly updating the same scene over a 10-second period. An implementation using OpenGL API calls directly (wrapping OpenGL through Python’s ctypes module) resulted in a performance of 42.53 FPS ± 2.7% FPS with N = 10,000,000 points. The same implementation using gloo and GLIR resulted in a performance of 41.98 FPS ± 3.1% with the same number of points. The performance overhead is within the error range; it corresponds to the generation of GLIR commands in Python and the execution of these commands by the Python GLIR interpreter. There is no serialization/deserialization involved in this process. Second, we have evaluated the performance of the WebGL backend.* We ran the previous gloo example with N = 1,000,000 points (using N = 10,000,000 points resulted in a crash of the browser). At every trial, a draw event is generated by JavaScript. This results in this event being sent to the Python server, the script’s on_draw() function to be called in the Python server, and the GLIR commands to be generated, serialized, and sent back to the browser. Then, these GLIR commands are processed by the WebGL GLIR interpreter within the next requestAnimationFrame() event. Using a single local machine, this entire process took 22 ms in average over all trials, showing that the lag overhead incurred by our architecture represents no more than one or two frames, assuming a 60 target FPS in the browser. Although we have not made benchmarks on more complex examples, we can expect this overhead to be relatively independent on the scene complexity in many situations. Specifically, visualizations that only result in uniform updates in response to events are not expected to result in a significant lag. This is because the communication overhead essentially depends on two factors: the volume of data to transfer between Python and the browser, and the network lag. In many scientific visualizations, the data can be transferred only once to the browser— at initialization time. Then, interaction occurs through uniform updates—for example, updating u_translate and u_scale uniform variables for pan/zoom. In this case, there is virtually no data transfer between Python and the browser during interactivity since the size of the messages is negligible. The situation is different when significant amounts of data need to be regularly sent from Python to the browser—for example, if the data are so large that a LOD technique has to be used. In this case, there can be a significant lag when the data are transferred. The frequency of these operations can be reduced, for example, no more than one data update per second so that the visualization feels responsive most of the time. The lag could be further reduced by using a binary WebSocket protocol instead of base64 (de)serialization for transferring array buffers, which is possible in IPython ≥ 3.0. Finally, we have only conducted benchmarks on a local machine. When different machines are used for the Python server and the WebGL client, the network lag may harm the perceived performance. In this case, it is conceivable to implement some of the interactivity in the client rather than in Python. For example, pan and zoom may be directly implemented in JavaScript so that network communication is bypassed. This could be done using some of the techniques developed for the offline WebGL backend (Section 6.5). * Using Chrome 39.0.2171.95 (64-bit) and IPython 3.0.dev (5bcd54d commit). 6.6 Performance Considerations 103 6.7 Conclusion VisPy brings the power of graphics cards to scientists who want to visualize large volumes of data efficiently. VisPy lowers the barrier of entry to OpenGL by providing several visualization-oriented abstraction layers. Although VisPy is mainly written in Python, one of the major open source data analysis frameworks, it can also work in web browsers via the WebGL specification. The web platform is indeed highly appealing by its ease of deployment and the multiplatform capabilities. Therefore, we have implemented tools to port VisPy visualizations from Python to the browser. This work could be extended in several ways. First, it should be possible to write a hybrid renderer where light interactivity routines are implemented in the client, while more complex routines are implemented in the browser—for example, routines that involve accessing a large, remotely located data set. This would be useful when a significant network lag hinders perceived performance—for example, while panning and zooming in a plot. More generally, browser features like webcam support could be transparently and efficiently handled by the online renderer, without requiring round-trip transfers of large streams of data between the client and the server. It would also be interesting to combine the generated WebGL code with existing WebGL or JavaScript libraries like three.js or dat.GUI (graphical controls in JavaScript). This would let users benefit from both Python’s analytics tools and JavaScript/WebGL libraries for creating interactive visualizations in the browser. Also, a major direction of research is to support VisPy’s higher level interfaces in the WebGL offline renderer. The scene layer provides several high-level interactivity routines and cameras that could be reimplemented in JavaScript. The user could therefore write custom and complex interactive visualizations using exclusively the scene layer, and generate WebGL versions of these visualizations fully automatically. The advantage of the scene layer over the lower level gloo interface is that the former does not require any knowledge of OpenGL. This is a critical point, since VisPy aims at targeting scientific end users who have complex visualization needs but no OpenGL skills. Acknowledgments We thank the rest of the VisPy development team (Luke Campagnola, Eric Larson, and Nicolas Rougier) and all VisPy contributors for their work on the project. Bibliography [Kelly 13] Ryan Kelly. “pypy-js-first-steps.” https://www.rfk.id.au/blog/entry/pypy-js-firststeps, 2013. [Khronos 13] “Typed Array Specification,” work in progress. https://www.khronos.org/ registry/typedarray/specs/latest/, 2013. [Oliphant 07] Travis E. Oliphant. “Python for Scientific Computing.” Computing in Science & Engineering, 9:10–20 doi:10.1109/MCSE.2007.58, 2007. 104 6. Data Visualization with WebGL [Perez 07] Fernando Pérez and Brian E. Granger. “IPython: A System for Interactive Scientific Computing.” Computing in Science and Engineering 9 (3): 21–29 doi:10.1109/ MCSE.2007.53. URL: http://ipython.org, 2007. [Rossant 13] C. Rossant and K. D. Harris.“Hardware-Accelerated Interactive Data Visualization for Neuroscience in Python.” Frontiers in Neuroinformatics 7 (36) doi:10.3389/fninf.2013.00036, 2013. [Rougier 13] Nicolas P. Rougier. “Shader-Based Antialiased Dashed Stroked Polylines.” Journal of Computer Graphics Techniques 2.2, 2013. [Shen 14] Helen Shen. “Interactive Notebooks: Sharing the Code.” Nature 515:7525 doi:10.1038/515151a, 2014. [van der Walt 11] Stéfan van der Walt, S. Chris Colbert, and Gaël Varoquaux. “The NumPy Array: A Structure for Efficient Numerical Computation.” Computing in Science & Engineering 13:22–30 (2011), DOI:10.1109/MCSE.2011.37, 2011. [WebGL] https://www.khronos.org/webgl/ Bibliography 105 7 Teaching an Introductory Computer Graphics Course with WebGL Edward Angel and Dave Shreiner 7.1 7.2 7.3 7.4 7.5 7.6 Introduction The Standard Course WebGL and Desktop OpenGL Application Organization and Ground Rules Modeling a Cube JavaScript Arrays 7.7 7.8 7.9 7.10 7.11 The HTML File The JS File MV.js Input and Interaction Textures and Render-to-Texture 7.12 Discussion Bibliography 7.1 Introduction For almost 20 years, OpenGL has been the standard API used in teaching computer graphics. Regardless of the approach used in a course, OpenGL has allowed instructors to support their courses with an API that is simple and close to the hardware. The deprecation of the fixed-function pipeline in OpenGL 3.1 forced instructors to make some difficult decisions as to which version to use. In OpenGL Insights [Angel 13], we argued for using a fully shader-based approach in a first course. Now, with the popularity of WebGL, instructors must decide whether or not to switch to yet another version. In this chapter, we argue that WebGL has major advantages and few disadvantages for teaching an introductory course in computer graphics for students in computer science and engineering. These conclusions are based on one of the authors’ experiences in teaching the course for over 20 years, both authors’ experience teaching SIGGRAPH 107 courses, and with the latest edition of our textbook [Angel 15]. In the following sections, we start with a comparison of a basic 3D application in WebGL with its shader-based desktop OpenGL counterpart. We then make this application interactive and add some other features, in particular texture mapping, which illustrate the advantages of WebGL over desktop OpenGL for teaching the first course. 7.2 The Standard Course Computer graphics has been a standard course in almost all Departments of Computer Science and Engineering since the 1970s. Even though there have been enormous advances in both hardware and software, the core topics have changed little over 40 years. They include •• •• •• •• •• •• Geometry and modeling Transformations Viewing Lighting and shading Texture mapping and pixel processing Rasterization Even as most courses have moved to a fully shader-based version of OpenGL, these t­opics remain the core of the standard course. We take that approach here and focus on the adjustments that have to be made in using WebGL. At the end, we discuss some ­a lternatives that become attractive with the WebGL API. 7.3 WebGL and Desktop OpenGL As readers of this work know, WebGL 1.0 is a JavaScript (JS) implementation of OpenGL ES 2.0. It is fully shader-based, lacks a fixed-function pipeline, and does not contain any of the functions deprecated with OpenGL 3.1. Although it may appear to be a simple conversion to take an application from desktop OpenGL to OpenGL ES and then to WebGL, there are a number of factors that make the task nontrivial but interesting. First, because OpenGL is concerned primarily with rendering, with desktop OpenGL we have to provide either platform-dependent functions for input and connection to the window system or to use libraries such as GLUT and GLEW. With WebGL we have to interact with the browser and the web. For teaching computer graphics, this change is a good thing. As OpenGL has developed, it has become more and more difficult to support students using a variety of platforms and hardware. Although an OpenGL application is platform independent in that it can be recompiled on almost any platform, the surrounding libraries have become more of a problem. Most instructors have used GLUT or freeglut and, depending on the platform, GLEW. Finding a set of libraries that work with Windows, OS X, and various versions of Linux has become increasingly difficult. Some instructors find that they can get a set that works with 32-bit architectures but not on 64-bit architectures. Much more problematic is that applications that worked with previous versions of OpenGL cannot work with a core profile in recent versions because the raster and bitblt 108 7. Teaching an Introductory Computer Graphics Course with WebGL functions have been deprecated. For example, menus were easy to add in older versions of OpenGL using GLUT, but GLUT menus need the raster and bitblt functions. One of the major advantages of going to WebGL is the ease with which we can add interactivity to applications. We will return to this issue in Section 7.10. Going from C or C++ to JavaScript and HTML requires some work. Although we can make our JavaScript code appear very similar to C/C++ code, there are some “gotchas” where there are significant differences that may not be obvious. Books such as The Good Parts [Crocket 08] and JavaScript: The Definitive Guide [Flanagan 11] are excellent references for ­students who are already adept in high-level languages. There are also many good tutorials on the web, for example, that by McGuire [McGuire 14]. A more serious issue at many schools can be the antipathy toward JS in computer science and engineering departments. Nevertheless, as we show, the benefits are significant and students have few problem learning and using JS. Having looked at some of the potential problem areas, before looking at the details of a simple example, let’s consider briefly the benefits of going to WebGL for teaching. These benefits include •• •• •• •• •• •• •• Cross-platform broad deployment—web and mobile Low barrier to entry Fast iteration Variety of available tools Performance A more modern API Integration with other web APIs WebGL is supported on all recent browsers, including those in most new smart phones. Because all these browsers can interpret JavaScript code, there is no explicit compilation stage; no need for the libraries such as GLUT and GLEW that had to be recompiled for each architecture; no system-specific interfaces such as xgl, wgl, and agl; and no dealing with changing versions of the operating system or any of the libraries. From an instructor’s perspective, getting the course started is a breeze. From the student’s perspective, she can work on the same project on multiple devices with no changes to code. Because the code is interpreted, students can work in an environment as simple as a standard text editor, thus allowing them to modify and rerun code almost instantaneously. In addition, developer tools within the browser such as profilers and debuggers make working with WebGL code much easier than working with desktop OpenGL. Although WebGL code is interpreted rather than compiled, the differences in performance between WebGL and desktop OpenGL are less than one might expect. JavaScript engines in the browsers have improved markedly. More important to most classes is that not only do we not tend to assign problems where performance is an issue, but also, once the data are put on the GPU, the performance is not very dependent on how the data got there. Although, in many ways, WebGL inherits a somewhat dated programming model from the original version of OpenGL, there are a few added features that make the API more modern. For example, as we shall show later, we can use images in the standard web image formats for textures. Also, JS often allows us to write clearer, more concise application code. 7.3 WebGL and Desktop OpenGL 109 But perhaps the greatest advantage of using WebGL for teaching is that students can integrate their graphics with any other code that is compatible with HTML5. For example, they can use standards such as CSS for page design and jQuery for interaction. We have found that even though our students come to the class with almost no knowledge of computer graphics, they arrive with experience in these and other web packages. Consequently, the interactive side of computer graphics, which was dropping out of the course for reasons including the recent changes in desktop OpenGL, can be returned to the course with ease. 7.4 Application Organization and Ground Rules A typical WebGL application is a combination of JS and HTML code. In addition, most real-world applications use CSS to design the pages and a package such as jQuery for interaction. An instructor must make some key decisions as to which packages to use and how to organize applications. Although many students come to the graphics class with web experience and are familiar with CSS and jQuery, many are not. Consequently, we decided to use only JS and HTML, although students were welcome to use CSS and jQuery if they preferred. Note that HTML and JS are sufficient to develop applications with mouse input, buttons, sliders, and menus. The next issue is how to organize the code. Minimally, every application must have a description of the web page that will display the graphics and any interactive tools, the JS code for the graphics, and two shaders. Although, all four components can be put into a single HTML file, putting everything into a single file masks the different jobs of the constituent parts and hinders development of an application. We decided that each application should have an HTML file and a separate JS file. The HTML file contains the page description (including the canvas we will draw on), the shaders,* and the location of the other files needed. The JS file contains the geometry and rendering code (i.e., the graphics part of the application). Finally we had to decide what, if any, helper code to give students. We decided, as we did in the examples in OpenGL Insights, to give students a function initShaders() that takes the identifiers of two shaders and produces a program object as var program = initShaders(vertexShaderId, fragmentShaderId); Our reason for doing so is that the various functions to read, compile, and link the shaders contribute little to understanding computer graphics and can be discussed later in the course.† We later provide a package with higher level functions, which is not needed initially and will be discussed later. 7.5 Modeling a Cube Let’s consider the example of rendering a colored cube. The example is simple but incorporates most of the elements needed in any WebGL application. We need both a vertex * Shaders can also be in separate files. However, if we do so, some browsers will complain about cross origin requests if the application is run locally without a web server and then fails to load the shaders. † A second version of initShaders that reads the shaders from files is also made available. 110 7. Teaching an Introductory Computer Graphics Course with WebGL shader and a fragment shader. Although the output in Figure 7.1 shows both rotation and interaction, we can avoid introducing transformations at this point by specifying vertices in clip coordinates and aligning the faces of the cube with the coordinates axes. Thus, the vertex shader can simply pass through the positions and the fragment shader needs only to set the color. We will add interaction and rotation later. We will build a cube model in the standard way through a vertex list. Consider the cube in Figure 7.2 with its vertices numbered as shown. The basic code for forming the cube data looks something like the following: function colorCube() { quad(1, 0, 3, 2); quad(2, 3, 7, 6); quad(3, 0, 4, 7); quad(6, 5, 1, 2); quad(4, 5, 6, 7); quad(5, 4, 0, 1); } Figure 7.1 Rotating cube with button control. The quad() function puts the v­ertex locations into an array that is sent to the GPU. The colorCube() function is the same whether we use WebGL or desktop OpenGL, JavaScript, or C/C++. However, when we examine the WebGL version of quad(), we see the differences from OpenGL and some of the key decisions we made in redesigning our course. The first major issue we must confront is how to handle arrays. 7.6 JavaScript Arrays JavaScript has only three atomic types: a single numeric type (a 64-bit floatingpoint number), strings, and Booleans. Everything else is an object. Objects inherit from a prototype and have methods and attributes. Consider what happens if we use a JS array for our data as in the code 5 6 1 2 7 4 0 3 Figure 7.2 Representing a cube. var vertices = [ -0.5, -0.5, -0.5, 0.5, 0.5, 0.5, 0.5, -0.5 ]; If we attempt to send these data to the GPU by gl.bufferData(gl.ARRAY_BUFFER, vertices, gl.STATIC_DRAW); 7.6 JavaScript Arrays 111 where gl is our WebGL context, we will get an error message because WebGL expects a simple C-like array comprising floating-point numbers. There are two main ways to get around this problem. One is to use JS typed arrays, which are equivalent to C-type arrays, as in the code: var vertices = new Float32Array([ -0.5, -0.5, -0.5, 0.5, 0.5, 0.5, 0.5, -0.5 ]); This method works with gl.bufferData(). Typed arrays have become standard for doing numerical applications with JavaScript as they are more efficient. There are packages such as glMatrix.js* for doing basic linear algebra with typed arrays. However, we believe typed arrays are not what we want to use for teaching computer graphics. Code developed with typed arrays often looks like C code with many loops, which tend to obscure the underlying geometric operations we try to stress in the course. Using JS arrays allows us to employ array methods to produce clean, concise code. We get around this problem of sending data to the GPU by using a function flatten() that takes as input a JS array and produces a typed array of floats as in the following: gl.bufferData(gl.ARRAY_BUFFER, flatten(vertices), gl.STATIC_DRAW); Consider now the implementation of the quad() function in the JS file. The vertices and vertex colors can be specified by JS arrays as† var vertices = [ [-0.5, -0.5, 0.5, 1.0], [-0.5, 0.5, 0.5, 1.0], [0.5, 0.5, 0.5, 1.0], [0.5, -0.5, 0.5, 1.0], [-0.5, -0.5, -0.5, 1.0], [-0.5, 0.5, -0.5, 1.0], [0.5, 0.5, -0.5, 1.0], [0.5, -0.5, -0.5, 1.0] ]; var vertexColors = [ [0.0, 0.0, 0.0, 1.0], //black [1.0, 0.0, 0.0, 1.0], //red [1.0, 1.0, 0.0, 1.0], //yellow [0.0, 1.0, 0.0, 1.0], //green [0.0, 0.0, 1.0, 1.0], //blue [1.0, 0.0, 1.0, 1.0], //magenta [0.0, 1.0, 1.0, 1.0], //cyan [1.0, 1.0, 1.0, 1.0] //white ]; Now if the point and color arrays are initialized as empty JS arrays * https://github.com/toji/gl-matrix † Note that using our flatten() function allows us to nest the data in the two following examples—­ something that is possible with JS arrays but is not possible with typed arrays. 112 7. Teaching an Introductory Computer Graphics Course with WebGL var points = []; var colors = []; and quad() uses two triangles for each quadrilateral, then quad() becomes function quad(a, b, c, d) { var indices = [a, b, c, a, c, d]; for (var i = 0; i < indices.length; ++i) { points.push(vertices[indices[i]]); colors.push(vertexColors[a]); } } If we use a triangle strip, we can use the more efficient var indices = [b, c, a, d]; Although typed arrays are generally more efficient, we prefer to teach with JS arrays. In an intro course, we generally use small examples and don’t change the data very often. Consequently, once the data get to the GPU, it does not matter much whether we used typed arrays or JS arrays to form the data. For teaching computer graphics, this code is much clearer than the equivalent in desktop OpenGL or with typed arrays in WebGL. 7.7 The HTML File The HTML file has three main parts. The first part contains the shaders. We can identify each type of script in the HTML file and we assign an identifier to each shader so that we can refer to them in the JS file. Although every implementation of OpenGL ES 2.0, and thus WebGL 1.0, must support medium precision in the fragment shader (see Chapter 8), we are still required to have a precision declaration in the fragment shader: The second part of the HTML reads in four text files. The first is a set of standard utilities available on the web* that allow us to set up the WebGL context. The second file contains a JS function that will read, compile, and link two shaders into a program object. The third file contains the flatten() function. The fourth file is our JS application file. "../Common/initShaders.js"> "../Common/flatten.js"> "cube.js"> Finally, we set up the HTML5 canvas and give it an identifier so that it can be referred to in the JS file: Oops... your browser doesn't support the HTML5 canvas element 7.8 The JS File We now return to the JS file, which is organized much like a desktop OpenGL application. Thus, there is a lot of initialization to set up buffers, form a program object, and get data to the GPU. Once all that is done, the rendering function is very simple. We focus on the key parts that differ from desktop OpenGL. The first of these is the setting of the WebGL context. The first step is to set up a WebGL context using the setupWebGL() function in the utility package. This function takes as input the canvas identifier we specified in the HTML file: var gl;//WebGL context window.onload = function init() { var canvas = document.getElementById("gl-canvas"); gl = WebGLUtils.setupWebGL(canvas); if (!gl) {alert("WebGL isn't available");} The initialization function is executed once all the files have been loaded through the onload event. The WebGL context is a JS object that includes all the WebGL functions and identifiers. Note that the window object is global. * https://code.google.com/p/webglsamples/source/browse/book/webgl-utils.js?r=41401f8a69b1f8d32c6863ac8 c1953c8e1e8eba0 114 7. Teaching an Introductory Computer Graphics Course with WebGL Next, we add the vertex position data as we did in the previous section. We specify a viewport and clear color using the standard OpenGL/WebGL functions. There is one important difference, however. The functions and parameters are members of the WebGL context whose name we chose (var gl) when we set up the context. gl.viewport(0, 0, canvas.width, canvas.height); gl.clearColor(0.0, 0.0, 0.0, 1.0); Using the initShaders() function, we create a program object: var program = initShaders(gl,"vertex-shader", "fragment-shader"); gl.useProgram(program); Loading the data onto the GPU and associating JS variables with shader variables is the same as with desktop OpenGL: var bufferId = gl.createBuffer(); gl.bindBuffer(gl.ARRAY_BUFFER, bufferId); gl.bufferData(gl.ARRAY_BUFFER, flatten(vertices), gl.STATIC_DRAW); var vPosition = gl.getAttribLocation(program, "vPosition"); gl.vertexAttribPointer(vPosition, 4, gl.FLOAT, false, 0, 0); gl.enableVertexAttribArray(vPosition); and we do the same for the vertex colors. Finally, we call our render function: render(); }; function render() { gl.clear(gl.COLOR_BUFFER_BIT); gl.drawArrays(gl.TRIANGLES, 0, numVertices); requestAnimationFrame(render); } Note that because this example is static, we don’t need the requestAnimationFrame(). We include it here as it will be needed in almost all other applications. 7.9 MV.js We provide students with a matrix-vector package, MV.js, which is included into the HTML5 file: MV.js defines the standard types contained in GLSL (vec2, vec3, vec4, mat2, mat3, mat4) and functions for creating and manipulating them. Thus, we can create the data for our cube program by var vertices vec4(-0.5, vec4(-0.5, vec4( 0.5, vec4( 0.5, 7.9 MV.js = [ -0.5, 0.5, 1.0), 0.5, 0.5, 1.0), 0.5, 0.5, 1.0), -0.5, 0.5, 1.0), 115 vec4(-0.5, vec4(-0.5, vec4( 0.5, vec4( 0.5, -0.5, -0.5, 1.0), 0.5, -0.5, 1.0), 0.5, -0.5, 1.0), -0.5, -0.5, 1.0) ]; without altering the quad() function and send the data to the GPU as before. gl.bufferData(gl.ARRAY_BUFFER, flatten(vertices), gl.STATIC_DRAW); Although this example is simple and may not illustrate the advantages of using MV.js, it lays the foundation for dealing with geometric types. Later, when we discuss algorithms for various graphics operations, we examine choices as to where to carry out various graphics operations. For example, we can compute per-vertex lighting in the application or in the vertex shader. We can also carry out per-fragment lighting in the fragment shader. In all three cases, by using MV.js, students can test the options with virtually identical code. Also included in MV.js are functions for producing the matrices that were part of the fixed-function pipeline, including transformation functions (rotate(), translate(), scale()) and viewing functions (ortho(), frustum(), perspective(), lookAt()). These functions form the matrices, which can then be sent to the shaders, for example: var modelViewMatrix = lookAt(eye, at, up); var projectionMatrix = ortho(left, right, bottom, ytop, near, far); gl.uniformMatrix4fv(modelViewMatrixLoc, false, flatten(modelViewMatrix)); gl.uniformMatrix4fv(projectionMatrixLoc, false, flatten(projectionMatrix)); or we can create matrices starting with an identity matrix as in var instance = mat4(); instance = mult(instance, translate(displacement)); An instructor might want to delay using MV.js until the topics have been covered, but using MV.js frees students from writing a lot of repetitious core. 7.10 Input and Interaction Thus far, we have only shown that WebGL enables us to run code that is very similar to desktop OpenGL in a browser. That in itself is a major plus for using WebGL for teaching computer graphics. At least as important is the ability to teach interactive computer graphics rather than just rendering—something that was becoming more and more difficult as desktop OpenGL evolved. The underlying problem is that desktop OpenGL requires additional libraries to support interaction and to provide the glue between OpenGL and the local window system. These libraries are constantly changing, are not standardized, and became less useful once many functions were deprecated, starting with OpenGL 3.1. With WebGL, we can use a variety of packages, such as jQuery, that have become standard for developing web applications and are not coupled to WebGL. 116 7. Teaching an Introductory Computer Graphics Course with WebGL Even without extra packages, it is very simple to bring in basic interactive devices including menus, sliders, text boxes, and buttons with just HTML and JS. For example a button for a toggle can be done with one line in the HTML file: and in the JS file we change a Boolean variable toggle by var a = document.getElementById("myButton"); a.addEventListener("click", function(event){toggle = !toggle;}); In fact, we can do the equivalent with just a single line in the HTML, file but we prefer to separate the page description from the action. A slider is almost as simple. Here’s one where a variable speed in the JS file goes from 0 to 100 and requires a minimum step size of 10 and starts with a value of 50. In the HTML file, use the line and then in the JS file document.getElementById("slider").onchange = function(event) { speed = 100 - event.target.value;}; 7.11 Textures and Render-to-Texture The texture mapping functions in WebGL are those from ES with one important addition—namely, that we can easily use images that are in the standard web formats (GIF, JPEG. PNG). For example, the texture mapped cube in Figure 7.3 uses the image specified using the image tag in the HTML file and then configuring the texture in the JS file by var image = document.getElementById​ ("texImage"); gl.texImage2D(gl.TEXTURE_2D, 0, gl.RGB, gl.RGB, gl.UNSIGNED_BYTE, image); Some browsers support video formats allowing a­ nimated texture mapping using MPEG files. It is also possible to use a canvas element as a texture. When we combine the ease of bringing in images with off-screen rendering, especially using renderto-texture, it opens up many possibilities for assignments and projects, even in a first class. Some examples include image processing, simulation, agent-based modeling, and GPGPU. 7.11 Textures and Render-to-Texture Figure 7.3 Texture mapped cube. 117 7.12 Discussion Compared with 20 years of teaching with desktop OpenGL, the course has been a huge success as measured by student feedback and by the quality of the required term projects. For example, “traditional” term projects such as building small CAD systems were enhanced by the ability to use web packages such as jQuery rather than trying to put together a user interface with the GLUT and GLEW OpenGL libraries. What was more interesting was that students were able to explore GPGPU in areas such as image processing and algorithm design, thus enhancing their understanding of both computer science fundamentals and the power of modern GPUs. There are some questions that came up during and after the course that merit some discussion. Some of these issues were presented at SIGGRAPH 14 [Cozzi 14]. 7.12.1 Should the Standard Curriculum Be Modified? Although the core topics in an introductory course are well established, over the years, not only has more material become important but the emphasis also has changed. With programmable GPUs many topics that were considered advanced or non-realtime are now easily programmed in shaders. One of the unexpected outcomes of our class was that some students did projects that were less traditional computer graphics and more GPGPU. What these students gained was a realization of the computing power of the GPU, which was far more than for the students who did more traditional projects. Our approach at this point is to get through much of the traditional material on geometry, shading, and lighting a little faster so as to leave more time for interaction and discrete methods. A better long-term goal is to move toward two standard courses, one close to the traditional course and a second on GPU computing. 7.12.2 What Other JS Issues Haven’t We Addressed? When planning the switchover to WebGL, a major concern was using JavaScript rather than C or C++. JS is still not well regarded in academic CS and CE departments. Unfortunately, much of that bias is based on earlier versions of the language. In practice, the students had little trouble with picking up the language (although the same cannot be said for many instructors!). Note that students who chose to program in Java and used translators to convert the code to JS wound up creating large amounts of inefficient code. There are some “gotchas” in JS that should be discussed. One is that everything other than the three basic types is an object, and objects inherit from prototypes. A related issue is that JS is a very large language and there are many ways to construct objects. These topics should be discussed, perhaps with a discussion of scope in JS. These issues are well addressed in JavaScript: the Good Parts [Crockford 08], which can be either a reference or a required textbook for the course. 7.12.3 What about Other Types of Shaders? Because WebGL is an implementation of OpenGL ES 2.0, it lacks geometry, tessellation, and compute shaders. Nor will any of these shaders be included in WebGL 2.0. For a first course in computer graphics, most instructors will not need any of these shaders. At worst, they can be discussed and sometimes be replaced by clever code. 118 7. Teaching an Introductory Computer Graphics Course with WebGL For now, implementing algorithms with fragment shaders for applications, such as image processing, using render-to-texture can demonstrate the power of GPU computing without having compute shaders. 7.12.4 Why Not three.js? three.js (threejs.org) is a powerful scene graph API built on top of WebGL and is an alternate API for a graphics course.* Some students chose to use it for their term projects. However, there are some good reasons that it is not the best choice for a course in computer graphics for students in computer science and engineering. A scene graph API is primarily for constructing models as opposed to working with the underlying rendering concepts. Consequently, three.js and other scene graph APIs are better suited for CADlike courses and survey courses. 7.12.5 A Final Word How we can teach computer graphics is rapidly changing. Within the past 3 years, we have moved our course first to shader-based desktop OpenGL and then to WebGL. Although we cannot deny that these changes involved a lot of work, the course is greatly improved. The ability to run across virtually all platforms and devices with the same application is a powerful incentive for making the change, as is the ability to integrate with other web applications and to develop the application within almost any browser. As for some of the weaknesses of the present API, we look forward to some upcoming changes that will go a long way toward overcoming any objections to using WebGL. One is the new versions of JavaScript (ES6), which will bring JavaScript closer to other programming languages. Second, there is a clear path forward for WebGL. WebGL 2.0 should be available by the time this book is published. Looking ahead slightly further, we can look at OpenGL ES 3.1, which has been released. Among its new features is compute shaders. Combined with ES6, WebGL will provide an even more powerful platform for teaching computer graphics in the coming years. Bibliography [Angel 13] E. Angel, “Teaching Computer Graphics with Shader-Based OpenGL in OpenGL Insights.” In OpenGL Insights, P. Cozzi and C. Riccio (Ed.), CRC Press, Boca Raton, FL, 2013, 3–16. [Angel 15] E. Angel and D. Shreiner, Interactive Computer Graphics (7th ed.), Pearson Education, New York, 2015. [Cozzi 14] https://github.com/pjcozzi/Articles/blob/master/SIGGRAPH/2014/TeachingIntro-and-Advanced-Graphics-with-WebGL-Small.pptx [Crockford 08] D. Crockford, JavaScript: the Good Parts, O’Reilly, Sebastopol, CA, 2008. [Flanagan 11] D. Flanagan, JavaScript: The Definitive Guide, O’Reilly, Sepastopol, CA, 2011. [McGuire 14] http://casual-effects.blogspot.com/2014/01/an-introduction-to-javascriptfor.html. * www.udacity.com/course/cs291 Bibliography 119 S ection III Mobile One of WebGL’s biggest strengths is that it runs on both desktop and mobile devices, including iOS, Android, and Windows Phone. This allows us to reach the widest audience with a single codebase. In fact, in many cases, our applications will run just fine on mobile just by adding touch events or relying on the emulated mouse events. However, a deeper ­understanding of WebGL on mobile can help us to write more reliable and faster code. In Chapter 8, “Bug-Free and Fast Mobile WebGL,” Olli Etuaho draws from his experience at NVIDIA to present tips for mobile WebGL development, including browser tools for mobile testing, profiling, and debugging; the pitfalls of shader precision and framebuffer color attachments formats; improving performance and power by reducing WebGL calls, optimizing shaders, and reducing bandwidth; and selecting a mobile-friendly WebGL engine. 8 Bug-Free and Fast Mobile WebGL Olli Etuaho 8.1 Introduction 8.2 Feature Compatibility 8.3 Performance 8.4 Resources Acknowledgments Bibliography 8.1 Introduction WebGL availability on mobile browsers has recently taken significant leaps forward. On leading mobile platforms, WebGL has also reached a performance and feature level where it is a feasible target for porting desktop apps. Often, content written on desktop will simply work across platforms. Still, special care needs to be taken to make a WebGL application run well on mobile devices. There are some nuances in the WebGL specification that may go unnoticed when testing only on desktop platforms, and limited CPU performance means that JavaScript and API usage optimizations play a much larger role than on desktops and high-end notebooks. Mobile GPU architectures are also more varied than desktop ones, so there are more performance pitfalls to deal with (Figure 8.1). This chapter focuses on shader precision, which is not widely understood in the ­community and thus has been a frequent source of bugs even in professionally developed WebGL libraries and applications. I will also touch upon some other common sources of bugs, and provide an overview of optimization techniques that are particularly useful when writing applications with mobile in mind. As ARM SoCs make inroads into notebooks, such as the recent Tegra K1 Chromebooks, this material will become relevant even for applications that are only targeting a mouse-and-keyboard form factor. 123 Figure 8.1 An example of what is possible with WebGL on mobile platforms: a physically based ­rendering demo running at 30 FPS in Chrome on an NVIDIA Shield Tablet (Tegra K1 32-bit). 8.1.1 Developer Tools Modern browsers include many developer tools that help with improving a WebGL app’s mobile compatibility and performance. In Chrome 42, JavaScript CPU Profile, Device mode, and Capture Canvas Frame tools are particularly useful for mobile WebGL development. They also have equivalent alternatives in Firefox, where we can use the Profiler tab of dev tools, the Responsive Design Mode, and the Canvas tab or the WebGL Inspector extension. Some equivalent tools are also found in Internet Explorer’s F12 developer tools and Safari’s Web Inspector. Familiarize yourself with these tools in your preferred development environment, as I’ll be referring to them throughout the rest of the chapter. I’m going to use the tools found in Chrome as an example, but often enough the basic processes can be generalized to tools provided by other browsers. The Device mode tool mostly helps with getting viewport settings right and making sure the web application’s UI can be used with touch. It does not emulate how the GPU behaves on the selected device, so it does not completely eliminate the need for testing on actual hardware, but it can help when getting started with mobile development. Firefox’s Responsive Design Mode does not have the touch features of Chrome’s Device mode, but can still also help. The JavaScript CPU Profile and Capture Canvas Frame tools help with finding application bottlenecks and optimization opportunities. In Chrome 39, Capture Canvas Frame is still experimental and needs to be enabled from the flags page* with #enable-­ devtools-experiments and from developer tools settings. It’s possible to use these inside the Inspect Devices tool† to perform the profiling on an Android device connected to your development workstation using USB. * chrome://flags † chrome://inspect 124 8. Bug-Free and Fast Mobile WebGL Firefox also has some tools that don’t have exact counterparts in Chrome. In the about:config settings page in Firefox, there are two WebGL flags that are particularly useful. One is webgl.min_capability_mode. If that flag is enabled, all WebGL parameters such as maximum texture size and maximum amount of uniforms are set to their minimum values across platforms where WebGL is exposed. This is very good for testing if very wide compatibility with older devices is required. More on capabilities can be found from the “Capabilities Exposed by getParameter” section of this chapter. The other interesting Firefox flag is webgl.disable-extensions. This is selfexplanatory; it is good for testing that the application has working fallbacks in case some WebGL extension it uses is not supported. 8.2 Feature Compatibility 8.2.1 Shader Precision Shader precision is the most common cause of issues when WebGL content developed on desktop is ported to mobile. Consider the GLSL example in Listing 8.1, which is designed to render a grid of randomly blinking circles as in Figure 8.2. It looks like the developer is aware that some mobile devices do not support highp precision in fragment shaders [ESSL100 §4.5.2], so the first three lines intended to set mediump precision when running on mobile devices have been included. But there is already the first small error: including #ifdef GL_ES is unnecessary in WebGL, where the GL_ES macro is always defined. It is only useful if the shader source is directly used in both OpenGL and WebGL environments. However, the critical error is much more subtle, and it might come as a surprise that the pseudorandom number generator is completely broken on many mobile devices. See Figure 8.3 for the results the shader produces on one mobile chip. Figure 8.2 Intended effect of Listing 8.1, captured on a notebook with an NVIDIA GeForce GPU. 8.2 Feature Compatibility 125 Figure 8.3 The example shader running on a device which includes 16-bit floating-point hardware. The reason for the different rendering is that the lowest floating-point precision most desktop GPUs implement in hardware is 32 bits, and all floating-point calculations ­regardless of GLSL precision run on that same 32-bit hardware. In contrast, the mobile device represented in Figure 8.3 includes both 32-bit and 16-bit floating-point hardware, and calculations specified as lowp or mediump get performed on the more power-efficient 16-bit hardware. The differences between mobile GPUs get a lot more varied than that, so on other devices we can see other kinds of flawed rendering. The GPUs use different bit counts in their fl ­ oating-point representation, use different rounding modes, and may handle ­subnormal numbers differently [Olson 13]. This is all according to the specification, so the mobile devices are not doing anything wrong by performing the calculations in a more optimized way. The WebGL specification is just very loose when it comes to floating-point precision. The calculation that goes wrong in this case is on line 8. The result from sin(x) is in the range [–1, 1], so sin(x) * 43758.5453 will be in the range [–43758.5453, 43758.5453]. The float value range in the minimum requirements for mediump precision is [–16384, 16384], and values outside this range might be clamped, which is the first possible cause of rendering issues. Even more importantly, the relative precision in the minimum requirements for mediump floats is 2–10, so only the first four decimals will count. For example, the closest possible representation of 1024.5 in minimum mediump precision is 1024.0, so fract(1024.5) computed in mediump precision returns 0.0, and not 0.5. To get a better feel for how lower precision floating-point numbers behave, see this low-precision floating point simulator.* * http://oletus.github.io/float16-simulator.js/ 126 8. Bug-Free and Fast Mobile WebGL There are two ways to fix Listing 8.1. The simplest way is to replace the first three lines with: precision highp float; and be done with it. As a result, the shader either works exactly as on common ­desktop GPUs, or does not compile at all if the platform does not support highp in fragment ­shaders. This might raise worries about abandoning potential users with older devices, but the majority of mobile devices that have reasonable WebGL support also fully support highp. This includes recent SOCs from Qualcomm, NVIDIA, and Apple. In particular, all OpenGL ES 3.0 supporting devices fully support highp [ESSL300 §4.5.3], but many older devices also include support for it. In data collected by WebGL Stats in the beginning of December 2014, 85% of mobile clients reported full support for highp in fragment shaders. Listing 8.1 Example of a shader with precision issues. #ifdef GL_ES precision mediump float; #endif uniform float time; float rand(vec2 co) { return fract(sin(dot(co.xy, vec2(12.9898, 78.233))) * 43758.5453); } void main (void) { //Divide the coordinates into a grid of squares vec2 v = gl_FragCoord.xy /20.0; //Calculate a pseudo-random brightness value for each square float brightness = fract(rand(floor(v)) + time); // Reduce brightness in pixels away from the square center brightness * = 0.5 − length(fract(v) − vec2(0.5, 0.5)); gl_FragColor = vec4(brightness * 4.0, 0.0, 0.0, 1.0); } In vertex shaders, it is recommended to use highp exclusively, since it is universally supported and vertex shaders are the bottleneck less often than fragment shaders. The better performance that can be attained on a minority of mobile GPUs in particularly vertex-heavy applications usually isn’t worth the risk of bugs from using mediump in v­ ertex shaders. Going the highp route does sacrifice some performance and power usage along with device compatibility. A better solution would be to change the pseudorandom function. In this case, the fix needs to be verified on a platform that supports performing floatingpoint calculations in a lower precision. One option is using suitable mobile hardware. To find out which kind of hardware you have, see the shader precision section of the khronos webgl Information page.* Another convenient option is to use software emulation of lower precision floats. The ANGLE library implements such emulation by mutating GLSL shaders that are passed to WebGL. When the mutated shaders are run, they produce results that are very close to * https://www.khronos.org/registry/webgl/sdk/tests/extra/webgl-info.html 8.2 Feature Compatibility 127 what one would see on most mobile devices. The emulation carries a large performance cost and may prevent very complex shaders from being compiled successfully, but is typically fast enough to run content targeting mobile platforms at interactive rates on the desktop. Chrome 41 was the first browser to expose this emulation through a command line flag--emulate-shader-precision. You’ll find more information on precision emulation on the GitHub repository of WebGL Insights.* In the case of the example code, simply reducing the last coefficient in the pseudorandom function will make the shader work adequately in mediump precision. The code with the fixes in place is found in Listing 8.2. Listing 8.2 The example shader with precision-related issues fixed. precision mediump float; uniform float time; float rand(vec2 co) { return fract(sin(dot(co.xy, vec2(12.9898, 78.233))) * 137.5453); } void main (void) { //Divide the coordinates into a grid of squares vec2 v = gl_FragCoord.xy /20.0; //Calculate a pseudo-random brightness value for each square float brightness = fract(rand(floor(v)) + time); // Reduce brightness in pixels away from the square center brightness * = 0.5 − length(fract(v) − vec2(0.5, 0.5)); gl_FragColor = vec4(brightness * 4.0, 0.0, 0.0, 1.0); } The OpenGL ES Shading Language spec mandates that if both a vertex shader and a fragment shader reference the same uniform, the precision of the uniform declarations must match. This can sometimes complicate making precision-related fixes, especially if the WebGL shader code is not written directly but rather generated by some tool. If a project is hit by a compiler error because of uniform precision mismatch, it is best to fix this by making the precision of the two uniform declarations the same by any means available, rather than using two different uniforms with different precision settings to work around the issue. Using two uniforms hinders the maintainability of the code and adds CPU overhead, which is always to be avoided on mobile platforms. Keep this issue in mind when making ­decisions about shader development tools. Code written in ESSL is much easier to convert to other languages than vice versa, so ESSL is a good choice for a primary shading language. Tip: Using mediump precision in fragment shaders provides the widest device compatibility, but risks corrupted rendering if the shaders are not properly tested. Tip: Using only highp precision prevents corrupted rendering at the cost of losing some efficiency and device compatibility. Prefer highp especially in vertex shaders. * https://github.com/WebGLInsights/WebGLInsights-1 128 8. Bug-Free and Fast Mobile WebGL Tip: To test device compatibility of shaders that use mediump or lowp precision, it is ­possible to use software emulation of lower precision. Tip: fract() is an especially risky function when being run at low precision. Tip: Remember to define sampler precision when sampling from float textures. 8.2.2 Render Target Support Attempting to render to an unsupported color buffer format is another common source of bugs when running WebGL content on mobile platforms. Desktop platforms typically support many formats as framebuffer color attachments, but 8-bit RGBA is the only format for which render target support is strictly guaranteed in WebGL [WebGL]. In the mobile world, one may find GLES 2.0 devices that only implement this bare minimum required by the spec. With devices that support GLES 3.0 the situation is improved, and both 8-bit RGBA and RGB are guaranteed to be supported in WebGL [GLES3.0 §3.8.3]. However, one needs to tread even more carefully when using floating-point render targets, which are available in WebGL through OES_texture_float and WEBGL_color_ buffer_float. The OES_texture_float extension, which includes the possibility of implicit render target support, is somewhat open to interpretation, and the level of support ­varies. Some mobile devices do not support rendering to floating-point textures at all, but others are better: some OpenGL ES 3.0 devices can expose rendering to 16-bit float RGBA, 16-bit float RGB, and 32-bit float RGBA, just like browsers on desktop. Note that support for rendering to 32-bit float RGB textures was recently removed from the WebGL specifications altogether to unify the platform, but some browsers on desktop may still include this legacy feature. The level of render target support on different platforms is summarized in the following table for the most interesting formats. It is organized by the native APIs that WebGL implementations use as backends on different platforms. On older mobile devices made before 2013, the API supported is typically OpenGL ES 2.0. On mobile devices starting from late 2013, the API supported is typically OpenGL ES 3.0 or newer. On Apple computers, starting from some 2007 models, and recent Linux PCs, the API is OpenGL 3.3 or newer, and on Windows, browsers typically implement WebGL using a DirectX 9 or DirectX 11 backend. The Windows render target support data here are from ANGLE, which is the most common backend library for WebGL on Windows. Format/platform OpenGL ES 2.0 OpenGL ES 3.0/3.1 OpenGL 3.3/newer 8-bit RGB Maybe Yes Yes 8-bit RGBA (1) Yes Yes Yes 16-bit float RGB Maybe (3) Maybe (3) Maybe 16-bit float RGBA Maybe (3) Maybe (3, 4) Yes 32-bit float RGB (2) No No (6) Maybe 32-bit float RGBA No Maybe (4) Yes Notes: (1) Guaranteed by WebGL specification, even if it’s not guaranteed by GLES2. (2) Support was removed from WEBGL_color_buffer_float recently. (3) Possible to implement using EXT_color_buffer_half_float. (4) Possible to implement using EXT_color_buffer_float. (5) Implemented using the corresponding RGBA format under the hood. (6) Possible to implement by using RGBA under the hood, but browsers don’t do this. 8.2 Feature Compatibility ANGLE DirectX Yes Yes Yes (5) Yes Maybe (5) Yes 129 To find out which texture formats your device is able to use as framebuffer color attachments, use the Khronos WebGL information page.* Tip: When using an RGB framebuffer, always implement a fallback to RGBA for when RGB is not supported. Use checkFramebufferStatus. 8.2.3 Capabilities Exposed by getParameter The WebGL function getParameter can be used to determine capabilities of the ­underlying graphics stack. It will reveal the maximum texture sizes; how many uniform, varying, and attribute slots are available; and how many texture units can be used in ­fragment or vertex shaders. The most important thing is that sampling textures in vertex shaders may not be supported at all—the value of MAX_VERTEX_TEXTURE_IMAGE_ UNITS is 0 on many mobile devices. The webgl.min_capability_mode flag in Firefox can be used to test whether an application is compatible with the minimum capabilities of WebGL. Note that this mode does not use the minimums mandated by the specification, as some of them are set to such low values that they are never encountered in the real world. For example, all devices where WebGL is available support texture sizes at least up to 1024×1024, even if the minimum value in the spec is only 64×64. OpenGL ES 3.0 devices must support textures at least up to 2048×2048 pixels in size. 8.3 Performance When it comes to performance, one thing in particular sets mobile GPUs apart from their desktop counterparts: the limited main memory bandwidth that is also shared with the SoC’s CPU cores [Pranckevičius 11; Merry 12]. This has led Imagination, Qualcomm, and ARM to implement a tiled rendering solution in hardware, which reduces the main memory bandwidth usage by caching a part of the framebuffer on-chip [Merry 12]. However, this approach also has drawbacks: It makes changing the framebuffer or blending state particularly expensive. The NVIDIA Tegra family of processors has a more desktop-like GPU architecture, so there are fewer surprises in store, but as a result the Tegra GPUs may be more sensitive to limited bandwidth in some cases. In the case of WebGL, the CPU also often becomes a bottleneck on ARM-based mobile devices. Running application logic in JavaScript costs more than if the same thing was implemented in native code, and the validation of WebGL call parameters and data done by the browser also uses up CPU cycles. The situation is made worse by the fact that WebGL work can typically be spread to at most a few CPU cores, and single-core performance is seeing slower growth than parallel-processing performance. The rate of singlecore performance gains started slowing down on desktop processors around 10 years ago [Sutter 09], and recently mobile CPUs have started to show this same trend. Even before the past few years, the performance of mobile GPUs has grown faster than performance of mobile CPUs. To use performance figures Apple has used in marketing the iPhone as an example, their graphics performance has increased nearly 90% per year on average, whereas their CPU performance has grown only by 60% per year on average. * https://www.khronos.org/registry/webgl/sdk/tests/extra/webgl-info.html 130 8. Bug-Free and Fast Mobile WebGL This means that optimizations targeting the CPU have become more important as new hardware generations have arrived. 8.3.1 Getting Started with Performance Improvements Performance improvements should always start with profiling where the bottlenecks are. For this, we combine multiple tools and approaches. A good starting point is obtaining a JavaScript CPU profile from a target device using browser developer tools. Doing the profiling on a few different devices is also useful, particularly if it is suspected that an application could be CPU limited on some devices and GPU limited on others. A profile measured on desktop doesn’t help as much: The CPU/GPU balance is usually very different, and the CPU bottlenecks can be different as well. If idle time shown in the profile is near zero, the bottleneck is likely to be in JavaScript execution, and it is worthwhile to start optimizing application logic and API usage and seeing whether this has an effect on performance. Application logic is unique to each application, but the same API usage patterns are common among many WebGL applications, so see what to look for in the next section. This is going to be especially beneficial if the CPU profile clearly shows WebGL functions as bottlenecks. See Figure 8.4 for an example. Each of the optimizations detailed in the following sections can sometimes yield more than 10% performance improvements in a CPU-bound application. The GitHub repository for WebGL Insights contains a CPU-bound drawing test to demonstrate the effect of some of these optimizations. The test artificially stresses the CPU to simulate CPU load from application logic and then issues WebGL draw calls with Figure 8.4 Example of a JavaScript CPU profile measured while doing heavy operations in the WebGL painting application CooPaint on a Nexus 5 (2013 phone). Some WebGL functions consume a large portion of CPU time. This application would very likely benefit from API usage optimizations such as trimming the number of uniform4fv calls. 8.3 Performance 131 options for different optimizations. The test results for each optimization given later in this chapter were measured on a Nexus 5 (2013 phone with Qualcomm Snapdragon 800) and a Shield Tablet (2014 tablet with NVIDIA Tegra K1 32-bit). If we see plenty of idle time on the CPU executing JavaScript, our application may be GPU bound or main memory bandwidth bound. In this case, we should look for ways to optimize shaders or think of different ways to achieve the desired rendering result. Fullscreen effects are usually a good place to start optimizing in this case [McCaffrey 12]. 8.3.2 Optimizing API Usage Every WebGL call has some CPU overhead associated with it. The underlying graphics driver stack needs to validate every command for errors, manage resources, and possibly synchronize data between threads [Hillaire 12]. On top of that, WebGL duplicates a lot of the validation to eliminate incompatibility between different drivers and to ensure stricter security guarantees. On some platforms, there is also overhead from translating WebGL calls to a completely different API, but on mobile platforms this is less of a concern since the underlying API is typically OpenGL ES, which WebGL is derived from. For more details, see Chapter 2. Since every API call has overhead, performance can often be improved by reducing the number of API calls. The basics are simple enough: An application should not have unnecessary “get” calls of any kind, especially getError, or frequent calls requiring ­synchronization like readPixels, flush, or finish. Drawing a chunk of geometry with WebGL always has two steps: Setting the GL state required for drawing and then ­calling drawElements or drawArrays. Setting the GL state can be broken down further into choosing a shader program by calling useProgram, setting flags and uniforms, and binding resources like textures and vertex buffer objects. An optimal system will change the GL state with the minimal number of API calls, rather than redundantly setting parts of the state that don’t change for every single draw call. To start with a simple example, if an application needs blending to be disabled for all draw calls, it makes a lot more sense to call gl.disable(gl.BLEND) once at the beginning of drawing rather than repeatedly before every single draw call. For binding resources, there are a few distinct ways to reduce API calls. Often the simplest way is to use vertex array objects, or VAOs, that can significantly improve performance on mobile devices. They are an API construct developed specifically to enable optimization, and are not to be confused with vertex buffer objects, or VBOs. In our CPUbound drawing test, FPS increase from using VAOs was 10%–13%. For more information, see [Sellers 13]. A VAO encapsulates the state related to bound vertex arrays—that is, the state that’s set by bindBuffer, enable/disableVertexAttribArray, and vertexAttribPointer. By using VAOs we can replace calls to the aforementioned functions with a single bindVertexArrayOES call. VAOs are available in WebGL 1.0 with the OES_ vertex_array_object extension, which is widely supported on mobile devices. As of early 2015, more than 80% of smartphone and tablet clients recorded by WebGL Stats have it.* * http://webglstats.com/ 132 8. Bug-Free and Fast Mobile WebGL Adding a fallback for devices without VAO support is also straightforward. Let’s call the code that binds buffers and sets vertex attrib pointers related to a specific mesh the binding block. If VAOs are supported, the code should initialize the VAO of each mesh using the binding block. Then, when the mesh is drawn, the code either binds the VAO if VAOs are supported, or executes the binding block if VAOs are not supported. The only case where this becomes more complicated is when there’s a different number of active vertex attribute arrays for different meshes—then the code should add disable­ VertexAttribArray calls where appropriate. For a complete code example, see an explanation of VAOs* or an implementation of a fallback path in SceneJS.† Lowering the number of vertex buffers helps to reduce CPU usage if VAOs are not a good fit for the code for some reason. This can be done by interleaving different types of vertex data for the same object: If we have, for example, positions, texture coordinates, and normals for each vertex, they can all be stored in the same vertex buffer in an interleaved fashion. In our CPU-bound drawing test that uses four vertex attributes, interleaving the attributes increased the FPS around 4%. The downside is that interleaving the data needs to be either handled by the content creation pipeline or done at load time; the latter may marginally slow down loading. Interleaving three attributes for a million vertices in a tight JS loop had a cost of around 200 ms on a Nexus 5 (2013 phone). In some special cases, geometry instancing can also be used, but that is typically much harder to integrate into a general-purpose engine than using VAOs, for example. It is only feasible if drawing objects is decoupled enough from the application logic for each object. Yet another way to reduce resource binding overhead is to combine textures of different objects into large texture atlases, so that less texture binding changes are needed. This requires support from the content authoring pipeline and the engine, so implementing it is also quite involved. Uniform values are a part of the shader program state. If we are using the same shader program to draw multiple objects, it is possible that some of the objects also share some of the same uniform values. In this case, it makes sense to avoid setting the same value to a uniform twice. Caching every single uniform value separately in JavaScript may not be a performance win, but if we can group, for example, lighting-related uniform values together, and only update them when the shader used to draw or the lighting changes, this can yield a large performance improvement. In our CPU-bound drawing test, updating one vec3 uniform had a cost on the order of 1 microsecond on the mobile devices tested. This is enough to make extra uniform updates add up to a significant performance decrease if they are done in large quantities. Reducing the number of useProgram calls is often also possible by sorting draw calls so that the ones using the same program are grouped together. This also helps to reduce redundant uniform updates further. We may even benefit from using parts of the uniform state as a part of the sort key. Software-driven deferred rendering is still a long shot on most mobile devices, but using WEBGL_draw_buffers brings it a bit closer to within reach. The performance improvement from using WEBGL_draw_buffers varies, but it reduces CPU usage, since draw * http://blog.tojicode.com/2012/10/oesvertexarrayobject-extension.html † https://github.com/xeolabs/scenejs/blob/v4.0/src/core/display/chunks/geometryChunk.js 8.3 Performance 133 calls only need to be issued once, and also reduces vertex shader load and depth buffer accesses. There should be clear benefits at least in complex scenes [Tian 14]. To find out which of these optimizations we might be able to make, it’s useful to get a complete picture of our application’s API usage by using Chrome’s Capture Canvas Frame tool. The tool will list the sequence of WebGL calls done by our application to render a single frame. It helps to see which commands to set GL state are common between multiple draw calls and how many calls are spent doing vertex array setup that might be better accomplished by using VAOs. Tip: Sort draw calls according to shader program state, and reap the benefits by removing redundant useProgram calls and uniform updates. Tip: Use vertex array objects (VAOs) and interleave static vertex data. This will save a lot of API calls. 8.3.3 Optimizing Shader Execution When we suspect that our application is shader-bound, we can always perform a simple test to see if this really is the case: Replace all of the shaders with trivial ones that only render a single recognizable color and measure the performance. If the performance is significantly changed, the application is likely shader-bound—either by GPU computation or by texture fetches performed by the shaders. There are a few different ways to optimize a shader-bound application. Many of the optimizations presented here are discussed in Pranckevičius’s excellent SIGGRAPH talk [Pranckevičius 11]. The talk has some more details about micro-optimization, but here we will concentrate on more general guidelines. If geometry is drawn in a random order, fragment shaders might be run unnecessarily for fragments that eventually get obscured by something else in the scene. This is commonly known as overdraw, and it can be reduced with the help of early z-test hardware built into the GPU. This requires sorting opaque geometry front-to-back, so that the fragments that are obscured by something else get processed last [McCaffrey 12]. The sorting should be done on a relatively coarse level in order to avoid spending too much time on it and moving the bottleneck to the CPU. A too fine-grained depth sort can also increase the number of shader state changes needed to render the scene, so it’s often a balancing act between sorting by shader state and sorting by depth. Also see Chapter 10. The front-to-back sorting helps to improve performance on all but Imagination’s PowerVR series of GPUs, which implement deferred fragment shading in hardware [Merry 12]. Also note that if shaders modify the fragment depth value or contain discard statements, the early z-test hardware won’t help. Another possible technique is to use a z-prepass, it’s typically not worth the overhead [McGuire 13]. Another optimization strategy is to switch to a simpler lighting model. This may mean sacrificing some quality, but there are often alternative ways to do things, such as using lookup textures instead of shader computation. However, since mobile GPUs are typically even more bandwidth-bound than they are computation-bound, this kind of change may not always be a performance win. Recent trends are also toward higher computational performance in mobile GPUs, with bandwidth being the more fundamental limit [McCaffrey 12]. 134 8. Bug-Free and Fast Mobile WebGL Recent Tegra processors perform especially well in shader computation, so they become texture-bandwidth-bound more easily than other mobile GPUs. Finally, if we want to squeeze out the last bits of performance, we can try lowering the precision of our shader computations. Just remember to test any shaders that use mediump or lowp carefully for correctness. The performance benefits that can be gained from this vary greatly between platforms: on NVIDIA Tegra K1, precision does not matter at all, and on the Qualcomm Adreno GPU line, precision only makes a small difference, but large differences have been reported on some other mobile GPUs. 8.3.4 Reducing Bandwidth Usage There are a few different ways to reduce bandwidth usage in a WebGL app, and some of them are WebGL-specific. First of all, we should make sure that extra copies don’t happen when the WebGL canvas is composited. Keep the preserveDrawingBuffer context creation attribute in its default value false. In some browsers, also having the alpha context creation attribute as false may enable more efficient occlusion culling; it communicates to the browser compositor that the canvas element is opaque and does not need to be blended with whatever is underneath it on the page. However, if we need a static background other than a flat color for our WebGL content, it may make sense to render the background using other HTML elements and only render the parts that change on every frame on the main WebGL canvas, keeping alpha as true. The more obvious ways to reduce bandwidth are reducing texture or framebuffer resolution. Reducing texture resolution can sometimes be done without sacrificing too much visual quality, especially if the textures have mostly low-frequency content [McCaffrey 12]. Many mobile device displays have extreme pixel densities these days, so it’s worthwhile to consider whether rendering at the native resolution is worth losing performance or reducing the visual quality in other ways. Also see Chapter 14. Implementing full-screen effects in an efficient way or avoiding them altogether can also enable huge bandwidth savings [McCaffrey 12; Pranckevičius 11]. In particular, it is better to combine different postprocessing filters into a single shader or add simple postprocessing effects directly into the shaders used to render geometry, when possible. Using lots of small polygons also costs bandwidth on tiler architectures, since they need to access the vertex data separately for each tile [Merry 12]. Optimizing models to minimize the vertex and triangle count helps on these GPUs. 8.3.5 Choosing a WebGL Engine for Mobile Using a higher level library or engine instead of writing directly against the WebGL API makes development more efficient. This also applies to mobile applications. The downside is that there’s some overhead from using the library, and if the library is not a good fit for the content, its effects on performance on mobile platforms may be disastrous. There are other use cases for WebGL besides rendering 3D scenes, but here I’ll give a brief overview of libraries that are specifically targeting 3D rendering from the mobile perspective. three.js is one of the most popular WebGL libraries. However, it’s not usually the best fit for developing applications for mobile. three.js offers a very flexible API, but this comes at a cost: Lots of dynamic behavior inside the library adds a whole lot of CPU overhead. At the time of writing, it had only a few of the optimizations detailed in this chapter. It often does redundant or inefficient updates to the GL state, particularly to uniform 8.3 Performance 135 values, and due to how it’s structured, this is unlikely to improve much without some very ­significant changes. Still, if we’re planning to render only relatively simple scenes with few objects, it’s possible to get good performance from three.js, and it allows us to easily use advanced effects using built-in and custom shaders. It does have depth sorting to help with avoiding overdraw. With three.js, it is recommended to use the BufferGeometry class instead of the Geometry class to improve performance and memory use. Some slightly less widely used libraries may be a better fit for your application, depending on what kind of content you’re planning to render. Babylon.js (Chapter 9) and Turbulenz (Chapter 10) have both been demonstrated to be able to run fairly complex 3D game content in an ARM SoC environment. Babylon.js has a robust framework for tracking GL state and making only the necessary updates, which saves a lot of CPU time. Turbulenz does some of this as well, and also uses VAOs to improve performance. Be aware of bugs though: Demos using older versions of Babylon.js do not do justice to the newer versions, where lots of issues appearing on mobile platforms have been fixed. Turbulenz still has some open issues that affect mobile specifically, so more work is required to ship a complete product with it. SceneJS is another production-proven WebGL library, which was heavily optimized in version 4.0. It lacks some of the flexibility of its peers and may not be suited to all types of 3D scenes, but makes up for this by using VAOs and state tracking to optimize rendering. If the application content consists of large amounts of static geometry, SceneJS may be an excellent pick. Native engines that have been Emscripten-compiled to JavaScript (Chapter 5) are sadly not yet a viable alternative when targeting mobile devices. The memory cost is simply too high on devices with less than 4 GB of memory, and the runtimes have a lot of overhead. Future hardware generations, improvements to the typed array specification, JavaScript engines, and the 3D engines themselves might still change this, but so far better results can almost always be had by using JavaScript directly. 8.4 Resources See this chapter’s materials in the WebGL Insights GitHub repo for additional resources, such as: •• •• •• A JavaScript calculator that demonstrates the behavior of low-precision floats Tools that emulate lowp and mediump computations on desktop A WebGL test that demonstrates CPU optimizations Acknowledgments Thanks to Shannon Woods, Florian Bösch, and Dean Jackson for providing useful data for this chapter. Bibliography [ESSL100] Robert J. Simpson. “The OpenGL ES Shading Language, Version 1.00.17.” Khronos Group, 2009. [ESSL300] Robert J. Simpson. “The OpenGL ES Shading Language, Version 3.00.4.” Khronos Group, 2013. 136 8. Bug-Free and Fast Mobile WebGL [GLES3.0] Benj Lipchak. “OpenGL ES, Version 3.0.4.” Khronos Group, 2014. [Hillaire 12] Sébastien Hillaire. “Improving Performance by Reducing Calls to the Driver.” In OpenGL Insights. Edited by Patrick Cozzi and Christophe Riccio. Boca Raton, FL: CRC Press, 2012. [McCaffrey 12] Jon McCaffrey. “Exploring Mobile vs. Desktop OpenGL Performance.” In OpenGL Insights. Edited by Patrick Cozzi and Christophe Riccio. Boca Raton, FL: CRC Press, 2012. [McGuire 13] Morgan McGuire. “Z-Prepass Considered Irrelevant.” Casual Effects Blog. http://casual-effects.blogspot.fi/2013/08/z-prepass-considered-irrelevant.html, 2013. [Merry 12] Bruce Merry. “Performance Tuning for Tile-Based Architectures.” In OpenGL Insights. Edited by Patrick Cozzi and Christophe Riccio. Boca Raton, FL: CRC Press, 2012. [Olson 13] “Benchmarking Floating Point Precision in Mobile GPUs.” http://community.arm.com/groups/arm-mali-graphics/blog/2013/05/29/benchmarking-floatingpoint-precision-in-mobile-gpus, 2013. [Pranckevičius 11] Aras Pranckevičius. “Fast Mobile Shaders.” http://aras-p.info/ blog/2011/08/17/fast-mobile-shaders-or-i-did-a-talk-at-siggraph/, 2011. [Resig 13] John Resig. “ASM.js: The JavaScript Compile Target.” http://ejohn.org/blog/ asmjs-javascript-compile-target/, 2013. [Sellers 13] Graham Sellers. “Vertex Array Performance.” OpenGL SuperBible Blog. http:// www.openglsuperbible.com/2013/12/09/vertex-array-performance/, 2013. [Sutter 09] Herb Sutter. “The Free Lunch Is Over: A Fundamental Turn toward Concurrency in Software.” Originally appeared in Dr. Dobb’s Journal 30(3), 2005. http://www. gotw.ca/publications/concurrency-ddj.htm, 2009. [Tian 14] Sijie Tian, Yuqin Shao, and Patrick Cozzi. “WebGL Deferred Shading.” Mozilla Hacks Blog. https://hacks.mozilla. org/2014/01/webgl-deferred-shading/, 2014. [WebGL] Khronos WebGL Working Group. “WebGL Specification.” https://www.khronos.org/r egistry/webgl/specs/latest/1.0/, 2014. Bibliography 137 S ection IV Engine Design Most graphics developers—and I believe most WebGL users—build applications on top of game and graphics engines that hide the details of the graphics API and provide convenient abstractions for loading models, creating materials, culling, level of detail, camera navigation, and so on. This allows developers to focus on their specific problem domain and not, for example, have to write shaders and model loaders by hand. Given how many developers build on top of engines, it is important that engine developers make efficient use of the WebGL API and provide simple and flexible abstractions. In this section, the creators of some of the most popular WebGL engines and platforms describe engine design and optimizations. Common themes throughout this section are understanding the CPU and GPU overhead of different WebGL API functions such as setting render state and uniforms, strategies for minimizing their cost without incurring too much CPU overhead, and shader pipelines to generate GLSL from higher-level material descriptions and shader libraries. Given the wide array of use cases, we see both ubershaders and shader graphs, as well as offline, online, and hybrid shader pipelines. Babylon.js is an open-source WebGL engine from Microsoft focused on simplicity and performance. It was used, for example, to build UbiSoft’s Assassin’s Creed Pirates Race. In Chapter 9, “WebGL Engine Design in Babylon.js,” David Catuhe, the engine’s lead developer, explains the design philosophy, public API, and internals, including its render loop, shader generation with an ubershader for optimized shaders and fallbacks for lowend hardware, and caches for matrices, states, and programs. I first learned about Turbulenz, an open-source WebGL game engine, at WebGL Camp Europe 2012. David Galeano presented it, including an impressive demo of rendering Quake 4 assets with culling, state sorting, and many other optimizations. In Chapter 10, “Rendering Optimizations in the Turbulenz Engine,” David goes into detail on the engine’s optimizations. Using the Oort Online game as a use case, David covers culling, organization of the renderer, and efficiently sorting draw calls to minimize WebGL implementation and overdraw overhead. Given the popularity of Blender as a 3D modeling and animation tool, it should be no surprise that an open-source WebGL engine, Blend4Web, was developed to easily bring Blender content to the web. In Chapter 11, “Performance and Rendering Algorithms in Blend4Web,” Alexander Kovelenov, Evgeny Rodygin, and Ivan Lyubovnikov ­highlight parts of Blend4Web’s implementation. Performance topics include runtime batching, culling, LOD, and performing physics in Web Workers with time synchronization. Blend4Web’s shader pipeline is detailed with a focus on using a custom preprocessor to create new directives for code reuse. Finally, ocean simulation and shading is covered including LOD, waves, and several fast and plausible shading techniques for refraction, caustics, reflections, foam, and subsurface scattering. As for publishing 3D models to the web, Sketchfab has gained a lot of traction in this space. It supports a variety of model formats and converts them to a common 3D scene format for rendering in their WebGL viewer. In Chapter 12, “Sketchfab Material Pipeline: From File Variations to Shader Generation,” Cedric Pinson and Paul CheyrouLagrèze explain how materials are handled in Sketchfab. They describe the challenges of implementing robust material support for a wide array of 3D model formats, and present Sketchfab’s material pipeline and optimizations. They describe how the scene is streamed to the viewer and how shaders are generated at runtime with a shader graph. The shader graph is a graph of nodes representing individual parts of a shader that are compiled together to generate a final shader, allowing developers to develop only individual nodes and still providing flexible materials for artists. As shaders get larger, they become harder to manage. In Chapter 13, “glslify: A m ­ odule system for GLSL,” Hugh Kennedy, Mikola Lysenko, Matt DesLauriers, and Chris Dickinson present an elegant solution: glslify. They show how modeling after npm, the Node.js package manager, provides a clean and flexible system for reusing and transforming GLSL code on the server or as part of the build process. A typical frame budget, 16 or 33 milliseconds, can go by quickly, especially if our engine has a large processing job it tries to perform in a single frame. In Chapter 14, “Budgeting Frame Time,” Philip Rideout provides WebGL performance tips starting with a design for making all WebGL calls inside requestAnimationFrame. To stay within a frame budget, this chapter looks at strategies to amortize work across multiple frames, including using the new yield keyword in ECMAScript 6, offloading CPU-intensive tasks to Web Workers, and optimizing fillrate by using a low-resolution canvas. 140 Engine Design 9 WebGL Engine Design in Babylon.js David Catuhe 9.1 9.2 9.3 9.4 9.5 9.6 Introduction Global Engine Architecture Engine Centralized Access Smart Shader Engine Caches Conclusion 9.1 Introduction About a year ago, I decided to sacrifice all my spare time for a project I had had in mind for a very long time: creating a pure JavaScript 3D engine using WebGL. IE11 had just shipped with early WebGL support, which meant that all major modern browsers were now able to render accelerated 3D content. I had been writing 3D engines since I was 18; the very first one I wrote in C/C++ did all its rendering on the CPU. Then I switched to the Glide SDK (from 3DFX). It was my very first contact with 3D accelerated rendering and I was absolutely blown away by the raw power that I was able to control! With Windows 95, I decided to port my engine to DirectX and, using this engine (named Nova), I founded a company in 2002 named Vertice, where I remained for 9 years as CTO and lead developer. Moving to Microsoft in 2012, I created a new engine called Babylon using Silverlight 5 and XNA. This engine was used as the basis of Babylon.js,* a pure JavaScript engine that relies on WebGL for rendering. * http://www.babylonjs.com 141 Using my experience with 3D rendering, I tried to create an engine built upon two foundations: •• •• Simplicity Performance For me, a successful framework is not about beauty of the code or heroic stuff that you are doing. It is all about simplicity of use. I consider Babylon.js successful when users can tell me that “this is easy to use and understand.” Also, performance matters because we are dealing with real-time rendering. With these two concepts in mind, I tried to create an architecture that will be both easy to use and performant. I really hope that web developers can use it without having to read documentation (or at least not too much!). This is why the engine is shipped with satellite tools: •• •• •• Playground*: The playground is a place where we can try to experiment with Babylon.js directly inside the browser. Examples are provided alongside a complete autocompletion and live documentation system that can help while we’re typing (see Figure 9.1). CYOS†: CYOS (Create Your Own Shader) is a tool where we can experiment with creating shaders that we can then use with Babylon.js (see Figure 9.2). Sandbox‡: The sandbox is a page where we can drag and drop 3D scenes created with Blender 3D or 3DS Max exporters. In this chapter, I explain how things work under the hood. Writing an engine is not just about rendering triangles and shaders. For the sake of readability and brevity, we focus on parts of the code related to WebGL. Tip: Feel free to use the playground with code examples used during this chapter to get live results. 9.2 Global Engine Architecture Before digging into the core of Babylon.js, let’s take a step back to see the global picture as shown in Figure 9.3. The engine surface is wide and in order to better understand how everything is linked, it is important to get a vision of all involved actors. The root of everything in Babylon.js is the Engine object. This is the link between the object model and WebGL. All orders sent to WebGL are centralized in the engine. To create it, developers just have to instantiate it with the rendering canvas used as a parameter: var engine = new BABYLON.Engine(canvas); * http://www.babylonjs.com/playground † http://www.babylonjs.com/cyos ‡ http://www.babylonjs.com/sandbox 142 9. WebGL Engine Design in Babylon.js Figure 9.1 Playground with autocompletion screen opened. The next important object is the Scene object. We can have as many scenes as we want. They are the container for all other objects (meshes, lights, cameras, textures, ­materials, etc.): var scene = new BABYLON.Scene(engine); Once the scene is created, we can start adding components like a camera (i.e., the user’s point of view): var camera = new BABYLON.ArcRotateCamera("Camera", 0, 0, 10, new BABYLON. Vector3(0, 0, 0), scene); Camera’s constructor needs the scene in order to attach to it. This is true for all actors except Engine. Meshes contain geometry information (vertex and index buffers) alongside a world matrix: var sphere = BABYLON.Mesh.CreateSphere("sphere1", 16, 2, scene); 9.2 Global Engine Architecture 143 Figure 9.2 CYOS screen. Shaders are handled by materials. The most general one is the StandardMaterial: var materialSphere3 = new BABYLON.StandardMaterial("texture3", scene); materialSphere3.diffuseTexture = new BABYLON.Texture("textures/misc.jpg", scene); In Section 9.4, we’ll see how shaders are controlled by materials. Listing 9.1 shows a simple scene involving a small cube and a material. Figure 9.4 shows the result of executing the code in Listing 9.1 Listing 9.1 Creating a simple scene. Figure 9.3 Global overview of Babylon.js classes and internal engines. 144 var scene = new BABYLON.Scene(engine); var camera = new BABYLON.FreeCamera("camera1", new BABYLON. Vector3(0, 5, -10), scene); camera.setTarget(BABYLON.Vector3.Zero()); 9. WebGL Engine Design in Babylon.js camera.attachControl(canvas, false); var light = new BABYLON.HemisphericLight("light1", new BABYLON. Vector3(0, 1, 0), scene); var sphere = BABYLON.Mesh.CreateSphere("sphere1", 16, 2, scene); sphere.position.y = 1; sphere.material = new BABYLON.StandardMaterial("red", scene); sphere.material.diffuseColor = BABYLON.Color3.Red(); These objects are the backbone of the framework and work together with several other objects (e.g., postprocess, shadows, collisions, physics, serialization, etc.) to enable different scenarios. From the user point of view, the complexities of shaders and WebGL are completely hidden under the API surface. When we call scene.render(), the following process is triggered: •• •• Using an octree, if activated, and frustum clipping, the scene establishes a list of visible meshes. The scene goes through all visible meshes and dispatches them into three groups: •• Opaque meshes, sorted front to back •• Alpha tested meshes, sorted front to back •• Transparent meshes, sorted back to front Collisions Engine Post Processes Physics Engine Engine Cameras Sprites Scene Meshes Particles SubMeshes Animations Engine Instances Skeletons Materials Lights Shadows Engine Lens Flares Textures Figure 9.4 Simple scene. 9.2 Global Engine Architecture 145 •• •• Each group is rendered using associated engine states, such as alpha blending. To render an individual mesh, the following process is executed: •• Mesh’s material is activated: Inner shader, samplers, and uniforms are transmitted to WebGL. •• Mesh’s index and vertex buffers are sent to WebGL. Draw command is executed (gl.drawElements). Now that we have a clearer view over the engine, let’s discuss the internal architecture and optimizations.* 9.3 Engine Centralized Access During the rendering process, everything related to WebGL is handled by the Engine object. It was architected this way to centralize cache and states. The Engine contains a reference to the WebGLContext object (gl). If we open the following url in Github, we will see that most of the code is about calling gl.xxxxx functions: https://github.com/BabylonJS/Babylon.js/blob/master/Babylon/babylon.engine.js The Engine is responsible for supporting the following WebGL features: •• •• •• •• •• •• •• •• •• Creating and updating vertex and index buffers Creating and updating textures (static, dynamic, video, render target) Shader creation, compilation, and linking into programs Communication with programs through uniforms and samplers bindings Buffers and textures bindings State management Cache management Deleting all WebGL resources Full-screen management It also drives frame rendering with a simple function, called runRenderLoop, that uses requestAnimationFrame to render frame: engine.runRenderLoop(function () { scene.render(); }); Capabilities detection is supported by a simple API to query for a specific extension: if (engine.getCaps().s3tc) {...} It keeps track of all compiled programs and all scenes to be able to clean everything when engine.dispose() is called. * Complete code is available on Github: https://github.com/BabylonJS/Babylon.js 146 9. WebGL Engine Design in Babylon.js 9.4 Smart Shader Engine In order to achieve maximum performance, Babylon.js is built upon a system that tries to compile the most efficient shader for a given task. Let’s imagine we need to have a material with only diffuse color. From the user’s point of view, the following code will do the job: mesh.material = new BABYLON.StandardMaterial("red", scene); mesh.material.diffuseColor = new BABYLON.Color3(1.0, 0, 0); Under the hood, the StandardMaterial object creates a BABYLON.Effect object, which is responsible for everything related to shaders and program. This effect is connected to the engine in order to get vertex and fragment shaders compiled and linked into a program. The StandardMaterial object uses only one big shader—the Uber Shader—that supports all features (e.g., diffuse, emissive, bump, ambient, opacity, fog, bones, alpha, Fresnel, etc.). We made this choice because it is far easier for us to work with one big shader instead of trying to have multiple small shaders for each specific task. The drawback of using only one piece of code is removing an unwanted feature. In our previous example, we only needed the diffuse part of the code and nothing else. Using conditions with uniforms was not an option because this is not optimal for GPU performance in WebGL. 9.4.1 Removing Conditions from Compiled Programs We decided to work with the compiler’s preprocessors using #define. Listing 9.2 is part of the fragment shader used by StandardMaterial: Listing 9.2 Fragment shader snippet used by StandardMaterial.* //Lighting vec3 diffuseBase = vec3(0., 0., 0.); vec3 specularBase = vec3(0., 0., 0.); float shadow = 1.; #ifdef LIGHT0 #ifdef SPOTLIGHT0 li ghtingInfo info = computeSpotLighting(viewDirectionW, normalW, vLightData0, vLightDirection0, vLightDiffuse0.rgb, vLightSpecular0, vLightDiffuse0.a); #endif #ifdef HEMILIGHT0 li ghtingInfo info = computeHemisphericLighting(viewDirectionW, normalW, vLightData0, vLightDiffuse0.rgb, vLightSpecular0, vLightGround0); * Complete shader is available on Github: https://github.com/BabylonJS/Babylon.js/blob/master/ Babylon/Shaders/default.fragment.fx 9.4 Smart Shader Engine 147 #endif #ifdef POINTDIRLIGHT0 li ghtingInfo info = computeLighting(viewDirectionW, normalW, vLightData0, vLightDiffuse0.rgb, vLightSpecular0, vLightDiffuse0.a); #endif #ifdef SHADOW0 #ifdef SHADOWVSM0 sh adow = computeShadowWithVSM(vPositionFromLight0, shadowSampler0); #else #ifdef SHADOWPCF0 sh adow = computeShadowWithPCF(vPositionFromLight0, shadowSampler0); #else sh adow = computeShadow(vPositionFromLight0, shadowSampler0, darkness0); #endif #endif #else shadow = 1.; #endif diffuseBase + = info.diffuse * shadow; specularBase + = info.specular * shadow; #endif Depending on the active options, all defines are gathered and the effect used by the StandardMaterial is compiled. Listing 9.3 shows an example of how the material configures the list of defines: Listing 9.3 Preparing a list of defines. if(this.diffuseTexture && BABYLON.StandardMaterial. DiffuseTextureEnabled) { if (!this.diffuseTexture.isReady()) { return false; } else { defines.push("#define DIFFUSE"); } } Before compiling the effects, all defines are gathered and transmitted to the effect constructor (Listing 9.4): Listing 9.4 Compiling an effect. var join = defines.join("\n"); this._effect = engine.createEffect(shaderName, attribs, […], […], join, fallbacks, this.onCompiled, this.onError); 148 9. WebGL Engine Design in Babylon.js The effect communicates with the engine to get the final program using specific defines (Listing 9.5): Listing 9.5 Compiling and linking a program using defines. var vertexShader = compileShader(this._gl, vertexCode, "vertex", defines); var fragmentShader = compileShader(this._gl, fragmentCode, "fragment", defines); var shaderProgram = this._gl.createProgram(); this._gl.attachShader(shaderProgram, vertexShader); this._gl.attachShader(shaderProgram, fragmentShader); this._gl.linkProgram(shaderProgram); Once this task is done, we are sure that the resulting program only contains the necessary code with no extra conditional statements. 9.4.2 Supporting Low-End Devices through Fallbacks Along with the performance benefits, using defines as a way to control the final program gives us a way to provide fallbacks when shaders are too complex for specific hardware. Babylon.js is a pure JavaScript library and thus can be used on mobile devices where GPUs are less powerful than on desktops. With simplicity in mind, I wanted to allow the user to get rid of the burden of handling these issues. This is why the Effect object allows us to define a list of fallbacks. The idea is pretty simple: While preparing the list of defines to be used for the compilation, the developer can also set up a list of optional defines that can be removed if shaders are not successfully compiled. The main reason why a working shader cannot be compiled on a given hardware is because of complexity: The number of instructions required to execute it is beyond the current limitation of the hardware. Based on this, the fallback system is used to remove some defines in order to produce simpler shaders. We can then automatically degrade the rendering in order to get a working shader. It is up to the developer to define which options to remove. Since the fallback system is a multistage system, the developer can also define priorities (Listing 9.6): Listing 9.6 Showing how Fresnel is configured as an optional feature. var fresnelRank = 1; if(this.diffuseFresnelParameters && this.diffuseFresnelParameters. isEnabled) { defines.push("#define DIFFUSEFRESNEL"); fallbacks.addFallback(fresnelRank, "DIFFUSEFRESNEL"); fresnelRank++; } 9.4 Smart Shader Engine 149 if(this.opacityFresnelParameters && this.opacityFresnelParameters. isEnabled) { defines.push("#define OPACITYFRESNEL"); fallbacks.addFallback(fresnelRank, "OPACITYFRESNEL"); fresnelRank++; } if(this.reflectionFresnelParameters && this.reflectionFresnelParameters.isEnabled) { defines.push("#define REFLECTIONFRESNEL"); fallbacks.addFallback(fresnelRank, "REFLECTIONFRESNEL"); fresnelRank++; } if (this.emissiveFresnelParameters && this.emissiveFresnelParameters. isEnabled) { defines.push("#define EMISSIVEFRESNEL"); fallbacks.addFallback(fresnelRank, "EMISSIVEFRESNEL"); fresnelRank++; } Developers can define a rank when creating a fallback. The Engine object will try to compile shaders with all options and if this fails, it will try again with some options removed based on ranking (Listing 9.7): Listing 9.7 Program compilation code with fallback system.* try { var engine = this._engine; this._program = engine.createShaderProgram(vertexSourceCode, fragmentSourceCode, defines); } catch (e) { if (fallbacks && fallbacks.isMoreFallbacks) { defines = fallbacks.reduce(defines); this._prepareEffect(vertexSourceCode, fragmentSourceCode, attributesNames, defines, fallbacks); } else { Tools.Error("Unable to compile effect: " + this.name); Tools.Error("Defines: " + defines); Tools.Error("Error: " + e.message); } } * Complete code is available on Github, in the effect and engine classes: https://github.com/BabylonJS/Babylon.js/blob/master/Babylon/babylon.engine.js https://github.com/BabylonJS/Babylon.js/blob/master/Babylon/Materials/babylon.effect.js But even with the greatest shaders you can write, you also have to ensure that only required commands are sent to WebGL. This is where caching systems go up on stage. 150 9. WebGL Engine Design in Babylon.js 9.5 Caches Babylon.js contains many different caching systems. These systems are intended to reduce overhead implied by recreating or resetting something already done. Through our testing, we identified several places where caching could be useful: •• •• •• •• •• World matrices computation WebGL states Textures Programs Uniforms 9.5.1 World Matrices World matrices are computed per mesh in order to get the current position/orientation/ scaling. These matrices not only depend on the mesh itself but also on the hierarchy. Before rendering a mesh, we have to compute this specific matrix. Instead of computing it on every frame, we decided to cache it and only compute it when required. The caching overhead is insignificant compared to the cost of building a world matrix where we have to •• •• •• •• •• •• •• Compute rotation matrix Compute scaling matrix Compute translation matrix Compute pivot matrix Get parent’s world matrix Compute billboarding if activated Multiply all these matrices Tip: You can see the complete code for this here: https://github.com/BabylonJS/Babylon. js/blob/master/Babylon/Mesh/babylon.abstractMesh.js#L321 Instead of this long code, Babylon.js can check properties against cached values (Listing 9.8): Listing 9.8 Comparing current values against cached values. Mesh.prototype._isSynchronized = function () { if (this.billboardMode ! = = AbstractMesh.BILLBOARDMODE_NONE) return false; if (this._cache.pivotMatrixUpdated) { return false; } if (this.infiniteDistance) { return false; } if (!this._cache.position.equals(this.position)) return false; 9.5 Caches 151 if (this.rotationQuaternion) { if(!this._cache.rotationQuaternion.equals (this.rotationQuaternion)) return false; } else { if (!this._cache.rotation.equals(this.rotation)) return false; } if (!this._cache.scaling.equals(this.scaling)) return false; return true; }; 9.5.2 Caching WebGL States On low-end devices, changing WebGL states can be extremely expensive due to the inner nature of the WebGL state machine. State objects in Babylon.js are used to centralize state management. Thus, instead of directly changing a WebGL state, the engine changes the corresponding value in a state object. This state object is then used to apply effective updates just before a render. This saves a lot of unrequired changes—for example, when we go through a list of meshes that all require the same value for a specific state. Some browsers can take care of not overwriting an already set value but this behavior is pretty uncommon with, for example, mobile devices. Listing 9.9 shows how a state object dedicated to alpha management works. An internal dirty flag is kept up to date and is used in order to decide if a specific internal state should be applied: Listing 9.9 Partial alpha state management. function _AlphaState() { this._isAlphaBlendDirty = false; this._alphaBlend = false; } Object.defineProperty(_AlphaState.prototype, "isDirty", { get: function () { return this._isAlphaBlendDirty; }, enumerable: true, configurable: true }); Object.defineProperty(_AlphaState.prototype, "alphaBlend", { get: function () { return this._alphaBlend; }, set: function (value) { if (this._alphaBlend = = = value) { return; } 152 9. WebGL Engine Design in Babylon.js this._alphaBlend = value; this._isAlphaBlendDirty = true; }, enumerable: true, configurable: true }); _AlphaState.prototype.apply = function (gl) { if (!this.isDirty) { return; } //Alpha blend if (this._isAlphaBlendDirty) { if (this._alphaBlend = = = true) { gl.enable(gl.BLEND); } else if (this._alphaBlend = = = false) { gl.disable(gl.BLEND); } this._isAlphaBlendDirty = false; } }; The engine itself uses these states objects when drawElements is required as shown in Listing 9.10: Listing 9.10 States objects applied just before drawElements. Engine.prototype.applyStates = function () { this._depthCullingState.apply(this._gl); this._alphaState.apply(this._gl); }; Engine.prototype.draw = function (indexStart, indexCount) { this.applyStates(); this._gl.drawElements(this._gl.TRIANGLES, indexCount, this._ gl.UNSIGNED_SHORT, indexStart * 2); }; 9.5.3 Caching Programs, Textures, and Uniforms Babylon.js also provides a caching system for various data related to WebGL. For instance, each time an Effect object wants to compile a program, the Engine object checks if the program was not already compiled by keeping a list of active programs. Textures are handled in the same way. Each time a new Texture object is instantiated, the Engine checks if the specific url was not loaded before in order to share resources. 9.5 Caches 153 We applied the same policy to uniforms as shown in Listing 9.11: Listing 9.11 Setting a color3 uniform through Effect object. Effect.prototype.setColor3 = function (uniformName, color3) { if(this._valueCache[uniformName] && this._valueCache[uniformName] [0] = = color3.r && this._valueCache[uniformName][1] = = color3.g && this._valueCache[uniformName][2] = = color3.b) return this; this._cacheFloat3(uniformName, color3.r, color3.g, color3.b); this._engine.setColor3(this.getUniform(uniformName), color3); return this; }; Even this.getUniform is cached. All uniforms’ locations are gathered when programs are compiled and linked in order to optimize uniforms’ access as well. 9.6 Conclusion This architecture where we try to automatically optimize for end users allowed us to create great demos like Assassin’s Creed Pirates Race by UbiSoft (Figure 9.5) and a complete train attraction (Figure 9.6). See a complete list of astonishing demos at www.babylonjs.com. Figure 9.5 Assassin’s Creed Pirates Race by UbiSoft. 154 9. WebGL Engine Design in Babylon.js Figure 9.6 Train simulator. If you want to discuss this chapter with me, feel free to ping me on twitter (@­deltakosh). I will be more than pleased to respond. Also, Babylon.js is an open-source engine, so if you want to contribute, please do it! The Github repository can be found here: https://github.com/BabylonJS/Babylon.js. 9.6 Conclusion 155 10 Rendering Optimizations in the Turbulenz Engine David Galeano 10.1 Introduction 10.2 Waste 10.3 Waste Avoidance 10.4 High-Level Filtering 10.5 Middle-Level Rendering Structures 10.6 Middle-Level Filtering 10.7 Low-Level Filtering 10.8 The Technique Object 10.9 Dispatching a Technique 10.10 Dispatching Buffers 10.11 Dispatching Textures 10.12 Dispatching Uniforms 10.13 Resources Bibliography 10.1 Introduction The Turbulenz Engine is a high-performance open source game engine available in JavaScript and TypeScript for building high-quality 2D and 3D games. In order to extract maximum performance from both JavaScript and WebGL, the Turbulenz Engine needs to reduce waste to the minimum, for the benefit of both the CPU and the GPU. In this chapter, we focus on the rendering loop and the removal of waste originating from redundant and/or useless state changes. In order for this 157 Figure 10.1 The game Oort Online. removal to be optimal, the engine employs several strategies at different levels, from ­high-level ­v isibility culling, with grouping and sorting of geometries, to low-level state change filtering. Throughout the chapter we give specific examples from our game Oort Online (Figure 10.1), a massive multiplayer game set in a sandbox universe of connected voxel worlds. 10.2 Waste JavaScript is now a high-performance language. It is still not as fast as some statically typed languages, but fast enough for using WebGL efficiently in order to achieve highquality 3D graphics at interactive frame rates. Both JavaScript and WebGL are perfectly capable of saturating a GPU with work to do. See [Echterhoff 14] for an analysis of WebGL and JavaScript performance compared to a native implementation. However, saturating a CPU or GPU with work that is either redundant or useless would be a waste of resources, and identifying and removing that waste is our focus. There are several kinds of waste that we classify into: 1. Useless work: doing something that has no visible effect 2. Repetitive work: doing the expensive state changes more than once per frame 3. Redundant work: doing exactly the same thing more than once in a row We will tackle each one in turn, but first let’s discuss the cost of removing unneeded work. 158 10. Rendering Optimizations in the Turbulenz Engine 10.3 Waste Avoidance Waste avoidance is simply avoiding the production of waste. It works on the principle that the greatest gains result from actions that remove or reduce resource utilization but deliver the same outcome. There is always a price to pay for removing waste. In order to save CPU and GPU time, we first need to use some CPU time, and there will always be a trade-off. Too much CPU time spent on filtering out useless work may actually make our application run slower than if there was no filtering at all, but the opposite is usually also true. The Turbulenz Engine is data-oriented and mostly data-driven; some special rendering effects may be handwritten to call the low-level rendering API directly for performance reasons, but the bulk of the work is defined at runtime based on loaded data. In general, it is much cheaper to avoid waste at a high level than at a lower level. The higher level has more context information to use and can detect redundancy at a higher scale. The lower levels can only deal with what is known at that instant. For example, the scene hierarchy and the spatial maps available at the high level together allow culling of nonvisible objects in groups, instead of having to check visibility individually for all objects. The Turbulenz Engine avoids waste at different levels, each one trying to reduce work as much as possible for the lower ones. We will explain the different strategies used at every level. 10.4 High-Level Filtering This is where the scene is managed, passing information to the middle-level renderer. At this level, we mostly focus on trying to remove work that provides no usable result. Examples of this kind of waste are geometries that are not visible because they are •• •• •• •• Out of the view frustum Fully occluded by other geometries Too far away and/or are too small for the target resolution Fully transparent This list also applies to other game elements like, for example, lights affecting the scene, animation of skinned geometries, entities AI, 3D sound emitters, etc. To optimize away each of these cases, we require extra information that is only available at a high level, for example: •• •• •• •• View frustum Geometry AABB Dimensions of the rendering target Transparency information The Turbulenz Engine filters out elements that are not visible using a two-step frustum culling system. First we cull the scene nodes and then their contents. Our scene hierarchy has an AABB on each node that contains renderable geometry or lights. Each scene node 10.4 High-Level Filtering 159 Scene SceneNode SceneNode SceneNode SceneNode Light Renderable Renderable Renderable DrawParameters DrawParameters DrawParameters DrawParameters Figure 10.2 Scene example. can contain an unlimited number of lights or renderable geometries. Figure 10.2 shows an example of a scene. These scene node AABBs are added to spatial maps, one for static objects and a separate one for dynamic objects. They are separate because they are updated with different frequency and we can use different kinds of spatial maps for each case. The spatial maps supported out of the box are •• •• •• AABB trees with different top-down building heuristics for static or dynamic objects Dense grids Sparse grids For example, our game Oort Online uses AABB trees for static objects and dense grids for dynamic ones. Figure 10.3 shows a screen capture showing the renderables bounding boxes. Once a node is deemed visible then we proceed to check for the visibility of each of the renderables and lights that it contains. When there is a single object on the node, then there is no need to check anything else and it is added to the visible set. When there are multiple objects, we check whether the node was fully visible in the frustum—if so, then all its contained objects are also visible; otherwise, we check visibility for each of them separately. For lights, we may go an extra step by projecting the bounding box into the screen to calculate how many pixels it would actually light, discarding the light or disabling its shadows’ maps depending on its contribution to the scene. However, all this work only removes objects outside the view frustum; it does not do anything about occluded objects. For occlusion culling, the main trade-off is that doing occlusion culling on the CPU is usually very expensive and hence only applied in a limited number of cases. Our game Oort Online is made of cubes and hence it is much simpler to calculate occluder frustums 160 10. Rendering Optimizations in the Turbulenz Engine Figure 10.3 Bounding boxes of visible renderables. than in games using complex nonconvex geometry, but even in this case we only apply occlusion from chunks of 16 × 16 × 16 blocks of opaque cubes closer to the camera. Unfortunately, it only helps in a limited number of situations—for example, looking into a mountain from its base where most of the voxels inside the mountain will be discarded. In other games we employed portals to find the potentially visible set of elements from a particular camera viewpoint. Once we have removed as much useless work as possible without spending too much time on it, then it is time to pass that information to the next level down. The output from this level is a collection of arrays of visible objects: visible nodes, visible renderables, visible lights, visible entities, etc. We are going to focus just on the list of renderables. 10.5 Middle-Level Rendering Structures Each visible renderable can represent several rendering geometries. Each rendering geometry is defined by a DrawParameters object. Managing the collection of visible rendering geometries for optimal dispatching is the role of the next level down. DrawParameters contains all the information required to render a particular geometry with a particular shading technique: •• •• •• Vertex buffers: could be more than one Vertex offsets: start of the used region on the vertex buffers Vertex semantics: to match a particular vertex component with a particular shader input 10.5 Middle-Level Rendering Structures 161 •• •• •• •• •• •• •• •• Index buffer: optional, for indexed primitives Index buffer offset: start of the used region on the index buffer Primitives count: the number of primitives to render Primitive type: triangle list, triangle strip, etc. Technique: the shading technique to be used to render this geometry Technique parameters: a dictionary containing the custom shading parameters for this geometry—for example, the world location, the material color, the diffuse texture, etc. User data object: used to group geometries by framebuffer, opacity, etc. Sort key: used to sort geometries belonging to the same group Both user data and sort key are the keys to filtering out waste at this level. 10.5.1 Group Index on the User Data Object The user data object can be used by the game to store anything about this particular instance of the geometry; the relevant part for this chapter is that the game generally uses this object to tell the renderer what particular group this geometry belongs to. In hindsight, we should have named this property differently and we should have ­separated out its different components into different values. The group information is stored as an integer index into the array of groups that the renderer manages; this is gameand renderer-specific. For example, these are the groups used for our game Oort Online: •• •• •• •• •• •• •• •• Prepass: for smoothing the normals in screen space on geometry close to the camera Opaque: for opaque voxels, entities, and opaque vegetation like tree trunks Alpha cutouts (or decals): for grass, leaves, etc. Transparent: for transparent voxels or entities Effects: for particle systems like the smoke or flames from the torches Water Lava Clouds The different groups for nonopaque geometry that require alpha blending (water, lava, transparent, clouds, etc.) are rendered in different order depending on the camera direction and vertical position in order to get better visual results; otherwise, the different ­layers may render on top of each other incorrectly. Other games may have more or fewer groups depending on the renderer. For example, games with dynamic shadows implemented using shadow mapping would have one or more groups for each of the shadow maps. 10.5.2 Sort Key The DrawParameters sort key is a JavaScript number used to sort geometries belonging to the same group. As is often the case in JavaScript, it can be a floating-point value or an integer; it is up to the game renderer to set it correctly. This value can be used to sort the DrawParameters objects in ascending or descending order, depending on what makes more sense for its particular render group. 162 10. Rendering Optimizations in the Turbulenz Engine For example, in the game Oort Online, the keys are calculated as follows: •• For opaque geometry or for additive-only blending: (dis tance << 24) | (techniqueIndex << 14) | (materialIndex << 8) | (vbIndex & 255) •• The distance value is the distance to the camera near plane quantized into 64 logarithmic-scaled buckets. The techniqueIndex value is the unique shading technique id. The materialIndex value is the unique material id. The vbIndex value is the unique vertex buffer id. For transparent geometry: (distance * 1024 * 16) | 0 The distance value is the distance to the camera near plane in world units. For opaque geometries, the main reason for sorting is performance. We build the sorting key in such a way that geometries with the same technique, material, or vertex buffer are one after the other in the sorted list. A perfect sorting would require a multilevel bucket hierarchy, which could be too expensive to build, while the sorting key provides a cheap enough solution being fast to generate and fast to sort by. The higher bits of the key are used to sort for the most important state change to optimize by, while the lower bits are left for the cheaper changes. The order will usually depend on the scene complexity. If overdraw is a significant issue, then we will use the distance to the camera on the higher bits, while in some other cases it is the shading technique that requires optimization, either because overdraw is not significant or because the hardware can deal with it efficiently—for example, on tile-based deferred GPUs. For transparent geometries, the main reason is correctness. Transparent geometries are sorted by distance to get the correct rendering, from back to front. However, if the blending is additive-only, then this is not needed. In the case of our game Oort Online, the “effects” pass only contains geometries rendered with additive blending, which is correct no matter the order; this means that we can employ more efficient sorting for this group. Unfortunately many rendering effects do not fall into this category; in our game Oort Online, only a handful of particle systems use additive blending—for example, the sparks coming out of a torch. In general, the Turbulenz Engine tries to use integers smaller than 30 bits in order to force the JavaScript JIT compilers into using integer operations for sorting instead of floating-point ones. Comparing integers is significantly cheaper than comparing floatingpoint values. The JIT compiler will optimize for either integers or floating-point values, but performance is more predictable if we use the same types everywhere, every time. For more information about low-level JavaScript optimizations, see [Wilson 12]. 10.6 Middle-Level Filtering This is where the renderer groups and sorts DrawParameters to be passed down to the low-level rendering functions. 10.6 Middle-Level Filtering 163 This level focuses on removing repetitive work that is done multiple times but not in a row. Examples of this kind of waste include: •• •• Bind the same framebuffer more than once. Enable blending more than once. To remove this waste, we rely on the game and the renderer to provide enough information to group and sort DrawParameters in the most efficient way to reduce the number of state changes to the minimum. At this point it is worth explaining the real cost of making WebGL calls. 10.6.1 WebGL Costs Making a function call is never totally free, but there are some functions that are far more expensive than others. Sometimes the real cost of function calls is not immediately obvious; they may have big performance side effects that only appear in combination with other function calls, and most WebGL calls are in this category. Older versions of rendering APIs like D3D9 or OpenGL ES 2 (which WebGL represents) do not actually represent how modern GPUs work. A lot of state and command translation and validation is lazily evaluated behind the scenes. Changing a state may imply just storing the new value and setting a dirty flag that is only checked and processed when we issue a draw call. This lazy evaluation cost heavily depends on the hardware and the different software layers that drive it, and there are big differences between them; for example, changing frequently from opaque rendering to transparent rendering and vice versa by changing the blend function is very expensive on deferred tile rendering architectures because they force flushing the rendering pipeline for each change, which could be very wasteful if there is a significant amount of overdraw. Defining the specific costs of each change for each piece of hardware is complicated, but there are some general rules that should work fine in most hardware. The following are in order of cost of change from high to low: •• •• •• •• •• •• •• •• •• Framebuffer Blend state Shader program Depth state Other render states Texture Vertex buffer Index buffer Shader parameters (uniforms) This ordering may change significantly from hardware to hardware, but as generic rules they are good enough for a start. For each particular platform we should profile and adapt. For more in-depth information, see [Forsyth 08]. Even if the geometries are rendered in the right order to minimize state changes, there is still one big source of GPU overhead that we are also trying to get rid of at this level: overdraw. 164 10. Rendering Optimizations in the Turbulenz Engine 10.6.2 Overdraw Overdraw is one of the main sources of wasted effort on complex 3D environments; ­classified as useless work, some pixels are rendered but are never visible because other ­pixels are rendered on top of them. This is a massive waste in terms of data moving through the app and the graphics pipeline. In theory this kind of waste is cheaper to be removed early on at a higher level; we should detect occluders and remove all occludes from the rendering list. However, this is a lot of work to do on the CPU and most occludees are only half covered, which means that we should, in theory, split the triangles into visible and not visible parts. And if we go down that route, then most of our CPU would be wasted on occlusion calculations while the GPU sits idle waiting for work to do. Instead, we rely on grouping and sorting opaque geometry based on the distance to the camera near plane. By first rendering opaque geometry closer to the camera, we can use the depth test to discard shading for occluded pixels. It does not eliminate all the waste from vertex transformations and triangle rasterization, but at least it removes the cost of shading in a relatively cheap way for the CPU. This is one case where we optimize for the CPU, relying on the massive parallel performance of modern GPUs. This optimization makes sense because of early-z reject hardware implementations that provide fast early-out paths for fragments that are occluded; some hardware even employs hierarchical depth buffers that are able to cull several occluded fragments at the same time. This optimization makes less sense on tile-based deferred architectures that only shade the visible fragment when required at the end of the rendering. If the game is not CPU limited by the number of draw calls, then a z-only pass could be a good solution to remove most of the overdraw waste. Rendering only to the depth buffer can often be optimized to be significantly faster, but an excessive amount of draw calls is too often the norm for complex games, so this solution is not usually more optimal than just trying to sort opaque geometry front to back. A z-only pass is even more handicapped in JavaScript/WebGL because of the additional overhead compared to native code. And, as before, on tile-based deferred architectures, this solution may actually go against the optimizations done in hardware and result in more GPU overhead. 10.6.3 Benefits of Grouping There are several reasons why we group DrawParameters into different buckets: 1. Because they are rendered into different framebuffers •• For example, they gather screen depth or view space-per-pixel normal information into textures to be used by subsequent groups. 2. Because they need to be rendered in a specific order for correctness •• For example, transparent geometry should be rendered after all the opaque geometry; otherwise, they may be blending on top of the wrong pixel. 3. Because they need to be rendered in a specific order for performance •• For example, opaque geometry closer to the camera could be rendered in its own group before the distant ones. •• For example, enabling and disabling blending may be quite expensive on some hardware, so we try to do it only once per frame. 10.6 Middle-Level Filtering 165 Points 2 and 3 require clarification. There is some overlap between the sort key and the group index. In theory we could encode all the required information to sort DrawParameters for performance reasons into the sort key, but there is a performance limit on the key size and it is relatively easy to use more than 31 bits and then sorting starts to be expensive for big collections. By moving part of the key information into the group index you can increase the effective key size without penalties; this could be seen as some kind of high-level bucket sorting. Once all our DrawParameters are grouped and sorted efficiently, then they are passed to the low-level API for rendering. For more on draw calls ordering, see [Ericson 08]. 10.7 Low-Level Filtering This is where we dispatch the DrawParameters changes to WebGL. Big arrays of DrawParameters objects are passed down to the low-level rendering functions for dispatching. This level focuses on removing redundant work that is done multiple times in a row. Examples of this kind of waste include: •• •• •• Binding the same vertex buffer repeatedly Binding the same texture on the same slot repeatedly Binding the same shading technique repeatedly To avoid this redundant work, we shadow all the internal WebGL state, including uniforms for shader programs, avoiding changes that do not alter it. There is overhead for keeping this shadow of the WebGL state in both memory usage and CPU cost, but as the rendering has been sorted already to keep redundant changes close together, most of the time the check for superfluous work saves a lot of work due to the high overhead of the underlying WebGL implementations. To reduce the cost of state checks as much as possible, we need to store the rendering data in optimal ways. The main rendering structure at this level is the Technique. 10.8 The Technique Object A Technique object contains the required information for shading a given geometry: •• •• 166 Vertex and fragment shaders •• A unique program is linked for each combination and is cached and shared between techniques. Render states •• Contains only the delta from our predefined default render states: −− DepthTestEnable: true −− DepthFunc: LEQUAL −− DepthMask: true −− BlendEnable: false 10. Rendering Optimizations in the Turbulenz Engine −− −− −− −− −− −− −− −− −− −− −− BlendFunc: SRC _ ALPHA, ONE _ MINUS _ SRC _ ALPHA CullFaceEnable: true CullFace: BACK FrontFace: CCW ColorMask: 0xffffffff StencilTestEnable: false StencilFunc: ALWAYS, 0, 0xffffffff StencilOp: KEEP, KEEP, KEEP PolygonOffsetFillEnable: false PolygonOffset: 0, 0 LineWidth: 1 •• Samplers •• The texture samplers that the shaders would require. •• They are matched by name to textures from the technique parameters objects passed in by the DrawParameters object. •• The sampler object contains only the delta from our predefined sampling states: −− MinFilter: LINEAR _ MIPMAP _ LINEAR −− MagFilter: LINEAR −− WrapS: REPEAT −− WrapT: REPEAT −− MaxAnisotropy: 1 •• Semantics •• The vertex inputs that the vertex shader would require. •• These are some of our predefined semantics: −− POSITION −− NORMAL −− BLENDWEIGHT −− TEXCOORD0 •• Uniforms •• The uniform inputs that the program would require •• Matched by name to the values from the technique parameters object passed on by the DrawParameters object. The Technique objects are immutable; if we want to use the same program with a different render state, we need to create a new Technique. This could potentially scale badly, but in practice none of our games use more than a couple dozen techniques. The main reason to have potentially too many techniques would be the need to ­support too many toggleable rendering configurations—for example, water with or without reflections—but in that case we can just load the techniques that we actually need. The Technique object contains lots of information; when we change a shading technique, we are actually potentially changing many WebGL states, which is why sorting by technique is so important in many cases, although we still need to optimize it as much as possible. 10.8 The Technique Object 167 10.9 Dispatching a Technique When we have to dispatch a new Technique, we still need to reduce the state changes to the minimum. We keep track of the previously dispatched Technique and we only update the delta between the two. •• •• •• Program: only changed if different from the previous one Render states: the new state values are only applied if they are different from the values previously set. The render states set by the previous technique that are not present on the current technique are reset to their default values. Samplers: the new sampling values are only applied if they are different from the values previously set. The sampling states set by the previous technique that are not present on the current technique are reset to their default values. Once the main states have been updated, the remaining changes are the buffers and the uniforms. 10.10 Dispatching Buffers Vertex and index buffers are only dispatched if they differ from the current value set by the previous DrawParameters object. The Turbulenz Engine creates big buffers that are shared among different geometries, which helps to reduce the number of buffer changes. Index buffers are easy; if the current buffer is different from the last one, we change it. If the DrawParameters does not contain an index buffer because it is rendering an unindexed primitive, then we just keep the old buffer active because it may be needed again in the future. Vertex buffers are more complicated. For each vertex buffer, we match each of its vertex components to the relevant vertex shader semantic and then we check if, for that semantic, we already have used the same vertex buffer. If so, we do nothing; otherwise, we need to update the vertex attribute pointer for that semantic. Shader semantics are hardcoded to match specific vertex attributes; for example, POSITION is always on the attribute zero. This makes dispatching simpler and faster because the same semantic will always be set to the same attribute, which means that in many cases we do not need to set a vertex attribute pointer again when rendering the same vertex buffer with multiple techniques. When the program is linked, we remap its attribute inputs to match our semantics table. This has potential issues when the vertex shader only supports a very limited number of attributes (which happens sometimes on mobile phones) and it means that the mapping between semantics and attributes is calculated right after creating the WebGL context; however, once it is done it never changes. If vertex array objects (VAOs) are supported, then we build one for each combination of vertex buffers and index buffer present in the DrawParameters objects. As we share the buffers between many different geometries, the actual number of combinations is usually quite low. This allows us at dispatch time to simplify all the buffer checks to a single equality comparison between the current VAO and the previous one. Even when the VAOs are different, setting them with WebGL is cheaper on the CPU than setting all the different buffers and vertex pointer attributes, which makes them a big win for complex scenes. 168 10. Rendering Optimizations in the Turbulenz Engine 10.11 Dispatching Textures Textures are nontrivial to set in WebGL because of the stateful nature of WebGL. At ­loading time, we assign a specific texture unit to each sampler used by a shader. These units are shared by all the shaders; the first sampler on each shader will use unit zero, the second sampler on each shader will use unit one, and so on. When we need to set a texture to a sampler, first we need to activate the required texture unit (but only if it was not already the active one), and then we need to bind the texture for the required target (2D or CUBE _ MAP). This requires a bit of juggling trying to avoid changing the active unit and the bound texture too many times when changing techniques. Our sort key includes a material id, which is derived from the collection of textures applied to a particular renderable; this way, we try to keep groups of textures together and to minimize the number of times a texture is bound. An alternative system would bind every texture to a different unit at loading time and then we would need to tell the shader which texture unit to use for each renderable at dispatch time. However, the maximum number of texture units supported varies heavily between video cards, some of them limited to 32 or less. This means that, if we are using more textures than the limit, then we still need to constantly bind textures to different units; we found this much slower in practice than our current system of hardcoded texture units at loading time. 10.12 Dispatching Uniforms Dispatching the uniforms required for each renderable is accumulatively the most expensive thing our games do—not because changing a uniform is too expensive, but rather because we have lots of them. Potentially, we could have a single big uniform array for each renderable in order to reduce the number of WebGL calls, but that would mean that any tiny difference in the array would require dispatching the whole array (which is a waste), so we group uniforms by frequency of change. Shading parameters that change infrequently between renderables are set in the same uniform array, while dynamic parameters are separated out into individual uniforms. This organization is quite effective in order to minimize the number of changes but it means that some techniques have dozens of individual parameters. Values for each uniform are extracted at dispatch time from the technique parameters dictionaries stored in each DrawParameters object and then compared against the current values set for that particular uniform on the current active program. Only when the values differ do we update the uniform. To avoid setting the same values twice, we employ a two-level filtering system. We store the JS array that was last set on the uniform; a quick equality check avoids setting the same object twice in a row for the same uniform. This works quite well in practice because of the sorting by material and because we reuse the same typed arrays for the same kinds of data as much as possible. Of course this information is only valid within the function that dispatches an array of DrawParameters and needs to be reset once it finishes. If the uniform consists of a single value, then we skip this level. Once the JS array is determined to be different from the previous one, then we check each individual value on the array. By default we use equality checks for each value on 10.12 Dispatching Uniforms 169 each uniform, which seems slow but is vastly faster than actually changing the uniform when it is not required. But, when the uniform contains a single floating-point value, we do not use an equality check; we compare whether the absolute difference between the new value and the old value is greater than 0.000001 and only then do we update it. This could be potentially risky if the precision required for that value is higher than the threshold, but we have never found a problem in practice. We could be more aggressive and apply the threshold update requirement for all uniforms, or even use a lower threshold, but we did find rendering issues when doing so. In particular, some rotations encoded in matrices require quite a lot of precision to be correct. However, this is something that could potentially be enabled per project; many users will not notice that an object did not move the tenth of a millimeter that it should have moved. 10.12.1 Performance Evaluation A typical frame in the game Oort Online would have metrics similar to these: •• •• •• •• •• •• •• •• •• •• •• 9.60 ms total dispatch time 3,316 uniform changes 2,074 draw calls 68 technique changes 670 vertex array object changes 3 index buffer changes 19 vertex buffer changes 23 vertex attribute changes 78 render state changes 181 texture changes 32 framebuffer changes These numbers were captured on a machine with the following specs: •• •• •• •• NVIDIA GeForce GTX 750 Ti Intel Core i5-4690 CPU 3.50 GHz Ubuntu 14.10 Google Chrome version 39 After disabling the low-level checks for equality of values, the metrics change to •• •• 10.53 ms total dispatch time 7,477 uniform changes Disabling also the higher level checks for JS array equality changes the metrics to •• •• 11.56 ms total dispatch time 16,242 uniform changes These numbers vary a lot depending on the camera position in this heavily dynamic game, but the gains are consistent in any situation. Each of our filtering levels reduces the number of uniform changes by almost 50% and reduces the total dispatch time by about 10%. 170 10. Rendering Optimizations in the Turbulenz Engine 10.13 Resources The Turbulenz Engine is Open Source and can be found at https://github.com/turbulenz/ turbulenz_engine. The documentation for the Turbulenz Engine is online and can be found at http://docs.turbulenz.com/ Bibliography [Echterhoff 14] Jonas Echterhoff. “Benchmarking unity performance in WebGL.” http:// blogs.unity3d.com/2014/10/07/benchmarking-unity-performance-in-webgl/, 2014. [Ericson 08] Christer Ericson. “Order your graphics draw calls around!” http://realtimecollisiondetection.net/blog/?p=86, 2008. [Forsyth 08] Tom Forsyth. “Renderstate change costs.” http://home.comcast.net/~tom_ forsyth/blog.wiki.html#[[Renderstate%20change%20costs]], 2008. [Wilson 12] Chris Wilson. “Performance Tips for JavaScript in V8.” http://www.html5rocks.com/en/tutorials/speed/v8/, 2012. Bibliography 171 11 Performance and Rendering Algorithms in Blend4Web Alexander Kovelenov, Evgeny Rodygin, and Ivan Lyubovnikov 11.1 Introduction 11.2 Prerender Optimizations 11.3 Threaded Physics Simulation 11.4 Ocean Rendering 11.5 Shaders in Blend4Web 11.6 Resources Bibliography 11.1 Introduction Blend4Web is an open-source WebGL framework that uses Blender 3D as its primary authoring tool. It started as an experimental project to replace Adobe Flash and then evolved into a feature-rich platform for any kind of 3D web development. In this chapter, we share insights into the advanced techniques we have implemented in our engine. Among them are prerender optimizations performed on the CPU to increase overall engine performance, discussed in the next section; a worker-based physics engine; a fast shader technique to render realistic oceans; and a feature-rich shader compilation pipeline. 11.2 Prerender Optimizations Reducing the number of draw calls has always been among the fundamental goals of rendering optimization. The history of the OpenGL API distinctly indicates this tendency, so every consecutive version appends something new, allowing us to do more things in just a single call. In WebGL 1, many such features are unavailable; however, some may be accessed by using extensions, such as OES_vertex_array_object, 173 ANGLE_instanced_arrays, OES_element_index_uint, and others. However, requiring the engine to use such extensions may have some compatibility costs, so they should be considered only as auxiliary means to increase performance on supported platforms. 11.2.1 Batching Given that Blend4Web is implemented in JavaScript, we needed a straightforward and fast batching solution. This is especially important when rendering big scenes consisting of thousands of objects. As in the famous presentation by Nvidia, “Batch, Batch, Batch” [Wloka 03], a batch includes geometry from all objects that share the same state of the rendering pipeline. To implement, we extract the state from Blender’s materials; combine it with the geometry of separate objects (meshes in Blender terminology), forming batches on a per-object basis; and finally merge these per-object batches into the global batches, which will be rendered in the scene. A first cut may do this at the resource preparation stage and provide such “static” geometry inside exported resource files. Despite the obvious benefit of reducing time needed to process geometry during the scene loading, this has considerable drawbacks. The first is the inability to modify the batches (e.g., upon receiving the information about the target platform or changing the user settings). The second and more essential drawback is the considerable increase in the file size for scenes with instanced objects. For example, if there are five trees and five stubs in the scene, the final geometry will be five times bigger than the original one, resulting in implications such as increasing the loading time, especially for low-bandwidth mobile connections. Due to these drawbacks, we chose to construct batches during scene loading and therefore we required a fast algorithm. As shown in Figure 11.1, after object batches are constructed, we need a method to compare them in order to Object make further merging possible. We use a simple hashing Object algorithm. First, we compute the hashes for each object Mesh Batches batch, and then we compare their hashes and merge the Scene Material batches if the hashes are equal. The key to performance is to Batches keep the static-like typing of JavaScript objects representing Object ... the batch so that a modern JavaScript engine can optimize Batches the code. This also allows developing a fast hashing function for every type in the same manner as Java’s hashCode() Figure 11.1 function. These hashes can be calculated very fast, but they The batching scheme. are not 100% reliable because they are only 32 bits long and prone to hash collisions (i.e., there is a high possibility for two different JavaScript objects to have the same hash). Thus, we needed to compare object batches themselves at the final stage of batch construction. This comparison is made by iteration through all its properties. We measured performance for this on a machine with an i7-3770K 3.50 GHz CPU and GeForce GTX 680 GPU using Chrome 39. The batches were created from lowpolygonal objects containing 1,000 triangles each. All the objects have the same material assigned to them, making this test the worst-case scenario (maximum number of intermediate batches joined into the single one). The size and complexity of the material only define the time to perform the hash calculation and this stage does not exceed 10% of 174 11. Performance and Rendering Algorithms in Blend4Web 3.5 Batching Time (Seconds) 3 2.5 2 1.5 1 0.5 0 0 1000 2000 3000 4000 5000 6000 Number of Batches Figure 11.2 Batching performance. the entire batching time. Thus, 90% of the time is spent for operations such as copying and ­repositioning the geometry in the final batch. As shown in Figure 11.2, the presented ­a lgorithm has a linear time complexity (i.e., the time to complete calculations is proportional to the number of batches). This provides a decent performance for real-world large scenes considering that 5,000 batches consist of five million triangles. 11.2.2 Frustum Culling and LODs Besides static geometry, there are also dynamic objects that can freely move across the scene. For these, we use view frustum culling and level-of-detail (LOD) algorithms. In our implementation of frustum culling, we use bounding ellipsoids to represent objects. This compromise solution was chosen because it is fast enough and allows us to effectively designate boundings for objects extended in one or two dimensions. Particularly, a human body is perfectly described by an ellipsoid extended in the up/down dimension. The algorithm itself is just an extension of a typical sphere–plane intersection, where an effective radius is calculated per each direction toward one of six planes making the camera’s frustum volume. The LOD algorithm is fairly straightforward. The artists produce the LODs as separate models and also assign the activation distances to them. Then, just before rendering, the algorithm selects a required LOD by checking the distance between its center and the camera. The results are shown in Figure 11.3. The screenshot shows a typical scene view, which can be used to analyze overall scene performance and for which the given FPS plot was created. Batching is the most important feature for this scene. Culling has a second priority while LODs produce the best results for hardware, the performance of which is h ­ eavily bound by the GPU. The presented algorithms are reasonably inexpensive for the CPU (e.g., for the demo scene presented on the screenshot, which has more than 1,000 objects); frustum culling and LOD calculations take ~2% of all the time available for the browser’s main execution thread. 11.2 Prerender Optimizations 175 100 90 80 70 60 50 40 30 20 10 0 Without Batching Batching Only Batching + Culling Batching + Culling + LOD Figure 11.3 Prerender optimization performance: screenshot and FPS on the Farm demo. 11.3 Threaded Physics Simulation Physics is a vast field that embraces more than simple constrained motions and collision detection algorithms. A typical real-time physics engine implements fast and intricate algorithms to achieve decent quality with minimum computation overhead. We had a strong intention to find an existing solution instead of writing our own. Luckily, there was a great project to compile the Bullet open-source physics engine directly to JavaScript called ammo.js. By using the Emscripten compiler to translate intermediate LLVM bitcode directly to a JavaScript subset known as asm.js (see Chapter 5), some really impressive results were achieved. However, ammo.js does not include several features we require. First, we needed floating objects and vehicles. Second, we required a fully asynchronous design, which is somewhat incompatible with ammo.js class-based architecture. Third, we needed our own mechanism to interpolate physics simulation results instead of Bullet’s built-in implementation based on Motion State. Thus, the uranium.js project was created. We moved physics calculations to the single independent Web Worker process, freeing the main thread of execution from physics and allowing it to spend more time on rendering, animation, frustum culling etc. The most difficult part to this approach was the mechanism of fast interworker communication (IWC) and time synchronization between the two threads. Fast IWC is required for scenes with a lot of objects because of the browser’s inherent overhead imposed on message passing. Our IWC implementation is based heavily on typed arrays and caching. This was done in such a way that all data frequently passing between the threads of execution are serialized to typed arrays (float or unsigned ­integer), which are created in advance and stored in cache to minimize the JavaScript garbage collector overhead. This approach is more streamlined than using transferable objects* because it is fully supported by all major browsers and there is no need to recreate arrays after each transfer. Also see Chapter 4. IWC messages are formed by using the following simple approach: the first position of transferred array is a message identifier (message id); the second through last positions * http://www.w3.org/html/wg/drafts/html/master/infrastructure.html#transferable-objects 176 11. Performance and Rendering Algorithms in Blend4Web are occupied by the message payload. For example, to set a position of some object on the ­physics scene we send the message OUT_SET_TRANSFORM to the worker process. OUT_ SET_TRANSFORM is a message id, so it will be stored at the first position of transferred array. The second position in the message array is occupied by a number representing a physics body id; next to it X,Y,Z coordinates of the object position follow and then X,Y,Z,W values form a quaternion vector representing the object’s rotation. This is explained in Figure 11.4. Time synchronization is needed because the threads use different clocks. The ticks of the main thread clock are generated by the browser and are linked with the browser’s internal rendering process. The worker process clock is free from any constraints and may generate ticks as fast as the performance of the target platform allows. For example, a helicopter may fly at 250 km/h, or ~70 m/s. At 60 FPS, there will be a displacement of more than a meter per frame. If the time is out of sync or acquired with considerable errors, there will be visual glitches with the flying object. For rapidly moving objects, we need precision in time determination at the level of 1/10 of the frame duration, which is equivalent to several milliseconds. The widely supported method ­performance. now() returns times as floating-point numbers with up to microsecond precision. Also, with a dedicated Web Worker, synchronization is greatly simplified because it uses the same time origin as the main thread of execution. Not all browsers have built-in support for a highprecision timer in dedicated Web Workers, so for unsupported ones we needed to use the old-fashioned Date object and synchronize time explicitly at the worker startup. For a simple benchmark, we used a scene with a varying number of cubes colliding with each other and a terrain mesh composed of 8,192 triangles. Each cube has a variety of forces acting upon it: There is a persistent gravity force and multiple transient collision forces per each point of interaction (pressing and friction forces). The scene is shown in Figure 11.5 along with the performance plot representing FPS for the given number of colliding objects. The performance decreases for many dynamic MSG ID BODY ID X POS Y POS Z POS X QUAT Y QUAT Z QUAT W QUAT Figure 11.4 OUT_SET_TRANSFORM message. 350 300 FPS 250 200 150 100 50 0 0 200 400 600 800 1000 1200 Number of Colliding Objects Figure 11.5 Physics performance. 11.3 Threaded Physics Simulation 177 objects as the swarm of IWC messages rises [Priour 14]. However, it suits real-world scenarios well, where a reasonably small number of objects (usually below 100) are calculated simultaneously and where we need polygonal meshes to represent their shape. 11.4 Ocean Rendering While working on one of our biggest WebGL demos, called “Capri,” we encountered the complicated problem of reconstructing the sea surface with a number of effects previously not reproduced in browsers, including: •• •• •• •• •• •• •• •• LOD system to reduce polygon count Single draw-call for water object Shoreline influence for rolling waves Refraction and caustics on underwater objects Dynamic reflections Foam Fake subsurface scattering Realistic floating object behavior 11.4.1 Preparing the Mesh In WebGL, it is important to move as many calculations as possible to the GPU. However, there are only two types of shaders available: vertex and fragment. This means that we need to prepare all the required geometry on the CPU. Our approach is based on a geometry clipmap technique that proved to be a good method for rendering static meshes such as terrains (also see Chapter 18) deformed by ­height-maps but has never been used for dynamic surfaces. The idea is to construct a static mesh from multiple square rings with different polygon densities. The mesh should move with a camera translation. This method achieves a fairly precise simulation in regions close to the camera and delivers a reasonable polygon budget. The main difference, compared to the standard geoclipmapping technique [Hoppe et al. 04], is that there are no seams between the LOD levels. The view of such a mesh is shown in Figure 11.6. The different LOD levels are painted in three different colors. Figure 11.6 Seamless mesh generated on the CPU. 178 11.4.2 Shore Parameters In order to reproduce rolling waves and water color gradient, the distance to the shore and the direction to the nearest shore point are needed. We wrote a Python script for Blender that, for a given terrain mesh and 11. Performance and Rendering Algorithms in Blend4Web Figure 11.7 Direction and distance to the shore baked into the RGBA image. a plane representing the water surface, constructs an RGBA image. For every pixel on this image, it finds the nearest terrain vertex and stores the distance and the direction to it. Therefore, all the calculations are done on the CPU before the scene is exported and there is no need for heavy calculations during scene loading. The direction is packed into the red and green channels, and the distance is packed into the blue and alpha channels (Figure 11.7). 11.4.3 Waves There are three types of waves applied to the water surface: •• •• •• Small Waves High distant waves which have greater influence far from the shore Small waves evenly distributed over the surface Rolling waves moving toward the shore Distant Waves Combined Waves 11.4.3.1 Distant and Small Waves Distant and small waves are generated by m ­ ixing ­several noise functions (Figure 11.8). For each noise, 11.4 Ocean Rendering Figure 11.8 Combination of small and distant waves. 179 a different movement speed and scale are chosen. Distant waves use two s­implex noises [McEwan 11]. Small waves are a combination of two cellular noise functions [Gustavson 11]. GLSL calculations are shown in Listing 11.1. Listing 11.1 Waves generation. //waves far from the shore float dist_waves = snoise(DST_NOISE_SCALE_0 * (pos.xz + DST_NOISE_FREQ_0 * time))* snoise(DST_NOISE_SCALE_1 * (pos.zx - DST_NOISE_FREQ_1 * time)); //high resolution geometric noise waves float cel_coord1 = 20.0/WAVES_LENGTH * (pos.xz − 0.25 * time); float cel_coord2 = 17.0/WAVES_LENGTH * (pos.zx + 0.1 * time); float small_waves = c ellular2x2(cel_coord1).x + cellular2x2(cel_coord2).x − 1.0; Summing up these functions we get dynamic waves that are well suited for rendering the open ocean [Mátyás 06]. 11.4.3.2 Rolling Waves Closer to the shore the length and amplitude of the wave should decrease. Also, as the underwater current is moving in the opposite direction, the wave receives a little incline toward the shore. Approximate shape is shown in Figure 11.9. To generate rolling waves the parameters of the shore are extracted from the prepared RGBA texture. Because the distance is packed into the blue and alpha channels, we need to divide the texture’s blue channel by 255 and add it to the value of the alpha channel. In other words, we reconstruct the original value from two parts: the bigger and less precise one—alpha channel—and the lesser but more precise one—blue channel. It is done with a bit_shift constant vector. Actually, this is a common technique for packing precise values in several channels of the image. Calculations are shown in Listing 11.2. Listing 11.2 Shore parameters extraction. //shore coordinates from world position vec2 shore_coords = 0.5 + vec2( (pos.x − SHORE_MAP_CENTER_X)/SHORE_MAP_SIZE_X, −(pos.y + SHORE_MAP_CENTER_Y)/SHORE_MAP_SIZE_Y); //unpack shore parameters from texture vec4 shore_params = texture2D(u_shore_dist_map,shore_coords); const vec2 bit_shift = vec2(1.0/255.0, 1.0); float shore_dist = dot(shore_params.ba, bit_shift); vec2 dir_to_shore = normalize(shore_params.rg * 2.0 − 1.0); 180 11. Performance and Rendering Algorithms in Blend4Web Figure 11.9 Rolling waves. Given the distance to the shore and the direction to it, we can calculate the sinusoidal waves moving toward the shore as shown in Listing 11.3. The shore_waves_length variable is divided by PI because the WAVES_LENGTH constant is defined in meters and needs to be converted to properly calculate the wave length. Listing 11.3 Waves moving toward the shore. float shore_waves_length = WAVES_LENGTH/MAX_SHORE_DIST/M_PI; float dist_fact = sqrt(shore_dist); float shore_dir_waves = max(shore_dist, DIR_MIN_SHR_FAC) * sin(dist_fact/shore_waves_length + DIR_FREQ*time) * max( snoise(DIR_NOISE_SCALE*(pos.xz + DIR_NOISE_FREQ*time)), DIR_MIN_NOISE_FAC); To make rolling waves more natural, the result is multiplied by one more snoise ­f unction value. Listing 11.4 Waves moving toward the shore. float dir_noise = max( snoise(DIR_NOISE_SCALE*(pos.xz + DIR_NOISE_FREQ*time)), DIR_MIN_NOISE_FAC); shore_dir_waves *= dir_noise; 11.4.3.3 Combining Waves All wave types are mixed to get the final vertical offset as shown in Listing 11.5. Listing 11.5 Mixing all waves together. float waves_height = WAVES_HEIGHT * mix(shore_dir_waves, dist_waves,max(dist_fact, DST_MIN_FAC)); waves_height += SMALL_WAVES_FAC * small_waves; 11.4 Ocean Rendering 181 11.4.3.4 Waves Inclination Higher vertices have greater inclination in the direction to the nearest shore point. Listing 11.6 Horizontal offset for waves inclination. float wave_factor = WAVES_HOR_FAC * shore_dir_waves * max(MAX_SHORE_DIST/35.0 * (0.05 − shore_dist), 0.0); vec2 hor_offset = wave_factor * dir_to_shore; 11.4.3.5 Normal Calculation In order to perform further shading, the normals need to be calculated in the vertex shader. Therefore, the “offset” function is called for three adjacent vertices by ­stepping with cascade steps in x and y directions. This step is stored in the vertex’s y coordinate. By using such information, we can calculate bitangent, tangent, and normal vectors. Listing 11.7 Normal calculation. vec3 bitangent = normalize(neighbour1 − world.position); vec3 tangent = normalize(neighbour2 − world.position); vec3 normal = normalize(cross(tangent, bitangent)); 11.4.4 Material Shading Water color consists of the components displayed in Figure 11.10(a–d). (a) (b) (c) (d) Figure 11.10 Rolling waves. (a) Lambert shading, Wardiso specular; (b) + Fresnel reflection; (c) + Subsurface scattering (SSS); and (d) + Foam. 182 11. Performance and Rendering Algorithms in Blend4Web Reflections are produced by flipping the original camera vertically and rendering the needed objects to a new framebuffer, which has a lower resolution for greater performance. SSS simulation is based on light and camera directions and surface normals [Seymour 12]. These simple calculations give us surprisingly realistic results. Foam is influenced by three major factors used for mixing with the original color: 1. High waves 2. Rolling waves with a normal close to the direction toward the shore 3. Depth-based foam on objects close to the surface of the water 11.4.4.1 High Waves Foam When water mesh vertices reach a specific height, foam starts to be mixed in, influencing the resulting color. The mask for this type of wave should look like the one in Figure 11.11. White areas have more foam: 11.4.4.2 Rolling Waves Foam Based on the wave normal vector and the direction to the shore, the foam mix factor increases. So we can see more realistic rolling waves, which generate foam in areas facing toward the shore. The GLSL equation is as follows: float foam_shore = 1.25 * dot(normal, shore_dir) - 0.1; foam_shore = max(foam_shore, 0.0); As seen in Figure 11.12, the mask has much sharper edges. Figure 11.11 High waves foam factor. 11.4 Ocean Rendering 183 Figure 11.12 Rolling waves foam factor. 11.4.4.3 Depth-Based Foam The last component of the resulting foam is a depth-based foam. Mathematically, a mask for such an effect is calculated by subtracting the depth of the underwater object’s pixel from the depth of the water surface pixel. If the result is close to zero, the foam factor has maximum value (Figure 11.13). 11.4.4.4 Final Look of the Foam By combining these three types of foam we achieve plausible waves. The result is shown in the Figure 11.14. 11.4.5 Refractions and Caustics 11.4.5.1 Refractions Refraction uses previously rendered underwater objects stored in a specially created framebuffer and distorts them in accordance with normal vectors of the water’s surface (Figure 11.15a, b). 11.4.5.2 Caustics In order to reduce texture memory usage, caustics are approximated by cellular noise with deformed texture coordinates (Figure 11.16a, b). 11.4.6 Floating Objects Physics In Blend4Web, every object that can float requires several points to be defined. They are described by metaobjects called “bobs.” When a bob goes underwater, the buoyancy force 184 11. Performance and Rendering Algorithms in Blend4Web Figure 11.13 Depth-based foam factor. Figure 11.14 Final look of the foam. 11.4 Ocean Rendering 185 (a) (b) Figure 11.15 Water refractions: (a) without refraction; (b) with refraction. (a) (b) Figure 11.16 Water refractions: (a) without caustics; (b) with caustics. is applied to its center and affects the floating object. In Figure 11.17, bobs are visualized as yellow spheres. In order to calculate the height of waves at a specific point, the equations mentioned in Section 11.4.3 must be implemented in the physics thread but with some simplifications. 11.5 Shaders in Blend4Web Modern graphics applications may use a large number of shaders for creating different effects and improving the overall rendering quality. There is no exception for WebGL applications despite the specifics imposed by the web platform (e.g., limitations on the loading time and the need to compile the shaders upon application startup). GLSL code can be stored inside an HTML page using, for example,