Armin Ronacher's personal blog about programming, games and random thoughts that come to his mind. Armin Ronacher's Thoughts and Writings 2016-12-29T00:00:00Z http://lucumr.pocoo.org/ <p>This should have been obvious to me for a longer time, but until earlier today I did not really realize the severity of the issues caused by <cite>str.format</cite> on untrusted user input. It came up as a way to bypass the Jinja2 Sandbox in a way that would permit retrieving information that you should not have access to which is why I just pushed out a <a class="reference external" href="https://www.palletsprojects.com/blog/jinja-281-released/">security release</a> for it.</p> <p>However I think the general issue is quite severe and needs to be a discussed because most people are most likely not aware of how easy it is to exploit.</p> <div class="section" id="the-core-issue"> <h2>The Core Issue</h2> <p>Starting with Python 2.6 a new format string syntax landed inspired by .NET which is also the same syntax that is supported by Rust and some other programming languages. It's available behind the <cite>.format()</cite> method on byte and unicode strings (on Python 3 just on unicode strings) and it's also mirrored in the more customizable <cite>string.Formatter</cite> API.</p> <p>One of the features of it is that you can address both positional and keyword arguments to the string formatting and you can explicitly reorder items at all times. However the bigger feature is that you can access attributes and items of objects. The latter is what is causing the problem here.</p> <p>Essentially one can do things like the following:</p> <div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="s1">&#39;class of {0} is {0.__class__}&#39;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="mi">42</span><span class="p">)</span> <span class="go">&quot;class of 42 is &lt;class &#39;int&#39;&gt;&quot;</span> </pre></div> <p>In essence: whoever controls the format string can access potentially internal attributes of objects.</p> </div> <div class="section" id="where-does-it-happen"> <h2>Where does it Happen?</h2> <p>First question is why would anyone control the format string. There are a few places where it shows up:</p> <ul class="simple"> <li>untrusted translators on string files. This is a big one because many applications that are translated into multiple languages will use new-style Python string formatting and not everybody will vet all the strings that come in.</li> <li>user exposed configuration. One some systems users might be permitted to configure some behavior and that might be exposed as format strings. In particular I have seen it where users can configure notification mails, log message formats or other basic templates in web applications.</li> </ul> </div> <div class="section" id="levels-of-danger"> <h2>Levels of Danger</h2> <p>For as long as only C interpreter objects are passed to the format string you are somewhat safe because the worst you can discover is some internal reprs like the fact that something is an integer class above.</p> <p>However tricky it becomes once Python objects are passed in. The reason for this is that the amount of stuff that is exposed from Python functions is pretty crazy. Here is an example from a hypothetical web application setup that would leak the secret key:</p> <div class="highlight"><pre><span></span><span class="n">CONFIG</span> <span class="o">=</span> <span class="p">{</span> <span class="s1">&#39;SECRET_KEY&#39;</span><span class="p">:</span> <span class="s1">&#39;super secret key&#39;</span> <span class="p">}</span> <span class="k">class</span> <span class="nc">Event</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span> <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="nb">id</span><span class="p">,</span> <span class="n">level</span><span class="p">,</span> <span class="n">message</span><span class="p">):</span> <span class="bp">self</span><span class="o">.</span><span class="n">id</span> <span class="o">=</span> <span class="nb">id</span> <span class="bp">self</span><span class="o">.</span><span class="n">level</span> <span class="o">=</span> <span class="n">level</span> <span class="bp">self</span><span class="o">.</span><span class="n">message</span> <span class="o">=</span> <span class="n">message</span> <span class="k">def</span> <span class="nf">format_event</span><span class="p">(</span><span class="n">format_string</span><span class="p">,</span> <span class="n">event</span><span class="p">):</span> <span class="k">return</span> <span class="n">format_string</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">event</span><span class="o">=</span><span class="n">event</span><span class="p">)</span> </pre></div> <p>If the user can inject <cite>format_string</cite> here they could discover the secret string like this:</p> <div class="highlight"><pre><span></span>{event.__init__.__globals__[CONFIG][SECRET_KEY]} </pre></div> </div> <div class="section" id="sandboxing-formatting"> <h2>Sandboxing Formatting</h2> <p>So what do you do if you do need to let someone else provide format strings? You can use the somewhat undocumented internals to change the behavior.</p> <div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">string</span> <span class="kn">import</span> <span class="n">Formatter</span> <span class="kn">from</span> <span class="nn">collections</span> <span class="kn">import</span> <span class="n">Mapping</span> <span class="k">class</span> <span class="nc">MagicFormatMapping</span><span class="p">(</span><span class="n">Mapping</span><span class="p">):</span> <span class="sd">&quot;&quot;&quot;This class implements a dummy wrapper to fix a bug in the Python</span> <span class="sd"> standard library for string formatting.</span> <span class="sd"> See http://bugs.python.org/issue13598 for information about why</span> <span class="sd"> this is necessary.</span> <span class="sd"> &quot;&quot;&quot;</span> <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">args</span><span class="p">,</span> <span class="n">kwargs</span><span class="p">):</span> <span class="bp">self</span><span class="o">.</span><span class="n">_args</span> <span class="o">=</span> <span class="n">args</span> <span class="bp">self</span><span class="o">.</span><span class="n">_kwargs</span> <span class="o">=</span> <span class="n">kwargs</span> <span class="bp">self</span><span class="o">.</span><span class="n">_last_index</span> <span class="o">=</span> <span class="mi">0</span> <span class="k">def</span> <span class="nf">__getitem__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">key</span><span class="p">):</span> <span class="k">if</span> <span class="n">key</span> <span class="o">==</span> <span class="s1">&#39;&#39;</span><span class="p">:</span> <span class="n">idx</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">_last_index</span> <span class="bp">self</span><span class="o">.</span><span class="n">_last_index</span> <span class="o">+=</span> <span class="mi">1</span> <span class="k">try</span><span class="p">:</span> <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">_args</span><span class="p">[</span><span class="n">idx</span><span class="p">]</span> <span class="k">except</span> <span class="ne">LookupError</span><span class="p">:</span> <span class="k">pass</span> <span class="n">key</span> <span class="o">=</span> <span class="nb">str</span><span class="p">(</span><span class="n">idx</span><span class="p">)</span> <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">_kwargs</span><span class="p">[</span><span class="n">key</span><span class="p">]</span> <span class="k">def</span> <span class="nf">__iter__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span> <span class="k">return</span> <span class="nb">iter</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">_kwargs</span><span class="p">)</span> <span class="k">def</span> <span class="nf">__len__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span> <span class="k">return</span> <span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">_kwargs</span><span class="p">)</span> <span class="c1"># This is a necessary API but it&#39;s undocumented and moved around</span> <span class="c1"># between Python releases</span> <span class="k">try</span><span class="p">:</span> <span class="kn">from</span> <span class="nn">_string</span> <span class="kn">import</span> <span class="n">formatter_field_name_split</span> <span class="k">except</span> <span class="ne">ImportError</span><span class="p">:</span> <span class="n">formatter_field_name_split</span> <span class="o">=</span> <span class="k">lambda</span> \ <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="o">.</span><span class="n">_formatter_field_name_split</span><span class="p">()</span> <span class="k">class</span> <span class="nc">SafeFormatter</span><span class="p">(</span><span class="n">Formatter</span><span class="p">):</span> <span class="k">def</span> <span class="nf">get_field</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">field_name</span><span class="p">,</span> <span class="n">args</span><span class="p">,</span> <span class="n">kwargs</span><span class="p">):</span> <span class="n">first</span><span class="p">,</span> <span class="n">rest</span> <span class="o">=</span> <span class="n">formatter_field_name_split</span><span class="p">(</span><span class="n">field_name</span><span class="p">)</span> <span class="n">obj</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">get_value</span><span class="p">(</span><span class="n">first</span><span class="p">,</span> <span class="n">args</span><span class="p">,</span> <span class="n">kwargs</span><span class="p">)</span> <span class="k">for</span> <span class="n">is_attr</span><span class="p">,</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">rest</span><span class="p">:</span> <span class="k">if</span> <span class="n">is_attr</span><span class="p">:</span> <span class="n">obj</span> <span class="o">=</span> <span class="n">safe_getattr</span><span class="p">(</span><span class="n">obj</span><span class="p">,</span> <span class="n">i</span><span class="p">)</span> <span class="k">else</span><span class="p">:</span> <span class="n">obj</span> <span class="o">=</span> <span class="n">obj</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="k">return</span> <span class="n">obj</span><span class="p">,</span> <span class="n">first</span> <span class="k">def</span> <span class="nf">safe_getattr</span><span class="p">(</span><span class="n">obj</span><span class="p">,</span> <span class="n">attr</span><span class="p">):</span> <span class="c1"># Expand the logic here. For instance on 2.x you will also need</span> <span class="c1"># to disallow func_globals, on 3.x you will also need to hide</span> <span class="c1"># things like cr_frame and others. So ideally have a list of</span> <span class="c1"># objects that are entirely unsafe to access.</span> <span class="k">if</span> <span class="n">attr</span><span class="p">[:</span><span class="mi">1</span><span class="p">]</span> <span class="o">==</span> <span class="s1">&#39;_&#39;</span><span class="p">:</span> <span class="k">raise</span> <span class="ne">AttributeError</span><span class="p">(</span><span class="n">attr</span><span class="p">)</span> <span class="k">return</span> <span class="nb">getattr</span><span class="p">(</span><span class="n">obj</span><span class="p">,</span> <span class="n">attr</span><span class="p">)</span> <span class="k">def</span> <span class="nf">safe_format</span><span class="p">(</span><span class="n">_string</span><span class="p">,</span> <span class="o">*</span><span class="n">args</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">):</span> <span class="n">formatter</span> <span class="o">=</span> <span class="n">SafeFormatter</span><span class="p">()</span> <span class="n">kwargs</span> <span class="o">=</span> <span class="n">MagicFormatMapping</span><span class="p">(</span><span class="n">args</span><span class="p">,</span> <span class="n">kwargs</span><span class="p">)</span> <span class="k">return</span> <span class="n">formatter</span><span class="o">.</span><span class="n">vformat</span><span class="p">(</span><span class="n">_string</span><span class="p">,</span> <span class="n">args</span><span class="p">,</span> <span class="n">kwargs</span><span class="p">)</span> </pre></div> <p>Now you can use the <cite>safe_format</cite> method as a replacement for <cite>str.format</cite>:</p> <div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="s1">&#39;{0.__class__}&#39;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="mi">42</span><span class="p">)</span> <span class="go">&quot;&lt;type &#39;int&#39;&gt;&quot;</span> <span class="gp">&gt;&gt;&gt; </span><span class="n">safe_format</span><span class="p">(</span><span class="s1">&#39;{0.__class__}&#39;</span><span class="p">,</span> <span class="mi">42</span><span class="p">)</span> <span class="gt">Traceback (most recent call last):</span> File <span class="nb">&quot;&lt;stdin&gt;&quot;</span>, line <span class="m">1</span>, in <span class="n">&lt;module&gt;</span> <span class="gr">AttributeError</span>: <span class="n">__class__</span> </pre></div> </div> Be Careful with Python's New-Style String Format http://lucumr.pocoo.org/2016/12/29/careful-with-str-format 2016-12-29T00:00:00Z http://lucumr.pocoo.org/2016/12/29/careful-with-str-format Armin Ronacher's Thoughts and Writings <p>The last few months I keep making the same observation over and over again in various different contexts: that whenever you are confronted with a very strong opinion about a topic, reasonable discussions about the topic often involve arguments that have long become outdated or are no longer strictly relevant to the conversation.</p> <p>What I mean by that is that given a controversial topic, a valid argument for one side of the other is being repeated by a crowd of people that once heard it, even after that argument stops being valid. This happens because often the general situation changed and the argument references a reality that no longer exists in the same form. Instead of reevaluating the environment however, goalposts are moved to restore the general sentiment of the opinion.</p> <p>To give you a practical example of this problem I can just go by a topic I have a very strong opinion about: Python 3. When Python 3 was not a huge thing yet I started having conversations with people in the community about the problems I see with splitting the community and complexity of porting. Not just that, I also kept bringing up general questions about some of the text and byte decisions. I started doing talks about the topic and write blog articles that kept being shared. Nowadays when I go to a conference I very quickly end up in conversations where other developers come to me and see me as the &quot;Does not like Python 3 guy&quot;. While I still am not a friend of some of the decisions in Python 3 I am very much aware that Python 3 in 2016 is a very different Python 3 than 6 years ago or earlier.</p> <p>In fact, I myself campaigned for some changes to Python 3 that made it possible to achieve better ports (like the reintroduction of the <cite>u</cite> prefix on Unicode string literals) and the bulk of my libraries work on Python 3 for many years now. It's a fact that in 2016 the problems that people have with Python 3 are different than they used to have before.</p> <p>This leads to very interesting conversations where I can have a highly technical conversation about a very specific issue with Python 3 and thoughts about how to do it differently or deal with it (like some of the less obvious consequences of the new text storage model) and another person joins into the conversation with an argument against Python 3 that has long stropped being valid. Why? Because there is a cost towards porting to Python 3 and a chance is not seen. This means that a person with a general negativity towards Python 3 would seek me out and try to reaffirm their opposition to a port to it.</p> <p>Same thing is happening with JavaScript where there is a general negative sentiment about programming in it but not everybody is having good arguments for it. There are some that actually program a lot in it and dislike specific things about the current state of the ecosystem, but generally acknowledge that the language is evolving, and then there are those that take advantage of unhappiness and bring their heavily outdated opposition against JavaScript into a conversation just to reaffirm their own opinion.</p> <p>This is hardly confined to the programming world. I made the same discovery about CETA. CETA is a free trade agreement between the European Union and Canada and it had the misfortune of being negotiated at the same time as the more controversial TTIP with the US. The story goes roughly like this: TTIP was negotiated in secrecy (as all trade agreements are) and there were strong disagreements between what the EU and what the US thought trade should look like. Those differences were about food safety standards and other highly sensitive topics. Various organizations on both the left and right extremes of the political scale started to grab any remotely controversial information that leaked out to shift the public opinion towards negativity to TTIP. Then the entire thing spiraled out of control: people not only railed against TTIP but took their opposition and looked for similar contracts and found CETA. Since both are trade agreements there is naturally a lot of common ground between them. The subtleties where quickly lost. Where the initial arguments against TTIP were food standards, public services and intransparent ISDS courts many of the critics failed to realize that CETA fundamentally was a different beast. Not only was it already a much improved agreement from the start, but it kept being modified from the initial public version of it to the one that was finally sent to national parliaments.</p> <p>However despite what I would have expected: that critics go in and acknowledge that their criticism was being heard instead slowly moved the goalposts. At this point there is so much emotion and misinformation in the general community that the goalpost moved all the way to not supporting further free trade at all. In the general conversation about ISDS and standards many people brought introduced their own opinions about free trade and their dislike towards corporations and multinationals.</p> <p>This I assume is human behavior. Admitting that you might be wrong is hard enough, but it's even harder when you had validation that you were right in the past. In particular that an argument against something might no longer be valid because that something has changed in the meantime is hard. I'm not sure what the solution to this is but I definitely realized in the few years on my own behavior that one needs to be more careful about stating strong opinions in public. At the same time however I think we should all be more careful dispelling misinformation in conversations even if the general mood supports your opinion. As an example while emotionally I like hearing stories about how JavaScript's packaging causes pain to developers since I experienced it first hand, I know from a rational point of view that the ecosystem is improving a tremendous speeds. Yes I have been burned by npm but it's not like this is not tremendously improving.</p> <p>Something that has been put to paper once is hard to remove from people's minds. In particular in the technological context technology moves so fast that very likely something you read once might no longer be up to date as little as six months later.</p> <p>So I suppose my proposal to readers is not to fall into that trap and to assume that the environment around oneself keeps on changing.</p> Be Careful About What You Dislike http://lucumr.pocoo.org/2016/11/5/be-careful-about-what-you-dislike 2016-11-05T00:00:00Z http://lucumr.pocoo.org/2016/11/5/be-careful-about-what-you-dislike Armin Ronacher's Thoughts and Writings <p>Recently I started looking into Python's new <a class="reference external" href="https://docs.python.org/3/library/asyncio.html">asyncio</a> module a bit more. The reason for this is that I needed to do something that works better with evented IO and I figured I might give the new hot thing in the Python world a try. Primarily what I learned from this exercise is that I it's a much more complex system than I expected and I am now at the point where I am very confident that I do not know how to use it properly.</p> <p>It's not conceptionally hard to understand and borrows a lot from Twisted, but it has so many elements that play into it that I'm not sure any more how the individual bits and pieces are supposed to go together. Since I'm not clever enough to actually propose anything better I just figured I share my thoughts about what confuses me instead so that others might be able to use that in some capacity to understand it.</p> <div class="section" id="the-primitives"> <h2>The Primitives</h2> <p><cite>asyncio</cite> is supposed to implement asynchronous IO with the help of coroutines. Originally implemented as a library around the <cite>yield</cite> and <cite>yield from</cite> expressions it's now a much more complex beast as the language evolved at the same time. So here is the current set of things that you need to know exist:</p> <ul class="simple"> <li>event loops</li> <li>event loop policies</li> <li>awaitables</li> <li>coroutine functions</li> <li>old style coroutine functions</li> <li>coroutines</li> <li>coroutine wrappers</li> <li>generators</li> <li>futures</li> <li>concurrent futures</li> <li>tasks</li> <li>handles</li> <li>executors</li> <li>transports</li> <li>protocols</li> </ul> <p>In addition the language gained a few special methods that are new:</p> <ul class="simple"> <li><tt class="docutils literal">__aenter__</tt> and <tt class="docutils literal">__aexit__</tt> for asynchronous <cite>with</cite> blocks</li> <li><tt class="docutils literal">__aiter__</tt> and <tt class="docutils literal">__anext__</tt> for asynchronous iterators (async loops and async comprehensions). For extra fun that protocol already changed once. In 3.5 it returns an awaitable (a coroutine) in Python 3.6 it will return a newfangled async generator.</li> <li><tt class="docutils literal">__await__</tt> for custom awaitables</li> </ul> <p>That's quite a bit to know and the documentation covers those parts. However here are some notes I made on some of those things to understand them better:</p> <div class="section" id="event-loops"> <h3>Event Loops</h3> <p>The event loop in asyncio is a bit different than you would expect from first look. On the surface it looks like each thread has one event loop but that's not really how it works. Here is how I think this works:</p> <ul class="simple"> <li>if you are the main thread an event loop is created when you call <tt class="docutils literal">asyncio.get_event_loop()</tt></li> <li>if you are any other thread, a runtime error is raised from <tt class="docutils literal">asyncio.get_event_loop()</tt></li> <li>You can at any point <tt class="docutils literal">asyncio.set_event_loop()</tt> to bind an event loop with the current thread. Such an event loop can be created with the <tt class="docutils literal">asyncio.new_event_loop()</tt> function.</li> <li>Event loops can be used without being bound to the current thread.</li> <li><tt class="docutils literal">asyncio.get_event_loop()</tt> returns the thread bound event loop, it does not return the currently running event loop.</li> </ul> <p>The combination of these behaviors is super confusing for a few reasons. First of all you need to know that these functions are delegates to the underlying event loop policy which is globally set. The default is to bind the event loop to the thread. Alternatively one could in theory bind the event loop to a greenlet or something similar if one would so desire. However it's important to know that library code does not control the policy and as such cannot reason that asyncio will scope to a thread.</p> <p>Secondly asyncio does not require event loops to be bound to the context through the policy. An event loop can work just fine in isolation. However this is the first problem for library code as a coroutine or something similar does not know which event loop is responsible for scheduling it. This means that if you call <tt class="docutils literal">asyncio.get_event_loop()</tt> from within a coroutine you might not get the event loop back that ran you. This is also the reason why all APIs take an optional explicit loop parameter. So for instance to figure out which coroutine is currently running one cannot invoke something like this:</p> <div class="highlight"><pre><span></span><span class="k">def</span> <span class="nf">get_task</span><span class="p">():</span> <span class="n">loop</span> <span class="o">=</span> <span class="n">asyncio</span><span class="o">.</span><span class="n">get_event_loop</span><span class="p">()</span> <span class="k">try</span><span class="p">:</span> <span class="k">return</span> <span class="n">asyncio</span><span class="o">.</span><span class="n">Task</span><span class="o">.</span><span class="n">get_current</span><span class="p">(</span><span class="n">loop</span><span class="p">)</span> <span class="k">except</span> <span class="ne">RuntimeError</span><span class="p">:</span> <span class="k">return</span> <span class="bp">None</span> </pre></div> <p>Instead the loop has to be passed explicitly. This furthermore requires you to pass through the loop explicitly everywhere in library code or very strange things will happen. Not sure what the thinking for that design is but if this is not being fixed (that for instance <tt class="docutils literal">get_event_loop()</tt> returns the actually running loop) then the only other change that makes sense is to explicitly disallow explicit loop passing and require it to be bound to the current context (thread etc.).</p> <p>Since the event loop policy does not provide an identifier for the current context it also is impossible for a library to &quot;key&quot; to the current context in any way. There are also no callbacks that would permit to hook the tearing down of such a context which further limits what can be done realistically.</p> </div> <div class="section" id="awaitables-and-coroutines"> <h3>Awaitables and Coroutines</h3> <p>In my humble opinion the biggest design mistake of Python was to overload iterators so much. They are now being used not just for iteration but also for various types of coroutines. One of the biggest design mistakes of iterators in Python is that <cite>StopIteration</cite> bubbles if not caught. This can cause very frustrating problems where an exception somewhere can cause a generator or coroutine elsewhere to abort. This is a long running issue that Jinja for instance has to fight with. The template engine internally renders into a generator and when a template for some reason raises a <cite>StopIteration</cite> the rendering just ends there.</p> <p>Python is slowly learning the lesson of overloading this system more. First of all in 3.something the asyncio module landed and did not have language support. So it was decorators and generators all the way down. To implemented the <cite>yield from</cite> support and more, the <cite>StopIteration</cite> was overloaded once more. This lead to surprising behavior like this:</p> <div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="k">def</span> <span class="nf">foo</span><span class="p">(</span><span class="n">n</span><span class="p">):</span> <span class="gp">... </span> <span class="k">if</span> <span class="n">n</span> <span class="ow">in</span> <span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">):</span> <span class="gp">... </span> <span class="k">return</span> <span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="gp">... </span> <span class="k">for</span> <span class="n">item</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">n</span><span class="p">):</span> <span class="gp">... </span> <span class="k">yield</span> <span class="n">item</span> <span class="o">*</span> <span class="mi">2</span> <span class="gp">...</span> <span class="gp">&gt;&gt;&gt; </span><span class="nb">list</span><span class="p">(</span><span class="n">foo</span><span class="p">(</span><span class="mi">0</span><span class="p">))</span> <span class="go">[]</span> <span class="gp">&gt;&gt;&gt; </span><span class="nb">list</span><span class="p">(</span><span class="n">foo</span><span class="p">(</span><span class="mi">1</span><span class="p">))</span> <span class="go">[]</span> <span class="gp">&gt;&gt;&gt; </span><span class="nb">list</span><span class="p">(</span><span class="n">foo</span><span class="p">(</span><span class="mi">2</span><span class="p">))</span> <span class="go">[0, 2]</span> </pre></div> <p>No error, no warning. Just not the behavior you expect. This is because a <cite>return</cite> with a value from a function that is a generator actually raises a <cite>StopIteration</cite> with a single arg that is not picked up by the iterator protocol but just handled in the coroutine code.</p> <p>With 3.5 and 3.6 a lot changed because now in addition to generators we have coroutine objects. Instead of making a coroutine by wrapping a generator there is no a separate object which creates a coroutine directly. It's implemented by prefixing a function with <tt class="docutils literal">async</tt>. For instance <tt class="docutils literal">async def x()</tt> will make such a coroutine. Now in 3.6 there will be separate async generators that will raise <cite>AsyncStopIteration</cite> to keep it apart. Additionally with Python 3.5 and later there is now a future import (<tt class="docutils literal">generator_stop</tt>) that will raise a <cite>RuntimeError</cite> if code raises <cite>StopIteration</cite> in an iteration step.</p> <p>Why am I mentioning all this? Because the old stuff does not really go away. Generators still have <cite>send</cite> and <cite>throw</cite> and coroutines still largely behave like generators. That is a lot of stuff you need to know now for quite some time going forward.</p> <p>To unify a lot of this duplication we have a few more concepts in Python now:</p> <ul class="simple"> <li>awaitable: an object with an <tt class="docutils literal">__await__</tt> method. This is for instance implemented by native coroutines and old style coroutines and some others.</li> <li>coroutinefunction: a function that returns a native coroutine. Not to be confused with a function returning a coroutine.</li> <li>a coroutine: a native coroutine. Note that old asyncio coroutines are not considered coroutines by the current documentation as far as I can tell. At the very least <tt class="docutils literal">inspect.iscoroutine</tt> does not consider that a coroutine. It's however picked up by the future/awaitable branches.</li> </ul> <p>In particularly confusing is that <tt class="docutils literal">asyncio.iscoroutinefunction</tt> and <tt class="docutils literal">inspect.iscoroutinefunction</tt> are doing different things. Same with <tt class="docutils literal">inspect.iscoroutine</tt> and <tt class="docutils literal">inspect.iscoroutinefunction</tt>. Note that even though inspect does not know anything about asycnio legacy coroutine functions in the type check, it is apparently aware of them when you check for awaitable status even though it does not conform to <tt class="docutils literal">__await__</tt>.</p> </div> <div class="section" id="coroutine-wrappers"> <h3>Coroutine Wrappers</h3> <p>Whenever you run <tt class="docutils literal">async def</tt> Python invokes a thread local coroutine wrapper. It's set with <tt class="docutils literal">sys.set_coroutine_wrapper</tt> and it's a function that can wrap this. Looks a bit like this:</p> <div class="highlight"><pre><span></span>&gt;&gt;&gt; import sys &gt;&gt;&gt; sys.set_coroutine_wrapper(lambda x: 42) &gt;&gt;&gt; async def foo(): ... pass ... &gt;&gt;&gt; foo() __main__:1: RuntimeWarning: coroutine &#39;foo&#39; was never awaited 42 </pre></div> <p>In this case I never actually invoke the original function and just give you a hint of what this can do. As far as I can tell this is always thread local so if you swap out the event loop policy you need to figure out separately how to make this coroutine wrapper sync up with the same context if that's something you want to do. New threads spawned will not inherit that flag from the parent thread.</p> <p>This is not to be confused with the asyncio coroutine wrapping code.</p> </div> <div class="section" id="awaitables-and-futures"> <h3>Awaitables and Futures</h3> <p>Some things are awaitables. As far as I can see the following things are considered awaitable:</p> <ul class="simple"> <li>native coroutines</li> <li>generators that have the fake <tt class="docutils literal">CO_ITERABLE_COROUTINE</tt> flag set (we will cover that)</li> <li>objects with an <tt class="docutils literal">__await__</tt> method</li> </ul> <p>Essentially these are all objects with an <tt class="docutils literal">__await__</tt> method except that the generators don't for legacy reasons. Where does the <tt class="docutils literal">CO_ITERABLE_COROUTINE</tt> flag come from? It comes from a coroutine wrapper (now to be confused with <tt class="docutils literal">sys.set_coroutine_wrapper</tt>) that is <tt class="docutils literal">&#64;asyncio.coroutine</tt>. That through some indirection will wrap the generator with <tt class="docutils literal">types.coroutine</tt> (to to be confused with <tt class="docutils literal">types.CoroutineType</tt> or <tt class="docutils literal">asyncio.coroutine</tt>) which will re-create the internal code object with the additional flag <tt class="docutils literal">CO_ITERABLE_COROUTINE</tt>.</p> <p>So now that we know what those things are, what are futures? First we need to clear up one thing: there are actually two (completely incompatible) types of futures in Python 3. <tt class="docutils literal">asyncio.futures.Future</tt> and <tt class="docutils literal">concurrent.futures.Future</tt>. One came before the other but they are also also both still used even within asyncio. For instance <tt class="docutils literal">asyncio.run_coroutine_threadsafe()</tt> will dispatch a coroutine to a event loop running in another thread but it will then return a <tt class="docutils literal">concurrent.futures.Future</tt> object instead of a <tt class="docutils literal">asyncio.futures.Future</tt> object. This makes sense because only the <tt class="docutils literal">concurrent.futures.Future</tt> object is thread safe.</p> <p>So now that we know there are two incompatible futures we should clarify what futures are in asyncio. Honestly I'm not entirely sure where the differences are but I'm going to call this &quot;eventual&quot; for the moment. It's an object that eventually will hold a value and you can do some handling with that eventual result while it's still computing. Some variations of this are called deferreds, others are called promises. What the exact difference is is above my head.</p> <p>What can you do with a future? You can attach a callback that will be invoked once it's ready or you can attach a callback that will be invoked if the future fails. Additionally you can <tt class="docutils literal">await</tt> it (it implements <tt class="docutils literal">__await__</tt> and is thus awaitable). Additionally futures can be cancelled.</p> <p>So how do you get such a future? By calling <tt class="docutils literal">asyncio.ensure_future</tt> on an awaitable object. This will also make a good old generator into such a future. However if you read the docs you will read that <tt class="docutils literal">asyncio.ensure_future</tt> actually returns a <tt class="docutils literal">Task</tt>. So what's a task?</p> </div> <div class="section" id="tasks"> <h3>Tasks</h3> <p>A task is a future that is wrapping a coroutine in particular. It works like a future but it also has some extra methods to extract the current stack of the contained coroutine. We already saw the tasks mentioned earlier because it's the main way to figure out what an event loop is currently doing via <tt class="docutils literal">Task.get_current</tt>.</p> <p>There is also a difference in how cancellation works for tasks and futures but that's beyond the scope of this. Cancellation is its own entire beast. If you are in a coroutine and you know you are currently running you can get your own task through <tt class="docutils literal">Task.get_current</tt> as mentioned but this requires knowledge of what event loop you are dispatched on which might or might not be the thread bound one.</p> <p>It's not possible for a coroutine to know which loop goes with it. Also the <cite>Task</cite> does not provide that information through a public API. However if you did manage to get hold of a task you can currently access <tt class="docutils literal">task._loop</tt> to find back to the event loop.</p> </div> <div class="section" id="handles"> <h3>Handles</h3> <p>In addition to all of this there are handles. Handles are opaque objects of pending executions that cannot be awaited but they can be cancelled. In particular if you schedule the execution of a call with <tt class="docutils literal">call_soon</tt> or <tt class="docutils literal">call_soon_threadsafe</tt> (and some others) you get that handle you can then use to cancel the execution as a best effort attempt but you can't wait for the call to actually take place.</p> </div> <div class="section" id="executors"> <h3>Executors</h3> <p>Since you can have multiple event loops but it's not obvious what the use of more than one of those things per thread is the obvious assumption can be made that a common setup is to have N threads with an event loop each. So how do you inform another event loop about doing some work? You cannot schedule a callback into an event loop in another thread <em>and</em> get the result back. For that you need to use executors instead.</p> <p>Executors come from <tt class="docutils literal">concurrent.futures</tt> for instance and they allow you to schedule work into threads that itself is not evented. For instance if you use <tt class="docutils literal">run_in_executor</tt> on the event loop to schedule a function to be called in another thread. The result is then returned as an asyncio coroutine instead of a concurrent coroutine like <tt class="docutils literal">run_coroutine_threadsafe</tt> would do. I did not yet have enough mental capacity to figure out why those APIs exist, how you are supposed to use and when which one. The documentation suggests that the executor stuff could be used to build multiprocess things.</p> </div> <div class="section" id="transports-and-protocols"> <h3>Transports and Protocols</h3> <p>I always though those would be the confusing things but that's basically a verbatim copy of the same concepts in Twisted. So read those docs if you want to understand them.</p> </div> </div> <div class="section" id="how-to-use-asyncio"> <h2>How to use asyncio</h2> <p>Now that we know roughly understand asyncio I found a few patterns that people seem to use when they write asyncio code:</p> <ul class="simple"> <li>pass the event loop to all coroutines. That appears to be what a part of the community is doing. Giving a coroutine knowledge about what loop is going to schedule it makes it possible for the coroutine to learn about its task.</li> <li>alternatively you require that the loop is bound to the thread. That also lets a coroutine learn about that. Ideally support both. Sadly the community is already torn of what to do.</li> <li>If you want to use contextual data (think thread locals) you are a bit out of luck currently. The most popular workaround is apparently atlassian's <tt class="docutils literal">aiolocals</tt> which basically requires you to manually propagate contextual information into coroutines spawned since the interpreter does not provide support for this. This means that if you have a utility library spawning coroutines you will lose context.</li> <li>Ignore that the old coroutine stuff in Python exists. Use 3.5 only with the new <tt class="docutils literal">async def</tt> keyword and co. In particular you will need that anyways to somewhat enjoy the experience because with older versions you do not have async context managers which turn out to be very necessary for resource management.</li> <li>Learn to restart the event loop for cleanup. This is something that took me longer to realize than I wish it did but the sanest way to deal with cleanup logic that is written in async code is to restart the event loop a few times until nothing pending is left. Since sadly there is no common pattern to deal with this you will end up with some ugly workaround at time. For instance <cite>aiohttp</cite>'s web support also does this pattern so if you want to combine two cleanup logics you will probably have to reimplement the utility helper that it provides since that helper completely tears down the loop when it's done. This is also not the first library I saw do this :(</li> <li>Working with subprocesses is non obvious. You need to have an event loop running in the main thread which I suppose is listening in on signal events and then dispatches it to other event loops. This requires that the loop is notified via <tt class="docutils literal"><span class="pre">asyncio.get_child_watcher().attach_loop(...)</span></tt>.</li> <li>Writing code that supports both async and sync is somewhat of a lost cause. It also gets dangerous quickly when you start being clever and try to support <tt class="docutils literal">with</tt> and <tt class="docutils literal">async with</tt> on the same object for instance.</li> <li>If you want to give a coroutine a better name to figure out why it was not being awaited, setting <tt class="docutils literal">__name__</tt> doesn't help. You need to set <tt class="docutils literal">__qualname__</tt> instead which is what the error message printer uses.</li> <li>Sometimes internal type conversations can screw you over. In particular the <tt class="docutils literal">asyncio.wait()</tt> function will make sure all things passed are futures which means that if you pass coroutines instead you will have a hard time finding out if your coroutine finished or is pending since the input objects no longer match the output objects. In that case the only real sane thing to do is to ensure that everything is a future upfront.</li> </ul> </div> <div class="section" id="context-data"> <h2>Context Data</h2> <p>Aside from the insane complexity and lack of understanding on my part of how to best write APIs for it my biggest issue is the complete lack of consideration for context local data. This is something that the node community learned by now. <tt class="docutils literal"><span class="pre">continuation-local-storage</span></tt> exists but has been accepted as implemented too late. Continuation local storage and similar concepts are regularly used to enforce security policies in a concurrent environment and corruption of that information can cause severe security issues.</p> <p>The fact that Python does not even have any store at all for this is more than disappointing. I was looking into this in particular because I'm investigating how to best support <a class="reference external" href="https://docs.sentry.io/learn/breadcrumbs/">Sentry's breadcrumbs</a> for asyncio and I do not see a sane way to do it. There is no concept of context in asyncio, there is no way to figure out which event loop you are working with from generic code and without monkeypatching the world this information will not be available.</p> <p>Node is currently going through the process of <a class="reference external" href="https://github.com/nodejs/node-eps/pull/18">finding a long term solution for this problem</a>. That this is not something to be left ignored can be seen by this being a recurring issue in all ecosystems. It comes up with JavaScript, Python and the .NET environment. The problem <a class="reference external" href="https://docs.google.com/document/d/1tlQ0R6wQFGqCS5KeIw0ddoLbaSYx6aU7vyXOkv-wvlM/edit">is named async context propagation</a> and solutions go by many names. In Go the context package needs to be used and explicitly passed to all goroutines (not a perfect solution but at least one). .NET has the best solution in the form of local call contexts. It can be a thread context, an web request context, or something similar but it's automatically propagating unless suppressed. This is the gold standard of what to aim for. Microsoft had this solved since more than 15 years now I believe.</p> <p>I don't know if the ecosystem is still young enough that logical call contexts can be added but now might still be the time.</p> </div> <div class="section" id="personal-thoughts"> <h2>Personal Thoughts</h2> <p>Man that thing is complex and it keeps getting more complex. I do not have the mental capacity to casually work with asyncio. It requires constantly updating the knowledge with all language changes and it has tremendously complicated the language. It's impressive that an ecosystem is evolving around it but I can't help but get the impression that it will take quite a few more years for it to become a particularly enjoyable and stable development experience.</p> <p>What landed in 3.5 (the actual new coroutine objects) is great. In particular with the changes that will come up there is a sensible base that I wish would have been in earlier versions. The entire mess with overloading generators to be coroutines was a mistake in my mind. With regards to what's in asyncio I'm not sure of anything. It's an incredibly complex thing and super messy internally. It's hard to comprehend how it works in all details. When you can pass a generator, when it has to be a real coroutine, what futures are, what tasks are, how the loop works and that did not even come to the actual IO part.</p> <p>The worst part is that asyncio is not even particularly fast. David Beazley's live demo hacked up asyncio replacement is twice as fast as it. There is an enormous amount of complexity that's hard to understand and reason about and then it fails on it's main promise. I'm not sure what to think about it but I know at least that I don't understand asyncio enough to feel confident about giving people advice about how to structure code for it.</p> </div> I don't understand Python's Asyncio http://lucumr.pocoo.org/2016/10/30/i-dont-understand-asyncio 2016-10-30T00:00:00Z http://lucumr.pocoo.org/2016/10/30/i-dont-understand-asyncio Armin Ronacher's Thoughts and Writings <p>A few months back I decided to write a command line client for <a class="reference external" href="http://www.getsentry.com/">Sentry</a> because manually invoking the Sentry API for some common tasks (such as dsym or sourcemap management is just no fun). Given the choice of languages available I went with Rust. The reason for this is that I want people to be able to download a single executable that links everything in statically. The choice was between C, C++, Go and Rust. There is no denying that I really like Rust so it was already a pretty good choice for me. However what made it even easier is that Rust quite a potent ecosystem for what I wanted. So here is my lessons learned from this.</p> <div class="section" id="libraries-for-http"> <h2>Libraries for HTTP</h2> <p>To make an HTTP request you have a choice of libraries. In particular there are two in Rust you can try: <a class="reference external" href="http://hyper.rs/">hyper</a> and <a class="reference external" href="https://crates.io/crates/curl">rust-curl</a>. I tried both and there are some releases with the former but I settled in rust-curl in the end. The reason for this is twofold. The first is that curl (despite some of the oddities in how it does things) is very powerful and integrates really well with the system SSL libraries. This means that when I compile the executable I get the native TLS support right away. rust-curl also (despite being not a pure rust library) compiles really well out of the box on Windows, macOS and Linux. The second reason is that Hyper is currently undergoing a major shift in how it's structured and a bit in flux. I did not want to bet on that too much. When I started it also did not have proxy support which is not great.</p> <p>For JSON parsing and serializing I went with <a class="reference external" href="https://crates.io/crates/serde">serde</a>. I suppose that serde will eventually be the library of choice for all things serialization but right now it's not. It depends on compiler plugins and there are two ways to make it work right now. One is to go with nightly Rust (which is what I did) the other is to use the build script support in Rust. This is similar to what you do in Go where some code generation happens as part of the build. It definitely works but it's not nearly as nice as using serde with nightly Rust.</p> </div> <div class="section" id="api-design"> <h2>API Design</h2> <p>The next question is what a good API design for a Rust HTTP library is. I struggeld with this quite a bit and it took multiple iterations to end up with something that I think is a good pattern. What I ended up is a collection of multiple types:</p> <ul class="simple"> <li><tt class="docutils literal">Api</tt>: I have a basic client object which I call <tt class="docutils literal">Api</tt> internally. it manages the curl handles (right now it just caches one) and also exposes convenience methods to perform certain types of HTTP requests. On top of that it provides high level methods that send the right HTTP requests and handle the responses.</li> <li><tt class="docutils literal">ApiRequest</tt>: basically your request object. It's mostly a builder for making requests and has a method to send the request and get a response object.</li> <li><tt class="docutils literal">ApiResponse</tt>: contains the response from the HTTP request. This also provides various helpers to convert the response into different things.</li> <li><tt class="docutils literal">ApiResult&lt;T&gt;</tt>: this is a result object which is returned from most methods. The error is a special API error that converts from all the APIs we call into. This means it can hold curl errors, form errors, JSON errors, IO errors and more.</li> </ul> <p>To give you an idea how this looks like I want to show you one of the high level methods that use most of the API:</p> <div class="highlight"><pre><span></span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="n">list_releases</span><span class="p">(</span><span class="o">&amp;</span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">org</span><span class="o">:</span><span class="w"> </span><span class="o">&amp;</span><span class="kt">str</span><span class="p">,</span><span class="w"> </span><span class="n">project</span><span class="o">:</span><span class="w"> </span><span class="o">&amp;</span><span class="kt">str</span><span class="p">)</span><span class="w"></span> <span class="w"> </span><span class="o">-&gt;</span><span class="w"> </span><span class="n">ApiResult</span><span class="o">&lt;</span><span class="nb">Vec</span><span class="o">&lt;</span><span class="n">ReleaseInfo</span><span class="o">&gt;&gt;</span><span class="w"></span> <span class="p">{</span><span class="w"></span> <span class="w"> </span><span class="nb">Ok</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="o">&amp;</span><span class="n">format</span><span class="o">!</span><span class="p">(</span><span class="s">&quot;/projects/{}/{}/releases/&quot;</span><span class="p">,</span><span class="w"></span> <span class="w"> </span><span class="n">PathArg</span><span class="p">(</span><span class="n">org</span><span class="p">),</span><span class="w"> </span><span class="n">PathArg</span><span class="p">(</span><span class="n">project</span><span class="p">)))</span><span class="o">?</span><span class="p">.</span><span class="n">convert</span><span class="p">()</span><span class="o">?</span><span class="p">)</span><span class="w"></span> <span class="p">}</span><span class="w"></span> </pre></div> <p>(Note that I'm using the new question mark syntax <tt class="docutils literal">?</tt> instead of the more familiar <tt class="docutils literal">try!</tt> macro here)</p> <p>So what is happening here?</p> <ol class="arabic simple"> <li>This is a method on the <tt class="docutils literal">Api</tt> struct. We use the <tt class="docutils literal">get()</tt> shorthand method to make an HTTP <cite>GET</cite> request. It takes one argument which is the URL to make the request to. We use standard string formatting to create the URL path here.</li> <li>The <cite>PathArg</cite> is a simple wrapper that customizes the formatting so that instead of just stringifying a value it also percent encodes it.</li> <li>The return value of the <cite>get</cite> method is a <tt class="docutils literal">ApiResult&lt;ApiResponse&gt;</tt> which provides a handy <tt class="docutils literal">convert()</tt> method which does both error handling and deserialization.</li> </ol> <p>How does the JSON handling take place here? The answer is that <tt class="docutils literal">convert()</tt> can do that. Because <tt class="docutils literal">Vec&lt;ReleaseInfo&gt;</tt> has an automatic deserializer implemented.</p> </div> <div class="section" id="the-error-onion"> <h2>The Error Onion</h2> <p>The bulk of the complexity is hidden behind multiple layers of error handling. It took me quite a long time to finally come up with this design which is why I'm particularly happy with finally having found one I really like. The reason error handling is so tricky with HTTP requests is because you want to have both the flexibility of responding to specific error conditions as well as automatically handling all the ones you are not interested in.</p> <p>The design I ended up with is that I have an <tt class="docutils literal">ApiError</tt> type. All the internal errors that the library could encounter (curl errors etc.) are automatically converted into an <tt class="docutils literal">ApiError</tt>. If you send a request the return value is as such <tt class="docutils literal">Result&lt;ApiResponse, ApiError&gt;</tt>. However the trick here is that at this level no HTTP error (other than connection errors) is actually stored as <tt class="docutils literal">ApiError</tt>. Instead also a failed response (because for instance of a 404) is stored as the actual response object.</p> <p>On the response object you can check the status of the response with these methods:</p> <div class="highlight"><pre><span></span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="n">status</span><span class="p">(</span><span class="o">&amp;</span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="o">-&gt;</span><span class="w"> </span><span class="kt">u32</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">status</span><span class="w"> </span><span class="p">}</span><span class="w"></span> <span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="n">failed</span><span class="p">(</span><span class="o">&amp;</span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="o">-&gt;</span><span class="w"> </span><span class="kt">bool</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">status</span><span class="w"> </span><span class="o">&gt;=</span><span class="w"> </span><span class="mi">400</span><span class="w"> </span><span class="o">&amp;&amp;</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">status</span><span class="w"> </span><span class="o">&lt;=</span><span class="w"> </span><span class="mi">600</span><span class="w"> </span><span class="p">}</span><span class="w"></span> <span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="n">ok</span><span class="p">(</span><span class="o">&amp;</span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="o">-&gt;</span><span class="w"> </span><span class="kt">bool</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="o">!</span><span class="bp">self</span><span class="p">.</span><span class="n">failed</span><span class="p">()</span><span class="w"> </span><span class="p">}</span><span class="w"></span> </pre></div> <p>However what's nice is that most of the time you don't have to do any of this. The response method also provides a method to conver non successful responses into errors like this:</p> <div class="highlight"><pre><span></span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="n">to_result</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="o">-&gt;</span><span class="w"> </span><span class="n">ApiResult</span><span class="o">&lt;</span><span class="n">ApiResponse</span><span class="o">&gt;</span><span class="w"> </span><span class="p">{</span><span class="w"></span> <span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">ok</span><span class="p">()</span><span class="w"> </span><span class="p">{</span><span class="w"></span> <span class="w"> </span><span class="k">return</span><span class="w"> </span><span class="nb">Ok</span><span class="p">(</span><span class="bp">self</span><span class="p">);</span><span class="w"></span> <span class="w"> </span><span class="p">}</span><span class="w"></span> <span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="nb">Ok</span><span class="p">(</span><span class="n">err</span><span class="p">)</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">deserialize</span><span class="o">::&lt;</span><span class="n">ErrorInfo</span><span class="o">&gt;</span><span class="p">()</span><span class="w"> </span><span class="p">{</span><span class="w"></span> <span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="nb">Some</span><span class="p">(</span><span class="n">detail</span><span class="p">)</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">err</span><span class="p">.</span><span class="n">detail</span><span class="w"> </span><span class="p">{</span><span class="w"></span> <span class="w"> </span><span class="k">return</span><span class="w"> </span><span class="nb">Err</span><span class="p">(</span><span class="n">ApiError</span><span class="o">::</span><span class="n">Http</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">status</span><span class="p">(),</span><span class="w"> </span><span class="n">detail</span><span class="p">));</span><span class="w"></span> <span class="w"> </span><span class="p">}</span><span class="w"></span> <span class="w"> </span><span class="p">}</span><span class="w"></span> <span class="w"> </span><span class="nb">Err</span><span class="p">(</span><span class="n">ApiError</span><span class="o">::</span><span class="n">Http</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">status</span><span class="p">(),</span><span class="w"> </span><span class="s">&quot;generic error&quot;</span><span class="p">.</span><span class="n">into</span><span class="p">()))</span><span class="w"></span> <span class="p">}</span><span class="w"></span> </pre></div> <p>This method consumes the response and depending on the condition of the response returns different results. If everything was fine the response is returned unchanged. However if there was an error we first try to deserialize the body with our own <tt class="docutils literal">ErrorInfo</tt> which is the JSON response our API returns or otherwise we fall back to a generic error message and the status code.</p> <p>What's deserialize? It just invokes serde for deserialization:</p> <div class="highlight"><pre><span></span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="n">deserialize</span><span class="o">&lt;</span><span class="n">T</span><span class="o">:</span><span class="w"> </span><span class="n">Deserialize</span><span class="o">&gt;</span><span class="p">(</span><span class="o">&amp;</span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="o">-&gt;</span><span class="w"> </span><span class="n">ApiResult</span><span class="o">&lt;</span><span class="n">T</span><span class="o">&gt;</span><span class="w"> </span><span class="p">{</span><span class="w"></span> <span class="w"> </span><span class="nb">Ok</span><span class="p">(</span><span class="n">serde_json</span><span class="o">::</span><span class="n">from_reader</span><span class="p">(</span><span class="k">match</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">body</span><span class="w"> </span><span class="p">{</span><span class="w"></span> <span class="w"> </span><span class="nb">Some</span><span class="p">(</span><span class="k">ref</span><span class="w"> </span><span class="n">body</span><span class="p">)</span><span class="w"> </span><span class="o">=&gt;</span><span class="w"> </span><span class="n">body</span><span class="p">,</span><span class="w"></span> <span class="w"> </span><span class="nb">None</span><span class="w"> </span><span class="o">=&gt;</span><span class="w"> </span><span class="o">&amp;</span><span class="s">b&quot;&quot;</span><span class="p">[..],</span><span class="w"></span> <span class="w"> </span><span class="p">})</span><span class="o">?</span><span class="p">)</span><span class="w"></span> <span class="p">}</span><span class="w"></span> </pre></div> <p>One thing you can see here is that the body is buffered into memory entirely. I was torn on this in the beginning but it actually turns out to make the API significantly nicer because it allows you to reason about the response better. Without buffering up everything in memory it becomes much harder to do conditional things based on the body. For the cases where we cannot deal with this limitation I have extra methods to stream the incoming data.</p> <p>On deserialization we match on the body. The body is an <tt class="docutils literal">Option&lt;Vec&lt;u8&gt;&gt;</tt> here which we convert into a <tt class="docutils literal">&amp;[u8]</tt> which satisfies the <tt class="docutils literal">Read</tt> interface which we can then use for deserialization.</p> <p>The nice thing about the aforementioned <tt class="docutils literal">to_result</tt> method is that it works just so nice. The common case is to convert something into a result and to then deserialize the response if everything is fine. Which is why we have this <tt class="docutils literal">convert</tt> method:</p> <div class="highlight"><pre><span></span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="n">convert</span><span class="o">&lt;</span><span class="n">T</span><span class="o">:</span><span class="w"> </span><span class="n">Deserialize</span><span class="o">&gt;</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="o">-&gt;</span><span class="w"> </span><span class="n">ApiResult</span><span class="o">&lt;</span><span class="n">T</span><span class="o">&gt;</span><span class="w"> </span><span class="p">{</span><span class="w"></span> <span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">to_result</span><span class="p">().</span><span class="n">and_then</span><span class="p">(</span><span class="o">|</span><span class="n">x</span><span class="o">|</span><span class="w"> </span><span class="n">x</span><span class="p">.</span><span class="n">deserialize</span><span class="p">())</span><span class="w"></span> <span class="p">}</span><span class="w"></span> </pre></div> </div> <div class="section" id="complex-uses"> <h2>Complex Uses</h2> <p>There are some really nice uses for this. For instance here is how we check for updates from the GitHub API:</p> <div class="highlight"><pre><span></span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="n">get_latest_release</span><span class="p">(</span><span class="o">&amp;</span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="o">-&gt;</span><span class="w"> </span><span class="n">ApiResult</span><span class="o">&lt;</span><span class="nb">Option</span><span class="o">&lt;</span><span class="p">(</span><span class="nb">String</span><span class="p">,</span><span class="w"> </span><span class="nb">String</span><span class="p">)</span><span class="o">&gt;&gt;</span><span class="w"></span> <span class="p">{</span><span class="w"></span> <span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="n">resp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">&quot;https://api.github.com/repos/getsentry/sentry-cli/releases/latest&quot;</span><span class="p">)</span><span class="o">?</span><span class="p">;</span><span class="w"></span> <span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="n">resp</span><span class="p">.</span><span class="n">status</span><span class="p">()</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="mi">404</span><span class="w"> </span><span class="p">{</span><span class="w"></span> <span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="n">info</span><span class="w"> </span><span class="o">:</span><span class="w"> </span><span class="n">GitHubRelease</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">resp</span><span class="p">.</span><span class="n">to_result</span><span class="p">()</span><span class="o">?</span><span class="p">.</span><span class="n">convert</span><span class="p">()</span><span class="o">?</span><span class="p">;</span><span class="w"></span> <span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="n">asset</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">info</span><span class="p">.</span><span class="n">assets</span><span class="w"> </span><span class="p">{</span><span class="w"></span> <span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="n">asset</span><span class="p">.</span><span class="n">name</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="n">REFERENCE_NAME</span><span class="w"> </span><span class="p">{</span><span class="w"></span> <span class="w"> </span><span class="k">return</span><span class="w"> </span><span class="nb">Ok</span><span class="p">(</span><span class="nb">Some</span><span class="p">((</span><span class="w"></span> <span class="w"> </span><span class="n">info</span><span class="p">.</span><span class="n">tag_name</span><span class="p">,</span><span class="w"></span> <span class="w"> </span><span class="n">asset</span><span class="p">.</span><span class="n">browser_download_url</span><span class="w"></span> <span class="w"> </span><span class="p">)));</span><span class="w"></span> <span class="w"> </span><span class="p">}</span><span class="w"></span> <span class="w"> </span><span class="p">}</span><span class="w"></span> <span class="w"> </span><span class="p">}</span><span class="w"></span> <span class="w"> </span><span class="nb">Ok</span><span class="p">(</span><span class="nb">None</span><span class="p">)</span><span class="w"></span> <span class="p">}</span><span class="w"></span> </pre></div> <p>Here we silently ignore a 404 but otherwise we parse the response as <cite>GitHubRelease</cite> structure and then look through all the assets. The call to <cite>to_result</cite> does nothing on success but it will handle all the other response errors automatically.</p> <p>To get an idea how the structures like <cite>GitHubRelease</cite> are defined, this is all that is needed:</p> <div class="highlight"><pre><span></span><span class="cp">#[derive(Debug, Deserialize)]</span><span class="w"></span> <span class="k">struct</span><span class="w"> </span><span class="n">GitHubAsset</span><span class="w"> </span><span class="p">{</span><span class="w"></span> <span class="w"> </span><span class="n">browser_download_url</span><span class="o">:</span><span class="w"> </span><span class="nb">String</span><span class="p">,</span><span class="w"></span> <span class="w"> </span><span class="n">name</span><span class="o">:</span><span class="w"> </span><span class="nb">String</span><span class="p">,</span><span class="w"></span> <span class="p">}</span><span class="w"></span> <span class="cp">#[derive(Debug, Deserialize)]</span><span class="w"></span> <span class="k">struct</span><span class="w"> </span><span class="n">GitHubRelease</span><span class="w"> </span><span class="p">{</span><span class="w"></span> <span class="w"> </span><span class="n">tag_name</span><span class="o">:</span><span class="w"> </span><span class="nb">String</span><span class="p">,</span><span class="w"></span> <span class="w"> </span><span class="n">assets</span><span class="o">:</span><span class="w"> </span><span class="nb">Vec</span><span class="o">&lt;</span><span class="n">GitHubAsset</span><span class="o">&gt;</span><span class="p">,</span><span class="w"></span> <span class="p">}</span><span class="w"></span> </pre></div> </div> <div class="section" id="curl-handle-management"> <h2>Curl Handle Management</h2> <p>One thing that is not visible here is how I manage the curl handles. Curl is a C library and the Rust binding to it is quite low level. While it's well typed and does not require unsafe code to use, it still feels very much like a C library. In particular there is a curl &quot;easy&quot; handle object you are supposed to keep hanging around between requests to take advantage of keepalives. However the handles are stateful. Readers of this blog are aware that there are few things I hate as much as unnecessary stateful APIs. So I made it as stateless as possible.</p> <p>The &quot;correct&quot; thing to do would be to have a pool of &quot;easy&quot; handles. However in my case I never have more than one request outstanding at the time so instead of going with something more complex I stuff away the &quot;easy&quot; handle in a <tt class="docutils literal">RefCell</tt>. A <tt class="docutils literal">RefCell</tt> is a smart pointer that moves the borrow semantics that rust normally requires at compile time to runtime. This is rougly how this looks:</p> <div class="highlight"><pre><span></span><span class="k">pub</span><span class="w"> </span><span class="k">struct</span><span class="w"> </span><span class="n">ApiRequest</span><span class="o">&lt;</span><span class="na">&#39;a</span><span class="o">&gt;</span><span class="w"> </span><span class="p">{</span><span class="w"></span> <span class="w"> </span><span class="n">handle</span><span class="o">:</span><span class="w"> </span><span class="n">RefMut</span><span class="o">&lt;</span><span class="na">&#39;a</span><span class="p">,</span><span class="w"> </span><span class="n">curl</span><span class="o">::</span><span class="n">easy</span><span class="o">::</span><span class="n">Easy</span><span class="o">&gt;</span><span class="w"></span> <span class="p">}</span><span class="w"></span> <span class="k">pub</span><span class="w"> </span><span class="k">struct</span><span class="w"> </span><span class="n">Api</span><span class="w"> </span><span class="p">{</span><span class="w"></span> <span class="w"> </span><span class="n">shared_handle</span><span class="o">:</span><span class="w"> </span><span class="n">RefCell</span><span class="o">&lt;</span><span class="n">curl</span><span class="o">::</span><span class="n">easy</span><span class="o">::</span><span class="n">Easy</span><span class="o">&gt;</span><span class="p">,</span><span class="w"></span> <span class="w"> </span><span class="p">...</span><span class="w"></span> <span class="p">}</span><span class="w"></span> <span class="k">impl</span><span class="w"> </span><span class="n">Api</span><span class="w"> </span><span class="p">{</span><span class="w"></span> <span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="n">request</span><span class="p">(</span><span class="o">&amp;</span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="o">:</span><span class="w"> </span><span class="n">Method</span><span class="p">,</span><span class="w"> </span><span class="n">url</span><span class="o">:</span><span class="w"> </span><span class="o">&amp;</span><span class="kt">str</span><span class="p">)</span><span class="w"></span> <span class="w"> </span><span class="o">-&gt;</span><span class="w"> </span><span class="n">ApiResult</span><span class="o">&lt;</span><span class="n">ApiRequest</span><span class="o">&lt;</span><span class="na">&#39;a</span><span class="o">&gt;&gt;</span><span class="w"></span> <span class="w"> </span><span class="p">{</span><span class="w"></span> <span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="k">mut</span><span class="w"> </span><span class="n">handle</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">shared_handle</span><span class="p">.</span><span class="n">borrow_mut</span><span class="p">();</span><span class="w"></span> <span class="w"> </span><span class="n">ApiRequest</span><span class="o">::</span><span class="n">new</span><span class="p">(</span><span class="n">handle</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="p">,</span><span class="w"> </span><span class="o">&amp;</span><span class="n">url</span><span class="p">)</span><span class="w"></span> <span class="w"> </span><span class="p">}</span><span class="w"></span> <span class="p">}</span><span class="w"></span> </pre></div> <p>This way if you call <cite>request</cite> twice you will get a runtime panic if the last request is still outstanding. This is fine for what I do. The <tt class="docutils literal">ApiRequest</tt> object itself implements a builder like pattern where you can modify the object with chaining calls. This is roughly how this looks like when used for a more complex situation:</p> <div class="highlight"><pre><span></span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="n">send_event</span><span class="p">(</span><span class="o">&amp;</span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">event</span><span class="o">:</span><span class="w"> </span><span class="o">&amp;</span><span class="n">Event</span><span class="p">)</span><span class="w"> </span><span class="o">-&gt;</span><span class="w"> </span><span class="n">ApiResult</span><span class="o">&lt;</span><span class="nb">String</span><span class="o">&gt;</span><span class="w"> </span><span class="p">{</span><span class="w"></span> <span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="n">dsn</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">dsn</span><span class="p">.</span><span class="n">as_ref</span><span class="p">().</span><span class="n">ok_or</span><span class="p">(</span><span class="n">Error</span><span class="o">::</span><span class="n">NoDsn</span><span class="p">)</span><span class="o">?</span><span class="p">;</span><span class="w"></span> <span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="n">event</span><span class="w"> </span><span class="o">:</span><span class="w"> </span><span class="n">EventInfo</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">request</span><span class="p">(</span><span class="n">Method</span><span class="o">::</span><span class="n">Post</span><span class="p">,</span><span class="w"> </span><span class="o">&amp;</span><span class="n">dsn</span><span class="p">.</span><span class="n">get_submit_url</span><span class="p">())</span><span class="o">?</span><span class="w"></span> <span class="w"> </span><span class="p">.</span><span class="n">with_header</span><span class="p">(</span><span class="s">&quot;X-Sentry-Auth&quot;</span><span class="p">,</span><span class="w"> </span><span class="o">&amp;</span><span class="n">dsn</span><span class="p">.</span><span class="n">get_auth_header</span><span class="p">(</span><span class="n">event</span><span class="p">.</span><span class="n">timestamp</span><span class="p">))</span><span class="o">?</span><span class="w"></span> <span class="w"> </span><span class="p">.</span><span class="n">with_json_body</span><span class="p">(</span><span class="o">&amp;</span><span class="n">event</span><span class="p">)</span><span class="o">?</span><span class="w"></span> <span class="w"> </span><span class="p">.</span><span class="n">send</span><span class="p">()</span><span class="o">?</span><span class="p">.</span><span class="n">convert</span><span class="p">()</span><span class="o">?</span><span class="p">;</span><span class="w"></span> <span class="w"> </span><span class="nb">Ok</span><span class="p">(</span><span class="n">event</span><span class="p">.</span><span class="n">id</span><span class="p">)</span><span class="w"></span> <span class="p">}</span><span class="w"></span> </pre></div> </div> <div class="section" id="lessons-learned"> <h2>Lessons Learned</h2> <p>My key takeaways from doing this in Rust so far have been:</p> <ul class="simple"> <li>Rust is definitely a great choice for building command line utilities. The ecosystem is getting stronger by the day and there are so many useful crates already for very common tasks.</li> <li>The cross platform support is superb. Getting the windows build going was easy cake compared to the terror you generally go through with other languages (including Python).</li> <li>serde is a pretty damn good library. It's a shame it's not as nice to use on stable rust. Can't wait for this stuff to get more stable.</li> <li>Result objects in rust are great but sometimes it makes sense to not immediately convert data into a result object. I originally converted failure responses into errors immediately and that definitely hurt the convenience of the APIs tremendously.</li> <li>Don't be afraid of using C libraries like <cite>curl</cite> instead of native Rust things. It turns out that Rust's build support is pretty magnificent which makes installing the rust curl library straightforward. It even takes care of compiling curl itself on Windows.</li> </ul> <p>If you want to see the code, the entire git repository of the client can be found online: <a class="reference external" href="http://github.com/getsentry/sentry-cli">getsentry/sentry-cli</a>.</p> </div> Rust and Rest http://lucumr.pocoo.org/2016/7/10/rust-rest 2016-07-10T00:00:00Z http://lucumr.pocoo.org/2016/7/10/rust-rest Armin Ronacher's Thoughts and Writings <p>Most of the readers of this blog are not from Europe, let alone Austria, the country I was born in. As such I'm not sure how many will actually care about Austrian politics here, especially if it's a lengthy post. But I would still like if you read it because I think the topic is important and not just because of Austria. Our problems here are not just ours, they are a general issue that affects all of Europe and the western world.</p> <p>So since you are probably in no way familiar with Austrian politics or the situation in the country I want to give you a brief overview. Austria has recovered very quickly from the war torn country it was after World War 2 and emerged as one of the most powerful economies of Europe if looked at on a GDP per capita basis. It underwent a conversion from an agricultural country with some tourism attached to being dominated by the service industry and producing technology and parts (the economical tree map looks confusing because it's so heavily diversified).</p> <p>However as great as the country has developed after the wars and as profitable the creation of the Eurozone was, there was an end to this positive trend and it came with the financial crisis of 2007/2008 (although with a bit of a delay). The economy did recover, but it did not do it to the extent people wanted. At the same time necessary reforms were not implemented (or not implemented in the right ways) and as a result the country has suffered major blows in the last few years. From a personal point of view I cannot stress enough how disappointed I am that many of my collegues went to other countries and started their companies there or work there. But it would be foolish to blame politics on this alone. This is as much a problem of politics as it is a problem of culture.</p> <p>We now reached the point where cheap and populist ideas like reducing social welfare for non citizens gets popular support. In this environment right wing parties emerge and this sunday Austrians will probably elect the first right wing leader of the country since the end of World War 2.</p> <p>But politics not what I want to talk about. What I want to talk about is the erosion of civilized discourse in Austria and I think in all of Europe. A large part of the general public are unable to have civilized discussions on the bases of facts and instead conspiracies and emotions take over and this is something that extends to politicians in Austria as well.</p> <div class="section" id="the-symptoms-and-problems"> <h2>The Symptoms and Problems</h2> <p>If you look at the emotional state of the country you can see a few symptoms and problems that help the populists to raise to power:</p> <ul> <li><p class="first"><strong>Inability (or unwillingness) to learn and understand</strong> how Austria and the world changed in the last few years. I think this is a big one of the people who want to leave the European Union and do similar crazy things to the Austrian economy. We're so intertwined with it, that I doubt anyone can predict what would happen as a consequence of leaving it.</p> </li> <li><p class="first"><strong>Comparing things that cannot directly be compared</strong> is a very related problem. As an example the Euro might have been a mistake for Germany but that does not mean that the Euro was not a profit for Austria. We were pegged to the Mark before, for us not much changed. If anything the situation improved because we're an export nation and our export partners are other European countries and if they also use the Euro they cannot harm our exports by devaluing their own currencies.</p> <p>But despite the fact it's so very hard to compare countries because they are so fundamentally differently structured - yet people will still do it in conversations. Switzerland is heralded as the great example of continental Europe in Austria but it's so specific out of history that it's incredible hard to copy or imitate.</p> </li> <li><p class="first"><strong>Not being able consider the other side</strong>. I am shocked sometimes what people here in Austria think the US are like. The idea that both Europeans and Americans might have very similar fears or hopes for TTIP for instance does not seem to exist here.</p> </li> <li><p class="first"><strong>Fear of change</strong>. I think this is a typical Austrian problem but to smaller extent it probably exists elsewhere too. Everything new is torpedoed until it cannot be avoided any more because every single other country already did it before. That applies to smoking bans as much as to embracing of credit cards, online services, acceptance of homosexuality, Sunday shopping, flexible working hours and much more.</p> </li> <li><p class="first"><strong>Broad categorization</strong>. I think Austrians are masters at giving good/bad labels to large masses of people based on some categorization instead of considering the individual. Refugees are either good or bad, the industry is good or bad, corporations are good or bad, immigrants are good or bad. That individually a refugee could be good or bad is impossible to comprehend in the general discourse and if someone does bring it up, it often gets dismissed as an outlier.</p> </li> <li><p class="first"><strong>Inability to give credit</strong>. This is particularly a problem in Austrian politics. It's one party against the other and never ever would a ruling party give an opposition party credit or the other way round. Likewise would social democratic voters never give conservative parties credit for something or the other way round.</p> </li> </ul> </div> <div class="section" id="the-root-causes"> <h2>The Root Causes</h2> <p>But what causes this behavior? I think Austria's history has a lot to do with it. In the recent history there were conservatives versus social democrats. Combined with the fact that after the war Austria emerged not only as a loser but also has one with a lot of baggage due to the support to national socialism and the complicated way to deal with it after the war. As such the population was always split in two on this level. However they could unity at least somewhat by voting for one of the two large centrist parties. Because the country was doing really well, there was no reason to reevaluate this.</p> <p>However when disaster struck this rift became bigger instead of smaller and particular with this upcoming presidential election only the most extreme candidates made it into the run-off. Voters did not vote for people they believed in as much as they voted by using tactics against predictions. This now has lead to one the ugliest pre-elections I have seen.</p> <p>Politics are no longer about doing the right thing but defending principles, even if they are completely unfounded. Even though everybody says they have the best for Austrian in mind everyone is so stuck to their own opinion that not a single meter of compromise can be achieved. Newspapers paint scary pictures of the different outcomes of the election, how the country will be torn, how one candidate would mean European sanctions and how the other candidate would mean the end of a functioning society.</p> </div> <div class="section" id="a-path-forward"> <h2>A Path Forward</h2> <p>There are clearly many things wrong in this country but so is it everywhere else. We're not alone with the changes in the world and we cannot fall back to local solutions for these problems. But likewise can we not pretend that problems don't exist. This behavior of ruling parties has helped the rise of the populists. It does not help to pretend that immigration without integration does not contribute to problems in society. We need a more honest approach with more talking to each other.</p> <p>Just a few days ago we got <a class="reference external" href="https://en.wikipedia.org/wiki/Christian_Kern">a new chancellor</a> and he has indicated that he wants to end the course of confrontation his predecessors had. This has been supported by all other parties other than the right wing FPOe. I hope they reconsider and also want to constructively work together with the rest of the government to lead the country forward and to restore a positive way of thinking rather than the fear that has been going around for the last years.</p> <p>This however is not something that is a problem that needs to be solved in government. This is a problem that we as people in that country have and we need to talk to each other more. If we talk more to each other I hope it becomes clear that we share many core values, we just don't always agree on all of them.</p> <p>And to my friends in Austria: please vote. But more than that: please accept that if the outcome is not what you wanted, that it does not mean the end of the country as you know it.</p> </div> A Europe For Our Children http://lucumr.pocoo.org/2016/5/18/for-our-children 2016-05-18T00:00:00Z http://lucumr.pocoo.org/2016/5/18/for-our-children Armin Ronacher's Thoughts and Writings <p>Like everybody else this week <a class="reference external" href="https://www.getsentry.com/">we</a> had fun with <a class="reference external" href="http://www.haneycodes.net/npm-left-pad-have-we-forgotten-how-to-program/">the pad-left disaster</a>. We're from the Python community and our exposure to the node ecosystem is primarily for the client side. We're big fans of the ecosystem that develops around react and as such quite a bit of our daily workflow involves npm.</p> <p>What frustrated me personally about this conversation that took place over the internets about the last few days however has nothing to do with npm, the guy who deleted his packages, any potential trademark disputes or the supposed inability of the JavaScript community to write functions to pad strings. It has more to do with how the ecosystem evolving around npm has created the most dangerous and irresponsible environment which in many ways leaves me scared.</p> <p>My opinion very quickly went from “<a class="reference external" href="https://twitter.com/mitsuhiko/status/712429716356124673">Oh that's funny</a>” to “<a class="reference external" href="https://twitter.com/mitsuhiko/status/712430645671280640">This concerns me</a>”.</p> <div class="section" id="dependency-explosion"> <h2>Dependency Explosion</h2> <p>When &quot;pad left&quot; disaster stroke I had a brief look at Sentry's dependency tree. I should probably have done that before but for as long things work you don't really tend to do that. At the time of writing we have 39 dependencies in our <tt class="docutils literal">package.json</tt>. These dependencies are strongly vetted in the sense that we do not include anything there we did not investigate properly. What however we cannot do, is also to investigate every single dependency there is. The reason for this is how these node dependencies explode. While we have 39 direct dependencies, we have more than a thousand dependencies in total as it turns out.</p> <p>To give you a comparison: the Sentry backend (Sentry server) has 45 direct dependencies. If you resolve all dependencies and install them as well you end up with a total of 65 packages which is significantly less. We only get a total of 20 packages over what we depend on ourselves. The typical Python project would be similar. For instance the Flask framework depends on three (soon to be four with Click added) other packages: Werkzeug, Jinja2 and itsdangerous. Jinja2 additionally depends on MarkupSafe. All of those packages are written by the same author however but split into rough responsibilities.</p> <p>Why is that important?</p> <ul class="simple"> <li>dependencies incur cost.</li> <li>every dependency is a liability.</li> </ul> </div> <div class="section" id="the-cost-of-dependencies"> <h2>The Cost of Dependencies</h2> <p>Let's talk about the cost of dependencies first. There are a few costs associated with every dependency and most of you who have been programming for a few years will have encountered this.</p> <p>The most obvious costs are that packages need to be downloaded from somewhere. This corresponds to direct cost. The most shocking example I encountered for this is the <a class="reference external" href="https://www.npmjs.com/package/isarray">isarray</a> npm package. It's currently being downloaded short of 19 million times a month from npm. The entire contents of that package can fit into a single line:</p> <div class="highlight"><pre><span></span><span class="nx">module</span><span class="p">.</span><span class="nx">exports</span> <span class="o">=</span> <span class="nb">Array</span><span class="p">.</span><span class="nx">isArray</span> <span class="o">||</span> <span class="kd">function</span><span class="p">(</span><span class="nx">a</span><span class="p">)</span> <span class="p">{</span> <span class="k">return</span> <span class="p">{}.</span><span class="nx">toString</span><span class="p">.</span><span class="nx">call</span><span class="p">(</span><span class="nx">a</span><span class="p">)</span> <span class="o">==</span> <span class="s1">&#39;[object Array]&#39;</span> <span class="p">}</span> </pre></div> <p>However in addition to this stuff there is a bunch of extra content in it. You actually end up downloading a 2.5KB tarball because of all the extra metadata, readme, license file, travis config, unittests and makefile. On top of that npm adds 6KB for its own metadata. Let's round it to 8KB that need to be downloaded. Multiplied with the total number of downloads last month the node community downloaded 140GB worth of isarray. That's half of the monthly downloads of what Flask achieves measured by size.</p> <p>The footprint of Sentry's server component is big when you add up all the dependencies. Yet the entire installation of Sentry from pypi takes about 30 seconds including compiling lxml. Installing the over 1000 dependencies for the UI though takes I think about 5 minutes even though you end up with a fraction of the code afterwards. Also the further you are away from the npm CDN node the worse the price for the network roundtrip you pay. I threw away my node cache for fun and ran npm install on Sentry. Takes about 4.5 minutes. And that's with good latency to npm, on a above average network connect and a top of the line Macbook Pro with an SSD. I don't want to know what the experience is for people on unreliable network connections. Afterwards I end up with 165MB in <cite>node_modules</cite>. For comparison the entirety of the Sentry's backend dependencies on the file system and all metadata is 60MB.</p> <p>When we have a thousand different dependencies we have a thousand different licenses and copyright files. Really makes me wonder what the license screen of a node powered desktop application would look like. But it's not also a thousand licenses, it's a huge number of independent developers.</p> </div> <div class="section" id="trust-and-auditing"> <h2>Trust and Auditing</h2> <p>This leads me to what my actual issue with micro-dependencies is: we do not have trust solved. Every once in a while people will bring up how we all would be better off if we PGP signed our Python packages. I think what a lot of people miss in the process is that signatures were never a technical problem but a trust and scaling problem.</p> <p>I want to give you a practical example of what I mean with this. Say you build a program based on the Flask framework. You pull in a total of 4-5 dependencies for Flask alone which are all signed off my me. The attack vector to get untrusted code into Flask is:</p> <ul class="simple"> <li>get a backdoor into a pull request and get it merged</li> <li>steal my credentials to PyPI and publish a new release with a backdoor</li> <li>put a backdoor into one of my dependencies</li> </ul> <p>All of those attack vectors I cover. I use my own software, monitor what releases are PyPI which is also the only place to install my software from. I 2FA all my logins where possible, I use long randomly generated passwords where I cannot etc. None of my libraries use a dependency I do not trust the developer of. In essence if you use Flask you only need to trust me to not be malicious or idiotic. Generally by vetting me as a person (or maybe at a later point an organization that releases my libraries) you can be reasonably sure that what you install is what you expect and not something dangerous. If you develop large scale Python applications you can do this for all your dependencies and you end up with a reasonably short list. More than that. Because Python's import system is very limited you end up with only one version of each library so when you want to go in detail and sign off on releases you only need to do it once.</p> <p>Back to Sentry's use of npm. It turns out we have four different versions of the same query string library because of different version pinning by different libraries. Fun.</p> <p>Those dependencies can easily end up being high value targets because of how few people know about them. juliangruber's &quot;isarray&quot; has 15 stars on github and only two people watch the repository. It's downloaded 18 million times a month. Sentry depends on it 20 times. 14 times it's a pin for <tt class="docutils literal">0.0.1</tt>, once it's a pin for <tt class="docutils literal">^1.0.0</tt> and 5 times for <tt class="docutils literal">~1.0.0</tt>. Any pin for anything other than a strict version match is a disaster waiting to happen if someone would manage to push out a point release for it by stealing juliangruber's credentials.</p> <p>Now one could argue that the same problem applies if people hack my account and push out a new Flask release. But I can promise you I will notice a release from one of my ~5 libraries because of a) I monitor those packages, b) other people would notice a release. I doubt people would notice a new isarray release. Yet <tt class="docutils literal">isarray</tt> is not sandboxed and runs with the same rights as the rest of the code you have.</p> <p>For instance sindresorhus <a class="reference external" href="https://www.npmjs.com/~sindresorhus">maintains 827 npm packages</a>. Most of which are probably one liners. I have no idea how good his opsec is, but my assumption is that it's significantly harder for him to ensure that all of those are actually his releases than it is for me as I only have to look over a handful.</p> </div> <div class="section" id="signatures"> <h2>Signatures</h2> <p>There is a common talk that package signatures would solve a lot of those issues but at the end of the day because of the trust we get from PyPI and npm we get very little extra security from a package signature compared to just trusting the username/password auth on package publish.</p> <p>Why package signatures are not the holy grail was <a class="reference external" href="https://caremad.io/2013/07/packaging-signing-not-holy-grail/">covered by Donald Stufft</a> aka Mr PyPI. You should definitely read that since he's describing the overarching issue much better than I could ever do.</p> </div> <div class="section" id="future-of-micro-dependencies"> <h2>Future of Micro-Dependencies</h2> <p>To be perfectly honest: I'm legitimately scared about node's integrity of the ecosystem and this worry does not go away. Among other things I'm using keybase and keybase uses unpinned node libraries left and right. keybase has 225 node dependencies from a quick look. Among those many partially pinned one-liner libraries for which it would be easily enough to roll out backdoor update if one gets hold of credentials.</p> <p><em>Update: it has been pointed out that keybase shrinkwrapped in the node client and that the new client is written in Go.</em> <a class="reference external" href="https://twitter.com/maxtaco/status/713037656255557632">Source</a></p> <p>If micro-dependencies want to have a future then something must change in npm. Maybe they would have to get a specific tag so that the system can automatically run automated analysis to spot unexpected updates. Probably they should require a CC0 license to simplify copyright dialogs etc.</p> <p>But as it stands right now I feel like this entire thing is a huge disaster waiting to happen and if you are not using node shrinkwrap yet you better get started quickly.</p> </div> Micropackages and Open Source Trust Scaling http://lucumr.pocoo.org/2016/3/24/open-source-trust-scaling 2016-03-24T00:00:00Z http://lucumr.pocoo.org/2016/3/24/open-source-trust-scaling Armin Ronacher's Thoughts and Writings <p>The longer I'm programming and creating software, the more I notice that I build a lot of stuff that requires maintenance even though it should not. In particular a topic that just keeps annoying me is how quickly technology moves forward and how much effort it is to maintain older code that still exists but now stands on ancient foundations.</p> <p>This is not a new discovery mind you. This blog you're reading started out as a Django application many, many years ago; made a transition to WordPress because I could not be bothered with updating Django; and then turned into two different static site generators because I did not want to bother with making database updates and rather wanted to track my content in a git repository.</p> <p>I like static website generators quite a bit. As everything needs a website these days — it's impossible to escape the work to create one. For programmers it's possible to get away with building something with static website generators like Jekyll, Hexo, Hugo, Pelican, Hyde, Brunch, Middleman, Harp, Expose, …</p> <p>As you can see the list of tools available is endless. Unfortunately though these tools are all aimed at programmers and it's very hard to use them as someone without programming experience. Worse though: many of them are clones of each other just written in different programming languages with very similar designs. There is very little innovation in that space and that's a bit unfortunate because I like the flexibility I get from frameworks like Flask at times.</p> <div class="section" id="so-i-built-my-own"> <h2>So I Built My Own</h2> <p>This is by far not the first time I built a static website generator but I hope it will be the last time. This one however is different from any project I built before. The reason it exists is quite frankly that it's impossible to escape family duties. For me that means helping out with the website of my parents. I knew that I did not want that to be WordPress or something that needs security updates so about two years ago I started to investigate that options there are.</p> <p>After a ton of toying around I ended up using <a class="reference external" href="http://pythonhosted.org/Frozen-Flask/">Frozen-Flask</a> for that project. It was neat because it allowed me to structure the website exactly like I wanted. However it also meant that whenever text started to change I needed to spend time on it. Thus I had to investigate CMS solutions again. Countless weekends were wasted trying to make WordPress work again and looking at Statamic. However I found them quite a bit more complex to customize than what I was used to with Frozen-Flask and they did not really fit the format at all. Especially WordPress feels much more like a blog engine than a CMS.</p> <p>Finally I decided to sit down and build something completely different: a content management system that uses flat files as source files like most other systems, but it has a locally hosted admin panel that a non programmer can use. You install the application, double click on the project, a browser opens and you can edit the pages. It builds in the background into static HTML files and there is a publish button to ship it up to a server. For collaboration one can use Dropbox.</p> </div> <div class="section" id="enter-lektor"> <h2>Enter Lektor</h2> <p>I called this system Lektor and Open Sourced it initially a few months ago after not having cared about it in a year or so. However I had another run-in with a project which was the Sentry documentation. Sentry uses Sphinx for the documentation and customizing the docs for what we had in mind there turned out to be a complete waste of time and sanity. While Lektor is currently not in a position where it could replace Sphinx for Sentry it gave me enough motivation to hack on it again on weekends.</p> <p>So I figured I might retry Open Sourcing it and made a website for it with documentation and cleaned up some bad stuff in it.</p> <p>Here is what it looks like when you open up the admin panel:</p> <img alt="https://raw.githubusercontent.com/lektor/lektor-archive/master/screenshots/admin.png" src="https://raw.githubusercontent.com/lektor/lektor-archive/master/screenshots/admin.png" style="width: 100%;" /> </div> <div class="section" id="lektor-is-a-framework"> <h2>Lektor is a Framework</h2> <p>But what makes Lektor so much fun to work with is that Lektor is (while very opinionated) very, very flexible. It takes a lot of inspiration from ORMs like Django's. Instead of there being a &quot;blog component&quot; you can model your own blog posts and render them with the templates you want to use. There is not a single built-in template that you have to use. The only thing it gives you is a quickstart that sets up the folders and copies default minimalistic templates over.</p> <p>As an example, here is how a blog index template looks like:</p> <div class="highlight"><pre><span></span><span class="cp">{%</span> <span class="k">extends</span> <span class="s2">&quot;blog_layout.html&quot;</span> <span class="cp">%}</span> <span class="cp">{%</span> <span class="k">from</span> <span class="s2">&quot;macros/pagination.html&quot;</span> <span class="k">import</span> <span class="nv">render_pagination</span> <span class="cp">%}</span> <span class="cp">{%</span> <span class="k">block</span> <span class="nv">title</span> <span class="cp">%}</span>My Blog<span class="cp">{%</span> <span class="k">endblock</span> <span class="cp">%}</span> <span class="cp">{%</span> <span class="k">block</span> <span class="nv">body</span> <span class="cp">%}</span> <span class="p">&lt;</span><span class="nt">h1</span><span class="p">&gt;</span>My Blog<span class="p">&lt;/</span><span class="nt">h1</span><span class="p">&gt;</span> <span class="p">&lt;</span><span class="nt">ul</span> <span class="na">class</span><span class="o">=</span><span class="s">&quot;blog-index&quot;</span><span class="p">&gt;</span> <span class="cp">{%</span> <span class="k">for</span> <span class="nv">post</span> <span class="k">in</span> <span class="nv">this.pagination.items</span> <span class="cp">%}</span> <span class="p">&lt;</span><span class="nt">li</span><span class="p">&gt;</span> <span class="p">&lt;</span><span class="nt">a</span> <span class="na">href</span><span class="o">=</span><span class="s">&quot;</span><span class="cp">{{</span> <span class="nv">post</span><span class="o">|</span><span class="nf">url</span> <span class="cp">}}</span><span class="s">&quot;</span><span class="p">&gt;</span><span class="cp">{{</span> <span class="nv">post.title</span> <span class="cp">}}</span><span class="p">&lt;/</span><span class="nt">a</span><span class="p">&gt;</span> — by <span class="cp">{{</span> <span class="nv">post.author</span> <span class="cp">}}</span> on <span class="cp">{{</span> <span class="nv">post.pub_date</span><span class="o">|</span><span class="nf">dateformat</span> <span class="cp">}}</span> <span class="cp">{%</span> <span class="k">endfor</span> <span class="cp">%}</span> <span class="p">&lt;/</span><span class="nt">ul</span><span class="p">&gt;</span> <span class="cp">{%</span> <span class="k">if</span> <span class="nv">this.pagination.pages</span> <span class="o">&gt;</span> <span class="m">1</span> <span class="cp">%}</span> <span class="cp">{{</span> <span class="nv">render_pagination</span><span class="o">(</span><span class="nv">this.pagination</span><span class="o">)</span> <span class="cp">}}</span> <span class="cp">{%</span> <span class="k">endif</span> <span class="cp">%}</span> <span class="cp">{%</span> <span class="k">endblock</span> <span class="cp">%}</span> </pre></div> <p>The system understands what the blog is, that it has child records, that those records are paginated, it can provide pagination etc. However there is nothing in there that makes it a blog in itself. It just has a very flexible ORM inspired component that gives access to the structured files on the file system. Programming for Lektor feels very much like programming something with Flask or Django.</p> </div> <div class="section" id="learn-more"> <h2>Learn More</h2> <p>If you want to learn more about it, there are quite a few resources at this point:</p> <ul class="simple"> <li><a class="reference external" href="https://www.getlektor.com/">The Lektor Website</a>, with documentation and all that cool stuff.</li> <li><a class="reference external" href="https://www.getlektor.com/blog/2015/12/hello-lektor/">Introduction Blog Post</a>, with some more back story and explanations of how it works.</li> <li><a class="reference external" href="https://www.getlektor.com/docs/guides/">A Few Guides</a> on how to build blogs, portfolio websites, etc.</li> <li><a class="reference external" href="https://www.getlektor.com/docs/quickstart/">A Quickstart</a> with a screencast to show the basics.</li> <li><a class="reference external" href="https://www.getlektor.com/docs/deployment/travisci/">A Deployment Guide for Lektor + GitHub Pages</a> that shows how to put something up with the help of Travis-CI (which also includes a short screencast).</li> </ul> </div> <div class="section" id="final-words"> <h2>Final Words</h2> <p>I hope people find it useful. I know that I enjoy using it a ton and I hope it makes others enjoy it similarly. Because I run so many Open Source projects and maintenance of all of them turns out to be tricky I figured I do this better this time around. Lektor belongs to a separate org and the project does not use any resources only I have access to (other than the domain name and the server travis-CI deploys to). So in case people want to help out, there is no single point of failure!</p> <p>I hope I can spend some time over Christmas to do the same to my other projects and alter the bus factor of them.</p> <p>There is far too much in Lektor to be able to cover it in a single blog post so I will probably write a bit more about some of the really cool things about in in the next few weeks. Enjoy!</p> </div> Introducing Lektor — A Static File Content Management System For Python http://lucumr.pocoo.org/2015/12/21/introducing-lektor 2015-12-21T00:00:00Z http://lucumr.pocoo.org/2015/12/21/introducing-lektor Armin Ronacher's Thoughts and Writings <p>There are many terrible modules in the Python standard library, but the Python <cite>re</cite> module is not one of them. While it's old and has not been updated in many years, it's one of the best of all dynamic languages I would argue.</p> <p>What I always found interesting about that module is that Python is one of the few dynamic languages which does not have language integrated regular expression support. However while it lacks syntax and interpreter support for it, it makes up for it with one of the better designed core systems from a pure API point of view. At the same time it's very bizarre. For instance the parser is written in pure Python which has some bizarre consequences if you ever try to trace Python while importing. You will discover that 90% of your time is probably spent in on of re's support module.</p> <div class="section" id="old-but-proven"> <h2>Old But Proven</h2> <p>The regex module in Python is really old by now and one of the constants in the standard library. Ignoring Python 3 it has not really evolved since its inception other than gaining basic unicode support at one point. Till this date it has a broken member enumeration (Have a look at what <tt class="docutils literal">dir()</tt> returns on a regex pattern object).</p> <p>However one of the nice things about it being old is that it does not change between Python versions and is very reliable. Not once did I have to adjust something because the regex module changed. Given how many regular expressions I'm writing in Python this is good news.</p> <p>One of the interesting quirks about its design is that its parser and compiler is written in Python but the matcher is written in C. This means we can pass the internal structures of the parser into the compiler to bypass the regex parsing entirely if we would feel like it. Not that this is documented. But it still works.</p> <p>There are many other things however that are not or badly documented about the regular expression system, so I want to give some examples of why the Regex module in Python is pretty cool.</p> </div> <div class="section" id="iterative-matching"> <h2>Iterative Matching</h2> <p>The best feature of the regex system in Python is without a doubt that it's making a clear distinction between matching and searching. Something that not many other regular expression engines do. In particular when you perform a match you can provide an index to offset the matching but the matching itself will be anchored to that position.</p> <p>In particular this means you can do something like this:</p> <div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">pattern</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s1">&#39;bar&#39;</span><span class="p">)</span> <span class="gp">&gt;&gt;&gt; </span><span class="n">string</span> <span class="o">=</span> <span class="s1">&#39;foobar&#39;</span> <span class="gp">&gt;&gt;&gt; </span><span class="n">pattern</span><span class="o">.</span><span class="n">match</span><span class="p">(</span><span class="n">string</span><span class="p">)</span> <span class="ow">is</span> <span class="bp">None</span> <span class="go">True</span> <span class="gp">&gt;&gt;&gt; </span><span class="n">pattern</span><span class="o">.</span><span class="n">match</span><span class="p">(</span><span class="n">string</span><span class="p">,</span> <span class="mi">3</span><span class="p">)</span> <span class="go">&lt;_sre.SRE_Match object at 0x103c9a510&gt;</span> </pre></div> <p>This is immensely useful for building lexers because you can continue to use the special <tt class="docutils literal">^</tt> symbol to indicate the beginning of a line of entire string. We just need to increase the index to match further. It also means we do not have to slice up the string ourselves which saves a ton of memory allocations and string copying in the process (not that Python is particularly good at that anyways).</p> <p>In addition to the matching Python can search which means it will skip ahead until it finds a match:</p> <div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">pattern</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s1">&#39;bar&#39;</span><span class="p">)</span> <span class="gp">&gt;&gt;&gt; </span><span class="n">pattern</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="s1">&#39;foobar&#39;</span><span class="p">)</span> <span class="go">&lt;_sre.SRE_Match object at 0x103c9a578&gt;</span> <span class="gp">&gt;&gt;&gt; </span><span class="n">_</span><span class="o">.</span><span class="n">start</span><span class="p">()</span> <span class="go">3</span> </pre></div> </div> <div class="section" id="not-matching-is-also-matching"> <h2>Not Matching is also Matching</h2> <p>A particular common problem is that the absence of a match is expensive to handle in Python. Think of writing a tokenizer for a wiki like language (like markdown for instance). Between the tokens that indicate formatting, there is a lot of text that also needs handling. So when we match some wiki syntax between all the tokens we care about, we have more tokens which need handling. So how do we skip to those?</p> <p>One method is to compile a bunch of regular expressions into a list and to then try one by one. If none matches we skip a character ahead:</p> <div class="highlight"><pre><span></span><span class="n">rules</span> <span class="o">=</span> <span class="p">[</span> <span class="p">(</span><span class="s1">&#39;bold&#39;</span><span class="p">,</span> <span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s1">r&#39;\*\*&#39;</span><span class="p">)),</span> <span class="p">(</span><span class="s1">&#39;link&#39;</span><span class="p">,</span> <span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s1">r&#39;\[\[(.*?)\]\]&#39;</span><span class="p">)),</span> <span class="p">]</span> <span class="k">def</span> <span class="nf">tokenize</span><span class="p">(</span><span class="n">string</span><span class="p">):</span> <span class="n">pos</span> <span class="o">=</span> <span class="mi">0</span> <span class="n">last_end</span> <span class="o">=</span> <span class="mi">0</span> <span class="k">while</span> <span class="mi">1</span><span class="p">:</span> <span class="k">if</span> <span class="n">pos</span> <span class="o">&gt;=</span> <span class="nb">len</span><span class="p">(</span><span class="n">string</span><span class="p">):</span> <span class="k">break</span> <span class="k">for</span> <span class="n">tok</span><span class="p">,</span> <span class="n">rule</span> <span class="ow">in</span> <span class="n">rules</span><span class="p">:</span> <span class="n">match</span> <span class="o">=</span> <span class="n">rule</span><span class="o">.</span><span class="n">match</span><span class="p">(</span><span class="n">string</span><span class="p">,</span> <span class="n">pos</span><span class="p">)</span> <span class="k">if</span> <span class="n">match</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span><span class="p">:</span> <span class="n">start</span><span class="p">,</span> <span class="n">end</span> <span class="o">=</span> <span class="n">match</span><span class="o">.</span><span class="n">span</span><span class="p">()</span> <span class="k">if</span> <span class="n">start</span> <span class="o">&gt;</span> <span class="n">last_end</span><span class="p">:</span> <span class="k">yield</span> <span class="s1">&#39;text&#39;</span><span class="p">,</span> <span class="n">string</span><span class="p">[</span><span class="n">last_end</span><span class="p">:</span><span class="n">start</span><span class="p">]</span> <span class="k">yield</span> <span class="n">tok</span><span class="p">,</span> <span class="n">match</span><span class="o">.</span><span class="n">group</span><span class="p">()</span> <span class="n">last_end</span> <span class="o">=</span> <span class="n">pos</span> <span class="o">=</span> <span class="n">match</span><span class="o">.</span><span class="n">end</span><span class="p">()</span> <span class="k">break</span> <span class="k">else</span><span class="p">:</span> <span class="n">pos</span> <span class="o">+=</span> <span class="mi">1</span> <span class="k">if</span> <span class="n">last_end</span> <span class="o">&lt;</span> <span class="nb">len</span><span class="p">(</span><span class="n">string</span><span class="p">):</span> <span class="k">yield</span> <span class="s1">&#39;text&#39;</span><span class="p">,</span> <span class="n">string</span><span class="p">[</span><span class="n">last_end</span><span class="p">:]</span> </pre></div> <p>This is not a particularly beautiful solution, and it's also not very fast. The more mismatches we have, the slower we get as we only advance one character at the time and that loop is in interpreted Python. We also are quite inflexible at the moment in how we handle this. For each token we only get the matched text, so if groups are involved we would have to extend this code a bit.</p> <p>So is there a better method to do this? What if we could indicate to the regular expression engine that we want it to scan for any of a few regular expressions?</p> <p>This is where it gets interesting. Fundamentally this is what we do when we write a regular expression with sub-patterns: <tt class="docutils literal">(a|b)</tt>. This will search for either <tt class="docutils literal">a</tt> or <tt class="docutils literal">b</tt>. So we could build a humongous regular expression out of all the expressions we have, and then match for that. The downside of this is that we will eventually get super confused with all the groups involved.</p> </div> <div class="section" id="enter-the-scanner"> <h2>Enter The Scanner</h2> <p>This is where things get interesting. For the last 15 years or so, there has been a completely undocumented feature in the regular expression engine: the scanner. The scanner is a property of the underlying SRE pattern object where the engine keeps matching after it found a match for the next one. There even exists an <tt class="docutils literal">re.Scanner</tt> class (also undocumented) which is built on top of the SRE pattern scanner which gives this a slightly higher level interface.</p> <p>The scanner as it exists in the <tt class="docutils literal">re</tt> module is not very useful unfortunately for making the 'not matching' part faster, but looking at its sourcecode reveals how it's implemented: on top of the SRE primitives.</p> <p>The way it works is it accepts a list of regular expression and callback tuples. For each match it invokes the callback with the match object and then builds a result list out of it. When we look at how it's implemented it manually creates SRE pattern and subpattern objects internally. (Basically it builds a larger regular expression without having to parse it). Armed with this knowledge we can extend this:</p> <div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">sre_parse</span> <span class="kn">import</span> <span class="n">Pattern</span><span class="p">,</span> <span class="n">SubPattern</span><span class="p">,</span> <span class="n">parse</span> <span class="kn">from</span> <span class="nn">sre_compile</span> <span class="kn">import</span> <span class="nb">compile</span> <span class="k">as</span> <span class="n">sre_compile</span> <span class="kn">from</span> <span class="nn">sre_constants</span> <span class="kn">import</span> <span class="n">BRANCH</span><span class="p">,</span> <span class="n">SUBPATTERN</span> <span class="k">class</span> <span class="nc">Scanner</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span> <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">rules</span><span class="p">,</span> <span class="n">flags</span><span class="o">=</span><span class="mi">0</span><span class="p">):</span> <span class="n">pattern</span> <span class="o">=</span> <span class="n">Pattern</span><span class="p">()</span> <span class="n">pattern</span><span class="o">.</span><span class="n">flags</span> <span class="o">=</span> <span class="n">flags</span> <span class="n">pattern</span><span class="o">.</span><span class="n">groups</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">rules</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span> <span class="bp">self</span><span class="o">.</span><span class="n">rules</span> <span class="o">=</span> <span class="p">[</span><span class="n">name</span> <span class="k">for</span> <span class="n">name</span><span class="p">,</span> <span class="n">_</span> <span class="ow">in</span> <span class="n">rules</span><span class="p">]</span> <span class="bp">self</span><span class="o">.</span><span class="n">_scanner</span> <span class="o">=</span> <span class="n">sre_compile</span><span class="p">(</span><span class="n">SubPattern</span><span class="p">(</span><span class="n">pattern</span><span class="p">,</span> <span class="p">[</span> <span class="p">(</span><span class="n">BRANCH</span><span class="p">,</span> <span class="p">(</span><span class="bp">None</span><span class="p">,</span> <span class="p">[</span><span class="n">SubPattern</span><span class="p">(</span><span class="n">pattern</span><span class="p">,</span> <span class="p">[</span> <span class="p">(</span><span class="n">SUBPATTERN</span><span class="p">,</span> <span class="p">(</span><span class="n">group</span><span class="p">,</span> <span class="n">parse</span><span class="p">(</span><span class="n">regex</span><span class="p">,</span> <span class="n">flags</span><span class="p">,</span> <span class="n">pattern</span><span class="p">))),</span> <span class="p">])</span> <span class="k">for</span> <span class="n">group</span><span class="p">,</span> <span class="p">(</span><span class="n">_</span><span class="p">,</span> <span class="n">regex</span><span class="p">)</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">rules</span><span class="p">,</span> <span class="mi">1</span><span class="p">)]))</span> <span class="p">]))</span><span class="o">.</span><span class="n">scanner</span> <span class="k">def</span> <span class="nf">scan</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">string</span><span class="p">,</span> <span class="n">skip</span><span class="o">=</span><span class="bp">False</span><span class="p">):</span> <span class="n">sc</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">_scanner</span><span class="p">(</span><span class="n">string</span><span class="p">)</span> <span class="n">match</span> <span class="o">=</span> <span class="bp">None</span> <span class="k">for</span> <span class="n">match</span> <span class="ow">in</span> <span class="nb">iter</span><span class="p">(</span><span class="n">sc</span><span class="o">.</span><span class="n">search</span> <span class="k">if</span> <span class="n">skip</span> <span class="k">else</span> <span class="n">sc</span><span class="o">.</span><span class="n">match</span><span class="p">,</span> <span class="bp">None</span><span class="p">):</span> <span class="k">yield</span> <span class="bp">self</span><span class="o">.</span><span class="n">rules</span><span class="p">[</span><span class="n">match</span><span class="o">.</span><span class="n">lastindex</span> <span class="o">-</span> <span class="mi">1</span><span class="p">],</span> <span class="n">match</span> <span class="k">if</span> <span class="ow">not</span> <span class="n">skip</span> <span class="ow">and</span> <span class="ow">not</span> <span class="n">match</span> <span class="ow">or</span> <span class="n">match</span><span class="o">.</span><span class="n">end</span><span class="p">()</span> <span class="o">&lt;</span> <span class="nb">len</span><span class="p">(</span><span class="n">string</span><span class="p">):</span> <span class="k">raise</span> <span class="ne">EOFError</span><span class="p">(</span><span class="n">match</span><span class="o">.</span><span class="n">end</span><span class="p">())</span> </pre></div> <p>So how do we use this? Like this:</p> <div class="highlight"><pre><span></span><span class="n">scanner</span> <span class="o">=</span> <span class="n">Scanner</span><span class="p">([</span> <span class="p">(</span><span class="s1">&#39;whitespace&#39;</span><span class="p">,</span> <span class="s1">r&#39;\s+&#39;</span><span class="p">),</span> <span class="p">(</span><span class="s1">&#39;plus&#39;</span><span class="p">,</span> <span class="s1">r&#39;\+&#39;</span><span class="p">),</span> <span class="p">(</span><span class="s1">&#39;minus&#39;</span><span class="p">,</span> <span class="s1">r&#39;\-&#39;</span><span class="p">),</span> <span class="p">(</span><span class="s1">&#39;mult&#39;</span><span class="p">,</span> <span class="s1">r&#39;\*&#39;</span><span class="p">),</span> <span class="p">(</span><span class="s1">&#39;div&#39;</span><span class="p">,</span> <span class="s1">r&#39;/&#39;</span><span class="p">),</span> <span class="p">(</span><span class="s1">&#39;num&#39;</span><span class="p">,</span> <span class="s1">r&#39;\d+&#39;</span><span class="p">),</span> <span class="p">(</span><span class="s1">&#39;paren_open&#39;</span><span class="p">,</span> <span class="s1">r&#39;\(&#39;</span><span class="p">),</span> <span class="p">(</span><span class="s1">&#39;paren_close&#39;</span><span class="p">,</span> <span class="s1">r&#39;\)&#39;</span><span class="p">),</span> <span class="p">])</span> <span class="k">for</span> <span class="n">token</span><span class="p">,</span> <span class="n">match</span> <span class="ow">in</span> <span class="n">scanner</span><span class="o">.</span><span class="n">scan</span><span class="p">(</span><span class="s1">&#39;(1 + 2) * 3&#39;</span><span class="p">):</span> <span class="k">print</span> <span class="p">(</span><span class="n">token</span><span class="p">,</span> <span class="n">match</span><span class="o">.</span><span class="n">group</span><span class="p">())</span> </pre></div> <p>In this form it will raise an <cite>EOFError</cite> in case it cannot lex something, but if you pass <tt class="docutils literal">skip=True</tt> then it skips over unlexable parts which is perfect for building things like wiki syntax lexers.</p> </div> <div class="section" id="scanning-with-holes"> <h2>Scanning with Holes</h2> <p>When we skip, we can use <tt class="docutils literal">match.start()</tt> and <tt class="docutils literal">match.end()</tt> to figure out which parts we skipped over. So here the first example adjusted to do exactly that:</p> <div class="highlight"><pre><span></span><span class="n">scanner</span> <span class="o">=</span> <span class="n">Scanner</span><span class="p">([</span> <span class="p">(</span><span class="s1">&#39;bold&#39;</span><span class="p">,</span> <span class="s1">r&#39;\*\*&#39;</span><span class="p">),</span> <span class="p">(</span><span class="s1">&#39;link&#39;</span><span class="p">,</span> <span class="s1">r&#39;\[\[(.*?)\]\]&#39;</span><span class="p">),</span> <span class="p">])</span> <span class="k">def</span> <span class="nf">tokenize</span><span class="p">(</span><span class="n">string</span><span class="p">):</span> <span class="n">pos</span> <span class="o">=</span> <span class="mi">0</span> <span class="k">for</span> <span class="n">rule</span><span class="p">,</span> <span class="n">match</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">scan</span><span class="p">(</span><span class="n">string</span><span class="p">,</span> <span class="n">skip</span><span class="o">=</span><span class="bp">True</span><span class="p">):</span> <span class="n">hole</span> <span class="o">=</span> <span class="n">string</span><span class="p">[</span><span class="n">pos</span><span class="p">:</span><span class="n">match</span><span class="o">.</span><span class="n">start</span><span class="p">()]</span> <span class="k">if</span> <span class="n">hole</span><span class="p">:</span> <span class="k">yield</span> <span class="s1">&#39;text&#39;</span><span class="p">,</span> <span class="n">hole</span> <span class="k">yield</span> <span class="n">rule</span><span class="p">,</span> <span class="n">match</span><span class="o">.</span><span class="n">group</span><span class="p">()</span> <span class="n">pos</span> <span class="o">=</span> <span class="n">match</span><span class="o">.</span><span class="n">end</span><span class="p">()</span> <span class="n">hole</span> <span class="o">=</span> <span class="n">string</span><span class="p">[</span><span class="n">pos</span><span class="p">:]</span> <span class="k">if</span> <span class="n">hole</span><span class="p">:</span> <span class="k">yield</span> <span class="s1">&#39;text&#39;</span><span class="p">,</span> <span class="n">hole</span> </pre></div> </div> <div class="section" id="fixing-up-groups"> <h2>Fixing up Groups</h2> <p>One annoying thing is that our group indexes are not local to our own regular expression but to the combined one. This means if you have a rule like <tt class="docutils literal">(a|b)</tt> and you want to access that group by index, it will be wrong. This would require a bit of extra engineering with a class that wraps the SRE match object with a custom one that adjusts the indexes and group names. If you are curious about that I made a more complex version of the above solution that implements a proper match wrapper <a class="reference external" href="https://github.com/mitsuhiko/python-regex-scanner">in a github repository</a> together with some samples of what you can do with it.</p> </div> Python's Hidden Regular Expression Gems http://lucumr.pocoo.org/2015/11/18/pythons-hidden-re-gems 2015-11-18T00:00:00Z http://lucumr.pocoo.org/2015/11/18/pythons-hidden-re-gems Armin Ronacher's Thoughts and Writings <p>In the Austrian internets <a class="reference external" href="http://www.politico.eu/wp-content/uploads/2015/10/schrems-judgment.pdf">the news about the end of the safe harbor act</a> has been universally welcomed it seems. Especially from non technical folks that see this as a big win for their privacy. Surprisingly many technical people also welcomed this ruling. And hey, if Snowden says that's a good ruling, who will argue against.</p> <p>I'm very torn about this issue because from a purely technical point of view it is very tricky to follow the ruling and by keeping to the current state of our data center environments in the light of some other rulings.</p> <p>I'm as disappointed as everybody else that government agencies are operating above what seems reasonable from a privacy point of view, but we should be careful about what how this field develops. Fundamentally sharing information on the internet and the right to privacy stand in conflict to each other and the topic is a lot more complex than to just demand more privacy without considering what this means on a technical level.</p> <div class="section" id="what-was-safe-harbor"> <h2>What Was Safe Harbor?</h2> <p>The US-EU Safe Harbor laws declared US soil as a safe location for user data to fulfill the European Privacy Directive. In a nutshell: this was the only reason any modern internet service could keep their primary user data in the United States on services like Amazon EC2 or Heroku.</p> <p>In essence Safe Harbor was a self assessment that an American company could sign to make itself subject to the European Data Protection Directive. At least in principle. Practically very few US companies cared about privacy which is probably a big reason why we ended up in this situation right now. The second one is the NSA surveillance but I want to cover this in particular separately a bit later.</p> </div> <div class="section" id="what-changed"> <h2>What Changed?</h2> <p>Maximillian Schrems, an Austrian citizen, has started an investigation into Facebook and its data deletion policies a while ago and been engaging with the Irish authorities on that matter ever since. The Irish rejected the complaint because they referred to the Safe Harbor act. What changed now is that the European Court of Justice ruled the following:</p> <blockquote> <p>In today’s judgment, the Court of Justice holds that the existence of a Commission decision finding that a third country ensures an adequate level of protection of the personal data transferred cannot eliminate or even reduce the powers available to the national supervisory authorities under the Charter of Fundamental Rights of the European Union and the directive.</p> <p>[…]</p> <p><strong>For all those reasons, the Court declares the Safe Harbour Decision invalid</strong>. This judgment has the consequence that the Irish supervisory authority is required to examine Mr Schrems’ complaint with all due diligence and, at the conclusion of its investigation, is to decide whether, pursuant to the directive, transfer of the data of Facebook’s European subscribers to the United States should be suspended on the ground that that country does not afford an adequate level of protection of personal data.</p> </blockquote> <p>The detailed ramifications of this are a bit unclear, but if you were relying on Safe Harbor so far, you probably have to move servers now.</p> </div> <div class="section" id="why-was-safe-harbor-useful"> <h2>Why Was Safe Harbor Useful?</h2> <p>So if you take the internet three years ago (before the Ukrainian situation happened) the most common of legally running an international internet platform as a smallish startup was to put the servers somewhere in the US and fill out the safe harbor self assessment every 12 months.</p> <p>To understand why that was a common setup you need to consider why it was chosen in the first place. The European Data Protection Directive came into effect quite a long time ago. It's dated for the end of 1995 and required user data to be either stored in EFTA states or optionally in another country if it can be ensured that the same laws are upheld. This is what safe harbor did. In absence of this, all data from European citizens must be stored on European soil.</p> <p>After the Ukrainian upraising and after Crimea fell to the Russian Federation a few things changed. International sanctions were put up against Russia and Russia decided to adopt the same provision as the European Union: Russian citizen's data has to be stored on Russian servers. This time however without an option to get exceptions to this rule.</p> <p>It's true that the US do not yet have a provision that requires US citizen data to be stored in the States, but this is something that has been discussed in the past and it's a requirement for working with the government already. However with both Russia and Europe we now have two large international players that set the precedent and it can only get worse from here.</p> </div> <div class="section" id="privacy-vs-data-control"> <h2>Privacy vs Data Control</h2> <p>The core of the issue currently is that data is considered power and privacy is a secondary issue there. While upholding privacy is an important and necessary goal, we need to be careful to not forget that the European countries are not any better. While it's nice to blame the NSA for world wide surveillance programs, we Europeans have our own governmental agencies that act with very little supervision and especially in the UK operate on the same invasiveness as in the US.</p> <p>A European cloud provider will have to comply with local law enforcement just as much as an American cloud provider will have to be with federal US one. The main difference just being the institutions involved.</p> <p>The motivation for the Russian government is most likely related to law enforcement over privacy. I'm almost sure they care more about keeping certain power over companies doing business in Russia to protect themselves against international sanctions than their citizens privacy.</p> </div> <div class="section" id="data-locality-and-personal-data"> <h2>Data Locality and Personal Data</h2> <p>So what exactly is the problem with storing European citizens data in Europe, data of Americans in the states and the data of Russians somewhere in the Russian Federation? Unsurprisingly this is a very hard problem to solve if you want to allow people from those different countries to interact with each other.</p> <p>Let's take a hypothetical startup here that wants to build some sort of Facebook for climbers. They have a very niche audience but they attract users from all over the world. Users of the platform can make international friendships, upload their climbing trips, exchange messages with each other and also purchase subscriptions for &quot;pro&quot; features like extra storage.</p> <p>So let's say we want to identify Russians, Americans and Europeans to keep the data local to each of their jurisdictions. The easy part is to set up some servers in all of those countries and make them talk to each other. The harder part is to figure out which user belongs to which jurisdiction. One way would be to make users upload their passport upon account creation and determine their main data center by their citizenship. This obviously would not cover dual citizens. A Russian-American might fall into two shards on a legal basis but they would only opt into one of them. So let's ignore those outliers. Let's also ignore what happens if the citizenship of a user changes because that process is quite involved and usually takes a few years and does not happen all that commonly.</p> <p>Now that we know where users are supposed to be stored, the question is how users are supposed to interact with each other. While distributed databases exist, they are not magic. Sending information from country to country takes a lot of time so operations that affect two users from different regions will involve quite a bit of delay. It also requires that the data temporarily crosses into another region. So if an American user sends data to a Russian user, that information will have to be processed somewhere.</p> <p>The problem however is if the information is not temporarily in flux. For instance sending a message from Russia to America could be seen as falling as being a duplicated message that is both intended for the American and Russian jurisdiction. Tricker it gets with information that cannot be directly correlated to a user. For instance what your friends are. Social relationships can only be modelled efficiently if the data is sufficiently local. We do not have magic in computing and we are bound to the laws of physics. If your friends are on the other side of the world (which nowadays the most likely are) it becomes impossible to handle.</p> <p>Credit card processing also falls in to this. Just because you are British does not mean your credit card is. Many people live in other countries and have many different bank accounts. The data inherently flows from system to system to clear the transaction. Our world is very connected nowadays and the concept of legal data locality is very much at odds with the realities of our world.</p> <p>The big cloud services are out, because they are predominantly placed in the US. Like it or not, Silicon Valley is many, many years ahead of what European companies can do. While there are some tiny cloud service providers in Europe, they barely go further than providing you with elastically priced hardware. For European startups this is a significant disadvantage over their American counterparts when they can no longer use American servers.</p> </div> <div class="section" id="privacy-not-data-locality"> <h2>Privacy not Data Locality</h2> <p>The case has been made that this discussion is not supposed to be about data locality but about privacy. That is correct for sure, but unfortunately data centers fall into the jurisdiction of where they are placed. Unless we come up with a rule where data centers are placed on international soil where they computers within them are out of government's reach, a lot of this privacy discussion is dishonest.</p> <p>What if the bad player are the corporates and now the governments? Well in that case that was the whole point of safe harbor to begin with: to enforce stricter privacy standards on foreign corporations for European citizens.</p> </div> <div class="section" id="how-to-comply"> <h2>How to Comply?</h2> <p>Now the question is how to comply with what this is going into. These new rules are more than implementable for Facebook size corporations, but it is incredibly hard to do for small startups. It's also not quite clear what can and what cannot be done with data now. At which point data is considered personal and at which point it is not, is something that differs from country to country and is in some situations even not entirely clear. For instance according to the UK DPA user relationships are personal information if they have &quot;biographical significance&quot;.</p> </div> <div class="section" id="a-disconnected-world"> <h2>A Disconnected World</h2> <p>What worries me is that we are taking a huge step back from an interconnected world where people can share information with each other, to more and more incompatible decentralization. Computer games traditionally have already enforced shards where people from different countries could not play together because of legal reasons. For instance many of my Russian friends could never play a computer game with me, because they are forced to play in their own little online world.</p> <p>Solutions will be found, and this ruling will probably have no significance for the average user. Most likely companies will ignore the ruling entirely anyways because nobody is going to prosecute anyone unless they are Facebook size. However that decisions of this magnitude are made without considering the technical feasibility is problematic.</p> </div> <div class="section" id="the-workaround"> <h2>The Workaround</h2> <p>For all intents and purposes nothing will really change for large companies like Facebook anyways. They will have their lawyers argue that their system cannot be implemented in a way to comply with forcing data to live in Europe and as such will refer to Article 26 of the Data Protection Directive which states that personal data to an untrusted third country on either a user given consent to this or there being a technical necessity for fulfilling the contract between user and service provider. The TOS will change, the lawyers will argue and in the end the only one who will really have to pick up the shards are small scale companies which are already overwhelmed by all the prior rules.</p> <p>Today does not seem to be a good day for small cloud service providers.</p> </div> The End of Safe Harbor and a Scary Path Forward http://lucumr.pocoo.org/2015/10/6/end-of-safe-harbor 2015-10-06T00:00:00Z http://lucumr.pocoo.org/2015/10/6/end-of-safe-harbor Armin Ronacher's Thoughts and Writings <p>I have a weird obsession with payment systems. They fascinate me. I find it very satisfying to make a credit card transaction and to get a text message confirming the purchase on my phone a second afterwards. As someone obsessed with networks, scalability and user experience I find this a very interesting field even though it's embedded in probably the least agile and most regulated industry. But not just the technology is interesting, also the fraud aspect is. Fraud prevention is an equally interesting topic to ponder about.</p> <p>What makes frauds in payments so interesting is that there are many different payment protocols that exist throughout the world and your credit card is valid with almost all of them. The fraud vectors are huge and very often the only thing that keeps fraud rates down is a random spot checks and common sense.</p> <p>The reason my interest got piqued again recently was Samsung Pay, particularly the MST part. MST, if you are not familiar with it, stands for magnetic secure transmission. The idea is that the phone emits a magnetic field that carries the information of track 2 on a credit card (at least in principle). What this means is that you can go to a lot of magstrip readers, hold your phone to it, and the reader thinks the card was swiped. (Assuming there are no other checks that a card is in a slot)</p> <p>From a fraud perspective this seems crazy. You scan someone's credit card, duplicate it onto your phone and off you go. Here are the results of my investigation about how this is supposed to be used securely.</p> <p>But for this we need to cover some ground.</p> <div class="section" id="a-bit-of-history"> <h2>A Bit of History</h2> <p>If we don't go too far back, the earliest forms of standardized credit card processing were based on a credit card number. The credit card number in itself is split into two parts. The first six digits are the IIN or Issuer Identification Number. It identifies the network of the card (MasterCard, AMEX, Visa, etc.) and might identify the bank within that network. The rest (the remaining 10-13 digits) are the PAN or Primary Account Number. IIN + PAN + expiration date + name of cardholder are the basic requirements for making a credit card transaction.</p> <p>However as you can guess, since all that information is on the card there is very little that actually protects a payment. That's why on most of those transactions done that way they will also ask for the signature of the cardholder. That signature really only plays a role if the transaction gets disputed.</p> </div> <div class="section" id="the-magstripe"> <h2>The Magstripe</h2> <p>What makes credit cards convenient for in-store purchases is that you do not need to write down numbers, instead you can &quot;swipe&quot; the card. At least you do that in the US ;) When you swipe the card, the reader reads the two tracks on the magstripe. They are almost the same with a different data density. Both tracks contain: IIN + PAN, country code, expiration date and a field for discretionary data. It also contains the service code. The service code tells the terminal how the card wants to be confirmed (does it work internationally, does it need online verification, does it need a pin, AM only etc.)</p> <p>Track 1 which has higher density also contains the card holder name and has a bit of extra space for the discretionary data. So if you swipe the card, you have pretty much all the info that's written on it. What's in the discretionary data we will cover later.</p> </div> <div class="section" id="transaction-types-and-security-codes"> <h2>Transaction Types and Security Codes</h2> <p>An important tool for understanding fraud and to combat it is to split the one huge problem of credit card fraud into smaller sub-problems. In particular the most important split is &quot;card present&quot; or &quot;card not present&quot; (CNP) which should indicate if the physical card was present at the origin of the transaction or not. So how do you do that if the data is the same? The earliest form of trying to combat this was the addition of two security codes. They have various different names (CVC, CVV, CID) and on most cards it comes in two flavors: code 1 and code 2. One is stored in the magstripe in the discretionary data field, the other is printed on the back of the card. The idea is that you can differentiate between transactions carrying no security code, or CVC1 or CVC2. If someone skimmed your card through a magstripe reader, they can get to all data with the exception of CVC2. If someone takes your card number via phone they won't get your CVC1.</p> <p>At this point you can already see that there are different types of transactions with different fraud parameters. If someone does not use a CVC code it does not mean that the transaction will be declined outright, but it indicates that something is fishy.</p> </div> <div class="section" id="emv"> <h2>EMV</h2> <p>EMV is the answer for all problems and has been for a long time. The reason it plays little role here is because EMV in itself is secure (bad chip implementations notwithstanding). However EMV is still not rolled out in the US and as such, there is a huge market where magstripe is still something people need to deal with. Also EMV without NFC support cannot support MST which is the topic of discussion here. We will come back to that later however.</p> </div> <div class="section" id="modern-transaction-types"> <h2>Modern Transaction Types</h2> <p>What should be clear now is that there are many different ways to make a credit card transaction. But what is that actual transaction? At one point you want your money. If you get your money or not as a merchant depends on if the transaction was fraudulent or not, and if it was, if you had a chance to detect the fraud yourself.</p> <p>At one point you need to actually try to charge the issuer of the card as a merchant. Ideally you do it as quickly as possible. If you do it at the time you swipe the card, you might directly go online and check with the card issuer if everything is in order. This happens in most terminals now where the terminal directly talks to the bank to record the transaction.</p> <p>A more evolved version of this method is to replace the magstripe with a EMV chip. That chip can a challenge/response game with the payment terminal which means that each purchase is unique and skimming the data off the chip will not be any good for future transactions. That again will only work for transactions that actually use the EMV chip. If you just steal the magstripe and go to the US where all readers are magstripe, this will do absolutely nothing to you.</p> <p>Likewise for online payments many issuing banks will use 3D Secure for payment verification. The idea is that the online form for your credit card number also presents you an iframe with an extra input form by the bank. This allows a second factor to confirm the payment. For instance on my Austrian Erste Mastercard the second factor is a confirmation with a transaction code. The transaction will be declined unless I confirm the payment in the iframe with a unique token sent to my phone via SMS.</p> </div> <div class="section" id="tokenization-apple-pay-samsung-pay"> <h2>Tokenization: Apple Pay / Samsung Pay</h2> <p>In an ideal world the magstripe would no longer exist and all terminals would use the EMV chip and online transactions would require 3D secure. However that's clearly not happening because the US seem to take bloody ages to replace their infrastructure. And not just the US. The idea to force everybody to newer and in this case kinda incompatible technologies did not work for many years, so an alternative has to appear.</p> <p>One alternative is what's often called &quot;Tokenization&quot; and oddly enough, it works by replacing the customer equipment rather then the merchant one. Instead of making all merchants upgrade their terminals to support EMV, you instead upgrade the customer's credit card to a phone.</p> <p>To understand why that's necessary you need to understand that NFC is not always NFC and in case of Samsung it might not even involve an actual RFID chip at all. In Europe when you use NFC for a payment the card transmits a response to a challenge like an EMV chip is. The transaction gets confirmed safely either directly by the card or in combination with the user's PIN. In either case the transaction gets confirmed through the issuer. In the United States however EMV often does not exist, so NFC has an alternative method where it transmits the MSD (magnet stripe data) instead. Apple Pay can do that similar to how Samsung Pay can transmit the very same data via magnetic pulses or NFC.</p> <p>So how does that make anything any more secure? Because of tokenization. Remember how the credit card number is split into IIN and PAN and how the magstripe contains this extra discretionary data. The idea is that assuming the terminal is connected to the internet and verifies transactions with the issuing bank the phone can play a little trick. The bank provides the phone with a method to &quot;clone&quot; the card securely onto the phone. At this point the phone acts as a hardware token generator. Whenever it confirms a transaction it replaces the PAN with a uniquely generated one and places some extra data in the discretionary data part. Both of that information gets transmitted to the issuing bank or TSP (token service provider, so MasterCard or Visa) where the token PAN (DPAN) gets replaced for the real PAN. The actual flow is a bit more complex than that, but in the end the transaction goes through like before.</p> </div> <div class="section" id="the-merchant-and-tokenization"> <h2>The Merchant and Tokenization</h2> <p>The important part here however is the merchant and this is where things get tricky. With Apple Pay the transaction is always done through a form of NFC. Either NFC with MSD or proper EMV NFC. It means that the merchant explicitly agrees with this form of payment and will introduce the system to the employees that accept the transactions. To confirm such a payment as a merchant you just make sure that the transaction is made from an iphone and everything else &quot;should be secure&quot;. The only case of fraud is if someone managed to get a card on their phone which they were not entitled too, but that's the bank's problem because they should make that flow secure.</p> <p>The situation however is different with Samsung Pay and the reason for that is MST. As Samsung Pay works with non NFC POS terminals the question is how a merchant can differ a phone that uses Tokenization properly or a fraudulent phone that just relays the magstripe tracks from a stolen card. In fact, the merchant can't really do anything there because the transaction is as far as I know indistinguishable from what is shown on the terminal. The only party that could reliably block the transaction is the issuer or TSP. This interestingly enough can be solved by supporting EMV :)</p> <p>A modern card (one that would be used with Samsung Pay) could come with magstripe and EMV and the magstripe could indicate that the card prefers the chip over swiping. In this case you could still clone the magstripe into your phone, but the transaction would be declined if it used neither tokenization nor the chip. For this to work however, all merchants need to support EMV which currently is not the case in the US.</p> </div> <div class="section" id="the-non-emv-apocalypse-of-2015"> <h2>The Non EMV Apocalypse of 2015</h2> <p>Something interesting is going to happen end of October 2015. The US will finally start to force merchants to upgrade to terminals that support EMV. From that point onwards any card that has an EMV chip, but the chip was not used for the transaction and that transaction was fraudulent will become the merchant's problem. Assuming Samsung Pay becomes widespread it could make this liability shift a bit more painful because as a merchant you can not tell a good Samsung phone from a bad Samsung phone, whereas you could probably tell an original credit card with embossed numbers from a fake card with mismatching numbers and making your own embossed cards with all the cards you skimmed is a lot more work than to clone a card into a phone.</p> <p>So maybe EMV will become a bigger thing as a result of Samsung Pay even if the technology in itself has some potential for magstripe abuse.</p> </div> <div class="section" id="death-of-msd"> <h2>Death of MSD</h2> <p>Interestingly enough the roll-out of EMV in the US might have some bad aspects for European travellers and others. Our cards have a very different fraud profile than American ones because domestic transactions are done via EMV for nearly thirty nears now, with the liability shift having happened more than 10 years ago. In Europe cards prefer chip and pin for terminals and NFC is only supported for EMV transactions.</p> <p>The US terminals might use the MSD data for NFC however. So as a European customer you might see an NFC logo somewhere, but because it uses NFC MSD your European bank will decline the transaction because they only allow EMV based NFC. This is to be seen however, right now NFC terminals in the US are still not very widespread and the liability shift did not happen yet.</p> </div> <div class="section" id="safety-of-samsung-pay"> <h2>Safety of Samsung Pay</h2> <p>So is it safe? Implemented correctly with tokenization Samsung Pay seems pretty safe.</p> <p>Will merchants like it? If they have EMV terminals, they will not have a problem with it. If they only have legacy terminals without chip support, they might become fraud magnets and they have little method to defend themselves against it.</p> <p>Will the magstripe finally die? Seems like magstripe found a second coming in the US thanks to tokenization, MSD NFC and maybe even Samsung Pay but most likely only as a transitional technology for EMV.</p> <p>I'm actually quite interested in if there are means of detecting a relayed magstripe track for a merchant. If someone knows, please let me know and I will amend the article to reflect that.</p> </div> Samsung Pay's MST Transactions and Merchant's Ability to Detect “Cloned” Magstripe Tracks http://lucumr.pocoo.org/2015/8/31/the-thing-about-samsung-pay 2015-08-31T00:00:00Z http://lucumr.pocoo.org/2015/8/31/the-thing-about-samsung-pay Armin Ronacher's Thoughts and Writings