Talk Notes: Wesley George The Limits of the ORM

Wesley George is a Technical Lead for a Startup in Canada called Clearbanc. The impetus from this talk came from his trying to use the ORM for a complex query and marveling at how slow it was. Thus, he put together this talk based on his adventures in deeply diving into SQL.

An ORM is Cool, but…

ORMs shine when they remove boilerplate code. They provide decent aggregation, but generally only for single table solutions. However, for more complex tasks, using complex SQL queries in a relational database allows you to create extremely powerful and performant aggregations.

Example 1: Simple Signups by Month

You might start with some code that looks like this:

This would run extremely slow against a sufficiently large or complex dataset, so it behooves us to dive into the SQL to run the computation. However, writing complex SQL allows us to do this in a much more performant way.

The WITH keyword allows us to essentially create ephemeral tables for use in a query. They are similar to subqueries, except they result in a much simpler final SELECT statement at the end.

Example 2: User Engagement Change Segmented by Campaign

For this example, we won’t even attempt to use the ORM.

Annotators note: Sorry, I’m not gonna type out the SQL for all this…

Concerning Efficiency and Managing Complexity

The technology inside of a relational dataase represents decades of computer science research. However, as your data grows into the terabyte and petabyte range, you will need to manage your complexity. Wesley describes a method and a mentions a tool for this.

  • Data Warehousing is the process of storing intermediate aggregations or representations of the data on a periodic basis (hourly, daily, weekly, etc).
  • SQLAlchemy is a lower-level SQL python package, allowing a comfortable code-based medium between an ORM and raw queries.
SQL Alchemy provides an alternative to an ORM, allowing you to represent SQL-like queries in your code.
SQL Alchemy provides an alternative to an ORM, allowing you to represent SQL-like queries in your code.

Talk Notes: Andrew Goodwin on Django Channels for the Real Time Web

Andrew Godwin is a Django core developer who works at Eventbrite. In this talk he talks about Django for the Real-Time web, otherwise known as Django Channels.

The Traditional Method

The "old" way of sending and receiving requests, pre-WebSockets and Django Channels
The “old” way of sending and receiving requests, pre-WebSockets and Django Channels

Send a request, get a response. Even with HTTP2, you can still treat your code the same way in a WSGI-style request.

However, with Websockets, things change. You can send without receiving, receive without sending, leave sockets open for hours, whatever. It’s the “wild wild west.” In Andrew’s mind, the way Django should work with wesockets should follow the standard Django contract: Easy to use, secure by default, hard to break / deadlock, Python 2 & 3 compatible, and optional.

But… There Are Problems.

Python is… not good with concurrency, and Django is not asynchronous. At first glance, it might seem like the solution is something like message-passing via WSGI. However, WebSockets also have the additional features of events and broadcasting, which would require cross-thread or even cross process communication.

Enter Django Channels: Concepts

Channels sits between your user interface and Django and provides an asynchronous layer utilizing WebSockets.

Django Channels is a WebSockets package based on a few concepts.

  • Channels: named FIFO task queues
  • Groups: named sets of channels with add/remove/send operations
  • Messages: representations for HTTP and WebSocket operations.
This is new way - send a message and receive zero or more messages. Views become Consumers. Messages can also go to Sockets or Workers.
This is new way – send a message and receive zero or more messages. Views become Consumers. Messages can also go to Sockets or Workers.

With these concepts, you get 5 simple API endpoint operations:

  1. send('channel_name', {ponies: True})
  2. receive_many(['channel_one'], ['channel_two'])
  3. group_add('group_name', 'channel_name')
  4. group_discard('group_name', 'channel_name')
  5. send_group('group_name', {ponies: True})
Much like Consumers are views, we have routers.py, which is paralleled by urls.py
Much like Consumers are views, we have routers.py, which is paralleled by urls.py

Example: Live Blog

Suppose you want a blog where the readers can get new blog posts as they are published, without refreshing.

  1. The client opens a websocket when the page is opened, and that websocket is added to a group.
  2. When the BlogPost model is saved, we send the post to that group

Fully working example available on GitHub.

Example: Chat

The simplest chat: a person types a message, everybody gets it. This example is nearly identical to the above example but instead of using the save method on a model, we simply use the ws_receive method:

  1. The client opens a websocket when the page is opened, and that websocket is added to a group.
  2. When the BlogPost model is saved, we send the post to that group

Fully working example available on GitHub.

Other Cool Stuff!

The ASGI Specification

Now that there’s a WebSocket medium for Django, we need a standard way of structuring channels and messages. Enter ASGI. This is an API specification for channel layer backends, as well as a message format for HTTP and WebSockets. ASGI is perfectly compatible with WSGI, and a number of other technologies as well.

Scaling?

Interface servers scale horizontally, as do worker servers. Thus, the channel layer has to as well. Luckily Django Channels has consistent hash sharding built in. Andrew talks about how it will be part of Django soon, but it’s not quite mature enough yet.

Irene Chen’s Beginner’s Guide to Deep Learning

Deep Learning, the endeavor to make computers as smart as humans, or even just its simpler cousin Machine Learning, can be incredibly overwhelming and daunting to learn. There’s either too much code or too much math. Luckily, Irene Chen wrote this talk for Beginners to learn the fundamentals of deep learning.

If these Talk Notes are useful to you, become a patron!

Deep Learning: Why Now?

Neural networks have been around since the 1970’s. Why the resurgence now? Three factors provide a new foundation for modern Deep Learning.

  1. Big data (aka the “fuel” of the rocket ship)
  2. Big processing power
  3. Robust neural networks  (aka the “engine” of the rocket ship)

 Because of this “perfect storm,” we are seeing a tremendous number of breakthroughs in ML / DL / AI. (i.e. AlphaGo)

Neural Networks: The Avocado Classifier

Neurons are the cells that comprise the human brain. Synapses connect neurons together. Computer scientists have modeled this with a simplified graph called a neural networks. In this graph, neurons are modeled with what are called nodes, and synapses are modeled with what are called edges.

Simple graph of a neural network, a crticial data structure to deep learning
Simple graph of a neural network

 

Note that some arrows are thicker than others – each edge has a weight which a measurement of the importance of data passing through. This is represented mathematically via a sigmoid function.

The hidden layers are everything between the input and output nodes.

Very simple example: Given an avocado and its height, “squishiness”, and the color of its skin, can you determine whether or not it is perfectly ripe? 

Forward Propagation is the standard execution model of the neural network, from input to output. Likewise, backwards propagation (or “backpropagation”) is when you work from output to input, changing weights and value of nodes and edges to improve your model.

Mathematical constructs used in deep learning
Mathematical constructs used in deep learning

 

To reduce errors, “tune” your parameters experimentally or by using the above math. Convergence is when your error rate is “good enough” based on the number of iterations.

Deep Learning Tools and Communities

  • Scikit-learn – very beginner friendly, contains a number of ML algorithms
  • Caffe – UC Berkeley’s computer vision library. Contains “Zoo,” a group of pre-trained models
  • Theano – Efficient GPU powered math
  • iPython Notebook (Jupyter) – great for interactive coding
  • Kaggle – Casual ML cooperative with contests and such

If these Talk Notes are useful to you, become a patron!

Guido van Rossom on the Python Language at PyCon 2016

25 years ago, Guido van Rossum released the Python programming language. Since then he’s been the “benevolent dictator” of the language. In this keynote talk from PyCon 2016, Guido (pronounced Gee-doh) walks us through a number of topics around this amazing and amazingly popular language.

If these Talk Notes are useful to you, become a patron!

The “State of Python”

  • Python 2.7. Until 2020 only security fixes, support for new OS versions, maybe bug fixes. See http://pythonclock.org.
  • Python 3.5
    • Native coroutine syntax with async / await. (PEP 492)
    • Matrix multiply: A@B, __matmul__
    • Unpacking Syntax: x = [1, 2, *y]
    • Bytes formatting is back: b"Hello %s, %d" % (b"world", 42)
    • gradual typing support: def gcd(a: int, b: int) (PEP 484)
  • Python 3.6 (code freeze in September 2016, released around Christmas)
    • f-strings: x = "world"; y = "42",  print(f"Hello {x}, {y}")
    • Underscores in numbers: 100_000_000
    • __fspath__ protocol, os.fspath() (for pathlib)
    • secrets.py: randbits(), token_hex(), etc…
    • Local time disambiguation: datetime(..., fold=1)
    • Moving to GitHub!
  • Beyond Python 3.6 is only speculation, thus not included here. 
    • However, one interesting thing is that Larry Hastings is working on removing the GIL.

What Else?

“Femail” Core developers

Guido still gets angry emails about this typo from last year’s talk, and at the time of the talk he still doesn’t have any female core devs 🙁

An Inspirational Story by Guido Van Russum

I won’t annotate this as it’s a very personal story, but you can watch it here.

If these Talk Notes are useful to you, become a patron!

 

Structured Data from Unstructured Text :: PyCon 2016

If these TalkNotes are useful to you, become a patron!

Talk by Van Lindberg (subbing for Smitha Milli)

Note: Since Van was subbing for Smitha, his talk was understandably much more ad-hoc than hers would have been. Not his fault! Technical difficulties made some of the examples non-workable but I hope to fill in the gaps with my own knowledge.

There is an explosion of data, and most of it is text. But it’s not just about text, it’s about language. Unstructured text is ambiguous, structured data us not.

Get the Data

There is an explosion of data, and most of it is text. But it’s not just about text, it’s about language. Unstructured text is ambiguous, structured data us not. You can get data from .CSV files, word docs, web pages, emails, almost anywhere.

Clean the Data

Remove any text that may confuse your program or is meaningless to a computer – non-ASCII special characters, white space / new line characters, etc.

Tokenize

Tokenizing is the process of breaking text into meaningful chunks. For example “The Queen” means something very differnt. Tokenizers help you figure out what the smallest piece of information to parse by is – aka your “unit” of measurement. It may be one word, it may not.

  • single character – You can analyze text based on single characters. See this fun demo.
  • unigram, or word tokenizer – tokens that are exactly one word in length
  • bigram, n-gram, etc – tokens that are two or more words in length

Using python and nltk, you can use the word_tokenizer function to achieve this. Note that tokenizers include certain punctuation like periods because periods, for example, change the meaning of the word before and after.

Diagram

Once you have your words, you can reduce ambiguity, starting with the meaning of the words. One way to start doing this is via sentence diagramming.

Sample Sentence Diagram
Sample Sentence Diagram

 

Google just released SyntaxNet, which uses TensorFlow. The sentence diagramming module is called Parsey McParseface, which matches the accuracy of the best human linguist (97%).

However, this still doesn’t give us the context we need – specifically how each sentence relates to other sentences or how the words in the sentence relate to terms in the real world.

Extracting Further Meaning

Instead of thinking of a sentence as a syntactical collection of words, we are thinking of the collection of words as a collection of features. In this context, a feature is something that  the presence or absence of word in a certain context.

Corpus – any collection of documents.

Dictionary – Maps words (1xN matrix) to vectors, allowing you do matrix math, extracting mathematical meaning from words. Example: King – man + woman = queen.

Types of Vector Analysis:

TD/IDF – Term Frequency / Inverse Document Frequency. Total frequency of feature / number of times feature appears over corpus. I.E. “and” will not be important, but something like “docker” will be more important.  This will return a list of floats showing the importance of words.

LSI – Latent Semantic Indexing. Figuring out “what a document is about” by Principal Component Analysis across vectors. This will show you “Queen Elizabeth” is closely associated with “ships,” “crown,” “England,” etc.

LDA  – Latent Dirichlet Allocation. Only mentioned briefly – this is a way of discovering topics in your document.

Models created using the above techniques: word2vec, lda2vec.

If these TalkNotes are useful to you, become a patron!