perkeep/talks/2011-05-07-Camlistore-Sao-P.../index.html

692 lines
20 KiB
HTML
Executable File
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!DOCTYPE html>
<!--
TODO;
* sharing,
* sharing an object vs. blob vs. search
* replication queue/details,
* tree sync vs. graph sync,
* project status, impl details, google, go
- python app engine version
- android client
- android gpg
- perl test suite
Google I/O 2011 HTML slides template
URL: http://code.google.com/p/io-2011-slides/
-->
<html>
<head>
<title>Camlistore</title>
<meta charset='utf-8' />
<script src='slides.js'></script>
</head>
<style>
/* Your individual styles here, or just use inline styles if thats
what you want. */
.smaller {
font-size: 80%;
}
ul li ul {
margin-top: 1.5em;
margin-bottom: 1em;
}
ul li ul li {
margin-top: 1em;
font-size: 80%;
}
ul li ul.dense li {
margin-top: 0em;
margin-bottom: 0em;
font-size: 80%;
}
h1.center {
text-align: center;
font-style: italic;
}
</style>
<body style='display: none'>
<section class='slides layout-regular'>
<!-- Your slides (<article>s) go here. Delete or comment out the
slides below. -->
<article>
<h1>
Camlistore
</h1>
<p>
Brad Fitzpatrick
<br>
2011-05-07
</p>
</article>
<article>
<h3>
Who am I?
</h3>
<ul class='nobuild'>
<li>
Brad Fitzpatrick &lt;brad@danga.com&gt;
</li>
<li>Perl Hacker since 1994</li>
<li>Projects:
<table><tr valign='top'>
<th>Danga / 6A (Perl)</th>
<th>Google</th>
</tr>
<td class='nobuild'>
<div>LiveJournal</div>
<div>memcached</div>
<div>Perlbal</div>
<div>MogileFS</div>
<div class='blue'>OpenID</div>
</td>
<td class='nobuild'>
<div><nobr>Social Graph API (<span class='blue'>XFN / FOAF</a>)</nobr></div>
<div class='blue'>WebFinger</div>
<div class='blue'>PubSubHubbub</div>
<div>Android</div>
<div>Go</div>
</td>
</table>
<div style='font-size: 70%; margin-top: 1em'>* <span class='blue'>decentralized social</span></div>
</li>
</ul>
</article>
<article>
<h3>
But why am I in Brazil?
</h3>
<ul class='nobuild'>
<li>
<i>"Hey, want to come speak at a Perl conference in Brazil?"</i>
</li>
<li>"Yes, totally, but... I don't write much Perl these days. :-("</li> <!-- " -->
<li style="margin-top: 2em"><i>"You could speak on memcached."</i></li>
<li>"But that's an old topic, no?"</li>
<li style="margin-top: 2em"><i>"You have any new project you're excited about?"</i></li>
</li>
</ul>
</article>
<article>
<h1 align='center'>
Camlistore!
</h1>
</article>
<article>
<h3>
Camlistore
</h1>
<ul>
<li>New open source project</li>
<li>Almost a year old</li>
<li>Still in development</li>
<li>Starting to be useful :-)</li>
<li>Hard to easily describe...</li>
</article>
<article>
<q>
Camlistore is a way to store, sync, share, model and back up content
</q>
<div class='author'>
camlistore.org
</div>
</article>
<article>
<h3>
Motivation
</h3>
<ul>
<li>I've written too many Content Management Systems
<ul>
<li>blogs, comments, photos, emails, backups, scanned paperwork, ...</li>
<li>is a scanned photo a scan, a photo, or a blog post? who cares.</li>
<li>write <b>one CMS to rule them all</b></li>
<li>... or at least a good framework for higher-level CMSes</li>
</ul>
</li>
</ul>
</article>
<article>
<h3>
Motivation (cont)
</h3>
<ul>
<li>I still want to help solve the Decentralized Social Network Problem
<ul>
<li>protocols, not companies</li>
<li>gmail, hotmail: hosted versions of SMTP, IMAP</li>
<li>... but I can run my own SMTP/IMAP server if I want.</li>
<li>... or change my SMTP/IMAP provider</li>
</ul>
</li>
</ul>
</article>
<article>
<h3>
Motivation (cont)
</h3>
<ul>
<li>I wanted something conceptually simple.</li>
<li>HTTP interfaces, not language-specific</li>
<li>I use lots of machines; don't want to think about sync or conflicts.</li>
<li>Data archaeology: should be easy and obvious to
reconstruct in 20 or 100 years</li>
</ul>
</article>
<article>
<h3>
The Product
</h3>
<ul>
<li>one private dumping ground to store anything</li>
<li>backups, filesystems, objects, photos, likes, bookmarks, shares, my website, ...</li>
<li>live backup my phone</li>
<li>live replicate / sync my dumping group between my house & laptop & Amazon & Google</li>
<li>web UI (ala gmail, docs.google.com, etc) or FUSE filesystem</li>
<li>Easy for end-users; powerful for dorks</li>
</ul>
</article>
<article>
<h3>
Security Model
</h3>
<ul>
<li><i><b>your</b></i> private repo, for life</li>
<li>everything private by default</li>
<li>grant access to specific objects/trees with friends or the world</li>
<li>web UI or CLI tools let you share</li>
</ul>
</article>
<article>
<h1 class='center'>
So what's with the silly name?
</h1>
</article>
<article>
<h3>
Camlistore
</h3>
<ul>
<li>Content-</li>
<li>Addressable</li>
<li>Multi-</li>
<li>Layer-</li>
<li>Indexed</li>
<li>Storage</li>
</ul>
</article>
<article>
<h3>
Content-Addressable
</h3>
<ul>
<li>At the core, everything is stored &amp; addressed by its digest (e.g. SHA1, MD5, etc)</li>
<li>e.g. <tt class='smaller'>"sha1-0beec7b5ea3f0fdbc95d0dd47f3c5bc275da8a33"</tt> for the blob <tt class='smaller'>"foo"</tt></li>
<li>Great properties:
<ul>
<li>no versions of content: change it changes the new digest too</li>
<li>no versions: no sync conflicts</li>
<li>no versions: perfect caching (have it or don't)</li>
</ul>
</li>
</ul>
</article>
<article>
<h3>
Multi-Layer, Indexed
</h3>
<ul>
<li>Unix philosophy: small pieces with well-defined interfaces that can be chained or composed</li>
<li>Camlistore pieces include:
<ul class='dense'>
<li style='margin-top: 1em'>Blob storage: memory, disk, S3, Google, MySQL index, etc</li>
<li>Schema</li>
<li>Signing</li>
<li>Replication</li>
<li>Indexing: (e.g. replicate from disk to MySQL index)</li>
<li>Search</li>
<li>HTML UI</li>
</ul>
</li>
</ul>
</article>
<article>
<h3>Logically</h3>
<img src='arch.png' width='100%'/>
</article>
<article>
<h3>In reality</h3>
<ul>
<li>End-users: use a hosted version</li>
<li>Dorks: single server binary with all the logical pieces</li>
</ul>
</article>
<article>
<h2>
From the bottom up...
</h2>
</article>
<article>
<h2>
Blob Server
</h2>
</article>
<article>
<h3>Blob Server: how dumb it is</h3>
<ul>
<li>"Blob" == zero or more bytes. <i class='red'>no</i> meta-data</li>
<li>private operations, to owner of data only:</li>
<ul>
<li class='green'>get(blobref) → blob</li>
<li class='green'>stat(blobref+) → [(blobref, size), ...]</li>
<li class='green'>put(blobref, blob)</li>
<li class='green'>enumerate(..) → [(blobref, size)...] (sorted by blobref)</li>
</ul>
<li>no public (non-owner) access</li>
<li>HTTP interface: <tt>GET /camli/sha1-xxxxxxx HTTP/1.1</tt></li>
<li><span class='green'>delete(blobref)</span> is disabled by default, privileged op for GC or replication queues only</li>
</ul>
</article>
<article>
<h3>Blob Server: seriously, no metadata</h3>
<ul>
<li>no filenames</li>
<li>no "mime types"</li>
<li>no "{create,mod,access} time"</li>
<li>size is implicit</li>
<li>blob: just some bytes</li>
<li>metadata? layers above.</li>
</ul>
</article>
<article>
<h1 class='center'>
Uh, what can you do with that?
</h1>
</article>
<article>
<h3>Uh, what can you do with that?</h3>
<ul>
<li>with just a blob server?</li>
<li>not much</li>
<li>but let's start with an easy example...</li>
</ul>
</article>
<article>
<h1 class='center'>
Filesystem Backups
</h1>
</article>
<article>
<h3>Filesystem Backups</h3>
<ul>
<li>previous project: brackup
<ul>
<li>good: Perl, slide/dice/encrypt S3 backup, content-addressed, good iterative backups</li>
<li>bad: large (several MB) "backup manifest" text files
</ul>
</li>
<li>fossil/venti, git, etc: directories content-addressed by content of their children, hash trees, etc</li>
<li>git: "tree objects", "commmit objects", etc</li>
<li>Camlistore: "schema blobs"</li>
</ul>
</article>
<article>
<h3>Schema: how to model your content</h3>
<ul>
<li>Camlistore defines <i>one possible</i> schema</li>
<li>but blobserver doesn't know about it all</li>
<li>tools generate schema,</li>
<li>indexer + search understand the schema.</li>
</ul>
</article>
<article>
<h3>Schema Blobs</h3>
<ul>
<li>so if all blobs are just dumb blobs of bytes with no metadata,</li>
<li>how do you store metadata?</li>
<li>as blobs themselves!</li>
</ul>
</article>
<article>
<h3>Minimal Schema Blob</h3>
<section>
<pre>{
"camliVersion": 1,
"camliType": "whatever"
}</pre>
</section>
<p>Whitespace doesn't matter. Just must be valid JSON in its
entirety. Use whatever JSON libraries you've got.</p>
<p>That one is named<br/><tt class='smaller'>sha1-19e851fe3eb3d1f3d9d1cefe9f92c6f3c7d754f6</tt></p>
<p>or perhaps: <tt class='smaller'>sha512-2c6746aba012337aaf113fd63c24d994a0703d33eb5d6ed58859e45dc4e02dcf<br/>dae5c4d46c5c757fb85d5aff342245fe4edb780c028a6f3c994c1295236c931e</tt></p>
</article>
<!-- END -->
<article>
<h3>Schema blob; type "file"</h3>
<section><pre>{"camliVersion": 1,
<span style='background: #fff'>"camliType": "file",</span>
"fileName": "foo.dat",
"unixPermission": "0644",
...,
"size": 6000133,
"contentParts": [
{"blobRef": "sha1-...dead", "size": 111},
{"blobRef": "sha1-...beef", "size": 5000000, "offset": 492 },
{"size": 1000000},
{"blobRef": "digalg-blobref", "size": 22},
]
}</pre></section>
</article>
<article>
<h3>Schema blob; type "directory"</h3>
<section><pre>{"camliVersion": 1,
<span style='background: #fff'>"camliType": "directory",</span>
"fileName": "foodir",
"unixPermission": "0755",
...,
"entries": <span style='background: #fff'>"sha1-c3764bc2138338d5e2936def18ff8cc9cda38455"</span>
}</pre></section>
</article>
<article>
<h3>Schema blob; type "static-set"</h3>
<section><pre>{"camliVersion": 1,
<span style='background: #fff'>"camliType": "static-set",</span>
"members": [
"sha1-xxxxxxxxxxxx",
"sha1-xxxxxxxxxxxx",
"sha1-xxxxxxxxxxxx",
"sha1-xxxxxxxxxxxx",
"sha1-xxxxxxxxxxxx",
"sha1-xxxxxxxxxxxx",
]
}</pre></section>
</article>
<article>
<h3>
Backup a directory...
</h3>
<section><pre>$ camput --file $HOME
sha1-8659a52f726588dc44d38dfb22d84a4da2902fed</pre></section>
<p>(like git/hg/fossil, that identifier represents everything down.)</p>
<p>Iterative backups are cheap, easy identifier to share, etc</p>
<p>But how will you remember that identifier? (later)</p>
</article>
<article>
<h3>
But what about mutable data?
</h3>
<ul>
<li>immutable data is easy to represent & reference</li>
<ul>
<li><tt class='smaller'>sha1-8659a52f726588dc44d38dfb22d84a4da2902fed</tt> is an immutable snapshot</li>
</ul>
<li>how to represent mutable data in an immutable, content-addressed world?</li>
<li>how to share a reference to a mutable object when changing an object mutates its name?</li>
</ul>
</article>
<article>
<h1 class='center'>
Objects & "Permanodes"
</h1>
</article>
<article>
<h3>
Terminology
</h3>
<ul>
<li><span class='red'>blob</span>: just dumb, immutable series of bytes</li>
<li><span class='red'>schema blob</span>: a blob that's a valid JSON object w/ camliVersion & camliType</li>
<li><span class='red'>signed schema blob</span> aka "<span class='red'>claim</span>": a schema blob with an embedded OpenPGP signature</li>
<li><span class='red'>object</span>: something mutable. represented as an anchor "<span class='blue'>permanode</span>" + a set of mutations (<span class='blue'>claims</span>)</li>
<li><span class='red'>permanode</span>: a stable reference. an anchor. just a <span class='blue'>signed schema blob</span>, but of almost no content...</li>
</ul>
</article>
<article>
<h3>
Permanode
</h3>
<section><pre><span style='font-weight: bold' class='blue'>$ camput --permanode</span>
sha1-ea799271abfbf85d8e22e4577f15f704c8349026
<span style='font-weight: bold' class="blue">$ camget sha1-ea799271abfbf85d8e22e4577f15f704c8349026</span>
<span style="background: #ff7">{"camliVersion": 1,
"camliSigner": "sha1-c4da9d771661563a27704b91b67989e7ea1e50b8",
<span style='font-weight: bold'>"camliType": "permanode"</span>,
"random": "oj)r}$Wa/[J|XQThNdhE"</span>
,"camliSig":"iQEcBAABAgAGBQJNRxceAAoJEGjzeDN/6vt8ihIH/Aov7FRIq4dODAPWGDwqL
1X9Ko2ZtSSO1lwHxCQVdCMquDtAdI3387fDlEG/ALoT/LhmtXQgYTt8QqDxVdu
EK1or6/jqo3RMQ8tTgZ+rW2cj9f3Q/dg7el0Ngoq03hyYXdo3whxCH2x0jajSt4RCc
gdXN6XmLlOgD/LVQEJ303Du1OhCvKX1A40BIdwe1zxBc5zkLmoa8rClAlHdqwo
gxYFY4cwFm+jJM5YhSPemNrDe8W7KT6r0oA7SVfOan1NbIQUel65xwIZBD0ah
CXBx6WXvfId6AdiahnbZiBup1fWSzxeeW7Y2/RQwv5IZ8UgfBqRHvnxcbNmScrzl
p3V3ZoY"}</pre></section>
</article>
<article>
<h3>
Backup a directory...
</h3>
<section><pre><span style='font-weight: bold'>$ camput --file $HOME</span>
sha1-8659a52f726588dc44d38dfb22d84a4da2902fed
<span style='font-weight: bold'>$ camput --permanode --file $HOME</span>
sha1-ea799271abfbf85d8e22e4577f15f704c8349026
<span style='font-weight: bold'>$ camput --permanode --name="Brad's home directory" --file $HOME</span>
sha1-ea799271abfbf85d8e22e4577f15f704c8349026</pre></section>
<ul>
<li>all the file data blobs, file/dir schema blobs,</li>
<li>a new permanode, owned by you</li>
<li>a mutation: permanode's content attribute == directory root</li>
<li>a mutation: permanode's name attribute == "Brad's home directory"</li>
</ul>
</article>
<article class='fill'>
<p><img src="fsbackup.png" height="100%"/></p>
</article>
<article>
<h1 class='center'>
Modeling non-filesystem objects
</h1>
</article>
<article>
<h3>Example: a photo gallery</h3>
<ul>
<li>Photos are objects</li>
<li>Galleries (sets) are objects</li>
<li>Photos are members of galleries</li>
<li>Photos & galleries have attributes (single-valued: "title", multi-valued: "tag")</li>
<li>Photos might be updated over time:
<ul>
<li>EXIF GPS updated, cropping, white balance</li>
<li>don't want to break links!</li>
</ul>
</li>
</ul>
</article>
<article class='fill'>
<p><img src="blobjects.png" width="100%"/></p>
</article>
<article>
<h1 class='center'>
How to make sense of that?
</h1>
</article>
<article>
<h1 class='center'>
Indexing & Search
</h1>
</article>
<article>
<h3>
Indexing: summary
</h3>
<p style='margin-top: 2em'>For each blob, build an index of:
<ul>
<li>directed graph of inter-blob references</li>
<li>(permanode, time) => resolved attributes</li>
<li>(permanode, time) => set memberships</li>
<li>etc...</li>
</ul>
</article>
<article>
<h3>
Indexing & Replication
</h3>
<ul>
<li>indexing is real-time, no polling</li>
<li>MySQL index speaks the blob server protocol</li>
<li>just replicated to MySQL (etc) just like Amazon S3 (etc)</li>
</ul>
<center><img src='repl.png' /></center>
</article>
<article>
<h3>
Search
</h3>
<ul>
<li>Permanodes created by $who, sorted by date desc, type "photo", tagged "funny"</li>
<li>My recent backups with attribute "hostname" == "camlistore.org",</l>
<li>All friends' galleries in which this photo appears,</li>
<li>etc...</li>
</ul>
<p>...similar to your email, or docs.google.com. "My stuff" or "My bookmarks".</p>
</article>
<article>
<h3>
Privacy Model
</h3>
<ul>
<li>all your blobs & objects & searches are private</li>
<li>nothing is public by default</li>
</ul>
</article>
<article>
<h1>
What if you want to share with friends, or globally publish something?
</h1>
</article>
<article>
<h3>
Sharing & Share Blobs
</h3>
<ul>
<li>the act of sharing involves creating a new <span class='red'>share claim</span>, just another blob, signed.</li>
</ul>
</article>
<article class='fill'>
<h3>
Image filling the slide (with optional header)
</h3>
<p>
<img src='images/example-cat.jpg'>
</p>
<div class='source white'>
Cat
</div>
</article>
<article class='nobackground'>
<h3>
A slide with an embed + title
</h3>
<iframe src='http://www.google.com/doodle4google/history.html'></iframe>
</article>
<article class='nobackground'>
<iframe src='http://www.google.com/doodle4google/history.html'></iframe>
</article>
<article class='fill'>
<h3>
Full-slide embed with (optional) slide title on top
</h3>
<iframe src='http://www.google.com/doodle4google/history.html'></iframe>
</article>
<article>
<h3>
Thank you!
</h3>
<ul>
<li>Brad Fitzpatrick, brad@danga.com</li>
<li>camlistore.org</li>
</ul>
</article>
</section>
</body>
</html>