How To View Pending Uploads In Dropbox

Camera uploads is a feature in our Android and iOS apps that automatically backs upward a user's photos and videos from their mobile device to Dropbox. The feature was outset introduced in 2012, and uploads millions of photos and videos for hundreds of thousands of users every solar day. People who utilise camera uploads are some of our most dedicated and engaged users. They care deeply about their photo libraries, and expect their backups to be quick and undecayed every time. It'due south important that we offer a service they can trust.

Until recently, camera uploads was congenital on a C++ library shared between the Android and iOS Dropbox apps. This library served us well for a long time, uploading billions of images over many years. All the same, it had numerous problems. The shared code had grown polluted with complex platform-specific hacks that fabricated it difficult to empathise and risky to alter. This take a chance was compounded past a lack of tooling back up, and a shortage of in-firm C++ expertise. Plus, afterward more than v years in production, the C++ implementation was beginning to show its historic period. Information technology was unaware of platform-specific restrictions on background processes, had bugs that could filibuster uploads for long periods of time, and made outage recovery hard and time-consuming.

In 2019, nosotros decided that rewriting the feature was the best way to offer a reliable, trustworthy user experience for years to come. This time, Android and iOS implementations would exist split up and employ platform-native languages (Kotlin and Swift respectively) and libraries (such equallyWorkManager andRoom for Android). The implementations could then be optimized for each platform and evolve independently, without being constrained past design decisions from the other.

This post is about some of the design, validation, and release decisions we made while building the new photographic camera uploads feature for Android, which we released to all users during the summer of 2021. The projection shipped successfully, with no outages or major issues; error rates went down, and upload performance greatly improved. If you haven't already enabled camera uploads, yous should try it out for yourself.

Designing for groundwork reliability

The main value proposition of photographic camera uploads is that it works silently in the background. For users who don't open the app for weeks or even months at a time, new photos should still upload promptly.

How does this work? When someone takes a new photo or modifies an existing photo, the Bone notifies the Dropbox mobile app. A groundwork worker nosotros call the scanner carefully identifies all the photos (or videos) that haven't withal been uploaded to Dropbox and queues them for upload. Then another background worker, the uploader, batch uploads all the photos in the queue.

Uploading is a 2 step procedure. First, like many Dropbox systems, nosotros break the file into 4 MB blocks, compute the hash of each block, and upload each cake to the server. Once all the file blocks are uploaded, we make a final commit request to the server with a list of all block hashes in the file. This creates a new file consisting of those blocks in the user's Camera Uploads folder. Photos and videos uploaded to this folder can and then be accessed from any linked device.

Ane of our biggest challenges is that Android places potent constraints on how often apps can run in the groundwork and what capabilities they have. For example, App Standby limits our background network access if the Dropbox app hasn't recently been foregrounded. This ways we might only be allowed to access the network for a 10-infinitesimal interval once every 24 hours. These restrictions have grown more strict in recent versions of Android, and the cross-platform C++ version of camera uploads was not well-equipped to handle them. Information technology would sometimes effort to perform uploads that were doomed to fail because of a lack of network access, or fail to restart uploads during the system-provided window when network admission became available.

Our rewrite does not escape these background restrictions; they nonetheless apply unless the user chooses to disable them in Android's system settings. However, we reduce delays as much as possible by taking maximum advantage of the network access we practice receive. We use WorkManager to handle these background constraints for us, guaranteeing that uploads are attempted if, and but if, network access becomes available. Unlike our C++ implementation, we besides exercise as much work as possible while offline—for example, past performing rudimentary checks on new photos for duplicates—before asking WorkManager to schedule us for network access.

Measuring interactions with our condition banners helps us identify emerging issues in our apps, and is a helpful indicate in our efforts to eliminate errors. After the rewrite was released, nosotros saw users interacting with more "all done" statuses than usual, while the number of "waiting" or error status interactions went downwards. (This data reflects simply paid users, but non-paying users show like results.)

To further optimize employ of our express network admission, we as well refined our handling of failed uploads. C++ camera uploads aggressively retried failed uploads an unlimited number of times. In the rewrite we added backoff intervals between retry attempts, and too tuned our retry behavior for different error categories. If an mistake is likely to exist transient, we retry multiple times. If it's likely to exist permanent, we don't bother retrying at all. As a effect, we make fewer overall retry attempts—which limits network and battery usage—andusers see fewer errors.

Designing for operation

Our users don't just expect camera uploads to work reliably. They also expect their photos to upload quickly, and without wasting organisation resources. We were able to make some big improvements hither. For instance, first-time uploads of large photo libraries at present stop up to four times faster. There are a few means our new implementation achieves this.

Parallel uploads
Beginning, we substantially improved performance by adding back up for parallel uploads. The C++ version uploaded merely one file at a time. Early in the rewrite, we collaborated with our iOS and backend infrastructure colleagues to pattern an updated commit endpoint with support for parallel uploads.

One time the server constraint was gone, Kotlin coroutines made it easy to run uploads concurrently. Although Kotlin Flows are typically processed sequentially, the available operators are flexible enough to serve every bit edifice blocks for powerful custom operators that back up concurrent processing. These operators can be chained declaratively to produce code that's much simpler, and has less overhead, than the manual thread direction that would've been necessary in C++.

            val uploadResults = mediaUploadStore     .getPendingUploads()     .unorderedConcurrentMap(concurrentUploadCount) {         mediaUploader.upload(it)     }     .takeUntil {         it != UploadTaskResult.SUCCESS     }     .toList()

A simple instance of a concurrent upload pipeline. unorderedConcurrentMap is a custom operator that combines the built-in flatMapMerge and transform operators.

Optimizing memory apply
Subsequently adding back up for parallel uploads, we saw a big uptick in out-of-memory crashes from our early testers. A number of improvements were required to brand parallel uploads stable enough for production.

First, we modified our uploader to dynamically vary the number of simultaneous uploads based on the amount of available organisation memory. This manner, devices with lots of retentivity could enjoy the fastest possible uploads, while older devices would not exist overwhelmed. Even so, we were notwithstanding seeing much higher memory usage than we expected, and then nosotros used the memory profiler to take a closer look.

The first matter we noticed was that memory consumption wasn't returning to its pre-upload baseline after all uploads were done. It turned out this was due to an unfortunate beliefs of the Java NIO API. It created an in-memory cache on every thread where we read a file, and once created, the cache could never exist destroyed. Since we read files with the threadpool-backed IO dispatcher, we typically ended up with many of these caches, ane for each dispatcher thread we used. We resolved this by switching to direct byte buffers, which don't allocate this enshroud.

The next thing we noticed were big spikes in memory usage when uploading, specially with larger files. During each upload, we read the file in blocks, copying each block into aByteArray for further processing. We never created a new byte array until the previous one had gone out of telescopic, so we expected only one to exist in-memory at a time. Nevertheless, it turned out that when we allocated a large number of byte arrays in a short fourth dimension, the garbage collector could not gratuitous them quickly enough, causing a transient memory spike. We resolved this issue by re-using the same buffer for all cake reads.

Parallel scanning and uploading
In the C++ implementation of photographic camera uploads, uploading could not kickoff until we finished scanning a user'due south photo library for changes. To avoid upload delays, each browse only looked at changes that were newer than what was seen in the previous scan.

This approach had downsides. At that place were some edge cases where photos with misleading timestamps could be skipped completely. If nosotros ever missed photos due to a bug or Bone change, aircraft a fix wasn't enough to recover; we besides had to clear affected users' saved browse timestamps to forcefulness a full re-browse. Plus, when camera uploads was get-go enabled, we still had to check everything before uploading anything. This wasn't a great commencement impression for new users.

In the rewrite, nosotros ensured correctness past re-scanning the whole library subsequently every change. We besides parallelized uploading and scanning, so new photos tin commencement uploading while we're still scanning older ones. This ways that although re-scanning can take longer, the uploads themselves notwithstanding start and finish promptly.

Validation

A rewrite of this magnitude is risky to transport. Information technology has unsafe failure modes that might merely bear witness upward at scale, such as corrupting 1 out of every million uploads. Plus, as with most rewrites, we could not avert introducing new bugs because we did non sympathize—or even know about—every border case handled by the old system. We were reminded of this at the get-go of the project when we tried to remove some ancient photographic camera uploads lawmaking that we idea was dead, and instead ended up DDOSing Dropbox'south crash reporting service. 🙃

Hash validation in product
During early development, nosotros validated many depression-level components past running them in production alongside their C++ counterparts and and then comparing the outputs. This allow us ostend that the new components were working correctly earlier we started relying on their results.

One of those components was a Kotlin implementation of the hashing algorithms that we use to identify photos. Considering these hashes are used for de-duplication, unexpected things could happen if the hashes change for fifty-fifty a tiny percent of photos. For instance, we might re-upload sometime photos believing they are new. When nosotros ran our Kotlin code alongside the C++ implementation, both implementations near always returned matching hashes, merely they differed about 0.005% of the time. Which implementation was wrong?

To answer this, we added some additional logging. In cases where Kotlin and C++ disagreed, nosotros checked if the server subsequently rejected the upload because of a hash mismatch, and if so, what hash it was expecting. We saw that the server was expecting the Kotlin hashes, giving the states high confidence the C++ hashes were wrong. This was slap-up news, since it meant we had fixed a rare problems we didn't fifty-fifty know nosotros had.

Validating land transitions
Camera uploads uses a database to runway each photograph'south upload state. Typically, the scanner adds photos in state NEW and and then moves them to PENDING (or DONE if they don't need to be uploaded). The uploader tries to upload PENDING photos and then moves them to DONE or Error.

Since we parallelize so much work, it'southward normal for multiple parts of the system to read and write this land database simultaneously. Individual reads and writes are guaranteed to happen sequentially, merely we're notwithstanding vulnerable to subtle bugs where multiple workers try to alter the land in redundant or contradictory means. Since unit tests only comprehend single components in isolation, they won't grab these bugs. Fifty-fifty an integration test might miss rare race conditions.

In the rewritten version of camera uploads, we guard against this past validating every state update against a set of immune state transitions. For instance, nosotros stipulate that a photograph can never motion from Error to Done without passing back through Pending. Unexpected state transitions could indicate a serious issues, and then if we run across one, we finish uploading and report an exception.

These checks helped united states of america detect a nasty bug early in our rollout. Nosotros started to see a high volume of exceptions in our logs that were acquired when photographic camera uploads tried to transition photos fromDONE toDONE. This made us realize nosotros were uploading some photos multiple times! The root crusade was a surprising behavior in WorkManager whereunique workers tin can restart before the previous instance is fully cancelled. No duplicate files were being created considering the server rejects them, merely the redundant uploads were wasting bandwidth and time. Once we fixed the upshot, upload throughput dramatically improved.

Rolling it out

Even after all this validation, we yet had to be cautious during the rollout. The fully-integrated arrangement was more than complex than its parts, and we'd also need to contend with a long tail of rare device types that are not represented in our internal user testing pool. We too needed to go along to come across or surpass the loftier expectations of all our users who rely on camera uploads.

To reduce this risk preemptively, we made certain to back up rollbacks from the new version to the C++ version. For instance, we ensured that all user preference changes made in the new version would utilize to the onetime version equally well. In the terminate nosotros never concluded upwards needing to curlicue back, but it was even so worth the effort to have the option bachelor in case of disaster.

We started our rollout with an opt-in pool of beta (Play Store early on access) users who receive a new version of the Dropbox Android app every week. This puddle of users was large enough to surface rare errors and collect key performance metrics such equally upload success rate. We monitored these key metrics in this population for a number of months to gain conviction it was fix to send widely. Nosotros discovered many bug during this fourth dimension catamenia, just the fast beta release cadence allowed usa to iterate and fix them apace.

We besides monitored many metrics that could hint at future bug. To make certain our uploader wasn't falling backside over time, we watched for signs of ever-growing backlogs of photos waiting to upload. We tracked retry success rates by error type, and used this to fine-tune our retry algorithm. Final but not least, we also paid close attending to feedback and support tickets we received from users, which helped surface bugs that our metrics had missed.

When we finally released the new version of photographic camera uploads to all users, it was articulate our months spent in beta had paid off. Our metrics held steady through the rollout and we had no major surprises, with improved reliability and depression error rates correct out of the gate. In fact, nosotros concluded up finishing the rollout alee of schedule. Since we'd front-loaded so much quality improvement work into the beta flow (with its weekly releases), we didn't have whatever multi-week delays waiting for critical bug fixes to roll out in the stable releases.

So, was it worth it?

Rewriting a big legacy feature isn't e'er the right decision. Rewrites are extremely time-consuming—the Android version alone took two people working for two full years—and can easily cause major regressions or outages. In gild to be worthwhile, a rewrite needs to deliver tangible value by improving the user experience, saving engineering fourth dimension and effort in the long term, or both.

What advice practice we have for others who are outset a project like this?

Define your goals and how you will mensurate them. At the start, this is important to make sure that the benefits will justify the attempt. At the stop, it will help you lot determine whether you got the results y'all wanted. Some goals (for example, future resilience against Bone changes) may not be quantifiable—and that'southward OK—but it's good to spell out which ones are and aren't.
De-risk it. Identify the components (or arrangement-broad interactions) that would crusade the biggest issues if they failed, and guard against those failures from the very commencement. Build critical components first, and try to exam them in production without waiting for the whole system to be finished. Information technology's also worth doing extra work upwardly-forepart in order to be able to roll back if something goes wrong.
Don't rush. Shipping a rewrite is arguably riskier than shipping a new feature, since your audition is already relying on things to work every bit expected. Start by releasing to an audience that'southward just large plenty to requite y'all the data yous need to evaluate success. Then, sentinel and await (and fix stuff) until your data give yous confidence to continue. Dealing with issues when the user-base of operations is small is much faster and less stressful in the long run.
Limit your scope. When doing a rewrite, it's tempting to tackle new feature requests, UI cleanup, and other backlog work at the same time. Consider whether this volition actually exist faster or easier than shipping the rewrite starting time and fast-following with the rest. During this rewrite we addressed issues linked to the cadre compages (such as crashes intrinsic to the underlying data model) and deferred all other improvements. If you lot change the feature too much, not only does it take longer to implement, but it's also harder to notice regressions or roll back.

In this case, we feel good about the determination to rewrite. We were able to amend reliability right away, and more importantly, we set ourselves up to stay reliable in the future. As the iOS and Android operating systems continue to evolve in split directions, it was but a matter of time earlier the C++ library broke badly plenty to require fundamental systemic changes. Now that the rewrite is complete, we're able to build and iterate on photographic camera uploads much faster—and offer a improve experience for our users, too.

Also: We're hiring!

Are yous a mobile engineer who wants to make software that'south reliable and maintainable for the long booty? If so, we'd love to have you at Dropbox! Visit our jobs page to run into current openings.