📖Solutions

How can we make use of these scattered computing resources?

Introduction

To address these challenges, we need to accomplish three key tasks. First, we must enhance the transmission speed of traditional P2P networks to fully utilize the bandwidth provided by network providers. Second, we need to compress the amount of communication data required for AI training. Third, we should simplify the process for users to offer their computing power, turning it into a straightforward application rather than a complex code and command-line process.

For the first task, we developed a P2P network with an enhanced network layer that allows traditional peer-to-peer networks to achieve speeds as fast as those provided by network providers. For the second task, we integrated various existing AI training data compression technologies, reducing the amount of communication data to less than 5% of its original volume. For the third task, we created a one-click installation computing node client, which easily connects to the DEKUPER cluster and enables users to provide cloud computing services for those in need of computing power.

P2P Network

We have developed a P2P network, an industry-leading high-speed decentralized and secure public chain, with an efficiency (TPS) of over 12,000. This network includes a sophisticated peer-to-peer network layer that far exceeds the speed of traditional P2P networks, supporting the efficient network transmission needs of AI training.

To increase the transmission speed of the P2P network, consider how home networks can access mainstream websites quickly and reach the speeds promised by network providers. This is possible because the servers of these websites are located on the backbone network of telecommunications companies. Therefore, our solution is to add relay nodes on the backbone network in various regions. Nodes in the P2P network will send data to these backbone relay nodes first, which will then forward the data to the target nodes. To develop this enhanced network, we spent five years writing over 800,000 lines of C++ code and invested tens of millions of dollars. Our tests show that the improved P2P network can achieve speeds claimed by network providers, which are hundreds of times higher than the original lower limit of transfer speeds.

Currently, many distributed computing projects use a fixed master node to assign tasks to trusted sub-nodes for processing, skipping the verification stage, and directly entering data into the database. This unverified information poses risks of incorrect data and user property loss. By ensuring signature verification and a decentralized design for the master node and committee, our P2P network's speed is more than six times that of other public blockchains meeting these requirements. Additionally, the P2P network balances secure decentralization with high-speed stable transaction processing.

The P2P network employs the unique SDBFT (Simplified Decentralized Byzantine Fault Tolerance) consensus algorithm, which is a Byzantine fault-tolerant consensus algorithm capable of tolerating certain anomalies through simplified decentralization. Compared to the traditional BFT algorithm, it offers advantages in efficiency and implementation difficulty, as well as high performance and good fault tolerance.

Data compression for communication between GPUs

99.9% of the gradient exchange in distributed SGD are redundant, and Deep Gradient Compression (DGC) algorithm greatly reduces the communication bandwidth. DGC algorithm can compress the gradient data in communication transmission to 1/600 of the original data, achieving the training effect of 1Gbps to the original 10Gbps bandwidth.

AQ-SGD compresses the change of activations for the same training example across epochs, in a decentralized network with slow connectivity (e.g., 100 Mbps), the performance of AQ-SGD is only 18% slower compared to an uncompressed approach in a high-speed datacenter network (e.g., 10 Gbps).

ZeRO++ leverages quantization, in combination with data, and communication remapping, to reduce total communication volume by 4x compared with ZeRO, without impacting model quality.

Combined, these compression measures can reduce the amount of data communicated between GPUs for AI training to less than 5% of the original amount.

Currently, the main AI training frameworks are PyTorch and TensorFlow. At present, DeepSpeed has handled the communication compression between GPUs in the PyTorch ecosystem relatively well. The TensorFlow ecosystem doesn't yet have such a system to handle this. We are developing a memory splitting and inter-GPU communication compression program based on TensorFlow, which combines the above compression algorithms and will be available soon.

Easy-to-use client

We provide a one-click installation client, including the management of complete local computing resources, cluster network registration and connection, and k8s container functions, which can be directly installed and used by novice users without any computer professional background, and provide computing resources to obtain rewards. It has privacy protection features to ensure the security of data in containers. Once a host user attempts to read data from the container, the host will have its permissions and cluster removed.

Last updated