tensorflow / lattice

Lattice methods in TensorFlow

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to use multi CPU easily?

fuyanlu1989 opened this issue · comments

It is so great to see such a good package. However the speed is too slow.

I am using Crystal ensemble model config. tfl.estimators.CannedRegressor estimator. It seems only one CPU is using, though I have 48 CPUs on the machine.

I have set the dataset with multiple threads:

feature_analysis_input_fn = tf.compat.v1.estimator.inputs.pandas_input_fn(
    x=train_xs.loc[feature_analysis_index].copy(), 
    y=train_ys.loc[feature_analysis_index].copy(), 
    batch_size=128, 
    num_epochs=1, 
    shuffle=True, 
    queue_capacity=1000,
    num_threads=40)

prefitting_input_fn = tf.compat.v1.estimator.inputs.pandas_input_fn(
    x=train_xs.loc[prefitting_index].copy(), 
    y=train_ys.loc[prefitting_index].copy(), 
    batch_size=128, 
    num_epochs=1, 
    shuffle=True, 
    queue_capacity=1000,
    num_threads=40)

train_input_fn = tf.compat.v1.estimator.inputs.pandas_input_fn(
    x=train_xs.loc[train_index].copy(), 
    y=train_ys.loc[train_index].copy(), 
    batch_size=128, 
    num_epochs=100, 
    shuffle=True, 
    queue_capacity=1000,
    num_threads=40)

The usage of CPU is still only 1.25 CPU. Any suggestion?

@mmilanifard Any suggestion?

Training a crystals model has multiple steps:

  • going over the feature_analysis_input_fn to calculate quantiles (will be single threaded)
  • running the prefitting training step using prefitting_input_fn
  • running the final training using train_input_fn

My suggestions:

  • Check to see if a vanilla DNN model uses more core to make sure your overall TF runtime is setup correctly to use all the core
  • Identify in which of the steps above you are seeing single core usage. The first two steps happen in the init of the estimator and third step is during estimator.train() call. You can look at the stderr to investigate further.
  • If your dataset is very large, use a subset for feature_analysis_input_fn.
  • Use lattices='random' instead of crystals to make sure prefitting is not the slow part of the process.

If the Bai used in a single task is multi thread or multi process program, the utilization ratio of multiple cores can be balanced, so as to improve the CPU utilization and reduce the time required for data processing

The main reason is that in the past the CPU was single core, and programmers basically did not have the concept of multi-core programming. Many software has not fully applied the idea of multi-core

Some parts of a program can be easily divided into multiple threads, for example, the interface and data processing of application software are divided into two independent threads. However, some processes are difficult to be simply divided into multiple threads. There are many problems that programmers on PC have not faced before (a few programmers who are proficient in network distributed computing, such as Google's background data system, have encountered these problems, Because these programmers are faced with organizing tens of thousands or even hundreds of thousands of servers on the network for Collaborative Computing), especially data processing programs like what you call