pypi erro when try to install gcsfs in google composer(airflow)

I use google composer-1.0.0-airflow-1.9.0. I used dask in one of my DAG and wanted to setup composer to use dask. One of the required package for this DAG is gcsfs. When I tried to install it via Web UI I got the below error:

Composer Backend timed out. Currently running tasks are [stage: CP_COMPOSER_AGENT_RUNNING description: "Composer Agent Running. Latest Agent Stage: stage: DEPLOYMENTS_UPDATED\n ." response_timestamp { seconds: 1540331648 nanos: 860000000 } ].

Updated:

The error is coming from this line of code when dask tries to read file from gcp bucket:dd.read_csv(bucket) log:

     [2018-10-24 22:25:12,729] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/site-packages/dask/bytes/core.py", line 350, in get_fs_token_paths
     [2018-10-24 22:25:12,733] {base_task_runner.py:98} INFO - Subtask:     fs, fs_token = get_fs(protocol, options)
     [2018-10-24 22:25:12,735] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/site-packages/dask/bytes/core.py", line 473, in get_fs
     [2018-10-24 22:25:12,740] {base_task_runner.py:98} INFO - Subtask:     "Need to install `gcsfs` library for Google Cloud Storage support\n"
     [2018-10-24 22:25:12,741] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/site-packages/dask/utils.py", line 94, in import_required
     [2018-10-24 22:25:12,748] {base_task_runner.py:98} INFO - Subtask:     raise RuntimeError(error_msg)
     [2018-10-24 22:25:12,751] {base_task_runner.py:98} INFO - Subtask: RuntimeError: Need to install `gcsfs` library for Google Cloud Storage support
     [2018-10-24 22:25:12,756] {base_task_runner.py:98} INFO - Subtask:     conda install gcsfs -c conda-forge
     [2018-10-24 22:25:12,758] {base_task_runner.py:98} INFO - Subtask:     or
     [2018-10-24 22:25:12,762] {base_task_runner.py:98} INFO - Subtask:     pip install gcsfs

When tried to install gcsfs in google composer UI using pypi got below error:

 {
   insertId:  "17ks763f726w1i"  
   logName:  "projects/xxxxxxxxx/logs/airflow-worker"  
   receiveTimestamp:  "2018-10-25T15:42:24.935880717Z"  
   resource: {…}  
   severity:  "ERROR"  
    textPayload:  "Traceback (most recent call last):
    File "/usr/local/bin/gcsfuse", line 7, in <module>
    from gcsfs.cli.gcsfuse import main
    File "/usr/local/lib/python2.7/site- 
     packages/gcsfs/cli/gcsfuse.py", line 3, in <module>
       fuse import FUSE
     ImportError: No module named fuse
    "  
   timestamp:  "2018-10-25T15:41:53Z"  
    }
728x90

3 Answers pypi erro when try to install gcsfs in google composer(airflow)

Unfortunately, your error mssage doesn't mean much to me.

gcsfs is pure python code, so it is very unlikely that anything is going wrong with installing it - as is done very commonly with pip or conda. The dependency libraries are a bunch of google ones, some of which may require compilation (I don't know), so I would suggest trying to find out from logs which one is stalling and taking it up with them. On the other hand, this kind of thing can often be a network/intermittent problem, so waiting may also fix things.

For the future, I recommend basing installations around conda, which never needs to compile anything and is generally better at dependency tracking.

5 months ago

This has to do with the fact that Composer and Airflow have silent dependencies and they are not syncd. So if gcsfs installation has conflicts with Airflow dependency, we get this error. More details here. The only workarounds ( other than updating to the Nov 28 release of composer) are:

Source: Thanks to Jake Biesinger ([email protected])

use a separate Kubernetes Pod for running various jobs, but it's a large change and requires infra we're not very familiar with (GKE). This particular issue can also be solved by installing dbt in a PythonVirtualEnvOperator, then having the python_callable re-use the virtualenv's bin dir, something like:

``` def _run_cmd_in_virtual_env(cmd):
subprocess.check_call(os.path.join(os.path.split(sys.argv[0])[0], cmd)

task = PythonVirtualEnvOperator(python_callable=_run_cmd_in_virtual_env, op_args=('dbt',)) # this will call the temporarily-installed dbt binary, something like /tmp/virtualenv-asdasd/bin/dbt.

```

4 months ago

I haven't tried this, but this might help you out. In general, installing arbitrary system packages (like fuse or whatever which becomes the dependencies of what you are trying to install) is not supported by Google Composer. As discussed here: https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!searchin/cloud-composer-discuss/sugimiyanto%7Csort:date/cloud-composer-discuss/jpxAGCPFkZo/mCx_P1LPCQAJ

However, you may be able to do this by uploading the package folder that you have installed it in your local (i.e. fuse), into your Google Cloud Storage bucket for example: gs://<your_bukcet_name>/libs, so that it becomes shared libraries. Then, you can set LD_LIBRARY_PATH environment variable in Google Composer to /home/airflow/gcs/libs, to make GCC look for shared libraries in that directory.

Then, try to reinstall the gcsfs using pypi Google Composer.

2 months ago