antonmks / Alenka

GPU database engine

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

trivial_copy_n: an illegal memory access was encountered

marklit opened this issue · comments

Hi,

I'm loading in a 8.7 GB, 20 million line CSV file into Alenka. The import starts out well and a number of the .hash files growing to 50 MB+ but around 10 minutes into the load command I get an illegal memory access was encountered error message.

I've compiled the master branch of Alenka (commit 59022b5) on Ubuntu 16.04 64-bit with CUDA 8 and I'm running it with an Nvidia GTX 1080 and the 367.48 driver.

Here are the steps I took that led up to the issue:

$ cat load.sql
A  :=  LOAD 'trips_xaa.csv' USING (',') AS (
    trip_id{1}:int,
    vendor_id{2}:varchar(3),

    pickup_datetime{3}:varchar(19),

    dropoff_datetime{4}:varchar(19),
    store_and_fwd_flag{5}:varchar(1),
    rate_code_id{6}:int,
    pickup_longitude{7}:DECIMAL(14,2),
    pickup_latitude{8}:DECIMAL(14,2),
    dropoff_longitude{9}:DECIMAL(14,2),
    dropoff_latitude{10}:DECIMAL(14,2),
    passenger_count{11}:int,
    trip_distance{12}:DECIMAL(14,2),
    fare_amount{13}:DECIMAL(14,2),
    extra{14}:DECIMAL(14,2),
    mta_tax{15}:DECIMAL(14,2),
    tip_amount{16}:DECIMAL(14,2),
    tolls_amount{17}:DECIMAL(14,2),
    ehail_fee{18}:DECIMAL(14,2),
    improvement_surcharge{19}:DECIMAL(14,2),
    total_amount{20}:DECIMAL(14,2),
    payment_type{21}:varchar(3),
    trip_type{22}:int,
    pickup{23}:varchar(50),
    dropoff{24}:varchar(50),

    dummy1{25}:varchar(50),
    dummy2{26}:varchar(50),

    cab_type{27}:varchar(6),

    precipitation{28}:int,
    snow_depth{29}:int,
    snowfall{30}:int,
    max_temperature{31}:int,
    min_temperature{32}:int,
    average_wind_speed{33}:int,

    pickup_nyct2010_gid{34}:int,
    pickup_ctlabel{35}:varchar(10),
    pickup_borocode{36}:int,
    pickup_boroname{37}:varchar(13),
    pickup_ct2010{38}:varchar(6) ,
    pickup_boroct2010{39}:varchar(7) ,
    pickup_cdeligibil{40}:varchar(1) ,
    pickup_ntacode{41}:varchar(4) ,
    pickup_ntaname{42}:varchar(56),
    pickup_puma{43}:varchar(4) ,

    dropoff_nyct2010_gid{44}:int,
    dropoff_ctlabel{45}:varchar(10),
    dropoff_borocode{46}:int,
    dropoff_boroname{47}:varchar(13),
    dropoff_ct2010{48}:varchar(6) ,
    dropoff_boroct2010{49}:varchar(7) ,
    dropoff_cdeligibil{50}:varchar(1) ,
    dropoff_ntacode{51}:varchar(4) ,
    dropoff_ntaname{52}:varchar(56),
    dropoff_puma{53}:varchar(4) 
);
STORE A INTO 'trips' BINARY;
$ ~/Alenka_master/alenka load.sql
GeForce GTX 1080 : 1835.000 Mhz   (Ordinal 0)
20 SMs enabled. Compute Capability sm_61
FreeMem:   6941MB   TotalMem:   8110MB   64-bit pointers.
Mem Clock: 5005.000 Mhz x 256 bits   (320.3 GB/s)
ECC Disabled


Executing file:
Couldn't open data dictionary
LOAD: A trips_xaa.csv 53  , 
Append 0
STORE: A trips 
set a piece to 1000000000 6276186112
processed recs 6441843 2350317568
processed recs 6441843 2337603584
processed recs 6441843 2323513344
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  failed synchronize in thrust::system::cuda::detail::trivial_copy_n: an illegal memory access was encountered
Aborted (core dumped)

A few minutes before the exception nvidia-smi was showing the following:

Sun Oct 16 17:34:57 2016       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.48                 Driver Version: 367.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    Off  | 0000:02:00.0      On |                  N/A |
| 24%   50C    P2    45W / 200W |   6163MiB /  8110MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0       960    G   /usr/lib/xorg/Xorg                             705MiB |
|    0     15016    G   compiz                                         194MiB |
|    0     21266    G   ...isallowFetchForDocWrittenScriptsInMainFra   107MiB |
|    0     21714    C   /home/mark/Alenka_master/alenka               5147MiB |
+-----------------------------------------------------------------------------+

These were the last files to be modified before the exception:

$ ls -alht | head
total 18G
-rw-rw-r--  1 mark mark  40K okt   16 18:22 trips.dropoff_puma
-rw-rw-r--  1 mark mark  13M okt   16 18:22 trips.dropoff_puma.2.idx
-rw-rw-r--  1 mark mark   20 okt   16 18:22 trips.dropoff_puma.header
drwxrwxr-x  2 mark mark  20K okt   16 18:22 .
-rw-rw-r--  1 mark mark  50M okt   16 18:22 trips.dropoff_puma.2.hash
-rw-rw-r--  1 mark mark 6,2M okt   16 18:22 trips.dropoff_ntaname.2.idx
-rw-rw-r--  1 mark mark   20 okt   16 18:22 trips.dropoff_ntaname.header
-rw-rw-r--  1 mark mark  50M okt   16 18:22 trips.dropoff_ntaname.2.hash
-rw-rw-r--  1 mark mark 6,2M okt   16 18:22 trips.dropoff_ntacode.2.idx

Here is the last few lines of strace:

...
clock_gettime(CLOCK_MONOTONIC_RAW, {30119, 174334632}) = 0
clock_gettime(CLOCK_MONOTONIC_RAW, {30119, 174345876}) = 0
clock_gettime(CLOCK_MONOTONIC_RAW, {30119, 174357176}) = 0
clock_gettime(CLOCK_MONOTONIC_RAW, {30119, 174368495}) = 0
clock_gettime(CLOCK_MONOTONIC_RAW, {30119, 174379727}) = 0
clock_gettime(CLOCK_MONOTONIC_RAW, {30119, 174394497}) = 0
ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0x7fffb66b4840) = 0
ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0x7fffb66b4810) = 0
write(16, "\253", 1)                    = 1
futex(0x7fbaa6088680, FUTEX_WAKE_PRIVATE, 2147483647) = 0
write(2, "terminate called after throwing "..., 48terminate called after throwing an instance of ') = 48
write(2, "thrust::system::system_error", 28thrust::system::system_error) = 28
write(2, "'\n", 2'
)                      = 2
write(2, "  what():  ", 11  what():  )             = 11
write(2, "failed synchronize in thrust::sy"..., 108failed synchronize in thrust::system::cuda::detail::trivial_copy_n: an illegal memory access was encountered) = 108
write(2, "\n", 1
)                       = 1
rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0
tgkill(22953, 22953, SIGABRT)           = 0
--- SIGABRT {si_signo=SIGABRT, si_code=SI_TKILL, si_pid=22953, si_uid=1000} ---
+++ killed by SIGABRT (core dumped) +++
Aborted (core dumped)

Any idea what might have caused this issue or what I can do to work around it? I'm happy to provide more telemetry if needed.

Cheers,
Mark

commented

I never seen this error message before. Can I get that csv file somewhere to test the load on my machine ?
Also, do you really need to load all the fields ? It seems that you need just a few for the queries. You can try loading just those and see if you still get the error.

I'll email you a link to the file.

I'll play around with loading in a reduced set of data in the mean time.

commented

I loaded the data successfully. The only issue I had was 'out of memory' error, so I had to reduce the segment size to 500 MB :
./alenka -l 500 load_trips.sql
Can you try it ?

Certainly, I'll try that and report back.

With 500 as a parameter I got that exception again but with 200 it loaded just fine.

~/Alenka_master/alenka -l 500 load.sql
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  failed synchronize in thrust::system::cuda::detail::trivial_copy_n: an illegal memory access was encountered
~/Alenka_master/alenka -l 200 load.sql
~/Alenka_master/alenka query.sql
...
mRecCount=1 mcount = 1 term 1 limit=0 print_all=1
|20000046 |

Thanks for your help on that one.

commented

While running a query on your data I found and fixed a bug in Alenka. I updated the master branch, so please update if you have any issues.

Good stuff, I'll re-compile Alenka before I start the 1.1B record import. I've earmarked Saturday to get started on this.

commented

Don't forget to use APPEND when loading consecutive files !

On Wed, Oct 19, 2016 at 10:43 AM, Mark Litwintschik <
notifications@github.com> wrote:

Good stuff, I'll re-compile Alenka before I start the 1.1B record import.
I've earmarked Saturday to get started on this.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#103 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABhkFC_YbTkmPt86Yn-_feyoUdr2uxwvks5q1coKgaJpZM4KX_Xx
.

commented

I made a few changes to alenka including addition of CAST operator
necessary for your queries.
Also, please notice that in a load script the types should be specified in
lower caps, like "decimals", not "DECIMALS", otherwise it is not going to
work, alenka is case sensitive.
I tested your queries and a new load script, if you need them I attached
them all to this message.

Best regards,

Anton

On Wed, Oct 19, 2016 at 11:42 AM, mks antonmks@gmail.com wrote:

Don't forget to use APPEND when loading consecutive files !

On Wed, Oct 19, 2016 at 10:43 AM, Mark Litwintschik <
notifications@github.com> wrote:

Good stuff, I'll re-compile Alenka before I start the 1.1B record import.
I've earmarked Saturday to get started on this.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#103 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABhkFC_YbTkmPt86Yn-_feyoUdr2uxwvks5q1coKgaJpZM4KX_Xx
.

Thanks Anton.

I'm not seeing the attachments here, could you send them over again please?

commented

That was an old message, I remember that after that I sent all the queries
as text to you in an email.
It might take for me a few days to add what I need to add to make queries 3
and 4 run, I'll try to it this weekend.

Anton

On Wed, Oct 26, 2016 at 9:11 AM, Mark Litwintschik <notifications@github.com

wrote:

Thanks Anton.

I'm not seeing the attachments here, could you send them over again please?


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#103 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABhkFCppfjsHbRZvFde_MpYHhZqW6pfaks5q3u77gaJpZM4KX_Xx
.

Cool. I'll earmark Sunday evening again to have another go with all this.

commented

I fixed an issue with APPEND and groupby operators, so Q1 should work. Unfortunately you have to reload the data.
I'll start working on the rest of the queries.

Great, I'll recompile and import the data again on Sunday and report back.